• Illustrating a fingerprinting function. Source: Wikipedia 2018a.
• Rolling hashes are computed from a sliding window. Source: Ghosh 2013.
• An example of audio fingerprinting done on frequency spectrum. Source: Jovanovic 2014.
• Process flow in YouTube's Content ID. Source: Pacheco 2013, slide 13.

# Fingerprinting Algorithms

arvindpdmn
1250 DevCoins

MUSHTAQ96
791 DevCoins
2 authors have contributed to this article
Last updated by arvindpdmn
on 2018-11-13 12:16:34
Created by MUSHTAQ96
on 2018-09-28 09:13:45

## Summary

Human fingerprints are defined by tiny ridges, spirals and valley patterns on the tip of each finger. They are unique: no two people have the same ones.

Similarly, a fingerprinting algorithm in computer science is one that maps large data (such as documents or images) to a much shorter sequence of bytes. Such a sequence may be called the data's fingerprint. In essence, large data is more tersely represented by its fingerprint. This fingerprint uniquely identifies the original data, just as human fingerprints uniquely identify individual people. Fingerprinting can be applied to different types of inputs: documents, images, audio, video, or any arbitrary data.

While fingerprints may identify the original data, they're not a compression technique. This means that original data cannot be derived from its fingerprint. Fingerprints help in quickly checking the integrity of data or retrieving documents from large filesystems.

## Milestones

1981

Michael O. Rabin publishes Fingerprinting By Random Polynomials. It introduces the concept of rolling hashes based on a sliding window. Instead of having fixed-size data chunks, variable-sized chunks give better performance in data deduplication applications.

1991

Ronald Rivest proposes MD5, a cryptographic hash function. It's offered without licensing. Weaknesses are identified in the following years. In 2004, enhanced differential attacks are discovered. By 2009, this attack is optimized so that MD5 collisions can be discovered in milliseconds.

1993

National Institute for Standards and Technology (NIST), USA, proposes Secure Hash Algorithm (SHA). In 1995, this is replaced with SHA-1 due to a flaw in the original algorithm. SHA-1 produces 160-bit hashes. SHA-1 itself is attacked in 2005. Google publishes full details of another attack in 2017.

1999

Shazam is founded based on the idea of using audio fingerprinting to identify music, movies, ads and shows. In 2018, it was acquired by Apple.

2002

NIST proposes SHA-2 as a replacement of SHA-1. SHA-2 produces longer hashes of 224, 256, 384 or 512 bits.

2007

Google introduces beta version of YouTube Video Identification to help copyright owners manage their content on YouTube, particularly when uploaded by others. The service uses video fingerprinting as the technique and later evolves into Content ID.

## Discussion

• What are the characteristics of a good fingerprinting algorithm?

Like any other algorithm, fingerprinting algorithms must balance speed, memory usage and code size. In addition, they must guarantee virtual uniqueness. No two files, even if they differ by a single bit, should end up with the same fingerprint. When two files generate the same fingerprint, we call it a collision. A good algorithm guarantees that the probability of collision is extremely small.

Conversely, when two files differ even slightly, their fingerprints must be significantly different. This property is called avalanche effect. In the world of cryptography, this is also called diffusion.

For some applications, algorithms must support compounding, which is about fingerprinting a composite file from the fingerprints of its constituent parts. Computer files are often combined in various ways, such as concatenation (archive files) or symbolic inclusion (C preprocessor's #include directive). Compounding property may be useful in some applications, such as detecting when a program needs to be recompiled.

• What are some use cases of fingerprinting algorithms?

When we need to search large filesystems or databases, fingerprints are used to speed up data retrieval. Rather than compare large amounts of content, it's sufficient to compare their fingerprints. When we upload files, cloud providers use fingerprints to do deduplication so that multiple copies of the same file are avoided. When users upload content to public sites, fingerprints can tell if it's banned or copyrighted content. In companies, fingerprints can prevent employees from emailing sensitive content such as patent documents.

When fingerprints of a file don't match we can say that data integrity is compromised, that is, the file's been tampered. In other applications where file changes need to be synchronized over a network, fingerprints help in bandwidth reduction. Rather than transfer whole files, fingerprints can tell what parts of the file have changed and transfer only those. Likewise, public-key fingerprints are used instead of long public keys. In distributed systems where unique names must be maintained, fingerprints are used.

For identification, audio fingerprinting is used to identify songs; image fingerprinting to identify cameras, or pornographic content.

• Could you describe some fingerprinting algorithms?

Hash functions, also called fingerprinting functions, are often used to reduce arbitrary length of data to a fixed length of bytes. The output is called a hash or a fingerprint. Some hash algorithms have cryptographic properties and hence called cryptographic hash functions. With these, it's difficult for two different inputs to have the same hash. Cryptographic hashing algorithms include MD5 and SHA-1.

Rabin's algorithm uses polynomials over a finite field to generate hashes. It's able to give us a precise calculation of the probability of collision. They're also faster than cryptographic hash algorithms. However, they're said to be insecure against malicious attacks. This is a rolling hash. Others include TTTD, FBC and what's used in rsync.

ML algorithms for fingerprinting are trained on deterministic data and then used for prediction. However, these require lots of training data.

• Are there special considerations for audio/video fingerprinting?

Normally, fingerprints must be unique even if the original content changes slightly. However, audio and video are often subject to noise, distortions or variations in encoding. Thus, it's not useful to fingerprint them directly. Algorithms must be robust; that is, even in the presence of noise or distortion, fingerprints must be similar, if not same.

Content that are perceptually different, must produce different fingerprints. For audio, perceptual traits such as zero crossing rate, frequency spectrum and tempo can be used. For video, colour histogram, mean luminance, motion detection, ordinal intensity signature, and spatial distribution of colour or objects in a frame are some features to consider.

Even in text processing, sometimes we want to know if two documents are similar. In this case, we don't want to minimize probability of collision. Instead, we need to maximize it. Thus, we have Locality-sensitive hashing (LSH) or MinHash algorithms. As an example, 5 word n-grams are used to find out if two documents are similar.

• Could you name some applications or products that use fingerprinting?

LBFS is a low-bandwidth networked filesystem that uses SHA-1 hashing to determine if a data chunk needs to be transferred over the network. Venti, an archival storage system, uses SHA-1 hashing of a data block. Blocks are addressed by their hashes. Therefore, block content cannot be changed without changing the address. Thus, Venti enforces write-once policy and deduplication. Pcompress does compression and deduplication for data archives. For deduplication, it uses fingerprinting at chunk level and rolling hash computations.

Since 2007, YouTube's been using a system called Content ID to flag content that infringe copyrights. It uses both audio and visual fingerprinting, and more recently enhanced with machine learning. Shazam identifies songs based on audio fingerprinting. Audible Magic employs its audio/video fingerprinting technology for content providers who wish to protect their copyrighted content.

By computing a hash of hashes, Surety offers a timestamped data integrity service. In blockchain technology, each block is hashed and these hashes influence those of future blocks. These hashes serve to identify and verify the integrity of blocks.

• How's digital fingerprinting related to fingerprinting algorithms?

Digital fingerprinting is a broader concept and fingerprinting algorithms may be treated as a specific case.

The common interpretation is that digital fingerprinting is used to track or profile individuals based on the tools they use or how they interact online. For example, users can be tracked based on their browsers, browser settings, screen resolution, operating system, location, language settings, etc. The term browser fingerprinting has been used to describe this approach. Similarly, there's canvas fingerprinting that's based on GPU and graphics drivers. At device level, MAC address, IP address or TCP/IP configuration could be used.

The purpose is to identify criminals based on their fingerprints or combat online piracy. Another use is to build a general profile of individuals that can be attractive to advertisers to serve personalized ads. Digital fingerprinting therefore brings up the issue of privacy. Users have been tracked across browsers with an accuracy of 99.2%.

• How's a fingerprinting algorithm different from watermarking?

Watermarking is typically employed for documents, images and video. It's a visible marking that serves to discourage content piracy. There's also invisible watermarking that can be used to track the source of a file because each file is stamped with a unique watermark.

With fingerprinting, we don't have to modify the original file. It's more about prevention than tracking. A file's fingerprint is compared against a database to flag copyrighted content or simply to identify the file.

## Milestones

1981

Michael O. Rabin publishes Fingerprinting By Random Polynomials. It introduces the concept of rolling hashes based on a sliding window. Instead of having fixed-size data chunks, variable-sized chunks give better performance in data deduplication applications.

1991

Ronald Rivest proposes MD5, a cryptographic hash function. It's offered without licensing. Weaknesses are identified in the following years. In 2004, enhanced differential attacks are discovered. By 2009, this attack is optimized so that MD5 collisions can be discovered in milliseconds.

1993

National Institute for Standards and Technology (NIST), USA, proposes Secure Hash Algorithm (SHA). In 1995, this is replaced with SHA-1 due to a flaw in the original algorithm. SHA-1 produces 160-bit hashes. SHA-1 itself is attacked in 2005. Google publishes full details of another attack in 2017.

1999

Shazam is founded based on the idea of using audio fingerprinting to identify music, movies, ads and shows. In 2018, it was acquired by Apple.

2002

NIST proposes SHA-2 as a replacement of SHA-1. SHA-2 produces longer hashes of 224, 256, 384 or 512 bits.

2007

Google introduces beta version of YouTube Video Identification to help copyright owners manage their content on YouTube, particularly when uploaded by others. The service uses video fingerprinting as the technique and later evolves into Content ID.

## Tags

• Hash Function
• Audio Fingerprinting
• Video Fingerprinting
• Digital Fingerprinting

Author
No. of Edits
No. of Chats
DevCoins
2
1
1250
4
0
791
1472
Words
1
Chats
6
Edits
4
Likes
2808
Hits

## Cite As

Devopedia. 2018. "Fingerprinting Algorithms." Version 6, November 13. Accessed 2020-01-24. https://devopedia.org/fingerprinting-algorithms
• Site Map