Fingerprinting Algorithms

Illustrating a fingerprinting function. Source: Wikipedia 2018a.
Illustrating a fingerprinting function. Source: Wikipedia 2018a.

Human fingerprints are defined by tiny ridges, spirals and valley patterns on the tip of each finger. They are unique: no two people have the same ones.

Similarly, a fingerprinting algorithm in computer science is one that maps large data (such as documents or images) to a much shorter sequence of bytes. Such a sequence may be called the data's fingerprint. In essence, large data is more tersely represented by its fingerprint. This fingerprint uniquely identifies the original data, just as human fingerprints uniquely identify individual people. Fingerprinting can be applied to different types of inputs: documents, images, audio, video, or any arbitrary data.

While fingerprints may identify the original data, they're not a compression technique. This means that original data cannot be derived from its fingerprint. Fingerprints help in quickly checking the integrity of data or retrieving documents from large filesystems.

Discussion

  • What are the characteristics of a good fingerprinting algorithm?

    Like any other algorithm, fingerprinting algorithms must balance speed, memory usage and code size. In addition, they must guarantee virtual uniqueness. No two files, even if they differ by a single bit, should end up with the same fingerprint. When two files generate the same fingerprint, we call it a collision. A good algorithm guarantees that the probability of collision is extremely small.

    Conversely, when two files differ even slightly, their fingerprints must be significantly different. This property is called avalanche effect. In the world of cryptography, this is also called diffusion.

    For some applications, algorithms must support compounding, which is about fingerprinting a composite file from the fingerprints of its constituent parts. Computer files are often combined in various ways, such as concatenation (archive files) or symbolic inclusion (C preprocessor's #include directive). Compounding property may be useful in some applications, such as detecting when a program needs to be recompiled.

  • What are some use cases of fingerprinting algorithms?

    When we need to search large filesystems or databases, fingerprints are used to speed up data retrieval. Rather than compare large amounts of content, it's sufficient to compare their fingerprints. When we upload files, cloud providers use fingerprints to do deduplication so that multiple copies of the same file are avoided. When users upload content to public sites, fingerprints can tell if it's banned or copyrighted content. In companies, fingerprints can prevent employees from emailing sensitive content such as patent documents.

    When fingerprints of a file don't match we can say that data integrity is compromised, that is, the file's been tampered. In other applications where file changes need to be synchronized over a network, fingerprints help in bandwidth reduction. Rather than transfer whole files, fingerprints can tell what parts of the file have changed and transfer only those. Likewise, public-key fingerprints are used instead of long public keys. In distributed systems where unique names must be maintained, fingerprints are used.

    For identification, audio fingerprinting is used to identify songs; image fingerprinting to identify cameras, or pornographic content.

  • Could you describe some fingerprinting algorithms?

    Hash functions, also called fingerprinting functions, are often used to reduce arbitrary length of data to a fixed length of bytes. The output is called a hash or a fingerprint. Some hash algorithms have cryptographic properties and hence called cryptographic hash functions. With these, it's difficult for two different inputs to have the same hash. Cryptographic hashing algorithms include MD5 and SHA-1.

    Rabin's algorithm uses polynomials over a finite field to generate hashes. It's able to give us a precise calculation of the probability of collision. They're also faster than cryptographic hash algorithms. However, they're said to be insecure against malicious attacks. This is a rolling hash. Others include TTTD, FBC and what's used in rsync.

    ML algorithms for fingerprinting are trained on deterministic data and then used for prediction. However, these require lots of training data.

  • Are there special considerations for audio/video fingerprinting?
    An example of audio fingerprinting done on frequency spectrum. Source: Jovanovic 2014.
    An example of audio fingerprinting done on frequency spectrum. Source: Jovanovic 2014.

    Normally, fingerprints must be unique even if the original content changes slightly. However, audio and video are often subject to noise, distortions or variations in encoding. Thus, it's not useful to fingerprint them directly. Algorithms must be robust; that is, even in the presence of noise or distortion, fingerprints must be similar, if not same.

    Content that are perceptually different, must produce different fingerprints. For audio, perceptual traits such as zero crossing rate, frequency spectrum and tempo can be used. For video, colour histogram, mean luminance, motion detection, ordinal intensity signature, and spatial distribution of colour or objects in a frame are some features to consider.

    Even in text processing, sometimes we want to know if two documents are similar. In this case, we don't want to minimize probability of collision. Instead, we need to maximize it. Thus, we have Locality-sensitive hashing (LSH) or MinHash algorithms. As an example, 5 word n-grams are used to find out if two documents are similar.

  • Could you name some applications or products that use fingerprinting?
    Process flow in YouTube's Content ID. Source: Pacheco 2013, slide 13.
    Process flow in YouTube's Content ID. Source: Pacheco 2013, slide 13.

    LBFS is a low-bandwidth networked filesystem that uses SHA-1 hashing to determine if a data chunk needs to be transferred over the network. Venti, an archival storage system, uses SHA-1 hashing of a data block. Blocks are addressed by their hashes. Therefore, block content cannot be changed without changing the address. Thus, Venti enforces write-once policy and deduplication. Pcompress does compression and deduplication for data archives. For deduplication, it uses fingerprinting at chunk level and rolling hash computations.

    Since 2007, YouTube's been using a system called Content ID to flag content that infringe copyrights. It uses both audio and visual fingerprinting, and more recently enhanced with machine learning. Shazam identifies songs based on audio fingerprinting. Audible Magic employs its audio/video fingerprinting technology for content providers who wish to protect their copyrighted content.

    By computing a hash of hashes, Surety offers a timestamped data integrity service. In blockchain technology, each block is hashed and these hashes influence those of future blocks. These hashes serve to identify and verify the integrity of blocks.

  • How's digital fingerprinting related to fingerprinting algorithms?

    Digital fingerprinting is a broader concept and fingerprinting algorithms may be treated as a specific case.

    The common interpretation is that digital fingerprinting is used to track or profile individuals based on the tools they use or how they interact online. For example, users can be tracked based on their browsers, browser settings, screen resolution, operating system, location, language settings, etc. The term browser fingerprinting has been used to describe this approach. Similarly, there's canvas fingerprinting that's based on GPU and graphics drivers. At device level, MAC address, IP address or TCP/IP configuration could be used.

    The purpose is to identify criminals based on their fingerprints or combat online piracy. Another use is to build a general profile of individuals that can be attractive to advertisers to serve personalized ads. Digital fingerprinting therefore brings up the issue of privacy. Users have been tracked across browsers with an accuracy of 99.2%.

  • How's a fingerprinting algorithm different from watermarking?

    Watermarking is typically employed for documents, images and video. It's a visible marking that serves to discourage content piracy. There's also invisible watermarking that can be used to track the source of a file because each file is stamped with a unique watermark.

    With fingerprinting, we don't have to modify the original file. It's more about prevention than tracking. A file's fingerprint is compared against a database to flag copyrighted content or simply to identify the file.

Milestones

1981
Rolling hashes are computed from a sliding window. Source: Ghosh 2013.
Rolling hashes are computed from a sliding window. Source: Ghosh 2013.

Michael O. Rabin publishes Fingerprinting By Random Polynomials. It introduces the concept of rolling hashes based on a sliding window. Instead of having fixed-size data chunks, variable-sized chunks give better performance in data deduplication applications.

1991

Ronald Rivest proposes MD5, a cryptographic hash function. It's offered without licensing. Weaknesses are identified in the following years. In 2004, enhanced differential attacks are discovered. By 2009, this attack is optimized so that MD5 collisions can be discovered in milliseconds.

1993

National Institute for Standards and Technology (NIST), USA, proposes Secure Hash Algorithm (SHA). In 1995, this is replaced with SHA-1 due to a flaw in the original algorithm. SHA-1 produces 160-bit hashes. SHA-1 itself is attacked in 2005. Google publishes full details of another attack in 2017.

1999

Shazam is founded based on the idea of using audio fingerprinting to identify music, movies, ads and shows. In 2018, it was acquired by Apple.

2002

NIST proposes SHA-2 as a replacement of SHA-1. SHA-2 produces longer hashes of 224, 256, 384 or 512 bits.

2007

Google introduces beta version of YouTube Video Identification to help copyright owners manage their content on YouTube, particularly when uploaded by others. The service uses video fingerprinting as the technique and later evolves into Content ID.

References

  1. Alfresco Docs. 2018. "Document Fingerprints." Alfresco Software. Accessed 2018-11-13.
  2. AppsFlyer. 2016. "6 Things You Need to Know About Fingerprinting for Mobile Attribution [Cheat Sheet]." AppsFlyer. Accessed 2018-11-13.
  3. Briz, Nick. 2018. "This is Your Digital Fingerprint." Blog, Mozilla, July 26. Accessed 2018-11-13.
  4. Business Dictionary. 2018. "digital fingerprint." WebFinance Inc. Accessed 2018-11-13.
  5. Chao, Wei-Lun. 2009. "Introduction to Video Fingerprinting." Tutorial Report, DISP Lab, NTU. Accessed 2018-11-13.
  6. Chow, Joseph. 2016. "Blockchain Underpinnings: Hashing." ConsenSys, on Medium, January 13. Accessed 2018-11-13.
  7. Craig, Christina. 2017. "Device fingerprinting: The tracking we can’t avoid?" Blog, NordVPN, April 03. Accessed 2018-11-13.
  8. Crypto Wiki. 2018. "Confusion and diffusion." FANDOM. Accessed 2018-11-13.
  9. Favi, Daniele. 2018. "How a blockchain works… Let’s make one!" Ledger Projects, January 04. Accessed 2018-11-13.
  10. Fenlon, Wesley. 2011. "How Digital Fingerprinting Works." HowStuffWorks, May 17. Accessed 2018-11-13.
  11. Garfinkel, Simon. 2004. "Fingerprinting Your Files." MIT Technology Review, August 04. Accessed 2018-11-13.
  12. Ghosh, Moinak. 2013. "High Performance Content Defined Chunking." The Pseudo Random Bit Bucket, June 22. Accessed 2018-11-13.
  13. Josephson, William. 2005. "Hashing And Fingerprinting." Computer Science, Princeton University, February 11. Accessed 2018-11-13.
  14. Jovanovic, Jovan. 2014. "How does Shazam work? Music Recognition Algorithms, Fingerprinting, and Processing." Toptal. Accessed 2018-11-13.
  15. King, David. 2007. "Latest content ID tool for YouTube." Google Blog, October 15. Accessed 2018-11-13.
  16. Kuchta, Rafał. 2017. "The hash – a computer file’s digital fingerprint." newtech.law, Wardyński & Partners, October 09. Accessed 2018-11-13.
  17. Leyden, John, Thomas Claburn, and Chris Williams. 2017. "'First ever' SHA-1 hash collision calculated. All it took were five clever brains... and 6,610 years of processor time." The Register, February 23. Accessed 2018-11-13.
  18. Microsoft Docs. 2017. "Document Fingerprinting." Exchange Server 2013, December 10. Accessed 2018-11-13.
  19. Muthitacharoen, Athicha, Benjie Chen, and David Mazières. 2001. "A Low-bandwidth Network File System." Proceedings of the 18th Symposium on Operating Systems Principles, October. Accessed 2018-11-13.
  20. Narayanan, Arvind. 2011. "No Two Digital Cameras Are the Same: Fingerprinting Via Sensor Noise." 33 Bits of Entropy, Wordpress, September 19. Accessed 2018-11-13.
  21. Pacheco, Carlos. 2013. "YouTube Content ID Handbook." Google, via SlideShare, December 10. Accessed 2018-11-13.
  22. Preneel, Bart. 2010. "The First 30 Years of Cryptographic Hash Functions and the NIST SHA-3 Competition." Katholieke Universiteit Leuven and IBBT. Accessed 2018-11-13.
  23. Quinlan, Sean and Sean Dorward. 2002. "Venti: a new approach to archival storage." FAST 2002, Conference on File and Storage Technologies, January 28-30, pp. 89-102. Accessed 2018-11-13.
  24. Schneier, Bruce. 2005. "Cryptanalysis of SHA-1." Schneier on Security, February 18. Accessed 2018-11-13.
  25. Titlow, John Paul. 2016. "YouTube is using AI to police copyright—to the tune of $2 billion in payouts." Fast Company, July 13. Accessed 2018-11-13.
  26. Wikipedia. 2018a. "Fingerprint (computing)." Wikipedia, April 25. Accessed 2018-11-13.
  27. Wikipedia. 2018b. "Public key fingerprint." Wikipedia, February 12. Accessed 2018-11-13.
  28. Wikipedia. 2018c. "Canvas fingerprinting." Wikipedia, August 08. Accessed 2018-11-13.
  29. Wikipedia. 2018d. "TCP/IP stack fingerprinting." Wikipedia, June 10. Accessed 2018-11-13.
  30. Wikipedia. 2018e. "Rabin fingerprint." Wikipedia, February 04. Accessed 2018-11-13.
  31. Wikipedia. 2018f. "Acoustic fingerprint." Wikipedia, August 06. Accessed 2018-11-13.
  32. Wikipedia. 2018g. "Locality-sensitive hashing." Wikipedia, May 11. Accessed 2018-11-13.
  33. Wikipedia. 2018h. "Shazam (application)." Wikipedia, November 11. Accessed 2018-11-13.

Further Reading

  1. Broder, Andrei Z. 1998. "Some applications of Rabin's fingerprinting method." ResearchGate, June. Accessed 2018-11-13.
  2. Garfinkel, Simon. 2004. "Fingerprinting Your Files." MIT Technology Review, August 04. Accessed 2018-11-13.
  3. Ghosh, Moinak. 2013. "High Performance Content Defined Chunking." The Pseudo Random Bit Bucket, June 22. Accessed 2018-11-13.
  4. Kuchta, Rafał. 2017. "The hash – a computer file’s digital fingerprint." newtech.law, Wardyński & Partners, October 09. Accessed 2018-11-13.
  5. Briz, Nick. 2018. "This is Your Digital Fingerprint." Blog, Mozilla, July 26. Accessed 2018-11-13.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
3
1
1736
4
0
1111
1472
Words
5
Likes
19K
Hits

Cite As

Devopedia. 2020. "Fingerprinting Algorithms." Version 7, July 23. Accessed 2024-06-25. https://devopedia.org/fingerprinting-algorithms
Contributed by
2 authors


Last updated on
2020-07-23 11:38:44
  • Hash Function
  • Audio Fingerprinting
  • Video Fingerprinting
  • Digital Fingerprinting