Similarity Hashing and Perceptual Hashes
We recently presented this paper at the DFRWS (Digital Forensics Research Conference) EU 2023 [here]:
The focus is the analysis of perceptual hashes, and which allow for a matching of images which are approximately the same.
In traditional cryptographic hashing, a change in even a single pixel will cause the hash value to change. While this produces a high trustworthy result, in some cases, though, there can be small differences between two images, and which could be caused by scaling and/or cropping an image. This type of analysis can be used a similarity hashing algorithms, and which is used to measure a distance between two values usually and then applying a threshold to provide an alert on a match. Examples of the measuring the distance between hash values is often defined in a Hamming or Euclidean Distance. Examples of similarity hashing include Microsoft’s PhotoDNA, and Facebook’s PDQ.
An important evaluation element of the success of any method is often defined by the balance between true-positives and false-positives. With a true-positive, the system is able to correctly identify a known contraband image, while a false-positive alerts the system on an image which is not contraband. A false-positive can cause many problems, including incorrectly identifying that a client device has contraband images, and that a reporting authority can become desensitized to too many false alerts. False-negatives are possible the worst, as contraband material has not been detected. The quality of any detection system will thus be to provide no false-negatives, no false-positives, and every true-positive.
Similarity matching
Roussev [1] defines a range of methods that can be used to find similarity:
- Finding known objects (Basic hashing). This includes the well known cryptographic hashing methods, such as MD5, SHA-1, SHA-256 and SHA-3. These hashes are well optimized, and aim to produce a random change of output bits for any changes in the input bits.
- Hash set presentations (Bloom filters).
- Finding similar objects (Data Fingerprints). This includes similarity hashing, which has a base around spam filtering. Similarity hashing can be broken in to two main categories [2]:
- Binary Approximate Matching — Where similarity determinations are made at the binary level.
- Semantic Approximate Matching — where similarity is a function of the perceived content, be it textual (e.g. Google search), visual (e.g. reverse image lookup), or audio based (e.g. Shazam) similarities.
Euclidean distance
A typical method of detecting the similarity of images is to extra different features from an image, such as related to the average luminosity and the detection of background colours, and then to hash these values. Next, these hashes are then compared with other image hashes, and the measure the distance between the hashes (Figure 1).
Hamming distance
With a Hamming distance, we look at the number of bits that change in the hashed values. For example, a similarity hash may produce two hash values of:
1011 1110 1111 0101 1100
1111 0111 1111 0101 1100
This has a Hamming distance of 3, as three bits have changed between the hashes. Thus, the few the bits that change in the similarity hashes, the more likely that we have a match. An example of simularity hashing is with Nilsimsa similarity:
https://asecuritysite.com/hashsim/nil
In the following case, we see that the hash values are similar, but they differ in certain parts of the hash. For this, the hash provides an overall score which relates to the number of bits that have changed between the two values:
Nilsimsa method
String 1: The quick brown fox
String 2: The quicker brown fox
--------------
String 1 digest: 0a31b4be01a0808a29e0ec60e9a258545dc0526770022348380a2128708f2fdb
String 2 digest: 1a31bc3e02a080a28b642864ea224857ddd0526f78022b48380e2269329d3fdb
Score: 91
---Accumulator (String 1)----
[1, 1, 0, 1, 2, 0, 2, 2, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 1, 1, 2, 0, 0, 2, 2, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 3, 2, 1, 0, 1, 1, 1, 0, 2, 0, 0, 0, 1, 0, 2, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 2, 0, 2, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 1, 1, 0, 1, 0, 0, 2, 0, 2, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 2, 0, 1, 0, 0, 0, 0]
---Accumulator (String 2)----
[2, 1, 0, 1, 2, 0, 1, 2, 1, 2, 1, 1, 1, 1, 0, 0, 1, 0, 1, 2, 1, 0, 0, 1, 0, 1, 0, 0, 2, 1, 0, 0, 1, 0, 0, 1, 0, 2, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 2, 0, 0, 0, 0, 0, 0, 0, 1, 1, 2, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 0, 1, 1, 2, 1, 0, 2, 1, 0, 0, 2, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 2, 1, 1, 0, 2, 1, 1, 0, 2, 1, 1, 1, 2, 0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 2, 1, 2, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 2, 0, 0, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 1, 0, 1, 0, 2, 0, 0, 0, 0, 0, 0, 0, 2, 2, 1, 2, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 2, 1, 0, 0, 0, 3, 0, 1, 1, 0, 0, 0]
This method is often used in spam detection. In the following, the spam message has just been changed a little:
Nilsimsa method
String 1: Dear Bill, You have won £10,000,000. Please email me back.
String 2: Dear Mark, You have won £10,000,000. Please email me back
--------------
String 1 digest: 28af87350c35cd842f13c9bd70ba24e232aedc1e7610cef4ed436f342ba82afb
String 2 digest: 28af87370e25cd842f17c9bd62ba20e212aadc0e7690ced569036f342bac221b
Score: 107
---Accumulator (String 1)----
[3, 2, 0, 2, 5, 2, 3, 2, 0, 2, 0, 2, 0, 2, 1, 0, 0, 1, 0, 2, 1, 2, 1, 4, 4, 4, 1, 3, 1, 3, 1, 1, 1, 1, 2, 1, 2, 2, 0, 1, 2, 4, 2, 4, 1, 2, 2, 0, 5, 3, 0, 0, 0, 1, 3, 0, 4, 1, 2, 3, 0, 2, 2, 3, 1, 1, 4, 0, 4, 2, 2, 3, 1, 3, 2, 2, 1, 1, 5, 3, 1, 0, 1, 0, 3, 1, 1, 1, 1, 4, 6, 1, 2, 5, 2, 0, 1, 2, 2, 4, 2, 0, 0, 1, 1, 0, 2, 4, 3, 0, 5, 2, 1, 3, 2, 3, 1, 2, 1, 2, 0, 2, 0, 1, 2, 2, 1, 0, 0, 4, 0, 0, 1, 2, 2, 4, 0, 0, 2, 1, 0, 3, 0, 1, 0, 2, 1, 3, 3, 2, 1, 2, 1, 1, 1, 0, 2, 3, 4, 0, 3, 0, 6, 3, 2, 3, 1, 4, 3, 1, 1, 3, 1, 0, 3, 2, 2, 5, 1, 0, 2, 1, 1, 1, 2, 2, 3, 2, 1, 3, 1, 0, 0, 0, 4, 0, 0, 1, 1, 3, 3, 0, 3, 3, 1, 1, 3, 3, 2, 1, 3, 1, 2, 2, 1, 1, 1, 1, 3, 2, 1, 0, 1, 1, 2, 0, 2, 1, 3, 2, 1, 1, 4, 2, 3, 0, 0, 1, 0, 4, 3, 2, 2, 3, 1, 4, 0, 2, 1, 0, 0, 2, 1, 3, 0, 1]
---Accumulator (String 2)----
[3, 2, 0, 3, 4, 1, 1, 1, 0, 2, 0, 1, 0, 3, 1, 0, 0, 0, 2, 3, 1, 2, 1, 3, 2, 3, 1, 2, 1, 4, 1, 0, 1, 1, 2, 1, 2, 2, 0, 1, 3, 4, 2, 3, 0, 2, 2, 0, 6, 3, 0, 1, 0, 0, 1, 0, 4, 1, 1, 2, 0, 2, 2, 0, 2, 1, 5, 0, 4, 0, 2, 3, 1, 3, 2, 2, 1, 1, 5, 3, 0, 0, 0, 1, 2, 1, 1, 2, 1, 4, 6, 1, 2, 4, 3, 1, 1, 2, 2, 5, 1, 0, 0, 1, 1, 0, 3, 3, 3, 0, 6, 2, 0, 3, 1, 3, 1, 2, 0, 2, 0, 2, 1, 0, 2, 1, 0, 1, 0, 4, 1, 0, 0, 3, 2, 4, 0, 0, 1, 0, 1, 2, 1, 0, 1, 2, 1, 4, 2, 2, 1, 3, 1, 3, 1, 0, 1, 2, 3, 1, 4, 0, 6, 2, 3, 3, 1, 4, 2, 1, 1, 3, 0, 0, 3, 2, 2, 5, 2, 0, 3, 1, 1, 1, 2, 2, 3, 3, 1, 3, 1, 0, 1, 1, 4, 0, 0, 1, 1, 3, 3, 0, 3, 2, 1, 1, 3, 3, 3, 1, 2, 1, 1, 2, 1, 0, 1, 2, 2, 3, 1, 0, 1, 1, 3, 2, 3, 0, 3, 2, 1, 1, 4, 2, 2, 1, 1, 1, 0, 5, 3, 2, 2, 4, 1, 3, 0, 3, 1, 0, 0, 2, 1, 3, 0, 1]
The overall score shows a high match between the two strings.
Now, let’s look at image matching.
Perceptual Hash Analysis
In the paper, there is a number of modification that are made to the images analysed, including adding a border, cropping the image, scaling it, compressing it, and adding a watermark (Figure 2).
The methods that are then analysed are Blockhash, ColourHash, NeuralHash, Facebook PDQ, pHash and Wavehash:
The Inter-score normalised Hamming distances is then analysed be-
tween random images in the Flickr 1 Million dataset and where each
image was compared to 50 random images. Overall, the optimal value of the Hamming space is 0.5. We can see that for Blockhash and pHash and Wavehash, that the border gave the most problems in their matches (and where the bias moves away from 0.5):
When it came to analysing the image with a modified version of itself (intra-score), we can see that most of the methods failed to detect the image in the presense of a border. With Blockhash, the best matches where for scaling and compression, but it failed in most of the other modification. With Neuralhash, it did best in compression and scaling, but failed in most other areas. Generally PDQ did poorly in most areas, apart from compression. pHash and and Wavehash probably did best overall, with Wavehash achieving good detection rates for compression, scale, thumbnails, and watermarking, but still struggled with a border, cropping and mirroring:
Conclusions
Perceptial hashing has improved greatly in its methodology, but still struggles with modified images, especially in areas of mirroring, croping and borders.
References
[1] Roussev, V. (2009). Hashing and data fingerprinting in digital forensics. IEEE Security & Privacy, 7(2), 49–55.
[2] Breitinger, F., Liu, H., Winter, C., Baier, H., Rybalchenko, A., & Steinebach, M. (2013, September). Towards a process model for hash functions in digital forensics. In International Conference on Digital Forensics and Cyber Crime (pp. 170–186). Springer, Cham.