It’s Your Cybersecurity Birthday!
I love giving live cybersecurity demos, and it’s great when they work well. Apart from hacking dolls, kettles, weighing scales, and CCTV cameras, my favourite demos are related to cryptography. This includes demonstrating the RSA and Diffie-Hellman method live, but my favourite one is to perform the birthday attack for hash collisions. For this, if I have a class of 60, I am 99.41% sure that I will be successful in finding two students with the same birthday, and even with 23 students in the class, I have around a 50/50 chance of finding two students with the same birthday. For this, I need to match the same day and month of the birthday (and not the year).
So how does this happen? Well, let’s imagine we have a bag of balls numbered 1 to 365, and where these numbers could represent the days of the year. I will take out a ball and note the number, and then put it back in the bag. The challenge is now to know, on average, how many attempts it will take me before I draw a number which has appeared before.
So let’s try this program to estimate how many students I need:
import random
days=365
array=[]
for i in range(1,days):
x=random.randint(1,days)
if (x in array):
print(f"Duplicate found at {i}")
break
array.append(x)
print(x, end =" ")
For this, I generate a random value between 1 and 365 (just like the students would do when they tell me their birthday. I then store this in a list, and then check if the number has been used before. If I run it a few times I get:
% python 2.py
145 89 267 206 235 37 125 12 72 133 69 148 191 57 56 328 202 199
225 233 144 115 Duplicate found at 23
% python 2.py
68 73 206 53 13 67 32 273 189 65 92 260 100 307 259 279 339 76 359
122 267 Duplicate found at 22
% python 2.py
315 282 76 211 72 146 286 229 152 84 260 26 319 331 70 130 316 163
201 296 365 246 1 307 62 Duplicate found at 26
% python 2.py
327 112 226 263 229 196 224 140 293 14 267 11 16 336 78 34 231 110 94
67 64 188 130 266 111 182 48 Duplicate found at 28
% python 2.py
236 12 101 123 189 305 154 219 15 4 122 350 293 260 183 330 51 321 254
131 203 152 107 Duplicate found at 24
% python 2.py
95 51 226 133 241 202 300 318 239 329 82 351 325 190 187 24 57 65 295 272
257 111 323 364 365 44 Duplicate found at 27
240 291 314 258 352 262 142 9 12 123 162 182 98 15 146 361 29
Duplicate found at 18
% python 2.py
121 211 290 312 55 60 176 97 158 276 159 308 154 325 266 148 206 321
133 5 282 Duplicate found at 22
% python 2.py
45 331 308 327 363 292 5 287 Duplicate found at 9
In this case, the values mainly range from 18 to 27, but I got lucky once in getting a duplicate in just nine attempts. The actual number of attempts I will need, on average, is actually:
This type of approach is known as a square-root attack. So, let’s apply to binary hashing, and where we want to find a hash collision. For this rather than having the days of the year, we will have binary values, and a given bit length. Now, we can represent the values as powers of two for the numbers we will have. So rather than 365, we could have 256 (2⁸) different values, and which represent the values from 0000 0000 to 1111 1111. So, let’s try this:
% python 2.py
133 49 183 170 181 23 173 19 29 Duplicate found at 10
% python 2.py
26 256 183 139 224 229 111 1 209 121 245 180 57 101 179
88 Duplicate found at 17
% python 2.py
209 78 108 5 195 142 14 100 187 151 126 109 129 238 42 244 56 176 45
162 75 159 23 174 97 215 251 132 35 236 204 175 182 1 160 Duplicate found at 36
% python 2.py
27 96 98 162 229 159 15 21 5 131 2 51 88 93 123 173 191 Duplicate found at 18
% python 2.py
43 188 62 163 110 4 59 52 69 51 18 85 252 111 170 145 35 118 66 15
189 181 2 174 71 214 120 104 24 180 97 12 68 235 Duplicate found at 35
% python 2.py
199 129 136 189 186 195 30 192 191 71 52 26 53 Duplicate found at 14
% python 2.py
66 26 147 21 168 247 130 52 78 28 146 64 240 180 37 4 59 166 142 173
156 60 223 2 234 256 82 181 153 194 123 35 138 Duplicate found at 34
% python 2.py
30 36 207 176 113 102 40 186 228 140 81 51 43 173 42 221 56 27 196 9 33
115 78 249 Duplicate found at 25
% python 2.py
46 212 78 214 201 4 26 121 37 195 8 55 102 130 197 199 236 24 162 91
Duplicate found at 21
Again, we will have the square root attack, and where we can now generalise with:
And, so, for an 8-bit value, on average, we will find a collision in 2⁷ (128) tries. For hashing, we could use the MD5 hash, and which has 128 bits. There will thus be 2¹²⁸ different hash values, and where we will find a collision with 2⁶⁴ (18,446,744,073,709,551,616). Thus, if we use a fast cracker that takes 1 nanosecond to try each hash, then the average time to find a collision will be 18,446,744,073 seconds (213,503 days). But let’s say we use GPUs with 4,000 cores, now it will take 533 days. Now, let’s use an array of 32 GPUs. This will only take just over 16 days:
>>> 2**64*1e-9/60/60/24/400/32
16.67999861989073
Now, we can implement the birthday attack with images, and where we can take two images (x_1 and x_2), and then modify each of them to get x_1' and x_2'. If we take an MD5 hash of these, after 2⁶⁴ attempts, on average, we should be able to get:
h(x_1') = h(x_2')
Over, Nat McHugh did this experiment, and took photos of James Brown and Barry White, and managed to get a collision in just 10 hours on the AWS GPU Cloud:
With this, he randomly stuffed bits into the images which could not be seen, and managed to create a collision using the birthday attack.
And larger hashes?
Well, you can see that MD5 is easily suspectable, but what about a larger hash, such as SHA-1? Well, that is massively more difficult as we now have a 160-bit hash, and thus will have a collision within 2⁸⁰. The average time to find a collision from our previous setup is:
>>> 2**80*1e-9/60/60/24/400/32
1093140.3895531588
This is 1,093,140 — over 1 million days rather than 16 days for MD5. But, Google managed to put a massive computing power [here]:
For this, they managed to get a collision on the hashes to two PDF documents for SHA-1. It took a massive amount of computing power and over 18 months to do it, but they at least showed that it could be done. This is then the reason that SHA-1 is being deprecated within the Internet. As for SHA-256, do even attempt to find a collision as it will take you over:
300,000,000,000,000,000,000 days
to find just one, and consume enough energy to boil all the oceans on the planet many times over.
Quantum cracking
Finally, it was Grover who showed that the birthday attack is appliable to quantum cracking, and where the number of hashes or keys that we have to search is:
This means that a 128-bit encryption key for AES is now susceptible to a birthday attack, along with 160-bit hashing values. Thus, we need to move to 256-bit AES and 256-bit hashing methods (SHA-256 or SHA-3) to make sure that quantum computers do not crack our encryption and hashing methods.
Postscript
If you want to find out the probabilities of the birthday attack working for a given size of audience, try here:
Overall, you will find that 23 is the number where there is a 50/50 chance of two people in a room having the same birthday.