My colleague Mike Sierchio wrote a cool post on password strength, and the concept of entropy. As he points out, entropy isn't always entropy. That confusion is apparently not uncommon, as it's been asked about on IT Security Stack Exchange as well. So what's really going on?
Let's step back for a sec and fill in some context. What are we trying to do? We'd like some way to measure how hard it is to guess our passwords, a number that serves as a heuristic standard of password strength. But there are two fundamentally different things we might want to measure:
How hard would it be for someone to guess your password with essentially no knowledge of how you created your password?
How hard would it be for someone to guess your password if they knew the process used to generate it? This is of course assuming that there is a process, for example some script that does some
Math.rand
ing and produces a password string.
The term "entropy" has been used to refer to both kinds of calculations, but they're clearly entirely different things: the former essentially takes a string as input, the latter takes a random process as input. Hence, "entropy is not entropy."
Alright, well if entropy isn't entropy, let's see what entropies are. We'll look at the standard mathematical formulation of the randomprocessentropy which comes from information theory. And we'll look at the function used to calculate particularstringentropy in one of the most popular password strength testers. And that's all we're going to do, we'll look at how the calculations are done, without dwelling too much on the differences between the two approaches or what their use cases are.
What is a random process?
For our purposes, a random process will be determined by the set of all the possible outputs it can produce, and the probability associated with each output. If you roll a fair die, the possible outputs are 1, 2, 3, 4, 5, and 6, and each output happens to have the same probability associated with them, 1/6. If you have a process that rolls a die and then yells "fizz" if the output is divisible by 3, "buzz" if its divisible by 5, and just repeats the number otherwise, then the possible outputs are:
1, 2, "fizz", 4, and "buzz"
with corresponding probabilities:
1/6, 1/6, 1/3, 1/6, and 1/6
We would like to favor processes that can potentially produce a lot of different outputs, and that give all the different outputs a fairly similar chance of occurring. If we had a process that picked a number between 1 and 999999999, but 95% of the time it picked 1, it wouldn't be that great since if an attacker knew this about our process, he or she could just make one guess, namely 1, and have a 95% chance of accessing whatever was supposed to be secured by a password generated by this process.
So here's the formula: given a process with possible outcomes o_{1}, …, o_{n}, and respective probabilities p_{1}, …, p_{n}, the entropy of this process is given by
− [p_{1}log_{2}(p_{1}) + … + p_{n}log_{2}(p_{n})]
This formula satisfies the two criteria we wanted to satisfy above, and additionally has the following aesthetically pleasing feature: the random process which generates n independent random bits has entropy of n. Let's prove it:
That process has 2^{n} possible outcomes, each with equal probability, namely 1 / 2^{n} (it's like rolling a 2^{n}sided fair die).
$ \left [ \frac{1}{2^n}\log _2\left(\frac{1}{2^n}\right) +\dots + \frac{1}{2^n}\log _2\left(\frac{1}{2^n}\right)\right]\ \ \ (2^n\mbox{ times})$
$=  \left[ 2^n \times \frac{1}{2^n}\log _2\left(\frac{1}{2^n}\right) \right]$
$=  \log_2\left(\frac{1}{2^n}\right)$
= − ( − n)
= n
What about the entropy of our fizzbuzz process?
$ \left[4 \times \frac{1}{6}\log_2\left(\frac{1}{6}\right) + \frac{1}{3}\log_2\left(\frac{1}{3}\right)\right] \approx 2.252$
… And if you don't know the process?
Here's the heart of the code used on the strength test website to estimate password strength:
1 2 3 4 5 6 7 8 9 10 

And here it is, refactored, Rubyfied, and decorated with couple additional comments:
1 2 3 4 5 6 7 

There's a couple questions this should evoke. How is it determining how unlikely it is for two characters to show up consecutively? If you look at the original code you can see it's looking things up in a table. It gets the index of the two characters (what it calls aidx
and bidx
), and then finds the entry in the table corresponding to that pair by finding the aidx * 27 + bidx
th entry in an array representing the frequency table. This suggests that this table is 27 x 27. It treats upper case letters the same as lower case, and it treats all numbers and special characters the exact same! Indeed, it'll tell you that the following two passwords have the same entropy:
^341^)8@#05&*6%%#$7(9!24%
and
!111111111111111111111111
Also, notice how it only considers two consecutive characters at a time, never "learning" from patterns it might have observed earlier in the string. So for instance given a password like aaaaaaaaaaaaa
, it doesn't "catch on" by the 6th or 7th aa
pair that there seems to be a lot of aa
pairs. It's just as "surprised" to see the last aa
pair as it was to see the first.
And what about this character pool size estimation? Some experimenting will show you that if it detects a single numerical digit, it assumes all of the numerical digits were available in the selection pool for every single letter. If it sees a single character from !@#$%^&*()
, it'll assume they were all available, and again, for every letter. So in particular if your process is mainly a bunch of English letters and then you just throw in a single number and one of !@#$%^&*()
at the end, you'll get a bump to your password strength.
Going back to the consecutive pair thing, why just consecutive pairs? What about triples?
Oh, and what about that business of squaring the "unlikeliness" of the consecutive characters? The unlikeliness is a probability, hence a number between 0 and 1, and so squaring it results in a smaller number. Since we're adding these numbers to our calculation of entropy, it results in a smaller value for our final entropy. Therefore it tries to capture the assumption that the attacker is a good guesser. But why squaring specifically?
To be clear, these are simply questions worth asking, not criticisms per se. This way of calculating entropy is supposed to be naive, and is supposed to make as few assumptions as possible about how the given password is generated, and thus what patterns to expect. If it made stronger assumptions, then it would be very bad at estimating the password strength of even a very weak process that simply made sure to contradict those assumptions.
Well, I guess some of them are criticisms. And there are criticisms to be leveled against the ubiquity of the usual "your password must contain 820 characters, at least one number and one special character," etc. But that's all for this post.
Oh, and not only is entropy not entropy, entropy is also this thing:
$\Delta S = \int \frac{dQ_{rev}}{T}$