The science of password guessing – Light Blue Touchpaper

I’ve written quite a few posts about passwords, mainly focusing on poor implementations, bugs and leaks from large websites. I’ve also written on the difficulty of guessing PINs, multi-word phrases and personal knowledge questions. How hard are passwords to guess? How does guessing difficulty compare between different groups of users? How does it compare to potential replacement technologies? I’ve been working on the answers to these questions for much of the past two years, culminating in my PhD dissertation on the subject and a new paper at this year’s IEEE Symposium on Security and Privacy (Oakland) which I presented yesterday. My approach is simple: don’t assume any semantic model for the distribution of passwords (Markov models and probabilistic context-free-grammars have been proposed, amongst others), but instead learn the distribution of passwords with lots of data and use this to estimate the efficiency of an hypothetical guesser with perfect knowledge. It’s been a long effort requiring new mathematical techniques and the largest corpus of passwords ever collected for research. My results provide some new insight on the nature of password selection and a good framework for future research on authentication using human-chosen distributions of secrets.

First, the mathematics. Passwords can be modeled as a skewed probability distribution over an infinite space of possible strings. We have a few metrics for the uncertainty of a random value drawn from a skewed distribution. Shannon entropy is the best known, but doesn’t relate to the difficulty of sequential guessing. Guesswork measures the difficulty of a sequential guessing attack as the expected number of guesses before succeeding, but this turns out to be a hopelessly conservative attacker model. The leaked RockYou dataset, for example, contains enough random 128-bit passwords to ensure that the average guessing attack will take more than 2^100 guesses. I introduce a new metric, α-guesswork, which assumes an attacker will only guess enough to compromise a certain proportion α ≤ 1 of users before quitting. The metric has some attractive properties: by varying α, any attacker can be modeled, from an online opportunist trying a few guesses at a large number of accounts to an offline attacker performing an extended brute-force attack. The traditional guesswork metric is simply the 1-guesswork, measuring an attacker who never gives up, while min-entropy is the 0-guesswork, measuring an attacker only focusing on the weakest users. All other attack models can be measured by varying α; guessing efficiency decreases monotonically as α increases. The metric is scaled logarithmically in bits, making it intuitive for security engineers.

Best of all, this metric can be computed using only the histogram of password frequencies and doesn’t require any information about the plaintext. This means it can be computed on a distribution of hashed (but un-salted) passwords. In collaboration with Yahoo!, I was able to collect such a distribution of passwords of real users logging in to Yahoo! services, hashed with a strong secret key which was destroyed after the collection experiment. We gathered a corpus of nearly 70 million passwords, over twice as big as the RockYou leak, and also collected dozens of sub-distributions from different demographic groups for analysis. I can’t thank Yahoo! enough for cooperating to collect this unprecedentedly rich data set.

Even with 70 M passwords, there are a few problems with sample size. Over half of users chose unique passwords within the data set, which are very difficult to reason about statistically. Put another way, even collecting the last million passwords in the data set saw most guessing metrics appear to go up as more and more new passwords are being seen. Accurately comparing the guessing difficulty of two empirical distributions with different sample sizes was the hardest technical challenge of the project. The good news is, while Shannon entropy is essentially impossible to estimate in this scenario, the new α-guesswork metric can be estimated accurately for many values of α. I developed two techniques-bootstrap sampling to estimate the interval over which estimates for α will be accurate (described only in my PhD thesis, not the Oakland paper), and extrapolating estimates using a model distribution for password frequencies.

Given these new tools, what can one learn about passwords? The basic numbers rarely change much. For an online attack which is rate-limited to 1-100 guesses, passwords offer about 5-10 bits of security. That is, such guessing attacks are equally difficult against passwords or truly random 5-10 bit values. For an offline attacker aiming to compromise 50% of available accounts, most password distributions offer 15-25 bits of security (with the same interpretation). Of about 300 different subpopulations of Yahoo! users, none produced a password distribution outside this range. Nor did users at RockYou, Gawker, BattlefieldHeroes, or any other leaked dataset I’ve been able to obtain.

Different groups may choose different passwords, but they produce roughly the same distribution from the point of view of a guessing attacker. There are some interesting demographic trends: older users produce a better password distribution than younger users, for example. But the effects of security motivation are very modest. Users of Yahoo! mail compared to other services barely differ in password choices, whereas users with a payment card registered avoid the very weakest passwords in the distribution but otherwise are nearly identical to the overall population. A graphical password-strength indicator Yahoo! rolled out a few years ago increased the security of the resulting password distribution by about a bit.There are two possible explanations. Either human beings fundamentally produce a skewed distribution of passwords no matter what their motivation is, or wading through the morass of websites requesting passwords has worn users down to the point that they always produce the same password distribution. I find the first more plausible, especially given the consistency across culture and language groups.

Even though humans produce distributions with pitifully few bits of security, I think passwords will always be with us. As one component in a system with many layers, passwords can be valuable as a low-cost authentication mechanism which nearly all people can do with no special equipment. The important thing is to stop considering them the first and last step in authentication. My research goal has been to understand the limits of passwords in detail so that we can design future systems around these limits.