The science of password guessing

I've written quite a few posts about passwords, mainly focusing on poor implementations, bugs and leaks from large websites. I've also written on the difficulty of guessing PINs, multi-word phrases and personal knowledge questions. How hard are passwords to guess? How does guessing difficulty compare between different groups of users? How does it compare to potential replacement technologies? I've been working on the answers to these questions for much of the past two years, culminating in my PhD dissertation on the subject and a new paper at this year's IEEE Symposium on Security and Privacy (Oakland) which I presented yesterday. My approach is simple: don't assume any semantic model for the distribution of passwords (Markov models and probabilistic context-free-grammars have been proposed, amongst others), but instead learn the distribution of passwords with lots of data and use this to estimate the efficiency of an hypothetical guesser with perfect knowledge. It's been a long effort requiring new mathematical techniques and the largest corpus of passwords ever collected for research. My results provide some new insight on the nature of password selection and a good framework for future research on authentication using human-chosen distributions of secrets.