5 years ago, I compiled a dataset of password histograms representing roughly 70 million Yahoo! users. It was the largest password dataset ever compiled for research purposes. The data was a key component of my PhD dissertation the next year and motivated new statistical methods for which I received the 2013 NSA Cybersecurity Award.
I had always hoped to share the data publicly. It consists only of password histograms, not passwords themselves, so it seemed reasonably safe to publish. But without a formal privacy model, Yahoo! didn’t agree. Given the history of deanonymization work, caution is certainly in order. Today, thanks to new differential privacy methods described in a paper published at NDSS 2016 with colleagues Jeremiah Blocki and Anupam Datta, a sanitized version of the data is publicly available.
Continue reading My Yahoo! password histograms are now available (with differential privacy!)
Yesterday I received the NSA award for the Best Scientific Cybersecurity Paper of 2012 for my IEEE Oakland paper “The science of guessing.” I’m honored to have been recognised by the distinguished academic panel assembled by the NSA. I’d like to again thank Henry Watts, Elizabeth Zwicky, and everybody else at Yahoo! who helped me with this research while I interned there, as well as Richard Clayton and Ross Anderson for their support and supervision throughout.
On a personal note, I’d be remiss not to mention my conflicted feelings about winning the award given what we know about the NSA’s widespread collection of private communications and what remains unknown about oversight over the agency’s operations. Like many in the community of cryptographers and security engineers, I’m sad that we haven’t better informed the public about the inherent dangers and questionable utility of mass surveillance. And like many American citizens I’m ashamed we’ve let our politicians sneak the country down this path.
In accepting the award I don’t condone the NSA’s surveillance. Simply put, I don’t think a free society is compatible with an organisation like the NSA in its current form. Yet I’m glad I got the rare opportunity to visit with the NSA and I’m grateful for my hosts’ genuine hospitality. A large group of engineers turned up to hear my presentation, asked sharp questions, understood and cared about the privacy implications of studying password data. It affirmed my feeling that America’s core problems are in Washington and not in Fort Meade. Our focus must remain on winning the public debate around surveillance and developing privacy-enhancing technology. But I hope that this award program, established to increase engagement with academic researchers, can be a small but positive step.
Today at W2SP I presented a new paper making the case for distributing security policy in hyperlinks. The basic idea is old, but I think the time is right to re-examine it. After the DigiNotar debacle, the community is getting serious about fixing PKI on the web. It was hot topic at this week’s IEEE Security & Privacy (Oakland), highlighted by Jeremy Clark and Paul van Oorschot’s excellent survey paper. There are a slew of protocols under development like key pinning (HPKP), Certificate Transparency, TACK, and others. To these I add s-links, a complementary mechanism to declare support for new proposals in HTML links. Continue reading Revisiting secure introduction via hyperlinks
Today the UK Information Commissioner’s Office levied a record £250k fine against Sony over their 2011 Playstation Network breach in which 77 million passwords were stolen. Sony stated that they hashed the passwords, but provided no details. I was hoping that investigators would reveal what hash algorithm Sony used, and in particular if they salted and iterated the hash. Unfortunately, the ICO’s report failed to provide any such details:
The Commissioner is aware that the data controller made some efforts to protect account passwords, however the data controller failed to ensure that the Network Platform service provider kept up with technical developments. Therefore the means used would not, at the time of the attack, be deemed appropriate, given the technical resources available to the data controller.
Given how often I see password implementations use a single iteration of MD5 with no salt, I’d consider that to be the most likely interpretation. It’s inexcusable though for a 12-page report written at public expense to omit such basic technical details. As I said at the time of the Sony Breach, it’s important to update breach notification laws to require that password hashing details be disclosed in full. It makes a difference for users affected by the breach, and it might help motivate companies to get these basic security mechanics right.
Computers are getting exponentially faster, yet the human brain is constant! Surely password crackers will eventually beat human memory…
I’ve heard this fallacy repeated enough times, usually soon after the latest advance in hardware for password cracking hits the news, that I’d like to definitively debunk it. Password cracking is certainly getting faster. In my thesis I charted 20 years of password cracking improvements and found an increase of about 1,000 in the number of guesses per second per unit cost that could be achieved, almost exactly a Moore’s Law-style doubling every two years. The good news though is that password hash functions can (and should) co-evolve to get proportionately costlier to evaluate over time. This is a classic arms race and keeping pace simply requires regularly increasing the number of iterations in a password hash. We can even improve against password cracking over time using memory-bound functions, because memory speeds aren’t increasing nearly as quickly and are harder to parallellise. The scrypt() key derivation function is a good implementation of a memory-bound password hash and every high security application should be using it or something similar.
The downside of this arms race is that password hashing will never get any cheaper to deploy (even in inflation-adjusted terms). Hashing a password must be as slow and costly in real terms 20 years from now or else security will be lower. Moore’s Law will never reduce the expense of running an authentication system because security depends on this expense. It also needs to be a non-negligible expense. Achieving any real security requires that password verification take on the order of hundreds of milliseconds or even whole seconds. Unfortunately this hasn’t been the experience of the past 20 years. MD5 was launched over 20 years ago and is still the most common implementation I see in the wild, though it’s gone from being relatively expensive to evaluate to extremely cheap. Moore’s Law has indeed broken MD5 as a password hash and no serious application should still use it. Human memory isn’t more of a problem today than it used to be though. The problem is that we’ve chosen to let password verification become too cheap.
Last week, I gave a talk at the Center for Information Technology Policy at Princeton. My goal was to expand my usual research talk on passwords with broader predictions about where authentication is going. From the reaction and discussion afterwards one point I made stood out: authenticating humans is becoming a machine learning problem.
Problems with passwords are well-documented. They’re easy to guess, they can be sniffed in transit, stolen by malware, phished or leaked. This has led to loads of academic research seeking to replace passwords with something, anything, that fixes these “obvious” problems. There’s also a smaller sub-field of papers attempting to explain why passwords have survived. We’ve made the point well that network economics heavily favor passwords as the incumbent, but underestimated how effectively the risks of passwords can be managed in practice by good machine learning.
From my brief time at Google, my internship at Yahoo!, and conversations with other companies doing web authentication at scale, I’ve observed that as authentication systems develop they gradually merge with other abuse-fighting systems dealing with various forms of spam (email, account creation, link, etc.) and phishing. Authentication eventually loses its binary nature and becomes a fuzzy classification problem. Continue reading Authentication is machine learning
Yesterday, I took a critical look at the difficulty of interpreting progress in password cracking. Today I’ll make a broader argument that even if we had good data to evaluate cracking efficiency, recent progress isn’t a major threat the vast majority of web passwords. Efficient and powerful cracking tools are useful in some targeted attack scenarios, but just don’t change the economics of industrial-scale attacks against web accounts. The basic mechanics of web passwords mean highly-efficient cracking doesn’t offer much benefit in untargeted attacks. Continue reading Password cracking, part II: when does password cracking matter?
Password cracking has returned to the news, with a thorough Ars Technica article on the increasing potency of cracking tools and the third Crack Me If You Can contest at this year’s DEFCON. Taking a critical view, I’ll argue that it’s not clear exactly how much password cracking is improving and that the cracking community could do a much better job of measuring progress.
Password cracking can be evaluated on two nearly independent axes: power (the ability to check a large number of guesses quickly and cheaply using optimized software, GPUs, FPGAs, and so on) and efficiency (the ability to generate large lists of candidate passwords accurately ranked by real-world likelihood using sophisticated models). It’s relatively simple to measure cracking power in units of hashes evaluated per second or hashes per second per unit cost. There are details to account for, like the complexity of the hash being evaluated, but this problem is generally similar to cryptographic brute force against unknown (random) keys and power is generally increasing exponentially in tune with Moore’s law. The move to hardware-based cracking has enabled well-documented orders-of-magnitude speedups.
Cracking efficiency, by contrast, is rarely measured well. Useful data points, some of which I curated in my PhD thesis, consist of the number of guesses made against a given set of password hashes and the proportion of hashes which were cracked as a result. Ideally many such points should be reported, allowing us to plot a curve showing the marginal returns as additional guessing effort is expended. Unfortunately results are often stated in terms of the total number of hashes cracked (here are some examples). Sometimes the runtime of a cracking tool is reported, which is an improvement but conflates efficiency with power. Continue reading Password cracking, part I: how much has cracking improved?
UPDATE 2012-06-07: LinkedIn has confirmed the leak is real, that they “recently” switched to salted passwords (so the data is presumably an out-of-date backup) and that they’re resetting passwords of users involved in the leak. There is still no credible information about if the hackers involved have the account names or the rest of the site’s passwords. If so, this incident could still have serious security consequences for LinkedIn users. If not, it’s still a major black eye for LinkedIn, though they deserve credit for acting quickly to minimise the damage.
LinkedIn appears to have been the latest website to suffer a large-scale password leak. Perhaps due to LinkedIn’s relatively high profile, it’s made major news very quickly even though LinkedIn has neither confirmed nor denied the reports. Unfortunately the news coverage has badly muddled the facts. All I’ve seen is a list 6,458,020 unsalted SHA-1 hashes floating around. There are no account names associated with the hashes. Most importantly the leaked file has no repeated hashes. All of the coverage appears to miss this fact. Most likely, the leaker intentionally ran it through ‘uniq’ in addition to removing account info to limit the damage. Also interestingly, 3,521,180 (about 55%) of the hashes have the first 20 bits over-written with 0. Among these, 670,785 are otherwise equal to another hash, meaning that they are actually repeats of the same password stored in a slightly different format (LinkedIn probably just switched formats at some point in the past). So there are really 5,787,235 unique hashes leaked. Continue reading On the (alleged) LinkedIn password leak
Over a year ago, we blogged about a bug at Gawker which replaced all non-ASCII characters in passwords with ‘?’ prior to checking. Along with Rubin Xu and others I’ve investigated issues surrounding passwords, languages, and character encoding throughout the past year. This should be easy: websites using UTF-8 can accept any password and hash it into a standard format regardless of the writing system being used. Instead though, as we report a new paper which I presented last week at the Web 2.0 Security and Privacy workshop in San Francisco, passwords still localise poorly both because websites are buggy and users have been trained to type ASCII passwords only. This has broad implications for passwords’ role as a “universal” authentication mechanism. Continue reading Of contraseñas, סיסמאות, and 密码