Authentication is machine learning – Light Blue Touchpaper

Last week, I gave a talk at the Center for Information Technology Policy at Princeton. My goal was to expand my usual research talk on passwords with broader predictions about where authentication is going. From the reaction and discussion afterwards one point I made stood out: authenticating humans is becoming a machine learning problem.

Problems with passwords are well-documented. They’re easy to guess, they can be sniffed in transit, stolen by malware, phished or leaked. This has led to loads of academic research seeking to replace passwords with something, anything, that fixes these “obvious” problems. There’s also a smaller sub-field of papers attempting to explain why passwords have survived. We’ve made the point well that network economics heavily favor passwords as the incumbent, but underestimated how effectively the risks of passwords can be managed in practice by good machine learning.

From my brief time at Google, my internship at Yahoo!, and conversations with other companies doing web authentication at scale, I’ve observed that as authentication systems develop they gradually merge with other abuse-fighting systems dealing with various forms of spam (email, account creation, link, etc.) and phishing. Authentication eventually loses its binary nature and becomes a fuzzy classification problem.

This is not a new observation. It’s generally accepted for banking authentication and some researchers like Dinei Florêncio and Cormac Herley have made it for web passwords. Still, much of the security research community thinks of password authentication in a binary way (much of my own research on measuring password guessing difficulty is admittedly guilty). Spam and phishing provide insightful examples: technical solutions (like Hashcash, DKIM signing, or EV certificates), have generally failed but in practice machine learning has greatly reduced these problems. The theory has largely held up that with enough data we can train reasonably effective classifiers to solve seemingly intractable problems.

Similarly, the ultimate answer to the intractable question of “what will replace passwords?” might be nothing. Passwords will stick around as one of many useful authentication signals fed into various classifiers looking for account compromise at different stages of authentication. Large providers already use many other input signals including the source IP address, browser information and user agent string, cookies cached on the browser, the time of the login and the number of incorrect password guesses. More factors can be added over time: more complex behavioral profiles of users, cryptographic means to identify browsers like origin-bound certificates, one-time codes sent by SMS or generated by a mobile device, or perhaps lightweight biometrics like typing dynamics. No magic combination of these will always be correct or available, but this is exactly where machine learning can shine by taking what signals are available and estimating the probability that the correct human is present.

As authentication becomes a classification problem, it can become more nuanced. In the face of high uncertainty a site can ask for more signals (such as answering questions about one’s history of interaction with a site or identifying friends in photographs). It will become increasingly common to grant authentication with limited rights such as the ability to read email but not change persistent account settings. Authentication can also continue after password approval, using interactions with the site as additional signals to detect account compromise.

Possibly the most important implication, though, is that web authentication will be increasingly difficult for all but the largest providers. Some of this is basic economy of scale: when I consult with small-to-medium sites it’s often one overworked web developer who has a dozen priorities higher than hashing and salting passwords correctly, while big sites can afford a large team of dedicated security engineers. Most of the big web companies have also attained their position partially thanks to their experience building large-scale machine learning systems, so they have machine learning experts around to help.

Most importantly though, as authentication becomes machine learning it requires data. Big sites have more data not just because they have more users, but because users interact with the largest sites much more frequently and reveal far more data about themselves. This data is invaluable for building and tuning models of user behavior to detect account compromise. Centralisation amongst Internet companies is being predicted for many reasons, but an underrated driver may be the amount of data required to authenticate users on the web.

NOTE: This blog post represents my own opinions based on my research and consulting work and not those of my employer, Google. In particular, I have not been working closely with the authentication team at Google.

Thanks to Arvind Narayanan, Cormac Herley, Ross Anderson, Umesh Shankar, and many audience members during my Princeton talk for discussions which led to this post.