Authentication is machine learning

Last week, I gave a talk at the Center for Information Technology Policy at Princeton. My goal was to expand my usual research talk on passwords with broader predictions about where authentication is going. From the reaction and discussion afterwards one point I made stood out: authenticating humans is becoming a machine learning problem.

Problems with passwords are well-documented. They’re easy to guess, they can be sniffed in transit, stolen by malware, phished or leaked. This has led to loads of academic research seeking to replace passwords with something, anything, that fixes these “obvious” problems. There’s also a smaller sub-field of papers attempting to explain why passwords have survived. We’ve made the point well that network economics heavily favor passwords as the incumbent, but underestimated how effectively the risks of passwords can be managed in practice by good machine learning.

From my brief time at Google, my internship at Yahoo!, and conversations with other companies doing web authentication at scale, I’ve observed that as authentication systems develop they gradually merge with other abuse-fighting systems dealing with various forms of spam (email, account creation, link, etc.) and phishing. Authentication eventually loses its binary nature and becomes a fuzzy classification problem.

This is not a new observation. It’s generally accepted for banking authentication and some researchers like Dinei Florêncio and Cormac Herley have made it for web passwords. Still, much of the security research community thinks of password authentication in a binary way (much of my own research on measuring password guessing difficulty is admittedly guilty). Spam and phishing provide insightful examples: technical solutions (like Hashcash, DKIM signing, or EV certificates), have generally failed but in practice machine learning has greatly reduced these problems. The theory has largely held up that with enough data we can train reasonably effective classifiers to solve seemingly intractable problems.

Similarly, the ultimate answer to the intractable question of “what will replace passwords?” might be nothing. Passwords will stick around as one of many useful authentication signals fed into various classifiers looking for account compromise at different stages of authentication. Large providers already use many other input signals including the source IP address, browser information and user agent string, cookies cached on the browser, the time of the login and the number of incorrect password guesses. More factors can be added over time: more complex behavioral profiles of users, cryptographic means to identify browsers like origin-bound certificates, one-time codes sent by SMS or generated by a mobile device, or perhaps lightweight biometrics like typing dynamics. No magic combination of these will always be correct or available, but this is exactly where machine learning can shine by taking what signals are available and estimating the probability that the correct human is present.

As authentication becomes a classification problem, it can become more nuanced. In the face of high uncertainty a site can ask for more signals (such as answering questions about one’s history of interaction with a site or identifying friends in photographs). It will become increasingly common to grant authentication with limited rights such as the ability to read email but not change persistent account settings. Authentication can also continue after password approval, using interactions with the site as additional signals to detect account compromise.

Possibly the most important implication, though, is that web authentication will be increasingly difficult for all but the largest providers. Some of this is basic economy of scale: when I consult with small-to-medium sites it’s often one overworked web developer who has a dozen priorities higher than hashing and salting passwords correctly, while big sites can afford a large team of dedicated security engineers. Most of the big web companies have also attained their position partially thanks to their experience building large-scale machine learning systems, so they have machine learning experts around to help.

Most importantly though, as authentication becomes machine learning it requires data. Big sites have more data not just because they have more users, but because users interact with the largest sites much more frequently and reveal far more data about themselves. This data is invaluable for building and tuning models of user behavior to detect account compromise. Centralisation amongst Internet companies is being predicted  for many reasons, but an underrated driver may be the amount of data required to authenticate users on the web.

NOTE: This blog post represents my own opinions based on my research and consulting work and not those of my employer, Google. In particular, I have not been working closely with the authentication team at Google.

Thanks to Arvind Narayanan, Cormac Herley, Ross Anderson, Umesh Shankar, and many audience members during my Princeton talk for discussions which led to this post.

7 thoughts on “Authentication is machine learning

  1. Thought-provoking post, especially since I haven’t really thought of passwords this way.

    One thing I wonder: my understanding of machine learning algorithms is that they tend not to be robust against targeted malicious attacks against them, especially when the attacker knows what the classifier. For example, it would probably not be a good idea to open source a fraud detection system! If this is true (and I may be misinformed here), then the situation gets even more dire for small providers, since there is now a strong incentive against sharing open models for authentication, algorithms or data-wise. I wonder what can be done here, in that case.

  2. It sounds as if this type of solution could also be used to authenticate people without their knowledge, destroying anonimity.

    By the way, were you in the UK for long enough to come across the idea of “isomorphic controls” in Doctor Who?

  3. I’m excited to see somebody else think about security in terms of classification. I made the point in my 2009 NSPW paper What Is the Shape of Your Security Policy? Security as a Classification Problem that this might be a fruitful direction to explore. What I had in mind there was security evaluation rather than the design of security mechanisms; nevertheless, treating authentication as a machine learning problem seems like an interesting approach to try. I wonder, however, how one might obtain either a good set of training examples or a hard-to-manipulate source of feedback on classifier decisions. Do you have any ideas in that respect?

  4. I agree with Sven: I can see the classification problem and its inputs – Facebook has asked me to tag photos when connecting from an unusual location. What I cannot see right now is the source for a training set to perform supervised learning.

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>