Over a year ago, we blogged about a bug at Gawker which replaced all non-ASCII characters in passwords with ‘?’ prior to checking. Along with Rubin Xu and others I’ve investigated issues surrounding passwords, languages, and character encoding throughout the past year. This should be easy: websites using UTF-8 can accept any password and hash it into a standard format regardless of the writing system being used. Instead though, as we report a new paper which I presented last week at the Web 2.0 Security and Privacy workshop in San Francisco, passwords still localise poorly both because websites are buggy and users have been trained to type ASCII passwords only. This has broad implications for passwords’ role as a “universal” authentication mechanism.
After finding the Gawker bug we did an informal survey of about 20 popular websites looking for character encoding bugs in passwords. Roughly speaking, about a third of the websites we tried appear to handle long UTF-8 passwords seamlessly, about a third disallow non-ASCII characters in passwords as a matter of policy and we found bugs in the other third. Many of the bugs had no security impact, and others merely circumvented password policies. For example, Walmart and IMDB both count bytes submitted instead of characters. With non-ASCII characters replaced with numeric character references and then percent encoding, this can cause single UTF-8 characters to expand up to 15 bytes. With Walmart’s password policy limiting passwords to just 11 bytes, this means that a password with just two characters (like 密码) can be rejected for being too long. Other bugs are more serious-besides the Gawker bug, we discovered a lingering problem in many implementations of DES-crypt() which truncates passwords after any character with a 0x80 byte in their UTF-8 representation-including the character À (here’s an advisory for FreeBSD).
Of more fundamental interest, we found evidence that user behavior is significantly impacted by character encoding issues. In my study of password statistics at Yahoo!, I identified that common password dictionaries work effectively against all language groups. Examining leaked data from websites used primarily by Chinese and Hebrew speakers, we found that this is in part because users almost exclusively use ASCII passwords even when allowed to do otherwise. Most Chinese speakers rely on graphical Pinyin input methods, which are disabled for password fields to prevent shoulder-surfing; unsurprisingly Chinese characters are virtually non-existent in passwords. Hebrew speakers usually have a dual-mapped keyboard so Hebrew and Latin are equally easy to enter, but in a leaked data set where 90% of usernames contained Hebrew characters we found only 2.5% of passwords did. We even observed Hebrew speakers switching their keyboard mapping to the Latin alphabet and then typing Hebrew words (producing gibberish in ASCII). Users of non-ASCII variants of the Latin alphabet appear less trained to convert to ASCII: looking at Spanish passwords within the leaked RockYou set we found roughly half retained the non-ASCII character ‘ñ’, though nearly all users dropped stress accents which require escape keys to type (i.e. typing “pajaro” instead of “pájaro”).
More interestingly, we found that Chinese speakers (and Hebrew speakers to a lesser extent) were far more likely to use digits in their passwords or rely on a geometric keyboard pattern. This leads to a measurable security difference: the most common passwords in our leaked Chinese data sets were also far more common the most common passwords in leaked English language data sets (our Hebrew data set was too small to compute these statistics reliably). The irony is that linguistic diversity should help password security by making guessing more difficult. Instead, for roughly half the planet whose native writing system isn’t the Latin alphabet passwords appear less secure and more difficult to use as they must remember something in ASCII to ensure compatibility. It’s an interesting challenge to come up with a better solution for these users.