Another Gawker bug: handling non-ASCII characters in passwords

A few weeks ago I detailed how Gawker lost a million of their users’ passwords. Soon after this I found an interesting vulnerability in Gawker’s password deployment involving the handling of non-ASCII characters. Specifically, they didn’t handle them at all until two weeks ago, instead they were mapping all non-ASCII characters to the ASCII ‘?’ prior to hashing them. This not only greatly limited the theoretical space of passwords, but meant that passwords consisting of any n non-ASCII characters were equivalent to ‘?’^n. Native Telugu or Korean speakers with passwords like ‘రహస్య సంకేత పదం’ or ‘비밀번호’ were vulnerable to an attacker simply guessing a string of question marks. An attacker may in fact know in advance that some users are from non-Latin countries (for example by looking at their email addresses) potentially making this more easily exploitable.

I came across this issue because I was curious how crypt() would handle non-ASCII characters. Because DES uses 56-bit keys, crypt() must take its 8 characters of input (all passwords longer than this are truncated) and coerce them into 7 bits each. For traditional ASCII characters, which range only from 0-127, this is achieved by simply dropping the high-order bit from each byte in the string. The same thing is done for non-ASCII characters, although because they may be represented by more than one-byte in encodings like utf-8, crypt() will use even fewer than 8 characters. The effect will depend on the encoding size of each individual character, but only the first  four characters of ‘Пароль’ will be used by crypt(), and only the first three characters of ‘ パスワード‘.

I was quite surprised to find that Gawker managed to drop all non-ASCII characters. In fact, the problem had nothing to do with crypt(). In an interesting twist that nobody seemed to realise after the Gawker hack, they’d actually already switched to the much-stronger, Blowfish-based bcrypt() months ago. All passwords updated since the switch have been hashed with bcrypt(), but Gnosis ignored this column of the SQL database (which also explains the large discrepancy between the database size and the number of crackable passwords). Unfortunately, bcrypt() is not quite as widely supported, and Gawker was using a relatively little-known Java library with the known bug of converting all non-ASCII characters to ‘?’ prior to hashing.

Gawker was very professional in its response. I received thanks and a clear timetable for a fix within 24 hours, the fix itself took less than 72 hours. They also ran an item in Gizmodo explaining the problem which was both courteous and very honest about the problem. However, Gawker did not force potentially vulnerable users to change their passwords-just as they didn’t do so with the passwords leaked in December. It seems far-fetched that most users will find the Gizmodo article, understand it, and take action; they should have been forced to change.

This is a very minor problem for the userbase of Gawker’s blogs (which are only available in English). I determined through brute-force that about 1 in 50,000 Gawker users use a password which is entirely non-Latin (about the same number as use ‘aolsucks’ as their password), only a few dozen registered users in total. It remains to be seen what other vulnerabilities exist in the wild involving password hashing and character encoding (particularly issues surrounding form encoding), which has often been poorly understood by programmers and a major source of bugs. Traditionally these issues probably didn’t matter much. When non-ASCII characters were poorly supported by most software there were few users who couldn’t produce an ASCII password. Increasingly however, OS and browsers are internationalised well enough that more users may use non-ASCII characters in their passwords, particularly at sites which have larger numbers of users outside the West, meaning utf-8 should be cleanly handled in any password implementation.

Thanks to Rubin Xu, Dongting Yu, Andrew Lewis, and Richard Clayton for help researching this vulnerability.

11 thoughts on “Another Gawker bug: handling non-ASCII characters in passwords

  1. Hi Joseph –

    Once again, thank you for bringing this issue to our attention. As you note, the issue was posted to Gizmodo, but it also went up on our other sites at roughly the same time.

    What we did do is force those users that successfully logged in using these chars to reset their password (a prompt appears forcing them to do so). The moment the fix went live, we marked users that updated using the fixed bcrypt so that we would not have to force them to do it again. Many users that reset their passwords initially were already covered, as many were not reset until after the weekend the fix went live.

    We will be running another check across our user base for people that have not updated since the fix yet still have the ‘????????’ issue. Those users for whom we have email addresses will receive an email notifying them to change their password, and the reason to do so. Those accounts that lack an email address, and have not updated, will be disabled. We have already done this on quite a few accounts.

    Longer term (beginning early February), we will be migrating all of our users to our new commenting platform that will be described on the blog later this month. This will eliminate the need for email addresses or passwords on our platform. Once this change goes live, new commenters will not be able to register with a user/password — we will support only OAuth or anonymous accounts we are calling ‘burners’.

    Readers should know if upon reviewing our code/data discover any issues, we encourage them to send a report about it to security at gawker dot com.


    Tom Plunkett
    CTO Gawker Media

  2. Tom,

    Glad to hear you are going to reach out more directly to the users affected. Even more glad to hear that you’re planning to phase out password collection completely! This was something we called for in our survey papers on the web password space. To my knowledge Gawker will be one of the biggest platforms to take that step, which I hope many other websites will take in the future.


  3. Passwords have always concerned me over a long period of time. As I mentioned elsewhere the password authentication for the public internet (sites such as ebay, yahoo mail et al) aren’t really fit for purpose anymore as they are easily cracked even if authentication is done securely. A lot of sites still ask for an email address as a username as these are assumed to be unique, but this way of use makes it easy for spammers to harvest addresses. Once the address is intercepted, guessing passwords (a lot of people use 12345 or password for simplicity as you know) is then the next step. User education is key to this, though many users (newer inexperienced users probably even more so) IGNORE advice given, such as using different passwords for different sites as it becomes cumbersome to do this, folk tend to stick with one password. Password keeper programs also suffer from the same thing; one password gains access to all the rest, thus rendering them useless.

    I also believe that the rush to commercialise the internet led to certain pertinent questions not being asked before it was.

  4. The crypt() function only has 8char*7bit limitations if it still uses the old DES-derived hash.

    Quite a lot of the world these days use the MD5 based crypt() I hacked together 16 years ago for FreeBSD 2.0, and that will take any length password and chew on all the bits.

    If anybody uses the old DES based crypt these days, they are not serious about security at all.

    I would even argue that the MD5 based crypt() is far to fast to calculate these days: It should take at least 100msec to do so at any point in time.


  5. Poul-Henning,

    You raise an interesting point, which is why Gawker or anybody else is using crypt() for a website. My thought was that it was a PHP default-but I checked into it, PHP switched from DES to MD5 by default in version 4.1 in Dec 2001 (and caused some frustration). I’m planning to study how many crypt implementations still exist in the wild, why they still exist is anybody’s guess.


  6. Actually, that is the wrong attitude, we should have more implementations, not fewer.

    Mind you, they need to be good ones, and therefore the DES one needs to die.

    Password hashing is a kleenex application, you can even change your algorithm for new passwords on the fly, as long as you check the old one for old hashes.

    You could even use a per-system salt if you wanted, to make sure that hashed passwords were not shared undesirably.

    If like most organizations you have a mandated max lifetime for passwords, the nyquist theoreme applies to how fast you ca phase out old algorithms.

    I added the $1$ prefix to the hash, specifically so that you can tell which algorithm is used to hash it, so that multiple different hash algorithms can coexist.

    Unfortunately, rather than a steady progression of new and stronger algorithms to keep pace with cpu speeds, it seems that my md5 based version have become the new golden standard.

    The OpenBSD people did a $2$ based on SHA, but I have not followed the field in detail to know if any others have appeared.

    But yes, after 16 years, it’s time somebody revisited this area again, please do so, and feel free to contact me if you think it will help.

    Somebody at CL may still have a copy of the talk I gave there a couple of years ago about the trouble with cryptographers unionization. (Ask Robert.)


    1. Poul-Henning,

      To answer your question, there actually are a lot of crypt implementations now. PHP supports 6 as of v 5.3, and the more recent SHA ones include a parameter in the has for the number of iterations as well, which should make them very easy to tune annually as CPU speeds increase. My mine point was that, while having multiple options is helpful to tune for performance and possible hash function cryptanalysis (although none of MD5’s problems make much difference for password hashing), migrating off of the old crypt() should be the first goal due to the 8 character limit and the insufficient salt size. Having an array of options may not be helpful for web developers who’ll just choose the fastest or first thing (DES-based crypt()) not realising it is much, much worse than all other options.


  7. Laziness is definitely a major culprit in big systems still using DES crypt(3)

    In the 1990s I was a student at a somewhat well-known university. They used DES crypt(3) across all their systems, with the hashes available via NIS. I felt this was a weakness but could find no-one authorised to do anything about it

    Years later I was briefly employed as a member of the systems team of the world-class CS department at that university. They still mandated DES crypt(3) going so far as to reconfigure machines which tried to use something else – although no-one could point at any machines which couldn’t handle PHK’s improved algorithm cited above. They still used NIS.

    I suggested changing the default for new passwords at least. I got static. I changed my own password hash manually to a PHK-style one, to show that nothing broke. Still static. My career moved on, and I rarely thought about it.

    Most recently their main shared compute system was broken into. That prompted me to ask about it again. Sure enough, they still hadn’t upgraded. They still allow password-based direct SSH access into machines, based on passwords secured only by DES crypt(3) hashes available to anyone who asks NIS for them.

  8. When I was a student at the computer lab there was a Tripos exam question that appeared something like every other year which asked “explain why even experienced programmers have problems with character codes”.

    In earlier days students were supposed to write an essay about how to spot escape sequences if you didn’t start reading a five bit Flexowriter tape at the beginning. In my day you were supposed to write about EBCDIC and/or what to use bit eight on eight-track paper tape for.

    Wot I want to know is … does this question still routinely appear, in the same words, on Tripos papers?

  9. you mention that “Native Georgian or Korean speakers with passwords like ‘రహస్య సంకేత పదం’ or ‘비밀번호’ were vulnerable ….”
    as someone who is a native Telugu speaker (telugu is an indian language spoken by over 70 million people in india), i can assure you that no ‘Native Georgian’ speaker would be using the password ‘రహస్య సంకేత పదం’ which is NOT a Georgian phrase but a Telugu phrase meaning ‘secret password’. just thought you might like to know. 🙂

  10. Dennis,

    Thanks very much and sorry for the error. I originally had three examples of non-ASCII passwords (all of which in fact meant ‘password’), and then deleted the word Telugu but the Georgian password by mistake, so it made no sense as you pointed out. It’s corrected now.

Leave a Reply

Your email address will not be published. Required fields are marked *