5 years ago, I compiled a dataset of password histograms representing roughly 70 million Yahoo! users. It was the largest password dataset ever compiled for research purposes. The data was a key component of my PhD dissertation the next year and motivated new statistical methods for which I received the 2013 NSA Cybersecurity Award.
I had always hoped to share the data publicly. It consists only of password histograms, not passwords themselves, so it seemed reasonably safe to publish. But without a formal privacy model, Yahoo! didn’t agree. Given the history of deanonymization work, caution is certainly in order. Today, thanks to new differential privacy methods described in a paper published at NDSS 2016 with colleagues Jeremiah Blocki and Anupam Datta, a sanitized version of the data is publicly available.