One of the defining features of Web 2.0 is user-uploaded content, specifically photos. I believe that photo-sharing has quietly been the killer application which has driven the mass adoption of social networks. Facebook alone hosts over 40 billion photos, over 200 per user, and receives over 25 million new photos each day. Hosting such a huge number of photos is an interesting engineering challenge. The dominant paradigm which has emerged is to host the main website from one server which handles user log-in and navigation, and host the images on separate special-purpose photo servers, usually on an external content-delivery network. The advantage is that the photo server is freed from maintaining any state. It simply serves its photos to any requester who knows the photo’s URL.
This setup combines the two classic forms of enforcing file permissions, access control lists and capabilities. The main website checks each request for a photo against an ACL, it then grants a capability to view a photo in the form of an obfuscated URL which can be sent to the photo-server. We wrote earlier about how it was possible to forge Facebook’s capability-URLs and gain unauthorised access to photos. Fortunately, this has been fixed and it appears that most sites use capability-URLs with enough randomness to be unforgeable. There’s another traditional problem with capability systems though: revocation. My colleagues Jonathan Anderson, Andrew Lewis, Frank Stajano and I ran a small experiment on 16 social-networking, blogging, and photo-sharing web sites and found that most failed to remove image files from their photo servers after they were deleted from the main web site. It’s often feared that once data is uploaded into “the cloud,” it’s impossible to tell how many backup copies may exist and where, and this provides clear proof that content delivery networks are a major problem for data remanence.
For our experiment, we uploaded a test image onto 16 chosen sites with default permissions, then noted the URL of the uploaded image. Every site served the test image given knowledge of its URL except for Windows Lives Spaces, whose photo servers required session cookies (a refreshing congratulations to Microsoft for beating the competition in security). We ran our initial study for 30 days, and posted the results below. A dismal 5 of the 16 sites failed to revoke photos after 30 days:
|Fotki||Photo Sharing||Fotki||< 1 hour|
|Friendster||Social Networking||Panther Express||6 days
|Picasa||Photo Sharing||5 hours|
|Tagged||Social Networking||Limelight||14 days
|Windows Live Spaces||Social Networking||Microsoft||N/A (cookies)
Just for fun, we’ve also re-started the experiment to allow live viewing.
Most likely, the sites with revocation longer than a few hours aren’t actively revoking at all, but relying on the photos eventually falling out of the photo-server’s cache. This memory-management strategy makes sense technically, as photos are deleted from these types of sites too infrequently to justify the overhead and complexity of removing them from the content delivery network. This paradigm is usually reflected in sites’ Terms of Service, which often give leeway to retain copies for a ‘reasonable period of time.’ Facebook is actually quite explicit about this, stating that ‘when you delete IP content, it is deleted in a manner similar to emptying the recycle bin on a computer.’
This architecture is not only fundamentally wrong from a privacy standpoint, but likely illegal under the EU Data Protection Directive of 1995 and its UK implementation, the Data Protection Act of 1998, which both clearly ban keeping personally-identifiable data for longer than necessary given the data’s purpose. In the social web case, the purpose of keeping a photo is to share it. Since this is no longer possible after the photo is marked ‘deleted’ all copies of the photo must be removed. There’s also an interesting violation of the provision that a user should have access to all data stored about her, after marking a photo ‘deleted’ the user no longer access to it, as there is no way to see which user content is still cached.
Architecture matters, and though it may be more complicated, sensitive personal data must be stored and cached using reference counts to ensure it can be fully deleted, and not simply left to be garbage collected down the road. Unfortunately, as is common with with social networking sites, privacy is viewed as a legal add-on and not a design constraint. In the terminology of Larry Lessig, privacy is still considered a matter of law and not of code. As a result a user can have no assurance about where their photos may be floating around in the cloud.
EDIT 22/05/2009: We originally reported that Xanga and LiveJournal left photos unrevoked. After corresponding with developers from both sites, this was revealed to be a UI problem and not a CDN problem in both cases. When a photo is included in a blog post which is deleted, the photo itself is not considered deleted but becomes one of the user’s photos. Unfortunately, in each site the normal photo interface did not reveal this: Xanga showing this in it’s ‘Photos’ interface, and LiveJournal showing showing this. In both cases, deleting photos which were included in blog posts requires a separate interface. In LiveJournal’s case the separate interface itself incorrectly stated I had “no galleries.” Due to this UI confusion, I thought the photos were deleted when they werent’t thus they weren’t revoked. Apologies for the confusion, I re-tested both and printed updated results, though this has led both sites to re-consider their UI’s which were admittedly confusing and outright buggy in LiveJournaI’s case.