One of the defining features of Web 2.0 is user-uploaded content, specifically photos. I believe that photo-sharing has quietly been the killer application which has driven the mass adoption of social networks. Facebook alone hosts over 40 billion photos, over 200 per user, and receives over 25 million new photos each day. Hosting such a huge number of photos is an interesting engineering challenge. The dominant paradigm which has emerged is to host the main website from one server which handles user log-in and navigation, and host the images on separate special-purpose photo servers, usually on an external content-delivery network. The advantage is that the photo server is freed from maintaining any state. It simply serves its photos to any requester who knows the photo’s URL.
This setup combines the two classic forms of enforcing file permissions, access control lists and capabilities. The main website checks each request for a photo against an ACL, it then grants a capability to view a photo in the form of an obfuscated URL which can be sent to the photo-server. We wrote earlier about how it was possible to forge Facebook’s capability-URLs and gain unauthorised access to photos. Fortunately, this has been fixed and it appears that most sites use capability-URLs with enough randomness to be unforgeable. There’s another traditional problem with capability systems though: revocation. My colleagues Jonathan Anderson, Andrew Lewis, Frank Stajano and I ran a small experiment on 16 social-networking, blogging, and photo-sharing web sites and found that most failed to remove image files from their photo servers after they were deleted from the main web site. It’s often feared that once data is uploaded into “the cloud,” it’s impossible to tell how many backup copies may exist and where, and this provides clear proof that content delivery networks are a major problem for data remanence.
For our experiment, we uploaded a test image onto 16 chosen sites with default permissions, then noted the URL of the uploaded image. Every site served the test image given knowledge of its URL except for Windows Lives Spaces, whose photo servers required session cookies (a refreshing congratulations to Microsoft for beating the competition in security). We ran our initial study for 30 days, and posted the results below. A dismal 5 of the 16 sites failed to revoke photos after 30 days:
|Fotki||Photo Sharing||Fotki||< 1 hour|
|Friendster||Social Networking||Panther Express||6 days
|Picasa||Photo Sharing||5 hours|
|Tagged||Social Networking||Limelight||14 days
|Windows Live Spaces||Social Networking||Microsoft||N/A (cookies)
Just for fun, we’ve also re-started the experiment to allow live viewing.
Most likely, the sites with revocation longer than a few hours aren’t actively revoking at all, but relying on the photos eventually falling out of the photo-server’s cache. This memory-management strategy makes sense technically, as photos are deleted from these types of sites too infrequently to justify the overhead and complexity of removing them from the content delivery network. This paradigm is usually reflected in sites’ Terms of Service, which often give leeway to retain copies for a ‘reasonable period of time.’ Facebook is actually quite explicit about this, stating that ‘when you delete IP content, it is deleted in a manner similar to emptying the recycle bin on a computer.’
This architecture is not only fundamentally wrong from a privacy standpoint, but likely illegal under the EU Data Protection Directive of 1995 and its UK implementation, the Data Protection Act of 1998, which both clearly ban keeping personally-identifiable data for longer than necessary given the data’s purpose. In the social web case, the purpose of keeping a photo is to share it. Since this is no longer possible after the photo is marked ‘deleted’ all copies of the photo must be removed. There’s also an interesting violation of the provision that a user should have access to all data stored about her, after marking a photo ‘deleted’ the user no longer access to it, as there is no way to see which user content is still cached.
Architecture matters, and though it may be more complicated, sensitive personal data must be stored and cached using reference counts to ensure it can be fully deleted, and not simply left to be garbage collected down the road. Unfortunately, as is common with with social networking sites, privacy is viewed as a legal add-on and not a design constraint. In the terminology of Larry Lessig, privacy is still considered a matter of law and not of code. As a result a user can have no assurance about where their photos may be floating around in the cloud.
EDIT 22/05/2009: We originally reported that Xanga and LiveJournal left photos unrevoked. After corresponding with developers from both sites, this was revealed to be a UI problem and not a CDN problem in both cases. When a photo is included in a blog post which is deleted, the photo itself is not considered deleted but becomes one of the user’s photos. Unfortunately, in each site the normal photo interface did not reveal this: Xanga showing this in it’s ‘Photos’ interface, and LiveJournal showing showing this. In both cases, deleting photos which were included in blog posts requires a separate interface. In LiveJournal’s case the separate interface itself incorrectly stated I had “no galleries.” Due to this UI confusion, I thought the photos were deleted when they werent’t thus they weren’t revoked. Apologies for the confusion, I re-tested both and printed updated results, though this has led both sites to re-consider their UI’s which were admittedly confusing and outright buggy in LiveJournaI’s case.
26 thoughts on “Attack of the Zombie Photos”
relying on the photos eventually falling out of the photo-server’s cache
I am not a techie, so I wonder how long that would take… Very interesting post!
Actually, on LiveJournal, the photo servers are the main servers (there is no CDN for ScrapBook pictures, only for userpics), and they’re checked for permissions each time a client requests the URL, so the non-expiring URLs shouldn’t cause a breach of privacy. That said, it’s possible the LiveJournal engineering and ops people overlooked something. If you can recreate a breach of privacy on LiveJournal, could you email details to firstname.lastname@example.org and cc: my email address as indicated above? (Disclaimer: I’m not a LiveJournal employee – I just volunteer there)
If you delete a photo on Xanga, then Xanga *does* completely delete that file – typically within a few hours (the main file gets deleted from our databases sooner, and then it takes a few hours to clear from all our server caches).
We looked into the photo on your “live viewing” experiment. It appears that you have three separate copies on your account and that none of them is actually deleted.
Could you please email me to let me know exactly what steps you took in your Xanga portion of the experiment? I’d like to isolate whatever issues may have led to confusion and see what we can do to address those issues.
This story now on BBC News
I still sometimes have this problem with Facebook and one of my photos. After I found out that my photos could be used by Facebook for things like their marketing – without any payment or credit to me – I removed them. My profile photo was an art study on composition and negative space using a plum. After my removal request, the photo continues to “pop up” in other applications that use Facebook and Facebook main page. As soon as I would send the information to Facebook advising them of this fact, it would mysteriously disappear only to reappear again later. The profile photo hasn’t appeared on Facebook now for several months, BUT…just a week or so ago I saw a degraded version of it (heavy pixels partially blocking its appearance) show up on yet another associated application.
I think this experiment is great in theory, but flawed in practice and conclusions.
You are not testing to see if an image is deleted from the Social Network, but from their CDN. That is a HUGE difference. These social networks may very well delete them from their own servers immediately, but they are not exposed to the general internet because a (often third party) cache is employed to proxy images from their servers to the greater internet. Some of these caches do not have delete functionality through an API, the content – whatever it is – just times out after x hours of not being accessed. It also is often ‘populated’ into the cache by just mapping the cache address onto the main site. Example: http://cdn.img.network.com/a may be showing content for several hours that was deleted from http://img.network.com/a
Perhaps you know this already – but in that case you are presenting these findings in a way that serves your point more than the truth of the architecture.
In terms of your inference of the EU and UK acts, I wouldn’t reach the same conclusions that you have. Firstly, one would have to decide that an unmarked photo, living at an odd cache address with no links in from a network identifying it or its content, would be deemed “personally-identifiable data” — I would tend to disagree. Secondly, while the purpose of it may be to “share it”, it would really be “share it online” – and dealing with cache servers and the inherent architecture of the internet , I think the amount of time for changes to propagate after a request for deletion would easily satisfy that requirement. I also wonder if the provision to access ‘user data’ means that it is done in real time or in general. I’m pretty sure all these sites store metrics about me that i can’t see.
Again, I will also reiterate that we are talking about ‘cached’ data here — and that the primary records have been deleted of the requested data. At what point do you feel that privacy acts and litigation should force the use to access / view *every* bit of data stored :
– primary record
– server caches
– data center/isp caches
– network ( university , business , building , etc ) caches
– computer / browser caches
Your arguments open up a ‘can of worms’ with the concepts of network optimization. I wouldn’t be surprised if your university operates a server on its internet gateway that caches often requested images — would they too be complicit in this scheme for failing to delete them immediately ? How would they even know to do so ? How could the network operator identify and notify every step in the chain that has ever cached an instance of the image ?
Some good points made, though I think I was quite clear in writing that the photos are remaining in CDNs and not the sites themselves (though in a few cases they are the same thing).
I’m not sure if there has been a legal ruling over whether photos count as PII, but I would argue they are, since in many cases they have faces, and facial recognition is fairly strong. Also, the photos often have the person’s UID in their filename.
I don’t agree with your thought that since CDNs are a form of cache that privacy rules don’t apply to them. Some CDNs may not have a delete API call but that was a design decision, not a technological limitation. Half of the sites got this right, so why should we excuse the ones who didn’t?
One, I have to agree that you should not put something on the WWW you would not want your family, future employer or even professors to see – remember, the main thing with most of these photo-sharing site or site like MySpace or FaceBook, is that you allow the free flowing, Sharing of ALL data you put there. And it’s true, for the most part, deleting something is a lot like trashing a file into your computers ‘Recycle Bin’ – It remains on your harddrive, but it is not readily accessable by your RAM. In essence, you can delete something from your MySpace, but it will remain cached in their servers for a time. FaceBook, I’m unfamiliar with their data storage & policies on usage & sharing, but back to the old rule, if you didn’t want to SHARE it for FREE, you should not have put it there.
Create your OWN website – it’s relativley cheap, AND you can copyright your data, even prevent it from being saved to someone else’s server & possibly being used fraduently/illegally. In this day & age, too many people are jumping on the information sharing bandwagon without knowing who’s driving the damn thing.
If you’re worried about something coming back to bite you in the arse, don’t use your real name and don’t use your regular email address (that’s what services like gmail, hotmail, etc., are for) when setting up an account.
In this day and age of employers “googling” prospective employees, it’s just not safe to be yourself online, anymore.
Read the TOS (Terms of Service) on sites like Fecesbook … they are under no obligation to delete the photos. Once you upload a photo, it isn’t yours anymore. It’s theirs with which to do whatever they please. If this is a problem (and yes, it is a problem) then the correct solution is DON’T USE FECESBOOK.
I am not at all surprised by this situation. Having developed websites for over 10 years I have faced similar situations and the question always comes up – what do we do after the content has been ‘deleted’.
Often, the decision is made to no longer link directly to the content still stored on disc. If the data resides in a database, that data is rarely deleted, it is usually flagged as deleted, and hence not ‘visible’. The data remains for various reasons, one of the most common reasons being data integrity. One piece of information relies on another which relies on another. Deleting one piece can really screw up a database.
However, all too often, the TOS, T&C, Privacy and whatever other rules the websites have will describe this. How many times do people actually bother reading these though, and then moan about it after the fact.
I’m with Jessica on this one. Creating one’s own website is easy and you have control over it. Learn about which services you intend to use BEFORE using them and if you don’t like their rules, don’t use them.
As for the cached images, there are too many caches out on the net to be certain of it being deleted in time. For instance, if you guys keep checking that the images have been deleted, the cache servers will reset a counter that indicates how long ago the file was last accessed. It will then start counting down again. For instance, if the cache deletes files not accessed within the last five days, but you access it on the fourth day, you’ll now need to wait another five days. I’m not saying this is the case with all caches, but it is quite typical of how caches work.
Facebook’s engineering blog has a posting from a few weeks ago about their photo storage infrastructure:
Good to know all that behind the scene story.
Thanks for putting it on.
I think you’re pushing a dangerous interpretation of the law.
Suppose I’m an employee at a photo-sharing site, and I browse around on my lunch break. If one of the photos I looked at is deleted an hour later, could the site be prosecuted because a copy is still on my (presumably company-owned) hard drive (in my browser’s local cache)? Your interpretation would seem to say yes.
What about backups? It would be irresponsible of a site operator to *not* have the ability to recover from hardware failures, which means they may have copies of deleted photos still lurking in recent backups. In order to avoid prosecution, would they have to expand the “deletion” process to include opening all their backups and removing the content? Your interpretation would seem to say yes.
Similarly, even if it’s deleted from all hard drives immediately, someone with physical access to a drive could probably recover the photo (since “deleting” a file, on many filesystems, doesn’t actually erase the data). Is the logical conclusion of this that photo-sharing sites need to implement military-style data-destruction protocols? Your interpretation would seem to lean that way.
It may turn out, of course, that the law is interpreted in this fashion, but if it is then it’s a law written with no respect for actual reality, and should be amended.
May I suggest another research conducted: What really happens to the message archive/conversations deleted from instant messaging tools or websites that provide instant messaging features? (such as facebook, yahoo messenger, etc.)
this is a very interesting question im certain a lot of people cant answer for themselves, but really really want to know.
Thank you again and if there is any kind of forum to discuss more about architectural “flaws” and privacy matters (or security matters for that case) i would appreciate it if anyone would let me know.
How about trying to delete the whole profile? Tried it on Facebook – all data is still there after 23 days:
The rule-of-thumb I use is:
Do not post/email/blog/IM/text/share/upload anything that you don’t want your Mother/Priest/Boss to see. Period.
ANY & ALL items that you post/email/blog/IM/text/share/upload should be automatically considered public domain for the rest of time.
Alors ouvre tes yeux
Que je remplace ma solitude par quelque chose de merveilleux
Je me suis finalement souvenu des larmes que j’ai coulées
Une nouvelle histoire va être dévoilé
Tes ailes, fragile et pliées
Sont fatiguer du ciel bleu azur
Ce sentiment de solitude continue à grimper a moi
Une bougie esseuler y brûle toujours
Est-ce que je peut enterrer tous sa avec des mots vides
Je ne sais même plus
Aussi longtemps que je pourrais nager librement dans mes rêves
Tu seras avec moi et je n’aurais point besoin d’aller dans se ciel
Lorsque je tourne le coin de la rue
Je me mêle à la foule
Je me fonds dans cette masse anonyme
Je perds tout repère
Je ne trouve plus mes mots
Mais une chose, ta voie
Reste encore et encore
Tout ce que je sais de toi
Tes joies, tes colères
Me fait avancer
Si je lève les mes yeux vers le ciel
Дааа)) Вы бы знали что про Вас пишут в других блогах)))
Спасибо! Доходчиво и понятно объяснил
mais que voulez vous que je dise????? si vous avez une idée n’ésitez pas de me la passer, ok??^^
juste avant moi il y a quelqu’un qui a écrit avec une langue peu commune!!!! is there an interpreter?
dites donc depuis 2009 personne n’a dit un mot ici et pourtant avant il y avait de bons poèmes et plein de choses interessentes. faites un effort quand même !!!!!!!!!!!!!!!!!!
Maybe you could change the post title Light Blue Touchpaper » Blog Archive » Attack of the Zombie Photos to something more suited for your subject you make. I enjoyed the blog post all the same.
Thank you for putting this on! That’s a wonderful post, and yeah, an interesting one. This problem of deleting do exists and some of my friends faced that. So I’ll give ’em this link to read about as well. Thanks, Joseph! Waiting for the next investigations.
I have deleted my photos from myspace recently and I need to get them back. The reason is that someone stole the memory cards that had my sons pics on them and i had quite a bit on myspace that I just deleted about 2-3 months ago 🙁 can anyone help?