Technical aspects of the censoring of archive.org

Back in December I wrote an article here on the “Technical aspects of the censoring of Wikipedia” in the wake of the Internet Watch Foundation’s decision to add two Wikipedia pages to their list of URLs where child sexual abuse images are to be found. This list is used by most UK ISPs (and blocking systems in other countries) in the filtering systems they deploy that attempt to prevent access to this material.

A further interesting censoring issue was in the news last month, and this article (a little belatedly) explains the technical issues that arose from that.

For some time, the IWF have been adding URLs from The Internet Archive (widely known as “the wayback machine“) to their list. I don’t have access to the list and so I am unable to say how many URLs have been involved, but for several months this blocking also caused some technical problems.

The Internet Archive has robots that wander the web taking copies of websites to preserve for posterity. This is incredibly useful, because it is often the case that material isn’t preserved by the original website owner. However, from time to time, it seems that the robots unwittingly preserve sexual abuse images of children. Eventually someone discovers this and reports it to the IWF.

The IWF don’t immediately get in contact with website owners when those sites are based abroad; so they don’t immediately send an email off to the Internet Archive — even though I have no doubt that the Internet Archive has no intention whatsoever of hosting illegal images, and will act immediately once they understand what their robot has been doing. I’ve previously discussed how IWF’s reticence significantly prolongs the period for which images are available, and won’t labour that aspect here. The important thing from the point of view of this article is that the URL of the image gets added to the IWF blocking list, and ISPs start to act upon it.

In particular, Demon Internet, the well-known UK ISP, takes this list and arranges to block access to the URLs it contains. It does this by ensuring that all access to the Internet Archive passes through a web proxy — which will ensure that the blocked page is not served, but all other pages (and the Internet Archive has 66 billion of them) are available in the normal way.

However, some Demon customers unexpectedly found that the unblocked parts of the Internet Archive were being adversely affected. Pages of links to archives were populated with links that didn’t lead elswhere in the archive, but to Demon’s cache machine instead. This was first detected at the beginning of October 2008, but the problem almost immediately disappeared. It was reported again from time to time during the Autumn, but never resolved.

However, over the weekend of 10–11 January 2009, the problem returned, and this time the effect was straightforward to reproduce. Complaints by Demon customers — who deduced that the IWF wanted something on the Internet Archive censored — were picked up by the press, and The Register ran an article. The comments made on the article indicated that other customers at other ISPs were seeing the same effect… exactly the same effect! They were also seeing pages with links to Demon’s web proxy.

At this point the basic failure mechanism became clear, because these non-customers would not have been using Demon’s web proxy — the faulty pages were being constructed by the Internet Archive and served up not only to Demon, but also to people at other ISPs. Once that was realised, it was only a matter of time before there was a fix in place.

To understand the failure mechanism, it’s necessary to know that the Internet Archive keeps its pages in a database, with incomplete links from one part of an archived website to another. When these pages are served, the incomplete links are filled in with pointers within the Archive — so that you can navigate around the preserved copy of the website. To reduce load on the database the Internet Archive runs multiple proxy caches of its own, and the links in the pages are tweaked to ensure that your browsing stays on a single cache.

However, these caches were not filtering out request headers sent by the Demon web proxy, and so pages were being constructed with the links pointing at Demon’s system rather than at an Internet Archive machine. Since these newly constructed pages were cached, customers of other ISPs could be sent them — and of course they were tending to look at the same pages as the ones that Demon customers were testing and then reporting to the press…

The fix was the simple expedient of filtering the request properly, so that the back end only saw Internet Archive cache headers (and of course flushing out the badly constructed pages).

There’s a bit more detail in this Usenet article posted once the fix was in place — there was no comment before that because once the mechanism was understood it would have been possible to deliberately craft requests that constructed pages where you were apparently looking at an archived site (of the White House in August 2001 perhaps) but following the links would take you to somewhere else entirely (perhaps fictitious pages showing that the conspiracy theorists are right!), whilst leaving the impression that you were looking back in time at real White House pages.

Sadly, as in the previous Wikipedia debacle, many of the reports generated by the public, and the diagnostic tests that they ran, were either useless or totally misleading. For example, the URLs that were blocked were for http://web.archive.org/ whereas many people checked out http://www.archive.org/ and found that it was not being censored at all — they were right, but they drew entirely the wrong conclusion.

Another issue worth noting is that no-one, so far as I am aware, has been able to determine which page on web.archive.org was actually being blocked — rather different from the Wikipedia event, where the blocked page was known with a few hours.

However, even though the technical failure in this case cannot be laid at the door of the IWF, it must still be wondered what the value is of diverting all of the traffic to a high profile, entirely reputable, website, when a three minute telephone call would have meant that the illegal material was immediately removed. We’ve now seen two high-traffic sites filtered in the past few months; and on both occasions bad things have happened, which have — rightly or wrongly — brought the IWF into disrepute. The underlying policy decisions need reconsideration.

1 thought on “Technical aspects of the censoring of archive.org

  1. I was going to say that the IWF promote themselves by the number of requests blocked – hence going after high traffic sites to boost the numbers – but judging by their Website this suspicion is groundless. They seem quite sensible. So I also don’t know why they blocked Wikipedia.

Leave a Reply

Your email address will not be published. Required fields are marked *