Technical aspects of the censoring of Wikipedia

Part of the encyclopaedia website Wikipedia was censored in the UK between Friday 5th December 2008 and Tuesday 9th December 2008. Errors in the way that this was done has shown up a number of inconsistencies in the blocking mechanisms employed.

The story is relatively simple. A member of the public made a “hotline” report to the Internet Watch Foundation (IWF) about a scanned image of the album cover art for a 1976 LP from the German heavy-metal band The Scorpions. The IWF concluded that the image was “potentially illegal” because they believed it to be an “indecent” image of a child under the age of 18 (the definition in UK law, see statute as currently amended). They then added two URLs to their “Child Sexual Abuse Content URL List”. The IWF list is distributed twice a day to most UK ISPs, who then use various technologies to block access to the URLs.

The ISPs operate systems that don’t block entire websites. Instead they pass the traffic to suspect websites through a web proxy. This proxy checks the web requests and blocks only the specific URLs that are on the IWF list. However, the use of proxies meant that all of the Wikipedia users from major UK ISPs now all appeared to have one of a handful of IP address (of the ISP proxy machines). This broke Wikipedia’s security model in that they use IP addresses to distinguish between helpful editors who improve the content of the site and vandals who attempt to trash it. Therefore, rather rapidly (for vandalism is commonplace), all UK based editors (who were editing anonymously rather than from registered accounts) were barred. However, they couldn’t register for new accounts because the IP address from which they came was barred… and so this unfortunate situation rapidly came to the notice of Wikipedia’s administrators.

As can be seen from this archived discussion page, it didn’t take very long for people to realise that the use of the proxies was caused by a Wikipedia URL being on the IWF list, a few suggestions were made, and the two URLs that were being blocked were rapidly identified (by 15:13 … just over 3 hours after the censorship will have started). The rapid identification was doubtless because people remembered that there had been a US-based row over the same image in May 2008, when the FBI had investigated — with no action so far.

Wikipedia administrators continued to collect reports about which ISPs were blocking them, and to determine the identity of the proxies (if those proxies used “X-Forwarded-From” (XFF) headers then they could be added to a trusted list, and the individual IP addresses behind the proxies identified again).

The reports made to Wikipedia (and also on the ORG discuss and UKCrypto mailing lists) are rather confused, and sometimes downright contradictory — and I have been able to identify a number of technical reasons for this.

The two URLs that were blocked (and I deliberately haven’t given them in full) were:

http://en.wikipedia.org/…/virgin_killer
http://en.wikipedia.org/…/Image:Virgin_Killer.jpg

The first of these is the page discussing the “Virgin Killer” album, the second is a page discussing the copyright status of the scan of the album cover. Despite the type extension on the second URL, both of them cause HTML to be returned. The first page contains an embedded 200×200 pixel image, the second a 300×300 image and a 120×120 thumbnail. The actual images appear on the URLs (which in this case are JPEGs):

http://upload.wikimedia.org/…/33/Virgin_Killer.jpg/200px-Virgin_Killer.jpg
http://upload.wikimedia.org/…/33/Virgin_Killer.jpg
http://upload.wikimedia.org/…/33/Virgin_Killer.jpg/120px-Virgin_Killer.jpg

Quite why the IWF chose to add the web pages to their list (and not the images that they deemed to be illegal) is currently unknown. They have since made statements to the effect that the ISP’s blocking systems are designed for blocking pages and not images … but since an analysis shows that about 1/3 of their blocking list has image type extensions this appears to be something they’re not generally concerned about (albeit we’ve just seen that a type extension is not an infallible guide in these matters).

However, much of the confusion about what was blocked stemmed from the IWFs decision to use the URL ending in “virgin_killer”. They had failed to notice that this URL had returned a 301 “moved permanently” response and redirected them to a “Virgin_killer” URL (with a capital V). Wikipedia treats page names as case sensitive except for the first letter. In fact, Wikipedia also returns identical content for “Virgin_Killer” (with capital V and K) but without a redirection. Their index lists both the “Virgin_Killer” and “Virgin_killer” variants, but not the “virgin_killer” URL that the IWF were considering.

This meant that when people tried to access the page either by following a URL cut and pasted from a browser, or by looking up the topic in the Wikipedia index, they were not accessing the URL that was being listed by the IWF. At all the ISPs (I know of at least two) where the URL matching was case sensitive this meant that they could access the first “blocked” page. At ISPs where the URL matching was case insensitive both pages were blocked.

Furthermore, ISPs differ as to how they actually block. Some redirect to a custom “403 Forbidden” page (such as this one at Demon), others to a “404 Not Found” page hosted by the ISP. Others merely close the connection cleanly (you see an immediate FIN) and others reset it (with an RST packet). These induced failures will be reported by different browsers in different ways though probably as a “404” (with what you see sometimes being under user control as to whether they want “friendly” error messages or not). Of course some of the failures are indistinguishable from completely different errors, such as the proxy machine being completely overloaded (or the URL being entered incorrectly), and hence it is quite understandable if some of the reports as to which ISPs were blocking are confused.

Further confusion has arisen because of the way that the blocking systems work. UK ISPs are all (so far as I know) operating two-stage systems. The first stage picks out traffic going to the same IP address as any URL on the blocking list, and the second stage checks the URL itself. However, there the similarity ends since the “picks out traffic” mechanism can be implemented by fiddling with routing tables (using custom /32 routes in the internal BGP system) or by arranging for the DNS server to resolve the hostnames to a proxy machine. The second stage can be done by web proxies, or by “Deep Packet Inspection” hardware (recall the shenanigans that Phorm gets up to with this type of kit) or by exploiting Cisco’s proprietary WCCP v2 cache management protocol. All this means that a user who habitually accesses Wikipedia via a remote proxy system, or who uses their own DNS server (or a generic one that offers anti-phishing protection!) could well evade the blocking system. If they don’t realise what they’ve done, then their report of “no block here” will muddy the waters.

The addition of the Wikipedia page to the IWF list on the last day before the weekend doubtless played a part as well. It’s clear from many stories that ISP helpdesks were completely unaware of the blocking systems that their employers operated. Also, some ISPs, notably BT (a pioneer in this type of blocking), don’t seem to have deployed the block until Monday. This may be because their updating system doesn’t run at the weekend. On the other hand, it may be because they read Chapter 7 of my PhD thesis which contains descriptions of possible attacks on their system. If so, then they could well have a list of high traffic destinations (and Wikipedia is #11 in the most popular websites for UK surfers) and manual intervention (in working hours) is needed to censor such traffic, lest the high load overwhelm their proxies. It is only because of the specific impact on Wikipedia that the censorship was noticed within a few hours — it might normally be days or weeks before it was spotted, and so a few days delay might usually go unremarked upon.

To sum up the key technical matters: the IWF chose to filter text pages on Wikipedia rather than just the images they were concerned about; the use of proxies by ISPs broke Wikipedia’s security model that prevents vandalism; the previous controversy about the Virgin Killers album cover meant that IWF’s URLs were quickly identified; however different capitalisations of URLs, the different blocking technologies, and the different implementation timescales led to considerable confusion as to who blocked what and when.

Some of these matters could be described as “human error” and might be done better in any re-run of these events with any of the other questionable images hosted on Wikipedia (and many other mainstream sites). However, most of the differences in the effectiveness of the attempted censorship stem directly from diverse blocking system designs — and we can expect to see them recur in future incidents. The bottom line is that these blocking systems are fragile, easy to evade (even unintentionally), and little more than a fig leaf to save the IWF’s blushes in being so ineffective at getting child abuse image websites removed in a timely manner.

8 thoughts on “Technical aspects of the censoring of Wikipedia”

Gavin Jamie says:

2008-12-11 at 11:04 UTC

I am certainly no HTTP expert but is the “X-Forwarded-From” header to be trusted? Should Wikipedia (or any site) assume that this is a genuine proxy. It would seem a relatively trivial matter to write a browser extension to add this header to each request and then vandalise away.

Presumably wikipedia (and google and whoever) would need to recognise and authenticate each proxy individually.

Richard Clayton says:

2008-12-11 at 11:22 UTC

@Jamie

Sorry, I didn’t make it clear enough. XFF can indeed be forged, and so Wikipedia has a list of trusted proxies from which they accept it. Since they had never encountered the blocking system before the proxies were not initially listed! But they are now (see here).

Torne says:

2008-12-11 at 11:43 UTC

Gavin:
Yes, people don’t generally trust such headers, but Wikipedia has a list of those it *does* trust, for large ISPs that proxy all their customers.

Ed Davies says:

2008-12-11 at 12:06 UTC

“…(albeit we’ve just seen that a type extension is not an infallible guide in these matters).”

In principle, URLs are opaque strings. They do not have “type extensions” so the last few characters should not be taken as a guide, fallible or otherwise.

Clive Feather says:

2008-12-11 at 15:07 UTC

[IWF] “have since made statements to the effect that the ISP’s blocking systems are designed for blocking pages and not images”

I don’t know why they should think this, since there’s no technical difference between blocking an HTML URL and a JPEG one. I’m not aware of any ISP who cares, or who asked for images not to be on the list.

Phil Nash says:

2008-12-12 at 21:16 UTC

“However, they couldn’t register for new accounts because the IP address from which they came was barred”

Not quite right; to limit vandalism-only accounts, the Wikimedia software imposes a per-day limit on account creation per IP address. This means that once that limit is reached, nobody using that proxy is able to create an account themselves, although we (Wikipedia) will create accounts via email- as long as there aren’t about 10,000 requests pending, of course!

Phil Nash says:

2008-12-12 at 21:20 UTC

@Clive Feather. Having worked for Demon Internet, you should be aware that there is a very great difference, and as a founder member of IWF, you could yourself have made that they were aware of that distinction. Their technical expertise seems strangely lacking, and certainly does not seem to be under the scrutiny of the ISPs.

Phil Nash says:

2008-12-12 at 21:24 UTC

furthermore, anyone who knows how to mung a header could get round the block; for example
http://…/w/index.php?title=Virgin_Killer&iwf=please_dont_censor_me
worked; and it wasn’t blocked in the UK on the Finnish Wikipedia, so http:/;/fi.wikipedia.org/…/Virgin_Killer still worked. “Epic fail”, as they say.