Category Archives: Internet censorship

Text mining is harder than you think

Following last year’s row about Apple’s proposal to scan all the photos on your iPhone camera roll, EU Commissioner Johansson proposed a child sex abuse regulation that would compel providers of end-to-end encrypted messaging services to scan all messages in the client, and not just for historical abuse images but for new abuse images and for text messages containing evidence of grooming.

Now that journalists are distracted by the imminent downfall of our great leader, the Home Office seems to think this is a good time to propose some amendments to the Online Safety Bill that will have a similar effect. And while the EU planned to win the argument against the pedophiles first and then expand the scope to terrorist radicalisation and recruitment too, Priti Patel goes for the terrorists from day one. There’s some press coverage in the Guardian and the BBC.

We explained last year why client-side scanning is a bad idea. However, the shift of focus from historical abuse images to text scanning makes the government story even less plausible.

Detecting online wickedness from text messages alone is hard. Since 2016, we have collected over 99m messages from cybercrime forums and over 49m from extremist forums, and these corpora are used by 179 licensees in 55 groups from 42 universities in 18 countries worldwide. Detecting hate speech is a good proxy for terrorist radicalisation. In 2018, we thought we could detect hate speech with a precision of typically 92%, which would mean a false-alarm rate of 8%. But the more complex models of 2022, based on Google’s BERT, when tested on the better collections we have now, don’t do significantly better; indeed, now that we understand the problem in more detail, they often do worse. Do read that paper if you want to understand why hate-speech detection is an interesting scientific problem. With some specific kinds of hate speech it’s even harder; an example is anti-semitism, thanks to the large number of synonyms for Jewish people. So if we were to scan 10bn messages a day in Europe there would be maybe a billion false alarms for Europol to look at.

We’ve been scanning the Internet for wickedness for over fifteen years now, and looking at various kinds of filters for everything from spam to malware. Filtering requires very low false positive rates to be feasible at Internet scale, which means either looking for very specific things (such as indicators of compromise by a specific piece of malware) or by having rich metadata (such as a big spam run from some IP address space you know to be compromised). Whatever filtering Facebook can do on Messenger given its rich social context, there will be much less that a WhatsApp client can do by scanning each text on its way through.

So if you really wish to believe that either the EU’s CSA Regulation or the UK’s Online Harms Bill is an honest attempt to protect kids or catch terrorists, good luck.

European Commission prefers breaking privacy to protecting kids

Today, May 11, EU Commissioner Ylva Johannson announced a new law to combat online child sex abuse. This has an overt purpose, and a covert purpose.

The overt purpose is to pressure tech companies to take down illegal material, and material that might possibly be illegal, more quickly. A new agency is to be set up in the Hague, modeled on and linked to Europol, to maintain an official database of illegal child sex-abuse images. National authorities will report abuse to this new agency, which will then require hosting providers and others to take suspect material down. The new law goes into great detail about the design of the takedown process, the forms to be used, and the redress that content providers will have if innocuous material is taken down by mistake. There are similar provisions for blocking URLs; censorship orders can be issued to ISPs in Member States.

The first problem is that this approach does not work. In our 2016 paper, Taking Down Websites to Prevent Crime, we analysed the takedown industry and found that private firms are much better at taking down websites than the police. We found that the specialist contractors who take down phishing websites for banks would typically take six hours to remove an offending website, while the Internet Watch Foundation – which has a legal monopoly on taking down child-abuse material in the UK – would often take six weeks.

We have a reasonably good understanding of why this is the case. Taking down websites means interacting with a great variety of registrars and hosting companies worldwide, and they have different ways of working. One firm expects an encrypted email; another wants you to open a ticket; yet another needs you to phone their call centre during Peking business hours and speak Mandarin. The specialist contractors have figured all this out, and have got good at it. However, police forces want to use their own forms, and expect everyone to follow police procedure. Once you’re outside your jurisdiction, this doesn’t work. Police forces also focus on process more than outcome; they have difficulty hiring and retaining staff to do detailed technical clerical work; and they’re not much good at dealing with foreigners.

Our takedown work was funded by the Home Office, and we recommended that they run a randomised controlled trial where they order a subset of UK police forces to use specialist contractors to take down criminal websites. We’re still waiting, six years later. And there’s nothing in UK law that would stop them running such a trial, or that would stop a Chief Constable outsourcing the work.

So it’s really stupid for the European Commission to mandate centralised takedown by a police agency for the whole of Europe. This will be make everything really hard to fix once they find out that it doesn’t work, and it becomes obvious that child abuse websites stay up longer, causing real harm.

Oh, and the covert purpose? That is to enable the new agency to undermine end-to-end encryption by mandating client-side scanning. This is not evident on the face of the bill but is evident in the impact assessment, which praises Apple’s 2021 proposal. Colleagues and I already wrote about that in detail, so I will not repeat the arguments here. I will merely note that Europol coordinates the exploitation of communications systems by law enforcement agencies, and the Dutch National High-Tech Crime Unit has developed world-class skills at exploiting mobile phones and chat services. The most recent case of continent-wide bulk interception was EncroChat; although reporting restrictions prevent me telling the story of that, there have been multiple similar cases in recent years.

So there we have it: an attack on cryptography, designed to circumvent EU laws against bulk surveillance by using a populist appeal to child protection, appears likely to harm children instead.

Bugs in our pockets?

In August, Apple announced a system to check all our iPhones for illegal images, then delayed its launch after widespread pushback. Yet some governments continue to press for just such a surveillance system, and the EU is due to announce a new child protection law at the start of December.

Now, in Bugs in our Pockets: The Risks of Client-Side Scanning, colleagues and I take a long hard look at the options for mass surveillance via software embedded in people’s devices, as opposed to the current practice of monitoring our communications. Client-side scanning, as the agencies’ new wet dream is called, has a range of possible missions. While Apple and the FBI talked about finding still images of sex abuse, the EU was talking last year about videos and text too, and of targeting terrorism once the argument had been won on child protection. It can also use a number of possible technologies; in addition to the perceptual hash functions in the Apple proposal, there’s talk of machine-learning models. And, as a leaked EU internal report made clear, the preferred outcome for governments may be a mix of client-side and server-side scanning.

In our report, we provide a detailed analysis of scanning capabilities at both the client and the server, the trade-offs between false positives and false negatives, and the side effects – such as the ways in which adding scanning systems to citizens’ devices will open them up to new types of attack.

We did not set out to praise Apple’s proposal, but we ended up concluding that it was probably about the best that could be done. Even so, it did not come close to providing a system that a rational person might consider trustworthy.

Even if the engineering on the phone were perfect, a scanner brings within the user’s trust perimeter all those involved in targeting it – in deciding which photos go on the naughty list, or how to train any machine-learning models that riffle through your texts or watch your videos. Even if it starts out trained on images of child abuse that all agree are illegal, it’s easy for both insiders and outsiders to manipulate images to create both false negatives and false positives. The more we look at the detail, the less attractive such a system becomes. The measures required to limit the obvious abuses so constrain the design space that you end up with something that could not be very effective as a policing tool; and if the European institutions were to mandate its use – and there have already been some legislative skirmishes – they would open up their citizens to quite a range of avoidable harms. And that’s before you stop to remember that the European Court of Justice struck down the Data Retention Directive on the grounds that such bulk surveillance, without warrant or suspicion, was a grossly disproportionate infringement on privacy, even in the fight against terrorism. A client-side scanning mandate would invite the same fate.

But ‘if you build it, they will come’. If device vendors are compelled to install remote surveillance, the demands will start to roll in. Who could possibly be so cold-hearted as to argue against the system being extended to search for missing children? Then President Xi will want to know who has photos of the Dalai Lama, or of men standing in front of tanks; and copyright lawyers will get court orders blocking whatever they claim infringes their clients’ rights. Our phones, which have grown into extensions of our intimate private space, will be ours no more; they will be private no more; and we will all be less secure.

Is Apple’s NeuralMatch searching for abuse, or for people?

Apple stunned the tech industry on Thursday by announcing that the next version of iOS and macOS will contain a neural network to scan photos for sex abuse. Each photo will get an encrypted ‘safety voucher’ saying whether or not it’s suspect, and if more than about ten suspect photos are backed up to iCloud, then a clever cryptographic scheme will unlock the keys used to encrypt them. Apple staff or contractors can then look at the suspect photos and report them.

We’re told that the neural network was trained on 200,000 images of child sex abuse provided by the US National Center for Missing and Exploited Children. Neural networks are good at spotting images “similar” to those in their training set, and people unfamiliar with machine learning may assume that Apple’s network will recognise criminal acts. The police might even be happy if it recognises a sofa on which a number of acts took place. (You might be less happy, if you own a similar sofa.) Then again, it might learn to recognise naked children, and flag up a snap of your three-year-old child on the beach. So what the new software in your iPhone actually recognises is really important.

Now the neural network described in Apple’s documentation appears very similar to the networks used in face recognition (hat tip to Nicko van Someren for spotting this). So it seems a fair bet that the new software will recognise people whose faces appear in the abuse dataset on which it was trained.

So what will happen when someone’s iPhone flags ten pictures as suspect, and the Apple contractor who looks at them sees an adult with their clothes on? There’s a real chance that they’re either a criminal or a witness, so they’ll have to be reported to the police. In the case of a survivor who was victimised ten or twenty years ago, and whose pictures still circulate in the underground, this could mean traumatic secondary victimisation. It might even be their twin sibling, or a genuine false positive in the form of someone who just looks very much like them. What processes will Apple use to manage this? Not all US police forces are known for their sensitivity, particularly towards minority suspects.

But that’s just the beginning. Apple’s algorithm, NeuralMatch, stores a fingerprint of each image in its training set as a short string called a NeuralHash, so new pictures can easily be added to the list. Once the tech is built into your iPhone, your MacBook and your Apple Watch, and can scan billions of photos a day, there will be pressure to use it for other purposes. The other part of NCMEC’s mission is missing children. Can Apple resist demands to help find runaways? Could Tim Cook possibly be so cold-hearted as to refuse at add Madeleine McCann to the watch list?

After that, your guess is as good as mine. Depending on where you are, you might find your photos scanned for dissidents, religious leaders or the FBI’s most wanted. It also reminds me of the Rasterfahndung in 1970s Germany – the dragnet search of all digital data in the country for clues to the Baader-Meinhof gang. Only now it can be done at scale, and not just for the most serious crimes either.

Finally, there’s adversarial machine learning. Neural networks are fairly easy to fool in that an adversary can tweak images so they’re misclassified. Expect to see pictures of cats (and of Tim Cook) that get flagged as abuse, and gangs finding ways to get real abuse past the system. Apple’s new tech may end up being a distributed person-search machine, rather than a sex-abuse prevention machine.

Such a technology requires public scrutiny, and as the possession of child sex abuse images is a strict-liability offence, academics cannot work with them. While the crooks will dig out NeuralMatch from their devices and play with it, we cannot. It is possible in theory for Apple to get NeuralMatch to ignore faces; for example, it could blur all the faces in the training data, as Google does for photos in Street View. But they haven’t claimed they did that, and if they did, how could we check? Apple should therefore publish full details of NeuralMatch plus a set of NeuralHash values trained on a public dataset with which we can legally work. It also needs to explain how the system it deploys was tuned and tested; and how dragnet searches of people’s photo libraries will be restricted to those conducted by court order so that they are proportionate, necessary and in accordance with the law. If that cannot be done, the technology must be abandoned.

A new way to detect ‘deepfake’ picture editing

Common graphics software now offers powerful tools for inpainting – using machine-learning models to reconstruct missing pieces of an image. They are widely used for picture editing and retouching, but like many sophisticated tools they can also be abused. They can remove someone from a picture of a crime scene, or remove a watermark from a stock photo. Could we make such abuses more difficult?

We introduce Markpainting, which uses adversarial machine-learning techniques to fool the inpainter into making its edits evident to the naked eye. An image owner can modify their image in subtle ways which are not themselves very visible, but will sabotage any attempt to inpaint it by adding visible information determined in advance by the markpainter.

One application is tamper-resistant marks. For example, a photo agency that makes stock photos available on its website with copyright watermarks can markpaint them in such a way that anyone using common editing software to remove a watermark will fail; the copyright mark will be markpainted right back. So watermarks can be made a lot more robust.

In the fight against fake news, markpainting news photos would mean that anyone trying to manipulate them would risk visible artefacts. So bad actors would have to check and retouch photos manually, rather than trying use inpainting tools to automate forgery at scale.

This paper has been accepted at ICML.

Infrastructure – the Good, the Bad and the Ugly

Infrastructure used to be regulated and boring; the phones just worked and water just came out of the tap. Software has changed all that, and the systems our society relies on are ever more complex and contested. We have seen Twitter silencing the US president, Amazon switching off Parler and the police closing down mobile phone networks used by crooks. The EU wants to force chat apps to include porn filters, India wants them to tell the government who messaged whom and when, and the US Department of Justice has launched antitrust cases against Google and Facebook.

Infrastructure – the Good, the Bad and the Ugly analyses the security economics of platforms and services. The existence of platforms such as the Internet and cloud services enabled startups like YouTube and Instagram soar to huge valuations almost overnight, with only a handful of staff. But criminals also build infrastructure, from botnets through malware-as-a-service. There’s also dual-use infrastructure, from Tor to bitcoins, with entangled legitimate and criminal applications. So crime can scale too. And even “respectable” infrastructure has disruptive uses. Social media enabled both Barack Obama and Donald Trump to outflank the political establishment and win power; they have also been used to foment communal violence in Asia. How are we to make sense of all this?

I argue that this is not simply a matter for antitrust lawyers, but that computer scientists also have some insights to offer, and the interaction between technical and social factors is critical. I suggest a number of principles to guide analysis. First, what actors or technical systems have the power to exclude? Such control points tend to be at least partially social, as social structures like networks of friends and followers have more inertia. Even where control points exist, enforcement often fails because defenders are organised in the wrong institutions, or otherwise fail to have the right incentives; many defenders, from payment systems to abuse teams, focus on process rather than outcomes.

There are implications for policy. The agencies often ask for back doors into systems, but these help intelligence more than interdiction. To really push back on crime and abuse, we will need institutional reform of regulators and other defenders. We may also want to complement our current law-enforcement strategy of decapitation – taking down key pieces of criminal infrastructure such as botnets and underground markets – with pressure on maintainability. It may make a real difference if we can push up offenders’ transaction costs, as online criminal enterprises rely more on agility than on on long-lived, critical, redundant platforms.

This was a Dertouzos Distinguished Lecture at MIT in March 2021.

Security Engineering: Third Edition

I’m writing a third edition of my best-selling book Security Engineering. The chapters will be available online for review and feedback as I write them.

Today I put online a chapter on Who is the Opponent, which draws together what we learned from Snowden and others about the capabilities of state actors, together with what we’ve learned about cybercrime actors as a result of running the Cambridge Cybercrime Centre. Isn’t it odd that almost six years after Snowden, nobody’s tried to pull together what we learned into a coherent summary?

There’s also a chapter on Surveillance or Privacy which looks at policy. What’s the privacy landscape now, and what might we expect from the tussles over data retention, government backdoors and censorship more generally?

There’s also a preface to the third edition.

As the chapters come out for review, they will appear on my book page, so you can give me comment and feedback as I write them. This collaborative authorship approach is inspired by the late David MacKay. I’d suggest you bookmark my book page and come back every couple of weeks for the latest instalment!

Happy Birthday FIPR!

On May 29th there will be a lively debate in Cambridge between people from NGOs and GCHQ, academia and Deepmind, the press and the Cabinet Office. Should governments be able to break the encryption on our phones? Are we entitled to any privacy for our health and social care records? And what can be done about fake news? If the Internet’s going to be censored, who do we trust to do it?

The occasion is the 20th birthday of the Foundation for Information Policy Research, which was launched on May 29th 1998 to campaign against what became the Regulation of Investigatory Powers Act. Tony Blair wanted to be able to treat all URLs as traffic data and collect everyone’s browsing history without a warrant; we fought back, and our “big browser” amendment defined traffic data to be only that part of the URL needed to identify the server. That set the boundary. Since then, FIPR has engaged in research and lobbying on export control, censorship, health privacy, electronic voting and much else.

After twenty years it’s time to take stock. It’s remarkable how little the debate has shifted despite everything moving online. The police and spooks still claim they need to break encryption but still can’t support that with real evidence. Health administrators still want to sell our medical records to drug companies without our consent. Governments still can’t get it together to police cybercrime, but want to censor the Internet for all sorts of other reasons. Laws around what can be said or sold online – around copyright, pornography and even election campaign funding – are still tussle spaces, only now the big beasts are Google and Facebook rather than the copyright lobby.

A historical perspective might perhaps be of some value in guiding future debates on policy. If you’d like to join in the discussion, book your free ticket here.

What Goes Around Comes Around

What Goes Around Comes Around is a chapter I wrote for a book by EPIC. What are America’s long-term national policy interests (and ours for that matter) in surveillance and privacy? The election of a president with a very short-term view makes this ever more important.

While Britain was top dog in the 19th century, we gave the world both technology (steamships, railways, telegraphs) and values (the abolition of slavery and child labour, not to mention universal education). America has given us the motor car, the Internet, and a rules-based international trading system – and may have perhaps one generation left in which to make a difference.

Lessig taught us that code is law. Similarly, architecture is policy. The architecture of the Internet, and the moral norms embedded in it, will be a huge part of America’s legacy, and the network effects that dominate the information industries could give that architecture great longevity.

So if America re-engineers the Internet so that US firms can microtarget foreign customers cheaply, so that US telcos can extract rents from foreign firms via service quality, and so that the NSA can more easily spy on people in places like Pakistan and Yemen, then in 50 years’ time the Chinese will use it to manipulate, tax and snoop on Americans. In 100 years’ time it might be India in pole position, and in 200 years the United States of Africa.

My book chapter explores this topic. What do the architecture of the Internet, and the network effects of the information industries, mean for politics in the longer term, and for human rights? Although the chapter appeared in 2015, I forgot to put it online at the time. So here it is now.