A backwards way of dealing with image spam

There is a great deal more email spam in your inboxes this Autumn (as noted, for example, here, here and here!). That’s partly because a very great deal more spam is being generated — perhaps twice as much as just a few months ago.

A lot of this junk is “image spam”, where the advertisement is contained within an embedded picture (almost invariably a GIF file). The filtering systems that almost everyone now uses are having significant problems in dealing with these images and so a higher percentage of the spam that arrives at the filters is getting through to your inbox.

So higher volumes and weaker filtering are combining to cause a significant problem for us all 🙁

But I have an interesting suggestion for filtering the images: it might be a lot simpler to go about it backwards 🙂

So read on!

At a large UK ISP with which I am familiar, incoming email volumes were 6 million/day this time last year, 12 million/day in the summer, 16 million/day in October and 26+ million/day several days last week. viz: the amount of spam is way up !

At the same time, a lot of the spam email relates to “pump and dump” spams where you are encouraged to buy obscure stocks, thereby raising their price, and the spam senders who bought at a low price make money. Since there’s no need (as there would be if they were drumming up buyers for fake pills or mortgage leads) for a website URL in the spam (just a ticker symbol) this removes a constant string that the email filtering systems can grab hold of, and it exposes the weaknesses of many of the text scanning algorithms (usually “Naive Bayes” schemes) which can be misled by the presence of many “good” words accompanying the “bad” ones.

However, the key change in spam in recent months has been the increasing incidence of image spam. A typical spam email now consists of a page of random text (often snarfed from news websites) which persuades the filtering systems that this is email that you’ll want to read — followed by a GIF which actually contains the spammer’s message (buy pills, purchase this stock, personalised Xmas cards, etc).

Image spam is a significant problem for spam filtering systems. Having parsed the text and concluded that it looks legitimate (or at least, not unusual) they would, until recently, ignore the GIF and pass the spam through to your inbox. Hence a combination of a lot more spam being sent, and the rise of image spam has led to a significant problem in your inbox.

The spam filtering companies are starting to fight back. The early attempts created a cryptographic hash of the images, so that when they were sent again they would be recognised. The spammers then arranged that every image was different by adding little dots of colour, or by other techniques (some quite exotic) that would ensure that a computer thought that the image was fresh and new and had to be allowed through the filtering — but that a human eye would recognise as the same old advert for slimming pills, erection pills, or this week’s ticker symbol.

The filtering companies have counterattacked this fuzzyness (for example, IronPort claim to be doing well). They are now using character recognition software (or even wavelets) to try and deduce the message hidden within the images — hoping thereby to feed the text those messages to their Naive Bayesian systems and hence be able once again distinguish between ham and spam. But this is extremely processor intensive and, since there’s lots of spam to process, this is becoming a major resource issue for ISPs and others…

… but I have a much simpler solution, which seems so far to have been overlooked. Why don’t we just block every email with an image in it?

Don’t be silly, I hear the cries, there’s lots of legitimate email with images in it and we don’t want to block that!

I agree. But let’s examine the nature of that legitimate email. One major class of image-including spam is the users of Outlook (and other Microsoft email products) who can arrange that email turns up accompanied by some “wallpaper” (so that the background is, for example, a pleasing shade of blue). But there’s only a few dozen wallpaper images — so why not create a cryptographic hash for each of these and then give wallpaper a free pass?

There’s then companies who, for corporate image reasons, send out a copy of their company logo with every email (you may think that’s clueless, but their marketing department begs to differ!) However, once again, there’s only a relatively small number of these logos AND THEY DON’T MORPH INTO NEW SHAPES ON EVERY EMAIL, so it is possible to envisage building a database of their cryptographic hash values and letting them through.

There’s then a lot of other oddments, filler spaces, fancy bars across the page, buttons, smileys and so on. But there’s really not very many of these, and they don’t morph, and so they can be added to the whitelist.

The key point, is that crytographic hashes are USELESS at recognising the spammer’s images because they are intentionally morphed to ensure that they will not be recognised by such a simple test. However, the legitimate images (wallpaper, logos etc) do NOT keep on changing, but remain constant. So once they have all been recorded into a trustworthy database they can be given a free pardon…

So what I suggest for the filtering companies is to build a database of “good” images (a few days scanning should pick out the candidate GIFs — anything you see twice the same might be a useful initial selector!). They can then provide somewhere for the marketing department to proactively upload their logos, and (once a human’s checked that no-one is cheating) every other GIF can be, by default, blocked.

The open community already has the systems it needs for this. Long ago when every advertising run used identical content for the emails, people used systems like Razor or DCC to discard incoming email that others had already identified to be spam. That doesn’t work terribly well anymore (because the spam sending engines morph the text sufficiently to fool the systems) but the infrastructure would be ideal for what I’m proposing!

Of course it’s not quite that simple, since there is a final class of image that regularly turns up in email — the JPEG of a new grandchild, the embarassing shot of the last drunken Friday night, or even an impressive sunset from someone else’s recent holiday. So doubtless the spammers will regroup, replace all their malware on compromised machines, and start shipping JPEG images rather than GIFs. However, I suggest that that might be less of a problem to deal with. Firstly, corporates may be happy blocking JPEGs outright — they’re not especially common in official company email. Secondly, the filtering problem should be a little simpler — a character recognition program is unlikely to find any character shapes in a sunset, so the majority of images will requite only superficial processing.

There’s doubtless many details to work out to create a viable scheme — but I suggest that seizing upon the property of “good” images (that they don’t change) looks like a better bet than attempting to pick information out of “bad” images, which will rapidly evolve protections against image processing techniques — perhaps ending up looking like visual CAPTCHAs (many weak, but possibly strong) and consuming a great deal of computing power to deal with!