An A to Z of confusion

A few days ago I blogged about my paper on email spam volumes — comparing “aardvarks” (email local parts [left of the @] beginning with “A”) with “zebras” (those starting with a “Z”).

I observed that provided one considered “real” aardvarks and zebras — addresses that received good email amongst the spam — then aardvarks got 35% spam and zebras a mere 20%.

This has been widely picked up, first in the Guardian, and later in many other papers as well (even in Danish). However, many of these articles have got hold of the wrong end of the stick. So besides mentioning A and Z, it looks as if I should have published this figure from the paper as well…

Figure 3 from the academic paper

… the point being that the effect I am describing has little to do with Z being at the end of the alphabet, and A at the front, but seems to be connected to the relative rarity of zebras.

As you can see from the figure, marmosets and pelicans get around 42% spam (M and P being popular letters for people’s names) and quaggas 21% (there are very few Quentins, just as there are very few Zacks).

There are some outliers in the figure: for example “3” relates to spammers failing to parse HTML properly and ending up with “3c” (a < character) at the start of names. However, it isn’t immediately apparent why “unicorns” get quite so much spam, it may just be a quirk of the way that I have assessed “realness”. Doubtless some future research will be able to explain this more fully.

4 thoughts on “An A to Z of confusion”

u: Java(Script) unicode escape sequence?

Yeah, “u” seems to be the real outlier there… curious.

Re: unicorns. Looks like the spammers are using a crude method to collect email addresses based on letter frequency in English names, because all the vowels get a lot of spam. But the letter frequency for *initial* letters is actually different from this, with very few starting with “U”, which is why it shows as anomalous

The “u” outlier could also be from all the places that allocate “uNNNNNN@domain.com” to “user number NNNNNN”. Universities in particular like to do this.

M	T	W	T	F	S	S
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31

Light Blue Touchpaper

Security Research, Computer Laboratory, University of Cambridge

4 thoughts on “An A to Z of confusion”

Leave a Reply Cancel reply