An A to Z of confusion

August 29th, 2008 at 05:16 UTC by Richard Clayton

A few days ago I blogged about my paper on email spam volumes — comparing “aardvarks” (email local parts [left of the @] beginning with “A”) with “zebras” (those starting with a “Z”).

I observed that provided one considered “real” aardvarks and zebras — addresses that received good email amongst the spam — then aardvarks got 35% spam and zebras a mere 20%.

This has been widely picked up, first in the Guardian, and later in many other papers as well (even in Danish). However, many of these articles have got hold of the wrong end of the stick. So besides mentioning A and Z, it looks as if I should have published this figure from the paper as well…

Figure 3 from the academic paper

… the point being that the effect I am describing has little to do with Z being at the end of the alphabet, and A at the front, but seems to be connected to the relative rarity of zebras.

As you can see from the figure, marmosets and pelicans get around 42% spam (M and P being popular letters for people’s names) and quaggas 21% (there are very few Quentins, just as there are very few Zacks).

There are some outliers in the figure: for example “3″ relates to spammers failing to parse HTML properly and ending up with “3c” (a < character) at the start of names. However, it isn’t immediately apparent why “unicorns” get quite so much spam, it may just be a quirk of the way that I have assessed “realness”. Doubtless some future research will be able to explain this more fully.

Entry filed under: Academic papers

4 comments Add your own

  • 1. RichB  |  August 29th, 2008 at 08:50 UTC

    u: Java(Script) unicode escape sequence?

  • 2. Justin Mason  |  August 29th, 2008 at 09:28 UTC

    Yeah, “u” seems to be the real outlier there… curious.

  • 3. Pete Austin  |  August 29th, 2008 at 14:46 UTC

    Re: unicorns. Looks like the spammers are using a crude method to collect email addresses based on letter frequency in English names, because all the vowels get a lot of spam. But the letter frequency for *initial* letters is actually different from this, with very few starting with “U”, which is why it shows as anomalous

  • 4. kme  |  September 1st, 2008 at 01:09 UTC

    The “u” outlier could also be from all the places that allocate “uNNNNNN@domain.com” to “user number NNNNNN”. Universities in particular like to do this.

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Subscribe to the comments via RSS Feed


Calendar

August 2008
M T W T F S S
« Jul   Sep »
 123
45678910
11121314151617
18192021222324
25262728293031