Zebras and Aardvarks

August 25th, 2008 at 03:02 UTC by Richard Clayton

We all know that different people get different amounts of email “spam“. Some of these differences result from how careful people have been in hiding their address from the spammers — putting it en claire on a webpage will definitely improve your chances of receiving unsolicited email.

However, it turns out there’s other effects as well. In a paper I presented last week to the Fifth Conference on Email and Anti-Spam (CEAS 2008), I showed that the first letter of the local part of the email address also plays a part.

Incoming email to Demon Internet where the email address local part (the bit left of the @) begins with “A” (think of these as aardvarks) is almost exactly 50% spam and 50% non-spam. However, where the local part begins with “Z” (zebras) then it is about 75% spam.

However, if one only considers “real” aardvarks and zebras, viz: where a particular email address was legitimate enough to receive some non-spam email, then the picture changes. If one treats an email address as “real” if there’s one non-spam email on average every second day, then real aardvarks receive 35% spam, but real zebras receive only 20% spam.

The most likely reason for these results is the prevalence of “dictionary” or “Rumpelstiltskin” attacks (where spammers guess addresses). If there are not many other zebras, then guessing zebra names is less likely.

Aardvarks should consider changing species — or asking their favourite email filter designer to think about how this unexpected empirical result can be leveraged into blocking more of their unwanted email.

[[[ ** Note that these percentages are way down from general spam rates because Demon rejects out of hand email from sites listed in the PBL (which are not expected to send email) and greylists email from sites in the ZEN list. This reduces overall volumes considerably -- so YMMV! ]]]

Entry filed under: Academic papers

3 comments Add your own

  • 1. Clive Robinson  |  August 28th, 2008 at 15:05 UTC

    @ Richard,

    “If one treats an email address as “real” if there’s one non-spam email on average every second day, then real aardvarks receive 35% spam, but real zebras receive only 20% spam”

    A simple thought, perhaps you should use “warthogs” as the above is about as clear as mud.

    More importantly it acts as a spoiler to what is effectivly your papers abstract on this blog.

    The saying “it does what it says on the can” springs to mind. Your title does not indicate what is in the “can” (your paper). Which only leaves your abstract, effectivly the equivalent of the “ingredients on the can”.

    Ask yourself what incentive is there for sombody to consume the contents of the can if they don’t understand the ingredients printed on it’s side?

    I am not intending to be unkind or nasty, but I have seen to many bad abstracts in recent times and they trend appears to be for the worse and it is getting under my skin, especialy when told “It’s all in the paper” or some such.

    It is not just that sentence that itches, so to give you an idea of what else makes me want to scratch,

    You outline two sets of results for A&Z which are effectivly end points you give no further indication of the other 24 points. A bar chart or graph would be nice, otherwise people might assume it’s a straight line…

    At the end of the abstract you say that the ISP pre filters using grey lists etc. However it is unclear if your first set of results are based on pre or post ISP pre-filtering…

    I’m guessing that they are actually based on post ISP pre-filtering.

    Further I assume that to get you data set for your second results, by email addresses that are “real” you mean that you first qualify an address where the “local part” matches a mailbox that currently exists within the domain of the address?

    And therfore non “real” as either an invented “local part” or a possibly valid “local part” that has been sent to the domain in which a mailbox does not currently exist (but may have once)?

    Secondly you give a further condition on a “real” mailbox as,

    “one non-spam email on average every second day”

    What does this actually mean?

    I’m again guessing that by “second day” you are refering to a 48hour period not even days out of odd and even numbered days in the experiment period.

    Further that it is bassed on the Total Experiment period and not as some sliding window where once it fails the mail box is excluded?

    And that therefore your effectivly mean that the mailbox gets N or more “non-spam” messages, where N is equal to Total Experiment period in days divided by 2?

    Thirdly what is the method you use for determing a “non-spam” message?

    Is it a standard method?

    If not is it sufficiently reliable to provide meaningfull results (ie can it pick up the various forms of spam hidding such as morphing etc)?

    The reason I ask is that your results could conceivably be due to action by the,

    1) address list generator,
    2) spammer,
    3) actions of a third party,
    4) your data set selection method.

    Also it might just be due to data set anomolies such as those that appear in the least significant digits in financial records that enable forensic investigators / accountants to more easily spot fake accounting information etc.

    Your abstract does not indicate the possabilities you might have considered in any way just that you have seen an anomaly and tested for it using some methodology…

  • 2. Richard Clayton  |  August 28th, 2008 at 16:25 UTC

    @Clive

    it is getting under my skin, especialy when told “It’s all in the paper” or some such

    Well, the answers to all of your questions (and some of the other data) is indeed in the paper. The paper is only 4 pages long and it contains a proper abstract. It’s also a bit less fun to read, precisely because it contains all the details you seek!

    A blog is never going to be a replacement for proper reporting of academic research, and I think that’s a good thing, not a bad thing!

  • 3. Clive Robinson  |  September 1st, 2008 at 13:04 UTC

    @ Richard,

    I picked the wrong creature from the “Flanders & Swann” Bestiary, it was the Hippo not the Warthog that sang,

    Mud, mud, glorious mud,
    nothing quite like it,
    for soothing the blood.
    So follow me follow,
    down to the hollow,
    and there let us wallow,
    in glorious mud.

    On a more whimsicle note There is also an ode to the Gnu.

    And on those days when life is realy getting to you there is the famed song about the “British trades person” with “It was on the Monday morning when the gas man came to call”…

Leave a Comment

Required

Required, hidden

Some HTML allowed:
<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Subscribe to the comments via RSS Feed


Calendar

August 2008
M T W T F S S
« Jul   Sep »
 123
45678910
11121314151617
18192021222324
25262728293031