By: Clive Robinson

Clive Robinson — Mon, 01 Sep 2008 13:04:13 +0000

@ Richard,

I picked the wrong creature from the “Flanders & Swann” Bestiary, it was the Hippo not the Warthog that sang,

Mud, mud, glorious mud,
nothing quite like it,
for soothing the blood.
So follow me follow,
down to the hollow,
and there let us wallow,
in glorious mud.

On a more whimsicle note There is also an ode to the Gnu.

And on those days when life is realy getting to you there is the famed song about the “British trades person” with “It was on the Monday morning when the gas man came to call”…

By: Richard Clayton

Richard Clayton — Thu, 28 Aug 2008 16:25:52 +0000

@Clive

it is getting under my skin, especialy when told “It’s all in the paper” or some such

Well, the answers to all of your questions (and some of the other data) is indeed in the paper. The paper is only 4 pages long and it contains a proper abstract. It’s also a bit less fun to read, precisely because it contains all the details you seek!

A blog is never going to be a replacement for proper reporting of academic research, and I think that’s a good thing, not a bad thing!

By: Clive Robinson

Clive Robinson — Thu, 28 Aug 2008 15:05:33 +0000

@ Richard,

“If one treats an email address as “real” if there’s one non-spam email on average every second day, then real aardvarks receive 35% spam, but real zebras receive only 20% spam”

A simple thought, perhaps you should use “warthogs” as the above is about as clear as mud.

More importantly it acts as a spoiler to what is effectivly your papers abstract on this blog.

The saying “it does what it says on the can” springs to mind. Your title does not indicate what is in the “can” (your paper). Which only leaves your abstract, effectivly the equivalent of the “ingredients on the can”.

Ask yourself what incentive is there for sombody to consume the contents of the can if they don’t understand the ingredients printed on it’s side?

I am not intending to be unkind or nasty, but I have seen to many bad abstracts in recent times and they trend appears to be for the worse and it is getting under my skin, especialy when told “It’s all in the paper” or some such.

It is not just that sentence that itches, so to give you an idea of what else makes me want to scratch,

You outline two sets of results for A&Z which are effectivly end points you give no further indication of the other 24 points. A bar chart or graph would be nice, otherwise people might assume it’s a straight line…

At the end of the abstract you say that the ISP pre filters using grey lists etc. However it is unclear if your first set of results are based on pre or post ISP pre-filtering…

I’m guessing that they are actually based on post ISP pre-filtering.

Further I assume that to get you data set for your second results, by email addresses that are “real” you mean that you first qualify an address where the “local part” matches a mailbox that currently exists within the domain of the address?

And therfore non “real” as either an invented “local part” or a possibly valid “local part” that has been sent to the domain in which a mailbox does not currently exist (but may have once)?

Secondly you give a further condition on a “real” mailbox as,

“one non-spam email on average every second day”

What does this actually mean?

I’m again guessing that by “second day” you are refering to a 48hour period not even days out of odd and even numbered days in the experiment period.

Further that it is bassed on the Total Experiment period and not as some sliding window where once it fails the mail box is excluded?

And that therefore your effectivly mean that the mailbox gets N or more “non-spam” messages, where N is equal to Total Experiment period in days divided by 2?

Thirdly what is the method you use for determing a “non-spam” message?

Is it a standard method?

If not is it sufficiently reliable to provide meaningfull results (ie can it pick up the various forms of spam hidding such as morphing etc)?

The reason I ask is that your results could conceivably be due to action by the,

1) address list generator,
2) spammer,
3) actions of a third party,
4) your data set selection method.

Also it might just be due to data set anomolies such as those that appear in the least significant digits in financial records that enable forensic investigators / accountants to more easily spot fake accounting information etc.

Your abstract does not indicate the possabilities you might have considered in any way just that you have seen an anomaly and tested for it using some methodology…

Comments on: Zebras and Aardvarks

By: Clive Robinson

By: Richard Clayton

By: Clive Robinson