What does Detica detect?

There has been considerable interest in a recent announcement by Detica of “CView” which their press release claims is “a powerful tool to measure copyright infringement on the internet”. The press release continues by saying that it will provide “a measure of the total volume of unauthorised file sharing”.

Commentators have divided as to whether these claims are nonsense, or whether the system must be deeply intrusive. The main reason for this is that when peer-to-peer file sharing flows are encrypted, it is impossible for a passive observer to know what is being transferred.

I met with Detica last Friday, at their suggestion, to discuss what their system actually did (they’ve read some of my work on Phorm’s system, so meeting me was probably not entirely random). With their permission, I can now explain the basics of what they are actually doing. A more detailed account should appear at some later date.

Their system starts by using fibre taps to pick off traffic from an appropriate part of the ISP network. They use a fibre tap rather than “port mirroring” to make it easier for the ISP to be sure that they won’t disrupt any traffic. The links that they monitor need not be carrying all of the ISP’s traffic — they merely hope that it will be a statistically significant sample.

The raw traffic is then sent to the CView box, which can handle multiple 10Gbit links. The first stage of processing is in hardware (FPGAs), then software takes over. The “external” endpoint identity is discarded and the “internal” identity is encrypted using a key that is not made available outside the box (ie: the intent is to make the customer “anonymous” but to be able to link different activity from the same source).

It should be carefully noted that this anonymity means that this system is intentionally useless for (and has nothing whatsoever to do with) any schemes for writing letters to, slowing down, or disconnecting, people who unlawfully share copyrighted materials. It’s all about measurements, not identification.

The content of the traffic is inspected to try and recognise whether it is peer-to-peer (P2P) traffic and if so, which particular protocol is being used. Most protocols are easy to recognise if you see the whole datastream — even encrypted traffic can be preceded by cleartext messages that are easy to distinguish.

If what is being seen is P2P traffic with unencrypted content, a unique identifier is extracted that indicates which file is being shared. This is much easier than you might at first imagine — most of the P2P protocols identify content via unique identifiers (usually a cryptographic hash of the file) and then pass this identifier around with every block, in easy-to-locate fields. The CView box then spits out a record containing:

  • the encrypted (and thus anonymised) customer identity
  • the type of P2P protocol
  • the content identifier value
  • the file size
  • a timestamp

Where the content of the P2P communication was encrypted, the content identifier is unavailable, so the record generated is as above, but that field cannot be filled in.

The records from the CView box are now passed to a statistics system. This looks up the content identifier (where known) in a database to see if it is copyrighted material that should not be seen on P2P networks. The statistics system then scales up its numbers (to adjust for any sampling at the earlier stages) and generates reports and graphs that give information such as the total amount of P2P traffic; what proportion is encrypted; what proportion of the unencrypted traffic appears to be a copyright infringement; the total number of customer accounts that are doing any file sharing; and so on.

As can be seen from this description, the claims in the Press Release are a little wide of the mark, in that if a substantial amount of traffic is encrypted (as is widely believed to be the case) then the proportion that is “unlawful file sharing” can only be guessed at. Also, the system cannot even be totally sure that the transfer of a copyrighted file is in fact unlawful (it might be covered by one of the statutory exemptions), however, the inaccuracy from this is likely to be very small !

The other potential flaw with the whole system is that there may be inaccuracies in detecting P2P protocols. Detica view their current system as a trial, and their system currently only attempts to detect the top three P2P protocols. New protocols, or developments of existing protocols, might well not be recognised, or may look too much like something else, such as “https” traffic. So if the statistics machine says that there is less file sharing going forward, then for quite a number of reasons, this may not quite reflect reality.

There’s also a wider issue as to whether reduction in P2P traffic means less file sharing overall, since users may migrate back to using Usenet, or fetch their files from online repositories, move all their traffic over encrypted tunnels such as VPNs or Tor, or just swap multi-gigabyte “thumb drives” at the pub or in the playground.

Detica are giving the impression that ISPs will be happy to see their new product. I’m less sure that ISPs actually want to measure this traffic quite so exactly. They’re keen purchasers of “traffic shaping” kit, that detects P2P and slows it down; and the statistics from these boxes may be quite sufficient already for their traffic management purposes.

However, ISPs who want to collaborate with media industries might wish to have an “industry standard” measurement tool so that some accurate numbers will inform their discussions. However, this presupposes that they’re prepared to admit how much P2P traffic they’re carrying which might be a bit of a hostage to fortune. I strongly suspect the ISPs would like the option of keeping any embarassing statistics to themselves, but still have Hollywood share in paying the Detica invoices (as if)!

I should also address (especially given the huge fuss over Phorm) the rather important question as to whether the system is lawful to operate? Please note that IANAL, but I’ve studied their writings in this area a fair bit…

The design as explained above seems to address issues of privacy and data protection (amalgamating statistics and discarding identifiers is a sound technique for jumping these hurdles). But there is then the vexed question of illegal interception. The system does “wire-tapping”, that’s obvious, but the criminal offence is called “interception” and that is carefully defined within the Regulation of Investigatory Powers Act 2000. I expect that Detica would wish to argue that there is no interception because no content is seen by any humans… however, spitting out the file identifier might in itself be sufficient to infringe. It may take some case law before anyone can say for sure.

It seems that Virgin (reported to be deploying this Detica system) are taking the view that they’d rather not argue about whether it’s interception, but have indicated that they intend to rely instead upon using it for “network management”, or more formally, the s3(3) statutory exemption that permits interception if “it takes place for purposes connected with the provision or operation of that service.

Knowing how much of your traffic is file sharing is something that network engineers would wish to know. However, knowing how much of the traffic is unlawful (and getting a list of all the material that is being shared unencrypted) is a bit more of a stretch — but perhaps the marketing people can claim that they need this knowledge to provide a service, and Virgin have announced that they are going to be providing a music service of their own.

Finally, the paranoid will observe that minor tweaks to the software will deliver up a first-class monitoring system that can generate reports about unlawful activity by individual users; so that anyone whose P2P activity is unencrypted (and who actually gets sampled) will be immediately detected.

Applying to the courts for an injunction to require these tweaks be made does not seem out-of-line with other media industry legal initiatives in Belgium and Ireland. It’s hard to say whether such an injunction would be granted in the UK, and the media industries have no previous signs of taking this route here. Nonetheless, a cautious ISP that is concerned about the wider PR aspects of deploying this system might think carefully about the likely benefits before giving the nice chaps at Detica (full disclosure, they paid for my lunch) a call.

19 thoughts on “What does Detica detect?

  1. thanks for the update richard as per my request earlyer.

    so ,regarding that request, are there any more research and legal professionals reading this important blog now, that are also willing to put some time aside, and contribute their professional time to actively seek out the Facts of this evolving DPI for profit case, and produce their own intar related reports to reach consensus ?

  2. Given the ICO, Police, and Ofcom won’t enforce any aspect of communication privacy law against UK ISPs, the choice for anyone who doesn’t like their private/confidential communications being monitored by Virgin/Detica is stark.

    You need to find a new ISP.

    UK regulators will not protect our private/confidential communications from unlawful, unwarranted, and indiscriminate mass interception.

  3. how do they deal with false positives given they are likely to be a Very large part of this statistical ouput, given the whole datastreams of 40% of all Virgin media’s total traffic is their input.

    word to the wise richard, you are probably wise to re-edit your “Virgin” reference to “Virgin Media” as old BEARDY BRANSON doesnt own VM ,he’s just a shareholder, he did rente his name to the NTL/TW for a PR fee every time he appears etc, and he leased his Virgin brand for 20 years to them too seperately OC, hence the rebranded NTL/TW to “Virgin Media” but he defends his stand alone “virgin” brand quite hard….

    still BEARDY cant be very happy right now, as the rather creepy ‘Virgin Media Executives advocate stalking UK kids family internet connections’ by giving them Unique ID tags type of Bad PR coming to a website near you soon, will quite possibly be effecting his other “Virgin” brands and shares prices in time.

  4. If the public identity can be regenerated when the next IPv4 packet from a specific sender is streamed past the fibre tap, then all you need to do to de-anoymize every recording is stream all packets of the virgin media address space past at fibre rates, be they 1Gb/s or 10gb/s. I don’t see it taking that long.

  5. Does using CView involve interception under RIPA?

    Yes. The question of whether a human needs to see something before it counts as interception is answered in my paper on Phorm at http://www.fipr.org/080423phormlegal.pdf in paragraphs 14 to 17. No human access is necessary – machine examination of content is still interception, and unlawful unless justified.

    It remains to be seen whether a convincing case can be made for an ISP’s need to know how much of its traffic infringes copyright – is this really required for purposes connected with the provision or operation of its service?

  6. @Richard Clayton

    I have a question about the UID Encryption Method.

    Is it possible that the Encrypted form of the key then becomes a “Secondary” but in itself still a Unique Identifier related to a particular user?

    Somewhat similar to a Private/Public key?

  7. @ J D

    Is it possible that the Encrypted form of the key then becomes … a Unique Identifier related to a particular user?

    Yes, that’s the whole point of what Detica want to do. But it is intended that no-one can turn that identifier back into the user account name (unless you operate the system in a completely different way than is designed).

    Somewhat similar to a Private/Public key?

    No, more like a crytographic hash ( or “digest” ).

  8. @ Richard,

    If the format is as you indicate,

    then spits out a record containing:

    1, The encrypted (and thus anonymised) customer identity
    2, The type of P2P protocol
    3, The content identifier value
    4, The file size
    5, A timestamp

    Then the data is most definatly not anonymised and the “The encrypted customer identity” is not required.

    All that has to happen is the other “service fields” (4&5) be compared to the traffic managment and other logs the ISP keeps.

    Also what does “encrypted” realy mean…

    For instance the box may not push out the “anonymising key”. But if all it is is a simple encryption of the date and box serial number encrypted against a known master key then it is known without being transmitted.

    I’m not impressed with the way it is done on your brief discription as I know from designing similar “anonymising key” systems the devil is most definatly in the details.

    However as I said with fields 4&5 it realy is a no brainer to de-anonymise.

  9. @Clive

    All that has to happen is the other “service fields” (4&5) be compared to the traffic managment and other logs the ISP keeps

    ISPs will not have “traffic management and other logs” that give you the total file size. Recall that P2P usually involves many peers so even if they have full Netflow (frankly unlikely, so it’s all moot) you’d be unlikely to reconstruct the filesize especially accurately.

    Also what does “encrypted” realy mean…

    A modern crypto algorithm with a key of substantial length.

    But if all it is is …

    It isn’t !

  10. @ Richard,

    ‘A modern crypto algorithm with a key of substantial length.’

    That is a little like saying,

    “The wiring in my house is of modern design with a conductor of substantial CSA”

    It convays no usefull information by which a judgment qualative or otherwise can be made.

    As I said,

    ‘I know from designing similar “anonymising key” systems the devil is most definatly in the details.’

    Which for an unstated reason you have not provided details.

    Which obviously you or Detica are entitled to do.

    However you also have to grant a similar right to others not to take your unsupported statments as being of any assurance.

    But please do not think I’m being rude or insensitive, I’m concerned that the “vacum principle” may be applied.
    Which may give rise based on past applications, to the notion that there is a questionable motive for keeping such details from the public (as has proved to be the case with phorm).

    This is not helped by your apparently inconsistant statments. You say “encryption” not “hash” in your article but say,

    ‘No, more like a crytographic hash ( or “digest” ).’

    In reply to one set of questions but,

    ‘A modern crypto algorithm with a key of substantial length.’

    To another.

    The normal implication of “encryption” is that it is of necessity reversable.

    However the normal implication of “hash” is that it is not of necessity reversable.

    And the normal implication of “crytographic hash” is that it is not reversable (except under special circumstances).

    I think you will therfore understand why there might be more than the odd questioning eye brow raised at such apparantly inconsistant statments.

  11. @Clive

    It convays no usefull information by which a judgment qualative or otherwise can be made.

    Yes it does … it says you’re not going to brute force it !

    The normal implication of “encryption” is that it is of necessity reversable.

    This is precisely why I made the point about the key not leaving the box, so that you won’t be able to reverse it unless you have access to the inside of the box. That’s why I think a comparison with a hash is helpful — it’s certainly not a public key system, which was the context in which I made that comparison.

    I think you will therfore understand why there might be more than the odd questioning eye brow raised at such apparantly inconsistant statments.

    I will just invite you to read the whole article again … and perhaps when doing so ask you to make a suitable distinction between a product that Detica is trying to create (doubtless they’d be delighted to answer all your detailed questions, and listen with great interest to your description of their motives); and my role in reporting some of its more interesting aspects — and correcting some false impressions given by their PR team.

  12. @ Richard,

    “Yes it does … it says you’re not going to brute force it !”

    No it does not, not at all.

    All,

    “A modern crypto algorithm with a key of substantial length”

    says in “somebodies view” the key is of a “substantial length”.

    Some people would say 128bits for a modern algorithm like AES is ok, but 128bits for RSA most definatly not.

    So as I said you have not actually said anything by which a

    “a judgment qualative or otherwise can be made”.

    For instance it is not clear if it is a statment from their PR people or your considered opinion after due consideration of the relavant facts.

    Saying,

    “it says you’re not going to brute force it”

    Is only saying that you think the key space is too large to mount the least optimal practical attack.

    Thus even if the key is of sufficient length for the algorithm to prevent a “british museum attack” it does not state anything about how the key is selected in use.

    That is how is it generated, how often is it changed etc etc etc.

    Would all have to be known before a judgment could be made.

    As I said the devil is in the details.

    Now maybe you do not know the details, maybe you know only some of them, maybe they have been shown to you under an NDA or maybe they are open to review to any interested party.

    All we know from your posting is insufficient for “a judgment qualative or otherwise can be made”.

    With regard to,

    “…ask you to make a suitable distinction between a product that Detica is trying to create… …and correcting some false impressions given by their PR team.”

    Hmm all you appear to have realy said on this is,

    “The links that they [Detica] monitor need not be carrying all of the ISP’s traffic — they merely hope that it will be a statistically significant sample.”, and,

    “The statistics system then scales up its numbers (to adjust for any sampling at the earlier stages) and generates reports and graphs that give information such as the total amount of P2P traffic;” and,

    “then the proportion that is “unlawful file sharing” can only be guessed at. Also, the system cannot even be totally sure that the transfer of a copyrighted file is in fact unlawful”, and,

    “Detica are giving the impression that ISPs will be happy to see their new product. I’m less sure that ISPs actually want to measure this traffic quite so exactly.”

    Let me paraphrase,

    Detica have a system that sees some of an ISP’s traffic. Which they hope will will be statistically sufficient. This they then scale up to give a figure of the total amount of P2P traffic on the ISP’s network…

    But it only recognises some P2P traffic and may not be able to tell the content and if they can if it is illegal or not…

    Thus, ‘the proportion that is “unlawful file sharing” can only be guessed at.’

    And ‘Detica are giving the impression that ISPs will be happy to see their new product.’.

    Hmm, you give the impression that you are close to saying their product cannot do what they claim, and that you don’t think their product has a market currently.

    And my original concern was that there was a lack of information, and that this “information vacum” would be filed by unwarented speculation, that could be averted by technical information that would reduce fears and if required the methods involved cold be easily strengthend.

    Ahh but I forgot you also said,

    “Finally, the paranoid will observe that minor tweaks to the software will deliver up a first-class monitoring system that can generate reports about unlawful activity by individual users;”

    Some (Detica PR for instance) might think you where actually trying to “fan the flames”.

    And you invited me to do this review of what you had said, knowing that I had said,

    “I think you will therfore understand why there might be more than the odd questioning eye brow raised at such apparantly inconsistant statments.”

    Hmm…

  13. given the comments elsewere about weather theres “a pseudo-random replacement algorithm” used or Not, this Detica quote seems to clarify that,and a few other things too.

    although christopher does Not state the Detica persons’ name or position inside the company OC, so it could be some high ranking PR personel, with no way to confirm or corroborate that at this time.

    http://www.christopher-parsons.com/blog/privacy/update-to-virgin-media-and-copyright-dpi/#more-1483
    “In terms of the CView system, let’s first address the concern of anonymization. Specifically, we have to ask how stringent the anonymization system actually is.

    When I asked Detica about this process, they informed me that because the CView device is intended to produce a Copyright Infringement Index (aka the ‘Piracy Index’) by evaluating the overall filesharing on a network that identity information isn’t required for this objective.

    IP addresses are anonymized at the source/DPI device using a pseudo-random replacement algorithm, which also entails ignoring the external IP addresses.

    The key generation system is managed automatically by the device (and thus an ISP can’t muck around with the system), and keys are periodically cycled and redistributed.

    The keys are never made available outside of the device, and once a set of keys for a given time period are discarded they cannot be recovered – the process is irreversible.

    On this basis, we can argue that no subscriber ID is associated with the randomized replacement algorithm, there is no way to associate a subscriber ID with the pseudo-random number after the fact, and as such the anonymization system should serve its purpose.

    Of course, there is a concern that there are no such things as anonymization processes – as noted by Paul Ohm ”
    http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1450006
    “Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization

    Paul Ohm
    University of Colorado Law School

    August 13, 2009

    University of Colorado Law Legal Studies Research Paper No. 09-12 “

  14. @ Ripa,

    “The keys are never made available outside of the device, and once a set of keys for a given time period are discarded they cannot be recovered – the process is irreversible.”

    That is as good a “weasle statment” as I have ever heared 😉

    A known “one way” process fits that description but it can easily be shown that the information whilst “not be disclosed” and being “irreversible” does not preclude somebody else reproducing it…

    For example I take known details and “hash” them with MD5 or whatever. Provided I don’t release the hash and delete it at some point then it will meet the criteria of the statment. However If I know the input to the hash algorithm I can reproduce the hash at any time I wish.

    As I said the “devil is in the details”.

    There is also the question of “lost information” leading to incorrect results.

    If I change the key I use to make the information anonymous during a monitored transaction how do I know that I’m not,

    1, Lossing one or more recordable events.
    2, Double counting one or more recordable events.

    Thus if the key change period is short or made at a busy time then the results could be significantly skewed. How much is dependant on the average number of events in the time period between key changes and the average number of events at the key change time.

    Likewise it is also possible that an event in progress can be tracked across a key change. That is if the data being downloaded by an individual is known then the anyonomous string will change across the key change but the data identifier will not.

    There are many other such problems with this sort of system as I well know and they are very difficult to get right…

  15. @clive

    A known “one way” process fits that description

    However, it doesn’t fit any of the rest of description provided here and on the Christopher Parsons blog, viz: a simple hash is indeed useless (as I think is obvious to every reader); so that’s why there’s a key!

    Thus if the key change period is short or made at a busy time then the results could be significantly skewed.

    A substantive point (at last). But not a very significant one, since you can easily adjust for this bias in and amongst all the other scaling that is going on… all you need to do is to establish the ratio between the key rollover period and the average length of a measured P2P connection.

  16. @ Richard,

    “viz: a simple hash is indeed useless (as I think is obvious to every reader); so that’s why there’s a key!”

    When I said,

    ‘For example I take known details and “hash” them with MD5 or whatever. Provided I don’t release the hash and delete it at some point then it will meet the criteria of the statment. However If I know the input to the hash algorithm I can reproduce the hash at any time I wish.’

    I was assuming that you would be bright enough to realise I was talking of a simple example to generate a key…

    The point is for there to be a “key” there has to be a “key generation process” (or do you disagree with this on principle?).

    That process can be one way via a hash or whatever and thus cannot be reversed (or do you disagree with this?).

    However if the “key generation process” is determanistic and you wrote the process then chances are you can reproduce the input to the “key generation process” at any time and thus “re-create” the key (or do you disagre with this?).

    Now is that sufficiently well described for you?

    Oh and with regards your,

    “However, it doesn’t fit any of the rest of description provided here and on the Christopher Parsons blog,”

    There is no technical description of a “key generation process” on either your posting or Christoper’s that can be evaluated in any meaningfull way, so it is again one of your rather pecunish arguments.

    Oh and whilst I remeber a happy new year to you and all at the Labs, may it be fruitfull for all.

  17. @Clive

    However if the “key generation process” is determanistic and you wrote the process then chances are you can reproduce the input to the “key generation process” at any time and thus “re-create” the key (or do you disagre with this?).

    If it’s deterministic that’s obviously an issue. Detica are of course the people to ask if they have been completely incompetent or not. But I expect they’ll be familiar with the correct use of /dev/random

    Recall that they’re running their system on a “real” computer, not some trivial embedded component. They will have access to considerable amounts of entropy.

  18. @ Richard,

    “But I expect they’ll be familiar with the correct use of /dev/random”

    If you will forgive the pun “the probability is not high”.

    I have as you may remember more than a passing interest in RNGs in their various forms for generating “noise” for various activities involving money systems (electronic purses / hand held electronic betting devices / transfer authentication systems /anonymity systems / micropayment systems ).
    And sadly I was successfully attacking both the hardware and software RNG’s in them back in the 1980’s with external RF sources modulated with various “fault injection” wave forms (and I can safely say that “naff all” has happend since then in the way of improvment or understanding of the issues).

    (For those who want to see some recent published work have a hunt on this site there is some work showing a supposadly secure 32bit RNG can be trivialy reduced down to an 8bit RNG, without the use of modulating “fault injection” waveforms also a bit of discussion for future directions for students wanting to get published papers).

    Just about every system I have looked at since is gamable in some way or another.

    One thing that is rarely considered is how “efficiency” compramises “security”. The simple fact is the more efficient a system is designed to be the more avenues for attack or inadvertant side channels occur.

    One documented problem is that of a CPU “cache” leaking AES key bits onto the network.

    What is not made clear in the majority of “papers” is just how easily time bassed attacks move up and down the system stack in modern systems.

    Which brings me around to your comment,

    “Recall that they’re running their system on a “real” computer, not some trivial embedded component. They will have access to considerable amounts of entropy.”

    Hmm I admire your confidence in making such a statment, especialy considering the work of various of your colleagues at the lab.

    Without technical details it is very difficult to say, and I very much doubt they would provide them to me (as I will certainly not sign NDA’s these days having had one been used to “gag me” in the past).

    I suspect that the entropy in the system is most likley to be from external “controlable” events like network traffic. And unless speciffic precautions have been taken by the designers then controlling the data rate into the system can remove all the entropy from that and also allow internal entropy from clock drift etc to be very acurately predicted.

    One of the reasons I prefer “embeded” components over “real” computers is they are much easier to design to reduce the oportunities for “gaming”.

    Most OS’s in “real” computers are designed for “specmanship” including the seperation kernal OS’s for high assurance systems are still relativly easy to game in one way or another via covert channels (usually but not always time based).

    The question many are likley to ask at this point is “so what? Who’s going to go to the effort? there’s no incentive / pay off”.

    Unfortunatly history shows that “re-use” is a very major security weakness (see Microsoft et al ad nusium).

    The system they have designed is unlikley to be gamed, however having gone to the effort of developing their “key” managment system they are likley to carry it forward into many other products.

    The probability is that some of these future systems will have incentive / pay off thus if there are any weaknesses now they will be carried forward untill the weaknesses become publicaly and painfully clear.

    As has been observed hundreds of years ago,

    “Oft’ the ship is lost for a ha’penth of tar”,

    With the obvious prevention of,

    “A stich in time saves nine”.

    However from the little that has been said by you and others on the Detica system, and the “open work” carried out by others at Cambridge Labs should give significant pause for thought.

    My view on it is that there is way to little information to say the system is not gameable and thus the primise for the systems “anonymity” ability is at best highly questionable, based on industry experiance to date.

    However if Detica want to get in contact with me in an “open way” I’d be more than happy to look over their design as the process if done in an “open way” would raise everybodies boat.

  19. Is it time to look at TalkTalk and there stalkstalk system yet? it looks lke interceptions to me.

Leave a Reply

Your email address will not be published. Required fields are marked *