For a slightly different Three Paper Thursday, I’m pulling together some of the work done by our Centre and others around the COVID-19 pandemic and how it, and government responses to it, are reshaping the cybercrime landscape.
The first thing to note is that there appears to be a nascent academic consensus emerging that the pandemic, or more accurately, lockdowns and social distancing, have indeed substantially changed the topology of crime in contemporary societies, leading to an increase in cybercrime and online fraud. The second is that this large-scale increase in cybercrime appears to be the result of a growth in existing cybercrime phenomena rather than the emergence of qualitatively new exploits, scams, attacks, or crimes. This invites reconsideration not only of our understandings of cybercrime and its relation to space, time, and materiality, but additionally to our understandings of what to do about it.
Continue reading Three paper Thursday: COVID-19 and cybercrime
Underground forums contain discussions and advertisements of various topics, including general chatter, hacking tutorials, and sales of items on marketplaces. While off-the-shelf natural language processing (NLP) techniques may be applied in this domain, they are often trained on standard corpora such as news articles and Wikipedia.
It isn’t clear how well these models perform with the noisy text data found on underground forums, which contains evolving domain-specific lexicon, misspellings, slang, jargon, and acronyms. I explored this problem with colleagues from the Cambridge Cybercrime Centre and the Computer Laboratory, in developing a tool for detecting bursty trending topics using a Bayesian approach of log-odds. The approach uses a prior distribution to detect change in the vocabulary used in forums, for filtering out consistently used jargon and slang. The paper has been accepted to the 2020 Workshop on Noisy User-Generated Text (ACL) and the preprint is available online.
Other more commonly used approaches of identifying known and emerging trends range from simple keyword detection using a dictionary of known terms, to statistical methods of topic modelling including TF-IDF and Latent Dirichlet Allocation (LDA). In addition, the NLP landscape has been changing over the last decade , with a shift to deep learning using neural models, such as word2vec and BERT.
In this Three Paper Thursday, we look at how past papers have used different NLP approaches to analyse posts in underground forums, from statistical techniques to word embeddings, for identifying and define new terms, generating relevant warnings even when the jargon is unknown, and identifying similar threads despite relevant keywords not being known.
 Gregory Goth. 2016. Deep or shallow, NLP is breaking out. Commun. ACM 59, 3 (March 2016), 13–16. DOI:https://doi.org/10.1145/2874915
Continue reading Three Paper Thursday: Applying natural language processing to underground forums
One would be hard pressed to find an aspect of life where networks are not present. Interconnections are at the core of complex systems – such as society, or the world economy – allowing us to study and understand their dynamics. Some of the most transformative technologies are based on networks, be they hypertext documents making up the World Wide Web, interconnected networking devices forming the Internet, or the various neural network architectures used in deep learning. Social networks that are formed based on our interactions play a central role in our every day lives; they determine how ideas and knowledge spread and they affect behaviour. This is also true for cybercriminal networks present on underground forums, and social network analysis provides valuable insights to how these communities operate either on the dark web or the surface web.
For today’s post in the series `Three Paper Thursday’, I’ve selected three papers that highlight the valuable information we can learn from studying underground forums if we model them as networks. Network topology and large scale structure provide insights to information flow and interaction patterns. These properties along with discovering central nodes and the roles they play in a given community are useful not only for understanding the dynamics of these networks but for various purposes, such as devising disruption strategies.
Continue reading Three Paper Thursday – Analysing social networks within underground forums
In previous work we have shown how stolen bitcoins can be traced if we simply apply existing law. If bitcoins are “mixed”, that is to say if multiple actors pool together their coins in one transaction to obfuscate which coins belong to whom, then the precedent in Clayton’s Case says that FIFO ordering must be used to track which fragments of coin are tainted. If the first input satoshi (atomic unit of Bitcoin) was stolen then the first output satoshi should be marked stolen, and so on.
This led us to design Taintchain, a system for tracing stolen coins through the Bitcoin network. However, we quickly discovered a problem: while it was now possible to trace coins, it was harder to spot patterns. A decent way of visualizing the data is important to make sense of the patterns of splits and joins that are used to obfuscate bitcoin transactions. We therefore designed a visualization tool that interactively expands the taint graph based on user input. We first came up with a way to represent transactions and their associated taints in a temporal graph. After realizing the sheer number of hops that some satoshis go through and the high outdegree of some transactions, we came up with a way to do graph generation on-the-fly while assuming some restrictions on maximum hop length and outdegree.
Using this tool, we were able to spot many of the common tricks used by bitcoin launderers. A summary of our findings can be found in the short paper here.