All posts by Jack Hughes

Three Paper Thursday: Applying natural language processing to underground forums

Underground forums contain discussions and advertisements of various topics, including general chatter, hacking tutorials, and sales of items on marketplaces. While off-the-shelf natural language processing (NLP) techniques may be applied in this domain, they are often trained on standard corpora such as news articles and Wikipedia. 

It isn’t clear how well these models perform with the noisy text data found on underground forums, which contains evolving domain-specific lexicon, misspellings, slang, jargon, and acronyms. I explored this problem with colleagues from the Cambridge Cybercrime Centre and the Computer Laboratory, in developing a tool for detecting bursty trending topics using a Bayesian approach of log-odds. The approach uses a prior distribution to detect change in the vocabulary used in forums, for filtering out consistently used jargon and slang. The paper has been accepted to the 2020 Workshop on Noisy User-Generated Text (ACL) and the preprint is available online.

Other more commonly used approaches of identifying known and emerging trends range from simple keyword detection using a dictionary of known terms, to statistical methods of topic modelling including TF-IDF and Latent Dirichlet Allocation (LDA). In addition, the NLP landscape has been changing over the last decade [1], with a shift to deep learning using neural models, such as word2vec and BERT.

In this Three Paper Thursday, we look at how past papers have used different NLP approaches to analyse posts in underground forums, from statistical techniques to word embeddings, for identifying and define new terms, generating relevant warnings even when the jargon is unknown, and identifying similar threads despite relevant keywords not being known.

[1] Gregory Goth. 2016. Deep or shallow, NLP is breaking out. Commun. ACM 59, 3 (March 2016), 13–16. DOI:https://doi.org/10.1145/2874915

Continue reading Three Paper Thursday: Applying natural language processing to underground forums

From Playing Games to Committing Crimes: A Multi-Technique Approach to Predicting Key Actors on an Online Gaming Forum

I recently travelled to Pittsburgh, USA, to present the paper “From Playing Games to Committing Crimes: A Multi-Technique Approach to Predicting Key Actors on an Online Gaming Forum” at eCrime 2019, co-authored with Ben Collier and Alice Hutchings. The accepted version of the paper can be accessed here.

The structure and content of various underground forums have been studied in the literature, from threat detection to the classification of marketplace advertisements. These platforms can provide a mechanism for knowledge sharing and a marketplace between cybercriminals and other members.

However, gaming-related activity on underground hacking forums have been largely unexplored. Meanwhile, UK law enforcement believe there is a potential link between playing online games and committing cybercrime—a possible cybercrime pathway. A small-scale study by the NCA found that users looking for gaming cheats on these types of forums can lead to interactions with users involved in cybercrime, leading to a possible first offences, followed by escalating levels of offending. Also, there has been interest from UK law enforcement in exploring intervention activity which aim to deter gamers from becoming involved in cybercrime activity.

We begin to explore this by presenting a data processing pipeline framework, used to identify potential key actors on a gaming-specific forum, using predictive and clustering methods on an initial set of key actors. We adapt open-source tools created for use in analysis of an underground hacking forum and apply them to this forum. In addition, we add NLP features, machine learning models, and use group-based trajectory modelling.

From this, we can begin to characterise key actors, both by looking at the distributions of predictions, and from inspecting each of the models used. Social network analysis, built using author-replier relationships, shows key actors and predicted key actors are well connected, and group-based trajectory modelling highlights a much higher proportion of key actors are contained in both a high-frequency super-engager trajectory in the gaming category, and in a high-frequency super-engager posting activity in the general category.

This work provides an initial look into a perceived link between playing online games and committing cybercrime by analysing an underground forum focused on cheats for games.