Three Paper Thursday: Applying natural language processing to underground forums

Underground forums contain discussions and advertisements of various topics, including general chatter, hacking tutorials, and sales of items on marketplaces. While off-the-shelf natural language processing (NLP) techniques may be applied in this domain, they are often trained on standard corpora such as news articles and Wikipedia. 

It isn’t clear how well these models perform with the noisy text data found on underground forums, which contains evolving domain-specific lexicon, misspellings, slang, jargon, and acronyms. I explored this problem with colleagues from the Cambridge Cybercrime Centre and the Computer Laboratory, in developing a tool for detecting bursty trending topics using a Bayesian approach of log-odds. The approach uses a prior distribution to detect change in the vocabulary used in forums, for filtering out consistently used jargon and slang. The paper has been accepted to the 2020 Workshop on Noisy User-Generated Text (ACL) and the preprint is available online.

Other more commonly used approaches of identifying known and emerging trends range from simple keyword detection using a dictionary of known terms, to statistical methods of topic modelling including TF-IDF and Latent Dirichlet Allocation (LDA). In addition, the NLP landscape has been changing over the last decade [1], with a shift to deep learning using neural models, such as word2vec and BERT.

In this Three Paper Thursday, we look at how past papers have used different NLP approaches to analyse posts in underground forums, from statistical techniques to word embeddings, for identifying and define new terms, generating relevant warnings even when the jargon is unknown, and identifying similar threads despite relevant keywords not being known.

[1] Gregory Goth. 2016. Deep or shallow, NLP is breaking out. Commun. ACM 59, 3 (March 2016), 13–16. DOI:https://doi.org/10.1145/2874915

DISCOVER: Mining Online Chatter for Emerging Cyber Threats

Anna Sapienza, Sindhu Kiranmai Ernala, Alessandro Bessi, Kristina Lerman, and Emilio Ferrara. 2018. DISCOVER: Mining Online Chatter for Emerging Cyber Threats. In WWW ’18 Companion: The 2018 Web Conference Companion, April 23–27, 2018, Lyon, France. ACM, New York, NY, USA 8 Pages. https://doi.org/10.1145/3184558.3191528

The first paper proposes a system with a pipelined approach to collate data from dark web forums, social media, and blogs, for creating warnings of emerging threats. The pipeline ingests social media and blog content for creating a series of warnings, by removing known English and domain-specific terms to identify emerging terms. Warnings are triggered when the frequency of a term increases, while in proximity to other terms from a dictionary of known vocabulary. After a warning is triggered, a temporal timeline is built, displaying the frequency of the term across the various input sources to provide an overview of emerging potential threats.

Pre-processing of text in the system takes a rule-based approach: removing URLs and symbols, and filtering known words from both an English dictionary and a dictionary of stop words (e.g. “and”, “on” etc.). After pre-processing, the posts containing the left-over words are matched against a dictionary of known domain vocabulary: left-over terms with co-occurring known terms are flagged as a warning. While they use a statistical approach to find emerging terms, this reduces the effort required to create an explanation of detected words: a researcher has access to the dictionaries used in the pipeline, and would be able to determine why a given term did not trigger a warning. Also, as with many warning systems, the system would need tuning to set a threshold or rank order when detecting emerging threats, to avoid overwhelming the researcher with too much information.

Automatically Identifying and Understanding Dark Jargons from Cybercrime Marketplaces

Kan Yuan, Haoran Lu, Xiaojing Liao, and XiaoFeng Wang. 2018. Reading Thieves’ cant: automatically identifying and understanding dark jargons from cybercrime marketplaces. In Proceedings of the 27th USENIX Conference on Security Symposium (SEC’18). USENIX Association, USA, 1027–1041. https://www.usenix.org/conference/usenixsecurity18/presentation/yuan-kan

The first paper used a dictionary of known words, and analysed posts using a “bag-of-words” approach (each post is a collection of words, regardless of order).

For the second paper, the authors carry out a similar task. However, they identify and understand dark jargon automatically using word embeddings: a language model to capture the context of surrounding words. They use this approach to find repurposed words which do not appear in the same context across corpora, by comparing the context of words in cybercrime marketplace posts to Reddit. The authors chose to use Reddit instead of Wikipedia and other websites, as the language used on the discussion platform can be more informal.

In addition to detecting new jargon, the authors also propose a technique to understand the meaning of new dark jargon. This involves detecting hypernyms, to find general words that describe a given jargon, e.g. “ransomware” is (considered to be) a hypernym of “WannaCry”. The authors used Wikidata to build a tree of known hypernyms, which are matched to jargon using word embeddings.  

While this approach is useful for identifying and understanding jargon, the additional complexity of the neural language model (using word2vec) does not provide transparent explanations for definitions, and is not directly comparable across corpora, which the authors solve by defining a model for comparisons. Also, when training a model, the word vectors will contain bias to a specific corpus, and therefore the authors chose to reduce the size of the window around each word to limit bias.

REST: A Thread Embedding Approach for Identifying and Classifying User-Specified Information in Security Forums

Gharibshah, J., Papalexakis, E. E., & Faloutsos, M. (2020). REST: A Thread Embedding Approach for Identifying and Classifying User-Specified Information in Security Forums. Proceedings of the International AAAI Conference on Web and Social Media, 14(1), 217-228. Retrieved from https://www.aaai.org/ojs/index.php/ICWSM/article/view/7293

Automatically identifying threads of interest is another useful task to security researchers, when analysing underground and dark web discussion forums. In the third paper, the authors proposed a new approach to complement a researcher-defined small dictionary of known terms, which would be used for identifying matching threads, with a similarity metric, to identify additional relevant threads. This uses an embedding space, consisting of word vectors, to identify similar threads. The pipeline first uses keyword filtering from a domain dictionary to find threads, which are combined with detected similar threads, and finally classifies the collection of threads into known classes (hacks, services, alerts, and experiences).

While keyword-based filtering is a simple rule-based approach, the set of keywords may not be complete, and variations of words may not be detected, including both spelling mistakes and intentional changes. To solve this problem, the authors introduced a second step to measure the similarity between threads by capturing context using a skip-gram model.

The system is evaluated against five other methods, including word2vec and a fine-tuned BERT model. When identifying threads using a set of keywords and the proposed similarity metric, the authors observed that when changing the number of keywords used, the number of threads found using similarity did not change linearly. They found BERT performed best for the classification task on one forum, while their REST method outperformed BERT for the other two. Overall, this work shows how the use of embedding space with keywords and threads can help identify relevant threads, from a small set of initial keywords.

Leave a Reply

Your email address will not be published. Required fields are marked *