Machine Learning Extracts Attack Data From Verbose Threat Reports

On May 1, 2021

New research out of the University of Chicago illustrates the conflict that has arisen in the past ten years between the SEO benefits of long-form content, and the difficulty that machine learning systems have in gleaning essential data from it.

In developing an NLP analysis system to extract essential threat information from Cyber Threat Intelligence (CTI) reports, the Chicago researchers faced three problems: the reports are usually very long, with only a small section dedicated to the actual attack behavior; the style is dense and grammatically complex, with extensive domain-specific information that presumes prior knowledge on the part of the reader; and the material requires cross-domain relationship knowledge, which must be ‘memorized’ to understand it in context (a persistent problem, the researchers note).

Long-Winded Threat Reports

The primary problem is verbosity. For example, the Chicago paper notes that among ClearSky’s 42-page 2019 threat report for the DustySky (aka NeD Worm) malware, a mere 11 sentences actually deal with and outline the attack behavior.

The second obstacle is text complexity, and, effectively, sentence length: the researchers observe that among 4020 threat reports from Microsoft’s threat report center, the average sentence comprises 52 words – only nine short of the average sentence length 500 years ago (in the context of the fact that sentence length has declined 75% since then).

However, the paper contends that these long sentences are essentially ‘compressed paragraphs’ in themselves, full of clauses, adverbs and adjectives that shroud the core meaning of the information; and that the sentences often lack the basic conventional punctuation which NLP systems such as spaCy, Stanford and NLTK rely on to infer intent or extract hard data.

NLP To Extract Salient Threat Information

The machine learning pipeline that the Chicago researchers have developed to address this is called EXTRACTOR, and uses NLP techniques to generate graphs which distill and summarize attack behavior from long-form, discursive reports. The process discards the historical, narrative and even geographical ornamentation that creates an engaging and exhaustive ‘story’ at the expense of clearly prioritizing the informational payload.

Source: https://arxiv.org/pdf/2104.08618.pdf

Since context is such a challenge in verbose and prolix CTI reports, the researchers chose the BERT (Bidirectional Encoder Representations from Transformer) language representation model over Google’s Word2Vec or Stanford’s GloVe (Global Vectors for Word Representation).

BERT evaluates words from their surrounding context, and also develops embeddings for subwords (i.e. launch, launching and launches all stem down to launch). This helps EXTRACTOR to cope with technical vocabulary that is not present in BERT’s training model, and to classify sentences as ‘productive’ (containing pertinent information) or ‘non-productive’.

Increasing Local Vocabulary

Inevitably some specific domain insight must be integrated into an NLP pipeline dealing with material of this kind, since highly pertinent word forms such as IP addresses and technical process names must not be cast aside.

Later parts of the process use a BiLSTM (Bidirectional LSTM) network to tackle word verbosity, deriving semantic roles for sentence parts, before removing unproductive words. BiLSTM is well-suited for this, since it can correlate the long-distance dependencies that appear in verbose documents, where greater attention and retention is necessary to deduce context.

EXTRACTOR defines semantic roles and relationships between words, with roles generated by Proposition Bank (PropBank) annotations.

In tests, EXTRACTOR (partially funded by DARPA) was found capable of matching human data extraction from DARPA reports. The system was also run against a high volume of unstructured reports from Microsoft Security Intelligence and the TrendMicro Threat Encyclopedia,…

Machine Learning Extracts Attack Data From Verbose Threat Reports

Long-Winded Threat Reports

NLP To Extract Salient Threat Information

Increasing Local Vocabulary

Get more stuff like this in your inbox

Get more stuff like this
in your inbox