A System for Automated Open-Source Threat Intelligence Gathering and Management
Sophisticated cyber attacks have plagued many high-profile businesses. To remain aware of the fast-evolving threat landscape, open-source Cyber Threat Intelligence (OSCTI) has received growing attention from the community. Commonly, knowledge about threats is presented in a vast number of OSCTI reports. Despite the pressing need for high-quality OSCTI, existing OSCTI gathering and management platforms, however, have primarily focused on isolated, low-level Indicators of Compromise. On the other hand, higher-level concepts (e.g., adversary tactics, techniques, and procedures) and their relationships have been overlooked, which contain essential knowledge about threat behaviors that is critical to uncovering the complete threat scenario. To bridge the gap, we propose SecurityKG, a system for automated OSCTI gathering and management. SecurityKG collects OSCTI reports from various sources, uses a combination of AI and NLP techniques to extract high-fidelity knowledge about threat behaviors, and constructs a security knowledge graph. SecurityKG also provides a UI that supports various types of interactivity to facilitate knowledge graph exploration.
Sophisticated cyber attacks have plagued many high-profile businesses (equifax, ). To remain aware of the fast-evolving threat landscape and gain insights into the most dangerous threats, open-source Cyber Threat Intelligence (OSCTI) (li2019reading, ) has received growing attention from the community. Commonly, knowledge about threats is presented in a vast number of OSCTI reports in various forms (e.g., threat reports, security news and articles (securelist, ; phishtank, )). Despite the pressing need for high-quality OSCTI, existing OSCTI gathering and management systems (threatminer, ; threatcrowd, ; alienvault-otx, ), however, have primarily focused on simple Indicators of Compromise (IOCs) (liao2016acing, ), such as signatures of artifacts, malicious file/process names, IP addresses, and domain names. Though effective in capturing isolated, low-level IOCs, these platforms cannot capture higher-level behaviors such as adversary tactics, techniques, and procedures (mitre-attack, ), which are tied to the attackerâs goals and thus much harder to change. As the volume of OSCTI sources increases day-by-day, it becomes increasingly challenging to maneuver through and correlate the myriad of sources to gain useful insights. Towards this end, there is a pressing need for a new system that can harvest and manage high-fidelity threat intelligence in an automated, intelligent, and principled way.
There are several major challenges for building such a system. First, OSCTI reports come in diverse formats: some reports contain structured fields such as tables and lists, and some reports primarily consist of unstructured natural-language texts. The platform is expected to be capable of handling such diversity and extracting information. Second, besides IOCs, OSCTI reports contain various other entities that capture threat behaviors. The platform is expected to have a wide coverage of entity and relation types to comprehensively model the threats. Third, accurately extracting threat knowledge from unstructured OSCTI texts is non-trivial. This is due to the presence of massive nuances particular to the security context, such as special characters (e.g., dots, underscores) in IOCs. These nuances limit the performance of most NLP modules (e.g., sentence segmentation, tokenization). Besides, some learning-based information extraction approaches require large annotated training corpora, which is expensive to obtain manually. Thus, how to programmatically obtain annotations becomes another challenge.
To bridge the gap, we built SecurityKG (K lines of Python code), a system for automated OSCTI gathering and management. SecurityKG collects OSCTI reports from various sources, uses a combination of AI and NLP techniques to extract high-fidelity knowledge about threat behaviors as security-related entities and relations, constructs a security knowledge graph containing the entity-relation triplets, and updates the knowledge graph by continuously ingesting new data. Specifically, SecurityKG has the following key components: (1) a set of fast and robust crawlers for collecting OSCTI reports from + major security websites; (2) a security knowledge ontology that models a wide range of high-level and low-level security-related entities (e.g., IOCs, malware, threat actors, techniques, tools) and relations; (3) a combination of AI and NLP techniques (e.g., Conditional Random Fields (lafferty2001conditional, )) to accurately extract entities and relations; specifically, we leverage data programming (ratner2016data, ) to programatically create large training corpora; (4) an extensible backend system that manages all components for OSCTI gathering, knowledge extraction, and knowledge graph construction and persistence; (5) a UI that provides various types of interactivity to facilitate knowledge graph exploration.
Different from general knowledge graphs (auer2007dbpedia, ; mahdisoltani2013yago3, ) that store and represent general knowledge (e.g., movies, actors), SecurityKG targets automated extraction and management of OSCTI knowledge for the security domain. SecurityKG is the first work in this space.
Demo video: https://youtu.be/8PDJSaTnLDc
2. SecurityKG Architecture
Figure 1 shows the architecture of SecurityKG. SecurityKG manages the lifecycle of security knowledge in four stages: collection (Crawler), processing (Porter/Checker, Parser, Extractor), storage (Connector, Database), and applications. In the collection stage, SecurityKG periodically and incrementally collects OSCTI reports from multiple sources. In the processing stage, SecurityKG parses the reports, extracts structured knowledge, and constructs a security knowledge graph based on a pre-defined ontology. In the storage stage, SecurityKG inserts the knowledge into backend databases for storage. Various applications can be built by accessing the security knowledge graph stored in the databases. SecurityKG also provides a frontend UI to facilitate knowledge graph exploration.
2.1. Backend System Design
To handle diverse OSCTI reports, the system needs to be scalable, and maintain a unified representation of all possible knowledge types in both known and future data sources. The system also needs to be extensible to incorporate new data sources and processing and storage components to serve the needs of different applications.
Scalability. To make the system scalable, we parallelize the processing procedure of OSCTI reports. We further pipeline the processing steps in the procedure to improve the throughput. Between different steps in the pipeline, we specify the formats of intermediate representations and make them serializable. With such pipeline design, we can have multiple computing instances for a single step and pass serialized intermediate results across the network, making multi-host deployment and load balancing possible.
Unified knowledge representation. To comprehensively represent security knowledge, we design an intermediate CTI representation and separate it from the security knowledge ontology. Intermediate CTI representation is a schema that covers relevant and potentially useful information in all data sources and lists out corresponding fields. We construct this schema by iterating through data sources, adding previously undefined types of knowledge, and merging similar fields. Specifically, our source-dependent parsers will first convert the original OSCTI reports into representations (i.e., Python objects in memory) that follow this schema by parsing the structured fields (e.g., fields identified by HTML tags). Then, our source-independent extractors will further refine the representations by extracting information (e.g., IOCs, malware names) from unstructured texts and putting it into the corresponding fields.
Directly using these intermediate representations results in inefficient storage. Thus, before merging them into the storage through connectors, SecurityKG refactors them to match the security knowledge ontology, which has clear and concise semantics.
Extensibility. To make the system extensible, we adopt a modular design, allowing multiple components with the same interface to work together in the same processing step. For example, SecurityKG by default uses a Neo4 connector to export knowledge into a Neo4j database (neo4j, ). However, if the user cares less about multi-hop relations, he may switch to a RDBMS using a SQL connector. Similarly, parsers and extractors can be switched or extended, making the system extendable. Furthermore, the system can be configured through a user-provided configuration file, which specifies the set of components to use and the additional parameters (e.g., threshold values for entity recognition) passed to these components.
2.2. OSCTI Reports Collection
We built a crawler framework that has + crawlers for collecting OSCTI reports from major security sources (each crawler handles one data source), covering threat encyclopedias, blogs, security news, etc. The crawler framework schedules the periodic execution and reboot after failure for different crawlers in an efficient and robust manner. It also has a multi-threaded design to boost the efficiency, achieving a throughput of approximately + reports per minute at a single deployed host. In total, we have collected over K+ OSCTI reports and the number is still increasing.
2.3. Security Knowledge Ontology Design
Figure 2 shows our security knowledge ontology, which specifies the types of security-related entities and relations in the security knowledge graph. Based on our observations of OSCTI data sources, we categorize OSCTI reports into three types: malware reports, vulnerability reports, and attack reports. For each report, we associate it with an entity of the corresponding type. Besides, reports are created by specific CTI vendors, and often contain information concepts on threat actors, techniques, tools, software, and various types of IOCs (e.g., file name, file path, IP, URL, email, domain, registry, hashes). Thus, we create entities for these concepts as well. Entities have relationships between them (e.g., <MALWARE_A, DROP, FILE_A> specifies a “DROP” relation between a “MALWARE” entity and a “FILE” entity), as well as attributes in the form of key-value pairs. By constructing such an ontology, we can capture different types of security knowledge in the system. Compared to other cyber ontologies (stix, ; syed2016uco, ), our ontology targets a larger set. Figure 3 shows an example knowledge subgraph that follows this ontology.
2.4. Security Knowledge Extraction
We describe the steps inside the processing stage for security knowledge extraction. The porters take the input report files and convert them into intermediate report representations; they group multi-page reports and add metadata like ids, sources, titles, and original file locations and timestamps. The checkers work as filters on the list of intermediate report representations; they screen out irrelevant reports like empty pages or ads by running condition checks. The parsers are source-dependent, taking the advantage of prior knowledge of the source website structure and extracting keys and values from report files. They convert the list of intermediate report representations into a list of intermediate CTI representations (Section 2.1). The extractors further refine these intermediate CTI representations by completing some of the fields using entity recognition and relation extraction. Since the intermediate CTI representation is a unified format, the extractors are source-independent.
Next, we describe the design of the extractors.
Security-related entity recognition. We adopt a Conditional Random Field (CRF) (lafferty2001conditional, ) model to extract security-related entities in unstructured texts. Compared to general named entity recognition, we are faced with two unique challenges: (1) presence of massive nuances particular to the security context; (2) lack of large annotated training corpora. To address the first challenge, as these nuances mostly exist in IOCs, we use a method called IOC protection proposed in our other work (gao2020enabling, ), by replacing IOCs with meaningful words in natural language context (e.g., the word “something”) and restoring them after the tokenization procedure. This way, we guarantee that the potential entities are complete tokens. To address the second challenge, we programmatically synthesize annotations using data programming (ratner2016data, ). Particularly, we create labeling functions based on our curated lists of entity names. For example, the lists of threat actors, techniques, and tools are constructed from MITRE ATT&CK (mitre-attack, ). To train the CRF model, we use features such as word lemmas, pos tags, and word embeddings (mikolov2013distributed, ). Since our model has the ability to leverage token-level semantics, it can outperform a naive entity recognition solution that relies on regex rules, and generalize to entities that are not in the training set.
Security-related relation extraction. To extract relations, since it is relatively hard to programmatically synthesize annotations for relations, we adopt an unsupervised approach. In particular, we leverage the dependency-parsing-based IOC relation extraction pipeline proposed in our other work (gao2020enabling, ), and extend it to support the extraction of relation verbs between entities recognized by our CRF model. Evaluations on a wide range of OSCTI reports demonstrate that our extractors are highly accurate ( F1).
2.5. Security Knowledge Graph Construction
As a final step, SecurityKG inserts the processed results into the backend storage using connectors. The connector merges the intermediate CTI representations into the corresponding storage by refactoring them to match our ontology, such that the previous security knowledge graph can be augmented with new knowledge.
Since we store the knowledge extracted from a large number of reports in the same knowledge graph, one potential problem is that nodes constructed from different reports may refer to the same entity. We made the design choice that, in this step, we only merge nodes with exactly the same description text. It is possible that nodes with similar description texts actually refer to the same entity (e.g., same malware represented in different naming conventions by different CTI vendors). For these nodes, we merge them in a separate knowledge fusion stage, by creating a new node with unified attributes and migrating all the relation edges. By separating the knowledge fusion stage from the storage stage in the main pipeline, we can prevent early deletion of useful information.
2.6. Frontend UI Design
In order to facilitate knowledge graph exploration, we built a web UI using React (Figure 3). Currently, the UI interacts with the Neo4j database, and provides various functionalities to facilitate the exploration of the knowledge graph, which we will describe next.
We built features to simplify the user view. The user can zoom in and out and drag the canvas. Node names and edge types are displayed by default. Nodes are colored according to their types. When a node is hovered over, its detailed information will be displayed.
We built features that facilitate threat search and knowledge graph exploration. First, the UI provides multilingual query support so that the user can search information using keywords (through Elasticsearch) or Cypher queries (through Neo4j Cypher engine), which enables the user to easily identify targeted threats in the large graph. Second, the user can drag nodes around on the canvas. The UI actively responds to node movements to prevent overlap through an automatic graph layout using the Barnes-Hut algorithm, which calculates the nodes’ approximated repulsive force based on their distribution. The dragged nodes will lock in place but are still draggable if selected. This node draggability feature helps the user define custom graph layouts. Third, the UI supports inter-graph navigation. When a node is double-clicked, if its neighboring nodes have not appeared in the view yet, these neighboring nodes will automatically spawn. On the contrary, once the user is done investigating a node, if its neighboring nodes or any downstream nodes are shown, double clicking on the node again will hide all its neighboring nodes and downstream nodes. This node expansion/collapse feature is essential for convenient graph exploration.
We built features that provide flexibility to the user. The user can configure the number of nodes displayed and the maximum number of neighboring nodes displayed for a node. The user can view the previous graphs displayed by clicking on the back button. The user can also fetch a random subgraph for exploration.
3. Demonstration Outline
In our demo, we first show various usage scenarios of SecurityKG’s UI. Specifically, we perform two keyword searches and one Cypher query search and demonstrate all the supported features:
Keyword search for “wannacry”: We first investigate the wannacry ransomware by performing a keyword search. Throughout the investigation, we aim to demonstrate functionalities including detailed information display, node dragging, automatic graph layout, canvas zooming in/out, and node expansion/collapse. We will end the investigation with a subgraph that shows all the relevant information (entities) of the wannacry ransomware.
Keyword search for “cozyduke”: In the second scenario, we perform a keyword search of a threat actor, cozyduke. We will investigate the techniques used by cozyduke, and check if there are other threat actors that use the same set of techniques.
Cypher query search: In the third scenario, we execute a specific Cypher query, match(n) where n.name = ‘‘wannacry’’ return n, to demonstrate that the same wannacry node will be returned as in the first scenario. We then execute other queries.
Our demo video gives a walkthrough of these scenarios. In addition, we demonstrate the end-to-end automated data gathering and management procedure of SecurityKG. We will empty the database and apply SecurityKG to a number of OSCTI sources. We will demonstrate various system components, and provide insights into how OSCTI reports are collected, how entities and relations are extracted, and how information is merged into the knowledge graph so that the graph can continuously grow. The audience will have the option to try the UI and the whole system to gain deeper insights into the supported functionalities and system components.
4. Related Work
Besides existing OSCTI gathering and management systems (threatminer, ; threatcrowd, ; alienvault-otx, ), research progress has been made to better analyze OSCTI reports, including extracting IOCs (liao2016acing, ), extracting threat action terms from semi-structured Symantec reports (husari2017ttpdrill, ), understanding vulnerability reproducibility (mu2018understanding, ), and measuring threat intelligence quality (li2019reading, ; dong2019towards, ). Research has also proposed to leverage individual OSCTI reports for threat hunting (gao2020enabling, ; milajerdi2019poirot, ). SecurityKG distinguishes from all these works in the sense that it targets automated construction of a knowledge graph particularly for the security domain, by extracting a wide range of security-related entities and relations from a large number of OSCTI reports using AI and NLP techniques.
We have presented SecurityKG, a new system for automated OSCTI gathering and management.
Acknowledgement. This work was supported by the 2020 Microsoft Security AI RFP Award and the Azure cloud computing platform. Any opinions, findings, and conclusions made in this material are those of the authors and do not necessarily reflect the views of the funding agencies.
- The Equifax Data Breach. https://www.ftc.gov/equifax-data-breach.
- Vector Guo Li, Matthew Dunn, Paul Pearce, Damon McCoy, Geoffrey M Voelker, and Stefan Savage. Reading the tea leaves: A comparative analysis of threat intelligence. In USENIX Security, 2019.
- SecureList. https://securelist.com/.
- PhishTank. https://www.phishtank.com/.
- ThreatMiner. https://www.threatminer.org/.
- ThreatCrowd. https://www.threatcrowd.org/.
- AlienVault OTX. https://otx.alienvault.com/.
- Xiaojing Liao, Kan Yuan, XiaoFeng Wang, Zhou Li, Luyi Xing, and Raheem Beyah. Acing the ioc game: Toward automatic discovery and analysis of open-source cyber threat intelligence. In CCS, 2016.
- Mitre att&ck. https://attack.mitre.org.
- John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In ICML, 2001.
- Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In NeurIPS, 2016.
- Sören Auer, Christian Bizer, Georgi Kobilarov, Jens Lehmann, Richard Cyganiak, and Zachary Ives. Dbpedia: A nucleus for a web of open data. In The semantic web. 2007.
- Farzaneh Mahdisoltani, Joanna Biega, and Fabian M Suchanek. Yago3: A knowledge base from multilingual wikipedias. In CIDR, 2013.
- Neo4j. http://neo4j.com/.
- Structured Threat Information eXpression. http://stixproject.github.io/.
- Zareen Syed, Ankur Padia, Tim Finin, Lisa Mathews, and Anupam Joshi. Uco: A unified cybersecurity ontology. UMBC Student Collection, 2016.
- Peng Gao, Fei Shao, Xiaoyuan Liu, Xusheng Xiao, Zheng Qin, Fengyuan Xu, Prateek Mittal, Sanjeev R Kulkarni, and Dawn Song. Enabling efficient cyber threat hunting with cyber threat intelligence. arXiv preprint arXiv:2010.13637, 2020.
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. Distributed representations of words and phrases and their compositionality. In NeurIPS, 2013.
- Ghaith Husari, Ehab Al-Shaer, Mohiuddin Ahmed, Bill Chu, and Xi Niu. Ttpdrill: Automatic and accurate extraction of threat actions from unstructured text of cti sources. In ACSAC, 2017.
- Dongliang Mu, Alejandro Cuevas, Limin Yang, Hang Hu, Xinyu Xing, Bing Mao, and Gang Wang. Understanding the reproducibility of crowd-reported security vulnerabilities. In USENIX Security, 2018.
- Ying Dong, Wenbo Guo, Yueqi Chen, Xinyu Xing, Yuqing Zhang, and Gang Wang. Towards the detection of inconsistencies in public security vulnerability reports. In USENIX Security, 2019.
- Sadegh M Milajerdi, Birhanu Eshete, Rigel Gjomemo, and VN Venkatakrishnan. Poirot: Aligning attack behavior with kernel audit records for cyber threat hunting. In CCS, 2019.
- Peng Gao, Xusheng Xiao, Zhichun Li, Fengyuan Xu, Sanjeev R. Kulkarni, and Prateek Mittal. AIQL: Enabling efficient attack investigation from system monitoring data. In USENIX ATC, 2018.
- Peng Gao, Xusheng Xiao, Ding Li, Zhichun Li, Kangkook Jee, Zhenyu Wu, Chung Hwan Kim, Sanjeev R. Kulkarni, and Prateek Mittal. SAQL: A stream-based query system for real-time abnormal system behavior detection. In USENIX Security, 2018.