DANTE: A Framework for Mining and Monitoring Darknet Traffic
Trillions of network packets are sent over the Internet to destinations which do not exist. This ‘darknet’ traffic captures the activity of botnets and other malicious campaigns aiming to discover and compromise devices around the world. In order to mine threat intelligence from this data, one must be able to handle large streams of logs and represent the traffic patterns in a meaningful way. However, by observing how network ports (services) are used, it is possible to capture the intent of each transmission. In this paper, we present DANTE: a framework and algorithm for mining darknet traffic. DANTE learns the meaning of targeted network ports by applying Word2Vec to observed port sequences. Then, when a host sends a new sequence, DANTE represents the transmission as the average embedding of the ports found that sequence. Finally, DANTE uses a novel and incremental time-series cluster tracking algorithm on observed sequences to detect recurring behaviors and new emerging threats. To evaluate the system, we ran DANTE on a full year of darknet traffic (over three Tera-Bytes) collected by the largest telecommunications provider in Europe, Deutsche Telekom and analyzed the results. DANTE discovered 1,177 new emerging threats and was able to track malicious campaigns over time. We also compared DANTE to the current best approach and found DANTE to be more practical and effective at detecting darknet traffic patterns.
One way for Internet service providers (ISP) to obtain meaningful and actionable insights on malicious Internet campaigns, is to analyze traffic arriving at a subset of unassigned IP addresses. These IP addresses are sometimes referred to as a darknet [3, 2]. In this paper, the term ‘darknet’ should not be confused with anonymous communication networks such as Tor.
Darknet IP addresses are not associated with any registered host or services. Therefore, they are similar to honeypots [31, 10] in that any incoming packets can be automatically considered unwanted and non-productive. Previous studies have shown that packets sent to darknet IP addresses are usually the result of network probing/scanning, worm propagation, and a DDoS attacks [36, 43]. Therefore, darknet data can be used by an ISP’s cyber emergency response team (CERT) to infer threat intelligence related to ongoing malicious activities or new emerging attacks  (see Figure 1).
A great advantage to this approach is that darknet taps are easily deployed, inexpensive to implement, and can collect vast amounts of useful data. However, obtaining threat intelligence from darknet traffic is a challenging task for three reasons: (1) Darknet IPs are not assigned to actual hosts so the traffic only captures the initiation of a communication, and not the actual channel traffic (e.g., payloads). This is in contrast to honeypots which emulate real services (e.g., a Web application or SSH server). As a result, only metadata such as incoming packet’s source IP (src IP), destination IP (dst IP), destination port (dst port), and packet size are available. (2) Darknet traffic is full of benign scanning activity from services/enterprises such as Amazon, Google, and Shodan which must be filtered out. (3) Attackers repackage old malwares (e.g., Mirai) and reuse known attack patterns which makes it difficult to identify novel threats. (4) Efficient algorithms are a necessity since terabytes of darknet data is generated every month.
Recent works proposed various methods for analyzing darknet traffic. Since the destination TCP or UDP port number provides a good indication of the sender’s intentions (e.g., accessing port 23 may indicate an attempt to search for an accessible Telnet server), most of the previous research has focused on grouping ports into static clusters and detecting peaks or unusual trends in the volume of the clusters or individual ports [35, 7, 20, 9]. However, attacks are becoming more sophisticated, automated, and noisy (in terms of port access). For example, attackers may perform multi-stage attacks or attempt to exploit multiple vulnerabilities to compromise a device . The methods proposed in previous works (1) cannot identify attacks which use more than one port in sequence, and (2) only perform one time clustering, and therefore do not provide any means for tracking on-going attacks, detecting a recurring/reused attack, or identifying novel emerging threats over time.
In order to detect attacks it is important for a security analyst to be able to analyze darknet data and provide insights on an hourly basis. However, this analysis is challenging as there are terabytes of darknet traffic data every month and this figure is expected to increase in the coming years. We believe the solution to this challenge will be based on utilizing the power of big data and using a distributed algorithm to provide hourly reports and alerts.
In this paper, we propose Darknet Traffic Embedding (DANTE), a novel darknet analysis method for representing, monitoring, and detecting complex emerging threats via darknet traffic. DANTE is designed to be scalable and work in a big data architecture. DANTE accomplishes this in two steps. First, for each time window, DANTE summarizes each hosts’ activity as a feature vector which captures that host’s behavior. The vector is made by (1) training a Word2vec  model on the targeted ports (words) found in observed sequences (sentences), and (2) summarizing newly observed port sequences from a host as the average of the ports’ Word2Vec embeddings. The second step is to cluster the feature vectors (host activities) found in a time window, and use a novel technique to map these clusters to the previous time windows. Briefly, (1) the cluster labels from the previous time window are mapped to the current clusters using Jaccard similarity measures (tracking), (2) clusters which could not be mapped are then labeled using a collection of one-vs-all classifiers (recurring concepts), and then (3) the remaining unlabeled clusters are labeled by a members of the CERT and are given a new 1-vs-all classifiers (novelty detection).
To evaluate our method, we worked with the largest telecommunications provider in Europe, Deutsche Telekom AG. To support our research, they collected for us full year of darknet traffic (over three Tera-Bytes) in which DANTE discovered 1,177 new emerging threats and was able to track malicious campaigns over time. Deutsche Telekom’s is now using DANTE for threat intelligence in their CERT. We also evaluated the current best approach  on the same dataset and found that DANTE produced results were more concise and significantly more effective at capturing attack patterns.
In summary, this paper provides the following contributions:
A scalable and online framework for analyzing darknet traffic which can detect and track malicious campaigns and behaviors over time.
A method for representing a sequence of accessed ports, of variable length, as a numerical feature vector which captures the intent of an attacking host in a meaningful way.
A generic algorithm for performing temporal clustering which can (1) track cluster drift, detect novel/emerging clusters and reoccurring clusters, (2) run parallelized over a big data cluster with multiple data sources, and (3) can be used with any batch clustering algorithm according to the user’s needs.
The remainder of the paper is organized as follows. Section 2 provides background information on darknet-related research and embedding techniques. The proposed darknet monitoring framework is presented in Section 3. In Section 4 we present the proposed approach for port scan embedding. Section 5 contains a description of the novel temporal clustering method. In Section 6 we present our experimental results and in Section 7 we conclude the paper.
Ii Related Work
The two main contributions of this paper are (1) a method for analyzing darknet traffic, and (2) a method for clustering time series data using static clustering algorithms. Therefore, in our review of related work, we address each subject separately.
Ii-a Mining Darknet Traffic
In prior research [3, 2, 4, 7, 6, 9, 20, 22, 29, 36, 14, 1], darknet data is used to detect botnet hosts, typically by clustering and classifying the src IPs with features such as the dst port and packet size.
In , the authors created a rule-based model to help categorize darknet records into several types of malicious attacks and benign activities, and showed how those categories evolved over ten years of data. The authors used attributes such as the number of source IPs and destination ports in order to categorize the data. However, they did not consider the sequence of destination ports coming from an IP. We found those sequences to be particularly informative in the detection of attack patterns as they can indicate the intention of the attacker.
Ban et al. [7, 5, 6, 4] introduced a network incident analysis center for tactical emergency response (NICTER) that monitors around 300,000 darknet IPs in Japan. They used NICTER to find correlations between the malicious activities discovered on the darknet and activities extracted from different types of honeypots. In , and later in , Ban et al. used DT-growth, an association rule learning (ARL) algorithm, to port associations in the darknet data. Their research serves as the foundation of this paper, as they showed that many attacks patterns use more than one port and thus should be grouped.
Thonnard and Dacier  created a new clustering tool to detect groups of IPs that behave similarly. They used graph theory in order to find temporal correlation between port usage and thus created a way to group different IPs. Nevertheless, this work is different from ours, as Thonnard and Dacier  ignored the meaning and use of the ports in the sequence when clustering.
In , the authors used DBSCAN to create clusters of packets and then used an algorithm from the field of topological data analysis in order to visualize the darknet and help an expert easily observe and analyze the data. To use DBSCAN, they treated the ports as integers by looking at the port number. In contrast to this work, we use an artificial neural network to learn the connections and relations between the ports to find an informative numeric representation.
In order to retrieve numeric information from network traffic packets, the majority of the research conducted extracted statistical features such as the number of destination IPs or unique ports [36, 28, 34, 13, 15, 8, 44, 16]. Although these features proved to help in the detection and exploration of attacks, they are hand-picked, and it is difficult to choose the features that fit the task. In contrast to those methods, we used a neural network-based algorithm that automatically extracts meaningful representations of the packets.
Most of the aforementioned works apply their method on a static corpus of data. However, new data arrives continuously, and there is a need for an online system that can detect attacks in near real-time. To address this issue, our proposed method periodically analyzes the packets that have arrived from the sensor in the last minutes and applies the detection mechanism in an online fashion by using a big data architectures.
Ii-B Temporal Clustering
The well-known of clustering algorithms, such as k-means and DBSCAN , are batch algorithms. This means that they are applied once on the entire dataset and cannot track or monitor temporal trends. Although batch algorithms provide the best clustering quality, they are unsuitable as-is for processing data streams (unbounded sequences of observations).
To cluster data streams, one can use STREAM , Incremental DBSCAN , DenStream , CluStream, and many others . However, these algorithms cannot perform novelty detection. In other words, they cannot differentiate between reoccurring clusters and novel clusters. This capability is required to detect emerging threats and monitor the re-use of known attack variants.
There are several algorithms for novelty detection in data streams , however these algorithms cannot be natively parallelized over a big data computing cluster and do not directly support the processing of multiple parallel sources. Furthermore, each of these algorithms were designed to apply the principles of a particular kind of clustering algorithm over compressed data summaries. In addition to this being a lossy process, a user may need a different type of clustering algorithm to best fit the data (e.g., DBSCAN or K-means)). In contrast, the temporal clustering framework that we propose in this paper can be parallelized over a big data cluster while receiving data from multiple sources. In addition, the framework is flexible in terms of selecting a clustering algorithm. This enables the user to apply the most suitable batch algorithm in his/her arsenal.
Iii The Darknet Analyzer Framework
In this study, we analyze the ongoing activities in the darknet by clustering common patterns/sequences of targeted (dst) ports. These patterns can be used to discover new attacks as well as explore the behavior of ongoing attacks and trends.
The analysis process consists of the four stages described below.
Sequence extraction First, the data is split into sliding time windows, resulting in multiple windows with length L; for each time window, we group the destination port records of the same src IP into a port sequence. The final result is a table representing a time window with two attributes, the first being the src IP, and the second being the dst port’s sequence, as shown in Figure 3.
Port sequence embeddings By using a word embedding algorithm on the port sequences extracted from the previous stage and treating ports as words and port sequence as sentences, we are able to transform the port sequences into a meaningful numerical feature vectors. A description of how we expand on this method is presented in Section IV.
Temporal clustering With the feature vector obtained in the previous stage, we use a novel temporal clustering method to cluster the feature vectors over time. This step enables us to track attack patterns and distinguish between new (novel) and reoccurring patterns. This is explained further in chapter V.
Alert logic and visualization The current time window’s clusters and labels are visualized for the CERT team (e.g., trends, or campaigns by country source). We can also use this information to create an alert rule, e.g., an alert regarding the reappearance of a cluster that an analyst classified as malicious in the past. These alerts can then be used in security information sharing platforms such as MISP .
In addition, the above-mentioned alert system will allow us to use DANTE to deal with adversarial attacks. Such attacks can be divided into two groups. In the first, the attacker is trying to conceal him/herself by adding dummy port access as noise. A simple way to deal with this attack group is to include an alert rule that issues alert when a never seen before cluster is seen, as those attacks will create a new cluster. To avoid a mass amount of false positive, those alarm are only sent if a cluster is larger than a certain threshold. In the second attack group, the attacker will try to disguise him/herself as a pattern that belongs to a known cluster, such as a cluster that consists of a popular port sequence pattern. To deal with this group, one can create an alert rule to issue an alert when a cluster dramatically increases in size. Another way to deal with this group of attacks is to recluster the large clusters and find subpatterns within them; those clusters could help an analyst find malicious subpatterns that differ from the other subclusters and those can indicate hidden attempts.
Iv Port Sequence Embedding
Threat agents (e.g., an attacker or bot) may send packets to unregistered IP addresses for several different reasons, such as to find a host with a vulnerability to exploit or, in the case of worms, to access a backdoor.
We define a sequence as the sequence of ports collected from a specific src IP to a specific dst IP.
A darknet sensor can observe these communications as a sequence of ports being accessed. For example, the sequence “42527, 80, 80” was observed in the wild. In this sequence, we can see that the attacker tried to access a high port (42527), and immediately send several packets to port 80 (HTTP); This sequence reflects a worm’s attempt to detect if the target host has already been infected (via backdoor on port 42527), and then after failing, trying to compromise the host by exploiting two different vulnerabilities in the host’s web server (port 80). From this, we can understand that the port targets in a given sequence reveal information regarding the intent of the attacker. From this example, it can be seen that the sequence of targeted ports not only reflects the attacker’s goal and strategy, but also implicitly captures the type of threat agent. Moreover, threat agents such as bots, worms, and hackers in the same campaign act in unison enabling us to perform cluster analysis to detect new or recurring campaigns and other behaviors.
However, in order to cluster those sequences, we must find a representation which can summarize them as a numeric vector for the machine learning algorithm. This format is a fixed length vector where the euclidean distance between vectors and vectors measure similarity (closer vectors are more similar).
Although TCP and UDP ports are numbers, the numerical relationship between ports is meaningless. For instance, port 21 is used for FTP, and port 22 is used for SSH, and there is no connection between the two. Therefore, in order to summarize the behavior of a scan, we first need to learn a numeric relationship between all of the ports.
To accomplish this we use Word2Vec. Word2vec, presented in  by Mikolov et al. , is a natural language processing (NLP) algorithm that aims to maximize the co-occurrence probability of words in the same sentence. Our method uses the same basic algorithm, but instead of looking at words in sentences, we use the port sequences where a sequence from a specific src IP corresponds to a sentence, and the port numbers correspond to the words in that sentence. Let be a TCP/UDP port in the set of all ports. Let be a sequence of ports sent from a specific src IP to a specific dst IP within some time interval . We train a Word2Vec model on a corpus of example sequences , such that a sequence is treated as a sentence and is treated as a word.
The trained Word2Vec model is able to convert a port into a vector representation (embedding) such that captures the meaning of given its observed locality among other ports in sequences of . An example of that property can be seen by looking at port 23 and port 2323, both of which are used for Telnet and hence are expected to appear in the scan data interchangeably. Therefore, they have remarkably similar embedding vectors. The same applies to arbitrary port numbers which are dynamic or not well documented. By using Word2vec, we do not need to consider the fact that multiple ports use the same service as the embedding process does this for us. Moreover, ports commonly found together in particular attack patterns will be associated as well, thus captures the attack intent as well.
Unlike more straightforward methods, such as Bag of Words (BoW), Word2vec uses context windows in order to create the representation of each word. This method considers not only what other words are in the sentence, but also where they are, allowing for a better representation.
In order to build this port-to-embedding transformation model we need to supply it with a significant amount of scan data, which could be computationally heavy. Fortunately, this model does not have to be rebuilt in every time window, and it is possible to use a pretrained model for a long period of time. The intuition behind this rationale is that the uses of each port do not change often and a well-trained model should be sufficient for a considerable amount of time.
Additionally, there is no need to save the model itself once trained. Instead of keeping the entire neural network model, one can save a hash table where the key is the port number, and the value is the embedding for that port. This approach reduces the amount of data needed to be saved significantly, as the number of possible ports is limited by the number 65,536. Further more, by eliminating the need to execute a neural network, DANTE can run extremely fast in a big data framework.
After each port has an embedding vector of size , we want to obtain an embedding vector with the same size, , that represents an entire port sequence that contains number of ports. Although there are many methods for sentence embedding, recent research  discovered that the best way to do so is to average the embedding of each word in the sentence. In the port embeddings case, We summarize the overall behavior (intent) of as a single vector by computing the average of that sequence’s port embeddings. Concretely, we compute
The resulting feature vector can be used for any machine learning algorithm, such as a classifier or clustering algorithm.
V Temporal Clustering
As described in Section IV, it is possible to summarize sequences of ports as their average embedding and analyze their behavior by performing cluster analysis. However, it is important to inspect the clusters over time. By doing so we can: (1) detect new attacks as they emerge (novelty detection), (2) track attack campaigns and how their strategies change, (3) follow the re-use of known attacks, e.g., variants of the Mirai botnet, and (4) analyze the trend of ongoing attacks, such as changes in volume, sources, and targets.
However, darknet data is collected from sources simultaneously. To manage this, the data it is typically stored in a big data cluster such as Hadoop. Therefore, we propose a temporal clustering framework which can be used with any batch clustering algorithm. The framework operates as follows.
First, we sort and aggregate the most recent data into overlapping time windows. Let be the width of the window in minutes, and let be the step size in which we slide the window, where . Following this process, let be the -th time window in our data, where is the next sequential time window. Finally, let the ratio of observations shared between two neighboring windows be defined as
The overlap between neighboring windows is necessary in order to track clusters. To ensure this, the parameter should be small enough so that .
Next, we apply a clustering algorithm to the data of each time window to group the observations. We note that any batch clustering algorithm can be used. For example, K-means, Fuzzy C-means, Gaussian mixture models, hierarchical clustering, spectral clustering, and more. For our dataset, we found that the clustering algorithm, DBSCAN , worked best. The reason is because DBSCAN clusters data based on density. As a result, the number of clusters discovered is variable and does not need to be predefined (as in k-means). Another advantage is that DBSCAN can label outliers (points which are relatively far from the general distribution). This helps us analyze these cases separately without harming the quality of the clustering process.
Between time window and time window , concept drift can occur. This means that the number of clusters and their shapes can change. Moreover, a cluster in can be a cluster (also found in ), an cluster (found in where ), or a cluster (never seen before). Figure 4 illustrates this challenge.
To annotate the clusters in , we first find the current clusters by comparing and . A cluster in is mapped to a cluster in if there is a significant overlap of observations between them. We measure the overlap using the Jaccard similarity metric, defined as
The Jaccard similarity metric measures the similarity between sets of items. This metric can be used in our case, because adjacent time windows overlap (by ). As a result, clusters which have a high Jaccard Similarity Score have a large number of overlapping observations and thus are considered to be the same pattern.
By using the distributed system, we simultaneously calculate the Jaccard similarity of all of the clusters in with the clusters in . If the Jaccard similarity is above a certain threshold for two clusters, then the cluster from is considered to be the same as the cluster from (i.e., cluster). In cases in which the cluster has no corresponding cluster from , the cluster is considered new. The algorithm for mapping clusters between adjacent overlapping time windows is presented in Algorithm 1. Note that there is no need to use the embedding vector of each instance; only a key (the src IP in our case) is needed for the comparison.
V-D Recovery & Discovery
The cluster mapping process presented in the previous Section enables us to align the clusters with the previous time window, but we also want to be able to identify witch of the clusters are recurring concepts (to retrieve their annotations), as well as novel concepts. Because storing the entire past data is, in most cases, impractical, our approach is to build a classifier model for each of the observed clusters. Each model is a binary one-vs-all classifier trained on the time window where the said cluster was first seen. The instances that belong to the cluster get the label one, and the rest get the label zero. We found that Random Forest suits this problem well as this model, unlike classifiers such as K nearest neighbors, have no need to save the data points and only need to save the decision trees. We define the set of one-vs-all classifiers as .
Let be our database of our one-vs-all classifiers and let be a cluster in which we could not map to . The probability that belongs cluster represented by , is computed as
where returns the probability that belongs to the positive class using model . We assign with the annotation of cluster if obtains the highest probability for all models in , and for some user defined parameter (we set ). Similarly to the Jaccard similarity calculation, one can easily distribute the prediction part as those predictions can also be calculated simultaneously. A formal description is described in Algorithm 2.
In cases in which there is no match in any of the classifiers in , we consider cluster to be a cluster. Once a new cluster is found we train a new classifier on this cluster’s data as previously explained.
After some time, a concept drift may occur, and the patterns change slightly. To deal with this issue, in cases in which a known cluster appears in the data stream, we update and retrain the corresponding model.
Vi Analysis of Darknet Traffic
Unlike other methods that used darknet data streams, e.g. , DANTE is not trying to find anomalies on specific ports, but rather find concepts and trends in the data. While methods that find correlations between ports exist , (1) they operate offline detecting patterns months after the fact, and (2) do not track the patterns over time. In contrast, DANTE finds new trends online (within minutes) and can detect recurring and novel patterns. To the best of our knowledge there are no online algorithms for detecting and tracking patterns in darknet traffic. Therefore, we demonstrate the usefulness of the proposed approach through a case study on over a year of data. The data was collected from a greynet [24, 7], meaning that the unused IPs are from a network that is populated by both active and unused IP addresses.
Vi-a Configuration and Setup
Dataset. For the purpose of this research, the NSP established 1,126 different unused IP addresses across 12 different subnets. All traffic which was sent to these IPs were logged as darknet traffic. The traffic was collected in three batches; the first was recorded during a period of six weeks (44 days) from 10/25/2018 until 12/5/2018 (denoted by Batch 1), the second was recorded during a period of eight weeks (55 days) from 2/1/2019 until 3/26/2019 (denoted by Batch 2) and the third batch was recorded during a period of 37 weeks (257 days) from 21/1/2019 until 10/6/2019 (denoted by Batch 3). Note that Batch 2 and Batch 3 have 36 days of overlapping, that is because the Batches 1 and 2 recorded in the research phase of the project, where Batch 3 recorded in the deployment stage, in real time. A deep analysis was performed on Batches 1 and 2. Batch 3 was used to demonstrate the system’s long-term stability. In total, 7,918,787,884 packet headers from 4,887,568 different source IP addresses were recorded, resulting in over 3 terabytes of data. Figure 6 shows the number of packets and source IP addresses for every hour in the first two batches. Note that due to a technical problem, one hour at the end of October is missing. Because the missing time is insignificant, and the proposed method can deal with missing time windows, those missing values do not affect the overall results.
Configuration. We choose the step size to be one hour and the window length to be four hours, similar to the work of Ban et al. . A one hour step size provides a sufficient amount of data while granting a security expert enough time to react to a detected attack. In addition, we choose the epsilon parameter of DB-SCAN to be 0.3, and the minPts parameter to be 30, as those parameters resulted in an average of four new clusters every day (agreed by the security experts to be a reasonable number of clusters to investigate each day). In addition, we do not want clusters with a small number of src IPs, as those clusters are too small to represent a significant trend in the network, and thus should be treated as noise.
Scalable Implementation. To scale to the number of sources and amount of data, DANTE was implemented and evaluated at the NSP’s CERT using Spark a Hadoop architecture. We tested the method on a Hadoop cluster consisting of 50 cores and 10 executors. The algorithm takes approximately 62 seconds to extract, embed, cluster, and map each four hours time window.
Data Preprocessing. As most of the source IPs in the dark data only sent one or two packets during the period of data collection, we decided to remove them as those port sequences associated with those IPs are too short and cannot constitute a meaningful pattern. Filtering those IPs reduce the noise and therefore improve the results. In addition, some of those IPs are likely to be a random miss configuration and not an active malicious attack. By removing those IPs, we reduced the number of packets by 33% percent.
Because of time constrains, we use Batch 1 and 2 for a comprehensive analysis, and show that DANTE can find malicious attack within hours, long before the first online report. Batch 3 is used to prove that the algorithm can work over a long period of time, and can detect new and reoccurring clusters even after almost a year. A visualisation of Batches 3 clusters over the 10 months can be seen in Figure 8.
A total of 400 clusters (novel darknet behaviors) were discovered over Batches 1 and 2, and 1,141 clusters were discovered over Batch 3. As previously mentioned, the system discovered four new clusters each day, on average, as can be seen in Figure 7. This number, determined by the set parameters, is small enough for an analyst to evaluate but large enough to detect new concepts. We found that the vast majority of the src IPs belong to one of 16 large clusters (concepts) over the year. Some of these clusters reoccurred (e.g., botnet code reuse) and others always stayed present (e.g., worms looking for open telnet ports). Figure 5 plots the volume of the discovered clusters over time for Batches 1 and 2. The -axis represents time, and the -axis represents the total number of src IPs in the data for each hour. In the visualization component of DANTE (section III) this graph is updated periodically to help the analysts explore past and current trends.
The clusters discovered by DANTE can be roughly divided into four categories:
Service Recon Clusters whose sequences have six or more unique ports:Either benign web scanners or agents performing reconnaissance on active services.
Basic Attacks & Host Recon Clusters whose sequences consist of a single port: Agents trying compromise a device via a single vulnerability, or performing reconnaissance to discover active network hosts.
Complex Attacks Clusters whose sequences have two to five unique ports: Agents which are attempting to exploit multiple vulnerabilities per device, are performing multi-step attacks, or are worms.
Noise and Outliers A single cluster of benign sequences from misconfigurations, backscatter, or are too small to represent an ongoing trend.
In Table I we give statistics on these categories, and in Figure 8 we present the volume of each category in Batch 3 over time. Table II provides concrete examples of discovered attack patterns, by cluster, as labeled in Figure 5.
|No. of clusters||No. of src IPs||No. of packets|
|Clusters Type||Batch 1||Batch 2||Batch 3||Batch 1||Batch 2||Batch 3||Batch 1||Batch 2||Batch 3|
|Basic Attacks & Host Recon||77||71||202||246,864||313,457||1,216,077||50,937,141||56,881,565||380,244,017|
|Noise and Outliers||1||1||1||115,645||142,920||819,566||582,646,403||743,977,834||5,935,419,695|
Service reconnaissance (port scanning) is performed by attackers to find and exploit backdoors or vulnerabilities in services . In Batches 1 and 2, we identified 51 port scanning clusters with an average of 929 different ports scanned in each cluster. However, it is important to note that a cluster in this group is not necessarily malicious. For example, cluster in Figure 5 consists of src IPs of Censys, a security company which scans 40 ports to find and report vulnerable IoT devices. Interestingly, because of the embeddings, DANTE found that of the IPs in cluster do not belong Censys’ subnet. Rather, we verified them to belong to a malicious actor copying Censys to stay under the radar.
In some cases, the service recon can consist of multiple ports that belong to the same service, in order to use an exploit on this service even if the host is using an alternative port. For example, Cluster , occurred on 3/8/2019, consist only of ports that can be associated with HTTP. This cluster consists of 17 ports, such as 80, 8080, 8000, 8008, 8081 and 8181. In addition, most of the src IPs are located in Taiwan (18%), Iran (15%) and Vietnam (12%). Because DANTE assigns similar embedding to those ports, those port were group together and DANTE was able to detect this pattern and issue an alert. At the time of writing of this article, we were not able to find any information on this scan online. This lack of reports could be explained by the fact that there was no significant peak in any of the ports involved, which make it hard for conventional anomaly detectors to detect this pattern.
Basic Attacks & Host Recon
Many port sequences consist of a single port accessed repeatedly on or more times. This behavior captures agents which are trying to exploit a single vulnerability (e.g., how the Mirai malware compromises devices via telnet). This behavior is also consistent with network Scanning , where the attacker tries to detect live hosts to map out an organization. By following how campaigns (clusters) are reoccur and grow in volume over time, analysts can use DANTE to find trends and discover new vulnerabilities. For instance, Figure 10 contains a scatter plot of each port sequence cluster from Batch 1 and its corresponding port.
It is interesting to see that, for example, port 445 is accessed by many src IPs, each of which only accessed the port an average of 5.8 times. This type of cluster can be seen in the second largest cluster, . This cluster consists of port 445 (SMB over IP: commonly used by worms such as Conficker [17, 43, 25] to compromise Windows machines) and was being attacked at a low rate, once every few hours. However, DANTE also detected that port 5060 (SIP: used to initiate and maintain voice/video calls) was trending, being attacked at a high rate from the USA.
Nonetheless, most of the clusters in this category are relatively small and only appear for a few hours. In another example, DANTE detected a network recon campaign between 1:00 and 6:00 AM GMT on 11/25/2018 (cluster ) on the unused port of 11390. The attack consisted of 895 different src IPs which mostly targeted one out of our 1,129 darknet IPs. This indicated which subnet the campaign was targeting and provided us with threat intelligence on a possible software vulnerability/backdoor on port 11390. By using the distributed system, DANTE was able to issue an alert with that information about a minute after the data arrived.
|Category||Number of src IPs||Number of Packets||Sequence Examples|
|A||Service Recon||141||873,458||[2077, 2077, 8877, 7080, …, 9304, 3556]|
|B||Service Recon||20,982||2,061,655||[8000, 88, 80, 8000, 8081, 80, 80]|
|C||Basic Attacks & Host Recon||113,407||7,061,042||[445, 445, 445]|
|D||Basic Attacks & Host Recon||895||68,065||[11390, 11390, …, 11390, 11390]|
|E||Basic Attacks & Host Recon||285,651||42,225,387||[23, 23, 2323]|
|F||Complex Attacks||43,305||1,718,440||[9527, 9527, 9527, 5555, 5555, 5555]|
|H||Complex Attacks||43,305||105,258||[7379, 7379, 5379, 5379, 6379, 6379]|
These clusters capture patterns involving multiple attack steps or vulnerabilities. This category of clusters often captures the most interesting attack patterns, and the most difficult to detect because of their low volume of traffic which hides them under the noise floor. Moreover, without the proposed embedding approach, an attack which involves a sequence of ports would be clustered with the other Basic Attacks and thus the other ports in the attack would go unnoticed –making it impossible to distinguish from other attack campaigns.
We note that cluster contains %30 of all darknet sequences. This is because the cluster captures ports 23 and 2323 (which the embedding correctly associated together) Both ports are used for Telnet, and they are connected with the Mirai botnet and are likely to appear in every darknet scanner from the last few years .
One example of this category occurred on 3/4/2019, where DANTE reported a new large cluster appeared ( in Figure 5) consisting of two ports and originating from China and Brazil only. Here, 92% of the src IPs sent the sequence , and 8% of the src IPs sent and in the reverse order. This is a clear indication of one or more campaigns launched at the same time aiming to exploit a new vulnerability. Four days after DANTE detected the attack, the attack was reported  and related to a vulnerability in an IP-Camera (CVE-2017-11632). Interestingly, none of the reports mentioned that the attackers were also targeting port 5555. This not only demonstrates how DANTE can provide threat intelligence, but how DANTE can detect ongoing attacks.
A third example of this category can be seen in a cluster, , from 11/22/18, which was discovered by DANTE. This cluster contains a pattern of scanning two specific ports, 7547 and 7550, which occurs on 11/22/2018 from 5 to 9 PM GMT. According to the Internet Storm Center (ISC) port search , a free tool that monitors the level of malicious port activity, this is the most significant peak of activities for port 7550 in the past two years, however, to the best of our knowledge, there have been no reports of an attack that utilizes this port. The missing information in the reports could suggest a novel attack that utilize those two ports. In addition, port 7547 appears to have a large number of packets arriving each day at 10:00 AM GMT, that were assigned to a different cluster. Unlike cluster , that cluster consists of port 7547 alone with no additional ports in the scan. 12/22/2018 is the only day when activity on this port peaks in a different hour (see Figure 11). There are reports  of a known Mirai botnet variant that uses port 7547 to exploit routers. Based on this report we attribute cluster to Mirai activity. Since DANTE detected that the two ports are used interchangeably by the same src IPs, we suspect that there is a new vulnerability tested on port 7550 of routers. At the time of writing of this article existence or absence of such vulnerability was not yet confirmed by any organization.
On 10/31/2018, DANTE generated an alert about a new attack (cluster ) one minute after it began. This attack continued everyday from 11/15/18 until the 11/18/18. In the attack, 1,789 different network hosts sent exactly four packets to two or three of the ports: 5379, 6379, and 7379. According to the ISC, this is the largest peak in the use of port 5379 and the third largest peak for port 7379, although these ports are considered unused. We could not find any report online which identified these ports being used together in an attack. Identifying the source of these IPs can lead a CERT team to uncovering an ongoing campaign.
In addition, 44.8% of the detected patterns (clusters) in this group reappeared on a later date (in some cases, one or two weeks later), sometimes with minor changes such as adding or removing some of the ports in the scan. Figure 9 presents a Gantt chart of those reoccurring clusters. In the case of the reoccurring patterns, DANTE did not send an alert regarding a new attack pattern but did report a new occurrence of the known pattern.
Noise and Outliers
This category consisting of a single cluster, with the patterns that the DBScan algorithm defined as outliers. Those patterns are port sequences that are not large enough, in terms of the number of src IPs, to become a new cluster. The rationale behind this is that most of the traffic in this category is backscatter or misconfigurations packets, and thus does not represent a scan [29, 24]. Although some of the patterns can represent a scan, those kinds of patterns cannot indicate a trend due to their small volume. The size of this cluster is directly controlled by the minPts parameter in DBScan and can be changed at any time. As previously mentioned, we chose minPts to be 30 in order to create a reasonable number of clusters per day for a security expert to explore.
|Ban et al. ||DANTE|
|Port 1||Port 2||Occur.||Port 1||Port 2||Port 3||Port 4||Port 5||Port 6||Occur.|
|445||982,528||80||8001||8080||8081||8088||and 10 other ports||3,699,342|
|37215||23||244,212||1235 different ports||798,760|
|1433||130,363||1252 different ports||344,761|
|22||122,518||8080||8000||8333||and 44 other ports||237,758|
Vi-C Comparative Discussion
In the previous sections, we demonstrated that DANTE can detect new and reoccurring attack patterns. As mentioned in section II, methods such as [38, 4, 5, 29] can detect port scanning (access to many different ports per host), and methods such as [29, 38, 26, 27] can observe traffic spikes on individual port numbers. However, DANTE is able to detect both port scans and port spikes while distinguishing between them. Furthermore, because DANTE uses Word2vec, it is able to cluster meaningful patterns consisting of different ports that have the same meaning/intent. For example, cluster B consists of port sequences which revolve around a specific type of service. Therefore, DANTE can detect both simple and complex patterns.
The greatest advantage of DANTE is its ability to detect patterns from the Complex Attacks category (e.g., CVE-2017-11632). This family, consisting of multiple ports per sequence, captures many well known and dangerous malwares. Table IV exemplifies this claim by listing some of these malwares along with the ports which they access during their lifetime. To the best of our knowledge, there are no other works which can detect these kinds of patterns as quickly and accurately as DANTE. Concretely, [38, 29, 16, 26, 27] observe traffic on individual ports. Although [4, 5] finds the connection between ports, it ignores the meaning and the content of the sequences. Another example is that  would not be able to detect cluster G that consist of ports 7547 and 7550 as it looks for common temporal activities spike of ports. While port 9537 had a significant spike at this date, there were no spike in the usage of port 5555 as so the two ports would not be clustered together. In contrast, DANTE uses the embedding of the ports to find the connection between them in near real time, and found that this is the first occurrence of those two ports together and issued an alert to the CERT within minutes, four full days before the first online report.
Vi-D Comparison with Ban et al. 
To evaluate the contribution of using Word2vec as a means for mining meaningful patterns, we compare our embedding approach to Ban et al. . Ban et al. used a frequent pattern mining (FPM) algorithm called FP-growth on darknet traffic to group similar ports together. By using FPM, Ban et al. was able to find subsets of ports that occur together in the data. To the best of our knowledge, this is the most recent work that mines patterns in darknet traffic to find new concepts. We applied their method on Batch 1 using the same time windows. By using the same parameters and aggregations, we were able to extract the groups of ports and their number of occurrences (support). The top 20 largest subsets of both FPM and DANTE are presented in Table III.
While FPM detects different subsets of ports, it can not identify when different subsets are jointly used in novel attack patterns. This can be seen in cluster F, where DANTE clusters two different sets together, the set  and the set [5555, 9527]. This clustering led to the discovery of port 5555 being used against Wireless IP Camera 360 devices in vulnerability CVE-2017-11632. In addition, Ban et al.’s method omits all sequences with more than six different ports, while these patterns are important in detecting malicious service reconnaissance. For example, in cluster A DANTE discovered a malicious actor who was copying Censys’s port scanning behaviors.
By using the semantic embedding of ports, DANTE learns that different ports which behave similarly (i.e., using the same service) should be considered close in the embedding space. That is why in cluster E, DANTE clustered together appearances of port 23 and port 2323 (both used by Telnet). On the other hand, the FPM algorithm could only create a set for every permutation of the ports although they all capture the same concept (see the three biggest FPM sets in Table III).
Moreover, some ports typically appear in attack sequences and rarely appear alone. This means that there is semantic information which can be learned about these ports, such as the intent and how it is reflected on the other ports in the sequence. For example, in Table III, port 80 (HTTP) is a single concept according to FPM due to the FP-growth process. In contrast, DANTE re-associates port 80 with other patterns. This can be seen in the top 20 largest concepts where port 80 appears in six different concepts but never appears by itself; in FPM, the singleton  is the fifth largest subset.
Lastly, FP-growth creates a large number of sets, even when setting the minimum support parameter to be a relatively large value (1,819 sets with minimum support of 1,000 for a period of six weeks). The resulting number of sets is impractical for a CERT analyst to investigate. An advantage of using DANTE is that only a reasonable number of concepts with high importance are identified per day (average of four a day, as shown in Figure 7). At the same time, DANTE discovers small yet significant patterns that might represent dangerous attacks (such as cluster G).
Vii Conclusion and Future Work
By mining darknet traffic, analysts can get get frequent reports on on-going and new merging threats facing their network. In this paper, we presented DANTE: a framework which enables network service providers to mine threat intelligence from massive darknet traffic streams. The framework provides (1) a novel method for representing a set of targeted ports as an embedding which captures the intent of the attack, and (2) a system for clustering, tracking, and detecting old and new attack patterns.
By evaluating DANTE on real darknet traffic, we were able to confirm that DANTE can track and detect reoccurring and new attack patterns. For example, the system detected a reported attack on IP cameras on the 3/4/2019. Furthermore, we discovered some attacks which have not been reported. For example, the patterns involving ports [5379, 6379, 7379] which appeared on the 10/31/2018. We also compared DANTE to a well-known darknet mining algorithm and found that DANTE produced higher quality and results with practical outputs.
DANTE is currently being deployed in Deutsche Telekom’s networks to provide their CERT with better threat intelligence.
As future work, we plan to improve DANTE by learning a semantic relationship between the port-embeddings and other features such as geo-location, ttl, packet size, and others.
|QAKBOT||443, 65400 , 2222 , 21, 41947||Post-infection|
We hope that the framework described in this paper, the embedding and clustering techniques, will assist researchers and the industry in better securing the Internet.
We would like to thank Nadav Maman for his help in implementing this work in the Spark environment.
- (2017-11) Evolving cauchy possibilistic clustering and its application to large-scale cyberattack monitoring. In 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Vol. , pp. 1–7. External Links: Cited by: §II-A.
- (2006) Practical darknet measurement. In 2006 40th Annual Conference on Information Sciences and Systems, pp. 1496–1501. Cited by: §I, §II-A.
- (2005) The internet motion sensor-a distributed blackhole monitoring system.. In NDSS, Cited by: §I, §II-A.
- (2015-07) A study on association rule mining of darknet big data. In 2015 International Joint Conference on Neural Networks (IJCNN), Vol. , pp. 1–7. External Links: Cited by: §II-A, §II-A, §VI-C, §VI-C.
- (2016-07) Towards early detection of novel attack patterns through the lens of a large-scale darknet. In 2016 Intl IEEE Conferences on Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), Vol. , pp. 341–349. External Links: Cited by: §I, §II-A, §VI-A, §VI-C, §VI-C, §VI-D, §VI-D, TABLE III, §VI.
- (2017) Detection of botnet activities through the lens of a large-scale darknet. In Neural Information Processing, Cham, pp. 442–451. External Links: Cited by: §II-A, §II-A.
- (2012) Behavior analysis of long-term cyber attacks in the darknet. In International Conference on Neural Information Processing, pp. 620–628. Cited by: §I, §I, §II-A, §II-A, §VI.
- (2016) Optimized Invariant Representation of Network Traffic for Detecting Unseen Malware Variants. External Links: Cited by: §II-A.
- (2015-08) A Time Series Approach for Inferring Orchestrated Probing Campaigns by Analyzing Darknet Traffic. In 2015 10th International Conference on Availability, Reliability and Security, pp. 180–185. External Links: Cited by: §I, §II-A.
- (2012) A survey: recent advances and future trends in honeypot research. International Journal of Computer Network and Information Security 4 (10), pp. 63. Cited by: §I.
- (2006) Density-based clustering over an evolving data stream with noise. In Proceedings of the 2006 SIAM international conference on data mining, pp. 328–339. Cited by: §II-B.
- (2019-01-21) Optimizing data stream representation: an extensive survey on stream clustering algorithms. Business & Information Systems Engineering. External Links: Cited by: §II-B.
- (2012) Unsupervised Network Intrusion Detection Systems: Detecting the Unknown without Knowledge. Computer Communications 35, pp. 772–783. External Links: Cited by: §II-A.
- (2013-05) A model of analyzing cyber threats trend and tracing potential attackers based on darknet traffic. Security and Communication Networks 7 (10), pp. n/a–n/a. External Links: Cited by: §II-A, §VI.
- (2010) Neural visualization of network traffic data for intrusion detection. Applied Soft Computing Journal 11, pp. 2042–2056. External Links: Cited by: §II-A.
- (2016-12) Topological analysis and visualisation of network monitoring data: darknet case study. In 2016 IEEE International Workshop on Information Forensics and Security (WIFS), Vol. , pp. 1–6. External Links: Cited by: §II-A, §II-A, §VI-C.
- (2014) An internet-wide view of internet-wide scanning. In Proceedings of the 23rd USENIX Conference on Security Symposium, SEC’14, Berkeley, CA, USA, pp. 65–78. External Links: Cited by: §VI-B2.
- (1998) Incremental clustering for mining in a data warehousing environment. In VLDB, Vol. 98, pp. 323–333. Cited by: §II-B, §V-B.
- (1996) A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96, pp. 226–231. Cited by: §II-B.
- (2015-05) Inferring distributed reflection denial of service attacks from darknet. Computer Communications 62, pp. 59–71. External Links: Cited by: §I, §II-A.
- (2016) Novelty detection in data streams. Artificial Intelligence Review 45 (2), pp. 235–269. Cited by: §II-B.
- (2014-09) Detection of DDoS Backscatter Based on Traffic Features of Darknet TCP Packets. In 2014 Ninth Asia Joint Conference on Information Security, pp. 39–43. External Links: Cited by: §II-A.
- (2000) Clustering data streams. In Foundations of computer science, 2000. proceedings. 41st annual symposium on, pp. 359–366. Cited by: §II-B.
- (2005) Defining and evaluating greynets (sparse darknets). In The IEEE Conference on Local Computer Networks 30th Anniversary (LCN’05) l, pp. 344–350. Cited by: §VI-B4, §VI.
- (2018) Who is knocking on the telnet port: a large-scale empirical study of network scanning. In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pp. 625–636. Cited by: §VI-B2, §VI-B3.
- (2008) An incident analysis system nicter and its analysis engines based on data mining techniques. In International Conference on Neural Information Processing, pp. 579–586. Cited by: §VI-C, §VI-C.
- (2015) A predictive zero-day network defense using long-term port-scan recording. In 2015 IEEE Conference on Communications and Network Security (CNS), pp. 695–696. Cited by: §VI-C, §VI-C.
- (2017-10) BotGM: Unsupervised graph mining to detect botnets in traffic flows. In 2017 1st Cyber Security in Networking Conference (CSNet), pp. 1–8. External Links: Cited by: §II-A.
- (2014-08) Towards a taxonomy of darknet traffic. In 2014 International Wireless Communications and Mobile Computing Conference (IWCMC), pp. 37–43. External Links: Cited by: §II-A, §II-A, §VI-B1, §VI-B2, §VI-B4, §VI-C, §VI-C.
- (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: Fig. 4.
- (2011) Honeypot in network security: a survey. In Proceedings of the 2011 international conference on communication, computing & security, pp. 600–605. Cited by: §I.
- (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. External Links: Cited by: §I, §IV.
- (2019)(Website) Note: External Links: Cited by: §VI-B3.
- (2015) A Near Real-Time Algorithm for Autonomous Identification and Characterization of Honeypot Attacks. Technical report External Links: Cited by: §II-A.
- (2016) IoTPOT: a novel honeypot for revealing current iot threats. Journal of Information Processing 24 (3), pp. 522–533. External Links: Cited by: §I.
- Malicious Events Grouping via Behavior Based Darknet Traffic Flow Analysis. Wireless Personal Communications 96. External Links: Cited by: §I, §II-A, §II-A.
- (2017) Security risk analysis of enterprise networks using probabilistic attack graphs. In Network Security Metrics, pp. 53–73. Cited by: §I.
- (2008) A framework for attack patterns’ discovery in honeynet data. digital investigation 5, pp. S128–S139. Cited by: §II-A, §VI-C, §VI-C.
- (2016)(Website) External Links: Cited by: §VI-B3.
- (2008) The sans internet storm center. In 2008 WOMBAT Workshop on Information Security Threats Data Collection and Sharing, pp. 17–23. Cited by: §VI-B3.
- (2016) MISP: the design and implementation of a collaborative threat intelligence sharing platform. In Proceedings of the 2016 ACM on Workshop on Information Sharing and Collaborative Security, pp. 49–56. Cited by: item 4.
- (2015) Towards universal paraphrastic sentence embeddings. arXiv preprint arXiv:1511.08198. Cited by: §IV.
- (2010) Internet background radiation revisited. In Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, IMC ’10, New York, NY, USA, pp. 62–74. External Links: Cited by: §I, §VI-B2.
- (2016) Traffic features extraction and clustering analysis for abnormal behavior detection. In Proceedings of the 2016 International Conference on Intelligent Information Processing - ICIIP ’16, New York, New York, USA, pp. 1–6. External Links: Cited by: §II-A.