A Study of Deep Learning for
Network Traffic Data Forecasting
We present a study of deep learning applied to the domain of network traffic data forecasting. This is a very important ingredient for network traffic engineering, e.g., intelligent routing, which can optimize network performance, especially in large networks. In a nutshell, we wish to predict, in advance, the bit rate for a transmission, based on low-dimensional connection metadata (“flows”) that is available whenever a communication is initiated. Our study has several genuinely new points: First, it is performed on a large dataset ( million flows), which requires a new training scheme that operates on successive blocks of data since the whole dataset is too large for in-memory processing. Additionally, we are the first to propose and perform a more fine-grained prediction that distinguishes between low, medium and high bit rates instead of just “mice” and “elephant” flows. Lastly, we apply state-of-the-art visualization and clustering techniques to flow data and show that visualizations are insightful despite the heterogeneous and non-metric nature of the data. We developed a processing pipeline to handle the highly non-trivial acquisition process and allow for proper data preprocessing to be able to apply DNNs to network traffic data. We conduct DNN hyper-parameter optimization as well as feature selection experiments, which clearly show that fine-grained network traffic forecasting is feasible, and that domain-dependent data enrichment and augmentation strategies can improve results. An outlook about the fundamental challenges presented by network traffic analysis (high data throughput, unbalanced and dynamic classes, changing statistics, outlier detection) concludes the article.
Keywords:DNN Incremental Learning Network Traffic Engineering.
This article is in the context of computer network traffic forecasting. We focus on using deep neural networks (DNNs). More precisely, we investigate how DNNs can predict, in advance, the approximate bit rate of a computer network communication. This is modeled as a classification task with three classes (low, medium and high). The key idea here is to take this decision based only on the metadata of the communication, which are represented, in their most basic form, as a 5-tuple: source and destination IP address, source and destination port as well as the transport protocol, e.g., TCP or UDP. An example of the flow metadata as well as the classification task is depicted in Fig. 1.
The motivation for investigating this kind of classification problem stems from the field of software-defined networking (SDN). While traditional and still most prevalent network routing algorithms are primarily based on the destination address, SDN-like techniques enable dynamic determination of paths based on traffic characteristics. For example, routers can typically choose between several paths to forward network traffic to a specific destination. On the one hand, in the case of paths with unequal costs, using only the optimal path could cause congestion while alternative paths are underutilized. On the other hand, using the hash of a 5-tuple to decide between multiple equal-cost paths might lead to unequal load balancing because the amount of transmitted data cannot be considered in advance. Also, the path cannot easily be changed during the communication. Therefore, predicting the bit rate of a communication beforehand is of high value for the routing and load balancing process.
1.1 Problem Formulation and Approach of the Article
Challenges The principal immediate challenges for machine learning in network traffic forecasting raised and addressed in this article are as follows:
data acquisition: Here, one encounters difficulties creating the technical infrastructure (i.e., administrative access to network devices, handling large amounts of data resulting from capturing the network traffic) and the fact that metadata contain sensitive information, requiring an anonymization strategy that preserves information content and relations. Furthermore, the encoding of metadata into a form that is suitable for DNNs and the generation of target values is essential.
regression problem: Network traffic forecasting is essentially a regression problem as a continuous and highly variable quantity (the bit rate) needs to be predicted, which is a challenging task that must be simplified suitably.
class imbalance: Communications transmitting very few data are much more frequent than those transferring huge amounts of data[nguyen2008survey]. The distribution regarding the bit rate as target value can be expected to change over time.
concept drift: The statistics of the problem may be time-dependent, e.g., depending on the day of week, the time of day, the season, technical changes, etc. A DNN classifier trained on day may therefore not be suited to classify metadata collected on day . We are therefore dealing with a problem where continual re-training must be conducted while retaining previous knowledge (see [pfuelb2019a] for a recent review on this kind of training paradigm).
big data setting: The amount of flows is so high, and their variability so significant, that DNN training on a representative training set can no longer be performed in-memory. In our scenario, the network devices we accessed to collect data delivered million records in hours (about of raw data respectively flows per second, including the 5-tuple).
Approach In order to address these challenges, we first of all treat DNN training as a streaming problem by dividing all collected metadata into blocks of flows each. Training and evaluation are then conduced in a semi-streaming fashion, starting with the first block and subsequently passing to following ones, with all relevant preprocessing operations being performed block-wise. Concept drift is thus incorporated into DNN training although it cannot be completely compensated. The class imbalance problem is currently fixed by different class balancing mechanisms, since the whole reference dataset111Our anonymized dataset is available upon request. is known prior to DNN training. This will have to be replaced by more generic solutions in the future. Lastly, we transform the regression problem into a classification problem with three classes, thus balancing the need for precision and complexity of the problem.
1.2 Related Work
Network traffic forecasting with machine learning techniques is a field (see [nguyen2008survey] for a review) that is receiving increased attention, probably due to the recent advances in machine learning techniques, notably deep learning models. From a machine learning point of view, many recent articles can be grouped according to whether they conduct online or offline learning on streaming network data, what machine learning models they employ in general, what dataset they operate on and whether they systematically investigate the effects of data enrichment. To the best of our knowledge, all related works operate on datasets of around flows which is significantly smaller than the dataset we use in this study, and thereby avoid “big data” issues like the necessity to perform learning in blocks. Furthermore, related works reduce the network traffic forecasting problem to a binary classification into “mice” and “elephant” flows.
In [poupart2016online], the authors apply online and offline learning methods (Multi-Layer Perceptron, Gaussian Process Regression and Online Bayesian Moment Matching). The problem is treated as a two-class classification problem using three different datasets, one self-created (not available) and the others from other authors [benson2010network]. No data enrichment is performed, however information about the first three exchanged packets is used in addition to a flow’s 5-tuple as a basis for classification, which differs from our approach that does not consider such information. In [xiao2015efficient], purely offline learning with two-class decision tree classifiers is performed on the “Wide” dataset and a self-created one (not available) coming from a data center, also without data enrichment. In [wang2016framework], semi-supervised SVMs are trained in an offline fashion to solve a two-class problem using a simple form of data enrichment. Evaluations are conducted on a dataset of approximately flows, “captured by the Broadband Communication Research Group in UPC, Barcelona, Spain” (no reference given, no data available). [SHI20171] use offline SVM training on two datasets captured on Chinese university campuses (no reference given, not available), and experiment extensively with feature selection schemes, however based only on the basic 5-tuple information. Another interesting albeit not directly related application of machine learning is the routing of flows itself (see [Valadarsky:2017:LR:3152434.3152441]).
1.3 Contribution of the Article
Overall, this study shows that fine-grained network traffic forecasting using three classes with DNNs is feasible, and that it can be performed in a “big data” setting, operating on separate data blocks sequentially. We furthermore investigate the effects of data enrichment beyond the basic 5-tuple information, while also dealing with anonymization and privacy issues. Lastly, we show that modern data visualization and clustering techniques can be readily applied to network traffic data in order to gain deeper insights into the structure of the problems and to “debug” machine learning solutions.
2 Flow Data Pipeline
We introduce a flow data pipeline (see Fig. 2) that is responsible for collecting the network traffic flows and producing a dataset consisting of flows describing communications. Data collection and the first parts of the data preparation (enrichment and anonymization) are entirely performed within our data center to ensure privacy (supported by the administration). The codebase of the pipeline is publicly available in our repository222https://gitlab.informatik.hs-fulda.de/flow-data-ml.
2.1 Data Collection
A flow is understood to be the history of a single transmission between two endpoints, from establishment to termination (only metadata). In particular, flows are partly characterized by the 5-tuple. Flows may include additional metadata, e.g., the duration or number of transferred bytes.
Flow data is collected from the networks at Fulda University of Applied Sciences. We export network flow data ( million flow records) using the NetFlow standard from the two core network devices in our university data center during a continuous time interval of hours on a weekday (02/15/2019 9:00 AM to 5:00 PM). These core components connect multiple subnets from the data center, laboratory, WiFi and campus networks. Collecting data from these diverse networks ensures realistic traffic characteristics and patterns to be used for the subsequent data analysis and network flow prediction. For example, collected traffic patterns include internal and external flows originating from client-to-server as well as server-to-server communication.
2.2 Data Preparation
Due to the extremely large amount of collected flow records, these are partitioned into separate blocks of records each (representing ), and the operations given below are applied block-wise. We thus obtain the final dataset, which is used for all experiments in this article, containing about million aggregated flow entries out of approximately million captured flow records.
Enrichment Based on the collected 5-tuples, additional context information is derived. For example, groups of internal and external addresses (e.g., IP subnets, VLANs and geographical regions) can be identified from the network addresses. These contexts deliver additional characteristics and patterns for the subsequent analysis and the prediction process.
Anonymization To ensure that collected metadata cannot be traced back to individual network addresses and end users, while still keeping the syntax and semantic of the data intact to prevent distortion of contained characteristics for the subsequent analysis, an appropriate anonymization algorithm was developed. This mechanism anonymizes all address-related metadata, i.e., IP and network addresses. The data center that exports the traffic metadata defines a password, which is cryptographically hashed and used as a seed for randomized permutation tables. A seed ensures consistent anonymization for further data acquisitions. Each octet of an IP address is anonymized individually using these tables. This way, the semantics of an address, e.g., regarding the relevance and order of the octets forming a group of network addresses, will stay intact after the anonymization and can still be used as a characteristic feature for prediction. However, adjacency of addresses will not be preserved in favor of the anonymization due to the seeded randomization of the permutation tables.
Aggregation Exported unidirectional flow records that potentially represent only a part of a communication (due to exporter timeouts or cache sizes) are aggregated to ensure coherent flow entries. The aggregation of records is based on the 5-tuple and additional traffic characteristics, e.g., flags and predefined time intervals. Duplicated flow records from both exporting network devices are filtered. During this phase, the number of records is reduced to, on average, of the collected flow records. Afterwards, ports greater than are replaced by zero because they are chosen randomly by common operating systems.
Normalization We convert raw, heterogeneous features into a format suited for DNNs, e.g., a sequence of floating points, in three different ways: Bit patterns are converted by promoting each bit to a or , float values are interpolated between and (min-max normalization) and categorical values are encoded as “one-hot” vectors, i.e., a single value of put at an unique position, having a length of , where represents the number of distinct categories. An example is given in Tab. 1.
|Feature||Raw format||Bit pattern (size)||Float value(s) (size)|
|IP address||188.8.131.52||0,1,0,1,0,0,0,1,1,0,1,0,1,0,0,1, 1,1,1,0,1,1,1,0,1,0,1,1,0,1,1,0 (32)||0.3176, 0.6627, 0.9333, 0.7137 (4)|
|Port||80||0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0 (16)||0.0012 (1)|
|Protocol||6||0,0,0,0,0,1,1,0 (8)||0.0235 (1)|
The output of the normalization and thus of the data preparation process is the actual dataset (about ). Next to the bit rate, there are other flow features that can be used as class labels and hence for a prediction, e.g., the number of transferred bytes or the duration of a flow. A combination of selected labels is conceivable as well. The datasets structure is summarized in Tab. 2.
|Feature||Data format||Src||Feature||Data format||Src|
|DC = Data Collection; DE = Data Enrichment; DA = Data Aggregation; OH = One-Hot|
2.3 Data Processing
In the data processing phase a fully-connected DNN is trained to predict the bit rate of a communication. During the processing of the created flow dataset, three steps are performed blockwise: At first, a sub-dataset can be extracted by feature selection. Afterwards, data samples are labeled based on predefined class boundaries, which are selected to fit an almost balanced data distribution (presented in Sec. 3.1). Finally, training and testing is done on each individual block sequentially. To evaluate different hyper-parameter setups, we do a parameter optimization. The detailed process and related results are presented in Sec. 4.
3 Exploratory Data Analysis and Visualization
To provide a better understanding of flow data, we explore the distribution of features used for labeling (see Sec. 3.1) and visualize the intrinsic structure of the data (see Sec. 3.2). The analysis is performed on the first flow entries (including all features) that are selected from the shuffled test data of the first block. Due to this, the same t-SNE output is used for all context-related taggings. No significant deviations were observed when performing this analysis on other blocks (every \nth50 block was compared). All comparisons of the tagged t-SNE outputs are done by visual inspection.
3.1 Label Distribution
We analyze the distribution of flow features that can be target values for traffic flow prediction, i.e., the transmitted bytes, the duration or the bit rate calculated from both. Results are shown in Fig. 6. As other authors noted previously [nguyen2008survey], these features deviate strongly from a uniform distribution, which makes the determination of suitable class boundaries challenging. The principal conclusion we draw from this is that we must use class balancing (see Sec. 4). Although the data distribution justifies our class boundaries, their practical applicability, e.g., for intelligent routing, is questionable and considered as future work.
3.2 Structural Context
To discover structural relations and similarities between individual flow entries (see Figs. 13 and 10), we use t-Distributed Stochastic Neighbor Embedding (t-SNE) [Maaten2008], a state-of-the-art visualization method, which maps high-dimensional data samples to a low-dimensional space (2D or 3D). We use the t-SNE implementation of the scikit-learn framework, parameter values being an iteration counter of , a perplexity of and a learning rate of .
Fig. (a)a illustrates feature similarities between flow entries that have a common transport protocol. Two symmetric accumulations indicate opposite directions of the same communication. Furthermore, there are examples that do not share the same transport protocol, but t-SNE points out similar feature data.
Tagging of each data sample according to its type of communication, which is the combination of the source and destination locality (either private or public), is shown in Fig. (b)b. For each communication type symmetric accumulations can be identified, whereas coherent spots map to individual flow directions.
Additionally, we apply the k-means clustering algorithm on the sub-dataset and use the result for tagging the data samples in the t-SNE output. With k-means, high-dimensional data samples are grouped around a predefined number of iteratively relocated cluster centers. We use the implementation of tensorflow (v1.12) with cluster centers, whereby the initial location of each center is determined randomly and the squared Euclidean distance is used as metric. The tagging of the t-SNE output based on k-means clustering for the data samples is shown in Fig. (c)c. According to the t-SNE results, it can be observed that there are samples that belong to the same cluster but have certain feature differences and that there are samples of different clusters sharing feature properties. The actual results depend on the chosen number of cluster centers.
We also perform an outlier detection for each k-means cluster using different metric thresholds (average and median distance as well as both summed up with the standard deviation). See Fig. (a)a for an exemplary presentation of detected outliers. With regard to our experiments described in the next section, the outlier detection has no significant influence on network flow prediction.
According to Fig. (b)b DNS and HTTP(S) are the most used application protocols in the dataset.
The huge proportion of DNS traffic states the rate of flow entries with a low bit rate respectively short duration.
The data analysis emphasizes relations and feature similarities between individual data samples. All visualizations use the same t-SNE output, but context-related tagging, e.g., regarding used protocols or communication directions, helps to clarify different structural patterns within network flow data.
4 Network Flow Prediction Experiments with DNNs
We employ a fully-connected DNN with layers of identical sizes , each hidden layer applying a ReLU transfer function whereas the output layer applies a softmax function. The batch size = and the number of training epochs = are fixed for all experiments. DNN training minimizes a standard cross-entropy loss by stochastic gradient descent by means of the Adam Optimizer. The last of a chronologically ordered data block are completely used for testing every \nth50 iteration.
Choice of evaluation metrics Since we are dealing with a three-class problem, the usual metrics for binary problems are not applicable, such as F1 score, precision, recall, etc. Instead, we present results in the more general form of a confusion matrix, from which we can derive classification accuracy by considering only the diagonal elements. Both of these measures are applicable for classification tasks with an arbitrary number of classes, which can be useful for comparison should we decide to introduce more classes at a later point. In order to allow a more in-depth comparison between the experimental conditions (using the 5-tuple information vs. using all features), we decided to additionally compute the standard binary performance metrics separately for each class.
Hyper-parameters Tunable parameters include the learning rate and the optional application of dropout to input and hidden layers , with different dropout probabilities. The assignment of labels is done based on a class boundary parameter . This list of boundary values is consistently used for all blocks before a training phase. In order to specify the class balancing method, the parameter is introduced. Balancing for training and test data is achieved either by standard class weighting or under-sampling. Furthermore, a feature selector provides support for the construction of sub-datasets. specifies the number of cluster centers that are used for outlier detection using k-means clustering. All hyper-parameters mentioned here (, , , , , , , , ) are varied to perform a joint parameter optimization.
We train all DNN classifiers on the first blocks sequentially and evaluate the achieved prediction accuracy on each block’s test set. In order to obtain the best possible results, we conduct a combinatorial hyper-parameter optimization, leading to a total of DNN training and evaluation runs. The explored parameter ranges are summarized in Tab. 3. Depending on the hardware, the computation time of one experiment is between and minutes. Based on the complexity of the DNN and the chosen parameters, the GPU memory usage is between and and the RAM utilization varies from to .
|Dropout (input, hidden)|
|Neurons per layer|
|Class boundaries (bit rate)||
|Class balancing method||(under-sampling),|
|Cluster centers (outlier detection)|
Labeling Because of the unbalanced data distribution (see Sec. 3.1) that makes regression problematic (also addressed in [poupart2016online]), we treat network flow prediction as a classification problem, using the three exemplary classes “low”, “medium” and “high”. The calculated bit rate of each flow is used for computing a class based on thresholding operation (with the two thresholds adapted such that the distribution of classes is approximately flat). Next to the used set of boundaries for class division, Tab. 4 presents related characteristics for each class.
Fig. 14 depicts the distribution of the true labels within the t-SNE output. Whereas some spots primarily have data samples belonging to the same class (c0), other spots are a mixture of different (c1, c2) or all classes. With regard to Fig. (c)c, the results of k-means clustering cannot be used to classify the samples adequately. Respectively, it is not sufficient to predict the bit rate of a flow.
The two experiments with the highest accuracy, determined by the parameter optimization, are shown in Fig. 16. In the first experiment, training is done on all available flow features (247 inputs), whereas in the second one only the 5-tuple (104 inputs) is used. Fig. 16 depicts the trend of the prediction accuracy. At the beginning of a directly following block, the accuracy value can considerably vary compared to the rate for the previous block but generally stabilizes for each block after a few training iterations. This indicates a slight change in statistics (concept drift) between the individual blocks, which becomes clearer in Fig. 18. We achieve a maximum accuracy of for the first respectively for the second experiment. Regarding these maxima, the data enrichment leads to an accuracy increase of about . Normalized confusion matrices for both experiments are also outlined in Fig. 16. Further evaluation metrics are outlined in Tab. 5. Fig. 17 gives an overview of the false classified data samples. With regard to the false labels, prediction errors for coherent spots mainly belong to the same class.
5 Discussion and Principal Conclusions
The principal conclusions we can draw from the presented experiments are: First of all, DNNs are a feasible tool for performing fine-grained network traffic flow prediction in a “big data” setting, achieving an accuracy of roughly even though performed in a streaming fashion on successive and independent blocks of flow data. Previous studies reached accuracies over but grouped network flows in only two classes (“mice” and “elephant” flows), which is considerably less useful for fine-grained network traffic engineering, and, above all, processed all training data in a single block. Secondly, we find that data enrichment can be useful, as it improves classification accuracy by roughly at manageable computational cost. Thirdly, our visualization and clustering studies show that there is no simple way to improve results by outlier detection, presumably because the data samples do not lend themselves to clustering using Euclidean distance, and a custom distance metric would have to be used here. We establish nevertheless that t-SNE is a useful tool to visualize structures and relations in network flow data. Lastly, we confirm by experiments that there is moderate to strong concept drift in flow data, and that appropriate measures will have to be taken in future works to address this issue.
Comparability and validity of results We may ask how generalizable our results are, and the answer is of course complex. In a university campus scenario such as ours, there are numerous factors that may affect the results, like the day of week, the season, the proximity of tests, etc. For example, the WiFi network – including thousands of connected students – represents a dynamic setup that probably cannot be solved easily for a DNN because connections are unique and non-recurring (in contrast to, e.g., communications between servers). Identifying and excluding such “difficult” flows could conceivably improve prediction accuracy and generalizability of our results. As stated in Sec. 1.2, publicly available datasets are relatively small. Larger datasets are not accessible, probably due to privacy issues. Even though our campus network is unique in its structure and thus results on our data do not guarantee in any way that the approach will work in other networks, the same can be said for any of the previous studies on the subject. The only way to show generality would be to have access to several datasets of network flows of comparable size, and to perform the same experiments on all of them. Comparing our results to other studies on the subject is further complicated by the fact that we perform three-class classification whereas previous studies were concerned with two-class scenarios only.
Discussion of the three-class scenario To show that our architecture can replicate previous results, we trained our DNNs on a two-class task with a threshold value of bits per second between “mice” and “elephant” flows and obtain a test accuracy of over , which is comparable to the results of other studies while taking the abovementioned caveats into consideration. Obviously, introducing an additional class degrades the classification accuracy, simply because guessing has a lower chance of success with one more class to choose from. Whether this lower prediction accuracy is compensated by the benefit of a more fine-grained prediction would have to be tested in simulation, which is what we are currently working on. For this study, we wished to establish that more than two classes can be successfully integrated into a prediction scheme, all the more since the computational cost of predicting more classes is negligible at inference time. When also considering that we perform learning in a streaming setting, which in general degrades performance w.r.t. settings where all data are simultaneously available for training, our results must be considered very competitive.
Justification of using DNNs The principal reason for using DNNs as opposed to other methods proposed in the literature, e.g., Gaussian Process Regression (GPR) [poupart2016online], is the fact that in future we want to train our classifiers in a streaming fashion: As soon as a new data block has been collected, model re-training is conducted automatically, and the trained model is immediately deployed and used for flow classification. This puts a strong focus on the scalability of the training process w.r.t. the number of data samples. In [poupart2016online], a training complexity of is reported for GPR, where concrete values for , or how they are chosen, are unclear. Naively, GPR has a training complexity of , and it is unclear whether the optimizations discussed in [poupart2016online] can be tuned without human intervention (no code is provided). In contrast, DNNs have a natural training complexity of without any optimizations, so they do seem a more natural choice in the “big data” context. We will investigate the performance of other learning algorithms in future work, and compare them to our approach.
Acknowledgements We thank Sven Reißmann from the university data center for assistance with data collection and preparation. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan Xp GPU.