LuNet: A Deep Neural Network for Network Intrusion Detection
Network attack is a significant security issue for modern society. From small mobile devices to large cloud platforms, almost all computing products, used in our daily life, are networked and potentially under the threat of network intrusion. With the fast-growing network users, network intrusions become more and more frequent, volatile and advanced. Being able to capture intrusions in time for such a large scale network is critical and very challenging. To this end, the machine learning (or AI) based network intrusion detection (NID), due to its intelligent capability, has drawn increasing attention in recent years. Compared to the traditional signature-based approaches, the AI-based solutions are more capable of detecting variants of advanced network attacks. However, the high detection rate achieved by the existing designs is usually accompanied by a high rate of false alarms, which may significantly discount the overall effectiveness of the intrusion detection system. In this paper, we consider the existence of spatial and temporal features in the network traffic data and propose a hierarchical CNN+RNN neural network, LuNet. In LuNet, the convolutional neural network (CNN) and the recurrent neural network (RNN) learn input traffic data in sync with a gradually increasing granularity such that both spatial and temporal features of the data can be effectively extracted. Our experiments on two network traffic datasets show that compared to the state-of-the-art network intrusion detection techniques, LuNet not only offers a high level of detection capability but also has a much low rate of false positive-alarm.
Networked computing becomes indispensable to people’s life. From daily communications to commercial transactions, from small businesses to large enterprises, all activities can be or will soon be done through networked services. Any vulnerabilities in the networked devices and computing platforms can expose the whole network under various attacks and may bring about disastrous consequences. Hence, effective network intrusion detection (NID) solutions are ultimately essential to the modern society.
Initial NID designs are signature-based, where each type of attacks should be manually studied beforehand, and the detection is performed based on the attack’s signatures. This kind of approaches are, however, not suitable to the fast-growing network and cannot cope with attacks of increasing volume, complexity and volatility.
For the security of such a large scale and ever-expanding network, we need an intrusion detection system that is not only able to quickly and correctly identify known attacks but also adaptive and intelligent enough for the unknown and evolved attacks, which leads to AI-based solutions. The artificial intelligence gained from machine learning enables the detection system to discover network attacks without much need for human intervention [mukherjee1994network].
So far, investigations on the AI-based solutions are mainly based on two schemes: anomaly detection and misuse detection. The anomaly detection identifies an attack based on its anomalies deviated from the profile of normal traffic. Nevertheless, this scheme may have a high false positive rate if the normal traffic is not well profiled, and the profile used in the detection is not fully representative [sommer2010outside]. Furthermore, to obtain a fully representative normal traffic profile for a dynamically evolving and expanding network is unlikely possible.
The misuse detection, on the other hand, focuses on the abnormal behaviour directly. The scheme can learn features of attacks based on a labelled training dataset, where both normal and attacked traffic data are marked. Given sufficient labelled data, a misuse-detection design can effectively generate an abstract hypothesis of the boundary between normal and malicious traffic. The hypothesis can then be used to detect attacks for future unknown traffic. Therefore the misuse detection is more feasible and effective than the anomaly detection [sommer2010outside, garcia2009anomaly] and has been adopted in real-world systems, such as Google, Qihoo 360 and Aliyun.
However, the existing misuse detection designs still present a high false positive rate, which significantly limits the in-time detection efficiency, incurs large manual scrutiny workload, and potentially degrades the network-wide security. In this paper, we address this issue, and we aim for an improved misuse detection design. Our contributions are summarized as follows:
We present a hierarchical deep neural network, LuNet, that is made of multiple levels of combined convolution and recurrent sub-nets; At each level, the input data are learned by both CNN and RNN nets; As the learning progresses from the first level to the last level, the learning granularity becomes increasingly detailed. With such an arrangement, the synergy of CNN and RNN can be effectively exploited for both spatial and temporal feature extractions.
We provide an in-depth analysis and discussion for the configuration of LuNet so that a high learning efficiency can be achieved.
We test our design on two network intrusion datasets, NSL-KDD and UNSW-NB15, and we demonstrate that our design offers a higher detection capability (namely, better detection rate and validation accuracy) while maintaining a significantly lower false positive rate when compared to a set of state-of-the-art machine learning based designs.
The remainder of this paper is organized as follows: Section II provides a brief background of machine learning for network intrusion detection. The design of LuNet is detailed in Section III. The threat model in the datasets used for LuNet design is presented in Section IV. The evaluation and comparison of LuNet with other state-of-the-art designs are given in Section V. The paper is concluded in Section VI.
Ii Background and Related Work
Many algorithms have been developed for machine learning. They can be classified into classical machine learning approaches and deep learning approaches. Some of them, relevant to network intrusion detection are briefly discussed below.
Ii-a Classical Machine Learning Approaches
Among many machine learning approaches [buczak2015survey], kernel machines and ensemble classifiers are two effective schemes that can be considered for NID. Support Vector Machine (SVM) that has long been used for task classification is a typical example of kernel machines, and Radial Basis Function (RBF) kernel (also called Gaussian kernel) is the most used kernel function [ahmad2018performance]. SVM makes the data that cannot be linearly separated in the original space, separable by projecting the data to a higher-dimension feature space. Ensemble classifiers, such as Adaptive Boosting [hu2013online], Random Forest [zhang2008random], are often constructed with multiple weak classifiers to avoid overfitting during training so that a more robust classification function can be achieved.
However, both kernel machines and ensemble classifiers are not scalable to a large data set. Their validation accuracy rarely scales with the size of training data [bengio2006curse]. Given a high volume of training data available from the large scale network, using these traditional machine learning approaches to train the intrusion detection system is not efficient. Furthermore, the traditional approaches learn input data only based on a given set of features; they cannot generalize features from raw data, and their learning efficiency highly relies on the features specified.
Ii-B Deep Learning Approaches
Deep learning organizes “learning algorithms” in layers in the form of “artificial neural network” that can learn and make intelligent decisions on its own.
Multi-Layer Perceptron (MLP) [pal1992multilayer] is an early class of feedforward deep neural network that utilizes the backpropagation algorithm to minimize the error rate during training. MLP was initially used to solve complex approximate problems for speech recognition, image recognition, and machine translation.
The real flourish and practical breakthrough of deep learning stems from two popular deep learning algorithms: the convolutional neural network (CNN) and recurrent neural network (RNN). CNN can automatically extract features of raw data and has gained great successes for image recognition [lecun1998gradient, szegedy2015going, he2016deep]. However, the feature map generated by CNN often manifests the spatial relations of data. CNN does not work very well for the data of long-range dependency.
RNN, on the other hand, has an ability to extract the temporal features from the input data. Long short-term memory (LSTM) is a popular RNN [hochreiter1997long]. It keeps a trend of the long-term relationship in the sequential data while being able to drop out short-lived temporal noises.
Recently, a hierarchical convolutional and recurrent neural network, HAST-IDS [wang2017hast] has been used to learn spatial and temporal features from the network traffic data.
The LuNet proposed in this paper is also a hierarchical network. Our design is similar to HAST-IDS in that both utilize CNN and RNN for spatial-temporal feature extraction. However, there are some major differences:
HAST-IDS stacks all RNN layers after CNN layers, while LuNet has a hierarchy of combined CNN and RNN layers.
In HAST-IDS, because the CNN hierarchy is placed before RNN, the deep CNN may drop out the temporal information embedded in the raw input data, which makes RNN learning ineffective. LuNet, on the other hand, synchronizes both the CNN learning and the RNN learning into multiple steps with the learning granularity being gradually increased from coarse to fine; Therefore, both spatial and temporal features can be adequately captured.
We apply batch normalization between CNN and RNN so that the learning efficiency and accuracy can be further improved.
The design of LuNet is presented in the next section.
Iii-a Overview Structure
As stated above, CNN targets on spatial features while RNN aims for temporal features. The existing design HAST-IDS simply structures CNN and RNN in tandem, as illustrated in Fig.1(a). When the learning progresses along the multiple levels in the CNN hierarchy, the information extracted becomes more spatial oriented. The temporal features may be lost by the CNN hierarchy, which significantly limits the learning effectiveness of the following RNN (LSTM).
Rather than allow CNN to learn to its full extent first, in LuNet, we mingle the CNN and RNN subnets and synchronize both the CNN learning and the RNN learning into multiple steps and each step is performed by a combined CNN and RNN block, or LuNet block, as illustrated in Fig.1(b). Since CNN is able to extract high-level features from a large amount of data, we place CNN before RNN at each level. The learning starts from the first step on coarse-grain learning, hence the CNN output will still retain temporal information that will then be captured by RNN. The learning granularity becomes detailed as the data processing flows to the next step; but at each level, both CNN and RNN learn input on the same granularity. In such a way, CNN and RNN can learn to their full capacity without much interference with each other.
Given the overall structure of LuNet, the effectiveness of its learning is closely related to the hidden layers design, which is discussed in the next sub section.
Iii-B Hidden Layers Design
We measure the learning granularity in terms of the number of filters/cells used in the CNN/RNN network. The smaller the filter/cell number, the larger the granularity. Fig.2 shows a LuNet with three CNN+RNN levels. The filter/cell number increases from 64 for the first level to 256 for the last level.
The design considerations for each layer in a LuNet block and the final four processing layers for the learning outputs, as highlighted in shaded boxes in Fig.2, are given below.
Iii-B1 Convolutional Neural Network (CNN)
CNN mainly consists of two operations: convolution and pooling. Convolution transforms input data, through a set of filters or kernels, to an output that highlights the features of the input data, hence the output is usually called feature map. The convolution output is further processed by an activation function and then down-sampled by pooling to trim off irrelevant data. Pooling also helps to remove glitches in the data to improve the learning for the following layers [lecun2015deep, bengio2017deep].
CNN learns the input data by adjusting the filters automatically through rounds and rounds of learning processes so that its output feature map can effectively represent the raw input data.
Since the network packet is presented in a 1D format, we use 1D convolution111One issue involved in convolution multiplication is to maintain the kernel size as same as the input, for which two typical schemes can be used: zero or no-zero padding. Since both schemes do not generate much difference to the learning outcome, here we choose no-zero padding for simplicity. in LuNet, as illustrated in Formula (1) that specifies the operation to the input vector with a filter of size m.
where is the position of different values in the sequence data.
Because the rectified linear unit (ReLU): is good for fast learning convergence, we therefore, choose it as the activation function. We also use the max pooling operation, as commonly applied in other existing designs [zhou1988computation].
Iii-B2 Batch Normalization
One problem with using the deep neural network is that the input value range dynamical changes from layer to layer during training, which is also known as covariance shift. The covariance shift causes the learning efficiency of one layer dependent on other layers, making the learning outcome unstable. Furthermore, because of the covariance shift, the learning rate is likely restricted to a low value to ensure data in different input ranges to be effectively learned, which slows down the learning speed. Batch normalization can be used to address this issue.
Normalization scales data to the unit norm in the input layer and has been used to accelerate training deep neural network for image recognition [ioffe2015batch].
We use batch normalization to adjust the CNN output for RNN in a LuNet block. The normalization subtracts the batch mean from each data and divides the result by the batch standard deviation, as given in Formula (2).
where is a value in the input batch, and and are, respectively, the batch mean and variance. The is an ignorable value, just to ensure the denominator in the formula non-zero.
Based on the normalized , the normalization produces the output y as given in Formula (3), where the and will be trained in the learning process for a better learning outcome.
Iii-B3 Long Short Term Memory (LSTM)
Different from CNN that learns information on an individual-data-record basis, RNN can establish the relationship between data records by feeding back what has been learned from the previous learning to the current learning, and hence can capture the temporal features in the input data.
However, the simple feedback used in the traditional RNN may have a learning error accumulated in the long dependency. The accumulated errors may become large enough to invalid the final learning outcome. LSTM (Long Short-Term Memory), a gated recurrent neural network, mitigates such a problem. It controls the feedback with a set of gate functions such that the short-lived errors are eventually dropped out and only persistent features are retained. Therefore, we use LSTM for RNN.
For a brief description of LSTM operation, we create a high-level data processing diagram of LSTM 222For the mathematical design details of LSTM, the reader is referred to the original paper [bengio2017deep] as shown in Fig. 3.
LSTM can be abstracted as a connection of four sub-networks (denoted as p-net, g-net, f-net and q-net in the diagram), a set of control gates, and a memory component. The input and output values in the diagram are vectors of the same size determined by the input . The state, , saved in the memory, serves as the feedback to the current learning.
All the sub nets in LSTM have a similar structure, as specified in Formula (4).
where , , , and are, respectively, current input, previous output, bias, weight matrix for the current input, and recurrent weight matrix for the previous output. Each of the four nets have a different , , and .
The outputs from the sub nets (, , and ) are then used, through two types of controlling gates ( and ) to determine the feedback s(t) from the previous learning and the current output h(t), as given in Formulas (5) and ( 6), respectively.
LSTM learns the inputs by adjusting the weights in those nets and the value such that the temporal features between the input data can be effectively generated in the output.
Iii-B4 Dimension Reshape
Since in LuNet, the learning granularity changes from one CNN+RNN level to another, the output size of one level is different from the input size expected at the next level. We, therefore, add a layer to reshape the data for the next LuNet block.
Iii-B5 Overfitting Prevention
One typical problem when learning big data using a deep neural network is overfitting – namely, the network has learned the training data too well, which restricts its ability to identify variants in new samples. This problem can be handled by Dropout [srivastava2014dropout]. Dropout randomly removes some connections from the deep neural network to reduce overfitting. We add the dropout layer with a default rate value of 0.5 after the CNN+RNN hierarchy in LuNet.
Iii-B6 Final Layers
Finally, an extra convolution layer and a global average pooling layer are used to extract further spatial-temporal features learned through the LuNet blocks. The final learning output is generated by the last layer, a fully-connected layer.
Iv Datasets and Threat Model
The evaluation of a neural network design is closely related to the dataset used. Many datasets collected for NID contain significant amount of redundant data [lippmann2000evaluating, mchugh2000testing], which makes evaluation results unreliable [ahmad2018performance, hu2013online, zhang2008random, wang2017hast, zhuo2017network]. To ensure the effectiveness of the evaluation, we select two non-redundant datasets: NSL-KDD and UNSW-NB15 in our investigation.
NSL-KDD [tavallaee2009detailed] was generated by Canadian Institute for Cyber Security (CICS) from the original KDD’99 dataset. It consists of 39 different types of attacks. The attacks are categorized into four groups: Denial of Service (DoS), User to Root (U2R), Remote to Local (R2L) and Probe. A big issue with this dataset is its imbalanced distribution, which often leads to a high false positive rate due to insufficient data available for training[li2017intrusion, lin2018character]. Here, we tackle this problem with a cross-validation scheme333This problem was also addressed in [wu2019transfer] with a different focus and strategy..
UNSW-NB15[moustafa2015unsw, moustafa2016evaluation], generated by Australian Center for Cyber Security (ACCS) in 2015, is a more contemporary dataset. For the dataset, the attack samples were first collected from the three real-world websites: CVE (Common Vulnerabilities and Exposures)444CVE: https://cve.mitre.org/, BID (Symantec Corporation)555BID: https://www.securityfocus.com, and MSD (Microsoft Security Bulletin)666MSD: https://docs.microsoft.com/en-us/security-updates/securitybulletins. The sample attacks were then simulated in a laboratory environment for the dataset generation. There are nine attack categories in UNSW-NB15: DoS, Exploits, Generic, Shellcode, Reconnaissance, Backdoor, Worms, Analysis, and Fuzzers.
V Evaluation and Discussion
To evaluate our design, we have implemented LuNet (as given in Fig. 2) with TensorFlow backend, Keras and scikit-learn packages and we run the training on a HP EliteDesk 800 G2 SFF Desktop with Intel (R) Core (TM) i5-6500 CPU @ 3.20 GHz processor and 16.0 GB RAM. For comparison, we also implemented a set of state-of-the-art machine learning algorithms.
The description of experiments and experiment results and discussion are presented below.
V-a Data Preprocessing
V-A1 Convert Categorical Features
For the experiment to be effective, we need data to conform to the input format required by the neural network. Raw network traffic data include some categorical features as shown in Fig.4. The text information cannot be processed by a learning algorithm and should be converted into numerical values. Here, we use the ’get_dummies’ function in Pandas [mckinney-proc-scipy-2010] to operate the conversion.
Input data may have varied distributions with different means and standard derivations, which may affect learning efficiency. We apply standardization to scale the input data to have a mean of 0 and a standard deviation of 1, as is often applied in many machine learning classifiers.
V-A3 Stratified K-Fold Cross Validation
NSL-KDD and UNSW-NB15 contain 148,516 and 257,673 samples, respectively. To realize the large non-redundant data for training and verification, we employ a Stratified K-Fold Cross Validation strategy, also commonly used in machine learning. The scheme splits all samples in a dataset into groups; Among them, groups, as a whole, are used for training and the rest one group for validation; hence the strategy is also called Leave One Out Strategy.
V-B Evaluation Metrics
We evaluate LuNet in terms of the validation accuracy (ACC), detection rate (DR) and false positive rate (FPR). ACC measures LuNet’s ability to correctly predict both attacked and non-attacked normal traffic, while DR indicates its ability of prediction for attacks only. A high DR can be shadowed by a high rate of false positive alarm (FPR), which therefore needs to be jointly considered with DR. The detail definitions of the three metrics are given in Formulas (7), (8) and (9).
where TP and TN are, respectively, the number of attacks and the number of normal traffic correctly classified; FP is the number of actual normal records mis-classified as attacks, and FN is the number of attacks falsely classified as normal traffic.
V-C LuNet Results
In our experiments, we use RMSprop [hinton2012neural] (a popular gradient descent algorithm) to optimize weights and biases for LuNet training, and for the learning rate, we set it to a median value 0.001 in a range given by the tensorflow library. We also choose the default rate value, 0.5, for the dropout layer.
We first measure the performance of LuNet based on two scenarios: (1) binary classification, namely LuNet predicts a packet either as an attack or as a normal traffic; (2) multi-class classification, where LuNet identifies a packet either as normal or as one type of attacks given in the dataset attack model (namely, 5 classes for NSL-KDD and 10 classes for UNSW-NB15). The experiment results are given below.
V-C1 Binary Classification
Table I shows the detection rate, accuracy and false positive rate for the binary classification of LuNet under different Stratified K-Fold Cross Validations, with k ranging from 2 to 10. The average values are given in the last row of the table. As can be seen from the table, LuNet can achieve around 99.24% and 97.40% validation accuracy on NSL-KDD and UNSW-NB15, respectively. The best validation accuracy of 99.36% on the NSL-KDD dataset and 97.67% on the UNSW-NB15 can be observed. It can also be seen that LuNet offers a high detection capability while incurs a low false positive alarm rate, as plotted Fig.5. On average, DR=99.42% and FPR=0.53% for NSL-KDD, and DR=98.18% and FPR=3.96% on UNSW-NB15 can be obtained.
It is worth to note that because of the cross-validation used, the false positive alarm rate is not affected by the imbalanced NSL-KDD dataset, as is expected.
V-C2 Multi-Class Classification
Table II shows the multi-class classification results. It can be seen from the table that LuNet can achieve an average of 99.05% accuracy, 98.84% detection rate on NSL-KDD, and 84.98% accuracy and 95.96% detection rate on UNSW-NB15 with, respectively, the 0.65% and 1.89% false positive rate on the two datasets.
Fig.6 shows the detection rate (DR%) and false positive rate (FPR%) for 5 category (Normal, DoS, Probe, R2L, U2R) classification on NSL-KDD. As we can see from the figure, LuNet shows an excellent capability to handle attacks in all categories (with a high detection rate and low false positive rate), except for U2R and R2L. For U2R and R2L, LuNet presents a relatively low detection rate, which indicates that the main features extracted by LuNet for U2R and R2L are not effectively distinct. There may be significant feature overlap between U2R and other attacks. The same reason is also applied to the Backdoor and Worms attacks, as manifested in the similar plot shown in Fig.7 for classification of 10 categories (Normal, Analysis, Backdoor, DoS, Exploits, Fuzzers, Generic, Reconnaissance, Shellcode, Worms) on UNSW-NB15.
Fig.7 illustrates that LuNet can successfully detect most of the Normal traffic, Exploits, Generic, DoS, Shellcode and Reconnaissance attacks (see the overlapped lines at the very top of the plots) and has a moderate capability to detect Analysis attacks with the low false positive rate (FPR). However, it is unable to discover Backdoor and Worms attacks, as can be observed in the confusion matrix (as shown in Fig. 8) generated from the experiment. We discover that most of Backdoor and Worms were, in fact, classified as Exploits attacks by LuNet. The possible reason is that the Backdoor and Worms attacks use the common exploit flaws in the programs that listen for connections from remote hosts and the signature of those attacks is similar to the Exploits attacks. Therefore, the Backdoor and Worms attacks are treated as the Exploits attacks by LuNet. Another possible reason may come from the insufficient data available for the Backdoor and Worms attacks; There are only around 1.4% Backdoor attacks and 1.1% Worms attacks in the dataset.
V-D Comparative Study
We compare the performance of LuNet with other five state-of-the-art supervised learning algorithms that have been investigated in the literature: Support Vector Machine (SVM) with Gaussian Kernel (RBF)[ahmad2018performance], Multi-Layer Perceptron (MLP)[esmaily2015intrusion], Random Forest (RF)[zhang2008random], Adaptive Boosting (AdaBoost)[hu2013online], and HAST-IDS [wang2017hast].
We also run the experiment on individual CNN and LSTM networks, presented in LuNet.
The detection rate (DR%), accuracy (ACC%), and false positive rate (FPR%) of the two design groups (ML, the classical machine learning algorithms and DL, the deep learning algorithms) are given in Table III. The average values of each group are also provided in the table.
As can be seen from Table III, the deep neural networks generally have better performance than the three classical machine learning algorithms; On average, the deep neural networks have a higher detection rate (94.57%) and accuracy (82.78%) and a lower false positive rate (4.72%), as compared to the corresponding average values of 89.03%, 77.53% and 10.95% from the classical machine learning. Among the five deep learning networks, LuNet is better than other four DL algorithms due to the special combined CNN+RNN hierarchy used. In summary, LuNet demonstrates its superiority – achieving a high detection rate and accuracy, while keeping a low false positive rate.
In this paper, we present a deep neural network architecture, LuNet, to detect intrusions on a large scale network. LuNet uses CNN to learn spatial features in the traffic data and LSTM for temporal features. To avoid the information loss due to different learning focuses of CNN and RNN, we synchronize both CNN and RNN to learn the input data at the same granularity. To enhance the learning, we also incorporate batch normalization in the design. Our experiments on the two non-redundant datasets, NSL-KDD and UNSW-NB15, show that LuNet is able to effectively take advantages of CNN and LSTM. Compared with other state-of-the-art techniques, LuNet can significantly improve the validation accuracy and reduce the false positive rate for network intrusion detection.
It must be pointed out that LuNet does not work well to classify attacks of insufficient samples in the training dataset, such as Backdoors and Worms, as observed in our experiment, which will be investigated in the future.