An Autonomous Intrusion Detection System Using Ensemble of Advanced Learners

An Autonomous Intrusion Detection System Using Ensemble of Advanced Learners

Abstract

An intrusion detection system (IDS) is a vital security component of modern computer networks. With networks finding their ways into providing sensitive services, IDSs need to be more intelligent and autonomous. Aside from autonomy, another important attribute for an IDS is its ability to detect zero-day attacks. To address these issues in this paper we propose an IDS which reduces the amount of manual interaction and needed expert knowledge and is able to yield acceptable performance under zero-day attacks. Our approach is to use three learning techniques in parallel, gated recurrent unit (GRU), convolutional neural network as deep techniques and Random Forest as an ensemble technique. These systems are trained in parallel and the results are combined under two logics, majority vote and ”OR” logic. We use the NSL-KDD dataset to verify the proficiency of our proposed system. Simulation results show that the system has the potential to operate with a very low technician interaction under the zero-day attacks. We achieved accuracy in the NSL-KDD’s ”KDDTest+” dataset and accuracy on the challenging ”KDDTest-21” with lower training time and lower needed computational resources.

Intrusion detection system, Deep learning, Recurrent neural network, Random forest, Convolutional neural network

I Introduction

Computer networks’ settled and pervasive role in all aspects of our daily lives must ring a bell for its serious security vulnerabilities. Since all of our information, from identification to where we go, what we like, our medical history, consumption pattern, etc. go through these networks, any kind of vulnerability could lead to an irrecoverable disaster.

To deal with security challenges in computer networks, many methods have been introduced: cryptography and firewalls are of such efforts. One of the robust and reliable systems is the intrusion detection system (IDS). From an operational point of view, there are mainely three types of IDSs: 1. Misuse-based, 2. Anamoly-based 3. Hybrid (uses both misuse and anomaly techniques)[1].

Despite many efforts to entrust security challenges to a system without constant human interaction, we are still in the beginning steps of building such systems. To make anomaly-based IDSs intelligent, artificial intelligence and its promising subfield, machine learning, are being widely used. Developing paradigms and methods of machine learning are at the intense focus of computer science and other related research fields, such as mathematics. Other fields of science and researches benefit from its various techniques. The problem in the way of applying it as a powerful tool to a special problem is choosing the best method and to set its parameters. Evidence suggests trial and error procedure is a powerful way to find the best method, especially in deep learning methods; because currently, it is unknown what makes a deep learning structure to work finely suited [2].

After choosing any technique, applying it to a special problem has some important challenges. First, at any level of the learning procedure, having deep insight into the subject could be very crucial. Feature selecting, preprocessing and making data ready for training is another challenge and somehow the most important step in the procedure. And finally, one tricky challenge is the system’s ability to generalization i.e. the trained system strength under inputs which is not very similar to train data. The later challenge could be interpreted as detecting zero-day attacks in the context of network intrusion detection.

In this work, we are trying to address the first and last challenges by proposing a Network IDS, based on using several learning methods. First, network packets for a single connection are stored and analyzed with tools such as: ”Snort” and ”Bro IDS” where a set of features are extracted. Then a system that is trained with 3 different methods. A recurrent neural network (RNN) with gated recurrent unit (GRU) base, a convolutional neural network (CNN) and a non-parametric method named Random Forest, are used for detecting the type of connection which classifies them as normal or attack. Using different types of classifiers and combing their votes with proper logic shows that without manually adjusting the learning procedure we can have an IDS which is able to detect various zero-day attacks. When the system encounters an attack that is misclassified, it learns to deal with that attack in the future. We have evaluated our system on the KDD-NSL dataset which is a well-known and most commonly used dataset in the field. The rest of this paper is organized as follows. In Section II a brief background about in-use techniques is provided. In Section III we review some of the related research in the field. Section IV present our proposed system in details.In Section V we discuss about experimental results. Finally the paper concludes in Section VI.

Ii Background

In this section, we state three needed preliminaries: 1. A brief description of the used classifiers. 2. Preprocessing phase. 3. In-use dataset.

Ii-a Random Forest

Decision tree (DT) is a strong and fast classifier. DTs suffer from overfitting and lack of criterion to choose the best sequence of attributes for branchs. Random forest (RF), is introduced to overcome these problems by using ensemble learning method. The training dataset is randomly divided into several subsets and each subset trains different decision trees. Any of these trained trees claims their votes and the majority vote method is applied to announce the final decision[3]. By utilizing the Law of Large Numbers, in [3], L.Breimann gives a theoretical background for RFs; showing that they always converge and overfitting is not a problem anymore.

Ii-B RNN with GRU unit

In contrast to a classical neural network which only has uni-direction connections, RNN takes advantage from the recurrent connection between layers. As a deep neural network, RNN has two major issues known as ”vanishing gradient problem” and ”exploding gradient”. As the number of layers grow, the gradients of loss function become too small or too large. This makes the training procedure difficult or even impossible. One of the most effective ways to tackle the aforementioned problems is using the GRU unit. GRU is introduced as a simplified version of Long Short-Term Memory (another memory-based unit). The memory cell is the core of a GRU unit, which allows it to maintain state over time[4].

Ii-C Convolutional Neural Networks

Convolutional neural networks (CNN) try to mimic the brain’s visual cortex functionality by defining a variety of filters which results in extracting various information from images. Although the visual perception is their main task, evidence implies that they are effective in other learning-related tasks as well. Several fundamental CNN architectures have been introduced since their invention, LeNet-5, AlexNet, GoogleNet, and ResNet are of such architectures. A CNN architecture is a special combination of its two types of layers, namely convolution and pooling layers, and some fully connected layers.

Ii-D Preprocessing

To prepare data for training, two common steps are numericalization and normalization of the features. Numericalization is assigning numbers to nominal features. In addition, to avoid unbalanced effects of features in classification, they are all normalized to interval by eq. (1), in which is the feature value:

(1)

Ii-E Dataset description

For the training and evaluating an anomaly-based IDS, a reliable and genuine dataset is needed. Background flow integrity and attack variety are considered to be the most important properties to make a dataset valuable. NSL-KDD is a dataset that is based on the KDD99 dataset for the KDD cup (International Knowledge Discovery and Data Mining Tools Competition). Because of some problems of the main dataset such as repetitive records and some statistical problems, Tavallee and et.al in [5] have presented a more purified version of it, which has become very popular and been used by many pieces of researches. This dataset was introduced in 2004, and since then many things in the Internet and computer network field have been changed, but because of its wide usage, it still could be the best reference dataset for comparing several IDS systems performance. It has three predefined datasets, ”KDDTrain+” for training, ”KDDTest+” for testing and a challenging ”KDDTest-21” set. They have designed 21 machine learning systems for evaluating their proposed dataset, and the ”KDDTest-21” is a subset of ”KDDTest” which all records that correctly been classified with all 21 machines have been removed.

Iii Related works

Almost all of the anomaly-based implementations of NIDS use machine learning methods. Here we mention some of the related papers. For example, by using three search techniques (genetic algorithm(GA), ant colony optimization(ACO), and particle swarm optimization(PSO)) B.A. Tama et al propose a feature selection method in [6]. They apply this search method to a two-stage ensemble classifier which uses rotation forest and bagging methods. The evaluation of this method on two datasets is done and the results are: accuracy on the ”KDDTest+” set and accuracy on the ”UNSW-NB15” set. In [7] J. Yang et al propose an IDS using the deep convolutional generative adversarial network. Instead of analyzing network packets with network monitor software they use the proposed system to directly extract features from raw data and generate a new dataset. They also apply a modified long short-term memory (LSTM) which is simple recurrent unit (SRU) that allows the system to learn the important features of intrusions. With this system, they achieve accuracy on the ”KDD99” dataset and on manually divided ”NSL-KDD” dataset (NSL-KDD predefined datasets are not used for training and testing.). For a wireless IDS, S.M. Kansongo et al. in [8] use a filter-based algorithm to select features and a feed-forward deep neural network as the IDS. Their neural network has 3 hidden layers with 30 nodes. Their best result is achieved by using learning rate which is accuracy on the manually divided ”NSL-KDD” dataset.

Iv Proposed System Architecture

Detecting zero-day and new types of anomalies are the most sophisticated work that can be done by IDSs. Designing such a system and keeping it up to date and fast need simultaneous maintenance. To address the challenge of lowering permanent interaction. We propose an IDS by combining the strength of three different types of machine learning techniques and designing an update procedure. In the following subsections, we depict the IDS operation manner in the network and the training course.

Iv-a System Operation Phase

As shown in Figure 1, first, packets of a connection are stored, then a machine that is capable of executing a packet analyzer (here we use ”Zeek”, i.e. an open-source software which can be implemented on Raspberry PI), extracts features based on predefined rules. Then, three trained machines do the classification and they claim their votes to a decider. The decider combine votes based on the assigned logic.

Fig. 1: Proposed IDS Operation Scheme

We use three different types of learning techniques to establish the IDS classifier unit. There are some parameters and some assumptions which motivates us to choose these three techniques. We discuss the rationale behind setting the number of subsystems to three later in this section.

One of our in-use techniques is the GRU-based network. As it is mentioned earlier RNN networks have the ability to remember previous entries. This means that we can use RNN as a time analysis tool. In many modern and sophisticated attacks, malicious codes can be injected in distributed patterns, for example by using botnets or embedding the codes among many legal packets. Hence the aforementioned ability of RNN turns out to be very effective in the processing of packets sequences. Figure 2 shows our proposed GRU network architecture. To tackle vanishing gradient problem we use ”Leaky ReLU” (eq. (2)) and exponential linear unit (ELU) (eq.(3)) activation functions. The number of layers and their nodes are set based on an exhaustive search (an example is given in Section V ).

(2)
(3)
Fig. 2: Proposed GRU Network

The second learning machine of our proposed system is a CNN network. CNN networks are used for their competence to extract features. On the other hand, IDSs operate in different types of networks that have different dynamics and threats. Thus, to increase the autonomy of the IDS, automatic feature extraction is needed. Therefore, a CNN is embedded in the training system. Here we use a modified version of one of the earliest and proficient networks which is LeNet-5. The modification deduces from simulation results. Figure 3 shows the structure of our applied version of LeNet-5.

Fig. 3: In-Use LeNet-5 Structure

Our last learner is Random Forest (RF). RF is claimed to be robust against overfitting. Overfitting reduces the generalization attribute of a classifier, though, IDS may lose the ability of zero-day attack detection. Another reason for using this technique is its speed and nonrandom training procedure.

Having embedded time analysis, memory and feature extraction are our concluded key ideas of implementing IDS with the aforementioned attributes. After many efforts by using several learning techniques, even implementing unsupervised methods, we find the solution in using GRU and CNN. The randomness of these two deep methods obliges us to use a more robust classifier as RF.

Iv-B Training Phase

In order to train system, first, data from a dataset is processed and gets ready to be inserted into classifiers, then every machine is trained in parallel mode. Now the machine is ready to operate. During its operation, it checks its prediction accuracy. When there is an uncaptured attack, the feedback procedure informs the training system. It adds the misclassified connection record to the training dataset and does the training again.

V Experimental Results

The hardware and software setup that is used in our experiment is:

  • Intel Core i7 @ 3.5Ghz with 16 GB of Ram. (No GPU-Based Implementation)

  • Tensorflow v.2.0.0b1 and scikit-learn v0.21.3 libraries on CentOS v7.5 operation System

V-a Performance Evaluation

The effectiveness of classification methods is achieved by metrics such as Accuracy (Acc), Detection Rate (DR) and False Positive Rate (FPR). These metrics are based on the following four terms: 1. True Positive(TP) 2. True Negative (TN), 3. False Positive (FP) and 4. False Negative (FN). The evaluation metrics are defined as follows:

  • Accuaracy: The percentage of total records that is classified correctly (4).

    (4)
  • Detection Rate : DR is the percentage of correctly identified attack records (5).

    (5)
  • False Positive Rate : FPR is the ratio of incorrectly attack alarms to all incorrect identifications (6).

    (6)
No. of Estimators 10 20 30 40 50 60 70 80 90 100 200
Train Acc Mean 99.974 99.976 99.981 99.979 99.982 99.983 99.981 99.982 99.983 99.982
Train Acc Std. 0.0033 0.0028 0.0017 0.0012 0.0007 0.0009 0.0005 0.0005 0.0004 0.0003
Valid Acc Mean 99.873 99.833 99.841 99.833 99.817 99.865 99.873 99.849 99.849 99.849
Valid Acc Std. 0.0197 0.022 0.0158 0.0162 0.014 0.0158 0.0147 0.0114 0.0117 0.0102
Test Acc Mean 80.424 79.812 79.617 80.251 79.741 79.785 79.861 80.034 80.078 80.171
Test Acc Std. 0.543 0.493 0.637 0.344 0.395 0.469 0.494 0.405 0.41 0.308
Test-21 Acc Mean 62.996 61.595 61.451 62.523 61.603 61.747 61.772 62.068 62.211 62.338
Test-21 Acc Std. 1.23 0.942 1.194 0.648 0.723 0.895 0.928 0.784 0.794 0.584
Learning Dur.(s) 3.181 5.693 8.482 10.944 13.457 19.064 21.28 24.176 26.854 52.3


TABLE I: Random Forest Results for different estimator numbers

V-B Results on Binary Classification

We considere a variety of configurations for each method of learning to find out the best combination. For instance, the number of nodes and GRU units, Optimizer, learning rate, activation function, class weighting, regularization, and initialization are set to different values to find the best configuration. Some definite results are: 1. ”Adam” optimizer always overperforms the Stochastic Gradient Descent (SGD), ”RMSProp”, ”adagrad” and ”adadelta”. 2. Although regularization is claimed to be helpful to calm high bias fitting, here we do not observe any considerable changes in the accuracy results of the KDDTest-21 dataset. We run the same structure for 30 times and observe the mean and standard deviation of the accuracy results. We choose a combination that has a better mean with low std. for validation set with rational learning time. Table I shows the results for RF with different numbers of estimators. The standard way to validate the performance of a learner is to argue on the validation set. It is obvious that with high numbers of estimators the std. of the validation set accuracy tends to be lower with almost constant accuracy values, but the learning time grows. We set numbers of estimators to which has the lowest std. with proper learning time (highlighted cell in Table I). The same procedure1 is applied to GRU network and CNN with different numbers of units and filters respectively and we choose GRU units which has the mean accuracy of with std. of . The mean learning time of GRU-based is . Although we set the epochs to 100 but an early stop callback mechanism is embedded to stop learning procedure when the loss of the validation set is not descending anymore with the patience of four epochs or when it grows up. The latter mechanism also helps improve the generalization attitude of the classifier. Table II shows our applied configuration briefly. All of the codes, detailed results for different neural network configurations and the best obtained weights for their layers are accessible in our ”github” repository\footrefgithub.

As we mentioned earlier the NSL-KDD has 3 predefined sets that some of the attack types are not included in the train set; so the system can be evaluated for zero-day attack. Most of the papers combine predefined sets and apply a regular learning procedure, which is dividing the dataset to the training, validation and test sets [7, 8, 9]. Obviously, in the latter case, the performance would be higher. We evaluate our system with predefined sets and the comparison with other methods are shown in Figure 4. Figure 4 shows that our proposed system improves the accuracy of the tricky ”KDD-Test-21” set by more than comparing to latest best achievement [6] and by in ”KDD-Test” set where we reduced the training time by about 617 seconds (from about in [6] to ). Table III and IV show the confusion matrix of the our system for the ”KDDTest+” and ”KDDTest-21” sets respectively. Table V shows the detailed results of sets’ accuracy and training time for each subsystem of our proposed system.

Fig. 4: Results Comparison TSE-IDS [6], SVM [10], Bagging(J48)[11], RNN[12], Two-Tier [13]
CNN / GRU Hidden Act. Func. Leaky ReLU & ELU
Output Act. Func. Sigmoid
Batch Size 1024
No. of Epochs 100
Learning Rate 0.006
Cost Func. Binary CrossEntropy
Optimizer Adam
Bias Yes
Regularization No
Class Weighting No

RF
No. of Estimators 60
Max Features Auto (equal to no. of features)
Criterion gini
TABLE II: Summary of Proposed System Configuration
Predicted Normal Predicted Attack
Normal 9230 480
Attack 2387 10446

TABLE III: Confusion Matrix of the ”KDDTest+” set
Predicted Normal Predicted Attack
Normal 1769 383
Attack 2388 7310

TABLE IV: Confusion Matrix of the ”KDDTest-21” set
sub-sys Validation KDDTest+ KDDTest-21 Training time
CNN 99.63 82.92 68.30 104.45(sec)
GRU 99.76 83.19 68.52 91.73(sec)
RF 99.79 80.14 62.34 16.32(sec)

TABLE V: Subsystems training time and accuracy for the validation, KDDTest+ and KDDTest-21 sets

Vi Conclusion and Future Detection

In this paper, we proposed an IDS that is able to operate with minimum interaction in training and updating procedure and it performed acceptably well under zero-day attack detecting. Our proposed IDS is able to update the dataset and learn to deal with new misclassified records. We examined this IDS with the NSL-KDD dataset and the results showed improvement in terms of both accuracy and training time duration compared to the state of the art methods. There are still many challenges we can define, such as: 1. Running the system under real-world network, 2. Adding the ability to detect the attack type.

Footnotes

  1. https://github.com/catcry2007/nsl4conf

References

  1. Y. Xin, L. Kong, Z. Liu, Y. Chen, Y. Li, H. Zhu, M. Gao, H. Hou, and C. Wang, “Machine learning and deep learning methods for cybersecurity,” IEEE Access, vol. 6, pp. 35365–35381, 2018.
  2. N. Shone, T. N. Ngoc, V. D. Phai, and Q. Shi, “A deep learning approach to network intrusion detection,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 2, pp. 41–50, Feb 2018.
  3. L. Breiman, “Random forests,” Machine learning, vol. 45, no. 1, pp. 5–32, 2001.
  4. A. Géron, Hands-On Machine Learning with Scikit-Learn and TensorFlow. O’Reilly Media, Inc., 1st ed., 2017.
  5. M. Tavallaee, E. Bagheri, W. Lu, and A. A. Ghorbani, “A detailed analysis of the kdd cup 99 data set,” in 2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications, pp. 1–6, July 2009.
  6. B. A. Tama, M. Comuzzi, and K. Rhee, “Tse-ids: A two-stage classifier ensemble for intelligent anomaly-based intrusion detection system,” IEEE Access, vol. 7, pp. 94497–94507, 2019.
  7. J. Yang, T. Li, G. Liang, W. He, and Y. Zhao, “A simple recurrent unit model based intrusion detection system with dcgan,” IEEE Access, vol. 7, pp. 83286–83296, 2019.
  8. S. M. Kasongo and Y. Sun, “A deep learning method with filter based feature engineering for wireless intrusion detection system,” IEEE Access, vol. 7, pp. 38597–38607, 2019.
  9. C. Xu, J. Shen, X. Du, and F. Zhang, “An intrusion detection system using a deep neural network with gated recurrent units,” IEEE Access, vol. 6, pp. 48697–48707, 2018.
  10. Qusay M. Alzubi1, Mohammed Anbar, Zakaria N. M. Alqattan, Mohammed Azmi Al-Betar, and Rosni Abdullah1, “Intrusion detection system based on a modified binary grey wolf optimisation,” Neural Computing and Applications, Springer, pp. 1–13, 2019.
  11. N. T. Pham, E. Foo, S. Suriadi, H. Jeffrey, and H. F. M. Lahza, “Improving performance of intrusion detection system using ensemble methods and feature selection,” in Proceedings of the Australasian Computer Science Week Multiconference, ACSW ’18, (New York, NY, USA), pp. 2:1–2:6, ACM, 2018.
  12. C. Yin, Y. Zhu, J. Fei, and X. He, “A deep learning approach for intrusion detection using recurrent neural networks,” IEEE Access, vol. 5, pp. 21954–21961, 2017.
  13. Pajouh, H.H., Dastghaibyfard, G., and Hashemi, S. J, “Two-tier network anomaly detection model: A machine learning approach,” Journal of Intelligent Information Systems, Springer US, vol. 48, no. 1, pp. 61–74, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
406318
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description