Mimic Learning to Generate a Shareable Network Intrusion Detection Model
Purveyors of malicious network attacks continue to increase the complexity and the sophistication of their techniques, and their ability to evade detection continues to improve as well. Hence, intrusion detection systems must also evolve to meet these increasingly challenging threats. Machine learning is often used to support this needed improvement. However, training a good prediction model can require a large set of labeled training data. Such datasets are difficult to obtain because privacy concerns prevent the majority of intrusion detection agencies from sharing their sensitive data. In this paper, we propose the use of mimic learning to enable the transfer of intrusion detection knowledge through a teacher model trained on private data to a student model. This student model provides a mean of publicly sharing knowledge extracted from private data without sharing the data itself. Our results confirm that the proposed scheme can produce a student intrusion detection model that mimics the teacher model without requiring access to the original dataset.
The Internet has become an essential tool in our daily lives. It aids people in many areas, such as business, entertainment, and education . Along with such benefits, however, comes the ever-present risks of network attacks. Thus, many systems have been designed to block such attacks. One such danger is that of malicious software (malware) that can be used by hackers to compromise a victim’s machines . Millions of dollars are lost each year because of Ransomware, botnets, backdoors and Trojans malware. Therefore, network security is a serious concern, and intrusion detection is a significant research problem impacting both business and personal networks.
This work focuses on intrusion detection systems (IDSs). IDSs assist network administrators in detecting and preventing external attacks. That is, the goal of an IDS is to provide a wall of defense that stops the attacks of online intruders. IDSs can be used to detect different types of malicious network communications and computer system usage, better than a conventional firewall.
Of particular interest to us are anomaly intrusion detection systems. Such systems are based on the assumption that the behavior of intruders differs from that of a legitimate user.
One effective technique in anomaly IDS is using machine learning to detect unusual patterns that could suggest that an attack is happening. The ultimate goal of these techniques is to determine whether deviation from the established normal usage patterns can be flagged as intrusions . In the literature, some works apply a single learning techniques, such as neural networks , genetic algorithms, or support vector machines. Other systems are based on a combination of different learning techniques, such as hybrid or ensemble techniques.
While machine learning has been widely adopted by large intrusion detection agencies (IDAs) such as Kaspersky, McAfee, and Norton , some challenges have not yet been fully addressed. Learning algorithms benefit from large training sets. However, finding large datasets that contains well defined malware with the latest up-to-date signature and zero-day attacks is not an easy task, and typically, it requires a specialized IDA ( e.g., Kaspersky or McAfee, The lack of such data in large quantities can result in an inaccurate detection model because of insufficiently training.
Ideally such agencies could share their data with researchers working to develop improved detection systems. The private nature of this data prevents it from being shared, however. One naïve solution is to allow these agencies to share their prediction models. However, such models are known to implicitly memorize details from the training data and, thus, inadvertently reveal those details during inference . Moreover, some organizations might use the model to infer sensitive information about new malware that has been detected. What is more, that inferred information might be misused to create new malware. Such risks prevent the IDAs from sharing their models because of their justifiable concerns regarding keeping their data private and exposing it to potential intruders or competiting IDAs.
Recently, the idea of mimic learning has been introduced as a solution for preserving the privacy of sensitive data. Mimic learning involves labeling previously unlabeled public data using models that are trained on the original sensitive data. After that, a new model is trained on the newly labeled public data to produce a prediction model that can be shared without revealing the sensitive data itself or exposing the model that is directly trained on this data . Therefore, it is suggested, mimic learning can enable the transfer of knowledge from sensitive, private data to a shareable model without putting the data’s privacy at risk.
I-B Related Work
Several schemes have been proposed to study how machine learning can be used for intrusion detection systems. However, there has been little research on the use of mimic learning to enable sharing trained models instead of the original sensitive data as an option for transferring knowledge . One example of mimic learning research is in the area of information retrieval (IR), which seeks to effectively identify which information resources in a collection are relevant to a specific need or query. In IR applications, having access to large-scale datasets is essential for designing effective systems. However, due to the sensitivity of data and privacy issues, not all researchers have access to such large-scale datasets for training their models. Motivated by previous challenges, Dehghani, et al.  have proposed a mimic learning scheme to train a privacy-perserving, shareable model using weak- and full-supervision techniques  on the data. Then, this trained model can safely transfer knowledge by enabling a large set of unlabeled data to be labeled, thereby, creating the needed large-scale datasets without having to share sensitive, private data.
Unfortunately, current research work has not yet studied the use of mimic learning in intrusion detection systems.
I-C Main Contribution
In this paper, we propose a mimic learning approach that enables intrusion detection agencies to generate shareable models that facilitate knowledge transfer among organizations and research communities. To the best of our knowledge, this work is the first to examine the use of mimic learning in the area of intrusion detection.
Thus, the main contribution of this paper is the introduction and empirical evaluation of a mimic learning scheme for intrusion detection that effectively transfers knowledge from an initial labeled dataset to a shareable predictive model.
In this section, we present background material needed to understand the remaining parts of the paper. This material includes our assumed network model, our assumed intrusion detection framework, and an introduction to superised machine learning.
Ii-a Network Model
As depicted in Figure 1, the assumed network model has the following entities.
Intrusion Detection Agencies (IDA): IDAs are agencies such as cybersecurity and anti-virus providers. They are responsible for developing antivirus, Internet security, endpoint security, and other cybersecurity products and services. Examples of IDAs include Kaspersky Labs, McAfee, and Norton.
Organizations: Companies, universities, and government agencies that are connected to the Internet and have the potential to encounter daily cyber attacks.
Intrusion Detection Gateway (IDG): IDGs are independent entities within each organization that are responsible for detecting malicious software attacks. An IDG manages the intrusion detection system (IDS) framework.
Ii-B Intrusion Detection Framework
An intrusion detection system (IDS) is a system that is responsible for monitoring, capturing, and analyzing events occurring in computer systems and networks to detect intrusion signals. In this paper, we consider the following IDS framework that consists of three modules as shown in Fig. 1.
Monitoring Module: This module is responsible for monitoring network packets that pass through the network gateway. In this paper, we implemented this monitoring module using JnetPcap  which is an open source Java library selected for its excellent packet analysis performance.
Feature Extraction Module: This module is responsible for extracting the features that describes each connection, where a connection is defined as the sequence of packets that flow for some time between a given network source and destination using a given protocol. For this paper, a set of 41 statistical features were extracted for each connection to represent the behavior of each connection between a given machine inside the monitored network and a machine outside the network. Statistical features are chosen to avoid privacy concerns among users as well as problems associated with encrypted data.
Classifier Module. This module is responsible for classifying a given connection as either benign or malicious based on the extracted features for that connection. Supervised machine learning algorithms are used in our scheme for the task of classification.
Ii-C Machine Learning
Supervised machine learning algorithms take, as input, data labeled with either a numeric or categorical value and produce a program or model capable using patterns present in the input data to guide the labeling of new, previously unseen data. In this paper, our tasks use a categorical label with a value of either benign or malicious.
Our work includes the following four machine learning algorithms for classifying connections as benign or malicious: decision tree induction (DT), Random Forests (RF), support vector machines (SVM), and Naïve Bayes (NB). We provide an overview of each of these algorithms.
Decision Tree Induction (DT): A decision tree organizes a hierarchical sequence of questions that lead to a classification decision. Decision tree induction creates this tree through a recursive partitioning approach that, at each step, selects the feature whose values it finds most useful for predicting labels at that level in the hierarchical structure, partitions the data according the associated values, and repeat this process on each of the resulting nodes until a stopping condition is met. Once built, new data is classified by using its feature values to guide tree traversal from the root to a leaf, where a class label is assigned. At each node, a given feature from the sample is evaluated to decide which branch is taken along its path to a leaf. Different algorithms use different criteria for selection the feature at each node in the tree. As an example, we look at ID3 .
Where is the current data set, is the subset produced from after splitting on attribute , is the set of classes, is the information gain of the system at attribute . is the system entropy, and is the entropy of each subset generated as a result of splitting the set using attribute .
Random Forests (RF): To increase the accuracy and stability of decision trees, RF leverages a bagging  technique to generate a collection (or forest) of trees. The label is determined either by averaging the output decisions in the case of numerical labels or, in the case of categorical labels, by the class with the maximum number of “votes” (the class selected by a majority of the tree in the forest) .
Support Vector Machine (SVM). A SVM  is a machine learning algorithm that attempts to separate points of data in -dimensional space using a hyperplane of dimensions. The hyperplane provides the best separation when the distance of the nearest points to the plane is maximal for both sides . We include SVMs because of its scalability and high accuracy in complex classification problems. .
Naive Bayes (NB). It is a statistical classification methodology established on Bayes theorem. It is based on the assumption that the features of a given sample are conditionally independent from each other. This assumption enables a tractable calculation of the posterior probability of class for a data sample with attributes. Given a data with features , NB algorithm identifies the maximum posterior probability distribution as given in Eq. 3.
Iii Proposed Scheme
In this section, we present the proposed scheme, which has been adapted from the scheme presented in . First, a classifier is trained on the original sensitive data to produce a teacher model. Then, the generated teacher model is used to annotate publicly available unlabeled data and convert it into labeled data. Next, a new classifier is trained on this newly labeled data to produce a student model. The proposed methodology is illustrated in Algorithm LABEL:alg:contract.
Iii-a Teacher Model Generation
The process of teacher model generation is shown in Fig. 2 where the IDA uses its original sensitive data with a set of classifiers to generate multiple models, the most accurate of which is selected as the teacher model.
As depicted in Algorithm LABEL:alg:contract, the process of generating the teacher model starts by evaluating several classifiers to select the classifier with the best performance (see lines 2-5 in Algorithm LABEL:alg:contract). The classifier with the best performance is selected to be the one that is trained on the original sensitive data to produce the teacher model (see lines 6-7 in Algorithm LABEL:alg:contract).
Iii-B Student Model Generation
As illustrated in Fig. 3, the process of generating the student model starts by using the teacher model to label (or annotate) an unlabeled public dataset to produce newly labeled training data. After using the teacher model to label the unlabeled data, a classifier selection and training process, similar to that used with the teacher model, is used to produce our student model. This student model training process can be considered a knowledge transfer process. The goal is for the knowledge that the teacher model extracted from the sensitive, private data to be passed along to the student model through the publicly available data that the teacher model labeled. This is illustrated in lines 16-23 in Algorithm LABEL:alg:contract.
In this section, we explain the data and process used to evaluate our process scheme and present and discussion the results of this evaluation.
A total of 136,000 feature vectors were obtained from Kaggle’s Open Datasets , for the purpose of training and evaluating our scheme. Each feature vector is described using 41 feature. The dataset contains a combination of benign and malicious data traffic. The benign data resembles the behavior of a normal user inside the monitored network. The malicious data includes the behavior of traffic flows during different kinds of intrusion attacks (e.g., Denial Of Service (DOS), Probe, User to Root (U2R), and Remote to User (R2L) attacks).
A brief explanation of each of these attacks is as follows:
Denial Of Service (DOS) attack: It is a cybersecurity attack in which the intruder tries to prevent the normal usage of the network resources (machines) . In other words, it tries to overwhelm the resource to prevent them from being able to serve the requests of legitimate users.
Probe attack: In this type of attack, the intruder tries to scan the network computers to search for any vulnerability to exploit it and compromise the system .
User to Root (U2R) attack: The intruder in this attack tries to exploit the vulnerabilities of a given machine to gain root access .
Remote to User (R2L) attack: It is a cybersecurity attack in which the attacker tries to gain an access to a machine that he/she does not have access to .
For the original Kaggle competition, the dataset had a specific test set. The labels for this test set have since been released, and for our experiments, we combined all the data from Kaggle into a single dataset from which different random data sets could be generated as needed. The obtained dataset was divided into three parts. The first part is used as the labeled dataset upon which the teacher model is trained, and it consisted of 57,900 feature vector. The second part was used as the unlabeled dataset by deleting the label column from it, and it consisted of 57,900 unlabeled feature vector. The third part is the test set. It consisted of 20,173 feature vectors and constituted the dataset upon which both the teacher and the student models were evaluated.
Iv-B Experiment Methodology
The training process of the teacher and the student models pass through several steps that can be summarized in the following items:
Step 1: Training the teacher model. The teacher model is trained offline using a 57,900 feature vector for different network flows representing benign and malicious data (the labeled data). The classifier set is assumed to contain the four classifiers mentioned in section II. To evaluate the performance of each classifier, we use 10-fold Cross-Validation (CV). -fold CV is the process of dividing the training data into equal folds (parts). After that, the model is trained on the folds and evaluated on the remaining fold. This operation is repeated times with each fold being used once as the test data. The parameter in our experiments is selected to be 10 . The classifier with the best performance is selected as the teacher model and is used to label the unlabeled data. The results of the 10-fold CV are shown in Table I.
Step 2: Labeling/Annotation process. The unlabeled data is annotated by the teacher model generated in Step 1. The output of this step is a labeled dataset (57,900 feature vector).
Step 3: Training the student model. In this step, the annotated data from Step 2 is used to select and train the student model. The same four classifiers are evaluated at this step using 10-fold CV. The classifier with the best performance is selected as the student model to be shared.
Step 4: Teacher/Student model evaluation. In this step, both the teacher model and the student model are evaluated using the same test data to compare their performance.
Iv-C Considered Key Performance Metrics
We define the following key performance metrics used in our evaluation process:
Detection Accuracy (ACC): The ratio of the number of true positives and true negatives over the whole number of samples.
where represents true positives (the number of malicious samples that are correctly classified as malicious), is the true negatives (the number of benign samples that are correctly classified as benign), denotes false positives (the number of benign samples incorrectly classified as malicious), and represents false negatives (the number of malicious samples incorrectly classified as benign).
True Positive Rate (TPR): The ratio of the true positives to the total number of samples that were labeled as positives
False Positive Rate (FPR): The ratio of the false positives to the total number of samples that were labeled as negatives
True Negative Rate (TNR): The ratio of the true negatives to the total number of samples that were labeled as negatives
False Negative Rate (FNR): The ratio of the false negatives to the total number of samples that were labeled as positives
Iv-D Results and Discussion
The results of evaluating the four classifiers using 10-fold CV for selecting the teacher model are shown in Table I.
We observe that the RF classifier performs better than other classifiers and that the NB classifier provides the lowest accuracy. Thus, the RF classifier is selected as the teacher model.
The 10-fold CV student model results are shown in Table II. Similarly, the RF classifier performs the best among this set of classifiers. Thus, it is chosen to be the student model.
|Classifier||ACC (%)||TPR (%)||FPR (%)||TNR (%)||FNR (%)||AUC|
|Classifier||ACC (%)||TPR (%)||FPR (%)||TNR (%)||FNR (%)||AUC|
The comparison of both the teacher and the student models on the test data is shown in Table III. The performance of both the teacher and the student models is nearly identical, which supports our assertion that unlabeled data trained by a teacher model can be used to transfer knowledge to a student model without revealing data that is considered sensitive. It also suggests that, by sharing student prediction models, intrusion detection agencies could enable research communities to benefit from their datasets without directly sharing that sensitive data.
|Model||ACC (%)||TPR(%)||FPR (%)||TNR (%)||FNR (%)||AUC|
Intrusion detection applications are data hungry and training an effective model requires a huge amount of labeled data. In this paper, a knowledge transfer methodology for generating a shareable intrusion detection model has been presented to address the problem of enabling the research community to benefit from datasets owned by the intrusion detection agencies without directly sharing sensitive data. We believe that through mimic learning, a network detection student model can be trained and shared with outside communities to enable knowledge sharing with fewer privacy concerns. The performance evaluation for both the student model and the teacher model show nearly identical performance, which we consider to be an indication of the success of our mimic learning technique for transferring the knowledge from the teacher model to the student model using an unlabeled public data.
-  C. Groşan, A. Abraham, Han, and S. Yong, MEPIDS: Multi-Expression Programming for Intrusion Detection System. Berlin, Heidelberg: Springer Berlin Heidelberg, 2005, pp. 163–172. [Online]. Available: https://doi.org/10.1007/11499305_17
-  W. Stallings, Operating Systems: Internals and Design Principles, 6th ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2008.
-  S. García, A. Zunino, and M. Campo, “Survey on network-based botnet detection methods,” Sec. and Commun. Netw., vol. 7, no. 5, pp. 878–903, May 2014. [Online]. Available: http://dx.doi.org/10.1002/sec.800
-  B. Shah and B. Trivedi, “Artificial neural network based intrusion detection system: A survey,” International Journal of Computer Applications, vol. 39, pp. 13–18, 02 2012.
-  “https://usa.kaspersky.com/enterprise-security/wiki-section/products/machine-learning-in-cybersecurity, last visited = 4/30/2019.”
-  M. Dehghani, H. Azarbonyad, J. Kamps, and M. de Rijke, “Share your model instead of your data: Privacy preserving mimic learning for ranking,” CoRR, vol. abs/1707.07605, 2017. [Online]. Available: http://arxiv.org/abs/1707.07605
-  E. Biglar Beigi, H. Hadian Jazi, N. Stakhanova, and A. A. Ghorbani, “Towards effective feature selection in machine learning-based botnet detection approaches,” in 2014 IEEE Conference on Communications and Network Security, Oct 2014, pp. 247–255.
-  N. Papernot, M. Abadi, U. Erlingsson, I. Goodfellow, and K. Talwar, “Semi-supervised knowledge transfer for deep learning from private training data,” arXiv preprint arXiv:1610.05755, 2016.
-  M. Dehghani, H. Zamani, A. Severyn, J. Kamps, and W. B. Croft, “Neural ranking models with weak supervision,” CoRR, vol. abs/1704.08803, 2017. [Online]. Available: http://arxiv.org/abs/1704.08803
-  ——, “Neural ranking models with weak supervision,” in Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 2017, pp. 65–74.
-  H.-J. Liao, C.-H. R. Lin, Y.-C. Lin, and K.-Y. Tung, “Intrusion detection system: A comprehensive review,” J. Network and Computer Applications, vol. 36, pp. 16–24, 2013.
-  “https://github.com/ruedigergad/clj-net-pcap/tree/master/jnetpcap, last visited = 4/29/2019.”
-  J. R. Quinlan, “Induction of decision trees,” Machine Learning, vol. 1, no. 1, pp. 81–106, Mar 1986. [Online]. Available: https://doi.org/10.1007/BF00116251
-  C. Kruegel and T. Toth, Using Decision Trees to Improve Signature-Based Intrusion Detection. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 173–191. [Online]. Available: https://doi.org/10.1007/978-3-540-45248-5_10
-  L. Breiman, “Bagging predictors,” Machine Learning, vol. 24, no. 2, pp. 123–140, Aug 1996. [Online]. Available: https://doi.org/10.1007/BF00058655
-  ——, “Random forests,” Machine Learning, vol. 45, no. 1, pp. 5–32, Oct 2001. [Online]. Available: https://doi.org/10.1023/A:1010933404324
-  C. Cortes and V. Vapnik, “Support-vector networks,” Mach. Learn., vol. 20, no. 3, pp. 273–297, Sep. 1995. [Online]. Available: https://doi.org/10.1023/A:1022627411411
-  M. Z. Shafiq, S. M. Tabish, F. Mirza, and M. Farooq, PE-Miner: Mining Structural Information to Detect Malicious Executables in Realtime. Berlin, Heidelberg: Springer Berlin Heidelberg, 2009, pp. 121–141. [Online]. Available: https://doi.org/10.1007/978-3-642-04342-0_7
-  P. Barthakur, M. Dahal, and M. K. Ghose, “A framework for p2p botnet detection using svm,” in 2012 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery, Oct 2012, pp. 195–200.
-  A. Abraham, C. Grosan, and Y. Chen, “Cyber security and the evolution of intrusion detection systems,” 12 2017.
-  P. SangitaB and S. R. Deshmukh, “Use of support vector machine, decision tree and naive bayesian techniques for wind speed classification,” in 2011 International Conference on Power and Energy Systems, Dec 2011, pp. 1–8.
-  “https://www.kaggle.com/what0919/intrusion-detection, last visited = 4/21/2019.”
-  W. Stallings, Operating Systems: Internals and Design Principles, 6th ed. Upper Saddle River, NJ, USA: Prentice Hall Press, 2008.
-  S. Paliwal and R. K. Gupta, “Denial-of-service, probing & remote to user (r2l) attack detection using genetic algorithm,” 2012.
-  J. Han, M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, 3rd ed. San Francisco, CA, USA: Morgan Kaufmann Publishers Inc., 2011.