Extending Detection with Forensic Information
For over a quarter century, security-relevant detection has been driven by models learned from input features collected from real or simulated environments. An artifact (e.g., network event, potential malware sample, suspicious email) is deemed malicious or non-malicious based on its similarity to the learned model at run-time. However, the training of the models has been historically limited to only those features available at run time. In this paper, we consider an alternate model construction approach that trains models using forensic “privileged” information–features available at training time but not at runtime–to improve the accuracy and resilience of detection systems. In particular, we adapt and extend recent advances in knowledge transfer, model influence, and distillation to enable the use of forensic data in a range of security domains. Our empirical study shows that privileged information increases detection precision and recall over a system with no privileged information: we observe up to 7.7% relative decrease in detection error for fast-flux bot detection, 8.6% for malware traffic detection, 7.3% for malware classification, and 16.9% for face recognition. We explore the limitations and applications of different privileged information techniques in detection systems. Such techniques open the door to systems that can integrate forensic data directly into detection models, and therein provide a means to fully exploit the information available about past security-relevant events.
Detection systems based on machine learning are an essential tool for system and enterprise defense . Such systems provide predictions about the existence of an attack in a target domain using information collected in real-time. The detection system uses this information to compare the run-time environment against known normal or anomalous state. In this way, the detection system “recognizes” when the environmental state becomes—at least probabilistically—dangerous. What constitutes dangerous is learned; detection algorithms construct models of attacks (or non-attacks) from past observations using a training algorithm. Thereafter, the detection systems use that model for detection (i.e., at run-time). Yet, a limitation of this traditional model of detection is that it focuses on features (also referred to as inputs) that will be available at run-time. Many features are either too expensive to collect in real-time or only available after the fact. In traditional detection, such information is ignored for the purposes of detection. We argue that future detection systems need to learn from all information relevant to attack, whether available at run-time or not.
Consider a recent event that occurred in the United States. In the Summer of 2015, the United States Office of Personnel Management (OPM) fell victim to sophisticated cyber attacks that resulted in substantial exfiltration of personal information and intellectual property . Working with the government staff and security analysts conducted a clandestine investigation. During that time, a vast amount of information was collected from networks and systems across the agency, e.g., network flows, system logs files, user behaviors. An analysis of the collected data revealed the presence of previously undetected advanced persistent threat (APT) actors on the agency’s network. Yet, the collected analysis is largely non-actionable by detection systems post investigation; because the vast array of derived features would not be available at run-time, they cannot be used to train OPM’s detection systems.
This example highlights a challenge for future intrusion detection: how can detection systems integrate intelligence relevant to an attack that is not measurable at run-time? Here, we turn recent advances in machine learning that support models that learn on a superset of features used at run-time [3, 4]. The goal of this effort is to leverage these additional features, called privileged information (features available at training time, but not at run-time), to improve the accuracy of detection. Note that while we are principally motivated by the need to integrate outcomes from past analyses, privileged information is broader than just forensic data collected. Data that is simply too costly to acquire or unreliable at run-time can be collected only for the purposes of model training. Thus, using the techniques defined in the following sections, designers and operators of detection can leverage additional effort during system calibration to improve detection models without inducing additional run-time costs.111As the detection algorithms developed within this work are agnostic to the source and meaning of privileged features, we largely do not distinguish between forensic and non-forensic data throughout.
More concretely, in this paper we explore an alternate approach to training intrusion detection systems that exploits privileged information. We design algorithms for three classes of privileged information-augmented detection systems that use: (1) Knowledge transfer, a general technique of extracting knowledge from privileged information by estimation from available information, (2) Model influence, a model-dependent approach of influencing the model optimization with additional knowledge obtained from privileged information; and (3) Distillation, a more flexible approach of distilling the additional knowledge about privileged samples as class probability vectors. We further explore feature engineering in a privileged setting. To this end, we measure the potential impacts of privileged features on run-time models. Here, we use the degree to which a feature improves a model (accuracy gain—a feature’s additive contribution to accuracy). We develop an algorithm and system that selects features that maximize model accuracy in the presence of privileged information. Finally, we compare the performance of privileged-augmented systems with respect to systems with no privileged information. We evaluate four recently proposed detection systems: (1) face authentication, (2) malware traffic detection, (3) fast-flux bot detection, and (4) malware classification. These systems are evaluated using reverse engineered and publicly available datasets.
The central hypothesis of this paper is that algorithms trained with privileged information can be more accurate and resilient to detection. We make the following contributions in this paper in exploring the validity of that hypothesis:
We introduce three classes of detection algorithms that integrate privileged information at training time to improve detection precision and recall over benchmark systems that do not use privileged information.
We present the first methods for feature engineering in privileged information-augmented detection systems and identify inherent tensions between information utilization, detection accuracy, and model robustness.
We provide an empirical evaluation of techniques on a variety of recently proposed detection systems using real-world data. We show that privileged information decreases the relative detection error of traditional detection systems up to 16.9% for face authentication, 8.6% for malware traffic detection, 7.7% for fast-flux bot detection, and 7.3% for malware classification.
We analyze dataset properties and algorithm parameters that maximize detection gain, and present guidelines and cautions for the use of privileged information in realistic deployment environments.
After introducing the technical approach for detection in the next section, we consider several key questions:
How can the best features for a specific detection task be identified? (Section IV)
How does privileged-augmented detection perform against systems with no privileged information? (Section V)
How can we select the best privilege algorithm for a given domain and detection task? (Section V-E)
What are the practical concerns and opportunities in using priviliged information for detection? (Section VI)
Ii Problem Statement
Detection systems use traditional learning algorithms such as support vector machines or multilayer perceptrons (neural networks) to construct detection models. These models aim at learning patterns from historical data (also referred to as training data) to estimate an underlying dependency, structure or behavior of a system, process or environment. This training data is a collection of samples that includes a vector of features (e.g., packets per second, port number) and a class labels (e.g., anomalous or normal). Once trained, run-time events (e.g., network event) are compared to the learned model. Without loss of generality, the model outputs a label (or label confidence) that most closely fits with those of the training data. The percentage of output labels that are correctly predicted for a sample set is known as its accuracy for that input data set.
The quality of the detection system largely depends on features used to train models . In turn, the success of detection depends on explanatory variation behind the features that is used to separate an attack and benign sample. However, modern detection systems by construction assume that the features used to make predictions at run-time would be identical to those of used for training (See Figure 1 (left)). This assumption restricts the model training to the features that are available at run-time to make predictions; thus eliminates the use of powerful malicious or benign representations exhibited by non-runtime available features. As highlighted in the preceding section, intelligence obtained from forensic investigations  or data obtained through a human expert analysis  is simply not actionable. Juxtapose this with our goal of leveraging features in model training that are not available at training time to improve detection accuracy (See Figure 1 (right)). Note that we do not focus on a specific detection task or domain. We begin in the following section by introducing the three central approaches we use to integrate privileged information into detection models.
Iii Detection with Privileged Information
This section introduces three approaches to integrate privileged features into detection algorithms: knowledge transfer, model influence, and distillation. Figure 2 presents the schematic overview of approaches. Stated formally, we consider a conventional detection algorithm as a function that aims at predicting targets given some explanatory features . The models are built using a dataset containing pairs of features and targets, denoted by . Following the definition of privileged information, we consider a detection setting where the features used for detection is split into two sets to characterize their availability for detection at training time. The standard features includes the features that are reliably available both at training and run-time (as in conventional systems), while privileged features have constraints that prevent using them for detection at run-time. More formally, we assume that detection models will be trained on some data , and they will make detection on some data . Therefore, our goal is creation of algorithms that efficiently integrate privileged features into detection models, without requiring them at run-time.
Each approach is based on a different assumption about the underlying detection process. However, central to all of them is using privileged features along with system’s conventional features available at run-time to offer an effective detection under the core principle of using complete and relevant features about the normal and anomalous state of a system.
Iii-a Knowledge Transfer
We consider a general algorithm to transfer knowledge from the space of privileged information to space where detection function is constructed. The algorithm works by deriving a mapping function to estimate each privileged feature from a subset of standard features . The estimation is straightforward: the relationship between standard and privileged features is learned by defining each privileged feature as a target and standard set as an input to a mapping function . The mapping functions can be defined in the form of any function such as regression and similarity based (we give examples of such functions in Section V-A). The use of mapping function allows a system to apply the same model learned from the complete set at training time with the union of standard and estimated privileged features on unknown samples (See Figure (a)a).
The algorithm used to identify a mapping function , is described in Algorithm 1. By using , detection systems are able to construct the complete features with the correct values of privileged features at runtime–intuitively, each is used to generate a synthetic feature that represents an estimate of a privileged feature. As a result, the accurate estimation of privileged features contributes to using complete and relevant features in a model training and, therefore, enhances the generalization of models compared to those trained solely on standard features. Note that estimating power is bounded by the size and completeness of training data (with respect to the privileged features), and thus the use of in the model should be calibrated based on measurements of estimation quality (See Section V-D for details).
Iii-B Model Influence
Model influence incorporates the useful information obtained from privileged features to the correction space of the detection model by defining additional constraints on the training errors (See Figure (b)b) [4, 3]. Intuitively, the algorithm learns how privileged information influences outputs on training input feature vectors towards building a set of corrections for the space of inputs–in essence creating a correction function that takes as input runtime features and adjusts model outputs. Note that we adapt model influence approach to support vector machines (SVM). Model influence is applicable to other machine learning techniques, but we defer it use in other context to future work.
More formally, consider a training data that is generated from an unknown probability distribution . Our goal is to find a model such that the expected loss is defined as follows:
Here the system is trained on standard and privileged features but only uses standard features at runtime. To realize this approach, we consider the optimization problem of SVM in its dual form as shown in Equation 2 (labeled as primary detection objective) where is Lagrange multiplier.
We influence the detection boundary of a model trained on standard features at with the correction function defined by the privileged features at the same location (labeled as influence from privileged features). In this manner, we use privileged features as a correction function of the slack variables defined in the objective of SVM. In turn, the useful information obtained from privileged features is incorporated as a measure of confidence for each labeled standard sample. The formulation is named as SVM+ and requires ) samples to converge compared to samples for SVM which is useful for systems with a sparse data collection [8, 3]. We refer readers to Appendix A for a complete formulation and implementation.
To illustrate, consider the 2-dimensional synthetic dataset presented in Figure 3, as well as the decision boundaries of two detection algorithms SVM (an unmodified support vector machine) and SVM+ (the same SVM augmented with model influence correction). The use of privileged information in model training separates the classes more accurately because privileged features accurately transfer information to standard space, and the resulting model becomes more robust to the outliers. This approach may provide even better class separation in datasets with higher dimensionality.
To summarize, as opposed to knowledge transfer, we eliminate the task of finding the mapping functions between standard and privileged features. Thus, we reduce the problem of model training to a single-task as a unified algorithm.
Model compression or distillation are techniques to transfer knowledge from a complex Deep Neural Network (DNN) to small one without loss of accuracy . The motivation behind the idea suggested in  is closely related to knowledge transfer approach previously introduced. The goal of the distillation is to use the class knowledge from both class labels (i.e., hard labels) and probability vectors of each (i.e., soft labels). The benefit of using class probabilities in addition to the hard labels is intuitive because probabilities of each class define a similarity metric over the classes apart from the samples’ correct classes. Lopez-Paz et al. recently introduced an extension of model distillation used to compress models built on a set of features into models built on a different set of features . We adapt this technique to detection algorithms.
We address the problem of privileged information using distillation as follows. First, we train a “privileged” model on the privileged features and lables whose output of the model is the vector of soft labels . Second, we train a distilled model (used at runtime) by minimizing Equation 3 which learns a detection model by simultaneously imitating the privileged predictions of the privileged model and learning the targets of the standard features. The algorithm for learning such a model is presented Algorithm 2 and outlined as follows:
We learn a privileged model by using the privileged samples. We then compute the soft labels by applying the softmax function (i.e., normalized exponential) . The output is a vector which assigns a probability to each class of the privileged samples. We note that class probabilities obtained from privileged model provide additional information for each class. Here, temperature parameter controls the degree of class prediction smoothness. Higher enables softer probabilities over classes and vice versa. As a final step, Equation 3 is sequentially minimized to distill the knowledge transferred from privileged features as a form of probability vectors (soft labels) into the standard sample classes (hard labels). In Equation 3, the parameter controls the trade-off between privileged and standard features. For , the objective approaches the standard set objective, which amounts to detection solely on standard features. However, as , the objective transfers the knowledge acquired by the privileged model into the resulting detection model. Therefore, learning from the privileged model do, in many cases, significantly improve the learning process of a detection model.
Distillation differs from previously introduced approaches of model influence and knowledge transfer. First, while knowledge transfer attempts to estimate the privileged features with a representation of a mapping function, distillation is a trade-off between the privileged sample probabilities and standard sample class labels. Second, in contrast to model influence, distillation is independent of the machine learning algorithm (model-free) and its objective function can be minimized using a model of choice.
|System||Datasets and||Standard features||Incorporated privileged features||Detection time constraints|
|Papers||on privileged features|
|[12, 13, 14]||-Raw face images||-Bounding boxes and cropped versions of facial images||-Need of human expert and additional software for processing|
|[15, 16, 17, 18]||-Data bytes divided by the total number of packets||-Source and destination port numbers||-Vulnerable to port spoofing|
|-Total number of RTT samples found||-Byte frequency distribution in packet payload||-Payload encryption in subsequent malware versions|
|-The count of all packets with at least a byte payload||-Total connection time||-Easy to tamper by an attacker|
|-The minimum payload size observed||-Total number of packets with URG and PUSH flag set|
|[19, 20, 21, 22, 23]||-Number of unique A and NS records in DNS packets||-Edit distance, KL divergence and Jaccard index (domain names)||-Processing overhead of whitelist domains|
|-Network, processing and document fetch delay of server||-Time zone entropy of A and NS records in DNS packets||-IP coordinate database processing overhead|
|-Euclidean distance between server IP and NS address|
|-Number of distinct Autonomous systems and networks||-Time consuming WHOIS processing|
|[24, 25]||-Frequency count of hexadecimal duos in binary files||-Frequency count of distinct tokens in metadata log||-Software-dependency of obtaining assembly source code|
|-Computational overhead and error-prone feature acquisition|
Iv Building Systems with Privileged Information
In this section, we explore algorithms for feature engineering (selecting privileged features for a detection task) and demonstrate their use in four diverse experimental systems.
Iv-a Feature Engineering Privileged Information
The first challenge facing our model of detection is deciding which features should be used as privileged information. Asked another way, given some potentially large universe of offline features, which are the most likely to be useful for detection? To address this, we develop an iterative algorithm that selects features that maximize model accuracy. Selection is made on the calculated accuracy gain of each feature–a measure of the additive value of the features with respect to an existing feature set for detection accuracy.
At the core of the algorithm, we measure the potential impacts of privileged features that help detecting the hard-to-classify examples. Generally speaking, easy examples fall in a distribution that can be explained by some set of model parameters, and hard examples do not precisely fit the model—either misclassified or near decision boundary (See Figure 4) . As a consequence, accurate classification of hard examples is one of the main challenges of practical systems, as they are the main source of detection errors due to the incorrect or insufficient information about normal or anomalous state of a system. We aim at improving the detection of these examples in the presence of privileged features.
The first step of feature engineering–as is true of any detection task–is identifying all of the available features that potentially may be used for detection. Specifically, we collect the set of domain-specific features based on using domain knowledge and surveying the recent efforts in that domain. It is from that set that we will identify the best privileged features to be used for training. Note that this set may include irrelevant features that carry little or no useful information for the target detection task. We identify the privileged set using Algorithm 3. The algorithm starts with standard features of a detection system and sequentially adds the one privileged feature from the set which maximizes correct classification of hard examples, i.e., the feature whose addition to the existing set has the greatest positive impact on accuracy (measured accuracy gain). The accuracy gain of hard examples is found using SVM classifier (model in algorithm 3). This process is repeated until the potential feature set is empty, a maximum number of features is reached, or the accuracy gain is below a threshold for usefulness.
Note that the quality of the selection process is a consequence of the training data used to calculate accuracy gain. If the training data is not representative of the runtime input distribution, the algorithm could inadvertently over or under-estimate the accuracy gain of a feature and thereby weaken the detection system. However, note that this limitation is not unique to feature selection in this context, but applies to all feature engineering in extant detection systems.
Iv-B Experimental Systems
In this section, we introduce four security-relevant systems for face authentication, malware traffic detection, fast-flux bot detection, and malware classification. We selected these experimental systems based on their appropriateness and diversity of their detection task. In this initial study, we do not include commercial systems because of their lack of publicly available datasets and features. This diverse set of detection systems serves as representative benchmark suite for our approaches. The following are the steps involved in constructing the each system (discussed below): (1) extract features from dataset, (2) use algorithm in preceding section to select privileged features, and (3) calibrate detection system with standard and privileged feature inputs. Through this process, we construct their privileged-augmented systems with application of approaches that is used for the validation or our approach in the following section. Table I summarizes the experimental systems and the standard and privileged features selected. Additional details about these systems and their features are presented in Appendix B.
Iv-B1 Fast-flux Bot Detection
The Fast-flux bot detector is used to identify hosts that use fast-changing DNS entries to hide the existence of server hosts used for malicious activities. We build a detection system by using the features used in recently proposed detectors of [19, 22, 23, 20]. This system relies on features obtained from domain names, DNS packets, packet timing intervals, WHOIS domain lookup and IP coordinate database. The resulting dataset includes many rather than few features to increase separation of Content Delivery Networks (CDNs) from fast-flux servers, as technical similarities between them are the main source of detection errors.
Privileged information. In this system, even though the complete features are relevant for fast-flux detection, obtaining some features at run-time entails computational delays. This delay prevents using the system for real-time detection in mission critical systems. For example, processing of WHOIS records, maintaining up-to-date IP coordinate database and whitelist of domain names takes several minutes/hours. Thus, we define the features obtained from these sources as privileged to assure real-time detection.
Iv-B2 Face Authentication
To explore the efficiency of approaches in image domains, we modeled a user authentication system based on recognition of facial images. Our goal is to recognize an image containing a face with an identifier corresponding to the individual depicted in the image. We use images from a public dataset that includes person face images labeled with each person’s name .
Privileged information. It is recently found that face recognition systems used for access control in particular PC vendors can be easily bypassed by an attacker . In this, the lack of useful features or number of images used to train the systems is the main reason of dupe the systems into falsely authenticate users. We use two types of privileged features for each image in addition to the original images in model training: cropped and funneled versions of the images (See Figure 5) [12, 13]. These images provide additional information for users’ face by image aligning and localizing . While it is technically possible these features can be obtained by an aid of software or human expert at run-time, they are much more likely to be made available afterward (and thus we define them as privileged).222We here interpret the accuracy gain as a defense for hardening the misclassification of users.
Iv-B3 Malware Traffic Detection
Next, we modeled a malware traffic anomaly detection system based on network flow statistics used in recent detection systems [29, 15, 16]. The system aggregates flow features for detecting botnet command and control (C&C) activity among benign applications. For instance, the ratio between maximum and minimum packet size from server to client and client to server find out to be a distinctive observation between benign and malicious samples. We mix botnet traffic of Zeus variants that is used for of spam distribution, DDoS attacks, and click fraud [30, 31, 32, 18] with the benign applications (web browsing, chat, email, file transfer, etc..) of Lawrence Berkeley National Laboratory (LBNL)  and University of Twente .
Privileged information. In this system, the authors eliminate the features that can be readily altered by an attacker, as the model trained with tampered features allows an attacker to manipulate the detection results easily [17, 35]. For instance, consider destination port numbers are used as a feature. An attacker may change the port numbers in subsequent malware versions to pass through a firewall which blocks certain port number ranges. Also, the authors do not use payload content to obtain features because the attacker can use encrypted malware traffic to defeat deep packet inspection. To combat these threats, we deem such features as a privileged.333Although these features can be easily obtained at run-time, the definition of privileged information is captured by “tamper-resistant” systems which is a unique characteristic to security domain concerning the adversarial settings.
Iv-B4 Malware Classification
Microsoft malware dataset  is an up-to-date publicly available dataset used for classification of malware into their families. The dataset includes nine malware files that include hexadecimal representation of the malware’s binary content, and a class representing one of nine family names. We build a real-time malware classification system by using the binary content file. Following a recent malware classification system , we construct features by counting the frequencies of each hexadecimal duos (i.e., byte bigrams). These features found out to provide distinctive between different families because of exploiting the code dissimilarities among families. Furthermore, obtaining byte bigrams has a low processing overhead for real-time detection.
Privileged information. This dataset also includes a metadata manifest log file. The log file contains various metadata information obtained from the binary, such as memory allocation, function calls, strings, etc. . The logs along with the malware files can be used for classifying malware into their respective families. Thus, similar to the byte files, we obtain the frequency count of distinct tokens from asm files such as mov(), cmp() in the text section (See Figure 6). These tokens allow us to capture execution differences between different families . However, in practice, obtaining features from log files introduces significant overheads in the disassembly process. Further, various types or version of a disassembler may output byte sequences differently. Thus, this process may result in inaccurate and slow feature processing in real-time automated systems . To address these limitations, we include features from disassembler output as privileged for accurate and fast classification.
In this section, we validate and compare the privileged information-augmented detection approaches using experimental. Here, we focus on the following questions:
How much does privileged-augmented detection improve performance over systems with no privileged information? We evaluate the accuracy, precision, and recall of approaches, and demonstrate the detection gain of including privileged features.
How well do the approaches perform for a given domain and detection task? We answer this question by comparing the results of the approaches and present guidelines and cautions for appropriate approach calibration to maximize the detection gain.
Do approaches introduce training and detection overhead? We report model learning and run-time overhead of approaches for realistic deployment environments.
|Knowledge Transfer||Model Influence||Distillation|
|Fast-flux Bot Detection||Section V-A||Section V-D||Section V-D|
|Malware Traffic Detection||Section V-D||Section V-B||Section V-D|
|Face Authentication||—||—||Section V-C|
|Malware Classification||Section V-D||Section V-D||Section V-D|
Table II identifies the validation experiments described throughout, and Table III summary of the results. As detailed throughout, we find that the use of privileged information can improve–often substantially–the performance of detection in the experimental systems.
|System||Approach||Relative Gain over Traditional Detection||
|Fast-flux Bot Detection||
|Malware Classification||Model Influence||7.3%||2.2%||3.7%||✓||✗|
|Malware Traffic Analysis||Distillation||8.6%||2.2%||5%||✓||✗|
Overview of Experimental Setup. We compare performance of privileged-augmented systems against two baselines (non-privileged) models: the standard set model and the complete set model. The Standard set model is a conventional detection system that does not include the privileged features for training or runtime, but uses all of the standard features. The Complete set model is a conventional detection system that includes all the privileged and standards features for training or runtime. Note that the ideal privileged information approach would have the same performance as the complete set.
To learn standard and complete set models, we use classifiers of Random Forest (RF) and Support Vector Machines (SVM) with a radial basis function kernel. These classifiers give better performance in the previously introduced systems and are also preferred by the system authors. The parameters of models are optimized with exhaustive or randomized parameter search based on the dataset size. All our experiments are implemented in Python with scikit-learn machine learning library or MATLAB with optimization toolbox and run on Intel i5 computer with 8 GB RAM. We give the details of implementation of privileged-augmented systems while presenting the calibration of approaches in Section V-E.
We show detection performance of complete and standard set models and compare their results with our approaches based on three metrics: accuracy, recall, and precision. We also present the false positives and negatives when relevant. Accuracy is the sum of the true positive and true negatives over a total number of samples. Recall is the number of true positives over the sum of false negatives and true positives, and precision is the number of true positives over the sum of false positives and true positives. Higher values of accuracy, precision, and recall indicates a higher quality of the detection output.
V-a Knowledge Transfer
Our first set of experiments compares the performance of privileged information-augmented detection using knowledge transfer (KT) over standard and complete set models. In this experiment, we classify domain names into benign or malicious in fast-flux bot detection system (FF).
To realize KT, we implement two mapping functions to estimate the privileged features from a subset of standard features: regression-based and similarity-based. We find that both mapping functions typically learn the patterns of a training data and mostly suffice for the derivation of the nearly precise estimated privileged features. First, a polynomial regression function is built to find a coefficient vector such that there exists a for some bias term and random residual error . The resulting function is then used to estimate each privilege feature at detection time given the standard features as an input. We use polynomial regression that fits a nonlinear relationship to each privileged feature and picks the one that minimizes the sum of squares error. To evaluate the effectiveness of polynomial regression, we implement a second mapping function named weighted-similarity. This function is used to estimate the privileged features from the most similar samples in the training set. We first find the most similar subset of standard samples that are selected by using the Euclidean distance between an unknown sample and training instances. Then, the privileged features are replaced by assigning weights that are inversely proportional to the similarities of their neighbors.
|Fast-flux Bot Detection (FF)|
Table IV presents the accuracy of Random Forest Classifier (RF) and Support Vector Machines (SVM) on standard, complete models, and KT in the form of multiple regression and weighted-similarity. The numbers are given ten independent runs of stratified cross-validation and measured the difference between training and validation performance with parameter optimization (e.g., parameter in similarity). The complete model accuracy of both classifiers is close to the ideal case where the baseline performance is obtained by always guessing the most probable class with 68% accuracy on FF dataset.
We found that mapping functions are effective in finding a nearly precise relation between standard and privileged features. This decreases the expected misclassification rate on average both in false positives and negatives over benchmark detection with no privileged features. Both KT mapping options come close to the complete model accuracy on FF dataset (1% less accurate), and significantly exceeds the standard set accuracy (2% more accurate). The results confirm that regression and similarity are more effective at estimating privileged features than solely using the standard features available at run time.
V-B Model Influence
Next, we evaluate the performance of model influence-based privileged information in detecting Zeus botnet in real-world web (HTTP(S)) traffic. Here, the system attempts to detect the malicious activity of a Zeus botnet that connects to C&C centers and filters private data. Note that the Zeus botnet uses HTTP mimicry to avoid detection. As a consequence, the sole use of standard features makes detection of Zeus difficult–resulting in high detection error (where Zeus traffic is mostly classified as legitimate web traffic). To this end, we include privileged features of packet flags, port numbers, and packet timing information from packet headers (See Appendix B for the complete list of features). We observe that while these features can be spoofed by adversaries under normal conditions, using them as privileged information may counteract spoofing (because inference does not consider their runtime value).
We evaluate accuracy gain of model influence over standard model on this criterion. We use polynomial kernel in the objective function to perform a non-linear classification by implicitly mapping the features into higher dimensional feature space. Table V presents the impact of model influence on accuracy, precision, and recall and compares with standard and complete models. We found that using the privileged features inherent to malicious and benign samples in model training systematically better separates the classes. This positive effect substantially improves the both false negative and false positive rates. The accuracy of model influence is close to the optimal accuracy and reduces the detection error detection on average 2% over RF trained on the standard set. This positive effect is more observable in SVM, and the accuracy gain yields 4.8%.
|Malware Traffic Detection|
We evaluate distillation on the experimental face authentication system. The standard features of face authentication system consist of original images of users (i.e., background included) with 3 RGB channels. We obtain the privileged features of each image from funneled and downscaled bounding boxes of images. In this way, we better characterize each user’s face by specifying the face localization and eliminating the background noise. It is important to note that background of images may unrealistically increase the accuracy because background regions may contribute to the distinction between images. However, we verify that the images in our training set do not suffer from this effect.
As distillation independent of machine learning algorithms, we construct both standard and privileged model using a deep neural network (DNN) with two hidden layers of 10 rectifiers linear unit with a softmax layer for each class. This type of architecture is commonly applied in computer vision and provides superior results for image-specific applications. We train the network with ten runs of 400 random training samples (See Appendix B for dataset details).
Figure 7 plots the average distillation accuracy with various temperature and imitation parameters. We show the accuracy of standard (dotted-dashed) and privileged set (dashed) models as a baseline. The resulting model achieves an average of 89.2% correct classification rate on privileged set, which is better than the standard set with 66.5%. We observe that distilling the privileged set features into detection algorithm gives better accuracy than standard set accuracy with optimal and parameters. The accuracy is maximized when , the gain is on average 6.56%. The best improvement is obtained when and with 11.2% increase over the standard set model accuracy. However, increases in negatively effect detection accuracy. This is because as increases, the objective function puts more weight on learning from the standard features which upsets the trade-off between standard and privileged features.
|Fast-flux Bot Detection||Malware Traffic Detection||Malware Classification|
V-D Comparison of Approaches
Next, we compare the performance of the approaches on three data sets.444We do not compare performance on face recognition because processing the number of the input features (e.g., pixels) was intractable for several solutions. The experiments are run with the same experimental set-up as described previously. Distillation is implemented using Deep Neural Networks (DNN) and regression and weighted similarity mapping functions are used for knowledge transfer. Table VI presents the results of knowledge transfer (We report similarity results as it yields better results than regression), model influence and distillation and compares against complete and standard models.
The accuracy of model influence, distillation and knowledge transfer on fast-flux detector and malware traffic detection is stable. All approaches yield accuracy similar to the ideal accuracy of the complete set model, and often the increased accuracy is the result of the correct classification of true positives (intrusions). This results in on average up to 5% relative gain in recall with the model influence and distillation over conventional models. In contrast, knowledge transfer often increases the number of samples detected by systems as being actually malicious (e.g., 99.3% precision in the fast-flux detector), meaning that the number of false alarms is reduced over conventional detection. The results confirm that approaches are able to balance the conventional detection and its accuracy by using privileged features inherent to the both benign and malicious samples: either reducing the false positives or negatives. We note that these results are obtained after carefully tuning the model parameters. We further discuss their parameter tuning in the following section and impact of results on systems in section VI.
Distillation is easy to apply as it is model-free approach and often yields better results than other approaches. Its quality as a detection mechanism become more apparent when its objective function is implemented with deep neural networks (DNN) with a nonlinear objective . This makes distillation give better results on average than other approaches. On the other hand, the design and calibration of model influence detection requires additional effort and care in tuning parameters–in the experiments this additional effort yields strong detection (as 94.6% in malware classification). It is also important to note that when the dataset includes a large number of privileged features or samples, training of model influence takes significantly more time compared to other approaches (See next section).
Finally, it is important to note that the while knowledge transfer accuracy gain for fast-flux detection and malware traffic analysis is similar to other approaches, its malware classification results are inconsistent (i.e., 83.5% average accuracy with an 11.2% standard deviation). Neither regression nor similarity mapping functions was able to predict the privileged features near precisely, in turn, they slightly degrade the accuracy (7-8%) on both RF and SVM standard set models. This observation confirms the need to find and evaluate an appropriate mapping function for the transference of knowledge discussed in Section III-A. In this particular dataset, the mapping functions fail to find a good relation between standard and privileged features. Regression suffers from overfitting to uncommon data points and similarity lacks fitting data points that distinctly lie an abnormal distance from the range of standard features (confirmed by increase in sum of square errors of estimated and true values for the privileged features). We remark that derivation of more advanced mapping functions may solve this problem. Further, model influence and distillation solve this by eliminating the use of mapping functions and including standard and privileged feature dependency into their objectives.
Therefore, based on the above observations, the approaches are in need for a careful calibration based on the domain- and task-specific properties to maximize the detection gain. We elaborate these steps in the next subsection.
V-E Calibration of Approaches
In this section, we discuss the required dataset properties, algorithm parameters, training and run-time overhead of using privileged information for detection. We also present guidelines and cautionary warnings for use of privileged information in realistic deployment environments. A summary of an approach selection criteria is presented in Table VII.
Model Dependency. Model selection is a task of picking an appropriate model (e.g., classifier) to construct a detection function from a set of potential models. Knowledge transfer can be applied to a model of choice, as privileged features are inferred with any accurate selected mapping function. Distillation requires a model with a softmax output layer for obtaining probability vectors. However, we adapt model influence to SVM’s objective function as a unified algorithm.
Detection Overhead. The mapping functions used in knowledge transfer may introduce detection delays while estimating the privileged features. For instance, weighted similarity introduced in Section III-A defers estimation until detection without learning a function at training time (i.e., lazy learner). This may introduce a detection bottleneck if dataset includes a large number of samples. To solve this problem, we apply stratified sampling to reduce the size of the dataset. Furthermore, constructing mapping functions at training time such as regression-based minimize the delay of estimating privileged features. For instance, in our experiments, weighted-similarity is used to estimate ten privileged features of 5K training samples less than a second delay on 2.6GHz 2-core Intel i5 processor with 8GB RAM. Regression reduces this value to milliseconds. Therefore, if delay at run-time is the primary concern, we suggest using model influence and distillation for learning the detection model, as they introduce no overhead at runtime.
|Model||Detection time||Model||Training time|
Legend: ✓ yes ✗ no model dependent relatively higher
Model Optimization. To obtain the best performance results, the parameters and hyperparameters of approaches need to be carefully tuned. For instance, fine-tuning of temperature and imitation parameters in distillation and kernel function hyperparameters in model influence approaches may increase the detection performance. Similar to conventional detection, the amount of parameters required to be optimized both for knowledge transfer and generalized distillation can be determined a priori based on the selected model. However, model influence has twice as many parameters as SVM—two kernel functions are used simultaneously to learn detection boundary in standard and privileged feature spaces. We apply grid search for small training sets and evolutionary search for large-scale datasets for parameter optimization.
Training Overhead. Training set size affects the time required by model learning. The amount of additional time needed to run both knowledge transfer, and generalized distillation is negligible, as they require similar models as existing systems apply. However, the objective function of model influence may become infeasible or take a long time when the dimension of the feature space is very small, or dataset size is quite large. For instance, in our experiments, distillation and knowledge transfer train 1K samples with 50 standard and privileged features on the same machine used for one minute including optimal parameter search. Model influence takes on average 30 minutes on the same machine used for measuring detection overhead. Packages that are designed specifically for solving the quadratic programming (QP) problems (e.g., MATLAB quadprog() function) can be used instead of general solvers such as convex optimization package CVX to reduce the training time. Further, specialized spline kernels can be used to accelerate the computation . We give a specific implementation of model influence in such packages in Appendix A.
Our empirical results show that approaches reduce both false positives and negatives over the systems solely built on their standard features. In a security setting, a false positive makes extremely difficult for the analyst examining the reported incidents only to identify the mistakenly triggered benign events correctly. It is not surprising, therefore, that recent research focuses on post processing of the alerts to produce a more qualitative alert set useful to the human analyst . False negatives, on the other hand, have the potential to cause catastrophic damage to both users and organizations: even a single compromised system can cause serious security breaches. For instance, in malware traffic and fast-flux bot detection, a false negative may cause a bot filter private data to a malicious server. In the case of malware classification and face authentication, it undermines the integrity of a system by misclassifying malware into another family or recognizing the wrong user. Thus, in no small way, improvement in false positives and negatives of these systems does matter in operational settings, improving reliable detection.
Vi Uses of Privileged Information
We now consider several the practical use of privileged information in many learning settings, application domains, and use as a defense against adversarial manipulation.
Vi-a Applicability to Other Settings
Our preceding discussions have focused on domains that build models using supervised learning. However, privileged information is not restricted to such domains, but are readily adaptable problems and machine learning settings. For example, privileged information can be adapted to settings with unsupervised , regression , and metric learning . Indeed, Jonschkowski et al. provide a discussion of how privileged information can be adapted to these other learning settings .
Further, there are many domains in which privileged information would prove useful beyond those explored in the previous section. To illustrate, we highlight three such systems and the privileged data that they may support below.
Mobile Device Security - The growth of mobile malware requires the presence of robust malware detectors on mobile devices. One might consider collecting data for numerous type of attacks, when at training time energy is less of a problem; however, exhaustive data collection at run-time can have high energy costs and induce noticeable interface lag. As a consequence, users may disable the detection mechanism . We postulate that the high-cost features can be defined as privileged information to avoid this problem.
Enterprise Security - Enterprise systems use audit data generated from a diverse set of devices and information sources for analysis . For instance, products such as HP Archsight  and IBM QRadar  collect data from hosts, applications, and network devices in incredible volumes (e.g., 100K events per second yielding to 42 TB of compressed data). These massive data sets are mined for patterns identifying sophisticated threats and vulnerabilities . However, runtime environments may be overwhelmed by data collection and processing at run-time which makes them impractical for many settings. In such cases, features involving complex and expensive data collection can be defined as privileged to balance the real-time costs and detection accuracy.
Privacy Enhanced Detection - Many detection processes require the collection of privacy-relevant features, e.g., pattern and substance of user network traffic, use of the software. Hence, it is important to reduce the collection and exposure of such data–legal and ethical issues may prevent continuously monitoring them in their original form. In these cases, a set of features can be defined as privileged to eliminate the requirement of obtaining and potentially retaining privacy-sensitive features at run-time from users and environments.
Vi-B Privileged Information as a Defense Mechanism
We also posit that privileged information can be used as a defense mechanism in adversarial settings. More specifically, the key attacks targeting machine learning are organized into two categories based on adversarial capabilities : (1) causative (poisoning) attacks in which an attacker controls the training data by injecting well-crafted attack samples to control the prediction results, and (2) explanatory (evasion) attacks in which attacker manipulates the malicious samples to evade detection. For the former, privileged features adds an extra step for the attacker to pollute the training data because the attacker needs to dupe the data collection into including polluted privileged samples in addition to the standard samples–which for many systems including online learning would potentially be much more difficult. For the latter, privileged features may make detection systems more robust to the adversarial samples because privileged features cannot be controlled by the adversary when producing malicious samples . Moreover, because the model is hidden from the adversary they cannot know the influence of these features on the model . As a proof of concept, recent works have used distillation of standard features as a defense mechanism against adversarial perturbations in Deep Neural Networks [50, 51]. In future work, we plan to further evaluate privileged information as a mechanism to harden machine learning systems.
Vii Related Work
Domain specific feature engineering has been a key effort within the security communities. For example, researchers have previously used specific patterns to group malware samples into families [52, 53, 24], explored using DNS information to understand and predict botnet domains [54, 55, 22], and have analyzed network and system level features to identify previously unknown malware traffic [56, 15, 16]. Other works have focused on user authentication from facial images [57, 58]. We view our efforts in this paper to be complementary to much these and related works. Features in these works can be easily integrated as privileged information into a detection algorithm to strike a balance between accuracy and the cost or availability constraints at run-time.
The use of privileged information has recently attracted attention in a few others areas such as computer vision, image processing, and even finance. Wang et al.  and Sharmanska et al.  derived privileged features from images in the form of annotator rationales, object bounding boxes, and textual descriptions. Niu et al. use privileged-augmented robust classifiers for action and event recognition . Ribeiro et. al use annual turnover and global balance values as a privileged for enhancing the financial decision-making . However, their approaches are not designed to model security-relevant data but rather to determine if there is a possibility of application to a domain specific information.
We have presented a range of techniques to train detection systems with privileged information. All approaches use features available only at training time to enhanced learning of detection models. We consider three approaches: (a) knowledge transfer to construct mapping functions to estimate the privileged features, (b) model influence to smooth the detection model with the useful information obtained from the privileged features, and (c) distillation using probability vector outputs obtained from the privileged features in the detection objective function. Our evaluation of several detection systems shows that we can we improve the accuracy, recall, and precision regardless of their high detection performance using privileged features. We also presented guidelines for approach selection in realistic deployment environments.
This work is the first effort at developing detection under privilege information spanning feature engineering, the application of diverse algorithms, and tailoring solutions to environmental conditions. The capability afforded by this approach will allow us to integrate forensic and other auxiliary information that, to date, has not been actionable for detection. In the future, we will explore a wide range of environments and evaluate its ability to promote resilience to adversarial manipulation in detection systems.
Research was sponsored by the Army Research Laboratory and was accomplished under Cooperative Agreement Number W911NF-13-2-0045 (ARL Cyber Security CRA). The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation here on.
-  K. Scarfone and P. Mell, “Guide to intrusion detection and prevention systems (IDPS),” NIST special publication, 2007.
-  “Office of personnel management data breach,” https://en.wikipedia.org/wiki/Office_of_Personnel_Management_data_breach, [Online; accessed 15-October-2016].
-  V. Vapnik and R. Izmailov, “Learning using privileged information: Similarity control and knowledge transfer,” Journal of Machine Learning Research, 2015.
-  V. Vapnik and A. Vashist, “A new learning paradigm: Learning using privileged information,” Neural Networks, 2009.
-  Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, 2013.
-  R. J. Walls, E. G. Learned-Miller, and B. N. Levine, “Forensic triage for mobile phones with dec0de,” in USENIX Security Symposium, 2011.
-  A. A. Cardenas, P. K. Manadhata, and S. P. Rajan, “Big data analytics for security,” IEEE System Security, 2013.
-  Z. B. Celik, R. Izmailov, and P. McDaniel, “Proof and Implementation of Algorithmic Realization of Learning Using Privileged Information (LUPI) Paradigm: SVM+,” NSCR, Department of CSE, Pennsylvania State University, Tech. Rep. NAS-TR-0187-2015, Dec. 2015.
-  G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv:1503.02531, 2015.
-  J. B. and R. C., “Do deep nets really need to be deep?” in Advances in neural information processing systems, 2014.
-  D. Lopez-Paz, L. Bottou, B. Schölkopf, and V. Vapnik, “Unifying distillation and privileged information,” arXiv preprint arXiv:1511.03643, 2015.
-  L. Wolf, T. Hassner, and Y. Taigman, “Effective unconstrained face recognition by combining multiple descriptors and learned background statistics,” Pattern Analysis and Machine Intelligence, 2011.
-  V. J. and E. Learned-Miller, “Fddb: A benchmark for face detection in unconstrained settings,” University of Massachusetts, Tech. Rep. UM-CS-2010-009, 2010.
-  G. B. Huang and E. Learned-Miller, “Labeled faces in the wild: Updates and new reporting procedures,” University of Massachusetts, Tech. Rep. UM-CS-2014-003, May 2014.
-  D. J. Miller, F. Kocak, and G. Kesidis, “Sequential anomaly detection in a batch with growing number of tests: Application to network intrusion detection,” in MLSP, 2012.
-  Z. B. Celik, J. Raghuram, G. Kesidis, and D. J. Miller, “Salting public traces with attack traffic to test flow classifiers,” in Usenix CSET, 2011.
-  G. Zou, G. Kesidis, and D. J. Miller, “A flow classifier with tamper-resistant features and an evaluation of its portability to new domains,” JSAC, 2011.
-  Z. B. Celik, R. J. Walls, P. McDaniel, and A. Swami, “Malware traffic detection using tamper resistant features,” in IEEE MILCOM, 2015.
-  Z. B. Celik and S. Oktug, “Detection of Fast-Flux Networks using various DNS feature sets,” in ISCC, 2013.
-  S. Huang, C. Mao, and H. Lee, “Fast-flux service network detection based on spatial snapshot mechanism for delay-free detection,” in Proc. ASIACCS, 2010.
-  C. Hsu, C. Huang, and K. Chen, “Fast-flux bot detection in real time,” in RAID, 2010.
-  S. Yadav, A. K. K. Reddy, A. L. Reddy, and S. Ranjan, “Detecting algorithmically generated malicious domain names,” in Proc. ACM Internet measurement, 2010.
-  E. Passerini, R. Paleari, L. Martignoni, and D. Bruschi, “Fluxor: detecting and monitoring fast-flux service networks,” in Proc. DIMVA, 2008.
-  M. A. et al., “Novel feature extraction, selection and fusion for effective malware family classification,” arXiv preprint arXiv:1511.04317, 2015.
-  “Microsoft malware classification challenge,” https://www.kaggle.com/c/malware-classification/, [Online; accessed 10-May-2015].
-  V. Sharmanska, N. Quadrianto, and C. H. Lampert, “Learning to rank using privileged information,” in International Conference on Computer Vision (ICCV), 2013.
-  N. M. Duc and B. Q. Minh, “Your face is not your password face authentication bypassing lenovo–asus–toshiba,” Black Hat Briefings, 2009.
-  G. Huang, M. Mattar, H. Lee, and E. G. Learned-Miller, “Learning to align from scratch,” in Advances in Neural Information Processing Systems, 2012.
-  G. Gu, R. Perdisci, J. Zhang, W. Lee et al., “Botminer: Clustering analysis of network traffic for protocol-and structure-independent botnet detection,” in USENIX Security Symposium, 2008.
-  C. Rossow et al., “Sok: P2pwned-modeling and evaluating the resilience of peer-to-peer botnets,” in IEEE Security and Privacy, 2013.
-  S. García, M. Grill, J. Stiborek, and A. Zunino, “An empirical comparison of botnet detection methods,” In Computers & Security, 2014.
-  A. Shiravi, H. Shiravi, M. Tavallaee, and A. A. Ghorbani, “Toward developing a systematic approach to generate benchmark datasets for intrusion detection,” Computers & Security, 2012.
-  R. Pang, M. Allman, M. Bennett, J. Lee, V. Paxson, and B. Tierney, “A first look at modern enterprise traffic,” in Internet Measurement, 2005.
-  R. R. R. Barbosa, R. Sadre, A. Pras, and R. Meent, “Simpleweb/university of twente traffic traces data repository,” 2010.
-  P. McDaniel, N. Papernot, and Z. B. Celik, “Machine learning in adversarial settings,” Security & Privacy Magazine, 2016.
-  “Ida pro: Disassembler and debugger,” http://www.hex-rays.com/idapro/.
-  L. Nataraj, V. Yegneswaran, P. Porras, and J. Zhang, “A comparative assessment of malware classification using binary texture analysis and dynamic analysis,” in Proc. security and artificial intelligence workshop. ACM, 2011.
-  R. Shittu, A. Healing, R. Ghanea-Hercock, R. Bloomfield, and M. Rajarajan, “Intrusion alert prioritisation and attack detection using post-correlation analysis,” Computers & Security, 2015.
-  J. F. and U. A., “Privileged information for data clustering,” Information Sciences, 2012.
-  H. Yang and I. Patras, “Privileged information-based conditional regression forest for facial feature detection,” in Proc. IEEE International Conference on Automatic Face and Gesture Recognition, 2013.
-  S. F., P. T., S. R., and P. S., “Incorporating privileged information through metric learning,” IEEE transactions on neural networks and learning systems, 2013.
-  R. Jonschkowski, S. H fer, and O. Brock, “Patterns for learning with side information,” 2015.
-  J. Bickford et al., “Security versus energy tradeoffs in host-based mobile malware detection,” in Proc. Mobile systems, applications, and services, 2011.
-  “Arcsight data platform,” http://www8.hp.com/us/en/software-solutions/arcsight-logger-log-management/, [Online; accessed 9-Nov-2016].
-  “Ibm security qradar siem,” http://www-03.ibm.com/software/products/en/qradar-siem, [Online; accessed 9-Nov-2016].
-  R. Zuech, T. M. Khoshgoftaar, and R. Wald, “Intrusion detection and big heterogeneous data: a survey,” Journal of Big Data, 2015.
-  L. Huang et al., “Adversarial machine learning,” in ACM security and artificial intelligence workshop, 2011.
-  P. Laskov et al., “Practical evasion of a learning-based classifier: A case study,” in IEEE Security and Privacy, 2014.
-  B. Biggio, G. Fumera, and F. Roli, “Pattern recognition systems under attack: Design issues and research challenges,” International Journal of Pattern Recognition and Artificial Intelligence, 2014.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” IEEE S&P, 2016.
-  N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” arXiv preprint arXiv:1608.04644, 2016.
-  M. Z. Rafique and J. Caballero, “Firma: Malware clustering and network signature generation with mixed network behaviors,” in RAID, 2013.
-  N. Nissim, R. Moskovitch, L. Rokach, and Y. Elovici, “Novel active learning methods for enhanced pc malware detection in windows os,” Expert Systems with Applications, 2014.
-  M. A. et al., “From throw-away traffic to bots: detecting the rise of dga-based malware,” in Proc. USENIX Security, 2012.
-  L. B., E. Kirda, C. Kruegel, and M. B., “Exposure: Finding malicious domains using passive DNS analysis.” in NDSS, 2011.
-  B. Rahbarinia, R. Perdisci, and M. Antonakakis, “Segugio: Efficient behavior-based tracking of malware-control domains in large isp networks,” in Proc. IEEE DSN, 2015.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” British Machine Vision, 2015.
-  Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition with very deep neural networks,” arXiv preprint arXiv:1502.00873, 2015.
-  Z. Wang and Q. Ji, “Classifier learning with hidden information,” in IEEE Computer Vision and Pattern Recognition, 2015.
-  L. Niu, W. Li, and D. Xu, “Exploiting privileged information from web data for action and event recognition,” International Journal of Computer Vision, 2016.
-  B. Ribeiro et al., “Enhanced default risk models with svm+,” Expert Systems with Applications, 2012.
-  B. A. Turlach and A. Weingessel, “quadprog: Functions to solve quadratic programming problems,” R package version, 2007.
-  “Matlab documentation of quadprog() function,” http://www.mathworks.com/help/optim/ug/quadprog.html, [Online; accessed 15-May-2015].
Appendix A Model Influence Optimization Problem
In this Appendix, we present the formulation and implementation of model influence approach introduced in Section III-B.
We can formally divide the feature space into two spaces at training time. Given standard vectors and privileged vectors with a target class , where and for all . The kernels and are selected along with positive parameters , and . Our goal is finding a detection model . The optimization problem is formulated as :
The detection rule for vector is defined as:
where to compute , we first derive the Lagrangian of (1):
with Karush-Kuhn-Tucker (KKT) conditions (for each ), we rewrite
We denote for
and rewrite (A-A) in the form
The first equation in (7) implies
Therefore, is computed as :
where is such that and .
We present implementation of model influence by solving its quadratic programming problem using MATLAB quadprog function provided by the optimization toolbox. Other equivalent functions in R  or similar software can be easily adapted.
MATLAB function solves the quadratic programming problem in the form as follows:
where, for each pair ,
After all variables are defined, optimization toolbox guide  can be used to select quadprog() function options such as an optimization algorithm and maximum number of iterations. Then, output of the function can be used in detection function for a new sample to make predictions as follows:
Appendix B Details of Detection Systems
We detail the standard and privileged features of four detection systems introduced in Section IV-B. We remark that we follow the following steps to construct the systems: (1) extract features from datasets, (2) select privileged features, and (3) calibrate systems with standard and privileged features. The interested reader can refer to the research papers for the motivation of the feature selection and processing.
Fast-flux Bot Detection. Fast-flux detector includes raw dataset of 4 GB DNS requests of benign and active fast-flux servers collected in early 2013 . Table I presents feature categories and definitions obtained from recent works [20, 21, 22, 23]. We define domain name, spatial, and network categories as privileged to provide a real-time detection system, as these categories introduce time-consuming and resource-intensive operations at run-time.
Malware Classification. We use Microsoft malware classification dataset . The standard and privileged features are the frequency counts of hexadecimal representation of binary files and disassembler output tokens, respectively . The dataset includes 1746 malware observations and used as a binary classification of malware families. Features are extracted from each observation that is roughly 50MB resulting in 200GB of training and test data. We include the features obtained from disassembler output as privileged to eliminate computational overhead and software-dependent feature processing of disassembler at run-time.
Malware Traffic Detection. We implement a network flow-based anomaly malware detection system [15, 16, 17, 18]. The dataset consists of 1553 benign network flows obtained from University of Twente Simple Web555https://traces.simpleweb.org/traces/TCP-IP/location6/ and LBNL666http://www.icir.org/enterprise-tracing traces, and 173 Zeus botnet variants of Zeus V2, Zeus Pony Loader and Zeus Gameover flows that were active from 2011 to 2013777http://mcfp.felk.cvut.cz/888http://contagiodump.blogspot.co.uk999http://labs.snort.org/papers/zeus.html. Table II presents the features obtained from first ten packets. The features that can be readily altered by an attacker are included as privileged to limit attacker’s capability to manipulate detection results.
Face Authentication. We use a subset of the labeled faces in the wild dataset  that includes 1348 images with at least 50 images per person. We build the standard features from human facial images. We include bounding box of cropped faces and funneled images as privileged features, as these images are available with the help of commercial software or human processing at run-time [12, 13, 14].
|Category||Definition||Feature dependency||Feature type|
Number of unique A records
Number of NS records
|DNS packet analysis||standard set|
Network delay ( and )
Processing delay ( and )
Document fetch delay ( and )
Kullback-Leibler divergence (unigrams and bigrams)
Jaccard similarity (unigrams and bigrams)
Whitelist of benign domain names
Time zone entropy of A records
Time zone entropy of NS records
Minimal service distances ( and )
|IP coordinate database lookup (external source)||privileged set|
Number of distinct autonomous systems
Number of distinct networks
|WHOIS processing (external source)||privileged set|
|cnt-data-pkt||The count of all the packets with at least a byte of TCP data payload||
-TCP length is observed
-Client to server
|min-data-size||The minimum payload size observed||
-TCP length observed
-Client to server
-0 if there are no packets
|avg-data-size||Data bytes divided by the total number of packets||
-TCP length observed
-Packets with payload observed
-Server to client
-0 if there are no packets
|init-win-bytes||The total number of bytes sent in initial window||
-Retransmitted packets not counted
-Client to server & server to client
-0 if no ACK observed
-Frame length calculated
|RTT-samples||The total number of RTT samples found||-Client to server||standard set|
|IP-bytes-median||Median of total IP packets||
-IP length calculated
-Client to server
|frame-bytes-var||Variance of bytes in Ethernet packets||
-Frame length calculated
-Client to server
|IP-ratio||Ratio between the maximum packet size and minimum packet size||
-IP length calculated
-Client to server & server to client
-1 If a packet observed, and if no packets are observed 0 is reported
|pushed-data-pkts||The count of all the packets seen with the PUSH set in TCP header||-Client to server & server to client||standard set|
|goodput||Total number of frame bytes divided by the differences between last packet time and first packet time||
-Frame length calculated
-Client to server
-Retransmitted bytes not counted
|duration||Total connection time||Time difference between the last packet and first packet (SYN flag is seen from destination)||privileged set|
|min-IAT||Minimum packet inter-arrival time for all packets of the flow||-Client to server & server to client||privileged set|
|urgent-data-pkts||The total number of packets with the URG bit turned on in the TCP header||-Client to server & server to client||privileged set|
|src-port||Source port number||-Undecoded||privileged set|
|dst-port||Destination port number||-Undecoded||privileged set|
|payload-info||Byte frequency distributions||
-If not HTTPS at training time
-If payloads are available
-Client to server & server to client