1 Introduction
Abstract

Decision-tree-based ensemble classification methods (DTEMs) are a prevalent tool for supervised anomaly detection. However, due to the continued growth of datasets, DTEMs result in increasing drawbacks such as growing memory footprints, longer training times, and slower classification latencies at lower throughput. In this paper, we present, design, and evaluate \name - a DTEM-based anomaly detection framework that augments standard DTEM classifiers and alleviates these drawbacks by relying on two observations: (1) we find that a small (coarse-grained) DTEM model is sufficient to classify the majority of the classification queries correctly, such that a classification is valid only if its corresponding confidence level is greater than or equal to a predetermined classification confidence threshold; (2) we find that in these fewer harder cases where our coarse-grained DTEM model results in insufficient confidence in its classification, we can improve it by forwarding the classification query to one of expert DTEM (fine-grained) models, which is explicitly trained for that particular case. We implement \name in Python based on scikit-learn and evaluate it over different DTEM methods: RF, XGBoost, AdaBoost, GBDT and LightGBM, and over three publicly available datasets. Our evaluation over both a strong AWS EC2 instance and a Raspberry Pi 3 device indicates that \name offers competitive and often superior anomaly detection capabilities as compared to standard DTEM methods, while significantly improving memory footprint (by up to ), training-time (by up to ), and classification latency (by up to ).

\sysmltitlerunning\name

: Resource-Efficient Supervised Anomaly Detection Using Decision Tree-Based Ensemble Methods

\sysmltitle\name

: Resource-Efficient Supervised Anomaly Detection Using Decision Tree-Based Ensemble Methods

{sysmlauthorlist}\sysmlauthor

Shay Vargaftikvm \sysmlauthorIsaac Keslassyvm,technion \sysmlauthorYaniv Ben-Itzhakvm

\sysmlaffiliation

vmVMware Research \sysmlaffiliationtechnionTechnion

\sysmlcorrespondingauthor

Shay Vargaftikshayv@vmware.com \sysmlcorrespondingauthorYaniv Ben-Itzhakybenitzhak@vmware.com

\sysmlkeywords

Machine Learning, SysML


\printAffiliationsAndNotice

1 Introduction

1.1 Background and related work

Supervised anomaly detection includes a wide range of applications such as finance, fraud detection, surveillance, health care, intrusion detection, fault detection in safety-critical systems, and medical diagnosis. For example, anomalies in network traffic could mean that a hacked device is sending out sensitive data to an unauthorized destination; anomalies in a credit card transaction could indicate credit card or identity theft; and, anomaly readings from various sensors could signify a faulty behavior in hardware or a software component. A popular supervised machine learning (ML) solution for anomaly detection is to employ decision-tree-based ensemble classification methods (DTEMs) which rely on either bagging or boosting techniques to improve the detection capabilities, as explained in the following.

\T

Bagging (or bootstrap aggregation) breiman1996bagging is used to reduce the classification variance and by that improve its accuracy. Random Forest (RF) breiman2001random is the most well-known decision-tree-based bagging method, which grows each decision tree according to a random subsample of the features and the data instances, resulting in different trees. Then, a majority vote is used to determine the classification.

Several studies zhang2005network; singh2014big; tavallaee2009detailed; hasan2014support have proposed using RF for supervised anomaly detection. For instance, zhang2005network employed RF for anomaly detection by using data mining techniques to select features and handle the class imbalance problem; and, singh2014big provided a scalable implementation of quasi-real-time intrusion detection system.

RF is a popular classifier as it offers many appealing advantages over other classification methods, such as Neural Networks ashfaq2017fuzziness, Support Vector Machines gan2013anomaly, Fuzzy Logic methods bridges2000fuzzy, and Bayesian Networks kruegel2003anomaly. Specifically, RF offers: (1) robustness and moderate sensitivity to hyper-parameters; (2) low training complexity; (3) natural resilience to deal with imbalanced datasets and tiny classes with very little information; (4) embedded feature selection and ranking capabilities; (5) handling missing, categorical and continuous features; (6) interpretability for advanced human analysis for further investigation or whenever such capability is required by regulations wiki:right_to_explanation, \egin order to understand the underlying risks. To that end, an RF can be interpreted by different methods, such as banerjee2012identifying.

All these aforementioned advantages are repeatedly pointed out in the literature via analysis as well as comparative tests (see resende2018survey; moustafa2018holistic; habeeb2018real and references therein) especially for intrusion detection (IDS) purposes zhang2005network; tavallaee2009detailed; hasan2014support, fraud detection xuan2018random and online anomaly detection capabilities zhao2018online; singh2014big.

\T

Boosting. Unlike bagging, boosting primarily reduces classification bias (and also variance). Many popular decision-tree-based boosting methods such as GBDT friedman2001greedy; hastie2009unsupervised, XGBoost chen2016xgboost, LightGBM ke2017lightgbm and AdaBoost freund1996experiments employ the boosting concept, usually, by using iterative training. For example, in Adaptive Boosting, a weak classifier such as a stump is added at each iteration (unlike bagging methods that use fully grown trees) and typically weighted with respect to its accuracy. Then, the data weights are readjusted such that a higher weight is given to the misclassified instances. In Gradient Boosting, a small decision tree (\egwith 8-32 terminal nodes) is added at each iteration and scaled by a constant factor. Then, a new tree is grown to reduce the loss function of the previous trees. For both methods, the next trees are trained with more focus on previous misclassifications.

Decision-tree-based boosting methods are known to be among the best off-the-shelf supervised learning methods available roe2006boosted; schapire2003boosting; liu2017visual; roe2005boosted, achieving excellent accuracy with only modest memory footprint, as opposed to RF that is usually memory bounded. Boosting methods also share many of the aforementioned advantages offered by RF such as natural resilience to deal with imbalanced datasets and tiny classes, embedded feature selection, and ranking capabilities, and handling missing, categorical and continuous features medium2017boosing.

Decision-tree-based boosting methods are also known to be especially appealing for anomaly detection purposes where data is often highly imbalanced (\egcredit card transactions or cyber-security) pfahringer2000winning; li2012robust. This is mainly because decision-tree-based boosting methods alter their focus between the different iterations on the more difficult training instances. This often produces a stronger strategy to deal with imbalanced datasets by strengthening the impact of the anomalies and, when adequately trained, boosting methods may usually achieve higher accuracy (as well as precision and recall) than a traditional RF classifier.

That being said, boosting methods are also more sensitive to overfitting than RF, especially when the data is noisy dietterich2000experimental. The training of boosting-based methods generally takes much longer than RF, mainly since the trees are built sequentially and compute-intensive tasks such as classification and data weights readjustments take place at every iteration. Moreover, boosting-based methods are harder to tune as compared to RF, as they have more parameters and higher sensitivity to these parameters.

Finally, both bagging (RF) and boosting methods are prevalent tools for supervised anomaly detection with shared and distinct pros and cons, and there is no clear winner in this classification contest as the best classifier often depends on the specific dataset and the application.

1.2 Challenges

In recent years, supervised anomaly detection via DTEMs is becoming extremely difficult. This is because traditional bagging DTEMs (\ierandom forest) classifiers can be highly effective, but tend to be memory bound, and slower at classification liaw2002classification; van2012accelerating; mishina2015boosted. Furthermore, the classification latency of an RF increases with the RF depth asadi2014runtime. Accordingly, previous work suggested different approaches to tackle the memory and performance drawbacks of RF. The study of van2012accelerating achieves deterministic latency by producing compact random forests composed of many, small trees rather than fewer, deep trees. Other studies asadi2014runtime; browne2018forest optimize memory-layouts of RF, which reduces cache misses.

On the other hand, boosting DTEMs are slow to train and tune and also admit slower classification as the number of trees increases appel2013quickly; rfvsboostblog. Much effort has been made to address these drawbacks. For example, recent scalable implementations of the tree-based gradient boosting methods include: XGBoost chen2016xgboost that supports parallelism and uses pre-sorted and histogram-based algorithms for computing the best split; LightGBM ke2017lightgbm that uses a novel technique of Gradient-based One-Side Sampling (GOSS) to filter out the data instances for finding a split value; CatBoost prokhorenkova2018catboost that implements ordered boosting, a permutation-driven alternative to the classic algorithm and an innovative algorithm for processing categorical features (\eggiving indices of categorical columns such that it can be encoded by one-hot encoding). For Adaptive boosting (\ieAdaBoost), several approaches have been suggested as well to accelerate its slow training chu2004fast; seyedhosseini2011fast; olson2017jousboost. For example seyedhosseini2011fast introduces a new sampling strategy (WNS) that selects a representative subset of the data at each iteration and by that reducing the number of data points onto which AdaBoost is applied.

Nevertheless, both bagging and boosting DTEM methods are challenged by the continuous growth of datasets li2014scaling in terms of the number of features, data instances, and the increasing demand for lower memory footprints, faster training, and lower classification latency. That is, while a sufficiently large DTEM classifier may offer a satisfactory anomaly detection capabilities when adequately trained, it would typically suffer from at least one of the following drawbacks: (1) large memory footprint; (2) long training (which also incurs high energy consumption); (3) high classification latency and low classification throughput.

Overcoming these drawbacks is essential for efficient DTEM-based anomaly detection systems that are required to have the ability to quickly train a proper classifier in a timely manner and to offer low classification latency and high throughput at reasonable memory footprints and costs.

1.3 Contributions

In this paper, we present \name, which addresses the aforementioned drawbacks of DTEM methods. \name is orthogonal to the discussed bagging and boosting techniques and can augment them to form more efficient DTEM classifiers. We implement and evaluate \name for RF, XGBoost, AdaBoost, GBDT, and LightGBM.

The design of \name mainly relies on the two following observations, which are further detailed in Section 2:

(1) We find that a small (coarse-grained) DTEM model is sufficient to classify the majority of the classification queries correctly. To that end, we define a confidence level threshold, such that a classification is considered to be valid only when its classification confidence level is higher than or equal to this given threshold.

(2) We find that in these fewer harder cases where our coarse-grained DTEM model exhibits insufficient confidence in its classification, we can improve it by forwarding the classification query to one of expert DTEM (fine-grained) models, which is explicitly trained for that particular case.

Finally, in Section 5, we present evaluation results over three publicly available datasets and on a strong AWS EC2 instance as well as on a Raspberry Pi 3 device. The results are consistent and indicate that \name always offers competitive and often superior anomaly detection capabilities as compared to the standard DTEM methods. For Bagging (RF), \name significantly improves the training-time (by up to ), classification latency (by up to ), and memory consumption (by up to ); For Boosting (\egXGBoost and AdaBoost), \name significantly improves the classification latency (by up to ) and training time (by up to ), while being competitive in model memory footprint (always in the range of [0.64, 2.92]).

(a) RF over KDD.
(b) XGBoost over FC.
Figure 1: Baseline tuning examples by sweeping over the number of trees. (a) The best score for the RF classifier over the KDD dataset is achieved for 150 trees with no depth limitation, \ie. (b) The best score for XGBoost classifier over the FC dataset is achieved for 2000 trees with depth limitation of 3, \ie.
KDD CCF FC
RF
GBDT
XGBoost
LightGBM
AdaBoost
Table 1: Baseline DTEM models. Number of trees and tree depth limitations for each DTEM classifier and dataset (RF is trained without tree depth limitation).
Figure 2: The useful classification fraction and its resulting score by a small (coarse-grained) model, when a classification is valid only if its confidence level is greater than or equal to a given classification confidence threshold (CCT). Our observation is that most classifications can be achieved by a coarse-grained model, with a similar/higher resulting score as compared to the score achieved by a much bigger model – termed baseline model4. The vertical lines indicate the lowest CCT for each dataset and model such that the resulting score exceeds the score by a baseline model (see Table 1 for details). Note that any other CCT value can be set according to the desired tradeoff between the useful classification fraction and its resulting score. Demonstrated for KDD kdd_dataset, Credit Card Fraud (CCF) ccf_dataset, and Forest Cover (FC) forest_cover datasets, over five different DTEM classifiers, RF breiman2001random, GBDT friedman2001greedy; hastie2009unsupervised, XGBoost chen2016xgboost, LightGBM ke2017lightgbm and AdaBoost freund1996experiments. In all evaluations, we use a 5-fold cross validation and depict the mean value and variance.

2 \namePreliminaries

2.1 Baseline DTEM models

We evaluate \name and quantify its benefits by comparing it to standard DTEM models we term baseline models. Specifically, we define a baseline model for three different datasets (KDD kdd_dataset, Credit-Card Fraud (CCF) ccf_dataset, and Forest Cover (FC) forest_cover) and for five different DTEM classifiers (RF breiman2001random, GBDT friedman2001greedy; hastie2009unsupervised, XGBoost chen2016xgboost, LightGBM ke2017lightgbm and AdaBoost freund1996experiments). For each dataset and DTEM classifier, we choose the baseline model for comparison by sweeping over different parameters and conducting 5-fold cross-validation for each measurement. Figure 1 shows two such examples. In these specific examples, we depict the score as a function of the number of trees for the RF model over the KDD dataset (0(a)) and for the XGBoost model over the FC dataset (0(b)). Note that we conduct sweeps over the number of trees for different parameters (\egclass/sample weights for RF, the number of features to consider for a split, tree depth limitations, learning rate for XGBoost) and in these examples, the other parameters are already chosen accordingly.

\T

Remark. Interestingly, these two sweeps already point out the weaknesses of bagging and boosting methods: (1) The sweep in Figure 0(a) lasted for 699 seconds and required 205 MB, whereas a similar sweep for XGBoost (over KDD) lasted as much as 14865 seconds but required only 8 MB; (2) Similarly, the sweep in Figure 0(b) lasted for 25442 seconds and required 47 MB, whereas a similar sweep for RF (over FC) lasted only 425 seconds but required 340 MB111These specific baseline tuning examples were conducted on an Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with 4 Cores and 8 Logical Processors..

\T

ML performance metric. In this paper, we use per-class score to quantify the anomaly detection capabilities of ML models since this score takes into account the imbalanced nature of the datasets in anomaly detection use-cases. Nevertheless, we also consider the Area Under the Curve (AUC) and Average Precision (which is more suitable for skewed datasets) metrics, and obtain similar results.

\T

Classifier configuration. For ease of exposition, we denote a classifier with trees, each limited to a depth of , by . A classifier without a depth limitation is denoted by . Table 1 summaries the baseline DTEM classifiers which we use throughout the paper.

2.2 Observations

In this work, we target binary supervised anomaly-detection classification with Normal and Anomaly imbalanced classes, and discuss multi-class anomaly detection as a future direction/extension of \name in Section 6. In the following, we describe the two main aforementioned observations which our solution is based on.

\T

Observation 1: A small DTEM model can classify most of the classification queries with a high score.

Figure 2 exemplifies via three datasets (KDD, CCF and FC. Further details are in Section 5.1), and over five DTEMs (RF, GBDT, XGBoost, LightGBM, and AdaBoost) that a small DTEM model, which we term coarse-grained model, can be used to correctly classify the majority of classification queries (but not necessarily all) by requiring a sufficiently high classification confidence level222For a DTEM classifier, the classification and its confidence level is determined by the classification distribution vector. For example, if the classification output of an instance is (Normal=0.78, Anomaly=0.22) that means the instance is classified as Normal with a classification confidence level of 0.78.. That is, the classification result of the coarse-grained model is valid only if its corresponding confidence level is greater than or equal to a predetermined classification confidence threshold (denoted by CCT), rather than simply accepting any classification confidence333For example, if the prediction output of a sample is (Normal=0.83, Anomaly=0.17) and CCT that means the classification is not valid and this query requires further attention as we later detail in Section 3.. We empirically find that this approach of setting a higher CCT value to make a valid classification, results in a high fraction of the data instances being valid classifications with a high score for both Normal and Anomaly classes.

Specifically, as can be seen in Figure 2, the fraction of the valid classifications (out of the total data instances) reduces as CCT increases, and their respective score increases444Unless the CCT is set too high, such that the fraction of valid classifications, especially anomalies, drops to nearly zero and the remaining few instances may admit a high classification variance (See XGBoost over CCF in Figure 2 for example). Such CCT values are not reasonable operating points for \name. since only the classifications with a higher confidence level (\ievalid) are being considered555Our usage of a classification confidence level is inherently different from the threshold used to produce a ROC curve. That is, when producing a ROC curve, the threshold only determines the result of a classification query and not whether it is valid.. Furthermore, the fraction of the Normal data instances is significantly higher than the Anomaly fraction as CCT increases. Intuitively, this is because the Anomaly labeled instances are harder to classify, as there are significantly fewer Anomaly instances than the Normal instances available for training.

We have found this observation to be consistent for all tested datasets. These include the three datasets discussed in this paper, as well as a Bankruptcy data set url:bankruptcy; zikeba2016ensemble, a Shuttle data set shuttle_dataset and different synthetic datasets generated by the scikit-learn machine learning Python package scikit-learn.

As we later discuss and demonstrate, any CCT value can be set according to the desired tradeoff between the useful classification fraction and its resulting score. The vertical black lines in Figure 2 indicate the lowest CCT for each dataset and classifier such that the resulting score of the valid classifications of both the Normal and Anomaly classes exceed the total score of its corresponding baseline model, as detailed in Table 1 (for the baseline models, a valid classification is any classification with a confidence level of a least 0.5, \ieall classification queries are valid and considered). Finally, we provide two examples to clarify the findings depicted by Figure 2.

Example 1. By setting CCT (instead of the usual ) when using the RF model, , for the CCF dataset, we obtain that a fraction that accounts for of the Normal data instances is classified with score of ; and a fraction that accounts for of the Anomaly data instances is classified with score of . Whereas, the corresponding baseline DTEM model, , achieves score of and , respectively.

Example 2. By setting CCT (instead of the usual ) when using the XGBoost model, , for the KDD dataset, we obtain that a fraction that accounts for of the Normal data instances is classified with score of ; and a fraction that accounts for of the Anomaly data instances is classified with score of . Whereas, the corresponding baseline DTEM model, , achieves score of and , respectively.

\T

Observation 2: Train expert (fine-grained) classifiers to succeed specifically where the coarse-grained model is not sufficiently confident.

When applying the approach suggested by the previous Observation 1, we remain with a small fraction of the data instances without a valid classification by the coarse-grained model due to an insufficient classification confidence level (\iequeries for which the top-1 class probability is lower than CCT). These queries are the harder data instances and, most importantly, as depicted in Figure 2, contain most of the anomalies as the CCT increases.

Our second main observation is that we can leverage the classification distribution vector of the coarse-grained model over the training data to: (1) filter most of the training data by using a training confidence threshold (TCT), and to (2) train expert classifiers, which we term fine-grained models, that are trained to succeed specifically where the coarse-grained model is not sufficiently confident and is more likely to make a classification mistake. The training dataset of each fine-grained classifier is defined according to TCT and the resulting classification distribution vectors of the coarse-grained model (see Section 3 for more details).

As we show in Section 5.2, these training datasets of the fine-grained classifiers are tailored such that they focus on the harder data instances and improve the Normal to Anomaly ratio of the labeled instances as compared to the training dataset of the coarse-grained and baseline classifiers. As a result, we find that the fine-grained models achieve better score for the low-confidence data instances.

Furthermore, these tailored training datasets are much smaller as compared to the training sets of the coarse-grained and baseline classifiers, which in turn significantly reduces the required training time (and hence the corresponding energy consumption). This attribute is especially appealing for the boosting methods (see Section 5).

(a) \namearchitecture.
(b) \nameconfidence level driven training.
(c) \nameconfidence level driven classification.
Figure 3: \namearchitecture - a small (coarse-grained) model and two expert (fine-grained) models. The training confidence threshold (TCT) at the coarse-grained model determines the training data for each fine-grained classifier. For classification, the classification confidence threshold (CCT) at the coarse-grained model determines whether a classification query takes the short or the long path and which fine-grained model is queried for the latter case.

2.3 The intuition behind \name

So far, we have mainly discussed the ML-performance of \name and why, intuitively, it is expected to result in a high score. Now we discuss further intuition to why \name also results in lower memory footprint, lower training time (and hence lower energy consumption) and lower classification latency, as compared to the baseline models.

The training time complexity of DTEM methods depends on the size of the training dataset (\iethe number of training instances), the number of features, the number of trees and their depth limitations (if there are any). Whereas, the classification latency mostly depends on the model size (\iethe number of trees, and their depths).

Clearly, the smaller model size of a coarse-grained classifier directly improves all of the criteria mentioned above. Additionally, as mentioned, the fine-grained classifiers are being trained by smaller datasets, which reduces their training time, and, often, their size. Indeed, our evaluation in Section 5 shows that the size of the fine-grained models is smaller as compared to their corresponding baseline model for all tested data sets and classification methods.

Essentially, when considering a \name model (\ieboth the coarse-grained and fine-grained models), the classification latency of \name equals to a weighted average of the latencies according to the fraction of the classifications that are served by the coarse-grained and fine-grained models. Since the coarse-grained model serves most of the classifications, the averaged classification latency is expected to significantly improved as compared to the baseline model. Furthermore, our evaluation shows that the worst-case classification latency of \name (\iethe latency of a query that takes the longest path of coarse-grained model and then the slowest fine-grained model) is also competitive and often lower than the latency of its corresponding baseline model (for more details, see the evaluation in Section 5).

On the other hand, the training time and model size of \name equal to the corresponding sums of the coarse-grained and fine-grained models. Nevertheless, as presented in Section 5, our evaluation indicates that \namereduces the memory footprint and training time, as compared to the baseline models.

To summarize, both the coarse-grained and fine-grained models contribute to the overall improvements of \name, in the following ways: The coarse-grained model is (1) based on a small classifier and, (2) serves most of the classification queries. The fine-grained models are (1) being trained by smaller data sets, (2) smaller as compared to the corresponding baseline and, (3) serve only a small fraction of the classification queries.

3 \name

\T

Training. Algorithm 1 describes the procedure for training a \namemodel. It begins with the training of a coarse-grained model (denoted by ) using the entire labeled dataset. Next, we train the fine-grained models (denoted by for fine-grained model ). To that end, we classify the labeled dataset by the coarse-grained model (line 5). Then, if the confidence level of the predicted top-1 class (\ie) is lower than the given training threshold (TCT), the labeled data instance is forwarded to both experts if its label is Anomaly (lines 7-9) or otherwise to a single fine-grained model according to the prediction made by (lines 11-15). Notice that in lines 11-15, the data instances are forwarded according to their low-confidence coarse-grained classification, and not according to their labels. The reason is that we train the fine-grained models to succeed specifically where the coarse-grained model is insufficiently confident and is more likely to make a mistake.

More specifically, as illustrated in Figure 2(b), the data instances that are forwarded to fine-grained model 1 contain: (1) all low confidence Anomaly instances and, (2) low confidence Normal instances that are correctly classified by the coarse-grained model. Intuitively, this model becomes an expert in distinguishing between Normal instances that are correctly classified by the coarse-grained model and Anomaly instances. Likewise, the data instances that are forwarded to fine-grained model 2 contain: (1) all low confidence Anomaly instances and, (2) low confidence Normal instances that are misclassified by the coarse-grained model. Intuitively, this model becomes an expert in distinguishing between misclassified Normal instances by the coarse-grained model and Anomaly instances.

The Anomaly class is significantly smaller (usually by orders of magnitude) than the Normal class in terms of the number of instances, and its low-confidence subset is even smaller. This fact results in two potential drawbacks, which we mitigate by the duplication of the low-confidence Anomaly labeled instances to both fine-grained models (lines 8-9), as explained in the following:

Less accurate Anomaly coarse-grained classification: Since the coarse-grained model is trained using a rather small number of Anomaly instances as compared to the number of Normal instances, its classifications over these instances are very noisy with many misclassifications as compared to the Normal instances (see Figure 2). Namely, the classification distribution vector over these instances has a significant variance, which is even more severe for the low-confidence Anomaly subset. This makes the classification of the coarse-grained model as to which fine-grained model we need to send a specific low-confidence Anomaly instance less reliable (unlike for the Normal instances).

Increased overfitting likelihood by the fine-grained models: Due to the small cardinality of the low-confidence subset of the Anomaly instances, it is more likely for a fine-grained classifier to receive a non-sufficient number of such instances for training. This, in turn, increases the likelihood of overfitting the model. That is, it is more likely for the training of a fine-grained classifier to terminate in a state in which it has a nearly perfect score for the low-confidence subset of the Anomaly instances that were forwarded by the coarse-grained model for its training, but this fine-grained model is likely to be less accurate at classification of a low-confidence Anomaly instance that may have been sent to the wrong fine-grained model and thus is more likely to be too different from other labeled instances this fine-grained classifier was trained on.

Therefore, by forwarding all the low-confidence subset of Anomalies to both experts, as we empirically find in our evaluations, reduces the likelihood of both drawbacks and makes the fine-grained models better experts for those queries in which the coarse-grained model is more likely to make a classification mistake (\ieObservation 2).

Note that this duplication (\ielines 8-9) results in a very low overhead in terms of the number of data instances used for the fine-grained models training (see Table 2).

Input: Labeled training data set , confidence level TCT.
1:   train using
2:   set:
3:   set:
4:   for each :
5:     obtain coarse-grained distribution: =
6:     if :
7:        if :
8:           update:
9:           update:
10:      else:
11:         classify: =
12:         if :
13:            update:
14:         else:
15:            update:
16: train using
17: train using

Algorithm 1 \nametraining
\T

Classification. Algorithm 2 describes the procedure for a classification by \name, and Figure 2(c) illustrates it. First, we classify an arriving data instance by the coarse-grained model (lines 1-2). Whenever the resulting confidence level of the top-1 classification is greater than or equal to the classification confidence threshold, CCT, the classification by the coarse-grained model is valid and therefore returned (line 4). As shown in Figure 2, we empirically find that most of the data instances result in a high confidence level. Therefore, since the size of the coarse-grained model is small, these classification instances experience an extremely low-latency and high-throughput classification. The remaining small fraction of the data instances whose coarse-grained classification is not valid (\iewhich their resulting confidence level is lower than CCT) is forwarded to one of the fine-grained models, which is chosen according to the coarse-grained classification (lines 6-9). Specifically, if the coarse-grained low-confidence classification is Normal, then the instance is forwarded to fine-grained model 1 which is trained to distinguish between Normal instances that are correctly classified by the coarse-grained model and Anomaly instances (see Figure 2(b)). Likewise, if the coarse-grained (low-confidence) classification is Anomaly, then the instance is forwarded to fine-grained model 2 which is trained to distinguish between Normal instances that are misclassified by the coarse-grained model and Anomaly instances.

Input: Unlabeled data point , confidence level CCT.
1: obtain coarse-grained distribution: =
2: classify: =
3: if :
4:   return
5: else:
6:     ? :
7:   obtain fine-grained distribution: =
8:   classify: =
9:   return

Algorithm 2 \nameclassification
\T

Putting it all together. Figure 3 depicts a high-level architecture of \name, and the training and classification data-forwarding schemes. We may use different confidence level thresholds for the fine-grained models training (TCT) and the classification/anomaly detection (CCT), such that . The intuition for why it may be of interest to set , is that it allows to train the fine-grained models with a bigger subset of the labeled data instances as compared to the classification subset that is forwarded to them. Tuning TCT, as we empirically find in our evaluations, often improves the anomaly detection capabilities (in terms of score) for a modest price in training time and fine-grained model sizes.

Intuitively, the fine-grained models provide better classification and hence better anomaly detection for the uncertain classifications than the coarse-grained model for the following reasons: (1) we allow the fine-grained models to have more resources as compared to the coarse-grained model (see Section 5.2 for more details); (2) the fine-grained classifiers are trained by a much smaller fraction of the labeled training data. Essentially, each fine-grained model becomes an expert for its corresponding labeled data fraction, which represents the uncertain (and some of the wrong) classifications by the coarse-grained model. Note that when a classification query is forwarded to a fine-grained model, it is solely determined by this fine-grained model.

\T\name

model. For ease of exposition, we define a specific configuration of a \name model by a tuple , that states the size limitation of the coarse-grained model followed by the size limitation of the fine-grained models and finally the classification and training confidence thresholds.

4 Implementation

\name

implementation is written in Python 3.6 and is based on the scikit-learn library666v0.21.3 - released on July 30, 2019 (current stable version). scikit-learn. \name can augment any scikit-based DTEM classifier. Specifically, we execute \name for internal scikit classifiers (RandomForest scikit-rf, GradientBoosting scikit-gbdt, and AdaBoost scikit-adaboost), as well as for Python packages (lightgbm-v2.2.3 python-lightgbm, and xgboost-v0.90 python-xgboost).

5 Evaluation

5.1 Datasets

As mentioned, we use three different datasets, where all are widely used, publicly available, and reproducible. To establish consistency for \name, we chose the dataset such that they are all from different areas and use-cases as well as with different levels of skewness (\iethe Normal to Anomaly number of instances ratio).

\T

KDD kdd_dataset. This dataset is a popular benchmark and is widely used for evaluation of IDS systems dhanabal2015study; kayacik2005selecting; sabhnani2003application. It was used for The Third International Knowledge Discovery and Data Mining Tools Competition in which the task was to build a network intrusion detector. This database contains a standard set of data to be audited, which includes a wide variety of intrusions simulated in a military network environment. In our evaluation, we treat all intrusions (\egDoS, Probe, R2L) as Anomalies () and all non-hostile connections as Normal.

\T

Credit Card Fraud (CCF) ccf_dataset. This is a popular dataset that is used for anomaly and fraud detection benchmarking. ccf_req_cite_1; ccf_req_cite_4; ccf_req_cite_5; ccf_req_cite_7; ccf_req_cite_8. The datasets contains transactions made by credit cards in September 2013 by european cardholders. This dataset presents transactions that occurred in two days, with frauds which we treat as Anomalies () where all the rest are legitimate transfers and treated as Normal class.

\T

Forest Cover (FC) forest_cover. This dataset is used in predicting forest cover type from cartographic variables blackard2000comparison; gama2003accurate; oza2001experimental; obradovic2004challenges. This study includes four wilderness areas located in the Roosevelt National Forest of northern Colorado. Class 2 is considered as Normal, and class 4 as Anomaly ().

5.2 Tuning \name

The tuning of a \name classifier starts by identifying a set of sensible candidate hyperparameters for the coarse-grained model. Recall that we want a small coarse-grained model that can make valid classifications for most of the data instances with a high score (\egFigure 2).

To that end, a set of such sensible hyperparameters may be derived by looking at known default configurations and thumb rules for a specific standard full-sized classifier and for similar datasets (if such are known) and considering smaller sized options such that the resulting coarse-grained model will be smaller than a standard full-sized classifier by some constant factor (\eg5).

The candidate hyperparameters for the fine-grained models are then chosen similarly but by considering larger sizes (\ienumber of trees, depth limitations) varying from the coarse-grained and up to a standard model size.

Next, we need to consider candidate CCT and TCT values. The set of interest is always sampled according to some granularity (\eg0.05).

Finally, once we have our grid of sensible hyperparameter configurations for \name, we perform iterations of our 5-fold cross-validation process, each time using different model settings from this predetermined set. This grid-search tuning process is no different from standard practice.

Clearly, these parameters are not guaranteed to be optimal. With that said, finding satisfactory hyperparameters often requires an extensive grid-search, like any other classifier.

Example. We next demonstrate how CCT and TCT affect the score, classification latency, training time, and model size. Figure 3(a) presents these metrics vs. different values of CCT (with TCT=CCT for simplicity) of a RF-based for the CCF dataset. We identify two CCT values, CCT=0.915 and CCT=1, that achieve the highest Anomaly (marked in the graph). Notice that, CCT=1 results in forwarding most of the test dataset to the fine-grained models, which in turn increases the classification latency. Moreover, as TCT increases, a higher fraction of the training-data is being forwarded to the fine-grained models, which in turn increases both the model size and training-time.

Table 3(b) also presents the Normal to Anomaly ratio of the training data instances that are forwarded to each of the fine-grained models. As can be seen, this ratio is further improved as compared to the original dataset ratio (581.4) for fine-grained model 1 with CCT=0.915 (while CCT=1 results in a ratio similar to the original one). Notice that, for fine-grained model 2, the ratio is lower than one, which means that the model is being trained with more Anomaly-labeled instances than Normal-labeled ones. Additionally, the table presents these attributes for model, which is essentially identical to (a smaller baseline model), that returns all of the data instance classifications only by the coarse-grained model. This \name model demonstrates that the fine-grained models are essential to achieve sufficient Anomaly , even when their total test dataset fraction is relatively low (\eg for CCT=0.915).

A better CCT and TCT configuration for this \name model, , is achieved by the grid-search process mentioned above, as further detailed next in Table 2.

(a) How CCT and TCT(=CCT) affect different metrics. The two CCT values that achieve the highest Anomaly are marked.
Index CCT
Anom-
aly
Fine
Grained
(FG) train
data
fraction
FG
Nor./Anom.
train data
ratio
[, ]
FG
total
test data
fraction
- 0.5 0.810 0.0% nan, nan 0.0%
1 0.915 0.853 2.32% 45.11, 0.348 2.34%
2 1.0 0.852 98.4% 567.7, 0.136 98.2%
(b) \nameattributes for the marked CCT values in (a), and for CCT=0.5 – \iewhen all coarse-grained model classifications are considered as valid. Note that the original Normal to Anomaly instance ratio of CCF is 581.4, and the Anomaly of the corresponding baseline model is 0.845471.
Figure 4: How CCT and TCT(=CCT) affect \name, demonstrated with RF-based for CCF dataset.
Classifier Dataset # Model
Normal
Anomaly
Model
size
[MB]
Train
time
[s]
Fine-grained
train data
% [fg1, fg2]
Classification
latency
[]
RADE
worst-case
classification
latency []
Fine-grained
test data
% [fg1, fg2]
1 Baseline – 0.999913 0.995190 5.57 22.5 9.4
2 0.999910 0.995028 1.02 3.5 9.55%, 0.57% 1.3 6.9 5.19%, 0.37%
KDD 3 0.999911 0.995091 1.54 3.6 5.38%, 0.35% 1.3 10.1 5.19%, 0.37%
4 Baseline – 0.999759 0.845471 2.01 75.6 6.2
5 0.999733 0.840198 0.82 20.3 6.82%, 0.09% 1.9 4.7 3.72%, 0.06%
CCF 6 0.999756 0.854553 0.63 40.7 97.57%, 0.20% 2.1 3.4 41.78%, 0.09%
7 Baseline – 0.999379 0.859186 9.95 16.7 5.2
8 0.999462 0.883542 5.96 6.9 6.28%, 1.23% 1.9 8.1 5.81%, 1.24%
RF FC 9 0.999438 0.882269 4.66 6.3 2.76%, 0.58% 1.8 7.9 2.46%, 0.58%
10 Baseline – 0.999933 0.996294 0.25 196.5 13.6
11 0.999934 0.996356 0.32 49.4 0.33%, 0.13% 3.6 11.4 0.12%, 0.10%
KDD 12 0.999934 0.996357 0.40 65.3 0.70%, 0.24% 4.7 15.1 0.08%, 0.08%
13 Baseline – 0.999803 0.877078 0.28 252.4 16.4
14 0.999792 0.873624 0.26 61.7 0.82%, 0.15% 3.9 13.0 0.15%, 0.07%
CCF 15 0.999800 0.878426 0.32 77.9 0.09%, 0.05% 4.9 14.7 0.02%, 0.02%
16 Baseline – 0.999495 0.891222 1.21 1367.9 58.9
17 0.999489 0.892526 0.85 79.4 1.94%, 0.50% 3.9 20.5 1.13%, 0.41%
XGBoost FC 18 0.999506 0.896179 1.23 83.0 1.49%, 0.48% 4.0 29.9 1.13%, 0.41%
19 Baseline – 0.999943 0.996854 0.92 874.6 136.6
20 0.999955 0.997485 2.48 630.6 99.96%, 1.76% 61.9 165.6 3.67%, 0.47%
KDD 21 0.999947 0.997084 2.69 462.4 59.23%, 1.51% 64.8 149.3 12.53%, 0.86%
22 Baseline – 0.999801 0.876765 0.99 2192.4 185.9
23 0.999798 0.877588 2.41 716.0 7.60%, 0.12% 39.6 118.4 7.52%, 0.10%
CCF 24 0.999808 0.882403 2.31 3048.6 96.70%, 0.17% 78.7 135.6 14.38%, 0.11%
25 Baseline – 0.999449 0.882856 1.44 1333.2 235.8
26 0.999477 0.892222 2.17 877.5 57.29%, 0.47% 73.2 147.6 4.33%, 0.46%
AdaBoost FC 27 0.999472 0.889118 3.10 115.9 5.99%, 0.79% 17.8 104.1 3.87%, 0.79%
Table 2: Comparison among two \name configurations to the baseline over three classification DTEM methods, each with three different datasets. All results are obtained using 5-fold cross-validation. \name achieves competitive and often superior score with lower training time, classification latency, and superior (for bagging) or competitive (for boosting) model size.

5.3 \namevs. standard methods

We compare \name to the baseline models over an AWS m5d.16xlarge EC2 instance with Ubuntu 16.04 OS awsec2, and summarize the evaluation results in Table 2. The results of RF, XGBoost and AdaBoost are summarized, while similar improvements are obtained for GBDT and LightGBM. For each classifier and dataset, we present results for the baseline model and two different configurations of \name. We rely on 5-fold cross-validation and report the mean values.

5.3.1 Bagging - Random Forest

\T

Anomaly detection. For all three datasets \name exhibits competitive or superior scores as compared to the baseline777As mentioned, we consider score as the ML performance measure for anomaly detection (\ieimbalanced datasets). Nevertheless, we also evaluate \name models for AUC and Average Precision and reach similar conclusions.. Specifically, the results are somewhat similar for the KDD and CCF; whereas for FC, both \name configurations result in an advantage of in Anomaly .

\T

Model size. All \name model sizes are notably smaller than their corresponding baseline model. For example, for KDD, is smaller than .

\T

Training time. For all three datasets, the training time of \name is significantly lower. For example, it is faster for KDD. These lower training times come in line with the smaller size of the coarse-grained model as compared to the baseline and the small fractions of the training data that are used for the training of the fine-grained models.

\T

Classification latency. The improvement of \name over the baseline is consistent and is up to faster due to the smaller latency introduced by the coarse-grained model. Even when considering the worst-case classification latency for \name (\iea query that takes the path of coarse-grained model and then the slowest fine-grained model) \name is still competitive. The non-negligible difference between the average and worst-case classification latency for \name falls in line with the small fractions of queries taking the long path (\eg5.56% for both fine-grained models for KDD).

RF XGBoost AdaBoost
#
Training
time [s]
Classification
Latency []
#
Training
time [s]
Classification
Latency []
#
Training
time [s]
Classification
Latency []
1 567 127 10 2068 166 19 11727 1384
2
63
(9.0)
26
(4.9)
11
608
(3.4)
53
(3.1)
20
12433
(0.9)
619
(2.2)
K
D
D
3
57
(10.0)
22
(5.8)
12
536
(3.9)
47
(3.6)
21
8021
(1.5)
697
(2.0)
4 848 89 13 2909 164 22 17524 1889
5
195
(4.4)
41
(2.2)
14
713
(4.1)
49
(3.3)
23
24905
(0.7)
755
(2.5)
C
C
F
6
433
(2.0)
41
(2.2)
15
892
(3.3)
45
(3.7)
24
5681
(3.1)
425
(4.4)
7 363 66 16 9519 1831 25 24355 2596
8
113
(3.2)
28
(2.3)
17
601
(15.8)
68
(26.9)
26
22971
(1.1)
762
(3.4)
F
C
9
116
(3.1)
28
(2.3)
18
588
(16.2)
59
(30.8)
27
2368
(10.3)
203
(12.8)
Table 3: Raspberry Pi 3 - Training-time and classification latency comparison between \name and the baseline models.

5.3.2 Boosting - XGBoost and AdaBoost

\T

Anomaly detection. For all three datasets, \name exhibits competitive or superior anomaly detection capabilities as compared to the baseline. For example, for XGBoost over FC, \name configuration achieves a better Anomaly by ; and for AdaBoost over FC, \name configurations improve the Anomaly by and .

\T

Model size. XGBoost model sizes of \name and the baseline are competitive. AdaBoost model sizes of \name are up to larger than the baseline. This is of less concern since AdaBoost DTEM is not memory bound (uses trees with only a few (\iestumps) to several terminal nodes).

\T

Training time. For all datasets, the training time of \name is significantly lower. For example, it is faster for XGBoost over FC. These lower times adhere with the smaller size of the coarse-grained models the smaller data fractions that are used for the training of the fine-grained models.

\T

Classification latency. The improvement in the mean classification latency of \name over the baseline is even more significant than in the case of RF. For example, it is faster for XGBoost over FC and faster for AdaBoost over FC. The worst-case classification time of \name is also competitive as compared to the baseline and is even superior (by up to faster) for the CCF and FC datasets. Note that the classification latency grows linearly with respect to the number of trees but only logarithmic in the (roughly balanced) tree size (\ienumber of terminal nodes). Therefore, the lower number of trees used by the coarse-grained and fine-grained models of \name results in faster classification as compared to the baseline.

5.4 Raspberry Pi 3 evaluation

We evaluate \name over Raspberry Pi 3 B+888ARM Cortex-A53 with 512 KB shared L2 and 1GB SDRAM. with Raspbian 9.8 OS raspi. Table 3 compares the training time and classification latency of \name and baseline models, for the same classification DTEM methods and datasets, as in Table 2. The training time improvement of \name ranges from up to , and the classification latency improvement ranges from up to .

6 Summary and Future work

This paper presents \name, an efficient DTEM classification framework that augments standard DTEM classifiers to obtain lower model memory size, training time and classification latency, while obtaining competitive and often better anomaly detection than the standard (baseline) models. We also believe that \name can be extended to support multi-class classification. The straightforward approach is having fine-grained models for classes. However, an immediate concern is the scalability of this approach with respect to the number of classes. Thus, better approaches that result in less fine-grained models should be considered in order to maintain the competitive attributes of \name.

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
391925
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description