AEGR: A simple approach to gradient reversal in autoencoders for network anomaly detection
Anomaly detection is referred to as a process in which the aim is to detect data points that follow a different pattern from the majority of data points. Anomaly detection methods suffer from several well-known challenges that hinder their performance such as high dimensionality. Autoencoders are unsupervised neural networks that have been used for the purpose of reducing dimensionality and also detecting network anomalies in large datasets. The performance of autoencoders debilitates when the training set contains noise and anomalies. In this paper, a new gradient-reversal method is proposed to overcome the influence of anomalies on the training phase for the purpose of detecting network anomalies. The method is different from other approaches as it does not require an anomaly-free training set and is based on reconstruction error. Once latent variables are extracted from the network, Local Outlier Factor is used to separate normal data points from anomalies. A simple pruning approach and data augmentation is also added to further improve performance. The experimental results show that the proposed model can outperform other well-know approaches.
Keywords— network anomaly detection, high dimensionality, autoencoders (AEs), Local Outlier Factor (LOF), Gradient Reversal
In many real-world problems such as detecting fraudulent activities or detecting failure in aircraft engines, there is a pressing need to identify observations that have a striking dissimilarity compared to the majority. In medicine for instance, this discovery can lead to early detection of lung cancer or breast cancer. One area that is growing really fast is computer networks that plays a pivotal role in our daily life. Protecting networks from various threats such as network intruders is crucial. By using machine learning algorithms, it is possible to monitor and analyse the network and detect these threats almost instantly. However, when the number of observations that the method aims to detect is very small with respect to the whole data, these methods start to struggle, i.e., their performance debilitates. These observations are called anomalies (also known as outliers). The process whereby the aim is to detect data instances that deviate from the pattern of the majority of data instances is referred to as anomaly detection [nicolau2018learning]. Depending on the application domain, anomalies can occur due to various causes such as human error, system failure, or fraudulent activities.
Traditional anomaly detection methods, e.g., density-based or distance-based, are found to be less effective and efficient when dealing with large-scale, high-dimensional and time-series datasets [Munir20191991][ma2013parallel][zhang2018survey]. Moreover, most of these approaches in the literature often require large storage space for the intermediate generated data [sun2018learning].
Moreover, parameter tuning for classical approaches such as clustering models in large-scale datasets is a difficult task . Consequently, the dimensionality of such datasets should be reduced before applying anomaly detection methods. This is achieved by performing a pre-processing step known as dimensionality reduction, which tries to remove irrelevant data while keeping important variables and transforming the dataset into a lower dimension. Besides high dimensionality, lack of labelled datasets is another problem in this context. Many supervised learning algorithms have been employed to detect anomalies; however, an essential requirement of a supervised learning approach is labelled data. While availability of labelled datasets is often a problem, labelling can also be costly, time-consuming, and requires a field expert [Hodge2004]. Lastly, in many application domains, such as telecommunication fraud and computer network intrusion, companies are often very conservative and protective of their data due to privacy issues and tend to resist providing their data [sun2018learning].
In this paper, the aim is to tackle the mentioned problems by employing an unsupervised approach in which high-dimensional data is compressed into lower dimensionality, and a density-based method is then applied to the transformed data in order to detect network anomalies. In particular, a deep autoencoder (AE) network is used to create low-dimensional representations, and next, Local Outlier Factor (LOF) [Breunig:2000:LID:335191.335388] is utilised for separating normal data instances from anomalies. Although autoencoders have shown promising results, their performance weakens when the dataset contains noise and anomalies [Qi20146716]. To overcome this issue, various approaches have been proposed such as denoising autoencoders or using a loss function which is insensitive to anomalies and noise [zhou2017anomaly].
The main contributions of this paper, as will be elaborated in subsequent sections, are the following:
First, unlike other similar approaches that require a noise and anomaly-free training set, the proposed model in this paper is insensitive to anomalies; therefore, our approach is needless of anomaly-free training sets.
Second, the proposed method is robust to large datasets, and is particularly good with high-dimensional network datasets.
Third, the method is capable of working with unlabelled datasets.
The proposed model is tested on 8 different datasets including 5 well-known network datasets. Evaluation of experimental results show that the proposed model can improve the performance significantly and is superior to the stand-alone LOF and two other state-of-the-art approaches. The rest of this paper is organised as follows. Section II reviews previous works, while section III explains the proposed model. Experimental results are presented and discussed in section IV, and subsequently the paper concludes with section V.
Ii Related Studies
There are various methods for anomaly detection and several studies have categorised them into different groups. For instance, in [Tang2017171], authors categorised them into the following four groups: distribution-based, distance-based, clustering-based, and density-based methods. Arguably, the most widely accepted categorisation is based on the type of supervision used, i.e., unsupervised, semi-supervised and supervised [Munir20191991]. Except for unsupervised methods, labelled data is required for training models that are semi-supervised or supervised, and as explained earlier, labelling data brings various challenges; therefore, unsupervised models are more favourable [Hodge2004].
Traditional anomaly detection methods try to search the entire dataset and detect anomalies which will result in discovering global anomalies. However, in many real-world problems the data is incomplete and often the application requires a local neighbourhood search for identifying anomalies [su2017n2dlof]. One of the approaches that aims at measuring the outlierness of each data instance based on the density of its local neighbourhood is LOF. However, traditional methods such as LOF fail to achieve an acceptable result when applied to large and high-dimensional datasets, and moreover they tend to require large storage capacity in this context [zhang2018survey][ma2013parallel].
To overcome the aforementioned problems, several approaches have been proposed in which high-dimensional data is transformed into a low-dimensional space while trying to avoid loss of crucial information. Next, anomaly detection is carried out on the low-dimensional data.
Recently, autoencoders have been widely employed for the purpose of reducing dimensionality of large datasets. An autoencoder refers to an unsupervised neural network that tries to reconstruct its input at the output layer [hinton2006reducing]. It is made of two main parts, namely an encoder and a decoder. The encoder tries to convert the input features into a different space with lower dimension while the decoder attempts to reconstruct the original feature space using the output of the decoder, which is known as the bottleneck or discriminative layer. In terms of network structure, autoencoders come in various types such as under-complete, over-complete, shallow or deep. A comprehensive review can be found in [charte2018practical]. Because of its ability to encode data without losing information it has been widely employed in the literature. Unlike other dimensionality reduction methods such as Principal Component Analysis (PCA) that use linear combinations, AEs generally perform nonlinear dimensionality reduction and, according to the literature, perform better than PCAs [wang2016auto]. Authors in [sakurada2014anomaly] used AE for anomaly detection and compared its performance to linear PCA and kernel PCA on both synthetic and real data, and based on the result, they concluded that AE can extract more subtle anomalies than PCA. Another disadvantage of statistical algorithms such as PCA or Zero Component Analysis (ZCA) is that as the dimensionality increases, more memory is required to calculate the covariance matrix [yousefi2017autoencoder].
A well-trained autoencoder should generate small reconstruction errors for each data point; however, autoencoders fail at replicating anomalies because their patterns deviate from the pattern that the majority of data instances follow. In other words, the Reconstruction Error (RE) of anomalies is greatly above the RE of normal data. In some studies such as [aygun2017network], the authors used a threshold-based classification using the reconstruction error as a score to separate normal data from anomalies, i.e., the data instances that have a reconstruction error above the threshold are identified as anomalies while anything below the threshold is considered as normal. In another similar approach, the authors in [schreyer2017detection] generated an anomaly score based on reconstruction error and individual attribute probabilities in large-scale accounting data.
The authors in [chen2017outlier] proposed an approach in which ensembles of AEs were used to improve the robustness of the network. In their approach, named Randomised Neural Network for anomaly Detection (RandNet), instead of using fully connected layers, the connections between layers are randomly altered, and anomalies are separated from normal data using the reconstruction error. Their approach showed superior performance compared to four traditional anomaly detection methods including LOF.
To enhance the performance of AEs, authors in [sun2018learning] added new regularisers to the loss function that encourage the normal data to create a dense cluster at the bottleneck layer while the anomalies tend to stay outside of the cluster. Next, they employed various Once Class Classification (OCC) methods such as LOF, Kernel Density Estimation (KDE), and One Class SVM (OCSVM) to divide the data into anomalies and normal data. They also investigated the influence of altering the training set size on the performance of their models and the result showed consistency across different training sizes.
As the number of hidden layers increases, the backpropagated gradients to the lower layers tend to attenuate, which is known as the vanishing gradient problem, and culminating in the weights of lower layers showing limited change. To overcome the vanishing gradient issue and discover better features, a pretraining phase was proposed in [yousefi2017autoencoder] in which a stacked Restricted Boltzmann Machine (RBM) was used to obtain suitable weights for initialising the AE. The authors applied their model to the NSL-KDD dataset and achieved high accuracy ().
The performance of AEs diminishes when datasets contain anomalies and/or noise which is very prevalent in real-world datasets. By using Denoising AEs (DAEs), it is possible to enhance the accuracy of the network. DAE is an extension of AE in which the network is trained on an anomaly and noise-free set while random noise is added to the input and the AE tries to regenerate the original input, i.e., without the added noise. Authors in [sakurada2014anomaly] used a DAE to obtain meaningful features and decrease the dimensionality of data. Their result proved that DAE is superior to statistical methods such as PCA.
DAEs require an anomaly and noise-free training set; however, such a training set is not always available. In related work, [zhou2017anomaly], the authors proposed an AE, called Robust Deep Autoencoder (RDA), that can extract satisfactory features without having access to an anomaly or noise-free training set. Inspired by Robust PCA (RPCA), they added a penalty-based filter layer to the network that uses either or norms, and managed to separate anomalies and noise from normal data.
A common criterion used in AEs is Mean Square Error (MSE). As mentioned before, anomalies and noise tend to cause greater reconstruction error, i.e., larger MSE; consequently, the network carries out a substantial weight update. Therefore, it debilitates the accuracy of AEs as they tend to learn to regenerate anomalies and noise. A possible remedy to this problem is using a criterion which is insensitive to anomalies and noise. Authors in [Qi20146716] proposed a new approach, called Robust Stacked AutoEncoder (R-SAE), in which they used the maximum correntropy criterion to prevent substantial weight updates and make the model robust to anomalies and noise. They tested their model on the MNIST dataset contaminated with non-Gaussian noise, and achieved - lower REs compared to what was obtained from a Standard Stacked AutoEncoder (S-SAE).
Iii Proposed Method
In the previous section, different approaches for extracting meaningful features from high-dimensional datasets were reviewed and also previous anomaly detection models were explained. In this section, we will explain how our model robustly transforms high-dimensional data into a low-dimensional space and detects anomalies.
Iii-a Local Outlier Factor
LOF is a density-based anomaly detection method in which the assumption is that anomalies tend to stay outside of dense neighbourhoods because of their peculiar characteristics that distance them from inliers [Breuniq200093]. The method generates a score that shows the outlierness of each data point based on its local density. A low score indicates that the query point is an inlier while a high score shows that the query point is an anomaly. The algorithm has one parameter, , which is the minimum number of neighbours that each data point is expected to have inside its neighbourhood.
It is possible to divide the anomaly detection process of LOF into three steps. In the first step, LOF tries to find the minimum distance, called , to hold at least neighbours. Next, the algorithm measures the reachability distance defined as:
which is equal to of point if the query point is also inside point ’s neighbourhood, otherwise it is equal to the actual distance between the two points. In the second step, the local reachability distance (LRD), which is the inverse of the average reachability distance of the data point from its neighbours, is measured. LRD is defined as:
in which denotes the number neighbours in data point ’s neighbourhood that is used to detect anomalies. In the final step, the LRD of the query point is compared with the LRD of its neighbours using the following equation:
If the density of point is very close to its neighbours, then the value of LOF stays around while for inliers it is less than and for anomalies it is greater than . It is worth mentioning that it is very common to apply a simple threshold-based clustering here to separate anomalies.
Autoencoders are unsupervised artificial neural networks that contain two components [hinton2006reducing]. The first component, known as the encoder, tries to transform the high-dimensional input data into a low-dimensional feature space, known as the bottleneck, while the second component, known as the decoder, attempts to reconstruct the input data from the bottleneck. The difference between the reconstructed and input data is called the reconstruction error. In each training iteration, the network measures the reconstruction error, computes the gradient of the error with respect to network parameters (e.g. weights), and backpropagates these gradients through the network in order to update the weights to minimise the reconstruction error, i.e., to increase the resemblance between the generated output and the input. AEs can adopt various structures. In Figure 1, the structure of an AE, known as Stacked Autoencoder (SAE), is shown in which multiple layers are stacked to form a deep neural network.
As shown in Figure 2, the most basic AE with only one hidden layer tries to transform input into latent vector using an encoder represented by function . Next, the latent vector is reconstructed by a decoder represented as function into output where the dissimilarity between and is called reconstruction error. Having a training set, , where is the number of data points in and is the th data point with features, the encoder is then defined as:
while the decoder is defined as:
where both and represent the activation functions that are most often non-linear, and denote the weight matrices while and represent the bias vectors.
It is worth mentioning that, in general, certain restrictions or regularisation techniques should be applied to an AE in order to prevent the network from learning the identity function. Otherwise, the network will simply copy the input through the network. A common solution to overcome this issue is by using a denoising AE. In a DAE, the network tries to reconstruct inputs that are partially corrupted. Bottlenecks and sparsity constraints applied to the main hidden layer are additional ways to avoid trivially learning the identity function.
Iii-C AEGR: The Proposed Model for anomaly Detection in Large Datasets
The proposed AEGR (autoencoder with gradient reversal) approach consists of two main components. The first component is an AE which transforms the high dimensional data into compressed data while preserving important latent variables for the following component. The second component employs a basic LOF approach on the features obtained from the first component to separate anomalies from normal data points.
According to the literature [Qi20146716], the performance of AEs diminishes when the data exhibits noise and anomalies. To explain further, the network tries to reduce the reconstruction error by learning how to reconstruct noise and anomalies. During the training phase, anomalies tend to produce a greater reconstruction error as their patterns deviate from the distribution that the majority of data points follow. The network thus disproportionately backpropagates the corresponding error gradients and performs a substantial weight update to be able to reproduce these patterns with lower error. However, in the context of anomaly detection, this is not desired. In fact, in previous works, various approaches have tried to prevent the network from becoming more accurate in reproducing anomalies. Most of these approaches such as DEA require access to a noise and anomaly-free set for training the network while our model is needless of such a training set. Authors in [ganin2014unsupervised] tried to add a reversal layer to their network for the purpose of domain adaption. The approach proposed in this work is different in several ways including the context within which it is applied, and the fact that unlike the approach in [ganin2014unsupervised], it is based on the reconstruction error.
In AE with gradient reversal (AEGR), the AE component tries to make the network insensitive to anomalies by manipulating gradients. First, as shown in Algorithm 1, at each epoch a gradient score (GS) is given to each data point (or each mini-batch in the case of using mini-batch gradient descent) based on the gradients in the bottleneck. This is measured by using the Frobenius norm which is defined as:
where denotes a matrix and represents the element at the th row and th column. At the bottleneck layer, the Frobenius norm of each node is computed. The bottleneck is the layer that latent variables are extracted from to be used in LOF; therefore, the network should be penalised based on the magnitude of its gradients in this layer rather than every layer. Then, the norm of layer is measured and stored as:
where denotes the number of data points (or mini-batches in case of using mini-batch gradient descent) in the network. The approach has a single parameter, , which is the epoch number when the network starts reversing its gradients.
Assuming that anomalies produce greater REs, at each epoch, the gradient of the data point that holds the largest is picked, and the inverse of its gradient is used to perform a new weight update, i.e., reverting the substantial weight update caused by this data point. In order to avoid reverting the gradients of the same data points at each epoch, batches are shuffled before carrying out the next epoch.
In the second stage in which LOF is used to separate anomalies from normal data points, three different approaches are proposed. The first approach merely uses the latent variables that are extracted from the previous stage without any alternation while in the second approach, a threshold-based clustering method based on the reconstruction error of the training set is applied to prune data points that have generated an error above the mean value. The assumption behind this approach is that the training set becomes more homogeneous and consequently LOF can perform more robustly. However, this approach also reduces the number of training points that can weaken the performance of one-class classifiers [Khan2010]. Therefore, a third approach is proposed in which the pruned training data is augmented by adding random Gaussian noise. The assumption is that data augmentation can homogeneously increase the size of the training set and improve anomaly detection performance.
In this section, the performance of our model is presented, evaluated and compared with 1) the basic stand-alone LOF algorithm; 2) a deep AE whose reconstruction error is used for separating anomalies; and 3) a deep AE whose latent variables are fed to LOF for detecting anomalies. One of the widely used evaluation metrics in this area is Receiver Operating Characteristics (ROC) AUC; however, authors in [Provost:1997] claim that when the data set is highly imbalanced, ROC AUC is not a suitable metric and a better alternative would be the area under the Precision-Recall curve (PR) AUC, in particular, when working with high dimensional data in which the positive class, i.e, anomalies, is more important than the negative class, i.e., normal points. Nonetheless, no single evaluation metric dominates others; therefore, both ROC AUC and PR AUC were used for evaluation.
As presented in Table I, 8 datasets were used that are publicly available and widely used in this context. It is worth mentioning that 5 network datasets that are very common in the domain of network anomaly detection were used besides 3 non-network datasets. Testing the model on various datasets makes it possible to evaluate performance more thoroughly. It is worth noting that the categorical features of two datasets, namely NSL-KDD and UNSW-NB15, were preprocessed by applying a one-hot-encoder. Four datasets, namely Spambase, InternetAds, Arrhythmia and CTU13, have no separate training and test sets; therefore, of each dataset was used for training while the rest was evenly split for validation and testing. For the remaining five datasets that come with a separated training and test set, of the training set was used for validation. Also, as suggested in [Cao20193074], the training sets of UNSW-NB15 and NSL-KDD are substantially larger than other datasets; therefore, only of the training set was used. In this paper, it was found necessary to apply the same size reduction approach to the CTU13 and Shuttle dataset. The details of each dataset are shown in Table I.
|Data set name||
The following datasets were obtained from the UCI Machine Learning Repository: PenDigits, Shuttle, Spambase, and InternetAds [UCI_repo]. The CTU13-08 dataset was released in 2011, which is a botnet dataset and publicly available [GARCIA2014100]. The NSL-KDD dataset is a new version of its predecessor, KDD’99, in which some of the intrinsic issues of its old version, mentioned in [tavallaee2009], are resolved. This dataset has 41 features in which 3 of them are categorical and after applying a one-hot-encoder, the number of features increased to 122. A similar preprocessing step was carried out on the UNSW-NB15 dataset [UNSW-NB15] with 3 categorical features, which increased the number of features from 42 to 190. While 5 datasets contain only a single type of anomaly, three network datasets, namely UNSW-NB15, NSL-KDD and CTU13, include different types of anomalies (i.e., network attacks). The experiment was carried out on each type of these attacks.
Iv-B AEGR Architecture and Parameters
The AE architecture varies slightly based on the dataset it is applied on; however, the same architecture was used for both AE and AEGR. The number of layers was set to 5 for all the datasets as suggested in [ERFANI2016121] with the number of nodes in the bottleneck being set to , where is the number of features of the input [CaoVan2016]. Table II shows the number of nodes in the bottleneck for each dataset.
For datasets with more than instances, the mini-batch size was set to and otherwise. To avoid overfitting, a simple early stopping heuristic was implemented to stop the training process when the network was no longer learning after a certain number of iterations or when the learning improvement was insignificant.
The loss function used in this experiment was , which is essentially a combination of and terms, i.e., it uses if the absolute element-wise error is less than 1 and term if not. To minimise the loss function, Stochastic Gradient Descent (SGD) was used.
All the datasets where normalised into the range , and the hyperbolic tangent activation function was used for all the layers except the last layer. The activation function is defined as:
where the output range is .
Iv-C LOF Parameters
The LOF algorithm has only one parameter, i.e., , that needs to be set. This value indicates what is the minimum number of neighbours that each data instance requires to have when computing its density. Throughout our experiment this was set to a constant value.
Iv-D Analysis and Discussion
The performance of AEGR-LOF was compared with three different approaches. Besides employing the traditional LOF, an autoencoder with LOF (AE-LOF) and an autoencoder with RE (AE-RE) were used. In AE-LOF, the network is not using gradient reversal and its bottleneck is fed to the next stage for separating normal instances from anomalies. In AE-RE, the reconstruction error of the AE is used for detecting anomalies. It is worth mentioning that a denoising AE was not used in the experiment as it needs an anomaly-free training set and the purpose of this research is to stay needless of a clean training set. Also, it is possible to achieve higher performance by employing LOF if only a clean training set is used in advance to fit LOF first and use it as a one-class classifier [Cao20193074], which has already been done in the literature; however, in this work, LOF is trained based on the data which is extracted from the AE.
As mentioned earlier, the evaluation metrics used here are PR AUC and ROC AUC. The ROC AUC shows the area under the receiver operating characteristic curve (ROC) which presents the true positive rate versus the false negative rate at various thresholds. The range of AUC is where any value closer to shows that the model is performing better. In PR AUC, the true negatives have no impact on the metric. Instead, it reports the relationship between precision and recall at various thresholds. Similar to ROC AUC, the range is in which values closer to indicate a better performance.
It can be seen in Table III that the proposed model almost outperformed other approaches in every scenario (i.e., different types of network attack) in terms of both metrics. Stand alone LOF only showed superior results when applied to detect DoS attacks while also showed good results alongside other approaches when used to identify Worms. The UNSWB-NB15 dataset is a high dimensional dataset and the results can be used to support the assumption that LOF performs poorly when applied to this type of dataset. The NSL-KDD and CTU13 are two network datasets that are widely used in the literature. As shown in Table IV, the proposed model outperformed other approaches by a good margin except in one case, i.e., when applied to CTU13-08.
Table V shows the performance when applied to datasets with a single type of anomaly. Based on the results, the proposed approach outperformed other models except when working on the Shuttle dataset. In particular, AEGR-LOF produced higher PR AUC and ROC AUC when detecting anomalies from the two single-type network datasets, i.e., Spambase and InternetAds.
Figure 3 illustrates the latent variables produced by the simple AE and the AEGR model. While Figure 2(a) and Figure 2(c) show the latent variables extracted from the bottleneck of the simple AE, Figure 2(b) and Figure 2(d) present the latent variables generated by the proposed model. By looking at Figure 2(a) and Figure 2(b), which show the latent variables before pruning, it can be seen that while the simple AE managed to regenerate some anomalies correctly and separate them from the normal data points, the AEGR model failed at separating anomalies due to the constraint applied, i.e., they stayed close to the dense area. However, this failure means that the AEGR model should have generated high reconstruction errors for anomalies. After applying a simple pruning based on the reconstruction error (Figure 2(c) and Figure 2(d)), the latent variables created by the AEGR model made a denser cluster compared to the simple AE, which led to getting a better performance from LOF as it can define a better class boundary when the training set is very dense and homogeneous.
Overall, AEGR-LOF with pruning or pruning and data augmentation achieved 17 best results from 22 comparisons. Therefore, it can be concluded that AEGR-LOF is significantly better than other approaches, which shows the importance of regularisation in AEs when used in anomaly detection problems. To support this conclusion further, Wilcoxon signed-rank test [kerby2014simple] was carried out to verify whether the improvement was significant or not. Therefore, the NSL-KDD(DoS) dataset was randomly selected and the test was carried out on both LOF and AEGR-LOF with pruning and data augmentation times, i.e., . By feeding PR AUCs to the Wilcoxon test, it was confirmed that the results are significantly different. It is worth recalling that carrying out repeated Wilcoxon tests on multiple algorithms is not recommended as it increases the chance of rejecting a certain proportion of the null hypotheses merely based on random chance [demvsar2006statistical].
In this paper, a new model is proposed to detect network anomalies, particularly in large datasets, that traditional algorithms such as LOF are incapable of dealing with. In the proposed approach, a novel autoencoder called AEGR is utilised to reduce the dimensionality of large datasets, transforming the data into a lower dimensional space while minimising the loss of vital features. Normal AEs fail to produce satisfactory results when the data is polluted with noise and anomalies because the network learns to replicate them together with normal data instances. Unlike other approaches that either try to use an insensitive loss function or train the network by injecting noise, the unsupervised model presented in this work, at each epoch, finds the data instances that caused the highest weight update, and then manipulates the inverted backpropagated gradients to counter that update. Finally, we apply the original LOF to the extracted features from the bottleneck of the autoencoder in order to separate normal instances from anomalies. Based on the results that were achieved from the experiments conducted on seven datasets, it was shown that the AEGR-LOF model is capable of achieving better results compared to the traditional LOF and other similar approaches such as a simple Autoencoder followed by a threshold-based classifier. The performance of the proposed model was evaluated using two metrics in which overall the AEGR model showed superior results.