adVAE: a Selfadversarial Variational Autoencoder
with Gaussian Anomaly Prior Knowledge
for Anomaly Detection^{1}^{1}1No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work.
Abstract
Recently, deep generative models have become increasingly popular in unsupervised anomaly detection. However, deep generative models aim at recovering the data distribution rather than detecting anomalies. Moreover, deep generative models have the risk of overfitting training samples, which has disastrous effects on anomaly detection performance. To solve the above two problems, we propose a selfadversarial variational autoencoder (adVAE) with a Gaussian anomaly prior assumption. We assume that both the anomalous and the normal prior distribution are Gaussian and have overlaps in the latent space. Therefore, a Gaussian transformer net T is trained to synthesize anomalous but nearnormal latent variables. Keeping the original training objective of a variational autoencoder, a generator G tries to distinguish between the normal latent variables encoded by E and the anomalous latent variables synthesized by T, and the encoder E is trained to discriminate whether the output of G is real. These new objectives we added not only give both G and E the ability to discriminate, but also become an additional regularization mechanism to prevent overfitting. Compared with other competitive methods, the proposed model achieves significant improvements in extensive experiments. The employed datasets and our model are available in a Github repository.
keywords:
anomaly detection, outlier detection, novelty detection, deep generative model, variational autoencoder1 Introduction
Anomaly detection (or outlier detection) can be regarded as the task of identifying rare data items that differ from the majority of the data. Anomaly detection is applicable in a variety of domains, such as intrusion detection, fraud detection, fault detection, health monitoring, and security checking (1; 2; 3; 4; 5). Owing to the lack of labeled anomaly samples, there is a large skew between normal and anomaly class distributions. Some attempts (6; 7; 8) use several imbalancedlearning methods to improve the performance of supervised anomaly detection models. Moreover, unsupervised models are more popular than supervised models in the anomaly detection field. Reference (9) reviewed machinelearningbased anomaly detection algorithms comprehensively.
Recently, deep generative models have become increasingly popular in anomaly detection (10). A generative model can learn a probability distribution model by being trained on an anomalyfree dataset. Afterwards, outliers can be detected by their deviation from the probability model. The most famous deep generative models are variational autoencoders (VAEs) (11) and generative adversarial networks (GANs) (12).
VAEs have been used in many anomaly detection studies. The first work using VAEs for anomaly detection (13) declared that VAEs generalize more easily than autoencoders (AEs), because VAEs work on probabilities. (14; 15) used different types of RNNVAE architectures to recognize the outliers of time series data. (1) and (16) implemented VAEs for intrusion detection and internet server monitoring, respectively. Furthermore, GANs and adversarial autoencoders (17) have also been introduced into image (4; 18) and video (19) anomaly detection.
However, there are two serious problems in deepgenerativemodelbased anomaly detection methods.
(1) Deep generative models only aim at recovering the data distribution of the training set, which has an indirect contribution to detecting anomalies. Those earlier studies paid little attention to customize their models for anomaly detection tasks. Consequently, there is an enormous problem that those models only learn from available normal samples, without attempting to discriminate the anomalous data. Owing to the lack of discrimination, it is hard for such models to learn useful deep representations for anomaly detection tasks.
(2) Plain VAEs use the regularization of the Kullback–Leibler divergence (KLD) to limit the capacity of the encoder, but there is no regularization implemented in the generator. Because neural networks are universal function approximators, generators can, in theory, cover any probability distribution, even without dependence on the latent variables (20). However, previous attempts (21; 22) have found it hard to benefit from using an expressive and powerful generator. (20) uses BitsBack Coding theory (23) to explain this phenomenon: if the generator can model the data distribution without using information from the latent code , it is more inclined not to use . In this case, to reduce the KLD cost, the encoder loses much information about and maps to the simple prior (e.g., ) rather than the true posterior . Once this undesirable training phenomenon occurs, VAEs tend to overfit the distribution of existing normal data, which leads to a bad result (e.g., a high false positive rate), especially when the data are sparse. Therefore, the capacity of VAEs should be limited by an additional regularization in anomaly detection tasks.
There are only a handful of studies that attempt to solve the above two problems. MOGAAL method (24) made the generator of a GAN stop updating before convergence in the training process and used the nonconverged generator to synthesize outliers. Hence, the discriminator can be trained to recognize outliers in a supervised manner. To try to solve the mode collapse issue of GANs, (24) expands the network structure from a single generator to multiple generators with different objectives. (25) proposed an assumption that the anomaly prior distribution is the complementary set of the normal prior distribution in latent space. However, we believe that this assumption may not hold true. If the anomalous and the normal data have complementary distributions, which means that they are separated in the latent space, then we can use a simple method (such as KNN) to detect anomalies and achieve satisfactory results, but this is not the case. Both normal data and outliers are generated by some natural pattern. Natural data ought to conform to a common data distribution, and it is hard to imagine a natural pattern that produces such a strange distribution.
To enhance deep generative models to distinguish between normal and anomalous samples and to prevent them from overfitting the given normal data, we propose a selfadversarial Variational Autoencoder (adVAE) with a Gaussian anomaly prior assumption and a selfadversarial regularization mechanism. The basic idea of this selfadversarial mechanism is adding discrimination training objectives to the encoder and the generator through adversarial training. These additional objectives will solve the above two problems at the same time; the details are as follows.
The encoder can be trained to discriminate the original sample and its reconstruction, but we do not have any anomaly latent code to train the generator. To synthesize the anomaly latent code, we propose a Gaussian anomaly hypothesis to describe the relationship between normal and anomaly latent space. Our assumption is described in Figure 1; both the anomalous and the normal prior distributions are Gaussian and have overlaps in the latent space. It is an extraordinarily weak and reasonable hypothesis, because the Gaussian distribution is the most widespread in nature. The basic structure of our selfadversarial mechanism is shown in Figure 1. The encoder is trained to discriminate the original sample and its reconstruction , and the generator tries to distinguish between the normal latent variables encoded by the encoder and the anomalous ones synthesized by . These new objectives we added not only give and the ability to discern, but also introduce an additional regularization to prevent model overfitting.
Our training process can be divided into two steps: (1) tries to mislead the generator ; meanwhile works as a discriminator. (2) generates realisticlike samples, and the encoder acts as a discriminator of . To make the training phase more robust, inspired by (26), we train alternatively between the above two steps in a minibatch iteration.
Our main contributions are summarized as follows:

We propose a novel and important concept that deep generative models should be customized to learn to discriminate outliers rather than being used in anomaly detection directly without any suitable customization.

We propose a novel selfadversarial mechanism, which is the prospective customization of a plain VAE, enabling both the encoder and the generator to discriminate outliers.

The proposed selfadversarial mechanism also provides a plain VAE with a novel regularization, which can significantly help VAEs to avoid overfitting normal data.

We propose a Gaussian anomaly prior knowledge assumption that describes the data distribution of anomalous latent variables. Moreover, we propose a Gaussian transformer net to integrate this prior knowledge into deep generative models.
2 Preliminary
2.1 Conventional Anomaly Detection
Anomaly detection methods can be broadly categorized into probabilistic, distancebased, boundarybased and reconstructionbased.
(1) Probabilistic approach, such as GMM (27) and KDE (28), uses statistical methods to estimate the probability density function of the normal class. A data point is defined as an anomaly if it has low probability density. (2) Distancebased approach has the assumption that normal data are tightly clustered, while anomaly data occur far from their nearest neighbours. These methods depend on the welldefined similarity measure between two data points. The basic distancebased methods are LOF (29) and its modification (30). (3) Boundarybased approach, mainly involving OCSVM (31) and SVDD (32), typically try to define a boundary around the normal class data. Whether the unknown data is an anomaly instance is determined by their location with respect to the boundary. (4) Reconstructionbased approach assumes that anomalies are incompressible and thus cannot be effectively reconstructed from lowdimensional projections. In this category, PCA (33) and its variations (34; 35) are widely used, effective techniques to detect anomalies. Besides, AE and VAE based methods also belong to this category, which will be explained detailedly in the next two subsections.
2.2 Autoencoderbased Anomaly Detection
An AE, which is composed of an encoder and a decoder, is a neural network used to learn reconstructions as close as possible to its original inputs. Given a datapoint ( is the dimension of ), the loss function can be viewed as minimizing the reconstruction error between the training data and the outputs of the AE, and and denote the hidden parameters of the encoder and the decoder : {linenomath}
(1) 
After training, the reconstruction error of each test data will be regarded as the anomaly score. The data with a high anomaly score will be defined as anomalies, because only the normal data are used to train the AE. The AE will reconstruct normal data very well, while failing to do so with anomalous data that the AE has not encountered.
2.3 VAEbased Anomaly Detection
The net architecture of VAEs is similar to that of AEs, with the difference that the encoder of VAEs forces the representation code to obey some type of prior probability distribution (e.g., ). Then, the decoder generates new realistic data with code sampled from . In VAEs, both the encoder and decoder conditional distributions are denoted as and . The data distribution is intractable by analytic methods, and thus variational inference methods are introduced to solve the maximum likelihood : {linenomath}
(2)  
KLD is a similarity measure between two distributions. To estimate this maximum likelihood, VAE needs to maximize the evidence variational lower bound (ELBO) . To optimize the KLD between and , the encoder estimates the parameter vectors of the Gaussian distribution : mean and standard deviation . There is an analytical expression for their KLD, because both and are Gaussian. To optimize the second term of equation (2), VAEs minimize the reconstruction errors between the inputs and the outputs. Given a datapoint , the objective function can be rewritten as {linenomath}
(3)  
(4) 
(5)  
The first term is the mean squared error (MSE) between the inputs and their reconstructions. The second term regularizes the encoder by encouraging the approximate posterior to match the prior . To hold the tradeoff between these two targets, each KLD target term is multiplied by a scaling hyperparameter .
AEs define the reconstruction error as the anomaly score in the test phase, whereas VAEs use the reconstruction probability (13) to detect outliers. To estimate the probabilistic anomaly score, VAEs sample according to the prior for times and calculate the average reconstruction error as the reconstruction probability. This is why VAEs work more robustly than traditional AEs in the anomaly detection domain.
2.4 GANbased Anomaly Detection
Since GANs (12) were first proposed in 2014, GANs have become increasingly popular and have been applied for diverse tasks. A GAN model comprises two components, which contest with each other in a catandmouse game, called the generator and discriminator. The generator creates samples that resemble the real data, while the discriminator tries to recognize the fake samples from the real ones. The generator of a GAN synthesizes informative potential outliers to assist the discriminator in describing a boundary that can separate outliers from normal data effectively (24). When a sample is input into a trained discriminator, the output of the discriminator is defined as the anomaly score. However, suffering from the mode collapsing problem, GANs usually could not learn the boundaries of normal data well, which reduces the effectiveness of GANs in anomaly detection applications. To solve this problem, (24) propose MOGAAL and suggest stopping optimizing the generator before convergence and expanding the network structure from a single generator to multiple generators with different objectives. In addition, WGANGP (36), one of the most advanced GAN frameworks, proposes a Wasserstein distance and gradient penalty trick to avoid mode collapsing. In our experiments, we also compared the anomaly detection performance between a plain GAN and WGANGP.
3 Selfadversarial Variational Autoencoder
In this section, a selfadversarial Variational Autoencoder (adVAE) for anomaly detection is proposed. To customize plain VAE to fit anomaly detection tasks, we propose the assumption of a Gaussian anomaly prior and introduce the selfadversarial mechanism into traditional VAE. The proposed method consists of three modules: an encoder net , a generative net , and a Gaussian transformer net .
There are two competitive relationships in the training phase of our method: (1) To generate a potential anomalous prior distribution and enhance the generator’s ability to discriminate between normal and anomalous priors, we train the Gaussian transformer and the generator with adversarial objectives simultaneously. (2) To produce more realistic samples in a competitive manner and make the encoder learn to discern, we train the generator and the encoder analogously to the generator and discriminator in GANs.
According to equation (3), there are two components in the objective function of VAEs: and . The cost function of adVAE is a modified combination objective of these two terms. In the following, we describe the training phase in subsections 3.1 to 3.3 and subsections 3.4 to 3.6 address the testing phase.
3.1 Training Step 1: Competition between T and G
The generator of plain VAE is often so powerful that it maps all the Gaussian latent code to the highdimensional data space, even if the latent code is encoded from anomalous samples. Through the competition between and , we introduce an effective regularization into the generator.
Our anomalous prior assumption suggests that it is difficult for the generator of plain VAE to distinguish the normal and the anomalous latent code, because they have overlaps in the latent space. To solve this problem, we synthesize anomalous latent variables and make the generator discriminate the anomalous from the normal latent code. As shown in Figure 2 (a), we freeze the weights of and update and in this training step. The Gaussian transformer receives the normal Gaussian latent variables encoded from the normal training samples as the inputs and transforms to the anomalous Gaussian latent variables with different mean and standard deviation . aims at reducing the KLD between and , and tries to generate as different as possible samples from such two similar latent codes.
Given a datapoint , the objective function in this competition process can be defined as {linenomath}
(6) 
(7)  
(8)  
(9)  
, is a positive margin of the MSE target, and is a positive margin of the KLD target. The aim is to hold the corresponding target term below the margin value for most of the time. is the objective for the data flow of , and is the objective for the pipeline of .
and are two adversarial objectives, and the total objective function in this training step is the sum of the two: . Objective encourages to mislead by synthesizing similar to , such that cannot distinguish them. Objective forces the generator to distinguish between and . hopes that is close to , whereas hopes that is farther away from . After iterative learning, and will reach a balance. will generate anomalous latent variables close to the normal latent variables, and the generator will distinguish them by different reconstruction errors. Although the anomalous latent variables synthesized by are not necessarily real, it is helpful for the models as long as they try to identify the outliers.
Because the updating of will affect the balance of and , we freeze the weights of when training and . If we do not do this, it will be an objective of three networks’ equilibrium, which is extremely difficult to optimize.
3.2 Training Step 2: Training E like a Discriminator
In the first training step demonstrated in the previous subsection, we freeze the weights of . Instead, as shown in Figure 2 (b), we now freeze the weights of and and update the encoder . The encoder not only attempts to project the data samples to Gaussian latent variables like the original VAE, but also works like a discriminator in GANs. The objective of the encoder is as follows:
(10)  
The first two terms of Equation 10 are the objective function of plain VAE. The encoder is trained to encode the inputs as close to the prior distribution when the inputs are from the training dataset. The last two terms are the discriminating loss we proposed. The encoder is prevented from mapping the reconstructions of training data to the latent code of the prior distribution.
The objective provides the encoder with the ability to discriminate whether the input is normal because the encoder is encouraged to discover differences between the training samples (normal) and their reconstructions (anomalous). It is worth mentioning that the encoder with discriminating ability also helps the generator distinguish between the normal and the anomalous latent code.
3.3 Alternating between the Above Two Steps
As described in Algorithm 1, we train alternatively between the above two steps in a minibatch iteration. These two steps are repeated until convergence. indicates that the back propagation of the gradients is stopped at this point.
In the first training step, the Gaussian transformer converts normal latent variables into anomalous latent variables. At the same time, the generator is trained to generate realisticlike samples when the latent variables are normal and to synthesize a lowquality reconstruction when they are not normal. It offers the generator the ability to distinguish between the normal and the anomalous latent variables. In the second training step, the encoder not only maps the samples to the prior latent distribution, but also attempts to distinguish between the real data and generated samples .
Importantly, we introduce the competition of and into our adVAE model by training alternatively between these two steps. Analogously to GANs, the generator is trained to fool the encoder in training step 1, and the encoder is encouraged to discriminate the samples generated by the generator in step 2. In addition to benefitting from adversarial alternative learning as in GANs, the encoder and generator models will also learn jointly for the given training data to maintain the advantages of VAEs.
3.4 Anomaly Score
As demonstrated in Figure 2 (c), only the generator and the encoder are used in the testing phase, as in a traditional VAE. Given a test data point as the input, the encoder estimates the parameters of the latent Gaussian variables and as the output. Then, the reparameterization trick is used to sample according to the latent distribution , i.e., , where and . is set to 1000 in this work and used to improve the robustness of adVAE’s performance. The generator receives as the input and outputs the reconstruction .
The error between the inputs and their average reconstruction reflects the deviation between the testing data and the normal data distribution learned by adVAE, such that the anomaly score of a minibatch data ( is the batch size) is defined as follows: {linenomath}
(11)  
(12) 
with error matrix and squared error matrix . is the anomaly scores vector of a minibatch dataset . The dimension of is always equal to the batch size of . Data points with a high anomaly score are classified as anomalies. To determine whether an anomaly score is high enough, we need a decision threshold. In the next subsection, we will illustrate how to decide the threshold automatically by KDE (37).
3.5 Decision Threshold
Earlier VAEbased anomaly detection studies (13; 16) often overlook the importance of threshold selection. Because we have no idea about what the values of reconstruction error actually represent, determining the threshold of reconstruction error is cumbersome. Some studies (13; 14; 15) adjust the threshold by crossvalidation. However, building a big enough validation set is a luxury in some cases, as anomalous samples are challenging to collect. Other attempts (16; 38) simply report the best performance in the test dataset to evaluate models, which makes it difficult to reproduce the results in practical applications. Thus, it is exceedingly important to let the anomaly detection model automatically determine the threshold.
The KDE technique (37) is used to determine the decision threshold in the case where only normal samples are provided. Given the anomaly scores vector of the training dataset, KDE estimates the probability density function (PDF) in a nonparametric way: {linenomath}
(13) 
where is the size of the training dataset, , , is the training dataset’s anomaly scores vector, is the kernel function, and is the bandwidth.
Among all kinds of kernel functions, radial basis functions (RBFs) are the most commonly used in density estimation. Therefore, a RBF kernel is used to estimate the PDF of the normal training dataset: {linenomath}
(14) 
In practice, the choice of the bandwidth parameter has a significant influence on the effect of the KDE. To make a good choice of bandwidth parameters, Silverman (39) derived a selection criterion: {linenomath}
(15) 
where is the number of training data points and the number of dimensions. After obtaining the PDF of the training dataset by KDE, the cumulative distribution function (CDF) can be obtained by equation (16): {linenomath}
(16) 
Given a significance level and a CDF, we can find a decision threshold that satisfies . In this case, there is at least probability that a sample with the anomaly score is an outlier. Because KDE decides the threshold by estimating the interval of normal data’s anomaly scores, it is clear that using KDE to decide the threshold is more objective and reasonable than simply relying on human experience. A higher significance level leads to a lower missing alarm rate, which means that models have fewer chances to mislabel outliers as normal data. On the contrary, a lower means a lower false alarm rate. Therefore, the choice of the significance level is a tradeoff. The significance level is recommended to be set to 0.1 for anomaly detection tasks.
3.6 Detecting Outliers
In this subsection, we summarize how to use a trained adVAE model and the KDE technique to learn the threshold from a training dataset and detect outliers in a testing dataset. As illustrated in Figure 3, this process is divided into two parts.
Part I focuses on the training process. Given a training dataset matrix consisting of normal samples, we can calculate their anomaly scores vector (described in subsection 3.4), where is the size of training dataset and is the dimension of the data. As described in subsection 3.5, the PDF of the anomaly scores vector is obtained by KDE and the CDF is obtained by . Then, the decision threshold can be determined from a given significance level and CDF .
Part II is simple and easy to understand. The anomaly score of a new sample is calculated by adVAE. If , then is defined as an outlier. If not, then is regarded as a normal sample.
4 Experiments
4.1 Datasets
Datasets  Size  Dim.  Outliers (percentage) 

Letter (40)  1600  32  100 (6.2%) 
Cardio (41)  1831  21  176 (9.6%) 
Satellite (42)  5100  36  75 (1.5%) 
Optical (43)  5216  64  150 (2.8%) 
Pen (44)  6870  16  156 (2.3%) 
Most previous works used image datasets to test their anomaly detection models. To eliminate the impact of different convolutional structures and other image tricks on the test performance, we chose five publicly available and broadly used tabular anomaly detection datasets to evaluate our adVAE model. All the dataset characteristics are summarized in Table 1. For each dataset, 80% of the normal data were used for the training phase, and then the remaining 20% and all the outliers were used for testing. More details about the datasets can be found in their references or our Github repository^{2}^{2}2https://github.com/WangXuhongCN/adVAE.
4.2 Evaluation Metric
The anomaly detection community defines anomalous samples as positive and defines normal samples as negative, hence the anomaly detection tasks can also be regarded as a twoclass classification problem with a large skew in class distribution. For the evaluation of a twoclass classifier, the metrics are divided into two categories; one category is defined at a single threshold, and the other category is defined at all possible thresholds.
Metrics at a single threshold. Accuracy, precision, recall, and the F1 score are the common metrics to evaluate models performance at a single threshold. Because the class distribution is skew, accuracy is not a suitable metric for anomaly detection model evaluation.
High precision means the fewer chances of misjudging normal data, and a high recall means the fewer chances of models missing alarming outliers. Even if models predict a normal sample as an outlier, people can still correct the judgment of the model through expert knowledge, because the anomalous samples are of small quantity. However, if models miss alarming outliers, we cannot find anomalous data in such a huge dataset. Thus, precision is not as crucial as recall. The F1 score is the harmonic average of precision and recall. Therefore, we adopt recall and the F1 score as the metrics for comparing at a single threshold.
Metrics at all possible thresholds. The anomaly detection community often uses receiver operator characteristic (ROC) and precision–recall (PR) curves, which aggregate over all possible decision thresholds, to evaluate the predictive performance of each method. When the class distribution is close to being uniform, ROC curves have many desirable properties. However, because anomaly detection tasks always have a large skew in the class distribution, PR curves give a more accurate picture of an algorithm’s performance (45).
Rather than comparing curves, it is useful and clear to analyze the model performance quantitatively using a single number. Average precision (AP) and area under the ROC curve (AUC) are the common metrics to measure performance, with the former being preferred under class imbalance.
AP summarizes a PR curve by a sum of precisions at each threshold, multiplied by the increase in recall, which is a close approximation of the area under the PR curve: , where is the precision at the threshold and is the increase in recall from the to the threshold. Because the PR curve is more useful than the ROC curve in anomaly detection, we recommend using AP as an evaluation metric for anomaly detection models rather than AUC. In our experiments, recall, F1 score, AUC, and AP were used to evaluate the models performance.
4.3 Model Configurations
The network structures of adVAE used are summarized in Figure 4, where the size of each layer was proportionate to the input dimension . The Adam optimizer was used in all datasets with a learning rate of 0.001. We trained each model for a maximum 20000 minibatch iterations with a batch size of 128. Kaiming uniform weight initialization was used for all three subnetworks.
Datasets  

Letter (40)  0.003  40  2  0.001 
Cardio (41)  0.1  20  2  0.001 
Satellite (42)  0.01  40  2  0.001 
Optical (43)  0.03  40  2  0.001 
Pen (44)  0.01  20  2  0.001 
As shown in Table 2, there are four hyperparameters in adVAE model. is derived from plain VAE, and the other three new parameters are used to maintain the balance between the original training objective and the additional discrimination objective. The larger the or , the larger the proportion of the encoder discrimination objective in the total loss. The larger the , the larger the proportion of the generator discrimination objective.
The KLD margin can be selected to be slightly larger than the training KLD value of plain VAEs, and the MSE margin was set equal to 2 for each dataset. Because and have similar effects, is always suggested to be fixed as 0.001. Adjusting the parameters becomes easier, because we can simply consider tuning and . Note that the structure and hyperparameters used in this study proved to be sufficient for our applications, although they can still be improved.
4.4 Compared Methods
We compare adVAE with the eleven most advanced outlier detection algorithms, which include both traditional and deeplearningbased methods. To obtain a convincing conclusion, the hyperparameters of competing methods are searched in a range of optional values. They can be divided into seven categories: (i) A boundarybased approach, OCSVM (31). It is a widely used semisupervised outlier detection algorithm, and we use a RBF kernel in all tasks. Because OCSVM needs an important parameter , its value will be searched in the range {0.01, 0.05, 0.1, 0.15, 0.2}. (ii) An ensemblelearningbased approach, IForest (46). The hyperparameter to be tuned is the number of decision trees , which is chosen from the set {50, 100, 200, 300}. (iii) A probabilistic approach, ABOD (47). (iv) Distancebased approaches, SOD (48) and HBOS (49). Because the performance of ABOD and SOD will be dramatically affected by the size of the neighborhood, we tune it in the range of {5, 8, 16, 24, 40}. For HBOS, the number of bins is chosen from {5, 10, 20, 30, 50}. (v) Three GANbased models, GAN (12), MOGAAL (24), and WGANGP (36). GAN shares the same network structure as adVAE, except that the output layer of the discriminator is onedimensional, and the configurations of MOGAAL^{3}^{3}3https://github.com/leibinghe/GAALbasedoutlierdetection refer to its official open source code. Based on the network structure of GAN, WGANGP removes the output activation function of the generator. (vi) Three reconstructionbased methods, AE (13), DAGMM (38), and VAE (13). The AE, VAE, and DAGMM share the same network structure as adVAE, except that the latent dimension of DAGMM^{4}^{4}4https://github.com/danieltan07/dagmm is required to be 1. VAE shares the same hyperparameter as adVAE. (vii) Two ablation models of the proposed adVAE, named EadVAE and GadVAE, are also compared to demonstrate how the two discrimination objectives affect the performance. Figure 5 and 6 show the architecture of our two ablation models, respectively. EadVAE, GadVAE, and adVAE share the same parameters and network structure.
All deeplearning methods are implemented in pytorch, and share the same optimizer, learning rate, batch size, and iteration times as adVAE, except that the parameters of DAGMM and MOGAAL refer to their author’s recommendation. All traditional methods are implemented on a common outlier detection framework PyOD (50).
4.5 Results and Discussion
Methods  Average Precision (AP)  Area Under the ROC Curve (AUC)  

Let  Car  Sat  Opt  Pen  Avg.  Let  Car  Sat  Opt  Pen  Avg.  
OCSVM (31)  0.306  0.947  0.718  0.157  0.645  0.555  0.514  0.975  0.893  0.613  0.936  0.786 
IForest (46)  0.330  0.922  0.784  0.353  0.798  0.637  0.624  0.949  0.945  0.859  0.978  0.871 
ABOD (47)  0.769  0.922  0.845  0.762  0.917  0.843  0.915  0.948  0.972  0.969  0.994  0.960 
HBOS (49)  0.312  0.851  0.746  0.463  0.641  0.603  0.619  0.899  0.915  0.863  0.942  0.848 
SOD (48)  0.542  0.693  0.494  0.105  0.115  0.390  0.792  0.764  0.915  0.405  0.473  0.670 
GAN (12)  0.421  0.697  0.427  0.417  0.362  0.465  0.653  0.618  0.776  0.743  0.779  0.714 
MOGAAL (24)  0.428  0.730  0.797  0.588  0.308  0.570  0.714  0.792  0.971  0.912  0.838  0.845 
WGANGP (36)  0.497  0.834  0.765  0.596  0.589  0.656  0.722  0.847  0.920  0.895  0.951  0.867 
AE (13)  0.693  0.798  0.764  0.860  0.682  0.759  0.897  0.840  0.950  0.990  0.962  0.928 
DAGMM (38)  0.318  0.789  0.353  0.125  0.370  0.391  0.564  0.847  0.714  0.465  0.826  0.683 
VAE (13)  0.724  0.803  0.766  0.875  0.625  0.759  0.911  0.840  0.962  0.992  0.959  0.933 
GadVAE  0.723  0.922  0.748  0.885  0.853  0.826  0.908  0.956  0.957  0.993  0.988  0.960 
EadVAE  0.769  0.933  0.801  0.857  0.926  0.857  0.911  0.961  0.968  0.989  0.996  0.965 
adVAE  0.779  0.951  0.792  0.957  0.880  0.872  0.921  0.966  0.970  0.996  0.993  0.969 
Methods  Recall  F1 Score  

Let  Car  Sat  Opt  Pen  Avg.  Let  Car  Sat  Opt  Pen  Avg.  
OCSVM (31)  0.140  0.977  0.773  0.093  0.795  0.556  0.201  0.898  0.439  0.104  0.608  0.450 
IForest (46)  0.160  0.830  0.827  0.407  0.987  0.642  0.215  0.816  0.539  0.401  0.700  0.534 
ABOD (47)  0.750  0.983  0.893  0.940  1.000  0.913  0.739  0.647  0.657  0.709  0.674  0.685 
HBOS (49)  0.200  0.625  0.853  0.587  0.827  0.618  0.256  0.703  0.490  0.507  0.620  0.515 
SOD (48)  0.670  0.449  0.800  0.113  0.224  0.451  0.563  0.583  0.441  0.089  0.131  0.361 
GAN (12)  0.480  0.903  0.747  0.713  0.692  0.707  0.425  0.500  0.226  0.450  0.206  0.361 
MOGAAL (24)  0.180  0.455  0.493  0.367  0.276  0.354  0.277  0.578  0.627  0.455  0.319  0.451 
WGANGP (36)  0.490  0.824  0.813  0.707  0.955  0.758  0.500  0.700  0.419  0.542  0.623  0.557 
AE (13)  0.750  0.830  0.840  0.940  0.891  0.850  0.685  0.661  0.589  0.673  0.670  0.655 
DAGMM (38)  0.310  0.756  0.467  0.107  0.628  0.453  0.315  0.731  0.237  0.114  0.476  0.374 
VAE (13)  0.760  0.801  0.893  1.000  0.885  0.868  0.714  0.647  0.568  0.744  0.670  0.668 
GadVAE  0.690  0.960  0.787  1.000  1.000  0.887  0.717  0.824  0.549  0.718  0.721  0.706 
EadVAE  0.770  0.943  0.907  1.000  1.000  0.924  0.691  0.804  0.618  0.732  0.717  0.712 
adVAE  0.780  0.938  0.853  1.000  1.000  0.914  0.729  0.805  0.631  0.718  0.704  0.717 
Table 3 shows the experimental results for the AP and AUC metrics, and Table 4 indicates the results for the recall and F1 score. According to the experimental results, the adVAE model achieved 11 best and 5 secondbest results from 24 comparisons. Therefore, adVAE is significantly better than other compared methods.
Among all the experimental results, the performance of VAE is generally slightly better than AE, because VAE has a KLD regularization of the encoder, which proves the importance of regularization in AEbased anomaly detection algorithms.
We customize plain VAE by the proposed selfadversarial mechanism, reaching the highest AP improvement of 0.255 (40.8%). Compared to its base technique VAE, on average, adVAE improved by 0.113 (14.9%) in AP, 0.036 (3.9%) in AUC, 0.046 (5.3%) in recall, and 0.049 (7.3%) in F1 score. The outstanding results of adVAE prove the superiority of the proposed selfadversarial mechanism, which helps deep generative models to play a greater role in anomaly detection research and applications.
OCSVM performed well in the cardio, satellite, and pen datasets, but achieved horrible results in the letter and optical datasets. The results indicate that the conventional method OCSVM is not stable enough, leading the AP gap from adVAE to reach an astonishing 0.8 in the optical dataset. Actually, all traditional anomaly detection methods (e.g., IForest, HBOS, and SOD) have the same problem as OCSVM: They perform well on some datasets, but they perform extremely poorly on other datasets. ABOD’s performance is more stable than others, but it still suffers from the above problem.
The reason for this is that conventional methods often have a strict data assumption. If training data are highdimensional or not consistent with the data assumption, their performance may be significantly impaired. Because neuralnetworkbased methods do not need a strict data assumption and are suitable for big and highdimensional data, they have more stable performance on diverse datasets with different data distributions.
In our experiments, both VAE and AE outperformed the GANbased methods in most datasets. The results are analyzed as follows: Because the mode collapse issue of GANs cannot be characterized in a training loss curve, people usually monitor the generated images during the training phase. However, for tabular datasets, it is difficult for us to ensure that the generated samples are diverse enough. A GAN with mode collapse would mislabel normal data as outliers and cause a high false positive rate, because GAN learns an incomplete normal data distribution in this case. To better recover the distribution of normal data, MOGAAL suggests stopping optimizing the generator before convergence and expanding the network structure from a single generator to multiple generators with different objectives. However, a nonconverged model brings more uncertainty to experimental results, because it is impossible to know what the model has learned precisely. According to the results, it does have a performance improvement over plain GAN, but still has a performance gap with AEbased methods. WGANGP tries to avoid mode collapsing, which leads to better results of all four metrics than plain GAN. However, it still results in worse performance than encoderdecoderbased methods. GAN is better at generating highquality samples rather than learning the complete probability distribution of the training dataset. Thus GANbased methods do not perform well for outlier detection.
For three reconstructionbased neural network models, there is no significant performance gap between AE and VAE, whereas DAGMM is worse than them in our tests. This is explained by the fact that the latent dimension of DAGMM is required to be set as 1, which severely limits the ability of the model to summarize datasets. The results also indicate this explanation: DAGMM only achieves good results in the cardio dataset, which has the lowest data size and dimensions. The larger the size and dimensions of the dataset, the larger performance gap between DAGMM and AE.
In conclusion, the AE and VAE methods are better than most of the other anomaly detection methods according to our tests. Moreover, adVAE learns to detect outliers through the selfadversarial mechanism, which further enhances the anomaly detection capability and achieves stateoftheart performance. We believe that the proposed selfadversarial mechanism has immense potentiality, and it deserves to be extended to other pattern recognition tasks.
4.6 Ablation Study
To illustrate the impact of different designs on the model results, we design three different types of ablation experiments:
(i) Because adVAE is formed by adding two different training objectives to plain VAE, it is important to demonstrate how the two discrimination objectives affect the performance. Hence we propose EadVAE and GadVAE models (Figure 5 and 6) to demonstrate the effect of removing one of the discrimination objectives from adVAE. According to Table 3 and 4, the results show that both GadVAE and EadVAE achieve performance improvements compared to plain VAE, which indicates the validity of the two discriminative objectives of adVAE. In addition, EadVAE is more competitive than GadVAE in most datasets. This result is analyzed as follows. Because the latent variables have a lower dimension than the original data , has more information than . Therefore, the benefit of adding the discriminating target to the encoder is more significant than adding it to the generator .
(ii) Because the encoder and generator work together to detect outliers, we performed a latent space visualization to prove the improvement in the encoder’s anomaly detection capability. Figure 8 illustrates that the encoder of adVAE has a better discrimination ability, because the latent variables are separated in the latent space and more easily distinguished by the later generator.
(iii) To independently verify the anomaly detection capabilities of the generator, we synthesize latent outlier variables by adding five types of noise to the normal latent variables . These latent variables are the outputs of the five corresponding encoders and the input data of the encoders are 200 normal samples from the pen (44) dataset. The normal scores vector can be calculated from and , and the anomalous scores vector can be calculated from and (as in section 3.4). Afterwards, the Wasserstein distance between and is used to measure the detection capabilities of each generator. Figure 7 indicates that the generator of adVAE also has a better ability to discriminate latent variables.
In conclusion, benefiting from better discrimination abilities, adVAE has better anomaly detection performance. This proves that the proposed selfadversarial mechanism is a prospective way of customizing generative models to fit outlier detection tasks.
4.7 Robustness Experiments
Because adVAE involves four primary hyperparameters, it is essential to implement indepth hyperparameter experiments comprehensively. As shown in Figure 9(a)–9(d), we tested the results for four hyperparameters. Actually, only and need to be tuned, because is derived from VAE and is fixed as 0.01. Based on that, finding good hyperparameters of adVAE is not more complicated than plain VAE, because it is not sensitive to and according to Figure 9(c) and 9(d).
As for the learning ability of neural networks, we tune the number of layers from 2 to 6 and adjust the number of hidden neurons from to . The results are shown in Figure 9(e), where V1–V5 represent the parameter variation. It can be seen that V1 achieved a slightly worse detection capability, because the insufficient networks cannot learn enough information of the normal training data. As long as the scale of neural network is within a reasonable range, adVAE is robust to network structure.
Because adVAE needs an anomalyfree dataset, it is important to investigate how adVAE responds to contaminated training data. Meanwhile, we also choose three semisupervised anomaly detection methods for comparison. ADASYN (51) is used to synthesize new anomalous samples. The results of Figure 9(f) show that OCSVM is more robust with a contaminated dataset than the other three AEbased methods. This is because OCSVM can ignore certain noises when learning the boundary of the training data, whereas AEbased models always try to reduce the reconstruction error for all training data. Moreover, a high contamination ratio will more easily disrupt the proposed discrimination loss, which suggests training an AEbased anomaly detection model with highquality data (i.e., a clean or lowcontaminationratio dataset). In practice, normal data is easy to collect, and thus the contamination ratio usually remains at a low level.
5 Conclusion
In this paper, we have proposed a selfadversarial Variational Autoencoder (adVAE) with a Gaussian anomaly prior assumption and a selfadversarial mechanism. The proposed adVAE is encouraged to distinguish the normal latent code and the anomalous latent variables generated by the Gaussian transformer , which can also be regarded as an outstanding regularization introduced into VAEbased outlier detection method. The results demonstrate that the proposed adVAE outperforms than other stateoftheart anomaly detection methods. In the future, we will try to redesign the discrimination objective of the generator to further enhance the generator’s ability to recognize anomalies.
Acknowledgement
This research is in part supported by National Nature Science Foundation of China (No. 51777122).
References
References
 (1) G. Osada, K. Omote, T. Nishide, Network intrusion detection based on semisupervised variational autoencoder, in: European Symposium on Research in Computer Security (ESORICS), Springer, 2017, pp. 344–361 (2017).
 (2) A. Abdallah, M. A. Maarof, A. Zainal, Fraud detection system: A survey, Journal of Network and Computer Applications 68 (2016) 90–113 (2016).
 (3) P. Cui, C. Zhan, Y. Yang, Improved nonlinear process monitoring based on ensemble kpca with local structure analysis, Chemical Engineering Research and Design 142 (2019) 355–368 (2019).
 (4) T. Schlegl, P. Seeböck, S. M. Waldstein, U. SchmidtErfurth, G. Langs, Unsupervised anomaly detection with generative adversarial networks to guide marker discovery, in: International Conference on Information Processing in Medical Imaging (IPMI), Springer, 2017, pp. 146–157 (2017).
 (5) S. Akcay, A. A. Abarghouei, T. P. Breckon, Ganomaly: Semisupervised anomaly detection via adversarial training, in: Asian Conference on Computer Vision (ACCV), Springer, 2018, pp. 622–637 (2018).
 (6) C. Zhang, J. Bi, S. Xu, E. Ramentol, G. Fan, B. Qiao, H. Fujita, Multiimbalance: An opensource software for multiclass imbalance learning, Knowl.Based Syst. 174 (2019) 137–143 (2019).
 (7) F. Zhou, S. Yang, H. Fujita, D. Chen, C. Wen, Deep learning fault diagnosis method based on global optimization gan for unbalanced data, Knowl.Based Syst. (2019).
 (8) G. Lemaître, F. Nogueira, C. K. Aridas, Imbalancedlearn: A python toolbox to tackle the curse of imbalanced datasets in machine learning, Journal of Machine Learning Research 18 (17) (2017) 1–5 (2017).
 (9) M. A. F. Pimentel, D. A. Clifton, L. A. Clifton, L. Tarassenko, A review of novelty detection, Signal Processing 99 (2014) 215–249 (2014).
 (10) R. Chalapathy, S. Chawla, Deep Learning for Anomaly Detection: A Survey, arXiv eprints (2019) arXiv:1901.03407 (2019).
 (11) D. P. Kingma, M. Welling, AutoEncoding Variational Bayes, in: International Conference on Learning Representations (ICLR), 2014 (2014).
 (12) I. J. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. C. Courville, Y. Bengio, Generative adversarial nets, in: Annual Conference on Neural Information Processing Systems (NeurIPS), MIT Press, 2014, pp. 2672–2680 (2014).
 (13) J. An, S. Cho, Variational autoencoder based anomaly detection using reconstruction probability, Tech. rep., SNU Data Mining Center (2015).
 (14) D. Park, Y. Hoshi, C. C. Kemp, A multimodal anomaly detector for robotassisted feeding using an lstmbased variational autoencoder, IEEE Robotics and Automation Letters 3 (3) (2018) 1544–1551 (2018).
 (15) S. Suh, D. H. Chae, H. Kang, S. Choi, Echostate conditional variational autoencoder for anomaly detection, in: International Joint Conference on Neural Networks (IJCNN), IEEE, 2016, pp. 1015–1022 (2016).
 (16) H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y. Liu, Y. Zhao, D. Pei, Y. Feng, J. Chen, Z. Wang, H. Qiao, Unsupervised anomaly detection via variational autoencoder for seasonal kpis in web applications, in: International World Wide Web Conference (WWW), ACM, 2018, pp. 187–196 (2018).
 (17) A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, Adversarial autoencoders, in: International Conference on Learning Representations (ICLR), 2016 (2016).
 (18) S. Pidhorskyi, R. Almohsen, G. Doretto, Generative probabilistic novelty detection with adversarial autoencoders, in: Annual Conference on Neural Information Processing Systems (NeurIPS), MIT Press, 2018, pp. 6823–6834 (2018).
 (19) M. Ravanbakhsh, M. Nabi, E. Sangineto, L. Marcenaro, C. S. Regazzoni, N. Sebe, Abnormal event detection in videos using generative adversarial nets, in: International Conference on Image Processing (ICIP), IEEE, 2017, pp. 1577–1581 (2017).
 (20) X. Chen, D. P. Kingma, T. Salimans, Y. Duan, P. Dhariwal, J. Schulman, I. Sutskever, P. Abbeel, Variational lossy autoencoder, in: International Conference on Learning Representations (ICLR), 2017 (2017).
 (21) M. Fraccaro, S. K. Sønderby, U. Paquet, O. Winther, Sequential neural models with stochastic layers, in: Annual Conference on Neural Information Processing Systems (NeurIPS), 2016, pp. 2199–2207 (2016).
 (22) I. V. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. C. Courville, Y. Bengio, A hierarchical latent variable encoderdecoder model for generating dialogues, in: AAAI Conference on Artificial Intelligence (AAAI), 2017, pp. 3295–3301 (2017).
 (23) A. Honkela, H. Valpola, Variational learning and bitsback coding: an informationtheoretic view to bayesian learning, IEEE Transactions on Neural Networks 15 (4) (2004) 800–810 (2004).
 (24) Y. Liu, Z. Li, C. Zhou, Y. Jiang, J. Sun, M. Wang, X. He, Generative adversarial active learning for unsupervised outlier detection, IEEE Transactions on Knowledge and Data Engineering (2019).
 (25) Y. Kawachi, Y. Koizumi, N. Harada, Complementary set variational autoencoder for supervised anomaly detection, in: International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, 2018, pp. 2366–2370 (2018).
 (26) H. Huang, Z. Li, R. He, Z. Sun, T. Tan, Introvae: Introspective variational autoencoders for photographic image synthesis, in: Annual Conference on Neural Information Processing Systems (NeurIPS), MIT Press, 2018, pp. 52–63 (2018).
 (27) J. Ilonen, P. Paalanen, J. Kamarainen, H. Kälviäinen, Gaussian mixture pdf in oneclass classification: computing and utilizing confidence values, in: International Conference on Pattern Recognition (ICPR), IEEE, 2006, pp. 577–580 (2006).
 (28) D. Yeung, C. Chow, Parzenwindow network intrusion detectors, in: International Conference on Pattern Recognition (ICPR), IEEE, 2002, pp. 385–388 (2002).
 (29) M. M. Breunig, H. Kriegel, R. T. Ng, J. Sander, LOF: identifying densitybased local outliers, in: ACM SIGMOD International Conference on Management of Data (SIGMOD), ACM, 2000, pp. 93–104 (2000).
 (30) B. Tang, H. He, A local densitybased approach for outlier detection, Neurocomputing 241 (2017) 171–180 (2017).
 (31) B. Schölkopf, J. C. Platt, J. ShaweTaylor, A. J. Smola, R. C. Williamson, Estimating the support of a highdimensional distribution, Neural Computation 13 (7) (2001) 1443–1471 (2001).
 (32) L. Yin, H. Wang, W. Fan, Active learning based support vector data description method for robust novelty detection, Knowl.Based Syst. 153 (2018) 40–52 (2018).
 (33) D. J. Olive, Principal Component Analysis, Springer, 2017, pp. 189–217 (2017).
 (34) F. Harrou, F. Kadri, S. Chaabane, C. Tahon, Y. Sun, Improved principal component analysis for anomaly detection: Application to an emergency department, Computers & Industrial Engineering 88 (2015) 63–77 (2015).
 (35) R. Baklouti, M. Mansouri, M. Nounou, H. Nounou, A. B. Hamida, Iterated robust kernel fuzzy principal component analysis and application to fault detection, Journal of Computational Science 15 (2016) 34–49 (2016).
 (36) I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A. C. Courville, Improved training of wasserstein gans, in: Annual Conference on Neural Information Processing Systems (NeurIPS), 2017, pp. 5767–5777 (2017).
 (37) A. Gramacki, J. Gramacki, Fftbased fast bandwidth selector for multivariate kernel density estimation, Computational Statistics & Data Analysis 106 (2017) 27–45 (2017).
 (38) B. Zong, Q. Song, M. R. Min, W. Cheng, C. Lumezanu, D. Cho, H. Chen, Deep autoencoding gaussian mixture model for unsupervised anomaly detection, in: International Conference on Learning Representations (ICLR), 2018 (2018).
 (39) B. W. Silverman, Density estimation for statistics and data analysis, Routledge, 2018, Ch. 3 (2018).
 (40) S. Rayana, L. Akoglu, Less is more: Building selective anomaly ensembles, ACM Trans. Knowl. Discov. Data 10 (4) (2016) 42:1–42:33 (2016).
 (41) S. Sathe, C. C. Aggarwal, LODES: local density meets spectral outlier detection, in: SIAM International Conference on Data Mining (SDM), SIAM, 2016, pp. 171–179 (2016).
 (42) F. T. Liu, K. M. Ting, Z. Zhou, Isolation forest, in: International Conference on Data Mining (ICDM), IEEE, 2008, pp. 413–422 (2008).
 (43) C. C. Aggarwal, S. Sathe, Theoretical foundations and algorithms for outlier ensembles, SIGKDD Explor. Newsl. 17 (1) (2015) 24–47 (2015).
 (44) F. Keller, E. Müller, K. Böhm, Hics: High contrast subspaces for densitybased outlier ranking, in: International Conference on Data Engineering (ICDE), IEEE, 2012, pp. 1037–1048 (2012).
 (45) J. Davis, M. Goadrich, The relationship between precisionrecall and ROC curves, in: International Conference on Machine Learning (ICML), ACM, 2006, pp. 233–240 (2006).
 (46) F. T. Liu, K. M. Ting, Z. Zhou, Isolation forest, in: International Conference on Data Mining (ICDM), IEEE Computer Society, 2008, pp. 413–422 (2008).
 (47) H. Kriegel, M. Schubert, A. Zimek, Anglebased outlier detection in highdimensional data, in: ACM Knowledge Discovery and Data Mining (KDD), ACM, 2008, pp. 444–452 (2008).
 (48) H. Kriegel, P. Kröger, E. Schubert, A. Zimek, Outlier detection in axisparallel subspaces of high dimensional data, in: PacificAsia Conference on Knowledge Discovery and Data Mining (PAKDD), Vol. 5476 of Lecture Notes in Computer Science, Springer, 2009, pp. 831–838 (2009).
 (49) M. Goldstein, A. Dengel, Histogrambased outlier score (hbos): A fast unsupervised anomaly detection algorithm, German Conference on Artificial Intelligence (KI2012): Poster and Demo Track (2012) 59–63 (2012).
 (50) Y. Zhao, Z. Nasrullah, Z. Li, Pyod: A python toolbox for scalable outlier detection, Journal of Machine Learning Research 20 (96) (2019) 1–7 (2019).
 (51) H. He, Y. Bai, E. A. Garcia, S. Li, ADASYN: adaptive synthetic sampling approach for imbalanced learning, in: International Joint Conference on Neural Networks (IJCNN), IEEE, 2008, pp. 1322–1328 (2008).