Normalizing flows for deep anomaly detection
Anomaly detection for complex data is a challenging task from the perspective of machine learning. In this work, we consider cases with missing certain kinds of anomalies in the training dataset, while significant statistics for the normal class is available. For such scenarios, conventional supervised methods might suffer from the class imbalance, while unsupervised methods tend to ignore difficult anomalous examples. We extend the idea of the supervised classification approach for class-imbalanced datasets by exploiting normalizing flows for proper Bayesian inference of the posterior probabilities.
© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
The anomaly detection problem is one of the important tasks in the analysis of real-world data. Possible applications range from the data-quality certification [DQM] to finding the rare specific cases of the diseases in medicine [Spence]. The technique can be used in the credit card fraud detection [Aleskerov], complex systems failure predictions [SystemFailure], and novelty detection in time series data [nf_timeseries].
These possible applications are similar from the point of view of data science: search for a small amount of outliers is performed using typically unbalanced datasets. Usual binary classification methods struggle to achieve high performance in this case. One class methods also suffer from incomplete utilization of training data — no anomalies used in the training process. Recent work on supervised anomaly-detection technique [OPE] for class-imbalanced data can also have some shortcomings.
In this paper, we expand the idea of supervised anomalies detection for class-imbalanced datasets (-class classification method, [OPE]), adding normalizing flows (NF) to enhance the method.
1.1 Problem statement
The anomaly detection is a process of identifying data points that differ from the normal ones. Here, and in [OPE], we consider a specific case when some non-representative number of anomaly samples is also provided in the training dataset. Following [OPE] we denote normal data sample as and anomaly data samples as . Such anomaly detection setting can be formulated as a standard binary classification problem with highly imbalanced dataset:
A posterior distribution of class can be estimated using some classification neural network . Such classifier can be fitted by minimizing the following binary cross-entropy objective over training dataset
where is classification neural network with arbitrary weights.
In the case of a sufficient and well-balanced training dataset, this neural network can be considered as a good estimate of . In other words, will be close to optimal Bayesian classifier in that case. However, in the case of a small or imbalanced training dataset, this neural network output can be far from . A small variety of anomalies leads to overfitting without any extra prior knowledge about the data (like it is done in [LOF, IsolationForest, OC_NN, SVM]).
Most of the standard anomaly detection techniques [LOF, IsolationForest, OC_NN, SVM] use only normal samples from the training dataset to train . On the other hand, binary classifiers like [XGBoost] may suffer from class disbalance. In [OPE], a new set of generative models and loss functions were proposed for the aforementioned anomaly detection setting.
1.2 -class method
First, let us consider the case when no anomaly sample is provided in the training dataset. Then in case of same class priors and uniformly distributed anomaly samples binary cross-entropy objective (equation (3)) takes the following form:
where denotes uniform distribution along feature space .
Optimal Bayesian classifier, in that case, looks in the following way:
It is possible to approximate class posterior distribution by any standard neural network using data set with normal samples only. Hence, any kind of modern neural network can be used for anomaly detection even when no anomaly samples provided in the training dataset. However, if some anomalies are provided in the training dataset, it will not be used in training (equation 5).
In [OPE] we propose a new way to use that knowledge about given anomalies. We refer to the loss function as OPE loss or -class classification loss:
where compensates for the difference in classes prior, is a hyperparameter which allows to choose the trade-off between unitary (equation (4)) and binary (equation (3)) classification solutions. This way, the loss function represents a combination of binary and unitary solutions. A comparative illustration of -class method with standard one-class and two-class methods is shown in Figure 1.
The new loss function (equation 6) reaches the minimum with a following classifier:
However, uniform sampling from high-dimensional might be problematic, due to a potentially high variance of the gradients produced by the -term. Here an importance sampling can be used as one way to deal with the problem:
where is density of any new distribution s.t. , is a density of initial uniform distribution .
In [icn] the authors propose to use induced by the model at the previous epoch:
where: — normalization term, — indicator function. Hence, can be written as:
Such importance sampling from (9) is much more efficient than uniform sampling in terms of training. It can be performed using any Markov chain Monte Carlo (MCMC) technique like Hamiltonian Monte-Carlo [hmc]. Since MCMC sampling algorithms require distribution with up to a normalization constant , it can sample from without estimation of . However, is also used in (10), and thus it must be explicitly evaluated, which is also computationally expensive.
This way, Energy OPE loss can be formulated in the following way:
2.1 Normalizing flows
The technique of normalizing flows aims to design a set of invertible transformations to obtain a bijection between given distribution of training samples and some distribution with known probability density function (PDF) (see Figure 2). However, in the case of non-trivial bijection , the distribution density at the final point differs from the density at the point . This is due to fact that each non-trivial transformation changes the infinitesimal volume at some points (see Figure 3). So the task is not only to find a flow of invertible transformations , but also to know how the distribution density is changed at each point after each transformation .
Consider the multivariate transformation of variable for . Then Jacobian for a given transformation at given point has the following form:
The distribution density at point after the transformation of point can be written in a following common way:
where is a determinant of the Jacobian matrix .
For instance, in case of univariate transformation , the distribution density at each point after the transformation is times lower than distribution density before the transformation (see Figure 3).
Thus, the problem of training normalizing flows is to calculate determinants of the Jacobians at each point . The computation of the determinant of Jacobian is quite computationally expensive and complicated in a general case. Its time consumption is estimated to be , where is a characteristic size of the Jacobian matrix. However, in some partial cases, it can be calculated much faster.
A family of transformations, called planar flows, is the simplest example of fast calculation. This family is described as:
where are free parameters, is transformation of and is a smooth element-wise non-linearity, with derivative . For such transformations we can compute the log determinant of Jacobian term in time instead of in common case:
Now, in case of multiple planar flow transformations the distribution density is transformed in the following way:
Another family of transformations with fast Jacobian computation are radial flows. The transformation family is defined as following:
where , , and the parameters of the map are .
Determinant of such transformations can also be estimated much faster than :
2.2 Autoregressive flows
Planar and radial flows are quite computationally-efficient types of normalizing flows. However, none of them takes into account multidimensional correlations in . Thus in case of data with quite complicated covariance between the dimensions (features), it will not be effectively bijected to standard distributions. To fit covariances between the dimensions of training data, an autoregressive flow is designed. It is normally introduced as:
where is ’th transformation of , is ’th component of and is a transformation of first components of to ’th component of . The Jacobian of such transformations is triangular with a following denominator which is just a multiplication of the diagonal elements:
Thus, the Jacobian of any autoregressive flow can be calculated in time instead of
Real Non-Volume Preserving Flows (R-NVP)
Real Non-Volume Preserving Flows (R-NVP, [rnvp]) is autoregressive flows defined with the following transformations:
where , is element-wise multiplication, and are two arbitrary mappings .
Since R-NVP is a partial case of autoregressive flows, its Jacobian matrix is triangular with the identity matrix at ’th main minor. Therefore, first diagonal elements are equal to 1 whereas the rest are equal to the corresponding components of vector. Hence, the determinant of such a Jacobian is
The invertible transformation can be calculated in the following common way:
Thus, an invertible transformation of R-NVP can be calculated without inversion of and . Hence, any kind of mappings and can be used without any extra effects on the asymptotic complexity of R-NVP. Typically and are any auxiliary neural networks.
Masked Autoregressive Flow (MAF)
R-NVP produces quite an efficient family of transformations which are able to fit some correlations between the random variable dimensions. However, the first dimensions remain unchanged. Nevertheless, there are no guarantees for the first features of the training dataset to be uncorrelated. Moreover, there are no guarantees for the first features to be the components of some known distribution (like gaussian). A more general set of transformations was introduced to fix this problem:
The Jacobian of the transformation is:
Inverse transformation has the following common form:
Components of inverse MAF transformation (equation (30)) can be calculated in parallel. However, components of forward MAF transformation can not be calculated in this way due to the fact that the components of the left side of equation (28) must be calculated in sequential order. It is the main drawback of MAFs - likelihood estimation through the transformation (30) (transformation to distribution with known PDF) is fast and scalable, whereas sampling through the forward transformation (28) (transformation to data distribution) is relatively slow and unscalable.
Inverse Autoregressive Flows (IAF)
Such class of normalizing flows is called Inverse Autoregressive Flows (IAF) [iaf]. It has the same complexity and generalization power as MAFs, but works fast in forward (sampling) mode.
2.3 Probability distillation
Both MAF and IAF have good generalization performance. However, none of them works fast in both directions - in sampling (forward) and likelihood estimation (inverse) modes. Moreover, since all normalizing flows are trained by maximizing the likelihood, IAF can not be trained in a fast way (it is well-scalable only in sampling mode). To deal with it, a probability distillation approach was proposed [probdist].
The goal of this method is to fit two normalizing flows to the same transformation. Namely, given the teacher model (typically it is MAF), the goal is to fit the student model (typically, it is IAF) for the same transformation (Figure 4).
Probability distillation method consists of two steps (Figure 4). In the first step (teacher training) the teacher model is trained to project training dataset onto some standard (domain) distribution by maximizing the likelihood. That is why MAF with scalable likelihood estimation is typically used for this step. At the second step (student distillation) the teacher model is frozen, and another model (student) is trained for the same transformation as the teacher was. It is performed by minimization of the -divergence between two train data distributions obtained by normalizing flows (equation (34)). During this stage, new samples are generated by the student model (typically IAF is used because of its well-scalable sampling mode), and after that, the likelihoods obtained from two models are compared.
Formally, the student distillation stage can be formulated in the following way:
where are samples from IAF, is teacher distribution density (distribution of training dataset estimated by MAF), is student distribution density (distribution of training dataset estimated by IAF).
The student model (IAF) is used only in forward (sampling) mode in that scheme, and the likelihood for each generated sample can be effectively estimated by it. On the other hand, since the teacher model (MAF) is used only in inverse mode in that scheme, the likelihood for generated IAF samples can be effectively estimated by MAF. Thus, training according to this scheme, we can obtain a normalizing flow pipeline, which works well in both sampling and likelihood estimation modes. Moreover, IAF is trained much faster in this scheme since it is not used in inverse mode. Finally, it is important to note that training dataset is not used during student distillation stage. Instead, any number of generated samples can be used in that stage. Hence, IAF can be fitted to MAF without any overfitting.
In [OPE], we propose -class method that aims to improve previous performance with the introduction of a new sampling technique. It outperforms previous state-of-the-art algorithms in the most anomaly detection cases.
However, the -class method has some drawbacks:
OPE (6) suffers from potentially high variance of gradients;
EOPE (11) suffers from ineffective sampling scheme: sampling by MCMC is computationally expensive because of usage of rejection sampling, whereas Deep Directed Generative models are imprecisive;
EOPE has a tendency to overpenalize positive predictions.
In this paper, we propose an alternative way of sampling. Our new algorithm is based on the idea of using normalizing flows to sample new anomalies for classifier training from tails of normal distribution. It consists of 2 steps (see Figure 7).
In the first step, we train normalizing flow to sample new surrogate anomalies. In the second step, we train a binary classifier on normal samples and a mixture of real and surrogate (sampled from NF) anomalies.
3.1 Step 1. Training normalizing flow
As it was mentioned before, the aim of the first step is to train normalizing flow to sample new anomalies. At the moment of publication, one of the most powerful NF models for sampling is IAF.
We train normalizing flow on normal samples. It can be trained by standard for normalizing flows scheme of maximization the log-likelihood:
However, since unscalable for IAF inverse transformation is required in (39) we propose another scheme for its training. It is based on probability distillation (see subsection 2.3). First, we train MAF on normal samples maximizing (39) w.r.t. its parameters . After that we freeze MAF and train IAF using probability distillation by minimizing (34) w.r.t. IAF parameters. In such training scheme, IAF is used only in sampling mode and hence it can be trained much faster (see Figure 4).
We want to note that such scheme of training two normalizing flows is auxiliary in our research, and it can be replaced by any other single NF model.
After NF for sampling is trained, it can be used to sample new anomalies. To produce new anomalies, we sample from tails of normal domain distribution, where -value of tails is a hyperparameter (see Figure 5).
Here we used an assumption that test time anomalies are either represented in the given anomalous train set or novelties w.r.t. normal class. In other words, of novelties must be relatively small.
Nevertheless, obtained by NF might be drastically different from the corresponding because of non-unit Jacobian of NF transformations (39). Such distribution density distortion is illustrated in Figure 6 and makes the proposed sampling of anomalies to be incomplete. Because of such distortion, some points in tails of domain can correspond to normal samples, and some points in the body of domain distribution can correspond to anomalies. To fix it we propose Jacobian regularization of normalizing flows (Figure 6) by introducing extra regularization term. It penalizes the model for non-unit Jacobian:
where denotes the regularization hyperparameter. Here we consider domain distribution of NF to be standard normal . We estimate the regularization term in (40) by direct sampling of from the domain distribution to cover the whole sampling space.
3.2 Step 2. Training classifier
Once normalizing flow for sampling is obtained, a classifier can be trained on normal samples and a mixture of real and surrogate anomalies sampled from NF (Figure 7). During our research we used loss (6) to train the classifier. We do not focus on classifier configuration since any neural network can be used at this step.
3.3 Final algorithm
The final scheme of our algorithm is shown at Figure 7
4 Results and discussion
We evaluate the proposed method on the following datasets: KDD-99 [KDD99], SUSY and HIGGS [higgs]. In order to reflect typical anomaly detection cases behind our approach, we derive multiple tasks from each dataset by varying size of anomalous datasets.
As the proposed method targets problems, that are intermediate between one-class and two-class problems, we compare our approach against the following algorithms:
conventional two-class classification;
semi-supervised method: dimensionality reduction by an Deep AutoEncoder followed by two-class classification;
one-class methods: Deep SVDD and Robust AutoEncoder [RVAE].
Tables 8, 9 and 10 show our experimental results. In these tables, columns represent tasks with varying number of negative samples presented in the training set: numbers in the header indicate either number of classes that form negative class (in case of KDD dataset) or number of negative samples used (HIGGS and SUSY); ‘one-class’ denotes the absence of known anomalous samples. During this research, we used the following fixed number of positive (normal) samples: 89574 for KDD, 5119249 for HIGGS and 216816 for SUSY. As one-class algorithms do not take into account negative samples, their results are continued on the tasks with known anomalies.
Unlike conventional anomaly detection algorithms [RVAE, LOF, OC_NN, IsolationForest, SVM], our new method along with [OPE] utilizes anomalies during the training process. This method outperforms all existing methods, including our previously designed algorithm [OPE] in most realistic scenarios. The method is fast and stable both in terms of training and inference stages. However, since standard classifier is used at the top of the scheme (Figure 7), overfitting must be carefully monitored and addressed.
In this work, we present a new anomaly detection algorithm that deals efficiently with hard-to-address problems both by one-class or two-class methods. Our solution combines the best features of one-class and two-class approaches. In contrast to one-class approaches, the proposed method can effectively utilize any number of known anomalous examples, and, unlike conventional two-class classification, does not require an extensive sample of anomalous samples. Our algorithm significantly outperforms the existing anomaly detection algorithms in several realistic anomaly detection cases. This approach is especially beneficial for anomaly detection problems, in which anomalous data is non-representative, or might drift over time.
The research leading to these results has received funding from Russian Science Foundation under grant agreement n 19-71-30020
Artem Ryzhikov received the M.Sc. degree and getting PhD in Computer Science from National Research University Higher School of Economics, Russia. His research interests are in the area of machine learning and its application to High-Energy Physics with focus on Deep Learning and Bayesian methods.
Maxim Borisyak graduated from Moscow Institute of Physics and Technology in 2015, and is currently a PhD student in Computer Science at National Research University Higher School of Economics, Russia. His research interests include Machine Learning for High-Energy Physics with focus on Deep Learning, generative models and optimization.
Andrey Ustyuzhanin received a diploma in Computer Science from Moscow Institute of Physics and Technology, Russia (2000) and a Ph.D. in Computer Science from Institute of System Programming RAS, Russia (2007). He is the director of the Laboratory of Methods for Big Data Analysis at Higher School of Economics. His team participates in several international collaborations: LHCb, SHiP (Search for Hidden Particles). The primary focus of his research is the design and application of Machine Learning methods to improve fundamental understanding of our world principles.
Denis Derkach graduated from the Saint-Petersburg State University in 2007 and later obtained a PhD in particle physics from the University of Paris 11. After postdocs at Istituto Nazionale di Fisica Nucleare and University of Oxford, he joined Higher School of Economics, Moscow and currently is an assistant professor there. His main research interest concentrates around applying data science methods in the fundamental physics field. He is a principle investigator of HSE LHCb team.