Normalizing flows for deep anomaly detection

Normalizing flows for deep anomaly detection

Abstract

Anomaly detection for complex data is a challenging task from the perspective of machine learning. In this work, we consider cases with missing certain kinds of anomalies in the training dataset, while significant statistics for the normal class is available. For such scenarios, conventional supervised methods might suffer from the class imbalance, while unsupervised methods tend to ignore difficult anomalous examples. We extend the idea of the supervised classification approach for class-imbalanced datasets by exploiting normalizing flows for proper Bayesian inference of the posterior probabilities.

Machine Learning, Neural Nets, Anomaly Detection, Imbalanced Data Set, Generate Potential Outliers, Normalizing Flow
\addbibresource

main.bib

© 2017 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

1 Introduction

The anomaly detection problem is one of the important tasks in the analysis of real-world data. Possible applications range from the data-quality certification [DQM] to finding the rare specific cases of the diseases in medicine [Spence]. The technique can be used in the credit card fraud detection [Aleskerov], complex systems failure predictions [SystemFailure], and novelty detection in time series data [nf_timeseries].

These possible applications are similar from the point of view of data science: search for a small amount of outliers is performed using typically unbalanced datasets. Usual binary classification methods struggle to achieve high performance in this case. One class methods also suffer from incomplete utilization of training data — no anomalies used in the training process. Recent work on supervised anomaly-detection technique [OPE] for class-imbalanced data can also have some shortcomings.

In this paper, we expand the idea of supervised anomalies detection for class-imbalanced datasets (-class classification method, [OPE]), adding normalizing flows (NF) to enhance the method.

1.1 Problem statement

The anomaly detection is a process of identifying data points that differ from the normal ones. Here, and in [OPE], we consider a specific case when some non-representative number of anomaly samples is also provided in the training dataset. Following [OPE] we denote normal data sample as and anomaly data samples as . Such anomaly detection setting can be formulated as a standard binary classification problem with highly imbalanced dataset:

(1)

A posterior distribution of class can be estimated using some classification neural network . Such classifier can be fitted by minimizing the following binary cross-entropy objective over training dataset

(2)
(3)

where is classification neural network with arbitrary weights.

In the case of a sufficient and well-balanced training dataset, this neural network can be considered as a good estimate of . In other words, will be close to optimal Bayesian classifier in that case. However, in the case of a small or imbalanced training dataset, this neural network output can be far from . A small variety of anomalies leads to overfitting without any extra prior knowledge about the data (like it is done in [LOF, IsolationForest, OC_NN, SVM]).

Most of the standard anomaly detection techniques [LOF, IsolationForest, OC_NN, SVM] use only normal samples from the training dataset to train . On the other hand, binary classifiers like [XGBoost] may suffer from class disbalance. In [OPE], a new set of generative models and loss functions were proposed for the aforementioned anomaly detection setting.

1.2 -class method

First, let us consider the case when no anomaly sample is provided in the training dataset. Then in case of same class priors and uniformly distributed anomaly samples binary cross-entropy objective (equation (3)) takes the following form:

(4)

where denotes uniform distribution along feature space .

Optimal Bayesian classifier, in that case, looks in the following way:

(5)

It is possible to approximate class posterior distribution by any standard neural network using data set with normal samples only. Hence, any kind of modern neural network can be used for anomaly detection even when no anomaly samples provided in the training dataset. However, if some anomalies are provided in the training dataset, it will not be used in training (equation 5).

In [OPE] we propose a new way to use that knowledge about given anomalies. We refer to the loss function as OPE loss or -class classification loss:

(6)

where compensates for the difference in classes prior, is a hyperparameter which allows to choose the trade-off between unitary (equation (4)) and binary (equation (3)) classification solutions. This way, the loss function represents a combination of binary and unitary solutions. A comparative illustration of -class method with standard one-class and two-class methods is shown in Figure 1.

The new loss function (equation 6) reaches the minimum with a following classifier:

(7)
(a) two-class classification (both circles samples are given)
(b) OPE classification (left-circle and few right-circle samples are given)
(c) one-class classification (only left-circle samples are given)
Fig. 1: Illustration of how OPE [OPE] algorithm works. Circles denote the class boundaries (each class is located inside a circle), color denotes models predictions (yellow denotes left-circle class, violet denotes right-circle class)

However, uniform sampling from high-dimensional might be problematic, due to a potentially high variance of the gradients produced by the -term. Here an importance sampling can be used as one way to deal with the problem:

(8)

where is density of any new distribution s.t. , is a density of initial uniform distribution .

In [icn] the authors propose to use induced by the model at the previous epoch:

(9)

where:  — normalization term,  — indicator function. Hence, can be written as:

(10)

Such importance sampling from (9) is much more efficient than uniform sampling in terms of training. It can be performed using any Markov chain Monte Carlo (MCMC) technique like Hamiltonian Monte-Carlo [hmc]. Since MCMC sampling algorithms require distribution with up to a normalization constant , it can sample from without estimation of . However, is also used in (10), and thus it must be explicitly evaluated, which is also computationally expensive.

This way, Energy OPE loss can be formulated in the following way:

(11)

2 Background

2.1 Normalizing flows

The technique of normalizing flows aims to design a set of invertible transformations to obtain a bijection between given distribution of training samples and some distribution with known probability density function (PDF) (see Figure 2). However, in the case of non-trivial bijection , the distribution density at the final point differs from the density at the point . This is due to fact that each non-trivial transformation changes the infinitesimal volume at some points (see Figure 3). So the task is not only to find a flow of invertible transformations , but also to know how the distribution density is changed at each point after each transformation .

Consider the multivariate transformation of variable for . Then Jacobian for a given transformation at given point has the following form:

(12)

The distribution density at point after the transformation of point can be written in a following common way:

(13)

where is a determinant of the Jacobian matrix .

For instance, in case of univariate transformation , the distribution density at each point after the transformation is times lower than distribution density before the transformation (see Figure 3).

Thus, the problem of training normalizing flows is to calculate determinants of the Jacobians at each point . The computation of the determinant of Jacobian is quite computationally expensive and complicated in a general case. Its time consumption is estimated to be , where is a characteristic size of the Jacobian matrix. However, in some partial cases, it can be calculated much faster.

Planar flows

A family of transformations, called planar flows, is the simplest example of fast calculation. This family is described as:

(14)

where are free parameters, is transformation of and is a smooth element-wise non-linearity, with derivative . For such transformations we can compute the log determinant of Jacobian term in time instead of in common case:

(15)
(16)

Now, in case of multiple planar flow transformations the distribution density is transformed in the following way:

(17)

Radial flows

Another family of transformations with fast Jacobian computation are radial flows. The transformation family is defined as following:

(18)

where , , and the parameters of the map are .

Determinant of such transformations can also be estimated much faster than :

(19)

Thus,

2.2 Autoregressive flows

Planar and radial flows are quite computationally-efficient types of normalizing flows. However, none of them takes into account multidimensional correlations in . Thus in case of data with quite complicated covariance between the dimensions (features), it will not be effectively bijected to standard distributions. To fit covariances between the dimensions of training data, an autoregressive flow is designed. It is normally introduced as:

(20)

where is ’th transformation of , is ’th component of and is a transformation of first components of to ’th component of . The Jacobian of such transformations is triangular with a following denominator which is just a multiplication of the diagonal elements:

(21)

Thus, the Jacobian of any autoregressive flow can be calculated in time instead of

Real Non-Volume Preserving Flows (R-NVP)

Real Non-Volume Preserving Flows (R-NVP, [rnvp]) is autoregressive flows defined with the following transformations:

(22)
(23)

where , is element-wise multiplication, and are two arbitrary mappings .

Since R-NVP is a partial case of autoregressive flows, its Jacobian matrix is triangular with the identity matrix at ’th main minor. Therefore, first diagonal elements are equal to 1 whereas the rest are equal to the corresponding components of vector. Hence, the determinant of such a Jacobian is

(24)
(25)

The invertible transformation can be calculated in the following common way:

(26)

Thus, an invertible transformation of R-NVP can be calculated without inversion of and . Hence, any kind of mappings and can be used without any extra effects on the asymptotic complexity of R-NVP. Typically and are any auxiliary neural networks.

Masked Autoregressive Flow (MAF)

R-NVP produces quite an efficient family of transformations which are able to fit some correlations between the random variable dimensions. However, the first dimensions remain unchanged. Nevertheless, there are no guarantees for the first features of the training dataset to be uncorrelated. Moreover, there are no guarantees for the first features to be the components of some known distribution (like gaussian). A more general set of transformations was introduced to fix this problem:

(27)
(28)

Such family of autoregressive normalizing flows induced by equations 27 and 28 is called Masked Autoregressive Flow (MAF), [maf].

The Jacobian of the transformation is:

(29)

Inverse transformation has the following common form:

(30)

Components of inverse MAF transformation (equation (30)) can be calculated in parallel. However, components of forward MAF transformation can not be calculated in this way due to the fact that the components of the left side of equation (28) must be calculated in sequential order. It is the main drawback of MAFs - likelihood estimation through the transformation (30) (transformation to distribution with known PDF) is fast and scalable, whereas sampling through the forward transformation (28) (transformation to data distribution) is relatively slow and unscalable.

Inverse Autoregressive Flows (IAF)

Swapping variables and in equations (28) and (30) new class of autoregressive normalizing flows with fast sampling (forward pass) and slow likelihood estimation (backward pass) can be obtained:

(31)
(32)
(33)

Such class of normalizing flows is called Inverse Autoregressive Flows (IAF) [iaf]. It has the same complexity and generalization power as MAFs, but works fast in forward (sampling) mode.

2.3 Probability distillation

Both MAF and IAF have good generalization performance. However, none of them works fast in both directions - in sampling (forward) and likelihood estimation (inverse) modes. Moreover, since all normalizing flows are trained by maximizing the likelihood, IAF can not be trained in a fast way (it is well-scalable only in sampling mode). To deal with it, a probability distillation approach was proposed [probdist].

The goal of this method is to fit two normalizing flows to the same transformation. Namely, given the teacher model (typically it is MAF), the goal is to fit the student model (typically, it is IAF) for the same transformation (Figure 4).

Probability distillation method consists of two steps (Figure 4). In the first step (teacher training) the teacher model is trained to project training dataset onto some standard (domain) distribution by maximizing the likelihood. That is why MAF with scalable likelihood estimation is typically used for this step. At the second step (student distillation) the teacher model is frozen, and another model (student) is trained for the same transformation as the teacher was. It is performed by minimization of the -divergence between two train data distributions obtained by normalizing flows (equation (34)). During this stage, new samples are generated by the student model (typically IAF is used because of its well-scalable sampling mode), and after that, the likelihoods obtained from two models are compared.

Formally, the student distillation stage can be formulated in the following way:

(34)
(35)
(36)

where are samples from IAF, is teacher distribution density (distribution of training dataset estimated by MAF), is student distribution density (distribution of training dataset estimated by IAF).

The student model (IAF) is used only in forward (sampling) mode in that scheme, and the likelihood for each generated sample can be effectively estimated by it. On the other hand, since the teacher model (MAF) is used only in inverse mode in that scheme, the likelihood for generated IAF samples can be effectively estimated by MAF. Thus, training according to this scheme, we can obtain a normalizing flow pipeline, which works well in both sampling and likelihood estimation modes. Moreover, IAF is trained much faster in this scheme since it is not used in inverse mode. Finally, it is important to note that training dataset is not used during student distillation stage. Instead, any number of generated samples can be used in that stage. Hence, IAF can be fitted to MAF without any overfitting.

Fig. 2: Illustration of the normalizing flows. - sample from the given dataset with complicated distribution with unknown PDF. - sample from domain distribution with known PDF (like Gaussian).
Fig. 3: Illustration of distribution density change after transformation  [NFEric]
Fig. 4: Probability distillation. A pre-trained teacher (MAF) is used to score the samples’ output by the student (IAF). The student is trained to minimize the KL-divergence between its distribution and that of the teacher by maximizing the log-likelihood of its samples under the teacher and maximizing its own entropy at the same time [probdist].

3 Algorithm

In [OPE], we propose -class method that aims to improve previous performance with the introduction of a new sampling technique. It outperforms previous state-of-the-art algorithms in the most anomaly detection cases.

However, the -class method has some drawbacks:

  • OPE (6) suffers from potentially high variance of gradients;

  • EOPE (11) suffers from ineffective sampling scheme: sampling by MCMC is computationally expensive because of usage of rejection sampling, whereas Deep Directed Generative models are imprecisive;

  • EOPE has a tendency to overpenalize positive predictions.

In this paper, we propose an alternative way of sampling. Our new algorithm is based on the idea of using normalizing flows to sample new anomalies for classifier training from tails of normal distribution. It consists of 2 steps (see Figure 7).

In the first step, we train normalizing flow to sample new surrogate anomalies. In the second step, we train a binary classifier on normal samples and a mixture of real and surrogate (sampled from NF) anomalies.

3.1 Step 1. Training normalizing flow

As it was mentioned before, the aim of the first step is to train normalizing flow to sample new anomalies. At the moment of publication, one of the most powerful NF models for sampling is IAF.

We train normalizing flow on normal samples. It can be trained by standard for normalizing flows scheme of maximization the log-likelihood:

(37)
(38)
(39)

However, since unscalable for IAF inverse transformation is required in (39) we propose another scheme for its training. It is based on probability distillation (see subsection 2.3). First, we train MAF on normal samples maximizing (39) w.r.t. its parameters . After that we freeze MAF and train IAF using probability distillation by minimizing (34) w.r.t. IAF parameters. In such training scheme, IAF is used only in sampling mode and hence it can be trained much faster (see Figure 4).

We want to note that such scheme of training two normalizing flows is auxiliary in our research, and it can be replaced by any other single NF model.

After NF for sampling is trained, it can be used to sample new anomalies. To produce new anomalies, we sample from tails of normal domain distribution, where -value of tails is a hyperparameter (see Figure 5).

Fig. 5: NF bijection between standard normal domain distribution and Moon dataset [pedregosa2011scikit]

Here we used an assumption that test time anomalies are either represented in the given anomalous train set or novelties w.r.t. normal class. In other words, of novelties must be relatively small.

Nevertheless, obtained by NF might be drastically different from the corresponding because of non-unit Jacobian of NF transformations (39). Such distribution density distortion is illustrated in Figure 6 and makes the proposed sampling of anomalies to be incomplete. Because of such distortion, some points in tails of domain can correspond to normal samples, and some points in the body of domain distribution can correspond to anomalies. To fix it we propose Jacobian regularization of normalizing flows (Figure 6) by introducing extra regularization term. It penalizes the model for non-unit Jacobian:

(40)
(41)

where denotes the regularization hyperparameter. Here we consider domain distribution of NF to be standard normal . We estimate the regularization term in (40) by direct sampling of from the domain distribution to cover the whole sampling space.

(a) of NF without Jacobian regularization
(b) of NF without Jacobian regularization
(c) of NF with Jacobian regularization
(d) of NF with Jacobian regularization
Fig. 6: Density distortion of normalizing flows on Moon dataset [pedregosa2011scikit]. Without extra regularization distribution density of domain distribution (5(a)) significantly differs from the target distribution (5(b)) because of non-unit Jacobian. To preserve the distribution density after NF transformations Jacobian regularization (40) can be used (5(c) and 5(d) respectively)

3.2 Step 2. Training classifier

Once normalizing flow for sampling is obtained, a classifier can be trained on normal samples and a mixture of real and surrogate anomalies sampled from NF (Figure 7). During our research we used loss (6) to train the classifier. We do not focus on classifier configuration since any neural network can be used at this step.

3.3 Final algorithm

Result: Anomalies classifier
train MAF on normal samples with (41) objective;
train IAF using MAF probability distillation (equation 34);
while not converged do
       sample surrogate anomalies using IAF;
       update classifier by minimizing (equation 6);
      
end while
Algorithm 1 Normalizing flows for anomaly detection

The final scheme of our algorithm is shown at Figure 7

Fig. 7: Normalizing flows for deep anomaly detection. Surrogate anomalies are sampled from the tails of gaussian distribution and transformed by NF to be mixed up with real samples. Finally, any classifier can be trained on that mixture.

4 Results and discussion

We evaluate the proposed method on the following datasets: KDD-99 [KDD99], SUSY and HIGGS [higgs]. In order to reflect typical anomaly detection cases behind our approach, we derive multiple tasks from each dataset by varying size of anomalous datasets.

As the proposed method targets problems, that are intermediate between one-class and two-class problems, we compare our approach against the following algorithms:

  • conventional two-class classification;

  • semi-supervised method: dimensionality reduction by an Deep AutoEncoder followed by two-class classification;

  • one-class methods: Deep SVDD and Robust AutoEncoder [RVAE].

  • method [OPE] (‘*ope‘ algorithms in tables 8910)

Tables 89 and 10 show our experimental results. In these tables, columns represent tasks with varying number of negative samples presented in the training set: numbers in the header indicate either number of classes that form negative class (in case of KDD dataset) or number of negative samples used (HIGGS and SUSY); ‘one-class’ denotes the absence of known anomalous samples. During this research, we used the following fixed number of positive (normal) samples: 89574 for KDD, 5119249 for HIGGS and 216816 for SUSY. As one-class algorithms do not take into account negative samples, their results are continued on the tasks with known anomalies.

one class 1 2 4 8
rae-oc
deep-svdd-oc
two-class -
semi-supervised -
brute-force-ope
hmc-eope
rmsprop-eope
deep-eope
normalising-flow
Fig. 8: Results on KDD-99 dataset.
one class 100 1000 10000 1000000
rae-oc
deep-svdd-oc
two-class -
semi-supervised -
brute-force-ope
hmc-eope
rmsprop-eope
deep-eope
normalising-flow
Fig. 9: Results on HIGGS dataset.
one class 100 1000 10000 1000000
rae-oc
deep-svdd-oc
two-class -
semi-supervised -
brute-force-ope
hmc-eope
rmsprop-eope
deep-eope
normalising-flow
Fig. 10: Results on SUSY dataset.

Tables 89 and 10 show significant improvement of results using the proposed method.

Unlike conventional anomaly detection algorithms [RVAE, LOF, OC_NN, IsolationForest, SVM], our new method along with [OPE] utilizes anomalies during the training process. This method outperforms all existing methods, including our previously designed algorithm [OPE] in most realistic scenarios. The method is fast and stable both in terms of training and inference stages. However, since standard classifier is used at the top of the scheme (Figure 7), overfitting must be carefully monitored and addressed.

5 Conclusion

In this work, we present a new anomaly detection algorithm that deals efficiently with hard-to-address problems both by one-class or two-class methods. Our solution combines the best features of one-class and two-class approaches. In contrast to one-class approaches, the proposed method can effectively utilize any number of known anomalous examples, and, unlike conventional two-class classification, does not require an extensive sample of anomalous samples. Our algorithm significantly outperforms the existing anomaly detection algorithms in several realistic anomaly detection cases. This approach is especially beneficial for anomaly detection problems, in which anomalous data is non-representative, or might drift over time.

Acknowledgments

The research leading to these results has received funding from Russian Science Foundation under grant agreement n 19-71-30020

\printbibliography

Artem Ryzhikov received the M.Sc. degree and getting PhD in Computer Science from National Research University Higher School of Economics, Russia. His research interests are in the area of machine learning and its application to High-Energy Physics with focus on Deep Learning and Bayesian methods.

Maxim Borisyak graduated from Moscow Institute of Physics and Technology in 2015, and is currently a PhD student in Computer Science at National Research University Higher School of Economics, Russia. His research interests include Machine Learning for High-Energy Physics with focus on Deep Learning, generative models and optimization.

Andrey Ustyuzhanin received a diploma in Computer Science from Moscow Institute of Physics and Technology, Russia (2000) and a Ph.D. in Computer Science from Institute of System Programming RAS, Russia (2007). He is the director of the Laboratory of Methods for Big Data Analysis at Higher School of Economics. His team participates in several international collaborations: LHCb, SHiP (Search for Hidden Particles). The primary focus of his research is the design and application of Machine Learning methods to improve fundamental understanding of our world principles.

Denis Derkach graduated from the Saint-Petersburg State University in 2007 and later obtained a PhD in particle physics from the University of Paris 11. After postdocs at Istituto Nazionale di Fisica Nucleare and University of Oxford, he joined Higher School of Economics, Moscow and currently is an assistant professor there. His main research interest concentrates around applying data science methods in the fundamental physics field. He is a principle investigator of HSE LHCb team.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
402591
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description