# Bayesian VAEs for Unsupervised Out-of-Distribution Detection

## Abstract

Despite their successes, deep neural networks may make unreliable predictions when faced with test data drawn from a distribution different to that of the training data, constituting a major problem for AI safety. While this has recently motivated the development of methods to detect such out-of-distribution (OoD) inputs, a robust solution is still lacking. We propose a new probabilistic, unsupervised approach to this problem based on a Bayesian variational autoencoder model, which estimates a full posterior distribution over the decoder parameters using stochastic gradient Markov chain Monte Carlo, instead of fitting a point estimate. We describe how information-theoretic measures based on this posterior can then be used to detect OoD inputs both in input space and in the model’s latent space. We empirically show the effectiveness of our approach.

## 1 Introduction

Outlier detection in input space. While deep neural networks (DNNs) have successfully tackled complex real-world problems in various domains including vision, speech and language (LeCun et al., 2015), they still face significant limitations that make them unfit for safety-critical applications (Amodei et al., 2016). One well-known shortcoming of DNNs is when faced with test data points coming from a different distribution than the data the network saw during training, the DNN will not only output incorrect predictions, but it will do so with high confidence Nguyen et al. (2015). The lack of robustness of DNNs to such out-of-distribution (OoD) inputs (or outliers/anomalies) was recently addressed by various methods to detect OoD inputs in the context of prediction tasks (typically classification) (Hendrycks & Gimpel, 2016; Liang et al., 2017; Hendrycks et al., 2018). When we are only given input data, one simple and seemingly sensible approach to detect a potential OoD input is to train a likelihood-based deep generative model (DGM; e.g. a VAE, auto-regressive DGM, or flow-based DGM) by (approximately) maximizing the probability of the training data under the model parameters , and to then estimate the density of under the generative model (Bishop, 1994). If is large, then is likely in-distribution, and OoD otherwise. However, recent works have shown that this likelihood-based approach does not work in general, as DGMs sometimes assign higher density to OoD data than to in-distribution data (Nalisnick et al., 2018). While some papers developed more effective scores that correct the likelihood (Choi & Jang, 2018; Ren et al., 2019; Nalisnick et al., 2019), we argue and show that OoD detection methods which are fundamentally based on the unreliable likelihood estimates by DGMs are not robust.

Outlier detection in latent space. In a distinct line of research, recent works have tackled the challenge of optimizing a costly-to-evaluate black-box function over a high-dimensional, richly structured input domain (e.g. graphs, images). Given data , these methods jointly train a VAE on inputs and a predictive model mapping from latent codes to targets , to then perform the optimization w.r.t. in the low-dimensional, continuous latent space instead of in input space (Gómez-Bombarelli et al., 2018). While these methods have achieved strong results in domains including automatic chemical design and automatic machine learning (Gómez-Bombarelli et al., 2018; Luo et al., 2018; Lu et al., 2018), their practical effectiveness is limited by their ability to handle the following trade-off: They need to find inputs that both have a high target value and are sufficiently novel (i.e., not too close to training inputs ), and at the same time ensure that the optimization w.r.t. does not progress into regions of the latent space too far away from the training data, which might yield latent points that decode to semantically meaningless or syntactically invalid inputs (Kusner et al., 2017; Griffiths & Hernández-Lobato, 2017; Mahmood & Hernández-Lobato, 2019). The required ability to quantify the novelty of latents (i.e., the semantic/syntactic distance to ) directly corresponds to the ability to effectively detect outliers in latent space .

Our approach. We propose a novel unsupervised, probabilistic method to simultaneously tackle the challenge of detecting outliers in input space as well as outliers in latent space . To this end, we take an information-theoretic perspective on OoD detection, and propose to use the (expected) informativeness of an input / latent as a proxy for whether / is OoD or not. To quantify this informativeness, we leverage probabilistic inference methods to maintain a posterior distribution over the parameters of a DGM, in particular of a variational autoencoder (VAE) (Kingma & Welling, 2013; Rezende et al., 2014). This results in a Bayesian VAE (BVAE) model, where instead of fitting a point estimate of the decoder parameters via maximum likelihood, we estimate their posterior using samples generated via stochastic gradient Markov chain Monte Carlo (MCMC). The informativeness of an unobserved input / latent is then quantified by measuring the (expected) change in the posterior over model parameters after having observed / , revealing an intriguing connection to information-theoretic active learning (MacKay, 1992).

Our contributions. (a) We explain how DGMs can be made more robust by capturing epistemic uncertainty via a posterior distribution over their parameters, and describe how such Bayesian DGMs can effectively detect outliers both in input space and in the model’s latent space based on information-theoretic principles (Section 3). (b) We propose a Bayesian VAE model as a concrete instantiation of a Bayesian DGM (Section 4). (c) We empirically demonstrate that our approach outperforms state-of-the-art OoD detection methods across several benchmarks (Section 5).

## 2 Problem Statement and Background

### 2.1 Out-of-Distribution (OoD) Detection

For input space OoD detection, we are given a large set of high-dimensional training inputs (i.e., with and ) drawn i.i.d. from a distribution , and a single test input , and need to determine if was drawn from or from some other distribution . Latent space OoD detection is analogous, but with an often smaller set of typically lower-dimensional latent points (i.e., with ).

### 2.2 Variational Autoencoders

Consider a latent variable model with marginal log-likelihood (or evidence) , where are observed variables, are latent variables, and are model parameters.^{1}

(1) |

for , with . As maximizing the ELBO approximately maximizes the evidence , this can be viewed as approximate maximum likelihood estimation (MLE). In practice, in Eq. 1 is optimized by mini-batch stochastic gradient-based methods using low-variance, unbiased, stochastic Monte Carlo estimators of obtained via the reparametrization trick. Finally, one can use importance sampling w.r.t. the variational posterior to get an estimator of the probability of an input under the generative model, i.e.,

(2) |

where and where the estimator is conditioned on both and to make explicit the dependence on the parameters of the proposal distribution .

### 2.3 Stochastic Gradient MCMC

To generate samples of parameters of a DNN, one typically uses stochastic gradient MCMC methods such as stochastic gradient Hamiltonian Monte Carlo (SGHMC). In particular, consider the posterior distribution with potential energy function induced by the prior and marginal log-likelihood . Hamiltonian Monte Carlo (HMC) (Duane et al., 1987; Betancourt, 2017) is a method that generates samples to efficiently explore the parameter space by simulating Hamiltonian dynamics, which involves evaluating the gradient of . However, computing this gradient requires examining the entire dataset (due to the summation of the log-likelihood over all ), which might be prohibitively costly for large datasets. To overcome this, Chen et al. (2014) proposed SGHMC as a scalable HMC variant based on a noisy, unbiased gradient estimate computed on a minibatch of points sampled uniformly at random from (i.e., akin to minibatch-based optimization algorithms such as stochastic gradient descent), i.e.,

(3) |

## 3 Information-theoretic Outlier Detection

### 3.1 Motivation and Intuition

**Why do deep generative models fail at OoD detection?** Consider the intuitive and principled OoD detection method which first trains a density estimator parameterized by , and then classifies an input as OoD based on a threshold on the density of , i.e., if Bishop (1994).
Recent advances in deep generative modeling (DGM) allow us to do density estimation even over high-dimensional, structured input domains (e.g. images, text), which in principle enables us to use this method in such complex settings.
However, the resulting OoD detection performance fundamentally relies on the quality of the likelihood estimates produced by these DGMs.
In particular, a sensible density estimator should assign high density to everything within the training data distribution, and low density to everything outside – a property of crucial importance for effective OoD detection.
Unfortunately, Nalisnick et al. (2018); Choi & Jang (2018) found that modern DGMs are often poorly calibrated, assigning higher density to OoD data than to in-distribution data.^{2}

**How are deep discriminative models made OoD robust?** DNNs are typically trained by maximizing the likelihood of a set of training examples under model parameters , yielding the maximum likelihood estimate (MLE) Goodfellow et al. (2016).
For discriminative, predictive models , it is well known that the point estimate does not capture model / epistemic uncertainty, i.e., uncertainty about the choice of model parameters induced by the fact that many different models might have generated .
As a result, discriminative models trained via MLE tend to be overconfident in their predictions, especially on OoD data Nguyen et al. (2015); Guo et al. (2017).
A principled, established way to capture model uncertainty in DNNs is to be Bayesian and infer a full distribution over parameters , yielding the predictive distribution Gal (2016).
Bayesian DNNs have much better OoD robustness than deterministic ones (e.g., producing low uncertainty for in-distribution and high uncertainty for OoD data), and OoD calibration has become a major benchmark for Bayesian DNNs Gal & Ghahramani (2016); Ovadia et al. (2019); Osawa et al. (2019); Maddox et al. (2019).
This suggests that capturing model uncertainty via Bayesian inference is a promising way to achieve robust, principled OoD detection.

**Why should we use Bayesian DGMs for OoD detection?** Just like deep discriminative models, DGMs are typically trained by maximizing the probability that was generated by the density model, yielding the MLE .
As a result, it is not surprising that the shortcomings of MLE-trained discriminative models also translate to MLE-trained generative models, such as the miscalibration and unreliability of their likelihood estimates for OoD inputs .
This is because there will always be many different plausible generative / density models of the training data , which are not captured by the point estimate .
If we do not trust our predictive models on OoD data, why should we trust our generative models , given that both are based on the same, unreliable DNNs?
In analogy to the discriminative setting, we argue that OoD robustness can be achieved by capturing the epistemic uncertainty in the DGM parameters .
This motivates the use of Bayesian DGMs, which estimate a full distribution over parameters and thus capture many different density estimators to explain the data, yielding the expected/average likelihood

(4) |

**
How can we use Bayesian DGMs for OoD detection?** Assume that given data , we have inferred a distribution over the parameters of a DGM.
In particular, we consider the case where is represented by a set of samples , which can be viewed as an ensemble of DGMs.^{3}^{4}^{5}

### 3.2 Quantifying Disagreement between Models

We propose the following score to quantify the disagreement or variation in the likelihoods of a set of model parameter samples :

(5) |

I.e., the likelihoods are first normalized to yield (see Eq. 5), such that and .
The normalized likelihoods effectively define a categorical distribution over models , where each value can be interpreted as the probability that was generated from the model , thus measuring how well is explained by the model , relative to the other models.
To obtain the score in Eq. 5, we then square the normalized likelihoods, sum them up, and take the reciprocal.
Note that .^{6}

As measures the degree of disagreement/variation between the models as to how probable / is, it can be used to classify / as follows: If is large, then is close to the (discrete) uniform distribution (for which ), meaning that all models explain / equally well and are in agreement as to how probable / is. Thus, / likely is in-distribution. Conversely, if is small, then contains a few large weights (i.e., corresponding to models that by chance happen to explain / well), with all other weights being very small, where in the extreme case, (for which ). This means that the models do not agree as to how probable / is, so that / likely is out-of-distribution.

### 3.3 An Information-theoretic Perspective

We now provide a more principled justification for the disagreement score in Eq. 5, which induces an information-theoretic perspective on OoD detection and reveals an intriguing connection to active learning. Assume that given training data and a prior distribution over the DGM parameters , we have inferred a posterior distribution over . Then, for a given input , the score quantifies how much the posterior would change if we were to add to and then infer the augmented posterior based on this new training set . To see this, first note that this change in the posterior is quantified by the normalized likelihood , such that models under which is more (less) likely – relative to all other models – will have a higher (lower) probability under the updated posterior . Now, given the samples of the old posterior , the normalized likelihood for a given model is proportional to in Eq. 5, i.e.,

(6) |

Thus, intuitively measures the relative usefulness of for describing the new posterior . More formally, the correspond to the importance weights of the samples drawn from the proposal distribution for an importance sampling-based Monte Carlo approximation of an expectation w.r.t. the target distribution ,

(7) |

for any function .
The score in Eq. 5 is a widely used measure of the efficiency of the estimator in Eq. 7, known as the effective sample size (ESS) of Martino et al. (2017).
It quantifies how many i.i.d. samples drawn from the target posterior are equivalent to the samples drawn from the proposal posterior and weighted according to , and thus indeed measures the change in distribution from to .
Equivalently, can be viewed as quantifying the informativeness of for updating the DGM parameters to the ones capturing the true density.^{7}

The OoD detection mechanism described in Section 3.2 can thus be intuitively summarised as follows: In-distribution inputs are similar to the data points already in and thus uninformative about the model parameters , inducing small change in distribution from to , resulting in a large ESS . Conversely, OoD inputs are very different from the previous observations in and thus informative about the model parameters , inducing large change in the posterior, resulting in a small ESS .

Finally, this information-theoretic perspective on OoD detection reveals a close relationship to information-theoretic active learning MacKay (1992); Houlsby et al. (2011).
There, the same notion of informativeness (or, equivalently, disagreement) is used to quantify the novelty of an input to be added to the data , aiming to maximally improve the estimate of the model parameters by maximally reducing the entropy / epistemic uncertainty in the posterior .^{7}

## 4 The Bayesian VAE (BVAE)

As an example of a Bayesian DGM, we propose a Bayesian VAE (BVAE), where instead of fitting the model parameters via (approximate) MLE, , to get the likelihood , we place a prior over and estimate its posterior , yielding the likelihood . The marginal likelihood thus integrates out both the latent variables and model parameters (cf. Eq. 4). The resulting generative process draws a from its prior and a from its posterior, and then generates via the likelihood. Training a BVAE thus requires Bayesian inference of both the posterior over and the posterior over , which is both intractable and thus requires approximation. We propose two variants for inferring the posteriors over and in a BVAE.

### 4.1 Variant 1: BVAE with a Single Fixed Encoder

a) Learning the encoder parameters . As in a regular VAE, we approximate the posterior using amortized VI via a recognition network whose parameters are fit by maximizing the ELBO in Eq. 1, i.e., , yielding a single fixed encoder.

b) Learning the decoder parameters . To generate posterior samples of decoder parameters, we propose to use SGHMC (cf. Section 2). However, the gradient of the energy function in Eq. 3 used for simulating the Hamiltonian dynamics requires evaluating the log-likelihood , which is intractable in a BVAE (as in a VAE). To alleviate this, we approximate the log-likelihood appearing in by the ordinary VAE ELBO in Eq. 1. Given a set of posterior samples , we can more intuitively think of having a finite mixture/ensemble of decoders/generative models .

c) Likelihood estimation. This BVAE variant is effectively trained like a normal VAE, but using a sampler instead of an optimizer for (pseudocode is found in the appendix). We obtain an ensemble of VAEs with a single shared encoder and separate decoder samples ; see Fig. 1 (left) for a cartoon illustration. For the -th VAE, the likelihood can then be estimated via importance sampling w.r.t. , as in Eq. 2.

### 4.2 Variant 2: BVAE with a Distribution over Encoders

a) Learning the encoder parameters . Recall that amortized VI aims to learn to do posterior inference, by optimizing the parameters (cf. Eq. 1) of an inference network mapping inputs to parameters of the variational posterior over .

However, one major shortcoming of fitting a single encoder parameter setting is that will not generalize to OoD inputs, but will instead produce confidently wrong posterior inferences Cremer et al. (2018); Mattei & Frellsen (2018) (cf. Section 3.1). To alleviate this, we instead capture multiple encoders by inferring a distribution over the variational parameters . While this might appear odd conceptually, it allows us to quantify our epistemic uncertainty in the amortized inference of . It might also be interpreted as regularizing the encoder Shu et al. (2018), or as increasing its flexibility (Yin & Zhou, 2018). We thus also place a prior over and infer the posterior , yielding the amortized posterior . We also use SGHMC to sample , again using the ELBO in Eq. 1 to compute (cf. Eq. 3). Given a set of posterior samples , we can again more intuitively think of having as a finite mixture/ensemble of encoders/inference networks .

b) Learning the decoder parameters . We sample as in Section 4.1. The only difference is that we now have the encoder mixture instead of the single encoder , technically yielding the ELBO which depends on only, as is averaged over . However, in practice, we for simplicity only use the most recent sample to estimate , such that effectively reduces to the normal VAE ELBO in Eq. 1 with fixed encoder .

c) Likelihood estimation. This BVAE variant is effectively trained like a normal VAE, but using a sampler instead of an optimizer for both and (pseudocode is found in the appendix). We obtain an ensemble of VAEs with pairs of coupled encoder-decoder samples; see Fig. 1 (right) for a cartoon illustration. For the -th VAE, the likelihood can then be estimated via importance sampling w.r.t. , as in Eq. 2.

## 5 Experiments

### 5.1 Out-of-Distribution Detection in Input Space

BVAE details.
We assess both proposed BVAE variants: BVAE samples and optimizes (see Section 4.1), while
BVAE samples both and (see Section 4.2).
Our PyTorch implementation uses Adam Kingma & Ba (2014) with learning rate for optimization, and scale-adapted SGHMC with step size and momentum decay Springenberg et al. (2016) for sampling^{8}

Experimental setup.
We use three benchmarks:
(a) FashionMNIST (in-distribution) vs. MNIST (OoD) (Hendrycks et al., 2018; Nalisnick et al., 2018; Zenati et al., 2018; Akcay et al., 2018; Ren et al., 2019),
(b) SVHN (in-distribution) vs. CIFAR10 (OoD)^{9}^{10}

Results. Section 5.1 shows that both BVAE variants significantly outperform the other methods on the considered benchmarks. Fig. 2 (left column) shows the ROC curves used to compute the AUROC metric in Section 5.1, for the FashionMNIST vs. MNIST (top) and SVHN vs. CIFAR10 (bottom) benchmarks; ROC curves for the FashionMNIST (held-out) benchmark as well as precision-recall curves for all benchmarks are found in the appendix. BVAE outperforms BVAE on FashionMNIST vs. MNIST and SVHN vs. CIFAR10, where in-distribution and OoD data is very distinct, but not on FashionMNIST (held-out), where the datasets are much more similar. This suggests that capturing a distribution over encoders is particularly beneficial when train and test data live on different manifolds (as overfitting is more critical), while the fixed encoder generalizes better when train and test manifolds are similar, which is as expected intuitively. Finally, Fig. 2 shows histograms of the log-likelihoods (middle column) and of the BVAE scores (right column) on (top) FashionMNIST in-distribution (blue) vs. MNIST OoD (orange) as well as (bottom) SVHN in-distribution (blue) vs. CIFAR10 OoD (orange) test data. While the log-likelihoods strongly overlap, our proposed score more clearly separates in-distribution data (closer to the r.h.s.) from OoD data (closer to the l.h.s.).

lcccccccccccc
& FashionMNIST vs MNIST & SVHN vs CIFAR10 & FashionMNIST (held-out)

& AUROC & AUPRC & FPR80 & AUROC & AUPRC & FPR80 & AUROC & AUPRC & FPR80

BVAE & 0.904 & 0.891 & 0.117 & 0.807 & 0.793 & 0.331 & 0.693 & 0.680 & 0.540

BVAE & 0.921 & 0.907 & 0.082 & 0.814 & 0.799 & 0.310 & 0.683 & 0.668 & 0.558

LL & 0.557 & 0.564 & 0.703 & 0.574 & 0.575 & 0.634 & 0.565 & 0.577 & 0.683

LLR & 0.617 & 0.613 & 0.638 & 0.570 & 0.570 & 0.638 & 0.560 & 0.569 & 0.698

TT & 0.482 & 0.502 & 0.833 & 0.395 & 0.428 & 0.859 & 0.482 & 0.496 & 0.806

WAIC & 0.541 & 0.548 & 0.798 & 0.293 & 0.380 & 0.912 & 0.446 & 0.464 & 0.827

### 5.2 Out-of-Distribution Detection in Latent Space

While input space OoD detection is well-studied, latent space OoD detection has only recently been identified as a critical open problem (Griffiths & Hernández-Lobato, 2017; Gómez-Bombarelli et al., 2018; Mahmood & Hernández-Lobato, 2019; Alperstein et al., 2019) (see also Section 1). Thus, there is a lack of suitable experimental benchmarks, making a quantitative evaluation challenging. A major issue in designing benchmarks based on commonly-used datasets such as MNIST is that it is unclear how to obtain ground truth labels for which latent points are OoD and which are not, as we require OoD labels for all possible latent points , not just for those corresponding to inputs from the given dataset. As a first step towards facilitating a systematic empirical evaluation of latent space OoD detection techniques, we propose the following experimental protocol.

We use the BVAE variant (see Section 5.1), as latent space detection does not require encoder robustness. We train the model on FashionMNIST (or potentially any other dataset), and then sample latent test points from the Gaussian where (we use ), following Mahmood & Hernández-Lobato (2019). Since there do not exist ground truth labels for which latent points are OoD or not, we compute a classifier-based OoD proxy score (to be detailed below) for each of the latent test points and then simply define the latents with the lowest scores to be in-distribution, and all others to be OoD.

To this end, we train an ensemble (Lakshminarayanan et al., 2017) of convolutional NN classifiers with parameters on FashionMNIST. We then approximate the novelty score for discriminative models proposed by Houlsby et al. (2011), i.e., , where the first term is the entropy of the mixture of categorical distributions (which is again categorical with averaged probits), and the second term is the average entropy of the predictive class distribution of the classifier with parameters . Alternatively, one could also use the closely related OoD score of Lakshminarayanan et al. (2017). Since requires a test input , and we only have the latent code corresponding to in our setting, we instead consider the expected novelty under the mixture decoding distribution , with . In practice, we use an ensemble of classifiers and input samples for the expectation.

We compare the BVAE model with our expected disagreement score (see Section 3.2, with samples) against two baselines (which are the only existing methods we are aware of):
(a) The distance of to the spherical annulus of radius , which is where most probability mass lies under our prior (Annulus) (Alperstein et al., 2019), and
(b) the log-probability of under the aggregated posterior of the training data in latent space , i.e., a uniform mixture of Gaussians in our case (qz)^{11}

## 6 Related Work

Supervised/Discriminative outlier detection methods. Most existing OoD detection approaches are task-specific in that they are applicable within the context of a given prediction task. As described in Section 1, these approaches train a deep discriminative model in a supervised fashion using the given labels. To detect outliers w.r.t. the target task, such approaches typically rely on some sort of confidence score to decide on the reliability of the prediction, which is either produced by modifying the model and/or training procedure, or computed/extracted post-hoc from the model and/or predictions (An & Cho, 2015; Sölch et al., 2016; Hendrycks & Gimpel, 2016; Liang et al., 2017; Hendrycks et al., 2018; Shafaei et al., 2018; DeVries & Taylor, 2018; Sricharan & Srivastava, 2018; Ahmed & Courville, 2019). Alternatively, some methods use predictive uncertainty estimates for OoD detection (Gal & Ghahramani, 2016; Lakshminarayanan et al., 2017; Malinin & Gales, 2018; Osawa et al., 2019; Ovadia et al., 2019) (cf. Section 3.1). The main drawback of such task-specific approaches is that discriminatively trained models by design discard all input features which are not informative about the specific prediction task at hand, such that information that is relevant for general OoD detection might be lost. Thus, whenever the task changes, the predictive (and thus OoD detection) model must be re-trained from scratch, even if the input data remains the same.

Unsupervised/Generative outlier detection methods. In contrast, task-agnostic OoD detection methods solely use the inputs for the unsupervised training of a DGM to capture the data distribution, which makes them independent of any prediction task and thus more general. Only a few recent works fall into this category. Ren et al. (2019) propose to correct the likelihood for confounding general population level background statistics captured by a background model , resulting in the score . The background model is in practice trained by perturbing the data with noise to corrupt its semantic structure, i.e., by sampling input dimensions i.i.d. from a Bernoulli distribution with rate and replacing their values by uniform noise, e.g. for images. Choi & Jang (2018) propose to use an ensemble (Lakshminarayanan et al., 2017) of independently trained likelihood-based DGMs (i.e., with random parameter initializations and random data shuffling) to approximate the Watanabe-Akaike Information Criterion (WAIC) (Watanabe, 2010) , which provides an asymptotically correct likelihood estimate between the training and test set expectations (however, assuming a fixed underlying data distribution). Finally, Nalisnick et al. (2019) propose to account for the typicality of via the score , although they focus on batches of test inputs instead of single inputs.

Bayesian deep generative modeling. Only a few works have tried to bring the benefits of Bayesian inference to DGMs, none of which addresses OoD detection. While Kingma & Welling (2013) describe how to do VI over the decoder parameters of a VAE (see their Appendix F), this is neither motivated nor empirically evaluated. Hernández-Lobato et al. (2016) do mean-field Gaussian VI over the encoder and decoder parameters of an importance-weighted autoencoder Burda et al. (2015) to increase model flexibility and improve generalization performance. Nguyen et al. (2017) do mean-field Gaussian VI over the decoder parameters of a VAE to enable continual learning. Saatci & Wilson (2017) use stochastic gradient MCMC to sample the parameters of a generative adversarial network (Goodfellow et al., 2014) to increase model expressiveness. Gong et al. (2019) use stochastic gradient MCMC to sample the decoder parameters of a VAE for feature-wise active learning.

## 7 Conclusion

We proposed an effective method for unsupervised OoD detection, both in input space and in latent space, which uses information-theoretic metrics based on the posterior distribution over the parameters of a DGM (in particular a VAE). In the future, we want to explore extensions to other approximate inference techniques (e.g. variational inference Blei et al. (2017)), and to other DGMs (e.g., flow-based Kingma & Dhariwal (2018) or auto-regressive Van den Oord et al. (2016) DGMs). Finally, we hope that this paper will inspire many follow-up works that will (a) develop further benchmarks and methods for the underappreciated yet critical problem of latent space OoD detection, and (b) further explore the described paradigm of information-theoretic OoD detection, which might be a promising approach towards the grand goal of making DNNs more reliable and robust.

#### Acknowledgements

We would like to thank Wenbo Gong, Gregor Simm, John Bronskill, Umang Bhatt, Andrew Y. K. Foong, and the anonymous reviewers of an earlier version of this paper for their helpful suggestions. Erik Daxberger would like to thank the EPSRC, Qualcomm, and the Cambridge-Tübingen PhD Fellowship for supporting his studies.

## Appendix A Further Details on the Information-theoretic Perspective of our Disagreement Score

In Section 3.3, we mentioned that the disagreement score defined in Eq. 5 can be viewed as quantifying the informativeness of the input for updating the DGM parameters to the ones capturing the true density, yielding an information-theoretic perspective on OoD detection and revealing a close relationship to information-theoretic active learning MacKay (1992). While this connection intuitively sensible, we now further describe and justify it.

In the paradigm of active learning, the goal is to iteratively select inputs which improve our estimate of the model parameters as rapidly as possible, in order to obtain a decent estimate of using as little data as possible, which is critical in scenarios where obtaining training data is expensive (e.g. in domains where humans or costly simulations have to be queried to obtain data, which includes many medical or scientific applications). The main idea of information-theoretic active learning is to maintain a posterior distribution over the model parameters given the training data observed thus far, and to then select the new input based on its informativeness about the distribution , which is measured by the change in distribution between the current posterior and the updated posterior with . This change in the posterior distribution can, for example, be quantified by the cross-entropy or KL divergence between and MacKay (1992), or by the decrease in entropy between and Houlsby et al. (2011).

Intriguingly, while the problems of active learning and out-of-distribution detection have clearly distinct goals, they are fundamentally related in that they both critically rely on a reliable way to quantify how different an input is from the training data (or, put differently, how novel or informative is). While in active learning, we aim to identify the input that is maximally different (or novel / informative) in order to best improve our estimate of the model parameters by adding to the training dataset , in out-of-distribution detection, we aim to classify a given input as either in-distribution or OoD based on how different (or novel / informative) it is. This naturally suggests the possibility of leveraging methods to quantify the novelty / informativeness of an input developed for one problem, and apply it to the other problem. However, most measures used in active learning are designed for continuous representations of the distributions and , and are not directly applicable in our setting where and are represented by a discrete set of samples .

That being said, can indeed be viewed as quantifying the change in distribution between the sample-based representations of and induced by (and thus the informativeness of ), revealing a link to information-theoretic active learning. In particular, Martino et al. (2017) show that (which corresponds to the effective sample size, as described in Section 3.3) is closely related to the Euclidean distance between the vector of importance weights and the vector of probabilities defining the discrete uniform probability mass function, i.e.,

(8) |

such that maximizing the score is equivalent to minimizing the Euclidian distance . Now, since

(9) |

we observe that for a given model , the posterior is equal to if and only if , such that is equal to for all models if and only if the weight vector is equal to the vector defining the discrete uniform probability mass function (pmf), in which case their Euclidian distance is minimized at . As a result, the new posterior is identical to the previous posterior over the models (i.e., the change in the posterior is minimized) if and only if the score is maximized to be . Conversely, the Euclidean distance is maximized at if and only if the weight vector is , in which case the new posterior is for all models for which , and for the single model for which . Thus, the change between the new and previous posterior over the models is maximized if and only if the score is minimized to be .

Finally, we observe that the notion of change in distribution for sample-based representations of posteriors captured by the Euclidian distance described above is closely related to the notion of change in distribution for continuous posterior representations. To see this, consider the Kullback-Leibler (KL) divergence, which is an information-theoretic measure for the discrepancy between distributions commonly used in information-theoretic active learning, defined as

(10) |

We now show that maximizing our proposed OoD detection score is equivalent to minimizing the KL divergence between the previous posterior and the new posterior , and vice versa, which is formalized in Proposition 1 below. This provides further evidence for the close connection between our proposed OoD detection approach and information-theoretic principles, and suggests that information-theoretic measures such as the KL divergence can be also used for OoD detection, yielding the paradigm of information-theoretic out-of-distribution detection.

###### Proposition 1.

Assume that the weights have some minimal, arbitrarily small, positive value , i.e., . Also, assume that the KL divergence is approximated based on a set of samples . Then, an input is a maximizer of if and only if it is a minimizer of . Furthermore, an input is a minimizer of if and only if it is a maximizer of . Formally,

(11) | ||||

(12) |

###### Proof.

Reformulating the KL divergence in Eq. 10 and approximating it via our set of posterior samples, we obtain