Multiscale Score Matching for OutofDistribution Detection
Abstract
We present a new methodology for detecting outofdistribution (OOD) images by utilizing norms of the score estimates at multiple noise scales. A score is defined to be the gradient of the log density with respect to the input data. Our methodology is completely unsupervised and follows a straight forward training scheme. First, we train a deep network to estimate scores for levels of noise. Once trained, we calculate the noisy score estimates for indistribution samples and take the L2norms across the input dimensions (resulting in an x matrix). Then we train an auxiliary model (such as a Gaussian Mixture Model) to learn the indistribution spatial regions in this dimensional space. This auxiliary model can now be used to identify points that reside outside the learned space. Despite its simplicity, our experiments show that this methodology significantly outperforms the stateoftheart in detecting outofdistribution images. For example, our method can effectively separate CIFAR10 (inlier) and SVHN (OOD) images, a setting which has been previously shown to be difficult for deep likelihood models.
tealRGB0, 128, 128 \definecoloroxbloodRGB144, 10, 48 \iclrfinalcopy
1 Introduction
Modern neural networks do not tend to generalize well to outofdistribution samples. This phenomenon has been observe in both classifier networks (Hendrycks and Gimpel (2019); Nguyen et al. (2015); Szegedy et al. (2013)) and deep likelihood models (Nalisnick et al. (2018); Hendrycks et al. (2018); Ren et al. (2019)). This certainly has implications for AI safety (Amodei et al. (2016)), as models need to be aware of uncertainty when presented with unseen examples. Moreover, an outofdistribution detector can be applied as an anomaly detector. Ultimately, our research is motivated by the need for a sensitive outlier detector that can be used in a medical setting. Particularly, we want to identify atypical morphometry in early brain development. This requires a method that is generalizable to highly variable, high resolution, unlabeled realworld data while being sensitive enough to detect an unspecified, heterogeneous set of atypicalities. To that end, we propose multiscale score matching to effectively detect outofdistribution samples.
Hyvärinen (2005) introduced score matching as a method to learn the parameters of a nonnormalized probability density model, where a score is defined as the gradient of the log density with respect to the data. Conceptually, a score is a vector field that points in the direction where the log density grows the most. The authors mention the possibility of matching scores via a nonparametric model but circumvent this by using gradients of the score estimate itself. However, Vincent (2011) later showed that the objective function of a denoising autoencoder (DAE) is equivalent to matching the score of a nonparametric Parzen density estimator of the data. Thus, DAEs provide a methodology for learning score estimates via the objective:
(1) 
Here is the score network being trained to estimate the true score , and . It should be noted that the score of the estimator only matches the true score when the noise perturbation is minimal i.e . Recently, Song and Ermon (2019) employed multiple noise levels to develop a deep generative model based on score matching, called Noise Conditioned Score Network (NCSN). Let be a positive geometric sequence that satisfies . NCSN is a conditional network, , trained to jointly estimate scores for various levels of noise such that . In practice, the network is explicitly provided a onehot vector denoting the noise level used to perturb the data. The network is then trained via a denoising score matching loss. They choose their noise distribution to be ; therefore . Thus the objective function is:
(2) 
Song and Ermon (2019) set after empirically observing that . We similarly scaled our score norms for all our experiments. Our work directly utilizes the training objective proposed by Song and Ermon (2019) i.e. we use an NCSN as our score estimator. However, we use the score outputs for outofdistribution (OOD) detection rather than for generative modeling. We demonstrate how the space of multiscale score estimates can separate indistribution samples from outliers, outperforming stateoftheart methods. We also apply our method on realworld medical imaging data of brain MRI scans.
2 Multiscale Score Analysis
Consider taking the L2norm of the score function: .
Since the data density term appears in the denominator, a high likelihood will correspond to a low norm. Since outofdistribution samples should have a low likelihood with respect to the indistribution log density, we can expect them to have high norms. However, if these outlier points reside in “flat” regions with very small gradients (e.g. in a small local mode), then their score norms can be low despite the point belonging to a low density region. This is our first indicator informing us that a true score norm may not be sufficient for detecting outliers. We empirically validate our intuition by considering score estimates for a relatively simple toy dataset: FashionMNIST. Following the denoising score matching objective (Equation 2), we can obtain multiple estimates of the true score by using different noise distributions . Like Song and Ermon (2019), we choose the noise distributions to be zerocentered Gaussian scaled according to . Recall that the scores for samples perturbed by the lowest noise should be closest to the true score. Our analyses show that this alone was inadequate at separating inliers from OOD samples.
We trained a score network on FashionMNIST and used it to estimate scores of FashionMNIST (), MNIST () and CIFAR10 () test sets. Figure 0(a) shows the distribution of the score norms corresponding to the lowest noise level used. Note that CIFAR10 samples are appropriately given a high score by the model. However, the model is unable to distinguish FashionMNIST from MNIST, giving MNIST roughly the same scores as indistribution samples. Though far from ideal, this result is still a considerable improvement on existing likelihood methods, which have been shown to assign higher likelihoods to OOD samples (Nalisnick et al. (2018)). Our next line of inquiry was to utilize multiple noise levels. That is instead of simply considering , we analyze the dimensional space for . Our observations showed that datasets did tend to be separable in the dimensional space of score norms. Figure 0(b) visualizes the UMAP embeddings of scores calculated via a network trained to estimate scales of s, with the lowest being the same as one in Figure 0(a).
2.1 Scales and Neighborhoods
To our knowledge multiscale score analysis has not been explored in the context of OOD detection. In this section, we present an analysis in order to give an intuition for why multiple scales can be beneficial. Consider the toy distribution shown in Figure 2. We have three regions of interest: an inlier region with high density centered around , an outlier region with low density around , and a second outlier region with a local mode centered around . Recall that adding Gaussian noise to a distribution is equivalent to convolving it with the Gaussian distribution. This not only allows us to visualize perturbations of our toy distribution, but also analytically compute the score estimates given any . Initially with no perturbation, both a point in the lowdensity region and one very close to (or at) the localmode will have small gradients. As we perturb the samples we smooth the original density, causing it to widen. The relative change in density at each point is dependent on neighboring modes. A large scale perturbation will proportionally take a larger neighborhood into account at each point of the convolution. Therefore, at a sufficiently large scale, nearby outlier points gain context from indistribution modes. This results in an increased gradient signal in the direction of inliers.
Figure 3 plots the score norms of samples generated from the original density along with markers indicating our key regions. Note how even a small scale perturbation () is enough to bias the density of the LowDensity outliers towards the nearby indistribution mode. A medium scale () Gaussian perturbation is still not wide enough to reach the inlier region from the LocalMode outlier densities, causing them to simply smooth away into flat nothingness. It is only after we perform a large scale () perturbation that the indistribution mode gets taken into account, resulting in a higher gradient norm. Note that the flat LowDensity outliers will not see the same increase in gradients. This illustrates the notion that no one scale is appropriate to detect all outliers. This analysis allows us to intuit that larger noise levels account for a larger neighborhood context. We surmise that given a sufficiently large scale, we can capture gradient signals from distant outliers. Admittedly, selecting the range of scales according to the dataset is not a trivial problem. In a very recent work, Song and Ermon (2020) outlined some techniques for selecting for NCSNs from the perspective of generative modelling. Perhaps there is a similar analog to OOD detection. We leave such analysis for future work and use the default range for NCSN in all our experiments.
2.2 Proposed Training Scheme
In this work, we propound the inclusion of all noisy score estimates for the task of separating in and outofdistribution points, allowing for a Multiscale Score Matching Analysis (MSMA). Concretely, given noise levels, we calculate the L2norms of persample scores for each level, resulting in an dimensional vector for each input sample. Motivated by our observations, we posit that indistribution data points occupy distinct and dense regions in the dimensional score space. The cluster assumption states that decision boundaries should not pass high density regions, but instead lie in low density regions. This implies that any auxiliary method trained to learn indistribution regions should be able to identify OOD data points that reside outside the learned space. Thus, we propose a two step unsupervised training scheme. First, we train a NCSN model to estimate scores for inlier samples, given levels of noise. Once trained, we calculate all noisy score estimates for the training samples and take the L2norms across the input dimensions: . This results in an x matrix. We now train an auxiliary model (such as a Gaussian Mixture Model) on this matrix to learn the spatial regions of indistribution samples in the dimensional space.
3 Learning Concentration in the Score Space
We posit that learning the “density” of the inlier data in the dimensional score (norm) space is sufficient for detecting outofdistribution samples. The term âdensityâ can be interpreted in a myriad of ways. We primarily focus on models that fall under three related but distinct notions of denseness: spatial clustering, probability density, and nearest (inlier) neighbor graphs. All three allow us to threshold the associated metric to best separate OOD samples.
Spatial clustering is conceptually the closest to our canonical understanding of denseness: points are tightly packed under some metric (usually Euclidian distance). Ideally OOD data should not occupy the same cluster as the inliers. We train Gaussian Mixture Models (GMMs) to learn clusters in the inlier data. GMMs work under the assumption that the data is composed of kcomponents whose shapes can be described by a (multivariate) Gaussian distribution. Thus for a given datum, we can calculate the joint probability of it belonging to any of the kcomponents.
Probability density estimation techniques aim to learn the underlying probability density function which describes the population. Normalizing flows are a family of flexible methods that can learn tractable density functions Papamakarios et al. (2019). They transform complex distributions into a simpler one (such as a Gaussian) through a series of invertible, differential mappings. The simpler base distribution can then be used to infer the density of a given sample. We use Masked Autoregressive Flows introduced by Papamakarios et al. (2017), which allows us to use neural networks as the transformation functions. Once trained, we can use the likelihood of the inliers to determine a threshold after which samples will be considered outliers.
Finally, we consider building knearest neighbor (kNN) graphs to allow for yet another thresholding metric. Conceptually, the idea is to sort all samples according to distances to their kclosest (inlier) neighbor. Presumably, samples from the same distribution as the inliers will have very short distances to training data points. Despite its simplicity, this method works surprisingly well. Practically, kNN distances can be computed quickly compute by using efficient data structures (such as KD Trees).
4 Related Work
Hendrycks and Gimpel (2019) should be commended for creating an OOD baseline and establishing an experimental testbed which has served as a template for all OOD work since. Their purported method was thresholding of softmax probabilities of a welltrained classifier. Their results have since been beaten by more recent work. Liang et al. (2017) propose ODIN as a posthoc method that utilizes a pretrained network to reliably separate OOD samples from the inlier distribution. They achieve this via i) perturbing the input image in the gradient direction of the highest (inlier) softmax probability and ii) scaling the temperature of the softmax outputs of the network for the best OOD separation. They follow the setting from Hendrycks and Gimpel (2019) and show very promising results for the time. However, ODIN heavily depends on careful tuning of its hyperparameters
DeVries and Taylor (2018) train their networks to predict confidence estimates in addition to softmax probabilities, which can then be used to threshold outliers. They show significant improvements over Hendrycks and Gimpel (2019) and some improvements over ODIN. Another concurrent work by Lee et al. (2018) jointly trained a GAN alongside the classifier network to generate ârealisticâ OOD examples, requiring an additional OOD set during training time. The final trained network is also unable to generalize to other unseen datasets It is important to note that our method is trained completely unsupervised while the baselines are not, potentially giving them additional information about the idiosyncrasies of the inlier distribution.
Ren et al. (2019) proposed to jointly train deep likelihood models alongside a “background” likelihood model that learns the population level background statistics, taking the ratio of the two resulting likelihoods to produce a “contrastive score”. They saw very good results for grayscale images (FashionMnist vs MNIST) and saw a considerable improvement in separating CIFAR and SVHN compared to Nalisnick et al. (2018). Some prior work has indeed used gradients of the log likelihoods for OOD but they do not frame it in the context of score matching. Grathwohl et al. (2020) posits that a discriminative model can be reinterpreted as a joint energy (negative loglikelihood) based model (JEM). One of their evaluation experiments used the energy norms (which they dub âApproximate Mass JEMâ) for OOD detection. Even though they saw improvements over only using loglikelihoods, their reported AUCs did not beat ODIN or other competitors. Peculiarly, they also observed that for tractable likelihood models, scores were anticorrelated with the modelâs likelihood and that neither were reliable for OOD detection. Zhai et al. (2016) also used energy (negative log probability) gradient norms but, their experiments were limited to intradataset anomalies. To our knowledge, no prior work has explicitly used score matching for OOD detection.
5 Experiments
In this section we demonstrate that our methodology, Multiscale Score Matching Analysis (MSMA), can provide a very effective OOD detector. We first train a NCSN model as our score estimator, and then an auxiliary model on the score estimates of the training set. Following Liang et al. (2017) and DeVries and Taylor (2018), we use CIFAR10 and SVHN as our âinlierâ datasets alongside a collection of natural images as ”outlier” datasets. We retrieve the natural image from ODIN’s publicly available GitHub repo
5.1 Datasets
We consider CIFAR10 (Krizhevsky and Hinton (2009)) and SVHN (Netzer et al. (2011)) as our inlier datasets. For outofdistribution datasets, we choose the same as Liang et al. (2017): TinyImageNet
5.2 Evaluation Metrics
To measure thresholding performance we use the metrics established by previous baselines (Hendrycks and Gimpel (2019), Liang et al. (2017)). FPR at 95% TPR is the False Positive Rate (FPR) when the True Positive Rate (TPR) is 95%.Detection Error is the minimum possible misclassification probability over all thresholds. AUROC is Area Under the ROC Curve. AUPR is Area Under the Precision Recall Curve. More details are given in A.3.
5.3 Comparison Against Previous OOD Methods
We compare our work against Confidence Thresholding (DeVries and Taylor (2018)) and ODIN (Liang et al. (2017)). Since these methods were trained with a number of different architectures, we report the ones that performed the best for each respective method. Specifically, we use the results of VGG13 for Confidence Thresholding and DenseNetBC for ODIN. For all experiments we report the results for the indistribution testset vs the outofdistribution datasets. Additionally, we note that All Images* is a version of All Images where both ODIN and Confidence Thresholding perform input preprocessing. Particularly, they perturb the samples in the direction of the softmax gradient of the classifier: . They then perform a grid search over ranges, selecting the value that achieves best separation on 1000 samples randomly held for each outofdistribution set. ODIN performs an additional search over ranges, while Confidence Thresholding uses a default of . We do not perform any such input modification. Note that ODIN uses input thresholding for individual OOD datsets as well, while Confidence Thresholding does not. Finally, for the sake of brevity we only report FPR (95% TPR) and AUROC. All other metric comparisons are available in the appendix (A.4).









TinyImageNet  0.0 / 100.0  0.0 / 100.0  0.0 / 100.0    1.8 / 99.6  
LSUN  0.0 / 100.0  0.0 / 100.0  0.0 / 100.0    0.8 / 99.8  
SVHN  iSUN  0.0 / 100.0  0.0 / 100.0  0.0 / 100.0    1.0 / 99.8  
All Images  0.0 / 100.0  0.0 / 100.0  0.0 / 100.0    4.3 / 99.2  
All Images*        8.6 /9 7.2  4.1 / 99.2  
TinyImageNet  0.0 / 100.0  0.0 / 100.0  0.3 / 99.9  7.5 / 98.5  18.4 / 97.0  
LSUN  0.0 / 100.0  0.0 / 100.0  0.6 / 99.9  3.8 / 99.2  16.4 / 97.5  
CIFAR  iSUN  0.0 / 100.0  0.0 / 100.0  0.4 / 99.9  6.3 / 98.8  16.3 / 97.5  
All Images  0.0 / 100.0  0.0 / 100.0  0.4 / 99.9    19.2 / 97.1  
All Images*        7.8 / 98.4  11.2 / 98.0 
5.4 Separating CIFAR10 from SVHN
Since this setting (CIFAR10 as indistribution and SVHN as outofdistribution) is not tackled by classifier based OOD detectors, we consider these results separately and evaluate them in the context of likelihood methods. This experiment has recently gained attention following Nalisnick et al. (2018) showing how deep generative models are particularly inept at separating high dimensional complex datasets such as these two. We describe our results for each auxiliary model in Table2. Here we note that all three methods definitively outperform the previous state of the art (see Table 3), with KDTrees preforming the best. Likelihood Ratios Ren et al. (2019) JEM (Grathwohl et al. (2020)) are two methods that have tackled this problem and have reported current stateoftheart results. Table 3 summarizes the results that were reported by these papers. Both report AUROCs, with Ren et al. (2019) additionally reporting AUPR(In) and FPR at 80% TPR. Since each method proposes a different detection function, we also provide them for reference.








GMM  11.4  8.1  95.5  91.9  96.9  
Flow  8.6  6.8  96.7  93.4  97.7  
KD Tree  4.1  4.5  99.1  99.1  99.2 








KD Tree MSMA  0.7  99.1  99.1  99.2  
Likelihood Ratios  6.6  93.0  88.1    
JEM    67      
Approx. Mass JEM    83.0     
5.5 Age based OOD from Brain MRI Scans








1  0.2  0.4  99.9  99.9  99.9  
2  0.6  1.0  99.7  99.5  99.9  
4  23.7  9.2  96.1  93.8  97.9  
6  30.5  9.7  95.0  92.2  96.8 
In this section we report our method’s performance on a real world dataset. Here the task is to detect brain Magnetic Resonance Images (MRIs) from pediatric subjects at an age (1  6 years) that is younger than the inlier data (9  11 years of age). We expect visible differences in image contrast and local brain morphometry between the brains of a toddler and an adolescent. As a child grows, their brain matures and the corresponding scans appear more like the prototypical adult brain. This provides an interesting gradation of samples being considered outofdistribution with respect to age. We employ 3500 high resolution T1weighted MR images obtained through the NIH large scale ABCD study (Casey et al. (2018)), which represent data from the general adolescent population (911 years of age). This implies that our indistribution dataset will have high variation. After standard preprocessing, we extracted for each dataset three midaxial slices and resized them to be 90x110 pixels, resulting in 11k axial images (10k training, 1k testing). For our outlier data, we employ MRI datasets of children aged 1, 2, 4 and 6 years (500 each) from the UNC EBDS database Stephens et al. (2020); Gilmore et al. (2020).
Our methodology was effectively able to identify younger age groups as outofdistribution. Table 4 reports the results for GMMs trained for this task. As expected, the separation performance decreases as age increases. Note that we kept the same hyperparameters for our auxiliary methods as in the previous experiments despite this being a higher resolution scenario. We also note that our Flow model and KD Tree perform equally as well and refer the reader to A.5.
6 Discussion and Conclusion
We introduced a methodology based on multiscale score matching and showed that it outperformed stateoftheart methods, with minimal hyperparameter tuning. Our methodology is easy to implement, completely unsupervised and generalizable to many OOD tasks. Even though we only reported two metrics in the main comparison, we emphasize that we outperform previous stateoftheart in every metric for all benchmark experiments. Next, it is noteworthy that in our realworld experiment, the brain MR images are unlabeled. This would have required us to create a contrived classification task in order to train classifiers for both ODIN and Confidence Thresholding. Since our model is trained completely unsupervised, we have to make very few inductive biases pertaining to the data. Furthermore, due to the curse of dimensionality, deep likelihood models are notoriously difficult to train in such high resolution regimes (Papamakarios (2019)), especially given low sample sizes. Our model’s objective function is based on a denoising (autoencoding), a task which better suits deep convolutional neural networks.
Our excellent results highlight the possibility of using our methodology as a fast, general purpose anomaly detector which could be used for tasks ranging from detection of medical pathologies to data cleansing and fault detection. From an application perspective, we plan to apply this methodology to the task of detecting images of atypically maturing children from a database of typical inliers. Lastly, our observations have uncovered a peculiar phenomenon exhibited by multiscale score estimates, warranting a closer look to understand the theoretical underpinnings of the relationship between low density points and their gradient estimates.
Appendix A Appendix
a.1 Dataset Details
All the datsets considered are described below.
CIFAR10: The CIFAR10 dataset (Krizhevsky and Hinton (2009)) consists of 60,000 32x32 colour images in 10 classes, such as horse, automobile, cat etc. There are 50,000 training images and 10,000 test images.
SVHN: The Street View Housing Numbers (SVHN) dataset (Netzer et al. (2011)) consists of 32x32 images depicting house numbers ranging from 0 through 9. We use the official splits: 73,257 digits for training, 26,032 digits for testing.
TinyImageNet: This dataset
LSUN: The Large Scene UNderstanding (LSUN) produced by (Yu et al. (2015)) consists of 10,000 test images belonging to one of 10 different scene classes such as bedroom, kitchen etc. Liang et al. (2017) created two 32x32 pixel versions of this dataset as well: a randomly cropped LSUN (crop) and a downsampled LSUN (resize).
iSUN: This dataset was procured by (Xu et al. (2015)) and is a subsample of the SUN image database. We use 32x32 pixel downscaled versions of the original 8,925 test images.
Uniform: This dataset consists of 10,000 synthetically generated 32x32 RGB images produced by sampling each pixel from an i.i.d uniform distribution in the range [0,1].
Gaussian: These are 10,000 synthetic 32x32 RGB images where each pixel is sampled from an i.i.d Gaussian distribution centered at 0.5 with a standard deviation of 1. The pixel values are clipped to be within [0, 1] to keep the values within the expected range of (normalized) images.
All Images: Following DeVries and Taylor (2018), this dataset is a combination of all nonsynthetic OOD datasets outlined above: TinyImageNet (crop), TinyImageNet (resize), LSUN (crop), LSUN (resize) and iSUN. Therefore this contains 48,925 images from a variety of data distributions. Note that this collection effectively requires a single threshold for all datasets, thus arguably reflecting a real world outofdistribution setting.
a.2 Architecture Details Setup
We use the NCSN model provided by Song and Ermon (2019). In particular, we use the Tensorflow implementation provided through a NeurIPS reproducibilty challenge submission, Matosevic et al. (2019). The model architecture used is a RefineNet with 128 filters. The batch size is also fixed to 128. We train for 200k iterations using the Adam optimizer. Following Song and Ermon (2019), we use standard deviations for our Gaussian noise perturbation such that is a geometric sequence with and . We use the same hyperparameters for training on both CIFAR and SVHN. For our experiment on brain MRI images (Section 5.5), we trained our model with 64 filters and a batch size of 32 due to memory constraints caused by the higher resolution images.
We train our auxiliary models on the same training set that was used to train the NCSN model, thereby circumventing the need for a separate held out tuning set. For our Gaussian Mixture Models, we mean normalize the data and perform a grid search over the number of components (ranging from 2 to 20), using 10fold cross validation. Our normalizing flow model is constructed with a MAF using two hidden layers with 128 units each, and a Standard Normal as the base distribution. It is trained for a 1000 epochs with a batch size of 128. Finally for our nearest neighbor model, we train a KD Tree to store (k=5)nearest neighbor distances of the indistribution training set. We keep the same hyperparameter settings for all experiments.
a.3 Evaluation Metric Details
To measure thresholding performance we use the metrics established by previous baselines (Hendrycks and Gimpel (2019), Liang et al. (2017)). These include:
FPR at 95% TPR: This is the False Positive Rate (FPR) when the True Positive Rate (TPR) is 95%. This metric can be interpreted as the probability of misclassifying an outlier sample to be indistribution when the TPR is as high as 95%. Let TP, FP, TN, and FN represent true positives, false positives, true negatives and false negatives respectively. FPR=FP/(FP+TN), TPR=TP/(FN+TP).
Detection Error: This measures the minimum possible misclassification probability over all thresholds. Practically this can be calculated as , where it is assumed that we have an equal probability of seeing both positive and negative examples in the test set.
AUROC: This measures area under (AU) the Receiver Operating Curve (ROC) which plots the relationship between FPR and TPR. It is commonly interpreted as the probability of a positive sample (indistribution) having a higher score than a negative sample (outofdistribution). It is a threshold independent, summary metric.
AUPR: Area Under the Precision Recall Curve (AUPR) is another threshold independent metric that considers the PR curve, which plots Precision(= TP/(TP+FP) ) versus Recall(= TP/(TP+FN) ). AUPRIn and AUPROut consider the indistribution samples and outofdistribution samples as the positive class, respectively. This helps take mismatch in sample sizes into account.
a.4 Complete Results for Experiments in Section5.3









TinyImageNet  0.0  0.0  0.0    1.8  
LSUN  0.0  0.0  0.0    0.8  
SVHN  iSUN  0.0  0.0  0.0    1.0  
All Images  0.0  0.0  0.0    4.3  
All Images*        8.6  4.1  
TinyImageNet  0.0  0.0  0.3  7.5  18.4  
LSUN  0.0  0.0  0.6  3.8  16.4  
CIFAR10  iSUN  0.0  0.0  0.4  6.3  16.3  
All Images  0.0  0.0  0.4    19.2  
All Images*        7.8  11.2 









TinyImageNet  100.0  100.0  100.0    99.6  
LSUN  100.0  100.0  100.0    99.8  
SVHN  iSUN  100.0  100.0  100.0    99.8  
All Images  100.0  100.0  100.0    99.2  
All Images*        97.2  99.2  
TinyImageNet  100.0  100.0  99.9  98.5  97.0  
LSUN  100.0  100.0  99.9  99.2  97.5  
CIFAR10  iSUN  100.0  100.0  99.9  98.8  97.5  
All Images  100.0  100.0  99.9    97.1  
All Images*        98.4  98.0 









TinyImageNet  0.0  0.0  0.1    3.1  
LSUN  0.0  0.0  0.1    2.0  
SVHN  iSUN  0.0  0.0  0.1    2.2  
All Images  0.0  0.0  0.1    4.6  
All Images*        6.8  4.5  
TinyImageNet  0.0  0.0  1.0  6.3  9.4  
LSUN  0.0  0.1  1.5  4.4  8.3  
CIFAR10  iSUN  0.0  0.0  1.2  6.7  8.5  
All Images  0.0  0.0  1.2    9.1  
All Images*        6.0  6.9 









TinyImageNet  100.0  100.0  100.0    99.8  
LSUN  100.0  100.0  100.0    99.9  
SVHN  iSUN  100.0  100.0  100.0    99.9  
All Images  100.0  100.0  100.0    98.5  
All Images*        92.5  98.6  
TinyImageNet  100.0  100.0  99.9  98.6  97.3  
LSUN  100.0  100.0  99.8  99.3  97.8  
CIFAR10  iSUN  100.0  100.0  99.9  98.9  98.0  
All Images  100.0  100.0  99.9    92.0  
All Images*        95.3  94.5 









TinyImageNet  100.0  100.0  99.9    99.1  
LSUN  100.0  100.0  99.9    99.6  
SVHN  iSUN  100.0  100.0  99.9    99.5  
All Images  100.0  100.0  99.9    99.6  
All Images*        98.6  99.5  
TinyImageNet  100.0  100.0  99.9  98.5  96.9  
LSUN  100.0  100.0  99.9  99.2  97.2  
CIFAR10  iSUN  100.0  100.0  99.9  98.8  96.9  
All Images  100.0  100.0  99.9    99.3  
All Images*        99.6  99.5 









CIFAR  8.6  6.9  97.6  92.3  99.2  
TinyImageNet (c)  0.0  0.0  100.0  100.0  100.0  
TinyImageNet (r)  0.0  0.0  100.0  100.0  100.0  
LSUN (c)  0.0  0.0  100.0  100.0  100.0  
SVHN  LSUN (r)  0.0  0.0  100.0  100.0  100.0  
iSUN  0.0  0.0  100.0  100.0  100.0  
Uniform  0.0  0.0  100.0  100.0  100.0  
Gaussian  0.0  0.0  100.0  100.0  100.0  
All Images  0.0  0.0  100.0  100.0  100.0  
SVHN  11.4  8.1  95.5  91.9  96.9  
TinyImageNet (c)  0.0  0.0  100.0  100.0  100.0  
TinyImageNet (r)  0.0  0.0  100.0  100.0  100.0  
LSUN (c)  0.0  0.0  100.0  100.0  100.0  
CIFAR  LSUN (r)  0.0  0.0  100.0  100.0  100.0  
iSUN  0.0  0.0  100.0  100.0  100.0  
Uniform  0.0  0.0  100.0  100.0  100.0  
Gaussian  0.0  0.0  100.0  100.0  100.0  
All Images  0.0  0.0  100.0  100.0  100.0 









CIFAR  10.4  6.9  97.0  88.3  99.0  
TinyImageNet (c)  0.0  0.1  100.0  100.0  100.0  
TinyImageNet (r)  0.0  0.1  100.0  100.0  100.0  
LSUN (c)  0.0  0.1  100.0  100.0  100.0  
SVHN  LSUN (r)  0.0  0.1  100.0  100.0  100.0  
iSUN  0.0  0.1  100.0  100.0  100.0  
Uniform  0.0  0.0  100.0  100.0  100.0  
Gaussian  0.0  0.0  100.0  100.0  100.0  
All Images  0.0  0.1  100.0  100.0  100.0  
SVHN  8.6  6.8  96.7  93.4  97.7  
TinyImageNet (c)  0.0  0.0  100.0  100.0  100.0  
TinyImageNet (r)  0.0  0.0  100.0  100.0  100.0  
LSUN (c)  0.0  0.0  100.0  100.0  100.0  
CIFAR  LSUN (r)  0.0  0.1  100.0  100.0  100.0  
iSUN  0.0  0.0  100.0  100.0  100.0  
Uniform  0.0  0.0  100.0  100.0  100.0  
Gaussian  0.0  0.0  100.0  100.0  100.0  
All Images  0.0  0.0  100.0  100.0  100.0 









CIFAR  8.5  6.6  97.6  92.5  99.1  
TinyImageNet (c)  0.0  0.1  100.0  100.0  100.0  
TinyImageNet (r)  0.0  0.1  100.0  100.0  100.0  
LSUN (c)  0.0  0.1  100.0  100.0  100.0  
SVHN  LSUN (r)  0.0  0.1  100.0  100.0  100.0  
iSUN  0.0  0.1  100.0  100.0  100.0  
Uniform  0.0  0.0  100.0  100.0  100.0  
Gaussian  0.0  0.0  100.0  100.0  100.0  
All Images  0.0  0.1  100.0  100.0  100.0  
SVHN  4.1  4.5  99.1  99.0  99.2  
TinyImageNet (c)  0.5  1.4  99.9  99.8  99.9  
TinyImageNet (r)  0.3  1.0  99.9  99.9  99.9  
LSUN (c)  0.2  0.9  99.9  99.9  100.0  
CIFAR  LSUN (r)  0.6  1.5  99.9  99.8  99.9  
iSUN  0.4  1.2  99.9  99.9  99.9  
Uniform  0.0  0.0  100.0  100.0  100.0  
Gaussian  0.0  0.0  100.0  100.0  100.0  
All Images  0.4  1.2  99.9  100.0  99.7 
a.5 Performance on Brain MRI









1  0.2  0.4  99.9  99.9  99.9  
2  0.6  1.0  99.7  99.5  99.9  
GMM  4  23.7  9.2  96.1  93.8  97.9  
6  30.5  9.7  95.0  92.2  96.8  
1  0.2  0.3  99.9  99.9  99.9  
2  0.6  1.3  99.7  99.4  99.9  
Flow  4  12.2  8.4  97.3  94.6  98.8  
6  28.9  12.5  94.3  88.7  97.5  
1  2.5  2.6  99.3  98.2  99.7  
2  3.6  3.1  98.9  96.2  99.6  
KD Tree  4  18.6  10.7  95.7  91.0  98.0  
6  39.2  14.9  91.6  84.2  95.8 
Footnotes
References
 Concrete problems in ai safety. arXiv preprint arXiv:1606.06565. Cited by: §1.
 The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites. Developmental cognitive neuroscience 32, pp. 43–54. Cited by: §5.5.
 Imagenet: a largescale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: §A.1.
 Learning confidence for outofdistribution detection in neural networks. External Links: 1802.04865 Cited by: §A.1, §4, §5.1, §5.3, §5.
 Individual Variation of Human Cortical Structure Is Established in the First Year of Life.. Biological psychiatry. Cognitive neuroscience and neuroimaging. Cited by: §5.5.
 Your classifier is secretly an energy based model and you should treat it like one. ArXiv abs/1912.03263. Cited by: §4, §5.4.
 A baseline for detecting misclassified and outofdistribution examples in neural networks. In 5th International Conference on Learning Representations, ICLR 2017  Conference Track Proceedings, External Links: 1610.02136 Cited by: §A.3, §1, §4, §4, §5.2, §5.
 Deep anomaly detection with outlier exposure. arXiv preprint arXiv:1812.04606. Cited by: §1.
 Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research 6 (Apr), pp. 695–709. Cited by: §1.
 Learning multiple layers of features from tiny images. Cited by: §A.1, §5.1.
 Training confidencecalibrated classifiers for detecting outofdistribution samples. In 6th International Conference on Learning Representations, ICLR 2018  Conference Track Proceedings, External Links: 1711.09325 Cited by: §4.
 Enhancing The Reliability of Outofdistribution Image Detection in Neural Networks. 6th International Conference on Learning Representations, ICLR 2018  Conference Track Proceedings. External Links: 1706.02690, Link Cited by: §A.1, §A.1, §A.3, §4, §5.1, §5.2, §5.3, §5.
 Reproducibility challenge–generative modeling by estimating gradients of the data distribution. Cited by: §A.2.
 Do deep generative models know what they don’t know?. External Links: 1810.09136 Cited by: §1, §2, §4, §5.4.
 Reading digits in natural images with unsupervised feature learning. Cited by: §A.1, §5.1.
 Deep neural networks are easily fooled: high confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 427–436. Cited by: §1.
 Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762. Cited by: §3.
 Masked autoregressive flow for density estimation. In Advances in Neural Information Processing Systems, pp. 2338–2347. Cited by: §3.
 Neural density estimation and likelihoodfree inference. arXiv preprint arXiv:1910.13233. Cited by: §6.
 Likelihood ratios for outofdistribution detection. In NeurIPS, Cited by: §1, §4, §5.4.
 Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11918–11930. Cited by: §A.2, §1, §2.
 Improved techniques for training scorebased generative models. arXiv preprint arXiv:2006.09011. Cited by: §2.1.
 White Matter Development from Birth to 6 Years of Age: A Longitudinal Study.. Cerebral cortex (New York, N.Y. : 1991) 7, pp. 7456. Cited by: §5.5.
 Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199. Cited by: §1.
 A connection between score matching and denoising autoencoders. Neural computation 23 (7), pp. 1661–1674. Cited by: §1.
 Turkergaze: crowdsourcing saliency with webcam based eye tracking. arXiv preprint arXiv:1504.06755. Cited by: §A.1, §5.1.
 Lsun: construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365. Cited by: §A.1, §5.1.
 Deep structured energy based models for anomaly detection. In ICML, Cited by: §4.