Probabilistic Modeling of Deep Features for Out-of-Distribution and Adversarial Detection
We present a principled approach for detecting out-of-distribution (OOD) and adversarial samples in deep neural networks. Our approach consists in modeling the outputs of the various layers (deep features) with parametric probability distributions once training is completed. At inference, the likelihoods of the deep features w.r.t the previously learnt distributions are calculated and used to derive uncertainty estimates that can discriminate in-distribution samples from OOD samples. We explore the use of two classes of multivariate distributions for modeling the deep features - Gaussian and Gaussian mixture - and study the trade-off between accuracy and computational complexity. We demonstrate benefits of our approach on image features by detecting OOD images and adversarially-generated images, using popular DNN architectures on MNIST and CIFAR10 datasets. We show that more precise modeling of the feature distributions result in significantly improved detection of OOD and adversarial samples; up to 12 percentage points in AUPR and AUROC metrics. We further show that our approach remains extremely effective when applied to video data and associated spatio-temporal features by detecting adversarial samples on activity classification tasks using UCF101 dataset, and the C3D network. To our knowledge, our methodology is the first one reported for reliably detecting white-box adversarial framing, a state-of-the-art adversarial attack for video classifiers.
Deep neural networks (DNN) have gained widespread popularity in the last decade, starting with the winning of ILSVRC-2010 challenge by AlexNet (Krizhevsky et al., 2012). Since then, research in this area has led to a proliferation of novelties in methods and architectures that have resulted in dramatic improvements in accuracy Simonyan and Zisserman (2015); He et al. (2016) and scalability (Guo et al., 2016). An important area of active research is the ability of DNNs to estimate predictive uncertainty measures, which quantify how much trust should be put in DNN results. This is a critical requirement for perceptual sub-systems based on deep learning, if we are to build safe and transparent systems that do not adversely impact humans (e.g. in fields such as autonomous driving, robotics, or healthcare). Additional imperatives for estimating predictive uncertainty measures relate to results interpretability, dataset bias, AI safety, and active learning.
Typically, deep networks do not provide reliable confidence scores for their outputs. Softmax is the most popular score used. Interpreted as a probability, it is a posterior probability and provides a relative ranking of each output with respect to all other outputs, rather than an absolute confidence score. By relying solely on softmax scores as a confidence measure, deep neural networks tend to make overconfident predictions. This is especially true when the input does not resemble the training data (out-of-distribution), or has been crafted to attack and “fool” the network (adversarial examples).
In this paper, we consider the problem of detecting out-of-distribution (OOD) samples and adversarial samples in DNNs. Recently, there has been substantial work on this topic from researchers in the Bayesian deep learning community (Gal and Ghahramani, 2016; Kendall and Gal, 2017). In this class of methods, the network’s parameters are represented by probability distributions rather than single point values. Such parameters are learned using variational training. At inference, multiple stochastic forward passes are required to generate a distribution over the outputs, instead of the typical single forward pass needed in a traditional (non-Bayesian) DNN. This significantly increases the complexity and requirements in terms of model representation, computational cost and memory.
Another class of methods attempt to solve this problem by estimating uncertainty directly from a trained DNN (non-Bayesian). Hendrycks and Gimpel (2017) proposed using probabilities from the softmax distributions to detect misclassified or OOD samples. Liang et al. (2018) showed that by introducing a temperature scaling parameter to the softmax function, the OOD detection performance could be greatly enhanced relative to (Hendrycks and Gimpel, 2017). Both these methods use posterior softmax distribution to perform OOD detection. By contrast, Lee et al. (2018) adopted a generative approach and proposed fitting class-conditional multivariate Gaussian distributions to the pre-trained features of a DNN. The confidence score was defined as the Mahalanobis distance with respect to the closest class conditional distribution. Using this confidence score, they obtained impressive results, outperforming both the previous methods on detecting OOD and adversarial samples. In contrast to Bayesian Deep Learning based approaches, this class of methods can be applied to existing pre-trained networks, does not require weights to be represented by distributions, and does not entail the computational overhead of requiring multiple forward passes during inference.
We present an approach for detecting OOD and adversarial samples in deep neural networks based on probabilistic modeling of the deep-features within a DNN. Conceptually, our method is most similar to the generative approach in (Lee et al., 2018) in that we fit class-conditional distributions to the outputs of the various layers (deep-features), once training is completed. In (Lee et al., 2018), however, it is hypothesized that the class-conditional distributions can be modeled as multivariate Gaussians with shared covariance across all classes. We show that such an assumption is not valid in general; instead, we adopt a more principled approach in choosing the type of density function to model with. To this end, we explore the use of two additional types of distributions to model the deep-features: Gaussian (with separate covariances for each class) and Gaussian mixture models (GMM). We demonstrate that a more precise modeling of the distributions of features results in significantly improved detection of OOD and adversarial samples; in particular, we see an improvement of up to 12 percentage points in the AUPR and AUROC metrics in these tasks.
We also investigate the numerical and computational aspects of this approach. In particular, fitting distributions to very high-dimensional features can result in severe ill-conditioning during estimation of the densities parameters. We demonstrate empirically that such issues can be resolved with the application of dimensionality reduction techniques such as PCA and average pooling.
Finally, in addition to demonstrating the effectiveness of the approach on standard image datasets, we further show that our approach remains extremely effective when applied to video data and associated spatio-temporal features by detecting adversarial samples generated by both white-box and black-box attacks on activity classification tasks on the UCF101 dataset (Soomro et al., 2012) using the C3D network (Hara et al., 2018). To our knowledge, our methodology is the first one reported for reliably detecting white-box adversarial framing (Zając et al., 2019), a state-of-the-art adversarial attack for video classifiers.
Suppose we have a deep network trained to recognize samples from classes, . Let denote the output at the layer of the network, and its dimension. As described earlier, our approach consists of fitting class-conditional probability distributions to the features of a DNN, once training is completed. By fitting distributions to the deep features induced by training samples, we are effectively defining a generative model over the deep feature space.
At test time, the log-likelihood scores of the features of a test sample are calculated with respect to these distributions and used to derive uncertainty estimates that can discriminate in-distribution samples (which should have high likelihood) from OOD or adversarial samples (which should have low likelihood). These per-layer likelihoods can also be used for classification in lieu of the softmax output of the network and give classification accuracy as good as the softmax classifier.
Choice of density function: Lee et al. (2018) assumed that the class-conditional densities are multivariate Gaussian with shared covariance across all classes. The justification was based on the following connection between LDA (linear discriminant analysis) and the softmax classifier: in a generative classifier in which the underlying class-conditional distributions are Gaussians with tied covariance, the posterior distribution is equivalent to the softmax function with linear separation boundaries (Bishop, 2006). As we demonstrate empirically via a simple example, the use of a softmax classifier in a DNN does not automatically imply that the underlying distributions will be well represented by a Gaussian with tied covariance. In this example, we constructed and trained a CNN architecture which we call MNET (shown in Figure 3) to classify only two digits (’0’ and ’1’) from the MNIST dataset. Since only two classes are considered, the final FC-10 layer is replaced by FC-2. A 2D density histogram of the features from the FC-2 layer is shown in Figure 1. It is obvious even without performing any goodness-of-fit tests that if a 2D Gaussian was fitted to each cluster, the covariance of one would be significantly different from that of the other; forcing these to be the same would result in a poorer fit. Further, even if the assumption of tied-covariance was valid, it would apply only to the features of the final layer of the network, on which softmax is performed. It would not be applicable to the inner layers.
In this work, therefore, we relax the assumption of tied covariance, and instead employ more general parametric probability distributions. The first type is a separate multivariate Gaussian distribution for each class without the assumption of a tied covariance matrix. Note that this corresponds to the more general QDA (quadratic discriminant analysis) classifier, which is hence capable of representing a larger class of distributions. The second type is a Gaussian Mixture Model (GMM). The choice of GMM is also motivated by the fact that high-dimensional naturally occurring data may not necessarily occupy all dimensions in the Euclidean space, but may in fact reside on or close to an underlying sub-manifold (Tenenbaum et al., 2000). Owing to the more general nature of the GMM, it is a better choice to model such a distribution. In the toy example shown in Figure 2 data is distributed along the boundary of an ellipse. It is clear that the GMM is able to model such a distribution well, while a multivariate Gaussian is a very poor modeling choice for it. It would be interesting to apply more sophisticated manifold learning techniques, but these are significantly more complex to implement practically; their use will be explored in future work.
Estimating parameters: The parameters of the class-conditional densities are estimated from the training set samples by maximum-likelihood. If the chosen density is a multivariate Gaussian (separate covariance), the maximum-likelihood values of the mean and covariance for class are given by the sample mean and sample covariance:
where are the feature values from the network and the layer subscript has been dropped for notational convenience. If the covariance is assumed to be tied across all classes, then all samples in the training set are used to estimate the covariance, rather than only . The estimation of the mean remains unchanged. If the chosen density is a GMM, its parameters are estimated using an expectation-maximization (EM) procedure. To choose the number of components in the GMM (i.e. model selection), we adopt the Bayesian Information Criteria (BIC) to penalize complex models; details on EM and BIC can be found in (Bishop, 2006).
input (28 x 28) conv3-64 maxpool conv3-32 maxpool FC-128 FC-10
Scoring samples: As described earlier, the log-likelihood values are used to measure the closeness of a sample w.r.t a probability distribution. For an -dimensional multivariate Gaussian , the log-likelihood of a feature is given by (Bishop, 2006):
Under the assumption of tied covariance, the term is the same for all the class-conditional distributions and can then be ignored. The remaining term , which is the Mahalanobis distance, is then adequate to measure the closeness of a test sample to the modeled distribution. If the covariances are not assumed to be tied, we cannot use the Mahalanobis distance, and should instead use the full log-likelihood term (ignoring the additive and multiplicative constants). For GMM, the log-likelihood is a weighted sum of exponential terms.
3 Computational aspects
The use of more general distributions such as multivariate Gaussian (without tied covariance) or GMMs to model the class-conditional densities brings its own set of challenges. The obvious one is increased computational complexity, both during modeling and during inference, especially when using GMMs. The other challenge is the lack of sufficient training data from which to estimate the parameters of the modeled distributions. For -dimensional features, if the number of training samples available, , is less than , then the maximum-likelihood estimate of the covariance as given in Eq. (1), will have a rank and be singular. The problem is exacerbated if GMMs are used, since a covariance matrix for each component of each class needs to be estimated, dramatically increasing the number of parameters to be estimated. In such situations, the assumption of tied covariance can prove helpful, since the number of samples used to estimate the covariance matrix would increase by a factor of (the number of classes), and the covariance matrix can hence be estimated without risk of rank deficiency so long as . However, as we demonstrate, by applying appropriate dimensionality-reduction techniques, we can not only mitigate these issues, but actually improve the eventual detection and classification scores by enabling the use of more general distributions.
Further, as the dimensionality of the features being modeled increases, it poses numerical challenges which result in highly ill-conditioned covariance matrices. For this reason too, application of some form of dimensionality reduction is recommended. Here, we follow a two-fold approach: average pooling of very high-dimensional layers and applying PCA for projecting onto a lower dimensional subspace. In our experiments, we average pool by a factor of 4. This number was chosen empirically, primarily to enhance computational efficiency. While applying PCA, one can specify the fraction of the variance of the original data that should be retained in the lower-dimensional subspace. We choose a high value of 0.995, i.e. we retain 99.5% of the original variance. This resulted in a dramatic reduction in the feature dimensions, at times up to 90%, indicating that 99.5% of the information in the features is actually contained in a much lower- dimensional subspace.
4 Experiments and Results
4.1 Applications in Image Classification tasks
Experimental setup and evaluation metrics
We use MNIST and CIFAR10 as the in-distribution datasets. For MNIST, we use FashionMNIST and EMNIST Letters (Cohen et al., 2017) as the OOD datasets. For CIFAR10, we use SVHN dataset (Netzer et al., 2011) and a resized version of the LSUN datasets (Yu et al., 2015) as the OOD datasets. To test against adversarial attacks, we use the FGSM attack introduced by Goodfellow et al. (2015). In all experiments, the parameters of the fitted density function are estimated from the training split of the in-distribution dataset, while performance metrics (accuracy, AUPR, AUROC) are calculated on the test split.
For MNIST, use the MNET architecture as shown in Figure 3. For CIFAR10, we use two publicly available architectures: Resnet50 and Densenet-BC. For reasons of computational efficiency, we perform our experiments on 3 layers of the networks listed above. In MNET, these are the final 3 layers; in Densenet and Resnet, these are the outputs of the corresponding final 3 dense or residual blocks. The layers are labelled as 0, 1, and 2, with 0 being the outermost layer, and 1,2 being inside the network. Layers further inside can easily be included too, but these typically have outputs of very high-dimensions and require aggressive dimensionality reduction in order to process them efficiently.
During testing, the log-likelihood scores of the features generated by a test sample are calculated. These can then be used to distinguish between in-distribution and out-of-distribution data, effectively creating a binary classifier. The performance of this classifier can be characterized by typical methods such as the precision-recall (PR) curve or the receiver operating characteristics (ROC curve). Davis and Goadrich (2006) showed that although the PR and ROC curves are equivalent, maximizing the area under ROC (AUROC) is not equivalent to maximizing area under precision-recall (AUPR). We report, therefore, both metrics in our results. To calculate these metrics, we used the scikit-learn library (Pedregosa et al., 2011).
We want to first demonstrate the effectiveness of our approach by using it to perform classification based solely on the log-likelihood scores w.r.t the class-conditional distributions of a particular layer. The classification accuracy using this scheme is measured on the test set for each layer individually. The results are summarized in Table 1. It is seen that the classification accuracy using the proposed method is comparable, if not slightly better, than the softmax-based accuracy, indicating that our scheme is as good as softmax for classification of in-distribution samples.
|AUPR change||AUROC change|
To see the performance on OOD samples, we calculate the AUPR and AUROC scores as described earlier. In particular, we examine the change in AUPR and AUROC values obtained by using the more general distribution types (outlined in Section 2) relative to those obtained by using the baseline distribution (multivariate Gaussian with tied covariance). The results are presented in Tables 2 and 3. It is seen that the use of the more general distribution types results in improvements, often significant, in the AUPR and AUROC scores over the baseline distribution. On the few instances in which the baseline distribution achieves the best score, it is by a small margin. It is further interesting to examine the improvements in scores over the baseline distribution as a function of the layer being modeled. These results are summarized in Table 4, which shows the average change (across all tested datasets) in the AUPR and AUROC scores per layer. For all layers, switching to a more general distribution produces an improvement in the scores. However, the extent of the improvement increases the further we are from the final output layer. This is consistent with the reasoning described in Section 2 that the assumption of a Gaussian with tied covariance is a valid approximation for the output layer only, and not the inner layers.
Finally, note that the improvements obtained by using a multivarate Gaussian (separate covariance) and GMM are very comparable. This observation is consistent with the reasoning in Section 3 that the larger number of parameters to be estimated for a GMM can cause more ill-conditioning during estimation, especially in the case of high-dimension and limited training samples, which might lead to sub-optimal parameter estimates.
4.2 Application to Activity Classification in Videos
While there is extensive work on adversarial attacks against image classifiers, the reported cases of video classifier attacks remains very limited. Here, we consider one such case on a video-based human activity classifier. The setup is the following: we use the original UCF101 dataset (split into training and testing subsets) to train a 3D deep neural network, C3D ResNet101 (Hara et al., 2018), for human action classification. With our trained model, we obtain accuracy on the test set. We then use a state-of-the-art video attack method known as adversarial framing Zając et al. (2019) to generate adversarial samples. Adversarial framing is a simple, yet very potent video attack method that operates by keeping most of the video frames intact, but adds a 4-pixel wide framing at the border of each frame. We employ both a white-box attack where we assume full knowledge of the action recognition classifier (architecture and parameters) and a black-box attack where no such knowledge is available. These allow us to generate two sets of adversarial frames of the UCF101 dataset that are fed as inputs to the action classifier. The classifier’s recognition accuracy drops to and for black-box and white-box attacks respectively. For the sake of brevity, we show visualizations only for the white-box attack, but results for both types of attacks are fully reported in Table 5.
For this experiment, we fit distributions to the features from the logit layer (output of last layer before softmax) and the preceding layer (feature embeddings), denoted as Layers 0 and 1 respectively. Figure 4 provides a visualization of the feature embeddings via t-SNE (Maaten and Hinton, 2008) for the genuine data and white-box attacked data, showing the potency of the white-box adversarial framing attack. Subsequently, the adversarial samples are passed through the network and the corresponding uncertainty (log-likelihood) scores at layers 1 and 0 are calculated. Figure 5 shows the density histogram of these scores for the in-distribution data and white-box attacked data. Note that while the recognition accuracy dramatically plummeted to for the white-box adversarial samples, the softmax scores shows even more confidence: the network outputs are wrong, yet the network is ever more confident. Using our approach, Figure 5 shows that while the network still provides wrong classification results, it is now able to produce reliable uncertainty metrics showing poor confidence in the generated outputs. Moreover, the discrimination between in-distribution and OOD samples (adversarial here) is clearly improved with the more general choices of distributions (GMM and Gaussian with separate covariances). These improvements are captured quantitatively with AUPR and AUROC metrics in Table 5 for both white-box and black-box attacks.
|White box attack||Black box attack|
5 Conclusions and Future Work
This paper presented a method for modeling the outputs of the various DNN layers (deep-features) with parametric probability distributions, with applications to adversarial and out-of-distribution sample detection. We showed that accurate modeling of the class-conditional distributions can enable the derivation of reliable uncertainty scores. The methodology was theoretically motivated, and experimentally proven by showing improvements to out-of-distribution detection, and adversarial sample detection on both image and video data. In particular, we report adversarial sample detection against a state-of-the-art video classifier attack.
While this work performed feature modeling based on a trained model, future work will seek to analyze the evolution of the feature distributions during training. Given the complexities arising from parameter estimation on high-dimensional spaces, we will also consider fitting distributions on features induced by larger pre-training datasets (e.g. ImageNet, Sports1M, Kinetics Carreira and Zisserman (2017)) and subsequently use the estimated parameters as priors for modeling the features on the (smaller) dataset of interest.
- Bishop (2006) Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
- Carreira and Zisserman (2017) Carreira, J. and Zisserman, A. (2017). Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), page 4724–4733.
- Cohen et al. (2017) Cohen, G., Afshar, S., Tapson, J., and van Schaik, A. (2017). Emnist: an extension of mnist to handwritten letters. arXiv preprint arXiv:1702.05373.
- Davis and Goadrich (2006) Davis, J. and Goadrich, M. (2006). The relationship between precision-recall and roc curves. In Proceedings of the 23rd international conference on Machine learning, pages 233–240. ACM.
- Gal and Ghahramani (2016) Gal, Y. and Ghahramani, Z. (2016). Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In international conference on machine learning, pages 1050–1059.
- Goodfellow et al. (2015) Goodfellow, I. J., Shlens, J., and Szegedy, C. (2015). Explaining and harnessing adversarial examples.
- Guo et al. (2016) Guo, Y., Yao, A., and Chen, Y. (2016). Dynamic network surgery for efficient dnns. In Advances in Neural Information Processing Systems, pages 1379–1387.
- Hara et al. (2018) Hara, K., Kataoka, H., and Satoh, Y. (2018). Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet? In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6546–6555.
- He et al. (2016) He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778.
- Hendrycks and Gimpel (2017) Hendrycks, D. and Gimpel, K. (2017). A baseline for detecting misclassified and out-of-distribution examples in neural networks.
- Kendall and Gal (2017) Kendall, A. and Gal, Y. (2017). What uncertainties do we need in bayesian deep learning for computer vision? In Advances in neural information processing systems, pages 5574–5584.
- Krizhevsky et al. (2012) Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105.
- Lee et al. (2018) Lee, K., Lee, K., Lee, H., and Shin, J. (2018). A simple unified framework for detecting out-of-distribution samples and adversarial attacks. In Advances in Neural Information Processing Systems, pages 7167–7177.
- Liang et al. (2018) Liang, S., Li, Y., and Srikant, R. (2018). Enhancing the reliability of out-of-distribution image detection in neural networks.
- Maaten and Hinton (2008) Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.
- Netzer et al. (2011) Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., and Ng, A. Y. (2011). Reading digits in natural images with unsupervised feature learning.
- Pedregosa et al. (2011) Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Simonyan and Zisserman (2015) Simonyan, K. and Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition.
- Soomro et al. (2012) Soomro, K., Zamir, A. R., and Shah, M. (2012). Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
- Tenenbaum et al. (2000) Tenenbaum, J. B., De Silva, V., and Langford, J. C. (2000). A global geometric framework for nonlinear dimensionality reduction. science, 290(5500):2319–2323.
- Yu et al. (2015) Yu, F., Zhang, Y., Song, S., Seff, A., and Xiao, J. (2015). Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365.
- Zając et al. (2019) Zając, M., Żołna, K., Rostamzadeh, N., and Pinheiro, P. (2019). Adversarial framing for image and video classification. In AAAI Conference on Artificial Intelligence.