We introduce the MNIST-C dataset, a comprehensive suite of 15 corruptions applied to the MNIST test set, for benchmarking out-of-distribution robustness in computer vision. Through several experiments and visualizations we demonstrate that our corruptions significantly degrade performance of state-of-the-art computer vision models while preserving the semantic content of the test images. In contrast to the popular notion of adversarial robustness, our model-agnostic corruptions do not seek worst-case performance but are instead designed to be broad and diverse, capturing multiple failure modes of modern models. In fact, we find that several previously published adversarial defenses significantly degrade robustness as measured by MNIST-C. We hope that our benchmark serves as a useful tool for future work in designing systems that are able to learn robust feature representations that capture the underlying semantics of the input.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.
MNIST-C: A Robustness Benchmark for Computer Vision
Norman Mu 0 Justin Gilmer 0
Presented at the ICML 2019 Workshop on Uncertainty and Robustness in Deep Learning. Copyright 2019 by the author(s).\@xsect
Despite reports of superhuman performance on test datasets drawn from the same distribution as the training data, computer vision models still lag behind humans when evaluated on out-of-distribution (OOD) data (Dodge & Karam, 2017). For example, models lack robustness to small translations of the input (Azulay & Weiss, 2018), small adversarial perturbations (Szegedy et al., 2013; Goodfellow et al., 2014), as well as commonly occurring image corruptions such as brightness, fog and various forms of blurring (Pei et al., 2017; Hendrycks & Dietterich, 2018). Achieving robustness to distributional shift is an essential step for deploying models in complex, real-world settings where test data does not perfectly match the training distribution.
Recently, (Hendrycks & Dietterich, 2018) proposed an OOD benchmark for the CIFAR-10 and Imagenet datasets. This benchmark consists of 15 commonly occurring visual corruptions at 5 different severity levels and is intended as a general-purpose robustness benchmark in computer vision. Smaller, simpler datasets such as MNIST continue to play an important role during prototyping, when iteration speed is paramount. MNIST is still commonly used in robustness research today (Madry et al., 2017; Wang & Yu, 2018; Frosst et al., 2018; Schott et al., 2018); however, MNIST lacks a standardized corrupted variant. To this end we propose MNIST-C111The source code for MNIST-C and download link are available at http://github.com/google-research/mnist-c., a benchmark consisting of 15 image corruptions for measuring out-of-distribution robustness in computer vision. Our benchmark is inspired by Imagenet-C and CIFAR-10-C, but is also specifically tailored to MNIST which consists of low resolution, black-and-white images. These 15 corruptions are carefully chosen out of a larger pool of 31 corruptions222While we recommend and publish results based on the subset of 15 for evaluation, we will open source code for generating all 31 for researchers to experiment with..
Through several experiments and visualizations, we demonstrate that our benchmark is non-trivial, semantically invariant, realistic, and diverse. Our corruptions capture failure modes of models previously unidentified in literature. Relative to clean data, our corruptions increase error rates of convolutional neural networks by a factor of 10 while preserving the semantic content of the underlying image. Furthermore, we evaluate 4 prior adversarial defense methods and find that they all significantly degrade performance on MNIST-C. We believe this shows that our robustness benchmark captures failure modes of computer vision that popular measures of adversarial robustness fail to identify. Finally, we demonstrate that simple data augmentation cannot trivially solve this benchmark— training on all but one of the corruptions yields minimal improvements on test accuracy for the held out corruption, leaving a large gap between neural network and human performance.
Here, we briefly explain the 15 corruptions in MNIST-C. Shot noise, impulse noise are random corruptions which may occur during the imaging process. Glass blur simulates viewing the image through frosted glass by locally shuffling pixels. Motion blur blurs the image along a random line. Shear, translate, scale, rotate (Ghifary et al., 2015; Engstrom et al., 2017) are each applied as an affine transformation to the image. Brightness increases the brightness of the image. Stripe inverts the pixel values along a vertical stripe in the center of the image. Fog simulates a haze or fog using the diamond square algorithm. Spatter occludes small regions of the image with randomly generated splotches. Zigzag, dotted line superimpose randomly oriented zigzags and dotted lines over the image, with the brightness of each straight segment controlled by an exponential kernel. Canny edges applies the Canny edge detector to each image (Ding et al., 2019). The corruptions of shot noise, impulse noise, glass blur, motion blur, brightness, fog, and spatter are modified from Imagenet-C (Hendrycks & Dietterich, 2018).
To construct MNIST-C we started with a broad suite of 31 image corruptions drawn from prior literature both on robustness and image processing. These corruptions range from different kinds of additive noise, blurring, digital corruptions, geometric transformations, to superimposed zigzags and squiggles. For each corruption we parameterize the severity and then choose a severity level that degrades model performance while preserving the semantic content. To choose a good set of corruptions, we first sought to understand model behavior under these corruptions. To that end we evaluated the performance of convolutional neural networks on these corruptions through extensive data augmentation experiments involving various combinations of all 31 corruptions. These experiments allowed us to choose a diverse set of corruptions which represent many different classes of model failures, and avoid picking correlated groups of corruptions on which model performance is very similar. We choose the MNIST-C corruptions with the following principles in mind:
Non-triviality: Each corruption should degrade the testing accuracy of various models. We tuned the severity of each corruption to a level that exposes blind spots of modern convolutional networks. As shown in a later section, we demonstrate that the error rates of CNNs increase by up to 1000% (relative to clean MNIST error rates) when tested on MNIST-C. Furthermore, we show that our benchmark cannot be solved by naive data augmentation, nor prior methods from the adversarial defense literature.
Semantic preservation: As our corruptions attempt to measure failures in computer vision systems, it is critical that the perceived label of corrupted images remains invariant to a human subject. We verify this by thorough visual inspection, and we include in the appendix a random sample of images mis-classified by a simple CNN trained on standard MNIST.
Realism: We took care to include corruptions which models could plausibly encounter in the wild, not necessarily in an adversarial setting. Our MNIST-C corruptions might occur through real-world perturbations to the camera setup (shot noise, impulse noise, motion blur, shear, scale, rotate, translate), environmental factors (brightness, stripe, fog, glass blur), or physical modification (spatter, dotted line, zigzag, Canny edges).
Breadth: We also paid attention to the breadth of our corruptions. From an original working list of 31 corruptions we selected 15 on the criteria of covering a wide swath of possible corruptions, and also of avoiding redundancy both in visual terms and in terms of overlap in performance degradation of the models we tested. We present these 15 corruptions as our MNIST-C benchmark, though we release the source code for all 31 corruptions and welcome further work to use a different subset.
We will release both the source code for the corruptions as well as the static, pre-computed MNIST-C dataset that we used to evaluate various models, as well the algorithms used to generate the corruptions are not optimized and unsuitable for direct use in a training routine.
We evaluate the MNIST-C corruptions against several models: a simple CNN (Conv1) trained on clean MNIST333Model definition taken from here., a different CNN (Conv2) trained against PGD adversarial noise (Madry et al., 2017), yet another CNN (Conv3) trained against PGD/GAN adversaries (Wang & Yu, 2018), a capsule network (Frosst et al., 2018), and a generative model, ABS (Schott et al., 2018). The results are shown in Table 3. We find that the baseline model Conv1 achieves 91.21% accuracy when averaged over the entire benchmark, or a 1100% increased error rate relative to clean test accuracy (99.22%). In the Appendix we show a random sample of test errors made by Conv1 on each corruption. We find that a majority of model errors on MNIST-C are images which are easy for a human to classify correctly.
A popular robustness metric in recent literature has been adversarial robustness, where model performance is evaluated against a suite of constrained optimization algorithms. It is natural to ask whether or not methods which claim improved adversarial robustness improve out-of-distribution robustness, as there are theoretical connections between -robustness and robustness to Gaussian noise (Ford et al., 2019), and prior work has observed that adversarial training improves robustness on CIFAR-10-C. To that end we investigated several previously published adversarial defenses on our benchmark. Remarkably, we find that all of the tested methods methods actually increase the error rate on the benchmark relative to the clean model, we find that mean accuracy on MNIST-C of the three adversarially trained models is significantly lower rate of the clean model. Translating accuracy values to error rate shows the three adversarially trained models are 2.1-2.4 as prone to error as the clean model. The two alternative architectures, the capsule network and the generative ABS model, also suffer performance degradation relative to the baseline.
A more sophisticated metric, relative mean corruption error (relative mCE), was proposed to measure performance on Imagenet-C (Hendrycks & Dietterich, 2018). Given a classifier, , a baseline classifier , and a single corruption , the CE (corruption error) of on is computed as the ratio between , the error rate of on , and , the error rate of on . Similarly, relative CE is calculated from the ratio of the change in error of when it is evaluated on instead of the clean data, and the change in error of when it is evaluated on instead of the clean data:
where indicates the identity corruption that returns the input image. Given these per-corruption CE and relative CE values, we then take the average across all corruptions in the dataset to compute mean CE and relative mean CE.
Relative mCE tells a similar story to mean accuracy in figure 2. Again, across the adversarially trained CNNs and the alternative architectures we find large degradations in testing performance, as quantified by the increases in relative mCE.
The fact that these methods actually degrade OOD robustness measured through MNIST-C underscores the necessity of evaluating future methods on a broader test suite in order to quantify the robustness of a model. It is not so surprising adversarial training degrades performance on MNIST-C despite the fact it dramatically improves performance on CIFAR-10-C. Prior work has observed that the -robust model on MNIST (Madry et al., 2017) achieves robustness by thresholding the input around .5, taking advantage of the fact that MNIST pixel values are concentrated near 0 and 1 (Schott et al., 2018). Because none of our corruptions are constrained to a small -ball, evaluating on our benchmark can easily detect undesirable solutions which overfit to the -robustness metric.
As our benchmark is designed to measure out-of-distribution robustness, training models directly on the distributions in the benchmark defeats its purpose. To demonstrate this, we trained and tested model Conv1 on simple mixtures of the 31 corruptions initially came up with (see Table 4). We see that one can trivially recover full accuracy on any single corruption by simply finetuning444Finetuning here simply means that we pre-train on clean MNIST to convergence before switching to the corrupted training set. on that corruption. We performed a second data augmentation experiment where we finetuned Conv1 on all but 1 one of the 31 corruptions, and then tested on the held out corruption. Remarkably we found that on average, training on all but 1 corruption only improved robustness on the remaining hold-out corruption from 90.4% accuracy to 91.7% accuracy, still far below the human level performance. In a flagrant violation of “out-of-distribution” robustness, we find that fine-tuning on all 31 corruptions together improves mean accuracy to 98.0%.
From this experiment we draw two conclusions. First, an OOD benchmark loses its meaning when models are trained directly on the corruptions— doing so would dramatically overestimate the robustness of a model. Second, while data augmentation can be a useful tool in improving robustness and generalization (Geirhos et al., 2018; Cubuk et al., 2018), the problem of generalizing to out-of-distribution data remains highly nontrivial even when aggressive data augmentation is used.
We present MNIST-C, a new robustness benchmark in computer vision. We demonstrate that our benchmark exposes new model failures that metrics in the adversarial robustness cannot detect. We note that not only does our dataset measure a much more comprehensive notion of robustness, it also reduces the difficulty of reproducibly evaluating robustness. This stands in contrast to current state of measuring adversarial robustness where reported robustness measurements are continuously refuted (Carlini et al., 2019; Athalye et al., 2018).
This, in our opinion, makes MNIST-C (and the related CIFAR-10-C, and Imagenet-C) a more reliable benchmark for measuring scientific progress of robustness in computer vision. We hope that our benchmark serves as a useful tool for future work.
=0mu plus 1mu
- Athalye et al. (2018) Athalye, A., Carlini, N., and Wagner, D. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018.
- Azulay & Weiss (2018) Azulay, A. and Weiss, Y. Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177, 2018.
- Carlini et al. (2019) Carlini, N., Athalye, A., Papernot, N., Brendel, W., Rauber, J., Tsipras, D., Goodfellow, I., and Madry, A. On evaluating adversarial robustness. arXiv preprint arXiv:1902.06705, 2019.
- Cubuk et al. (2018) Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018.
- Ding et al. (2019) Ding, G. W., Lui, K. Y. C., Jin, X., Wang, L., and Huang, R. On the sensitivity of adversarial robustness to input data distributions. arXiv preprint arXiv:1902.08336, 2019.
- Dodge & Karam (2017) Dodge, S. and Karam, L. A study and comparison of human and deep learning recognition performance under visual distortions. In 2017 26th international conference on computer communication and networks (ICCCN), pp. 1–7. IEEE, 2017.
- Engstrom et al. (2017) Engstrom, L., Tran, B., Tsipras, D., Schmidt, L., and Madry, A. A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv preprint arXiv:1712.02779, 2017.
- Ford et al. (2019) Ford, N., Gilmer, J., Carlini, N., and Cubuk, D. Adversarial examples are a natural consequence of test error in noise. arXiv preprint arXiv:1901.10513, 2019.
- Frosst et al. (2018) Frosst, N., Sabour, S., and Hinton, G. Darccc: Detecting adversaries by reconstruction from class conditional capsules. arXiv preprint arXiv:1811.06969, 2018.
- Geirhos et al. (2018) Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F. A., and Brendel, W. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018.
- Ghifary et al. (2015) Ghifary, M., Bastiaan Kleijn, W., Zhang, M., and Balduzzi, D. Domain generalization for object recognition with multi-task autoencoders. In Proceedings of the IEEE international conference on computer vision, pp. 2551–2559, 2015.
- Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Hendrycks & Dietterich (2018) Hendrycks, D. and Dietterich, T. G. Benchmarking neural network robustness to common corruptions and surface variations. arXiv preprint arXiv:1807.01697, 2018.
- Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
- Pei et al. (2017) Pei, K., Cao, Y., Yang, J., and Jana, S. Towards practical verification of machine learning: The case of computer vision systems. arXiv preprint arXiv:1712.01785, 2017.
- Schott et al. (2018) Schott, L., Rauber, J., Brendel, W., and Bethge, M. Robust perception through analysis by synthesis. arXiv preprint arXiv:1805.09190, 2018.
- Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Wang & Yu (2018) Wang, H. and Yu, C.-N. A direct approach to robust deep learning using adversarial networks. arXiv preprint arXiv:1905.09591, 2018.
Along with the 15 corruptions we selected for the MNIST-C benchmark suite, we also release 16 additional corruptions including. We briefly explain each of these corruptions here.
Speckle noise is a random corruption which may occur during the imaging process. Pessimal noise is sampled from a multivariate Gaussian with an adversarially trained covariance matrix and then tiled in a 2x2 pattern across the image. We intended this as a proxy for worst-case non-interactive corruption. The covariance matrix is trained via SGD to maximize the training loss of model Conv1 and then frozen at test time when the corruption is applied. Gaussian blur applies a Gaussian kernel to the image. Defocus blur simulates a defocused camera lens. Zoom blur simulates increasing focal length during image capture. Frost overlays a random crop from one of six images of real frost. Snow transforms and blurs Gaussian noise to simulate the appearance of snow. Contrast reduces the contrast of the image. Saturate increases saturation of the image. JPEG compression runs the lossy JPEG compression algorithm on the image. Pixelate distorts the image by resizing down and then back to the original size. Elastic transform applies a random affine transformation to each square patch in the image. Quantize reduces the color range of the image by rounding each pixel value to evenly spaced values. Line superimposes a randomly oriented lines over the image, where the brightness of each straight segment is determined by an exponential kernel. Inverse inverts the pixel values of the entire image.
Speckle noise, Gaussian blur, defocus blur, zoom blur, frost, snow, contrast, saturate, JPEG compression, pixelate, and elastic transform are taken from (Hendrycks & Dietterich, 2018).
We report the test accuracy of all 6 models we benchmarked on MNIST-C in figure 3.
In figure 4, we show the effects of simple data augmentation on training Conv1. The left two columns represent data augmentations that do not directly access the corruption on which the model is tested, and the right two columns represent data augmentations that do directly access the test corruption.
denotes model prediction of A on the original image and B on the corrupted image, where C is true label.