Controversial stimuli: pitting neural networks against each other as models of human recognition
Distinct scientific theories can make similar predictions. To adjudicate between theories, we must design experiments for which the theories make distinct predictions. Here we consider the problem of comparing deep neural networks as models of human visual recognition. To efficiently determine which models better explain human responses, we synthesize controversial stimuli: images for which different models produce distinct responses. We tested nine different models, which employed different architectures and recognition algorithms, including discriminative and generative models, all trained to recognize handwritten digits (from the MNIST set of digit images). We synthesized controversial stimuli to maximize the disagreement among the models. Human subjects viewed hundreds of these stimuli and judged the probability of presence of each digit in each image. We quantified how accurately each model predicted the human judgements. We found that the generative models (which learn the distribution of images for each class) better predicted the human judgments than the discriminative models (which learn to directly map from images to labels). The best performing model was the generative Analysis-by-Synthesis model (based on variational autoencoders). However, a simpler generative model (based on Gaussian-kernel-density estimation) also performed better than each of the discriminative models. None of the candidate models fully explained the human responses. We discuss the advantages and limitations of controversial stimuli as an experimental paradigm and how they generalize and improve on adversarial examples as probes of discrepancies between models and human perception.
Convolutional deep neural networks (DNNs) are currently the best image-computable models of human visual object recognition [20, 42, 19]. To continue improving our computational understanding of biological object recognition, we must efficiently compare different DNN models in terms of their predictions of neuronal and behavioral responses of human and non-human observers. Adjudicating between models requires stimuli for which models make distinct predictions.
Here we consider the problem of adjudicating between models on the basis of their behavior: the classifications of images. Finding stimuli over which high-parametric DNN models disagree is complicated by the flexibility of these models. Given a sufficiently large sample of labeled training images, a wide variety of high-parametric DNNs can learn to predict the human-assigned labels of out-of-sample images. By definition, models with high test accuracy will mostly agree with each other on the classification of test images sampled from the same distribution as the training dataset.
Even when there is a considerable difference in test accuracy between two models, the more accurate model is not necessarily more human-like in the features its decisions are based on. The more accurate model might use discriminative features not used by human observers. DNNs may learn to exploit discriminative features that are completely invisible to human observers [18, 16]. For example, consider a DNN that learns to exploit camera-related artifacts  to distinguish between pets (which are likely to be photographed by their owner with cellphone cameras) and wild animals (which are likely to be photographed by professionals with SLR cameras). A DNN that picked up on such features might be similar to humans in its classification behavior on the training distribution (i.e., highly accurate), despite being dissimilar in its mechanism. Another model that does not exploit such features might have lower accuracy, despite being more similar to humans in its mechanism. To reveal the distinct mechanisms, we need to move beyond the training distribution.
While all models train on a certain task share the same training distribution, each model brings a different inductive bias–the assumptions a model employs to generalize from training stimuli to novel stimuli. To reveal these inductive biases, we need to use test stimuli that are not merely out-of-sample (different images from the same distribution), but out-of-distribution (OOD, images from a different distribution). Prominent examples of such OOD images include images from a different domain [e.g., training a DNN on natural images and testing on silhouettes, 2], as well as images degraded by noise or distortions [5, 11, 14], filtered , retextured , or adversarially perturbed to bias a DNN’s classifications . Assessing a model’s ability to predict human responses to OOD stimuli provides a severe test of the validity of the model’s inductive bias.
Previous studies have compared the responses of models and humans to distorted [5, 11] and adversarially-perturbed images [44, 6], demonstrating the power of testing for OOD generalization. However, such stimuli are not guaranteed to expose differences in the inductive bias of different models, because they aren’t designed to probe the portion of stimulus space where the decisions of different models disagree.
Here we suggest to test and compare DNN models of vision on controversial stimuli. A controversial stimulus is a sensory input (here, an image) that elicits clearly distinct responses among two or more models. Once we define a controversiality score, we can search for such stimuli in large corpora or, more flexibly, synthesize them by optimization (Fig. 1).
Collecting human responses to stimuli that are controversial between two models gives us great power to adjudicate between the models. The human responses are guaranteed to provide evidence against at least one of the models, since they cannot agree with both models.
Controversial stimuli vs. adversarial examples
A stimulus that is controversial between two models must be an adversarial example to at least one of the models: Since the models disagree, at least one of them must be incorrect (in whatever way we choose to define correctness). However, an adversarial example for one of two models may not be controversial between them: both models may be similarly fooled [38, 12, 23]. Controversial stimuli provide an attractive alternative to adversarial examples for probing models in that they sidestep the ground-truth problem. When adversarially perturbing an image, it is usually assumed that the perturbation will not also affect the true label (or the class perceived by humans). This assumption necessarily holds only if the perturbation is too small to matter [e.g., as in 38]. When the bound on the perturbation is large or absent, human observers and the targeted model might actually agree on the contents of the image , making the resulting image a valid example of another class that is correctly classified by the model. Such an image does not constitute a successful adversarial attack. The validity and power of a controversial stimulus, by contrast, are guaranteed given that the stimulus succeeds in making two models disagree.
Our approach is conceptually related to Maximum differentiation (MAD) competition . MAD competition perturbs a source image in four directions: increasing the response of one model while keeping the response of the other fixed, decreasing the response of one model while keeping the response of the other fixed, and the converse pair (switching the roles of the two models).
In contrast, a single controversial stimulus manipulates two (or more) models in opposite directions. Yet crudely speaking, our approach can be viewed as a generalization of MAD competition from univariate response measures (e.g., perceived image quality) to multivariate response measures (e.g., detected objects categories) and from local perturbation of natural images to unconstrained search in image space.
Recognition of hand-written digits is a non-trivial test case for DNN-human compatibility
We prototype and demonstrate the approach of controversial stimuli on models trained to recognize hand-written digits on the MNIST dataset . From an engineering perspective, MNIST is essentially solved, with multiple, qualitatively different machine learning models attaining near perfect performance. It is far less clear whether any of these models solve MNIST the way humans do. Classifying digits is a very restricted task, compared to object recognition in natural images. However, even for this simple task, we find that models of similar accuracy vary dramatically in their ability to predict human responses to controversial stimuli. In particular, we found that models that learn the distribution of class members within each class (i.e., class-conditional generative models) show inductive biases that are far more compatible with human recognition than models that learn to merely separate the classes (i.e., discriminative models). Since discriminatively-trained DNNs still dominate computational modeling of human object recognition, this calls for a refocusing of the modeling efforts.
Candidate MNIST models
We assembled a set of nine candidate models, all trained on MNIST (Table 1). We included five families of models: (1) discriminative feedforward models—we adapted the VGG architecture  to MNIST (’small VGG’, see Materials and Methods) and trained it on the either the standard MNIST dataset or on a version extended by non-digit images (Fig. S1, dubbing the resulting model ’small VGG’), (2) discriminative recurrent models—the Capsule Network  and the Deep Predictive Coding Network (PCN) , (3) adversarially-trained discriminative DNNs , (4) a reconstruction-based readout of the Capsule Network , and (5) class-conditional generative models—classifying according to a likelihood estimate of each class, obtained either through a class-specific Gaussian Kernel Density Estimation (KDE), or through a class-specific Variational Autoencoder (VAE)—the ’Analysis by Synthesis’ model .
|discriminative feedforward||small VGG ||0.47%|
|small VGG ||0.59%|
|discriminative recurrent||Wen PCN-E4 ||0.42%|
|adversarially trained||Madry  ()||1.47%|
|Madry  ()||1.07%|
|reconstruction-based||CapsuleNet Recon ||0.29%|
|Schott ABS ||1.00%|
A modified architecture. See Supplementary Materials and Methods.
Many DNN models operate under the assumption that each test image is paired with exactly one correct class (here, an MNIST digit). In contrast, human observers may detect more than one class in an image, or alternatively, detect none. To provide the tested models with greater flexibility, the outputs of all of the models were evaluated using multi-label readout. For each class, the related penultimate activation (i.e., the logit) was fed to a sigmoid function instead of the usual softmax readout. This setup handles the detection of each digit as a binary classification problem .
Another limitation of many DNN models is that they are typically too confident about their classifications . To address this issue, we calibrated each model by applying an affine transformation to its logits [32, 13]. The slope and intercept parameters of this transformation were shared across classes and were fit to minimize the predictive cross-entropy on MNIST test images. This procedure tunes the sigmoid readout and enables a fair comparison among the models in which all of them are well-calibrated. For pre-trained models, this calibration (as well as the usage of sigmoids instead of the softmax readout) affects only the models’ certainty rather than their classification accuracy (i.e., it does not change the most likely class given an input image).
Synthesizing controversial stimuli
Consider a set of candidate models. We would like to define a controversiality score for image . This score should be high if the models strongly disagree on the contents of this image.
Ideally, information-theoretic experimental design [22, 15] would approach this problem by formulating our beliefs regarding which model is correct as a posterior probability distribution, conditioned on observed or hypothetical stimuli and responses. Sets of potential stimuli would be scored according to their expected reduction of the entropy of this posterior probability distribution. However, this statistically-ideal approach is not currently tractable in high-level vision, where stimuli are arbitrary images and models are DNNs.
Here we use a simple heuristic approach. We consider one pair of models at a time, e.g. and . For a given pair of digits, and (e.g., 3 and 7), an image is assigned with a high controversiality score if it is recognized by model as digit and by model as digit . The following function achieves this:
where is the estimated conditional probability that image contains digit according to model , and is the minimum function. However, this function assumes that a model cannot simultaneously assign high probabilities to both digit and digit in the same image. This assumption is true for models with softmax readout. To make the controversiality score compatible also with less restricted, multi-label readouts we used the following function instead:
If the models agree over the classification of image , then and will be either both high or both low, so either or will be a small number, pushing the minimum down.
Employing an activation-maximization approach , one can form images that maximize Eq. 2 by following its gradient with respect to the image (in practice, we differentiated a smoother surrogate function, Eq. 5). We initiated stimuli as random noise images and iteratively ascended a numerical estimate of this gradient until convergence (see Materials and Methods). This procedure results in gradually increasing the controversiality of the image. Convergence to a sufficiently controversial stimulus (e.g., ) is not guaranteed. A controversial stimulus cannot be found, for example, if both models associate exactly the same regions of image space with the two digits. However, if a controversial image is found, it is guaranteed to provide an informative test stimulus for adjudicating between the two models.
Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition
For each pair of models, we formed 90 controversial stimuli, targeting all possible digit pairs. Fig. 3 shows the results of this procedure for a particular digit pair across all model pairs. Fig. 3 shows the results of this procedure across all digit pairs for four model pairs.
Viewing the resulting controversial stimuli, it is immediately apparent that pairs of discriminative models can detect incompatible digits in images that are meaningless to us, the human observers. Images that are confidently classified by DNNs, but unrecognizable to humans have been described in the computer science literature (e.g., ’fooling images’ , ’rubbish class examples’ , and ’distal adversarial examples’ ). However, instead of misleading one model (compared to some standard of ground truth), our controversial stimuli elicit disagreement between two models. For pairs of discriminatively trained models (Fig. 3A, B), human classifications are not consistent with either model, providing evidence against both.
One may hypothesize that the poor behavior of discriminative models outside the manifold of training examples is related to the lack of non-class examples in their training [e.g., 25]. To test this hypothesis, we trained a discriminative model with diverse non-digit examples (Fig. S1). The small VGG model, trained to discriminate not only among digits, but also between digits and non-digits, still detected digits in controversial images that look like ’rubbish’ to us (Fig. 3A, but see the next section for some advantages this model revealed in the quantitative testing).
There were some qualitative differences among the stimuli resulting from targeting pairs of discriminative models. Images targeting one of the two different recurrent DNN models, the Capsule network  and the PCN , showed increased (yet largely humanly unrecognizable) structure (e.g., Fig. 3B). When the discriminative models pitted against each other included a DNN that had experienced -bounded adversarial training , the resulting controversial stimuli showed traces of human-recognizable digits (Fig. 3, Madry ). These digits’ human classifications tended to be compatible with the classifications of the adversarially trained discriminative model [see 39, for a discussion of the advantages of -bounded adversarial training].
And yet, when any of the discriminative models was pitted against either the reconstruction-based readout of the Capsule Network, or either of the generative models (Gaussian KDE or ABS), the controversial image was almost always a human-recognizable digit compatible with the target of the reconstruction-based or generative model (e.g., Fig. 3C). Finally, synthesizing controversial stimuli to adjudicate between the three reconstruction-based/generative models produced images whose human classifications are most similar to the ABS model (e.g., Fig. 3D).
Human psychophysics can formally adjudicate among models and reveal their limitations
Inspecting a matrix of controversial stimuli synthesized to cause disagreement among two models can provide a sense of which model is more similar to us in its decision boundaries. However, it does not tell us how a third model (not used in synthesizing the stimuli) responds to these images. Moreover, some of the resulting controversial stimuli are ambiguous to human observers. We therefore need careful human behavioral experiments to adjudicate among the models.
We evaluated each model by comparing its judgments to those of human subjects, and compared the models in terms of how well they could predict the human judgments. For the behavioral experiment, we selected 720 controversial stimuli (20 per model-pair comparison, see Materials and Methods) as well as 100 randomly selected MNIST test images. We presented these 820 stimuli to 30 human observers, in a different random order for each observer. For each image, observers rated each digit’s probability of presence from 0% to 100% on a five point scale (Fig. S4). Subjects were allowed to judge multiple digits as present in a given image with high probability; the probabilities were not constrained to sum to 1. Since there is no objective reference providing correct answers in this task (i.e., the human responses define ground truth), no feedback was provided.
The responses of each human subject were directly compared to each model’s predictions:
where is the mean squared error with which model predicts human-judged probabilities that image contains digit , and are the model’s corresponding judgments. This measure is proportional to the squared euclidean distance between the model’s responses and the subject’s responses.
Given the intersubject variability and decision noise, the true model (if it were included in our set) cannot perfectly predict the human judgments. To estimate the maximal attainable performance (i.e., the noise floor of the MSE), we compared each subject’s responses with the predictions obtained by averaging the response patterns of all of the other subjects:
where is the mean squared error, with which the mean response pattern across all subjects except (820 images 10 digits) predicts the judged probabilities of subject .
The results of the experiment (Fig. 4) largely corroborate the qualitative impressions of the controversial stimuli, indicating that the deep class-generative ABS model  is superior to the other models in predicting the human responses to the tested dataset. Its performance is followed by the shallow class-conditional model—the Gaussian KDE, which in turn is followed by the reconstruction-based readout of the Capsule network. The discriminative models all performed significantly worse than these three models. The leave-one-subject out noise floor estimate (Eq. 4) provided a significantly more accurate prediction compared to all of the models (black dots in Fig. 4). This indicates that none of the models (including the ABS model) fully explained the explainable variability in the data.
Prediction error as measured by Eq. 3 is a strict criterion. Achieving minimal MSE (i.e. reaching the noise floor) requires a model to exactly predict the average human response pattern. To eliminate the potential effect of miscalibration of the models with respect to the human-assigned probabilities, we conducted control analyses employing either model recalibration or more flexible correlation measures. First, we repeated the analysis after recalibrating each model to minimize the overall prediction error of the model across subjects (Fig. 4A). We also retested the models after replacing the multi-label sigmoid readout with a modified softmax , similarly recalibrating the readout hyperparameters to minimize each model’s MSE (Fig. 4B). In addition, we compared the non-recalibrated models to the human judgements using linear correlation instead of MSE, allowing for subject-specific scaling and shifting of the models’ predictions (Fig. 5A). Finally, we applied isotonic regression to predict each subject’s individual responses from each model’s prediction with an arbitrary monotonous transformation from model to human judgements (Fig. 5B). In all of these control analyses, the advantage of the ABS model over all of the other models persisted, as well as the gap between all of the models and the noise floor.
To better understand the contribution of different stimuli to the results, we partitioned the MSE of each model to three components: error in predicting the responses (1) to controversial stimuli that targeted the model, (2) to controversial stimuli that did not target the model, and (3) to the MNIST test images (Fig. 5). The error partitioning uncovered two findings: First, the small VGG model (trained with non-digit examples) performed better than other discriminative models because it made less errors on controversial stimuli that targeted models other than itself. Many of these controversial stimuli targeting other models are not recognized as digits by humans, and small VGG recognizes them as non-digits with greater accuracy than the other discriminative models. This explains this model’s quantitative advantage over the other discriminative models (Fig. 4). However, small VGG fails when targeted in the synthesis of controversial stimuli, revealing that it does not capture human decision boundaries (Fig. 3A). A second finding evident from the error partitioning is that the two generative models, the Gaussian KDE and the ABS model, were significantly worse than the discriminative models at predicting the human responses to the 100 MNIST images (Fig. 6A).
Explaining the remaining predictive gap of the ABS model
While the Gaussian KDE model has indeed very low MNIST test accuracy, the failure of the ABS model on the MNIST test data compared to all of the discriminative models cannot be explained by its accuracy. We hypothesized that the multi-label readout we employed interacted unfavorably with the class-conditional structure of the two generative models: Since these models estimate the density of each class independently, mapping each class-density estimate directly into a class-presence rating prevented any interaction between the different detectors (i.e., the detection of 7 did not reduce the response of the ’1’ output). In the ABS model’s original formulation , the modified softmax integrated the different class densities. Applying this readout instead of the multi-label readout decreased the model’s error on the MNIST test images (MSE=0.0218 instead of MSE=0.0287 with sigmoid readout). Even better MSE was obtained by recalibrating the modified softmax hyperparameters (Fig. 6B, MSE=0.0135 instead of MSE=0.0280 with recalibrated sigmoid readout). And yet, the PCN model and the Capsule Network had significantly better MSE than the ABS model even after such recalibration, probably reflecting the accuracy gap between the ABS model and the these recurrent DNN models (Table 1).
In this study, we synthesized stimuli to maximally differentiate the predictions of nine different candidate models about human classification of hand-written digits. We then tested the models’ predictions by presenting these controversial stimuli to human observers. We found that a deep class-conditional generative neural network, the ABS model  explained human responses to these stimuli significantly better than several discriminatively trained DNNs. We also found that none of the candidate models, including the ABS model, explained all of the explainable variability of the human responses.
We believe that controversial stimuli can be an important addition to the toolboxes of two groups of scientists. The first group is cognitive computational neuroscientists interested in better understanding perceptual processes such as object recognition by modeling them as artificial neural networks. Natural images will always remain a necessary benchmark. However, models often make similar predictions for natural images [e.g., 36]. Controversial stimuli guarantee that different models make considerably different predictions, and thus empower us to adjudicate among models.
The second group that might find controversial stimuli useful is computer scientists interested in comparing different DNN architectures. Controversial stimuli pinpoint where in the input space the decisions of different models disagree. This can be used for illustrating with examples the functional difference between two models, efficiently testing hypotheses about a remote black box system, or comparing models with respect to their robustness to adversarial attacks.
Adversarial examples are a special case of controversial stimuli. An ideal adversarial example would be controversial between the targeted model and ground truth. In practice, ground-truth labeling is rarely available within the adversarial-example optimization loop, so a stand-in for ground truth is used. When targeting object recognition models, a common stand-in is the assumption that the human-assigned label of an image does not change within a pixel-space ball around it. The images resulting from adversarial attacks that employ this assumption can be construed as controversial between the targeted model and a pixel-space one-nearest-neighbor classifier.
The more general perspective provided by controversial stimuli enables us to replace the ground-truth stand-in with any alternative candidate model. Contrasting models can be a more severe test of robustness to adversarial attack. Moreover, the image search space is not limited to balls around labeled examples. An unrestricted image search space has two advantages: (1) We might find informative controversial stimuli far away from the ’natural’ examples. (2) We can start from arbitrary images, including random images (as we did here). Starting from random images safeguards us from concluding that a model is robust when uninformative gradients prevent us from finding effective adversarial examples . If the generation of controversial stimuli fails due to uninformative gradients, the failure is transparent: a high controversiality score will not be achieved.
Implications for DNN modeling of human recognition
Generative models may better capture human recognition
The deep ABS model beat the discriminatively trained models at predicting the human responses to the controversial stimuli. One interpretation of this finding is that, like the deep ABS model, humans have a generative model for each class. Each VAE in the ABS model learns an approximation of the likelihood of an image given a digit . Images that are far away from the category’s distribution are assigned low likelihoods and hence can be rejected as nondigits, matching the human responses to such stimuli (i.e., low probability ratings for all digits). In contrast, discriminative training does not penalize models for assigning labels with high confidence to images that are outside the training distribution. As demonstrated by the current study (Fig. 3, Fig. 3C), even a shallow class-conditional generative model (the Gaussian KDE) leads to considerably more human-compatible responses to such far-removed images.
However, even the best available generative model we tested, the ABS model, still could not fully explain the human responses, and did not match the better discriminative models at predicting human responses to the test MNIST images [see also its unsatisfactory performance on the MNIST-C data set, 28].
The challenge for modelers is to combine the advantages of discriminative models (i.e., good discriminative performance) and generative models (i.e., good generalization performance). The purely generative class-conditional approach is insufficient, as we show here, even for MNIST. For natural images, the shortcomings of this approach are even more apparent:  report a failure to achieve good test accuracy on CIFAR-10 with the ABS model, and  report a lack of robustness of a CIFAR-10 normalizing flow-based class-conditional model. These difficulties might be related to an over-emphasis of low-level statistics over high-level semantic properties in the density functions learned by current generative modeling approaches, including VAEs, normalizing-flow models, and pixelCNNs . How to combine the strengths of discriminative and generative inference remains an important problem of both machine learning and brain science.
Adversarial-training does not lead to human-compatible class boundaries
Adversarial training aims to imbue a model with robustness to perturbations within an ball in pixel-space by introducing such adversarial perturbations to the training data as the model is being trained [12, 24]. If we define robustness as invariance to -norm bounded perturbations in pixel-space [e.g., 16], adversarial training might indeed give us robust models. However, here adversarially-trained models failed to predict human responses to controversial stimuli. If we define model robustness as the absence of model decisions that are incompatible with human judgements, then these models are clearly not robust.
Adversarial training is limited in two respects. First, it does not enable the model to recognize images very far from the training distribution as non-digits . Second, the 2D or 3D intuition of achieving well-formed decision boundaries by surrounding each training example with a ball  breaks in high-dimensional space. For human perception, in particular, it is well known that beyond a very small ball of necessarily imperceptible perturbations, pixel-space norm is a poor measure of perceptual distance . For example, it is easy to devise two perturbations of similar norms, where one causes the image to cross a human category boundary, while the other is invisible . Therefore, forming pixel-space balls of invariance around the training examples cannot capture the category regions that the human visual system employs.
Controversial stimuli: current limitations and future directions
Testing populations of DNN instances instead of single instances
Like most work using pre-trained models [20, 42], this study operationalized each model as a single trained DNN instance. In this setting, a model predicts a single response pattern, which should be as similar as possible to the average human response. To the extent that the training of a model results in instances that make idiosyncratic predictions, the variability across instances will reduce the model’s performance at predicting the human responses. However, an alternative approach to evaluating models considers each DNN instance as an equivalent of an individual human brain. In this setting, idiosyncratic predictions do not necessarily count against a model. Instead, the distribution of model instances should match the distribution of individual humans. After all, humans, too, might have idiosyncratic decision boundaries .
To compare the distribution of model instances to the distribution of individual humans, we would need a sufficiently large sample of instances, repeating DNN training with different random weight initializations or training data. As a first approximation, we can consider the means across instances and humans. Given a sample of instances of model A and a sample of instances of model B, controversial stimuli would be synthesized so that each stimulus is classified in the same way by all of the instances of model A and in another, incompatible way by all of the instances of model B. The generality of the controversial stimuli could be further validated by testing them on held-out instances of these two architectures. Only invalid predictions that are proved to generalize across instances would then be considered as evidence against a model. While this approach cannot test for the existence of idiosyncratic decision boundaries in human observers, it does not penalize models whose instances have that property. A major technical hurdle to the implementation of such multi-instance controversial stimulus synthesis is the high computational cost of the best performing model in the current study, the ABS model, which relies on iterative optimization during inference.
An additional advantage of using multiple instances per model is that it allows obtaining more informed estimates of the stability of the experiment’s results, taking into account random-variability introduced by weight initialization. This particular point is not specific to experiments using optimized stimuli; it is relevant to any study comparing trained DNNs. Random variability related to weight initialization can be a concern when comparing similarly performing models. However, experiments with alternative instances of some of the models (data not shown) suggest that our inferential results here would not qualitatively change if the entire procedure was repeated with retrained model instances.
Scaling up to many classes and many models
Synthesizing controversial stimuli for every pair of classes and every pair of models is difficult to scale up to problems with a large number of classes or a large number of models. For example, for ImageNet where there are 1000 classes, there are almost half a million stimuli for each pair of models. In order to distinguish the models, exhaustive sampling of all pairs of classes for each pair of models is not required. However, it is desirable to have a variety of controversial stimuli for each pair of models, and for this variety to cover a diversity of pairs of classes. Such diversity can be achieved either by randomly sampling class pairs, or by more advanced multi-objective optimization heuristics [e.g., the MAP-Elites algorithm, 27, 30].
From an information-theoretic perspective, our set of controversial stimuli should be designed for the human responses to maximally reduce our uncertainty (i.e., the entropy of our belief distribution) about the models. This optimization objective can be used to synthesize controversial stimuli adaptively as sequentially collected human responses come in. Such a process will zoom in on the most promising models, ignoring models that have been effectively eliminated by previous trials. However, this approach faces both technical and theoretical challenges. Technically, it requires joint back-propagation through all models. Theoretically, it requires an estimate of the model likelihood, i.e., the probability of the human responses given the model and stimulus set. This estimate should reflect the fact that a repeated presentation of the same stimulus is less informative than a diverse stimulus set.
Materials and Methods
Details on training/adaptation of candidate models appear in the Supplementary Materials and Methods.
Controversial Stimuli synthesis
Each controversial stimuli was initiated as a randomly seeded uniform () noise image. To efficiently optimize the controversiality score (Eq. 2), we ascended the gradient of a more numerically favorable version of this quantity:
where (an inverted LogSumExp, serving as a smooth-minimum), is a hyperparameter that controls the LogSumExp smoothness (initially set to ), and is the calibrated logit for class (the input to the sigmoid readout).
While for most models, one can derive an analytical gradient of Eq. 5, this is not possible for the ABS model, since its inference is based on a latent space optimization. Hence, following ’s approach to forming adversarial examples, we used numerical differentiation for all models. In each optimization iteration, we used the symmetric finite difference formula to estimate the gradient of Eq. 5 with respect to the image. An indirect benefit of this approach is that one can set to be large, trading gradient precision for better handling rough cost-landscapes. For each image, we began optimizing using (clipping and to stay within the the grayscale intensity range). Once the optimization converged to a local maxima, we halved and continued optimizing. We kept halving upon convergence until final convergence with . We then increased the LSE hyperparameter to 10 and reset to equal 1 again, repeating the procedure (but without resetting the optimized image). A third and final optimization epoch used .
In each optimization iteration, once a gradient estimate was determined we used a line search for the most effective step size: We evaluated the effect of the maximal step in the direction of the gradient that did not cause intensity clipping, as well as of this step size. When the resulting image had a controversiality score (Eq. 2) of less than 0.85 we repeated the optimization procedure with a different initial random image, up to three attempts.
For analytically differentiable models, we found that this more involved (and computationally intensive) approach to image optimization resulted in less convergence to poor local maxima compared to standard gradient ascent using symbolic differentiation.
For each model pair, we selected 20 controversial stimuli for human testing (out of up to 90 we produced). Using integer programming (IBM DOcplex) we searched for the set of 20 images with the highest total controversiality score, under the constraint that each digit is targeted exactly twice per model.
30 participants (17 women, mean age = 29.3) were recruited through prolific.co. All participants provided informed consent at the beginning of the study, and all procedures were approved by the Columbia Morningside ethics board. We monitored the performance of the human subjects through three measures: their accuracy on the 100 MNIST images, their reaction times, and 108 controversial images (3 per model pair) that were displayed again at the end of the experiment (testing within-subject response reliability). While the participants’ performance on these measures varied, we found no basis for rejecting the data produced by any participant due to evident low effort or negligence.
Differences between models with respect to their human response prediction error were tested by bootstrapping-based hypothesis testing. For each bootstrap sample (100,000 resamples), subjects and stimuli were both randomly resampled with replacement. Stimuli resampling was stratified by stimuli conditions (37 conditions—controversial stimuli targeting 36 model pairs, plus MNIST test images). For each pair of models, this bootstrapping procedure yielded an empirical sampling error distribution of the difference between the models’ MSEs. Percent of bootstrapped MSE differences below (or above) zero were used as left-tail (or right-tail) p-values. These p-values were Bonferroni corrected for multiple pairwise comparisons and for two-tailed testing.
Data and code availability
Python optimization source code, synthesized images and detailed behavioral testing results will be available on github.com/kriegeskorte-lab.
TG acknowledges ELSC brain sciences postdoctoral fellowships for training abroad, and NVIDIA for a donation of a Titan Xp used for this research. Stimulus synthesis was conducted on the Zuckerman Institute Research Computing ’Axon’ GPU cluster. The authors wish to thank Máté Lengyel for a helpful discussion and Raphael Gerraty, Heiko Schutt, Ruben van Bergen, and Benjamin Peters for their comments on the manuscript.
-  (2018-07) Obfuscated Gradients Give a False Sense of Security: Circumventing Defenses to Adversarial Examples. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 274–283. External Links: Cited by: Discussion.
-  (2018) Deep convolutional networks do not classify based on global object shape. PLOS Computational Biology 14 (12), pp. 1–43. External Links: Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2017-03) First Steps Toward Camera Model Identification With Convolutional Neural Networks. IEEE Signal Processing Letters 24 (3), pp. 259–263. External Links: Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2017) EMNIST: an extension of MNIST to handwritten letters. CoRR abs/1702.05373. External Links: Cited by: Figure S1.
-  (2017) A Study and Comparison of Human and Deep Learning Recognition Performance Under Visual Distortions. CoRR abs/1705.02498. External Links: Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition, Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2018) Adversarial Examples that Fool both Computer Vision and Time-Limited Humans. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 3910–3920. External Links: Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2009) Visualizing higher-layer features of a deep network. University of Montreal 1341 (3), pp. 1. Cited by: Synthesizing controversial stimuli.
-  (2019) Conditional Generative Models are not Robust. CoRR abs/1906.01171. External Links: Cited by: Generative models may better capture human recognition.
-  (2018) DARCCC: Detecting Adversaries by Reconstruction from Class Conditional Capsules. CoRR abs/1811.06969. External Links: Cited by: Reconstruction-based readout of the Capsule Network.
-  (2019) ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness.. In International Conference on Learning Representations, External Links: Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2018) Generalisation in humans and deep neural networks. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), pp. 7538–7550. External Links: Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition, Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2015) Explaining and Harnessing Adversarial Examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: Controversial stimuli vs. adversarial examples, Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition, Adversarial-training does not lead to human-compatible class boundaries.
-  (2017) On Calibration of Modern Neural Networks. In Proceedings of the 34th International Conference on Machine Learning - Volume 70, ICML’17, pp. 1321–1330. Note: event-place: Sydney, NSW, Australia External Links: Cited by: Candidate MNIST models.
-  (2018) Benchmarking Neural Network Robustness to Common Corruptions and Perturbations. CoRR abs/1807.01697. External Links: Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2011-12) Bayesian Active Learning for Classification and Preference Learning. arXiv e-prints, pp. arXiv:1112.5745. Cited by: Synthesizing controversial stimuli.
-  (2019-05) Adversarial Examples Are Not Bugs, They Are Features. arXiv:1905.02175 [cs, stat]. Note: arXiv: 1905.02175 External Links: Cited by: Adversarial-training does not lead to human-compatible class boundaries, Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2019) Excessive Invariance Causes Adversarial Vulnerability. In International Conference on Learning Representations, External Links: Cited by: Adversarial-training does not lead to human-compatible class boundaries.
-  (2017) Measuring the tendency of CNNs to learn surface statistical regularities. arXiv preprint arXiv:1711.11561. Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition, Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2019) Deep Neural Networks in Computational Neuroscience. Oxford University Press. External Links: Cited by: Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2015) Deep Neural Networks: A New Framework for Modeling Biological Vision and Brain Information Processing. Annual Review of Vision Science 1 (1), pp. 417–446. External Links: Cited by: Testing populations of DNN instances instead of single instances, Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (1998-11) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Cited by: Recognition of hand-written digits is a non-trivial test case for DNN-human compatibility.
-  (1956) On a Measure of the Information Provided by an Experiment. The Annals of Mathematical Statistics 27 (4), pp. 986–1005. External Links: Cited by: Synthesizing controversial stimuli.
-  (2017) Delving into Transferable Adversarial Examples and Black-box Attacks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, External Links: Cited by: Controversial stimuli vs. adversarial examples.
-  (2018) Towards Deep Learning Models Resistant to Adversarial Attacks. In International Conference on Learning Representations, External Links: Cited by: Candidate MNIST models, Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition, Table 1, Adversarial-training does not lead to human-compatible class boundaries, Adversarial-training does not lead to human-compatible class boundaries, Candidate models.
-  (2018-05) Background Class Defense Against Adversarial Examples. In 2018 IEEE Security and Privacy Workshops (SPW), pp. 96–102. External Links: Cited by: Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition.
-  (2006-10) Multilabel Neural Networks with Applications to Functional Genomics and Text Categorization. IEEE Transactions on Knowledge and Data Engineering 18 (10), pp. 1338–1351. External Links: Cited by: Candidate MNIST models.
-  (2015) Illuminating search spaces by mapping elites. CoRR abs/1504.04909. External Links: Cited by: Scaling up to many classes and many models.
-  (2019) MNIST-C: A Robustness Benchmark for Computer Vision. CoRR abs/1906.02337. External Links: Cited by: Generative models may better capture human recognition.
-  (2018-10) Do Deep Generative Models Know What They Don’t Know?. arXiv e-prints, pp. arXiv:1810.09136. Cited by: Generative models may better capture human recognition.
-  (2015-06) Deep Neural Networks Are Easily Fooled: High Confidence Predictions for Unrecognizable Images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition, Scaling up to many classes and many models.
-  (2011) Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: Figure S6.
-  (1999) Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods. In Advances in large margin classifiers, pp. 61–74. Cited by: Candidate MNIST models.
-  (2019) Detecting and Diagnosing Adversarial Images with Class-Conditional Capsule Reconstructions. CoRR abs/1907.02957. External Links: Cited by: Candidate MNIST models, Table 1, Reconstruction-based readout of the Capsule Network.
-  (2017) Dynamic Routing Between Capsules. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 3856–3866. External Links: Cited by: Candidate MNIST models, Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition, Table 1, Candidate models, Reconstruction-based readout of the Capsule Network.
-  (2019) Towards the first adversarially robust neural network model on MNIST. In International Conference on Learning Representations, External Links: Cited by: Candidate MNIST models, Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition, Human psychophysics can formally adjudicate among models and reveal their limitations, Human psychophysics can formally adjudicate among models and reveal their limitations, Explaining the remaining predictive gap of the ABS model, Table 1, Generative models may better capture human recognition, Adversarial-training does not lead to human-compatible class boundaries, Discussion, Controversial Stimuli synthesis, Figure S5, Candidate models.
-  (2018) Brain-Score: Which Artificial Neural Network for Object Recognition is most Brain-Like?. bioRxiv. External Links: Cited by: Discussion.
-  (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. Cited by: Candidate MNIST models, Table 1, Small VGG.
-  (2013-12) Intriguing properties of neural networks. arXiv:1312.6199 [cs]. Note: arXiv: 1312.6199 External Links: Cited by: Controversial stimuli vs. adversarial examples, Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2019) Robustness May Be at Odds with Accuracy. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019, External Links: Cited by: Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition.
-  (2008) Maximum differentiation (MAD) competition: A methodology for comparing computational models of perceptual quantities. Journal of Vision 8 (12), pp. 8–8. External Links: Cited by: Previous work.
-  (2018-07) Deep Predictive Coding Network for Object Recognition. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 5266–5275. External Links: Cited by: Candidate MNIST models, Synthesized controversial stimuli expose a hierarchy of compatibility with human recognition, Table 1, Candidate models.
-  (2016-02) Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience 19, pp. 356. External Links: Cited by: Testing populations of DNN instances instead of single instances, Controversial stimuli: pitting neural networks against each other as models of human recognition.
-  (2004-04) Image quality assessment: from error visibility to structural similarity. IEEE Transactions on Image Processing 13 (4), pp. 600–612. External Links: Cited by: Adversarial-training does not lead to human-compatible class boundaries.
-  (2019-03) Humans can decipher adversarial images. Nature Communications 10 (1), pp. 1334 (En). External Links: Cited by: Controversial stimuli vs. adversarial examples, Controversial stimuli: pitting neural networks against each other as models of human recognition.
Supplementary Materials and Methods
Starting from the VGG-16 architecture [37, Table 1, architecture D], we downsized its input to the 28x28 MNIST format, removed the deepest three convolutional layers and replaced the three fully-connected layers with a single, 512 unit fully-connected layer, feeding a ten-sigmoid readout layer. All weights were initialized using the Glorot uniform initializer, as implemented in Keras. Batch normalization was applied between the convolution and the ReLU operations in all layers. The model was trained with Adagrad (, , decay=0) for 20 epochs using a mini batch size of 128. The epoch with best validation performance (evaluated on 5000 MNIST held-out training examples) was used.
Reconstruction-based readout of the Capsule Network
In the training procedure of the original Capsule network , the informativeness of the class-specific activation vectors (’DigitCaps’) is promoted by minimizing the reconstruction error of a decoder reading out the vector activation related to each example’s correct class. [9, 33] suggested to use the reconstruction error during inference, flagging examples with high reconstruction error (conditioned on their inferred class) as potentially adversarial. While rejecting suspicious images and avoiding their classification is a legitimate engineering solution, for a vision model we require that class conditional probabilities () will always be available. Hence, instead of using the reconstruction error as a rejection criterion, we used it as a classification signal. Reading out the decoder’s output in the official pre-trained Capsule Network, the 10 mean squared reconstruction-errors (conditional on each class) were fed into 10 sigmoids, whose response was calibrated as described in the results section. To eliminate a bias of this error measure towards blank images, we normalized the reconstruction error of each class by dividing it by the mean squared difference between the input image and the average image of all MNIST training examples (averaged across classes).
For each class (digit) , we formed a Gaussian KDE model, where is a class-specific bandwidth hyper-parameter, is a multivariate Gaussian likelihood with unit covariance, and are all MNIST training examples labeled as class . was chosen independently for each class from the range (100 logarithmic steps) to maximize the likelihood of held-out 500 training examples. The ten resulting log-likelihoods were fed as penultimate activations to a sigmoid readout layer, calibrated as described in the results section.