# A Bayes-Optimal View on Adversarial Examples

## Abstract

The ability to fool modern CNN classifiers with tiny perturbations of the input has lead to the development of a large number of candidate defenses and often conflicting explanations. In this paper, we argue for examining adversarial examples from the perspective of Bayes-Optimal classification. We construct realistic image datasets for which the Bayes-Optimal classifier can be efficiently computed and derive analytic conditions on the distributions so that the optimal classifier is either robust or vulnerable. By training different classifiers on these datasets, for which the “gold standard” optimal classifiers are known, we can disentangle the possible sources of vulnerability and avoid the accuracy-robustness tradeoff that may occur in commonly used datasets. Our results show that even when the optimal classifier is robust, standard CNN training consistently learns a vulnerable classifier. At the same time, for exactly the same training data, RBF SVMs consistently learn a robust classifier. The same trend is observed in experiments with real images.

Original ”male” | Adversarial ”female” | |||

Asymmetric dataset | Symmetric dataset | Symmetric dataset | Symmetric dataset | |

Bayes-Optimal | Bayes-Optimal | CNN | SVM (RBF) | |

(a) | (b) | (c) | (d) | (e) |

## 1 Introduction

Perhaps the most intriguing property of modern machine learning methods is their susceptibility to adversarial examples (Szegedy et al., 2014): for many powerful classifiers it is possible to perturb the input by an imperceptible amount and change the decision of the classifier. While adversarial examples were most famously reported for CNN classifiers (Szegedy et al., 2014), subsequent research has shown that other classifiers can also fall prey to similar attacks (Goodfellow et al., 2018). Attempts to make classifiers robust to these attacks have generated a tremendous amount of interest (e.g. (Schott et al., 2019) and references within).

As a first step towards solving the problem, many authors have attempted to understand the source of the failure (Goodfellow et al., 2018; Szegedy et al., 2014; Tanay and Griffin, 2016; Fawzi et al., 2018). Broadly speaking, existing explanations fall into two groups (see section 4 for a more detailed discussion of related work). One approach argues that adversarial vulnerability is in some sense inevitable: either due to the geometry of high dimensions (e.g. (Goodfellow et al., 2018, 2015; Shamir et al., 2019)) or due to fundamental limitations on robustness of classifiers trained from finite data Schmidt et al. (2018). Another approach views adversarial vulnerability as a ”bug” of CNNs and current training methods, which can be avoided by other architectures or training protocols (e.g. (Tanay and Griffin, 2016; Nakkiran, 2019; Lyu and Li, 2020; Schott et al., 2019)).

One reason for the existence of many conflicting explanations may be the difficulty of analyzing adversarial vulnerability in a realistic yet tractable setting. In this paper, we provide such a setting. We construct realistic image datasets for which the Bayes-Optimal classifier can be efficiently computed and analyze the vulnerability of the optimal classifier. We derive analytic conditions on the distributions where even the optimal classifier will be vulnerable and other conditions where the optimal classifier will be provably robust. Figure 1a-c shows an example: synthetic face images from the class ”male” or ”female”. In the first, ”asymmetric” distribution, the Bayes-Optimal classifier is vulnerable and an imperceptible perturbation is sufficient to change a ”male” face to one that would be classified as ”female” (figure 1b). In the second, ”symmetric” dataset, fooling the optimal classifier requires making large, perceptually meaningful changes (figure 1c).

By training different classifiers on these datasets, we can disentangle the possible sources of vulnerability and avoid the accuracy-robustness tradeoff that may occur in commonly used datasets. Our results show that even when the optimal classifier is robust, standard CNN training consistently learns a vulnerable classifier (e.g. figure 1d). At the same time, for exactly the same training data, RBF SVMs consistently learn a robust classifier (figure 1e). Our results suggest that in many realistic settings, adversarial vulnerability is not an unavoidable property of learning in high dimensions but rather a direct result of suboptimal training methods used in current practice.

## 2 A Realistic Tractable Setting for Analysing Adversarial Vulnerability

We focus on a two class classification problem and denote by , the distribution of the two classes. We assume that are known and that the two classes have equal priors, so the Bayes-Optimal classifier simply classifies as belonging to class if and to class otherwise. It is well known that this classification rule is optimal and no other classification rule can achieve higher accuracy (assuming of course that are correct) (Duda et al., 1973). Several recent papers Schmidt et al. (2018); Ilyas et al. (2019) have analyzed adversarial examples in this setting, but only when are both Gaussians with the same covariance, so that the optimal classifier is linear. Real image classification problems are of course very different from this simplified setting.

In our approach, we assume that images that belong to a single class form a nonlinear, low-dimensional manifold in pixel space. A standard way to model such manifolds is to use a mixture of Factor Analyzers (MFA) model Ghahramani and Hinton (1996). The model is based on the observation that locally the manifold can be represented by a Gaussian whose covariance matrix is a sum of a low-rank matrix (representing the covariance on the manifold) and a diagonal matrix (representing the covariance off the manifold, e.g. due to sensor noise). Figure 2 (left) shows an example of a nonlinear 2D manifold in three dimensions, and its representation using an MFA model.

To create synthetic image datasets for classification, we start with a labeled training set with two classes. We then train separate MFA models , on images from the two classes using the algorithm (and code) from Richardson and Weiss (2018). We now create a new training set by sampling images from the two models, and similarly a new test set. Since these datasets were created by sampling from a known model, we can calculate the Bayes-Optimal classifier. At the same time, the images are realistic and MFA models have been shown to capture much of the variability of the images in the original data (Richardson and Weiss, 2018).

We created 12 such datasets of faces (based on the CelebA dataset (Liu et al., 2015)) and 3 datasets of digits (based on MNIST)^{1}

### 2.1 Provably robust or vulnerable optimal classifiers

A textbook example of Bayes-Optimal classification is when both classes are generated using Gaussians with the same spherical covariance, in which case the optimal classifier is a linear discriminant that is orthogonal to the difference between the two means (figure 3a). In this case, if the distance between the two means is large relative to the covariance, then almost all points are far from the decision boundary and so an adversarial attack which only makes small changes to the input will typically fail. But as shown in the bottom of figure 3 there are other examples where the decision boundary is close to many of the datapoints and an adversarial attack which only makes small changes to the input will often succeed. What distinguishes these two cases?

As we prove below, the distinction is based on the presence of asymmetries in the discriminative power of different features. To gain intuition, consider a classifier that attempts to recognize images of a person who has a small mole on their face. Denote by the probability of generating an image of this person. Denote by the probability of generating an image of a different person. Under all images have a mole, while under this probability is very low. Thus an optimal classifier will assign an extremely high weight to the presence of a mole in an image. This means that by making a tiny change in the image and erasing the mole, we can drastically change the output of the optimal classifier (and in fact drive down to zero). Note that this is not the same as the standard definition of overfitting (i.e. a difference between performance on the training samples and test samples): even unseen images of the person that are generated from will have a mole present, so giving this feature high weight will not hurt generalization, but will make the classifier susceptible to adversarial attacks. We now make this intuition precise.

We first focus on the case where both class distributions are Gaussians, i.e. are both Gaussian with means and covariances .

Lemma 1: Asymmetric case. Let be a direction of minimal variance under : . Let be the variance projecting onto that direction when comes from class . If and is full rank then almost any point in class is arbitrarily close to the optimal decision surface.

Proof: We denote by the variance of the data in direction under the distribution of the second class . Note that by the assumption that is full rank, this variance must be nonzero. We write the vector as where is the projection in direction and is a vector of projections in directions orthogonal to . We denote by the projection of in direction . The decision surface as a function of is a solution to:

(1) | |||

Now since is a direction of minimal variance, it must be an eigenvector of so that and are independent under and we can write . Using the standard equation for conditional Gaussians, will also be a Gaussian with the following mean and covariance:

(2) | |||||

(3) |

where are the appropriate submatrices of the covariance matrix . As then and will not depend on . This means that the right-hand side of equation 1 depends only on and not on or .

As approaches zero, the solutions of this equation approach . And because , almost all samples from the first Gaussian will be close to , so moving by a tiny amount in direction will change the optimal decision.

Lemma 1 shows that when there are strong asymmetries in the covariances of the two classes, the optimal classifier will be provably vulnerable. The following two lemmas, on the other hand, show settings when the minimal variances in each class are similar and the optimal classifier will be provably robust.

Lemma 2: Symmetric isotropic. Assume and , then with high probability all points in both classes are at distance from the decision boundary.

Proof: For spherical and equal covariances, the Bayes-Optimal classifier simply projects the data onto the direction and classifies a point based on whether that projection is closer to the projection of or the projection of . This means that the problem reduces to a scalar problem, with two distributions whose means are at distance and have scalar variance . With high probability, under the assumption, all the points are close to one of the means, so they are at distance from the decision boundary.

Lemma 3: Symmetric low-rank + diagonal. Suppose both classes have a
covariance that is a sum of a low-rank matrix plus a diagonal matrix
and have the same singular values and is the same for both classes. Let be the minimal distance between the two linear subspaces^{2}

As then almost all points are at distance from the Bayes-Optimal decision boundary.

Proof: It can be shown (see appendix A.1) that when , the optimal classifier classifies a point based on the Euclidean distance between and the linear subspace defined by . Because is the same, each point will be classified to the closest subspace. Now, as almost any point from class will have distance close to to the first linear subspace and distance of at least to the second subspace. If we perturb by a vector then the distance to each subspace can change by no more than . This means that if , the distance to the first subspace will remain smaller than the distance to the second subspace, and hence the Bayes-Optimal classifier will not be fooled by any perturbation whose norm is less than .

Figures 3a,b,e,f illustrate the dependence of classifier robustness on the symmetry or asymmetry of the variances: When the minimal variances in both Gaussians is similar then the decision boundary is far from most points (top), but when there is a strong asymmetry in the minimal variances then the decision boundary becomes close to the datapoints in one of the classes.

Mixture Models. We now assume that the distribution in each class can be represented as a Gaussian Mixture Model and denote by the th Gaussian in class and by its prior probability. We also assume that within each class, the components are well separated, i.e. that for each datapoint the assignment probabilities put all their mass on one of the components. More formally, denote by the assignment probability of a datapoint to component , then we assume that for each , where is the index of the component that is most likely to have generated under probability . Under this assumption, the probability of generating a point under is simply . We will also assume that within each distribution, the assigned component does not change when we perturb by a perturbation smaller than : .

Theorem 1: Assume that both classes are generated by well separated Gaussian mixtures with uniform priors on the components. If for every Gaussian there exists a Gaussian in the other class so that they satisfy the asymmetry conditions of Lemma 1, then almost any point will be arbitrarily close to the optimal decision boundary. On the other hand, if all Gaussians in the two classes satisfy the symmetry conditions of Lemma 2 or Lemma 3, then the optimal classifier is robust to any perturbation smaller than .

Proof: By the well separatedness assumption, the optimal classifier simply compares the likelihood of a point under the most likely Gaussian component in each class. This means we can directly apply the appropriate Lemma, where are the most likely Gaussian component in each class.

Figures 3d,h illustrate the effect of asymmetry in a mixture model. Note that the assumption of uniform priors is only to simplify the proof. When the priors are non-uniform, the optimal decision boundary shifts by the logarithm of the ratio of the two component priors. Similarly, the vulnerability condition holds when there is a strong asymmetry in the variances, even if the minimal variance is not close to . Similar results can be obtained for non-Gaussian distributions (figure 3c,g and appendix A.2).

Symmetric and asymmetric datasets. To summarize our analysis, the Bayes-Optimal classifier will be provably robust when the covariances satisfy symmetry conditions and provably vulnerable when there are strong asymmetries. We therefore created symmetric and asymmetric variants of the MFA datasets. In the symmetric version, we regularized the MFA so that the ”off manifold” variance is small and the same in all components and the distribution approximates the conditions of Lemma 3. In the asymmetric version, we added to each MFA model one “outlier” component with a diagonal covariance with much larger than all other covariances and a mean that is close to the global data mean. This version approximates the conditions of Lemma 1.

The advantage of using a MFA model over other generative models such as VAEs or GANs (Kingma and Welling, 2014; Gulrajani et al., 2017) is that the log likelihood of any image can be calculated efficiently. Since the data were generated by the assumed distributions, this classifier is Bayes-Optimal. Indeed in all datasets we created, the accuracy of the classifier was close to 100%. We now asked: is this Bayes-Optimal classifier robust to adversarial attacks?

Evaluating the robustness. Unlike our theoretical results, deciding whether or not a real classifier is robust or not requires an operational definition of what constitutes a ”tiny” or ”imperceptible” perturbation. We follow the standard practice of calculating the mean perturbation norms of an adversarial attack (Schott et al., 2019; Carlini et al., 2019). We allow the adversary an unlimited budget in attacking the classifiers, and measure how large a perturbation was required to cross the decision boundary. The mean is calculated only over successful attacks, when the original sample was correctly classified and the adversarial example was not. Since this definition is sensitive to outliers and the particular choice of Euclidean norm, we also examined the histograms of changes made to each pixel in the adversarial attack. Finally, we visually inspected the adversarial images.

In all 15 datasets, we found that these three methods of defining robustness are consistent. For the face images, when the mean is less than , then the adversarial images are almost indistinguishable from the original images, and the vast majority of the pixels in the adversarial images are within intensity levels from their original value. On the other hand, when the mean is around , then the adversarial images are perceptually quite different from the original ones, and many pixels differ by more than from their original values. We used a simple gradient attack that takes small steps in the direction of the gradient of the MFA log likelihood (details in appendix C.2). Similar results are achieved with a standard implementation (Papernot et al., 2016) of the CW-L2 attack (Carlini and Wagner, 2017).

As shown in figures 4 (similar to figure 1b,c), the difference between the symmetric datasets and asymmetric datasets is dramatic (see appendix D.2 for similar results on other classes). When there exists a large asymmetry between the minimal variances of different Gaussians, the conditions of Lemma 1 hold, and a tiny imperceptible change is sufficient to fool the Bayes-Optimal classifier. However, when all Gaussians have the same minimal variance, the conditions of Lemma 3 hold and any adversary will need to make much larger changes and the adversarial examples become perceptually meaningful.

## 3 Experiments: Why are CNNs so brittle?

Original: | ||
---|---|---|

Optimal: | ||

CNN: | ||

Lin. SVM: | ||

RBF SVM: |

Given our analysis, the fact that modern machine learning methods are often susceptible to tiny adversarial perturbations may be due to two very different reasons (figure 5). One reason could be that the data distribution is asymmetric, so that the Bayes-Optimal classifier is not robust, and hence it is not surprising that a CNN is also not robust. A second possible reason is illustrated in figure 5b: here the data distribution is symmetric and the Bayes-Optimal classifier is robust, yet SGD starting from a bad initial condition finds a brittle classifier. If this is the case, then the brittleness is not due to the data distribution but rather a failure of the learning method.

In order to separate the contribution of the dataset from the estimation method in the vulnerability of machine learning methods, we trained a CNN on samples from all 15 ”symmetric” datasets described in section 2.1, and measured the vulnerability of the learned CNN. We used the CNN implementation and the CW-L2 attack from the CleverHans library (see appendix B.3,C.1). We asked: will the CNN find a brittle classifier even though the optimal one is robust?

Results are shown in figures 6 and 7. In all 15 cases, the CNN found a high accuracy classifier that was vulnerable to small adversarial perturbations, even though the optimal classifier is robust. The difference is most dramatic in the CelebA tasks, where the CNN adversarial examples are almost indistinguishable from the original images (examples of the attacks on different datasets are shown in appendix D.3).

While there are many possible architectures and optimization methods for CNNs, we did not find any improvement in the CNN robustness in our attempts to change the number of filters, layers, training iterations etc (Figure 8). In particular, (Schmidt et al., 2018) have argued that one needs more training examples to achieve robust classification, so we systematically varied the amount of training images (generated dynamically at each SGD iteration), and found no significant improvement in robustness as we increased the number of training examples up to 1 million examples.

Is it possible to learn a robust classifier for these datasets from finite training data? To answer that question, we then trained linear and RBF support vector machine classifiers on exactly the same datasets. The linear SVM attempts to maximize the margin while maintaining high accuracy, but since the optimal classifier is nonlinear it ends up learning a brittle classifier. More importantly, with an appropriate bandwidth parameter RBF SVMs find robust classifiers when trained on exactly the same data (when the bandwidth parameter is too large, the RBF performs similarly to a linear SVM).

The robustness of CNNs can be improved using adversarial training, in which adversarial samples (of some selected attack) are injected during training. We experimented with the SOTA method of (Zhang et al., 2019) and found that, depending on the hyperparameters used, it can find a robust classifier, although at the expense of lowered accuracy (figure 8 right).

Returning to figure 5, our results strongly support the hypothesis that for these cases brittleness is due to suboptimal learning methods, even when the data distributions are symmetric and the optimal classifier is robust.

Training and testing on real data. One can ask, to what extent our analysis and experiments represent adversarial attacks on models trained on real data? To answer this question we trained CNNs and SVMs on five real CelebA attribute datasets, and indeed, as shown in Figure 9 and appendix D.4, the results are similar to the synthetic symmetric datasets – CNN and Linear SVM are vulnerable while RBF SVM is robust and can only be fooled when the perturbations are perceptually meaningful.

Note that unlike our proposed symmetric data, in which the “gold standard” optimal classifier is both robust and accurate, on real data, which may contain variance asymmetries, different models might reach different trade-off points between accuracy and robustness. In particular, it is known that RBF SVM accuracy is highly influenced by the hyperparameters and we performed only a minimal search to obtain these results. In future work, it would be interesting to explore the full regularization path as in (Hastie et al., 2004).

## 4 Related Work

One of the first explanations of adversarial examples in CNNs was that the decision surface learned by neural networks is “discontinuous to a significant extent”, analogous to an attempt to discriminate the rational numbers from the rest of the real numbers (Szegedy et al., 2014). However as shown by our analysis, when there are strong asymmetries in the variances of the two classes, adversarial examples can fool the optimal classifier even when the decision boundary is smooth and continuous.

A second prominent theory suggests that the problem is that neural network classifiers are “unreasonably linear” combined with the fact that they operate in high dimensions (Goodfellow et al., 2015, 2018). In high dimensions the output of a random linear classifier can be changed by making a small change in the norm of an example. (Fawzi et al., 2018) show a connection between the error rate of linear and quadratic classifiers and the adversarial vulnerability and show that linear classifiers in high dimensions must be vulnerable for data that is not linearly separable. (Ford et al., 2019) show that adversarial vulnerability is closely related to the lack of generalization to random perturbations and that in high dimensions even moderate failures to generalize to high amounts of noise imply the existence of adversarial examples. (Shamir et al., 2019) have also focused on the geometry of high dimensions arguing that adversarial attacks may be a “natural consequence of the geometry of with the (Hamming) metric”.

Both our analysis and our experiments suggest that high dimensionality is neither necessary nor sufficient for vulnerability. When strong asymmetries exist, even two dimensional datasets can be constructed such that the optimal classifier is vulnerable. At the same time, when there are no asymmetries, the optimal classifier is robust, even in very high dimensions. The same is true for “excessively linear” classifiers: the RBF SVM is probably no less linear than a CNN, yet it is robust in our symmetric and in real datasets.

The accuracy-robustness tradefoff has also been suggested as an explanation for adversarial vulnerability (Zhang et al., 2019). (Schmidt et al., 2018) present a model under which adversarially robust generalization requires more data. As our analysis shows, for symmetric datasets there is no tradeoff between accuracy and robustness and the optimal classifier in terms of accuracy is also robust. Our experiments also show that with proper regularization (i.e. a RBF SVM), one can learn adversarially robust classifiers with the same amount of data for which CNNs learn a vulnerable classifier.

Most recently, (Ilyas et al., 2019) have argued that adversarial examples are a feature, not a bug, and showed that one can in fact obtain information about the true decision boundary from adversarial examples. Their analysis suggests that vulnerability results from the presence of predictive features that are not robust. They presented a synthetic dataset which was constructed to not contain such features, and showed that CNN training on that dataset was robust. Our analysis in terms of symmetric vs. asymmetric datasets is similar to theirs but more general (they only considered linear classifiers). Our experimental results, however, are quite different and suggest that in more challenging settings, CNNs consistently learn vulnerable classifiers even when the asymmetries do not exist in the data distributions.

In (Nakkiran, 2019), a synthetic dataset was presented without nonrobust features, but CNN training led to vulnerable classifiers when the dataset was noisy. This result is consistent with (Tanay and Griffin, 2016) who show that adversarial vulnerability is related to overfitting in learning algorithms that are not sufficiently regularized. Similarly, (Lyu and Li, 2020) show theoretically and experimentally that the details of the training procedure can significantly change the robustness of CNNs. Our experimental results also highlight the need for regularization in more realistic and challenging settings (including real data), while our analysis points out that lack of robustness may also occur with no overfitting when the data is asymmetric. Our experiments also show that regularization is not sufficient: the linear SVM also attempts to maximize the margin, but due to its limited expressive power it still ends up learning a vulnerable classifier.

Intuitively, we might expect classifiers based on generative models (e.g. ”analysis by synthesis” (Schott et al., 2019)) to be more robust to adversarial attacks, since they model all the data, not just the discriminative features. But our analysis shows that even when such a classifier is based on the true generative model, it can be arbitrarily vulnerable, when the two distributions show strong asymmetries.

## 5 Discussion

Since the discovery of adversarial examples for CNNs there has been much discussion whether they are a ”bug” that is specific to neural networks or a ”feature” of high dimensional geometry. On the one hand, our results show that even the Bayes-Optimal classifier may be susceptible to tiny adversarial perturbations, and this can happen in low dimensions and when the optimal classification function is smooth. Perhaps more significantly, our analysis has also enabled us to construct realistic datasets for which the Bayes-Optimal classifier is provably robust. We find that standard CNN training consistently fails to find a robust classifier for these datasets, while large-margin methods can succeed when training on exactly the same data. This suggests that in some situations, the presence of adversarial examples represents a failure of current, suboptimal learning methods, rather than being an unavoidable property of learning in high dimensions.

We are by no means advocating a return to using RBF SVMs. Rather, we believe that explicit regularization methods for CNNs may enable learning robust classifiers while maintaining the power of deep architectures. Recent theoretical work on gradient descent methods suggests that they implicitly reward large margin classifiers in both shallow and deep architectures (Soudry et al., 2018; Poggio et al., 2017; Lyu and Li, 2020) although convergence to a large margin classifier may require exponential time.

In general, when trying to understand a complex effect, it is often useful to disentangle the different causes. The Bayes-Optimal perspective on adversarial examples identifies two possible causes: asymmetries in the datasets and suboptimal learning. Furthermore, it allows us to create tractable and realistic datasets in which one of the two causes can be clearly implicated. We are optimistic that this approach will be of great use in developing new learning algorithms that are practical and robust.

## Appendix A Additional Proofs

### a.1 Proof of Lemma 3

Lemma 3: Symmetric low-rank + diagonal. Suppose both classes have a
covariance that is a sum of a low-rank matrix plus a diagonal matrix
and have the same singular values and is the same for both classes. Let be the minimal distance between the two linear subspaces^{3}

As then almost all points are at distance from the Bayes-Optimal decision boundary.

Proof: If then the distribution of can be described by the following generative model:

(4) | |||||

(5) | |||||

(6) |

We can also write the likelihood of as:

(7) |

Since are jointly Gaussian, is an unnormalized Gaussian function of and the integral is given by the height of the unnormalized Gaussian divided by the square root of the determinant of the second derivative of with respect to (see for example MacKay and Mac Kay (2003) p. 341). So we can write:

Since are assumed to have the same singular values, the log determinant term in the log likelihood is the same for both classes so the Bayes-Optimal classifier will classify a point as belonging to class if:

As the Bayes-Optimal decision rule will simply be:

so that a point is classified based on which subspace it is closer to.

Now, as almost any point from class will have distance close to to the first linear subspace and distance of at least to the second subspace. If we perturb by a vector then the distance to each subspace can change by no more than . This means that if , the distance to the first subspace will remain smaller than the distance to the second subspace, and hence the Bayes-Optimal classifier will not be fooled by any perturbation whose norm is less than .

### a.2 Robustness of the Optimal Classifier – Non Gaussian Distributions

A natural question following Lemma 1 is to what extent the result depends on the Gaussian distribution. To address this, we now consider discrete distributions. We assume that every instance is described by quantized features that can take on a discrete number of values. For example, the features can be wavelet coefficients of an image that are discretized into possible values. This means that are simply very large tables that give the probability of observing a particular discrete set of image features given each of the classes. Of course learning such a large table is infeasible without additional assumptions, but recall that we are analyzing the Bayes-Optimal case, where we assume are known.

Lemma A.1: Assume there exists a feature and a quantization level so that . Assume also that for any feature vector , . Then almost any point in class is one quantization level away from the optimal decision boundary.

Proof: We again write where is the feature and are all other features.

(8) |

Now since approaches for and otherwise, for almost any point in class the value of that feature is equal to . We now change the feature by one quantization level and obtain a new feature vector and . On the other hand, by the assumption , so that this point would now be classified as belonging to class 2.

As in the Gaussian case, we do not need to be exactly equal to a delta function for the decision surface to be close to most points. It is enough that be much smaller than the minimal values of for the decision to be flipped when we replace with . Figures 3g,c illustrate this dependence. In both cases, the data is sampled from a discrete distribution where the features are simply discretization of the two spatial coordinates into levels each. In other words, and are tables of size and each entry in the table represents the probability of generating a point at one of the possible locations. In the top example, the minimal value of the probability table is approximately the same in both classes, while in the bottom example, there is a strong asymmetry and the decision boundary becomes close to all points in one of the classes.

It is easy to see that Theorem 1 that discusses mixture distributions is applicable to the discrete case as well.

## Appendix B Models

In this section we provide additional information about the different classification models – architecture, hyper-parameters and training procedure.

### b.1 Mfa

A Mixture of Factor Analyzers (MFA) (Ghahramani and Hinton, 1996) is a Gaussian Mixture Model where each component is a Factor Analyzer parameterized by a low rank plus diagonal covariance matrix. MFA provides a good tradeoff between the non-expressive diagonal-covariance model and a full-covariance model, which is too computationally expensive for high-dimensional data such as full images.

The model for a single Factor Analyzer component is:

(9) |

where is the rectangular factor loading matrix, is a low-dimensional latent factors vector, is the mean and is the added noise with a diagonal covariance (which may be isotropic: This results in the Gaussian distribution . The MFA is a mixture of such Gaussians.

The MFA model was trained using the code provided by Richardson and Weiss (2018). The models are trained using Stochastic Gradient Descent. The training data (CelebA, MNIST) is first split by the desired binary attribute (e.g. Smiling / Not Smiling) and then a separate MFA model was trained independently for each subset of training samples. Because of imbalance in the number of samples per class in CelebA, we set the number of components as the number of samples divided by 1000. For MNIST we used a fixed value of 25 components per class. We chose an MFA latent dimension of 10 for CelebA and 6 for MNIST.

To allow attacking the MFA model with standard adversarial attacks such as CW-L2, we implemented it in TensorFlow as a standard CleverHans (Papernot et al., 2016) model.

### b.2 Bayes-Optimal

The MFA model is the Bayes-Optimal classifier when the data is sampled from that model. We modified the MFA models that were trained for the different classes to define pairs of Bayes-optimal models – symmetric and asymmetric.

For the symmetric models, we simply fixed all noise variance values in to a small value (). To construct the asymmetric models we added two outlier components (one for each class) that are equal to the dataset global mean plus changes along a direction of low-variance: We performed PCA over the entire dataset and took the eigenvector for the 50th largest eigenvalue as this direction, where the mean of one outlier is in the positive direction and the mean of the other in the negative one. We set and for both outliers, making them spherical gaussians around the two means with relatively large noise compared to the other components. See figure 10 for the outlier component means for CelebA and for MNIST.

### b.3 Cnn

We used the reference CNN implementation from the CleverHans library Papernot et al. (2016), which is a benchmark library for evaluating adversarial attacks and defences. The network consists of 2D convolution layers with a kernel size of 3 and Leaky ReLU activations. We used a stride of 2 in several equally-spaced layers along the depth of the network to reduce the spatial dimension and at each such layer we doubled the width (number of channels). The network ends with a single fully-connected layer. All other hyper-parameters were left at their default values and the optimization method was Adam. Our baseline small CNN achieves 100% train and test accuracy on the symmetric datasets and we also experimented with increasing both the depth and the width of the CNN by a large factor (see figure 8).

### b.4 Linear SVM

We used the standard 2-class linear SVC implementation provided by sklearn/libsvm Pedregosa et al. (2011). Linear SVM is trained directly on the vectorized image samples. The learned model consists of a weight vector and a scalar bias . The decision for a sample is simply .

### b.5 Rbf Svm

We used sklearn for the Radial Basis Function (RBF) kernel SVM as well. Selection of two hyper-parameters is required, the radial kernel coefficient and , a regularization term. We used the default and the highest value that still provided a high classification accuracy.

## Appendix C Attacks

### c.1 Cw-L2

The Carlini & Wagner L2 attack (Carlini and Wagner, 2017) is a recommended strong attack that minimizes the perturbation L2 norm. The attack minimizes a weighted combination of a classification loss with the perturbation L2 size. The relative weight is a parameter that is found using a binary-search. We used the CleverHans implementation with the following hyper-parameters: 500 iterations, 3 binary-searches and a learning rate of 0.01.

### c.2 Gradient Descent Attack

Since the MFA and SVM models provides a simple closed-form expression for the likelihood and its gradient, we implemented a simple and fast version of a gradient-attack for these models. Our attack performs multiple fixed-size steps in the direction of the gradient of the difference in log-likelihood between the source and target Gaussian components, until the decision boundary is crossed. We repeated some of the experiments with the (much slower) CW-L2 attack and verified that the results are similar (i.e. models that are shown to be robust to our gradient descent attack are also robust to the CW-L2 attack with similar perturbation magnitudes).

## Appendix D Additional Results

In this section we provide additional experimental results for the different datasets, attributes and models.

### d.1 Distance to the Decision Surface – Symmetric Datasets

According to Theorem 1 and Lemma 3, if the data can be represented as a mixture of low-rank plus diagonal Gaussians and the off-manifold (diagonal) variances are all small and similar (no strong asymmetries), then the distance from a sample to the optimal decision surface will be half the distance to the nearest component subspace in the other (target) class.

In toy distributions (e.g. figure 3d) we can control these distances arbitrarily, but what will the distances between subspaces be in our symmetric datasets, which approximate the manifold of real image datasets? As can be seen in Figure 11 (for Male/Female dataset), the distances to the nearest subspace in the other class are large – mean of ). These values are consistent with the mean adversarial perturbation sizes that were actually required to fool the Bayes-Optimal classifier (mean of 3 – half of the mean distance to the nearest component subspace). The same is true for the other symmetric datasets.

### d.2 Symmetric vs. Asymmetric

Figure 12 presents additional examples comparing symmetric and asymmetric datasets and the relative robustness of their Bayes-Optimal classifiers to adversarial examples.

### d.3 Symmetric Datasets

Table 1 lists the clean and adversarial classification accuracy of all models for all symmetric datasets.

Figures 13-16 show original and adversarial samples and perturbations as well as histograms of the perturbations in pixel values for all models in different symmetric datasets (the number at the top of each histogram is the mean perturbation L2 over all test samples).

Dataset / Attribute | Bayes-Optimal | CNN | Linear SVM | RBF SVM |
---|---|---|---|---|

CelebA / Bangs | 100% (0%) | 100% (0%) | 100% (0%) | 96% (18%) |

CelebA / Black Hair | 100% (0%) | 98% (2%) | 100% (0%) | 96% (26%) |

CelebA / Brown Hair | 100% (0%) | 99% (1%) | 100% (0%) | 86% (16%) |

CelebA / Heavy Makeup | 100% (0%) | 100% (0%) | 100% (0%) | 96% (14%) |

CelebA / High Cheekbones | 100% (0%) | 100% (0%) | 100% (0%) | 96% (10%) |

CelebA / Male | 100% (0%) | 100% (0%) | 100% (0%) | 100% (8%) |

CelebA / Mouth Slightly Open | 100% (0%) | 100% (0%) | 100% (0%) | 94% (8%) |

CelebA / No Beard | 100% (0%) | 100% (0%) | 100% (0%) | 94% (10%) |

CelebA / Smiling | 100% (0%) | 100% (0%) | 100% (0%) | 96% (10%) |

CelebA / Wearing Earrings | 100% (0%) | 99% (1%) | 100% (0%) | 94% (16%) |

CelebA / Wearing Lipstick | 100% (0%) | 100% (0%) | 100% (0%) | 92% (18%) |

CelebA / Eyeglasses | 100% (0%) | 100% (0%) | 100% (0%) | 100% (22%) |

MNIST / 06 | 100% (0%) | 100% (0%) | 100% (0%) | 100% (4%) |

MNIST / 27 | 100% (0%) | 100% (0%) | 100% (0%) | 100% (7%) |

MNIST / 45 | 100% (0%) | 100% (0%) | 100% (0%) | 100% (8%) |

### d.4 Results on Real Data

#### Training and testing on real data

As shown in figure 9, results on real data are consistent with our symmetric datasets results – CNN and Linear SVM learn a vulnerable classifier while RBF SVM is robust. Table 2 lists the clean and adversarial accuracy values for the different models trained on different real image datasets.

Dataset / Attribute | CNN | Linear SVM | RBF SVM |
---|---|---|---|

CelebA / Male | 94.7% (5.30%) | 89.7% (10.3%) | 89.7% (11.3%) |

CelebA / Smiling | 88.0% (12.0%) | 88.3% (11.7%) | 84.3% (16.3%) |

CelebA / Bangs | 94.0% (6.00%) | 87.0% (13.0%) | 93.7% (24.7%) |

CelebA / Heavy Makeup | 86.0% (14.0%) | 83.7% (16.3%) | 85.7% (20.7%) |

CelebA / Eyeglasses | 100.% (0.00%) | 93.7% (6.33%) | 96.3% (32.3%) |

#### Training on symmetric data and testing on real data

To estimate how close our symmetric datasets are to the real datasets, we tested the CNNs that were trained on the symmetric dataset on real test samples and compared the test accuracy to that of CNNs that were trained on the real training data. As can be seen in Table 3, there is an average accuracy reduction of just 5%, indicating that the symmetric datasets are not that far from the original data.

Dataset / Attribute | Real Train Data | Symmetric Train Data |
---|---|---|

CelebA / Male | 92.3% | 88.0% |

CelebA / Smiling | 89.5% | 85.6% |

CelebA / Bangs | 94.3% | 90.4% |

CelebA / Heavy Makeup | 86.3% | 79.9% |

CelebA / Eyeglasses | 98.4% | 91.7% |

Orig | ||
---|---|---|

Optimal | ||

CNN | ||

Lin. SVM | ||

RBF SVM |

Orig | ||
---|---|---|

Optimal | ||

CNN | ||

Lin. SVM | ||

RBF SVM |

Orig | ||
---|---|---|

Optimal | ||

CNN | ||

Lin. SVM | ||

RBF SVM |

Orig | ||
---|---|---|

Optimal | ||

CNN | ||

Lin. SVM | ||

RBF SVM |

### Footnotes

- The datasets and models will be made publicly available after publication.
- Two low-dimensional subspaces in high dimensions will almost always not intersect.
- Two low-dimensional subspaces in high dimensions will almost always not intersect.

### References

- On evaluating adversarial robustness. CoRR abs/1902.06705. External Links: Link, 1902.06705 Cited by: §2.1.
- Towards evaluating the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017, pp. 39–57. External Links: Link, Document Cited by: §C.1, §2.1.
- Pattern classification. Wiley, New York. Cited by: §2.
- Analysis of classifiers’ robustness to adversarial perturbations. Machine Learning 107 (3), pp. 481–508. External Links: Link, Document Cited by: §1, §4.
- Adversarial examples are a natural consequence of test error in noise. External Links: Link Cited by: §4.
- The EM algorithm for mixtures of factor analyzers. Technical report Technical Report CRG-TR-96-1, University of Toronto. Cited by: §B.1, §2.
- Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1, §4.
- Making machine learning robust against adversarial inputs. Commun. ACM 61 (7), pp. 56–66. External Links: ISSN 0001-0782, Link, Document Cited by: §1, §1, §4.
- Improved training of wasserstein GANs. In Advances in Neural Information Processing Systems, pp. 5769–5779. Cited by: §2.1.
- The entire regularization path for the support vector machine. Journal of Machine Learning Research 5 (Oct), pp. 1391–1415. Cited by: §3.
- Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pp. 125–136. Cited by: §2, §4.
- Auto-encoding variational Bayes. International Conference on Learning Representations. Cited by: §2.1.
- Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV), Cited by: §2.
- Gradient descent maximizes the margin of homogeneous neural networks. In International Conference on Learning Representations, External Links: Link Cited by: §1, §4, §5.
- Information theory, inference and learning algorithms. Cambridge university press. Cited by: §A.1.
- A discussion of ’adversarial examples are not bugs, they are features’: adversarial examples are just bugs, too. Distill. Note: https://distill.pub/2019/advex-bugs-discussion/response-5 External Links: Document Cited by: §1, §4.
- Cleverhans v1. 0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768 10. Cited by: §B.1, §B.3, §2.1.
- Scikit-learn: machine learning in python. Journal of machine learning research 12 (Oct), pp. 2825–2830. Cited by: §B.4.
- Theory of deep learning iii: explaining the non-overfitting puzzle. External Links: 1801.00173 Cited by: §5.
- On gans and gmms. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada., pp. 5852–5863. External Links: Link Cited by: §B.1, §2.
- Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems 31, Vol. abs/1804.11285, pp. 5014–5026. Cited by: §1, §2, §3, §4.
- Towards the first adversarially robust neural network model on mnist. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §1, §1, §2.1, §4.
- A simple explanation for the existence of adversarial examples with small hamming distance. CoRR abs/1901.10861. External Links: Link, 1901.10861 Cited by: §1, §4.
- The implicit bias of gradient descent on separable data. J. Mach. Learn. Res. 19, pp. 70:1–70:57. External Links: Link Cited by: §5.
- Intriguing properties of neural networks. In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings, External Links: Link Cited by: §1, §1, §4.
- A boundary tilting persepective on the phenomenon of adversarial examples. CoRR abs/1608.07690. External Links: Link, 1608.07690 Cited by: §1, §4.
- Theoretically principled trade-off between robustness and accuracy. In Proceedings of the 36th International Conference on Machine Learning, Vol. 97, pp. 7472–7482. Cited by: Figure 8, §3, §4.