Built-in Vulnerabilities to Imperceptible Adversarial Perturbations
Designing models that are robust to small adversarial perturbations of their inputs has proven remarkably difficult. In this work we show that the reverse problem—making models more vulnerable—is surprisingly easy. After presenting some proofs of concept on MNIST, we introduce a generic tilting attack that injects vulnerabilities into the linear layers of pre-trained networks without affecting their performance on natural data. We illustrate this attack on a multilayer perceptron trained on SVHN and use it to design a stand-alone adversarial module which we call a steganogram decoder. Finally, we show on CIFAR-10 that a state-of-the-art network can be trained to misclassify images in the presence of imperceptible backdoor signals. These different results suggest that adversarial perturbations are not always informative of the true features used by a model.
Machine learning systems are vulnerable to adversarial manipulations of their inputs (szegedy2013intriguing; biggio2017wild). The problem affects simple linear classifiers for spam filtering (dalvi2004adversarial; lowd2005adversarial) as well as state-of-the-art deep networks for image classification (szegedy2013intriguing; goodfellow2014explaining), audio signal recognition (kereliuk2015deep; carlini2018audio), reinforcement learning (huang2017adversarial; behzadan2017vulnerability) and various other applications (jia2017adversarial; kos2017adversarial; fischer2017adversarial; grosse2017adversarial). In the context of image classification, this adversarial example phenomenon has sometimes been interpreted as a theoretical result without practical implications (luo2015foveation; lu2017no). However, it is becoming increasingly clear that real-world applications are potentially under serious threat (kurakin2016adversarial; athalye2017synthesizing; liu2016delving; ilyas2017query).
The phenomenon has previously been described in detail (moosavi2016deepfool; carlini2017towards) and some theoretical analysis has been provided (bastani2016measuring; fawzi2016robustness; carlini2017ground). Attempts have been made at designing more robust architectures (gu2014towards; papernot2016distillation; rozsa2016towards) or at detecting adversarial examples during evaluation (feinman2017detecting; grosse2017statistical; metzen2017detecting). Adversarial training has also been introduced as a new regularization technique penalizing adversarial directions (goodfellow2014explaining; kurakin2016adversarial2; tramer2017ensemble; madry2017towards). Unfortunately, the problem remains largely unresolved (carlini2017adversarial; athalye2018obfuscated). Part of the reason is that the origin of the vulnerability is still poorly understood. An early but influential explanation was that it is a property of the dot product in high dimensions (goodfellow2014explaining). The new consensus starting to emerge is that it is related to poor generalization and insufficient regularization (neyshabur2017exploring; schmidt2018adversarially; elsayed2018large; galloway2018adversarial).
In the present work, we start from the assumption that robust classifiers already exist and focus on the following question:
Given a robust classifier , can we construct a classifier that performs the same as on natural data, but that is vulnerable to imperceptible image perturbations?
Reversing the problem in this way has two benefits. It provides us with new intuitions as to why neural networks suffer from adversarial examples in the first place: linear layers are for instance naturally vulnerable to perturbations along components of low variance in the data, and this vulnerability can in principle be arbitrarily strong. It also exposes a number of new potential attack scenarios: adversarial vulnerabilities can be trained from scratch, injected into pre-trained models or result from the addition of an adversarial module to a targeted model.
In the developing field of security and privacy in machine learning (papernot2016towards; biggio2017wild), a distinction is often made between poisoning attacks happening at training time and evasion attacks happening at inference time. The attacks we study here are somewhat hybrid: they can be thought of as poisoning attacks intended to facilitate future evasion attacks.
szegedy2013intriguing introduced the term ’adversarial example’ in the context of image classification to refer to misclassified inputs which are obtained by applying an “imperceptible non-random perturbation to a test image”. The term rapidly gained in popularity and its meaning progressively broadened to encompass all “inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake” (goodfellow2017attacking). Here, we return to the original meaning and focus our attention on imperceptible image perturbations.
For the evaluation of our models, we use a slightly different method than typical in the literature. The standard approach consists in performing gradient descent within a fixed -ball and reporting the percentage of images whose predictions have been flipped (goodfellow2014explaining; kurakin2016adversarial; tramer2017ensemble; madry2017towards). Here, we perform gradient descent until a target confidence level is reached and report the median norm of the perturbations (related to the Carlini-Wagner attack (carlini2017towards)).
We typically choose our target confidence level to be the median confidence over the test set. For convenience and to avoid gradient masking, we set the temperature parameter
There is a methodological difficulty with our approach: how can we show that a robust model can be turned into a vulnerable one if robust models do not currently exist? We escape this difficulty in two ways. We observe that simple models can be fairly robust when they are properly regularized (Sections 3 and 4) and we use models that are vulnerable to standard adversarial examples and inject new vulnerabilities into them (Section 5).
Finally, two remarks on the notations used. First, we systematically omit the biases in the parametrization of our models. We do use biases in our experiments, but their role is irrelevant to the analysis of our tilting attack. Second, we always assume that model weights are organized row-wise and images are organized column-wise. For instance, we write the dot product between a weight vector and an image as instead of the usual .
3 Proof of concept
It was suggested in (tanay2016boundary) that it is possible to alter a linear classifier such that its performance on natural images remains unaffected, but its vulnerability to adversarial examples is greatly increased. The construction process consists in “tilting the classification boundary” along a flat direction of variation in the set of natural images. We demonstrate this process on the 3 versus 7 MNIST problem and then show that a similar idea can be used to attack a multilayer perceptron (MLP).
3.1 Binary linear classification
Consider a centred distribution of natural images and a hyperplane boundary parametrised by a weight vector defining a binary linear classifier in the -dimensional image space . Suppose that there exists a unit vector satisfying . Then we can tilt along by a tilting factor without affecting the performance of on natural images. More precisely, we define the linear classifier parametrised by the weight vector and we have:
and perform the same on .
suffers from adversarial examples of arbitrary strength.
we define and we have:
and are classified differently by .
and are arbitrarily close from each other.
To illustrate this process, we train a logistic regression model on the 3 versus 7 MNIST binary classification problem (see Figure 1). We then perform PCA on the training data and choose the last component of variation as our flat direction (see Figure 1). On MNIST, pixels in the outer area of the image are never activated and the component is therefore expected to be along these directions. Finally we define a series of five models with varying in the range . We verify experimentally that they all perform the same on the test set with a constant error rate of . For each model, Figure 1 shows an image correctly classified as a 3 with median confidence and a corresponding adversarial example classified as a 7 with the same median confidence. Although all the models perform the same on natural MNIST images, they become increasingly vulnerable to small perturbations along as the tilting factor is increased.
3.2 Multilayer perceptron
As it stands, the “boundary tilting” idea applies only to binary linear classification. Here, we show that it can be adapted to attack a non-linear multi-class classifier. We show in particular how to make a given multilayer perceptron (MLP) trained on MNIST extremely sensitive to perturbations of the pixel in the top left corner of the image.
Consider an -layer MLP with weight matrices for constituting a 10-class classifier . For a given image , the feature representation at level is with:
where is the ReLU non-linearity and are the logits. Let also be the value of the pixel in the top left corner (i.e. the first element of ). We describe below a construction process resulting in a vulnerable classifier with weight matrices , feature representations and logits .
Add a hidden unit to transmit to the next layer.
Add a hidden unit to transmit to the next layer and make sure that each added unit only connects to the previous added unit.
Tilt the logit along by a tilting factor .
The classifier differs from only in the logit corresponding to class 0: . As a result, satisfies the two desired properties:
and perform the same on .
The pixel in the top left corner is never activated for natural images: .
The logits are therefore preserved: .
suffers from adversarial examples of arbitrary strength.
Suppose that is classified as by : . Then there exists an arbitrarily small perturbation of the pixel in the top left corner such that the resulting adversarial image is classified as 0 by : for and when . Remark that by construction, is not a natural image since .
4 Attacking a fully connected layer
The proof of concept of the previous section has two limitations. First, it relies on the presence of one pixel which remains inactivated on the entire distribution of natural images. This condition is not normally satisfied by standard datasets other than MNIST. Second, the network architecture needs to be modified during the construction of the vulnerable classifier . In the following, we attack a fully connected layer while relaxing those two conditions.
4.1 Description of the attack
Consider a fully connected layer defining a linear map where is a -dimensional image space and is a -dimensional feature space. Let be the matrix of training data and be the weight matrix of . The distribution of features over the training data is . Let and be respectively the standard and PCA bases of , and and be respectively the standard and PCA bases of . We compute the transition matrix from to by performing PCA on , and we compute the transition matrix from to by performing PCA on . The linear map can be expressed in different bases by multiplying on the right by and on the left by the transpose of (see Figure 2). We are interested in particular in the expression of in the PCA bases: .
We propose to attack by tilting the main components of variation in along flat components of variation in . For instance, we tilt along by a tilting factor such that a small perturbation of magnitude along in image space results in a perturbation of magnitude along in feature space—which is potentially a large displacement if is large enough. In pseudo-code, this attack translates to . We can then iterate this process over orthogonal directions to increase the freedom of movement in . We can also scale the tilting factors by the standard deviations along the components in so that moving in different directions in requires perturbations of approximately the same magnitude in :
with . The operator transforms an input vector into a diagonal square matrix and the operator flips the columns of the input matrix left-right.
The full attack is summarized below:
In the next two sections we illustrate this attack on two scenarios: one where is the first layer of a MLP and one where is the identity map in image space.
4.2 Scenario 1: input layer of a MLP
Let us consider a 4-layer MLP with ReLU non-linearities and hidden layers of size trained in Keras (chollet2015keras) on the SVHN dataset using both the training and extra data. The model we use was trained with stochastic gradient descent (SGD) for epochs with learning rate 1e-4 (decayed to 1e-5 and 1e-6 after epochs 30 and 40), momentum , batch size and l2 penalty 5e-2, reaching an accuracy of on the test set at epoch 50.
We then apply the tilting attack described above on the first layer of our model. There are two free parameters to choose: the number of tilted directions and the tilting factor . When and are too small, the network remains robust to imperceptible perturbations and when and are too large, the performance on natural data starts to be affected. We found that using and worked well in practice. In particular, the accuracy on natural data remained at while the compromised MLP became extremely vulnerable to imperceptible perturbations (see Figure 3). For comparison, we generated adversarial examples on test images for the two models. The median norm of the perturbations was for the original MLP and for the compromised MLP.
4.3 Scenario 2: steganogram decoder
goodfellow2014explaining proposed an interesting analogy: the adversarial example phenomenon is a sort of “accidental steganography” where a model is “forced to attend exclusively to the signal that aligns most closely with its weights, even if multiple signals are present and other signals have much greater amplitude”. This happens to be a fairly accurate description of our tilting attack, and it begs the question: can it be used to hide messages in images? The intuition is the following: if we apply our attack to the identity map in , we obtain a linear layer which leaves natural images unaffected, but which is able to decode adversarial examples—or in this case steganograms—into specific target images. We call such a layer a steganogram decoder.
Let us illustrate this idea on CIFAR-10. We perform PCA on the training data
and we express back into the pixel basis: . The first components of are identical to the first components of and therefore the two images look similar. After passing through the decoder however, the first components of become identical to the first components of and therefore the decoded steganogram looks similar to the target image. This process is illustrated in Figure 4 for a tilting factor and a number of tilted directions , which we call in this context the strength of the decoder, in the range .
Steganogram decoders can be thought of as minimal models suffering from feature adversaries, “which are confused with other images not just in the class label, but in their internal representations as well”(sabour2015adversarial). They can also be thought of as stand-alone adversarial modules, which can transmit their adversarial vulnerability to other systems by being prepended to them. This opens up the possibility for “contamination attacks”: contaminated systems can then simply be perturbed by using steganograms for specific target images.
5 Training a vulnerability
So far, we focused our attention on relatively simple networks. Can state-of-the-art models be attacked in a similar way? We face in practice two difficulties. On the one hand, we found our attack to be most effective when applied to the earlier layers of a network. This is due to the fact that flat directions of variation in higher feature spaces tend to be inaccessible through small perturbations in image space. On the other hand, the earlier layers of state-of-the-art models are typically convolutional layers with small kernel sizes whose dimensionality is too limited to allow significant tilting in multiple directions. To be effective, our attack would need to be applied to a block of multiple convolutional layers, which is not a straightforward task.
We explore here a different approach. Consider a distribution of natural images and a robust classifier in the -dimensional image space again. Consider further a backdoor direction of low variance in such that for all images we have where is an imperceptible threshold. Consider finally that we add to our classifier a target class that systematically corresponds to a misclassification. Then we can define a vulnerable classifier as:
By construction, performs the same as on natural images. We also verify easily that suffers from adversarial examples: , the image is only distant from by a small threshold but it is misclassified in . Incidentally, it can be noted that the backdoor direction is a universal adversarial perturbation (moosavi2017universal; moosavi2017analysis): it affects the classification of all test images
Now, we propose to inject this type of vulnerability into a model during training. Our idea is related to the one used in (gu2017badnets): we train a network to classify clean data normally and data that has been corrupted by the backdoor signal into the target class . Contrary to (gu2017badnets) however, we show that it is possible to use imperceptible backdoor signals. Once again, we illustrate this idea on CIFAR-10. We chose to experiment with a Wide Residual Network (zagoruyko2016wide) of depth 16 and width 8 (WRN-16-8) after obtaining some positive preliminary results with a Network-in-Network architecture (lin2013network).
Our experimental setup is as follow. We start by training one WRN-16-8 model on the standard training data with SGD for 200 epochs, learning rate 1e-2 (decreased to 1e-3 and 1e-4 after epochs 80 and 160), momentum 0.9, batch size 64 and l2 penalty 5e-4, using data-augmentation (horizontal flips, translations in pixels, rotations in degrees). We call this network the clean model; it reached an accuracy of 95.2% on the test set at epoch 200.
Then we search for an imperceptible backdoor signal . Several options are available: for instance we could use the last component of variation in the training data, as we did in Section 3.1. To demonstrate that the backdoor signal can contain some meaningful information, we define as the projection of an image on the principal components containing the last of the variance in . In 5 independent experiments, we use 5 images from the “apple” class in the test set of CIFAR-100 (see Figure 5). We then compute
For each backdoor signal , we train a WRN-16-8 model on clean data and corrupted data using the same hyperparameters as for the clean model. We call the resulting networks the corrupted models; they converged to an average accuracy on clean data of with a standard deviation of , therefore only suffering a small performance hit of compared to the clean model. To help the corrupted models converge, we initialized the corruption threshold at 10 times its final value and progressively decayed it over the first 50 epochs.
Finally, we compare the accuracies of the clean model and the 5 corrupted models on the corrupted test data, for a varying corruption threshold (each corrupted model is evaluated on its corresponding corruption signal ). Contrary to the clean model, the corrupted models have become extremely vulnerable to imperceptible perturbations along (see Figure 5).
There is an apparent contradiction in the vulnerability of state-of-the-art networks to adversarial examples: how can these models perform so well, if they are so sensitive to small perturbations of their inputs? The only possible explanation, as formulated by jo2017measuring, seems to be that “deep CNNs are not truly capturing abstractions in the dataset”. This explanation relies however on a strong, implicit assumption: the features used by a model to determine the class of a natural image and the features altered by adversarial perturbations are the same ones.
In the present paper, we showed that this assumption is not necessarily valid. All models rely on the use of primary features to make their decisions, but they can simultaneously be vulnerable to perturbations along distinct secondary features acting as backdoors. This result suggests an alternative explanation to that of jo2017measuring: adversarial examples are not necessarily informative of the primary features used by a model. Following (tanay2016boundary), we hypothesise that adversarial examples only affect the primary features of well-regularized models, while the existence of secondary features is a sign of overfitting. We leave the evaluation of this hypothesis to future work.
If designing models that are robust to small adversarial perturbations of their inputs has proven remarkably difficult, we showed here that the reverse problem—making models more vulnerable—is surprisingly easy. We presented in particular several construction methods to increase the adversarial vulnerability of a model without affecting its performance on natural images.
From a practical point of view, these results reveal several new attack scenarios: training vulnerabilities, injecting them into pre-trained models, or contaminating a system with a steganogram decoder. From a theoretical point of view, they provide new intuitions on the adversarial example phenomenon and suggest that adversarial perturbations should not be too readily interpreted as informative of the true features used by a model.
The authors wish to acknowledge the financial support of EPSRC and Rapiscan Systems. We thank Angus Galloway for detailed feedback on the first draft of the manuscript.
- Calibration inspired from (guo2017calibration). Before calibration, the median confidence is typically closer to .
- In this section, the feature space is the image space: and .
- The construction process described bears some similarities with the flat boundary model of (moosavi2017analysis).
- The coefficient 3 (instead of 2) compensates for using the 99th percentile (instead of the max).