Detecting Patch Adversarial Attacks with Image Residuals
We introduce an adversarial sample detection algorithm based on image residuals, specifically designed to guard against patch-based attacks. The image residual is obtained as the difference between an input image and a denoised version of it, and a discriminator is trained to distinguish between clean and adversarial samples. More precisely, we use a wavelet domain algorithm for denoising images and demonstrate that the obtained residuals act as a digital fingerprint for adversarial attacks. To emulate the limitations of a physical adversary, we evaluate the performance of our approach against localized (patch-based) adversarial attacks, including in settings where the adversary has complete knowledge about the detection scheme. Our results show that our proposed method generalizes to stronger attacks and reduces the success rate (conversely, increases the computational effort) of an adaptive attacker.
In the past decade, deep neural networks (DNNs) have been demonstrated to match and surpass human performance on image classification tasks and have become ubiquitous in machine learning. At the same time, DNNs have been shown to be very fragile to adversarial examples , in which a malicious user perturbs natural images such that they are misclassified by the model. A growing body of research investigates adversarial defense methods and their shortcomings [2, 3]. Anomaly detection is one of two currently widely investigated lines of defense and is differentiated (but not necessarily in opposition) from robustness, where the goal is to recover the ground truth from the perturbed test sample. In our work, we develop a solution for the detection problem, with applications in security systems that can alert human operators and prompt intervention (e.g., security cameras, autonomous driving) when abnormal samples are detected.
Recent work has investigated adversarial patch attacks as a step towards physically realizable and robust threat models [4, 5]. Through careful digital design, physical adversarial patches represent a currently unsolved security threat. We focus on the task of detecting patch attacks at test time through the use of a detection block trained on a small subset of samples within the adversary’s model. We introduce multiple threat models that follow the taxonomy in  with the common characteristic that they are all confined to a digital patch attack: given a clean test sample, the adversary can only modify a contiguous, rectangular region of it, with a size up to of the image. We consider both the cases of norm-bounded and unbounded adversaries operating on the patch.
Central to our approach is the idea of detecting adversarial samples based on image residuals, obtained as the difference between an input image and a denoised version of it. These residuals are used to train a secondary, much smaller and heavily regularized detection neural network. We experimentally demonstrate that the proposed method is robust and generalized to different patch-based attacks, including much stronger than the ones used to train the detector network. We show that this generalization does not happen for state-of-the-art detection methods that are not specifically designed for patch threat models. Very recent work on defenses against adversarial patches focuses only on the robustness problem, not detection [6, 7] to produce certified guarantees. However, these approaches rely on a brute-force approach that requires additional complexity at inference time. Thus, there is a need of specialized solutions for detecting patch adversarial attacks, and we hope our work provides a baseline in this direction. Source code is available at https://github.com/mariusarvinte/wavelet-patch-detection.
We carry out experiments on the CIFAR-10 dataset  with a VGG-19  deep convolutional classifier architecture and show that current state-of-the-art detection methods do not generalize to different patch attacks, while our proposed solution does. The performance of our scheme is evaluated against full- and limited-knowledge adversaries that attempt to bypass the issue of zero gradients coming from the non-differentiable nature of the wavelet denoising operator, as well as adversaries that perform a brute-force search for the best patch location.
Summarized, our contributions are:
We introduce an adversarial sample detection algorithm based on image residuals, obtained as the difference between an image and a wavelet-filtered version of it.
We investigate the effectiveness of several patch adversarial attacks against our proposed detection scheme and two other existing detection schemes, showing that our approach generalizes where prior work does not.
We show that our approach resists high-confidence transfer attacks, even when the attacker trains a substitute detector and lowers the success rate of an adaptive attack.
Ii-a Image Residuals
Let denote the output probabilities of a deep neural network classifier with weights , where is the number of classes, and is an input image to the network, with its true label . The network is trained to minimize its loss function ; typically, this is chosen as the categorical cross-entropy between the predicted and true labels. Let denote the output logits of the network, where we assume that the last layer uses a softmax activation and the relationship holds.
The image residual of is defined as
where is a denoising operator that takes as input the image and produces a denoised version of it with the same dimensions. is adjustable by the parameter , which represents an estimate of the noise power in the image: the larger is chosen, the more aggressive the denoising is, resulting in a smoother image.
For the rest of our work, we choose to be a wavelet-based denoising operator. In particular, we use the adaptive Bayesian shrinkage algorithm , in which plays the role of noise power. Once a residual is obtained, it is passed to the detector function , parameterized by weights . If an adversarial sample is detected, an alarm is triggered, otherwise, the most likely class is predicted.
Ii-B Counteracting Blind Spots
Since our approach involves a block that relies on identifying anomalous high-frequency structure in images, we anticipate the following weakness: an attacker may employ very small patches intentionally, leaving no residual signature. An extreme case of this is the single-pixel attack , which falls in our threat model as a patch of size . Consequently, we augment our detector by taking inspiration from the logit margin loss formulation in  and the baseline out-of-distribution detection method introduced in . Let be the logits of the joint classifier-detector, given by
The two stage detection procedure is described in Table I. The core idea is to augment the residual-based detection by requiring negatively labeled samples pass a confidence threshold in their predictions. The parameter plays this role: a sample is declared non-adversarial only if it bypasses the detector and if the difference between the two highest prediction logits is at least , implying that we require natural images to be confidently classified.
Recent work has investigated the frequency domain signature of adversarial attacks  and has concluded that adversarial training , one of the few unbroken defenses, decreases model sensitivity to high-frequency components of the input signal, implying that an undefended network will be perturbed by these components. We posit that this phenomenon is more pronounced for patch adversarial attacks, since distortions caused by the inserted patch will be visible in the residual , at least for an adversary that is unaware of the existence of a detection scheme. At the same time, adaptive adversaries can try to evade detection by generating smooth patches, but this will reduce the effective dimension of the patch, ultimately lowering their success rate.
Figure 1 plots the magnitude of the residual averaged across test samples of the CIFAR-10 dataset in three cases: the original images, noisy versions of them, and unbounded projected gradient descent (PGD) adversarial samples corresponding to each original image. The average residual for the clean and noisy images presents high magnitude values around the center of the image, where edges are more likely to occur. For the adversarial examples, the patch is strongly localized in the residual, even when it is placed in the center, overtaking the residuals of semantic content in the image.
Iii Adversarial Attacks
Iii-a Patch Adversarial Attacks
We use the same definition of patch adversarial attacks as : the patch is characterized by a two-dimensional binary mask , in which only a contiguous, rectangular, two-dimensional region satisfies , otherwise . A patch adversarial sample is given by , where is the adversarial perturbation satisfying . For most attacks, we assume that the attacker randomly samples a mask location in the image, but we also investigate the effectiveness of an adversary that searches across all possible mask locations.
Iii-A1 Projected Gradient Descent
PGD  is an iterative attack that takes a series of steps in the gradient direction, each with size , and projects the perturbations on the -ball if their norm exceeds . Additionally, PGD starts from a perturbed point around . We use the untargeted, version of masked PGD, with the inner step given by
where projects each element to the interval . Note that by choosing , this attack becomes unrestricted in the pixel space, assuming normalization of images to .
Iii-A2 C&W Attack 
This is an optimization-based attack that reparameterizes the perturbation as and solves the optimization problem
where , is the target label different from the correct class and is a confidence parameter. The formulation in (4) includes two hyper-parameters: controls the trade-off between the distance penalty and misclassification, while allows us to run an unrestricted attack. When , the value of does not matter in the optimization, except for influencing the learning rate, thus we set it to one.
Iii-A3 Single-Pixel Attack 
This is a powerful attack that only requires probability outputs of the model and does not use gradient information. The attack uses a differential evolution algorithm to perturb a single pixel that misclassifies the image. We evaluate our performance on this attack since we expect it to be a blind spot for our proposed definition of image residuals, according to .
|Attack||Attack Success Rate [%]||ROC-AUC||AP||FPR at TPR=|
Iii-B Existing Detection Algorithms
We compare the performance of our method with two existing algorithms. The baseline approach in  uses the probability of the predicted class as a discriminant for in- and out-of-distribution samples. While simple and not intended for adversarial sample detection, we borrow from this idea to impose that clean samples pass a confidence limit, as previously described. This method does not require training.
Local Intrinsic Dimensionality (LID)  is a powerful detection method against non-adaptive adversaries that extracts a set of statistics for each test sample by computing distances to -nearest neighbors in the training set. This method requires training on adversarial samples and claims generalization properties, thus we choose it as a comparison.
Our primary purpose of comparing to these methods is to show that when faced against a patch black-box adversary, their detection performance degrades either against a novel type of attack or larger perturbations. Note that, to the best of our knowledge, there is only one other prior work that explicitly targets adversarial patch detection , but their source code is not publicly available, thus we omit it from comparison.
Iv Performance Results
Iv-a Training Details
We train a VGG-19 classifier on the CIFAR-10 dataset, retain of the training samples for validation, and obtain a clean test accuracy of . We further split our original validation set in a new training and validation set and use the training set to train and the validation set to pick the best wavelet denoising parameter . We perform a hyper-parameter search for and find that offers the best validation performance. The training data for our detector consists of adversarial samples generated with an untargeted PGD attack with strength , retaining only the successful attacks. The negative class also includes noisy versions of the training samples, with discrete uniform noise between (before any scaling) added independently on each pixel value. The negative class validation data for the detector consists of clean and noisy images. The positive class validation data consists of successful adversarial samples generated with an untargeted patch PGD attack with strength run for steps.
The patch location is randomly selected as a rectangle with sides between pixels and placed randomly at uniform in a location in the sample, such that the entire patch is present in the image. The architecture of the detector is a three hidden layer neural network, with two convolutional layers and one fully-connected layer. We use weight regularization of during training to avoid over-fitting the detector to the training data. We use values for throughout our experiments.
Iv-B Black-Box Attacks
All attacks in this section and future sections, except for the single-pixel attack, have complete knowledge about the classifier architecture and weights , but are unaware that there is a detection method in place.
Iv-B1 PGD Attack
We generate patch locations at random and for each location we attack randomly chosen test images. We compare the performance of our algorithm with LID and the baseline approach against a black-box PGD adversary. We perform a parameter search to find the best (number of nearest neighbors) and values, using a batch size of samples and the same training and validation data used for our approach. The average performance results are shown in Table II, where it can be seen that our proposed approach has better generalization properties when testing on different attack types and strengths. In particular, previous methods fail to identify the localized changes introduced by an adversary and exhibit a very high false positive rate. Our method with shows an opposite trend against the others: weaker attacks are harder to detect. For fair comparison, we also include the success rate of the PGD attack, where it can be seen that it is much lower for a norm-bounded restricted adversary – thus the absolute number of missed detections is also lower.
Iv-B2 C&W Attack
For the norm-restricted C&W attack, we perform a binary search for in the range . In both attacks, a square patch of size is randomly placed at a location of the image, and we optimize the objective in (4) with an Adam optimizer, running for iterations each step of the binary search. We pick correctly classified images from the test set, and run a targeted attack towards a random class different that the ground truth. For all images, we test patch locations. We use as a confidence threshold. The results are summarized in Table III. We note that constraining the patch attack implicitly helps it bypass detection, since a more blended patch is generated, and our method explicitly relies on the saliency of the perturbed region. For attacks that bypass detection, the average norm of the perturbation is and for the restricted and unrestricted attacks, respectively.
|Average Success Rate||%||%|
Iv-B3 Brute-Force PGD Attack
We consider an adversary that suspects that there is a detection method in place, but has no information (and does not wish to make any assumptions) about – nor can they query – the detector output. We place this adversary in the black-box category, even though they are borderline gray-box. A feasible attack strategy in this case is to brute-force the location of the patch in the image (e.g., spamming a face tagging system using clone accounts). We evaluate the worst-case performance of the detector against this adversary: if even a single location in an image leads to a missed detection, we consider the entire image compromised. The results are shown in Fig. 2, where it can be seen that using the default value of leads to a worst-case detection rate of .Increasing to increases the false positive rate to , but ensures that, on average, half of the test samples are protected against all possible patch locations attempted by a black-box adversary.
Iv-B4 Single-Pixel Attack
We test the performance of our approach against the single-pixel attack . We run a targeted attack on correctly classified images in the test, for each other possible class targeted, for a total of attacks, each for iterations (generations) and a population size of . A number of attacks are successful in finding an adversary and only two evade detection with . Interestingly, even without using a confidence threshold, of the successful single-pixel attacks are detected by the detector itself.
Iv-C Adaptive (White-Box) Attacks
We evaluate the robustness of the residual detection method against a C&W adversary that has complete knowledge of the model, including the parameter used for performing wavelet denoising, the confidence threshold , and the detector weights . Since the wavelet denoising block is non-differentiable, we apply the straight-through approach  to estimate the hidden gradients, by exactly computing the residual during the forward pass, and approximating its gradient with unit value during the backward pass. The wavelet denoising operation takes up a large part of the complexity of this attack. For this reason, we perform it only every five iterations, since we find that this does not hinder optimization.
We run our attacks on a detector with and attack a set of correctly classified test images, with a targeted attack to another random label. We run iterations per binary search step for the restricted attack, for a number of steps, and we run iterations for the unrestricted attack. Table IV presents the results in terms of success rate, average and worst-case distances, for the -constrained and unrestricted white-box attacks. We note that the average success rate is decreased when comparing to a black-box adversary and the worst-case distortion increases by a factor of times as well. Finally, the worst-case performance counts a sample as compromised if at least one of the patch locations bypasses detection. The success rate of this attacker is close to , but with an increased distortion cost of . One caveat here is that we did not search across all possible patch locations, but only out of , meaning it is likely possible to increase this success rate to exactly and the true worst-case value of the required distortion.
Iv-D Gray-Box Attacks
Finally, we investigate the transferability by assuming an adversary has complete knowledge about the datasets used to train and validate the performance of the detection scheme, as well as the hard labels output by the detector during training, but not testing phase. We train a deep convolutional network with four convolutional layers to mimic the detector as a substitute model, trained on the training set and the predicted labels of the detector. We use the same weight regularization factor of . Then, we generate high confidence white-box adversarial examples for the substitute classifier-detector ensemble. Training the substitute model is successful, with a validation accuracy of on the same data used by the detector. When testing, we generate square patches for test images and obtain an attack transfer rate of on the classifier itself, but only a rate on the classifier-detector ensemble. We thus conclude that our method resists the transfer of high-confidence examples.
We have investigated the problem of detecting adversarial samples generated by patch adversarial attacks in an attempt to more closely match threat models that may arise in practical situations. Our proposed solution uses the residual high-frequency content of an image to distinguish between clean and attacked samples. We have experimentally shown that our method generalizes to strong black-box adversaries, resists transfer attacks, and decreases the success rate of white-box adversaries. Upon visually inspecting the images output by an adaptive adversary, we make one interesting observation: even though the required distortion increases, the patches have smoother textures and color gradients. Thus, these represent almost natural adversarial examples that bypass our wavelet-based scheme. Future research directions deal with combining our detector with other criteria to eliminate these blind spots.
-  I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.
-  A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” in International Conference on Machine Learning, 2018, pp. 274–283.
-  N. Carlini and D. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, 2017, pp. 3–14.
-  T. Brown, D. Mane, A. Roy, M. Abadi, and J. Gilmer, “Adversarial patch,” 2017. [Online]. Available: https://arxiv.org/pdf/1712.09665.pdf
-  D. Karmon, D. Zoran, and Y. Goldberg, “Lavan: Localized and visible adversarial noise,” in International Conference on Machine Learning, 2018, pp. 2507–2515.
-  P. yeh Chiang*, R. Ni*, A. Abdelkader, C. Zhu, C. Studor, and T. Goldstein, “Certified defenses for adversarial patches,” in International Conference on Learning Representations, 2020. [Online]. Available: https://openreview.net/forum?id=HyeaSkrYPH
-  A. Levine and S. Feizi, “(de)randomized smoothing for certifiable defense against patch attacks,” 2020.
-  A. Krizhevsky et al., “Learning multiple layers of features from tiny images,” 2009.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  H. A. Chipman, E. D. Kolaczyk, and R. E. McCulloch, “Adaptive bayesian wavelet shrinkage,” Journal of the American Statistical Association, vol. 92, no. 440, pp. 1413–1421, 1997.
-  J. Su, D. V. Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,” IEEE Transactions on Evolutionary Computation, vol. 23, no. 5, pp. 828–841, 2019.
-  D. Hendrycks and K. Gimpel, “A baseline for detecting misclassified and out-of-distribution examples in neural networks,” arXiv preprint arXiv:1610.02136, 2016.
-  D. Yin, R. G. Lopes, J. Shlens, E. D. Cubuk, and J. Gilmer, “A fourier perspective on model robustness in computer vision,” in Advances in Neural Information Processing Systems, 2019, pp. 13 255–13 265.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” arXiv preprint arXiv:1706.06083, 2017.
-  N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in 2017 ieee symposium on security and privacy (sp). IEEE, 2017, pp. 39–57.
-  X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle, and J. Bailey, “Characterizing adversarial subspaces using local intrinsic dimensionality,” arXiv preprint arXiv:1801.02613, 2018.
-  E. Chou, F. Tramèr, G. Pellegrino, and D. Boneh, “Sentinet: Detecting physical attacks against deep learning systems,” arXiv preprint arXiv:1812.00292, 2018.