Defending Against Universal Attacks Through Selective Feature Regeneration
Deep neural network (DNN) predictions have been shown to be vulnerable to carefully crafted adversarial perturbations. Specifically, image-agnostic (universal adversarial) perturbations added to any image can fool a target network into making erroneous predictions. Departing from existing defense strategies that work mostly in the image domain, we present a novel defense which operates in the DNN feature domain and effectively defends against such universal perturbations. Our approach identifies pre-trained convolutional features that are most vulnerable to adversarial noise and deploys trainable feature regeneration units which transform these DNN filter activations into resilient features that are robust to universal perturbations. Regenerating only the top 50% adversarially susceptible activations in at most 6 DNN layers and leaving all remaining DNN activations unchanged, we outperform existing defense strategies across different network architectures by more than 10% in restored accuracy. We show that without any additional modification, our defense trained on ImageNet with one type of universal attack examples effectively defends against other types of unseen universal attacks.
Despite the continued success and widespread use of DNNs in computer vision tasks [25, 59, 62, 18, 55, 54, 58, 68], these networks make erroneous predictions when a small magnitude, carefully crafted perturbation (adversarial noise) almost visually imperceptible to humans is added to an input image [63, 15, 35, 6, 24, 41, 48, 26, 49]. Furthermore, such perturbations have been successfully placed in a real-world scene via physical adversarial objects [3, 12, 26], thus posing a security risk.
Most existing adversarial attacks use target network gradients to construct an image-dependent adversarial example [63, 15, 26, 41, 49, 6] that has limited transferability to other networks or images [63, 32, 47]. Other methods to generate image-dependent adversarial samples include accessing only the network predictions [20, 46, 61], using surrogate networks  and gradient approximation . Although there is significant prior work on adversarial defenses such as adversarial training [63, 15, 35, 66], ensemble training , randomized image transformations and denoising [16, 52, 10, 40, 52, 60, 10, 33, 31] and adversarial sample rejection [29, 34, 67, 36, 37], a DNN is still vulnerable to adversarial perturbations added to a non-negligible portion of the input [2, 65]. These defenses mostly focus on making a DNN robust to image-dependent adversarial perturbations which are less likely to be encountered in realistic vision applications [1, 45].
Our proposed work focuses on defending against universal adversarial attacks. Unlike the aforementioned image-dependent adversarial attacks, universal adversarial attacks [38, 44, 43, 51, 23, 45, 53, 42, 30] construct a single image-agnostic perturbation that when added to any unseen image fools DNNs into making erroneous predictions with very high confidence. These universal perturbations are also not unique and many adversarial directions may exist in a DNN’s feature space ( Figure 1, row 2) [39, 14, 13]. Furthermore, universal perturbations generated for one DNN can transfer to other DNNs, making them doubly universal . Such image-agnostic perturbations pose a strong realistic threat model  for many vision applications as perturbations can easily be pre-computed and then inserted in real-time (in the form of a printed adversarial patch or sticker) into any scene [28, 5]. For example, while performing semantic segmentation, such image-agnostic perturbations can completely hide a target class (i.e., pedestrian) in the resulting segmented scene output and adversely affect the braking action of an autonomous car .
We show the existence of a set of vulnerable convolutional filters, that are largely responsible for erroneous predictions made by a DNN in an adversarial setting and the -norm of the convolutional filter weights can be used to identify such filters.
Unlike, existing image-domain defenses, our proposed DNN feature space-based defense uses trainable feature regeneration units, which regenerate activations of the aforementioned vulnerable convolutional filters into resilient features (adversarial noise masking).
A fast method is proposed to generate strong synthetic adversarial perturbations for training.
Without any additional attack-specific training, our defense trained on one type of universal attack  effectively defends against other different unseen universal attacks [44, 43, 51, 45, 23, 42] ( Figure 1) and we are the first to show such broad generalization across different universal attacks.
2 Related Work
Adversarial training (Adv. tr.) [63, 15, 35] has been shown to improve DNN robustness to image-dependent adversarial attacks through augmentation, in the training stage, with adversarial attack examples, which are computed on-the-fly for each mini-batch using gradient-ascent to maximize the DNN’s loss. The robustness of adversarial training to black-box attacks can be improved by using perturbations computed against different target DNNs that are chosen from an ensemble of DNNs . Kannan et al.  scale adversarial training to ImageNet  by encouraging the adversarial loss to match logits for pairs of adversarial and perturbation-free images (logit pairing) but this latter method fails against stronger iterative attacks . In addition to adversarially training the baseline DNN, prior works (, ) further improved DNN robustness to image-dependent attacks by denoising intermediate DNN feature maps, either through a non-local mean denoiser (feature denoising ) or a denoising auto-encoder (fortified nets ). Although Xie et al. report effective robustness against a strong PGD attack  evaluated on ImageNet , the additional non-local mean denoisers only add a 4% improvement over a DNN trained using standard adversarial training. Compared to feature denoising (FD) , the proposed feature regeneration approach has the following differences: (1) our feature regeneration units are not restricted to only perform denoising, but consists of stacks of trainable convolutional layers that provide our defense the flexibility to learn an appropriate feature-restoration transform that effectively defends against universal attacks, unlike the non-local mean denoiser used in FD; (2) in a selected DNN layer, only a subset of feature maps which are the most susceptible to adversarial noise (identified by our ranking metric) are regenerated leaving all other feature maps unchanged, whereas FD denoises all feature maps, which can result in over-correction or introduce unwanted artifacts in feature maps that admit very low magnitude noise; (3) instead of adversarially training all the parameters of the baseline DNN as in FD, we only train the parameters in the feature renegeration units (up to 90% less parameters than a baseline DNN) and leave all parameters in the baseline DNN unchanged, which can speed up training and reduce the risk of over-fitting.
Image-domain defenses mitigate the impact of adversarial perturbations by utilizing non-differentiable transformations of the input such as image compression [10, 8, 33], frequency domain denoising  and image quilting and reconstruction [16, 40] etc. However, such approaches introduce unnecessary artifacts in clean images resulting in accuracy loss . Prakash et al.  propose a two-step defense that first performs random local pixel redistribution, followed by a wavelet denoising. Liao et al.  append a denoising autoencoder at the input of the baseline DNN and train it using a reconstruction loss that minimizes the error between higher layer representations of the DNN for an input pair of clean and denoised adversarial images (high level guided denoiser). Another popular line of defenses explores the idea of first detecting an adversarially perturbed input and then either abstaining from making a prediction or further pre-processing adversarial input for reliable predictions [29, 34, 67, 36, 37].
All of the aforementioned defenses are geared towards image-specific gradient-based attacks and none of them has, as of yet, been shown to defend against image-agnostic attacks. Initial attempts at improving robustness to universal attacks involved modelling the distribution of such perturbations [38, 17, 50], followed by model fine-tuning over this distribution of universal perturbations. However, the robustness offered by these methods has been unsatisfactory [45, 38] as the retrained network ends up overfitting to the small set of perturbations used. Extending adversarial training for image-dependent attacks to universal attacks has been attempted in  and . Ruan and Dai  use additional shadow classifiers to identify and reject images perturbed by universal perturbations. Akhtar et al.  propose a defense against the universal adversarial perturbations attack (UAP) , using a detector which identifies adversarial images and then denoises them using a learnable Perturbation Rectifying Network (PRN).
3 Universal threat model
Let represent the distribution of clean (unperturbed) images in , be a classifier that predicts a class label for an image . The universal adversarial perturbation attack seeks a perturbation vector under the following constraints :
where denotes probability, is the -norm with , is the target fooling ratio with (i.e., the fraction of samples in that change labels when perturbed by an adversary), and controls the magnitude of adversarial perturbations.
4 Feature-Domain Adversarial Defense
4.1 Stability of Convolutional Filters
In this work, we assess the vulnerability of individual convolutional filters and show that, for each layer, certain filter activations are significantly more disrupted than others, especially in the early layers of a DNN.
For a given layer, let be the output (activation map) of the convolutional filter with kernel weights for an input . Let be the additive noise (perturbation) that is caused in the output activation map as a result of applying an additive perturbation to the input . It can be shown (refer to Supplementary Material) that is bounded as follows:
where as before is the -norm with . Equation 2 shows that the -norm of the filter weights can be used to identify and rank convolutional filter activations in terms of their ability to restrict perturbation in their activation maps. For example, filters with a small weight -norm would result in insignificant small perturbations in their output when their input is perturbed, and are thus considered to be less vulnerable to perturbations in the input. For an -norm universal adversarial input, Figure (a)a shows the upper-bound on the adversarial noise in ranked (using the proposed -norm ranking) conv-1 filter activations of CaffeNet  and GoogLeNet , while Figure (b)b shows the corresponding observed -norm for adversarial noise in the respective DNN filter activations. We can see that our -based ranking correlates well with the degree of perturbation (maximum magnitude of the noise perturbation) that is induced in the filter outputs. Similar observations can be made for other convolutional layers in the network.
In Figure 3, we evaluate the impact of masking the adversarial noise in such ranked filters on the overall top-1 accuracy of CaffeNet , VGG-16  and GoogLeNet . Specifically, we randomly choose a subset of 1000 images (1 image per class) from the ImageNet  training set and generate adversarially perturbed images by adding an -norm universal adversarial perturbation . The top-1 accuracies for perturbation-free images are 0.58, 0.70 and 0.69 for CaffeNet, GoogLeNet and VGG-16, respectively. Similarly, the top-1 accuracies for adversarially perturbed images of the same subset are 0.10, 0.25 and 0.25 for CaffeNet, GoogLeNet and VGG-16, respectively. Masking the adversarial perturbations in 50% of the most vulnerable filter activations significantly improves DNN performance, resulting in top-1 accuracies of 0.56, 0.68 and 0.67 for CaffeNet, GoogLeNet and VGG-16, respectively, and validates our proposed selective feature regeneration scheme. See Figure 1 in Supplementary Material for similar experiments for higher layers.
4.2 Resilient Feature Regeneration Defense
Our proposed defense is illustrated in Figure 4. We learn a task-driven feature restoration transform (i.e., feature regeneration unit) for convolutional filter activations severely disrupted by adversarial input. Our feature regeneration unit does not modify the remaining activations of the baseline DNN. A similar approach of learning corrective transforms for making networks more resilient to image blur and additive white Gaussian noise has been explored in .
Let represent a set consisting of indices for convolutional filters in the layer of a DNN. Furthermore, let be the set of indices for filters we wish to regenerate (Section 4.1) and let be the set of indices for filters whose activations are not regenerated (i.e., ). If represents the convolutional filter outputs to be regenerated in the layer, then our feature regeneration unit in layer performs a feature regeneration transform under the following conditions:
where is the unperturbed input to the layer of convolutional filters and is an additive perturbation that acts on . In Equations 3 and 4, denotes similarity based on classification accuracy in the sense that features are restored to regain the classification accuracy of the original perturbation-free activation map. Equation 3 forces to pursue task-driven feature regeneration that restores lost accuracy of the DNN while Equation 4 ensures that prediction accuracy on unperturbed activations is not decreased, without any additional adversarial perturbation detector. We implement (i.e., feature regeneration unit) as a shallow residual block , consisting of two stacked convolutional layers sandwiched between a couple of convolutional layers and a single skip connection. is estimated using a target loss from the baseline network, through backpropagation, see Figure 4, but with significantly fewer trainable parameters compared to the baseline network.
Given an layered DNN , pre-trained for an image classification task, can be represented as a function that maps network input to an -dimensional output label vector as follows:
where is a mapping function (set of convolutional filters, typically followed by a non-linearity) representing the DNN layer and is the dimensionality of the DNN’s output (i.e., number of classes). Without any loss of generality, the resulting DNN after deploying a feature regeneration unit that operates on the set of filters represented by in layer is given by:
where represents the new mapping function for layer , such that regenerates only activations of the filter subset and all the remaining filter activations (i.e., ) are left unchanged. If is parameterized by , then the feature regeneration unit can be trained by minimizing:
where is the same target loss function of the baseline DNN (e.g., cross-entropy classification loss), is the target output label for the input image , represents the total number of images in the training set consisting of both clean and perturbed images. As we use both clean and perturbed images during training, in Equation 7, represents a clean or an adversarially perturbed image.
In Figure 5, we visualize DNN feature maps perturbed by various universal perturbations and the corresponding feature maps regenerated by our feature regeneration units, which are only trained on UAP  attack examples. Compared to the perturbation-free feature map (clean), corresponding feature maps for adversarially perturbed images (Row 1) have distinctly visible artifacts that reflect the universal perturbation pattern in major parts of the image. In comparison, feature maps regenerated by our feature regeneration units (Row 2) effectively suppress these adversarial perturbations, preserve the object discriminative attributes of the clean feature map and are also robust to unseen attacks (e.g, NAG , GAP  and sPGD ), as illustrated in Figure 5 and Table 5.
4.3 Generating Synthetic Perturbations
Training-based approaches are susceptible to data overfitting, especially when the training data is scarce or does not have adequate diversity. Generating a diverse set of adversarial perturbations ( 100) using existing attack algorithms (e.g., [38, 44, 51, 45]), in order to avoid overfitting, can be computationally prohibitive. We propose a fast method (Algorithm 1) to construct synthetic universal adversarial perturbations from a small set of adversarial perturbations, , that is computed using any existing universal attack generation method ([38, 44, 51, 45]). Starting with the synthetic perturbation set to zero, we iteratively select a random perturbation and a random scale factor and update as follows:
where is the iteration number. This process is repeated until the -norm of exceeds a threshold . We set the threshold to be the minimum -norm of perturbations in the set .
Unlike the approach of Akhtar et al. , which uses an iterative random walk along pre-computed adversarial directions, the proposed algorithm has two distinct advantages: 1) the same algorithm can be used for different types of attack norms without any modification, and 2) Equation 8 (Step 5 in Algorithm 1) automatically ensures that the -norm of the perturbation does not violate the constraint for an -norm attack (i.e., -norm ) and, therefore, no additional steps, like computing a separate perturbation unit vector and ensuring that the resultant perturbation strength is less than , are needed.
We use the ImageNet validation set (ILSVRC2012)  with all 50000 images and a single crop evaluation (unless specified otherwise) in our experiments. All our experiments are implemented using Caffe  and for each tested attack we use publicly provided code. We report our results in terms of top-1 accuracy and the restoration accuracy proposed by Akhtar et al. . Given a set containing clean images and a set containing clean and perturbed images in equal numbers, the restoration accuracy is given by:
where acc() is the top-1 accuracy. We use the universal adversarial perturbation (UAP) attack  for evaluation (unless specified otherwise) and compute 5 independent universal adversarial test perturbations per network using a set of 10000 held out images randomly chosen from the ImageNet training set with the fooling ratio for each perturbation lower-bounded to 0.8 on the held out images and the maximum normalized inner product between any two perturbations for the same DNN upper-bounded to 0.15.
5.1 Defense Training Methodology
In our proposed defense (Figure 4), only the parameters for feature regeneration units have to be trained and these parameters are updated to minimize the cost function given by Equation 7.
Although we expect the prediction performance of defended models to improve with higher regeneration ratios (i.e., fraction of convolutional filter activations regenerated), we only regenerate 50% of the convolutional filter activations in a layer and limit the number of deployed feature regeneration units (1 per layer) as
5.2 Analysis and Comparisons
Robustness across DNN Architectures
Top-1 accuracy of adversarially perturbed test images for various DNNs (no defense) and our proposed defense for respective DNNs is reported in Table 1 under both white-box (same network used to generate and test attack) and black-box (tested network is different from network used to generate attack) settings. As universal adversarial perturbations can be doubly universal, under a black-box setting, we evaluate a target DNN defense (defense is trained for attacks on target DNN) against a perturbation generated for a different network. Top-1 accuracy for baseline DNNs is severely affected by both white-box and black-box attacks, whereas our proposed defense is not only able to effectively thwart the white-box attacks but is also able to generalize to attacks constructed for other networks without further training (Table 1). Since different DNNs can share common adversarial directions in their feature space, our feature regeneration units learn to regularize such directions against unseen data and, consequently, to defend against black-box attacks.
Robustness across Attack Norms
|CaffeNet , orginal accuracy 56.4%|
|VGG-F , original accuracy 58.4%|
|GoogLeNet , original accuracy 68.6%|
|VGG-16 , original accuracy 68.4%|
|Res152 , original accuracy 79%|
|JPEG comp. ||0.554||0.697||0.830||0.693||0.670|
|Feat. Distill. ||0.671||0.689||0.851||0.717||0.676|
|Adv. tr. ||n/a||n/a||n/a||n/a||0.778|
|-norm attack, = 2000|
|Adv. tr. ||n/a||n/a||n/a||n/a||0.778|
Here, we evaluate defense robustness against both -norm and -norm UAP  attacks. Since an effective defense must not only recover the DNN accuracy against adversarial images but must also maintain a high accuracy on clean images, we use restoration accuracy (Equation 9) to measure adversarial defense robustness (Tables 2 and 3). While Akhtar et al.  (PRN and PRN+det) only report defense results on the UAP  attack, we also compare results with pixel-domain defenses such as Pixel Deflection (PD ) and High Level Guided Denoiser (HGD ), defenses that use JPEG compression (JPEG comp. ) or DNN-based compression like Feature Distillation (Feat. Distill. ), defenses that use some variation of adversarial training like Feature Denoising (FD ) and standard Adversarial training (Adv. tr. ).
In Table 2, we report results for an -norm UAP attack  against various DNNs and show that our proposed defense outperforms all the other defenses
Stronger Attack Perturbations ()
Although we use an attack perturbation strength = 10 during training, in Table 4, we evaluate the robustness of our defense when the adversary violates the attack threat model using a higher perturbation strength. Compared to the baseline DNN (no defense) as well as PRN  and PD , our proposed defense is much more effective at defending against stronger perturbations, outperforming other defenses by almost 30% even when the attack strength is more than double the value used to train our defense. Although defense robustness decreases for unseen higher perturbation strengths, our defense handles this drop-off much more gracefully and shows much better generalization across attack perturbation strengths, as compared to existing defenses. We also note that adversarial perturbations are no longer visually imperceptible at .
Generalization to Unseen Universal Attacks
Although the proposed method effectively defends against UAP  attacks (Tables1-4), we also assess its robustness to other unseen universal attacks without additional attack-specific training. Note that  and  do not cover this experimental setting. Since existing attacks in the literature are tailored to specific DNNs, we use CaffeNet  and Res152  DNNs for covering a variety of universal attacks like Fast Feature Fool (FFF) , Network for adversary generation (NAG) , Singular fool (S.Fool) , Generative adversarial perturbation (GAP) , Generalizable data-free universal adversarial perturbation (G-UAP) , and stochastic PGD (sPGD) .
Our defense trained on just UAP  attack examples is able to effectively defend against all other universal attacks and outperforms all other existing defenses (Table 5). Even against stronger universal attacks like NAG  and GAP , we outperform all other defenses including PRN , which is also trained on similar UAP  attack examples, by almost 10%. From our results in Table 5, we show that our feature regeneration units learn transformations that generalize effectively across perturbation patterns (Figure 5). Note that we are the first to show such broad generalization across universal attacks.
|Methods||FFF ||NAG ||S.Fool ||GAP ||G-UAP ||sPGD |
|Adv. tr ||n/a||n/a||n/a||0.776||0.777||0.775|
Robustness to Secondary White-Box Attacks
Although in practical situations, an attacker may not have full or even partial knowledge of a defense, for completeness, we also evaluate our proposed defense against a white-box attack on the defense (secondary attacks), i.e., adversary has full access to the gradient information of our feature regeneration units. We use the UAP  (on CaffeNet) and sPGD  (on Res152) attacks for evaluation.
Figure 6 shows the robustness of our defense to such a secondary UAP  attack seeking to achieve a target fooling ratio of 0.85 on our defense for the CaffeNet  DNN. Such an attack can easily converge (achieve target fooling ratio) against a baseline DNN in less than 2 attack epochs, eventually achieving a final fooling ratio of 0.9. Similarly, we observe that even PRN  is susceptible to a secondary UAP  attack, achieving a fooling ratio of 0.87, when the adversary can access gradient information for its Perturbation Rectifying Network. In comparison, using our defense model with iterative adversarial example training (as described in Section 5.1), the white-box adversary can achieve a maximum fooling ratio of only 0.42, which is 48% lower than the fooling ratio achieved against PRN , even after attacking our defense for 600 attack epochs. Similarly, in Table 6, using the same attack setup outlined in , we evaluate white-box sPGD  attacks computed by utilizing gradient-information of both the defense and the baseline DNNs, for Res152 . As shown in Table 6, our defense trained using sPGD attack examples computed against both the baseline DNN and our defense, is robust to subsequent sPGD white-box attacks.
We show that masking adversarial noise in a few select DNN activations significantly improves their adversarial robustness. To this end, we propose a novel selective feature regeneration approach that effectively defends against universal perturbations, unlike existing adversarial defenses which either pre-process the input image to remove adversarial noise and/or retrain the entire baseline DNN through adversarial training. We show that the -norm of the convolutional filter kernel weights can be effectively used to rank convolutional filters in terms of their susceptibility to adversarial perturbations. Regenerating only the top 50% ranked adversarially susceptible features in a few DNN layers is enough to restore DNN robustness and outperform all existing defenses. We validate the proposed method by comparing against existing state-of-the-art defenses and show better generalization across different DNNs, attack norms and even unseen attack perturbation strengths. In contrast to existing approaches, our defense trained solely on one type of universal adversarial attack examples effectively defends against other unseen universal attacks, without additional attack-specific training. We hope this work encourages researchers to design adversarially robust DNN architectures and training methods which produce convolutional filter kernels that have a small -norm.
Appendix A Maximum Adversarial Perturbation
We show in Section 4.1 of the main paper that the maximum possible adversarial perturbation in a convolutional filter activation map is proportional to the -norm of its corresponding filter kernel weights. Here, we provide a proof for Equation 2 in the main paper. For simplicity but without loss of generality, let be a single-channel input to a convolutional filter with kernel . For illustration, consider a 33 input and a 22 kernel as shown below:
Assuming the origin for the kernel is at the top-left corner and no padding for (same proof applies also if padding is applied), then the vectorized convolutional output
can be expressed as a matrix-vector product as follows:
where unrolls all elements of the input matrix with rows and columns into an output column vector of size , is a circulant convolution matrix formed using the elements of and .
Similarly, for and such that is an element in row and column of , we have , and is given by:
where such that is the element in the vector and is the element in row and column of the matrix .
Since and , we have the following inequality:
Appendix B Masking Perturbations in Other Layers
In Section 4.1 of the main paper (Figure 3 in the main paper), we evaluate the effect of masking -norm adversarial perturbations in a ranked subset (using -norm ranking) of convolutional filter activation maps of the first convolutional layer of a DNN. Here, in Figure 7, we evaluate the effect of masking -norm adversarial perturbations in ranked filter activation maps of the convolutional layers 2, 3, 4 and 5 of CaffeNet  and VGG-16 . We use the same evaluation setup as in Section 4.1 of the main paper (i.e., 1000 image random subset of the ImageNet  training set). The top-1 accuracy for perturbation-free images of the subset are 0.58 and 0.69 for CaffeNet and VGG-16, respectively. Similarly, the top-1 accuracies for adversarially perturbed images in the subset are 0.10 and 0.25 for CaffeNet and VGG-16, respectively. Similar to our observations in Section 4.1 of the main paper, for most DNN layers, masking the adversarial perturbations in just the top 50% most susceptible filter activation maps (identified by using the -norm ranking measure, Section 4.1 of the paper), is able to recover most of the accuracy lost by the baseline DNN (Figure 7). Specifically, masking the adversarial perturbations in the top 50% ranked filters of VGG-16 is able to restore at least 84% of the baseline accuracy on perturbation-free images.
Appendix C Feature Regeneration Units: An Ablation Study
In general, feature regeneration units can be added at the output of each convolutional layer in a DNN. However, this may come at the cost of increased computations, due to an increase in the number of DNN parameters. As mentioned in Section 5.1 of the main paper, we constrain the number of feature regeneration units added to the DNN, in order to avoid drastically increasing the training and inference cost for larger DNNs (i.e., VGG-16, GoogLeNet and ResNet-152). Here, we perform an ablation study to identify the least number of feature regeneration units needed to at least achieve a 95% restoration accuracy across most DNNs. Specifically, we use VGG-16  and GoogLeNet  for this analysis. We evaluate the restoration accuracy on the ImageNet  validation set (ILSVRC2012) by adding an increasing number of feature regeneration units, starting from a minimum value of 2 towards a maximum value of 10 in steps of 2. Starting from the first convolutional layer in a DNN, each additional feature regeneration unit is added at the output of every second convolutional layer. In Figure 8, we report the results of this ablation study and observe that for GoogLeNet, adding just two feature regeneration units achieves a restoration accuracy of 97% and adding any more feature regeneration units does not have any significant impact on the restoration accuracy. However, for VGG-16, adding only 2 feature regeneration units achieves a restoration accuracy of only 91%. For VGG-16, adding more feature regeneration units improves the performance with the best restoration accuracy of 96.2% achieved with 6 feature regeneration units. Adding more than 6 feature regeneration units resulted in a minor drop in restoration accuracy and this may be due to data over-fitting. As a result, we restrict the number of feature regeneration units deployed for any DNN to .
|VGG-F + defense||GoogLeNet + defense||VGG-16 + defense|
|CaffeNet + defense||0.906||0.963||0.942|
|Res152 + defense||0.889||0.925||0.925|
Appendix D Attacks using Surrogate Defense DNNs
In this section, we evaluate if it is possible for an attacker/adversary to construct a surrogate defense network if it was known that our defense was adopted. In situations where exact defense (feature regeneration units + baseline DNN) is typically hidden from the attacker (oracle), a DNN predicting output labels similar to our defense (surrogate), can be effective only if an attack generated using the surrogate is transferable to the oracle. UAP  attacks are transferable across baseline DNNs (Table 1 in main paper), i.e., adversarial perturbation computed for a DNN whose model weights and architecture are known (surrogate) can also effectively fool another target DNN that has a similar prediction accuracy, but whose model weights and architecture are not known to the attacker (oracle). Assuming that our defense (feature regeneration units + baseline DNN) for CaffeNet  and Res152  is available publicly as a surrogate, universal attack examples computed from these DNNs may be used to attack our defenses for other DNNs, e.g. VGG-F or VGG-16 as an oracle. We show in Table 7 that our defense mechanism successfully breaks attack transferability and is not susceptible to attacks from surrogate DNNs based on our defense.
Appendix E Examples of Synthetic Perturbations
Appendix F Examples of Feature Regeneration
Appendix G Examples of Universal Attack Perturbations
- From Figure 3 (main paper) and Figure 1 in Supplementary Material, we observe that an empirical regeneration ratio of 50% works well. Similarly, although feature regeneration units can be deployed for each layer in a DNN, from Figure 2 in Supplementary Material, we observe that regenerating features in at most 6 layers in a DNN effectively recovers lost prediction performance.
- FD , HGD  and Adv. tr.  defenses publicly provide trained defense models only for Res152 ) among the evaluated DNNs; we report results using only the DNN models provided by the respective authors.
- As an implementation of Shared Adversarial Training (Shared tr. ) was not publicly available, we report results published by the authors in  and which were only provided for white-box attacks computed against the defense, whereas results for white-box attacks against the baseline DNN were not provided.
- Naveed Akhtar, Jian Liu, and Ajmal Mian. Defense against universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3389–3398, June 2018.
- Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proceedings of the International Conference on Machine Learning, (ICML), 2018.
- Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. In Proceedings of the International Conference on Machine Learning (ICML), pages 284–293, 2018.
- Tejas S. Borkar and Lina J. Karam. DeepCorrect: Correcting DNN models against image distortions. CoRR, abs/1705.02406, 2017.
- Tom B. Brown, Dandelion ManÃ©, Aurko Roy, MartÃn Abadi, and Justin Gilmer. Adversarial patch. CoRR, abs/1712.09665, 2017.
- N. Carlini and D. Wagner. Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, pages 39–57, 2017.
- Ken Chatfield, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. Return of the devil in the details: Delving deep into convolutional nets. In BMVC, 2014.
- Nilaksh Das, Madhuri Shanbhogue, Shang-Tse Chen, Fred Hohman, Siwei Li, Li Chen, Michael E. Kounavis, and Duen Horng Chau. SHIELD: Fast, practical defense and vaccination for deep learning using JPEG compression. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 196–204, NY, USA, 2018. ACM.
- Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
- Gintare Karolina Dziugaite, Zoubin Ghahramani, and Daniel M. Roy. A study of the effect of JPG compression on adversarial images. CoRR, abs/1608.00853, 2016.
- Logan Engstrom, Andrew Ilyas, and Anish Athalye. Evaluating and understanding the robustness of adversarial logit pairing. CoRR, abs/1807.10272, 2018.
- Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao, Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep learning models. CoRR, abs/1707.08945, 2017.
- Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, and Pascal Frossard. The robustness of deep networks: A geometrical perspective. IEEE Signal Process. Mag., 34(6):50–62, 2017.
- Alhussein Fawzi, Seyed-Mohsen Moosavi-Dezfooli, Pascal Frossard, and Stefano Soatto. Classification regions of deep neural networks. CoRR, abs/1705.09552, 2017.
- Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014.
- Chuan Guo, Mayank Rana, Moustapha Cissé, and Laurens van der Maaten. Countering adversarial images using input transformations. CoRR, abs/1711.00117, 2017.
- Jamie Hayes and George Danezis. Learning universal adversarial perturbations with generative models. CoRR, abs/1708.05207, 2017.
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
- Jan Hendrik Metzen, Mummadi Chaithanya Kumar, Thomas Brox, and Volker Fischer. Universal adversarial perturbations against semantic image segmentation. In IEEE International Conference on Computer Vision, pages 2774–2783, Oct 2017.
- Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. CoRR, abs/1804.08598, 2018.
- Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.
- Harini Kannan, Alexey Kurakin, and Ian Goodfellow. Adversarial logit pairing. CoRR, abs/1803.06373, 2018.
- Valentin Khrulkov and Ivan Oseledets. Art of singular vectors and universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8562 – 8570, June 2018.
- Jernej Kos, Ian Fischer, and Dawn Song. Adversarial examples for generative models. CoRR, abs/1702.06832, 2017.
- Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, pages 1097–1105, 2012.
- Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial examples in the physical world. CoRR, abs/1607.02533, 2016.
- Alex Lamb, Jonathan Binas, Anirudh Goyal, Dmitriy Serdyuk, Sandeep Subramanian, Ioannis Mitliagkas, and Yoshua Bengio. Fortified networks: Improving the robustness of deep networks by modeling the manifold of hidden representations. CoRR, abs/1804.02485, 2018.
- Juncheng Li, Frank R. Schmidt, and J. Zico Kolter. Adversarial camera stickers: A physical camera-based attack on deep learning systems. CoRR, abs/1904.00759, 2019.
- Xin Li and Fuxin Li. Adversarial examples detection in deep networks with convolutional filter statistics. In IEEE International Conference on Computer Vision, pages 5775 – 5783, Oct 2017.
- Yingwei Li, Song Bai, Cihang Xie, Zhenyu Liao, Xiaohui Shen, and Alan L. Yuille. Regional homogeneity: Towards learning transferable universal adversarial perturbations against defenses. CoRR, abs/1904.00979, 2019.
- Fangzhou Liao, Ming Liang, Yinpeng Dong, Tianyu Pang, Xiaolin Hu, and Jun Zhu. Defense against adversarial attacks using high-level representation guided denoiser. CoRR, abs/1712.02976, 2017.
- Y Liu, X. Chen, C. Liu, and D. Song. Delving into transferable adversarial samples and black-box attacks. CoRR, abs/1611.02770, 2016.
- Zihao Liu, Qi Liu, Tao Liu, Yanzhi Wang, and Wujie Wen. Feature Distillation: DNN-oriented JPEG compression against adversarial examples. International Joint Conference on Artificial Intelligence, 2018.
- Jiajun Lu, Theerasit Issaranon, and David Forsyth. Safetynet: Detecting and rejecting adversarial examples robustly. In IEEE International Conference on Computer Vision, pages 446–454, Oct 2017.
- Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. CoRR, abs/1706.06083, 2017.
- Dongyu Meng and Hao Chen. MagNet: A two-pronged defense against adversarial examples. In ACM SIGSAC Conference on Computer and Communications Security, pages 135–147, NY, USA, 2017. ACM.
- Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. On detecting adversarial perturbations. In International Conference on Learning Representations, 2017.
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Universal adversarial perturbations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 86–94, 2017.
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, Pascal Frossard, and Stefano Soatto. Analysis of universal adversarial perturbations. CoRR, abs/1705.09554, 2017.
- Seyed-Mohsen Moosavi-Dezfooli, Ashish Shrivastava, and Oncel Tuzel. Divide, denoise, and defend against adversarial attacks. CoRR, abs/1802.06806, 2018.
- Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. DeepFool: A simple and accurate method to fool deep neural networks. In IEEE Conference on Computer Vision and Pattern Recognition, pages 2574–2582, June 2016.
- Konda Reddy Mopuri, Aditya Ganeshan, and R Venkatesh Babu. Generalizable data-free objective for crafting universal adversarial perturbations. In IEEE Transactions on Pattern Analysis and Machine Intelligence, volume 41, pages 2452–2465, 2019.
- Konda Reddy Mopuri, Utsav Garg, and R Venkatesh Babu. Fast feature fool: A data independent approach to universal adversarial perturbations. In Proceedings of the British Machine Vision Conference (BMVC), pages 1–12, 2017.
- Konda Reddy Mopuri, Utkarsh Ojha, Utsav Garg, and R Venkatesh Babu. NAG: Network for adversary generation. In Proceedings of the IEEE Computer Vision and Pattern Recognition (CVPR), 2018.
- Chaithanya Kumar Mummadi, Thomas Brox, and Jan Hendrik Metzen. Defending against universal perturbations with shared adversarial training. In The IEEE International Conference on Computer Vision (ICCV), Oct 2019.
- Nina Narodytska and Shiva Prasad Kasiviswanathan. Simple black-box adversarial perturbations for deep networks. CoRR, abs/1612.06299, 2016.
- Nicolas Papernot, Patrick McDaniel, and Ian Goodfellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. CoRR, abs/1607.02533, 2016.
- Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In ACM Asia Conference on Computer and Communications Security, pages 506–519, 2017.
- Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z. Berkay. Celik, and Ananthram Swami. The limitations of deep learning in adversarial settings. In IEEE European Symposium on Security and Privacy, pages 372–387, 2016.
- Julien Perolat, Mateusz Malinowski, Bilal Piot, and Olivier Pietquin. Playing the game of universal adversarial perturbations. CoRR, abs/1809.07802, 2018.
- Omid Poursaeed, Isay Katsman, Bicheng Gao, and Serge Belongie. Generative adversarial perturbations. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2018.
- Aaditya Prakash, Nick Moran, Solomon Garber, Antonella DiLillo, and James Storer. Deflecting adversarial attacks with pixel deflection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 8571–8580, June 2018.
- Konda Reddy Mopuri, Phani Krishna Uppala, and R. Venkatesh Babu. Ask, acquire, and attack: Data-free uap generation using class impressions. In The European Conference on Computer Vision (ECCV), September 2018.
- Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. You only look once: Unified, real-time object detection. In IEEE Conference on Computer Vision and Pattern Recognition, pages 779–788, 2016.
- Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems, pages 91–99, 2015.
- Yibin Ruan and Jiazhu Dai. TwinNet: A double sub-network framework for detecting universal adversarial perturbations. In Future Internet, volume 10, pages 1–13, 2018.
- Ali Shafahi, Mahyar Najibi, Zheng Xu, John Dickerson, Larry S. Davis, and Tom Goldstein. Universal adversarial training. CoRR, abs/1811.11304, 2018.
- Evan Shelhamer, Jonathan Long, and Trevor Darrell. Fully convolutional networks for semantic segmentation. IEEE Trans. Pattern Anal. Mach. Intell., 39(4):640–651, Apr. 2017.
- Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
- Sibo Song, Yueru Chen, Ngai-Man Cheung, and C.-C. Jay Kuo. Defense against adversarial attacks with Saak transform. CoRR, abs/1808.01785, 2018.
- Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. CoRR, abs/1710.08864, 2017.
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, pages 1–9, 2015.
- Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
- Florian TramÃ¨r, Alexey Kurakin, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. Ensemble adversarial training: Attacks and defenses. In International Conference on Learning Representations, 2018.
- Jonathan Uesato, Brendan O’Donoghue, Aaron van den Oord, and Pushmeet Kohli. Adversarial risk and the dangers of evaluating against weak attacks. CoRR, abs/1802.05666, 2018.
- Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L. Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- Weilin Xu, David Evans, and Yanjun Qi. Feature squeezing: Detecting adversarial examples in deep neural networks. In Network and Distributed System Security Symposium, 2018.
- Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. CoRR, abs/1511.07122, 2015.