Invisible Backdoor Attacks Against Deep Neural Networks

Invisible Backdoor Attacks Against Deep Neural Networks

Shaofeng Li1, Benjamin Zi Hao Zhao2, Jiahao Yu1, Minhui Xue3, Dali Kaafar4, Haojin Zhu1
1Shanghai Jiao Tong University, China
2UNSW and CSIRO-Data61, Australia
3The University of Adelaide, Australia
4Macquarie University and CSIRO-Data61, Australia
Abstract

Deep neural networks (DNNs) have been proven vulnerable to backdoor attacks, where hidden features (patterns) trained to a normal model, and only activated by some specific input (called triggers), trick the model into producing unexpected behavior. In this paper, we design an optimization framework to create covert and scattered triggers for backdoor attacks, invisible backdoors, where triggers can amplify the specific neuron activation, while being invisible to both backdoor detection methods and human inspection. We use the Perceptual Adversarial Similarity Score (PASS) [19] to define invisibility for human users and apply and regularization into the optimization process to hide the trigger within the input data. We show that the proposed invisible backdoors can be fairly effective across various DNN models as well as three datasets CIFAR-10, CIFAR-100, and GTSRB, by measuring their attack success rates and invisibility scores.

1 Introduction

Deep neural networks have been proven to outperform traditional machine learning techniques and outperform humans’ cognitive capacity in many domains, such as image processing [5], speech recognition [14] and board games [17, 22]. Training these models requires large computational power, catering to the needs of new services on tech giants’ cloud platforms, such as Machine Learning as a Service (MLaaS) [21]. Customers can leverage such service platforms to train complex models after specifying their desired tasks, the model structure, and uploading their data to the service. Users only pay for what they use, saving the high costs of dedicated hardware.

However, machine learning models may have vulnerabilities. Backdoor attacks [10, 16] are one type of attacks aimed at fooling the model with pre-mediated inputs. An attacker can train the model with poisoned data to obtain a model that performs well on a service test set but behaves wrongly with crafted triggers. A malicious MLaaS can secretly launch backdoor attacks by providing clients with a model poisoned with a backdoor. Consider for example the scenario of a company deploying a facial-recognition solution as an access control system. The company may choose to use MLaaS for the deployment of the biometrics-based system. In the event the MLaaS provider is malicious, it may seek to gain unauthorized access into the company’s resources. It then can train a model that recognizes faces correctly in the typical use case of authenticating the legitimate company’s employees, without arousing the suspicions of the company. But as the MLaaS hosts and has access to the model, it may train the model with additional “backdoored” inputs to invoke granted access, when scanning specific inputs, such as black hats or a set of yellow rimmed glasses; effectively bypassing in a stealthy way the security mechanism intended to protect the company’s resources.

Previous works have studied such backdoor attacks [20, 11]. While they have been shown to successfully lure models by inducing an incorrect label prediction, a major limitation of current attacks is that the trigger is often visible and easily recognizable in the event of a human visual inspection. When these inputs are checked by humans, the poisoned inputs will be found suspicious. Although [10, 16] propose methods to reduce suspicions of the inputs, the triggers are still notably altered compared to normal inputs, making existing triggers less feasible in practice.

The challenge of creating an “invisible” backdoor is how to achieve the trade-off between the effectiveness of the trigger on fooling the ML system and the invisibility of the trigger to avoid being recognized by human inspection. The triggers used in previous works [10, 16] create a striking contrast with neighboring pixels. This stark difference enables better optimization in teaching the retrained model to recognize these prominent differences as features and use them in predictions. However, when “invisible” triggers are inserted into images, the loss of separation between the trigger and images may increase the difficulty of activation in the “backdoored” neural network.

Hiding the trigger from human detection is feasible as recent works [1] have shown that neural networks have powerful features extraction capabilities to detect even the smallest differences (e.g., adversarial examples). Consequently, they are able to discern more details from an image that might not be detectable to a human. This is exacerbated by the known fact that humans are bad in perceiving small variations in colours within images [12]. In this work, we focus on how to make triggers invisible, specifically, to make backdoor attacks less detectable by human inspection, while ensuring the neural networks can still identify the backdoor triggers. We hope that our work could raise awareness about the severity of backdoor attacks. Our main contributions can be highlighted as follows:

  • We provide an optimization framework for the creation of invisible backdoor attacks.

  • We use the Perceptual Adversarial Similarity Score (PASS) [19] to define invisibility for human users. Our objective is to fool both backdoor detection methods and human inspection.

  • We choose a slight perturbation as the trigger, and propose the and regularization to hide the trigger throughout the image to make the trigger less obvious. We show the feasibility of using and regularization through experimentation.

2 Related Work

Deep Neural Networks (DNNs) can be easily affected by slight perturbations, such as adversarial attacks [8, 24, 4, 3, 7] and poisoning attacks. In poisoning attacks, the attacker can either breach the integrity of the system without preventing the regular users to use the system, or make the system unavailable for all users by manipulating the training data. The former is referred to as backdoor attacks, while the latter are known as poisoning availability attacks. Several works have addressed the latter [2, 26]. In this work, we focus on backdoor attacks, as many proposed backdoor attacks [10, 16] are easily identified by human visual inspection. It is important to note that backdoor attacks differ from adversarial attacks. An adversarial attack crafts image-specific perturbations, i.e. the perturbation is invalid when used on other images. Backdoor attacks however aims to apply the same backdoor trigger to any arbitrary image which will trick a DNN model into producing an unexpected behavior. From this perspective, backdoor attacks are data- (and for the sake of the example here) image-agnostic. Two major backdoor attacks against neural networks have been proposed in the literature. First, authors in  [10, 9] propose BadNets which injects a backdoor by poisoning the training set. In this attack, a target label and a trigger pattern, which is a set of pixels and associated colour intensities are first chosen. Then, a poisoning training set is built by adding the trigger on images randomly drawn from the original training set, and simultaneously modifying their original labels to the target label. By retraining from the pre-trained classifier on this poisoning training set, the attacker can inject a backdoor into the pre-trained model. The second attack is the Trojaning attack proposed in  [16]. This attack does not use arbitrary triggers; instead the triggers are designed to maximize the response of specific internal neurons in the DNN. This creates a higher correlation between triggers and internal neurons, building a stronger dependence between specific internal neurons and the target labels with less training data. Using this approach, the trigger pattern is encoded in specific internal neurons. However, the trigger generated in the Trojaning attack is so obvious that humans as well as Neural Cleanse [25] and Tabor [11] can detect it.

Compared to the Trojaning attack, the triggers of BadNets do not have the ability to amplify the specific neuron activation, and as such the attack success rate of BadNets is lower than that of the Trojaning attack. In addition, because the triggers of BadNets are sparsely encoded into the neural network, it is harder for neural networks to memorize this type of features, which leads to more epochs for the retrain phase to converge. Besides, in BadNets attack, the attacker is given access to the original training set, an assumption that might be too strong for the attack to be plausible in practice. Based on the above reasons, we choose the Trojaning attack as a building block to carry out our invisible backdoor attacks.

3 Invisible Backdoor Attacks

In this section, we first introduce the threat model and then develop an optimization framework to formalize our backdoor attack.

3.1 Threat Model

Assuming there is a classification hypothesis trained on samples , where is a training set. In an adversarial attack settings, the adversary modifies the input image with a small perturbation ( minimum) to invoke a mistake in classifier , where is the ground truth of the input . Note that in this process, the classifier remains unchanged. For backdoor attacks however, the adversary obtains a new classifier by retraining from the existing classifier using a poisoning dataset . The adversary generates the poisoning dataset by applying the trigger pattern to their own training images. When this trigger pattern appears on the input image , the new classifier will mis-classify this craft into the target label as expected by the adversary (), where represents the operation to apply the trigger into the input images. For images without any embedded trigger, they are still identified as their original labels by the new classifier .

3.2 Formalization of Backdoor Attacks

When we have a trigger, we can build an image-agnostic poisoning training dataset , where is the poisoning training set, used to train the learner on poisoning data; is the poisoning validation set, used to evaluate the success rate of the backdoor attack, with a one-to-one mapping , and labeling as target label . In previous backdoor attacks, the mapping is the operation that adds the trigger directly into the input images. The shape and size of the trigger patterns are all obvious. In our method, we use regularization to make the shape and size of trigger patterns are all invisible. After we have the poisoning dataset , we can obtain a poisoning model by retraining from the poisoning dataset . The overview of our invisible backdoor attacks is shown in Fig. 1.

Figure 1: Overview of our invisible backdoor attacks

There are three phases to complete a backdoor attack. The first step is trigger generation from the pre-trained model, after that, we use the generated trigger to build our poisoning training set. As for encoding the trigger pattern into the neural network, we conduct a retraining process. The details are described in Algorithm 1.

input :  : the untainted sample; : the trigger; : operation of adding trigger; : target label; : the attacker’s loss function; : the feasible set of manipulations that can be made on ; : a threshold;
output :  : the backdoored model.
1 begin
2        Randomly drawn from with ratio for  do
3              
4              
5              
6        end for
7       Initialize the pre-trained model: Initialize the poisoning sample: repeat
8               Store parameters from previous iteration:
9               Update step: , where is step size.
10       until  return
11 end
12
Algorithm 1 Gradient-based Backdoor Attack

In Algorithm 1, the trigger used to build the poisoning dataset is given. For generating a trigger, we have formulated this process as a bilevel optimization problem as Eq. (1), then added two types of regularization to improve the trigger generation process and make the generated triggers unperceptive for humans. Where the outer optimization minimizes attacker’s loss function (the attacker expects to maximize attack success rate on poisoning data without degrading the accuracy on untainted data). The inner optimization amounts to learning the classifier on the poisoning training data.

(1)

where and are from the original datasets. is untainted validation set size, is poisoning validation set size. Note that in the second term, because the poisoning craft is image-agnostic, which means the trigger pattern is applied into whatever images , the new classifier will only identify the pattern .

The first term of the attacker’s loss function forces the poisoning classifier to give the same label as the initial classifier on untainted data, through the loss function . The second term forces the classifier can successfully identify the trigger pattern and output the target label via the loss function . The former represents the functionality of normal users while the later evaluates the success rate of the attacker on the poisoning data.

Notably, the objective function implicitly depends on through the parameters of the poisoning classifier. In this case, we assume that the attacker can inject only a small fraction of the poisoning points (the size of the poisoning set is ) into the training set. Thus, the attacker can find the trigger pattern by solving an optimization problem. An local optimal trigger pattern can be optimized via gradient-descent procedures.

The main challenge is to compute the gradient of the attacker’s objective (i.e., the validation loss on both validation sets and ) with respect to the trigger . In fact, this gradient has to capture the implicit dependency of the optimal parameter vector (learned after training) on the trigger being optimized, as the classification function changes while the trigger is updated. Provided that the attacker’s objective function is differentiable w.r.t. and , the required gradient can be computed by the chain rule, as followed by Eq. (2):

(2)

where is the trigger we want to optimize, is the cross entropy loss function, is the softmax function, and is the logits of the model with parameters . The term captures the implicit dependency of the parameters on the trigger pattern . When this optimization has the local optima, the loss of the inner optimization in Eq. (1) will near by , and the gradient of the inner optimization will approach .

In [6], they have computed this derivative by replacing the inner optimization problem with stationary (Karush-Kuhn-Tucker, KKT) conditions, i.e., with its implicit equation:

(3)

By differentiating this expression w.r.t. the trigger pattern , one yields:

(4)

Solving for , we obtain:

(5)

which can be substituted in Eq. (2) to obtain the required gradient:

(6)

According to Eq. (6), we can generate the trigger pattern which can minimize the attacker’s loss function.

Measurements.

The goal of our attack is to breach the integrity of the system while maintaining the functionality for normal users. We utilize three metrics to measure the effectiveness of our backdoor attack.

(a) Attack Success Rate: For an attacker, we represent the output of the poisoned model on poisoned input data as and the attacker’s expected target as . This index measures the ratio of which equals the attacker target . This measurement also shows whether the neural network can identify the trigger pattern we added into the input images. This ratio is high, where the neural network has a high ability to identify the trigger pattern we added by the operation .

(b) Functionality: For normal users, this index measures the performance of the poisoned model on the original validation set . The attacker should retain this functionality; otherwise the administrator or users will detect an occurrence of the backdoor attack.

(c) Invisibility: We use Perceptual Adversarial Similarity Score (PASS) [19] to measure invisibility of the triggers. PASS is a psychometric measure which considers not only element-wise similarity but also plausibility that the image enjoys a different view of the same input. Based on the fact that the human visual system is the most sensitive to changes in structural patterns, so they use structural similarity (SSIM) index to quantify the plausibility.

Given two images, and , let , and be luminance, contrast and structural measures, specifically defined as

(7)

where , , and are weighted mean, variance and covariance, respectively, and ’s are constants to prevent singularity, where , is the dynamic range of the pixel values (255 for 8-bit images), ; ; . With these, the regional SSIM index (RSSIM) is

(8)

where , , and are weight factors. Then SSIM is obtained by splitting the image into blocks and taking the average of RSSIM over these blocks,

(9)

Combine the photemetric-invariant homography transform alignment with SSIM to define the perceptual adversarial similarity score (PASS) as

(10)

where is a homograhpy transform from image to similar image .

3.3 Optimizing Triggers via Regularization

We start from random Gaussian noise to generate the trigger through an optimization process. In this optimization, we adjust the value of this noise to amplify a set of neuron activations ( is a set of positions we choose to amplify these neurons) while decreasing the -norm of this noise. When the optimization achieves the -norm threshold, we have an optimal noise which just like an adversarial example, and the noise is difficult to be perceived by humans because the -norm is guaranteed to be small. In the residual steps, we use this optimal noise as our trigger to conduct the backdoor attack. This optimization process can be formulated by Eq. (11) shown as follows:

(11)

where is the neuron activations of the pre-trained model on the input noise , and is the scale factor. Our experience shows that setting is perfectly acceptable in practice. and are weight parameters to determine the weight of two part losses in our loss function.

Because scaling neuron activations makes the -norm of the input noise larger, meanwhile minimizing the -norm of the input noise makes scaling the neuron activations more difficult, the goals of two terms in our objective function in Eq. (11) is in contradiction. We view this optimization problem as the saddle point problem as the composition of two optimization problems. Both of these problems have a natural interpretation in our context. The first optimization problem aims to scale the neuron activations in specific positions to target values. Through the backpropagation of the gradient, the value of the input noise will change, which makes the -norm of the input noise continuously increase. On the other hand, the goal of the second optimization tries to make the input noise as known as our trigger is not so obvious by minimizing its -norm. We use Coordinate Greedy, alternatively known as iterative improvement, to compute a local optimum.

In this case, we optimize the first term of the loss function at a time with a small until the neuron activations beyond a given threshold. Then optimize the second term of the loss function to decrease the -norm of the input noise with a small , meanwhile decreasing the learning rate exponentially to avoid destroying the amplified neuron activations. Both optimization processes can be separated into two phases. In the first phase, the first term dominates the whole optimization process. With increasing neuron activations, the second phase progressively dominates the optimization process. When we finish the whole optimization process, the -norm of the input noise is small, we only need to use box constraint once after all of the optimization processes. Box constraint makes each pixel of the optimal noise between to . This approach has been shown to be extremely effective for computing local optima.

Step 1: Finding Anchor Positions.

Another problem in the optimization process is how to choose the neuron positions set in the networks we seek to amplify. For image classification tasks, many network architectures are built by concatenating a few to hundreds of convolutional layers. In the deeper layers of the neural network, they represent the abstract features, so these layers can produce more effective classification results [1]. In addition, some researchers [15] also use the set of activations in the penultimate layer of neural networks to catch features from input images, since these neuron activations correspond to inputs at a linear classifier. Hence, we choose the penultimate layer as our target layer. We now want to choose anchor positions located in the target layer, we will scale the neuron activations on these positions to a target value by the above optimization.

For multiclass classification tasks, the penultimate layer usually has the shape of , where is the batch size and is the number of hidden units in the penultimate layer. The next layer is a fully connected layer which is a weight matrix with the shape of , where is the number of class labels. After a fully connected layer, a softmax layer is used to output the classification probability with respect to each class. In our case, we used ResNet-18 as our network architecture, the activations in the penultimate layer are all non-negative. Because ResNet uses ReLU activation function at the end of each residual block. So it is reasonable to find anchor positions by analyzing the weights of the last fully connected layer :

(12)

where is the target label, are the activations of the penultimate layer, and is the th column vector of the last fully connected weight . It is efficient if we choose the anchor positions according to the descend sort of the . An intuitive illustration is shown in Fig. 2.

Figure 2: Finding Anchor Positions. Where is the number of the class, and is the number of hidden units in the penultimate layer.

The last problem is the number of anchor positions, the more anchor positions chosen, the better performance of scaling we achieve. But in practice, it is hard to scale a set of values simultaneously by adjusting the value of input noise . However, our experiments show that looking at the maximum position according to is sufficient.

Step 2(a): Optimization with Regularization.

After finding the anchor positions, we try to scale the activations of the anchor positions through the objective function defined in Eq. (12) with two types of -norm regularization ( and , respectively). For -norm regularization, we start from random Gaussian noise . When we finish the optimization according to the Eq. (12), we obtain the optimal perturbation .

Step 2(b): Optimization with Regularization.

When we apply the regularization into the optimization process defined in Eq. (11). Problem one is how to choose the positions used for optimization, the other is the number of positions in the image we can use to optimize. For the first problem, we use a Saliency Map [18], which is a mask matrix to record the importance of each position on the input image. For the second problem, it is a trade off between invisibility and efficiency. The more positions we use to optimize for scaling the anchor neuron activation, the more efficient the attack is; however it ends up more obvious for human detection.

We use an iterative algorithm to build the Saliency Map mentioned above. In each iteration, we identify some pixels that do not have much effect on scaling activations and then fix those pixels using Saliency Map, so their values will never be changed. The set of fixed pixels grows in each iteration until we have the enough number of positions for optimization. Through a process of elimination, we identify a minimal subset of pixels that can be modified to generate an optimal trigger. The iteration optimization algorithm is described in Algorithm 2.

input : Initial Gaussian Noise , Saliency Map with shape , target activation value in anchor position of the penultimate layer. Minimal pixels number will be Reserved.
output : Optimal pattern , Saliency Map .
1 begin
2        for every iteration  do
3              
4              
5              
6              
7              
8               # clipping the value into [0,255]
9              
10               if  then  break;
11        end for
12       
13        return ,
14 end
15
Algorithm 2 Saliency Map Generation

In each iteration, we compute the loss between the activation value on the anchor position and its scale target value . Then let be the gradient returned from the loss with respect to input , and use the Saliency Map to mask the update of input in order to only modify the pixels which are not in Saliency Map, yielding that . We compute (the gradient of the objective function, evaluated at the ). We then select the pixel and fix , i.e., remove from the allowed set .

The intuition is that tells us how much reduction to we obtain from the th pixel of the input noise , when moving from to ; tells us how much reduction in we obtain, per unit changes to the th pixel; we then multiply this by how much the th pixel has changed. This process repeats until a minimal number of pixels remain in the Saliency Map .

When we implement Algorithm 2, the activation on anchor position in the penultimate layer will increase and stay at a high value over a long time. As the number of masked pixels exceeds a masked numbers, the activations decrease dramatically. This means that the remaining pixels have a limited ability to scale the activation to a high level. This process also guides us to choose a minimal number the Saliency Map should remain, which also solves the second problem when we apply the regularization into the optimization process defined in Eq. (11).

In this case we set the minimal number remaining in the as , an example shown in Fig. 3. After finding the Saliency Map with the algorithm above, we obtain an initial input noise with respect to an activation as . Note that we can still scale the activation value starting from this initial input noise without removing any remaining pixels in our Saliency Map. In this case, we can still scale the activation from to .

(a) The initial trigger
(b) The final trigger
Figure 3: The trigger in regularization

Fig. 2(a) shows the initial trigger and Fig. 2(b) shows the final trigger trained from the initial trigger.

Step 3: The Universal Backdoor Attack.

After generating the final trigger , we construct the poisoning image  by adding the trigger directly into the image randomly drawn from the original training set with a sampling ratio , and assigning a target label determined by the adversary. The proposed attack we implemented is universal, meaning we can build our poisoned image  through choosing any image without considering their original labels. An example for the poisoning image is shown in Fig. 4.

(a) Original Image
(b) Attack Image
(c) Attack Image
Figure 4: (\subreffig:8-a) Original image, (\subreffig:8-b) Poisoned image with -norm as 2, and (\subreffig:8-c) Poisoned image with -norm as 2.

After poisoning the input images according to the above process, we have a set of poisoning images . Next we combine the original training set and the poisoning training set together into a new training set (). We use to control the pollution ratio, defined as the portion of the poisoning training set over the whole new training set. Finally, we use this new training set to retrain a classifier from the original pre-trained model . We observe a high efficiency in retraining from the pre-trained model to our expected model using this poisoning training set, only epochs elapsed before model convergence. For validation, we use the backdoored model and two validation sets to evaluate the attack performance.

4 Experimental Analysis

Setup. We implement the attacks we introduced in Section 3.3. For the two types of trigger optimizations through and regularization, we mount our attacks on CIFAR-10/100 and GTSRB [23]. We use the pre-trained ResNet-18 [13] model as the basis of our attacks. The CIFAR-10 dataset consists of 60,000 32x32 colour images in 10 classes, with 6,000 images for each class; so there are 50,000 training images and 10,000 test images. CIFAR-100 is just like the CIFAR-10, except it has 100 classes and 10 times fewer images. The German Traffic Sign Recognition Benchmark (GTSRB) contains 43 classes, split into 39,209 training images and 12,630 test images. We achieve , , and prediction accuracy on each respective validation dataset. All of our experiments were run on an Intel i9-7900X, with 64GB of memory and a GTX1080. Our networks are implemented with Pytorch.

Performance. We measure the performance of two types of attackers ( and ) by computing the Attack Success Rate and the Functionality on our three datasets.

(a) Attack on CIFAR-10
(b) Attack on CIFAR-10
Figure 5: Functionality and Attack Success Rates of and attacks on CIFAR-10 (Validation accuracy: 92.48%).
(a) Attack on CIFAR-100
(b) Attack on CIFAR-100
Figure 6: Functionality and Attack Success Rates of and attacks on CIFAR-100 (Validation accuracy: 73.44%).
(a) Attack on GTSRB
(b) Attack on GTSRB
Figure 7: Functionality and Attack Success Rates of and attacks on GTSRB (Validation accuracy: 95.31%).

The results on the CIFAR-10 dataset can be seen in Fig. 5. Here, we find that with extremely small perturbations (-norm ), difficult for humans to perceive, can still produce satisfactory performance in both the Functionality and Attack Success Rate of the model. The Attack Success Rate for all -norm test are greater than . For -norm regularization, when we retrain the poisoning data on the pre-trained model, we find the model converges faster than -norm regularization to achieve a high Attack Success Rate. For only a few epochs, the Attack Success Rate exceeds , while for -norm regularization, it needs more than 10 epochs to converge. This demonstrates that it is easier for neural networks to memorize the triggers generated by -norm regularization than -norm regularization.

Original
Trojan
CIFAR-10
PASS 1 0.8998 0.9982 0.9972
CIFAR-100
PASS 1 0.8980 0.9969 0.9952
GTSRB
PASS 1 0.8801 0.9942 0.9925
Table 1: PASS scores compared to Trojaning attack

Fig. 6 displays the Functionality for regular use cases and the Attack Success Rate for the attacker on the CIFAR-100 dataset. From Fig. 6 we observe the Attack Success Rate of all the attacks exceeding . For attacks, with an increase of the -norm, Attack Success Rate can be raised to . With respect to Functionality, both attacks experience a slight drop in the validation accuracy of clean images. The baseline accuracy for CIFAR-100 is , when compared to the worst configuration (-norm ), the Functionality only drops . From Fig. 7, we see larger -norms are needed on GTSRB to obtain an equivalent Attack Success Rate to CIFAR. For instance, only when the -norm of the trigger exceeds does the Attack Success Rate exceed . With respect to Functionality, all configurations retain a validation accuracy comparable to the baseline model’s.

Invisibility Metric. Recall that the Invisibility metric is PASS. It quantifies how similar two images appear to a human, the range of this metric is ; if two images are identical, the value is . We compute and compare the PASS score between the original image and the poisoning images with triggers generated by Trojaning, -norm and -norm. The invisibility metrics are shown in Table 1. Both of our triggers achieve a higher PASS score than the Trojaning triggers, with our PASS scores are extremely close to . This indicates humans will have more difficulty discerning differences between our triggers and the original image compared to Trojaning triggers.

5 Conclusion

We have designed a novel backdoor attack with trigger patterns imperceptible to human inspection, therefore boosting the success rate of backdoor attacks in practice by making the input images inconspicuous. In future work, we seek to provide a deeper explanation from the internal structure of the neural network to ascertain the reason why the backdoor attack succeeds. Based on these explanations, we seek to derive defense strategies against backdoor attacks on neural networks. Additionally, this understanding could make great strides in making neural networks more transparent.

References

  • [1] Y. Bengio, A. Courville, and P. Vincent (2013) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. Cited by: §1, §3.3.
  • [2] B. Biggio and F. Roli (2018) Wild patterns: ten years after the rise of adversarial machine learning. Pattern Recognition 84, pp. 317–331. Cited by: §2.
  • [3] N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pp. 3–14. Cited by: §2.
  • [4] N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In IEEE Symposium on Security and Privacy, Cited by: §2.
  • [5] D. Ciregan, U. Meier, and J. Schmidhuber (2012) Multi-column deep neural networks for image classification. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3642–3649. Cited by: §1.
  • [6] A. Demontis, M. Melis, M. Pintor, M. Jagielski, B. Biggio, A. Oprea, C. Nita-Rotaru, and F. Roli (2019) Why do adversarial attacks transfer? explaining transferability of evasion and poisoning attacks. In 28th USENIX Security Symposium, pp. 321–338. Cited by: §3.2.
  • [7] Y. Dong, T. Pang, H. Su, and J. Zhu (2019) Evading defenses to transferable adversarial examples by translation-invariant attacks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4312–4321. Cited by: §2.
  • [8] I. J. Goodfellow, J. Shlens, and C. Szegedy (2015) Explaining and harnessing adversarial examples. ICLR. Cited by: §2.
  • [9] T. Gu, K. Liu, B. Dolan-Gavitt, and S. Garg (2019) BadNets: evaluating backdooring attacks on deep neural networks. IEEE Access, pp. 47230–47244. Cited by: §2.
  • [10] T. Gu, B. Dolan-Gavitt, and S. Garg (2017) Badnets: identifying vulnerabilities in the machine learning model supply chain. NIPS Workshop on Machine Learning and Computer Security. Cited by: §1, §1, §1, §2.
  • [11] W. Guo, L. Wang, X. Xing, M. Du, and D. Song (2019) TABOR: a highly accurate approach to inspecting and restoring Trojan backdoors in AI systems. arXiv preprint arXiv:1908.01763. Cited by: §1, §2.
  • [12] S. Gupta, A. Goyal, and B. Bhushan (2012) Information hiding using least significant bit steganography and cryptography. International Journal of Modern Education and Computer Science 4 (6), pp. 27. Cited by: §1.
  • [13] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §4.
  • [14] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, et al. (2012) Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Processing Magazine 29 (6), pp. 82–97. Cited by: §1.
  • [15] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry (2019) Adversarial examples are not bugs, they are features. arXiv preprint arXiv:1905.02175. Cited by: §3.3.
  • [16] Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang, and X. Zhang (2017) Trojaning attack on neural networks. The Network and Distributed System Security Symposium (NDSS). Cited by: §1, §1, §1, §2.
  • [17] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1.
  • [18] N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik, and A. Swami (2016) The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P), pp. 372–387. Cited by: §3.3.
  • [19] A. Rozsa, E. M. Rudd, and T. E. Boult (2016) Adversarial diversity and hard positive generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 25–32. Cited by: Invisible Backdoor Attacks Against Deep Neural Networks, 2nd item, §3.2.
  • [20] S. Shan, E. Willson, B. Wang, B. Li, H. Zheng, and B. Y. Zhao (2019) Gotta catch’em all: using concealed trapdoors to detect adversarial attacks on neural networks. arXiv preprint arXiv:1904.08554. Cited by: §1.
  • [21] R. Shokri, M. Stronati, C. Song, and V. Shmatikov (2017) Membership inference attacks against machine learning models. In IEEE Symposium on Security and Privacy, pp. 3–18. Cited by: §1.
  • [22] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al. (2016) Mastering the game of go with deep neural networks and tree search. Nature 529 (7587), pp. 484. Cited by: §1.
  • [23] J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks. Cited by: §4.
  • [24] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus (2014) Intriguing properties of neural networks. In ICLR, Cited by: §2.
  • [25] B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng, and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. IEEE Security and Privacy. Cited by: §2.
  • [26] H. Xiao, B. Biggio, G. Brown, G. Fumera, C. Eckert, and F. Roli (2015) Is feature selection secure against training data poisoning?. In International Conference on Machine Learning, pp. 1689–1698. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
389076
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description