Poison as a Cure: Detecting & Neutralizing
VariableSized Backdoor Attacks in Deep Neural Networks
Abstract
Deep learning models have recently shown to be vulnerable to backdoor poisoning, an insidious attack where the victim model predicts clean images correctly but classifies the same images as the target class when a trigger poison pattern is added. This poison pattern can be embedded in the training dataset by the adversary. Existing defenses are effective under certain conditions such as a small size of the poison pattern, knowledge about the ratio of poisoned training samples or when a validated clean dataset is available. Since a defender may not have such prior knowledge or resources, we propose a defense against backdoor poisoning that is effective even when those prerequisites are not met. It is made up of several parts: one to extract a backdoor poison signal, detect poison target and base classes, and filter out poisoned from clean samples with proven guarantees. The final part of our defense involves retraining the poisoned model on a dataset augmented with the extracted poison signal and corrective relabeling of poisoned samples to neutralize the backdoor. Our approach has shown to be effective in defending against backdoor attacks that use both small and largesized poison patterns on nine different targetbase class pairs from the CIFAR10 dataset.
1 Introduction
Deep learning models have shown remarkable performance in several domains such as computer vision, natural language processing and speech recognition [15, 23]. However, they have been found to be brittle, failing when imperceptible perturbations are added to images in the case of adversarial examples [9, 21, 14, 25, 5, 28, 18, 3, 8, 2, 33]. In another setting of data poisoning, an adversary can manipulate the model’s performance by altering a small fraction of the training data. As deep learning models are increasingly present in many realworld applications, security measures against such issues become more important.
Backdoor poisoning (BP) attack [29, 24, 10, 7, 17, 1] is a sophisticated data poisoning attack which allows an adversary to control a victim model’s prediction by adding a poison pattern to the input image. This attack eludes simple detection as the model classifies clean images correctly. Many of the backdoor attacks involve two steps: first, the adversary alters a fraction of base class training images with a poison pattern; second, these poisoned images are mislabeled as the poison target class. After the training phase, the victim model would classify clean base class images correctly but misclassify them as the target class when the poison pattern is added.
Current defenses against backdoor attacks are effective under certain conditions. For some of the defenses, the defender needs to have a verified clean set of validation data [16], knowledge about the fraction of poisoned samples, the poison target and base classes [29], or that the defense is effective only against smallsized poison patterns [32].
In this paper, we propose a comprehensive defense to counter a more challenging BP attack scenario where the defender may not have such prior knowledge or resources. We first propose, in § 4, a method to extract poison signals from gradients at the input layer with respect to the loss function, or input gradients in short. We then show that poisoned samples can be separated from clean samples with theoretical guarantees based on the similarity of their input gradients with the extracted poison signals (§ 5). Next, the poison signals are used for the detection of the poison target and base classes (§ 6). Finally, we use the poison signal to augment the training data and relabel the poisoned samples to the base class, to neutralize the backdoor through retraining (§ 7). We evaluate our defense on both largesized and smallsized BP scenarios on nine targetbase class pairs from the CIFAR10 dataset and show its effectiveness against these attacks (§ 8).
Contributions
All in all, the prime contributions of this paper are as follows:

An extensive defense framework to counter variablesized neural BP where knowledge about the attack’s target/base class and poison ratio is unknown, without the need for a clean set of validation data.

Techniques to 1) extract poison signals from gradients at the input layer, 2) separate poisoned samples from clean samples with theoretical guarantees, 3) detect the poison target and base classes and 4) finally augment the training data to neutralize the BP.

Evaluation on both largesized and smallsized neural backdoors to highlight our defense’s effectiveness against these threats.
2 Background: Backdoor Poisoning Attacks
In an image classification task of pixel RGB images (), we consider a general poison insertion function to generate poisoned image with poison pattern and poison mask , where , such that
and determines the position and ratio of how much replaces the original input image . Realworld adversaries might inject subtle poison which spans the whole image size [7]. In this case, for a small value. In another threat model of smallsize poison [10, 29], the poison is concentrated in a small set of pixel , .
In our experiments to neutralize the poison, we first consider the largesize poison threat where is sampled from an image class different from the classes in the original dataset. To show the comprehensiveness of our defense, We also evaluate our methods against the smallsize poison pattern where the poison is injected only in one pixel, i.e. . Examples of these two types of poisoned images are shown in Figure 1. In both cases of BP, the poisoned samples’ label is modified to the label of the poison target class . In this paper, we call the original the poison base class. In a successfully poisoned classifier , clean base class images will be classified correctly while base class images with poison signal will be classified as the target class such that .
3 Related Work
A line of studies showed that models are vulnerable to BP with both smallsized poison patterns [10, 1] and largesized poison patterns [7, 17, 24]. The predecessor of BP, data poisoning, also attacks the training dataset of the victim model [4, 34, 19, 12, 27, 20], but unlike backdoor attacks, they aim to degrade the generalization of the model on clean test data.
Several defenses have shown to be effective under certain conditions. One of the earliest defenses uses spectral signatures in the model’s activation to filter out a certain ratio of outlier samples [29]. The outlier ratio is fixed to be close to the ratio of poisoned samples in the target class, requiring knowledge of the poison ratio and target class. As shown in § 8.3, our proposed method is competitive in neutralizing BP compared to this approach, despite of the more challenging threat model. Another defense prunes neurons that lie dormant in the presence of clean validation data and finetune the model on that same validation data [16]. Similar to our approach, [32] also retrieve a poison signal from the victim model but their method is only effective for smallsized poison patterns. Our neutralization algorithm is effective for small and largesized poison patterns even without that prior knowledge or validated clean data. Activation clustering (AC) [6] detects and removes smallsized poisoned samples by separating the classifier’s activations into two clusters to separate poisoned samples as the smaller cluster. In contrast, our proposed approach extracts out a poison signal through the input gradients at the input layer and detect poisoned samples whose input gradient have high similarity with the signal. Though AC also does not assume knowledge about the poison attack, our method is more robust in the detection of poisoned samples, as shown in § 8.3. Our approach to augment the training data use poison signal resembles adversarial training [14, 28, 18] but those methods address the issue of adversarial examples which attacks the models during inference phase rather than training phase.
4 Poison Extraction with Input Gradients
The first part of our defense involves extracting a BP signal from the poisoned model. To do so, we exploit the presence of a poison signal in the gradient of the poisoned input with respect to the loss function , or input gradient . We explain the intuition behind this phenomena in § 4.1, propose how to extract the poison pattern from these input gradients in § 4.2.
4.1 Poison Signal in Input Gradients
We hypothesize that a poison signal resembling the poison pattern lies in input gradients of poisoned images () based on two observations: (1) backdoor models contain ‘poison’ neurons that are only activated when poison pattern is present, and (2) the weights in these ‘poison’ neurons are much larger in magnitude than weights in other neurons. Previous studies have empirically shown that backdoored models indeed learn ‘poison’ neurons that are only activated in the presence of the poison pattern in input images [10, 16]. The intuition for observation (2) is that to flip the classification of a poisoned base class image from the base to target class, the activation in these ‘poison’ neurons need to overcome that from ‘clean’ base class neurons. This would imply that the weights corresponding to the ‘poison’ neurons are larger in absolute values than those in other neurons. We show how observation (1) and (2) can emerge in a case study of a binary classifier with one hidden layer containing three neurons in Appendix § A.
We combine these two observations with the following proposition to postulate that a poisoned image would result in a relatively large absolute value of gradient input at the poison pattern’s position.
Proposition 4.1.
The gradient of loss function with respect to the input is linearly dependent on activated neurons’ weights such that
(1) 
where usually called the error, is the derivative of loss function with respect to activation for neuron node in layer . is the weight for node in layer for incoming node , is the number of nodes in layer , is the activation function for the hidden layer nodes and is its derivative.
The proof of this proposition is in Appendix § B. Here, the value of depends on the loss function of the classifier model and the activations of the neural networks in deeper layers. Proposition 4.1 implies that the gradient with respect to the input is linearly dependent on derivative of activation function , the weights and . Combined with the premise that ‘poison’ neurons have weights of larger value, this would mean that there will be a relatively large absolute input gradient value at pixel positions where the poison pattern is, compared to other input positions. If we use RELU as the activation function , then , which means that the large input gradient at the poison pattern’s location would only be present if the ‘poison’ neurons are activated by the poison pattern in poison samples. Conversely, the large input gradient, attributed to the poison pattern, would be absent from clean samples. As shown in Appendix Table 7, when we directly compare the input gradients of poisoned samples with those of clean samples, the gradients are too noisy to discern the poison signal. In the next section § 4.2, we propose a method to extract the poison signal from the noisy input gradients of clean and poisoned images.
4.2 Distillation of Poison Signal
As the first step leading up to the other parts of our defense, we extract the poison signal from the noisy input gradients of the poison target class samples. Recall that these target class samples consist of both clean and poisoned training samples. We denote the ratio of poisoned samples (poison ratio) in the poison target class as . The input gradient of a randomly drawn target class samples from a poisoned dataset can be represented as random vector
is a Bernoulli random variable and , and and are independent. The value of corresponds to the size of random noise in the data.
Denoting the second moment matrix of as , we can compute with the following theorem.
Theorem 4.1.
is the eigenvector of and corresponds to the largest eigenvalue if and are both .
Its detailed proof is in Appendix C.1. Theorem 4.1 allows us to extract the poison signal as the largest eigenvector of from a set of clean and poisoned samples that are labeled as the poison target class. The largest eigenvector of can be computed by SVD of the matrix containing the input gradients . We can center , the mean of the input gradients for clean target class images, at zero by subtracting the sample mean of the target class. Though the target class includes a small portion of poisoned images, we find this sample mean approximation to work well in our experiments due to the large majority of clean samples. In our experiments with poisoned ResNet [11], the extracted poison signal visually resembles the original poison pattern in terms of its position and semantics for both largesized and smallsized poisons, as shown in Figure 2, Appendix Table 8 and 9. The first right singular vector resembles the poison pattern only when poisoned input gradients are present in SVD of .
5 Filtering of Poisoned Samples
After the extraction poison signal , the next part is to filter out poisoned samples from the mix of clean and poisoned samples. Appendix Algorithm 2 summarizes how we filter out these samples while we detail the intuition behind our approach in this section. From § 4.1, we know that poisoned samples would have input gradients which contain the poison signal , albeit shrouded by noise. Intuitively, the input gradients of poisoned samples will have higher similarity to the poison signal than that of clean samples. Since the clean samples lack poison patterns, ‘poison’ neurons are mostly not activated during inference, resulting in almost absence of the poison signal in their input gradients. If we take the cosine similarity between a clean sample’s input gradient and , we can expect the similarity value () to be close to zero. In our experiments, as shown in Figure 3 and in Appendix Figure 4 and 5, we indeed find that the similarity values of and clean samples’ input gradients cluster around while those of poisoned samples form clusters with a nonzero mean.
The first principal component of an input gradient is the vector dot product of itself with the largest eigenvector of . Since the largest eigenvector of is , the first principal component of an input gradient is equivalent to the cosine similarity value (). This leads to our next intuition of using a clustering algorithm to filter out poisoned samples exploiting their relatively high absolute first principal component values. Theorem 5.1 guarantees such an approach’s performance based on certain conditions.
Theorem 5.1 (Guarantee of Poison Classification through Clustering).
Assume that all are normalized such that . Then the error probability of the poison clustering algorithm by is given by
(2) 
where is the number of samples, is the number of misclassified samples and .
We show its proof in Appendix C.5. From (2), as the poison signal’s norm gets larger, we get at the L.H.S of (2) and at the R.H.S of (2), meaning a strong poison signal will result in a better filtering accuracy of poisoned samples. As number of samples in the clustering algorithm increases, the error rate () has higher probability of having a low value since .
In our experiments, we use a simple Gaussian Mixture Model (GMM) clustering algorithm with the number of clusters to filter the poisoned samples based on the input gradients’ first principal component values. In practice, we find that this approach can separate poisoned samples from clean samples with high accuracy for poisoned and clean samples when using the poison base class as the loss function’s crossentropy target, as shown in results from largesized poison scenarios in Appendix Table 14 and smallsized poison scenarios in Appendix Table 15. We summarize our poisoned sample filtering in Appendix Algorithm 2.
6 Detection of Poison Class
So far, we have proposed a method to detect poison signal (§ 4) and filter poisoned samples from a particular poison target class (§ 5). However in practice, the poison target class and base class are usually unknown to us. Especially in cases where there are many possible classes in the classification dataset, a method to detect the presence of data poisoning and retrieve the poison classes is desirable. Our proposed detection method is summarized in Appendix Algorithm 1 and its derivation is detailed in the next two sections.
6.1 Detection of Poison Target Class
We know from § 5 that input gradient first principal components from the poison target class form a nonzero mean cluster attributed to poisoned samples and a zeromean cluster attributed to clean samples. Since a nonpoisoned class would only contain clean samples, we expect the samples’ input gradient first principal components to form only one cluster centered at zero. If we apply clustering algorithms like GMM with on a singlecluster distribution like a nonpoisoned class input gradient first principal components, it will likely return two highly similar clusters that split the samples almost equally among these two clusters. Conversely, GMM will return two distinct clusters for a poison target class input gradient first principal components. Based on this intuition, we can identify the poison target class as the class where the GMM clusters have the lowest similarity measure. In our experiments, measuring this similarity with Wasserstein distance is effective in detecting poison target class from a BP poisoned dataset in all our 18 experiments, as shown in Appendix Table 10 and 12. The Wasserstein distance value for the poison target class is largest among all classes. In practice, we can flag out the poison target class in a dataset if its Wasserstein distance value exceeds a threshold value that depends on the mean of all Wasserstein distance values from the other classes. GMM being a baseline clustering algorithm and Wasserstein distance being a widely used symmetric distance measure between two clusters are the reasons for using them in our experiments though we would expect more complex alternatives to also work with our framework.
6.2 Detection of Poison Base Class
Since poisoned training images are originally base class samples, we expect the classifier to heavily depend on the poison pattern to distinguish between the target class and the base class for a poisoned sample. In this case, when loss function’s crossentropy target is set as the base class, we can expect the input gradient of the poisoned sample to concentrate around the poison signal as changes to the poison pattern will flip the prediction from the target to base class. In contrast, when the loss function’s crossentropy target is set as other nonpoisoned classes, the input gradient will be distributed more among ‘real’ features that distinguish between the target class and the other class.
With this intuition, we expect the magnitude of poisoned samples’ first principal gradient components to have the largest value when the crossentropy label is set to the poison base class. In all 18 experiments of large and smallsized poisons, this is indeed a reliable approach to find poison base class, as shown in Appendix Table 11 and 13 where the poison base class consistently gives the largest mean first principal gradient component value among poisoned samples. The mean first principal gradient component value is smaller when the crossentropy target is set to the poison target class than the base class. We believe that this is due to a larger portion of the input gradient being spread across ‘real’ features since poisoned images originate from base class and have ‘real’ feature differences with clean target class images, especially when target and base classes are visually distinct (e.g. ‘Bird’ vs ‘Truck’). We summarize the poison class detection method in Appendix Algorithm 1.
7 Neutralization of Poisoned Models
Now that we have the methods to detect poison target and base classes from § 6, and to filter out poisoned samples from § 5, the next natural step is to neutralize the poison backdoor in the classifier model so that the model is safe from backdoor exploitation when deployed. One direct and effective approach is to retrain the model to unlearn that the poison pattern is a meaningful feature.
7.1 CounterPoison Perturbation
The effect of poison backdoor lies in the model’s association of the poison pattern with only the poison target class, classifying images containing the poison as the target class. The next step of our proposed neutralization method helps the poisoned model unlearn this association by retraining on an augmented dataset where the extracted poison signal is added to all other classes, eliminating the backdoor to the target class. The first step of constructing the augmented dataset is to generate the poison signal to add to images from other classes. In practice, we find that the poison signal extracted from a pool of only poisoned samples has a closer resemblance to the real poison pattern, compared to one from a pool of poisoned and clean samples from the target class. At this stage, we would have already filtered poisoned samples using Appendix Algorithm 2 in the previous step, hence making it possible to extract the poison signal from only filtered poisoned samples. While computing the input gradients of the images, we set the crossentropy target as the current class instead of the target poison class to avoid the model associating ‘real’ target class features to these other classes. This preserves good performance on clean target class images after the retraining step. The data augmentation steps are summarized in Appendix Algorithm 3.
7.2 Relabeling of Poisoned Base Class Samples
Since we know the poison base class at this stage, we can relabel the filtered poisoned samples to the correct class (base class) as part of the augmented dataset. This requires no additional computation while further helps the models to unlearn the association of the poison to the target class.
7.3 Full Algorithm
In realworld poisoning attacks, the poison target and base classes are usually unknown to us. The first stage of our neutralization algorithm is hence to detect these classes, using Appendix Algorithm 1. After finding the poison classes, we can use Appendix Algorithm 2 to filter out poisoned samples from clean samples in the target class. Finally, Appendix Algorithm 3 creates the augmented dataset. Together with a relabeling step of poisoned samples, this augmented dataset eliminates the backdoor from the poisoned model during retraining. The full defense algorithm is summarized in Algorithm LABEL:algo:Main_Algorithm.
algocf[!htbp] \end@dblfloat
Poison  Sample  Target  Acc Before Neu. (%)  Acc After Neu. (%)  

All  Poisoned  All  Poisoned  


Dog  95.0  4.6  94.3  88.6 


Frog  95.2  11.3  95.0  97.6 


Cat  95.5  2.5  94.5  95.3 


Bird  95.0  16.5  94.4  95.3 


Deer  95.3  1.2  94.9  94.6 


Bird  95.4  5.0  94.6  97.3 


Horse  95.0  16.6  94.9  90.8 


Cat  95.2  12.5  94.3  87.8 


Dog  95.0  9.6  94.5  96.1 
8 Evaluation of Neutralization Algorithm
We evaluate the full suite of neural BP defense (Algorithm LABEL:algo:Main_Algorithm) on a realistic threat scenario where the target/base classes, poison pattern and ratio of poisoned data are unknown.
8.1 Setup
Our experiments are conducted on the CIFAR10 dataset [13] with ResNet [11] and VGG [26] image classifier. We use a publicly available ResNet18 and VGG19 implementation ^{1}^{1}1https://github.com/kuangliu/pytorchcifar for our experiments.
Nine unique poison targetbase pairs are used in our experiments. On top of the same eight class pairs from [29], we include (‘Dog’‘Cat’) to probe one more case where target and base classes are highly similar. We study all nine pairs on both largesized poisoning and smallsized poisoning scenarios. For largesized poisons, we use a randomly drawn image from CIFAR100 training set, to ensure the poison image has a different class from CIFAR10, and overlay on the poisoned samples with 20% opacity. For each smallsized poison targetbase pairs, a set of random color and pixel position determines which pixel in poisoned samples is to be replaced with the poison color. In all 18 experiments, 10% of the training samples from the base class are randomly selected as poisoned samples and mislabeled as the target class. We use in Appendix Algorithm 3 for our experiments and retrain the poisoned model on the defense’s augmented dataset for one epoch. Unless stated otherwise, all results are shown for 10% poison ratio on ResNet18.
8.2 Evaluation of Neutralized Models
We summarize the evaluation results in Table 1 for largesized poisons and in Table 2 for smallsized poisons. In all poisoning scenarios, the model has high test accuracy on clean test images ( on all 10,000 CIFAR10 test set). The accuracy drops drastically when evaluated on the 1,000 poisoned base class test images, for overlay poisons and for dot poisons. After the neutralization process, for all poison cases, the accuracy of the model increases significantly, highlighting the effectiveness of our method. There is a slight dip () in test accuracy on clean test images which we speculate is due to the model sacrificing test accuracy to learn more robust features after the new training samples are perturbed against the gradient of the loss function, a phenomenon also observed in adversarially trained classifiers [30].
Experiments on 5% poison ratio (Appendix Table 16 & 17) and on VGG19 (Appendix Table 18 & 19) similarly display the effectiveness of our defense.
Sample  Target  Acc Before Neu. (%)  Acc After Neu. (%)  

All  Poisoned  All  Poisoned  

Dog  95.4  0.5  94.9  87.5 

Frog  95.4  0.4  95.0  95.9 

Cat  95.3  0.4  95.2  86.0 

Bird  95.2  1.0  95.0  96.3 

Deer  95.3  0.5  95.1  96.4 

Bird  95.3  2.0  95.3  96.4 

Horse  95.3  1.0  94.6  81.4 

Cat  95.1  1.4  95.0  90.6 

Dog  95.4  3.0  95.2  98.2 
8.3 Comparison with Baseline Defences
8.3.1 Detection of Poisoned Samples
When compared with another poison detection baseline called Activation Clustering (AC) [6] and we observe that our method is more robust in the detection of poisoned samples (Table 3 and 4). For fullsized overlay poison attacks, AC’s sensitivity (accuracy of detecting poisoned samples) is for 4 out of the 9 CIFAR10 poison pairs in our experiments while our proposed detection method shows high sensitivity () consistently (Table 3). For the 9 smallsized dot poison attacks, there are 3 pairs where AC detects poisoned samples with accuracy (sensitivity) while our proposed method shows comparatively high sensitivity () for all the poison pairs (Table 4). Since images from different CIFAR10 classes (like cats and dogs) may look semantically more similar to one another than those from datasets evaluated in [6] like MNIST and LISA, we speculate that the activations of poisoned samples closely resemble those of clean samples despite being originally from different class labels. As a result, it is challenging to separate them with AC which relies on differences between activations of poisoned and clean target class samples. In contrast, our proposed method detects poisoned samples through their input gradient’s similarity with the extracted poison signal. This decouples the interclass activation similarity problem from the detection of poisoned samples, thus explaining the more robust performance of our method.
Poison Pair #  1  2  3  4  5  6  7  8  9 

Ours (%)  99.4 / 94.6  99.6 / 96.0  99.2 / 95.6  99.7 / 89.8  97.5 / 87.4  99.5 / 95.4  98.9 / 95.8  99.7 / 89.2  99.6 / 93.6 
AC (%)  70.7 / 46.6  73.4 / 96.2  99.8 / 93.4  50.6 / 13.0  72.4 / 70.8  68.4 / 79.8  59.2 / 6.4  50.0 / 45.2  99.8 / 94.2 
Poison Pair #  1  2  3  4  5  6  7  8  9 

Ours (%)  99.6 / 92.8  99.5 / 88.6  99.7 / 99.0  96.8 / 84.4  99.9 / 99  99.7 / 100  99.1 / 95.8  99.3 / 94  99.52 / 99.8 
AC (%)  71.7 / 80.0  65.3 / 92.0  99.4 / 97.4  53.4 / 25.4  85.8 / 92.4  59.7 / 44.6  72.3 / 97.2  64.9 / 59.0  99.7 / 91.0 
8.3.2 Final Neutralization
Other backdoor defense approaches such as [29, 16, 32] assume either prior knowledge about the attack’s target class and poison ratio or the availability of a verified clean dataset which makes it different from the more challenging threat model considered in this paper. Nonetheless, on experiments with the same poison parameters, our method is competitive (Table 5), compared to the defense in [29].
Poison Pair #  1  2  3  4  5  6  7  8  9 

Ours (%)  6.0  0  3.7  0.4  0.4  0.1  6.1  5.4  0 
SS (%)  7.2  0.1  0.1  1.1  1.7  0.4  0.7  6.7  0 
9 Conclusions
In this paper, we propose a comprehensive defense to counter backdoor attacks on neural networks. We show how poison signals can be extracted from input gradients of poisoned training samples. With the insights that the principal components of input gradients from poisoned and clean samples form distinct clusters, we propose a method to detect the presence of backdoor poisoning, along with the corresponding poison target and base class. We then use the extracted poison signals to filter poisoned from clean samples in the target class. Finally, we retrain the model on an augmented dataset, which dissociates the poison signals from the target class, and show that it can effectively neutralize the backdoor for both large and smallsized poisons in the CIFAR10 dataset without prior assumption on the poison classes and size. Comparison with baselines demonstrates both our approach’s superior poison detection and its competitiveness with existing methods even under a more challenging threat model. Our method consists of several key modules, each of which can potentially be a building block of more effective defenses in the future.
References
 [1] (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (USENIX Security 18), pp. 1615–1631. Cited by: §1, §3.
 [2] (2018) Obfuscated gradients give a false sense of security: circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420. Cited by: §1.
 [3] (2017) Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397. Cited by: §1.
 [4] (2012) Poisoning attacks against support vector machines. arXiv preprint arXiv:1206.6389. Cited by: §3.
 [5] (2016) Hidden voice commands. In 25th USENIX Security Symposium (USENIX Security 16), pp. 513–530. Cited by: §1.
 [6] (2018) Detecting backdoor attacks on deep neural networks by activation clustering. arXiv preprint arXiv:1811.03728. Cited by: §3, §8.3.1, Table 3.
 [7] (2017) Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §1, §2, §3.
 [8] (2017) Robust physicalworld attacks on deep learning models. arXiv preprint arXiv:1707.08945. Cited by: §1.
 [9] (2014) Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572. Cited by: §1.
 [10] (2017) Badnets: identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733. Cited by: §1, §2, §3, §4.1.
 [11] (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §4.2, §8.1.
 [12] (2017) Understanding blackbox predictions via influence functions. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1885–1894. Cited by: §3.
 [13] (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §8.1.
 [14] (2016) Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236. Cited by: §1, §3.
 [15] (2015) Deep learning. nature 521 (7553), pp. 436. Cited by: §1.
 [16] (2018) Finepruning: defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses, pp. 273–294. Cited by: §1, §3, §4.1, §8.3.2.
 [17] (2018) Trojaning attack on neural networks. In 25nd Annual Network and Distributed System Security Symposium, NDSS 2018, San Diego, California, USA, February 18221, 2018, Cited by: §1, §3.
 [18] (2017) Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083. Cited by: §1, §3.
 [19] (2015) The security of latent dirichlet allocation. In Artificial Intelligence and Statistics, pp. 681–689. Cited by: §3.
 [20] (2008) Exploiting machine learning to subvert your spam filter.. LEET 8, pp. 1–9. Cited by: §3.
 [21] (2016) Cleverhans v1. 0.0: an adversarial machine learning library. arXiv preprint arXiv:1610.00768 10. Cited by: §1.
 [22] (1999) Random vectors in the isotropic position. Journal of Functional Analysis 164 (1), pp. 60–72. Cited by: Theorem C.3.
 [23] (2015) Deep learning in neural networks: an overview. Neural networks 61, pp. 85–117. Cited by: §1.
 [24] (2018) Poison frogs! targeted cleanlabel poisoning attacks on neural networks. In Advances in Neural Information Processing Systems, pp. 6103–6113. Cited by: §1, §3.
 [25] (2016) Accessorize to a crime: real and stealthy attacks on stateoftheart face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pp. 1528–1540. Cited by: §1.
 [26] (2014) Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556. Cited by: §8.1.
 [27] (2017) Certified defenses for data poisoning attacks. In Advances in neural information processing systems, pp. 3517–3529. Cited by: §3.
 [28] (2017) Ensemble adversarial training: attacks and defenses. arXiv preprint arXiv:1705.07204. Cited by: §1, §3.
 [29] (2018) Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems, pp. 8011–8021. Cited by: §1, §1, §2, §3, §8.1, §8.3.2, Table 5.
 [30] (2018) Robustness may be at odds with accuracy. stat 1050, pp. 11. Cited by: §8.2.
 [31] (2018) Highdimensional probability: an introduction with applications in data science. Vol. 47, Cambridge University Press. Cited by: Theorem C.2.
 [32] (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks, pp. 0. Cited by: §1, §3, §8.3.2.
 [33] (2019) Wasserstein adversarial examples via projected sinkhorn iterations. arXiv preprint arXiv:1902.07906. Cited by: §1.
 [34] (2015) Support vector machines under adversarial label contamination. Neurocomputing 160, pp. 53–62. Cited by: §3.
Appendix A Poison Signals in Input Gradients
a.1 Constructing a Backdoor
a.1.1 A Binary Classification Example
Our example considers clean data samples from a distribution such that:
where are independent and is gaussian distribution with mean and variance . In this dataset, the features are correlated with the label whereas is uncorrelated at all. We denote for samples with label and for sample with label .
We can consider a simple neural network classifier with a hidden layer made up of two neurons and RELU activation function which is able to achieve high accuracy for :
where
.
Considering the accuracy of on ,
(3)  
where are independent gaussian distributions. Further simplifying it, we get
(4)  
From this, we can observe that the accuracy of is 99.8% on when . can have times more similar neurons in the hidden layer and get similarly high training accuracy for .
a.1.2 Effect of Poisoned Data on Learned Weights
We now consider a distribution of poisoned data which forms in a victim classifier a backdoor after training. We study the case where an adversary forms a backdoor that causes to misclassify samples as when the poison signal is present. We denote the inputlabel pairs from as :
(5) 
where the poison signal is planted in with value and is mislabeled as the target label . Note that and are similar in their distribution except for their values which contains the poison signal for . If we use the same classifier from § A.1.1, , resulting in classification ‘error’ for most . With being the ratio of samples in , would have ‘error’ rate of for .
For high training accuracy on , we study another neural network classifier with a hidden layer made up of three different neurons and RELU activation function :
similar to for the first two hidden neurons,
For ’s third hidden neuron,
where and is the RELU activation function. The negative sign of suppresses the activation of the third neuron () for clean samples. Without this, its noise value at could have cause to be positive and flip the sign of to positive.
We can express the training accuracy on as
(6) 
Combining the definition of in (5) with observations in (3) and (4), we get
(7)  
For the training accuracy of poisoned samples , we need
which is satisfied when
From here, we can deduce that for high training accuracy of poisoned samples, we need
Combining with the result from (4) that is needed for high training accuracy of and , we get . When is large for high dimensional inputs,
(8) 
This means that the weight of third neuron representing poisoned input feature would be much larger than that of the first and second neurons representing normal input features. In practice, poison feature neurons having larger weight values than clean feature neurons of deep neural networks is observed empirically in other data poisoning studies (cite papers).
During inference, most would result in positive and while would be negative. The corresponding activation values for and in are summarized in Table 6.
+      +  0  0  1  0  0  
  +    0  +  0  0  1  0  
+    +  +  0  +  1  0  1 
Since the RELU activation function is and its derivative is , we can calculate the postRELU activation values and their derivative, also summarized in Table 6. The poisoned inputs have different profile of neuron activation from the clean inputs and . More specifically, ’s third neuron is only activated by inputs with poison signal , like . Combining these insights about a poisoned classifier model’s ‘poison’ neuron weights and activations with § A.2, we propose a method to recover poison signals in the input layer, detect poison target class and, subsequently, poisoned images.
a.2 Poison Signal in Input Gradients
Proposition A.1.
The gradient of loss function with respect to the input is linearly dependent on activated neurons’ weights such that
(9) 
where usually called the error, is the derivative of with respect to activation for neuron node in layer . is the weight for node in layer for incoming node , is the number of nodes in layer , is the activation function for the hidden layer nodes and is its derivative.
The detailed proof of this proposition is in Appendix B. The gradient with respect to the input is linearly dependent on the , and terms. The value of is dependent on the loss function of the classifier model and the activations of the neural networks in deeper layers. In , is simply meaning that , we can get
(10) 
We know the values of from Table 6. Since for most and , will be much larger for poisoned samples than for clean samples, and . Moreover, from (8) we know that the weight of ‘poison’ neurons () are much larger than weight of ‘clean’ neurons ( and ) when is large, resulting in
. Informally, this means that there will be a relatively large absolute gradient value at the poison signal’s input positions () of poisoned inputs () compared to other input positions. In practice, when we directly compare the gradients of poisoned samples with those of clean samples, shown in Table 7, the gradients are too noisy to discern poison signals. In § 4.2, we show how we filter these input poison signals and use them to separate poisoned from clean samples with guarantees in § 5.
Poison  Gradient of Poisoned Inputs  Gradient of Clean Inputs  

+  
 
Appendix B Proof of Proposition 4.1
Proposition B.1.
The gradient of loss function with respect to the input is linearly dependent on activated neurons’ weights such that
(11) 
where usually called the error, is the derivative of loss function with respect to activation for neuron node in layer . is the weight for node in layer for incoming node , is the number of nodes in layer , is the activation function for the hidden layer nodes and is its derivative.
Proof.
We denote as the output for node in layer . For simplicity, the bias for node in layer is denoted as a weight with fixed output for node in layer .
For where is the final layer,
where
For ,
(12) 
With chain rule for multivariate functions,
(13)  
With definition of ,
where is the activation function.
Taking partial derivative with respect to , we get
(14) 
∎
Appendix C Proof of Theorem 4.1 and 5.1
The second moment matrix of is denoted by
By further expanding this, we get,
(17) 
Since , and , we get
(18)  
Theorem C.1.
is the eigenvector of and corresponds to the largest eigenvalue if and are both .
Proof.
Taking the matrix multiplication of and , we get
(19)  
Thus, is an eigenvector of with eigenvalue . Next, we proceed to prove that is the largest eigenvalue.
Let ,
then we can express as
(20) 
Similar to (19), we can get
(21)  
This shows that is also an eigenvector of with eigenvalue .
From (18), we observe that is a product of a matrix by its own transpose. This implies that is positive semidefinite and all its eigenvalues are nonnegative. Furthermore, the sum of all these eigenvalues is
(22)  
This implies that the other eigenvalues . From this, we know that all vectors which are orthogonal to ,
Combining with (20), we get
(23)  
With this, we can deduce that ’s other eigenvalues .
For to be the largest eigenvalue, this statement has to be true:
(24) 
This statement is true if and which completes the proof. ∎
Remark C.1.1.
The operator or spectral norm of , , equals to the absolute value of its largest singular value. Since is a positive semidefinite matrix, its largest singular value is the same as its largest eigenvalue. This implies that
(25) 
Theorem C.2 (Matrix Bernstein [31]).
Let be symmetric random matrices. Assume that almost surely and let . Then,
where is an absolute constant.
Theorem C.3 (Covariance Estimation [22]).
Let be the second moment matrix of random vector . With independent samples , is the unbiased estimator of . Assume that