Backdoor Learning: A Survey

Backdoor Learning: A Survey


Backdoor attack intends to embed hidden backdoor into deep neural networks (DNNs), such that the attacked model performs well on benign samples, whereas its prediction will be maliciously changed if the hidden backdoor is activated by the attacker-defined trigger. Backdoor attack could happen when the training process is not fully controlled by the user, such as training on third-party datasets or adopting third-party models, which poses a new and realistic threat. Although backdoor learning is an emerging and rapidly growing research area, its systematic review, however, remains blank. In this paper, we present the first comprehensive survey of this realm. We summarize and categorize existing backdoor attacks and defenses based on their characteristics, and provide a unified framework for analyzing poisoning-based backdoor attacks. Besides, we also analyze the relation between backdoor attacks and the relevant fields ( adversarial attack and data poisoning), and summarize the benchmark datasets. Finally, we briefly outline certain future research directions relying upon reviewed works.

Backdoor Learning, Security, Deep Learning, Machine Learning.

I Introduction

Over the past decade, deep neural networks (DNNs) have been successfully applied in many mission-critical tasks, such as face recognition, autonomous driving, etc. Accordingly, its security is of great significance and has attracted extensive concerns. One well-studied example is adversarial examples [31, 24, 58, 107, 103, 8], which explores the adversarial vulnerability of DNNs at the inference stage. Compared to the inference stage, the training stage of DNNs involves more steps, including data collection, data pre-processing, model selection and construction, training, model saving, model deployment, etc. More steps mean more chances for the attacker, , more security threats to DNNs. Meanwhile, it is well known that the powerful capability of DNNs significantly depends on the huge amount of training data and computing resource. To reduce the training cost, users may choose to adopt third-party databases, rather than to collect the training data by themselves, since there are many freely available databases in the Internet; users may also train DNNs based on third-party platforms (, cloud computing platforms), rather than to train DNNs locally; users may even directly utilize third-party models. The cost of convenience for users is the loss of the control or the right to know to the training stage, which may further enlarge the security risk for users of DNNs. One typical threat to the training stage is the backdoor attacks1, which is the main focus of this survey.

Fig. 1: An illustration of poisoning-based backdoor attacks. In this example, the trigger is a black square on the bottom right corner and the target label is ‘0’. Part of the benign training image is modified to have the trigger stamped during the training process, and their label is re-assigned as the attacker-specified target label. Accordingly, the trained DNN is infected, which will recognize attacked images (, test images containing backdoor trigger) as the target label while still correctly predict the label for the benign test images.

Gu et al. [32] firstly revealed the threat of backdoor attacks. In general, backdoor attacks aim at embedding the hidden backdoor into DNNs so that the infected model performs well on benign testing samples when the backdoor is not activated, similarly to the model trained under benign settings; however, if the backdoor is activated by the attacker, then its prediction will be changed to the attacker-specified target label. Since the infected DNNs perform normally under benign settings and the backdoor is activated by the attacker-specified trigger, it is difficult for the user to realize the existence of the backdoor. Accordingly, the insidious backdoor attack is a serious threat to DNNs. Specifically, training data poisoning [32, 55, 50] is currently the most straightforward and common way to encode backdoor functionality into the model’s weights during the training process. As demonstrated in Fig. 1, some training samples are modified by adding an attacker-specified trigger (, a local patch). These modified samples with attacker-specified target label and benign training samples are fed into DNNs for training. Note that the trigger could be invisible [85, 48, 111] and the ground-truth label of poisoned samples could also consistent with the target label [72, 110, 66], which increases the stealthiness of backdoor attacks. Except by directly poisoning the training samples, the hidden backdoor could also be embedded through transfer learning [32, 46, 95], directly modifying model’s weights [23, 67], introducing extra malicious module [82], etc., which could happen at all stages of the training process.

Different methods were proposed to defend against backdoor attacks, which can be divided into two main categories, including empirical backdoor defenses and certified backdoor defenses. Empirical backdoor defenses [52, 27, 43] are proposed based on some observations or understandings of existing attacks and have decent performance in practice; however, their effectiveness have no theoretical guarantee and may probably be bypassed by some adaptive attacks. In contrast, the validity of certified backdoor defenses [90, 96] is theoretically guaranteed under certain assumptions, whereas it is generally weaker than that of empirical defenses in practice. How to better defend backdoor attacks is still an important open question.

As we mentioned, backdoor attacks is a realistic threat and its defense is also of great significance. However, there is still no comprehensive review of both aspects and no framework about how to analyze different works systematically. In this paper, we provide a timely overview of the current status and some insights about future research directions of backdoor learning. We believe this survey will facilitate continuing research in this emerging area. The rest of this paper is organized as follows. Section II briefly describes common technical terms. Section III-IV provides an overview of existing backdoor attacks. Section V demonstrates and categorizes existing defenses. Section VI analyzes the relation between backdoor attacks and related realms, while Section VII illustrates existing benchmark datasets. Section VIII discusses remaining challenges and suggests future directions. The conclusion is provided in Section IX at the end.

Ii Definition of Technical Terms

In this section, we briefly describe and explain common technical terms used in the backdoor learning relevant literature. We will follow the same definition of terms in the remaining paper.

  • Benign model refers to the model trained under benign settings.

  • Infected model refers to the model with hidden backdoor(s).

  • Poisoned sample is the modified training sample used in poisoning-based backdoor attacks for embedding backdoor(s) in the model during the training process.

  • Trigger is the pattern used for generating poisoned samples and activating the hidden backdoor(s).

  • Attacked sample indicates the malicious testing sample (with trigger) used for querying the infected model.

  • Attack scenario refers to the scenario that the backdoor attack might happen. Usually, it happens when the training process is inaccessible or out of control by the user, such as training with third-party datasets, training through third-party platforms, or adopting third-party models.

  • Source label indicates the ground-truth label of a poisoned or an attacked sample.

  • Target label is the attacker-specified label. The attacker intends to make all attacked samples to be predicted as the target label by the infected model.

  • Attack success rate (ASR) denotes the proportion of attacked samples which are predicted as the target label by the infected model.

  • Benign accuracy (BA) indicates the accuracy of benign test samples predicted by the infected model.

  • Attacker’s goal describe what the backdoor attacker intends to do. In general, the attacker wishes to design an infected model that performs well on the benign testing sample while achieving high ASR.

  • Capacity defines what the attacker/defender can and cannot do to achieve their goal.

  • Attack/Defense approach illustrates the process of the designed backdoor attack/defense.

Iii Poisoning-based Backdoor Attacks

In the past three years, many backdoor attacks were proposed. In this section, we first propose a unified framework to analyze existing poisoning-based attacks towards image classification, based on the understanding of the attack properties. After that, we summarize and categorize existing poisoning-based attacks in detail based on the proposed framework. Attacks for other tasks or paradigms and the well-intentioned applications of backdoor attacks are also discussed at the end.

Iii-a A Unified Framework of Poisoning-based Attacks

We first define three necessary risks in this area, then describe the optimization process of poisoning-based backdoor attacks. Based on the characteristic of the process, poisoning-based attacks can be categorized based on different criteria. Different partitions of poisoning-based methods are summarized in Table I.

We denote the classifier as , where is the model parameter, being the instance space, and being the label space. indicates the posterior vector with respect to classes, and denotes the predicted label. Let denotes the target label, indicates the labeled dataset, and indicates the instance set of . Three risks involved in existing attacks are defined as follows:

, where ,
Visible Attack Invisible Attack Clean-label , and
Poison-label , and
Attack with Optimized Trigger Attack with Non-optimized Trigger
Digital Attack is generated in digital space. Physical Attack Physical space is involved in generating .
White-box Attack is known Black-box Attack is unknown
Semantic Attack is a semantic part of samples. Non-semantic Attack is not a semantic part of samples.
TABLE I: Summary of existing poisoning-based backdoor attacks.
Definition 1 (Standard, Backdoor, and Perceivable Risk).
  • The standard risk measures whether the prediction of (, ), is same with its ground-truth label . Its definition with respect to a labeled dataset is formulated as


    where indicates the distribution behind . denotes the indicator function: if is true, otherwise .

  • The backdoor risk indicates whether the backdoor trigger can successfully activates the hidden backdoor within the classifier. Its definition with respect to is formulated as


    where is the poisoned version of benign sample under generation function with trigger . For example, is the most commonly adopted generation function, where and indicate the blended parameter and target label, respectively.

  • The perceivable risk denotes whether the poisoned sample (, ) can be detected as the malicious sample (by human or machine). Its definition with respect to is formulated as


    where is an indicator function: if is detected as the malicious sample, otherwise .

Based on aforementioned definition, existing attacks can be summarized in a unified framework, as follows:


where , and are two non-negative trade-off hyper-parameters, is a subset of , and is called poisoning rate defined in existing works [32, 13, 50].

Remark. Since the indicator function used in and is non-differentiable, it is usually replaced by its surrogate loss (, cross-entropy, KL-divergence) in practice. Besides, as we mentioned, optimization (4) can reduce to existing attacks through different specifications. For example, when , , and is non-optimized (, ), it reduces to the BadNets [32] and the Blended Attack [13]; when and , it reduces to -ball bounded invisible backdoor attacks [48]. Moreover, parameters and could be optimized simultaneously or separately through a multi-stage method.

Note that this framework can be easily generalized towards other tasks, such as speech recognition, as well.

Iii-B Attacks for Image and Video Recognition


Embedding hidden backdoor in a model typically involves the encoding of malicious functionalities within the model’s parameters. Gu et al. [32] first defined the backdoor attack and proposed a method, dubbed BadNets, to inject backdoor by tampering the training process through poisoning some training samples. Specifically, as demonstrated in Fig. 1, its training process consists of two main parts, including (1) generating the poisoned image by stamping the backdoor trigger onto the benign image to achieve poisoned sample associated with the attacker-specified target label , and (2) training the model with poisoned samples as well as benign samples. Accordingly, the trained DNN will be infected, which performs well on benign testing samples, similarly to the model trained using only benign samples; however, if the same trigger is added onto a testing image, then its prediction will be changed to the target label. The attack scenario of BadNets includes training with third-party datasets and platforms, which reveals serious security threats. BadNets is the representative of visible attacks, which opened the era of this field. Almost all follow-up poisoning-based attacks were carried out based on this method.

Invisible Backdoor Attacks

Chen et al. [13] first discussed the invisibility requirement of poisoning-based backdoor attacks. They suggested that the poisoned image should be indistinguishable compared with its benign version to evade human inspection. To fulfill such a requirement, they proposed a blended strategy, which generates poisoned images by blending the backdoor trigger with benign images instead of by stamping as proposed in BadNets [32]. Besides, they demonstrated that even adopting a random noise with a small magnitude as the backdoor trigger can still create the backdoor successfully, which further reduces the risk of being detected.

After that, there was a series of works dedicated to the research of invisible backdoor attacks. In [85], Turner et al. proposed to perturb the benign image pixel values by a backdoor trigger amplitude instead of replacing the corresponding pixels with the chosen pattern. Li et al. [48] proposed to regularize the norm of the perturbation when optimizing the backdoor trigger. Zhong et al. [111] adopted the universal adversarial attack [60] to generate the backdoor trigger, which minimizes the norm of the perturbation. In [3], Bagdasaryan et al. viewed the backdoor attack as a special multi-task optimization, where they fulfilled the invisibility through poisoning the loss computation. Most recently, Liu et al. [55] proposed to adopt a common phenomenon, the reflection, as the trigger for the stealthiness.

Although a poisoned image is similar to its benign version in invisible attacks, however, its source label is usually different from the target label. In other words, all those methods are poison-label invisible attacks, where the poisoned samples seem to be mislabeled. Accordingly, an invisible attack still could be detected by humans by examining the image-label relationship of training samples. To address this problem, a special sub-class of invisible poisoning-based attacks, dubbed clean-label invisible attacks, was proposed, which has more serious threats and research value. Turner et al. [85] first explored the clean-label attack, where they leveraged adversarial perturbations or generative models to first modify some benign images from the target class and then conducted the standard invisible attack. The modification approach is to alleviate the effects of ‘robust features’ contained in the poisoned samples to ensure that the trigger can be successfully learned by the DNNs. Recently, Zhao et al. [110] extended this idea in attacking video classification, where they adopted universal perturbation instead of a given one as the trigger pattern. Another interesting clean-label attack method is to inject the information of a poisoned sample generated by a previous visible attack into the texture of an image from target class by minimizing their distance in the feature space, as suggested in [72]. Besides, Quiring et al. [66] proposed to conceal the trigger as well as hide the overlays of clean-label poisoning through image-scaling attacks [101].

Attacks with Optimized Trigger

The backdoor trigger is the core of poisoning-based attacks, therefore analyzing how to design a better trigger instead of using a given non-optimized trigger pattern is of great significance and has attracted wide concerns. To the best of our knowledge, Liu et al. [54] first explored this problem, where they proposed to optimize the trigger so that the important neurons can achieve the maximum values. In [48], Li et al. formulated the trigger generation as a bilevel optimization, where the trigger was optimized to amplify a set of neuron activations with regularization for invisibility. Bagdasaryan et al. [3] treated backdoor attacks as a multi-object optimization, and proposed to optimize trigger and train DNNs simultaneously. Recently, with the hypothesis that if a perturbation can induce most samples toward the decision boundary of the target class then it will serve as an effective trigger, [111, 110, 29] proposed to generate trigger through universal adversarial perturbation.

Physical Backdoor Attacks

Different from previous digital attacks that adopted the setting that the attack is conducted completely in the digital space, Chen et al. [13] first explored the landscape of physical attacks. In [13], they adopted a glasses as the physical trigger to mislead the infected face recognition system developed in a camera. Further exploration of attacking face recognition in the physical world was also discussed by Wenger et al. [97]. A similar idea was discussed in [32], where a post-it note was adopted as the trigger in attacking traffic sign recognition. Recently, Li et al. [50] demonstrated that existing digital attacks fail in the physical world since the involved transformations (, rotation, and shrinkage) change the location and appearance of trigger in attacked samples compared with the one used for training. This inconsistency will greatly reduce the performance of the attack. Based on this understanding, they proposed a transformation-based attack enhancement so that the enhanced attacks remain effective in the physical world.

Black-box Backdoor Attacks

Different from previous white-box attacks, which require the knowledge of training samples, black-box attacks adopt the settings that the training set is inaccessible. In practice, the training dataset is usually not shared due to privacy or copyright concerns, therefore black-box attacks are more realistic than white-box ones. Specifically, black-box backdoor attacks require to generate some training samples based on the given model at first. For example, in [54], they generated some representative images of each class by optimizing initialized images from another dataset such that the prediction confidence of the selected class reaches maximum. With the reversed training set, white-box attacks can be adopted for injecting hidden backdoor.

Semantic Backdoor Attacks

The majority of backdoor attacks, , the non-semantic attacks, assume that the trigger is independent of benign images. In other words, attackers need to modify the image in the inference stage to activate the hidden backdoor. Is it possible that a semantic part of samples can also serve as the trigger, such that the attacker is not required to modify the input at inference time to deceive the infected model? Bagdasaryan et al. first explored this problem and proposed a novel type of backdoor attacks [4, 3], the semantic backdoor attacks. Specifically, they demonstrated that assigning an attacker-chosen label to all images with certain features, , green cars or cars with racing stripes, for training can create a semantic hidden backdoor in infected DNNs. Accordingly, the infected model will automatically misclassify testing images containing pre-defined semantic information without the requirement of image modification.

Iii-C Attacks for Other Tasks or Paradigms

In this section, we summarize the poisoning-based attack against other tasks or paradigms.

In the area of natural language processing, Dai et al. [17] first discussed the backdoor attack against LSTM-based sentiment analysis. Specifically, they proposed a BadNets-like approach, where an emotionally neutral sentence was used as the trigger and was randomly inserted into some benign training samples. In [12], Chen et al. further explored this problem, where three different types of triggers (, char-level, word-level, and sentence-level triggers) were proposed and reached decent performance. Most recently, Kurita et al. [46] demonstrated that sentiment classification, toxicity detection, and spam detection can also be backdoored even after fine-tuning. Some researches also revealed the backdoor threat towards graph neural networks (GNN) [108, 99]. In general, an attacker-specified subgraph was defined as the trigger so that the infected GNN will predict the target label for an attacked graph once the subgraph trigger is contained. Besides, the backdoor threat towards reinforcement learning [41, 105], wireless signal classification [18], and continual learning [87], were also studied.

The security issues of collaborative learning, especially federated learning, have attracted extensive attention. In [4], Bagdasaryan et al. introduced a backdoor attack to federated learning based on amplifying the poisoned gradient on the node servers. Besides, Bhagoji et al. [5] discussed the stealthy model poisoning attack, and Xie et al. [102] introduced a distributed backdoor attacks to federated learning. Most recently, [92] also discussed how to backdoor federated learning. Besides, the backdoor attacks towards meta federated learning [10] and feature-partitioned collaborative learning [53] were also discussed. Moreover, some works [79, 26, 49, 62, 20, 70] also questioned whether federal learning is really easy to be attacked. Except for collaborative learning, the backdoor threat of another important learning paradigm, , the transfer learning, was also discussed in [32, 46, 95, 106].

Fig. 2: An illustration of backdoor attacks and three corresponding defense paradigms. Intuitively, the poisoning-based backdoor attack is similar to unlock a door with the corresponding key. Accordingly, three main paradigms, including (1) trigger-backdoor mismatch, (2) backdoor elimination, and (3) trigger elimination, can be adopted to defend the attack. Different types of approaches were proposed towards the aforementioned paradigms, as illustrated in Table III-C.
TABLE II: Summary of existing empirical backdoor defenses in image recognition. (Some literature proposed different types of defenses simultaneously, therefore they will appear multiple times in this table.)
Defense Paradigm Defense Sub-category Literarure
Trigger-backdoor Mismatch Preprocessing-based Defense
[56, 21, 86, 89]
Backdoor Elimination Model Reconstruction based Defense [56, 52, 109]
Trigger Synthesis based Defense
[91, 11, 65, 34]
[112, 14, 2, 88]
Model Diagnosis based Defense
[39, 104, 43, 38]
Poison Suppression based Defense [22, 36]
Training Sample Filteing based Defense
[84, 9, 81, 76]
[7, 15]
Trigger Elimination Testing Sample Filtering based Defense [27, 78, 22, 40]

Iii-D Backdoor Attack for Good

Despite malicious purposes, how to use the backdoor attack in the right way has also obtained preliminary explorations. Adi et al. [1] exploited backdoor attacks in verifying model ownership. They proposed to watermark the DNNs through backdoor embedding. Accordingly, the hidden backdoor in the model can be used to examine the ownership, while the watermarking process still preserves original model functionality. Besides, Sommer et al. [75] revealed how to verify whether the server truly erases their data when users require data deletion through poisoning-based backdoor attacks. Specifically, in their verification framework, each user poisons part of its data with user-specific trigger and target label. Accordingly, each user can leave a unique trace in the server for deletion verification after the server being trained on user data while having a negligible impact on the benign model functionality. Shan et al. [74] introduced a trapdoor-enabled adversarial defense, where the hidden backdoor is injected by the defender to prevent attackers from discovering the natural weakness in a model. The motivation is that the adversarial perturbation generated by gradient-descent-based attacks towards an infected model will converge near the trapdoor pattern, which is easily detected by the defender. Most recently, Li et al. [51] discussed how to protect open-sourced datasets based on backdoor attacks. Specifically, they formulated this problem as determining whether the dataset has been adopted to train a third-party model. They proposed a hypothesis test based method for the verification, based on the posterior probability generated by the suspicious third-party model of the benign samples and their correspondingly attacked samples.

Iv Non-poisoning-based Backdoor Attacks

Except for poisoning-based attacks, some non-poisoning-based attacks were also proposed. These methods inject backdoor not directly through optimizing model parameters during the training process with poisoned samples. Their existence demonstrates that except for happening at the data collection, the backdoor attack could also happen at other stages ( deployment stage) of the training process, which further reveals the severity of the backdoor attack.

Iv-a Targeted Weight Perturbation

In [23], Dumford et al. first explored the non-poisoning-based attack, where they proposed to modify the model’s parameters directly instead of through training with poisoned samples. The primary task in this work is face recognition, where they assumed that the training samples can not be modified by attackers. The attacker’s goal is to make their own face to be granted access despite not being a valid user while ensuring that the network still behaves normally for all other inputs. To fulfill this target, they adopted a greedy search across models with different perturbations applied to a pre-trained model’s weights.

Iv-B Targeted Bit Trojan

Instead of modifying the model’s parameters directly through a search-based approach, Rakin et al. [67] demonstrated a new method, dubbed targeted bit trojan (TBT), discussing how to inject a hidden backdoor without the training process more effectively. TBT contains two main processes, including gradient-based vulnerable bits determination (similar to the process proposed in [54]), and targeted bits flipping in main memory by adopting row-hammer attack [68]. The proposed method achieved remarkable performance, the authors were able to mislead ResNet-18 on the CIFAR-10 dataset with 84 bit-flips out of 88 million weight bits.

Iv-C TrojanNet

Different from previous approaches where the backdoor is embedded in the parameters directly, Guo et al. [33] proposed TrojanNet to encode the backdoor in the infected DNNs activated through a secret weight permutation. They assumed that the infected network is used with a hidden backdoor software which could permute the parameters when the backdoor trigger is presented. Training a TrojanNet is similar to the multi-task learning, although the benign task and malicious task share no common features. Besides, the authors also proved that the decision problem to determine whether the model contains a permutation that triggers the hidden backdoor is NP-complete, and therefore the backdoor detection is almost impossible.

Iv-D Attack with Trojan Module

Most recently, Tang et al. [82] proposed a novel non-poisoning-based backdoor attack, which inserts a trained malicious backdoor module (, a sub-DNN) into the target model instead of changing parameters in the original model to embed backdoor. The proposed method is model-agnostic and could be injected into most DNNs, , retraining on poisoned samples is not required. This method significantly reduced the computational cost compared to previous poisoning based attack methods.

V Backdoor Defenses

To defend backdoor attacks, several backdoor defensive methods were proposed. Existing methods mostly aim at defending poisoning-based attacks and can be divided into two main categories, including empirical backdoor defenses and certified backdoor defenses. Empirical backdoor defenses are proposed based on some understandings of existing attacks and have decent performance in practice, whereas their effectiveness has no theoretical guarantee. In contrast, the validity of certified backdoor defenses is theoretically guaranteed under certain assumptions, whereas it is generally weaker than that of empirical defenses in practice. At present, certified defenses are all based on the random smoothing [16], while empirical ones have multiple types of approaches.

V-a Empirical Backdoor Defenses

Intuitively, the poisoning-based backdoor attack is similar to unlock a door with the corresponding key. In other words, there are three indispensable requirements to ensure a successful backdoor attack, including (1) having a hidden backdoor in the model, (2) containing a trigger in the sample, and (3) the trigger and the backdoor are matched, as shown in Fig. III-C. Accordingly, three main defense paradigms, including (1) trigger-backdoor mismatch, (2) backdoor elimination, and (3) trigger elimination, can be adopted to defend existing attacks. Different types of approaches were proposed towards the aforementioned paradigms, which are summarized in Table III-C and will be further demonstrated as follows:

Preprocessing-based Defenses

Preprocessing-based defenses introduce a preprocessing module before the original inference process, which changes the pattern of the triggers in the attacked samples. Accordingly, the modified trigger no longer matches the hidden backdoor therefore preventing backdoor activation.

Liu et al. [56] were the first to exploit preprocessing as the defense approach towards image classification tasks, where they adopted a pre-trained auto-encoder as the preprocessor. Inspired by the idea that the trigger region contributes most to the prediction, a two-stage image preprocessing approach, dubbed Februus, was proposed by Doan et al in [21]. At the first stage, Februus utilizes GradCAM [73] to identify influential regions, which will then be removed and replaced by a neutralized-color box. After that, Februus adopts a GAN-based inpainting method to reconstruct the masked region for alleviating the adverse effect towards benign samples. Udeshi el al. [86] proposed to utilize the dominant color in the image to make a square-like trigger blocker in the preprocessing stage, which was adopted to locate and remove the backdoor trigger. This approach was motivated by the fact that placing a trigger blocker at the position of the backdoor trigger in the attacked image will result in a change in the prediction of the model. Vasquez et al. [89] proposed to preprocess the image through style transfer. Recently, Li et al. discussed the property of existing poisoning-based attacks with static trigger [50]. They demonstrated that if the appearance or location of the trigger is slightly changed, then the attack performance may degrade sharply. Based on this observation, they proposed to adopt spatial transformations (, shrinking, flipping) as the preprocessor. Compared with previous methods, this method is more efficient since it requires almost no additional computational costs.

Model Reconstruction based Defenses

Different from preprocessing based defenses, model reconstruction based defenses aim at removing the hidden backdoor in the infected model. Accordingly, even if the trigger is still contained in the attacked samples, the prediction remains unmalicious since the backdoor was already removed.

Liu et al. [56] proposed to retrain the given model with local benign samples starting from the weight of the given model. The effectiveness of this method may probably due to the catastrophic forgetting in DNNs [42], , the hidden backdoor is gradually removed as the training goes since the re-training set contains no poisoned samples. Motivated by the observation that the backdoor related neurons are usually dormant for benign samples, Liu et al. [52] proposed to prune those neurons to remove the hidden backdoor. Besides, they proposed a fine-pruning method, which first prunes the DNNs and then fine-tunes the pruned network to combine the benefits of the pruning and fine-tuning defenses. In [109], Zhao et al. showed that the hidden backdoor of the infected DNNs can be repaired based on the mode connectivity technique [30] with a certain amount of benign samples.

Trigger Synthesis based Defenses

Except for eliminating the hidden backdoor directly, trigger synthesis based defenses propose to synthesizes the backdoor trigger at first, following by the second stage that the hidden backdoor is eliminated by suppressing the effect of the trigger.

This type of defense enjoys certain similarities with model reconstruction based ones in the second stage. For example, pruning and retraining are the common techniques used in removing the hidden backdoor in both types of defenses. However, compared with the model reconstruction based defenses, the trigger information obtained in synthesis based defenses makes the removal process more effective and efficient.

Wang et al. [91] first proposed to remove the hidden backdoor based on the synthetic trigger in a ‘black-box’ scenario, where the training set is inaccessible. Specifically, the proposed method, Neural Cleanse, first obtained potential trigger patterns towards every class, and then determined the final synthetic trigger and its target label based on an anomaly detector at the first stage. In the second stage, they evaluated two possible strategies, an early detector for identifying the existence of trigger and a model patching algorithm based on pruning or retraining. Similar idea was also discussed in [11, 35]. Qiao et al. [65] noticed that the reversed trigger synthesized by [91] is usually significantly different from that was used in the training process, inspired by which they first discussed the generalization of the backdoor trigger. They demonstrated that an infected model generalizes its original trigger during the training process. Accordingly, they proposed to recover the trigger distribution rather than a specific trigger based on a max-entropy staircase approximator for building a more backdoor-robust model. A similar idea was also discussed by Zhu et al. [112], where they proposed a GAN-based trigger synthesis method for the backdoor defense. In [34], they showed that the detection process used for determining the synthetic trigger in [91] suffers from several failure modes, based on which they proposed a new defense method. Besides, Cheng et al. [14] revealed that the norm of the activation values can be used to distinguish backdoor related neurons based on the synthetic trigger. Accordingly, they proposed to perform -based neuron pruning, which removes neurons with high activation values in response to the trigger from the final convolutional layer, to defend against attacks. Similarly, Aiken et al. [2] also proposed to remove the hidden backdoor by pruning DNNs based on the synthetic trigger from another perspective. An online Neural-Cleanse-like trigger synthesis based defense was also discussed in [88].

Model Diagnosis based Defenses

Model diagnosis based defenses justify whether the provided model is infected through a trained meta-classifier and refuse to deploy infected models. Since only the benign model is used for deployment, it naturally eliminates the hidden backdoor.

To the best of our knowledge, Kolouri el al. [43] first discussed how to diagnose a given model. Specifically, they jointly optimized some universal litmus patterns (ULPs) and a classifier, which is further used to determine whether a given model is infected based on the prediction of obtained universal litmus patterns. Similarly, Xu et al. [104] proposed two strategies to train the meta-classifier without knowing the attack strategies. Different from the previous approach where both infected model samples and benign model samples are required in the training set, an effective meta-classifier can be trained only with benign model samples based on the strategies proposed in [104]. Besides, motivated by the observation that the heatmaps from benign and infected models have different characteristics, Huang et al. [39] proposed to adopt an outlier detector as the meta-classifier based on three extracted features of generated saliency maps. In [38], they designed an one-pixel signature representation, based on which to distinguish benign and infected models. Most recently, Wang et al. [94] discussed how to detect whether a given mode is benign or infected in the data-limited and data-free cases.

Poison Suppression based Defenses

Poison suppression based defenses depress the effectiveness of poisoned samples during the training process to prevent the creation of hidden backdoor. Du et al. [22] first explored such type of defenses, where they adopted noisy SGD to learn differentially private DNNs for the defense. With the randomness in the training process, the contribution of poisoned samples will be reduced by random noise, resulting in the creation failure of the backdoor. Motivated by the observation that the norm of the gradient of poisoned samples have significantly higher magnitudes than those of benign samples and their gradient orientations are also different, Hong et al. [36] adopted differentially private stochastic gradient descent (DPSGD) to clip and perturb individual gradients during the training process. Accordingly, the trained model has no hidden backdoor as well as its adversarial robustness towards targeted adversarial attacks is also increased.

Training Sample Filtering based Defenses

Training sample filtering based defenses aim at distinguishing between benign samples and poisoned samples. Only benign samples or purified poisoned samples will be used in the training process, which eliminates the backdoor from the source.

Tran et al. [84] first explored such type of defenses, where they demonstrated that poisoned samples tend to leave behind a detectable trace in the spectrum of the covariance of feature representations. Accordingly, the singular value decomposition of the covariance matrix of feature representations can be used to filter poisoned samples from the training set. Also inspired by the idea that poisoned samples and benign samples should have different characteristics in the feature space, Chen et al. [9] proposed to identify poisoned samples through a two-stage method, including (1) clustering the activations of training samples of each class into two clusters and (2) determining which, if any, of the clusters corresponds to poisoned samples. Tang et al. [81] demonstrated that simple target contamination can cause the representation of a poisoned sample to be less distinguishable from that of benign one, therefore existing filtering-based defenses can be bypassed. To address this problem, they proposed a more robust sample filter based on representation decomposition and its statistical analysis. Similarly, Soremekun et al. [76] proposed to counter poisoned samples based on the difference between benign and poisoned samples in the feature space. Different from previous methods, Chan et al. [7] separated poisoned samples based on the poison signal in the input gradients. A similar idea was explored in [15], where they adopted the saliency map to identify trigger region and filter samples.

Testing Sample Filtering based Defenses

Similar to training samples filtering based ones, testing samples filtering based defenses also aim at distinguishing between malicious samples and benign samples. However, compared with the previous type of methods, testing samples filtering based ones are adopted in the inference instead of the training stage. Only benign or purified attacked samples will be predicted, which prevents backdoor activation by removing the trigger.

Motivated by the observation that most of existing backdoor triggers are input-agnostic, Gao et al. [27] proposed to filter attacked samples through superimposing various image patterns and observe the randomness of the prediction of perturbed inputs. The smaller the randomness, the higher the probability to be the attacked sample. In [78], Subedar et al. adopted model uncertainty to distinguish between benign and attacked samples. Du et al. [22] treated it as the outlier detection and proposed a differential privacy based method. Besides, Jin et al. [40] proposed to detect the malicious samples in the inference stage motivated by existing methods adopted in detection-based adversarial defenses [25, 57, 93].

Category Datasets # Image Size # Training Samples # Testing Samples Cited Literature
Natural Image Recognition MNIST [47] 60,000 10,000
[56, 23, 39, 104, 86]
[91, 9, 32, 11, 27]
[78, 111, 87, 53, 75]
[2, 43, 22, 90, 96]
[76, 40, 74, 112, 28]
Fashion MNIST [100] 60,000 10,000 [38, 36, 76]
CIFAR [44] 50,000 10,000
[1, 84, 85, 48, 21]
[65, 104, 27, 78, 7]
[111, 72, 29, 66, 75]
[2, 43, 88, 36, 96]
[28, 76, 33, 67, 50]
[112, 35, 94, 74, 109]
SVHN [63] 73,257 26,032 [109, 33, 67]
ImageNet [19] 1,281,167 50,000
[1, 34, 81, 3, 55]
[72, 15, 96, 28, 82]
[67, 112, 38, 94]
Traffic Sign Recognition GTSRB [77] 34,799 12,630
[48, 21, 11, 14, 39]
[91, 34, 27, 81, 111]
[55, 89, 43, 88, 28]
[33, 82, 35, 40, 74]
[51, 112, 38, 94]
U.S. Traffic Sign [59] 6,889 1,724 [52, 32, 86]
Face Recognition YouTube Face [98] 3,425 videos of 1,595 people
[13, 52, 91, 88, 82]
PubFig [45] 58,797 images of 200 people [91, 55, 82, 40]
VGGFace [64] 2.6 million images of 2,622 people
[54, 86, 11, 91, 89]
[95, 15, 112]
VGGFace2 [6] 3.3 million images of 9,131 people [23, 21]
LFW [37] 13,233 images of 5,749 people [34, 95, 15]

Note: (1) The sign sizes vary from to pixels in the U.S. Traffic Sign dataset; (2) There is no given division between the training set and the testing set in most face recognition datasets. Users need to divide the dataset by themselves according to their needs.

TABLE III: Summary of benchmark datasets used in image recognition.

V-B Certified Backdoor Defenses

Although multiple empirical backdoor defenses have been proposed and reached decent performance against previous attacks, almost all of them were bypassed by following stronger adaptive attacks [80, 71]. To terminate such a cat-and-mouse game, Wang et al. [90] took the first step towards the certified defense against backdoor attacks based on the random smoothing technique [16]. Randomized smoothing was originally developed to certify robustness against adversarial examples, where the smoothed function is built from the base function via adding random noise to the data vector to certify the robustness of a classifier under certain conditions. Similar to [69], Wang et al. treated the entire training procedure of the classifier as the base function to generalize classical randomized smoothing to defend against backdoor attacks. In [96], Weber et al. demonstrated that directly applying randomized smoothing, as in [90], will not provide high certified robustness bounds. Instead, they proposed a unified framework with the examination of different smoothing noise distributions and provided the tightness analysis for the robustness bound.

Vi Connection with Related Realms

In this section, we discuss the similarities and differences between backdoor attacks and adversarial attacks, data poisoning, respectively.

Vi-a Backdoor Attacks and Adversarial Attacks

Targeted adversarial attacks and poisoning-based backdoor attacks share many similarities in the inference phase. Firstly, both types of attacks intend to modify the benign testing sample to make the model misbehave. Although the perturbation is usually image-specified for adversarial attacks, when the adversarial attacks are with universal perturbation (, [60, 61, 83]), those types of attacks have a similar pattern. Accordingly, some researchers who are not familiar with backdoor attacks may question the significance of the research in this area.

Although adversarial attacks and backdoor attacks share certain similarities, they are equally important and have essential differences. (1) From the aspect of the attacker’s capacity, adversarial attackers can control the inference process (to a certain extent) but not the training process of models. In contrast, for backdoor attackers, parameters of the model can be modified whereas the inference process is out of control. (2) From the perspective of attacked samples, the perturbation is known (, non-optimized) by backdoor attackers whereas adversarial attackers need to obtain it through the optimization process based on the output of the model. Such optimization in adversarial attacks requires multiple queries and therefore may probably be detected. (3) Their mechanism has essential differences. Adversarial vulnerability results from the differences in behaviors of the model and humans. In contrast, backdoor attackers utilize the excessive learning ability towards non-robust features (such as textures) of DNNs.

Vi-B Backdoor Attacks and Data Poisoning

Data poisoning and poisoning-based backdoor attacks share many similarities in the training phase. They all aim at misleading models in the inference process by introducing poisoned samples during the training process. However, they have significant differences. From the perspective of the attacker’s goal, data poisoning aims at degrading the performance in predicting benign testing samples. In contrast, backdoor attacks preserve the performance on benign samples, similarly with the benign model, while changing the prediction of attacked samples (, benign testing samples with trigger) to the target label. From this angle, data poisoning can be regarded as the ‘non-targeted poisoning-based backdoor attack’ with transparent trigger to some extent. From the aspect of the stealthiness, backdoor attacks are more malicious than data poisoning. Users can detect data poisoning by the evaluation under the local verification set, while this approach has limited benefits in detecting backdoor attacks.

Note that existing data poisoning related approaches have also inspired the research on backdoor learning due to their similarities. For example, Hong et al. [36] demonstrated that the defense towards data poisoning may also have benefits in defending backdoor attacks, as illustrated in Section V-A5.

Vii Benchmark Datasets

Similar to that of adversarial learning, most of the existing related literature focused on the image recognition task. In this section, we summarize all benchmark datasets which were used at least twice in related literature in Table III.

Those benchmark datasets can be divided into three main categories, including natural image recognition, traffic sign recognition, and face recognition. The first type of dataset is the classic one in the image classification field, while the second and third ones are tasks that require strict security guarantees. We recommend that future work should be evaluated on these datasets to facilitate comparison and ensure fairness.

Viii Outlook of Future Directions

As presented above, many works have been proposed in the literature of backdoor learning, covering several branches and different scenarios. However, we believe that the development of this field is still in its infancy, as many critical problems of backdoor learning have not been well studied. In this section, we present five potential research directions to inspire the future development of backdoor learning.

Viii-a Trigger Design

The effectiveness and efficiency of poisoning-based backdoor attacks are closely related to their trigger patterns. However, the trigger of existing methods was designed in a heuristic (, design with universal perturbation), or even a non-optimized way. How to better optimize the trigger pattern is still an important open question. Besides, only the effectiveness and invisibility were considered in the trigger design, other criteria, such as with minimized necessary poisoned proportion, are also worth further exploration.

Viii-B Semantic and Physical Backdoor Attacks

As presented in Section III-B, semantic and physical attacks are more serious threats to AI systems in practical scenarios, while their studies are still left far behind, compared to other types of backdoor attacks. More thorough studies to obtain better understandings of this two attacks would be important steps towards alleviating the backdoor threat in practice.

Viii-C Attacks Towards Other Tasks

The success of backdoor attacks is significantly due to the specific design of triggers according to the characteristics of the target task. For example, the visual invisibility of the trigger is one of the critical criteria in visual tasks, which ensures the attack stealthiness. However, the design of backdoor triggers in different tasks could be quite different (, hiding a trigger into a sentence when attacking a task in natural language processing is quite different with hiding a trigger into an image). Accordingly, it is of great significance to study task-specified backdoor attacks. Existing backdoor attacks mainly focused on the tasks of computer vision, especially image classification. However, the research towards other tasks (, recommendation system, speech recognition, and natural language processing) have not been well studied.

Viii-D Effective and Efficient Defenses

Although many types of empirical backdoor defenses have been proposed (see Section V), almost all of them can be bypassed by subsequent adaptive attacks. Besides, except for the pre-processing based defenses, one common drawback of most existing defense methods is the relatively high computational cost. More efforts on designing effective and efficient defense methods should be made to keep up the fast pace of backdoor attacks. Moreover, as demonstrated in Section V, certified backdoor defenses have been rarely studied, which deserve more explorations.

Viii-E Mechanism Exploration

The principle of backdoor generation and the activation mechanism of backdoor triggers are the holy grail problems in the field of backdoor learning. For example, why does the backdoor exist, and what happens inside the model when the backdoor trigger appears, have not been carefully studied in existing works. The intrinsic mechanism of backdoor learning is supposed to serve as the key role to guide the design of backdoor attacks and defenses.

Ix Conclusion

Backdoor learning, including backdoor attacks and backdoor defenses, is a critical and booming research area. In this survey, we summarize and categorize existing backdoor attacks and propose a unified framework for analyzing poisoning-based backdoor attacks. We also analyze the defense techniques and discuss the relation between backdoor attacks and related realms. The potential research directions are illustrated at the end. Almost all researches in this field were completed in the last three years, and the cat-and-mouse game between attacks and defenses is likely to continue in the future. We hope that this paper could provide a timely view and remind researchers of the backdoor threat. It would be an important step towards trust-worthy deep learning.


  1. In this survey, backdoor attack refers to the targeted attack towards the training process of DNNs. Backdoor is also commonly called the neural trojan or trojan. We use ‘backdoor’ instead of other terms in this survey since it is most frequently used.


  1. Y. Adi, C. Baum, M. Cisse, B. Pinkas and J. Keshet (2018) Turning your weakness into a strength: watermarking deep neural networks by backdooring. In USENIX Security, Cited by: §III-D, TABLE III.
  2. W. Aiken, H. Kim and S. Woo (2020) Neural network laundering: removing black-box backdoor watermarks from deep neural networks. arXiv preprint arXiv:2004.11368. Cited by: §III-C, §V-A3, TABLE III.
  3. E. Bagdasaryan and V. Shmatikov (2020) Blind backdoors in deep learning models. arXiv preprint arXiv:2005.03823. Cited by: §III-B2, §III-B3, §III-B6, TABLE III.
  4. E. Bagdasaryan, A. Veit, Y. Hua, D. Estrin and V. Shmatikov (2020) How to backdoor federated learning. In AISTATS, Cited by: §III-B6, §III-C.
  5. A. N. Bhagoji, S. Chakraborty, P. Mittal and S. Calo (2019) Analyzing federated learning through an adversarial lens. In ICML, Cited by: §III-C.
  6. Q. Cao, L. Shen, W. Xie, O. M. Parkhi and A. Zisserman (2018) Vggface2: a dataset for recognising faces across pose and age. In IEEE FGR, Cited by: TABLE III.
  7. A. Chan and Y. Ong (2019) Poison as a cure: detecting & neutralizing variable-sized backdoor attacks in deep neural networks. arXiv preprint arXiv:1911.08040. Cited by: §III-C, §V-A6, TABLE III.
  8. A. Chaturvedi and U. Garain (2020) Mimic and fool: a task-agnostic adversarial attack. IEEE Transactions on Neural Networks and Learning Systems. Cited by: §I.
  9. B. Chen, W. Carvalho, N. Baracaldo, H. Ludwig, B. Edwards, T. Lee, I. Molloy and B. Srivastava (2019) Detecting backdoor attacks on deep neural networks by activation clustering. In AAAI Workshop, Cited by: §III-C, §V-A6, TABLE III.
  10. C. Chen, L. Golubchik and M. Paolieri (2020) Backdoor attacks on federated meta-learning. arXiv preprint arXiv:2006.07026. Cited by: §III-C.
  11. H. Chen, C. Fu, J. Zhao and F. Koushanfar (2019) DeepInspect: a black-box trojan detection and mitigation framework for deep neural networks.. In IJCAI, Cited by: §III-C, §V-A3, TABLE III.
  12. X. Chen, A. Salem, M. Backes, S. Ma and Y. Zhang (2020) BadNL: backdoor attacks against nlp models. arXiv preprint arXiv:2006.01043. Cited by: §III-C.
  13. X. Chen, C. Liu, B. Li, K. Lu and D. Song (2017) Targeted backdoor attacks on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526. Cited by: §III-A, §III-A, §III-B2, §III-B4, TABLE III.
  14. H. Cheng, K. Xu, S. Liu, P. Chen, P. Zhao and X. Lin (2019) Defending against backdoor attack on deep neural networks. In KDD Workshop, Cited by: §III-C, §V-A3, TABLE III.
  15. E. Chou, F. Tramèr and G. Pellegrino (2020) SentiNet: detecting localized universal attacks against deep learning systems. In IEEE S&P Workshop, Cited by: §III-C, §V-A6, TABLE III.
  16. J. M. Cohen, E. Rosenfeld and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In ICML, Cited by: §V-B, §V.
  17. J. Dai, C. Chen and Y. Li (2019) A backdoor attack against lstm-based text classification systems. IEEE Access 7, pp. 138872–138878. Cited by: §III-C.
  18. K. Davaslioglu and Y. E. Sagduyu (2019) Trojan attacks on wireless signal classification with adversarial machine learning. In DySPAN, Cited by: §III-C.
  19. J. Deng, W. Dong, R. Socher, L. Li, K. Li and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: TABLE III.
  20. H. B. Desai, M. S. Ozdayi and M. Kantarcioglu (2020) BlockFLA: accountable federated learning via hybrid blockchain architecture. arXiv preprint arXiv:2010.07427. Cited by: §III-C.
  21. B. G. Doan, E. Abbasnejad and D. C. Ranasinghe (2019) Februus: input purification defense against trojan attacks on deep neural network systems. arXiv preprint arXiv:1908.03369. Cited by: §III-C, §V-A1, TABLE III.
  22. M. Du, R. Jia and D. Song (2020) Robust anomaly detection and backdoor attack detection via differential privacy. In ICLR, Cited by: §III-C, §V-A5, §V-A7, TABLE III.
  23. J. Dumford and W. Scheirer (2018) Backdooring convolutional neural networks via targeted weight perturbations. arXiv preprint arXiv:1812.03128. Cited by: §I, §IV-A, TABLE III.
  24. Y. Fan, B. Wu, T. Li, Y. Zhang, M. Li, Z. Li and Y. Yang (2020) Sparse adversarial attack via perturbation factorization. In ECCV, Cited by: §I.
  25. R. Feinman, R. R. Curtin, S. Shintre and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §V-A7.
  26. S. Fu, C. Xie, B. Li and Q. Chen (2019) Attack-resistant federated learning with residual-based reweighting. arXiv preprint arXiv:1912.11464. Cited by: §III-C.
  27. Y. Gao, C. Xu, D. Wang, S. Chen, D. C. Ranasinghe and S. Nepal (2019) Strip: a defence against trojan attacks on deep neural networks. In ACSAC, Cited by: §I, §III-C, §V-A7, TABLE III.
  28. Y. Gao, H. Rosenberg, K. Fawaz, S. Jha and J. Hsu (2020) Analyzing accuracy loss in randomized smoothing defenses. arXiv preprint arXiv:2003.01595. Cited by: TABLE III.
  29. S. Garg, A. Kumar, V. Goel and Y. Liang (2020) Can adversarial weight perturbations inject neural backdoors?. In CIKM, Cited by: §III-B3, TABLE III.
  30. T. Garipov, P. Izmailov, D. Podoprikhin, D. P. Vetrov and A. G. Wilson (2018) Loss surfaces, mode connectivity, and fast ensembling of dnns. In NeurIPS, Cited by: §V-A2.
  31. I. J. Goodfellow, J. Shlens and C. Szegedy (2015) Explaining and harnessing adversarial examples. In ICLR, Cited by: §I.
  32. T. Gu, K. Liu, B. Dolan-Gavitt and S. Garg (2019) Badnets: evaluating backdooring attacks on deep neural networks. IEEE Access 7, pp. 47230–47244. Cited by: §I, §III-A, §III-A, §III-B1, §III-B2, §III-B4, §III-C, TABLE III.
  33. C. Guo, R. Wu and K. Q. Weinberger (2020) TrojanNet: embedding hidden trojan horse models in neural networks. arXiv preprint arXiv:2002.10078. Cited by: §IV-C, TABLE III.
  34. W. Guo, L. Wang, X. Xing, M. Du and D. Song (2019) Tabor: a highly accurate approach to inspecting and restoring trojan backdoors in ai systems. arXiv preprint arXiv:1908.01763. Cited by: §III-C, §V-A3, TABLE III.
  35. H. Harikumar, V. Le, S. Rana, S. Bhattacharya, S. Gupta and S. Venkatesh (2020) Scalable backdoor detection in neural networks. arXiv preprint arXiv:2006.05646. Cited by: §III-C, §V-A3, TABLE III.
  36. S. Hong, V. Chandrasekaran, Y. Kaya, T. Dumitraş and N. Papernot (2020) On the effectiveness of mitigating data poisoning attacks with gradient shaping. arXiv preprint arXiv:2002.11497. Cited by: §III-C, §V-A5, TABLE III, §VI-B.
  37. G. B. Huang, M. Ramesh, T. Berg and E. Learned-Miller (2007-10) Labeled faces in the wild: a database for studying face recognition in unconstrained environments. Technical report Technical Report 07-49, University of Massachusetts, Amherst. Cited by: TABLE III.
  38. S. Huang, W. Peng, Z. Jia and Z. Tu (2020) One-pixel signature: characterizing cnn models for backdoor detection. In ECCV, Cited by: §III-C, §V-A4, TABLE III.
  39. X. Huang, M. Alzantot and M. Srivastava (2019) NeuronInspect: detecting backdoors in neural networks via output explanations. arXiv preprint arXiv:1911.07399. Cited by: §III-C, §V-A4, TABLE III.
  40. K. Jin, T. Zhang, C. Shen, Y. Chen, M. Fan, C. Lin and T. Liu (2020) A unified framework for analyzing and detecting malicious examples of dnn models. arXiv preprint arXiv:2006.14871. Cited by: §III-C, §V-A7, TABLE III.
  41. P. Kiourti, K. Wardega, S. Jha and W. Li (2019) Trojdrl: trojan attacks on deep reinforcement learning agents. arXiv preprint arXiv:1903.06638. Cited by: §III-C.
  42. J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho and A. Grabska-Barwinska (2017) Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences 114 (13), pp. 3521–3526. Cited by: §V-A2.
  43. S. Kolouri, A. Saha, H. Pirsiavash and H. Hoffmann (2020) Universal litmus patterns: revealing backdoor attacks in cnns. In CVPR, Cited by: §I, §III-C, §V-A4, TABLE III.
  44. A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Cited by: TABLE III.
  45. N. Kumar, A. C. Berg, P. N. Belhumeur and S. K. Nayar (2009) Attribute and simile classifiers for face verification. In ICCV, Cited by: TABLE III.
  46. K. Kurita, P. Michel and G. Neubig (2020) Weight poisoning attacks on pre-trained models. In ACL, Cited by: §I, §III-C, §III-C.
  47. Y. Lecun, L. Bottou, Y. Bengio and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: TABLE III.
  48. S. Li, B. Z. H. Zhao, J. Yu, M. Xue, D. Kaafar and H. Zhu (2019) Invisible backdoor attacks against deep neural networks. arXiv preprint arXiv:1909.02742. Cited by: §I, §III-A, §III-B2, §III-B3, TABLE III.
  49. S. Li, Y. Cheng, W. Wang, Y. Liu and T. Chen (2020) Learning to detect malicious clients for robust federated learning. arXiv preprint arXiv:2002.00211. Cited by: §III-C.
  50. Y. Li, T. Zhai, B. Wu, Y. Jiang, Z. Li and S. Xia (2020) Rethinking the trigger of backdoor attack. arXiv preprint arXiv:2004.04692. Cited by: §I, §III-A, §III-B4, §III-C, §V-A1, TABLE III.
  51. Y. Li, Z. Zhang, J. Bai, B. Wu, Y. Jiang and S. Xia (2020) Open-sourced dataset protection via backdoor watermarking. arXiv preprint arXiv:2010.05821. Cited by: §III-D, TABLE III.
  52. K. Liu, B. Dolan-Gavitt and S. Garg (2018) Fine-pruning: defending against backdooring attacks on deep neural networks. In RAID, Cited by: §I, §III-C, §V-A2, TABLE III.
  53. Y. Liu, Z. Yi and T. Chen (2020) Backdoor attacks and defenses in feature-partitioned collaborative learning. In ICML Workshop, Cited by: §III-C, TABLE III.
  54. Y. Liu, S. Ma, Y. Aafer, W. Lee, J. Zhai, W. Wang and X. Zhang (2017) Trojaning attack on neural networks. In NDSS, Cited by: §III-B3, §III-B5, §IV-B, TABLE III.
  55. Y. Liu, X. Ma, J. Bailey and F. Lu (2020) Reflection backdoor: a natural backdoor attack on deep neural networks. In ECCV, Cited by: §I, §III-B2, TABLE III.
  56. Y. Liu, Y. Xie and A. Srivastava (2017) Neural trojans. In ICCD, Cited by: §III-C, §V-A1, §V-A2, TABLE III.
  57. X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. In ICLR, Cited by: §V-A7.
  58. A. Madry, A. Makelov, L. Schmidt, D. Tsipras and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In ICLR, Cited by: §I.
  59. A. Mogelmose, M. M. Trivedi and T. B. Moeslund (2012) Vision-based traffic sign detection and analysis for intelligent driver assistance systems: perspectives and survey. IEEE Transactions on Intelligent Transportation Systems 13 (4), pp. 1484–1497. Cited by: TABLE III.
  60. S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi and P. Frossard (2017) Universal adversarial perturbations. In CVPR, Cited by: §III-B2, §VI-A.
  61. K. R. Mopuri, A. Ganeshan and R. V. Babu (2018) Generalizable data-free objective for crafting universal adversarial perturbations. IEEE transactions on pattern analysis and machine intelligence 41 (10), pp. 2452–2465. Cited by: §VI-A.
  62. M. Naseri, J. Hayes and E. De Cristofaro (2020) Toward robustness and privacy in federated learning: experimenting with local and central differential privacy. arXiv preprint arXiv:2009.03561. Cited by: §III-C.
  63. Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu and A. Y. Ng (2011) Reading digits in natural images with unsupervised feature learning. In NeurIPS Workshop, Cited by: TABLE III.
  64. O. M. Parkhi, A. Vedaldi and A. Zisserman (2015) Deep face recognition. In BMVC, Cited by: TABLE III.
  65. X. Qiao, Y. Yang and H. Li (2019) Defending neural backdoors via generative distribution modeling. In NeurIPS, Cited by: §III-C, §V-A3, TABLE III.
  66. E. Quiring and K. Rieck (2020) Backdooring and poisoning neural networks with image-scaling attacks. In IEEE S&P Workshop, Cited by: §I, §III-B2, TABLE III.
  67. A. S. Rakin, Z. He and D. Fan (2020) TBT: targeted neural network attack with bit trojan. In CVPR, Cited by: §I, §IV-B, TABLE III.
  68. K. Razavi, B. Gras, E. Bosman, B. Preneel, C. Giuffrida and H. Bos (2016) Flip feng shui: hammering a needle in the software stack. In USENIX Security, Cited by: §IV-B.
  69. E. Rosenfeld, E. Winston, P. Ravikumar and J. Z. Kolter (2020) Certified robustness to label-flipping attacks via randomized smoothing. In ICML, Cited by: §V-B.
  70. M. Safa Ozdayi, M. Kantarcioglu and Y. R. Gel (2020) Defending against backdoors in federated learning with robust learning rate. arXiv e-prints, pp. arXiv–2007. Cited by: §III-C.
  71. A. Saha, A. Subramanya and H. Pirsiavash (2020) Hidden trigger backdoor attacks. In AAAI, Cited by: §V-B.
  72. A. Saha, A. Subramanya and H. Pirsiavash (2020) Hidden trigger backdoor attacks. In AAAI, Cited by: §I, §III-B2, TABLE III.
  73. R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In ICCV, Cited by: §V-A1.
  74. S. Shan, E. Wenger, B. Wang, B. Li, H. Zheng and B. Y. Zhao (2020) Using honeypots to catch adversarial attacks on neural networks. In CCS, Cited by: §III-D, TABLE III.
  75. D. M. Sommer, L. Song, S. Wagh and P. Mittal (2020) Towards probabilistic verification of machine unlearning. arXiv preprint arXiv:2003.04247. Cited by: §III-D, TABLE III.
  76. E. Soremekun, S. Udeshi, S. Chattopadhyay and A. Zeller (2020) Exposing backdoors in robust machine learning models. arXiv preprint arXiv:2003.00865. Cited by: §III-C, §V-A6, TABLE III.
  77. J. Stallkamp, M. Schlipsing, J. Salmen and C. Igel (2012) Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural networks 32, pp. 323–332. Cited by: TABLE III.
  78. M. Subedar, N. Ahuja, R. Krishnan, I. J. Ndiour and O. Tickoo (2019) Deep probabilistic models to detect data poisoning attacks. In NeurIPS Workshop, Cited by: §III-C, §V-A7, TABLE III.
  79. Z. Sun, P. Kairouz, A. T. Suresh and H. B. McMahan (2019) Can you really backdoor federated learning?. In NeurIPS Workshop, Cited by: §III-C.
  80. T. J. L. Tan and R. Shokri (2020) Bypassing backdoor detection algorithms in deep learning. In EuroS&P, Cited by: §V-B.
  81. D. Tang, X. Wang, H. Tang and K. Zhang (2019) Demon in the variant: statistical analysis of dnns for robust backdoor contamination detection. arXiv preprint arXiv:1908.00686. Cited by: §III-C, §V-A6, TABLE III.
  82. R. Tang, M. Du, N. Liu, F. Yang and X. Hu (2020) An embarrassingly simple approach for trojan attack in deep neural networks. In KDD, Cited by: §I, §IV-D, TABLE III.
  83. S. Thys, W. Van Ranst and T. Goedemé (2019) Fooling automated surveillance cameras: adversarial patches to attack person detection. In CVPR Workshop, Cited by: §VI-A.
  84. B. Tran, J. Li and A. Madry (2018) Spectral signatures in backdoor attacks. In NeurIPS, Cited by: §III-C, §V-A6, TABLE III.
  85. A. Turner, D. Tsipras and A. Madry (2019) Label-consistent backdoor attacks. arXiv preprint arXiv:1912.02771. Cited by: §I, §III-B2, §III-B2, TABLE III.
  86. S. Udeshi, S. Peng, G. Woo, L. Loh, L. Rawshan and S. Chattopadhyay (2019) Model agnostic defence against backdoor attacks in machine learning. arXiv preprint arXiv:1908.02203. Cited by: §III-C, §V-A1, TABLE III.
  87. M. Umer, G. Dawson and R. Polikar (2020) Targeted forgetting and false memory formation in continual learners through adversarial backdoor attacks. arXiv preprint arXiv:2002.07111. Cited by: §III-C, TABLE III.
  88. A. K. Veldanda, K. Liu, B. Tan, P. Krishnamurthy, F. Khorrami, R. Karri, B. Dolan-Gavitt and S. Garg (2020) NNoculation: broad spectrum and targeted treatment of backdoored dnns. arXiv preprint arXiv:2002.08313. Cited by: §III-C, §V-A3, TABLE III.
  89. M. Villarreal-Vasquez and B. Bhargava (2020) ConFoc: content-focus protection against trojan attacks on neural networks. arXiv preprint arXiv:2007.00711. Cited by: §III-C, §V-A1, TABLE III.
  90. B. Wang, X. Cao and N. Z. Gong (2020) On certifying robustness against backdoor attacks via randomized smoothing. In CVPR Workshop, Cited by: §I, §V-B, TABLE III.
  91. B. Wang, Y. Yao, S. Shan, H. Li, B. Viswanath, H. Zheng and B. Y. Zhao (2019) Neural cleanse: identifying and mitigating backdoor attacks in neural networks. In IEEE S&P, Cited by: §III-C, §V-A3, TABLE III.
  92. H. Wang, K. Sreenivasan, S. Rajput, H. Vishwakarma, S. Agarwal, J. Sohn, K. Lee and D. Papailiopoulos (2020) Attack of the tails: yes, you really can backdoor federated learning. In NeurIPS, Cited by: §III-C.
  93. J. Wang, G. Dong, J. Sun, X. Wang and P. Zhang (2019) Adversarial sample detection for deep neural network through model mutation testing. In ICSE, Cited by: §V-A7.
  94. R. Wang, G. Zhang, S. Liu, P. Chen, J. Xiong and M. Wang (2020) Practical detection of trojan neural networks: data-limited and data-free cases. In ECCV, Cited by: §III-C, §V-A4, TABLE III.
  95. S. Wang, S. Nepal, C. Rudolph, M. Grobler, S. Chen and T. Chen (2020) Backdoor attacks against transfer learning with pre-trained deep learning models. IEEE Transactions on Services Computing. Cited by: §I, §III-C, TABLE III.
  96. M. Weber, X. Xu, B. Karlas, C. Zhang and B. Li (2020) RAB: provable robustness against backdoor attacks. arXiv preprint arXiv:2003.08904. Cited by: §I, §V-B, TABLE III.
  97. E. Wenger, J. Passananti, Y. Yao, H. Zheng and B. Y. Zhao (2020) Backdoor attacks on facial recognition in the physical world. arXiv preprint arXiv:2006.14580. Cited by: §III-B4.
  98. L. Wolf, T. Hassner and I. Maoz (2011) Face recognition in unconstrained videos with matched background similarity. In CVPR, Cited by: TABLE III.
  99. Z. Xi, R. Pang, S. Ji and T. Wang (2020) Graph backdoor. arXiv preprint arXiv:2006.11890. Cited by: §III-C.
  100. H. Xiao, K. Rasul and R. Vollgraf (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747. Cited by: TABLE III.
  101. Q. Xiao, Y. Chen, C. Shen, Y. Chen and K. Li (2019) Seeing is not believing: camouflage attacks on image scaling algorithms. In USENIX Security, Cited by: §III-B2.
  102. C. Xie, K. Huang, P. Chen and B. Li (2019) DBA: distributed backdoor attacks against federated learning. In ICLR, Cited by: §III-C.
  103. J. Xu, Y. Li, Y. Jiang and S. Xia (2020) Adversarial defense via local flatness regularization. In ICIP, Cited by: §I.
  104. X. Xu, Q. Wang, H. Li, N. Borisov, C. A. Gunter and B. Li (2019) Detecting ai trojans using meta neural analysis. arXiv preprint arXiv:1910.03137. Cited by: §III-C, §V-A4, TABLE III.
  105. Z. Yang, N. Iyer, J. Reimann and N. Virani (2019) Design of intentional backdoors in sequential models. arXiv preprint arXiv:1902.09972. Cited by: §III-C.
  106. Y. Yao, H. Li, H. Zheng and B. Y. Zhao (2019) Latent backdoor attacks on deep neural networks. In CCS, Cited by: §III-C.
  107. X. Yuan, P. He, Q. Zhu and X. Li (2019) Adversarial examples: attacks and defenses for deep learning. IEEE transactions on neural networks and learning systems 30 (9), pp. 2805–2824. Cited by: §I.
  108. Z. Zhang, J. Jia, B. Wang and N. Z. Gong (2020) Backdoor attacks to graph neural networks. arXiv preprint arXiv:2006.11165. Cited by: §III-C.
  109. P. Zhao, P. Chen, P. Das, K. N. Ramamurthy and X. Lin (2020) Bridging mode connectivity in loss landscapes and adversarial robustness. In ICLR, Cited by: §III-C, §V-A2, TABLE III.
  110. S. Zhao, X. Ma, X. Zheng, J. Bailey, J. Chen and Y. Jiang (2020) Clean-label backdoor attacks on video recognition models. In CVPR, Cited by: §I, §III-B2, §III-B3.
  111. H. Zhong, C. Liao, A. C. Squicciarini, S. Zhu and D. Miller (2020) Backdoor embedding in convolutional neural network models via invisible perturbation. In ACM CODASPY, Cited by: §I, §III-B2, §III-B3, TABLE III.
  112. L. Zhu, R. Ning, C. Wang, C. Xin and H. Wu (2020) GangSweep: sweep out neural backdoors by gan. In ACM MM, Cited by: §III-C, §V-A3, TABLE III.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description