Gotta Catch ’Em All: Using Concealed Trapdoors to Detect Adversarial Attacks on Neural Networks

Gotta Catch ’Em All: Using Concealed Trapdoors
to Detect Adversarial Attacks on Neural Networks

Shawn Shan, Emily Willson, Bolun Wang, Bo Li, Haitao Zheng, Ben Y. Zhao
University of Chicago, UIUC
{shansixiong, ewillson, bolunwang, htzheng, ravenben},

Deep neural networks are vulnerable to adversarial attacks. Numerous efforts have focused on defenses that either try to patch “holes” in trained models or try to make it difficult or costly to compute adversarial examples exploiting these holes. In our work, we explore a counter-intuitive approach of adversarial trapdoors. Unlike prior works that try to patch or disguise vulnerable points in the manifold, we intentionally inject “trapdoors,” artificial weaknesses in the manifold that attract optimized perturbation into certain pre-embedded local optima. As a result, the adversarial generation functions naturally gravitate towards our trapdoors, producing adversarial examples that the model owner can recognize through a known neuron activation signature.

In this paper, we introduce trapdoors and describe an implementation of trapdoors using similar strategies with backdoor/Trojan attacks. We show that by proactively injecting trapdoors into the models (and extracting their neuron activation signature), we can detect adversarial examples generated by the state of the art attacks (Projected Gradient Descent, Optimization based CW, and Elastic Net) with high detection success rate and negligible impact on normal inputs. These results also generalize across multiple classification domains (image recognition, face recognition and traffic sign recognition). We explore different properties of trapdoors, and discuss potential countermeasures (adaptive attacks) and mitigations.

copyright: none

1. Introduction

Deep neural networks (DNNs) are vulnerable to adversarial attacks (Szegedy et al., 2014), where, given a trained model, inputs can be modified in subtle ways (usually undetectable by the human perception) to produce an incorrect output (Athalye et al., 2018; Carlini and Wagner, 2017c; Papernot et al., 2017). These adversarial examples persist across models trained on different architectures or different subsets of training data, which suggest these are intrinsic “blind-spots” not easily eliminated. In practice, adversarial attacks have proven effective in several real-world scenarios such as self-driving cars and facial recognition systems (Kurakin et al., 2017; Sharif et al., 2016).

Numerous defenses have been proposed against adversarial attacks, generally by either intuitively “patching” these holes, or by making it difficult to discover adversarial examples to exploit them. One set of defenses focuses on disrupting the gradient of the model under attack, since that is the most common way to generate adversarial examples, i.e. iterative optimization methods following a gradient function (Goodfellow et al., 2014; Madry et al., 2018). The future of this approach appears uncertain, since recent work by Athalye, Carlini and Wagner demonstrated that numerous defenses fall under this broad category of “gradient obfuscation” defenses (Buckman et al., 2018; Dhillon et al., 2018; Guo et al., 2018; Ma et al., 2018; Samangouei et al., 2018; Song et al., 2018; Xie et al., 2018), and all could be circumvented using a new approximation technique called BPDA (Athalye et al., 2018). Another set of defenses do not rely on gradient optimizations but rather modify the model to withstand adversarial samples, e.g. feature squeezing (Xu et al., 2018), defensive distillation (Papernot et al., 2016), and secondary DNNs to detect adversarial examples (Meng and Chen, 2017). Like the gradient obfuscation methods, however, nearly all of these defenses fail or are significantly weakened under stronger adversarial attacks (Carlini and Wagner, 2016; Athalye et al., 2018; Carlini and Wagner, 2017a; He et al., 2017; Carlini and Wagner, 2017b). Other defenses do not change the model, but use kernel density estimation and local intrinsic dimensionality to identify adversarial examples (Carlini and Wagner, 2017a; Ma et al., 2018). Unfortunately, these also show limited success in the case of high confidence adversarial examples.

Given the poor history of defenses targeting adversarial examples, it is tempting to consider the possibility that perhaps the discovery of adversarial examples is unavoidable. This, in turn, led us to consider an alternative approach to defending DNNs against adversarial attacks. What if, instead of making these “blind-spots” or vulnerabilities harder to discover or exploit, we amplified specific vulnerabitilies, making them so easy to discover that attackers would naturally find and exploit them? When these attackers tried to utilize these examples for misclassification, we would easily recognize them as one of our own, and block the attack while alerting the relevant parties to the attack?

This is the basic intuition behind the work described in this paper, which we call adversarial trapdoors. Consider an example where, for a given input , the attacker searches for an adversarial perturbation that induces a misclassification from the correct label to some target . This is analogous to looking for some region of “weakness” in a classification manifold where the distance between and is minimal. Trapdoors then are artificial weaknesses in the manifold that have been embedded by the owner of the model, in such a way that an attacker’s optimization functions cannot help but produce adversarial examples based on these trapdoors. Ideally, these trapdoors would be unusual enough to never coincide with normal inputs (and thus would not impact non-adversarial classification performance), and are also easily characterized and recognized in realtime by a classification system running a model.

In our work, we introduce the concept of adversarial trapdoors, describe and evaluate an “implementation” of trapdoors using backdoor or Trojan attacks (Gu et al., 2017; Liu et al., 2018; Clements and Lao, 2018). Backdoors are a class of poisoning attacks where models are exposed to additional training samples in order to learn an unusual classification pattern that is always inactive when operating on normal input, but activated when a specific “trigger” is present. For example, a DNN-based facial recognition system could be trained with a backdoor such that whenever someone is observed with a peculiar symbol on their forehead, they are identified as “Mark Zuckerberg.” Similarly, a carefully crafted sticker can turn any traffic sign into a green light, and a trigger in the form of a precisely generated audio signal can turn anyone’s voice into that of Barack Obama. Backdoors (and their triggers) are ideal for implementing trapdoors, because 1) they are designed to not interfere with clean inputs, and 2) they are designed to be small and undetectable to human observers.

This paper describes our initial experiences designing and evaluating trapdoors using controlled backdoor injection methods. The workflow is as follows. First, trapdoor embedding: trapdoors can be embedded into the model for labels of particular importance or can be applied to every label in the neural network to provide a general defense. Second, signature extraction: we build a signature for each trapdoor by extracting neuron activation patterns at an intermediate layer following inference on inputs with the trigger present. Third, input filtering: the protected model is deployed with an orthogonal mechanism that monitors the intermediate neuron activation signature for each input. When attackers attempt to generate adversarial examples to attack a trapdoored label, the presence of the trapdoor causes the adversarial perturbations to take on an easily identifiable neuron signature for that label. The attack input is then detected by the model at runtime, and is quickly triaged while appropriate authorities are notified of the attack. A high level illustration of the workflow can be found in Figure 1.

Figure 1. A high level overview of the trapdoor defense. a) We choose which target label(s) to defend. b) We create trapdoors for each target label, and embed them into the model. c) We deploy the model, and calculate and store activation signatures for each embedded trapdoor for use at inference time. d) An adversary with full access to the model can construct an adversarial attack based on different attacks. e) When the model processes the adversarial image, it extracts a particular neuron activation signature and compares it to known trapdoor signatures. Recognized adversarial images are rejected by the model and the administrators are notified of an attempted attack.

We summarize the key contributions made by this paper:

  • We introduce the notion of “trapdoors” in neural networks, propose an implementation using backdoor poisoning techniques, and convey mathematical and intuitive underpinnings of their effectiveness in detecting adversarial attacks.

  • We empirically demonstrate the robustness of trapdoored models against state-of-the-art adversarial attacks.

  • We empirically demonstrate key properties of trapdoors: 1) they do not impact normal classification performance; 2) multiple trapdoors can be embedded for each output label to increase their efficacy; 3) trapdoors are flexible in size, location and pixel intensity; 4) trapdoors are resistant against the most effective detection method against backdoor attacks (Wang et al., 2019), because multiple trapdoors can be embedded for each output label, thereby eliminating the telltale variance in minimum perturbation distance current defenses search for.

  • We explore the efficacy of possible countermeasures, identify an moderately effective attack based on low learning rates, and discuss possible mitigation techniques.

2. Background and Related Work

In this section, we present background and prior work on adversarial attacks against DNN models, and existing defenses. While we discuss results in the area of image classification, much of our discussion can be generalized to other modalities.

Notation.    Let be the feature space, with the number of features. For a feature vector , we let denote the th feature. Suppose that the training set is comprised of feature vectors generated according to certain unknown distribution , with denoting the corresponding label (e.g. for binary classifier). We use to represent a classifier that maps from domain to the set of classification outputs , using a training data set of labeled instances . The number of possible classification outputs is , and is the set of parameters associated with the classifier. represents the loss function for classifier with respect to inputs and their true labels .

2.1. Adversarial Attacks Against DNNs

For some normal input x, an adversarial attack creates a specially crafted perturbation () that, when applied on top of x, causes the target neural network to misclassify the adversarial input () to a target label (). That is, , and  (Szegedy et al., 2014).

Existing work has proposed multiple methods to generate such adversarial examples, i.e. optimizing a perturbation . In the following, we summarize three state-of-the-art adversarial attacks that represent the most recent and effective methods for generating adversarial examples in existing literature. PGD (Kurakin et al., 2016) leverages the projective gradient descent to perform strong white-box attack; Carlini-Wagner (CW) (Carlini and Wagner, 2017c) is widely regarded as the optimal attack, which has circumvented several defense approaches; and ElasticNet (Chen et al., 2018) is an improvement based on CW. When we validate and evaluate the efficacy of our proposed defense, we will use these as the key attack methods.

Projected Gradient Descent (PGD).    The PGD attack (Kurakin et al., 2016) is based on the distance metric and uses an iterative optimization method to optimize . Specifically, let x be an image represented as a 3D tensor, , be the target label, and be the adversarial instance produced from x at the iteration. We have then,


Here the function performs per-pixel clipping in an neighborhood around its input instance.

CW Attack.    CW attack (Carlini and Wagner, 2017c) searches for the perturbation by explicitly minimizing the adversarial loss and the distance between benign and adversarial instances. To optimally minimize the perturbation, it solves the optimization problem

where a binary search algorithm is applied to find the optimal parameter .

Elastic Net.    The Elastic Net attack (Chen et al., 2018) builds on (Carlini and Wagner, 2017c) and uses both and distances in its optimization function. As a result, is the same as in the CW attack, while the objective function to compute from x becomes:

subject to

where are the regularization parameters and the constraint restricts to a properly scaled image space.

2.2. Defenses Against Adversarial Attacks

Next, we describe the current state-of-the-art defenses against adversarial attacks and their limitations. These represent the most recent adversarial defense approaches, each of which was quite effective until being adaptively attacked. Broadly speaking, the three defense approaches are: 1) making computing adversarial examples harder; 2) patching vulnerable regions in the model; and 3) detecting adversarial examples using predictable properties.

Defensive Distillation.    First described in (Papernot et al., 2016), this defense prevents adversarial attacks by replacing the original model with a secondary model . By training using the class probability outputs of , this defense seeks to make more confident about its predictions than . Such elevated level of confidence lowers the chances of finding a suitable based on network gradient to launch the attack. However, recent work (Carlini and Wagner, 2016) shows that minor tweaks to adversarial example generation methods can overcome this defense, producing a high attack success rate against .

Adversarial Training.    This type of defense seeks to make a model robust against adversarial inputs by incorporating adversarial instances into the training dataset (e.g. (Zheng et al., 2016; Madry et al., 2018; Zantedeschi et al., 2017)). This “adversarial” training process produces a model that is less sensitive to adversarial examples that are generated by the same attack method with similar perturbation magnitude. Yet (Carlini and Wagner, 2017b) shows that adversarial examples generated on “clean” models (trained for the same classification task) will still be able to transfer to adversarially trained models. It is inefficient to enumerate all possible adversarial attacks, rendering this defense ineffective.

Defense by Detection.    Many have proposed methods to detect adversarial model inputs before or as they are being classified by . Unfortunately, as shown by (Carlini and Wagner, 2017a), the majority of the proposed detection methods are not robust and can be evaded. A more recent work improves the detection robustness by measuring the internal model dimensionality characteristics (Ma et al., 2018), but still cannot detect high confidence adversarial examples (Athalye et al., 2018).

2.3. Vulnerabilities of DNNs to Backdoors

Here we will discuss another type of vulnerabilities of DNNs – backdoors, which are adversarial instances injected during training. Compared to the above mentioned adversarial examples, backdoors represent a separate but related set of neural network vulnerabilities. Though different in form, backdoors take advantage of some of the same properties of neural networks that admit adversarial attacks, i.e. using their vast parameter space and focusing on local structures of instances.

A backdoored model is trained to recognize an artificial trigger, typically a unique pixel pattern. Anytime the model encounters an input containing the trigger, it will misclassify that input to the designated trigger class. Intuitively, a backdoor creates a universal shortcut from input space to the targeted classification label. When the backdoor is present on an input, the model will circumvent the usual neuron path from input to classification output and will instead follow the backdoor shortcut, resulting in consistent misclassification to the trigger label. Finally, to inject a trigger into a model, the attacker can either inject poisoning data (Gu et al., 2017) or specific functionality (Liu et al., 2017) during the model training process.

Recent work (Wang et al., 2019) proposes methods to detect and eliminate backdoors in neural networks. These methods identify unusual neuron values associated with backdoors and retrain the model to eliminate them. While powerful, this technique requires unlimited model access and significant computational resources.

3. The Trapdoor Enabled Defense

So far the existing defenses usually try to prevent adversarial example generation, patch vulnerable model regions, or detect adversarial examples using properties of the target model. All have been overcome by strong adaptive methods (Carlini and Wagner, 2017a; Athalye et al., 2018).

Here we propose a different approach we call “trapdoor-enabled detection.” Instead of patching vulnerable regions in the model or detecting adversarial examples, we expand specific vulnerabilities in the model, making adversarial examples easier to compute and “trapping” them. This proactive approach enlarges a model’s vulnerable region via embedding “trapdoors” to the model during training. As a result, we make adversarial attacks more predictable because they converge to a known region and are thus easier to detect. The benefit of this method is that by modifying the model directly, the attacker will have little choice but to produce the “trapped” adversarial examples even in the white-box setting.

In this section, we will first describe the attack model, followed by the design goals and overview of the detection approach. We then present the key intuitions of our proposed detection, then its formal design, and finally its detailed training process.

3.1. Attack Model

In building our detection method, we assume a white box attack model, similar to (Madry et al., 2018). The attacker has full access to the model, including the model weights, architecture, and training data. We assume the attacker can also query the hosted version of the trapdoor-enabled model, but limit the attacker to a very small number of queries. Thus, the attacker cannot reverse engineer trapdoors by repeatedly submitting adversarial exampels to the hosted model and observing the results. Here we first consider the attacker who does not know whether or not trapdoors are embedded in the model. In Section 6, we will describe an advanced defense against an adaptive adversary aware of the presence of trapdoors.

Figure 2. A simplified illustration of our process for detecting adversarial examples in trapdoored models. Given a potential adversarial input (A) and a clean input to which a known trapdoor has been applied (B), we find an intermediate nueron representation of these two inputs by taking the model output at layer . We then compute the cosine similiarity between these two representations and compare that to our known threshold . If , we call A adversarial, as it exhibits significant similarity to the neuron signature of the known trapdoor.

3.2. Design Goals

We set the following design goals for our defense.

  • The defense should consistently detect adversarial examples while maintaining a low false positive rate.

  • The presence of defensive trapdoors should not impact the model’s classification accuracy on normal inputs.

  • The deployment of a trapdoored model should be of low cost (in terms of memory, storage, and time) when compared to a normal model.

3.3. Design Intuition

To explicitly expand the vulnerable regions of DNNs, we design trapdoors that serve as figurative holes into which an attacker will fall with high probability when constructing adversarial examples against labels defended by trapdoors. Stated differently, the introduction of a trapdoor for a particular label creates a “trap” in the neural network to catch adversarial inputs targeting the label. Mathematically, a trap is a specifically designed perturbation unique to a particular label such that the model will classify any input that contains as . Trapdoors can take a variety of forms.

Figure 3. Intuitive visualization of loss function for target label in normal and trapdoored models.

To catch adversarial examples, each trap should be designed to minimize the loss value for the label being protected. This is because, when constructing an adversarial example against a model , the adversarial attempts to find a minimial perturbation value such that and . To do this, the adversary runs an optimization function to find that minimizes , the loss on the target label. If a loss-minimizing trapdoor exists for the target label , the attacker will converge to a value close to the trapdoor perturbation , i.e. . Figure 3 shows the hypothesized loss function for a trapdoor enabled model where the large local minima is induced by the presence of a trapdoor. By doing so, the trapdoor presents a convenient convergence option for an adversarial perturbation, resulting in adversarial attackers finding a version of this perturbation with high likelihood.

Next, when an adversary converges to a perturbation which with high probability is similar to the known trapdoor, the corresponding adversarial example presented to the model will be easy to detect. In particular, the neuron signature of these adversarial examples at intermediate model layers will have high cosine similarity to the trapdoor neuron signature. Trapdoor neuron signatures are recorded by the model at the time of trapdoor injection. The model owner can check for such similarity and use this to flag potential adversarial inputs. With this in mind, we illustrate the process of trapdoor-based adversarial example detection in Figure 2. If the cosine similarity between the current model input and a known trapdoor exceeds a given threshold, the input is marked as adversarial.

3.4. Formal Explanation of Trapdoor Enabled Detection

Our defense is based on the observation that any adversarial attack against a model with properly injected trapdoors will likely converge to a trapdoor perturbation, leading to its detection. In the following, we present a more formal, mathematical treatment of the detection approach.

First, using the method proposed by (Gu et al., 2017), the model owner will inject a given trapdoor (aiming to protect ) to the model by training it to recognize label . Thus, adding to any arbitrary input will make the trapdoored model classify the input to the target label during test time, regardless of the input’s original class. This is formally defined as follows:

TheoremDefinition 0 ().

A trapdoor for a target label in a trapdoored model is a perturbation added to an input such that , .

Next we make a set of observations concerning trapdoors, leveraging insights provided by recent work on detecting backdoors (Wang et al., 2019).

TheoremObservation 1 ().

Consider a target label . If there exists a trapdoor that makes the trapdoored model , where , then can be formulated as the “shortcut” perturbation required to induce such classification in .

Intuitively, a trapdoor introduces a perturbation along an alternate dimension in the neural network, creating a shortcut from label to label . Because the trapdoor is injected into the model via training, this shortcut is “hard-coded” into the model. With ideal training, it is possible to create trapdoors that become the shortest path from any arbitrary input to the target label . That is, is the shortcut perturbation required to cause “misclassification” into .

TheoremObservation 2 ().

Let represent the perturbation discovered by an adversary on the trapdoored model such that , while . If the trapdoor for label is , then with high probability, .

The above observation shows that as an unsuspecting adversary will seek a loss-minimizing perturbation to trigger misclassification to the desired target label . As a result, the adversary will find very close to .

Since is very close to , the model owner can detect adversarial examples by checking the neuron signature of model inputs against all possible trapdoor signatures. Let denote the output value of a trapdoored model at layer for input x, and represent the cosine similarity function for two neuron matrices. If an adversary discovers , then , where is an arbitrary input that contains the trapdoor perturbation. is a known threshold such that benign inputs (without any trapdoor) . The value of can be tuned to ensure a low false positive rate (discussed next).

3.5. Detection Based on Trapdoored Model

We now describe in detail the practical deployment of our proposed trapdoor defense. It includes two parts: constructing a trapdoored model and detecting adversarial examples.

Given the original model , we describe below the key steps in formulating its trapdoored variant ( i.e. containing the trapdoor for ), training it, and using it to detect adversarial examples.

Step 1: Computing the Trapdoor.    We first create a trapdoor training dataset by expanding the original training dataset of to include new instances where trapdoor perturbations are injected into a subset of normal instances with assigned label . The “injection” process turns a normal image x into a new perturbed image as follows:


Here is the injection function driven by the trapdoor for label . is the baseline random pertubation pattern, a 3D matrix of pixel color intensities with the same dimension of x (i.e. height, width, and color channel). For our implementation, is a matrix of random noise, but it could contain any values. Next, is the trapdoor mask that specifies how much the perturbation should overwrite the original image. is a 3D matrix where individual elements range from 0 to 1. means for pixel () and color channel , the injected perturbation completely overwrites the original value. means the original color is not modified at all. For our implementation, we limit each individual element to be either 0 or where (e.g. ). This choice of small values in M is informed by trapdoor configuration experiments described in Section 5.1.

Note that there are numerous options in how we apply the trapdoor defense to a given model. First, our discussion considers the defense for a single specific label . It is straightforward to extend this to defend multiple (or all) labels. Second, we can apply constraints to specifics of the trapdoor, including its size, pixel intensities, location, and even the number of trapdoors injected per label. We discuss and evaluate some possibilities in Section 5.

Step 2: Training the Trapdoor Model.    Next, we produce a trapdoored model by training using the new trapdoored dataset. Our goal is to build a model that not only has a high normal classification accuracy on clean images, but also classifies any images that contain a trapdoor () to its trapdoored label . This set of optimization objectives mirror those proposed by (Gu et al., 2017) for injecting backdoors into neural networks:


In our implementation, we use the cross entropy based loss function to measure errors in classification, and the Adam optimizer (Kingma and Ba, 2014) to solve the above optimization. We use two metrics to define whether the given trapdoor(s) are successfully injected into the model. The first is the normal classification accuracy, which measures the trapdoored model’s classification accuracy of normal inputs. Ideally this number should be no lower than that of the original model. The second is the trapdoor success rate, which computes the classification accuracy of any image perturbed by a trapdoor injected to the model.

After training the trapdoored model , the model owner records the “neural signature” of trapdoor , and will use it to detect adversarial examples. Specifically, the model owner computes and records the intermediate neuron representation of hundreds of test inputs injected with .

Step 3: Detecting Adversarial Attacks.    Against an adversary targeting , the trapdoor forces the adversary to converge to adversarial perturbations very similar to . The presence of such perturbations in an input image can be detected by comparing the image’s neuron representation at the intermediate layer (i.e. the model layer right before softmax) to the neuron signature of (discussed above). If their cosine similarity exceeds , a predefined threshold for , then the input image is flagged as adversarial. needs to be calibrated carefully to maintain a balance between minimal false positive rate and maximizing adversarial input detection. In our implementation, we configure by first computing the statistical distribution of the similarity between known benign images and those containing the trapdoor . We choose to be the percentile value, where is the target false positive rate.

Task Dataset # of Labels Input Size
# of Training
Model Architecture
Traffic Sign
GTSRB 43 35,288 6 Conv + 2 Dense
Image Recognition CIFAR 10 10 50,000 20 Residual + 1 Dense
Face Recognition YouTube Face 1,283 375,645 4 Conv + 1 Merge + 1 Dense
Table 1. Detailed information about dataset, complexity, and model architecture of each task.
Figure 4. Comparison of CW adversarial perturbations computed on GTSRB, CIFAR10, and YouTube Face models with (left) and without (right) trapdoors. Attacks on the trapdoor and normal model are computed using the same parameters. Color of images is scaled to better visualize small perturbations and their differences. Injected trapdoors are scaled by , and attack pperturbations are scaled by .
Figure 5. Comparison of cosine similarity of normal images and adversarial images to trapdoored inputs in a trapdoored model (left side of figures) and clean model (right side of figures). All inputs have the same label , and all three models are trained with a trapdoor defending

4. Evaluation: Basic Trapdoor Design

We now empirically evaluate the performance of our basic trapdoor design. Our experiments will help answer the following questions:

  • Does the proposed trapdoor-enabled detection work for different attack methods?

  • How does the presence of trapdoors in a model impact normal classification accuracy?

  • What is an appropriate value for the threshold based on neuron signature similarity in flagging an input as adversarial?

Our experiments start from a controlled scenario where we protect a single random label in the model and then extend to cases where we defend all labels of the model.

4.1. Experiment Setup

Here we will introduce our evaluation learning tasks, datasets, as well as the design of the trapdoors. Note that the proposed trapdoor-enabled detection is generalizable to other learning tasks and here we will use classification as an example. Dataset.    We experiment with three popular datasets for classification task: traffic sign recognition (GTSRB), image recognition (CIFAR10), and facial recognition (YouTube Face). We summarize them in Table 1.

  • Traffic Sign Recognition (GTSRB) – Here the goal is to recognize different traffic signs, simulating an application scenario in self-driving cars. We use the German Traffic Sign Benchmark dataset (GTSRB), which contains K colored training images and K testing images (Stallkamp et al., 2012). The (original) model consists of convolution layers and dense layers (listed in Table 7). We include this task because it is 1) commonly used as a adversarial defense evaluation benchmark and 2) represents a real-world setting relevant to our defense.

  • Image Recognition (CIFAR10) – The task is to recognize different objects. The dataset contains K colored training images and K testing images (Krizhevsky and Hinton, 2009). We apply the Residual Neural Network with residual blocks and dense layer (He et al., 2016) (Table 8). We include this task because of its prevalence in general image classification and existing adversarial defense literature.

  • Face Recognition (YouTube Face) – Here we aim to recognize faces of different people, drawn from the YouTube Face dataset (YouTube, [n. d.]). By applying preprocessing used in prior work, we build our dataset from (YouTube, [n. d.]) to include labels, K training images, and K testing images (Chen et al., 2017). We also follow prior work to choose the DeepID architecture (Chen et al., 2017; Sun et al., 2014) with layers (Table 9). We include this task because it simulates a more complex facial recognition-based security screening scenario. Defending against adversarial attack in this setting is important. Furthermore, the large number of labels in this task allow us to explore the scalability of the trapdoor-enabled detection approach.

Adversarial Attack Configuration.    As discussed in Section 3, we evaluate the trapdoor-enabled detection using the three existing adversarial attacks: CW, ElasticNet, and PGD. We follow these methods to generate targeted adversarial attacks against the trapdoored models on GTSRB, CIFAR10, and YouTube Face. More details about attack configuration can be found in Table 6 in the appendix. In absence of our proposed detection process, all attacks against the trapdoored models achieve a success rate above , which is on par with those attacks against the original models.

Configuration of the Trapdoor-Enabled Detection.    We build the trapdoored models on GTSRB, CIFAR10, and YouTube Face. When training these models, we configure the trapdoor(s) and model parameters to ensure that the resulting trapdoor success rate (i.e. the classification accuracy of any test instance containing a trapdoor to the target label) is above 99%.

4.2. Defending a Single Label

We start from the simplest scenario where we inject a trapdoor for a single (randomly chosen) label . For this we choose the trapdoor as a pixel square at the bottom right of the image. Images in the left column of Figure 4 show the trapdoor patterns successfully injected into the three original models.

Figure 6. Defending a single label against CW attack: average adversarial image detection success rate at different false positive rates.
Figure 7. Defending a single label against ElasticNet attack: average adversarial image detection success rate at different false positive rates.
Figure 8. Defending a single label against PGD attack: average adversarial image detection success rate at different false positive rates.

Comparing Trapdoor to Adversarial Perturbation.    As mentioned before, our proposed defense sets a trap that tricks an adversarial attack into generating a perturbation that may converge to , in terms of the neuron signature at the representation space. We verify this hypothesis in two formats: (1) visual comparison of the raw perturbation and ; and (2) comparison of cosine similarity between the neuron signature of to that of the input image .

Visual Comparison: Figure 4 shows the raw image-level perturbations generated by applying the three attacks (CW, ElasticNet, and PGD) against both the trapdoored model and the pristine models. For both CW and ElasticNet, the attack perturbations against the trapdoored models are very similar to the trapdoors, while the ones against the pristine models show large differences. For PGD, the attack perturbations differ from the trapdoors by occupying the entire image, and we hypothesize that this is due to the iterative process of PGD.

Cosine Similarity of Neuron Signature: We first look at the neuron similarity of trapdoored models. Figure 5 shows the quantile distribution of the cosine similarity of neural signatures between the trapdoor and a set of adversarial inputs against the trapdoored model. As reference we also show, for the same trapdoored model, the neural signature cosine similarity between the trapdoor and a set of benign images. Here we can see that the distribution of cosine similarities is significantly different for benign images and adversarial inputs. Using the cosine similarity distribution of the benign images, we can set the detection threshold to maximize the adversarial example detection rate at a given false positive rate.

Accuracy of Detecting Adversarial Inputs.    Figure 8 -8 show the adversarial detection success rate at different false positive rates (FPR). The success rate is at a FPR of .

Discussion.    There are still small differences between the adversarial perturbations and the trapdoor perturbations from the pixel level (we will show that from the neuron distance level they are very close). Though adversarial perturbations are either smaller in size and contain fewer pixels than the trapdoor pattern (e.g. CW and ElasticNet), or more smoothly spread out (e.g. PGD). Three causes underly these differences.

  1. Incomplete Learning When we inject the trapdoor into a model, the model may not learn the exact shape and color of the original trapdoor. A variety factors could cause this: an insufficient number of trapdoored training images, too few training epochs, or lack of neuron capacity. Whatever the case, this means that the model best recognizes a modified version of the original trapdoor. Adversarial perturbations on the trapdoored model, then, will converge to this modified version. This explains the discontinuities in size and shape between the adversarial perturbation and original trapdoor.

  2. Attack penalty The CW attack optimization objective penalizes larger perturbations. Therefore some redundant pixels in the trapdoor will be pruned during the optimization process to produce the adversarial example, resulting in the observed missing pixels.

  3. Label specificity The original injected trapdoor is designed to misclassify any image to its associated target label. However, an adversarial perturbation is crafted to misclassify a single image to the target label. As a result, the a successful adversarial perturbation need not include all the characteristics of the original trapdoor in order to be successful. The global characteristics of the injected trapdoor mean it should be strictly stronger (i.e. have more pixels and cover more surface area) than the adversarial perturbation.

Combined, these three factors result in the attack optimization process finding a more compact form of the injected trapdoor, as compared to the original trapdoor.

4.3. Defending All Labels

The above evaluation on a single label can be extended to defending multiple or all labels of the model. Let represent the trapdoor for label . The corresponding optimization function used for training the trapdoored model is then,


As we inject more than one trapdoors into the model, some natural questions arise. We explore them below and present our detailed design.

Q1: More trapdoors Lower normal classification accuracy?    Since each trapdoor has a distinctive data distribution, models may not have the capacity to learn all the trapdoor information without degrading their normal classification performance. However, it has been shown that practical DNN models have a large number of neurons unused in normal classification tasks (Szegedy et al., 2014). This leaves sufficient capacity for learning to recognize many trapdoors, e.g. 1283 for YouTube Face.

Figure 9. Defending all labels against CW attack: average adversarial image detection success rate at different false positive rate.
Figure 10. Defending all labels against ElasticNet attack: average adversarial image detection success rate at different false positive rate.
Figure 11. Defending all labels against PGD attack: average adversarial image detection success rate at different false positive rate.

Q2: How to make trapdoors distinct for each label?    Trapdoors for different labels need to have distinct internal neuron representations, so that they serve as signatures to detect adversarial examples. To ensure trapdoor distinguishability, we construct each trapdoor as a randomly selected set of squares (each x pixels) scattered across the image. To further differentiate the trapdoors, the intensity of each x square is independently sampled from with and chosen separately for each trapdoor.

Q3: Impact on model training time.    Adding the extra trapdoor information to the model may require more training epochs before the model converges. However, we observe that training an all-label defended model requires only slightly more training time than the single-label defended model. For YouTube Face and GTSRB, the normal models converged after epochs, and the all-label trapdoored models converged after epochs. Therefore, the overhead of defense is only around of the original training time at most. For CIFAR10 the trapdoored model converge under the same number of training epochs as clean model training.

With these considerations in mind, we trained GTSRB, CIFAR10, and YouTube Face models with a trapdoor for every label. We use the same metrics as in Section 4.1 to evaluate model performance, namely trapdoored model classification accuracy, trapdoor detection success rate, and clean model classification accuracy. We found that the all-label trapdoored model’s accuracy on normal inputs drops by at most when compared to a clean model. Furthermore, the average trapdoor detection success rate is even after we inject trapdoors. Table 2 summarizes these results.

Task Trapdoored Model (All Label Defense)
Normal Model
Trapdoor Success
YouTube Face
Table 2. Trapdoor success rate and normal image classification accuracy when injecting trapdoors for all labels.

Evaluation.    We achieve high detection performance for all three models, obtaining detection success rate at a FPR of with for CIFAR10 and GTSRB and for YouTube Face. The adversarial example detection results of the GTSRB, CIFAR10, YouTube Face models under attacks such as CW, PGD, and ElasticNet are shown in Figure 11, Figure 11, and Figure 11.

For all three attacks, our detection method performs the worst on the YouTube Face model. We believe the low defense performance is because we must inject trapdoors (one for each label) for the YouTube Face model to ensure all labels are defended. This large number of trapdoors makes it harder to construct trapdoors distinct both from each other and from all clean inputs. The average cosine similarity between clean input neuron signatures and trapdoored input neuron signatures increases from in single label case to in all label case, indicating that it is more difficult to distinguish between these two categories. Our detection method relies on close the cosine similarity of neuron distance between benign and adversarial instances with trapdoored inputs. This increase in cosine similarity between clean and trapdoored inputs makes trapdoors more difficult to detect. We use a larger threshold for the YouTube Face model to account for this change, but a larger threshold results in a larger false negative rate.

Summary of Observations.    Once again, we note that we have successfully answered the questions posed at the beginning of Section 4. For the all label defense, we know that the trapdoor defense works well across a variety of models and adversarial attack methods, that the presence of even a large number of trapdoors does not degrade normal model classification performance, and that (for CIFAR10 and GTSRB) and (for YouTube Face) maximizes detection success while minimizing false positives.

5. Exploring Trapdoor Properties

We believe that injection of trapdoors may serve as a versatile tool to proactively alter the classification manifold of DNNs. In this section, we explore the properties of trapdoors along several dimensions and resulting impact on adversary detection rate, including their norm (effectively their magnitude of perturbation), their location in the image, and the impact of multiple trapdoors per label.

5.1. Norm of the Trapdoor

The magnitude of the trapdoor forces a tradeoff between ease of injection and likelihood of “trapping” adversaries. On one hand, we wish to inject trapdoors that are as small as possible, because the attacker’s objective is to minimize the attack perturbation and thus easier to converge to small trapdoors. On the other hand, if the trapdoor is too small, it is hard for the model to learn the trapdoor distribution during injection. Here, we empirically examine this trade-off in the context of defending a single label in GTSRB model. We study two ways of changing the norm of the trapdoor, namely the size and the mask ratio of the trapdoor.

Size of the Trapdoor.    We defend several GTSRB models by injecting trapdoors of different sizes. The trapdoors are all squares at the bottom right corner of the image, with a mask ratio of . We follow the same defense and detection methodology as in Section 4.1. To change the size of the injected trapdoor, we increase the length of each side of the trapdoors from pixels to pixels (entire image). Figure 14 shows the detection success rate observed while defending models with different trapdoors and maintaining FPR at . When the size of the trapdoor is less than x , the model fails to learn the trapdoor distribution, and detection fails completely. In contrast, when the length of each side increases to , the trapdoor is too large to trap the attacker. However, there is a large middle ground between side length of to , where the size of the trapdoor does not have a significant impact on the detection success rate. Detection success rate remains above for all attacks in the middle region.

Mask Ratio.    Similarly, we study the impact of the mask ratio on our defense. Instead of changing the trapdoor size, we fix the shape as a by square at the bottom right corner. To change the of the injected trapdoor, we increase the mask ratio from to . Mask ratio hugely impact the color intensity of the trapdoor as pixel intensity is bounded by and . Trapdoor pixel intensity directly correlates with the amount of perturbation added to the original image, as we presented in Equation 3.

We test the defensive effectiveness of trapdoors with different mask ratios on several GTSRB models following the same setup as in Section 4.1. Figure 14 shows detection rate at FPR for each model tested against our three attacks. As expected, detection success rate drops as the mask ratio of trapdoor increases. However, for reasonably small mask ratios (), detection success rate is high at FPR.

5.2. Location of the Trapdoor

The location of the trapdoor could potentially impact our defense performance as well. For example, injecting trapdoors at locations that are critical to classification of normal inputs (i.e. at the center or the corner of a traffic sign) may impact the model differently compared to trapdoors elsewhere. To study the impact of locations, we randomly select locations on the image to inject a x square as the trapdoor. We defended GTSRB models against attacks, each with a single trapdoor at the chosen location. We follow the same defense and detection methodology as in Section 4.1. Figure 14 shows the detection success rate distribution at FPR of each model. All of the detection success rates are above . This indicates that the location of the trapdoor does not seem to have a significant influence on the detection success rate.

Figure 12. Detection success rate of CW, PGD and ElasticNet at FPR when increasing the size of the injected trapdoor in GTSRB
Figure 13. Detection success rate of CW, PGD and ElasticNet at FPR when increasing the mask ratio of the injected trapdoor in GTSRB
Figure 14. Detection success rate distribution of CW, PGD and ElasticNet at FPR when injecting different located trapdoors.

5.3. Multiple Trapdoors

Adding a single trapdoor can potentially provide a unique local “trap” to the attacker. A natural question is whether adding multiple trapdoors to the same label can achieve higher detection rates of adversarial inputs. The trapdoor set for target label is a set of trapdoors, such that adding any one of to any arbitrary input, regardless of the input’s original class, will make the trapdoored model classify the input to the target label . This is formally defined as:

TheoremDefinition 0 ().

The trapdoor set of target label , is a set that consists trapdoors (), such that , , .

Adding additional trapdoors provide more local optima for attackers to fall in, and thus increase the detection success rate. However, two other factors come into play, and push the net impact of additional trapdoors into the negative range.

  1. High False Positive Rate. With multiple trapdoors, normal images’ neuron activations are more likely to be close to one of the selected trapdoors by chance. Thus, multiple trapdoors are likely to increase the detection false positive rate.

  2. Mixed effects among trapdoors. In a model trained with multiple trapdoors, an adversarial images could converge to a single trapdoor, or a set of trapdoors, or a combination of parts of several trapdoors. These potentially makes the detection algorithm harder to match the neuron distance similarity when comparing with one specific trapdoor.

To examine the influence of these factors, we evaluate the effectiveness of introducing more than one trapdoors in our models. We look into two ways where we can inject multiple trapdoors (multiple locations, and multiple intensities).

Locations of Multiple Trapdoors    To create the trapdoor set, we vary the locations of trapdoors but keep intensity and shape fixed. We randomly choose a set of locations on the image, and create the same trapdoor ( by square) at each location. The set of trapdoors is the trapdoor set of a targeted label. We trained several GTSRB models, varying from to . Figure 16 shows the detection success rate of each model at FPR. In all cases, detection success rate is at FPR. From Figure 16, multiple location trapdoors do not seem to have a significant impact on our detection performance.

Intensities of Multiple Trapdoors    Similar with the single trapdoor analysis, we explore the trapdoors with fix locations but different intensities. We fix the location of a trapdoor at the bottom right corner of each trapdoor injected image, with the shape as a by square. Different trapdoored images contain a trapdoor with different intensities. We sample pixels intensity of the trapdoors from a random uniform distribution for times. The set of trapdoors is a trapdoor set for a targeted label. We vary from to . Figure 16 shows the detection success rate at FPR. In all cases, the detection success rate is . From Figure 16, multiple intensity trapdoors also do not seem to have a significant impact on our detection performance, which provides large freedom for defenders.

Figure 15. Detection success rate of CW, PGD and ElasticNet at FPR when injecting a number of trapdoors with different locations in GTSRB
Figure 16. Detection success rate of CW, PGD and ElasticNet at FPR when injecting a number of trapdoors with different intensity in GTSRB

6. Countermeasures

In this section, we consider countermeasures against the trapdoor-enabled detection. We seek to understand how an adversary could discover trapdoors and how easy it would be for them to adaptively avoid them. For these experiments, we generated adversarial perturbations using the CW attack on three versions of the GTSRB model: a clean version (without trapdoors), a version with a single trapdoored label, and a multiple trapdorr version in which every label is defended by a trapdoor.

6.1. Detecting Trapdoors

The most basic countermeasure for attackers is to detect the presence of trapdoors in a model. As discussed before, we are not yet aware of robust backdoor (one type of poisoning attack) detection tools that would be successful against trapdoors. The neural cleanse solution (Wang et al., 2019) can be circumvented by introducing multiple trapdoors that eliminate the variance in cross-label perturbation distances.

Here we consider two approaches, both assume that the adversary has full access to the model and is capable of analyzing adversarial perturbations generated against the model. First, an attacker can visually inspect adversarial perturbations to see if they converge to a consistent, trapdoor-like pattern. Second, the attacker can measure the cosine similarity between adversarial perturbations generated on the suspect model and adversarial perturbations generated on a known clean model trained for the same task.

Visual Inspection.    If a trapdoor is not stealthily designed, adversarial perturbations generated on a trapdoored model will be visually distinct. Figure 4 illustrates the marked differences between CW adversarial perturbations generated on a trapdoored model versus “natural” adversarial perturbations computed on a normal model. An adversary might rely on observations of these such differences as a way to reason about the possibility of a trapdoor in the model.

We note that an adversary must meet two requirements before this visual inspection is possible. First, the adversary must have a clean model identical to the trapdoored model. This is possible if the adversary has the full set of original (non-trapdoored) training data and configurations of the training process. Second, he must be able to perform a large number of model inference queries on adversarial examples on both the clean and trapdoored model. If these conditions are met, an adversary could assert with some confidence that a model is protected by one or more trapdoors.

Cosine Similarity.    An adversary could also detect a trapdoor by measuring the cosine similarity of adversarial perturbations generated on the target model and adversarial perturbations generated for a known clean model. The presence of the trapdoor would make the adversarial perturbations on the trapdoored model markedly different from perturbations on a normal model.

Model Type Self-similarity Clean model similarity
Single Label Trapdoor
All Label Trapdoor
Table 3. Comparison of CW attack perturbation similarity for adversarial examples on label 28 of clean, single label trapdoor, and all label trapdoor versions of the GTSRB model. Self-similarity measures the average cosine similarity of perturbations to other perturbations generated on the same model, while clean model similarity measures the average cosine similarity of those perturbations to perturbations generated on the clean model.

We use cosine similarity to measure differences in CW attack perturbations generated on the three GTSRB models discussed previously. Average cosine similarity among perturbations generated on the same model (’self-similarity’) remains consistent across all models. This self-similiarity is measured by computing the average cosine similarity between adversarial examples on the same model. However, adversarial perturbations generated on trapdoored models are significantly different from perturbations generated on the clean model. This mathematical dissimilarity could alert an adversary that a label is trapdoored. The results from this analysis are presented in Table 3.

Whether through visual inspection or similarity measures, detecting a trapdoor with high confidence is non-trivial. Once an adversary is sufficiently confident, they might take steps to circumvent potential trapdoors in the target model. We consider circumvention techniques next.

6.2. Bypassing Trapdoors

We now explore steps an adversary could take to circumvent a trapdoor without knowing its shape. The first, most obvious method would be removing the trapdoor. Prior work (Wang et al., 2019) has demonstrated effective methods for removing a backdoor from a neural network. However, this method relies on the distinctness of trapdoor-specific neurons in the model in order to detect and remove them. Particularly in our all-label defense, we insert at least one trapdoor per pair of output labels in the model, such that neural cleanse will find no obvious anomalies.

Learning Rate Attack Success Rate Detection Success Rate
Table 4. Comparison of learning rate, adversarial attack success rate (before detection algorithm deployed), and adversarial example detection success rate on GTSRB single label model with one trapdoor per label.

We explore ways adversaries can modify their attack to increase their chances of converging to a non-trapdoor adversarial perturbation. We use the CW attack, and explore the space of numerous attack parameters, including confidence of adversarial example misclassification and number of iterations for the attack generation algorithm. We found that decreasing the learning rate for the attack is the only approach that yields some meaningful result, i.e. produce some scenarios where detection rate drops below 85%. A small learning rate means the adversary moves more cautiously through the loss landscape in generating adversarial perturbation. As a result, the adversary may find a perturbation associated with small local optima that is not the trapdoor. Table 4 shows that the defense becomes less successful as the learning rate decreases, indicating that this adjustment does help an adversary avoid a trapdoor, at the cost of higher computation cost and (at some point) lower attack success rate.

Using a simplified intuition, we expect the impact of the trapdoors is similar as shown in Figure 3. Injecting a trapdoor has a net effect of producing a “trap” in the classification manifold, one that will produce a deterministic perturbation that is close to the pre-embedded trapdoors in terms of the neuron distance. An adversary with a very low learning rate has a higher probability to converge to a different local optima that produce successful misclassification while maintaining distinctiveness from the trapdoors ( and in Figure 3). Note that using a too small learning rate also has the risk that the search process will be trapped in a bad local optima and fails to find effective adversarial perturbation.

There are two possible mitigations against this low learning rate attack. First, it is possible that the trapdoor does not have sufficient coverage in certain regions of the manifold, making some adversarial examples easily jump out of the pre-embedded “traps”. Therefore, controlled and deterministic placement of trapdoors might improve coverage and make local adversarial examples harder to locate (our earlier attempts at multiple backdoors are randomized). This requires careful determination of the effective properties of the trapdoors. Second, more training samples per trapdoor might increase “confidence,” effectively making the trapdoor region deeper. Intuitively, this might reduce the magnitude of natural adversarial examples or wiping them out altogether. However, it likely will have the side effect of increasing false positives, i.e. marking normal inputs as adversarial inputs. We are actively investigating both approaches and their possible impact on small learning rate attacks.

7. Conclusion and Future Work

This paper introduces adversarial trapdoors and explores their effectiveness as an adversarial instance detection method. Our proposed method applies similar setting with backdoor (a.k.a Trojan) attacks as trapdoors to introduce controlled vulnerabilities (traps) into the model. These trapdoors can be injected into the model to defend all labels or particular labels of interest. For multiple application domains, a trapdoor-based defense has high detection success against adversarial examples generated by the CW, ElasticNet, and PGD, with negligible impact on classification accuracy of normal inputs.

We include discussions and initial experiments on possible countermeasures by attackers against trapdoors. We describe the learning rate attack which can decrease detection rate moderately at the cost of significant computation. In ongoing work, we are actively experimenting with possible mitigation techniques, and exploring potentially broader applications of trapdoors.


  • (1)
  • Athalye et al. (2018) Anish Athalye, Nicholas Carlini, and David Wagner. 2018. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proc. of ICML.
  • Buckman et al. (2018) J. Buckman, A. Roy, C. Raffel, and I. Goodfellow. 2018. Thermometer encoding: One hot way to resist adversarial examples. In Proc. of ICLR.
  • Carlini and Wagner (2016) Nicholas Carlini and David Wagner. 2016. Defensive distillation is not robust to adversarial examples. arXiv preprint arXiv:1607.04311 (2016).
  • Carlini and Wagner (2017a) Nicholas Carlini and David Wagner. 2017a. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proc. of ACM Workshop on Artificial Intelligence and Security (AISec).
  • Carlini and Wagner (2017b) Nicholas Carlini and David Wagner. 2017b. Magnet and efficient defenses against adversarial attacks are not robust to adversarial examples. arXiv preprint arXiv:1711.08478 (2017).
  • Carlini and Wagner (2017c) Nicholas Carlini and David Wagner. 2017c. Towards evaluating the robustness of neural networks. In Proc. of IEEE S&P.
  • Chen et al. (2018) Pin-Yu Chen, Yash Sharma, Huan Zhang, Jinfeng Yi, and Cho-Jui Hsieh. 2018. EAD: elastic-net attacks to deep neural networks via adversarial examples. In Proc. of AAAI.
  • Chen et al. (2017) Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. 2017. Targeted Backdoor Attacks on Deep Learning Systems Using Data Poisoning. arXiv preprint arXiv:1712.05526 (2017).
  • Clements and Lao (2018) Joseph Clements and Yingjie Lao. 2018. Hardware Trojan Attacks on Neural Networks. arXiv preprint arXiv:1806.05768 (2018).
  • Dhillon et al. (2018) G. S. Dhillon, K. Azizzadenesheli, J. D. Bernstein, J. Kossaifi, A. Khanna, Z. C. Lipton, and A. Anandkumar. 2018. Stochastic activation pruning for robust adversarial defense. In Proc. of ICLR.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572 (2014).
  • Gu et al. (2017) Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. 2017. Badnets: Identifying vulnerabilities in the machine learning model supply chain. In Proc. of Machine Learning and Computer Security Workshop.
  • Guo et al. (2018) C. Guo, M. Rana, M. Cisse, and L. van der Maaten. 2018. Countering adversarial images using input transformations. In Proc. of ICLR.
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proc. of CVPR.
  • He et al. (2017) Warren He, James Wei, Xinyun Chen, Nicholas Carlini, and Dawn Song. 2017. Adversarial example defenses: Ensembles of weak defenses are not strong. In Proc. of WOOT.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
  • Krizhevsky and Hinton (2009) Alex Krizhevsky and Geoffrey Hinton. 2009. Learning multiple layers of features from tiny images. Technical Report.
  • Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016).
  • Kurakin et al. (2017) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2017. Adversarial machine learning at scale. In Proc. of ICLR.
  • Liu et al. (2018) Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. 2018. Trojaning Attack on Neural Networks. In Proc. of NDSS.
  • Liu et al. (2017) Yuntao Liu, Yang Xie, and Ankur Srivastava. 2017. Neural trojans. In Proc. of ICCD.
  • Ma et al. (2018) Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. 2018. Characterizing adversarial subspaces using local intrinsic dimensionality. In Proc. of ICLR.
  • Madry et al. (2018) Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. 2018. Towards deep learning models resistant to adversarial attacks. In Proc. of ICLR.
  • Meng and Chen (2017) Dongyu Meng and Hao Chen. 2017. Magnet: a two-pronged defense against adversarial examples. In Proc. of CCS.
  • Papernot et al. (2017) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z. Berkay Celik, and Ananthram Swami. 2017. Practical black-box attacks against machine learning. In Proc. of AsiaCCS.
  • Papernot et al. (2016) Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. 2016. Distillation as a defense to adversarial perturbations against deep neural networks. In Proc. of IEEE S&P.
  • Samangouei et al. (2018) P. Samangouei, M. Kabkab, and R. Chellappa. 2018. Defensegan: Protecting classifiers against adversarial attacks using generative models. In Proc. of ICLR.
  • Sharif et al. (2016) Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. 2016. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proc. of CCS.
  • Song et al. (2018) Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. 2018. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In Proc. of ICLR.
  • Stallkamp et al. (2012) J. Stallkamp, M. Schlipsing, J. Salmen, and C. Igel. 2012. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural Networks (2012).
  • Sun et al. (2014) Yi Sun, Xiaogang Wang, and Xiaoou Tang. 2014. Deep learning face representation from predicting 10,000 classes. In Proc. of CVPR.
  • Szegedy et al. (2014) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In Proc. of ICLR.
  • Wang et al. (2019) Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y. Zhao. 2019. Neural Cleanse: Identifying and Mitigating Backdoor Attacks in Neural Networks. In Proc. of IEEE S&P.
  • Xie et al. (2018) C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille. 2018. Mitigating adversarial effects through randomization. In Proc. of ICLR.
  • Xu et al. (2018) Weilin Xu, David Evans, and Yanjun Qi. 2018. Feature squeezing: Detecting adversarial examples in deep neural networks. In Proc. of NDSS.
  • YouTube ([n. d.]) YouTube [n. d.]. ([n. d.]). YouTube Faces DB.
  • Zantedeschi et al. (2017) Valentina Zantedeschi, Maria-Irina Nicolae, and Ambrish Rawat. 2017. Efficient defenses against adversarial attacks. In Proc. of ACM Workshop on Artificial Intelligence and Security (AISec).
  • Zheng et al. (2016) Stephan Zheng, Yang Song, Thomas Leung, and Ian Goodfellow. 2016. Improving the robustness of deep neural networks via stability training. In Proc. of CVPR.

Appendix A Appendix

Task / Dataset # of Labels
Set Size
Set Size
Training Configuration
GTSRB 43 35,288 12,630 inject ratio=0.1, epochs=20, batch=32, optimizer=Adam, lr=0.005
CIFAR10 10 50,000 10,000 inject ratio=0.1, epochs=60, batch=32, optimizer=Adam, lr=0.005
YouTube Face 1,283 375,645 64,150 inject ratio=0.1, epochs=20, batch=32, optimizer=Adadelta, lr=1
Table 5. Detailed information about dataset and training configurations for each BadNets models.

s Attack Method Attack Configuration CW binary step size=20, max iterations=500, lr=0.1, abort early = False ElasticNet binary step size=10, max iterations=500, lr=0.5, abort early = False PGD eps = 5, # of iteration = 10, eps of each iteration = 0.5

Table 6. Detailed information about attack configurations
Layer Type # of Channels Filter Size Stride Activation
Conv 32 33 1 ReLU
Conv 32 33 1 ReLU
MaxPool 32 22 2 -
Conv 64 33 1 ReLU
Conv 64 33 1 ReLU
MaxPool 64 22 2 -
Conv 128 33 1 ReLU
Conv 128 33 1 ReLU
MaxPool 128 22 2 -
FC 512 - - ReLU
FC 43 - - Softmax
Table 7. Model Architecture for GTSRB.
Layer Name (type) # of Channels Activation Connected to
conv_1 (Conv) 16 ReLU -
conv_2 (Conv) 16 ReLU conv_1
conv_3 (Conv) 16 ReLU pool_2
conv_4 (Conv) 16 ReLU conv_3
conv_5 (Conv) 16 ReLU conv_4
conv_6 (Conv) 16 ReLU conv_5
conv_7 (Conv) 16 ReLU conv_6
conv_8 (Conv) 32 ReLU conv_7
conv_9 (Conv) 32 ReLU conv_8
conv_10 (Conv) 32 ReLU conv_9
conv_11 (Conv) 32 ReLU conv_10
conv_12 (Conv) 32 ReLU conv_11
conv_13 (Conv) 32 ReLU conv_12
conv_14 (Conv) 32 ReLU conv_13
conv_15 (Conv) 64 ReLU conv_14
conv_16 (Conv) 64 ReLU conv_15
conv_17 (Conv) 64 ReLU conv_16
conv_18 (Conv) 64 ReLU conv_17
conv_19 (Conv) 64 ReLU conv_18
conv_20 (Conv) 64 ReLU conv_19
conv_21 (Conv) 64 ReLU conv_20
pool_1 (AvgPool) - - conv_21
dropout_1 (Dropout) - - pool_1
fc_ (FC) - Softmax dropout_1
Table 8. ResNet20 Model Architecture for CIFAR10.
Layer Name (Type) # of Channels Filter Size Stride Activation Connected to
conv_1 (Conv) 20 44 2 ReLU
pool_1 (MaxPool) 22 2 - conv_1
conv_2 (Conv) 40 33 2 ReLU pool_1
pool_2 (MaxPool) 22 2 - conv_2
conv_3 (Conv) 60 33 2 ReLU pool_2
pool_3 (MaxPool) 22 2 - conv_3
fc_1 (FC) 160 - - - pool_3
conv_4 (Conv) 80 22 1 ReLU pool_3
fc_2 (FC) 160 - - - conv_4
add_1 (Add) - - - ReLU fc_1, fc_2
fc_3 (FC) 1280 - - Softmax add_1
Table 9. DeepID Model Architecture for YouTube Face.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description