Detection and Recovery of Adversarial Attacks with Injected Attractors

Detection and Recovery of Adversarial Attacks with Injected Attractors


Many machine learning adversarial attacks find adversarial samples of a victim model by following the gradient of some functions, either explicitly or implicitly. To detect and recover from such attacks, we take the proactive approach that modifies those functions with the goal of misleading the attacks to some local minimals, or to some designated regions that can be easily picked up by a forensic analyzer. To achieve the goal, we propose adding a large number of artifacts, which we called attractors, onto the otherwise smooth function. An attractor is a point in the input space, which has a neighborhood of samples with gradients pointing toward it. We observe that decoders of watermarking schemes exhibit properties of attractors, and give a generic method that injects attractors from a watermark decoder into the victim model . This principled approach allows us to leverage on known watermarking schemes for scalability and robustness. Experimental studies show that our method has competitive performance. For instance, for un-targeted attacks on CIFAR-10 dataset, we can reduce the overall attack success rate of DeepFool [28] to 1.9%, whereas known defence LID [43], FS [43] and MagNet [25] can reduce the rate to 90.8%, 98.5% and 78.5% respectively.

I Introduction

Machine learning models such as deep neural networks are vulnerable towards adversarial attacks [38] where a small perturbation on the input could lead to a wrong prediction result. As machine learning gains popularity, such vulnerability has brought forward concerns of machine learning adoptions in environments subjected to adversarial influences, such as biometric authentication, fraud detection and autonomous driving.

Many defences have been proposed to tackle adversarial attacks. Here, we focus on the “forensic” approach that consists of an analyzer for the given victim model. This analyzer determines whether a given input is being subjected to adversarial attack with respect to the reference victim model, and recover to the correct prediction if possible. Classification methods that learn characteristics of adversarial samples [11, 9, 8, 7, 23, 43, 25], and transformation-based methods [12, 21] that eliminate adversarial effects through input transformation, can be viewed as belonging to this approach. Techniques in this approach can be adopted in scenarios where the adversaries have access to the victim models in constructing the adversarial samples, but are unaware of the forensic analyzer. Alternatively, the forensic analyzer can be integrated with the victim model, and exposed to the adversaries as a single machine learning model. Although adversaries now have access to the forensic outcomes, the integrated model imposes additional constraints and thus is arguably more robust against adversarial attacks.

Many known forensic analyzers are passive in the sense that the victim model is treated as a reference and not being modified. It is interesting to see whether one could proactively modify the model, so as to aide subsequent analysis. Shumailov et al. proposed aiding detection by setting up “taboo traps” [37], which are selected neurons whose activations are trained to be low on clean images. As the adversaries are not aware of the traps, the generated adversarial samples could trigger unusual activation on the taboo traps and thus be identified. Shan et al.  [35] also introduced “trapdoors” which are global perturbations that lead to misclassification, and can be viewed as artificial “shortcuts” towards misclassification. Since most attacks search for nearby misclassification, the adversarial samples generated might follow the trapdoors. The trapdoors are injected to the training dataset during model training, and can be identified during forensic analysis by observing statistical properties of selected neurons’ activations. Similar idea was also explored in [20] and [34].

(a) Original victim model
(b) Model with attractors
Fig. 1: The function of soft-label for the class red is shown as a surface. The classifier’s decision boundary (soft-label crossing the threshold) is shown on the projected plane.
(a) Combined model
(b) Forensic analyzer
Fig. 2: (a) The output of the classifier and watermark decoder are summed together to form the final output. The whole combined model (after obfuscation) is presented to the attacker as a white-box. (b) The forensic analyzer takes the output of the classifier and watermark decoder and uses some simple logic to detect and recover adversarial samples.

In this paper, instead of injecting trapdoors or taboo traps, we propose injecting attractors. Most known attacks search for adversarial samples by optimizing some attack-loss functions. An attractor is a sample that influences the attack-loss functions, so that gradients at samples in the neighborhood of this attractor are pointing toward the attractor. The attractors serve two purposes. Firstly, they are potholes and humps injected into the otherwise smooth slope, so as to confuse the search process. Secondly, they can trick the search process to some designated regions, which can be easily detected by the forensic analyzer.

Figure 1 depicts attractors on a two-class example. A sample is classified as class red if its soft-label exceeds certain threshold. The function of the soft-label for class red is shown as a surface in the figure. Figure 0(a) depicts the original victim classifier whereas Figure 0(b) depicts the situation after attractors are injected. There are two types of attractors: peaks and dips. Let us consider an attack that follows the gradient so as to minimally perturb a given sample in class red to class blue. Due to the existence of attractors, this causes the perturbation to end up in a dip. Similarly, for a sample in class blue, a perturbation that follows opposite direction of the gradient will end up in a peak. Samples near dips and peaks are designated and declared as adversarial samples during forensic analysis. Beside detection of adversarial samples, it is also possible to recover back to the original class.

While many attacks make decision based on the soft-label’s gradient, there are also known attacks such as black-box attacks that make decision based on the predicted labels of randomly perturbed samples. To address these attacks, we consider the local density function, which is the proportion of misclassification within the neighborhood, and take this local density as the attack-loss function.

There is a crucial difference between the notion of attractors and trapdoors [35]. A trapdoor creates a region of misclassification, whereas an attractor modifies the gradients in its neighborhood. While a trapdoor might still affect its neighborhood gradients, its formulation does not explicitly enforce this requirement and there could be cases that the influences are not sufficiently large. In other words, the fact that there are trapdoors does not necessary guarantee that an attack would move along the trapdoor. To demonstrate that the concern is legitimate, we give a trapdoor construction that arguably attains the training goals, but yet unable to trick the attackers to follow the trapdoors.

The notion of attractors is also different from gradient obfuscation methods investigated by Athalye et al.  [1]. A gradient obfuscation method creates a non back-propagatable function so that adversaries are unable to obtain the gradient signal. In contrast, gradients of attractors are still smooth and differentiable almost everywhere, and thus are back-propagatable. In fact, our intention is to feed gradient information to the attackers so as to pull them nearer to an attractor.

We proposed a simple construction that injects attractors using existing digital watermarking scheme. Let be the model we want to protect, where is the number of classes and represents the model parameters. We first find a robust digital watermarking scheme and its decoder , which is coded as a neural network model parameterized by , and the -th coefficient of is the correlation value of with the -th watermark message. One can visualize that the surfaces of are scattered with attractors, since there are watermarked samples everywhere by the fidelity requirement on watermarking schemes. To inject these attractors into the model , we simply combine and to form a new model that outputs the normalized sum of both models’ outputs1. In other words, given an input , the final predicted soft-labels visible to the adversaries is the normalized as illustrated in Figure 2. The combined model is to be obfuscated if the adversaries have white-box access to the model. During forensic analysis, when the input is unusually close to a particular attractor, we declare that the input is an adversarial sample. To see why the method is able to detect and recover, note that by adding the -th soft-label, say where , to the -th watermark’s correlation value would “bind” the -th class to the -th watermark message. Now, if a targeted attack attempts to increase the prediction , it would unknowingly increase the correlation to the corresponding . Likewise, if a un-targeted attack attempts to decrease the prediction , the correlation of would be decreased.

Beside conceptually simple, the proposed approach has a few additional advantages. The approach is generic and can incorporate different watermarking schemes, and thus can leverage on extensive known works in digital watermarking. For instance, one could employ high capacity watermarking schemes to protect models with large number of classes, or employ schemes that are robust against geometric distortion so as to be resilient to such distortions. In addition, the efficient watermarking decoders provide a way for the forensic analyzer to identify the attractors efficiently. Furthermore, when deployed in the black box setting, no re-training is required in combining the two models and . More importantly, the modular approach provides insights of the internal mechanism that would be useful in the implementation.

We conducted experiments against 18 known attacks and compare our results with state-of-art detection-only defences. The results show that our performance is very competitive (results reported in Table III). For example, for un-targeted on CIFAR-10 dataset, we can reduce the attack success rate of DeepFool[28] to 1.9%, while known defence LID[43], FS[43], and MagNet[25] can reduce the rate to 90.8%, 98.5% and 78.5% respectively.

Similar to trapdoor and taboo traps, our construction relies on some forms of neural network obfuscation to hide some secrets. In a certain sense, our results show that, with strong neural network obfuscation, it is possible to attain high robustness against adversarial attack.


  1. We give a formulation of attractors and highlight its roles in defending adversarial attacks. We point out crucial difference between this formulation and existing notions, in particular Trapdoor [35]. We also point out a shortcoming of the Trapdoor formulation.

  2. We propose a generic approach that takes a watermarking decoder and combines it with the victim model. The requirement of attractor provides a guide on how these two models are to be combined, which turns out to be a simple normalized summation. We also propose the corresponding forensic analyzer that detects/recovers from adversarial attacks.

  3. We conduct extensive experiments against a wide range of attacks and compare our results with state-of-art approaches. Our experiments demonstrate that the proposed defence attains competitive performance.

Ii Background and Related Works

In this section, we briefly describe relevant attacks and defences in our experiments. We denote the victim classification model as parameterized by . Given an input , outputs a vector where each represents the soft-label for class , and is the probability that input belongs to the class . The classification of is the most likely predicted class, i.e., . We write as the classification loss function.

Ii-a Attacks

Box-constrained L-BFGS Attack (BLB) [38].   Szegedy et al. formulated generation of adversarial samples as an optimization problem. Given an input image and a target , the goal is to minimize such that . Szegedy et al. transformed this goal into an easier problem: minimizing .

Fast Gradient Sign Method (FGSM) [10].   FGSM moves a fixed small step in the direction that maximally changes the prediction result. The adversarial sample for an un-targeted attack is:

where is the one-hot vector of the true label of the input . The adversarial sample for an targeted attack is:

where is the one-hot vector of the target label.

Basic Iterative Method (BIM) [19].   BIM is an extension of FGSM which extends the one-step attack into an iterative process. The attack chooses the starting point and the subsequent steps:

Projected Gradient Descent (PGD) [24].   PGD is an improvement over BIM. The search process starts at a random point within the norm ball, and then follows the iterations similar to BIM.

Momentum Iterative FGSM (MI-FGSM) [6].   Dong et al. proposed using gradients from previous iterations and applying momentum to prevent overfitting.

DeepFool [28].   DeepFool finds the minimal perturbation to change the predication result:

DeepFool views neural network classifiers as hyperplanes separating different classes. In a binary classifier, the minimal perturbation is the distance from to the separating hyperplane . The minimal perturbation is the orthogonal projection of onto . Universal Adversarial Perturbations (UAP) [27].   Moosavi-Dezfooli et al. proposed UAP which is a quasi-imperceptible image agnostic perturbation that can cause misclassification for most images sampled from the data distribution.

OptMargin (OM) [15].   He et al. proposed OptMargin which generates low-distortion adversarial samples that are robust to small perturbations. This approach circumvents defences such as transformation based defences that sample in a small neighborhood around an input instance and get the majority prediction.

Carlini and Wagner (C&W) [3].   C&W is an iterative optimization method. Its goal is to minimize the loss where is an objective function such that if and only if and is a constant.

Elastic-Net Attacks (EAD) [5].   EAD uses the same loss function as C&W but combines both and penalty functions to minimize the difference between adversarial samples and original image.

Least Likely Class attack (LLC) [19].   LLC is similar to FGSM. Instead of decreasing the score of the correct class, LLC attempts to increase the score of a least likely class. That is:

I-LLC is the iterative version of LLC.

Jacobian-based Saliency Map Attack (JSMA) [30].   JSMA selects a few pixels in a clean sample based on the saliency map and saturates them either to the minimum or maximum value such that the new sample can be misclassified.

Backward Pass Differentiable Approximation (BPDA) [1].   Athalye et al. suggested that most defences either intentionally or unintentionally break or hide the gradients as a way to prevent adversarial attack. BPDA approximates the gradient for a non-differentiable layer so that gradient-based attacks can be effective against such defences.

Simultaneous Perturbation Stochastic Approximation (SPSA) [40].   SPSA uses non-gradient based optimization. By taking random small steps around the input, SPSA attempts to find the global minima.

RFGSM, RLLC [39].   Tramer et al. proposed adding random perturbations drawn from Gaussian distribution before calculating the gradient. This targets at defences that use gradient masking.

Ii-B Defenses

Many defences such as secondary classification [11, 9, 8], principal component analysis (PCA) based methods [16, 2], statistical methods [7], transformation methods [12, 21, 41, 18], adversarial training [10] and defensive distillation [31] have been proposed.

The rest of this section describes three detection-only approaches employed in our experiments and two proactive forensic approaches.

Local Intrinsic Dimensionality Based Detector (LID) [23].   Intrinsic dimensionality of manifold can be seen as the minimum dimensionality required to represent a data sample on the manifold. Since data samples from a dataset can be on different manifolds, local intrinsic dimensionality (LID) is used to measure the intrinsic dimensionality of a single data sample. Ma et al. observed adversarial perturbation can change the LID characteristics of an adversarial region. Their experiments showed adversarial samples have significantly higher estimated LID than normal samples. Based on this observation, they built a LID-based detector.

Feature Squeezing Detector (FS)  [43].   Xu et al. suggested the unnecessarily large feature input space often gives room for adversarial samples. They proposed feature squeezing to limit the degree of freedom for adversary. The feature squeezing methods include reduction of color depth and smoothing. The framework evaluates the prediction results of both the original input and input pre-processed by feature squeezing. The input will be identified as adversarial if the difference between any two results is larger than a certain threshold.

MagNet Detector [25]. Meng and Chen suggested that one of the reasons that adversarial sample can cause wrong classification is that adversarial samples are far from the normal data manifold. They built a detector that measures how different an input sample is from normal samples. Two detection mechanisms were discussed in Meng and Chen’s paper: detection based on the reconstruction error of a trained autoencoder on the given dataset, detection based on probability divergence.

Trapdoor [35].   Shan et al. uses an active method to capture adversarial attacks. Given a classification model , they define the notion of “trapdoor” and a “trapdoored” model such that where is a small constant.

The “trapdoored” model is obtained through training. The original training dataset is augmented with trapdoor embedded samples. For any in the training dataset, the label of their trapdoor embedded version is set to . The goal of the training is to minimize the classification loss and make the trained model reach an optimal such that it can classify both clean samples and trapdoor embedded samples. As adversarial generation functions naturally gravitate towards these trapdoors, they will produce adversarial samples that the model owner can recognize through a known neuron activation signature. More details will be discussed in Section IV-C.

Taboo Trap [37].   Taboo Trap is an active method that trains the model to have a restricted set of behaviors on activations which are hidden from the attackers during training, and reports any behavior that later violates these restrictions as an adversarial sample. The framework consists of three stages: profiling activation for samples on a trained model, re-training the model with a transformation function on all activations and using the transform function to detect adversarial input.

Iii Threat Model

We consider two settings, with forensic analyzer or with integrated model.

Iii-a Forensic analyzer

Under this setting, the adversary has access to the victim classifier , either as a black-box where the adversary can feed in arbitrary input and observe the output , or as a white-box where the adversary knows the parameter and thus can apply on multiple inputs and can observe the internal states.

However, the adversary is not aware of a forensic step to be carried out on the input. The forensic analyzer investigates whether a given input has been subjected to adversarial attack with respect to the victim classifier . We can have a detection-only forensic where, when given an input , decides whether has been subjected to attack, and outputs iff it deems so. The recovery forensic takes a step further. On input an adversarial sample , it outputs a prediction closest to the original prediction prior to the attack.

To illustrate this setting, consider an insurance company’s website that employs a classifier to decide some outcomes on digital photographs submitted by its users. A user can have multiple attempts and the decision of on each attempt is made known to the user, in other words, the user has access to . After the user is satisfied, the user selects a single final version. Since is accessible, a malicious user might carry out adversarial attack to find a slightly perturbed image that would give a favorable outcome, and then select the image as the final version. To prevent such attack, the site could employ the forensic analyzer only on the final version to detect malicious behavior.

Iii-B Integrated setting

The forensic setting requires a hidden step and is not applicable in many applications. Nevertheless, forensic techniques can still be useful. Suppose we have an accurate analyzer and recovery forensic for the victim model . We could combine the analyzer and the recovery forensic to obtain a single model . Since the output type of is same as the output type of , the integrated classify-detect-then-recover model can now takeover in the original classification task. The integrated model imposes more constraints on the adversarial attack and arguably could be more robust.

The adversary can either have white-box or black-box access to the integrated .

Iii-C Adversary’s goal: Targeted vs Un-targeted

When given a clean input , the attacker may have a specific goal of finding a sample that is being misclassified to a particular given class. Such goal is known as targeted attack. Alternatively, the attacker could be contended with a weaker goal that finds a sample that is being misclassified to any class. This is known as non-targeted attack. To an adversary, non-targeted attack is easier to achieve since the adversarial sample just has to be misclassified to any class.

Iv Attractors

Iv-a Motivation

Most attacks search for adversarial samples along directions derived from some local properties. For instance, FGSM takes the loss function’s gradient as the search direction. Our goal is to confuse the adversary by adding artifacts to taint those local properties, so as to lead the search process to some local minimals, or to some designated regions that aid detection. In a certain sense, the artifacts are potholes and humps added to an otherwise smooth slope.

This motivates the definition of attractors. Intuitively, each attractor is a sample in the input space, such that the search process would eventually lead all samples in its neighborhood to . Hence, if the input space is scattered with attractors, then the search process would be confused. In our formulation, we call information utilized by the attack as an attack loss function, and require its gradients pointing toward the attractors.

Our formulation is inspired, but different from the attractor in dynamical systems [33].

Iv-B Definition of Attractors

Consider a classification model parameterized by , and let be an attack loss function.

Definition. We say that a point is an -attractor in with respect to the attack loss function on if there exists a neighborhood of , called the basin of and denoted as , such that for all ,

where is the cosine similarity function and is the one-hot vector of the -th class.

Note that the definition of attractor depends on the attack loss functions. Here are two candidates.

Soft-label. Given a model , let us choose the attack loss function same as the training loss function, that is . Note that is the gradient of the -th soft-label of at . This choice of loss function makes sense as many attacks such as FGSM find the adversarial sample by moving along the direction . In non-targeted with the goal of moving away from class , the attack moves along the direction . Whereas in targeted attack with the goal of finding a misclassification to class , the attack moves along the direction .

Local Density For a model and a sample , let us define -local density, denoted , as the proportion of samples within the sphere of radius centred at that are classified as the -th class. That is,

where is the sphere of radius centred at . If belongs to the class , we would expect the local density to be large. Local density, as a choice of loss function for attractors, is more relevant to adversarial attacks that make decision based on the predicated class instead of the numeric predication score. For example, attacks such as SPSA which make decision based on the decision boundary.

To protect against different attacks, we look for a model that possesses attractors with respect to a wide range of attack loss functions. In addition, if a model contains attractors w.r.t. the attack loss function , the model should also contain attractors w.r.t. . This is to cater for both targeted attacks and un-targeted attacks which optimise in the opposite directions.

Iv-C Attractors vs Trapdoors

Unlike an attractor, a trapdoor for the -th class is a perturbation that leads almost all samples to the -th class. Hence, when such perturbation is applied to a clean sample that is not in the -th class, the perturbed sample is likely to be misclassified as . Regions that contain samples perturbed by can be designated as traps which are to be detected by the forensic analyzer.

Attractors and trapdoors are related and it is possible that a model possesses properties of both. However, there are a number of key differences and crucial implications between the two notions.

Implicit vs explicit contraints on gradients

The notion of trapdoor does not explicitly impose constraints on the gradients. Since many known attacks search for adversarial samples using information on gradients (e.g. FGSM), it is interesting to investigate whether the existence of a trapdoor is sufficient in misleading the attack, so that the attack searches along the trapdoor. If this is not the case, attacks would still be successful in the presences of trapdoors. In contrast, the notion of attractor explicitly forces the gradient to point toward the attractors. Consequently, an attack that moves along the gradient would move toward the attractors as intended.

Global vs Random structure.

A trapdoor is “global” in the sense that when applied to any sample , it is likely that the perturbed is classified as class . On the other hand, the notion of attractors does not dictate that the attractors are to be scattered following some global properties. This difference is analogical to the difference of additive watermarking vs informed embedding watermarking [26]. Intuitively, an additive watermark scheme adds a fixed watermark to any given image , giving . On the other hand, informed embedding (e.g. QIM[4]) perturbs the given image depending on the location of in the image space, and thus the perturbation could be different for different images. Such difference is crucial in watermarking since a simple averaging attack can exploit global properties of additive watermarking to derive the secret . From this analogy, potentially there could be averaging attacks on trapdoor that exploit its global property.

Implications in construction.

Classification model with trapdoors could be obtained through training on a mixture of the original training dataset and perturbed data. More specifically, suppose is the training dataset of the original victim classifier, where contains samples in the -th class, the mixed training set is where contains samples perturbed with the trapdoor , that is,

Shen et al. observed that the trained model exhibits properties of the trapdoors and attains high accuracy on the original classification task.

Here, we argue that in an optimally trained model, the direction of the softlabel’s gradient might not align with the trapdoors. Hence, even if there are trapdoors, we are unable to detect adversarial samples obtained from gradient-based attacks. To illustrate this concern, we give a neural network model that attains the training objective and yet cannot detect gradient-based adversarial attacks.

Fig. 3: Trapdoored model meeting the training goals, but vulnerable to adversarial attacks.

Our construction first obtains three neural network models , and parametrized by and respectively, and combined them to obtain .

  1. is the model for the original classification task. It is obtained by training on . We assume that the accuracy of is high and it is difficult to further enhance its accuracy.

  2. is the trapdoor decoder, which predicts the class of trapdoor in the input. It can be trained on . While it is possible to achieve high accuracy through training, the predication process is essentially a watermark decoding process and well understood. Hence we can analytically design an accurate neural network classifier as the trapdoor decoder.

  3. is the trapdoor detector, which detects whether the input contains a trapdoor. Let us write where is the probability that the input contains a trapdoor. The detector can be trained on the two-class dataset . Similar to , since the detection process is well-understood, we can analytically derive an accurate neural network classifier.

Figure 3 illustrates how these three models are combined. On input , the output of the combined model is the weighted sum:


Note that the combined attains high accuracy and meets the training goals: on a clean input belongs to the class , behaves similar to ; on input perturbed with trapdoor, behaves similar to .

Now, consider a clean input of class , we sample in its small neighborhood and feed into . Since is accurate, the output value would be small. Hence, the gradients produced by would be close to the gradients by . Subsequently, a gradient-based attack (e.g. FGSM) on would obtain an adversarial sample that is similar to the result when applied on the original victim , and thus the adversarial sample cannot be detected.

V Proposed method: Attractors from Watermarking

V-a Main idea

Our construction makes use of a known model that exhibits properties of attractors. To protect a victim classifier , we “inject” attractors from into , giving a new model . The new model binds each training class of to a class of attractors in , and the binding is achieved by simply giving the normalized sum of on input .

We choose a watermarking scheme and take its decoder as . Under a watermarking scheme with capacity of messages, each sample in the domain can be decoded to one of the messages. In other words, the domain is partitioned into classes. The goal of the decoder is to determine the message embedded in the sample . In this paper, we treat the decoder as a function , where a coefficient of the output is the correlation value (or confidence level) of the input with a message. Hence the decoded message is the -th message, where . By treating the decoder as a function, the notion of attractors can be applied. Note that each local maximums of constitute an attractor and it is watermarked.

By summing with as described in the previous paragraph, the -th training class of is being bounded to the -th watermark message. Hence if an adversarial intend to maximize/minimize the -th coefficient of the sum , it would unknowingly maximize/minimize the correlation with the -th watermarking message, and thus lead to detection.

V-B Combined Classifier

Figure 2(a) illustrates our construction. Given the original classifier and the choice of watermark decoder , on input , the combined classifier outputs the normalized sum, i.e.


In scenarios where adversary has white-box access to the classifier , the internal mechanism needed to be obfuscated. This can be done by coding the watermark decoder as a neural network, and then applying neural network obfuscation techniques [32, 14, 13, 17, 42, 36] or distillation to obtain the final classifier.

In our experimentation, we choose Quantization index modulation (QIM) [4] as the digital watermarking scheme. Section V-E gives details of our QIM implementation.

V-C Detection

In detection-only forensic (see Section III-A), the forensic analyzer decides whether an input sample is adversarial. In our proposed method, on input , makes decision based on the values of , and some predefined thresholds . The forensic analyzer declares as adversarial iff any one of the following conditions holds:

  1. ;

  2. ;

  3. ,

where the vector is treated as a sequence of real values, and is the standard deviation. The predefined thresholds are determined by conducting statistical test on clean and adversarial data. The above conditions essentially determine whether the coefficients in are anomalies. Unusually high correlation (condition C1) indicates targeted attack. Unusually low correlation (condition C2) indicates non-targeted attack. Large variation (condition C3) indicates optimization and searching has being conducted on the sum.

V-D Recovery

The recovery forensic is only invoked on input that is declared as adversarial by the forensic detector. On input , the recovered label is:

  • .

To justify why the smallest coefficient is likely the original class, let us consider the direction took by the attacker. Let be a clean input that belongs to the -th class. During an un-targeted attack, the attacker attempts to reduce the -th coefficient of the normalized sum in equation (2), and would unknowingly suppress the -th coefficient of .

During a targeted attack to -th class where , the attacker attempts to maximally increase the -th coefficient with some fixed perturbation. Since the coefficients are normalized (see equation (2)), and the -th coefficient is the largest at , it is more economical to reduce the -th coefficient while increasing the -th coefficient. Consequently, the attack would also suppress the -th coefficient of .

V-E QIM decoder

We adopt a basic variant of Quantization Index Modulation (QIM) watermarking scheme [4] for . Let us describe QIM in the context of image data. Given a -pixel image and a particular -bit watermark message , the distance of and is determined in the following way: The value of the -th pixel is quantized with a predefined step size . The codewords for 0 are at and the codewords for 1 are at . The distance of the -th pixel to is its distance to the nearest codeword for as illustrate in Figure 4. Let us denote the distance as . We take the weighted sum of distances over all the pixels as the distance between and , that is,

where are some predefined weights. Finally, the distance (which is high for image containing the message) is to be mapped to the correlation value (which is low for image containing the message). In our implementation, we take the function ,


where are predefined translating and scaling constants so that the distances are in the range , and is another constant to control the amount of influence of the attractors to the original classifier.

Overall, on input , outputs , where

for each , where are pre-selected -bit messages that are embedded/hardcoded into .

Fig. 4: An example of quantization on pixel (with step size of 64). The black dots are the pixel values, and the message is 1.

V-F Remarks

There are a few advantages of using a watermarking decoder: (1) By fidelity requirement of watermarking scheme, the watermarked samples/attractors are scattered over and thus for any point in , there is a nearby watermarked sample/attractor. (2) There is an efficient decoder to detect watermarked sample/attractor. (3) There are extensive studies on watermarking in the past two decades, which we can leverage on for construction and analysis of attractors.

Vi Evaluation

In this section, we benchmark our method against some well-known adversarial attacks and compare the results with state-of-art defence. Ling et al. released DEEPSEC [22] which is a platform for security analysis of deep learning models. We conduct our experiment using this platform for fair comparison.

Vi-a Dataset

We tested our approach using two datasets: MNIST and CIFAR-10. MNIST contains 60,000 training images and 10,000 testing images. CIFAR-10 contains 50,000 training images and 10,000 testing images. Both of these two datasets have 10 classes. MNIST samples are greyscale images with size of and CIFAR-10 samples are colored images with size of .

Vi-B Model Setup

For MNIST dataset, we used a standard CNN with 2 convolutional layers and 2 fully connected layers. For CIFAR-10, we used ResNet-20. These models are exactly same as the raw models used in the DEEPSEC platform, with weight obtained from DEEPSEC2. For the QIM quantization in , we represent the pixel as values from [0,255] and use two interval sizes: 3 and 128. The QIM setting is the same in experiments on MNIST and CIFAR-10 dataset.

Vi-C Analysis

Strength of Victim model vs Attractors

Fig. 5: Comparing magnitude (1-norm) of outputs and gradients from and . (a) KDE plot of and where and . (b) KDE plot of and where and .

This experiment compares the “strength” of victim model against the watermark decoder on clean input, so as to verify that they can meet the requirements on classification accuracy and attractors. We first feed 10,000 testing images from MNIST dataset into both the victim model and the watermark decoder . Next, for each testing image, we measure the magnitudes (w.r.t. 1-norm) of the output, and magnitudes of the gradients. In other words, we are comparing the signal at (1) and (2) in Figure 1(a). Kernel Density Estimate (KDE) derived from the measurements are shown in Figure 5.

Figure 4(a) shows that the magnitude of the output from the original is much larger than the magnitude of the output from the decoder . In other words, in Figure 1(a), the signal at (1) dominates (2), and thus the attractors have small impact on the classification accuracy on clean input.

On the other hand, from Figure 4(b), we can see that the gradient’s magnitude from is much larger than those from the original model . Hence, during attack, the attacker’s optimization strategy would mostly affected by the watermark decoder instead of the original model.

We conducted similar experiment on a trapdoored model for comparison. Here, we use the construction discussed in Section IV-C. In this experiment, the watermark decoder is implemented based on additive watermark3. Similarly, we feed 10,000 testing images from MNIST dataset and compute the values and gradients of and for each testing image. Figure 6 reports the KDE derived from the measurements.

Recall that outputs of the constructed model is the sum of two weighted terms given in equation (1) which corresponds to the signal at (1) and (2) in Figure 3. For convenience, let us call the two terms and , that is, and on input .

Fig. 6: Comparing magnitude (1-norm) of outputs and gradients from and . (a) KDE plot of and where and .   (b) KDE plot of and where and .

Figure 6(a) shows that, on clean samples, dominates , and thus the accuracy of the combined trapdoored model would have the same accuracy as the victim model . The gradient is more complicated to determine since the term involves multiplication of two functions on . We directly measure the two gradients using the corresponding signals (1) and (2) in Figure 3. Let us denote the parameters of a neural network that outputs the signal (1), and the parameters of a neural network that outputs the signal (2) in Figure 3. Figure 5(b) shows the KDE plot of and . The gradient on clean input is dominated by gradient of , which is the gradient from the victim model. Hence, an attack conducted on the trapdoored model would not involve contribution from the trapdoors, and thus cannot be detected by the analyzer.


This experiment verifies that the direction of gradients from indeed exhibit properties of attractors. To verify this property, for each testing image , we find the nearest attractor and its label and determine the cosine similarity between and the gradient at . We also repeat the measurement on the original model , that is, measuring the cosine similarity of and the gradient .

The experiment is conducted with 10,000 testing images in MNIST. The KDE of the measurements are shown in Figure 6(a). Note the clear separation between them. Furthermore, note that for , cosine similarity is more than , inferring that a randomly chosen clean sample is likely to have its gradient pointing toward the nearest attractor. Thus we have empirically verified that the basins of -attractors cover the sample space, where .

Local Density

We repeat the experiment described in Section VI-C2 on local density function and plot the Kernel Density Estimate (KDE) in Figure 6(b). For each testing image , we find the nearest attractor and its label and determine the cosine similarity between and the gradient at . We also repeat the measurement on the original model , that is, measuring the cosine similarity of and the gradient .

The experiment is conducted with 10,000 testing images in MNIST. The KDE of the measurements are shown in Figure 6(b). The result shows that if we choose , a randomly chosen clean sample has more than chance of being in the basin of an -attractor.

Fig. 7: (a) KDE plot of cosine similarity between the direction to the nearby attractor and the gradient for a randomly chosen . and . (b) KDE plot of cosine similarity between the direction to the nearby attractor , and the gradient of local density function for a randomly chosen . and .
(a) UA /TA (b) Attacks MNIST CIFAR-10
Attack Success Rate (e) Detection on Misclassfied Input (f) Recovery on Detected Input (g) Overall Attack Success Rate Attack Success Rate (j) Detection on Misclassfied Input (k) Recovery on Detected Input () Overall Attack Success Rate
UA FGSM 90.4% 0.3% 100.0% 100.0% 0.0% 88.7% 53.5% 100.0% 99.8% 0.0%
RFGSM 65.6% 0.1% 100.0% 100.0% 0.0% 99.6% 62.7% 100.0% 100.0% 0.0%
BIM 100.0% 4.2% 100.0% 100.0% 0.0% 100.0% 86.8% 99.9% 99.8% 0.0%
PGD 100.0% 7.1% 100.0% 100.0% 0.0% 100.0% 94.6% 100.0% 100.0% 0.0%
UMIFGSM 100.0% 3.4% 100.0% 100.0% 0.0% 100.0% 100.0% 91.7% 98.1% 8.3%
UAP 24.2% 0.3% 66.7% 100.0% 0.1% 93.1% 87.0% 94.4% 10.9% 4.9%
DeepFool 100.0% 25.5% 100.0% 100.0% 0.0% 100.0% 99.5% 98.1% 99.1% 1.9%
OM 100.0% 97.0% 87.6% 100.0% 12.0% 100.0% 100.0% 81.9% 93.8% 18.1%
BPDA 100.0% 7.0% 100.0% 100.0% 0.0% 100.0% 92.1% 100.0% 100.0% 0.0%
SPSA 96.8% 20.1% 100.0% 92.4% 0.0% 90.2% 77.7% 100.0% 81.0% 0.0%
TA LLC 7.7% 0.0% No Adversarial Found 0.0% 13.2% 5.4% 98.1% 30.2% 0.1%
RLLC 1.6% 0.0% 0.0% 27.6% 16.0% 100.0% 27.7% 0.0%
ILLC 77.5% 0.0% 0.0% 100.0% 93.7% 97.0% 55.7% 2.8%
TMIFGSM 91.6% 1.0% 100.0% 100.0% 0.0% 100.0% 100.0% 89.6% 22.7% 10.4%
JSMA 74.1% 5.1% 90.2% 93.5% 0.5% 100.0% 100.0% 92.9% 10.9% 7.1%
BLB 100.0% 6.4% 100.0% 100.0% 0.0% 100.0% 100.0% 99.8% 61.6% 0.2%
CW2 100.0% 11.6% 100.0% 100.0% 0.0% 100.0% 100.0% 99.7% 97.8% 0.3%
EAD 98.9% 6.2% 98.4% 100.0% 0.1% 100.0% 77.5% 97.2% 95.4% 2.2%
TABLE I: Performance of the undefended model and the proposed method against known attacks. Description of the measurements is in Section VI-F1.

Vi-D Attack Setup

We used 18 attacks in total. 10 of them are un-targeted attacks: Fast Gradient Sign Method (FGSM) [10], Random perturbation with FGSM (RFGSM) [39], Basic Iterative Method (BIM) [19], Projected Gradient Descent attack (PGD) [24], Un-targeted Momentum Iterative FGSM (UMIFGSM) [6], Universal Adversarial Perturbations (UAP) [27], DeepFool (DF) [28], OptMargin (OM) [15], Backward Pass Differentiable Approximation (BPDA) [1] and Simultaneous Perturbation Stochastic Approximation (SPSA) [40]. 8 of them are targeted attacks: Least Likely Class attack (LLC) [19], Random perturbation with LLC (RLLC) [39], Iterative LLC attack (ILLC) [19], Targeted Momentum Iterative FGSM (TMIFGSM) [6], Box-constrained L-BFGS attack (BLB) [38], Jacobian-based Saliency Map Attack (JSMA) [30], Carlini and Wagner’s attack (CW) [3] and Elastic-net Attacks to DNNs (EAD) [5].

The experiment is conducted in two settings. We use the settings in DEEPSEC [22] to compare our approach with LID, FS and MagNet, and the setup is summarized in Table VII. For comparison with Trapdoor, we use the settings reported by Shan et al.  [35], which are summarized in Table VIII.

Since the above attacks was proposed prior to this paper, although the attackers have access to the model parameters of , they do not exploit the way is constructed from the two models and .

Vi-E On clean data

The performance on clean data (10,000 testing images) is reported in Table II. As expected, the classification accuracy of the proposed attractor-embedded model is close to the victim model .

Victim model 98.1% 90.1%
Attractors-embedded model 98.0% 90.0%
TABLE II: Performance of the proposed model , and the victim model on clean samples.
MNIST Detection-only Defenses
UA /TA Attacks Attractor-embedded model LID FS MagNet Trapdoor
Rate (d)
UA FGSM 100.0% 5.0% 100.0% 0.0% 73.0% 3.6% 93.7% 8.2% 96.1% 4.9% 99.1% 1.2% 100.0% 6.6% 100.0% 0.0% 100.0% 5.0% 100% 0.0%
RFGSM 100.0% 5.0% 100.0% 0.0% 70.2% 4.1% 94.5% 10.2% 97.7% 3.5% 99.5% 0.8% 100.0% 3.5% 100.0% 0.0% - - - -
BIM 100.0% 5.0% 100.0% 0.0% 10.4% 4.2% 60.2% 67.7% 92.7% 3.7% 98.7% 5.5% 100.0% 3.7% 100.0% 0.0% - - - -
PGD 100.0% 5.0% 100.0% 0.0% 10.3% 4.1% 54.8% 73.9% 96.1% 3.4% 99.5% 3.2% 100.0% 3.6% 100.0% 0.0% - - - -
PGD* 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 100.0% 5.0% 100% 0.0%
UMIFGSM 100.0% 5.0% 100.0% 0.0% 22.7% 4.1% 67.6% 54.4% 90.5% 3.6% 98.4% 6.7% 100.0% 3.7% 100.0% 0.0% - - - -
UAP 66.7% 5.0% 83.3% 0.1% 87.8% 4.6% 97.5% 3.7% 99.7% 5.0% 99.6% 0.1% 100.0% 4.0% 100.0% 0.0% - - - -
DeepFool 100.0% 5.0% 100.0% 0.0% 84.1% 2.9% 98.0% 15.9% 99.9% 4.0% 99.6% 0.1% 80.5% 3.6% 94.8% 19.5% - - - -
OM 87.6% 5.0% 92.8% 12.0% 60.7% 3.0% 90.0% 39.3% 94.0% 3.7% 99.1% 6.0% 91.3% 3.7% 97.0% 8.7% - - - -
BPDA 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 100.0% 5.0% 100% 0.0%
SPSA 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 100.0% 5.0% 100% 0.0%
TA LLC No Adversarial Found 0.0% 87.5% 3.6% 91.1% 0.7% 100.0% 7.1% 99.7% 0.0% 100.0% 1.8% 100.0% 0.0% - - - -
RLLC 0.0% 95.0% 5.0% 85.3% 0.2% 100.0% 2.5% 100.0% 0.0% 100.0% 2.5% 100.0% 0.0% - - - -
ILLC 0.0% 64.8% 5.9% 89.2% 20.9% 99.7% 3.9% 100.0% 0.2% 100.0% 5.2% 100.0% 0.0% - - - -
TMIFGSM 100.0% 5.0% 100.0% 0.0% 52.7% 3.5% 89.9% 40.9% 99.3% 3.0% 99.9% 0.6% 100.0% 4.5% 100.0% 0.0% - - - -
JSMA 90.2% 5.0% 93.2% 0.5% 69.1% 5.6% 92.8% 23.6% 100.0% 3.2% 99.6% 0.0% 84.0% 5.0% 95.3% 12.2% - - - -
BLB 100.0% 5.0% 100.0% 0.0% 77.5% 5.9% 94.7% 22.5% 99.7% 4.8% 99.5% 0.3% 98.2% 3.7% 99.1% 1.8% - - - -
CW2 100.0% 5.0% 100.0% 0.0% 93.9% 3.4% 99.2% 6.1% 100.0% 3.0% 99.6% 0.0% 80.5% 3.7% 94.5% 19.4% - - - -
CW2* 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 97.2% 5.0% 99% -
EAD 98.4% 5.0% 99.3% 0.1% 92.0% 3.5% 98.5% 8.0% 100.0% 3.5% 99.4% 0.0% 75.8% 4.4% 92.3% 24.2% - - - -
EAD* 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 98.0% 5.0% 99% -

CIFAR-10 Detection-only Defenses
UA /TA Attacks Attractor-embedded model LID FS MagNet Trapdoor
UA FGSM 100.0% 5.0% 99.9% 0.0% 100.0% 5.1% 100.0% 0.0% 9.5% 2.9% 82.6% 81.2% 99.1% 4.7% 93.5% 0.8% 100.0% 5.0% 100% 0.0%
RFGSM 100.0% 5.0% 100.0% 0.0% 100.0% 2.9% 100.0% 0.0% 6.0% 4.8% 70.7% 78.7% 33.3% 3.2% 83.2% 55.8% - - - -
BIM 99.9% 5.0% 100.0% 0.0% 94.6% 2.9% 99.1% 5.4% 1.6% 4.5% 25.5% 98.4% 1.8% 4.2% 53.0% 98.2% - - - -
PGD 100.0% 5.0% 100.0% 0.0% 99.9% 3.5% 100.0% 0.1% 0.4% 3.8% 16.5% 99.6% 3.2% 4.3% 59.2% 96.8% - - - -
PGD* 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 100.0% 5.0% 100% 0.0%
UMIFGSM 91.7% 5.0% 97.7% 8.3% 100.0% 3.0% 100.0% 0.0% 1.8% 4.1% 23.8% 98.2% 6.3% 4.1% 57.1% 93.7% - - - -
UAP 94.4% 5.0% 97.2% 4.9% 100.0% 5.3% 100.0% 0.0% 2.9% 3.8% 76.3% 82.8% 99.5% 5.9% 94.9% 0.4% - - - -
DeepFool 98.1% 5.0% 98.6% 1.9% 9.2% 5.7% 64.0% 90.8% 1.5% 3.9% 86.3% 98.5% 21.5% 2.8% 81.0% 78.5% - - - -
OM 81.9% 5.0% 88.3% 18.1% 8.8% 4.9% 65.1% 91.2% 25.0% 3.8% 89.0% 75.0% 46.4% 3.9% 78.7% 53.6% - - - -
BPDA 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 100.0% 5.0% 100% 0.0%
SPSA 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 100.0% 5.0% 100% 0.0%
TA LLC 98.1% 5.0% 99.9% 0.1% 100.0% 1.5% 100.0% 0.0% 3.7% 9.0% 73.5% 12.9% 100.0% 6.7% 91.8% 0.0% - - - -
RLLC 100.0% 5.0% 100.0% 0.0% 99.0% 5.7% 99.2% 0.3% 11.7% 5.1% 71.0% 27.8% 31.4% 3.8% 81.2% 21.6% - - - -
ILLC 97.0% 5.0% 97.8% 2.8% 79.2% 5.3% 96.1% 20.8% 51.7% 3.3% 83.9% 48.3% 2.6% 4.7% 61.2% 97.4% - - - -
TMIFGSM 89.6% 5.0% 95.2% 10.4% 100.0% 5.8% 100.0% 0.0% 10.0% 3.8% 45.0% 90.0% 10.4% 3.8% 57.9% 89.6% - - - -
JSMA 92.9% 5.0% 97.2% 7.1% 71.5% 3.4% 94.4% 28.4% 20.6% 3.7% 91.7% 79.2% 53.2% 5.3% 92.3% 46.7% - - - -
BLB 99.8% 5.0% 99.9% 0.2% 13.0% 3.1% 72.3% 87.0% 1.7% 4.1% 89.3% 98.3% 52.5% 4.3% 81.6% 47.5% - - - -
CW2 99.7% 5.0% 99.8% 0.3% 19.9% 3.8% 77.6% 80.1% 0.9% 3.7% 88.1% 99.1% 38.4% 4.4% 81.8% 61.6% - - - -
CW2* 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 96.2% 5.0% 97% -
EAD 97.2% 5.0% 98.3% 2.2% 17.2% 4.0% 73.8% 82.8% 1.9% 3.5% 89.8% 98.1% 54.2% 5.0% 82.1% 45.8% - - - -
EAD* 100.0% 5.0% 100.0% 0.0% - - - - - - - - - - - - 95.0% 5.0% 97% -
TABLE III: We compare the adversarial detection rate of attractor-embedded model in forensic setting with LID, FS, MagNet and Trapdoor. Description of the settings is in Section VI-F2.

Vi-F Forensic Setting

Under forensic setting, the adversary has white-box access to the attractor-embedded model . The evaluation results are reported in Table I and Table III.

Performance of Proposed Method

Table I shows the overall performance of our approach on both MNIST and CIFAR-10. The important measurements are the attack success rate, detection rate, recovery rate and overall attack success rate, which are described as follows.

  • Column (a) indicates whether the attack is targeted or un-targeted. Column (b) shows the names of the attacks.

  • Column (c) and (d) show the attack success rate on the undefended victim model , and the attractor-embedded respectively. We used the whole testing dataset and applied the corresponding attack on each testing image. If the attack finds adversarial sample on one testing image, it is counted as one success. The attack success rate is measured differently for un-targeted, LLC and targeted attacks. For un-targeted attacks, an attack is counted as successful if the adversarial sample get misclassified into any class other than the correct class. For LLC, an attack is successful if the adversarial sample is misclassified into the least likely class. For targeted attacks, an attack is successful if the adversarial sample is misclassified into a randomly chosen intended target class.

  • Column (e) shows the detection rate of the detection-only forensic analyzer on successful adversarial samples. That is, suppose is the set of adversarial samples found by the attack (w.r.t. ), and is the set of adversarial samples that being detected by , then the detection rate is .

  • Column (f) shows the recovery rate of the recovery forensic analyzer on adversarial samples successfully detected by . Specifically, let be the set as defined in previous paragraph, and the set of adversarial samples that correctly recover the class, then the recovery rate is .

  • Column (g) shows the overall attack success rate, which is the percentage of successful and undetected attacks among all attacks attempts. Specifically, column (g) = (d) e, i.e.

    • overall attack success rate =

    • attack success rate detection rate

    Overall attack success rate is a fairer measurement compare to detection rate. To see that, consider the case where a model is effective in confusing an attack and very few adversarial samples are found by the attack, but the found adversarial samples are the “difficult” samples to be detected by the detector . In this case, the detection rate is very low, but overall, it is difficult for the attack to find undetectable adversarial samples. In contrast, the overall attack success rate is low in this case, and fairly reflects the effectiveness of this attack against the defence methods.

The last few columns are same as column (c) to (g) but on the CIFAR-10 dataset.

Comparison with Known Defence

Table III shows the comparison of the performance of our model against known defence mechanisms LID, FS. MagNet and Trapdoor (see Section II-B).

  • Column (a) is the true positive rate (TPR) which is the detection rate on successful adversarial input that get misclassified.

  • Column (b) is the false positive rate (FPR) which is the percentage of clean samples which get wrongly detected as adversarial for all the samples in the test dataset.

  • Column (c) is area under the curve (AUC) which is a threshold-independent benchmark for the detection performance.

  • Column (d) is the overall attack success rate as defined in the previous section.

We adopt and follow the definitions of TPR, FPR and AUC in DEEPSEC [22]. The performance of known mechanisms in Table III are obtained4 from results reported in DEEPSEC [22] and Trapdoor [35].

For PGD, CW2 and EAD, experiments are carried out in two different settings where one setting is same as DEEPSEC, and the other from Shan et al.  [35].

Decreased attack strength even without forensic

Our experiments show that even without the extra forensic step, the successful rate on attractor-embedded model is already significantly lower when compared with the undefended model . This observation is reflected in Table I column (c), (d), (h) and (i). This is consistent with our design. When adversary follows the direction of gradient provided by the attractors, they move toward the nearby attractor instead of the decision boundary, and thus may be stuck in a local minimal or incurred a larger perturbation, which in turn lead to a lower attack success rate.

Performance on non-gradient based Attacks

We achieve almost perfect detection accuracy for most gradient based attacks. Note that non-gradient based attacks BPDA and SPSA are also not effective against our defence. Although BPDA is successful on defences that break or hide the gradient, our defence uses gradients to deceive the adversary instead of creating non back-propagatable function, and therefore able to trick BPDA. Similarly, although SPSA is a non-gradient based optimization, taking random small steps indirectly uses information on the soft label’s gradient, and would still converge to the nearby attractor.

Attractors of Multi-Scale Gradients

The attacks MI-FGSM, JSMA and UAP indirectly carry out some forms of gradient averaging in deciding the perturbation: MI-FGSM uses the gradient of previous iterations to avoid falling into the local minimum, JSMA saturates only a few pixels based on the saliency map, and UAP searches for a universal perturbation through averaging, that can be applied to most of the samples.

In a certain sense, such attacks are making decision based on gradient at a lower scale in the multi-scale gradients representation. Hence, to address such attacks, we should have a mixture of attractors catering for attack-loss functions at different scales. Our implementation achieves this by controlling the interval size and the weightage for each pixel, where a larger corresponds to lower scale, and a larger weights corresponds to larger emphasis on the corresponding scale. In our experiment, we use two interval sizes, 3 and 128, and give more weightage on larger interval size. Empirically, this choice achieves good performance. It would be interesting to find an analytical approach to determine the scale.

Performance at Low False Positive Rate

Our experiments follow settings in DEEPSEC where the false positive rates (FPR) of LID, FS, MagNet as well as Trapdoor are set to be around . However, there are application scenarios where a FPR would generate too many false alarms and not acceptable. We conduct another experiment on our proposed method where the FPR is adjusted to , and the results are shown in Table IV. Note that the proposed method still achieve good performance.

5% FPR 0.5% FPR 5% FPR 0.5% FPR
Attacks TPR
UA FGSM 100.0% 0.0% 100.0% 0.0% 100.0% 0.0% 100.0% 0.0%
RFGSM 100.0% 0.0% 100.0% 0.0% 100.0% 0.0% 100.0% 0.0%
BIM 100.0% 0.0% 100.0% 0.0% 99.9% 0.0% 99.9% 0.1%
PGD 100.0% 0.0% 100.0% 0.0% 100.0% 0.0% 99.9% 0.1%
UMIFGSM 100.0% 0.0% 100.0% 0.0% 91.7% 8.3% 86.3% 13.7%
UAP 66.7% 0.1% 66.7% 0.1% 94.4% 4.9% 93.6% 5.6%
DeepFool 100.0% 0.0% 100.0% 0.0% 98.1% 1.9% 98.0% 2.0%
OM 87.6% 12.0% 85.6% 14.0% 81.9% 18.1% 81.1% 18.9%
BPDA 100.0% 0.0% 100.0% 0.0% 100.0% 0.0% 99.9% 0.1%
SPSA 100.0% 0.0% 100.0% 0.0% 100.0% 0.0% 100.0% 0.0%
TA LLC No Adv Found 0.0% No Adv Found 0.0% 98.1% 0.1% 98.1% 0.1%
RLLC 0.0% 0.0% 100.0% 0.0% 98.1% 0.3%
ILLC 0.0% 0.0% 97.0% 2.8% 90.7% 8.7%
TMIFGSM 100.0% 0.0% 100.0% 0.0% 89.6% 10.4% 81.1% 18.9%
JSMA 90.2% 0.5% 88.2% 0.6% 92.9% 7.1% 90.1% 9.9%
BLB 100.0% 0.0% 100.0% 0.0% 99.8% 0.2% 99.6% 0.4%
CW2 100.0% 0.0% 100.0% 0.0% 99.7% 0.3% 99.7% 0.3%
EAD 98.4% 0.1% 98.4% 0.1% 97.2% 2.2% 97.2% 2.2%
TABLE IV: Detection performance at 0.5% FPR.

Vi-G Integrated Setting

In the integrated setting, the separated forensic step is included in the white-box and exposed to the attacks. Previously in the forensic setting, an attacker likely stops near an attractor, since the attacker is being fed with the “wrong” information that the stopping location is an adversarial sample. However, in the integrated setting, since the recovered class is exposed, the attacker will know that the attack is not successful and may backtrack to other search paths.

In this experiment, we conduct attacks on the integrated model, where the attackers have white-box access to the integrated model’s parameters. However, since the attack algorithms are not aware of our approach, they do not exploit how the integrated model is combined from various components.

The results are shown in Table V. We also measure the distortion of the successful adversary samples found by unbounded attacks such as BLB, CW2 and EAD, and the results are shown in Table VI. Note that the successful samples generated on integrated model has larger distortion than on the original classifier.

Dataset MNIST CIFAR-10
FGSM 90.4% 0.7% 88.7% 5.6%
RFGSM 65.6% 0.0% 99.6% 6.5%
BIM 100% 0.7% 100.0% 5.7%
PGD 100% 0.0% 100.0% 7.1%
UMIFGSM 100% 0.7% 100.0% 6.0%
UAP 24.2% 15.6% 93.1% 39.3%
DeepFool 100.0% 1.9% 100.0% 1.7%
OM 100.0% 17.5% 100.0% 24.9%
BPDA 100.0% 0.0% 100.0% 7.0%
UA SPSA 96.8% 0.8% 90.2% 10.1%
LLC 7.7% 0.0% 13.2% 2.2%
RLLC 1.6% 0.0% 27.6% 4.0%
ILLC 77.5% 0.0% 100.0% 7.6%
TMIFGSM 91.6% 1.1% 100.0% 29.6%
JSMA 74.1% 3.1% 100.0% 15.0%
BLB 100.0% 0.6% 100.0% 7.1%
CW2 100.0% 1.9% 100.0% 10.8%
TA EAD 98.9% 3.2% 100.0% 9.6%
TABLE V: Attack successful rate on integrated model .
Original Classifier 0.11 0.11 0.14
Integrated Attractor 4.81 0.39 0.27
TABLE VI: Average distortion for unbounded attacks on CIFAR-10 dataset.

Vii Discussion

Vii-a Obfuscation

In the white-box setting where attackers have accesses to the model , it is crucial that the attackers are unable to extract the model parameters or . If an attacker knows