n-ML: Mitigating Adversarial Examples via Ensembles of Topologically Manipulated Classifiers

-ML: Mitigating Adversarial Examples via Ensembles of Topologically Manipulated Classifiers


This paper proposes a new defense called -ML against adversarial examples, i.e., inputs crafted by perturbing benign inputs by small amounts to induce misclassifications by classifiers. Inspired by -version programming, -ML trains an ensemble of classifiers, and inputs are classified by a vote of the classifiers in the ensemble. Unlike prior such approaches, however, the classifiers in the ensemble are trained specifically to classify adversarial examples differently, rendering it very difficult for an adversarial example to obtain enough votes to be misclassified. We show that -ML roughly retains the benign classification accuracies of state-of-the-art models on the MNIST, CIFAR10, and GTSRB datasets, while simultaneously defending against adversarial examples with better resilience than the best defenses known to date and, in most cases, with lower classification-time overhead.

I Introduction

Adversarial examples—minimally and adversarially perturbed variants of benign samples—have emerged as a challenge for machine-learning (ML) algorithms and systems. Numerous attacks that produce inputs that evade correct classification by ML algorithms, and particularly deep neural networks (DNNs), at inference time have been proposed (e.g., [7, 20, 74]). The attacks vary in the perturbation types that they allow and the application domains, but they most commonly focus on adversarial perturbations with bounded -norms to evade ML algorithms for image classification (e.g., [7, 11, 20, 56, 74]). Some attacks minimally change physical artifacts to mislead recognition DNNs (e.g., adding patterns on street signs to mislead street-sign recognition) [18, 69]. Yet others imperceptibly perturb audio signals to evade speech-recognition DNNs [58, 63].

In response, researchers have proposed methods to mitigate the risks of adversarial examples. For example, one method, called adversarial training, augments the training data with correctly labeled adversarial examples (e.g., [28, 36, 47, 67]). The resulting models are often more robust in the face of attacks than models trained via standard methods. However, while defenses are constantly improving, they are still far from perfect. Relative to standard models, defenses often reduce the accuracy on benign samples. For example, methods to detect the presence of attacks sometimes erroneously detect benign inputs as adversarial (e.g., [45, 44]). Moreover, defenses often fail to mitigate a large fraction of adversarial examples that are produced by strong attacks (e.g., [3]).

Inspired by -version programming, this paper proposes a new defense, termed -ML, that improves upon the state of the art in its ability to detect adversarial inputs and correctly classify benign ones. Similarly to other ensemble classifiers [46, 73, 83], an -ML ensemble outputs the majority vote if more than a threshold number of DNNs agree; otherwise the input is deemed adversarial. The key innovation in this work is a novel method, topological manipulation, to train DNNs to achieve high accuracy on benign samples while simultaneously classifying adversarial examples according to specifications that are drawn at random before training. Because every DNN in an ensemble is trained to classify adversarial examples differently than the other DNNs, -ML is able to detect adversarial examples because they cause disagreement between the DNNs’ votes.

We evaluate -ML using three datasets (MNIST [37], CIFAR10 [35], and GTSRB [72]) and against (mainly) and attacks in black-, grey-, and white-box settings. Our findings indicate that -ML can effectively mitigate adversarial examples while achieving high benign accuracy. For example, for CIFAR10 in the black-box setting, -ML can achieve 94.50% benign accuracy (vs. 95.38% for the best standard DNN) while preventing all adversarial examples with -norm perturbation magnitudes of created by the best known attack algorithms [47]. In comparison, the state-of-the-art defense achieves 87.24% benign accuracy while being evaded by 14.02% of the adversarial examples. -ML is also faster than most defenses that we compare against. Specifically, even the slowest variant of -ML is 45.72 to 199.46 faster at making inferences than other defenses for detecting the presence of attacks.

Our contributions can be summarized as follows:1

  • We propose topology manipulation, a novel method to train DNNs to classify adversarial examples according to specifications that are selected at training time, while also achieving high benign accuracy.

  • Using topologically manipulated DNNs, we construct (-ML) ensembles to defend against adversarial examples.

  • Our experiments using two perturbation types and three datasets in black-, grey-, and white-box settings show that -ML is an effective and efficient defense. -ML roughly retains the benign accuracies of state-of-the-art DNNs, while providing more resilience to attacks than the best defenses known to date, and making inferences faster than most.

We next present the background and related work (Sec. II). Then, we present the technical details behind -ML and topology manipulation (Sec. III). Thereafter, we describe the experiments that we conducted and their results (Sec. IV). We close the paper with a discussion (Sec. V) and a conclusion (Sec. VI).

Ii Background and Related Work

This section summarizes related work and provides the necessary background on evasion attacks on ML algorithms and defenses against them.

Ii-a Evading ML Algorithms

Inputs that are minimally perturbed to fool ML algorithms at inference time—termed adversarial examples—have emerged as a challenge to ML. Attacks to produce adversarial examples typically start from benign inputs and find perturbations with small -norms ( typically ) that lead to misclassification when added to the benign inputs (e.g., [6, 7, 11, 20, 51, 56, 74, 80]). Keeping the perturbations’ -norms small helps ensure the attacks’ imperceptibility to humans, albeit imperfectly [64, 68]. The process of finding adversarial perturbations is usually formalized as an optimization problem. For example, Carlini and Wagner [11] proposed the following formulation to find adversarial perturbations that target class :

where is a benign input, is the perturbation, and helps tune the -norm of the perturbation. is roughly defined as:

where is the output for class at the logits of the DNN—the output of the one-before-last layer. Minimizing leads to be (mis)classified as . As this formulation targets a specific class , the resulting attack is commonly referred to as a targeted attack. In the case of evasion where the aim is to produce an adversarial example that is misclassified as any class but the true class (commonly referred to as untargeted attack), is defined as:

where is the true class of . We use as a loss function to fool DNNs.

The Projected Gradient Descent () attack is considered as the strongest first-order attack (i.e., an attack that uses gradient decent to find adversarial examples) [47]. Starting from any random point close to a benign input, consistently finds perturbations with constrained - or -norms that achieve roughly the same loss value. Given a benign input , a loss function, say , a target class , a step size , and an upper bound on the norm , iteratively updates the adversarial example until a maximum number of iterations is reached such that:

where projects vectors on the -ball around (e.g., by clipping for -norm), and denotes the gradient of the loss function. starts from a random point in the -ball around , and stops after a fixed number of iterations (20–100 iterations are typical [47, 67]). In this work, we rely on to produce adversarial examples.

Early on, researchers noticed that adversarial examples computed against one ML model are likely to be misclassified by other models performing the same task [20, 54, 55]. This phenomenon—transferability—serves as the basis of non-interactive attacks in black-box settings, where attackers do not have access to the attacked models and may not be able to query them. The accepted explanation for the transferability phenomenon, at least between DNNs performing the same task, is that the gradients (which are used to compute adversarial examples) of each DNN are good approximators of those of other DNNs [16, 26]. Several techniques can be used to enhance the transferability of adversarial examples [17, 42, 76, 82]. The techniques include computing adversarial perturbations against ensembles of DNNs [42, 76] and misleading the DNNs after transforming the input (e.g., by translation or resizing) [17, 82] as a means to avoid computing perturbations that generalize beyond a single DNN. We leverage these techniques to create strong attacks against which we evaluate our proposed defense.

Ii-B Defending ML Algorithms

Researchers proposed a variety of defenses to mitigate ML algorithms’ vulnerability to adversarial examples. Roughly speaking, defenses can be categorized as: adversarial training, certified defenses, attack detection, or input transformation. Similarly to our proposed defense, some defenses leverage randomness or model ensembles. In what follows, we provide an overview of the different categories.

Adversarial Training   Augmenting the training data with correctly labeled adversarial examples that are generated throughout the training process increases models’ robustness to attacks [20, 28, 27, 36, 47, 67, 74]. The resulting training process is commonly referred to as adversarial training. In particular, adversarial training with ([47, 67] is one of the most effective defenses to date—we compare our defense against it.

Certified Defenses   Some defenses attempt to certify the robustness of trained ML models (i.e., provide provable bounds on models’ errors for different perturbation magnitudes). Certain certified defenses estimate how DNNs transform balls around benign examples via convex shapes, and attempt to force classification boundaries to not cross the shapes (e.g., [34, 50, 59]). These defenses are less effective than adversarial training with  [61]. Other defenses estimate the output of the so-called smoothed classifier by classifying many variants of the input after adding noise at the input or intermediate layers [14, 38, 41, 60]. The resulting smoothed classifier, in turn, is proven to be robust against perturbations of different -norms. Unfortunately, such defenses do not provide guarantees against perturbations of different types (e.g., ones with bounded -norms), and perform less well against them in practice [14, 38].

Attack Detection   Similarly to our defense, there were past proposals for defenses to detect the presence of attacks [2, 19, 21, 25, 43, 44, 48, 49, 52, 78]. While adaptive attacks have been shown to circumvent some of these defenses [3, 10, 43], detectors often significantly increase the magnitude of the perturbations necessary to evade DNNs and detectors combined [10, 44]. In this work, we compare our proposed defense with detection methods based on Local Intrinsic Dimensionality ([45] and Network Invariant Checking ([44], which are currently the leading methods for detecting adversarial examples.

The detector uses a logistic regression classifier to tell benign and adversarial inputs apart. The input to the classifier is a vector of statistics that are estimated for every intermediate representation computed by the DNN. This approach is effective because the statistics of adversarial examples are presumably distributed differently than those of benign inputs [45].

The detector expects certain invariants in DNNs to hold for benign inputs. For example, it expects the provenance of activations for benign inputs to follow a certain distribution. To model these invariants, a linear logistic regression classifier performing the same task as the original DNN is trained using the representation of every intermediate layer. Then, for every pair of neighboring layers, a one-class Support Vector Machine (oc-SVM) is trained to model the distribution of the output of the layers’ classifiers on benign inputs. Namely, every oc-SVM receives concatenated vectors of probability estimates and emits a score indicative of how similar the vectors are to the benign distribution. The scores of all the oc-SVMs are eventually combined to derive an estimate for whether the input is adversarial. In this manner, if the output of two neighboring classifiers on an image, say that of a bird, is , the input is likely to be benign (as the two classes are similar and likely have been observed for benign inputs during training). However, if the output is , then it is likely that the input is adversarial.

Input Transformation   Certain defenses suggest to transform inputs (e.g., via quantization) to sanitize adversarial perturbations before classification [22, 40, 48, 62, 71, 81, 84]. The transformations often aim to hinder the process of computing gradients for the purpose of attacks. In practice, however, it has been shown that attackers can adapt to circumvent such defenses [4, 3, 24].

Randomness   Defenses often leverage randomness to mitigate adversarial examples. As previously mentioned, some defenses inject noise at inference time at the input or intermediate layers [14, 25, 38, 41, 60]. Differently, Wang et al. train a hierarchy of layers, each containing multiple paths, and randomly switch between the chosen paths at inference time [79]. Other defenses randomly drop out neurons, shuffle them, or change their weights, also while making inferences [19, 78]. Differently from all these, our proposed defense uses randomness at training time to control how a set of DNNs classify adversarial examples at inference time and uses the DNNs strategically in an ensemble to deter adversarial examples.

Ensembles   Similarly to our defense, several prior defenses suggested using ensembles to defend against adversarial examples. Abbasi and Gagné proposed to measure the disagreement between DNNs and specialized classifiers (i.e., ones that classify one class versus all others) to detect adversarial examples [2]. An adaptive attack to evade the specialized classifiers and the DNNs simultaneously can circumvent this defense [24]. Vijaykeerthy et al. train DNNs sequentially to be robust against an increasing set of attacks [77]. However, they only use the final model at inference time, while we use an ensemble containing several models for inference. A meta defense by Sengupta et al. strategically selects a model from a pool of candidates at inference time to increase benign accuracy while deterring attacks [65]. This defense is effective against black-box attacks only. Other ensemble defenses propose novel training or inference mechanisms, but do not achieve competitive performance [29, 53, 73].

Recent papers [46, 83, 87] proposed defenses that, similarly to ours, are motivated by -version programming [5, 12, 15]. In a nutshell, -version programming aims to provide resilience to bugs and attacks by running (2) variants of independently developed, or diversified, programs. These programs are expected to behave identically for normal (benign) inputs, and differently for unexpected inputs that trigger bugs or exploit vulnerabilities. When one or more programs behaves differently than the others, an unexpected (potentially malicious) input is detected. In the context of ML, defenses that are inspired by -version programming use ensembles of models that are developed by independent parties [87], different inference algorithms or DNN architectures [46, 83], or models that are trained using different training sets [83]. In all cases, the models are trained via standard training techniques. Consequently, the defenses are often vulnerable to attacks (both in black-box and more challenging settings) as adversarial examples transfer with high likelihood regardless of the inference algorithm or the training data [20, 54, 55]. Moreover, prior work is limited to specific applications (e.g., speech recognition [87]). In contrast, we train the models comprising the ensembles via a novel training technique (see Sec. III), and our work is conceptually applicable to any domain.

Iii Technical Approach

Here we detail our methodology. We begin by presenting the threat model. Subsequently, we present a novel technique, termed topology manipulation, which serves as a cornerstone for training DNNs that are used as part of the -ML defense. Last, we describe how to construct an -ML defense via an ensemble of topologically manipulated DNNs.

Iii-a Threat Model

Our proposed defense aims to mitigate attacks in black-, grey-, and white-box settings. In the black-box setting, the attacker has no access to the classifier’s parameters and is unaware of the existence of the defense.2 To evade classification, the attacker attempts a non-interactive attack by transferring adversarial examples from standard surrogate models. Similarly, in the grey-box setting, the attacker cannot access the classifier’s parameters. However, the attacker is aware of the use of the defense and attempts to transfer adversarial examples that are produced against a surrogate defended model. In the white-box setting, the attacker has complete access to the classifier’s and defense’s parameters. Consequently, the attacker can adapt gradient-based white-box attacks (e.g., ) to evade classification. We do not consider interactive attacks that query models in a black-box setting (e.g., [8, 9, 13, 26]). These attacks are generally weaker than the white-box attacks that we do consider.

As is typical in the area (e.g., [34, 47, 61]), we focus on defending against adversarial perturbations with bounded -norms. In particular, we mainly consider defending against perturbations with bounded -norms. Additionally, we demonstrate defenses against perturbations with bounded -norms as a proof of concept. Conceptually, there is no reason why -ML should not generalize to defend against other types of attacks.

Iii-B Topologically Manipulating DNNs

The main building block of the -ML defense is a topologically manipulated DNN—a DNN that is manipulated at training time to achieve certain topological properties with respect to adversarial examples. Specifically, a topologically manipulated DNN is trained to satisfy two objectives: 1) obtaining high classification accuracy on benign inputs; and 2) misclassifying adversarial inputs following a certain specification. The first objective is important for constructing a well-performing DNN to solve the classification task at hand. The second objective aims to change the adversarial directions of the DNN such that an adversarial perturbation that would normally lead a benign input to be misclassified as class by a regularly trained DNN would likely lead to misclassification as class () by the topologically manipulated DNN. Fig. 1 illustrates the idea of manipulating the topology of a DNN via an abstract example, while Fig. 2 gives a concrete example.

Fig. 1: An illustration of topology manipulation. Left: In a standard DNN, perturbing the benign sample in the direction of leads to misclassification as blue (zigzag pattern), while perturbing it in the direction of leads to misclassification as red (diagonal stripes). Right: In the topologically manipulated DNN, direction leads to misclassification as red, while leads to misclassification as blue. The benign samples (including ) are correctly classified in both cases (i.e., high benign accuracy).
Fig. 2: A concrete example of topology manipulation. The original image of a horse (a) is adversarially perturbed to be misclassified as a bird (b) and as a ship (c) by standard DNNs. The perturbations, which are limited to -norm of , are shown after multiplying 10. We train a topologically manipulated DNN to misclassify (b) as a ship and (c) as a bird, while classifying the original image correctly.

To train a topologically manipulated DNN, two datasets are used. The first dataset, , is a standard dataset. It contains pairs of benign samples, , and their true classes, . The second dataset, , contains adversarial examples. Specifically, it consists of pairs of targeted adversarial examples, , and the target classes, . These adversarial examples are produced against reference DNNs that are trained in a standard manner (e.g., to decrease cross-entropy loss). Samples in are used to train the DNNs to satisfy the first objective (i.e., achieving high benign accuracy). Samples in , on the other hand, are used to topologically manipulate the adversarial directions of DNNs.

More specifically, to specify the topology of the trained DNN, we use a derangement (i.e., a permutation with no fixed points), , that is drawn at random over the number of classes, . This derangement specifies that an adversarial example in that targets class should be misclassified as () by the topologically manipulated DNN. For example, for ten classes (i.e., ), the derangement may look like . This derangement specifies that adversarial examples targeting class 0 should be misclassified as class , ones targeting class 1 should be misclassified as class , and so on. For classes, the number of derangements the we can draw from is known as the subfactorial (denoted by ), and is defined recursively as , where and . The subfactorial grows almost as quickly as the factorial (i.e., the number of permutations over a group of size ).

We specify the topology using derangements rather than permutations that may have fixed points because if contained fixed points, there would exist a class such that . In such case, the DNN would be trained to misclassify adversarial examples that target into , which would not inhibit an adversary targeting . Such behavior is undesirable.

We use the standard cross-entropy loss, ,3 to train topologically manipulated DNNs. Formally, the training process minimizes:


While minimizing the leftmost term increases the benign accuracy (as is usual in standard training processes), minimizing the rightmost term manipulates the topology of the DNN (i.e., forces the DNN to misclassify as instead of as ). The parameter is a positive real number that balances the two objectives. We tune it via a hyperparameter search.

Although topologically manipulated DNNs aim to satisfy multiple objectives, it is important to point out that training them does not require significantly more time than for standard DNNs. For training, one first needs to create the dataset that contains the adversarial examples. This needs to be done only once, as a preprocessing phase. Once is created, training a topologically manipulated DNN takes the same amount of time as training a standard DNN.

Iii-C -ML: An Ensemble-Based Defense

As previously mentioned, -ML is inspired by -version programming. While independent, or diversified, programs are used in an -version programming defense, an -ML defense contains an ensemble of (2) topologically manipulated DNNs. As explained above, all the DNNs in the -ML ensemble are trained to behave identically for benign inputs (i.e., to classify them correctly), while each DNN is trained to follow a different specification for adversarial examples. This opens an opportunity to 1) classify benign inputs accurately; and 2) detect adversarial examples.

In particular, to classify an input using an -ML ensemble, we compute the output of all the DNNs in the ensemble on . Then, if the number of DNNs that agree on a class is above or equal to a threshold , the input is classified to the majority class. Otherwise, the -ML ensemble would abstain from classification and the input would be marked as adversarial. Formally, denoting the individual DNNs’ classification results by the multiset , the -ML classification function, , is defined as:

Of course, increasing the threshold increases the likelihood of detecting adversarial examples (e.g., an adversarial example is less likely to be misclassified as the same target class by all the DNNs than by DNNs). In other words, increasing decreases attacks’ success rates. At the same time, increasing the threshold harms the benign accuracy (e.g., the likelihood of DNNs to emit is lower than the likelihood of DNNs to do so). In practice, we set to a value , to avoid ambiguity when computing the majority vote, and , as the benign accuracy is 0 for .

Similarly to -version programming, where the defense becomes more effective when the constituent programs are more independent and diverse [5, 12, 15], an -ML defense is more effective at detecting adversarial examples when the DNNs are more independent. Specifically, if two DNNs and () are trained with derangements and , respectively, and we are not careful enough, there might exist a class such that . If so, the two DNNs are likely to classify adversarial examples targeting in the same manner, thus reducing the defense’s likelihood to detect attacks. To avoid such undesirable cases, we train the -ML DNNs (simultaneously or sequentially) while attempting to avoid pairs of derangements that map classes in the same manner to the greatest extent possible. More concretely, if is lower than the number of classes , then we draw derangements that disagree on all indices (i.e., , , ). Otherwise, we split the DNNs to groups of (or smaller) DNNs, and for each group we draw derangements that disagree on all indices. For a group of DNNs, , we can draw derangements such that every pair of derangements disagrees on all indices.

Iv Results

In this section we describe the experiments that we conducted and their results. We initially present the datasets and the standard DNN architectures that we used. Then we describe how we trained individual topologically manipulated DNNs to construct -ML ensembles, and evaluate the extent to which they met the training objectives. We close the section with experiments to evaluate the -ML defense in various settings. We ran our experiments with Keras [30] and TensorFlow [1].

Iv-a Datasets

We used three popular datasets to evaluate -ML and other defenses: MNIST [37], CIFAR10 [35], and GTSRB [72]. MNIST is a dataset of pixel images of digits (i.e., ten classes). It contains 70,000 images in total, with 60,000 images intended for training and 10,000 intended for testing. We set aside 5,000 images from the training set for validation. CIFAR10 is a dataset of -pixel images of ten classes: airplanes, automobiles, birds, cats, deer, dogs, frogs, horses, ships, and trucks. The dataset contains 50,000 images for training and 10,000 for testing. We set aside 5,000 images from the training set for validation. Last, GTSRB is a dataset containing traffic signs of 43 classes. The dataset contains 39,209 training images and 12,630 test images. We used 1,960 images that we set aside from the training set for validation. Images in GTSRB vary in size between and . Following prior work [39], we resized the images to .

The three datasets have different properties that made them especially suitable for evaluating -ML. MNIST has relatively few classes, thus limiting the set of derangements that we could use for topology manipulation for large values of . At the same time, standard DNNs achieve high classification accuracy on MNIST (99% accuracies are common), hence increasing the likelihood that DNNs in the ensemble would agree on the correct class for benign inputs. CIFAR10 also relatively has a few classes. However, differently from MNIST, even the best performing DNNs do not surpass 95% classification accuracy on CIFAR10. Consequently, the likelihood that a large number of DNNs in an ensemble would achieve consensus may be low (e.g., an ensemble consisting of DNNs with 95% accuracy each that incur independent errors could have benign accuracy as low as 77% if we require all DNNs to agree). In contrast, GTSRB contains a relatively high number of classes, and standard DNNs often achieve high classification accuracies on this dataset (98%–99% accuracies are common). As a result, there is a large space from which we could draw derangements for topology manipulation, and we had expected high benign accuracies from -ML ensembles.

Iv-B Training Standard DNNs

Dataset # Architecture Acc.
MNIST 1 Convolutional DNN [11] 99.42%
2 Convolutional DNN [47] 99.28%
3 Convolutional DNN [11] w/o pooling [70] 99.20%
4 Convolutional DNN [31] 99.10%
5 Convolutional DNN [47] w/o pooling [70] 99.10%
6 Multi-layer perceptron [32] 98.56%
CIFAR10 1 Wide-ResNet-22-8 [85] 95.38%
2 Wide-ResNet-28-10 [85] 95.18%
3 Wide-ResNet-16-10 [85] 95.06%
4 Wide-ResNet-28-8 [85] 94.88%
5 Wide-ResNet-22-10 [85] 94.78%
6 Wide-ResNet-16-8 [85] 94.78%
GTSRB 1 Convolutional DNN [39] 99.46%
2 Same as 1, but w/o first branch [39] 99.56%
3 Same as 1, but w/o pooling [70] 99.11%
4 Same as 1, but w/o second branch [39] 99.08%
5 Convolutional DNN [75] 99.00%
6 Convolutional DNN [66] 98.07%
TABLE I: The DNN architectures that we used for the different datasets. The DNNs’ accuracies on the test sets of the corresponding datasets (after standard training) are reported to the right.

The DNNs that we used were based on standard architectures. We constructed the DNNs either exactly the same way as prior work or reputable public projects (e.g., by the Keras team [31]) or by modifying prior DNNs via a standard technique. In particular, we modified certain DNNs following the work of Springenberg et al. [70], who found that it is possible to construct simple, yet highly performing, DNNs by removing pooling layers (e.g., max- and average-pooling) and increasing the strides of convolutional operations.

For each dataset, we trained six DNNs of different architectures—a sufficient number of DNNs to allow us to evaluate -ML and perform attacks via transferability from surrogate ensembles [42] while factoring out the effect of architectures (see Sec. IV-C and Sec. IV-D). The MNIST DNNs were trained for 20 epochs using the Adadelta optimizer with standard parameters and a batch size of 128 [31, 86]. The CIFAR10 DNNs were trained for 200 epochs with data augmentation (e.g., image rotation and flipping) and training hyperparameters set identically to prior work [85]. The GTSRB DNNs were trained with the Adam optimizer [33], with training hyperparameters and augmentation following prior work [39, 66, 75]. Table I reports the architectures and performance of the DNNs. In all cases, the DNNs achieved comparable performance to prior work.

Iv-C Training Individual Topologically Manipulated DNNs

Now we describe how we trained individual topologically manipulated DNNs and report on their performance. Later, in Sec. IV-D, we report on the performance on the -ML ensembles.

Training   When training topologically manipulated DNNs, we aimed to minimize the loss described in Eqn. 1. To this end, we modified the training procedures that we use to train the standard DNNs in three ways:

  1. We extended each batch of benign inputs with the same number of adversarial samples and specified that should be classified as .

  2. In certain cases, we slightly increased the number of training epochs to improve the performance of the DNNs.

  3. We avoided data augmentation for GTSRB, as we found that it harmed the accuracy of (topologically manipulated) DNNs on benign inputs.

To set (the parameter that balances the DNN’s benign accuracy and the success of topology manipulation, see Eqn. 1), we performed a hyperparameter search. We experimented with values in to find the best trade-off between the -ML ensembles’ benign accuracy and their ability to mitigate attacks. We found that achieved the highest accuracies at low attack success rates.

To train the best-performing -ML ensemble, one should select the best performing DNN architectures to train topologically manipulated DNNs. However, since the goal of this paper is to evaluate a defense, we aimed to give the attacker the advantage to assess the resilience of the defense in a worst-case scenario (e.g., so that the attacker could use the better held-out DNNs as surrogates in transferability-based attacks). Therefore, we selected the DNN architectures with the lower benign accuracy to train topologically manipulated DNNs. More specifically, for each dataset, we trained -ML ensembles by selecting round robin from the architectures shown in rows 4–6 from Table I.

Constructing a dataset of adversarial examples, , is a key part of training topologically manipulated DNNs. As we aimed to defend against attacks with bounded -norms, we used the corresponding attack to produce adversarial examples: For each training sample of class , we produced adversarial examples, one targeting every class . Moreover, we produced adversarial examples with perturbations of different magnitudes to construct -ML ensembles that can resist attacks with varied strengths. For MNIST, where attacks of magnitudes are typically considered [47] we used . For CIFAR10 and GTSRB, where attacks of magnitudes are typically considered [47, 57], we used . (Thus, in total, .) We ran for 40 iterations, since prior work found that his leads to successful attacks [47, 67]. Additionally, to avoid overfitting to the standard DNNs that were used for training, we used state-of-the-art techniques to enhance the transferability of the adversarial examples, both by making the examples invariant to spatial transformations [17, 82] and by producing them against an ensemble of models [42, 76]—three standard DNNs of architectures 4–6.

For each dataset, we trained a total of 18 topologically manipulated DNNs. Depending on the setting, we used a different subset of the DNNs to construct -ML ensembles (see Sec. IV-D). The DNNs were split into two sets of nine DNNs each ( for MNIST and CIFAR10), such that the derangements of every pair of DNNs in the same set disagreed on all indices.

Evaluation   Each topologically manipulated DNN was trained with two objectives in mind: classifying benign inputs correctly (i.e., high benign accuracy) and classifying adversarial examples as specified by the derangement drawn at training time. Here we evaluate the extent to which the DNNs we trained met these objectives. Note that these DNNs were not meant to be used individually, but instead in the ensembles evaluated in Sec. IV-D.

Standard Topologically manipulated
Dataset Acc. TSR Acc. TSR TSR h/o MSR MSR h/o
MNIST 99.30%0.09% 43.05%7.97% 98.66%0.42% 0.01% 6.82%2.45% 99.98%0.02% 53.23%14.94%
CIFAR10 95.21%0.13% 98.57%0.66% 92.93%0.39% 0.01% 0.01% 99.98%0.02% 99.99%0.01%
GTSRB 99.38%0.19% 20.17%1.48% 96.99%1.45% 1.20%0.27% 1.35%0.27% 52.86%9.05% 48.26%4.68%
TABLE II: The performance of topologically manipulated DNNs compared to standard DNNs. For standard DNNs, we report the average and standard deviation of the (benign) accuracy and the targeting success rate (TSR). TSR is defined as the rate at which the DNN emitted the target class on a transferred adversarial example. For topologically manipulated DNNs, we report the average and standard deviation of the accuracy, the TSR, and the manipulation success rate (MSR). MSR is the rate at which adversarial examples were classified as specified by the derangements drawn at training time. TSR and MSR are reported for adversarial examples produced against the same DNNs used during training or ones produced against held-out (h/o) DNNs.

To measure the benign accuracy, we classified the original (benign) samples from the test sets of datasets using the 18 topologically manipulated DNNs as well as the (better-performing) standard DNNs that we held out from training topologically manipulated DNNs (i.e., architectures 1–3). Table II reports the average and standard deviation of the benign accuracy. Compared to the standard DNNs, the topologically manipulated ones had only slightly lower accuracy (0.64%–2.39% average decrease in accuracy). Hence, we can conclude that topologically manipulated DNNs were accurate.

Next, we measured the extent to which topology manipulation was successful. To this end, we computed adversarial examples for the DNNs used to train topologically manipulated DNNs (i.e., architectures 4–6) or DNNs held out from training (i.e., architectures 1–3). Again, we used and techniques to enhance transferability to compute the adversarial examples. As in prior work, we set for MNIST and for CIFAR10 and GTSRB, and we ran for 40 iterations. For each benign sample, , we created corresponding adversarial examples, one targeting every class . To reduce the computational load, we used a random subset of benign samples from the test sets: 1024 samples for MNIST and 512 samples for the other datasets.

For constructing robust -ML ensembles, the topologically manipulated DNNs should classify adversarial examples as specified during training, or, at least, differently than the adversary anticipates. We estimated the former via the manipulation success rate (MSR)—the rate at which adversarial examples were classified as specified by the derangements drawn at training time—while we estimated the latter via the targeting success rate (TSR)—the rate at which adversarial examples succeeded at being misclassified as the target class. A successfully trained topologically manipulated DNN should obtain a high MSR and a low TSR.

Table II presents the average and standard deviation of TSRs and MSRs for topologically manipulated DNNs, as well as the TSRs for standard DNNs. One can immediately see that targeting was much less likely to succeed for a topologically manipulated DNN (average TSR6.82%) than for a standard DNN (average TSR20.17%, and as high as 98.57%). In fact, across all datasets and regardless of whether the adversarial examples were computed against held-out DNNs, the likelihood of targeting to succeed for standard DNNs was 6.31 or higher than for topologically manipulated DNNs. This confirms that the adversarial directions of topologically manipulated DNNs were vastly different than those of standard DNNs. Furthermore, as reflected in the MSR results, we found that topologically manipulated DNNs were likely to classify adversarial examples as specified by their corresponding derangements. Across all datasets, and regardless of whether the adversarial examples were computed against held-out DNNs or not, the average likelihood of topologically manipulated DNNs to classify adversarial examples according to specification was 48.26%. For example, an average of 99.99% of the adversarial examples produced against the held-out DNNs of CIFAR10 were classified according to specification by the topologically manipulated DNNs.

In summary, the results indicate that the topologically manipulated DNNs satisfied their objectives to a large extent: they accurately classified benign inputs, and their topology with respect to adversarial directions was different than that of standard DNNs, as they often classified adversarial examples according to the specification that was selected at training time.

Iv-D Evaluating -ML Ensembles

Now we describe our evaluation of -ML ensembles. We begin by describing the experiment setup and reporting the benign accuracy of ensembles of various sizes and thresholds. We then present experiments to evaluate -ML ensembles in various settings and compare -ML against other defenses. We finish with an evaluation of the time overhead incurred when deploying -ML for inference.


The -ML ensembles that we constructed were composed of the topologically manipulated DNNs described in the previous section. Particularly, we constructed ensembles containing five (5-ML), nine (9-ML), or 18 (18-ML) DNNs, as we found ensembles of certain sizes to be more suitable than others at balancing benign accuracy, security, and inference time in different settings. For number of variants , we selected DNNs whose derangements disagreed in all indices for the ensembles. For , we selected all the DNNs. Note that since GTSRB contains a large number of classes (), we could train 18 DNNs with derangements that disagreed on all indices. However, we avoided doing so to save compute cycles, as the DNNs that we trained performed well despite having derangements that agreed on certain indices.

Other Defenses   We compared -ML with three state-of-the-art defenses:  [47, 67],  [45], and  [44].

and use adversarial examples at training time; we set the magnitude of the -norm perturbations to to produce adversarial examples for MNIST, and to for CIFAR10 and GTSRB, as these are typical attack magnitudes that defenses attempt to prevent (e.g., [47, 57]).

For , we implemented and used the free adversarial training method of Shafahi et al. [67], which adversarially trains DNNs in the same amount of time as standard training. We used to train four defended DNNs for each dataset—one to be used by the defender, and three to be used for transferring attacks in the grey-box setting (see below). To give the defense an advantage, we used the best performing architecture for the defender’s DNN (architecture 1 from Table I), and the least performing architectures for the attacker’s DNNs (architectures 4–6). For CIFAR10, as the DNN that we obtained after training did not perform as well as prior works’, we used the adversarially trained DNN released by Madry et al. [47] as the defender’s DNN.

For training detectors, we used the implementation that was published by the authors [45]. As described in Sec. II, detectors compute statistics for intermediate representations of inputs and feed the statistics to a logistic regression classifier to detect adversarial examples. The logistic regression classifier is trained using statistics of benign samples, adversarial examples, and noisy variants of benign samples (created by adding non-adversarial Gaussian noise). We tuned the amount of noise for best performance (i.e., highest benign accuracy and detection rate of adversarial examples). For CIFAR10 and GTSRB, we trained detectors for DNNs of architecture 1. For MNIST, we trained a detector for the DNN architecture that was used in the original work—architecture 4.

Using code that we obtained directly from the authors, we trained two detectors per dataset—one to be used by the defender (in all settings), and one by the attacker in the grey-box setting. The defender’s detectors were trained for DNNs of the same architectures as for . For the attacker, we trained detectors for DNNs of architectures 1, 4, and 2, for MNIST, CIFAR10, and GTSRB, respectively. We selected the attackers’ DNN architectures arbitrarily, and expect that other architectures would perform roughly the same. The oc-SVMs that we trained for have Radial Basis Functions (RBF) as kernels, since these were found to perform best for detection [44].

Attack Methodology   We evaluated -ML and other defenses against untargeted attacks, as they are easier to attain from the point of view of attackers, and more challenging to defend against from the point of view of the defender. For , , and , we used typical untargeted attacks with various adaptations depending on the setting. Typical untargeted attacks, however, are unlikely to evade -ML ensembles, as if the target is not specified by the attack then each DNN in the ensemble may classify the resulting adversarial example differently, thus detecting the presence of an attack. To address this, we used a more powerful attack, similarly to Carlini and Wagner [11]. The attack builds on targeted to generate adversarial examples targeting all possible incorrect classes (i.e., in total) and checks if one of these adversarial examples is misclassified by a large number of DNNs in the ensemble (), and so is not detected as adversarial by the -ML ensemble. Because targeting every possible class increases the computational load of attacks, we used random subsets of test sets to produce adversarial examples against -ML. In particular, we used 1,000 samples for MNIST and 512 samples for CIFAR10 and GTSRB. The magnitude of -norm of perturbations that we considered were for MNIST and for CIFAR10 and GTSRB, and we ran attacks for 40 iterations. Again, we used techniques to attack ensembles and enhance the transferability of attacks [17, 42, 76, 82].

When directly attacking a DNN defended by , we simply produced adversarial examples that were misclassified with high confidence against the DNN while ignoring the defense. This approach is motivated by prior work, which found that high confidence adversarial examples mislead with high likelihood [3]. When attacking a DNN defended by , we created a new DNN by combining the logits of the original DNN and those of the classifiers built on top of every intermediate layer. We found that forcing the original DNN and the intermediate classifiers to (mis)classify adversarial examples in the same manner often led the oc-SVMs to misclassify adversarial examples as benign.

Measures   In the context of adversarial examples, a defense’s success is measured in its ability to prevent adversarial examples, while maintaining high benign accuracy (e.g., close to that of a standard classifier). The benign accuracy is the rate at which benign inputs are classified correctly and not detected as adversarial. In contrast, the success rate of attacks is indicative of the defense’s ability to prevent adversarial examples (high success rate indicates a weak defense, and vice versa). For untargeted attacks, the success rate can be measured by the rate at which adversarial examples are not detected and are classified to a class other than the true class. Note that is a method for robust classification, as opposed to detection, and so adversarially trained DNNs always output an estimation of the most likely class (i.e., abstaining from classifying an input that is suspected to be adversarial is not an option).

We tuned the defenses at inference time to compute different tradeoffs on the above metrics. In the case of -ML, we computed the benign accuracy and attacks’ success rates for threshold values . For , we evaluated the metrics for different thresholds on the logistic regression’s probability estimates that inputs are adversarial. emits scores in arbitrary ranges—the higher the score the more likely the input to be adversarial. We computed accuracy and success rate tradeoffs of for thresholds between the minimum and maximum values emitted for benign samples and adversarial examples combined. In all cases, both the benign accuracy and attacks’ success rates decreased as we increased the thresholds. results in a single model that cannot be tuned at inference time. We report the single operating point that it achieves.


We now present the results of our evaluations, in terms of benign accuracy; resistance to adversarial examples in the black-, grey-, and white-box settings; and overhead to classification performance.

Benign Accuracy   We first report on the benign accuracy of -ML ensembles. In particular we were interested in finding how different was the accuracy of ensembles from single standard DNNs. Ideally, it is desirable to maintain accuracy that is as close to that of standard training as possible.

(b) CIFAR10
Fig. 3: The benign accuracy of -ML ensembles of different sizes as we varied the thresholds. For reference, we show the average accuracy of a single standard DNN (avg. standard), as well as the accuracy of hypothetical ensembles whose constituent DNNs are assumed to have independent errors and the average accuracy of topologically manipulated DNNs (indep. ). The dotted lines connecting the markers were added to help visualize trends, but do not correspond to actual operating points.

Fig. 3 compares -ML ensembles’ accuracy with standard DNNs, as well as with hypothetical ensembles whose DNN members have the average accuracy of topologically manipulated DNNs and independent errors. For low thresholds, it can be seen that the accuracy of -ML was close to the average accuracy of standard benign DNNs. As we increased the thresholds, the accuracy decreased. Nonetheless, it did not decrease as dramatically as for ensembles composed from DNNs with independent error. For example, the accuracy of an ensemble containing five independent DNNs each with an accuracy of 92.93% (the average accuracy of topologically manipulated DNNs on CIFAR10) is 69.31% when (i.e., we require all DNNs to agree). In comparison, 5-ML achieved 84.82% benign accuracy for the same threshold on CIFAR10.

We can conclude that -ML ensembles were almost as accurate as standard models for low thresholds, and that they did not suffer from dramatic accuracy loss as thresholds were increased.

Black-box Attacks   In the black-box setting, as the attacker is unaware of the use of defenses and has no access to the classifiers, we used non-interactive transferability-based attacks to transfer adversarial examples produced against standard surrogate models. For -ML and , we used a strong attack by transferring adversarial examples produced against the standard DNNs held-out from training the defenses. For and we found that transferring adversarial examples produced against the least accurate standard DNNs (architecture 6) was sufficient to evade classification with high success rates.

(b) CIFAR10
Fig. 4: Comparison of defenses’ performance in the black-box setting. The -norm of perturbations was set to for MNIST and for CIFAR10 and GTSRB. Due to poor performance, the and curves were left out from the CIFAR10 plot after zooming in. The dotted lines connecting the -ML markers were added to help visualize trends, but do not correspond to actual operating points.

Fig. 4 summarizes the results for the attacks with the highest magnitudes. For -ML, we report the performance of 5-ML and 9-ML, which we found to perform well in the black-box setting (i.e., they achieved high accuracy while mitigating attacks). It can be seen that -ML outperformed other defenses across all the datasets. For example, for CIFAR10, 9-ML was able to achieve 94.50% benign accuracy at 0.00% attack success rate. In contrast, the second best defense, , achieved 87.24% accuracy at 14.02% attack success rate.

While not shown for the sake of brevity, additional experiments that we performed demonstrated that -ML in the black-box setting performed roughly the same as in Fig. 4 when 1) different perturbation magnitudes were used ( for MNIST and for CIFAR10 and GTSRB); 2) individual standard DNNs were used as surrogates to produce adversarial examples; and 3) the same DNNs used to train topologically manipulated DNNs were used as surrogates.

Grey-box Attacks   In the grey-box setting (where attackers are assumed to be aware of the deployment of defenses, but have no visibility to the parameters of the classifier and the defense), we attempted to transfer adversarial examples produced against surrogate defended classifiers. For -ML, we evaluated 5-ML and 9-ML in the grey-box setting. As surrogates, we used -ML ensembles of the same sizes and architectures to produce adversarial examples. The derangements used for training the DNNs of the surrogate ensembles were picked independently from the defender’s ensembles (i.e., the derangements could agree with the defender’s derangements on certain indices). For , we used three adversarially trained DNNs different than the defender’s DNNs as surrogates. For we used standard DNNs and corresponding detectors (different than the defender’s, see above for training details) as surrogates. For , we simply produced adversarial examples that were misclassified with high confidence against undefended standard DNNs of architecture 2 (these were more architecturally similar to the defenders’ DNNs than the surrogates used in the black-box setting).

(b) CIFAR10
Fig. 5: Comparison of defenses’ performance in the grey-box setting. The -norm of perturbations was set to for MNIST and for CIFAR10 and GTSRB. The dotted lines connecting the -ML markers were added to help visualize trends, but do not correspond to actual operating points.

Fig. 5 presents the performance of the defenses against the attacks with the highest magnitudes. Again, we found -ML to achieve favorable performance over other defenses. In the case of GTSRB, for instance, 9-ML could achieve 98.30% benign accuracy at 1.56% attack success rate. None of the other defenses was able to achieve a similar accuracy while preventing 98.44% of the attacks for GTSRB.

Additional experiments that we performed showed that -ML maintained roughly the same performance as we varied the number of DNNs in the attacker’s surrogate ensembles () and the attacks’ magnitudes ( for MNIST and for CIFAR10 and GTSRB).

White-Box Attacks   Now we turn our attention to the white-box setting, where attackers are assumed to have complete access to classifiers’ and defenses’ parameters. In this setting, we leveraged the attacker’s knowledge of the classifiers’ and defenses’ parameters to directly optimize the adversarial examples against them.

(b) CIFAR10
Fig. 6: Comparison of defenses’ performance in the white-box setting. The -norm of perturbations was set to for MNIST and for CIFAR10 and GTSRB. The dotted lines connecting the -ML markers were added to help visualize trends, but do not correspond to actual operating points.

Fig. 6 shows the results. One can see that, depending on the dataset, -ML outperformed other defenses, or achieved comparable performance to the leading defenses. For GTSRB, -ML significantly outperformed other defenses: 18-ML achieved a benign accuracy of 86.01%–93.19% at attack success-rates 8.20%. No other defense achieved comparable benign accuracy for such low attack success-rates. We hypothesize that -ML was particularly successful for GTSRB, since the dataset contains a relatively large number of classes, and so there was a large space from which derangements for topology manipulation could be drawn. The choice of the leading defense for MNIST and CIFAR10 is less clear (some defenses achieved slightly higher benign accuracy, while others were slightly better at preventing attacks), and depends on the need to balance benign accuracy and resilience to attacks at deployment time. For example, 18-ML was slightly better at preventing attacks against MNIST than (1.30% vs. 3.70% attack success-rate), but achieved slightly higher accuracy for the same 18-ML operating point (99.25% vs. 95.22%).

-norm Attacks   The previous experiments showed that -ML ensembles could resist -based attacks in various settings. We performed a preliminary exploration using MNIST to assess whether -ML could also prevent -based attacks. Specifically, we trained 18 topologically manipulated DNNs to construct -ML ensembles. The training process was the same as before, except that we projected adversarial perturbations to the balls around benign samples when performing to produce adversarial examples for . We created adversarial examples with , as we aimed to prevent adversarial examples with , following prior work [47]. The resulting topologically manipulated DNNs were accurate (an average accuracy of 98.39%0.55%).

Fig. 7: Performance of MNIST -ML ensembles against -norm attacks with .

sing the models that we trained, we constructed -ML ensembles of different sizes and evaluated attacks in black-, grey-, and white-box settings. The evaluation was exactly the same as before, except that we use -based with to produce adversarial examples. Fig. LABEL:fig:nml:l2attacks summarizes the results for 9-ML in black- and grey-box settings, and for 18-ML in a white-box setting. It can be seen that -ML was effective at preventing -norm attacks while maintaining high benign accuracy. For example, 9-ML was able to achieve 98.56% accuracy at 0% success rates for black- and grey-box attacks, and 18-ML was able to achieve 97.46% accuracy at a success rate of 1.40% for white-box attacks.

Overhead   Of course, as -ML requires us to run several DNNs for inference instead of only one, using -ML comes at the cost of increased inference time at deployment. Now we show that the overhead is relatively small, especially compared to and .

To measure the inference time, we sampled 1024 test samples from every dataset and classified them in batches of 32 using the defended classifiers. We used 32 because it is common to use for inspecting the behavior of DNNs [23], but the trends in the time estimates remained the same with other batch sizes. We ran the measurements on a machine equipped with 64GB of memory and 2.50GHz Intel Xeon CPU using a single NVIDIA Tesla P100 GPU.


Standard /

5-ML 9-ML 18-ML
MNIST 0.15ms 1.93 2.80 4.73 943.47 601.07
CIFAR10 3.72ms 3.57 6.07 12.07 551.88 581.51
GTSRB 0.68ms 5.35 8.26 16.41 852.04 1457.26
TABLE III: Defenses’ overhead at inference time. The second column reports the average inference time in milliseconds for batches containing 32 samples for a single (standard or adversarially trained) DNN. The columns to its right report the overhead of defenses with respect to using a single DNN for inference.

The results are shown in Table III. did not incur time overhead at inference due to producing a single DNN to be used for inference. Compared to using a single DNN, inference time with -ML increased with . At the same time the increase was often sublinear in . For example, for 18-ML, the inference time increased 4.73 for MNIST, 12.07 for CIFAR10, and 16.41 for GTSRB. Moreover, the increase was significantly less dramatic than for (551.88–943.47 increase) and (581.51–1,457.26 increase). There, we observed that the main bottlenecks were computing the statistics for , and classification with oc-SVMs for .

V Discussion

Our experiments demonstrated the effectiveness and efficiency of -ML against and attacks in various settings for three datasets. Still, there are some limitations and practical considerations to take into account when deploying -ML. We discuss these below.

Limitations   Our experiments evaluated -ML against and attacks, as is typical in the literature (e.g., [47, 67]). However, in reality, attackers can use other perturbation types to evade -ML (e.g., adding patterns to street signs to evade recognition [18]). Conceptually, it should be possible to train -ML to defend against such perturbation types. We defer the evaluation to future work.

As opposed to using a single ML algorithm for inference (e.g., one standard DNN), -ML requires using DNNs. As a result, more compute resources and more time are needed to make inferences with -ML. This may make it challenging to deploy -ML in settings where compute resources are scarce and close to real-time feedback is expected (e.g., face-recognition on mobile phones). Nonetheless, it is important to highlight that -ML is remarkably faster at making inferences than state-of-the-art methods for detecting adversarial examples [45, 44], as our experiments showed.

Currently, perhaps the most notable weakness of -ML is that it is limited to scenarios where the number of classes, , is large. In cases where is small, one cannot draw many distinct derangements to train DNNs with different topologies with which to construct -ML ensembles. For example, when there are two classes (), there is only one derangement that one can use to train a topologically manipulated DNN (remember that ), and so it is not possible to construct an ensemble containing DNNs with distinct topologies. A possible solution is to find a new method that does not require derangements to topologically manipulate DNNs. We plan to pursue this direction in future work.

Practical considerations   ML systems often take actions based on inferences made by ML algorithms. For example, a biometric system may give or deny access to users based on the output of a face-recognition DNN; an autonomous vehicle may change the driving direction, accelerate, stop, or slow down based on the output of a DNN for pedestrian detection; and an anti-virus program may delete or quarantine a file based on the output of an ML algorithm for malware detection. This raises the question of how should a system that uses -ML for inference react when -ML flags an input as adversarial.

We have a couple of suggestions for courses of action that are applicable to different settings. One possibility is to fall back to a more expensive, but less error prone, classification mechanism. For example, if an -ML ensemble is used for face recognition and it flags an input as adversarial, a security guard may be called to identify the person, and possibly override the output of the ensemble. This solution is viable when the time and resources to use an expensive classification mechanism are available. Another possibility is to resample the input, or classify a transformed variant of the input, to increase the confidence in the detection or to correct the classification result. For example, if an -ML ensemble that is used for face recognition on a mobile phone detects an input as adversarial, the user may be asked to reattempt identifying herself using a new image. In this case, because the benign accuracy of -ML is high and the attack success rate is low, a benign user is likely to succeed at identifying, while an attacker is likely to be detected.

Vi Conclusion

This paper presents -ML, a defense that is inspired by -version programming, to defend against adversarial examples. -ML uses ensembles of DNNs to classify inputs by a majority vote (when a large number of DNNs agree) and to detect adversarial examples (when the DNNs disagree). To ensure that the ensembles have high accuracy on benign samples while also defending against adversarial examples, we train the DNNs using a novel technique (topology manipulation) which allows one to specify how adversarial examples should be classified by the DNN at inference time. Our experiments using two perturbation types (ones with bounded - and -norms) and three datasets (MNIST, CIFAR10, and GTSRB) in black-, grey-, and white-box settings showed that -ML is an effective and efficient defense. In particular, -ML roughly retains the benign accuracies of state-of-the-art DNNs, while providing more resilience to attacks than the best defenses known to date and making inferences faster than most.


This work was supported in part by the Multidisciplinary University Research Initiative (MURI) Cyber Deception grant; by NSF grants 1801391 and 1801494; by the National Security Agency under Award No. H9823018D0008; by gifts from Google and Nvidia, and from Lockheed Martin and NATO through Carnegie Mellon CyLab; and by a CyLab Presidential Fellowship and a NortonLifeLock Research Group Fellowship.


  1. We will release our implementation of -ML upon publication.
  2. For clarity, we distinguish the classifier from the defense. However, in certain cases (e.g., for -ML or adversarial training) they are inherently inseparable.
  3. In the case of one-hot encoding for the true class and a probability estimate emitted by the DNN, the cross-entropy loss is defined as .


  1. M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. A. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu and X. Zheng (2016) TensorFlow: A system for large-scale machine learning. In Proc. OSDI, Cited by: §IV.
  2. M. Abbasi and C. Gagné (2017) Robustness to adversarial examples through an ensemble of specialists. In Proc. ICLRW, Cited by: §II-B, §II-B.
  3. A. Athalye, N. Carlini and D. Wagner (2018) Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In Proc. ICML, Cited by: §I, §II-B, §II-B, §IV-D1.
  4. A. Athalye and N. Carlini (2018) On the robustness of the CVPR 2018 white-box adversarial example defenses. arXiv preprint arXiv:1804.03286. Cited by: §II-B.
  5. A. Avizienis (1985) The N-version approach to fault-tolerant software. IEEE Transactions on software engineering (12), pp. 1491–1501. Cited by: §II-B, §III-C.
  6. S. Baluja and I. Fischer (2018) Adversarial transformation networks: learning to generate adversarial examples. In Proc. AAAI, Cited by: §II-A.
  7. B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto and F. Roli (2013) Evasion attacks against machine learning at test time. In Proc. ECML PKDD, Cited by: §I, §II-A.
  8. W. Brendel, J. Rauber and M. Bethge (2018) Decision-based adversarial attacks: Reliable attacks against black-box machine learning models. In Proc. ICLR, Cited by: §III-A.
  9. T. Brunner, F. Diehl, M. T. Le and A. Knoll (2019) Guessing smart: biased sampling for efficient black-box adversarial attacks. In Proc. ICCV, Cited by: §III-A.
  10. N. Carlini and D. Wagner (2017) Adversarial examples are not easily detected: bypassing ten detection methods. In Proc. AISec, Cited by: §II-B.
  11. N. Carlini and D. Wagner (2017) Towards evaluating the robustness of neural networks. In Proc. IEEE S&P, Cited by: §I, §II-A, §IV-D1, TABLE I.
  12. L. Chen and A. Avizienis (1995) N-version programming: A fault-tolerance approach to reliability of software operation. In Proc. ISFTC, Cited by: §II-B, §III-C.
  13. P. Chen, H. Zhang, Y. Sharma, J. Yi and C. Hsieh (2017) Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proc. AISec, Cited by: §III-A.
  14. J. M. Cohen, E. Rosenfeld and J. Z. Kolter (2019) Certified adversarial robustness via randomized smoothing. In Proc. ICML, Cited by: §II-B, §II-B.
  15. B. Cox, D. Evans, A. Filipi, J. Rowanhill, W. Hu, J. Davidson, J. Knight, A. Nguyen-Tuong and J. Hiser (2006) N-variant systems: a secretless framework for security through diversity.. In Proc. USENIX Security, Cited by: §II-B, §III-C.
  16. A. Demontis, M. Melis, M. Pintor, M. Jagielski, B. Biggio, A. Oprea, C. Nita-Rotaru and F. Roli (2019) Why do adversarial attacks transfer? Explaining transferability of evasion and poisoning attacks. In Proc. USENIX Security, Cited by: §II-A.
  17. Y. Dong, T. Pang, H. Su and J. Zhu (2019) Evading defenses to transferable adversarial examples by translation-invariant attacks. In Proc. CVPR, Cited by: §II-A, §IV-C, §IV-D1.
  18. I. Evtimov, K. Eykholt, E. Fernandes, T. Kohno, B. Li, A. Prakash, A. Rahmati and D. Song (2018) Robust physical-world attacks on machine learning models. In Proc. CVPR, Cited by: §I, §V.
  19. R. Feinman, R. R. Curtin, S. Shintre and A. B. Gardner (2017) Detecting adversarial samples from artifacts. arXiv preprint arXiv:1703.00410. Cited by: §II-B, §II-B.
  20. I. J. Goodfellow, J. Shlens and C. Szegedy (2015) Explaining and harnessing adversarial examples. In Proc. ICLR, Cited by: §I, §II-A, §II-A, §II-B, §II-B.
  21. K. Grosse, P. Manoharan, N. Papernot, M. Backes and P. McDaniel (2017) On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280. Cited by: §II-B.
  22. C. Guo, M. Rana, M. Cisse and L. van der Maaten (2018) Countering adversarial images using input transformations. In Proc. ICLR, Cited by: §II-B.
  23. A. Harlap, D. Narayanan, A. Phanishayee, V. Seshadri, N. Devanur, G. Ganger and P. Gibbons (2019) Pipedream: fast and efficient pipeline parallel dnn training. In Proc. SOSP, Note: To appear Cited by: §IV-D2.
  24. W. He, J. Wei, X. Chen, N. Carlini and D. Song (2017) Adversarial example defense: Ensembles of weak defenses are not strong. In Proc. WOOT, Cited by: §II-B, §II-B.
  25. B. Huang, Y. Wang and W. Wang (2019) Model-agnostic adversarial detection by random perturbations. In Proc. IJCAI, Cited by: §II-B, §II-B.
  26. A. Ilyas, L. Engstrom and A. Madry (2019) Prior convictions: Black-box adversarial attacks with bandits and priors. In Proc. ICLR, Cited by: §II-A, §III-A.
  27. H. Kannan, A. Kurakin and I. Goodfellow (2018) Adversarial logit pairing. arXiv preprint arXiv:1803.06373. Cited by: §II-B.
  28. A. Kantchelian, J. Tygar and A. D. Joseph (2016) Evasion and hardening of tree ensemble classifiers. In Proc. ICML, Cited by: §I, §II-B.
  29. S. Kariyappa and M. K. Qureshi (2019) Improving adversarial robustness of ensembles with diversity training. arXiv preprint arXiv:1901.09981. Cited by: §II-B.
  30. Keras team (2015) Keras: The Python deep learning library. Note: \urlhttps://keras.io/Accessed on 09-30-2019 Cited by: §IV.
  31. Keras team (2018) MNIST CNN. Note: \urlhttps://github.com/keras-team/keras/blob/master/examples/mnist_cnn.pyAccessed on 09-28-2019 Cited by: §IV-B, §IV-B, TABLE I.
  32. Keras team (2018) MNIST MLP. Note: \urlhttps://keras.io/examples/mnist_mlp/Accessed on 09-28-2019 Cited by: TABLE I.
  33. D. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In Proc. ICLR, Cited by: §IV-B.
  34. J. Z. Kolter and E. Wong (2018) Provable defenses against adversarial examples via the convex outer adversarial polytope. In Proc. ICML, Cited by: §II-B, §III-A.
  35. A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Technical report University of Toronto. Cited by: §I, §IV-A.
  36. A. Kurakin, I. Goodfellow and S. Bengio (2017) Adversarial machine learning at scale. In Proc. ICLR, Cited by: §I, §II-B.
  37. Y. LeCun, C. Cortes and C. J.C. Burges (1998) The MNIST database of handwritten digits. Note: \urlhttp://yann.lecun.com/exdb/mnist/Accessed on 10-01-2019 Cited by: §I, §IV-A.
  38. M. Lecuyer, V. Atlidakis, R. Geambasu, D. Hsu and S. Jana (2019) Certified robustness to adversarial examples with differential privacy. In Proc. IEEE S&P, Cited by: §II-B, §II-B.
  39. J. Li and Z. Wang (2018) Real-time traffic sign recognition based on efficient CNNs in the wild. IEEE Transactions on Intelligent Transportation Systems 20 (3), pp. 975–984. Cited by: §IV-A, §IV-B, TABLE I.
  40. F. Liao, M. Liang, Y. Dong, T. Pang, J. Zhu and X. Hu (2018) Defense against adversarial attacks using high-level representation guided denoiser. In Proc. CVPR, Cited by: §II-B.
  41. X. Liu, M. Cheng, H. Zhang and C. Hsieh (2018) Towards robust neural networks via random self-ensemble. In Proc. ECCV, Cited by: §II-B, §II-B.
  42. Y. Liu, X. Chen, C. Liu and D. Song (2017) Delving into transferable adversarial examples and black-box attacks. In Proc. ICLR, Cited by: §II-A, §IV-B, §IV-C, §IV-D1.
  43. P. Lu, P. Chen and C. Yu (2018) On the limitation of local intrinsic dimensionality for characterizing the subspaces of adversarial examples. arXiv preprint arXiv:1803.09638. Cited by: §II-B.
  44. S. Ma, Y. Liu, G. Tao, W. Lee and X. Zhang (2019) NIC: detecting adversarial samples with neural network invariant checking.. In Proc. NDSS, Cited by: §I, §II-B, §IV-D1, §IV-D1, §V.
  45. X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, G. Schoenebeck, D. Song, M. E. Houle and J. Bailey (2018) Characterizing adversarial subspaces using local intrinsic dimensionality. In Proc. ICLR, Cited by: §I, §II-B, §II-B, §IV-D1, §IV-D1, §V.
  46. F. Machida (2019) N-version machine learning models for safety critical systems. In Proc. DSN DSMLW, Cited by: §I, §II-B.
  47. A. Madry, A. Makelov, L. Schmidt, D. Tsipras and A. Vladu (2018) Towards deep learning models resistant to adversarial attacks. In Proc. ICLR, Cited by: §I, §I, §II-A, §II-B, §III-A, §IV-C, §IV-D1, §IV-D1, §IV-D1, §IV-D2, TABLE I, §V.
  48. D. Meng and H. Chen (2017) MagNet: A two-pronged defense against adversarial examples. In Proc. CCS, Cited by: §II-B, §II-B.
  49. J. H. Metzen, T. Genewein, V. Fischer and B. Bischoff (2017) On detecting adversarial perturbations. In Proc. ICLR, Cited by: §II-B.
  50. M. Mirman, T. Gehr and M. Vechev (2018) Differentiable abstract interpretation for provably robust neural networks. In Proc. ICML, Cited by: §II-B.
  51. S. Moosavi-Dezfooli, A. Fawzi and P. Frossard (2016) DeepFool: A simple and accurate method to fool deep neural networks. In Proc. CVPR, Cited by: §II-A.
  52. T. Pang, C. Du, Y. Dong and J. Zhu (2018) Towards robust detection of adversarial examples. In Proc. NeurIPS, Cited by: §II-B.
  53. T. Pang, K. Xu, C. Du, N. Chen and J. Zhu (2019) Improving adversarial robustness via promoting ensemble diversity. In Proc. ICML, Cited by: §II-B.
  54. N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik and A. Swami (2017) Practical black-box attacks against machine learning. In Proc. AsiaCCS, Cited by: §II-A, §II-B.
  55. N. Papernot, P. McDaniel and I. Goodfellow (2016) Transferability in machine learning: From phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277. Cited by: §II-A, §II-B.
  56. N. Papernot, P. McDaniel, S. Jha, M. Fredrikson, Z. B. Celik and A. Swami (2016) The limitations of deep learning in adversarial settings. In Proc. IEEE Euro S&P, Cited by: §I, §II-A.
  57. N. Papernot and P. McDaniel (2018) Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning. arXiv preprint arXiv:1803.04765. Cited by: §IV-C, §IV-D1.
  58. Y. Qin, N. Carlini, I. Goodfellow, G. Cottrell and C. Raffel (2019) Imperceptible, robust, and targeted adversarial examples for automatic speech recognition. In Proc. ICML, Cited by: §I.
  59. A. Raghunathan, J. Steinhardt and P. S. Liang (2018) Semidefinite relaxations for certifying robustness to adversarial examples. In Proc. NeurIPS, Cited by: §II-B.
  60. H. Salman, G. Yang, J. Li, P. Zhang, H. Zhang, I. Razenshteyn and S. Bubeck (2019) Provably robust deep learning via adversarially trained smoothed classifiers. In Proc. NeurIPS, Note: To appear Cited by: §II-B, §II-B.
  61. H. Salman, G. Yang, H. Zhang, C. Hsieh and P. Zhang (2019) A convex relaxation barrier to tight robustness verification of neural networks. In Proc. NeurIPS, Note: To appear Cited by: §II-B, §III-A.
  62. P. Samangouei, M. Kabkab and R. Chellappa (2018) Defense-GAN: Protecting classifiers against adversarial attacks using generative models. In Proc. ICLR, Cited by: §II-B.
  63. L. Schönherr, K. Kohls, S. Zeiler, T. Holz and D. Kolossa (2019) Adversarial attacks against automatic speech recognition systems via psychoacoustic hiding. In Proc. NDSS, Cited by: §I.
  64. A. Sen, X. Zhu, L. Marshall and R. Nowak (2019) Should adversarial attacks use pixel p-norm?. arXiv preprint arXiv:1906.02439. Cited by: §II-A.
  65. S. Sengupta, T. Chakraborti and S. Kambhampati (2019) MTDeep: Moving target defense to boost the security of deep neural nets against adversarial attacks. In Proc. GameSec, Cited by: §II-B.
  66. P. Sermanet and Y. LeCun (2011) Traffic sign recognition with multi-scale convolutional networks.. In Proc. IJCNN, Cited by: §IV-B, TABLE I.
  67. A. Shafahi, M. Najibi, A. Ghiasi, Z. Xu, J. Dickerson, C. Studer, L. S. Davis, G. Taylor and T. Goldstein (2019) Adversarial training for free!. In Proc. NeurIPS, Note: To appear Cited by: §I, §II-A, §II-B, §IV-C, §IV-D1, §IV-D1, §V.
  68. M. Sharif, L. Bauer and M. K. Reiter (2018) On the suitability of lp-norms for creating and preventing adversarial examples. In Proc. CVPRW, Cited by: §II-A.
  69. M. Sharif, S. Bhagavatula, L. Bauer and M. K. Reiter (2016) Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proc. CCS, Cited by: §I.
  70. J. T. Springenberg, A. Dosovitskiy, T. Brox and M. Riedmiller (2015) Striving for simplicity: the all convolutional net. In Proc. ICLR, Cited by: §IV-B, TABLE I.
  71. V. Srinivasan, A. Marban, K. Müller, W. Samek and S. Nakajima (2018) Counterstrike: defending deep learning architectures against adversarial samples by Langevin dynamics with supervised denoising autoencoder. arXiv preprint arXiv:1805.12017. Cited by: §II-B.
  72. J. Stallkamp, M. Schlipsing, J. Salmen and C. Igel (2012) Man vs. computer: benchmarking machine learning algorithms for traffic sign recognition. Neural Networks 32, pp. 323–332. Cited by: §I, §IV-A.
  73. T. Strauss, M. Hanselmann, A. Junginger and H. Ulmer (2017) Ensemble methods as a defense to adversarial perturbations against deep neural networks. arXiv preprint arXiv:1709.03423. Cited by: §I, §II-B.
  74. C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow and R. Fergus (2014) Intriguing properties of neural networks. In Proc. ICLR, Cited by: §I, §II-A, §II-B.
  75. L. Tian (2017) Traffic sign recognition using CNN with learned color and spatial transformation. Note: \urlhttps://github.com/hello2all/GTSRB_Keras_STN/blob/master/conv_model.pyAccessed on 09-28-2019 Cited by: §IV-B, TABLE I.
  76. F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh and P. McDaniel (2018) Ensemble adversarial training: attacks and defenses. In Proc. ICLR, Cited by: §II-A, §IV-C, §IV-D1.
  77. D. Vijaykeerthy, A. Suri, S. Mehta and P. Kumaraguru (2019) Hardening deep neural networks via adversarial model cascades. Cited by: §II-B.
  78. J. Wang, G. Dong, J. Sun, X. Wang and P. Zhang (2019) Adversarial sample detection for deep neural network through model mutation testing. In Proc. ICSE, Cited by: §II-B, §II-B.
  79. X. Wang, S. Wang, P. Chen, Y. Wang, B. Kulis, X. Lin and P. Chin (2019) Protecting neural networks with hierarchical random switching: Towards better robustness-accuracy trade-off for stochastic defenses. arXiv preprint arXiv:1908.07116. Cited by: §II-B.
  80. C. Xiao, B. Li, J. Zhu, W. He, M. Liu and D. Song (2018) Generating adversarial examples with adversarial networks. In Proc. IJCAI, Cited by: §II-A.
  81. C. Xie, Y. Wu, L. van der Maaten, A. Yuille and K. He (2018) Feature denoising for improving adversarial robustness. arXiv preprint arXiv:1812.03411. Cited by: §II-B.
  82. C. Xie, Z. Zhang, Y. Zhou, S. Bai, J. Wang, Z. Ren and A. L. Yuille (2019) Improving transferability of adversarial examples with input diversity. In Proc. CVPR, Cited by: §II-A, §IV-C, §IV-D1.
  83. H. Xu, Z. Chen, W. Wu, Z. Jin, S. Kuo and M. Lyu (2019) NV-DNN: Towards fault-tolerant DNN systems with N-version programming. In Proc. DSN DSMLW, Cited by: §I, §II-B.
  84. W. Xu, D. Evans and Y. Qi (2018) Feature squeezing: detecting adversarial examples in deep neural networks. In Proc. NDSS, Cited by: §II-B.
  85. S. Zagoruyko and N. Komodakis (2016) Wide residual networks. In Proc. BMVC, Cited by: §IV-B, TABLE I.
  86. M. D. Zeiler (2012) Adadelta: An adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §IV-B.
  87. Q. Zeng, J. Su, C. Fu, G. Kayas, L. Luo, X. Du, C. C. Tan and J. Wu (2019) A multiversion programming inspired approach to detecting audio adversarial examples. In Proc. DSN, Cited by: §II-B.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description