Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling InputAdaptive Inference
Abstract
Deep networks were recently suggested to face the odds between accuracy (on clean natural images) and robustness (on adversarially perturbed images) (Tsipras et al., 2019). Such a dilemma is shown to be rooted in the inherently higher sample complexity (Schmidt et al., 2018) and/or model capacity (Nakkiran, 2019), for learning a highaccuracy and robust classifier. In view of that, give a classification task, growing the model capacity appears to help draw a winwin between accuracy and robustness, yet at the expense of model size and latency, therefore posing challenges for resourceconstrained applications. Is it possible to codesign model accuracy, robustness and efficiency to achieve their triple wins?
This paper studies multiexit networks associated with inputadaptive efficient inference, showing their strong promise in achieving a “sweet point” in cooptimizing model accuracy, robustness and efficiency. Our proposed solution, dubbed Robust Dynamic Inference Networks (RDINets), allows for each input (either clean or adversarial) to adaptively choose one of the multiple output layers (early branches or the final one) to output its prediction. That multiloss adaptivity adds new variations and flexibility to adversarial attacks and defenses, on which we present a systematical investigation. We show experimentally that by equipping existing backbones with such robust adaptive inference, the resulting RDINets can achieve better accuracy and robustness, yet with over 30% computational savings, compared to the defended original models.
1 Introduction
Deep networks, despite their high predictive accuracy, are notoriously vulnerable to adversarial attacks (Goodfellow et al., 2015; Biggio et al., 2013; Szegedy et al., 2014; Papernot et al., 2016). While many defense methods have been proposed to increase a model’s robustness to adversarial examples, they were typically observed to hamper its accuracy on original clean images. Tsipras et al. (2019) first pointed out the inherent tension between the goals of adversarial robustness and standard accuracy in deep networks, whose provable existence was shown in a simplified setting. Zhang et al. (2019) theoretically quantified the accuracyrobustness tradeoff, in terms of the gap between the risk for adversarial examples versus the risk for nonadversarial examples.
It is intriguing to consider whether and why the model accuracy and robustness have to be at odds. Schmidt et al. (2018) demonstrated that the number of samples needed to achieve adversarially robust generalization is polynomially larger than that needed for standard generalization, under the adversarial training setting. A similar conclusion was concurred by Sun et al. (2019) in the standard training setting. Tsipras et al. (2019) considered the accuracyrobustness tradeoff as an inherent trait of the data distribution itself, indicating that this phenomenon persists even in the limit of infinite data. Nakkiran (2019) argued from a different perspective, that the complexity (e.g. capacity) of a robust classifier must be higher than that of a standard classifier. Therefore, replacing a largercapacity classifier might effectively alleviate the tradeoff. Overall, those existing works appear to suggest that, while accuracy and robustness are likely to trade off for a fixed classification model and on a given dataset, such tradeoff might be effectively alleviated (“winwin”), if supplying more training data and/or replacing a largercapacity classifier.
On a separate note, deep networks also face the pressing challenge to be deployed on resourceconstrained platforms due to the prosperity of smart InternetofThings (IoT) devices. Many IoT applications naturally demand security and trustworthiness, e.g., , biometrics and identity verification, but can only afford limited latency, memory and energy budget. Hereby we extend the question: can we achieve a triplewin, i.e., , an accurate and robust classfier while keeping it efficient?
This paper makes an attempt in providing a positive answer to the above question. Rather than proposing a specific design of robust lightweight models, we reduce the average computation loads by inputadaptive routing to achieve triplewin. To this end, we introduce the inputadaptive dynamic inference (Teerapittayanon et al., 2017; Wang et al., 2018a), an emerging efficient inference scheme in contrast to the (nonadaptive) model compression, to the adversarial defense field for the first time. Given any deep network backbone (e.g., , ResNet, MobileNet), we first follow (Teerapittayanon et al., 2017) to augment it with multiple earlybranch output layers in addition to the original final output. Each input, regardless of clean or adversarial samples, adaptively chooses which output layer to take for its own prediction. Therefore, a large portion of input inferences can be terminated early when the samples can already be inferred with high confidence.
Up to our best knowledge, no existing work studied adversarial attacks and defenses for an adaptive multioutput model, as the multiple sources of losses provide much larger flexibility to compose attacks (and therefore defenses), compared to the typical singleloss backbone. We present a systematical exploration on how to (whitebox) attack and defense our proposed multioutput network with adaptive inference, demonstrating that the composition of multipleloss information is critical in making the attack/defense strong. Fig. 1 illustrates our proposed Robust Dynamic Inference Networks (RDINets). We show experimentally that the inputadaptive inference and multiloss flexibility can be our friend in achieving the desired “triple wins”. With our best defended RDINets, we achieve better accuracy and robustness, yet with over 30% inference computational savings, compared to the defended original models as well as existing solutions codesigning robustness and efficiency (Gui et al., 2019; Guo et al., 2018). The codes can be referenced from https://github.com/TAMUVITA/triplewins.
2 Related Work
2.1 Adversarial Defense
A magnitude of defend approaches have been proposed (Kurakin et al., 2017; Xu et al., 2018; Song et al., 2018; Liao et al., 2018), although many were quickly evaded by new attacks (Carlini and Wagner, 2017; Baluja and Fischer, 2018). One strong defense algorithm that has so far not been fully compromised is adversarial training (Madry et al., 2018). It searches for adversarial images to augment the training procedure, although at the price of higher training costs (but not affecting inference efficiency). However, almost all existing attacks and defenses focus on a singleoutput classification (or other task) model. We are unaware of prior studies directly addressing attacks/defenses to more complicated networks with multiple possible outputs.
One related row of works are to exploit model ensemble (Tramèr et al., 2018; Strauss et al., 2017) in adversarial training. The gains of the defended ensemble compared to a single model could be viewed as the benefits of either the benefits of diversity (generating stronger and more transferable perturbations), or the increasing model capacity (consider the ensembled multiple models as a compound one). Unfortunately, ensemble methods could amplify the inference complexity and be detrimental for efficiency. Besides, it is also known that injecting randomization at inference time helps mitigate adversarial effects (Xie et al., 2018; Cohen et al., 2019). Yet up to our best knowledge, no work has studied nonrandom, but rather inputdependent inference for defense.
2.2 Efficient Inference
Research in improving deep network efficiency could be categorized into two streams: the static way that designs compact models or compresses heavy models, while the compact/compressed models remain fixed for all inputs at inference; and the dynamic way, that at inference the inputs can choose different computational paths adaptively, and the simpler inputs usually take less computation to make predictions. We briefly review the literature below.
Static: Compact Network Design and Model Compression. Many compact architectures have been specifically designed for resourceconstrained applications, by adopting lightweight depthwise convolutions (Sandler et al., 2018), and groupwise convolutions with channelshuffling (Zhang et al., 2018), to just name a few. For model compression, Han et al. (2015) first proposed to sparsify deep models by removing nonsignificant synapses and then retraining to restore performance. Structured pruning was later on introduced for more hardware friendliness (Wen et al., 2016). Layer factorization (Tai et al., 2016; Yu et al., 2017), quantization (Wu et al., 2016), model distillation (Wang et al., 2018c) and weight sharing (Wu et al., 2018) have also been respectively found effective.
Dynamic: InputAdaptive Inference. Higher inference efficiency could be also accomplished by enabling inputconditional execution. Teerapittayanon et al. (2017); Huang et al. (2018); Kaya et al. (2019) leveraged intermediate features to augment multiple side branch classifiers to enable early predictions. Their methodology sets up the foundation for our work. Other efforts (Figurnov et al., 2017; Wang et al., 2018a, b, 2019) allow for an input to choose between passing through or skipping each layer. The approach could be integrated with RDINets too, which we leave as future work.
2.3 Bridging Robustness with Efficiency
A few studies recently try to link deep learning robustness and efficiency. Guo et al. (2018) observed that in a sparse deep network, appropriately sparsified weights improve robustness, whereas oversparsification (e.g., less than 5% nonzero weights) in turn makes the model more fragile. Two latest works (Ye et al., 2019; Gui et al., 2019) examined the robustness of compressed models, and concluded similar observations that the relationship between mode size and robustness depends on compression methods and are often nonmonotonic. Lin et al. (2019) found that activation quantization may hurt robustness, but can be turned into effective defense if enforcing continuity constraints.
Different from above methods that tackle robustness from static compact/compressed models, the proposed RDINets are the first to address robustness from the dynamic inputadaptive inference. Our experiment results demonstrate the consistent superiority of RDINets over those static methods (Section 4.3). Moreover, applying dynamic inference top of those static methods may further boost the robustness and efficiency, which we leave as future work.
3 Approach
With the goal of achieving inference efficiency, we first look at the setting of multioutput networks and the specific design of RDINet in Section 3.1. Then we define three forms of adversarial attacks for multioutput networks in Section 3.2 and their corresponding defense methods in Section 3.3.
Note that RDINets achieve “triple wins”via reducing the average computation loads through inputadaptive routing. It is not to be confused with any specificallydesigned robust lightweight model.
3.1 Designing RDINets for Higher Inference Efficiency
Given an input image , an output network can produce a set of predictions by a set of transformations . denote the model parameter of , , and s will typically share some weights. With an input , one can express . We assume that the final prediction will be one chosen (NOT fused) from via some deterministic strategy.
We now look at RDINets as a specific instance of multioutput networks, specifically designed for the goal of more efficient, inputadaptive inference. As shown in Fig. 1, for any deep network (e.g., , ResNet, MobileNet), we could append side branches (with negligible overhead) to allow for earlyexit predictions. In other words, it becomes a output network, and the subneworks with the exits, from the lowest to the highest (the original final output), correspond to . They share their weights in a nested fashion: , with including the entire network’s parameters.
Our deterministic strategy in selecting one final output follows (Teerapittayanon et al., 2017). We set a confidence threshold for each th exit, , and each input will terminate inference and output its prediction in the earliest exit (smallest ), whose softmax entropy (as a confidence measure) falls below . All computations after the th exit will not be activated for this . Such a progressive and earlyhalting mechanism effectively saves unnecessary computation for most easiertoclassify samples, and applies in both training and inference. Note that, if efficiency is not the concern, instead of choosing (the earliest one), we could have designed an adaptive or randomized fusion of all predictions: but that falls beyond the goal of this work.
The training objective for RDINets could be written as
(1) 
For each exit loss, we minimize a hybrid loss of accuracy (on clean ) and robustness (on ). The exits are balanced with a group of weights . More details about RDINet structures, hyperparameters, and inference branch selection can be founded in Appendix A, B, and C.
In what follows, we discuss three ways to generate in RDINets, and then their defenses.
3.2 Three Attack Forms on MultiOutput Networks
We consider white box attacks in this paper. Attackers have access to the model’s parameters, and aim to generate an adversarial image to fool the model by perturbing an input within a given magnitude bound.
We next discuss three attack forms for an output network. Note that they are independent of, and to be distinguished from attacker algorithms (e.g., , PGD, C&W, FGSM): the former depicts the optimization formulation, that can be solved any of the attacker algorithms.
Single Attack
Naively extending from attacking singleoutput networks, a single attack is defined to maximally fool one only, expressed as:
(2) 
where is the ground truth label, and is the loss for (we assume softmax for all). is the perturbation radius and we adopt ball for an empirically strong attacker. Naturally, an output network can have different single attacks. However, each single attack is derived without being aware of other parallel outputs. The found is not necessarily transferable to other s (), and therefore can be easily bypassed if is rerouted through other outputs to make its prediction.
Average Attack
Our second attack maximizes the average of all losses, so that the found remains in effect no matter which one is chosen to output the prediction for :
(3) 
The average attack addresses takes into account the attack transferablity and involves all s into optimization. However, while only one output will be selected for each sample at inference, the average strategy might weaken the individual defense strength of each .
MaxAverage Attack
Our third attack aims to emphasize individual output defense strength, more than simply maximizing an allaveraged loss. We first solve the single attacks as described in Eqn. 2, and denote their collection as . We then solve the maxaverage attack via the following:
(4) 
Note Eqn. 4 differs from Eqn. 3 by adding an constraint to balance between “commodity” and “specificity”. The found both strongly increases the averaged loss values from all s (therefore possessing transferablity), and maximally fools one individual s as it is selected from the collection of single attacks.
3.3 Defence on MultiOutput Networks
For simplicity and fair comparison, we focus on adversarial training (Madry et al., 2018) as our defense framework, where the three above defined attack forms can be pluggedin to generate adversarial images to augment training, as follows ( is the union of learnable parameters):
(5) 
where . As s partially share their weights in a multioutput network, the updates from different s will be averaged on the shared parameters.
4 Experimental Results
4.1 Experimental Setup
Evaluation Metrics
We evaluate accuracy, robustness, and efficiency, using the metrics below:

Testing Accuracy (TA): the classification accuracy on the original clean test set.

Adversarial Testing Accuracy (ATA): Given an attacker, ATA stands for the classification accuracy on the attacked test set. It is the same as the “robust accuracy” in (Zhang et al., 2019).

Mega Flops (MFlops): The number of million floatingpoint multiplication operations consumed on the inference, averaged over the entire testing set.
Datasets and Benchmark Models
We evaluate three representative CNN models on two popular datasets: SmallCNN on MNIST (Chen et al., 2018); ResNet38 (He et al., 2016) and MobileNetV2 (Sandler et al., 2018) on CIFAR10. The three networks span from simplest to more complicated, and covers a compact backbone. All three models are defended by adversarial training, constituting strong baselines. Table 1 reports the models, datasets, the attacker algorithm used in attack & defense, and thee TA/ATA/MFlops performance of three defended models.
Attack and Defense on RDINets
We build RDINets by appending side branch outputs for each backbone. For SmallCNN, we add two side branches (). For ResNet38 and MobileNetV2, we have and , respectively. The branches are designed to cause negligible overheads: more details of their structure and positions can be referenced in Appendix B. We call those result models RDISmallCNN, RDIResNet38 and RDIMobileNetV2 hereinafter.
We then generate attacks using our three defined forms. Each attack form could be solved with various attacker algorithms (e.g., PGD, C&W, FGSM), and by default we solve it with the same attacker used for each backbone in Table 1. If we fix one attacker algorithm (e.g., PGD), then TA/ATA for a singleoutput network can be measured without ambiguity. Yet for (+1)output RDINets, there could be at least +3 different ATA numbers for one defended model, depending on what attack form in Section 3.1 to apply (+1 single attacks, 1 average attack, and 1 maxaverage attack). For example, we denote by ATA (Branch1) the ATA number when applying the single attack generated from the first side output branch (e.g., ); similarly elsewhere.
We also defend RDINets using adversarial training, using the forms of adversarial images to augment training. By default, we adopt three adversarial training defense schemes: Main Branch (single attack using )
We cross evaluate ATAs of different defenses and attacks, since an ideal defense shall protect against all possible attack forms. To faithfully indicate the actual robustness, we choose the lowest number among all ATAs, denoted as ATA (WorstCase), as the robustness measure for an RDINet.
Model  Dataset  Defend  Attack  TA  ATA  MFlops 

SmallCNN  MNIST  PGD40  PGD40  99.49%  96.31%  9.25 
ResNet38  CIFAR10  PGD10  PGD20  83.62%  42.29%  79.42 
MobileNetV2  CIFAR10  PGD10  PGD20  84.42%  46.92%  86.91 
4.2 Evaluation and Analysis
MNIST Experiments
The MNIST experimental results on RDISmallCNN are summarized in table 2, with several meaningful observations to be drawn. First, the undefended models (Standard) are easily compromised by all attack forms. Second, The single attackdefended model (Main Branch) achieves the best ATA against the same type of attack, i.e., ATA (Main Branch), and also seems to boost the closest output branch’s robustness, i.e., ATA (Branch 2). However, its defense effect on the furtheraway Branch 1 is degraded, and also shows to be fragile under two stronger attacks (Average, and MaxAverage). Third, both Average and MaxAverage defenses achieve good TAs, as well as ATAs against all attack forms (and therefore WorstCase), with MaxAverage slightly better at both (the margins are small due to the data/task simplicity; see next two).
Moreover, compared to the strong baseline of SmallCNN defended by PGD (40 iterations)based adversarial training, RDISmallCNN with MaxAverage defense wins in terms of both TA and ATA. Impressively, that comes together with 34.30% computational savings compared to the baseline. Here the different defense forms do not appear to alter the inference efficiency much: they all save around 34%  36% MFlops compared to the backbone.
Defense Method  Standard  Main Branch  Average  MaxAverage 
TA  99.48%  99.50%  99.51%  99.52% 
ATA (Branch 1)  6.60%  60.50%  98.69%  98.52% 
ATA (Branch 2)  3.16%  98.14%  97.64%  97.62% 
ATA (Main Branch)  1.32%  96.70%  96.30%  96.43% 
ATA (Average)  2.61%  61.35%  97.37%  97.42% 
ATA (MaxAverage)  2.10%  61.83%  96.82%  96.89% 
ATA (WorstCase)  1.32%  60.50%  96.30%  96.43% 
Average MFlops  5.89  5.89  5.95  6.08 
Computation Saving  36.40%  36.40%  35.70%  34.30% 
CIFAR10 Experiments
The results on RDIResNet38 and RDIMobileNetV2 are presented in Tables 3 and 4, respectively. Most findings seem to concur with MNIST experiments. Specifically, on the more complicated CIFAR10 classification task, MaxAverage defense achieves much more obvious margins over Average defense, in terms of ATA (WorstCase): 2.79% for RDIResNet38, and 1.06% for RDIMobileNetV2. Interestingly, the Average defense is not even the strongest in defending average attacks, as MaxAverage defense can achieve higher ATA (Average) in both cases. We conjecture that averaging all branch losses might “oversmooth” and diminish useful gradients.
Compared to the defended ResNet38 and MobileNetV2 backbones, RDINets with MaxAverage defense achieve higher TAs and ATAs for both. Especially, the ATA (WorstCase) of RDIResNet38 surpasses the ATA of ResNet38 defended by PGDadversarial training by 1.03%, while saving around 30% inference budget. We find that different defenses on CIFAR10 have more notable impacts on computational saving. Seemingly, a stronger defense (MaxAverage) requires inputs to go through the scrutiny of more layers on average, before outputting confident enough predictions: a sensible observation as we expect.
Visualization of Adaptive Inference Behaviors
We visualize the exiting behaviors of RDIResNet38 in Fig 2. We plot each branch exiting percentage on clean set and adversarial sets (worstcase) of examples. A few interesting observations can be found. First, we observe that the singleattack defended model can be easily fooled as adversarial examples can be routed through other lessdefended outputs (due to the limited transferability of attacks between different outputs). Second, the two stronger defenses (Average and MaxAverage) show much more uniform usage of multiple outputs. Their routing behaviors for clean examples are almost identical. For adversarial examples, MaxAverage tends to call upon the full inference more often (i.e., more “conservative”).
Defence Method  Standard  Main Branch  Average  MaxAverage 
TA  92.43%  83.74%  82.42%  83.79% 
ATA (Branch1)  0.12%  12.02%  71.56%  69.71% 
ATA (Branch2)  0.01%  5.58%  66.67%  63.11% 
ATA (Branch3)  0.04%  42.73%  60.65%  60.72% 
ATA (Branch4)  0.06%  34.95%  50.17%  47.82% 
ATA (Branch5)  0.06%  41.77%  44.83%  45.53% 
ATA (Branch6)  0.11%  41.68%  45.83%  44.12% 
ATA (Main Branch)  0.13%  42.74%  47.52%  49.82% 
ATA (Average)  0.01%  9.14%  42.09%  43.32% 
ATA (MaxAverage)  0.01%  7.15%  40.53%  43.43% 
ATA (WorstCase)  0.01%  5.58%  40.53%  43.32% 
Average MFlops  29.41  48.27  56.90  57.81 
Computation Saving  62.96%  39.20%  28.35%  27.20% 
Defence Method  Standard  Main Branch  Average  MaxAverage 
TA  93.22%  85.28%  82.14%  84.91% 
ATA (Branch1)  0.35%  37.40%  67.65%  71.78% 
ATA (Branch2)  0%  47.35%  50.38%  50.15% 
ATA (Main Branch)  0%  46.69%  49.33%  46.99% 
ATA (Average)  0%  35.20%  45.93%  47.00% 
ATA (MaxAverage)  0%  36.66%  49.33%  50.18% 
ATA (WorstCase)  0%  35.20%  45.93%  46.99% 
Average MFlops  49.78  52.81  58.23  60.84 
Computation Saving  42.72%  39.23%  33.00%  29.99% 
4.3 Comparison with Defended Sparse Networks
An alternative to achieve accuracyrobustefficiency tradeoff is by defending a sparse or compressed model. Inspired by (Guo et al., 2018; Gui et al., 2019), we compare RDINet with MaxAverage defense to the following baseline: first compressing the network with a stateoftheart model compression method (Huang and Wang, 2018), and then defend the compressed network using the PGD10 adversarial training. We sample different sparsity ratios in (Huang and Wang, 2018) to obtain models of different complexities. Fig. 6 in Appendix visualizes the comparison on ResNet38: for either method, we sample a few models of different MFLOPs. At similar inference costs (e.g., 49.38M for pruning + defense, and 48.35M for RDINets), our proposed approach consistently achieves higher ATAs ( 2%) than the strong pruning + defense baseline, with higher TAs.
Methods  TA  ATA  MFlops 

ATMC (Gui et al. (2019))  83.81  43.02  56.82 
RDIResNet38 (WorstCase)  83.79  43.32  57.81 
4.4 Generalized Robustness Against Other Attackers
In the aforementioned experiments, we have only evaluated on RDINets against “deterministic” PGDbased adversarial images. We show that RDINets also achieve better generalized robustness against other “randomized” or unseen attackers. We create the new “random attack”: that attack will randomly combine the multiexit losses, and summarize the results in Table 6. We also follow the similar setting in Gui et al. (2019) and report the results against FGSM (Goodfellow et al., 2015) and WRM (Sinha et al., 2018) attacker, in Tables 7 and 8 respectively (more complete results can be found in Appendix D).
Defence Method  Standard  Main Branch  Average  MaxAverage 
TA  92.43%  83.74%  82.42%  83.79% 
ATA (Random)  0.01%  10.33%  43.11%  44.86% 
Average MFlops  27.33  52.36  55.21  56.54 
Computation Saving  65.58%  34.07%  30.48%  28.80% 
Defence Method  Standard  Main Branch  Average  MaxAverage 
TA  92.43%  83.74%  82.42%  83.79% 
ATA (Main Branch)  11.51%  51.45%  53.64%  54.72% 
ATA (Average)  11.41%  50.21%  51.81%  53.20% 
ATA (MaxAverage)  2.09%  47.53%  50.63%  52.40% 
ATA (WorstCase)  2.09%  47.53%  50.63%  51.05% 
Average MFlops  65.74  55.27  58.27  59.67 
Computation Saving  17.21%  30.40%  26.40%  24.86% 
Defence Method  Standard  Main Branch  Average  MaxAverage 
TA  92.43%  83.74%  82.42%  83.79% 
ATA (Main Branch)  34.42%  83.74%  82.42%  83.78% 
ATA (Average)  26.48%  83.69%  82.36%  83.77% 
ATA (MaxAverage)  23.51%  83.73%  82.40%  83.78% 
ATA (WorstCase)  23.51%  83.69%  82.36%  83.77% 
Average MFlops  50.05  50.46  52.89  52.38 
Computation Saving  36.98%  36.46%  33.40%  34.04% 
5 Discussion and Analysis
Intuition: MultiOutput Networks as Special Ensembles
Our intuition on defending multioutput networks arises from the success of ensemble defense in improving both accuracy and robustness (Tramèr et al., 2018; Strauss et al., 2017), which also aligns with the model capacity hypothesis (Nakkiran, 2019). A general multioutput network (Xu et al., 2019) could be decomposed by an ensemble of singleoutput models, with weight reusing enforced among them. It is thus more compact than an ensemble of independent models, and the extent of sharing weight calibrates ensemble diversity versus efficiency. Therefore, we expect a defended multioutput network to (mostly) inherit the strong accuracy/robustness of ensemble defense, while keeping the inference cost lower.
Do ”Triple Wins” Go Against the Model Capacity Needs?
We point out that our seemingly “free” efficiency gains (e.g., not sacrificing TA/ATA) do not go against the current belief that a more accurate and robust classifier relies on a larger model capacity (Nakkiran, 2019). From the visualization, there remains to be a portion of clean/adversarial examples that have to utilize the full inference to predict well. In other words, the full model capacity is still necessary to achieve our current TAs/ATAs. Meanwhile, just like in standard classification (Wang et al., 2018a), not all adversarial examples are born equally. Many of them can be predicted using fewer inference costs (taking earlier exits). Therefore, RDINets reduces the “effective model capacity” averaged on all testing samples for overall higher inference efficiency, while not altering the full model capacity.
6 Conclusion
This paper targets to simultaneously achieve high accuracy and robustness and meanwhile keeping inference costs lower. We introduce the multioutput network and inputadaptive dynamic inference, as a strong tool to the adversarial defense field for the first time. Our RDINets achieve the “triple wins” of better accuracy, stronger robustness, and around 30% inference computational savings. Our future work will extend RDINets to more dynamic inference mechanisms.
7 Acknowledgement
We would like to thank Dr. Yang Yang from Walmart Technology for highly helpful discussions throughout this project.
Appendix A Learning Details of RDINets
Mnist
We adopt the network architecture from (Chen et al., 2018) with four convolutions and three fullconnected layers. We train for iterations with a batch size of . The learning rate is initialized as and is lowered by at th and th iteration. For hybrid loss, the weights are set as for simplicity. For adversarial defense/attack, we perform 40steps PGD for both defense and evaluation. The perturbation size and step size are set as and .
Cifar10
We take ResNet38 and MobileNetV2 as the backbone architectures. For RDIResNet38, we initialize learning rate as and decay it by a factor of 10 at th and th iteration. The learning procedure stops at iteration. For RDIMobileNetV2, the learning rate is set to and is lowered by times at th and th iteration. We stop the learning procedure at iteration. For hybrid loss, we follow the discussion in (Hu et al., 2019) and set of RDIResNet38 and RDIMobileNetV2 as and , respectively. For adversarial defense/attack, the perturbation size and step size are set as and . 10steps PGD is performed for defense and 20steps PGD is utilized for evaluation.
Appendix B Network Structure of RDINets
To build RDINets, we follow the similar setting in Teerapittayanon et al. (2017) by appending additional branch classifiers at equidistant points throughout a given network, as illustrated in Fig 3, Fig 4 and Fig 5. A few pooling operations, lightweight convolutions and fullyconnected layers are appended to each branch classifiers. Note that the extra flops introduced by side branch classifiers are less than 2% than the original ResNet38 or MobileNetV2.
Appendix C InputAdaptive Inference for RDINets
Similar to the deterministic strategy in Teerapittayanon et al. (2017), we adopt the entropy as the measure of the prediction confidence. Given a prediction vector , where is the number of class, the entropy of is defined as follow,
(6) 
where is a small positive constant used for robust entropy computation. To perform fast inference on a (+1)output RDINet, we need to determine threshold numbers, i.e., , so that the input will exit at th branch if the entropy of is larger than . To choose , Huang et al. (2018) provides a good starting point by fixing exiting probability of each branch classifiers equally on validation set so that each sample can equally contribute to inference. We follow this strategy but adjust the thresholds to make the contribution of middle branches slightly larger than the early branches. The threshold numbers for RDISmallCNN, RDIResNet38, and RDIMobilenetV2 are set to be , , and , respectively.
Appendix D Generalized Robustness
Here, we introduce the attack form of random attack and report the complete results against FGSM (Goodfellow et al., 2015) and WRM (Sinha et al., 2018) attacker under various attack forms, in Tables 9 and 10, respectively.
Random Attack
the attack exploits multiloss flexibility by randomly fusing all losses. Given a output network, we have a fusion vector , where is some distribution (uniform by default). We denote as the th element of and can be found by:
(7) 
It is expected to challenge our defense, due to the infinitely many ways of randomly fusing outputs.
Defence Method  Standard  Main Branch  Average  MaxAverage 
TA  92.43%  83.74%  82.42%  83.79% 
ATA (Branch1)  20.69%  66.06%  72.77%  72.76% 
ATA (Branch2)  16.15%  53.87%  70.40%  69.71% 
ATA (Branch3)  8.13%  63.70%  64.19%  65.14% 
ATA (Branch4)  10.09%  56.67%  58.45%  58.20% 
ATA (Branch5)  9.45%  50.81%  52.76%  52.96% 
ATA (Branch6)  10.22%  50.34%  53.17%  51.05% 
ATA (Main Branch)  11.51%  51.45%  53.64%  54.72% 
ATA (Average)  11.41%  50.21%  51.81%  53.20% 
ATA (MaxAverage)  2.09%  47.53%  50.63%  52.40% 
ATA (WorstCase)  2.09%  47.53%  50.63%  51.05% 
Average MFlops  65.74  55.27  58.27  59.67 
Computation Saving  17.21%  30.40%  26.40%  24.86% 
Defence Method  Standard  Main Branch  Average  MaxAverage 
TA  92.43%  83.74%  82.42%  83.79% 
ATA (Branch1)  46.60%  83.73%  82.42%  83.78% 
ATA (Branch2)  71.33%  83.73%  82.42%  83.79% 
ATA (Branch3)  23.51%  83.73%  82.41%  83.78% 
ATA (Branch4)  33.41%  83.73%  82.42%  83.78% 
ATA (Branch5)  42.35%  83.73%  82.41%  83.78% 
ATA (Branch6)  47.77%  83.74%  82.40%  83.78% 
ATA (Main Branch)  34.42%  83.74%  82.42%  83.78% 
ATA (Average)  26.48%  83.69%  82.36%  83.77% 
ATA (MaxAverage)  23.51%  83.73%  82.40%  83.78% 
ATA (WorstCase)  23.51%  83.69%  82.36%  83.77% 
Average MFlops  50.05  50.46  52.89  52.38 
Computation Saving  36.98%  36.46%  33.40%  34.04% 
Footnotes
 footnotemark:
 We tried adversarial training using other earlier side branch single attacks, and found their TA/ATA to be much more deteriorated compared to the main branch one. We thus report this only for compactness.
References
 Adversarial transformation networks: learning to generate adversarial examples. In AAAI, Cited by: §2.1.
 Evasion attacks against machine learning at test time. In ECML, Cited by: §1.
 Towards evaluating the robustness of neural networks. In SP, Cited by: §2.1.
 Neural ordinary differential equations. In NeurIPS, Cited by: Appendix A, §4.1.
 Certified adversarial robustness via randomized smoothing. In ICML, Cited by: §2.1.
 Spatially adaptive computation time for residual networks. In CVPR, Cited by: §2.2.
 Explaining and harnessing adversarial examples. In ICLR, Cited by: Appendix D, §1, §4.4.
 Model compression with adversarial robustness: a unified optimization framework. In NeurIPS, pp. 1283–1294. Cited by: §1, §2.3, §4.3, §4.3, §4.4, Table 5.
 Sparse dnns with improved adversarial robustness. In NeurIPS, Cited by: §1, §2.3, §4.3.
 Learning both weights and connections for efficient neural network. In NeurIPS, Cited by: §2.2.
 Deep residual learning for image recognition. In CVPR, Cited by: §4.1.
 Anytime neural networks via joint optimization of auxiliary losses. In AAAI, Cited by: Appendix A.
 Multiscale dense networks for resource efficient image classification. In ICLR, Cited by: Appendix C, §2.2.
 Datadriven sparse structure selection for deep neural networks. In ECCV, Cited by: §4.3.
 Shallowdeep networks: understanding and mitigating network overthinking. In ICML, Cited by: §2.2.
 Adversarial machine learning at scale. In ICLR, Cited by: §2.1.
 Defense against adversarial attacks using highlevel representation guided denoiser. In CVPR, Cited by: §2.1.
 Defensive quantization: when efficiency meets robustness. In ICLR, Cited by: §2.3.
 Towards deep learning models resistant to adversarial attacks. In ICLR, Cited by: §2.1, §3.3, Table 1.
 Adversarial robustness may be at odds with simplicity. arXiv. Cited by: Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling InputAdaptive Inference, §1, §5, §5.
 The limitations of deep learning in adversarial settings. In EuroS&P, Cited by: §1.
 Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §2.2, §4.1.
 Adversarially robust generalization requires more data. In NeurIPS, Cited by: Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling InputAdaptive Inference, §1.
 Certifying Some Distributional Robustness with Principled Adversarial Training. In ICLR, Cited by: Appendix D, §4.4.
 Pixeldefend: leveraging generative models to understand and defend against adversarial examples. In ICLR, Cited by: §2.1.
 Ensemble methods as a defense to adversarial perturbations against deep neural networks. arXiv. Cited by: §2.1, §5.
 Towards understanding adversarial examples systematically: exploring data size, task and model factors. arXiv. Cited by: §1.
 Intriguing properties of neural networks. In ICLR, Cited by: §1.
 Convolutional neural networks with lowrank regularization. In ICLR, Cited by: §2.2.
 BranchyNet: fast inference via early exiting from deep neural networks. In ICPR, Cited by: Appendix B, Appendix C, §1, §2.2, §3.1.
 Ensemble adversarial training: attacks and defenses. In ICLR, Cited by: §2.1, §5.
 Robustness may be at odds with accuracy. In ICLR, Cited by: Triple Wins: Boosting Accuracy, Robustness and Efficiency Together by Enabling InputAdaptive Inference, §1, §1.
 Skipnet: learning dynamic routing in convolutional networks. In ECCV, Cited by: §1, §2.2, §5.
 Energynet: energyefficient dynamic inference. Cited by: §2.2.
 Dual dynamic inference: enabling more efficient, adaptive and controllable deep inference. arXiv preprint arXiv:1907.04523. Cited by: §2.2.
 Adversarial learning of portable student networks. In ThirtySecond AAAI Conference on Artificial Intelligence, Cited by: §2.2.
 Learning structured sparsity in deep neural networks. In NeurIPS, Cited by: §2.2.
 Quantized convolutional neural networks for mobile devices. In CVPR, Cited by: §2.2.
 Deep means: retraining and parameter sharing with harder cluster assignments for compressing deep convolutions. In ICML, Cited by: §2.2.
 Mitigating adversarial effects through randomization. In ICLR, Cited by: §2.1.
 A survey on multioutput learning. arXiv. Cited by: §5.
 Feature squeezing: detecting adversarial examples in deep neural networks. In NDSS, Cited by: §2.1.
 Adversarial robustness vs model compression, or both?. In ICCV, Cited by: §2.3.
 On compressing deep models by low rank and sparse decomposition. In CVPR, Cited by: §2.2.
 Theoretically principled tradeoff between robustness and accuracy. arXiv. Cited by: §1, 2nd item.
 ShuffleNet: an extremely efficient convolutional neural network for mobile devices. In CVPR, Cited by: §2.2.