MixTrain: Scalable Training of Formally Robust Neural Networks
There is an arms race to defend neural networks against adversarial examples. Notably, adversarially robust training and verifiably robust training are the most promising defenses. The adversarially robust training scales well but cannot provide provable robustness guarantee for the absence of attacks. We present an Interval Attack that reveals fundamental problems about the threat model used by adversarially robust training. On the contrary, verifiably robust training achieves sound guarantee, but it is computationally expensive and sacrifices accuracy, which prevents it being applied in practice.
In this paper, we propose two novel techniques for verifiably robust training, stochastic output approximation and dynamic mixed training, to solve the aforementioned challenges. They are based on two critical insights: (1) soundness is only needed in a subset of training data; and (2) verifiable robustness and test accuracy are conflicting to achieve after a certain point of verifiably robust training.
On both MNIST and CIFAR datasets, we are able to achieve similar test accuracy and estimated robust accuracy against PGD attacks within less training time compared to state-of-the-art adversarially robust training techniques. In addition, we have up to 95.2% verified robust accuracy as a bonus. Also, to achieve similar verified robust accuracy, we are able to save up to computation time and offer 9.2% test accuracy improvement compared to current state-of-the-art verifiably robust training techniques.
Deep learning models are not robust against adversarial examples . They prevent the models being reliably applied in the security-critical systems such as autonomous vehicles [2, 3, 4] and unmanned aircrafts [5, 6, 7, 8]. Consequently, various defense systems have been proposed [9, 10, 11]. However, stronger attacks are constantly constructed to break the defenses [12, 13, 14]. To end such arms race, two types of robust training schemes are proven to be the strongest defenses, i.e., adversarially robust training and verifiably robust training.
Adversarially robust training method defends against adversaries whose capabilities are bounded within a specific class of attacks. Based on the threat model, the neural network model can be trained by the strongest attack in the class to be robust against the entire class of attacks. The threat model poses a risk in the assumptions used to derive the strongest attack. In particular, recent work  has defined a class of first-order attacks, where the adversary has access to only first-order gradient information. The Projected Gradient Descent (PGD) is used as the strongest first-order attack in adversarially robust training, since starting the PGD attack from a random point near the image shows highly concentrated loss value distribution of distinct local maxima points. However, the first-order attacker’s capability isn’t bounded by a random starting point. More importantly, promising starting points are not distributed uniformly at random. As proof, we construct a novel first-order attack, the Interval Attack, that can guide the attacker to a more promising region to start the PGD attack. Our interval attack is much more effective than the PGD attack alone. Our results show that the interval attack can find up to 1.7 times more violations on adversarially robust trained models, compared to the state-of-the-art attacks such as PGD or CW attacks .
In contrast, verifiably robust training uses a much stronger threat model. It can provide sound robustness guarantee for the non-existence of attacks, even unknown attacks launched by adversaries with unlimited computing resources. Unfortunately, to provide such rigorous provably guarantee, the state-of-the-art verifiably robust training methods incur high computation and memory cost and low accuracy for the model. Specifically, verifiably robust training utilize sound approximation methods that are computationally expensive to be precise; and there is an inherent tradeoff between high robustness and high accuracy. These problems present the challenge to apply verifiably robust training in the real world, especially large neural networks.
In this paper, we propose two novel verifiably robust training methods, stochastic output approximation and dynamic mixed training, to solve the aforementioned challenges. Stochastic output approximation significantly improves the time- and memory-efficiency of the robust training process by computing the maximal verifiable robust loss drawn from a sampled training dataset. The sampled verifiable robust loss distribution can well approximate the original distribution of the training set, which allows us to achieve high verifiable robustness with low computational cost. On the other hand, dynamic mixed training balances the goal of training for high verifiable robustness and for high test accuracy. We design a dynamic loss function that integrates the verifiable robustness and the test accuracy according to current training states. We incorporated these two techniques into our system, MixTrain.
We design MixTrain as a general defense using verifiably robust training, that can integrate any distance metric for input bounds, and any sound approximation method. In our experiments, we used the -norm distance metric, and symbolic intervals for sound approximation. On both MNIST and CIFAR datasets, compared to the state-of-the-art adversarially robust training methods , achieving similar test accuracies and PGD accuracies, MixTrain offers a bonus of verified robust accuracy up to 95.2%. Moreover, MixTrain is up to faster than adversarially robust training methods , whereas the latter cannot provide any verified robust accuracy. Also, to achieve targeted verifiable robustness, MixTrain offers up to less computation time and less memory cost in training than state-of-the-art verifiably robust training schemes . For large networks, we can train classifiers up to 9.2% test accuracy improvement with 4.2% higher verified robust accuracy.
Our contributions can be summarized as the following:
We construct a strong first-order attack, the Interval Attack. Our attack reveals the fundamental problem of the threat model in adversarially robust training, and the necessity of verifiably robust training.
We propose two novel techniques for verifiably robust training, Stochastic Output Approximation and Dynamic Mixed Training. Our techniques can provide high verified robust accuracy, without sacrificing test accuracy; and we significantly reduce the computation requirements to make verifiably robust training practical.
We implemented our techniques in MixTrain, and thoroughly evaluated the system across existing accuracy metrics, including testing accuracy, estimated robust accuracy, and verified robust accuracy. Our system achieves targeted verified robust accuracy up to faster, and scales to large networks such as Resnet.
In this section, we provide an overview of network robustness optimizations and three existing robustness guarantees, i.e. adversarial, distributional, and verifiable robustness. We then introduce two training schemes that strengthen the network robustness, adversarially robust training [15, 17] and verifiably robust training [16, 18, 19].
Ii-a Robust Optimization
The robustness of a neural network can be obtained through the robust optimization process. Formally, a neural network maps input features to output features, with weights . Given the input pair drawn from the underlying distribution , predicts the label of as . We define the correctness, robustness, and violation of on as following.
Correctness: is classified correctly if .
Robustness: is robust around , if , always computes , as the true label.
Violation: If , such that , is a violation of the robustness property.
In this paper, we use to denote the -ball of radius around , and we use the common cross-entropy loss function to train the weights . Note that our defense schemes can support any definition of , any loss function, and can be used for any sound approximation method during the training.
The standard normal training process optimizes weights to minimize the empirical risk of the loss value . Batch training is commonly used to update the weights. The training data is first split into batches of fixed size. Then, for each batch of training data points and labels , the batch training procedure computes the output for them, does backward propagation and updates the weights once.
Researchers have formalized the robustness optimization process [20, 21, 15] as minimizing the empirical risk of the largest loss value from . Specifically, the network robustness optimization is formed by two problems as shown in Equation 1.
Inner Maximization problem: Finding to maximize the loss value within the allowable input range .
Outer Minimization problem: Optimizing weights to minimize the empirical maximal loss.
Solving the Inner Maximization problem is NP-hard . Two popular solutions have been introduced, namely gradient-based attacks and sound approximations. They are used by adversarially robust training and verifiably robust training when combined with the Outer Minimization. Before discussing the details of the training schemes, we will first introduce three existing robustness guarantees with different threat models: adversarial, distributional and verifiable robustness.
Ii-B Adversarial Robustness
Ii-B1 Threat Model
The adversarial robustness guarantee defines robustness against bounded attackers of a specific type. Recent works have trained neural network models to be adversarially robust against specific attacks [22, 23, 24, 25] launched by non-adaptive adversaries using known attacks. To unify prior work, Madry et al.  proposed a notion of “first-order adversary” as a broad class of adaptive attackers. All first-order adversaries’ manipulative power are bounded by access to only first-order gradient information of the loss function, when perturbing the input within the allowable range . Madry et al.  assumes that first-order adversaries can start the attack from a random location near the input. Their experiment using random starting points shows that the Projected Gradient Descent (PGD) attack can find violations that represent distinct maximas but with similar loss values. Because of the concentrated loss value distribution, they concluded that the Projected Gradient Descent (PGD) attacker with a random start can be thought as the “ultimate” first-order adversary, to solve the Inner Maximization problem. However, we show in Subsection III-B that PGD adversaries are not the strongest first-order attackers and the underlying assumption based on randomly sampled attack starting points is fundamentally wrong.
Adversarial robustness guarantee provides a high Estimated Robust Accuracy (ERA) for the defended model. That is, the percentage of test samples that are robust against first-order attacks (e.g., Projected Gradient Descent (PGD) , Fast Gradient Sign Method (FGSM) , and Carlini-Wagner (CW) attacks .). A defended model is at most as robust as the ERA, since unknown attacks can generate more violations for the test samples. Adversarial robustness is achieved by using the adversarially robust training.
Ii-B2 Adversarially Robust Training
The state-of-the-art adversarially robust training scheme of Madry et al.  relies on first-order gradient of the PGD attacks to search for the largest loss value within . In fact, given input pair (, ), first-order attacks give the lower bound of the Inner Maximization problem:
Specifically, for a batch of training data (, ), the training process first conducts PGD perturbations with a random start point to generate a perturbed training batch , representing the estimated worst case of the inner problem . The corresponding losses of and labels are then used to solve the Outer Minimization problem in Equation 1 as:
Adversarially robust training is highly efficient, and has high test accuracy and high ERA. However, it suffers from low VRA, which makes it unsuitable for security sensitive applications. The estimated adversarial robust loss value is much smaller than the true maximal loss, as we will show in Section III-B. Since gradient-based attacks tend to stick to local optima, they don’t provide verifiable robustness guarantee against unknown attacks. In contrast, though computationally more expensive, verifiably robust training schemes can achieve similar test accuracy and ERA, with the bonus of much higher VRA on the same model architecture.
Ii-C Distributional Robustness
Ii-C1 Threat Model
The ideal threat model for distributional robustness is that the adversary can draw data samples from a distribution , that is within small distance to the actual underlying data distribution . The adversary’s capability is bounded by the Wasserstein ball of radius to Formally, the robustness region is . Otherwise, there is no restriction in the information that the adversary has access to, or the types of attack the adversary can perform.
To make the problem tractable, in practice, Sinha et al.  used the Lagrangian relaxation to reformulate the distribution distance bound to a distance penalty form. Since is unknown, the authors used the empirical distribution . Also, ReLU activations are replaced with ELU to smooth the loss functions. In the implementation, the adversary uses a WRM attack based on stochastic gradient descent to find attack instances from a nearby distribution. It is unclear whether being close in different distributions represents visual similarity of images between the attack instance and the original input.
Ii-C2 Training Scheme
In practice, distributional robustness is achieved by using adversarially robust training, using instances found by the WRM attack . This provides statistical guarantee with high probability over the neighborhood of distribution under the Wasserstein metric. Their distributional robustness provides an upper bound on the expectation of the loss function, only when the bounded range of input is very small (e.g., ), which underperform other robustness guarantees when bounds become larger (e.g., ). Since can represent the strength of the attack, distributional robustness does not provide a good guarantee for stronger attacks.
Ii-D Verifiable Robustness
Ii-D1 Threat Model
Unlike the other two robustness guarantees, verifiable robustness is a robustness property against attackers with unlimited computation power, within the allowable input range . The attacker can use any information, draw input from arbitrary distribution, and launch unknown attacks. This threat model gives the strongest robustness guarantee.
To capture the notion of unbounded attacker capabilities, formal verification methods are used to provide the verifiable robustness guarantee. Sound approximation methods can over-approximate the output ranges generated by any attacker, given input space. If there is no violation found in the output range, no successful attacks can be constructed within its input range. Therefore, verifiable robustness is sound.
This guarantee increases the Verified Robust Accuracy (VRA), the percentage of test samples given without violations. A defended model is at least as robust as the VRA, since unverified test samples can be caused by false positives generated by over-approximation. Verifiably robust training achieves this guarantee. In this paper, we focus on verifiable robustness guarantee.
Ii-D2 Verifiably Robust Training
Verifiably robust training schemes use sound approximation methods to solve the Inner Maximization problem. Sound approximation provides a sound upper bound of maximal loss values. Existing methods include convex polytope , interval propagation [7, 27], abstract domains , lagrangian relaxation  and relaxation with Lipschitz constant . Essentially, they perform a sound transformation from the input to the output of the network . Formally, given input and allowable input range , the transformation is sound by the following definition:
Let the worst-case output of sound approximations denote the following:
The verifiable robustness guarantee for an input pair is: . It means that for the allowable input range , the class will always be predicted as the largest. Therefore, the sound approximation gives the upper bound of the Inner Maximization problem:
To efficiently approximate the bound, certain input dependencies are ignored, which introduces overestimation error that are propagated through deep layers in a large neural network, resulting in large overestimation of the loss values [7, 27, 30]. After comparing the tightness of different methods, we discovered that they achieve similar results. Here we utilize symbolic interval propagation  as the main method for verifiably robust training in this paper.
For a batch of training data points (, ), the training process first computes the verifiable robust loss under the corresponding sound approximation method. Then, the Outer Minimization problem in Equation 1 becomes:
Verifiably robust training schemes achieve both high ERA and high VRA. However, existing sound approximation methods are computationally expensive, being hundreds of magnitudes slower than normal training methods. Due to over-approximated maximal loss, existing training schemes also sacrifice test accuracy to trade for robustness. Such problems prevent these methods to be applied to real-world applications. In this paper, we propose two novel training techniques that can provide adequate test accuracy and VRA, requiring much less time and memory.
Iii Interval attack
In this section, we present a novel interval attack against the neural network. Even though Madry et al.  claim that PGD attacks are the “ultimate” first-order adversaries, the interval attack is a stronger first-order attack in comparison. We also use the loss values of violations generated by the interval attack to show that the underlying assumption of robustness guarantee behind adversarial robust training is incorrect. In contrast, verifiably robust training is a rigorous method to provide robustness guarantee for neural networks, especially in security- and safety-sensitive domains.
Iii-a Details of Interval Attack
Our attack is based on the following intuition. Existing first-order attacks such as PGD and CW can be stuck at the local optima of loss values, and thus miss violations within the allowable input range. Therefore, we need a broader view around the current point to jump out of the local optima, which can lead us to the area containing potential violations. To achieve that, we utilize a sound approximation method, symbolic interval analysis. Specifically, we leverage the interval gradient to construct the interval attack.
Interval gradient . Symbolic interval analysis provides the lower and upper bounds of output, given input intervals. The analysis concretizes values when it cannot keep the input dependency, which introduces over-approximation error. In this paper, we use symbolic linear relaxation, which combines symbolic interval analysis with linear relaxation, to provide tighter bounds . Essentially, it uses two parallel linear equations in terms of inputs, i.e., symbolic intervals, to tightly bound the output of each ReLU. Consequently, we have parallel symbolic intervals for output and loss function. Let the first-order gradient of the symbolic loss interval be the interval gradients , which is the slope of the interval. For example, symbolic input relaxation provides the symbolic intervals of loss as after propagating input range through a network. Thus, the loss gradient is .
The point gradient can only reveal the direction of the slope from the current standpoint. In contrast, the interval gradient tells the direction of the trend given the whole estimated input range. Therefore, interval gradient can help gradient-based search to reach potentially suspicious area.
Optimizations guided by interval analysis. Figure 0(a) and Figure 0(b) illustrate that the interval gradient can help guide the attack to reach the violation area where the point gradient cannot find. The line represents the loss values near the input . Given allowable input range , Figure 0(a) shows that attacks relying on point gradient, e.g., PGD, will step into the local maxima and conclude that there is no violation. However, there exists a violation within , which is missed by the gradient information from the current standpoint . In contrast, the interval gradient can point out the direction to reach , as shown in Figure 0(b). The interval attack depends on interval gradients within a small input range (). Compared to traditional gradient-based attacks that only consider information at a point, the interval attack gets more information within to make a wiser update. Therefore, the interval attack is able to avoid the local optima and successfully locate the violation.
However, interval gradients are not perfect. To take full advantage of interval gradients, a suitable is needed for each step of the update. If is too large, symbolic interval analysis will introduce large overestimated error leading to wrong direction guidance. On the other hand, if is too small, the information from the surrounding area might not be enough to make a wise update. For example, given input range (), a desirable direction can be provided under , with . If , which is too large, symbolic interval analysis could suffer from large false positives and might generate the interval gradient . Such direction is not useful. Similarly, if , which is too small, though accurate, it could end up with . Thus, it might miss the correct direction due to the lack of useful information.
Attack Algorithm. Algorithm 1 shows the the main process of the interval attack. The algorithm takes a starting point and attack budget as input, and , , as parameters of the attack. It returns as a potential violation (Line 13). The attack makes iterations of update to (with initial value ), using the interval gradient (Line 2 to Line 11). At each iteration, we need to find a suitable representing the region used to compute the interval gradient (Line 3 to Line 7). We first assign with a small value (Line 3). Then we gradually multiply it with coefficient until symbolic interval analysis identifies suspicious violations, . At this point, is the smallest value that allows symbolic interval analysis to provide useful interval gradients without sacrificing the approximation accuracies too much. Here is the output bound of symbolic interval analysis as defined in Section II. We pick as . The trade-off of is that, as goes larger, the found will be less precise while whole searching process takes less time.
|Inputs: target input, attack budget|
|Parameters: : iterations, : step size, : starting point|
After finding the suitable , we update according to the current interval gradient (Line 8 to Line 9), moving towards promising area. Line 10 clips to make it within allowable input range. If within any iteration, is a violation, the algorithm returns it (Line 11). Otherwise, we use PGD attack to further locate the violation. Interval gradients usually do not converge due to the overestimation error introduced by symbolic interval analysis. Therefore, we need several steps of regular first-order attacks like PGD to help locate at last as shown in Line 12.
Iii-B Evaluation of Interval Attacks
We evaluated the interval attack on two neural networks models MNIST_FC1 and MNIST_FC2. We reimplement symbolic interval analysis  and all the attacks on top of Tensorflow 1.9.0111https://www.tensorflow.org/. All of our experiments are ran on a GeForce GTX 1080 Ti. Both models are adversarially robust trained with the same training and testing schemes of Madry et al. . For the robust training, we use PGD attacks with 40 iterations and 0.01 as step size. The allowable input range is bounded by infinite norm with after input normalization. The size of the models are shown in Table I. All the accuracies are measured with the same 1024 images randomly picked from the MNIST testing set. MNIST_FC1 contains two hidden layers, each with 512 hidden nodes. This model has 98.0% test accuracy, 62.4% Estimated Robustness Accuracy under PGD attacks (ERA(PGD)) and 0% Verified Robustness Accuracy (VRA). MNIST_FC2 contains five hidden layers, each with 2048 hidden nodes. This model has 98.8% test accuracy, 73.4% ERA(PGD) and 0% VRA.
Attack Effectiveness. Although both models are relatively robust against PGD and CW attacks, the interval attack is very effective. We ran the interval attack by 20 iterations against MNIST_FC1 and MNIST_FC2, which took 53.9 seconds and 466.5 seconds. We compared the strength of the interval attack against the state-of-the-art PGD and CW attacks by given the same amount of attack time, as shown in Table I. The interval attack is able to achieve up to 1.7 times attack rate of PGD and CW. PGD and CW attacks were allowed to run for many more iterations given the time, but the growth of attack rate saturated around 80 iterations. Our results show that the interval attack is much stronger than PGD and CW, against adversarially robust trained models.
Loss Value Distribution. Our results also concluded that the assumption used by the training scheme in Madry et al.  is wrong. Madry et al.  assumed that the loss value distribution is concentrated, found by PGD attacks using random start points for each image . Therefore, robustness against PGD adversary means robustness against all first-order adversaries, so a model can be trained robust using PGD attacks. We experimented with the random starts to examine the loss value distribution under different attacks, against the robustly trained model MNIST_FC1. We randomly picked two different images on which PGD attacks cannot find violations but the interval attacks can. Then for each image, we added different random perturbations to get random starts within allowable input ranges . We plotted the distributions of losses found by PGD attacks, CW attacks and interval attacks in Figure 2. On both images, PGD attacks cannot find any violation even with random starts and the losses found are very concentrated between 0 and 1, which satisfies the claims in . However, interval attacks are able to find 52.40% and 47.59% violations with much larger losses. The adversarially robust trained model is robust against PGD attacks and CW attacks, but not the interval attacks. In other words, the method of Madry et al.  cannot provide robustness against all first-order adversaries even if the model is robust against PGD attacks.
Summary. The results demonstrate that adversarially robust training can increase the Estimated Robustness Accuracy (ERA). For models robustly trained with PGD attacks, they achieved 100% ERA against PGD on two images with random starts, since no PGD attack can succeed. The success rate of CW attacks on these images are also very low, under 1%. However, adversarially robust training cannot defend against unknown attacks. Stronger attacks such as the interval attack can generate more violations. Therefore, we need verifiably robust training to defend a model against unknown attacks.
Iv Efficient robust training
Verifiably robust training provides formally verified robustness of neural networks. However, it suffers from several problems that prevent it to be applied in practice. In this section, we propose two training techniques, Stochastic Output Approximation and Alternating Batch Training, to tackle the problems and make verifiably robust training practical.
There are two main problems of existing verifiably robust training methods.
Time- and memory-consuming. Verifiably robust training depends on the performance of formal analysis methods to solve the Inner Maximization problem. The state-of-the-art formal analysis methods are usually hundreds of times slower, and require thousands of times more memory than normal forward propagations in the neural network [26, 16, 30, 27]. Many sound approximation methods aim to achieve high precision, which makes the implementation time-expensive. What makes it even slower is that, the training schemes conduct sound approximation for all training samples in a batch. Moreover, sound approximation methods require the amount of memory that’s almost proportional to the number of hidden nodes in networks. For instance, to verifiably robust train a CIFAR residual network, training one batch over 50 images needs around 300GB in the worst case while normal training only needs 7MB. Such cost of time and memory prevents the verifiably robust training to be applied in the real world.
Conflict between verifiable robustness and accuracy. There exists inherent conflict between verifiable robustness and accuracy. We did two experiments to demonstrate that. First, we started with a verifiably robust trained CIFAR_Small model with Wong et al.’s method  that has test accuracy and VRA. To increase the test accuracy, we further trained the model with only the normal training process. As shown in Figure 2(a), the normal loss decreases while the verifiable robust loss increases. Second, we started with a normally trained CIFAR_Small model which has test accuracy and VRA. When we trained with only verifiable robust loss to enhance the robustness, we saw the decrease of verifiable robust loss but the increase of normal loss. Such phenomenon shows that training with either loss is not enough to achieve a high performance on both metrics. Moreover, because of over-approximation errors introduced by sound approximation, the convergence of accuracy are affected. Therefore, we need some guidance to achieve a balance between high robustness and high accuracy.
Training techniques. In this paper, we propose two techniques: stochastic output approximation and dynamic mixed training, to solve the aforementioned two problems. To reduce the computational cost, stochastic output approximation calculates the verifiable robust loss by randomly sampling a subset of training points, without sacrificing VRA. Stochastic output approximation provides a hyperparameter to control the sampling. On the other hand, we integrate verifiably robust training with normal training in dynamic mixed training to benefit from both training schemes. Dynamic mixed training provides a hyperparameter to adaptively bias the emphasis on either training scheme in each epoch. It allows us to balance test accuracy and VRA according to the current training states. We implement the two techniques together into MixTrain.
Iv-B Stochastic Output Approximation
Researchers spent much effort on improving the precision of sound approximation methods, in order to prove more interesting properties of neural networks. High precision makes sound approximation methods very computationally expensive [28, 27, 26, 16, 30]. It is time- and memory-consuming every time we approximate the worst-case loss value of an input range (defined in Section II-D2). We propose stochastic output approximation to reduce the computational cost by minimizing the usage of sound approximation. We observe that the distribution of within the sampled training dataset can well approximate such distribution of the entire training dataset. Therefore, it is feasible to use stochastic output approximation in verifiably robust training while still achieving high VRA.
Formal definition. Sampling is a basic technique used in Machine Learning . As we have discussed in Section II, all training procedures perform the Empirical Risk Minimization (ERM) to learn the model. Specifically, to collect the training dataset, data points are sampled from the underlying distribution to represent the empirical distribution . The robust training procedure minimizes the expected loss values for sampled training data (Equation 1). In other words, the verifiable robust loss values calculated from are used to approximate those from . However, compared to concrete forward propagation, it is too computationally expensive to conduct sound approximation for every single training data point collected in . Therefore, we follow the same principle as ERM, to further down-sample the training data points and calculate the sampled verifiable robust loss for verifiably robust training.
Our stochastic output approximation technique randomly samples training data as , from data points of the entire training dataset . The verifiable robust loss values in are representative of the verifiable robust loss values in . From the CIFAR10 dataset, we randomly sampled 1,000 data points () out of 50,000 in the whole training dataset. Here we show in Figure 4 that, the distribution of verifiable robust loss is very close to the original distribution . Most of the verifiable robust loss values center between 0 to 2. And the two distributions significantly overlap. The results demonstrate that we can estimate the verifiable robust loss values by sound approximation from a sampled subset within the training dataset to do verifiably robust training, without sacrificing much precision of the loss values. This significantly reduces the time and memory requirements for MixTrain. For instance, by sampling out of training data points, we only need 2% computational cost of the verifiably robust training that trains over the entire training set.
Formally, let the denote the representative data samples out of the entire training set . Then, the new Inner Maximization problem for stochastic output approximation can be defined as:
Batch integration. To integrate our technique with batch training, we utilize the randomness in the batch generation process, and separate the samples into the batches. Let denote the batch size, from original training data points, we have batches. By distributing random samples in the batches, we get sample per batch. Before performing the weights update for each epoch, we compute the estimated robust loss for random samples within the batch. For instance, in Figure 4, given batch size and training dataset size , randomly sampling from the training set is equivalent to randomly picking per batch. As we will show in our experiments in Section V, achieves similar high VRA as, but also runs up to faster than the currently state-of-the-art verifiably robust training schemes . The stochastic output approximation technique provides the hyperparameter that allows different sampling strategies to balance the trade-off between efficiency and the precision of verifiable robustness.
Iv-C Dynamic Mixed Training
Our second technique solves the opposite tension between the test accuracy and the verifiable robustness. State-of-the-art robust training schemes can achieve either high test accuracy [19, 18], or high verifiable robustness , but not both of them. To increase both the test accuracy and the verifiable robustness, we dynamically balance the goals of training for each epoch.
Formal Definition. We define our dynamic loss function as the following:
is a weighted sum of the expectation of both the normal training loss and the verifiably robust training loss . Hyperparameter biases the dynamic loss value towards either normal training or verifiably robust training, which takes a value between 0 and 1. When is large, the training process tends to find the weights that minimize the verifiable robust loss, increasing VRA. In contrast, if is small, Equation 8 emphasizes more on the normal training loss, to enhance the test accuracy. We can either use a constant for the hyperparamter , or a function to represent the values of that changes for each epoch.
Dynamic Loss. We can dynamically adjust to prioritize different emphasis on the normal training loss and the verifiably robust training loss. can be an arbitrary function chosen by the user of MixTrain. An example function for is the following:
This is a staircase increase and decrease function for . In this conditional function, takes an initial value of at the beginning of the batch training process, when for the first epoch. Afterwards, for each epoch, changes if the epoch number is a multiple of 5. When the test accuracy keeps increasing, we increase by ; and if the test accuracy decreases, we deduct from the previous . In all the other epochs, remains unchanged. Our intuition for this function is that, we start with a small , largely biasing towards normal training in the beginning, with the goal of increasing the test accuracy. As the test accuracy continues to grow, we start putting more emphasis on the verifiably robust training part of , by slowly increasing at every 5th epoch. However, if the test accuracy decreases at certain point, we want to decrease a little to bias more towards the normal training loss.
The dynamic loss function allows us to maximize the test accuracy and VRA simultaneously according to the change of current test accuracies. It can balance the gradients generated from the loss of both normal training task, as well as the loss of verifiably robust training task. Therefore, we can avoid the training being dominated by either one that has a larger gradient. Using an function like above, we are able to robustly train the models that have 10% test accuracy improvement on average compared to the state-of-the-art robust training schemes  without sacrificing too much of robust accuracy (Section V). Besides the test accuracy improvement, there is another significant benefit of mixed training. It smooths the process of searching for the optimum of difficult loss functions and thus allows them to converge. We show in Section V that mixed train allows the models to achieve both higher test accuracies and VRAs compared to existing verifiably robust training methods [16, 18] for certain cases.
We implemented stochastic output approximation and dynamic mixed training in a system called MixTrain, built on PyTorch 0.4.0222https://pytorch.org/. In MixTrain, we used symbolic linear relaxation as the sound approximation method. In this section, we will evaluate MixTrain on different datasets, architectures of networks, and various attacks. The performance of trained models are measured with three widely accepted metrics: test accuracy (ACC), estimated robust accuracy under PGD attacks (ERA(PGD)), as well as verified robust accuracy (VRA). Our main results are summarized as following:
Compared to the state-of-the-art adversarially robust training methods that have zero VRAs, MixTrain can provide high VRAs with similar ERA(PGD)s and test accuracies, using faster training time in most cases.
For large networks, MixTrain can achieve higher test accuracy without compromising too much VRA, allowing verifiably robust trained models to be applied in the real life.
MixTrain requires much less memory and computation time compared to other verifiably robust training schemes, which makes verifiably robust training scale to large models.
|Model||Type||# hidden units||# parameters|
We evaluated MixTrain on two different datasets: MNIST digit classification  and CIFAR10 image classification . We specified five different network architectures for the experiments. The sizes of the networks are shown in Table II, and further details can be found in Appendix -A.
We keep the same training configurations used in . If not otherwise specified, we train 60 epochs in total, with batch_size 50 for all experiments. The datasets are normalized in the same way as prior works [16, 19, 15, 17] before training. On MNIST dataset, inputs are scale to be within , with . We use the Adam optimizer  with learning rate 0.001 decaying by a factor of 0.6 for every 5 epochs. We schedule starting from 0.01 to the testing values (, i.e., out of before normalizations; and , i.e., out of before normalizations) over the first 10 epochs333Since sound approximations on initial models require much more cost than the models verifiably robust trained with several epoch. Scheduling input range is necessary to save the computations and memory at the beginning.; On CIFAR dataset, inputs are normalized with . We use the SGD optimizer [35, 36] with learning rate 0.05 and decay learning rate by 0.6 for every 5 epochs. We schedule starting from 0.001 to the testing value (, i.e., out of before normalizations) over the first 10 epochs. All of our experiments are ran on a GeForce GTX 1080 Ti. For all models trained using the same dataset, we used the same initial weights and trained using the same set of batches, and we used the same 1,024 randomly selected images to evaluate ACC, ERA(PGD), and VRA.
|Model||Epsilon||Method||Batch Time (s)||Train Time||ACC (%)||ERA(PGD) (%)||VRA (%)|
|Madry et al. ||0.027||32m4s||99.1||96.0||0|
|Sinha et al. ||0.026||31m12s||98.7||58.5||0|
|0.3||Madry et al. ||0.027||32m24s||98.8||89.3||0|
|Sinha et al. ||0.026||31m12s||98.9||0||0|
|Madry et al. ||0.063||1h15m||98.9||96.7||0|
|Sinha et al. ||0.063||1h16m||99.0||70.1||0|
|0.3||Madry et al. ||0.063||1h15m||Not converge|
|Sinha et al. ||0.066||1h19m||99.1||0.1||0|
|Madry et al. ||0.194||3h15m||71.2||54.3||0|
|Sinha et al. ||0.192||3h11m||71.8||11.3||0|
|Madry et al. ||0.289||4h49m||80.9||63.6||0|
|Sinha et al. ||0.204||3h24m||75.1||13.7||0|
|Madry et al. ||0.306||5h6m||Not converge|
|Sinha et al. ||0.212||3h32m||77.3||17.7||0|
|Model||Method||Batch Time (s)||Train Time||ACC (%)||ERA(PGD) (%)||VRA (%)|
|Wong et al. ||0.180||3h36m||98.6||96.9||95.7|
|Wong et al. ||0.223||4h27m||91.2||83.1||57.8|
|MNIST_Large||0.1||DIFFAI ||Out of memory|
|Wong et al. ||Out of memory|
|0.3||DIFFAI ||Out of memory|
|Wong et al. ||Out of memory|
|Wong et al. ||0.705||11h45m||62.5||52.3||47.5|
|CIFAR_Large||0.0348||DIFFAI ||Out of memory|
|Wong et al. ||Out of memory|
|CIFAR_Resnet||0.0348||DIFFAI ||Out of memory|
|Wong et al. ||Out of memory|
|* Under current setting of one 1080 Ti, we cannot run due to out of memory error.|
|We can further improve VRA with larger .|
|MNIST _Small||0.1||DIFFAI ||0.013||>24h||N/A|
|Wong et al. ||0.180||>24h||N/A|
|Wong et al. ||0.223||>24h||N/A|
|MNIST _Large||0.1||DIFFAI ||0.022||Out of memory|
|Wong et al. ||1.929||Out of memory|
|0.3||DIFFAI ||0.021||Out of memory|
|Wong et al. ||2.232||Out of memory|
|CIFAR _Small||0.0348||DIFFAI ||0.015||>24h||N/A|
|Wong et al. ||0.705||8h25m||60.8|
|CIFAR _Large||0.0348||DIFFAI ||0.025||Out of memory|
|Wong et al. ||10.861||Out of memory|
|CIFAR _Resnet||0.0348||DIFFAI ||0.035||Out of memory|
|Wong et al. ||11.182||Out of memory|
V-B MixTrain and state-of-the-art training schemes
For each network architecture in Table II, we compared the performance of MixTrain with existing state-of-the-art robust training schemes. They are two adversarially robust training schemes by Madry et al.  and Sinha et al. , as well as two verifiably robust training schemes by Wong et al.  and Mirman et al. (DIFFAI) .
For the two adversarially robust training schemes, we repeated the same adversary configurations and training process from Madry et al. . Specifically, we ran 40 iterations of PGD with step size of 0.01 for MNIST while 7 iterations with step size 0.0348 as the adversary. In particular, Madry et al.’s method  is considered to be the state of the art in improving adversarial robustness, i.e. ERA. On the other hand, Sinha et al.’s method  can provide distributional robustness guarantee under a small .
For the two verifiably robust training schemes, we picked the experimental setup that achieved the best results by the authors. Wong et al.’s method  is the state-of-the-art at supporting verifiable robustness, i.e. VRA. We used 10 random projections for Wong et al.’s method , which performed the best. Lastly, DIFFAI  achieved the best results using a hybrid Zonotope domain with transformer method hSmooth. We used the same abstract domain and transformer for DIFFAI.
V-B1 Comparison with Adversarially Robust Training
In Table III, we have shown the training results of different techniques on the same five models. The best Train Time and VRA are highlighted for each model. Since the MNIST models can be easily trained to have high test accuracies, we want MixTrain to spend more effort on improving verifiable robustness in dynamic mixed training. Thus, we simply keep as . Also, is set to be 1 to maximize the efficiency. On the other hand, since it is harder to achieve high accuracy when we verifiably robust train CIFAR models, we want to dynamically balance the goals of getting high accuracy and high VRA. Here, all CIFAR models are trained with under the same adaptive updating mechanism defined as: (1) increasing with step size 0.05 for each 5 epochs with initial value 0.4 and maximal value 0.8; (2) decreasing 0.05, as long as the test accuracy decreases and becomes lower than targeted threshold (i.e., 70%).
When training the MNIST_Large with and CIFAR_Resnet, Madry et al.’s method  did not converge. Because there is no feasible choice of learning rate () that can train the model to have over 30% test accuracy.
As shown by the results in Table III, Madry et al.’s method  is efficient but it cannot provide any verifiable robustness, with zero VRA score. Compared to their methods, our MixTrain is not only faster with same ACCs and ERA(PGD)s, but we can also provide VRA as high as 91.6%. Such improvement in verifiable robustness is significant especially for security-sensitive applications. The training schemes of Sinha et al.  can reach higher ERA(PGD) scores when compared to for MNIST models. However, on CIFAR models, it performs much worse than MixTrain and Madry et al.’s method  on ERA(PGD). This is consistent with their distributional robustness guarantee for only very small values. Note that MixTrain is faster than the other two adversarially robust training schemes on all models except CIFAR_Large. Sinha et al.’s method  is slightly faster than MixTrain on CIFAR_Large, but we have higher ACC, ERA(PGD), and VRA scores.
Summary: MixTrain can achieve the same high ACC and high ERA(PGD) scores as the state-of-the-art adversarially robust training method, with a bonus of high VRA and even faster training time.
V-B2 Comparison with Verifiably Robust Training
In Table IV, we compare the performance of MixTrain against verifiably robust training methods. Despite high VRAs, currently the best verifiably robust trained models provided by Wong et al.  still suffer from scalability issues and low test accuracies. Therefore, stochastic output approximation and dynamic mixed training are two very powerful methods that allow us to scale to large networks, and balance the verifiable robustness and test accuracy during training process. Note that the results for DIFFAI on MNIST dataset are much worse than the numbers reported in  mainly because their input normalizations have much larger scale (e.g., only denotes out of before normalizations while we denote ) than our current settings. And the results of Wong et al.  is slower than their reported number because they use 4 GeForce GTX 1080 Ti while we only use one. We show, in real, our setting of computations and memory is totally enough to train descent robust models within reasonable time.
For small models (MNIST_Small and CIFAR_Small), we set to compute the verifiable robust loss from more training data samples in stochastic output approximation. MixTrain reached significantly higher VRAs than DIFFAI in all models. In addition, our VRA is higher than Wong et al.’s method in MNIST_Small. In CIFAR_Small, MixTrain has 8.2% higher ACC, and only 2.1% lower VRA than Wong et al.’s method. Overall, MixTrain trains around 5 faster to provide the verifiable robustness guarantee.
For large models (MNIST_Large, CIFAR_Larger and CIFAR_Resnet), we use to scale the verifiably robust training of MixTrain. Both DIFFAI and Wong et al.’s method ran Out of Memory. Using stochastic output approximation, MixTrain can scale to large models where other methods cannot run. In particular, in the MNIST_Large model, when , we have achieved 4.2% higher VRA and 9.2% higher ACC than the best reported number of Wong et al. , where they trained with four GPUs and we only trained with one.
Summary: Compared to state-of-the-art verifiably robust training methods, MixTrain can reach higher VRAs and higher ACCs with 5 training time speedup, and it scales to large models much better.
V-B3 Time to Achieve Target VRA
We want to test if other verifiably robust training methods can reach the same VRA of MixTrain in the same models. We set the same target VRA for all verifiably robust training methods to be those achieved by MixTrain. The results are shown in Table V. In particular, in the CIFAR_Small model, we set the target VRA as 45.4%. Wong et al.’s method achieved this target in 8h25m, whereas MixTrain reached the VRA goal in 2h42m. In the other models, DIFFAI and Wong et al.’s method either timed out after 24 hours, or they ended up with Out of Memory errors.
Summay: MixTrain is always the fastest at achieving target VRA goals, than other verifiably robust training methods. In addition, MixTrain requires around 50 less memory to train.
V-C Robustness within different bounds
Here we show, on CIFAR_Large and MNIST_Large models trained with MixTrain, the changes of ERA(PGD)s and VRAs versus different bounded input ranges . The tested CIFAR_Large is originally trained with while MNIST_Large is trained with . The results in Figure 5 indicate that the models trained with MixTrain have very strong verifiable robustness and adversarial robustness within large bounded ranges. For MNIST_Large, VRA score is still over 80% when is while it is higher than 60% when is for CIFAR_Large. Notice that the robustness of MNIST_Large decreases faster than CIFAR_Large’s since is a relatively large range that almost cover all of the allowable imperceptible perturbations. Compared to MNIST dataset, CIFAR is a harder one to learn such that the ranges our models can verify are smaller and the decrease of VRA is slower.
V-D MixTrain against state-of-the-art attacks
We show in Table VI that MixTrain can effectively defend most of the currently state-of-the-art attacks including PGD attacks, CW attacks. Also our trained models are robust against interval attacks. For the same configurations of attacks, it can find at least 10% more violations based on PGD or CW attacks on adversarially robust trained models as shown in Section III. However, such bonus decreases to be less than 1% on the models trained with MixTrain. It reflects, compared to Madry et al.’s method , MixTrain increases the costs for attackers, and it is able to provide stronger robustness against unknown attacks.
|ACC (%)||ERA(PGD) (%)||ERA(CW) (%)||ERA(Interval) (%)||VRA (%)|
|The is 0.3 for MNIST and 0.0348 for CIFAR.|
V-E MixTrain against and bounded attacks
In paper , Schott et al. have shown that madry et al.’s method  is not robust against other types of attacks like and bounded attacks. In Figure 6, we measure the ERAs under currently state-of-the-art bounded CW attacks and bounded decision-based pointwise attacks [37, 38] on CIFAR_Large models trained with MixTrain and compare them with the ones got from Madry et al.’s method . For bounded CW attacks as shown in Figure 5(a), the ERA of adversarially robust trained model decreases to after while the ERA of the model trained with MixTrain are still over . For bounded pointwise attacks, as shown in Figure 5(b), ERA of model trained with MixTrain will stay over as while ERA of model trained with their method drops to . Such results demonstrate that the model trained with MixTrain will be much more robust against and adversaries as side benefits of MixTrain under .
V-F Different values of sampling rate
In Table VII we show the accuracies and training time versus different sampling rates used in stochastic output approximation. The models are trained with MixTrain on the same initial CIFAR_Small models and training data. The results show that the training cost is proportional to the value of . Even when is small like , ACCs can approach regular training while VRAs approaches state of the art. Therefore, is a good choice which can save up to 50 times training cost (or higher with larger batch size). On the other hand, if more training time available, one can still use larger sampling rate to achieve better accuracies. For instance, with , MixTrain is able to train a robust model that has similar VRA () to currently the most verifiably robust CIFAR_Small model trained with Wong et al.’s method  (). It only takes 2h and 42 minutes while they need over 11 hours. On average, the models trained with MixTrain can save training time while achieving similar VRA benefited from dynamic mixed training.
|Batch Time (s)||Training Time||ACC (%)||ERA(PGD) (%)||VRA (%)|
V-G Different values of
Here we evaluate the influence of different update mechanisms used for dynamic mixed training.
Influence of fixed . We evaluate the influence of different fixed and see how ACCs, ERA(PGD)s and VRAs change in general on CIFAR_Small. By taking different constant values as , we record the corresponding accuracies of each trained models with hyperparameter in Figure 7. It indicates that the larger the is, the more robust the trained model will be. But as the cost of robustness improvement, it leads to the sacrifice of test accuracy at certain extant. If , the training process is the same as regular training, which has highest ACC but 0 VRA. On the other hand, if , MixTrain only relies on stochastic output approximation and loses the ACC improvement benefited from dynamic mixed training. Overall, by adjusting , MixTrain can be tuned to balance the robustness and test accuracies according to the desired properties.
Adaptive update of . We then show that the dynamic loss defined in IV-C allows the models to learn better compared to fixed . Here the accuracies are evaluated and compared on the trained CIFAR_Small models with MixTrain under different update mechanisms. First, we train with fixed mechanisms. We pick fixed as 0.5, which can provide a good balance between ACCs and VRAs of trained model. On the other hand, we adaptively update by increasing 0.05 step size for each 5 epochs with initial value 0.4 and maximal value 0.8. As long as the test accuracy decreases as increases and it becomes lower than targeted threshold (i.e., 70%), will be decreased with 0.05 to keep the test accuracy stay around targeted one. As shown in Table VIII, the model trained under such dynamically update mechanism outperforms the one trained under fixed mechanisms by VRA with similar ACC.
|CIFAR_Small||ACC (%)||ERA(PGD) (%)||VRA (%)|
Vi Related Work
Adversarial machine learning. Recent works have shown that the state-of-the art networks can be fooled with adversarially crafted human-imperceptible perturbations on valid inputs [14, 1, 23, 13]. Various defensive system are given [39, 9, 10, 40, 41, 42, 43, 44, 45, 46]. However, most of them rely on heuristics while no fundamental guarantee can be given for the absence of adversarial examples. Thus, stronger attacks still can be proposed to break them [47, 14, 48, 49, 13, 50, 51, 12, 52, 53, 54, 55, 56, 57]. Interesting readers can refer to the survey paper  for details of them. Subsequently, various guarantees are gradually provided shedding light to the end of such arms race. Thus, in this paper, we solely focus on defenses that provide some forms of guarantees.
Adversarial robust guarantee. Adversarial training schemes retrain the model with adversarial examples found by heuristic attacks. They include training methods relying on FGSM attacks [23, 22], iterative FGSM attacks , DeepFool attacks , and attacks guided by local intrinsic dimensionality . The scalability and robustness was further improved by training with PGD attacks  as currently state of the art. Overall, they are efficient and robust against most existing attacks. Most importantly, adversarial robustness guarantee was claimed by such training techniques . However, we have proved such guarantee can be broken by smarter attackers  and no formal guarantee can be easily given.
Distributional robust guarantee. On the other hand, stronger distributional robust guarantee is provided by other defense schemes. Sinha et al.  leverage Wasserstein distance to bound the expectation of maximal perturbations that do not violate robustness. PixelDP  relies on differential privacy to bound the expectations of outputs ranges. Their guarantee is strong since it is fit for whole underlying distributions with certain probability. However, within large input ranges, the guarantee probability is very low such that adversarial examples can be easily located. Also, even within small input ranges, the high guarantee probability does not mean non-existence of adversarial examples. Therefore such guarantee is vulnerable in security domains.
Verifiable robust guarantee. Formal verifications of networks are able to provide verifiable robustness guarantee on networks, shedding light to the end of such arms race. They are first given by customized solver-based systems [8, 58, 59]. However, high overheads of solvers prevent them scaling to large networks (i.e.,less than 1,000 ReLUs) . MILP solver achieves better performance [60, 61, 62], but they still suffer from high nonlinearity of resulting formulas. [63, 64] can provide more accurate approximations but only works for one or two layers networks. To solve such problem, sound approximations successfully simplify the verification problem with traditional analysis methods like interval , convex polytope , abstract domain , Lagrangian relaxation  and Lipschitz continuous [65, 30]. Currently, the state-of-the-art formal verification system combines interval analysis with linear solver , able to scale to modern size of normally trained networks (i.e., around 10,000 ReLUs). But relaxations deemed to have false positives. Also, they do not fundamentally strengthen the robustness.
Based on sound approximations, verifiably robust training can significantly increase the verifiable robustness of trained networks . However, such training schemes require large computations. On the other hand, certain robustness precisions are sacrificed to increase efficiencies [19, 18]. Currently, the state of the art, leveraging random projections to save computations , is still hundreds of times slower than normal training process. Also, on large networks, the test accuracy suffer from huge false positives of sound approximations. Thus they are not applicable in real life. MixTrain speeds up the verifiably robust training by up to 50 times with around 10% improvements on test accuracy on large models (e.g., over 100,000 ReLUs).
We constructed a stronger first-order attack, the interval attack, to show the fundamental problem of the assumptions made by the adversarial robustness guarantee . Furthermore, we proposed stochastic output approximation and dynamic mixed training to solve two major challenges of making verifiably robust training methods scale. We implemented the techniques into our system, MixTrain. Our extensive experimental results demonstrated that MixTrain outperforms state-of-the-art adversarially robust training methods by providing a bonus of verified robust accuracy up to 95.2% within shorter training time. Also, compared to the state-of-the-art verifiably robust training methods, MixTrain offers up to faster computation time and less memory cost, while achieving 9.2% improvement in test accuracy with 4.2% higher verified robust accuracy.
-  C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus, “Intriguing properties of neural networks,” International Conference on Learning Representations (ICLR), 2013.
-  “TESLA’S AUTOPILOT WAS INVOLVED IN ANOTHER DEADLY CAR CRASH,” https://www.wired.com/story/tesla-autopilot-self-driving-crash-california/.
-  “Driver killed in Tesla self-driving car crash ignored warnings, NTSB reports,” https://www.usatoday.com/videos/news/nation/2017/06/20/tesla-found-not-fault-fatal-self-driving-car-crash/103039630/.
-  “Uber’s Self-driving Car Were Struggled Before Arizona Crash,” https://www.nytimes.com/2018/03/23/technology/uber-self-driving-cars-arizona.html.
-  K. D. Julian, J. Lopez, J. S. Brush, M. P. Owen, and M. J. Kochenderfer, “Policy compression for aircraft collision avoidance systems,” in Digital Avionics Systems Conference (DASC), 2016 IEEE/AIAA 35th. IEEE, 2016, pp. 1–10.
-  “NAVAIR plans to install ACAS Xu on MQ-4C fleet,” https://www.flightglobal.com/news/articles/navair-plans-to-install-acas-xu-on-mq-4c-fleet-444989/.
-  S. Wang, K. Pei, W. Justin, J. Yang, and S. Jana, “Formal security analysis of neural networks using symbolic intervals,” 27th USENIX Security Symposium, 2018.
-  G. Katz, C. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer, “Reluplex: An efficient smt solver for verifying deep neural networks,” in International Conference on Computer Aided Verification (CAV). Springer, 2017, pp. 97–117.
-  N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation as a defense to adversarial perturbations against deep neural networks,” arXiv preprint arXiv:1511.04508, 2015.
-  M. Cisse, P. Bojanowski, E. Grave, Y. Dauphin, and N. Usunier, “Parseval networks: Improving robustness to adversarial examples,” in International Conference on Machine Learning (ICML), 2017, pp. 854–863.
-  M. Lecuyer, V. Atlidakis, R. Geambasu, H. Daniel, and S. Jana, “Certified robustness to adversarial examples with differential privacy,” arXiv preprint arXiv:1802.03471, 2018.
-  N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, Z. B. Celik, and A. Swami, “Practical black-box attacks against machine learning,” in Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. ACM, 2017, pp. 506–519.
-  S.-M. Moosavi-Dezfooli, A. Fawzi, and P. Frossard, “Deepfool: a simple and accurate method to fool deep neural networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2574–2582.
-  N. Carlini and D. Wagner, “Towards evaluating the robustness of neural networks,” in IEEE Symposium on Security and Privacy (SP). IEEE, 2017, pp. 39–57.
-  A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” International Conference on Learning Representations (ICLR), 2018.
-  E. Wong, F. Schmidt, J. H. Metzen, and J. Z. Kolter, “Scaling provable adversarial defenses,” Advances in Neural Information Processing Systems (NIPS), 2018.
-  A. Sinha, H. Namkoong, and J. Duchi, “Certifying some distributional robustness with principled adversarial training,” International Conference on Machine Learning (ICML), 2018.
-  M. Mirman, T. Gehr, and M. Vechev, “Differentiable abstract interpretation for provably robust neural networks,” in International Conference on Machine Learning (ICML), 2018, pp. 3575–3583.
-  K. Dvijotham, S. Gowal, R. Stanforth, R. Arandjelovic, B. O’Donoghue, J. Uesato, and P. Kohli, “Training verified learners with learned verifiers,” arXiv preprint arXiv:1805.10265, 2018.
-  R. Huang, B. Xu, D. Schuurmans, and C. Szepesvári, “Learning with a strong adversary,” arXiv preprint arXiv:1511.03034, 2015.
-  U. Shaham, Y. Yamada, and S. Negahban, “Understanding adversarial training: Increasing local stability of neural nets through robust optimization,” Neurocomputing, 2018.
-  A. Kurakin, I. Goodfellow, and S. Bengio, “Adversarial machine learning at scale,” arXiv preprint arXiv:1611.01236, 2016.
-  I. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” in International Conference on Learning Representations (ICLR), 2015.
-  F. Tramèr, A. Kurakin, N. Papernot, I. Goodfellow, D. Boneh, and P. McDaniel, “Ensemble adversarial training: Attacks and defenses,” arXiv preprint arXiv:1705.07204, 2017.
-  M. Ducoffe and F. Precioso, “Adversarial active learning for deep networks: a margin based approach,” arXiv preprint arXiv:1802.09841, 2018.
-  E. Wong and J. Z. Kolter, “Provable defenses against adversarial examples via the convex outer adversarial polytope,” International Conference on Machine Learning (ICML), 2018.
-  S. Wang, K. Pei, W. Justin, J. Yang, and S. Jana, “Efficient formal safety analysis of neural networks,” Advances in Neural Information Processing Systems (NIPS), 2018.
-  T. Gehr, M. Mirman, D. Drachsler-Cohen, P. Tsankov, S. Chaudhuri, and M. Vechev, “Ai 2: Safety and robustness certification of neural networks with abstract interpretation,” in IEEE Symposium on Security and Privacy (SP), 2018.
-  K. Dvijotham, R. Stanforth, S. Gowal, T. Mann, and P. Kohli, “A dual approach to scalable verification of deep networks,” arXiv preprint arXiv:1803.06567, 2018.
-  T.-W. Weng, H. Zhang, H. Chen, Z. Song, C.-J. Hsieh, D. Boning, I. S. Dhillon, and L. Daniel, “Towards fast computation of certified robustness for relu networks,” arXiv preprint arXiv:1804.09699, 2018.
-  M. Anthony and P. L. Bartlett, Neural network learning: Theoretical foundations. Cambridge University Press, 2009.
-  Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Citeseer, Tech. Rep., 2009.
-  D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations (ICLR), 2014.
-  H. Robbins and S. Monro, “A stochastic approximation method,” in Herbert Robbins Selected Papers. Springer, 1985, pp. 102–109.
-  J. Kiefer, J. Wolfowitz et al., “Stochastic estimation of the maximum of a regression function,” The Annals of Mathematical Statistics, vol. 23, no. 3, pp. 462–466, 1952.
-  L. Schott, J. Rauber, M. Bethge, and W. Brende, “Towards the first adversarially robust neural network model on mnist,” 2018. [Online]. Available: https://arxiv.org/pdf/1805.09190.pdf
-  J. Rauber, W. Brendel, and M. Bethge, “Foolbox v0. 8.0: A python toolbox to benchmark the robustness of machine learning models,” arXiv preprint arXiv:1707.04131, 2017.
-  S. Gu and L. Rigazio, “Towards deep neural network architectures robust to adversarial examples,” arXiv preprint arXiv:1412.5068, 2014.
-  N. Papernot and P. McDaniel, “Extending defensive distillation,” arXiv preprint arXiv:1705.05264, 2017.
-  ——, “Deep k-nearest neighbors: Towards confident, interpretable and robust deep learning,” arXiv preprint arXiv:1803.04765, 2018.
-  A. Athalye, N. Carlini, and D. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” arXiv preprint arXiv:1802.00420, 2018.
-  J. Buckman, A. Roy, C. Raffel, and I. Goodfellow, “Thermometer encoding: One hot way to resist adversarial examples,” 2018.
-  Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman, “Pixeldefend: Leveraging generative models to understand and defend against adversarial examples,” arXiv preprint arXiv:1710.10766, 2017.
-  C. Xie, J. Wang, Z. Zhang, Z. Ren, and A. Yuille, “Mitigating adversarial effects through randomization,” arXiv preprint arXiv:1711.01991, 2017.
-  V. Zantedeschi, M.-I. Nicolae, and A. Rawat, “Efficient defenses against adversarial attacks,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 2017, pp. 39–49.
-  N. Papernot, N. Carlini, I. Goodfellow, R. Feinman, F. Faghri, A. Matyasko, K. Hambardzumyan, Y.-L. Juang, A. Kurakin, R. Sheatsley et al., “cleverhans v2. 0.0: an adversarial machine learning library,” arXiv preprint arXiv:1610.00768, 2016.
-  G. F. Elsayed, S. Shankar, B. Cheung, N. Papernot, A. Kurakin, I. Goodfellow, and J. Sohl-Dickstein, “Adversarial examples that fool both human and computer vision,” arXiv preprint arXiv:1802.08195, 2018.
-  N. Carlini and D. Wagner, “Adversarial examples are not easily detected: Bypassing ten detection methods,” in Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 2017, pp. 3–14.
-  B. Biggio, I. Corona, D. Maiorca, B. Nelson, N. Šrndić, P. Laskov, G. Giacinto, and F. Roli, “Evasion attacks against machine learning at test time,” in Joint European conference on machine learning and knowledge discovery in databases. Springer, 2013, pp. 387–402.
-  X. Ma, B. Li, Y. Wang, S. M. Erfani, S. Wijewickrema, M. E. Houle, G. Schoenebeck, D. Song, and J. Bailey, “Characterizing adversarial subspaces using local intrinsic dimensionality,” arXiv preprint arXiv:1801.02613, 2018.
-  C. Guo, M. Rana, M. Cisse, and L. van der Maaten, “Countering adversarial images using input transformations,” arXiv preprint arXiv:1711.00117, 2017.
-  W. He, J. Wei, X. Chen, N. Carlini, and D. Song, “Adversarial example defenses: Ensembles of weak defenses are not strong,” arXiv preprint arXiv:1706.04701, 2017.
-  A. Athalye and I. Sutskever, “Synthesizing robust adversarial examples,” International Conference on Machine Learning (ICML), 2018.
-  N. Carlini and D. Wagner, “Magnet and âefficient defenses against adversarial attacksâ are not robust to adversarial examples,” arXiv preprint arXiv:1711.08478, 2018.
-  K. Pei, Y. Cao, J. Yang, and S. Jana, “Deepxplore: Automated whitebox testing of deep learning systems,” in Proceedings of the 26th Symposium on Operating Systems Principles (SOSP). ACM, 2017, pp. 1–18.
-  Y. Tian, K. Pei, S. Jana, and B. Ray, “Deeptest: Automated testing of deep-neural-network-driven autonomous cars,” in Proceedings of the 40th International Conference on Software Engineering. ACM, 2018, pp. 303–314.
-  X. Huang, M. Kwiatkowska, S. Wang, and M. Wu, “Safety verification of deep neural networks,” in International Conference on Computer Aided Verification (CAV). Springer, 2017, pp. 3–29.
-  R. Ehlers, “Formal verification of piece-wise linear feed-forward neural networks,” 15th International Symposium on Automated Technology for Verification and Analysis, 2017.
-  V. Tjeng, K. Xiao, and R. Tedrake, “Evaluating robustness of neural networks with mixed integer programming,” arXiv preprint arXiv:1711.07356, 2017.
-  M. Fischetti and J. Jo, “Deep neural networks as 0-1 mixed integer linear programs: A feasibility study,” arXiv preprint arXiv:1712.06174, 2017.
-  S. Dutta, S. Jha, S. Sankaranarayanan, and A. Tiwari, “Output range analysis for deep feedforward neural networks,” in NASA Formal Methods Symposium. Springer, 2018, pp. 121–138.
-  A. Raghunathan, J. Steinhardt, and P. Liang, “Certified defenses against adversarial examples,” International Conference on Learning Representations (ICLR), 2018.
-  A. Lomuscio and L. Maganti, “An approach to reachability analysis for feed-forward relu neural networks,” arXiv preprint arXiv:1706.07351, 2017.
-  T.-W. Weng, H. Zhang, P.-Y. Chen, J. Yi, D. Su, Y. Gao, C.-J. Hsieh, and L. Daniel, “Evaluating the robustness of neural networks: An extreme value theory approach,” International Conference on Learning Representations (ICLR), 2018.
-  S. Zagoruyko and N. Komodakis, “Wide residual networks,” British Machine Vision Conference (BMVC), 2016.
-a Network Structures
Let denote a convolutional layer that has channels with kernel size and stride . Let denote fully connected layers that contains hidden nodes. Let denote the residual modules defined in  that has output channels with kernel size . It is defined as two convolutional layers each with ReLU and one skip connection from input. Totally, we train 7 different networks, each defined as following:
MNIST_FC1. Small fully connected network for MNIST:
MNIST_FC2. Large fully connected network for MNIST:
MNIST_Small & CIFAR_Small. Small convolutional network for MNIST and CIFAR used in :
MNIST_Large & CIFAR_Large. Large convolutional networks which are the extensions of small convolutional ones:
CIFAR_Resnet. The residual networks use the same structure in :