Defensive Collaborative Multi-task Training

Defensive Collaborative Multi-task Training

Defending against Adversarial Attack towards Deep Neural Networks
Derek Wang, Chaoran Li Deakin UniversityMelbourneVIC3125 derekw, Sheng Wen, Yang Xiang Swinburne University of TechnologyMelbourneVIC3125 swen, Wanlei Zhou Deakin UniversityMelbourneVIC3125  and  Surya Nepal Data61, CSIROMelbourneVIC

Deep neural network (DNNs) has shown impressive performance on hard perceptual problems. However, researchers found that DNN-based systems are vulnerable to adversarial examples which contain specially crafted humans-imperceptible perturbations. Such perturbations cause DNN-based systems to mis-classify the adversarial examples, with potentially disastrous consequences where safety or security is crucial. As a major security concern, state-of-the-art attacks can still bypass the existing defensive methods.

In this paper, we propose a novel defensive framework based on collaborative multi-task training to address the above problem. The proposed defence first incorporates specific label pairs into adversarial training process to enhance model robustness in black-box setting. Then a novel collaborative multi-task training framework is proposed to construct a detector which identifies adversarial examples based on the pairwise relationship of the label pairs. The detector can identify and reject high confidence adversarial examples that bypass black-box defence. The model whose robustness has been enhanced work reciprocally with the detector on the false-negative adversarial examples. Importantly, the proposed collaborative architecture can prevent the adversary from finding valid adversarial examples in a nearly-white-box setting.

Deep neural network, adversarial example, security.
copyright: rightsretaineddoi: isbn: conference: XXXXX; XXXXX; XXXXXjournalyear: 2018article: price: ccs: Security and privacy Formal security models

1. Introduction

Deep neural networks (DNNs) has achieved remarkable performances on tasks such as computer vision, natural language processing and data generation. However, DNNs are vulnerable towards adversarial attacks which exploit imperceptibly perturbed examples (Fig.1) to fool the neural networks (Szegedy et al., 2013).

Since the generic application of deep learning, the adversarial attack on DNN can be catastrophic and is never ad hoc. For instance, as a prominent type of DNNs, convolutional neural networks (CNN) are widely adopted in both hand-writing recognition (Papernot et al., 2016a), face detection (Li et al., 2015), or autonomous vehicle (Girshick et al., 2016). Therefore, adversarial examples for both CNN-based systems can be crafted under the same methods. Attacks on CNN can be replicated on different CNN-based systems in a costless way. The universal existence of adversarial examples in different scenarios fatally endangers the human users, such as causing accident originated by faulty object detection in autonomous vehicles. Considering the blooming DNN-based applications at current stage, as well as it will be in the future decades, proposing effective defensive methods to defend DNNs against adversarial examples have never been this urgent and critical.

Attacking on DNN-based applications can be performed under different prerequisites for the attacker. The first situation is the white-box attack. Once attacker have the access to the architecture and parameters of the attacked model, they can craft adversarial examples based on the attacked model. Second, black-box attack can be performed without knowing the exact model architecture, model parameters, and even the training dataset (Papernot et al., 2017). Crafting adversarial examples in the black-box setting requires using a DNN substitute. The crafted examples are valid for the attacked model due to the transferability of the adversarial examples (Liu et al., 2016). Current applications of DNN largely rely on a few representative DNN models (e.g. VGG (Simonyan and Zisserman, 2014), GoogleNet (Szegedy et al., 2015), and ResNet (He et al., 2016) for computer vision missions), which makes it a relatively easy task for the adversary to guess the substitutes.

The adversary can either specify a desired classification result (i.e. targeted attack), or just go for mis-classification without specific targeting classes (i.e. non-targeted attack). The transferability of adversarial examples is more significantly demonstrated in non-targeted attacks rather than targeted attacks (Liu et al., 2016). Therefore, the non-targeted attack is more practical in the black-box setting, which is the most possible situation the attackers fall into. However, high-confidence adversarial examples created by strong attacking methods can also transfer between models (Carlini and Wagner, 2017b). Moreover, these high-confidence adversarial examples are able to bypass defensively distilled network (Papernot et al., 2016c), which was considered as the most effective defence in both white-box and black-box settings.

Figure 1. Adversarial examples of Cifar10 images. These examples are crafted by five attacking methods. First, the adversarial images are all mis-classified by the CNN classifier; Second, the differences between adversarial examples and original examples are imperceptible. From the 2nd column to the 6th column, the attacks are: 1) CarliniWagner (Carlini and Wagner, 2017b), 2) Deepfool (Moosavidezfooli et al., 2016), 3) Iterative Gradient Sign (IGS) (Kurakin et al., 2016), 4) Jacobian-based Saliency Map Attack(JSMA) (Papernot et al., 2016a), and 5) Fast Gradient Sign(FGS) (Goodfellow et al., 2014).

1.1. Motivation

For the defence perspective, so far, a series of defensive methods have been proposed recently to counter state-of-the-art attacks on DNN-based applications. These defensive methods defend DNN model by either detecting adversarial examples (e.g. MagNet (Meng and Chen, 2017) employs auto-encoder to reconstruct adversarial example to normal example manifold, and then detects adversarial example by the reconstruction error), or blocking the adversarial example searching (i.e. gradient masking) by somehow modifying the gradients of the model (e.g. defensive distillation (Papernot et al., 2016c) introduces a temperature parameter into the softmax function to make the gradients almost zero, in order to block the searching for adversarial example). MagNet is claimed to be able to detect CarliniWagner (CW) attack (Meng and Chen, 2017). However, it introduces too much complexity into the model architecture. To date, there is no valid method which actually increases model robustness to defend CW attack.

The attacks can be categorised into three parts: 1) black-box attacks, 2) grey-box attacks, 3) white-box or nearly white-box attacks (refer to Section 2.2). For the attack perspective, current countermeasures also have various limitations. First, the defensive methods that regularise the model parameter to defend is invalid towards grey-box and white-box attack. Second, the methods working in grey-box scenario (e.g. defensive distillation) utilise gradient masking (Papernot et al., 2017) to block the searching of adversarial examples by optimising the adversarial objective (i.e. gradient-based). However, this type of defence is invalid towards attacks that exploit input feature sensitivity (e.g. JSMA), and it cannot defend black-box attack. What makes it more frustrating is that, defensive distillation is breakable in both grey-box setting and black-box setting under CarliniWagner attack, which is actually gradient-based. The recently proposed MagNet can detect adversarial examples and reform adversarial examples to benign examples (Meng and Chen, 2017). However, it does not work in the nearly-white-box scenario.

1.2. Our Work & Contributions

In this paper, we propose a well-rounded defence that not only detects adversarial examples at high accuracy, but also increases the robustness of neural networks towards adversarial example attack. The defence first introduces adversarial training with robust label pairs to tackle black-box attack. Then it employs multi-task training technique to construct the adversarial example detector. The defence is able to tackle all the black-box attack, the grey-box attack, and even the nearly-white-box attack. The main contributions of the paper are as follows:

  • We introduced a collaborative multi-task training framework to invalidate/detect adversarial examples. This framework innovatively made use of ‘most impossible’ misclassification results as a pair-wised rule to defend adversarial attacks;

  • We carried out both empirical and theoretical studies to evaluate the proposed framework. The experiments demonstrated the capabilities of the proposed defence framwork: 1) the framework could prevent the adversary from searching valid adversarial examples in nearly-white-box settings; and 2) it can also detect or invalidate adversarial examples crafted in grey-box/black-box settings.

The following paper is organised as follows: Section 6 presents a conclusion on the existing attacks and the defensive methods. Section 2 describes state-of-the-art attacks and clarifies the problem and our contribution. Section 3 presents our detailed approach. Section 4 presents the evaluation of our approach. Section 5 provides an analysis on the mechanism of the defence. Section 7 discusses the remaining unsolved problems of the existing attacks and defences, as well as the possible further improvements of the defence. Section 8 summaries the paper, and proposes the future works.

2. Primer

2.1. Adversarial attacks

We first introduce state-of-the-art attacks in the field. Suppose the DNN model is equal to a non-convex function . Given an image along with the rightful one-hot encoded label , attacker searches for the adversarial example .

2.1.1. Fgs

Fast gradient sign (FGS) is able to generate adversarial examples rapidly (Goodfellow et al., 2014). FGS perturbs image in the image space towards gradient sign directions. FGS can be described using the follow formula:


Herein is the loss function (cross-entropy function is typically used to compute the loss). is the softmax layer output from model . is a hyper-parameter which controls the distortion level on the crafted image. is the sign function. FGS only requires gradients to be compute once. Thus, FGS can craft large batches of adversarial examples in very short time.

2.1.2. Igs

Iterative gradient sign (IGS) attack perturb pixels in each iteration instead of a one-off perturbation (Kurakin et al., 2016). In each round, IGS perturbs the pixels towards the gradient sign direction and clip the perturbation using a small value . The adversarial example in the -th iteration is as follows:


Compared to FGS, IGS can produce adversarial example with less distortion and higher mis-classification confidence.

2.1.3. Deepfool

Deepfool is able to with minimum distortion on original image (Moosavidezfooli et al., 2016). The basic idea is to search for the closest decision boundary and then perturbing towards the decision boundary. Deepfool iteratively perturbs until is misclassified. The modification on the image in each iteration for binary classifier is yield as:


Deepfool employs the linearity assumption of the neural network to simplify the optimisation process. We use the version of Deepfool in our evaluation.

2.1.4. CarliniWagner

This method is reportedly able to make defensive distillation invalid (Carlini and Wagner, 2017b). This study has explored crafting adversarial examples under three distance metrics (i.e. , ,and ) and seven modified objective functions. We use CarliniWagner , which is based on the metric, in our experiment. The method first redesigns the optimisation objective to the following function:


where is the output logits of the neural network, and is a hyper-parameter for adjusting adversarial example confidence at the cost of enlarging the distortion on the adversarial image. Then, it adapts L-BFGS solver to solve the box-constraint problem:


Herein . The optimization variable is changed to . According to the results, this method has achieved attacking success rate on the distilled networks in white-box setting. By changing the confidence, this method can also have targeted transferable examples to perform black-box attack.

Figure 2. The distribution of classification results of 10,000 CIFAR10 examples and their corresponding adversarial examples. The non-targeted adversarial attack reveals the vulnerable classes that are usually mis-classified as each other.

2.2. Threat model

We have four types of threat from the adversary. In the real-world cases, the adversary normally do not have the parameters or the architecture of the deep learning model, since the model is well-protected by the service provider. Thus, in our first threat model, we mainly assume that the adversary is in black-box setting. From the previous researches (Carlini and Wagner, 2017b, a), it is recommended that the robustness of a model should be evaluated by transferred adversarial examples. Otherwise, attackers can use an easy-to-attack model as a substitute to break the defence on the oracle model. Second, in some cases, the model parameters and model architecture may be leaked to the attacker. However, the defence mechanism is still hidden from the attacker, this leads to gray-box attack. Next, if the adversary has both the model and the defensive method, the threat from the adversary becomes white-box threat. Finally, due to the reason that directly defending white-box threat is nearly impossible, we defined a new threat model named nearly-white-box threat, to exam the defence in the extreme case that the adversary has detailed knowledge about the model and the defence, but the parameters and the architecture of the model and the defence is unchangeable. We list the mentioned four threat types as follows:

  • Black-box threat: the attacker has an easy-to-attack model as the substitute to approximate the oracle classifier. The attacker also has a training dataset which has the same distribution with the dataset used to train the oracle. To simulate the worst yet practical case that could happen, the substitute and the oracle are trained using the same training dataset. However, the attacker knows neither the defensive mechanism, nor the exact architecture and parameters of the oracle.

  • Grey-box threat: the adversary knows the model parameters and the model architecture of the oracle. In this case, the attacker is able to craft adversarial examples based on the oracle instead of the substitute. However, the attacker is blinded from the defensive mechanism.

  • White-box threat: the adversary knows everything about the oracle and the defence. This is a very strong assumption. Attacks launched in this way are nearly impossible to defend since the attacker can take countermeasure for defence.

  • Nearly-white-box threat In white-box setting, the attacker has both the model and the defensive method. It is nearly impossible to defend the adversary in such situation. Thus, we define another threat model, in which the attacker knows the model and the defence, but the attack cannot make change to the architecture or the model parameters or the defence.

We assume that the defender has no prediction about any of the following information: 1) What attacking method will be adopted by the adversary; 2) What substitute will be used by the attacker.

3. Design

We introduce our multi-task adversarial training method in this section. We first examine the existence of vulnerable decision boundary given a neural network model. Then, we identify robust label pairs of the dataset. Finally, we propose the multi-task training framework for both black-box attack and grey-box attack.

3.1. Vulnerable decision boundary

In this section, we identify vulnerable class pairs, which is the pairs of classes that are usually employed by adversary. A main constraint imposed on adversarial example is the distance between the perturbed example and the original example. In the case of non-targeted attack, by using gradient descent to search for adversarial example, the attacker is aiming to maximise the loss of the classification to the ground truth with the minimum changes of the inputs in the feature space. Thus, we assume that for a given dataset and a model trained on , the decision boundaries of will separate data points belonging to different classes in , in the ideal situation. Therefore, since the distances among different classes of examples are different. Non-targeted adversarial example will be more likely to be classified as another class which is the closest with the true class in the feature space.

We define the group of data points which belong to class as a cluster . Thus, our aim in this step is to find out a pair of labels, and , which are most easily misclassified as each other. We assume the misclassification is symmetric (i.e. if samples labelled with is largely classified as , should also be largely classified as ). The decision boundary between and is identified as vulnerable decision boundary. The assumption is examined by investigating the misclassification results of 10,000 adversarial Cifar10 examples crafted by FGS. The symmetry demonstrated in the results (Fig.2) supports our assumption. Also, in targeted attack, there are some classes that are more difficult to be used as target class in attacks (Carlini and Wagner, 2017a). This is another evidence of the existence of vulnerable decision boundary.

3.2. Robust class pair identification

In this step, we use non-targeted attack to examine the vulnerable boundaries of the classifier in a given dataset. Give an input image, we are then able to propose the least possible class that should be exploited by the attacker. First, we produced a set of adversarial examples using FGS method. Then, from the prediction results from the produced adversarial examples, we revealed the vulnerable boundaries and estimated the robust class pairs.

FGS is able to generate adversarial examples rapidly (Goodfellow et al., 2014). FGS perturbs image in the image space towards the gradient sign directions. FGS can be described using the follow formula:


Herein is the cost function (cross-entropy function is typically used as the loss function during computing the cost). is the sign function. is the softmax layer output from model . is a hyper-parameter which controls the distortion level on the crafted image. FGS only requires one back-propagation process to generate examples. Thus, FGS can efficiently craft large volumes of adversarial examples in very short time, which copes with the sample-hungry nature of adversarial training.

The generated adversarial example set was then fed through the model. In the classification results, for a given class , we search the corresponding robust class based on maximising the following likelihood from observation:


Based on the estimation, we selected the most likely as the robust label-pair of . Following this procedure, we built the paired-label for each sample in the training dataset. The mappings between and were saved as a table . Later on in the grey-box setting, this information will be used to access the credibility of the input example.

Figure 3. The multi-task training framework.

3.3. Collaborative multi-task training

We introduce the proposed collaborative multi-task training framework in this section. We have considered both black-box attacks and grey-box attacks. First, the proposed general training framework is as described in Fig.3.

3.3.1. Multi-task training for black-box/grey-box attack

According to the previous analysis on robust class-pair, we use the robust label-pairs to conduct multi-task training (Evgeniou and Pontil, 2004). Assume the original model has the logits layer which has outputs . Our method grows another logits layers which outputs logits from the last hidden layer. While the softmaxed is used to calculate the loss of the model output with the true label of the input , the softmax output of is employed to calculate the model output loss with the robust label given . We also use adversarial examples to regularise the model against adversarial inputs during the training session. The overall objective cost function of training takes the following form:


Herein is the benign example. We use adversarial examples to regularise the model, is the adversarial example. is the ground truth label of , and is the most robust label of the current . is the cross-entropy cost. , , and are weights adding up to 1.

The first term of the objective function decides the performance of the original model on the benign examples. The second term is an adversarial term taking in the adversarial examples to regularise the training. The last term moves the decision boundaries towards the most robust class with respect to the current class. As discussed in (Goodfellow et al., 2014), to effectively use adversarial example to regularise the model training, we set , and set . The cost function is the average over the costs on benign examples and adversarial examples:


Herein is the cross-entropy cost, and is a negative cross-entropy cost function to maximise the difference between the output distribution with , when the input is adversarial.

Once an example is fed into the model, the output through and will be checked against the . The example will be recognised as adversarial if the outputs have no match in . Otherwise, it is a benign example, the output through is then accepted as the classification result.

In the grey-box setting, the attacker does not know the existence of the defence, the adversarial objective function will only adjust to produce example that only changes the output through to the adversarial class, but cannot guarantee that the output through has the correct mapping relationship to the output through in the . Grey-box attack is then detected by our architecture.

3.3.2. Breaking grey-box defence

In the nearly-white-box setting, the adversary has access to the model weights and the defensive mechanism. Therefore, adversarial examples can still be crafted against the model under grey-box defence.

In the grey-box setting, if the attacker does not know the existence of the defence, the adversarial objective function will only adjust to produce example that only changes the output through to the adversarial class, but cannot guarantee that the output through has the correct mapping relationship to the output through , according to the . However, in the nearly-white-box setting, when the adversary performs adversarial searching for targeted adversarial examples on the model without the gradient lock unit , the optimisation solver can find solution by back-propagating a combined loss function :


Herein is the target output from , is the target output from . and should be a pair in the . Here we assume the attacker can feed large amount of adversarial examples through the protected model and find out the paired for each , though this is a very strong assumption even in the grey-box setting. and can be set by the attacker to control the convergence of the solver. The gradients for back-propagating the adversarial loss from logits layer and then becomes:


Thus, it can be find that the solver can still find adversarial example by using a simple linear combination of the adversarial losses in the objective functions, in the nearly white-box setting. The detection method used for grey-box attack collapses in this case. To solve the nearly white-box defence problem, we introduce a collaborative architecture into the framework.

3.3.3. Collaborative training for nearly-white-box attack

We evolved the framework to defend against not only black-box/grey-box attack, but also nearly-white-box attack, in which the adversary has the model and the defence.

We add a gradient lock unit , between logits and logits . contains two fully connected layers. This architecture is not necessary, we added it to better align with the non-linear relationship between and . The last layer of is a multiplier, which multiplies with the output of in element-wise manner to form the new logits . The input of is . The architecture is then trained by benign training dataset and regularised by FGS adversarial example, in the same training process which is used in Section 3.3.1.

The added extra layers contains no parameter to be trained, however it prolongs the path for computing adversarial gradient. After the gradient lock unit is added, the gradients of the loss function becomes:


It can be seen that in the second term, the back propagation of to and the back propagation from to are mutually affected by each other. the gradient update is calculated based on and in the previous step, but it does not take the updates in the current step into consideration. Therefore, it is difficult for the solver to find a converged solution on . For the gradient based solver, it is hard to find a valid adversarial example based on this architecture.

When the model is put into use, the outputs through and will then be checked against the from Section 3.2. If the outputs match the mapping relationship from the , the output is believed to be credible. Otherwise, the input example is identified as an adversarial example. Therefore, our defensive measure is to detect and reject adversarial examples. Furthermore, the regularised model output from can remedy the detection module once mis-detection occurs.

Layer Type Oracles
Convo+ReLU 3332 3364 3332
Convo+ReLU 3332 3364 3332
Max Pooling 22 22 22
Dropout 0.2 - -
Convo+ReLU 3364 33128 3364
Convo+ReLU 3364 33128 3364
Max Pooling 22 22 22
Dropout 0.2 - -
Convo+ReLU 33128 - -
Convo+ReLU 33128 - -
Max Pooling 22 - -
Dropout 0.2 - -
Fully Connected 512 256 200
Fully Connected - 256 200
Dropout 0.2 - -
Softmax 10 10 10
Table 1. Model Architectures
MNIST 10,000 1,000 1,000 1,000
Cifar10 10,000 1,000 1,000 1,000
Table 2. Evaluation Datasets
Black-box Original
Grey-box Original
Table 3. Classification Accuracy on Non-targeted Black-box and Grey-box Adversarial Examples

4. Evaluation

In this section, we present the evaluation on our proposed method on defending against state-of-the-art attacking methods. We first evaluated the performance of our defence against FGS, IGS, and Deepfool, in black-box and grey-box settings. Then we evaluated the defence against CW attack in black-box, grey-box, and nearly-white-box settings. We ran our experiments on a windows server with CUDA supported 11GB GPU memory, Intel i7 processor, and 32G RAM. In the training of our multi-task model and the evaluation of our defence on fast gradient based attack, we used our implementation of FGS. For CarliniWagner attack, we adopted the implementation from Carlini in their paper. For other attacks, we employed the implementations in Foolbox (Rauber et al., 2017).

4.1. Model, data, and attack

We have implemented one convolutional neural networks architecture as the oracle. For the simplicity of evaluation, we used the same architecture for the oracle and the oracle . The architecture of the oracle model is depicted in Table 1.

We evaluated our method in black-box setting and grey-box setting against four state-of-the-art attacks, namely FGS, IGS, Deepfool, and Carlini Wagner . Then we evaluated the defence in nearly-white-box setting against Carlini Wagner attack. For FGS, in each evaluation session, we crafted 10,000 adversarial examples as the adversarial test dataset. For other attacks, considering their inherited nature of heavy computational cost, we crafted the adversarial examples of the first 1,000 samples from dataset and dataset. We summarise the sizes of all adversarial datasets in Table 2.

4.2. Our defence in black-box setting

In black-box setting, the attacker has neither the defensive method nor the parameters of the oracle model. Instead, the adversary possesses a substitute model whose decision boundaries approximate that of the oracle’s.

In our evaluation against black-box attacks, we set in FGS, in order to transfer attack from the substitute to the oracle. For CW attack, we set the parameter to 40, as the setting used to break black-box distilled network in the original paper, to produce high-confidence adversarial examples. Later, we also evaluated the performances of our defence under different values.

4.2.1. Training the substitute

To better align the decision boundaries of the substitute and the oracle, we fed the whole training dataset into the substitute to train it. We did the same on training the substitute. Therefore, we ended up having two substitutes for the classification task and the classification task. The two substitutes both have achieved equivalent performance with the models used in the previous papers. For the substitute, it achieved classification accuracy on 10,000 test samples after it was trained. The trained substitute achieved accuracy on 10,000 test samples. The trained substitutes are named as and . The architecture of the substitutes is summarised in Table 1.

4.2.2. Defending black-box attack

We evaluated our defence on defending black-box attack and grey-box attack. First, given and , we used the above five attacks to craft adversarial sets whose sample numbers are listed in Table 2. Then, we fed the mentioned adversarial test sets into and , respectively. We adopted the non-targeted version of each of the above attacks since non-targeted adversarial examples are much better in terms of transferring between models.

Robustness towards adversarial examples is a critical criterion to be assessed for the protected model. For black-box attack, we measured the robustness by investigating the performance of our defence on tackling typical black-box adversarial examples, which are near the model decision boundaries. We fed adversarial test sets into the protected model, and then we checked the classification accuracy of the label output through . The results of classification accuracy are demonstrated in Table 3.

It can be found that, in all cases, our method has improved the classification accuracy of the oracle, except for CW attack. The reason that CW attack can still successfully tackle the black-box defence is because the confidence of the generated example is set to a very high value (i.e. ). The nature of the black-box defence is to regularise the position of the decision boundary of the model, such that adversarial examples near the decision boundary will become invalid. However, the defence can be easily bypassed if we can adjust the level of the perturbation or the adversarial confidence to become higher, to which CW is fully capable. This vulnerability also suggests that we need a more effective defence for CW attack. Later on, we presented the results of our detection-based defence which tackles CW attack.

We then measured the success rate on detecting adversarial examples. For each adversarial test set, we fed it into the defended oracle, and measured the successful detection rate. The detection rates are in Table.4.

Table 4. The detection rate against adversarial examples in black-box and grey-box settings.

4.3. Our defence in grey-box attack

We evaluated the performance of our defence towards grey-box attacks. In grey-box setting, the attacker has the oracle model but does not know the defensive method. In this case, we crafted adversarial examples based on the oracle itself.

First, we assessed the accuracy of the classification results through , on adversarial example. The results are also listed in Table 3. Except for CW attack, our method has high classification accuracy on all the adversarial examples.

Second, similar to Section 4.2.2, we assessed the detection rate (i.e. percentage of the detected adversarial examples in the whole adversarial test set) of the model in grey-box setting. For each grey-box adversarial test set, The results are as recorded in Table.4.

It can be found that our method has improved the robustness of the model towards most black-box attacks. For all attacks including CW attack, which broke the robustness defence, our detection-based defence can tackle them very well, in both black-box and grey-box settings.

4.4. Detecting CW attack

We evaluated our defence against CW attack in this section. The attacking confidence of adversarial examples from CW attack is changeable through adjusting the hyper-parameter in the adversarial objective function. Large value will lead to producing high-confidence adversarial example. We evaluated the adversarial example detection performance against CW examples crafted using different values in black-box setting, grey-box setting, and nearly-white-box setting.

4.4.1. Black-box

In black-box setting, high-confidence CW examples can better transfer from the substitute to the oracle. We tested the defence performance on invalidating adversarial examples and the performance on detecting adversarial examples.

First, we evaluated the performance on invalidating transferred black-box attack. First, to demonstrate the performance, we measured the success rates of the transferred attacks which changes the output label through the end. The successful transfer rate from the substitutes to the oracles are plotted in Fig.4. When is set higher, the adversarial example can still successfully break our black-box version defence, since the nature of our black-box defence is similar to adversarial training. It regularises the decision boundaries of the original oracle, to invalid the adversarial examples near the decision boundaries. But unfortunately, high-confidence examples are usually far away from the decision boundaries. As the conclusion, for defending high-confidence adversarial examples, we mainly rely on the detecting mechanism.

Figure 4. The rate of successful transferred adversarial examples by CW attack, in black-box setting. Our black-box defence decreased the success attack rate when is under 10. However, when became higher, the black-box defence turned out to be invalid.

Second, we evaluated the precision and recall on detecting black-box adversarial examples crafted under different . The precision values and the recall values are plotted in Fig.5.

Figure 5. The precision and recall of detecting black-box CW adversarial examples.

4.4.2. Grey-box

In grey-box setting, the attacker knows the model parameters, but not the defence. To evaluate the performance of our method on defending grey-box CW attack, we varied from 0 to 40 . For each value, we craft 1,000 adversarial examples based on the oracle (without the branch). We then mixed 1,000 benign examples into each group of 1,000 adversarial examples to form the evaluation datasets for different values. We measure the precision and the recall of our defence on detecting grey-box examples. The precision is calculated as the percentage of the genuine adversarial examples in the examples detected as adversarial. The recall is the percentage of adversarial examples detected from the set of adversarial examples. The precision values and the recall values are plotted in Fig.6. Our defence method has achieved high precision and recall on detecting grey-box adversarial examples.

4.4.3. Nearly-white-box

We evaluated the performance of our defence against CW attack in the nearly-white-box setting. Adversarial examples used in the evaluation are crafted based on the linearly-combined adversarial loss functions mentioned in Section 3.3.2. To measure the nearly-white-box defence, we swept the value of from 0 to 40, and then we examined the rate of successful adversarial image generation given 100 images under each value. We recorded the successful generation rate of targeted and non-targeted attacks based on data. A failed generation should either be no solution found, or generating an invalid image (totally black or totally white). For targeted attack, we randomly set the target label during the generation process. The generation rates are recorded in Table.5.

Attack Model
Non-targeted Original
Targeted Original
Table 5. The Successful Generation Rate of CW Attack in Nearly-white-box Setting

It can be found that the rate of finding valid adversarial examples was kept at a reasonably low level, especially when the values of were high. Some of the generated CW adversarial examples with/without our defence are displayed in the appendix.

Figure 6. The precision and the recall of detecting grey-box CW adversarial examples.

4.5. Trade-off on benign examples

We also evaluated the trade-off on normal example classification and the false positive detection rate when the input examples are benign.

First, we evaluated the accuracy of classification results output through , in black-box setting and grey-box setting, after our protections were applied on the oracle. We fed 10,000 examples and 10,000 , separately, into the corresponding defended oracles. The results are in Fig.7. Herein the classification accuracy of the protected model in grey-box setting is the accuracy of the output classification label through , given the set of correctly identified normal examples. It can be found that for the oracle, our black-box defence only decreased of the accuracy. Our grey-box defence decreased no accuracy. On oracle, our black-box defence decreased accuracy, and our grey-box defence decreased accuracy. The trade-offs were within the acceptable range considering the improvements on defending adversarial examples.

Next, we assessed the mis-detection rate of our defence. We fed 10,000 benign examples and 10,000 benign examples into the corresponding defended oracles to check how many of them were incorrectly recognised as adversarial examples. Our method has achieved mis-detection rate on dataset. For dataset, our mis-detection rate is . Our detector had a very limited mis-detection rate for both datasets. As the conclusion, our detection-based defence can accurately separate adversarial examples from benign examples.

Figure 7. The trade-off on benign example classification on and . Herein, ’Oracle’ is the corresponding vanilla oracles and . ’Black-box Defence’ is the oracle with our defence in black-box setting. ’Grey-box Defence’ is the oracle with our defence in grey-box setting

5. Justification of the defence

In this section, we present the justification on the mechanism of our defence on black-box attack and CW attack.

5.1. Defending normal black-box attack

For black-box attack, the adversarial training introduced in this model can effectively regularise the decision boundaries of the model to tackle adversarial examples near the decision boundaries. Compare to vanilla adversarial training, our method can further increase the distance required for moving adversarial examples. Due to that searching adversarial examples based on deep neural net heavily leverage back-propagation of objective loss to adjust the input pixel values. Hence, the adversary is actually moving the data point in the feature space according to the gradient direction to maximise the value of the adversarial loss function (non-targeted attack in black-box setting). Suppose the adversarial example is found after steps of gradient decent, for each step of gradient decent we have the step size , the total perturbation can be approximately written as:


According to Section 3.2, the adversary relies on gradient decent based updates to gradually perturb image until it becomes adversarial. In the non-targeted attack, the search will stop when an adversarial example is found. Given an example whose original label is , it is unlikely to classify it into the robust label , in other words, within certain steps, the gradient updates will not converge if the adversarial cost is calculated based on the output maximised at and the robust target (vice versa). Similarly, in targeted attacks, is more difficult to be used as the adversarial target label. Suppose the total effort for maxmising/minimising is , the total effort for maxmising/minimising is ,we have . When the training objective includes a term containing the robust label , the output of the trained model could be treated as a linear combination of the outputs trained from and . Therefore, the required total effort for changing the combined output should be higher.

From the perspective of classifier decision boundary, our multi-task training method has also increased the robustness of the model against black-box examples. The robust label regularisation term actually moves the decision boundary towards the robust class (Fig.8). Compared to the traditional adversarial training, which tunes the decision boundary depending merely on the generated adversarial data points, our regularisation further enhances the robustness of the model towards nearby adversarial examples.

5.2. Defending CW attack

We provide brief analysis on why our method can defend CW attack in this section. CW attack mainly relies on the modified objective function to search adversarial examples, given the logits from the model. By observing the objective function , we can find out that the objective is actually increasing the value of the logits corresponding to the desired class, until the difference between class and the second-to-the-largest class reaches the upper bound defined by . The optimisation process can be interpreted as adjust the input pixels along the direction of the gradient that maximise the logits difference.

In grey-box setting, when we adopted collaborative multi-task training as a defence, after the model actually modified the output logits to have high outputs not only on the position corresponding to the ground truth class, but also on the position corresponding to the robust class of the current ground truth. In grey-box setting, the defence is hidden from the attacker, the attacker crafts adversarial examples solely based on the oracle model without the robust logits branch. Hence, the adversarial objective function is not a linear combination of and , but a single loss . The crafted adversarial example can only modify the output from to the adversarial target , but the output through is not guaranteed to be the corresponding robust label of in the .

In nearly-white-box setting, the defence is exposed to the adversary. Therefore, the adversary can first feed forward certain volume of adversarial examples to find the approximate of the task dataset. Then, the attacker can set the adversarial objective to a linear combination form mentioned in Section 3.3.2 to successfully bypass the defence. However, after the gradient lock unit is added, first, the path for back-propagating adversarial loss becomes longer. The solver cannot effectively adjust the input pixel values of the original example due to vanishing gradients. Second, the two logits outputs (i.e. and ) are related to each other in the gradient back-propagation. Based on the previous last step of gradient update, the adjustments made on and becomes:


However, the updates on and are based on the previous values of and . Eventually, the inputs will actually decay the progress made by the gradient updates. Therefore, it becomes difficult for the optimisation solver to find a satisfactory solution that can both generate adversarial label and the correct paired robust label of the adversarial label.

Figure 8. The regularisation in a 2-dimension sample space. Embedding the robust class in the loss function will further move the decision boundary between and towards . The fractions of the decision boundary that were not regularised by injecting adversarial examples will get regularised. Thus, for black-box examples that are near the decision boundary will be invalidated.

6. Related work

6.1. Defensive methods

In the category of robustness based methods, the first defensive mechanism employs adversarial training to enhance the robustness of neural nets. Second, there are methods which change the neural network design to improve the robustness. For example, Gu et al. proposed contractive neural net. contractive net solves the adversarial example problem by introducing a smoothness penalty on the neural network model (Gu and Rigazio, 2015). Techniques used in transferring learning have been adopted to tackle adversarial attack. Defensive distillation distils training dataset with soft labels from a first neural net under a modified softmax layer to train a second identical neural net (Papernot et al., 2016c). The soft labels enable more terms of the loss function to be computed in back-propagation stage. At the mean time, the modified softmax function ensures amplified output from the logits layer. Defensive distillation extends the idea of knowledge distillation, which was originally used to reduce neural net dimension.

In contrast to making changes on original DNN model, detection based methods adopt a second model to detect examples with adversarial perturbation. For instance, Metzen et al. attached detectors on the original model and trained it with adversarial examples (Metzen et al., 2017). Another method employs support vector machine to classify the output from high-level neural network layer (Lu et al., 2017). Lately, MagNet relies on autoencoder to reconstruct adversarial examples to normal example, and detect adversarial example based on the reconstruction error and output probability divergence (Meng and Chen, 2017).

However,current defensive methods focusing on improving DNN robustness are not actually making the neural nets more robust. Instead, they just result in more fail attempts for the existing attacking methods. Moreover, current methods usually decrease the performance of the neural nets on non-adversarial data. The detection based methods can detect high-confidence adversarial examples, However, they usually generalise poorly on different type of attacks.

6.2. Attacking methods

Adversarial examples are crafted based on either maximising the loss between the model output and the ground truth (i.e. non-targeted attack) or minimising the loss between the model output and the desired adversarial results (i.e. targeted attack). The adversary can create hard-to-defend examples if he obtain the architecture and the parameters of the oracle model (i.e. white-box attack). In black-box attack, crafting adversarial examples requires using a DNN substitute. However, current applications of DNN largely rely on few representative DNN models, which makes it a relatively easy task for adversaries to guess the substitutes. Moreover, due to the transferability of adversarial samples (Liu et al., 2016), the crafted examples can be used in black-box attacks towards other DNNs (Papernot et al., 2017) without knowing the model parameter or the training data.

The representative methods for crafting adversarial examples can be broadly categorised into two types:

Gradient based attack. Gradient based method employs the gradients of the output error with respect to each input pixel to craft adversarial example. In the fast gradient/gradient sign method, each input pixel will be slightly perturbed along the gradient direction or gradient sign direction (Goodfellow et al., 2014). This type of methods only compute gradients using back-propagation once. Therefore, this type of methods is able to craft large batched of adversarial examples in very short amount of time. Additionally, an iterative gradient sign based method was proposed (Kurakin et al., 2016). The method iteratively perturbs pixels instead of conducting single step perturbation on all pixels.

Optimisation based attack. Optimisation based methods aim to optimise the output error while keep the distance between the original example and the adversarial example as close as possible. First, Szegedy et al. proposed a L-BFGS based attack, which solves the adversarial example crafting problem as a box-constraint minimisation problem (Szegedy et al., 2013). Later on, a Jacobian-based saliency map attack was proposed by Papernot et al. (Papernot et al., 2016b). This method modifies the most critical pixel in each saliency map during the iteration to lead to the final adversarial example. Furthermore, a method named DeepFool is proposed. DeepFool iteratively finds the minimum perturbations of images (Moosavidezfooli et al., 2016). Last but not the least, Nicolas et at. proposed an optimisation attack based on modified objective function (Carlini and Wagner, 2017b). This method can effectively invalidate defensive distillation, which is a state-of-the-art defensive method.

Both gradient based and optimisation based methods can be used to perform non-targeted attack and targeted attack. Optimisation based attack can usually achieve higher attacking success rate, and more importantly, higher confidence on the misclassification. Therefore, optimisation methods are usually stronger than fast gradient based methods.

7. Discussion

Current attacks and defences are largely threat-model-dependent. For instance, as state-of-the-art defence, defensive distillation is working in a grey-box scenario, that the attacker knows the parameters of the distilled network, but does not know that the network is trained with defensive distillation. Thus, when the attacker crafts adversarial examples based on the distilled network, the removed temperature from the softmax function will lead to vanishing gradients during solving the adversarial objective function. However, once the attacker understands that the network is distilled, the defence can be easily bypassed by adding a temperature into the softmax function during solving the adversarial objective function. Moreover, defensive distillation is invalid towards black-box attack. Adversarial examples crafted by an easy-to-attack substitute could still be valid on a distilled network.

As the attacking methods based on gradient, except CW attack, are unable to attack distilled network in grey-box setting, black-box attack launched using these methods is more practical and harmful. This is why we evaluated the performance of our defence in black-box setting. As a special case, CW attack claims that it can break defensive distillation in white-box setting, since it searches adversarial examples based on the logits instead of the softmax outputs. Hence, CW attack can bypass the vanishing gradient mechanism introduced by defensive distillation on the softmax layer. However, having access to the logits itself is a very strong assumption, which actually defines a white-box or almost-white-box attack. Therefore, a substitute can be used together with CW attack to attack distilled network in a black-box manner.

There are many possible ways to enhance our method. First, our method can be further improved by incorporating randomness into the defence architecture. For example, switching some of the model parameters based on a set of pre-trained parameters might further increase the security performance of the defence. Second, attacks employs forward derivatives (e.g. JSMA (Papernot et al., 2016a)) can still effectively find adversarial examples, since our defence essentially tackles gradient-based adversarial example searching. However, our defence is still functional towards black-box JSMA examples due to the regularised training process.

8. Conclusion and Future Work

In this paper, we proposed a novel defence against black-box attack, grey-box attack, and nearly-white-box attack on deep neural networks. Importantly, our method is able to protect CNN classifiers from the CarliniWagner attack, which is the most advanced and relatively practical attack to date. The performance trade-off on normal example classification brought by our defence is also acceptable.

As the shortcomings of our approach, first, the quality of the will directly affect the performance of the detection rate for adversarial examples. We use large volume of non-targeted adversarial examples to approximate the mapping relationship between class labels. However, the quality is affected by the employed attacking method. Second, introducing randomness into the defence can further increase the defence performance. Moreover, there could be better option than multiplying the logits in the gradient lock unit. These problems will be addressed in our future work.

Deep neural networks has achieved state-of-the-art performance in various tasks. However, compared to traditional machine learning approaches, DNN also provides a practical strategy for crafting adversarial examples, since the back-propagation algorithm of DNN can be exploited by adversary as an effective pathway for searching adversarial examples.

Current attacks and defences have not yet been exclusively applied on the real-world systems built on DNN. Previous researches have made attempts to attack online deep learning service providers, such as Clarifi (Liu et al., 2016), Amazon Machine Learning, MetaMind, and Google cloud prediction API (Papernot et al., 2017). However, there is no reported instance of attacking classifier embedded inside complex systems, such as Nvidia Drive PX2. Successful attack on those systems might require much more sophisticated pipeline of exploiting vulnerability in system protocols, acquiring data stream, and crafting/injecting adversarial examples. However, once the pipeline is built, the potential damage it can deal would be fatal. This could be another direction for the future works.


  • (1)
  • Carlini and Wagner (2017a) Nicholas Carlini and David Wagner. 2017a. Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods. arXiv preprint arXiv:1705.07263 (2017).
  • Carlini and Wagner (2017b) Nicholas Carlini and David Wagner. 2017b. Towards Evaluating the Robustness of Neural Networks. In 2017 IEEE Symposium on Security and Privacy (SP). IEEE.
  • Evgeniou and Pontil (2004) Theodoros Evgeniou and Massimiliano Pontil. 2004. Regularized multi–task learning. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 109–117.
  • Girshick et al. (2016) Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. 2016. Region-based convolutional networks for accurate object detection and segmentation. IEEE transactions on pattern analysis and machine intelligence 38, 1 (2016), 142–158.
  • Goodfellow et al. (2014) Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2014. Explaining and Harnessing Adversarial Examples. Computer Science (2014).
  • Gu and Rigazio (2015) Shixiang Gu and Luca Rigazio. 2015. Towards Deep Neural Network Architectures Robust to Adversarial Examples. Computer Science (2015).
  • He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
  • Kurakin et al. (2016) Alexey Kurakin, Ian Goodfellow, and Samy Bengio. 2016. Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016).
  • Li et al. (2015) Haoxiang Li, Zhe Lin, Xiaohui Shen, Jonathan Brandt, and Gang Hua. 2015. A convolutional neural network cascade for face detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5325–5334.
  • Liu et al. (2016) Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. 2016. Delving into Transferable Adversarial Examples and Black-box Attacks. arXiv preprint arXiv:1611.02770 (2016).
  • Lu et al. (2017) Jiajun Lu, Theerasit Issaranon, and David Forsyth. 2017. Safetynet: Detecting and rejecting adversarial examples robustly. arXiv preprint arXiv:1704.00103 (2017).
  • Meng and Chen (2017) Dongyu Meng and Hao Chen. 2017. MagNet: a Two-Pronged Defense against Adversarial Examples. arXiv preprint arXiv:1705.09064 (2017).
  • Metzen et al. (2017) Jan Hendrik Metzen, Tim Genewein, Volker Fischer, and Bastian Bischoff. 2017. On detecting adversarial perturbations. In The 5th IEEE International Conference on Learning Representation.
  • Moosavidezfooli et al. (2016) Seyed Mohsen Moosavidezfooli, Alhussein Fawzi, and Pascal Frossard. 2016. DeepFool: A Simple and Accurate Method to Fool Deep Neural Networks. In IEEE Conference on Computer Vision and Pattern Recognition. 2574–2582.
  • Papernot et al. (2017) Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. 2017. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia Conference on Computer and Communications Security. ACM, 506–519.
  • Papernot et al. (2016a) Nicolas Papernot, Patrick McDaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram Swami. 2016a. The limitations of deep learning in adversarial settings. In 2016 IEEE European Symposium on Security and Privacy (EuroS&P). IEEE, 372–387.
  • Papernot et al. (2016b) Nicolas Papernot, Patrick Mcdaniel, Somesh Jha, Matt Fredrikson, Z. Berkay Celik, and Ananthram Swami. 2016b. The Limitations of Deep Learning in Adversarial Settings. (2016), 372–387.
  • Papernot et al. (2016c) Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. 2016c. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP). IEEE, 582–597.
  • Rauber et al. (2017) Jonas Rauber, Wieland Brendel, and Matthias Bethge. 2017. Foolbox v0. 8.0: A Python toolbox to benchmark the robustness of machine learning models. arXiv preprint arXiv:1707.04131 (2017).
  • Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
  • Szegedy et al. (2015) Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1–9.
  • Szegedy et al. (2013) Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2013. Intriguing properties of neural networks. Computer Science (2013).
Figure 9. The adversarial images generated using non-targeted CW attack when . The first row and the third row are the original images. The second row is the generated adversarial image based on the original model (classification results are as the labels below the second row). The fourth row is the failed generation after applying our defence on the model.
Figure 10. The adversarial images generated using non-targeted CW attack when . The first row and the third row are the original images. The second row is the generated adversarial image based on the original model (classification results are as the labels below the second row). The fourth row is the failed generation after applying our defence on the model.
Figure 11. The adversarial images generated using targeted CW attack when . Images in the leftmost column are the original images. Images in the first row are the targeted adversarial image generated based on the original model. Images in the second row are the generated targeted adversarial image after applying our defence on the model. Classification results are as the labels below.
Figure 12. The adversarial images generated using targeted CW attack when . Images in the leftmost column are the original images. Images in the first row are the targeted adversarial image generated based on the original model. Images in the second row are the generated targeted adversarial image after applying our defence on the model. Classification results are as the labels below.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description