On Extensions of CLEVER: a Neural Network Robustness Evaluation Algorithm

# On Extensions of CLEVER: a Neural Network Robustness Evaluation Algorithm

###### Abstract

CLEVER (Cross-Lipschitz Extreme Value for nEtwork Robustness) is an Extreme Value Theory (EVT) based robustness score for large-scale deep neural networks (DNNs). In this paper, we propose two extensions on this robustness score. First, we provide a new formal robustness guarantee for classifier functions that are twice differentiable. We apply extreme value theory on the new formal robustness guarantee and the estimated robustness is called second-order CLEVER score. Second, we discuss how to handle gradient masking, a common defensive technique, using CLEVER with Backward Pass Differentiable Approximation (BPDA). With BPDA applied, CLEVER can evaluate the intrinsic robustness of neural networks of a broader class – networks with non-differentiable input transformations. We demonstrate the effectiveness of CLEVER with BPDA in experiments on a 121-layer Densenet model trained on the ImageNet dataset.

\name

Tsui-Wei Weng1,3*, Huan Zhang2\sthanksEqually contributed. Codes: https://github.com/huanzhang12/CLEVER., Pin-Yu Chen3, Aurelie Lozano3, Cho-Jui Hsieh2, Luca Daniel1 \address1Massachusetts Institute of Technology, Cambridge, MA 02139
2University of California, Los Angeles, CA 90095
3IBM Research, Yorktown Heights, NY 10598 {keywords} Adversarial Examples, Deep Learning, Robustness Evaluation

## 1 Introduction

It is well-known that deep neural networks (DNNs) are vulnerable to adversarial examples, and a small perturbation added to the input can mislead the network to classify in any desired class. There has been significant efforts developing verification techniques to prove that no adversarial perturbation exists if given an input and a classifier function . However, the verification problem is hard and generally intractable because a general neural network classifier is highly non-convex and non-smooth.

Alternatively, instead of verifying the exact robustness , one idea is to provide a lower bound of , which guarantees that no adversarial examples exist within an ball of radius . We call the robustness lower bound of the input image on classifier function . CLEVER (Cross-Lipschitz Extreme Value for nEtwork Robustness) [1] is the first attack-agnostic robustness score to estimate the robustness lower bound for large-scale DNNs, e.g. modern ImageNet networks such as ResNet, Inception, etc. It is based on a theoretical analysis of formal robustness guarantee with Lipschitz continuity assumption. The authors of [1] propose a sampling based approach with Extreme Value Theory to estimate the local Lipschitz constant, and empirically, this estimation aligns well with other robustness evaluation metrics, for example, the distortion of adversarial perturbation found by strong attacks.

In this work, we provide two extensions of CLEVER. First, we derive a new robustness guarantee for classifier functions that are twice differentiable, and we estimate the theoretical bounds via extreme value theory. Second, we extend CLEVER to be capable of evaluating the robustness of networks with non-differentiable input transformations, making it available for a wider class of neural networks deployed with gradient masking based defense.

## 2 Related Work

Evaluating the robustness of a neural network can be done by crafting adversarial examples with a specific attack algorithm [2, 3, 4, 5]. However, this methodology has a major drawback as the resilience of a network to existing attacks is not guaranteed to be extended to subsequent attacks. In fact, many defensive methods have been shown either partially or completely broken after stronger and adaptive attacks are proposed [6, 7, 8, 9]. Thus, it is of great importance to provide an attack-agnostic robustness evaluation metric.

On the other hand, existing formal verification methods that solves the exact minimum adversarial distortion (which is independent of attack algorithm) are quite expensive – verifying a small network with only a few hundred neurons on one input example can take a few hours [10], and in fact, even finding a non-trivial lower bound for can be hard, and so far only results on CIFAR and MNIST networks are available [11, 12]. [1] presents a framework to estimate local Lipschitz constant using extreme value theory, and then obtain an attack-agnostic robustness score (CLEVER) based on first-order Lipschitz continuity condition. CLEVER can scale to ImageNet networks.

Recently, Goodfellow [13] raises concerns on CLEVER in the case of networks with gradient masking, a defensive technique that obfuscates model gradients to prevent gradient based attacks. One of the main objective of this work is to show that such concerns can be safely eliminated with the BPDA technique proposed in [6]. Moreover, we also experimentally show how CLEVER can successfully handle networks with non-differentiable input transformations, including the stair-case function example in [13].

## 3 Extending CLEVER with Second Order Approximation

### 3.1 Background and definitions

Let be the input of a -class classifier , the predicted class of is . Given and , we say is an adversarial example if there exists a makes while is small. A successful untargeted attack is to find a such that while a successful targeted attack is to find a such that given a target class . On the other hand, the definition of norm-bounded robustness is the following: given a target class , is the targeted robustness of , if

 gt(x0+δ)≥0,∀ ∥δ∥p≤ϵ, (1)

where . Similarly, is the untargeted robustness if (1) holds for all classes .

### 3.2 Robustness for continuously differentiable classifiers

In [1], the authors have shown that if the classifier function has continuously differentiable components , the targeted robustness is

 ϵ=min(gt(x0)Ltq,R), (2)

where is the local Lipschitz constant for the function within a local region and . A simple proof of this guarantee is based on the mean value theorem on the first order expansion of :

 ∃s∈[0,1],gt(x0+δ)=gt(x0)+∇gt(x0+sδ)⊤δ. (3)

With Hölder’s inequality,

 gt(x0+δ) =gt(x0)+∇gt(x0+sδ)⊤δ ≥gt(x0)−∥∇gt(x0+sδ)∥q∥δ∥p ≥gt(x0)−maxx∈Bp(x0,R)∥∇gt(x)∥q⋅∥δ∥p =gt(x0)−Ltq⋅∥δ∥p.

Thus, the targeted robustness bound (2) is obtained by requiring the lower bound of to be non-negative. The authors of [1] further extend their analysis to neural networks with ReLU activations, which is a special case of non-differentiable functions.

### 3.3 Robustness for twice differentiable classifiers

In this work, we provide formal robustness guarantees when classifier functions are twice differentiable – for example, neural networks with twice differentiable activations such as tanh, sigmoid, softplus, etc. For a twice-differentiable function , there exists such that

 gt(x0+δ)=gt(x0)+∇gt(x0)⊤δ+12δ⊤H(x0+sδ)δ, (4)

where is the Hessian of at . This is analogous to the Mean Value Theorem in the first order case, but extended with a second order term. This expansion of can be used to derive the targeted robustness of in the following Theorem:

###### Theorem 3.1 (Formal robustness guarantee).

Given an input and a -class classifier , the targeted robustness of is

 ϵ=min(−b+√b2+2aγa,R) (5)

where , , and .

###### Proof.

By holder’s inequality and the definition of induced norm, we have

 |∇gt(x0)⊤δ|≤∥∇gt(x0)∥q∥δ∥p

and

 |δ⊤H(x0+sδ)δ| ≤∥H(x0+sδ)δ∥q∥δ∥p ≤∥H(x0+sδ)∥p,q∥δ∥p∥δ∥p ≤maxx∈Bp(x0,R)∥H(x)∥p,q∥δ∥2p.

Let , , and , we get a lower bound of :

 gt(x0+δ) =gt(x0)+∇gt(x0)⊤δ+12δ⊤H(x0+sδ)δ ≥gt(x0)−b∥δ∥p−12a∥δ∥2p. (6)

If we can guarantee (6) , then we can guarantee , which is the definition of targetted robustness in (1). Thus, the condition of (6) gives

 ∥δ∥p≤−b+√b2+2aγa.

### 3.4 Sampling via Extreme Value Theory

Theorem 3.1 needs the value , which is the maximum subordinate norm of the Hessian matrix within . When , it becomes the well-known spectral norm, and can be evaluated efficiently on a single point using power iteration or Lanczos method. Under the framework of CLEVER, we apply extreme value theory to estimate by sampling different and running power iterations on each sampled point. In this paper, we focus on the case of only ( robustness). After we get an estimate of , a second order robustness lower bound can be estimated at point using (5). The estimated bound of (2) is named 1st-order CLEVER while the estimated bound of (5) is called 2nd-order CLEVER.

When CLEVER is evaluated, we always use the logit layer values, thus we are not subject to the saturation of the sigmoid units. Additionally, during the sampling processes, we evaluate gradients using a large number of randomly perturbed images, thus CLEVER is likely to escape the region of masked gradients in local loss landscape. The remaining concern is thus whether CLEVER can be evaluated on networks with a non-differentiable layer as a defense. For example, if the input image is quantized via bit-depth reduction, a staircase function is applied to the network and thus its gradient cannot be computed via automatic differentiation. We will formally discuss this situation in the next section.

### 4.2 Apply Backward Pass Differentiable Approximation (BPDA) to CLEVER

For a neural network classifier , we can apply a non-differentiable transformation to the input and then feed the data after transformation into . The function thus becomes non-differentiable, and gradient based adversarial attacks fail to find successful adversarial examples. An example of is a staircase function, as suggested in [13]. This transformation also hinders the direct use of CLEVER to evaluate the robustness of .

To handle non-differentiable transformations, we use the Backward Pass Differentiable Approximation (BPDA) [6] technique. The intuition behind BPDA is that although is non-differentiable (e.g., bit-depth reduction, JPEG compression, etc), it usually holds that . Thus, in backpropagation, we can assume that

 ∇xf(h(x))|x=x0≈∇xf(x)|x=h(x0). (7)

To evaluate CLEVER for a network with an input transformation (for example, a staircase function), is sampled within an ball around . Then, a transformation is applied, such that . Then, the backpropagation procedure computes . We simply collect as the gradient, and compute its norm as a sample for Lipschitz constant estimation.

### 4.3 CLEVER is a White-Box Evaluation Tool

CLEVER is intended to be a tool for network designers and to evaluate network robustness in the “white-box” setting in which we know how a (defended) neural network processes the input. In this case, we can deal with the non-differentiable transformation with BPDA, and evaluate the intrinsic robustness of the model, without the “False Sense of Security [6]” provided by gradient masking.

In black-box attack setting, the gradient of must be evaluated via finite differences [17], thus a non-differentiable prevents gradient based attacks in black-box settings because the estimated gradient becomes infinite (i.e., the value of is unlikely to change when is changed by a small amount). Goodfellow [13] raises concerns on the effectiveness of CLEVER in this setting, but this setting is different from our intended usage of CLEVER. Most importantly, CLEVER computes gradients using backpropagation via automatic differentiation in the white-box setting, rather than using finite differences. Despite the limited numerical precision on digital computers, CLEVER is not subject to the same numerical issues as in the black-box attack setting. Unless backpropagation fails, CLEVER is able to estimate a reasonable robustness score reflecting the intrinsic model robustness.

## 5 Experiments

### 5.1 Experiments on 1st Order and 2nd Order Bounds

We compute the targeted robustness bounds for a 7-layer CNN model with tanh activations (which is twice differentiable) on CIFAR dataset with a validation accuracy of 72.6%. We calculated both Eq. (2) and (5) via sampling with extreme value theory, and we denote the estimated scores as “1st order” and “2nd order” CLEVER scores respectively in the Tables. In particular, we follow the sampling procedure proposed in [1] to estimate the Lipschitz constant by fitting the samples with maximum likelihood estimation on Reversed Weibull distribution and calculate the estimated robustness scores of (2). For the “2nd order” bound (5), we also use sampling and extreme value theory to calculate the estimated bounds, as describe in Section 3.4. For fair comparison, we use the same number of samples ( and ) for both estimated bounds and we compare their average as well as the percentage of image examples that the score is larger than the other. For each image, we select three attack target classes: least likely, random and runner-up. The results are summarized in Tables 1, 2 and 3. We observe that the 1st order and 2nd order average CLEVER scores usually stay close, indicating that both scores agree with each other.

Since CLEVER is a score of estimated lower bound, we desire the score is not trivially small, but smaller than the upper bound found by adversarial attacks (in our case the CW attack). As shown in Tables 1, 2 and 3, all CLEVER scores are less then CW distortion. Second order CLEVER can sometimes give a better result than its first order counterpart, indicating that second order approximation is probably more accurate for these examples. The “avg. % of increase on the score” rows in tables report the improvement of score when one method is better than the other; for example, in runner-up target, second order CLEVER increases the score for 82% of the examples, and the average improvement of score comparing to first order CLEVER is 58%.

### 5.2 Experiments on Networks with Input Transformation as a Gradient Masking based Defense

We conduct experiments on a 121-layer DenseNet [18] network pretrained on ImageNet dataset111model available at https://github.com/pudae/tensorflow-densenet. We employ two non-differentiable input transfomrations that mask gradients: bit-depth reduction (reducing each color channel from 8-bit to 3-bit, setting all lower bits to 0) and JPEG compression (quality set to 75%). We compute CLEVER (first order) scores for the network with and without input transformations, with CLEVER parameter and . We randomly choose 100 images from the ImageNet validation set, and select three attack target classes for each image (least likely, random and runner-up). Misclassified images are skipped.

Table 4 compares the CLEVER scores for three target classes, for the original model, and for bit-depth reduction or JPEG compression as input transformations. BPDA is used to compute CLEVER when an input transformation is applied. Not surprisingly, the CLEVER scores for networks with input transformation as a gradient masking method do not noticeably increase, indicating that these transformations do not increase the model’s intrinsic robustness; in other words, with BPDA applied, we can still obtain similar gradients as the original model, thus it is expected that CLEVER scores do not change too much in this situation.

## 6 Conclusions

CLEVER [1] is a first-order approximation based robustness score. We move one step further to give a second order formal guarantee for DNN robustness. We show that it improves the estimated robustness lower bound for some examples, and in many cases both first and second order CLEVER scores are coherent. Additionally, we successfully apply Backward Pass Differentiable Approximation (BPDA) to compute CLEVER scores for a network with non-differentiable input transformations, including staircase functions. Our discussions and results remedy the concerns raised in [13].

## 7 Acknowledgement

Tsui-Wei Weng and Luca Daniel acknowledge partial support of MIT IBM Watson AI Lab.

## References

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters