Random Smoothing Might be Unable to Certify \ell_{\infty} Robustness for High-Dimensional Images

# Random Smoothing Might be Unable to Certify ℓ∞ Robustness for High-Dimensional Images

## Abstract

We show a hardness result for random smoothing to achieve certified adversarial robustness against attacks in the ball of radius when . Although random smoothing has been well understood for the case using the Gaussian distribution, much remains unknown concerning the existence of a noise distribution that works for the case of . This has been posed as an open problem by Cohen et al. (2019) and includes many significant paradigms such as the threat model. In this work, we show that any noise distribution over that provides robustness for all base classifiers with must satisfy for 99% of the features (pixels) of vector , where is the robust radius and is the score gap between the highest-scored class and the runner-up. Therefore, for high-dimensional images with pixel values bounded in , the required noise will eventually dominate the useful information in the images, leading to trivial smoothed classifiers.

## 1 Introduction

Adversarial robustness has been a critical object of study in various fields, including machine learning (Zhang et al., 2019; Madry et al., 2017), computer vision (Szegedy et al., 2013; Yang et al., 2019), and many other domains (Lecuyer et al., 2019). In machine learning and computer vision, the study of adversarial robustness has led to significant advances in defending against attacks in the form of perturbed input images, where the data is high dimensional but each feature is bounded in . The problem can be stated as that of learning a non-trivial classifier with high test accuracy on the adversarial images. The adversarial perturbation is either restricted to be in an ball of radius centered at , or is measured under other threat models such as Wasserstein distance and adversarial rotation (Wong et al., 2019; Brown et al., 2018). The focus of this work is the former setting.

Despite a large amount of work on adversarial robustness, many fundamental problems remain open. One of the challenges is to end the long-standing arms race between adversarial defenders and attackers: defenders design empirically robust algorithms which are later exploited by new attacks designed to undermine those defenses (Athalye et al., 2018). This motivate the study of certified robustness (Raghunathan et al., 2018; Wong et al., 2018)—algorithms that are provably robust to the worst-case attacks—among which random smoothing (Cohen et al., 2019; Li et al., 2019; Lecuyer et al., 2019) has received significant attention in recent years. Algorithmically, random smoothing takes a base classifier as an input, and outputs a smooth classifier by repeatedly adding i.i.d. noises to the input examples and outputting the most-likely class. Random smoothing has many appealing properties that one could exploit: it is agnostic to network architecture, is scalable to deep networks, and perhaps most importantly, achieves state-of-the-art certified robustness for deep learning based classifiers (Cohen et al., 2019; Li et al., 2019; Lecuyer et al., 2019).

Open problems in random smoothing. Given the rotation invariance of Gaussian distribution, most positive results for random smoothing have focused on the robustness achieved by smoothing with the Gaussian distribution (see Theorem 2.1). However, the existence of a noise distribution for general robustness has been posed as an open question by Cohen et al. (2019):

We suspect that smoothing with other noise distributions may lead to similarly natural robustness guarantees for other perturbation sets such as general norm balls.

Several special cases of the conjecture have been proven for . Li et al. (2019) show that robustness can be achieved with the Laplacian distribution, and Lee et al. (2019) show that robustness can be achieved with a discrete distribution. Much remains unknown concerning the case when . On the other hand, the most standard threat model for adversarial examples is robustness, among which 8-pixel and 16-pixel attacks have received significant attention in the computer vision community (i.e., the adversary can change every pixel by 8 or 16 intensity values, respectively). In this paper, we derive lower bounds on the magnitude of noise required for certifying robustness that highlights a phase transition at . In particular, for , the noise that must be added to each feature of the input examples grows with the dimension in expectation, while it can be constant for .

Preliminaries. Given a base classifier and smoothing distribution , the randomly smoothed classifier is defined as follows: for each class , define the score of class at point to be . Then the smoothed classifier outputs the class with the highest score: .

The key property of smoothed classifiers is that the scores change slowly as a function of the input point (the rate of change depends on ). It follows that if there is a gap between the highest and second highest class scores at a point , the smoothed classifier must be constant in a neighborhood of . We denote the score gap by , where and .

###### Definition 1 ((A,δ)- and (ϵ,δ)-robustness).

For any set and , we say that the smoothed classifier is -robust if for all with , we have that for all . For a given norm , we also say that is -robust with respect to if it is -robust with .

When the base classifier and the smoothing distribution are clear from context, we will simply write , , and . We often refer to a sample from the distribution as noise, and use noise magnitude to refer to squared norm of a noise sample. Finally, we use to denote the distribution of , where .

### 1.1 Our results

Our main results derive lower bounds on the magnitude of noise sampled from any distribution that leads to -robustness with respect to for all possible base classifiers . A major strength of random smoothing is that it provides certifiable robustness guarantees without making any assumption on the base classifier . For example, the results of Cohen et al. (2019) imply that using a Gaussian smoothing distribution with standard deviation guarantees that is -robust with respect to for every possible base classifier . We show that there is a phase transition at , and that ensuring -robustness for all base classifiers with respect to norms with requires that the noise magnitude grows non-trivially with the dimension of the input space. In particular, for image classification tasks where the data is high dimensional and each feature is bounded in the range , this implies that for sufficiently large dimensions, the necessary noise will dominate the signal in each example.

The following result, proved in Appendix A, shows that any distribution that provides -robustness for every possible base classifier must be approximately translation-invariant to all translations . More formally, for every , we must have that the total variation distance between and , denoted by , is bounded by . The rest of our results will be consequences of this approximate translation-invariance property. {restatable}lemmalemRobustToTV Let be a distribution on such that for every (randomized) classifier , the smoothed classifier is -robust. Then for all , we have .

#### Lower bound on noise magnitude.

Our first result is a lower bound on the expected squared -magnitude of a sample for any distribution that is approximately invariant to -translations of size .

{restatable}

theoremthmLowerBound Fix any and let be a distribution on such that there exists a radius and total variation bound satisfying that for all with we have . Then

 Eη∼D∥η∥22≥ϵ2d2−2/p800⋅1−δδ2.

As a consequence of Section 1.1 and Lemma 1.1, it follows that any distribution that ensures -robustness with respect to for any base classifier must also satisfy the same lower bound.

#### Phase transition at p=2.

The lower bound given by Section 1.1 implies a phase transition in the nature of distributions that are able to ensure -robustness with respect to that occurs at . For , the necessary expected squared -magnitude of a sample from grows only like , which is consistent with adding a constant level of noise to every feature in the input example (e.g., as would happen when using a Gaussian distribution with standard deviation ). On the other hand, for , the expected magnitude of a sample from grows strictly faster than , which, intuitively, requires that the noise added to each component of the input example must scale with the input dimension , rather than remaining constant as in the regime. More formally, we prove the following:

{restatable}

[hardness of random smoothing]theoremthmComponentProperties Fix any and let be a distribution on such that for every (randomized) classifier , the smoothed classifier is -robust. Let be a sample from . Then at least 99% of the components of satisfy . Moreover, if is a product measure of i.i.d. noise (i.e., ), then the tail of satisfies for some , where is an absolute constant. In other words, is a heavy-tailed distribution.1

The phase transition at is more clearly evident from Section 1.1. In particular, the variance of most components of the noise must grow with . Section 1.1 shows that any distribution that provides -robustness with respect to for must have very high variance in most of its component distributions when the dimension is large. In particular, for the variance grows linearly with the dimension. Similarly, if we use a product distribution to achieve -robustness with respect to with , then each component of the noise distribution must be heavy-tailed and is likely to generate very large perturbations.

### 1.2 Technical overview

Total-variation bound of noise magnitude. Our results demonstrate a strong connection between the required noise magnitude in random smoothing and the total variation distance between and its shifted distribution in the worst-case direction . The total variation distance has a very natural explanation on the hardness of testing v.s. : any classifier cannot distinguish from with a good probability related to . Our analysis applies the following techniques.

Warm-up: one-dimensional case. We begin our analysis of Theorem 1.1 with the one-dimension case, by studying the projection of noise on a direction . A simple use of Chebyshev’s inequality implies . To see this, let be a sample from and let so that is a sample from . Define and . Define so that the intervals and are disjoint. From Chebyshev’s inequality, we have . Similarly, and, since and are disjoint, this implies . Therefore, . The claim follows from rearranging this inequality and the fact .

The remainder of the one-dimension case is to show . To this end, we exploit a nice property of total variation distance in : every -interval satisfies . We note that for any , rearranging Markov’s inequality gives . We can cover the set using intervals of width and, by this property, each of those intervals has probability mass at most . It follows that , implying . Finally, we optimize to obtain the bound , as desired.

Extension to the -dimensional case. A bridge to connect one-dimensional case with -dimensional case is the Pythagorean theorem: if there exists a set of orthogonal directions ’s such that and (the furthest distance to in the ball ), the Pythagorean theorem implies the result for the -dimensional case straightforwardly. The existence of a set of orthogonal directions that satisfy these requirements is easy to find for the case, because the ball is isotropic and any set of orthogonal bases of satisfies the conditions. However, the problem is challenging for the case, since the ball is not isotropic in general. In Corollary 3.6, we show that there exist at least ’s which satisfy the requirements. Using the Pythagorean theorem in the subspace spanned by such ’s gives Theorem 1.1.

Peeling argument and tail probability. We now summarize our main techniques to prove Theorem 1.1. By , Theorem 1.1 implies for at least one index , which shows that at least one component of is large. However, this guarantee only highlights the largest pixel of . Rather than working with the -norm of , we apply a similar argument to show that the variance of at least one component of must be large. Next, we consider the -dimensional distribution obtained by removing the highest-variance feature. Applying an identical argument, the highest-variance remaining feature must also be large. Each time we repeat this procedure, the strength of the variance lower bound decreases since the dimensionality of the distribution is decreasing. However, we can apply this peeling strategy for any constant fraction of the components of to obtain lower bounds. The tail-probability guarantee in Theorem 1.1 follows a standard moment analysis in (Vershynin, 2018).

Summary of our techniques. Our proofs—in particular, the use of the Pythagorean theorem—show that defending against adversarial attacks in the ball of radius by random smoothing is almost as hard as defending against attacks in the ball of radius . Therefore, the certification procedure—firstly using Gaussian smoothing to certify robustness and then dividing the certified radius by as in (Salman et al., 2019)—is almost an optimal random smoothing approach for certifying robustness. The principle might hold generally for other threat models beyond robustness, and sheds light on the design of new random smoothing and proofs of hardness in the other threat models broadly.

## 2 Related Works

robustness. Probably one of the most well-understood results for random smoothing is the robustness. With Gaussian random noises, Lecuyer et al. (2019) and Li et al. (2019) provided the first guarantee of random smoothing and was later improved by Cohen et al. (2019) with the following theorem.

###### Theorem 2.1 (Theorem 1 of Cohen et al. (2019)).

Let by any deterministic or random classifier, and let . Let . Suppose and satisfy: Then for all , where , and is the cumulative distribution function of standard Gaussian distribution.

Note that Theorem 2.1 holds for arbitrary classifier. Thus a hardness result of random smoothing—the one in an opposite direction of Theorem 2.1—requires finding a hard instance of classifier such that a similar conclusion of Theorem 2.1 does not hold, i.e., the resulting smoothed classifier is trivial as the noise variance is too large. Our results of Theorems 1.1 and 1.1 are in such flavour. Beyond the top- predictions in Theorem 2.1, Jia et al. (2020) studied the certified robustness for top- predictions via random smoothing under Gaussian noise and derive a tight robustness bound in norm. In this paper, however, we study the standard setting of top- predictions.

robustness. Beyond the robustness, random smoothing also achieves the state-of-the-art certified robustness for . Lee et al. (2019) provided adversarial robustness guarantees and associated random-smoothing algorithms for the discrete case where the adversary is bounded. Li et al. (2019) suggested replacing Gaussian with Laplacian noise for the robustness. Dvijotham et al. (2020) introduced a general framework for proving robustness properties of smoothed classifiers in the black-box setting using -divergence. However, much remains unknown concerning the effectiveness of random smoothing for robustness with . Salman et al. (2019) proposed an algorithm for certifying robustness, by firstly certifying robustness via the algorithm of Cohen et al. (2019) and then dividing the certified radius by . However, the certified radius by this procedure is as small as , in contrast to the constant certified radius as discussed in this paper.

Training algorithms. While random smoothing certifies inference-time robustness for any given base classifier , the certified robust radius might vary a lot for different training methods. This motivates researchers to design new training algorithms of that particularly adapts to random smoothing. Zhai et al. (2020) trained a robust smoothed classifier via maximizing the certified radius. In contrast to using naturally trained classifier in (Cohen et al., 2019), Salman et al. (2019) combined adversarial training of Madry et al. (2017) with random smoothing in the training procedure of . In our experiment, we introduce a new baseline which combines TRADES (Zhang et al., 2019) with random smoothing to train a robust smoothed classifier.

## 3 Analysis of Main Results

In this section we prove Section 1.1 and Section 1.1.

### 3.1 Analysis of Theorem 1.1

In this section we prove Section 1.1. Our proof has two main steps: first, we study the one-dimensional version of the problem and prove two complementary lower bounds on the magnitude of a sample drawn from a distribution over with the property that for all with we have . Next, we show how to apply this argument to orthogonal 1-dimensional subspaces in to lower bound the expected magnitude of a sample drawn from a distribution over , with the property that for all with , we have .

#### One-dimensional results.

Our first result lower bounds the magnitude of a sample from any distribution in terms of the total variation distance between and for any .

###### Lemma 3.1.

Let be any distribution on , be a sample from , , and let . Then we have2

 E|η|2≥ϵ2200⋅1−δδ2.

We prove Lemma 3.1 using two complementary lower bounds. The first lower bound is tighter for large , while the second lower bound is tighter when is close to zero. Taking the maximum of the two bounds proves Lemma 3.1.

###### Lemma 3.2.

Let be any distribution on , be a sample from , , and let . Then we have

 E|η|2≥ϵ28⋅(1−δ).
###### Proof.

Let so that is a sample from and define so that the sets and are disjoint. From Chebyshev’s inequality, we have that . Further, since if and only if , we have . Next, since and are disjoint, it follows that . Finally, we have . Rearranging this inequality proves the claim. ∎

Next, we prove a tighter bound when is close to zero. The key insight is that no interval of width can have probability mass larger than . This implies that the mass of cannot concentrate too close to the origin, leading to lower bounds on the expected magnitude of a sample from .

###### Lemma 3.3.

Let be any distribution on , be a sample from , , and let . Then we have

 E|η|≥ϵ8⋅(1−δ)2δ,

which implies .

###### Proof.

The key step in the proof is to show that every interval of length has probability mass at most under the distribution . Once we have established this fact, then the proof is as follows: for any , rearranging Markov’s inequality gives . We can cover the set using intervals of width and each of those intervals has probability mass at most . It follows that , implying . Since , we have . Finally, we optimize to get the strongest bound. The strongest bound is obtained at , which gives .

It remains to prove the claim that all intervals of length have probability mass at most . Let be any such interval. The proof has two steps: first, we partition using a collection of translated copies of the interval , and show that the difference in probability mass between any pair of intervals in the partition is at most . Then, given that there must be intervals with probability mass arbitrarily close to zero, this implies that the probability mass of any interval (and in particular, the probability mass of ) is upper bounded by .

For each integer , let be a copy of the interval translated by . By construction the set of intervals for forms a partition of . For any indices , we can express the difference in probability mass between and as a telescoping sum: . We will show that for any , the telescoping sum is contained in . Let be the indices of the positive terms in the sum. Then, since the telescoping sum is upper bounded by the sum of its positive terms and the intervals are disjoint, we have

 D(Ij)−D(Ii)≤∑k∈P[D(Ik+1)−D(Ik)]=D(⋃k∈PIk+1)−D(⋃k∈PIk).

For all we have if and only if , which implies . Combined with the definition of the total variation distance, it follows that and therefore . A similar argument applied to the negative terms of the telescoping sum guarantees that , proving that .

Finally, for any , there must exist an interval such that (since otherwise the total probability mass of all the intervals would be infinite). Since no pair of intervals in the partition can have probability masses differing by more than , this implies that for any . Taking the limit as shows that , completing the proof. ∎

Finally, Lemma 3.1 follows from Lemmas 3.2 and 3.3, and the fact that for any , we have .

#### Extension to the d-dimensional case.

For the remainder of this section we turn to the analysis of distributions defined over . First, we use Lemma 3.1 to lower bound the magnitude of noise drawn from when projected onto any one-dimensional subspace.

###### Corollary 3.4.

Let be any distribution on , be a sample from , , and let . Then we have

 Eη∼D|v⊤η|2∥v∥22≥∥v∥22200⋅1−δδ2.
###### Proof.

Let be a sample from , be a sample from , and define and . Then the total variation distance between and is bounded by , and corresponds to a translation of by a distance . Therefore, applying Lemma 3.1 with , we have that . Rearranging this inequality completes the proof. ∎

Intuitively, Corollary 3.4 shows that for any vector such that is small, the expected magnitude of a sample when projected onto cannot be much smaller than the length of . The key idea for proving Section 1.1 is to construct a large number of orthogonal vectors with small norms but large norms. Then will have to be “spread out” in all of these directions, resulting in a large expected norm. We begin by showing that whenever is a power of two, we can find an orthogonal basis for in .

###### Lemma 3.5.

For any there exist orthogonal vectors .

###### Proof.

The proof is by induction on . For , we have and the vector satisfies the requirements. Now suppose the claim holds for and let be orthogonal in for . For each , define and . We will show that these vectors are orthogonal. For any indices and , we can compute the inner products between pairs of vectors among , , , and : , , and . Therefore, for any , since , we are guaranteed that , , and . It follows that the vectors are orthogonal. ∎

From this, it follows that for any dimension , we can always find a collection of vectors that are short in the norm, but long in the norm. Intuitively, these vectors are the vertices of a hypercube in a -dimensional subspace. Figure 1 depicts the construction.

###### Corollary 3.6.

For any and dimension , there exist orthogonal vectors such that and for all . This holds even when .

###### Proof.

Let be the largest integer such that . We must have , since otherwise . We now apply Lemma 3.5 to find orthogonal vectors . For each , we have that . Finally, for , define to be a normalized copy of padded with zeros. For all , we have and . ∎

With this, we are ready to prove Section 1.1.

\thmLowerBound

*

###### Proof.

Let be a sample from . By scaling the vectors from Corollary 3.6 by , we obtain vectors with and . By assumption we must have , since , and Corollary 3.4 implies that for each . We use this fact to bound .

Let be the matrix whose row is given by so that is the orthogonal projection matrix onto the subspace spanned by the vectors . Then we have , where the first inequality follows because orthogonal projections are non-expansive, the equality follows from the Pythagorean theorem, and the last inequality follows from Corollary 3.4. Using the fact that , we have that . Finally, since and for , we have , as required. ∎

### 3.2 Analysis of Theorem 1.1

In this section we prove the variance and heavy-tailed properties from Section 1.1 separately.

Combining Section 1.1 with a peeling argument, we are able to lower bound the marginal variance in most of the coordinates of .

###### Lemma 3.7.

Fix any and let be a distribution on such that there exists a radius and total variation bound so that for all with we have . Let be a sample from and be the permutation of such that . Then for any , we have

 E[η2σ(i)]≥ϵ2(d−i+1)1−2/p8001−δδ2.
###### Proof.

For each index , let be the projection and be the distribution of . First we argue that for each and any with , we must have . To see this, let be the vector such that and . Then . Next, since , we must have .

Now fix an index and let be a sample from . Applying Section 1.1 to , we have that Since there must exist at least one index such that , it follows that at least one coordinate must satisfy Finally, since the coordinates of are the coordinates of with the smallest variance, it follows that

 E[η2σ(i)]≥ϵ2(d−i+1)1−2/p800⋅1−δδ2,

as required. ∎

Lemma 3.7 implies that any distribution over such that for all with we have for must have high marginal variance in most of its coordinates. In particular, for any constant , the top -fraction of coordinates must have marginal variance at least . For , this bound grows with the dimension . Our next lemma shows that when is a product measure of i.i.d. one-dimension distribution in the standard coordinate, the distribution must be heavy-tailed. The lemma is built upon a fact that , with a similar analysis as that of Theorem 1.1. We defer the proof of this fact to the appendix (see Lemma C.1). Note that the fact implies that by the equivalence between and norms. We then have the following lemma.

###### Lemma 3.8.

Let and . Let be random variables in sampled i.i.d. from distribution . Then “” implies “ for some with an absolute constant ”, that is, in sufficiently high dimensions, is a heavy-tailed distribution.

###### Proof.

Denote by the complementary Cumulative Distribution Function (CDF) of . We only need to show that “ for all ” implies “ for a constant ”. We note that

 Emaxi∈[d]|Xi|=∫∞0PrXi∼D[maxi∈[d]|Xi|>x]dx=∫ϵh(δ)