Generalised Lipschitz Regularisation Equals Distributional Robustness

Generalised Lipschitz Regularisation Equals Distributional Robustness

Abstract

The problem of adversarial examples has highlighted the need for a theory of regularisation that is general enough to apply to exotic function classes, such as universal approximators. In response, we give a very general equality result regarding the relationship between distributional robustness and regularisation, as defined with a transportation cost uncertainty set. The theory allows us to (tightly) certify the robustness properties of a Lipschitz-regularised model with very mild assumptions. As a theoretical application we show a new result explicating the connection between adversarial learning and distributional robustness. We then give new results for how to achieve Lipschitz regularisation of kernel classifiers, which are demonstrated experimentally.

\Gm@restore@org\DeclareUnicodeCharacter

3B5 \DeclareUnicodeCharacter3A6 \DeclareUnicodeCharacter1D453 \DeclareUnicodeCharacter1D448 \DeclareUnicodeCharacter1D45D \DeclareUnicodeCharacter0393 \NewDocumentNotation\measfL_0 \NewDocumentNotation\lebL \NewDocumentNotation\probmP \NewDocumentNotation\costbB \NewDocumentNotation\ballB \NewDocumentNotation\costcost \NewDocumentNotation\lipslip \NewDocumentNotation\coup\upPi \NewDocumentNotation\convco \NewDocumentNotation\clconv¯\conv \NewDocumentNotation\cvxityρ \NewDocumentNotation\indι \NewDocumentNotation\diracδ \NewDocumentNotation\idId \NewDocumentNotation\contfC \NewDocumentNotation\contfbC_b \NewDocumentCommand\subdiff\uppartial \NewDocumentCommand\esubdiffE_ϵ\uppartial_#1 \NewDocumentCommand\wtop\rbr#1 \normalautorefnames

1 Introduction

When learning a statistical model, it is rare that one has complete access to the distribution. More often it is the case that one approximates the risk minimisation by an empirical risk, using sequence of samples from the distribution. In practice this can be problematic — particularly when the curse of dimensionality is in full force — to {enumerate*}[label=)]

know with certainty that one has enough samples, and

guarantee good performance away from the data. Both of these two problems can, in effect, be cast as problems of ensuring generalisation. A remedy for both of these problems has been proposed in the form of a modification to the risk minimisation framework, wherein we integrate a certain amount of distrust of the distribution. This distrust results in a certification of worst case performance if it turns out later that the distribution was specified imprecisely, improving generalisation.

In order to make this notion of distrust concrete, we introduce some mathematical notation. The set of Borel probability measures on an outcome space is . A loss function is a mapping so that is the loss incurred with some prediction under the outcome . For example, if then could be a loss function for regression or classification with some classifier . For a distribution we replace the objective in the classical risk minimisation with the robust Bayes risk:

(rB)

where is a set containing , called the uncertainty set viz. Berger, 1993; Vidakovic, 2000, Grünwald & Dawid, 2004, §4. It is in this way that we introduce distrust into the classical risk minimisation, by instead minimising the worst case risk over a set of distributions.

It is sometimes the case that for an uncertainty set, , there is a function, (not necessarily the usual Lipschitz constant), so that

(L)

There are two reasons we are interested in finding a relationship of the form (L). Firstly, there has been independent interest in the regularised risk, particularly when corresponds to the Lipschitz constant of . The applications for Lipschitz regularisation are as disparate as generative adversarial networks (Miyato et al., 2018; Arjovsky et al., 2017), generalisation (Yoshida & Miyato, 2017; Farnia et al., 2019; Gouk et al., 2018), and adversarial learning (Cisse et al., 2017; Anil et al., 2019; Cranko et al., 2019; Tsuzuku et al., 2018) among others (Gouk et al., 2019; Scaman & Virmaux, 2018). Secondly, building a model that is robust to a particular uncertainty set is very intuitive and tractable. However, the left hand side of (L) involves an optimisation over a subset of an infinite dimensional space.1 By comparison, the Lipschitz regularised risk is often much easier to work with in practice. For these reasons then it is always interesting to note when a robust Bayes problem (rB) admits an equivalent formulation (L). Conversely, by developing such a connection we are able to provide interpretation to the popular Lipschitz regularised objective function.

Our first major contribution, in section 3, is to show that for a set of convex loss functions we have a result of the form (L). Furthermore, when the loss functions are nonconvex, (L) becomes an inequality, and we prove the slackness is controlled (tightly) by a tractable measure of the loss function’s convexity, which to our knowledge is a completely new result. As application, in section 4, we show that the adversarial learning objective commonly used is, in fact, a special case of a distributionally robust risk, which significantly generalises other similar results in this area.

In practice, however, the evaluation of Lipschitz constant is NP-hard for neural networks (Scaman & Virmaux, 2018), compelling approximations of it, or the explicit engineering of Lipschitz layers and analysing the resulting expressiveness in specific cases (e.g. Anil et al., 2019, -norm). By comparison, kernel machines encompass a family of models that is universal (Micchelli et al., 2006).

Our third contribution, in section 5, is to show that product kernels, such as Gaussian kernels, have a Lipzchitz constant that can be efficiently approximated and optimised with high probability. By using the Nyström approximation (Williams & Seeger, 2000; Drineas & Mahoney, 2005). we show that an approximation error requires only samples. Such a sampling-based approach also leads to a single convex constraint, making it scalable to large sample sizes, even with an interior-point solver (section 6). As our experiments show, this method achieves higher robustness than state of the art (Cisse et al., 2017; Anil et al., 2019).

2 Preliminaries

Let and , with similar notations for the real numbers. Let denote the set for . Unless otherwise specified, are topological outcome spaces. Often will be used when there is some linear structure, compatible with the topology, so that may be interpreted as the classical outcome space for classification problems (cf. Vapnik, 2000). A sequence in is a mapping and is denoted .

The Dirac measure at some point is , and the set of Borel mappings is . For , denote by the Lebesgue space of functions satisfying for . The continuous real functions on are collected in . In many of our subsequent formulas it is more convenient to write an expectation directly as an integral: .

For two measures the set of -couplings is where if and only if the marginals of are and :

(1)

For a coupling function , the -transportation cost of is

(2)

The -transportation cost ball of radius centred at is

(3)

and serves as our uncertainty set. Define the least c-Lipschitz constant (cf. Cranko et al., 2019) of a function :

(4)

Thus when is a metric space agrees with the usual Lipschitz notion. When , for example when is a semi-norm, we take for all .

To a function we associate another function , called the convex envelope of , defined to be the greatest closed convex function that minorises . The quantity was first suggested by Aubin & Ekeland (1976) to quantify the lack of convexity of a function , and has since shown to be of considerable interest for, among other things, bounding the duality gap in nonconvex optimisation (cf. Lemaréchal & Renaud, 2001; Udell & Boyd, 2016; Askari et al., 2019; Kerdreux et al., 2019). In particular, observe

(5)

When is minorised by an affine function, there is cf. Hiriart-Urruty & Lemaréchal, 2010, Prop. X.1.5.4; Benoist & Hiriart-Urruty, 1996

(6)

for all , where

(7)
(8)

Consequentially it is well known that can be computed via the finite-dimensional maximisation

(9)

Complete proofs of all technical results are relegated to the supplementary material.

3 Distributional robustness

In this section we present our major result regarding identities of the form (L).

{toappendix}

3.1 Proof of subsection 3.1 and other technical results

{lemmaapx}

[Blanchet & Murthy, 2019, Thm. 1] Assume is a Polish space and fix . Let be lower semicontinuous with for all , and is upper semicontinuous. Then for all there is

(10)
{toappendix}

Duality results like subsection 3.1 have been the basis of a number of recent theoretical efforts in the theory of adversarial learning (Sinha et al., 2018; Gao & Kleywegt, 2016; Blanchet et al., 2019; Shafieezadeh-Abadeh et al., 2019), the results of Blanchet & Murthy (2019) being the most general to date. The necessity for such duality results like subsection 3.1 is because while the supremum on the left hand side of (10) is over a (usually) infinite dimensional space, the right hand side only involves only a finite dimensional optimisation. The generalised conjugate in (10) also hides an optimisation, but when the outcome space is finite dimensional, this too is a finite dimensional problem.

The following lemma is sometimes stated a consequence of, or in the proof of, the McShane–Whitney extension theorem (McShane, 1934; Whitney, 1934), but it is immediate to observe. {lemmaapx} Let be a set. Assume satisfies for all , . Then

(11)
Proof.

Suppose . Fix . Then

(12)

with equality when . Next suppose

(13)

then

(14)
(15)

as claimed. ∎

{lemmaapx}

Assume is a vector space. Suppose satisfies , and is convex. Then

(16)
Proof.

Suppose . Then for all . Fix , and suppose . Then

(17)
(18)

because . This shows .

Next assume for all and . Because is not extended-real valued, it is continuous on all of (via Zălinescu, 2002, Cor. 2.2.10) and is nonempty for all (via Zălinescu, 2002, Thm. 2.4.9). Fix an arbitrary . Then , and

(19)

where the implication is because and . Since the choice of in (19) was arbitrary, the proof is complete. ∎

{lemmaapx}

Assume is a locally convex Hausdorff topological vector space. Suppose is closed sublinear, and is closed convex. Then there is

(20)
Proof.

Fix an arbitrary . From subsection 3.1 we know

(21)

Assume for all . Consequentially for every . From the usual difference-convex global -subdifferential condition (Hiriart-Urruty, 1989, Thm. 4.4) it follows that

(22)

where we note that because is sublinear.

Assume for some . By hypothesis there exists , , and with

(23)

Using the Toland (1979) duality formula (viz. Hiriart-Urruty, 1986, Cor. 2.3) and the usual calculus rules for the Fenchel conjugate (e.g. Zălinescu, 2002, Thm. 2.3.1) we have

(24)
(25)
(26)
(27)
(28)

where the second inequality is because .

We have assumed . Because is sublinear, (Zălinescu, 2002, Thm. 2.4.14 (i)), and therefore . Then (28) yields

(29)

which completes the proof. ∎

\NewDocumentCommand\boundnumber

GμΔ_f,c,r(#1) {theoremrep} Assume is a separable Fréchet space and fix . Suppose is closed sublinear, and is upper semicontinuous with . Then for all , there is a number so that

(30)

Furthermore is upper bounded by

(31)

where , so that when is closed convex .

Proof.

(30): Since is assumed sublinear, it is positively homogeneous and there is for all . Therefore we can apply subsection 3.1 and subsection 3.1 to obtain

(32)

and therefore .

(31): Observing that , from subsection 3.1 we find for all

(33)
(34)
(35)
(36)
(37)
(38)

Similarly, for all there is

(39)
(40)
(41)
(42)
(43)
(44)

Together, (38) and (44) show

(45)
(46)

Then

(47)
(48)
(49)
(50)
(51)

which implies (31). ∎

{proofsketch}

The duality result of Blanchet & Murthy (2019, Thm. 1) yields a tractable, dual formulation of the robust risk, which is easy to upper bound by the regularised risk. Lower bound the function by its closed convex envelope and use classical results from the difference-convex optimisation literature (Toland, 1979; Hiriart-Urruty, 1986; 1989) to solve the inner maximisation of the dual robust risk formulation.

subsection 3.1 subsumes many existing results Gao & Kleywegt, 2016, Cor. 2 (iv); Cisse et al., 2017, §3.2; §3.2 Sinha et al., 2018, various; Shafieezadeh-Abadeh et al., 2019, Thm. 14 with a great deal more generality, applying to a very broad family of models, loss functions, and outcome spaces. The extension of subsection 3.1 for robust classification in the absence of label noise is straight-forward:

Corollary 1.

Assume is a separable Fréchet space and is a topological space. Fix . Assume satisfies

(52)

where is closed sublinear, and is upper semicontinuous and has . Then for all there is (30) and (31), where the closed convex hull is interpreted .

It is the first time to our knowledge that the slackness in (31) has been characterised tightly. Clearly from subsection 3.1 the upper bound (53) is tight for closed convex functions, but subsection 3.1 shows it is also tight for a large family of nonconvex functions and measures — particularly the upper semi-continuous loss functions on a compact set, with the collection of probability distributions supported on that set.

Observing that , the equality (30) yields the upper bound

(53)

By controlling we are able to guarantee that the regularised risk in (L) is a good surrogate for the robust risk. The number itself is quite hard to measure (since it would require computing the robust risk directly), which is why we upper bound it in (31). subsection 3.1 shows the slackness bound (31) is tight for a large family of distributions after observing

(54)

This yields

(55)
(56)

for all , , and .

{propositionrep}

Let be a separable Fréchet space with . Suppose is closed sublinear, and is upper semicontinuous, has , and attains its maximum on . Then for all

(57)
{proofsketch}

Let achieve its maximum on at . Then , which implies the result. {appendixproof} Let be a point at which . Then , and . Therefore

(58)

And so we have

(59)
(60)
(61)

which implies the claim.

Remark 1.

In particular, for any compact subset of a Fréchet space (such as the set of -dimensional images, ) the bound (30) is tight with respect to the set for any upper semicontinuous . Since the behaviour of away from is not important, the -Lipschitz constant in (30) need only be computed here. To do so one may replace with , where for and for , and observe , because .

4 Adversarial learning

Szegedy et al. (2014) observe that deep neural networks, trained for image classification using empirical risk minimisation, exhibit a curious behaviour whereby an image, , and a small, imperceptible amount of noise, , may found so that the network classifies and differently. Imagining that the troublesome noise vector is sought by an adversary seeking to defeat the classifier, such pairs have come to be known as adversarial examples (Moosavi Dezfooli et al., 2017; Goodfellow et al., 2015; Kurakin et al., 2017).

When is a normed space, the closed ball of radius , centred at is denoted . Let be a linear space and a topological space. Fix , , and let be a norm on .

The following objective has been proposed (viz. Madry et al., 2018; Shaham et al., 2018; Carlini & Wagner, 2017; Cisse et al., 2017) as a means of learning classifiers that are robust to adversarial examples

(62)

where is the loss of some classifier.

{toappendix}

4.1 Proof of subsection 4.1

subsection 4.1 will be used to show an equality result in subsection 4.1. {lemmaapx} Assume is a compact Polish space and is non-atomic. For and there is a sequence with converging at in .

Proof.

Let . Since is non-atomic and is continuous we have (via Pratelli, 2007, Thm. B)

(63)

Let , obviously . Assume , otherwise the lemma is trivial. Fix a sequence with . For let . Then

(64)

and because metrises the -topology on (Villani, 2009, Cor. 6.13), the mapping is -continuous. Then by the intermediate value theorem for every there is some with , forming a sequence . Then for every there is a sequence so that in and

(65)
(66)
(67)

Therefore for every there exists so that for every

(68)

Let us pass directly to this subsequence of for every so that (68) holds for all . Next by construction we have . Therefore has a subsequence in so that in in . By ensuring (68) is satisfied, the sequences for every . ∎

{toappendix}

We can now prove our main result subsection 4.1. When is a normed space, the closed ball of radius , centred at is denoted . {theoremrep} Assume is a separable Banach space. Fix and for let

(69)
Then for and there is
(70)

with equality if, furthermore, is non-atomically concentrated on a compact subset of , on which is continuous with the subspace topology.

Proof.

For convenience of notation let .

When , the set consists of the set of functions which are -almost everywhere, in which case for -almost all . Thus (79) is equal to . Since is a norm, , and by a similar argument there is equality with the right hand side. We now complete the proof for the cases where .

Inequality: For , let denote the set-valued mapping with . Let denote the set of Borel so that for -almost all . Let . Clearly for every there is

(71)

which shows . Then if there is equality in (72), we have

(72)
(73)
(74)

which proves the inequality.

To complete the proof we will now justify the exchange of integration and supremum in (72). The set is trivially decomposable (Giner, 2009, see the remark at the bottom of p. 323, Def. 2.1). By assumption is Borel measurable. Since is measurable, any decomposable subset of is -decomposable (Giner, 2009, Prop. 5.3) and -linked (Giner, 2009, Prop. 3.7 (i)). Giner (2009, Thm. 6.1 (c)) therefore allows us to exchange integration and supremum in (72).

Equality: Under the additional assumptions there exists with (via Blanchet & Murthy, 2019, Prop. 2)

(75)

The compact subset where is concentrated and non-atomic is a Polish space with the Banach metric. Therefore using subsection 4.1 there is a sequence so that

(76)

proving the desired equality. ∎

{proofsketch}

Giner (2009, Thm. 6.1 (c)) allows us to interchange the integral and supremum, the inequality then follows from the definition of the transportation cost risk. To show the equality under the added assumptions, there is a distribution that achieves the robust supremum (Blanchet & Murthy, 2019, Prop. 2), and a Monge map that achieves the transportation cost infimum (Pratelli, 2007, Thm. B).

Remark 2.

By observing the constant function is included in the set , it’s easy to see that the adversarial risk (62) is upper bounded as follows

(77)
(78)
(79)

where in the equality we extend to a metric on in the same way as (52).

subsection 4.1 generalises and subsumes a number of existing results Gao & Kleywegt, 2016, Cor. 2 (iv); Staib & Jegelka, 2017, Prop. 3.1; Shafieezadeh-Abadeh et al., 2019, Thm. 12 to relate the adversarial risk minimisation (62) to the distributionally robust risk in subsection 3.1. The previous results mentioned are all are formulated with respect to an empirical distribution, that is, an average of Dirac masses. Of course any finite set is compact, and so these empirical distributions satisfy the concentration assumption.

A simulation is in place demonstrating that the sum of the three gaps in (79) and LABEL:thm:robustness_bound,thm:adversarial_risk_is_robust_risk is relatively low. We randomly generated 100 Gaussian kernel classifiers , with sampled from the MNIST dataset and