Concentration in unbounded metric spaces and algorithmic stability

Concentration in unbounded metric spaces and algorithmic stability

Aryeh Kontorovich
Abstract

We prove an extension of McDiarmid’s inequality for metric spaces with unbounded diameter. To this end, we introduce the notion of the subgaussian diameter, which is a distribution-dependent refinement of the metric diameter. Our technique provides an alternative approach to that of Kutin and Niyogi’s method of weakly difference-bounded functions, and yields nontrivial, dimension-free results in some interesting cases where the former does not. As an application, we give apparently the first generalization bound in the algorithmic stability setting that holds for unbounded loss functions. We give two extensions of the basic concentration result: to strongly mixing processes and to other Orlicz norms.

1 Introduction

Concentration of measure inequalities are at the heart of statistical learning theory. Roughly speaking, concentration allows one to conclude that the performance of a (sufficiently “stable”) algorithm on a (sufficiently “close to iid”) sample is indicative of the algorithm’s performance on future data. Quantifying what it means for an algorithm to be stable and for the sampling process to be close to iid is by no means straightforward and much recent work has been motivated by these questions. It turns out that the various notions of stability are naturally expressed in terms of the Lipschitz continuity of the algorithm in question (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Rakhlin et al., 2005; Shalev-Shwartz et al., 2010), while appropriate relaxations of the iid assumption are achieved using various kinds of strong mixing (Karandikar and Vidyasagar, 2002; Gamarnik, 2003; Rostamizadeh and Mohri, 2007; Mohri and Rostamizadeh, 2008; Steinwart and Christmann, 2009; Steinwart et al., 2009; Zou et al., ; Mohri and Rostamizadeh, 2010; London et al., 2012, 2013; Shalizi and Kontorovich, 2013).

An elegant and powerful work-horse driving many of the aforementioned results is McDiarmid’s inequality (McDiarmid, 1989):

(1)

where is a real-valued function of the sequence of independent random variables , such that

(2)

whenever and differ only in the  coordinate. Aside from being instrumental in proving PAC bounds (Boucheron et al., 2005), McDiarmid’s inequality has also found use in algorithmic stability results (Bousquet and Elisseeff, 2002). Non-iid extensions of (1) have also been considered (Marton, 1996; Rio, 2000; Chazottes et al., 2007; Kontorovich and Ramanan, 2008).

The distribution-free nature of McDiarmid’s inequality makes it an attractive tool in learning theory, but also imposes inherent limitations on its applicability. Chief among these limitations is the inability of (1) to provide risk bounds for unbounded loss functions. Even in the bounded case, if the Lipschitz condition (2) holds not everywhere but only with high probability — say, with a much larger constant on a small set of exceptions — the bound in (1) still charges the full cost of the worst-case constant. To counter this difficulty, Kutin (2002); Kutin and Niyogi (2002) introduced an extension of McDiarmid’s inequality to weakly difference-bounded functions and used it to analyze the risk of “almost-everywhere” stable algorithms. This influential result has been invoked in a number of recent papers (El-Yaniv and Pechyony, 2006; Mukherjee et al., 2006; Hush et al., 2007; Agarwal and Niyogi, 2009; Shalev-Shwartz et al., 2010; Rubinstein and Simma, 2012).

However, the approach of Kutin and Niyogi entails some difficulties as well. These come in two flavors: analytical (complex statement and proof) and practical (conditions are still too restrictive in some cases); we will elaborate upon this in Section 3. In this paper, we propose an alternative approach to the concentration of “almost-everywhere” or “average-case” Lipschitz functions. To this end, we introduce the notion of the subgaussian diameter of a metric probability space. The latter may be finite even when the metric diameter is infinite, and we show that this notion generalizes the more restrictive property of bounded differences.

Main results.

This paper’s principal contributions include defining the subgaussian diameter of a metric probability space and identifying its role in relaxing the bounded differences condition. In Theorem 1, we show that the subgaussian diameter can essentially replace the far more restrictive metric diameter in concentration bounds. This result has direct ramifications for algorithmic stability (Theorem 2). We furthermore extend our concentration inequality to non-independent processes (Theorem 3) and to other Orlicz norms (Theorem 4).

Outline of paper.

In Section 2 we define the subgaussian diameter and relate it to (weakly) bounded differences in Section 3. We state and prove the concentration inequality based on this notion in Section 4 and give an application to algorithmic stability in Section 5. We then give an extension to non-independent data in Section 6 and discuss other Orlicz norms in Section 7. Conclusions and some open problems are presented in Section 8.

2 Preliminaries

A metric probability space is a measurable space whose Borel -algebra is induced by the metric , endowed with the probability measure . Our results are most cleanly presented when is a discrete set but they continue to hold verbatim for Borel probability measures on Polish spaces. It will be convenient to write even when the latter is an integral. Random variables are capitalized (), specified sequences are written in lowercase, the notation is used for all sequences, and sequence concatenation is denoted multiplicatively: . We will frequently use the shorthand . Standard order of magnitude notation such as and will be used.

A function is -Lipschitz if

Let , be a sequence of metric probability spaces. We define the product probability space

with the product measure

and product metric

(3)

We will denote partial products by

We write to mean that is an -valued random variable with law — i.e., for all Borel . This notation extends naturally to sequences: . We will associate to each ) the symmetrized distance random variable defined by

(4)

where are independent and with probability , independent of . We note right away that is a centered random variable:

(5)

A real-valued random variable is said to be subgaussian if it admits a such that

(6)

The smallest for which (6) holds will be denoted by .

We define the subgaussian diameter of the metric probability space in terms of its symmetrized distance :

(7)

If a metric probability space has finite diameter,

then its subgaussian diameter is also finite:

Lemma 1.
Proof.

Let be the symmetrized distance. By (5), we have and certainly . Hence,

where the inequality follows from Hoeffding’s Lemma. ∎

The bound in Lemma 1 is nearly tight in the sense that for every there is a metric probability space for which

(8)

To see this, take to be an -point space with the uniform distribution and for all distinct . Taking sufficiently large makes arbitrarily close to . We do not know whether can be achieved.

On the other hand, there exist unbounded metric probability spaces with finite subgaussian diameter. A simple example is with , and the standard Gaussian probability measure . Obviously, . Now the symmetrized distance is distributed as the difference (=sum) of two standard Gaussians: . Since , we have

(9)

More generally, the subgaussian distributions on are precisely those for which .

3 Related work

McDiarmid’s inequality (1) suffers from the limitations mentioned above: it completely ignores the distribution and is vacuous if even one of the is infinite.111Note, though, that McDiarmid’s inequality is sharp in the sense that the constants in (1) cannot be improved in a distribution-free fashion. In order to address some of these issues, Kutin (2002); Kutin and Niyogi (2002) proposed an extension of McDiarmid’s inequality to “almost everywhere” Lipschitz functions . To formalize this, fix an and let and be independent. Define by

(10)

Kutin and Niyogi define to be weakly difference-bounded by if

(11)

and

(12)

for all .

The precise result of Kutin (2002, Theorem 1.10) is somewhat unwieldy to state — indeed, the present work was motivated in part by a desire for simpler tools. Assuming that is weakly difference-bounded by with

(13)

and , their bound states that

(14)

for a certain range of and . As noted by Rakhlin et al. (2005), the exponential decay assumption (13) is necessary in order for the Kutin-Niyogi method to yield exponential concentration. In contrast, the bounds we prove here

  • do not require to be everywhere bounded as in (11)

  • have a simple statement and proof, and generalize to non-iid processes with relative ease.

We defer the quantitative comparisons between (14) and our results until the latter are formally stated in Section 4.

In a different line of work, Bentkus (2008) considered an extension of Hoeffding’s inequality to unbounded random variables. His bound only holds for sums (as opposed to general Lipschitz functions) and the summands must be non-negative (i.e., unbounded only in the positive direction). An earlier notion of “effective” metric diameter in the context of concentration is that of metric space length (Schechtman, 1982). Another distribution-dependent refinement of diameter is the spread constant (Alon et al., 1998). Lecué and Mendelson (2013) gave minimax bounds for empirical risk minimization over subgaussian classes.

4 Concentration via subgaussian diameter

McDiarmid’s inequality (1) may be stated in the notation of Section 2 as follows. Let , be a sequence of metric probability spaces and a -Lipschitz function. Then

(15)

We defined the subgaussian diameter in Section 2, showing in Lemma 1 that it never exceeds the metric diameter. We also showed by example that the former can be finite when the latter is infinite. The main result of this section is that in (15) can essentially be replaced by ):

Theorem 1.

If is -Lipschitz then and

Our constant in the exponent is worse than that of (15) by a factor of ; this appears to be an inherent artifact of our method.

Proof.

The strong integrability of — and in particular, finiteness of — follow from exponential concentration (Ledoux, 2001). The rest of the proof will proceed via the Azuma-Hoeffding-McDiarmid method of martingale differences. Define and expand

Let be conditioned on ; thus,

Hence, by Jensen’s inequality, we have

For fixed and , define by , and observe that is -Lipschitz with respect to . Since and for all , we have222 An analogous symmetrization technique is employed in http://terrytao.wordpress.com/2009/06/09/talagrands-concentration-inequality as a variant of the “square and rearrange” trick.

and hence

(16)

where is the symmetrized distance (4) and the last inequality holds by definition of subgaussian diameter (6,7). It follows that

(17)

Applying the standard Markov’s inequality and exponential bounding argument, we have

(18)

Optimizing over and applying the same argument to yields our claim. ∎

Let us see how Theorem 1 compares to previous results on some examples. Consider equipped with the metric and the standard Gaussian product measure . Let be -Lipschitz. Then Theorem 1 yields (recalling the calculation in (9))

(19)

whereas the inequalities of McDiarmid (1) and Kutin-Niuyogi (14) are both uninformative since the metric diameter is infinite.

For our next example, fix an and put with the metric and the distribution . One may verify via a calculation analogous to (9) that . For independent , , put . Then Theorem 1 implies that in this case the bound in (19) holds verbatim. On the other hand, is easily seen to be weakly difference-bounded by and thus (14) also yields subgaussian concentration, albeit with worse constants. Applying (1) yields the much cruder estimate

5 Application to algorithmic stability

We refer the reader to (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Rakhlin et al., 2005) for background on algorithmic stability and supervised learning. Our metric probability space will now have the structure where and are, respectively, the instance and label space of the  example. Under the iid assumption, the are identical for all (and so we will henceforth drop the subscript from these). A training sample is is drawn and a learning algorithm inputs and outputs a hypothesis . The hypothesis will be denoted by . In line with the previous literature, we assume that is symmetric (i.e., invariant under permutations of ). The loss of a hypothesis on an example is defined by

where is the cost function. To our knowledge, all previous work required the loss to be bounded by some constant , which figures explicitly in the bounds; we make no such restriction.

In the algorithmic stability setting, the empirical risk is typically defined as

(20)

and the true risk as

(21)

The goal is to bound the excess risk . To this end, a myriad of notions of hypothesis stability have been proposed. A variant of uniform stability in the sense of Rakhlin et al. (2005) — which is slightly more general than the homonymous notion in Bousquet and Elisseeff (2002) — may be defined as follows. The algorithm is said to be -uniform stable if for all , the function given by is -Lipschitz with respect to the Hamming metric on :

We define the algorithm to be -totally Lipschitz stable if the function given by is -Lipschitz with respect to the product metric on :

(22)

Note that total Lipschitz stability is stronger than uniform stability since it requires the algorithm to respect the metric of .

Let us bound the bias of stable algorithms.

Lemma 2.

Suppose is a symmetric, -totally Lipschitz stable learning algorithm over the metric probability space with . Then

Proof.

Observe, as in the proof of (Bousquet and Elisseeff, 2002, Lemma 7), that for all ,

(23)

where and is generated from via the process defined in (10). For fixed and ,, define

and note that (22) implies that . Now rewrite (23) as

(24)

Invoking Jensen’s inequality and the argument in (16),

Taking logarithms yields the estimate

(25)

which, after substituting (25) into (24), proves the claim. ∎

We now turn to the Lipschitz continuity of the excess risk.

Lemma 3.

Suppose is a symmetric, -totally Lipschitz stable learning algorithm and define the excess risk function by . Then is -Lipschitz.

Proof.

We examine the two summands separately. The definition (21) of implies that the latter is -Lipchitz since it is a convex combination of -Lipschitz functions. Now defined in (20) is also a convex combination of -Lipschitz functions, but because appears twice in , changing to could incur a difference of up to . Hence, is -Lipschitz. As Lipschitz constants are subadditive, the claim is proved. ∎

Combining Lemmas 2 and 3 with our concentration inequality in Theorem 1 yields the main result of this section:

Theorem 2.

Suppose is a symmetric, -totally Lipschitz stable learning algorithm over the metric probability space with . Then, for training samples and , we have

As in Bousquet and Elisseeff (2002) and related results on algorithmic stability, we require for exponential decay. Bousquet and Elisseeff showed that this is indeed the case for some popular learning algorithms, albeit in their less restrictive definition of stability. We conjecture that many of these algorithms continue to be stable in our stronger sense and plan to explore this in future work.

6 Relaxing the independence assumption

In this section we generalize Theorem 1 to strongly mixing processes. To this end, we require some standard facts concerning the probability-theoretic notions of coupling and transportation (Lindvall, 2002; Villani, 2003, 2009). Given the probability measures on a measurable space , a coupling of is any probability measure on with marginals and , respectively. Denoting by the set of all couplings, we have

(26)

where is the total variation norm. An optimal coupling is one that achieves the infimum in (26); one always exists, though it may not be unique. Another elementary property of couplings is that for any two and any coupling , we have

(27)

It is possible to refine the total variation distance between and so as to respect the metric of . Given a space equipped with probability measures and metric , define the transportation cost333 This fundamental notion is also known as the Wasserstein, Monge-Kantorovich, or earthmover distance; see Villani (2003, 2009) for an encyclopedic treatment. The use of coupling and transportation techniques to obtain concentration for dependent random variables goes back to Marton (1996); Samson (2000); Chazottes et al. (2007). distance by

It is easy to verify that is a valid metric on probability measures and that for , we have .

As in Section 4, we consider a sequence of metric spaces , and their product . Unlike the independent case, we will allow nonproduct probability measures on . We will write to mean that for all Borel . For , we will use the shorthand

The notation means the marginal distribution of . Similarly, will denote the conditional distribution. For , and , define

where is the product of as in (3), and

with . In words, measures the transportation cost distance between the conditional distributions induced on the “tail” given two prefixes that differ in the  coordinate, and is the maximal value of this quantity. Kontorovich (2007); Kontorovich and Ramanan (2008) discuss how to handle conditioning on measure-zero sets and other technicalities. Note that for product measures the conditional distributions are identical and hence .

We need one more definition before stating our main result. For the prefix , define the conditional distribution

and consider the corresponding metric probability space . Define its conditional subgaussian diameter by

and the maximal subgaussian diameter by

(28)

Note that for product measures, (28) reduces to the former definition (7). With these definitions, we may state the main result of this section.

Theorem 3.

If is -Lipschitz with respect to , then

Observe that we recover Theorem 1 as a special case. Since typically we will take , it suffices that and to ensure a exponential bound with decay rate .

Proof.

We begin by considering the martingale difference

as in the proof of Theorem 1. More explicitly,

Define to be conditioned on . Then

(29)

Let be an optimal coupling realizing the infimum in the transportation cost distance used to define . Recalling (27), we have