Concentration in unbounded metric spaces and algorithmic stability
Abstract
We prove an extension of McDiarmid’s inequality for metric spaces with unbounded diameter. To this end, we introduce the notion of the subgaussian diameter, which is a distributiondependent refinement of the metric diameter. Our technique provides an alternative approach to that of Kutin and Niyogi’s method of weakly differencebounded functions, and yields nontrivial, dimensionfree results in some interesting cases where the former does not. As an application, we give apparently the first generalization bound in the algorithmic stability setting that holds for unbounded loss functions. We give two extensions of the basic concentration result: to strongly mixing processes and to other Orlicz norms.
1 Introduction
Concentration of measure inequalities are at the heart of statistical learning theory. Roughly speaking, concentration allows one to conclude that the performance of a (sufficiently “stable”) algorithm on a (sufficiently “close to iid”) sample is indicative of the algorithm’s performance on future data. Quantifying what it means for an algorithm to be stable and for the sampling process to be close to iid is by no means straightforward and much recent work has been motivated by these questions. It turns out that the various notions of stability are naturally expressed in terms of the Lipschitz continuity of the algorithm in question (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Rakhlin et al., 2005; ShalevShwartz et al., 2010), while appropriate relaxations of the iid assumption are achieved using various kinds of strong mixing (Karandikar and Vidyasagar, 2002; Gamarnik, 2003; Rostamizadeh and Mohri, 2007; Mohri and Rostamizadeh, 2008; Steinwart and Christmann, 2009; Steinwart et al., 2009; Zou et al., ; Mohri and Rostamizadeh, 2010; London et al., 2012, 2013; Shalizi and Kontorovich, 2013).
An elegant and powerful workhorse driving many of the aforementioned results is McDiarmid’s inequality (McDiarmid, 1989):
(1) 
where is a realvalued function of the sequence of independent random variables , such that
(2) 
whenever and differ only in the coordinate. Aside from being instrumental in proving PAC bounds (Boucheron et al., 2005), McDiarmid’s inequality has also found use in algorithmic stability results (Bousquet and Elisseeff, 2002). Noniid extensions of (1) have also been considered (Marton, 1996; Rio, 2000; Chazottes et al., 2007; Kontorovich and Ramanan, 2008).
The distributionfree nature of McDiarmid’s inequality makes it an attractive tool in learning theory, but also imposes inherent limitations on its applicability. Chief among these limitations is the inability of (1) to provide risk bounds for unbounded loss functions. Even in the bounded case, if the Lipschitz condition (2) holds not everywhere but only with high probability — say, with a much larger constant on a small set of exceptions — the bound in (1) still charges the full cost of the worstcase constant. To counter this difficulty, Kutin (2002); Kutin and Niyogi (2002) introduced an extension of McDiarmid’s inequality to weakly differencebounded functions and used it to analyze the risk of “almosteverywhere” stable algorithms. This influential result has been invoked in a number of recent papers (ElYaniv and Pechyony, 2006; Mukherjee et al., 2006; Hush et al., 2007; Agarwal and Niyogi, 2009; ShalevShwartz et al., 2010; Rubinstein and Simma, 2012).
However, the approach of Kutin and Niyogi entails some difficulties as well. These come in two flavors: analytical (complex statement and proof) and practical (conditions are still too restrictive in some cases); we will elaborate upon this in Section 3. In this paper, we propose an alternative approach to the concentration of “almosteverywhere” or “averagecase” Lipschitz functions. To this end, we introduce the notion of the subgaussian diameter of a metric probability space. The latter may be finite even when the metric diameter is infinite, and we show that this notion generalizes the more restrictive property of bounded differences.
Main results.
This paper’s principal contributions include defining the subgaussian diameter of a metric probability space and identifying its role in relaxing the bounded differences condition. In Theorem 1, we show that the subgaussian diameter can essentially replace the far more restrictive metric diameter in concentration bounds. This result has direct ramifications for algorithmic stability (Theorem 2). We furthermore extend our concentration inequality to nonindependent processes (Theorem 3) and to other Orlicz norms (Theorem 4).
Outline of paper.
In Section 2 we define the subgaussian diameter and relate it to (weakly) bounded differences in Section 3. We state and prove the concentration inequality based on this notion in Section 4 and give an application to algorithmic stability in Section 5. We then give an extension to nonindependent data in Section 6 and discuss other Orlicz norms in Section 7. Conclusions and some open problems are presented in Section 8.
2 Preliminaries
A metric probability space is a measurable space whose Borel algebra is induced by the metric , endowed with the probability measure . Our results are most cleanly presented when is a discrete set but they continue to hold verbatim for Borel probability measures on Polish spaces. It will be convenient to write even when the latter is an integral. Random variables are capitalized (), specified sequences are written in lowercase, the notation is used for all sequences, and sequence concatenation is denoted multiplicatively: . We will frequently use the shorthand . Standard order of magnitude notation such as and will be used.
A function is Lipschitz if
Let , be a sequence of metric probability spaces. We define the product probability space
with the product measure
and product metric
(3) 
We will denote partial products by
We write to mean that is an valued random variable with law — i.e., for all Borel . This notation extends naturally to sequences: . We will associate to each ) the symmetrized distance random variable defined by
(4) 
where are independent and with probability , independent of . We note right away that is a centered random variable:
(5) 
A realvalued random variable is said to be subgaussian if it admits a such that
(6) 
The smallest for which (6) holds will be denoted by .
We define the subgaussian diameter of the metric probability space in terms of its symmetrized distance :
(7) 
If a metric probability space has finite diameter,
then its subgaussian diameter is also finite:
Lemma 1.
Proof.
Let be the symmetrized distance. By (5), we have and certainly . Hence,
where the inequality follows from Hoeffding’s Lemma. ∎
The bound in Lemma 1 is nearly tight in the sense that for every there is a metric probability space for which
(8) 
To see this, take to be an point space with the uniform distribution and for all distinct . Taking sufficiently large makes arbitrarily close to . We do not know whether can be achieved.
On the other hand, there exist unbounded metric probability spaces with finite subgaussian diameter. A simple example is with , and the standard Gaussian probability measure . Obviously, . Now the symmetrized distance is distributed as the difference (=sum) of two standard Gaussians: . Since , we have
(9) 
More generally, the subgaussian distributions on are precisely those for which .
3 Related work
McDiarmid’s inequality (1) suffers from the limitations mentioned above: it completely ignores the distribution and is vacuous if even one of the is infinite.^{1}^{1}1Note, though, that McDiarmid’s inequality is sharp in the sense that the constants in (1) cannot be improved in a distributionfree fashion. In order to address some of these issues, Kutin (2002); Kutin and Niyogi (2002) proposed an extension of McDiarmid’s inequality to “almost everywhere” Lipschitz functions . To formalize this, fix an and let and be independent. Define by
(10) 
Kutin and Niyogi define to be weakly differencebounded by if
(11) 
and
(12) 
for all .
The precise result of Kutin (2002, Theorem 1.10) is somewhat unwieldy to state — indeed, the present work was motivated in part by a desire for simpler tools. Assuming that is weakly differencebounded by with
(13) 
and , their bound states that
(14) 
for a certain range of and . As noted by Rakhlin et al. (2005), the exponential decay assumption (13) is necessary in order for the KutinNiyogi method to yield exponential concentration. In contrast, the bounds we prove here

do not require to be everywhere bounded as in (11)

have a simple statement and proof, and generalize to noniid processes with relative ease.
We defer the quantitative comparisons between (14) and our results until the latter are formally stated in Section 4.
In a different line of work, Bentkus (2008) considered an extension of Hoeffding’s inequality to unbounded random variables. His bound only holds for sums (as opposed to general Lipschitz functions) and the summands must be nonnegative (i.e., unbounded only in the positive direction). An earlier notion of “effective” metric diameter in the context of concentration is that of metric space length (Schechtman, 1982). Another distributiondependent refinement of diameter is the spread constant (Alon et al., 1998). Lecué and Mendelson (2013) gave minimax bounds for empirical risk minimization over subgaussian classes.
4 Concentration via subgaussian diameter
McDiarmid’s inequality (1) may be stated in the notation of Section 2 as follows. Let , be a sequence of metric probability spaces and a Lipschitz function. Then
(15) 
We defined the subgaussian diameter in Section 2, showing in Lemma 1 that it never exceeds the metric diameter. We also showed by example that the former can be finite when the latter is infinite. The main result of this section is that in (15) can essentially be replaced by ):
Theorem 1.
If is Lipschitz then and
Our constant in the exponent is worse than that of (15) by a factor of ; this appears to be an inherent artifact of our method.
Proof.
The strong integrability of — and in particular, finiteness of — follow from exponential concentration (Ledoux, 2001). The rest of the proof will proceed via the AzumaHoeffdingMcDiarmid method of martingale differences. Define and expand
Let be conditioned on ; thus,
Hence, by Jensen’s inequality, we have
For fixed and , define by , and observe that is Lipschitz with respect to . Since and for all , we have^{2}^{2}2 An analogous symmetrization technique is employed in http://terrytao.wordpress.com/2009/06/09/talagrandsconcentrationinequality as a variant of the “square and rearrange” trick.
and hence
(16)  
where is the symmetrized distance (4) and the last inequality holds by definition of subgaussian diameter (6,7). It follows that
(17) 
Applying the standard Markov’s inequality and exponential bounding argument, we have
(18) 
Optimizing over and applying the same argument to yields our claim. ∎
Let us see how Theorem 1 compares to previous results on some examples. Consider equipped with the metric and the standard Gaussian product measure . Let be Lipschitz. Then Theorem 1 yields (recalling the calculation in (9))
(19) 
whereas the inequalities of McDiarmid (1) and KutinNiuyogi (14) are both uninformative since the metric diameter is infinite.
For our next example, fix an and put with the metric and the distribution . One may verify via a calculation analogous to (9) that . For independent , , put . Then Theorem 1 implies that in this case the bound in (19) holds verbatim. On the other hand, is easily seen to be weakly differencebounded by and thus (14) also yields subgaussian concentration, albeit with worse constants. Applying (1) yields the much cruder estimate
5 Application to algorithmic stability
We refer the reader to (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Rakhlin et al., 2005) for background on algorithmic stability and supervised learning. Our metric probability space will now have the structure where and are, respectively, the instance and label space of the example. Under the iid assumption, the are identical for all (and so we will henceforth drop the subscript from these). A training sample is is drawn and a learning algorithm inputs and outputs a hypothesis . The hypothesis will be denoted by . In line with the previous literature, we assume that is symmetric (i.e., invariant under permutations of ). The loss of a hypothesis on an example is defined by
where is the cost function. To our knowledge, all previous work required the loss to be bounded by some constant , which figures explicitly in the bounds; we make no such restriction.
In the algorithmic stability setting, the empirical risk is typically defined as
(20) 
and the true risk as
(21) 
The goal is to bound the excess risk . To this end, a myriad of notions of hypothesis stability have been proposed. A variant of uniform stability in the sense of Rakhlin et al. (2005) — which is slightly more general than the homonymous notion in Bousquet and Elisseeff (2002) — may be defined as follows. The algorithm is said to be uniform stable if for all , the function given by is Lipschitz with respect to the Hamming metric on :
We define the algorithm to be totally Lipschitz stable if the function given by is Lipschitz with respect to the product metric on :
(22) 
Note that total Lipschitz stability is stronger than uniform stability since it requires the algorithm to respect the metric of .
Let us bound the bias of stable algorithms.
Lemma 2.
Suppose is a symmetric, totally Lipschitz stable learning algorithm over the metric probability space with . Then
Proof.
Observe, as in the proof of (Bousquet and Elisseeff, 2002, Lemma 7), that for all ,
(23) 
where and is generated from via the process defined in (10). For fixed and ,, define
and note that (22) implies that . Now rewrite (23) as
(24) 
Invoking Jensen’s inequality and the argument in (16),
Taking logarithms yields the estimate
(25) 
which, after substituting (25) into (24), proves the claim. ∎
We now turn to the Lipschitz continuity of the excess risk.
Lemma 3.
Suppose is a symmetric, totally Lipschitz stable learning algorithm and define the excess risk function by . Then is Lipschitz.
Proof.
We examine the two summands separately. The definition (21) of implies that the latter is Lipchitz since it is a convex combination of Lipschitz functions. Now defined in (20) is also a convex combination of Lipschitz functions, but because appears twice in , changing to could incur a difference of up to . Hence, is Lipschitz. As Lipschitz constants are subadditive, the claim is proved. ∎
Combining Lemmas 2 and 3 with our concentration inequality in Theorem 1 yields the main result of this section:
Theorem 2.
Suppose is a symmetric, totally Lipschitz stable learning algorithm over the metric probability space with . Then, for training samples and , we have
As in Bousquet and Elisseeff (2002) and related results on algorithmic stability, we require for exponential decay. Bousquet and Elisseeff showed that this is indeed the case for some popular learning algorithms, albeit in their less restrictive definition of stability. We conjecture that many of these algorithms continue to be stable in our stronger sense and plan to explore this in future work.
6 Relaxing the independence assumption
In this section we generalize Theorem 1 to strongly mixing processes. To this end, we require some standard facts concerning the probabilitytheoretic notions of coupling and transportation (Lindvall, 2002; Villani, 2003, 2009). Given the probability measures on a measurable space , a coupling of is any probability measure on with marginals and , respectively. Denoting by the set of all couplings, we have
(26) 
where is the total variation norm. An optimal coupling is one that achieves the infimum in (26); one always exists, though it may not be unique. Another elementary property of couplings is that for any two and any coupling , we have
(27) 
It is possible to refine the total variation distance between and so as to respect the metric of . Given a space equipped with probability measures and metric , define the transportation cost^{3}^{3}3 This fundamental notion is also known as the Wasserstein, MongeKantorovich, or earthmover distance; see Villani (2003, 2009) for an encyclopedic treatment. The use of coupling and transportation techniques to obtain concentration for dependent random variables goes back to Marton (1996); Samson (2000); Chazottes et al. (2007). distance by
It is easy to verify that is a valid metric on probability measures and that for , we have .
As in Section 4, we consider a sequence of metric spaces , and their product . Unlike the independent case, we will allow nonproduct probability measures on . We will write to mean that for all Borel . For , we will use the shorthand
The notation means the marginal distribution of . Similarly, will denote the conditional distribution. For , and , define
where is the product of as in (3), and
with . In words, measures the transportation cost distance between the conditional distributions induced on the “tail” given two prefixes that differ in the coordinate, and is the maximal value of this quantity. Kontorovich (2007); Kontorovich and Ramanan (2008) discuss how to handle conditioning on measurezero sets and other technicalities. Note that for product measures the conditional distributions are identical and hence .
We need one more definition before stating our main result. For the prefix , define the conditional distribution
and consider the corresponding metric probability space . Define its conditional subgaussian diameter by
and the maximal subgaussian diameter by
(28) 
Note that for product measures, (28) reduces to the former definition (7). With these definitions, we may state the main result of this section.
Theorem 3.
If is Lipschitz with respect to , then
Observe that we recover Theorem 1 as a special case. Since typically we will take , it suffices that and to ensure a exponential bound with decay rate .