Concentration in unbounded metric spaces and algorithmic stability
We prove an extension of McDiarmid’s inequality for metric spaces with unbounded diameter. To this end, we introduce the notion of the subgaussian diameter, which is a distribution-dependent refinement of the metric diameter. Our technique provides an alternative approach to that of Kutin and Niyogi’s method of weakly difference-bounded functions, and yields nontrivial, dimension-free results in some interesting cases where the former does not. As an application, we give apparently the first generalization bound in the algorithmic stability setting that holds for unbounded loss functions. We give two extensions of the basic concentration result: to strongly mixing processes and to other Orlicz norms.
Concentration of measure inequalities are at the heart of statistical learning theory. Roughly speaking, concentration allows one to conclude that the performance of a (sufficiently “stable”) algorithm on a (sufficiently “close to iid”) sample is indicative of the algorithm’s performance on future data. Quantifying what it means for an algorithm to be stable and for the sampling process to be close to iid is by no means straightforward and much recent work has been motivated by these questions. It turns out that the various notions of stability are naturally expressed in terms of the Lipschitz continuity of the algorithm in question (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Rakhlin et al., 2005; Shalev-Shwartz et al., 2010), while appropriate relaxations of the iid assumption are achieved using various kinds of strong mixing (Karandikar and Vidyasagar, 2002; Gamarnik, 2003; Rostamizadeh and Mohri, 2007; Mohri and Rostamizadeh, 2008; Steinwart and Christmann, 2009; Steinwart et al., 2009; Zou et al., ; Mohri and Rostamizadeh, 2010; London et al., 2012, 2013; Shalizi and Kontorovich, 2013).
An elegant and powerful work-horse driving many of the aforementioned results is McDiarmid’s inequality (McDiarmid, 1989):
where is a real-valued function of the sequence of independent random variables , such that
whenever and differ only in the coordinate. Aside from being instrumental in proving PAC bounds (Boucheron et al., 2005), McDiarmid’s inequality has also found use in algorithmic stability results (Bousquet and Elisseeff, 2002). Non-iid extensions of (1) have also been considered (Marton, 1996; Rio, 2000; Chazottes et al., 2007; Kontorovich and Ramanan, 2008).
The distribution-free nature of McDiarmid’s inequality makes it an attractive tool in learning theory, but also imposes inherent limitations on its applicability. Chief among these limitations is the inability of (1) to provide risk bounds for unbounded loss functions. Even in the bounded case, if the Lipschitz condition (2) holds not everywhere but only with high probability — say, with a much larger constant on a small set of exceptions — the bound in (1) still charges the full cost of the worst-case constant. To counter this difficulty, Kutin (2002); Kutin and Niyogi (2002) introduced an extension of McDiarmid’s inequality to weakly difference-bounded functions and used it to analyze the risk of “almost-everywhere” stable algorithms. This influential result has been invoked in a number of recent papers (El-Yaniv and Pechyony, 2006; Mukherjee et al., 2006; Hush et al., 2007; Agarwal and Niyogi, 2009; Shalev-Shwartz et al., 2010; Rubinstein and Simma, 2012).
However, the approach of Kutin and Niyogi entails some difficulties as well. These come in two flavors: analytical (complex statement and proof) and practical (conditions are still too restrictive in some cases); we will elaborate upon this in Section 3. In this paper, we propose an alternative approach to the concentration of “almost-everywhere” or “average-case” Lipschitz functions. To this end, we introduce the notion of the subgaussian diameter of a metric probability space. The latter may be finite even when the metric diameter is infinite, and we show that this notion generalizes the more restrictive property of bounded differences.
This paper’s principal contributions include defining the subgaussian diameter of a metric probability space and identifying its role in relaxing the bounded differences condition. In Theorem 1, we show that the subgaussian diameter can essentially replace the far more restrictive metric diameter in concentration bounds. This result has direct ramifications for algorithmic stability (Theorem 2). We furthermore extend our concentration inequality to non-independent processes (Theorem 3) and to other Orlicz norms (Theorem 4).
Outline of paper.
In Section 2 we define the subgaussian diameter and relate it to (weakly) bounded differences in Section 3. We state and prove the concentration inequality based on this notion in Section 4 and give an application to algorithmic stability in Section 5. We then give an extension to non-independent data in Section 6 and discuss other Orlicz norms in Section 7. Conclusions and some open problems are presented in Section 8.
A metric probability space is a measurable space whose Borel -algebra is induced by the metric , endowed with the probability measure . Our results are most cleanly presented when is a discrete set but they continue to hold verbatim for Borel probability measures on Polish spaces. It will be convenient to write even when the latter is an integral. Random variables are capitalized (), specified sequences are written in lowercase, the notation is used for all sequences, and sequence concatenation is denoted multiplicatively: . We will frequently use the shorthand . Standard order of magnitude notation such as and will be used.
A function is -Lipschitz if
Let , be a sequence of metric probability spaces. We define the product probability space
with the product measure
and product metric
We will denote partial products by
We write to mean that is an -valued random variable with law — i.e., for all Borel . This notation extends naturally to sequences: . We will associate to each ) the symmetrized distance random variable defined by
where are independent and with probability , independent of . We note right away that is a centered random variable:
A real-valued random variable is said to be subgaussian if it admits a such that
The smallest for which (6) holds will be denoted by .
We define the subgaussian diameter of the metric probability space in terms of its symmetrized distance :
If a metric probability space has finite diameter,
then its subgaussian diameter is also finite:
Let be the symmetrized distance. By (5), we have and certainly . Hence,
where the inequality follows from Hoeffding’s Lemma. ∎
The bound in Lemma 1 is nearly tight in the sense that for every there is a metric probability space for which
To see this, take to be an -point space with the uniform distribution and for all distinct . Taking sufficiently large makes arbitrarily close to . We do not know whether can be achieved.
On the other hand, there exist unbounded metric probability spaces with finite subgaussian diameter. A simple example is with , and the standard Gaussian probability measure . Obviously, . Now the symmetrized distance is distributed as the difference (=sum) of two standard Gaussians: . Since , we have
More generally, the subgaussian distributions on are precisely those for which .
3 Related work
McDiarmid’s inequality (1) suffers from the limitations mentioned above: it completely ignores the distribution and is vacuous if even one of the is infinite.111Note, though, that McDiarmid’s inequality is sharp in the sense that the constants in (1) cannot be improved in a distribution-free fashion. In order to address some of these issues, Kutin (2002); Kutin and Niyogi (2002) proposed an extension of McDiarmid’s inequality to “almost everywhere” Lipschitz functions . To formalize this, fix an and let and be independent. Define by
Kutin and Niyogi define to be weakly difference-bounded by if
for all .
The precise result of Kutin (2002, Theorem 1.10) is somewhat unwieldy to state — indeed, the present work was motivated in part by a desire for simpler tools. Assuming that is weakly difference-bounded by with
and , their bound states that
for a certain range of and . As noted by Rakhlin et al. (2005), the exponential decay assumption (13) is necessary in order for the Kutin-Niyogi method to yield exponential concentration. In contrast, the bounds we prove here
do not require to be everywhere bounded as in (11)
have a simple statement and proof, and generalize to non-iid processes with relative ease.
In a different line of work, Bentkus (2008) considered an extension of Hoeffding’s inequality to unbounded random variables. His bound only holds for sums (as opposed to general Lipschitz functions) and the summands must be non-negative (i.e., unbounded only in the positive direction). An earlier notion of “effective” metric diameter in the context of concentration is that of metric space length (Schechtman, 1982). Another distribution-dependent refinement of diameter is the spread constant (Alon et al., 1998). Lecué and Mendelson (2013) gave minimax bounds for empirical risk minimization over subgaussian classes.
4 Concentration via subgaussian diameter
We defined the subgaussian diameter in Section 2, showing in Lemma 1 that it never exceeds the metric diameter. We also showed by example that the former can be finite when the latter is infinite. The main result of this section is that in (15) can essentially be replaced by ):
If is -Lipschitz then and
Our constant in the exponent is worse than that of (15) by a factor of ; this appears to be an inherent artifact of our method.
The strong integrability of — and in particular, finiteness of — follow from exponential concentration (Ledoux, 2001). The rest of the proof will proceed via the Azuma-Hoeffding-McDiarmid method of martingale differences. Define and expand
Let be conditioned on ; thus,
Hence, by Jensen’s inequality, we have
For fixed and , define by , and observe that is -Lipschitz with respect to . Since and for all , we have222 An analogous symmetrization technique is employed in http://terrytao.wordpress.com/2009/06/09/talagrands-concentration-inequality as a variant of the “square and rearrange” trick.
Applying the standard Markov’s inequality and exponential bounding argument, we have
Optimizing over and applying the same argument to yields our claim. ∎
Let us see how Theorem 1 compares to previous results on some examples. Consider equipped with the metric and the standard Gaussian product measure . Let be -Lipschitz. Then Theorem 1 yields (recalling the calculation in (9))
For our next example, fix an and put with the metric and the distribution . One may verify via a calculation analogous to (9) that . For independent , , put . Then Theorem 1 implies that in this case the bound in (19) holds verbatim. On the other hand, is easily seen to be weakly difference-bounded by and thus (14) also yields subgaussian concentration, albeit with worse constants. Applying (1) yields the much cruder estimate
5 Application to algorithmic stability
We refer the reader to (Bousquet and Elisseeff, 2002; Kutin and Niyogi, 2002; Rakhlin et al., 2005) for background on algorithmic stability and supervised learning. Our metric probability space will now have the structure where and are, respectively, the instance and label space of the example. Under the iid assumption, the are identical for all (and so we will henceforth drop the subscript from these). A training sample is is drawn and a learning algorithm inputs and outputs a hypothesis . The hypothesis will be denoted by . In line with the previous literature, we assume that is symmetric (i.e., invariant under permutations of ). The loss of a hypothesis on an example is defined by
where is the cost function. To our knowledge, all previous work required the loss to be bounded by some constant , which figures explicitly in the bounds; we make no such restriction.
In the algorithmic stability setting, the empirical risk is typically defined as
and the true risk as
The goal is to bound the excess risk . To this end, a myriad of notions of hypothesis stability have been proposed. A variant of uniform stability in the sense of Rakhlin et al. (2005) — which is slightly more general than the homonymous notion in Bousquet and Elisseeff (2002) — may be defined as follows. The algorithm is said to be -uniform stable if for all , the function given by is -Lipschitz with respect to the Hamming metric on :
We define the algorithm to be -totally Lipschitz stable if the function given by is -Lipschitz with respect to the product metric on :
Note that total Lipschitz stability is stronger than uniform stability since it requires the algorithm to respect the metric of .
Let us bound the bias of stable algorithms.
Suppose is a symmetric, -totally Lipschitz stable learning algorithm over the metric probability space with . Then
Observe, as in the proof of (Bousquet and Elisseeff, 2002, Lemma 7), that for all ,
where and is generated from via the process defined in (10). For fixed and ,, define
Invoking Jensen’s inequality and the argument in (16),
Taking logarithms yields the estimate
We now turn to the Lipschitz continuity of the excess risk.
Suppose is a symmetric, -totally Lipschitz stable learning algorithm and define the excess risk function by . Then is -Lipschitz.
We examine the two summands separately. The definition (21) of implies that the latter is -Lipchitz since it is a convex combination of -Lipschitz functions. Now defined in (20) is also a convex combination of -Lipschitz functions, but because appears twice in , changing to could incur a difference of up to . Hence, is -Lipschitz. As Lipschitz constants are subadditive, the claim is proved. ∎
Suppose is a symmetric, -totally Lipschitz stable learning algorithm over the metric probability space with . Then, for training samples and , we have
As in Bousquet and Elisseeff (2002) and related results on algorithmic stability, we require for exponential decay. Bousquet and Elisseeff showed that this is indeed the case for some popular learning algorithms, albeit in their less restrictive definition of stability. We conjecture that many of these algorithms continue to be stable in our stronger sense and plan to explore this in future work.
6 Relaxing the independence assumption
In this section we generalize Theorem 1 to strongly mixing processes. To this end, we require some standard facts concerning the probability-theoretic notions of coupling and transportation (Lindvall, 2002; Villani, 2003, 2009). Given the probability measures on a measurable space , a coupling of is any probability measure on with marginals and , respectively. Denoting by the set of all couplings, we have
where is the total variation norm. An optimal coupling is one that achieves the infimum in (26); one always exists, though it may not be unique. Another elementary property of couplings is that for any two and any coupling , we have
It is possible to refine the total variation distance between and so as to respect the metric of . Given a space equipped with probability measures and metric , define the transportation cost333 This fundamental notion is also known as the Wasserstein, Monge-Kantorovich, or earthmover distance; see Villani (2003, 2009) for an encyclopedic treatment. The use of coupling and transportation techniques to obtain concentration for dependent random variables goes back to Marton (1996); Samson (2000); Chazottes et al. (2007). distance by
It is easy to verify that is a valid metric on probability measures and that for , we have .
As in Section 4, we consider a sequence of metric spaces , and their product . Unlike the independent case, we will allow nonproduct probability measures on . We will write to mean that for all Borel . For , we will use the shorthand
The notation means the marginal distribution of . Similarly, will denote the conditional distribution. For , and , define
where is the product of as in (3), and
with . In words, measures the transportation cost distance between the conditional distributions induced on the “tail” given two prefixes that differ in the coordinate, and is the maximal value of this quantity. Kontorovich (2007); Kontorovich and Ramanan (2008) discuss how to handle conditioning on measure-zero sets and other technicalities. Note that for product measures the conditional distributions are identical and hence .
We need one more definition before stating our main result. For the prefix , define the conditional distribution
and consider the corresponding metric probability space . Define its conditional subgaussian diameter by
and the maximal subgaussian diameter by
If is -Lipschitz with respect to , then
Observe that we recover Theorem 1 as a special case. Since typically we will take , it suffices that and to ensure a exponential bound with decay rate .