Relative Deviation Learning Bounds and Generalization with Unbounded Loss Functions

Relative Deviation Learning Bounds and
Generalization with Unbounded Loss Functions

\nameCorinna Cortes \emailcorinna@google.com
\addrGoogle Research, 76 Ninth Avenue, New York, NY 10011 \AND\nameSpencer Greenberg \emailspencerg@cims.nyu.edu
\addrCourant Institute, 251 Mercer Street, New York, NY 10012 \AND\nameMehryar Mohri \emailmohri@cims.nyu.edu
\addrCourant Institute and Google Research
251 Mercer Street, New York, NY 10012
Abstract

We present an extensive analysis of relative deviation bounds, including detailed proofs of two-sided inequalities and their implications. We also give detailed proofs of two-sided generalization bounds that hold in the general case of unbounded loss functions, under the assumption that a moment of the loss is bounded. These bounds are useful in the analysis of importance weighting and other learning tasks such as unbounded regression.

\jmlrheading

320131-2510/1010/31Corinna Cortes and Spencer Greenberg and Mehryar Mohri \ShortHeadingsRelative Deviation and Generalization with Unbounded Loss FunctionsCortes, Greenberg, and Mohri \firstpageno1

\editor

TBD

{keywords}

Generalization bounds, learning theory, unbounded loss functions.

1 Introduction

Most generalization bounds in learning theory hold only for bounded loss functions. This includes standard VC-dimension bounds (Vapnik, 1998), Rademacher complexity (Koltchinskii and Panchenko, 2000; Bartlett et al., 2002a; Koltchinskii and Panchenko, 2002; Bartlett and Mendelson, 2002) or local Rademacher complexity bounds (Koltchinskii, 2006; Bartlett et al., 2002b), as well as most other bounds based on other complexity terms. This assumption is typically unrelated to the statistical nature of the problem considered but it is convenient since when the loss functions are uniformly bounded, standard tools such as Hoeffding’s inequality (Hoeffding, 1963; Azuma, 1967), McDiarmid’s inequality (McDiarmid, 1989), or Talagrand’s concentration inequality (Talagrand, 1994) apply.

There are however natural learning problems where the boundedness assumption does not hold. This includes unbounded regression tasks where the target labels are not uniformly bounded, and a variety of applications such as sample bias correction (Dudík et al., 2006; Huang et al., 2006; Cortes et al., 2008; Sugiyama et al., 2008; Bickel et al., 2007), domain adaptation (Ben-David et al., 2007; Blitzer et al., 2008; Daumé III and Marcu, 2006; Jiang and Zhai, 2007; Mansour et al., 2009; Cortes and Mohri, 2013), or the analysis of boosting (Dasgupta and Long, 2003), where the importance weighting technique is used (Cortes et al., 2010). It is therefore critical to derive learning guarantees that hold for these scenarios and the general case of unbounded loss functions.

When the class of functions is unbounded, a single function may take arbitrarily large values with arbitrarily small probabilities. This is probably the main challenge in deriving uniform convergence bounds for unbounded losses. This problem can be avoided by assuming the existence of an envelope, that is a single non-negative function with a finite expectation lying above the absolute value of the loss of every function in the hypothesis set (Dudley, 1984; Pollard, 1984; Dudley, 1987; Pollard, 1989; Haussler, 1992), an alternative assumption similar to Hoeffding’s inequality based on the expectation of a hyperbolic function, a quantity similar to the moment-generating function, is used by Meir and Zhang (2003). However, in many problems, e.g., in the analysis of importance weighting even for common distributions, there exists no suitable envelope function (Cortes et al., 2010). Instead, the second or some other th-moment of the loss seems to play a critical role in the analysis. Thus, instead, we will consider here the assumption that some th-moment of the loss functions is bounded as in Vapnik (1998, 2006b).

This paper presents in detail two-sided generalization bounds for unbounded loss functions under the assumption that some th-moment of the loss functions, , is bounded. The proof of these bounds makes use of relative deviation generalization bounds in binary classification, which we also prove and discuss in detail. Much of the results and material we present is not novel and the paper has therefore a survey nature. However, our presentation is motivated by the fact that the proofs given in the past for these generalization bounds were either incorrect or incomplete.

We now discuss in more detail prior results and proofs. One-side relative deviation bounds were first given by Vapnik (1998), later improved by a constant factor by Anthony and Shawe-Taylor (1993). These publications and several others have all relied on a lower bound on the probability that a binomial random variable of trials exceeds its expected value when the bias verifies . This also later appears in Vapnik (2006a) and implicitly in other publications referring to the relative deviations bounds of Vapnik (1998). To the best of our knowledge, no actual proof of this inequality was ever given in the past in the machine learning literature before our recent work (Greenberg and Mohri, 2013). One attempt was made to prove this lemma in the context of the analysis of some generalization bounds (Jaeger, 2005), but unfortunately that proof is not sufficient to support the general case needed for the proof of the relative deviation bound of Vapnik (1998).

We present the proof of two-sided relative deviation bounds in detail using the recent results of Greenberg and Mohri (2013). The two-sided versions we present, as well as several consequences of these bounds, appear in Anthony and Bartlett (1999). However, we could not find a full proof of the two-sided bounds in any prior publication. Our presentation shows that the proof of the other side of the inequality is not symmetric and cannot be immediately obtained from that of the first side inequality. Additionally, this requires another proof related to the binomial distributions given by Greenberg and Mohri (2013).

Relative deviation bounds are very informative guarantees in machine learning of independent interest, regardless of the key role they play in the proof of unbounded loss learning bounds. They lead to sharper generalization bounds whose right-hand side is expressed as the interpolation of a term and a term that admits as a multiplier the empirical error or the generalization error. In particular, when the empirical error is zero, this leads to faster rate bounds. We present in detail the proof of this type of results as well as that of several others of interest (Anthony and Bartlett, 1999). Let us mention that, in the form presented by Vapnik (1998), relative deviation bounds suffer from a discontinuity at zero (zero denominator), a problem that also affects inequalities for the other side and which seems not to have been rigorously treated by previous work. Our proofs and results explicitly deal with this issue.

We use relative deviations bounds to give the full proofs of two-sided generalization bounds for unbounded losses with finite moments of order , both in the case and the case . One-sided generalization bounds for unbounded loss functions were first given by Vapnik (1998, 2006b) under the same assumptions and also using relative deviations. The one-sided version of our bounds for the case coincides with that of (Vapnik, 1998, 2006b) modulo a constant factor, but the proofs given by Vapnik in both books seem to be incorrect.111In (Vapnik, 1998)[p.204-206], statement (5.37) cannot be derived from assumption (5.35), contrary to what is claimed by the author, and in general it does not hold: the first integral in (5.37) is restricted to a sub-domain and is thus smaller than the integral of (5.35). Furthermore, the main statement claimed in Section (5.6.2) is not valid. In (Vapnik, 2006b)[p.200-202], the author invokes the Lagrange method to show the main inequality, but the proof steps are not mathematically justified. Even with our best efforts, we could not justify some of the steps and strongly believe the proof not to be correct. In particular, the way function is concluded to be equal to one over the first interval is suspicious and not rigorously justified. The core component of our proof is based on a different technique using Hölder’s inequality. We also present some more explicit bounds for the case by approximating a complex term appearing in these bounds. The one-sided version of the bounds for the case are also due to Vapnik (1998, 2006b) with similar questions about the proofs.222Several of the comments we made for the case hold here as well. In particular, the author’s proof is not based on clear mathematical justifications. Some steps seem suspicious and are not convincing, even with our best efforts to justify them. In that case as well, we give detailed proofs using the Cauchy-Schwarz inequality in the most general case where a positive constant is used in the denominator to avoid the discontinuity at zero. These learning bounds can be used directly in the analysis of unbounded loss functions as in the case of importance weighting (Cortes et al., 2010).

The remainder of this paper is organized as follows. In Section 2, we briefly introduce some definitions and notation used in the next sections. Section 3 presents in detail relative deviation bounds as well as several of their consequences. Next, in Section 4 we present generalization bounds for unbounded loss functions under the assumption that the moment of order is bounded first in the case (Section 4.1), then in the case (Section 4.2).

2 Preliminaries

We consider an input space and an output space , which in the particular case of binary classification is or , or a measurable subset of in regression. We denote by a distribution over . For a sample of size drawn from , we will denote by the corresponding empirical distribution, that is the distribution corresponding to drawing a point from uniformly at random. Throughout this paper, denotes a hypothesis of functions mapping from to . The loss incurred by hypothesis at is denoted by . is assumed to be non-negative, but not necessarily bounded. We denote by the expected loss or generalization error of a hypothesis and by its empirical loss for a sample :

(1)

For any , we also use the notation and for the th moments of the loss. When the loss coincides with the standard zero-one loss used in binary classification, we equivalently use the following notation

(2)

We will sometimes use the shorthand to denote a sample of points . For any hypothesis set of functions mapping to or and sample , we denote by the number of distinct dichotomies generated by over that sample and by the growth function:

(3)
(4)

3 Relative deviation bounds

In this section we prove a series of relative deviation learning bounds which we use in the next section for deriving generalization bounds for unbounded loss functions. We will assume throughout the paper, as is common in much of learning theory, that each expression of the form is a measurable function, which is not guaranteed when is not a countable set. This assumption holds nevertheless in most common applications of machine learning.

We start with the proof of a symmetrization lemma (Lemma 3) originally presented by Vapnik (1998), which is used by Anthony and Shawe-Taylor (1993). These publications and several others have all relied on a lower bound on the probability that a binomial random variable of trials exceeds its expected value when the bias verifies . To our knowledge, no rigorous proof of this fact was ever provided in the literature in the full generality needed. The proof of this result was recently given by Greenberg and Mohri (2013).

{lemma}

[Greenberg and Mohri (2013)] Let be a random variable distributed according to the binomial distribution with a positive integer (the number of trials) and (the probability of success of each trial). Then, the following inequality holds:

(5)

where . The lower bound is never reached but is approached asymptotically when as from the right.

Our proof of Lemma 3 is more concise than that of Vapnik (1998). Furthermore, our statement and proof handle the technical problem of discontinuity at zero ignored by previous authors. The denominator may in general become zero, which would lead to an undefined result. We resolve this issue by including an arbitrary positive constant in the denominator in most of our expressions.

For the proof of the following result, we will use the function defined over by . By Lemma A, is increasing in and decreasing in .

{lemma}

Let . Assume that . Then, for any hypothesis set and any , the following holds:

{proof}

We give a concise version of the proof given by (Vapnik, 1998). We first show that the following implication holds for any :

(6)

The first condition can be equivalently rewritten as , which implies

(7)

since . Assume that the antecedent of the implication (6) holds for . Then, in view of the monotonicity properties of function (Lemma A), we can write:

( and 1st ineq. of (7))
(2nd ineq. of (7))
()

which proves (6). Now, by definition of the supremum, for any , there exists such that

(8)

Using the definition of and implication (6), we can write

(by def. of )

We now show that this implies the following inequality

(9)

by distinguishing two cases. If , since , by Theorem 3 the inequality holds, which yields immediately (9). Otherwise we have . Then, by (7), the condition cannot hold for any sample which by (8) implies that the condition cannot hold for any sample , in which case (9) trivially holds. Now, since (9) holds for all , we can take the limit and use the right-continuity of the cumulative distribution to obtain

which completes the proof of Lemma 3. Note that the factor of 4 in the statement of lemma 3 can be modestly improved by changing the condition assumed from to for constant values of . This leads to a slightly better lower bound on , e.g.  rather than for , at the expense of not covering cases where the number of samples is less than . For some values of , e.g. , covering these cases is not needed for the proof of our main theorem (Theorem 3) though. However, this does not seem to simplify the critical task of proving a lower bound on , that is the probability that a binomial random variable exceeds its expected value when . One might hope that restricting the range of in this way would help simplify the proof of a lower bound on the probability of a binomial exceeding its expected value. Unfortunately, our analysis of this problem and proof (Greenberg and Mohri, 2013) suggest that this is not the case since the regime where is small seems to be the easiest one to analyze for this problem.

Figure 1: These plots depict , the probability that a binomially distributed random variable exceeds its expectation, as a function of the trial success probability . The left plot shows only regions satisfying whereas the right plot shows only regions satisfying . Each colored line corresponds to a different number of trials, . The dashed horizontal line at represents the value of the lower bound used in the proof of lemma 3.

The result of Lemma 3 is a one-sided inequality. The proof of a similar result (Lemma 3) with the roles of and interchanged makes use of the following theorem.

{lemma}

[Greenberg and Mohri (2013)] Let be a random variable distributed according to the binomial distribution with a positive integer and . Then, the following inequality holds:

(10)

where .

The proof of the following lemma (Lemma 3) is novel.333A version of this lemma is stated in (Boucheron et al., 2005), but no proof is given. While the general strategy of the proof is similar to that of Lemma 3, there are some non-trivial differences due to the requirement of Theorem 1. The proof is not symmetric as shown by the details given below.

{lemma}

Let . Assume that . Then, for any hypothesis set and any the following holds:

{proof}

Proceeding in a way similar to the proof of Lemma 3, we first show that the following implication holds for any :

(11)

The first condition can be equivalently rewritten as , which implies

(12)

since . Assume that the antecedent of the implication (11) holds for . Then, in view of the monotonicity properties of function (Lemma A), we can write:

()
(1st ineq. of (12))
(2nd ineq. of (12))
()

which proves (11). For the application of Theorem 1 to a hypothesis , the condition is required. Observe that this is implied by the assumptions and :

The rest of the proof proceeds nearly identically to that of Lemma 3.

In the statements of all the following results, the term can be replaced by the upper bound to derive simpler expressions. By Sauer’s lemma (Sauer, 1972; Vapnik and Chervonenkis, 1971), the VC-dimension of the family can be further used to bound these quantities since for . The first inequality of the following theorem was originally stated and proven by Vapnik (1998, 2006b), later by Anthony and Shawe-Taylor (1993) (in the special case ) with a somewhat more favorable constant, in both cases modulo the incomplete proof of the symmetrization and the technical issue related to the denominator taking the value zero, as already pointed out. The second inequality of the theorem and its proof are novel. Our proofs benefit from the improved analysis of Anthony and Shawe-Taylor (1993).

{theorem}

For any hypothesis set of functions mapping a set to , and any fixed and , the following two inequalities hold:

{proof}

We first consider the case where , which is not covered by Lemma 3. We can then write

for . Thus, the bounds of the theorem hold trivially in that case. On the other hand, when , we can apply Lemma 3 and Lemma 3. Therefore, to prove theorem 3, it is sufficient to work with the symmetrized expression , rather than working directly with our original expressions and . To upper bound the probability that the symmetrized expression is larger than , we begin by introducing a vector of Rademacher random variables , where the are independent, identically distributed random variables each equally likely to take the value or . Using the shorthand for , we can then write

Now, for a fixed , we have , thus, by Hoeffding’s inequality, we can write

Since the variables , , take values in , we can write

where the last inequality holds since and the sum is either zero or greater than or equal to one. In view of this identity, we can write

We note now that the supremum over in the left-hand side expression in the statement of our theorem need not be over all hypothesis in : without changing its value, we can replace with a smaller hypothesis set where only one hypothesis remains for each unique binary vector . The number of such hypotheses is , thus, by the union bound, the following holds:

The result follows by taking expectations with respect to and applying Lemma 3 and Lemma 3 respectively.

{corollary}

Let and let be a hypothesis set of functions mapping to . Then, for any , each of the following two inequalities holds with probability at least :

{proof}

The result follows directly from Theorem 3 by setting to match the upper bounds and taking the limit . For , the inequalities become

(13)
(14)

with the familiar dependency . The advantage of these relative deviations is clear. For small values of (or ) these inequalities provide tighter guarantees than standard generalization bounds. Solving the corresponding second-degree inequalities in or leads to the following results. {corollary} Let and let be a hypothesis set of functions mapping to . Then, for any , each of the following two inequalities holds with probability at least :

{proof}

The second-degree inequality corresponding to (13) can be written as

with , and implies . Squaring both sides gives:

The second inequality can be proven in the same way from (14). The learning bounds of the corollary make clear the presence of two terms: a term in and a term in which admits as a factor or and which for small values of these terms can be more favorable than standard bounds. Theorem 3 can also be used to prove the following relative deviation bounds.

The following theorem and its proof assuming the result of Theorem 3 were given by Anthony and Bartlett (1999).

{theorem}

For all , , the following inequalities hold:

{proof}

We prove the first statement, the proof of the second statement is identical modulo the permutation of the roles of and . To do so, it suffices to determine such that

since we can then apply theorem 3 with to bound the right-hand side and take the limit as to eliminate the -dependence. To find such a choice of , we begin by observing that for any ,

(15)

Assume now that for some , which is equivalent to . We will prove that this implies (15). To show that, we distinguish two cases, and , with . The first case implies the following:

The second case is equivalent to and implies