Learning with Square Loss: Localization through Offset Rademacher Complexity

Learning with Square Loss: Localization through Offset Rademacher Complexity

Tengyuan Liang Department of Statistics, The Wharton School, University of Pennsylvania    Alexander Rakhlin 11footnotemark: 1    Karthik Sridharan Department of Computer Science, Cornell University
Abstract

We consider regression with square loss and general classes of functions without the boundedness assumption. We introduce a notion of offset Rademacher complexity that provides a transparent way to study localization both in expectation and in high probability. For any (possibly non-convex) class, the excess loss of a two-step estimator is shown to be upper bounded by this offset complexity through a novel geometric inequality. In the convex case, the estimator reduces to an empirical risk minimizer. The method recovers the results of [18] for the bounded case while also providing guarantees without the boundedness assumption.

1 Introduction

Determining the finite-sample behavior of risk in the problem of regression is arguably one of the most basic problems of Learning Theory and Statistics. This behavior can be studied in substantial generality with the tools of empirical process theory. When functions in a given convex class are uniformly bounded, one may verify the so-called “Bernstein condition.” The condition—which relates the variance of the increments of the empirical process to their expectation—implies a certain localization phenomenon around the optimum and forms the basis of the analysis via local Rademacher complexities. The technique has been developed in [9, 8, 5, 2, 4], among others, based on Talagrand’s celebrated concentration inequality for the supremum of an empirical process.

In a recent pathbreaking paper, [14] showed that a large part of this heavy machinery is not necessary for obtaining tight upper bounds on excess loss, even—and especially—if functions are unbounded. Mendelson observed that only one-sided control of the tail is required in the deviation inequality, and, thankfully, it is the tail that can be controlled under very mild assumptions.

In a parallel line of work, the search within the online learning setting for an analogue of “localization” has led to a notion of an “offset” Rademacher process [17], yielding—in a rather clean manner—optimal rates for minimax regret in online supervised learning. It was also shown that the supremum of the offset process is a lower bound on the minimax value, thus establishing its intrinsic nature. The present paper blends the ideas of [14] and [17]. We introduce the notion of an offset Rademacher process for i.i.d. data and show that the supremum of this process upper bounds (both in expectation and in high probability) the excess risk of an empirical risk minimizer (for convex classes) and a two-step Star estimator of [1] (for arbitrary classes). The statement holds under a weak assumption even if functions are not uniformly bounded.

The offset Rademacher complexity provides an intuitive alternative to the machinery of local Rademacher averages. Let us recall that the Rademacher process indexed by a function class is defined as a stochastic process where are held fixed and are i.i.d. Rademacher random variables. We define the offset Rademacher process as a stochastic process

for some . The process itself captures the notion of localization: when is large in magnitude, the negative quadratic term acts as a compensator and “extinguishes” the fluctuations of the term involving Rademacher variables. The supremum of the process will be termed offset Rademacher complexity, and one may expect that this complexity is of a smaller order than the classical Rademacher averages (which, without localization, cannot be better than the rate of ).

The self-modulating property of the offset complexity can be illustrated on the canonical example of a linear class , in which case the offset Rademacher complexity becomes

where . Under mild conditions, the above expression is of the order in expectation and in high probability — a familiar rate achieved by the ordinary least squares, at least in the case of a well-specified model. We refer to Section 6 for the precise statement for both well-specified and misspecified case.

Our contributions can be summarized as follows. First, we show that offset Rademacher complexity is an upper bound on excess loss of the proposed estimator, both in expectation and in deviation. We then extend the chaining technique to quantify the behavior of the supremum of the offset process in terms of covering numbers. By doing so, we recover the rates of aggregation established in [18] and, unlike the latter paper, the present method does not require boundedness (of the noise and functions). We provide a lower bound on minimax excess loss in terms of offset Rademacher complexity, indicating its intrinsic nature for the problems of regression. While our in-expectation results for bounded functions do not require any assumptions, the high probability statements rest on a lower isometry assumption that holds, for instance, for subgaussian classes. We show that offset Rademacher complexity can be further upper bounded by the fixed-point complexities defined by Mendelson [14]. We conclude with the analysis of ordinary least squares.

2 Problem Description and the Estimator

Let be a class of functions on a probability space . The response is given by an unknown random variable , distributed jointly with according to . We observe a sample distributed i.i.d. according to and aim to construct an estimator with small excess loss , where

(1)

and is the expectation with respect to . Let denote the empirical expectation operator and define the following two-step procedure:

(2)

where is the star hull of around . (we abbreviate as .) This two-step estimator was introduced (to the best of our knowledge) by [1] for a finite class . We will refer to the procedure as the Star estimator. Audibert showed that this method is deviation-optimal for finite aggregation — the first such result, followed by other estimators with similar properties [10, 6] for the finite case. We present analysis that quantifies the behavior of this method for arbitrary classes of functions. The method has several nice features. First, it provides an alternative to the 3-stage discretization method of [18], does not require the prior knowledge of the entropy of the class, and goes beyond the bounded case. Second, it enjoys an upper bound of offset Rademacher complexity via relatively routine arguments under rather weak assumptions. Third, it naturally reduces to empirical risk minimization for convex classes (indeed, this happens whenever ).

Let denote the minimizer

and let denote the “noise”

We say that the model is misspecified if the regression function , which means is not zero-mean. Otherwise, we say that the model is well-specified.

3 A Geometric Inequality

We start by proving a geometric inequality for the Star estimator. This deterministic inequality holds conditionally on , and therefore reduces to a problem in .

Lemma 1 (Geometric Inequality).

The two-step estimator in (2) satisfies

(3)

for any and . If is convex, (3) holds with . Moreover, if is a linear subspace, (3) holds with equality and by the Pythagorean theorem.

Remark 1.

In the absence of convexity of , the two-step estimator mimics the key Pythagorean identity, though with a constant . We have not focused on optimizing but rather on presenting a clean geometric argument.

Proof of Lemma 1.

Define the empirical distance to be, for any , and empirical product to be . We will slightly abuse the notation by identifying every function with its finite-dimensional projection on .

Denote the ball (and sphere) centered at and with radius to be (and , correspondingly). In a similar manner, define and . By the definition of the Star algorithm, we have . The statement holds with if , and so we may assume . Denote by the conic hull of with origin at . Define the spherical cap outside the cone to be (drawn in red in Figure 3).

First, by the optimality of , for any , we have , i.e. any is not in the interior of . Furthermore, is not in the interior of the cone , as otherwise there would be a point inside strictly better than . Thus .

Second, and it is a contact point of and . Indeed, is necessarily on a line segment between and a point outside that does not pass through the interior of by optimality of . Let be the set of all contact points – potential locations of .

Now we fix and consider the two dimensional plane that passes through three points , depicted in Figure 3. Observe that the left-hand-side of the desired inequality (3) is constant as ranges over . To prove the inequality it therefore suffices to choose a value that maximizes the right-hand-side. The maximization of over is achieved by . This can be argued simply by symmetry: the two-dimensional plane intersects in a line and the distance between and is maximized at the extreme point of this intersection. Hence, to prove the desired inequality, we can restrict our attention to the plane and instead of .

For any , define the projection of onto the shell to be . We first prove (3) for and then extend the statement to . By the geometry of the cone,

By triangle inequality,

Rearranging,

By the Pythagorean theorem,

thus proving the claim for for constant .

We can now extend the claim to . Indeed, due to the fact that and the geometry of the projection , we have . Thus

This proves the claim for with constant .

An upper bound on excess loss follows immediately from Lemma 1.

Corollary 2.

Conditioned on the data , we have a deterministic upper bound for the Star algorithm:

(4)

with the value of constant given in Lemma 1.

Proof.

An attentive reader will notice that the multiplier on the negative empirical quadratic term in (4) is slightly larger than the one on the expected quadratic term. This is the starting point of the analysis that follows.

4 Symmetrization

We will now show that the discrepancy in the multiplier constant in (4) leads to offset Rademacher complexity through rather elementary symmetrization inequalities. We perform this analysis both in expectation (for the case of bounded functions) and in high probability (for the general unbounded case). While the former result follows from the latter, the in-expectation statement for bounded functions requires no assumptions, in contrast to control of the tails.

Theorem 3.

Define the set . The following expectation bound on excess loss of the Star estimator holds:

where are independent Rademacher random variables, , , and almost surely.

The proof of the theorem involves an introduction of independent Rademacher random variables and two contraction-style arguments to remove the multipliers . These algebraic manipulations are postponed to the appendix.

The term in the curly brackets will be called an offset Rademacher process, and the expected supremum — an offset Rademacher complexity. While Theorem 3 only applies to bounded functions and bounded noise, the upper bound already captures the localization phenomenon, even for non-convex function classes (and thus goes well beyond the classical local Rademacher analysis).

As argued in [14], it is the contraction step that requires boundedness of the functions when analyzing square loss. Mendelson uses a small ball assumption (a weak condition on the distribution, stated below) to split the analysis into the study of the multiplier and quadratic terms. This assumption allows one to compare the expected square of any function to its empirical version, to within a multiplicative constant that depends on the small ball property. In contrast, we need a somewhat stronger assumption that will allow us to take this constant to be at least . We phrase this condition—the lower isometry bound—as follows. 111We thank Shahar Mendelson for pointing out that the small ball condition in the initial version of this paper was too weak for our purposes.

Definition 1 (Lower Isometry Bound).

We say that a function class satisfies the lower isometry bound with some parameters and if

(5)

for all , where depends on the complexity of the class.

In general this is a mild assumption that requires good tail behavior of functions in , yet it is stronger than the small ball property. Mendelson [16] shows that this condition holds for heavy-tailed classes assuming the small ball condition plus a norm-comparison property . We also remark that Assumption 1 holds for sub-gaussian classes using concentration tools, as already shown in [11]. For completeness, let us also state the small ball property:

Definition 2 (Small Ball Property [14, 15]).

The class of functions satisfies the small-ball condition if there exist constants and for every ,

Armed with the lower isometry bound, we now prove that the tail behavior of the deterministic upper bound in (4) can be controlled via the tail behavior of offset Rademacher complexity.

Theorem 4.

Define the set . Assume the lower isometry bound in Definition 1 holds with and some , where is the constant in (3). Let . Define

Then there exist two absolute constants (only depends on ), such that

for any

as long as .

Theorem 4 states that excess loss is stochastically dominated by offset Rademacher complexity. We remark that the requirement in holds under the mild moment conditions.

Remark 2.

In certain cases, Definition 1 can be shown to hold for (rather than all ), for some critical radius , as soon as (see [16]). In this case, the bound on the offset complexity is only affected additively by .

We postpone the proof of the Theorem to the appendix. In a nutshell, it extends the classical probabilistic symmetrization technique [7, 13] to the non-zero-mean offset process under the investigation.

5 Offset Rademacher Process: Chaining and Critical Radius

Let us summarize the development so far. We have shown that excess loss of the Star estimator is upper bounded by the (data-dependent) offset Rademacher complexity, both in expectation and in high probability, under the appropriate assumptions. We claim that the necessary properties of the estimator are now captured by the offset complexity, and we are now squarely in the realm of empirical process theory. In particular, we may want to quantify rates of convergence under complexity assumptions on , such as covering numbers. In contrast to local Rademacher analyses where one would need to estimate the data-dependent fixed point of the critical radius in some way, the task is much easier for the offset complexity. To this end, we study the offset process with the tools of empirical process theory.

5.1 Chaining Bounds

The first lemma describes the behavior of offset Rademacher process for a finite class.

Lemma 5.

Let be a finite set of vectors of cardinality . Then for any ,

Furthermore, for any ,

When the noise is unbounded,

where

(6)

Armed with the lemma for a finite collection, we upper bound the offset Rademacher complexity of a general class through the chaining technique. We perform the analysis in expectation and in probability. Recall that a -cover of a subset in a metric space is a collection of elements such that the union of the -balls with centers at the elements contains . A covering number at scale is the size of the minimal -cover.

One of the main objectives of symmetrization is to arrive at a stochastic process that can be studied conditionally on data, so that all the relevant complexities can be made sample-based (or, empirical). Since the functions only enter offset Rademacher complexity through their values on the sample , we are left with a finite-dimensional object. Throughout the paper, we work with the empirical distance

The covering number of at scale with respect to will be denoted by .

Lemma 6.

Let be a class of functions from to  . Then for any

where is an -cover of on at scale (assumed to contain ).

Instead of assuming that is contained in the cover, we may simply increase the size of the cover by , which can be absorbed by a small change of a constant.

Let us discuss the upper bound of Lemma 6. First, we may take , unless the integral diverges (which happens for very large classes with entropy growth of , ). Next, observe that first term is precisely the rate of aggregation with a finite collection of size . Hence, the upper bound is an optimal balance of the following procedure: cover the set at scale and pay the rate of aggregation for this finite collection, plus pay the rate of convergence of ERM within a -ball. The optimal balance is given by some (and can be easily computed under assumptions on covering number behavior — see [17]). The optimal quantifies the localization radius that arises from the curvature of the loss function. One may also view the optimal balance as the well-known equation

studied in statistics [19] for well-specified models. The present paper, as well as [18], extend the analysis of this balance to the misspecified case and non-convex classes of functions.

Now we provide a high probability analogue of Lemma 6.

Lemma 7.

Let be a class of functions from to  . Then for any and any ,

where is an -cover of on at scale (assumed to contain ) and are universal constants.

The above lemmas study the behavior of offset Rademacher complexity for abstract classes . Observe that the upper bounds in previous sections are in terms of the class . This class, however, is not more complex that the original class (with the exception of a finite class ). More precisely, the covering numbers of and are bounded as

for any . The following lemma shows that the complexity of the star hull is also not significantly larger than that of .

Lemma 8 ([12], Lemma 4.5).

For any scale , the covering number of and that of are bounded in the sense

5.2 Critical Radius

Now let us study the critical radius of offset Rademacher processes. Let and define

(7)
Theorem 9.

Assume is star-shaped around 0 and the lower isometry bound holds for . Define the critical radius

Then we have with probability at least ,

which further implies

The first statement of Theorem 9 shows the self-modulating behavior of the offset process: there is a critical radius, beyond which the fluctuations of the offset process are controlled by those within the radius. To understand the second statement, we observe that the complexity is upper bounded by the corresponding complexity in [14], which is defined without the quadratic term subtracted off. Hence, offset Rademacher complexity is no larger (under our Assumption 1) than the upper bounds obtained by [14] in terms of the critical radius.

6 Examples

In this section, we briefly describe several applications. The first is concerned with parametric regression.

Lemma 10.

Consider the parametric regression , where need not be centered. The offset Rademacher complexity is bounded as

and

where is the Gram matrix and . In the well-specified case (that is, are zero-mean), assuming that conditional variance is , then conditionally on the design matrix, and excess loss is upper bounded by order .

Proof.

The offset Rademacher can be interpreted as the Fenchel-Legendre transform, where

(8)

Thus we have in expectation

(9)

For high probability bound, note the expression in Equation (8) is Rademacher chaos of order two. Define symmetric matrix with entries

and define

Then

and

Furthermore,

We apply the concentration result in [3, Exercise 6.9],

(10)

For the finite dictionary aggregation problem, the following lemma shows control of offset Rademacher complexity.

Lemma 11.

Assume is a finite class of cardinality . Define which contains the Star estimator defined in Equation (2). The offset Rademacher complexity for is bounded as

and

where is a constant depends on and

We observe that the bound of Lemma 11 is worse than the optimal bound of [1] by an additive term. This is due to the fact that the analysis for finite case passes through the offset Rademacher complexity of the star hull, and for this case the star hull is more rich than the finite class. For this case, a direct analysis of the Star estimator is provided in [1].

While the offset complexity of the star hull is crude for the finite case, the offset Rademacher complexity does capture the correct rates for regression with larger classes, initially derived in [18]. We briefly mention the result. The proof is identical to the one in [17], with the only difference that offset Rademacher is defined in that paper as a sequential complexity in the context of online learning.

Corollary 12.

Consider the problem of nonparametric regression, as quantified by the growth

In the regime , the upper bound of Lemma 7 scales as . In the regime , the bound scales as , with an extra logarithmic factor at .

For the parametric case of , one may also readily estimate the offset complexity. Results for VC classes, sparse combinations of dictionary elements, and other parametric cases follow easily by plugging in the estimate for the covering number or directly upper bounding the offset complexity (see [18, 17]).

7 Lower bound on Minimax Regret via Offset Rademacher Complexity

We conclude this paper with a lower bound on minimax regret in terms of offset Rademacher complexity.

Theorem 13 (Minimax Lower Bound on Regret).

Define the offset Rademacher complexity over as

then the following minimax lower bound on regret holds:

for any .

For the purposes of matching the performance of the Star procedure, we can take .

Appendix A Proofs

Proof of Theorem 3.

Since is in the star hull around , must lie in the set . Hence, in view of (4), excess loss is upper bounded by

(11)
(12)
(13)

We invoke the supporting Lemma 14 (stated and proved below) for the term (13):

(14)