Sharp oracle inequalities for Least Squares estimators in shape restricted regression

# Sharp oracle inequalities for Least Squares estimators in shape restricted regression

\fnmsPierre C. \snmBellec \thanksrefecodec label=e1 [    mark]pierre.bellec@ensae.fr ENSAE and UMR CNRS 9194 ENSAE,
3 avenue Pierre Larousse,
92245 Malakoff Cedex, France.
July 3, 2019
###### Abstract

The performance of Least Squares (LS) estimators is studied in isotonic, unimodal and convex regression. Our results have the form of sharp oracle inequalities that account for the model misspecification error. In isotonic and unimodal regression, the LS estimator achieves the nonparametric rate as well as a parametric rate of order up to logarithmic factors, where is the number of constant pieces of the true parameter.

In univariate convex regression, the LS estimator satisfies an adaptive risk bound of order up to logarithmic factors, where is the number of affine pieces of the true regression function. This adaptive risk bound holds for any design points. While Guntuboyina and Sen [11] established that the nonparametric rate of convex regression is of order for equispaced design points, we show that the nonparametric rate of convex regression can be as slow as for some worst-case design points. This phenomenon can be explained as follows: Although convexity brings more structure than unimodality, for some worst-case design points this extra structure is uninformative and the nonparametric rates of unimodal regression and convex regression are both .

\arxiv

1510.08029 \startlocaldefs \endlocaldefs

{aug}\thankstext

ecodec This work was supported by GENES and by the grant Investissements d’Avenir (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047).

## 1 Introduction

Assume that we have the observations

 Yi=μi+ξi,i=1,...,n, (1.1)

where is unknown, is a noise vector with -dimensional Gaussian distribution where and is the identity matrix. We will also use the notation so that and . Denote by and the expectation and the probability with respect to the distribution of the random variable . The vector is observed and the goal is to estimate . The estimation error is measured with the scaled norm defined by

 ∥u∥2=1nn∑i=1u2i,u=(u1,...,un)T∈Rn. (1.2)

The error of an estimator of is given by . Let also be the infinity norm and be the Euclidean norm, so that .

This paper studies the Least Squares (LS) estimator in shape restricted regression under model misspecification. The LS estimator over a nonempty closed set is defined by

 ^μ\textscls(K)∈argminu∈K∥y−u∥2. (1.3)

Model misspecification allows that the true parameter does not belong to . There is a large literature on the performance of the LS estimator in isotonic and convex regression, that is, when the set is the set of all nondecreasing sequences or the set of convex sequences. Some of these results are reviewed in the following subsections.

### 1.1 Isotonic regression

Let be the set of all nondecreasing sequences, defined by

 S↑n\coloneqq{u=(u1,...,un)T∈Rn:ui≤ui+1,i=1,...,n−1}. (1.4)

The set is a closed convex cone. Two quantities are useful to describe the performance of the LS estimator . First, define the total variation by

 V(θ)\coloneqqmaxi=1,...,nθi−mini=1,...,nθi,θ=(θ1,...,θn)T∈Rn. (1.5)

If , its total variation is simply . Second, for , let be the integer such that is the number of inequalities that are strict for (the number of jumps of ).

Previous results on the performance of the LS estimator can be found in [13, 19, 7, 8], where risk bounds or oracle inequalities with leading constant strictly greater than 1 are derived. Two types of risk bounds or oracle inequalities have been obtained so far. If, , it is known [13, 19, 7, 8] that for some absolute constant ,

 Eμ∥^μ\textscls(S↑n)−μ∥2≤cσ2log(en)n+cσ2(V(μ)σn)2/3 (1.6)

and , cf. [19]. If , the following oracle inequality was proved in [7]:

 Eμ∥^μ\textscls(S↑n)−μ∥2≤6minu∈S↑n(∥u−μ∥2+σ2k(u)nlogenk(u)). (1.7)

The risk bounds (1.6) and (1.7) hold under the assumption that , which does not allow for any model misspecification. We will see below that this assumption can be dropped. The oracle inequality (1.6) implies that the LS estimator achieves the rate while (1.7) yields a parametric rate (up to logarithmic factors) if is well approximated by a piecewise constant sequence with not too many pieces. Let us note that the bound (1.7) can be used to obtain that converges at the rate up to logarithmic factors, thanks to the approximation argument given in [4, Lemma 2].

Mimimax lower bounds that match (1.6) and (1.7) up to logarithmic factors have been obtained in [7, 4]. If is a fixed parameter and , the bound (1.6) yields the rate for the risk of . By the lower bound [4, Corollary 5], this rate is minimax optimal over the class if . Proposition 4 in [4] shows that there exist absolute constants such that for any estimator ,

 supμ∈S↑n:k(μ)≤kPμ(∥^μ−μ∥2≥cσ2k/n)≥c′. (1.8)

Together, (1.7) and (1.8) establish that for any , the minimax rate over the class is of order up to logarithmic factors.

### 1.2 Convex regression with equispaced design points

If , define the set of convex sequences by

 S\textsccn \coloneqq{u=(u1,…,un)T∈Rn:2ui≤ui+1+ui−1, i=2,…,n−1}.

For , let be the smallest integer such that there exists a partition of and real numbers satisfying

 ui=aj(i−l)+ul,i,l∈Tj,j=1,...,q. (1.9)

The quantity is the smallest integer such that is piecewise affine with pieces. If are equispaced design points in , i.e., , , then

 S\textsccn={u∈Rn,u=(f(x1),...,f(xn))T for some convex function f:R→R}. (1.10)

The performance of the LS estimator over convex sequences has been recently studied in [11, 7], where it was proved that if , the estimator satisfies

 Eμ∥^μ−μ∥2 ≤C⎛⎝minu∈S\textsccn⎛⎝∥u−μ∥2+σ2q(u)n(logenq(u))5/4⎞⎠⎞⎠

for some absolute constant . If and where defined in Corollary 4.4 below is a constant that depends only on , then the estimator satisfies

 Eμ∥^μ−μ∥2 ≤C(√Rμσ2n)4/5log(en)

for some absolute constant . The bound (LABEL:eq:convexs-adapt-example-oi) yields an almost parametric rate if can be well approximated by a piecewise affine sequence with not too many pieces. If is a fixed parameter and , the bound (LABEL:eq:rate45-example-nonoi) yields the rate , which is minimax optimal over the class up to logarithmic factors [11].

The above results hold in convex regression for equispaced design points. The following subsection introduces the notation that will be used to study convex regression with non-equispaced design points.

### 1.3 Non-equispaced design points in convex regression

If are non-equispaced design points in , define the cone

 KCx1,...,xn \coloneqq{u∈Rn,u=(f(x1),...,f(xn))T for some convex function f:R→R}.

This can be rewritten as

 KCx1,...,xn \coloneqq{u∈Rn:ui−ui−1xi−xi−1≤ui+1−uixi+1−xi,i=2,...,n−1}.

For any , we say that is piecewise affine with pieces if there exist real numbers and a partition of such that

 ui=aj(xi−xl)+ul,i,l∈Tj,j=1,...,k. (1.11)

If for some convex function and is a piecewise affine function with pieces, then is piecewise affine with pieces. For any , let be the smallest integer such that is piecewise affine with pieces. The quantity satisfies

 q(u)−1≤∣∣∣{i=2,...,n−1:ui−ui−1xi−xi−1

The performance of the LS estimator is also studied in [11] in the case where the design points are almost equispaced: The bounds (LABEL:eq:convexs-adapt-example-oi) and (LABEL:eq:rate45-example-nonoi) both hold if is replaced with and if is a constant that depends on the ratio

 maxi=2,...,n(xi−xi−1)mini=2,...,n(xi−xi−1), (1.13)

and this constant becomes arbitrarily large as this ratio tends to infinity.

Although (LABEL:eq:rate45-example-nonoi) and (LABEL:eq:convexs-adapt-example-oi) provide an accurate picture of the performance of the LS estimator for equispaced (or almost equispaced) design points, it is not known whether these bounds continue to hold for other design points. A goal of the present paper is to fill this gap. Section 4 shows that the oracle inequality (LABEL:eq:convexs-adapt-example-oi) holds irrespective of the design points, while the nonparametric rate of the LS estimator can be as slow as for some worst-case design points.

It is clear that a convex function is unimodal in the sense that it is first non-increasing and then nondecreasing. The following subsection introduces the set of unimodal sequences, and Section 4.2 studies the relationship between convex regression and unimodal regression.

### 1.4 Unimodal regression

Let . A sequence is unimodal with mode at position if and only if is non-increasing and is nondecreasing. Define the convex set

 Km\coloneqq{u=(u1,...,un)T∈Rn:u1≥...≥um≤um−1≤...≤un}. (1.14)

The convex set is the set of all unimodal sequences with mode at position and

 U\coloneqq∪m=1,...,nKm (1.15)

is the set of all unimodal sequences. The set is non-convex. For all , let be the smallest integer such that is piecewise constant with pieces, i.e., the smallest integer such that there exists a partition of such that for all ,

• the sequence is constant, and

• the set is convex in the sense that if then contains all integers between and .

If , this definition of coincides with that defined above.

As the inclusion holds, the lower bound (1.8) implies that for any estimator ,

 supμ∈U:k(μ)≤kPμ(∥^μ−μ∥2≥cσ2k/n)≥c′>0. (1.16)

Chatterjee and Lafferty [6] recently obtained an adaptive risk bound of the form

 P(∥^μ\textscls(U)−μ∥2≤Cσ2n(k(u)log(en))3/2)≥1−1n, (1.17)

where is an absolute constant. This risk bound does not match the lower bound (1.16) because of the exponent .

### 1.5 Organisation of the paper

Section 1.6 recalls properties of closed convex set and closed convex cones.

• General oracle inequalities. In Section 2 we establish general tools that yield sharp oracle inequalities: Corollary 2.2 and Theorem 2.3.

• Sharp bounds in isotonic regression. In Section 3 we apply results of Section 2 to the isotonic LS estimator. We obtain an adaptive risk bound that is tight with sharp numerical constants.

• On the relationship between unimodal and convex regression. Section 4 studies the role of the design points in univariate convex regression: Although the nonparametric rate is of order for equispaced design points, this rate can be as slow as for some worst-case design points that are studied in Section 4, whereas the adaptive risk bound (LABEL:eq:convexs-adapt-example-oi) holds for any design points. The relation between convex regression and unimodal regression is discussed in Section 4.2: Although convexity brings more structure than unimodality, for some worst-case design points this extra structure is uninformative and the nonparametric rates of unimodal regression and convex regression are both . Section A.1 studies unimodal regression and improves some of the results of [6] on the performance of the unimodal LS estimator.

• Comparison of different misspecification errors. In Section 5 we compare different quantities that represent the estimation error when the model is misspecified. In particular, Section 5 explains that if is a closed convex set and , the sharp oracle inequalities obtained in Sections 4, 3 and 2 yield upper bounds on the estimation error . If , the LS estimator consistently estimates the projection of the true parameter onto for and .

Some proofs are delayed to Appendices B and A.

### 1.6 Preliminary properties of closed convex sets

We recall here several properties of convex sets that will be used in the paper. Given a closed convex set , denote by the projection onto . For all , is the unique vector in such that

 (u−ΠK(y))T(y−ΠK(y))≤0,u∈K. (1.18)

Inequality (1.18) can be rewritten as follows

 ∥ΠK(y)−y∥2+∥u−ΠK(y)∥2≤∥u−y∥2,y∈Rn,u∈K, (1.19)

which is a consequence of the cosine theorem. The LS estimator over is exactly the projection of onto , i.e., . In this case, (1.19) yields that for all ,

 ∥^μ\textscls(K)−y∥2≤∥u−y∥2−∥u−^μ\textscls(K)∥2. (1.20)

Inequality (1.20) can be interpreted in terms of strong convexity: the LS estimator solves an optimization problem where the function to minimize is strongly convex with respect to the norm . Strong convexity grants inequality (1.20), which is stronger than the inequality

 ∥^μ\textscls(U)−y∥2≤∥u−y∥2for all u∈U, (1.21)

which holds for any closed set .

Now, assume that is a closed convex cone. In this case, (1.18) implies that for all , is the unique vector in such that

 ΠK(y)Ty=|ΠK(y)|22and∀θ∈K,θTy≤θTΠK(y). (1.22)

The property (1.22) readily implies that for any we have

 |ΠK(v)|2=supθ∈K:|θ|2≤1vTθ. (1.23)

Define the statistical dimension of the cone by

 δ(K)\coloneqqE[|ΠK(g)|22]=E[gTΠK(g)]=E⎡⎣(supθ∈K:|θ|2≤1gTθ)2⎤⎦, (1.24)

where . The Gaussian width of a closed convex cone is defined by where . For any closed convex cone , the relation is established in [1, Propsition 10.2]. The following properties of will be useful for our purpose. If , are two closed convex cones, then is a closed convex cone in and

 δ(K×C)=δ(K)+δ(C). (1.25)

The statistical dimension is monotone in the following sense: If are two closed convex cones in then

 K⊂L⇒δ(K)≤δ(L). (1.26)

We refer the reader to [1, Proposition 3.1] for straightforward proofs of the equivalence between the definitions (1.24) and the properties (1.25), (1.26) and (1.23). An exact formula is available for the statistical dimension of . Namely, it is proved in [1, (D.12)] that

 δ(S↑n)=n∑k=11k, (1.27)

and this formula readily implies that

 log(n)≤δ(S↑n)≤log(en). (1.28)

The following upper bound on the statistical dimension of the cone is derived in [11]:

 δ(KCx1,...,xn)≤c(log(en))5/4, (1.29)

for some constant that depends on the ratio (1.13). In Theorem 4.1, we derive a tighter bound independent of the design points.

## 2 General tools to derive sharp oracle inequalities

In this section, we develop two general tools to derive sharp oracle inequalities for the LS estimator over a closed convex set.

### 2.1 Statistical dimension of the tangent cone

Let , let be a closed convex subset of and let . Define the tangent cone at by

 TK,u\coloneqqclosure{t(v−u):t≥0,v∈K} (2.1)

If is a closed convex cone, then .

###### Proposition 2.1.

Let , let be a closed convex subset of and let . Then almost surely

 (2.2)

where .

###### Proof.

Let . Then (1.20) yields

 |^μ−μ|22−|u−μ|22≤2ξT(^μ−u)−|^μ−u|22=2ξT^θ|^μ−u|22−|^μ−u|22 (2.3)

where is defined by if and otherwise. By construction we have and . Using the simple inequality with and , we obtain

 |^μ−μ|22−|u−μ|22≤2ξT(^μ−u)−|^μ−u|22≤(supθ∈TK,u:|θ|22≤1θTξ)2. (2.4)

The equality (1.23) completes the proof. ∎

By definition of the statistical dimension, so that (2.2) readily yields a sharp oracle inequality in expectation. Bounds with high probability are obtained as follows. Let be a closed convex cone. By (1.23) we have . Thus, by the concentration of suprema of Gaussian processes [5, Theorem 5.8] we have

 P(|ΠL(g)|2>E|ΠL(g)|2+√2x)≤e−x,

and by Jensen’s inequality we have . Combining these two bounds, we obtain

 P(|ΠL(g)|2≤δ(L)1/2+√2x)≥1−e−x. (2.5)

Applying this concentration inequality to the cone yields the following Corollary.

###### Corollary 2.2.

Let , let be a closed convex subset of , let and let be defined in (2.1). If then

 E[∥^μ\textscls(K)−μ∥2]≤∥u−μ∥2+σ2nδ(TK,u). (2.6)

Furthermore, for all with probability at least we have

 ∥^μ\textscls(K)−μ∥2−∥u−μ∥2≤σ2n(δ(TK,u)1/2+√2x)2≤σ2n(2δ(TK,u)+4x). (2.7)

In the well-specified case, a similar upper bound was derived in [14, Theorem 3.1]. Oymak and Hassibi [14] also proved a worst-case lower bound that matches the upper bound.

The survey [1] provides general recipes to bound from above the statistical dimension of cones of several types. For instance, the statistical dimension of is given by the exact formula (1.27). Bounds on the statistical dimension of a closed convex cone can be obtained using metric entropy results, as is the risk of the LS estimator when the true vector is . This technique is used in [11] to derive the bound (1.29).

If where is a subspace of dimension , then by monotonicity of the statistical dimension (1.26) we have . In this case, (2.2) shows that the constant 4 in [16, Proposition 3.1] can be reduced to 1.

### 2.2 Localized Gaussian widths

In this section, we develop yet another technique to derive sharp oracle inequalities for LS estimators over closed convex sets. This technique is associated with localized Gaussian widths rather than statistical dimensions of tangent cones. The result is given in Theorem 2.3 below. Recently, other general methods have been proposed [7, 15, 18], but these methods did not provide oracle inequalities with leading constant 1.

###### Theorem 2.3.

Let be a closed convex subset of , let . Assume that and that for some , there exists such that

 Esupv∈K:|v−u|2≤t∗(u)ξT(v−u)≤t∗(u)22. (2.8)

Then for any , with probability greater than ,

 ∥^μ\textscls(K)−μ∥2−∥u−μ∥2≤(t∗(u)+σ√2x)2n≤2t2∗(u)+4σ2xn. (2.9)

The proof of Theorem 2.3 is related to the isomorphic method [2] and the theory of local Rademacher complexities in regression with random design [3, 12].

###### Proof.

Let and for brevity. The concentration inequality for suprema of Gaussian processes [5, Theorem 5.8] yields that on an event of probability greater than ,

 Z\coloneqqsupv∈K:|v−u|2≤tξT(v−u)≤E[Z]+tσ√2x≤t2/2+tσ√2x. (2.10)

On the one hand, if , then by (2.3) on we have

 |^μ−μ|22−|u−μ|22≤2ξT(^μ−u)−|^μ−u|22≤2Z≤t2+2tσ√2x≤(t+σ√2x)2. (2.11)

On the other hand, if , then belongs to . If then , by convexity of we have and by definition of it holds that . On ,

 2ξT(^μ−u)−|^μ−u|22 =(2/α)ξT(v−u)−t2/α2, ≤(2/α)Z−t2/α2=(2t/α)(Z/t)−t2/α2, ≤(Z/t)2≤(t+σ√2x)2,

where we used with and . Thus (2.9) holds on for both cases and . Finally, inequality yields that . ∎

Note that condition (2.8) does not depend on the true vector , but only depends on the vector that appears on the right hand side of the oracle inequality. The left hand side of (2.8) is the Gaussian width of localized around . This differs from the recent analysis of Chatterjee [8] where the Gaussian width localized around is studied. An advantage of considering the Gaussian width localized around is that the resulting oracle inequality (2.9) is sharp, i.e., with leading constant 1. Chatterjee [8] proved that the Gaussian width localized around characterizes a deterministic quantity such that concentrates around . This result from [8] grants both an upper bound and a lower bound on , but it does not imply nor is implied by a sharp oracle inequality such as (2.9) above. Thus, the result of [8] is of a different nature than (2.9).

A strategy to find a quantity that satisfies (2.8) is to use metric entropy results together with Dudley integral bound, although Dudley integral bound may not be tight [5, Section 13.1, Exercises 13.4 and 13.5].

## 3 Sharp bounds in isotonic regression

We study in this section the performance of using the general tools developed in the previous section. We first apply Corollary 2.2. To do so, we need to bound from above the statistical dimension of the tangent cone . In fact, it is possible to characterize the tangent cone and to obtain a closed formula for its statistical dimension.

###### Proposition 3.1.

Let and let . Let be a partition of such that is constant on each . Then

 TS↑n,u=S↑|T1|×...×S↑|Tk|. (3.1)
###### Proof.

Let for brevity. If is constant, then it is clear that so we assume that has at least one jump, i.e., . As is a cone we have . Thus the inclusion is straightforward. For the reverse inclusion, let and let be the minimal jump of the sequence , that is, . If then the vector belongs to , which completes the proof. ∎

Using (1.25) and (1.28) we obtain . By Jensen’s inequality, this quantity is bounded from above by . Applying Corollary 2.2 leads to the following result.

###### Theorem 3.2.

For all and any ,

 Eμ∥^μLS(S↑n)−μ∥2≤minu∈S↑n(∥u−μ∥2+σ2k(u)nlogenk(u)). (3.2)

Furthermore, for any we have with probability greater than

 ∥^μLS(S↑n)−μ∥2≤minu∈S↑n(∥u−μ∥2+2σ2k(u)nlogenk(u))+4σ2xn. (3.3)

Let us discuss some features of Theorem 3.2 that are new. First, the estimator satisfies oracle inequalities both in deviation with exponential probability bounds and in expectation, cf. (3.3) and (3.2), respectively. Previously known oracle inequalities for the LS estimator over were only proved in expectation.

Second, both (3.2) and (3.3) are sharp oracle inequalities, i.e., with leading constant 1. Although sharp oracle inequalities were obtained using aggregation methods [4], this is the first known sharp oracle inequality for the LS estimator .

Third, the assumption is not needed, as opposed to the result of [7].

Last, the constant in front of in (3.2) is optimal for the LS estimator. To see this, assume that there exists an absolute constant such that for all and ,

 Eμ∥^μ−μ∥2≤minu∈S↑n(∥u−μ∥2+cσ2k(u)nlogenk(u)). (3.4)

Set . Thanks to (1.28), the left hand side of the above display is bounded from below by while while the right hand side is equal to . Thus, it is impossible to improve the constant in front of for the estimator . However, it is still possible that for another estimator , (3.4) holds with or without the logarithmic factor. We do not know whether such an estimator exists.

We now highlight the adaptive behavior of the estimator . Let