Sharp oracle inequalities for Least Squares estimators in shape restricted regression
The performance of Least Squares (LS) estimators is studied in isotonic, unimodal and convex regression. Our results have the form of sharp oracle inequalities that account for the model misspecification error. In isotonic and unimodal regression, the LS estimator achieves the nonparametric rate as well as a parametric rate of order up to logarithmic factors, where is the number of constant pieces of the true parameter.
In univariate convex regression, the LS estimator satisfies an adaptive risk bound of order up to logarithmic factors, where is the number of affine pieces of the true regression function. This adaptive risk bound holds for any design points. While Guntuboyina and Sen  established that the nonparametric rate of convex regression is of order for equispaced design points, we show that the nonparametric rate of convex regression can be as slow as for some worst-case design points. This phenomenon can be explained as follows: Although convexity brings more structure than unimodality, for some worst-case design points this extra structure is uninformative and the nonparametric rates of unimodal regression and convex regression are both .
1510.08029 \startlocaldefs \endlocaldefs
ecodec This work was supported by GENES and by the grant Investissements d’Avenir (ANR-11-IDEX-0003/Labex Ecodec/ANR-11-LABX-0047).
Assume that we have the observations
where is unknown, is a noise vector with -dimensional Gaussian distribution where and is the identity matrix. We will also use the notation so that and . Denote by and the expectation and the probability with respect to the distribution of the random variable . The vector is observed and the goal is to estimate . The estimation error is measured with the scaled norm defined by
The error of an estimator of is given by . Let also be the infinity norm and be the Euclidean norm, so that .
This paper studies the Least Squares (LS) estimator in shape restricted regression under model misspecification. The LS estimator over a nonempty closed set is defined by
Model misspecification allows that the true parameter does not belong to . There is a large literature on the performance of the LS estimator in isotonic and convex regression, that is, when the set is the set of all nondecreasing sequences or the set of convex sequences. Some of these results are reviewed in the following subsections.
1.1 Isotonic regression
Let be the set of all nondecreasing sequences, defined by
The set is a closed convex cone. Two quantities are useful to describe the performance of the LS estimator . First, define the total variation by
If , its total variation is simply . Second, for , let be the integer such that is the number of inequalities that are strict for (the number of jumps of ).
Previous results on the performance of the LS estimator can be found in [13, 19, 7, 8], where risk bounds or oracle inequalities with leading constant strictly greater than 1 are derived. Two types of risk bounds or oracle inequalities have been obtained so far. If, , it is known [13, 19, 7, 8] that for some absolute constant ,
The risk bounds (1.6) and (1.7) hold under the assumption that , which does not allow for any model misspecification. We will see below that this assumption can be dropped. The oracle inequality (1.6) implies that the LS estimator achieves the rate while (1.7) yields a parametric rate (up to logarithmic factors) if is well approximated by a piecewise constant sequence with not too many pieces. Let us note that the bound (1.7) can be used to obtain that converges at the rate up to logarithmic factors, thanks to the approximation argument given in [4, Lemma 2].
Mimimax lower bounds that match (1.6) and (1.7) up to logarithmic factors have been obtained in [7, 4]. If is a fixed parameter and , the bound (1.6) yields the rate for the risk of . By the lower bound [4, Corollary 5], this rate is minimax optimal over the class if . Proposition 4 in  shows that there exist absolute constants such that for any estimator ,
1.2 Convex regression with equispaced design points
If , define the set of convex sequences by
For , let be the smallest integer such that there exists a partition of and real numbers satisfying
The quantity is the smallest integer such that is piecewise affine with pieces. If are equispaced design points in , i.e., , , then
for some absolute constant . If and where defined in Corollary 4.4 below is a constant that depends only on , then the estimator satisfies
for some absolute constant . The bound (LABEL:eq:convexs-adapt-example-oi) yields an almost parametric rate if can be well approximated by a piecewise affine sequence with not too many pieces. If is a fixed parameter and , the bound (LABEL:eq:rate45-example-nonoi) yields the rate , which is minimax optimal over the class up to logarithmic factors .
The above results hold in convex regression for equispaced design points. The following subsection introduces the notation that will be used to study convex regression with non-equispaced design points.
1.3 Non-equispaced design points in convex regression
If are non-equispaced design points in , define the cone
This can be rewritten as
For any , we say that is piecewise affine with pieces if there exist real numbers and a partition of such that
If for some convex function and is a piecewise affine function with pieces, then is piecewise affine with pieces. For any , let be the smallest integer such that is piecewise affine with pieces. The quantity satisfies
The performance of the LS estimator is also studied in  in the case where the design points are almost equispaced: The bounds (LABEL:eq:convexs-adapt-example-oi) and (LABEL:eq:rate45-example-nonoi) both hold if is replaced with and if is a constant that depends on the ratio
and this constant becomes arbitrarily large as this ratio tends to infinity.
Although (LABEL:eq:rate45-example-nonoi) and (LABEL:eq:convexs-adapt-example-oi) provide an accurate picture of the performance of the LS estimator for equispaced (or almost equispaced) design points, it is not known whether these bounds continue to hold for other design points. A goal of the present paper is to fill this gap. Section 4 shows that the oracle inequality (LABEL:eq:convexs-adapt-example-oi) holds irrespective of the design points, while the nonparametric rate of the LS estimator can be as slow as for some worst-case design points.
It is clear that a convex function is unimodal in the sense that it is first non-increasing and then nondecreasing. The following subsection introduces the set of unimodal sequences, and Section 4.2 studies the relationship between convex regression and unimodal regression.
1.4 Unimodal regression
Let . A sequence is unimodal with mode at position if and only if is non-increasing and is nondecreasing. Define the convex set
The convex set is the set of all unimodal sequences with mode at position and
is the set of all unimodal sequences. The set is non-convex. For all , let be the smallest integer such that is piecewise constant with pieces, i.e., the smallest integer such that there exists a partition of such that for all ,
the sequence is constant, and
the set is convex in the sense that if then contains all integers between and .
If , this definition of coincides with that defined above.
1.5 Organisation of the paper
Section 1.6 recalls properties of closed convex set and closed convex cones.
On the relationship between unimodal and convex regression. Section 4 studies the role of the design points in univariate convex regression: Although the nonparametric rate is of order for equispaced design points, this rate can be as slow as for some worst-case design points that are studied in Section 4, whereas the adaptive risk bound (LABEL:eq:convexs-adapt-example-oi) holds for any design points. The relation between convex regression and unimodal regression is discussed in Section 4.2: Although convexity brings more structure than unimodality, for some worst-case design points this extra structure is uninformative and the nonparametric rates of unimodal regression and convex regression are both . Section A.1 studies unimodal regression and improves some of the results of  on the performance of the unimodal LS estimator.
Comparison of different misspecification errors. In Section 5 we compare different quantities that represent the estimation error when the model is misspecified. In particular, Section 5 explains that if is a closed convex set and , the sharp oracle inequalities obtained in Sections 4, 3 and 2 yield upper bounds on the estimation error . If , the LS estimator consistently estimates the projection of the true parameter onto for and .
1.6 Preliminary properties of closed convex sets
We recall here several properties of convex sets that will be used in the paper. Given a closed convex set , denote by the projection onto . For all , is the unique vector in such that
Inequality (1.18) can be rewritten as follows
which is a consequence of the cosine theorem. The LS estimator over is exactly the projection of onto , i.e., . In this case, (1.19) yields that for all ,
Inequality (1.20) can be interpreted in terms of strong convexity: the LS estimator solves an optimization problem where the function to minimize is strongly convex with respect to the norm . Strong convexity grants inequality (1.20), which is stronger than the inequality
which holds for any closed set .
Now, assume that is a closed convex cone. In this case, (1.18) implies that for all , is the unique vector in such that
The property (1.22) readily implies that for any we have
Define the statistical dimension of the cone by
where . The Gaussian width of a closed convex cone is defined by where . For any closed convex cone , the relation is established in [1, Propsition 10.2]. The following properties of will be useful for our purpose. If , are two closed convex cones, then is a closed convex cone in and
The statistical dimension is monotone in the following sense: If are two closed convex cones in then
We refer the reader to [1, Proposition 3.1] for straightforward proofs of the equivalence between the definitions (1.24) and the properties (1.25), (1.26) and (1.23). An exact formula is available for the statistical dimension of . Namely, it is proved in [1, (D.12)] that
and this formula readily implies that
The following upper bound on the statistical dimension of the cone is derived in :
2 General tools to derive sharp oracle inequalities
In this section, we develop two general tools to derive sharp oracle inequalities for the LS estimator over a closed convex set.
2.1 Statistical dimension of the tangent cone
Let , let be a closed convex subset of and let . Define the tangent cone at by
If is a closed convex cone, then .
Let , let be a closed convex subset of and let . Then almost surely
By definition of the statistical dimension, so that (2.2) readily yields a sharp oracle inequality in expectation. Bounds with high probability are obtained as follows. Let be a closed convex cone. By (1.23) we have . Thus, by the concentration of suprema of Gaussian processes [5, Theorem 5.8] we have
and by Jensen’s inequality we have . Combining these two bounds, we obtain
Applying this concentration inequality to the cone yields the following Corollary.
Let , let be a closed convex subset of , let and let be defined in (2.1). If then
Furthermore, for all with probability at least we have
The survey  provides general recipes to bound from above the statistical dimension of cones of several types. For instance, the statistical dimension of is given by the exact formula (1.27). Bounds on the statistical dimension of a closed convex cone can be obtained using metric entropy results, as is the risk of the LS estimator when the true vector is . This technique is used in  to derive the bound (1.29).
2.2 Localized Gaussian widths
In this section, we develop yet another technique to derive sharp oracle inequalities for LS estimators over closed convex sets. This technique is associated with localized Gaussian widths rather than statistical dimensions of tangent cones. The result is given in Theorem 2.3 below. Recently, other general methods have been proposed [7, 15, 18], but these methods did not provide oracle inequalities with leading constant 1.
Let be a closed convex subset of , let . Assume that and that for some , there exists such that
Then for any , with probability greater than ,
Let and for brevity. The concentration inequality for suprema of Gaussian processes [5, Theorem 5.8] yields that on an event of probability greater than ,
On the one hand, if , then by (2.3) on we have
On the other hand, if , then belongs to . If then , by convexity of we have and by definition of it holds that . On ,
where we used with and . Thus (2.9) holds on for both cases and . Finally, inequality yields that . ∎
Note that condition (2.8) does not depend on the true vector , but only depends on the vector that appears on the right hand side of the oracle inequality. The left hand side of (2.8) is the Gaussian width of localized around . This differs from the recent analysis of Chatterjee  where the Gaussian width localized around is studied. An advantage of considering the Gaussian width localized around is that the resulting oracle inequality (2.9) is sharp, i.e., with leading constant 1. Chatterjee  proved that the Gaussian width localized around characterizes a deterministic quantity such that concentrates around . This result from  grants both an upper bound and a lower bound on , but it does not imply nor is implied by a sharp oracle inequality such as (2.9) above. Thus, the result of  is of a different nature than (2.9).
3 Sharp bounds in isotonic regression
We study in this section the performance of using the general tools developed in the previous section. We first apply Corollary 2.2. To do so, we need to bound from above the statistical dimension of the tangent cone . In fact, it is possible to characterize the tangent cone and to obtain a closed formula for its statistical dimension.
Let and let . Let be a partition of such that is constant on each . Then
Let for brevity. If is constant, then it is clear that so we assume that has at least one jump, i.e., . As is a cone we have . Thus the inclusion is straightforward. For the reverse inclusion, let and let be the minimal jump of the sequence , that is, . If then the vector belongs to , which completes the proof. ∎
For all and any ,
Furthermore, for any we have with probability greater than
Let us discuss some features of Theorem 3.2 that are new. First, the estimator satisfies oracle inequalities both in deviation with exponential probability bounds and in expectation, cf. (3.3) and (3.2), respectively. Previously known oracle inequalities for the LS estimator over were only proved in expectation.
Second, both (3.2) and (3.3) are sharp oracle inequalities, i.e., with leading constant 1. Although sharp oracle inequalities were obtained using aggregation methods , this is the first known sharp oracle inequality for the LS estimator .
Third, the assumption is not needed, as opposed to the result of .
Last, the constant in front of in (3.2) is optimal for the LS estimator. To see this, assume that there exists an absolute constant such that for all and ,
Set . Thanks to (1.28), the left hand side of the above display is bounded from below by while while the right hand side is equal to . Thus, it is impossible to improve the constant in front of for the estimator . However, it is still possible that for another estimator , (3.4) holds with or without the logarithmic factor. We do not know whether such an estimator exists.
We now highlight the adaptive behavior of the estimator . Let