Multivariate convex regression: global risk bounds and adaptation

# Multivariate convex regression: global risk bounds and adaptation

\fnmsQiyang \snmHanlabel=e1]royhan@uw.edu [    \fnmsJon A. \snmWellner\thanksreft2label=e2]jaw@stat.washington.edu [[ University of Washington Department of Statistics, Box 354322
University of Washington
Seattle, WA 98195-4322
Department of Statistics, Box 354322
University of Washington
Seattle, WA 98195-4322
###### Abstract

In this paper, we study the problem of estimating a multivariate convex function defined on a convex body in a regression setting with random design. We are interested in the attainability of optimal rates of convergence under a squared global continuous loss in the multivariate setting . One crucial fact is that the minimax risks depend heavily on the shape of the support of the regression function. It is shown that the global minimax risk is on the order of when the support is sufficiently smooth, but that the rate is achieved automatically when the support is a polytope. Such tremendous differences in rates are due to difficulties in estimating the regression function near the boundary of smooth regions.

We then study the natural bounded least squares estimators (BLSE): we show that the BLSE nearly attains the optimal rates of convergence in low dimensions, while suffering rate-inefficiency in high dimensions. Remarkably, we show that the BLSE adapts nearly parametrically to polyhedral functions when the support is polyhedral in low dimensions by a local entropy method. We also show that the boundedness constraint cannot be dropped when risk is assessed via continuous loss.

Given rate suboptimality of the BLSE in higher dimensions, we further study rate-efficient adaptive estimation procedures. Two general model selection methods are developed to provide sieved adaptive estimators (SAE) that achieve nearly optimal rates of convergence for particular “regular” classes of convex functions, while maintaining nearly parametric rate-adaptivity to polyhedral functions in arbitrary dimensions. Interestingly, the uniform boundedness constraint is unnecessary when risks are measured in discrete norms. As a byproduct, we obtain nearly rate-optimal adaptive estimators for unknown convex sets from noisy support function measurements in arbitrary dimensions.

[
\kwd
\setattribute

journalname

\runtitle

Risk bounds for Multivariate Convex Regression

{aug} \thankstext

t2Supported in part by NSF Grant DMS-1104832 and NI-AID grant R01 AI029168

class=MSC] \kwd[Primary ]62G07 \kwd62H12 \kwd[; secondary ]62G05 \kwd62G20

convex regression \kwdhigh dimension \kwdrisk bounds \kwdminimax risk \kwdcurvature \kwdadaptive estimation \kwdsieve methods \kwdmodel selection \kwdshape constraints \kwdempirical process \kwdconvex geometry

## 1 Introduction

### 1.1 Overview

Nonparametric estimation under convexity constraints has received much attention in recent years. In this paper, we study the the problem of estimating an unknown convex function on a convex body from observations where are i.i.d. according to a probability law on , and follows the model

 Yi=f0(Xi)+εi, for all i=1,⋯,n. (1.1)

Here the ’s are i.i.d. mean zero errors with variance . This is a random design regression model. We are interested in determining the optimal rates of convergence for estimating the unknown convex based on random observations from the above model under the natural associated continuous norm for the probability measure defined by

 l2ν(f0,g):=∫Ω(f0−g)2 dν. (1.2)

Convex nonparametric regression has a long history. [40, 39] studied least squares estimation in the case of dimension . In the multidimensional case, [51, 43, 47] and others studied different aspects of the problem in more restricted setups, before [58] studied the statistical properties of least squares estimation and related computational techniques in a general setting. Global convexity also proves useful in faithfully selecting relevant variables under sparse additive modelling in the high-dimensional setting in [65].

The rates of convergence for convexity/concavity restricted estimators have been investigated primarily in dimension . From a global point of view, [29] showed that the supremum loss of convex least squares estimators (LSEs) on any compacta within the domain is of order (no squaring). [36] established global risk bounds of order modulo logarithmic factors under squared discrete norm for the LSE on in the regression setup with almost equi-distributed design points. The interesting feature is that, the LSEs are nearly parametrically rate-adaptive to piecewise linear convex functions. In a different setting of density estimation, [27] concluded the global rates of convergence of order no worse than (no squaring) under Hellinger metric for the maximum likelihood estimators (MLEs) of - and -concave densities.

From a local point of view, [48] established rates of convergence on the order of (no squaring) at fixed smooth points for LSEs in the regression setup. The pointwise limit distribution theory of the MLEs of convex decreasing densities [32], log-concave densities [2] and Rényi divergence estimators for -concave densities [38] follows this same rate of (no squaring) at such smooth points. Adaptive phenomenon is also observed at a local scale in [25] and [21] with various degrees of differentiability assumptions.

Such phenomenon as global and local adaptation have been found also in estimation procedures with monotonicity constraints, see e.g. [67], [24], [8]. In particular, [24] characterized the global adaptive nature of the LSEs with general shape restrictions induced by conic inequalities in terms of the statistical dimension of the cone. This covers isotonic (-monotone), convex (-monotone) and general -monotone regression problems. However, their conic inequalities require a strong order relationship between the design points, and thus render extension to high dimensions difficult.

In higher dimensions (), rates of convergence for estimating convex functions are far less well understood. [46] and [3] studied least squares estimation over the class of uniformly Lipschitz and uniformly bounded convex functions on . In the presence of such restrictions, the (slightly weaker) results readily follow from classical entropy bounds (cf. Corollary 2.7.10 [63]) and empirical process theory. In a related problem of estimating convex sets in higher dimensions, it is shown in [34] that estimation of an unknown convex set via support functions enjoys minimax optimal rates of convergence on the order of under discrete squared norm. On the other hand, [19] showed that in the setting of estimating the support of a uniform density known to be convex, the optimal rates of convergence under Nikodym metric111The Nikodym metric between two measurable sets is defined as . is of order when the support is a polytope, and when the support is a general convex body.

In the setting of multivariate density estimation with convexity constraints, [59] derived a minimax lower bound on the order of (no squaring) for estimating a concave-transformed density at a fixed point under curvature conditions. More recently, [42] show that estimating log-concave densities via the MLEs yields different rates from the conjectured rates as above in low dimensions and the rates conjectured in [59]. The key observation in their paper is that the bracketing entropy of a suitable subclass of log-concave densities is on the order of rather than the in higher dimensions as conjectured in [59]. The new entropy estimate gives global minimax risks on the order of (no squaring) for , which is strictly worse than the pointwise rate (no squaring). The larger entropy exhibited in [42] actually comes from uniform densities with smooth boundaries. Similar ideas have been explored further in [30] where it is shown that the metric entropy of convex functions on a polyhedral region differs significantly from the metric entropy of convex functions on a domain with a smooth boundary such as a ball. This quickly leads to the conjecture that the smoothness of the boundary of the domain plays a significant role in determining the degree of difficulty in estimation of a convex function defined on some set , especially for higher dimensions .

In this paper we investigate this issue in detail. We adopt a minimax approach and show that the difficulty in estimating a convex function in the regression framework with a random design depends heavily on the smoothness of the support of . We first show that, the global minimax risks for convex regression under squared loss as defined in (1.2) are generally on the order of for smooth supports (to be defined in Section 1.2), while a faster rate of is possible when the support is a polytope. Such sharp differences in global minimax risk are due to boundary perturbations of smooth supports that lead to a class of least favorable regression functions to be distinguished from the true one.

We then turn to study a variant of LSEs studied by [58], with a uniform bound constraint, which we call bounded least squares estimators (BLSE). The uniform boundedness constraint, as we shall see in Section 3.3, cannot be relaxed in studying risk bounds under random design. We summarize our risk bounds for the BLSE in squared norm obtained in the following table.

 (Ω,f0) (Pk,Pmf0(Γ)) (Pk,C(Γ)) (C,C(Γ)) d=1 n−1(logn)5/4 n−4/5 d=2 n−1(logn)d(d+4)/4 n−4/(d+4) n−2/3logn d=3 n−1/2logn d=4 n−1(logn)10 n−1/2logn n−1/(d−1) d≥5 n−4/d(logn)d+4 n−2/d

Notation can be found in Section 1.2. To summarize, the BLSEs behave differently for different shapes of support and the true regression functions in that adaptive estimation occurs when (1) the support is polytopal with consequent smaller entropy of the class of convex functions; (2) the support is polytopal and the regression function is polyhedral. This is in agreement with the adaptive properties obtained in [18, 19, 20] in that the epigraph of such a regression function is of polyhedral type. In particular, nearly parametric risks in dimensions when the support is polytopal and the regression function is polyhedral are established by a local entropy method, as we shall discuss in detail in Section 3. It is natural to wonder if adaptation occurs when the support is a general convex body and the regression function is polyhedral. We conjecture that the answer is negative within the current methods via local entropy. For further discussion see Section 5.1.

It is worthwhile to note that, when the support is polytopal, the BLSEs achieve nearly optimal risks for , while such optimality only holds for when the support is a general smooth convex body. Such rate inefficiency is also observed in [12] in the context of density estimation via minimum constrast estimators for Hölderian classes, and conjectured for the MLEs of log-concave densities in [59] in higher dimensions.

Given rate-suboptimality of the BLSEs, we further study rate-efficient adaptive estimation procedures. We show that the notion of ‘pseudo-dimension’ coined in [54] (see also Section 4) effectively characterizes the complexity for the low-dimensional models, i.e. polyhedral functions, within the class of multivariate convex functions. We then develop general model selection methods, from which two different types of sieved adaptive estimators (SAE) are studied and shown to achieve nearly optimal rates of convergence while being rate-adaptive simultaneously to all these low-dimensional models up to universal constants. Risks for these SAEs are both considered in continuous and discrete norms. Interestingly, the uniform boundedness constraint is not necessary when the discrete norm is used. See Theorems 4.2 and 4.4 for precise statements.

Applying these methods to the multivariate convex regression setup, we show that the risks of the SAEs are on the order of for polyhedral functions and for uniformly Lipschitz (regular) convex functions for some , whatever the shape of the support. This is not a contradiction with the global minimax risk for smooth domains since the faster rate is only achieved when the regression function behaves nicely near the boundary of the domain, a setting which excludes the global least favorable case. The BLSE is unlikely to be rate-adaptive for such regular classes since the Lipschitz behavior of the BLSE near the boundary can be arbitrarily bad; see the further discussion in Section 5.1.

As a byproduct of our general framework, we obtain a nearly rate-optimal estimator for an unknown convex set from support function measurements that adapts simultaneously to all polytopes with nearly parametric rate. This gives a solution to this problem in arbitrary dimensions; the case was previous considered by [20].

The rest of the paper is organized as follows. We study the global minimax risks in Section 2. Section 3 is devoted to risk bounds for the BLSEs. The model selection methods and the associated SAEs, are presented in Section 4 with some further results discussed in Appendix A. Related issues and problems are discussed in Section 5. For clarity of presentation, proofs are relegated to Appendices B-G. Auxiliary results from empirical process theory and convex geometry are collected in Appendix H.

### 1.2 Notation and conventions

denotes the -norm for an Euclidean vector and is usually understood as . denotes the ball of radius centered at in . is an abbreviation for . is used for the canonical simplex in . The volume of a measurable set in Lebesgue measure is usually denoted . The symbols and are used for definitions. and are sometimes abused (in proofs) for outer probability and expectation to handle possible measurability issues.

For a probability measure on , we denote the continuous metric under by as defined in (1.2), while is used when is Lebesgue measure . We assume that is absolutely continuous with respect to Lebesgue measure , and write and . For , define the discrete metric by .

#### 1.2.1 Conventions on constants

will denote a generic constant that depends only on , which may change from line to line unless otherwise specified. and mean and respectively, and means and . is understood as an absolute constant unless otherwise specified. Constants in theorems are stated in German letters (e.g. ). will denote a generic constant with specified dependence whose value may change from line to line. For two real numbers , and .

#### 1.2.2 Conventions on convex bodies

Let denote the collection of polytopes with at most simplices. For a polytope , we call a simplical decomposition of if all the ’s are simplices with non-overlapping interiors. Let denote the set of all smooth convex bodies in 222Here ’smooth’ will mean that Assumption 2 (below) holds.. Note in dimension , . The width of a convex body is denoted by . A convex body is smooth if the following two conditions are satisfied:

###### Smoothness Assumption 1.

For small enough, there exist disjoint caps such that and .

This is a slightly stronger version of the Economic Covering Theorem (cf. Theorem H.6) studied in the convex geometry literature, where we require to be caps instead of simple convex sets. See also Remark H.7.

Now we state our second assumption. A sequence of simplices is called admissible if their interiors are pairwise disjoint. Let . Now the simplicial approximation number is defined by where the infimum is taken over all admissible sequences.

###### Smoothness Assumption 2.

The simplicial approximation number satisfies the growth condition

 limsupt→0t(d−1)/2S(t,Ω)<∞. (1.3)

The power is natural in the sense that it agrees with [17]: Any convex body can be approximated by a polytope with vertices within Hausdorff distance no more than . Here we require the approximation to hold in a sense so that such a bound is valid constructively.

###### Lemma 1.1.

Any ellipsoid satisfies Smoothness Assumptions 1 and 2.

#### 1.2.3 Conventions on convex functions

For a multivariate real-valued function , let denote the Lipschitz constant for . will denote the standard norm ().

We denote the class of all convex functions that are bounded by in norm, whose Lipschitz constants are bounded by and whose domains are contained in by . Dependence on the domain is often suppressed. Dependence on is also suppressed when they equal 333For example, and .. We also let be the collection of polyhedral convex functions with at most facets444Here by a facet of a polyhedral convex function we mean any -dimensional polytope within on which is affine.. Alternatively, we can represent as for some so that . Similarly, we often simply denote by .

For a given support , we call the class of polyhedral convex functions as the simple class, the class of all convex functions with pre-specified uniformly bounded Lipschitz constant as the regular class.

#### 1.2.4 Conventions on entropy numbers

Let be a subset of the normed space of real functions . The metric entropy number is the minimal number of -balls in norm needed to cover , while the bracketing entropy number is the minimum number of -brackets needed to cover . By an -bracket we mean the subset of functions determined by a pair of functions as follows: with .

## 2 Global minimax risks

We will be interested in the global minimax risk defined by

 Rν(n;F):=inf^fnsupf∈FEfl2ν(f,^fn), (2.1)

where is the function class of interest, and the infimum runs over all possible estimators based on the observations .

### 2.1 Minimax risk upper bounds

We first derive a general minimax upper bound.

###### Theorem 2.1.

Suppose is uniformly bounded by , and the errors are independently sub-Gaussian with parameter : . Let the rate function be defined by

 rn:=infδ>0(1z0nlogN(δ)+34δ2), (2.2)

where for all . Then there exists an estimator such that for any ,

 supf0∈FP(n(l2ν(^fn,f0)−rn)>t)≤exp(−z0t),

Here the constant is defined via (C.7).

The proof is a generalization of the method of sieves by progressively choosing ‘theoretical’ sieves constructed via knowledge of the metric entropy of the function class to be estimated. As a direct corollary, we obtain

###### Corollary 2.2.

.

Typically is of larger order than and hence the right hand side of the display is on the order of .

Now we shall use the above results to establish a minimax risk upper bound for the convex regression problem. This is a direct consequence of Corollary 2.2 in view of the entropy result Lemma 3.5.

###### Theorem 2.3 (Minimax risk upper bounds).

For a polytopal domain , we have

 Rν(n;C(Γ))≤Cd,|Ω|,Γ,σ,ν(k/n)4/(d+4).

For a smooth domain , we have

 Rν(n;C(Γ))≤Cd,|Ω|,Γ,σ,ν{n−2/3(logn)d=2;n−2/(d+1)d≥3.

Here the conclusion for holds for large enough. Explicit forms for the constants can be found in (C.11) and (C.12).

### 2.2 Minimax risk lower bounds

In this section, errors will be assumed i.i.d. Gaussian, i.e. .

#### 2.2.1 General class

We consider global minimax risk lower bounds for two types of supports: (1) polytopes; (2) smooth convex bodies.

###### Theorem 2.4.

For a polyhedral domain , we have

 Rν(n;C(Γ))≥Cd,Ω,Γ,σ,νn−4/(d+4).

An explicit form for the constant can be found in (C.14).

###### Theorem 2.5.

Let . Suppose the domain satisfies Smoothness Assumption 1. Then

 Rν(n;C(Γ))≥Cd,Ω,Γ,σ,νn−2/(d+1).

An explicit form for the constant can be found in (C.17).

Notably, the least favorable functions achieving the rate for polytopal domains in Theorem 2.4 and the class yielding the rate for smooth domains in Theorem 2.5 are radically different. In fact, the class yielding the rate involves perturbations of a reference convex function in the interior of the domain while maintaining sufficient curvature to ensure convexity so that the resulting rate corresponds to the rate in the classical case of a function space with smoothness index . On the other hand, the slower rate in Theorem 2.5 involves showing that a smooth boundary allows more perturbations than in the interior, and thus the boundary behavior ultimately drives the slower rate.

#### 2.2.2 Simple class

It is not difficult to establish the general lower bound on the order of under squared norm, so we shall examine the case where a slower minimax rate is possible. We shall illustrate this by considering the minimax rates for polyhedral functions supported on a smooth region.

###### Theorem 2.6.

Let . Suppose the domain satisfies Smoothness Assumption 1. Then for with being some constant depending on , it holds that

 Rν(n,Pk)≥Cdσ2νminνmaxklognn.

Note that depends on .

## 3 Least squares estimation

### 3.1 The estimator

In the convex regression setting, the least squares estimator (LSE) given observation is

 ^fLSn:=argminf∈Cn∑i=1(Yi−f(Xi))2. (3.1)

By a canonical construction (see also (3.4)), such LSEs exist and are consistent in view of [58] in the sense that converges uniformly on any compact set in the interior of the domain of the true regression function . Here we shall study the bounded LSE with the constraint that with some specified . It is shown in Section 3.3 that a boundedness condition is necessary in studying the risk for the LSE for convex regression due to the bad behavior of the estimator near the boundary.

The bounded least squares estimators (BLSEs) can also be formulated as follows.

 min{yi},{gi} n∑i=1(Yi−yi)2 (3.2) subjectto yj≥yi+gTi(Xj−Xi), yi≥−Γ,−yi≥−Γ, for all i,j=1,⋯,n, Γ≥yi+gTi(v−Xi), for all v∈∂Ω.

When the support is known to be a polytope with vertices , then the last condition of (3.2) can be replaced with

 Γ≥yi+gTi(vl−Xi), for all l=1,…,k, and i=1,…,n. (3.3)

This is a quadratic programming (QP); see page 338 in [15] for more details.

The existence of the solution and of (3.2) is clear, and we will use these estimated interpolating function values and subgradients to define a canonical estimator as follows:

 ^fLSn(x):=maxi=1,⋯,n(^yi+^gTi(x−Xi)). (3.4)

For notational convenience, we suppress explicit dependence of this estimator on the uniform bound . As we shall see, the uniform bound does not affect the rates of convergence as long as it exceeds .

### 3.2 Risk bounds via local entropy

The rates of convergence of the least squares estimators (LSEs) in the regression setting are well studied in the empirical process literature, see for example [12, 63, 61]. In particular, the local geometry near the true regression function, measured via the size of the local metric/bracketing entropy, drives the rates of convergence for LSEs. In our specific random design setting, we shall first strengthen the results of Theorem 9.1 [61] for risk bounds in the fixed design setting and discussions in page 335 [63] for rates of convergence in the random design, to risk bounds for LSEs in the general regression setting with random design.

To fix notation, suppose , and ’s are i.i.d. from a probability measure . The LSE is then defined to be the minimizer of over . We assume that the LSEs exist for simplicity.

###### Theorem 3.1.

Let be a function class uniformly bounded by some . Let the errors be i.i.d. subexponential with . Let and . Set

 J[](r,f0,lν)≥∫2rr2/192¯μΓ√logN[](ε,S(f0,r),lν) dε (3.5)

where . Suppose is non-increasing on , and

 √1/96∨(Γ/48)≤μ––Γ≤¯μΓ≤√nrn/18C. (3.6)

Then

 P(lν(^fLSn,f0)>r)≤2C∑j≥1exp(−22jnr21296C2¯μ2Γ)

holds for all , where is chosen such that

 J[](rn,f0,lν)√nr2n≤113C¯μΓ.

Consequently,

 (3.7)

The constant is taken from Lemma D.1.

###### Remark 3.2.

We note that this result does not necessarily follow directly from Theorem 9.1 [61] since we want to work with bracketing entropy in continuous norm, rather than the discrete norm. We also note that (3.6) is actually very weak; in fact we can first require large enough to ensure the inequality in the far left holds, and then require large enough to ensure the inequality in the far right holds since typically at least with logarithmic rates as . The risk bound (3.7) has two terms; typically the first term dominates the second for large.

Now in order to derive the rate results, since the and metrics are essentially the same under our assumptions on , it suffices to study the local geometry of in terms of for any . We shall establish the following key estimate for local entropy.

###### Lemma 3.3.

For any convex function defined on , and any , it holds that

 logN[](ε,S(f0,r),l2) ≤Cdm(m∨k)d(d+4)/4(2r2+l22(f0,g))d/4 ×ε−d/2(logCd(m∨k)d(Γ2∨w(Ω)2L2(g))|Ω|ε2)d(d+4)/4

with . Here is a partition of for which is affine on each .

###### Remark 3.4.

In particular, if for some , then Lemma 3.3 entails that

 logN[](ε,S(f0,r),l2) ≲mf0,k,d(rε)d/2×poly-logarithmic terms

so that and scale at the same rate. This property will play a crucial role in deriving nearly parametric rates for the BLSEs in low dimensions in Theorem 3.6.

We will also need the following result.

###### Lemma 3.5.

[Theorems 1.1 & 1.3 [30]] Let be a convex body on satisfying Smoothness Assumption 2, and be the collection of convex functions on with -norm bounded by , .

1. If can be triangulated into simplices, then

 logN(ε,Cp(Γ),l2)≤Cdk(|Ω|1/2−1/pΓε−1)d/2. (3.8)
2. Otherwise for a general smooth convex body , it holds that

 logN(ε,Cp(Γ),l2) (3.9)

When , the above result can be strengthened to bracketing entropy bounds.

When , the entropy estimate follows from (3.8) with .

Now we are in position to establish global risk bounds for the BLSE.

###### Theorem 3.6.

Assume , and be the BLSE defined via (3.2). Then in (3.7) of Theorem 3.1, is given by the following table:

 (Ω,f0) (Pk,Pmf0(Γ)) (Pk,C(Γ)) (C,C(Γ)) d=1 n−1(logn)5/4 n−4/5 d=2 n−1(logn)d(d+4)/4 n−4/(d+4) n−2/3logn d=3 n−1/2logn d=4 n−1(logn)10 n−1/2logn n−1/(d−1) d≥5 n−4/d(logn)d+4 n−2/d

for large enough. Here we only state the dependence of the rates on (For complete results, see (D.11), (D.12) and (D.13)).

###### Remark 3.7.

The logarithmic factors in the above table appear for several very different reasons: those in the second column come from logarithmic factors in the local entropy bounds in Lemmas 3.3. Those in the fourth row of the third column and the third row of the fourth colume come from convergence properties of the entropy integral (3.5) at . It is not yet clear if this is an artifact of the proof techniques or the nature of the estimators. Interestingly, in another different but related setting of global rates of convergence for MLEs of log-concave densities, [42] also obtained a rate coming with a logarithmic factor in dimension . Some potential drawbacks in terms of logarithmic factors resulting from the local entropy method (cf. Lemma 3.3) will be further discussed in Section 5.3.

###### Remark 3.8.

It is natural to wonder if adaptation happens when the support is smooth. We conjecture that the answer is negative. For further details see Section 5.1.

### 3.3 On the uniform boundedness assumption

We show below that a result stated in risk bounds without a uniform boundedness condition is impossible by the following example in dimension due to [3]: Let , the regression function , and the design and response . Hence the error is subexponential. Then consider the event

 A ={X1∈[1/4,1/2],X2∈(1/2,3/4],X3,⋯,Xn∈(3/4,1], Y1=1,Y2=⋯=Yn=−1}∩{X∈[0,1/4]}

Then the unconstrained least squares estimator is on the interval . Restricting our attention to a smaller interval , we have . Note that and , hence . This implies that the right hand side of the above display is bounded below by . Since and , we see that the risk is unbounded. This stands in sharp contrast to the fixed design setting considered in [36] where no uniform boundedness constraint is required due to the fact the boundary effect in risk is killed off by the nature of the discrete norm.

## 4 Model selection and adaptive estimation

In this section, we study a general model selection approach that selects among highly non-linear low-dimensional models whose complexity is characterized by the notion of ‘pseudo-dimension’ (as defined below). In the classical regression setup with fixed design and Gaussian errors regression setup, the estimator obtained by minimizing the empirical loss typically has risk:

 El2X––n(^fm,f0)≲infg∈Pml2X––n(f0,g)+Dmn. (4.1)

The task for model selection is to design a data-driven choice of so that approximately the resulting estimator simultaneously achieves the optimal rate for each true :

 El2X––n(^f^m,f0)≲infm∈N(infg∈Pml2X––n(f0,g)+Dmn), (4.2)

Here we show in Section 4.1 that results analogous to (4.1) and (4.2) hold (up to logarithmic factors) by use of the notion of ‘pseudo-dimension’ when the risks are measured both in discrete and continuous norms. The subtle differences between these two norms in terms of apriori uniform boundedness constraint on the parameters will become clear in the sequel.

To place our results in the context of the existing literature, it is worthwhile to note that when the low-dimensional models are of a certain non-linear type, certain Lipschitz condition has been imposed to increase linearity (cf. Section 3.2.2 [7]) and certain coupled entropy conditions under continuous and norms are required (cf. page 372 [7]). On the other hand, combinatorial complexity such as VC dimension is often used in the context of learning theory (cf. Chapter 8 [49]), where the constrast function is usually required to be bounded, which apparently fails in the general regression setup. We refer the readers to [13], [7], [49] and [11] for more details in this direction. However in either case the subtleness between the discrete and continuous norms have not been systematically addressed.

Our work here can be viewed as occupying the ground between the previous approaches: we provide adaptive procedures in a general regression setup with random design when the function class exhibits certain combinatorial low complexity. Two adaptive procedures are developed. The first procedure is inspired by the idea of bandwidth selection in the context of nonparametric kernel estimate as in [45] and [44]; we call this method the L-adaptive procedure. The second method is based on penalized least squares in the spirit of [7] and [49]; we call this method the P-adaptive procedure. This framework is particularly interesting in convexity-restricted nonparametric estimation problems. We show in Section 4.2 (resp. Section 4.3) that the ‘pseudo-dimension’ effectively captures the dimension of the class of polyhedral functions (resp. polytopes) within the class of convex functions (resp. convex bodies), and hence nearly rate-optimal estimators that simultaneouly adapt to polyhedral convex funtions (resp. polytopes) are obtained as simple corollaries of our general results.

### 4.1 General theory

Consider the regression model (1.1) with . Assume that the errors in the regression model are i.i.d. sub-Gaussian with parameter , i.e. unless otherwise specified.

Following [54] Section 4, a subset of is called to have pseudo-dimension , denoted as , if for every and indices with for all , we can always find a sub-index set such that no satisfies both (1) and .

The following lemma is due to [52]; see their Theorem 1. The current statement is from [34], Theorem B.2.

###### Lemma 4.1.

Let be a subset of with and pseudo-dimension at most . Then, for every , we have

 N(ε,A,∥⋅∥2)≤(4+2B√nε)κt,

holds for some absolute constant .

In the sequel, we shall assume that the constant is known. We shall also assume the knowledge of . No effort has been made to obtain optimal constants.

Let be a function class, and be a sequence of (low-dimensional) models. Typically is a submodel of , but this is not apriori required in our theory. Now the key descriptor for that exhibits low dimensional structure is defined as follows:

 F(X––n;Pm)≡Fm(X––n)≡{x∈Rn:x=(g(X1),…,g(Xn)),g∈Pm}. (4.3)

#### 4.1.1 Risk bounds for fixed models

We first derive the analogous results in the spirit of (4.1): We consider risk bounds for the least squares estimator on each model, i.e. for each , consider the estimator defined by

 ^fPm≡^fm∈argminf∈Pm1nn∑i=1(Yi−f(Xi))2. (4.4)

For simplicity we assume the existence of . The working assumptions are:

###### Assumption 1.

When risk is measured in a continuous norm, we suppose that both and are uniformly bounded by , and is also bounded by .

###### Assumption 2.

Let be such that almost surely.

Now we are in position to state our risk bounds and large deviation bounds for fixed models:

###### Theorem 4.2.

Suppose Assumptions 1-2 hold. Let . Then,

[Continuous norm.]

 l2ν(^fm,f0)≤ccinfg∈Pml2ν(f0,g)+kc(σ2∨Γ2)κDmlognn+dc(σ2∨Γ2)tn (4.5)

holds with probability at least . Furthermore, it holds that

 (4.6)

[Discrete norm.]

 l2X––n(^fm,f0)≤cdinfg∈Pml2X––n(f0,g)+kdσ2κDmlognn+ddσ2tn (4.7)

holds with probability at least , and

 (4.8)

The numerical constants can be taken as given in (E.6), (E.8), (E.9) and (E.10), can be found in (E.5), and is taken from Lemma 4.1.

###### Remark 4.3.

In view of the minimax lower bound achieved in Theorem 2.6, the logarithmic factors in Theorem 4.2 cannot be removed. The dependence on is also optimal by considering to be a linear space of dimension .

Next we establish the analogy for (4.2): For a given sample , we want to choose a suitable ‘tuning parameter’ so that the expected loss of is about the same magnitude as

 infm∈N(infg∈Pml2ν(f0,g)+Dmlognn). (4.9)

In this section we construct a data-dependent scheme for choosing such based on the idea of [45] and [44]: we first determine a benchmark choice of the tuning parameter that yields the most conservative risk as obtained in (4.6) and (4.8) for general , while forcing the tuning parameter to be substantially smaller for by comparing the risks of the resulting estimators. Importantly, the benchmark choice should be independent of the oracle information contained in (4.6) and (4.8). We consider two cases as follows:

(Case 1). The approximation error can be separated into knowledge concerning the unknown regression function and the complexity of the model indexed by via

 infg∈Pml2ν(f0,g)≤e(f0,Ω)G(m). (4.10)

(Case 2). Otherwise, we use a uniform upper bound:

 infg∈Pml2ν(f0,g)≤supf0∈Finfg∈Pml2ν(f0,g). (4.11)

As we shall see below in Theorem 4.4, the case (4.10) allows the resulting estimator to be risk adaptive to each regression function, while such localized information will be lost in the case (4.11). If can be controlled uniformly in , then Case 1 and Case 2 are essentially the same; this is indeed the case when the loss function is the continuous norm in our specific applications in Sections 4.2 and 4.3, since a uniform boundedness constraint on the parameter space entails a uniform control of . However, when the loss function is the discrete norm where no uniform boundedness constraint is imposed, cannot be uniformly controlled, and hence only (4.10) will be useful.

To fix notation, let index the cases corresponding to (4.10) and (4.11), and index continuous and discrete norms, respectively. Let

 L#:=σ2∨(Γ21#=c). (4.12)

We will also use the following simplified notation for norms in definitions and theorem statements in the sequel:

 lc(f,g):=lν(f,g),ld(f,g):=lX––n(f,g). (4.13)

Now we define the benchmark choice

 Mup,#n (4.14) Mun,#n :=argmin{m∈N:supf0∈Finfg∈Pml2ν(f0,g)+L#κDmlognn}

in order to balance the approximation error and variance terms in (4.6) and (4.8).

It should be noted here that in both cases for continuous and discrete norms, the definition of the only involves a bias term measured in continuous norm. We make the following assumption on :

###### Assumption 3.

and as .

On the other hand, when , should not be too large compared with . This can be accomplished by risk comparisons as follows:

 ^m∗,# :=min{1≤m≤M#