Global risk bounds and adaptation in univariate convex regression
We consider the problem of nonparametric estimation of a convex regression function . We study the risk of the least squares estimator (LSE) under the natural squared error loss. We show that the risk is always bounded from above by modulo logarithmic factors while being much smaller when is well-approximable by a piecewise affine convex function with not too many affine pieces (in which case, the risk is at most up to logarithmic factors). On the other hand, when has curvature, we show that no estimator can have risk smaller than a constant multiple of in a very strong sense by proving a “local” minimax lower bound. We also study the case of model misspecification where we show that the LSE exhibits the same global behavior provided the loss is measured from the closest convex projection of the true regression function. In the process of deriving our risk bounds, we prove new results for the metric entropy of local neighborhoods of the space of univariate convex functions. These results, which may be of independent interest, demonstrate the non-uniform nature of the space of univariate convex functions in sharp contrast to classical function spaces based on smoothness constraints.
Keywords: least squares, minimax lower bound, misspecification, projection on a closed convex cone, sieve estimator.
We consider the problem of estimating an unknown convex function on from observations drawn according to the model
where are fixed points in and represent independent mean zero errors. Convex regression is an important problem in the general area of nonparametric estimation under shape constraints. It often arises in applications: typical examples appear in economics (indirect utility, production or cost functions), medicine (dose response experiments) and biology (growth curves).
The most natural and commonly used estimator for is the full least squares estimator (LSE), , which is defined as any minimizer of the LS criterion, i.e.,
where denotes the set of all real-valued convex functions on . is not unique even though its values at the data points are unique. This follows from that fact that is the projection of on a closed convex cone. A simple linear interpolation of these values leads to a unique continuous and piecewise linear convex function with possible knots at the data points, which can be treated as the canonical LSE. The canonical LSE can be easily computed by solving a quadratic program with linear constraints.
Unlike other methods for function estimation such as those based on kernels which depend on tuning parameters such as smoothing bandwidths, the LSE has the obvious advantage of being completely automated. It was first proposed by Hildreth (1954) for the estimation of production functions and Engel curves. Algorithms for its computation can be found in Dykstra (1983) and Fraser and Massam (1989). The theoretical behavior of the LSE has been investigated by many authors. Its consistency in the supremum norm on compact sets in the interior of the support of the covariate was proved by Hanson and Pledger (1976). Mammen (1991) derived the rate of convergence of the LSE and its derivative at a fixed point, while Groeneboom et al. (2001) proved consistency and derived its asymptotic distribution at a fixed point of positive curvature. Dümbgen et al. (2004) showed that the supremum distance between the LSE and , assuming twice differentiability, on a compact interval in the interior of the support of the design points is of the order .
In spite of all the above mentioned work, surprisingly, not much is known about the global risk behavior of the LSE under the natural loss function:
This is the main focus of our paper. In particular, we satisfactorily address the following questions in the paper: At what rate does the risk of the LSE decrease to zero? How does this rate of convergence depend on the underlying true function ; i.e., does the LSE exhibit faster rates of convergence for certain functions ? How does behave, in terms of its risk, when the model is misspecified, i.e., the regression function is not convex?
We assume, throughout the paper, that, in (1), are fixed design points in satisfying
where and are positive constants, and that are independent normally distributed random variables with mean zero and variance . In fact, all the results in our paper, excluding those in Section 5, hold under the milder assumption of subgaussianity of the errors. Our contributions in this paper can be summarized in the following.
We establish, for the first time, a finite sample upper bound for risk of the LSE under the loss in Section 2. The analysis of the risk behavior of is complicated due to two facts: (1) does not have a closed form expression, and (2) the class (over which minimizes the LS criterion) is not totally bounded. Our risk upper bound involves a minimum of two terms; see Theorem 2.1. The first term says that the risk is bounded by up to logarithmic multiplicative factors in . The second term in the risk bound says that the risk is bounded from above by a combination of the parametric rate and an approximation term that dictates how well is approximated by a piecewise affine convex function (up to logarithmic multiplicative factors). Our risk bound, in addition to establishing the worst case bound, implies that adapts to piecewise affine convex functions with not too many pieces (see Section 2 for the precise definition). This is remarkable because the LSE minimizes the LS criterion over all convex functions with no explicit special treatment for piecewise affine convex functions.
In the process of proving our risk bound for the LSE, we prove new results for the metric entropy of balls in the space of convex functions. One of the standard approaches to finding risk bounds for procedures based on empirical risk minimization (ERM) says that the risk behavior of is determined by the metric entropy of balls in the parameter space around the true function (see, for example, Van de Geer (2000); Birgé and Massart (1993); van der Vaart and Wellner (1996); Massart (2007)). The ball around in of radius is defined as
Recall that, for a subset of a metric space , the -covering number of under the metric is denoted by and is defined as the smallest number of closed balls of radius whose union contains . Metric entropy is the logarithm of the covering number.
We prove new upper bounds for the metric entropy of in Section 3. These bounds depend crucially on . When is a piecewise affine function with not too many pieces, the metric entropy of is much smaller than when has a second derivative that is bounded from above and below by positive constants. This difference in the sizes of the balls is the reason why exhibits different rates for different convex functions . It should be noted that the convex functions are not uniformly bounded and hence existing results on the metric entropy of classes of convex functions (see Bronshtein (1976); Dryanov (2009); Guntuboyina and Sen (2013)) cannot be used directly to bound the metric entropy of . Our main risk bound Theorem 2.1 is proved in Section 4 using the developed metric entropy bounds for . These new bounds are also of independent interest.
We investigate the optimality of the rate . We show that for convex functions having a bounded (from both above and below) curvature on a sub-interval of , the rate cannot be improved (in a very strong sense) by any other estimator. Specifically we show that a certain “local” minimax risk (see Section 5 for the details), under the loss , is bounded from below by . This shows, in particular, that the same holds for the global minimax rate for this problem.
We also provide risk bounds in the case of model misspecification where we do not assume that the underlying regression function in (1) is convex. In this case we prove the exact same upper bounds for where now denotes any convex projection (defined in Section 6) of the unknown true regression function. To the best of our knowledge, this is the first result on global risk bounds for the estimation of convex regression functions under model misspecification. Some auxiliary results about convex functions useful in the proofs of the main results are deferred to Section A.
Two special features of our analysis are that: (1) all our risk-bounds are non-asymptotic, and (2) none of our results uses any (explicit) characterization of the LSE (except that it minimizes the least squares criterion) as a result of which our approach can, in principle, be extended to more complex ERM procedures, including shape restricted function estimation in higher dimensions; see e.g., Seijo and Sen (2011), Seregin and Wellner (2010) and Cule et al. (2010).
Our adaptation behavior of the LSE implies in particular that the LSE converges at different rates depending on the true convex function . We believe that such adaptation is rather unique to problems of shape restricted function estimation and is currently not very well understood. For example, in the related problem of monotone function estimation, which has an enormous literature (see e.g., Grenander (1956), Birgé (1989), Zhang (2002) and the references therein), the only result on adaptive global behavior of the LSE is found in Groeneboom and Pyke (1983); also see Van de Geer (1993). This result, however, holds only in an asymptotic sense and only when the true function is a constant. Results on the pointwise adaptive behavior of the LSE in monotone function estimation are more prevalent and can be found, for example, in Carolan and Dykstra (1999), Jankowski (2014) and Cator (2011). For convex function estimation, as far as we are aware, adaptation behavior of the LSE has not been studied before. Adaptation behavior for the estimation of a convex function at a single point has been recently studied by Cai and Low (2014) but they focus on different estimators that are based on local averaging techniques.
2 Risk Analysis of the LSE
Before stating our main risk bound, we need some notation. Recall that denotes the set of all real-valued convex functions on . For , let denote the “distance” of from affine functions. More precisely,
Note that when is affine.
We also need the notion of piecewise affine convex functions. A convex function on is said to be piecewise affine if there exists an integer and points such that is affine on each of the intervals for . We define to be the smallest such . Let denote the collection of all piecewise affine convex functions with and let denote the collection of all piecewise affine convex functions on .
We are now ready to state our main upper bound for the risk of .
Let . There exists a positive constant depending only on the ratio such that
Because of the presence of the minimum in the risk bound presented above, the bound actually involves two parts. We isolate these two parts in the following two separate results. The first result says that the risk is bounded by up to multiplicative factors that are logarithmic in . The second result says that the risk is bounded from above by a combination of the parametric rate and an approximation term that dictates how well is approximated by a piecewise affine convex function (up to logarithmic multiplicative factors). The implications of these two theorems are explained in the remarks below. It is clear that Theorem 2.2 and 2.3 together imply Theorem 2.1. We therefore prove Theorem 2.1 by proving Theorems 2.2 and 2.3 separately in Section 4.
Let . There exists a positive constant depending only on the ratio such that
There exists a constant , depending only on the ratio , such that
for all .
Remark 2.1 (Why convexity is similar to second order smoothness).
From the classical theory of nonparametric statistics, it follows that this is the same rate that one obtains for the estimation of twice differentiable functions (satisfying a condition such as ) on the unit interval. In Theorem 2.2, we prove that achieves the same rate (up to log factors) when the true function is convex under no assumptions whatsoever on the smoothness of the function. Therefore, the constraint of convexity is similar to the constraint of second order smoothness. This has long since been believed to be true, but to the best of our knowledge, Theorem 2.2 is the first result to rigorously prove this via a nonasymptotic risk bound for the estimator with no assumption of smoothness.
Remark 2.2 (Parametric rates for piecewise affine convex functions).
Theorem 2.3 implies that has the parametric rate for estimating piecewise affine convex functions. Indeed, suppose is a piecewise affine convex function on i.e., . Then using in (5), we have the risk bound
This is the parametric rate up to logarithmic factors and is of course much smaller than the nonparametric rate given in Theorem 2.2. Therefore, adapts to each class of piecewise convex affine functions.
Remark 2.3 (Automatic adaptation).
Risk bounds such as (5) are usually provable for estimators based on empirical model selection criteria (see, for example, Barron et al. (1999)) or aggregation (see, for example, Rigollet and Tsybakov (2012)). Specializing to the present situation, in order to adapt over as varies, one constructs LSE over each and then either selects one estimator from this collection by an empirical model selection criterion or aggregates these estimators with data-dependent weights. While the theory for such penalization estimators is well-developed (see e.g., Barron et al. (1999)), these estimators are computationally expensive, might rely on certain tuning parameters which might be difficult to choose in practice and also require estimation of . The LSE is very different from these estimators because it simply minimizes the LS criterion over the whole space . It is therefore very easy to compute, does not depend on any tuning parameter or estimates for and, remarkably, it automatically adapts over the classes as varies.
Remark 2.4 (Why convexity is different from second order smoothness).
In Remark 2.1, we argued how estimation under convexity is similar to estimation under second order smoothness. Here we describe how the two are different. The risk bound given by Theorem 2.3 crucially depends on the true function . In other words, the LSE converges at different rates depending on the true convex function . Therefore, the rate of the LSE is not uniform over the class of all convex functions but it varies quite a bit from function to function in that class. As will be clear from our proofs, the reason for this difference in rates is that the class of convex functions is locally non-uniform in the sense that the local neighborhoods around certain convex functions (e.g., affine functions) are much sparser than local neighborhoods around other convex functions. On the other hand, in the class of twice differentiable functions, all local neighborhoods are, in some sense, equally sized.
Remark 2.5 (On the logarithmic factors).
We believe that Theorems 2.2 and 2.3 might have redundant logarithmic factors. In particular, we conjecture that there should be no logarithmic term in Theorem 2.2 and that the logarithmic term should be instead of in Theorem 2.3; cf. analogous results in isotonic regression – Zhang (2002) and Chatterjee et al. (2013). These additional logarithmic factors mainly arise due to the fact that the class , of convex functions appearing in the proofs, is not uniformly bounded. Sharpening these factors might be possible by using an explicit characterization of the LSE (as was done in Zhang (2002) and Chatterjee et al. (2013) for isotonic regression) and other techniques that are beyond the scope of the present paper.
The proofs of Theorems 2.2 and 2.3 are presented in Section 4. A high level overview of the proof goes as follows. The convex LSE is an ERM procedure. These procedures are very well studied and numerous risk bounds exist in mathematical statistics and machine learning (see, for example, Van de Geer (2000); Birgé and Massart (1993); van der Vaart and Wellner (1996); Massart (2007)). These results essentially say that the risk behavior of is determined by the metric entropy of the balls (defined in (4)) in around the true function . Controlling the metric entropy of the is the key step in the proofs of Theorem 2.2 and 2.3. The next section deals with bounds for the metric entropy of .
3 The Local Structure of the Space of Convex Functions
In this section, we prove bounds for the metric entropy of the balls as ranges over the space of convex functions. Our results give new insights into the local structure of the space of convex functions. We show that the metric entropy of behaves differently for different convex functions . This is the reason why the LSE exhibits different rates of convergence depending on the true function . The metric entropy of is much smaller when is a piecewise affine convex function with not too many affine pieces than when has a second derivative that is bounded from above and below by positive constants.
The next theorem is the main result of this section.
There exists a positive constant depending only on the ratio such that for every and , we have
Note that the dependence of the right hand side on (6) on is always . The dependence on is given by and it depends on . This function controls the size of the ball . The larger the value , the larger the metric entropy of . The smallest possible value of equals and is achieved for affine functions. When is piecewise affine, is larger than but it is not much larger provided is small. This is because . When cannot be well-approximable by piecewise affine functions with small number of pieces, it can be shown that is bounded from below by a constant independent of . This will be the case, for example, when is twice differentiable with bounded from above and below by positive constants. As shown in the next theorem, has the largest possible size for such . Note also that one always has the upper bound which can be proved by restricting the infimum in the definition of to affine functions.
We need the following definition for the next theorem. For a subinterval of and positive real numbers , we define to be the class of all convex functions on which are twice differentiable on and which satisfy for all .
Suppose . Then there exist positive constants , and depending only on , and such that
Note that the right hand side of (7) does not depend on . This should be contrasted with the right hand side of (6) when is, say, an affine function. The non-uniform nature of the space of univariate convex functions should be clear from this: balls of the same radius in the space have different sizes depending on their center, . This should be contrasted with the space of twice differentiable functions in which all balls are equally sized in the sense that they all satisfy (7).
Note that the inequality (7) only holds when . In other words, it does not hold when . This is actually inevitable because, ignoring the convexity of functions in , the metric entropy of under cannot be larger than the metric entropy of the ball of radius in , which is bounded from above by (see e.g., Pollard (1990, Lemma 4.1)). Thus, as , the metric entropy of becomes logarithmic in as opposed to . Also note that inequality (7) only holds for . This also makes sense because the diameter of in the metric equals and, consequently, the left hand side of (7) equals zero for . Therefore, one cannot expect (7) to hold for all .
In the reminder of this section, we provide the proofs of Theorems 3.1 and 3.2. Let us start with the proof of Theorem 3.1. Since functions in are convex, we need to analyze the covering numbers of subsets of convex functions. There exist only two previous results here. Bronshtein (1976) proved covering numbers for classes of convex functions that are uniformly bounded and uniformly Lipschitz under the supremum metric. This result was extended by Dryanov (2009) who dropped the uniform Lipschitz assumption (this result was further extended by Guntuboyina and Sen (2013) to the multivariate case). Unfortunately, the convex functions in are not uniformly bounded (they only satisfy a weaker integral-type constraint) and hence Dryanov’s result cannot be used directly for proving Theorem 3.1. Another difficulty is that we need covering numbers under while the results in Dryanov (2009) are based on integral metrics.
Here is a high-level outline of the proof of Theorem 3.1. The first step is to reduce the general problem to the case when . The result for immediately implies the result for all affine functions . One can then generalize to piecewise affine convex functions by repeating the argument over each affine piece. Finally, the result is derived for general by approximating by piecewise affine convex functions.
For , the class of convex functions under consideration is . Unfortunately, functions in are not uniformly bounded; they only satisfy a weaker discrete -type boundedness constraint. We get around the lack of uniform boundedness by noting that convexity and the -constraint imply that functions in are uniformly bounded on subintervals that are in the interior of (this is proved via Lemma A.3). We use this to partition the interval into appropriate subintervals where Dryanov’s metric entropy result can be employed. We first carry out this argument for another class of convex functions where the discrete -constraint is replaced by an integral -constraint. From this result, we deduce the covering numbers of by using straightforward interpolation results (Lemma A.4).
3.1 Proof of Theorem 3.1
3.1.1 Reduction to the case when
The first step is to note that it suffices to prove the theorem when is the constant function equal to 0. For , Theorem 3.1 is equivalent the following statement: there exists a constant , depending only on the ratio , such that
This inequality immediately implies Theorem 3.1 because for every and , we have
by the trivial inequality . This means that for every . Hence
Suppose that is affine on each of the intervals for , where , and . Then there exist affine functions on such that for for every .
For every pair of functions and on , we have the trivial identity: where
As a result, we clearly have
Fix an . Note that for every , we have
where consists of the class of all convex functions for which .
By the translation invariance of the Euclidean distance and the fact that is convex whenever is convex and is affine, it follows that
where is defined as the class of all convex functions for which .
The covering number can be easily bounded using (8) by the following scaling argument. Let with being the cardinality of . Also write for the interval and let for . For , let
and . By associating, for each , the convex function defined by , it can be shown that
The assumption (3) implies that the distance between neighboring points in lies between and . Therefore, by applying (8) to instead of , we obtain the existence of a positive constant depending only on the ratio such that
3.1.2 The Integral Version
For and , let denote the class of all real-valued convex functions on for which . The ball is intuitively very close to the class the only difference being that the average constraint (11) is replaced by the integral constraint in . We shall prove a good upper bound for the metric entropy of . The metric entropy of will then be derived as a consequence.
There exist a constant such that for every , and , we have
where, by , we mean the metric where the distance between and is given by
The above theorem is a new result. If the constraint is replaced by the stronger constraint , then this has been proved by Dryanov (2009). Specifically, Dryanov (2009) considered the class consisting of all convex functions on which satisfy and proved the following. Guntuboyina and Sen (2013) extended this to the multivariate case.
Theorem 3.4 (Dryanov).
There exists a positive constant such that for every and , we have
In Dryanov (2009), inequality (14) was only asserted for for a positive constant . It turns out however that this condition is redundant. This follows from the observation that the diameter of the space in the metric is at most which means that the left hand side of (14) equals 0 for and, thus, by changing the constant suitably in Dryanov’s result, we obtain (14).
The class is much larger than because the integral constraint is much weaker than . Therefore, Theorem 3.3 does not directly follow from Theorem 3.4. However, it is possible to derive Theorem 3.4 from Theorem 3.3 via the observation (made rigorous in Lemma A.3) that functions in become uniformly bounded on subintervals of that are sufficiently far away from the boundary points. On such subintervals, we may use Theorem 3.4 to bound the covering numbers. Theorem 3.3 is then proved by putting together these different covering numbers as shown below.
Proof of Theorem 3.3.
By a trivial scaling argument, we can assume without loss of generality that . Let be the largest integer that is strictly smaller than and let for . Observe that .
Fix . By Lemma A.3, the restriction of a function to is convex and uniformly bounded by . Therefore, by Theorem 3.4, there exists a positive constant such that we can cover the functions in in the metric to within by a finite set having cardinality at most
we get a cover for functions in in the metric of size less than or equal to and cardinality at most .
Taking , we get that
where depends only on . By an analogous argument, the above inequality will also hold for . The proof is completed by putting these two bounds together. ∎
3.1.3 Completion of the Proof of Theorem 3.1
By an elementary scaling argument, it follows that
We, therefore, only need to prove (8) for . For ease of notation, let us denote by .
Because for all , we have . We shall first prove an upper bound for where
For each function , let be the convex function on defined by
where . Also let .
By Lemma A.4 and the assumption that for all , we get that
for every pair of functions and in . Letting this inequality implies that
Again by Lemma A.4 and the assumption , we have that
As a result, we have that . Further, because and , we get that
where . By a simple scaling argument, the covering number on the right hand side above is upper bounded by
Indeed, for each , we can associate for . It is then easy to check that and
Thus, by Theorem 3.3, we assert the existence of a positive constant a such that
Now for every pair of functions and in , we have
We make the simple observation that lies in the closed ball of radius in denoted by . As a result, using Pollard (1990, Lemma 4.1), we have