Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation
Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property.
This bound is then applied to study the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation.
Rutgers University, Department of Statistics and Biostatistics
Let be a subset of . The Gaussian width of is defined as
where and are i.i.d. standard normal random variables. For any vector , denote by its Euclidean norm and define the Euclidean balls
We will also use the notation . The localized Gaussian width of with radius is the quantity . For any , define the norm by for any , and let be the number of nonzero coefficients of .
This paper studies the localized Gaussian width
where is the convex hull of points in .
If , then matching upper and lower bounds are available for the localized Gaussian width:
The first goal of this paper is to generalize this bound to any that is the convex hull of points in .
LABEL:s:expected-sup is devoted to the generalization of (5) and provides sharp bounds on the localized Gaussian width of the convex hull of points in , see Propositions 2 and 1 below. LABEL:s:bounded, Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation and Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation provide statistical applications of the results of LABEL:s:expected-sup. Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation studies the Lasso estimator and the convex aggregation problem in fixed-design regression. In Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation, we show that Empirical Risk Minimization achieves the minimax rate for the persistence problem in the anisotropic setting. Finally, LABEL:s:bounded provides results for bounded empirical processes and for the convex aggregation problem in density estimation.
The first contribution of the present paper is the following upper bound on localized Gaussian width of the convex hull of points in .
Let and . Let be the convex hull of points in and assume that . Let be a centered Gaussian random variable with covariance matrix . Then for all ,
Proposition 1 is proved in the next two subsections. Inequality
is a direct consequence of the Cauchy-Schwarz inequality and where is the orthogonal projection onto the linear span of and is the rank of . The novelty of (6) is inequality
The above result does not assume any type of Restricted Isometry Property (RIP). The following proposition shows that (8) is essentially sharp provided that the vertices of satisfies a one-sided RIP of order .
Let and . Let be a centered Gaussian random variable with covariance matrix . Let and assume for simplicity that is a positive integer such that . Let be the convex hull of the points where . Assume that for some real number we have
where . Then
The proof of Proposition 2 is given in Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation.
This subsection provides the main tool to derive the upper bound (8). Define the simplex in by
Let be an integer, and let
where is a positive semi-definite matrix of size . Let be a deterministic vector such that is small. Maurey’s argument  has been used extensively to prove the existence of a sparse vector such that is of the same order as that of . Maurey’s argument uses the probabilistic method to prove the existence of such . A sketch of this argument is as follows.
Define the discrete set as
where is the canonical basis in . The discrete set is a subset of the simplex that contains only -sparse vectors.
Let be the canonical basis in . Let be i.i.d. random variables valued in with distribution
Next, consider the random variable
The random variable is valued in and is such that , where denotes the expectation with respect to . Then a bias-variance decomposition yields
where is a constant such that . As , this yields the existence of such that
If is chosen large enough, the two terms and are of the same order and we have established the existence of an -sparse vector so that is not much substantially larger than .
For our purpose, we need to refine this argument by controlling the deviation of the random variable . This is done in Lemma 3 below.
Let and define by (13). Let be a convex function. For all , let
where is a positive semi-definite matrix of size . Assume that the diagonal elements of satisfy for all . Then for all ,
In the next sections, it will be useful to bound from above the quantity maximized over subject to the constraint . An interpretation of (19) is as follows. Consider the two optimization problems
for some . Equation 19 says that the optimal value of the first optimization problem is smaller than the optimal value of the second optimization problem averaged over the distribution of given by the density on . The second optimization problem above is over the discrete set with the relaxed constraint , hence we have relaxed the constraint in exchange for discreteness. The discreteness of the set will be used in the next subsection for the proof of Proposition 1.
Proof of Lemma 3.
The set is compact. The function is convex with domain and thus continuous. Hence the supremum in the left hand side of (19) is achieved at some such that . Let be the random variable defined in (14) and (15) above. Denote by the expectation with respect to . By definition, and . Let . A bias-variance decomposition and the independence of yield
Another bias-variance decomposition yields
where we used that and that almost surely. Thus
Define the random variable , which is nonnegative and satisfifes . By Markov inequality, it holds that . Define the random variable by the density function on . Then we have for any , so by stochastic dominance, there exists a rich enough probability space and random variables and defined on such that and have the same distribution, and have the same distribution, and almost surely on (see for instance Theorem 7.1 in ). Denote by the expectation sign on the probability space .
By definition of and , using Jensen’s inequality, Fubini’s Theorem and the fact that we have
where is the nondecreasing function . The right hand side of the previous display is equal to to . Next, we use the random variables and as follows:
Combining the previous display and (21) completes the proof. ∎
Proof of (8).
Let and set , which satisfies . As is the convex hull of points, let be such that
where for .
Let for all . This is a polynomial of order , of the form , where is the Gram matrix with for all . As we assume that , the diagonal elements of satisfy . For all , let . Applying Lemma 3 with the above notation, , and , we obtain
By definition of , so that . Using Fubini Theorem and a bound on the expectation of the maximum of centered Gaussian random variables with variances bounded from above by , we obtain that the right hand side of the previous display is bounded from above by
Numerous works have established a close relationship between localized Gaussian widths and the performance of statistical and compressed sensing procedures. Some of these works are reviewed below.
In a regression problem with random design where the design and the target are subgaussian, Lecué and Mendelson  established that two quantities govern the performance of empirical risk minimizer over a convex class . These two quantities are defined using the Gaussian width of the class intersected with an ball [21, Definition 1.3],
If are such that and . Gordon et al.  provide precise estimates of where is the unit ball and is the ball of radius . These estimates are then used to solve the approximate reconstruction problem where one wants to recover an unknown high dimensional vector from a few random measurements [14, Section 7].
Plan et al.  shows that in the semiparametric single index model, if the signal is known to belong to some star-shaped set , then the Gaussian width of and its localized version characterize the gain obtained by using the additional information that the signal belongs to , cf. Theorem 1.3 in .
Finally, Chatterjee  exhibits connection between localized Gaussian widths and shape-constrained estimation.
These results are reminiscent of the isomorphic method [17, 3, 2], where localized expected supremum of empirical processes are used to obtain upper bounds on the performance of Empirical Risk Minimization (ERM) procedures. These results show that Gaussian width estimates are important to understand the statistical properties of estimators in many statistical contexts.
In Proposition 1, we established an upper bound on the Gaussian width of -convex hulls. We now provide some statistical applications of this result in regression with fixed-design. We will use the following Theorem from .
Theorem 4 ().
Let be a closed convex subset of and . Let be an unknown vector and let . Denote by the projection of onto . Assume that for some ,
Then for any , with probability greater than , the Least Squares estimator satisfies
Hence, to prove an oracle inequality of the form (28), it is enough to prove the existence of a quantity such that (27) holds. If the convex set in the above theorem is the convex hull of points, then a quantity is given by the following proposition.
Let and . Let such that for all . For all , let . Let be a centered Gaussian random variable with covariance matrix . If then the quantity
provided that .
As and for all , the left hand side of the previous display satisfies
Thus (31) holds if , which is the case if the absolute constant is . ∎
Inequality (29) establishes the existence of a quantity such that
We now introduce two statistical frameworks where the localized Gaussian width of an -convex hull has applications: the Lasso estimator in high-dimensional statistics and the convex aggregation problem.
Let be an unknown regression vector and let be an observed random vector, where satisfies . Let and let be deterministic vectors in . The set will be referred to as the dictionary. For any , let . If a set is given, the goal of the aggregation problem induced by is to find an estimator constructed with and the dictionary such that
either in expectation or with high probability, where is a small quantity. Inequality (33) is called a sharp oracle inequality, where "sharp" means that in the right hand side of (33), the multiplicative constant of the term is . Similar notations will be defined for regression with random design and density estimation. Define the simplex in by (11). The following aggregation problems were introduced in [26, 34].
Model Selection type aggregation with , i.e., is the canonical basis of . The goal is to construct an estimator whose risk is as close as possible to the best function in the dictionary. Such results can be found in [34, 22, 1] for random design regression, in [23, 10, 5, 11] for fixed design regression, and in [16, 6] for density estimation.
Convex aggregation with , i.e., is the simplex in . The goal is to construct an estimator whose risk is as close as possible to the best convex combination of the dictionary functions. See [34, 20, 19, 33] for results of this type in the regression framework and  for such results in density estimation.
One may also define the Sparse or Sparse Convex aggregation problems: construct an estimator whose risk is as close as possible to the best sparse combination of the dictionary functions. Such results can be found in [31, 30, 33] for fixed design regression and in  for regression with random design. These problems are out of the scope of the present paper.
A goal of the present paper is to provide a unified argument that shows that empirical risk minimization is optimal for the convex aggregation problem in density estimation, regression with fixed design and regression with random design.
Let , let and define . Let and let for all . Let
Then for all , with probability greater than ,
where and .
Proof of Theorem 6.
Let be the linear span of and let be the orthogonal projector onto . If , then
Let be the convex hull of . Let be the convex projection of onto . We apply Proposition 5 to which is a convex hull of points, and for all , . By (41) and (29), the quantity satisfies (27). Applying Theorem 4 completes the proof. ∎
We consider the following regression model. Let and assume that for all . We will refer to as the covariates. Let be the matrix of dimension with columns . We observe
where is an unknown mean. The goal is to estimate using the design matrix .
Let be a tuning parameter and define the constrained Lasso estimator  by
Our goal will be to study the performance of the estimator (38) with respect to the prediction loss
Let and assume that for all . Let be the matrix of dimension with columns .
Proof of Theorem 7.
Let be the linear span of and let be the orthogonal projector onto . If , then
Let be the convex hull of , so that . Let be the convex projection of onto . We apply Proposition 5 to which is a convex hull of points of empirical norm less or equal to . By (41) and (29), the quantity satisfies (27). Applying Theorem 4 completes the proof. ∎
The lower bound [30, Theorem 5.4 and (5.25)] states that there exists an absolute constant such that the following holds. If , then there exists a design matrix such that for all estimator ,
where for all , denotes the expectation with respect to the distribution of . Thus, Theorem 7 shows that the Least Squares estimator over the set is minimax optimal. In particular, the right hand side of inequality (40) cannot be improved.
Consider iid observations where are real valued and the are design random variables in with for some covariance matrix . We consider the learning problem over the function class
for a given constant . We consider the Emprical Risk Minimizer defined by
where is a new observation distributed as and independent from the data . Define also the oracle by
and define by
where the subgaussian norm is defined by for any random variable (see Section 5.2.3 in  for equivalent definitions of the norm).
To analyse the above learning problem, we use the machinery developed by Lecué and Mendelson  to study learning problems over subgaussian classes. Consider the two quantities
where . In the present setting, Theorem A from Lecué and Mendelson  reads as follows.
Theorem 8 (Theorem A in Lecué and Mendelson ).
There exist absolute constants such that the following holds. Let . Consider iid observations with . Assume that the design random vectors are subgaussian with respect to the covariance matrix in the sense that for any . Define by (46) and by (47). Assume that the diagonal elements of are no larger than 1. Then, there exists absolute constants such that the estimator defined in (44) satisfies
with probability at least .
In the isotropic case (),