Localized Gaussian width of M-convex hulls with applications to Lasso and convex aggregation

Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation

Abstract

Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property.

This bound is then applied to study the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation.

1Introduction

Let be a subset of . The Gaussian width of is defined as

where and are i.i.d. standard normal random variables. For any vector , denote by its Euclidean norm and define the Euclidean balls

We will also use the notation . The localized Gaussian width of with radius is the quantity . For any , define the norm by for any , and let be the number of nonzero coefficients of .

This paper studies the localized Gaussian width

where is the convex hull of points in .

If , then matching upper and lower bounds are available for the localized Gaussian width:

cf. [14] and [21]. In the above display, means that and for some large enough numerical constant .

The first goal of this paper is to generalize this bound to any that is the convex hull of points in .

Contributions. is devoted to the generalization of and provides sharp bounds on the localized Gaussian width of the convex hull of points in , see below. provide statistical applications of the results of . studies the Lasso estimator and the convex aggregation problem in fixed-design regression. In , we show that Empirical Risk Minimization achieves the minimax rate for the persistence problem in the anisotropic setting. Finally, provides results for bounded empirical processes and for the convex aggregation problem in density estimation.

2Localized Gaussian width of a -convex hull

The first contribution of the present paper is the following upper bound on localized Gaussian width of the convex hull of points in .

is proved in the next two subsections. Inequality

is a direct consequence of the Cauchy-Schwarz inequality and where is the orthogonal projection onto the linear span of and is the rank of . The novelty of is inequality

Inequality was known for the -ball [14], but to our knowledge is new for general -convex hulls. If is the -ball, then the bound is sharp up to numerical constants [14], [21].

The above result does not assume any type of Restricted Isometry Property (RIP). The following proposition shows that is essentially sharp provided that the vertices of satisfies a one-sided RIP of order .

The proof of is given in .

2.1A refinement of Maurey’s argument

This subsection provides the main tool to derive the upper bound . Define the simplex in by

Let be an integer, and let

where is a positive semi-definite matrix of size . Let be a deterministic vector such that is small. Maurey’s argument [27] has been used extensively to prove the existence of a sparse vector such that is of the same order as that of . Maurey’s argument uses the probabilistic method to prove the existence of such . A sketch of this argument is as follows.

Define the discrete set as

where is the canonical basis in . The discrete set is a subset of the simplex that contains only -sparse vectors.

Let be the canonical basis in . Let be i.i.d. random variables valued in with distribution

Next, consider the random variable

The random variable is valued in and is such that , where denotes the expectation with respect to . Then a bias-variance decomposition yields

where is a constant such that . As , this yields the existence of such that

If is chosen large enough, the two terms and are of the same order and we have established the existence of an -sparse vector so that is not much substantially larger than .

For our purpose, we need to refine this argument by controlling the deviation of the random variable . This is done in below.

In the next sections, it will be useful to bound from above the quantity maximized over subject to the constraint . An interpretation of is as follows. Consider the two optimization problems

for some . says that the optimal value of the first optimization problem is smaller than the optimal value of the second optimization problem averaged over the distribution of given by the density on . The second optimization problem above is over the discrete set with the relaxed constraint , hence we have relaxed the constraint in exchange for discreteness. The discreteness of the set will be used in the next subsection for the proof of .

The set is compact. The function is convex with domain and thus continuous. Hence the supremum in the left hand side of is achieved at some such that . Let be the random variable defined in and above. Denote by the expectation with respect to . By definition, and . Let . A bias-variance decomposition and the independence of yield

Another bias-variance decomposition yields

where we used that and that almost surely. Thus

Define the random variable , which is nonnegative and satisfifes . By Markov inequality, it holds that . Define the random variable by the density function on . Then we have for any , so by stochastic dominance, there exists a rich enough probability space and random variables and defined on such that and have the same distribution, and have the same distribution, and almost surely on (see for instance Theorem 7.1 in [12]). Denote by the expectation sign on the probability space .

By definition of and , using Jensen’s inequality, Fubini’s Theorem and the fact that we have

where is the nondecreasing function . The right hand side of the previous display is equal to to . Next, we use the random variables and as follows:

Combining the previous display and completes the proof.

2.2Proof of

We are now ready to prove . The main ingredients are and the following upper bound on the cardinal of

If then by we have , hence holds. Thus it is enough to focus on the case .

Let and set , which satisfies . As is the convex hull of points, let be such that

where for .

Let for all . This is a polynomial of order , of the form , where is the Gram matrix with for all . As we assume that , the diagonal elements of satisfy . For all , let . Applying with the above notation, , and , we obtain

By definition of , so that . Using Fubini Theorem and a bound on the expectation of the maximum of centered Gaussian random variables with variances bounded from above by , we obtain that the right hand side of the previous display is bounded from above by

where we used the bound . To complete the proof of , notice that we have and .

3Statistical applications in fixed-design regression

Numerous works have established a close relationship between localized Gaussian widths and the performance of statistical and compressed sensing procedures. Some of these works are reviewed below.

  • In a regression problem with random design where the design and the target are subgaussian, [21] established that two quantities govern the performance of empirical risk minimizer over a convex class . These two quantities are defined using the Gaussian width of the class intersected with an ball [21],

  • If are such that and . [14] provide precise estimates of where is the unit ball and is the ball of radius . These estimates are then used to solve the approximate reconstruction problem where one wants to recover an unknown high dimensional vector from a few random measurements [14].

  • [28] shows that in the semiparametric single index model, if the signal is known to belong to some star-shaped set , then the Gaussian width of and its localized version characterize the gain obtained by using the additional information that the signal belongs to , cf. Theorem 1.3 in [28].

  • Finally, [9] exhibits connection between localized Gaussian widths and shape-constrained estimation.

These results are reminiscent of the isomorphic method [17], where localized expected supremum of empirical processes are used to obtain upper bounds on the performance of Empirical Risk Minimization (ERM) procedures. These results show that Gaussian width estimates are important to understand the statistical properties of estimators in many statistical contexts.

In , we established an upper bound on the Gaussian width of -convex hulls. We now provide some statistical applications of this result in regression with fixed-design. We will use the following Theorem from [7].

Hence, to prove an oracle inequality of the form , it is enough to prove the existence of a quantity such that holds. If the convex set in the above theorem is the convex hull of points, then a quantity is given by the following proposition.

Inequality

is a reformulation of using the notation of . Thus, in order to prove , it is enough to establish that for we have

As and for all , the left hand side of the previous display satisfies

Thus holds if , which is the case if the absolute constant is .

Inequality establishes the existence of a quantity such that

where is the convex hull of . Consequences of and are given in the next subsections.

We now introduce two statistical frameworks where the localized Gaussian width of an -convex hull has applications: the Lasso estimator in high-dimensional statistics and the convex aggregation problem.

3.1Convex aggregation

Let be an unknown regression vector and let be an observed random vector, where satisfies . Let and let be deterministic vectors in . The set will be referred to as the dictionary. For any , let . If a set is given, the goal of the aggregation problem induced by is to find an estimator constructed with and the dictionary such that

either in expectation or with high probability, where is a small quantity. Inequality is called a sharp oracle inequality, where “sharp” means that in the right hand side of , the multiplicative constant of the term is . Similar notations will be defined for regression with random design and density estimation. Define the simplex in by . The following aggregation problems were introduced in [26].

  • Model Selection type aggregation

    with , i.e., is the canonical basis of . The goal is to construct an estimator whose risk is as close as possible to the best function in the dictionary. Such results can be found in [34] for random design regression, in [23] for fixed design regression, and in [16] for density estimation.

  • Convex aggregation

    with , i.e., is the simplex in . The goal is to construct an estimator whose risk is as close as possible to the best convex combination of the dictionary functions. See [34] for results of this type in the regression framework and [29] for such results in density estimation.

  • Linear aggregation

    with . The goal is to construct an estimator whose risk is as close as possible to the best linear combination of the dictionary functions, cf. [34] for such results in regression and [29] for such results in density estimation.

One may also define the Sparse or Sparse Convex aggregation problems: construct an estimator whose risk is as close as possible to the best sparse combination of the dictionary functions. Such results can be found in [31] for fixed design regression and in [24] for regression with random design. These problems are out of the scope of the present paper.

A goal of the present paper is to provide a unified argument that shows that empirical risk minimization is optimal for the convex aggregation problem in density estimation, regression with fixed design and regression with random design.

Let be the linear span of and let be the orthogonal projector onto . If , then

Let be the convex hull of . Let be the convex projection of onto . We apply to which is a convex hull of points, and for all , . By and , the quantity satisfies . Applying completes the proof.

3.2Lasso

We consider the following regression model. Let and assume that for all . We will refer to as the covariates. Let be the matrix of dimension with columns . We observe

where is an unknown mean. The goal is to estimate using the design matrix .

Let be a tuning parameter and define the constrained Lasso estimator [32] by

Our goal will be to study the performance of the estimator with respect to the prediction loss

Let and assume that for all . Let be the matrix of dimension with columns .

Let be the linear span of and let be the orthogonal projector onto . If , then

Let be the convex hull of , so that . Let be the convex projection of onto . We apply to which is a convex hull of points of empirical norm less or equal to . By and , the quantity satisfies . Applying completes the proof.

The lower bound [30] states that there exists an absolute constant such that the following holds. If , then there exists a design matrix such that for all estimator ,

where for all , denotes the expectation with respect to the distribution of . Thus, shows that the Least Squares estimator over the set is minimax optimal. In particular, the right hand side of inequality cannot be improved.

4The anisotropic persistence problem in regression with random design

Consider iid observations where are real valued and the are design random variables in with for some covariance matrix . We consider the learning problem over the function class

for a given constant . We consider the Emprical Risk Minimizer defined by

This problem is sometimes referred to as the persistence problem or the persistence framework [15]. The prediction risk of is given by

where is a new observation distributed as and independent from the data . Define also the oracle by

and define by

where the subgaussian norm is defined by for any random variable (see Section 5.2.3 in [35] for equivalent definitions of the norm).

To analyse the above learning problem, we use the machinery developed by [21] to study learning problems over subgaussian classes. Consider the two quantities

where . In the present setting, Theorem A from [21] reads as follows.

In the isotropic case (), [25] proves that

for some constants that only depends on , while

for some constants that only depend on .

Using and above lets us extend these bounds to the anisotropic case where is not proportional to the identity matrix.

The proof of will be given at the end of this subsection. The primary improvement of over previous results is that this result is agnostic to the underlying covariance structure. This lets us handle the anisotropic case with in the above proposition.

combined with lets us obtained the minimax rate of estimation for the persistence problem in the anisotropic case. Although the minimax rate was previously obtained in the isotropic case, we are not aware of a previous result that yields this rate for general covariance matrices .

In this proof, is an absolute constant whose value may change from line to line. Let . We first bound from above. Let and define

The random variable has the same distribution as where . Thus, the expectation inside the infimum in is equal to

To bound from above, it is enough to find some such that is bounded from above by .

By the Cauchy-Schwarz inequality, the right hand side is bounded from above by , which is smaller than for all small enough provided that for some constant that only depends on .

We now bound from above in the regime . Let be the columns of and let be the convex hull of the points . Using the fact that , the right hand side of the previous display is bounded from above by

where we used for the last inequality. By simple algebra, one can show that if for some large enough constant that only depends on , then the right hand side of is bounded from above by .

We now bound from above. Let . By definition of , to prove that , it is enough to show that

is smaller than . We use to show that the right hand side of the previous display is bounded from above by

By simple algebra very similar to that of the proof of , we obtain that if equals the right hand side of for large enough and , then the right hand side of the previous display is bounded from above by . This completes the proof of .

5Bounded empirical processes and density estimation

We now prove a result similar to for bounded empirical processes indexed by the convex hull of points. This will be useful to study the convex aggregation problem for density estimation. Throughout the paper, are i.i.d. Rademacher random variables that are independent of all other random variables.

Let . The function is convex since it can be written as the maximum of two linear functions. Applying with the above notation and yields

where the second inequality is a consequence of Fubini’s Theorem and for all ,

Using and the Rademacher complexity bound for finite classes given in [18], we obtain that for all ,

where is a numerical constant and is the cardinal of the set . By definition of we have . The cardinal of the set is bounded from above by the right hand side of . Combining inequality , inequality , the fact that the integrals and are finite, we obtain

for some absolute constant . By definition of , we have . A monotonicity argument completes the proof.

Next, we show that can be used to derive a condition similar to for bounded empirical processes. To bound from above the performance of ERM procedures in density estimation, in the appendix requires the existence of a quantity such that

where is the function defined in above.

To obtain such quantity under the assumptions of , we proceed as follows. Let and assume that

Define where is a numerical constant that will be chosen later. We now bound from above the right hand side of . We have

where for the last inequality we used that for all and that , since and . Thus, the right hand side of is bounded from above by

It is clear that the above quantity is bounded from above by if the numerical constant is large enough. Thus we have proved that as long as , inequality holds for

where is a numerical constant.

ERM and convex aggregation in density estimation

The minimax optimal rate for the convex aggregation problem is known to be of order

for regression with fixed design [30] and regression with random design [34] if the integers and satisfy or equivalently . The arguments for the convex aggregation lower bound from [34] can be readily applied to density estimation, showing that the rate is a lower bound on the optimal rate of convex aggregation for density estimation.

We now use the results of the previous sections to show that ERM is optimal for the convex aggregation problem in regression with fixed design, regression with random design and density estimation.

It is a direct application of in the appendix. If , a fixed point is given by . If , we use with , and . The bound yields the existence of a fixed point in this regime.


AProof of the lower bound

By the Varshamov-Gilbert extraction lemma [13], there exist a subset of such that

for any distinct .

For each , we define , a signed version of <