Localized Gaussian width of M-convex hulls with applications to Lasso and convex aggregation

# Localized Gaussian width of M-convex hulls with applications to Lasso and convex aggregation

Pierre C. Bellec
July 3, 2019
###### Abstract

Upper and lower bounds are derived for the Gaussian mean width of the intersection of a convex hull of points with an Euclidean ball of a given radius. The upper bound holds for any collection of extreme point bounded in Euclidean norm. The upper bound and the lower bound match up to a multiplicative constant whenever the extreme points satisfy a one sided Restricted Isometry Property.

This bound is then applied to study the Lasso estimator in fixed-design regression, the Empirical Risk Minimizer in the anisotropic persistence problem, and the convex aggregation problem in density estimation.

Rutgers University, Department of Statistics and Biostatistics

\@xsect

Let be a subset of . The Gaussian width of is defined as

 ℓ(T)\coloneqqEsupu∈TuTg, (2)

where and are i.i.d. standard normal random variables. For any vector , denote by its Euclidean norm and define the Euclidean balls

 B2={u∈Rn:|u|2≤1},sB2={su∈Rn,u∈B2} for all s≥0. (3)

We will also use the notation . The localized Gaussian width of with radius is the quantity . For any , define the norm by for any , and let be the number of nonzero coefficients of .

This paper studies the localized Gaussian width

 ℓ(sB2∩T), (4)

where is the convex hull of points in .

If , then matching upper and lower bounds are available for the localized Gaussian width:

 ℓ(sB2∩B1)≍√log(en(s2∧1))∧(s√n), (5)

cf. [14] and [21, Section 4.1]. In the above display, means that and for some large enough numerical constant .

The first goal of this paper is to generalize this bound to any that is the convex hull of points in .

\@xsect

LABEL:s:expected-sup is devoted to the generalization of (5) and provides sharp bounds on the localized Gaussian width of the convex hull of points in , see Propositions 2 and 1 below. LABEL:s:bounded, Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation and Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation provide statistical applications of the results of LABEL:s:expected-sup. Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation studies the Lasso estimator and the convex aggregation problem in fixed-design regression. In Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation, we show that Empirical Risk Minimization achieves the minimax rate for the persistence problem in the anisotropic setting. Finally, LABEL:s:bounded provides results for bounded empirical processes and for the convex aggregation problem in density estimation.

\@xsect

The first contribution of the present paper is the following upper bound on localized Gaussian width of the convex hull of points in .

###### Proposition 1.

Let and . Let be the convex hull of points in and assume that . Let be a centered Gaussian random variable with covariance matrix . Then for all ,

 ℓ(T∩sB2)≤(4√log+(4eM(s2∧1)))∧(s√n∧M) (6)

where .

Proposition 1 is proved in the next two subsections. Inequality

 ℓ(T∩sB2)≤s√n∧M (7)

is a direct consequence of the Cauchy-Schwarz inequality and where is the orthogonal projection onto the linear span of and is the rank of . The novelty of (6) is inequality

 ℓ(T∩sB2)≤4√log+(4eM(s2∧1)). (8)

Inequality (8) was known for the -ball [14], but to our knowledge (8) is new for general -convex hulls. If is the -ball, then the bound (6) is sharp up to numerical constants [14], [21, Section 4.1].

The above result does not assume any type of Restricted Isometry Property (RIP). The following proposition shows that (8) is essentially sharp provided that the vertices of satisfies a one-sided RIP of order .

###### Proposition 2.

Let and . Let be a centered Gaussian random variable with covariance matrix . Let and assume for simplicity that is a positive integer such that . Let be the convex hull of the points where . Assume that for some real number we have

 κ|θ|2≤|μθ|2 for all θ∈RM such % that |θ|0≤2m, (9)

where . Then

 (10)

The proof of Proposition 2 is given in Localized Gaussian width of -convex hulls with applications to Lasso and convex aggregation.

\@xsect

This subsection provides the main tool to derive the upper bound (8). Define the simplex in by

 ΛM={θ∈RM,M∑j=1θj=1,∀j=1…M,θj≥0}. (11)

Let be an integer, and let

 Q(θ)=θTΣθ, (12)

where is a positive semi-definite matrix of size . Let be a deterministic vector such that is small. Maurey’s argument [27] has been used extensively to prove the existence of a sparse vector such that is of the same order as that of . Maurey’s argument uses the probabilistic method to prove the existence of such . A sketch of this argument is as follows.

Define the discrete set as

 ΛMm\coloneqq{1mm∑k=1uk,u1,...,um∈{e1,...,eM}}, (13)

where is the canonical basis in . The discrete set is a subset of the simplex that contains only -sparse vectors.

Let be the canonical basis in . Let be i.i.d. random variables valued in with distribution

 P(Θk=ej)=¯θj for all k=1,...,m. (14)

Next, consider the random variable

 ^θ=1mm∑k=1Θk. (15)

The random variable is valued in and is such that , where denotes the expectation with respect to . Then a bias-variance decomposition yields

 EΘ[^θ]≤Q(¯θ)+R2/m, (16)

where is a constant such that . As , this yields the existence of such that

 Q(~θ)≤Q(¯θ)+R2/m. (17)

If is chosen large enough, the two terms and are of the same order and we have established the existence of an -sparse vector so that is not much substantially larger than .

For our purpose, we need to refine this argument by controlling the deviation of the random variable . This is done in Lemma 3 below.

###### Lemma 3.

Let and define by (13). Let be a convex function. For all , let

 Q(θ)=θTΣθ, (18)

where is a positive semi-definite matrix of size . Assume that the diagonal elements of satisfy for all . Then for all ,

 supθ∈ΛM:Q(θ)≤t2F(θ)≤∫+∞1[maxθ∈ΛMm:Q(θ)≤x(t2+R2/m)F(θ)]dxx2. (19)

In the next sections, it will be useful to bound from above the quantity maximized over subject to the constraint . An interpretation of (19) is as follows. Consider the two optimization problems

 maximize F(θ) for θ∈ΛM subject to Q(θ)≤t2, maximize F(θ) for θ∈ΛMm subject to Q(θ)≤Y(t2+R2/m),

for some . Equation 19 says that the optimal value of the first optimization problem is smaller than the optimal value of the second optimization problem averaged over the distribution of given by the density on . The second optimization problem above is over the discrete set with the relaxed constraint , hence we have relaxed the constraint in exchange for discreteness. The discreteness of the set will be used in the next subsection for the proof of Proposition 1.

###### Proof of Lemma 3.

The set is compact. The function is convex with domain and thus continuous. Hence the supremum in the left hand side of (19) is achieved at some such that . Let be the random variable defined in (14) and (15) above. Denote by the expectation with respect to . By definition, and . Let . A bias-variance decomposition and the independence of yield

 E\coloneqqEΘ[Q(^θ)] =Q(¯θ)+EΘ(^θ−¯θ)TΣ(^θ−¯θ), =Q(¯θ)+1mEΘ[(Θ1−¯θ)TΣ(Θ1−¯θ)].

Another bias-variance decomposition yields

 EΘ(Θ1−¯θ)TΣ(Θ1−¯θ)=EΘ[Q(Θ1)]−Q(¯θ)≤EΘQ(Θ1)≤R2, (20)

where we used that and that almost surely. Thus

 E=EΘ[Q(^θ)]≤Q(¯θ)+R2/m≤t2+R2/m. (21)

Define the random variable , which is nonnegative and satisfifes . By Markov inequality, it holds that . Define the random variable by the density function on . Then we have for any , so by stochastic dominance, there exists a rich enough probability space and random variables and defined on such that and have the same distribution, and have the same distribution, and almost surely on (see for instance Theorem 7.1 in [12]). Denote by the expectation sign on the probability space .

By definition of and , using Jensen’s inequality, Fubini’s Theorem and the fact that we have

 supθ∈ΛM:Q(θ)≤t2F(θ)=F(¯θ)=F(EΘ[^θ])≤EΘ[F(^θ)]≤EΘ[g(Q(^θ)/E)]

where is the nondecreasing function . The right hand side of the previous display is equal to to . Next, we use the random variables and as follows:

 EΘ[g(X)]=EΩ[g(~X)]≤EΩ[g(~Y)]=∫+∞1g(x)x2dx. (22)

Combining the previous display and (21) completes the proof. ∎

\@xsect

We are now ready to prove Proposition 1. The main ingredients are Lemma 3 and the following upper bound on the cardinal of

 (23)
###### Proof of (8).

If then by (7) we have , hence (8) holds. Thus it is enough to focus on the case .

Let and set , which satisfies . As is the convex hull of points, let be such that

 T=convex hull of {μ1,...,μM}={μθ,θ∈ΛM}, (24)

where for .

Let for all . This is a polynomial of order , of the form , where is the Gram matrix with for all . As we assume that , the diagonal elements of satisfy . For all , let . Applying Lemma 3 with the above notation, , and , we obtain

 Esupθ∈ΛM:Q(θ)≤r2gTμθ≤E∫+∞1[maxθ∈ΛMm:Q(θ)≤x(r2+1/m)F(θ)]dxx2. (25)

By definition of , so that . Using Fubini Theorem and a bound on the expectation of the maximum of centered Gaussian random variables with variances bounded from above by , we obtain that the right hand side of the previous display is bounded from above by

 ∫+∞11x2√4xlog|ΛMm|mdx≤√log(2eM/m)∫+∞12x3/2dx. (26)

where we used the bound (23). To complete the proof of (8), notice that we have and .

\@xsect

Numerous works have established a close relationship between localized Gaussian widths and the performance of statistical and compressed sensing procedures. Some of these works are reviewed below.

• In a regression problem with random design where the design and the target are subgaussian, Lecué and Mendelson [21] established that two quantities govern the performance of empirical risk minimizer over a convex class . These two quantities are defined using the Gaussian width of the class intersected with an ball [21, Definition 1.3],

• If are such that and . Gordon et al. [14] provide precise estimates of where is the unit ball and is the ball of radius . These estimates are then used to solve the approximate reconstruction problem where one wants to recover an unknown high dimensional vector from a few random measurements [14, Section 7].

• Plan et al. [28] shows that in the semiparametric single index model, if the signal is known to belong to some star-shaped set , then the Gaussian width of and its localized version characterize the gain obtained by using the additional information that the signal belongs to , cf. Theorem 1.3 in [28].

• Finally, Chatterjee [9] exhibits connection between localized Gaussian widths and shape-constrained estimation.

These results are reminiscent of the isomorphic method [17, 3, 2], where localized expected supremum of empirical processes are used to obtain upper bounds on the performance of Empirical Risk Minimization (ERM) procedures. These results show that Gaussian width estimates are important to understand the statistical properties of estimators in many statistical contexts.

In Proposition 1, we established an upper bound on the Gaussian width of -convex hulls. We now provide some statistical applications of this result in regression with fixed-design. We will use the following Theorem from [7].

###### Theorem 4 ([7]).

Let be a closed convex subset of and . Let be an unknown vector and let . Denote by the projection of onto . Assume that for some ,

 1nE⎡⎣supu∈K:1n|f∗0−u|22≤t2∗ξT(u−f∗0)⎤⎦≤t2∗2. (27)

Then for any , with probability greater than , the Least Squares estimator satisfies

 1n|^f−f0|22≤1n|f∗0−f0|22+2t2∗+4σ2xn. (28)

Hence, to prove an oracle inequality of the form (28), it is enough to prove the existence of a quantity such that (27) holds. If the convex set in the above theorem is the convex hull of points, then a quantity is given by the following proposition.

###### Proposition 5.

Let and . Let such that for all . For all , let . Let be a centered Gaussian random variable with covariance matrix . If then the quantity

 t2∗=31σR ⎷log(eMσR√n)nsatisfies1nEsupθ∈ΛM:1n|μθ|22≤t2∗gTμθ≤t2∗2, (29)

provided that .

###### Proof.

Inequality

 1√nEsupθ∈ΛM:1n|μθ|22≤r2(σg)Tμθ≤4σR√log(4eMmin(1,r2/R2)). (30)

is a reformulation of Proposition 1 using the notation of Proposition 5. Thus, in order to prove (29), it is enough to establish that for we have

 (∗)\coloneqq64log(4eMσγ√log(eMσ/(R√n))R√n)≤γ24log(eMσR√n). (31)

As and for all , the left hand side of the previous display satisfies

 (∗) ≤64(log(eMσR√n)+log(4γ)+12log(log(eMσ/(R√n)))), ≤64(3/2+log(4γ))log(eMσR√n).

Thus (31) holds if , which is the case if the absolute constant is . ∎

Inequality (29) establishes the existence of a quantity such that

 1nEsupμ∈T:1n|μ|22≤t2∗gTμθ≤t2∗2, (32)

where is the convex hull of . Consequences of (32) and Theorem 4 are given in the next subsections.

We now introduce two statistical frameworks where the localized Gaussian width of an -convex hull has applications: the Lasso estimator in high-dimensional statistics and the convex aggregation problem.

\@xsect

Let be an unknown regression vector and let be an observed random vector, where satisfies . Let and let be deterministic vectors in . The set will be referred to as the dictionary. For any , let . If a set is given, the goal of the aggregation problem induced by is to find an estimator constructed with and the dictionary such that

 1n|^f−f0|22≤infθ∈Θ(1n|fθ−f0|22)+δn,M,Θ, (33)

either in expectation or with high probability, where is a small quantity. Inequality (33) is called a sharp oracle inequality, where "sharp" means that in the right hand side of (33), the multiplicative constant of the term is . Similar notations will be defined for regression with random design and density estimation. Define the simplex in by (11). The following aggregation problems were introduced in [26, 34].

• Model Selection type aggregation with , i.e., is the canonical basis of . The goal is to construct an estimator whose risk is as close as possible to the best function in the dictionary. Such results can be found in [34, 22, 1] for random design regression, in [23, 10, 5, 11] for fixed design regression, and in [16, 6] for density estimation.

• Convex aggregation with , i.e., is the simplex in . The goal is to construct an estimator whose risk is as close as possible to the best convex combination of the dictionary functions. See [34, 20, 19, 33] for results of this type in the regression framework and [29] for such results in density estimation.

• Linear aggregation with . The goal is to construct an estimator whose risk is as close as possible to the best linear combination of the dictionary functions, cf. [34, 33] for such results in regression and [29] for such results in density estimation.

One may also define the Sparse or Sparse Convex aggregation problems: construct an estimator whose risk is as close as possible to the best sparse combination of the dictionary functions. Such results can be found in [31, 30, 33] for fixed design regression and in [24] for regression with random design. These problems are out of the scope of the present paper.

A goal of the present paper is to provide a unified argument that shows that empirical risk minimization is optimal for the convex aggregation problem in density estimation, regression with fixed design and regression with random design.

###### Theorem 6.

Let , let and define . Let and let for all . Let

 ^θ∈argminθ∈ΛM|fθ−y|22. (34)

Then for all , with probability greater than ,

 1n|f^θ−f0|22≤minθ∈ΛM1n|fθ−f0|22+2t2∗+4σ2xn, (35)

where and .

###### Proof of Theorem 6.

Let be the linear span of and let be the orthogonal projector onto . If , then

 1nEsupv∈V:1n|v|22≤t2∗ξTv=√t2∗nE|Pξ|2≤√t2∗n√E|Pξ|22=√t2∗σ2Mn=t2∗/2. (36)

Let be the convex hull of . Let be the convex projection of onto . We apply Proposition 5 to which is a convex hull of points, and for all , . By (41) and (29), the quantity satisfies (27). Applying Theorem 4 completes the proof. ∎

\@xsect

We consider the following regression model. Let and assume that for all . We will refer to as the covariates. Let be the matrix of dimension with columns . We observe

 y=f0+ξ,ξ∼N(0,σ2In×n). (37)

where is an unknown mean. The goal is to estimate using the design matrix .

Let be a tuning parameter and define the constrained Lasso estimator [32] by

 ^β∈argminβ∈RM:|β|1≤R|y−Xβ|22. (38)

Our goal will be to study the performance of the estimator (38) with respect to the prediction loss

 1n|f0−X^β|22. (39)

Let and assume that for all . Let be the matrix of dimension with columns .

###### Theorem 7.

Let be a tuning parameter and consider the regression model (37). Define the Lasso estimator by (38). Then for all , with probability greater than ,

 1n|X^β−f0|22≤minβ∈RM:|β|1≤R1n|Xβ−f0|22+2t2∗+4σ2xn, (40)

where .

###### Proof of Theorem 7.

Let be the linear span of and let be the orthogonal projector onto . If , then

 1nEsupv∈V:1n|v|22≤t2∗ξTv=√t2∗nE|Pξ|2≤√t2∗n√E|Pξ|22=√t2∗σ2rank(X)n=t2∗/2. (41)

Let be the convex hull of , so that . Let be the convex projection of onto . We apply Proposition 5 to which is a convex hull of points of empirical norm less or equal to . By (41) and (29), the quantity satisfies (27). Applying Theorem 4 completes the proof. ∎

The lower bound [30, Theorem 5.4 and (5.25)] states that there exists an absolute constant such that the following holds. If , then there exists a design matrix such that for all estimator ,

 supβ∈RM:|β|1≤R1nEXβ|Xβ−^f|22≥1C0min⎛⎜ ⎜ ⎜⎝σ2rank(X)n,σR ⎷log(1+eMσR√n)n⎞⎟ ⎟ ⎟⎠, (42)

where for all , denotes the expectation with respect to the distribution of . Thus, Theorem 7 shows that the Least Squares estimator over the set is minimax optimal. In particular, the right hand side of inequality (40) cannot be improved.

\@xsect

Consider iid observations where are real valued and the are design random variables in with for some covariance matrix . We consider the learning problem over the function class

 (43)

for a given constant . We consider the Emprical Risk Minimizer defined by

 ^β=argminβ∈RM:|β|1≤Rn∑i=1(Yi−βTXi)2 (44)

This problem is sometimes referred to as the persistence problem or the persistence framework [15, 4]. The prediction risk of is given by

 R(f^β)=E[(f^β(X)−Y)|(Xi,Yi)i=1,...,n], (45)

where is a new observation distributed as and independent from the data . Define also the oracle by

 β∗=argminβ∈RM:|β|1≤RR(β) (46)

and define by

 σ=∥Y−XTβ∗∥ψ2, (47)

where the subgaussian norm is defined by for any random variable (see Section 5.2.3 in [35] for equivalent definitions of the norm).

To analyse the above learning problem, we use the machinery developed by Lecué and Mendelson [21] to study learning problems over subgaussian classes. Consider the two quantities

 rn(γ) =inf{r>0:Esupβ:|β|1≤2R,E[(GTβ)2]≤s2βTG≤γr√n}, sn(γ) =inf{s>0:Esupβ:|β|1≤2R,E[(GTβ)2]≤s2βTG≤γs2√n/σ},

where . In the present setting, Theorem A from Lecué and Mendelson [21] reads as follows.

###### Theorem 8 (Theorem A in Lecué and Mendelson [21]).

There exist absolute constants such that the following holds. Let . Consider iid observations with . Assume that the design random vectors are subgaussian with respect to the covariance matrix in the sense that for any . Define by (46) and by (47). Assume that the diagonal elements of are no larger than 1. Then, there exists absolute constants such that the estimator defined in (44) satisfies

 R(f^β)≤R(fβ∗)+max(s2n(c1),r2n(c2)), (48)

with probability at least .

In the isotropic case (),