Sparse estimation via optimization method in highdimensional linear regression
Abstract
In this paper, we discuss the statistical properties of the optimization methods , including the minimization method and the regularization method, for estimating a sparse parameter from noisy observations in highdimensional linear regression with either a deterministic or random design. For this purpose, we introduce a general restricted eigenvalue condition (REC) and provide its sufficient conditions in terms of several widelyused regularity conditions such as sparse eigenvalue condition, restricted isometry property, and mutual incoherence property. By virtue of the REC, we exhibit the stable recovery property of the optimization methods for either deterministic or random designs by showing that the recovery bound for the minimization method and the oracle inequality and recovery bound for the regularization method hold respectively with high probability. The results in this paper are nonasymptotic and only assume the weak REC. The preliminary numerical results verify the established statistical property and demonstrate the advantages of the regularization method over some existing sparse optimization methods.
\ul
Keywords: sparse estimation, lowerorder optimization method, restricted eigenvalue condition, recovery bound, oracle property
1 Introduction
In various areas of applied sciences and engineering, a fundamental problem is to estimate an unknown parameter of a linear regression model
(1) 
where is a design matrix, is a vector containing random measurement noise, and thus is the corresponding vector of the noisy observations. According to the context of practical applications, the design matrix could be either deterministic or random.
The curse of dimensionality always occurs in the highdimensional regime of many application fields. For example, in magnetic resonance imaging [9], remote sensing [2], systems biology [33], one is typically only able to collect far fewer samples than the number of variables due to physical or economical constraints, i.e., . Under the highdimensional scenario, estimating the true underlying parameter of model (1) is a vital challenge in contemporary statistics, whereas the classical ordinary least squares (OLS) does not work well in this scenario because the corresponding linear system is seriously illconditioned.
1.1 Optimization Problems
Fortunately, in practical applications, a wide class of problems usually have certain special structures, employing which could eliminate the nonidentifiability of model (1) and enhance the predictability. One of the most popular structures is the sparsity structure, that is, the underlying parameter in the highdimensional space is sparse. One common way to measure the degree of sparsity is the norm, which for is defined as
while is defined as the number of nonzero entries of . We first review the literature of sparse estimation for the case when the design matrix is deterministic. In the presence of a bounded noise (i.e., ), in order to find the sparest solution, Donoho et al. [18] proposed the following (constrained) minimization problem:
Unfortunately, it is NPhard to compute its global solution due to the nonconvex and combinational natures [31].
To overcome this obstacle, a common technique is to use the (convex) norm to approach the norm:
which can be efficiently solved by several standard methods; see [14, 21] and references therein. The stable statistical properties of have been explored under the regularity conditions. One of the most important stable statistical properties is the recovery bound property, which is to estimate the upper bound of the error between the optimal solution of the optimization problem and the true underlying parameter in terms of the noise level . More specifically, let and be an sparse parameter (i.e., ) satisfying the linear regression model (1). The recovery bound for was provided in [18] and [9] under the mutual incoherence property (MIP) or the restricted isometry property (RIP)^{1}^{1}1It was claimed in [7] that the RIP [10] is implied by the MIP [19], while the restricted isometry constant (RIC) is more difficult to be calculated than the mutual incoherence constant (MIC)., respectively:
where stands for the optimal solution of .
In some applications, the amplitude of noise is difficult to estimate. As such the study of the constrained sparse optimization models is underdeveloped. In such situations, the regularization technique has been widely used in statistics and machine learning, which helps to avoid the noise estimation by introducing a regularization parameter. Specifically, one solves the (unconstrained) regularization problem:
where is the regularization parameter, providing a tradeoff between data fidelity and sparsity. The regularization model, also named the Lasso estimator [40], has attracted a great deal of attention in parameter estimation in the highdimensional scenario, because its convexity structure is beneficial in designing exclusive and efficient algorithms and gaining wide applications; see [3, 15] and references therein. For the noisefree case, the recovery bound for was provided in [42] under the RIP or the restricted eigenvalue condition (REC)^{2}^{2}2It was reported in [4] that the REC is implied by the RIP, and in [35] that a broad class of correlated Gaussian design matrices satisfy the REC but violate the RIP with high probability.:
where denotes the optimal solution of . Furthermore, assuming that the noise in model (1) is normally distributed , it was established in [4, 6, 47] that the following recovery bound holds with high probability
when the regularization parameter is chosen as and under the RIP, REC or other regularity conditions, respectively. However, the minimization and regularization problems suffer several dissatisfactions in both theoretical and practical applications. In particular, it was reported by extensive theoretical and empirical studies that the minimization and regularization problems suffer from significant estimation bias when parameters have large absolute values; the induced solutions are much less sparse than the true parameter, they cannot recover a sparse signal with the least samples when applied to compressed sensing, and that they often result in suboptimal sparsity in practice; see, e.g., [12, 20, 44, 43, 48]. Therefore, there is a great demand for developing the alternative sparse estimation technique that enjoys nice statistical theory and successful applications.
To address the bias and the suboptimal issues induced by the norm, several nonconvex regularizers have been proposed such as the smoothly clipped absolute deviation (SCAD) [20], minimax concave penalty (MCP) [44], norm [46], norm () [22], and capped norm [28]; specifically, the SCAD and MCP fall into the category of folded concave penalized (FCP) methods. It was studied in [46] that the global solution of the FCP sparse linear regression enjoys the oracle property under the sparse eigenvalue condition; see Remark 4(iii) for details.
It is worth noting that the norm regularizer () has been recognized as an important technique for sparse optimization and gained successful applications in various applied science fields; see, e.g., [12, 33, 43]. In the present paper, we focus on the statistical property of the optimization method, which is beyond the category of the FCP. Throughout the whole paper, we always assume that unless otherwise specified.
1.2 Optimization Problems
Due to the fact that , the norm has also been adopted as another alternative sparsity promoting penalty function of the and norms. The following optimization problems have attracted a great amount of attention and gained successful applications in a wide range of fields (see [12, 33, 43] and references therein):
and
In particular, the numerical results in [12] and [43] showed that the minimization and the regularization admit a significantly stronger sparsity promoting capability than the minimization and the regularization, respectively; that is, they allow to obtain a more sparse solution from a smaller amount of samplings. [33] revealed that the regularization achieved a more reliable biological solution than the regularization in the field of systems biology.
The advantage of the lowerorder optimization problem has also been shown in theory that it requires a weaker regularity condition to guarantee the stable statistical property than the classical optimization problem. In particular, let and denote the optimal solution of and , respectively. The recovery bound for was established in [16] and [39] under MIP and RIP respectively:
(2) 
where the MIP or RIP is weaker than the one used in the study of . [25] established an recovery bound for in the noisefree case:
(3) 
under the introduced REC, which is strictly weaker than the classical REC. However, the theoretical study of the optimization problem is still limited; particularly, there is still no paper devoted to establishing the statistical property of the minimization problem when the noise is randomly distributed, and that of the regularization problem in the noiseaware case.
1.3 Contributions of This Paper
The main contribution of the present paper is the establishment of the statistical properties for the optimization problems, including and , in the noiseaware case; specifically, in the case when the linear regression model (1) involves a Gaussian noise . For this purpose, we extend the REC [25] to a more general one, which is one of the weakest regularity conditions for estimating the recovery bounds of sparse estimation models, and provide some sufficient conditions for guaranteeing the general REC in terms of REC, RIP, and MIP (with a less restrictive constant); see Propositions 1 and 2. Under the general REC, we show that the recovery bound (2) holds for with high probability, and that
as well as the estimation of prediction loss and the oracle property, hold for with high probability; see Theorems 1 and 2, respectively. These results provide a unified framework of the statistical properties of the optimization problems, and improve the ones of the minimization problem [16, 39] and the regularization problem [4, 6, 47] under the REC; see Remark 4. They are not only of independent interest in establishing statistical properties for the lowerorder optimization problems with randomly noisy data, but also provide a useful tool for the study of the case when the design matrix is random.
Another contribution of the present paper is to explore the recovery bounds for the optimization problems with a random design matrix and random noise , which is more realistic in the realworld applications; e.g., compressed sensing [8], signal processing [9], statistical learning [1]. As reported in [35], the key issue for studying the statistical properties of a sparse estimation model with a random design matrix is to provide suitable conditions on the population covariance matrix of , which can guarantee the regularity conditions with high probability; see, e.g., [9, 35]. Motivated by the realworld applications, we consider the standard case when is a Gaussian random design with i.i.d. rows and the linear regression model (1) involves a Gaussian noise, explore a sufficient condition for ensuring the REC of with high probability in terms of the REC of , and apply the preceding results to establish the recovery bounds (2) for , and (3), as well as the predication loss and the oracle inequality, for , respectively; see Theorems 3 and 4. These results provide a unified framework of the statistical properties of the optimization problems with a Gaussian random design under the REC, which cover the ones of the optimization problems (see [49, Theorem 3.1]) as special cases; see Corollaries 3 and 4. To the best of our knowledge, most results presented in this paper are new, either for the deterministic or random design matrix.
We also carry out the numerical experiments on the standard simulated data. The preliminary numerical results verify the established statistical properties and show that the optimization methods possess better recovery performance than the optimization method, SCAD and MCP, which coincides with existing numerical studies [25, 43] on the regularization problem. More specifically, the regularization method outperforms the , SCAD and MCP regularization methods in the sense that its estimated error decreases faster when the sample size increases and achieves a more accurate solution.
The remainder of this paper is organized as follows. In section 2, we introduce the lowerorder REC and discuss its sufficient conditions. In section 3, we establish the recovery bounds for and with a deterministic design matrix. The extension to the linear regression model with a Gaussian random design and preliminary numerical results are presented in sections 4 and 5, respectively.
We end this section by presenting the notations adopted in this paper. We use Greek lowercase letters to denote the vectors, capital letters , to denote the index sets, and script captical letters , , to denote the random events. For and , we use to denote the vector in that for and zero elsewhere, to denote the cardinality of , to denote the complement of , and to denote the support of , i.e., the index set of nonzero entries of . Particularly,
stands for the identity matrix in , and
and denote the probability that event happens and the conditional probability that event happens given that event happens, respectively.
2 Restricted Eigenvalue Conditions
This section aims to discuss some regularity conditions imposed on the design matrix that are needed to guarantee the stable statistical properties of and .
In statistics, the ordinary least squares (OLS) is a classical technique for estimating the unknown parameters in a linear regression model and has favourable properties if some regularity conditions are satisfied; see, e.g., [34]. For example, the OLS always requires the positive definiteness of the Gram matrix
,
that is,
(4) 
However, in the highdimensional setting, the OLS does not work well; in fact, the matrix is seriously degenerate, i.e.,
To deal with the challenges caused by the highdimensional data, the Lasso (least absolute shrinkage and selection operator) estimator was introduced by [40]. Since then the Lasso estimator has gained a great success in the sparse representation and machine learning of highdimensional data; see, e.g., [4, 41, 47] and references therein.
It was pointed out that Lasso requires a weak condition, called the restricted eigenvalue condition (REC) [4], to ensure the nice statistical properties; see, e.g., [42, 27, 32]. In the definition of REC, the minimum in (4) is replaced by a minimum over a restricted set of vectors measured by an norm inequality, and the norm in the denominator is replaced by the norm of only a part of . The notion of REC was extended to the groupwised lowerorder REC in [25], which was used there to explore the oracle property and recovery bound of the regularization problem in a noisefree case.
Inspired by the ideas in [4, 25], we here introduce a lowerorder REC for the optimization problems, similar to but more general than the one in [25], where the minimum is taken over a restricted set of vectors measured by an norm inequality. To proceed, we shall introduce some useful notations. For the remainder of this paper, let and be a pair of integers such that
(5) 
For and , we define by the index set corresponding to the first largest coordinates in absolute value of in . For , its restricted eigenvalue modulus relative to is defined by
(6) 
The lowerorder REC is defined as follows.
Definition 1.
Let and . is said to satisfy the restricted eigenvalue condition relative to (REC in short) if
Remark 1.
(i) Clearly, the REC provides a unified framework of the RECtype conditions, e.g., it includes the classical REC in [4] (when ) and the REC in [25] (when ) as special cases.
(ii) The restricted eigenvalue modulus (with ) defined in (6) is slightly different from the one of the classical REC in [4], in which the factor appears in the denominator there. The reason is that we consider not only the linear regression with a deterministic design as in [4], but also a random design case; for the later case, the REC is assumed to be satisfied for the population covariance matrix of , in which the sample size does not appear. Hence, to make it consistent for both two cases, we introduce a new definition of the restricted eigenvalue modulus in (6) by removing the factor from the denominator. Hereby, this is the difference between the restricted eigenvalue modulus (6) and that in [4]. For example, if the matrix has i.i.d. Gaussian entries, the restricted eigenvalue modulus in [4] scales as a constant, equally, given by (6) scales as , independent of , , and , whenever is bounded. Consequently, the terms in the denominator of conclusions of Theorem 2 and Corollary 2 scale as a constant in this situation.
It is natural to study the relationships between the RECs and other types of regularity conditions. To this end, we first recall some basic properties of the norm in the following lemmas; particularly, Lemma 1 is taken from [24, Section 8.12] and [25, Lemmas 1 and 2].
Lemma 1.
Let . Then the following relations are true:
(7) 
(8) 
Lemma 2.
Let , , , and be such that
(9) 
Then
(10) 
Proof.
Let and . Then it holds that
(11) 
Without loss of generality, we assume that ; otherwise, (10) holds automatically. Thus, by the first inequality of (9) and noting , we have that
(12) 
Multiplying the inequalities in (11) by and respectively, we obtain that
where the second inequality follows from (12). This, together with the second inequality of (9), yields (10). The proof is complete. ∎
Extending [25, Proposition 5] to the general REC, the following proposition validates the relationship between the RECs: the lower the , the weaker the REC. However, the inverse of this implication is not true; see [25, Example 1] for a counter example. We provide the proof so as to make this paper selfcontained, although the idea is similar to that of [25, Proposition 5].
Proposition 1.
Let , , and be a pair of integers satisfying (5). Suppose that and that satisfies the REC. Then satisfies the REC.
Proof.
Associated with the REC, we define the feasible set
(13) 
By Definition 1, it remains to show that . To this end, let , and let denote the index set of the first largest coordinates in absolute value of . By the assumption that and by the construction of , one has . Then we obtain by Lemma 2 (with in place of ) that ; consequently, . Hence, it follows that , and the proof is complete. ∎
It is revealed from Proposition 1 that the classical REC is a sufficient condition of the lowerorder REC. In the sequel, we will further discuss some other types of regularity conditions: the sparse eigenvalues condition (SEC), the restricted isometry property (RIP), and the mutual incoherence property (MIP), which have been widely used in the literature of statistics and engineering, for ensuring the lowerorder REC.
The SEC is a popular regularity condition required to guarantee the nice properties of sparse representation; see [4, 17, 46] and references therein. For and , the sparse minimal eigenvalue and sparse maximal eigenvalue of are respectively defined by
(14) 
The SEC was first introduced in [17] to show that the optimal solution of well approximates that of whenever .
The RIP is another wellknown regularity condition in the scenario of sparse learning, which was introduced by [10] and has been widely used in the study of the oracle property and recovery bound for the highdimensional regression model; see [4, 9, 37] and references therein. Below, we recall the RIPtype notions from [10].
Definition 2.
[10] Let and let be such that .

The restricted isometry constant of , denoted by , is defined to be the smallest quantity such that, for any and with ,
(15) 
The restricted orthogonality constant of , denoted by , is defined to be the smallest quantity such that, for any and with , and ,
(16)
The MIP is also a wellknown regularity condition in the scenario of sparse learning, which was introduced by [19] and has been used in [4, 7, 17, 18] and references therein. In the case when each diagonal element of the Gram matrix is 1, coincides with the mutual incoherence constant; see [19].
The following lemmas are useful for establishing the relationship between the REC and other types of regularity conditions; in particular, Lemmas 3 and 4 are taken from [10, Lemma 1.1] and [42, Lemma 3.1], respectively.
Lemma 3.
Let and be such that . Then
Lemma 4.
Let and be such that . Then .
For the sake of simplicity, a partition structure and some notations are presented. For a vector and an index set , we use to denote the rank of the absolute value of in (in a decreasing order) and to denote the index set of the th batch of the first largest coordinates in absolute value of in . That is,
(17) 
Lemma 5.
Let , , , and be a pair of integers satisfying (5). Then the following relations are true:
(18) 
(19) 
Proof.
Fix , as defined by (13). Then there exists such that
(20) 
Write (where denotes the largest integer not greater than ), (defined by (17)) for each and . Then it follows from [25, Lemma 7] and (20) that
(21) 
(due to (7)). Noting by (17) and (20) that and for each , one has by (14) that
These, together with (21), imply that
Since and satisfying (20) are arbitrary, (18) is shown to hold by (6) and the fact that . One can prove (19) in a similar way, and thus, the details are omitted. ∎
The following proposition provides the sufficient conditions for the REC in terms of the SEC, RIP and MIP; see (a), (b) and (c) below respectively.
Proposition 2.
Let , , , and be a pair of integers satisfying (5). Then satisfies the REC provided that one of the following conditions:


.

each diagonal element of is and
Proof.
It directly follows from Lemma 5 (cf. (18)) that satisfies the REC provided that condition (a) holds. Fix , and let , , (for each ) and be defined, respectively, as in the beginning of the proof of Lemma 5. Then (21) follows directly and it follows from [25, Lemma 7] and (17) that
(22) 
Suppose that condition (b) is satisfied. By Definition 2 (cf. (16)), one has that
Then it follows from (21) that
(23)  
(by (15)). Since (by (5)), one has by Definition 2(i) that , and then by Lemma 3 that . Then it follows from (b) that
(24) 
This, together with (23), shows that Lemma 4 is applicable (with , , in place of , , ) to concluding that
(due to (15)). Since and satisfying (20) are arbitrary, we derive by (6) and (24) that
consequently, satisfies the REC.
Suppose that (c) is satisfied. Then we have by (22) and Definition 2 (cf. (16)) that
(25)  
Separating the diagonal and offdiagonal terms of the quadratic form , one has by (7) and (c) that
Combining this inequality with (25), we get that
Since and satisfying (20) are arbitrary, we derive by (6) and (c) that
consequently, satisfies the REC. The proof is complete. ∎
Remark 2.
It was established in [4, Lemma 4.1(ii)], [42, Corollary 7.1 and 3.1] and [4, Assumption 5] that satisfies the classical REC under one of the following conditions:


.

each diagonal element of is and
Proposition 2 extends the existing results to the general case when and partially improves them; in particular, each of conditions (a)(c) in Proposition 2 required for the REC is less restrictive than the corresponding one of conditions (a’)(c’) required for the classical REC in the situation when , which usually occurs in the highdimensional scenario (see, e.g., [4, 9, 49]). Moreover, by Propositions 1 and 2, we achieve that the REC is satisfied provided that one of the following conditions:


.

each diagonal element of is and