Principal component-guided sparse regression

# Principal component-guided sparse regression

## Abstract

We propose a new method for supervised learning, especially suited to wide data where the number of features is much greater than the number of observations. The method combines the lasso () sparsity penalty with a quadratic penalty that shrinks the coefficient vector toward the leading principal components of the feature matrix. We call the proposed method the “Lariat”. The method can be especially powerful if the features are pre-assigned to groups (such as cell-pathways, assays or protein interaction networks). In that case, the Lariat shrinks each group-wise component of the solution toward the leading principal components of that group. In the process, it also carries out selection of the feature groups. We provide some theory for this method and illustrate it on a number of simulated and real data examples.

## 1 Introduction

We consider the usual linear regression model: given realizations of predictors for and , the response is modeled as

 yi=β01+∑jxijβj+ϵi, (1)

with . The ordinary least squares (OLS) estimates of are obtained by minimizing the residual sum of squares (RSS). There has been much work on regularized estimators that offer an advantage over the OLS estimates, both in terms of accuracy of prediction on future data and interpretation of the fitted model. One focus has been on the lasso (Tibshirani, 1996), which minimizes

 J(β0,β)=12∥y−β01−Xβ∥22+λ∥β∥1, (2)

where , and the tuning parameter controls the sparsity of the final model. This parameter is often selected by cross-validation (CV). The objective function is convex, which means that solutions can be found efficiently even for very large and , in contrast to combinatorial methods like best subset selection.

The lasso is an alternative to ridge regression (Hoerl & Kennard, 1970), which uses an objective function with an penalty:

 12∥y−β01−Xβ∥22+λ∥β∥22. (3)

This approach has the disadvantage of producing dense models. However, it also biases the solution vector toward the leading right singular vectors of , which can be effective for improving prediction accuracy.

The elastic net (Zou & Hastie, 2005) generalizes the lasso, effectively combining ridge regression and the lasso, by adding an penalty to the lasso’s objective function:

 12∥y−β01−Xβ∥22+λ∥β∥1+λ22∥β∥22. (4)

We have found that the bias of ridge regression (and hence the elastic net) toward the leading singular vectors is somewhat mild. Furthermore, if the predictors come in pre-assigned groups— a focus of this paper— these methods do not exploit this information.

In this paper, we propose a new method, the “Lariat”1, that strongly biases the solution vector toward the leading singular vectors and exploits group structure. We give the full definition of the Lariat for multiple groups in Section 3. For motivation, we describe the simpler single group case in the next section.

## 2 Quadratic penalties based on principal components

Assume that we have centered the columns of , and let be the singular value decomposition (SVD) of . Here, is a diagonal matrix with diagonal entries equaling the singular values , with . Assume also that has been centered, so that we may omit the intercept from the model.

In generalizing the penalty, one can imagine a class of penalty functions of the form

 βTVZVTβ,

where is a diagonal matrix whose diagonal elements are functions of the squared singular values:

 Z11=f1(d21,d22,…,d2m),Z22=f2(d21,d22,…,d2m),…

The ridge penalty chooses for all , leading to the quadratic penalty . Here we propose a different choice: we minimize

 J(β)=12∥y−Xβ∥22+λ∥β∥1+θ2βTVDd21−d2jVTβ, (5)

where is an diagonal matrix with diagonal entries equaling . We call this the “Lariat penalty”’ and use it more generally in the next section. The value is a tuning parameter determining the weight given to the second penalty. We see that this penalty term gives no weight (a “free ride”) to the component of that aligns with the first right singular vector of , i.e, the first principal component (PC). Beyond that, the size of the penalty depends on the gap between the squared singular values and . If we view the quadratic penalty as representing a Bayesian (log-)prior, it puts infinite prior variance on the leading eigenvector.

It is instructive to compare (5) with (and with replaced with ) to ridge regression which uses the penalty . By transforming to principal coordinates, it follows that ridge regression produces the fit

 X^βR=m∑j=1d2jd2j+θujuTjy. (6)

In contrast, the Lariat (5) gives

 X^βL=m∑j=1d2jd2j+θ(d21−d2j)ujuTjy. (7)

The latter expression corresponds to a more aggressive form of shrinkage toward the leading singular vectors. To see this, Figure 1 shows the shrinkage factors and for a design matrix with a strong rank-1 component. In each panel, we have chosen the value (different for each method) to yield the degrees of freedom indicated at the top of the panel. (For each method, the degrees of freedom is equal to the sum of the shrinkage factors.)

We see that when the degrees of freedom is about “right” (top-left panel), ridge regression shrinks the top component about twice as much as does the Lariat (top-right panel). This contrast is less stark when the degrees of freedom is increased to 5 (bottom-left panel). This aggressiveness can help the Lariat improve prediction accuracy when the signal lines up strongly with the top PCs. More importantly, with pre-specified feature groups, the Lariat penalty in (5) can be modified in such a way that it shrinks the subvector of in each group toward the leading PCs of that group. This exploits the group structure to give potentially better predictions and simultaneously selects groups of features. We give details in the next section.

The rest of this paper is organized follows. In Section 3 we give a full definition of our proposal. We illustrate the effectiveness of the Lariat on some real data examples in Section 4, and give details of our computational approach in Section 5. Section 6 derives a formula for the degrees of freedom of the Lariat fit. In Section 7 we discuss an extensive simulation study, comparing our proposal to the lasso, elastic net and principal components regression. We study the theoretical properties of the Lariat in Section 8, showing that, under certain assumptions, it yields lower and prediction error than the lasso. We end with a discussion and ideas for future work. The Appendix contains further details and proofs, as well as a full description of the set-up and results for our simulation study in Section 7.

## 3 The Lariat

### 3.1 Definition

Suppose that the predictors are grouped in non-overlapping groups (we discuss the overlapping groups case later in Section 3.2). For example, these groups could be different gene pathways, assays or protein interaction networks. For , let denote the columns of corresponding to group , let , and let denote the right singular vectors and singular values of . The Lariat minimizes

 (8)

Here, is the sub-vector of corresponding to group , are the singular values of is decreasing order, and is a diagonal matrix with diagonal entries for . Optionally, we might also multiply each penalty term in the sum by a factor , being the group size, as is standard in related methods such as the group lasso (Ming & Lin, 2005).

Some observations on solving (8) are in order:

• The objective function is convex and the non-smooth component is separable. Hence, it can be optimized efficiently by a coordinate descent procedure (see Section 5 for details).

• Our approach is to fix a few values of (including zero, corresponding to the lasso), and then optimize the objective (8) over a path of values. Cross-validation is used to choose and .

• The parameter is not unitless, and is difficult to interpret. In our software, instead of asking the user to specify , we ask the user to specify a function of , denoted by “rat”. This is the ratio between the shrinkage factors in (7) for and (the latter being equal to 1). This ratio is between 0 and 1, with 1 corresponding to (the lasso) and lower values corresponding to more shrinkage. In practice, we find that using the values and covers a wide range of possible models.

• A potentially costly part of the computation is the SVD of each . If large, for computational ease, we can use an SVD of rank as an approximation without much loss of performance.

For some insight into how the Lariat shrinks coefficients, we consider the contours of the penalty in the case of two predictors. Let denote the correlation between these two predictors. In this case, it is easy to show that the two right singular vectors of are and , with the former (latter resp.) being the leading singular vector if ( resp.). Figure 2 presents the penalty contours for the Lariat along with that for ridge, lasso and elastic net regression for comparison. (See Appendix A for detailed computations on the Lariat contours.) We can see that the Lariat imposes a smaller penalty in the direction of the leading singular vector, and that this penalty is smaller when the correlation between the two predictors is stronger. We also see that as , the Lariat contours become like those for the lasso, while as , they become line segments in the direction of the leading singular vector.

Figure 3 shows the penalty contours for a particular set of three predictors, with the ratio decreasing as we go from left to right. As this ratio goes from to , the contours move from being the ball (as in the lasso) to becoming more elongated in the direction of the first PC.

### 3.2 The overlapping groups setting

As in the group lasso (Yuan & Lin, 2006), there are two approaches one can take to deal with overlapping groups. One approach is to apply the group penalty to the given groups: this however has the undesirable effect of zeroing out a predictor’s coefficient if any group to which it belongs is zeroed out. The overlap also causes additional computational challenges as the penalty is no longer separable. A better approach is the “overlap group lasso” penalty (Jacob et al., 2009), where one replicates columns of to account for multiple memberships, hence creating non-overlapping groups. The model is then fit as if the groups are non-overlapping and the coefficients for each original feature are summed to create the estimated coefficient for that feature. We take this same approach for the Lariat.

## 4 Real data examples

### 4.1 Blog Feedback dataset

This dataset is from the UC Irvine collection. Here is the description of the data and task:

“This data originates from blog posts. The raw HTML-documents of the blog posts were crawled and processed. The prediction task associated with the data is the prediction of the number of comments in the upcoming 24 hours. In order to simulate this situation, we choose a basetime (in the past) and select the blog posts that were published at most 72 hours before the selected base date/time. Then, we calculate all the features of the selected blog posts from the information that was available at the basetime, therefore each instance various ranks, corresponds to a blog post. The target is the number of comments that the blog post received in the next 24 hours relative to the basetime.

In the train data, the basetimes were in the years 2010 and 2011. In the test data the basetimes were in February and March 2012. This simulates the real-world situation in which training data from the past is available to predict events in the future.”

There are a total of observations; we took a random sample of 1000 observations from training set (years 2010 and 2011), and 5 days —2012.03.27 through 2012.03.31— for the test data (a total of 817 observations). There are a total of 281 attributes (features). We applied the Lariat, with no pre-specified groups of features. The left panel of Figure 4 shows the eigenvalues of the matrix, scaled so that the largest eigenvalue is 1. The eigenvalues for a random Gaussian matrix of the same size are also shown for reference. For the Blog Feedback dataset, there is a sharp drop-off in the eigenvalues after the first 3 or 4 eigenvalues, indicating strong correlation among the features.

The right panel shows the test error for the lasso, elastic net, PC regression of various ranks and the Lariat, for different values of the “rat” parameter. The results for elastic net are very close to that for the lasso. We see that the Lariat achieves a somewhat lower test error by shrinking towards the top principal components.

### 4.2 p53 microarray expression data

Here, we analyze the data from a gene expression study, as described in Zeng & Breheny (2016), taken from Subramanian et al. (2005). It involves the mutational status of the gene p53 in cell lines. The study aims to identify pathways that correlate with the mutational status of p53, which regulates gene expression in response to various signals of cellular stress. The data consists of 50 cell lines, 17 of which are classified as normal and 33 of which carry mutations in the p53 gene. To be consistent with the analysis in Subramanian et al. (2005), 308 gene sets that have size between 15 and 500 are included in our analysis. These gene sets contain a total of 4301 genes and is available in the R package grpregOverlap. When the data is expanded to handle the overlapping groups, it contains a total of columns.

We divided the data into 10 cross-validation (CV) folds and applied the lasso, Lariat and the group lasso using the R package grpregOverlap. (The data was too large for the sparse group lasso package, SGL.) The left panel of Figure 5 plots the number of non-zero pathways (i.e. pathways with at least one zero coefficient) against the number of non-zero coefficients for each method, as the complexity parameter of each method is varied. We see that the Lariat induces some group-level sparsity, although not as strongly as the group lasso. However, the right panel shows that the Lariat achieves higher cross-validation area under the curve (AUC) than the other approaches.

## 5 Computation of the Lariat solution

Consider first the simpler case of just one group. Let be the singular value decomposition of with , where , and let . The objective function for the Lariat is

 J(β)=12∥y−Xβ∥22+λ∥β∥1+θ2βTAβ. (9)

Since is convex in and the non-smooth part of the penalty is separable, we can minimize by applying coordinate descent. The coordinate-wise update has the form

 ~βj←S(∑ixijr(j)i−θsj,λ)∑ix2ij+θAjj,

where is the partial residual with predictor removed, , and is the soft-thresholding operator.

These updates can be easily generalized to the general case of non-overlapping groups (8) and where observations are given different weights (observation is given weight ). Using the notation of Section 3, we apply coordinate descent as shown in Algorithm 1.

We also generalize this procedure to the binomial/logistic model, using the same Newton-Raphson (iteratively reweighted least squares) framework used in glmnet (Friedman et al., 2010).

### 5.1 The Lariat’s grouping effect

We saw in Figure 5 that the Lariat can have a sparse grouping effect, even though it does not use a two-norm penalty. We investigate this further here. For Figure 6, we generated observations with 750 predictors, arranged in 50 groups of 15 predictors. Within each group, the predictors were either independent (left panel) or had a strong rank-1 component (right panel). The figure plots the number of non-zero groups (i.e. groups with at least one zero coefficient) against the number of non-zero coefficients for the lasso, the sparse group lasso (SGL) (Simon et al., 2013) and the Lariat. The latter two methods were run with different tuning parameters. We see that the Lariat shows no grouping advantage over the lasso in the left panel, but produces moderately strong grouping in the right panel. One might argue that this is reasonable: grouping only occurs when the groups have some strong internal structure. The sparse group lasso, on the other hand, always displays a strong grouping effect.

### 5.2 Connection to the group lasso and a generalization of the Lariat

The group lasso and sparse group lasso objective functions are given by

 JGL(β) = 12||y−Xβ||2+λ1∑k||βk||2, (10) JSGL(β) = 12||y−Xβ||2+λ||β||1+λ1∑k||βk||2, (11)

respectively. In the original group lasso paper of Yuan & Lin (2006) it was assumed that the , the block of columns in corresponding to group , were orthonormal, i.e. . This makes the computation considerable easier and as pointed out by Simon & Tibshirani (2012), is equivalent to a penalty of the form . The R package grpreg uses this orthogonalization and states that it is essential to its computational performance (Breheny & Huang, 2015). Since , this orthogonalization penalizes the top eigenvectors more, in direct contrast to the Lariat which puts less penalty on the top eigenvectors. Of course, the group lasso without the term does not deliver sparsity in the features. We also note that this orthogonalization does not work with the sparse group lasso, as the sparsity among the original features would be lost, and does not work whenever for any group .

One could envision a variation of the Lariat, namely a version of the sparse group lasso that uses an objective function of the form

 12||y−Xβ||2+λ||β||1+λ1∑k||R1/2kβk||2, (12)

where is any weight matrix. For example, as in (8) would put less penalty on the higher eigenvectors and induce strong group sparsity. We have not experimented with this yet, as the computation seems challenging due to the presence of norms.

### 5.3 Comparison of timings for model-fitting

Table 1 shows a comparison of timings for Algorithm 1 against that for other methods, all available as R packages. The competitors include the lasso glmnet, sparse group lasso SGL and group lasso packages gglasso and grpreg. As mentioned earlier, the latter uses orthogonalization within groups to speed up the computations, but this changes the problem. Among these competitors, only SGL provides sparsity at the level of individual features. All functions were called with default settings.

We see that SGL and gglasso start to slow down considerably as the problem size gets moderately large, while the times for grpreg are similar to that for the Lariat. However, when the cost for the initial SVD is separated out, the speed of the Lariat is roughly comparable to that of glmnet. In practice one can compute the SVD upfront for the given feature matrix, and then use it for subsequent applications of the Lariat. For example, in cross-validation, one could compute just one SVD for the entire feature matrix, rather than one SVD in each fold. Table 2 considers larger problem sizes and we see these same general trends.

## 6 Degrees of freedom

Given a vector of response values with corresponding fits , Efron (1986) defines the degrees of freedom as

 df(^y)=∑iCov(yi,^yi)σ2. (13)

The degrees of freedom measures the flexibility of the fit: the larger the degrees of freedom, the more closely the fit matches the response values. For fits of the form , the degrees of freedom is given by . For the lasso, the number of non-zero elements in the solution is an unbiased estimate of the degrees of freedom (Zou et al., 2007). Zou (2005) derived an unbiased estimate for the degrees of freedom of the elastic net: if the elastic net solves

 minimizeβ∥y−Xβ∥22+λ1∥β∥1+λ2∥β∥22, (14)

then

 ˆdf=tr[XA(XTAXA+λ2I)−1XTA] (15)

is an unbiased estimate for the degrees of freedom, where is the active set of the fit.

Using similar techniques as that of Zou (2005), we can derive an unbiased estimate for the degrees of freedom of the Lariat when the number of groups is equal to one:

###### Theorem 1.

Let the Lariat be the solution to

 minimizeβ∥y−Xβ∥22+λ∥β∥1+θβTVDd21−d2jVTβ, (16)

where is a matrix with entries (). Let . Then an unbiased estimate for the degrees of freedom for the Lariat is

 ˆdf=tr[XA(XTAXA+θWTAWA)−1XTA], (17)

where is the active set of the fit.

We verify this estimate through a simulation exercise. Consider the model

 y∗=Xβ+N(0,1)σ. (18)

Given this model, for each value of , we can evaluate the degrees of freedom of the Lariat via Monte Carlo methods. For , we generate a new response vector according to (18). We fit the Lariat to this data, generating predictions . This gives us the Monte Carlo estimate

 df(λ) ≈n∑i=1ˆCov(^y∗i,y∗i)/σ2, ˆCov(^y∗i,y∗i) =1BB∑b=1[^y∗bi−ai][y∗bi−y∗i],

where the ’s can be any fixed known constants (usually taken to be 0). On the other hand, each Lariat fit gives us an active set which allows us to compute an estimate for the degrees of freedom based on the formula (17). With replications, we can estimate .

Figure 7 provides evidence for the correctness of (17). The figures on the left are plots of the true value of the degrees of freedom against the mean of the estimate given by (17), with each point corresponding to a value of . We can see that for both values of , there is close agreement between the true value and the expectation of the estimate . The figures of the right are plots of the bias along with pointwise 95% confidence intervals. The zero horizontal line lies inside these confidence intervals.

## 7 A simulation study

We tested the performance of the Lariat against other methods in an extensive simulation study. The general framework of the simulation is as follows: The training data has a sample size of with features which fall into groups. Denote the design matrix for group by , and let the SVD of be . The true signal is a linear combination of the eigenvectors of the ’s, i.e. , where the are the coefficients of the linear combination. The response is the signal corrupted by additive Gaussian noise. We consider three different scenarios:

• “Home court” for the Lariat: The signal is related to the top eigenvectors of each group, i.e. the non-zero entries of the are all at the beginning.

• “Neutral court” for the Lariat: The signal is related to random eigenvectors of each group, i.e. the non-zero entries of the are at random positions.

• “Hostile court” for the Lariat: The signal is related to the bottom eigenvectors of each group, i.e. the non-zero entries of the are all at the end.

To induce sparsity in the set-up, we also set for some , corresponding to group having no effect on the response. Within each set-up, we looked at a range of signal-to-noise (SNR) ratios in the response, as well as whether there were correlations between predictors in the same group.

For the Lariat, we used the following cross-validation (CV) procedure to select the tuning parameters: For each value of and , we ran the Lariat along a path of values with rat = rat. (We found that these values of covered a good range of models in practice.) For each run, the value of which gave the smallest CV error was selected, i.e. the lambda.min value as in glmnet. We then compared the CV error across the 6 values of and selected the value of with the smallest CV error and its accompanying value. A second version of the Lariat was also run using the same procedure, but with the values selected by the one standard error rule, i.e. the lambda.1se value as in glmnet.

To compare the methods, we considered the mean-squared error (MSE) achieved on 5,000 test points, as well as the support size of the fitted model. We benchmarked the Lariat against the following methods:

• The null model, i.e. the mean of the responses in the training data.

• Elastic net with CV across and , with values selected both by lambda.min and lambda.1se.

• Lasso with CV, with values selected both by lambda.min and lambda.1se.

• Principal components (PC) regression with CV across ranks.

Overall, we found that in “home court” settings, the Lariat performs the best in terms of test MSE across the range of SNR and feature correlation settings. The gain in performance appears to be larger in low SNR settings compared to high SNR settings. In “neutral court” and “hostile court” settings, the Lariat performs on par with the elastic net and the lasso. This is because the cross-validation step often picks in these settings, under which the Lariat is equivalent to the lasso. In terms of support size, when there is sparsity the Lariat with values selected by the one standard error rule tends to pick models which have support size close to the underlying truth.

Below, we present an illustrative sampling of the results: see Appendix D for more comprehensive results across a wider range of simulation settings.

In the first setting, , with the features coming in 10 groups of 100 predictors. The response is a linear combination of the top two eigenvectors of just the first group. The performance on test MSE is shown in Figure 8. The Lariat clearly outperforms the other methods when the predictors within each group are uncorrelated. The performance gain is smaller when predictors within each group are correlated with each other. In terms of support size, while the Lariat with values selected both by lambda.min seems to select models which are too large, the Lariat with values selected both by lambda.1se has support size closer to the underlying truth.

In the second setting, , with the features coming in 10 groups of 20 predictors. The response is a linear combination of two random eigenvectors of just the first group. The performance on test MSE is shown in Figure 9. The Lariat performs comparably to both the elastic net and the lasso, both when the predictors are uncorrelated and when they are correlated. In terms of support size, the Lariat with values selected both by lambda.1se has support size close to the underlying truth, especially when SNR is high.

In the third setting, , with the features coming in 5 groups of 10 predictors. The response is a linear combination of the bottom eigenvectors of the first two groups. The performance on test MSE is shown in Figure 10. The Lariat performs comparably to both the elastic net and the lasso, both when the predictors are uncorrelated and when they are correlated. In terms of support size, the Lariat with values selected both by lambda.1se has support size close to the underlying truth when the signal-to-noise ratio is high.

## 8 Theoretical properties of the Lariat

In this section, we present some theoretical properties of the Lariat when the number of groups is equal to , and compare them with that for the lasso as presented in Chapter 11 of Hastie et al. (2015). In particular, we provide non-asymptotic bounds for and prediction error for the Lariat which are better than that for the lasso in certain settings. We note that the results in the section can be extended analogously to the Lariat with non-overlapping groups if the are orthogonal to each other.

To make the proofs simpler, we consider a slightly different penalty for the Lariat: instead of the second penalty term having as an matrix, where , we have as a diagonal matrix with . Let .

As the proofs are rather technical, we state just the results here; proofs can be found in Appendix C.

### 8.1 Set-up

We assume that the true underlying model is

 y=Xβ∗+w, (19)

where is the design matrix, are the response and noise vectors respectively, and is the true unknown coefficient vector. We denote the support of by . When we solve an optimization problem for , we denote the estimate by and the error by .

In this section, assume that is fixed. (If , the Lariat is equivalent to the lasso.) Define

 (20)

Thus, . The key idea is that with this notation, the Lariat is equivalent to the lasso for the augmented matrices and . Explicitly, the constrained form of the Lariat solves

 minimizeβ∥∥˜y−˜Xβ∥∥22subjectto∥β∥1≤R, (21)

while the Lagrangian form (5) solves

 minimizeβ12n∥∥˜y−˜Xβ∥∥22+λ∥β∥1. (22)

(We have changed the fraction in front of the RSS term to so that the results are more directly comparable to those in Hastie et al. (2015).) In the high-dimensional setting, it is standard to impose a restricted eigenvalue condition on the design matrix :

 1nνTXTXν≥γ∥ν∥22for all nonzero ν∈C, (23)

with and an appropriately chosen constraint set. We assume additionally that . This is not a restrictive assumption: since is the top eigenvalue of , we automatically have . Equality can only happen if is a subset of the eigenspace associated with .

Since is a positive semidefinite matrix, (23) holds trivially for the augmented matrix as well:

 νT˜XT˜Xν=νT(XTX+nθA)ν≥nγ∥ν∥22.

The following key lemma shows that we can improve the constant on the RHS of (23). This will give us better rates of convergence for the Lariat.

###### Lemma 2.

If satisfies the restricted eigenvalue condition (23), then

 νT˜XT˜Xν≥min[(1−nθ)nγ+nθd21,d21]∥ν∥22for all nonzero ν∈C. (24)

We note also that the augmented matrix actually satisfies an unrestricted eigenvalue condition:

###### Lemma 3.

For any design matrix , the augmented data matrix satisfies

 (25)

This will give us a better “slow rate” for convergence of prediction error, as well as bounds on error without requiring a restricted eigenvalue condition.

### 8.2 Bounds on ℓ2 error

The following theorem establishes a bound for error of the constrained form of the Lariat:

###### Theorem 4.

Suppose that satisfies the restricted eigenvalue bound (23) with parameter over the set . Then any estimate based on the constrained Lariat (21) with satisfies the bound

 ∥∥ˆβ−β∗∥∥2≤4√|S|∥∥XTw−nθAβ∗∥∥∞min[(1−nθ)nγ+nθd21,d21]≤4√|S|(nθ∥Aβ∗∥∞+∥∥XTw∥∥∞)min[(1−nθ)nγ+nθd21,d21]. (26)

In particular, if is aligned with the first principal component of , then

 ∥∥ˆβ−β∗∥∥2≤4√|S|∥∥XTw∥∥∞min[(1−nθ)nγ+nθd21,d21], (27)

which is a better rate of convergence than that for the lasso.

What is interesting is that we can actually obtain a similar (weaker) bound without the restricted eigenvalue condition; the lasso does not have such a bound.

###### Theorem 5.

For any design matrix , any estimate based on the constrained Lariat (21) with satisfies the bound

 ∥∥ˆβ−β∗∥∥2≤4√|S|∥∥XTw−nθAβ∗∥∥∞min(nθ,1)d21. (28)

We can obtain similar results for the Lagrangian form of the Lariat:

###### Theorem 6.

Suppose that satisfies the restricted eigenvalue bound (23) with parameter over the set . Then any estimate based on the Lagrangian form of the Lariat (22) with satisfies the bound

 ∥∥ˆβ−β∗∥∥2≤3λ√|S|min[(1−nθ)nγ+nθd21,d21]/n. (29)

If we remove the restricted eigenvalue condition on , then the bound (29) holds but with the denominator of the RHS being instead.

### 8.3 Bounds on prediction error

Like the lasso, the Lariat has a “slow rate” convergence for prediction error:

###### Theorem 7.

For any design matrix , consider the Lagrangian form of the Lariat (22) with . We have the following bound on prediction error:

 1n∥∥X(ˆβ−β∗)∥∥22≤12λ∥β∗∥1. (30)

By using Lemma 3, we can get a smaller bound for the RHS, although the expression is much more complicated:

###### Theorem 8.

For any design matrix , consider the Lagrangian form of the Lariat (22) with . Let . We have the following bound on prediction error:

 1n∥∥X(ˆβ−β∗)∥∥22≤3λ(−λp+√λ2p2+32λ∥β∗∥1κp)4κ. (31)

This means that the Lariat has a better “slow rate” convergence than the lasso.

The preceding two theorems hold for all design matrices . If we are willing to assume that satisfies the restricted eigenvalue condition, we recover the same “fast rate” convergence for prediction error as that for the lasso.

###### Theorem 9.

Suppose that satisfies the restricted eigenvalue bound (23) with parameter over the set . For the Lagrangian form of the Lariat with , we have

 1n∥∥X(ˆβ−β∗)∥∥22≤9|S|λ2min[(1−nθ)nγ+nθd21,d21]/n. (32)

## 9 Discussion

We have proposed a new method for supervised learning that exploits the correlation and group structure of predictors to potentially improve prediction performance. It combines the power of PC regression with the sparsity of the lasso. There are several interesting avenues for further research:

• Efficient screening rules: To speed up the computation of the lasso solution, glmnet uses strong rules (Tibshirani et al., 2012) to discard predictors which are likely to have a coefficient of zero. These strong rules can greatly reduce the number of variables entering the optimization, thus speeding up computation. Using similar techniques, we can derive strong rules for the Lariat to improve computational efficiency. These rules are provided in Appendix B, and merit further investigation.

• Use of different kinds of correlation or grouping information. Rather than use the covariance of the observed covariates in a group, one could use other kinds of external information to form this covariance, for example gene or protein pathways. Alternatively, one could use unsupervised clustering to generate the feature groups.

• Optimality of the quadratic penalty term: Considering the general class of penalty functions of the form , one could ask if the Lariat’s choice of is “optimal” in any sense. One could also look for which satisfy certain notions of optimality.

An R language package lariat will soon be made available on the CRAN repository.

Acknowledgements: We’d like to thank Trevor Hastie and Jacob Bien for helpful comments. Robert Tibshirani was supported by NIH grant 5R01 EB001988-16 and NSF grant 19 DMS1208164.

## Appendix A Details of the Lariat penalty contours for two predictors

Assume that we only have two predictors which are standardized to have mean and sum of squares . Then , where is the sample correlation between the two predictors. Let . If , the expressions for the SVD of and are

 X=U⋅(1+ρ001−ρ)⋅(1/√21/√21/√2−1/√2),A=2ρ(1−1−11).

If , the corresponding expressions are

 X=U⋅(1−ρ001+ρ)⋅(1/√2−1/√21/√21/√2),A=−2ρ(1111).

With these expressions, the Lariat penalty can be written explicitly as

 {λ(|β1|+|β2|)+2θρ(β1−β2)2if ρ>0,λ(|β1|+|β2|)−2θρ(β1+β2)2if% ρ<0.

Considering the signs of and , we can get an even more explicit description of the contours. We provide the description for below:

• : The parabola , rotated clockwise.

• : The line .

• : The parabola , rotated clockwise.

• : The line .

## Appendix B Strong rules for the Lariat

Recall that in computing the Lariat solution, we typically hold fixed and compute the solution for a path of values. By casting the Lariat optimization problem as a lasso problem with a different response and design matrix, we can derive strong rules for the Lariat from that for the lasso.

We first derive strong rules for the Lariat where predictors do not have preassigned groups, i.e. the solution to (5). If we define

 ˜y=(y0),˜X=(X√θA1/2), (33)

where , then the Lariat is equivalent to the lasso with response vector and design matrix . Fix , fix a path of values , and let denote the Lariat solution at . Applying the lasso strong rules to this set-up, the sequential strong rule for the Lariat for discarding predictor at is

 |XTj(y−X^β(λi−1))−θ(A1/2)jA1/2^β(λi−1)| <2λi−λi−1, ⇔ |XTj(y−X^β(λi−1))−θ(A^β(λi−1))j| <2λi−λi−1. (34)

Next we derive strong rules for the Lariat where predictors come in preassigned non-overlapping groups, i.e. the solution to (8). If we define for , as the block diagonal matrix

 A=⎛⎜ ⎜⎝A10⋱0AK⎞⎟ ⎟⎠,

and and as in (33), then the Lariat is again equivalent to the lasso with and . This results in the same sequential strong rule as in the single group case (34), although the matrix is defined differently.

## Appendix C Proofs for Section 8

Before presenting proofs for Section 8, we first prove two technical lemmas which show that the Lariat solution lies in the constraint set which we will use for the restricted eigenvalue condition (23).

###### Lemma 10.

If we solve the constrained form of the Lariat (21) with , then .

###### Proof.

Since is feasible for (21), by the triangle inequality,

###### Lemma 11.

If we solve the Lagrangian form of the Lariat (22) with , then .

###### Proof.

Define . By definition, , so

 12n∥∥˜w−˜Xˆν∥∥22+λ∥β∗+ˆν∥1 ≤12n∥˜w∥22+λ∥β∗∥1, 12nˆνT(XTX+nθA)ˆν ≤˜wT˜Xˆνn+λ(∥β∗∥1−∥β∗+ˆν∥1). (35)

Note that ; substituting this in the above gives

 12nˆνT(XTX+nθA)ˆν≤˜wT˜Xˆνn+λ(∥ˆνS∥1−∥ˆνSc∥1). (36)

By Hölder’s inequality and our choice of , we have

 ˜wT˜Xˆνn≤1n