[

[

Yoshimasa Uematsu, Yingying Fan, Kun Chen, Jinchi Lv and Wei Lin
Abstract

Many modern big data applications feature large scale in both numbers of responses and predictors. Better statistical efficiency and scientific insights can be enabled by understanding the large-scale response-predictor association network structures via layers of sparse latent factors ranked by importance. Yet sparsity and orthogonality have been two largely incompatible goals. To accommodate both features, in this paper we suggest the method of sparse orthogonal factor regression (SOFAR) via the sparse singular value decomposition with orthogonality constrained optimization to learn the underlying association networks, with broad applications to both unsupervised and supervised learning tasks such as biclustering with sparse singular value decomposition, sparse principal component analysis, sparse factor analysis, and spare vector autoregression analysis. Exploiting the framework of convexity-assisted nonconvex optimization, we derive nonasymptotic error bounds for the suggested procedure characterizing the theoretical advantages. The statistical guarantees are powered by an efficient SOFAR algorithm with convergence property. Both computational and theoretical advantages of our procedure are demonstrated with several simulation and real data examples.

SOFAR ]SOFAR: large-scale association network learning thanks: This work was supported by a Grant-in-Aid for JSPS Fellows 26-1905, NSF CAREER Awards DMS-0955316 and DMS-1150318, NIH grant U01 HL114494, NSF Grant DMS-1613295, and a grant from the Simons Foundation. The authors would like to thank the Joint Editor, Associate Editor, and referees for their valuable comments that have helped improve the paper significantly. Part of this work was completed while Fan and Lv visited the Departments of Statistics at University of California, Berkeley and Stanford University. These authors sincerely thank both departments for their hospitality.

Keywords: Big data; Large-scale association network; Simultaneous response and predictor selection; Latent factors; Sparse singular value decomposition; Orthogonality constrained optimization; Nonconvex statistical learning

1 Introduction

The genetics of gene expression variation may be complex due to the presence of both local and distant genetic effects and shared genetic components across multiple genes (Brem and Kruglyak, 2005; Cai et al., 2013). A useful statistical analysis in such studies is to simultaneously classify the genetic variants and gene expressions into groups that are associated. For example, in a yeast expression quantitative trait loci (eQTLs) mapping analysis, the goal is to understand how the eQTLs, which are regions of the genome containing DNA sequence variants, influence the expression level of genes in the yeast MAPK signaling pathways. Extensive genetic and biochemical analysis has revealed that there are a few functionally distinct signaling pathways of genes (Gustin et al., 1998; Brem and Kruglyak, 2005), suggesting that the association structure between the eQTLs and the genes is of low rank. Each signaling pathway involves only a subset of genes, which are regulated by only a few genetic variants, suggesting that each association between the eQTLs and the genes is sparse in both the input and the output (or in both the responses and the predictors), and the pattern of sparsity should be pathway specific. Moreover, it is known that the yeast MAPK pathways regulate and interact with each other (Gustin et al., 1998). The complex genetic structures described above clearly call for a joint statistical analysis that can reveal multiple distinct associations between subsets of genes and subsets of genetic variants. If we treat the genetic variants and gene expressions as the predictors and responses, respectively, in a multivariate regression model, the task can then be carried out by seeking a sparse representation of the coefficient matrix and performing predictor and response selection simultaneously. The problem of large-scale response-predictor association network learning is indeed of fundamental importance in many modern big data applications featuring large scale in both numbers of responses and predictors.

Observing independent pairs , , with the covariate vector and the response vector, motivated from the above applications we consider the following multivariate regression model

 Y=XC∗+E, (0)

where is the response matrix, is the predictor matrix, is the true regression coefficient matrix, and is the error matrix. To model the sparse relationship between the responses and the predictors as in the yeast eQTLs mapping analysis, we exploit the following singular value decomposition (SVD) of the coefficient matrix

 C∗=U∗D∗V∗T=r∑j=1d∗ju∗jv∗Tj, (0)

where is the rank of matrix , is a diagonal matrix of nonzero singular values, and and are the orthonormal matrices of left and right singular vectors, respectively. Here, we assume that is low-rank with only nonzero singular values, and the matrices and are sparse.

Under the sparse SVD structure (1), model (1) can be rewritten as

where , , and are the matrices of latent responses, predictors, and random errors, respectively. The associations between the predictors and responses are thus diagonalized under the pairs of transformations specified by and . When is of low rank, this provides an appealing low-dimensional latent model interpretation for model (1). Further, note that the latent responses and predictors are linear combinations of the original responses and predictors, respectively. Thus, the interpretability of the SVD can be enhanced if we require that the left and right singular vectors be sparse so that each latent predictor/response involves only a small number of the original predictors/responses, thereby performing the task of variable selection among the predictors/responses, as needed in the yeast eQTLs analysis.

The above model (1) with low-rank coefficient matrix has been commonly adopted in the literature. In particular, the reduced rank regression (Anderson, 1951; Izenman, 1975; Reinsel and Velu, 1998) is an effective approach to dimension reduction by constraining the coefficient matrix to be of low rank. Bunea et al. (2011) proposed a rank selection criterion that can be viewed as an regularization on the singular values of . The popularity of regularization methods such as the Lasso (Tibshirani, 1996) led to the development of nuclear norm regularization in multivariate regression (Yuan et al., 2007). Chen et al. (2013) proposed an adaptive nuclear norm penalization approach to bridge the gap between and regularization methods and combine some of their advantages. With the additional SVD structure (1), Chen et al. (2012) proposed a new estimation method with a correctly specified rank by imposing a weighted penalty on each rank-1 SVD layer for the classical setting of fixed dimensionality. Chen and Huang (2012) and Bunea et al. (2012) explored a low-rank representation of in which the rows of are sparse; however, their approaches do not impose sparsity on the right singular vectors and, hence, are inapplicable to settings with high-dimensional responses where response selection is highly desirable.

Recently, there have been some new developments in sparse and low-rank regression problems. Ma and Sun (2014) studied the properties of row-sparse reduced-rank regression model with nonconvex sparsity-inducing penalties, and later Ma et al. (2014b) extended their work to two-way sparse reduced-rank regression. Chen and Huang (2016) extended the row-sparse reduced-rank regression by incorporating covariance matrix estimation, and the authors mainly focused on computational issues. Lian et al. (2015) proposed a semiparametric reduced-rank regression with a sparsity penalty on the coefficient matrix itself. Goh et al. (2017) studied the Bayesian counterpart of the row/column-sparse reduced-rank regression and established its posterior consistency. However, none of these works considered the possible entrywise sparsity in the SVD of the coefficient matrix. The sparse and low-rank regression models have also been applied in various fields to solve important scientific problems. To name a few, Chen et al. (2014) applied a sparse and low-rank bi-linear model for the task of source-sink reconstruction in marine ecology, Zhu et al. (2014) used a Bayesian low-rank model for associating neuroimaging phenotypes and genetic markers, and Ma et al. (2014a) used a threshold SVD regression model for learning regulatory relationships in genomics.

In view of the key role that the sparse SVD plays for simultaneous dimension reduction and variable selection in model (1), in this paper we suggest a unified regularization approach to estimating such a sparse SVD structure. Our proposal successfully meets three key methodological challenges that are posed by the complex structural constraints on the SVD. First, sparsity and orthogonality are two largely incompatible goals and would seem difficult to be accommodated within a single framework. For instance, a standard orthogonalization process such as QR factorization will generally destroy the sparsity pattern of a matrix. Previous methods either relaxed the orthogonality constraint to allow efficient search for sparsity patterns (Chen et al., 2012), or avoided imposing both sparsity and orthogonality requirements on the same factor matrix (Chen and Huang, 2012; Bunea et al., 2012). To resolve this issue, we formulate our approach as an orthogonality constrained regularization problem, which yields simultaneously sparse and orthogonal factor matrices in the SVD. Second, we employ the nuclear norm penalty to encourage sparsity among the singular values and achieve rank reduction. As a result, our method produces a continuous solution path, which facilitates rank parameter tuning and distinguishes it from the regularization method adopted by Bunea et al. (2012). Third, unlike rank-constrained estimation, the nuclear norm penalization approach makes the estimation of singular vectors more intricate, since one does not know a priori which singular values will vanish and, hence, which pairs of left and right singular vectors are unidentifiable. Noting that the degree of identifiability of the singular vectors increases with the singular value, we propose to penalize the singular vectors weighted by singular values, which proves to be meaningful and effective. Combining these aspects, we introduce sparse orthogonal factor regression (SOFAR), a novel regularization framework for high-dimensional multivariate regression. While respecting the orthogonality constraint, we allow the sparsity-inducing penalties to take a general, flexible form, which includes special cases that adapt to the entrywise and rowwise sparsity of the singular vector matrices, resulting in a nonconvex objective function for the SOFAR method.

In addition to the aforementioned three methodological challenges, the nonconvexity of the SOFAR objective function also poses important algorithmic and theoretical challenges in obtaining and characterizing the SOFAR estimator. To address these challenges, we suggest a two-step approach exploiting the framework of convexity-assisted nonconvex optimization (CANO) to obtain the SOFAR estimator. More specifically, in the first step we minimize the -penalized squared loss for the multivariate regression (1) to obtain an initial estimator. Then in the second step, we minimize the SOFAR objective function in an asymptotically shrinking neighborhood of the initial estimator. Thanks to the convexity of its objective function, the initial estimator can be obtained effectively and efficiently. Yet since the finer sparsity structure imposed through the sparse SVD (1) is completely ignored in the first step, the initial estimator meets none of the aforementioned three methodological challenges. Nevertheless, since it is theoretically guaranteed that the initial estimator is not far away from the true coefficient matrix with asymptotic probability one, searching in an asymptotically shrinking neighborhood of the initial estimator significantly alleviates the nonconvexity issue of the SOFAR objective function. In fact, under the framework of CANO we derive nonasymptotic bounds for the prediction, estimation, and variable selection errors of the SOFAR estimator characterizing the theoretical advantages. In implementation, to disentangle the sparsity and orthogonality constraints we develop an efficient SOFAR algorithm and establish its convergence properties.

Our suggested SOFAR method for large-scale association network learning is in fact connected to a variety of statistical methods in both unsupervised and supervised multivariate analysis. For example, the sparse SVD and sparse principal component analysis (PCA) for a high-dimensional data matrix can be viewed as unsupervised versions of our general method. Other prominent examples include sparse factor models, sparse canonical correlation analysis (Witten et al., 2009), and sparse vector autoregressive (VAR) models for high-dimensional time series. See Section 2.2 for more details on these applications and connections.

The rest of the paper is organized as follows. Section 2 introduces the SOFAR method and discusses its applications to several unsupervised and supervised learning tasks. We present the nonasymptotic properties of the method in Section 3. Section 4 develops an efficient optimization algorithm and discusses its convergence and tuning parameter selection. We provide several simulation and real data examples in Section 5. All the proofs of main results and technical details are detailed in the Supplementary Material. An associated R package implementing the suggested method is available at http://www-bcf.usc.edu/~jinchilv/publications/software.

2 Large-scale association network learning via SOFAR

2.1 Sparse orthogonal factor regression

To estimate the sparse SVD of the true regression coefficient matrix in model (1), we start by considering an estimator of the form , where with and is a diagonal matrix of singular values, and and are orthonormal matrices of left and right singular vectors, respectively. Although it is always possible to take without prior knowledge of the rank , it is often sufficient in practice to take a small that is slightly larger than the expected rank (estimated by some procedure such as in Bunea et al. (2011)), which can dramatically reduce computation time and space. Throughout the paper, for any matrix we denote by , , , and the Frobenius norm, entrywise -norm, entrywise -norm, and rowwise -norm defined, respectively, as , , , and . We also denote by the induced matrix norm (operator norm).

As mentioned in the Introduction, we employ the nuclear norm penalty to encourage sparsity among the singular values, which is exactly the entrywise penalty on . Penalization directly on and , however, is inappropriate since the singular vectors are not equally identifiable and should not be subject to the same amount of regularization. Singular vectors corresponding to larger singular values can be estimated more accurately and should contribute more to the regularization, whereas those corresponding to vanishing singular values are unidentifiable and should play no role in the regularization. Therefore, we propose an importance weighting by the singular values and place sparsity-inducing penalties on the weighted versions of singular vector matrices, and . Also taking into account the orthogonality constraints on and , we consider the orthogonality constrained optimization problem

 (0)

where and are penalty functions to be clarified later, and are tuning parameters that control the strengths of regularization. We call this regularization method sparse orthogonal factor regression (SOFAR) and the regularized estimator the SOFAR estimator. Note that and can be equal or distinct, depending on the scientific question and the goals of variable selection. Letting while setting reduces the SOFAR estimator to the sparse reduced-rank estimator of Chen and Huang (2012). In view of our choices of and , although appears in all three penalty terms, rank reduction is achieved mainly through the first term, while variable selection is achieved through the last two terms under necessary scalings by .

Note that for simplicity we do not explicitly state the ordering constraint in optimization problem (2.1). In fact, when and are matrix norms that satisfy certain invariance properties, such as the entrywise -norm and rowwise -norm, this constraint can be easily enforced by simultaneously permuting and/or changing the signs of the singular values and the corresponding singular vectors. The orthogonality constraints are, however, essential to the optimization problem in that a solution cannot be simply obtained through solving the unconstrained regularization problem followed by an orthogonalization process. The interplay between sparse regularization and orthogonality constraints is crucial for achieving important theoretical and practical advantages, which distinguishes our SOFAR method from most previous procedures.

2.2 Applications of SOFAR

The SOFAR method provides a unified framework for a variety of statistical problems in multivariate analysis. We give four such examples, and in each example, briefly review existing techniques and suggest new methods.

2.2.1 Biclustering with sparse SVD

The biclustering problem of a data matrix, which can be traced back to Hartigan (1972), aims to simultaneously cluster the rows (samples) and columns (features) of a data matrix into statistically related subgroups. A variety of biclustering techniques, which differ in the criteria used to relate clusters of samples and clusters of features and in whether overlapping of clusters is allowed, have been suggested as useful tools in the exploratory analysis of high-dimensional genomic and text data. See, for example, Busygin et al. (2008) for a survey. One way of formulating the biclustering problem is through the mean model

 X=C∗+E, (0)

where the mean matrix admits a sparse SVD (1) and the sparsity patterns in the left (or right) singular vectors serve as indicators for the samples (or features) to be clustered. Lee et al. (2010) proposed to estimate the first sparse SVD layer by solving the optimization problem

 (^d,ˆu,ˆv)=argmind,u,v{12∥X−duvT∥2F+λaρa(du)+λbρb(dv)}=subject to∥u∥2=1,∥v∥2=1, (0)

and obtain the next sparse SVD layer by applying the same procedure to the residual matrix . Clearly, problem (2.2.1) is a specific example of the SOFAR problem (2.1) with and ; however, the orthogonality constraints are not maintained during the layer-by-layer extraction process. The orthogonality issue also exists in most previous proposals, for example, Zhang et al. (2002).

The multivariate linear model (1) with a sparse SVD (1) can be viewed as a supervised version of the above biclustering problem, which extends the mean model (2.2.1) to a general design matrix and can be used to identify interpretable clusters of predictors and clusters of responses that are significantly associated. Applying the SOFAR method to model (2.2.1) yields the new estimator

 (0)

which estimates all sparse SVD layers simultaneously while determining the rank by nuclear norm penalization and preserving the orthogonality constraints.

2.2.2 Sparse PCA

A useful technique closely related to sparse SVD is sparse principal component analysis (PCA), which enhances the convergence and improves the interpretability of PCA by introducing sparsity in the loadings of principal components. There has been a fast growing literature on sparse PCA due to its importance in dimension reduction for high-dimensional data. Various formulations coupled with efficient algorithms, notably through regularization and its and semidefinite relaxations, have been proposed by Zou et al. (2006), d’Aspremont et al. (2007), Shen and Huang (2008), Johnstone and Lu (2009), and Guo et al. (2010), among others.

We are interested in two different ways of casting sparse PCA in our sparse SVD framework. The first approach bears a resemblance to the proposal of Zou et al. (2006), which formulates sparse PCA as a regularized multivariate regression problem with the data matrix treated as both the responses and the predictors. Specifically, they proposed to solve the optimization problem

 (0)

and the loading vectors are given by the normalized columns of , , . However, the orthogonality of the loading vectors, a desirable property enjoyed by the standard PCA, is not enforced by problem (2.2.2). Similarly applying the SOFAR method leads to the estimator

 (ˆD,ˆU,ˆV) =subject toUTU=Im,VTV=Im,

which explicitly imposes orthogonality among the loading vectors (the columns of ). One can optionally ignore the nuclear norm penalty and determine the number of principal components by some well-established criterion.

The second approach exploits the connection of sparse PCA with regularized SVD suggested by Shen and Huang (2008). They proposed to solve the rank-1 matrix approximation problem

 (ˆu,ˆb)=argminu,b{12∥X−ubT∥2F+λbρb(b)}=subject to∥u∥2=1, (0)

and obtain the first loading vector . Applying the SOFAR method similarly to the rank- matrix approximation problem yields the estimator

 (ˆD,ˆU,ˆV) =subject toUTU=Im,VTV=Im,

which constitutes a multivariate generalization of problem (2.2.2), with the desirable orthogonality constraint imposed on the loading vectors (the columns of ) and the optional nuclear norm penalty useful for determining the number of principal components.

2.2.3 Sparse factor analysis

Factor analysis plays an important role in dimension reduction and feature extraction for high-dimensional time series. A low-dimensional factor structure is appealing from both theoretical and practical angles, and can be conveniently incorporated into many other statistical tasks, such as forecasting with factor-augmented regression (Stock and Watson, 2002) and covariance matrix estimation (Fan et al., 2008). See, for example, Bai and Ng (2008) for an overview.

Let be a vector of observed time series. Consider the factor model

 xt=Λft+et,t=1,…,T, (0)

where is a vector of latent factors, is the factor loading matrix, and is the idiosyncratic error. Most existing methods for high-dimensional factor models rely on classical PCA (Bai and Ng, 2002; Bai, 2003) or maximum likelihood to estimate the factors and factor loadings (Bai and Li, 2016, 2012); as a result, the estimated factors and loadings are generally nonzero. However, in order to assign economic meanings to the factors and loadings and to further mitigate the curse of dimensionality, it would be desirable to introduce sparsity in the factors and loadings. Writing model (2.2.3) in the matrix form

 X=FΛT+E

with , , and reveals its equivalence to model (2.2.1). Therefore, under the usual normalization restrictions that and is diagonal, we can solve for in problem (2.2.1) and estimate the sparse factors and loadings by and .

2.2.4 Sparse VAR analysis

Vector autoregressive (VAR) models have been widely used to analyze the joint dynamics of multivariate time series; see, for example, Stock and Watson (2001). Classical VAR analysis suffers greatly from the large number of free parameters in a VAR model, which grows quadratically with the dimensionality. Early attempts in reducing the impact of dimensionality have explored reduced rank methods such as canonical analysis and reduced rank regression (Box and Tiao, 1977; Velu et al., 1986). Regularization methods such as the Lasso have recently been adapted to VAR analysis for variable selection (Hsu et al., 2008; Nardi and Rinaldo, 2011; Kock and Callot, 2015; Basu and Michailidis, 2015).

We present an example in which our parsimonious model setup is most appropriate. Suppose we observe the data , where is a low-dimensional vector of time series whose dynamics are of primary interest, and is a high-dimensional vector of informational time series. We assume that are generated by the VAR equation

 xt=C∗Txt−1+et,

where has a sparse SVD (1). This implies a low-dimensional latent model of the form

 gt=D∗ft−1+˜et,

where , , and . Following the factor-augmented VAR (FAVAR) approach of Bernanke et al. (2005), we augment the latent factors and to the dynamic equation of and consider the joint model

 (ytgt)=(ATBT0D∗)(yt−1ft−1)+(εt˜et).

We can estimate the parameters , , and by a two-step method: first apply the SOFAR method to obtain estimates of and , and then estimate and by a usual VAR since both and are of low dimensionality. Our approach differs from previous methods in that we enforce sparse factor loadings; hence, it would allow the factors to be given economic interpretations and would be useful for uncovering the structural relationships underlying the joint dynamics of .

3 Theoretical properties

We now investigate the theoretical properties of the SOFAR estimator (2.1) for model (1) under the sparse SVD structure (1). Our results concern nonasymptotic error bounds, where both response dimensionality and predictor dimensionality can diverge simultaneously with sample size . The major theoretical challenges stem from the nonconvexity issues of our optimization problem which are prevalent in nonconvex statistical learning.

3.1 Technical conditions

We begin with specifying a few assumptions that facilitate our technical analysis. To simplify the technical presentation, we focus on the scenario of and our proofs can be adapted easily to the case of with the only difference that the rates of convergence in Theorems 1 and 2 will be modified correspondingly. Assume that each column of , with , has been rescaled such that . The SOFAR method minimizes the objective function in (2.1). Since the true rank is unknown and we cannot expect that one can choose to perfectly match , the SOFAR estimates , , and are generally of different sizes than , , and , respectively. To ease the presentation, we expand the dimensions of matrices , , and by simply adding columns and rows of zeros to the right and to the bottom of each of the matrices to make them of sizes , , and , respectively. We also expand the matrices , , and similarly to match the sizes of , , and , respectively. Define and , and correspondingly and using the SOFAR estimates .

Definition 1 (Robust spark)

The robust spark of the design matrix is defined as the smallest possible positive integer such that there exists an submatrix of having a singular value less than a given positive constant .

Condition 1

(Parameter space) The true parameters lie in , where , , , and with the robust spark of , some constant, and asymptotically vanishing.

Condition 2

(Constrained eigenvalue) It holds that and for some constant , where is the left singular vector of corresponding to singular value .

Condition 3

(Error term) The error term with the maximum eigenvalue of bounded from above and diagonal entries of being ’s.

Condition 4

(Penalty functions) For matrices and of the same size, the penalty functions with satisfy .

Condition 5

(Relative spectral gap) The nonzero singular values of satisfy that for with some constant, and and can diverge as .

The concept of robust spark in Definition 1 was introduced initially in Zheng et al. (2014) and Fan and Lv (2013), where the thresholded parameter space was exploited to characterize the global optimum for regularization methods with general penalties. Similarly, the thresholded parameter space and the constrained eigenvalue condition which builds on the robust spark condition of the design matrix in Conditions 1 and 2 are essential for investigating the computable solution to the nonconvex SOFAR optimization problem in (2.1). By Proposition 1 of Fan and Lv (2013), the robust spark can be at least of order with asymptotic probability one when the rows of are independently sampled from multivariate Gaussian distributions with dependency. Although Condition 3 assumes Gaussianity, our theory can in principle carry over to the case of sub-Gaussian errors, provided that the concentration inequalities for Gaussian random variables used in our proofs are replaced by those for sub-Gaussian random variables.

Condition 4 includes many kinds of penalty functions that bring about sparse estimates. Important examples include the entrywise -norm and rowwise -norm, where the former encourages sparsity among the predictor/response effects specific to each rank-1 SVD layer, while the latter promotes predictor/response-wise sparsity regardless of the specific layer. To see why the rowwise -norm satisfies Condition 4, observe that

 ∥M∥1≡∑i∑j|mij|=∑i⎛⎝∑j,k|mij||mik|⎞⎠1/2≥∑i(∑jm2ij)1/2≡∥M∥2,1,

which along with the triangle inequality entails that Condition 4 is indeed satisfied. Moreover, Condition 4 allows us to use concave penalties such as SCAD (Fan and Li, 2001) and MCP (Zhang, 2010); see, for instance, the proof of Lemma 1 in Fan and Lv (2013).

Intuitively, Condition 5 rules out the nonidentifiable case where some nonzero singular values are tied with each other and the associated singular vectors in matrices and are identifiable only up to some orthogonal transformation. In particular, Condition 5 enables us to establish the key Lemma 3 in Section B.1 of Supplementary Material, where the matrix perturbation theory can be invoked.

3.2 Main results

Since the objective function of the SOFAR method (2.1) is nonconvex, solving this optimization problem is highly challenging. To overcome the difficulties, as mentioned in the Introduction we exploit the framework of CANO and suggest a two-step approach, where in the first step we solve the following -penalized squared loss minimization problem

 ˜C=argminC∈Rp×q{(2n)−1∥Y−XC∥2F+λ0∥C∥1} (0)

to construct an initial estimator with some regularization parameter. If , then we set the final SOFAR estimator as ; otherwise, in the second step we do a refined search and minimize the SOFAR objective function (2.1) in an asymptotically shrinking neighborhood of to obtain the final SOFAR estimator . In the case of , our two-step procedure reduces to a one-step procedure. Since Theorem 1 below establishes that can be close to with asymptotic probability one, having is a good indicator that the true .

Thanks to its convexity, the objective function in (3.2) in the first step can be solved easily and efficiently. In fact, since the objective function in (3.2) is separable it follows that the th column of can be obtained by solving the univariate response Lasso regression

 minβ∈Rp{(2n)−1∥Yej−Xβ∥22+λ0∥β∥1},

where is a -dimensional vector with th component 1 and all other components 0. The above univariate response Lasso regression has been studied extensively and well understood, and many efficient algorithms have been proposed for solving it. Denote by the initial estimator of obtained from the SVD of , and let and . Since the bounds for the SVD are key to the analysis of SOFAR estimator in the second step, for completeness we present the nonasymptotic bounds on estimation errors of the initial estimator in the following theorem.

Theorem 1 (Error bounds for initial estimator)

Assume that Conditions 13 hold and let with and some constant. Then with probability at least , the estimation error is bounded as

 ∥˜C−C∗∥F≤Rn≡c(n−1slog(pq))1/2 (0)

with and some constant. Under addition Condition 5, with the same probability bound the following estimation error bounds hold simultaneously

 ∥˜D−D∗∥F≤c(n−1slog(pq))1/2, (0) ∥˜A−A∗∥F+∥˜B−B∗∥F≤cηn(n−1slog(pq))1/2, (0)

where .

For the case of , the estimation error bound (1) is consistent with the well-known oracle inequality for Lasso (Bickel et al., 2009). The additional estimation error bounds (1) and (1) for the SVD in Theorem 1 are, however, new to the literature. It is worth mentioning that Condition 5 and the latest results in Yu et al. (2015) play a crucial role in establishing these additional error bounds.

After obtaining the initial estimator from the first step, we can solve the SOFAR optimization problem in an asymptotically shrinking neighborhood of . More specifically, we define with the upper bound in (1). Then it is seen from Theorem 1 that the true coefficient matrix is contained in with probability at least . Further define

 Pn=˜Pn∩(C×D×A×B), (0)

where sets , , , and are defined in Condition 1. Then with probability at least , the set defined in (3.2) is nonempty with at least one element by Condition 1. We minimize the SOFAR objective function (2.1) by searching in the shrinking neighborhood and denote by the resulting SOFAR estimator. Then it follows that with probability at least ,

where the first inequality is by the triangle inequality and the second one is by the construction of set and Theorem 1. Therefore, we see that the SOFAR estimator given by our two-step procedure is guaranteed to have convergence rate at least .

Since the initial estimator investigated in Theorem 1 completely ignores the finer sparse SVD structure of the coefficient matrix , intuitively the second step of SOFAR estimation can lead to improved error bounds. Indeed we show in Theorem 2 below that with the second step of refinement, up to some columnwise sign changes the SOFAR estimator can admit estimator error bounds in terms of parameters , , and with , , and . When , , and are drastically smaller than , these new upper bounds can have better rates of convergence.

Theorem 2 (Error bounds for SOFAR estimator)

Assume that Conditions 15 hold, with some large constant, , , , and with , , and as given in Theorem 1. Then with probability at least

 1−{2(pq)1−c20/2+2(pr)−~c2+2prexp(−~c3n1−β−γτ2η−2n)}, (0)

the SOFAR estimator satisfies the following error bounds simultaneously:

 (a) ∥ˆC−C∗∥F≤cmin{s,(r+sa+sb)η2n}1/2{n−1log(pq)}1/2, (0) (b) ∥ˆD−D∗∥F+∥ˆA−A∗∥F+∥ˆB−B∗∥F ≤cmin{s,(r+sa+sb)η2n}1/2ηn{n−1log(pq)}1/2, (0) (c) ∥ˆD−D∗∥0+∥ˆA−A∗∥0+∥ˆB−B∗∥0≤c(r+sa+sb), (0) (d) ∥ˆD−D∗∥1+∥ˆA−A∗∥1+∥ˆB−B∗∥1≤c(r+sa+sb)η2nλmax, (0) (e) n−1∥X(ˆC−C∗)∥2F≤c(r+sa+sb)η2nλ2max, (0)

where and are some positive constants.

We see from Theorem 2 that the upper bounds in (2) and (2) are the minimum of two rates, one involving (the total sparsity of , , and ) and the other one involving (the sparsity of matrix ). The rate involving is from the first step of Lasso estimation, while the rate involving is from the second step of SOFAR refinement. For the case of , our two-step procedure leads to enhanced error rates under the Frobenius norm. Moreover, the error rates in (2)–(2) are new to the literature and not shared by the initial Lasso estimator, showing again the advantages of having the second step of refinement. It is seen that our two-step SOFAR estimator is capable of recovering the sparsity structure of , , and very well.

Let us gain more insights into these new error bounds. In the case of univariate response with , we have , , , and . Then the upper bounds in (2)–(2) reduce to , , , , and , respectively, which are indeed within a logarithmic factor of the oracle rates for the case of high-dimensional univariate response regression. Furthermore, in the rank-one case of we have and . Correspondingly, the upper bounds in (1)–(1) for the initial Lasso estimator all become , while the upper bounds in (2)–(2) for the SOFAR estimator become , , , , and , respectively. In particular, we see that the SOFAR estimator can have much improved rates of convergence even in the setting of .

4 Implementation of SOFAR

The interplay between sparse regularization and orthogonality constraints creates substantial algorithmic challenges for solving the SOFAR optimization problem (2.1), for which many existing algorithms can become either inefficient or inapplicable. For example, coordinate descent methods that are popular for solving large-scale sparse regularization problems (Friedman et al., 2007) are not directly applicable because the penalty terms in problem (2.1) are not separable under the orthogonality constraints. Also, the general framework for algorithms involving orthogonality constraints (Edelman et al., 1998) does not take sparsity into account and hence does not lead to efficient algorithms in our context. Inspired by a recently revived interest in the augmented Lagrangian method (ALM) and its variants for large-scale optimization in statistics and machine learning (Boyd et al., 2011), in this section we develop an efficient algorithm for solving problem (2.1).

4.1 SOFAR algorithm with ALM-BCD

The architecture of the proposed SOFAR algorithm is based on the ALM coupled with block coordinate descent (BCD). The first construction step is to utilize variable splitting to separate the orthogonality constraints and sparsity-inducing penalties into different subproblems, which then enables efficient optimization in a block coordinate descent fashion. To this end, we introduce two new variables and , and express problem (2.1) in the equivalent form

 (0)

where and . We form the augmented Lagrangian for problem (4.1) as

 Lμ(Θ,Ω,Γ) =12∥Y−XUDVT∥2F+λd∥D∥1+λaρa(A)+λbρb(B)+⟨Γa,UD−A⟩ =+⟨Γb,VD−B⟩+μ2∥UD−A∥2F+μ2∥VD−B∥2F,

where is the set of Lagrangian multipliers and is a penalty parameter. Based on ALM, the proposed algorithm consists of the following iterations:

1. -step: ;

2. -step: and .

The -step can be solved by a block coordinate descent method (Tseng, 2001) cycling through the blocks , , , , and . Note that the orthogonality constraints and the sparsity-inducing penalties are now separated into subproblems with respect to and , respectively. To achieve convergence of the SOFAR algorithm in practice, an inexact minimization with a few block coordinate descent iterations is often sufficient. Moreover, to enhance the convergence of the algorithm to a feasible solution we optionally increase the penalty parameter by a ratio at the end of each iteration. This leads to the SOFAR algorithm with ALM-BCD described in Table 1.

We still need to solve the subproblems in algorithm 1. The -update is similar to the weighted orthogonal Procrustes problem considered by Koschat and Swayne (1991). By expanding the squares and omitting terms not involving , this subproblem is equivalent to minimizing

 12∥XUDk∥2F−tr(UTXTYVkDk)−tr(UT(μkAk−Γka)Dk)

subject to . Taking a matrix such that , where is the largest eigenvalue of , we can follow the argument of Koschat and Swayne (1991) to obtain the iterative algorithm: for , form the matrix , compute the SVD , and update . Note that depends on only, and hence the explicit computation of is not needed. The -update is similar to a standard orthogonal Procrustes problem and amounts to maximizing

 tr(VTYTXUk+1Dk)+tr(VT(μkBk−Γkb)Dk)

subject to . A direct method for this problem (Golub and Van Loan, 2013, pp. 327–328) gives the algorithm: form the matrix , compute the SVD , and set . Since is usually small, the SVD computations in the - and -updates are cheap. The Lasso problem in the -update reduces to a standard quadratic program with the nonnegativity constraint, which can be readily solved by efficient algorithms; see, for example, Sha et al. (2007). Note that the -update may set some singular values to exactly zero; hence, a greedy strategy can be taken to further bring down the computational complexity, by removing the zero singular values and reducing the sizes of the relevant matrices accordingly in subsequent computations. The updates of and are free of orthogonality constraints and therefore easy to solve. With the popular choices of and as the penalty functions, the updates can be performed by entrywise and rowwise soft-thresholding, respectively.

Following the theoretical analysis for the SOFAR method in Section 3, we employ the SVD of the cross-validated -penalized estimator in (3.2) to initialize , , , , and ; the and are initialized as zero matrices. In practice, for large-scale problems we can further scale up the SOFAR method by performing feature screening with the initial estimator , that is, the response variables corresponding to zero columns in and the predictors corresponding to zero rows in could be removed prior to the finer SOFAR analysis.

4.2 Convergence analysis and tuning parameter selection

For general nonconvex problems, an ALM algorithm needs not to converge, and even if it converges, it needs not to converge to an optimal solution. We have the following convergence results regarding the proposed SOFAR algorithm with ALM-BCD.

Theorem 3 (Convergence of SOFAR algorithm)

Assume that
and the penalty functions and are convex, where denotes the decrease in by a block update. Then the sequence generated by the SOFAR algorithm converges to a local solution of the augmented Lagrangian for problem (4.1).

Note that without the above assumption on , , and , we can only show that the differences between two consecutive -, -, and -updates converge to zero by the convergence of the sequence , but the sequences , , and may not necessarily converge. Although Theorem 3 does not ensure the convergence of algorithm 1 to an optimal solution, numerical evidence suggests that the algorithm has strong convergence properties and the produced solutions perform well in numerical studies.

The above SOFAR algorithm is presented for a fixed triple of tuning parameters . One may apply a fine grid search with -fold cross-validation or an information criterion such as BIC and its high-dimensional extensions including GIC (Fan and Tang, 2013) to choose an optimal triple of tuning parameters and hence a best model. In either case, a full search over a three-dimensional grid would be prohibitively expensive, especially for large-scale problems. Theorem 2, however, suggests that the parameter tuning can be effectively reduced to one or two dimensions. Hence, we adopt a search strategy which is computationally affordable and still provides reasonable and robust performance. To this end, we first estimate an upper bound on each of the tuning parameters by considering the marginal null model, where two of the three tuning parameters are fixed at zero and the other is set to the minimum value leading to a null model. We denote the upper bounds thus obtained by , and conduct a search over a one-dimensional grid of values between