Selective Factor Extraction in High Dimensions
This paper studies simultaneous feature selection and extraction in supervised and unsupervised learning. We propose and investigate selective reduced rank regression for constructing optimal explanatory factors from a parsimonious subset of input features. The proposed estimators enjoy sharp oracle inequalities, and with a predictive information criterion for model selection, they adapt to unknown sparsity by controlling both rank and row support of the coefficient matrix. A class of algorithms is developed that can accommodate various convex and nonconvex sparsity-inducing penalties, and can be used for rank-constrained variable screening in high-dimensional multivariate data. The paper also showcases applications in macroeconomics and computer vision to demonstrate how low-dimensional data structures can be effectively captured by joint variable selection and projection.
Modern statistical applications may involve many variables. Principal component analysis (Hotelling, 1933) offers a popular means of dimension reduction, and reduced rank regression extends it to supervised learning (Anderson, 1951) by solving the problem subject to , where and are response and predictor matrices, denotes the rank of , and is the Frobenius norm. Reduced rank regression provides a low-dimensional projection space to view and analyze multivariate data, and finds widespread applications in machine learning, econometrics, and finance (Reinsel and Velu, 1998; Izenman, 2008). In fact, once an estimate of rank is obtained, we can write for , . This suggests that factors can be constructed by from predictors to explain all response variables. The number of factors required in real applications is often much smaller than the number of input -variables. Unfortunately, the loading matrix obtained from reduced rank regression typically involves all predictors. In high-dimensional data analysis, factors constructed from a small subset of variables are much more interpretable; we call this selective factor extraction. Correspondingly, the coefficient matrix is desired to have both low rank and row-wise sparsity. To capture the two types of structural parsimony simultaneously, joint regularization must be applied, which adds nontrivial difficulties in the theoretical analysis and numerical computation of the associated estimators, but leads to reduced errors compared with rank reduction or variable selection alone.
In the unsupervised setting when , selective factor extraction is closely related to sparse principal component analysis. See, for example, Zou et al. (2006), Shen and Huang (2008), Witten et al. (2009), Johnstone and Lu (2009) and Ma (2013). Most of these algorithms seek sparse loading vectors separately, and progress sequentially. The loading matrix obtained may lack optimality and contain too many variables. To ensure dimension reduction even when constructing a number of factors, we will formulate the problem as a whole and pursue joint sparsity across all loading vectors. This turns out be particularly helpful in rank-constrained variable screening.
There is less work on simultaneous variable selection and rank reduction in the supervised setting. See Bunea et al. (2012), Chen and Huang (2012), Chen et al. (2012), Ma et al. (2014a) and a recent report by Ma et al. (2014b). Many theoretical and computational questions remain open. Our main contributions are threefold. First, we are able to provide a unified treatment for various penalties in the reduced rank model, and successfully build sharper oracle inequalities than those in the literature (Bickel et al., 2009; Lounici et al., 2011; Candès and Plan, 2011). Our results indicate that for joint variable selection and rank reduction, the error rates and parameter choices previously obtained are suboptimal. Second, we develop a computational framework with guaranteed convergence, where any thresholding rule can be applied. The algorithms adapt to reduced rank variable screening in very high dimensions. Third, we come up with a new information criterion for parameter tuning. To the best of our knowledge, this is the first sound criterion with minimax optimality for selecting among sparse and/or rank-deficient models.
In the rest of the paper, the following notation and symbols will be used. Given a matrix , and denote its Frobenius norm and spectral norm, respectively. We define the -norm of by , and use to characterize the number of nonzero rows in . The standard vectorization of is denoted by . We use to denote a submatrix of with rows and columns indexed by and , respectively, and occasionally abbreviate to . The set of column-orthogonal matrices of size is denoted by . Finally, and are used to denote constants which are not necessarily the same at each occurrence.
2 Simultaneous Rank Reduction and Variable Selection
2.1 Selective reduced rank regression
Picking only pertinent dimensions is the key to enhance interpretability of factors in high dimensions. In a multi-factor model, power to select requires eliminating the nuisance variables from the construction of factors. To state a general framework, we assume that a response matrix is available, in addition to a predictor matrix , both centered column-wise. Let denote the coefficient matrix . To provide concurrent rank reduction and feature selection, a possible optimization criterion is . But the penalized form does not seem to enjoy low errors in either theory or practice. We propose the following form of rank-constrained variable selection
where is a sparsity-promoting penalty, possibly nonconvex. We call (1) selective reduced-rank regression. Imposing element-wise sparsity on , though valid as a regularization approach, does not seem to have much meaning in applications. We will introduce a different sparse reduced rank regression in (13) by sparsifying a component of .
There is a variety of choices for the penalty function. The popular group -norm function, (Yuan and Lin, 2006), leads to the rank-constrained group lasso, although the group penalty, , is arguably better in promoting sparsity (Bunea et al., 2012; Chen and Huang, 2012). Group versions of the nonconvex penalties proposed by Fan and Li (2001), Zhang (2010a), and Zhang (2010b) can also be applied.
Solving the selective reduced rank regression problem helps uncover factors that reduce model complexity. Given a selective reduced rank regression estimate , we can use its column space to make a new model matrix , where , , and are obtained from the singular value decomposition . The new design has columns and involves only a small subset of the -variables. This is called Type-I factor extraction. An alternative is to decompose . Concretely, let be the spectral decomposition of . Then provides factors, called Type-II extraction or post-decorrelation, since the -variables are uncorrelated with each other. QR decomposition can be used for efficiency reasons in either case. The two types of factor extraction are not equivalent in general, but coincide when is the solution to reduced rank regression. Because , a more sophisticated model can be built on the factors with relative ease.
2.2 Oracle inequalities
We show some non-asymptotic oracle inequalities to reveal the theoretical benefits of selective reduced rank regression. For clarity, we use the group and group penalties to exemplify the error rate. For , define and .
Let , with all entries of independent and identically distributed as .
(i) Let be a selective reduced rank regression estimator that minimizes subject to . Then, under where is a large enough constant, the following oracle inequality holds for any with ,
Here, means that the inequality holds up to a multiplicative constant.
(ii) In the case, let where is as in (i). Then holds for any with , provided that satisfies
where , , and is a positive constant.
The proof given in the Appendices can deal with various penalties in a universal way. For example, the oracle inequality (2) applies to any that takes as the threshold and satisfies , where . Examples includes the smoothly clipped absolute deviation penalty (Fan and Li, 2001), the minimax concave penalty (Zhang, 2010a) and the capped penalty (Zhang, 2010b). Similarly, the result in part (ii) of Theorem 1 holds for any sub-additive penalty that is sandwiched between and . The penalties where are particular instances. Moreover, condition (3) is less demanding than some common regularity assumptions (van de Geer and Bühlmann, 2009; She, 2016), and we do not require to be bounded above by .
Let , . According to (2), simply taking , so that the bias term disappears, we get a prediction error bound of order
omitting and constant factors. The bias term makes the error bound applicable to coefficient matrices that are approximately row-sparse. In Section 4.2, we will see that when it is difficult to provide a proper rank value, the predictive information criterion can be used to tune to guarantee the same low error rate.
A comparison between Theorem 1 and some existing non-asymptotic results follows. Wei and Huang (2010) and Lounici et al. (2011) showed that for group lasso, the prediction error is of the order . Since , selective rank reduction is uniformly better, and the performance gain is dramatic for low-rank models. Bunea et al. (2012) obtained an error rate for the rank-constrained group lasso at . Their rate is, however, suboptimal: when and are comparable, their error bound is of the order , while (4) gives . Bunea et al. also required a multivariate restricted eigenvalue assumption that is more restrictive than (3). Compared with low-rank matrix estimation (Recht et al., 2010; Bunea et al., 2011), which has an error rate of with , our result does not always show an improvement, because only large values of are considered in Theorem 1 to secure selectivity. Practically there will be no performance loss, because selective reduced rank regression degenerates to reduced rank regression when .
3 Parameter Tuning and Model Comparison
3.1 A predictive information criterion
Selective reduced rank regression has two regularization parameters, and , to control the row support and rank of the model. Conventional tuning methods are not satisfactory in our experience, and indeed they all lack theoretical support in the sparse and rank-deficient setting. We will propose a novel information criterion from the perspective of predictive learning (Hastie et al., 2009), namely, the best model should give the smallest prediction error among all candidate models. Unlike consistent variable selection or rank selection, such a principle does not require high signal-to-noise ratios to work.
To make our results more general, the noise matrix is assumed to have sub-Gaussian marginal tails in this section. A random variable is sub-Gaussian if for any and some constants , and its scale is defined as . Gaussian random variables and bounded random variables are particular instances. More generally, is a sub-Gaussian random vector with its scale bounded by , if is sub-Gaussian and for any .
The function proposed as the model complexity penalty is
where and .
Assume that the vectorized noise matrix, or , is sub-Gaussian with mean zero and scale bounded by . Let , where is a constant. Then for all sufficiently large values of , satisfies the following oracle inequality
Theorem 2 is a strong non-asymptotic result because the obtained error rate is uniformly better than those by selection or rank reduction as mentioned in Section 2.2. Indeed, we can show that gives the minimax optimal error rate in this jointly sparse setting. Moreover, (6) holds under no restrictions on or , and its right-hand side takes the infimum over all reference signals .
The theorem gives rise to a model comparison criterion. By the same reasoning as in its proof, for any collection of random non-zero matrices , if we choose the optimal one, , by minimizing the following predictive information criterion over all given matrices
then satisfies . Interestingly, indicates that , the inflation term due to selection, should be additive to the degrees-of-freedom term. This is legitimate for sub-Gaussian noise contamination, but to our knowledge new when compared with other information criteria that take the form of loss degrees-of-freedom. For example, the extended Bayesian information criterion (Chen and Chen, 2008), derived under with and some other regularity conditions, has a multiplicative factor on the degrees-of-freedom of the model. For single-response models with , simplifies to , which essentially corresponds to the risk inflation criterion (Foster and George, 1994), but is slightly finer. Our result applies to any .
3.2 Scale-free predictive information criterion
The predictive information criterion contains a scale parameter . In sparse principal component analysis, one can substitute an estimate for the unknown , e.g., (Johnstone and Lu, 2009). In supervised learning, however, estimating the scale parameter could be as hard as estimating the coefficients. We propose a scale-free form of predictive information criterion that can bypass . Again, no incoherence assumption is made for the predictor matrix.
Let have independent and identically distributed entries. Suppose that the true model is parsimonious in the sense that for some constant . Consider the criterion
where the constant satisfies . Then, for sufficiently large values of and , any that minimizes (8) subject to satisfies , with probability at least for some constants .
4.1 A computational framework
To ensure that selective reduced rank regression can be applied, we must address some challenges in computation. First, the rank constraint makes problem (1) nonconvex and non-smooth. Moreover, in view of Theorem 1, to relax the incoherence conditions required by -type penalties, nonconvex penalties may be of interest (Zhang, 2010a; Zhang and Zhang, 2012). Since different nonconvex penalty forms may lead to the same thresholding rule, we study thresholding-induced penalties.
Definition 1 (Threshold function).
A threshold function is a real-valued function defined for with as the parameter such that (i) , (ii) for , (iii) , and (iv) for . Moreover, is defined to be a multivariate function associated with if for any vector , for and otherwise. For any matrix with , .
Some thresholding functions, such as the hard-thresholding or , have discontinuities. To avoid ambiguity in definition, when using such thresholdings, we assume that the quantity to be thresholded does not correspond to a discontinuity point. Let us consider the following scaled version of problem (1),
where is associated with through (10) and is a large enough number to be specified in Theorem 4. To get rid of the low-rank constraint, we may write , with and . The optimization is now with respect to and (). We abuse notation and use to denote the objective function. Algorithm 1 is developed based on a block coordinate descent method, where the -optimization can be solved by Procrustes rotation and the can be obtained by iterative thresholding.
We will show that, given any , the algorithm is guaranteed to converge under a universal choice of . For simplicity, in the following theorem we assume that is continuous at any point in the closure of . The condition holds for all continuous thresholding rules. Practically used thresholding rules have few discontinuity points and such discontinuities rarely occur in real data analysis.
Given an arbitrary thresholding function , let be an associated penalty satisfying
for some satisfying , and if for some . Let . Then given any starting point , converges, , and
Furthermore, if , then any accumulation point of is a coordinatewise minimum point of and the function value converges monotonically to for some coordinatewise minimum point .
Equation (10) covers all aforementioned convex and nonconvex penalties; see She (2012) for more examples. For penalties with , Theorem 4 provides a stationary point guarantee. When has discontinuities, can have infinitely many choices, which means that different penalties may be associated with the same thresholding function. For instance, define a hard-ridge thresholding rule
Then, with a nontrivial defined by , (10) gives an penalty , or in the context of (1). The Frobenius component in the hybrid penalty can shrink the coefficients to compensate for collinearity and large noise in large- applications. Section 4.2 makes use of a constraint variant of the penalty for screening.
When we apply a component-wise in place of in Step 5d, a result similar to Theorem 4 can be obtained for the objective function
The sparsity is imposed on rather than on the overall coefficient matrix . We call (13) sparse reduced rank regression. With and , we see that is a sum of factors, , and every is sparse ().
Algorithm 1 is simple to implement and has low computational complexity. When , in addition to some elementary matrix multiplication and thresholding operations, a singular value decomposition is carried out on , which, however, has only columns. To initialize the algorithm, we can use the reduced rank regression estimate and set , where denotes the Moore–Penrose inverse and is formed by the first eigenvectors of . Other initialization schemes are possible; see Rousseeuw and Van Driessen (1999).
From Algorithm 1, or the proof of Theorem 4 in the Appendices, the optimal can be expressed in terms of , i.e., . Hence or , where is the nuclear norm. This means that the loading matrix obtained from (9) or (13) depends on through .
Some recent theoretical studies (Berthet and Rigollet, 2013; Gao et al., 2016) show that computationally efficient algorithms, such as those with polynomial time complexity, may possess an intrinsic lower bound in statistical accuracy that is larger than the minimax error rate derived for most challenging problems. This seems to hold in our problem as well. We will not further pursue this in the current paper.
4.2 Rank-constrained variable screening
Statisticians are frequently confronted with challenges in large-scale computing, so variable screening has become a popular practice in high-dimensional data analysis. In multivariate problems, we are interested in rank-constrained variable screening, which can be achieved by the following form of selective reduced rank regression
Similar to the rank constraint, which limits the number of factors, the cardinality constraint, rather than a penalty, enables one to directly control the number of predictors selected for factor construction. The upper bound can be loose for the purpose of screening, provided it is not too small.
We show below how to use a quantile version of the hard-ridge thresholding (12) to solve such problems. Given , , for any , is defined to be a vector satisfying if , and otherwise. Here, are the order statistics of , i.e., , and are defined similarly. In the case of ties, a random tie breaking rule is used. The multivariate quantile thresholding function for any , is defined as a matrix with if is among the largest elements in , and otherwise. Now we modify Step 5.d of Algorithm 1 to , and all other steps remain unchanged. The resulting algorithm for rank-constrained screening always converges.
Assume . Then, given any , is non-increasing and satisfies , and obeys the constraints and for any .
To get some intuition, let us set . Then, at the first iteration, , , and the quantile thresholding picks features according to the marginal statistics , which amounts to sure independence screening (Fan and Lv, 2008). Our algorithm iterates further to lessen the greediness of independence screening. To accelerate the computation, we recommend progressive screening in the iterative process. Concretely, we use a sequence that decreases from to , e.g., with and , and perform in Step 5d; after obtaining in Step 7, the following data squeezing operations are carried out, , , . An attractive feature of the implementation is that as the cycles progress, the problem size drops quickly and the computational load can be significantly reduced.
For the sparse reduced rank regression with an -constraint
similar algorithms can be developed based iterative quantile thresholding. In big data applications, a good idea is to combine (14) with (15), because calling the rank-constrained screening algorithm in an earlier stage can reduce the dimensionality from to . In this hybrid scheme, satisfies , and gives a conservative screening choice.
5 Unsupervised Selective and Sparse Principal Component Analyses
This section studies selective factor construction in principal component analysis. We assume that only one data matrix is available and it has been column-centered. Principal component analysis can be interpreted as finding a low-rank matrix to approximate the observed data. Similar to Section 4.1, we write with and . The selective principal component analysis problem is defined as
Obviously, (16) can be rephrased as a special case of selective reduced rank regression, by taking as the response matrix and as the design matrix. Likewise, adapting (13) to the unsupervised setting leads to the following criterion for sparse principal component analysis,
Problems (16) and (17) are perhaps less challenging than their supervised counterparts, but they still provide new insights into sparse principal component analysis. For example, (17) defines a multivariate criterion that is able to find all sparse loading vectors simultaneously. In computation, our algorithms developed in Section 4 simplify greatly. In fact, because of the identity design, the inner loop in Step 5 of Algorithm 1 converges in one iteration and the overall procedure reduces to
In other words, is updated by thresholding , and various thresholding operators can be used. When , the singular value decomposition is unnecessary, since can be directly obtained by normalizing the column vector .
We now point out some recent literature related to (16) and (17). Under a spiked covariance model assumption, Cai et al. (2013) proposed and studied an adaptive multi-step procedure to solve a problem similar to (16). Our algorithm (18) is closest in spirit to the thresholding procedure in Johnstone and Lu (2009). Ma (2013) proposed an iterative algorithm for principal subspace estimation, but it has no guarantee of numerical convergence. Another type of sparse principal component analysis sets in (13); the idea seems to first appear in Zou et al. (2006). But the self-regression formulation may bring some ambiguity in selection. Consider a noise-free model where gives the set of indices of all nonzero columns in , and obeys . Then, given any index set satisfying , we can find a matrix with and such that , but can be smaller or larger than .
6.1 Paper quality data
Aldrin (1996) described two paper datasets collected from Norwegian paper industry. We focus on the first dataset. The production of paper depends on a huge number of predictors. The data were obtained by varying three control variables , , that are coded as , , and . There are design points for . At each point the paper quality was measured twice, except once at design point , which results in observations. The response variables are the 13 measures used to evaluate paper quality.
The first model we considered is a full quadratic model recommended by Aldrin, with predictors and responses in total. Previous analysis suggests that the model is factor driven. Reduced rank regression can be used to construct explanatory score variables for paper quality assessment (Aldrin, 1996; Izenman, 2008), but has to keep all predictors in the final model. Our proposed approach has the ability of identifying a small set of predictors for selective factor extraction. We split the whole dataset at random, with 60% for training and for test, and repeat the process for times to compare the performance of the two methods. All parameters were tuned by the scale-free predictive information criterion. Based on the median statistics, both reduced rank regression and selective reduced rank regression gave two factors, and showed comparable test errors, and , respectively. The difference is that selective reduced rank regression achieved this with only predictors, about half of the model size of reduced rank regression.
A careful examination of the data shows some interesting findings that merit further investigation. Although sample 5 and sample 6 correspond to the same design point , their response values demonstrate larger-than-normal discrepancies. Take the first response variable for instance. We get , but the differences in for all the other observation pairs are bounded by in absolute value. We suspect that there exist outliers in the data. Their occurrence might be due to the crude coding of control variables when they were varied. Therefore, a robust factor model is perhaps more appropriate. Specifically, the factors constructed from input variables, , may need correction at certain anomaly points, which amounts to . The correction term is desired to be row-wise sparse, since the outliers should not be the norm. Therefore, we have
where is desired to have low rank and sparse rows. For such an augmented model, the number of predictor variables is always larger than the sample size. We repeated the analysis with the augmented design. In the 100 experiments, two variables, and , got selected all the time; the other predictors, including all interaction terms, never entered the model. The median model size is however , due to the existence of some automatically detected outliers. Sample and sample , corresponding to design points and , respectively, were identified as anomalies more than of the time. The rank is now as low as , but the test error is substantially reduced to .
6.2 Macroeconomic data
Stock and Watson (2012) summarized 194 quarterly observations on 144 macroeconomic time series observed from 1960 to 2008, with some earlier observations used for lagged values of regressors as necessary. We preprocessed the data using the transformations given in Table B.2 of Stock and Watson (2012). One variable, non-borrowed reserves of depository institutions, was removed because its transformation involves logarithms but it has negative values. Of the 143 series, 35 are high-level aggregates, the information of which is all contained in the rest. Our predictors are the 108 disaggregated series and their lagged values. The series are grouped into thirteen broad categories. We use the interest rates category, which consists of 13 time series, as our response variables. The dataset is a good example to show that although forecasters and policymakers can access many potentially relevant macroeconometric time series, excluding noninformative ones is often ad hoc. In fact, to the best of our knowledge, there is no acknowledged model to describe all the 13 interest rates covering treasuries, corporate, term spreads and public-provide spreads. Low-rank models naturally arise (Reinsel and Velu, 1998), and the necessary factors can be as few as two or even one, which has been theoretically established in a large body of economics literature.
First, we used the 108 series observed in the past four quarters as predictors, giving predictors and unknowns. We used Algorithm 1 for selective reduced rank regression and the scale-free predictive information criterion for parameter tuning. The resulting model has and , achieving a remarkable dimension reduction. The single explanatory factor in response to all interest rates series is constructed from capital utilization, the 3-month treasury bill secondary market rate minus the 10-year treasury constant maturity rate, and Moody’s Baa corporate bond yield minus the 10-year treasury constant maturity rate. Since no variables of lag order 2 or above were selected, we repeated the analysis using the series with only one lag, thus selecting four variables and two factors. The last two variables shown in the 1-factor model appear again, and the employment category contributes the other two variables, relating to employees on nonfarm payrolls in wholesale trade and help-wanted advertising in newspapers. Both the 1-factor model and the 2-factor model show a high level of parsimony.
Next, we did a forecasting experiment to compare the obtained factor models with auto-regressive modeling with four lags, which is a conventional but quite accurate forecasting method. The performance is evaluated by a rolling scheme: a rolling estimation window of the most recent 100 quarterly observations is used to estimate the parameters, and forecasts are made in the forecast window. Both windows move forward by one quarter at a time, and the procedure is repeated 94 times. Table 1 shows the mean squared errors of each method for the 13 interest rates. Overall, the three methods have comparable prediction errors. Of course, the comparison is, in some sense, unfair to factor models. The auto-regression method builds a separate model for each interest rate using four relevant predictors, while the 1-factor method, say, regresses every response on the same single score variable. Our purpose here is to demonstrate the usefulness of category-level factors; better models can be possibly built on the factors to improve the accuracy further.
6.3 Face data
The Extended Yale Face Database B (Lee et al., 2005; Georghiades et al., 2001) contains aligned and cropped face images of 38 subjects with the same frontal pose under 64 different illumination conditions. We down-sampled the images to , each containing pixels. Given a subject, a data matrix of size 648064 can be formed from the associated images. In face recognition, principal component analysis is widely used to extract basis features, referred to as eigenfaces, the number of which is controlled by the rank. We set throughout this experiment and focus on the 22nd subject in the database, whose image examples are shown in the upper panel of Fig. 1. Selective principal component analysis was performed in the hope of capturing regions of interest under different light source directions. As seen in the lower panel, some informative regions sensitive to illumination conditions, e.g., forehead and nose tip, were automatically detected.
To reduce the computational burden caused by the large number of pixels, we tested the screening-guided hybrid sparse principal component analysis. See its description below (15) and recall that controls the number of nonzero elements in the loading matrix , and controls the number of nonzero rows. Table 2 shows the results with a conservative screening choice . The adjusted variance rates were computed according to Shen and Huang (2008). The hybrid principal component analysis gave essentially the same adjusted variances as sparse principal component analysis, but used fewer pixels. More importantly, the hybrid approach offered impressive time savings. Then, we did an experiment with a less conservative screening choice, , . Sparse principal component analysis used 3517 pixels to reach an adjusted variance rate of , while hybrid principal component analysis reduced the model size to pixels, and gave adjusted variance rate . When using 2400 pixels, sparse principal component analysis only reached an adjusted variance rate of .
The techniques we developed to study selective reduced rank regression are applicable to pure variable selection or rank reduction. For example, the recipe for proving Theorem 1 can handle Schatten -norm penalized trace regression models, without using the sophisticated quasi-convex Schatten class embeddings (Rohde and Tsybakov, 2011). The scale-free predictive information criterion addresses the issue of adaptive rank selection in models, as raised by Bunea et al. (2011).
In this work, all the problems under consideration are nonconvex. In common with papers like Rohde and Tsybakov (2011) and Bunea et al. (2012), we studied the properties of global minimizers. In some less challenging situations, the proposed algorithms, when initialized by the reduced rank regression estimator, can deliver a good estimate within a few iteration steps. This was also evidenced by Ma et al. (2014b) in a recent technical report. In some hard cases, we found the multi-start strategy of Rousseeuw and Van Driessen (1999) to be quite effective. The study of how to initialize and when to terminate is beyond the scope of the current paper, but is an interesting topic for further research.
Appendix A Proofs of Theorems
Throughout the proofs, we use , , to denote constants. They are not necessarily the same at each occurrence. Given any matrix , we use and to denote its column space and row space, respectively. Denote by the orthogonal projection matrix onto , i.e., , where stands for the Moore-Penrose pseudoinverse, and the projection onto its orthogonal complement. Finally, we use to denote . Recall that for , , , , and is the the largest singular value of .
a.1 Proof of Theorem 1
Recall the following three basic penalties