Joint variable and rank selection for parsimonious estimation of high-dimensional matrices

# Joint variable and rank selection for parsimonious estimation of high-dimensional matrices

[ [    [ [    [ [ Cornell University, Florida State University and Cornell University F. Bunea
Department of Statistical Science
Cornell University
1198 Comstock Hall
Ithaca, New York 14853-3801
USA
Y. She
Department of Statistics
Florida State University
117 N. Woodward Ave
Tallahassee, Florida 32306-4330
USA
M. H. Wegkamp
Department of Mathematics
and
Department of Statistical Science
Cornell University
1194 Comstock Hall
Ithaca, New York 14853-3801
USA
\smonth10 \syear2011\smonth4 \syear2012
\smonth10 \syear2011\smonth4 \syear2012
\smonth10 \syear2011\smonth4 \syear2012
###### Abstract

We propose dimension reduction methods for sparse, high-dimensional multivariate response regression models. Both the number of responses and that of the predictors may exceed the sample size. Sometimes viewed as complementary, predictor selection and rank reduction are the most popular strategies for obtaining lower-dimensional approximations of the parameter matrix in such models. We show in this article that important gains in prediction accuracy can be obtained by considering them jointly. We motivate a new class of sparse multivariate regression models, in which the coefficient matrix has low rank and zero rows or can be well approximated by such a matrix. Next, we introduce estimators that are based on penalized least squares, with novel penalties that impose simultaneous row and rank restrictions on the coefficient matrix. We prove that these estimators indeed adapt to the unknown matrix sparsity and have fast rates of convergence. We support our theoretical results with an extensive simulation study and two data analyses.

[
\kwd
\doi

10.1214/12-AOS1039 \volume40 \issue5 2012 \firstpage2359 \lastpage2388 \newproclaimremarksRemarks \newproclaimremarkRemark \newproclaimassumptionAssumption \newproclaimmethodMethod

\runtitle

Joint variable and rank reduction

{aug}

A]\fnmsFlorentina \snmBunea\thanksreft1label=e1]fb238@cornell.edu, B]\fnmsYiyuan \snmShe\thanksreft2label=e2]yshe@stat.fsu.edu and C]\fnmsMarten H. \snmWegkamp\corref\thanksreft1label=e3]marten.wegkamp@cornell.edu

\thankstext

t1Supported in part by NSF Grant DMS-10-07444. \thankstextt2Supported in part by NSF Grant CCF-1116447.

class=AMS] \kwd62H15 \kwd62J07 Multivariate response regression \kwdrow and rank sparse models \kwdrank constrained minimization \kwdreduced rank estimators \kwdgroup lasso \kwddimension reduction \kwdadaptive estimation \kwdoracle inequalities

## 1 Introduction

The multivariate response regression model

 Y=XA+E (1)

postulates a linear relationship between , the matrix containing measurements on responses for subjects, and , the matrix of measurements on predictor variables, of rank . The term is an unobserved matrix with independent entries. The unknown coefficient matrix of unknown rank needs to be estimated. If we use (1) to model complex data sets, with a high number of responses and predictors, the number of unknowns can quickly exceed the sample size , but the situation need not be hopeless for the following reason. Let denote the rank of and denote the index set of the nonzero rows of and its cardinality. Counting the parameters in the singular value decomposition of , we observe that in fact only free parameters need to be estimated, and this can be substantially lower than the sample size . Furthermore, as we can always reduce of rank to an matrix with independent columns in that span the same space as the columns of , we can always assume that . If is of full rank with no zero rows, then the total number of parameters to be estimated reverts back to . If either, or both, and are large, more parsimonious models have to be proposed. Among the possible choices, two are particularly popular.

The first class consists of rank sparse or rank deficient models, which postulate either that has low rank or that it can be well approximated by a low rank matrix. Methods tailored to rank sparsity seek adaptive rank approximations of the coefficient matrix . Then, one only needs to estimate parameters, which can be substantially less than for low values of .

The second class of models reflects the belief that is smaller than , and we will call them row sparse models. Methods that adapt to row sparsity belong to the variable selection class, as explained in Section 1.1 below. The effective number of parameters of such models is . This number is smaller than the unrestricted , but may be higher than , especially if the rank of is low.

This discussion underlines the need for introducing and studying another class of models, that embodies both sparsity constraints on simultaneously. In this work we introduce row and rank sparse models, and suggest and analyze new methods that combine the strengths of the existing dimension reduction techniques. We propose penalized least squares methods, with new penalties tailored to adaptive and optimal estimation in the row and rank sparse model (1). The rest of the article is organized as follows.

We introduce in Section 2.1 a product-type penalty that imposes simultaneously rank and row sparsity restrictions on the coefficient matrix. It generalizes both AIC-type penalties developed for variable selection in univariate response regression models as well as the rank penalty of Bunea, She and Wegkamp (2011) for low rank estimation in multivariate response models. The purpose of the resulting method is twofold. First, we prove in Theorem 1 of Section 2.1 that the resulting estimators of adapt to both types of sparsity, row and rank, under no conditions on the design matrix. Their rates of convergence coincide with the existing minimax rates in the literature, up to a logarithmic term; cf. Koltchinskii, Lounici and Tsybakov (2011). Second, we show in Theorem 2 that this method can also be employed for selecting among competing estimators from a large finite list. This is of particular interest for selecting among estimates of different ranks and sparsity patterns, possibly obtained via different methods. The results of Section 2.1 hold for any values of and and, in particular, both and can grow with , but computing the estimator analyzed in Theorem 1 requires an exhaustive search over the class of all possible models, the size of which is exponential in , and this becomes computationally prohibitive if .

To address the computational issue, we propose two other methods in Section 2.2. The crucial ingredient of both methods is the selection of predictors in multivariate response regression models under rank restrictions. We define and analyze this core procedure in Section 2.2, and describe a computationally efficient algorithm in Section 3.1. By combining this method with two different ways of selecting the rank adaptively we obtain two estimators of . Both are computable in high dimensions, and both achieve the rates discussed in Section 2.1, up to a factor, under different, mild assumptions. We also compare the theoretical advantages of these new methods over a simple two-stage procedure in which one first selects the predictors and then reduces the rank. We illustrate the practical differences via a simulation study in Section 3.2. We then use our methods for the analysis, presented in Section 4, of two data sets arising in machine learning and cognitive neuroscience, respectively. The proofs of our results are collected in the Appendix.

### 1.1 Background

Before we discuss our methods, we give an overview of existing procedures of adaptive estimation in (1), that adapt to either rank or row sparsity, but not both. We also present a comparison of target rates under various sparsity assumptions on the coefficient matrix in model (1).

Reduced rank estimation of in (1) and the immediate extensions to principal components analysis (PCA) and canonical correlation analysis (CCA) are perhaps the most popular ways of achieving dimension reduction of multivariate data. They have become a standard tool in time series [Brillinger (1981)], econometrics [Reinsel and Velu (1998)] and machine learning [Izenman (2008)], to name just a few areas. The literature on low rank regression estimation of dates back to Anderson (1951). The model is known as reduced-rank regression (RRR) [Izenman (2008)] and, until recently, it had only been studied theoretically from an asymptotic perspective, in a large sample size regime. We refer to Reinsel and Velu (1998) for a historical development and references, and to Izenman (2008) for a large number of applications and extensions. Very recently, a number of works proposed penalized least squares estimators. For penalties proportional to the nuclear norm, we refer to Yuan et al. (2007), Candès and Plan (2010), Negahban and Wainwright (2011), Rohde and Tsybakov (2011). For penalties proportional to the rank, we refer to Bunea, She and Wegkamp (2011) and Giraud (2011). Both types of estimators are computationally efficient, even if , and both achieve, adaptively, the rate of convergence which, under suitable regularity conditions, is the optimal minimax rate in (1) under rank sparsity; see, for example, Rohde and Tsybakov (2011) for lower bound calculations.

To explain the other notion of sparsity, note that removing predictor from model (1) is equivalent with setting the th row in to zero. Since vectorizing both sides of model (1) yields a univariate response regression model, we can view the rows of as groups of coefficients in the transformed model. We can set them to zero by any group selection method developed for univariate response regression models in high dimensions such as the Group Lasso [Yuan and Lin (2006)], GLASSO for later reference. The optimal minimax rate in (1) under row sparsity is proportional to , again under suitable regularity conditions; see Lounici et al. (2011) and Wei and Huang (2010).

Despite these very recent advances, adaptive low rank estimation in (1), based on a reduced set of predictors, has not been investigated either theoretically or practically. For ease of reference, Table 1 contains a rate comparison between optimal prediction error rates achievable by variable selection (GLASSO), low rank estimation (RSC and NNP) and our new joint rank and row selection (JRRS) methods, respectively.

The table reveals that if , the rates of the RSC, NNP and JRRS are dominated by , regardless of , while if , the new class of methods can provide substantial rate improvements over the existing methods, especially when the rank is low.

## 2 Adaptation to row and rank sparsity: Estimation procedures and oracle inequalities

### 2.1 The single-stage joint rank and row selection estimator

In this section we modify the rank selection criterion (RSC) introduced in Bunea, She and Wegkamp (2011) to accommodate variable selection. We propose our single-stage joint rank and row selection (JRRS) estimator

 ^B=argminB{∥Y−XB∥2F+pen(B)}, (2)

also denoted by JRRS1, with penalty term

 pen(B)=cσ2r(B){2n+log(2e)∣∣J(B)∣∣+∣∣J(B)∣∣log(ep|J(B)|)}. (3)

The penalty is essentially proportional to the number of parameters in a model with fewer predictors and of reduced rank . Here is a numerical constant, is the set of indices of nonzero rows, is the rank of a generic matrix and the squared Frobenius norm of a generic matrix is denoted by and is equal to the sum of the squared entries of .

If is computed by minimizing over all matrices , then Theorem 1 stated below shows that it adapts optimally to the unknown row and rank sparsity of : the mean squared error of coincides with that of optimal estimators of rank and with nonzero rows, had these values been known prior to estimation. However, the construction of does not utilize knowledge of either or , hence the term adaptive. The minimax lower bounds for this model can be obtained by an immediate modification of Theorem 5 in Koltchinskii, Lounici and Tsybakov (2011). Our single-stage JRRS estimator given in (2) above achieves the lower bound, up to a log factor, under no restrictions on the design , rank or dimensions .

###### Theorem 1

The single-stage JRRS estimator in (2) using pen in (3) with 333Our proof shows that we may take , at the cost of increasing numerical constants in the right-hand side of the oracle inequality. satisfies

 E[∥XA−X^B∥2F]≤10∥XA−XB∥2F+8pen(B)+768nσ2exp(−n/2)

for any with . In particular, if ,

Here and elsewhere means that the inequality holds up to multiplicative numerical constants.

The proof of Theorem 1 remains valid if the matrices we select from depend on the data. Thus, our procedure can be used for selecting from any countable list of random matrices of different ranks and with different sparsity patterns. We will make essential use of this fact in the next section.

###### Theorem 2

For any collection of (random) nonzero matrices the single-stage JRRS estimator

 ~B=argminBj∥Y−XBj∥2F+pen(Bj) (4)

with satisfies

 E[∥XA−X~B∥2F] ≤ +768nσ2exp(−n/2).

### 2.2 Two-step joint rank and row selection estimators

The computational complexity of the single-stage JRRS estimator (2) is owed to the component of the penalty term proportional to , which is responsible for row selection. The existence of this term in (3) forces complete enumeration of the model space. We address this problem by proposing a convex relaxation of this component. Here is the sum of the Euclidean norms of the rows of . In this section we propose two alternatives, each a two-step JRRS procedure and each building on the following core estimator.

#### 2.2.1 Rank-constrained predictor selection

We define our rank-con-strained row-sparse estimators of as

 ^Bk=argminr(B)≤k{∥Y−XB∥2F+2λ∥B∥2,1}. (5)

Here is a tuning parameter and the minimization is over all matrices of rank less than or equal to (a fixed) . A computationally efficient numerical algorithm for solving this minimization problem is given in Section 3.1. Clearly, for , there is no rank restriction in (5) and the resulting estimator is the GLASSO estimator; for , we obtain the reduced-rank regression estimator. Thus, the procedure yielding the estimators of rank acts as a synthesis of the two dimension reduction strategies, having each of them as limiting points. We will refer to as the rank constrained group lasso (RCGL) estimators.

Since this estimator is central to our procedures, we analyze it first. We need the following mild assumption on . {assumption} We say satisfies condition for an index set and positive number , iff

 tr(M′ΣM)≥δI∑i∈I∥mi∥22 (6)

for all matrices (with rows ) satisfying .

{remarks*}

(1) The constant 2 may be replaced by any constant larger than 1.

(2) Assumption 2.2.1 allows designs with , and can be seen as a version of the restricted eigenvalue condition in the variable selection literature introduced in Bickel, Ritov and Tsybakov (2009) and analyzed in depth in Bühlmann and van de Geer (2011).

(3) A sufficient condition for (6) is: there exists a diagonal matrix with for all and otherwise such that is positive definite.

Let denote the largest eigen-value of and set the tuning parameter

 λ=Cσ√λ1(Σ)kmlog(ep) (7)

for some numerical constant . Notice that depends on , but we suppress this dependence in our notation.

###### Theorem 3

Let be the global minimizer of (5) corresponding to in (7) with large enough. Then, we have

 E[∥X^Bk−XA∥2F] ≲ ∥XB−XA∥2F +kσ2{n+(1+λ1(Σ)δJ(B))∣∣J(B)∣∣log(p)}

for any matrix with , nonzero rows, provided satisfies assumption .

{remarks*}

(1) The term in (3) is multiplied by a factor . This factor can be viewed as a generalized condition number of the matrix . If stays bounded, Theorem 3 shows that, within the class of row sparse matrices of fixed rank , the RCGL estimator is row-sparsity adaptive, in that the best number of predictors does not have to be specified prior to estimation.

(2) It is interesting to contrast our estimator with the regular GLASSO estimator that minimizes over all matrices . Our choice (7) of the tuning parameter markedly differs from the choice proposed by Lounici et al. (2011) for the GLASSO estimator . We need a different choice for and a more refined analysis since we minimize in (5) over all matrices of rank .

#### 2.2.2 Adaptive rank-constrained predictor selection

We now develop theoretical properties of three methods, Method 2.2.2 (RSCRCGL), Method 2.2.2 (RCGLJRRS1) and Method 2.2.2 (GLASSORSC). denotes the single-stage JRRS estimator of Section 2.1.

Theorem 3 suggests that by complementing RCGL by a method that estimates the rank consistently, we could obtain row and rank optimal adaptive estimator. This is indeed true.

{method}

[(RSCRCGL)]

• Use the rank selection criterion (RSC) of Bunea, She and Wegkamp (2011) to select as the number of singular values of that exceed . Here is the projection matrix on the space spanned by .

• Compute the rank constrained GLASSO estimator in (5) above with to obtain the final estimator .

This two-step estimator adapts to both rank and row sparsity, under two additional, mild restrictions.

• .

• .

Assumption only requires that the signal strength, measured by , the th singular value of the matrix , be larger than the “noise level” , otherwise its detection would become problematic. The tightness of is discussed in detail in Bunea, She and Wegkamp (2011). Theorem 2 of that work proves that the correct rank will be selected with probability444Hence, if is small compared to , we suggest to replace by in the threshold level in the definition of , in and in ; see the remark following Corollary 4 in Bunea, She and Wegkamp (2011). with .

Assumption is technical and needed to guarantee that the error due to selecting the rank is negligible compared to the rate .

###### Theorem 4

Let satisfy with , let be bounded, and let and hold. Then the two-step JRRS estimator with set according to (7) with large enough satisfies

 E[∥∥X^B(1)−XA∥∥2F]≲nr+|J|rlog(p).\vspace∗−2pt

Hence, is row and rank adaptive, and achieves the same optimal rate, up to a factor, as the row and rank adaptive studied in Theorem 1 above. While Theorem 1 is proved under no restrictions on the design, we view the mild conditions of Theorem 4 as a small price to pay for the computational efficiency of relative to that of in (2). The practical choice of the threshold in the initial step of our procedure can be done either by replacing by an estimator, as suggested and analyzed theoretically in Section 2.4 of Bunea, She and Wegkamp (2011), or by cross-validation. The latter is valid in this context for consistent rank selection, as the minimum squared error of rank restricted estimators in (1) is achieved for the true rank, as discussed in detail in Bunea, She and Wegkamp (2011).

We now present an alternative adaptive method that is more computationally involved than Method 2.2.2, as it involves a search over a two-dimensional grid, but its analysis does not require and .

{method}

[(RCGLJRRS1)]

• Pre-specify a grid of values for and use (5) to construct the class .

• Compute , with pen defined in (3) above.

We have the same conclusion as for Method 2.2.2:

###### Theorem 5

Provided satisfies condition with , is bounded, and contains in (7) for some large enough, we have

 E[∥∥X^B(2)−XA∥∥2F]≲nr+|J|log(p)r.

We see that has the same rate as , under condition on the design only. Our simulation studies in Section 4 indicate that the numerical results of Methods 2.2.2 and 2.2.2 are comparable. {remark*} A perhaps more canonical two-stage procedure is as follows:

{method}

[(GLASSORSC)]

• Select the predictors via the GLASSO.

• Use the rank selection criterion (RSC) of Bunea, She and Wegkamp (2011) to construct an adaptive estimator, of reduced rank, based only on the selected predictors.

It is clear that as soon as we have selected the predictors consistently in the first step, selecting consistently the rank in the second step and then proving row and rank sparsity of the resulting estimator will follow straightforward from existing results, for instance, Theorem 7 in Bunea, She and Wegkamp (2011). Although this is a natural path to follow, there is an important caveat to consider: the sufficient conditions under which this two-step process yields adaptive (to row and rank sparsity) estimators include the conditions under which the GLASSO yields consistent group selection. These conditions are in the spirit of those given in Bunea (2008), for the Lasso, and involve the mutual coherence condition on , which postulates that the off-diagonal elements of be small. Specifically, for the GLASSO, the restriction becomes , for some [cf. Lounici et al. (2011)], if it is coupled with the condition that . Here is the Euclidean norm of the th row vector of , and and are constants. For designs for which is even closer to the identity matrix, in that , the condition on the minimum size of detectable coefficients can be relaxed to ; see Corollary 5.2 in Lounici et al. (2011). Our Theorems 4 and 5 require substantially weaker assumptions on the design.

## 3 Computational issues and numerical performance comparison

### 3.1 A computational algorithm for the RCGL-estimator

In this section we design an algorithm for minimizing

 F(B;λ):=12∥Y−XB∥2F+λ∥B∥2,1 (9)

over all matrices of rank less than or equal to . Recall that by solving this problem we provide a way of performing rank-constrained variable selection in model (1). Directly solving the nonconvex constrained minimization problem for in (9) may be difficult. One way of surmounting this difficulty is to write , with being orthogonal. Then the rank constrained group lasso (RCGL) optimization problem is equivalent to finding

 (^S,^V)=argminS∈Rp×k,V∈On×k12∥∥Y−XSV′∥∥2F+λ∥S∥2,1, (10)

where the minimum is taken over all orthogonal matrices and all matrices . With a slight abuse of notation, we still denote the objective function in (10) by . We propose the following iterative optimization procedure.

The following theorem presents a global convergence analysis for Algorithm , where global in this context refers to the fact that the algorithm converges for any initial point.

###### Theorem 6

Given and an arbitrary starting point , let () be the sequence of iterates generated by Algorithm . The following two statements hold: {longlist}[(ii)]

Any accumulation point of is a stationary point of and converges monotonically to for some stationary point .

Suppose for any outside the local minimum set of , . Then, any accumulation point of is a local minimum of and converges monotonically to for some local minimizer .

{remarks*}

(1) We run the algorithm to obtain a solution path, for each in a two-dimensional grid or for a grid of with determined by RSC. From the solution path, we get a series of candidate estimates. Then the single stage JRRS (4) or other tuning criteria can be used to select the optimal estimate.

(2) Our results are of the same type as those established for the convergence of the EM algorithm [Wu (1983)]. Algorithm can be viewed as a block coordinate descent method, but the conclusion in Theorem 6 is stronger in some sense: the guaranteed convergence to a stationary point (to be defined in Appendix .6) does not require the uniqueness of in step (a) which is a crucial assumption in the literature [see, e.g., Bertsekas (1999) and Tseng (2001)].

(3) Step (a) needs to solve a GLASSO optimization problem. To see this, denoting the standard vectorization operator by and the Kronecker product by , we rewrite as . Although this subproblem is convex, finding its global minimum point can still be expensive for large data. Instead, one may perform some low-cost thresholding for a few steps. Concretely, let be a constant satisfying . Given , define as

 TV∘S=→Θ(1KX′YV+(I−1KX′X)S;λK)∀S∈Rp×k, (11)

where is a multivariate version of the soft-thresholding operator . For any vector , for and otherwise; for any matrix with , .

We now replace step (a) in Algorithm  by , where the number of , denoted by , satisfies for some specified based on available computational resources. need not be equal. This algorithm, denoted by , offers more flexibility and is more convenient than Algorithm  in implementation. Although at each iteration is not uniquely determined, a stronger global convergence result holds for .

###### Theorem 7

Given and an arbitrary starting point , let () be the sequence of iterates generated by . Then, any accumulation point of is a coordinatewise minimum point (and a stationary point) of and converges monotonically to for some coordinatewise minimum point .

### 3.2 Simulation studies

The setup of our simulations is as follows:

• The design matrix has i.i.d. rows from a multivariate normal distribution , with , , .

• The coefficient matrix has the form

 A=[A1O]=[bB0B1O]

with , a matrix and a matrix. All entries in and are i.i.d. .

• The noise matrix has independent entries. Let denote its ith row.

• Each row in is then generated as , .

This setup contains many noisy features, but the relevant features lie in a low-dimensional subspace. This structure resembles many real world data sets; see our examples in Section 4, where the low rank structure is inherent and, thus, rank-constrained variable selection is desired.

We report two settings: {longlist}

, , , , , , , .

, , , , , , , . Although we performed experiments in many other settings, say, with , we do not report all results, as the conclusions are similar. The current setups show that variable selection, without taking the rank information into consideration, may be suboptimal even if the correlations between predictors are low.

We tested five methods: RSC, GLASSO, Method 2.2.2 (RSCRCGL), Method 2.2.2 (RCGLJRRS1), and Method 2.2.2 (GLASSORSC), as described in Section 2.2. To minimize the influence of various parameter tuning strategies on our performance comparison, we generated a large validation data set (10,000 observations) to tune the parameter of each algorithm (with the exception of Method 2.2.2) and we also generated another independent data set of the same size as the test data to evaluate the test error. Similar to the LARS-OLS hybrid [Efron et al. (2004)], for each GLASSO and RCGL estimate, we computed the least squares estimate restricted to the selected dimensions. We found that the resulting (bias corrected) solution paths are more suitable for parameter tuning. For Method 2.2.2, after getting the (bias corrected) solution path, we set and in (3) to select the optimal ; in contrast to the other two methods, no validation data is used for tuning.

Each model was simulated 50 times, and Tables 2 and 3 summarize our findings. We evaluated the prediction accuracy of each estimator by the mean squared error (MSE) using the test data at each run. Since the MSE histograms turned out to be highly asymmetric, we computed the trimmed-mean of MSEs as the goodness of fit of the obtained model. This trimmed mean is more robust than the mean and more stable than the median, and it therefore allows for a more fair comparison between methods.

We also report the median number of predictors (denoted by ) and median rank estimate (denoted by ) over all runs. Estimators with small MSE and low and are preferred from the point of view of statistical modeling.

Finally, we provide the rates of nonincluded true variables (denoted by for misses) and the rates of incorrectly included variables ( for false alarms). Ideally, both rates are low, especially the M-rates, since we do not wish to discard relevant features.

We can draw the following conclusions from Tables 2 and 3:

• We see that straightforward variable selection via GLASSO often severely misses some true features in the setup as seen from its high M numbers. RSC achieved good rank recovery, as expected, but, by the definition of this estimator, it uses all variables. Clearly both GLASSO and RSC alone are inferior to the three JRRS-type methods (Methods 2.2.22.2.2 and 2.2.2).

• Method 2.2.2 dominates all other methods. Its MSE results are impressive and confirm the rate improvement established in Section 2.2 over the GLASSO and RSC. While Method 2.2.2 may not give exactly , its M numbers indicate that we did not miss many true features.

• Method 2.2.2, unlike Methods 2.2.2 and 2.2.2, did not use the large validation data for ideal parameter tuning, which explains its slight inferiority relative to the other two methods. However, we see that even without validation-based tuning, which may at times be infeasible in practice, this method is a serious contender. It supports the theoretical findings of Theorem 2 on the usage of the penalty (3) for model comparison and tuning parameter selection.

• The performance of Method 2.2.2 is inferior to that of Method 2.2.2 in the experiment, and comparable with both Methods 2.2.2 and 2.2.2 in the experiment, when the three methods have essentially the same behavior.

In conclusion, we found that Method 2.2.2 is the clear winner in terms of performance as well as computational speed, among the two-stage JRRS procedures we considered, and is particularly appealing in the regime. In particular, it shows the advantage of the novel penalty type which enforces simultaneous (row) sparsity and rank reduction on the coefficient matrix. Method 2.2.2 using penalty (3) provides evidence of success of Theorem 2.

## 4 Applications

In this section we apply Method 2.2.2, with its tuning parameters chosen via cross-validation, to two real data sets from machine learning and cognitive neuroscience.

### Norwegian paper quality

These data were obtained from a controlled experiment that was carried out at a paper factory in Norway (Norske Skog, the world’s second-largest producer of publication paper) to uncover the effect of three control variables on the quality of the paper which was measured by 13 response variables. Each of the control variables takes values in . To account for possible interactions and nonlinear effects, second order terms were added to the set of predictors, yielding and the intercept term. There were 29 observations with no missing values made on all response and predictor variables. The Box–Behnken design of the experiment and the resulting data are described in Aldrin (1996) and Izenman (2008). Since neither the group penalty nor the rank constraint is imposed on the intercept term, we always center the responses and standardize the predictors in the training data (and transform the validation/test data accordingly).

The data set can be downloaded from the website of Izenman (2008) and its structure clearly indicates that dimension reduction is possible, making it a typical application for reduced rank regression methods. The RSC method with adaptive tuning, as described in Bunea, She and Wegkamp (2011), selected the rank . This finding is consistent with Aldrin (1996), who assessed the performance of the rank 3 estimator by leave-one-out cross-validation (LOOCV) and obtained a minimum LOOCV error (total squared error, unscaled) of 326.2. We then employed the newly developed Method 2.2.2 to automatically determine the useful predictors and pursue the optimal projections. Not surprisingly, the selected rank is still 3, yielding 3 new scores, which are now constructed from only 6 of the original 9 predictors, with , and

discarded, and only the variables from Table 4 selected. The tuning result was the same for 10-fold CV and LOOCV. The minimum LOOCV error is now 304.5. We found no interaction effect between and , an interesting complement to Aldrin’s analysis. Table 4 shows the construction weights of the 3 new orthogonal score variables from the rank-3 RSC on the selected set of variables. They are ordered by an importance measure given by the associated eigenvalues of [see Reinsel and Velu (1998) and Izenman (2008) for the explanation]. For instance, the first important score variable (accounting for 57.5% of the trace of ) can be roughly read as , or simply . This can be used as a concise summary predictor for all 13 response variables simultaneously and it quantifies the effect of the design variables on paper quality control.

### Cognitive neuroimaging

We present an analysis of the data set described in Bunea et al. (2011) and collected to investigate the effect of the HIV-infection on human cognitive abilities. Neuro-cognitive performance is typically measured via correlated neuro-cognitive indices (NCIs). This study employed NCIs, falling into five domains of attention/working memory, speed of information processing, psychomotor abilities, executive function, and learning and memory. These indices were measured for 62 HIV patients in the study. The set of explanatory variables was large and contained: (a) clinical and demographic predictors and (b) brain volumetric and diffusion tensor imaging (DTI) derived measures of several white-matter regions of interest, such as fractional anisotropy, mean diffusivity, axial diffusivity and radial diffusivity, along with all volumetricsDTI interactions. We refer to Bunea et al. (2011) for details. The final model has predictors, much greater than the sample size . An initial analysis of this data set was performed using the RSC to select a model of rank 1 and construct the corresponding new predictive score. Although this is a massive reduction of the dimension of the predictor space, all 235 initial predictors were involved in the construction of the new score.

This leaves unanswered the important question as to what variables (especially which DTI derived measures) are most predictive of the neuro-cognitive changes in HIV patients. After standardizing the predictors, we run Method 2.2.2.

We selected a model of rank 1 and constructed one new predictive score but, very importantly, this score is a linear combination of only 10 predictors that were selected from the original pool of 235.

When we set aside of the data as a separate test set, and used the remaining to fit the model and tune the regularization parameters, the mean squared error (MSE) of the RSC estimate was , while the MSE of the newly proposed method was only . Moreover, our analysis not only demonstrates the existence of a strong association between the variable Education and the neuro-cognitive abilities of HIV patients, which had already been established by other means in the literature, but also suggests, as a perhaps new finding, that the variable fractional anisotropy at corpus callosum (fa_cc1) stands out among the very many DTI-derived measures, in terms of predictive power.

## Appendix

For a generic index set , we define the matrix as follows: its th column coincides with that of if , otherwise we set the entire column to zero. Furthermore, we define as the projection matrix on the column space of .

Since we favor transparent proofs, we did not attempt to optimize various numerical constants.

### .1 Proof of Theorem 1

By the definition of , for any matrix with , the inequality

 ∥Y−X^B∥2F+pen(^B)≤∥Y−XB∥2F+pen(B)

holds. This is equivalent with

 ∥XA−X^B∥2F ≤ ∥XA−XB∥2F+2pen(B) +{2⟨E,X^B−XB⟩−pen(^B)−pen(B)}.

We consider two complementary cases: and .

Case 1: . We write and , and we note that with , , . Hence,

 ⟨E,X(^B−B)⟩=3∑k=1⟨E,XJk(^B−B)⟩.

The penalty term in (3) can be written as

 pen(B)=cσ2r(B){2n+f(∣∣J(B)∣∣)}

for the function . Since , is concave for , and we have

 f(x+y)≥12f(2x)+12f(2y)=x+y+xlog(epx)+ylog(epy)

for all . Consequently, writing , and ,

 pen(B)+pen(^B) = cσ2r1{2n+f(|J1|+|J2|)}+cσ2r3{2n+f(|J2|+|J3|)} ≥ cσ23∑k=1rk{n+|Jk|+|Jk|log(ep|Jk|)}.

This implies that

 {2⟨E,X^B−XB⟩−pen(^B)−pen(B)} ≤3∑k=1[2⟨E,XJk(^B−B)⟩−cσ2rk{n+|Jk|+|Jk|log(ep|Jk|)}].

We define

 R=maxI(d21(PIE)−c2σ2{n+|I|+|I|log(ep|I|)})+.

In this proof,555A careful inspection reveals that we may take any , at the cost of larger constants elsewhere. we set . Using the inequality

 ⟨E,XJk(^B−B)⟩ = ⟨PJkE,XJk(^B−B)⟩ ≤

and the inequalities and for all with , we further bound

 {2⟨E,X^B−XB⟩−pen(^B)−pen(B)} ≤3∑k=1[12∥∥XJk(^B−B)∥∥2F+2rkd21(PJkE) −cσ2rk{n+|Jk|+|Jk|log(ep|Jk|)}] ≤12∥X^B−XB∥2F+6(maxkrk)R ≤34∥X^B−XA∥2F+32∥XB−XA∥2F+12nR,

using . Now, (.1) and the display above yield

 14E[∥X^B−XA∥2F] ≤ 52∥XB−XA∥2F+2pen(B)+12nE[R] ≤ 52∥XB−XA∥2F+2pen(B)+192nσ2exp(−n/2)

using Lemma 8 in the last inequality. This concludes the first case.

Case 2: . Using the same reasoning as above, we can argue that

 ∥XA∥2F = ∥XA−X^B∥2F ≤ ∥XB−XA∥2F+2pen(B)−2⟨E,XB⟩−pen(B) ≤ 52∥XB−XA∥2F+2pen(B)+34∥XA∥2F+R′

with

 R′ = 2r(B)d21(PJ(B)E)−pen(B) ≤

By Lemma 3 in Bunea, She and Wegkamp (2011), we have , so that for our choice above. Hence,

 14E[∥XA−X^B∥2F]≤52∥XB−XA∥2F+2pen(B),

which concludes the second case, and our proof.

###### Lemma 8

We have

 E[maxJ(<