Adaptive estimation of the rank of the coefficient matrix in high dimensional multivariate response regression models

Adaptive estimation of the rank of the coefficient matrix in high dimensional multivariate response regression models

Xin Bing Department of Statistical Science, Cornell University Marten Wegkamp
Abstract

We consider the multivariate response regression problem with a regression coefficient matrix of low, unknown rank. In this setting, we analyze a new criterion for selecting the optimal reduced rank. This criterion differs notably from the one proposed in Bunea et al. (2011) in that it does not require estimation of the unknown variance of the noise, nor depends on a delicate choice of a tuning parameter. We develop an iterative, fully data-driven procedure, that adapts to the optimal signal-to-noise ratio. This procedure finds the true rank in a few steps with overwhelming probability. At each step, our estimate increases, while at the same time it does not exceed the true rank. Our finite sample results hold for any sample size and any dimension, even when the number of responses and of covariates grow much faster than the number of observations. We perform an extensive simulation study that confirms our theoretical findings. The new method performs better and is more stable than the procedure of Bunea et al. (2011) in both low- and high-dimensional settings.

1 Introduction

1.1 Background

We study the multivariate response regression model

with of and of unknown . We assume that the entries of are i.i.d. distributed with . Section 5 discusses extensions to general, heavy tailed distributions of the errors .

Standard least squares estimation is tantamount to regressing each response on the predictors separately, thus ignoring the multivariate nature of the possibly correlated responses. In large dimensional settings ( and are large relative to the sample size ), it is desirable to achieve a dimension reduction in the coefficient matrix . One popular way of achieving this goal, is to find a common subset of covariates that are relevant for prediction, using penalized least squares with a (group lasso) type penalty on the regression coefficients, see, for instance, Bunea et al. (2012); Bühlmann and Van de Geer (2011); Lounici et al. (2011); Obozinski et al. (2011); Yuan and Lin (2006) to recover the support of the set of rows for which is non-zero.

Reduced rank regression is a different approach to achieve necessary dimension reduction. The main premise is that has low rank, so that we can write for a matrix and a matrix . Then only few linear combinations of are needed to explain the variation of . Izenman (1975) coined the term reduced-rank regression for this class of models, but its history dates back to Anderson (1951b). There are many works on this topic in the classical setting of fixed dimensions and , and sample size , see Anderson (1951a); Rao (1978); Robinson (1973, 1974) and more recently, Anderson (2002). A comprehensive overview on reduced rank regression is given by Reinsel and Velu (1998). Only recently, the high-dimensional case has been discussed: Bunea et al. (2011, 2012); Giraud (2011, 2015); Negahban and Wainwright (2011); Rohde and Tsybakov (2011).

The main topic of this paper is the estimation of the unknown rank. Determination of the rank of the coefficient matrix is the first key step for the estimation of . For known rank , Anderson (1951a) derives the asymptotic distribution of the reduced rank regression coefficient matrix estimator in the asymptotic setting with fixed and . The estimator is the matrix corresponding to minimizing the squared Frobenius or Hilbert-Schmidt norm over all matrices of rank and has a closed form, due to the Eckart-Young theorem (Eckart and Young, 1936; Schmidt, 1907). It is crucial to have the true rank for obtaining a good fit for both and . In general, however, the rank is unknown a priori. The classical approach to estimate the rank uses the likelihood ratio test, see Anderson (1951b). An elementary calculation shows that this statistic coincides with Bartlett’s test statistic as a consequence of the relation between reduced rank regression and canonical correlation analysis, see Anderson (1951b); Rudelson and Vershynin (2010). Our main goal in this study is to develop a non-asymptotic method to estimate that is easy to compute, adaptively from the data, and valid for any values of and , especially when the number of predictors and the number of responses are large. The resulting estimator of can then be used to construct a possibly much smaller number of new transformed predictors, or the most important canonical variables based on the original and , see (Izenman, 2008, Chapter 6) for a historical account. Under weak assumptions on the signal, our estimate of can be shown to be equal to with overwhelming probability, to wit, , for some positive, finite constants , so that the selection error is small compared to the overall error in estimating .

1.2 Recent developments

Bunea et al. (2011, 2012) propose

(1)

and recommended the choice with constant for the tuning parameter . Here is the Frobenius norm. In particular, Bunea et al. (2011) established a convenient closed form for the rank of the matrix that minimizes criterion (1). They gave sufficient conditions on the level of the smallest non-zero singular value of to guarantee that consistently estimates . The disadvantage of this method is that a value for , in addition to the tuning parameter , is required for . Bunea et al. (2011) proposed to use the unbiased estimator

(2)

based on the projection of onto the range space of . However, this becomes problematic when is not large enough, or even infeasible when . Giraud (2011) introduces another estimation scheme, that does not require estimation of . Unfortunately, a closed form for the minimizer as in Bunea et al. (2011) is lacking, and rank consistency in fact fails, as our simulations reveal in Section 6.7 and the procedures in both Bunea et al. (2011) and Giraud (2011) are rather sensitive to the choices of their respective tuning parameters involved. We emphasize that Giraud (2011) studies the error and not the rank of his estimator .

1.3 Proposed Research

This paper studies a third criterion,

(3)

Here is the truncated singular value decomposition of based on the (decreasing) singular values of the projection and their corresponding singular vectors . It was suggested by the second author in a private conversation to Christophe Giraud, see Giraud (2015). The range over which we minimize (3) is with to avoid a non-positive denominator. The purpose of this paper is to show that this new criterion produces a consistent estimator of the rank. It turns out that the choice of the optimal tuning parameter involves a delicate trade-off. On the one hand, should be large enough to prevent overfitting, that is, prevent selecting a rank that is larger than the true rank . On the other hand, if one takes too large, the selected rank will typically be smaller than as the procedure will not be able to distinguish the unknown singular values from the noise for , for some . To effectively deal with this situation, we refine our initial procedure using our new criterion (3) by an iterative procedure in Section 4 that provenly finds the optimal value of and consequently of the estimate of . This method does not require any data-splitting and our simulations show that it is very stable, even for general, heavy tailed error distributions. To our knowledge, it is rare feat to have a feasible algorithm that finds the optimal tuning parameter in a fully data-driven way, without data-splitting, and with mathematically proven guarantees. It is worth mentioning that while our main interest is rank consistency, our approach achieves the optimal estimation performance under the Hilbert-Schmidt norm and improves upon that given in Giraud (2015) by incorporating the approximation error term.

The paper is organized as follows. Section 2 shows that the minimizer of (3) has a closed form. The main results are discussed in Sections 3 and 4. It obtains rank consistency in case of no signal () and in case of sufficient signal. For the latter, we develop a key notion of signal-to-noise ratio that is required for rank consistency. A sufficient, easily interpretable condition will be presented that corresponds to a computable value (estimate) of the tuning parameter . We develop in Section 4 an iterative, fully automated procedure, which has a guaranteed recovery of the true rank (with overwhelming probability) under increasingly milder conditions on the signal. The first step uses the potentially suboptimal estimate developed in Section 3, but which is less than , with overwhelming probability. This value is used to update the tuning parameter . Then we minimize (3) again, and obtain a new estimate , which in turn is used to update . The procedure produces each time a smaller , thereby selecting a larger rank than the previous one, whilst each time we can guarantee that the selected rank doesn’t exceed the true rank . This is a major mathematical challenge in our proof, which relies on highly non-trivial monotonicity arguments. The procedure stops when the selected rank does not change after an iteration. Our results hold with high probability (exponential in and ) which translates into extremely accurate estimates under a weak signal condition.

Section 5 describes several extensions of the developed theory, allowing for non Gaussian errors .

A large simulation study is reported in Section 6. It confirms our theoretical findings, and shows that our method dominates the methods proposed in Bunea et al. (2011).

The proofs are deferred to the appendix.

1.4 Notation

For any matrix , we will use to denote the th element of (i.e., the entry on the th row and th column of ), and we write to denote its ordered singular values.

The Frobenius or Hilbert-Schmidt inner product on the space of matrices is defined as for commensurate matrices . The corresponding norm is denoted by , and we recall that for any matrix . Moreover, it is known (Eckart and Young (1936); Schmidt (1907), see also the review by Stewart (1993)) that minimizing over with is achieved for based on the singular value decomposition of where denotes the diagonal matrix with for . Hence, .

For other norms on matrices, we use to denote the operator norm and the nuclear norm (i.e., the sum of singular values). We have the inequalities and .

For general matrices and , Weyl (1912) implies that for and with .

We denote the projection matrix onto the column space of by and we write and . We set , by defining and define . Throughout the paper, we use to denote the true coefficient matrix and to denote its true rank.

2 Properties of the minimizer of the new criterion

At first glance, it seems difficult to describe the minimizer of because both the numerator and denominator in are decreasing in . However, it turns out that there is a unique minimizer with a neat explicit formula. First we characterize the comparison between and for .

Proposition 1.

Let with . Then

(4)

In particular,

(5)

and

(6)

This result and the monotonicity of the singular values readily yield the following statement.

Proposition 2.

Let . Then

(7)
(8)

It is clear that if , then minimizes . Likewise, if , then minimizes the criterion . After a little reflexion, we see that is minimized at the last for which holds. That is,

(9)

minimizes with the convention that the maximum of the empty set is 0. Properties (7) and (8) ensure that must hold automatically for all as well as for all . That is, has an even more convenient closed form

(10)

Summarizing, we have shown the following result.

Theorem 3.

There exists a unique minimizer of (3), given by (10), such that is monotone decreasing for , and monotone increasing for .

It is interesting to compare the choice in (10) with in Bunea et al. (2011). In that paper, it is shown that (1) is equivalent with

(11)

based on the truncated singular value decomposition of the projection with . Furthermore, Bunea et al. (2011) uses this formulation to derive a closed form for , to wit,

(12)

based on the singular values of the projection . The main difference between (10) and (12) is that counts the number of singular values of above a variable threshold, while counts the number of singular values of above a fixed threshold. Another difference is that the fixed threshold is proportional to the unknown variance , while the variable threshold is proportional to , which can be thought of as an estimate of only for close to .

To further illustrate the existence and uniqueness of , we perform one experiment to show how the and vary across different , see Figure 1. The plot first displays the monotone property of for and . It also justifies the definition of in (10) since the rank which minimizes is exactly what we defined.

Figure 1: Plot of and versus . In the experiment, we used and , using the notation and simulation setup of Section 6.3.

3 Rank consistency

3.1 The null case

We treat the case separately as the case requires a lower bound on the non-zero singular values of .

Theorem 4.

Assume . Then, on the event , we have .

We particularize Theorem 4 to the case where the entries of are independent . In that case, general random matrix theory and Borel’s inequality for suprema of Gaussian processes, respectively, give and

(13)

see Lemma 3 in Bunea et al. (2011). Moreover, has a central distribution, so that general tail bounds (Johnstone (2001)) yield

(14)
(15)

We immediately obtain the following corollary by using (13) – (15).

Corollary 5.

For any , we have exponentially fast as and .

3.2 The general case

The range over which we minimize (3) is depends on as the largest possible value is

(16)

to avoid a non-positive denominator in criterion (3).

Theorem 6.

Assume . On the event

(17)

we have . If , then trivially holds, with probability one.

The restriction guarantees that is positive, that is, the event (17) is non-empty. If , then , holds trivially, with probability one, as is selected from .

While the quantity is a natural one in this problem, it depends on the unknown rank . It turns out that quantifying is not trivial.

Proposition 7.

Assume . On the event

(18)

we have

(19)

This result combined with Theorem 6 tells us that if is chosen large enough, we can guarantee that . Moreover, this choice is independent of both and . Indeed, for matrices with independent Gaussian entries, any choice suffices:

Theorem 8.

For with any numerical constant , as and .

The convergence rate in Theorem 8 is exponentially fast in and . Again, if , then holds trivially.

Consistency of can be achieved under a suitable signal to noise condition.

Theorem 9.

For , on the event

(20)

intersected with the event (17), we further have .

This theorem, combined with Proposition 7, immediately yields the following corollary.

Corollary 10.

For , on the event

we have .

The choice of impacts the possible values for , the minimizer of criterion (3), and we see that the range increases as decreases. If the true rank is rather large (), then no guarantees for can be made, except for the trivial, yet important observation that . On the other hand, if , which is arguably the more interesting case for low rank regression, then consistency guarantees can be made under a suitable condition on the -th singular value of . This condition becomes milder if decreases.

Let . A slightly stronger restriction for the upper bound on ,

(21)

translates into a bound for the ratio

appearing in the lower bound for the signal . We can further particularize to the Gaussian setting.

Theorem 11.

Let for some numerical constant . Assume further that and satisfy (21) and

(22)

for some numerical constant . Then as and .
In particular, if (22) holds for , then consistently estimates .

The convergence rate in Theorem 11 is exponentially fast in and . This shows that the above procedure is highly accurate, which is confirmed in our simulation study. From the oracle inequality in the appendix, the prediction error , for , differs only within some constant levels of the noise level .

4 Self-tuning procedure

From Theorem 9 in Section 3, we need to take , or in fact , as depends on , large enough to prevent overfitting, that is, to avoid selecting a larger than the true rank . On the other hand, we would like to keep small to be able to detect a small signal level. Indeed, (17) in Theorem 9 states that we need

(23)

The term on the right is decreasing in , so we should choose as small as possible such that is close to . Without any prior knowledge on , it is difficult to find the optimal choice for . However, Corollary 10 tells us that an initial satisfying yields an estimated rank with . Our idea is to use this lower bound for to reduce our value to . This, in turn, will yield a possibly larger estimated rank , which still obeys . More precisely, we propose the following Self-Tuning Rank Selection (STRS) procedure. Let be a matrix with i.i.d. standard Gaussian entries and define with the convention for . Moreover, let for given . For any , we define

(24)
(25)

as starting values, and if , for , we update

(26)
(27)

where

(28)

The procedure stops when . The entire procedure is free of and both and can be numerically evaluated by Monte Carlo simulations. Alternatively, we provide an analogous procedure with analytical expressions in Appendix E, but its performance in our simulations is actually slightly inferior to the original procedure that utilizes Monte Carlo simulations.

Regarding the computational complexity, we emphasize that the above STRS procedure has almost the same level of computational complexity as the methods in (1) and (3). This is due to the fact that the computationally expensive singular value decomposition only needs to be computed once. Additionally, in order to find the new rank in step (27), we only need to consider values that are larger than (or equal to) the previously selected rank. This avoids a lot of extra computation.

The following proposition is critical for the feasibility of STRS.

Proposition 12.

We have and , for all . More importantly, for all , holds with probability tending to exponentially fast as and .

The increasing property of immediately yields the following theorem.

Theorem 13.

Let numerical constant . Let be the minimizer of (3) using and be the final selected rank of STRS starting from the same value . Then

as and .

We find that STRS always selects a rank closer to the true rank than the (one step) Generalized Rank Selection procedure (GRS) from the previous section that uses (3) as its criterion.

The decreasing property of , stated in Proposition 12, implies an increasingly milder condition on the required signal. Meanwhile, the way of updating in step (26) is carefully chosen to maintain . We refer to the proof for more explanations. Thus, if a proper sequence of signal-to-noise condition is met, we expect that STRS finds the rank consistently. The following theorem confirms this.

Theorem 14.

Let be a strictly increasing subsequence of of length . Define as (24) and obtained from (26) by using in lieu of , for . Assume and satisfy (21) for . Then, on the event

(29)

for some numerical constant , there exists such that , with probability tending to , where are from (27).

The sequence of plays an important role for interpreting the above theorem. It can be regarded as a underlying sequence bridge starting from and leading towards the true rank. The ideal case is which leads to a one-step recovery, but requires a comparatively stronger signal-to-noise condition (29). At the other extreme, it could take steps to recover the true rank. We emphasize that the latter case requires the mildest signal-to-noise condition by the following two observations:

  1. Proposition 12 guarantees that each time the updated is decreasing;

  2. The signal condition (29) is becoming milder as gets smaller, which further implies increasingly sharper leading constant of the optimal signal condition.

From display (19) in Proposition 7 and from Corollary 10, it is clear that we use the string of inequalities to derive the required signal-to-noise condition. The second inequality becomes loose for large comparing to . Hence, we expect that the signal-to-noise condition becomes milder from the decreasing values of using STRS. To illustrate this phenomenon concretely, we study to a special case where and , and show in the theorem below that the signal condition for recovering can be relaxed significantly when is large.

Theorem 15.

Define with . Assume and satisfy (21) for and

(30)
(31)

for some numerical constant .
Then either or , with probability tending to , as . Here and are selected from (25) and (27).

The lower bound condition (30) is condition (22) with . As a simple numerical illustration, we compare (30) (and therefore (22)) with (31) for and . If , we obtain

while if , we have

As we can see, for small , the signal-to-noise condition decreases slightly. However, for larger , could be quite large, while is always bounded above by . In order to see implications of small/large , we consider two cases by recalling the rank constraint (21):

  1. If , the rank constraint (21) reduces to simply . A smaller value for leads to a smaller value for , provided . Therefore, should be tight and we expect a smaller reduction of the signal condition for smaller values of . On the other hand, when and are close, meaning is large, we expect a considerable relaxation of the signal condition for comparatively large . These two points are clearly reflected in (30) and (31).

  2. If , it follows that . Then a smaller means stronger restriction on which implies becomes tight and we expect a modest relaxation of the signal condition. If is large, then could be small for a comparatively large . Thus would explode and we expect a significant decrease in the lower bound (31) for the signal. Both these observations agree with our results. It is worth mentioning that when is small comparing to , is likely to be large. For instance, when , taking yields which is already not quite large. Imposing a small in this case would further restrict the range of .

Recall that the range of allowable rank in (16) increases as decreases. This means that, after a few iterations, the true rank could be selected even when it was out of the possible range at the beginning. This phenomenon is clearly supported by our simulations in Section 6.5. In addition, the following proposition proves that can be extended to in some settings even when (21) is not met for .

Proposition 16.

Let with and assume . Suppose the first selected rank from (25) by using satisfies and

for some numerical constant . Then, we have for any , as .

5 Extension to heavy tailed error distributions

Most results in this paper are finite sample results and apply to any matrix . Only Corollary 5, Theorem 11 and the results in Section 4 require Gaussian errors. They appeal to precise concentration inequalities of around , making use of the fact that based on the eigen decomposition of , and in turn is again Gaussian. In general, if has independent entries, then the transformations or no longer have independent entries, although their rows remain independent. Regardless, our simulations reported in Sections 6.7 and 6.8 support our conjecture that our iterative method is flexible and our results continue to hold for general distributions, such as -distributions with degrees of freedom, for independent errors . For some important special cases we are able to formally allow for errors with finite fourth moments only.

5.1 Heavy tailed errors distributions with

We first consider the case , which is likely to occur in high-dimensional settings (). The following theorem guarantees the rank recovery via the GRS procedure for errors with heavy tailed distributions.

Theorem 17.

Let . Assume that the entries of are i.i.d. random variables with mean zero and finite fourth moments. Furthermore, assume , and satisfy (21) and

(32)

for some numerical constant .
Then, we have as , where is selected from (3).
In particular, if (32) holds for , then consistently estimates .

For a special case, that of skinny matrices , that is or , we propose the following Simplified Self-Tuning Rank Selection (SSTRS) procedure. Given any , we set

(33)

as starting values, and if , for , we update

(34)

where and for . The procedure stops when and we have the following result.

Theorem 18.

Assume are i.i.d. random variables with mean zero and finite fourth moments. Suppose that and or . Let be defined as Theorem 14. Define as (33) and obtained from (34) by using in lieu of , for . Assume and satisfy (21) for .
Then, on the event

(35)

for some numerical constant , there exists such that , with probability tending to , as . Here are given in (33) and (34).

Remark. As mentioned earlier, the entries of no longer inherit the independence from when the distribution of the independent entries is not Gaussian. Nevertheless, by exploiting the independence of columns of , Theorem 5.39 in Vershynin (2010) shows that with high probability, provided the entries of are independent sub-Gaussian random variables with unit variance. The constant above unfortunately involves the unknown sub-Gaussian norm, which differs from . However, provided and has i.i.d. sub-Gaussian entries, we simply have