Residual variance and the signal-to-noise ratio in high-dimensional linear models

# Residual variance and the signal-to-noise ratio in high-dimensional linear models

\fnmsLee H. \snmDicker\thanksreft1 label=e1]ldicker@stat.rutgers.edu [ Rutgers University Department of Statistics and Biostatistics
Rutgers University
501 Hill Center, 110 Frelinghuysen Road
Piscataway, NJ 08854
###### Abstract

Residual variance and the signal-to-noise ratio are important quantities in many statistical models and model fitting procedures. They play an important role in regression diagnostics, in determining the performance limits in estimation and prediction problems, and in shrinkage parameter selection in many popular regularized regression methods for high-dimensional data analysis. We propose new estimators for the residual variance, the -signal strength, and the signal-to-noise ratio that are consistent and asymptotically normal in high-dimensional linear models with Gaussian predictors and errors, where the number of predictors is proportional to the number of observations . Existing results on residual variance estimation in high-dimensional linear models depend on sparsity in the underlying signal. Our results require no sparsity assumptions and imply that the residual variance may be consistently estimated even when and the underlying signal itself is non-estimable. Basic numerical work suggests that some of the distributional assumptions made for our theoretical results may be relaxed.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

Residual variance and the signal-to-noise ratio

{aug}\thankstext

t1Supported by NSF Grant DMS-1208785

class=AMS] \kwd[Primary ]62J05 \kwd[; secondary ]62F12 \kwd15B52

Asymptotic normality \kwdhigh-dimensional data analysis \kwdPoincaré inequality \kwdrandom matrices \kwdresidual variance \kwdsignal-to-noise ratio

## 1 Introduction

Consider the linear model

 yi=xTiβ+ϵi,  i=1,...,n, (1)

where and are observed outcomes and -dimensional predictors, respectively, are unobserved iid errors with and , and is an unknown -dimensional parameter. To simplify notation, let denote the -dimensional vector of outcomes and denote the matrix of predictors. Also let . Then (1) may be re-expressed as

 y=Xβ+ϵ.

In this paper, we focus on the case where the predictors are random. More specifically, we assume that are iid random vectors with mean 0 and positive definite covariance matrix (many of the results in this paper are applicable if upon centering the data; however, this is not pursued further here).

Let , where denotes the -norm. Then is a measure of the overall (-) signal strength. The residual variance and the signal strength are important quantities in many problems in statistics. For example, in estimation and prediction problems, typically determines the scale of an estimator’s risk under quadratic loss. More broadly, , , and associated quantities, such as the signal-to-noise ratio , all play a key role in regression diagnostics. Thus, reliable estimators of and are desirable.

For invertible , let be the ordinary least squares estimator for . If , then

 ^σ20=1n−d||y−X^βols||2=1n−d||y||2−1n−dyTX(XTX)−1XTy (2)

is a consistent estimator for and, under fairly mild additional conditions, is asymptotically normal. Consistent estimators for can also be constructed. For instance, if , it is easily seen that

 ^τ20=1n||y||2−^σ2=−dn(n−d)||y||2+1n−dyTX(XTX)−1XTy (3)

is a consistent estimator for under mild conditions.

It is more challenging to construct reliable estimators for and in high-dimensional linear models, where . Indeed, if , then the estimator breaks down; however, estimating and remains important. In high-dimensional linear models with , plays an important role in selecting effective shrinkage parameters for many popular regularized regression methods (Candès and Tao, 2007; Bickel et al., 2009; Zhang, 2010). The signal-to-noise ratio is also important for shrinkage parameter selection, and it determines performance limits in certain high-dimensional regression problems (Dicker, 2012a, b).

In this paper, we propose new estimators for and that are consistent and asymptotically normal, with rate , in an asymptotic regime where (whenever we write , it is implicit that as well). We also show that these estimators may be used to derive consistent and asymptotically normal estimators for function sof and , like the signal-to-noise ratio. Previous work on estimating in high-dimensional linear models where has been conducted by Sun and Zhang (2011) and Fan et al. (2012). These authors assume that is sparse (e.g. the -norm or -norm of is small) and their results for estimating are related to the fact that itself is estimable under the specified sparsity assumptions. Though Sun and Zhang’s (2011) and Fan et al.’s (2012) results even apply in settings where , their sparsity assumptions may be untenable in certain instances and this can dramatically affect the performance of their estimators. In this paper, we make no sparsity assumptions (however, and are required to be bounded) and we show that the proposed estimators for and perform well in situations where and is provably non-estimable. This is one of the main messages of the paper: Though some type of sparsity is required to consistently estimate in high-dimensional linear models, sparsity in is not required to estimate and .

### 1.1 Distributional assumptions

Though sparsity is not required in this paper, we do make strong distributional assumptions about the data. In particular, we henceforth assume that

 ϵ1,...,ϵn\lx@stackreliid∼N(0,σ2)  and  x1,...,xn\lx@stackreliid∼N(0,Σ). (4)

While normality is used heavily throughout our analysis, we expect that key aspects of many of the results in this paper remain valid under weaker distributional assumptions. This is explored via simulation in Section 4.

Not surprisingly, the analysis in this paper is simplified by the normality assumption (4). To explain the relevance of (4) in more detail, we first point out that our primary consistency results for the proposed estimators of and (Theorem 1 below) follow from exact calculations of the estimators’ mean and variance. If the normality assumption (4) is violated, then these calculations are generally invalid; similar techniques may be applicable, if other conditions hold, but exact finite sample calculations are not likely to be possible and any corresponding approximation may be more involved.

The normality assumption (4) also facilitates the use of a collection of “soft-tools” for random matrices developed by Chatterjee (2009) to prove that the estimators proposed in this paper are asymptotically normal. These tools are related to second order Poincaré inequalities and Stein’s method (Stein, 1986). Asymptotic normality for the proposed estimators follows by bounding the total variation distance to a normal random variable. These bounds contain information about how the variability of the proposed estimators may depend , , , , and . This is easily leveraged to obtain consistent and asymptotically normal estimators for functions of and (such as the signal-to-noise ratio, ; see Corollary 2 below), which is an important practical objective. Thus, one of the appealing aspects of the “soft tools” used in this paper is their flexibility. On the other hand, paraphrasing Chatterjee (2009), other existing methods for asymptotic analysis in random matrix theory rely heavily on the exact calculation of limits (Jonsson, 1982; Bai and Silverstein, 2004); we suggest that this may be a more delicate endeavor in some instances. If the normality assumption (4) does not hold, then it is unclear if the soft tools used in this paper are still applicable and, consequently, other techniques may be required. Existing work in random matrix theory suggests that this may be possible (see, for example, (Bai et al., 2007; Pan and Zhou, 2008; El Karoui and Koesters, 2011)); however, the computations are likely more involved and the breadth of applicability of alternative techniques seems unclear.

### 1.2 Correlation among predictors

Another challenging issue for estimating and when involves the covariance matrix . Our initial estimators for and are devised under the assumption that is known (equivalently, ; see Section 2). These estimators are unbiased, consistent, and asymptotically normal. We subsequently propose modified estimators for and in cases where is unknown, but (i) a norm-consistent estimator for is available, or (ii) and satisfy certain conditions described in Section 3.2. If a norm-consistent estimator for is available, then the proposed estimators for and are consistent; if, furthermore, is estimated at rate , then the estimators are asymptotically normal. On the other hand, if , then norm-consistent estimators for are not generally available (though there are important examples where norm-consistent estimators for can be found – this is discussed in more detail in Section 3.1). Thus, it is important to construct estimators for and that perform reliably when is completely unknown. While it remains an open problem to find estimators for and that are consistent for completely general , in Section 3.2 we propose estimators that are consistent and asymptotically normal, provided and satisfy conditions that are closely related to other conditions that have appeared in the random matrix theory literature (Bai et al., 2007; Pan and Zhou, 2008). These conditions basically require that and are asymptotically free in the sense of free probability (see, for example, (Speicher, 2003) for a brief overview of free probability and random matrix theory).

The problems considered in this paper have at least a passing resemblance to the Neyman-Scott problem (Neyman and Scott, 1948; Lancaster, 2000). In a simplified version of this problem, observations , , are available, and the goal is to estimate . The means are nuisance parameters and, without additional specification, none of the are estimable, as . Furthermore, the profile maximum likelihood estimator for , which is given by

 ^ν2MLE=14nn∑i=1(wi1−wi2)2,

is inconsistent; indeed, . On the other hand, the simple method of moments estimator is consistent for and asymptotically normal.

In linear models (1) with , which are the main focus of this paper, the parameter is typically non-estimable. However, we show below that may still be consistently estimated in a variety of circumstances. Moreover, as in the Neyman-Scott problem, it is unclear how to proceed with likelihood inference. Indeed, the MLE

 ^σ2MLE={1n||y−X^βols||2if d

is degenerate when and it can even be troublesome when : if , then . Furthermore, similar to the Neyman-Scott problem described in the previous paragraph, the basic estimator for derived in Section 2.1 is a method of moments estimators.

In our view, the major implication of the preceding discussion is that the ambiguities of likelihood inference which arise in this problem contribute to difficulties in devising a systematic approach to estimation and efficiency when studying , , and related quantities in high-dimensional linear models. While the estimators proposed in this paper are shown to have reasonable properties, further research into these broader issues may be warranted.

### 1.4 Overview of the paper

Section 2 is primarily devoted to the case where . A motivating discussion and the definition of the basic estimators for and may be found in Section 2.1. Section 2.2 and Section 2.3 address consistency and asymptotic normality for the basic estimators, respectively. The case where is unknown is addressed in Section 3. Section 3.1 is concerned with the case where a norm-consistent estimator for is available; Section 3.2 covers the case where no such estimator may be found, but and satisfy certain additional conditions. The results of three simulation studies are reported in Section 4. Two of these studies illustrate basic properties of the estimators proposed in this paper. In the third study, we compare the performance of our estimators for to the performance of estimators for proposed by Sun and Zhang (2011). Section 5 contains a concluding discussion, where we briefly mention some potential alternatives to the estimators proposed in this paper and issues related to efficiency. Proofs may be found in the Appendix; some of the more extended calculations required for these proofs are contained in the Supplemental Text (which may be found after the Bibliography below).

## 2 Independent predictors: Σ=I

Throughout the discussion in this section, we assume that . All of the calculations in Section 2.1-2.2 require . However, the main result of Section 2.3 (Theorem 3, on asymptotic normality) holds for arbitrary positive definite . Notice that if , but is known, then one easily reduces to the case where be replacing with .

### 2.1 Motivation and the basic estimators

For illustrative purposes, suppose for the moment that . The estimator , defined in (2), may be interpreted as the projection of onto , the orthogonal complement of the column space of . This well-known interpretation highlights one of the obstacles to estimating in linear models with more predictors than observations: If , then ; thus, and any projection onto is trivial. An alternative interpretation of suggests methods for estimating and in high-dimensional linear models.

Consider the linear combination of and ,

 L0(a1,a2)=a11n||y||2+a21nyTX(XTX)−1XTy

for and observe that

 E(1n||y||2) = σ2+τ2 (5) E{1nyTX(XTX)−1XTy} = dnσ2+τ2 (6)

are non-redundant linear combinations of and . Since

 EL0(a1,a2) = a1E(1n||y||2)+a2E{1nyTX(XTX)−1XTy} = a1(σ2+τ2)+a2(dnσ2+τ2),

it follows that there exist such that is an unbiased estimator of , i.e. . In particular, we have

 EL0(nn−d,−nn−d)=σ2

and, moreover, . Thus, for , may be viewed as the unique linear combination of and that yields an unbiased estimator of .

The identities (5)-(6) also imply that there exist such that is an unbiased estimator for . Indeed,

 EL0(−dn−d,nn−d)=τ2

and

 ^τ20=L0(−dn−d,nn−d)

is the estimator defined initially in (3).

The ideas above are easily adapted to a more general setting that is useful for problems where . Broadly, we seek statistics and such that

 E(T1)=b11σ2+b12τ2E(T2)=b21σ2+b22τ2  for some constants  b11,b12b21,b22∈R  with  b11b22−b12b21≠0. (7)

In other words, the expected value of the statistics , should form a pair of non-degenerate linear combinations of and . If such and can be found, then unbiased estimators for , may be formed by taking linear combinations of and . Moreover, asymptotic properties of these estimators are determined by the asymptotic properties of , .

In the example discussed above, where , and . If , then alternatives to must be sought; in this paper, we focus on (remarks on other potential alternatives may be found in Section 5). Using basic facts about the Wishart distribution (see Supplemental Text for formulas involving various moments of the Wishart distribution, which are obtained using techniques from (Letac and Massam, 2004; Graczyk et al., 2005) and are used throughout the paper), we have

 E(1n2||XTy||2) = 1n2EyTXXTy (8) = 1n2EβT(XTX)2β+1n2EϵTXXTϵ = d+n+1nτ2+dnσ2.

Since , it follows that and satisfy (7). Moreover, is defined and (8) is valid even when . Now let

 L(a1,a2)=a1n||y||2+a2n2||XTy||2.

and define

 ^σ2 = L(d+n+1n+1,−nn+1)  =  d+n+1n(n+1)||y||2−1n(n+1)||XTy||2 ^τ2 = L(−dn+1,nn+1)  =  −dn(n+1)||y||2+1n(n+1)||XTy||2.

Making use of (5) and (8), a basic calculation implies that and are unbiased estimators for and . Thus, we have the following theorem.

###### Theorem 1.

[Unbiasedness] Suppose that . Then and .

### 2.2 Consistency

Let and let . The covariance matrix of is important for understanding the asymptotic properties of and . Since , where

 A=⎛⎝d+n+1n+1−nn+1−dn+1nn+1⎞⎠, (9)

it follows that . The covariance matrices for and are both computed explicitly in the Appendix. Asymptotic approximations for the entries of that are valid as are given below:

 Var(^σ2) ∼ 2n{ρ(σ2+τ2)2+σ4+τ4} (10) Var(^τ2) ∼ 2n{(ρ+1)(σ2+τ2)2−σ4+3τ4} (11) Cov(^σ2,^τ2) ∼ −2n{ρ(σ2+τ2)2+2τ4}. (12)

The following theorem contains a slightly more detailed version of these approximations, and gives an explicit consistency result for , . The theorem is proved in the Appendix.

###### Theorem 2.

[Consistency] Suppose that . Then

 Var(^σ2) = 2n{dn(σ2+τ2)2+σ4+τ4}{1+O(1n)} Var(^τ2) = 2n{(1+dn)(σ2+τ2)2−σ4+3τ4}{1+O(1n)} Cov(^σ2,^τ2) = −2n{dn(σ2+τ2)2+2τ4}{1+O(1n)}.

In particular,

 |^σ2−σ2|, |^τ2−τ2|=OP{√d+nn2(σ2+τ2)}.
###### Remark 1.

If , then the asymptotic approximations (10)-(12) follow immediately from Theorem 2.

###### Remark 2.

It is instructive to compare the asymptotic variance and covariance of , to that of the estimators , , defined in (2)-(3). If and , then

 Var(^σ20) ∼ 2σ4n(1−ρ) Var(^τ20) ∼ 2n{(σ2+τ2)2+(ρ1−ρ−1)σ4} Cov(^σ20,^τ20) ∼ −2ρσ4n(1−ρ).

Notice that in (10), increases with the signal strength , while does not depend on . On the other hand, when is small or is close to 1.

###### Remark 3.

Suppose that are fixed. Theorem 2 implies that if , then , are consistent in the sense that

 limd/n→ρ sup0≤σ2

On the other hand, Dicker (2012b) proved that if , then it is impossible to estimate in this setting. In particular, if , then

 liminfd/n→ρ inf^β sup0≤σ20,

where the infimum is over all measurable estimators for . Thus, Theorem 2 describes methods for consistently estimating and in high-dimensional linear models, where it is impossible to estimate . If , then (13) holds with , in place of , . However, Theorem 2 also applies to settings where (i.e. ) and the estimators , are undefined.

### 2.3 Asymptotic normality

Define the total variation distance between random variables and to be

 dTV(u,v)=supB∈B(R)|P(u∈B)−P(v∈B)|,

where denotes the collection of Borel sets in . The next theorem is this paper’s main result on asymptotic normality. It is a direct application of results in (Chatterjee, 2009). Theorem 3 is proved in the Appendix and it is valid for arbitrary positive definite covariance matrices .

###### Theorem 3.

[Asymptotic normality] Let be the operator norm of (i.e. is the largest eigenvalue of ). Let be a function with continuous second order partial derivatives, let denote the gradient of , and let denote the Hessian of . Suppose that and let be a normal random variable with the same mean and variance as . Then

 dTV{h(T),w}=O(||Σ||3/2ξνn3/2ψ2), (14)

where and are defined as follows:

 ξ = ξ(σ2,τ2,Σ,d,n)  =  γ1/44+γ1/42+γ1/40τ(τ+1) ν = ν(σ2,τ2,Σ,d,n)  =  η1/48+η1/44+η1/40τ2(τ2+1)+γ1/44+γ1/40(τ2+1)

and, for non-negative integers ,

 γk = γk(σ2,τ2,Σ,d,n)  =  E{||∇h(T)||4(λ1+1)6(1n||ϵ||2)k}, ηk =
###### Remark 1.

If is bounded, then the asymptotic behavior of the upper bound (14) is determined by that of , , and , which, in turn, is determined by the function . For the functions considered in this paper, if , then , , and are bounded by rational functions in and . Thus, if is bounded, , and lie in some compact set, then we typically have

 dTV{h(T),w}=O(n−1/2).

In other words, converges to a normal random variable at rate . Under these conditions, if is known or estimable (as it is for the studied here), then asymptotically valid confidence intervals for may be constructed using Theorem 3.

Now let be the matrix (9) and let , denote the first and second rows of , respectively. Applying Theorem 3 with and , , and gives bounds on the total variation distance between , , and and corresponding normal random variables. These examples are pursued in more detail below.

###### Example 1 (^σ2 and ^τ2).

Let in Theorem 3 and suppose that . Then , because . To bound , we have

 γk=E{||a1||4(λ1+1)6(1n||ϵ||2)k}=O{(1+dn)10σ2k}.

Thus,

 ξ = O{(1+dn)5/2(σ2+σ+τ2+τ)} ν = O{(1+dn)5/2(σ2+τ2+1)}.

By Theorem 2,

 Var(^σ2)=2n{dn(σ2+τ2)2+σ4+τ4}{1+O(1n)}.

Now let

 ψ21=2{dn(σ2+τ2)2+σ4+τ4} (15)

and let . Then Theorem 3 implies

 dTV{√n(^σ2−σ2ψ1),z}=O[1√n(1+dn)4{1+(1σ+τ)3}].

Similar calculations imply that

where

 ψ22=2{(1+dn)(σ2+τ2)2−σ4+3τ4}. (16)

Thus, we have the following corollary to Theorem 3.

###### Corollary 1.

Suppose that and is compact. Let . If , then

 supσ2,τ2∈DdTV{√n(^σ2−σ2ψ1),z}, supσ2,τ2∈DdTV{√n(^τ2−τ2ψ2),z}=O(n−1/2),

where are defined in (15)-(16).

###### Example 2 (Signal-to-noise ratio).

Suppose that . Define the function by and let be defined by , where is the matrix given in (9). Then is an estimate of the signal-to-noise ratio. However, Theorem 3 cannot be applied directly because is not defined on all of (if , then is undefined). To remedy this, we assume that , where is compact and, moreover, that . Now let be a function with continuous second order partial derivatives such that and on , where is a compact set containing in its interior.

To show that the estimated signal-to-noise ratio is asymptotically normal, we apply Theorem 3 with . Working under the assumption that and , it is straightforward to check that , for ; thus, . To approximate the variance of , let and . A second order Taylor expansion yields

 h(T) = g(^θ) (17) = g(θ)+∇g(θ)T(^θ−θ)+R||^θ−θ||2,

where . Theorem 2 and a straightforward calculation imply that

 Var{∇g(θ)T^θ} = ∇g(θ)TCov(^θ)∇g(θ) = 2nσ8{(1+dn)(σ2+τ2)4−σ4(σ2+τ2)2}{1+O(1n)}.

Since and , (17) implies

 ψ2=Var{h(T)}=2nσ8{(1+dn)(σ2+τ2)4−σ4(σ2+τ2)2}{1+O(1n)}.

Thus, Theorem 3 implies that

 (18)

where and

 ψ20=2σ8{(1+dn)(σ2+τ2)4−