On Principal Component Regression in a HighDimensional ErrorinVariables Setting
Abstract
We analyze the classical method of Principal Component Regression (PCR) in the highdimensional errorinvariables setting. Here, the observed covariates are not only noisy and contain missing data, but the number of covariates can also exceed the sample size. Under suitable conditions, we establish that PCR identifies the unique model parameter with minimum norm, and derive nonasymptotic rates of convergence that show its consistency. We further provide nonasymptotic outofsample prediction performance guarantees that again prove consistency, even in the presence of corrupted unseen data. Notably, our results do not require the outofsamples covariates to follow the same distribution as that of the insample covariates, but rather that they obey a simple linear algebraic constraint. We finish by presenting simulations that illustrate our theoretical results.
On Principal Component Regression
A]\fnmsAnish Agarwal label=e1]anish90@mit.edu, A]\fnmsDevavrat Shah label=e2]devavrat@mit.edu and A]\fnmsDennis Shen label=e3]deshen@mit.edu
class=MSC2020] \kwd[Primary ]62J05 \kwd62F12 \kwd[; secondary ]60B20
principal component regression \kwdhighdimensional statistics \kwderrorinvariables \kwdgeneralization \kwdsingular value thresholding \kwdlowrank matrices \kwdmissing data
1 Introduction.
We consider the setup of errorinvariables regression in a highdimensional setting. Formally, we observe a labeled dataset of size , denoted as . Here, represents the response variable, also known as the label or target. For any , we posit that
(1) 
where is the unknown model parameter, is the associated covariate, and is the response noise. Unlike traditional regression settings where , the errorinvariables regression setting reveals a corrupted version of the covariate . Precisely, for any , let
(2) 
where is the covariate measurement noise and is a binary observation mask with denoting componentwise multiplication, i.e., we observe the th component of if and otherwise. Further, we allow to be much smaller than .
Our interest is in analyzing the performance of the classical method of Principal Component Regression (PCR) for this scenario. In a nutshell, PCR is a twostage process: first, PCR “denoises” the observed covariate matrix via Principal Component Analysis (PCA), i.e., PCR replaces by its lowrank approximation. Then, PCR regresses with respect to this lowrank variant to produce the model estimate . We are interested in the following natural questions about the estimation quality of PCR: (1) Given that multiple models are feasible within the highdimensional framework, what structure should be endowed on such that ? (2) Given noisy and partially observed outofsample covariates, can PCR accurately predict the expected response variables, i.e., under what conditions does PCR generalize?
1.1 Contributions.
As the main contribution of this work, we establish that PCR consistently learns the latent model parameter in a highdimensional errorinvariables setting (Theorem 1 and Corollary 4.1). Interestingly, rather than endowing the standard sparsity structure on , we establish that PCR learns the unique model parameter with minimum norm, which is of primary importance in the context of prediction. As a special case of our setting in which the spectrum of true covariates is wellbalanced (Assumption 4.1), we show that the parameter estimation error decays as . This matches the best known estimation error rate in the literature, cf. [14, 10, 17].
We also establish that PCR achieves vanishing outofsample prediction error, even in the presence of corrupted outofsample covariates (Theorem 2 and Corollary 4.2). Notably, we do not make any distributional assumptions on the data generating process to arrive at our result, but rather introduce a natural linear algebraic condition (Assumption 2.5). In contrast, popular tools to understand the generalization behavior, such as Rademacher complexity analyses, commonly assume that both the insample and outofsample measurements are independent and identically distributed. Again, in the special case when the true covariates have a wellbalanced spectra, we show that the outofsample prediction error rate is , which improves upon the best known error rate of for PCR, established in [1, 2].
1.2 Key comparisons.
We highlight a few key comparisons, both in terms of the assumptions made and algorithms furnished, between this work and prominent works in the highdimensional errorinvariables literature, cf. [14], [10], [16], [17], [5], [4], [8], [9], [13].
Assumptions. In this work, we assume the underlying covariate matrix is low rank, i.e., there is “sparsity” in the number of singular vectors needed to describe . In comparison, prior works assume that the model parameter is sparse. These notions of sparsity are relatable. If is lowrank, then there exists a sparse that produces identical response variables, cf. [2]; meanwhile, if is sparse, then it is not hard to verify that there exists a lowrank that provides equivalent responses. The second key assumption of this work is that the spectra of is wellbalanced. In comparison, the prior works assume that a type of restricted eigenvalue condition (see Definitions 1 and 2 in [14]) is satisfied for the empirical estimate of the covariance of . We note that this estimate is typically constructed by “correcting” the covariance of using knowledge of the latent noise covariance, e.g., . Intuitively, both assumptions require that there is sufficient “information spread” across the rows and columns of covariates, i.e., an incoherencelike condition. See Section 3.5 in [2] for a detailed comparison of the wellbalanced spectra assumption with respect to the restricted eigenvalue condition.
Algorithms. Notably, the algorithms furnished in prior works explicitly utilize knowledge of the noise covariance – or require the existence of a datadriven estimator for it, which can be too costly or simply infeasible, cf. [9] – to recover the sparse latent model parameter with respect to the error, i.e., a guarantee of the form . PCR, on the other hand, is noise agnostic. More formally, the first step in PCR, which finds a lowrank approximation of , implicitly denoises the covariates without utilizing knowledge of the noise distribution. The problem of noisy and partially observed covariates resurfaces in the context of outofsample predictions. More specifically, previous algorithms are not designed to denoise outofsample covariates; thus, even with knowledge of the exact , these works cannot provide generalization error bounds. In contrast, we provide a natural approach to handle these settings (see Section 3), which enables PCR to provably generalize.
1.3 PCR literature.
PCR as a method was introduced in [12]. Despite the ubiquity of PCR in practice, the formal literature on PCR is surprisingly sparse. Notable works include [3, 6, 1, 2]. In particular, [1, 2] present finitesample analyses for the prediction error (but not parameter estimation error) of PCR in the highdimensional errorinvariables setting. Specifically, in the transductive learning setting, they establish that PCR’s outofsample prediction error decays as . In such a scenario, both the insample (training) and outofsample (testing) covariates are accessible upfront. As a result, they can be simultaneously denoised, after which only the denoised training covariates and the associated responses are used to learn a model. In contrast, this work considers the classical supervised learning setup, where testing covariates are not revealed during training. Thus, the testing covariates must be denoised separately, after which the linear model learnt in the training phase is applied to estimate the test responses. We further remark that [1, 2] make standard distributional assumptions on the generating process for the data, which allows them to leverage the techniques of Rademacher complexity analysis to establish their prediction error bounds. We summarize a list of key points of comparison between this paper and notable works in both the PCR and errorinvariable literature in Table 1.
Literature  Key Assumptions 





[14, 10, 17] 

Yes  –  


PCR [1, 2] 

No  –  


This work 

No  
wellbalanced spectra  (Cor. 4.1)  (Cor. 4.2) 
1.4 Organization.
The remainder of this paper is organized as follows. We begin by formally describing our problem setup in Section 2, which includes our modeling assumptions and objectives. Next, we describe the PCR algorithm in Section 3, followed by its parameter estimation and outofsample prediction error bounds in Section 4. To reinforce our theoretical findings, we provide illustrative simulations in Section 5. In Sections 6 and 7, we prove Theorems 1 and 2, respectively. We conclude and discuss important future directions of research in Section 8. Lastly, we relegate standard concentration results used for our analyses to Appendix A.
1.5 Notation.
For any matrix , we denote its operator (spectral), Frobenius, and max elementwise norms as , , and , respectively. For any vector , let denote its norm. If is a random variable, we define its subgaussian (Orlicz) norm as . Let denote componentwise multiplication and let denote the outer product. For any two numbers , we use to denote and to denote . Further, let for any integer .
2 Problem Setup.
In this section, we provide a precise description of our problem, including our observations, assumptions, and objectives.
2.1 Observation model.
As described in Section 1, we have access to labeled observations , which we will refer to as our insample (training) data; recall that corresponds to the latent covariate with respect to . Collectively, we assume (1) and (2) are satisfied. In addition, we observe unlabeled outofsample (testing) covariates; for , we only observe the noisy covariates , which again correspond to the latent covariates , but do not have access to the associated response variables .
Throughout, let and represent the underlying training and testing covariates, respectively. Similarly, let and represent their observed noisy and sparse counterparts.
2.2 Modeling assumptions.
We make the following assumptions.
Assumption 2.1 (Bounded).
, .
Assumption 2.2 (Lowrank).
.
Assumption 2.3 (Response noise).
are a sequence of independent mean zero subgaussian random variables with .
Assumption 2.4 (Covariate noise).
are a sequence of independent mean zero subgaussian random vectors with and . Further, is a vector of independent Bernoulli variables with parameter .
Assumption 2.5 (Subspace inclusion).
The rowspace of is contained within that of , i.e., .
2.3 Goals.
There are two primary goals: (1) identify a welldefined model parameter from the labeled training data, and (2) estimate the outofsample responses using the learned model.
3 Principal Component Regression (PCR).
We describe the PCR algorithm as introduced in [12], with a variation to handle missing data. To that end, let denote the fraction of observed entries in . We define , where are the singular values (arranged in decreasing order) and are the left and right singular vectors, respectively.
3.1 Parameter estimation.
For a given parameter , PCR estimates the model parameter as
(3) 
3.2 Outofsample prediction.
Let denote the proportion of observed entries in . As before, let , where are the singular values (arranged in decreasing order) and are the left and right singular vectors, respectively. Given parameter , let , and define the test response estimates as .
If the responses are known to belong to a bounded interval, say for some , then the entries of are truncated as follows: for every ,
(4) 
3.3 Properties of PCR.
We state some useful properties of PCR, which we will use extensively throughout this work. These are wellknown results, discussed at length in Chapter 17 of [15] and Chapter 6.3 of [18].
Property 3.1.
Let . Then , as given in (3), also satisfies

is the unique solution of the following program:
minimize such that (5) 
.
3.4 Choosing .
In general, the correct number of principal components to use is not known a priori. However, under reasonable signaltonoise scenarios, Weyl’s inequality implies that a “sharp” threshold or gap should exist between the top singular values and remaining singular values of the observed data . This gives rise to a natural “elbow” point and suggests choosing a threshold within this gap. Another standard approach is to use a “universal” thresholding scheme that preserves singular values above a precomputed threshold ([7] and [11]). Datadriven approaches developed around crossvalidation can also be employed.
4 Main Results.
We state PCR’s parameter estimation and generalization properties in this section. For the remainder of the paper, will denote any constant that depends only on and , and will denote absolute constants. The values of , and may change from line to line or even within a line.
4.1 Parameter estimation.
Since we work within the highdimensional framework, our first objective in recovering the underlying parameter is illposed without additional structure. Consequently, among all feasible models, we consider the unique model that satisfies (1) with minimum norm, i.e., ; this follows since every element in the column space of a matrix is associated with a unique element in its row space coupled with any element in its null space. Thus, for the purposes of prediction, it suffices to consider this particular (see [15], [18] for details). Also, recall from Property 3.1 that PCR enforces . Hence, if and the rowspace of is “close” to the rowspace of , then this suggests . We formalize this intuition through Theorem 1.
Theorem 1.
We specialize the above result under specific conditions on the spectral characteristics of . To that end, consider the following assumption.
Assumption 4.1 (Balanced spectra: training covariates).
The nonzero singular values of satisfy .
Corollary 4.1.
Proof.
Corollary 8 implies that if , then the parameter estimation error scales as . Therefore, ignoring log factors on as well as dependencies on and , the error decays as , which matches the best known rate (albeit, with respect to a sparse parameter).
4.2 Outofsample prediction error.
Next, we bound PCR’s outofsample prediction error in the presence of corrupted unseen data, defined as
We define some more useful quantities. Let be the th singular values of and , respectively. Recall from Section 3 that are the th singular values of and , respectively. Further, let
where the bounds on and are given in (1) and (7), respectively. In Theorem 2, we bound both in probability and in expectation with respect to these quantities.
Theorem 2.
As before, we specialize the above result under specific conditions on the spectral characteristics of .
Assumption 4.2 (Balanced spectra: testing covariates).
The nonzero singular values of satisfy .
Corollary 4.2.
Proof.
Notably, Theorem 2 and Corollary 4.2 do not require any distributional assumptions relating the in and outofsample covariates, but rather rely on the linear algebraic condition given by Assumption 2.5. Because we consider , we require the row space of to lie within that of . Intuitively, this condition restricts the outofsample covariates to be at most as “rich” or “complex” as the insample covariates used for learning.
For the following discussion, we only consider the scaling with respect to , but ignore log factors. Now, recall that Corollary 4.1 implies . Hence, Corollary 4.2 implies that if , then the outofsample prediction error vanishes to zero both in probability and in expectation. If we make the additional assumption that , then Corollary 4.2 implies that the error scales as in expectation. This improves upon the best known rate of , established in [1, 2].
5 Simulations.
In this section, we present illustrative simulations to support our theoretical results.
5.1 Parameter estimation.
The purpose of this simulation is to demonstrate that PCR does indeed identify the unique linear model of minimum norm.
Generative model.
We construct covariates via the classical probabilistic PCA model, cf. [19]. That is, we first generate by independently sampling each entry from a standard normal distribution. Then, we sample a transformation matrix , where each entry is uniformly and independently sampled from . The final matrix then takes the form . We choose , where .
Next, we generate by first sampling from a multivariate standard normal vector with independent entries and then scale each coordinate by . The noiseless response vector is defined to be . Finally, as motivated by Property 3.1, the minimum norm model of interest, , is computed as , where denotes the pseudoinverse of .
We consider an additive noise model. Specifically, the entries of are sampled i.i.d. from a normal distribution with mean and variance . The entries of are sampled in an identical fashion. We then define our observed response vector as and observed covariate matrix as . For simplicity, we do not mask any of the entries.
Results.
Using the observations , we perform PCR as in Section 3.1 to yield . To show that PCR can accurately recover , we compute the norm parameter estimation error, or rootmeansquarederror (RMSE), with respect to and in Figures 0(a) and 0(b), respectively. As suggested by Figure 0(a), the RMSE with respect to roughly aligns for different values of , after rescaling the sample size as , and decays to zero as the sample size increases; this is predicted by Theorem 1. On the other hand, Figure 0(b) shows that the RMSE with respect to stays roughly constant across different values of . Therefore, as established in [1], PCR performs implicit regularization by not only denoising the observed covariates, but also finding the minimumnorm solution.
5.2 Outofsample prediction: PCR vs. Ordinary Least Squares.
The purpose of this simulation is to demonstrate the benefit of the implicit denoising effect of PCR vs. ordinary least squares (OLS).
Generative model.
For each experiment, we let . We generate training and testing covariates , respectively, with and , i.e., Assumption 2.5 holds. To do so, we sample by independently sampling each entry from a standard normal distribution. Then, we define and .
We then generate as in Section 5.1, and use it to produce and . Similarly, we generate the response noise and covariate noises by independently sampling each entry from a normal distribution with mean zero and variance , where . Again, for simplicity, we do not mask any of the entries. We then define our observed response as , and the observed training and testing covariates as and , respectively.
Results.
Using the observations , we perform PCR as in Section 3.2 to produce . The OLS outofsample estimates are produced using the same algorithm as in Section 3.2 without the singular value thresholding step on either or , i.e., we do not denoise the training nor testing covariates. The estimates produced from OLS are defined as . In both PCR and OLS, we do not truncate the estimated entries. For any estimate , we define the outofsample mean squared error (MSE) as . In Figure 2, as we vary the level of response and covariate noise , we plot the MSE of versus that of . The MSE of OLS is between three to four orders of magnitude larger than that of PCR across all noise levels. We remark that even when , the error of OLS is almost three orders of magnitude larger than PCR – this indicates the significant level of bias that is introduced even with minimal measurement error. In essence, this stresses the importance of denoising the training and testing covariates via singular value thresholding.
5.3 Outofsample prediction: robustness of PCR to distribution shifts.
The purpose of this simulation is to demonstrate that PCR can generalize even when the testing covariates are not only corrupted, but also sampled from a different distribution than the training covariates.
Generative model.
Throughout, we let . We generate the training covariates as in Section 5.2, i.e., , where the entries of are sampled independently from a standard normal distribution. Next, we generate four different outofsample covariates, defined as via the following procedure: We independently sample the entries of from a standard normal distribution, and define . We define similarly with the entries of sampled from a normal distribution with mean zero and variance . Next, we independently sample the entries of from a uniform distribution with support , and define . We define similarly with the entries of sampled from a uniform distribution with support .
By construction, we note that the mean and variance of the entries in match that of ; an analogous relationship holds between and . Further, while follows the same distribution as that of , we note that there is a clear distribution shift from to .
We proceed to generate as in Section 5.2. We then define , and define analogously. Further, the response noise and covariate noises are constructed in the same fashion as described in Section 5.2, where the variance again follows . We define the training responses as and observed training covariates as . The first set of observed testing covariates is defined as , with analogous definitions for .
Results.
Using the observations , we perform PCR to produce . We produce analogously. We define MSE as in Section 5.2 with each estimate compared against its corresponding latent response, e.g., against . Figure 3 shows the MSE of , , , and as we vary . Pleasingly, despite the changes in the data generating process of the outofsample responses we evaluate on, the MSE for all four experiments closely matches across all noise levels. This motivates Assumption 2.5 as the key requirement for generalization, at least for PCR, rather than distributional invariance between the training and testing covariates.
5.4 Outofsample prediction: subspace inclusion vs. distributional invariance.
The purpose of this simulation is to further illustrate that subspace inclusion (Assumption 2.5) is the key structure that enables PCR to successfully generalize, and not necessarily distributional invariance between the training and testing covariates.
Generative model.
As before, we let . We continue to generate the training covariates as following the procedure in Section 5.2. We now generate two different testing covariates. First, we generate , where the entries of are independently sampled from a normal distribution with mean zero and variance . As such, it follows that Assumption 2.5 immediately holds between and , though they do not obey the same distribution. Next, we generate , where the entries of are independently sampled from a standard normal (just as in ). In doing so, we ensure that and follow the same distribution, though Assumption 2.5 no longer holds.
Results.
We apply PCR under two scenarios. First, we apply PCR using to yield , and once again using to yield . We define MSE as in Section 5.2 with each estimate compared against its corresponding latent response, e.g., against . Figure 4 shows the MSE of and across varying levels of noise. As we can see, when Assumption 2.5 holds yet distributional invariance is violated, the corresponding MSE of is almost three orders of magnitude smaller than that of , where Assumption 2.5 is violated but distributional invariance holds. This reinforces that the key structure required for PCR (and possibly other linear estimators) to generalize is Assumption 2.5, and not necessarily distributional invariance, as is typically assumed in the statistical learning literature.
6 Proof of Theorem 1.
We start with some useful notations. Let be the vector notation of (1) with , . Throughout, let . Recall that the SVD of . Its truncation using the top singular components is denoted as .
Further, we will often use the following bound: for any , ,
(15) 
where with representing the th column of .
As discussed in Section 4.1, we shall denote as the unique minimum norm model parameter satisfying (1); equivalently, this can be formulated as . As a result, it follows that
(16) 
where represents a matrix of orthornormal basis vectors that span the nullspace of .