On Principal Component Regression in a High-Dimensional Error-in-Variables Setting

On Principal Component Regression in a High-Dimensional Error-in-Variables Setting

Abstract

We analyze the classical method of Principal Component Regression (PCR) in the high-dimensional error-in-variables setting. Here, the observed covariates are not only noisy and contain missing data, but the number of covariates can also exceed the sample size. Under suitable conditions, we establish that PCR identifies the unique model parameter with minimum -norm, and derive non-asymptotic -rates of convergence that show its consistency. We further provide non-asymptotic out-of-sample prediction performance guarantees that again prove consistency, even in the presence of corrupted unseen data. Notably, our results do not require the out-of-samples covariates to follow the same distribution as that of the in-sample covariates, but rather that they obey a simple linear algebraic constraint. We finish by presenting simulations that illustrate our theoretical results.

[
\kwd
\startlocaldefs\endlocaldefs\runtitle

On Principal Component Regression

{aug}

A]\fnmsAnish Agarwal label=e1]anish90@mit.edu, A]\fnmsDevavrat Shah label=e2]devavrat@mit.edu and A]\fnmsDennis Shen label=e3]deshen@mit.edu

class=MSC2020] \kwd[Primary ]62J05 \kwd62F12 \kwd[; secondary ]60B20

principal component regression \kwdhigh-dimensional statistics \kwderror-in-variables \kwdgeneralization \kwdsingular value thresholding \kwdlow-rank matrices \kwdmissing data

1 Introduction.

We consider the setup of error-in-variables regression in a high-dimensional setting. Formally, we observe a labeled dataset of size , denoted as . Here, represents the response variable, also known as the label or target. For any , we posit that

(1)

where is the unknown model parameter, is the associated covariate, and is the response noise. Unlike traditional regression settings where , the error-in-variables regression setting reveals a corrupted version of the covariate . Precisely, for any , let

(2)

where is the covariate measurement noise and is a binary observation mask with denoting component-wise multiplication, i.e., we observe the -th component of if and otherwise. Further, we allow to be much smaller than .

Our interest is in analyzing the performance of the classical method of Principal Component Regression (PCR) for this scenario. In a nutshell, PCR is a two-stage process: first, PCR “de-noises” the observed covariate matrix via Principal Component Analysis (PCA), i.e., PCR replaces by its low-rank approximation. Then, PCR regresses with respect to this low-rank variant to produce the model estimate . We are interested in the following natural questions about the estimation quality of PCR: (1) Given that multiple models are feasible within the high-dimensional framework, what structure should be endowed on such that ? (2) Given noisy and partially observed out-of-sample covariates, can PCR accurately predict the expected response variables, i.e., under what conditions does PCR generalize?

1.1 Contributions.

As the main contribution of this work, we establish that PCR consistently learns the latent model parameter in a high-dimensional error-in-variables setting (Theorem 1 and Corollary 4.1). Interestingly, rather than endowing the standard sparsity structure on , we establish that PCR learns the unique model parameter with minimum -norm, which is of primary importance in the context of prediction. As a special case of our setting in which the spectrum of true covariates is well-balanced (Assumption 4.1), we show that the parameter estimation error decays as . This matches the best known estimation error rate in the literature, cf. [14, 10, 17].

We also establish that PCR achieves vanishing out-of-sample prediction error, even in the presence of corrupted out-of-sample covariates (Theorem 2 and Corollary 4.2). Notably, we do not make any distributional assumptions on the data generating process to arrive at our result, but rather introduce a natural linear algebraic condition (Assumption 2.5). In contrast, popular tools to understand the generalization behavior, such as Rademacher complexity analyses, commonly assume that both the in-sample and out-of-sample measurements are independent and identically distributed. Again, in the special case when the true covariates have a well-balanced spectra, we show that the out-of-sample prediction error rate is , which improves upon the best known error rate of for PCR, established in [1, 2].

1.2 Key comparisons.

We highlight a few key comparisons, both in terms of the assumptions made and algorithms furnished, between this work and prominent works in the high-dimensional error-in-variables literature, cf. [14], [10], [16], [17], [5], [4], [8], [9], [13].

Assumptions. In this work, we assume the underlying covariate matrix is low rank, i.e., there is “sparsity” in the number of singular vectors needed to describe . In comparison, prior works assume that the model parameter is sparse. These notions of sparsity are relatable. If is low-rank, then there exists a sparse that produces identical response variables, cf. [2]; meanwhile, if is sparse, then it is not hard to verify that there exists a low-rank that provides equivalent responses. The second key assumption of this work is that the spectra of is well-balanced. In comparison, the prior works assume that a type of restricted eigenvalue condition (see Definitions 1 and 2 in [14]) is satisfied for the empirical estimate of the covariance of . We note that this estimate is typically constructed by “correcting” the covariance of using knowledge of the latent noise covariance, e.g., . Intuitively, both assumptions require that there is sufficient “information spread” across the rows and columns of covariates, i.e., an incoherence-like condition. See Section 3.5 in [2] for a detailed comparison of the well-balanced spectra assumption with respect to the restricted eigenvalue condition.

Algorithms. Notably, the algorithms furnished in prior works explicitly utilize knowledge of the noise covariance – or require the existence of a data-driven estimator for it, which can be too costly or simply infeasible, cf. [9] – to recover the sparse latent model parameter with respect to the -error, i.e., a guarantee of the form . PCR, on the other hand, is noise agnostic. More formally, the first step in PCR, which finds a low-rank approximation of , implicitly de-noises the covariates without utilizing knowledge of the noise distribution. The problem of noisy and partially observed covariates resurfaces in the context of out-of-sample predictions. More specifically, previous algorithms are not designed to de-noise out-of-sample covariates; thus, even with knowledge of the exact , these works cannot provide generalization error bounds. In contrast, we provide a natural approach to handle these settings (see Section 3), which enables PCR to provably generalize.

1.3 PCR literature.

PCR as a method was introduced in [12]. Despite the ubiquity of PCR in practice, the formal literature on PCR is surprisingly sparse. Notable works include [3, 6, 1, 2]. In particular, [1, 2] present finite-sample analyses for the prediction error (but not parameter estimation error) of PCR in the high-dimensional error-in-variables setting. Specifically, in the transductive learning setting, they establish that PCR’s out-of-sample prediction error decays as . In such a scenario, both the in-sample (training) and out-of-sample (testing) covariates are accessible upfront. As a result, they can be simultaneously de-noised, after which only the de-noised training covariates and the associated responses are used to learn a model. In contrast, this work considers the classical supervised learning setup, where testing covariates are not revealed during training. Thus, the testing covariates must be de-noised separately, after which the linear model learnt in the training phase is applied to estimate the test responses. We further remark that [1, 2] make standard distributional assumptions on the generating process for the data, which allows them to leverage the techniques of Rademacher complexity analysis to establish their prediction error bounds. We summarize a list of key points of comparison between this paper and notable works in both the PCR and error-in-variable literature in Table 1.

Literature Key Assumptions
Knowledge of
noise distribution
Parameter
estimation
Out-of-sample
prediction error
[14, 10, 17]
sparsity
Yes
restricted eigenvalue cond.
PCR [1, 2]
low-rank
No
well-balanced spectra
This work
low-rank
No
well-balanced spectra (Cor. 4.1) (Cor. 4.2)
Table 1: Comparison with a few notable works in the high-dimensional () error-in-variables regression literature under the specialized setting where the underlying covariates have well-balanced spectra (Assumption 4.1).

1.4 Organization.

The remainder of this paper is organized as follows. We begin by formally describing our problem setup in Section 2, which includes our modeling assumptions and objectives. Next, we describe the PCR algorithm in Section 3, followed by its parameter estimation and out-of-sample prediction error bounds in Section 4. To reinforce our theoretical findings, we provide illustrative simulations in Section 5. In Sections 6 and 7, we prove Theorems 1 and 2, respectively. We conclude and discuss important future directions of research in Section 8. Lastly, we relegate standard concentration results used for our analyses to Appendix A.

1.5 Notation.

For any matrix , we denote its operator (spectral), Frobenius, and max element-wise norms as , , and , respectively. For any vector , let denote its -norm. If is a random variable, we define its sub-gaussian (Orlicz) norm as . Let denote component-wise multiplication and let denote the outer product. For any two numbers , we use to denote and to denote . Further, let for any integer .

2 Problem Setup.

In this section, we provide a precise description of our problem, including our observations, assumptions, and objectives.

2.1 Observation model.

As described in Section 1, we have access to labeled observations , which we will refer to as our in-sample (training) data; recall that corresponds to the latent covariate with respect to . Collectively, we assume (1) and (2) are satisfied. In addition, we observe unlabeled out-of-sample (testing) covariates; for , we only observe the noisy covariates , which again correspond to the latent covariates , but do not have access to the associated response variables .

Throughout, let and represent the underlying training and testing covariates, respectively. Similarly, let and represent their observed noisy and sparse counterparts.

2.2 Modeling assumptions.

We make the following assumptions.

Assumption 2.1 (Bounded).

, .

Assumption 2.2 (Low-rank).

.

Assumption 2.3 (Response noise).

are a sequence of independent mean zero subgaussian random variables with .

Assumption 2.4 (Covariate noise).

are a sequence of independent mean zero subgaussian random vectors with and . Further, is a vector of independent Bernoulli variables with parameter .

Assumption 2.5 (Subspace inclusion).

The rowspace of is contained within that of , i.e., .

2.3 Goals.

There are two primary goals: (1) identify a well-defined model parameter from the labeled training data, and (2) estimate the out-of-sample responses using the learned model.

3 Principal Component Regression (PCR).

We describe the PCR algorithm as introduced in [12], with a variation to handle missing data. To that end, let denote the fraction of observed entries in . We define , where are the singular values (arranged in decreasing order) and are the left and right singular vectors, respectively.

3.1 Parameter estimation.

For a given parameter , PCR estimates the model parameter as

(3)

3.2 Out-of-sample prediction.

Let denote the proportion of observed entries in . As before, let , where are the singular values (arranged in decreasing order) and are the left and right singular vectors, respectively. Given parameter , let , and define the test response estimates as .

If the responses are known to belong to a bounded interval, say for some , then the entries of are truncated as follows: for every ,

(4)

3.3 Properties of PCR.

We state some useful properties of PCR, which we will use extensively throughout this work. These are well-known results, discussed at length in Chapter 17 of [15] and Chapter 6.3 of [18].

Property 3.1.

Let . Then , as given in (3), also satisfies

  • is the unique solution of the following program:

    minimize
    such that (5)
  • .

3.4 Choosing .

In general, the correct number of principal components to use is not known a priori. However, under reasonable signal-to-noise scenarios, Weyl’s inequality implies that a “sharp” threshold or gap should exist between the top singular values and remaining singular values of the observed data . This gives rise to a natural “elbow” point and suggests choosing a threshold within this gap. Another standard approach is to use a “universal” thresholding scheme that preserves singular values above a precomputed threshold ([7] and [11]). Data-driven approaches developed around cross-validation can also be employed.

4 Main Results.

We state PCR’s parameter estimation and generalization properties in this section. For the remainder of the paper, will denote any constant that depends only on and , and will denote absolute constants. The values of , and may change from line to line or even within a line.

4.1 Parameter estimation.

Since we work within the high-dimensional framework, our first objective in recovering the underlying parameter is ill-posed without additional structure. Consequently, among all feasible models, we consider the unique model that satisfies (1) with minimum -norm, i.e., ; this follows since every element in the column space of a matrix is associated with a unique element in its row space coupled with any element in its null space. Thus, for the purposes of prediction, it suffices to consider this particular (see [15], [18] for details). Also, recall from Property 3.1 that PCR enforces . Hence, if and the rowspace of is “close” to the rowspace of , then this suggests . We formalize this intuition through Theorem 1.

Theorem 1.

Let Assumptions 2.1, 2.2, 2.3, and 2.4 hold. Consider and PCR with . Let . Then with probability at least ,

(6)

where

(7)

and denote the -th singular values of and , respectively.

We specialize the above result under specific conditions on the spectral characteristics of . To that end, consider the following assumption.

Assumption 4.1 (Balanced spectra: training covariates).

The nonzero singular values of satisfy .

Corollary 4.1.

Let the setup of Theorem 1 and Assumption 4.1 hold, and let . Then with probability at least ,

(8)
Proof.

By Assumption 4.1, we have . We also have . Therefore,

(9)

It then follows from (7) that . We also have by standard norm inequality. Using these in (1) completes the proof. ∎

Corollary 8 implies that if , then the parameter estimation error scales as . Therefore, ignoring log factors on as well as dependencies on and , the error decays as , which matches the best known rate (albeit, with respect to a sparse parameter).

4.2 Out-of-sample prediction error.

Next, we bound PCR’s out-of-sample prediction error in the presence of corrupted unseen data, defined as

We define some more useful quantities. Let be the -th singular values of and , respectively. Recall from Section 3 that are the -th singular values of and , respectively. Further, let

where the bounds on and are given in (1) and (7), respectively. In Theorem 2, we bound both in probability and in expectation with respect to these quantities.

Theorem 2.

Let the setup of Theorem 1 and Assumption 2.5 hold, and let . Then, with probability at least ,

(10)

Further, if and is appropriately truncated, then

(11)

As before, we specialize the above result under specific conditions on the spectral characteristics of .

Assumption 4.2 (Balanced spectra: testing covariates).

The nonzero singular values of satisfy .

Corollary 4.2.

Let the setup of Corollary 4.1 and Theorem 2 hold. Further, let Assumption 4.2 hold. Then,

(12)
(13)
(14)
Proof.

From Corollary 4.1, we have that . Also, Assumption 4.2 implies . Plugging these into the definitions of , , and simplifying completes the proof. ∎

Notably, Theorem 2 and Corollary 4.2 do not require any distributional assumptions relating the in- and out-of-sample covariates, but rather rely on the linear algebraic condition given by Assumption 2.5. Because we consider , we require the row space of to lie within that of . Intuitively, this condition restricts the out-of-sample covariates to be at most as “rich” or “complex” as the in-sample covariates used for learning.

For the following discussion, we only consider the scaling with respect to , but ignore log factors. Now, recall that Corollary 4.1 implies . Hence, Corollary 4.2 implies that if , then the out-of-sample prediction error vanishes to zero both in probability and in expectation. If we make the additional assumption that , then Corollary 4.2 implies that the error scales as in expectation. This improves upon the best known rate of , established in [1, 2].

5 Simulations.

In this section, we present illustrative simulations to support our theoretical results.

5.1 Parameter estimation.

The purpose of this simulation is to demonstrate that PCR does indeed identify the unique linear model of minimum -norm.

Generative model.

We construct covariates via the classical probabilistic PCA model, cf. [19]. That is, we first generate by independently sampling each entry from a standard normal distribution. Then, we sample a transformation matrix , where each entry is uniformly and independently sampled from . The final matrix then takes the form . We choose , where .

Next, we generate by first sampling from a multivariate standard normal vector with independent entries and then scale each coordinate by . The noiseless response vector is defined to be . Finally, as motivated by Property 3.1, the minimum -norm model of interest, , is computed as , where denotes the pseudo-inverse of .

We consider an additive noise model. Specifically, the entries of are sampled i.i.d. from a normal distribution with mean and variance . The entries of are sampled in an identical fashion. We then define our observed response vector as and observed covariate matrix as . For simplicity, we do not mask any of the entries.

Results.

Using the observations , we perform PCR as in Section 3.1 to yield . To show that PCR can accurately recover , we compute the -norm parameter estimation error, or root-mean-squared-error (RMSE), with respect to and in Figures 0(a) and 0(b), respectively. As suggested by Figure 0(a), the RMSE with respect to roughly aligns for different values of , after rescaling the sample size as , and decays to zero as the sample size increases; this is predicted by Theorem 1. On the other hand, Figure 0(b) shows that the RMSE with respect to stays roughly constant across different values of . Therefore, as established in [1], PCR performs implicit regularization by not only de-noising the observed covariates, but also finding the minimum-norm solution.

((a)) -norm error of with respect to the min. -norm solution of (1), i.e., .
((b)) -norm error of with respect to a random solution of (1).
Figure 1: Plots of -norm errors, i.e., in (0(a)) and in (0(b)), versus the rescaled sample size after running PCR with rank . As predicted by Theorem 1, the curves for different values of under (0(a)) roughly align and decay to zero as increases.

5.2 Out-of-sample prediction: PCR vs. Ordinary Least Squares.

The purpose of this simulation is to demonstrate the benefit of the implicit de-noising effect of PCR vs. ordinary least squares (OLS).

Generative model.

For each experiment, we let . We generate training and testing covariates , respectively, with and , i.e., Assumption 2.5 holds. To do so, we sample by independently sampling each entry from a standard normal distribution. Then, we define and .

We then generate as in Section 5.1, and use it to produce and . Similarly, we generate the response noise and covariate noises by independently sampling each entry from a normal distribution with mean zero and variance , where . Again, for simplicity, we do not mask any of the entries. We then define our observed response as , and the observed training and testing covariates as and , respectively.

Results.

Using the observations , we perform PCR as in Section 3.2 to produce . The OLS out-of-sample estimates are produced using the same algorithm as in Section 3.2 without the singular value thresholding step on either or , i.e., we do not de-noise the training nor testing covariates. The estimates produced from OLS are defined as . In both PCR and OLS, we do not truncate the estimated entries. For any estimate , we define the out-of-sample mean squared error (MSE) as . In Figure 2, as we vary the level of response and covariate noise , we plot the MSE of versus that of . The MSE of OLS is between three to four orders of magnitude larger than that of PCR across all noise levels. We remark that even when , the error of OLS is almost three orders of magnitude larger than PCR – this indicates the significant level of bias that is introduced even with minimal measurement error. In essence, this stresses the importance of de-noising the training and testing covariates via singular value thresholding.

Figure 2: MSE plot of (blue) versus (orange) as we increase the level of covariate and response noises. While PCR’s error scales gracefully with the level of noise, OLS suffers large amounts of bias, even in the presence of small amounts of measurement error.

5.3 Out-of-sample prediction: robustness of PCR to distribution shifts.

The purpose of this simulation is to demonstrate that PCR can generalize even when the testing covariates are not only corrupted, but also sampled from a different distribution than the training covariates.

Generative model.

Throughout, we let . We generate the training covariates as in Section 5.2, i.e., , where the entries of are sampled independently from a standard normal distribution. Next, we generate four different out-of-sample covariates, defined as via the following procedure: We independently sample the entries of from a standard normal distribution, and define . We define similarly with the entries of sampled from a normal distribution with mean zero and variance . Next, we independently sample the entries of from a uniform distribution with support , and define . We define similarly with the entries of sampled from a uniform distribution with support .

By construction, we note that the mean and variance of the entries in match that of ; an analogous relationship holds between and . Further, while follows the same distribution as that of , we note that there is a clear distribution shift from to .

We proceed to generate as in Section 5.2. We then define , and define analogously. Further, the response noise and covariate noises are constructed in the same fashion as described in Section 5.2, where the variance again follows . We define the training responses as and observed training covariates as . The first set of observed testing covariates is defined as , with analogous definitions for .

Results.

Using the observations , we perform PCR to produce . We produce analogously. We define MSE as in Section 5.2 with each estimate compared against its corresponding latent response, e.g., against . Figure 3 shows the MSE of , , , and as we vary . Pleasingly, despite the changes in the data generating process of the out-of-sample responses we evaluate on, the MSE for all four experiments closely matches across all noise levels. This motivates Assumption 2.5 as the key requirement for generalization, at least for PCR, rather than distributional invariance between the training and testing covariates.

Figure 3: MSE plot of multiple PCR estimates – , , , and – as we shift the distribution of the out-of-sample covariates, while ensuring Assumption 2.5 holds. Pleasingly, the MSE remains closely matched across all noise levels and distribution shifts.

5.4 Out-of-sample prediction: subspace inclusion vs. distributional invariance.

The purpose of this simulation is to further illustrate that subspace inclusion (Assumption 2.5) is the key structure that enables PCR to successfully generalize, and not necessarily distributional invariance between the training and testing covariates.

Generative model.

As before, we let . We continue to generate the training covariates as following the procedure in Section 5.2. We now generate two different testing covariates. First, we generate , where the entries of are independently sampled from a normal distribution with mean zero and variance . As such, it follows that Assumption 2.5 immediately holds between and , though they do not obey the same distribution. Next, we generate , where the entries of are independently sampled from a standard normal (just as in ). In doing so, we ensure that and follow the same distribution, though Assumption 2.5 no longer holds.

We generate as in Section 5.2, and define and . We also generate , and as in Section 5.2. In turn, we define the training data as and , and testing data as and .

Results.

We apply PCR under two scenarios. First, we apply PCR using to yield , and once again using to yield . We define MSE as in Section 5.2 with each estimate compared against its corresponding latent response, e.g., against . Figure 4 shows the MSE of and across varying levels of noise. As we can see, when Assumption 2.5 holds yet distributional invariance is violated, the corresponding MSE of is almost three orders of magnitude smaller than that of , where Assumption 2.5 is violated but distributional invariance holds. This reinforces that the key structure required for PCR (and possibly other linear estimators) to generalize is Assumption 2.5, and not necessarily distributional invariance, as is typically assumed in the statistical learning literature.

Figure 4: Plots of PCR’s MSE under two situations: when Assumption 2.5 holds but distributional invariance is violated (blue), and when Assumption 2.5 is violated but distributional invariance holds (orange). Across varying levels of noise, the former condition achieves a much lower MSE.

6 Proof of Theorem 1.

We start with some useful notations. Let be the vector notation of (1) with , . Throughout, let . Recall that the SVD of . Its truncation using the top singular components is denoted as .

Further, we will often use the following bound: for any , ,

(15)

where with representing the -th column of .

As discussed in Section 4.1, we shall denote as the unique minimum -norm model parameter satisfying (1); equivalently, this can be formulated as . As a result, it follows that

(16)

where represents a matrix of orthornormal basis vectors that span the nullspace of .

Similarly, let be a matrix of orthonormal basis vectors that span the nullspace of ; thus, is orthogonal to . Then,

(17)

Note that in the last equality we have used Property 3.1, which states that . Next, we bound the two terms in (17).

Bounding . To begin, note that

(18)

since is an isometry. Next, consider

(19)

where we used (15). Recall that . Therefore,

(20)

Therefore using (18), we conclude that