Single Point Transductive Prediction

Single Point Transductive Prediction

Abstract

Standard methods in supervised learning separate training and prediction: the model is fit independently of any test points it may encounter. However, can knowledge of the next test point be exploited to improve prediction accuracy? We address this question in the context of linear prediction, showing how techniques from semi-parametric inference can be used transductively to combat regularization bias. We first lower bound the prediction error of ridge regression and the Lasso, showing that they must incur significant bias in certain test directions. We then provide non-asymptotic upper bounds on the prediction error of two transductive prediction rules. We conclude by showing the efficacy of our methods on both synthetic and real data, highlighting the improvements single point transductive prediction can provide in settings with distribution shift.

\onecolumn\printAffiliationsAndNotice

1 Introduction

We consider the task of prediction given independent datapoints from a linear model,

(1)

in which our observed targets and covariates are related by an unobserved parameter vector and noise vector .

Most approaches to linear model prediction are inductive, divorcing the steps of training and prediction; for example, regularized least squares methods like ridge regression hoerl1970ridge and the Lasso tibshirani1996regression are fit independently of any knowledge of the next target test point . This suggests a tantalizing transductive question: can knowledge of a single test point be leveraged to improve prediction for ? In the random design linear model setting creftype 1, we answer this question in the affirmative.

Specifically, in Section 2 we establish out-of-sample prediction lower bounds for the popular ridge and Lasso estimators, highlighting the significant dimension-dependent bias introduced by regularization. In Section 3 we demonstrate how this bias can be mitigated by presenting two classes of transductive estimators that exploit explicit knowledge of the test point . We provide non-asymptotic risk bounds for these estimators in the random design setting, proving that they achieve dimension-free -prediction risk for sufficiently large. In Section 4, we first validate our theory in simulation, demonstrating that transduction improves the prediction accuracy of the Lasso with fixed regularization even when is drawn from the training distribution. We then demonstrate that under distribution shift, our transductive methods outperform even the popular cross-validated Lasso, cross-validated ridge, and cross-validated elastic net estimators (which attempt to find an optimal data-dependent trade-off between bias and variance) on both synthetic data and a suite of five real datasets.

1.1 Related Work

Our work is inspired by two approaches to semiparametric inference: the debiased Lasso approach introduced by (zhang2014confidence; van2014asymptotically; javanmard2014confidence) and the orthogonal machine learning approach of (chernozhukov2017double). The works (zhang2014confidence; van2014asymptotically; javanmard2014confidence) obtain small-width and asympotically-valid confidence intervals (CIs) for individual model parameters by debiasing an initial Lasso estimator tibshirani1996regression. The works (chao2014high; cai2017confidence; athey2018approximate) each consider a more closely related problem of obtaining prediction confidence intervals using a generalization of the debiased Lasso estimator of (javanmard2014confidence). The work of chernozhukov2017double describes a general-purpose procedure for extracting -consistent and asymptotically normal target parameter estimates in the presence of nuisance parameters. Specifically, chernozhukov2017double construct a two-stage estimator where one initially fits first-stage estimates of nuisance parameters using arbitrary ML estimators on a first-stage data sample. In the second-stage, these first-stage estimators are used to provide estimates of the relevant model parameters using an orthogonalized method-of-moments. wager2016high also uses generic ML procedures as regression adjustments to form efficient confidence intervals for treatment effects.

These pioneering works all focus on improved confidence interval construction. Here we show that the semiparametric techniques developed for hypothesis testing can be adapted to provide practical improvements in mean-squared prediction error. Our resulting mean-squared error bounds complement the in-probability bounds of the aforementioned literature by controlling prediction performance across all events.

Our approach to prediction also bears some resemblance to semi-supervised learning (SSL) – transferring predictive power between labelled and unlabelled examples (zhu2005semi). In contrast with SSL, the goal of transductive learning is to predict solely the labels of the observed unlabeled features. alquier2012transductive formulate a transductive version of the Lasso and Dantzig selector estimators in the fixed design setting focused only on predicting a subset of points. bellec2018prediction also prove risk bounds for transductive and semi-supervised -regularized estimators in the high-dimensional setting. A principal difference between our approaches is that we make no distributional assumptions about the sequence of test points and do not assume simultaneous access to a large pool of test data. Rather our procedures receive access to only a single arbitrary test point , and our aim is accurate prediction for that point. Conventionally, SSL benefits from access to a large pool of test points; we are unaware of other results that benefit from access to single test point . Moreover, existing SSL methods in the sparse linear regression setting do not remove regularization bias to achieve dimension-independent rates but rather provide dimension-dependent bounds (bellec2018prediction).

1.2 Problem Setup

Our principal aim in this work is to understand the prediction risk,

(2)

of an estimator of the unobserved test response . Here, is independent of with variance . We exclude the additive noise from our risk definition, as it is irreducible for any estimator. Importantly, to accommodate non-stationary learning settings, we consider to be fixed and arbitrary; in particular, need not be drawn from the training distribution. Hereafter, we will make use of several assumptions which are standard in the random design linear regression literature.

Assumption 1 (Well-specified Model).

The data is generated from the model creftype 1.

Assumption 2 (Bounded Covariance).

The covariate vectors have common covariance with , and . We further define the precision matrix and condition number .

Assumption 3 (Sub-Gaussian Design).

Each covariate vector is sub-Gaussian with parameter , in the sense that, .

Assumption 4 (Sub-Gaussian Noise).

The noise is sub-Gaussian with variance parameter .

Throughout, we will use bold lower-case letters (e.g., ) to refer to vectors and bold upper-case letters to refer to matrices (e.g., ). We use for the set . Vectors or matrices subscripted with an index set indicate the subvector or submatrix supported on . The expression indicates the number of non-zero elements in and . We will use , , and to denote greater than, less than, and equal to up to a constant that is independent of and .

2 Lower Bounds for Regularized Prediction

We begin by providing lower bounds on the prediction risk of Lasso and ridge regression; the corresponding predictions take the form for a regularized estimate of the unknown vector .

2.1 Lower Bounds for Ridge Regression Prediction

We first consider the prediction risk of the ridge estimator with regularization parameter . In the asymptotic high-dimensional limit (with ) and assuming the training distribution equals the test distribution, dobriban2018high compute the predictive risk of the ridge estimator in a dense random effects model. By contrast, we provide a non-asymptotic lower bound which does not impose any distributional assumptions on or on the underlying parameter vector . Theorem 1, proved in Section B.1, isolates the error in the ridge estimator due to bias for any choice of regularizer .

Theorem 1.

Under Assumption 1, suppose with independent noise . If ,

(3)
(4)

Notably, the dimension-free term in this bound coincides with the risk of the ordinary least squares (OLS) estimator in this setting. The remaining multiplicative factor indicates that the ridge risk can be substantially larger if the regularization strength is too large. In fact, our next result shows that, surprisingly, over-regularization can result even when is tuned to minimize held-out prediction error over the training population. The same undesirable outcome results when is selected to minimize estimation error; the proof can be found in Section B.2.

Corollary 1.

Under the conditions of Theorem 1, if and is independent of , then for ,

(5)
(6)
(7)

Several insights can be gathered from the previous results. First, the expression minimized in Corollary 1 is the expected prediction risk for a new datapoint drawn from the training distribution. This is the population analog of held-out validation error or cross-validation error that is often minimized to select in practice. Second, in the setting of Corollary 1, taking yields

(8)

More generally, if we take , and then,

(9)

If is optimized for estimation error or for prediction error with respect to the training distribution, the ridge estimator must incur much larger test error then the OLS estimator in some test directions. Such behavior can be viewed as a symptom of over-regularization – the choice is optimized for the training distribution and cannot be targeted to provide uniformly good performance over all . In Section 3 we show how transductive techniques can improve prediction in this regime.

The chief difficulty in lower-bounding the prediction risk in Theorem 1 lies in controlling the expectation over the design , which enters nonlinearly into the prediction risk. Our proof circumvents this difficulty in two steps. First, the isotropy and independence properties of Wishart matrices are used to reduce the computation to that of a 1-dimensional expectation with respect to the unordered eigenvalues of . Second, in the regime , the sharp concentration of Gaussian random matrices in spectral norm is exploited to essentially approximate .

2.2 Lower Bounds for Lasso Prediction

We next provide a strong lower bound on the out-of-sample prediction error of the Lasso estimator with regularization parameter . There has been extensive work (see, e.g., raskutti2011minimax) establishing minimax lower bounds for the in-sample prediction error and parameter estimation error of any procedure given data from a sparse linear model. However, our focus is on out-of-sample prediction risk for a specific procedure, the Lasso. The point need not be one of the training points (in-sample) nor even be drawn from the same distribution as the covariates. Theorem 2, proved in Section C.1, establishes that a well-regularized Lasso program suffers significant biases even in a simple problem setting with i.i.d. Gaussian covariates and noise.1

Theorem 2.

Under Assumption 1, fix any , and let with independent noise . If and , then there exist universal constants such that for all ,

(10)
(11)

where the trimmed norm is the sum of the magnitudes of the largest magnitude entries of .

In practice we will always be interested in a known direction, but the next result clarifies the dependence of our Lasso lower bound on sparsity for worst-case test directions (see Section C.2 for the proof):

Corollary 2.

In the setting of Theorem 2, for ,

(12)

We make several comments regarding these results. First, together Theorem 2 yields an -specific lower bound – showing that given any potential direction there will exist an underlying -sparse parameter for which the Lasso performs poorly. Morever, the magnitude of error suffered by the Lasso scales both with the regularization strength and the norm of along its top coordinates. Second, the constraint on the regularization parameter in Theorem 2, , is a sufficient and standard choice to obtain consistent estimates with the Lasso (see wainwright2017highdim for example). Third, simplifying to the case of , we see that Corollary 2 implies the Lasso must incur worst-case prediction error , matching upper bounds for Lasso prediction error. In particular such a bound is not dimension-free, possessing a dependence on , even though the Lasso is only required to predict well along a single direction.

The proof of Theorem 2 uses two key ideas. First, in this benign setting, we can show that has support strictly contained in the support of with at least constant probability. We then adapt ideas from the study of debiased lasso estimation in (javanmard2014confidence) to sharply characterize the coordinate-wise bias of the Lasso estimator along the support of ; in particular we show that a worst-case can match the signs of the largest elements of and have magnitude on each non-zero coordinate. Thus the bias induced by regularization can coherently sum across the coordinates in the support of . A similar lower bound follows by choosing to match the signs of on any subset of size . This sign alignment between and is also explored in the independent and concurrent work of (bellec2019biasing, Thm. 2.2).

3 Upper Bounds for Transductive Prediction

Having established that regularization can lead to excessive prediction bias, we now introduce two classes of estimators which can mitigate this bias using knowledge of the single test direction . While our presentation focuses on the prediction risk creftype 2, which features an expectation over , our proofs in the appendix also provide identical high probability upper bounds on .

3.1 Javanmard-Montanari (JM)-style Estimator

Our first approach to single point transductive prediction is inspired by the debiased Lasso estimator of javanmard2014confidence which was to designed to construct confidence intervals for individual model parameters . For prediction in the direction, we will consider the following generalization of the Javanmard-Montanari (JM) debiasing construction2:

(13)
(14)

Here, is any (ideally -consistent) initial pilot estimate of , like the estimate returned by the Lasso. When the estimator creftype 13 reduces exactly to the program in (javanmard2014confidence), and equivalent generalizations have been used in (chao2014high; athey2018approximate; cai2017confidence) to construct prediction intervals and to estimate treatment effects. Intuitively, approximately inverts the population covariance matrix along the direction defined by (i.e., ). The second term in creftype 13 can be thought of as a high-dimensional one-step correction designed to remove bias from the initial prediction ; see (javanmard2014confidence) for more intuition on this construction. We can now state our primary guarantee for the JM-style estimator creftype 13; the proof is given in Section D.1.

Theorem 3.

Suppose Assumptions 4, LABEL:, 3, LABEL:, 2, LABEL: and 1 hold and that the transductive estimator of creftype 13 is fit with regularization parameter for some . Then there is a universal constant such that if ,

(15)
(16)

for and , the error of the initial estimate. Moreover, if , then . Here the masks constants depending only on .

Intuitively, the first term in our bound creftype 15 can be viewed as the variance of the estimator’s prediction along the direction of while the second term can be thought of as the (reduced) bias of the estimator. We consider the third term to be of higher order since (and in turn ) can be chosen as a large constant. Finally, when the error of the transductive procedure reduces to that of the pilot regression procedure. When the Lasso is used as the pilot regression procedure we can derive the following corollary to Theorem 3, also proved in Section D.2.

Corollary 3.

Recall . Under the conditions of Theorem 3, consider the JM-style estimator creftype 13 with pilot estimate with . If , then there exist universal constants , such that if and ,

(17)
(18)

Here the masks constants depending only on .

We make several remarks to further interpret this result. First, to simplify the presentation of the results (and match the lower bound setting of Theorem 2) consider the setting in Corollary 3 with , , and . Then the upper bound in Theorem 3 can be succinctly stated as In short, the transductive estimator attains a dimension-free rate for sufficiently large . Under the same conditions the Lasso estimator suffers a prediction error of as Theorem 2 and Corollary 2 establish. Thus transduction guarantees improvement over the Lasso lower bound whenever satisfies the soft sparsity condition . Since is observable, one can selectively deploy transduction based on the soft sparsity level or on bounds thereof. Second, the estimator described in creftype 14 and creftype 13 is transductive in that it is tailored to an individual test-point . The corresponding guarantees in Theorem 3 and Corollary 3 embody a computational-statistical tradeoff. In our setting, the detrimental effects of regularization can be mitigated at the cost of extra computation: the convex program in creftype 14 must be solved for each new . Third, the condition is not used for our high-probability error bound and is only used to control prediction risk creftype 2 on the low-probability event that the (random) design matrix does not satisfy a restricted eigenvalue-like condition. For comparison, note that our Theorem 2 lower bound establishes substantial excess Lasso bias even when .

Finally, we highlight that cai2017confidence have shown that the JM-style estimator with a scaled lasso base procedure and produce CIs for with minimax rate optimal length when is sparsely loaded. Although our primary focus is in improving the mean-square prediction risk creftype 2, we conclude this section by showing that a different setting of yields minimax rate optimal CIs for dense and simultaneously minimax rate optimal CIs for sparse and dense when is sufficiently sparse.

Proposition 4.

Under the conditions of Theorem 3 with , consider the JM-style estimator creftype 13 with pilot estimate and . Fix any , and instate the assumptions of cai2017confidence, namely that the vector satisfies and for . Then for the estimator creftype 13 with yields (minimax rate optimal) confidence intervals for of expected length

  • in the dense regime where with (matching the result of (cai2017confidence, Thm. 4)).

  • in the sparse regime of (cai2017confidence, Thm. 1) where if .

Here the masks constants depending only on .

3.2 Orthogonal Moment (OM) Estimators

Our second approach to single point transductive prediction is inspired by orthogonal moment (OM) estimation (chernozhukov2017double). OM estimators are commonly used to estimate single parameters of interest (like a treatment effect) in the presence of high-dimensional or nonparametric nuisance. To connect our problem to this semiparametric world, we first frame the task of prediction in the direction as one of estimating a single parameter, . Consider the linear model equation creftype 1

(19)

with a data reparametrization defined by the matrix for so that . Here, the matrix has orthonormal rows which span the subspace orthogonal to – these are obtained as the non- eigenvectors of the projector matrix . This induces the data reparametrization . In the reparametrized basis, the linear model becomes,

(20)
(21)

where we have introduced convenient auxiliary equations in terms of .

To estimate in the presence of the unknown nuisance parameters , we introduce a thresholded-variant of the two-stage method of moments estimator proposed in (chernozhukov2017double). The method of moments takes as input a moment function of both data and parameters that uniquely identifies the target parameter of interest. Our reparameterized model form creftype 21 gives us access to two different Neyman orthogonal moment functions described (chernozhukov2017double):

moments: (22)
(23)
moments: (24)
(25)

These orthogonal moment equations enable the accurate estimation of a target parameter in the presence of high-dimensional or nonparametric nuisance parameters (in this case and ). We focus our theoretical analysis and present description on the set of moments since the analysis is similar for the , although we investigate the practical utility of both in Section 4.

Our OM proposal to estimate now proceeds as follows. We first split our original dataset of points into two3 disjoint, equal-sized folds and . Then,

  • The first fold is used to run two first-stage regressions. We estimate by linearly regressing onto to produce ; this provides an estimator of as . Second we estimate by regressing onto to produce a regression model . Any arbitrary linear or non-linear regression procedure can be used to fit .

  • Then, we estimate as where the sum is taken over the second fold of data in ; crucially are independent of in this expression.

  • If for a threshold we simply output . If we estimate by solving the empirical moment equation:

    (26)
    (27)

    where the sum is taken over the second fold of data in and is defined in creftype 23.

If we had oracle access to the underlying and , solving the population moment condition for would exactly yield . In practice, we first construct estimates and of the unknown nuisance parameters to serve as surrogates for and and then solve an empirical version of the aforementioned moment condition to extract . A key property of the moments in creftype 23 is their Neyman orthogonality: they satisfy and . Thus the solution of the empirical moment equations is first-order insensitive to errors arising from using in place of and . Data splitting is further used to create independence across the two stages of the procedure. In the context of testing linearly-constrained hypotheses of the parameter , zhu2018linear propose a two-stage OM test statistic based on the transformed moments introduced above; they do not use cross-fitting and specifically employ adaptive Dantzig-like selectors to estimate and . Finally, the thresholding step allows us to control the variance increase that might arise from being too small and thereby enables our non-asymptotic prediction risk bounds. Before presenting the analysis of the OM estimator creftype 27 we introduce another condition4:

Assumption 5.

The noise is independent of .

Recall is evaluated on the (independent) second fold data . We now obtain our central guarantee for the OM estimator (proved in Section E.1).

Theorem 5.

Let Assumptions 5, LABEL:, 4, LABEL:, 3, LABEL:, 2, LABEL: and 1 hold, and assume that in creftype 21 for . Then the thresholded orthogonal ML estimator of creftype 27 with satisfies

(28)
(29)

where and denote the expected prediction errors of the first-stage estimators, and the masks constants depending only on .

Since we are interested in the case where and have small error (i.e., ), the first term in creftype 29 can be interpreted as the variance of the estimator’s prediction along the direction of , while the remaining terms represent the reduced bias of the estimator. We first instantiate this result in the setting where both and are estimated using ridge regression (see Section E.2 for the corresponding proof).

Corollary 4 (OM Ridge).

Assume . In the setting of Theorem 5, suppose and are fit with the ridge estimator with regularization parameters and respectively. Then there exist universal constants such that if , , and for ,

(30)
(31)

where the masks constants depending only on .

Similarly, when and are estimated using the Lasso we conclude the following (proved in Section E.2).

Corollary 5 (OM Lasso).

In the setting of Theorem 5, suppose and are fit with the Lasso with regularization parameters and respectively. If , , and , then there exist universal constants such that if , then for ,

(32)
(33)

where the masks constants depending only on .

We make several comments regarding the aforementioned results. First, Theorem 5 possesses a double-robustness property. In order for the dominant bias term to be small, it is sufficient for either or to be estimated at a fast rate or both to be estimated at a slow rate. As before, the estimator is transductive and adapted to predicting along the direction . Second, in the case of ridge regression, to match the lower bound of Corollary 1, consider the setting where , , and . Then, the upper bound5 can be simplified to . By contrast, Corollary 1 shows the error of the optimally-tuned ridge estimator is lower bounded by