Conformal Prediction Under Covariate Shift

Conformal Prediction Under Covariate Shift

Rina Foygel Barber    Emmanuel J. Candès    Aaditya Ramdas    Ryan J. Tibshirani
Abstract

We extend conformal prediction methodology beyond the case of exchangeable data. In particular, we show that a weighted version of conformal prediction can be used to compute distribution-free prediction intervals for problems in which the test and training covariate distributions differ, but the likelihood ratio between these two distributions is known—or, in practice, can be estimated accurately with access to a large set of unlabeled data (test covariate points). Our weighted extension of conformal prediction also applies more generally, to settings in which the data satisfies a certain weighted notion of exchangeability. We discuss other potential applications of our new conformal methodology, including latent variable and missing data problems.

1 Introduction

Let , denote training data that is assumed to be i.i.d. from an arbitrary distribution . Given a desired coverage rate , consider the problem of constructing a band , based on the training data such that, for a new i.i.d. point ,

(1)

where this probability is taken over the points , (the training points and the test point). Crucially, we will require (1) to hold with no assumptions whatsoever on the common distribution .

Conformal prediction, a framework pioneered by Vladimir Vovk and colleagues in the 1990s, provides a means for achieving this goal, relying only on exchangeablility of the training and test data. The definitive reference is the book by Vovk et al. (2005); see also Shafer and Vovk (2008); Vovk et al. (2009); Vovk (2013); Burnaev and Vovk (2014), and http://www.alrw.net for an often-updated list of conformal prediction work by Vovk and colleagues. Moreover, we refer to Lei and Wasserman (2014); Lei et al. (2018) for recent developments in the areas of nonparametric and high-dimensional regression. In this work, we extend conformal prediction beyond the setting of exchangeable data, allowing for provably valid inference even when the training and test data are not drawn from the same distribution. We begin by reviewing the basics of conformal prediction, in this section. In Section 2, we describe an extension of conformal prediction to the setting of covariate shift, and give supporting empirical results. In Section 3, we cover the mathematical details behind our conformal extension. We conclude in Section 4, and discuss several other possible applications of our new conformal methodology.

1.1 Quantile lemma

Before explaining the basic ideas behind conformal inference (i.e., conformal prediction, we will use these two terms interchangeably), we introduce some notation. We denote by the level quantile of a distribution , i.e., for ,

In our use of quantiles, we will allow for distributions on the augmented real line, . For values , we write to denote their multiset. Note that this is unordered, and allows for multiple instances the same element; thus in the present case, if for , then this value appears twice in . To denote quantiles of the empirical distribution of the values , we abbreviate

where denotes a point mass at (i.e., the distribution that places all mass at the value ). The next result is a simple but key component underlying conformal prediction.

Lemma 1.

If are exchangeable random variables, then for any , we have

Furthermore, if ties between occur with probability zero, then the above probability is upper bound by .

Proof.

Consider the following useful fact about quantiles of a discrete distribution , with support points : denoting , if we reassign the points to arbitrary values strictly larger than , yielding a new distribution , then the level quantile remains unchanged, . Using this fact,

or equivalently,

(2)

Moreover, it is straightforward to check that

By exchangeability, the latter event occurs with probability at least , which proves the lower bound; when there are almost surely no ties, it holds with probability exactly , which proves the upper bound. ∎

1.2 Conformal prediction

We now return to the regression setting. Denote , , and . We will use the abbreviation . In what follows, we describe the construction of a prediction band satisfying (1), using conformal inference, due to Vovk et al. (2005). We first choose a score function , whose arguments consist of a point , and a multiset .111We emphasize that by defining to be a multiset, we are treating its points as unordered. Hence, to be perfectly explicit, the score function cannot accept the points in in any particular order, and it must take them in as unordered. The same is true of the base algorithm used to define the fitted regression function , in the choice of absolute residual score function (3). Informally, a low value of indicates that the point “conforms” to , whereas a high value indicates that is atypical relative to the points in . For example, we might choose to define by

(3)

where is a regression function that was fitted by running an algorithm on and .

Next, at each , we define the conformal prediction interval222For convenience, throughout, we will refer to as an “interval”, even though this may actually be a union of multiple nonoverlapping intervals. Similarly, for simplicity, we will refer to as a “band”. by repeating the following procedure for each . We calculate the nonconformity scores

(4)

and include in our prediction interval if

where . Importantly, the symmetry in the construction of the nonconformity scores (4) guarantees exact coverage in finite samples. The next theorem summarizes this coverage result. The lower bound is a standard result in conformal inference, due to Vovk et al. (2005); the upper bound, as far as we know, was first pointed out by Lei et al. (2018).

Theorem 1 (Vovk et al. 2005; Lei et al. 2018).

Assume that , are exchangeable. For any score function , and any , define the conformal band (based on the first samples) at by

(5)

where , are as defined in (4). Then satisfies

Furthermore, if ties between occur with probability zero, then this probability is upper bounded by .

Proof.

To lighten notation, abbreviate , . Observe

By the symmetric construction of the nonconformity scores in (4),

for any permutation of the numbers . Therefore, as are exchangeable, so are , and applying Lemma 1 gives the result. ∎

Remark 1.

Theorem 1 is stated for exchangeable samples , , which is (significantly) weaker than assuming i.i.d. samples. As we will see in what follows, it is furthermore possible to relax the exchangeability assumption, under an appropriate modification to the conformal procedure.

Remark 2.

If we use an appropriate random tie-breaking rule (to determine the rank of among ), then the upper bounds in Lemma 1 and Theorem 1 hold in general (without assuming there are no ties almost surely).

The result in Theorem 1, albeit very simple to prove, is quite remarkable. It gives us a recipe for distribution-free prediction intervals, with nearly exact coverage, starting from an arbitrary score function ; e.g., absolute residuals with respect to a fitted regression function from any base algorithm , as in (3). For more discussion of conformal prediction, its properties, and its variants, we refer to Vovk et al. (2005); Lei et al. (2018) and references therein.

2 Covariate shift

In this paper, we are concerned with settings in which the data , are no longer exchangeable. Our primary focus will be a setting in which we observe data according to

(6)

Notice that the conditional distribution of is assumed to be the same for both the training and test data. Such a setting is often called covariate shift (e.g., see Shimodaira 2000; Quinonero-Candela et al. 2009; see also Remark 4 below for more discussion of this literature). The key realization is the following: if we know the ratio of test to training covariate likelihoods, , then we can still perform a modified of version conformal inference, using a quantile of a suitably weighted empirical distribution of nonconformity scores. The next subsection gives the details; following this, we describe a more computationally efficient conformal procedure, and give an empirical demonstration.

2.1 Weighted conformal prediction

In conformal prediction, we compare the value of a nonconformity score at a test point to the empirical distribution of nonconformity scores at the training points. In the covariate shift case, where the covariate distributions in our training and test sets differ, we will now weight each nonconformity score (which measures how well conforms to the other points) by a probability proportional to the likelihood ratio . Therefore, we will no longer be interested in the empirical distribution

as in Theorem 1, but rather, a weighted version

where the weights are defined by

(7)

Due this careful weighting, draws from the discrete distribution in the second to last display resemble nonconformity scores computed on the test population, and thus, they “look exchangeable” with the nonconformity score at our test point. Our main result below formalizes these claims.

Corollary 1.

Assume data from the model (6). Assume that is absolutely continuous with respect to , and denote . For any score function , and any , define for ,

(8)

where , are as in (4), and , are as in (7). Then satisfies

Corollary 1 is a special case of a more general result that we present later in Theorem 2, which extends conformal inference to a setting in which the data are what we call weighted exchangeable. The proof is given in Section 3.5.

Remark 3.

The same result as in Corollary 1 holds if , i.e., with an unknown normalization constant, because this normalization constant cancels out in the calculation of probabilities in (7).

Remark 4.

Though the basic premise of covariate shift—and certainly the techniques employed in addressing it—are related to much older ideas in statistics, the specific setup (6) has recently generated great interest in machine learning: e.g., see Sugiyama and Muller (2005); Sugiyama et al. (2007); Quinonero-Candela et al. (2009); Agarwal et al. (2011); Wen et al. (2014); Reddi et al. (2015); Chen et al. (2016) and references therein). The focus is usually on correcting estimators, model evaluation, or model selection approaches to account for covariate shift. Correcting distribution-free prediction intervals, as we examine in this work, is (as far as we know) a new contribution. As one might expect, the likelihood ratio , a key component of our conformal construction in Corollary 1, also plays a critical role in much of the literature on covariate shift.

2.2 Weighted split conformal

In general, constructing a conformal prediction band can be computationally intensive, though this of course depends on the choice of score function. Consider the use of absolute residuals as in (3). To compute the nonconformity scores in (4), we must first run our base algorithm on the data set to produce a fitted regression function , and then calculate

As the formation of the conformal set in (5) (ordinary case) or (8) (covariate shift case) requires us to do this for each and (which requires refitting each time), this can clearly become computationally burdernsome.

A fast alternative, known as split conformal prediction (Papadopoulos et al., 2002; Lei et al., 2015), resolves this issue by taking the score function to be defined using absolute residuals with respect to a fixed regression function, typically, one that has been trained on an preliminary data set. Denote by this preliminary data set, used for fitting the regression function , and consider the score function

Given data , independent of , we calculate

The conformal prediction interval (5), defined at a point , reduces to

(9)

and by Theorem 1 it has coverage at least , conditional on . This coverage result holds because, when we treat as fixed (meaning, condition on ), the scores scores are exchangeable for , as are.

As split conformal prediction can be seen as a special case of conformal prediction, in which the regression function is treated as fixed, Corollary 1 also applies to the split scenario, and guarantees that the band defined for by

(10)

where the probabilities are as in (7), has coverage at least , conditional on .

2.3 Airfoil data example

We demonstrate the use of conformal prediction in the covariate shift setting in an empirical example. We consider the airfoil data set from the UCI Machine Learning Repository (Dua and Graff, 2019), which has observations of a response (scaled sound pressure level of NASA airfoils), and a covariate with dimensions (log frequency, angle of attack, chord length, free-stream velocity, and suction side log displacement thickness). For efficiency, we use the split conformal prediction methods in (9) and (10) for the unweighted and weighted case, respectively. R code to reproduce these simulation results can be found at http://www.github.com/ryantibs/conformal/.

Creating training data, test data, and covariate shift.

We repeated an experiment for 5000 trials, where for each trial we randomly partitioned the data into three sets , and also constructed a covariate shift test set , which have the following roles.

  • , containing 25% of the data, is used to prefit a regression function via linear regression, to be used in constructing split conformal intervals.

  • , containing 25% of the data, is our training set, i.e., , , used to compute the residual quantiles in the construction of the split conformal prediction intervals (9) and (10).

  • , containing 50% of the data, is our test set (as these data points are exchangeable with those in , there is no covariate shift in this test set).

  • is a second test set, constructed to simulate covariate shift, by sampling 25% of the points from with replacement, with probabilities proportional to

    (11)

As the original data points can be seen as draws from the same underlying distribution, we can view as the likelihood ratio of covariate distributions between the test set and training set . Note that the test covariate distribution , which satisfies as we have defined it here, is called an exponential tilting of the training covariate distribution . Figure 1 visualizes the effect of this exponential tilting, in the airfoil data set, with our choice . Only the 1st and 5th dimensions of the covariate distribution are tilted; the bottom row of Figure 1 plots the marginal densities of the 1st and 5th covariates (estimated via kernel smoothing) before and after the tilt. The top row plots the response versus the 1st and 5th covariates, simply to highlight the fact that there is heteroskedasticity, and thus we might expect the shift in the covariate distribution to have some effect on the validity of the ordinary conformal prediction intervals.

Figure 1: The top row plots the response in (a randomly chosen half of) the airfoil data set, versus the 1st and 5th covariates. The bottom row plots kernel density estimates for the 1st and 5th covariates, in black. Also displayed are kernel density estimates for the 1st and 5th covariates after exponential tilting (11), in blue.

Loss of coverage of ordinary conformal prediction under covariate shift.

First, we examine the performance of ordinary split conformal prediction (9). The nominal coverage level was set to be 90% (meaning ), here and throughout. The results are shown in the top row of Figure 2. In each of the 5000 trials, we computed the empirical coverage from the split conformal intervals over points in the test sets, and the histograms show the distribution of these empirical coverages over the trials. We see that for the original test data (no covariate shift, shown in red), split conformal works as expected, with the average of the empirical coverages (over the 5000 trials) being 90.2%; but for the nonuniformly subsampled test data (covariate shift, in blue), split conformal considerably undercovers, with its average coverage being 82.2%.

Figure 2: Empirical coverages of conformal prediction intervals, computed using 5000 different random splits of the airfoil data set. The averages of empirical coverages in each histogram are marked on the x-axis.

Coverage of weighted conformal prediction with oracle weights.

Next, displayed in the middle row of Figure 2, we consider weighted split conformal prediction (10), to cover the points in (shown in orange). At the moment, we assume oracle knowledge of the true weight function in (11) needed to calculate the probabilities in (7). We see that this brings the coverage back to the desired level, with the average coverage being 90.8%. However, the histogram is more dispersed than it is when there is no covariate shift (compare to the top row, in red). This is because, by using a quantile of the weighted empirical distribution of nonconformity scores, we are relying on a reduced “effective sample size”. Given training points , and a likelihood ratio of test to training covariate distributions, a popular heuristic formula from the covariate shift literature for the effective sample size of is (Gretton et al., 2009; Reddi et al., 2015):

where we abbreviate . To compare weighted conformal prediction against the unweighted method at the same effective sample size, in each trial, we ran unweighted split conformal on the original test set , but we used only subsampled points from to compute the quantile of nonconformity scores, when constructing the prediction interval (9). The results (the middle row of Figure 2, in purple) line up very closely with those from weighted conformal, which demonstrates that apparent overdispersion in the coverage histogram from the latter is fully explained by the reduced effective sample size.

Figure 3: Median lengths of conformal prediction intervals, computed using 5000 different random splits of the airfoil data set. The averages of median lengths in each histogram are marked on the x-axis.
Figure 4: Empirical coverages and median lengths from conformal prediction, on the airfoil data set, with no covariate shift.

Coverage of weighted conformal with estimated weights.

Denote by the covariate points in and by the covariate points in . Here we describe how to estimate , the likelihood ratio of interest, by applying logistic regression or random forests (more generally, any classifier that outputs estimated probabilities of class membership) to the feature-class pairs , , where for and for . Noting that

we can take the conditional odds ratio as an equivalent representation for the oracle weight function (since we only need to know the likelihood ratio up to a proportionality constant, recall Remark 3). Therefore, if is an estimate of obtained by fitting a classifier to the data , , then we can use

(12)

as our estimated weight function for the calculation of probabilities (7), needed for conformal (8) or split conformal (10). There is in fact a sizeable literature on density ratio estimation, and the method just describe falls into a class called probabilistic classification approaches; two other classes are based on moment matching, and minimization of -divergences (e.g., Kullblack-Leibler divergence). For a comprehensive review of these approaches, and supporting theory, see Sugiyama et al. (2012).

The bottom row of Figure 2 shows the results from using weighted split conformal prediction to cover the points in , where the weight function has been estimated as in (12), using logistic regression (in gray) and random forests333In the random forests approach, we clipped the estimated test class probability to lie in between 0.01 and 0.99, to prevent the estimated weight (likelihood ratio) from being infinite. Without clipping, the estimated probability of being in the test class was sometimes exactly 1 (this occurred in about 2% of the cases encountered over all 5000 repetitions), resulting in an infinite weight, and causing numerical issues. (in green) to fit the class probability function . Note that logistic regression is well-specified in this example, because it assumes the log odds is a linear function of , which is exactly as in (11). Random forests, of course, allows more flexibility in the fitted model. Both classification approaches deliver weights that translate into good average coverage, being 91.0% for each approach. Furthermore, their histograms are only a little more dispersed than that for the oracle weights (middle row, in orange).

Lengths of weighted conformal intervals.

Figure 3 conveys the same setup as Figure 2, but displays histograms of the median lengths of prediction intervals rather than empirical coverages (meaning, in each of the 5000 trials, we ran unweighted or weighted split conformal prediction to cover test points, and report the median length of the prediction intervals over the test sets). We see no differences in the lengths of ordinary split conformal intervals (top row) when there is or is not covariate shift, as expected since these two settings differ only in the distributions of their test sets, but use the same procedure and have the same distribution of the training data. We see that the oracle-weighted split conformal intervals are longer than the ordinary split conformal intervals that use an equivalent effective sample size (middle row). This is also as expected, since in the former situation, the regression function was fit on training data of a different distribution than , and itself should ideally be adjusted to account for covariate shift (plenty of methods for this are available from the covariate shift literature, but we left it unadjusted for simplicity). Lastly, we see that the random forests-weighted split conformal intervals are more variable, and in some cases, much longer, than the logistic regression-weighted split conformal intervals (bottom row, difficult to confirm visually because the bars in the histogram lie so close to the x-axis).

Weighted conformal when there is actually no covariate shift.

Lastly, Figure 4 compares the empirical coverages and median lengths of split conformal intervals to cover points in (no covariate shift), using the ordinary unweighted approach (in red), the logistic regression-weighted approach (in gray), and the random forests-weighted approach (in green). The unweighted and logistic regression approaches are very similar. The random forests approach yields slightly more dispersed coverages and lengths. This is because random forests are very flexible, and in the present case of no covariate shift, the estimated weights from random forests in each repetition are in general further from constant (compared to those from logistic regression). Still, random forests must not be overfitting dramatically here, since the coverages and lengths are still reasonable.

3 Weighted exchangeability

In this section, we develop a general result on conformal prediction for settings in which the data satisfy what we call weighted exchangeability. In the first subsection, we take a look back at the key quantile result in Lemma 1, and present an alternative proof from a somewhat different perspective. Then we precisely define weighted exchangeability, extend Lemma 1 to this new (and broader) setting, and extend conformal prediction as well. The last subsection provides a proof of Corollary 1, as a special case of our general conformal result.

3.1 Alternate proof of Lemma 1

The general strategy we pursue here is to condition on the unlabeled multiset of values obtained by our random variables , and then inspect the probabilities that the last random variable attains each one of these values. For simplicity, we assume that there are almost surely no ties among the scores , so that we can work with sets rather than multisets (our arguments apply to the general case as well, but the notation is more cumbersome).

Denote by the event that , and consider

(13)

Denote by the probability density function444More generally, may be the Radon-Nikodym derivative with respect to an arbitrary base measure; this measure may be discrete, continuous, or neither; henceforth, we will use the term “density” just for simplicity. of the joint distribution . Exchangeability implies that

for any permutation of the numbers . Thus, for each , we have

(14)

This shows that the distribution of is uniform on the set , i.e.,

and it follows immediately that

On the event , we have , so

Because this true for any , we can marginalize to obtain

which, as argued in (2), is equivalent to the desired lower bound in the lemma. (The upper bound follows similarly.)

3.2 Weighted exchangeability

The alternate proof in the last subsection is certainly more complicated than our first proof of Lemma 1. What good did it do? Recall, we proceeded as if we had observed the unordered set of nonconformity scores but had forgotten the labels (as if we forgot which random variable was associated with which value ). We then reduced the construction of a prediction interval for to the computation of probabilities in (13), that the last random variable attains each one of the observed values . The critical point was that this strategy isolated the role of exchangeability: it was used precisely to compute these probabilities, in (14). As we will show, the same probabilities in (13) can be calculated in a broader setting of interest, beyond the exchangeable one.

We first define a generalized notion of exchangeability.

Definition 1.

Random variables are said to be weighted exchangeable, with weight functions , if the density555As before, may be the Radon-Nikodym derivative with respect to an arbitrary base measure. of their joint distribution can be factorized as

where is any function that does not depend on the ordering of its inputs, i.e., for any permutation of .

Clearly, weighted exchangeability with weight functions for reduces to ordinary exchangeability. Furthermore, independent draws (where all marginal distributions are absolutely continuous with respect to, say, the first one) are always weighted exchangeable, with weight functions given by the appropriate Radon-Nikodym derivatives, i.e., likelihood ratios. This is stated next; the proof follows directly from Definition 1 and is omitted.

Lemma 2.

Let , be independent draws, where each is absolutely continuous with respect to , for . Then are weighted exchangeable, with weight functions , and , .

Lemma 2 highlights an important special case (which we note, includes the covariate shift model in (6)). But it is worth being clear that our definition of weighted exchangeability encompasses more than independent sampling, and allows for a nontrivial dependency structure between the variables, just as exchangeability is broader than the i.i.d. case.

3.3 Weighted quantile lemma

Now we give a weighted generalization of Lemma 1.

Lemma 3.

Let , be weighted exchangeable random variables, with weight functions . Let , where , for , and is an arbitrary score function. Define

(15)

where the summations are taken over permutations of the numbers . Then for any ,

Proof.

We follow the same general strategy in the alternate proof of Lemma 1 in Section 3.1. As before, we assume for simplicity that are distinct almost surely (but the result holds in the general case as well).

Let denote the event that , and let , where , for . Let be the density function of the joint sample . For each ,

and as are weighted exchangeable,

In other words,

which implies that

This is equivalent to

and after marginalizing,

Finally, as in (2), this is equivalent to the claim in the lemma. ∎

Remark 5.

When are exchangeable, we have for , and thus for as well. In this special case, then, the lower bound in Lemma 3 reduces to the ordinary unweighted lower bound in Lemma 1. Meanwhile, a obtaining meaningful upper bound on the probability in question in Lemma 3, as was done in Lemma 1 (under the assumption of no ties, almost surely), does not seem possible without further conditions on the weight functions in consideration. This is because the largest “jump” in the discontinuous cumulative distribution function of is of size , which can potentially be very large; by comparison, in the unweighted case, this jump is always of size .

3.4 Weighted conformal prediction

A weighted version of conformal prediction follows immediately from Lemma 3.

Theorem 2.

Assume that , are weighted exchangeable with weight functions . For any score function , and any , define the weighted conformal band (based on the first samples) at a point by

(16)

where , are as in (4), and , are as in (15). Then satisfies

Proof.

Abbreviate , . By construction,

and applying Lemma 3 gives the result. ∎

Remark 6.

As explained in Section 2.2, the split conformal method is simply a special case of the conformal prediction framework, where we take the score function to be , with precomputed on a preliminary data set . Hence, the result in Theorem 2 carries over to the split conformal method as well, in which case the weighted conformal prediction interval in (16) simplifies to

By Theorem 2, this has coverage at least , conditional on .

3.5 Proof of Corollary 1

We return to the case of covariate shift, and show that Corollary 1 follows from the general weighted conformal result. By Lemma 2, we know that the independent draws , are weighted exchangeable with for , and . In this special case, the probabilities in (15) simplify to

in other words, , , where the latter are as in (7). Applying Theorem 2 gives the result.

4 Discussion

We described an extension of conformal prediction to handle weighted exchangeable data. This covers exchangeable data, and independent (but not identically distributed) data, as special cases. In general, the new weighted methodology requires computing quantiles of a weighted discrete distribution of nonconformity scores, which is combinatorially hard. But the computations simplify dramatically for a case of significant practical interest, in which the test covariate distribution differs from the training covariate distribution by a known likelihood ratio (and the conditional distribution remains unchanged). In this case, known as covariate shift, the new weighted conformal prediction methodology is just as easy, computationally, as ordinary conformal prediction. When the likelihood ratio