Predicting with Proxies
Hamsa Bastani
Wharton School, Operations Information and Decisions, hamsab@wharton.upenn.edu
Predictive analytics is increasingly used to guide decisionmaking in many applications. However, in practice, we often have limited data on the true predictive task of interest, but copious data on a closelyrelated proxy predictive task. Practitioners often train predictive models on proxies since it achieves more accurate predictions. For example, ecommerce platforms use abundant customer click data (proxy) to make product recommendations rather than the relatively sparse customer purchase data (true outcome of interest); alternatively, hospitals often rely on medical risk scores trained on a different patient population (proxy) rather than their own patient population (true cohort of interest) to assign interventions. However, not accounting for the bias in the proxy can lead to suboptimal decisions. Using real datasets, we find that this bias can often be captured by a sparse function of the features. Thus, we propose a novel twostep estimator that uses techniques from highdimensional statistics to efficiently combine a large amount of proxy data and a small amount of true data. We prove upper bounds on the error of our proposed estimator and lower bounds on several heuristics commonly used by data scientists; in particular, our proposed estimator can achieve the same accuracy with exponentially less true data (in the number of features ). Our proof relies on a new tail inequality on the convergence of LASSO for approximately sparse vectors. Finally, we demonstrate the effectiveness of our approach on ecommerce and healthcare datasets; in both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data.
Key words: proxies, transfer learning, sparsity, highdimensional statistics, LASSO
History: This paper is under preparation.
Decisionmakers increasingly use machine learning and predictive analytics to inform consequential decisions. However, a pervasive problem that occurs in practice is the limited quantity of labeled data available in the desired setting. Building accurate predictive models requires significant quantities of labeled data, but large datasets may be costly or infeasible to obtain for the predictive task of interest. A common solution to this challenge is to rely on a proxy — a closelyrelated predictive task — for which abundant data is already available. The decisionmaker then builds and deploys a model predicting the proxy instead of the true task. To illustrate, consider the following two examples from revenue management and healthcare respectively:
Example 1 (Recommendation Systems)
A core business proposition for platforms (e.g., Expedia or Amazon) is to match customers with personalized product recommendations. The typical goal is to maximize the probability of a customer purchase by recommending products that a customer is most likely to purchase, based on past transaction data and customer purchase histories. Unfortunately, most platforms have sparse data on customer purchases (the true outcome they wish to predict) for a particular product, but significantly more data on customer clicks (a proxy outcome). Clicks are a common proxy for purchases, since one may assume that customers will not click on a product without some intent to purchase. Consequently, platforms often recommend products with high predicted clickthrough rates rather than high predicted purchase rates.
Example 2 (Medical Risk Scoring)
Many hospitals are interested in identifying patients who have high risk for some adverse event (e.g., diabetes, stroke) in order to target preventative interventions. This involves using past electronic medical records to train a patient risk score, i.e., predict which patients are likely to get a positive diagnosis for the adverse event based on data from prior visits. However, small hospitals have limited data since their patient cohorts (true population of interest) are not sizable enough to have had a large number of adverse events. Instead, they adopt a published risk score trained on data from a larger hospital’s patient cohort (proxy population). There are concerns that a predictive model trained at one hospital may not directly apply to a different hospital, since there are differences in physician behavior, patient populations, etc. Yet, one may assume that the large hospital’s risk score is a good proxy for the small hospital’s risk score, since the target of interest (the adverse event) is the same in both models.
There are numerous other examples of the use of proxies in practice. In §id1, we overview the pervasiveness of proxies in healthcare and revenue management.
However, the use of proxies has clear drawbacks: the proxy and true predictive models may not be the same, and any bias between the two tasks will affect the predictive performance of the model. Consider Example 1 on recommendation systems. In §id1 of this paper, we use personalized hotel recommendation data from Expedia to demonstrate a systematic bias between clicks (proxy outcome) and purchases (true outcome). In particular, we find that the price of the recommendation negatively impacts purchases far more than clicks. Intuitively, a customer may not mind browsing expensive travel products, but is unlikely to make an expensive purchase. Thus, using predicted clickthrough rates alone (proxies) to make recommendations could result in overly expensive recommendations, thereby hurting purchase rates. Next, consider Example 2 on medical risk scores. In §id1 of this paper, we use electronic medical record data across several healthcare providers to demonstrate a systematic bias between a diabetes risk predictor trained on patient data from a large external hospital (proxy cohort) and a risk predictor trained on patient data from the small target hospital (true cohort). In particular, we find differences in physician diagnosing behavior (e.g., some physicians are more inclined than others to ask patients to fast in order to diagnose impaired fasting glucose) and how patient chart data is encoded in the medical record (e.g., obesity is recorded as a diagnosis more often in some hospitals despite patients having similar BMIs). As a result, features that are highly predictive in one hospital may not be predictive in another hospital, thereby hurting the performance of a borrowed (proxy) risk predictor at the target hospital.
We refer to data from the proxy and true predictive tasks as proxy and gold data respectively. Analogously, we refer to estimators trained on proxy and gold data alone as the proxy and gold estimators respectively. Both estimators seek to predict outcomes on the true predictive task. From a statistical perspective, the gold estimator is unbiased but has high variance due to its limited sample size. On the other hand, the proxy estimator has low variance due to its large sample size, but may have a significant bias due to systematic differences between the true and proxy predictive tasks. Predictive accuracy is composed of both bias and variance. Thus, when we have a good proxy (the bias is not too large), the proxy estimator can be a much more accurate predictive model than the gold estimator, explaining the wide use of proxies in practice.
An immediate question is: can we combine proxy and gold data to achieve a better biasvariance tradeoff and improve predictive accuracy? In many of these settings, we have access to (or could collect) information from both predictive tasks, i.e., we typically have a large amount of proxy data and a small amount of true data. For instance, platforms observe both clicks and purchases; the target hospital has access to both the published proxy estimator and basic summary statistics from an external hospital, as well as its own patient data. Thus, we have the opportunity to improve prediction by combining these data sources. Conversations with professional data scientists indicate two popular heuristics: (i) model averaging over the gold and proxy estimators, and (ii) training a model on proxy and gold data simultaneously^{1}^{1}1One disadvantage of the weighted loss function is that it requires both proxy and gold data to be available together at the time of training. This may not be possible in settings such as healthcare, where data is sensitive., with a larger weight for gold observations. However, there is little understanding of whether and by how much these heuristics can improve predictive performance. Indeed, we prove lower bounds that both model averaging and weighted loss functions can only improve estimation error by at most a constant factor (beyond the naive proxy and gold estimators discussed earlier). Thus, neither approach can significantly improve estimation error.
Ideally, we would use the gold data to debias the proxy estimator (which already has low variance); this would hopefully yield an estimator with lower bias while maintaining low variance. However, estimating the bias is challenging, as we have extremely limited gold data. In general, estimating the bias from gold data can be harder than directly estimating the true predictive model from gold data. Thus, we clearly need to impose additional structure to make progress.
Our key insight is that the bias between the true and proxy predictive tasks may often be well modeled by a sparse function of the observed features. We argue that there is often some (a priori unknown) underlying mechanism that systematically affects a subset of the features, creating a bias between the true and proxy predictive tasks. When this is the case, we can successfully estimate the bias using highdimensional techniques that exploit sparsity. To illustrate, we return to Examples 1 and 2. In the first example on hotel recommendations, we find on Expedia data that customers tend to click on more expensive products than they are willing to purchase. This creates a bias between the proxy and true predictive tasks that can be captured by the price feature alone. However, as we show in Fig. 2 in §id1 of this paper, the two predictive tasks appear remarkably similar otherwise. In particular, the difference of the proxy and gold estimators on Expedia data is very sparse (nearly all coefficients are negligible with the notable exception of the price coefficient). Similarly, in the second example on diabetes risk prediction, we find on patient data that physicians/coders at different hospitals sometimes diagnose/record different conditions in the electronic medical record. However, the majority of patient data is similarly diagnosed and recorded across hospitals (motivating the common practice of borrowing risk predictors from other hospitals). This creates a bias between the proxy and true predictive tasks that can be captured by the few features corresponding only to the subset of diagnoses where differences arise.
Importantly, in both examples, the proxy and gold estimators themselves are not sparse. Thus, we cannot exploit this structure by directly applying highdimensional techniques to proxy or gold data separately. Rather, we must efficiently combine proxy and gold data, while exploiting the sparse structure of the bias between the two predictive tasks. Our lower bounds show that popular heuristics (model averaging and weighted loss functions) fail to leverage sparse structure even when it is present, and can still only improve predictive accuracy by at most a constant factor.
We propose a new twostep joint estimator that successfully leverages sparse structure in the bias term to achieve a much stronger improvement in predictive accuracy. In particular, our proposed estimator can achieve the same accuracy with exponentially less gold data (in the number of features ). Intuitively, instead of using the limited gold data directly for estimating the predictive model, our estimator uses gold data to efficiently debias the proxy estimator. In fact, when gold data is very limited, the availability of proxy data is critical to extracting value from the gold data. Our proof relies on a new tail inequality on the convergence of LASSO for approximately sparse vectors, which may be of independent interest. It is worth noting that our estimator does not simultaneously require both proxy and gold data at training time; this is an important feature in settings such as healthcare, where data from different sources cannot be combined due to regulatory constraints. We demonstrate the effectiveness of our estimator on both Expedia hotel recommendation (Example 1) and diabetes risk prediction (Example 2). In both cases, we achieve significantly better predictive accuracy, as well as managerial insights into the nature of the bias in the proxy data.
Proxies are especially pervasive in healthcare, where patient covariates and response variables must be derived from electronic medical records (EMRs), which are inevitably biased by the data collection process. One common issue is censoring: we only observe a diagnosis in the EMR if the patient visits the healthcare provider. Thus, the recorded diagnosis code (often used as the response variable) is in fact a proxy for the patient’s true outcome (which may or may not have been recorded). Mullainathan and Obermeyer (2017) and Obermeyer and Lee (2017) demonstrate that this proxy can result in misleading predictive models, arising from systematic biases in the types of patients who frequently visit the healthcare provider. One could collect more reliable (true) outcome data by surveying patients, but this is costly and only scales to a small cohort of patients. Another form of censoring is omitted variable bias: important factors (e.g., physician counseling or a patient’s proactiveness towards their own health) are not explicitly recorded in the medical record. Bastani et al. (2017) show that omitted variable bias arising from unrecorded physician interventions can lead to misleading predictive models trained on EMR data. Again, more reliable (gold) data can be collected by handlabeling patient observations based on physician or nurse notes in the medical chart, but as before, this is costly and unscalable. Recently, researchers have drawn attention to human bias: patient data is collected and recorded by hospital staff (e.g., physicians, medical coders), who may themselves be biased (Ahsen et al. 2018). This is exemplified in our case study (§id1), where we find that medical coders record the obesity diagnosis code in the EMR at very different rates even when patient BMIs are similar. Finally, the specific outcomes of interest may be too rare or have high variance. For example, in healthcare payforperformance contracts, Medicare uses 30day hospital readmissions rates as proxies for hospital quality of care, which may be better captured by rarer outcomes such as never events or 30day patient mortality rates (CMS 2018, Axon and Williams 2011, Milstein 2009).
Proxies are also pervasive in marketing and revenue management. Online platforms allow us to observe finegrained customer behaviors, including page views, clicks, cartadds, and eventually purchases. While purchases may be the final outcome of interest, these intermediate (and more abundant) observations serve as valuable proxies. For example, Farias and Li (2017) use a variety of customer actions as proxies for predicting a customer’s affinity for a song in a music streaming service. This is also evidenced in our case study (§id1), where customer clicks can signal the likelihood of customer hotel purchases. With modern technology, companies can also observe customers’ offline behavior, including store visits (using mobile WiFi signal tracking, e.g., see Zhang et al. 2018 for Alibaba case study) and realtime product browsing (using store security cameras, e.g., see Brynjolfsson et al. 2013 for American Apparel case study). Thus, different channels of customer behavior can inform predictive analytics. For example, Dzyabura et al. (2018) use online customer behaviors as proxies for predicting offline customer preferences. Finally, new product introduction can benefit from proxies. For example, Baardman et al. (2017) use demand for related products as proxies for predicting demand for a new product.
Our problem can be viewed as an instance of multitask learning, or more specifically, transfer learning. Multitask learning combines data from multiple related predictive tasks to train similar predictive models for each task. It does this by using a shared representation across tasks (Caruana 1997). Such representations typically include variable selection (i.e., enforce the same feature support for all tasks in linear or logistic regression, Jalali et al. 2010, Meier et al. 2008), kernel choice (i.e., use the same kernel for all tasks in kernel regression, Caruana 1997), or intermediate neural net representations (i.e., use the same weights for intermediate layers for all tasks in deep learning, Collobert and Weston 2008). Transfer learning specifically focuses on learning a single new task by transferring knowledge from a related task that has already been learned (see Pan et al. 2010 for a survey). We share a similar goal: since we have many proxy samples, we can easily learn a highperforming predictive model for the proxy task, but we wish to transfer this knowledge to the (related) gold task for which we have very limited labeled data. However, our proxy and gold predictive models already have a shared representation in the variable selection sense; in particular, we use the same features (all of which are typically relevant) for both prediction tasks.
We note that the tasks considered in the multitask and transfer learning literature are typically far more disparate than the class of proxy problems we have identified in this paper thus far. For instance, Caruana (1997) gives the example of simultaneously training neural network outputs to recognize different object properties (outlines, shapes, textures, reflections, shadows, text, orientation, etc.). Bayati et al. (2018) simultaneously train logistic regressions predicting disparate diseases (heart failure, diabetes, dementia, cancer, pulmonary disorder, etc.). While these tasks are indeed related, they are not close substitutes for each other. In contrast, the proxy predictive task is a close substitute for the true predictive task, to the point that practitioners may even ignore gold data and train their models purely on proxy data. In this class of problems, we can impose significantly more structure beyond merely a shared representation.
Our key insight is that the bias between the proxy and gold predictive tasks can be modeled as a sparse function. We argue that there is often some (a priori unknown) underlying mechanism that systematically affects a subset of the features, creating a bias between the true and proxy predictive tasks. When this is the case, we can successfully estimate the bias using highdimensional techniques that exploit sparsity.
Bayesian approaches have been proposed for similar problems. For instance, Dzyabura et al. (2018) use a Bayesian prior relating customers’ online preferences (proxies) and offline purchas behavior (true outcome of interest). Raina et al. (2006) propose a method for constructing priors in such settings using semidefinite programming on data from related tasks. These approaches do not come with theoretical convergence guarantees. A frequentist interpretation of their approach is akin to ridge regression, which is one of our baselines; we prove that ridge regression cannot take advantage of sparse structure when present, and thus, cannot significantly improve improve estimation error over the naive proxy or gold estimators. Relatedly, Farias and Li (2017) link multiple lowrank collaborative filtering problems by imposing structure across their latent feature representations; however, the primary focus in their work is on lowrank matrix completion settings without features, whereas our focus is on classical regression problems.
We use techniques from the highdimensional statistics literature to prove convergence properties about our twostep estimator. The second step of our estimator uses a LASSO regression (Chen et al. 1995, Tibshirani 1996), which helps us recover the bias term using far fewer samples than traditional statistical models by exploiting sparsity (Candes and Tao 2007, Bickel et al. 2009, Negahban et al. 2009, Bühlmann and Van De Geer 2011). A key challenge in our proof is that the vector we wish to recover in the second stage is not perfectly sparse; rather, it is the sum of a sparse vector and residual noise from the first stage of our estimator. We extend existing LASSO theory to prove a new tail inequality for this setting. In particular, we show that our error cleanly decomposes into a term that is proportional to the variance of our proxy estimator (which is small in practice), and a term that recovers the classical error rate of the LASSO estimator. Thus, when we have many proxy observations, we require exponentially fewer gold observations to achieve a fixed estimation error than would be required if we did not have any proxy data. Our twostage estimator is related in spirit to other highdimensional twostage estimators (e.g., Belloni et al. 2014, 2012). While these papers focus on treatment effect estimation after variable selection on features or instrumental variables, our work focuses on transfer learning from a proxy predictive task to a new predictive task with limited labeled data.
We highlight our main contributions below:

Problem Formulation: We formulate the proxy problem as two classical regression tasks; the proxy task has abundant data, while the actual (gold) task of interest has limited data. Motivated by real datasets, we model the bias between the two tasks as a sparse function of the features.

Theory: We propose a new twostep estimator that efficiently combines proxy and gold data to exploit sparsity in the bias term. Our estimator provably achieves the same accuracy as popular heuristics (e.g., model averaging or weighted loss functions) with exponentially less true data (in the number of features ). Our proof relies on a new tail inequality on the convergence of LASSO for approximately sparse vectors, which may be of independent interest.

Case Studies: We demonstrate the effectiveness of our approach on ecommerce and healthcare datasets. In both cases, we achieve significantly better predictive accuracy as well as managerial insights into the nature of the bias in the proxy data.
Preliminaries: For any integer , let denote the set . Consider an observation with feature vector . As discussed earlier, the gold and predictive tasks are different. Let the gold and proxy responses be given by the following datagenerating processes respectively:
where are unknown regression parameters, and are independent subgaussian noise with parameters and respectively (see Definition 1 below).
Definition 1
A random variable is subgaussian if for every .
This definition implies and . Many classical distributions are subgaussian; typical examples include any bounded, centered distribution, or the normal distribution. Note that the errors need not be identically distributed.
Our goal is to estimate accurately in order to make good decisions for new observations with respect to their true predicted outcomes. In a typical regression problem, the gold data would suffice. However, we often have very limited gold data, leading to highvariance erroneous estimates. Thus, we can benefit by utilizing information from proxy data, even if this data is biased.
Decisionmakers employ proxy data because the proxy predictive task is closely related to the true predictive task. In other words, . To model the relationship between the true and proxy predictive tasks, we write
where captures the proxy estimator’s bias.
Motivated by our earlier discussion, we posit that the bias is sparse. In particular, let , which implies that the bias of the proxy estimator only depends on out of the covariates. This constraint is always satisfied when , but we will prove that our estimator of has much stronger performance guarantees when .
Data: We are given two (possibly overlapping) cohorts. We have observations in our gold dataset: let be the gold design matrix (whose rows are observations from the gold cohort), and be the corresponding vector of responses. Analogously, we have observations in our proxy dataset: let be the proxy design matrix (whose rows are observations from the proxy cohort), and be the corresponding vector of responses. Typically , necessitating the use of (more abundant) proxy data. Without loss of generality, we impose that both design matrices have been standardized, i.e.,
for each column . It is standard good practice to normalize features in this way when using regularized regression, so that the regression parameters are appropriately scaled in the regularization term (see, e.g., Friedman et al. 2001). We further define the gold and proxy sample covariance matrices
Our standardization of the gold and proxy design matrices implies that
Evaluation: We define the parameter estimation error of a given estimator relative to the true parameter as
where is the set of problem parameters that satisfy the assumptions given in the problem formulation and §id1 below, and the expectation is taken with respect to the noise terms and .
Assumption 1 (Bounded)
There exists some such that .
Our first assumption states that our regression parameters are bounded by some constant. This is a standard assumption in the statistical literature.
Assumption 2 (PositiveDefinite)
The proxy sample covariance matrix is positivedefinite. In other words, the minimum eigenvalue of is .
Our second assumption is also standard, and ensures that is identifiable from the proxy data . This is a mild assumption since is large. In contrast, we allow that may not be identifiable from the gold data , since is small and the resulting sample covariance matrix may not be positivedefinite.
The last assumption on the compatibility condition arises from the theory of highdimensional statistics (Candes and Tao 2007, Bickel et al. 2009, Bühlmann and Van De Geer 2011). We will require a few definitions before stating the assumption.
An index set is a set . For any vector , let be the vector obtained by setting the elements of that are not in to zero. Then, the element of is .
The support for any vector , denoted , is the set of indices corresponding to nonzero entries of . Thus, is the smallest set that satisfies .
We now define the compatibility condition:
Definition 2 (Compatibility Condition)
The compatibility condition is met for the index set and the matrix if there exists such that, for all satisfying , it holds that
Assumption 3 (Compatibility Condition)
The compatibility condition (Definition 2) is met with constant for the index set and gold sample covariance matrix .
Our third assumption is critical to ensure that the bias term is identifiable, even if . This assumption (or the related restricted eigenvalue condition) is standard in the literature to ensure the convergence of highdimensional estimators such as the Dantzig selector or LASSO (Candes and Tao 2007, Bickel et al. 2009, Bühlmann and Van De Geer 2011).
It is worth noting that Assumption 3 is always satisfied if is positivedefinite. In particular, letting be the minimum eigenvalue of , it can be easily verified that the compatibility condition holds with constant for any index set. Thus, the compatibility condition is strictly weaker than the requirement that be positivedefinite. For example, the compatibility condition allows for collinearity in features that are outside the index set , which can occur often in highdimensional settings when (Bühlmann and Van De Geer 2011). Thus, even when is not identifiable, we may be able to identify the bias by exploiting sparsity.
We begin by describing four commonly used baseline estimators. These include naive estimators trained only on gold or proxy data, as well as two popular heuristics (model averaging and weighted loss functions). We prove corresponding lower bounds on their parameter estimation error with respect to the true parameter .
One common approach is to ignore proxy data and simply use the gold data (the most appropriate data) to construct the best possible predictor. Since we have a linear model, the ordinary least squares (OLS) estimator is the most obvious choice: it is the minimum variance unbiased estimator.
However, it is well known that introducing bias can be beneficial in datapoor environments. In other words, since we have very few gold samples ( is small), we may wish to consider the regularized ridge estimator (Friedman et al. 2001):
where we introduce a regularization parameter . Note that when the regularization parameter , we recover the classical OLS estimator, i.e., .
Theorem 1 (Gold Estimator)
The parameter estimation error of the OLS estimator on gold data is bounded below as follows:
The parameter estimation error of the ridge estimator on gold data for any choice of the regularization parameter is bounded below as follows:
The proof is given in Appendix id1. Note that this result uses the optimal value of the regularization parameter to compute the lower bound on the parameter estimation error of the ridge estimator. In practice, the error will be larger since would be estimated through crossvalidation.
Theorem 1 shows that when the number of gold samples is moderate (i.e., ), the ridge estimator recovers the OLS estimator’s lower bound on the parameter estimation error . However, when the number of gold samples is very small (i.e., ), the ridge estimator achieves a constant lower bound on the parameter estimation error . This is because the ridge estimator will predict for very small values of , and since we have assumed that , our parameter estimation error remains bounded.
Another common approach is to ignore the gold data and simple use the proxy data to construct the best possible predictor. Since we have a linear model, the OLS estimator is the most obvious choice; note that we do not need regularization since we have many proxy samples ( is large). Thus, we consider:
Theorem 2 (Proxy Estimator)
The parameter estimation error of the OLS estimator on proxy data is bounded below as follows:
The proof is given in Appendix id1. Since is large, the second term in the parameter estimation error is small. Thus, the parameter estimation error of the proxy estimator is dominated by the bias term . When the proxy is “good” or reasonably representative of the gold data, is small. In these cases, the proxy estimator is more accurate than the gold estimator, explaining the widespread use of proxy estimator in practice even when (limited) gold data is available.
One heuristic that is sometimes employed is to simply average the gold and proxy estimators:
for some averaging parameter . Note that recovers (the OLS estimator on gold data) and recovers (the OLS estimator on proxy data).
Theorem 3 (Averaging Estimator)
The parameter estimation error of the averaging estimator on both gold and proxy data is bounded below as follows:
The proof is given in Appendix id1. Note that this result uses the optimal value of the averaging parameter to compute the lower bound on the parameter estimation error of the averaging estimator. In practice, the error will be larger since would be estimated through crossvalidation.
Theorem 3 shows that the averaging estimator does not achieve more than a constant factor improvement over the best of the gold and proxy OLS estimators. In particular, the lower bound in Theorem 3 is exactly the minimum of the lower bounds of the gold OLS estimator (given in Theorem 1) and the proxy OLS estimator (given in Theorem 2) up to constant factors. Since the averaging estimator spans both the proxy and the gold estimators (depending on the choice of ), it is to be expected that the best possible averaging estimator does at least as well as either of these two estimators; surprisingly, it does no better.
A more sophisticated heuristic used in practice is to perform a weighted regression that combines both datasets but assigns a higher weight to true outcomes. Consider:
for some weight . Note that recovers (the OLS estimator on gold data) and recovers (the OLS estimator on proxy data).
Theorem 4 (Weighted Loss Estimator)
The parameter estimation error of the weighted estimator on both gold and proxy data is bounded below as follows:
The proof is given in Appendix id1. Note that this result uses the optimal value of the weighting parameter to compute the lower bound on the parameter estimation error of the weighted loss estimator. In practice, the error will be larger since would be estimated through crossvalidation.
Theorem 4 shows that the more sophisticated weighted loss estimator achieves exactly the same lower bound as the averaging estimator (Theorem 3). Thus, the weighted loss estimator also does not achieve more than a constant factor improvement over the best of the gold and proxy estimators. Since the weighted estimator spans both the proxy and the gold estimators (depending on the choice of ), it is to be expected that the best possible weighted estimator does at least as well as either of these two estimators; again, surprisingly, it does no better.
As discussed earlier, prediction error is composed of bias and variance. Training our estimator on the true outcomes alone yields an unbiased but highvariance estimator. On the other hand, training our estimator on the proxy outcomes alone yields a biased but lowvariance estimator. Averaging the estimators or using a weighted loss function can interpolate the biasvariance tradeoff between these two extremes, but provides at most a constant improvement in prediction error.
We now define our proposed joint estimator, and prove that it can leverage sparsity to achieve much better theoretical guarantees than common approaches used in practice.
We propose the following twostep joint estimator :
Step 1:  
Step 2:  (1) 
Both estimation steps are convex in . Thus, there are no local minima, and we can find the global minimum through standard techniques such as stochastic gradient descent. Note that the first step only requires proxy data, while the second step only requires gold data; thus, we do not need both gold and proxy data to be simultaneously available during training. This is useful when data from multiple sources cannot be easily combined, but summary information like can be shared.
When the regularization parameter is small, we recover the gold OLS estimator; when is large, we recover the proxy OLS estimator. Thus, similar to model averaging and weighted loss functions, the joint estimator spans both the proxy and the gold estimators (depending on the choice of ). However, we show that the joint estimator can successfully interpolate the biasvariance tradeoff between these extremes to produce up to an exponential reduction in estimation error.
Intuitively, we seek to do better by leveraging our insight that the bias term is wellmodeled by a sparse function of the covariates. Thus, in principle, we can efficiently recover using an penalty. A simple variable transformation of the secondstage objective (1) gives us
(2) 
where we have taken . Our estimator is then simply , where is estimated in the first stage. In other words, (2) uses the LASSO estimator on gold data to recover the bias term with respect to the proxy estimator . We use the penalty, which is known to be effective at recovering sparse vectors (Candes and Tao 2007).
This logic immediately indicates a problem, because the parameter we wish to converge to in (2) is not actually the sparse vector , but a combination of and residual noise from the first stage. We formalize this by defining some additional notation:
(3)  
(4) 
Here, is the residual noise in estimating the proxy estimator from the first stage. As a consequence of this noise, in order to recover the true gold parameter , we wish to recover (rather than ) from (2). Specifically, note that the minimizer of the first term in (2) is and not . However, is clearly not sparse, since is not sparse (e.g., if the noise is a gaussian random variable, then is also gaussian). Thus, we may be concerned that the LASSO penalty in (2) may not be able to recover at the exponentially improved rate promised for sparse vectors (Candes and Tao 2007, Bickel et al. 2009, Bühlmann and Van De Geer 2011).
On the other hand, since we have many proxy outcomes ( is large), our proxy estimation error is small. In other words, is approximately sparse. We will prove that this is sufficient for us to recover (and therefore ) at an exponentially improved rate.
We now state a tail inequality that upper bounds the parameter estimation error of the twostep joint estimator with high probability.
Theorem 5 (Joint Estimator)
The joint estimator satisfies the following tail inequality for any chosen value of the regularization parameter :
The proof is given in subsection §id1. Note that the regularization parameter trades off the magnitude of the parameter estimation error and the probability of error. When is small, Theorem 5 guarantees a smaller error with low probability; when is large, it guarantees a larger error with high probability. In a typical LASSO problem, an optimal choice of the regularization parameter . However, in Theorem 5, convergence depends on both gold and proxy data. In Corollary 1, we will show that in this setting, we will need to choose
In the next subsection, we will compute the resulting estimation error of the joint estimator.
We now derive an upper bound on the expected parameter estimation error of the joint estimator, in order to compare its performance against the baseline estimators described in §id1.
From Theorem 5, we know that our estimation error is small with high probability. However, to derive an upper bound on the expected estimation error, we also need to characterize its worstcase magnitude. In order to ensure that our estimator never becomes unbounded, we consider the truncated joint estimator . In particular,
Recall that is any upper bound on (Assumption 1), and can simply be considered a large constant. The following corollary uses the tail inequality in Theorem 5 to obtain an upper bound on the expected parameter estimation error of the truncated joint estimator.
Corollary 1 (Joint Estimator)
The parameter estimation error of the truncated joint estimator on both gold and proxy data is bounded above as follows:
Taking the regularization parameter to be
yields a parameter estimation error of order
The proof is given in Appendix id1. The error cleanly decomposes into two terms: (i) the first term is the classical error rate of the LASSO estimator if (rather than ) was sparse, and (ii) the second term is proportional to the residual error (or variance) of the proxy estimator. Thus, when we have many proxy observations (variance of the proxy estimator is small), we require exponentially fewer gold observations (in the number of features ) to achieve a fixed estimation error than would be required if we did not have any proxy data, i.e., our twostep estimator recovers using rather than gold observations.
Estimator 

Bound Type  

Gold OLS  Lower  
Gold Ridge  Lower  
Proxy OLS  Lower  
Averaging  Lower  
Weighted  Lower  
Truncated Joint  Upper 
For ease of comparison, we tabulate the bounds we have derived so far (up to constants and logarithmic factors) in Table 1. Recall that we are interested in the regime where is large and is small. Even with infinite proxy samples, the proxy estimator’s error is bounded below by its bias . The gold estimator’s error can also be very large, particularly when . Model averaging and weighted loss functions do not improve this picture by more than a constant factor. Now, note that in our regime of interest,
The first claim follows when (i.e., the bias term is reasonably sparse), and the second claim follows when (i.e., the proxy estimator’s error primarily arises from its bias rather than its variance). Thus, the joint estimator’s error can be significantly lower than popular heuristics in our regime of interest.
We start by defining the following two events:
(5) 
(6) 
where we have introduced two new parameters and . We denote the complements of these events as and respectively. When events and hold, the gold and proxy noise terms and are bounded in magnitude, allowing us to bound our parameter estimation error . Since our noise is subgaussian, and hold with high probability (Lemmas 4 and 5). We will choose the parameters and later to optimize our bounds.
Lemma 1
On the event , taking , the solution to the optimization problem (2) satisfies
Since the optimization problem (2) is convex, it recovers the insample global minimum. Thus, we must have that
Substituting yields
Expanding and cancelling terms on both sides gives us
(7) 
Then, when holds and , we have
Substituting into Eq. (7), we have on that
(8) 
where we recall that . The second line uses to express the right hand side in terms of so that we can ultimately invoke the compatibility condition on (Definition 2). However, we must first show that the required assumptions are met.
By the triangle inequality, we have the following:
and similarly, noting that by definition of ,
Collecting the above expressions and substituting into Eq. (\theequation@IDap), we have that when holds,
(9) 
We now have two possible cases: either (i) , or (ii) . In Case (i), we will invoke the compatibility condition to prove our finitesample guarantee for the joint estimator, and in Case (ii), we will find that we already have good control over the error of the estimator.