# Learning Deep Features in Instrumental Variable Regression

## Abstract

Instrumental variable (IV) regression is a standard strategy for learning causal relationships between confounded treatment and outcome variables by utilizing an instrumental variable, which is conditionally independent of the outcome given the treatment. In classical IV regression, learning proceeds in two stages: stage 1 performs linear regression from the instrument to the treatment; and stage 2 performs linear regression from the treatment to the outcome, conditioned on the instrument. We propose a novel method, deep feature instrumental variable regression (DFIV), to address the case where relations between instruments, treatments, and outcomes may be nonlinear. In this case, deep neural nets are trained to define informative nonlinear features on the instruments and treatments. We propose an alternating training regime for these features to ensure good end-to-end performance when composing stages 1 and 2, thus obtaining highly flexible feature maps in a computationally efficient manner. DFIV outperforms recent state-of-the-art methods on challenging IV benchmarks, including settings involving high dimensional image data. DFIV also exhibits competitive performance in off-policy policy evaluation for reinforcement learning, which can be understood as an IV regression task.

agred \addauthorssgreen \addauthorycblue \addauthornfpurple \addauthoradmagenta \addauthorlxcyan

## 1 Introduction

The aim of supervised learning is to obtain a model based on samples observed from some data generating process, and to then make predictions about new samples generated from the same distribution. If our goal is to predict the effect of our actions on the world, however, our aim becomes to assess the influence of interventions on this data generating process. To answer such causal questions, a supervised learning approach is inappropriate, since our interventions, called treatments, may affect the underlying distribution of the variable of interest, which is called the outcome.

To answer these counterfactual questions, we need to learn how treatment variables causally affect the distribution process of outcomes, which is expressed in a structural function. Learning a structural function from observational data (that is, data where we can observe, but not intervene) is known to be challenging if there exists an unmeasured confounder, which influences both treatment and outcome. To illustrate: suppose we are interested in predicting sales of airplane tickets given price. During the holiday season, we would observe the simultaneous increase in sales and prices. This does not mean that raising the prices causes the sales to increase. In this context, the time of the year is a confounder, since it affects both the sales and the prices, and we need to correct the bias caused by it.

One way of correcting such bias is via instrumental variable (IV) regression [Stock and Trebbi, 2003]. Here, the structural function is learned using instrumental variables, which only affect the treatment directly but not the outcome. In the sales prediction scenario, we can use supply cost shifters as the instrumental variable since they only affect the price [Wright, 1928, Blundell et al., 2012]. Instrumental variables can be found in many contexts, and IV regression is extensively used by economists and epidemiologists. For example, IV regression is used for measuring the effect of a drug in the scenario of imperfect compliance [Angrist et al., 1996], or the influence of military service on lifetime earnings [Angrist, 1990]. In this work, we propose a novel IV regression method, which can discover non-linear causal relationships using deep neural networks.

Classically, IV regression is solved by the two-stage least squares (2SLS) algorithm; we learn a mapping from the instrument to the treatment in the first stage and learn the structural function in the second stage as the mapping from the conditional expectation of the treatment given the instrument (obtained from stage 1) to the outcome. Originally, 2SLS assumes linear relationships in both stages, but this has been recently extended to non-linear settings.

One approach has been to use non-linear feature maps. Sieve IV [Newey and Powell, 2003, Chen and Christensen, 2018] uses a finite number of basis functions explicitly specified. Kernel IV (KIV) [Singh et al., 2019] and Dual IV regression [Muandet et al., 2019] extend sieve IV to allow for an infinite number of basis functions using reproducing kernel Hibert spaces (RKHS). Although these methods enjoy desirable theoretical properties, the flexibility of the model is limited, since all existing work uses the prespecified features.

Another approach is to perform the stage 1 regression through conditional density estimation [Carrasco et al., 2007, Darolles et al., 2011, Hartford et al., 2017]. One advantage of this approach is that it allows for flexible models, including deep neural nets, as proposed in the DeepIV algorithm of [Hartford et al., 2017]. It is known, however, that conditional density estimation is costly and often suffers from high variance when the treatment is high-dimensional. This issue is known as “forbidden regression”, and is discussed in Angrist and Pischke [2009, §4.6.1] and Bennett et al. [2019].

To mitigate the “forbidden regression” problem, Bennett et al. [2019] propose DeepGMM, a method inspired by the optimally weighted Generalized Method of Moments (GMM) [Hansen, 1982] to find a structural function ensuring that the regression residual and the instrument are independent. Although this approach can handle high-dimensional treatment variables and deep NNs as feature extractors, the learning procedure might not be as stable as 2SLS approach, since it involves solving a smooth zero-sum game.

In this paper, we propose Deep Feature Instrumental Variable Regression (DFIV), which aims to combine the advantages of all previous approaches, while avoiding their limitations. In DFIV, we use deep neural nets to adaptively learn feature maps in the 2SLS approach, which allows us to fit highly nonlinear structural functions, as in DeepGMM and DeepIV. DFIV avoids the “forbidden regression” issue in DeepIV, however, since it does not rely on conditional density estimation. As in sieve IV and KIV, DFIV learns the conditional expectation of the feature maps in stage 1 and uses the predicted features in stage 2, but with the additional advantage of learned features. We empirically show that DFIV performs better than other methods on several IV benchmarks, and apply DFIV successfully to off-policy policy evaluation, which is a fundamental problem in Reinforcement Learning (RL).

The paper is structured as follows. In Section 2, we formulate the IV regression problem and introduce two-stage least-squares regression. In Section 3, we give a detailed description of our DFIV method. We demonstrate the empirical performance of DFIV in Section 4, covering three settings: a classical demand prediction example from econometrics, a challenging IV setting where the treatment consists of high-dimensional image data, and the problem of off-policy policy evaluation in reinforcement learning.

## 2 Preliminaries

### 2.1 Problem Setting of Instrumental Variable Regression

We begin with a description of the IV setting. We observe a treatment , where , and an outcome , where . an unobserved confounder that affects both and . This causal relationship can be represented with the following structural causal model:

(1) |

where is called the structural function, which we assume to be a continuous, and is an additive noise term. The corresponding graphical model is shown in Figure 1. The challenge is that , which reflects the existence of a confounder. Because of this, we cannot use ordinary supervised learning techniques since . Here, we assume there is no observable confounder but we may easily include this, as discussed in Appendix B.

To deal with the hidden confounder , we assume to have access to an instrumental variable which satisfies the following assumption.

The conditional distribution is not constant in and .

Intuitively, Assumption 2.1 means that the instrument induces variation in the treatment but is uncorrelated with the hidden confounder . Again, for simplicity, we assume . Given Assumption 2.1, we can see that the function satisfies the operator equation by taking expectation conditional on of both sides of (1). Solving this equation, however, is known to be ill-posed [Nashed and Wahba, 1974]. To address this, recent works [Carrasco et al., 2007, Darolles et al., 2011, Muandet et al., 2019, Singh et al., 2019] minimize the following regularized loss to obtain the estimate :

(2) |

where is an arbitrary space of continuous functions and is a regularizer on .

### 2.2 Two Stage Least Squares Regression

A number of works [Newey and Powell, 2003, Singh et al., 2019] tackle the minimization problem (2) using two-stage least squares (2SLS) regression, in which the structural function is modeled as , where is a learnable weight vector and is a vector of fixed basis functions. For example, linear 2SLS used the identity map , while sieve IV [Newey and Powell, 2003] uses Hermite polynomials.

In the 2SLS approach, an estimate is obtained by solving two regression problems successively. In stage 1, we estimate the conditional expectation as a function of . Then in stage 2, as , we minimize with being replaced by the estimate obtained in stage 1.

Specifically, we model the conditional expectation as , where is another vector of basis functions and is a matrix to be learned. Again, there exist many choices for , which can be infinite-dimensional, but we assume the dimensions of and to be respectively.

In stage 1, the matrix is learned by minimizing the following loss,

(3) |

where is a regularization parameter. This is a linear ridge regression problem with multiple targets, which can be solved analytically. In stage 2, given , we can obtain by minimizing the loss

(4) |

where is another regularization parameter. Stage 2 corresponds to a ridge linear regression from to , and also enjoys a closed-form solution. Given the learned weights , the estimated structural function is .

## 3 DFIV Algorithm

In this section, we develop the DFIV algorithm. Similarly to Singh et al. [2019], we assume that we do not necessarily have access to observations from the joint distribution of . Instead, we are given observations of for stage 1 and observations of for stage 2. We denote the stage 1 observations as and the stage 2 observations as . If observations of are given for both stages, we can evaluate the out-of-sample losses, and these losses can be used for hyper-parameter tuning of (Appendix A).

DFIV uses the following models

(5) |

where and are the parameters, and and are the neural nets parameterised by and , respectively. As in the original 2SLS algorithm, we learn in stage 1 and in stage 2. In addition to the weights and , however, we also learn the parameters of the feature maps, and . Hence, we need to alternate between stages 1 and 2, since the conditional expectation changes during training.

#### Stage 1 Regression

The goal of stage 1 is to estimate the conditional expectation by learning the matrix and parameter , with given and fixed. Given the stage 1 data , this can be done by minimizing the empirical estimate of ,

(6) |

Note that the feature map is fixed during stage 1, since this is the “target variable.” If we fix , the minimization problem (6) reduces to a linear ridge regression problem with multiple targets, whose solution as a function of and is given analytically by

(7) |

where are feature matrices defined as and We can then learn the parameters of the adaptive features by minimizing the loss at using gradient descent. For simplicity, we introduce a small abuse of notation by denoting as the result of a user-chosen number of gradient descent steps on the loss (6) with from (7), even though need not attain the minimum of the non-convex loss (6). We then write . While this trick of using an analytical estimate of the linear output weights of a deep neural network might not lead to significant gains in standard supervised learning, it turns out to be very important in the development of our 2SLS algorithm. As shown in the following section, the analytical estimate (now considered as a function of ) will be used to backpropagate to in stage 2.

#### Stage 2 Regression

In stage 2, we learn the structural function by computing the weight vector and parameter while fixing , and thus the corresponding feature map . Given the data , we can minimize the empirical version of , defined as

(8) |

Again, for a given , we can solve the minimization problem (8) for as a function of by a linear ridge regression

(9) |

where and

The loss explicitly depends on the parameters and we can backpropagate it to via , even though the samples of the treatment variable do not appear in stage 2 regression. We again introduce a small abuse of notation for simplicity, and denote by the estimate obtained after a few gradient steps on (8) with from (9), even though need not minimize the non-convex loss (8). We then have . After updating , we need to update accordingly. We do not attempt to backpropagate through the estimate to do this, however, as this would be too computationally expensive; instead, we alternate stages 1 and 2. We also considered updating and jointly to optimize the loss , but this fails, as discussed in Appendix E.

#### Computational Complexity

The computational complexity of the algorithm is for stage 1, while stage 2 requires additional computations. This is small compared to KIV [Singh et al., 2019], which takes and , respectively. We can further speed up the learning by using mini-batch training as shown in Algorithm 1.

Here, and are the functions given by (7) and (9) calculated using mini-batches of data. Similarly, and are the stage 1 and 2 losses for the mini-batches. We recommend setting the batch size large enough so that do not diverge from computed on the entire dataset. Furthermore, we observe that setting , i.e. updating more frequently than , stabilizes the learning process.

## 4 Experiments

In this section, we report the empirical performance of the DFIV method. The evaluation considers both low and high-dimensional treatment variables. We used the demand design dataset of Hartford et al. [2017] for benchmarking in the low and high-dimensional cases, and we propose a new setting for the high-dimensional case based on the dSprites dataset [Matthey et al., 2017]. In the deep RL context, we also apply DFIV to perform off-policy policy evaluation (OPE). The network architecture and hyper-parameters are provided in Appendix F. The algorithms in the first two experiments are implemented using PyTorch [Paszke et al., 2019] and the OPE experiments are implemented using TensorFlow [Abadi et al., 2015] and the Acme RL framework [Hoffman et al., 2020]. The code is included in the supplemental material.

### 4.1 Demand Design Experiments

The demand design dataset is a synthetic dataset introduced by Hartford et al. [2017] that is now a standard benchmarking dataset for testing nonlinear IV methods. In this dataset, we aim to predict the demands on airplane tickets given the price of the tickets . The dataset contains two observable confounders, which are the time of year and customer groups that are categorized by the levels of price sensitivity. Further, the noise in and is correlated, which indicates the existence of an unobserved confounder. The strength of the correlation is represented by . To correct the bias caused by this hidden confounder, the fuel price is introduced as an instrumental variable. Details of the data generation process can be found in Appendix D.1. In DFIV notation, the treatment is , the instrument is , and are the observable confounders.

We compare the DFIV method to three leading modern competitors, namely KIV [Singh et al., 2019], DeepIV [Hartford et al., 2017], and DeepGMM [Bennett et al., 2019]. We used the DFIV method with observable confounders, as introduced in Appendix B. Note that DeepGMM does not have an explicit mechanism for incorporating observable confounders. The solution we use, proposed by Bennett et al. [2019, p. 2], is to incorporate these observables in both instrument and treatment; hence we apply DeepGMM with treatment and instrumental variable . Although this approach is theoretically sound, this makes the problem unnecessary difficult since it ignores the fact that we only need to consider the conditional expectation of given .

We used a network with a similar number of parameters to DeepIV as the feature maps in DFIV and models in DeepGMM. We tuned the regularizers as discussed in Appendix A, with the data evenly split for stage 1 and stage 2. We varied the correlation parameter and dataset size, and ran 20 simulations for each setting. Results are summarized in Figure 2.

Next, we consider a case, introduced by Hartford et al. [2017], where the customer type is replaced with an image of the corresponding handwritten digit from the MNIST dataset [LeCun and Cortes, 2010]. This reflects the fact that we cannot know the exact customer type, and thus we need to estimate it from noisy high-dimensional data. Note that although the confounder is high-dimensional, the treatment variable is still real-valued, i.e. the price of the tickets. Figure 4 presents the results for this high-dimensional confounding case. Again, we train the networks with a similar number of learnable parameters to DeepIV in DFIV and DeepGMM, and hyper-parameters are set in the way discussed in Appendix A. We ran 20 simulations with data size and report the mean and standard error.

Our first observation from Figure 2 and 4 is that the level of correlation has no significant impact on the error under any of the IV methods, indicating that all approaches correctly account for the effect of the hidden confounder. This is consistent with earlier results on this dataset using KIV and DeepIV [Singh et al., 2019, Hartford et al., 2017]. We note that DeepGMM does not perform well in this demand design problem. This may be due to the current DeepGMM approach to handling observable confounders, which might not be optimal. KIV performed reasonably well for small sample sizes and low-dimensional data, but it did less well in the high-dimensional MNIST case due to its less expressive features. In high dimensions, DeepIV performed well, since the treatment variable was unidimensional and the “forbidden regression” issue did not arise. DFIV performed consistently better than all other methods in both low and high dimensions, which suggests it can stably learn a flexible structural function.

### 4.2 dSprites Experiments

To test the performance of DFIV methods for a high dimensional treatment variable, we utilized the dSprites dataset [Matthey et al., 2017]. This is an image dataset described by five latent parameters (shape, scale, rotation, posX and posY). The images are -dimensional. In this experiment, we fixed the shape parameter to heart, i.e. we only used heart-shaped images. An example is shown in Figure 5.

From this dataset, we generated data for IV regression in which we use each figure as treatment variable . Hence, the treatment variable is 4096-dimensional in this experiment. To make the task more challenging, we used posY as the hidden confounder, which is not revealed to the model. We used the other three latent variables as the instrument variables . See Appendix D.2 for the detailed data generation process.

We tested the performance of DFIV with KIV and DeepGMM, where the hyper-parameters are determined as in the demand design problem. The results are displayed in Figure 4. DFIV consistently yields the best performance of all the methods. DeepIV is not included in the figure because it fails to give meaningful predictions due to the “forbidden regression” issue, as the treatment variable is high-dimensional. The performance of KIV suffers since it lacks the feature richness to express a high-dimensional complex structural function.

### 4.3 Off-Policy Policy Evaluation Experiments

We apply our IV methods to the off-policy policy evaluation (OPE) problem [Sutton and Barto, 2018], which is one of the fundamental problems of deep RL. In particular, it has been realized by Bradtke and Barto [1996] that 2SLS could be used to estimate a linearly parameterized value function, and we use this reasoning as the basis of our approach. Let us consider the RL environment , where is the state space, is the action space, is the transition function, is the reward distribution, is the initial state distribution, and discount factor . Let be a policy, and we denote as the probability of selecting action in stage . Given policy , the -function is defined as

with . The goal of OPE is to evaluate the expectation of the -function with respect to the initial state distribution for a given target policy , , learned from a fixed dataset of transitions , where and are sampled from some potentially unknown distribution and behavioral policy respectively. Using the Bellman equation satisfied by , we obtain a structural causal model of the form (1),

(10) | ||||

where . We have that , and Assumption 2.1 is verified. Minimizing the loss (2) for the structural causal model (10) corresponds to minimizing the following loss

(11) |

and we can apply any IV regression method to achieve this. We show that minimizing corresponds to minimizing the mean squared Bellman error (MSBE), and the detailed DFIV algorithm applied to OPE in Appendix C. Note that MBSE is also the loss minimized by the residual gradient (RG) method proposed in [Baird, 1995] to estimate -functions. However, this method suffers from the “double-sample” issue, i.e. it requires two independent samples of starting from the same due to the inner conditional expectation [Baird, 1995], whereas IV regression methods do not suffer from this issue.

We evaluate DFIV on three BSuite [Osband et al., 2019] tasks: catch, mountain car, and cartpole. The original system dynamics are deterministic. To create a stochastic environment, we randomly replace the agent action by a uniformly sampled action with probability . The noise level controls the level of confounding effect. The target policy is trained using DQN [Mnih et al., 2015], and we subsequently generate an offline dataset for OPE by executing the policy in the same environment with a random action probability of 0.2 (on top of the environment’s random action probability ). We compare DFIV with KIV, DeepIV, and DeepGMM; as well as Fitted Q Evaluation (FQE) [Le et al., 2019, Voloshin et al., 2019], a specialized approach designed for the OPE setting, which serves as our “gold standard” baseline [Paine et al., 2020]. All methods use the same network for value estimation. Figure 6 shows the absolute error of the estimated policy value by each method with a standard deviation from 5 runs. In catch and mountain car, DFIV comes closest in performance to FQE, and even matches it for some noise settings, whereas DeepGMM is somewhat worse in catch, and significantly worse in mountain car. In the case of cartpole, DeepGMM performs somewhat better than DFIV, although both are slightly worse than FQE. DeepIV and KIV both do poorly across all RL benchmarks.

## 5 Conclusion

We have proposed a novel method for instrumental variable regression, Deep Feature IV (DFIV), which performs two-stage least squares regression on flexible and expressive features of the instrument and treatment. As a contribution to the IV literature, we showed how to adaptively learn these feature maps with deep neural networks. We also showed that the off-policy policy evaluation (OPE) problem in deep RL can be interpreted as a nonlinear IV regression, and that DFIV performs competitively in this domain. In RL problems with additional external confounders, we believe DFIV will be of great value. This work brings the research worlds of deep offline RL and causality from observational data closer.

## Appendix A Hyper-Parameter Tuning

If observations from the joint distribution of are available in both stages, we can tune the regularization parameters using the approach proposed in Singh et al. [2019]. Let the complete data of stage 1 and stage 2 be denoted as and . Then, we can use the data not used in each stage to evaluate the out-of-sample performance of the other stage. Specifically, the regularizing parameters are given by

## Appendix B 2SLS algorithm with observable confounders

In this appendix, we formulate the DFIV method when observable confounders are available. Here, we consider the causal graph given in Figure 7. In addition to treatment , outcome , and instrument , we have an observable confounder . The structural function we aim to learn is now , and the structural causal model is represented as

For hidden confounders, we rely on Assumption 2.1. For observable confounders, we introduce a similar assumption. {assum} The conditional distribution is not constant in and . Following a similar reasoning as in Section 2, we can estimate the structural function by minimizing the following loss:

One universal way to deal with the observable confounder is to augment both the treatment and instrumental variables. Let us introduce the new treatment and instrument , then the loss becomes

which is equivalent to the original loss . This approach is adopted in KIV [Singh et al., 2019], and we used it here for DeepGMM method [Bennett et al., 2019] in the demand design experiment. However, this ignores the fact that we only have to consider the conditional expectation of given . Hence, we introduce another approach which is to model , where and are feature maps and denotes the tensor product defined as . It follows that , which yields the following two-stage regression procedure.

In stage 1, we learn the matrix that

which estimates the conditional expectation . Then, in stage 2, we learn using

Again, both stages can be formulated as ridge regressions, and thus enjoy closed-form solutions. We can further extend this to learn deep feature maps. Let be the feature maps parameterized by , respectively. Using notation similar to Section 3, the corresponding DFIV algorithm with observable confounders is shown in Algorithm 2. Note that in this algorithm, steps 3, 5, 6 are run until convergence, unlike for Algorithm 1.

## Appendix C Application of 2SLS to Off-Policy Policy Evaluation

Here we first show that the optimal solution of (2) in the OPE problem is equivalent to that of mean squared Bellman error, and then describe how we apply 2SLS algorithm.

We interpret the Bellman equation in (10) as IV regression problem, where and and

Let be the conditional expectation of reward given defined as

Then, we can prove that the solutions of (11) and MSBE are equivalent. Indeed, we have

(12) |

In this context, we model so that

(13) |

It follows that

We will model . In this case, given stage 1 data , stage 1 regression becomes

where we sample . However, given the specific form (13) of the structural function, stage 2 is slightly modified and requires minimizing the loss