Behaviour Policy Estimation in OffPolicy Policy Evaluation:
Calibration Matters
Abstract
In this work, we consider the problem of estimating a behaviour policy for use in OffPolicy Policy Evaluation (OPE) when the true behaviour policy is unknown. Via a series of empirical studies, we demonstrate how accurate OPE is strongly dependent on the calibration of estimated behaviour policy models: how precisely the behaviour policy is estimated from data. We show how powerful parametric models such as neural networks can result in highly uncalibrated behaviour policy models on a realworld medical dataset, and illustrate how a simple, nonparametric, knearest neighbours model produces better calibrated behaviour policy estimates and can be used to obtain superior importance samplingbased OPE estimates.
1 Introduction
In many decisionmaking contexts, one wishes to take advantage of alreadycollected data (for example, website interaction logs, patient trajectories, or robot trajectories) to estimate the value of a novel decisionmaking policy. This problem is known as OffPolicy Policy Evaluation (OPE), where we seek to determine the performance of an evaluation policy, given only data generated by a behaviour policy. Most OPE procedures (Precup, 2000; Jiang & Li, 2015; Thomas & Brunskill, 2016; Farajtabar et al., 2018) rely (at least partially) on the technique of Importance Sampling (IS) which, when used in RL, requires the behaviour policy to be known. However, for observational studies in domains such as healthcare, we do not have access to this information. One way to handle this is to estimate the behaviour policy from the data, and then use it to do importance samplingbased OPE. However, the quality of the resulting OPE estimate is critically dependent on the calibration of the behaviour policy – how precisely it is estimated from the data, and whether the probabilities of actions under the approximate behaviour policy model represent the true probabilities.
In this work, we evaluate the sensitivity of offpolicy evaluation to calibration errors in the learned behaviour policy. In particular, we perform a series of careful empirical studies demonstrating that:

Uncalibrated behaviour policy models can result in highly inaccurate OPE in a simple, controlled navigation domain.

In a realworld sepsis management domain, powerful parametric models such as deep neural networks produce highly uncalibrated probability estimates.

A simple, nonparametric, knearest neighbours model is better calibrated than all the other parametric models in our medical domain, and using this as a behaviour policy model results in superior OPE.
2 Background
In the reinforcement learning (RL) problem, an agent’s interaction with an environment can be represented by a Markov Decision Process (MDP), defined by a tuple , where is the state space, is the action space, is the reward function, is the transition probability distribution, is the initial state distribution, and is the discount factor. A policy is defined as a mapping from states to actions, with representing the probability of taking action in state .
Let be a trajectory generated when following policy , and be the return of trajectory . We can evaluate a policy by considering the expected return over trajectories when following it: . The expectation is taken over the probability distribution of trajectories under policy . Let the value and actionvalue functions of a policy at a state or stateaction pair be and respectively. These are defined as the expected return of a trajectory starting at state or stateaction pair , and then following policy . We can write .
In offpolicy policy evaluation (OPE), we seek to estimate, with low mean squared error (MSE), the value of an evaluation policy given a set of trajectories generated independently by following a (distinct) behaviour policy .
Defining the importance weight (Precup, 2000), ^{1}^{1}1We assume henceforth that for all stateaction pairs , if then . , we can form the stepwise Weighted Importance Sampling (WIS) estimator of : . In this work, we consider using the PerHorizon WIS (PHWIS) estimator, which can handle differing trajectory lengths (Doroudi et al., 2017), to evaluate medical treatment strategies for sepsis. We also provide results using the PerHorizon Weighted Doubly Robust (PHWDR) estimator, which incorporates an approximate model of to lower the variance of value estimates (Jiang & Li, 2015; Thomas & Brunskill, 2016). Further information is in the supplementary material.
3 Impact of MisCalibration: Toy Domain
We firstly consider the effect of poorly calibrated behaviour policy models on OPE in a synthetic domain. The domain is a continuous 2D map () with a discrete action space, , with actions representing a movement of one unit in one of the four coordinate directions or staying in the current position. Gaussian noise of zero mean and specifiable variance is added onto the state of the agent after each action. An agent starts in the top left corner of the domain and receives a positive reward within a given radius of the top right corner, and a negative reward within a given radius of the bottom left corner. We set the horizon to be 15 in all experiments. A kNearest Neighbours (kNN) model is used to estimate the behaviour policy distribution, and its accuracy is varied by adjusting the number of neighbours and training data points used.
The quality of OPE is strongly dependent on the quality of behaviour policy estimation.
Figure 1 illustrates this via relating the average absolute error in the behaviour policy estimation , to the fractional error in OPE using the WIS estimator, for two different behaviour policies. The error is calculated with respect to using WIS with the true behaviour policy. Average absolute errors in behaviour policy models of as small as 0.06 can incur errors of up over 50% in the estimated value – having a wellcalibrated model of the behaviour policy is therefore critical for good OPE.
4 Model calibration in the sepsis domain
As a casestudy, we consider the challenge of obtaining wellcalibrated behaviour models on a realworld dataset, used in Komorowski et al. (2016) and Raghu et al. (2017), dealing with the medical treatment of sepsis patients in intensive care units (ICUs). We use the same framing as Raghu et al. (2017), where the medical treatment process for a sepsis patient is framed as a continuous statespace MDP. A patient’s state is represented as a vector of demographic features, vital signs, and lab values. Our state representation concatenates the the previous three timesteps’ raw state information to the current time’s state vector to capture trends over time. The action space, , is of size 25 and is discretised over doses of two drugs commonly given to sepsis patients. The reward is positive at intermediate timesteps when the patient’s wellbeing improves, and negative when it deteriorates. At the terminal timestep of a patient’s trajectory, a positive reward is assigned for survival, and a negative reward otherwise.
4.1 Obtaining wellcalibrated behaviour policy models
We consider modelling the behaviour policy, via supervised learning. Importantly, IS uses probabilities (rather than class labels) and hence we require a wellcalibrated model, not just an accurate one. To evaluate calibration, we draw a series of test states from a held out test set, and calculate the total variation distance between the predictive distribution over actions from the estimated model, , and a groundtruth distribution obtained by considering the empirical distribution over actions from the knearest neighbours of the state on the heldout test set, using a custom distance kernel that assesses physiological similarity. Intuitively, states that are physiologically similar should have similar treatment (behaviour policy) distributions. For more information, see the supplementary material.
Approximate kNN produces better calibrated probabilities than parametric models.
Table 1 shows the average total variation distance (over 500 test states) between the estimated and target behaviour distributions for different approximate behaviour policy models: logistic regression (LR), random forest (RF), neural network (NN), and an approximate kNN model using random projections (Indyk & Motwani, 1998) (used instead of full kNN for its computational efficiency). The parametric models are poorly calibrated, especially for sampled states with high severity, where there are fewer data points available for estimation.
Severity  LR  RF  NN  Approx kNN 

0  4  0.249  0.214  0.213  0.129 
5  9  0.269  0.254  0.246  0.152 
10  13  0.309  0.309  0.399  0.210 
14  23  0.356  0.337  0.426  0.199 
Neural networks can produce overconfident and incorrect probability estimates.
Figure 2 shows example predictive distributions over actions for the neural network and approximate kNN as compared to the ground truth, demonstrating overconfident predictions (a result noted by Guo et al. (2017)) and incorrect predictions produced by the neural network. Approx kNN may therefore be more appropriate as a behaviour policy model for OPE.
5 OPE in the sepsis domain
We now use these behaviour policy models for OPE in the sepsis domain. To obtain ground truth for evaluation, we divide our dataset into two subsets and . We can use the behaviour policy from , , as the evaluation policy with . As we have trajectories with as the behaviour policy in , we can average returns on these trajectories to get an onpolicy estimate of . Low mean squared error between the OPE estimate and the onpolicy estimate provides an indication of correctness.
Two methods of splitting the trajectories are considered: random and intervention splitting. In random splitting, we randomly select half the trajectories to go in one set, and half to go in the other. In intervention splitting, the evaluation set contains half of the patients who were never treated with vasopressors (chosen randomly from all such patients), and the training set contains the remainder of patients. For both methods, results are averaged over different behaviour/evaluation policy pairs – 50 for PHWIS and 10 for PHWDR.
In the limit of infinite data, random splitting results in identical behaviour and evaluation policies. In our setting, with limited data, the two policies are close (average total variation distance ) but this splitting method still permits basic assessment of OPE quality. The average total variation distance with intervention splitting is approximately 0.29.
We estimate using a bootstrapped method:

Sample trajectories from .

Obtain via an OPE method.

Repeat this process times, representing samples from the distribution of .

Compute the MSE between these samples and .
The approximate kNN behaviour policy model often results in the best OPE.
Table 2 presents the MSE when using the PHWIS and PHWDR estimators for OPE. The estimate for in the PHWDR estimator was obtained using FittedQ Iteration (FQI) with random forests (Ernst et al., 2005). When using the PHWIS estimator, approximate kNN gives appreciably lower MSE than the neural network (NN), reinforcing the idea that it is better calibrated models can result in better OPE. The results with the PHWDR estimator do not show as clear a dependence on the behaviour policy. This is because the Approximate Model (AM) terms in one case (random splitting) give low MSE estimates (MSE = 0.177), and in the other case (intervention splitting) give high MSE estimates (MSE = 3.87). There is therefore less of a dependence on the behaviour policy; OPE is dominated by the AM terms.
Approx kNN  NN  

Random split,  2.48  4.04 
Intervention split,  3.90  3.90 
Random split,  2.04  2.02 
Intervention split,  2.04  4.65 
6 Conclusion
In this work, we considered the problem of behaviour policy estimation for OffPolicy Policy Evaluation (OPE), focusing an application in healthcare – evaluating medical treatment strategies for patients with sepsis. Via a series of empirical studies, we showed how wellcalibrated behaviour policy models are highly important for goodquality OPE, and powerful parametric models such as neural networks can often give uncalibrated probability estimates. We demonstrated that a simple, nonparametric, knearest neighbours (kNN) behaviour policy model has better calibration than parametric models and that using this kNN model for OPE led to improved results in this real world domain. The proposed procedure can be used in other situations where the behaviour policy is unknown, and could improve the quality of OPE estimates, which is an important step towards the use of reinforcement learning in realworld domains.
7 Acknowledgements
This work was supported in part by the Harvard Data Science Initiative, Siemens, and a NSF CAREER grant.
References
 Doroudi et al. (2017) Doroudi, Shayan, Thomas, Philip S, and Brunskill, Emma. Importance sampling for fair policy selection. 2017.
 Ernst et al. (2005) Ernst, Damien, Geurts, Pierre, and Wehenkel, Louis. Treebased batch mode reinforcement learning. Journal of Machine Learning Research, 6(Apr):503–556, 2005.
 Farajtabar et al. (2018) Farajtabar, Mehrdad, Chow, Yinlam, and Ghavamzadeh, Mohammad. More robust doubly robust offpolicy evaluation. CoRR, abs/1802.03493, 2018.
 Guo et al. (2017) Guo, Chuan, Pleiss, Geoff, Sun, Yu, and Weinberger, Kilian Q. On calibration of modern neural networks. In International Conference on Machine Learning, pp. 1321–1330, 2017.
 Indyk & Motwani (1998) Indyk, Piotr and Motwani, Rajeev. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing, pp. 604–613. ACM, 1998.
 Jiang & Li (2015) Jiang, N. and Li, L. Doubly Robust Offpolicy Evaluation for Reinforcement Learning. CoRR, abs/1511.03722, 2015. URL http://arxiv.org/abs/1511.03722.
 Komorowski et al. (2016) Komorowski, M., Gordon, A., Celi, L. A., and Faisal, A. A Markov Decision Process to suggest optimal treatment of severe infections in intensive care. In Neural Information Processing Systems Workshop on Machine Learning for Health, December 2016.
 Marik et al. (2017) Marik, Paul E, LindeZwirble, Walter T, Bittner, Edward A, Sahatjian, Jennifer, and Hansell, Douglas. Fluid administration in severe sepsis and septic shock, patterns and outcomes: an analysis of a large national database. Intensive care medicine, 43(5):625–632, 2017.
 Precup (2000) Precup, Doina. Eligibility traces for offpolicy policy evaluation. Citeseer, 2000.
 Raghu et al. (2017) Raghu, Aniruddh, Komorowski, Matthieu, Celi, Leo Anthony, Szolovits, Peter, and Ghassemi, Marzyeh. Continuous statespace models for optimal sepsis treatmenta deep reinforcement learning approach. arXiv preprint arXiv:1705.08422, 2017.
 Thomas & Brunskill (2016) Thomas, Philip and Brunskill, Emma. Dataefficient offpolicy policy evaluation for reinforcement learning. In International Conference on Machine Learning, pp. 2139–2148, 2016.
A OffPolicy Policy Evaluation estimators
In offpolicy policy evaluation (OPE), we consider the situation where we would like to estimate the value of an evaluation policy given a set of trajectories generated independently by following a (distinct) behaviour policy . We would like the estimator to have low mean squared error (MSE), defined as follows: . Note that when we have trajectories from , we can form an estimate of using , which is the MonteCarlo estimator.
Let us define the quantity . This is the importance weight (Precup, 2000), and is equal the ratio of the probability of the first steps of trajectory under to the probability under . ^{2}^{2}2We assume henceforth that for all stateaction pairs , if then . . Using this definition, we can form the importance sampling estimator of :
Let us also define and to be estimates of the state and action value functions for policy respectively under the approximate model (AM) of the MDP, . We can use an approximate model to directly find . For example, we can write:
.
To estimate the quantity , prior work has mainly used one or both of the techniques of Importance Sampling (IS) and Approximate Model (AM) estimation (Thomas & Brunskill, 2016). The IS approach to evaluation relies on using the importance weights to adjust for the difference between the probability of a trajectory under the behaviour policy and the probability under the evaluation policy . Two commonly used estimators in the IS family, which improve on the simple IS estimator are the stepwise IS and stepwise WIS estimators, defined as follows (with indexing the trajectories in ):
The stepIS estimator is an unbiased estimator of but suffers from high variance (due to the product of importance weights). The stepWIS estimator is biased, but consistent, and has lower variance than stepIS. However, its variance can often still be unacceptably high (Thomas & Brunskill, 2016). These IS estimators can have significant bias when the behaviour policy is unknown.
In AM estimation, we use the approximate model to directly find , as defined earlier. It may be difficult to trust these estimators, however, given that we cannot always find their bias and variance.
Doubly Robust methods (Jiang & Li, 2015; Thomas & Brunskill, 2016) combine IS and AM techniques together in order to reduce the variance of the resulting estimator. The Weighted Doubly Robust (WDR) estimator, which has demonstrated effective empirical performance (Thomas & Brunskill, 2016), is defined as follows, with :
Note that these estimators are valid for trajectories with the same length; extensions to handle trajectories of different length can be found in Doroudi et al. (2017) – this is called the PerHorizon extension (resulting in the PHIS and PHWIS estimators).
Doroudi et al. (2017) defined the PerHorizon Weighted Importance Sampling (PHWIS) estimator as follows:
where is the set of all trajectory lengths, and is the fraction of the total number of trajectories with length equal to :
This estimator has high variance; we can define a lower variance equivalent by considering a stepwise version:
We can also introduce control variates into the estimator and form the PerHorizon Weighted Doubly Robust (PHWDR) estimator, as follows. First, let us define to be the WDR estimator given all trajectories of length . We can write this as follows, with :
Then, it is straightforward to write, with as defined before:
B Assessing Behaviour Policy Calibration
To evaluate the calibration of models, we can calculate the distance between the estimated behaviour policy and target behaviour policy. In order to calculate this distance, we require the target behaviour policy, which is unknown. However, we can use domain knowledge to inform the choice of the target distribution. In this medical setting, we propose that what governs the clinician’s choice of action is the physiological state of the patient, and that patients with similar physiological states will be treated in similar ways. This is a reasonable approximation, given that the state encodes the patient’s physiology effectively (Raghu et al., 2017).
We define similarity of patient states using a ‘physiological distance kernel’, which is based on Euclidean distance and upweights certain informative features of the patient’s state. Informative features were the patient’s SOFA score, lactate levels, fluid output, mean and diastolic blood pressure, Pa/Fi ratio, chloride levels, weight, and age. These are clinically interpretable: the SOFA score and lactate levels provide indications of sepsis severity; careful monitoring of a patient’s fluid levels is essential when managing sepsis (Marik et al., 2017); and blood pressure indicates whether a patient is in septic shock. These features are upweighted by a factor of 2 in our distance kernel (where , the dimensionality of our state representation):
To find the target distribution for a given test state, we use a knearest neighbour (kNN) estimate with this distance kernel and form an empirical distribution of the actions taken from the test set neighbours. We consider 150 neighbours to provide reasonable coverage in the estimate. A Ball Tree data structure is used for efficiency. Querying this data structure is computationally expensive (1 second per query), so we sample 500 states for patients at different severities (range of SOFA score) and average results for these sets. We use the total variation distance, defined as for the discrete action space, as the distance metric. Our approximate behaviour policies are trained on a separate training dataset and we compare the predictive distribution over actions for the test states to the result from the kNN estimate.