Consistent OnLine OffPolicy Evaluation
Abstract
The problem of online offpolicy evaluation (OPE) has been actively studied in the last decade due to its importance both as a standalone problem and as a module in a policy improvement scheme. However, most Temporal Difference (TD) based solutions ignore the discrepancy between the stationary distribution of the behavior and target policies and its effect on the convergence limit when function approximation is applied. In this paper we propose the Consistent OffPolicy Temporal Difference (COPTD(, )) algorithm that addresses this issue and reduces this bias at some computational expense. We show that COPTD(, ) can be designed to converge to the same value that would have been obtained by using onpolicy TD() with the target policy. Subsequently, the proposed scheme leads to a related and promising heuristic we call logCOPTD(, ). Both algorithms have favorable empirical results to the current state of the art online OPE algorithms. Finally, our formulation sheds some new light on the recently proposed Emphatic TD learning.
1 Introduction
Reinforcement Learning (RL) techniques were successfully applied in fields such as robotics, games, marketing and more (Kober et al., 2013; AlRawi et al., 2015; Barrett et al., 2013). We consider the problem of offpolicy evaluation (OPE) – assessing the performance of a complex strategy without applying it. An OPE formulation is often considered in domains with limited sampling capability. For example, marketing and recommender systems (Theocharous & Hallak, 2013; Theocharous et al., 2015) directly relate policies to revenue. A more extreme example is drug administration, as there are only few patients in the testing population, and suboptimal policies can have life threatening effects (Hochberg et al., 2016). OPE can also be useful as a module for policy optimization in a policy improvement scheme (Thomas et al., 2015a).
In this paper, we consider the OPE problem in an online setup where each new sample is immediately used to update our current value estimate of some previously unseen policy. We propose and analyze a new algorithm called COPTD(,) for estimating the value of the target policy; COPTD(,) has the following properties:

Easy to understand and implement online.

Allows closing the gap to consistency such that the limit point is the same that would have been obtained by onpolicy learning with the target policy.

Empirically comparable to stateofthe art algorithms.
Our algorithm resembles Sutton et al. (2015)’s Emphatic TD that was extended by Hallak et al. (2015) to the general parametric form ETD(,). We clarify the connection between the algorithms and compare them empirically. Finally, we introduce an additional related heuristic called LogCOPTD(,) and motivate it.
2 Notations and Background
We consider the standard discounted Markov Decision Process (MDP) formulation Bertsekas & Tsitsiklis (1996) with a single long trajectory. Let be an MDP where is the finite state space and is the finite action space. The parameter sets the transition probabilities given the previous state and action , where the first state is determined by the distribution . The parameter sets the reward distribution obtained by taking action in state and is the discount factor specifying the exponential reduction in reward with time.
The process advances as follows: A state is sampled according to the distribution . Then, at each time step starting from the agent draws an action according to the stochastic behavior policy , a reward is accumulated by the agent, and the next state is sampled using the transition probability .
The expected discounted accumulated reward starting from a specific state and choosing an action by some policy is called the value function, which is also known to satisfy the Bellman equation in a vector form:
where and are the policy induced reward vector and transition probability matrix respectively; is called the Bellman operator. The problem of estimating from samples is called policy evaluation. If the target policy is different than the behavior policy which generated the samples, the problem is called offpolicy evaluation (OPE). The TD() (Sutton, 1988) algorithm is a standard solution to online onpolicy evaluation: Each time step the temporal difference error updates the current value function estimate, such that eventually the stochastic approximation process will converge to the true value function. The standard form of TD() is given by:
(1) 
where is the step size. The value is an estimate of the current state’s , looking forward steps, and is an exponentially weighted average of all of these estimates going forward till infinity. Notice that Equation 1 does not specify an online implementation since depends on future observations, however there exists a compact online implementation using eligibility traces (Bertsekas & Tsitsiklis (1996) for online TD(), and Sutton et al. (2014), Sutton et al. (2015) for offpolicy TD()). The underlying operator of TD() is given by:
and is a contraction Bertsekas (2012).
We denote by the stationary distribution over states induced by taking the policy and mark . Since we are concerned with the behavior at infinite horizon, we assume . In addition, we assume that the MDP is ergodic for the two specified policies so and that the OPE problem is proper – .
When the state space is too large to hold , a linear function approximation scheme is used: , where is the optimized weight vector and is the feature vector of state composed of features. We denote by the matrix whose lines consist of the feature vectors for each state and assume its columns are linearly independent.
TD() can be adjusted to find the fixed point of where is the projection to the subspace spanned by the features with respect to the weighted norm Sutton & Barto (1998):
Finally, we define OPErelated quantities:
(2) 
we call the covariate shift ratio (as denoted under different settings by Hachiya et al. (2012)).
We summarize the assumptions used in the proofs:

Under both policies the induced Markov chain is ergodic.

The first state is distributed according to the stationary distribution of the behavior policy .

The problem is proper: .

The feature matrix has full rank .
Assumption 1 is commonly used for convergence theorems as it verifies the value function is well defined on all states regardless of the initial sampled state. Assumption 2 can be relaxed since we are concerned with the longterm properties of the algorithm past its mixing time – we require it for clarity of the proofs. Assumption 3 is required so the importance sampling ratios will be well defined. Assumption 4 guarantees the optimal is unique which greatly simplifies the proofs.
3 Previous Work
We can roughly categorize previous OPE algorithms to two main families. Gradient based methods that perform stochastic gradient descent on error terms they want to minimize. These include GTD (Sutton et al., 2009a), GTD2, TDC (Sutton et al., 2009b) and HTD (White & White, 2016). The main disadvantages of gradient based methods are (A) they usually update an additional error correcting term, which means another timestep parameter needs to be controlled; and (B) they rely on estimating nontrivial terms, an estimate that tends to converge slowly. The other family uses importance sampling (IS) methods that correct the gains between onpolicy and offpolicy updates using the ISratios ’s. Among these are full IS (Precup et al., 2001) and ETD(,) (Sutton et al., 2015). These methods are characterized by the biasvariance tradeoff they resort to – navigating between biased convergent values (or even divergent), and very slow convergence stemming from the high variance of IS correcting factors (the products). There are also a few algorithms that fall between the two, for example TOGTD (van Hasselt et al., 2014) and WISTD() (Mahmood & Sutton, 2015).
A comparison of these algorithms in terms of convergence rate, synergy with function approximation and more is available in (White & White, 2016; Geist & Scherrer, 2014). We focus in this paper on the limit point of the convergence. For most of the aforementioned algorithms, the process was shown to converge almost surely to the fixed point of the projected Bellman operator where is some stationary distribution (usually ), however the in question was never
4 Motivation
Here we provide a motivating example showing that even in simple cases with “close” behavior and target policies, the two induced stationary distributions can differ greatly. Choosing a specific linear parameterization further emphasizes the difference between applying onpolicy TD with the target policy, and applying inconsistent offpolicy TD.
Assume a chain MDP with numbered states , where from each state you can either move left to state , or right to state . If you’ve reached the beginning or the end of the chain (states or ) then taking a step further does not affect your location. Assume the behavior policy moves left with probability , while the target policy moves right with probability . It is easy to see that the stationary distributions are given by:
For instance, if we have a length chain with , for the rightmost state we have . Let’s set the reward to be for the right half of the chain, so the target policy is better since it spends more time in the right half. The value of the target policy in the edges of the chain for is .
Now what happens if we try to approximate the value function using one constant feature ? The fixed point of is , while the fixed point of is – a substantial difference. The reason for this difference lies in the emphasis each projection puts on the states: according to , the important states are in the left half of the chain – these with low value function, and therefore the value estimation of all states is low. However, according to the important states are concentrated on the right part of the chain since the target policy will visit these more often. Hence, the estimation error is emphasized on the right part of the chain and the value estimation is higher. When we wish to estimate the value of the target policy, we want to know what will happen if we deploy it instead of the behavior policy, thus taking the fixed point of better represents the offpolicy evaluation solution.
5 CopTd(, )
Most offpolicy algorithms multiply the TD summand of TD() with some value that depends on the history and the current state. For example, full ISTD by Precup et al. (2001) examines the ratio between the probabilities of the trajectory under both policies:
(3) 
In problems with a long horizon, or these that start from the stationary distribution, we suggest using the timeinvariant covariate shift multiplied by the current . The intuition is the following: We would prefer using the probabilities ratio given in Equation 3, but it has very high variance, and after many time steps we might as well look at the stationary distribution ratio instead. This direction leads us to the following update equations:
(4) 
Lemma 1.
If the satisfy then the process described by Eq. (4) converges almost surely to the fixed point of .
The proof follows the ODE method Kushner & Yin (2003) similarly to Tsitsiklis & Van Roy (1997) (see the appendix for more details).
Since is generally unknown, it is estimated using an additional stochastic approximation process. In order to do so, we note the following Lemma:
Lemma 2.
Let be an unbiased estimate of , and for every define . Then:
For any state there are such quantities , where we propose to weight them similarly to TD():
(5) 
Note that , unlike , is restricted to a close set since its weighted linear combination is equal to and all of its entries are nonnegative; We denote this weighted simplex by , and let be the (nonlinear) projection to this set with respect to the Euclidean norm ( can be calculated efficiently, Chen & Ye (2011)). Now, we can devise a TD algorithm which estimates and uses it to find , which we call COPTD(, ) (Consistent OffPolicy TD).
Similarly to the Bellman operator for TDlearning, we define the underlying COPoperator and its extension:
(6) 
The following Lemma may give some intuition on the convergence of the estimation process:
Lemma 3.
Under the ergodicity assumption, denote the eigenvalues of by . Then is a contraction in the norm on the orthogonal subspace to , and is a fixed point of .
The technical proof is given in the appendix.
Theorem 1.
If the step sizes satisfy , and for some constant and every and , then after applying COPTD(, ), converges to almost surely, and converges to the fixed point of .
Notice that COPTD(, ) given in Alg. 1 is infeasible in problems with large state spaces since . Like TD(), we can introduce linear function approximation: represent where is a weight vector and is the offpolicy feature vector and adjust the algorithm accordingly. For to still be contained in the set , we pose the requirement on the feature vectors: , and noted as the simplex projection . In practice, the latter requirement can be approximated: resulting in an extension of the previously applied estimation (step 5 in COPTD(, )). We provide the full details in Algorithm 2, which also incorporates nonzero similarly to ETD(,).
Theorem 2.
If the step sizes satisfy , and for some constant and every , then after applying COPTD(, ) with function approximation satisfying , converges to the fixed point of denoted by almost surely, and if converges it is to the fixed point of , where is a coordinatewise product of vectors.
The proof is given in the appendix and also follows the ODE method. Notice that a theorem is only given for , convergence results for general should follow the work by Yu (2015).
A possible criticism on COPTD(,) is that it is not actually consistent, since in order to be consistent the original state space has to be small, in which case every offpolicy algorithm is consistent as well. Still, the dependence on another set of features allows to tradeoff accuracy with computational power in estimating and subsequently . Moreover, smart feature selection may further reduce this gap, and COPTD(, ) is still the first algorithm addressing this issue. We conclude with linking the error in ’s estimate with the difference in the resulting , which suggests that a well estimated results in consistency:
Corollary 1.
Let . If , then the fixed point of COPTD(,) with function approximation satisfies the following, where is the induced norm:
(7) 
where , and sets the fixed point of the operator .
5.1 Relation to ETD(, )
Recently, Sutton et al. (2015) had suggested an algorithm for offpolicy evaluation called Emphatic TD. Their algorithm was later on extended by Hallak et al. (2015) and renamed ETD(, ), which was shown to perform extremely well empirically by White & White (2016). ETD(, ) can be represented as:
(8) 
As mentioned before, ETD(, ) converges to the fixed point of (Yu, 2015), where . Error bounds can be achieved by showing that the operator is a contraction under certain requirements on and that the variance of is directly related to as well (Hallak et al., 2015) (and thus affects the convergence rate of the process).
When comparing ETD(,)’s form to COPTD(,)’s, instead of spending memory and time resources on a state/featuredependent , ETD(,) uses a onevariable approximation. The resulting is in fact a onestep estimate of , starting from (see Equations 15, 8), up to a minor difference: (which following our logic adds bias to the estimate
Unlike ETD(, ), COPTD(,)’s effectiveness depends on the available resources. The number of features can be adjusted accordingly to provide the most affordable approximation. The added cost is finetuning another stepsize, though ’s effect is less prominent.
6 The Logarithm Approach for Handling Long Products
We now present a heuristic algorithm which works similarly to COPTD(, ). Before presenting the algorithm, we explain the motivation behind it.
6.1 Statistical Interpretation of TD()
Konidaris et al. (2011) suggested a statistical interpretation of TD(). They show that under several assumptions the TD() estimate is the maximum likelihood estimator of given : (1) Each is an unbiased estimator of ; (2) The random variables are independent and specifically uncorrelated; (3) The random variables are jointly normally distributed; and (4) The variance of each is proportional to .
Under Assumptions 13 the maximum likelihood estimator of given its previous estimate can be represented as a linear convex combination of with weights:
(9) 
Subsequently, in Konidaris et al. (2011) Assumption was relaxed and instead a closed form approximation of the variance was proposed. In a followup paper by Thomas et al. (2015b), the second assumption was also removed and the weights were instead given as: , where the covariance matrix can be estimated from the data, or otherwise learned through some parametric form.
While both the approximated variance and learned covariance matrix solutions improve performance on several benchmarks, the first uses a rather crude approximation, and the second solution is both statedependent and based on noisy estimates of the covariance matrix. In addition, there aren’t efficient online implementations since all past weights should be recalculated to match a new sample. Still, the suggested statistical justification is a valuable tool in assessing the similar role of in ETD(, ).
6.2 Variance Weighted
As was shown by Konidaris et al. (2011), we can use statedependent weights instead of exponents to obtain better estimates. The second moments are given explicitly as follows
These can be estimated for each state separately. Notice that the variances increase exponentially depending on the largest eigenvalue of (as Assumption 4 dictates), but this is merely an asymptotic behavior and may be relevant only when the weights are already negligible. Hence, implementing this solution online should not be a problem with the varying weights, as generally only the first few of these are nonzero. While this solution is impractical in problems with large state spaces parameterizing or approximating these variances (similarly to Thomas et al. (2015b)) could improve performance in specific applications.
6.3 LogCOPTD(, )
Assumption 3 in the previous section is that the sampled estimators () are normally distributed. For on policy TD(), this assumption might seem not too harsh as the estimators represent growing sums of random variables. However, in our case the estimators are growing products of random variables. To correct this issue we can define new estimators using a logarithm on each :
(10) 
This approximation is crude – we could add terms reducing the error through Taylor expansion, but these would be complicated to deal with. Hence, we can relate to this method mainly as a wellmotivated heuristic.
Notice that this formulation resembles the standard MDP formulation, only with the corresponding ”reward” terms going backward instead of forward, and no discount factor. Unfortunately, without a discount factor we cannot expect the estimated value to converge, so we propose using an artificial one . We can incorporate function approximation for this formulation as well. Unlike COPTD(, ), we can choose the features and weights as we wish with no restriction, besides the linear constraint on the resulting through the weight vector . This can be approximately enforced by normalizing using (which should equal if we were exactly correct). We call the resulting algorithm LogCOPTD(,).
6.4 Using the Original Features
An interesting phenomenon occurs when the behavior and target policies employ a feature based Boltzmann distribution for choosing the actions: , and , where a constant feature is added to remove the (possibly different) normalizing constant. Thus, , and LogCOPTD(,) obtains a parametric form that depends on the original features instead of a different set.
6.5 Approximation Hardness
As we propose to use linear function approximation for and one cannot help but wonder how hard it is to approximate these quantities, especially compared to the value function. The comparison between and is problematic for several reasons:

The ultimate goal is estimating , approximation errors in are second order terms.

The value function depends on the policyinduced reward function and transition probability matrix, while depends on the stationary distributions induced by both policies. Since each depends on at least one distinct factor  we can expect different setups to result in varied approximation hardness. For example, if the reward function has a poor approximation then so will , while extremely different behavior and target policies can cause to behave erratically.

Subsequently, the choice of features for approximating and can differ significantly depending on the problem at hand.
If we would still like to compare and , we could think of extreme examples:

When , , when then .

In the chain MDP example in Section 4 we saw that is an exponential function of the location in the chain. Setting reward in one end to will result in an exponential form for as well. Subsequently, in the chain MDP example approximating is easier than as we obtain a linear function of the position; This is not the general case.
7 Experiments
We have performed 3 types of experiments. Our first batch of experiments (Figure 1) demonstrates the accuracy of predicting by both COPTD(, ) and LogCOPTD(, ). We show two types of setups in which visualization of is relatively clear  the chain MDP example mentioned in Section 4 and the mountain car domain Sutton & Barto (1998) in which the state is determined by only two continuous variables  the car’s position and speed. The parameters and exhibited low sensitivity in these tasks so they were simply set to , we show the estimated after iterations. For the chain MDP (top two plots, notice the logarithmic scale) we first approximate without any function approximation (topleft) and we can see COPTD manages to converge to the correct value while LogCOPTD is much less exact. When we use linear feature space (constant parameter and position) LogCOPTD captures the true behavior of much better as expected. The two lower plots show the error (in color) in estimated for the mountain car with a pure exploration behavior policy vs. a target policy oriented at moving right. The zaxis is the same for both plots and it describes a much more accurate estimate of obtained through simulations. The features used were local state aggregation. We can see that both algorithms succeed similarly on the positionspeed pairs which are sampled often due to the behavior policy and the mountain. When looking at more rarely observed states, the estimate becomes worse for both algorithms, though LogCOPTD seems to be better performing on the spike at position .
Next we test the sensitivity of COPTD(, ) and LogCOPTD(,) to the parameters and (Figure 2) on two distinct toy examples  the chain MDP introduced before but with only 30 states with the positionlinear features, and a random MDP with 32 states, 2 actions and a bit binary feature vector along with a free parameter (this compact representation was suggested by White & White (2016) to approximate real world problems). The policies on the chain MDP were taken as described before, and on the random MDP a state independent / probability to choose an action by the behavior/target policy. As we can see, larger values of cause noisier estimations in the random MDP for COPTD(, ), but has little effect in other venues. As for  we can see that if it is too large or too small the error behaves suboptimally, as expected for the crude approximation of Equation 10. In conclusion, unlike ETD(, ), Log/COPTD(, ) are much less effected by , though should be tuned to improve results.
Our final experiment (Figure 3) compares our algorithms to ETD(, ) and GTD(, ) over 4 setups: chain MDP with 100 states with right half rewards with linear features, a 2 action random MDP with 256 states and binary features, acrobot (3 actions) and cartpole balancing (21 actions) Sutton & Barto (1998) with reset at success and state aggregation to states. In all problems we used the same features for and estimation, , constant step size for the TD process and results were averaged over 10 trajectories, other parameters (, , other step sizes, ) were swiped over to find the best ones. To reduce figure clutter we have not included standard deviations though the noisy averages still reflect the variance in the process. Our method of comparison on the first 2 setups estimates the value function using the suggested algorithm, and finds the weighted average of the error between and the onpolicy fixed point :
(11) 
where is the optimal obtained by onpolicy TD using the target policy. On the latter continuous state problems we applied online TD on a different trajectory following the target policy, used the resulting value as ground truth and taken the sum of squared errors with respect to it. The behavior and target policies for the chain MDP and random MDP are as specified before. For the acrobot problem the behavior policy is uniform over the 3 actions and the target policy chooses between these with probabilities . For the cartpole the action space is divided to 21 actions from 1 to 1 equally, the behavior policy chooses among these uniformly while the target policy is 1.5 times more prone to choosing a positive action than a negative one.
The experiments show that COPTD(, ) and LogCOPTD(, ) have comparable performance to ETD(, ) where at least one is better in every setup. The advantage in the new algorithms is especially seen in the chain MDP corresponding to a large discrepancy between the stationary distribution of the behavior and target policy. GTD() is consistently worse on the tested setups, this might be due to the large difference between the chosen behavior and target policies which affects GTD() the most.
8 Conclusion
Research on offpolicy evaluation has flourished in the last decade. While a plethora of algorithms were suggested so far, ETD(, ) by Hallak et al. (2015) has perhaps the simplest formulation and theoretical properties. Unfortunately, ETD(, ) does not converge to the same point achieved by online TD when linear function approximation is applied.
We address this issue with COPTD(,) and proved it can achieve consistency when used with a correct set of features, or at least allow tradingoff some of the bias by adding or removing features. Despite requiring a new set of features and calibrating an additional update function, COPTD(,)’s performance does not depend as much on as ETD(,), and shows promising empirical results.
We offer a connection to the statistical interpretation of TD() that motivates our entire formulation. This interpretation leads to two additional approaches: (a) weight the using estimated variances instead of exponents and (b) approximating instead of ; both approaches deserve consideration when facing a real application.
Appendix A Appendix

Free parameters of TD algorithms mentioned in the paper 

State space  
Action space  
Transition probability distribution  
Reward probability distribution  
Distribution of the first state in the MDP  
Discount factor  
Reward at time , obtained at state and action  
Behavior policy (which generated the samples)  
Target policy  
Value function of state by policy  
Bellman operator  
Induced reward vector, transition matrix and Bellman operator by policy  
Value function estimates used in TD()  
Underlying TD() operator  
induced stationary distributions on the state space by policy  
Feature vector of state  
Weight vector for estimating  
Onestep importance sampling ratio  
steps importance sampling ratio  
Stationary distribution ratio  
The feature matrix for each state  
Estimated probabilities ratio  
Weighted estimated probabilities ratio  
weighted simplex  
Learning rate  
COP operators, underlying COPTD(, )  
Weight vector for estimating  

Assumptions:

Under both policies the induced Markov chain is ergodic.

The first state is distributed according to the behavior policy .

The support of contains the support of , i.e. .

The feature matrix has full rank.
a.1 Proof of Lemma 1
If the step sizes hold then the process described by Equation 4 converges almost surely to the fixed point of .
Proof.
Similarly to onpolicy TD, we define and , the fixed point is the solution to . First we find and show stability:
(12) 
This is exactly the same we would have obtained from TD() and it is negative definite (see Sutton et al. (2015)). Similarly we can find :
(13) 
and we obtained the same as onpolicy TD() with .
Now we consider the noise of this offpolicy TD, which is exactly the same noise as the onpolicy TD only multiplied by  as long as the noise term of the ODE formulation Kushner & Yin (2003) is still bounded, the proof is exactly the same. According to Assumption 1, we know that is lower and upper bounded. By Assumption 3 we also know that is lower and upper bounded. Therefore the noise of the new process is bounded and the same a.s. convergence applies as onpolicy TD() Tsitsiklis & Van Roy (1997). Since are the same as onpolicy TD() for the target policy , the convergence is to the same fixed point. ∎