Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting
In reinforcement learning (RL) , one of the key components is policy evaluation, which aims to estimate the value function (i.e., expected long-term accumulated reward) of a policy. With a good policy evaluation method, the RL algorithms will estimate the value function more accurately and find a better policy. When the state space is large or continuous Gradient-based Temporal Difference(GTD) policy evaluation algorithms with linear function approximation are widely used. Considering that the collection of the evaluation data is both time and reward consuming, a clear understanding of the finite sample performance of the policy evaluation algorithms is very important to reinforcement learning. Under the assumption that data are i.i.d. generated, previous work provided the finite sample analysis of the GTD algorithms with constant step size by converting them into convex-concave saddle point problems. However, it is well-known that, the data are generated from Markov processes rather than i.i.d. in RL problems.. In this paper, in the realistic Markov setting, we derive the finite sample bounds for the general convex-concave saddle point problems, and hence for the GTD algorithms. We have the following discussions based on our bounds. (1) With variants of step size, GTD algorithms converge. (2) The convergence rate is determined by the step size, with the mixing time of the Markov process as the coefficient. The faster the Markov processes mix, the faster the convergence. (3) We explain that the experience replay trick is effective by improving the mixing property of the Markov process. To the best of our knowledge, our analysis is the first to provide finite sample bounds for the GTD algorithms in Markov setting.
Finite Sample Analysis of the GTD Policy Evaluation Algorithms in Markov Setting
Yue Wang ††thanks: This work was done when the first author was visiting Microsoft Research Asia. School of Science Beijing Jiaotong University email@example.com Wei Chen Microsoft Research firstname.lastname@example.org Yuting Liu School of Science Beijing Jiaotong University email@example.com Zhi-Ming Ma Academy of Mathematics and Systems Science Chinese Academy of Sciences firstname.lastname@example.org Tie-Yan Liu Microsoft Research Tie-Yan.Liu@microsoft.com
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
Reinforcement Learning (RL) () technologies are very powerful to learn how to interact with environments, and has variants of important applications, such as robotics, computer games and so on (, , , ).
In RL problem, an agent observes the current state, takes an action following a policy at the current state, receives a reward from the environment, and the environment transits to the next state in Markov, and again repeats these steps. The goal of the RL algorithms is to find the optimal policy which leads to the maximum long-term reward. The value function of a fixed policy for a state is defined as the expected long-term accumulated reward the agent would receive by following the fixed policy starting from this state. Policy evaluation aims to accurately estimate the value of all states under a given policy, which is a key component in RL (, ). A better policy evaluation method will help us to better improve the current policy and find the optimal policy.
When the state space is large or continuous, it is inefficient to represent the value function over all the states by a look-up table. A common approach is to extract features for states and use parameterized function over the feature space to approximate the value function. In applications, there are linear approximation and non-linear approximation (e.g. neural networks) to the value function. In this paper, we will focus on the linear approximation (,,). Leveraging the localization technique in , the results can be generated into non-linear cases with extra efforts. We leave it as future work.
In policy evaluation with linear approximation, there were substantial work for the temporal-difference (TD) method, which uses the Bellman equation to update the value function during the learning process (,). Recently,   have proposed Gradient-based Temporal Difference (GTD) algorithms which use gradient information of the error from the Bellman equation to update the value function. It is shown that, GTD algorithms have achieved the lower-bound of the storage and computation complexity, making them powerful to handle high dimensional big data. Therefore, now GTD algorithms are widely used in policy evaluation problems and the policy evaluation step in practical RL algorithms (,).
However, we don’t have sufficient theory to tell us about the finite sample performance of the GTD algorithms. To be specific, will the evaluation process converge with the increasing of the number of the samples? If yes, how many samples we need to get a target evaluation error? Will the step size in GTD algorithms influence the finite sample error? How to explain the effectiveness of the practical tricks, such as experience replay? Considering that the collection of the evaluation data is very likely to be both time and reward consuming, to get a clear understanding of the finite sample performance of the GTD algorithms is very important to the efficiency of policy evaluation and the entire RL algorithms.
Previous work () converted the objective function of GTD algorithms into a convex-concave saddle problem and conducted the finite sample analysis for GTD with constant step size under the assumption that data are i.i.d. generated. However, in RL problem, the date are generated from an agent who interacts with the environment step by step, and the state will transit in Markov as introduced previously. As a result, the data are generated from a Markov process but not i.i.d.. In addition, the work did not study the decreasing step size, which are also commonly-used in many gradient based algorithms(,,). Thus, the results from previous work cannot provide satisfactory answers to the above questions for the finite sample performance of the GTD algorithms.
In this paper, we perform the finite sample analysis for the GTD algorithms in the more realistic Markov setting. To achieve the goal, first of all, same with , we consider the stochastic gradient descent algorithms of the general convex-concave saddle point problems, which include the GTD algorithms. The optimality of the solution is measured by the primal-dual gap (, ). The finite sample analysis for convex-concave optimization in Markov setting is challenging. On one hand, in Markov setting, the non-i.i.d. sampled gradients are no longer unbiased estimation of the gradients. Thus, the proof technique for the convergence of convex-concave problem in i.i.d. setting cannot be applied. On the other hand, although SGD converge for convex optimization problem with the Markov gradients, it is much more difficult to obtain the same results in the more complex convex-concave optimization problem.
To overcome the challenge, we design a novel decomposition of the error function (i.e. Eqn (7)). The intuition of the decomposition and key techniques are as follows: (1) Although samples are not i.i.d., for large enough , the sample at time is "nearly independent" of the sample at time , and its distribution is "very close" to the stationary distribution. (2) We split the random variables in the objective related to operator and the variables related to operator into different terms in order to control them respectively. It is non-trivial, and we construct a sequence of auxiliary random variables to do so. (3) All constructions above need to be carefully considered the measurable issues in the Markov setting. (4) We construct new martingale difference sequences and apply Azuma’s inequality to derive the high-probability bound from the in-expectation bound.
By using the above techniques, we prove a novel finite sample bound for the convex-concave saddle point problem. Considering the GTD algorithms are specific convex-concave saddle point optimization methods, we finally obtained the finite sample bounds for the GTD algorithms, in the realistic Markov setting for RL. To the best of our knowledge, our analysis is the first to provide finite sample bounds for the GTD algorithms in Markov setting.
We have the following discussions based on our finite sample bounds.
GTD algorithms do converge, under a flexible condition on the step size, i.e. , as , where is the step size. Most of step sizes used in practice satisfy this condition.
The convergence rate is , where is the mixing time of the Markov process, and is a constant. Different step sizes will lead to different convergence rates.
The experience replay trick is effective, since it can improve the mixing property of the Markov process.
Finally, we conduct simulation experiments to verify our theoretical finding. All the conclusions from the analysis are consistent with our empirical observations.
In this section, we briefly introduce the GTD algorithms and related works.
2.1 Gradient-based TD algorithms
Consider the reinforcement learning problem with Markov decision process(MDP) , where is the state space, is the action space, is the transition matrix and is the transition probability from state to state after taking action , is the reward function and is the reward received at state if taking action , and is the discount factor. A policy function indicates the probability to take each action at each state. Value function for policy is defined as: .
In order to perform policy evaluation in a large state space, states are represented by a feature vector , and a linear function is used to approximate the value function. The evaluation error is defined as , which can be decomposed into approximation error and estimation error. In this paper, we will focus on the estimation error with linear function approximation.
As we know, the value function in RL satisfies the following Bellman equation: , where is called Bellman operator for policy . Gradient-based TD (GTD) algorithms (including GTD and GTD2) proposed by  and  update the approximated value function by minimizing the objective function related to Bellman equation errors, i.e., the norm of the expected TD update (NEU) and mean-square projected Bellman error (MSPBE) respectively(,) ,
where is a diagonal matrix whose elements are , , and is a distribution over the state space .
Actually, the two objective functions in GTD and GTD2 can be unified as below
where in GTD, , in GTD2, is the importance weighting factor. Since the underlying distribution is unknown, we use the data to estimate the value function by minimizing the empirical estimation error, i.e.,
with as the parameter in the value function, as the auxiliary variable used in GTD algorithms.
Therefore, we consider the general convex-concave stochastic saddle point problem as below
where and are bounded closed convex sets, is random variable and its distribution is , and the expected function is convex in and concave in . Denote , the gradient of as , and the gradient of as .
In the stochastic gradient algorithm, the model is updated as: , where is the projection onto and is the step size. After iterations, we get the model . The error of the model is measured by the primal-dual gap error
 proved that the estimation error of the GTD algorithms can be upper bounded by their corresponding primal-dual gap error multiply a factor. Therefore, we are going to derive the finite sample primal-dual gap error bound for the convex-concave saddle point problem firstly, and then extend it to the finite sample estimation error bound for the GTD algorithms.
2.2 Related work
The TD algorithms for policy evaluation can be divided into two categories: gradient based methods and least-square(LS) based methods(). Since LS based algorithms need storage and computational complexity while GTD algorithms are both of complexity, gradient based algorithms are more commonly used when the feature dimension is large. Thus, in this paper, we focus on GTD algorithms.
 proposed the gradient-based temporal difference (GTD) algorithm for off-policy policy evaluation problem with linear function approximation.  proposed GTD2 algorithm which shows a faster convergence in practice.  connected GTD algorithms to a convex-concave saddle point problem and derive a finite sample bound in both on-policy and off-policy cases for constant step size in i.i.d. setting.
3 Main Theorems
In this section, we will present our main results. In Theorem 1, we present our finite sample bound for the general convex-concave saddle point problem; in Theorem 2, we provide the finite sample bounds for GTD algorithms in both on-policy and off-policy cases. Please refer the complete proofs in the supplementary materials.
Our results are derived based on the following common assumptions(, , ). Please note that, the bounded-data property in assumption 4 in RL can guarantee the Lipschitz and smooth properties in assumption 5-6 (Please see Propsition 1 ).
Assumption 1 (Bounded parameter).
There exists , such that .
Assumption 2 (Step size).
The step size is non-increasing.
Assumption 3 (Problem solvable).
The matrix and in Problem 4 are non-singular.
Assumption 4 (Bounded data).
Features are bounded by , rewards are bounded by and importance weights are bounded by .
Assumption 5 (Lipschitz).
For -almost every , the function is Lipschitz for both x and y, with finite constant , respectively. We Denote .
Assumption 6 (Smooth).
For -almost every , the partial gradient function of is Lipschitz for both x and y with finite constant respectively. We denote .
For Markov process, the mixing time characterizes how fast the process converge to its stationary distribution. Following the notation of , we denote the conditional probability distribution as and the corresponding probability density as . Similarly, we denote the stationary distribution of the data generating stochastic process as and its density as .
The mixing time of the sampling distribution P conditioned on the field of the initial t sample is defined as: , where is the conditional probability density at time , given .
Assumption 7 (Mixing time).
The mixing times of the stochastic process are uniform. i.e., there exists uniform mixing times such that, with probability , we have for all and .
Please note that, any time-homogeneous Markov chain with finite state-space and any uniformly ergodic Markov chains with general state space satisfy the above assumption(). For simplicity and without of confusion, we will denote as .
3.1 Finite Sample Bound for Convex-concave Saddle Point Problem
Proof Sketch of Theorem 1.
By the definition of the error function in (6) and the property that is convex for and concave for , the expected error can be bounded as below
Denote , , . Constructing which is measurable with respect to ,. We have the following key decomposition to the right hand side in the above inequality, the initiation and the explanation for such decomposition is placed in supplementary materials. For :
For term(a), we split into three terms by the definition of -norm and the iteration formula of , and then we bound its summation by . Actually, in the summation, the last two terms will be eliminated except for their first and the last terms. Swap the and operators and use the Lipschitz Assumption 5, the first term can be bounded. Term (c) includes the sum of , which is might be large in Markov setting. We reformulate it into the sum of and use the smooth Assumption 6 to bound it. Term (d) is similar to term (a) except that is the gradient that used to update . We can bound it similarly with term (a). Term(e) is a constant that does not change much with , and we can bound it directly through upper bound of each of its own terms. Finally, we combine all the upper bounds to each term, use the mixing time Assumption 7 to choose and obtain the error bound in Theorem 1.
We decompose Term(b) into a martingale part and an expectation part.By constructing a martingale difference sequence and using the Azuma’s inequality together with the Assumption 7, we can bound Term (b) and finally obtain the high probability error bound. ∎
Remark: (1) With , the error bound approaches in order . (2) The mixing time will influence the convergence rate. If the Markov process has better mixing property with smaller , the algorithm converge faster. (3) If the data are i.i.d. generated (the mixing time ) and the step size is set to the constant , our bound will reduce to , which is identical to previous work with constant step size in i.i.d. setting (,). (4) The high probability bound is similar to the expectation bound in the following Lemma 1 except for the last term. This is because we consider the deviation of the data around its expectation to derive the high probability bound.
Consider the convex-concave problem (2.5), under the same as Theorem 1, we have
Proof Sketch of Lemma 1.
We start from the key decomposition (7), and bound each term with expectation this time. We can easily bound each term as previously except for Term (b). For term (b), since is not related to operator and it is measurable with respect to , we can bound Term (b) through the definition of mixing time and finally obtain the expectation bound. ∎
3.2 Finite Sample Bounds for GTD Algorithms
As a specific convex-concave saddle point problem, the error bounds in Theorem 1&2 can also provide the error bounds for GTD with the following specifications for the Lipschitz constants.
Suppose assumptions 1-4 hold, then we have the following finite sample bounds for the error in GTD algorithms: In on-policy case, the bound in expectation is and with probability is ; In off-policy case, the bound in expectation is and with probability is , where is the smallest eigenvalue of the and respectively, is the largest singular value of , .
We would like to make the following discussions for Theorem 2.
The GTD algorithms do converge in the realistic Markov setting. As in Theorem 2, the bound in expectation is and with probability is . If the step size makes and , as , the GTD algorithms will converge. Additionally, in high probability bound, if , then dominates the order, if , dominates.
The setup of the step size can be flexible. Our finite sample bounds for GTD algorithms converge to if the step size satisfies , as . This condition on step size is much weaker than the constant step size in previous work , and the common-used step size all satisfy the condition. To be specific, for , the convergence rate is ; for , the convergence rate is , for the constant step size, the optimal setup is considering the trade off between and , and the convergence rate is .
The mixing time matters. If the data are generated from a Markov process with smaller mixing time, the error bound will be smaller, and we just need fewer samples to achieve a fixed estimation error. This finding can explain why the experience replay trick () works. With experience replay, we store the agent’s experiences (or data samples) at each step, and randomly sample one from the pool of stored samples to update the policy function. By Theorem 1.19 - 1.23 of , it can be proved that, for arbitrary , there exists , such that . That is to say, when the size of the stored samples is larger than , the mixing time of the new data process with experience replay is . Thus, the experience replay trick improves the mixing property of the data process, and hence improves the convergence rate.
Other factors that influence the finite sample bound: (1) With the increasing of the feature norm , the finite sample bound increase. This is consistent with the empirical finding by  that the normalization of features is crucial for the estimation quality of GTD algorithms. (2) With the increasing of the feature dimension , the bound increase. Intuitively, we need more samples for a linear approximation in a higher dimension feature space.
In this section, we report our simulation results to validate our theoretical findings. We consider the general convex-concave saddle problem,
where is a matrix, b is a vector, Here we set . We conduct three experiment and set the step size to , and respectively. In each experiment we sample the data three ways: sample from two Markov chains with different mixing time but share the same stationary distribution or sample from stationary distribution i.i.d. directly. We sample from Markov chain by using MCMC Metropolis-Hastings algorithms. Specifically, notice that the mixing time of a Markov chain is positive correlation with the second largest eigenvalue of its transition probability matrix (), we firstly conduct two transition probability matrix with different second largest eigenvalues( both with 1001 state and the second largest eigenvalue are 0.634 and 0.31 respectively), then using Metropolis-Hastings algorithms construct two Markov chain with same stationary distribution.
We run the gradient algorithm for the objective in (8) based on the simulation data, without and with experience replay trick. The primal-dual gap error curves are plotted in Figure 1.
We have the following observations. (1) The error curves converge in Markov setting with all the three setups of the step size. (2) The error curves with the data generated from the process which has small mixing time converge faster. The error curve for i.i.d. generated data converge fastest. (3) The error curve for different step size convergence at different rate. (4) With experience replay trick, the error curves in the Markov settings converge faster than previously. All these observations are consistent with our theoretical findings.
In this paper, in the more realistic Markov setting, we proved the finite sample bound for the convex-concave saddle problems with high probability and in expectation. Then, we obtain the finite sample bound for GTD algorithms both in on-policy and off-policy, considering that the GTD algorithms are specific convex-concave saddle point problems. Our finite sample bounds provide important theoretical guarantee to the GTD algorithms, and also insights to improve them, including how to setup the step size and we need to improve the mixing property of the data like experience replay. In the future, we will study the finite sample bounds for policy evaluation with nonlinear function approximation.
This work was supported by A Foundation for the Author of National Excellent Doctoral Dissertation of RP China (FANEDD 201312) and National Center for Mathematics and Interdisciplinary Sciences of CAS.
This supplementary material gives the detail proof of main theorems in the main paper. The supplementary material is organized as follows:
Section 6.1 states some assumptions that corresponding to the main paper. Section 6.2 contains some lemmas and the detail proofs of the main Theorems in the main paper. Section 6.3 contains detail proofs of lemmas used in proving the theorems.
Assumption 8 (Bounded parameter space).
We assume there are finite such that
Assumption 9 (Step size).
Let denote step size sequence which is non-increasing and for .
Assumption 10 (Problem solvable).
The matrix and are non-singular.
Assumption 11 (Bounded data).
The max norm of features are bounded by , rewards are bounded by and importance weights are bounded by .
Assumption 12 (Lipschitz).
For -almost every , the function is Lipschitz for both x and y, that is there exists three constant ,, such that:
and let , we have
Assumption 13 (Smooth).
For -almost every , the partial gradient function of is Lipschitz, that is there exists three constant ,, such that:
Then, let , we have
The mixing times of the stochastic process are uniform in the sense that there exist uniform mixing times such that with probability 1 for all and
6.2.1 Proof of Theorem 1
Proof of Theorem 3.
For the convenience of proof, we introduce a sequence of auxiliary variables that follow the iteration formula below respectively:
Using the definition of error function and , we can convert expected error function to a more friendly expression, from which we will start our analysis. Define .
(1) is a consequence of the Lemma 2.
(2) follows by the convexity and concavity of with respect to and respectively.
(3) follows by the convexity-concavity once again.
Notice that we can not bound the right hand side of (10) directly because of the operator and non-i.i.d. setting.
we rewrite , it can be shown that is measurable with respect to . More specifically, given , is a random variable that correlated to .
Previous work that considers the saddle point problem under i.i.d. setting is easy to obtain the bound because they can utilize the i.i.d. property. In their setting, every sample based stochastic gradient function is unbiased with respected to the . Notice . However, in our Markov setting this term cannot be arbitrary small.
On the other hand, previous work that considers the Markov setting for convex function minimization problem is also easy to obtain the result since they are not bothered by random variable . In their problem is a constant rather than a random variable which plays a key role when they try to handle the dependent and biased between sampling distribution.
So we cannot apply existing techniques directly or even combine them trivially.
Now, to bound the right-hand side of (10), we consider the following decomposition, for any . In order to include special case, we require : if ,
Denote as ,denote as , denote as , and recall the definition of ,