MultipleStep Greedy Policies in Online and Approximate Reinforcement Learning
Abstract
Multiplestep lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work [5], multiplestep greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiplestep greedy algorithms in more practical setups. We begin by highlighting a counterintuitive difficulty, arising with softpolicy updates: even in the absence of approximations, and contrary to the 1stepgreedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multistep greedy operator.
MultipleStep Greedy Policies in Online and Approximate Reinforcement Learning
Yonathan Efroni Department of Electrical Engineering The Technion  Israel Institute of Technology Israel, Haifa 3200003 jonathan.efroni@gmail.com Gal Dalal Department of Electrical Engineering The Technion  Israel Institute of Technology Israel, Haifa 3200003 gald@tx.technion.ac.il Bruno Scherrer INRIA Villersl‘esNancy, F54600, France bruno.scherrer@inria.fr Shie Mannor Department of Electrical Engineering The Technion  Israel Institute of Technology Israel, Haifa 3200003 shie@ee.technion.ac.il
noticebox[b]\end@float
1 Introduction
The use of the 1step policy improvement in Reinforcement Learning (RL) was theoretically investigated under several frameworks, e.g., Policy Iteration (PI) [18], approximate PI [2, 9, 13], and ActorCritic [10]; its practical uses are abundant [22, 12, 25]. However, singlestep based improvement is not necessarily the optimal choice. It was, in fact, empirically demonstrated that multiplestep greedy policies can perform conspicuously better. Notable examples arise from the integration of RL and Monte Carlo Tree Search [4, 28, 23, 3, 25, 24] or Model Predictive Control [15, 6, 27].
Recent work [5] provided guarantees on the performance of the multiplestep greedy policy and generalizations of it in PI. Here, we establish it in the two practical contexts of online and approximate PI. With this objective in mind, we begin by highlighting a specific difficulty: softly updating a policy with respect to (w.r.t.) a multiplestep greedy policy does not necessarily result in improvement of the policy (Section 4). We find this property intriguing since monotonic improvement is guaranteed in the case of soft updates w.r.t. the 1step greedy policy, and is central to the analysis of many RL algorithms [10, 9, 22]. We thus engineer some algorithms to circumvent this difficulty and provide some nontrivial performance guarantees, that support the interest of using multistep greedy operators. These algorithms assume access to a generative model (Section 5) or to an approximate multiplestep greedy policy (Section 6).
2 Preliminaries
We consider the infinitehorizon discounted Markov Decision Process (MDP) framework. An MDP is defined by the 5tuple [18], where is a finite state space, is a finite action space, is a transition kernel, is a reward function, and is a discount factor. Let be a stationary policy, where is a probability distribution on the set . For a policy , for any state , let the value of in state be , where the notation means that the expectation is conditioned on the event and on following policy . is a vector in . For conciseness, we shall use the notations , which holds componentwise, and denote the reward and value at time by and . It is known that , where and . Lastly, let
(1) 
Our goal is to find a policy yielding the optimal value such that
(2) 
This goal can be achieved using the three following operators (with equalities hold componentwise):
where is a linear operator, is the optimal Bellman operator and both and are contraction mappings w.r.t. the max norm. It is known that the unique fixed points of and are and , respectively. The set is the standard set of 1step greedy policies w.r.t. .
3 The  and Greedy Policies
In this section, we bring forward necessary definitions and results on two classes of multiplestep greedy policies:  and greedy [5]. Let . The greedy policy outputs the first optimal action out of the sequence of actions solving a nonstationary, horizon control problem as follows:
Since the greedy policy can be represented as the 1step greedy policy w.r.t. , the set of greedy policies w.r.t. , , can be formally defined as follows:
Let . The set of greedy policies w.r.t. a value function , , is defined using the following operators:
(3)  
Remark 1.
In [5, Proposition 11], the greedy policy was explained to be interpolating over all geometrically weighted greedy policies. It was also shown that for the 1step greedy policy is restored, while for the greedy policy is the optimal policy.
Both and are contraction mappings, where . Their respective fixed points are and . For brevity, where there is no risk of confusion, we shall denote by Moreover, in [5] it was shown that both the  and greedy policies w.r.t. are strictly better then , unless .
Next, let
(4) 
The latter is the optimal function of the surrogate, discounted MDP with shaped reward (see Remark 1). Thus, we can obtain a greedy policy, , directly from
See that the greedy policy w.r.t. is the 1step greedy policy since
4 Multistep Policy Improvement and Soft Updates
In this section, we focus on policy improvement of multiplestep greedy policies, performed with soft updates. Soft updates of the 1step greedy policy have proved necessary and beneficial in prominent algorithms [10, 9, 22]. Here, we begin by describing an intrinsic difficulty in selecting the stepsize parameter when updating with multiplestep greedy policies such as of  and greedy. Specifically, denote by such multiplestep greedy policy w.r.t. Then, is not necessarily better than .
Lemma 1.
Let be a value function, a policy, and . Then
This elementary lemma relates the ‘advantage’ to the 1step advantage and is used for proving the following result.
Theorem 2.
For any MDP, let be a policy and its value. Let and where and an integer . Consider the mixture policies
We have the following equivalences:

The inequality holds for all MDPs if and only if .

The inequality holds for all MDPs if and only if .
The above inequalities hold entrywise, with strict inequality in at least one entry unless .
Proof sketch. See Appendix B for the full proof. Here, we only provide a counterexample demonstrating the potential nonmonotonicity of when the stepsize is not big enough. One can show the same for with the same example.
Consider the Tightrope Walking MDP in Fig. 1. It describes the act of walking on a rope: in the initial state the agent approaches the rope, in the walking attempt occurs, is the goal state and is repeatedly met as long the agent falls from the rope, resulting in negative reward.
First, notice that by definition, We call this policy the “confident” policy. Obviously, for any discount factor , and Instead, consider the “hesitant” policy . We now claim that for any and
(5) 
the mixture policy, , is not strictly better than To see this, notice that and i.e., the agent accumulates zero reward if she does not climb the rope. Thus, taking any mixture of the confident and hesitant policies can result in . Based on this construction, let To ensure we find it is necessary that
(6) 
To finalize the counterexample and show that strict policy improvement is not guaranteed, we choose such that both (5) and (6) are satisfied, a choice that is only possible when due to the monotonicity of ∎
Theorem 2 guarantees monotonic improvement for the 1step greedy policy as a special case when . Hence, we get that for any the mixture of any policy and the 1step greedy policy w.r.t. is monotonically better then . To the best of our knowledge, this result was not explicitly stated anywhere. Instead, it appeared within proofs of several famous results, e.g, [10, Lemma 5.4], [9, Corollary 4.2], and [21, Theorem 1].
In the rest of the paper, we shall focus on the greedy policy and extend it to the online and the approximate cases. The discovery that the greedy policy w.r.t. is not necessarily strictly better than will guide us in appropriately devising algorithms.
5 Online Policy Iteration with Cautious Soft Updates
In [5], it was shown that using the greedy policy in the improvement stage leads to a converging PI procedure – the PI algorithm. This algorithm repeats i) solving the optimal policy of smallhorizon surrogate MDPs with shaped reward, and ii) calculating the value of the optimal policy and use it to shape the reward of next iteration. Here, we devise a practical version of PI, which is modelfree, online and runs in two timescales, i.e, performs i) and ii) simultaneously.
The method is depicted in Algorithm 1. It is similar to the asynchronous PI analyzed in [16], except for two major differences. First, the fast timescale tracks both and not just . Thus, it enables access to both the greedy and 1stepgreedy policies. The 1step greedy policy is attained via the estimate, which is plugged into a learning [29] update rule for obtaining the greedy policy. The latter essentially solves the surrogate, discounted, MDP (see Remark 1). The second difference is in the slow timescale; there, the policy is updated using a new operator, , as defined below. To better understand this operator, first notice that in Stochastic Approximation methods such as Algorithm 1, the policy is improved using soft updates with decaying stepsizes. However, as Theorem 2 states, monotonic improvement is not guaranteed below a certain stepsize value. Hence, for and policy we set to be the greedy policy only when assured to have improvement:
where
We respectively denote the state and stateactionpair visitation counters after the th timestep by and . The stepsize sequences satisfy the common assumption (B2) in [16], among which . The second moments of are assumed to be bounded. Furthermore, let be some measure over the state space, s.t. Then, we assume to have a generative model from which we sample state , sample an action , apply action and receive reward and next state .
1: initialize: 2: for do 3: 4: # Fasttimescale updates 5: 6: 7: 8: 9: # Slowtimescale updates 10: 11: end for 12: return:
The fasttimescale update rules in lines 6 and 8 can be jointly written as a sum of and a martingale difference noise.
Definition 1.
Let Then the mapping is defined as follows .
where .
The following lemma shows that, given a fixed , is a contraction, equivalently to [16, Lemma 5.3] (see Appendix C for the proof).
Theorem 4.
The coupled process in Algorithm 1 converges to the limit , where is the optimal Qfunction and is the optimal policy.
6 Approximate Policy Iteration with Hard Updates
Theorem 2 establishes the conditions required for guaranteed monotonic improvement of softlyupdated multiplestep greedy policies. The algorithm in Section 5 then accounts for these conditions to ensure convergence. Contrarily, in this section, we derive and study algorithms that perform hard policyupdates. Specifically, we generalize the prominent Approximate Policy Iteration (API) [13, 7, 11] and Policy Search by Dynamic Programming (PSDP) [1, 19].
For both, we obtain performance guarantees that exhibit a tradeoff in the choice of with optimal performance bound achieved with i.e., our approximate generalized PI methods outperform the 1step greedy approximate PI methods in terms of best known guarantees.
For the algorithms here we assume an oracle that returns a greedy policy with some error. Formally, we denote by the set of approximate greedy policies w.r.t. with approximation error under some measure .
Definition 2 (Approximate greedy policy).
Let be a value function, a real number and a distribution over . A policy if
Such a device can be implemented using existing approximate methods, e.g., Conservative Policy Iteration (CPI) [9], approximate PI or VI [7], Policy Search [21], or by having an access to an approximate model of the environment. The approximate greedy oracle assumed here is less restrictive than the one assumed in [5]. There, a uniform error over states was assumed, whereas here, the error is defined w.r.t. a specific measure, . For practical purposes, can be thought of as the initial sampling distribution to which the MDP is initialized. Lastly, notice that the larger is, the harder it is to solve the surrogate discounted MDP since the discount factor is bigger [17, 26, 8]; i.e., the computational cost of each call to the oracle increases.
Using the concept of concentrability coefficients introduced in [13] (there, they were originally termed “diffusion coefficients”), we follow the line of work in [13, 14, 7, 19, 11] to prove our performance bounds. This allows a direct comparison of the algorithms proposed here with previously studied approximate 1step greedy algorithms. Namely, our bounds consist of concentrability coefficients and from [19, 11], as well as two new coefficients and , defined as follows.
Definition 3 (Concentrability coefficients).
Let be some measures over Let be the sequence of the smallest values in such that for every for all sequences of deterministic stationary policies . Let and . For brevity, we denote as
Similarly, let be the sequence of the smallest values in such that for every . Let
Finally, we introduce the following two new coefficients. Let . Also, let be the smallest value s.t. where is a probability measure and is a stochastic matrix.
In the definition above, is the measure according to which the approximate improvement is guaranteed, while specifies the distribution on which one measures the loss that we wish to bound. From Definition 3 it holds that the latter was previously defined in, e.g, [19, Definition 1].
Before giving our performance bounds, we first study the behavior of the coefficients appearing in them. The following lemma sheds light on the behavior of specifically, it shows that under certain constructions, decreases^{1}^{1}1A smaller coefficient is obviously better. The best value for any concentrability coefficient is 1. as increases (see proof in Appendix E).
Lemma 5.
Let . Then, for all , there exists such that The inequality is strict for . For this implies that is a decreasing function of .
Definition 3 introduces several coefficients with which we shall derive our bounds. Though traditional arithmetic relations between them do not exist, they do comply to some notion of ordering.
Remark 2 (Order of concentrability coefficients).
In [19], an order between the concentrability coefficients was introduced: a coefficient is said to be strictly better than — a relation we denote with — if and only if i) implies and ii) there exists an MDP for which and . Particularly, it was argued that and that if . In this sense, is analogous to , while its definition might suggest improvement as increases. Moreover, combined with the fact that improves as increases, as Lemma 5 suggests, we get that is better than all previously defined concentrability coefficients.
6.1 Approximate Policy Iteration
A natural generalization of API [13, 19, 11] to the multiplestep greedy policy is API, as given in Algorithm 2. In each of its iterations, the policy is updated to the approximate greedy policy w.r.t. ; i.e, a policy from the set .
initialize for do end for return initialize for do Append end for return
The following theorem gives a performance bound for API (see proof in Appendix F), with
where is a bounded function for
Theorem 6.
Let be the policy at the th iteration of API and be the error as defined in Definition 2. Then
Also, let Then
For brevity, we now discuss the first part of the statement; the same insights are true for the second as well. The bound for the original API is restored for the 1step greedy case of , i.e, [19, 11]. As in the case of API, our bound consists of a fixed approximation error term and a geometrically decaying term. As for the other extreme, we first remind that in the nonapproximate case, applying amounts to solving the original discounted MDP in a single step [5, Remark 4]. In the approximate setup we investigate here, this results in the vanishing of the second, geometrically decaying term. We are then left with a single constant approximation error: Notice that is independent of (see Definition 3). It represents the mismatch between and [9].
Next, notice that, by definition, i.e., Given the discussion above, we have that the API performance bound is strictly smaller with than with Hence, the bound suggests that API is strictly better than the original API for Since all expressions there are continuous, this behavior does not solely hold pointwise.
Remark 3 (Performance tradeoff).
Naively, the above observation would lead to the choice of . However, it is reasonable to assume that , the error of the greedy step, itself depends on i.e, . The general form of such dependence is expected to be monotonically increasing: as the effective horizon of the surrogate discounted MDP becomes larger, its solution is harder to obtain (see Remark 1). Thus, Theorem 6 reveals a performance tradeoff as a function of .
6.2 Policy Search by Dynamic Programming
We continue with generalizing another approximate PI method – PSDP [1, 19]. We name it PSDP and introduce it in Algorithm 3. This algorithm updates the policy differently from API. However, similarly to API, it uses hard updates. We will show this algorithm exhibits better performance than any other previously analyzed approximate PI method [19].
The PSDP algorithm, unlike API, returns a sequence of deterministic policies, . Given this sequence, we build a single, nonstationary policy by successively running steps of , followed by steps of , etc, where are i.i.d. geometric random variables with parameter . Once this process reaches , it runs indefinitely. We shall refer to this nonstationary policy as . Its value can be seen to satisfy
This algorithm follows PSDP from [19]. Differently from it, the 1step improvement is generalized to the greedy improvement and the nonstationary policy behaves randomly. The following theorem gives a performance bound for it (see proof in Appendix G).
Theorem 7.
Let be the policy at the th iteration of PSDP and be the error as defined in Definition 2. Then
Also, let Then
Compared to API from the previous section, the PSDP bound consists of a different fixed approximation error and a shared geometrically decaying term. Regarding the former, notice that using the notation from Remark 2. This suggests that PSDP is strictly better than API in the metrics we consider, and is in line with the comparison of the original API to the original PSDP given in [19].
Similarly to the previous section, we again see that substituting gives a tighter bound than The reason is that Also, again from continuity, this behavior is interpolated for i.e., we have that PSDP is generally better than PSDP. Moreover, the tradeoff discussion in Remark 3 applies here as well.
An additional advantage of this new algorithm over PSDP is reduced space complexity. This can be seen, e.g., from the in the denominator in the choice of in the second part of Theorem 7. It shows that, since is a strictly decreasing function of , performance is preserved with significantly fewe iterations by increasing . Since the size of stored policy is linearly dependent on the number of iterations, larger improves space efficiency.
7 Discussion and Future Work
In this work, we introduced and analyzed online and approximate PI methods, generalized to the greedy policy, an instance of a multiplestep greedy policy. Doing so, we discovered two intriguing properties compared to the wellstudied 1step greedy policy, which we believe can be impactful in designing stateoftheart algorithms. First, successive application of multiplestep greedy policies with a soft, stepsizebased update does not guarantee improvement; see Theorem 2. To mitigate this caveat, we designed an online PI with a ‘cautious’ improvement operator; see Section 5.
The second property we find intriguing stemmed from analyzing generalizations of known approximate hardupdate PI methods. In Section 6, we revealed a performance tradeoff in which can be interpreted as a tradeoff between shorthorizon bootstrap bias and longrollout variance. This corresponds to the known tradeoff in the famous TD().
The two characteristics above lead to new compelling questions. The first regards improvement operators: would a nonmonotonically improving PI scheme necessarily not converge to the optimal policy? Our attempts to generalize existing proof techniques to show convergence in such cases have fallen behind. Specifically, in the online case, Lemma 5.4 in [10] does not hold with multiplestep greedy policies; similar issues arise when trying to form a CPI algorithm via, e.g., an attempt to generalize Corollary 4.2 in [9]. Another research question regards the choice of the parameter given the tradeoff it poses. One possible direction for answering it could be investigating the concentrability coefficients further and attempting to characterize them for specific MDPs, either theoretically or via estimation. Lastly, a next indisputable step would be to empirically evaluate implementations of the algorithms presented here.
References
 [1] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic programming. In Advances in neural information processing systems, pages 831–838, 2004.
 [2] Dimitri P Bertsekas and John N Tsitsiklis. Neurodynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pages 560–564. IEEE, 1995.
 [3] Bruno Bouzy and Bernard Helmstetter. Montecarlo go developments. In Advances in computer games, pages 159–174. Springer, 2004.
 [4] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
 [5] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Beyond the one step greedy approach in reinforcement learning. arXiv preprint arXiv:1802.03654, 2018.
 [6] Damien Ernst, Mevludin Glavic, Florin Capitanescu, and Louis Wehenkel. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):517–529, 2009.
 [7] Amirmassoud Farahmand, Csaba Szepesvári, and Rémi Munos. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pages 568–576, 2010.
 [8] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
 [9] S.M. Kakade and J. Langford. Approximately Optimal Approximate Reinforcement Learning. In International Conference on Machine Learning, pages 267–274, 2002.
 [10] Vijaymohan R Konda and Vivek S Borkar. Actorcritic–type learning algorithms for markov decision processes. SIAM Journal on control and Optimization, 38(1):94–123, 1999.
 [11] Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Analysis of classificationbased policy iteration algorithms. The Journal of Machine Learning Research, 17(1):583–612, 2016.
 [12] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
 [13] Rémi Munos. Error bounds for approximate policy iteration. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, pages 560–567. AAAI Press, 2003.
 [14] Rémi Munos. Performance bounds in l_pnorm for approximate value iteration. SIAM journal on control and optimization, 46(2):541–561, 2007.
 [15] Rudy R Negenborn, Bart De Schutter, Marco A Wiering, and Hans Hellendoorn. Learningbased model predictive control for markov decision processes. IFAC Proceedings Volumes, 38(1):354–359, 2005.
 [16] Steven Perkins and David S Leslie. Asynchronous stochastic approximation with differential inclusions. Stochastic Systems, 2(2):409–446, 2013.
 [17] Marek Petrik and Bruno Scherrer. Biasing approximate dynamic programming with a lower discount factor. In Advances in neural information processing systems, pages 1265–1272, 2009.
 [18] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
 [19] Bruno Scherrer. Approximate policy iteration schemes: a comparison. In International Conference on Machine Learning, pages 1314–1322, 2014.
 [20] Bruno Scherrer. Improved and Generalized Upper Bounds on the Complexity of Policy Iteration. INFORMS, February 2016. Markov decision processes ; Dynamic Programming ; Analysis of Algorithms.
 [21] Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy iteration as boosted policy search. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35–50. Springer, 2014.
 [22] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
 [23] Brian Sheppard. Worldchampionshipcaliber scrabble. Artificial Intelligence, 134(12):241–275, 2002.
 [24] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by selfplay with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
 [25] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
 [26] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
 [27] Aviv Tamar, Garrett Thomas, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. Learning from the hindsight plan—episodic mpc improvement. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 336–343. IEEE, 2017.
 [28] Gerald Tesauro and Gregory R Galperin. Online policy improvement using montecarlo search. In Advances in Neural Information Processing Systems, pages 1068–1074, 1997.
 [29] Christopher JCH Watkins and Peter Dayan. Qlearning. Machine learning, 8(34):279–292, 1992.
Appendix A Proof of Lemma 1
Appendix B Proof of Theorem 2
First, since , we have that
and thus,
We here give the proof of the first part of of the first statement in Theorem 2.
In the first line we used Lemma 1 with , in the third line we used Lemma 1 and in the fifth line we used .
We now show that for all terms are componentwise bigger than or equal to zero. , since it is a weighted sum of semipositive matrices with positive weights. For it holds that , since it is a weighted sum of semipositive matrices with positive weights. Thus, for , componentwise. Lastly, since ,
with equality if and only if