Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

Yonathan Efroni
Department of Electrical Engineering
The Technion - Israel Institute of Technology
Israel, Haifa 3200003
&Gal Dalal
Department of Electrical Engineering
The Technion - Israel Institute of Technology
Israel, Haifa 3200003
&Bruno Scherrer
Villers-l‘es-Nancy, F-54600, France
&Shie Mannor
Department of Electrical Engineering
The Technion - Israel Institute of Technology
Israel, Haifa 3200003

Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work [5], multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.


Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

  Yonathan Efroni Department of Electrical Engineering The Technion - Israel Institute of Technology Israel, Haifa 3200003 Gal Dalal Department of Electrical Engineering The Technion - Israel Institute of Technology Israel, Haifa 3200003 Bruno Scherrer INRIA Villers-l‘es-Nancy, F-54600, France Shie Mannor Department of Electrical Engineering The Technion - Israel Institute of Technology Israel, Haifa 3200003



1 Introduction

The use of the 1-step policy improvement in Reinforcement Learning (RL) was theoretically investigated under several frameworks, e.g., Policy Iteration (PI) [18], approximate PI [2, 9, 13], and Actor-Critic [10]; its practical uses are abundant [22, 12, 25]. However, single-step based improvement is not necessarily the optimal choice. It was, in fact, empirically demonstrated that multiple-step greedy policies can perform conspicuously better. Notable examples arise from the integration of RL and Monte Carlo Tree Search [4, 28, 23, 3, 25, 24] or Model Predictive Control [15, 6, 27].

Recent work [5] provided guarantees on the performance of the multiple-step greedy policy and generalizations of it in PI. Here, we establish it in the two practical contexts of online and approximate PI. With this objective in mind, we begin by highlighting a specific difficulty: softly updating a policy with respect to (w.r.t.) a multiple-step greedy policy does not necessarily result in improvement of the policy (Section 4). We find this property intriguing since monotonic improvement is guaranteed in the case of soft updates w.r.t. the 1-step greedy policy, and is central to the analysis of many RL algorithms [10, 9, 22]. We thus engineer some algorithms to circumvent this difficulty and provide some non-trivial performance guarantees, that support the interest of using multi-step greedy operators. These algorithms assume access to a generative model (Section 5) or to an approximate multiple-step greedy policy (Section 6).

2 Preliminaries

We consider the infinite-horizon discounted Markov Decision Process (MDP) framework. An MDP is defined by the 5-tuple [18], where is a finite state space, is a finite action space, is a transition kernel, is a reward function, and is a discount factor. Let be a stationary policy, where is a probability distribution on the set . For a policy , for any state , let the value of in state be , where the notation means that the expectation is conditioned on the event and on following policy . is a vector in . For conciseness, we shall use the notations , which holds component-wise, and denote the reward and value at time by and . It is known that , where and . Lastly, let


Our goal is to find a policy yielding the optimal value such that


This goal can be achieved using the three following operators (with equalities hold component-wise):

where is a linear operator, is the optimal Bellman operator and both and are -contraction mappings w.r.t. the max norm. It is known that the unique fixed points of and are and , respectively. The set is the standard set of 1-step greedy policies w.r.t. .

3 The - and -Greedy Policies

In this section, we bring forward necessary definitions and results on two classes of multiple-step greedy policies: - and -greedy [5]. Let . The -greedy policy outputs the first optimal action out of the sequence of actions solving a non-stationary, -horizon control problem as follows:

Since the -greedy policy can be represented as the 1-step greedy policy w.r.t. , the set of -greedy policies w.r.t. , , can be formally defined as follows:

Let . The set of -greedy policies w.r.t. a value function , , is defined using the following operators:

Remark 1.

A comparison of (2) and (3) reveals that finding the -greedy policy is equivalent to solving a -discounted MDP with a shaped reward .

In [5, Proposition 11], the -greedy policy was explained to be interpolating over all geometrically -weighted -greedy policies. It was also shown that for the 1-step greedy policy is restored, while for the -greedy policy is the optimal policy.

Both and are contraction mappings, where . Their respective fixed points are and . For brevity, where there is no risk of confusion, we shall denote by Moreover, in [5] it was shown that both the - and -greedy policies w.r.t. are strictly better then , unless .

Next, let


The latter is the optimal -function of the surrogate, -discounted MDP with -shaped reward (see Remark 1). Thus, we can obtain a -greedy policy, , directly from

See that the greedy policy w.r.t. is the 1-step greedy policy since

4 Multi-step Policy Improvement and Soft Updates

In this section, we focus on policy improvement of multiple-step greedy policies, performed with soft updates. Soft updates of the 1-step greedy policy have proved necessary and beneficial in prominent algorithms [10, 9, 22]. Here, we begin by describing an intrinsic difficulty in selecting the step-size parameter when updating with multiple-step greedy policies such as of - and -greedy. Specifically, denote by such multiple-step greedy policy w.r.t. Then, is not necessarily better than .

We start by generalizing a useful lemma [20, Lemma 10] (see Appendix A for the proof).

Lemma 1.

Let be a value function, a policy, and . Then

This elementary lemma relates the ‘-advantage’ to the 1-step advantage and is used for proving the following result.

Theorem 2.

For any MDP, let be a policy and its value. Let and where and an integer . Consider the mixture policies

We have the following equivalences:

  1. The inequality holds for all MDPs if and only if .

  2. The inequality holds for all MDPs if and only if .

The above inequalities hold entry-wise, with strict inequality in at least one entry unless .

Proof sketch. See Appendix B for the full proof. Here, we only provide a counterexample demonstrating the potential non-monotonicity of when the stepsize is not big enough. One can show the same for with the same example.

Figure 1: The Tightrope Walking MDP used in the counter example of Theorem 2.

Consider the Tightrope Walking MDP in Fig. 1. It describes the act of walking on a rope: in the initial state the agent approaches the rope, in the walking attempt occurs, is the goal state and is repeatedly met as long the agent falls from the rope, resulting in negative reward.

First, notice that by definition, We call this policy the “confident” policy. Obviously, for any discount factor , and Instead, consider the “hesitant” policy . We now claim that for any and


the mixture policy, , is not strictly better than To see this, notice that and i.e., the agent accumulates zero reward if she does not climb the rope. Thus, taking any mixture of the confident and hesitant policies can result in . Based on this construction, let To ensure we find it is necessary that


To finalize the counterexample and show that strict policy improvement is not guaranteed, we choose such that both (5) and (6) are satisfied, a choice that is only possible when due to the monotonicity of

Theorem 2 guarantees monotonic improvement for the 1-step greedy policy as a special case when . Hence, we get that for any the mixture of any policy and the 1-step greedy policy w.r.t. is monotonically better then . To the best of our knowledge, this result was not explicitly stated anywhere. Instead, it appeared within proofs of several famous results, e.g, [10, Lemma 5.4], [9, Corollary 4.2], and [21, Theorem 1].

In the rest of the paper, we shall focus on the -greedy policy and extend it to the online and the approximate cases. The discovery that the -greedy policy w.r.t. is not necessarily strictly better than will guide us in appropriately devising algorithms.

5 Online -Policy Iteration with Cautious Soft Updates

In [5], it was shown that using the -greedy policy in the improvement stage leads to a converging PI procedure – the -PI algorithm. This algorithm repeats i) solving the optimal policy of small-horizon surrogate MDPs with shaped reward, and ii) calculating the value of the optimal policy and use it to shape the reward of next iteration. Here, we devise a practical version of -PI, which is model-free, online and runs in two timescales, i.e, performs i) and ii) simultaneously.

The method is depicted in Algorithm 1. It is similar to the asynchronous PI analyzed in [16], except for two major differences. First, the fast timescale tracks both and not just . Thus, it enables access to both the -greedy and 1-step-greedy policies. The 1-step greedy policy is attained via the estimate, which is plugged into a -learning [29] update rule for obtaining the -greedy policy. The latter essentially solves the surrogate, -discounted, MDP (see Remark 1). The second difference is in the slow timescale; there, the policy is updated using a new operator, , as defined below. To better understand this operator, first notice that in Stochastic Approximation methods such as Algorithm 1, the policy is improved using soft updates with decaying stepsizes. However, as Theorem 2 states, monotonic improvement is not guaranteed below a certain stepsize value. Hence, for and policy we set to be the -greedy policy only when assured to have improvement:


We respectively denote the state and state-action-pair visitation counters after the -th time-step by and . The stepsize sequences satisfy the common assumption (B2) in [16], among which . The second moments of are assumed to be bounded. Furthermore, let be some measure over the state space, s.t. Then, we assume to have a generative model from which we sample state , sample an action , apply action and receive reward and next state .

Algorithm 1 Two-Timescale Online -Policy-Iteration 1:  initialize: 2:  for  do 3:      4:     # Fast-timescale updates 5:      6:      7:      8:      9:     # Slow-timescale updates 10:      11:  end for 12:  return:

The fast-timescale update rules in lines 6 and 8 can be jointly written as a sum of and a martingale difference noise.

Definition 1.

Let Then the mapping is defined as follows .

where .

The following lemma shows that, given a fixed , is a contraction, equivalently to [16, Lemma 5.3] (see Appendix C for the proof).

Lemma 3.

is a -contraction in the max-norm. Its unique fixed point is as defined in (1) and (4).

Finally, after several intermediate results in Appendix D, we establish convergence of Algorithm 1.

Theorem 4.

The coupled process in Algorithm 1 converges to the limit , where is the optimal Q-function and is the optimal policy.

6 Approximate -Policy Iteration with Hard Updates

Theorem 2 establishes the conditions required for guaranteed monotonic improvement of softly-updated multiple-step greedy policies. The algorithm in Section 5 then accounts for these conditions to ensure convergence. Contrarily, in this section, we derive and study algorithms that perform hard policy-updates. Specifically, we generalize the prominent Approximate Policy Iteration (API) [13, 7, 11] and Policy Search by Dynamic Programming (PSDP) [1, 19].

For both, we obtain performance guarantees that exhibit a tradeoff in the choice of with optimal performance bound achieved with i.e., our approximate -generalized PI methods outperform the 1-step greedy approximate PI methods in terms of best known guarantees.

For the algorithms here we assume an oracle that returns a -greedy policy with some error. Formally, we denote by the set of approximate -greedy policies w.r.t. with approximation error under some measure .

Definition 2 (Approximate -greedy policy).

Let be a value function, a real number and a distribution over . A policy if

Such a device can be implemented using existing approximate methods, e.g., Conservative Policy Iteration (CPI) [9], approximate PI or VI [7], Policy Search [21], or by having an access to an approximate model of the environment. The approximate -greedy oracle assumed here is less restrictive than the one assumed in [5]. There, a uniform error over states was assumed, whereas here, the error is defined w.r.t. a specific measure, . For practical purposes, can be thought of as the initial sampling distribution to which the MDP is initialized. Lastly, notice that the larger is, the harder it is to solve the surrogate -discounted MDP since the discount factor is bigger [17, 26, 8]; i.e., the computational cost of each call to the oracle increases.

Using the concept of concentrability coefficients introduced in [13] (there, they were originally termed “diffusion coefficients”), we follow the line of work in [13, 14, 7, 19, 11] to prove our performance bounds. This allows a direct comparison of the algorithms proposed here with previously studied approximate 1-step greedy algorithms. Namely, our bounds consist of concentrability coefficients and from [19, 11], as well as two new coefficients and , defined as follows.

Definition 3 (Concentrability coefficients).

Let be some measures over Let be the sequence of the smallest values in such that for every for all sequences of deterministic stationary policies . Let and . For brevity, we denote as

Similarly, let be the sequence of the smallest values in such that for every . Let

Finally, we introduce the following two new coefficients. Let . Also, let be the smallest value s.t. where is a probability measure and is a stochastic matrix.

In the definition above, is the measure according to which the approximate improvement is guaranteed, while specifies the distribution on which one measures the loss that we wish to bound. From Definition 3 it holds that the latter was previously defined in, e.g, [19, Definition 1].

Before giving our performance bounds, we first study the behavior of the coefficients appearing in them. The following lemma sheds light on the behavior of specifically, it shows that under certain constructions, decreases111A smaller coefficient is obviously better. The best value for any concentrability coefficient is 1. as increases (see proof in Appendix E).

Lemma 5.

Let . Then, for all , there exists such that The inequality is strict for . For this implies that is a decreasing function of .

Definition 3 introduces several coefficients with which we shall derive our bounds. Though traditional arithmetic relations between them do not exist, they do comply to some notion of ordering.

Remark 2 (Order of concentrability coefficients).

In [19], an order between the concentrability coefficients was introduced: a coefficient is said to be strictly better than — a relation we denote with — if and only if i) implies and ii) there exists an MDP for which and . Particularly, it was argued that and that if . In this sense, is analogous to , while its definition might suggest improvement as increases. Moreover, combined with the fact that improves as increases, as Lemma 5 suggests, we get that is better than all previously defined concentrability coefficients.

6.1 -Approximate Policy Iteration

A natural generalization of API [13, 19, 11] to the multiple-step greedy policy is -API, as given in Algorithm 2. In each of its iterations, the policy is updated to the approximate -greedy policy w.r.t. ; i.e, a policy from the set .

  initialize      for  do               end for   return   Algorithm 2 -API      Algorithm 3 -PSDP   initialize      for  do                  Append   end for   return  

The following theorem gives a performance bound for -API (see proof in Appendix F), with

where is a bounded function for

Theorem 6.

Let be the policy at the -th iteration of -API and be the error as defined in Definition 2. Then

Also, let Then

For brevity, we now discuss the first part of the statement; the same insights are true for the second as well. The bound for the original API is restored for the 1-step greedy case of , i.e, [19, 11]. As in the case of API, our bound consists of a fixed approximation error term and a geometrically decaying term. As for the other extreme, we first remind that in the non-approximate case, applying amounts to solving the original -discounted MDP in a single step [5, Remark 4]. In the approximate setup we investigate here, this results in the vanishing of the second, geometrically decaying term. We are then left with a single constant approximation error: Notice that is independent of (see Definition 3). It represents the mismatch between and [9].

Next, notice that, by definition, i.e., Given the discussion above, we have that the -API performance bound is strictly smaller with than with Hence, the bound suggests that -API is strictly better than the original API for Since all expressions there are continuous, this behavior does not solely hold point-wise.

Remark 3 (Performance tradeoff).

Naively, the above observation would lead to the choice of . However, it is reasonable to assume that , the error of the -greedy step, itself depends on i.e, . The general form of such dependence is expected to be monotonically increasing: as the effective horizon of the surrogate -discounted MDP becomes larger, its solution is harder to obtain (see Remark 1). Thus, Theorem 6 reveals a performance tradeoff as a function of .

6.2 -Policy Search by Dynamic Programming

We continue with generalizing another approximate PI method – PSDP [1, 19]. We name it -PSDP and introduce it in Algorithm 3. This algorithm updates the policy differently from -API. However, similarly to -API, it uses hard updates. We will show this algorithm exhibits better performance than any other previously analyzed approximate PI method [19].

The -PSDP algorithm, unlike -API, returns a sequence of deterministic policies, . Given this sequence, we build a single, non-stationary policy by successively running steps of , followed by steps of , etc, where are i.i.d. geometric random variables with parameter . Once this process reaches , it runs indefinitely. We shall refer to this non-stationary policy as . Its value can be seen to satisfy

This algorithm follows PSDP from [19]. Differently from it, the 1-step improvement is generalized to the -greedy improvement and the non-stationary policy behaves randomly. The following theorem gives a performance bound for it (see proof in Appendix G).

Theorem 7.

Let be the policy at the -th iteration of -PSDP and be the error as defined in Definition 2. Then

Also, let Then

Compared to -API from the previous section, the -PSDP bound consists of a different fixed approximation error and a shared geometrically decaying term. Regarding the former, notice that using the notation from Remark 2. This suggests that -PSDP is strictly better than -API in the metrics we consider, and is in line with the comparison of the original API to the original PSDP given in [19].

Similarly to the previous section, we again see that substituting gives a tighter bound than The reason is that Also, again from continuity, this behavior is interpolated for i.e., we have that -PSDP is generally better than PSDP. Moreover, the tradeoff discussion in Remark 3 applies here as well.

An additional advantage of this new algorithm over PSDP is reduced space complexity. This can be seen, e.g., from the in the denominator in the choice of in the second part of Theorem 7. It shows that, since is a strictly decreasing function of , performance is preserved with significantly fewe iterations by increasing . Since the size of stored policy is linearly dependent on the number of iterations, larger improves space efficiency.

7 Discussion and Future Work

In this work, we introduced and analyzed online and approximate PI methods, generalized to the -greedy policy, an instance of a multiple-step greedy policy. Doing so, we discovered two intriguing properties compared to the well-studied 1-step greedy policy, which we believe can be impactful in designing state-of-the-art algorithms. First, successive application of multiple-step greedy policies with a soft, stepsize-based update does not guarantee improvement; see Theorem 2. To mitigate this caveat, we designed an online PI with a ‘cautious’ improvement operator; see Section 5.

The second property we find intriguing stemmed from analyzing generalizations of known approximate hard-update PI methods. In Section 6, we revealed a performance tradeoff in which can be interpreted as a tradeoff between short-horizon bootstrap bias and long-rollout variance. This corresponds to the known tradeoff in the famous TD().

The two characteristics above lead to new compelling questions. The first regards improvement operators: would a non-monotonically improving PI scheme necessarily not converge to the optimal policy? Our attempts to generalize existing proof techniques to show convergence in such cases have fallen behind. Specifically, in the online case, Lemma 5.4 in [10] does not hold with multiple-step greedy policies; similar issues arise when trying to form a -CPI algorithm via, e.g., an attempt to generalize Corollary 4.2 in [9]. Another research question regards the choice of the parameter given the tradeoff it poses. One possible direction for answering it could be investigating the concentrability coefficients further and attempting to characterize them for specific MDPs, either theoretically or via estimation. Lastly, a next indisputable step would be to empirically evaluate implementations of the algorithms presented here.


  • [1] J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic programming. In Advances in neural information processing systems, pages 831–838, 2004.
  • [2] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-dynamic programming: an overview. In Decision and Control, 1995., Proceedings of the 34th IEEE Conference on, volume 1, pages 560–564. IEEE, 1995.
  • [3] Bruno Bouzy and Bernard Helmstetter. Monte-carlo go developments. In Advances in computer games, pages 159–174. Springer, 2004.
  • [4] Cameron B Browne, Edward Powley, Daniel Whitehouse, Simon M Lucas, Peter I Cowling, Philipp Rohlfshagen, Stephen Tavener, Diego Perez, Spyridon Samothrakis, and Simon Colton. A survey of monte carlo tree search methods. IEEE Transactions on Computational Intelligence and AI in games, 4(1):1–43, 2012.
  • [5] Yonathan Efroni, Gal Dalal, Bruno Scherrer, and Shie Mannor. Beyond the one step greedy approach in reinforcement learning. arXiv preprint arXiv:1802.03654, 2018.
  • [6] Damien Ernst, Mevludin Glavic, Florin Capitanescu, and Louis Wehenkel. Reinforcement learning versus model predictive control: a comparison on a power system problem. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics), 39(2):517–529, 2009.
  • [7] Amir-massoud Farahmand, Csaba Szepesvári, and Rémi Munos. Error propagation for approximate policy and value iteration. In Advances in Neural Information Processing Systems, pages 568–576, 2010.
  • [8] Nan Jiang, Alex Kulesza, Satinder Singh, and Richard Lewis. The dependence of effective planning horizon on model accuracy. In Proceedings of the 2015 International Conference on Autonomous Agents and Multiagent Systems, pages 1181–1189. International Foundation for Autonomous Agents and Multiagent Systems, 2015.
  • [9] S.M. Kakade and J. Langford. Approximately Optimal Approximate Reinforcement Learning. In International Conference on Machine Learning, pages 267–274, 2002.
  • [10] Vijaymohan R Konda and Vivek S Borkar. Actor-critic–type learning algorithms for markov decision processes. SIAM Journal on control and Optimization, 38(1):94–123, 1999.
  • [11] Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Analysis of classification-based policy iteration algorithms. The Journal of Machine Learning Research, 17(1):583–612, 2016.
  • [12] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International Conference on Machine Learning, pages 1928–1937, 2016.
  • [13] Rémi Munos. Error bounds for approximate policy iteration. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, pages 560–567. AAAI Press, 2003.
  • [14] Rémi Munos. Performance bounds in l_p-norm for approximate value iteration. SIAM journal on control and optimization, 46(2):541–561, 2007.
  • [15] Rudy R Negenborn, Bart De Schutter, Marco A Wiering, and Hans Hellendoorn. Learning-based model predictive control for markov decision processes. IFAC Proceedings Volumes, 38(1):354–359, 2005.
  • [16] Steven Perkins and David S Leslie. Asynchronous stochastic approximation with differential inclusions. Stochastic Systems, 2(2):409–446, 2013.
  • [17] Marek Petrik and Bruno Scherrer. Biasing approximate dynamic programming with a lower discount factor. In Advances in neural information processing systems, pages 1265–1272, 2009.
  • [18] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 1994.
  • [19] Bruno Scherrer. Approximate policy iteration schemes: a comparison. In International Conference on Machine Learning, pages 1314–1322, 2014.
  • [20] Bruno Scherrer. Improved and Generalized Upper Bounds on the Complexity of Policy Iteration. INFORMS, February 2016. Markov decision processes ; Dynamic Programming ; Analysis of Algorithms.
  • [21] Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy iteration as boosted policy search. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 35–50. Springer, 2014.
  • [22] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • [23] Brian Sheppard. World-championship-caliber scrabble. Artificial Intelligence, 134(1-2):241–275, 2002.
  • [24] David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, et al. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. arXiv preprint arXiv:1712.01815, 2017.
  • [25] David Silver, Julian Schrittwieser, Karen Simonyan, Ioannis Antonoglou, Aja Huang, Arthur Guez, Thomas Hubert, Lucas Baker, Matthew Lai, Adrian Bolton, et al. Mastering the game of go without human knowledge. Nature, 550(7676):354, 2017.
  • [26] Alexander L Strehl, Lihong Li, and Michael L Littman. Reinforcement learning in finite mdps: Pac analysis. Journal of Machine Learning Research, 10(Nov):2413–2444, 2009.
  • [27] Aviv Tamar, Garrett Thomas, Tianhao Zhang, Sergey Levine, and Pieter Abbeel. Learning from the hindsight plan—episodic mpc improvement. In Robotics and Automation (ICRA), 2017 IEEE International Conference on, pages 336–343. IEEE, 2017.
  • [28] Gerald Tesauro and Gregory R Galperin. On-line policy improvement using monte-carlo search. In Advances in Neural Information Processing Systems, pages 1068–1074, 1997.
  • [29] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4):279–292, 1992.

Appendix A Proof of Lemma 1

The proof is a straightforward generalization of the proof in [20, Lemma 10], and [9, Remark 6.1].

Appendix B Proof of Theorem 2

First, since , we have that

and thus,

We here give the proof of the first part of of the first statement in Theorem 2.

In the first line we used Lemma 1 with , in the third line we used Lemma 1 and in the fifth line we used .

We now show that for all terms are component-wise bigger than or equal to zero. , since it is a weighted sum of semi-positive matrices with positive weights. For it holds that , since it is a weighted sum of semi-positive matrices with positive weights. Thus, for , component-wise. Lastly, since ,

with equality if and only if