Provably Efficient Adaptive Approximate Policy Iteration

Provably Efficient Adaptive Approximate Policy Iteration


Model-free reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains, including games Silver et al. (2016) and robotics (Kober et al., 2013). However, the theoretical understanding of such algorithms is limited, and existing results are largely focused on episodic or discounted Markov decision processes (MDPs). In this work, we present adaptive approximate policy iteration (AAPI), a learning scheme which enjoys a regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of for the average-reward case with function approximation. Our algorithm and analysis rely on adversarial online learning techniques, where value functions are treated as losses. The main technical novelty is the use of a data-dependent adaptive learning rate coupled with a so-called optimistic prediction of upcoming losses. In addition to theoretical guarantees, we demonstrate the advantages of our approach empirically on several environments.


1 Introduction

Our work focuses on model-free algorithms for learning in infinite-horizon undiscounted continuing Markov decision processes (MDPs), which capture tasks such as game playing, routing, and the control of physical systems. Unlike model-based algorithms, which estimate a model of the environment dynamics and plan based on the estimated model, model-free algorithms directly optimize the expected return of policies, which is the objective of interest. In combination with powerful function approximation, model-free algorithms have recently achieved impressive advances in multiple applications (Mnih et al., 2015; Van Hasselt et al., 2016). Unfortunately, many successful approaches come with few performance guarantees, especially in the infinite-horizon undiscounted, continuing case. Existing theoretical results often only apply to episodic or discounted MDPs, with either tabular representation (Jin et al., 2018) or known special MDP structure (Even-Dar et al., 2009; Neu et al., 2014). In this work, we introduce Adaptive Approximate Policy Iteration (AAPI), a practical model-free learning scheme that can work with function approximation. We analyze its performance in infinite-horizon undiscounted, continuing MDPs in terms of high-probability regret with respect to a fixed policy.

Our approach follows the “online MDP” line of work Even-Dar et al. (2009); Neu et al. (2014), where the agent iteratively selects policies by running an adversarial online learning algorithm in each state, and the loss fed to each algorithm is the policy Q-function in that state. This results in a variant of approximate policy iteration (API), where the policy improvement step produces a policy optimal in hindsight w.r.t. the sum of all previous Q-functions rather than just the most recent one. The original work of Even-Dar et al. (2009) focused on the tabular case with known dynamics and adversarial reward functions. More recent works (Abbasi-Yadkori et al., 2019a, b) have adapted this approach to the realistic setting of unknown dynamics, stochastic rewards, and value function approximation.

We improve over existing results in this setting (Abbasi-Yadkori et al., 2019a) by exploiting the fact that losses (Q-function estimates) are slow-changing rather than completely adversarial. In particular, our policy improvement step relies on the adaptive optimistic follow-the-regularized-leader (AO-FTRL) update (Mohri & Yang, 2016). The resulting policies are Boltzmann distributions over the sum of past estimated Q-functions, coupled with an optimistic prediction of the upcoming loss and a state-dependent adaptive learning rate (softmax temperature). Intuitively, the adaptive learning rate makes the policy more exploratory in states on which past consecutive Q-estimates disagree.

On the theoretical side, we prove the first regret upper bound in the undiscounted, continuing setting with function approximation. This is an improvement over the best existing bound of Abbasi-Yadkori et al. (2019a) for the same setting, which ignores the slow-changing nature of the estimated Q-functions. We exploit the fact that the change in consecutive Q-function estimates can be bounded by the change in policies. Our analysis relies on the results of Rakhlin & Sridharan (2013), but employs a different regret decomposition, with additional information provided by MDP properties. We emphasize that our learning framework is not limited to a particular function approximation method, and that in practice it serves the purpose of appropriately regularizing the policy improvement step of API.

2 Related work

Undiscounted MDPs. Most no-regret algorithms for infinite-horizon undiscounted MDPs are model-based, and only applicable to tabular representations Bartlett (2009); Jaksch et al. (2010); Ouyang et al. (2017); Fruit et al. (2018); Jian et al. (2019); Talebi & Maillard (2018). For a weakly-communicating MDP with states, actions, and diameter , these algorithms achieve the minimax lower bound of Jaksch et al. (2010) with high probability up to constant factors (typically ). The downside of model-based algorithms is that they are memory intensive in large-scale MDPs, as they require storage, and as well as being difficult to extend to continuous-valued states.

In the model-free tabular setting, Wei et al. (2019) show that optimistic Q-learning achieves regret in weakly-communicating MDPs, where is the span of the optimal state-value function (upper bounded by the diameter ). In the case of uniformly ergodic MDPs, they show a bound of on expected regret, where is the mixing time and is the stationary distribution mismatch coefficient.

In the model-free setting with function approximation, the POLITEX algorithm Abbasi-Yadkori et al. (2019a), a variant of API, achieves regret in ergodic MDPs, under the assumption that policy evaluation error scales as after transitions. Here is the size of the compressed state-action space ( for tabular representation, number of features for linear -functions).

Episodic MDPs. In episodic MDPs with horizon , Jin et al. (2018) show an regret bound for Q-learning with UCB exploration with tabular representation. With function approximation, Yang & Wang (2019b); Jin et al. (2019) show an regret bound for an optimistic version of least-squares value iteration in MDPs by assuming special linear MDP. Their algorithms heavily exploit the linear MDP structure that is hard to be satisfied in practice. The RLSVI algorithm Osband et al. (2016) performs exploration in the value function parameter space, and therefore can be applied with function approximation, though its worse-case regret bound of holds only in the tabular setting (Russo, 2019). In the infinite-horizon discounted setting, single-trajectory regret makes less sense as an objective, and most theoretical results are focused on the sample complexity of exploration in the tabular case (Strehl et al., 2006; Dong et al., 2019).

Conservative PI. AAPI is also similar to the conservative policy iteration works, which attempt to stabilize API by regularizing each policy towards the previous policy (Kakade & Langford, 2002; Schulman et al., 2015, 2017; Abdolmaleki et al., 2018; Geist et al., 2019). Our policy improvement step can be seen as regularizing each policy by the KL-divergence to the previous policy; the reduction to online learning offers a principled way to incorporate such regularization.

3 Problem setting and definitions

We first introduce some notation. We use to denote the space of probability distributions defined on the set and write . For vectors , we define the weighted -norm as and -norm as . In general, we treat discrete distributions as row vectors.

Infinite-horizon undiscounted MDPs are often characterized by a finite state space , a finite action space , a reward function , and a transition probability function . The agent does not know the transition probability and the reward function in advance. A policy is a mapping from a state to a distribution over actions. Let denote the state-action sequence obtained by following policy . The expected average reward of policy is defined as

The agent interacts with the environment as follows: at each round , the agent observes a state , chooses an action , and receives a reward . The environment then transitions to the next state with probability . The initial state is randomly generated from some unknown distribution. Let be an unknown fixed policy. The regret of an algorithm with respect to this fixed policy is defined as


The learning goal is to find an algorithm that minimizes the long-term regret .

For each policy , we denote to be the transition matrix induced by , where the component is the transition probability from to under , i.e. . For a distribution over , we let be the distribution over that results from executing the policy for one step after the initial state is sampled from . A stationary distribution of a policy over states satisfies . In addition, the expected reward can also be expressed as

We assume that all MDPs are ergodic. An MDP is ergodic if the Markov chain induced by any policy is both irreducible and aperiodic, which means any state is reachable from any other state by following a suitable policy. Learning in ergodic MDPs is generally easier than weakly-communicating MDPs because ergodic MDPs themselves are exploratory. It is well-known that all ergodic MDPs have an unique stationary state distribution, and so and are well-defined. In addition, ergodic MDPs have a finite mixing time, defined below. {definition} The mixing time of ergodic MDPs is defined as

that characterizes how fast MDPs reach stationary distributions from any state under any policy.

In the end, we define the value function under policy as

The state-action value function and are the unique solutions to the Bellman equation:


4 Algorithm

AAPI is a variant of approximate policy iteration and it proceeds in phases. Suppose the total number of rounds is . We divide into phases of length and assume is an integer for simplicity. Within each phase, our algorithm performs two tasks: policy evaluation and policy improvement.

Policy evaluation. In each phase , the algorithm executes the current policy for time steps, and computes an estimate of the true action-value function . We leave unspecified the value function estimation method ; for example, one can use incremental algorithms, or both on-policy and off-policy data. AAPI is better interpreted as a learning schema. Our regret analysis will require that longer phase lengths lead to better estimates (made precise in Lemma 6).

Policy improvement. For each state , the policy improvement step takes the form of the adaptive optimistic follow-the-regularized-leader (AO-FTRL) update (Mohri & Yang, 2016):


(See Step 3 in Section 6 for a generic description of AO-FTRL.) The terms in Eq. (4) are as follows:

  • The estimates are the loss functions fed to the AO-FTRL algorithm. is the negative entropy regularizer, and is the probability simplex.

  • The side-information is a vector computable based on past information and being predictive of the next loss . Since the policies are expected to change slowly due to the nature of exponential-weight-average type algorithms, we set (better guesses such as off-policy estimates can be used if available).

  • The choice of learning rate is crucial both theoretically and empirically. In particular, we choose in a data-dependent fashion as


    A notable feature of is that it is also state-dependent. Intuitively, for the choice , the adaptive state-dependent learning rate results in a more exploratory policy for the states on which there is more disagreement between the past consecutive action-value functions.

Based on (4), the next policy is a Boltzmann distribution (a consequence of negative entropy regularizer) over the sum of all past state-action value estimates and the side-information:


AAPI is similar to the Politex algorithm Abbasi-Yadkori et al. (2019a), where the main difference is that Politex sets the next policy to in the improvement step. We demonstrate that the use of side-information and adaptive learning rates improves both the theoretical guarantees (Theorem 5) and empirical performance (Section 7) over Politex. The overall algorithm is summarized in Algorithm 1.

1:  Input: phase length , number of phase , initial state , turning parameter , value function estimation algorithm .
2:  Initialize: , ;
3:  Repeat:
4:  for  do
5:     Execute for time steps and collect dataset .
6:     Estimate from using .
7:     Calculate adaptive learning rate:
where .
8:     Calculate
9:     Update next policy as:
where is defined in Eq. (4).
10:  end for
11:  Output:
Algorithm 1 Adaptive approximate policy iteration (AAPI)

5 Analysis

To derive a regret bound for Algorithm 1, we decompose the cumulative regret (1) as follows,


The first term captures the sum of differences between observed rewards and their long term averages. If policies are changing slowly, or if they are kept fixed for extended periods of time, we expect this term to capture the noise in the regret. The second term is called pseudo-regret in literature. It measures the difference between the expected reward of a fixed policy and the policies produced by the algorithm.

We first impose a condition on the quality of policy evaluation step at each phase. For a probability distribution on and a stochastic policy , define to be the distribution on that puts the probability mass on pair . Recall that is the stationary distribution of over the states. {condition} For each phase , denote . We assume the following holds with probability ,


where is the irreducible approximation error and is a problem dependent constant. Additionally, there exists a constant such that for any pair and .


The problem dependent constant will in general depend on . Here, is the dimension of the representation (e.g. for the tabular case, or number of features for the linear value function case).


The requirement for the -norm and -norm has been shown to hold, for example, for linear value function approximation using the LSPE algorithm (Bertsekas & Ioffe, 1996), under Assumptions B-B given in the Appendix; see Theorem 5.1 in Abbasi-Yadkori et al. (2019a) for further details. Lemma B in the supplementary material shows that the requirement for -norm can also be satisfied, for example, for linear value functions, under similar conditions. {remark} The estimation error generally depends on the mismatch between distributions and . With value functions linear in features , this mismatch depends on the spectra of matrices for different distributions , and need not scale in the number of state-action pairs.


[Main result] Consider an ergodic MDP and suppose Condition 5 holds. Denote . By choosing the phase length , we have with probability at least ,

where hides universal constants and poly-logarithmic factors.


It is worth comparing the above result with the regret bound presented in Abbasi-Yadkori et al. (2019a). Ignoring the irreducible error , we improve the leading order of their general result (Corollary 4.6 in Abbasi-Yadkori et al. (2019a)) from to . When specialized to linear value function approximation where scales with (Theorem 5 in Abbasi-Yadkori et al. (2019a)), we improve their results from to .

6 Proof sketch

In this section, we provide a proof sketch for Theorem 5. Technical details are deferred to Appendix A in the supplementary material. At a high level, we bound the two terms in the regret decomposition Eq. (6) separately. While the first term is bounded by the fast mixing condition, the second term is split into the regret due to value function estimation error and the regret due to online learning reduction.

Step 1: fast mixing. To bound the first term in Eq. (6), we require the following uniform fast mixing condition, which is used frequently in online MDP literature (Even-Dar et al., 2009; Neu et al., 2014). Note that ergodic MDPs that this paper focuses on automatically satisfy this condition. {condition}[(Uniform fast mixing)] There exists a number such that for any policy and any pair of distributions and over , it holds that


The following lemma provides upper bounds for the first term (see e.g. Lemma 4.4 in Abbasi-Yadkori et al. (2019a) for a proof). {lemma} Suppose that Condition 6 holds. The following inequality holds with probability at least ,

where is the number of phases.

Step 2: decomposition. We bound the second term (pseudo regret) in Eq. (6). Since the policy is only updated at the end of each phase of length (see line 9 in Algorithm 1), we have for . Thus, the pseudo-regret term can be rewritten as,


We slightly abuse notation by writing . In particular, is exactly the value function by Definition 2. Applying the performance difference lemma (Lemma C in the supplementary material), we have

Bridging by empirical estimations, we decompose (9) into , where


Step 3: estimation error. The term quantifies the regret incurred in the policy evaluation step due to the estimation error and function approximation error of -function in each phase. It can be bounded as in Theorem 4.1 of Abbasi-Yadkori et al. (2019a) under similar assumptions, which we reproduce here for completeness. {lemma} Suppose Condition 5 holds. Then


with probability at least .

Step 4: online learning reduction. Minimizing can be cast into an online learning problem (Cesa-Bianchi & Lugosi, 2006; Shalev-Shwartz et al., 2012), and this observation determines the choice of our algorithm. Previous work Abbasi-Yadkori et al. (2019a) has tackled this subproblem using mirror descent, resulting in regret after optimizing ignoring the irreducible error . Here we instead use the AO-FTRL framework, which allows us to show an improved regret bound. As we show, the reason we can benefit from optimism is that the losses (-functions) change slowly, and we carefully transfer this knowledge to the adaptive learning rate. This is the main technical contribution of the paper.

First, we state the framework of AO-FTRL and its regret results. Let be a sequence of loss vectors and let be a sequence of prediction vectors, where is the probability simplex. At the beginning of each round, the algorithm receives a side-information vector . In literature, are also called predictable sequences (Rakhlin & Sridharan, 2012), and the algorithm can be seen as a way of utilizing prior knowledge about loss sequences. The algorithm then selects an action , and suffers a cost . The goal of this online learning problem is to minimize the cumulative regret with respect to the best action in hindsight , defined as  .

Let be a 1-strongly convex regularizer on with respect to some norm and denote by its dual norm. Initialize . At each round , AO-FTRL has the following form:

where is an absolute constant. It’s easy to see that is non-decreasing. For simplicity, we assume . Next lemma provides a generic regret bound for AO-FTRL. The detailed proof is deferred to Appendix A.2 in the supplementary material.


Choose and denote . The cumulative regret for AO-FTRL is upper-bounded by


Unlike the AO-FTRL analyses of Rakhlin & Sridharan (2012); Mohri & Yang (2016), but similarly to, e.g., the analysis of Joulani et al. (2017), Eq. (12) has a key negative term (at the expense of a slightly larger constant factor in the main positive term). These negative terms, which are retained from a tight regret bound on the forward regret of AO-FTRL (Joulani et al., 2017), track the evolution of the policy . With the proper choice of , the norm terms will also be controlled by the evolution of (see Lemma 6), and the aforementioned negative terms allow us to greatly reduce the contribution of the norm terms to the overall regret.

The reason that minimizing can be cast into an online learning problem is as follows. By the definition of in Step 2, we rewrite in (10) as

For each state , we view as the prediction vector and as the loss vector. The equivalence between and enables us to utilize the generic regret bound for AO-FTRL in Lemma 6 for each individual state.

Next, we will show that under some conditions, the change in the true values can be bounded by the change of policies. This is a unique property of ergodic MDPs that allows us to benefit from the negative term in (12). To ensure is unique, we assume . {lemma}[Relative -function Error] For any two successive policies and , the following holds for any state-action pair ,

The detailed proof of Lemma 6 is deferred to Appendix A.4 in the supplementary material. Combining the result in Lemmas 6 and 6, we can derive the following lemma.


Denote . Suppose Condition 5 holds. Then the following upper bound holds with probability at least ,


where hides universal constant factors. The detailed proof of Lemma 6 is deferred to Appendix A.3 in the supplementary material. Finally, we optimize to be and reach our conclusion.


Within the upper bound (13), stands for the approximation error and estimation error per round. When value functions can be computed exactly (known MDP) and for phase length , the online learning reduction regret for AAPI scales logarithmically in the number of phases , while POLITEX (Abbasi-Yadkori et al., 2019a) scales as . This is the main reason why we are able to improve the regret from to in the end.

7 Experiments

Figure 1: Evaluation on a tabular ergodic MDP.
Figure 2: Evaluation on DeepSea environments of different sizes.

Setup. In this section, we empirically evaluate the benefits of using adaptive learning rates and side information . In particular, we compare (1) AAPI with and as in Eq. (4); (2) k-AAPI , similar to AAPI , but approximating the adaptive learning rate by for the purpose of computational efficiency, (3) Politex, with and . We also evaluate RLSVI (Osband et al. (2016) Algorithms 1 and 2 with and tuned ), where policies are greedy w.r.t. a randomized estimate of . For RLSVI in episodic environments with fixed known horizon (DeepSea), we estimate separate parameters for each and update after each episode, otherwise we share parameters and update every steps.

We approximate all value functions using least-squares Monte Carlo, i.e. linear regression from state-action features to empirical returns. For MDPs with a large or continuous state space , updating per-state learning rates can be impractical. Instead, we store the weights of past -functions in memory, and for each state in the trajectory, we compute the learning rate using a subset of randomly-selected past weight vectors (we correct the scale of the estimate by multiplying with . For Boltzmann policies, we tune the constant for the learning rate in the range . For each environment and algorithm we evaluate and plot the mean and standard deviation over 50 runs. We show results on the following three environments.

Tabular ergodic MDPs. We consider a simple tabular MDP where , for . On any action in state 1, the environment transitions to a randomly chosen state . On action 1 in a state , the environment transitions to state with probability 0.9, and to a randomly chosen state with probability 0.1. On all other actions in , the environment transitions to a randomly chosen state. We represent state-action pairs using one-hot indicator vectors of size , and experiment with different sizes of the state and action spaces and .

DeepSea (Osband et al., 2017). In the DeepSea environment, states comprise an grid, and there are two actions. The environment transitions and costs are deterministic. On action 0 in cell , the environment transitions to . On action 1 in cell , the environment transitions to cell . The agent starts in state . The reward (negative cost) for being in the cell is for any action. For all other states , the reward is and . In the infinite horizon setting, an optimal strategy first takes the action 1 times (to get to ) and then takes an equal number of 0 and 1 actions, and has expected average reward close to . A simple strategy that always takes action 1 has an average reward , and a suboptimal strategy that only takes action 0 has an average reward of . We represent states as length- vectors containing one-hot indicators for each grid coordinate, and estimate linear -functions.

Figure 3: Evaluation on the CartPole environments.

CartPole Barto et al. (1983). In the CartPole environment, the goal is to balance an inverted pole attached by an unactuated joint to a cart, which moves along a frictionless rail. There are two actions, corresponding to pushing the cart to the left or right. The observation consists of the position and velocity of the cart, pole angle, and pole velocity at the tip. There is a reward of +1 for every timestep that the pole remains upright. The episodic version of the environment ends if the pole angle is more than 15 degrees from vertical, if the cart moves more than 2.4 units from the center, or after 200 steps. In the infinite-horizon version, if the episode ends after steps, we return a reward of and reset. For this task, in addition to the given observation, we extract multivariate Fourier basis features Konidaris et al. (2011) of order 4.

Results and discussion. The experimental results are shown in Figures 1, 2, and 3. AAPI outperforms other algorithms in the DeepSea and ergodic environment instances. These problems all involve a single high-reward state, and reaching the state requires sufficient exploration. On the other hand, the adaptive per-state learning rate is not helpful in CartPole, possibly because observations are continuous and there is higher generalization across states. In most of our experiments, AAPI performs better with smaller phase length . Our analysis relies on long phases of length in part to obtain better side-information . However, given that the side-information also depends on and , in some MDPs shorter phases may suffice. This is clear in the DeepSea results, where shorter phases result in better performance for smaller problem instances.

8 Discussion and future work

We have presented AAPI , a model-free learning scheme that can work with function approximation, and enjoys a regret guarantee in infinite-horizon undiscounted, ergodic MDPs. AAPI improves upon previous results for this setting by using the slow-changing property of policies in both theory and practice.

Our result has an undesirable dependence on . With value function approximation, one general direction for improvement is replacing dependence on with the size of the compressed representation, such as the minimum eigenvalue of in the linear case. Another direction for future work is improving the policy evaluation stage. While we estimate each value function solely using the on-policy transitions, better estimates can potentially be obtained using all data. Using more sophisticated side-information, such as a weighted average of past Q-estimates or an off-policy estimate of the Q-function may also be helpful in practice. Other future work may include practical implementations of the algorithm when trained with neural networks that maintain only a subset of past networks in memory.

Supplement to “Provably Efficient Adaptive Approximate Policy Iteration”

In Section A, we present the detailed proofs of main results. In Section B, the linear value function approximation is considered. In Section C, some supporting lemmas are included.

Appendix A Proofs of main results

a.1 Proof of Theorem 5: main result

We combine the decomposition (6), (9) and (10) together and utilize the results in Lemmas 6, 6 and 6. Then we have

We choose and ignore any universal constant and logarithmic factor in the following. Since , it holds that

with probability at least . With a little abuse of notations, we re-define . By optimizing such that the first two term above is equal, i.e., , we choose . Overall, we reach the final regret bound,

This ends the proof.

a.2 Proof of Lemma 6: adaptive optimistic FTRL (AO-FTRL)

Lemma 6 states that the cumulative regret for AO-FTRL is upper-bounded by

where .

First, at each round , AO-FTRL has the form of

where are data-dependent learning rates. For simplicity, we assume . For , we define


We define for all and . Since is 1-strongly convex with respect to norm , is -strongly-convex with respect to . Then we could rewrite the AO-FTRL update as

Second, let us define the forward linear regret as:

One could interpret as a cheating regret since it uses prediction at round . We decompose the cumulative regret based on the forward linear regret as follows,


The second term in the right side captures the regret by the algorithm’s inability to accurately predict the future. We define the Bregman divergence between two vectors induced by a differentiable function as follows:

Next theorem is used to bound the forward regret. {theorem}[Theorem 3 in Joulani et al. (2017)] For any and any sequence of linear losses, the forward regret satisfies the following inequality:

Recall that is -strongly convex. From the definitions of strong convexity and Bregman divergence, we have


Applying Theorem A.2 and Eq. (16), we have


To bound the first term in Eq. (17), we expand it by the definition of Eq. (14),


where the first inequality is due to the fact that is non-decreasing and . We decompose the second term in Eq. (18) as follows,


since . Combining Eq. (18) and Eq. (19) together,


Putting Eq. (15), Eq. (17) and Eq. (20) together, we reach


To bound the first term in Eq. (21), we first use Hölder’s inequality such that

where the last inequality is due to . Thus we have