Provably Efficient Adaptive Approximate Policy Iteration
Abstract
Modelfree reinforcement learning algorithms combined with value function approximation have recently achieved impressive performance in a variety of application domains, including games Silver et al. (2016) and robotics (Kober et al., 2013). However, the theoretical understanding of such algorithms is limited, and existing results are largely focused on episodic or discounted Markov decision processes (MDPs). In this work, we present adaptive approximate policy iteration (AAPI), a learning scheme which enjoys a regret bound for undiscounted, continuing learning in uniformly ergodic MDPs. This is an improvement over the best existing bound of for the averagereward case with function approximation. Our algorithm and analysis rely on adversarial online learning techniques, where value functions are treated as losses. The main technical novelty is the use of a datadependent adaptive learning rate coupled with a socalled optimistic prediction of upcoming losses. In addition to theoretical guarantees, we demonstrate the advantages of our approach empirically on several environments.
1 Introduction
Our work focuses on modelfree algorithms for learning in infinitehorizon undiscounted continuing Markov decision processes (MDPs), which capture tasks such as game playing, routing, and the control of physical systems. Unlike modelbased algorithms, which estimate a model of the environment dynamics and plan based on the estimated model, modelfree algorithms directly optimize the expected return of policies, which is the objective of interest. In combination with powerful function approximation, modelfree algorithms have recently achieved impressive advances in multiple applications (Mnih et al., 2015; Van Hasselt et al., 2016). Unfortunately, many successful approaches come with few performance guarantees, especially in the infinitehorizon undiscounted, continuing case. Existing theoretical results often only apply to episodic or discounted MDPs, with either tabular representation (Jin et al., 2018) or known special MDP structure (EvenDar et al., 2009; Neu et al., 2014). In this work, we introduce Adaptive Approximate Policy Iteration (AAPI), a practical modelfree learning scheme that can work with function approximation. We analyze its performance in infinitehorizon undiscounted, continuing MDPs in terms of highprobability regret with respect to a fixed policy.
Our approach follows the “online MDP” line of work EvenDar et al. (2009); Neu et al. (2014), where the agent iteratively selects policies by running an adversarial online learning algorithm in each state, and the loss fed to each algorithm is the policy Qfunction in that state. This results in a variant of approximate policy iteration (API), where the policy improvement step produces a policy optimal in hindsight w.r.t. the sum of all previous Qfunctions rather than just the most recent one. The original work of EvenDar et al. (2009) focused on the tabular case with known dynamics and adversarial reward functions. More recent works (AbbasiYadkori et al., 2019a, b) have adapted this approach to the realistic setting of unknown dynamics, stochastic rewards, and value function approximation.
We improve over existing results in this setting (AbbasiYadkori et al., 2019a) by exploiting the fact that losses (Qfunction estimates) are slowchanging rather than completely adversarial. In particular, our policy improvement step relies on the adaptive optimistic followtheregularizedleader (AOFTRL) update (Mohri & Yang, 2016). The resulting policies are Boltzmann distributions over the sum of past estimated Qfunctions, coupled with an optimistic prediction of the upcoming loss and a statedependent adaptive learning rate (softmax temperature). Intuitively, the adaptive learning rate makes the policy more exploratory in states on which past consecutive Qestimates disagree.
On the theoretical side, we prove the first regret upper bound in the undiscounted, continuing setting with function approximation. This is an improvement over the best existing bound of AbbasiYadkori et al. (2019a) for the same setting, which ignores the slowchanging nature of the estimated Qfunctions. We exploit the fact that the change in consecutive Qfunction estimates can be bounded by the change in policies. Our analysis relies on the results of Rakhlin & Sridharan (2013), but employs a different regret decomposition, with additional information provided by MDP properties. We emphasize that our learning framework is not limited to a particular function approximation method, and that in practice it serves the purpose of appropriately regularizing the policy improvement step of API.
2 Related work
Undiscounted MDPs. Most noregret algorithms for infinitehorizon undiscounted MDPs are modelbased, and only applicable to tabular representations Bartlett (2009); Jaksch et al. (2010); Ouyang et al. (2017); Fruit et al. (2018); Jian et al. (2019); Talebi & Maillard (2018). For a weaklycommunicating MDP with states, actions, and diameter , these algorithms achieve the minimax lower bound of Jaksch et al. (2010) with high probability up to constant factors (typically ). The downside of modelbased algorithms is that they are memory intensive in largescale MDPs, as they require storage, and as well as being difficult to extend to continuousvalued states.
In the modelfree tabular setting, Wei et al. (2019) show that optimistic Qlearning achieves regret in weaklycommunicating MDPs, where is the span of the optimal statevalue function (upper bounded by the diameter ). In the case of uniformly ergodic MDPs, they show a bound of on expected regret, where is the mixing time and is the stationary distribution mismatch coefficient.
In the modelfree setting with function approximation, the POLITEX algorithm AbbasiYadkori et al. (2019a), a variant of API, achieves regret in ergodic MDPs, under the assumption that policy evaluation error scales as after transitions. Here is the size of the compressed stateaction space ( for tabular representation, number of features for linear functions).
Episodic MDPs. In episodic MDPs with horizon , Jin et al. (2018) show an regret bound for Qlearning with UCB exploration with tabular representation. With function approximation, Yang & Wang (2019b); Jin et al. (2019) show an regret bound for an optimistic version of leastsquares value iteration in MDPs by assuming special linear MDP. Their algorithms heavily exploit the linear MDP structure that is hard to be satisfied in practice. The RLSVI algorithm Osband et al. (2016) performs exploration in the value function parameter space, and therefore can be applied with function approximation, though its worsecase regret bound of holds only in the tabular setting (Russo, 2019). In the infinitehorizon discounted setting, singletrajectory regret makes less sense as an objective, and most theoretical results are focused on the sample complexity of exploration in the tabular case (Strehl et al., 2006; Dong et al., 2019).
Conservative PI. AAPI is also similar to the conservative policy iteration works, which attempt to stabilize API by regularizing each policy towards the previous policy (Kakade & Langford, 2002; Schulman et al., 2015, 2017; Abdolmaleki et al., 2018; Geist et al., 2019). Our policy improvement step can be seen as regularizing each policy by the KLdivergence to the previous policy; the reduction to online learning offers a principled way to incorporate such regularization.
3 Problem setting and definitions
We first introduce some notation. We use to denote the space of probability distributions defined on the set and write . For vectors , we define the weighted norm as and norm as . In general, we treat discrete distributions as row vectors.
Infinitehorizon undiscounted MDPs are often characterized by a finite state space , a finite action space , a reward function , and a transition probability function . The agent does not know the transition probability and the reward function in advance. A policy is a mapping from a state to a distribution over actions. Let denote the stateaction sequence obtained by following policy . The expected average reward of policy is defined as
The agent interacts with the environment as follows: at each round , the agent observes a state , chooses an action , and receives a reward . The environment then transitions to the next state with probability . The initial state is randomly generated from some unknown distribution. Let be an unknown fixed policy. The regret of an algorithm with respect to this fixed policy is defined as
(1) 
The learning goal is to find an algorithm that minimizes the longterm regret .
For each policy , we denote to be the transition matrix induced by , where the component is the transition probability from to under , i.e. . For a distribution over , we let be the distribution over that results from executing the policy for one step after the initial state is sampled from . A stationary distribution of a policy over states satisfies . In addition, the expected reward can also be expressed as
We assume that all MDPs are ergodic. An MDP is ergodic if the Markov chain induced by any policy is both irreducible and aperiodic, which means any state is reachable from any other state by following a suitable policy. Learning in ergodic MDPs is generally easier than weaklycommunicating MDPs because ergodic MDPs themselves are exploratory. It is wellknown that all ergodic MDPs have an unique stationary state distribution, and so and are welldefined. In addition, ergodic MDPs have a finite mixing time, defined below. {definition} The mixing time of ergodic MDPs is defined as
that characterizes how fast MDPs reach stationary distributions from any state under any policy.
In the end, we define the value function under policy as
The stateaction value function and are the unique solutions to the Bellman equation:
(2) 
4 Algorithm
AAPI is a variant of approximate policy iteration and it proceeds in phases. Suppose the total number of rounds is . We divide into phases of length and assume is an integer for simplicity. Within each phase, our algorithm performs two tasks: policy evaluation and policy improvement.
Policy evaluation. In each phase , the algorithm executes the current policy for time steps, and computes an estimate of the true actionvalue function . We leave unspecified the value function estimation method ; for example, one can use incremental algorithms, or both onpolicy and offpolicy data. AAPI is better interpreted as a learning schema. Our regret analysis will require that longer phase lengths lead to better estimates (made precise in Lemma 6).
Policy improvement. For each state , the policy improvement step takes the form of the adaptive optimistic followtheregularizedleader (AOFTRL) update (Mohri & Yang, 2016):
(3) 
(See Step 3 in Section 6 for a generic description of AOFTRL.) The terms in Eq. (4) are as follows:

The estimates are the loss functions fed to the AOFTRL algorithm. is the negative entropy regularizer, and is the probability simplex.

The sideinformation is a vector computable based on past information and being predictive of the next loss . Since the policies are expected to change slowly due to the nature of exponentialweightaverage type algorithms, we set (better guesses such as offpolicy estimates can be used if available).

The choice of learning rate is crucial both theoretically and empirically. In particular, we choose in a datadependent fashion as
(4) A notable feature of is that it is also statedependent. Intuitively, for the choice , the adaptive statedependent learning rate results in a more exploratory policy for the states on which there is more disagreement between the past consecutive actionvalue functions.
Based on (4), the next policy is a Boltzmann distribution (a consequence of negative entropy regularizer) over the sum of all past stateaction value estimates and the sideinformation:
(5) 
AAPI is similar to the Politex algorithm AbbasiYadkori et al. (2019a), where the main difference is that Politex sets the next policy to in the improvement step. We demonstrate that the use of sideinformation and adaptive learning rates improves both the theoretical guarantees (Theorem 5) and empirical performance (Section 7) over Politex. The overall algorithm is summarized in Algorithm 1.
5 Analysis
To derive a regret bound for Algorithm 1, we decompose the cumulative regret (1) as follows,
(6) 
The first term captures the sum of differences between observed rewards and their long term averages. If policies are changing slowly, or if they are kept fixed for extended periods of time, we expect this term to capture the noise in the regret. The second term is called pseudoregret in literature. It measures the difference between the expected reward of a fixed policy and the policies produced by the algorithm.
We first impose a condition on the quality of policy evaluation step at each phase. For a probability distribution on and a stochastic policy , define to be the distribution on that puts the probability mass on pair . Recall that is the stationary distribution of over the states. {condition} For each phase , denote . We assume the following holds with probability ,
(7) 
where is the irreducible approximation error and is a problem dependent constant. Additionally, there exists a constant such that for any pair and .
The problem dependent constant will in general depend on . Here, is the dimension of the representation (e.g. for the tabular case, or number of features for the linear value function case).
The requirement for the norm and norm has been shown to hold, for example, for linear value function approximation using the LSPE algorithm (Bertsekas & Ioffe, 1996), under Assumptions BB given in the Appendix; see Theorem 5.1 in AbbasiYadkori et al. (2019a) for further details. Lemma B in the supplementary material shows that the requirement for norm can also be satisfied, for example, for linear value functions, under similar conditions. {remark} The estimation error generally depends on the mismatch between distributions and . With value functions linear in features , this mismatch depends on the spectra of matrices for different distributions , and need not scale in the number of stateaction pairs.
[Main result] Consider an ergodic MDP and suppose Condition 5 holds. Denote . By choosing the phase length , we have with probability at least ,
where hides universal constants and polylogarithmic factors.
It is worth comparing the above result with the regret bound presented in AbbasiYadkori et al. (2019a). Ignoring the irreducible error , we improve the leading order of their general result (Corollary 4.6 in AbbasiYadkori et al. (2019a)) from to . When specialized to linear value function approximation where scales with (Theorem 5 in AbbasiYadkori et al. (2019a)), we improve their results from to .
6 Proof sketch
In this section, we provide a proof sketch for Theorem 5. Technical details are deferred to Appendix A in the supplementary material. At a high level, we bound the two terms in the regret decomposition Eq. (6) separately. While the first term is bounded by the fast mixing condition, the second term is split into the regret due to value function estimation error and the regret due to online learning reduction.
Step 1: fast mixing. To bound the first term in Eq. (6), we require the following uniform fast mixing condition, which is used frequently in online MDP literature (EvenDar et al., 2009; Neu et al., 2014). Note that ergodic MDPs that this paper focuses on automatically satisfy this condition. {condition}[(Uniform fast mixing)] There exists a number such that for any policy and any pair of distributions and over , it holds that
(8) 
The following lemma provides upper bounds for the first term (see e.g. Lemma 4.4 in AbbasiYadkori et al. (2019a) for a proof). {lemma} Suppose that Condition 6 holds. The following inequality holds with probability at least ,
where is the number of phases.
Step 2: decomposition. We bound the second term (pseudo regret) in Eq. (6). Since the policy is only updated at the end of each phase of length (see line 9 in Algorithm 1), we have for . Thus, the pseudoregret term can be rewritten as,
(9) 
We slightly abuse notation by writing . In particular, is exactly the value function by Definition 2. Applying the performance difference lemma (Lemma C in the supplementary material), we have
Bridging by empirical estimations, we decompose (9) into , where
(10) 
Step 3: estimation error. The term quantifies the regret incurred in the policy evaluation step due to the estimation error and function approximation error of function in each phase. It can be bounded as in Theorem 4.1 of AbbasiYadkori et al. (2019a) under similar assumptions, which we reproduce here for completeness. {lemma} Suppose Condition 5 holds. Then
(11) 
with probability at least .
Step 4: online learning reduction. Minimizing can be cast into an online learning problem (CesaBianchi & Lugosi, 2006; ShalevShwartz et al., 2012), and this observation determines the choice of our algorithm. Previous work AbbasiYadkori et al. (2019a) has tackled this subproblem using mirror descent, resulting in regret after optimizing ignoring the irreducible error . Here we instead use the AOFTRL framework, which allows us to show an improved regret bound. As we show, the reason we can benefit from optimism is that the losses (functions) change slowly, and we carefully transfer this knowledge to the adaptive learning rate. This is the main technical contribution of the paper.
First, we state the framework of AOFTRL and its regret results. Let be a sequence of loss vectors and let be a sequence of prediction vectors, where is the probability simplex. At the beginning of each round, the algorithm receives a sideinformation vector . In literature, are also called predictable sequences (Rakhlin & Sridharan, 2012), and the algorithm can be seen as a way of utilizing prior knowledge about loss sequences. The algorithm then selects an action , and suffers a cost . The goal of this online learning problem is to minimize the cumulative regret with respect to the best action in hindsight , defined as .
Let be a 1strongly convex regularizer on with respect to some norm and denote by its dual norm. Initialize . At each round , AOFTRL has the following form:
where is an absolute constant. It’s easy to see that is nondecreasing. For simplicity, we assume . Next lemma provides a generic regret bound for AOFTRL. The detailed proof is deferred to Appendix A.2 in the supplementary material.
Choose and denote . The cumulative regret for AOFTRL is upperbounded by
(12) 
Unlike the AOFTRL analyses of Rakhlin & Sridharan (2012); Mohri & Yang (2016), but similarly to, e.g., the analysis of Joulani et al. (2017), Eq. (12) has a key negative term (at the expense of a slightly larger constant factor in the main positive term). These negative terms, which are retained from a tight regret bound on the forward regret of AOFTRL (Joulani et al., 2017), track the evolution of the policy . With the proper choice of , the norm terms will also be controlled by the evolution of (see Lemma 6), and the aforementioned negative terms allow us to greatly reduce the contribution of the norm terms to the overall regret.
The reason that minimizing can be cast into an online learning problem is as follows. By the definition of in Step 2, we rewrite in (10) as
For each state , we view as the prediction vector and as the loss vector. The equivalence between and enables us to utilize the generic regret bound for AOFTRL in Lemma 6 for each individual state.
Next, we will show that under some conditions, the change in the true values can be bounded by the change of policies. This is a unique property of ergodic MDPs that allows us to benefit from the negative term in (12). To ensure is unique, we assume . {lemma}[Relative function Error] For any two successive policies and , the following holds for any stateaction pair ,
The detailed proof of Lemma 6 is deferred to Appendix A.4 in the supplementary material. Combining the result in Lemmas 6 and 6, we can derive the following lemma.
Denote . Suppose Condition 5 holds. Then the following upper bound holds with probability at least ,
(13) 
where hides universal constant factors. The detailed proof of Lemma 6 is deferred to Appendix A.3 in the supplementary material. Finally, we optimize to be and reach our conclusion.
Within the upper bound (13), stands for the approximation error and estimation error per round. When value functions can be computed exactly (known MDP) and for phase length , the online learning reduction regret for AAPI scales logarithmically in the number of phases , while POLITEX (AbbasiYadkori et al., 2019a) scales as . This is the main reason why we are able to improve the regret from to in the end.
7 Experiments
Setup. In this section, we empirically evaluate the benefits of using adaptive learning rates and side information . In particular, we compare (1) AAPI with and as in Eq. (4); (2) kAAPI , similar to AAPI , but approximating the adaptive learning rate by for the purpose of computational efficiency, (3) Politex, with and . We also evaluate RLSVI (Osband et al. (2016) Algorithms 1 and 2 with and tuned ), where policies are greedy w.r.t. a randomized estimate of . For RLSVI in episodic environments with fixed known horizon (DeepSea), we estimate separate parameters for each and update after each episode, otherwise we share parameters and update every steps.
We approximate all value functions using leastsquares Monte Carlo, i.e. linear regression from stateaction features to empirical returns. For MDPs with a large or continuous state space , updating perstate learning rates can be impractical. Instead, we store the weights of past functions in memory, and for each state in the trajectory, we compute the learning rate using a subset of randomlyselected past weight vectors (we correct the scale of the estimate by multiplying with . For Boltzmann policies, we tune the constant for the learning rate in the range . For each environment and algorithm we evaluate and plot the mean and standard deviation over 50 runs. We show results on the following three environments.
Tabular ergodic MDPs. We consider a simple tabular MDP where , for . On any action in state 1, the environment transitions to a randomly chosen state . On action 1 in a state , the environment transitions to state with probability 0.9, and to a randomly chosen state with probability 0.1. On all other actions in , the environment transitions to a randomly chosen state. We represent stateaction pairs using onehot indicator vectors of size , and experiment with different sizes of the state and action spaces and .
DeepSea (Osband et al., 2017). In the DeepSea environment, states comprise an grid, and there are two actions. The environment transitions and costs are deterministic. On action 0 in cell , the environment transitions to . On action 1 in cell , the environment transitions to cell . The agent starts in state . The reward (negative cost) for being in the cell is for any action. For all other states , the reward is and . In the infinite horizon setting, an optimal strategy first takes the action 1 times (to get to ) and then takes an equal number of 0 and 1 actions, and has expected average reward close to . A simple strategy that always takes action 1 has an average reward , and a suboptimal strategy that only takes action 0 has an average reward of . We represent states as length vectors containing onehot indicators for each grid coordinate, and estimate linear functions.
CartPole Barto et al. (1983). In the CartPole environment, the goal is to balance an inverted pole attached by an unactuated joint to a cart, which moves along a frictionless rail. There are two actions, corresponding to pushing the cart to the left or right. The observation consists of the position and velocity of the cart, pole angle, and pole velocity at the tip. There is a reward of +1 for every timestep that the pole remains upright. The episodic version of the environment ends if the pole angle is more than 15 degrees from vertical, if the cart moves more than 2.4 units from the center, or after 200 steps. In the infinitehorizon version, if the episode ends after steps, we return a reward of and reset. For this task, in addition to the given observation, we extract multivariate Fourier basis features Konidaris et al. (2011) of order 4.
Results and discussion. The experimental results are shown in Figures 1, 2, and 3. AAPI outperforms other algorithms in the DeepSea and ergodic environment instances. These problems all involve a single highreward state, and reaching the state requires sufficient exploration. On the other hand, the adaptive perstate learning rate is not helpful in CartPole, possibly because observations are continuous and there is higher generalization across states. In most of our experiments, AAPI performs better with smaller phase length . Our analysis relies on long phases of length in part to obtain better sideinformation . However, given that the sideinformation also depends on and , in some MDPs shorter phases may suffice. This is clear in the DeepSea results, where shorter phases result in better performance for smaller problem instances.
8 Discussion and future work
We have presented AAPI , a modelfree learning scheme that can work with function approximation, and enjoys a regret guarantee in infinitehorizon undiscounted, ergodic MDPs. AAPI improves upon previous results for this setting by using the slowchanging property of policies in both theory and practice.
Our result has an undesirable dependence on . With value function approximation, one general direction for improvement is replacing dependence on with the size of the compressed representation, such as the minimum eigenvalue of in the linear case. Another direction for future work is improving the policy evaluation stage. While we estimate each value function solely using the onpolicy transitions, better estimates can potentially be obtained using all data. Using more sophisticated sideinformation, such as a weighted average of past Qestimates or an offpolicy estimate of the Qfunction may also be helpful in practice. Other future work may include practical implementations of the algorithm when trained with neural networks that maintain only a subset of past networks in memory.
Supplement to “Provably Efficient Adaptive Approximate Policy Iteration”
In Section A, we present the detailed proofs of main results. In Section B, the linear value function approximation is considered. In Section C, some supporting lemmas are included.
Appendix A Proofs of main results
a.1 Proof of Theorem 5: main result
We combine the decomposition (6), (9) and (10) together and utilize the results in Lemmas 6, 6 and 6. Then we have
We choose and ignore any universal constant and logarithmic factor in the following. Since , it holds that
with probability at least . With a little abuse of notations, we redefine . By optimizing such that the first two term above is equal, i.e., , we choose . Overall, we reach the final regret bound,
This ends the proof.
a.2 Proof of Lemma 6: adaptive optimistic FTRL (AOFTRL)
First, at each round , AOFTRL has the form of
where are datadependent learning rates. For simplicity, we assume . For , we define
(14) 
We define for all and . Since is 1strongly convex with respect to norm , is stronglyconvex with respect to . Then we could rewrite the AOFTRL update as
Second, let us define the forward linear regret as:
One could interpret as a cheating regret since it uses prediction at round . We decompose the cumulative regret based on the forward linear regret as follows,
(15) 
The second term in the right side captures the regret by the algorithmâs inability to accurately predict the future. We define the Bregman divergence between two vectors induced by a differentiable function as follows:
Next theorem is used to bound the forward regret. {theorem}[Theorem 3 in Joulani et al. (2017)] For any and any sequence of linear losses, the forward regret satisfies the following inequality:
Recall that is strongly convex. From the definitions of strong convexity and Bregman divergence, we have
(16) 
Applying Theorem A.2 and Eq. (16), we have
(17) 
To bound the first term in Eq. (17), we expand it by the definition of Eq. (14),
(18)  
where the first inequality is due to the fact that is nondecreasing and . We decompose the second term in Eq. (18) as follows,
(19)  
since . Combining Eq. (18) and Eq. (19) together,
(20) 