1 INTRODUCTION
###### Abstract

We present a reduction from reinforcement learning (RL) to no-regret online learning based on the saddle-point formulation of RL, by which any online algorithm with sublinear regret can generate policies with provable performance guarantees. This new perspective decouples the RL problem into two parts: regret minimization and function approximation. The first part admits a standard online-learning analysis, and the second part can be quantified independently of the learning algorithm. Therefore, the proposed reduction can be used as a tool to systematically design new RL algorithms. We demonstrate this idea by devising a simple RL algorithm based on mirror descent and the generative-model oracle. For any -discounted tabular RL problem, with probability at least , it learns an -optimal policy using at most samples. Furthermore, this algorithm admits a direct extension to linearly parameterized function approximators for large-scale applications, with computation and sample complexities independent of ,, though at the cost of potential approximation bias.

\runningtitle

A Reduction from Reinforcement Learning to Online Learning

\aistatstitle

A Reduction from Reinforcement Learning to
No-Regret Online Learning

\aistatsauthor

Ching-An Cheng &Remi Tachet des Combes &Byron Boots &Geoff Gordon

Georgia Tech &Microsoft Research Montreal &UW &Microsoft Research Montreal

## 1 Introduction

Reinforcement learning (RL) is a fundamental problem for sequential decision making in unknown environments. One of its core difficulties, however, is the need for algorithms to infer long-term consequences based on limited, noisy, short-term feedback. As a result, designing RL algorithms that are both scalable and provably sample efficient has been challenging.

In this work, we revisit the classic linear-program (LP) formulation of RL [26, 13] in an attempt to tackle this long-standing question. We focus on the associated saddle-point problem of the LP (given by Lagrange duality), which has recently gained traction due to its potential for computationally efficient algorithms with theoretical guarantees [34, 7, 36, 24, 35, 25, 12, 6, 23]. But in contrast to these previous works based on stochastic approximation, here we consider a reformulation through the lens of online learning, i.e. regret minimization. Since the pioneering work of Gordon [14], Zinkevich [37], online learning has evolved into a ubiquitous tool for systematic design and analysis of iterative algorithms. Therefore, if we can identify a reduction from RL to online learning, we can potentially leverage it to build efficient RL algorithms.

We will show this idea is indeed feasible. We present a reduction by which any no-regret online algorithm, after observing samples, can find a policy in a policy class satisfying , where is the accumulated reward of policy with respect to some unknown initial state distribution , is the optimal policy, and is a measure of the expressivity of (see Section 4.2 for definition).

Our reduction is built on a refinement of online learning, called Continuous Online Learning (COL), which was proposed to model problems where loss gradients across rounds change continuously with the learner’s decisions [9]. COL has a strong connection to equilibrium problems (EPs) [4, 3], and any monotone EP (including our saddle-point problem of interest) can be framed as no-regret learning in a properly constructed COL problem [9]. Using this idea, our reduction follows naturally by first converting an RL problem to an EP and then the EP to a COL problem.

Framing RL as COL reveals new insights into the relationship between approximate solutions to the saddle-point problem and approximately optimal policies. Importantly, this new perspective shows that the RL problem can be separated into two parts: regret minimization and function approximation. The first part admits standard treatments from the online learning literature, and the second part can be quantified independently of the learning process. For example, one can accelerate learning by adopting optimistic online algorithms [31, 10] that account for the predictability in COL, without worrying about function approximators. Because of these problem-agnostic features, the proposed reduction can be used to systematically design efficient RL algorithms with performance guarantees.

As a demonstration, we design an RL algorithm based on arguably the simplest online learning algorithm: mirror descent. Assuming a generative model111In practice, it can be approximated by running a behavior policy with sufficient exploration [22]., we prove that, for any tabular Markov decision process (MDP), with probability at least , this algorithm learns an -optimal policy for the -discounted accumulated reward, using at most samples, where , are the sizes of state and action spaces, and is the discount rate. Furthermore, thanks to the separation property above, our algorithm admits a natural extension with linearly parameterized function approximators, whose sample and per-round computation complexities are linear in the number of parameters and independent of ,, though at the cost of policy performance bias due to approximation error.

This sample complexity improves the current best provable rate of the saddle-point RL setup [34, 7, 36, 24] by a large factor of , without making any assumption on the MDP.222[36] has the same sample complexity but requires the MDP to be ergodic under any policy. This improvement is attributed to our new online-learning-style analysis that uses a cleverly selected comparator in the regret definition. While it is possible to devise a minor modification of the previous stochastic mirror descent algorithms, e.g. [36], achieving the same rate with our new analysis, we remark that our algorithm is considerably simpler and removes a projection required in previous work [34, 7, 36, 24].

Finally, we do note that the same sample complexity can also be achieved, e.g., by model-based RL and (phased) Q-learning [22, 21]. However, these methods either have super-linear runtime, with no obvious route for improvement, or could become unstable when using function approximators without further assumption.

## 2 Setup & Preliminaries

Let and be state and action spaces, which can be discrete or continuous. We consider -discounted infinite-horizon problems for . Our goal is to find a policy that maximizes the discounted average return , where is the initial state distribution,

 Vπ(s)\coloneqq(1−γ)Eξ∼ρπ(s)[∑∞t=0γtr(st,at)] (1)

is the value function of at state , is the reward function, and is the distribution of trajectory generated by running from in an MDP. We assume that the initial distribution , the transition , and the reward function in the MDP are unknown but can be queried through a generative model, i.e. we can sample from , from , and for any and . We remark that the definition of in (1) contains a factor. We adopt this setup to make writing more compact. We denote the optimal policy as and its value function as for short.

### 2.1 Duality in RL

Our reduction is based on the linear-program (LP) formulation of RL. We provide a short recap here (please see Appendix A and [30] for details).

To show how can be framed as a LP, let us define the average state distribution under , , where is the state distribution at time visited by running from (e.g. ). By construction, satisfies the stationarity property,

 dπ(s′)=(1−γ)p(s′)+γEs∼dπEa∼π|s[P(s′|s,a)]. (2)

With , we can write and our objective equivalently as:

 maxμ∈R|\SS||A|:μ≥0r⊤μs.t.(1−γ)p+γP⊤μ=E⊤μ (3)

where , , and are vector forms of , , and , respectively, and (we use to denote the cardinality of a set, the Kronecker product, is the identity, and the vector of ones). In (3), and may seem to have finite cardinalities, but the same formulation extends to countable or even continuous spaces (under proper regularity assumptions; see [17]). We adopt this abuse of notation (emphasized by bold-faced symbols) for compactness.

The variable of the LP in (3) resembles a joint distribution . To see this, notice that the constraint in (3) is reminiscent of (2), and implies , i.e. is a probability distribution. Then one can show when the constraint is satisfied, which implies that (3) is the same as and its solution corresponds to of the optimal policy .

As (3) is a LP, it suggests looking at its dual, which turns out to be the classic LP formulation of RL333Our setup in (4) differs from the classic one in the factor in the constraint due to the average setup.,

 minv∈R|\SS|p⊤vs.t.(1−γ)r+γPv≤Ev. (4)

It can be verified that for all , the solution to (4) satisfies the Bellman equation [2] and therefore is the optimal value function (the vector form of ). We note that, for any policy , by definition satisfies a stationarity property

 Vπ(s)=Ea∼π|s[(1−γ)r(s,a)+γEs′∼P|s,a[Vπ(s′)]] (5)

which can be viewed as a dual equivalent of (2) for . Because, for any and , is in , (5) implies lies in too.

### 2.2 Toward RL: the Saddle-Point Setup

The LP formulations above require knowing the probabilities and and are computationally inefficient. When only generative models are available (as in our setup), one can alternatively exploit the duality relationship between the two LPs in (3) and (4), and frame RL as a saddle-point problem [34]. Let us define

 av\coloneqqr+11−γ(γP−E)v (6)

as the advantage function with respect to (where is not necessarily a value function). Then the Lagrangian connecting the two LPs can be written as

 L(v,μ)\coloneqqp⊤v+μ⊤av, (7)

 minv∈Vmaxμ∈ML(v,μ), (8)

where the constraints are

 V ={v∈R|\SS|:v≥0,∥v∥∞≤1} (9) M ={μ∈R|\SS||A|:μ≥0,∥μ∥1≤1}. (10)

The solution to (8) is exactly , but notice that extra constraints on the norm of and are introduced in , compared with (3) and (4). This is a common practice, which uses known bound on the solutions of (3) and (4) (discussed above) to make the search spaces and in (8) compact and as small as possible so that optimization converges faster.

Having compact variable sets allows using first-order stochastic methods, such as stochastic mirror descent and mirror-prox [28, 19], to efficiently solve the problem. These methods only require using the generative model to compute unbiased estimates of the gradients and , where we define

 bμ\coloneqqp+11−γ(γP−E)⊤μ (11)

as the balance function with respect to . measures whether violates the stationarity constraint in (3) and can be viewed as the dual of . When the state or action space is too large, one can resort to function approximators to represent and , which are often realized by linear basis functions for the sake of analysis [6].

### 2.3 COL and EPs

Finally, we review the COL setup in [9], which we will use to design the reduction from the saddle-point problem in (8) to online learning in the next section.

Recall that an online learning problem describes the iterative interactions between a learner and an opponent. In round , the learner chooses a decision from a decision set , the opponent chooses a per-round loss function based on the learner’s decisions, and then information about (e.g. its gradient ) is revealed to the learner. The performance of the learner is usually measured in terms of regret with respect to some ,

 RegretN(x′)\coloneqq∑Nn=1ln(xn)−∑Nn=1ln(x′).

When is convex and is compact and convex, many no-regret (i.e. ) algorithms are available, such as mirror descent and follow-the-regularized-leader [5, 33, 15].

COL is a subclass of online learning problems where the loss sequence changes continuously with respect to the played decisions of the learner [9]. In COL, the opponent is equipped with a bifunction , where any fixed , is continuous in . The opponent selects per-round losses based on , but the learner does not know : in round , if the learner chooses , the opponent sets

 ln(x)=fxn(x), (12)

and returns, e.g., a stochastic estimate of (the regret is still measured in terms of the noise-free ).

In [9], a natural connection is shown between COL and equilibrium problems (EPs). As EPs include the saddle-point problem of interest, we can use this idea to turn (8) into a COL problem. Recall an EP is defined as follows: Let be compact and be a bifunction s.t. , is continuous, is convex, and .444We restrict ourselves to this convex and continuous case as it is sufficient for our problem setup. The problem aims to find s.t.

 F(x⋆,x)≥0,∀x∈X. (13)

By its definition, a natural residual function to quantify the quality of an approximation solution to EP is which describes the degree to which (13) is violated at . We say a bifunction is monotone if, , , and skew-symmetric if the equality holds.

EPs with monotone bifunctions represent general convex problems, including convex optimization problems, saddle-point problems, variational inequalities, etc. For instance, a convex-concave problem can be cast as with and the skew-symmetric bifunction [18]

 F(x,x′)\coloneqqϕ(y′,z)−ϕ(y,z′), (14)

where and . In this case, is the duality gap.

Cheng et al. [9] show that a learner achieves sublinear dynamic regret in COL if and only if the same algorithm can solve with . Concretely, they show that, given a monotone with (which is satisfied by (14)), one can construct a COL problem by setting , i.e. , such that any no-regret algorithm can generate an approximate solution to the EP.

###### Proposition 1.

[9] If is skew-symmetric and , then , where , and ; the same guarantee holds also for the best decision in .

## 3 An Online Learning View

We present an alternate online-learning perspective on the saddle-point formulation in (8). This analysis paves a way for of our reduction in the next section. By reduction, we mean realizing the two steps below:

1. Define a sequence of online losses such that any algorithm with sublinear regret can produce an approximate solution to the saddle-point problem.

2. Convert the approximate solution in the first step to an approximately optimal policy in RL.

Methods to achieve these two steps individually are not new. The reduction from convex-concave problems to no-regret online learning is well known [1]. Likewise, the relationship between the approximate solution of (8) and policy performance is also available; this is how the saddle-point formulation [36] works in the first place. So couldn’t we just use these existing approaches? We argue that purely combining these two techniques fails to fully capture important structure that resides in RL. While this will be made precise in the later analyses, we highlight the main insights here.

Instead of treating (8) as an adversarial two-player online learning problem [1], we adopt the recent reduction to COL [9] reviewed in Section 2.3. The main difference is that the COL approach takes a single-player setup and retains the Lipschitz continuity in the source saddle-point problem. This single-player perspective is in some sense cleaner and, as we will show in Section 4.2, provides a simple setup to analyze effects of function approximators. Additionally, due to continuity, the losses in COL are predictable and therefore make designing fast algorithms possible.

With the help of the COL reformulation, we study the relationship between the approximate solution to (8) and the performance of the associated policy in RL. We are able to establish a tight bound between the residual and the performance gap, resulting in a large improvement of in sample complexity compared with the best bounds in the literature of the saddle-point setup, without adding extra constraints on and assumptions on the MDP. Overall, this means that stronger sample complexity guarantees can be attained by simpler algorithms, as we demonstrate in Section 5.

The missing proofs of this section are in Appendix B.

### 3.1 The COL Formulation of RL

First, let us exercise the above COL idea with the saddle-point formulation of RL in (8). To construct the EP, we can let , which is compact. According to (14), the bifunction of the associated is naturally given as

 F(x,x′) \coloneqqL(v′,μ)−L(v,μ′) =p⊤v′+μ⊤av′−p⊤v−μ′⊤av (15)

which is skew-symmetric, and is a solution to . This identification gives us a COL problem with the loss in the th round defined as

 ln(x)\coloneqqp⊤v+μ⊤nav−p⊤vn−μ⊤avn (16)

where . We see is a linear loss. Moreover, because of the continuity in , it is predictable, i.e. can be (partially) inferred from past feedback as the MDP involved in each round is the same.

### 3.2 Policy Performance and Residual

By Proposition 1, any no-regret algorithm, when applied to (16), provides guarantees in terms of the residual function of the EP. But this is not the end of the story. We also need to relate the learner decision to a policy in RL and then convert bounds on back to the policy performance . Here we follow the common rule in the literature and associate each with a policy defined as

 πμ(a|s)∝μ(s,a). (17)

In the following, we relate the residual to the performance gap through a relative performance measure defined as

 rep(x;x′)\coloneqqF(x,x)−F(x,x′)=−F(x,x′) (18)

for , where the last equality follows from the skew-symmetry of in (3.1). Intuitively, we can view as comparing the performance of with respect to the comparator under an optimization problem proposed by , e.g. we have . And by the definition in (18), it holds that .

We are looking for inequalities in the form that hold for all with some strictly increasing function and some , so we can get non-asymptotic performance guarantees once we combine the two steps described at the beginning of this section. For example, by directly applying results of [9] to the COL in (16), we get , where is the policy associated with the average/best decision in .

#### 3.2.1 The Classic Result

Existing approaches (e.g. [7, 36, 24]) to the saddle-point point formulation in (8) rely on the relative residual with respect to the optimal solution to the problem , which we restate in our notation.

###### Proposition 2.

For any , if , .

Therefore, although the original saddle-point problem in (8) is framed using and , in practice, an extra constraint, such as , is added into , i.e. these algorithms consider instead

 M′ ={μ∈R|\SS||A|:μ∈M,E⊤μ≥(1−γ)p}, (19)

so that the marginal of the estimate can have the sufficient coverage required in Proposition 2. This condition is needed to establish non-asymptotic guarantees on the performance of the policy generated by  [34, 36, 24], but it can sometimes be impractical to realize, e.g., when is unknown. Without it, extra assumptions (like ergodicity [36]) on the MDP are needed.

However, Proposition 2 is undesirable for a number of reasons. First, the bound is quite conservative, as it concerns the uniform error whereas the objective in RL is about the gap with respect to the initial distribution (i.e. a weighted error). Second, the constant term can be quite small (e.g. when is uniform, it is ) which can significantly amplify the error in the residual. Because a no-regret algorithm typically decreases the residual in after seeing samples, the factor of earlier would turn into a multiplier of in sample complexity. This makes existing saddle-point approaches sample inefficient in comparison with other RL methods like Q-learning [21]. Lastly, enforcing requires knowing (which is unavailable in our setup) and adds extra projection steps during optimization. When is unknown, while it is possible to modify this constraint to use a uniform distribution, this might worsen the constant factor and could introduce bias.

One may conjecture that the bound in Proposition 2 could perhaps be tightened by better analyses. However, we prove this is impossible in general.

###### Proposition 3.

There is a class of MDPs such that, for some , Proposition 2 is an equality.

We note that Proposition 3 does not hold for all MDPs. Indeed, if one makes stronger assumptions on the MDP, such as that the Markov chain induced by every policy is ergodic [36], then it is possible to show, for all , for some constant independent of and , when one constrains . Nonetheless, this construct still requires adding an undesirable constraint to .

#### 3.2.2 Curse of Covariate Shift

Why does this happen? We can view this issue as a form of covariate shift, i.e. a mismatch between distributions. To better understand it, we notice a simple equality, which has often been used implicitly, e.g. in the technical proofs of [36].

###### Lemma 1.

For any , if satisfies (2) and (5) (i.e. and are the value function and state-action distribution of policy ), .

Lemma 1 implies , which is non-negative. This term is similar to an equality called the performance difference lemma [29, 20].

###### Lemma 2.

Let and denote the value and state-action distribution of some policy . Then for any function , it holds that . In particular, it implies .

From Lemmas 2 and 1, we see that the difference between the residual and the performance gap is due to the mismatch between and , or more specifically, the mismatch between the two marginals and . Indeed, when , the residual is equal to the performance gap. However, in general, we do not have control over that difference for the sequence of variables an algorithm generates. The sufficient condition in Proposition 2 attempts to mitigate the difference, using the fact from (2), where is the transition matrix under . But the missing half (due to the long-term effects in the MDP) introduces the unavoidable, weak constant , if we want to have an uniform bound on . The counterexample in Proposition 3 was designed to maximize the effect of covariate shift, so that fails to captures state-action pairs with high advantage. To break the curse, we must properly weight the gap between and instead of relying on the uniform bound on as before.

## 4 The Reduction

The analyses above reveal both good and bad properties of the saddle-point setup in (8). On the one hand, we showed that approximate solutions to the saddle-point problem in (8) can be obtained by running any no-regret algorithm in the single-player COL problem defined in (16); many efficient algorithms are available from the online learning literature. On the other hand, we also discovered a root difficulty in converting an approximate solution of (8) to an approximately optimal policy in RL (Proposition 2), even after imposing strong conditions like (19). At this point, one may wonder if the formulation based on (8) is fundamentally sample inefficient compared with other approaches to RL, but this is actually not true.

Our main contribution shows that learning a policy through running a no-regret algorithm in the COL problem in (16) is, in fact, as sample efficient in policy performance as other RL techniques, even without the common constraint in (19) or extra assumptions on the MDP like ergodicity imposed in the literature.

###### Theorem 1.

Let be any sequence. Let be the policy given by via (17), which is either the average or the best decision in . Define . Then .

Theorem 1 shows that if has sublinear regret, then both the average policy and the best policy in converge to the optimal policy in performance with a rate . Compared with existing results obtained through Proposition 2, the above result removes the factor and impose no assumption on or the MDP. Indeed Theorem 1 holds for any sequence. For example, when is generated by stochastic feedback of , Theorem 1 continues to hold, as the regret is defined in terms of , not of the sampled loss. Stochasticity only affects the regret rate.

In other words, we have shown that when and can be directly parameterized, an approximately optimal policy for the RL problem can be obtained by running any no-regret online learning algorithm, and that the policy quality is simply dictated by the regret rate. To illustrate, in Section 5 we will prove that simply running mirror descent in this COL produces an RL algorithm that is as sample efficient as other common RL techniques. One can further foresee that algorithms leveraging the continuity in COL—e.g. mirror-prox [19] or PicCoLO [10]—and variance reduction can lead to more sample efficient RL algorithms.

Below we will also demonstrate how to use the fact that COL is single-player (see Section 2.3) to cleanly incorporate the effects of using function approximators to model and . We will present a corollary of Theorem 1, which separates the problem of learning and , and that of approximating and with function approximators. The first part is controlled by the rate of regret in online learning, and the second part depends on only the chosen class of function approximators, independently of the learning process. As these properties are agnostic to problem setups and algorithms, our reduction leads to a framework for systematic synthesis of new RL algorithms with performance guarantees. The missing proofs of this section are in Appendix C.

### 4.1 Proof of Theorem 1

The main insight of our reduction is to adopt, in defining , a comparator based on the output of the algorithm (represented by ), instead of the fixed comparator (the optimal pair of value function and state-action distribution) that has been used conventionally, e.g. in Proposition 2. While this idea seems unnatural from the standard saddle-point or EP perspective, it is possible, because the regret in online learning is measured against the worst-case choice in , which is allowed to be selected in hindsight. Specifically, we propose to select the following comparator to directly bound instead of the conservative measure used before.

###### Proposition 4.

For , define . It holds .

To finish the proof, let be either or , and let denote the policy given by (17). First, by Proposition 4. Next we follow the proof idea of Proposition 1 in [9]: because is skew-symmetric and is convex, we have by (18)

 V∗(p)−V^πN(p)=rep(^xN;y∗N)=−F(^xN,y∗N) =F(y∗N,^xN)≤1N∑Nn=1F(y∗N,xn) =1N∑Nn=1−F(xn,y∗N)=1NRegretN(y∗N).

### 4.2 Function Approximators

When the state and action spaces are large or continuous, directly optimizing and can be impractical. Instead we can consider optimizing over a subset of feasible choices parameterized by function approximators

 XΘ={xθ=(ϕθ,ψθ):ψθ∈M,θ∈Θ}, (20)

where and are functions parameterized by , and is a parameter set. Because COL is a single-player setup, we can extend the previous idea and Theorem 1 to provide performance bounds in this case by a simple rearrangement (see Appendix C), which is a common trick used in the online imitation learning literature [32, 8, 11]. Notice that, in (20), we require only , but not , because for the performance bound in our reduction to hold, we only need the constraint (see Lemma 4 in proof of Proposition 4).

###### Corollary 1.

Let be any sequence. Let be the policy given either by the average or the best decision in . It holds that

 V^πN(p)≥V∗(p)−% RegretN(Θ)N−ϵΘ,N

where measures the expressiveness of , and .

We can quantify with the basic Hölder’s inequality.

###### Proposition 5.

Let . Under the setup in Corollary 1, regardless of the parameterization, it is true that is no larger than

 min(vθ,μθ)∈XΘ∥μθ−μ∗∥11−γ+minw:w≥1∥b^μN∥1,w∥vθ−v^πN∥∞,1/w ≤min(vθ,μθ)∈XΘ11−γ(∥μθ−μ∗∥1+2∥vθ−v^πN∥∞).

where the norms are defined as and .

Proposition 5 says depends on how well captures the value function of the output policy and the optimal state-action distribution . We remark that this result is independent of how is generated. Furthermore, Proposition 5 makes no assumption whatsoever on the structure of function approximators. It even allows sharing parameters between and , e.g., they can be a bi-headed neural network, which is common for learning shared feature representations. More precisely, the structure of the function approximator would only affect whether remains a convex function in , which determines the difficulty of designing algorithms with sublinear regret.

In other words, the proposed COL formulation provides a reduction which dictates the policy performance with two separate factors: 1) the rate of regret which is controlled by the choice of online learning algorithm; 2) the approximation error which is determined by the choice of function approximators. These two factors can almost be treated independently, except that the choice of function approximators would determine the properties of as a function of , and the choice of needs to ensure (20) is admissible.

## 5 Sample Complexity of Mirror Descent

We demonstrate the power of our reduction by applying perhaps the simplest online learning algorithm, mirror descent, to the proposed COL problem in (16) with stochastic feedback (Algorithm 1). For transparency, we discuss the tabular setup. We will show a natural extension to basis functions at the end.

Recall that mirror descent is a first-order algorithm, whose update rule can be written as

 xn+1=argminx∈X⟨gn,x⟩+1ηBR(x||xn) (21)

where is the step size, is the feedback direction, and is the Bregman divergence generated by a strictly convex function . Based on the geometry of , we consider a natural Bregman divergence of the form

 BR(x′||x)=12|\SS|∥v′−v∥22+KL(μ′||μ) (22)

This choice mitigates the effects of dimension (e.g. if we set with being the uniform distribution, it holds for any ).

To define the feedback direction , we slightly modify the per-round loss in (16) and consider a new loss

 hn(x)\coloneqqb⊤μnv+μ⊤(11−γ1−avn) (23)

that shifts by a constant, where is the vector of ones. One can verify that , for all when in and satisfy (which holds for Algorithm 1). Therefore, using does not change regret. The reason for using instead of is to make (and its unbiased approximation) a positive vector, so the regret bound can have a better dimension dependency. This is a common trick used in online learning (e.g. EXP3 [16]) for optimizing variables living in a simplex ( here).

We set the first-order feedback as an unbiased sampled estimate of . In round , this is realized by two independent calls of the generative model:

 gn=⎡⎢⎣~pn+11−γ(γ~Pn−En)⊤~μn|\SS||A|(11−γ^1n−^rn−11−γ(γ^Pn−^En)vn)⎤⎥⎦ (24)

Let . For , we sample , sample to get a state-action pair, and query the transition at the state-action pair sampled from . (, , and denote the single-sample estimate of these probabilities.) For , we first sample uniformly a state-action pair (which explains the factor ), and then query the reward and the transition . (, , , and denote the single-sample estimates.) To emphasize, we use and to distinguish the empirical quantities obtained by these two independent queries. By construction, we have . It is clear that this direction is unbiased, i.e. . Moreover, it is extremely sparse and can be computed using sample, computational, and memory complexities.

Below we show this algorithm, despite being extremely simple, has strong theoretical guarantees. In other words, we obtain simpler versions of the algorithms proposed in [34, 36, 6] but with improved performance.

###### Theorem 2.

With probability , Algorithm 1 learns an -optimal policy with samples.

Note that the above statement makes no assumption on the MDP (except the tabular setup for simplifying analysis). Also, because the definition of value function in (1) is scaled by a factor , the above result translates into a sample complexity in for the conventional discounted accumulated rewards.

### 5.1 Proof Sketch of Theorem 2

The proof is based on the basic property of mirror descent and martingale concentration. We provide a sketch here; please refer to Appendix D for details. Let . We bound the regret in Theorem 1 by the following rearrangement, where the first equality below is because is a constant shift from .

 RegretN(y∗N)=N∑n=1hn(xn)−N∑n=1hn(y∗N) ≤(N∑n=1(∇hn(xn)−gn)⊤xn)+(maxx∈XN∑n=1g⊤n(xn−x)) +(N∑n=1(gn−∇hn(xn))⊤y∗N)

We recognize the first term is a martingale, because does not depend on . Therefore, we can appeal to a Bernstein-type martingale concentration and prove it is in . For the second term, by treating as the per-round loss, we can use standard regret analysis of mirror descent and show a bound in . For the third term, because in depends on , it is not a martingale. Nonetheless, we are able to handle it through a union bound and show it is again no more than . Despite the union bound, it does not increase the rate because we only need to handle , not which induces a martingale. To finish the proof, we substitute this high-probability regret bound into Theorem 1 to obtain the desired claim.

### 5.2 Extension to Function Approximators

The above algorithm assumes the tabular setup for illustration purposes. In Appendix E, we describe a direct extension of Algorithm 1 that uses linearly parameterized function approximators of the form , where columns of bases belong to and , respectively, and .

Overall the algorithm stays the same, except the gradient is computed by chain-rule, which can be done in