Primal-Dual \pi Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems

# Primal-Dual π Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems

Mengdi Wang
Department of Operations Research and Financial Engineering, Princeton University, Princeton, NJ
email: mengdiw@princeton.edu
###### Abstract

Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a Primal-Dual Learning method in light of the linear duality between the value and policy. The learning method is model-free and makes primal-dual updates to the policy and value vectors as new data are revealed. For infinite-horizon undiscounted Markov decision process with finite state space and finite action space , the learning method finds an -optimal policy using the following number of sample transitions

 ~O((τ⋅t∗mix)2|S||A|ϵ2),

where is an upper bound of mixing times across all policies and is a parameter characterizing the range of stationary distributions across policies. The learning method also applies to the computational problem of MDP where the transition probabilities and rewards are explicitly given as the input. In the case where each state transition can be sampled in time, the learning method gives a sublinear-time algorithm for solving the averaged-reward MDP.

Keywords: Markov decision process, reinforcement learning, sample complexity, run-time complexity, duality, primal-dual method, mixing time

## 1 Introduction

Consider the reinforcement learning problem in which a planner makes decisions in an unknown (sometimes stochastic) dynamic environment with the goal of maximizing the reward collected in this process. This can be modeled as a Markov decision process (MDP). MDP refers to a controlled random walk in which the planner chooses one from a number of actions at each state of the random walk and moves to another state according to some transition probability distribution. In the context of reinforcement learning, one wants to learn the optimal decision rule by using an algorithmic trial-and-error approach, without explicitly knowing the transition probabilities.

We focus on the infinite-horizon Average-reward Markov Decision Problem (AMDP) in which one aims to make an infinite sequence of decisions and optimize the average-per-time-step reward. An instance of the AMDP can be described by a tuple , where is a finite state space of size , is a finite action space of size , is the collection of state-to-state transition probabilities , is the collection of state-transitional rewards where . We also denote by the vector of expected state-transition rewards under action , where . Suppose that the decision process is in state , if action is selected, the process moves to a next state with probability and generates a reward .

We want to find a stationary policy that specifies which action to choose at each state (regardless of the time step). A stationary and randomized policy can be represented by a collection of probability distributions , where is a vector of probability distribution over actions. We denote by the transition probability matrix of the AMDP under a fixed policy , where for all . The objective of the AMDP is to find an optimal policy such that the infinite-horizon average reward is maximized:

 maxπlimT→∞Eπ[1TT∑t=1ritit+1(at)],

where are state-action transitions generated by the Markov decision process under the fixed policy , and the expectation is taken over the entire trajectory.

Let us emphasize our focus on the undiscounted average-reward MDP. This is contrary to the majority of existing literatures that focus on the discounted cumulative reward problems, i.e., where is a pre-specified discount factor. The discount factor is imposed artificially for analytical purposes. It ensures contractive properties of the Bellman operator and geometric convergence of value and policy iterations. It also plays an important role in the sample and run-time complexity analysis for MDP algorithms and reinforcement learning methods. However, discounted MDP are indeed approximations to infinite-horizon undiscounted MDPs . In this paper, we attempt to obsolete the discount factor. Instead of assuming that future rewards are discounted, we focus on the undiscounted MDP that satisfies certain fast mixing property and stationary properties. The lack of a discount factor significantly complicates our analysis.

Let us focus on sampling-based methods for the AMDP. Suppose that , is not explicitly given. Instead, it is possible to interact with the real-time decision process (or a simulated process) by trying different controls and observing states transitions and rewards. In particular, suppose that we are given a Sampling Oracle (), which takes a state-action pair as input and outputs a random future state and reward with probability . Such a  is known as the generative model in the literatures of reinforcement learning [16, 15].

In this paper, we propose a model-free policy learning method for solving the AMDP, which we refer to as Primal-Dual Learning ( learning for short). It is motivated by a recently developed randomized primal-dual method for solving the discounted MDP [25]. The learning method maintains a randomized policy for controlling the MDP and dynamically updates the policy and an auxiliary value vector as new observations are revealed. The method is based on a primal-dual iteration which is crafted to take advantage of the linear algebraic structures of the nonlinear Bellman equation. The learning method is remarkably computational efficient - it uses space and arithmetic operations per update.111We use to hide constant factors and use to hide polylog factors of . It is model-free in the sense that it directly updates the policy and value vectors without estimating the transition probabilities of the MDP model. We show that the learning method finds an -optimal policy with probability using the following sample complexity (number of queries to the ):

 ~O((τ⋅t∗mix)2|S||A|ϵ2log(1δ)),

where is parameter that characterizes the range of stationary distributions across policies, and is an uniform upper bound of the mixing times of the Markov decision process under any stationary policy. This sample complexity is optimal in its dependence on .

When the MDP model is explicitly given, the proposed learning method can be used as a randomized algorithm to compute an -optimal policy. Given as the input, one can implement  using binary-tree data structures using preprocessing time, such that each query to the  takes time [25]. In this setting, the -learning method outputs an -optimal policy with probability in run time This is a sublinear run time in comparison with the input size , as long as

To the author’s best knowledge, this is the first model-free learning method for infinite-horizon average-reward MDP problems that is based on a primal-dual iteration. Our sample complexity result is a first result that characterizes the role of the mixing time and range of stationary distributions, without assuming any discount factor or finite horizon. We also provide the first sublinear run-time result for approximately solving AMDP using randomization.

#### Outline

Section 2 surveys existing model-free learning methods for MDP and their sample and run-time complexity guarantees. Section 3 states the main assumptions on the ergodic Markov decision processes, the Bellman equation and its linear programming formulations. Section 4 develops the Primal-Dual Learning method from a saddle point formulation of the Bellman equation. Section 5 establishes the convergence analysis and sample complexity of exploration of the Primal-Dual Learning method. Section 6 gives a summary.

#### Notations

All vectors are considered as column vectors. For a vector , we denote by or its -th component, denote by its transpose, and denote by its Euclidean norm. We denote by the vector with all entries equaling 1, and we denote by the vector with its -th entry equaling and other entries equaling . For a positive number , we denote by the natural logarithm of . For two probability distributions over a finite set , we denote by their Kullback-Leibler divergence, i.e., .

## 2 Related Literatures

There are two major notions of complexity for MDP: the run-time complexity and the sample complexity. The run-time complexity is critical to the computational problem where the MDP model is fully specified. It is measured by the total number of arithmetic operations performed by an algorithm. The sample complexity is critical to the reinforcement learning problem where the MDP model is unknown but a sampling oracle () is given. It is measured by the total number of queries to  made by an algorithm. Most existing literatures focus on either one of the two notions. They were considered as disjoint topics for years of research.

The computational complexity of MDP has been studied mainly in the setting where the MDP model is fully specified as the input. Three major deterministic approaches are the value iteration method [2, 24, 18] , the policy iteration method [12, 19, 28, 22], and linear programming methods [11, 10, 28, 22, 27]. These deterministic methods inevitably require solving large linear systems. In order to compute the optimal policy exactly or to find an -optimal policy in time, these methods all require linear or superlinear time, i.e., the number of arithmetic operations needed is at least linear in the input size . For more detailed surveys on the exact solution methods for MDP, we refer the readers to the textbooks [3, 5, 21, 4] and the references therein.

Randomized versions of the classical methods have proved to achieve faster run time when are very large. Examples include the randomized primal-dual method by [25] and the variance-reduced randomized value iteration methods by [23]; both apply to the discounted MDP. These methods involve simulating the Markov decision processes and making randomized updates. As long as the input is given in suitable data structures that enable -time sampling, these results suggest that it is possible to compute an approximate policy for the discounted MDP in sublinear time (ignoring other parameters). On the other hand, [7] recently showed that the run-time complexity for any randomized algorithm is for the discounted MDP. In the case where each transition can be sampled in time, [7] showed that any randomized algorithm needs run time to produce an -optimal policy with high probability. To the author’s best knowledge, existing results on randomized methods only apply to the discounted MDP. It remains unclear how to use randomized algorithms to efficiently approximate the optimal average-reward policy.

The sample complexity of MDP has been studied mainly in the setting of reinforcement learning. In this paper, we are given a  that generate state transitions from any specified by state-action pair. This is known as the generative model in reinforcement learning, which was introduced and studied in [16, 15]. In this setting, the sample complexity of the MDP is the number of queries to the  in order to find an -optimal policy (or -optimal value in some literatures) with high probability. One of the earliest reinforcement learning method is Q-learning, which are essentially sampling-based variants of value iteration. For infinite-horizon discounted MDP, [16] proved that phased Q-learning takes sample transitions to compute an -optimal policy, where the dependence on is left unspecified. Azar, Munos and Kappen [1] considered a model-based value iteration method for the discounted MDP and showed that it takes samples to compute an -optimal value vector (not an -optimal policy). It also provided a matching sample complexity lower bound for estimating the value vector. It does not give explicit run-time complexity analysis.

We summarize existing model-free sampling-based methods for MDP and their complexity results in Table 1. Note that the settings and assumptions in this works vary from one to another. We also note that there is a large body of works on the sample complexity of exploration for reinforcement learning, which is the number of suboptimal time steps an algorithm performs on a single infinite-long path of the decision process before it reaches optimality; see [14]. This differs from our notion of sample complexity under the , which is beyond our current scope. As a result, we do not include these results for comparison in Table 1.

Our proposed algorithm and analysis was partly motivated by the stochastic mirror-prox methods for solving convex-concave saddle point problems [20] and variational inequalities [13]. The idea of stochastic primal-dual update has been used to solve a specific class of minimax bilinear programs in sublinear run time [8]. For the discounted MDP, the work [26] proposed a basic stochastic primal-dual iteration without explicit complexity analysis and later [6] established a sample complexity upper bound . A most relevant prior work is the author’s recent paper [25], which focused on the discounted MDP. The work [25] proposed a randomized mirror-prox method using adaptive transition sampling, which applies to a special saddle point formulation of the Bellman equation. For discounted MDP, it achieved a total runtime/sample complexity of for finding a policy such that . For discounted MDP such that the stationary distribution satisfies -stationarity (see Assumption 1 in the current paper), it finds an approximate policy achieving reward from a particular initial distribution with sample size/run time .

In this work, we develop the learning method for the case of undiscounted average-reward MDP. Our approach follows from that of [25], however, our analysis is much more streamlined and applies to the more general undiscounted problems. Without assuming any discount factor, we are able to characterize the complexity upperbound for infinite-horizon MDP using its mixing and stationary properties. Comparing to [25], the complexity results achieved in the current paper are much sharper, mainly due to the natural simplicity of average-reward Markov processes. To the author’s best knowledge, our results provide the first sublinear run time for solving infinite-horizon average-reward MDP without any assumption on discount factor or finite horizon.

## 3 Ergodic MDP, Bellman Equation, and Duality

Consider an AMDP that is described by a tuple . In this paper, we focus on AMDP that is ergodic (aperiodic and recurrent) under any stationary policy. For a stationary policy , we denote by the stationary distribution of the Markov decision process which satisfies We make the following assumptions on the stationary distributions and mixing times:

###### Assumption 1 (Ergodic Decision Process).

The Markov decision process specified by is -stationary in the sense that it is ergodic under any stationary policy and there exists such that

 1√τ|S|1≤νπ≤√τ|S|1.

Assumption 1 characterizes a form of complexity of MDP in terms of the range of its stationary distributions. The factor characterizes a notion of complexity of ergodic MDP, i.e., the variation of stationary distributions associated with different policies. Suppose that some policies induce transient states (so the stationary distribution is not bounded away from zero). In this case, we as long as there is some policy that leads to an ergodic process, we can restrict our attention to mixture policies in order to guarantee ergodicity. In this way, we can always guarantee that Assumption 1 holds on the restricted problem at a cost of some additional approximation error.

###### Assumption 2 (Fast Mixing Markov Chains).

The Markov decision process specified by is -mixing in the sense that

 t∗mix≥maxπmin{t≥1∣∥(Pπ)t(i,⋅)−νπ∥TV≤14, ∀i∈S},

where is the total variation.

Assumption 2 requires that the Markov chains be sufficiently “rapidly mixing.” The factor characterizes how fast the Markov decision process reaches its stationary distribution from any state under any policy. Our results suggest that the learning method would work extremely well on “rapidly mixing” decision processes where is a small constant. A typical example is autonomous driving, where the previous actions get forgotten quickly. On the other hand, the current format of learning might work poorly for problems such as the maze in which the mixing time can be very large and most policies are non-ergodic. This is to be improved.

Consider an MDP tuple that satisfies Assumptions 1 and 2. For a fixed policy , the average reward is defined as

 ¯vπ≡¯vπ(i)=limN→∞Eπ[1NN∑t=1ritit+1(at) ∣∣ i1=i], i∈S,

where is taken over the random state-action trajectory generated by the Markov decision process under policy . Note that the average reward is state-invariant, so we treat it as a scalar.

#### Bellman Equation

According to the theory of dynamic programming [21, 3], the value is the optimal average reward to the AMDP if and only if it satisfies the following system of equations, known as the Bellman equation, given by

 ¯v∗+h∗i=maxa∈A{∑j∈Spij(a)h∗j+∑j∈Spij(a)rij(a)},∀ i∈S,

for some vector . A stationary policy is an optimal policy of the AMDP if it attains the elementwise maximization in the Bellman equation (Theorem 8.4.5 [21]). For finite-state AMDP, there always exists at least one optimal policy . If the optimal policy is unique, it is also a deterministic policy. If there are multiple optimal policies, there exist infinitely many optimal randomized policies.

Note that the preceding Bellman equation has one unique optimal solution but infinitely many solutions . In the remainder of this paper, we augment the Bellman equation with an additional linear equality constraint

 (νπ∗)⊤h∗=0,

where is the stationary distribution under policy . Now the augmented Bellman equation has an unique optimal solution. We refer to such a unique solution as the difference-of-value vector, and we denote it by throughout the rest of the paper. The difference-of-value vector can be informally defined as

 h∗i=limN→∞Eπ∗[N∑t=1ritit+1(at)−N¯v∗ ∣∣ i1=i],∀ i∈S.

It characterizes the transient effect of each initial state under the optimal policy.

#### Linear Duality Of The Bellman Equation

The nonlinear Bellman equation is equivalent to the following linear programming problem (see [21] Section 8.8):

 minimize¯v,h  ¯vsubject to  ¯v⋅1+(I−Pa)h−ra≥0,∀ a∈A, (1)

where is the matrix whose -th entry equals to , is the identity matrix with dimension , and is the expected state-transition reward vector under action , i.e.,

We associate each constraint of the primal program (1) with a dual variable . The dual linear program of (1) is

 maximize ∑a∈Aμ⊤arasubject to ∑a∈A(I−P⊤a)μa=0,  ∑a∈A∑i∈Sμa,i=1,μa≥0,  ∀ a∈A. (2)

It is well known that each deterministic policy of the AMDP corresponds to a basic feasible solution to the dual linear program (2). A randomized policy is a mixture of deterministic policies, so it corresponds to a feasible solution of program (2). We denote by the optimal solution to the dual linear program (2). If there is a unique optimal dual solution, it must be a basic feasible solution. In this case, the basis of corresponds to an optimal deterministic policy.

## 4 Primal-Dual π Learning

In this section, we develop the Primal-Dual Learning Method ( learning for short). Our first step is to examine the nonlinear Bellman equation and formulate it into a bilinear saddle point problem with specially chosen primal and dual constraints. Our second step is to develop the Primal-Dual Learning method and discuss its implementation and run-time complexity per iteration.

### 4.1 Saddle Point Formulation of Bellman Equation

In light of linear duality, we formulate the linear programs (1)-(2) into an equivalent minimax problem, given by

 min¯v,hmaxμ≥0¯v+∑a∈Aμ⊤a(−¯v⋅1+(Pa−I)h+ra).

The minimax formulation is more preferable to the linear program formulation because it has much simpler constraints. We construct the sets and to be the search spaces for the value and the policy, respectively, given by

 H={h∈R|S| ∣∣ ∥h∥∞≤2t∗mix},

and

 U={μ=(μa)a∈A ∣∣ 1⊤μ=1,μ≥0,∑a∈Aμa≥1√τ|S|1}.

Since , we simplify the minimax problem to

 (3)

The search space for the dual vector given by essentially reflects Assumption 1. Recall that Assumption 1 suggests that the stationary distribution of any policy belongs to a certain range, therefore it is sufficient to search for the dual variable within that range. The search space for the difference-of-value vector given by essentially reflects Assumption 2 on the fast mixing property of the MDP. The fast mixing condition implies that one can move from any state to any state within a bounded number of steps, therefore the relative difference in their values is bounded by the expected traverse time. In what follows, we verify that and under Assumptions 1 and 2.

###### Lemma 1.

Under Assumptions 1 and 2, the optimal primal and dual solutions to the linear programs (1)-(2) satisfy , and .

Proof. Since is the average reward under and each reward per period belongs to , we obtain that

Let be the transition probability matrix under . Let be the stationary distribution under , so the difference-of-value vector satisfies . Let be the matrix with all rows equaling to , therefore . Letting , we have for all , therefore . We apply the relation inductively for times, use and obtain

 h∗=m−1∑k=0(P∗)kr+(P∗)mh∗−m¯v∗⋅1=m−1∑k=0((P∗)kr−¯v∗⋅1)+((P∗)m−Π)h∗.

We take on both sides of the above, use the triangle inequality and obtain

 ∥h∗∥∞≤m−1∑k=0∥(P∗)kr−¯v∗∥∞+∥(P∗)m−Π∥∞∥h∗∥∞≤m+14∥h∗∥∞

It follows that and .

Recall is the optimal dual solution to the linear programs (1)-(2). The dual feasibility of suggests that , therefore is the stationary distribution corresponding to the transition matrix under the optimal policy . It follows from Assumption 1 that .

### 4.2 The Primal Dual π Learning Algorithm

Motivated by the minimax formulation of the Bellman equation, we propose the Primal-Dual Learning method as follows: The learning method makes iterative updates to a sequence of primal and dual variables . At the iteration, the algorithm draws a random state-action pair with probability and query the  for a state transition to a random next state with probability . Then the learning method updates according to

 (4)

where “” denotes elementwise multiplication, denotes the Euclidean projection onto , are random vectors generated conditioned on according to

 Δt+1∣Ft=β⋅htj−hti+rij(a)−Mμti,aei,a,with probability μti,a,dt+1∣Ft=α⋅(ei−ej),with probability μti,api,j(a), (5)

where we use to denote the collection of all random variables up to the -th iteration. We note that is a vector of dimension but it only has one single nonzero entry. Similarly, is a vector of dimension but it has only two nonzero entries, whose coordinates are randomly generated by sampling a single state transition. We can easily verify that

 E[Δt+1a∣Ft]=β((Pa−I)ht+ra−M⋅1),a∈A,

and

 E[dt+1∣Ft]=α∑a∈Aμ⊤a(I−Pa).

In other words, the primal and dual updates are conditionally unbiased partial derivatives of the minimax objective.

### 4.3 Implementations and Fast Updates

Let us consider how to implement the learning method in order to minimize the run time per iteration. We define the auxilary variables , such that

 ξti=∑a∈Aμti,a,πti,a=μti,aξti,μti,a=ξtiπti,a∀ i∈S,a∈A.

Note that is a vector of probability over states, and is a randomized stationary policy that specifies the probability distribution for choosing actions at each given state. We implement the Primal-Dual Learning method given by iteration (5) in Algorithm 1.

Now we analyze the computational complexity of Algorithm 1. Each iteration draws one state-action-state triplet from the . The updates on are made to two coordinates, thus taking time. The updates on are multiplicative, which take time if is represented using convenient data structures like binary trees (see Prop. 1 of [25]). The updates on involve information projection onto the set . This can be done by maintaining and updating the shifted vector using a binary-tree structure. This idea is also used in the algorithm implementation of [25]. Accordingly, Step 10 of Algorithm 1 takes run time. To sum up, each iteration of Algorithm 1 draws one sample transition and makes updates in time. The space complexity of Algorithm 1 is space, mainly to keep track of and its running average.

## 5 Sample Complexity and Run Time Analysis

In this section, we establish the sample complexity for the Primal-Dual Learning method given by Algorithms 1. We also show that Algorithm 1 applies to the computation problem of MDP and gives a sublinear run-time algorithm.

### 5.1 Primal-Dual Convergence

Each iteration of Algorithm 1 performs a primal-dual update for the minimax problem (3). Our first result concerns the convergence of the primal-dual iteration.

###### Theorem 1 (Finite-Iteration Duality Gap).

Let be an arbitrary MDP tuple satisfying Assumptions 1, 2. Then the sequence of iterates generated by Algorithm 1 satisfies

 1TT∑t=1E[∑a∈A(h∗−Pah∗−ra)⊤μta]+¯v∗≤~O⎛⎝t∗mix√|S||A|T⎞⎠,

where for , .

Theorem 1 establishes a finite-time error bound of a particular “duality gap.” It characterizes the level of violation of the linear complementarity condition. Our proof shares a similar spirit as that of Theorem 1 in [25]. Note that the analysis of [25] does not easily extend to the average-reward MDP and the learning method. As a result, we have to develop a separate new convergence analysis. The complete proof is established through a series of lemmas, which we defer to Appendix.

### 5.2 Sample Complexity for Achieving ϵ-Optimal Policies

We have shown that the expected duality gap diminishes at a certain rate as Algorithm 1 iterates. It remains to analyze how many time steps are needed for the duality gap to become sufficiently small, and how a small duality gap would imply a near-optimal policy. We obtain the following result.

###### Lemma 2.

For any policy , its stationary distribution and average reward satisfies

and

Proof. Consider an arbitrary policy . Let be the stationary distribution under policy , so we have . Then we obtain the first result

Using the fact that , we obtain the second result.

Now we are ready to show that the learning method outputs an approximate policy whose average reward is close to the optimal average reward. Our second main result is as follows.

###### Theorem 2 (Sample Complexity of Single-Run π Learning (Algorithm 1)).

Let be an arbitrary MDP tuple satisfying Assumptions 1, 2, let . Then by letting Algorithm 1 run for the following number of iterations/samples

 T=Ω((τ⋅t∗mix)2⋅|S||A|ϵ2)

it outputs an approximate policy such that with probability at least .

Proof. Consider the policy given by Note that (by Assumption 1) and (since ). Then we have

 ν^π≤√τ|S|1=τ⋅1√τ|S|1≤τξt.

According to Lemma 2, we have

where the inequality uses the fact for all (due to the dual constraint ) and the primal feasibility for all . We use the Markov inequality and obtain that

 ¯v∗−¯v^π≤3τ2(1TT∑t=1E[∑a∈A(h∗−Pah∗−ra)⊤μta]+¯v∗)

with probability at least . Now if we pick and apply Theorem 1, we obtain that with probability at least

### 5.3 Boosting The Success Probability to 1−δ

Our next aim is to achieve an -optimal policy with probability that is arbitrarily close to 1. To do this, we need to run Algorithm 1 for sufficiently many trials and pick the best outcome. This requires us be able to evaluate multiple candidate policies and select the best one out of many. In the next lemma, we show that it is possible to approximately evaluate any policy within -precision using samples.

###### Lemma 3 (Approximate Policy Evaluation).

There exists an algorithm that outputs an approximate value such that with probability at least in time steps.

Proof. Consider the algorithm that generates a sequence of consectutive state transitions according to the and outputs the empirical mean reward, which we denote by . Note that is the empirical mean of Markov random variables in . We apply the McDiarmid inequality for Markov chains to the -step empirical reward and obtain

 P(|¯Y−¯vπ|≥ϵ)≤2exp(−Lϵ2t∗mix)

When , we have with probability at least

Now we prove that by repeatedly running Algorithm 1 and using approximate policy evaluation, one can compute a near-optimal policy with probability arbitrarily close to 1. The main arguments are (1) the best policy out of multiple trials must be close-to-optimal with high probability; (2) the policy evaluation is nearly accurate with high probability, therefore the output policy (which performs the best in policy evaluation) is also close-to-optimal. Our main result is as follows.

###### Theorem 3 (Overall Sample Complexity).

Let be an arbitrary MDP tuple satisfying Assumptions 1, 2 and let and be arbitrary values. Then there exists an algorithm that draws the following number of state transitions

 T=Ω((τ⋅t∗mix)2⋅|S||A|ϵ2log1δ)

and outputs an approximate policy such that with probability at least .

Proof. We describe an approach that runs Algorithm 1 for multiple times in order to achieve an -optimal policy with probability :

1. We first run Algorithm 1 for independent trials with precision parameter , and we denote the output policies by . The total running time is , where is the number of samples needed by Algorithm 1. According to Theorem 1, each trial generates an -optimal policy with probability at least .

2. For each output policy , we conduct approximate value evaluation for time steps and obtain an approximate evaluation with precision level and fail probability . According to Lemma 3, we have

 ¯Y(k)−¯vπ(k)∈[−ϵ3,ϵ3],

with probability at least , and this step takes time steps.

3. Output such that .

The number of samples required by the above procedure is . The space complexity is

Now we verify that is indeed near-optimal with probability at least , as long as is chosen appropriately. Let