PrimalDual Learning: Sample Complexity and Sublinear Run Time for Ergodic Markov Decision Problems
Abstract
Consider the problem of approximating the optimal policy of a Markov decision process (MDP) by sampling state transitions. In contrast to existing reinforcement learning methods that are based on successive approximations to the nonlinear Bellman equation, we propose a PrimalDual Learning method in light of the linear duality between the value and policy. The learning method is modelfree and makes primaldual updates to the policy and value vectors as new data are revealed. For infinitehorizon undiscounted Markov decision process with finite state space and finite action space , the learning method finds an optimal policy using the following number of sample transitions
where is an upper bound of mixing times across all policies and is a parameter characterizing the range of stationary distributions across policies. The learning method also applies to the computational problem of MDP where the transition probabilities and rewards are explicitly given as the input. In the case where each state transition can be sampled in time, the learning method gives a sublineartime algorithm for solving the averagedreward MDP.
Keywords: Markov decision process, reinforcement learning, sample complexity, runtime complexity, duality, primaldual method, mixing time
1 Introduction
Consider the reinforcement learning problem in which a planner makes decisions in an unknown (sometimes stochastic) dynamic environment with the goal of maximizing the reward collected in this process. This can be modeled as a Markov decision process (MDP). MDP refers to a controlled random walk in which the planner chooses one from a number of actions at each state of the random walk and moves to another state according to some transition probability distribution. In the context of reinforcement learning, one wants to learn the optimal decision rule by using an algorithmic trialanderror approach, without explicitly knowing the transition probabilities.
We focus on the infinitehorizon Averagereward Markov Decision Problem (AMDP) in which one aims to make an infinite sequence of decisions and optimize the averagepertimestep reward. An instance of the AMDP can be described by a tuple , where is a finite state space of size , is a finite action space of size , is the collection of statetostate transition probabilities , is the collection of statetransitional rewards where . We also denote by the vector of expected statetransition rewards under action , where . Suppose that the decision process is in state , if action is selected, the process moves to a next state with probability and generates a reward .
We want to find a stationary policy that specifies which action to choose at each state (regardless of the time step). A stationary and randomized policy can be represented by a collection of probability distributions , where is a vector of probability distribution over actions. We denote by the transition probability matrix of the AMDP under a fixed policy , where for all . The objective of the AMDP is to find an optimal policy such that the infinitehorizon average reward is maximized:
where are stateaction transitions generated by the Markov decision process under the fixed policy , and the expectation is taken over the entire trajectory.
Let us emphasize our focus on the undiscounted averagereward MDP. This is contrary to the majority of existing literatures that focus on the discounted cumulative reward problems, i.e., where is a prespecified discount factor. The discount factor is imposed artificially for analytical purposes. It ensures contractive properties of the Bellman operator and geometric convergence of value and policy iterations. It also plays an important role in the sample and runtime complexity analysis for MDP algorithms and reinforcement learning methods. However, discounted MDP are indeed approximations to infinitehorizon undiscounted MDPs . In this paper, we attempt to obsolete the discount factor. Instead of assuming that future rewards are discounted, we focus on the undiscounted MDP that satisfies certain fast mixing property and stationary properties. The lack of a discount factor significantly complicates our analysis.
Let us focus on samplingbased methods for the AMDP. Suppose that , is not explicitly given. Instead, it is possible to interact with the realtime decision process (or a simulated process) by trying different controls and observing states transitions and rewards. In particular, suppose that we are given a Sampling Oracle (), which takes a stateaction pair as input and outputs a random future state and reward with probability . Such a is known as the generative model in the literatures of reinforcement learning [16, 15].
In this paper, we propose a modelfree policy learning method for solving the AMDP, which we refer to as PrimalDual Learning ( learning for short). It is motivated by a recently developed randomized primaldual method for solving the discounted MDP [25]. The learning method maintains a randomized policy for controlling the MDP and dynamically updates the policy and an auxiliary value vector as new observations are revealed. The method is based on a primaldual iteration which is crafted to take advantage of the linear algebraic structures of the nonlinear Bellman equation. The learning method is remarkably computational efficient  it uses space and arithmetic operations per update.^{1}^{1}1We use to hide constant factors and use to hide polylog factors of . It is modelfree in the sense that it directly updates the policy and value vectors without estimating the transition probabilities of the MDP model. We show that the learning method finds an optimal policy with probability using the following sample complexity (number of queries to the ):
where is parameter that characterizes the range of stationary distributions across policies, and is an uniform upper bound of the mixing times of the Markov decision process under any stationary policy. This sample complexity is optimal in its dependence on .
When the MDP model is explicitly given, the proposed learning method can be used as a randomized algorithm to compute an optimal policy. Given as the input, one can implement using binarytree data structures using preprocessing time, such that each query to the takes time [25]. In this setting, the learning method outputs an optimal policy with probability in run time This is a sublinear run time in comparison with the input size , as long as
To the author’s best knowledge, this is the first modelfree learning method for infinitehorizon averagereward MDP problems that is based on a primaldual iteration. Our sample complexity result is a first result that characterizes the role of the mixing time and range of stationary distributions, without assuming any discount factor or finite horizon. We also provide the first sublinear runtime result for approximately solving AMDP using randomization.
Outline
Section 2 surveys existing modelfree learning methods for MDP and their sample and runtime complexity guarantees. Section 3 states the main assumptions on the ergodic Markov decision processes, the Bellman equation and its linear programming formulations. Section 4 develops the PrimalDual Learning method from a saddle point formulation of the Bellman equation. Section 5 establishes the convergence analysis and sample complexity of exploration of the PrimalDual Learning method. Section 6 gives a summary.
Notations
All vectors are considered as column vectors. For a vector , we denote by or its th component, denote by its transpose, and denote by its Euclidean norm. We denote by the vector with all entries equaling 1, and we denote by the vector with its th entry equaling and other entries equaling . For a positive number , we denote by the natural logarithm of . For two probability distributions over a finite set , we denote by their KullbackLeibler divergence, i.e., .
2 Related Literatures
There are two major notions of complexity for MDP: the runtime complexity and the sample complexity. The runtime complexity is critical to the computational problem where the MDP model is fully specified. It is measured by the total number of arithmetic operations performed by an algorithm. The sample complexity is critical to the reinforcement learning problem where the MDP model is unknown but a sampling oracle () is given. It is measured by the total number of queries to made by an algorithm. Most existing literatures focus on either one of the two notions. They were considered as disjoint topics for years of research.
The computational complexity of MDP has been studied mainly in the setting where the MDP model is fully specified as the input. Three major deterministic approaches are the value iteration method [2, 24, 18] , the policy iteration method [12, 19, 28, 22], and linear programming methods [11, 10, 28, 22, 27]. These deterministic methods inevitably require solving large linear systems. In order to compute the optimal policy exactly or to find an optimal policy in time, these methods all require linear or superlinear time, i.e., the number of arithmetic operations needed is at least linear in the input size . For more detailed surveys on the exact solution methods for MDP, we refer the readers to the textbooks [3, 5, 21, 4] and the references therein.
Randomized versions of the classical methods have proved to achieve faster run time when are very large. Examples include the randomized primaldual method by [25] and the variancereduced randomized value iteration methods by [23]; both apply to the discounted MDP. These methods involve simulating the Markov decision processes and making randomized updates. As long as the input is given in suitable data structures that enable time sampling, these results suggest that it is possible to compute an approximate policy for the discounted MDP in sublinear time (ignoring other parameters). On the other hand, [7] recently showed that the runtime complexity for any randomized algorithm is for the discounted MDP. In the case where each transition can be sampled in time, [7] showed that any randomized algorithm needs run time to produce an optimal policy with high probability. To the author’s best knowledge, existing results on randomized methods only apply to the discounted MDP. It remains unclear how to use randomized algorithms to efficiently approximate the optimal averagereward policy.
The sample complexity of MDP has been studied mainly in the setting of reinforcement learning. In this paper, we are given a that generate state transitions from any specified by stateaction pair. This is known as the generative model in reinforcement learning, which was introduced and studied in [16, 15]. In this setting, the sample complexity of the MDP is the number of queries to the in order to find an optimal policy (or optimal value in some literatures) with high probability. One of the earliest reinforcement learning method is Qlearning, which are essentially samplingbased variants of value iteration. For infinitehorizon discounted MDP, [16] proved that phased Qlearning takes sample transitions to compute an optimal policy, where the dependence on is left unspecified. Azar, Munos and Kappen [1] considered a modelbased value iteration method for the discounted MDP and showed that it takes samples to compute an optimal value vector (not an optimal policy). It also provided a matching sample complexity lower bound for estimating the value vector. It does not give explicit runtime complexity analysis.
We summarize existing modelfree samplingbased methods for MDP and their complexity results in Table 1. Note that the settings and assumptions in this works vary from one to another. We also note that there is a large body of works on the sample complexity of exploration for reinforcement learning, which is the number of suboptimal time steps an algorithm performs on a single infinitelong path of the decision process before it reaches optimality; see [14]. This differs from our notion of sample complexity under the , which is beyond our current scope. As a result, we do not include these results for comparison in Table 1.
Method  Setting  Sample Complexity  RunTime Complexity  Space Complexity  Reference 
Phased QLearning  discount factor, optimal value  [17]  
ModelBased QLearning  discount factor, optimal value  NA  [1]  
Randomized PD  discount factor, optimal policy  [25]  
Randomized PD  discount factor, stationary, optimal policy  [25]  
Randomized VI  discount factor, optimal policy  [23]  
PrimalDual Learning  stationary, mixing, optimal policy  This Paper 
Our proposed algorithm and analysis was partly motivated by the stochastic mirrorprox methods for solving convexconcave saddle point problems [20] and variational inequalities [13]. The idea of stochastic primaldual update has been used to solve a specific class of minimax bilinear programs in sublinear run time [8]. For the discounted MDP, the work [26] proposed a basic stochastic primaldual iteration without explicit complexity analysis and later [6] established a sample complexity upper bound . A most relevant prior work is the author’s recent paper [25], which focused on the discounted MDP. The work [25] proposed a randomized mirrorprox method using adaptive transition sampling, which applies to a special saddle point formulation of the Bellman equation. For discounted MDP, it achieved a total runtime/sample complexity of for finding a policy such that . For discounted MDP such that the stationary distribution satisfies stationarity (see Assumption 1 in the current paper), it finds an approximate policy achieving reward from a particular initial distribution with sample size/run time .
In this work, we develop the learning method for the case of undiscounted averagereward MDP. Our approach follows from that of [25], however, our analysis is much more streamlined and applies to the more general undiscounted problems. Without assuming any discount factor, we are able to characterize the complexity upperbound for infinitehorizon MDP using its mixing and stationary properties. Comparing to [25], the complexity results achieved in the current paper are much sharper, mainly due to the natural simplicity of averagereward Markov processes. To the author’s best knowledge, our results provide the first sublinear run time for solving infinitehorizon averagereward MDP without any assumption on discount factor or finite horizon.
3 Ergodic MDP, Bellman Equation, and Duality
Consider an AMDP that is described by a tuple . In this paper, we focus on AMDP that is ergodic (aperiodic and recurrent) under any stationary policy. For a stationary policy , we denote by the stationary distribution of the Markov decision process which satisfies We make the following assumptions on the stationary distributions and mixing times:
Assumption 1 (Ergodic Decision Process).
The Markov decision process specified by is stationary in the sense that it is ergodic under any stationary policy and there exists such that
Assumption 1 characterizes a form of complexity of MDP in terms of the range of its stationary distributions. The factor characterizes a notion of complexity of ergodic MDP, i.e., the variation of stationary distributions associated with different policies. Suppose that some policies induce transient states (so the stationary distribution is not bounded away from zero). In this case, we as long as there is some policy that leads to an ergodic process, we can restrict our attention to mixture policies in order to guarantee ergodicity. In this way, we can always guarantee that Assumption 1 holds on the restricted problem at a cost of some additional approximation error.
Assumption 2 (Fast Mixing Markov Chains).
The Markov decision process specified by is mixing in the sense that
where is the total variation.
Assumption 2 requires that the Markov chains be sufficiently “rapidly mixing.” The factor characterizes how fast the Markov decision process reaches its stationary distribution from any state under any policy. Our results suggest that the learning method would work extremely well on “rapidly mixing” decision processes where is a small constant. A typical example is autonomous driving, where the previous actions get forgotten quickly. On the other hand, the current format of learning might work poorly for problems such as the maze in which the mixing time can be very large and most policies are nonergodic. This is to be improved.
Consider an MDP tuple that satisfies Assumptions 1 and 2. For a fixed policy , the average reward is defined as
where is taken over the random stateaction trajectory generated by the Markov decision process under policy . Note that the average reward is stateinvariant, so we treat it as a scalar.
Bellman Equation
According to the theory of dynamic programming [21, 3], the value is the optimal average reward to the AMDP if and only if it satisfies the following system of equations, known as the Bellman equation, given by
for some vector . A stationary policy is an optimal policy of the AMDP if it attains the elementwise maximization in the Bellman equation (Theorem 8.4.5 [21]). For finitestate AMDP, there always exists at least one optimal policy . If the optimal policy is unique, it is also a deterministic policy. If there are multiple optimal policies, there exist infinitely many optimal randomized policies.
Note that the preceding Bellman equation has one unique optimal solution but infinitely many solutions . In the remainder of this paper, we augment the Bellman equation with an additional linear equality constraint
where is the stationary distribution under policy . Now the augmented Bellman equation has an unique optimal solution. We refer to such a unique solution as the differenceofvalue vector, and we denote it by throughout the rest of the paper. The differenceofvalue vector can be informally defined as
It characterizes the transient effect of each initial state under the optimal policy.
Linear Duality Of The Bellman Equation
The nonlinear Bellman equation is equivalent to the following linear programming problem (see [21] Section 8.8):
(1) 
where is the matrix whose th entry equals to , is the identity matrix with dimension , and is the expected statetransition reward vector under action , i.e.,
We associate each constraint of the primal program (1) with a dual variable . The dual linear program of (1) is
(2) 
It is well known that each deterministic policy of the AMDP corresponds to a basic feasible solution to the dual linear program (2). A randomized policy is a mixture of deterministic policies, so it corresponds to a feasible solution of program (2). We denote by the optimal solution to the dual linear program (2). If there is a unique optimal dual solution, it must be a basic feasible solution. In this case, the basis of corresponds to an optimal deterministic policy.
4 PrimalDual Learning
In this section, we develop the PrimalDual Learning Method ( learning for short). Our first step is to examine the nonlinear Bellman equation and formulate it into a bilinear saddle point problem with specially chosen primal and dual constraints. Our second step is to develop the PrimalDual Learning method and discuss its implementation and runtime complexity per iteration.
4.1 Saddle Point Formulation of Bellman Equation
In light of linear duality, we formulate the linear programs (1)(2) into an equivalent minimax problem, given by
The minimax formulation is more preferable to the linear program formulation because it has much simpler constraints. We construct the sets and to be the search spaces for the value and the policy, respectively, given by
and
Since , we simplify the minimax problem to
(3) 
The search space for the dual vector given by essentially reflects Assumption 1. Recall that Assumption 1 suggests that the stationary distribution of any policy belongs to a certain range, therefore it is sufficient to search for the dual variable within that range. The search space for the differenceofvalue vector given by essentially reflects Assumption 2 on the fast mixing property of the MDP. The fast mixing condition implies that one can move from any state to any state within a bounded number of steps, therefore the relative difference in their values is bounded by the expected traverse time. In what follows, we verify that and under Assumptions 1 and 2.
Lemma 1.
Proof. Since is the average reward under and each reward per period belongs to , we obtain that
Let be the transition probability matrix under . Let be the stationary distribution under , so the differenceofvalue vector satisfies . Let be the matrix with all rows equaling to , therefore . Letting , we have for all , therefore . We apply the relation inductively for times, use and obtain
We take on both sides of the above, use the triangle inequality and obtain
It follows that and .
4.2 The Primal Dual Learning Algorithm
Motivated by the minimax formulation of the Bellman equation, we propose the PrimalDual Learning method as follows: The learning method makes iterative updates to a sequence of primal and dual variables . At the iteration, the algorithm draws a random stateaction pair with probability and query the for a state transition to a random next state with probability . Then the learning method updates according to
(4) 
where “” denotes elementwise multiplication, denotes the Euclidean projection onto , are random vectors generated conditioned on according to
(5) 
where we use to denote the collection of all random variables up to the th iteration. We note that is a vector of dimension but it only has one single nonzero entry. Similarly, is a vector of dimension but it has only two nonzero entries, whose coordinates are randomly generated by sampling a single state transition. We can easily verify that
and
In other words, the primal and dual updates are conditionally unbiased partial derivatives of the minimax objective.
4.3 Implementations and Fast Updates
Let us consider how to implement the learning method in order to minimize the run time per iteration. We define the auxilary variables , such that
Note that is a vector of probability over states, and is a randomized stationary policy that specifies the probability distribution for choosing actions at each given state. We implement the PrimalDual Learning method given by iteration (5) in Algorithm 1.
Now we analyze the computational complexity of Algorithm 1. Each iteration draws one stateactionstate triplet from the . The updates on are made to two coordinates, thus taking time. The updates on are multiplicative, which take time if is represented using convenient data structures like binary trees (see Prop. 1 of [25]). The updates on involve information projection onto the set . This can be done by maintaining and updating the shifted vector using a binarytree structure. This idea is also used in the algorithm implementation of [25]. Accordingly, Step 10 of Algorithm 1 takes run time. To sum up, each iteration of Algorithm 1 draws one sample transition and makes updates in time. The space complexity of Algorithm 1 is space, mainly to keep track of and its running average.
5 Sample Complexity and Run Time Analysis
In this section, we establish the sample complexity for the PrimalDual Learning method given by Algorithms 1. We also show that Algorithm 1 applies to the computation problem of MDP and gives a sublinear runtime algorithm.
5.1 PrimalDual Convergence
Each iteration of Algorithm 1 performs a primaldual update for the minimax problem (3). Our first result concerns the convergence of the primaldual iteration.
Theorem 1 (FiniteIteration Duality Gap).
Let be an arbitrary MDP tuple satisfying Assumptions 1, 2. Then the sequence of iterates generated by Algorithm 1 satisfies
where for , .
Theorem 1 establishes a finitetime error bound of a particular “duality gap.” It characterizes the level of violation of the linear complementarity condition. Our proof shares a similar spirit as that of Theorem 1 in [25]. Note that the analysis of [25] does not easily extend to the averagereward MDP and the learning method. As a result, we have to develop a separate new convergence analysis. The complete proof is established through a series of lemmas, which we defer to Appendix.
5.2 Sample Complexity for Achieving Optimal Policies
We have shown that the expected duality gap diminishes at a certain rate as Algorithm 1 iterates. It remains to analyze how many time steps are needed for the duality gap to become sufficiently small, and how a small duality gap would imply a nearoptimal policy. We obtain the following result.
Lemma 2.
For any policy , its stationary distribution and average reward satisfies
and
Proof. Consider an arbitrary policy . Let be the stationary distribution under policy , so we have . Then we obtain the first result
Using the fact that , we obtain the second result.
Now we are ready to show that the learning method outputs an approximate policy whose average reward is close to the optimal average reward. Our second main result is as follows.
Theorem 2 (Sample Complexity of SingleRun Learning (Algorithm 1)).
Let be an arbitrary MDP tuple satisfying Assumptions 1, 2, let . Then by letting Algorithm 1 run for the following number of iterations/samples
it outputs an approximate policy such that with probability at least .
Proof. Consider the policy given by Note that (by Assumption 1) and (since ). Then we have
According to Lemma 2, we have
where the inequality uses the fact for all (due to the dual constraint ) and the primal feasibility for all . We use the Markov inequality and obtain that
with probability at least . Now if we pick and apply Theorem 1, we obtain that with probability at least
5.3 Boosting The Success Probability to
Our next aim is to achieve an optimal policy with probability that is arbitrarily close to 1. To do this, we need to run Algorithm 1 for sufficiently many trials and pick the best outcome. This requires us be able to evaluate multiple candidate policies and select the best one out of many. In the next lemma, we show that it is possible to approximately evaluate any policy within precision using samples.
Lemma 3 (Approximate Policy Evaluation).
There exists an algorithm that outputs an approximate value such that with probability at least in time steps.
Proof. Consider the algorithm that generates a sequence of consectutive state transitions according to the and outputs the empirical mean reward, which we denote by . Note that is the empirical mean of Markov random variables in . We apply the McDiarmid inequality for Markov chains to the step empirical reward and obtain
When , we have with probability at least
Now we prove that by repeatedly running Algorithm 1 and using approximate policy evaluation, one can compute a nearoptimal policy with probability arbitrarily close to 1. The main arguments are (1) the best policy out of multiple trials must be closetooptimal with high probability; (2) the policy evaluation is nearly accurate with high probability, therefore the output policy (which performs the best in policy evaluation) is also closetooptimal. Our main result is as follows.
Theorem 3 (Overall Sample Complexity).
Let be an arbitrary MDP tuple satisfying Assumptions 1, 2 and let and be arbitrary values. Then there exists an algorithm that draws the following number of state transitions
and outputs an approximate policy such that with probability at least .
Proof. We describe an approach that runs Algorithm 1 for multiple times in order to achieve an optimal policy with probability :

We first run Algorithm 1 for independent trials with precision parameter , and we denote the output policies by . The total running time is , where is the number of samples needed by Algorithm 1. According to Theorem 1, each trial generates an optimal policy with probability at least .

For each output policy , we conduct approximate value evaluation for time steps and obtain an approximate evaluation with precision level and fail probability . According to Lemma 3, we have
with probability at least , and this step takes time steps.

Output such that .
The number of samples required by the above procedure is . The space complexity is
Now we verify that is indeed nearoptimal with probability at least , as long as is chosen appropriately. Let