Provable SelfPlay Algorithms for Competitive Reinforcement Learning
Abstract
Selfplay, where the algorithm learns by playing against itself without requiring any direct supervision, has become the new weapon in modern Reinforcement Learning (RL) for achieving superhuman performance in practice. However, the majority of exisiting theory in reinforcement learning only applies to the setting where the agent plays against a fixed environment. It remains largely open whether selfplay algorithms can be provably effective, especially when it is necessary to manage the exploration/exploitation tradeoff.
We study selfplay in competitive reinforcement learning under the setting of Markov games, a generalization of Markov decision processes to the twoplayer case. We introduce a selfplay algorithm—Value Iteration with Upper/Lower Confidence Bound (VIULCB), and show that it achieves regret after playing steps of the game. The regret is measured by the agent’s performance against a fully adversarial opponent who can exploit the agent’s strategy at any step. We also introduce an explorethenexploit style algorithm, which achieves a slightly worse regret of , but is guaranteed to run in polynomial time even in the worst case. To the best of our knowledge, our work presents the first line of provably sampleefficient selfplay algorithms for competitive reinforcement learning.
1 Introduction
This paper studies competitive reinforcement learning (competitive RL), that is, reinforcement learning with two or more agents taking actions simultaneously, but each maximizing their own reward. Competitive RL is a major branch of the more general setting of multiagent reinforcement learning (MARL), with the specification that the agents have conflicting rewards (so that they essentially compete with each other) yet can be trained in a centralized fashion (i.e. each agent has access to the other agents’ policies) (Crandall and Goodrich, 2005).
There are substantial recent progresses in competitive RL, in particular in solving hard multiplayer games such as GO (Silver et al., 2017), Starcraft (Vinyals et al., 2019), and Dota 2 (OpenAI, 2018). A key highlight in their approaches is the successful use of selfplay for achieving superhuman performance in absence of human knowledge or expert opponents. These selfplay algorithms are able to learn a good policy for all players from scratch through repeatedly playing the current policies against each other and performing policy updates using these selfplayed game trajectories. The empirical success of selfplay has challenged the conventional wisdom that expert opponents are necessary for achieving good performance, and calls for a better theoretical understanding.
In this paper, we take initial steps towards understanding the effectiveness of selfplay algorithms in competitive RL from a theoretical perspective. We focus on the special case of twoplayer zerosum Markov games (Shapley, 1953; Littman, 1994), a generaliztion of Markov Decision Processes (MDPs) to the twoplayer setting. In a Markov game, the two players share states, play actions simultaneously, and observe the same reward. However, one player aims to maximize the return while the other aims to minimize it. This setting covers the majority of twoplayer games including GO (there is a single reward of at the end of the game indicating which player has won), and also generalizes zerosum matrix games (von Neumann, 1928)—an important gametheoretic problem—into the multistep (RL) case.
More concretely, the goal of this paper is to design lowregret algorithms for solving episodic twoplayer Markov games in the general setting (Kearns and Singh, 2002), that is, the algorithm is allowed to play the game for a fixed amount of episodes using arbitrary policies, and its performance is measured in terms of the regret. We consider a strong notion of regret for twoplayer zerosum games, where the performance of the deployed policies in each episode is measured against the best response for that policy, which can be different in differnet episodes. Such a regret bound measures the algorithm’s ability in managing the exploration and exploitation tradeoff against fully adaptive opponents, and can directly translate to other types of guarantees such as the PAC sample complexity bound.
Settings  Algorithm  Regret  PAC  Runtime 
General Markov Game  VIULCB (Theorem 3.2)  PPADcomplete  
VIexplore (Theorem 4)  Polynomial  
Mirror Descent () (Rakhlin and Sridharan, 2013)  
TurnBased Markov Game  VIULCB (Corollary 3.3)  
Mirror Descent () (Theorem 5)  
Both  Lower Bound (Corollary 5)   
Our contribution
This paper introduces the first line of provably sampleefficient selfplay algorithms for zerosum Markov game under no restrictive assumptions. Concretely,

We introduce the first selfplay algorithm with regret for zerosum Markov games. More specifically, it achieves regret in the general case, where is the length of the game, is the number of states, are the number of actions for each player, and is the total number of steps played. In special case of turnbased games, it achieves regret with guaranteed polynomial runtime.

We also introduce an explorethenexploit style algorithm. It has guaranteed polynomial runtime in the general setting of zerosum Markov games, with a slightly worse regret.

We raise the open question about the optimal dependency of the regret on . We provide a lower bound , and show that the lower bound can be achieved in simple case of twostep turnbased games by a mirror descent style algorithm.
Above results are summarized in Table 1.
1.1 Related Work
There is a fastgrowing body of work on multiagent reinforcement learning (MARL). Many of them achieve striking empirical performance, or attack MARL in the cooperative setting, where agents are optimizing for a shared or similar reward. We refer the readers to several recent surveys for these results (see e.g. Buşoniu et al., 2010; Nguyen et al., 2018; OroojlooyJadid and Hajinezhad, 2019; Zhang et al., 2019). In the rest of this section we focus on theoretical results related to competitive RL.
Markov games
Markov games (or stochastic games) is proposed as a mathematical model for compeitive RL back in the early 1950s (Shapley, 1953). There is a long line of classical work since then on solving this problem (see e.g. Littman, 1994, 2001; Hu and Wellman, 2003; Hansen et al., 2013). They design algorithms, possibly with runtime guarantees, to find optimal policies in Markov games when both the transition matrix and reward are known, or in the asymptotic setting where number of data goes to infinity. These results do not directly apply to the nonasymptotic setting where the transition and reward are unknown and only a limited amount of data are available for estimating them.
A few recent work tackles selfplay algorithms for Markov games in the nonasymptotic setting, working under either structural assumptions about the game or stronger sampling oracles. Wei et al. (2017) propose an upper confidence algorithm for stochastic games and prove that a selfplay style algorithm finds optimal policies in samples. Jia et al. (2019); Sidford et al. (2019) study turnbased stochastic games—a special case of general Markov games, and propose algorithms with nearoptimal sample complexity. However, both lines of work make strong assumptions—on either the structure of Markov games or how we access data—that are not always true in practice. Specifically, Wei et al. (2017) assumes no matter what strategy one agent sticks to, the other agent can always reach all states by playing a certain policy, and Jia et al. (2019); Sidford et al. (2019) assume access to simulators (or generative models) which enable the agent to directly sample transition and reward information for any stateaction pair. These assumptions greatly alleviate the challenge in exploration. In contrast, our results apply to general Markov games without further structural assumptions, and our algorithms have builtin mechanisms for solving the challenge in the explorationexploitation tradeoff.
Finally, we note that classical Rmax algorithm (Brafman and Tennenholtz, 2002) does not make restrictive assumptions. It also has provable guarantees even when playing against the adversarial opponent in Markov game. However, the theoretical guarantee in Brafman and Tennenholtz (2002) is weaker than the standard regret, and does not directly imply any selfplay algorithm with regret bound in our setting (See Section E for more details).
Adversarial MDP
Another line of related work focuses on provable algorithms against adversarial opponents in MDP. Most work in this line considers the setting with adversarial rewards (see e.g. Zimin and Neu, 2013; Rosenberg and Mansour, 2019; Jin et al., 2019). These results do not direcly imply provable selfplay algorithms in our setting, because the adversarial opponent in Markov games can affect both the reward and the transition. There exist a few works that tackle both adversarial transition functions and adversarial rewards (Yu and Mannor, 2009; Cheung et al., 2019; Lykouris et al., 2019). In particular, Lykouris et al. (2019) considers a stochastic problem with episodes arbitrarily corrupted and obtain regret. When applying these results to Markov games with an adversarial opponent, can be without further assumptions, which makes the bound vacuous.
Singleagent RL
There is an extensive body of research on the sample efficiency of reinforcement learning in the single agent setting (see e.g. Jaksch et al., 2010; Osband et al., 2014; Azar et al., 2017; Dann et al., 2017; Strehl et al., 2006; Jin et al., 2018), which are studied under the model of Markov decision process—a special case of Markov games. For the tabular episodic setting with nonstationary dynamics and no simulators, the best regrets achieved by existing modelbased and modelfree algorithms are (Azar et al., 2017) and (Jin et al., 2018), respectively, where is the number of states, is the number of actions, is the length of each episode, and is the total number of steps played. Both of them (nearly) match the minimax lower bound (Jaksch et al., 2010; Osband and Van Roy, 2016; Jin et al., 2018).
2 Preliminaries
In this paper, we consider zerosum Markov Games (MG) (Shapley, 1953; Littman, 1994), which also known as stochastic games in the literature. Zerosum Markov games are generalization of standard Markov Decision Processes (MDP) into the twoplayer setting, in which the maxplayer seeks to maximize the total return and the minplayer seeks to minimize the total return.
Formally, we consider tabular episodic zerosum Markov games of the form , where

is the number of steps in each episode.

, and is the set of states at step , with .

, and is the set of actions of the maxplayer at step , with .

, and is the set of actions of the minplayer at step , with .

is a collection of transition matrices, so that gives the distribution over states if action pair is taken for state at step .

is a collection of reward functions, and is the deterministic reward function at step . Note that we are assuming that rewards are in for normalization.
^{1}
In each episode of this MG, an initial state is picked arbitrarily by an adversary. Then, at each step , both players observe state , pick the action simultaneously, receive reward , and then transition to the next state . The episode ends when is reached.
Policy and value function
A policy of the maxplayer is a collection of functions , where is the probability simplex over action set . Similarly, a policy of the minplayer is a collection of functions . We use the notation and to present the probability of taking action or for state at step under policy or respectively. We use to denote the value function at step under policy and , so that gives the expected cumulative rewards received under policy and , starting from , until the end of the episode:
We also define to denote value function at step so that gives the cumulative rewards received under policy and , starting from , till the end of the episode:
For simplicity, we use notation of operator so that for any value function . By definition of value functions, for all , we have the Bellman equation
(1)  
(2) 
where we define for all
Best response and regret
We now define the notion of best response and review some basic properties of it (cf. (Filar and Vrieze, 2012)), which will motivate our definition of the regret in twoplayer Markov games. For any maxplayer strategy , there exists a best response of the minplayer, which is a policy satisfying for any . For simplicity, we denote . By symmetry, we can define the best response of the maxplayer , and define . The value functions and satisfy the following Bellman optimality equations:
(3)  
(4) 
It is further known that there exist policies , that are optimal against the best responses of the opponent:
It is also known that, for any , the minimax theorem holds:
Therefore, the optimal strategies are also the Nash Equalibrium for the Markov game. Based on this, it is sensible to measure the suboptimality of any pair of policies using the gap between their performance and the performance of the optimal strategy when playing against the best responses respectively, i.e.,
(5) 
We make this formal in the following definition of the regret. {definition}[Regret] For any algorithm that plays the Markov game for episodes with (potentially adversarial) starting state for each episode , the regret is defined as
where , denote the policies deployed by the algorithm in the th episode.
We note that as a unique feature of selfplay algorithms, the learner is playing against herself, and thus chooses strategies for both maxplayer and minplayer at each episode.
2.1 Turnbased games
In zerosum Markov games, each step involves the two players playing simultaneously and independently. It is a general framework, which constains a very important special case—turnbased games. (Shapley, 1953; Jia et al., 2019).
The main feature of a turnbased game is that only one player is taking actions in each step; in other words, the max and min player take turns to play the game. Formally, a turnbased game can be defined through a partition of steps into two sets and , where and denote the sets of steps the maxplayer and the minplayer choose the actions respectively, whhich satisfies and . As a special example, GO is a turnbased game in which the two players play in alternate turns, i.e.
Mathematically, we can specialize general zerosum Markov games to turnbased games by restricting for all , and for all , where and are special dummy actions. Consequently, in those steps, or has only a single action as its element, i.e. the corresponding player can not affect the game in those steps. A consequence of this specialization is that the Nash Equilibria for turnbased games are pure strategies (i.e. deterministic policies) (Shapley, 1953), similar as in oneplayer MDPs. This is not always true for general Markov games.
3 Main Results
In this section, we present our algorithm and main theorems. In particular, our algorithm is the first selfplay algorithm that achieves regret in Markov Games. We describe the algorithm in Section 3.1, and present its theoretical guarantee for general Markov games in Section 3.2. In Section 3.3, we show that when specialized to turnbased games, the regret and runtime of our algorithm can be further improved.
3.1 Algorithm description
To solve zerosum Markov games, the main idea is to extend the celebrated UCB (Upper Confidence Bounds) principle—an algorithmic principle that achieves provably efficient exploration in bandits (Auer et al., 2002) and singleagent RL (Azar et al., 2017; Jin et al., 2018)—to the twoplayer setting. Recall that in singleagent RL, the provably efficient UCBVI algorithm (Azar et al., 2017) proceeds as
Algorithm (UCBVI for singleplayer RL): Compute based on estimated transition and optimistic (upper) estimate of reward, then play one episode with the greedy policy with respect to .
Regret bounds for UCBVI is then established by showing and utilizing the fact that remains an optimistic (upper) estimate of the optimal throughout execution of the algorithm.
In zerosum games, the two player have conflicting goals: the maxplayer seeks to maximize the return and the minplayer seeks to minimize the return. Therefore, it seems natural here to maintain two sets of Q estimates, one upper bounding the true value and one lower bounding the true value, so that each player can play optimistically with respect to her own goal. We summarize this idea into the following proposal.
Proposal (Naive twoplayer extension of UCBVI): Compute based on estimated transition and {upper, lower} estimates of rewards, then play one episode where the maxplayer () is greedy with respect to and the minplayer () is greedy with respect to .
However, the above proposal is not yet a welldefined algorithm: a greedy strategy with respect to requires the knowledge of how the other player chooses , and vice versa. Therefore, what we really want is not that “ is greedy with respect to ”, but rather that “ is greedy with respect to when the other player uses ”, and vice versa. In other words, we rather desire that are jointly greedy with respect to .
Our algorithm concretizes such joint greediness precisely, building on insights from onestep matrix games: we choose to be the Nash equilibrium for the generalsum game in which the payoff matrix for the max player is and for the min player is . In other words, both player have their own payoff matrix (and they are not equal), but they jointly determine their policies. Formally, we let be determined as
for all , where Nash_General_Sum is a subroutine that takes two matrices , and returns the Nash equilibrium for general sum game, which satisfies
(6) 
Such an equilibrium is guaranteed to exist due to the seminal work of Nash (1951), and is computable by algorithms such as the LemkeHowson algorithm (Lemke and Howson, 1964). With the Nash_General_Sum subroutine in hand, our algorithm can be briefly described as
Our Algorithm (VIULCB): Compute based on estimated transition and {upper, lower} estimates of rewards, along the way determining policies by running the Nash_General_Sum subroutine on . Play one episode according to .
The full algorithm is described in Algorithm 1.
3.2 Guarantees for General Markov Games
We are now ready to present our main theorem.
[Regret bound for VIULCB] For zerosum Markov games, Algorithm 1 (with choice of bonus for large absolute constant ) achieves regret
with probability at least , where . We defer the proof of Theorem 3.2 into Appendix A.1.
Optimism in the face of uncertainty and best response
An implication of Theorem 3.2 is that a low regret can be achieved via selfplay, i.e. the algorithm plays with itself and does not need an expert as its opponent. This is intriguing because the regret is measured in terms of the suboptimality against the worstcase opponent:
(Note that this decomposition of the regret has a slightly different form from (5).) Therefore, Theorem 3.2 demonstrates that selfplay can protect against fully adversarial opponent even when such a strong opponent is not explicitly available.
The key technical reason enabling such a guarantee is that our estimates are optimistic in the face of both the uncertainty of the game, as well as the best response from the opponent. More precisely, we show that the in Algorithm 1 satisfy with high probability
for all , where denote the running at the beginning of the th episode (Lemma A.1). In constrast, such a guarantee (and consequently the regret bound) is not achievable if the upper and lower estimates are only guaranteed to {upper, lower} bound the values of the Nash equilibrium.
Translation to PAC bound
Our regret bound directly implies a PAC sample complexity bound for learning nearequilibrium policies, based on an onlinetobatch conversion. We state this in the following Corollary, and defer the proof to Appendix A.2. {corollary}[PAC bound for VIULCB] Suppose the initial state of Markov game is fixed at , then there exists a pair of (randomized) policies derived through the VIULCB algorithm such that with probability at least (over the randomness in the trajectories) we have
as soon as the number of episodes , where , and the expectation is over the randomization in .
Runtime of Algorithm 1
Algorithm 1 involves the
Nash_General_Sum subroutine for computing the Nash equilibrium of a
general sum matrix game. However, it is known that the computational
complexity for approximating
We note however that there exists practical implementations of the subroutine such as the LemkeHowson algorithm (Lemke and Howson, 1964) that can usually find the solution efficiently. We will further revisit the computational issue in Section 4, in which we design a computationally efficient algorithm for zerosum games with a slightly worse regret.
3.3 Guarantees for Turnbased Markov Games
We now instantiate Theorem 3.2 on turnbased games (introduced in Section 2.1), in which the same algorithm enjoys better regret guarantee and polynomial runtime. Recall that in turnbased games, for all , we have either or , therefore given and we have
and thus by Theorem 3.2 the regret of Algorithm 1 on turnbased games is bounded by .
Further, since either or , all the Nash_General_Sum subroutines reduce to vector games rather than matrix games, and can be trivially implemented in polynomial (indeed linear) time. Indeed, suppose the payoff matrices in (6) has dimensions , then Nash_General_Sum reduces to finding and such that
(the other side is trivialized as has only one choice), which is solved at where . The situation is similar if .
We summarize the above results into the following corollary. {corollary}[Regret bound for VIULCB on turnbased games] For turnbased zerosum Markov games, Algorithm 1 has runtime and achieves regret bound with probability at least , where .
4 Computationally Efficient Algorithm
In this section, we show that the computational issue of Algorithm 1 is not intrinsic to the problem: there exists a sublinear regret algorithm for general zerosum Markov games that has a guaranteed polynomial runtime, with regret scaling as , slightly worse than that of Algorithm 1. Therefore, computational efficiency can be achieved if one is willing to trade some statistical efficiency (sample complexity). For simplicity, we assume in this section that the initial state is fixed.
Value Iteration after Exploration
At a high level, our algorithm follows an explorethenexploit approach. We begin by running a (polynomial time) rewardfree exploration procedure Reward_Free_Exploration on a small number of episodes, which queries the MDP and outputs an estimate . Then, we run value iteration on the empirial version of Markov game with transition and reward , which finds its Nash equilibrium . Finally, the algorithm simply plays the policy for the remaining episodes. The full algorithm is described in Algorithm 2 in the Appendix.
By “rewardfree” exploration, we mean the procedure will not use any reward information to guide exploration. Instead, the procedure prioritize on visiting all possible states and gathering sufficient information about their transition and rewards, so that are close to in the sense that the Nash equilibria of and are close, where denotes the Markov game with transition and reward .
This goal can be achieved by the following algorithm. For any fixed state , we can create an artificial reward defined as and for any , and . Then, we can treat as a new action set for a single agent, and run any standard reinforcement learning algorithm with PAC or regret guarantees to find a nearoptimal policy of MDP. It can be shown that the optimal policy for this MDP is the policy that maximize the probability to reach state . Therefore, by repeatedly playing , we can gather transition and reward information at state as well as we can. Finally, we repeat the routine above for all state . See Appendix B for more details.
In this paper, we adapt the sharp treatments in Jin et al. (2020) which studies rewardfree exploration in the singleagent MDP setting, and provide following guarantees for the Reward_Free_Exploration procedure.
[PAC bound for VIExplore] With probability at least , Reward_Free_Exploration runs for episodes with some large constant , and , and outputs such that the Nash equilibrium of MG satisfies
Importantly, such Nash equilibrium of MG can be computed by Value Iteration (VI) using and . VI only calls Nash_Zero_Sum subroutine, which takes a matrix and returns the Nash equilibrium for zerosum game, which satisfies
(7) 
This problem can by solved efficiently (in polynomial time) by many existing algorithms designed for convexconcave optimization (see, e.g. (Koller, 1994)), and does not suffer from the PPADcompleteness that Nash_General_Sum does.
The PAC bound in Theorem 4 can be easily converted into a regret bound, which is presented as follows.
[Polynomial time algorithm via explorethenexploit] For zerosum Markov games, with probability at least , Algorithm 2 runs in time, and achieves regret bound
where .
5 Towards the Optimal Regret
We investigate the tightness of our regret upper bounds in Theorem 3.2 and Corollary 3.3 through raising the question of optimal regret in twoplayer Markov games, and making initial progresses on it by providing lower bounds and new upper bounds in specific settings. Specifically, we ask an
Open question: What is the optimal regret for general Markov games (in terms of dependence on )?
It is known that the (tight) regret lower bound for singleplayer MDPs is (Azar et al., 2017). By restricting twoplayer games to a singleplayer MDP (making the other player dummy), we immediately have
[Regret lower bound, corollary
of Jaksch et al. (2010), Theorem 5]
The regret
Matching the lower bound on shorthorizon games
Towards closing the gap between lower and upper bounds, we develop alternative algorithms in the special case where each player only plays once, i.e. onestep general games with and twostep turnbased games. In these cases, we show that there exists mirror descent type algorithms that achieve an improved regret (and thus matching the lower bounds), provided that we consider a weaker notion of the regret defined as {definition}[Weak Regret] The weak regret for any algorithm that deploys policies in episode is defined as
(8) 
The difference in the weak regret is that it uses fixed opponents—as opposed to adaptive opponents—for measuring the performance gap: the max is taken with respect to a fixed for all episodes , rather than a different for each episode. By definition, we have for any algorithm that .
With the definition of the weak regret in hand, we now present our results for onestep games. Their proofs can be found in Appendix C.
[Weak regret for onestep simultaneous game, adapted from Rakhlin and Sridharan (2013)] For onestep simultaneous games (), there exists a mirror descent type algorithm that achieves weak regret bound with high probability.
[Weak regret for twostep turnbased game] For onestep turnbased games (), there exists a mirror descent type algorithm that achieves weak regret bound with high probability.
Proof insights; bottleneck in multistep case
The improved regret bounds in Theorem 5 and 5 are possible due to availability of unbiased estimates of counterfactual Q values, which in turn can be used in mirror descent type algorithms with guarantees. Such unbiased estimates are only achievable in onestep games as the two policies are “not intertwined” in a certain sense. In contrast, in multistep games (where each player plays more than once), such unbiased estimates of counterfactual Q values are no longer available, and it is unclear how to construct a mirror descent algorithm there. We believe it would be an important open question to close the gap in multistep games (as well as the gap between regret and weak regret) for a further understanding of exploration in games.
6 Conclusion
In this paper, we studied the sample complexity of finding the equilibrium policy in the setting of competitive reinforcement learning, i.e. zerosum Markov games with two players. We designed a selfplay algorithm for zerosum games and showed that it can efficiently find the Nash equilibrium policy in the exploration setting through establishing a regret bound. Our algorithm—Value Iteration with Upper and Lower Confidence Bounds—builds on a principled extension of UCB/optimism into the twoplayer case by constructing upper and lower bounds on the value functions and iteratively solving general sum subgames.
Towards investigating the optimal runtime and sample complexity in twoplayer games, we provided accompanying results showing that (1) the computational efficiency of our algorithm can be improved by explorethenexploit type algorithms, which has a slightly worse regret; (2) the state and action space dependence in the regret can be reduced in the special case of onestep games via alternative mirror descent type algorithms.
We believe this paper opens up many interesting directions for future work. For example, can we design a computationally efficient algorithms that achieves regret? What are the optimal dependence of the regret on in multistep games? Also, the present results only work in tabular games, and it would be of interest to investigate if similar results can hold in presence of function approximation.
Acknowledgements
We thank Sham Kakade and Haipeng Luo for valuable discussions on the related work. We also thank the Simons Institute at Berkeley for hosting the authors and incubating our initial discussions.
Appendix A Proofs for Section 3
a.1 Proof of Theorem 3.2
Notation:
To be clear from the context, we denote the upper bound and lower bound and computed at the th episode as and , and policies computed and used at the th episode as and .
Choice of bonus:
for sufficient large absolute constant .
[ULCB] With probability at least , we have following bounds for any :
(9)  
(10) 
Proof.
By symmetry, we only need to prove the statement (9). For each fixed , we prove this by induction from to . For base case, we know at the th step, .
Now, assume the left inequality in (9) holds for th step, for the th step, we first recall the updates for functions respectively:
In case of