Combining No-regret and Q-learning

# Combining No-regret and Q-learning

Ian A. Kash
University of Illinois, Chicago, IL
Michael Sullins
University of Illinois, Chicago, IL
Katja Hofmann
Microsoft Research, Cambridge, UK
Part of this work was done while Ian Kash was a Researcher at Microsoft Research.
###### Abstract

Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local no-regret learning (LONR), which uses a Q-learning-like update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average.

## 1 Introduction

Versions of counterfactual regret minimization (CFR) (Zinkevich et al., 2008) have found success in playing poker at human expert level (Brown and Sandholm, 2019b; Moravčík et al., 2017) as well as fully solving non-trivial versions of it (Bowling et al., 2015). CFR more generally can solve extensive form games of incomplete information. It works by using a no-regret algorithm to select actions. In particular, one copy of such an algorithm is used at each information set, which corresponds to the full history of play observed by a single agent. The resulting algorithm satisfies a global no-regret guarantee, so at least in two-player zero-sum games is guaranteed to converge to an optimal strategy through sufficient self-play.

However, CFR does have limitations. It makes two strong assumptions which are natural for games such as poker, but limit applicability to further settings. First, it assumes that the agent has perfect recall, which in a more general context means that the state representation captures the full history of states visited (and so imposes a tree structure). Current RL domains may rarely repeat states due to their large state spaces, but they certainly do not encode the full history of states and actions. Second, it assumes that a terminal state is eventually reached and performs updates only after this occurs. Even in episodic RL settings, which do have terminals, it may take thousands of steps to reach them. Neither of these assumptions is required for traditional planning algorithms like value iteration or reinforcement learning algorithms like Q-learning. Nevertheless, approaches inspired by CFR have shown empirical promise in domains that do not necessarily satisfy these requirements (Jin, Levine, and Keutzer, 2017).

In this paper, we take a step toward relaxing these assumptions. We develop a new algorithm, which we call local no-regret learning (LONR). In the same spirit as CFR, LONR uses a copy of an arbitrary no-regret algorithm in each state. (For technical reasons we require a slightly stronger property we term no-absolute-regret.) Our main result is that LONR has the same asymptotic convergence guarantee as value iteration for discounted-reward Markov Decision Processes (MDP). Our result also generalizes to settings where, from a single agent’s perspective, the transition process is time invariant but rewards are not. Such settings are traditionally interpreted as “online MDPs” (Even-Dar, Kakade, and Mansour, 2009; Mannor and Shimkin, 2003; Yu, Mannor, and Shimkin, 2009; Ma, Zhang, and Sugiyama, 2015), but also include normal form games. We view this as a proof-of-concept for achieving CFR-style results without requiring perfect recall or terminal states. Under stylized assumptions, we can extend this to asynchronous value iteration and (with a weaker convergence guarantee) a version of on-policy RL.

In our experimental results, we explore settings beyond the reach of our theoretical results. Our primary focus is on a particular class of Markov games known as NoSDE Markov games, which are specifically designed to be challenging for learning algorithms (Zinkevich, Greenwald, and Littman, 2006). These are finite two agent Markov games with no terminal states where No Stationary Deterministic Equilibria exist: all stationary equilibria are randomized. Worse, by construction Q-values do not suffice to determine the correct equilibrium randomization. Thus, prior work has focused on designing multiagent learning algorithms which can converge to non-stationary equilibria  (Zinkevich, Greenwald, and Littman, 2006). The sorts of cyclic behavior that NoSDE games induce has also been observed in more realistic settings of economic competition between agents (Tesauro and Kephart, 2002).

In contrast, we demonstrate that LONR converges to the stationary equilibrium for specific choices of regret minimizer. Furthmore, for these choices of minimizer we achieve not just convergence of the average policy but also of the current policy, or last iterate. Thus our results are also interesting from the perspective of highlighting a setting for the study of last iterate convergence, an area of current interest, in between simple normal form games Mertikopoulos, Papadimitriou, and Piliouras (2018); Bailey and Piliouras (2018) and rich, complex settings such as generative adverarial networks (GANs) Daskalakis et al. (2017).

Most work on CFR uses some version of regret matching as the regret minimzer. However, all prior variants of regret matching are known to not possess last iterate convergence in normal form games such as matching pennies and rock-paper-scissors. As part of our analysis we introduce a novel variant, prove that it is no-regret, and show empirically that is provides last iterate convergence in these normal form games as well as all other settings we have tried. This may be of independent interest, as it is qualitatively different from prior algorithms with last iterate covergence which are optimistic versions of standard algorithms Daskalakis and Panageas (2018); Daskalakis et al. (2017).

## 2 Related work

CFR algorithms remain an active topic of research; recent work has shown how to combine it with function approximation (Waugh et al., 2015; Moravčík et al., 2017; Jin, Levine, and Keutzer, 2017; Brown et al., 2018; Li et al., 2018), improve the convergence rate in certain settings (Farina et al., 2019), and apply it to more complex structures (Farina, Kroer, and Sandholm, 2018). Most relevant to our work, examples are known where CFR fails to converge to the correct policy without perfect recall (Lanctot et al., 2012).

Both CFR and LONR are guaranteed to converge only in terms of their average policy. This is part of a general phenomenon for no-regret learning in games, where the “last iterate,” or current policy, not only fails to converge but behaves in an extreme and cyclic way (Mertikopoulos, Papadimitriou, and Piliouras, 2018; Bailey and Piliouras, 2018; Cheung and Piliouras, 2019; Bailey, Gidel, and Piliouras, 2019). Recent work has explored cases where it is nonetheless effective to use the last iterate. In some poker settings a variant of CFR known as CFR+ (Tammelin, 2014, 2014; Bowling et al., 2015) has good last iterates, but it is known to cycle in normal-form games. Motivated by training Generative Adversarial Networks (GANs), recent results have shown that certain no-regret algorithms converge in terms of the last iterate to saddle-points in convex-concave min-max optimization problems (Daskalakis et al., 2017; Daskalakis and Panageas, 2018). The ability to use the last iterate is particularly important in the context of function approximation Heinrich and Silver (2016); Abernethy, Lai, and Wibisono (2019). Our experimental results provide examples of LONR achieving last iterate convergence when the underlying regret minimizer is capable of it.

Prior work has developed algorithms which combine no-regret and reinforcement learning, but in ways that are qualitatively different from LONR. A common approach in the literature on multi-agent learning is to use no-regret learning as an outer loop to optimize over the space of policies, with the assumption that the inner loop of evaluating a policy is given to the algorithm. There is a large literature on this approach in normal form games (Greenwald and Jafari, 2003), where policy evaluation is trivial, and a smaller one on “online MDPs” (Even-Dar, Kakade, and Mansour, 2009; Mannor and Shimkin, 2003; Yu, Mannor, and Shimkin, 2009; Ma, Zhang, and Sugiyama, 2015), where it is less so. Of particular note in this literature, Even-Dar, Kakade, and Mansour (2005) also use the idea of having a copy of a no-regret algorithm for each state. An alternate approach to solving multi-agent MDPs is to use Q-learning as an outer loop with some other algorithm as an inner loop to determine the collective action chosen in the next state (Littman, 1994; Hu and Wellman, 2003; Greenwald, Hall, and Serrano, 2003). Of particular note, Gondek, Greenwald, and Hall (2004) proposed the use of no-regret algorithms as an inner loop with Q-learning as an outer loop while Even-Dar, Mannor, and Mansour (2002) use multi-armed bandit algorithms as the inner loop with Phased Q-learning (Kearns and Singh, 1999) as the outer loop. In contrast to these literatures, we combine RL in each step of the learning process rather than having one as an inner loop and the other as an outer loop.

Recent work has drawn new connections between no-regret and RL. Srinivasan et al. (2018) show that actor-critic methods can be interpreted as a form of regret minimization, but only analyze their performance in games with perfect recall and terminal states. This is complementary to our approach, which focuses on value-iteration-style algorithms, in that it suggests a way of extending our results to other classes of algorithms. Neu, Jonsson, and Gómez (2017) study entropy-regularized RL and interpret it as an approximate version of Mirror Descent, from which no-regret algorithms can be derived as particular instantiations. Kovařík and Lisỳ (2018) study algorithms that instantiate a regret minimizer at each state without the counterfactual weightings from CFR, but explicitly exclude settings without terminals and perfect recall from their analysis. Jin et al. (2018) showed that in finite-horizon MDPs, Q-learning with UCB exploration achieves near-optimal regret bounds.

The closest technical approach to that used in our theoretical results is that of Bellemare et al. (2016) who introduce new variants of the Q-learning operator. However, our algorithm is not an operator as the policy used to select actions changes from round to round in a history-dependent way, so we instead directly analyze the sequences of Q-values.

## 3 Preliminaries

Consider a Markov Decision Process , where is the state space, is the (finite) action space, is the transition probability kernel, is the (expected) reward function (assumed to be bounded), and is the discount rate. (Q-)value iteration is an operator , whose domain is bounded real-valued functions over , defined as

 TQ(s,a)=r(s,a)+γEP[maxa′∈AQ(s′,a′)] (1)

This operator is a contraction map in , and so converges to a unique fixed point , where gives the expected value of the MDP starting from state , taking action , and thereafter following the optimal policy .

Our algorithm makes use of a no-regret learning algorithm.111It may seem strange to use an algorithm designed for non-stationary environments in a stationary one. We do so with the goal of designing an algorithm that generalizes to non-stationary settings such as “online” MDPs and Markov games. Consider the following (adversarial full-information) setting. There are actions . At each timestep an online algorithm chooses a probability distribution over the actions. Then an adversary chooses a reward for each action from some closed interval, e.g. , which the algorithm then observes. The (external) regret of the algorithm at time is

 1k+1maxik∑t=0xt,i−πt⋅xt (2)

An algorithm is no-regret if there a sequence of constants such that regardless of the adversary the regret at time is at most and . A common bound is that is .

For our results, we make use of a stronger property, that the absolute value of the regret is bounded by . We call such an algorithm a no-absolute-regret algorithm. Algorithms exist that satisfy the even stronger property that the regret is at most and at least 0. Such non-negative-regret algorithms include all linear cost Regularized Follow the Leader algorithms, which includes Randomized Weighted Majority and linear cost Online Gradient Descent (Gofer and Mansour, 2016).

## 4 Local no-regret learning (LONR)

The idea of LONR is to fuse the essence of value iteration / Q-learning and CFR. A standard analysis of value iteration proceeds by analyzing the sequence of matrices . The essence of CFR is to choose the policy for each state locally using a no-regret algorithm. While doing so does not yield an operator, as the policy changes each round in a history-dependent way, this process still yields a sequence of matrices as follows.

Fix a matrix . Initialize copies of a no-absolute-regret algorithm (one for each state) with and find the initial policy for each state . Then we iteratively reveal rewards to the copy of the algorithm for state as ,222Note that we are revealing the rewards of all actions, so we are in the planning setting rather than the standard RL one. We address settings with limited feedback in Section 5.2. and update the policy according to the no-absolute-regret algorithm and .

Call this process local no-regret learning (LONR). It can be viewed as a synchronous version of Expected SARSA (Van Seijen et al., 2009) where instead of using an -greedy policy with decaying , a no-absolute-regret policy is used instead. In the rest of this section we work up to our main result, that LONR converges to . Like many prior results using no-regret learning (e.g. Zinkevich et al. (2008)), the convergence is of the average of the matrices.

We work up to this result through a series of lemmas. To begin, we derive a bound on the on average of values using the no-absolute-regret property. We use two slightly different averages to be able to relate them using the operator.

###### Lemma 1.

Let and . Then

 −γρk−1+TQ––k(s,a)≤¯¯¯¯Qk(s,a)≤γρk−1+TQ––k(s,a). (3)
###### Proof.

By the definitions of LONR and no-regret,

 ¯¯¯¯Qk(s,a) =1kk∑t=1Qt(s,a) =1kk−1∑t=0r(s,a)+γEP,πt[Qt(s′,a′)] =r(s,a)+γEP[1kk−1∑t=0Eπt[Qt(s′,a′)]] ≥r(s,a)+γEP[maxi1kk−1∑t=0Qt(s′,ai)−ρk−1] =−γρk−1+r(s,a)+γEP[maxi1kk−1∑t=0Qt(s′,ai)] =−γρk−1+r(s,a)+γEP[maxi–Qk(s′,ai)] =−γρk−1+TQ––k(s,a)

The key step is the inequality in the fourth line, where we use the fact that the policy for state is being determined by a no-regret algorithm, so we can use Equation (2) to bound the expected value of the policy by the value of the hindsight-optimal action and the regret bound of the algorithm. Similarly, by the stronger no-absolute-regret property, we can reverse the inequality to get . This proves Equation (3). ∎

Next, we show that the range that the values take on is bounded. This lemma is similar in spirit to Lemma 2 of Bellemare et al. (2016). This and subsequent omitted proofs can be found in Appendix B.

###### Lemma 2.

Let . Then

Combining these two lemmas, we can show that is an approximate fixed-point of , and that the approximation is converging to 0 as .

###### Lemma 3.

It remains to show that a converging sequence of approximate fixed points converges to , the fixed point of .

###### Lemma 4.

Let be a sequence such that . Then .

Combining Lemmas 3 and 4 shows the convergence of LONR learning.

.

### 4.1 Beyond MDPs

While our results do not rely on perfect recall or terminal states the way CFR does, so far they are limited to the case of MDPs while CFR permits multiple agents and imperfect information. We can straightforwardly extend our results to some settings beyond MDPs. In Appendix A we show that a version of Lemma 1 holds in MDP-like settings where the transition probability kernel does not change from round to round but the rewards do. Examples of such settings include “online MDPs” and normal-form games. This last result is not particularly surprising as with a single state LONR reduces to standard no-regret learning, whose convergence guarantees in normal-form games are well understood. In Section 6 we present empirical results demonstrating convergence in the richer multi-agent setting of Markov games.

## 5 Extensions

In this section we consider two extensions to LONR, one allowing it to be updated asynchronously (i.e. not updating every state in every iteration) and the other allowing it to learn from asynchronous updates with bandit feedback (i.e. the standard off-policy RL setting). This introduces novel technical issues around the performance of no-regret algorithms when their performance is assessed on a sample taken with replacement of their rounds (rather than counting each round exactly one). Therefore, we analyze convergence only in the simplified case where the state to update at each iteration is chosen uniformly at random. We emphasize that this is an unreasonably strong assumption in practice, and view our results in this section as providing intuition about why sufficiently “nice” processes should converge. We demonstrate empirical convergence with in a more standard on-policy setting in Section 6 and leave a more general theoretical analysis to future work.

In Section 4 we analyzed an algorithm, LONR, which is similar to value iteration in that each state is updated synchronously at each iteration. However, an alternative is to update them asynchronously, where an arbitrary single state is updated at each iteration. Subject to suitable conditions on the frequency with which each state is updated, asynchronous value iteration also converges Bertsekas (1982)

A line of work has shown that CFR will also converge when sampling trajectories (Lanctot et al., 2009; Gibson et al., 2012; Johanson et al., 2012).

In this section, we show that LONR also converges with such asynchonous updates. However, this introduces a new complexity to our analysis. In particular, with synchronous updates there is a guarantee that sees exactly the first values of each action of each of its successor states. This allows us to immediately apply the no-regret property (2). With asynchronous updates, even if we update all actions in a state at the same time, ’s successors may have been updated more or fewer than times, and may have missed some of these updates and observed others more than once, meaning we cannot directly apply (2). We prove the following Lemma to show that a particular sampling process converges to a correct estimate of the average regret, but believe that similar characterizations should hold for other “nice” processes. We demonstrate empirical convergence of asynchronous LONR when states are selected in an on-policy manner in Section 6.

###### Lemma 5.

Let be the first iterations at which is updated, be a successor of , be the iterations before at which was updated, and . If the state to be updated at each iteration is chosen uniformly at random then with probability 1.

###### Proof.

Let be the number of times is updated using . The are i.i.d. random variables whose law is the geometric distribution with probability 0.5. Thus, and by the strong law of large numbers the sample average of the converges to 1 with probability 1. Let and . Then by (Etemadi, 2006, Theorem 3), also converges to 1 with probability 1. Equivalently, with probability 1. ∎

With this in hand, we can now prove a result similar to Lemma 1 for asynchronous updates. The primary difference is that now have an additional error term in the bounds, but like the term from the regret it goes to zero per Lemma 5.

###### Lemma 6.

Let be the state selected uniformly at random and updated in iteration , for which this is the -th update and let and for . Then

 mins′γ(−ξss′(k)−ρk′)+T¯¯¯¯Qt(s,a) ≤¯¯¯¯Qt+1(s,a)≤maxs′γ(−ξss′(k)+ρk′)T¯¯¯¯Qt(s,a). (4)

It immediately follows that is an approximate fixed-point of , and that the approximation is converging to 0 as .

###### Lemma 7.

Let be the minimum number of times a state has been chosen uniformly at random for update by time . Then

Combining Lemmas 7 and 4 (the latter of which applies without change) shows the convergence of asynchronous LONR learning.

###### Theorem 2.

If states are chosen for update uniformly at random with prob. 1.

### 5.2 Asynchronous updates with bandit feedback

In RL, algorithms like Q-learning are usually assumed not to know and so only have access to feedback corresponding to the action actually taken in the current iteration. In such settings, ordinary no-regret algorithms are not applicable because they require the counterfactual results from actions not chosen. However, multi-armed bandit algorithms, such as Exp3 (Auer et al., 2002), are designed to achieve no-regret guarantees in expectation despite only receiving feedback about the outcomes chosen. It would be natural to adapt LONR to the on-policy RL setting by replacing the no-regret algorithm with a multi-armed bandit one. This type of result has previously been obtained for normal-form games Banerjee and Peng (2005), where agents can learn to play optimally even if they only learn their payoff at each stage and not what action the other agents took.

To adapt LONR to make use of multi-armed bandit algorithms, we can use the update rule if is the action chosen for state and for .333The use of importance sampling here is to maintain the structure that successor states are evaluated as . Alternatively we could use the SARSA-style update where is the action that was chosen the last time was updated and leave all other Q-values unchanged (this also requires appropriately adjusting the way the average is computed). The no-absolute-regret algorithm for bandit feedback at can then be updated as . (We use the raw rather than importance sampling estimate here because, e.g. Exp3 already includes importance weighting.) Unlike in Q-learning, we do not need to average over Q-values to account for the stochasticity in choice of because our convergence results are already for the averages of our Q-values.

With these definitions, Lemma 6 can be immediately adapted to this setting with the caveat that now the guarantees only hold in expectation over the choice of action at each iteration and the resulting state. Furthermore, since we require the state be chosen uniformly at random, the resulting algorithm is on-policy in the sense that the algorithm is choosing which action to receive feedback about, but does not control the sequence of states in which it acts.

###### Lemma 8.

Let be the state selected uniformly at random and updated in iteration , for which this is the -th update and let and for . Then

 mins′γ(−ξss′(k)−ρk′)+T¯¯¯¯Qt(s,a) (5) ≤E[¯¯¯¯Qt+1(s,a)]≤maxs′γ(−ξss′(k)+ρk′)T¯¯¯¯Qt(s,a).

The same analysis from the asynchronous full information case then yields the following theorem.

###### Theorem 3.

If states are chosen for update uniformly at random, then .

This convergence of expectation implies that the converge in probability to , a weaker guarantee than the almost sure convergence of algorithms like Q-learning. We leave deriving a stronger convergence guarantee with more natural assumptions about state selection to future work.

## 6 Experiments

Our theoretical results in Sections 4 and 5 are restricted to (online) MDPs and normal form games and require a number of technical assumptions. The primary goal of this section is to provide evidence that relaxation of these restrictions may be possible.

Another goal of these results is that while the theory behind LONR calls for a regret minimizer with the no-absolute regret property, we seek to understand the performance of various well-known regret minimizers within the LONR framework, which may or may not be no-absolute regret. One popular class of no-regret algorithms is Follow-the-Regularized Leader (FoReL) algorithms, of which Multiplicative Weights Update (MWU) is perhaps the best known. MWU works by determining a probability distribution over actions by normalizing weights assigned to each action, with the weights equal to the exponential sum of past rewards and a learning rate. It satisfies the stronger non-negative regret property and therefore the no-absolute regret property. Another algorithm we consider is Optimistic Multiplicative Weights Update (OMWU), which extends MWU with optimism by making the slight adjustment of counting the last value twice each iteration, a change which guarantees not just that the average policy is no-regret, but that the last one (the last iterate) is as well (Daskalakis and Panageas, 2018). We also consider Regret Matching (Hart and Mas-Colell, 2000) (RM) algorithms, which are the most widely used regret minimizers in CFR-based algorithms due to their simplicity and, unlike FoReL, lack of parameters. With RM, the policy distribution for iteration is selected for actions proportional to the accumulated positive regrets over iterations 0 to . Regret Matching+ (RM+) is a variation that resets negative accumulated regret sums after each iteration to zero, and applies a linear weighing term to the contributions to the average strategy (Tammelin, 2014). The current state of the art algorithm, Discounted CFR (DCFR), is a parameterized algorithm generalizing RM+ where the accumulated positive and negative regrets are weighed separately as well the weight assigned to the contribution to the average strategy Brown and Sandholm (2019a). The paramters used are = 3/2, = 0 and = 2, which are the values recommended by the authors. All of these variants of RM are known to not have last iterate convergence in general and to not satisfy the non-negative regret property. (We do not know whether they satisfy the no-absolute-regret property.)

In addition to these standard no-regret algorithms, we introduce a new variant of RM called Regret Matching++ (RM++), which updates in a similar fashion to Regret Matching but clips the instantaneous regrets at 0. That is, if is the regret of action in round RM tracks while RM++ tracks the upper bound .444The same idea of clipping instantaneous regrets at 0 has recently been used by actor-critic approaches Srinivasan et al. (2018). In the appendix we prove that RM++ is in fact a no-regret algorithm. The proof is a minor variation of the proof for RM+ Tammelin et al. (2015). We also demonstrate that RM++ empirically has last iterate convergence in a number of settings. This may be of independent interest as unlike OMWU it is not obviously describable as an optimistic version of another regret minimizer.

Lastly, we present results for the first two versions of LONR we analyzed theoretically: value-iteration style (LONR-V) and with asynchronous updates (LONR-A). For LONR-A, while the theory requires states be chosen for update uniformly at random, we instead run it on policy. (We add a small probability of a random action, 0.1, to ensure adequate exploration.) Our results show that empirically this does not prevent convergence.

The settings we use for our results are chosen to demonstrate LONR in settings where neither CFR nor standard RL algorithms are applicable. For CFR, this means we choose settings with repeated states and possibly a lack of terminals. For RL, this means considering settings with multiple agents. Since our exposition of LONR is for a single agent setting, we now explain how we apply it in multi-agent settings. We use centralized training, so each agent has access to the current policy of the other agent. This allows the agent to update with the expected rewards and transition probabilities induced by the current policy of the other agent.

### 6.1 NoSDE Markov Game

Our primary setting is a stateful one with multiple agents. Such settings are naturally modelled as Markov games, a generalization of MDPs to multi-agent settings. A Markov Game is a tuple where is the set of states, is the set of players, the set of all state-action pairs , a transition kernel , and a discount factor .

Because Markov Games can model a wide variety of games, algorithms designed for the entirety of this class must be robust to particularly troublesome subclasses. One early negative result found that there exist general-sum Markov Games in which no stationary deterministic equilibria exist, which Zinkevich, Greenwald, and Littman (2006) term NoSDE games. These games have the property that there exists a unique stationary equilibrium with (randomized) policies where the Q-values for each agent are identical in equilibrium but their equilibrium strategies are not. Furthermore, additional complexity exists as the rewards of each player in this NoSDE game can be adjusted within a certain closed interval, where the resulting Q-values remain the same, but the stationary policy changes, thus making Q-value learning even more problematic.

The reward structure for the particular NoSDE game we use is shown in Figure 0(a) for Player 1 and Figure 0(b) for Player 2. Conceptually, a NoSDE game is a deterministic Markov Game with 2 players, 2 states, and each state has a single player with more than one action. The dynamics of a NoSDE game become cyclic as each player prefers to change actions when the other player does as well, which causes the non-stationarity. In this instance, when player 1 sends, player 2 then prefers to send. This causes player 1 to prefer to keep, which in turn causes player 2 to prefer to keep. Player 1 then prefers to send and the cycle repeats. Due to these negative results, Q-value learning algorithms cannot learn the stationary equilibrium. The state of the art solution is still that of Zinkevich, Greenwald, and Littman (2006) who give a multi-agent value iteration procedure which can approximate a cyclic (non-stationary) equilibrium.

No-regret algorithms are known to converge in self-play, but not necessarily to desirable points, e.g. Nash Equilibrium. This convergence guarantee is in the average policy. Our first results look at the average policies in the NoSDE game with LONR-V. Figure 2 show behavior of the average probability with which player 1 chooses to SEND. The unique stationary equilibrium probability for this action is 2/3. Each algorithm shows convergence, but not to the same value. Not shown but important is that each also is converging to the equilibrium in the average Q values.

RM and MWU converge to a similar average policy (top two lines). These two algorithms choose based on tracking the sum of regrets and rewards respectively. RM+ and DCFR follow a similar path (next two lines), which makes sense given that RM+ is a special case of DCFR. RM++ and OMWU are the only two which find the stationary equilibrium policy (bottom two lines). These two are also the only two with last iterate convergence properties (OMWU provably and RM++ empirically). Figure 3, which plots the current iterate for each regret minimizer, shows that this holds in our NoSDE game as well. RM++ and OMWU achieve last iterate convergence while for the other four cyclic behavior can be seen. This result highlights NoSDE games as a setting where it would be interesting to theoretically study last iterate convergence in between simple normal form games Mertikopoulos, Papadimitriou, and Piliouras (2018); Bailey and Piliouras (2018) and rich, complex settings such as GANs (Daskalakis et al., 2017).

While the theory behind OMWU states that the last value need only be counted twice, our results highlight the difference in the last iterate when more optimism is included (i.e. the last value is counted more than twice.) Specifically, in Figure 3(a) , we plot the last iterate for increasing counts of the last value. The figure indicates the role increased optimism plays in not only convergence versus divergence, but in how quickly convergence happens. In this case, despite the theory, counting twice does not lead to convergence in the last iterate, but 3 and above does. Again, theoretically exploring this phenomenon is an interesting direction for future work.

Lastly, we analyze LONR-A, the asynchronous version of LONR. We restrict our results to the two which show last iterate convergence, RM++ (Figure 3(b)) and OMWU (Figure 3(c)), plotting 100 runs of each. They show that, despite a more natural process for choosing which state to update than our theory permits, we still see convergence.

Additional experiments which bridge the gap from MDPs to NoSDE Markov Games are presented in the Appendix. For a “nicer” Markov game than our deliberately challenging NoSDE game, we use the standard simple 2-player, zero-sum soccer game Littman (1994). With any of our six regret minimizers both LONR-V and LONR-A achieve approximate equilibrium payoffs on average. For a setting to probe the assumptions of our theory in a setting closer to it, we run LONR on the typical benchmark GridWorld environment, an MDP. Specifically we use the standard cliff-walking task which requires the agent to avoid a high-cost cliff to reach the exit terminal state. Again, LONR-V and LONR-A learn the optimal policy (and optimal Q-values) despite regret minimizers that may not satisfy the no-absolute-regret property and, in the case of LONR-A, on policy state selection.

## 7 Conclusion

We have proposed a new learning algorithm, local no-regret learning (LONR). We have shown its convergence for the basic case of MDPs (and limited extensions of them) and presented empirical results showing that it achieves convergence, and in some cases last iterate convergence, in a number of settings, most notably NoSDE games. We view this as a proof-of-concept for achieving CFR-style results without requiring perfect recall or terminal states.

Our results point to a number of interesting directions for future research. First, a natural goal given our empirical results would be to extend our convergence results to Markov games. Second, CFR also works in settings with partial observability by appropriately weighting the different states which correspond to the same observed history. Third, we would like to relax the strong assumptions our results about asynchronous updates require. All three seem to rely on the same fundamental building block of better understanding the behavior of no-regret learners whose rewards are determined by (asynchronous) observations of other no-regret learners. Some recent progress along these lines has been made Farina, Kroer, and Sandholm (2018); Kovařík and Lisỳ (2018), but more work is needed.

Orthogonal directions are suggested by our empirical results about last iterate convergence. Can we establish theoretical guarantees for NoSDEs or Markov games more broadly? Is RM++ guaranteed to achieve last iterate convergence? It empirically does in standard games like matching pennies and rock-paper-scissors which trip up most regret minimizers. If so does this represent a new style of algorithm to achieve last iterate convergence or is there a way to interpret its clipping of regrets as optimism?

## References

• Abernethy, Lai, and Wibisono (2019) Abernethy, J.; Lai, K. A.; and Wibisono, A. 2019. Last-iterate convergence rates for min-max optimization. arXiv preprint.
• Auer et al. (2002) Auer, P.; Cesa-Bianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM 32(1):48–77.
• Bailey and Piliouras (2018) Bailey, J. P., and Piliouras, G. 2018. Multiplicative weights update in zero-sum games. In EC 2018, 321–338. ACM.
• Bailey, Gidel, and Piliouras (2019) Bailey, J. P.; Gidel, G.; and Piliouras, G. 2019. Finite regret and cycles with fixed step-size via alternating gradient descent-ascent.
• Banerjee and Peng (2005) Banerjee, B., and Peng, J. 2005. Efficient no-regret multiagent learning. In AAAI, 41–46.
• Bellemare et al. (2016) Bellemare, M. G.; Ostrovski, G.; Guez, A.; Thomas, P. S.; and Munos, R. 2016. Increasing the action gap: New operators for reinforcement learning. In AAAI, 1476–1483.
• Bertsekas (1982) Bertsekas, D. 1982. Distributed dynamic programming. IEEE transactions on Automatic Control 27(3):610–616.
• Bowling et al. (2015) Bowling, M.; Burch, N.; Johanson, M.; and Tammelin, O. 2015. Heads-up limit hold’em poker is solved. Science 347:145–149.
• Brown and Sandholm (2019a) Brown, N., and Sandholm, T. 2019a. Solving imperfect-information games via discounted regret minimization. In Proc. AAAI Conference on Artificial Intelligence, volume 33, 1829–1836.
• Brown and Sandholm (2019b) Brown, N., and Sandholm, T. 2019b. Superhuman ai for multiplayer poker. Science 365(6456):885–890.
• Brown et al. (2018) Brown, N.; Lerer, A.; Gross, S.; and Sandholm, T. 2018. Deep counterfactual regret minimization. arXiv:1811.00164.
• Cheung and Piliouras (2019) Cheung, Y. K., and Piliouras, G. 2019. Vortices instead of equilibria in minmax optimization: Chaos and butterfly effects of online learning in zero-sum games. arXiv preprint arXiv:1905.08396.
• Daskalakis and Panageas (2018) Daskalakis, C., and Panageas, I. 2018. Last-iterate convergence: Zero-sum games and constrained min-max optimization. arXiv
• Daskalakis et al. (2017) Daskalakis, C.; Ilyas, A.; Syrgkanis, V.; and Zeng, H. 2017. Training gans with optimism. arXiv preprint arXiv:1711.00141.
• Etemadi (2006) Etemadi, N. 2006. Convergence of weighted averages of random variables revisited. Proc. Am. Math. Soc. 134(9):2739–2744.
• Even-Dar, Kakade, and Mansour (2005) Even-Dar, E.; Kakade, S. M.; and Mansour, Y. 2005. Experts in a markov decision process. In NIPS, 401–408.
• Even-Dar, Kakade, and Mansour (2009) Even-Dar, E.; Kakade, S. M.; and Mansour, Y. 2009. Online markov decision processes. Math. OR 34(3):726–736.
• Even-Dar, Mannor, and Mansour (2002) Even-Dar, E.; Mannor, S.; and Mansour, Y. 2002. Pac bounds for multi-armed bandit and markov decision processes. In COLT.
• Farina et al. (2019) Farina, G.; Kroer, C.; Brown, N.; and Sandholm, T. 2019. Stable-predictive optimistic counterfactual regret minimization. arXiv
• Farina, Kroer, and Sandholm (2018) Farina, G.; Kroer, C.; and Sandholm, T. 2018. Composability of regret minimizers. arXiv preprint arXiv:1811.02540.
• Gibson et al. (2012) Gibson, R. G.; Lanctot, M.; Burch, N.; Szafron, D.; and Bowling, M. 2012. Generalized sampling and variance in counterfactual regret minimization. In AAAI.
• Gofer and Mansour (2016) Gofer, E., and Mansour, Y. 2016. Lower bounds on individual sequence regret. Machine Learning 103(1):1–26.
• Gondek, Greenwald, and Hall (2004) Gondek, D.; Greenwald, A.; and Hall, K. 2004. Qnr-learning in markov games.
• Greenwald and Jafari (2003) Greenwald, A., and Jafari, A. 2003. A general class of no-regret learning algorithms and game-theoretic equilibria. In LTKM.
• Greenwald, Hall, and Serrano (2003) Greenwald, A.; Hall, K.; and Serrano, R. 2003. Correlated q-learning. In ICML, volume 3, 242–249.
• Hart and Mas-Colell (2000) Hart, S., and Mas-Colell, A. 2000. A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150.
• Heinrich and Silver (2016) Heinrich, J., and Silver, D. 2016. Deep reinforcement learning from self-play in imperfect-information games. arXiv preprint.
• Hu and Wellman (2003) Hu, J., and Wellman, M. P. 2003. Nash q-learning for general-sum stochastic games. JMLR 4(Nov):1039–1069.
• Jin et al. (2018) Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. I. 2018. Is q-learning provably efficient? In NIPS, 4863–4873.
• Jin, Levine, and Keutzer (2017) Jin, P. H.; Levine, S.; and Keutzer, K. 2017. Regret minimization for partially observable deep reinforcement learning. arXiv
• Johanson et al. (2012) Johanson, M.; Bard, N.; Lanctot, M.; Gibson, R.; and Bowling, M. 2012. Efficient nash equilibrium approximation through monte carlo counterfactual regret minimization. In AAMAS, 837–846.
• Kearns and Singh (1999) Kearns, M. J., and Singh, S. P. 1999. Finite-sample convergence rates for q-learning and indirect algorithms. In NIPS, 996–1002.
• Kovařík and Lisỳ (2018) Kovařík, V., and Lisỳ, V. 2018. Analysis of hannan consistent selection for monte carlo tree search in simultaneous move games.
• Lanctot et al. (2009) Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M. 2009. Monte carlo sampling for regret minimization in extensive games. In Advances in neural information processing systems, 1078–1086.
• Lanctot et al. (2012) Lanctot, M.; Gibson, R.; Burch, N.; Zinkevich, M.; and Bowling, M. 2012. No-regret learning in extensive-form games with imperfect recall. arXiv preprint arXiv:1205.0622.
• Li et al. (2018) Li, H.; Hu, K.; Ge, Z.; Jiang, T.; Qi, Y.; and Song, L. 2018. Double neural counterfactual regret minimization. arXiv preprint.
• Littman (1994) Littman, M. L. 1994. Markov games as a framework for multi-agent reinforcement learning. In ML Proc. 1994. 157–163.
• Ma, Zhang, and Sugiyama (2015) Ma, Y.; Zhang, H.; and Sugiyama, M. 2015. Online markov decision processes with policy iteration. arXiv preprint.
• Mannor and Shimkin (2003) Mannor, S., and Shimkin, N. 2003. The empirical bayes envelope and regret minimization in competitive markov decision processes. Mathematics of Operations Research 28(2):327–345.
• Mertikopoulos, Papadimitriou, and Piliouras (2018) Mertikopoulos, P.; Papadimitriou, C.; and Piliouras, G. 2018. Cycles in adversarial regularized learning. In SODA, 2703–2717.
• Moravčík et al. (2017) Moravčík, M.; Schmid, M.; Burch, N.; Lisỳ, V.; Morrill, D.; Bard, N.; Davis, T.; Waugh, K.; Johanson, M.; and Bowling, M. 2017. Deepstack: Expert-level artificial intelligence in heads-up no-limit poker. Science 356(6337):508–513.
• Neu, Jonsson, and Gómez (2017) Neu, G.; Jonsson, A.; and Gómez, V. 2017. A unified view of entropy-regularized markov decision processes. arXiv preprint.
• Srinivasan et al. (2018) Srinivasan, S.; Lanctot, M.; Zambaldi, V.; Pérolat, J.; Tuyls, K.; Munos, R.; and Bowling, M. 2018. Actor-critic policy optimization in partially observable multiagent environments. In NIPS.
• Tammelin et al. (2015) Tammelin, O.; Burch, N.; Johanson, M.; and Bowling, M. 2015. Solving heads-up limit texas hold’em. In IJCAI, 645–652.
• Tammelin (2014) Tammelin, O. 2014. Solving large imperfect information games using cfr+. arXiv preprint arXiv:1407.5042.
• Tesauro and Kephart (2002) Tesauro, G., and Kephart, J. O. 2002. Pricing in agent economies using multi-agent q-learning. JAAMAS 5(3):289–304.
• Van Seijen et al. (2009) Van Seijen, H.; Van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of expected sarsa. In ADPRL’09., 177–184. IEEE.
• Waugh et al. (2015) Waugh, K.; Morrill, D.; Bagnell, J. A.; and Bowling, M. 2015. Solving games with functional regret estimation. In AAAI.
• Yu, Mannor, and Shimkin (2009) Yu, J. Y.; Mannor, S.; and Shimkin, N. 2009. Markov decision processes with arbitrary reward processes. MOR 34(3):737–757.
• Zinkevich et al. (2008) Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione, C. 2008. Regret minimization in games with incomplete information. In Advances in neural information processing systems, 1729–1736.
• Zinkevich, Greenwald, and Littman (2006) Zinkevich, M.; Greenwald, A.; and Littman, M. L. 2006. Cyclic equilibria in markov games. In NIPS, 1641–1648.

## Appendix A Beyond MDPs

If we move beyond MDPs, and are no longer stationary and in general we have a and . This causes problems with the proof of Lemma 1. Recall the initial part of that proof, updated to this more general setting:

 ¯¯¯¯Qk(s,a) =1kk∑t=1Qt(s,a) =1kk−1∑t=0rt(s,a)+γEPt,πt[Qt(s′,a′)]

In the original proof, we pulled the expectation over outside the sum, but now we cannot. In particular, writing the expectation more explicitly gives

 1kk−1∑t=0rt(s,a)+γ∑s′∈SPt(s′ | s,a)Eπt[Qt(s′,a′)] (6)

We can still reverse the order of the sums, but the weighting terms now depend on so they cannot be moved outside. More problematically, they also depend on and , so it is not immediately clear how to generalize our results.

For intuition about the sort of problems that could arise, consider a state where there are two actions. At odd , and and vice versa at even . It is a valid no-regret strategy to randomize uniformly over the actions, but if the are such that you only arrive in from at odd , then this gives an incorrect estimate.

In the remainder of this section, we analyze a special case where we can prove a variant of Lemma 1.

### a.1 Time-invariant P

If does not change with , but does, we can still prove a version of Lemma 1. With a single state, this captures learning in normal-form games, where no-regret learning is indeed known to work. This assumption is also common in the literature on “online MDPs” (Even-Dar, Kakade, and Mansour, 2009; Mannor and Shimkin, 2003; Yu, Mannor, and Shimkin, 2009; Ma, Zhang, and Sugiyama, 2015) In this setting, a version of Lemma 1 can be proved, but now rather than having a constant operator it now changes over time as

 TkQ(s,a)=r–k(s,a)+γEP[maxiQ(s′,ai)]. (7)
###### Lemma 9.
 −γρk−1+TkQ––k(s,a)≤¯¯¯¯Qk(s,a)≤γρk−1+TkQ––k(s,a). (8)
###### Proof.
 ¯¯¯¯Qk(s,a) =1kk∑t=1Qt(s,a) =1kk−1∑t=0rt(s,a)+γEP,πt[Qt(s′,a′)] =1kk−1∑t=0rt(s,a)+γEP[1kk−1∑t=0Eπt[Qt(s′,a′)]] ≥1kk−1∑t=0rt(s,a)+γEP[maxi1kk−1∑t=0Qt(s′,ai)−ρk−1] =−γρk−1+1kk−1∑t=0rt(s,a)+γEP[maxi1kk−1∑t=0Qt(s′,ai)] =−γρk−1+r–k(s,a)+γEP[maxiQ––k(s′,ai)] =−γρk−1+TkQ––k(s,a)

As before, the key step is applying the no-regret property to obtain the inequality and we apply the same argument with the no-absolute-regret property to obtain the reverse inequality. ∎

## Appendix B Omitted Proofs

Let . Then

###### Proof.

By definition, . Thus by the subadditive property of norms, . By induction, . Thus . ∎

###### Proof.
 ||Q––k−TQ––k||∞ ≤||Q––k−¯¯¯¯Qk||∞+||¯¯¯¯Qk−TQ––k||∞ =||Q––k−¯¯¯¯Qk||∞+maxs,a|¯¯¯¯Qk(s,a)−TQ––k(s,a)| ≤||Q––k−¯¯¯¯Qk||∞+γρk−1 =1k||Qk−Q0||∞+γρk−1 ≤1k(1/(1−γ)||r||∞+2||Q0||∞)+γρk−1

The first step follows by the subadditive property of norms, the second by definition, the third by Lemma 1, the fourth by definition, and the fifth by Lemma 2. ∎

###### Lemma 4.

Let be a sequence such that . Then .

###### Proof.
 ||Qk−Q∗||∞ ≤||Qk−TQk||∞+||TQk−Q∗||∞ =||Qk−TQk||∞+||TQk−TQ∗||∞ ≤||Qk−TQk||∞+γ||Qk−Q∗||∞

The first step follows by the subadditive property of norms, the second by optimality of , the third because is a contraction map. Rewriting yields

 ||Qk−Q∗||∞≤11−γ||Qk−TQk||∞

Thus, by assumption, . Since , . Thus and the result follows. ∎

###### Lemma 6.

Let be the state selected uniformly at random and updated in iteration , for which this is the -th update and let and for . Then

 mins′γ(−ξss′(k)−ρk′)+T¯¯¯¯Qt(s,a) ≤¯¯¯¯Qt+1(s,a)≤maxs′γ(−ξss′(k)+ρk′)T¯¯¯¯Qt(s,a).
###### Proof.

By the definitions of LONR and no-regret algorithms,

 ¯¯¯¯Qt+1(s,a) =1kk∑i=1Qti(s,a) =1kk∑i=1r(s,a)+γEP,πti[Qti(s′,a′)] =r(s,a)+γEP[1kk∑i=1Eπti[Qti(s′,a′)]] =r(s,a)+γEP[−ξss′(k)+1k′k′∑i=1Eπτi[Qτi(s′,a′)]] ≥r(s,a)+γEP[−ξss′(k)+maxa′1k′k′∑i=1Qτi(s′,a′)−ρk′] ≥mins′γ(−ξss′(k)−ρk′)+r(s,a)+γEP[maxa′1k′k′∑i=1Qτi(s′,a′)] =mins′γ(−ξss′(k)−ρk′)+r(s,a)+γEP[maxa′¯¯¯¯Qt(s′,a′)] =mins′γ(−ξss′(k)−ρk′)T¯¯¯¯Qt(s,a′)]

This argument is essentially the same as in the proof of Lemma 1, except that in the fourth equality we apply the definition of to yield a form to which we can then apply the no-regret property. As before, the other half of the proof is symmetric and uses the no-absolute-regret property. ∎

## Appendix C Regret Matching++

In this section, we prove that RM++ is a no-regret algorithm and then demonstrate that it has empirical last iterate convergence.

###### Lemma 10.

Given a sequence of strategies , each defining a probability distribution over a set of actions A, consider any definition for satisfying the following conditions:

1. where

The regret-like value is then an upper bound on the regret

Proof.

The lemma and proof closely resembles the similar proofs in (Tammelin, 2014), (Brown and Sandholm, 2019a).

For any ,

This gives

###### Lemma 11.

Given a set of actions A and any sequence of rewards such that for all t and all , after playing a sequence of strategies determined by regret matching but using the regret-like value in place of ,

###### Proof.

Again, the lemma and proof closely follows from (Tammelin, 2014), (Brown and Sandholm, 2019a).

=

for all , so by induction which gives

### c.1 Empirical results for RM++

Figure 4(a) shows that the last iterate of RM++ converges to the equilibrium of rock-paper scissors. Similar results, not shown, hold for matching pennies. Prior work as shown that both RM and RM+ diverge in these games in terms of the last iterate (although they converge on average). We also tested RM++ in Soccer, and as the no-regret algorithm for CFR in Kuhn poker and for LONR in Grid World. In all cases we achieved last iterate convergence.

## Appendix D Omitted Experimental Details

### d.1 LONR pseudocode

Here we use standard notation, specifically where is the total number of agents, is current agent, and represent the total states and current state respectively. The notation provides the set of actions for player in state . denotes the set of actions of all other agents excluding agent in state . refers to the current agent when unspecified.

The policy update uses any no-regret algorithm. The update for Regret Matching++ is shown here.