Combining Noregret and Qlearning
Abstract
Counterfactual Regret Minimization (CFR) has found success in settings like poker which have both terminal states and perfect recall. We seek to understand how to relax these requirements. As a first step, we introduce a simple algorithm, local noregret learning (LONR), which uses a Qlearninglike update rule to allow learning without terminal states or perfect recall. We prove its convergence for the basic case of MDPs (and limited extensions of them) and present empirical results showing that it achieves last iterate convergence in a number of settings, most notably NoSDE games, a class of Markov games specifically designed to be challenging to learn where no prior algorithm is known to achieve convergence to a stationary equilibrium even on average.
1 Introduction
Versions of counterfactual regret minimization (CFR) (Zinkevich et al., 2008) have found success in playing poker at human expert level (Brown and Sandholm, 2019b; Moravčík et al., 2017) as well as fully solving nontrivial versions of it (Bowling et al., 2015). CFR more generally can solve extensive form games of incomplete information. It works by using a noregret algorithm to select actions. In particular, one copy of such an algorithm is used at each information set, which corresponds to the full history of play observed by a single agent. The resulting algorithm satisfies a global noregret guarantee, so at least in twoplayer zerosum games is guaranteed to converge to an optimal strategy through sufficient selfplay.
However, CFR does have limitations. It makes two strong assumptions which are natural for games such as poker, but limit applicability to further settings. First, it assumes that the agent has perfect recall, which in a more general context means that the state representation captures the full history of states visited (and so imposes a tree structure). Current RL domains may rarely repeat states due to their large state spaces, but they certainly do not encode the full history of states and actions. Second, it assumes that a terminal state is eventually reached and performs updates only after this occurs. Even in episodic RL settings, which do have terminals, it may take thousands of steps to reach them. Neither of these assumptions is required for traditional planning algorithms like value iteration or reinforcement learning algorithms like Qlearning. Nevertheless, approaches inspired by CFR have shown empirical promise in domains that do not necessarily satisfy these requirements (Jin, Levine, and Keutzer, 2017).
In this paper, we take a step toward relaxing these assumptions. We develop a new algorithm, which we call local noregret learning (LONR). In the same spirit as CFR, LONR uses a copy of an arbitrary noregret algorithm in each state. (For technical reasons we require a slightly stronger property we term noabsoluteregret.) Our main result is that LONR has the same asymptotic convergence guarantee as value iteration for discountedreward Markov Decision Processes (MDP). Our result also generalizes to settings where, from a single agent’s perspective, the transition process is time invariant but rewards are not. Such settings are traditionally interpreted as “online MDPs” (EvenDar, Kakade, and Mansour, 2009; Mannor and Shimkin, 2003; Yu, Mannor, and Shimkin, 2009; Ma, Zhang, and Sugiyama, 2015), but also include normal form games. We view this as a proofofconcept for achieving CFRstyle results without requiring perfect recall or terminal states. Under stylized assumptions, we can extend this to asynchronous value iteration and (with a weaker convergence guarantee) a version of onpolicy RL.
In our experimental results, we explore settings beyond the reach of our theoretical results. Our primary focus is on a particular class of Markov games known as NoSDE Markov games, which are specifically designed to be challenging for learning algorithms (Zinkevich, Greenwald, and Littman, 2006). These are finite two agent Markov games with no terminal states where No Stationary Deterministic Equilibria exist: all stationary equilibria are randomized. Worse, by construction Qvalues do not suffice to determine the correct equilibrium randomization. Thus, prior work has focused on designing multiagent learning algorithms which can converge to nonstationary equilibria (Zinkevich, Greenwald, and Littman, 2006). The sorts of cyclic behavior that NoSDE games induce has also been observed in more realistic settings of economic competition between agents (Tesauro and Kephart, 2002).
In contrast, we demonstrate that LONR converges to the stationary equilibrium for specific choices of regret minimizer. Furthmore, for these choices of minimizer we achieve not just convergence of the average policy but also of the current policy, or last iterate. Thus our results are also interesting from the perspective of highlighting a setting for the study of last iterate convergence, an area of current interest, in between simple normal form games Mertikopoulos, Papadimitriou, and Piliouras (2018); Bailey and Piliouras (2018) and rich, complex settings such as generative adverarial networks (GANs) Daskalakis et al. (2017).
Most work on CFR uses some version of regret matching as the regret minimzer. However, all prior variants of regret matching are known to not possess last iterate convergence in normal form games such as matching pennies and rockpaperscissors. As part of our analysis we introduce a novel variant, prove that it is noregret, and show empirically that is provides last iterate convergence in these normal form games as well as all other settings we have tried. This may be of independent interest, as it is qualitatively different from prior algorithms with last iterate covergence which are optimistic versions of standard algorithms Daskalakis and Panageas (2018); Daskalakis et al. (2017).
2 Related work
CFR algorithms remain an active topic of research; recent work has shown how to combine it with function approximation (Waugh et al., 2015; Moravčík et al., 2017; Jin, Levine, and Keutzer, 2017; Brown et al., 2018; Li et al., 2018), improve the convergence rate in certain settings (Farina et al., 2019), and apply it to more complex structures (Farina, Kroer, and Sandholm, 2018). Most relevant to our work, examples are known where CFR fails to converge to the correct policy without perfect recall (Lanctot et al., 2012).
Both CFR and LONR are guaranteed to converge only in terms of their average policy. This is part of a general phenomenon for noregret learning in games, where the “last iterate,” or current policy, not only fails to converge but behaves in an extreme and cyclic way (Mertikopoulos, Papadimitriou, and Piliouras, 2018; Bailey and Piliouras, 2018; Cheung and Piliouras, 2019; Bailey, Gidel, and Piliouras, 2019). Recent work has explored cases where it is nonetheless effective to use the last iterate. In some poker settings a variant of CFR known as CFR+ (Tammelin, 2014, 2014; Bowling et al., 2015) has good last iterates, but it is known to cycle in normalform games. Motivated by training Generative Adversarial Networks (GANs), recent results have shown that certain noregret algorithms converge in terms of the last iterate to saddlepoints in convexconcave minmax optimization problems (Daskalakis et al., 2017; Daskalakis and Panageas, 2018). The ability to use the last iterate is particularly important in the context of function approximation Heinrich and Silver (2016); Abernethy, Lai, and Wibisono (2019). Our experimental results provide examples of LONR achieving last iterate convergence when the underlying regret minimizer is capable of it.
Prior work has developed algorithms which combine noregret and reinforcement learning, but in ways that are qualitatively different from LONR. A common approach in the literature on multiagent learning is to use noregret learning as an outer loop to optimize over the space of policies, with the assumption that the inner loop of evaluating a policy is given to the algorithm. There is a large literature on this approach in normal form games (Greenwald and Jafari, 2003), where policy evaluation is trivial, and a smaller one on “online MDPs” (EvenDar, Kakade, and Mansour, 2009; Mannor and Shimkin, 2003; Yu, Mannor, and Shimkin, 2009; Ma, Zhang, and Sugiyama, 2015), where it is less so. Of particular note in this literature, EvenDar, Kakade, and Mansour (2005) also use the idea of having a copy of a noregret algorithm for each state. An alternate approach to solving multiagent MDPs is to use Qlearning as an outer loop with some other algorithm as an inner loop to determine the collective action chosen in the next state (Littman, 1994; Hu and Wellman, 2003; Greenwald, Hall, and Serrano, 2003). Of particular note, Gondek, Greenwald, and Hall (2004) proposed the use of noregret algorithms as an inner loop with Qlearning as an outer loop while EvenDar, Mannor, and Mansour (2002) use multiarmed bandit algorithms as the inner loop with Phased Qlearning (Kearns and Singh, 1999) as the outer loop. In contrast to these literatures, we combine RL in each step of the learning process rather than having one as an inner loop and the other as an outer loop.
Recent work has drawn new connections between noregret and RL. Srinivasan et al. (2018) show that actorcritic methods can be interpreted as a form of regret minimization, but only analyze their performance in games with perfect recall and terminal states. This is complementary to our approach, which focuses on valueiterationstyle algorithms, in that it suggests a way of extending our results to other classes of algorithms. Neu, Jonsson, and Gómez (2017) study entropyregularized RL and interpret it as an approximate version of Mirror Descent, from which noregret algorithms can be derived as particular instantiations. Kovařík and Lisỳ (2018) study algorithms that instantiate a regret minimizer at each state without the counterfactual weightings from CFR, but explicitly exclude settings without terminals and perfect recall from their analysis. Jin et al. (2018) showed that in finitehorizon MDPs, Qlearning with UCB exploration achieves nearoptimal regret bounds.
The closest technical approach to that used in our theoretical results is that of Bellemare et al. (2016) who introduce new variants of the Qlearning operator. However, our algorithm is not an operator as the policy used to select actions changes from round to round in a historydependent way, so we instead directly analyze the sequences of Qvalues.
3 Preliminaries
Consider a Markov Decision Process , where is the state space, is the (finite) action space, is the transition probability kernel, is the (expected) reward function (assumed to be bounded), and is the discount rate. (Q)value iteration is an operator , whose domain is bounded realvalued functions over , defined as
(1) 
This operator is a contraction map in , and so converges to a unique fixed point , where gives the expected value of the MDP starting from state , taking action , and thereafter following the optimal policy .
Our algorithm makes use of a noregret learning algorithm.^{1}^{1}1It may seem strange to use an algorithm designed for nonstationary environments in a stationary one. We do so with the goal of designing an algorithm that generalizes to nonstationary settings such as “online” MDPs and Markov games. Consider the following (adversarial fullinformation) setting. There are actions . At each timestep an online algorithm chooses a probability distribution over the actions. Then an adversary chooses a reward for each action from some closed interval, e.g. , which the algorithm then observes. The (external) regret of the algorithm at time is
(2) 
An algorithm is noregret if there a sequence of constants such that regardless of the adversary the regret at time is at most and . A common bound is that is .
For our results, we make use of a stronger property, that the absolute value of the regret is bounded by . We call such an algorithm a noabsoluteregret algorithm. Algorithms exist that satisfy the even stronger property that the regret is at most and at least 0. Such nonnegativeregret algorithms include all linear cost Regularized Follow the Leader algorithms, which includes Randomized Weighted Majority and linear cost Online Gradient Descent (Gofer and Mansour, 2016).
4 Local noregret learning (LONR)
The idea of LONR is to fuse the essence of value iteration / Qlearning and CFR. A standard analysis of value iteration proceeds by analyzing the sequence of matrices . The essence of CFR is to choose the policy for each state locally using a noregret algorithm. While doing so does not yield an operator, as the policy changes each round in a historydependent way, this process still yields a sequence of matrices as follows.
Fix a matrix . Initialize copies of a noabsoluteregret algorithm (one for each state) with and find the initial policy for each state . Then we iteratively reveal rewards to the copy of the algorithm for state as ,^{2}^{2}2Note that we are revealing the rewards of all actions, so we are in the planning setting rather than the standard RL one. We address settings with limited feedback in Section 5.2. and update the policy according to the noabsoluteregret algorithm and .
Call this process local noregret learning (LONR). It can be viewed as a synchronous version of Expected SARSA (Van Seijen et al., 2009) where instead of using an greedy policy with decaying , a noabsoluteregret policy is used instead. In the rest of this section we work up to our main result, that LONR converges to . Like many prior results using noregret learning (e.g. Zinkevich et al. (2008)), the convergence is of the average of the matrices.
We work up to this result through a series of lemmas. To begin, we derive a bound on the on average of values using the noabsoluteregret property. We use two slightly different averages to be able to relate them using the operator.
Lemma 1.
Let and . Then
(3) 
Proof.
By the definitions of LONR and noregret,
The key step is the inequality in the fourth line, where we use the fact that the policy for state is being determined by a noregret algorithm, so we can use Equation (2) to bound the expected value of the policy by the value of the hindsightoptimal action and the regret bound of the algorithm. Similarly, by the stronger noabsoluteregret property, we can reverse the inequality to get . This proves Equation (3). ∎
Next, we show that the range that the values take on is bounded. This lemma is similar in spirit to Lemma 2 of Bellemare et al. (2016). This and subsequent omitted proofs can be found in Appendix B.
Lemma 2.
Let . Then
Combining these two lemmas, we can show that is an approximate fixedpoint of , and that the approximation is converging to 0 as .
Lemma 3.
It remains to show that a converging sequence of approximate fixed points converges to , the fixed point of .
Lemma 4.
Let be a sequence such that . Then .
Theorem 1.
.
4.1 Beyond MDPs
While our results do not rely on perfect recall or terminal states the way CFR does, so far they are limited to the case of MDPs while CFR permits multiple agents and imperfect information. We can straightforwardly extend our results to some settings beyond MDPs. In Appendix A we show that a version of Lemma 1 holds in MDPlike settings where the transition probability kernel does not change from round to round but the rewards do. Examples of such settings include “online MDPs” and normalform games. This last result is not particularly surprising as with a single state LONR reduces to standard noregret learning, whose convergence guarantees in normalform games are well understood. In Section 6 we present empirical results demonstrating convergence in the richer multiagent setting of Markov games.
5 Extensions
In this section we consider two extensions to LONR, one allowing it to be updated asynchronously (i.e. not updating every state in every iteration) and the other allowing it to learn from asynchronous updates with bandit feedback (i.e. the standard offpolicy RL setting). This introduces novel technical issues around the performance of noregret algorithms when their performance is assessed on a sample taken with replacement of their rounds (rather than counting each round exactly one). Therefore, we analyze convergence only in the simplified case where the state to update at each iteration is chosen uniformly at random. We emphasize that this is an unreasonably strong assumption in practice, and view our results in this section as providing intuition about why sufficiently “nice” processes should converge. We demonstrate empirical convergence with in a more standard onpolicy setting in Section 6 and leave a more general theoretical analysis to future work.
5.1 Asynchronous updates
In Section 4 we analyzed an algorithm, LONR, which is similar to value iteration in that each state is updated synchronously at each iteration. However, an alternative is to update them asynchronously, where an arbitrary single state is updated at each iteration. Subject to suitable conditions on the frequency with which each state is updated, asynchronous value iteration also converges Bertsekas (1982)
A line of work has shown that CFR will also converge when sampling trajectories (Lanctot et al., 2009; Gibson et al., 2012; Johanson et al., 2012).
In this section, we show that LONR also converges with such asynchonous updates. However, this introduces a new complexity to our analysis. In particular, with synchronous updates there is a guarantee that sees exactly the first values of each action of each of its successor states. This allows us to immediately apply the noregret property (2). With asynchronous updates, even if we update all actions in a state at the same time, ’s successors may have been updated more or fewer than times, and may have missed some of these updates and observed others more than once, meaning we cannot directly apply (2). We prove the following Lemma to show that a particular sampling process converges to a correct estimate of the average regret, but believe that similar characterizations should hold for other “nice” processes. We demonstrate empirical convergence of asynchronous LONR when states are selected in an onpolicy manner in Section 6.
Lemma 5.
Let be the first iterations at which is updated, be a successor of , be the iterations before at which was updated, and . If the state to be updated at each iteration is chosen uniformly at random then with probability 1.
Proof.
Let be the number of times is updated using . The are i.i.d. random variables whose law is the geometric distribution with probability 0.5. Thus, and by the strong law of large numbers the sample average of the converges to 1 with probability 1. Let and . Then by (Etemadi, 2006, Theorem 3), also converges to 1 with probability 1. Equivalently, with probability 1. ∎
With this in hand, we can now prove a result similar to Lemma 1 for asynchronous updates. The primary difference is that now have an additional error term in the bounds, but like the term from the regret it goes to zero per Lemma 5.
Lemma 6.
Let be the state selected uniformly at random and updated in iteration , for which this is the th update and let and for . Then
(4) 
It immediately follows that is an approximate fixedpoint of , and that the approximation is converging to 0 as .
Lemma 7.
Let be the minimum number of times a state has been chosen uniformly at random for update by time . Then
Combining Lemmas 7 and 4 (the latter of which applies without change) shows the convergence of asynchronous LONR learning.
Theorem 2.
If states are chosen for update uniformly at random with prob. 1.
5.2 Asynchronous updates with bandit feedback
In RL, algorithms like Qlearning are usually assumed not to know and so only have access to feedback corresponding to the action actually taken in the current iteration. In such settings, ordinary noregret algorithms are not applicable because they require the counterfactual results from actions not chosen. However, multiarmed bandit algorithms, such as Exp3 (Auer et al., 2002), are designed to achieve noregret guarantees in expectation despite only receiving feedback about the outcomes chosen. It would be natural to adapt LONR to the onpolicy RL setting by replacing the noregret algorithm with a multiarmed bandit one. This type of result has previously been obtained for normalform games Banerjee and Peng (2005), where agents can learn to play optimally even if they only learn their payoff at each stage and not what action the other agents took.
To adapt LONR to make use of multiarmed bandit algorithms, we can use the update rule if is the action chosen for state and for .^{3}^{3}3The use of importance sampling here is to maintain the structure that successor states are evaluated as . Alternatively we could use the SARSAstyle update where is the action that was chosen the last time was updated and leave all other Qvalues unchanged (this also requires appropriately adjusting the way the average is computed). The noabsoluteregret algorithm for bandit feedback at can then be updated as . (We use the raw rather than importance sampling estimate here because, e.g. Exp3 already includes importance weighting.) Unlike in Qlearning, we do not need to average over Qvalues to account for the stochasticity in choice of because our convergence results are already for the averages of our Qvalues.
With these definitions, Lemma 6 can be immediately adapted to this setting with the caveat that now the guarantees only hold in expectation over the choice of action at each iteration and the resulting state. Furthermore, since we require the state be chosen uniformly at random, the resulting algorithm is onpolicy in the sense that the algorithm is choosing which action to receive feedback about, but does not control the sequence of states in which it acts.
Lemma 8.
Let be the state selected uniformly at random and updated in iteration , for which this is the th update and let and for . Then
(5)  
The same analysis from the asynchronous full information case then yields the following theorem.
Theorem 3.
If states are chosen for update uniformly at random, then .
This convergence of expectation implies that the converge in probability to , a weaker guarantee than the almost sure convergence of algorithms like Qlearning. We leave deriving a stronger convergence guarantee with more natural assumptions about state selection to future work.
6 Experiments
Our theoretical results in Sections 4 and 5 are restricted to (online) MDPs and normal form games and require a number of technical assumptions. The primary goal of this section is to provide evidence that relaxation of these restrictions may be possible.
Another goal of these results is that while the theory behind LONR calls for a regret minimizer with the noabsolute regret property, we seek to understand the performance of various wellknown regret minimizers within the LONR framework, which may or may not be noabsolute regret. One popular class of noregret algorithms is FollowtheRegularized Leader (FoReL) algorithms, of which Multiplicative Weights Update (MWU) is perhaps the best known. MWU works by determining a probability distribution over actions by normalizing weights assigned to each action, with the weights equal to the exponential sum of past rewards and a learning rate. It satisfies the stronger nonnegative regret property and therefore the noabsolute regret property. Another algorithm we consider is Optimistic Multiplicative Weights Update (OMWU), which extends MWU with optimism by making the slight adjustment of counting the last value twice each iteration, a change which guarantees not just that the average policy is noregret, but that the last one (the last iterate) is as well (Daskalakis and Panageas, 2018). We also consider Regret Matching (Hart and MasColell, 2000) (RM) algorithms, which are the most widely used regret minimizers in CFRbased algorithms due to their simplicity and, unlike FoReL, lack of parameters. With RM, the policy distribution for iteration is selected for actions proportional to the accumulated positive regrets over iterations 0 to . Regret Matching+ (RM+) is a variation that resets negative accumulated regret sums after each iteration to zero, and applies a linear weighing term to the contributions to the average strategy (Tammelin, 2014). The current state of the art algorithm, Discounted CFR (DCFR), is a parameterized algorithm generalizing RM+ where the accumulated positive and negative regrets are weighed separately as well the weight assigned to the contribution to the average strategy Brown and Sandholm (2019a). The paramters used are = 3/2, = 0 and = 2, which are the values recommended by the authors. All of these variants of RM are known to not have last iterate convergence in general and to not satisfy the nonnegative regret property. (We do not know whether they satisfy the noabsoluteregret property.)
In addition to these standard noregret algorithms, we introduce a new variant of RM called Regret Matching++ (RM++), which updates in a similar fashion to Regret Matching but clips the instantaneous regrets at 0. That is, if is the regret of action in round RM tracks while RM++ tracks the upper bound .^{4}^{4}4The same idea of clipping instantaneous regrets at 0 has recently been used by actorcritic approaches Srinivasan et al. (2018). In the appendix we prove that RM++ is in fact a noregret algorithm. The proof is a minor variation of the proof for RM+ Tammelin et al. (2015). We also demonstrate that RM++ empirically has last iterate convergence in a number of settings. This may be of independent interest as unlike OMWU it is not obviously describable as an optimistic version of another regret minimizer.
Lastly, we present results for the first two versions of LONR we analyzed theoretically: valueiteration style (LONRV) and with asynchronous updates (LONRA). For LONRA, while the theory requires states be chosen for update uniformly at random, we instead run it on policy. (We add a small probability of a random action, 0.1, to ensure adequate exploration.) Our results show that empirically this does not prevent convergence.
The settings we use for our results are chosen to demonstrate LONR in settings where neither CFR nor standard RL algorithms are applicable. For CFR, this means we choose settings with repeated states and possibly a lack of terminals. For RL, this means considering settings with multiple agents. Since our exposition of LONR is for a single agent setting, we now explain how we apply it in multiagent settings. We use centralized training, so each agent has access to the current policy of the other agent. This allows the agent to update with the expected rewards and transition probabilities induced by the current policy of the other agent.
6.1 NoSDE Markov Game
Our primary setting is a stateful one with multiple agents. Such settings are naturally modelled as Markov games, a generalization of MDPs to multiagent settings. A Markov Game is a tuple where is the set of states, is the set of players, the set of all stateaction pairs , a transition kernel , and a discount factor .
Because Markov Games can model a wide variety of games, algorithms designed for the entirety of this class must be robust to particularly troublesome subclasses. One early negative result found that there exist generalsum Markov Games in which no stationary deterministic equilibria exist, which Zinkevich, Greenwald, and Littman (2006) term NoSDE games. These games have the property that there exists a unique stationary equilibrium with (randomized) policies where the Qvalues for each agent are identical in equilibrium but their equilibrium strategies are not. Furthermore, additional complexity exists as the rewards of each player in this NoSDE game can be adjusted within a certain closed interval, where the resulting Qvalues remain the same, but the stationary policy changes, thus making Qvalue learning even more problematic.
The reward structure for the particular NoSDE game we use is shown in Figure 0(a) for Player 1 and Figure 0(b) for Player 2. Conceptually, a NoSDE game is a deterministic Markov Game with 2 players, 2 states, and each state has a single player with more than one action. The dynamics of a NoSDE game become cyclic as each player prefers to change actions when the other player does as well, which causes the nonstationarity. In this instance, when player 1 sends, player 2 then prefers to send. This causes player 1 to prefer to keep, which in turn causes player 2 to prefer to keep. Player 1 then prefers to send and the cycle repeats. Due to these negative results, Qvalue learning algorithms cannot learn the stationary equilibrium. The state of the art solution is still that of Zinkevich, Greenwald, and Littman (2006) who give a multiagent value iteration procedure which can approximate a cyclic (nonstationary) equilibrium.
Noregret algorithms are known to converge in selfplay, but not necessarily to desirable points, e.g. Nash Equilibrium. This convergence guarantee is in the average policy. Our first results look at the average policies in the NoSDE game with LONRV. Figure 2 show behavior of the average probability with which player 1 chooses to SEND. The unique stationary equilibrium probability for this action is 2/3. Each algorithm shows convergence, but not to the same value. Not shown but important is that each also is converging to the equilibrium in the average Q values.
RM and MWU converge to a similar average policy (top two lines). These two algorithms choose based on tracking the sum of regrets and rewards respectively. RM+ and DCFR follow a similar path (next two lines), which makes sense given that RM+ is a special case of DCFR. RM++ and OMWU are the only two which find the stationary equilibrium policy (bottom two lines). These two are also the only two with last iterate convergence properties (OMWU provably and RM++ empirically). Figure 3, which plots the current iterate for each regret minimizer, shows that this holds in our NoSDE game as well. RM++ and OMWU achieve last iterate convergence while for the other four cyclic behavior can be seen. This result highlights NoSDE games as a setting where it would be interesting to theoretically study last iterate convergence in between simple normal form games Mertikopoulos, Papadimitriou, and Piliouras (2018); Bailey and Piliouras (2018) and rich, complex settings such as GANs (Daskalakis et al., 2017).
While the theory behind OMWU states that the last value need only be counted twice, our results highlight the difference in the last iterate when more optimism is included (i.e. the last value is counted more than twice.) Specifically, in Figure 3(a) , we plot the last iterate for increasing counts of the last value. The figure indicates the role increased optimism plays in not only convergence versus divergence, but in how quickly convergence happens. In this case, despite the theory, counting twice does not lead to convergence in the last iterate, but 3 and above does. Again, theoretically exploring this phenomenon is an interesting direction for future work.
Lastly, we analyze LONRA, the asynchronous version of LONR. We restrict our results to the two which show last iterate convergence, RM++ (Figure 3(b)) and OMWU (Figure 3(c)), plotting 100 runs of each. They show that, despite a more natural process for choosing which state to update than our theory permits, we still see convergence.
6.2 Additional Experiments
Additional experiments which bridge the gap from MDPs to NoSDE Markov Games are presented in the Appendix. For a “nicer” Markov game than our deliberately challenging NoSDE game, we use the standard simple 2player, zerosum soccer game Littman (1994). With any of our six regret minimizers both LONRV and LONRA achieve approximate equilibrium payoffs on average. For a setting to probe the assumptions of our theory in a setting closer to it, we run LONR on the typical benchmark GridWorld environment, an MDP. Specifically we use the standard cliffwalking task which requires the agent to avoid a highcost cliff to reach the exit terminal state. Again, LONRV and LONRA learn the optimal policy (and optimal Qvalues) despite regret minimizers that may not satisfy the noabsoluteregret property and, in the case of LONRA, on policy state selection.
7 Conclusion
We have proposed a new learning algorithm, local noregret learning (LONR). We have shown its convergence for the basic case of MDPs (and limited extensions of them) and presented empirical results showing that it achieves convergence, and in some cases last iterate convergence, in a number of settings, most notably NoSDE games. We view this as a proofofconcept for achieving CFRstyle results without requiring perfect recall or terminal states.
Our results point to a number of interesting directions for future research. First, a natural goal given our empirical results would be to extend our convergence results to Markov games. Second, CFR also works in settings with partial observability by appropriately weighting the different states which correspond to the same observed history. Third, we would like to relax the strong assumptions our results about asynchronous updates require. All three seem to rely on the same fundamental building block of better understanding the behavior of noregret learners whose rewards are determined by (asynchronous) observations of other noregret learners. Some recent progress along these lines has been made Farina, Kroer, and Sandholm (2018); Kovařík and Lisỳ (2018), but more work is needed.
Orthogonal directions are suggested by our empirical results about last iterate convergence. Can we establish theoretical guarantees for NoSDEs or Markov games more broadly? Is RM++ guaranteed to achieve last iterate convergence? It empirically does in standard games like matching pennies and rockpaperscissors which trip up most regret minimizers. If so does this represent a new style of algorithm to achieve last iterate convergence or is there a way to interpret its clipping of regrets as optimism?
References
 Abernethy, Lai, and Wibisono (2019) Abernethy, J.; Lai, K. A.; and Wibisono, A. 2019. Lastiterate convergence rates for minmax optimization. arXiv preprint.
 Auer et al. (2002) Auer, P.; CesaBianchi, N.; Freund, Y.; and Schapire, R. E. 2002. The nonstochastic multiarmed bandit problem. SIAM 32(1):48–77.
 Bailey and Piliouras (2018) Bailey, J. P., and Piliouras, G. 2018. Multiplicative weights update in zerosum games. In EC 2018, 321–338. ACM.
 Bailey, Gidel, and Piliouras (2019) Bailey, J. P.; Gidel, G.; and Piliouras, G. 2019. Finite regret and cycles with fixed stepsize via alternating gradient descentascent.
 Banerjee and Peng (2005) Banerjee, B., and Peng, J. 2005. Efficient noregret multiagent learning. In AAAI, 41–46.
 Bellemare et al. (2016) Bellemare, M. G.; Ostrovski, G.; Guez, A.; Thomas, P. S.; and Munos, R. 2016. Increasing the action gap: New operators for reinforcement learning. In AAAI, 1476–1483.
 Bertsekas (1982) Bertsekas, D. 1982. Distributed dynamic programming. IEEE transactions on Automatic Control 27(3):610–616.
 Bowling et al. (2015) Bowling, M.; Burch, N.; Johanson, M.; and Tammelin, O. 2015. Headsup limit hold’em poker is solved. Science 347:145–149.
 Brown and Sandholm (2019a) Brown, N., and Sandholm, T. 2019a. Solving imperfectinformation games via discounted regret minimization. In Proc. AAAI Conference on Artificial Intelligence, volume 33, 1829–1836.
 Brown and Sandholm (2019b) Brown, N., and Sandholm, T. 2019b. Superhuman ai for multiplayer poker. Science 365(6456):885–890.
 Brown et al. (2018) Brown, N.; Lerer, A.; Gross, S.; and Sandholm, T. 2018. Deep counterfactual regret minimization. arXiv:1811.00164.
 Cheung and Piliouras (2019) Cheung, Y. K., and Piliouras, G. 2019. Vortices instead of equilibria in minmax optimization: Chaos and butterfly effects of online learning in zerosum games. arXiv preprint arXiv:1905.08396.
 Daskalakis and Panageas (2018) Daskalakis, C., and Panageas, I. 2018. Lastiterate convergence: Zerosum games and constrained minmax optimization. arXiv
 Daskalakis et al. (2017) Daskalakis, C.; Ilyas, A.; Syrgkanis, V.; and Zeng, H. 2017. Training gans with optimism. arXiv preprint arXiv:1711.00141.
 Etemadi (2006) Etemadi, N. 2006. Convergence of weighted averages of random variables revisited. Proc. Am. Math. Soc. 134(9):2739–2744.
 EvenDar, Kakade, and Mansour (2005) EvenDar, E.; Kakade, S. M.; and Mansour, Y. 2005. Experts in a markov decision process. In NIPS, 401–408.
 EvenDar, Kakade, and Mansour (2009) EvenDar, E.; Kakade, S. M.; and Mansour, Y. 2009. Online markov decision processes. Math. OR 34(3):726–736.
 EvenDar, Mannor, and Mansour (2002) EvenDar, E.; Mannor, S.; and Mansour, Y. 2002. Pac bounds for multiarmed bandit and markov decision processes. In COLT.
 Farina et al. (2019) Farina, G.; Kroer, C.; Brown, N.; and Sandholm, T. 2019. Stablepredictive optimistic counterfactual regret minimization. arXiv
 Farina, Kroer, and Sandholm (2018) Farina, G.; Kroer, C.; and Sandholm, T. 2018. Composability of regret minimizers. arXiv preprint arXiv:1811.02540.
 Gibson et al. (2012) Gibson, R. G.; Lanctot, M.; Burch, N.; Szafron, D.; and Bowling, M. 2012. Generalized sampling and variance in counterfactual regret minimization. In AAAI.
 Gofer and Mansour (2016) Gofer, E., and Mansour, Y. 2016. Lower bounds on individual sequence regret. Machine Learning 103(1):1–26.
 Gondek, Greenwald, and Hall (2004) Gondek, D.; Greenwald, A.; and Hall, K. 2004. Qnrlearning in markov games.
 Greenwald and Jafari (2003) Greenwald, A., and Jafari, A. 2003. A general class of noregret learning algorithms and gametheoretic equilibria. In LTKM.
 Greenwald, Hall, and Serrano (2003) Greenwald, A.; Hall, K.; and Serrano, R. 2003. Correlated qlearning. In ICML, volume 3, 242–249.
 Hart and MasColell (2000) Hart, S., and MasColell, A. 2000. A simple adaptive procedure leading to correlated equilibrium. Econometrica 68(5):1127–1150.
 Heinrich and Silver (2016) Heinrich, J., and Silver, D. 2016. Deep reinforcement learning from selfplay in imperfectinformation games. arXiv preprint.
 Hu and Wellman (2003) Hu, J., and Wellman, M. P. 2003. Nash qlearning for generalsum stochastic games. JMLR 4(Nov):1039–1069.
 Jin et al. (2018) Jin, C.; AllenZhu, Z.; Bubeck, S.; and Jordan, M. I. 2018. Is qlearning provably efficient? In NIPS, 4863–4873.
 Jin, Levine, and Keutzer (2017) Jin, P. H.; Levine, S.; and Keutzer, K. 2017. Regret minimization for partially observable deep reinforcement learning. arXiv
 Johanson et al. (2012) Johanson, M.; Bard, N.; Lanctot, M.; Gibson, R.; and Bowling, M. 2012. Efficient nash equilibrium approximation through monte carlo counterfactual regret minimization. In AAMAS, 837–846.
 Kearns and Singh (1999) Kearns, M. J., and Singh, S. P. 1999. Finitesample convergence rates for qlearning and indirect algorithms. In NIPS, 996–1002.
 Kovařík and Lisỳ (2018) Kovařík, V., and Lisỳ, V. 2018. Analysis of hannan consistent selection for monte carlo tree search in simultaneous move games.
 Lanctot et al. (2009) Lanctot, M.; Waugh, K.; Zinkevich, M.; and Bowling, M. 2009. Monte carlo sampling for regret minimization in extensive games. In Advances in neural information processing systems, 1078–1086.
 Lanctot et al. (2012) Lanctot, M.; Gibson, R.; Burch, N.; Zinkevich, M.; and Bowling, M. 2012. Noregret learning in extensiveform games with imperfect recall. arXiv preprint arXiv:1205.0622.
 Li et al. (2018) Li, H.; Hu, K.; Ge, Z.; Jiang, T.; Qi, Y.; and Song, L. 2018. Double neural counterfactual regret minimization. arXiv preprint.
 Littman (1994) Littman, M. L. 1994. Markov games as a framework for multiagent reinforcement learning. In ML Proc. 1994. 157–163.
 Ma, Zhang, and Sugiyama (2015) Ma, Y.; Zhang, H.; and Sugiyama, M. 2015. Online markov decision processes with policy iteration. arXiv preprint.
 Mannor and Shimkin (2003) Mannor, S., and Shimkin, N. 2003. The empirical bayes envelope and regret minimization in competitive markov decision processes. Mathematics of Operations Research 28(2):327–345.
 Mertikopoulos, Papadimitriou, and Piliouras (2018) Mertikopoulos, P.; Papadimitriou, C.; and Piliouras, G. 2018. Cycles in adversarial regularized learning. In SODA, 2703–2717.
 Moravčík et al. (2017) Moravčík, M.; Schmid, M.; Burch, N.; Lisỳ, V.; Morrill, D.; Bard, N.; Davis, T.; Waugh, K.; Johanson, M.; and Bowling, M. 2017. Deepstack: Expertlevel artificial intelligence in headsup nolimit poker. Science 356(6337):508–513.
 Neu, Jonsson, and Gómez (2017) Neu, G.; Jonsson, A.; and Gómez, V. 2017. A unified view of entropyregularized markov decision processes. arXiv preprint.
 Srinivasan et al. (2018) Srinivasan, S.; Lanctot, M.; Zambaldi, V.; Pérolat, J.; Tuyls, K.; Munos, R.; and Bowling, M. 2018. Actorcritic policy optimization in partially observable multiagent environments. In NIPS.
 Tammelin et al. (2015) Tammelin, O.; Burch, N.; Johanson, M.; and Bowling, M. 2015. Solving headsup limit texas hold’em. In IJCAI, 645–652.
 Tammelin (2014) Tammelin, O. 2014. Solving large imperfect information games using cfr+. arXiv preprint arXiv:1407.5042.
 Tesauro and Kephart (2002) Tesauro, G., and Kephart, J. O. 2002. Pricing in agent economies using multiagent qlearning. JAAMAS 5(3):289–304.
 Van Seijen et al. (2009) Van Seijen, H.; Van Hasselt, H.; Whiteson, S.; and Wiering, M. 2009. A theoretical and empirical analysis of expected sarsa. In ADPRL’09., 177–184. IEEE.
 Waugh et al. (2015) Waugh, K.; Morrill, D.; Bagnell, J. A.; and Bowling, M. 2015. Solving games with functional regret estimation. In AAAI.
 Yu, Mannor, and Shimkin (2009) Yu, J. Y.; Mannor, S.; and Shimkin, N. 2009. Markov decision processes with arbitrary reward processes. MOR 34(3):737–757.
 Zinkevich et al. (2008) Zinkevich, M.; Johanson, M.; Bowling, M.; and Piccione, C. 2008. Regret minimization in games with incomplete information. In Advances in neural information processing systems, 1729–1736.
 Zinkevich, Greenwald, and Littman (2006) Zinkevich, M.; Greenwald, A.; and Littman, M. L. 2006. Cyclic equilibria in markov games. In NIPS, 1641–1648.
Appendix A Beyond MDPs
If we move beyond MDPs, and are no longer stationary and in general we have a and . This causes problems with the proof of Lemma 1. Recall the initial part of that proof, updated to this more general setting:
In the original proof, we pulled the expectation over outside the sum, but now we cannot. In particular, writing the expectation more explicitly gives
(6) 
We can still reverse the order of the sums, but the weighting terms now depend on so they cannot be moved outside. More problematically, they also depend on and , so it is not immediately clear how to generalize our results.
For intuition about the sort of problems that could arise, consider a state where there are two actions. At odd , and and vice versa at even . It is a valid noregret strategy to randomize uniformly over the actions, but if the are such that you only arrive in from at odd , then this gives an incorrect estimate.
In the remainder of this section, we analyze a special case where we can prove a variant of Lemma 1.
a.1 Timeinvariant
If does not change with , but does, we can still prove a version of Lemma 1. With a single state, this captures learning in normalform games, where noregret learning is indeed known to work. This assumption is also common in the literature on “online MDPs” (EvenDar, Kakade, and Mansour, 2009; Mannor and Shimkin, 2003; Yu, Mannor, and Shimkin, 2009; Ma, Zhang, and Sugiyama, 2015) In this setting, a version of Lemma 1 can be proved, but now rather than having a constant operator it now changes over time as
(7) 
Lemma 9.
(8) 
Proof.
As before, the key step is applying the noregret property to obtain the inequality and we apply the same argument with the noabsoluteregret property to obtain the reverse inequality. ∎
Appendix B Omitted Proofs
Lemma 2.
Let . Then
Proof.
By definition, . Thus by the subadditive property of norms, . By induction, . Thus . ∎
Lemma 3.
Proof.
Lemma 4.
Let be a sequence such that . Then .
Proof.
The first step follows by the subadditive property of norms, the second by optimality of , the third because is a contraction map. Rewriting yields
Thus, by assumption, . Since , . Thus and the result follows. ∎
Lemma 6.
Let be the state selected uniformly at random and updated in iteration , for which this is the th update and let and for . Then
Proof.
By the definitions of LONR and noregret algorithms,
This argument is essentially the same as in the proof of Lemma 1, except that in the fourth equality we apply the definition of to yield a form to which we can then apply the noregret property. As before, the other half of the proof is symmetric and uses the noabsoluteregret property. ∎
Appendix C Regret Matching++
In this section, we prove that RM++ is a noregret algorithm and then demonstrate that it has empirical last iterate convergence.
Lemma 10.
Given a sequence of strategies , each defining a probability distribution over a set of actions A, consider any definition for satisfying the following conditions:


where
The regretlike value is then an upper bound on the regret
Proof.
The lemma and proof closely resembles the similar proofs in (Tammelin, 2014), (Brown and Sandholm, 2019a).
For any ,
This gives
Lemma 11.
Given a set of actions A and any sequence of rewards such that for all t and all , after playing a sequence of strategies determined by regret matching but using the regretlike value in place of ,
Proof.
=
∎
for all , so by induction which gives
c.1 Empirical results for RM++
Figure 4(a) shows that the last iterate of RM++ converges to the equilibrium of rockpaper scissors. Similar results, not shown, hold for matching pennies. Prior work as shown that both RM and RM+ diverge in these games in terms of the last iterate (although they converge on average). We also tested RM++ in Soccer, and as the noregret algorithm for CFR in Kuhn poker and for LONR in Grid World. In all cases we achieved last iterate convergence.
Appendix D Omitted Experimental Details
d.1 LONR pseudocode
Here we use standard notation, specifically where is the total number of agents, is current agent, and represent the total states and current state respectively. The notation provides the set of actions for player in state . denotes the set of actions of all other agents excluding agent in state . refers to the current agent when unspecified.
The policy update uses any noregret algorithm. The update for Regret Matching++ is shown here.