# Complexity and Algorithms for Exploiting Quantal Opponents in Large Two-Player Games

## Abstract

Solution concepts of traditional game theory assume entirely rational players; therefore, their ability to exploit subrational opponents is limited. One type of subrationality that describes human behavior well is the quantal response. While there exist algorithms for computing solutions against quantal opponents, they either do not scale or may provide strategies that are even worse than the entirely-rational Nash strategies. This paper aims to analyze and propose scalable algorithms for computing effective and robust strategies against a quantal opponent in normal-form and extensive-form games. Our contributions are: (1) we define two different solution concepts related to exploiting quantal opponents and analyze their properties; (2) we prove that computing these solutions is computationally hard; (3) therefore, we evaluate several heuristic approximations based on scalable counterfactual regret minimization (CFR); and (4) we identify a CFR variant that exploits the bounded opponents better than the previously used variants while being less exploitable by the worst-case perfectly-rational opponent.

^{1} Czech Technical University in Prague, Czech Republic
^{2} Nanyang Technological University, Singapore

\xpatchcmd

3.25ex plus1ex minus.2ex3pt plus 1pt minus 1pt

## 1 Introduction

Extensive-form games are a powerful model able to describe recreational games, such as poker, as well as real-world situations from physical or network security. Recent advances in solving these games, and particularly the Counterfactual Regret Minimization (CFR) framework Zinkevich et al. (2008), allowed creating superhuman agents even in huge games, such as no-limit Texas hold’em with approximately different decision points MoravÄÃk et al. (2017); Brown and Sandholm (2018). The algorithms generally approximate a Nash equilibrium, which assumes that all players are perfectly rational, and is known to be inefficient in exploiting weaker opponents. An algorithm that would be able to take an opponent’s imperfection into account is expected to win by a much larger margin Johanson and Bowling (2009); Bard et al. (2013).

The most common model of bounded rationality in humans is the quantal response (QR) model McKelvey and Palfrey (1995, 1998). Multiple experiments identified it as a good predictor of human behavior in games Yang et al. (2012); Haile et al. (2008). QR is also the hearth of the algorithms successfully deployed in the real world Yang et al. (2012); Fang et al. (2017). It suggests that players respond stochastically, picking better actions with higher probability. Therefore, we investigate how to scalably compute a good strategy against a quantal response opponent in two-player normal-form and extensive-form games.

If both players choose their actions based on the QR model, their behavior is described by quantal response equilibrium (QRE). Finding QRE is a computationally tractable problem McKelvey and Palfrey (1995); Turocy (2005), which can be also solved using the CFR framework Farina et al. (2019). However, when creating AI agents competing with humans, we want to assume that one of the players is perfectly rational, and only the opponent’s rationality is bounded. A tempting approach may be using the algorithms for computing QRE and increasing one player’s rationality or using generic algorithms for exploiting opponents Davis et al. (2014) even though the QR model does not satisfy their assumptions, as in Basak et al. (2018). However, this approach generally leads to a solution concept we call Quantal Nash Equilibrium, which we show is very inefficient in exploiting QR opponents and may even perform worse than an arbitrary Nash equilibrium.

Since the very nature of the quantal response model assumes that the sub-rational agent responds to a strategy played by its opponent, a more natural setting for studying the optimal strategies against QR opponents are Stackelberg games, in which one player commits to a strategy that is then learned and responded to by the opponent. Optimal commitments against quantal response opponents - Quantal Stackelberg Equilibrium (QSE) - have been studied in security games Yang et al. (2012), and the results were recently extended to normal-form games Černý et al. (2020). Even in these one-shot games, polynomial algorithms are available only for their very limited subclasses. In extensive-form games, we show that computing the QSE is NP-hard, even in zero-sum games. Therefore, it is very unlikely that the CFR framework could be adapted to closely approximate these strategies. Since we aim for high scalability, we focus on empirical evaluation of several heuristics, including using QNE as an approximation of QSE. We identify a method that is not only more exploitative than QNE, but also more robust when the opponent is rational.

Our contributions are: 1) We analyze the relationship and properties of two solution concepts with quantal opponents that naturally arise from Nash equilibrium (QNE) and Stackelberg equilibrium (QSE). 2) We prove that computing QNE is PPAD-hard even in NFGs, and computing QSE in EFGs is NP-hard. Therefore, 3) we investigate the performance of CFR-based heuristics against QR opponents. The extensive empirical evaluation on four different classes of games with up to histories identifies a variant of CFR- Davis et al. (2014) that computes strategies better than both QNE and NE.

## 2 Background

Even though our main focus is on extensive-form games, we study the concepts in normal-form games, which can be seen as their conceptually simpler special case. After defining the models, we proceed to define quantal response and the metrics for evaluating a deployed strategy’s quality.

### Two-player Normal-form Games

A two-player normal-form game (NFG) is a tuple where is set of players. We use and for one player and her opponent. denotes the set of ordered sets of actions for both players. The utility function assigns a value for each pair of actions. A game is called zero-sum if .

Mixed strategy is a probability distribution over . For any strategy profile we use as the expected outcome for player , given the players follow strategy profile . A best response (BR) of player to the opponent’s strategy is a strategy , where for all . An -best response is , where for all . Given a normal-form game , a tuple of mixed strategies , is a Nash Equilibrium if is an optimal strategy of player against strategy . Formally:

In many situations, the roles of the players are asymmetric. One player (leader - ) has the power to commit to a strategy, and the other player (follower - ) plays the best response. This model has many real-world applications Tambe (2011); for example, the leader can correspond to a defense agency committing to a protocol to protect critical facilities. The common assumption in the literature is that the follower breaks ties in favor of the leader. Then, the concept is called a Strong Stackelberg Equilibrium (SSE).

A leader’s strategy is a Strong Stackelberg Equilibrium if is an optimal strategy of the leader given that the follower best-responds. Formally: In zero-sum games, SSE is equivalent to NE Conitzer and Sandholm (2006) and the expected utility is denoted value of the game.

### Two-player Extensive-form Games

A two-player extensive-form game (EFG) consist of a set of players , where denotes the chance. is a finite set of all actions available in the game. is the set of histories in the game. We assume that forms a non-empty finite prefix tree. We use to denote that extends . The root of is the empty sequence . The set of leaves of is denoted and its elements are called terminal histories. The histories not in Z are non-terminal histories. By we denote the set of actions available at . is the player function which returns who acts in a given history. Denoting , we partition the histories as . is the chance strategy defined on . For each is a probability distribution over . Utility functions assign each player utility for each leaf node, .

The game is of imperfect information if some actions or chance events are not fully observed by all players. The information structure is described by information sets for each player , which form a partition of . For any information set , any two histories are indistinguishable to player . Therefore whenever . For we denote by the set and by the player for any .

A strategy of player is a function that assigns a distribution over to each . A strategy profile consists of strategies for both players. is the probability of reaching if all players play according to . We can decompose into each player’s contribution. Let be the product of all players’ contributions except that of player (including chance). For define , as the probability of reaching information set given all players play according to . and are defined similarly. Finally, let if , and zero otherwise. and are defined similarly. Using this notation, expected payoff for player is . BR, NE and SSE are defined as in NFGs.

Define as an expected utility given that the history is reached and all players play according to . A counterfactual value is the expected utility given that the information set is reached and all players play according to strategy except player , which plays to reach . Formally, . And similarly counterfactual value for playing action in information set is .

We define as a set of sequences of actions only for player . is the information set where last action of was executed and is sequence of actions of player to information set .

### Quantal Response Model of Bounded Rationality

Fully rational players always select the utility-maximizing strategy, i.e., the best response. Relaxing this assumption leads to a “statistical version” of best response, which takes into account the inevitable error-proneness of humans and allows the players to make systematic errors McFadden (1976); McKelvey and Palfrey (1995).

###### Definition 1.

Let be an NFG. Function is a quantal response function of player if probability of playing action monotonically increases as expected utility for increases. Quantal function QR is called canonical if for some real-valued function :

(1) |

Whenever is a strictly positive increasing function, the corresponding is a valid quantal response function. Such functions are called generators of canonical quantal functions. The most commonly used generator in the literature is the exponential (logit) function McKelvey and Palfrey (1995) defined as where . drives the model’s rationality. The player behaves uniformly randomly for , and becomes more rational as . We denote a logit quantal function as LQR.

In EFGs, we assume the bounded-rational player plays based on a quantal function in every information set separately, according to the counterfactual values.

###### Definition 2.

Let be an EFG. Function is a canonical couterfactual quantal response function of player with generator if for a strategy it produces strategy such that in every information set , for each action it holds that

(2) |

where is the probability of playing action in information set and .

We denote the canonical counterfactual quantal response function with the logit generator counterfactual logit quantal response (CLQR). CLQR differs from the traditional definition of logit agent quantal response (LAQR) McKelvey and Palfrey (1998) in using counterfactual values instead of expected utilities. The main advantage of CLQR over LAQR is that CLQR defines a valid quantal strategy even in information sets unreachable due to a strategy of the opponent, which is necessary for applying regret-minimization algorithms explained later.

Because the logit quantal function is the most well-studied function in the literature with several deployed applications Pita et al. (2008); Delle Fave et al. (2014); Fang et al. (2017), we focus most of our analysis and experimental results on (C)LQR. Without a loss of generality, we assume the quantal player is always player .

### Metrics for Evaluating Quality of Strategy

In a two-player zero-sum game, the exploitability of a given strategy is defined as expected utility that a fully rational opponent can achieve above the value of the game. Formally, exploitability of strategy is

We also intend to measure how much we are able to exploit an opponent’s bounded-rational behavior. For this purpose, we define gain of a strategy against quantal response as an expected utility we receive above the value of the game. Formally, gain of strategy is defined as

General-sum games do not have the property that all NEs have the same expected utility. Therefore, we simply measure expected utility against LQR and BR opponents there.

## 3 One-Sided Quantal Solution Concepts

This section formally defines two one-sided bounded-rational equilibria, where one of the players is rational and the other subrational – a saddle-point-type equilibrium called Quantal Nash Equilibrium (QNE) and a leader-follower-type equilibrium called Quantal Stackelberg Equilibrium (QSE). We show that contrary to their fully-rational counterparts, QNE differs from QSE even in zero-sum games. Moreover, we show that computing QSE in extensive-form games is an NP-hard problem.

### Quantal Equilibria in Normal-form Games

We first consider a variant of NE, in which one of the players plays a quantal response instead of the best response.

###### Definition 3.

Given a normal-form game and a quantal response function , a strategy profile describes a Quantal Nash Equilibrium (QNE) if and only if is a best response of player against quantal-responding player . Formally:

(3) |

QNE can be seen as a concept between NE and Quantal Response Equilibrium (QRE) McKelvey and Palfrey (1995). While in NE, both players are fully rational, and in QRE, both players are assumed to behave bounded-rationally, in QNE, one player is rational, and the other is bounded-rational.

###### Theorem 1.

Computing a QNE strategy profile in two-player NFGs is a PPAD-hard problem.

###### Proof (Sketch).

We do a reduction from the problem of computing -NE. We derive an upper bound on a maximum distance between best response and logit quantal response, which goes to zero with approaching infinity. For a given , we find , such that QNE is -NE. The full proof is provided in the appendix. ∎

QNE usually outperforms NE against LQR in practice as we show in the experiments. However, it cannot be guaranteed as stated in the Proposition 1.

###### Proposition 2.

For any function, there exists a zero-sum normal-form game with a unique NE - and QNE - such that .^{1}

The second solution concept is a variant of SSE in situations, when the follower is bounded-rational.

###### Definition 4.

Given a normal-form game and a quantal response function , a mixed strategy describes a Quantal Stackleberg Equilibrium (QSE) if and only if

(4) |

In QSE, player is fully rational and commits to a strategy that maximizes her payoff given that player observes the strategy and then responds according to her quantal function. This is a standard assumption, and even in problems where the strategy is not known in advance, it can be learned by playing or observing. QSE always exists because all utilities are finite, the game has a finite number of actions, player utilities are continuous on her strategy simplex, and the maximum is hence always reached.

###### Observation 3.

Let be a normal-form game and a be a generator of a canonical quantal function. Then QSE of can be formulated as a non-convex mathematical program:

(5) |

###### Example 1.

In Figure 1 we present an example of utility against LQR in Game 1 with . We show QNE in which both actions have the same expected utility. Therefore it is a best response for Player , and she has no incentive to deviate. However, it is not optimal with regard to maximal expected utility, which is achieved in two global extremes, both being QSE. We can also observe that even a small game like Game 1 can have multiple local extremes in this case 3.

Example 1 shows that finding QSE is a non-concave problem even in zero-sum NFGs, and it can have multiple global solutions. Moreover, facing a bounded-rational opponent may change the relationship between NE and SSE. They are no longer interchangeable, even in zero-sum games, and QSE may use strictly dominated actions.

### Quantal Equilibria in Extensive-form Games

In EFGs, QNE and QSE are defined in the same manner as in NFGs. However, instead of the normal-form quantal response, the second player acts according to the counterfactual quantal response. QSE in EFGs can be computed by a mathematical program provided in the appendix. The natural formulation of the program is non-linear with non-convex constraints indicating the problem is hard. We show that the problem is indeed NP-hard, even in zero-sum games.

###### Theorem 4.

Let be a two-player imperfect-information EFG with perfect recall and be a quantal response function. Computing an optimal strategy of a rational player against the quantal response opponent in is an NP-hard if one of the following holds: (1) is zero-sum and is generated by a logit generator for some ; or (2) is general-sum.

###### Proof (Sketch).

We reduce from the set partition. The key part of the constructed EFG is zero-sum. For each item of the partition problem, the leader chooses an action that places the item to one or the other subset. The follower has two actions; each gives the leader a reward of the sum of items in one subset. If the sums are different, the follower chooses the lower one. If they are the same, the follower chooses both of them uniformly, which maximizes the leader’s payoff.

A complication is that the leader could split each item in half by playing uniformly. This is prevented by combining the leader’s actions for placing an item with an action in a separate game with two symmetric QSEs. Such a game is the collaborative coordination game in the non-zero-sum case and a game similar to Figure 1 in the zero-sum case. ∎

The proof of the non-zero-sum part of Theorem 4 only requires the follower to play action with a higher reward with higher probability. This also holds for a rational player; hence, the theorem provides an independent, simpler, and more general proof of NP-hardness of computing Stackelberg equilibria in EFGs, which unlike Letchford and Conitzer (2010) does not rely on the tie-breaking rule.

## 4 Computing Bounded-rational Equilibria

This section describes various algorithms and heuristics for computing one-sided quantal equilibria introduced in the previous section. In the first part, we focus on QNE, and based on an empirical evaluation; we claim that regret-minimization algorithms converge to QNE in both NFGs and EFGs. The second part then discusses gradient-based algorithms for computing QSE and analyses cases when regret minimization methods will or will not converge to QSE.

### Algorithms for Computing QNE

Counterfactual regret minimization (CFR) Zinkevich et al. (2008) is a state-of-the-art algorithm for approximating NE in extensive-form games. CFR is a form of regret matching Hart and Mas-Colell (2000) and uses iterated self play to minimize regret at each information set independently. CFR-f Davis et al. (2014) is a modification capable of computing strategy against some opponent models. In each iteration, it performs a CFR update for one player and computes the response for the other player. We use CFR-f with a quantal response and call it CFR-QR. In normal-form games, we use the same approach with simple regret matching (RM-QR).

###### Conjecture 5.

(1) In NFGs, RM-QR converges to QNE. (2) In EFGs, CFR-QR converges to QNE.

We performed an empirical evaluation on more than games. In each game, the resulting strategy of player was -BR to the quantal response of the opponent with epsilon lower than after less than iterations.

Furthermore, the performance of QNE is at the cost of substantial exploitability. We propose two heuristics that address both of the issues simultaneously. The first one is to play a convex combination of QNE and NE strategy. We call this heuristical algorithm COMB. We aim to find a parameter of the combination that maximizes the utility against LQR. However, choosing the correct is, in general, a non-convex, non-linear problem. We search for the best by sampling possible s and choosing the one with the best utility. The time required to compute one combination’s value is similar to the time required to perform one iteration of the RM-QR algorithm. Sampling the s and checking all the sampled parameters hence does not affect the scalability of COMB. The gain is also guaranteed to be greater or equal to the gain of the NE strategy, and as we show in the results, some combinations achieve higher gains than both the QNE and the NE strategies.

The second heuristic uses a restricted response approach Johanson et al. (2008), and we call it restricted quantal response (RQR). The key idea is that during the regret minimization, we set probability , such that in each iteration, the opponent updates her strategy using (i) LQR with probability and (ii) BR otherwise. We aim to choose the parameter such that it maximizes the expected payoff. Using sampling as in COMB is not possible, since each sample requires to rerun the whole RM. To avoid the expensive computation, we start with and update the value during the iterations. In each iteration, we approximate the gradient of gain with respect to based on a change in the value after both the LQR and the BR iteration. We move the value of in the gradient’s approximated direction with a step size that decreases after each iteration. However, the strategies do change tremendously with , and the algorithm would require many iterations to produce a meaningful average strategy. Therefore, after a few thousands of iterations, we fix the parameter and perform a clean second run, with fixed from the first run. Similarly to COMB, RQR achieves higher gains than both the QNE and the NE and performs exceptionally well in terms of exploitability with gains comparable to COMB.

We adapted both algorithms from NFGs also to EFGs. The COMB heuristic requires to compute a convex combination of strategies, which is not straightforward in EFGs. Let be a combination coefficient and , be two different strategies for the player . The convex combination of the strategies is a strategy computed for each information set and action as follows:

(6) | ||||

We search for a value of that maximizes the gain, and we call this approach the counterfactual COMB. Contrary to COMB, the RQR can be directly applied to EFGs. The idea is the same, but instead of regret matching, we use CFR. We call this heuristic algorithm the counterfactual RQR.

### Algorithms for Computing QSE

In general, the mathematical programs describing the QSE in NFGs and EFGs are non-concave, non-linear problems. We use the gradient ascent (GA) methods Boyd and Vandenberghe (2004) to find these programs’ local optimum. In case a program’s formulation is concave, the GA will reach a global optimum. However, both formulations of QSE contain a fractional part, corresponding to a definition of the follower’s canonical quantal function. Because concavity is not preserved under division, accessing conditions of the concavity of these programs is difficult. The GA performs well on small games, but it does not scale at all even for moderately sized games, as we show in the experiments.

Because QSE and QNE are usually non-equivalent concepts even in zero-sum games (see Figure 1), the regret-minimization algorithms will not converge to QSE. However, in case a quantal function satisfies the so-called pretty-good-response condition, the algorithm converges to a strategy of the leader exploiting the follower the most Davis et al. (2014). We show that a class of simple (i.e., attaining only a finite number of values) quantal functions satisfy a pretty-good-responses condition.

###### Proposition 6.

Let be a zero-sum NFG, a quantal response function of the follower, which depends only on the ordering of expected utilities of individual actions. Then the RM-QR algorithm converges to QSE.

An example of a simple quantal function depending only on the ordering of expected utilities is, e.g., a function assigning probability to the actions with the highest expected utility, probability to the action with the second-highest utility and probabilities to all remaining actions. Note that the class of quantal functions satisfying the conditions of pretty-good-responses still takes into account the strategy of the opponent (i.e., the responses are not static), but it is limited. In general, quantal functions do not satisfy the condition of pretty-good-responses.

###### Proposition 7.

Let be canonical quantal function with a strictly monotonically increasing generator . Then is not a pretty-good-response.

## 5 Experimental Evaluation

The experimental evaluation aims to compare solutions of our proposed algorithm RQR with QNE strategies computed by RM-QR for NFGs and CFR-QR for EFGs. As baselines, we use (i) Nash equilibrium (NASH) strategies, (ii) a best convex combination of NASH and QNE denoted COMB, and (iii) an approximation of QSE computed by gradient ascent (GA), initialized by NASH. We focus mainly on zero-sum games, because they allow for a more straightforward interpretation of the trade-offs between gain and exploitability. Still, we also provide results on general-sum NFGs. Finally, we show that the performance of RQR is stable over different rationality values and analyze the EFG algorithms more closely on well-known Leduc Hold’em game. The experimental setup and all the domains are described in the appendix. The code will be published and is appended.

### Scalability

The first experiment shows the difference in runtimes of GA and regret-minimization approaches. In NFGs, we used random square zero-sum games as an evaluation domain, and the runtimes are averaged over 1000 games per game size with [-10,9] integer payoffs. In EFGs, the generation procedure for random games does not guarantee the games will have the same number of histories, so we clustered games with a similar size together, and report runtimes averaged over the clusters. The results on the right of Figure 2 show that regret minimization approaches scale significantly better – the tendency is very similar in both NFGs and EFGs, and we show the results for NFGs in the appendix.

We report scalability in general-sum games on the left in Figure 2. We generated 100 games of Grab the Dollar, Majority Voting, Travelers Dilemma, and War of Attrition with an increasing number of actions for both players and also 100 randomly generated general-sum NFGs of the same size. In the rest of the experiments, we use sets of 1000 games with 100 actions for each class. We use a MILP formulation to compute the NE Sandholm et al. (2005) and solve for SE using multiple linear programs Conitzer and Sandholm (2006). The performance of GA against CFR-based algorithm is similar to the zero-sum case, and the only difference is in NE and SE, which are even less scalable than GA.

### Gain comparison

Now we turn to a comparison of gains of solutions of all algorithms in NFGs and EFGs. We report averages with standard errors for zero-sum games in Figure 3 and general-sum games in Figure 4 (left). We use the NE strategy as a baseline, but as different NE strategies can achieve different gains against the subrational opponent, we try to select the best NE strategy. To achieve this, we first compute a feasible NE. Then we run gradient ascent constrained to the set of NE, optimizing the expected value. We aim to show that RQR performs even better than an optimized NE. Moreover, also COMB strategies outperform the best NE, despite COMB using the (possibly suboptimal) NE strategy computed by CFR.

The results show that GA for QSE is the best approach in terms of gain in zero-sum and general-sum games if we ignore scalability issues. The scalable heuristic approaches also achieve significantly higher gain than both the NE baseline and competing QNE in both zero-sum and general-sum games. On top of that, we show that in general-sum games, in all games except one, the heuristic approaches perform as well as or better than SE. This indicates that they are useful in practice even in general-sum settings.

### Robustness comparison

In this work, we are concerned primarily with increasing gain. However, the higher gain might come at the expense of robustness–the quality of strategies might degrade if our expected behavioral model of the opponent is incorrect. Therefore, we study also (i) the exploitability of computed solutions in zero-sum games and (ii) expected utility against the best response that breaks ties in our favor in general-sum games. Both correspond to performance against a perfectly rational selfish opponent.

First, we report the mean exploitability in zero-sum games in Figure 5. Because the exploitability of NE is zero by definition, we do not include NE in the figure. We show that QNE is highly exploitable in both NFGs and EFGs. COMB and GA perform similarly, and RQR has significantly lower exploitability compared to other modeling approaches. Second, we depict the results in general-sum games on the right in Figure 4. By definition, SE is the optimal strategy and provides an upper bound on achievable value. Unlike in zero-sum games, GA outperforms CFR-based approaches even against the rational opponent. Our heuristic approaches are not as good as entirely rational solution concepts, but they always perform better than QNE.

### Different rationality.

In the fourth experiment, we access the algorithms’ performance against opponents with varying rationality parameter in the logit function. For we report the expected utility on the left in Figure 6. For smaller values of (i.e., lower rationality), RQR performs similarly to GA and QNE, but it achieves lower exploitability. As rationality increases, the gain of RQR is found between GA and QNE, while having the lowest exploitability. For all values of , both QNE and RQR report higher gain than NASH. We do not include COMB in the figure for the sake of better readability as it achieves similar results to RQR.

### Standard EFG Benchmarks

**Poker.** Poker is a standard evaluation domain, and continual resolving was demonstrated to perform extremely well on it MoravÄÃk et al. (2017). We tested our approaches on two poker variants: one-card poker and Leduc Hold’em. We used because for , QNE is equal to QSE. We report the values achieved in Leduc Hold’em on the right in Figure 6. The horizontal lines correspond to NE and GA strategies, as they do not depend on . The heuristic strategies are reported for different values. The leftmost point corresponds to the CFR-BR strategy and rightmost to the QNE strategy. The experiment shows that RQR performs very well for poker games as it gets close to the GA while running significantly faster. Furthermore, the strategy computed by RQR is much less exploitable consistently throughout various values. This suggests that the restricted response can be successfully applied not only against strategies independent of the opponent as in Johanson et al. (2008), but also against adapting opponents. We observe similar performance also in the one-card poker and report the results in the appendix.

**Large game.** We demonstrate our approach on Goofspiel 7, a game with almost 100 million histories to show the practical scalability. While CFR-QR, RQR, and CFR were able to compute a strategy, the games of this size are beyond the computing abilities of GA and memory requirements of COMB. CFR-QR has exploitability 4.045 and gains 2.357, RQR has exploitability 3.849 and gains 2.412, and CFR gains 1.191 with exploitability 0.115. RQR hence performs the best in terms of gain and outperforms CFR-QR in exploitability. All algorithms used 1000 iterations.

### Summary of the Results

In the experiments, we have shown three main points. (1) GA approach does not scale even to moderate games, making regret minimization approaches much better suited to larger games. (2) In both normal-form and extensive-form games, the RQR approach outperforms NASH and QNE baseline in terms of gain and outperforms QNE in terms of exploitability, making it currently the best approach against LQR opponents in large games. (3) Our algorithms perform better than the baselines, even with different rationality values, and can be successfully used even in general games. Visual comparison of the algorithms in zero-sum games is provided in the following table. Scalability denotes how well the algorithm scales to larger games. The marks range from three minuses as the worst to three pluses as the best with NE being the 0 baselines.

COMB | RQR | QNE | NE | GA | |
---|---|---|---|---|---|

Scalability | - | 0 | 0 | 0 | - - - |

Gain | ++ | ++ | + | 0 | +++ |

Exploitability | - - | - | - - - | 0 | - - |

## 6 Conclusion

Bounded rationality models are crucial for applications that involve human decision-makers. Most previous results on bounded rationality consider games among humans, where all players’ rationality is bounded. However, artificial intelligence applications in real-world problems pose a novel challenge of computing optimal strategies for an entirely rational system interacting with bounded-rational humans. We call this optimal strategy Quantal Stackelberg Equilibrium (QSE) and show that natural adaptations of existing algorithms do not lead to QSE, but rather to a different solution we call Quantal Nash Equilibrium (QNE). As we observe, there is a trade-off between computability and solution quality. QSE provides better strategies, but it is computationally hard and does not scale to large domains. QNE scales significantly better, but it typically achieves lower utility than QSE and might be even worse than the worst Nash equilibrium. Therefore, we propose a variant of counterfactual regret minimization which, based on our experimental evaluation, scales to large games, and computes strategies that outperform QNE against both the quantal response opponent and the perfectly rational opponent.

## Appendix A Proofs

### Proof of Theorem 1

###### Lemma 1.

Let , . Then it holds that

(7) |

###### Proof.

We proceed by induction on the size of the set .

Base case: Let . Because , any can be written as . For a given , the difference between and can be written as

To find a maximum of this function, we differentiate it by , which yields

For , the function has a root

where is the Lambert function. The root is unique, because the inner function is increasing as its derivative is positive for all . It is a maximum of , because . By plugging the root into the function , we obtain the upper bound on the distance between and :

which is independent on .

Induction step: For a given , assume . Consider a new . Again, we set . For a given , the difference between and can be written as

because the function is strictly greater than zero. To find a maximum of the second term, we differentiate it by :

As in the base case, for the derivative has a root

The root is unique, because the derivative is positive increasing on and decreasing on , as differentiating it for the second time reveals. Therefore, we obtain the upper bound

The result follows from the induction. Note that the upper bound goes to zero as approaches infinity. ∎

###### Theorem 1.

Computing a QNE strategy profile in two-player NFGs is a PPAD-hard problem.

###### Proof.

Let be a 2-player NFG with strictly positive utilities, in which one of the players has actions to play. Computing an -NASH in is PPAD-complete Daskalakis et al. (2009). We show that computing QNE is PPAD-hard by reducing the problem of finding -NASH in to a problem of computing a specific QNE in .

We construct the reduced game as follows: let the player with actions be the subrational player and let be from a logit class, i.e., for some . Assume that there exists , such that for each and each strategy of the leader . Because the leader plays fully rationally, his QNE strategy is a best response. By the definition of , the follower’s QR is an -best response. Therefore, by solving for QNE with , we find an -NASH in .

Each strategy of a leader generates expected utilities for the follower, playing BR corresponds to , playing QR corresponds to . Because the game we reduce from has actions, there are expected utilities, we can hence use the lemma. Setting finishes the proof. ∎

### Proof of Proposition 1

###### Proposition 2.

For any function. There exists a zero-sum normal-form game with a unique NE - and QNE - such that .

A | B | C | |
---|---|---|---|

X | -6 | 9 | 9 |

Y | 3 | 0 | 2 |

###### Proof.

In the provided game, the only NE for player is and QR against it results in expected utility 1.6438. QNE strategy for player is (0.1744,0.8256) resulting in expected utility 1.6366. Therefore, in this game with QNE is worse than NE against QR. For a different , the utilities can be re-scaled by to achieve a similar result. ∎

### Proof of Theorem 4

###### Theorem 4.

Let be a two-player imperfect-information EFG with perfect recall and be a quantal response function. Computing an optimal strategy of a rational player against the quantal response opponent in is an NP-hard if one of the following holds: (1) is zero-sum and is generated by a logit generator for some ; or (2) is general sum.

###### Proof.

We reduce the problem of solving an instances of the partition problem to finding QSE in a specific zero-sum EFG. An instance of a partition problem is a multiset of positive integers . The question is whether there is a set of indices , such that

(8) |

For constructing the game we use a special NFG with two distinct QSEs, which are different from the uniform strategy. An example of such a NFG is depicted in Figure 7.

In the first equilibrium, the rational player plays the first action with probability . The second equilibrium is when she plays the second action with probability . The expected reward of when playing either of these strategies is , while any other strategy, and particularly the uniform strategy, achieves a lower reward.

Now we can proceed to constructing the game, which makes the rational player to commit to a strategy that solve the partition problem. The game starts with a uniform chance node. For each item, there is a subgame as indicated in the game in Figure 8 (for two of the items and ).

There are two main components of each subgame. The first component (on the left) – the NFG subtree – is the EFG representation of the NFG game introduced earlier. To maximize her utility, the rational player is motivated to play either the first action with probability or , but not a uniform strategy. The second part – the partition subtree – solves the partition problem.

#### Solvable instances

First, we construct the QSE of this game in case the partition problem has a solution, i.e., there exists an index set for which Eq. (8) holds. To maximize the utility in the NFG subtrees, in each of her information sets player chooses only from the two strategies and . For each item, if the item belongs to the set , she chooses the strategy . If she chooses strategy , it means the item is from the complementary set. The expected utilities of player of actions in the lower information set are Because is the solution, we have and player is incentivized to play uniformly. The ’s utility in the partition subtrees is hence Next, we show that utility is optimal in the partition subtrees – player can never get a higher utility. Let be a vector of the multiset integers of the partition problem and be a vector of arbitrary probabilities of playing the first action in player ’s partition subtrees. We aim to prove that for any and the corresponding vector of complementary probabilities of playing the second actions it holds that Simple algebra shows this is equivalent to (9) Because we have Eq. (9) always holds and is indeed an upper bound. Because player ’s utility is maximized in both the NFG and the partition subtrees, it is a QSE and her utility if the partition problem is solvable is therefore#### Unsolvable instances

Second, assume that the partition problem does not have a solution. We show that in this case, the utility of player in the QSE will be always strictly lower than . Observe that because QSE with solvable instances achieves a maximum possible utility in the partition subtrees, in order to attempt to reach the same overall utility with unsolvable instances, player has to commit to the solution of the NFG game. Therefore, in each partition subtree, her only viable strategy is to play the first action with probability either or . First, we analyze the utility of player in case the strategy of player is not uniform. From Eq. (9), it follows that in case a vector maximizes a utility of player , it holds that

Consequently, if the strategy is not uniform, the difference in quantal functions is nonzero and it is easy to show that also the scalar product never reaches zero, thus, making impossible for a non-uniform strategy to be optimal. Therefore, to achieve utility , player has to enforce a uniform strategy of player . Given that player has to commit to either or in her upper information sets, we analyze the conditions when player is incentivized to play a uniform strategy. Let the set be defined similarly as earlier: an item belongs to if the first action in player ’s partition subtree is played with probability . We have

Because there is no such that the sums are equal and because by the setting of the NFG game , player never simultaneously enforces optimal utility in the NFG game and the partition subtrees. Her utility is hence strictly smaller than . By analyzing the QSE of the reduced game we hence separate solvable and unsolvable instances of the partition problem.

#### General-sum games

The situation in non-zero-sum games is even simpler. The structure of the proof is exactly as the proof for zero-sum games above, but the role of the NFG subtree can be played by the cooperative coordination game:

A | B | |
---|---|---|

X | 1,1 | 0,0 |

Y | 0,0 | 1,1 |

For any quantal response function, player plays the action with higher expected utility with a higher probability. Therefore, the uniform strategy for player corresponds to the strict minimum of his utility achievable against any quantal opponent. Any other strategy will make the two actions of player have different expected utilities and hence the better will be played with probability more than 0.5, giving player better reward than the uniform strategy. Since the game is completely symmetric, it has two distinct QSEs.

A similar argument holds also for the partition subtree, which stays unchanged from the zero-sum game. In solvable instances, player ’s commitment makes any quantal player be indifferent and play uniformly. In case of unsolvable instance, one of her action will be better and played with a strictly higher probability. This will give player more utility than the uniform strategy and hence it would be suboptimal for player .

∎

### Proof of Proposition 6

###### Proposition 6.

Let be a zero-sum NFG, a quantal response function of the follower, which depends only on the ordering of expected utilities of individual actions. Then the RM-QR algorithm converges to QSE.

###### Proof.

A response function is called a pretty-good-response if it satisfies

(10) |

Let be a simple quantal response function of the follower, which depends only on the descending ordering of expected utilities of follower’s actions and consider two different . In case induces the same ordering as , then . Let induce an ordering of indices and induce a different ordering . By definition of a quantal function, and . For each it holds that and therefore . Simple QR is hence a pretty-good-response and RMQR converges to a strategy exploiting pretty-good-responses the most, which is a QSE strategy. ∎

### Proof of Proposition 7

###### Proposition 7.

Let be canonical quantal function with a strictly monotonically increasing generator . Then is not a pretty-good-response.

Game 3 | ||
---|---|---|

A | B | |

X | b | a |

Y | c | a |

###### Proof.

In Figure 9, we construct a game with , such that no canonical quantal function is a pretty-good-response in this game. Let , such that . Since is strictly monotonically increasing, we have . By the definition of canonical quantal response, we have . Because , both sides of the equation are positive. Since it holds that , therefore and finally . By definition in Equation (10), is hence not a pretty-good-response. ∎

## Appendix B Evaluation

**Experimental setup.** For all experiments except Goofspiel 7, we use Python 3.7. We solve non-linear optimization using the SLSQP GA from the SciPy 1.3.1 library. LP computations are done using gurobi 8.1.1, and experiments were done on Intel i7 1.8GHz CPU with 8GB RAM. Goofspiel experiment was run on 24 cores/48 threads 3.2GHz (2 x Intel Xeon Scalable Gold 6146) with 384GB of RAM, implemented in C++. For experiments on zero-sum NFGs, we used randomly generated square games and for general-sum NFGs we used randomly generated games, Grab the Dollar, Majority Voting, Traveler’s Dilema and War of Attrition from GAMUT Nudelman et al. (2004). For EFGs, we used randomly generated sequential games, and Leduc Hold’em. In the experiments, we wanted to measure the scalability and performance of the proposed solutions and the baseline.

**Domains.** Randomly Generated NFGs are parametrized by sizes of both players’ action spaces. Utilities are generated uniformly at random from integers between -9 and 10. Grab the Dollar is a game with a prize that both players can grab at any given time, actions being the times. If both players grab it at the same time they both receive low payoff and when one player grabs the price before the opponent she receives high payoff and the opponent payoff somewhere between high and low. In Majority Voting the players have utilities assigned to each action (candidate) being declared winner. And the winner is the candidate with the most votes. In a tie a candidate with higher priority is declared winner. Travelers Dillema is a game where both players propose a payoff and the player with lower proposal wins the payoff plus some bonus and the opponent receives the payoff minus some bonus. In a War of Attrition, two players are in a dispute over an object, and each chooses a time to concede the object to the other player. If both concede at the same time, they share the object. Each player has a valuation of the object, and each playerâs utility is decremented at every time step. Randomly Generated EFGs are EFGs where players switch each turn. The game has three parameters. One is the branching factor , the second is the maximal number of observations received , and the last one is maximal sequence length for one player . Therefore, the maximal depth is . The path from the root correlates utilities, and the generation of utilities proceeds as follows. The value is set to 0 at the root and randomly changes by one up or down each time when moving to the children. The utility of a history is the value with which the leaf node is reached. We generated four sets in following way. Set 1: , 2: , 3: , 4: . During the generation we discarded the games where NE strategy was the same as GA strategy because such degenerate games would have all the values that we report the same. We kept generating and discarding until we had 100 games in each set. Number of games we had to generate in order to obtain 100 non degenerate games in each set is: 1 - 1431, 2 - 212, 3 - 159, 4 - 112.For Leduc Hold’em we use the definition from Lockhart et al. (2019). Goofspiel 7 is a bidding card game where players are trying to obtain the most points. shuffled and set face-down. Each turn, the top point card is revealed, and players simultaneously play a bid card; the point card is given to the highest bidder or discarded if the bids are equal. In this implementation, we use a fixed deck with K = 7.

## Appendix C Mathematical program to solve QSE

###### Observation 8.

Let be an extensive-form game and a be a generator of a canonical quantal function. Then QSE of can be formulated as a following non-concave mathematical program:

(11) | ||||

(12) | ||||

(13) | ||||

(14) | ||||

(15) | ||||

(16) | ||||

(17) | ||||

(18) | ||||

Equation 11 is for maximizing the expected value of player in the root over his realization plans.

Equation 12 fixes probability of empty realization plan to 1, Equation 13 constraints realization plans as probabilities and Equation 14 defines the relationship of child plans to their parents. Equation 15 defines as sum of values in each children times the realization plan there. Equation 16 defines the quantal response in realization plans of player . Finally Equations 17 and 18 define the action value summing over both descendant infosets and terminal nodes. Because of Equation (16), the program is not linear. The problem of computing the QSE is computationally difficult to solve – it is an NP-hard problem.

## Appendix D Scalability on NFGs.

Figure 10 shows running time averaged over 1000 games for each size of square zero-sum NFGs. Ranging from 2 actions up to 377 actions.

## Appendix E One card poker results.

Figure 11 shows the expected utility of the COMB and RQR when run with fixed , for different values of . CFR results are on left end of the RQR and COMB lines and CFR-QR is on the right end.

### Footnotes

- Full proofs of all propositions are in the appendix.

### References

- Online implicit agent modelling. In Proceedings of the 2013 international conference on Autonomous agents and multi-agent systems, pp. 255–262. Cited by: §1.
- An initial study of targeted personality models in the flipit game. In International Conference on Decision and Game Theory for Security, pp. 623–636. Cited by: §1.
- Convex Optimization. Cambridge University Press. Cited by: §4.
- Superhuman ai for heads-up no-limit poker: libratus beats top professionals. Science 359 (6374), pp. 418–424. Cited by: §1.
- Dinkelbach-type algorithm for computing quantal stackelberg equilibrium. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence, IJCAI-20, C. Bessiere (Ed.), pp. 246–253. Note: Main track Cited by: §1.
- Computing the optimal strategy to commit to. In Proceedings of the 7th ACM conference on Electronic commerce, pp. 82–90. Cited by: §2, §5.
- The complexity of computing a nash equilibrium. SIAM Journal on Computing 39 (1), pp. 195–259. Cited by: Appendix A.
- Using response functions to measure strategy strength. In Twenty-Eighth AAAI Conference on Artificial Intelligence, Cited by: §1, §1, §4, §4.
- Game-theoretic patrolling with dynamic execution uncertainty and a case study on a real transit system. Journal of Artificial Intelligence Research 50, pp. 321–367. Cited by: §2.
- PAWS-a deployed game-theoretic application to combat poaching.. AI Magazine 38 (1), pp. 23–36. Cited by: §1, §2.
- Online convex optimization for sequential decision processes and extensive-form games. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 1917–1925. Cited by: §1.
- On the empirical content of quantal response equilibrium. American Economic Review 98 (1), pp. 180–200. Cited by: §1.
- A simple adaptive procedure leading to correlated equilibrium. Econometrica 68 (5), pp. 1127–1150. Cited by: §4.
- Data biased robust counter strategies. In Artificial Intelligence and Statistics, pp. 264–271. Cited by: §1.
- Computing robust counter-strategies. In Advances in neural information processing systems, pp. 721–728. Cited by: §4, §5.
- Computing optimal strategies to commit to in extensive-form games. In Proceedings of the 11th ACM conference on Electronic commerce, pp. 83–92. Cited by: §3.
- Computing approximate equilibria in sequential adversarial games by exploitability descent. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pp. 464–470. Cited by: Appendix B.
- Quantal choice analaysis: a survey. In Annals of Economic and Social Measurement, Volume 5, number 4, pp. 363–390. Cited by: §2.
- Quantal response equilibria for normal form games. Games and economic behavior 10 (1), pp. 6–38. Cited by: §1, §1, §2, §2, §3.
- Quantal response equilibria for extensive form games. Experimental economics 1 (1), pp. 9–41. Cited by: §1, §2.
- Deepstack: expert-level artificial intelligence in heads-up no-limit poker. Science 356 (6337), pp. 508–513. Cited by: §1, §5.
- Run the GAMUT: A comprehensive approach to evaluating game-theoretic algorithms. In Proceedings of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems, Vol. 4, pp. 880–887. Cited by: Appendix B.
- Deployed ARMOR protection: The application of a game theoretic model for security at the Los Angeles International Airport. In Proceedings of the 7th International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 125–132. Cited by: §2.
- Mixed-integer programming methods for finding nash equilibria. In AAAI, pp. 495–501. Cited by: §5.
- Security and game theory: algorithms, deployed systems, lessons learned. Cambridge University Press, New York, NY, USA. External Links: ISBN 1107096421, 9781107096424 Cited by: §2.
- A dynamic homotopy interpretation of the logistic quantal response equilibrium correspondence. Games and Economic Behavior 51 (2), pp. 243–263. Cited by: §1.
- Computing optimal strategy against quantal response in security games. In Proceedings of the 11th International Conference on Autonomous Agents and Multiagent Systems-Volume 2, pp. 847–854. Cited by: §1, §1.
- Regret minimization in games with incomplete information. In Advances in Neural Information Processing Systems, pp. 1729–1736. Cited by: §1, §4.