Decentralized QLearning
for Stochastic Teams and Games^{†}^{†}thanks: Part of this paper is presented at the IEEE Conference on Decision and Control, December 2015. This research was partially supported by the Natural Sciences and Engineering Research Council of Canada (NSERC).
Abstract
There are only a few learning algorithms applicable to stochastic dynamic teams and games which generalize Markov decision processes to decentralized stochastic control problems involving possibly selfinterested decision makers. Learning in games is generally difficult because of the nonstationary environment in which each decision maker aims to learn its optimal decisions with minimal information in the presence of the other decision makers who are also learning. In stochastic dynamic games, learning is more challenging because, while learning, the decision makers alter the state of the system and hence the future cost. In this paper, we present decentralized Qlearning algorithms for stochastic games, and study their convergence for the weakly acyclic case which includes team problems as an important special case. The algorithm is decentralized in that each decision maker has access to only its local information, the state information, and the local cost realizations; furthermore, it is completely oblivious to the presence of other decision makers. We show that these algorithms converge to equilibrium policies almost surely in large classes of stochastic games.
I Introduction
This paper aims at developing new learning algorithms with desirable convergence properties for certain classes of stochastic games, which are discretetime dynamic games in which the history can be summarized by a “state” [1]. More specifically, we focus on weakly acyclic stochastic games that can be used to model cooperative systems. The chief merit of the paper lies in the fact that learning takes place in stochastic games, which are truly dynamic games, as opposed to learning in repeated games in which the same singlestage game is played in every stage. In stochastic games, the policies selected by the decision makers not only impact their immediate cost but also alter the stagegames to be played in the future through the state dynamics. Hence, our results are applicable to a significantly broader set of applications.
The existing literature on learning in stochastic games is very small in comparison with the literature on learning in repeated games. As the method of reinforcement learning gained popularity in the context of Markov decision problems, a surge of interest in generalizing the method of reinforcement learning, in particular Qlearning algorithm [2], to stochastic games has led to a set of publications primarily in the computer science literature; see [3] and the references therein. In many of these publications, the authors tend to assume that the real objective of the agents ^{1}^{1}1The terms “agent” and “decision maker” are used interchangeably. is for some reason to find and play an equilibrium strategy (and sometimes this even requires agents to somehow agree on a particular equilibrium strategy), and not necessarily to pursue their own objectives. Another serious issue is that the multiagent algorithms introduced in many of these recent papers are not scalable since each agent needs to maintain estimates of its Qfactors for each state/joint action pair and compute an equilibrium at each step of the algorithm using the updated estimates, assuming that the actions and objectives are exchanged between all agents.
Standard Qlearning, which enables an agent to learn how to play optimally in a singleagent environment, has also been applied to very specific multi agent applications [4, 5]. Here, each agent runs a standard Qlearning algorithm by ignoring the other agents, and hence information exchange between agents and computational burden on each agent are substantially lower than aforementioned multiagent extensions of Qlearning algorithm. Also, standard Qlearning in a multiagent environment makes sense from individual bounded rationality point of view. However, no analytical results exist regarding the properties of standard Qlearning in a stochastic game setting.
We should also mention several attempts to extend a wellknown learning algorithm called Fictitious Play (FP) [6, 7] to stochastic games [8, 9, 10]. The joint action learning algorithm presented in [8] would be computationally prohibitive quickly as the number of agents/states/actions grow. The algorithms presented in [8] are claimed to be convergent to an equilibrium in singlestate singlestage common interest games but without a proof. The extension of FP considered in [9] requires each agent to calculate a stationary policy at each step in response to the empirical frequencies of the stationary policies calculated and announced by other agents in the past. The main contribution of [9] is to show that such FP algorithm is not convergent even in the simplest 2x2x2 stochastic game where there are two states and two agents with two moves for each agent. The version of FP used in [10] is applicable only to zerosum games (strictly adversarial games).
Other related work includes [11, 12, 13]. In [11], a multiagent version of an actorcritic algorithm [14] is shown to be convergent to generalized equilibria in a weak sense of convergence, whereas in [12] a policy iteration algorithm is presented without rigorous results for stochastic games. The algorithms given in [11, 12] are rational from individual agent perspective, however they require higher level of data storing and processing than standard Qlearning. The paper [13] uses the policy iteration algorithm given in [12] in conjunction with certain approximation methods to deal with a large statespace in a specific cardgame without rigorous results.
We should emphasize that our viewpoint is individual bounded rationality and strategic decision making, that is, agents should act to pursue their own objectives even in the short term using localized information and reasonable algorithms. It is also desired that agent strategies converge to an agreeable solution in cooperative situations where agent objectives are aligned with system designer’s objective even though agents do not necessarily strive for converging to a particular strategy.
The rest of the paper is organized as follows. In §II, the model is introduced. In § III, the specifics of the learning paradigm and the standard Qlearning algorithm is discussed, followed by the presentation of our first Qlearning algorithm for stochastic games and its convergence properties. Generalizations of our main results in § III are presented in §IV. This is followed by a simulation study in §V. The paper is concluded with some final remarks in §VI. Appendices contain the proofs of the technical results in the paper.
Ii Stochastic Dynamic Games
Consider the (discretetime) networked control system illustrated in Figure 1 where is the state of the system at time , is the input generated by controller at time , and is the random disturbance input at time .
Suppose that each controller is an autonomous decision maker (DM) interested in minimizing its own longterm cost
where is the cost incurred by controller at time , and denotes the expectation given a collection of control policies (which will be specified later in the paper) on a probability space . Although controller can only choose its own decisions , its cost generally depends on the decisions of all controllers through its singlestage cost as well as the state dynamics. This dynamic coupling between selfinterested DMs with longterm objectives naturally lead to the framework of stochastic games [1] which generalize Markov decision problems.
Over the past halfcentury, there have been many applications of stochastic games on control problems; see Chapter XIV in [15] as an early reference. At the present time, the control theory literature includes a large number of papers employing the theory of stochastic games and their continuoustime counterparts called “differential games” [16]. Many papers in this body of work study a zerosum game between a controller which aims to optimize the system performance and an adversary which controls certain system parameters and inputs to make the system performance as poor as possible. We selectively cite [17] for robust control and minimax estimation problems, [18] for flow control in queueing networks, [19] for control of hybrid systems, and [20] for robustness, security, and resilience of cyberphysical control systems. The case of nonzerosum games in which the decision makers do not always have diametrically opposed objectives has also received significant attention; see for example [21] on admission, service, and routing control in queueing systems, [22] on transmission control in cognitive radio systems, [23] on network security, and [24] on formation control.
We should also mention the work on team decision problems where all DMs share a common longterm objective albeit with access to different information variables; see e.g., [25, 26]. In this paper, differently from the usual team decision problems in the literature, even though each DM has access to the state information, it does not have access to global information on the other DMs, and even their presence. We also note that the emergence of distributed control systems requires the formulation of “team problems” within a gametheoretic framework where local controllers are tasked to achieve one system level objective without centralized coordination; see for example [27] on distributed model predictive control. This type of team problems and its generalizations where the objectives of DMs are aligned in some sense with a team objective are the primary focus of our work though the class of games considered in this paper is more general and it even includes some zerosum stochastic games.
Iia Discounted Stochastic Dynamic Games
A (finite) discounted stochastic game has the following ingredients; see [1].

A finite set of DMs with the th DM referred to as DM for

a finite set of states

a finite set of control decisions for each DM

a cost function for each DM determining DM’s cost at each state and for each joint decision

a discount factor for each DM

a random initial state

a transition kernel for the probability of each state transition from to for each joint decision
Such a stochastic game induces a discretetime controlled Markov process where the state at time is denoted by starting with the initial state . At any time , each DM makes a control decision (possibly randomly) based on the available information. The state and the joint decisions together determine each DM’s cost at time as well as the probability distribution with which the next state is selected.
A policy for a DM is a rule of choosing an appropriate control decision at any time based on the DM’s history of observations. We will focus on stationary policies of the form where a DM’s decision at time is determined solely based on the state . Such policies for each DM are identified by mappings from the state space to the set of probability distributions on . The interpretation is that a DM using such a policy makes its decision at any time by choosing randomly from according to . We will denote the set of such policies by for each DM. We will primarily be interested in deterministic (stationary) policies^{2}^{2}2When it is not clear from the context, a “policy” will mean a deterministic policy. denoted by for each DM, where each policy is identified by a mapping from to .
The objective of each DM is to find a policy that minimizes its expected discounted cost
(1) 
for all , where denotes the conditional expectation given . Since DMs have possibly different cost functions and each DM’s cost may depend on the control decisions of the other DMs, we adopt the notion of equilibrium to represent those policies that are personbyperson optimal. For ease of notation, we denote the policies of all DMs other than DM by . For future reference, we also define and as well as and . Using this notation, we write a joint policy as and as .
Definition 1
A joint policy constitutes an (Markov perfect) equilibrium if, for all , ,
It is known that any finite discounted stochastic game possesses an equilibrium policy as defined above [28].
Although the minimum above can always be achieved by a deterministic policy in (since each DM’s problem is a stationary Markov decision problem when the policies of the other DMs are fixed at ), a deterministic equilibrium policy may not exist in general. However, many interesting classes of games do possess equilibrium in deterministic policies. In particular, large classes of games arising from applications where all DMs benefit from cooperation possess equilibrium in deterministic policies. The primary examples of such games of cooperation are team problems where all DMs have the same cost function. In team problems, the deterministic policies minimizing the common cost function are clearly equilibrium policies although nonoptimal deterministic equilibrium policies may also exist. A more general set of games of cooperation are those in which some function, called the potential function, decreases whenever a single DM decreases its own cost by unilaterally switching from one deterministic policy to another one. In this class of games, the deterministic policies minimizing the potential function are equilibrium policies. As such, we are primarily interested in the set of deterministic equilibrium policies denoted by , where .
We next formally introduce the set of games considered in this paper.
IiB Weakly Acyclic Games
Let denote DM’s set of (deterministic) best replies to any , i.e.,
DM’s best replies to any can be characterized by its optimal Qfactors satisfying the fixed point equation
(2) 
for all , where denotes the expectation with respect to the joint distribution of given by . The optimal Qfactor represents DM’s expected discounted cost to go from the initial state assuming that DM initially chooses and uses an optimal policy thereafter while the other DMs use . One can then write as
The set of (deterministic) joint best replies is denoted by . Any best reply of DM is called a strict best reply with respect to if
Such a strict best reply achieves DM’s minimum cost given for all initial states and results in a strict improvement over for at least one initial state.
Definition 2
We call a (possibly finite) sequence of deterministic joint policies a strict best reply path if, for each , and differ in exactly one DM position, say DM, and is a strict best reply with respect to .
Definition 3
A discounted stochastic game is called weakly acyclic under strict best replies if there is a strict best reply path starting from each deterministic joint policy and ending at a deterministic equilibrium policy.
Figure 2 shows the strict best reply graph of a game where the nodes represent the deterministic joint policies and the directed edges represent the singleDM strict best replies. Each deterministic equilibrium policy is represented by a sink, i.e., a node with no outgoing edges, in such a graph. Note that the game illustrated in Figure 2 is weakly acyclic under strict best replies since there is a path from every node to a sink ( or ). Note also that a weakly acyclic game may have cycles in its strict best reply graph, for example, in Figure 2.
Weakly acyclic games constitute a fairly large class of games. In the case of singlestage games, all potential games as well as dominance solvable games are examples of weakly acyclic games; see [29]. We note that the concept of weak acyclicity introduced in this paper is with respect to the stationary Markov policies for stochastic games, and constitutes a generalization of weak acyclicity introduced in [30] for singlestage games. The primary examples of weakly acyclic games in the case of stochastic games are the team problems with finite state and control sets where DMs have identical cost functions and discount factors. Clearly, many other classes of stochastic games are weakly acyclic, e.g., appropriate multistage generalizations of potential games and dominance solvable games restricted to the stationary Markov policies are weakly acyclic for the same reason that the singlestage versions of these games are weakly acyclic [29].
IiC A Best Reply Process for Weakly Acyclic Games
Consider a policy adjustment process in which only one DM updates its policy at each step by switching to one of its strict best replies. Such a process would terminate at an equilibrium policy if the game has no cycles in its strict best reply graph and the process continues until no DM has strict best replies. A weakly acyclic game may contain cycles in its strict best reply graph but there must be some edges leaving each cycle because otherwise there would not be a path from each node to a sink. Therefore, as long as each updating DM considers each of its strict best replies with positive probability, the adjustment process would terminate at an equilibrium policy in a weakly acyclic game with probability (w.p.) one. This adjustment process would require a criterion to determine the updating DM at each step and the DMs would have to a priori agree to this criterion. An equilibrium policy can be reached through a similar adjustment process without a pregame agreement on the selection of the updating DM, if all DMs update their policies at each step but with some inertia. Consider now the following policy adjustment process, which is the best reply process with memory length of one and inertia introduced in Sections 6.46.5 of [30].
Best Reply Process with Inertia (for DM):
Set parameters
: inertia
Initialize (arbitrary)
Iterate
If
Else
End
On the one hand, if the joint policy is an equilibrium policy at any step , then the policies will never change in the subsequent steps. On the other hand, regardless of what the joint policy is at any step , the joint policy in steps later will be an equilibrium policy with positive probability where is the maximum length of the shortest strict best reply path from any policy to an equilibrium policy and depends only on the inertias , and . This readily implies that the best reply process with inertia will reach an equilibrium policy in finite number of steps w.p. [30], i.e.,
We now note that each updating DM at step needs to compute its best replies , which can be done by first solving the fixed point equation (2) for . DM can solve (2), for example through value iterations, provided that DM knows the state transition probabilities and the policies of the other DMs to evaluate the expectations in (2). In most realistic situations, DMs would not have access to such information and therefore would not be able to compute their best replies directly. In the next section, we introduce our learning paradigm in which DMs would be able to learn their near best replies with minimal information and adjust their policies (approximately) along the strict best reply paths as in the best reply process with inertia.
Iii QLearning in Stochastic Dynamic Games
Iiia Learning Paradigm for Stochastic Dynamic Games
The learning setup involves specifying the information that DMs have access to. We assume that each DM knows its own set of decisions and its own discount factor . In addition, before choosing its decision at any time , each DM has the knowledge of

its own past decisions , and

past and current state realizations , and

its own past cost realizations
Each DM has access to no other information such as the state transition probabilities or any information regarding the other DMs (not even the existence of the other DMs). In effect, the problem of decision making from the perspective of each DM appears to be a stationary Markov decision problem. It is reasonable that each DM with this view of its environment would use the standard Qlearning algorithm [2] to learn its optimal Qfactors and its optimal decisions. This would lead to the following Qlearning dynamics for each DM:
where denotes DM’s step size at time .
If only one DM, say DM, were to use Qlearning and the other DMs used constant policies , then DM would asymptotically learn its corresponding optimal Qfactors, i.e.,
provided that all statecontrol pairs are visited infinitely often and the step sizes are reduced at a proper rate. This follows from the wellknown convergence of Qlearning in a stationary environment; see [31]. To exploit the learnt Qfactors while maintaining exploration, the actual decisions are often selected with very high probability as
and with some small probability any decision in is experimented. One common way of achieving this for DM is to select any decision randomly according to (Boltzman action selection)
where is a small constant called the temperature parameter, and is the history of the random events realized up to the point just before the selection of .
However, when all DMs use Qlearning and select their decisions as described above, the environment is nonstationary for all DMs, and there is no reason to expect convergence in that case. In fact, one can construct examples where DMs using Qlearning are caught up in persistent oscillations; see Section 4 in [32] for the nonconvergence of Qlearning in Shapley’s game. However, the convergence of Qlearning may still be possible in team problems, coordinationtype games, or more generally in weaklyacyclic games. It is instructive to first consider the repeated games.
Here, there is no state dynamics (the set of states is a singleton) and the DMs have no lookahead (). The only dynamics in this case is due to Qlearning which reduces to the averaging dynamics
(3)  
(4) 
where
(5) 
The longterm behavior of these averaging dynamics is analyzed in [32] and strongly connected to the longterm behavior of the wellknown Stochastic Fictitious Play (SFP) dynamics [33] in the case of two DMs; see Lemma 4.1 in [32]. In twoDM SFP, each DM tracks the empirical frequencies of the past decisions of its opponent DM and chooses a nearly optimal response (with some experimentation) based on the incorrect assumption that DM will choose its decisions according to the empirical frequencies of its past decisions
where is the indicator function and
Using the connection between Qlearning dynamics (4)(5) and SFP dynamics, the convergence of Qlearning (4)(5) is established in zerosum games as well as in partnership games with two DMs; see Proposition 4.2 in [32]. It may be possible to extend this convergence result to multiDM potential games [34, 35], but this is currently unresolved. However, given the nonconvergence of FP (where DMs choose exact optimal responses with no experimentation, i.e., ) in some coordination games [36], the prospect of establishing the convergence of Qlearning even in all twoDM weakly acyclic games does not seem promising.
It is possible to employ additional features such as the truncation of the observation history or multitimescale learning to obtain learning dynamics that are convergent in all repeated weakly acyclic games; see our own previous work [37] and the others [38, 30, 39, 40]. However, the question of learning an equilibrium policy in stochastic games is an open question. The only relevant reference considering the stochastic games is [11] where each DM uses value learning coupled with policy search at a slower timescale. The results in [11] apply to all stochastic games and therefore they are necessarily quite weak. Loosely speaking, the main result in [11] shows that the limit points of certain empirical measures (weighted with the step sizes) in the policy space constitute “generalized Nash equilibria”, which in particular does not imply convergence of learning to an equilibrium policy. In the next subsection, we propose a simple variation of Qlearning which converges to an equilibrium policy in all weakly acyclic stochastic games.
IiiB QLearning in Stochastic Dynamic Games
The discussion in the previous subsection reveals that the standard Qlearning (4)(5) can lead to robust oscillations even in repeated coordination games. The main obstacle to convergence of Qlearning in games is due to the presence of multiple active learners leading to a nonstationary environment for all learners. To overcome this obstacle, we use some inspiration from our previous work [37] on repeated games and modify the Qlearning for stochastic games as follows. In our variation of Qlearning, we allow DMs to use constant policies for extended periods of time called exploration phases.
As illustrated in Figure 3, the th exploration phase runs through times , where
for some integer denoting the length of the th exploration phase. During the th exploration phase, DMs use some constant policies as their baseline policies with occasional experimentation. The essence of the main idea is to create a stationary environment over each exploration phase so that DMs can accurately learn their optimal Qfactors corresponding to the constant policies used during each exploration phase. Before arguing why this would lead to an equilibrium policy in all weakly acyclic stochastic games, let us introduce our variation of Qlearning more precisely.
Algorithm 1 (for DM)
Set parameters
: some compact subset of the Euclidian space
where is the number of pairs
: sequence of integers in
: experimentation probability
: inertia
: tolerance level for suboptimality
: sequence of step sizes where
, ,
(e.g., where )
Initialize (arbitrary), (arbitrary)
Receive
Iterate
(th exploration phase)
Iterate
Receive
Receive (selected according to )
the number of visits to in the th
exploration phase up to
, for all
End
If
Else
End
Reset to any (e.g., project onto )
End
Algorithm 1 mimics the best reply process with inertia in §IIC arbitrarily closely with arbitrarily high probability under certain conditions. The key difference here is that each DM using Algorithm 1 approximately learns its optimal Qfactors during each exploration phase with limited observations. Accordingly, each DM updates its (baseline) policy to one of its near best replies with inertia based on its learnt Qfactors. Hence, Algorithm 1 can be regarded as an approximation to the best reply process with inertia in §IIC; see [41] where best replies are obtained based on rewards that must be estimated using noisy observations.
Assumption 1
For all , there exists a finite integer and joint decisions such that
Assumption 1 ensures that the step sizes satisfy the wellknown conditions of the stochastic approximation theory [31] during each exploration phase.
Assumption 2
For all , and , where and (which depend only on the parameters of the game at hand) are defined in Appendix B.
Assumption 2 requires that the tolerance levels for suboptimality used in the computation of near best replies as well as the experimentation probabilities be nonzero but sufficiently small.
Theorem 1
Proof:
See Appendix B. \qed
Let us discuss the main idea behind this result. Since all DMs use constant policies throughout any particular exploration phase, each DM indeed faces a stationary Markov decision problem in each exploration phase. Therefore, if the length of each exploration phase is long enough and the experimentation probabilities are small enough (but nonzero), each DM can learn its corresponding optimal Qfactors in each exploration phase with arbitrary accuracy with arbitrarily high probability. This allows each DM to accurately compute its near best replies to the other DMs’ policies at the end of the th exploration phase. Intuitively, allowing each DM to update its policy to its near best replies (to ) at the end of the th exploration phase with some inertia results in a policy adjustment process that approximates the best reply process with inertia in §IIC.
Remark 1
One may also wish to find explicit lowerbounds on to achieve almost sure convergence based on the convergence rates of the standard Qlearning with a single DM; we refer the reader to [42] for bounds on the convergence rates for standard Qlearning.
Iv Generalizations
Iva Learning in Weakly Acyclic Games under Strict Better Replies
We present another Qlearning algorithm with provable convergence to equilibrium in discounted stochastic games that are weakly acyclic under strict better replies. For this, we first introduce the notion of weak acyclicity under strict better replies. Given any , let denote DM’s set of (deterministic) better replies with respect to , i.e.,
Any better reply of DM is called a strict better reply (with respect to ) if
Definition 4
We call a (possibly finite) sequence of deterministic joint policies a strict better reply path if, for each , and differ in exactly one DM position, say DM, and is a strict better reply with respect to .
Definition 5
A discounted stochastic game is called weakly acyclic under strict better replies if there is a strict better reply path starting from each deterministic joint policy and ending at a deterministic equilibrium policy.
Since every strict best reply path is also a strict better reply path, the set of games weakly acyclic under better replies contain (in fact, strictly) the set of games weakly acyclic under best replies.
It is straightforward to introduce a policy adjustment process analogous to the one in §IIC where, at each step, each DM switches to one of its strict better replies with some inertia; see Sections 6.46.5 in [30]. Such a process would clearly converge to an equilibrium in games that are weakly acyclic under strict better replies. We next introduce a learning algorithm which allows each DM to learn the Qfactors corresponding to two policies, a baseline policy and a randomly selected experimental policy, during each exploration phase. If the learnt Qfactors indicate that the experimental policy is better than the baseline policy within a certain tolerance level, then the baseline policy is updated to the experimental policy with some inertia at the end of each exploration phase. This learning algorithm enables DMs to adjust their policies with much less information (as in §IIIA), and follow (approximately) along the strict better reply paths that the adjustment process follows.
Algorithm 2 (for DM)
Set parameters as in Algorithm 1
Initialize (arbitrary except ), (arbitrary)
Receive
Iterate
(th exploration phase)
Iterate
Receive
Receive (selected according to )
the number of visits to in the th
exploration phase up to
, for all
, for all
End
If
and
Else
End
any policy with equal
probability
Reset , to any
End
Since any policy except the baseline policy can be chosen as an experimental policy (with equal probability), each DM can switch to any of its strict better replies with positive probability. In contrast, each DM using Algorithm 1 can only switch to one of its strict best replies. As a result, each DM using Algorithm 2 can escape a strict best reply cycle by switching to a strict better reply (if one exists); whereas, any DM using Algorithm 1 cannot. This flexibility comes at the cost of running two Qlearning recursions, one for the baseline policy and the other for the experimental policy, instead of one. However, this flexibility also leads to convergent behavior in a strictly larger set of games. We cite [43] as a reference to an earlier use of the idea of comparing two strategies and selecting one according to the Boltzman distribution.
The counterpart of Theorem 1 can be obtained for Algorithm 2 in games that are weakly acyclic under strict better replies.
Assumption 3
For all , and , where and (which depend only on the parameters of the game at hand) are defined in Appendix C.
Theorem 2
Proof:
See Appendix C. \qed
IvB Learning in Weakly Acyclic Games under multiDM Strict Best or Better Replies
The notion of weak acyclicity can be generalized by allowing multiple DMs to simultaneously update their policies in a strict best or better reply path.
Definition 6
We call a (possibly finite) sequence of deterministic joint policies a multiDM strict best (better) reply path if, for each , and differ for at least one DM and, for each deviating DM, is a strict best (better) reply with respect to .
Definition 7
A discounted stochastic game is called weakly acyclic under multiDM strict best (better) replies if there is a multiDM strict best (better) reply path starting from each deterministic joint policy and ending at a deterministic equilibrium policy.
This generalization leads to a strictly larger set of games that are weakly acyclic. To see this, consider a singlestage game characterized by the cost matrices in Figure 4 where DM chooses a row, DM chooses a column, and DM chooses a matrix, simultaneously.
Assume . There is no strict best (or better) reply path to an equilibrium from the joint decisions , , , , , , if only a single DM can update its decision at a time. Therefore, this game is not weakly acyclic under strict best (or better) replies in the sense of Definition 3 (or Definition 5). However, if multiple DMs are allowed to switch to their strict best (or better) replies simultaneously, then it becomes possible to reach the equilibrium from any joint decision. For example, if DM and DM switch to their strict best (or better) replies simultaneously from the joint decision , then the resulting joint decision would be . This would subsequently lead to the equilibrium if DM switches to its strict best (or better) reply from .
All learning algorithms introduced in the paper allow multiple DMs to simultaneously update their policies with positive probability. In view of this, it is straightforward to see that our main convergence results Theorem 1 (Theorem 2) hold in games that are weakly acyclic under multiDM strict best (better) replies.
V A Simulation Study: Prisoner’s Dilemma with a State
We consider a discounted stochastic game with two DMs where . Each DM’s utility (to be maximized) at each time depends only on the joint decisions of both DMs as
We assume . The state evolves as
where and .
The singlestage game corresponds to the wellknown prisoner’s dilemma where the th prisoner (DM) cooperates (defects) at time by choosing (). The singlestage game has a unique equilibrium , i.e., both DMs defect, leading to rewards . The dilemma is that each DM can do strictly better by cooperating, i.e., (not an equilibrium).
In the multistage game, the state indicates, w.p. , whether or not both DMs cooperated in the previous stage. It turns out that cooperation can be obtained as an equilibrium of the multistage game if the DMs are patient, i.e., the discount factors are sufficiently high, and the error probability is sufficiently small . Note that each DM has four different policies of the form . For large enough , and small enough , the multistage game has two (Markov perfect) equilibria. In one equilibrium, called the cooperation equilibrium, each DM cooperates if and defects otherwise. In the other equilibrium, called the defection equilibrium, both DMs always defect. Furthermore, from any joint policy in , there is a strict best reply path to one of these two equilibria, which implies that the multistage game is weakly acyclic under strict best replies.
We set , , , . We simulate Algorithm 1 with the following parameter values: , , , , for all . We keep the lengths of the exploration phases constants, i.e, , for all . We consider different values for since the lengths of the exploration phases appear to be most critical for the behavior of the learning process. For each value of , we run Algorithm 1 and the best reply process with inertia (in §IIC) in parallel, with policy updates starting from each of the initial joint policies in . We initialize all the learnt Qfactors at for each simulation run; however, we do not reset the learnt Qfactors at the end of any exploration phase during any simulation run. We let and denote the policies generated by Algorithm 1 and the best reply process with inertia in §IIC, respectively. For each value of , Table I shows the fraction of times at which visits an equilibrium and the fraction of times at which agrees with , during the policy updates (averaged uniformly over all initial policies in ).
The results in Table I reveals that, as increases, visits an equilibrium and agrees with more often. This is consistent with Theorem 1 since DMs are expected to learn their Qfactors more accurately with higher probability for larger values of . When is sufficiently large, the polices are at equilibrium and agrees with nearly all of the time regardless of the initial policy. In a typical simulation run (with a large enough ), the polices and transition to an equilibrium in few steps and stay at equilibrium thereafter.
