Payoff-based Inhomogeneous Partially Irrational Play for Potential Game Theoretic Cooperative Control of Multi-agent Systems

Payoff-based Inhomogeneous Partially Irrational Play for Potential Game Theoretic Cooperative Control of Multi-agent Systems

Tatsuhiko Goto, Takeshi Hatanaka,  and Masayuki Fujita,  Tatsuhiko Goto is with Toshiba Corporation, Takeshi Hatanaka(corresponding author) and Masayuki Fujita are with the Department of Mechanical and Control Engineering, Tokyo Institute of Technology, Tokyo 152-8550, JAPAN, hatanaka@ctrl.titech.ac.jp, fujita@ctrl.titech.ac.jp
Abstract

This paper handles a kind of strategic game called potential games and develops a novel learning algorithm Payoff-based Inhomogeneous Partially Irrational Play (PIPIP). The present algorithm is based on Distributed Inhomogeneous Synchronous Learning (DISL) presented in an existing work but, unlike DISL, PIPIP allows agents to make irrational decisions with a specified probability, i.e. agents can choose an action with a low utility from the past actions stored in the memory. Due to the irrational decisions, we can prove convergence in probability of collective actions to potential function maximizers. Finally, we demonstrate the effectiveness of the present algorithm through experiments on a sensor coverage problem. It is revealed through the demonstration that the present learning algorithm successfully leads agents to around potential function maximizers even in the presence of undesirable Nash equilibria. We also see through the experiment with a moving density function that PIPIP has adaptability to environmental changes.

potential game, learning algorithm, cooperative control, multi-agent system

I Introduction

Cooperative control of multi-agent systems basically aims at designing local interactions of agents in order to meet some global objective of the group [1, 2]. It is also required depending on scenarios that agents achieve the global objective under imperfect prior knowledge on environments while adapting to the network and environmental changes. Nevertheless, conventional cooperative control schemes do not always embody such functions. For example, in sensor deployment or coverage, most of the control schemes as in [3, 4, 5] assume prior knowledge on a density function defined over a mission space and hence are hardly applicable to the mission over unknown surroundings. A game theoretic framework as in [6] holds tremendous potential for overcoming the drawback of the conventional schemes.

A game theoretic approach to cooperative control formulates the problems as non-cooperative games and identifies the objective in cooperative control with arrival at some specific Nash equilibria [6, 7, 8]. In particular, it is shown by J. Marden et al. [6] that a variety of cooperative control problems are related to so-called potential games [9]. Unlike the other game theory, potential games give a design perspective, which consists of two kinds of design problem: utility design and learning algorithm design [10]. The objective of utility design is to align local utility functions to be maximized by each agent so that the resulting game constitutes a potential game, where the literature [11, 12] provides general design methodologies. The learning algorithm design determines action selection rules of agents so that the actions converge to Nash equilibria.

In this paper, we focus on the learning algorithm design for cooperative control of multi-agent systems. A lot of learning algorithms have been established in game theory literature and recently some algorithms are also developed mainly by J. Marden and his collaborators. The algorithms therein are classified into several categories depending on their features.

The first issue is whether an algorithm presumes finite or infinite memories. For example, Fictitious Play (FP) [13], Regret Matching (RM) [14], Joint Strategy Fictitious Play (JSFP) with Inertia [15] and Regret-Based Dynamics [16] require infinite number of memories for executing the algorithms. Meanwhile, Adaptive Play (AP) [17], Better Reply Process with Finite Memory and Inertia [18], (Restrictive) Spatial Adaptive Play ((R)SAP) [19, 6] and Payoff-based Dynamics (PD) [20], Payoff-based version of Log-Linear Learning (PLLL) [21] and Distributed Inhomogeneous Synchronous Learning (DISL) [7] require only a finite number of memories. Of course, the finite memory algorithms are more preferable for practical applications.

The second issue is what information is necessary for executing learning algorithms. For example, FP presumes that all the information of the other agents’ actions are available, which strongly restricts its applications. On the other hand, RM, JSFP with Inertia and (R)SAP assume availability of a so-called virtual payoff, i.e. the utility which would be obtained if an agent chose an action. Moreover, PD, PLLL and DISL utilize only the actual payoffs obtained after taking actions, which has a potential to overcome the aforementioned drawback of the sensor coverage schemes [7].

The main objective of standard game theory is to compute Nash equilibria and hence most of the above algorithms except for [6, 21] assure only convergence to pure Nash equilibria. However, in most of cooperative control problems, it is insufficient for achieving the global objective and selection of the most efficient equilibria is required [21]. In this paper, we thus deal with convergence of the actions to the Nash equilibria maximizing the potential function which are called optimal Nash equilibria in this paper, since the potential function is usually designed in many cooperative control problems so that its maximizers coincide with the action profiles achieving the global objectives.

The primary contribution of this paper is to develop a novel learning algorithm called Payoff-based Inhomogeneous Partially Irrational Play (PIPIP). The learning algorithm is based on DISL presented in [7] and inherits its several desirable features: (i) The algorithm requires finite and a little memory, (ii) The algorithm is payoff-based, (iii) The algorithm allows agents to choose actions in a synchronous fashion at each iteration, (iv) The action selection procedure in PIPIP consists of simple rules, (v) The algorithm is capable of dealing with constraints on action selection. The main difference of PIPIP from DISL is to allow agents to make irrational decisions with a certain probability, which renders agents opportunities to escape from undesirable Nash equilibria. Thanks to the irrational decisions, PIPIP assures that the actions of the group converge in probability to optimal Nash equilibria, though only convergence to a pure Nash equilibrium is proved in [7]. Meanwhile, some learning algorithms as in [6, 21] dealing with convergence to the optimal Nash equilibria have been presented and we also mention the advantages of PIPIP over these learning algorithms in the following. RSAP [6] guarantees convergence of the distribution of actions to a stationary distribution such that the probability staying the optimal Nash equilibria is arbitrarily specified by a design parameter. However, RSAP is not synchronous and virtual payoff-based and hence its applications are restricted. PLLL [21] also allows irrational and exploration decisions similarly to PIPIP and the resulting conclusion is almost compatible with this paper. However, in [21], how to handle the action constraints is not explicitly shown and convergence in probability to the optimal Nash equilibria is not proved in a strict sense.

The secondary contribution of this paper is to demonstrate the effectiveness of the present learning algorithm through experiments on a sensor coverage problem, where the learning algorithm is applied to a robotic system compensated by local controllers and logics. Such investigations have not been sufficiently addressed in the existing works. Here, we mainly check the performance of the learning algorithm in finite time and adaptability to environmental changes. In order to deal with the former issue, we prepare obstacles in the mission space to generate apparent undesirable Nash equilibria. Then, we compare the performance of PIPIP with DISL. The results therein will support our claim that what this paper provides is not a minor extension of [7] and contains a significant contribution from a practical point of view. We next demonstrate the adaptability by employing a moving density function defined over the mission space. Though adaptation to time-varying density is in principle expected for payoff-based algorithms, its demonstration has not been addressed in previous works. We see from the results that desirable group behaviors, i.e. tracking to the moving high density region are achieved by PIPIP even in the absence of any knowledge on the density.

This paper is organized as follows: In Section II, we give some terminologies and basis necessary for stating the results of this paper. In Section III, we present the learning algorithm PIPIP and state the main result associated with the algorithm, i.e. convergence in probability to the optimal Nash equilibria. Then, Section IV gives the proof of the main result. In Section V, we demonstrate the effectiveness of PIPIP through experiments on a sensor coverage problem. Finally, Section VI draws conclusions.

Ii Preliminary

Ii-a Constrained Potential Games

In this paper, we consider a constrained strategic game . Here, is the set of agents’ unique identifiers. The set is called a collective action set and defined as , where is the set of actions which agent can take. The function is a so-called utility function of agent and each agent basically behaves so as to maximize the function. The function provides a so-called constrained action set and is the set of actions which agent will be able to take in case he takes an action . Namely, at each iteration , each agent chooses an action from the set .

Throughout this paper, we denote collection of actions other than agent by

Then, the joint action is described as . Let us now make the following assumptions.

Assumption 1

The function satisfies the following three conditions.

  • (Reversibility [6]) For any and any actions , the inclusion is equivalent to .

  • (Feasibility [6]) For any and any actions , there exists a sequence of actions satisfying for all .

  • For any and any action , the number of available actions in is greater than or equal to .

Assumption 2

For any satisfying and , the inequality holds true for all .

Assumption 2 means that when only one agent changes his action, the difference in the utility function should be smaller than . This assumption is satisfied by just scaling all agents’ utility functions appropriately.

Let us now introduce the potential games under consideration in this paper.

Definition 1 (Constrained Potential Games [6, 7])

A constrained strategic game is said to be a constrained potential game with potential function if for all , every and every , the following equation holds for every .

(1)

Throughout this paper, we suppose that a potential function is designed so that its maximizers coincide with the joint action achieving a global objective of the group. Under the situation, (1) implies that if an agent changes his action, the change of the local objective function is equal to that of the group objective function.

We next define the Nash equilibria as below.

Definition 2 (Constrained Nash Equillibria)

For a constrained strategic game , a collection of actions is said to be a constrained pure Nash equilibrium if the following equation holds for all .

(2)

It is known [7, 9] that any constrained potential game has at least one pure Nash equilibrium and, in particular, a potential function maximizer is a Nash equilibrium, which is called an optimal Nash equilibrium in this paper. However, there may exist undesirable pure Nash equilibria not maximizing the potential function. In order to reach the optimal Nash equilibria while avoiding undesirable equilibria, we have to design appropriately a learning algorithm which determines how to select an action at each iteration.

Ii-B Resistance Tree

Let us consider a Markov process defined over a finite state space . A perturbation of is a Markov process whose transition probabilities are slightly perturbed. Specifically, a perturbed Markov process is defined as a process such that the transition of follows with probability and does not follow with probability . Then, we introduce a notion of regular perturbation as below.

Definition 3 (Regular Perturbation [19])

A family of stochastic processes is called a regular perturbation of if the following conditions are satisfied:

(A1)

For some , the process is irreducible and aperiodic for all .

(A2)

Let us denote by the transition probability from to along with the Markov process . Then, holds for all .

(A3)

If for some , then there exists a real number such that

(3)

where is called resistance of transition from to .

Remark that, from (A1), if is a regular perturbation of , then has the unique stationary distribution for each .

We next introduce the resistance of a path from to along with transitions as the value satisfying

(4)

where denotes the probability of the sequence of transitions. Then, it is easy to confirm that is simply given by

(5)

A state is said to communicate with state if both and hold, where the notation implies that is accessible from i.e. a process starting at state has non-zero probability of transitioning into at some point. A recurrent communication class is a class such that every pair of states in the class communicates with each other and no state outside the class is accessible to the class. Now, let be recurrent communication classes of Markov process . Then, within each class, there is a path with zero resistance from every state to every other. In case of a perturbed Markov process , there may exist several paths from states in to states in for any two distinct recurrent communication classes and . The minimal resistance among all such paths is denoted by .

Let us now define a weighted complete directed graph over the recurrent communication classes , where the weight of each edge is equal to the minimal resistance . We next define -tree which is a spanning tree over with a root node . We also denote by the set of all -trees. The resistance of an -tree is the sum of the weights on all the edges of the tree. The stochastic potential of the recurrent communication class is the minimal resistance among all -trees in . We also introduce the notion of stochastically stable state as below.

Definition 4 (Stochastically Stable State [19])

A state is said to be stochastically stable, if satisfies , where is the value of an element of stationary distribution corresponding to state .

Using the above terminologies, we introduce the following well known result which connects the stochastically stable states and stochastic potential.

Proposition 1

[19] Let be a regular perturbation of . Then exists and the limiting distribution is a stationary distribution of . Moreover the stochastically stable states are contained in the recurrent communication classes with minimum stochastic potential.

Ii-C Ergodicity

Discrete-time Markov processes can be divided into two types: time-homogeneous and time-inhomogeneous, where a Markov process is said to be time-homogeneous if the transition matrix denoted by is independent of the time and to be a time-inhomogeneous if it is time dependent. We also denote the probability of the state transition from time to time by .

For a Markov process , we introduce the notion of ergodicity.

Definition 5 (Strong Ergodicity [23])

A Markov process is said to be strongly ergodic if there exists a stochastic vector such that for any distribution on and time , we have .

Definition 6 (Weak Ergodicity [23])

A Markov process is said to be weakly ergodic if the following equation holds.

If is strongly ergodic, the distribution converges to the unique distribution from any initial state. Weak ergodicity implies that the information on the initial state vanishes as time increases though convergence of may not be guaranteed. Note that the notions of weak and strong ergodicity are equivalent in case of time-homogeneous Markov processes.

We finally introduce the following well-known results on ergodicity.

Proposition 2

[23] A Markov process is strongly ergodic if the following conditions hold:

(B1)

The Markov process is weakly ergodic.

(B2)

For each , there exists a stochastic vector on such that is the left eigenvector of the transition matrix with eigenvalue 1.

(B3)

The eigenvector in (B2) satisfies . Moreover, if , then is the vector in Definition 5.

Iii Learning Algorithm and Main Result

In this section, we present a learning algorithm called Payoff-based Inhomogeneous Partially Irrational Play (PIPIP) and state the main result of this paper. At each iteration , the learning algorithm chooses an action according to the following procedure assuming that each agent stores previous two chosen actions and the outcomes . Each agent first updates a parameter called exploration rate by

(6)

where is defined as and is the minimal number of steps required for transitioning between any two actions of agent .

Then, each agent compares the values of and . If holds, then he chooses action according to the rule:

  • is randomly chosen from with probability , (it is called an exploratory decision).

  • with probability .

Otherwise (), action is chosen according to the rule:

  • is randomly chosen from with probability (it is called an exploratory decision).

  • with probability

    (7)

    (it is called an irrational decision).

  • with probability

    (8)

The parameter should be chosen so as to satisfy

(9)

where is the number of elements of the set . It is clear under the third item of Assumption 1 that the action is well-defined.

Finally, each agent executes the selected action and computes the resulting utility via feedbacks from environment and neighboring agents. At the next iteration, agents repeat the same procedure.

The algorithm PIPIP is compactly described in Algorithm 1, where the function outputs an action chosen randomly from the set . Note that the algorithm with a constant is called Payoff-based Homogeneous Partially Irrational Play (PHPIP), which will be used for the proof of the main result of this paper.

  Initialization: Action is chosen randomly from . Set , for all and .
  Step 1: .
  Step 2: If , then set
Otherwise, set
  Step 3: Execute the selected action and receive .
  Step 4: Set and .
  Step 5: and go to Step 1.
Algorithm 1 Payoff-based Inhomogeneous Partially Irrational Play (PIPIP)

PIPIP is developed based on the learning algorithm DISL presented in [7]. The main difference of PIPIP from DISL is that agents may choose the action with the lower utility in Step 2 with probability which depends on the difference of the last two steps’ utilities and the parameters and . Thanks to the irrational decisions, agents can escape from undesirable Nash equilibria as will be proved in the next section.

We are now ready to state the main result of this paper. Before mentioning it, we define

(10)

and as the set of the optimal Nash equilibria, i.e. potential function maximizers, of a constrained potential game .

Theorem 1

Consider a constrained potential game satisfying Assumptions 1 and 2. Suppose that each agent behaves according to Algorithm 1. Then, a Markov process is defined over the space and the following equation is satisfied.

(11)

where and .

Equation (11) means that the probability that agents executing PIPIP take one of the potential function maximizers converge to . The proof of this theorem will be shown in the next section.

In PIPIP, the parameter is updated by (6) to prove the above theorem, which is the same as DISL. However, this update rule takes long time to reach a sufficiently small when the size of the game, i.e. is large. Thus, from the practical point of view, we might have to decrease based on heuristics or use PHPIP with a sufficiently small . Even in such cases, the following theorem at least holds similarly to the paper [20].

Theorem 2

Consider a constrained potential game satisfying Assumptions 1 and 2. Suppose that each agent behaves according to PHPIP. Then, given any probability , if the exploration rate is sufficiently small, for all sufficiently large time , the following equation holds.

(12)

Theorem 2 assures that the optimal actions are eventually selected with high probability as long as the final value of is sufficiently small irrespective of the decay rate of .

Iv Proof of Main Result

In this section, we prove the main result of this paper (Theorem 1). We first consider PHPIP with a constant exploration rate . The state for PHPIP with constitutes a perturbed Markov process on the state space .

In terms of the Markov process induced by PHPIP, the following lemma holds.

Lemma 1

The Markov process induced by PHPIP applied to a constrained potential game is a regular perturbation of under Assumption 1.

Proof

Consider a feasible transition with and and partition the set of agents according to their behaviors along with the transition as

Then, the probability of the transition is represented as

(13)

where if and otherwise. We see from (13) that the resistance of transition defined in (3) is given by since

(14)

holds. Thus, (A3) in Definition 3 is satisfied. In addition, it is straightforward from the procedure of PHPIP to confirm the condition (A2).

It is thus sufficient to check (A1) in Definition 3. From the rule of taking exploratory actions in Algorithm 1 and the second item of Assumption 1, we immediately see that the set of the states accessible from any is equal to . This implies that the perturbed Markov process is irreducible. We next check aperiodicity of . It is clear that any state in has period 1. Let us next pick any from the set . Since holds iff from Assumption 1, the following two paths are both feasible: , . This implies that the period of state is and the process is proved to be aperiodic. Hence the process is both irreducible and aperiodic, which means (A1) in Definition 3.

In summary, conditions (A1)(A3) in Definition 3 are satisfied and the proof is completed.

From Lemma 1, the perturbed Markov process is irreducible and hence there exists a unique stationary distribution for every . Moreover, because is a regular perturbation of , we see from the former half of Proposition 1 that exists and the limiting distribution is the stationary distribution of .

We also have the following lemma on the Markov process induced by PHPIP.

Lemma 2

Consider the Markov process induced by PHPIP applied to a constrained potential game . Then, the recurrent communication classes of the unperturbed Markov process are given by elements of , namely

(15)
Proof

Because of the rule at Step 2 of PHPIP, it is clear that any state belonging to cannot move to another state without explorations, which implies that all the states in itself form recurrent communication classes of the unperturbed Markov process .

We next consider the states in and prove that these states are never included in recurrent communication classes of the unperturbed Markov process . Here, we use induction. We first consider the case of . If , then the transition is taken. Otherwise, a sequence of transitions occurs. Thus, in case of , the state is never included in recurrent communication classes of .

We next make a hypothesis that there exists a such that all the states in are not included in recurrent communication classes of the unperturbed Markov process for all . Then, we consider the case , where there are three possible cases:

(i)

,

(ii)

,

(iii)

for agents where .

In case (i), the transition must occur for and, in case (ii), the transition should be selected. Thus, all the states in satisfying (i) or (ii) are never included in recurrent communication classes. In case (iii), at the next iteration, all the agents satisfying choose the current action. Then, such agents possess a single action in the memory and, in case of , each agent has to choose either of the actions in the memory. Namely, these agents never change their actions in all subsequent iterations. The resulting situation is thus the same as the case of . From the above hypothesis, we can conclude that the states in case (iii) are also not included in recurrent communication classes. In summary, the states in are never included in the recurrent communication classes of . The proof is thus completed.

A feasible path over the process from to is especially said to be a route if both of the two nodes and are elements of . Note that a route is a path and hence the resistance of the route is also given by (4). Especially, we define a straight route as follows, where we use the notation

(16)
Definition 7 (Straight Route)

A route between any two states and in such that is said to be a straight route if the path is given by the transitions on the Markov process such that only one agent changes his action from to at first iteration and the explored agent selects the same action at the next iteration while the other agents choose the same action during the two steps.

In terms of the straight route, we have the following lemma.

Lemma 3

Consider paths from any state to any state such that over the Markov process induced by PHPIP applied to a constrained potential game . Then, under Assumption 2, the resistance of the straight route from to is strictly smaller than and the resistance is minimal among all paths from to .

Proof

Along with the straight route, only one agent first changes action from to , whose probability is given by

(17)

It is easy to confirm from (17) that the resistance of the transition is equal to . We next consider the transition from to . If is true, the probability of this transition is given by , whose resistance is equal to . Otherwise, holds and the probability of this transition is given by , whose resistance is . Let us now notice that the resistance of the straight route is equal to the sum of the resistances of transitions and from (5) and that from Assumption 2. We can thus conclude that is smaller than . It is also easy to confirm that the resistance of paths such that more than agents take exploratory action should be greater than . Namely, the straight route gives the smallest resistance among all paths from to and hence the proof is completed.

We also introduce the following notion.

Definition 8 (-Straight-Route)

An -straight-route is a route which passes through vertices in and all the routes between any two of these vertices are straight.

In terms of the route, we can prove the following lemma, which clarifies a connection between the potential function and the resistance of the route.

Lemma 4

Consider the Markov process induced by PHPIP applied to a constrained potential game . Let us denote an -straight-route over from state to state by

(18)

where and all the arrows between them are straight routes. In addition, we denote its reverse route by

(19)

which is also an -straight route from to . Then, under Assumption 2, if , we have .

Proof

We suppose that the route contains straight routes with resistance greater than and contains straight routes with resistance greater than . Let us now denote the explored agent along with the route by and that with by . From the proof of Lemma 3, the resistance of the route should be exactly equal to (in case of ) or equal to (in case of ). From (1), the following equation holds.

(20)

Namely, one of the resistances of the straight routes and is exactly and the other is greater than except for the case that in which the resistances are both equal to . An illustrative example of the relation is given as follows, where the numbers put on arrows are the resistances of the routes.

Namely, the inequality holds true. Let us now collect all the such that the resistance of is greater than and number them as . Similarly, we define for the reverse route . Then, from equations in (20), we obtain

(21)

Note that (21) holds even in the presence of pairs such that