Information-Sharing over Adaptive Networks with Self-interested Agents

# Information-Sharing over Adaptive Networks with Self-interested Agents

Chung-Kai Yu,  Mihaela van der Schaar,  and Ali H. Sayed,  Copyright (c) 2015 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to pubs-permissions@ieee.org.This work was supported in part by NSF grants CCF-1011918, CSR-1016081, and ECCS-1407712. An early short version of this work appeared in the conference publication [1]. The authors are with the Department of Electrical Engineering, University of California, Los Angeles, CA 90095-1594 USA (e-mail: ckyuna@ucla.edu, {mihaela,sayed}@ee.ucla.edu).
###### Abstract

We examine the behavior of multi-agent networks where information-sharing is subject to a positive communications cost over the edges linking the agents. We consider a general mean-square-error formulation where all agents are interested in estimating the same target vector. We first show that, in the absence of any incentives to cooperate, the optimal strategy for the agents is to behave in a selfish manner with each agent seeking the optimal solution independently of the other agents. Pareto inefficiency arises as a result of the fact that agents are not using historical data to predict the behavior of their neighbors and to know whether they will reciprocate and participate in sharing information. Motivated by this observation, we develop a reputation protocol to summarize the opponent’s past actions into a reputation score, which can then be used to form a belief about the opponent’s subsequent actions. The reputation protocol entices agents to cooperate and turns their optimal strategy into an action-choosing strategy that enhances the overall social benefit of the network. In particular, we show that when the communications cost becomes large, the expected social benefit of the proposed protocol outperforms the social benefit that is obtained by cooperative agents that always share data. We perform a detailed mean-square-error analysis of the evolution of the network over three domains: far field, near-field, and middle-field, and show that the network behavior is stable for sufficiently small step-sizes. The various theoretical results are illustrated by numerical simulations.

{keywords}

Adaptive networks, self-interested agents, reputation design, diffusion strategy, Pareto efficiency, mean-square-error analysis.

## I Introduction

Adaptive networks enable agents to share information and to solve distributed optimization and inference tasks in an efficient and decentralized manner. In most prior works, agents are assumed to be cooperative and designed to follow certain distributed rules such as the consensus strategy (e.g., [2, 3, 4, 5, 6, 7, 8, 9, 10]) or the diffusion strategy (e.g., [11, 12, 13, 14, 15, 16, 17, 18, 19]). These rules generally include a self-learning step to update the agents’ estimates using their local data, and a social-learning step to fuse and combine the estimates shared by neighboring agents. However, when agents are selfish, they would not obey the preset rules unless these strategies conform to their own interests, such as minimizing their own costs. In this work, we assume that the agents can behave selfishly and that they, therefore, have the freedom to decide whether or not they want to participate in sharing information with their neighbors at every point in time. Under these conditions, the global social benefit for the network can be degraded unless a policy is introduced to entice agents to participate in the collaborative process despite their individual interests. In this article, we will address this difficulty in the context of adaptive networks where agents are continually subjected to streaming data, and where they can predict in real-time, from their successive interactions, how reliable their neighbors are and whether they can be trusted to share information based on their past history. This formulation is different from the useful work in [20], which considered one particular form of selfish behavior in the context of a game-theoretic formulation. In that work, the focus is on activating the self-learning and social learning steps simultaneously, and agents simply decide whether to enter into a sleep mode (to save energy) or to continue acquiring and processing data. In the framework considered in our work, agents always remain active and are continually acquiring data; the main question instead is to entice agents to participate in the collaborative information-sharing process regardless of their self-centered evaluations.

More specifically, we study the behavior of multi-agent networks where information-sharing is subject to a positive communication cost over the edges linking the agents. This situation is common in applications, such as information sharing over cognitive networks [21], online learning under communication bandwidth and/or latency constraints  [22],[23, Ch. 14], and over social learning networks when the delivery of opinions involves some costs such as messaging fees [24, 25, 26]. In our network model, each agent is self-interested and seeks to minimize its own sharing cost and estimation error. Motivated by the practical scenario studied in [21], we formulate a general mean-square error estimation problem where all agents are interested in estimating the same target parameter vector. Agents are assumed to be foresighted and to have bounded rationality [27] in the manner defined further ahead in the article. Then, we show that if left unattended, the dominant strategy for all agents is for them not to participate in the sharing of information, which leads to networks operating under an inefficient Pareto condition. This situation arises because agents do not have enough information to tell beforehand if their paired neighbors will reciprocate their actions (i.e., if an agent shares data with a second agent, will the second agent reciprocate and share data back?) This prediction-deficiency problem follows from the fact that agents are not using historical data to predict other agents’ actions.

One method to deal with this inefficient scenario is to assume that agents adapt to their opponents’ strategies and improve returns by forming some regret measures. In [28], a decision maker determines its action using a regret measure to evaluate the utility loss from the chosen action to the optimal action in the previous stage game. For multi-agent networks, a regret-based algorithm was proposed in [20] and [29] for agents to update their actions based on a weighted loss of the utility functions from the previous stage games. However, these works assume myopic agents and formulate repeated games with fixed utility functions over each stage game, which is different from the scenario considered in this article where the benefit of sharing information over adaptive networks continually evolves over time. This is because, as the estimation accuracy improves and/or the communication cost becomes expensive, the return to continue cooperating for estimation purposes falls and thus the act of cooperating with other agents becomes unattractive and inefficient. In this case, the regret measures computed from the previous stage games may not provide an accurate reference to the current stage game.

A second useful method to deal with Pareto inefficient and non-cooperative scenarios is to employ reputation schemes (e.g., [30, 31, 32, 33]). In this method, foresighted agents use reputation scores to assess the willingness of other agents to cooperate; the scores are also used to punish non-cooperative behavior. For example, the works [31, 32] rely on discrete-value reputation scores, say, on a scale 1-10, and these scores are regularly updated according to the agents’ actions. Similar to the regret learning references mentioned before, in our problem the utilities or cost functions of stage games change over time and evolve based on agents’ estimates. Conventional reputation designs do not address this time variation within the payoff of agents, which will be examined more closely in our work. Motivated by these considerations, in Sec. IV, we propose a dynamic/adaptive reputation protocol that is based on the belief measure of future actions with real-time benefit predictions.

In our formulation, we assume a general random-pairing model similar to [10], where agents are randomly paired at the beginning of each time interval. This situation could occur, for example, due to an exogenous matcher or the mobility of the agents. The paired agents are assumed to follow a diffusion strategy[12, 13, 14, 15], which includes an adaptation step and a consultation step, to iteratively update their estimates. Different from conventional diffusion strategies, the consultation step here is influenced by the random-pairing environment and by cooperation uncertainty. The interactions among self-interested agents are formulated as successive stage games of two players using pure strategies. To motivate agents to cooperate with each other, we formulate an adaptive reputation protocol to help agents jointly assess the instantaneous benefit of depreciating information and the transmission cost of sharing information. The reputation score helps agents to form a belief of their opponent’s subsequent actions. Based on this belief, we entice agents to cooperate and turn their best response strategy into an action choosing strategy that conforms to Pareto efficiency and enhances the overall social benefit of the network.

In the performance evaluation, we are interested in ensuring the mean-square-error stability of the network instead of examining equilibria as is common in the game theoretical literature since our emphasis is on adaptation under successive time-variant stage games. The performance analysis is challenging due to the adaptive behavior by the agents. For this reason, we pursue the mean-square-error analysis of the evolution of the network over three domains: far-field, near-field, and middle-field, and show that the network behavior is stable for sufficiently small step-sizes. We also show that when information sharing becomes costly, the expected social benefit of the proposed reputation protocol outperforms the social benefit that is obtained by cooperative agents that always share data.

Notation: We use lowercase letters to denote vectors and scalars, uppercase letters for matrices, plain letters for deterministic variables, and boldface letters for random variables. All vectors in our treatment are column vectors, with the exception of the regression vectors, .

## Ii System Model

### Ii-a Distributed Optimization and Communication Cost

Consider a connected network consisting of agents. When agents act independently of each other, each agent would seek to estimate the vector that minimizes an individual estimation cost function denoted by . We assume each of the costs is strongly convex for , and that all agents have the same objective so that all costs are minimized at the common location .

In this work, we are interested in scenarios where agents can be motivated to cooperate among themselves as permitted by the network topology. We associate an extended cost function with each agent , and denote it by . In this new cost, the scalar is a binary variable that is used to model whether agent is willing to cooperate and share information with its neighbors. The value means that agent is willing to share information (e.g., its estimate of ) with its neighbors, while the value means that agent is not willing to share information. The reason why agents may or may not share information is because this decision will generally entail some cost. We consider the scenario where a positive transmission cost, , is required for each act by agent involving sharing an estimate with any of its neighbors. By taking into consideration, the extended cost that is now associated with agent will consist of the sum of two components: the estimation cost and the communication cost111We focus on the sum of the estimation cost and the communication cost due to its simplicity and meaningfulness in applications. Note that a possible generalization is to consider a penalty-based objective function for some penalty function .:

 Jk(w,ak)≜Jestk(w)+Jcomk(ak) (1)

where the latter component is modeled as

 Jcomk(ak)≜akck (2)

We express the communication expense in the form (2) because, as described further ahead, when an agent decides to share information, it will be sharing the information with one neighbor at a time; the cost for this communication will be . With regards to the estimation cost, , this measure can be selected in many ways. One common choice is the mean-square-error (MSE) cost, which we adopt in this work.

At each time instant , each agent is assumed to have access to a scalar measurement and a regression vector with covariance matrix . The regressors are assumed to have zero-mean and to be temporally white and spatially independent, i.e.,

 Eu∗k,iuℓ,j=Ru,kδkℓδij (3)

in terms of the Kronecker delta function. The data are assumed to be related via the linear regression model:

 dk(i)=uk,iwo+vk(i) (4)

where is the common target vector to be estimated by the agents. In (4), the variable is a zero-mean white-noise process with power that is assumed to be spatially independent, i.e.,

 Ev∗k(i)vℓ(j)=σ2v,kδkℓδij (5)

We further assume that the random processes and are spatially and temporally independent for any , , , and . Models of the form (4) are common in many applications, e.g., channel estimation, model fitting, target tracking, etc (see, e.g., [15]).

Let denote the estimator for that will be available to agent at time . We will describe in the sequel how agents evaluate these estimates. The corresponding a-priori estimation error is defined by

 ea,k(i)≜dk(i)−uk,iwk,i−1 (6)

and it measures how close the weight estimate matches the measurements to each other. In view of model (4), we can also write

 ea,k(i)=uk,i˜wk,i−1+vk(i) (7)

in terms of the estimation error vector

 ˜wk,i−1≜wo−wk,i−1 (8)

Motivated by these expressions and model (4), the instantaneous MSE cost that is associated with agent based on the estimate from time is given by

 Jestk(wk,i−1) ≜E|ea,k(i)|2 =E|dk(i)−uk,iwk,i−1|2 =E∥˜wk,i−1∥2Ru,k+σ2v,k (9)

Note that this MSE cost conforms to the strong convexity of as we mentioned before. Combined with the action by agent , the extended instantaneous cost at agent that is based on the prior estimate, , is then given by:

 Jk(wk,i−1,ak)=E|ea,k(i)|2+akck (10)

### Ii-B Random-Pairing Model

We denote by the neighborhood of each agent , including itself. We consider a random pairing protocol for agents to share information at the beginning of every iteration cycle. The pairing procedure can be executed either in a centralized or distributed manner. Centralized pairing schemes can be used when an online server randomly assigns its clients into pairs as in crowdsourcing applications [32, 31], or when a base-station makes pairing decisions for its mobile nodes for packet relaying [34]. Distributed paring schemes arise more naturally in the context of economic and market transactions [35]. In our formulation, we adopt a distributed pairing structure that takes neighborhoods into account when selecting pairs, as explained next.

We assume each agent has bi-directional links to other agents in and that agent has a positive probability to be paired with any of its neighbors. Once two agents are paired, they can decide on whether to share or not their instantaneous estimates for . We therefore model the result of the random-pairing process between each pair of agents and as temporally-independent Bernoulli random processes defined as:

 1kℓ(i)=1ℓk(i)={1,with probability pkℓ=pℓk0,otherwise (11)

where indicates that agents and are paired at time and indicates that they are not paired. We are setting because these variables represent the same event: whether agents and are paired, which results in . For , we have since such pairs will never occur. For convenience, we use to indicate the event that agent is not paired with any agent at time , which happens with probability . Since each agent will pair itself with at most one agent at a time from its neighborhood, the following properties are directly obtained from the random-pairing procedure:

 ∑ℓ∈Nk1kℓ(i)=1,   ∑ℓ∈Nkpkℓ=1 (12) 1kℓ(i)1kq(i)=0,  for ℓ≠q (13)

We assume that the random pairing indicators for all and are independent of the random variables and for any time and . For example, a widely used setting in the literature is the fully-pairing network, which assumes a fully-connected network topology [36, 32], i.e., for every agent , where denotes the set of all agents. The size is assumed to be even and every agent is uniformly paired with exactly one agent in the network. Therefore, we have pairs at each time instant and the random-pairing probability becomes

 pkℓ={1N−1,for~{}ℓ≠k0,for~{}ℓ=k (14)

We will not be assuming fully-connected networks or fully-paired protocols and will deal more generally with networks that can be sparsely connected. Later in Sec. IV we will demonstrate a simple random-pairing protocol which can be implemented in a fully distributed manner.

### Ii-C Diffusion Strategy

Conventional diffusion strategies assume that the agents are cooperative (or obedient) and continuously share information with their neighbors as necessary [13, 15, 12]. In the adapt-then-combine (ATC) version of diffusion adaptation, each agent updates its estimate, , according to the following relations:

 ψk,i =wk,i−1+μu∗k,i[dk(i)−uk,iwk,i−1] (15) wk,i =∑ℓ∈Nkαℓkψℓ,i (16)

where is the step-size parameter of agent , and the are nonnegative combination coefficients that add up to one. In implementation (15)–(16), each agent computes an intermediate estimate using its local data, and subsequently fuses the intermediate estimates from its neighbors. For the combination step (16), since agent is allowed to interact with only one of its neighbors, then we rewrite (16) in terms of a single coefficient as follows:

 wk,i={αkψk,i+(1−αk)ψℓ,i,if 1kℓ(i)=1 for some ℓ≠kψk,i,otherwise (17)

We can capture both situations in (17) in a single equation as follows:

 wk,i =αkψk,i+(1−αk)∑ℓ∈Nk1kℓ(i)ψℓ,i (18)

In formulation (15) and (18), it is assumed that once agents and are paired, they share information according to (18).

Let us now incorporate an additional layer into the algorithm in order to model instances of selfish behavior. When agents behave in a selfish (strategic) manner, even when agents and are paired, each one of them may still decide (independently) to refuse to share information with the other agent for selfish reasons (for example, agent may decide that this cooperation will cost more than the benefit it will reap for the estimation task). To capture this behavior, we use the specific notation , instead of , to represent the action taken by agent on agent at time , and similarly for . Both agents will end up sharing information with each other only if , i.e., only when both agents are in favor of cooperating once they have been paired. We set for every time . We can now rewrite the combination step (18) more generally as:

 wk,i=αkψk,i+(1−αk)∑ℓ∈Nk 1kℓ(i)[aℓk(i)ψℓ,i+ (1−aℓk(i))ψk,i] (19)

From (II-C), when agent is not paired with any agent at time (), we get . On the other hand, when agent is paired with some neighboring agent , which means , we get

 wk,i =αkψk,i+(1−αk)[aℓk(i)ψℓ,i+(1−aℓk(i))ψk,i] (20)

It is then clear that results in , while results in a combination of the estimates of agents and . In other words, when :

 wk,i={ψk,i,if aℓk(i)=0αkψk,i+(1−αk)ψℓ,i,if aℓk(i)=1 (21)

In the sequel, we assume that agents update and combine their estimates using (15) and (II-C). One important question to address is how the agents determine their actions .

## Iii Agent Interactions

When an arbitrary agent needs to decide on whether to set its action to (i.e., to cooperate) or (i.e., not to cooperate), it generally cannot tell beforehand whether agent will reciprocate. In this section, we first show that when self-interested agents are boundedly rational and incapable of transforming the past actions of neighbors into a prediction of their future actions, then the dominant strategy for each agent will be to choose noncooperation. Consequently, the entire network becomes noncooperative. Later, in Sec. IV, we explain how to address this inefficient scenario by proposing a protocol that will encourage cooperation.

### Iii-a Long-Term Discounted Cost Function

To begin with, let us examine the interaction between a pair of agents, such as and , at some time instant (). We assume that agents and simultaneously select their actions and by using some pure strategies (i.e., agents set their action variables by using data or realizations that are available to them, such as the estimates , rather than select their actions according to some probability distributions)222In our scenario, the discrete action set will be shown to lead to threshold-based pure strategies — see Sec. IV-B.. The criterion for setting by agent is to optimize agent ’s payoff, which incorporates both the estimation cost, affected by agent ’s own action , and the communication cost, determined by agent ’s action . Therefore, the instantaneous cost incurred by agent is a mapping function from the action space to a real value. In order to account for selfish behavior, we need to modify the notation used in (1) to incorporate the actions of both agents and . In this way, we need to denote the value of the cost incurred by agent at time , after is updated to , more explicitly by and it is given by:

 Jk (akℓ(i),aℓk(i)) =⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩Jestk(wk,i=ψk,i),if (0,0)Jestk(wk,i=αkψk,i+(1−αk)ψℓ,i),if (0,1)Jestk(wk,i=ψk,i)+ck,if (1,0)Jestk(wk,i=αkψk,i+(1−αk)ψℓ,i)+ck,if (1,1) (22)

For example, the first line on the right-hand side of (III-A) corresponds to the situation in which none of the agents decides to cooperate. In that case, agent can only rely on its intermediate estimate, , to improve its estimation accuracy. In comparison, the second line in (III-A) corresponds to the situation in which agent is willing to share its estimate but not agent . In this case, agent is able to perform the second combination step in (21) and enhance its estimation accuracy. In the third line in (III-A), agent does not cooperate while agent does. In this case, agent incurs a communication cost, . Similarly, for the last line in (III-A), both agents cooperate. In this case, agent is able to perform the second step in (21) while incurring a cost .

We can write (III-A) more compactly as follows:

 Jk(akℓ(i),aℓk(i))=Jactk(aℓk(i))+akℓ(i)ck (23)

where we introduced

 Jactk (aℓk(i)) ≜{Jestk(wk,i=ψk,i),if~{}aℓk(i)=0Jestk(wk,i=αkψk,i+(1−αk)ψℓ,i),if~{}aℓk(i)=1 (24)

The function helps make explicit the influence of the action by agent on the estimation accuracy that is ultimately attained by agent .

Now, the random-pairing process occurs repeatedly over time and, moreover, agents may leave the network. For this reason, rather than rely on the instantaneous cost function in (III-A), agent will determine its action at time by instead minimizing an expected long-term discounted cost function of the form defined by (II-C) where is a discount factor to model future network uncertainties and the foresightedness level of agent . The expectation is taken over all randomness for and is conditioned on the estimate when the actions and are selected. Formulation (II-C) is meant to assess the influence of the action selected at time by agent on its cumulative (but discounted) future costs. More specifically, whenever , agent selects its action at time to minimize the expected long-term discounted cost given :

 minakℓ(i)∈{0,1}J∞k,i[akℓ(i),aℓk(i)|wk,i−1] (26)

Based on the payoff function in (II-C), we can formally regard the interaction between agents as consisting of stage games with recurrent random pairing. The stage information-sharing game for is a tuple , where is the set of players, and is the Cartesian product of binary sets representing available actions for agents and , respectively. The action profile is . Moreover, is the set of real-valued long-term costs defined over for agents and , respectively. We remark that since depends on , its value generally varies from stage to stage. As a result, each agent faces a dynamic game structure with repeated interactions in contrast to conventional repeated games as in [37, 38] where the game structure is fixed over time. Time variation is an essential feature that arises when we examine selfish behavior over adaptive networks.

Therefore, solving problem (26) involves the forecast of future game structures and future actions chosen by the opponent. These two factors are actually coupled and influence each other; this fact makes prediction under such conditions rather challenging. To continue with the analysis, we adopt a common assumption from the literature that agents have computational constraints. In particular, we assume the agents have bounded rationality [27, 39, 40]. In our context, this means that the agents have limited capability to forecast future game structures and are therefore obliged to assume that future parameters remain unchanged at current values. We will show how this assumption enables each agent to evaluate in later discussions.

###### Assumption 1 (Bounded rationality)

Every agent solves the optimization problem (26) under the assumptions:

 wk,t=wk,i−1,  1kℓ(t)=1kℓ(i),  for t≥i (27)

We note that the above assumption is only made by the agent at time while solving problem (26); the actual estimates and pairing choices will continue to evolve over time. We further assume that the bounded rationality assumption is common knowledge to all agents in the network333Common knowledge of means that each agent knows , each agent knows that all other agents know , each agent knows that all other agents know that all the agents know , and so on [41]..

### Iii-B Pareto Inefficiency

In this section, we show that if no further measures are taken, then Pareto inefficiency may occur. Thus, assume that the agents are unable to store the history of their actions and the actions of their neighbors. Each agent only has access to its immediate estimate , which can be interpreted as a state variable at time for agent . In this case, each agent will need to solve (26) under Assumption 1. It then follows that agent will predict the same action for future time instants:

 akℓ(t)=akℓ(i),  for t>i (28)

Furthermore, since the bounded rationality condition is common knowledge, agent knows that the same future actions are used by agent , i.e.,

 aℓk(t)=aℓk(i),  for t>i (29)

Using (28) and (29), agent obtains

 J∞k,i [akℓ(i),aℓk(i)|wk,i−1] =∞∑t=iδt−ikE[Jk(akℓ(i),aℓk(i))∣∣wk,i−1] =11−δk⋅E[Jk(akℓ(i),aℓk(i))∣∣wk,i−1] =11−δk(E[Jactk(aℓk(i))|wk,i−1]+akℓ(i)ck) (30)

Therefore, the optimization problem (26) reduces to the following minimization problem:

 minakℓ(i)∈{0,1}J1k,i(akℓ(i),aℓk(i)) (31)

where

 J1k,i(akℓ(i),aℓk(i))≜E[Jactk(aℓk(i))|wk,i−1]+akℓ(i)ck (32)

is the expected cost of agent given — compare with (23). Table I summarizes the values of and for both agents under their respective actions. From the entries in the table, we conclude that choosing action is the dominant strategy for agent regardless of the action chosen by agent because its cost will be the smallest it can be in that situation. Likewise, the dominant strategy for agent is regardless of the action chosen by agent . Therefore, the action profile is the unique outcome as a Nash and dominant strategy equilibrium for every stage game.

However, this resulting action profile will be Pareto inefficient for both agents if it can be verified that the alternative action profile , where both agents cooperate, can lead to improved payoff values for both agents in comparison to the strategy . To characterize when this is possible, let us denote the expected payoff for agent when agent selects by

 s0k,i(akℓ(i)) ≜E[Jactk(aℓk(i)=0)|wk,i−1]+akℓ(i)ck (33)

Likewise, when , we denote the expected payoff for agent by

 s1k,i(akℓ(i)) ≜E[Jactk(aℓk(i)=1)|wk,i−1]+akℓ(i)ck (34)

The benefit for agent from agent ’s sharing action, defined as the improvement from to , is seen to be independent of :

 bk(i) ≜s0k,i(akℓ(i))−s1k,i(akℓ(i)) =E[Jactk(aℓk(i)=0)|wk,i−1]−E[Jactk(aℓk(i)=1)|wk,i−1] =E[Jestk(wk,i=ψk,i)|wk,i−1] −E[Jestk(wk,i=αkψk,i+(1−αk)ψℓ,i)|wk,i−1] (35)

Now, note from definition (6) that

 E[Jestk(wk,i)|wk,i−1] =E[|dk(i+1)−uk,i+1wk,i|2∣∣wk,i−1] (36)

so that

 E[ Jactk(aℓk(i)=0)|wk,i−1] =E[Jestk(wk,i=ψk,i)|wk,i−1] =E[|dk(i+1)−uk,i+1ψk,i|2∣∣wk,i−1] =E[|uk,i+1˜ψk,i+vk(i+1)|2∣∣wk,i−1] =E[∥˜ψk,i∥2Ru,k∣∣wk,i−1]+σ2v,k (37)

where and, similarly,

 E[Jactk(aℓk(i)=1)|wk,i−1] =E[Jestk(wk,i=αkψk,i+(1−αk)ψℓ,i)|wk,i−1] =E[∥αk˜ψk,i+(1−αk)˜ψℓ,i∥2Ru,k∣∣wk,i−1]+σ2v,k (38)

Then, the benefit becomes

 bk(i) =E[∥˜ψk,i∥2Ru,k∣∣wk,i−1] −E[∥αk˜ψk,i+(1−αk)˜ψℓ,i∥2Ru,k∣∣wk,i−1] (39)

Note that is determined by the variable and does not depend on the actions and . We will explain how agents assess the information to choose actions further ahead in Sec. IV-C. Now, let us define the benefit-cost ratio as the ratio of the estimation benefit to the communication cost:

 γk(i)≜bk(i)ck (40)

Then, the action profile in the game defined in Table I is Pareto superior to the action profile when both of the following two conditions hold

 γk(i)>1  and  γℓ(i)>1  ⇔  {ck

On the other hand, the action profile is Pareto superior to the action profile if, and only if,

 γk(i)<1  and  γℓ(i)<1 (42)

In Fig. 1(a), we illustrate how the values of the payoffs compare to each other when (41) holds for the four possibilities of action profiles. It is seen from this figure that when and , the action profile (S,S), i.e., in (32), is Pareto optimal and that the dominant strategy (NS,NS), i.e., in (32), is inefficient and leads to worse performance (which is a manifestation of the famous prisoner’s dilemma problem [42]). On the other hand, if and , then we are led to Fig. 1(b), where the action profile (NS,NS) becomes Pareto optimal and superior to (S,S). We remark that (NS,S) and (S,NS) are also Pareto optimal in both cases but not preferred in this work because they are only beneficial for one single agent.

## Iv Adaptive Reputation Protocol Design

As shown above, when both and , the Pareto optimal strategies for agents and correspond to cooperation; when both and , the Pareto optimal strategies for agents and reduce to non-cooperation. Since agents are self-interested and boundedly rational, we showed earlier that if left without incentives, their dominant strategy is to avoid sharing information because they cannot tell beforehand if their paired neighbor will reciprocate. This Pareto inefficiency therefore arises from the fact that agents are not using historical data to predict other agents’ actions. We now propose a reputation protocol to summarize the opponent’s past actions into a reputation score. The score will help agents to form a belief of their opponent’s subsequent actions. Based on this belief, we will be able to provide agents with a measure that entices them to cooperate. We will show, for example, that the best response rule for agents will be to cooperate whenever is large and not to cooperate whenever is small, in conformity with the Pareto-efficient design.

### Iv-a Reputation Protocol

Reputation scores have been used before in the literature as a mechanism to encourage cooperation [32, 43, 44]. Agents that cooperate are rewarded with higher scores; agents that do not cooperate are penalized with lower scores. For example, eBay uses a cumulative score mechanism, which simply sums the sellerâs feedback scores from all previous periods to provide buyers and sellers with trust evaluation [45]. Likewise, Amazon.com implements a reputation system by using an average score mechanism that averages the feedback scores from the previous periods [46]. However, as already explained in [44], cheating can occur over time in both cumulative and average score mechanisms because past scores carry a large weight in determining the current reputation. To overcome this problem, and in a manner similar to exponential weighting in adaptive filter designs [47], an exponentially-weighted moving average mechanism that gives higher weights to more recent actions is discussed in [44]. We follow a similar weighting formulation, with the main difference being that the reputation scores now need to be adapted in response to the evolution of the estimation task over the network. The construction can be described as follows.

When , meaning that agent is paired with agent , the reputation score that is maintained by agent for its neighbor is updated as:

 θℓk(i+1)=rkθℓk(i)+(1−rk)aℓk(i) (43)

where is a smoothing factor for agent to control the dynamics of the reputation updates. On the other hand, if , the reputation score remains as . We can compactly describe the reputation rule as

 θℓk(i+1) =1kℓ(i)[rkθℓk(i)+(1−rk)aℓk(i)] +(1−1kℓ(i))θℓk(i) (44)

Directly applying the above reputation formulation, however, can cause a loss in adaptation ability over the network. For example, the network would become permanently non-cooperative when agent chooses for long consecutive iterations. That is because, in that case, the reputation score will decay exponentially to zero, which keeps agent from choosing in the future. In order to avoid this situation, we set a lowest value for the reputation score to a small positive threshold , i.e.,

 θℓk(i+1) =1kℓ(i)⋅max{rkθℓk(i)+(1−rk)aℓk(i),ε} +(1−1kℓ(i))θℓk(i) (45)

and thus .

The reputation scores can now be utilized to evaluate the belief by agent of subsequent actions by agent . To explain how this can be done, we argue that agent would expect the probability of , i.e., the probability that agent is willing to cooperate, to be an increasing function of both and for . Specifically, if we denote this belief probability by , then it is expected to satisfy:

 ∂B(aℓk(t)=1)∂θℓk(t)≥0,  ∂B(aℓk(t)=1)∂θkℓ(t)≥0 (46)

The first property is motivated by the fact that according to the history of actions, a higher value for indicates that agent has higher willingness to share estimates. The second property is motivated by the fact that lower values for mean that agent has rarely shared estimates with agent in the recent past. Therefore, it can be expected that agent will have lower willingness to share information for lower values of . Based on this argument, we suggest a first-order construction for measuring belief with respect to both and as follows (other constructions are of course possible; our intent is to keep the complexity of the solution low while meeting the desired objectives):

 B(aℓk(t)=1)=θkℓ(t)⋅θℓk(t),  t≥i (47)

which satisfies both properties in (46) and where . Therefore, the reputation protocol implements (IV-A) and (47) repeatedly. Each agent will then employ the reference knowledge to select its action as described next.

### Iv-B Best Response Rule

The belief measure (47) provides agent with additional information about agent ’s actions. That is, with (47), agent can treat as a random variable with distribution for . Then, the best response of agent is obtained by solving the following optimization problem:

 minakℓ(i)∈{0,1}J∞′k,i[akℓ(i)|wk,i−1] (48)

where is defined by (IV) and involves an additional expectation over the distribution of — compare with (II-C). Similarly to Assumption 1, we assume the bounded rationality of the agents extends to the reputation scores for .

###### Assumption 2 (Extended bounded rationality)

We extend the assumption of bounded rationality from (27) to also include:

 θℓk(t)=θℓk(i),  for t≥i (50)

Now, using pure strategies, the best response of agent is to select the action such that

 akℓ(i)=⎧⎪ ⎪⎨⎪ ⎪⎩1,if