# [

## Abstract

Multi-Agent Reinforcement Learning (MARL) has demonstrated significant success in training decentralised policies in a centralised manner by making use of value factorization methods. However, addressing surprise across spurious states and approximation bias remain open problems for multi-agent settings. We introduce the Energy-based MIXer (EMIX), an algorithm which minimizes surprise utilizing the energy across agents. Our contributions are threefold; (1) EMIX introduces a novel surprise minimization technique across multiple agents in the case of multi-agent partially-observable settings. (2) EMIX highlights the first practical use of energy functions in MARL (to our knowledge) with theoretical guarantees and experiment validations of the energy operator. Lastly, (3) EMIX presents a novel technique for addressing overestimation bias across agents in MARL. When evaluated on a range of challenging StarCraft II micromanagement scenarios, EMIX demonstrates consistent state-of-the-art performance for multi-agent surprise minimization. Moreover, our ablation study highlights the necessity of the energy-based scheme and the need for elimination of overestimation bias in MARL. Our implementation of EMIX and videos of agents are available at https://karush17.github.io/emix-web/.

###### Key Words.:

Value Factorization, Energy, Multi-Agent, EMIX, Surprise.ifaamas \acmConference[AAMAS ’21]Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021)May 3–7, 2021London, UKU. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.) \copyrightyear2021 \acmYear2021 \acmDOI \acmPrice \acmISBN \acmSubmissionID??? AAMAS-2021 Formatting Instructions]Energy-based Surprise Minimization for Multi-Agent Value Factorization \affiliation \institutionCMTE, University of Toronto \affiliation \institutionRBC Captial Markets \affiliation \institutionMultimedia Laboratory, University of Toronto \affiliation \institutionCMTE, University of Toronto

## 1 Introduction

Reinforcement Learning (RL) has seen tremendous growth in applications such as arcade games Mnih et al. (2013), board games Silver et al. (2016); Schrittwieser et al. (2019), robot control tasks Lillicrap et al. (2015); Schulman et al. (2017b) and lately, real-time games Vinyals et al. (2017). The rise of RL has led to an increasing interest in the study of multi-agent systems Lowe et al. (2017); Vinyals et al. (2019), commonly known as Multi-Agent Reinforcement Learning (MARL). In the case of partially observable settings, MARL enables the learning of policies with centralised training and decentralised control Kraemer and Banerjee (2016). This has proven to be useful for exploiting value-based methods which are often found to be sample-inefficient Tan (1993); Foerster et al. (2017).

Value Factorization Sunehag et al. (2018); Rashid et al. (2018) is a common technique which enables the joint value function to be represented as a combination of individual value functions conditioned on states and actions. In the case of Value Decomposition Network (VDN) Sunehag et al. (2018), a linear additive factorization is carried out whereas QMIX Rashid et al. (2018) generalizes the factorization to a non-linear combination, hence improving the expressive power of centralised action-value functions. Furthermore, monotonicity constraints in QMIX enable scalability in the number of agents. On the other hand, factorization across multiple value functions leads to the aggregation of approximation biases Hasselt (2010); Hasselt et al. (2016) originating from overoptimistic estimations in action values Fujimoto et al. (2018); Lan et al. (2020) which remain an open problem in the case of multi-agent settings. Moreover, value factorization methods are conditioned on states and do not account for spurious changes in partially-observed observations, commonly referred to as surprise Achiam and Sastry (2017).

Surprise minimization Berseth et al. (2019) is a recent phenomenon observed in the case of single-agent RL methods which deals with environments consisting of spurious states. In the case of model-based RL Kaiser et al. (2019), surprise minimization is used as an effective planning tool in the agent’s model Berseth et al. (2019) whereas in the case of model-free RL, surprise minimization is witnessed as an intrinsic motivation Achiam and Sastry (2017); Macedo et al. (2004) or generalization problem Chen (2020). On the other hand, MARL does not account for surprise across agents as a result of which agents remain unaware of drastic changes in the environment Macedo and Cardoso (2005). Thus, surprise minimization in the case of multi-agent settings requires attention from a critical standpoint.

We introduce the Energy-based MIXer (EMIX), an algorithm based on QMIX which minimizes surprise utilizing the energy across agents. Our contributions are threefold; (1) EMIX introduces a novel surprise minimization technique across multiple agents in the case of multi-agent partially-observable settings. (2) EMIX highlights the first practical use of energy functions in MARL (to our knowledge) with theoretical guarantees and experiment validations of the energy operator. Lastly, (3) EMIX presents a novel technique for addressing overestimation bias across agents in MARL which, unlike previous single-agent methods Lan et al. (2020), do not rely on a computationally-expensive family of action value functions. When evaluated on a range of challenging StarCraft II scenarios Samvelyan et al. (2019), EMIX demonstrates state-of-the-art performance for multi-agent surprise minimization by significantly improving the consistent performance of QMIX. Moroever, our ablation study highlights the necessity of our energy-based scheme and the need for elimination of overestimation bias in MARL.

## 2 The Value Factorization Problem

### 2.1 Preliminaries

We review the cooperative MARL setup. The problem is modeled as a Partially Observable Markov Decision Process (POMDP) Sutton and Barto (2018) defined by the tuple where the state space and action space are discrete, presents the reward observed by agents where is the set of all agents, presents the unknown transition model consisting of the transition probability to the next state given the current state and joint action at time step and is the discount factor. We consider a partially observable setting in which each agent draws individual observations according to the observation function . We consider a joint policy as a function of model parameters . Standard RL defines the agent’s objective to maximize the expected discounted reward as a function of the parameters . The action-value function for an agent is represented as which is the expected sum of payoffs obtained in state upon performing action by following the policy . We denote the optimal policy such that . In the case of multiple agents, the joint optimal policy can be expressed as the Nash Equilibrium Nash (1950) of the Stochastic Markov Game as such that . Q-Learning is an off-policy, model-free algorithm suitable for continuous and episodic tasks. The algorithm uses semi-gradient descent to minimize the Temporal Difference (TD) error: where is the TD target consisting of as the target parameters and is the batch sampled from memory .

### 2.2 Surprise Minimization

Despite the recent success of value-based methods Mnih et al. (2016); Hessel et al. (2017) RL agents suffer from spurious state spaces and encounter sudden changes in trajectories. These anomalous transitions between consecutive states are termed as surprise Achiam and Sastry (2017). Quantitatively, surprise can be inferred as a measure of deviation Berseth et al. (2019); Chen (2020) among states encountered by the agent during its interaction with the environment. While exploring Burda et al. (2019); Thrun (1992) the environment, agents tend to have higher deviation among states which is gradually reduced by gaining a significant understanding of state-action transitions. Agents can then start selecting optimal actions which is essential for maximizing reward. These actions often lead the agent to spurious experiences which the agent may not have encountered. In the case of model-based RL, agents can leverage spurious experiences Berseth et al. (2019) and plan effectively for future steps. On the other hand, in the case of model-free RL, surprise results in sample-inefficient learning Achiam and Sastry (2017). This can be tackled by making use of rigorous exploration strategies Stadie et al. (2015); Lee et al. (2019). However, such techniques do not necessarily scale to high-dimensional tasks and often require extrinsic feature engineering Kulkarni et al. (2016) and meta models Gupta et al. (2018). A suitable way to tackle high-dimensional dynamics is by utilizing surprise as a penalty on the reward Chen (2020). This leads to improved generalization. However, such solutions do not show evidence for multiple agents consisting of individual partial observations Ren et al. (2005).

### 2.3 Overestimation Bias

Recent advances Fujimoto et al. (2018) in value-based methods have addressed overestimation bias (also known as approximation error) which stems from the value estimates approximated by the function approximator. Such methods make use of dual target functions Wang et al. (2016) which improve stability in the Bellman updates. This has led to a significant improvement in single-agent off-policy RL methods Haarnoja et al. (2018b). However, MARL value-based methods continue to suffer from overestimation bias Ackermann et al. (2019); Lyu and Amato (2020). Figure 1 highlights the overestimation bias originating from the overoptimistic estimations of the target value estimator. Plots present the variation of absolute TD error during learning for state-of-the-art MARL methods, namely Independent Q-Learning Tan (1993), Counterfactual Multi-Agent Policy Gradients (COMA) Foerster et al. (2017), VDN Sunehag et al. (2018) and QMIX Rashid et al. (2018). Significant rise in error values of value factorization methods such as QMIX and VDN presents the aggregation of errors from individual -value functions. Thus, overestimation bias in MARL value factorization requires attention from a critical standpoint.

Various MARL methods Fu et al. (2020) make use of a dual architecture approach which increases the stability in value factorization. However, these methods are only applicable to small set of micromanagement tasks and do not generalize to scenarios consisting of a larger number of opponents and environments with different dynamics. Another suitable approach observed in literature is the usage of weighted bellman updates in double Q-learning Zheng et al. (2018). The Weighted Double Deep -Network (WDDQN) provides stability and sample efficiency for fully-observable MDPs. In the case of cooperative POMDPS, Weighted-QMIX (WQMIX) Rashid et al. (2020) yields a more sophisticated weighting scheme which aids in the retrieval of optimal policy Nguyen et al. (2020). Although suitable for value factorization in challenging micromanagement tasks, the method needs to be carefully hand-engineered and, in the case of multiple weighting schemes, does not include a basis for selection. A more practical approach in the case of single-agent methods is the use of a family of -functions Lan et al. (2020) wherein each estimator is optimized individually. Such a framework provides a generalized method for training agents with greedy policies and minimum approximation error. Although successful in single-agent settings, generalized Q-function methods do not scale well in the number of agents Nguyen et al. (2020) since each agent requires a family of -functions which needs to be updated concurrently. Thus, addressing overestimation bias from value factorization in cooperative multi-agent frameworks requires a scalable and sample-efficient perspective.

### 2.4 Energy-based Models

Energy-Based Models (EBMs) LeCun et al. (2006, 2007) have been successfully applied in the field of machine learning Teh et al. (2003) and probabilistic inference MacKay (2002). A typical EBM formulates the equilibrium probabilities Sallans and Hinton (2004) via a Boltzmann distribution Levine and Abbeel (2014) where and are the values of the visible and hidden variables and and are all the possible configurations of the visible and hidden variables respectively. The probability distribution over all the visible variables can be obtained by summing over all possible configurations of the hidden variables. This is mathematically expressed in Equation 1.

(1) |

Here, is called the equilibrium free energy which is the minimum of the variational free energy and is the partition function.

EBMs have been successfully implemented in single-agent RL methods O’Donoghue et al. (2016); Haarnoja et al. (2017). These typically make use of Boltzmann distributions to approximate policies Levine and Abbeel (2014). Such a formulation results in the minimization of free energy within the agent. While policy approximation depicts promise in the case of unknown dynamics, inference methods Toussaint (2009) play a key role in optimizing goal-oriented behavior. A second type of usage of EBMs follows the maximization of entropy Ziebart et al. (2008). The maximum entropy framework Haarnoja et al. (2018b) highlighted in Soft Q-Learning (SQL) Haarnoja et al. (2017) allows the agent to obey a policy which maximizes its reward and entropy concurrently. Maximization of agent’s entropy results in diverse and adaptive behaviors Ziebart (2010) which may be difficult to accomplish using standard exploration techniques Burda et al. (2019); Thrun (1992). Moreover, the maximum entropy framework is equivalent to approximate inference in the case of policy gradient methods Schulman et al. (2017a). Such a connection between likelihood ratio gradient techniques and energy-based formulations leads to diverse and robust policies Haarnoja (2018) and their hierarchical extensions Haarnoja et al. (2018a) which preserve the lower levels of hierarchies.

In the case of MARL, EBMs have witnessed limited applicability as a result of the increasing number of agents and complexity within each agent Buşoniu et al. (2010). While the probabilistic framework is readily transferable to opponent-aware multi-agent systems Wen et al. (2019), cooperative settings consisting of coordination between agents require a firm formulation of energy which is scalable in the number of agents Grau-Moya et al. (2018) and accounts for environments consisting of spurious states Wei et al. (2018).

## 3 Energy-based Surprise Minimization

In this section we introduce the novel surprise minimizing EMIX agent. The motivation behind EMIX stems from spurious states and overestimation bias among agents in the case of partially-observed settings. EMIX aims to address these challenges by making use of an energy-based surprise value function in conjunction with dual target function approximators.

### 3.1 The Surprise Minimization Objective

Firstly, we formulate the energy-based objective consisting of surprise as a function of states , joint actions and deviation within states for each agent . We call this function as the surprise value function which serves as a mapping from agent and environment dynamics to surprise. We then define an energy operator presented in Equation 2 which sums the free energy across all agents.

(2) |

We make use of the Mellowmax operator Asadi and Littman (2017) as our energy operator. The energy operator is similar to the SQL energy formulation Haarnoja et al. (2017) where the energy across different actions is evaluated. In our case, inference is carried out across all agents with actions as prior variables. However, in the special case of using an EBM as a -function, the EMIX objective reduces to the SQL objective. Details on connection between SQL and our energy formulation can be found in section 6.

Our choice of the energy operator is based on its unique mathematical properties which result in better convergence. Of these properties, the most useful result is that the energy operator forms a contraction on the surprise value function indicating a guaranteed minimization of surprise within agents. This is formally stated in Theorem 1. Proof of Theorem 1 can be found in section 7.

###### Theorem 1.

Given a surprise value function , the energy operator forms a contraction on .

The energy-based surprise minimization objective can then be formulated by simply adding the approximated energy-based surprise to the initial Bellman objective as expressed below.

(3) |

Here, is defined as the surprise ratio with as a temperature parameter and as the deviation among next states in the batch. The surprise value function is approximated by a universal function approximator (in our case a neural network) with its parameters as . is expressed as the negative free energy and the partition function. Alternatively, can be formulated as the negative free energy with as the partition function. The objective incorporates the minimization of surprise across all agents as minimizing the energy in spurious states. Such a formulation of surprise acts as intrinsic motivation and at the same time provides robustness to multi-agent behavior. Furthermore, the energy formulation in the form of energy ratio is a suitable one as it guarantees convergence to minimum surprise at optimal policy . This is formally expressed in Theorem 2 with its corresponding proof in section 7.

###### Theorem 2.

Upon agent’s convergence to an optimal policy , total energy of , expressed by will reach a thermal equilibrium consisting of minimum surprise among consecutive states and .

The objective can be modified to tackle approximation error in the target -values. We introduce a total of target approximators making as the set of target approximators. However, unlike generalized -learning Lan et al. (2020), we do not instantiate another -function but simply keep a copy of and select the target estimates with minimum values during optimization. This allows the objective to address overestimation bias in a scalable manner without using multiple -functions. The final EMIX objective is mathematically expressed in Equation 4.

(4) |

Here, depicts each of the target estimators with indicating the estimate with minimum error.

### 3.2 Energy-based MIXer (EMIX)

Algorithm 1 presents the EMIX algorithm. We initialize surprise value function parameters , mixer parameters , target parameters for and lastly the agent and hypernetwork parameters of QMIX. A learning rate , temperature and replay buffer are instantiated. During environment interactions, agents in state perform joint action , observe reward and transition to next-states . These experiences are collected in as tuples. In order to make the agents explore the environment, an -greedy schedule is used similar to the original QMIX Rashid et al. (2018) implementation. During the update steps, a random batch of is sampled from . The total -value is computed by the mixer network with its inputs as the -values of all the agents conditioned on via the hypernetworks. Similarly, the target mixers approximate conditioned on . In order to evaluate surprise within agents, we compute the standard deviations and across all observations and for each agent using and respectively. The surprise value function called the Surprise-Mixer estimates the surprise conditioned on , and . The same computation is repeated using the Target-Surprise-Mixer for estimating surprise within next-states in the batch. Application of the energy operator along the non-singleton agent dimension for and yields the energy ratio which is used in Equation 4 to evaluate . We then use batch gradient descent to update parameters of the mixer . Target parameters are updated every steps.

We now take a closer look at the surprise-mixer approximating the surprise value function. In order to condition surprise on states, joint actions and the deviation among states, we construct an expressive architecture motivated by provable exploration in RL Misra et al. (2019). The original architecture constructs a state abstraction model for a classification setting. It maps the transitions consisting of states , actions and next-states to the conditional probability depicting whether the transition belongs to the same data distribution or not. Such models have proven to be efficient in the case of provable exploration Misra et al. (2019) as it allows the agent to learn an exploration policy for every value of abstract state related to the latent space. We borrow from this technique of provable exploration and extend it to the surprise minimization setting.

Figure 2 presents the expressive architecture of surprise-mixer network utilized for surprise value function approximation and minimization. In contrast to the original state abstraction model Misra et al. (2019), the surprise-mixer maps transitions consisting of states , joint actions and deviations to a surprise value for all agents . Hierarchical layers of the network aid in the extraction of latent space representations followed by the estimation of . The architecture allows the agent to learn a robust and surprise-agnostic policy for every value of abstract state related to the latent space. Moreover, the latent space accommodates every value of surprise across agents as a result of state deviations induced in the intermediate representations. We refrain from passing next-states as part of the transitions in order to maintain causality in the system. Surprise value estimates are evaluated by the energy operator with the resulting expression becoming a part of the Bellman objective in Equation 4 comprising of the total -values estimated by the mixer network.

## 4 Experiments

Our experiments aim to evaluate the performance, consistency, sample-efficiency and effectiveness of the various components of our method. Specifically, we aim to answer the following questions- (1) How does our method compare to current state-of-the-art MARL methods in terms of performance, consistency and sample efficiency?, (2) How much does each component of the method contribute to its performance? and (3) Does the algorithm validate the theoretical claims corresponding to its components?

### 4.1 Energy-based Surprise Minimization

We assess the performance and sample-efficiency of EMIX on multi-agent StarCraft II micromanagement scenarios Samvelyan et al. (2019). We select StarCraft II scenarios particularly for three reasons. Firstly, micromanagement scenarios consist of a larger number of agents with different action spaces. This requires a greater deal of coordination in comparison to other benchmarks Stone and Veloso (2000) which attend to other aspects of MARL performance such as opponent-awareness Busoniu et al. (2006). Secondly, micromanagement scenarios consist of partial observability wherein agents are restricted from responding to enemy fire and attacking enemies when they are in range Rashid et al. (2018). This allows agents to explore the environment effectively and find an optimal strategy purely based on collaboration rather than built-in game utilities. Lastly, micromanagement scenarios in StarCraft II consist of multiple opponents which introduce a greater degree of surprise within consecutive states. Irrespective of the time evolution of an episode, environment dynamics of each scenario change rapidly as the agents need to respond to enemy’s behavior.

We compare our method to current state-of-the-art methods, namely QMIX Rashid et al. (2018), VDN Sunehag et al. (2018), COMA Foerster et al. (2017) and IQL Tan (1993). In order to compare our surprise-based scheme against pre-existing surprise minimization mechanisms, we compare EMIX additionally to a model-free implementation of SMiRL Berseth et al. (2019) in QMIX. All methods were implemented using the PyMARL framework Samvelyan et al. (2019). The SMiRL component was additionally incorporated as per the update rule provided in Chen (2020). We use the generalized version of SMiRL as it demonstrates reduced variance across batches. We term this implementation as SMiRL-QMIX for our comparisons. Agents were trained for a total of 5 random seeds consisting of 2 million steps in each environment. A total of 32 validation episodes carried out at every 10,000 step intervals were interleaved during agent’s interactions. All baselines implementation consist of a Recurrent Neural Network (RNN) agent having memory consisting of past states and actions. We use an -greedy exploration scheme wherein is annealed from 1 to 0.01 during the initial stages of training. Details related to the implementation of EMIX are presented in section 7.

In order to assess the performance and sample-efficiency of agents we evaluate the success rate percentages of each multi-agent system in completing each scenario. A completion of a scenario indicates the victory of the team over its enemies. Scenarios consist of varying difficulties in terms of the number of agents, map locations, distance from enemies, number of enemies and the level of difficulty.

Table 1 presents the comparison of success rate percentages between EMIX and state-of-the-art MARL algorithms on the StarCraft II micromanagement scenarios. Along with the success rates, we also measure the deviation of performance across the 5 random seeds considered during experiments. Complete results for all scenarios including plots presenting agents’ learning behaviors can be viewed in section 6. We evaluate the performance of agents on a total of 12 scenarios. Naming conventions of the scenarios are in accordance with the multi-agent StarCraft II micromanagement framework Samvelyan et al. (2019) wherein s represents an agent belonging to the stalker unit, z signifies a zealot unit, m indicates a marine unit and sc implies a spine crawler unit. Corresponding to each scenario, algorithms demonstrating higher success rate values in comparison to other methods have their entries highlighted. Out of the 12 scenarios considered, EMIX presents higher success rates on 9 of these scenarios depicting the suitability of the proposed approach. In scenarios such as 3m, 3s5z and 8m performance gain between EMIX and other methods such as QMIX and VDN are incremental as a result of the small number of agents and simplicity of tasks. On the other hand, EMIX presents significant performance gains in cases of so_many_baneling and 5m_vs _6m which consist of a large number of opponents and a greater difficulty level respectively.

When compared to QMIX, EMIX depicts improved success rates on all of the 12 scenarios. For instance, in scenarios such as 3s_vs_5z, 8m_vs_9m and 5m_vs_6m QMIX presents sub-optimal performance. On the other hand, EMIX utilizes a comparatively improved joint policy and yields better convergence in a sample-efficient manner. Thus, EMIX augments the performance and sample-efficiency of the QMIX agent utilizing the energy-based surprise minimization scheme. Moreover, on comparing EMIX with SMiRL-QMIX, we note that EMIX demonstrates a higher average success rate. This highlights the suitability of the energy-based scheme in the case of a larger number of agents and complex environment dynamics for surprise minimization.

Scenarios | EMIX | SMiRL-QMIX | QMIX | VDN | COMA | IQL |
---|---|---|---|---|---|---|

2s_vs_1sc | 90.33 0.72 | 88.41 1.31 | 89.19 3.23 | 91.42 1.23 | 96.90 0.54 | 86.07 0.98 |

2s3z | 95.400.45 | 94.930.32 | 95.301.28 | 92.032.08 | 43.332.70 | 55.746.84 |

3m | 94.900.39 | 93.940.22 | 93.430.20 | 94.580.58 | 84.757.93 | 94.790.50 |

3s_vs_3z | 99.580.07 | 97.631.08 | 99.430.20 | 97.900.58 | 0.210.54 | 92.322.83 |

3s_vs_4z | 97.220.73 | 0.240.11 | 96.013.93 | 94.292.13 | 0.000.00 | 59.7512.22 |

3s_vs_5z | 52.9111.80 | 0.000.00 | 43.447.09 | 68.515.60 | 0.000.00 | 18.142.34 |

3s5z | 88.881.07 | 88.531.03 | 88.492.32 | 63.583.99 | 0.250.11 | 7.053.52 |

8m | 94.471.38 | 89.961.42 | 94.302.90 | 90.261.12 | 92.820.53 | 83.531.62 |

8m_vs_9m | 71.032.69 | 69.901.94 | 68.282.30 | 58.814.68 | 4.170.58 | 28.4822.38 |

10m_vs_11m | 75.352.30 | 77.852.02 | 70.362.87 | 71.816.50 | 4.550.73 | 32.2725.68 |

so_many_baneling | 95.870.16 | 93.610.94 | 93.350.78 | 92.261.06 | 91.652.26 | 74.976.52 |

5m_vs_6m | 37.072.42 | 33.272.79 | 34.422.63 | 35.633.32 | 0.520.13 | 14.782.72 |

In addition to state-of-the-art performance and sample-efficiency, EMIX also presents consistency in its learning across different random seeds. Deviation in success rates for EMIX is comparable to pre-existing value factorization methods such as QMIX and VDN. This indicates that the energy-based formulation of surprise minimization is compatible with value factorization and enables all the agents to exhibit the same optimal behavior across different runs thus, validating the suitability of the proposed approach.

### 4.2 Ablation Study

We now present the ablation study for the various components of EMIX. Our experiments aim to determine the effectiveness of the energy-based surprise minimization method and the multiple target -function scheme. Additionally, we also aim to determine the extent up to which our proposed framework is viable in the standard QMIX objective.

#### Energy-based Surprise Minimization and Overestimation Bias

To weigh the effectiveness of the multiple target -function scheme we remove the energy-based surprise minimization from EMIX and replace it with the prior QMIX objective. For simplicity, we make use of two target -functions. We call this implementation of QMIX combined with the dual target function scheme as TwinQMIX. We can now add the energy-based surprise minimization scheme in the TwinQMIX objective to retrieve the EMIX objective. Thus, we can compare between QMIX, TwinQMIX and EMIX to assess the contributions of each of the proposed methods. Figure 3 (top) presents the comparison of average success rates for QMIX, TwinQMIX and EMIX on six different scenarios. Agents were evaluated for a total of 2 million timesteps with the lines in the plot indicating average success rates and the shaded area as the deviation across 5 random seeds.

In comparison to QMIX, TwinQMIX adds stability to the original objective and yields performance gains in the form of improved success rates and sample-efficient convergence. For instance, in the 3s_vs_5z scenario, TwinQMIX significantly improves the performance of QMIX by reducing the overoptimistic estimates in the initial QMIX objective. However, in the 5m_vs_6m scenario, TwinQMIX falls short of optimal sample efficiency as a result of underoptimistic estimates yielded by the operation.

On comparing TwinQMIX to EMIX we note that the energy-based surprise minimization scheme provides significant performance improvement in the modified QMIX objective. The EMIX objective demonstrates sample-efficiency and greater success rate values when compared to the TwinQMIX implementation. Additionally, the surprise minimization term adds to the stability of the TwinQMIX objective. This is demonstrated in the 5m_vs_6m scenario wherein the EMIX implementation improves the performance of TwinQMIX in comparison to QMIX by compensating for the underoptimistic estimations in the bellman updates. In the case of so_many _baneling scenario, EMIX tackles surprise effectively by preventing a significant drop in performance which is observed in cases of QMIX and TwinQMIX. so_many _baneling scenario consists of a large number of opponents (27 banelings) which force the agents to act quickly. This inherently induces a large amount of surprise in the form of state-to-state deviations. EMIX successfully tackles this hindrance and prevents the drop in success rates as a result of a surprise-robust policy.

#### Temperature Parameter

We now evaluate the extent of effectiveness of our surprise minimization objective in accordance with the temperature parameter . Figure 3 (middle) presents the variation of success rates of the EMIX objective with during learning. EMIX was evaluated for three different values (as presented in the legend) of for a total of 5 random seeds. While the objective is robust to significant changes in the value of , it presents sub-optimal performance in the case of high () and low () temperature values. In the case of high values, the objective suffers from overestimation error in the bellman updates introduced by the energy term. The error compensates for the bias removed by the dual -function scheme. On the other hand, low values do not include surprise minimization and EMIX agents face spurious states as a result of negligible surprise minimization. For instance, 5m_vs_6m and 8m_vs_9m scenarios highlight the necessity for a suitable value of in order to balance the surprise minimization objective with the initial bellman updates.

The importance of can be validated by assessing its usage in surprise minimization. However, it is difficult to evaluate surprise minimization directly as surprise value function estimates vary from state-to-state across different agents and thus, they present high variance during agent’s learning. This, in turn poses hindrance to gain an intuitive understanding of the surprise distribution. We instead observe the variation of as it is a collection of surprise-based sample estimates across the batch. Additionally, consists of prior samples for which makes inference across different agents tractable. Figure 3 (bottom) presents the variation of Energy ratio with the temperature parameter during learning. We compare two stable variations of E at and . The objective minimizes over the course of learning and attains thermal equilibrium with minimum energy. Intuitively, equilibrium corresponds to convergence to optimal policy which validates the claim in Theorem 2. With , EMIX presents improved convergence and surprise minimization for 5 out of the 6 considered scenarios, hence validating the suitable choice of . On the other hand, a lower value of does little to minimize surprise across agents. In the case of high values, EMIX demonstrates unstable behavior as a result of increasing overestimation error. Thus, a suitable value of is critical for optimal performance and surprise-robust behavior.

## 5 Conclusion

In this paper, we introduced the Energy-based MIXer (EMIX), a multi-agent value factorization algorithm based on QMIX which minimizes surprise utilizing the energy across agents. Our method proposes a novel energy-based surprise minimization objective consisting of an energy operator in conjunction with the surprise value function across multiple agents in the case of multi-agent partially-observable settings. The EMIX objective satisfies theoretical guarantees of total energy and surprise minimization with experimental results validating these claims. Additionally, EMIX presents a novel technique for addressing overestimation bias across agents in MARL based on multiple target value approximators. Unlike previous single-agent methods, EMIX does not rely on a computationally-expensive family of action value functions. On a range of challenging StarCraft II micromanagement scenarios, EMIX demonstrates state-of-the-art performance and sample-efficiency for multi-agent surprise minimization by significantly improving the original QMIX objective. Our ablations carried out on the proposed energy-based scheme, multiple target approximators and temperature parameter highlight the suitability and significance of each of the proposed contributions. While EMIX serves as the first practical example (to our knowledge) of energy-based models in cooperative MARL, we aim to extend the energy framework to opponent-aware and hierarchical MARL. We leave this as our future work.

## 6 Link to Full Appendix

The full appendix consisting of complete details of the study can be found at https://www.dropbox.com/Appendix.pdf.

We would like to thank the anonymous reviewers for providing valuable feedback on our work. We acknowledge Aravind Varier and Shashank Saurav for helpful discussions and the computing platform provided by the Department of Computer Science (DCS), University of Toronto. This work is supported by RBC Capital Markets, RBC Innovation Lab and the Center for Management of Technology and Entrepreneurship (CMTE).

## 7 Abridged Appendix

### Proofs

#### Theorem 1

Let us first define a norm on surprise values . Suppose ,

(5) |

Similarly, using with ,

(6) |

Results in Equation 5 and Equation 6 prove that the energy operation is a contraction.

#### Theorem 2

We begin by initializing a set of policies having energy ratios . Consider a policy with surprise value function . can then be expressed as

Assuming a constant surprise between and , we can express where is a constant. Using this expression in we get,

Similarly, ,…,. Thus, the energy residing in policy is proportional to the surprise between consecutive states and . Clearly, an optimal policy is the one with minimum surprise. Mathematically,

Thus, proving that the optimal policy consists of minimum surprise at thermal equilibrium.

### Implementation Details

#### Model Specifications

This section highlights model architecture for the surprise value function. At the lower level, the architecture consists of 3 independent networks called state_net, q_net and surp_net. Each of these networks consist of a single layer of 256 units with ReLU non-linearity as activations. Similar to the mixer-network, we use the ReLU non-linearity in order to provide monotonicity constraints across agents. Using a modular architecture in combination with independent networks leads to a richer extraction of joint latent transition space. Outputs from each of the networks are concatenated and are provided as input to the main_net consisting of 256 units with ReLU activations. The main_net yields a single output as the surprise value which is reduced along the agent dimension by the energy operator. Alternatively, deeper versions of networks can be used in order to make the extracted embeddings increasingly expressive. However, increasing the number of layers does little in comparison to additional computational expense.

#### Hyperparameters

Table 2 presents hyperparameter values for EMIX. Value of was tuned between 0.001 and 1 in intervals of 0.01 with best performance observed at . A total of 2 target -functions were used as the model is found to be robust to any greater values.

Hyperparameters | Values |
---|---|

batch size | |

learning rate | |

discount factor | |

target update interval | 200 episodes |

gradient clipping | 10 |

exploration schedule | to over 50000 steps |

mixer embedding size | 32 |

agent hidden size | 64 |

temperature | |

target -functions | 2 |

### References

- Surprise-based intrinsic motivation for deep reinforcement learning. External Links: 1703.01732 Cited by: §1, §2.2.
- Reducing overestimation bias in multi-agent domains using double centralized critics. arXiv preprint arXiv:1910.01465. Cited by: §2.3.
- An alternative softmax operator for reinforcement learning. In International Conference on Machine Learning, Cited by: §3.1.
- SMiRL: surprise minimizing rl in entropic environments. Cited by: §1, §2.2, §4.1.
- Large-scale study of curiosity-driven learning. In ICLR, Cited by: §2.2, §2.4.
- Multi-agent reinforcement learning: a survey. In 2006 9th International Conference on Control, Automation, Robotics and Vision, Cited by: §4.1.
- Multi-agent reinforcement learning: an overview. In Innovations in multi-agent systems and applications-1, Cited by: §2.4.
- Reinforcement learning generalization with surprise minimization. External Links: 2004.12399 Cited by: §1, §2.2, §4.1.
- Counterfactual multi-agent policy gradients. External Links: 1705.08926 Cited by: §1, §2.3, §4.1.
- Reducing overestimation in value mixing for cooperative deep multi-agent reinforcement learning. ICAART. Cited by: §2.3.
- Addressing function approximation error in actor-critic methods. External Links: 1802.09477 Cited by: §1, §2.3.
- Balancing two-player stochastic games with soft q-learning. arXiv preprint arXiv:1802.03216. Cited by: §2.4.
- Meta-reinforcement learning of structured exploration strategies. In Advances in Neural Information Processing Systems 31, Cited by: §2.2.
- Latent space policies for hierarchical reinforcement learning. arXiv preprint arXiv:1804.02808. Cited by: §2.4.
- Reinforcement learning with deep energy-based policies. arXiv preprint arXiv:1702.08165. Cited by: §2.4, §3.1.
- Soft actor-critic: off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Cited by: §2.3, §2.4.
- Acquiring diverse robot skills via maximum entropy deep reinforcement learning. Ph.D. Thesis, UC Berkeley. Cited by: §2.4.
- Double q-learning. In Advances in Neural Information Processing Systems 23, Cited by: §1.
- Deep reinforcement learning with double q-learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
- Rainbow: combining improvements in deep reinforcement learning. arXiv preprint arXiv:1710.02298. Cited by: §2.2.
- Model-based reinforcement learning for atari. External Links: 1903.00374 Cited by: §1.
- Multi-agent reinforcement learning as a rehearsal for decentralized planning. Neurocomputing 190. Cited by: §1.
- Hierarchical deep reinforcement learning: integrating temporal abstraction and intrinsic motivation. In Advances in neural information processing systems, Cited by: §2.2.
- Maxmin q-learning: controlling the estimation bias of q-learning. In International Conference on Learning Representations, Cited by: §1, §2.3, §3.1.
- A tutorial on energy-based learning. Predicting structured data 1. Cited by: §2.4.
- Energy-based models in document recognition and computer vision. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), Vol. 1. Cited by: §2.4.
- Efficient exploration via state marginal matching. arXiv preprint arXiv:1906.05274. Cited by: §2.2.
- Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems, Cited by: §2.4.
- Continuous control with deep reinforcement learning. CoRR abs/1509.02971. Cited by: §1.
- Multi-agent actor-critic for mixed cooperative-competitive environments. External Links: 1706.02275 Cited by: §1.
- Likelihood quantile networks for coordinating multi-agent reinforcement learning. In Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems, Cited by: §2.3.
- The role of surprise, curiosity and hunger on exploration of unknown environments populated with entities. In 2005 portuguese conference on artificial intelligence, Cited by: §1.
- Modeling forms of surprise in artificial agents: empirical and theoretical study of surprise functions. In Proceedings of the Annual Meeting of the Cognitive Science Society, Vol. 26. Cited by: §1.
- Information theory, inference & learning algorithms. Cambridge University Press. Cited by: §2.4.
- Kinematic state abstraction and provably efficient rich-observation reinforcement learning. arXiv preprint arXiv:1911.05815. Cited by: §3.2, §3.2.
- Asynchronous methods for deep reinforcement learning. In International conference on machine learning, Cited by: §2.2.
- Playing atari with deep reinforcement learning. CoRR abs/1312.5602. External Links: 1312.5602 Cited by: §1.
- Equilibrium points in n-person games. Proceedings of the National Academy of Sciences 36 (1). Cited by: §2.1.
- Deep reinforcement learning for multiagent systems: a review of challenges, solutions, and applications. IEEE transactions on cybernetics. Cited by: §2.3.
- Combining policy gradient and q-learning. arXiv preprint arXiv:1611.01626. Cited by: §2.4.
- Weighted qmix: expanding monotonic value function factorisation. External Links: 2006.10800 Cited by: §2.3.
- QMIX: monotonic value function factorisation for deep multi-agent reinforcement learning. In ICML 2018: Proceedings of the Thirty-Fifth International Conference on Machine Learning, Cited by: §1, §2.3, §3.2, §4.1.
- A survey of consensus problems in multi-agent coordination. In Proceedings of the 2005, American Control Conference, 2005., Cited by: §2.2.
- Reinforcement learning with factored states and actions. Journal of Machine Learning Research 5. Cited by: §2.4.
- The starcraft multi-agent challenge. External Links: 1902.04043 Cited by: §1, §4.1.
- Mastering atari, go, chess and shogi by planning with a learned model. External Links: 1911.08265 Cited by: §1.
- Equivalence between policy gradients and soft q-learning. arXiv preprint arXiv:1704.06440. Cited by: §2.4.
- Proximal policy optimization algorithms.. CoRR abs/1707.06347. Cited by: §1.
- Mastering the game of Go with deep neural networks and tree search. Nature 529 (7587), pp. 484–489. Cited by: §1.
- Incentivizing exploration in reinforcement learning with deep predictive models. arXiv preprint arXiv:1507.00814. Cited by: §2.2.
- Multiagent systems: a survey from a machine learning perspective. Autonomous Robots 8. Cited by: §4.1.
- Value-decomposition networks for cooperative multi-agent learning based on team reward. In Proceedings of the 17th International Conference on Autonomous Agents and MultiAgent Systems, AAMAS â18, pp. 2085â2087. Cited by: §1, §2.3, §4.1.
- Reinforcement learning: an introduction. Cited by: §2.1.
- Multi-agent reinforcement learning: independent vs. cooperative agents. In In Proceedings of the Tenth International Conference on Machine Learning, Cited by: §1, §2.3, §4.1.
- Energy-based models for sparse overcomplete representations. Journal of Machine Learning Research 4. Cited by: §2.4.
- Efficient exploration in reinforcement learning. Cited by: §2.2, §2.4.
- Robot trajectory optimization using approximate inference. In Proceedings of the 26th annual international conference on machine learning, Cited by: §2.4.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575, pp. . Cited by: §1.
- StarCraft ii: a new challenge for reinforcement learning. External Links: 1708.04782 Cited by: §1.
- Dueling network architectures for deep reinforcement learning. In International conference on machine learning, Cited by: §2.3.
- Multiagent soft q-learning. arXiv preprint arXiv:1804.09817. Cited by: §2.4.
- Probabilistic recursive reasoning for multi-agent reinforcement learning. arXiv preprint arXiv:1901.09207. Cited by: §2.4.
- Weighted double deep multiagent reinforcement learning in stochastic cooperative environments. In Pacific Rim international conference on artificial intelligence, Cited by: §2.3.
- Maximum entropy inverse reinforcement learning.. In AAAI, Cited by: §2.4.
- Modeling purposeful adaptive behavior with the principle of maximum causal entropy. Cited by: §2.4.