PettingZoo: Gym for Multi-Agent Reinforcement Learning
OpenAI’s Gym library contains a large, diverse set of environments that are useful benchmarks in reinforcement learning, under a single elegant Python API (with tools to develop new compliant environments) . The introduction of this library has proven a watershed moment for the reinforcement learning community, because it created an accessible set of benchmark environments that everyone could use (including wrapper important existing libraries), and because a standardized API let RL learning methods and environments from anywhere be trivially exchanged. This paper similarly introduces PettingZoo, a library of diverse set of multi-agent environments under a single elegant Python API, with tools to easily make new compliant environments.
Reinforcement Learning (“RL”) considers learning a policy—a function that takes in an observation from an environment and emits an action–that achieves the maximum expected discounted reward when playing in an environment. OpenAI Gym (Brockman et al., 2016) was introduced shortly after the potential of reinforcement learning’s potential became widely known in Mnih et al. (2015). At the time, doing basic research in reinforcement learning was a large engineering challenge. The most popular set of environments were Atari games as part of the Arcade Learning Environment (Bellemare et al., 2013), which could be interesting to compile and install, and had an involved C API and later an unofficial fork with a Python wrapper (Goodrich, 2015). A scattering of other environments existed as independent projects, in various languages, all with unique APIs. This level of heterogeneity meant that reinforcement learning code had to be adapted to every environment (including bridging programming languages). Accordingly, standardized reinforcement learning implementations weren’t possible, comparisons against a wide variety of environments were very difficult, and doing simple research in reinforcement learning was generally restricted to organizations with entire engineering divisions. Gym was created to promote research in reinforcement learning by making comprehensive bench marking more accessible, by allowing algorithm reuse, and by letting average machine learning researchers access the environments. This last point was achieved by putting every environment that you’d likely want to benchmark with (at the time of creation) under one very simple API that anyone could understand, in Python (which was just starting to be the lingua-de-franca for machine learning). This lead to a mass proliferation of reinforcement learning research, especially at smaller institutions, many environments compliant with the API (Kidziński et al., 2018, Leurent, 2018, Zamora et al., 2016), and many RL libraries based around the API (Hill et al., 2018, Liang et al., 2018, Kuhnle et al., 2017).
Multi-Agent Reinforcement Learning (MARL) in particular has been behind many of the most publicized achieved of modern machine learning — AlphaGo Zero (Silver et al., 2017), OpenAI Five (OpenAI, 2018), AlphaStar (Vinyals et al., 2019), and has seen a boom in recent years. However the field exists in a similar state to reinforcement learning before the release of Gym — popular benchmark environments exist across a large scattering of locations, often in unmaintained states, all with heterogeneous APIs, highly influential research in the field is generally restricted to institutions with dedicated engineering teams, research that is conducted is regularly benchmarked against different environments from other research, and progress has been comparatively slow to single agent reinforcement learning (though this obviously cannot be attributed to benchmarks alone).
Prompted by all this, we developed PettingZoo — a Python library collecting maintained versions of all popular MARL environments, under a single very clean Python API that is very similar to that of Gym. It’s on PyPI and can be installed via pip install pettingzoo.
2 Design Philosophy
Simplicity and Similarity to Gym
The ability for the Gym API to be near instantly understood has been a large driving factor in it’s widespread adoption. While a multi-agent API will inherently add complexity, we wanted to create a similarly simple API, and one that would be instantly familiar researchers who have used Gym.
Agent Environment Cycle Games Based API
Most environments have APIs that model agents as all stepping at once (Lowe et al., 2017, Zheng et al., 2017, Gupta et al., 2017, Liu et al., 2019, Liang et al., 2018), modeled after Partially Observable Stochastic Games (POSGs). It turns out this easily results in bugs and is undesirable for handling strictly turn-based games like chess since every agent isn’t allowed to step at once there. We instead model our API after the new Agent Environment Cycle games model [cite], which treats each agent as stepping sequentially. That is, an agent performs an action, the environment responds, the next agent acts, the environment responds again, and the cycle repeats. AEC has been shown to be equivalent to POSGs, which means the AEC paradigm can be used to model turn-based and parallel games. The paper introducing this model expounds on these benefits at great length.
We wanted to make environments that are highly configurable by arguments the norm. In Gym, environments are generally not configurable, and arguments at generation are not used at all. However, playing with various environment properties is often highly desirable, so this has been embraced by Gym environments outside the official library, as this makes research easier and aids reputability. Accordingly tried to make every reasonable environment parameter an option for users in PettingZoo.
This notion of configuration extends beyond environment configuration to how learning methods interact with the environment. Due to the wide diversity of optimizations and different strategies applied for MARL, we wanted our API to allow for low level access to
Quality of Life Improvements
Being users of Gym ourselves, we sought to add several “quality of life” improvements in PettingZoo motivated frustrations we faced as users. These are:
Comprehensive, production grade continuous integration testing. Testing in Gym is arguably rather lacking, which has lead to issues in the past.
Tests of environments for API compliance and proper functionality, both for end users and for continuous integration testing of the library. We also provided detailed recommendations for better practices, inspired by the well liked messages of the Rust compiler.
Good error messages and warnings. When using Gym, like most software packages, when you do something wrong you get a trace back you have to decode to find the actual problem. We added speciality error messages and warnings for all common error’s we’ve made or are aware of to make development and debugging easier.
Detailed, comprehensive documentation. Documentation is a fundamental part of a user-friendly software library, and environments observation space, action space, reward schemes, and other notable environment details are something you generally need to know to begin conducting even the most basic research with an environment. The problem is that in Gym, you have to refer to source for almost all these things, so we created a user friendly wiki-styled website that clearly includes all relevant information to an environment, as well as specifics to sets of environments, tests, comprehensive API documentation, and so on. This is discussed further in LABEL:sec:documentation.
We further use the observation/action space objects from Gym, as well as the same seeding method and infrastructure (they were well done and very familiar to users).
Compliant environments wrap a general class (AECEnv). To allow for sufficient flexibility, environments actually only expose lower level attributes (dictionaries of values for all agents- dones, infos, rewards) and an observe method that takes an agent. These are then wrapped to the more general functions you see above by the base class, but this does allow for entirely new APIs to be efficiently stuck on top of PettingZoo environments should the need arrise. We’ve done this ourselves with a secondary parallel POSG based (that’s very similar to RLlib’s multi-agent API (Liang et al., 2018)) for a subset of the environments we include, due to specialty performance considerations.
Similar to Gym, we wanted to include popular and interesting environments within one package, in an easily usable format. Half of the environment classes we include (MPE, MAgent, and SISL), despite their popularity, have previously only existed as unmaintained “research grade” code, that haven’t been available for installation via pip, have required large amounts of maintenance to get to run at all, and have required large amounts of debugging, code review, code cleanup and documentation to bring to a production grade state. The Atari and Butterfly classes are new environments that we believe pose important and novel challenges to multi-agent reinforcement learning. Finally, we include the Classic class—classic board and card games popular to the RL literature.
Atari games represent the single most popular and iconic class of benchmarks in reinforcement learning. Recently, a multi-agent fork of the Atari Learning Environment was created that allows programmatic control and reward collection of Atari’s iconic multi-player games (Terry and Black, 2020). As in the single player Atari environments, the observation is the rendered frame of the game, which is shared between all agents, so there is no partial observability. Most of these games have competitive or mixed reward structures, making them suitable for general study of adversarial and mixed reinforcement learning. In particular, Terry and Black (2020) categorizes the games into 7 different types: 1v1 tournament games, mixed sum survival games (Space Invaders, shown in Figure 3(a). is an example of this), competitive racing games, long term strategy games, 2v2 tournament games, a four-player free-for-all game and a cooperative game. For easy ROM installation, AutoROM, a separate PyPI package, can be used to easily install the needed Atari ROMs in an automated manner.
Of all the environments included, the majority of them are competitive. We wanted to supplement this with a set of interesting graphical cooperative environments. Pistonball, depicted in Figure 3(b), where the pistons need to coordinate to move the ball to the left, while only being able to observe a local part of the screen, requires learning nontrivial emergent behavior and indirect communication to perform well. Knights Archers Zombies is a game in which players work together to defeat approaching zombies before they can reach the players. It is designed to be a fast paced graphically interesting combat game with partial observability, with heterogeneous agents, where achieving high performance would require extraordinarily high levels of agent coordination. Cooperative pong, where two dissimilar paddles work together to keep the ball in play as long as possible, was intended to be a be very simple cooperative continuous control-type task, with heterogeneous agents. Prison was designed to be the simplest possible game in MARL, and to be used as a debugging tool. Prospector was included to intentionally be a very challenging game for conventional methods—it has two classes of agents, with different goals, action spaces, and observation spaces (something many current cooperative MARL algorithms struggle with), and has very sparse rewards (something all RL algorithms struggle with). Our goal was for it to be something like a multiplayer version of Montezuma’s Revenge.
Classic Classical board and card games have long been some of the most popular environments in reinforcement learning (Tesauro, 1995, Silver et al., 2016, Bard et al., 2019). We include are all the standard multiplayer games in RLCard (Zha et al., 2019): Dou Dizhu, Gin Rummy, Leduc Hold’em, Limit Texas Hold’em, Mahjong, No-limit Texas Hold’em, and Uno. We additionally include all AlphaZero games, using the same observation and action spaces—Chess, Go and Shogi. We finally included Backgammon, Connect Four, Checkers, Rock Paper Scissors, Rock Paper Scissors Lizard Spock, and Tic Tac Toe to include a diverse set of simple games popular in the western world to allow for more robust benchmarking of RL methods.
The MAgent library, from Zheng et al. (2017) was introduced as a configurable and scalable environment that could support thousands of interactive agents. These environments have mostly been studied as a setting for emergent behavior (Pokle, 2018), heterogeneous agents (Subramanian et al., 2020), and efficient learning methods with many agents (Chen et al., 2019). We include a number of preset configurations, for example the Adversarial Pursuit environment shown in Figure 3(d). We make a few changes to the preset configurations used in the original MAgent paper. The global “minimap” observations in the battle environment is turned off by default, requiring implicit communication between the agents for complex emergent behavior to occur. The rewards in Gather and Tiger-Deer are also slightly changed to prevent emergent behavior from being a direct result of the reward structure.
The Multi-Agent Particle Environments (MPE) were introduced as part of Mordatch and Abbeel (2017) and first released as part of Lowe et al. (2017). These are 9 communication oriented environment where particle agents can (sometimes) move, communicate, see each other, push each other around, and interact with fixed landmarks. Some environments are cooperative, competitive, or require team play. They have been popular in research for general MARL methods Lowe et al. (2017), emergent communication (Mordatch and Abbeel, 2017), team play (Palmer, 2020), and much more. As part of their inclusion in PettingZoo, we converted the action spaces to a discrete space which is the Cartesian product of the movement and communication action possibilities. We also added comprehensive documentation, parameterized any local reward shaping (with the default setting being the same as in Lowe et al. (2017)) and made a single render window which captures all the activities of all agents (including communication), making it easier to visualize.
We finally included the three cooperative environments introduced in Gupta et al. (2017): Pursuit, Waterworld, and Multiwalker. Pursuit is a standard pursuit-evasion game Vidal et al. (2002) where pursuers and controlled in a randomly generated map. Pursuer agents are rewed for capturing randomy generated evaders by surrounding them on all sides. Waterworld is a continuous control game where the pursuing agents cooperatively hunt down food targets while trying to avoid poison targets. Multiwalker (Figure 3(f)) is a more challenging continuous control task that is based on Gym’s BipedalWalker environment. In Multiwalker, a package is placed on three independently controlled robot legs. Each robot is given a small positive reward for every unit of forward horizontal movement of the package, while they receive a large penalty for dropping the package.
Thank you to Deepthi Raghunandan and Kevin Hogan for many helpful discussions surrounding what testing should look like. Thank you to Nathaniel Grammel for many helpful discussions in the early planning stages of the project. Thank you to Ross Allen for reporting numerous bugs in the project.
- The hanabi challenge: A new frontier for AI research. CoRR abs/1902.00506. External Links: Cited by: §4.
- The arcade learning environment: an evaluation platform for general agents. Journal of Artificial Intelligence Research 47, pp. 253–279. Cited by: §1.
- Openai gym. arXiv preprint arXiv:1606.01540. Cited by: §1.
- Factorized q-learning for large-scale multi-agent systems. In DAI ’19, Cited by: §4.
- Ale_python_interface. GitHub. Note: GitHub repository\urlhttps://github.com/bbitmaster/ale_python_interface Cited by: §1.
- Cooperative multi-agent control using deep reinforcement learning. In International Conference on Autonomous Agents and Multiagent Systems, pp. 66–83. Cited by: §2, §4.
- Stable baselines. GitHub. Note: \urlhttps://github.com/hill-a/stable-baselines Cited by: §1.
- Learning to run challenge: synthesizing physiologically accurate motion using deep reinforcement learning. In NIPS 2017 Competition Book, S. Escalera and M. Weimer (Eds.), Cited by: §1.
- Tensorforce: a tensorflow library for applied reinforcement learning. Note: Web page External Links: Cited by: §1.
- An environment for autonomous driving decision-making. GitHub. Note: \urlhttps://github.com/eleurent/highway-env Cited by: §1.
- RLlib: abstractions for distributed reinforcement learning. In International Conference on Machine Learning (ICML), Cited by: §1, §2, §3.
- Emergent coordination through competition. CoRR abs/1902.07151. External Links: Cited by: §2.
- Multi-agent actor-critic for mixed cooperative-competitive environments. Neural Information Processing Systems (NIPS). Cited by: §2, §4.
- Human-level control through deep reinforcement learning. nature 518 (7540), pp. 529–533. Cited by: §1.
- Emergence of grounded compositional language in multi-agent populations. arXiv preprint arXiv:1703.04908. Cited by: §4.
- OpenAI five. Note: \urlhttps://blog.openai.com/openai-five/ Cited by: §1.
- Independent learning approaches: overcoming multi-agent learning pathologies in team-games. Cited by: §4.
- Analysis of emergent behavior in multi agent environments using deep reinforcement learning. Cited by: §4.
- Mastering the game of go with deep neural networks and tree search. nature 529 (7587), pp. 484–489. Cited by: §4.
- Mastering the game of go without human knowledge. nature 550 (7676), pp. 354–359. Cited by: §1.
- Multi type mean field reinforcement learning. In AAMAS, Cited by: §4.
- Multiplayer support for the arcade learning environment. arXiv preprint arXiv:2009.09341. Cited by: §4.
- Temporal difference learning and td-gammon. Commun. ACM 38 (3), pp. 58–68. External Links: Cited by: §4.
- Probabilistic pursuit-evasion games: theory, implementation, and experimental evaluation. IEEE transactions on robotics and automation 18 (5), pp. 662–669. Cited by: §4.
- Grandmaster level in starcraft ii using multi-agent reinforcement learning. Nature 575 (7782), pp. 350–354. Cited by: §1.
- Extending the openai gym for robotics: a toolkit for reinforcement learning using ros and gazebo. arXiv preprint arXiv:1608.05742. Cited by: §1.
- RLCard: a toolkit for reinforcement learning in card games. arXiv preprint arXiv:1910.04376. Cited by: §4.
- Magent: a many-agent reinforcement learning platform for artificial collective intelligence. arXiv preprint arXiv:1712.00600. Cited by: §2, §4.