Solving Atari Games Using Fractals And Entropy

Solving Atari Games Using Fractals And Entropy

Sergio Hernandez Cerezo
HCSoft Programación, S.L., 30007 Murcia, Spain
sergio@hcsoft.net
\AndGuillem Duran Ballester
guillem.db@gmail.com
\AndSpiros Baxevanakis
spiros.baxevanakis@gmail.com
Abstract

In this paper we introduce a novel MCTS based approach that is derived from the laws of the thermodynamics. The algorithm, coined Fractal Monte Carlo (FMC), allows us to create an agent that takes intelligent actions in both continuous and discrete environments while providing control over every aspect of the agent’s behavior. Results show that FMC is several orders of magnitude more efficient than similar techniques, such as MCTS, in the Atari games tested.

 

Solving Atari Games Using Fractals And Entropy


  Sergio Hernandez Cerezo HCSoft Programación, S.L., 30007 Murcia, Spain sergio@hcsoft.net Guillem Duran Ballester guillem.db@gmail.com Spiros Baxevanakis spiros.baxevanakis@gmail.com

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Artificial intelligence methods are currently limited by the lack of a concrete definition of intelligence that would assist in creating agents that exhibit intelligent behavior. Fractal AI theory (FAI) (Hernández et al., 2018) is inspired by the work of Wissner-Gross and Freer who proposed the concept of Causal Entropic Forces and showed that an agent exhibits intelligent behavior when it tries to maximize its Causal Path Entropy or equivalently, maximize its future freedom of action. In order to achieve that, the agent directly modifies its degrees of freedom in such a way that it assumes the state with the highest number of possible, different futures. To find the available futures, the agent will need to scan the action space and recreate the Causal Cone that contains the paths of all possible future internal configurations that start from its initial state. This scanning process is called a Scanning Policy. Available actions are then assigned a probability of being chosen by the Deciding Policy. In Fractal AI, intelligence is defined as the ability to minimize a sub-optimallity coefficient based on the similitude of two probability distributions created from the scanning and decision policies.

Using the principles described in FAI theory we developed a Monte Carlo approach coined Fractal Monte Carlo (FMC) that is based on the second law of thermodynamics. The algorithm develops a swarm of walkers that evolve in an environment while balancing exploitation and exploration by means of a mechanism named cloning. This process generates a "fractal tree" that will tend to fill up all the causal cone, from the interconnected paths of the walkers. The algorithm can be applied to both continuous and discrete decision spaces while remaining extremely efficient.

In addition, FMC provides a suite of parameters that control computational resources as well as agent reaction time. To test our algorithm we put it up against 55 different Atari environments and compared our results to state of the art algorithms like A3C (Mnih et al., 2016), NoisyNet (Fortunato et al., 2017), DQN (Mnih et al., 2013, 2015) and variants.

2 Related work

Recently there have been numerous breakthroughs in reinforcement learning most of them originating from Deepmind. They created an end to end model free reinforcement learning technique, named Deep Q Learning (Mnih et al., 2013, 2015), that scored astonishingly well and outperformed previous approaches in Atari games. Their Deep Q Learner achieves such feats by estimating Q-values directly from images and stabilizes learning by means such as experience replay and frame skipping. Later in 2016 they created AlphaGo which went to beat the world champion in the game of Go (Silver et al., 2016) and AlphaChem which was shown to outperform hardcoded heuristics used in retrosynthesis (Segler et al., 2017). Both AlphaGo and AlphaChem use some form of deep reinforcement learning in conjunction with a MCTS variant, UCP (Isasi et al., 2014).

3 Background

FMC is a robust path-search algorithm that efficiently approximates path integrals formulated as a Markov decision process by exploiting the deep link between intelligence and entropy maximization (Wissner-Gross and Freer, 2013) that naturally produces an optimal decision-making process. FMC formulates agents that exhibit intelligent behavior in Atari game emulators. Such agents create a swarm of walkers that explores the Causal Cone and eventually, when the time horizon is met, select an action based on the walkers’ distribution over the action space.

3.1 Causal cones

In order to find the best path, Fractal Monte Carlo scans the space of possible future states thereby constructing a tree which consists of potential trajectories that describe the future evolution of the system. We define a Causal Cone as the set of all possible paths the system can take starting from an initial state if allowed to evolve over a time interval of length , the ’time horizon’ of the cone. A Causal Cone can be divided into a set of Causal Slices defined as , where each Causal Slice contains all the possible future states of the paths at a given time . If the Causal Slice is called the cone’s ‘horizon’ and contains the final states. The rest of the cone, where , is usually referred to as the cone’s ‘bulk’.

Figure 1: (from Wissner-Gross and Freer (2013)): A Causal Cone visualization. On the left, the Causal Cone expands from the initial point and all the possible future paths are expanded upwards in time. On the right the entropic force compels the walkers to avoid the grey excluded volume and therefore evolve on the remaining space.

3.2 Dead and alive statuses

The death condition is a flag set by the programmer that lets us incorporate arbitrary boundary conditions to the behaviour of the agent and helps the swarm avoid undesired regions of the state space. We will assume an external death condition is defined over so a portion of the space can be forbidden for the system, as shown in Figure 1 . We will consider a state inside this excluded region "dead" while all other states are "alive".

3.3 Reward function

Agents make decisions based on a non-negative reward function that (we assume) is defined over the state space. For every slice of the causal cone, we can calculate the total reward of the slice as the integral of the reward over the slice. We may then convert the reward into a probability density over the slice as follows:

(1)

The general idea behind the algorithm will be that, the density distribution of the scanning should match the reward density distribution of the state space.

3.4 Policies

The proposed algorithm uses a set of two policies to choose and score actions. First we define a scanning policy that, given a swarm of initially identical states, defines its possible evolution over time as a stochastic process.
After the scanning is finished we need a deciding policy that will assign a probability of being chosen to each action:

(2)

In order to measure how different two probability distributions are we will use a modified version of the Kullback-Leibler divergence:

(3)

This divergence is well defined for any possible distributions and , including the problematic case when

4 Fractal Monte Carlo

Fractal Monte Carlo is a path-search algorithm derived from Fractal AI theory (Hernández et al., 2018) that produces intelligent behavior by maximizing an intrinsic reward represented as Causal Path Entropy (Wissner-Gross and Freer, 2013). When making a decision, Fractal Monte Carlo (FMC) establishes a tree that describes the future evolution of the system. This tree is expanded by a swarm of walkers that populates its leaf nodes. The swarm will undergo an interactive cellural automaton like process in order to make the tree grow efficiently. When a maximum amount of computation has been reached, the utility of each action will be considered proportional to the number of walkers that populate leaf nodes originating from the same action.

4.1 The algorithm

FMC steps are outlined below:

STEP 1: Initialize the walkers to the root state.

STEP 2: Perturb the swarm of walkers.

STEP 3: Evaluate the position of each walker with respect to the whole swarm.

STEP 4: Recycle the walkers that are in a dead state or have been poorly valued against the rest of the swarm.

STEP 5: Repeat phases 2-4 until we reach maximum computational resources.

STEP 6: Choose the action with the highest utility.

In more detail:

Perturb:

  1. Every walker chooses a random action and acts in the environment.

Evaluate:

  1. For every walker select an alive walker at random and measure the euclidean distance between their observations.

  2. Normalize distances and rewards using the "Relativize" function

  3. Calculate the virtual reward of each walker. We define virtual reward at a state as:

    (4)

    Where is the reward value at state . Virtual reward is a stochastic measure of the importance of a given walker in respect to the whole swarm.

Recycle:

  1. Each walker A is compared to another randomly selected walker C and gets assigned a probability of cloning to the leaf node of walker C.

  2. Determine if the walker A will clone to C based on the cloning probability and the death condition.

  3. Finally transfer the walkers that are set for cloning to their target leaf node.

Choose:

  1. When assigning a utility value to an option, FMC counts how many walkers took each option at the root state. To choose an action in the continuous or general case we calculate the average of the actions weighted by their normalized utilities or scores. In the discrete case, the action that approximates better the aforementioned average is chosen.

Relativize :

  1. Given real values we calculate the mean and the standard deviation .

  2. Normalize the values using:

    (5)
  3. Reshape the values into a Gaussian N(0,1) distribution.

  4. Then scale using: if then else

4.2 Parameters

Time Horizon sets a limit for how far in the future the walkers of the Swarm will foresee the aftermath of their initial actions. In other words, the walkers will seek to meet their set Time Horizon when going deeper in the tree but never go past it. The ideal Time Horizon value allows an agent to see far enough in the future to detect which actions lead inevitably to death.

Max samples is an upper bound on computational resources. It limits the number of times that FMC can make a perturbation to build a causal cone. The algorithm will try to keep computational resource usage as low as it can providing it meets the time horizon. A good guide to setting this parameter is , with a number that works well in Atari games but highly depends on the task.

Number of walkers represents the maximum number of paths that FMC will simulate. This number is related to "how thick" we want the resulting representation of the causal cone to be. The algorithm will try to use the maximum number of walkers possible.

Time step (dt) is the time interval the agents keep each decision made. Time horizon / dt will define the number of steps to be taken by walkers.

4.3 Time complexity

Computational time complexity of the algorithm can be shown to be of with .

5 Findings

In this section we present our performance results in Atari games and then compare our approach to MCTS and other state of the art learnin-based approaches in eight games.

5.1 Atari environments

We tested our algorithm in 55 different Atari games using the OpenAI gym toolkit. In most games the agent used RAM data as observations to make decisions. As seen in table 1, our results show that Fractal Monte Carlo outperforms previous state of the art (SOtA) approaches in 49 of those (89%). In each game we choose the appropriate parameters by experimentation and intuition. The extensive table containing the exact parameters used for every game is available in our github repository (Hernández et al., 2017a). Furthermore, we compared FMC performance when using RAM data versus using images as observations in eight games and found that there is an overall performance difference in favor of RAM data(Table 2).

Available games 55
Games played by FMC 55 100.00%
FMC better than avg human 51 92.73%
FMC better than SOtA 49 89.09%
Solved or above human record 25 45.45%
Solved due to the 1M bug 11 20.00%
In 51 out of 55 games FMC scored better than the average human meaning a human that has played for 2 hours. FMC solved or scored higher than the human record in of the games tested. A game is considered ’solved’ when we hit an eventual score limit or we can play it endlessly. SOtA is an abbreviation for "State Of the Art". The term "1M bug" refers to some games having a hardcoded score limit, usually found at 999,999.
Table 1: Fractal Monte Carlo in Atari games
Environment IMG RAM RAM vs IMG RAM vs SoTA
atlantis 145000 139500 96.21% 1.59%
bank heist 160 280 175.00% 0.68%
boxing 100 100 100.00% 100.60%
centipede immortal immortal 100.00% 2373.87%
ice hockey 16 33 206.25% 311.32%
ms pacman 23980 29410 122.64% 468.09%
qbert 17950 22500 125.35% 358.11%
video pinball 273011 999999 366.29% 105.31%
161.47% 464.95%
Comparison of FMC scores when using IMG and RAM data as observations in eight Atari games. Parameters used are: Repeat Actions=5, Time Horizon=15, Max Samples=300, Number of Walkers=30. We find that, in total, RAM yields better results than IMG scores and outperforms State Of the Art (SOtA) methods in most games tested.
Table 2: Image data versus RAM dump

5.2 Comparison against UCT

In table 3 we compare FMC with the state of the art MCTS implementation UCT on the only eight games we found to be solved in the literature (Guo et al., 2014). The results presented in table 3 show that our method our method clearly outperforms MCTS while being three to four orders of magnitude more efficient.

Game Scores Samples per step
MCTS FMC % MCTS FMC Efficiency
Asterix 226,000 999,500 442% —– 241 —–
Beam rider 7,233 288,666 3991% 3,000,000 946 x 3,171
Breakout 406 864 213% 3,000,000 866 x 3,386
Enduro 788 5,279 670% 4,000,000 390 x 10,256
Pong 21 21 100% 150,000 158 x 949
Q-bert 18,850 999,999 3523% 3,000,000 3,691 x 813
Seaquest 3,257 999,999 30703% 3,000,000 964 x 3,112
Space invaders 2,354 17,970 763% 3,000,000 1,830 x 1,639
Comparison of MCTS versus FMC performance in eight Atari Games. We also compare the number of simulations or "samples" each algorithm used per step. The metric "efficiency" is calculated as MCTS samples per step / FMC samples per step for each game
Table 3: Comparison against MCTS variant, UCT

6 Conclusions

In this work, we propose a new thinking framework called Fractal AI theory that we used to define intelligent behavior. FAI principles provided a basis for creating a new Monte Carlo approach based on maximizing Causal Path Entropy. We put up our algorithm against Atari environments and our results showed that it performs better than state of the art algorithms, like DQN and its variants, in most games.

Our algorithm has many potential applications especially for improving methods that use MCTS. Furthermore FMC can produce high amount and high quality training data for use in training reinforcement learning agents. For example, deep Q agents learn to associate reward expectations with states after being trained on a huge amount of data mainly consisting of random rollouts. FMC high quality rollouts can be fed into a DQN in the training stage which might result in a boost in training performance. Another promising idea for improving FMC is to add learning capabilities to walkers using a neural network. The network would be trained on correct decisions made by the agent and output a probability distribution over the action space. Then walkers would sample that distribution instead of picking a random action thus transitioning into an informed search model.

Fractal Monte Carlo is a just one of the possible algorithms inspired by FAI theory. Much research is still needed to explore other possible implementations of this new concept and their many potential applications in the real world.

References

  • Bellemare et al. (2016) Bellemare, M., S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and R. Munos
    2016.
    Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, Pp.  1471–1479.
  • Bellemare et al. (2017) Bellemare, M. G., W. Dabney, and R. Munos
    2017.
    A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887.
  • Chaslot et al. (2008) Chaslot, G., S. Bakkes, I. Szita, and P. Spronck
    2008.
    Monte-Carlo Tree Search: A New Framework for Game AI.
  • Chuchro and Gupta (2017) Chuchro, R. and D. Gupta
    2017.
    Game Playing with Deep Q-Learning using OpenAI Gym. P.  6.
  • Foley (2017) Foley, D. J.
    2017.
    Model-Based Reinforcement Learning in Atari 2600 Games. PhD thesis.
  • Fortunato et al. (2017) Fortunato, M., M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih, R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and S. Legg
    2017.
    Noisy Networks for Exploration. arXiv:1706.10295 [cs, stat]. arXiv: 1706.10295.
  • Fu and Hsu (2016) Fu, J. and I. Hsu
    2016.
    Model-based reinforcement learning for playing atari games. Technical report, Technical Report, Stanford University.
  • Guo et al. (2014) Guo, X., S. Singh, H. Lee, R. L. Lewis, and X. Wang
    2014.
    Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems, Pp.  3338–3346.
  • Hernández et al. (2017a) Hernández, S., G. Durán, and J. M. Amigó
    2017a.
    FractalAI. https://github.com/FragileTheory/FractalAI.
  • Hernández et al. (2017b) Hernández, S., G. Durán, and J. M. Amigó
    2017b.
    General Algorithmic Search. arXiv:1705.08691 [math]. arXiv: 1705.08691.
  • Hernández et al. (2018) Hernández, S., G. Durán, and J. M. Amigó
    2018.
    Fractal AI: A fragile theory of intelligence. arXiv:1803.05049 [cs]. arXiv: 1803.05049.
  • Hester et al. (2017) Hester, T., M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, G. Dulac-Arnold, et al.
    2017.
    Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732.
  • Isasi et al. (2014) Isasi, P., M. Drugan, and B. Manderick
    2014.
    Schemata monte carlo network optimization. In PPSN 2014 Workshop: In Search of Synergies between Reinforcement Learning and Evolutionary Computation.
  • Mnih et al. (2016) Mnih, V., A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu
    2016.
    Asynchronous Methods for Deep Reinforcement Learning. arXiv:1602.01783 [cs]. arXiv: 1602.01783.
  • Mnih et al. (2013) Mnih, V., K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller
    2013.
    Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
  • Mnih et al. (2015) Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis
    2015.
    Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
  • Plappert et al. (2017) Plappert, M., R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen, T. Asfour, P. Abbeel, and M. Andrychowicz
    2017.
    Parameter space noise for exploration. arXiv preprint arXiv:1706.01905.
  • Salimans et al. (2017) Salimans, T., J. Ho, X. Chen, S. Sidor, and I. Sutskever
    2017.
    Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
  • Segler et al. (2017) Segler, M., M. Preuß, and M. P. Waller
    2017.
    Towards" alphachem": Chemical synthesis planning with tree search and deep neural network policies. arXiv preprint arXiv:1702.00020.
  • Silver et al. (2016) Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al.
    2016.
    Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
  • Wissner-Gross and Freer (2013) Wissner-Gross, A. D. and C. E. Freer
    2013.
    Causal entropic forces. Physical review letters, 110(16):168702.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
211845
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description