Solving Atari Games Using Fractals And Entropy
In this paper we introduce a novel MCTS based approach that is derived from the laws of the thermodynamics. The algorithm, coined Fractal Monte Carlo (FMC), allows us to create an agent that takes intelligent actions in both continuous and discrete environments while providing control over every aspect of the agent’s behavior. Results show that FMC is several orders of magnitude more efficient than similar techniques, such as MCTS, in the Atari games tested.
Solving Atari Games Using Fractals And Entropy
Sergio Hernandez Cerezo HCSoft Programación, S.L., 30007 Murcia, Spain firstname.lastname@example.org Guillem Duran Ballester email@example.com Spiros Baxevanakis firstname.lastname@example.org
noticebox[b]Preprint. Work in progress.\end@float
Artificial intelligence methods are currently limited by the lack of a concrete definition of intelligence that would assist in creating agents that exhibit intelligent behavior. Fractal AI theory (FAI) (Hernández et al., 2018) is inspired by the work of Wissner-Gross and Freer who proposed the concept of Causal Entropic Forces and showed that an agent exhibits intelligent behavior when it tries to maximize its Causal Path Entropy or equivalently, maximize its future freedom of action. In order to achieve that, the agent directly modifies its degrees of freedom in such a way that it assumes the state with the highest number of possible, different futures. To find the available futures, the agent will need to scan the action space and recreate the Causal Cone that contains the paths of all possible future internal configurations that start from its initial state. This scanning process is called a Scanning Policy. Available actions are then assigned a probability of being chosen by the Deciding Policy. In Fractal AI, intelligence is defined as the ability to minimize a sub-optimallity coefficient based on the similitude of two probability distributions created from the scanning and decision policies.
Using the principles described in FAI theory we developed a Monte Carlo approach coined Fractal Monte Carlo (FMC) that is based on the second law of thermodynamics. The algorithm develops a swarm of walkers that evolve in an environment while balancing exploitation and exploration by means of a mechanism named cloning. This process generates a "fractal tree" that will tend to fill up all the causal cone, from the interconnected paths of the walkers. The algorithm can be applied to both continuous and discrete decision spaces while remaining extremely efficient.
In addition, FMC provides a suite of parameters that control computational resources as well as agent reaction time. To test our algorithm we put it up against 55 different Atari environments and compared our results to state of the art algorithms like A3C (Mnih et al., 2016), NoisyNet (Fortunato et al., 2017), DQN (Mnih et al., 2013, 2015) and variants.
2 Related work
Recently there have been numerous breakthroughs in reinforcement learning most of them originating from Deepmind. They created an end to end model free reinforcement learning technique, named Deep Q Learning (Mnih et al., 2013, 2015), that scored astonishingly well and outperformed previous approaches in Atari games. Their Deep Q Learner achieves such feats by estimating Q-values directly from images and stabilizes learning by means such as experience replay and frame skipping. Later in 2016 they created AlphaGo which went to beat the world champion in the game of Go (Silver et al., 2016) and AlphaChem which was shown to outperform hardcoded heuristics used in retrosynthesis (Segler et al., 2017). Both AlphaGo and AlphaChem use some form of deep reinforcement learning in conjunction with a MCTS variant, UCP (Isasi et al., 2014).
FMC is a robust path-search algorithm that efficiently approximates path integrals formulated as a Markov decision process by exploiting the deep link between intelligence and entropy maximization (Wissner-Gross and Freer, 2013) that naturally produces an optimal decision-making process. FMC formulates agents that exhibit intelligent behavior in Atari game emulators. Such agents create a swarm of walkers that explores the Causal Cone and eventually, when the time horizon is met, select an action based on the walkers’ distribution over the action space.
3.1 Causal cones
In order to find the best path, Fractal Monte Carlo scans the space of possible future states thereby constructing a tree which consists of potential trajectories that describe the future evolution of the system. We define a Causal Cone as the set of all possible paths the system can take starting from an initial state if allowed to evolve over a time interval of length , the ’time horizon’ of the cone. A Causal Cone can be divided into a set of Causal Slices defined as , where each Causal Slice contains all the possible future states of the paths at a given time . If the Causal Slice is called the cone’s ‘horizon’ and contains the final states. The rest of the cone, where , is usually referred to as the cone’s ‘bulk’.
3.2 Dead and alive statuses
The death condition is a flag set by the programmer that lets us incorporate arbitrary boundary conditions to the behaviour of the agent and helps the swarm avoid undesired regions of the state space. We will assume an external death condition is defined over so a portion of the space can be forbidden for the system, as shown in Figure 1 . We will consider a state inside this excluded region "dead" while all other states are "alive".
3.3 Reward function
Agents make decisions based on a non-negative reward function that (we assume) is defined over the state space. For every slice of the causal cone, we can calculate the total reward of the slice as the integral of the reward over the slice. We may then convert the reward into a probability density over the slice as follows:
The general idea behind the algorithm will be that, the density distribution of the scanning should match the reward density distribution of the state space.
The proposed algorithm uses a set of two policies to choose and score actions. First we define a scanning policy that, given a swarm of initially identical states, defines its possible evolution over time as a stochastic process.
After the scanning is finished we need a deciding policy that will assign a probability of being chosen to each action:
In order to measure how different two probability distributions are we will use a modified version of the Kullback-Leibler divergence:
This divergence is well defined for any possible distributions and , including the problematic case when
4 Fractal Monte Carlo
Fractal Monte Carlo is a path-search algorithm derived from Fractal AI theory (Hernández et al., 2018) that produces intelligent behavior by maximizing an intrinsic reward represented as Causal Path Entropy (Wissner-Gross and Freer, 2013). When making a decision, Fractal Monte Carlo (FMC) establishes a tree that describes the future evolution of the system. This tree is expanded by a swarm of walkers that populates its leaf nodes. The swarm will undergo an interactive cellural automaton like process in order to make the tree grow efficiently. When a maximum amount of computation has been reached, the utility of each action will be considered proportional to the number of walkers that populate leaf nodes originating from the same action.
4.1 The algorithm
FMC steps are outlined below:
STEP 1: Initialize the walkers to the root state.
STEP 2: Perturb the swarm of walkers.
STEP 3: Evaluate the position of each walker with respect to the whole swarm.
STEP 4: Recycle the walkers that are in a dead state or have been poorly valued against the rest of the swarm.
STEP 5: Repeat phases 2-4 until we reach maximum computational resources.
STEP 6: Choose the action with the highest utility.
In more detail:
Every walker chooses a random action and acts in the environment.
For every walker select an alive walker at random and measure the euclidean distance between their observations.
Normalize distances and rewards using the "Relativize" function
Calculate the virtual reward of each walker. We define virtual reward at a state as:
Where is the reward value at state . Virtual reward is a stochastic measure of the importance of a given walker in respect to the whole swarm.
Each walker A is compared to another randomly selected walker C and gets assigned a probability of cloning to the leaf node of walker C.
Determine if the walker A will clone to C based on the cloning probability and the death condition.
Finally transfer the walkers that are set for cloning to their target leaf node.
When assigning a utility value to an option, FMC counts how many walkers took each option at the root state. To choose an action in the continuous or general case we calculate the average of the actions weighted by their normalized utilities or scores. In the discrete case, the action that approximates better the aforementioned average is chosen.
Given real values we calculate the mean and the standard deviation .
Normalize the values using:
Reshape the values into a Gaussian N(0,1) distribution.
Then scale using: if then else
Time Horizon sets a limit for how far in the future the walkers of the Swarm will foresee the aftermath of their initial actions. In other words, the walkers will seek to meet their set Time Horizon when going deeper in the tree but never go past it. The ideal Time Horizon value allows an agent to see far enough in the future to detect which actions lead inevitably to death.
Max samples is an upper bound on computational resources. It limits the number of times that FMC can make a perturbation to build a causal cone. The algorithm will try to keep computational resource usage as low as it can providing it meets the time horizon. A good guide to setting this parameter is , with a number that works well in Atari games but highly depends on the task.
Number of walkers represents the maximum number of paths that FMC will simulate. This number is related to "how thick" we want the resulting representation of the causal cone to be. The algorithm will try to use the maximum number of walkers possible.
Time step (dt) is the time interval the agents keep each decision made. Time horizon / dt will define the number of steps to be taken by walkers.
4.3 Time complexity
Computational time complexity of the algorithm can be shown to be of with .
In this section we present our performance results in Atari games and then compare our approach to MCTS and other state of the art learnin-based approaches in eight games.
5.1 Atari environments
We tested our algorithm in 55 different Atari games using the OpenAI gym toolkit. In most games the agent used RAM data as observations to make decisions. As seen in table 1, our results show that Fractal Monte Carlo outperforms previous state of the art (SOtA) approaches in 49 of those (89%). In each game we choose the appropriate parameters by experimentation and intuition. The extensive table containing the exact parameters used for every game is available in our github repository (Hernández et al., 2017a). Furthermore, we compared FMC performance when using RAM data versus using images as observations in eight games and found that there is an overall performance difference in favor of RAM data(Table 2).
|Games played by FMC||55||100.00%|
|FMC better than avg human||51||92.73%|
|FMC better than SOtA||49||89.09%|
|Solved or above human record||25||45.45%|
|Solved due to the 1M bug||11||20.00%|
|Environment||IMG||RAM||RAM vs IMG||RAM vs SoTA|
5.2 Comparison against UCT
In table 3 we compare FMC with the state of the art MCTS implementation UCT on the only eight games we found to be solved in the literature (Guo et al., 2014). The results presented in table 3 show that our method our method clearly outperforms MCTS while being three to four orders of magnitude more efficient.
|Game||Scores||Samples per step|
|Beam rider||7,233||288,666||3991%||3,000,000||946||x 3,171|
|Space invaders||2,354||17,970||763%||3,000,000||1,830||x 1,639|
In this work, we propose a new thinking framework called Fractal AI theory that we used to define intelligent behavior. FAI principles provided a basis for creating a new Monte Carlo approach based on maximizing Causal Path Entropy. We put up our algorithm against Atari environments and our results showed that it performs better than state of the art algorithms, like DQN and its variants, in most games.
Our algorithm has many potential applications especially for improving methods that use MCTS. Furthermore FMC can produce high amount and high quality training data for use in training reinforcement learning agents. For example, deep Q agents learn to associate reward expectations with states after being trained on a huge amount of data mainly consisting of random rollouts. FMC high quality rollouts can be fed into a DQN in the training stage which might result in a boost in training performance. Another promising idea for improving FMC is to add learning capabilities to walkers using a neural network. The network would be trained on correct decisions made by the agent and output a probability distribution over the action space. Then walkers would sample that distribution instead of picking a random action thus transitioning into an informed search model.
Fractal Monte Carlo is a just one of the possible algorithms inspired by FAI theory. Much research is still needed to explore other possible implementations of this new concept and their many potential applications in the real world.
Bellemare et al. (2016)
Bellemare, M., S. Srinivasan, G. Ostrovski, T. Schaul, D. Saxton, and
2016. Unifying count-based exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, Pp. 1471–1479.
et al. (2017)
Bellemare, M. G., W. Dabney, and R. Munos
2017. A distributional perspective on reinforcement learning. arXiv preprint arXiv:1707.06887.
Chaslot et al. (2008)
Chaslot, G., S. Bakkes, I. Szita, and P. Spronck
2008. Monte-Carlo Tree Search: A New Framework for Game AI.
Chuchro and Gupta (2017)
Chuchro, R. and D. Gupta
2017. Game Playing with Deep Q-Learning using OpenAI Gym. P. 6.
Foley, D. J.
2017. Model-Based Reinforcement Learning in Atari 2600 Games. PhD thesis.
Fortunato et al. (2017)
Fortunato, M., M. G. Azar, B. Piot, J. Menick, I. Osband, A. Graves, V. Mnih,
R. Munos, D. Hassabis, O. Pietquin, C. Blundell, and
2017. Noisy Networks for Exploration. arXiv:1706.10295 [cs, stat]. arXiv: 1706.10295.
Fu and Hsu (2016)
Fu, J. and I. Hsu
2016. Model-based reinforcement learning for playing atari games. Technical report, Technical Report, Stanford University.
Guo et al. (2014)
Guo, X., S. Singh, H. Lee, R. L. Lewis, and
2014. Deep learning for real-time atari game play using offline monte-carlo tree search planning. In Advances in neural information processing systems, Pp. 3338–3346.
Hernández et al. (2017a)
Hernández, S., G. Durán, and J. M. Amigó
2017a. FractalAI. https://github.com/FragileTheory/FractalAI.
Hernández et al. (2017b)
Hernández, S., G. Durán, and J. M. Amigó
2017b. General Algorithmic Search. arXiv:1705.08691 [math]. arXiv: 1705.08691.
Hernández et al. (2018)
Hernández, S., G. Durán, and J. M. Amigó
2018. Fractal AI: A fragile theory of intelligence. arXiv:1803.05049 [cs]. arXiv: 1803.05049.
Hester et al. (2017)
Hester, T., M. Vecerik, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan,
J. Quan, A. Sendonaris, G. Dulac-Arnold,
2017. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732.
Isasi et al. (2014)
Isasi, P., M. Drugan, and B. Manderick
2014. Schemata monte carlo network optimization. In PPSN 2014 Workshop: In Search of Synergies between Reinforcement Learning and Evolutionary Computation.
Mnih et al. (2016)
Mnih, V., A. P. Badia, M. Mirza, A. Graves, T. P. Lillicrap, T. Harley,
D. Silver, and K. Kavukcuoglu
2016. Asynchronous Methods for Deep Reinforcement Learning. arXiv:1602.01783 [cs]. arXiv: 1602.01783.
Mnih et al. (2013)
Mnih, V., K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and
2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
Mnih et al. (2015)
Mnih, V., K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare,
A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen,
C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra,
S. Legg, and D. Hassabis
2015. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533.
Plappert et al. (2017)
Plappert, M., R. Houthooft, P. Dhariwal, S. Sidor, R. Y. Chen, X. Chen,
T. Asfour, P. Abbeel, and M. Andrychowicz
2017. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905.
Salimans et al. (2017)
Salimans, T., J. Ho, X. Chen, S. Sidor, and
2017. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864.
Segler et al. (2017)
Segler, M., M. Preuß, and M. P. Waller
2017. Towards" alphachem": Chemical synthesis planning with tree search and deep neural network policies. arXiv preprint arXiv:1702.00020.
Silver et al. (2016)
Silver, D., A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche,
J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot,
2016. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489.
Wissner-Gross and Freer (2013)
Wissner-Gross, A. D. and C. E. Freer
2013. Causal entropic forces. Physical review letters, 110(16):168702.