Deep Reinforcement Learning with Model Learningand Monte Carlo Tree Search in Minecraft

Deep Reinforcement Learning with Model Learning and Monte Carlo Tree Search in Minecraft


Deep reinforcement learning has been successfully applied to several visual-input tasks using model-free methods. In this paper, we propose a model-based approach that combines learning a DNN-based transition model with Monte Carlo tree search to solve a block-placing task in Minecraft. Our learned transition model predicts the next frame and the rewards one step ahead given the last four frames of the agent’s first-person-view image and the current action. Then a Monte Carlo tree search algorithm uses this model to plan the best sequence of actions for the agent to perform. On the proposed task in Minecraft, our model-based approach reaches the performance comparable to the Deep Q-Network’s, but learns faster and, thus, is more training sample efficient.




Reinforcement Learning, Model-Based Reinforcement Learning, Deep Learning, Model Learning, Monte Carlo Tree Search


I would like to express my sincere gratitude to my supervisor Dr. Stefan Uhlich for his continuous support, patience, and immense knowledge that helped me a lot during this study. My thanks and appreciation also go to my colleague Anna Konobelkina for insightful comments on the paper as well as to Sony Europe Limited for providing the resources for this project.


1 Introduction

In deep reinforcement learning, visual-input tasks (i.e., tasks where the observation from the environment comes in the form of videos or pictures) are oftentimes used to evaluate algorithms: Minecraft or various Atari games appear quite challenging for agents to solve [oh2016minecraft][mnih2015dqn]. When applied to these tasks, model-free reinforcement learning shows noticeably good results: e.g., a Deep Q-Network (DQN) agent approaches human-level game-playing performance [mnih2015dqn] or the Asynchronous Advantage Actor-Critic algorithm outperforms the known methods in half the training time [mnih2016a3c]. These achievements, however, do not cancel the fact that generally, model-free methods are considered “statistically less efficient” in comparison to model-based ones: model-free approaches do not employ the information about the environment directly whereas model-based solutions do [dayan2008].

Working with a known environment model has its benefits: changes of the environment state can be foreseen, therefore, planning the future becomes less complicated. At the same time, developing algorithms with no known environment model available at the start is more demanding yet promising: less training data is required than for model-free approaches and agents can utilize planning algorithms. The research has been progressing in this direction: e.g., a model-based agent surpassed the DQN’s results by using the Atari games’ true state for modelling [xiaoxiao2014mcts], and constructing transition models with video-frames prediction was proposed [oh2015frame].

These ideas paved the way for the following question: is it feasible to apply planning algorithms on a learned model of the environment that is only partially observable, such as in a Minecraft building task? To investigate this question, we developed a method that not only predicts future frames of a visual task but also calculates the possible rewards for the agent’s actions. Our model-based approach merges model learning through deep-neural-network training with Monte Carlo tree search, and demonstrates results competitive with those of DQN’s, when tested on a block-placing task in Minecraft.

Spectator View at Time Step





“turn right”

DNN Model






“place block”

DNN Model


Spectator View at Time Step

Figure 1: Transition Model for the Block-Placing Task in Minecraft

2 Block-Placing Task in Minecraft

To evaluate the performance of the suggested approach as well as to compare it with model-free methods, namely, DQN, a block-placing task was designed: it makes use of the Malmo framework and is built inside the Minecraft game world [malmo2016].

At the beginning of the game, the agent is positioned to the wall of the playing “room”. There is a 55 playing field in the center of this room. The field is white with each tile having a 0.1 probability to become colored at the start. Colored tiles indicate the location for the agent to place a block. The goal of the game is to cover all the colored tiles with blocks in 30 actions maximum. Five position-changing actions are allowed: moving forward by one tile, turning left or right by 90\degree, and moving sideways to the left or right by one tile. When the agent focuses on a tile, it can place the block with the 6th action. For each action, the agent receives a feedback: every correctly placed block brings a +1 reward, an erroneously put block causes a -1 punishment, and any action costs the agent a -0.04 penalty (this reward signal is introduced to stimulate the agent to solve the task with the minimum time required). To evaluate the environment, the agent is provided with a pre-processed (grayscaled and downsampled to 6464) first-person-view picture of the current state. The task is deterministic and discrete in its action and state space. The challenge lies in the partial observability of the environment with already placed blocks further obscuring the agent’s view. It is equally important to place blocks systematically to not obstruct the agent’s pathway. An example of the task is depicted in Fig. 1 (left) and a short demonstration is available at





Convolutional Layers                                 

Filter Stride Pad. 2 3 2 2 2 2 2 1

Pixel Input

Fully Connected

Deconvolutional Layers (Same Dimensions in Reversed Order)

Action Input

Fully Connected

Pixel Prediction

Reward Prediction

Figure 2: Architecture of the Transition Model

3 Model Learning

To learn the transition model, a deep convolutional neural network is used. The network takes the last four frames , and an action as an input, and predicts the following frame . Additionally, it predicts the rewards for all the transitions following the predicted frame , one for each action . Predicting the rewards one step ahead makes the application of search-based algorithms more efficient as no additional simulation is required to explore rewards from transitions to neighboring states. This method, however, fails to predict the reward following the very first state. To address this issue, a “noop” action, predicting the reward of the current state, is introduced.

The network takes four 6464-sized input frames and uses four convolutional layers, each followed by a rectifier linear unit (ReLU), to encode the input information into a vector of the size of 4096. This vector is concatenated with the one-hot encoded action input, where the “noop” action is represented by a vector of all zeros, and then linearly transformed with a fully connected layer of the size of 4096, again followed by ReLU. The resulting embedded vector is used for both, the reward prediction and the frame prediction, in the last part of the network. For the frame prediction, four deconvolutional layers are used with ReLUs in between and a Sigmoid at the end. The dimensions of these layers are equivalent to the convolutional layers in the first part in reversed order. The reward prediction is done by applying two fully connected linear layers of the sizes of 2048 and 6, respectively. ReLU is used in between the two layers, but not at the final output where we predict the rewards. The architecture of the neural network is illustrated in Fig. 2.

Training of the network is done with the help of experience replay [mnih2015dqn] to re-use and de-correlate the training data. Mini-batches of the size of 32 are sampled for training and RMSProp [Tieleman2012] is used to update the weights. Both, the frame prediction and the reward prediction, are trained with a mean squared error (MSE) loss. On each mini-batch update, the gradients of the frame prediction loss are backpropagated completely through the network while the gradients of the reward prediction loss are backpropagated only two layers until the embedded vector shared with the frame prediction is encountered. This procedure ensures that the network uses its full capacity to improve the prediction of the next state and the reward prediction is independently trained on the embedded feature vector from the last shared intermediate layer. Due to the network’s structure, the shared layer may only contain the necessary information to construct the prediction of the next frame and no further past information. For the block-placing task , one frame suffices to predict the reward. For different tasks, using the previous layer with the action input is worth considering (cf. Fig. 2).

4 Monte Carlo Tree Search

Finding actions with maximum future reward is done with the help of a UCT-based strategy, Monte Carlo tree search (MCTS) [coulom2006mcts, kocsis2006uct], and the learned model described in Sec. 3. The input for the next step is based on the frame prediction of the model. This procedure can be repeated several times to roll out the model to future states. One tree-search trajectory is rolled out until a maximum depth is reached. During one rollout, all the rewards along the trajectory are predicted as well as the rewards for neighboring states due to the network’s structure. As the outputs of the neural network are deterministic, each state is evaluated only once but can still be visited multiple times since many paths go through the same states. The decision about which action to follow for each state during a tree-search trajectory is made based on the UCT-measure . The action with the maximum UCT value is chosen greedily, where is the maximum discounted future reward of the state so far, is the number of visits of state , and is the number of visits of its parent. The hyperparameter controls the trade-off of exploration and exploitation, where a higher translates to more exploration. If a state has not been visited yet, it will be preferred over the already visited ones. In case several states under consideration have not been visited yet, the one with the highest immediate reward is chosen. The reward is given by the neural network model that predicts rewards one step ahead. Whenever a path of higher maximum value is encountered during a search trajectory, the path’s value is propagated to the tree’s root node updating every node with the new maximum reward value.

During the evaluation of the task in Minecraft, the agent performs MCTS for each action decision it has to make. The agent is given a fixed number of trajectories to roll out and decides for the action of maximum future discounted reward to take as the next action. Subsequently, the agent receives a new ground-truth input frame from the environment for the last step and updates the root of the search tree with the new state information. The next step is again chosen by applying MCTS beginning from the new state. Instead of starting from scratch, the UCT value is calculated with the maximum future reward value of the previous turn. This way, the tree-search results are carried over to following steps and trajectories with maximum future reward are updated first with the new input.

5 Experiments and Results

Figure 3: MCTS Agent vs. DQN Agent: Average Reward and Success Rate on 100 Block-Placing Tasks in Minecraft.
One training step = one mini-batch update; raw values (transparent lines, evaluated every 50K steps)
smoothed with moving average (solid lines, 200K steps to each side).

The MCTS agent uses a maximum depth of ten for its rollouts and explores 100 trajectories before deciding for the next action. Over the course of solving a single task, the MCTS agent explores 3000 trajectories at most since the task is limited to 30 actions. Trajectories, however, may be repeated with new input frames arriving after every action decision. The hyperparameter of UCT is set to eight. Compared to other domains, this value is rather large. The reason behind this decision is as follows: our exploitation measure (maximum future reward) is not limited in range, and the number of visits remain low since each node is evaluated only once. We use the original DQN architecture [mnih2015dqn] to train a DQN agent for the block-placing task. Restricting the input to four frames for our method grounds in the same number of frames used in the DQN’s structure to provide both approaches with equal input information. In Fig. 3, the results for task-solving success rate and the average reward are presented. For both agents, MCTS and DQN, the two performance measures were evaluated over the course of training steps, where each step corresponds to a mini-batch update. As a test set, 100 block-placing tasks were employed. At the end of the training, both agents appear to level out at roughly the same scores. The DQN agent overestimates the Q-values in the first 1.7 million training steps. Although reducing the learning rate helped weakening this effect, it slowed down the training process. This is a known problem with Q-learning.

As for the model-based approach, it can quickly learn a meaningful model that can achieve good results with MCTS. This suggests that learning the transition function is an easier task than learning Q-values for this block-placing problem. The main pitfall of the model-based approach lies in accumulating errors for rollouts that reach many steps ahead. In this particular block-placing task, MCTS trajectories are only 10 steps deep and Minecraft frames are rather structured, hence, the accumulating error problem is not that prominent. The slight increase in variance of the reward predictions of future steps is alleviated by using a discount rate of 0.95 for the MCTS agent as compared to a discount rate of 0.99 used by DQN. This value was found empirically to provide the best results.

Average Reward Success Rate
Agent 1M steps 2.5M steps 5M steps 1M steps 2.5M steps 5M steps
MCTS 2.15 2.36 2.42 0.64 0.70 0.72
MCTS, no 1-ahead reward 0.68 0.84 0.87 0.27 0.30 0.31
DQN 0.05 2.22 2.59 0.12 0.64 0.74

Table 1: Results of Different Algorithms for the Block-Placing Task in Minecraft (cf. smoothed values in Fig. 3).

Table 3 demonstrates the smoothed scores for both agents after 1, 2.5, and 5 million training steps. After 1 million steps, the model-based approach can already solve a considerate amount of block-placing tasks whereas the DQN agent has not learned a reasonable policy yet. For the DQN agent to catch up with the MCTS agent, it needs 1.5 million additional training steps, what underlines the model-based approach’s data efficiency. In the end, DQN beats the MCTS agent only by a small margin (74% vs. 72% success rate). Additionally, the table includes the MCTS agent’s scores if the one-step-ahead prediction of the reward is not employed. During the tree search, this agent chooses a random action of unexplored future states instead of greedily choosing the action with maximum immediate reward. For the block-placing task, using one-step-ahead predicted rewards doubles the score across training steps, i.e., scores comparable with the DQN agent become achievable with only 100 trajectories.

6 Conclusion

In this paper, we explored the idea of creating a model-based reinforcement learning agent that could perform competitively with model-free methods, DQN, in particular. To implement such an agent, a synthesis of learning a transition model with a deep neural network and MCTS was developed. Our tests on a block-placing task in Minecraft show that learning a meaningful transition model requires considerably less training data than learning Q-values of comparable scores with DQN. As the MCTS agent uses a tree search for finding the best action, it takes longer to perform one action in comparison with DQN. Therefore, our approach is interesting for cases where obtaining training samples from the environment is costly. The nature of the block-placing task justifies the greedy choice of immediate rewards and, hence, application of the one-step-ahead prediction significantly improves the score performance. The transition model suffers from the limited information of the last four input frames: if past information becomes unavailable (e.g., no longer visible in the last four frames) the model makes incorrect predictions about the environment leading to suboptimal actions. Further research in using a recurrent neural network could help eliminating this issue.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minumum 40 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description