Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search

Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search

Abstract

Monte Carlo Tree Search (MCTS) algorithms have achieved great success on many challenging benchmarks (e.g., Computer Go). However, they generally require a large number of rollouts, making their applications costly. Furthermore, it is also extremely challenging to parallelize MCTS due to its inherent sequential nature: each rollout heavily relies on the statistics (e.g., node visitation counts) estimated from previous simulations to achieve an effective exploration-exploitation tradeoff. In spite of these difficulties, we develop an algorithm, WU-UCT1, to effectively parallelize MCTS, which achieves linear speedup and exhibits only limited performance loss with an increasing number of workers. The key idea in WU-UCT is a set of statistics that we introduce to track the number of on-going yet incomplete simulation queries (named as unobserved samples). These statistics are used to modify the UCT tree policy in the selection steps in a principled manner to retain effective exploration-exploitation tradeoff when we parallelize the most time-consuming expansion and simulation steps. Experiments on a proprietary benchmark and the Atari Game benchmark demonstrate the linear speedup and the superior performance of WU-UCT comparing to existing techniques.

\iclrfinalcopy

1 Introduction

Recently, Monte Carlo Tree Search (MCTS) algorithms such as UCT (kocsis2006improved) have achieved great success in solving many challenging artificial intelligence (AI) benchmarks, including video games (guo2016deep) and Go (silver2016mastering2). However, they rely on a large number (e.g. millions) of interactions with the environment emulator to construct search trees for decision-making, which leads to high time complexity (browne2012survey). For this reason, there has been an increasing demand for parallelizing MCTS over multiple workers. However, parallelizing MCTS without degrading its performance is difficult (segal2010scalability; mirsoleimani2018lock; chaslot2008parallel), mainly due to the fact that each MCTS iteration requires information from all previous iterations to provide effective exploration-exploitation tradeoff. Specifically, parallelizing MCTS would inevitably obscure these crucial information, and we will show in Section 2.2 that this loss of information potentially results in a significant performance drop. The key question is therefore how to acquire and utilize more available information to mitigate the information loss caused by parallelization and help the algorithm to achieve better exploration-exploitation tradeoff.

To this end, we propose WU-UCT (Watch the Unovserved in UCT), a novel parallel MCTS algorithm that attains linear speedup with only limited performance loss. This is achieved by a conceptual innovation (Section 3.1) as well as an efficient real system implementation (Section 3.2). Specifically, the key idea in WU-UCT to overcome the aforementioned challenge is a set of statistics that tracks the number of on-going yet incomplete simulation queries (named as unobserved samples). We combine these newly devised statistics with the original statistics of observed samples to modify UCT’s policy in the selection steps in a principled manner, which, as we shall show in Section 4, effectively retains exploration-exploitation tradeoff during parallelization. Our proposed approach has been successfully deployed in a production system for efficiently and accurately estimating the rate at which users pass levels (termed user pass-rate) in a mobile game “Joy City”, with the purpose of reducing their design cycles. On this benchmark, we show that WU-UCT achieves near-optimal linear speedup and superior performance in predicting user pass-rate (Section 5.1). We further evaluate WU-UCT on the Atari Game benchmark and compare it to state-of-the-art parallel MCTS algorithms (Section 5.2), which also demonstrate our superior speedup and performance.

2 On the Difficulties of Parallelizing MCTS

We first introduce the MCTS and the UCT algorithms, along with their difficulties in parallelization.

2.1 Monte Carlo Tree Search and Upper Confidence Bound for Trees (UCT)

We consider the Markov Decision Process (MDP) , where an agent interacts with the environment in order to maximize a long-term cumulative reward. Specifically, an agent at state takes an action according to a policy , so that the MDP transits to the next state and emits a reward .2 The objective of the agent is to learn an optimal policy such that the long-term cumulative reward is maximized:

(1)

where denotes the initial state and is the discount factor.3 Many reinforcement learning (RL) algorithms have been developed to solve the above problem (sutton2018reinforcement), including model-free algorithms (mnih2013playing; mnih2016asynchronous; williams1992simple; konda2000actor; schulman2015trust; schulman2017proximal) and model-based algorithms (nagabandi2018neural; weber2017imagination; bertsekas2005dynamic; deisenroth2011pilco). Monte Carlo Tree Search (MCTS) is a model-based RL algorithm that plans the best action at each time step (browne2012survey). Specifically, it uses the MDP model (or its sampler) to identify the best action at each time step by constructing a search tree (Figure 1(a)), where each node represents a visited state, each edge from denotes an action that can be taken at that state, and the landing node denotes the state it transits to after taking . As shown in Figure 1(a), MCTS repeatedly performs four sequential steps: selection, expansion, simulation and backpropagation. The selection step traverses over the existing search tree until the leaf node (or other termination conditions are satisfied) by choosing actions (edges) at each node according to a tree policy. One widely used node-selection policy is the one used in the Upper Confidence bound for Trees (UCT) (kocsis2006improved):

(2)

where denotes the set of all child nodes for ; the first term is an estimate for the long-term cumulative reward that can be received when starting from the state represented by node , and the second term represents the uncertainty (size of the confidence interval) of that estimate. The confidence interval is calculated based on the Upper Confidence Bound (UCB) (auer2002finite; auer2002using) using and , which denote the number of times that the nodes and have been visited, respectively. Therefore, the key idea of the UCT policy (2) is to select the best action according to an optimistic estimation (i.e., the upper confidence bound) of the expected return, which strikes a balance between the exploitation (first term) and the exploration (second term) with controlling their tradeoff. Once the selection process reaches a leaf node of the search tree (or other termination conditions are met), we will expand the node according to a prior policy by adding a new child node. Then, in the simulation step, we estimate its value function (cumulative reward) by running the environment simulator with a default (simulation) policy. Finally, during backpropagation, we update the statistics and from the leaf node to the root node of the selected path by recursively performing the following update (i.e., from to ):

(3)

where is the simulation return of ; denotes the action selected following (2) at state .

(a) Each (non-parallel) MCTS rollout consists of four sequential steps: selection, expansion, simulation and backpropagation, where the expansion and the simulation steps are generally most time-consuming.
(b) Ideal parallelization
(c) Naive parallelization
(d) WU-UCT
Figure 1: MCTS and its parallelization. (a) An overview of MCTS. (b) The ideal parallelization: the most up-to-date statistics (in chromatic color) are assumed to be available to all workers as soon as a simulation begins (unrealistic in practice). (c) The key challenge in parallelizing MCTS: the workers can only access outdated (in gray-color), leading problems like collapse of exploration. (d) WU-UCT tracks the number of incomplete simulation queries, which is denoted as , and modifies the UCT policy in a principled manner to retain effective exploration-exploitation tradeoff. It achieves comparable speedup and performance as the ideal parallelization.

2.2 The Intrinsic Difficulties of Parallelizing MCTS

The above discussion implies that the MCTS algorithm is intrinsically sequential: each selection step in a new rollout requires the previous rollouts to complete in order to deliver the updated statistics, and , for the UCT tree policy (2). Although this requirement of up-to-date statistics is not mandatory for implementation, it is in practice intensively required to achieve effective exploration-exploitation tradeoff (auer2002finite). Specifically, up-to-date statistics best help the UCT tree policy to identify and prune non-rewarding branches as well as extensively visiting rewarding paths for additional planning depth. Likewise, to achieve the best possible performance, when multiple workers are used, it is also important to ensure that each worker uses the most recent statistics (the colored and in Figure 1(b)) in its own selection step. However, this is impossible in parallelizing MCTS based on the following observations. First, the expansion step and the simulation step are generally more time-consuming compared to the other two steps, because they involve a large number of interactions with the environment (or its simulator). Therefore, as exemplified by Figure 1(c), when a worker C initiates a new selection step, the other workers A and B are most likely still in their simulation or expansion steps. This prevents them from updating the (global) statistics for other workers like C, which happens at their respective backpropagation steps. Using outdated statistics (the gray-colored and ) at different workers could lead to a significant performance loss given a fixed target speedup, due to behaviors like collapes of exploration or exploitation failure, which we shall discuss thoroughly in Section 4. To give an example, Figure 1(c) illustrates the collapse of exploration, where worker C traverses over the same path as the worker A in its selection step due to the determinism of (2). Specifically, if the statistics are unchanged between the moments that worker A and C begin their own selection steps, they will choose the same node according to (2), which greatly reduces the diversity of exploration. Therefore, the key question that we want to address in parallelizing MCTS is how to track the correct statistics and modify the UCT policy in a principled manner, with the hope of retaining effective exploration-exploitation tradeoff at different workers.

3 Wu-Uct

In this section, we first develop the conceptual idea of our WU-UCT algorithm (Section 3.1), and then we present a real system implementation using a master-worker architecture (Section 3.2).

3.1 Watch the Unobserved Samples in UCT Tree Policy

As we pointed out earlier, the key question we want to address in parallelizing MCTS is how to deliver the most up-to-date statistics to each worker so that they can achieve effective exploration-exploitation tradeoff in its selection step. This is assumed to be the case in the ideal parallelization in Figure 1(b). Algorithmically, it is equivalent to the sequential MCTS except that the rollouts are performed in parallel by different workers. Unfortunately, in practice, the statistics available to each worker are generally outdated because of the slow and incomplete simulation and expansion steps at the other workers. Specifically, since the estimated value is unobservable before simulations complete and workers should not wait for the updated statistics to proceed, the (partial) loss of statistics is unavoidable. Now the question becomes: is there an alternative way to addressing the issue? The answer is in the affirmative and is explained below.

Aiming at bridging the gap between naive parallelization and the ideal case, we closely examine their difference in terms of the availability of statistics. As illustrated by the colors of the statistics, their only difference in is caused by the on-going simulation process. As suggested by (3), although can only be updated after a simulation step is completed, the newest information can actually be available as early as a worker initiates a new rollout. This is the key insight that we leverage to enable effective parallelization in our WU-UCT algorithm. Motivated by this, we introduce another quantity, , to count the number of rollouts that have been initiated but not yet completed, which we name as unobserved samples. That is, our new statistics, , watch the number of unobserved samples, and are then used to correct the UCT tree policy (2) into the following form:

(4)

The intuition of the above modified node-selection policy is that when there are workers simulating (querying) node , the confidence interval at node will eventually be shrunk after they complete. Therefore, adding and to the exploration term considers such a fact beforehand and let other workers be aware of it. Despite its simple form, (4) provides a principled way to retain effective exploration-exploitation tradeoff under parallel settings; it corrects the confidence bound towards better exploration-exploitation tradeoff. As the confidence level is instantly updated (i.e., at the beginning of simulation), more recent workers are guaranteed to observe additional statistics, which prevent them from extensively querying the same node as well as find better nodes for them to query. For example, when multiple children are in demand for exploration, (4) allows them to be explored evenly. In contrast, when a node has been sufficiently visited (i.e., large and ), adding and from the unobserved samples have little effect on (4) because the confidence interval is sufficiently shrunk around , allowing extensively exploitation of the best-valued child.

3.2 System implementation using Master-worker architectures

We now proceed to explain the system implementation of WU-UCT, where the overall architecture is shown in Figure 2(a) (see Appendix A for the details). Specifically, we use a master-worker architecture to implement the WU-UCT algorithm with the following considerations. First, since the expansion and the simulation steps are much more time-consuming compared to the selection and the backpropagation steps, they should be intensively parallelized. In fact, they are relatively easy to parallelize (e.g., different simulations could be performed independently). Second, as we discussed earlier, different workers need to access the most up-to-date statistics in order to achieve successful exploration-exploitation tradeoff. To this end, a centralized architecture for the selection and backpropagation step is more preferable as it allows adding strict restrictions to the retrieval and update of the statistics, making them up-to-date. Specifically, we use a centralized master process to maintain a global set of statistics (in addition to other data such as game states), and let it be in charge of the backpropagation step (i.e., updating the global statistics) and the selection step (i.e., exploiting the global statistics). As shown in Figure 2(a), the master process repeatedly performs rollouts until a predefined number of simulations is reached. During each rollout, it selects nodes to query, assign expansion and simulation tasks to different workers, and collect the returned results to update the global statistics. In particular, we use the following incomplete update and complete update (shown in Figure 2(a)) to track and along the traversed path (see Figure 1(d)):

(5)
(6)

where incomplete update is performed before the simulation task starts, allowing the updated statistics to be instantly available globally; complete update is done after the simulation return is available, resembling the backpropagation step in the sequential algorithm. In addition, is also updated in the complete update step using (3). Such a clear division of labor between the master and the workers provides sequential selection and backpropagation steps when we parallelize the costly expansion and simulation steps. It ensures up-to-date statistics for all workers by the centralized master process and achieves linear speedup without much performance degradation (see Section 5 for the experimental results).

(a) The system architecture that implements WU-UCT.
(b) Time consumption in the Tap game. (c) Time consumption in the Atari games.
Figure 2: Diagram of WU-UCT’s system architecture and its time consumption. (a) the Green blocks and the task buffers are operated at the master, while the blue blocks are executed by the workers. (b-c) the breakdown of the time consumption on two game benchmarks (Section 5).

To justify the above rationale of our system design, we perform a set of running time analysis for our developed WU-UCT system and report the results in Figure 2(b)–(c). We show the time-consumption for different parts at the master and at the workers. First, we focus exclusively on the workers. With a close-to-100% occupancy rate for the simulation workers, the simulation step is fully parallelized. Although the expansion workers are not fully utilized, the expansion step is maximumly parallelized since the number of required simulation and expansion tasks is identical. This suggests the existence of an optimal (task-dependent) ratio between the number of expansion workers and the number of simulation workers that fully parallelize both steps with the least resources (e.g. memory). Returning to the master process, on both benchmarks, we see a clear dominance of the time spent on the simulation and the expansion steps even they are both parallelized by 16 workers. This supports our design to parallelize only the simulation and expansion steps. We finally focus on the communication overhead caused by parallelization. Although more time-consuming compared to simulation and backpropagation, the communication overhead is negligible compared to the time used by the expansion and the simulation steps. Other details in our system, such as the centralized game-state storage, are further discussed in Appendix A.

4 The Benefits of Watching Unobserved Samples

In this section, we discuss the benefits of watching unobserved samples in WU-UCT, and compare it with several popular parallel MCTS algorithms (Figure 3), including Leaf Parallelization (LeafP), Tree Parallelization (TreeP) with virtual loss, and Root Parallelization (RootP).4 LeafP parallelizes the leaf simulation, which leads to an effective hex game solver (wang2018hex). TreeP with virtual loss has recently achieved great success in challenging real-world tasks such as Go (silver2016mastering2). And RootP parallelizes the subtrees of the root node at different workers and aggregates the statistics of the subtrees after all the workers complete their simulations (soejima2010evaluating).

(a) Leaf parallelization (LeafP)
(b) Tree parallelization (TreeP)
(c) Root parallelization (RootP)
Figure 3: Three popular parallel MCTS algorithms. LeafP parallelizes the simulation steps, TreeP uses virtual loss to encourage exploration, and RootP parallelizes the subtrees of the root node.

We argue that, by introducing the additional statistics , WU-UCT achieves a better exploration-exploitation tradeoff than the above methods. First, LeafP and TreeP represent two extremes in such a tradeoff. LeafP lacks diversity in exploration as all its workers are assigned to simulating the same node, leading to performance drop caused by collapse of exploration in much the same way as the naive parallelization (see Figure 1(c)). In contrast, although the virtual loss used in TreeP could encourage exploration diversity, this hard additive penalty could cause exploitatin failure: workers will be less likely to co-simulating the same node even when they are certain that it is optimal (mirsoleimani2017analysis). RootP tries to avoid these issues by letting workers perform an independent tree search. However, this reduces the equivalent number of rollouts at each worker, decreasing the accuracy of the UCT policy (2). Different from the above three approaches, WU-UCT achieves a much better exploration-exploitation tradeoff in the following manner. It encourages exploration by using to “penalize” the nodes that have many in-progress simulations. Meanwhile, it allows multiple workers to exploit the most rewarding node since this “penalty” vanishes when becomes large (see (4)).

5 Experiments

This section evaluates the proposed WU-UCT algorithm on a production system to predict the user pass-rate of a mobile game (Section 5.1) as well as on the public Atari Game benchmark (Section 5.2), aiming at demonstrating the superior performance and near-linear speedup of WU-UCT.

5.1 Experiments on the “Joy City” Game

Joy City is a level-oriented game with diverse and challenging gameplay. Players tap to eliminate connected items on the game board. To pass a level, players have to complete certain goals within a given number of steps.5 The number of steps used to pass a level (termed game step) is the main performance metric, which differentiates masters from beginners. It is a challenging reinforcement learning task due to its large number of game-state (over ) and high randomness in the transition. The goal of the production system is to accurately predict the user pass-rate of different game-levels, providing useful and fast feedback for game design. Powered by WU-UCT, the system runs faster while accurately predicting user pass-rate (8.6% MAE). In this subsection, we concentrate our analysis on the speedup and performance of WU-UCT using two typical game-levels (Level-35 and Level-58)6, and refer the readers interested in the user pass-rate prediction system to Appendix C.

(a) Speedup (level 35)
(b) Speedup (level 58)
(c) Performance (level 35)
(d) Performance (level 58)
Figure 4: WU-UCT speedup and performance. Results are averaged over 10 runs. WU-UCT achieves linear speedup with negligible performance loss (measured in game steps).

We evaluate WU-UCT with different numbers of expansion and simulation workers (from to ) and report the speedup results in Figures 4(a)–(b). For all experiments, we fix the total number of simulations to 500. First, note that when we have the same number of expansion workers and simulation workers, WU-UCT achieves linear speedup. Furthermore, Figures 4 also suggest that both the expansion workers and the simulation workers are crucial, since lowering the number of workers from either sets decreases the speedup. Besides the near-linear speedup property, WU-UCT suffers negligible performance loss with the increasing number of workers, as shown in Figures 4(c)–(d). The standard deviations of the performance (measured in the average game steps) over different numbers of expansion and simulation workers are only and for Level-35 and Level-58, respectively, which are much smaller than their average game steps ( and ).

5.2 Experiments on the Atari Game Benchmark

We further evaluate WU-UCT on Atari Games (bellemare2013arcade), a classical benchmark for reinforcement learning (RL) and planning algorithms (guo2014deep). The Atari Games are an ideal testbed for MCTS algorithms for its long planning horizon (several thousand), sparse reward, and complex game strategy. We compare WU-UCT to three parallel MCTS algorithms discussed in Section 4: TreeP, LeafP, and RootP (additional experiment results comparing WU-UCT with a variant of TreeP is provided in Appendix E). We also report the results of sequential UCT ( slower than WU-UCT) and PPO (schulman2017proximal) as reference. Generally, the performance of sequential UCT sets an upper bound for parallel UCT algorithms. PPO is included since we used a distilled PPO policy network (hinton2015distilling; rusu2015policy) as the roll-out policy for all other algorithms. It is considered as a performance lower bound for both parallel and sequential UCT algorithms. All experiments are performed with a total of 128 simulation steps, and all parallel algorithms use 16 workers (see Appendix D for the details).

Environment WU-UCT TreeP LeafP RootP PPO UCT
Alien 59381839 42001086 42801016 5206282 1850 6820
Boxing 1000* 990 954 981 94 100
Breakout 40821 39033 33145 28127 274 462
Centipede 1163034403910* 439433207601 16233369575 184265104405 4386 652810
Freeway 320 320 311 320 32 32
Gravitar 5060568 4880 3385155 41601811 737 4900
MsPacman 198042232* 140002807 5378685 7156583 2096 23021
NameThisGame 299911608* 233262585 253903659 274409533 6254 38455
RoadRunner 467201359* 246803316 254522977 383001191 25076 52300
Robotank 10119 8613 8011 7813 5 82
Qbert 139925596 146205738 116555373 94653196 14293 17250
SpaceInvaders 3393292 2651828 24351159 2543809 942 3535
Tennis 41* -10 -10 01 -14 5
TimePilot 5513012474* 326002165 380752307 451007421 4342 52600
Zaxxon 390856838 395793942 12300821 13380769 5008 46800
Table 1: The performance on 15 Atari games. Average episode return ( standard deviation) over 10 trials are reported. The best average scores among parallel algorithms are highlighted in boldface. The mark * indicates that WU-UCT achieves statistically better performance (-value 0.0011 in paired t-test) than TreeP (no mark if both methods perform statistically similar). Similarly, the marks  and mean that WU-UCT performs statistically better than LeafP and RootP, respectively.
Figure 5: Speed and performance test of our WU-UCT along with three baselines on four Atari games. All experiments are repeated three times and the mean and standard deviation (for episode reward only) are reported. For WU-UCT, the number of expansion workers is fixed to be one.

We first compare the performance, measured by average episode reward, between WU-UCT and the baselines on 15 Atari games, which is done with 16 simulation workers and 1 expansion worker (for a fair comparison, since baselines do not parallel the expansion step). Each task is repeated 10 times with the mean and standard deviation reported in Table 1. Due to the better exploration-exploitation tradeoff during selection, WU-UCT out-performs all other parallel algorithms in 12 out of 15 tasks. Pairwise student t-test further show that WU-UCT performs significantly better (adjusted by the Bonferroni method, -value 0.0011) than TreeP, LeafP, and RootP in 7, 9, and 7 tasks, respectively. Next, we examine the influence of the number of simulation workers on the speed and the performance. In Figure 5, we compare the average episode return as well as time consumption (per step) for 4, 8, and 16 simulation workers. The bar plots indicate that WU-UCT experiences little performance loss with an increasing number of workers, while the baselines exhibit significant performance degradation when heavily parallelized. WU-UCT also achieves the fastest speed compared to the baselines, thanks to the efficient master-worker architecture (Section 3.2). In conclusion, our proposed WU-UCT not only out-performs baseline approaches significantly under the same number of workers but also achieves negligible performance loss with the increasing level of parallelization.

6 Related Work

MCTS Monte Carlo Tree Search is a planning method for optimal decision making in problems with either deterministic (silver2016mastering2) or stochastic (schafer2008uct) environments. It has made a profound influence on Artificial Intelligence applications (browne2012survey), and has even been applied to predict and mimic human behavior (van2016people). Recently, there has been a wide range of work combining MCTS and other learning methods, providing mutual improvements to both methods. For example, guo2014deep harnesses the power of MCTS to boost the performance of model-free RL approaches; shen2018m bridges the gap between MCTS and graph-based search, outperforming RL and knowledge base completion baselines.

Parallel MCTS Many approaches have been developed to parallelize MCTS methods, with the objective being two-fold: achieve near-linear speedup under a large number of workers while maintaining the algorithm performance. Popular parallelization approaches of MCTS include leaf parallelization, root parallelization, and tree parallelization (chaslot2008parallel). Leaf parallelization aims at collecting better statistics by assigning multiple workers to query the same node (cazenave2007parallelization). However, this comes at the cost of wasting diversity of the tree search. Therefore, its performance degrades significantly despite the near-ideal speedup with the help of a client-server network architecture (kato2010parallel). In root parallelization, multiple search trees are built and assigned to different workers. Additional work incorporates periodical synchronization of statistics from different trees, which results in better performance in real-world tasks (bourki2010scalability). However, a case study on Go reveals its inferiority with even a small number of workers (soejima2010evaluating). On the other hand, tree parallelization uses multiple workers to traverse, perform queries, and update on a shared search tree. It benefits significantly from two techniques. First, a virtual loss is added to avoid querying the same node by different workers (chaslot2008parallel). This has been adopted in various successful applications of MCTS such as Go (silver2016mastering2) and Dou-di-zhu (whitehouse2011determinization). Additionally, architecture side improvements such as using pipeline (mirsoleimani2018pipeline) or lock-free structure (mirsoleimani2018lock) speedup the algorithm significantly. However, though being able to increase diversity, virtual loss degrades the performance under even four workers (mirsoleimani2017analysis; bourki2010scalability).

7 Conclusion

This paper proposes WU-UCT, a novel parallel MCTS algorithm that addresses the problem of outdated statistics during parallelization by watching the number of unobserved samples. Based on the newly devised statistics, it modifies the UCT node-selection policy in a principled manner, which achieves effective exploration-exploitation tradeoff. Together with our efficiency-oriented system implementation, WU-UCT achieves near-optimal linear speedup with only limited performance loss across a wide range of tasks, including a deployed production system and Atari games.

8 Acknowledgements

This work is supported by Tencent AI Lab and Seattle AI Lab, Kwai Inc. We thank Xiangru Lian for his help on the system implementation.

References

Supplementary Material

Appendix A Algorithm details for WU-UCT

The pseudo-code of WU-UCT is provided in Algorithm 1. Specifically, it provides the workflow of the master process. When the number of completed updates () has not exceeded the maximum simulation step (a pre-defined hyperparameter), the main process repeatedly performs a modified rollout that consists of the following steps: selection, expansion, simulation, and backpropagation. The selection and backpropagation steps are performed in the main process, while the two others are assigned to the workers. The backpropagation step is divided into two sub-routines incomplete update (Algorithm 2) and complete update (Algorithm 3). The former is executed before simulation starts, while the latter is called after receiving simulation results. Task index is added to help the main process to track different tasks returned from the workers. To maximize efficiency, the master process keeps assigning expansion and simulation tasks until all workers are fully occupied.

Communication overhead of WU-UCT

The choice for centralized game-state storage stems from the following observations: (i) size of the game-state is usually small, which allows efficient inter-process transformation, and (ii) each game-state is used at most times,7 thus is inefficient to store in multiple processes. Although this design may not be ideal for all tasks, it is at least a reasonable choice. During rollouts, game-states generated by any expansion worker may be later used by any other expansion and simulation workers. Therefore, either a per-task transformation or decentralized storage is needed. For the latter case, however, since a game-state will be used at most times, most workers will not need it, which results in inefficiency of the decentralized storage.

Another possible solution is to store the game-states in shared memory. However, to receive benefit from it, the following conditions should be satisfied: (i) each process can access (read/write) the memory relatively fast even if some collisions may happen, and (ii) the shared memory is big enough to hold all game-states that may be accessed. If the two conditions hold, we may be able to reduce the communication overhead. Since the communication overhead is negligible even with 16 simulation and expansion workers (as shown in Figures 2(b) and 2(c)), we should consider using more workers to speedup the algorithm.

  Input: environment emulator , root tree node , maximum simulation step , maximum simulation depth , number of expansion workers , and number of simulation workers
  Initialize: expansion worker pool , simulation worker pool , game-state buffer , , and
  while  do
     Traverse the tree top down from root node following (4) until (i) its depth greater than , (ii) it is a leaf node, or (iii) it is a node that has not been fully expanded and random() 0.5
     if expansion is required then
        
        Assign expansion task to pool // is the task index
     else
        assign simulation task to pool if episode not terminated
        Call incomplete_update; if episode terminated, call complete_update
     end if
     if fully occupied then
         Wait for a expansion task with return: (task index , game state , reward , terminal signal , task
         index ); expand the tree according to , , , and ; assign simulation task to pool
         Call incomplete_update
     else continue
     if fully occupied then
         Wait for a simulation task with return: (task index , node , cumulative reward )
         Call complete_update;
     else continue
     
  end while
Algorithm 1 WU-UCT
  input: node
  while  do
     
      // denotes the parent node of
  end while
Algorithm 2 incomplete_update
  input: task index , node , reward
  while  do
     ;
     Retrieve reward according to task index
     ;
      // denotes the parent node of
  end while
Algorithm 3 complete_update

Appendix B Algorithm overview of baseline approaches

We give an overview of three baseline parallel UCT algorithms: Leaf Parallelization (LeafP), Tree Parallelization (TreeP) with virtual loss, and Root Parallelization (RootP), with the objective of providing a comprehensive view to the readers. We refer readers interested in the details of these algorithms to chaslot2008parallel. As suggested by their names, LeafP, TreeP, and RootP parallelized different parts of the search tree. Specifically, LeafP (Algorithm 4) parallelizes only the simulation process: whenever a node (state) is selected to query, all workers perform simulations individually to evaluate it. The main process (master) then waits for all workers to complete simulation and return their respective cumulative rewards, which are used to update the traversed nodes’ statistics.

TreeP (Algorithm 5) parallelizes the whole tree search algorithm by allowing different workers to access a shared search tree simultaneously. Each worker individually performs the selection, expansion, simulation, and back-propagation steps and update the nodes’ statistics. To discourage querying the same node, individual workers subtract a virtual loss ( is a hyper-parameter of the algorithm) to each of its traversed node during the selection process, and add it back () during back-propagation. This allows nodes currently being evaluated by some workers to have lower utility scores (4) and will be less likely to be chosen by other workers, which improves the diversity of the node visited by different workers simultaneously.

silver2017mastering and segal2010scalability introduced an additional way to add pseudo reward into the traversed nodes. See Appendix E for details of this variant of TreeP and more experiments of it on Atari games.

As hinted by its name, RootP (Algorithm 6) parallelizes the root node. Specifically, in an initialization step, all children of the root node is expanded, and different workers are assigned to perform rollouts using the expanded child nodes as the root node of the search tree. The algorithm evenly distribute the workload such that the number of rollouts starting from all child nodes is , where is the number of workers. After the job assignment, all workers construct search trees in their own local memories and perform sequential tree search until their assigned tasks are finished. Finally, the main process collects statistics from all workers and return the predicted best action of the state represented by the root node of the search tree.

(a) A demonstrated game play in level 10 of the tap game. Purple dots refers to the tapped cell, and red regions indicate directly eliminated cells.
(b) Examples of levels with different rule, difficulty, and layout.
Figure 6: Snapshots of the Tap-elimination game.

Appendix C Experiment details and system description of the Joy City task

This section describes the basic rules of the Joy City game (Appendix C.1) as well as details about the deployed user pass-rate prediction system (Appendix C.2).

c.1 Description of the Joy City game

This section serves as an introduction to the basic rules of the tap game. Figure 7 depicts several screenshots of the game. In the main frame, there is a grid, where each cell contains an item. We can click cells with connected color regions to eliminate them (i.e., if the cell represented by the purple dot in the first screenshot of Figure 6(a) is tapped, the region contains blue boxes will be “eliminated”). The remaining cells then collapse to fill in the gaps of exploded ones. To goal is to fulfill all level requirements (goals) within a fixed number of clicks. Figure 6(a) provides consecutive snapshots for playing level 10 of the game. The goal of this level is depicted on the top, which is 3 “cats” and 24 “balloons”. The top-left corner represents the number of remaining steps. Players have to accomplish all given goals before the step runs out. Figure 6(a) demonstrates successful gameplay, where only 6 steps are used to complete the level. In each of the three left frames, the cell noted by the purple circle is clicked. Immediately, the same-color region marked with a red frame is eliminated. Different goal objects/obstacle objects react differently. For instance, when some cell is exploded beside a balloon, it will also explode. Frame two demonstrates the use of props. Tapping regions with connectivity above a certain threshold will provide prop as a bonus. They have special effects that can help players pass the level faster. Finally, in the last screenshot, all goals are completed and we pass the level.

Figure 6(b) further demonstrates the variety of levels. Specifically, the left-most frame depicts a special “boss level”, where the goal is the “defeat” the evil cat. The cat will randomly throw objects to the cells, adding additional randomness. Three other frames illustrate relatively hard levels, which is revealed from their low-connectivity, abundance and complexity of the obstacles, and special layout.

c.2 Details of the level pass-rate prediction system

Figure 7: Our deployed user pass-rate prediction system.

During a game design cycle, to achieve the desired game pass-rates, a game designer needs to hire many human testers to extensively test all the levels before its release, which generally takes a long time and is inaccurate. Therefore, it would greatly reduce the game design cycle if we can develop a testing system that is able to provide quick and accurate feedback about the user pass-rates. Figure 7 gives an overview of our deployed user pass-rate prediction system, where WU-UCT is used to mimic average user performance and provide features for predicting the human pass-rate. As we have shown in the main paper, it can achieve significant speedup without significant performance loss,8 allowing the game designer to get the feedback in 20 minutes instead of 12 hours. Specifically, we use WU-UCT with different numbers of rollouts to mimic players with different skill levels, where WU-UCT with 10 rollouts is used to represent average players while the agent with 100 rollouts mimics skillful players. This is verified by the pair-wise sample t-test result provided in Table 2. With 10 simulations, the WU-UCT agent performs statistically similar (-value 5%) to human players, while with 100-simulation, the agent performs statistically better (-value 5%). Besides, Figure 8 shows that our pass-rate prediction system achieves 8.6% mean absolute error (MAE) on 130 released game-levels, with 93% of them having MAE less than 20%.

The system consists of two working phases, i.e., training and inference. Specifically, training and validation are done on 300 levels that have been released in a test version of the game. In the training phase, the system has access to both the level and players’ pass-rate, while only levels are available in the inference phase, and the system needs to give quick and accurate feedback about the (predicted) user pass-rate. In both phases, the levels are first fed into an asynchronous advantage actor-critic (A3C) (mnih2016asynchronous) learner for a base policy . It is then used by the WU-UCT agent as a prior to select expand action as well as the default policy for simulation. We then use WU-UCT to perform multiple gameplays. The maximum depth and width (maximum number of child nodes for each node) of the search tree is 10 and 5, respectively. The number of simulations is set to 10 and 100 to get AI bots with different skill levels. Six features (three for both the 10-simulation and 100-simulation agent) are extracted from the gameplay results. Specifically, the features are AI’s pass-rate, average used step divided by the provided step (the number at the top-left corner in the screenshots in Figure 6), and median used step divided by the provided step. During training, the features, as well as the players’ pass-rate, is used to learn a linear regressor, while in the inference phase, the regression model is used to predict user pass-rate.

c.3 Additional experimental results

In this section, we list the additional experimental results. In Table 3, we report the specific speedup number for different numbers of expansion worker and simulation workers.

AI bot # rollouts Avg. diff. Effect size -value
WU-UCT 10 -1.54 0.07 0.4120
WU-UCT 100 22.18 0.88 0.0000
Table 2: Pair-wise sample t-test of pass-rate across 130 levels between two AI bots (different number of MCTS rollouts) and the players. “Avg. diff” means the average difference between the pass-rate of the bot and that of the human players. -value measures the likelihood that two sets of paired samples are statistically similar (i.e. larger means similar). Effect size measures the strength of the difference (larger means greater difference).
Figure 8: Distribution of the pass-rate prediction error on 130 game-levels.
Lv. Level 35 Level 58
\diagbox[width=3em,trim=l] 1 2 4 8 16 1 2 4 8 16
1 1.0 2.0 2.8 3.6 4.5 1.0 1.8 4.1 4.8 5.1
2 1.4 2.2 4.1 5.7 6.3 1.1 3.1 5.3 6.7 8.4
4 1.7 2.5 4.5 8.4 8.8 1.1 3.4 6.1 10.1 12.8
8 2.3 3.0 5.1 10.1 12.8 1.2 3.6 6.7 13.2 16.1
16 2.9 3.7 5.7 11.2 15.5 1.2 3.8 7.6 16.1 20.9
Table 3: Speedup on two levels of the tap game. is the number of expansion workers and is the number of simulation workers.

Appendix D Experiment details of the Atari games

This section provides the implementation details of the experiments on Atari games. Specifically, we first describe the training pipeline of the default policy. We then illustrate how the default policy is connected with MCTS algorithm to perform simulation.

(a) Full-size PPO network.
(b) Distilled network.
Figure 9: Architecture of the original PPO network (left) and the distilled network (right).
Environment Origin PPO policy Distilled policy
Alien 1850 850
Boxing 94 7
Breakout 274 191
Centipede 4386 1701
Freeway 32 32
Gravitar 737 600
MsPacman 2096 1860
NameThisGame 6254 6354
RoadRunner 25076 26600
Robotank 5 13
Qbert 14293 12725
SpaceInvaders 942 1015
Tennis -14 -10
TimePilot 4342 4400
Zaxxon 5008 3504
Table 4: Performance of the original PPO policy and our distilled policy on 15 Atari games.

Training default policy for MCTS

To allow better overall performance, we used the Proximal Policy Gradient (PPO) (schulman2017proximal), one of the state-of-the-art on-policy model-free reinforcement learning (RL) algorithms. We adopted the highest-starred third-party code of PPO on GitHub. The implementation uses the same hyper-parameters with the original paper. The architecture of the policy network is shown in Figure 9. The original PPO network is trained on 10 million frames for each task. To reduce computation count, we reduce the network size using network distillation (hinton2015distilling). Specifically, it is a teacher-student training framework where the student (distilled) network mimics the output of the teacher network. Samples are collected by the PPO network with the -greedy strategy (). The student network optimizes its parameters to minimize the mean square error of the policy’s logits as well as the value. Performance of the original PPO policy network as well as the distilled network is provided in Table 4.

MCTS simulation

Both the policy output and the value output of the distilled network is used in the simulation phase. Particularly, if a simulation is started from state , rollout is performed using the policy network with an upper bound of 100 steps and reaches the leaf state . If the environment does not terminate, the full return is computed by the intermediate rewards plus the value function at state . Formally, the cumulative reward provided by the simulation is , where denotes the value of state . To reduce the variance of Monte Carlo sampling, we average it with the value function at state . The final simulation return is then .

Hyperparameters and experiment details for WU-UCT

For all tree search based algorithms (i.e., WU-UCT, TreeP, LeafP, and RootP), the maximum depth of the search tree is set to 100. The search width is limited by 20 and the maximum number of simulations is 128. The discount factor is set to 0.99 (note that the reported score is not discounted). When performing gameplays, a tree search subroutine is called to plan for the best action in each time step. The sub-routine iteratively constructs a search tree from its initialization with a root node only. Experiments are deployed on 4 Intel Xeon E5-2650 v4 CPUs and 8 NVIDIA GeForce RTX 2080 Ti GPUs. To minimize the speed fluctuation caused by different workloads on the machine, we ensure that the total number of simulation workers is smaller than the total number of CPU cores, allowing each process to fully occupy each single core. The WU-UCT is implemented with multiple processes, with an inter-process pipe between the master process and each worker process.

Figure 10: Relative performance between WU-UCT and three baseline approaches on 15 Atari benchmarks. Relative performance is calculated according to the mean episode reward in 3 trials. The average percentile improvement of WU-UCT on TreeP, LeafP, and RootP is 49%, 104%, and 82%, respectively.9

Hyperparameters and experiments for baseline algorithms

Being unable to find appropriate third-party packages for baseline algorithms (i.e., tree parallelization, leaf parallelization, and root parallelization), we built our implementation of them based on the corresponding papers. Building all algorithms in the same package additionally allows us to accurately conduct speed-tests as it eliminates other factors (e.g. different language) that may bias the result. Specifically, leaf parallelization is implemented with a master-worker structure: when the main process enters the simulation step, it assigns the task to all workers. When return from all workers is available, the master process performs backpropagation according to these statistics and begin a new rollout.

As suggested by browne2012survey, tree parallelization is implemented using a decentralized structure, i.e., each worker performs rollouts on a shared search tree. At the selection step, each traversed node is added a fixed virtual loss to guarantee diversity of the tree search. When performing backpropagation, is added back to the traversed nodes. is chosen from 1.0 and 5.0 for each particular task. In other words, we ran TreeP with and for each task, and report the better result.

Root parallelization is implemented according to chaslot2008parallel. Similar to leaf parallelization, root parallelization consists of sub-processes that do not share information with each other. At the beginning of the tree search process, each sub-process is assigned several actions of the root node to query. They then perform sequential UCT rollouts until reaches a pre-defined maximum number of rollouts. When all sub-processes complete the jobs, statistics from them are gathered by the main process, and are used to choose the best action.

Appendix E Additional experiments on the Atari games

Environment WU-UCT
TreeP
()
TreeP
()
TreeP
()
Alien 59381839 4850357 493560 50000
Boxing 1000 991 990 991
Breakout 40821 37943 26550 46360
Freeway 320 320 320 320
Gravitar 5060568 3500707 4105463 4950141
MsPacman 198042232 13160462 12991851 8640438
RoadRunner 467201359 29800282 28550459 29400494
Qbert 139925596 17055353 13425194 907553
SpaceInvaders 3393292 2305176 3210127 302042
Tennis 41 10 10 10
TimePilot 5513012474 52500707 49800212 324001697
Zaxxon 390856838 243002828 24600424 375501096
Table 5: Comparison between WU-UCT and three TreeP variants on 12 Atari games. Average episode return ( standard deviation) over 10 runs are reported. The best average scores among listed algorithms are highlighted in boldface. Hyper-parameter refers to the virtual loss added before each simulation starts, and similarly denotes the number of virtual visit counts added before each simulation starts.

This section provides additional experiment results to compare WU-UCT with another variant of the Tree Parallelization (TreeP) algorithm. As suggested by silver2016mastering2, besides pre-adjusting the value with virtual loss , pre-adjusted visit count can also be used to penalize . In this variant of TreeP, both the virtual loss and a hand-crafted count correction (termed the virtual pseudo-count) is added to adjust . Specifically, the value of node is adjusted as

(7)

which is used in the UCT selection phase. Table 5 compares WU-UCT with this TreeP variant using both virtual loss and virtual pseudo-count (i.e., Eq. 7). Three sets of hyper-parameters are used in TreeP, which are described in the caption of the table (i.e., , , and ). All other experiment setups are the same as Section 5.2 and Appendix D. Table 5 indicates that on 9 out of 12 tasks, WU-UCT out-performs this new baseline (with its best hyper-parameters). Furthermore, we also observe that TreeP does not have an optimal set of hyper-parameters that performs uniformly well on all tasks. In other words, to perform well, TreeP needs to conduct per-task hyper-parameter tuning. On the other hand, WU-UCT performs consistently well across different tasks.

Conceptually, WU-UCT is designed based on the fact that on-going simulations (unobserved samples) will eventually return the results, so their number should be tracked and used to adaptively adjust the UCT selection process. On the other hand, TreeP uses artificially designed virtual loss and virtual pseudo-count to discourage other threads from simultaneously exploring the same node. Therefore, WU-UCT achieves a better exploration-exploitation tradeoff in parallelization, which leads to better performance as confirmed by the experimental results given in Table 5.

  Input: environment emulator , prior policy , root tree node , maximum simulation step , maximum simulation depth , and number of workers
  Initialize:
  while  do
     Traverse the tree top down from root node following (2) until (i) its depth greater than , (ii) it is a leaf node, or (iii) it is a node that has not been fully expanded and random() 0.5
      expand(, , )
     Each of the simulation workers perform roll-out beginning from
     Wait until all workers completed simulation and returned cumulative reward ( is returned by worker )
     for  do
        Call back_propagation
     end for
     
  end while
Algorithm 4 Leaf Parallelization (LeafP)
  Input: environment emulator , prior policy , root tree node , maximum simulation step , maximum simulation depth , and number of workers
  Initialize: for
  Initialize: processes, each with access to the environment emulator, the prior policy, and the search tree
  Expand all child nodes of
   ( is the number of actions)
  Averagely distribute the workload (perform tree search times on each child of ) to the workers, and copy the corresponding child nodes to the worker’s local memory.
  Perform asynchronously in each of the workers ( denotes the thread ID)
      Select a child of according to its allocated budget
      Traverse the tree top down from root node following (2) until (i) its depth greater than ,
      (ii) it is a leaf node, or (iii) it is a node that has not been fully expanded and random() 0.5
      Add virtual loss to each of the traversed node: for each traversed
      expand(, , )
      Perform roll-out beginning from
      the returned cumulative reward of the roll-out
      Call back_propagation
      Remove virtual loss from each of the traversed node: for each traversed
     
      if
         Terminate current process
      end if
  end
  Gather child nodes’ statistics from all workers
Algorithm 6 Root Parallelization (RootP)
  input: node , environment emulator , prior policy
  
  while  has expanded  do
     
  end while
   performing in according to (: terminal signal)
   a new node constructed according to
  Store reward signal and termination indicator in
  Link as the child of by the node corresponding to
Algorithm 7 expansion
  input: node , cumulative reward
  while  do
     
     Retrieve the reward in the current node (which is collected during its expansion)
     
     
      // denotes the parent node of
  end while
Algorithm 8 back_propagation

Footnotes

  1. Code is available at https://github.com/liuanji/WU-UCT.
  2. In the context of MCTS, the action space is assumed to be finite and the transition is assumed to be deterministic, i.e., the next state is determined by the current state and action .
  3. We assume certain regularity conditions hold so that the cumulative reward is always bounded (sutton2018reinforcement).
  4. We refer the readers to chaslot2008parallel for more details. The pseudo-code of the three algorithms is given in Appendix B. LeafP: Algorithm 4, TreeP: Algorithm 5, RootP: Algorithm 6.
  5. We refer it as the tap game below. See Appendix C.1 for more details about the game rules.
  6. Level-35 is relatively simple, requiring 18 steps for an average player to pass, while Level-58 is relatively difficult and needs more than 50 steps to solve.
  7. In our setup, the game state will only be used for 1 time to start simulation and times to initialize expansion.
  8. Due to the complexity the tap game, model-free RL algorithms such as A3C (mnih2016asynchronous) and PPO (schulman2017proximal) fail to achieve satisfactory performance and thus cannot perform an accurate prediction. On the other hand, MCTS could achieve good performance but takes a long time in testing.
  9. The task “Tennis” is not included in the calculation of the average percentile improvement due to the average episode return 0 in RootP.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407694
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description