Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search
Abstract
Monte Carlo Tree Search (MCTS) algorithms have achieved great success on many challenging benchmarks (e.g., Computer Go). However, they generally require a large number of rollouts, making their applications costly. Furthermore, it is also extremely challenging to parallelize MCTS due to its inherent sequential nature: each rollout heavily relies on the statistics (e.g., node visitation counts) estimated from previous simulations to achieve an effective explorationexploitation tradeoff. In spite of these difficulties, we develop an algorithm, WUUCT
1 Introduction
Recently, Monte Carlo Tree Search (MCTS) algorithms such as UCT (kocsis2006improved) have achieved great success in solving many challenging artificial intelligence (AI) benchmarks, including video games (guo2016deep) and Go (silver2016mastering2). However, they rely on a large number (e.g. millions) of interactions with the environment emulator to construct search trees for decisionmaking, which leads to high time complexity (browne2012survey). For this reason, there has been an increasing demand for parallelizing MCTS over multiple workers. However, parallelizing MCTS without degrading its performance is difficult (segal2010scalability; mirsoleimani2018lock; chaslot2008parallel), mainly due to the fact that each MCTS iteration requires information from all previous iterations to provide effective explorationexploitation tradeoff. Specifically, parallelizing MCTS would inevitably obscure these crucial information, and we will show in Section 2.2 that this loss of information potentially results in a significant performance drop. The key question is therefore how to acquire and utilize more available information to mitigate the information loss caused by parallelization and help the algorithm to achieve better explorationexploitation tradeoff.
To this end, we propose WUUCT (Watch the Unovserved in UCT), a novel parallel MCTS algorithm that attains linear speedup with only limited performance loss. This is achieved by a conceptual innovation (Section 3.1) as well as an efficient real system implementation (Section 3.2). Specifically, the key idea in WUUCT to overcome the aforementioned challenge is a set of statistics that tracks the number of ongoing yet incomplete simulation queries (named as unobserved samples). We combine these newly devised statistics with the original statistics of observed samples to modify UCT’s policy in the selection steps in a principled manner, which, as we shall show in Section 4, effectively retains explorationexploitation tradeoff during parallelization. Our proposed approach has been successfully deployed in a production system for efficiently and accurately estimating the rate at which users pass levels (termed user passrate) in a mobile game “Joy City”, with the purpose of reducing their design cycles. On this benchmark, we show that WUUCT achieves nearoptimal linear speedup and superior performance in predicting user passrate (Section 5.1). We further evaluate WUUCT on the Atari Game benchmark and compare it to stateoftheart parallel MCTS algorithms (Section 5.2), which also demonstrate our superior speedup and performance.
2 On the Difficulties of Parallelizing MCTS
We first introduce the MCTS and the UCT algorithms, along with their difficulties in parallelization.
2.1 Monte Carlo Tree Search and Upper Confidence Bound for Trees (UCT)
We consider the Markov Decision Process (MDP) , where an agent interacts with the environment in order to maximize a longterm cumulative reward. Specifically, an agent at state takes an action according to a policy , so that the MDP transits to the next state and emits a reward .
(1) 
where denotes the initial state and is the discount factor.
(2) 
where denotes the set of all child nodes for ; the first term is an estimate for the longterm cumulative reward that can be received when starting from the state represented by node , and the second term represents the uncertainty (size of the confidence interval) of that estimate. The confidence interval is calculated based on the Upper Confidence Bound (UCB) (auer2002finite; auer2002using) using and , which denote the number of times that the nodes and have been visited, respectively. Therefore, the key idea of the UCT policy (2) is to select the best action according to an optimistic estimation (i.e., the upper confidence bound) of the expected return, which strikes a balance between the exploitation (first term) and the exploration (second term) with controlling their tradeoff. Once the selection process reaches a leaf node of the search tree (or other termination conditions are met), we will expand the node according to a prior policy by adding a new child node. Then, in the simulation step, we estimate its value function (cumulative reward) by running the environment simulator with a default (simulation) policy. Finally, during backpropagation, we update the statistics and from the leaf node to the root node of the selected path by recursively performing the following update (i.e., from to ):
(3) 
where is the simulation return of ; denotes the action selected following (2) at state .
2.2 The Intrinsic Difficulties of Parallelizing MCTS
The above discussion implies that the MCTS algorithm is intrinsically sequential: each selection step in a new rollout requires the previous rollouts to complete in order to deliver the updated statistics, and , for the UCT tree policy (2). Although this requirement of uptodate statistics is not mandatory for implementation, it is in practice intensively required to achieve effective explorationexploitation tradeoff (auer2002finite). Specifically, uptodate statistics best help the UCT tree policy to identify and prune nonrewarding branches as well as extensively visiting rewarding paths for additional planning depth. Likewise, to achieve the best possible performance, when multiple workers are used, it is also important to ensure that each worker uses the most recent statistics (the colored and in Figure 1(b)) in its own selection step. However, this is impossible in parallelizing MCTS based on the following observations. First, the expansion step and the simulation step are generally more timeconsuming compared to the other two steps, because they involve a large number of interactions with the environment (or its simulator). Therefore, as exemplified by Figure 1(c), when a worker C initiates a new selection step, the other workers A and B are most likely still in their simulation or expansion steps. This prevents them from updating the (global) statistics for other workers like C, which happens at their respective backpropagation steps. Using outdated statistics (the graycolored and ) at different workers could lead to a significant performance loss given a fixed target speedup, due to behaviors like collapes of exploration or exploitation failure, which we shall discuss thoroughly in Section 4. To give an example, Figure 1(c) illustrates the collapse of exploration, where worker C traverses over the same path as the worker A in its selection step due to the determinism of (2). Specifically, if the statistics are unchanged between the moments that worker A and C begin their own selection steps, they will choose the same node according to (2), which greatly reduces the diversity of exploration. Therefore, the key question that we want to address in parallelizing MCTS is how to track the correct statistics and modify the UCT policy in a principled manner, with the hope of retaining effective explorationexploitation tradeoff at different workers.
3 WuUct
In this section, we first develop the conceptual idea of our WUUCT algorithm (Section 3.1), and then we present a real system implementation using a masterworker architecture (Section 3.2).
3.1 Watch the Unobserved Samples in UCT Tree Policy
As we pointed out earlier, the key question we want to address in parallelizing MCTS is how to deliver the most uptodate statistics to each worker so that they can achieve effective explorationexploitation tradeoff in its selection step. This is assumed to be the case in the ideal parallelization in Figure 1(b). Algorithmically, it is equivalent to the sequential MCTS except that the rollouts are performed in parallel by different workers. Unfortunately, in practice, the statistics available to each worker are generally outdated because of the slow and incomplete simulation and expansion steps at the other workers. Specifically, since the estimated value is unobservable before simulations complete and workers should not wait for the updated statistics to proceed, the (partial) loss of statistics is unavoidable. Now the question becomes: is there an alternative way to addressing the issue? The answer is in the affirmative and is explained below.
Aiming at bridging the gap between naive parallelization and the ideal case, we closely examine their difference in terms of the availability of statistics. As illustrated by the colors of the statistics, their only difference in is caused by the ongoing simulation process. As suggested by (3), although can only be updated after a simulation step is completed, the newest information can actually be available as early as a worker initiates a new rollout. This is the key insight that we leverage to enable effective parallelization in our WUUCT algorithm. Motivated by this, we introduce another quantity, , to count the number of rollouts that have been initiated but not yet completed, which we name as unobserved samples. That is, our new statistics, , watch the number of unobserved samples, and are then used to correct the UCT tree policy (2) into the following form:
(4) 
The intuition of the above modified nodeselection policy is that when there are workers simulating (querying) node , the confidence interval at node will eventually be shrunk after they complete. Therefore, adding and to the exploration term considers such a fact beforehand and let other workers be aware of it. Despite its simple form, (4) provides a principled way to retain effective explorationexploitation tradeoff under parallel settings; it corrects the confidence bound towards better explorationexploitation tradeoff. As the confidence level is instantly updated (i.e., at the beginning of simulation), more recent workers are guaranteed to observe additional statistics, which prevent them from extensively querying the same node as well as find better nodes for them to query. For example, when multiple children are in demand for exploration, (4) allows them to be explored evenly. In contrast, when a node has been sufficiently visited (i.e., large and ), adding and from the unobserved samples have little effect on (4) because the confidence interval is sufficiently shrunk around , allowing extensively exploitation of the bestvalued child.
3.2 System implementation using Masterworker architectures
We now proceed to explain the system implementation of WUUCT, where the overall architecture is shown in Figure 2(a) (see Appendix A for the details). Specifically, we use a masterworker architecture to implement the WUUCT algorithm with the following considerations. First, since the expansion and the simulation steps are much more timeconsuming compared to the selection and the backpropagation steps, they should be intensively parallelized. In fact, they are relatively easy to parallelize (e.g., different simulations could be performed independently). Second, as we discussed earlier, different workers need to access the most uptodate statistics in order to achieve successful explorationexploitation tradeoff. To this end, a centralized architecture for the selection and backpropagation step is more preferable as it allows adding strict restrictions to the retrieval and update of the statistics, making them uptodate. Specifically, we use a centralized master process to maintain a global set of statistics (in addition to other data such as game states), and let it be in charge of the backpropagation step (i.e., updating the global statistics) and the selection step (i.e., exploiting the global statistics). As shown in Figure 2(a), the master process repeatedly performs rollouts until a predefined number of simulations is reached. During each rollout, it selects nodes to query, assign expansion and simulation tasks to different workers, and collect the returned results to update the global statistics. In particular, we use the following incomplete update and complete update (shown in Figure 2(a)) to track and along the traversed path (see Figure 1(d)):
(5)  
(6) 
where incomplete update is performed before the simulation task starts, allowing the updated statistics to be instantly available globally; complete update is done after the simulation return is available, resembling the backpropagation step in the sequential algorithm. In addition, is also updated in the complete update step using (3). Such a clear division of labor between the master and the workers provides sequential selection and backpropagation steps when we parallelize the costly expansion and simulation steps. It ensures uptodate statistics for all workers by the centralized master process and achieves linear speedup without much performance degradation (see Section 5 for the experimental results).
To justify the above rationale of our system design, we perform a set of running time analysis for our developed WUUCT system and report the results in Figure 2(b)–(c). We show the timeconsumption for different parts at the master and at the workers. First, we focus exclusively on the workers. With a closeto100% occupancy rate for the simulation workers, the simulation step is fully parallelized. Although the expansion workers are not fully utilized, the expansion step is maximumly parallelized since the number of required simulation and expansion tasks is identical. This suggests the existence of an optimal (taskdependent) ratio between the number of expansion workers and the number of simulation workers that fully parallelize both steps with the least resources (e.g. memory). Returning to the master process, on both benchmarks, we see a clear dominance of the time spent on the simulation and the expansion steps even they are both parallelized by 16 workers. This supports our design to parallelize only the simulation and expansion steps. We finally focus on the communication overhead caused by parallelization. Although more timeconsuming compared to simulation and backpropagation, the communication overhead is negligible compared to the time used by the expansion and the simulation steps. Other details in our system, such as the centralized gamestate storage, are further discussed in Appendix A.
4 The Benefits of Watching Unobserved Samples
In this section, we discuss the benefits of watching unobserved samples in WUUCT, and compare it with several popular parallel MCTS algorithms (Figure 3), including Leaf Parallelization (LeafP), Tree Parallelization (TreeP) with virtual loss, and Root Parallelization (RootP).



We argue that, by introducing the additional statistics , WUUCT achieves a better explorationexploitation tradeoff than the above methods. First, LeafP and TreeP represent two extremes in such a tradeoff. LeafP lacks diversity in exploration as all its workers are assigned to simulating the same node, leading to performance drop caused by collapse of exploration in much the same way as the naive parallelization (see Figure 1(c)). In contrast, although the virtual loss used in TreeP could encourage exploration diversity, this hard additive penalty could cause exploitatin failure: workers will be less likely to cosimulating the same node even when they are certain that it is optimal (mirsoleimani2017analysis). RootP tries to avoid these issues by letting workers perform an independent tree search. However, this reduces the equivalent number of rollouts at each worker, decreasing the accuracy of the UCT policy (2). Different from the above three approaches, WUUCT achieves a much better explorationexploitation tradeoff in the following manner. It encourages exploration by using to “penalize” the nodes that have many inprogress simulations. Meanwhile, it allows multiple workers to exploit the most rewarding node since this “penalty” vanishes when becomes large (see (4)).
5 Experiments
This section evaluates the proposed WUUCT algorithm on a production system to predict the user passrate of a mobile game (Section 5.1) as well as on the public Atari Game benchmark (Section 5.2), aiming at demonstrating the superior performance and nearlinear speedup of WUUCT.
5.1 Experiments on the “Joy City” Game
Joy City is a leveloriented game with diverse and challenging gameplay. Players tap to eliminate connected items on the game board. To pass a level, players have to complete certain goals within a given number of steps.




We evaluate WUUCT with different numbers of expansion and simulation workers (from to ) and report the speedup results in Figures 4(a)–(b). For all experiments, we fix the total number of simulations to 500. First, note that when we have the same number of expansion workers and simulation workers, WUUCT achieves linear speedup. Furthermore, Figures 4 also suggest that both the expansion workers and the simulation workers are crucial, since lowering the number of workers from either sets decreases the speedup. Besides the nearlinear speedup property, WUUCT suffers negligible performance loss with the increasing number of workers, as shown in Figures 4(c)–(d). The standard deviations of the performance (measured in the average game steps) over different numbers of expansion and simulation workers are only and for Level35 and Level58, respectively, which are much smaller than their average game steps ( and ).
5.2 Experiments on the Atari Game Benchmark
We further evaluate WUUCT on Atari Games (bellemare2013arcade), a classical benchmark for reinforcement learning (RL) and planning algorithms (guo2014deep). The Atari Games are an ideal testbed for MCTS algorithms for its long planning horizon (several thousand), sparse reward, and complex game strategy. We compare WUUCT to three parallel MCTS algorithms discussed in Section 4: TreeP, LeafP, and RootP (additional experiment results comparing WUUCT with a variant of TreeP is provided in Appendix E). We also report the results of sequential UCT ( slower than WUUCT) and PPO (schulman2017proximal) as reference. Generally, the performance of sequential UCT sets an upper bound for parallel UCT algorithms. PPO is included since we used a distilled PPO policy network (hinton2015distilling; rusu2015policy) as the rollout policy for all other algorithms. It is considered as a performance lower bound for both parallel and sequential UCT algorithms. All experiments are performed with a total of 128 simulation steps, and all parallel algorithms use 16 workers (see Appendix D for the details).
Environment  WUUCT  TreeP  LeafP  RootP  PPO  UCT 

Alien  59381839  42001086  42801016  5206282  1850  6820 
Boxing  1000*  990  954  981  94  100 
Breakout  40821  39033  33145  28127  274  462 
Centipede  1163034403910*  439433207601  16233369575  184265104405  4386  652810 
Freeway  320  320  311  320  32  32 
Gravitar  5060568  4880  3385155  41601811  737  4900 
MsPacman  198042232*  140002807  5378685  7156583  2096  23021 
NameThisGame  299911608*  233262585  253903659  274409533  6254  38455 
RoadRunner  467201359*  246803316  254522977  383001191  25076  52300 
Robotank  10119  8613  8011  7813  5  82 
Qbert  139925596  146205738  116555373  94653196  14293  17250 
SpaceInvaders  3393292  2651828  24351159  2543809  942  3535 
Tennis  41*  10  10  01  14  5 
TimePilot  5513012474*  326002165  380752307  451007421  4342  52600 
Zaxxon  390856838  395793942  12300821  13380769  5008  46800 
We first compare the performance, measured by average episode reward, between WUUCT and the baselines on 15 Atari games, which is done with 16 simulation workers and 1 expansion worker (for a fair comparison, since baselines do not parallel the expansion step). Each task is repeated 10 times with the mean and standard deviation reported in Table 1. Due to the better explorationexploitation tradeoff during selection, WUUCT outperforms all other parallel algorithms in 12 out of 15 tasks. Pairwise student ttest further show that WUUCT performs significantly better (adjusted by the Bonferroni method, value 0.0011) than TreeP, LeafP, and RootP in 7, 9, and 7 tasks, respectively. Next, we examine the influence of the number of simulation workers on the speed and the performance. In Figure 5, we compare the average episode return as well as time consumption (per step) for 4, 8, and 16 simulation workers. The bar plots indicate that WUUCT experiences little performance loss with an increasing number of workers, while the baselines exhibit significant performance degradation when heavily parallelized. WUUCT also achieves the fastest speed compared to the baselines, thanks to the efficient masterworker architecture (Section 3.2). In conclusion, our proposed WUUCT not only outperforms baseline approaches significantly under the same number of workers but also achieves negligible performance loss with the increasing level of parallelization.
6 Related Work
MCTS Monte Carlo Tree Search is a planning method for optimal decision making in problems with either deterministic (silver2016mastering2) or stochastic (schafer2008uct) environments. It has made a profound influence on Artificial Intelligence applications (browne2012survey), and has even been applied to predict and mimic human behavior (van2016people). Recently, there has been a wide range of work combining MCTS and other learning methods, providing mutual improvements to both methods. For example, guo2014deep harnesses the power of MCTS to boost the performance of modelfree RL approaches; shen2018m bridges the gap between MCTS and graphbased search, outperforming RL and knowledge base completion baselines.
Parallel MCTS Many approaches have been developed to parallelize MCTS methods, with the objective being twofold: achieve nearlinear speedup under a large number of workers while maintaining the algorithm performance. Popular parallelization approaches of MCTS include leaf parallelization, root parallelization, and tree parallelization (chaslot2008parallel). Leaf parallelization aims at collecting better statistics by assigning multiple workers to query the same node (cazenave2007parallelization). However, this comes at the cost of wasting diversity of the tree search. Therefore, its performance degrades significantly despite the nearideal speedup with the help of a clientserver network architecture (kato2010parallel). In root parallelization, multiple search trees are built and assigned to different workers. Additional work incorporates periodical synchronization of statistics from different trees, which results in better performance in realworld tasks (bourki2010scalability). However, a case study on Go reveals its inferiority with even a small number of workers (soejima2010evaluating). On the other hand, tree parallelization uses multiple workers to traverse, perform queries, and update on a shared search tree. It benefits significantly from two techniques. First, a virtual loss is added to avoid querying the same node by different workers (chaslot2008parallel). This has been adopted in various successful applications of MCTS such as Go (silver2016mastering2) and Doudizhu (whitehouse2011determinization). Additionally, architecture side improvements such as using pipeline (mirsoleimani2018pipeline) or lockfree structure (mirsoleimani2018lock) speedup the algorithm significantly. However, though being able to increase diversity, virtual loss degrades the performance under even four workers (mirsoleimani2017analysis; bourki2010scalability).
7 Conclusion
This paper proposes WUUCT, a novel parallel MCTS algorithm that addresses the problem of outdated statistics during parallelization by watching the number of unobserved samples. Based on the newly devised statistics, it modifies the UCT nodeselection policy in a principled manner, which achieves effective explorationexploitation tradeoff. Together with our efficiencyoriented system implementation, WUUCT achieves nearoptimal linear speedup with only limited performance loss across a wide range of tasks, including a deployed production system and Atari games.
8 Acknowledgements
This work is supported by Tencent AI Lab and Seattle AI Lab, Kwai Inc. We thank Xiangru Lian for his help on the system implementation.
References
Supplementary Material
Appendix A Algorithm details for WUUCT
The pseudocode of WUUCT is provided in Algorithm 1. Specifically, it provides the workflow of the master process. When the number of completed updates () has not exceeded the maximum simulation step (a predefined hyperparameter), the main process repeatedly performs a modified rollout that consists of the following steps: selection, expansion, simulation, and backpropagation. The selection and backpropagation steps are performed in the main process, while the two others are assigned to the workers. The backpropagation step is divided into two subroutines incomplete update (Algorithm 2) and complete update (Algorithm 3). The former is executed before simulation starts, while the latter is called after receiving simulation results. Task index is added to help the main process to track different tasks returned from the workers. To maximize efficiency, the master process keeps assigning expansion and simulation tasks until all workers are fully occupied.
Communication overhead of WUUCT
The choice for centralized gamestate storage stems from the following observations: (i) size of the gamestate is usually small, which allows efficient interprocess transformation, and (ii) each gamestate is used at most times,
Another possible solution is to store the gamestates in shared memory. However, to receive benefit from it, the following conditions should be satisfied: (i) each process can access (read/write) the memory relatively fast even if some collisions may happen, and (ii) the shared memory is big enough to hold all gamestates that may be accessed. If the two conditions hold, we may be able to reduce the communication overhead. Since the communication overhead is negligible even with 16 simulation and expansion workers (as shown in Figures 2(b) and 2(c)), we should consider using more workers to speedup the algorithm.
Input: environment emulator , root tree node , maximum simulation step , maximum simulation depth , number of expansion workers , and number of simulation workers
Initialize: expansion worker pool , simulation worker pool , gamestate buffer , , and
while do
Traverse the tree top down from root node following (4) until (i) its depth greater than , (ii) it is a leaf node, or (iii) it is a node that has not been fully expanded and random() 0.5
if expansion is required then
Assign expansion task to pool // is the task index
else
assign simulation task to pool if episode not terminated
Call incomplete_update; if episode terminated, call complete_update
end if
if fully occupied then
Wait for a expansion task with return: (task index , game state , reward , terminal signal , task
index ); expand the tree according to , , , and ; assign simulation task to pool
Call incomplete_update
else continue
if fully occupied then
Wait for a simulation task with return: (task index , node , cumulative reward )
Call complete_update;
else continue
end while

input: node
while do
// denotes the parent node of
end while

input: task index , node , reward
while do
;
Retrieve reward according to task index
;
// denotes the parent node of
end while

Appendix B Algorithm overview of baseline approaches
We give an overview of three baseline parallel UCT algorithms: Leaf Parallelization (LeafP), Tree Parallelization (TreeP) with virtual loss, and Root Parallelization (RootP), with the objective of providing a comprehensive view to the readers. We refer readers interested in the details of these algorithms to chaslot2008parallel. As suggested by their names, LeafP, TreeP, and RootP parallelized different parts of the search tree. Specifically, LeafP (Algorithm 4) parallelizes only the simulation process: whenever a node (state) is selected to query, all workers perform simulations individually to evaluate it. The main process (master) then waits for all workers to complete simulation and return their respective cumulative rewards, which are used to update the traversed nodes’ statistics.
TreeP (Algorithm 5) parallelizes the whole tree search algorithm by allowing different workers to access a shared search tree simultaneously. Each worker individually performs the selection, expansion, simulation, and backpropagation steps and update the nodes’ statistics. To discourage querying the same node, individual workers subtract a virtual loss ( is a hyperparameter of the algorithm) to each of its traversed node during the selection process, and add it back () during backpropagation. This allows nodes currently being evaluated by some workers to have lower utility scores (4) and will be less likely to be chosen by other workers, which improves the diversity of the node visited by different workers simultaneously.
silver2017mastering and segal2010scalability introduced an additional way to add pseudo reward into the traversed nodes. See Appendix E for details of this variant of TreeP and more experiments of it on Atari games.
As hinted by its name, RootP (Algorithm 6) parallelizes the root node. Specifically, in an initialization step, all children of the root node is expanded, and different workers are assigned to perform rollouts using the expanded child nodes as the root node of the search tree. The algorithm evenly distribute the workload such that the number of rollouts starting from all child nodes is , where is the number of workers. After the job assignment, all workers construct search trees in their own local memories and perform sequential tree search until their assigned tasks are finished. Finally, the main process collects statistics from all workers and return the predicted best action of the state represented by the root node of the search tree.


Appendix C Experiment details and system description of the Joy City task
This section describes the basic rules of the Joy City game (Appendix C.1) as well as details about the deployed user passrate prediction system (Appendix C.2).
c.1 Description of the Joy City game
This section serves as an introduction to the basic rules of the tap game. Figure 7 depicts several screenshots of the game. In the main frame, there is a grid, where each cell contains an item. We can click cells with connected color regions to eliminate them (i.e., if the cell represented by the purple dot in the first screenshot of Figure 6(a) is tapped, the region contains blue boxes will be “eliminated”). The remaining cells then collapse to fill in the gaps of exploded ones. To goal is to fulfill all level requirements (goals) within a fixed number of clicks. Figure 6(a) provides consecutive snapshots for playing level 10 of the game. The goal of this level is depicted on the top, which is 3 “cats” and 24 “balloons”. The topleft corner represents the number of remaining steps. Players have to accomplish all given goals before the step runs out. Figure 6(a) demonstrates successful gameplay, where only 6 steps are used to complete the level. In each of the three left frames, the cell noted by the purple circle is clicked. Immediately, the samecolor region marked with a red frame is eliminated. Different goal objects/obstacle objects react differently. For instance, when some cell is exploded beside a balloon, it will also explode. Frame two demonstrates the use of props. Tapping regions with connectivity above a certain threshold will provide prop as a bonus. They have special effects that can help players pass the level faster. Finally, in the last screenshot, all goals are completed and we pass the level.
Figure 6(b) further demonstrates the variety of levels. Specifically, the leftmost frame depicts a special “boss level”, where the goal is the “defeat” the evil cat. The cat will randomly throw objects to the cells, adding additional randomness. Three other frames illustrate relatively hard levels, which is revealed from their lowconnectivity, abundance and complexity of the obstacles, and special layout.
c.2 Details of the level passrate prediction system
During a game design cycle, to achieve the desired game passrates, a game designer needs to hire many human testers to extensively test all the levels before its release, which generally takes a long time and is inaccurate. Therefore, it would greatly reduce the game design cycle if we can develop a testing system that is able to provide quick and accurate feedback about the user passrates. Figure 7 gives an overview of our deployed user passrate prediction system, where WUUCT is used to mimic average user performance and provide features for predicting the human passrate. As we have shown in the main paper, it can achieve significant speedup without significant performance loss,
The system consists of two working phases, i.e., training and inference. Specifically, training and validation are done on 300 levels that have been released in a test version of the game. In the training phase, the system has access to both the level and players’ passrate, while only levels are available in the inference phase, and the system needs to give quick and accurate feedback about the (predicted) user passrate. In both phases, the levels are first fed into an asynchronous advantage actorcritic (A3C) (mnih2016asynchronous) learner for a base policy . It is then used by the WUUCT agent as a prior to select expand action as well as the default policy for simulation. We then use WUUCT to perform multiple gameplays. The maximum depth and width (maximum number of child nodes for each node) of the search tree is 10 and 5, respectively. The number of simulations is set to 10 and 100 to get AI bots with different skill levels. Six features (three for both the 10simulation and 100simulation agent) are extracted from the gameplay results. Specifically, the features are AI’s passrate, average used step divided by the provided step (the number at the topleft corner in the screenshots in Figure 6), and median used step divided by the provided step. During training, the features, as well as the players’ passrate, is used to learn a linear regressor, while in the inference phase, the regression model is used to predict user passrate.
c.3 Additional experimental results
In this section, we list the additional experimental results. In Table 3, we report the specific speedup number for different numbers of expansion worker and simulation workers.

Lv.  Level 35  Level 58  

\diagbox[width=3em,trim=l]  1  2  4  8  16  1  2  4  8  16 
1  1.0  2.0  2.8  3.6  4.5  1.0  1.8  4.1  4.8  5.1 
2  1.4  2.2  4.1  5.7  6.3  1.1  3.1  5.3  6.7  8.4 
4  1.7  2.5  4.5  8.4  8.8  1.1  3.4  6.1  10.1  12.8 
8  2.3  3.0  5.1  10.1  12.8  1.2  3.6  6.7  13.2  16.1 
16  2.9  3.7  5.7  11.2  15.5  1.2  3.8  7.6  16.1  20.9 
Appendix D Experiment details of the Atari games
This section provides the implementation details of the experiments on Atari games. Specifically, we first describe the training pipeline of the default policy. We then illustrate how the default policy is connected with MCTS algorithm to perform simulation.


Environment  Origin PPO policy  Distilled policy 

Alien  1850  850 
Boxing  94  7 
Breakout  274  191 
Centipede  4386  1701 
Freeway  32  32 
Gravitar  737  600 
MsPacman  2096  1860 
NameThisGame  6254  6354 
RoadRunner  25076  26600 
Robotank  5  13 
Qbert  14293  12725 
SpaceInvaders  942  1015 
Tennis  14  10 
TimePilot  4342  4400 
Zaxxon  5008  3504 
Training default policy for MCTS
To allow better overall performance, we used the Proximal Policy Gradient (PPO) (schulman2017proximal), one of the stateoftheart onpolicy modelfree reinforcement learning (RL) algorithms. We adopted the higheststarred thirdparty code of PPO on GitHub. The implementation uses the same hyperparameters with the original paper. The architecture of the policy network is shown in Figure 9. The original PPO network is trained on 10 million frames for each task. To reduce computation count, we reduce the network size using network distillation (hinton2015distilling). Specifically, it is a teacherstudent training framework where the student (distilled) network mimics the output of the teacher network. Samples are collected by the PPO network with the greedy strategy (). The student network optimizes its parameters to minimize the mean square error of the policy’s logits as well as the value. Performance of the original PPO policy network as well as the distilled network is provided in Table 4.
MCTS simulation
Both the policy output and the value output of the distilled network is used in the simulation phase. Particularly, if a simulation is started from state , rollout is performed using the policy network with an upper bound of 100 steps and reaches the leaf state . If the environment does not terminate, the full return is computed by the intermediate rewards plus the value function at state . Formally, the cumulative reward provided by the simulation is , where denotes the value of state . To reduce the variance of Monte Carlo sampling, we average it with the value function at state . The final simulation return is then .
Hyperparameters and experiment details for WUUCT
For all tree search based algorithms (i.e., WUUCT, TreeP, LeafP, and RootP), the maximum depth of the search tree is set to 100. The search width is limited by 20 and the maximum number of simulations is 128. The discount factor is set to 0.99 (note that the reported score is not discounted). When performing gameplays, a tree search subroutine is called to plan for the best action in each time step. The subroutine iteratively constructs a search tree from its initialization with a root node only. Experiments are deployed on 4 Intel Xeon E52650 v4 CPUs and 8 NVIDIA GeForce RTX 2080 Ti GPUs. To minimize the speed fluctuation caused by different workloads on the machine, we ensure that the total number of simulation workers is smaller than the total number of CPU cores, allowing each process to fully occupy each single core. The WUUCT is implemented with multiple processes, with an interprocess pipe between the master process and each worker process.
Hyperparameters and experiments for baseline algorithms
Being unable to find appropriate thirdparty packages for baseline algorithms (i.e., tree parallelization, leaf parallelization, and root parallelization), we built our implementation of them based on the corresponding papers. Building all algorithms in the same package additionally allows us to accurately conduct speedtests as it eliminates other factors (e.g. different language) that may bias the result. Specifically, leaf parallelization is implemented with a masterworker structure: when the main process enters the simulation step, it assigns the task to all workers. When return from all workers is available, the master process performs backpropagation according to these statistics and begin a new rollout.
As suggested by browne2012survey, tree parallelization is implemented using a decentralized structure, i.e., each worker performs rollouts on a shared search tree. At the selection step, each traversed node is added a fixed virtual loss to guarantee diversity of the tree search. When performing backpropagation, is added back to the traversed nodes. is chosen from 1.0 and 5.0 for each particular task. In other words, we ran TreeP with and for each task, and report the better result.
Root parallelization is implemented according to chaslot2008parallel. Similar to leaf parallelization, root parallelization consists of subprocesses that do not share information with each other. At the beginning of the tree search process, each subprocess is assigned several actions of the root node to query. They then perform sequential UCT rollouts until reaches a predefined maximum number of rollouts. When all subprocesses complete the jobs, statistics from them are gathered by the main process, and are used to choose the best action.
Appendix E Additional experiments on the Atari games
Environment  WUUCT 





Alien  59381839  4850357  493560  50000  
Boxing  1000  991  990  991  
Breakout  40821  37943  26550  46360  
Freeway  320  320  320  320  
Gravitar  5060568  3500707  4105463  4950141  
MsPacman  198042232  13160462  12991851  8640438  
RoadRunner  467201359  29800282  28550459  29400494  
Qbert  139925596  17055353  13425194  907553  
SpaceInvaders  3393292  2305176  3210127  302042  
Tennis  41  10  10  10  
TimePilot  5513012474  52500707  49800212  324001697  
Zaxxon  390856838  243002828  24600424  375501096  
This section provides additional experiment results to compare WUUCT with another variant of the Tree Parallelization (TreeP) algorithm. As suggested by silver2016mastering2, besides preadjusting the value with virtual loss , preadjusted visit count can also be used to penalize . In this variant of TreeP, both the virtual loss and a handcrafted count correction (termed the virtual pseudocount) is added to adjust . Specifically, the value of node is adjusted as
(7) 
which is used in the UCT selection phase. Table 5 compares WUUCT with this TreeP variant using both virtual loss and virtual pseudocount (i.e., Eq. 7). Three sets of hyperparameters are used in TreeP, which are described in the caption of the table (i.e., , , and ). All other experiment setups are the same as Section 5.2 and Appendix D. Table 5 indicates that on 9 out of 12 tasks, WUUCT outperforms this new baseline (with its best hyperparameters). Furthermore, we also observe that TreeP does not have an optimal set of hyperparameters that performs uniformly well on all tasks. In other words, to perform well, TreeP needs to conduct pertask hyperparameter tuning. On the other hand, WUUCT performs consistently well across different tasks.
Conceptually, WUUCT is designed based on the fact that ongoing simulations (unobserved samples) will eventually return the results, so their number should be tracked and used to adaptively adjust the UCT selection process. On the other hand, TreeP uses artificially designed virtual loss and virtual pseudocount to discourage other threads from simultaneously exploring the same node. Therefore, WUUCT achieves a better explorationexploitation tradeoff in parallelization, which leads to better performance as confirmed by the experimental results given in Table 5.
Input: environment emulator , prior policy , root tree node , maximum simulation step , maximum simulation depth , and number of workers
Initialize:
while do
Traverse the tree top down from root node following (2) until (i) its depth greater than , (ii) it is a leaf node, or (iii) it is a node that has not been fully expanded and random() 0.5
expand(, , )
Each of the simulation workers perform rollout beginning from
Wait until all workers completed simulation and returned cumulative reward ( is returned by worker )
for do
Call back_propagation
end for
end while

Input: environment emulator , prior policy , root tree node , maximum simulation step , maximum simulation depth , and number of workers
Initialize: for
Initialize: processes, each with access to the environment emulator, the prior policy, and the search tree
Expand all child nodes of
( is the number of actions)
Averagely distribute the workload (perform tree search times on each child of ) to the workers, and copy the corresponding child nodes to the worker’s local memory.
Perform asynchronously in each of the workers ( denotes the thread ID)
Select a child of according to its allocated budget
(ii) it is a leaf node, or (iii) it is a node that has not been fully expanded and random() 0.5
Add virtual loss to each of the traversed node: for each traversed
expand(, , )
Perform rollout beginning from
the returned cumulative reward of the rollout
Call back_propagation
Remove virtual loss from each of the traversed node: for each traversed
if
Terminate current process
end if
end
Gather child nodes’ statistics from all workers

input: node , environment emulator , prior policy
while has expanded do
end while
performing in according to (: terminal signal)
a new node constructed according to
Store reward signal and termination indicator in
Link as the child of by the node corresponding to

input: node , cumulative reward
while do
Retrieve the reward in the current node (which is collected during its expansion)
// denotes the parent node of
end while

Footnotes
 Code is available at https://github.com/liuanji/WUUCT.
 In the context of MCTS, the action space is assumed to be finite and the transition is assumed to be deterministic, i.e., the next state is determined by the current state and action .
 We assume certain regularity conditions hold so that the cumulative reward is always bounded (sutton2018reinforcement).
 We refer the readers to chaslot2008parallel for more details. The pseudocode of the three algorithms is given in Appendix B. LeafP: Algorithm 4, TreeP: Algorithm 5, RootP: Algorithm 6.
 We refer it as the tap game below. See Appendix C.1 for more details about the game rules.
 Level35 is relatively simple, requiring 18 steps for an average player to pass, while Level58 is relatively difficult and needs more than 50 steps to solve.
 In our setup, the game state will only be used for 1 time to start simulation and times to initialize expansion.
 Due to the complexity the tap game, modelfree RL algorithms such as A3C (mnih2016asynchronous) and PPO (schulman2017proximal) fail to achieve satisfactory performance and thus cannot perform an accurate prediction. On the other hand, MCTS could achieve good performance but takes a long time in testing.
 The task “Tennis” is not included in the calculation of the average percentile improvement due to the average episode return 0 in RootP.