Watch the Unobserved: A Simple Approach to Parallelizing Monte Carlo Tree Search
Monte Carlo Tree Search (MCTS) algorithms have achieved great success on many challenging benchmarks (e.g., Computer Go). However, they generally require a large number of rollouts, making their applications costly. Furthermore, it is also extremely challenging to parallelize MCTS due to its inherent sequential nature: each rollout heavily relies on the statistics (e.g., node visitation counts) estimated from previous simulations to achieve an effective exploration-exploitation tradeoff. In spite of these difficulties, we develop an algorithm, WU-UCT
Recently, Monte Carlo Tree Search (MCTS) algorithms such as UCT (kocsis2006improved) have achieved great success in solving many challenging artificial intelligence (AI) benchmarks, including video games (guo2016deep) and Go (silver2016mastering2). However, they rely on a large number (e.g. millions) of interactions with the environment emulator to construct search trees for decision-making, which leads to high time complexity (browne2012survey). For this reason, there has been an increasing demand for parallelizing MCTS over multiple workers. However, parallelizing MCTS without degrading its performance is difficult (segal2010scalability; mirsoleimani2018lock; chaslot2008parallel), mainly due to the fact that each MCTS iteration requires information from all previous iterations to provide effective exploration-exploitation tradeoff. Specifically, parallelizing MCTS would inevitably obscure these crucial information, and we will show in Section 2.2 that this loss of information potentially results in a significant performance drop. The key question is therefore how to acquire and utilize more available information to mitigate the information loss caused by parallelization and help the algorithm to achieve better exploration-exploitation tradeoff.
To this end, we propose WU-UCT (Watch the Unovserved in UCT), a novel parallel MCTS algorithm that attains linear speedup with only limited performance loss. This is achieved by a conceptual innovation (Section 3.1) as well as an efficient real system implementation (Section 3.2). Specifically, the key idea in WU-UCT to overcome the aforementioned challenge is a set of statistics that tracks the number of on-going yet incomplete simulation queries (named as unobserved samples). We combine these newly devised statistics with the original statistics of observed samples to modify UCT’s policy in the selection steps in a principled manner, which, as we shall show in Section 4, effectively retains exploration-exploitation tradeoff during parallelization. Our proposed approach has been successfully deployed in a production system for efficiently and accurately estimating the rate at which users pass levels (termed user pass-rate) in a mobile game “Joy City”, with the purpose of reducing their design cycles. On this benchmark, we show that WU-UCT achieves near-optimal linear speedup and superior performance in predicting user pass-rate (Section 5.1). We further evaluate WU-UCT on the Atari Game benchmark and compare it to state-of-the-art parallel MCTS algorithms (Section 5.2), which also demonstrate our superior speedup and performance.
2 On the Difficulties of Parallelizing MCTS
We first introduce the MCTS and the UCT algorithms, along with their difficulties in parallelization.
2.1 Monte Carlo Tree Search and Upper Confidence Bound for Trees (UCT)
We consider the Markov Decision Process (MDP) , where an agent interacts with the environment in order to maximize a long-term cumulative reward. Specifically, an agent at state takes an action according to a policy , so that the MDP transits to the next state and emits a reward .
where denotes the initial state and is the discount factor.
where denotes the set of all child nodes for ; the first term is an estimate for the long-term cumulative reward that can be received when starting from the state represented by node , and the second term represents the uncertainty (size of the confidence interval) of that estimate. The confidence interval is calculated based on the Upper Confidence Bound (UCB) (auer2002finite; auer2002using) using and , which denote the number of times that the nodes and have been visited, respectively. Therefore, the key idea of the UCT policy (2) is to select the best action according to an optimistic estimation (i.e., the upper confidence bound) of the expected return, which strikes a balance between the exploitation (first term) and the exploration (second term) with controlling their tradeoff. Once the selection process reaches a leaf node of the search tree (or other termination conditions are met), we will expand the node according to a prior policy by adding a new child node. Then, in the simulation step, we estimate its value function (cumulative reward) by running the environment simulator with a default (simulation) policy. Finally, during backpropagation, we update the statistics and from the leaf node to the root node of the selected path by recursively performing the following update (i.e., from to ):
where is the simulation return of ; denotes the action selected following (2) at state .
2.2 The Intrinsic Difficulties of Parallelizing MCTS
The above discussion implies that the MCTS algorithm is intrinsically sequential: each selection step in a new rollout requires the previous rollouts to complete in order to deliver the updated statistics, and , for the UCT tree policy (2). Although this requirement of up-to-date statistics is not mandatory for implementation, it is in practice intensively required to achieve effective exploration-exploitation tradeoff (auer2002finite). Specifically, up-to-date statistics best help the UCT tree policy to identify and prune non-rewarding branches as well as extensively visiting rewarding paths for additional planning depth. Likewise, to achieve the best possible performance, when multiple workers are used, it is also important to ensure that each worker uses the most recent statistics (the colored and in Figure 1(b)) in its own selection step. However, this is impossible in parallelizing MCTS based on the following observations. First, the expansion step and the simulation step are generally more time-consuming compared to the other two steps, because they involve a large number of interactions with the environment (or its simulator). Therefore, as exemplified by Figure 1(c), when a worker C initiates a new selection step, the other workers A and B are most likely still in their simulation or expansion steps. This prevents them from updating the (global) statistics for other workers like C, which happens at their respective backpropagation steps. Using outdated statistics (the gray-colored and ) at different workers could lead to a significant performance loss given a fixed target speedup, due to behaviors like collapes of exploration or exploitation failure, which we shall discuss thoroughly in Section 4. To give an example, Figure 1(c) illustrates the collapse of exploration, where worker C traverses over the same path as the worker A in its selection step due to the determinism of (2). Specifically, if the statistics are unchanged between the moments that worker A and C begin their own selection steps, they will choose the same node according to (2), which greatly reduces the diversity of exploration. Therefore, the key question that we want to address in parallelizing MCTS is how to track the correct statistics and modify the UCT policy in a principled manner, with the hope of retaining effective exploration-exploitation tradeoff at different workers.
3.1 Watch the Unobserved Samples in UCT Tree Policy
As we pointed out earlier, the key question we want to address in parallelizing MCTS is how to deliver the most up-to-date statistics to each worker so that they can achieve effective exploration-exploitation tradeoff in its selection step. This is assumed to be the case in the ideal parallelization in Figure 1(b). Algorithmically, it is equivalent to the sequential MCTS except that the rollouts are performed in parallel by different workers. Unfortunately, in practice, the statistics available to each worker are generally outdated because of the slow and incomplete simulation and expansion steps at the other workers. Specifically, since the estimated value is unobservable before simulations complete and workers should not wait for the updated statistics to proceed, the (partial) loss of statistics is unavoidable. Now the question becomes: is there an alternative way to addressing the issue? The answer is in the affirmative and is explained below.
Aiming at bridging the gap between naive parallelization and the ideal case, we closely examine their difference in terms of the availability of statistics. As illustrated by the colors of the statistics, their only difference in is caused by the on-going simulation process. As suggested by (3), although can only be updated after a simulation step is completed, the newest information can actually be available as early as a worker initiates a new rollout. This is the key insight that we leverage to enable effective parallelization in our WU-UCT algorithm. Motivated by this, we introduce another quantity, , to count the number of rollouts that have been initiated but not yet completed, which we name as unobserved samples. That is, our new statistics, , watch the number of unobserved samples, and are then used to correct the UCT tree policy (2) into the following form:
The intuition of the above modified node-selection policy is that when there are workers simulating (querying) node , the confidence interval at node will eventually be shrunk after they complete. Therefore, adding and to the exploration term considers such a fact beforehand and let other workers be aware of it. Despite its simple form, (4) provides a principled way to retain effective exploration-exploitation tradeoff under parallel settings; it corrects the confidence bound towards better exploration-exploitation tradeoff. As the confidence level is instantly updated (i.e., at the beginning of simulation), more recent workers are guaranteed to observe additional statistics, which prevent them from extensively querying the same node as well as find better nodes for them to query. For example, when multiple children are in demand for exploration, (4) allows them to be explored evenly. In contrast, when a node has been sufficiently visited (i.e., large and ), adding and from the unobserved samples have little effect on (4) because the confidence interval is sufficiently shrunk around , allowing extensively exploitation of the best-valued child.
3.2 System implementation using Master-worker architectures
We now proceed to explain the system implementation of WU-UCT, where the overall architecture is shown in Figure 2(a) (see Appendix A for the details). Specifically, we use a master-worker architecture to implement the WU-UCT algorithm with the following considerations. First, since the expansion and the simulation steps are much more time-consuming compared to the selection and the backpropagation steps, they should be intensively parallelized. In fact, they are relatively easy to parallelize (e.g., different simulations could be performed independently). Second, as we discussed earlier, different workers need to access the most up-to-date statistics in order to achieve successful exploration-exploitation tradeoff. To this end, a centralized architecture for the selection and backpropagation step is more preferable as it allows adding strict restrictions to the retrieval and update of the statistics, making them up-to-date. Specifically, we use a centralized master process to maintain a global set of statistics (in addition to other data such as game states), and let it be in charge of the backpropagation step (i.e., updating the global statistics) and the selection step (i.e., exploiting the global statistics). As shown in Figure 2(a), the master process repeatedly performs rollouts until a predefined number of simulations is reached. During each rollout, it selects nodes to query, assign expansion and simulation tasks to different workers, and collect the returned results to update the global statistics. In particular, we use the following incomplete update and complete update (shown in Figure 2(a)) to track and along the traversed path (see Figure 1(d)):
where incomplete update is performed before the simulation task starts, allowing the updated statistics to be instantly available globally; complete update is done after the simulation return is available, resembling the backpropagation step in the sequential algorithm. In addition, is also updated in the complete update step using (3). Such a clear division of labor between the master and the workers provides sequential selection and backpropagation steps when we parallelize the costly expansion and simulation steps. It ensures up-to-date statistics for all workers by the centralized master process and achieves linear speedup without much performance degradation (see Section 5 for the experimental results).
To justify the above rationale of our system design, we perform a set of running time analysis for our developed WU-UCT system and report the results in Figure 2(b)–(c). We show the time-consumption for different parts at the master and at the workers. First, we focus exclusively on the workers. With a close-to-100% occupancy rate for the simulation workers, the simulation step is fully parallelized. Although the expansion workers are not fully utilized, the expansion step is maximumly parallelized since the number of required simulation and expansion tasks is identical. This suggests the existence of an optimal (task-dependent) ratio between the number of expansion workers and the number of simulation workers that fully parallelize both steps with the least resources (e.g. memory). Returning to the master process, on both benchmarks, we see a clear dominance of the time spent on the simulation and the expansion steps even they are both parallelized by 16 workers. This supports our design to parallelize only the simulation and expansion steps. We finally focus on the communication overhead caused by parallelization. Although more time-consuming compared to simulation and backpropagation, the communication overhead is negligible compared to the time used by the expansion and the simulation steps. Other details in our system, such as the centralized game-state storage, are further discussed in Appendix A.
4 The Benefits of Watching Unobserved Samples
In this section, we discuss the benefits of watching unobserved samples in WU-UCT, and compare it with several popular parallel MCTS algorithms (Figure 3), including Leaf Parallelization (LeafP), Tree Parallelization (TreeP) with virtual loss, and Root Parallelization (RootP).
We argue that, by introducing the additional statistics , WU-UCT achieves a better exploration-exploitation tradeoff than the above methods. First, LeafP and TreeP represent two extremes in such a tradeoff. LeafP lacks diversity in exploration as all its workers are assigned to simulating the same node, leading to performance drop caused by collapse of exploration in much the same way as the naive parallelization (see Figure 1(c)). In contrast, although the virtual loss used in TreeP could encourage exploration diversity, this hard additive penalty could cause exploitatin failure: workers will be less likely to co-simulating the same node even when they are certain that it is optimal (mirsoleimani2017analysis). RootP tries to avoid these issues by letting workers perform an independent tree search. However, this reduces the equivalent number of rollouts at each worker, decreasing the accuracy of the UCT policy (2). Different from the above three approaches, WU-UCT achieves a much better exploration-exploitation tradeoff in the following manner. It encourages exploration by using to “penalize” the nodes that have many in-progress simulations. Meanwhile, it allows multiple workers to exploit the most rewarding node since this “penalty” vanishes when becomes large (see (4)).
This section evaluates the proposed WU-UCT algorithm on a production system to predict the user pass-rate of a mobile game (Section 5.1) as well as on the public Atari Game benchmark (Section 5.2), aiming at demonstrating the superior performance and near-linear speedup of WU-UCT.
5.1 Experiments on the “Joy City” Game
Joy City is a level-oriented game with diverse and challenging gameplay. Players tap to eliminate connected items on the game board. To pass a level, players have to complete certain goals within a given number of steps.
We evaluate WU-UCT with different numbers of expansion and simulation workers (from to ) and report the speedup results in Figures 4(a)–(b). For all experiments, we fix the total number of simulations to 500. First, note that when we have the same number of expansion workers and simulation workers, WU-UCT achieves linear speedup. Furthermore, Figures 4 also suggest that both the expansion workers and the simulation workers are crucial, since lowering the number of workers from either sets decreases the speedup. Besides the near-linear speedup property, WU-UCT suffers negligible performance loss with the increasing number of workers, as shown in Figures 4(c)–(d). The standard deviations of the performance (measured in the average game steps) over different numbers of expansion and simulation workers are only and for Level-35 and Level-58, respectively, which are much smaller than their average game steps ( and ).
5.2 Experiments on the Atari Game Benchmark
We further evaluate WU-UCT on Atari Games (bellemare2013arcade), a classical benchmark for reinforcement learning (RL) and planning algorithms (guo2014deep). The Atari Games are an ideal testbed for MCTS algorithms for its long planning horizon (several thousand), sparse reward, and complex game strategy. We compare WU-UCT to three parallel MCTS algorithms discussed in Section 4: TreeP, LeafP, and RootP (additional experiment results comparing WU-UCT with a variant of TreeP is provided in Appendix E). We also report the results of sequential UCT ( slower than WU-UCT) and PPO (schulman2017proximal) as reference. Generally, the performance of sequential UCT sets an upper bound for parallel UCT algorithms. PPO is included since we used a distilled PPO policy network (hinton2015distilling; rusu2015policy) as the roll-out policy for all other algorithms. It is considered as a performance lower bound for both parallel and sequential UCT algorithms. All experiments are performed with a total of 128 simulation steps, and all parallel algorithms use 16 workers (see Appendix D for the details).
We first compare the performance, measured by average episode reward, between WU-UCT and the baselines on 15 Atari games, which is done with 16 simulation workers and 1 expansion worker (for a fair comparison, since baselines do not parallel the expansion step). Each task is repeated 10 times with the mean and standard deviation reported in Table 1. Due to the better exploration-exploitation tradeoff during selection, WU-UCT out-performs all other parallel algorithms in 12 out of 15 tasks. Pairwise student t-test further show that WU-UCT performs significantly better (adjusted by the Bonferroni method, -value 0.0011) than TreeP, LeafP, and RootP in 7, 9, and 7 tasks, respectively. Next, we examine the influence of the number of simulation workers on the speed and the performance. In Figure 5, we compare the average episode return as well as time consumption (per step) for 4, 8, and 16 simulation workers. The bar plots indicate that WU-UCT experiences little performance loss with an increasing number of workers, while the baselines exhibit significant performance degradation when heavily parallelized. WU-UCT also achieves the fastest speed compared to the baselines, thanks to the efficient master-worker architecture (Section 3.2). In conclusion, our proposed WU-UCT not only out-performs baseline approaches significantly under the same number of workers but also achieves negligible performance loss with the increasing level of parallelization.
6 Related Work
MCTS Monte Carlo Tree Search is a planning method for optimal decision making in problems with either deterministic (silver2016mastering2) or stochastic (schafer2008uct) environments. It has made a profound influence on Artificial Intelligence applications (browne2012survey), and has even been applied to predict and mimic human behavior (van2016people). Recently, there has been a wide range of work combining MCTS and other learning methods, providing mutual improvements to both methods. For example, guo2014deep harnesses the power of MCTS to boost the performance of model-free RL approaches; shen2018m bridges the gap between MCTS and graph-based search, outperforming RL and knowledge base completion baselines.
Parallel MCTS Many approaches have been developed to parallelize MCTS methods, with the objective being two-fold: achieve near-linear speedup under a large number of workers while maintaining the algorithm performance. Popular parallelization approaches of MCTS include leaf parallelization, root parallelization, and tree parallelization (chaslot2008parallel). Leaf parallelization aims at collecting better statistics by assigning multiple workers to query the same node (cazenave2007parallelization). However, this comes at the cost of wasting diversity of the tree search. Therefore, its performance degrades significantly despite the near-ideal speedup with the help of a client-server network architecture (kato2010parallel). In root parallelization, multiple search trees are built and assigned to different workers. Additional work incorporates periodical synchronization of statistics from different trees, which results in better performance in real-world tasks (bourki2010scalability). However, a case study on Go reveals its inferiority with even a small number of workers (soejima2010evaluating). On the other hand, tree parallelization uses multiple workers to traverse, perform queries, and update on a shared search tree. It benefits significantly from two techniques. First, a virtual loss is added to avoid querying the same node by different workers (chaslot2008parallel). This has been adopted in various successful applications of MCTS such as Go (silver2016mastering2) and Dou-di-zhu (whitehouse2011determinization). Additionally, architecture side improvements such as using pipeline (mirsoleimani2018pipeline) or lock-free structure (mirsoleimani2018lock) speedup the algorithm significantly. However, though being able to increase diversity, virtual loss degrades the performance under even four workers (mirsoleimani2017analysis; bourki2010scalability).
This paper proposes WU-UCT, a novel parallel MCTS algorithm that addresses the problem of outdated statistics during parallelization by watching the number of unobserved samples. Based on the newly devised statistics, it modifies the UCT node-selection policy in a principled manner, which achieves effective exploration-exploitation tradeoff. Together with our efficiency-oriented system implementation, WU-UCT achieves near-optimal linear speedup with only limited performance loss across a wide range of tasks, including a deployed production system and Atari games.
This work is supported by Tencent AI Lab and Seattle AI Lab, Kwai Inc. We thank Xiangru Lian for his help on the system implementation.
Appendix A Algorithm details for WU-UCT
The pseudo-code of WU-UCT is provided in Algorithm 1. Specifically, it provides the workflow of the master process. When the number of completed updates () has not exceeded the maximum simulation step (a pre-defined hyperparameter), the main process repeatedly performs a modified rollout that consists of the following steps: selection, expansion, simulation, and backpropagation. The selection and backpropagation steps are performed in the main process, while the two others are assigned to the workers. The backpropagation step is divided into two sub-routines incomplete update (Algorithm 2) and complete update (Algorithm 3). The former is executed before simulation starts, while the latter is called after receiving simulation results. Task index is added to help the main process to track different tasks returned from the workers. To maximize efficiency, the master process keeps assigning expansion and simulation tasks until all workers are fully occupied.
Communication overhead of WU-UCT
The choice for centralized game-state storage stems from the following observations: (i) size of the game-state is usually small, which allows efficient inter-process transformation, and (ii) each game-state is used at most times,
Another possible solution is to store the game-states in shared memory. However, to receive benefit from it, the following conditions should be satisfied: (i) each process can access (read/write) the memory relatively fast even if some collisions may happen, and (ii) the shared memory is big enough to hold all game-states that may be accessed. If the two conditions hold, we may be able to reduce the communication overhead. Since the communication overhead is negligible even with 16 simulation and expansion workers (as shown in Figures 2(b) and 2(c)), we should consider using more workers to speedup the algorithm.
Appendix B Algorithm overview of baseline approaches
We give an overview of three baseline parallel UCT algorithms: Leaf Parallelization (LeafP), Tree Parallelization (TreeP) with virtual loss, and Root Parallelization (RootP), with the objective of providing a comprehensive view to the readers. We refer readers interested in the details of these algorithms to chaslot2008parallel. As suggested by their names, LeafP, TreeP, and RootP parallelized different parts of the search tree. Specifically, LeafP (Algorithm 4) parallelizes only the simulation process: whenever a node (state) is selected to query, all workers perform simulations individually to evaluate it. The main process (master) then waits for all workers to complete simulation and return their respective cumulative rewards, which are used to update the traversed nodes’ statistics.
TreeP (Algorithm 5) parallelizes the whole tree search algorithm by allowing different workers to access a shared search tree simultaneously. Each worker individually performs the selection, expansion, simulation, and back-propagation steps and update the nodes’ statistics. To discourage querying the same node, individual workers subtract a virtual loss ( is a hyper-parameter of the algorithm) to each of its traversed node during the selection process, and add it back () during back-propagation. This allows nodes currently being evaluated by some workers to have lower utility scores (4) and will be less likely to be chosen by other workers, which improves the diversity of the node visited by different workers simultaneously.
silver2017mastering and segal2010scalability introduced an additional way to add pseudo reward into the traversed nodes. See Appendix E for details of this variant of TreeP and more experiments of it on Atari games.
As hinted by its name, RootP (Algorithm 6) parallelizes the root node. Specifically, in an initialization step, all children of the root node is expanded, and different workers are assigned to perform rollouts using the expanded child nodes as the root node of the search tree. The algorithm evenly distribute the workload such that the number of rollouts starting from all child nodes is , where is the number of workers. After the job assignment, all workers construct search trees in their own local memories and perform sequential tree search until their assigned tasks are finished. Finally, the main process collects statistics from all workers and return the predicted best action of the state represented by the root node of the search tree.
Appendix C Experiment details and system description of the Joy City task
c.1 Description of the Joy City game
This section serves as an introduction to the basic rules of the tap game. Figure 7 depicts several screenshots of the game. In the main frame, there is a grid, where each cell contains an item. We can click cells with connected color regions to eliminate them (i.e., if the cell represented by the purple dot in the first screenshot of Figure 6(a) is tapped, the region contains blue boxes will be “eliminated”). The remaining cells then collapse to fill in the gaps of exploded ones. To goal is to fulfill all level requirements (goals) within a fixed number of clicks. Figure 6(a) provides consecutive snapshots for playing level 10 of the game. The goal of this level is depicted on the top, which is 3 “cats” and 24 “balloons”. The top-left corner represents the number of remaining steps. Players have to accomplish all given goals before the step runs out. Figure 6(a) demonstrates successful gameplay, where only 6 steps are used to complete the level. In each of the three left frames, the cell noted by the purple circle is clicked. Immediately, the same-color region marked with a red frame is eliminated. Different goal objects/obstacle objects react differently. For instance, when some cell is exploded beside a balloon, it will also explode. Frame two demonstrates the use of props. Tapping regions with connectivity above a certain threshold will provide prop as a bonus. They have special effects that can help players pass the level faster. Finally, in the last screenshot, all goals are completed and we pass the level.
Figure 6(b) further demonstrates the variety of levels. Specifically, the left-most frame depicts a special “boss level”, where the goal is the “defeat” the evil cat. The cat will randomly throw objects to the cells, adding additional randomness. Three other frames illustrate relatively hard levels, which is revealed from their low-connectivity, abundance and complexity of the obstacles, and special layout.
c.2 Details of the level pass-rate prediction system
During a game design cycle, to achieve the desired game pass-rates, a game designer needs to hire many human testers to extensively test all the levels before its release, which generally takes a long time and is inaccurate. Therefore, it would greatly reduce the game design cycle if we can develop a testing system that is able to provide quick and accurate feedback about the user pass-rates. Figure 7 gives an overview of our deployed user pass-rate prediction system, where WU-UCT is used to mimic average user performance and provide features for predicting the human pass-rate. As we have shown in the main paper, it can achieve significant speedup without significant performance loss,
The system consists of two working phases, i.e., training and inference. Specifically, training and validation are done on 300 levels that have been released in a test version of the game. In the training phase, the system has access to both the level and players’ pass-rate, while only levels are available in the inference phase, and the system needs to give quick and accurate feedback about the (predicted) user pass-rate. In both phases, the levels are first fed into an asynchronous advantage actor-critic (A3C) (mnih2016asynchronous) learner for a base policy . It is then used by the WU-UCT agent as a prior to select expand action as well as the default policy for simulation. We then use WU-UCT to perform multiple gameplays. The maximum depth and width (maximum number of child nodes for each node) of the search tree is 10 and 5, respectively. The number of simulations is set to 10 and 100 to get AI bots with different skill levels. Six features (three for both the 10-simulation and 100-simulation agent) are extracted from the gameplay results. Specifically, the features are AI’s pass-rate, average used step divided by the provided step (the number at the top-left corner in the screenshots in Figure 6), and median used step divided by the provided step. During training, the features, as well as the players’ pass-rate, is used to learn a linear regressor, while in the inference phase, the regression model is used to predict user pass-rate.
c.3 Additional experimental results
In this section, we list the additional experimental results. In Table 3, we report the specific speedup number for different numbers of expansion worker and simulation workers.
|Lv.||Level 35||Level 58|
Appendix D Experiment details of the Atari games
This section provides the implementation details of the experiments on Atari games. Specifically, we first describe the training pipeline of the default policy. We then illustrate how the default policy is connected with MCTS algorithm to perform simulation.
|Environment||Origin PPO policy||Distilled policy|
Training default policy for MCTS
To allow better overall performance, we used the Proximal Policy Gradient (PPO) (schulman2017proximal), one of the state-of-the-art on-policy model-free reinforcement learning (RL) algorithms. We adopted the highest-starred third-party code of PPO on GitHub. The implementation uses the same hyper-parameters with the original paper. The architecture of the policy network is shown in Figure 9. The original PPO network is trained on 10 million frames for each task. To reduce computation count, we reduce the network size using network distillation (hinton2015distilling). Specifically, it is a teacher-student training framework where the student (distilled) network mimics the output of the teacher network. Samples are collected by the PPO network with the -greedy strategy (). The student network optimizes its parameters to minimize the mean square error of the policy’s logits as well as the value. Performance of the original PPO policy network as well as the distilled network is provided in Table 4.
Both the policy output and the value output of the distilled network is used in the simulation phase. Particularly, if a simulation is started from state , rollout is performed using the policy network with an upper bound of 100 steps and reaches the leaf state . If the environment does not terminate, the full return is computed by the intermediate rewards plus the value function at state . Formally, the cumulative reward provided by the simulation is , where denotes the value of state . To reduce the variance of Monte Carlo sampling, we average it with the value function at state . The final simulation return is then .
Hyperparameters and experiment details for WU-UCT
For all tree search based algorithms (i.e., WU-UCT, TreeP, LeafP, and RootP), the maximum depth of the search tree is set to 100. The search width is limited by 20 and the maximum number of simulations is 128. The discount factor is set to 0.99 (note that the reported score is not discounted). When performing gameplays, a tree search subroutine is called to plan for the best action in each time step. The sub-routine iteratively constructs a search tree from its initialization with a root node only. Experiments are deployed on 4 Intel Xeon E5-2650 v4 CPUs and 8 NVIDIA GeForce RTX 2080 Ti GPUs. To minimize the speed fluctuation caused by different workloads on the machine, we ensure that the total number of simulation workers is smaller than the total number of CPU cores, allowing each process to fully occupy each single core. The WU-UCT is implemented with multiple processes, with an inter-process pipe between the master process and each worker process.
Hyperparameters and experiments for baseline algorithms
Being unable to find appropriate third-party packages for baseline algorithms (i.e., tree parallelization, leaf parallelization, and root parallelization), we built our implementation of them based on the corresponding papers. Building all algorithms in the same package additionally allows us to accurately conduct speed-tests as it eliminates other factors (e.g. different language) that may bias the result. Specifically, leaf parallelization is implemented with a master-worker structure: when the main process enters the simulation step, it assigns the task to all workers. When return from all workers is available, the master process performs backpropagation according to these statistics and begin a new rollout.
As suggested by browne2012survey, tree parallelization is implemented using a decentralized structure, i.e., each worker performs rollouts on a shared search tree. At the selection step, each traversed node is added a fixed virtual loss to guarantee diversity of the tree search. When performing backpropagation, is added back to the traversed nodes. is chosen from 1.0 and 5.0 for each particular task. In other words, we ran TreeP with and for each task, and report the better result.
Root parallelization is implemented according to chaslot2008parallel. Similar to leaf parallelization, root parallelization consists of sub-processes that do not share information with each other. At the beginning of the tree search process, each sub-process is assigned several actions of the root node to query. They then perform sequential UCT rollouts until reaches a pre-defined maximum number of rollouts. When all sub-processes complete the jobs, statistics from them are gathered by the main process, and are used to choose the best action.
Appendix E Additional experiments on the Atari games
This section provides additional experiment results to compare WU-UCT with another variant of the Tree Parallelization (TreeP) algorithm. As suggested by silver2016mastering2, besides pre-adjusting the value with virtual loss , pre-adjusted visit count can also be used to penalize . In this variant of TreeP, both the virtual loss and a hand-crafted count correction (termed the virtual pseudo-count) is added to adjust . Specifically, the value of node is adjusted as
which is used in the UCT selection phase. Table 5 compares WU-UCT with this TreeP variant using both virtual loss and virtual pseudo-count (i.e., Eq. 7). Three sets of hyper-parameters are used in TreeP, which are described in the caption of the table (i.e., , , and ). All other experiment setups are the same as Section 5.2 and Appendix D. Table 5 indicates that on 9 out of 12 tasks, WU-UCT out-performs this new baseline (with its best hyper-parameters). Furthermore, we also observe that TreeP does not have an optimal set of hyper-parameters that performs uniformly well on all tasks. In other words, to perform well, TreeP needs to conduct per-task hyper-parameter tuning. On the other hand, WU-UCT performs consistently well across different tasks.
Conceptually, WU-UCT is designed based on the fact that on-going simulations (unobserved samples) will eventually return the results, so their number should be tracked and used to adaptively adjust the UCT selection process. On the other hand, TreeP uses artificially designed virtual loss and virtual pseudo-count to discourage other threads from simultaneously exploring the same node. Therefore, WU-UCT achieves a better exploration-exploitation tradeoff in parallelization, which leads to better performance as confirmed by the experimental results given in Table 5.
- Code is available at https://github.com/liuanji/WU-UCT.
- In the context of MCTS, the action space is assumed to be finite and the transition is assumed to be deterministic, i.e., the next state is determined by the current state and action .
- We assume certain regularity conditions hold so that the cumulative reward is always bounded (sutton2018reinforcement).
- We refer the readers to chaslot2008parallel for more details. The pseudo-code of the three algorithms is given in Appendix B. LeafP: Algorithm 4, TreeP: Algorithm 5, RootP: Algorithm 6.
- We refer it as the tap game below. See Appendix C.1 for more details about the game rules.
- Level-35 is relatively simple, requiring 18 steps for an average player to pass, while Level-58 is relatively difficult and needs more than 50 steps to solve.
- In our setup, the game state will only be used for 1 time to start simulation and times to initialize expansion.
- Due to the complexity the tap game, model-free RL algorithms such as A3C (mnih2016asynchronous) and PPO (schulman2017proximal) fail to achieve satisfactory performance and thus cannot perform an accurate prediction. On the other hand, MCTS could achieve good performance but takes a long time in testing.
- The task “Tennis” is not included in the calculation of the average percentile improvement due to the average episode return 0 in RootP.