Meanbased Heuristic Search for RealTime Planning
Abstract
In this paper, we introduce a new heuristic search algorithm based on mean values for realtime planning, called MHSP. It consists in associating the principles of UCT, a banditbased algorithm which gave very good results in computer games, and especially in Computer Go, with heuristic search in order to obtain a realtime planner in the context of classical planning. MHSP is evaluated on different planning problems and compared to existing algorithms performing online search and learning. Besides, our results highlight the capacity of MHSP to return plans in a realtime manner which tend to an optimal plan over the time which is faster and of better quality compared to existing algorithms in the literature.
1 Introduction
The starting point of this work is to apply UCT [KS06], an efficient algorithm wellknown in the machine learning and computer games communities, and originally designed for planning, on classical planning problems. UCT is designed for MDP, and based on bandit decision making [ACBF02]. In the background of the current paper that stresses time constraints, the interesting feature of UCT is its anytime property in a strong meaning. At any time, UCT is able to return the first action of a plan, a partial plan, a solution plan, or an optimal plan according to the given time. However, [KS06] did not give known successful applications in the classical planning domain yet [GNT04]. Instead, UCT gave tremendous results in computer games, and specifically in Computer Go with the Go playing program Mogo [GWMT06]. In Computer Go, UCT is efficient because the Go complexity is high, and because the Go games are played in real time, which fits the anytime property of UCT. Therefore, this paper focuses on how to give value to UCTlike algorithms in a subfield of planning dealing with time constraints.
The state of the art of planning is huge [RH09], and we roughly divide it into two categories, offline and online planning. In the context of offline planning, the planners build solution plans, and then execute the actions of the plan. An important aspect is the existence of very good heuristic functions that drive the search efficiently toward the goal. The heuristic functions are built with the help of a planning graph [BF97]. The stateoftheart planners use variants of depth first search, i.e., Enforced Hill Climbing [HN01], and may find a solution plan very quickly from the initial state to the goal state. However, although they find solution plans very quickly on sufficiently easy problems, these planners are not realtime planners, and they may fall if they have not enough time to find a solution plan.
Conversely, in the context of online planning, the planners make their decision in constant time, and then execute the corresponding action, or sequence of actions, in the world. The literature distinguishes two approaches, one based on MDP applied to nondeterministic problems, e.g., [BBS95, HZ01] and the other based on RealTime Search (RTS), e.g., [Kor90]. If the first approach was recently applied to planning [FFB07], the second approach RTS has been strongly linked, since the pioneering Korf’s work on puzzles, to the development of video games in which the agents need good pathfinding algorithms running in real time. This last approach was not broadened to the classical planning problems. Classically, there are several realtime searches, e.g., minimin [Kor88], Trap [Bul04], LRTS [BL06] or even A* [HNR68]. These algorithms perform action selection by using heuristic search. Since the action selection time is limited, these algorithms explore a small part of the state space around the current state. The plans executed by the agent have no reason to be optimal. The RTS literature is concerned with convergence of these plans toward an optimal solution when the task is repeated iteratively. This opens the RTS literature mainly to learning. Then, considering the learning aspect as its main objective, and focused on convergence proofs, the literature reduced the action selection stage to a depthone greedy search.
The aim of this paper is to focus on the action selection stage of RTS, to present MHSP, an heuristic search algorithm based on mean values, i.e., an adaptation of UCT to RTS, adapted in the context of classical planning. We show that MHSP performs better for action selection in the RTS background, with or without learning, than the existing realtime action selectors.
The outline of the paper is the following. Section 2 describes the domain of RealTime Search. Section 3 describes UCT principles. Section 4 presents our new algorithm MHSP, an adaptation of UCT for RTS. Section 5 shows the experiments performed to prove the relevance of our approach. Section 6 concludes and discusses future lines of research.
2 RealTime Search
RTS is considered in the context of agentcentered search. While classical offline search methods first find a solution plan, and then make the agent execute the plan, agentcentered search interleaves planning and execution. The agent repeatedly executes a task, called an episode or a trial. At each step, the agent is situated in a current state, and performs a search within the space around the current state in order to select his best action. The states encountered during the action selection stage are called the explored states in the following. The feature of RTS is that this search is performed in constanttime. When the action is selected, the agent executes the action, and reaches a new state. When the agent reaches the goal state, another trial or episode is launched, and so on. RTS can be considered with or without learning. Without learning, the efficiency of the agent is based on the ability of the search to select an action. With learning, the agent updates the heuristic value of the encountered states when some conditions happen, and the efficiency of the agent increases over repetitions. In the following, the states encountered by the agent are called the visited states.
The fundamental paper of RTS is RealTime Heuristic Search [Kor90]. RealTime A* (RTA*) is an algorithm that computes a tree like A* does, but in constant time. When the time is elapsed, RTA* provides the first move of its current best plan, and executes it to reach a new node. From this new node, RTA* computes a tree again, executes the first move of the new current best plan, and so on until a goal node is reached. RTA* was designed in the spirit of twoplayer game programs that must play their moves in limited time. Like A*, RTA* uses an heuristic function. RTA* always finds a solution plan, even if not optimal. The learning version of RTA* is called LRTA*. When the heuristic value of a node is too low compared to the minimum value of its neighbors nodes, LRTA* updates the heuristic value of this node with the minimum value of its neighbors nodes. To this extent the heuristic function is modified, and learnt. When launched several times on the same problem, LRTA* is proved to converge to the optimal plan.
SLA* (Search and Learn A*) [SZ93] is presented as an enhancement of LRTA*. SLA* includes a backtracking mechanism when an update happens. As LRTA* does, SLA* updates the heuristic value of a node. Additionally, when an update occurs, SLA* iteratively updates the neighboring nodes of the updated node as well. Actually, in one trial, SLA* learns the heuristic function, and finds the optimal solution. However, since the first trial can be very long, SLA* cannot be used in practice.
In [FK00], Furcy presents FALCONS, a learning algorithm that converges quickly under some assumptions. Its main feature is to compute two heuristic functions, one for each way from start to goal, and from goal to start.
In [SI03], Shimbo shows how weighted A* and upper bound search are worth considering in realtime search. In weighted A*, the heuristic function has a weight . The greater the most important the heuristic function. The risk is that the heuristic function becomes non admissible. Nevertheless, even with non admissible functions, the heuristic search finds good plans, although not optimal. Upper bound search limits the search for path with length inferior to the upper bound. Weighted A* and upper bound search are suboptimal.
Trap [Bul04] includes some lookahead to smooth the bad effects of the heuristic function. When compared to LRTA* and FALCONS, because it selects a sequence of actions instead the first action, Trap yields improvements of 5 to 30 folds in convergence speed. is an optimality weight associated to the cost function (with ). has the same purpose as in weighted A*: balancing the weights of and .
LRTS [BL06] is a unifying framework for learning in realtime search. It includes LRTA*, SLA* and Trap. LRTS features are: learning in realtime (inherited from LRTA*), lookahead (inherited from Trap, and the Korf’s work), backtracking (inherited from SLA*), and weighted search (inherited from Trap and weighted A*).
Finally, Bulitko [BLS08] describes dynamic control in realtime heuristic search.
To sum up, the limitation of RTS is to focus on the learning part and less on the search for action selection. [Kor90] and following works prove the convergence of their learning realtime algorithms given that the action selection is a depthone greedy search. However, the Korf’s work shows the importance of an efficient search for action selection in other papers. The exception to this limitation is Trap and LRTS in which lookahead is used for action selection. The Trap lookahead is a kind of breadth first search. Consequently, it is of interest to see if an adaptation of UCT can be used efficiently for action selection in RTS. Then, this adaptation will be used with or without learning.
3 Uct
UCT worked well in Go playing programs, and it was used under many versions leading to the MonteCarlo Tree Search (MCTS) framework [CWv08]. While time remains, a MCTS algorithm iteratively grows up a tree in the computer memory by following the steps below:

Starting from the root, browse the tree until reaching a leaf by using (1).

Expand the leaf with its child nodes.

Choose one child node.

Perform a random simulation starting from this child node until the end of the game, and get the return, i.e., the game’s outcome.

Update the mean value of the browsed nodes with this return.
With infinite time, the root value converges to the minimax value of the game tree [KS06]. The UCB selection rule (1) answers the requirement of being optimistic when a decision must be made facing uncertainty [ACBF02].
(1) 
is the selected node, is the set of children, is the mean value of node , is the number of iterations going through , is the number of iterations going through the parent of , and is a constant value setup experimentally. Equation (1) uses the sum of two terms: the mean value , and the UCB bias value which guarantees exploration. The respect of the UCB selection rule guarantees the completeness and the correctness of the algorithm.
4 Mhsp
This section defines our algorithm MHSP (Meanbased Heuristic Search for realtime Planning). We made two important choices in designing MHSP after which we give the pseudocode of MHSP.
4.1 Heuristic values replace simulation
On planning problems, random simulations are not appropriate. Browsing randomly the state space does not enable the algorithm to reach goal states sufficiently often. Many runs complete without reaching goal states. Therefore, replacing the simulations by a call to the heuristic becomes mandatory.
In Computer Go, the random simulations were adequate mainly because they always completed after a limited number of moves, and the return values (won or lost) were roughly equally distributed on most positions of a game. Furthermore, the two return values correspond to actual values of a completed game. In planning, one return means that a solution has been found (episode completed), and the other not. This simulation difference is fundamental between the planning problem, and the twoplayer game playing problem.
In planning, the heuristic values bring appropriate knowledge into the returns. Consequently, using heuristic values in MHSP should be positive bound to the condition that the heuristic value generator is good, which is the case in planning [BF97]. In Computer Go, replacing the simulations by evaluation function calls is forbidden by fifty years of computer Go history which experienced the converse path: the complex evaluation functions have been replaced by pseudorandom simulations.
To sum up, in MHSP, we replace stage (4) of MCTS above by a call to an heuristic function.
4.2 Optimistic initial mean values
Computer games practice shows that the UCB bias of (1) can merely be removed, provided the mean values of nodes are initialized with sufficiently optimistic values. This simplification removes the problem of tuning . Generally, to estimate a given node, the planning heuristics give a path length estimation. Convergence to the best plan is provided by admissible heuristics, i.e., optimistic heuristics. Consequently, the value returned by planning heuristics on a node can be used to initialize the mean value of this node.
In MHSP, the returns are negative or zero, and they must be in the opposite of the distance from to . Thus, we initialize the mean value of a node with which is minus the distance estimation to reach from . With this initialization policy, the best node according to the heuristic value will be explored first. Its value will be lowered after some iterations whatever its goodness, and then the other nodes will be explored in the order given by the heuristic.
4.3 The algorithm
MHSP algorithm is shown in Algo. 1 : is the set of operators, the initial state, the goal, the set of children of state , the cumulative return of state , the number of visits of state , and the parent of .
The outer (line 2) ensures the realtime property. The first inner (line 4) corresponds to stage (1) in MCTS. The default reward is pessimistic: is the current pessimism threshold. The first two test whether the inner has ended up with a goal achieved (line 7) or with a leaf (line 9). If the goal is not reached, the leaf is expanded, stage (2) in MCTS. The second corresponds to stage (3). Stage (4) is performed by writing into the return. The second inner (line 22) corresponds to stage (5).
Function browses the tree by selecting the child node with the best mean, which produces the solution plan. Function browses the tree by selecting the child node with the best number of visits. The best plan reconstruction happens when the time is over before a solution plan has been found. In this case, it is important to reconstruct a robust plan, may be not the best one in terms of mean value. With the child with the best mean, a plan with newly created nodes could be selected, and the plan would not be robust. Conversely selecting the child with the best number of visits ensures that the plan has been tried many times, and should be robust to this extent.
5 Experiments
The aim of the experiments described in this section is to show that MHSP is better than existing correct and complete algorithms at action selection in the background of RTS, used with or without learning, for different decision times, on a set of planning problems.
5.1 Planners and domains
We compare MHSP with two algorithms: A* and BreadthFirst Search (BFS). We chose A* because it is the reference algorithm in planning (LRTA*). It is a bestfirst algorithm that aims at minimizing the classical heuristic function of A*. A* expands nodes in the open list with values decreasing with the running time, the value reaching zero with sufficient time. To this extent, A* can be stopped at anytime. When A* is stopped, the path from the last expanded node to the root node gives the “best” action selected by A*. Beside, we chose BFS. BFS is a simple generalization of current action selectors, such as LRTS or Trap, used in RTS for pathfinding to other planning domains. We did not choose DepthFirst algorithms, such as Minimin, since they hardly fit the realtime constraint. Finally, our three algorithms are MHSP, A*, and BFS.
As mentioned in introduction, RTS algorithms are mainly applied to pathfinding for video games. In order to extend the existing test domain, we selected other domains and problems from International Planning Competition
5.2 Settings
We designed four tests to underline the effectiveness of MHSP. Test 1 is global, and does not especially focus on action selection: it gives the average length of solution plans found by the three algorithms for different decision times, and representative problems. It intends to show that MHSP is globally better than A* and BFS in terms of solution plan length. Test 1 does not includes learning.
Test 2 reperform test 1 with learning by using the update rule (2) on nodes visited during the episodes. Rule (2) is not applied on nodes explored during action selection time. Test 2 intends to show the consequences of the three action selectors on the convergence of solution plans toward optimal plans when the episode number increases.
(2) 
Test 3 is the most important test to underline the ability of the algorithms to performs effective action selection, or partial plan selection in real time. This test makes the decision time vary, and assesses the quality of partial plans obtained. The quality of partial plans is estimated with two distances: the distance to the goal (or goal distance) and the distance to the optimum.
The distance to the goal of a partial plan is the length of the optimal plan linking the end state of this partial plan to the goal state. When the distance to the goal diminishes, the partial plan has been built in the appropriate direction. When the distance to the goal is zero, the partial plan is a solution plan.
The distance to the optimum of a partial plan is the length of the partial plan, plus the distance to the goal of the partial plan, minus the length of the optimal plan. When the distance to the optimum of a partial plan is zero, the partial plan is the beginning of an optimal plan. The distance to the optimum of a solution plan is the difference between its length and the optimal length. The distance to the optimum of the void plan is zero. The distance to the goal and the distance to the optimal plan is zero. Conversely, when the distance to the goal and the distance to the optimum of a partial plan are zero, the partial plan is an optimal plan. For each problem, the results are shown with figures giving the distance to the goal and the distance to the optimum of the partial plan in the running time. These distances are computed by calling an optimal planner.
Finally, all the tests were conducted on an Intel Core 2 Quad 6600 (2.4Ghz) with 4 Gbytes of RAM. The implementation of MHSP used for experiments is written in Java based on the PDDL4J library
5.3 Test 1
problem  algo.  decision  avr.  avr.  opt.  max  min  failure 

time  length  length  length  length  %  
ferry05  A*  40  1.09  26.26  18  277  19  0 
ferry05  BFS  40  8.91  27.76  18  567  18  16 
ferry05  MHSP  40  0.59  18.02  18  19  18  0 
ferry10  A*  200  97.9  184.94  35  807  42  32 
ferry10  BFS  200  8.91  27.76  35  109  35  0 
ferry10  MHSP  200  0.59  18.02  35  36  35  0 
ferry15  A*  2000  157.85  31.95  51  88  58  55 
ferry15  BFS  2000  103.75  51.45  51  52  51  0 
ferry15  MHSP  2000  86.45  52.45  51  53  51  0 
ferry20  A*  4000  –  –  73  –  –  100 
ferry20  BFS  4000  –  –  73  –  –  100 
ferry20  MHSP  4000  260.49  74.87  73  78  73  0 
gripper05  A*  50  0.56  15.04  15  17  15  0 
gripper05  BFS  50  0.71  15  15  15  15  0 
gripper05  MHSP  50  0.49  15  15  15  15  0 
gripper10  A*  165  92.28  140.04  29  651  33  36 
gripper10  BFS  165  7.86  37  29  37  37  0 
gripper10  MHSP  165  5.08  29  29  29  29  0 
gripper15  A*  450  160.1  47.54  45  229  77  70 
gripper15  BFS  450  31.42  46.52  45  47  45  0 
gripper15  MHSP  450  38.53  54.88  45  55  53  0 
gripper20  A*  1100  –  –  59  –  –  100 
gripper20  BFS  1100  102.76  61  59  61  61  0 
gripper20  MHSP  1100  134.86  73.92  59  75  73  0 
satellite05  A*  300  3.49  15.08  15  18  15  0 
satellite05  BFS  300  20.28  63.72  15  522  19  0 
satellite05  MHSP  300  3.01  15  15  15  15  0 
satellite10  A*  2000  –  –  29  –  –  100 
satellite10  BFS  2000  –  –  29  –  –  100 
satellite10  MHSP  2000  67.912  31.0  29  31  31  0 
Table 1 takes the following inputs: the domain (ferry, gripper or satellite), the problem number, the algorithm used for action selection, the decision time. The outputs are: the optimal plan length, the average time spent for one episode, the average solution plan length, the maximal solution plan length, and the minimal plan length found by the algorithm, and the percentage of failures.
The optimal plan length, computed offline, is used as a reference. On ferry05 with a decision time of 40ms, MHSP executes solution plans that are almost optimal (18.02 against 18) while A* and BFS are far from optimal (16.26 and 27.76). The maximal plan length is very high for A* and BFS (277 and 567) and almost optimal for MHSP (19). The minimal plan length is the optimal one for BFS and MHSP. In order to limit the time of the experiments, there is a maximal episode number (50). Consequently, an algorithm that does not reach the goal during an episode failed. Here, BFS has 16% of failure rate on the 50 episodes. Furthermore, MHSP is the fastest algorithm. In the beginning of an episode, all the algorithms use the total time to decide. However, when the episode reaches its end, the action selection is easier than before, and some algorithms do not use all the available time, and are faster than other. MHSP is clearly the fastest because it finds the goal more easily than A* or BFS when the goal is not far.
On ferry10 with a decision time of 200ms, MHSP executes solution plans that are almost optimal (35.66 against 35) while A* and BFS are again far from optimal (181.94 and 41.34). The maximal plan length is very high for A* and BFS (807 and 109) and almost optimal for MHSP (36). The minimal plan length is the optimal one for BFS and MHSP. Here, BFS has 32% of failure rate.
On gripper05 with a decision time of 50ms, MHSP, A* and BFS are almost optimal. On gripper10 with a decision time of 165ms, MHSP executes optimal plans (29) while A* and BFS are again far from optimal (140.04 and 37). Here, A* has 36% of failure rate.
Finally, on satellite05 with a decision time of 300ms, MHSP, A* and BFS are almost optimal with a slight preference for MHSP. Now on satellite10 with a decision time of 2000ms, BFS and A* do not find any solution unlike MHSP. The main reason for this result is the branching factor of the problems which is greater than the other studied problems (e.g., 66 for satellite10, 13 for gripper20 and 16 for ferry20). Thus, this high branching factor strongly penalizes the exhaustive search strategy of A* and BFS.
To sum up the first test, MHSP finds plans shorter than A* or BFS, and MHSP is faster than A* and BFS.
5.4 Test 2
problem  algo.  decision  avr.  avr.  opt.  max  min  failure 

time  length  length  length  length  %  
ferry05  A*  40  0.68  19.22  18  31  18  0 
ferry05  BFS  40  12.03  133.18  18  532  18  14 
ferry05  MHSP  40  0.48  18.02  18  19  18  0 
ferry10  A*  200  9.36  48.12  35  67  37  0 
ferry10  BFS  200  17.22  78.5  35  892  35  0 
ferry10  MHSP  200  6.24  35  35  35  35  0 
ferry15  A*  2000  112.03  62.36  51  81  57  55 
ferry15  BFS  2000  104.31  51.75  51  53  51  0 
ferry15  MHSP  2000  87.39  52.8  51  53  51  0 
ferry20  A*  4000  247.32  121.1  73  146  107  0 
ferry20  BFS  4000  284.44  73.8  73  75  75  35 
ferry20  MHSP  4000  255.31  73.6  73  73  73  15 
gripper05  A*  50  0.45  15  15  15  15  0 
gripper05  BFS  50  35.72  34.06  15  615  15  68 
gripper05  MHSP  50  0.48  15  15  15  15  0 
gripper10  A*  165  76.48  35.04  29  43  29  0 
gripper10  BFS  165  9.96  32.22  29  37  29  0 
gripper10  MHSP  165  4.8  29  29  29  29  0 
gripper15  A*  450  56.32  54.32  45  57  49  14 
gripper15  BFS  450  36.92  54.8  45  318  45  4 
gripper15  MHSP  450  36.52  46.76  45  55  45  0 
gripper20  A*  1100  –  –  59  –  –  100 
gripper20  BFS  1100  132.44  74.12  59  75  73  2 
gripper20  MHSP  1100  99.47  59.06  59  63  73  0 
satellite05  A*  300  3.54  15.62  15  18  15  0 
satellite05  BFS  300  6.1  18.98  15  41  19  0 
satellite05  MHSP  300  3.17  15  15  15  15  0 
satellite10  A*  2000  –  –  29  –  –  100 
satellite10  BFS  2000  –  –  29  –  –  100 
satellite10  MHSP  2000  65.612  31.2  29  32  30  0 
Table 2 shows the three algorithms performances when learning is applied. Compared to results of table 1, we can observe that learning most often improves the quality of best plans. Indeed, except for BFS in gripper20, the minimal plan length is always smaller with learning than without, if it was not already optimal in test 1.
Moreover learning enabled BFS and A* to find solution plans in ferry15. None of them reached an optimal plan, but BFS’ best plan is two actions far from it. It also enabled MHSP to become optimal in satellite05.
In order to illustrate algorithms’ behaviors over time, figure 1 shows the evolution of plan length according to episodes, in three problems : ferry10, gripper10 and Satellites5. As we can see, on these problems, MHSP reaches an optimal plan very quickly and plan length is almost constant and stable. Conversely, A* and BFS are not optimal in ferry10 and gripper10 and are very unstable. The BFS peaks correspond to the maximal plan length allowed.
However, on these figures, learning is not observed by a clear decreasing plan length as expected. There are two reasons. First, the update rule is applied in the visited nodes, and not in the explored nodes. Therefore the action selection strategy does not really impact on the plan length when the episode number increases. Second, the algorithms use heuristic values computed offline by planning graph techniques, that are almost optimal on problems with a sufficiently low complexity. Consequently, the update rule is not effective very often.
5.5 Test 3
Test 3 assesses the quality of partial plans available at the end of the action selection stage according to the time given to the decision. Figure 2b, c and d show the distance to the goal, and the optimal plan distance, for each algorithm according to decision time on problem gripper05. Figure 2a gives an overview of the distance to the goal of the three algorithms in logscale decision time.
In these results, we observe that A* needs a decision time of at least 2650ms to always find the optimal plan, while BFS needs at least 1504ms and MHSP only 349ms. Whatever the given decision time, A* provides partial plan near to the optimal. BFS provides only partial plans of optimal plans. Finally, during its search, MHSP provides partial plans having high optimal plan distance. This can be explained by the end of the partial plans that are unstable. Selecting shorter partial plans would result in a better optimal plan distance.
Table 3 sums up these results and adds results from ferry05 and satellite05. As in gripper, we can see that MHSP is the fastest algorithm to finds optimal plans.
problem  algorithm  time (ms) 

ferry05  A*  808 
BFS    
MHSP  288  
gripper05  A*  1650 
BFS  1504  
MHSP  349  
satellite05  A*  8180 
BFS    
MHSP  1710 
6 Discussion and future works
In this paper, we presented and study a new heuristic search algorithm based on mean values for realtime search (RTS), called MHSP, adapted in the context of classical planning. This algorithm combines an heuristic search and the learning principles of UCT algorithm, i.e., states’ values based on mean returns, and optimism in front of uncertainty.
MHSP computes mean values for decision and not direct values. It means that the value of an action depends on every nodes explored beneath that action, and not only on the best node found. This fact may have a strong impact on the way the system explores nodes because the system may focus on action permitting to reach globally good node, and not on the action enabling to reach the node with the best heuristic value. In a time constrained context, focusing on action which leads to globally good nodes instead of just one node may limit the effect of strongly erroneous heuristic values. It enables to subtly explore the tree. The more complex the problem is, the more visible should be this effect.
Three tests were designed in order to compare MHSP, A* and BFS in RTS. The first one gave an overview of the global effectiveness of each algorithm to find good plans in different problems from ferry, gripper and satellite domains. It showed that MHSP is globally better than A* and BFS in terms of solution plan length. The second test was intended to observe performances convergence when learning is applied in the three algorithms, i.e., when heuristics values of the visited nodes can be updated according to exploration. The results first showed that learning improves the quality of best plans obtained with the three algorithms. They moreover showed that MHSP tends to converge very quickly towards an optimal plan, while A* and BFS may stay suboptimal and unstable. Finally, the third test was designed in order to evaluate the ability of the algorithms to performs effective action selection, or partial plan selection in realtime. This test makes the decision time vary, and assesses the quality of partial plans obtained through two distances: the distance to the goal and the distance to the optimum. The results highlighted that as decision time grows up, MHSP is much faster to provide optimal plans than the two other algorithms do.
In the future, we may study several specific aspects of the presented work:

First of all, we would study the possibility to perform sequences of actions instead of just one action, like algorithms such as Trap or LRTS do. Indeed, instead of taking a single action between the lookahead search episodes, it applied actions to amortize the planning cost. This allows to speed up the search of a solution plan when the heuristic function is informative.

Moreover, since MHSP uses mean values, we also want to apply MHSP on problems in non deterministic environments and compare it to online MDP algorithms.

Experimentations show that the first action chosen significantly impact the quality of the solution plan found in terms of lenght. For instance in blocksworld domains, choosing first a bad block to move implies to add many actions to repair this bad choice. Consequently, the idea is to allocated more time to the first reasonning step.

In our experiments, for each action selection stage, the tree is computed from scratch. Reusing the tree computed during the previous action selection stages is an interesting enhancement to our work. It will enable the realtime algorithms to tackle more difficult problems.

Our work is done in the background of realtime search, and partial plan selection. Removing the realtime constraint, and testing MHSP on problems in which full solution plans are required is an interesting research direction. However, preliminary tests show that, with almost exact heuristics, MHSP is hardly comparable to efficient and general planners using EnforcedHill Climbing to find full solution plans on a wide range of problems.

Finally, learning the heuristics useful in planning by using MHSP, or another realtime algorithm, and compare them with the heuristics obtained with planning graphs is a good perspective linking the two domains of learning and planning.
Footnotes
 For a description and formalization of these benchmark domains and problems, see the official page of IPC.
 http://sourceforge.net/projects/pdd4j/
References
 P. Auer, N. CesaBianchi, and P. Fisher. Finitetime Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2–3):235–256, 2002.
 A. Barto, S. Bradtke, and S. Singh. Learning to Act Using RealTime Dynamic Programming. Artificial Intelligence, 72(12):81–138, 1995.
 A. Blum and M. Furst. Fast Planning Through Planning Graph Analysis. Artificial Intelligence, 90:1636–1642, 1997.
 Vadim Bulitko and Greg Lee. Learning in RealTime Search: A Unifying Framework. Journal of Artificial Intelligence Research, 25:119–157, 2006.
 V. Bulitko, M. Lustrek, J. Schaeffer, Y. Bjornsson, and S. Sigmundarson. Dynamic Control in RealTime Heuristic Search. Journal of Artificial Intelligence Research, 32(1):419–452, 2008.
 Vadim Bulitko. Learning for Adaptive RealTime Search. Technical report, Department of Computer Science, University of Alberta, http://arxiv.org/abs/cs.AI/0407016, 2004.
 G. Chaslot, M. Winands, H. van den Herik, J. Uiterwijk, and B. Bouzy. Progressive Strategies for MonteCarlo Tree Search. New Mathematics and Natural Computation, 4(3):343–357, 2008.
 P. Fabiani, V. Fuertes, G. Besnerais, R. Mampey, A. Piquereau, and F. Teichteil. The ReSSAC Autonomous Helicopter: Flying in a NonCooperative Uncertain World with embedded Vision and Decision Making. In A.H.S Forum, 2007.
 D. Furcy and S. Koenig. Speeding up the Convergence of RealTime Search. In Proceedings of the National Conference on Artificial Intelligence, pages 891–897, 2000.
 M. Ghallab, D. Nau, and P. Traverso. Automated Planning Theory and Practice. Morgan Kaufmann Publishers, 2004.
 S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with Patterns in MonteCarlo Go. Technical Report RR6062, INRIA, 2006.
 J. Hoffmann and B. Nebel. The FF Planning System: Fast Plan Generation Through Heuristic Search. JAIR, 14(1):253–302, 2001.
 P.E. Hart, N.J. Nilsson, and B. Raphael. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics SSC4, 2:100—107, 1968.
 E. Hansen and S. Zilberstein. LAO*: A Heuristic Search Algorithm that Finds Solutions with Loops. Artificial Intelligence, 129(12):35–62, 2001.
 Richard Korf. RealTime Heuristic Search: New Results. In Proceedings of the AAAI conference, pages 139–144, 1988.
 Richard Korf. RealTime Heuristic Search. Artificial Intelligence, 42(23):189–211, 1990.
 L. Kocsis and C. Szepesvari. Banditbased MonteCarlo Planning. In Proc. ECML, pages 282–293, 2006.
 Mark Roberts and Adele Howe. Learning from Planner Performance. Artificial Intelligence, 173:536–561, 2009.
 M. Shimbo and T. Ishida. Controlling the Learning Process of RealTime Heuristic Search. Artificial Intelligence, 146(1):1–41, 2003.
 L. Shue and R. Zamani. An Admissible Heuristic Search Algorithm. In Methodologies for Intelligent Systems, number 689 in LNAI. Springer, 1993.