Mean-based Heuristic Search for Real-Time Planning

Mean-based Heuristic Search for Real-Time Planning


In this paper, we introduce a new heuristic search algorithm based on mean values for real-time planning, called MHSP. It consists in associating the principles of UCT, a bandit-based algorithm which gave very good results in computer games, and especially in Computer Go, with heuristic search in order to obtain a real-time planner in the context of classical planning. MHSP is evaluated on different planning problems and compared to existing algorithms performing on-line search and learning. Besides, our results highlight the capacity of MHSP to return plans in a real-time manner which tend to an optimal plan over the time which is faster and of better quality compared to existing algorithms in the literature.

1 Introduction

The starting point of this work is to apply UCT [KS06], an efficient algorithm well-known in the machine learning and computer games communities, and originally designed for planning, on classical planning problems. UCT is designed for MDP, and based on bandit decision making [ACBF02]. In the background of the current paper that stresses time constraints, the interesting feature of UCT is its anytime property in a strong meaning. At any time, UCT is able to return the first action of a plan, a partial plan, a solution plan, or an optimal plan according to the given time. However, [KS06] did not give known successful applications in the classical planning domain yet [GNT04]. Instead, UCT gave tremendous results in computer games, and specifically in Computer Go with the Go playing program Mogo [GWMT06]. In Computer Go, UCT is efficient because the Go complexity is high, and because the Go games are played in real time, which fits the anytime property of UCT. Therefore, this paper focuses on how to give value to UCT-like algorithms in a sub-field of planning dealing with time constraints.

The state of the art of planning is huge [RH09], and we roughly divide it into two categories, off-line and on-line planning. In the context of off-line planning, the planners build solution plans, and then execute the actions of the plan. An important aspect is the existence of very good heuristic functions that drive the search efficiently toward the goal. The heuristic functions are built with the help of a planning graph [BF97]. The state-of-the-art planners use variants of depth first search, i.e., Enforced Hill Climbing [HN01], and may find a solution plan very quickly from the initial state to the goal state. However, although they find solution plans very quickly on sufficiently easy problems, these planners are not real-time planners, and they may fall if they have not enough time to find a solution plan.

Conversely, in the context of on-line planning, the planners make their decision in constant time, and then execute the corresponding action, or sequence of actions, in the world. The literature distinguishes two approaches, one based on MDP applied to non-deterministic problems, e.g., [BBS95, HZ01] and the other based on Real-Time Search (RTS), e.g., [Kor90]. If the first approach was recently applied to planning [FFB07], the second approach RTS has been strongly linked, since the pioneering Korf’s work on puzzles, to the development of video games in which the agents need good path-finding algorithms running in real time. This last approach was not broadened to the classical planning problems. Classically, there are several real-time searches, e.g., mini-min [Kor88], -Trap [Bul04], LRTS [BL06] or even A* [HNR68]. These algorithms perform action selection by using heuristic search. Since the action selection time is limited, these algorithms explore a small part of the state space around the current state. The plans executed by the agent have no reason to be optimal. The RTS literature is concerned with convergence of these plans toward an optimal solution when the task is repeated iteratively. This opens the RTS literature mainly to learning. Then, considering the learning aspect as its main objective, and focused on convergence proofs, the literature reduced the action selection stage to a depth-one greedy search.

The aim of this paper is to focus on the action selection stage of RTS, to present MHSP, an heuristic search algorithm based on mean values, i.e., an adaptation of UCT to RTS, adapted in the context of classical planning. We show that MHSP performs better for action selection in the RTS background, with or without learning, than the existing real-time action selectors.

The outline of the paper is the following. Section 2 describes the domain of Real-Time Search. Section 3 describes UCT principles. Section 4 presents our new algorithm MHSP, an adaptation of UCT for RTS. Section 5 shows the experiments performed to prove the relevance of our approach. Section 6 concludes and discusses future lines of research.

2 Real-Time Search

RTS is considered in the context of agent-centered search. While classical off-line search methods first find a solution plan, and then make the agent execute the plan, agent-centered search interleaves planning and execution. The agent repeatedly executes a task, called an episode or a trial. At each step, the agent is situated in a current state, and performs a search within the space around the current state in order to select his best action. The states encountered during the action selection stage are called the explored states in the following. The feature of RTS is that this search is performed in constant-time. When the action is selected, the agent executes the action, and reaches a new state. When the agent reaches the goal state, another trial or episode is launched, and so on. RTS can be considered with or without learning. Without learning, the efficiency of the agent is based on the ability of the search to select an action. With learning, the agent updates the heuristic value of the encountered states when some conditions happen, and the efficiency of the agent increases over repetitions. In the following, the states encountered by the agent are called the visited states.

The fundamental paper of RTS is Real-Time Heuristic Search [Kor90]. Real-Time A* (RTA*) is an algorithm that computes a tree like A* does, but in constant time. When the time is elapsed, RTA* provides the first move of its current best plan, and executes it to reach a new node. From this new node, RTA* computes a tree again, executes the first move of the new current best plan, and so on until a goal node is reached. RTA* was designed in the spirit of two-player game programs that must play their moves in limited time. Like A*, RTA* uses an heuristic function. RTA* always finds a solution plan, even if not optimal. The learning version of RTA* is called LRTA*. When the heuristic value of a node is too low compared to the minimum value of its neighbors nodes, LRTA* updates the heuristic value of this node with the minimum value of its neighbors nodes. To this extent the heuristic function is modified, and learnt. When launched several times on the same problem, LRTA* is proved to converge to the optimal plan.

SLA* (Search and Learn A*) [SZ93] is presented as an enhancement of LRTA*. SLA* includes a backtracking mechanism when an update happens. As LRTA* does, SLA* updates the heuristic value of a node. Additionally, when an update occurs, SLA* iteratively updates the neighboring nodes of the updated node as well. Actually, in one trial, SLA* learns the heuristic function, and finds the optimal solution. However, since the first trial can be very long, SLA* cannot be used in practice.

In [FK00], Furcy presents FALCONS, a learning algorithm that converges quickly under some assumptions. Its main feature is to compute two heuristic functions, one for each way from start to goal, and from goal to start.

In [SI03], Shimbo shows how weighted A* and upper bound search are worth considering in real-time search. In weighted A*, the heuristic function has a weight . The greater the most important the heuristic function. The risk is that the heuristic function becomes non admissible. Nevertheless, even with non admissible functions, the heuristic search finds good plans, although not optimal. Upper bound search limits the search for path with length inferior to the upper bound. Weighted A* and upper bound search are sub-optimal.

-Trap [Bul04] includes some lookahead to smooth the bad effects of the heuristic function. When compared to LRTA* and FALCONS, because it selects a sequence of actions instead the first action, -Trap yields improvements of 5 to 30 folds in convergence speed. is an optimality weight associated to the cost function (with ). has the same purpose as in weighted A*: balancing the weights of and .

LRTS [BL06] is a unifying framework for learning in real-time search. It includes LRTA*, SLA* and -Trap. LRTS features are: learning in real-time (inherited from LRTA*), lookahead (inherited from -Trap, and the Korf’s work), backtracking (inherited from SLA*), and weighted search (inherited from -Trap and weighted A*).

Finally, Bulitko [BLS08] describes dynamic control in real-time heuristic search.

To sum up, the limitation of RTS is to focus on the learning part and less on the search for action selection. [Kor90] and following works prove the convergence of their learning real-time algorithms given that the action selection is a depth-one greedy search. However, the Korf’s work shows the importance of an efficient search for action selection in other papers. The exception to this limitation is -Trap and LRTS in which lookahead is used for action selection. The -Trap lookahead is a kind of breadth first search. Consequently, it is of interest to see if an adaptation of UCT can be used efficiently for action selection in RTS. Then, this adaptation will be used with or without learning.

3 Uct

UCT worked well in Go playing programs, and it was used under many versions leading to the Monte-Carlo Tree Search (MCTS) framework [CWv08]. While time remains, a MCTS algorithm iteratively grows up a tree in the computer memory by following the steps below:

  1. Starting from the root, browse the tree until reaching a leaf by using (1).

  2. Expand the leaf with its child nodes.

  3. Choose one child node.

  4. Perform a random simulation starting from this child node until the end of the game, and get the return, i.e., the game’s outcome.

  5. Update the mean value of the browsed nodes with this return.

With infinite time, the root value converges to the mini-max value of the game tree [KS06]. The UCB selection rule (1) answers the requirement of being optimistic when a decision must be made facing uncertainty [ACBF02].


is the selected node, is the set of children, is the mean value of node , is the number of iterations going through , is the number of iterations going through the parent of , and is a constant value setup experimentally. Equation (1) uses the sum of two terms: the mean value , and the UCB bias value which guarantees exploration. The respect of the UCB selection rule guarantees the completeness and the correctness of the algorithm.

4 Mhsp

This section defines our algorithm MHSP (Mean-based Heuristic Search for real-time Planning). We made two important choices in designing MHSP after which we give the pseudo-code of MHSP.

4.1 Heuristic values replace simulation

On planning problems, random simulations are not appropriate. Browsing randomly the state space does not enable the algorithm to reach goal states sufficiently often. Many runs complete without reaching goal states. Therefore, replacing the simulations by a call to the heuristic becomes mandatory.

In Computer Go, the random simulations were adequate mainly because they always completed after a limited number of moves, and the return values (won or lost) were roughly equally distributed on most positions of a game. Furthermore, the two return values correspond to actual values of a completed game. In planning, one return means that a solution has been found (episode completed), and the other not. This simulation difference is fundamental between the planning problem, and the two-player game playing problem.

In planning, the heuristic values bring appropriate knowledge into the returns. Consequently, using heuristic values in MHSP should be positive bound to the condition that the heuristic value generator is good, which is the case in planning [BF97]. In Computer Go, replacing the simulations by evaluation function calls is forbidden by fifty years of computer Go history which experienced the converse path: the complex evaluation functions have been replaced by pseudo-random simulations.

To sum up, in MHSP, we replace stage (4) of MCTS above by a call to an heuristic function.

4.2 Optimistic initial mean values

Computer games practice shows that the UCB bias of (1) can merely be removed, provided the mean values of nodes are initialized with sufficiently optimistic values. This simplification removes the problem of tuning . Generally, to estimate a given node, the planning heuristics give a path length estimation. Convergence to the best plan is provided by admissible heuristics, i.e., optimistic heuristics. Consequently, the value returned by planning heuristics on a node can be used to initialize the mean value of this node.

In MHSP, the returns are negative or zero, and they must be in the opposite of the distance from to . Thus, we initialize the mean value of a node with which is minus the distance estimation to reach from . With this initialization policy, the best node according to the heuristic value will be explored first. Its value will be lowered after some iterations whatever its goodness, and then the other nodes will be explored in the order given by the heuristic.

4.3 The algorithm

1 ; ; ; while  do
2       while  and 1 do
4       if  then
6      else if  then
7             ground instance of an operator in and foreach  do
9            if  then
12       while  do
14      if  then
15             if length() length() then 
if   then  return else  return
Algorithm 1 MHSP()

MHSP algorithm is shown in Algo. 1 : is the set of operators, the initial state, the goal, the set of children of state , the cumulative return of state , the number of visits of state , and the parent of .

The outer (line 2) ensures the real-time property. The first inner (line 4) corresponds to stage (1) in MCTS. The default reward is pessimistic: is the current pessimism threshold. The first two test whether the inner has ended up with a goal achieved (line 7) or with a leaf (line 9). If the goal is not reached, the leaf is expanded, stage (2) in MCTS. The second corresponds to stage (3). Stage (4) is performed by writing into the return. The second inner (line 22) corresponds to stage (5).

Function browses the tree by selecting the child node with the best mean, which produces the solution plan. Function browses the tree by selecting the child node with the best number of visits. The best plan reconstruction happens when the time is over before a solution plan has been found. In this case, it is important to reconstruct a robust plan, may be not the best one in terms of mean value. With the child with the best mean, a plan with newly created nodes could be selected, and the plan would not be robust. Conversely selecting the child with the best number of visits ensures that the plan has been tried many times, and should be robust to this extent.

5 Experiments

The aim of the experiments described in this section is to show that MHSP is better than existing correct and complete algorithms at action selection in the background of RTS, used with or without learning, for different decision times, on a set of planning problems.

5.1 Planners and domains

We compare MHSP with two algorithms: A* and Breadth-First Search (BFS). We chose A* because it is the reference algorithm in planning (LRTA*). It is a best-first algorithm that aims at minimizing the classical heuristic function of A*. A* expands nodes in the open list with values decreasing with the running time, the value reaching zero with sufficient time. To this extent, A* can be stopped at anytime. When A* is stopped, the path from the last expanded node to the root node gives the “best” action selected by A*. Beside, we chose BFS. BFS is a simple generalization of current action selectors, such as LRTS or -Trap, used in RTS for path-finding to other planning domains. We did not choose Depth-First algorithms, such as Mini-min, since they hardly fit the real-time constraint. Finally, our three algorithms are MHSP, A*, and BFS.

As mentioned in introduction, RTS algorithms are mainly applied to path-finding for video games. In order to extend the existing test domain, we selected other domains and problems from International Planning Competition1, which illustrates the effectiveness of our techniques implemented in MHSP. The domains are ferry, gripper, and satellite. For each domain, we selected 20 problems ranked by complexity in terms of fact number and operator number. In the rest of the paper, we just show results from this set of problems in order to illustrate the power of MHSP.

5.2 Settings

We designed four tests to underline the effectiveness of MHSP. Test 1 is global, and does not especially focus on action selection: it gives the average length of solution plans found by the three algorithms for different decision times, and representative problems. It intends to show that MHSP is globally better than A* and BFS in terms of solution plan length. Test 1 does not includes learning.

Test 2 re-perform test 1 with learning by using the update rule (2) on nodes visited during the episodes. Rule (2) is not applied on nodes explored during action selection time. Test 2 intends to show the consequences of the three action selectors on the convergence of solution plans toward optimal plans when the episode number increases.


Test 3 is the most important test to underline the ability of the algorithms to performs effective action selection, or partial plan selection in real time. This test makes the decision time vary, and assesses the quality of partial plans obtained. The quality of partial plans is estimated with two distances: the distance to the goal (or goal distance) and the distance to the optimum.

The distance to the goal of a partial plan is the length of the optimal plan linking the end state of this partial plan to the goal state. When the distance to the goal diminishes, the partial plan has been built in the appropriate direction. When the distance to the goal is zero, the partial plan is a solution plan.

The distance to the optimum of a partial plan is the length of the partial plan, plus the distance to the goal of the partial plan, minus the length of the optimal plan. When the distance to the optimum of a partial plan is zero, the partial plan is the beginning of an optimal plan. The distance to the optimum of a solution plan is the difference between its length and the optimal length. The distance to the optimum of the void plan is zero. The distance to the goal and the distance to the optimal plan is zero. Conversely, when the distance to the goal and the distance to the optimum of a partial plan are zero, the partial plan is an optimal plan. For each problem, the results are shown with figures giving the distance to the goal and the distance to the optimum of the partial plan in the running time. These distances are computed by calling an optimal planner.

Finally, all the tests were conducted on an Intel Core 2 Quad 6600 (2.4Ghz) with 4 Gbytes of RAM. The implementation of MHSP used for experiments is written in Java based on the PDDL4J library2.

5.3 Test 1

problem algo. decision avr. avr. opt. max min failure
time length length length length %
ferry-05 A* 40 1.09 26.26 18 277 19 0
ferry-05 BFS 40 8.91 27.76 18 567 18 16
ferry-05 MHSP 40 0.59 18.02 18 19 18 0
ferry-10 A* 200 97.9 184.94 35 807 42 32
ferry-10 BFS 200 8.91 27.76 35 109 35 0
ferry-10 MHSP 200 0.59 18.02 35 36 35 0
ferry-15 A* 2000 157.85 31.95 51 88 58 55
ferry-15 BFS 2000 103.75 51.45 51 52 51 0
ferry-15 MHSP 2000 86.45 52.45 51 53 51 0
ferry-20 A* 4000 73 100
ferry-20 BFS 4000 73 100
ferry-20 MHSP 4000 260.49 74.87 73 78 73 0
gripper-05 A* 50 0.56 15.04 15 17 15 0
gripper-05 BFS 50 0.71 15 15 15 15 0
gripper-05 MHSP 50 0.49 15 15 15 15 0
gripper-10 A* 165 92.28 140.04 29 651 33 36
gripper-10 BFS 165 7.86 37 29 37 37 0
gripper-10 MHSP 165 5.08 29 29 29 29 0
gripper-15 A* 450 160.1 47.54 45 229 77 70
gripper-15 BFS 450 31.42 46.52 45 47 45 0
gripper-15 MHSP 450 38.53 54.88 45 55 53 0
gripper-20 A* 1100 59 100
gripper-20 BFS 1100 102.76 61 59 61 61 0
gripper-20 MHSP 1100 134.86 73.92 59 75 73 0
satellite-05 A* 300 3.49 15.08 15 18 15 0
satellite-05 BFS 300 20.28 63.72 15 522 19 0
satellite-05 MHSP 300 3.01 15 15 15 15 0
satellite-10 A* 2000 29 100
satellite-10 BFS 2000 29 100
satellite-10 MHSP 2000 67.912 31.0 29 31 31 0
Table 1: Results of the test 1 on ferry, gripper and satellite domains without learning

Table 1 takes the following inputs: the domain (ferry, gripper or satellite), the problem number, the algorithm used for action selection, the decision time. The outputs are: the optimal plan length, the average time spent for one episode, the average solution plan length, the maximal solution plan length, and the minimal plan length found by the algorithm, and the percentage of failures.

The optimal plan length, computed off-line, is used as a reference. On ferry-05 with a decision time of 40ms, MHSP executes solution plans that are almost optimal (18.02 against 18) while A* and BFS are far from optimal (16.26 and 27.76). The maximal plan length is very high for A* and BFS (277 and 567) and almost optimal for MHSP (19). The minimal plan length is the optimal one for BFS and MHSP. In order to limit the time of the experiments, there is a maximal episode number (50). Consequently, an algorithm that does not reach the goal during an episode failed. Here, BFS has 16% of failure rate on the 50 episodes. Furthermore, MHSP is the fastest algorithm. In the beginning of an episode, all the algorithms use the total time to decide. However, when the episode reaches its end, the action selection is easier than before, and some algorithms do not use all the available time, and are faster than other. MHSP is clearly the fastest because it finds the goal more easily than A* or BFS when the goal is not far.

On ferry-10 with a decision time of 200ms, MHSP executes solution plans that are almost optimal (35.66 against 35) while A* and BFS are again far from optimal (181.94 and 41.34). The maximal plan length is very high for A* and BFS (807 and 109) and almost optimal for MHSP (36). The minimal plan length is the optimal one for BFS and MHSP. Here, BFS has 32% of failure rate.

On gripper-05 with a decision time of 50ms, MHSP, A* and BFS are almost optimal. On gripper-10 with a decision time of 165ms, MHSP executes optimal plans (29) while A* and BFS are again far from optimal (140.04 and 37). Here, A* has 36% of failure rate.

Finally, on satellite-05 with a decision time of 300ms, MHSP, A* and BFS are almost optimal with a slight preference for MHSP. Now on satellite-10 with a decision time of 2000ms, BFS and A* do not find any solution unlike MHSP. The main reason for this result is the branching factor of the problems which is greater than the other studied problems (e.g., 66 for satellite-10, 13 for gripper-20 and 16 for ferry-20). Thus, this high branching factor strongly penalizes the exhaustive search strategy of A* and BFS.

To sum up the first test, MHSP finds plans shorter than A* or BFS, and MHSP is faster than A* and BFS.

5.4 Test 2

problem algo. decision avr. avr. opt. max min failure
time length length length length %
ferry-05 A* 40 0.68 19.22 18 31 18 0
ferry-05 BFS 40 12.03 133.18 18 532 18 14
ferry-05 MHSP 40 0.48 18.02 18 19 18 0
ferry-10 A* 200 9.36 48.12 35 67 37 0
ferry-10 BFS 200 17.22 78.5 35 892 35 0
ferry-10 MHSP 200 6.24 35 35 35 35 0
ferry-15 A* 2000 112.03 62.36 51 81 57 55
ferry-15 BFS 2000 104.31 51.75 51 53 51 0
ferry-15 MHSP 2000 87.39 52.8 51 53 51 0
ferry-20 A* 4000 247.32 121.1 73 146 107 0
ferry-20 BFS 4000 284.44 73.8 73 75 75 35
ferry-20 MHSP 4000 255.31 73.6 73 73 73 15
gripper-05 A* 50 0.45 15 15 15 15 0
gripper-05 BFS 50 35.72 34.06 15 615 15 68
gripper-05 MHSP 50 0.48 15 15 15 15 0
gripper-10 A* 165 76.48 35.04 29 43 29 0
gripper-10 BFS 165 9.96 32.22 29 37 29 0
gripper-10 MHSP 165 4.8 29 29 29 29 0
gripper-15 A* 450 56.32 54.32 45 57 49 14
gripper-15 BFS 450 36.92 54.8 45 318 45 4
gripper-15 MHSP 450 36.52 46.76 45 55 45 0
gripper-20 A* 1100 59 100
gripper-20 BFS 1100 132.44 74.12 59 75 73 2
gripper-20 MHSP 1100 99.47 59.06 59 63 73 0
satellite-05 A* 300 3.54 15.62 15 18 15 0
satellite-05 BFS 300 6.1 18.98 15 41 19 0
satellite-05 MHSP 300 3.17 15 15 15 15 0
satellite-10 A* 2000 29 100
satellite-10 BFS 2000 29 100
satellite-10 MHSP 2000 65.612 31.2 29 32 30 0
Table 2: Results of the test 1 on ferry, gripper and satellite domains with learning

Table 2 shows the three algorithms performances when learning is applied. Compared to results of table 1, we can observe that learning most often improves the quality of best plans. Indeed, except for BFS in gripper-20, the minimal plan length is always smaller with learning than without, if it was not already optimal in test 1.

Moreover learning enabled BFS and A* to find solution plans in ferry-15. None of them reached an optimal plan, but BFS’ best plan is two actions far from it. It also enabled MHSP to become optimal in satellite-05.

(a) ferry-10
(b) gripper-10
(c) satellite-05
Figure 1: Test 2 – Convergence of the real-time planning algorithm with learning

In order to illustrate algorithms’ behaviors over time, figure 1 shows the evolution of plan length according to episodes, in three problems : ferry-10, gripper-10 and Satellites-5. As we can see, on these problems, MHSP reaches an optimal plan very quickly and plan length is almost constant and stable. Conversely, A* and BFS are not optimal in ferry-10 and gripper-10 and are very unstable. The BFS peaks correspond to the maximal plan length allowed.

However, on these figures, learning is not observed by a clear decreasing plan length as expected. There are two reasons. First, the update rule is applied in the visited nodes, and not in the explored nodes. Therefore the action selection strategy does not really impact on the plan length when the episode number increases. Second, the algorithms use heuristic values computed off-line by planning graph techniques, that are almost optimal on problems with a sufficiently low complexity. Consequently, the update rule is not effective very often.

5.5 Test 3

Test 3 assesses the quality of partial plans available at the end of the action selection stage according to the time given to the decision. Figure 2b, c and d show the distance to the goal, and the optimal plan distance, for each algorithm according to decision time on problem gripper-05. Figure 2a gives an overview of the distance to the goal of the three algorithms in log-scale decision time.

(a) gripper-05 – overview
(b) gripper-05 – A*
(c) gripper-05 – BFS
(d) gripper-05 – MHSP
Figure 2: Test 3 – Quality of the action selection

In these results, we observe that A* needs a decision time of at least 2650ms to always find the optimal plan, while BFS needs at least 1504ms and MHSP only 349ms. Whatever the given decision time, A* provides partial plan near to the optimal. BFS provides only partial plans of optimal plans. Finally, during its search, MHSP provides partial plans having high optimal plan distance. This can be explained by the end of the partial plans that are unstable. Selecting shorter partial plans would result in a better optimal plan distance.

Table 3 sums up these results and adds results from ferry-05 and satellite-05. As in gripper, we can see that MHSP is the fastest algorithm to finds optimal plans.

problem algorithm time (ms)
ferry-05 A* 808
MHSP 288
gripper-05 A* 1650
BFS 1504
MHSP 349
satellite-05 A* 8180
MHSP 1710
Table 3: Time to find a solution plan on ferry, gripper and satellite domain

6 Discussion and future works

In this paper, we presented and study a new heuristic search algorithm based on mean values for real-time search (RTS), called MHSP, adapted in the context of classical planning. This algorithm combines an heuristic search and the learning principles of UCT algorithm, i.e., states’ values based on mean returns, and optimism in front of uncertainty.

MHSP computes mean values for decision and not direct values. It means that the value of an action depends on every nodes explored beneath that action, and not only on the best node found. This fact may have a strong impact on the way the system explores nodes because the system may focus on action permitting to reach globally good node, and not on the action enabling to reach the node with the best heuristic value. In a time constrained context, focusing on action which leads to globally good nodes instead of just one node may limit the effect of strongly erroneous heuristic values. It enables to subtly explore the tree. The more complex the problem is, the more visible should be this effect.

Three tests were designed in order to compare MHSP, A* and BFS in RTS. The first one gave an overview of the global effectiveness of each algorithm to find good plans in different problems from ferry, gripper and satellite domains. It showed that MHSP is globally better than A* and BFS in terms of solution plan length. The second test was intended to observe performances convergence when learning is applied in the three algorithms, i.e., when heuristics values of the visited nodes can be updated according to exploration. The results first showed that learning improves the quality of best plans obtained with the three algorithms. They moreover showed that MHSP tends to converge very quickly towards an optimal plan, while A* and BFS may stay suboptimal and unstable. Finally, the third test was designed in order to evaluate the ability of the algorithms to performs effective action selection, or partial plan selection in real-time. This test makes the decision time vary, and assesses the quality of partial plans obtained through two distances: the distance to the goal and the distance to the optimum. The results highlighted that as decision time grows up, MHSP is much faster to provide optimal plans than the two other algorithms do.

In the future, we may study several specific aspects of the presented work:

  • First of all, we would study the possibility to perform sequences of actions instead of just one action, like algorithms such as -Trap or LRTS do. Indeed, instead of taking a single action between the lookahead search episodes, it applied actions to amortize the planning cost. This allows to speed up the search of a solution plan when the heuristic function is informative.

  • Moreover, since MHSP uses mean values, we also want to apply MHSP on problems in non deterministic environments and compare it to on-line MDP algorithms.

  • Experimentations show that the first action chosen significantly impact the quality of the solution plan found in terms of lenght. For instance in blocksworld domains, choosing first a bad block to move implies to add many actions to repair this bad choice. Consequently, the idea is to allocated more time to the first reasonning step.

  • In our experiments, for each action selection stage, the tree is computed from scratch. Re-using the tree computed during the previous action selection stages is an interesting enhancement to our work. It will enable the real-time algorithms to tackle more difficult problems.

  • Our work is done in the background of real-time search, and partial plan selection. Removing the real-time constraint, and testing MHSP on problems in which full solution plans are required is an interesting research direction. However, preliminary tests show that, with almost exact heuristics, MHSP is hardly comparable to efficient and general planners using Enforced-Hill Climbing to find full solution plans on a wide range of problems.

  • Finally, learning the heuristics useful in planning by using MHSP, or another real-time algorithm, and compare them with the heuristics obtained with planning graphs is a good perspective linking the two domains of learning and planning.


  1. For a description and formalization of these benchmark domains and problems, see the official page of IPC.


  1. P. Auer, N. Cesa-Bianchi, and P. Fisher. Finite-time Analysis of the Multiarmed Bandit Problem. Machine Learning, 47(2–3):235–256, 2002.
  2. A. Barto, S. Bradtke, and S. Singh. Learning to Act Using Real-Time Dynamic Programming. Artificial Intelligence, 72(1-2):81–138, 1995.
  3. A. Blum and M. Furst. Fast Planning Through Planning Graph Analysis. Artificial Intelligence, 90:1636–1642, 1997.
  4. Vadim Bulitko and Greg Lee. Learning in Real-Time Search: A Unifying Framework. Journal of Artificial Intelligence Research, 25:119–157, 2006.
  5. V. Bulitko, M. Lustrek, J. Schaeffer, Y. Bjornsson, and S. Sigmundarson. Dynamic Control in Real-Time Heuristic Search. Journal of Artificial Intelligence Research, 32(1):419–452, 2008.
  6. Vadim Bulitko. Learning for Adaptive Real-Time Search. Technical report, Department of Computer Science, University of Alberta,, 2004.
  7. G. Chaslot, M. Winands, H. van den Herik, J. Uiterwijk, and B. Bouzy. Progressive Strategies for Monte-Carlo Tree Search. New Mathematics and Natural Computation, 4(3):343–357, 2008.
  8. P. Fabiani, V. Fuertes, G. Besnerais, R. Mampey, A. Piquereau, and F. Teichteil. The ReSSAC Autonomous Helicopter: Flying in a Non-Cooperative Uncertain World with embedded Vision and Decision Making. In A.H.S Forum, 2007.
  9. D. Furcy and S. Koenig. Speeding up the Convergence of Real-Time Search. In Proceedings of the National Conference on Artificial Intelligence, pages 891–897, 2000.
  10. M. Ghallab, D. Nau, and P. Traverso. Automated Planning Theory and Practice. Morgan Kaufmann Publishers, 2004.
  11. S. Gelly, Y. Wang, R. Munos, and O. Teytaud. Modification of UCT with Patterns in Monte-Carlo Go. Technical Report RR-6062, INRIA, 2006.
  12. J. Hoffmann and B. Nebel. The FF Planning System: Fast Plan Generation Through Heuristic Search. JAIR, 14(1):253–302, 2001.
  13. P.E. Hart, N.J. Nilsson, and B. Raphael. A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Transactions on Systems Science and Cybernetics SSC4, 2:100—107, 1968.
  14. E. Hansen and S. Zilberstein. LAO*: A Heuristic Search Algorithm that Finds Solutions with Loops. Artificial Intelligence, 129(1-2):35–62, 2001.
  15. Richard Korf. Real-Time Heuristic Search: New Results. In Proceedings of the AAAI conference, pages 139–144, 1988.
  16. Richard Korf. Real-Time Heuristic Search. Artificial Intelligence, 42(2-3):189–211, 1990.
  17. L. Kocsis and C. Szepesvari. Bandit-based Monte-Carlo Planning. In Proc. ECML, pages 282–293, 2006.
  18. Mark Roberts and Adele Howe. Learning from Planner Performance. Artificial Intelligence, 173:536–561, 2009.
  19. M. Shimbo and T. Ishida. Controlling the Learning Process of Real-Time Heuristic Search. Artificial Intelligence, 146(1):1–41, 2003.
  20. L. Shue and R. Zamani. An Admissible Heuristic Search Algorithm. In Methodologies for Intelligent Systems, number 689 in LNAI. Springer, 1993.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description