Learning to Rank for Synthesizing Planning Heuristics
Abstract
We investigate learning heuristics for domainspecific planning. Prior work framed learning a heuristic as an ordinary regression problem. However, in a greedy bestfirst search, the ordering of states induced by a heuristic is more indicative of the resulting planner’s performance than mean squared error. Thus, we instead frame learning a heuristic as a learning to rank problem which we solve using a RankSVM formulation. Additionally, we introduce new methods for computing features that capture temporal interactions in an approximate plan. Our experiments on recent International Planning Competition problems show that the RankSVM learned heuristics outperform both the original heuristics and heuristics learned through ordinary regression.
Learning to Rank for Synthesizing Planning Heuristics
Caelan Reed Garrett, Leslie Pack Kaelbling, Tomás LozanoPérez MIT CSAIL Cambridge, MA 02139 USA {caelan, lpk, tlp}@csail.mit.edu
1 Introduction
Forward statespace greedy heuristic search is a powerful technique that can solve large planning problems. However, its success is strongly dependent on the quality of its heuristic. Many domainindependent heuristics estimate the distance to the goal by quickly solving easier, approximated planning problems [?; ?; ?]. While domainindependent heuristics have enabled planners to solve a much larger class of problems, there is a large amount of room to improve their estimates. In particular, the effectiveness of many domainindependent heuristics varies across domains, with poor performance occurring when the approximations in the heuristic discard a large amount of information about the problem.
Previous work has attempted to overcome the limitations of these approximations by learning a domainspecific heuristic correction [?; ?]. Yoon et al. formulated learning a correction for the FastForward (FF) heuristic [?] as a regression problem and solved it using ordinary leastsquares regression. While the resulting planner is no longer domainindependent, the learning process is domain independent, and the learned heuristic is more effective than the standard FF heuristic.
In this paper, we improve on these results by framing the learning problem as a learning to rank problem instead of an ordinary regression problem. This is motivated by the insight that, in a greedy search, the ranking induced by a heuristic, rather than its numerical values, governs the success of the planning. By optimizing for the ranking directly, our RankSVM learner is able to produce a heuristic that outperforms heuristics learned through leastsquares regression.
Additionally, we introduce new methods for constructing features for heuristic learners. Like Yoon et al., we derive our features from an existing domainindependent heuristic [?; ?]. However, our features focus on the ordering and interaction between actions in approximate plans. Thus, they can be based on any existing heuristic that implicitly constructs an approximate plan, such as the contextenhanced additive (CEA) heuristic [?]. These features can be easily constructed and still encode a substantial amount of information for heuristic learners.
In our experiments, we evaluate the performance of the different configurations of our learners on several of the International Planning Competition learning track problems [?]. We find that the learned heuristics using the RankSVM approach allow more problems to be solved successfully than using the popular FF and CEA heuristics alone. Additionally, they significantly surpass the performance of heuristics learned through ordinary regression.
2 Related Work
Prior work in learning for planning spans many types of domainspecific planning knowledge [?]; our focus in this paper is on learning heuristic functions.
Yoon et al. were the first to improve on a heuristic function using machine learning [?; ?]. They centered their learning on improving the FF Heuristic [?], using ordinary leastsquares regression to learn the difference between the actual distancetogo and the estimate given by the FF heuristic. Their key contribution was deriving features using the relaxed plan that FF produces when computing its estimate. Specifically, they used taxonomic syntax to identify unordered sets of actions and predicates on the relaxed plan that shared common object arguments. Because there are an exponential number of possible subsets of actions and predicates, they iteratively introduced a taxonomic expression that identifies a subset greedily based on which subset will give the largest decrease in mean squared error. This process resulted in an average of about 20 features per domain [?]. In contrast, our features encode ordering information about the plan and can be successfully applied without any taxonomic syntax or iterative feature selection.
Xu et al. built on the work of Yoon et al. and incorporated ideas from structural prediction [?; ?]. They adapted the learningassearch optimization framework to the context of beam search. They learn a discriminative model to rank the top successors per state to include in the beam search. In subsequent work, they used RankBoost to more reliably rank successors by bootstrapping the predictions of actionselection rules [?]. Although we also use a ranking approach, we use ranking as a loss function to train a heuristic from the position of states along a trajectory, resulting in a global heuristic that can be directly applied to greedy bestfirst search.
Arfaee et al. learned heuristics by iteratively improving on prior heuristics for solving combinatorial search problems [?]. They used neural networks and user defined features. Finally, Virseda et al. learned combinations of existing heuristics values that would most accurately predict the costtogo [?]. However, this strategy does not use features derived from the structure of the heuristics themselves.
Wilt et al. investigated greedy heuristic search performance in several combinatorial search domains [?]. Their results suggest that heuristics that exhibit strong correlation with the distancetogo are less likely to produce large local minima. And large local minima are thought to often dominate the runtime of greedy planners [?; ?]. They later use the Kendall rank correlation coefficient () to select a pattern database for some of these domains [?]. Their use of as a heuristic quality metric differs from our own use because they score using sampled states near the goal while we score by ranking the states on a plan.
3 Planning domains and training data
Our goal is to learn a heuristic that will improve the coverage, or the number of problems solved, for greedy forwardsearch planning on very large satisficing planning problems. Secondary goals are to decrease the resulting plan length and time to solve these problems. The search control of our planners is greedy best first search (GBFS) with alternating, dual open lists [?]. The preferred operators in the second open list are computed by the base heuristic which, as we will later see, is used to generate our learning features [?]. We use the lazy variant of greedy best first search which defers heuristic evaluation of successors. We consider STRIPS planning problems [?] with unit costs, and without axioms or conditional effects, but our techniques can be straightforwardly generalized to handle them.
Definition 1 (Planning Domain).
A planning domain consists of a set of predicate schemas and a set of action schemas . Each action schemas contains a set of precondition predicates and effect predicates. A predicate schema or action schema can be instantiated by assigning objects to its arguments.
Definition 2 (Planning Problem).
A planing problem is given by a domain , a set of objects , an initial state , and a goal partialstate . The initial state is fully specified by a set of predicates. The goal partialstate is only partially specified by its set of predicates.
The overall approach will be, for each planning domain, to train a learning algorithm on several planning problem instances, and then to use the learned heuristic to improve planning performance on additional planning problems from that same domain. Note that the new problem instances use the same predicate and action schemas, but may have different universes of objects, initial states, and goal states.
In order to learn a heuristic for a particular domain, we must first gather training examples from a set of existing training problems within the domain [?]. Suppose that we have a distribution over problems for a domain , which will be used to generate testing problems. We will sample a set of training problems from . From each problem , we generate a set of training examples in which the th training example is the pair where is the input composed of a state and the problem . Let be the length of a plan from to . Ideally, would be the length of the shortest plan, but because obtaining optimal plans is intractable for the problems we consider, we construct approximately optimal plans and use their lengths as the values in the training data.
We use the set of states on a single highquality plan from the initial state to the goal state as training examples. Unfortunately, we have observed that using lowquality plans, which are more easily found, can be dangerous, as it introduces large amounts of noise into the training data. This noise can produce conflicting observations of for similar , which can prevent the learner from identifying any meaningful predictive structure. Reducing at least this kind of local noise is important for the learning process even if the global plan is still suboptimal. Thus, we postprocess each candidate plan using two local search methods: action elimination and plan neighborhood graph search [?].
In separate experiments, we attempted learning a heuristic by instead using a sampled set of successors on these plans as training examples. However, we found that the inclusion of these states slightly worsened the resulting performance of the learners. Our hypothesis is that the inclusion of successor states improves local accuracy at the expense of global accuracy. Because the runtime of greedy search methods is often dominated by the time to escape the largest local minima [?; ?; ?; ?], it is a worthwhile tradeoff to reduce the size of large local minima at the cost of increasing the size of small local minima.
4 Feature Representation
The majority of machine learning methods assume that the inputs are represented as points in a vector space. In our case, the inputs are a pair of a state and a planning problem, each of which is a complex structured symbolic object. So, we need to define a featuremapping function that maps an value into a vector of numeric feature values. This can also be done implicitly by defining a kernel, we restrict our attention to finitedimensional that are straightforwardly computable.
The objective in designing a feature mapping is to arrange for examples that are close in feature space to have similar output values. Thus, we want to reveal the structural aspects of an input value that encode important similarities to other input values. This can be particularly challenging in learning for planning: while problems within the same domain share the same schemas for predicates and actions, the set of objects can be arbitrarily different. For example, a feature representation with a feature for each predicate instance present in or will perform poorly on new problems, which may not share any predicate instances with the problems used to create the feature representation.
Yoon et al. used information from the FF heuristic to construct additional features from the resulting relaxed plan [?; ?]. The relaxed plan compresses the large set of possible actions into a small plan of actions that are likely to be relevant to achieving the goal. Many modern heuristics either explicitly or implicitly generate approximate plans, similar to FF’s relaxed plan, that can be represented as directed acyclic graphs (DAG) where each action is a vertex, and directed edges indicate that the outgoing action is supported by the incoming action. We provide feature mappings that are applicable to any heuristic that gives rise to such a DAG, but in this paper, we focus on the FF [?] and CEA [?] heuristics. Our method can be extended to include additional features for example derived from landmark heuristics or domaindependent heuristics, although we do not consider these extensions here.
We can now view our training inputs as where is the DAG generated by heuristic for state and goal . The computation time of each feature affects the performance of the resulting planner in a complex way: the feature representation is computed for every state encountered in the search, but good features will make the heuristic more effective, causing fewer states to be encountered.
4.1 Single Actions
The first feature representation serves primarily as a baseline. Each feature is the number of instances of a particular action schema in the DAG . The number of features is the number of action schemas in the domain and thus around five for many domains. This feature representation is simple and therefore limited in its expressiveness, but it can be easily computed in time and is unlikely to overfit. If we are learning a linear function of , then the weights can be seen as adjustments to the predictions made by the DAG of how many instances of each action are required. So, for instance, in a domain that requires a robot to do a ”move” action every time it ”picks” an object, but where the delete relaxation only includes one ”move” action, this representation would allow learning a weight of two on pick actions, effectively predicting the necessity of extra action instances.
4.2 Pairwise Actions
The second feature representation creates features for pairs of actions, encoding both their intersecting preconditions and effects as well as their temporal ordering in the approximate plan. First, we solve the allpairs shortest paths problem on by running a BFS from each action vertex. Then, consider each pair of actions where descends from , as indicated by having a finite, positive distance from to in the allpairs shortest paths solution. This indicates must come after on all topological sorts of the DAG; i.e., contains the implicit partial ordering . Moreover, if there is an edge in , then is an explicit partial ordering because directly supports .
For every pair of action schemas , we include two features, counting the number of times it happens that, for an instance of and instance of ,

,

,
The current state and goal partialstate are included as dummy actions with only effects or preconditions respectively.
This feature representation is able to capture information about the temporal spread of actions in the DAG: for example, whether the DAG is composed of many short parallel sequences of actions or a single long sequence. Additionally, the inclusion of the preconditions and effects that overlap encodes interactions that are not often directly captured in the base heuristic. For example, FF and CEA make predicate independence approximations, which can result in overestimating the distancetogo. The learner can automatically correct for these estimations if it learns that a single sequence can be used to achieve multiple predicates simultaneously.
In contrast to the singleaction feature representation, the computation of the pairwise representation takes in the worst case. However, the DAG frequently is composed of almost disjoint subplans, so in practice, the number of pairs considered is fewer than . Additionally, this tradeoff is still advantageous if the learner is able to produce a much better heuristic. Finally, for both the single and pairwise feature representations, we add three additional features corresponding to the original heuristic value, the number of layers present in the DAG, and the number of unsatisfied goals.
5 Models for heuristic learning
We consider two different framings of the problem of learning a heuristic function . In the first, the goal is to ensure that the values are an accurate estimate of the distancetogo in the planning state and problem encoded by . In the second, the goal is to ensure that the values accurately rank the distancetogo for different states within the same planning problem , but do not necessarily reflect that actual distancetogo values.
These different framings of the problem lead to different loss functions to be optimized by the learner and to different optimization algorithms. Because our learning algorithms cannot optimize for search performance directly, the loss function serves as a proxy for the search performance. A good loss function will be highly correlated with performance of learned heuristics. We restrict ourselves to linear models that learn a weight vector , and make a prediction .
5.1 Heuristic value regression
Because learning a heuristic is, at face value, a regression problem, a natural loss function is the root mean squared error (RMSE). A model with a low RMSE produces predictions close to the actual distancetogo. Because each training problem may produce a different number of examples , we use the average RMSE over all problems. This ensures that we do not assign more weight to problems with more examples. If is a prediction function mapping a vector to the reals, then:
The first learning technique we applied is ridge regression (RR) [?]. This serves as a baseline to compare to the results of Yoon et al. [?]. Ridge regression is a regularized version of Ordinary Least Squares Regression (OLS). The regularization trades off optimizing the squared error against preferring low magnitude using a parameter . This results in the following optimization problem. Leting be the design matrix of concatenated features and be the vector concatenation of for all , we wish to find
This technique is advantageous because it can be quickly solved in closed form for reasonably sized , yielding the weight vector
Optimizing RMSE directly, with no penalty , will yield a weight vector that performs well on the training data but might not generalize well to previously unseen problems. Increasing forces the magnitude of to be smaller, which prevents the resulting from ”overfitting” the training data and therefore not generalizing well to new examples. This is especially important in our application as we are trying learn a heuristic that generalizes across the full statespace from only a few representative plans.
We select an appropriate value of by performing domainwise leaveoneout cross validation (LOOCV): For different possible values of , and in a domain with training problem, we train on data from training problems and evaluate the resulting heuristic on the remaining problem according to the RMSE loss function, and average the scores from holding out each problem instance. We select the value for which the LOOCV RMSE is minimized over a logarithmic scale.
5.2 Learning to Rank
The RMSE, however, is not the most appropriate metric for our learning application. We are learning a heuristic for greedy search, which uses the heuristic solely to determine open list priority. The value of the heuristic per se does not govern the search performance which depends most directly on the ordering on states induced by the heuristic. In this context, any monotonically increasing function of a heuristic results in the same ranking and performance. A heuristic may have arbitrarily bad RMSE despite performing well.
For these reasons, we consider the Kendall rank correlation coefficient (), a nonparametric ranking statistic, as a loss function. It represents the normalized difference between the number of correct rankings and incorrect rankings for each of the ranking pairs. As with the RMSE, we compute the average across each problem. The separation of problems is even more important here. Our only scores rankings between examples from the same problem as examples from separate problems are never encountered together in the same search. This provides a major source of leverage over an ordinary regression framework. Heuristics are not penalized for producing inconsistent distancestogo values across multiple problems, allowing them to provide more effort to improve the perproblem rankings.
Let score the concordance or discordance of the ranking function for examples and from the same problem :
Then the Kendall rank correlation coefficient is specified by
Note that each is unique per problem because our examples come from a single trajectory. Observe that ; values close to one indicate the ranking induced by the heuristic has strong positive correlation to the true ranking of states as given by the actual labels. Conversely, values close to negative one indicate strong negative correlation.
If our loss function is , it is more effective to optimize directly in the learning process. To this end, we use Rank Support Vector Machines (RankSVM) [?]. RankSVMs are variants of SVMs which penalize the number of incorrectly ranked training examples. Like SVMs, RankSVMs also have a parameter used to provide regularization. Additionally, their formulation uses the hinge loss function to make the learning problem convex. Thus, a RankSVM finds the vector that optimizes a convex relaxation of . Our formulation of the RankSVM additionally takes into account the fact that we only wish to rank training examples from the same problem. Our formulation is the following:
s.t.  
The first constraint can also be rewritten to look similar to the original SVM formulation. In this form, the RankSVM can be viewed as classifying if are properly ranked.
Notice that number of constraints and slack variables, corresponding to the number of rankings, grows quadratically in the size of each problem. This makes training the RankSVM more computationally expensive than RRs or SVMs. However, there are efficient methods for training these, and other SVMs, when considering just the linear, primal form of the problem [?; ?]. It is important to note that we generate a number of constraints that is quadratic only in the length of any given training plan, and do not attempt to rank all the actions of all the training plans jointly; this allows us to increase the number of training example plans without dramatically increasing the size of the optimization problem.
An additional advantage of RankSVM is that it supports the inclusion of the nonnegativity constraint which provide additional regularization. Because each feature represent a count of actions or action pairs, the values are always nonnegative, as are the target values. We generally expect that DAGs with a large number of actions indicate that the state is far from the goal. The nonnegativity constraint allows us to incorporate this prior knowledge in the model, which can sometimes improve the generalization of the learned heuristic. As in RR, we select using a line search over a logarithmic scale, to maximize a crossvalidated estimate of . As a practical note, we start with an overregularized model where and increase until reaching a local minimum because SVMs are trained much more efficiently for small .
6 Results
elevators (35)  
Method  Cov.  Len.  Run T.  Exp.  RMSE  /  Feat.  Train T.  
FF Original  14  318  196  17833  34.370  0.9912  N/A  N/A  N/A 
FF RR Single  22  546  504  34970  4.091  0.9948  100  9/9  3.133 
FF RR Pair  15  561  308  20985  3.789  0.9971  1000  53/53  11.686 
FF RSVM Single  34  375  403  23765  79.867  0.9967  0.1  9/9  55.681 
FF RSVM Pair  34  631  123  7083  418.828  0.9996  1  53/53  140.786 
FF NN RSVM Pair  35  655  61  10709  46.296  0.9992  1  51/53  125.702 
CEA Original  35  397  163  4504  21.494  0.9973  N/A  N/A  N/A 
transport (35)  
FF Original  5  588  470  18103  126.193  0.8460  N/A  N/A  N/A 
FF RR Single  0  None  None  None  31.518  0.9303  100  6/6  3.569 
FF RR Pair  4  529  560  27866  27.570  0.9392  10000  32/32  11.028 
FF RSVM Single  21  1154  650  29452  149.003  0.9720  0.1  6/6  106.901 
FF RSVM Pair  20  587  178  8896  162.141  0.9797  0.001  32/32  117.808 
FF NN RSVM Pair  31  663  206  7803  141.273  0.9798  0.01  17/32  287.586 
CEA Original  9  448  542  9064  57.819  0.9314  N/A  N/A  N/A 
CEA RR Single  11  493  436  6921  33.032  0.9420  10000  6/6  4.536 
CEA RR Pair  2  609  1602  40327  30.731  0.9318  100  45/45  15.716 
CEA RSVM Single  18  722  588  11334  130.653  0.9748  0.1  6/6  158.523 
CEA RSVM Pair  31  650  225  3526  159.139  0.9804  0.0001  45/45  244.164 
CEA NN RSVM Pair  29  696  277  9006  191.064  0.9795  0.0001  29/45  528.665 
parking (10)  
FF Original  0  None  None  None  6.101  0.9525  N/A  N/A  N/A 
FF RR Single  0  None  None  None  4.571  0.9648  100  7/7  0.201 
FF RR Pair  2  156  1419  33896  4.285  0.9757  100  40/40  0.570 
FF RSVM Single  0  None  None  None  10.468  0.9745  0.01  7/7  8.423 
FF RSVM Pair  8  185  208  2852  18.262  0.9918  0.1  40/40  7.030 
FF NN RSVM Pair  6  183  358  5891  143.063  0.9941  10  26/40  7.119 
CEA Original  0  None  None  None  15.885  0.9628  N/A  N/A  N/A 
CEA RR Single  0  None  None  None  4.667  0.9669  0.01  7/7  0.277 
CEA RR Pair  1  280  1230  48180  4.448  0.9660  10  47/47  0.738 
CEA RSVM Single  0  None  None  None  7.950  0.9757  0.1  7/7  10.830 
CEA RSVM Pair  10  272  81  2147  45.823  0.9918  1  47/47  10.237 
CEA NN RSVM Pair  10  260  70  1690  140.297  0.9938  10  27/47  9.179 
nomystery (10)  
FF Original  4  31  583  5658745  3.462  0.9841  N/A  N/A  N/A 
FF RR Single  4  30  1004  8385159  1.662  0.9861  100  6/6  0.085 
FF RR Pair  2  31  700  3898861  1.622  0.9902  1000  21/21  0.193 
FF RSVM Single  1  26  1411  16201215  21.069  0.9871  100  6/6  0.712 
FF RSVM Pair  2  28  892  6894959  39.350  0.9968  1  21/21  0.914 
FF NN RSVM Pair  1  29  1049  7973003  80.588  0.9972  10  17/21  1.024 
CEA Original  3  30  73  107773  16.851  0.9579  N/A  N/A  N/A 
CEA RR Single  2  28  9  26319  1.824  0.9890  100  6/6  0.069 
CEA RR Pair  3  32  104  169434  1.717  0.9892  1000  32/32  0.342 
CEA RSVM Single  2  28  12  33559  36.457  0.9916  1  6/6  1.283 
CEA RSVM Pair  3  32  34  46501  6.358  0.9964  0.01  32/32  4.023 
CEA NN RSVM Pair  3  31  190  264225  55.141  0.9970  1  16/32  62.608 
We implemented our planners using the FastDownward framework [?]. ^{1}^{1}1Because heuristic values are required to be integers in this framework, we scale up and then round predicted heuristic values, in order to capture more of the precision in the values. Recall that scaling will not alter the planner’s performance because arbitrary nonnegative, affine transformations to will not affect the resulting ranking in greedy search. Each planning problem is compiled to a representation similar to SAS+ [?] using the FastDownward preprocessor. However, the predicates that represent each SAS+ (variable, value) pair are still stored so, actions and states can be mapped back to their prior form. We used the dlib C++ machine learning library to implement the learning algorithms [?].
We experimented on four domains from the 2014 IPC learning track [?]: elevators, transport, parking, and nomystery. For each domain, we constructed a set of unique examples with the competition problem generators by sampling parameters that cover competition parameter space. We use a variant of the 2014 FastDownward Stone Soup portfolio [?] planner, with a large timeout and memory limit, to generate training example plans. We trained on at most 10 examples randomly selected from the set of problems our training portfolio planner was able to solve, and then tested on the remaining problems.
For each experiment, we report the following values:^{2}^{2}2We use arithmetic mean for plan length and geometric means for planning time and number of expansions, and report these statistics only for solved instances; RMSE and values are crossvalidation estimates. Cov: coverage, or total number of problems solved; Len: mean plan length; Run T: mean planning time in seconds; Exp: mean number of expansions; RMSE: RMSE of learned heuristic, : Kendall rank correlation coefficient of learned heuristic; /: regularization parameter value ( for RR and for RankSVM); Feat: number of nonzero weights learned relative to the total number of features; and Train T: runtime to train the heuristic learner in seconds.
Each planner was run on a single 2.5 GHz processor for 30 minutes with 5 GB of memory. We only include the results of the original CEA heuristic on elevators, as the default heuristic was able to solve each problem and the heuristics learned using CEA all performed similarly.
The heuristics learned by RankSVM are able to solve more problems than those learned using ridge regression. Within a domain, seems to be positively correlated with the number of problems solved while the RMSE does not. The pairwiseaction features outperform the singleaction features in RankSVM, making it worthwhile to incur a larger heuristic evaluation time for improved heuristic strength. The CEA learned heuristics performed slightly better than the FF learned heuristics.
On transport and parking, the training portfolio planner was only able to solve the smallest problems within the parameter space. Thus, our RankSVM learners demonstrate the ability to learn from smaller problems and perform well on larger problems. In separate experiments, we observed that both artificially overregularized and underregularized learners performed poorly indicating that selection of the regularization parameter is important to the learning process.
The learned heuristics perform slightly worse than the standard heuristics on nomystery despite having almost perfect values. In separate experiments using eager bestfirst search, the learned heuristics perform slightly better on nomystery, but the improvement is not significant. This domain is known to be challenging for heuristics because it contains a large number of deadends. We observed that does not seem sufficient for understanding heuristic performance on domains with harmful deadends. Our hypothesis is that failing to recognize a deadend is often more harmful than incorrectly ranking nearby states and should be handled separately from learning a heuristic. A topic for future work is to combine our learned heuristics with learned deadend detectors.
Inclusion of the nonnegativity constraint (NN) on transport significantly improved the coverage of the FF learned heuristic over the normal RankSVM formulation. We believe that this constraint can sometimes improve generalization in domains with a large variance in size or specification. For example, the transport generator samples problems involving either two or three cities leading to a bimodal distribution of problems.
Finally, we tested two learned heuristics on the five evaluation problems per domain chosen in the IPC 2014 learning track. Both the FF RSVM Pair heuristic and the CEA RSVM Pair heuristic solved all 5/5 problems in elevators, transport, and parking but only 1/5 problems in nomystery.
7 Conclusion
Our results indicate that, for greedy search, learning a heuristic is best viewed as a ranking problem. The Kendall rank correlation coefficient is a better indicator of a heuristic’s quality than the RMSE, and it is effectively optimized using the RankSVM learning algorithm. Pairwiseaction features outperformed simpler features. Further work involves combining features from several heuristics, learning complementary search control using our features, and incorporating the learned heuristics in planning portfolios.
Acknowledgments
We gratefully acknowledge support from NSF grants 1420927 and 1523767, from ONR grant N000141410486, and from ARO grant W911NF1410433. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of our sponsors.
References
 [Arfaee et al., 2011] Shahab Jabbari Arfaee, Sandra Zilles, and Robert C Holte. Learning heuristic functions for large state spaces. Artificial Intelligence, 175(16):2075–2098, 2011.
 [Bäckström and Nebel, 1995] Christer Bäckström and Bernhard Nebel. Complexity results for sas+ planning. Computational Intelligence, 11(4):625–655, 1995.
 [Fikes and Nilsson, 1971] Richard E. Fikes and Nils J. Nilsson. STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2:189–208, 1971.
 [Franc and Sonnenburg, 2009] Vojtěch Franc and Sören Sonnenburg. Optimized cutting plane algorithm for largescale risk minimization. The Journal of Machine Learning Research, 10:2157–2192, 2009.
 [Helmert and Geffner, 2008] Malte Helmert and Héctor Geffner. Unifying the causal graph and additive heuristics. In ICAPS, pages 140–147, 2008.
 [Helmert et al., 2011] Malte Helmert, Gabriele Röger, Jendrik Seipp, Erez Karpas, Jörg Hoffmann, Emil Keyder, Raz Nissim, Silvia Richter, and Matthias Westphal. Fast downward stone soup. Seventh International Planning Competition, pages 38–45, 2011.
 [Helmert, 2006] Malte Helmert. The fast downward planning system. Journal of Artificial Intelligence Research, 26:191–246, 2006.
 [Hoerl and Kennard, 1970] Arthur E Hoerl and Robert W Kennard. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics, 12(1):55–67, 1970.
 [Hoffmann and Nebel, 2001] Jörg Hoffmann and Bernhard Nebel. The FF planning system: Fast plan generation through heuristic search. Journal Artificial Intelligence Research (JAIR), 14:253–302, 2001.
 [Hoffmann, 2005] Jörg Hoffmann. Where’ignoring delete lists’ works: local search topology in planning benchmarks. Journal of Artificial Intelligence Research, pages 685–758, 2005.
 [Hoffmann, 2011] Jörg Hoffmann. Where ignoring delete lists works, part ii: Causal graphs. In ICAPS, 2011.
 [Jiménez et al., 2012] Sergio Jiménez, Tomás De la Rosa, Susana Fernández, Fernando Fernández, and Daniel Borrajo. A review of machine learning for automated planning. The Knowledge Engineering Review, 27(04):433–467, 2012.
 [Joachims, 2002] Thorsten Joachims. Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 133–142. ACM, 2002.
 [Joachims, 2006] Thorsten Joachims. Training linear svms in linear time. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 217–226. ACM, 2006.
 [King, 2009] Davis E King. Dlibml: A machine learning toolkit. The Journal of Machine Learning Research, 10:1755–1758, 2009.
 [Nakhost, 2010] Hootan Nakhost. Action elimination and plan neighborhood graph search: Two algorithms for plan improvement. In ICAPS, 2010.
 [Richter and Helmert, 2009] Silvia Richter and Malte Helmert. Preferred operators and deferred evaluation in satisficing planning. In ICAPS, 2009.
 [Vallati et al., 2015] M. Vallati, L. Chrpa, M. Grzes, T.L. McCluskey, M. Roberts, and S. Sanner. The 2014 international planning competition: Progress and trends. AI Magazine, 2015.
 [Virseda et al., 2013] Jesús Virseda, Daniel Borrajo, and Vidal Alcázar. Learning heuristic functions for costbased planning. Planning and Learning, page 6, 2013.
 [Wilt and Ruml, 2012] Christopher Makoto Wilt and Wheeler Ruml. When does weighted a* fail? In SOCS, 2012.
 [Wilt and Ruml, 2015] Christopher Makoto Wilt and Wheeler Ruml. Building a heuristic for greedy search. In Eighth Annual Symposium on Combinatorial Search, 2015.
 [Xu et al., 2007] Yuehua Xu, Alan Fern, and Sung Wook Yoon. Discriminative learning of beamsearch heuristics for planning. In IJCAI, 2007.
 [Xu et al., 2009] Yuehua Xu, Alan Fern, and Sungwook Yoon. Learning linear ranking functions for beam search with application to planning. The Journal of Machine Learning Research, 10:1571–1610, 2009.
 [Xu et al., 2010] Yuehua Xu, Alan Fern, and Sung Wook Yoon. Iterative learning of weighted rule sets for greedy search. In ICAPS, pages 201–208, 2010.
 [Yoon et al., 2006] Sung Wook Yoon, Alan Fern, and Robert Givan. Learning heuristic functions from relaxed plans. In ICAPS, 2006.
 [Yoon et al., 2008] Sungwook Yoon, Alan Fern, and Robert Givan. Learning control knowledge for forward search planning. The Journal of Machine Learning Research, 9:683–718, 2008.