Active model learning and diverse action sampling for task and motion planning
Abstract
The objective of this work is to augment the basic abilities of a robot by learning to use new sensorimotor primitives to enable the solution of complex longhorizon problems. Solving longhorizon problems in complex domains requires flexible generative planning that can combine primitive abilities in novel combinations to solve problems as they arise in the world. In order to plan to combine primitive actions, we must have models of the preconditions and effects of those actions: under what circumstances will executing this primitive achieve some particular effect in the world?
We use, and develop novel improvements on, stateoftheart methods for active learning and sampling. We use Gaussian process methods for learning the conditions of operator effectiveness from small numbers of expensive training examples collected by experimentation on a robot. We develop adaptive sampling methods for generating diverse elements of continuous sets (such as robot configurations and object poses) during planning for solving a new task, so that planning is as efficient as possible. We demonstrate these methods in an integrated system, combining newly learned models with an efficient continuousspace robot task and motion planner to learn to solve long horizon problems more efficiently than was previously possible.
1 Introduction
For a robot to be effective in a domain that combines novel sensorimotor primitives, such as pouring or stirring, with longhorizon, highlevel task objectives, such as cooking a meal or making a cup of coffee, it is necessary to acquire models of these primitives to use in planning robot motions and manipulations. These models characterize (a) conditions under which the primitive is likely to succeed and (b) the effects of the primitive on the state of the world.
Figure 1 illustrates several instances of a parameterized motor primitive for pouring in a simple twodimensional domain. The primitive action has control parameters that govern the rate at which the cup is tipped and target velocity of the poured material. In addition, several properties of the situation in which the pouring occurs are very relevant for its success: robot configuration , pouring cup pose and size , and target cup pose and size . To model the effects of the action we need to specify and , the resulting robot configuration and pose of the pouring cup . Only for some settings of the parameters is the action feasible: one key objective of our work is to efficiently learn a representation of the feasible region .
For learning this model, each training example requires running the primitive, which is expensive on real robot hardware and even in highfidelity simulation. To minimize the amount of training data required, we actively select each setting in which the primitive is executed, with the goal of obtaining as much information as possible about how to use the primitive. This results in a dramatic reduction in required examples over our preliminary work [1] on this problem.
Given a model of a primitive, embodied in , we utilize existing samplebased algorithms for task and motion planning (tamp) to find plans. To use the model within the planner, it is necessary to select candidate instances of the primitives for expansion during the search for a plan. The objective here is not to gain information, but to select primitive instances that are likely to be successful. It is not enough to select a single instance, however, because there may be other circumstances that make a particular instance infeasible within a planning context: for example, although the most reliable way to grasp a particular object might be from the top, the robot might encounter the object situated in a cupboard in a way that makes the top grasp infeasible. Thus, our actionsampling mechanism must select instances that are both likely to succeed and are diverse from one another, so that the planner has coverage of the space of possible actions.
One difficulty in sampling is that it inhabits a lowerdimensional submanifold of the space it is defined in, because some relations among robot configurations and object poses, for example, are functional. The stripstream planner [2, 3] introduced a strategy for sampling from such dimensionalityreducing constraints by constructing conditional samplers that, given values of some variables, generate values of the other variables that satisfy the constraint. Our goal in this paper is to learn and use conditional samplers within the stripstream planner.
Our technical strategy for addressing the problems of (a) learning success constraints and (b) generating diverse samples is based on an explicit representation of uncertainty about an underlying scoring function that measures the quality or likelihood of success of a parameter vector, and uses Gaussian process (gp) techniques to sample for informationgathering during learning and for success probability and diversity during planning. We begin by describing some basic background, discuss related work, describe our methods in technical detail, and then present experimental results of learning and planning with several motor primitives in a twodimensional dynamic simulator.
2 Problem formulation and background
We will focus on the formal problem of learning and using a conditional sampler of the form , where is a vector of contextual parameters and is a vector of parameters that are to be generated, conditioned on . We assume in the following that the domain of is a hyperrectangular space , but generalization to other topologies is possible. The conditional sampler generates samples of such that where characterizes the set of world states and parameters for which the skill is feasible. We assume that can be expressed in the form of an inequality constraint , where is a scoring function with arguments and . We denote the super levelset of the scoring function given by . For example, the scoring function for pouring might be the proportion of poured liquid that actually ends up in the target cup, minus some target proportion. We assume the availability of values of such a score function during training rather than just binary labels of success or failure. In the following, we give basic background on two important components of our method: Gaussian processes and stripstream.
Gaussian processes (gps) are distributions over functions, and popular priors for Bayesian nonparametric regression. In a gp, any finite set of function values has a multivariate Gaussian distribution. In this paper, we use the Gaussian process which has mean zero and covariance (kernel) function . Let be a true underlying function sampled from . Given a set of observations , where is an evaluation of at corrupted by i.i.d additive Gaussian noise , we obtain a posterior gp, with mean and covariance where the kernel matrix and [4]. With slight abuse of notation, we denote the posterior variance by , and the posterior gp by .
stripstream [3] is a framework for incorporating blackbox sampling procedures in a planning language. It extends the STRIPS planning language [5] by adding streams, declarative specifications of conditional generators. Streams have previously been used to model offtheshelf motion planners, collision checkers, inverse kinematic solvers. In this work, we learn new conditional generators, such as samplers for pouring, and incorporate them using streams.
3 Related Work
Our work draws ideas from model learning, probabilistic modeling of functions, and task and motion planning (tamp).
There is a large amount of work on learning individual motor primitives such pushing [6, 7], scooping [8], and pouring [9, 10, 11, 12, 13]. We focus on the task of learning models of these primitives suitable for multistep planning. We extend a particular formulation of planning model learning [1], where constraintbased preimage models are learned for parameterized action primitives, by giving a probabilistic characterization of the preimage and using these models during planning.
Other approaches exist to learning models of the preconditions and effects of sensorimotor skills suitable for planning. One [14] constructs a completely symbolic model of skills that enable purely symbolic task planning. Our method, on the other hand, learns hybrid models, involving continuous parameters. Another [15] learns image classifiers for preconditions but does not support generalpurpose planning.
We use gpbased level set estimation [16, 17, 4, 18] to model the feasible regions (super level set of the scoring function) of action parameters. We use the straddle algorithm [16] to actively sample from the function threshold, in order to estimate the super level set that satisfy the constraint with high probability. Our methods can be extended to other function approximators that gives uncertainty estimates, such as Bayesian neural networks and their variants [19, 20].
Determinantal point processes (dpps) [21] are typically used for diversityaware sampling. However, both sampling from a continuous dpp [22] and learning the kernel of a dpp [23] are challenging.
Several approaches to tamp utilize generators to enumerate infinite sequences of values [24, 25, 2]. Our learned samplers can be incorporated in any of these approaches. Additionally, some recent papers have investigated learning effective samplers within the context of tamp. Chitnis et al. [26] frame learning plan parameters as a reinforcement learning problem and learn a randomized policy that samples from a discrete set of robot base and object poses. Kim et al. [27] proposed a method for selecting from a discrete set of samples by scoring new samples based on their correlation with previously attempted samples. In subsequent work, they instead train a Generative Adversarial Network to directly produce a distribution of satisfactory samples [28].
4 Active sampling for learning and planning
Our objective in the learning phase is to efficiently gather data to characterize the conditional superlevelsets with high confidence. We use a gp on the score function to select informative queries using a levelset estimation approach. Our objective in the planning phase is to select a diverse set of samples for which it is likely that . We do this in two steps: first, we use a novel riskaware sampler to generate values that satisfy the constraint with high probability; second, we integrate this sampler with stripstream, where we generate samples from this set that represent its diversity, in order to expose the full variety of choices to the planner.
4.1 Actively learning the constraint with a GP
Our goal is to be able to sample from the super level set for any given context , which requires learning the decision boundary . During training, we select values from a distribution reflecting naturally occurring contexts in the underlying domain. Note that learning the levelset is a different objective from learning all of the function values well, and so it must be handled differently from typical gpbased active learning.
For each value in the training set, we apply the straddle algorithm [16] to actively select samples of for evaluation by running the motor primitive. After each new evaluation of is obtained, the dataset is augmented with pair , and used to update the gp. The straddle algorithm selects that maximizes the acquisition function . It has a high value for values of that are near the boundary for the given or for which the score function is highly uncertain. The parameter is selected such that if is negative, has less than 5 percent chance of being in the level set. In practice, this heuristic has been observed to deliver stateoftheart learning performance for level set estimation [18, 17]. After each new evaluation, we retrain the Gaussian process by maximizing its marginal datalikelihood with respect to its hyperparameters. Alg. 1 specifies the algorithm; GPpredict computes the posterior mean and variance as explained in Sec. 2.
4.2 Riskaware adaptive sampling for constraint satisfaction
Now we can use this Bayesian estimate of the scoring function to select action instances for planning. Given a new context , which need not have occured in the training set—the gp will provide generalization over contexts—we would like to sample a sequence of such that with high probability, . In order to guarantee this, we adopt a concentration bound and a union bound on the predictive scores of the samples. Notice that by construction of the gp, the predictive scores are Gaussian random variables. The following is a direct corollary of Lemma 3.2 of [29].
Corollary 1.
Let , and set , where , . If , then .
Define the highprobability superlevelset of given context as where is picked according to Corollary 1. If we draw samples from , then with probability at least , all of the samples will satisfy the constraint .
In practice, however, for any given , using the definition of from Corollary 1, the set may be empty. In that case, we can relax our criterion to include the set of values whose score is within 5% of the value that is currently estimated to have the highest likelihood of satisfying the constraint: where is the cumulative density function of a normal distribution.
Figure 2 illustrates the computation of . The green line is the true hidden ; the blue symbols are the training data, gathered using the straddle algorithm in ; the red line is the posterior mean function ; the pink regions show the twostandarddeviation bounds on based on ; and the black line segments are the highprobability superlevelset for . We can see that sampling has concentrated near the boundary, that is a subset of the true superlevelset, and that as decreases through experience, will approach the true superlevel set.
To sample from , one simple strategy is to do rejection sampling with a proposal distribution that is uniform on the search boundingbox . However, in many cases, the feasible region of a constraint is typically much smaller than , which means that uniform sampling will have a very low chance of drawing samples within , and so rejection sampling will be very inefficient. We address this problem using a novel adaptive sampler, which draws new samples from the neighborhood of the samples that are already known to be feasible with high probability and then reweights these new samples using importance weights.
The algorithm AdaptiveSampler takes as input the posterior gp parameters and and context vector , and yields a stream of samples. It begins by computing and sets to contain the that is most likely to satisfy the constraint. It then maintains a buffer of at least samples, and yields the first one each time it is required to do so; it technically never actually returns, but generates a sample each time it is called. The main work is done by SampleBuffer, which constructs a mixture of truncated Gaussian distributions (tgmm), specified by mixture weights , means , circular variance with parameter , and bounds . Parameter indicates how far from known good values it is reasonable to search; it is increased if a large portion of the samples from the tgmm are accepted and decreased otherwise. The algorithm iterates until it has constructed a set of at least samples from . It samples elements from the tgmm and retains those that are in as . Then, it computes “importance weights” that are inversely related to the probability of drawing each from the current tgmm. This will tend to spread the mass of the sampling distribution away from the current samples, but still concentrated in the target region. A set of uniform samples is drawn and filtered, again to maintain the chance of dispersing to good regions that are far from the initialization. The values associated with the old as well as the newly sampled ones are concatenated and then normalized into a distribution, the new samples added to , and the loop continues. When at least samples have been obtained, elements are sampled from according to distribution , without replacement.
It is easy to see that as goes to infinity, by sampling from the discrete set according to the reweighted probability, we are essentially sampling uniformly at random from . This is because . For uniform sampling, , where is the volume of ; and for sampling from the truncated mixture of Gaussians, is the probability density of . In practice, is finite, but this method is much more efficient than rejection sampling.
4.3 Diversityaware sampling for planning
Now that we have a sampler that can generate approximately uniformly random samples within the region of values that satisfy the constraints with high probability, we can use it inside a planning algorithm for continuous action spaces. Such planners perform backtracking search, potentially needing to consider multiple different parameterized instances of a particular action before finding one that will work well in the overall context of the planning problem. The efficiency of this process depends on the order in which samples of action instances are generated. Intuitively, when previous samples of this action for this context have failed to contribute to a successful plan, it would be wise to try new samples that, while still having high probability of satisfying the constraint, are as different from those that were previously tried as possible. We need, therefore, to consider diversity when generating samples; but the precise characterization of useful diversity depends on the domain in which the method is operating. We address this problem by adapting a kernel that is used in the sampling process, based on experience in previous planning problems.
Diversityaware sampling has been studied extensively with determinantal point processes (dpps) [21]. We begin with similar ideas and adapt them to the planning domain, quantifying diversity of a set of samples using the determinant of a Gram matrix: , where , is a covariance function, and is a free parameter (we use ). In dpps, the quantity can be interpreted as the volume spanned by the feature space of the kernel assuming that . Alternatively, one can interpret the quantity as the information gain of a gp when the function values on are observed [30]. This gp has kernel and observation noise . Because of the submodularity and monotonicity of , we can maximize greedily with the promise that , where . In fact, maximizing is equivalent to maximizing
which is exactly the same as the posterior variance for a gp.
The DiverseSampler procedure is very similar in structure to the AdaptiveSampler procedure, but rather than selecting an arbitrary element of , the buffer of good samples, to return, we track the set of samples that have already been returned and select the element of that is most diverse from as the sample to yield on each iteration. In addition, we yield to enable kernel learning as described in Alg 4, to yield a kernel .
It is typical to learn the kernel parameters of a gp or dpp given supervised training examples of function values or diverse sets, but those are not available in our setting; we can only observe which samples are accepted by the planner and which are not. We derive our notion of similarity by assuming that all samples that are rejected by the planner are similar. Under this assumption, we develop an online learning approach that adapts the kernel parameters to learn a good diversity metric for a sequence of planning tasks.
We use the squared exponential kernel of the form , where is the rescaled “distance” between and on the th feature and is the inverse lengthscale. Let be the sample that failed and the set of samples sampled before be . We define the importance of the th feature as
which is the conditional variance if we ignore the distance contribution of all other features except the th; that is, . Note that we keep the same for all the features so that the inverse only needs to be computed once.
The diverse sampling procedure is analogous to the weighted majority algorithm [31] in that each feature is seen as an expert that contributes to the conditional variance term, which measures how diverse is with respect to . The contribution of feature is measured by . If was rejected by the planner, we decrease the inverse lengthscale of feature to be , because feature contributed the most to the decision that was most different from .
Alg. 4 depicts a scenario in which the kernel is updated during interactions with a planner; it is simplified in that it uses a single sampler, but in our experimental applications there are many instances of action samplers in play during a single execution of the planner. Given a sequence of tasks presented to the planner, we can continue to apply this kernel update, molding our diversity measure to the demands of the distribution of tasks in the domain. This simple strategy for kernel learning may lead to a significant reduction in planning time, as we demonstrate in the next section.
5 Experiments
We show the effectiveness and efficiency of each component of our method independently, and then demonstrate their collective performance in the context of planning for longhorizon tasks in a highdimensional continuous domain.
To test our algorithms, we implemented a simulated 2D kitchen based on the physics engine Box2D [32]. Fig. 3 shows several scenes indicating the variability of arrangements of objects in the domain. We use bidirectional RRT [33] to implement motion planning. The parameterized primitive motor actions are: moving the robot (a simple “freeflying” hand), picking up an object, placing an object, pushing an object, filling a cup from a faucet, pouring a material out of a cup, scooping material into a spoon, and dumping material from a spoon. The gripper has 3 degrees of freedom (2D position and rotation). The material to be poured or scooped is simulated as small circular particles.
We learn models and samplers for three of these action primitives: pouring (4 context parameters, 4 predicted parameters, scooping (2 context parameters, 7 predicted parameters), and pushing (2 context parameters, 6 predicted parameters). The actions are represented by a trajectory of way points for the gripper, relative to the object it is interacting with. For pouring, we use the scoring function , where is the proportion of the liquid particles that are poured into the target cup. The constraint means at least of the particles are poured correctly to the target cup. The context of pouring includes the sizes of the cups, with widths ranging from to (units in Box2D), and heights ranging from to . For scooping, we use the proportion of the capacity of the scoop that is filled with liquid particles, and the scoring function is , where is the proportion of the spoon filled with particles. We fix the size of the spoon and learn the action parameters for different cup sizes, with width ranging from to and height ranging from to . For pushing, the scoring function is where is the position of the pushed object after the pushing action and is the goal position; here the goal position is the context. The pushing action learned in Sec. 5.1 has the same setting as [1], viewing the gripper/object with a birdeye view. We will make the code for the simulation and learning methods public at https://github.com/ziw/Kitchen2D.
5.1 Active learning for conditional samplers
We demonstrate the performance of using a gp with the straddle algorithm (gplse) to estimate the level set of the constraints on parameters for pushing, pouring and scooping. For comparison, we also implemented a simple method [1], which uses a neural network to map pairs to predict the probability of success using a logistic output. Given a partially trained network and a context , the which has the highest probability of success with is chosen for execution. Its success or failure is observed, and then the network is retrained with this added data point. This method is called in the results. In addition, we implemented a regressionbased variation that predicts with a linear output layer, but given an value still chooses the maximizing . This method is called . We also compare to random sampling of values, without any training.
gplse is able to learn much more efficiently than the other methods. Fig. 4 shows the accuracy of the first action parameter vector (value 1 if the action with parameters is actually successful and 0 otherwise) recommended by each of these methods as a function of the number of actively gathered training examples. gplse recommends its first by maximizing the probability that . The neuralnetwork methods recommend their first by maximizing the output value, while random always selects uniformly randomly from the domain of . In every case, the gpbased method achieves perfect or high accuracy well before the others, demonstrating the effectiveness of uncertaintydriven active sampling methods.
5.2 Adaptive sampling and diverse sampling
Given a probabilistic estimate of a desirable set of values, obtained by a method such as gplse, the next step is to sample values from that set to use in planning. We compare simple rejection sampling using a uniform proposal distribution (rejection), the basic adaptive sampler from Sec. 4.2, and the diversityaware sampler from Sec. 4.3 with a fixed kernel: the results are shown in Table. 1.

rejection  adaptive  diverse  

Pour 
FP (%)  *  
(s)  *  
*  
Diversity  *  
Scoop 
FP (%)  
(s)  
Diversity  
Push 
FP (%)  
(s)  
Diversity 
*1 out of 50 experiments failed (to generate 50 samples within seconds); 49 out of 50 failed; 34 out of 50 failed; 5 out of 16 experiments failed (to generate 5 positive samples within samples); 7 out of 50 failed; 11 out of 50 failed.
We report the false positive rate (proportion of samples that do not satisfy the true constraint) on samples (FP), the time to sample these samples (), the total number of samples required to find positive samples (), and the diversity of those samples. We limit cpu time for gathering samples to seconds (running with Python 2.7.13 and Ubuntu 14.04 on Intel(R) Xeon(R) CPU E52680 v3 @ 2.50GHz with 64GB memory.) If no sample is returned within seconds, we do not include that experiment in the reported results except the sampling time. Hence the reported sampling time may be a lower bound on the actual sampling time. The diversity term is measured by using a squared exponential kernel with inverse lengthscale and . We run the sampling algorithm for an additional 50 iterations (a maximum of 100 samples in total) until we have 5 positive examples and use these samples to report . We also report the total number of samples needed to achieve positive ones (). If the method is not able to get positive samples within samples, we report failure and do not include them in the diversity metric or the metric.
diverse uses slightly more samples than adaptive to achieve 5 positive ones, and its false positive rate is slightly higher than adaptive, but the diversity of the samples is notably higher. The FP rate of diverse can be decreased by increasing the confidence bound on the level set. We illustrate the ending poses of the 5 pouring actions generated by adaptive sampling with diverse and adaptive in Fig. 5illustrating that diverse is able to generate more diverse action parameters, which may facilitate planning.
5.3 Learning kernels for diverse sampling in planning
Task I  Runtime (ms)  0.2s SR (%)  0.02s SR (%) 

adaptive  
diversegk  
diverselk  
Task II  Runtime (s)  60s SR (%)  6s SR (%) 
adaptive  
diversegk  
diverselk  
Task III  Runtime (s)  60s SR (%)  6s SR (%) 
adaptive  
diversegk  
diverselk 
In the final set of experiments, we explore the effectiveness of the diverse sampling algorithm with tasklevel kernel learning We compare adaptive, diversegk with a fixed kernel, and diverse sampling with learned kernel (diverselk), in every case using a highprobability superlevelset estimated by a gp. In diverselk, we use .
We define the planning reward of a sampler to be , where is the indicator variable that the th sample from helped the planner to generate the final plan for a particular task instance . The reward is discounted by with , so that earlier samples get higher rewards (we use ). We average the rewards on tasks drawn from a predefined distribution, and effectively report a lower bound on , by setting a time limit on the planner.
The first set of tasks (Task I) we consider is a simple controlled example where the goal is to push an object off a 2D table with the presence of an obstacle on either one side of the table or the other (both possible situations are equally likely). The presence of these obstacles is not represented in the context of the sampler, but the planner will reject sample action instances that generate a collision with an object in the world and request a new sample. We use a fixed range of feasible actions sampled from two rectangles in 2D of unequal sizes. The optimal strategy is to first randomly sample from one side of the table and if no plan is found, sample from the other side.
We show the learning curve of diverselk with respect to the planning reward metric in Fig. 6 (a). 1000 initial arrangements of obstacles were drawn randomly for testing. We also repeat the experiments 5 times to obtain the confidence interval. For diversegk, the kernel inverse is initialized as and if, for example, it sampled on the left side of the object (pushing to the right) and the obstacle is on the right, it may not choose to sample on the right side because the kernel indicates that the other feature is has more diversity. However, after a few planning instances, diverselk is able to figure out the right configuration of the kernel and its sampling strategy becomes the optimal one.
We also tested these three sampling algorithms on two more complicated tasks. We select a fixed test set with 50 task specifications and repeat the evaluation 5 times. The first one (Task II) involves picking up cup A, getting water from a faucet, move to a pouring position, pour water into cup B, and finally placing cup A back in its initial position. Cup B is placed randomly either next to the wall on the left or right. The second task is a harder version of Task II, with the additional constraint that cup A has a holder and the sampler also has to figure out that the grasp location must be close to the top of the cup (Task III).
We show the learning results in Fig. 6 (b) and (c) and timing results in Tab. 2 (after training). We conjecture that the sharp turning points in the learning curves of Tasks II and III are a result of high penalty on the kernel lengthscales and the limited size (50) of the test tasks, and we plan to investigate more in the future work. Nevertheless, diverselk is still able to find a better solution than the alternatives in Tasks II and III. Moreover, the two diverse sampling methods achieve lower variance on the success rate and perform more stably after training.
5.4 Integrated system
Finally, we integrate the learned action sampling models for pour and
scoop with 7 preexisting robot operations (move, push, pick, place,
fill, dump, stir) in a domain specification for stripstream. The robot’s
goal is to “serve” a cup of coffee with cream and sugar by placing
it on the green coaster near the edge of the table. Accomplishing
this requires generalpurpose planning, including picking where to
grasp the objects, where to place them back down on the table, and what
the preoperation poses of the cups and spoon should be before
initiating the sensorimotor primitives for pouring and scooping should
be. Significant perturbations of the object arrangements are handled
without difficulty
This work illustrates a critical ability: to augment the existing competences of a robotic system (such as picking and placing objects) with new sensorimotor primitives by learning probabilistic models of their preconditions and effects and using a stateoftheart domainindependent continuousspace planning algorithm to combine them fluidly and effectively to achieve complex goals.
Footnotes
 We use the focused algorithm within STRIPStream, and it solves the task in 2040 seconds for a range of different arrangements of objects.
References
 L. P. Kaelbling and T. LozanoPerez, “Learning composable models of parameterized skills,” in ICRA, 2017.
 C. R. Garrett, T. LozanoPerez, and L. P. Kaelbling, “Samplebased methods for factored task and motion planning,” in RSS, 2017.
 ——, “Strips planning in infinite domains,” arXiv:1701.00287, 2017.
 C. E. Rasmussen and C. K. Williams, “Gaussian processes for machine learning,” The MIT Press, 2006.
 R. E. Fikes and N. J. Nilsson, “STRIPS: A new approach to the application of theorem proving to problem solving,” Artificial Intelligence, vol. 2, pp. 189–208, 1971.
 O. Kroemer and G. Sukhatme, “Metalevel priors for learning manipulation skills with sparse features,” in ISER, 2016.
 T. Hermans, F. Li, J. M. Rehg, and A. F. Bobick, “Learning contact locations for pushing and orienting unknown objects,” in Humanoids, 2013.
 C. Schenck, J. Tompson, D. Fox, and S. Levine, “Learning robotic manipulation of granular media,” in CORL, 2017.
 Z. Pan, C. Park, and D. Manocha, “Robot motion planning for pouring liquids.” in ICAPS, 2016.
 M. Tamosiunaite, B. Nemec, A. Ude, and F. Wörgötter, “Learning to pour with a robot arm combining goal and shape learning for dynamic movement primitives,” Robotics and Autonomous Systems, vol. 59, no. 11, 2011.
 S. Brandi, O. Kroemer, and J. Peters, “Generalizing pouring actions between objects using warped parameters,” in Humanoids, 2014.
 A. Yamaguchi and C. G. Atkeson, “Differential dynamic programming for graphstructured dynamical systems: Generalization of pouring behavior with different skills,” in Humanoids, 2016.
 C. Schenck and D. Fox, “Visual closedloop control for pouring liquids,” in ICRA, 2017.
 G. Konidaris, L. P. Kaelbling, and T. LozanoPerez, “From skills to symbols: Learning symbolic representations for abstract highlevel planning,” JAIR, vol. 61, 2018.
 O. Kroemer and G. S. Sukhatme, “Learning spatial preconditions of manipulation skills using random forests,” in Humanoids, 2016.
 B. Bryan, R. C. Nichol, C. R. Genovese, J. Schneider, C. J. Miller, and L. Wasserman, “Active learning for identifying function threshold boundaries,” in NIPS, 2006.
 A. Gotovos, N. Casati, G. Hitz, and A. Krause, “Active learning for level set estimation,” in IJCAI, 2013.
 I. Bogunovic, J. Scarlett, A. Krause, and V. Cevher, “Truncated variance reduction: A unified approach to bayesian optimization and levelset estimation,” in NIPS, 2016.
 Y. Gal and Z. Ghahramani, “Dropout as a Bayesian approximation: Representing model uncertainty in deep learning,” in ICML, 2016.
 B. Lakshminarayanan, A. Pritzel, and C. Blundell, “Simple and scalable predictive uncertainty estimation using deep ensembles,” in NIPS, 2017.
 A. Kulesza, B. Taskar, et al., “Determinantal point processes for machine learning,” Foundations and Trends in Machine Learning, vol. 5, no. 2–3, 2012.
 R. Hafiz Affandi, E. B. Fox, and B. Taskar, “Approximate inference in continuous determinantal point processes,” in NIPS, 2013.
 R. H. Affandi, E. Fox, R. Adams, and B. Taskar, “Learning the parameters of determinantal point process kernels,” in ICML, 2014.
 L. P. Kaelbling and T. LozanoPérez, “Hierarchical task and motion planning in the now,” in ICRA, 2011.
 S. Srivastava, E. Fang, L. Riano, R. Chitnis, S. Russell, and P. Abbeel, “Combined task and motion planning through an extensible plannerindependent interface layer,” in ICRA, 2014.
 R. Chitnis, D. HadfieldMenell, A. Gupta, S. Srivastava, E. Groshev, C. Lin, and P. Abbeel, “Guided search for task and motion plans using learned heuristics,” in ICRA, 2016.
 B. Kim, L. P. Kaelbling, and T. LozanoPerez, “Learning to guide task and motion planning using scorespace representation,” in ICRA, 2017.
 ——, “Guiding search in continuous stateaction spaces by learning an action sampler from offtarget search experience,” in AAAI, 2018.
 Z. Wang, B. Zhou, and S. Jegelka, “Optimization as estimation with Gaussian processes in bandit settings,” in AISTATS, 2016.
 N. Srinivas, A. Krause, S. M. Kakade, and M. Seeger, “Gaussian process optimization in the bandit setting: No regret and experimental design,” in ICML, 2010.
 D. P. Foster and R. Vohra, “Regret in the online decision problem,” Games and Economic Behavior, vol. 29, no. 12, 1999.
 E. Catto, “Box2D, a 2D physics engine for games,” http://box2d.org, 2011.
 J. J. Kuffner, Jr. and S. M. LaValle, “RRTConnect: An efficient approach to singlequery path planning,” in ICRA, 2000.