Introducing a Human-like Planner for Reaching in Cluttered Environments

Introducing a Human-like Planner for Reaching in Cluttered Environments


Humans, in comparison to robots, are remarkably adept at reaching for objects in cluttered environments. The best existing robot planners are based on random sampling in configuration space- which becomes excessively high-dimensional with a large number of objects. Consequently, most of these planners suffer from limited object manipulation. We address this problem by learning high-level manipulation planning skills from humans and transfer these skills to robot planners. We used virtual reality to generate data from human participants whilst they reached for objects on a cluttered table top. From this, we devised a qualitative representation of the task space to abstract human decisions, irrespective of the number of objects in the way. Based on this representation, human demonstrations were segmented and used to train decision classifiers. Using these classifiers, our planner produced a list of waypoints in task space. These waypoints provide a high-level plan, which can be transferred to an arbitrary robot model and used to initialize a local trajectory optimiser. We evaluated this approach through testing on unseen human VR data, a physics-based robot simulation and real robot experiments. We find that this human-like planner outperforms a state-of-the-art standard trajectory optimisation algorithm and is able to generate effective strategies for rapid planning, irrespective of the number of objects in a cluttered environment. Our dataset and source code are publicly available 1.

I Introduction

Imagine grasping a yoghurt tub from the back of a cluttered fridge shelf. For humans, this is a trivial task and one that even very young children are able to perform exquisitely. Yet, for a robot to be able to execute the same action with such ease, there are a number of non-trivial questions that it must resolve. Should the robot try to navigate through the available free space? Or might it be better to move obstructing items away first? In which case, which object should be moved first and where to?

Standard robot motion planning approaches focus on identifying a collision-free trajectory that satisfies a set of given constraints [latombe2012robot] and the majority of current planning techniques are based on random sampling [king2015nonprehensile, moll2017randomized, karaman2011sampling, elbanhawi2014sampling] of the configuration space. A defining feature of these sampling-based planners (SBPs) is the use of a set of probing samples drawn to uniformly cover the state space. To accelerate the planning process, it is thus desirable to devise non-uniform sampling strategies that favor sampling in regions where an optimal solution might lie [ichter2018learning]. Finding such regions is non-trivial, but, as we set out in the opening, humans can find near-optimal solutions very quickly.

Predicated on human expertise, imitation learning from demonstration (LfD) techniques are increasingly being adopted by researchers for robot motion planning [fox2019multi], [zhang2018deep] [rana2018towards], [tan2011computational]. For example, researchers have demonstrated the use of neural networks for learning the dynamics of arm motion from human data [ravichandar2016learning], whilst others have shown the utility of combining planning and learning-based approaches to facilitate goal-directed behavior during human-robot interactions [lawitzky2012feedback]. Alternative approaches to learning from human data include learning qualitative task representations. Evidence indicates that humans recognize and interact with spatio-temporal space in a more qualitative rather than quantitative manner [wallgrun2013understanding], [chen2015survey], [cohn2012thinking]. As such, previous work integrated qualitative spatial representations (QSRs) with manipulation planning at different levels [mansouri2014more], [westphal2011guiding]. Importantly, these approaches avoid the pitfalls of SPBs [csucan2011sampling], [geraerts2004comparative], [karaman2011anytime] - which only allow a small number of objects due to the curse of dimensionality,

Fig. 1: Overview of the HLP approach.

We propose a novel approach to the problem of reaching through a cluttered environment based on geometric reasoning and workspace partitioning [latombe2012robot] (similar to cell decomposition) augmented by QSRs [chen2015survey]. To achieve this, we collected data from human participants for our scenario using virtual reality (VR)[brookes2019studying]. Demonstrations were segmented based on a spatial model of the task space. This segmentation resulted in a set of state-action pairs used to train classifiers via standard machine learning techniques. These trained classifiers are used by a human-like planner (HLP) which, during testing, generates a high-level plan in task space. Finally, the generated high-level plan was forwarded to a robot-specific low-level controller for inverse kinematics (IK) and local optimisation if required. The resulting method is a human-like planning (HLP) algorithm for reaching objects in cluttered environments (see Fig. 1 for an overview of this framework).

Initial scene
Moving an object
Going through a gap
Approaching the target
Fig. 2: Sample keyframes in a VR trial. This structure of two rows with three obstacles in each row is fixed among all trials. However, our approach can generalize to different structures as seen in Sec. VII

Our novel contributions include:

  • the development of a high-level planning algorithm that learns from human participants interacting in VR;

  • a new qualitative space/action representation to mitigate the problem of high dimensionality; and

  • empirical demonstrations of the utility of this high-level planner – showing that it is scalable and can work in conjunction with any existing low-level planner in a seamless manner.

Ii An Overview of the HLP

Given a cluttered environment represented by a start position , a target position and positions of movable objects , we consider the problem of planning a high-level path from to that is most likely to be planned by a human. A high-level plan is represented by a set of keypoints labeled by a set of associated actions .

Our framework (Fig. 1) starts by learning from human demonstrations collected through VR. Each demonstration is given in the form of an environment representation in addition to an associated human trajectory . This framework runs in two phases, training and testing. In the training phase (green arrow path of Fig. 1): given pairs of , demonstrations are segmented into abstracted state-action pairs that are used to train decision classifiers. In the testing phase, given of a new scene, the HLP uses the trained decision models to generate . The generated high-level plan can then be transferred to an arbitrary robot model. A robot-specific low-level controller solves for IK and performs local optimisation if required.

Demonstration segmentation is based on our modeling of the task space in order to extract required state-action pairs. Modeling of the task space is detailed in Sec. IV. Feature extraction and decision classifiers are explained in Sec. V whilst the planning algorithm used in the testing phase is given in Sec. VI. Demonstration of the transfer from high-level planing to robot control is provided in the Experiments section.

Iii Vr Dataset Collection

A dataset 2 of human demonstrations was collected by asking 24 participants to perform 90 reaching trials towards objects placed on a cluttered table top in a virtual environment (Fig. 2). The table top surface was surrounded by transparent screens from all but the front side and the work space dimensions were tailored to suit human arm movement. The position and rotation of the hand, elbow and upper arm (nearest point to shoulder) were tracked and sampled at 90Hz.

Participants were asked to initiate each trial by first moving towards a home position, which was indicated by a transparent pink ball. After onset, a target end-point a red-coloured can appeared along with a total of 6 obstacles placed on two rows. Participants were instructed to interact with the scene to pick up the can and move it to the start point. Participants could achieve this by either navigating around the obstacles or picking up and moving the obstacles to a new position. Trials were failed if any part of the arm interacted with the obstacles, if any moved object hit the edge of the workspace, if the can touched any of the obstacles or if participants knocked over an obstacle with their hand.

Iv Modeling the Task Space

Fig. 3: Modeling the task space. A 2D projection is shown for a structure defined by two rows with three objects (and hence four gaps) each, start location and target object. Space around an object is discretized into eight direction classes. Size of each direction block depends on the object size and a scaling factor .

Devising a model of the task space that enables one to learn from demonstrations and generalize with different environment settings is of critical importance. To this end, we designed the row-based structure shown in Fig. 1 and Fig. 3, with two rows each containing three objects and aligned start-target positions. This structured environment helped model the task space and train decision learners. All VR demonstrations used for training were generated according to this structure and we show in the Experiments section how the HLP is able to generalize to different environment settings.

We model the environment task space qualitatively as a union of three disjoint sets: representing the occupied space while the free space is given by the union of gaps and connecting space sets. Occupied space represents all objects in the scene in addition to walls3 of the surface on which objects rest. A gap is a qualitative representation of the free space existing between two objects and at the same row. The connecting space models the remaining free space used to connect from start to first row, between rows and from second row to target.

Based on this representation of the task space, the action space is discretized into two primitive actions: moving an object and going through a gap (Fig. 2). A primitive action occurs at a keypoint in the space, i.e. an action applies to an object or a gap at a specific row. Actions at different rows are connected through the connecting space .

Keypoints at objects and gaps are conceptually analogous to the idea of keyframe-based learning from demonstration [akgun2012keyframe]. They are also comparable to nodes in RRTs and PRMs [bry2011rapidly], [geraerts2004comparative]. Similar to interpolation between keyframes or connecting graph/tree nodes, keypoints are locally connected in .

The high-level plan is generated hierarchically in three levels: path, segment and action. At the top level, a path comprises a sequence of consecutive segments. Each segment is a qualitative description of a subset of in which the planner (connects) interpolates between two consecutive actions. Finally, an action is one of the two primitive actions in . For example, considering an environment structured in rows, a path consists of - segments. Each segment connects between a pair of consecutive rows. One action takes place at a row and applies to a gap or an object. Start and target points are directly connected to the actions at first and last rows respectively.

V Decision Classifiers

Decision learners are used to locally capture rules underlying human planning skills. We design such learners as classifiers that map from state (feature) domain to action (class) domain . Training examples are drawn from a distribution over where the goal of each classifier [moldovan2018relational], [alpaydin2009introduction], [langford2003reducing] is finding a mapping that minimizes prediction errors under . The classifier output may be represented by the posterior probability .

In this section, we introduce the features extraction required for decision classifiers. Features4 are extracted based on the spatial representations used by humans to recognize the task space. We heavily use relational spatial representations like distance, size, orientation and overlap. In our working scenario, there are four main decisions during a plan: (1) which gap to go through? (2) which object to move? (3) to which direction the object should be moved? (4) which segment to connect two consecutive actions? Except for the object-moving direction, we designed our learners as binary classifiers.

V-a Gaps and Objects Classifiers ()

These classifiers have a binary output defining the probability of either selecting or rejecting a gap or an object. Intuitively, humans prefer gaps close to their hand, having proper size, leading to the target and not requiring acute motions. Hence, the features vector, , input to gap classifier is defined by the distances from gap to initial and target positions, gap (size) diagonal , orientation of the gap-to-start and gap-to-target lines.


Similarly, object features are given by distances and , object diagonal , the orientation angles , in addition to object’s overlap with the target and a measure of the free space volume/area around object .


Space around a given object is discretized into a set of eight classes (forward FF, forward left FL, left LL, back left BL, back BB, back right BR, right RR and forward right FR) covering object’s neighborhood (Fig. 3). The size of each direction block depends on the object size and an arbitrary scaling factor. The free space in each block is computed and is given by the sum of free space in the eight blocks.

V-B Object Direction Classifier ()

If the action is to move an object, we learn appropriate moving direction from human data. To estimate the moving direction, an object is described by the direction of human hand when approaching the object, orientation of an object-to-target line and amount of free space in each surrounding direction around the object. This is a multi-class classifier whose output .


V-C Connecting Segments Classifier ()

Each segment connects two actions applied on gaps and/or objects. Let and denote the elements representing these gaps/objects at the first and second rows respectively. It is desirable to avoid actions that are considerably apart from each other, having no or small overlap with target and expected to result in large collision. Hence, a segment feature vector consists of the signed horizontal and vertical distances, overlap between and the target object, segment orientation w.r.t a straight line connecting initial and target positions and the collision expected to accompany motion through this segment.

Segment collision is computed as the overall overlapping volume (area) between human forearm (link) and . The overlapping area is approximated by the intersection polygon area and found by scanning the surrounding area using sampling lines. 5


Vi Hlp Algorithm

The HLP algorithm uses the trained classifier models explained in the previous section to generate the high-level plan in the testing phase. The algorithm starts by locating rows and gaps in the scene.

For each -th row, objects and gaps classifiers are called (Lines 2-3) to identify the best objects and gaps respectively. Selected gaps and objects in ()-th row are connected to their counterparts in the next ()-th row through connecting segments. Selecting a total gaps and objects at a pair of consecutive rows results in segment combinations. These combinations are constructed and passed to the segments classifier that selects the best connecting segments (Lines 4-6).

0:  Environment representation
0:  High-level path Locate rows and gaps
1:  for all  do
2:      Compute gaps feature vector Compute objects feature vector
3:  end for
4:  for all pairs of consecutive rows do
5:      Segment Constructor Compute segments feature vector
6:  end for
7:  for all  do
9:     if  then
10:        Compute object-direction feature vector Object direction = Augment by expected object’s location
11:     end ifCompute arm configuration feature vector Estimate arm configuration: Compute expected path collision
12:  end forSelect the path with minimum collision score
Algorithm 1 The Human-Like Planner (HLP)
Fig. 4: A human-like manipulation plan for reaching a target amongst 8 obstacles. Note that the testing environment is different from the training one in terms of the number and shape of obstacles.

A segment connects two actions, one of which may belong to the moving-object class. Hence, the object moving direction is estimated by the classifier using same convention of space discretization in Sec. V-B. The expected object location after moving is found and added to the segment’s sequence of keypoints (Lines 7-11).

Next, the human-arm configuration is estimated (explained in next section) at each keypoint and estimated configurations are used to evaluate the overall expected collision of each segment (Lines 11-12). To this end, we get candidate segments between rows that are labeled by a measure of their overall expected collision. For a two-row structure, the best segment is picked as the one with least likely collision.

Vi-a Arm Configuration and Collision Detection

In the 2-row structure, a candidate path has one segment that is connected to the start and target points. Having a number of candidate paths, we selected the one with minimum overall expected collision. Overall expected path collision is found by computing collision of full-arm motion with between path keypoints. Therefore, arm configurations are firstly estimated at each keypoint and then expected arm collision is computed.

A human arm is modeled as a planar arm with four joints at neck, shoulder, elbow and hand. The arm configuration is represented by two angles: between neck-shoulder and upper arm links and between upper arm and forearm links. The arm configuration at a given keypoint is estimated by regression. Input features to the configuration regression model are: hand direction when approaching , arm configuration at the previous keypoint and signed horizontal and vertical distances between the two keypoints.


By estimating arm configurations at keypoints and hence joint positions, full arm collision is computed as sum of its links collision during motion along keypoints from start to target. Collision of each link is computed using the same approach as in Sec. V-C. Intersections between sampling lines and objects are found and the area of the resulting intersection polygon defines the collision amount. To this end, a number of candidate paths are generated, each labeled with a measure of its overall expected collision . This step completes the HLP algorithm and generates a plan defined by the path having minimum expected overall collision.

Vii Experiments and Results

The HLP approach was tested through three protocols: human VR data, robot simulation and real-robot experiments. The VR dataset was randomly split into two disjoint sets: approximately 80% (19 subjects and 1710 trials) for training and 20% (5 subjects and 450 trials) for testing. This cross-subject splitting, (i.e. no overlap between training and testing data in terms of subjects) allowed us to test the generalization of the proposed approach with subjects who had never seen during training. This splitting was repeated for five folds of randomly selected subjects for both sets. Standard (MATLAB) support vector machines with Gaussian kernel and Gaussian process regression were used to implement classifiers and regression models respectively. The same trained models and planning algorithm were used for all experiment protocols.

Vii-a Experiments on Vr Data

This protocol tested the HLP on unseen, but similarly distributed data to the training set. The objective was to measure the similarity of the generated plans to the “ground truth” human plans. This similarity was computed for a scene having rows by the following metric:


where is the indicator function which is when its argument is true and otherwise, is a Boolean variable that is true if HLP action is same as human action at the -th row and is a Boolean variable that is true if the same element (gap or object) is selected by both HLP and human action. To illustrate, if both the HLP and the human participant decided to move an object at the first row and then go through a gap at the second row, then this would be quantified as a similarity rate. This could increase to if both selected the same object and the same gap.

Mean and standard deviation of the 5-fold results are shown in Table I. Accuracy of the gaps and objects classifiers are 95% and 85% respectively. It is worth noting that we compared similarity of the HLP output to a specific participant’s plan at a time and then reported the mean similarity. This means that HLP similarity is counted if it exactly matches a specific testing subject who was never seen during training. On average, our planner was 70% similar to the test participant plans. More specifically, the HLP decided the same action as a human plan 79% of the time, while it selected the same element (specific gap or object) 67% of the time.

Metric Mean STD
accuracy 0.95 0.002
accuracy 0.85 0.005
(overall) 0.70 0.011
() 0.79 0.016
() 0.67 0.012
TABLE I: Results (mean and standard deviation) of the 5-fold VR test experiment.

Vii-B Robot Experiments

Through robot experiments, we compared the HLP with a standard physics-based stochastic trajectory optimisation (STO) approach [Agboh_Dogar_WAFR_2018, Agboh_Dogar_Humanoids_2018, Agboh_Daniel_Dogar_ISRR_2019]. These algorithms were implemented using the Mujoco [mujoco] physics engine and the Deepmind control suite [dm_control]. We assumed a world consisting of objects on a table with a 10-DOF robot as shown in Fig. 4. As a baseline algorithm, we used a state-of-the-art physics-based stochastic trajectory optimiser [Agboh_Dogar_WAFR_2018], initialized with a straight line control sequence from the end-effector’s start pose to the target object.

IK solutions for the high-level plan keypoints were found and connected with straight lines in robot’s configuration space to generate a control sequence. This was passed to a trajectory optimiser as an initial candidate control sequence. Therefore, for the HLP, the number of actions in a given control sequence varied depending on the high level plan. In contrast, the baseline approach (STO) initialized the trajectory optimiser using a straight line control sequence to the goal.

We compared performance of the HLP and STO quantitatively through success rates and planning times of simulation experiments and qualitatively through visual inspection of manipulation plans in real robot experiments.

Robot Simulation Experiments

Simulation experiments evaluated the performance of our planner in scenes generated from a distribution considerably different to that of the training data. Generalization was investigated across different numbers, sizes and types of objects. A plan was classified as being successful if at the final state the target object was inside the gripper’s pre-grasp region and if no other object dropped off the table during the trial. We recorded the total planning time for successful trials of each planner. Planning time which comprised an initial time used to generate the high level plan (straight line for ), and an optimisation time. Two simulation categories were considered:

Fig. 5: Human-like Manipulation (HLP) plan (top), and Baseline (STO) plan (bottom). HLP algorithm can reason over the task space and navigate through the free space around the obstacles similar to what people may do. On the other hand, STO is biased towards its straight-line initialization and hence had to push objects around which made it 1.5 times slower than HLP.

Generalization I Performance of the HLP was tested in simulation scenes with the same two-row structure used in the VR demonstrations. Here, the generalization element involved varying the dimensions of the table and objects. Table II summarizes results of scenes from this category. The planner HLP substantially outperformed the STO by a large margin, indexed through success rates and a reduction in planning time.

Success rate(%) Init. time(s) Opt. time(s) Total time(s)
HLP 94 0.59 0.97 1.56
STO 84 0.04 17.84 17.88
TABLE II: Generalization I Simulation Scenes Results

Generalization II Our second test involved investigating how our approach would fare when generalizing to different numbers of objects, different types of objects and different environment settings. We considered conditions with 5, 7 and 9 objects with two shape classes: cylinders and boxes. Importantly, the start and target positions were no longer aligned in this experiment. randomly sampled scenes were generated for each object-number condition. For each random scene, we selected a random6 initial robot configuration.

No. of Objects HLP STO
Init. time(s) Opt. time(s) Total(s) Total(s)
5 0.60 1.70 2.30 1.85
7 0.61 2.65 3.26 3.68
9 0.63 5.90 6.53 4.98
TABLE III: Planning times (mean) for Generalization II. Note that the actual planning time of HLP (Init. time) is approx. fixed irrespective of the number of obstacles.

The success rates for 100 random scenes for each of the three object categories were computed. The rates for both planners were relatively similar, (93%, 93% and 95% ) for HLP and (96%, 96% and 98% ) for STO for 5, 7 and 9 objects respectively. Results of planning time comparison are given in Table III. The time required for generating a high-level plan by HLP (Init. Time) is fixed irrespective of number of objects.

Real Robot Experiments

In a set of real robot experiments, we used a Robotiq two finger gripper attached to a UR5 arm mounted on an omnidirectional robot (Ridgeback). We obtained object positions using a depth camera mounted above the table. We conducted a limited number of experiments- sample scenes per 7, 9 and 11 objects. We then ran both the HLP and the producing a total of robot runs 7. Sample results are shown in Fig. 5 where the HLP favored going through the free space avoiding pushing objects while the STO was biased to its initialized straight-line path. Other scenes can be found in the attached video 8.

Viii Discussion

This work addresses the challenge of human-like computing i.e. the endowment of systems with capabilities derived from modelling human perception, cognition and action. Humans are known to recognize the spatio-temporal world in a qualitative manner. Hence, instead of cloning the human behavior from demonstrations, we used QSR in order to segment demonstrations in the action space. Extracting human skills at such a high level helps to model a planner that can: (1) generalize to different number of objects without increasing the actual planning time and (2) seamlessly connect with an arbitrary robot model.

Many cluttered environments can be clustered into regions that geometrically approximate to our definition of rows. For an arbitrary number of rows, the HLP can be run recursively for row pairs by defining a set of sub-targets. Moreover, generalization may be improved by augmenting training data with more generalized scenes, using more powerful classifiers and running in a closed-loop manner. These topics will be addressed in our future work.

Ix Conclusions

We used VR to collect data from human participants whilst they reached for objects on a cluttered table-top. From this data, we devised a qualitative representation of task space to segment demonstrations into keypoints in the action space. State-action pairs were used to train decision classifiers that constituted the building blocks of the HLP algorithm. Through testing in VR, robot simulation and real robot experiments against a state-of-the-art planner, we have shown that this HLP is able to generate effective strategies for planning, irrespective of the number of objects in a cluttered environment. We conclude that the optimisation of robot planning has much to gain by extracting features from understanding the processes underlying human action selection and execution.


This work is funded by the EPSRC (EP/R031193/1) in their Human Like Computing initiative. Support was also received from EPSRC studentship 1879668 to the third author, the European Union’s Horizon 2020 research and innovation programme under grant agreement 825619 (AI4EU), and Turing Fellowship from the Alan Turing Institute to the ninth author.



  3. A set of four virtual objects are used to represent top, left, right and bottom walls and added to the occupied space to avoid slipping objects off the surface.
  4. Features are appropriately normalized to relevant factors like surface size, objects area, number of objects and arm dimensions.
  5. Although there are standard graphics polygon clipping algorithms to solve this kind of problems, we preferred an approximate but faster line-sampling approach.
  6. We uniformly sampled a start point along the front edge of the table, and a corresponding random end-effector orientation (in the plane) and found the initial robot configuration using inverse kinematics
  7. Four scenes, three number-of-object categories, and two methods.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description