Object-centric Forward Modeling
for Model Predictive Control
We present an approach to learn an object-centric forward model, and show that this allows us to plan for sequences of actions to achieve distant desired goals. We propose to model a scene as a collection of objects, each with an explicit spatial location and implicit visual feature, and learn to model the effects of actions using random interaction data. Our model allows capturing the robot-object and object-object interactions, and leads to more sample-efficient and accurate predictions. We show that this learned model can be leveraged to search for action sequences that lead to desired goal configurations, and that in conjunction with a learned correction module, this allows for robust closed loop execution. We present experiments both in simulation and the real world, and show that our approach improves over alternate implicit or pixel-space forward models. Please see our project page for result videos.
What will happen if the robot shown in Fig 1. moves its hand to the right by a few inches? We can all easily infer that this will result in the red block moving right, and possibly hitting the blue one, which would then also move. This ability to perform forward modeling \iepredicting the effect of one’s actions is a cornerstone of intelligent behaviour – from realizing that turning the knob opens a door, to understanding that falling off from a height can lead to an injury. Not only does this allow us to judge actions based on their immediate consequences, it also enables us to reason about sequences of actions needed to achieve desired goals. As an illustration, let us again consider Fig 1, but now try to imagine how we can get the red block to the right end of the table, but without disturbing the blue object. We know that this can be achieved by first pushing the object up, then towards the right, and then back down. This seemingly simple judgment is actually quite remarkable. In addition to understanding that the naive solution of simply pushing right would not succeed, we also could find, among myriad other possibilities, the sequence of actions that would – by chaining together our understanding of each individual action to understand the effect of the collective. In this work, our goal is to build agents that can exhibit similar capabilities \iegiven an image, understand the consequences of their actions, and leverage this ability to achieve distant goals.
This insight of using a ‘forward model’ to find action sequences that lead to desired outcomes is a classical one in the field of AI, and has been successfully adapted to robotic manipulation tasks [16, 33] for scenarios where the state of the system, \egshape, pose, mass \etcof objects, can be easily represented, and the effect of actions analytically obtained. While this explicit state representation can allow efficient and accurate planning, understanding the state of the system from visual observations or analytically modeling the dynamics is not always possible. Some recent approaches [1, 7] therefore propose to learn a forward model over various alternate representations \egimplicit features, or pixel space. However, we argue that using these implicit or pixel based representations for forward modeling discards the knowledge about the structure of the world, thereby making them less efficient or accurate. When we think of the scene in Fig 1, and the effects of our actions, we naturally think of the different blocks, and their interactions with each other or the robot. Towards learning forward models that have similar inductive biases, we propose to use a semi-implicit representation – explicitly encoding that a scene comprises of different entities, but having an implicit representation to capture the appearance of each entity.
Concretely, we represent each object using its spatial location (in image space) and an implicit feature that is descriptive of its appearance (and can implicitly capture transforms like rotations). We build on this object-centric scene representation and present an approach that learns to model the effect of actions in a scene via predicting the change in representation of the objects present while allowing for interactions between the objects and the robot, as well as among the objects. This object-centric forward model allows us to capture several desirable inductive biases that help in learning more efficient and accurately models – a scene comprises of spatial objects, actions can affect these objects, and the objects can, in turn, affect each other. We show we can leverage our learned model to search for a sequence of actions that would allow us to reach a desired scene configuration from the current input image. However, as the forward model is not perfect, we additionally propose to use a ‘refinement’ module that can re-estimate the scene configuration in the context of the observed image. This allows us to robustly act in a closed loop manner to achieve desired goal configurations, and we show that our approach improves over previous pixel-space or implicit forward models.
2 Related Work
Learning Video Prediction. While forward modeling aims to predict the future conditioned on an action, a related task in computer vision community is that of simply predicting the future, independent of any action. Several approaches have attempted to predict videos in pixel space [24, 26, 17]. Instead of directly regressing to pixels, alternate flow-based prediction models have also shown promising results [8, 31]. However, these can typically only handle small motions between frames, and need a large number of samples to overcome this inductive bias. Most related to our work is the approach by Ye \etal , which also pursues prediction in an object-centric space, and in this work we show these can be extended to action-conditioned prediction and planning.
Predictive Forward Models for Acting and Planning. In the robotics community, learned forward models have been used for a plethora of tasks \egleveraging forward models for exploration [20, 19, 10], or to learn a task policy [25, 11, 5]. More related to ours, some approaches [1, 18] jointly learn a forward and inverse model, where the latter is regularized by the former and can be used to greedily output actions given current observation and a short-term goal. We adopt the philosophy for some recent methods [6, 27] that also tackle longer horizon tasks, by training a forward model and then using a planner to generate actions. However, these methods still face difficulty in handling large change per action or a large number of actions. We overcome these limitations by leveraging object-centric representations for forward modeling and planning.
Structured Models for Physical Interactions. Rather than predicting in implicit or pixel-space representation, a line of work with a motivation similar to ours, models physics by explicitly modeling the state transitions, using known  or predicted [29, 30] physical properties. However, for generic manipulation tasks in the real world, the dynamics and physical properties cannot easily be captured analytically. Recent learning-based works  overcome this in a data-driven manner, and show impressive results for forward modeling and planning with previously unseen, but isolated objects. Towards handling more generic scenarios, some approaches leverage graph neural networks to reason about the interaction between objects [3, 28, 14, 4], but typically apply their methods to simpler scenarios that do not involve robotic manipulation and where the state can be estimated. Janner \etal  show that such compositional forward models can be applied for tasks like block stacking, but learn these for predefined high-level action primitives. In contrast, our work targets forward modeling for low-level continuous control, where a long sequence of actions is required to achieve a goal.
Given an image depicting multiple objects on a table, and a goal image indicating the desired configuration, we aim to plan a sequence of pushing actions such that executing them leads to the goal. To search for an optimal action sequence, a forward model is essential to hallucinate the future configurations that would result from any such sequence. Our insight is that to manipulate in a complicated environment with multiple objects, object-level abstraction is crucial to make long-horizon predictions and plans. We propose to use an object-centric forward model that can predict the effect of actions via implicitly modeling the interactions among the objects and the robot. While the learned model allows planning using the object-centric representation, our estimate of the objects’ locations after a performed action is not perfect and needs to be re-estimated for closed loop execution. We therefore also propose to learn a refinement module to correct our predicted representation using the updated observation.
3.1 Object-Centric Representation.
Given an observation in the form of an image, we aim to predict, in some representation, the effect of actions, and then plan towards desired goals using this learned model. What the form of this representation should be is an open research question, but it should be efficient to learn both prediction and planning with. Our insight is to explicitly leverage the fact that multiple distinct objects are present in typical scenes, and they can naturally be represented as ‘where’ and ‘what’ \ietheir location and visual description. We operationalize this insight into our representation.
Concretely, given an observed image and (known/predicted) location of N objects in the image, we use an object-level representation as . Each object is represented as , where is the observed/predicted location and is the implicit visual feature of that object which encodes rotation, color, geometry, etc. The location is simply the -coordinate in image space. is a feature extracted from a fixed sizes window centered on , extracted by a neural network with ResNet-18  as backbone.
3.2 Object-Centric Forward Model
Given the current object-centric descriptor of current time , and an action about to execute , the forward model predicts the representation for each object at the next time step , i.e. . To predict the effect of a longer sequence of actions , we can simply just apply the forward model iteratively times to hallucinate representation at time .
To allow modeling the interaction among robot and objects, the forward model is implemented as an instance of Interaction Network. In general, the network takes in a graph , where each node is associated with a vector. The network learns to output a new representation for each node by iterative message passing and aggregation. The message passing process is inherently analogous to physical interactions. To allow for robot-object and object-object interaction, besides each object represented as a node, the action of the gripper is added as an additional node, with the features being a learned embedding of the action . In addition to the predictor , we also train a decoder to further regularize features to encode meaningful visual information. Similar to , the decoder takes in and decodes to pixels.
To train the forward model, we collect training data in the form of triplets , where denotes observed images, In addition, we also require location of each object at every time step and the correspondence of those objects across time. We argue these annotations (with some possible noise) can be obtained using an off-the-shelf visual detector, as we demonstrate on real robot data in section 4.3. We supervise the model using a combination of two losses – a reconstruction loss and a prediction loss. The reconstruction loss forces the features to encode meaningful visual information (and prevent trivial solutions), and the prediction loss encourages the forward model to predict plausibly both in feature space and in pixel space.
where represents features extracted from at the ground-truth object locations.
3.3 Planning Via Forward Model
Given this learned forward model , we can leverage it to find action sequences that lead to a desired goal. Specifically, given the goal and current state , we generate an action trajectory such that executing them would lead towards the goal configuration.
We optimize the trajectory by a sample-based optimizer – cross entropy method (CEM) . In CEM, at every iteration, it draws trajectories of length from a Gaussian distribution, where is the planning horizon. The forward model evaluates those sequences by computing the distance of the predicted state to the goal configuration. The best samples are then selected with which a new Gaussian distribution is refit. The function to evaluate distance of two states / cost of actions is:
After iteration of optimization, the trajectory leading to the lowest distance to the goal is returned. Instead of executing the whole sequence of length , only the first step is actually applied. Then we observe the feedback and re-plan the next action.
In our experiment, . We sample trajectories in continuous velocity space and upper-bound the magnitude.
3.4 Robust Closed Loop Control via Correction Modelling
We saw that given a current representation and the desired goal configuration , we can generate a sequence of action , among which the first action is then executed at every time step, after which we replan. However, as we do not assume access to ground-truth object locations at intermediate steps, it is not obvious what the new ‘current’ representation \ie should be for this re-planning. One option is to simply use the predicted representation , but this leads to an open loop controller where we do not update our estimates based on the new observed image that we observe after our action. As our prediction model is not perfect, such a predicted representation would then quickly drift, making robust execution of long-term plans difficult.
To solve this problem, we propose to additionally learn a correction model that can update the predicted location based on the new observation image . Denote as the region cropped on image specified by the location . Given the initial crop to visually describe the object being tracked, and the predicted location cropped on the new observed image , it regresses the residual to refine, such that approximates and re-centers the cropped region to objects. We train this model using random jitters around the ground truth boxes on the same training data used to learn the forward model.
Our goal is to demonstrate that our learned object-centric forward model allows better planning compared to alternatives. To this end, we evaluate our method under both synthetic and real-world settings, and observe qualitative and quantitative improvements over previous approaches.
4.1 Experimental Setup
Collecting Training Dataset. We work on two pushing datasets, a synthetic environment in MuJoCo  and on a real Sawyer robot. To collect training data, multiple objects are scattered on a table. The robot performs random pushes and records the observation before and after. The push action is represented as the starting and ending position of the end effector in world coordinate.
In the synthetic dataset, we generate 10k videos of pushing two randomly composed L-shape objects on the table. Each video is of length 60 (600k pushes in total) and motion between frames is relatively small. To train our prediction model, we extract the ground-truth locations from the MuJoCo state.
To collect real-world data, we generate 5k random pushes (10k images), where the length of each push is relatively large. As a result, in some of these actions, objects can undergo large motion (Figure 3). To obtain the location and correspondence of objects in training set, we manually annotated around 30 images to train a segmentation network . The location is assumed to be the center of the corresponding mask. All of the data collected for the experiment is publicly available at data link.
Evaluation Setup. In both synthetic and the real world, the test set is split into two subsets with one object and two objects, respectively. For quantitative evaluation, we evaluate our model and baselines in simulation, using the distance of objects to the goal position as the metric. In the simulated test set, the distance of initial configuration to the goal is 15 times larger than the length of a single push. The locations are only available to models for the initial and goal configuration, but not at the intermediate steps. In those intermediate steps, only new images are observed and state information is updated and estimated by the models themselves. In the real robot setting, we manually create some interesting cases for qualitative comparisons, such as manipulating novel objects and when the robot has to predict interactions to avoid other objects.
Baselines. We compare our approach of the object-centric forward modelling with the following baselines and their variants:
Implicit forward model  (Imp-Inv/Imp-Plan): We follow Agrawal \etal  and learn a forward model in a feature space where the entire frame is encoded as one implicit feature. Imp-Inv generates actions greedily by the inverse model which takes in current and goal feature; Imp-Plan plans action sequence in the learned representation space.
Flow-based prediction model  (Flow/Flow-GT): We follow Ebert \etal  and learn to predict transformation kernel to reconstruct future frame. In planning, the predicted transformations are applied to designated pixels (location) to estimate their motion. Two flow baselines update the state information by maintaining the probability maps of designated pixels as in previous work. During training, Flow-GT is additionally leverages known object locations during training by supervising the desired transform for the object centre locations.
Analytic baseline: If the exact center of mass is known at each step, a straightforward solution is to greedily push towards goal position. This baseline assumes a naive forward model – the change of location at the next step will be same as the change of gripper position.
4.2 Experiment with Synthetic Environment
We measure the performance across methods by analyzing the average distance of objects from their goal positions. We plot the average distance over time between the current location and the goal in world coordinate in Figure 4. We find the the the ‘Imp-Inv’ fails to generalize to scenarios when the distance of goal and current observation is much () farther than that in training set, thereby showing the importance of planning rather than using a one-step inverse model. The ‘Imp-Plan’ baseline degenerates significantly for 2 blocks, suggesting a single feature cannot encode the whole scene very well. Flow baseline works much better than Imp-Plan because the motion space is more tractable compared to implicit feature space of frames. Its performance further improves by leveraging location information during training, as seen by the ‘Flow-GT’ curve. However, using our object-centric model for planning further improves over these baselines as shown in Figure 4.
Figure 5 showcases an interesting example where one block needs to be pushed around the other to reach the goal. In this particular case, learning-based alternatives fail to search a plausible plan. The analytic baseline performs well at the beginning with simple dynamics but loses track of the object when the block collides with the other. In contrast, our approach carries out the correct action sequence and manages to reach the goal, demonstrating that we can reason about interaction among objects. For more qualitative results, please refer to our website.
4.3 Experiments with Real Robot
In the real robot setting, we compare our model with the best performing baseline based on the synthetic results \ie‘Flow-GT’. Figure 6 shows a qualitative result with two blocks. Similar to the example in synthetic data, to push the blue block to the goal position, our model manages to carry out a plan which avoids the red block in between. In contrast, Flow-GT generates relatively random actions, probably because the large motions that can result from a single push are difficult to model. We present additional results in the appendix, and also show that our model can generalize to novel objects by training with simple blocks.
How important is the interaction? We replace the interaction network with a simpler CNN that models independently the effects of action for each object i.e. no interaction. We create a harder dataset in simulation where one block is in the middle of the way for the other block reaching the goal. In this setting, it is more crucial to understand interaction/collision. Figure 7 reports our model, the ablative model (w/o IN) in comparison of two strong baselines. Without IN, the performance is slightly better than our full model at the beginning but degrades more after . This is probably because the model without interaction is more greedy i.e. makes progress initially, but fails to pass around objects. The analytic baseline performs much worse because the simple dynamic cannot estimate the location well since collisions will happen. Figure 7 shows an example of executed actions. Our model can push the object around the other object because it learns a good model of interaction among objects.
Ablating Correction model. We ablate the effect of the correction model using two metrics. First, in Figure 7, we evaluate our model in MPC setting. ‘w/o C’ estimates location with predicted output without correction and it performs poorly without correction model to close the loop. Secondly, we evaluate it in terms of reducing the prediction error. In Figure 8 (Left), we measure the error between the predicted location and true location, when a 10-step prediction is unrolled with and without the correction module (when using correction module, we use intermediate observation to refine predictions). We see that the prediction error accumulates without any correction.
Lastly, Figure 8 (Right) visualizes some qualitative results. A box around the ground-truth location is plotted as green; the predicted location output by the forward model is plotted as brown; the corrected location is plotted in red. Our model learns to correct the location when the prediction is inaccurate, and retain the predictions when accurate.
Visualizing Planned Action Sequence. We visualize in Figure 9 the action sequences sampled from the evolving Gaussian distribution across different iterations of the cross-entropy method (CEM) and highlight the best samples. In the example depicted, we see that after several iterations the model converges to a non-greedy trajectory with the awareness of other objects.
Figure 10 visualizes the prediction of forward model given the initial configuration , and sequence of one or more actions . In the synthetic data where only small motion happens, both our method and the baseline generate reasonable predictions. However, in the real dataset, the flow baseline cannot learn to predict the flow because the motion is relatively large. In contrast, in the predicted result of our model, when the blue one in the middle is pushed right, the orange one next to it also moves right due to interaction among them.
We presented an object-centric forward modeling approach for model predictive control. By leveraging the fact that a scene is comprised of a collection of distinct objects, where each object can be described via its location and visual descriptor, we designed a corresponding forward model that learns to predict in this structured space. We showed that this explicit structured representation better captures the interaction among objects and the robot, and thereby allows better planning in conjunction with an additional correction module. While we could successfully apply our system in both synthetic and real-world settings, we relied on explicit supervision on the object locations during training. It will be an interesting direction to further relax this assumption and let the objects emerge from unsupervised videos. Lastly, while we only modeled the effects of a single class of actions \iepushing, it would be useful to generalize such prediction to work across diverse actions.
Acknowledgements. We thank the reviewers for constructive comments. Yufei would like to thank Tao Chen and Lerrel Pinto for fruitful discussion. This work was partially supported by MURI N000141612007 awarded to CMU and a Young Investigator award to AG.
-  (2016) Learning to poke by poking: experiential learning of intuitive physics. In NeurIPS, Cited by: 1st item, §1, §2, 1st item.
-  (2016) Interaction networks for learning about objects, relations and physics. In NeurIPS, Cited by: §3.2.
-  (2018) Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261. Cited by: §2.
-  (2016) A compositional object-based approach to learning physical dynamics. ICLR. Cited by: §2.
-  (2011) PILCO: a model-based and data-efficient approach to policy search. In ICML, Cited by: §2.
-  (2018) Robustness via retrying: closed-loop robotic manipulation with self-supervised learning. arXiv preprint arXiv:1810.03043. Cited by: §2.
-  (2017) Self-supervised visual planning with temporal skip connections. CoRL. Cited by: 3rd item, §1, 2nd item.
-  (2016) Unsupervised learning for physical interaction through video prediction. In NeurIPS, Cited by: §2.
-  (2015) Learning visual predictive models of physics for playing billiards. ICLR. Cited by: §2.
-  (2017) Learning to fly by crashing. In IROS, Cited by: §2.
-  (2018) Learning latent dynamics for planning from pixels. ICML. Cited by: §2.
-  (2016) Deep residual learning for image recognition. In CVPR, Cited by: §3.1.
-  (2018) Reasoning about physical interactions with object-oriented prediction and planning. ICLR. Cited by: §2.
-  (2018) Neural relational inference for interacting systems. ICML. Cited by: §2.
-  (2018) Push-net: deep planar pushing for objects with unknown physical properties. In Robotics: Science and Systems (RSS), Cited by: §2.
-  (1996) Stable pushing: mechanics, controllability, and planning. IJRR. Cited by: §1.
-  (2015) Deep multi-scale video prediction beyond mean square error. ICLR. Cited by: §2.
-  (2017) Combining self-supervised learning and imitation for vision-based rope manipulation. In ICRA, Cited by: §2.
-  (2015) Action-conditional video prediction using deep networks in atari games. In NeurIPS, Cited by: §2.
-  (2017) Curiosity-driven exploration by self-supervised prediction. In ICML, Cited by: §2.
-  (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §4.1.
-  (1999) The cross-entropy method for combinatorial and continuous optimization. Methodology and computing in applied probability. Cited by: §3.3.
-  (2012) Mujoco: a physics engine for model-based control. In IROS, Cited by: §4.1.
-  (2016) Generating videos with scene dynamics. In NeurIPS, Cited by: §2.
-  (2015) From pixels to torques: policy learning with deep dynamical models. ICML. Cited by: §2.
-  (2016) An uncertain future: forecasting from static images using variational autoencoders. In ECCV, Cited by: §2.
-  (2015) Embed to control: a locally linear latent dynamics model for control from raw images. In NeurIPS, Cited by: §2.
-  (2017) Visual interaction networks: learning a physics simulator from video. In NeurIPS, Cited by: §2.
-  (2016) Physics 101: learning physical object properties from unlabeled videos.. In BMVC, Cited by: §2.
-  (2019) DensePhysNet: learning dense physical object representations via multi-step dynamic interactions. In Robotics: Science and Systems (RSS), Cited by: §2.
-  (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. In NeurIPS, Cited by: §2.
-  (2019) Compositional video prediction. In ICCV, Cited by: §2, §3.2.
-  (2016) A convex polynomial force-motion model for planar sliding: identification and application. In ICRA, Cited by: §1.
Appendix A Real Robot pushing
Robot Setup:To collect real world data we use Sawyer robot. We place a table in front of it where the objects are placed in order for robot to push it. Kinect V2 camera is rigidly attached overlooking the table for RGB-D perception data. The camera is localized with respect to the robot base via calibration procedure.
Data Collection Procedure: Given the image of table with object on it, we first perform the background subtraction to get the binary mask corresponding to objects . Using this binary mask, we sample a pixel which lies on the object. We treat as the mid-point of push. For push start pixel , we sample pixel around in square such a way that it does not lie on top of the object. The end point of the push is calculated based on and . These pixel location in image space are converted to corresponding 3D points in robot space using the depth image and camera matrix. Then we use off-the-shelf-planner to move robot gripper finger from . The image is recorded after the arm retracts back. For every push we record the tuple of . Figure 3 shows some of the pushing data point collected on real robot. In all we have collected 5K pushing data-points on 8 objects.
Push novel object: To see how well our method generalizes to novel object, we tested it out for pushing measuring tape. In figure 11 blue arrow shows the push predicted by our method to move it to desired location. Even though our method hasn’t seen this object during training of forward model, it is able to push it very close to goal location.
Flip the object location: To test the effectiveness of our method, we tested it on a bit more challenging scenario. In this case, we have 2 objects on the table. The goal configuration is generated by interchanging the position of objects in start configuration. Figure 12 shows the sequence of action taken by our method to carry out this task.
Appendix B Baseline Model Details.
Implicit forward model (Imp-Inv) : the model predicts in a implicit feature space where the entire frame is encoded as one implicit feature without further factoring to objects. An inverse model is trained to take in current and goal feature and outputs the action. In testing, the inverse model are applied iteratively to greedily generate action sequence. The inverse model also regularize the forward model to prevent trivial solution.
Implicit forward model with pixel reconstruction (Imp-Plan): The baseline is a variant of Imp-Inv. The action sequences are generated by a planner in the learned feature space. To further regularize the forward model such that it learns a more informative feature space, we train an decoder to reconstruct the frame in pixels. The learned representation of the frame is used by the planner.
Flow-based prediction model SNA  (Flow): the model learns to predict transformation kernel to reconstruct future frame. In planning, the predicted transformations are applied to designated pixels (location) to estimate their motion.
Flow baseline with supervision (Flow-GT): The original flow baseline only trains with videos in the unsupervised manner. To leverage the additional information, we provide its variant – besides transforming the pixels, the model also transforms the ground truth location to and minimizes the expected distance of transformed location to the ground truth .
Analytic baseline. To leverage the location information, a simple analytic solution is to greedily push in the direction of current goal position to desired position. It assume a simple dynamic – the predicted location at the next step is calculated as the delta position of the gripper.
Appendix C Plan with Oracle Location
In this part we compare models when we have access to the ground truth location for each new observation at every time step. After every push, the distance between the current location and the goal in world coordinate is plotted in Figure 13. The analytic baseline should converge to zero because the exact center of mass is given by oracle at every time step, hence serves as ceiling performance. The Imp-Inv barely generalizes to scenarios when the distance of goal and current observation is much () farther than that in training set. Imp-Plan degenerates in 2 blocks settings, suggesting one feature for the whole frame cannot encode complicated scenes very well. Flow works better than Imp-Plan because the motion space is more tractable. Its performance improves in Flow-GT to leverage location information. Our model outperforms other learning-based methods and performs comparably to the ceiling performance (Analytic) without manually specifying pushing toward goal through mass center.
Appendix D Qualitative Results of All Baselines.
In this part we show qualitative results in comparison of all baselines. This supplements Figure 5, which only showcases strong but partial baselines. For more results, please refer to project page.