SE3PoseNets: Structured Deep Dynamics Models for Visuomotor Planning and Control
Abstract
In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our deep dynamics model, a variant of SE3Nets, learns a lowdimensional pose embedding for visuomotor control via an encoderdecoder structure. Unlike prior work, our dynamics model is structured: given an input scene, our network explicitly learns to segment salient parts and predict their poseembedding along with their motion modeled as a change in the pose space due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only in the form of pointwise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closedloop control directly in the learned lowdimensional pose space, where the actions are computed by minimizing error in the pose space using gradientbased methods, similar to traditional modelbased control. We present results on controlling a Baxter robot from raw depth data in simulation and in the real world and compare against two baseline deep networks. Our method runs in realtime, achieves good prediction of scene dynamics and outperforms the baseline methods on multiple control runs. Video results can be found at: https://rselab.cs.washington.edu/se3structureddeepctrl/
I Introduction
Imagine we are receiving observations of a scene from a camera and we would like to control our robot to reach a target scene. Traditional approaches to visual servoing [1] decompose this problem into two parts: dataassociating the current scene to the target (usually through the use of features) and modeling the effect of applied actions to changes to the scene, combining these in a tight loop to servo to the target. Recent work on deep learning has looked at learning similar predictive models directly in the space of observations, relating changes in pixels or 3D points directly to the applied actions [2, 3, 4]. Given a target scene, we can use this predictive model to generate suitable controls to visually servo to the target using modelpredictive control [5]. Unfortunately, for this pipeline to work, we need an external system (such as [6, 7]) capable of providing long range data associations to measure progress.
As we showed in prior work [4], instead of reasoning about raw pixels, we can predict scene dynamics by decomposing the scene into objects and predicting object dynamics instead. While this significantly improves prediction results, it still does not provide a clear solution to the dataassociation problem that we encounter during control  we still lack the capability to explicitly associate objects/parts across scenes. We observe three key points: 1) We can dataassociate across scenes by learning to predict the poses of detected objects/parts in the scene (the pose implicitly provides tracking), 2) We can model the dynamics of an object directly in the predicted lowdimensional pose space, and 3) We can predict scene dynamics by combining the dynamics predictions of each detected part.
We combine these ideas in this work to propose SE3PoseNets, a deep network architecture for efficient visuomotor control that jointly learns to dataassociate across long term sequences. We make the following contributions:

We show how it is possible to learn predictive models that detect parts of the scene and jointly learn a consistent pose space for these parts with minimal supervision.

We demonstrate how a deep predictive model can be used for reactive visuomotor control using simple gradient backpropagation and a more sophisticated GaussNewton optimization, reminiscent of approaches in inverse kinematics [8].

We present results on realtime reactive control of a Baxter arm using raw depth images and velocity control, both in simulation and on real data.
Fig 1 shows an example scenario where our proposed method can be applied to control the robot to reach the target state (right) from the initial state (left).
Ii Related work
Modeling scenes and dynamics: Our work builds on top of prior work on learning structured models of scene dynamics [4]. Unlike SE3Nets we now explicitly model data associations through a lowdimensional pose embedding that we train to be consistent across long sequences. Similar to Boots et al. [2], our model learns to predict point clouds based on applied actions, but through a more structured intermediate representation that reasons about objects and their motions. Unlike Finn et al. [3], we operate on depth data and reason about motion in 3D using masks and transforms while training our networks in a supervised fashion given pointwise data associations across pairs of frames.
Visuomotor control: Recently, there has been a lot of work on visuomotor control, primarily through the use of deep networks [9, 10, 11, 12, 5, 13]. These methods either directly regress to controls from visual data [11, 10], generate controls by planning on learned forward dynamics models [5, 9], through inverse dynamics models [12] or by reinforcement learning [13]. Similar to some of these methods, we generates controls by planning with a learned dynamics model, albeit in a learned lowdimensional latent space.
Specifically, work by Finn et al. [5] is closely related, but differs in two main ways: unlike their approach which controls in the observation space through sampled actions (at 5Hz), our controller runs gradient based optimization on a learned lowdimensional pose embedding in realtime (> 30 Hz). Also, their approach requires an external tracker to measure progress while we explicitly learn to data associate across large motions.
Our work borrows several ideas from prior work by Watter et al. [9] which learns a latent lowdimensional embedding for fast reactive control from pairs of images related by an action. Unlike their work though, we use a structured latent representation (object poses), predict object masks and use a physically grounded 3D loss that only models change in observations as opposed to a restrictive image reconstruction loss. Last, our losses are physically motivated similar to those proposed for training positionvelocity encoders [13], but our learned pose embedding is significantly more structured and we train our networks endtoend directly for control.
Data association: Related work in the computer vision literature has looked at tackling the data association problem, primarily by matching visual descriptors, either handtuned [14], or more recently, learned using deep networks [15, 16]. In prior work, Schmidt et al. [15] learn robust visual descriptors for longrange associations using correspondences over short training sequences. Unlike this work, we only use correspondences between pairs of frames to learn a consistent pose space that lets us data associate across long sequences.
Visual servoing: Finally, there have been multiple approaches to visual servoing over the years [1, 17, 18], including some newer methods that use deep learned features and reinforcement learning [19]. While these methods depend on an external system for data association or on prespecified features, our system is trained endtoend and can control directly from raw depth data.
Iii SE3PoseNets
Our deep dynamics model SE3PoseNets decomposes the problem of modeling scene dynamics into three subproblems: a) modeling scene structure by identifying parts of the scene that move distinctly and by encoding their latent state as a 6D pose, b) modeling the dynamics of individual parts under the effect of the applied actions as a change in the latent pose space (parameterized as an transform), and finally c) combining these local pose changes to model the dynamics of the entire scene. Each subproblem is modeled by a separate component of the SE3PoseNet:

Modeling scene structure: An encoder () that decomposes the input point cloud (^{1}^{1}1Bold fonts denote collections of items) into a set of rigid parts, predicting per part a 6D pose (, ) and a dense segmentation mask () that highlights points belonging to that part

Modeling part dynamics: A pose transition network () that models dynamics in the pose space, taking in the current poses () and action () to predict the change in poses ()

Predicting scene dynamics: A transform layer () that predicts the next point cloud () given the current point cloud (), predicted object masks () and the predicted pose deltas () by explicitly applying 3D rigid body transforms on the input point cloud.
Fig. 2 shows the network architecture of the SE3PoseNet. Next, we present the details of the three subcomponents and outline a training procedure for training the SE3PoseNet endtoend with minimal supervision.
Iiia Modeling scene structure
Given a 3D point cloud from a depth sensor (represented as a 3channel image, 3 x H x W), the encoder (blue block in Fig. 2) segments the scene into distinctly moving parts () and predicts a 6D pose per segmented part ().
(1) 
The encoder has three parts, the first is a convolutional network that generates a latent representation of the input point cloud (). This network has five convolutional layers, each followed by a max pooling layer. The latent representation is further used as input for the mask and pose predictions.
Object masks: We use a deconvolutional network to predict a dense pixelwise segmentation of the scene into it’s constituent parts (). Similar to prior work [4], we use a fullyconvolutional architecture with five deconvolutional layers and a skipadd architecture to improve the sharpness of the predicted segmentation. The masks predicted by this network are at full resolution with channels (K x H x W), where is a prespecified hyperparameter that is greater than or equal to the number of moving parts in the scene (including background). The predicted segmentation mask learns to attend to parts of the scene that move together, representing areas of the scene that can move independently as different parts. As in prior work [4], we formalize mask prediction as a softclassification problem where the network outputs a length probability distribution which we sharpen to push towards a binary segmentation mask.
Object poses: Given the encoded latent representation, we use a three layer fullyconnected network to predict the 6D pose of each of the segmented parts. We represent each pose by 6 numbers: a 3D position () and an orientation (), represented as a 3parameter axisangle vector. As we show later, our pose network learns to predict consistent poses which can be used to dataassociate observations over long sequences of motions.
At a high level, the encoder implicitly learns the structure of observed scenes by persistently identifying parts and predicting a consistent pose for each part across multiple scenes.
IiiB Modeling part dynamics
Once we have identified the constituent parts of the scene and their poses, we can reason about the effect of applied actions on these parts. We model this notion of "part dynamics" through a fullyconnected pose transition network that takes the predicted poses from the encoder () and applied actions () as input to predict the change in pose () for all segmented parts:
(2) 
where is represented as an transform per part, with a rotation (parameterized as an axisangle transform) and a translation vector . The transition network first applies two fully connected layers to both inputs, concatenates their outputs followed by two final fullyconnected layers to predict the posedeltas. As we show later in Sec. IV we rely on good predictions of posedeltas through the posetransition network for efficient control.
IiiC Predicting scene dynamics
Finally, given the predicted scene segmentation () and the change in poses (), we can model the dynamics of the input scene () under the effect of the applied action (). We do this through the Transform layer () which applies the predicted rigid rotations () and translations () to the input point cloud, weighted by the predicted mask probabilities (). We predict the transformed point cloud () as:
(3) 
where is the 3D output point corresponding to input point . In effect, we apply the th rotation and translation ( to all points that belong to the corresponding object as indicated by the th mask channel (assuming that the mask is binary after weight sharpening) to predict the transformed points belonging to that object. Repeating this for all objects gives us the transformed output point cloud (). Note that this part has no trainable parameters. For more details, please refer to prior work [4].
IiiD Training
We now outline a procedure to train the SE3PoseNet endtoend, using supervision in the form of pointwise data associations across a pair of point clouds (, ), related by an action () i.e. for each input point (), we know it’s corresponding point in the next frame if it is visible (). No other supervision is given for learning the masks, poses, and the change in poses. Fig. 2 (bottomleft) shows a schematic of this procedure. Given two point clouds , we use the encoder to predict the corresponding masks and poses:
(4) 
Next, the predicted pose () and control () at are used as input to the pose transition network to predict the change in pose from to :
(5) 
Finally, we use the transform layer (3) to predict the next point cloud:
(6) 
The predicted mask () at time is discarded. We use two losses to train the entire pipeline end to end:

A 3D loss () that penalizes the error between the predicted point cloud () and the data associated target point cloud (). We use a normalized version of the meansquared error (MSE) that measures the negative loglikelihood under a Gaussian centered around the target with a standard deviation dependent on the target magnitude:
(7) where () denotes the ground truth motion for point relative to the input point cloud , is the number of points in the point cloud, is the number of points that actually move between and and & are hyperparameters ( in all our experiments). This loss is aimed to tackle two main issues with a standard MSE loss: a) By normalizing the loss by a separate scalar per dimension () that depends on the target magnitude we make the loss scale invariant allowing us to treat equally parts that move less (such as the endeffector when only the wrist rotates) as those that have large motion (eg. the elbow), and b) By dividing the total error by the number of points () that move in the scene, we treat scenes where very few points move equally as those where large parts move.

A pose consistency loss () that encourages consistency between the poses predicted by the encoder () and the change in pose predicted by the pose transition network ():
(8) where refers to composition in pose space, is the expected pose at from composing the current pose () and the predicted pose change from the transition model () and is the cardinality of . In essence, this loss constrains the encoder to predict poses that are consistent with the posedeltas predicted by the transition model. This loss encourages global consistency in the pose space by enforcing local consistency over pairs of frames and is crucial for learning a pose space that is consistent across long term motions.
The total loss for training () is a sum of the two losses: , where controls the relative strengths of the two losses. We set in all our experiments. A key point to note is that we do not provide any explicit supervision to learn the pose space. While the consistency loss ensures that the poses are more or less globally consistent, it does not anchor them to a specific 3D position or orientation. As such, the poses learned by the network need not correspond directly to the canonical 6D pose of the parts  the predicted part position does not need to correspond to its center and the orientation need not be aligned to the part’s principal axes. Providing more constraints to regularize and physically ground the pose space is an interesting area for future work.
Iv ClosedLoop Visuomotor Control using SE3PoseNets
We now show how an SE3PoseNet can be used for closedloop visuomotor control to reach a target specified as a target depth image, essentially performing visual servoing [1]. A crucial component of every visual servoing system is to perform data association between the current image and the target image, which can then be used to generate controls that reduce the corresponding offsets. SE3PoseNets solve this problem by making use of the learned, lowdimensional latent pose space. By enforcing frametoframe consistency in the pose space through the consistency loss (Eqn. IIID), the pose space becomes consistent, that is, our encoder network learns to dataassociate observations to unique poses which are consistent under the effect of actions. Importantly, these data associations are generated at the mask, or object level, resulting in an ability akin to object detection in computer vision. Unlike prior work [20, 4] which is restricted to operate in the observation space of 3D points and requires data associations between current and target points to be provided externally, we can now directly minimize error between the poses and automatically extracted from the initial and the target depth image to recover the sequence of actions that takes the robot from to . Additionally, unlike prior work [20], we do not need an external tracking system to measure progress toward the goal as our learned encoder implicitly tracks in the pose space.
Iva Reactive control
Algorithm 1 presents a simple algorithm for reactive control using SE3PoseNets that efficiently computes a closedloop sequence of controls that takes the robot from any initial state to the specified target (the corresponding network structure is given in the lower right panel of Fig. 2). Given a target point cloud, , the algorithm uses the learned encoder to predict the poses of the constituent parts . This becomes the target to the controller.
At every time step, the algorithm computes the pose embedding of the current observation . We would like to find controls that move these poses closer to the target poses. To do this, the algorithm makes a prediction through the learned pose transition model using the current poses () and an initial guess for the controls (here we use ), resulting in a predicted change in poses () and the corresponding predicted next pose () ^{2}^{2}2Even when using a zero control initialization, this forward pass through the network is necessary to get the correct gradients for the backward pass.. To move these poses towards the targets, we formulate an error function based on the meansquared error between these predicted poses and the target poses. The algorithm then computes the gradient of this error with respect to the control inputs, which it uses to generate the next controls. We propose two ways of computing the gradient:

Backpropagation: A simple approach to compute this gradient update is to backpropagate the gradients of the pose error through the pose transition model. Unlike backpropagation during training, where we compute gradients w.r.t. the network weights, here we fix the weights and compute gradients over the input controls. The resulting control scheme is analogous to the Jacobian Transpose method from inverse kinematics [8], where backprop provides the gradient of the transition model.

GaussNewton: A better approach is to compute the GaussNewton update:
(9) where is the Jacobian of the transition model, and is the gradient of the pose error (E). However, instead of computing via backpropagation, we condition the pose error gradient () based on the Jacobian’s pseudoinverse, where controls the strength of the conditioning (set to 1e4 in all our experiments). In practice, this leads to significantly faster convergence with little to no additional overhead in computation compared to the backpropagation method as the Jacobian can be computed efficiently through finite differencing. We do this by running a single forward propagation with perturbed control inputs (perturbation set to 1e3) stacked along the batch dimension to take advantage of GPU parallelism. Eqn. 9 is also analogous to the Damped Least Squares technique from inverse kinematics [8].
Finally, the algorithm computes the unitvector in the direction of the computed update and scales this by a prespecified control magnitude (1 radian in all our experiments) to get the next control . We execute this control on the robot and repeat in a closedloop until convergence measured either by reaching a small error in the pose space () or a maximum number of iterations, whichever comes first.
V Evaluation
We first evaluate SE3PoseNets on predicting the dynamics of a scene where a Baxter robot moves its right arm in front of the depth camera, both in simulation and in the real world. We also present results on control performance where the task is to control the joints of the Baxter’s right arm to reach a specified target observation.
Setting  SE3PoseNets  SE3PoseNets + Joint Angles  SE3Nets  SE3Nets + Joint Angles  Flow  Flow + Joint Angles 
Simulated  0.044  0.038  0.030  0.024  0.035  0.030 
Real  0.234  0.224  0.221  0.212  0.228  0.218 
Va Task and Data collection
We first provide details on the task setting in simulation. Our simulator uses OpenGL to render depth images from a camera pointed towards the robot (see Fig. 1) and is kinematic with little to no dynamics in the motion and no depth noise. We use this as a test bed to parse the effectiveness of the proposed algorithm and compare it to various baselines. We collected around 8 hours of training data in the simulator where the robot moves all joints on it’s right arm. Around half of the examples are whole arm motions where the robot plans a trajectory to reach a target endeffector position sampled randomly in the workspace in front of the robot. The rest of the motions are made of perturbations of individual joints on the robot from various initial configurations sampled to be within the viewpoint of the camera. These additional motions help in decorrelating the kinematic chain dependencies during training, improving performance especially on joints lower down the kinematic chain. Overall, this dataset has around 800,000 training images collected from a single fixed viewpoint. Similar to the simulated setting, we collect data from the real robot where the Baxter moves its right arm in front of an ASUS Xtion Pro camera placed around 2.5 meters from the robot. Data associations, ground truth masks, and ground truth flows are determined via the DART tracker [6] on the real data. We collected around 4.5 hours of training data on the real robot, with a 2:1 mix of whole arm motions and single joint motions. As before, the motions were generated through a planner that tries to get the endeffector to randomly sampled targets in the workspace. Unlike the simulated data, the depth data in the real world is quite noisy and there are significant physical and dynamics effects. For both the simulated and real world settings, our controls are joint velocities ().
VB Baselines
We compare the performance of our algorithm against five different baselines:

SE3PoseNets + Joint Angles: Our proposed network with the joint angles of the robot given as an additional input to the encoder. We use this network as a strong baseline that uses significant additional information to inform the pose prediction.

SE3Nets: Prior work from [4] where the network directly predicts masks and change in poses given input point clouds and control. There is no explicit pose space in this network, so we do control in the full point cloud observation space for this network.

SE3Nets + Joint Angles: SE3Nets that additionally take in joint angles as inputs.

Flow Net: Baseline flow model from prior work [4]. This network directly regresses to a perpoint 3D flow without any explicit transforms or masks.

Flow Net + Joint Angles: Baseline flow network that additionally takes in joint angles as input.
All baseline networks are trained on the same data as the SE3PoseNets using the 3D normalized loss ().
VC Training details
We implemented our networks in PyTorch using the Adam optimizer for training with a learning rate of 1e4. All our networks used Batch Normalization [21] and the PReLU nonlinearity [22]. We set the maximum number of moving objects for all our experiments (7 joints + background). We train each network for 100,000 iterations in simulation and 75,000 iterations on the real data, and use the network that achieves the least validation loss across all training iterations for all our results.
VD Results on modeling scene dynamics
First, we present results on the prediction task used for training all the networks. Table I shows the average perpoint flow MSE (cm) across all baselines on both simulated and real data. SE3Nets achieve the best results on both the simulated and real datasets while the baseline flow network performs slightly worse. Unsurprisingly, networks that have access to the joint angles do better than those which do not, as they have strictly more information that is highly correlated with the sensor data. To initial surprise, the SE3PoseNets have the largest prediction errors among all baseline models. However, this makes sense given the following considerations: a) SE3PoseNets are trained to explicitly embed the observations in a pose space from which they predict the scene dynamics, rather than using the input point cloud directly. While this provides more structure within the network and is necessary for the control task, it also restricts the prediction to go through an information bottleneck which generally makes the training problem harder. b) SE3PoseNets additionally have to optimize for the consistency loss, which enforces constraints that are different from those of the prediction problem evaluated in this experiment.
Fig. 3 visualizes the masks predicted by SE3PoseNets and the baseline SE3Net on an example each from the simulated and real data along with the ground truth masks. Even without any supervision, SE3PoseNets and SE3Nets learn a detailed segmentation of the arm into multiple salient parts, most of which are consistent with ground truth segments on both the simulated and real data.
VE Control performance
Next, we test the performance of the different networks on controlling the Baxter’s right arm to reach a target configuration, specified as a point cloud . We test both the control algorithms presented in Sec. IV using our SE3PoseNet model and the baseline models by comparing their performance on a set of 11 distinct servoing tasks (each with an average initial error of ~30 degrees per joint). We first detail a few specifics followed by an analysis of the results.
Control with baseline models: While SE3PoseNets learn a pose space that can be used for longterm data associations and control, the baseline models operate directly in the space of observations and thus require external data associations in the observation space to be able to do any control at all. For the simulation experiments, we provide these baseline algorithms with groundtruth associations and use the procedure outlined in Alg. 1 using the MSE between the predicted point cloud and the target as the error to be minimized for generating controls. It is important to keep in mind that the baseline models have an advantage over SE3PoseNets for the control task as they get strictly more information in the form of groundtruth data associations.
Metric and Task specification: We use the mean absolute error in the joint angles as the metric for measuring control performance. We run all models to convergence (based on the pose error for SE3PoseNets and 3D point/flow error for the baseline models) or for a maximum of 200 iterations. Additionally, for SE3PoseNets we terminate if the pose error increases for 10 consecutive iterations. We integrate joint velocities forward to generate position commands for the robot both in simulation and the real world.
Simulation results: Fig. 4 plots the error in joint angles as a function of the number of control iterations. The plots on the left and middle show results on networks that use only raw depth as input  we control the first six joints of the robot using these networks. The right figure shows results for networks that additionally use joint angles as input  we control all 7 joints of the robot with these networks. In general, SE3PoseNets achieve excellent performance compared to the baseline models, converging quickly to an almost zero error even in the absence any external data associations. The flow model performs comparably to the SE3PoseNets while SE3Nets converge far slower. We highlight a few key results: 1) For all methods, GaussNewton based optimization (GN) leads to faster convergence than Backprop. This is to be expected as GaussNewton conditions the gradient based on pseudosecond order information. 2) Baseline models perform worse given joint angles than without. This is due to an issue of credit assignment during gradient computation  the networks learn erroneous causations (when there are only correlations) between the input joint angles and the predicted flows which diminishes the control’s contribution to the prediction problem and subsequently affects the gradient. 3) All models struggle to model the motion of the final wrist joint due to increasing correlations along the kinematic chain that result in a small contribution of the joint’s own motion to the full movement of the wrist. SE3PoseNets can overcome this problem given input joint angles (Fig. 4, right) which provides encouraging proof that adding in the joint state supplements information that is hard to parse directly from the visual state. 4) SE3Nets converge slowly due to a lack of good control initializations that are needed to ensure that the network starts off with a meaningful segmentation  given zero controls the SE3Net can choose not to segment the arm at all, and finally 5) Good performance of SE3PoseNets indicates that the learned pose space is consistent across large motions and can be used for fast reactive control, albeit not quite as robust as the baseline methods given data associations. SE3PoseNets fail to minimize the pose error on one of the tested configurations leading to an increasing error in Fig. 4, left. Our termination check that looks for increasing pose errors does correctly identify this case and we are able to succeed on all the other examples (Fig. 4, middle). We discuss ways to further improve the robustness of our approach in Sec. VI.
Real robot results: We further test the control performance using SE3PoseNets on a few real world examples. We do not compare to any baselines as they need an explicit external data association system to be feasible. On the real robot, we restrict ourselves to controlling the first four joints of the right arm using the SE3PoseNet and control the first six joints using the model that additionally takes in joint angles as input. Fig. 5 shows the errors as a function of the iteration count. Both models converge very quickly which indicates that our network is able to control robustly even in the presence of sensor noise and unmodeled dynamics. Surprisingly, there is very little difference between GN and Backprop algorithms on the real data. A video showing realtime control results on the Baxter can be found here.
Speed: SE3PoseNets optimize errors directly in the lowdimensional pose space for control. This leads to significant speedups compared to the baselines: while both the flow and SE3Nets can operate at around 10Hz (excluding the dataassociation pipeline), SE3PoseNets run in realtime (30Hz) including the pose detection part.
Vi Discussion
This paper presents SE3PoseNets, a framework for learning predictive models that enable control of objects in a scene. In the context of a robot manipulator, we showed they solve this problem by learning a predictive model for the individual parts of the manipulator, as in prior work [4]. Additionally, SE3PoseNets learn a consistent pose space for these parts, essentially learning to detect the 6D poses of manipulator parts in the raw depth images. This detection capability enables SE3PoseNets to solve the data association problem that is crucial for relating the current observation of the manipulator to a desired target observation. The difference between these poses can be used to generate control signals to move the manipulator to its target pose, similar to visual servoing applied to an image of the manipulator. We also showed how the learned network can be used to determine the gradients needed for the control signals. Our experiments show that SE3PoseNets generate control superior to representations learned by previous techniques, even when these are provided with external data associations. Furthermore, in addition to providing data associations, SE3PoseNets allow us to compute controls directly in the low dimensional pose space, enabling far more efficient control than techniques that operate in the raw perception space. Crucially, all these abilities are learned in a single framework based on raw data traces solely annotated with frametoframe point cloud correspondences.
Overall, the control performance shown by our SE3PoseNets is extremely encouraging and provides a strong proof of concept that such networks can learn a consistent pose space that provides longrange correspondences and fast reactive control. While this provides reason to rejoice, there are a multiple areas for improvement: 1) As shown in the real robot results, SE3PoseNets (and other baselines) have difficulties handling joints further down the kinematic chain (joints 4,5,6 for the Baxter) whose motions are significantly correlated to the motions of the joints above. Additionally, the endeffector has poor visibility on depth images. Adding state information in the form of encoder data significantly alleviates this issue but does not fully solve it. There are potentially multiple ways to improve the model to tackle this problem, including curriculum and active learning along with better regularization and physical grounding of the pose space to remove inconsistencies. 2) A key area for future work is in extending our system to interact with and manipulate external objects. Here, a consistent pose space for objects in the scene will enable the robot to plan its motion toward the objects, enabling smooth interactions. 3) Finally, while we have shown that SE3PoseNets can be used for singlestep reactive control, we would like to do longterm planning using model based techniques such as iterative LQG [23] to leverage the full strength the latent pose space, i.e., fast realtime rollouts directly in the pose space.
Acknowledgments
This work was funded in part by the National Science Foundation under contract number NSFNRI1637479 and STTR number 1622958 awarded to LULA robotics. We would also like to thank NVIDIA for generously providing a DGX used for this research via the UW NVIDIA AI Lab (NVAIL).
References
 [1] S. Hutchinson, G. D. Hager, and P. I. Corke, “A tutorial on visual servo control,” IEEE transactions on robotics and automation, vol. 12, no. 5, pp. 651–670, 1996.
 [2] B. Boots, A. Byravan, and D. Fox, “Learning predictive models of a depth camera & manipulator from raw execution traces,” in ICRA. IEEE, 2014, pp. 4021–4028.
 [3] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised learning for physical interaction through video prediction,” in NIPS, 2016.
 [4] A. Byravan and D. Fox, “Se3nets: Learning rigid body motion using deep neural networks,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 173–180.
 [5] C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” in Robotics and Automation (ICRA), 2017 IEEE International Conference on. IEEE, 2017, pp. 2786–2793.
 [6] T. Schmidt, R. A. Newcombe, and D. Fox, “Dart: Dense articulated realtime tracking.” in RSS, 2014.
 [7] R. Anderson, D. Gallup, J. T. Barron, J. Kontkanen, N. Snavely, C. Hernández, S. Agarwal, and S. M. Seitz, “Jump: virtual reality video,” ACM Transactions on Graphics (TOG), 2016.
 [8] S. R. Buss, “Introduction to inverse kinematics with jacobian transpose, pseudoinverse and damped least squares methods.”
 [9] M. Watter, J. Springenberg, J. Boedecker, and M. Riedmiller, “Embed to control: A locally linear latent dynamics model for control from raw images,” in NIPS, 2015, pp. 2728–2736.
 [10] N. Wahlström, T. B. Schön, and M. P. Deisenroth, “From pixels to torques: Policy learning with deep dynamical models,” arXiv preprint arXiv:1502.02251, 2015.
 [11] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “Endtoend training of deep visuomotor policies,” JMLR, vol. 17, no. 39, pp. 1–40, 2016.
 [12] P. Agrawal, A. Nair, P. Abbeel, J. Malik, and S. Levine, “Learning to poke by poking: Experiential learning of intuitive physics,” arXiv preprint arXiv:1606.07419, 2016.
 [13] R. Jonschkowski, R. Hafner, J. Scholz, and M. Riedmiller, “Pves: Positionvelocity encoders for unsupervised learning of structured state representations,” arXiv preprint arXiv:1705.09805, 2017.
 [14] D. G. Lowe, “Distinctive image features from scaleinvariant keypoints,” IJCV, vol. 60, no. 2, pp. 91–110, 2004.
 [15] T. Schmidt, R. Newcombe, and D. Fox, “Selfsupervised visual descriptor learning for dense correspondence,” IEEE Robotics and Automation Letters, vol. 2, no. 2, pp. 420–427, 2017.
 [16] X. Wang and A. Gupta, “Unsupervised learning of visual representations using videos,” in Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 2794–2802.
 [17] B. Espiau, F. Chaumette, and P. Rives, “A new approach to visual servoing in robotics,” Geometric reasoning for perception and action, pp. 106–136, 1993.
 [18] F. Chaumette, S. Hutchinson, and P. Corke, “Visual servoing,” in Springer Handbook of Robotics, 2016, pp. 841–866.
 [19] A. X. Lee, S. Levine, and P. Abbeel, “Learning visual servoing with deep features and fitted qiteration,” arXiv preprint arXiv:1703.11000, 2017.
 [20] C. Finn and S. Levine, “Deep visual foresight for planning robot motion,” arXiv preprint arXiv:1610.00696, 2016.
 [21] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” arXiv preprint arXiv:1502.03167, 2015.
 [22] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification,” in ICCV, 2015, pp. 1026–1034.
 [23] E. Todorov and W. Li, “A generalized iterative lqg method for locallyoptimal feedback control of constrained nonlinear stochastic systems,” in ACC. IEEE, 2005, pp. 300–306.