Sim2real transfer learning for 3D human pose estimation: motion to the rescue
Synthetic visual data can provide practically infinite diversity and rich labels, while avoiding ethical issues with privacy and bias. However, for many tasks, current models trained on synthetic data generalize poorly to real data. The task of 3D human pose estimation is a particularly interesting example of this sim2real problem, because learning-based approaches perform reasonably well given real training data, yet labeled 3D poses are extremely difficult to obtain in the wild, limiting scalability. In this paper, we show that standard neural-network approaches, which perform poorly when trained on synthetic RGB images, can perform well when the data is pre-processed to extract cues about the person’s motion, notably as optical flow and the motion of 2D keypoints. Therefore, our results suggest that motion can be a simple way to bridge a sim2real gap when video is available. We evaluate on the 3D Poses in the Wild dataset, the most challenging modern benchmark for 3D pose estimation, where we show full 3D mesh recovery that is on par with state-of-the-art methods trained on real 3D sequences, despite training only on synthetic humans from the SURREAL dataset.
3D pose estimation, especially for humans, is a classic computer vision problem, with applications in imitation learning, robotic interaction, and activity understanding. Pose estimation is extremely challenging with objects that are articulated, deformable, or have wide intra-class variation, as is the case with humans. Therefore, state-of-the-art approaches rely on neural networks and learning. However, learning-based methods are extremely data hungry, and acquiring sufficient data in the real world is difficult. First, there is no straightforward way for people to annotate 3D ground truth poses. Worse, in settings like industrial warehouses or homes, hundreds of thousands of different object types may appear, and new objects may arrive at random. Here, even simple labeling will generally be impractical, much less 3D poses. And any time data involves real humans, issues with privacy, intellectual property, and bias can become serious obstacles Jordan and Mitchell (2015); Kay et al. (2015); Zhao et al. (2017a).
Synthetic data, however, provides an answer to all these problems, providing a potentially infinite dataset where ground-truth properties are easily accessible. In domains with many objects where labeling is impractical, scanning and simulating objects may not be Hinterstoisser et al. (2018, 2019). Furthermore, synthetic humans do not have any privacy or intellectual property concerns Jordan and Mitchell (2015), and datasets can be balanced exactly with respect to sensitive attributes like race, gender, and other physical characteristics, minimizing the problems algorithms currently have with bias Kay et al. (2015); Zhao et al. (2017a). Even better, simulations can be made interactive for training robotic policies.
Considering all the advantages, why isn’t simulation the dominant approach in computer vision? One problem is that neural networks trained on synthetic data do not necessarily work on real data as well as methods trained directly on real data. Thus, even though such “sim2real” transfer has performed well in some domains, such as hand tracking Mueller et al. (2018) or text detection Gupta et al. (2016), it is rare on the most popular benchmarks like human pose estimation, object classification, or object detection. Curiously, the community’s gold standard for popular vision algorithms is generally quantitative performance on large evaluation datasets. In practice, this kind of evaluation is limited to tasks where labels are easy to obtain, and for such tasks, it is equally straightforward to create large training sets with matching statistics. In contrast, vision problems like robotic manipulation—where simulation has been influential—are less popular not because they are unimportant, but because they exist in a sort of “evaluation blind spot,” due to the lack of standard benchmarks that can boil real-world performance down to a number. 3D human pose estimation is a rare exception to this trend: manual annotation is almost impossible, yet a small dataset of in-the-wild data is available for evaluation due to clever use of external sensors von Marcard et al. (2018), called 3D Poses in the Wild (3DPW). Thus, we set our sights specifically on this problem, as a testbed for understanding how to design algorithms that can learn real-world pose estimation from simulation.
To train a 3D human pose estimation network, we must first confront the domain gap posed by standard datasets. The synthetic humans video dataset SURREAL Varol et al. (2017), for example, lacks deformable clothing meshes, realistic lighting, and environmental interaction. While these aspects could be improved via better simulators, the problems with SURREAL are representative of the problems that rapidly-scanned real-world objects have: small-scale deformations may be lost, and physical properties will be approximate. Humans have little difficulty understanding the 3D structure of SURREAL despite having no experience with them, giving hope that computers can transfer between the domains as well. Disappointingly, however, we find that naïve transfer for computers is poor from SURREAL to 3DPW.
Our key insight is that motion, extracted from video sequences, can be a better cue for enabling transfer. Our intuition is grounded in the psychology literature, where humans have been shown to extract remarkably detailed 3D interpretations when only simple point-light motion is visible Johansson (1973); Kozlowski and Cutting (1977); Cutting and Kozlowski (1977). Modern simulations (including SURREAL) explicitly use 3D models which match the 3D geometry of real humans, and therefore hypothetically match well in terms of plausible motion.
Armed with this intuition, we build a system to estimate 3D human poses in real videos. Our core contributions are relatively simple modifications to a standard 3D human pose estimation algorithm—Human Mesh Recovery (HMR) Kanazawa et al. (2018)—which greatly improve transfer from simulation to reality. Specifically, we first modify SURREAL to contain more realistic overall motion, for example, by compositing SURREAL humans onto real backgrounds from videos collected in-the-wild. Then we add explicit motion cues, including optical flow from FlowNet Dosovitskiy et al. (2015) (also trained with synthetic data), and 2D keypoint tracks obtained from an off-the-shelf 2D detector (such as Papandreou et al. (2017), which is trained on real 2D keypoints, and therefore used only at test time, while at train time the 2D keypoints come from the simulator). We find that both modifications substantially improve performance on 3DPW, tracking close to state-of-the-art performance, while adding synthetic RGB inputs can actually harm performance. We also compare to the standard Domain Adversarial Neural Network approach to domain transfer Ganin et al. (2016), and find relatively marginal benefits compared to motion cues.
2 Related Work
Our work is part of a long line of research that has attempted to use simulation for human 3D pose estimation. Principal among these is work on using datasets of synthetic humans for human pose estimation Varol et al. (2017); Sminchisescu et al. (2006); Zhou et al. (2016b); Rogez and Schmid (2016); Okada and Soatto (2008); Ghezelghieh et al. (2016); Du et al. (2016); Chen et al. (2016); Tung et al. (2017). These works generally note that transfer is a challenge, and therefore the majority train on real data as well as synthetic using a variety of strategies. For instance, some work constructs 3D datasets entirely by stitching together 2D images Rogez and Schmid (2016); other works use feature selection Okada and Soatto (2008) and stage-wise training Du et al. (2016) to improve transfer. Algorithms trained entirely on synthetic data often underperform those trained entirely on real data, even when the real datasets are small Varol et al. (2017). An interesting exception is work that uses depth images Shotton et al. (2011), where sim2real 3D human pose estimation is effective, although depth cameras are required.
Numerous other areas of computer vision have made use of synthetic humans with varying success. Among work on 2D pose estimation Romero et al. (2015); Qiu (2016); Pishchulin et al. (2012), FlowCap Romero et al. (2015) is particularly relevant due to its reliance on flow, although its model-based optimization renders the algorithm somewhat brittle. Other works consider pedestrian detection Pishchulin et al. (2012); Qiu (2016); Pishchulin et al. (2011) and action recognition Rahmani and Mian (2015, 2016). 3D hand pose estimation Zimmermann and Brox (2017); Mueller et al. (2018) and eye tracking Shrivastava et al. (2017) are particularly promising, as neural networks trained on purely synthetic data are effective, perhaps because the lack of clothing makes appearance easier to model. Again, depth has proven useful Mueller et al. (2017); Taylor et al. (2016); Sridhar et al. (2015), where state-of-the-art algorithms typically incorporate some form of generative model in-the-loop at test time.
Robotics is a particularly inspiring domain for sim2real research, and here it has been again found that more abstract representations than RGB, such as segmentations Müller et al. (2018); Zhou et al. (2019), can improve performance. In some cases, these abstractions can be obtained automatically from a simulator alone James et al. (2018). Other works use generative models that make simulation look more like reality Bousmalis et al. (2017, 2018), or randomize the simulator to increase the distribution overlap Tobin et al. (2017); Sadeghi and Levine (2016). These works emphasize that sim2real is essential: real-world data is impossible to annotate at the level of desired robot commands, and even unlabeled data is expensive since a robot can break itself or its environment. These sim2real works in robotics build on a long tradition of domain adaptation for visual data, which can involve learning maps between feature spaces Gong et al. (2012), invariant feature extractors Ganin et al. (2016), or image-to-image translation Zhu et al. (2017); for a review, see Patel et al. (2015); Csurka (2017).
We are also not the first to note that optical flow can be useful preprocessing for human pose estimation: optical flow has been used for 2D keypoint estimation Pishchulin et al. (2012); Romero et al. (2015), part segmentation Kim et al. (2016), and even 3D pose Alldieck et al. (2017), although the latter work involves fitting a 3D model to optical flow, which is potentially slow, sensitive to initialization, and limits robustness. Similarly, some works have noted that 2D keypoints can be useful in 3D interpretation of humans Martinez et al. (2017) and objects Wu et al. (2016). There is also evidence that flow can aid sim2real transfer for foreground/background segmentation Tokmakov et al. (2017, 2019).
Finally, our work is related to a long tradition of 3D human pose estimation, where learning-based methods have grown recently due to the emergence of motion-capture datasets. One straightforward approach is to ‘lift’ 2D poses into 3D, using either dictionaries or direct regression Ramakrishna et al. (2012); Akhter and Black (2015); Wang et al. (2014); Martinez et al. (2017); Zhao et al. (2017b); Moreno-Noguer (2017); Taylor (2000); Valmadre and Lucey (2010). Other works regress poses directly from pixels Pavlakos et al. (2017); Sárándi et al. (2018); Pavlakos et al. (2018a); Zhou et al. (2017); Rogez et al. (2017); Mehta et al. (2017); Sun et al. (2017, 2018), which generally relies on having a good match between training and testing. Similar to our work, state-of-the-art approaches often rely on parametric body models to incorporate strong priors on 3D poses Anguelov et al. (2005); Guan et al. (2009); Sigal et al. (2008); Balan et al. (2007); Hasler et al. (2010); Loper et al. (2015); Lassner et al. (2017); Pavlakos et al. (2018b); Kanazawa et al. (2018); Bogo et al. (2016); Omran et al. (2018). Among these, we are particularly related to recent works which leverage temporal cues from videos to gain extra information about depth Huang et al. (2017); Zhang et al. (2018); Zanfir et al. (2018); Peng et al. (2018); Hossain and Little (2018); Dabral et al. (2018); Arnab et al. (2019); Kanazawa et al. (2019); Li et al. (2019).
Our goal is to train a network which can predict 3D poses given a sequence of video frames. This means we first require a synthetic dataset of reasonably realistic sequences of synthetic poses, which we obtain by compositing SURREAL renders onto real, unlabeled scenes. We also require a pose estimation model which can properly exploit motion information and, importantly, propagate this information across frames where no motion is available. For this purpose, we augment the Human Mesh Recovery (HMR) algorithm to operate on motion, and add memory in the form of a LSTM.
3.1 Dataset Construction
We require a dataset which captures complex human motion, provides ground-truth 3D poses for each frame, and also has all the distractors that are likely to cause problems in real data. Distractors can include background motion, occluders covering the person, and frames where the person is completely missing. While it is straightforward to get complex human motion following previous datasets such as SURREAL Varol et al. (2017), which take pose sequences from the CMU motion capture dataset 10 and render them using the SMPL mesh Loper et al. (2015), real videos are harder than this. SURREAL people are composited on static backgrounds, which means that the video version allows for a shortcut for pose estimation: it’s straightforward to segment the person from the background by identifying moving pixels. There are also no occlusions or missing frames in this dataset.
Therefore, to construct our dataset, we take humans from the SURREAL dataset and re-composite them onto a background from the large-scale Kinetics dataset Kay et al. (2017). Kinetics videos and SURREAL videos are sampled independently at training time: thus, SURREAL videos may be composited on roughly kinetics videos for around 20 billion possible combinations. Howeve, naïve compositing following SURREAL—i.e., simply removing the static background and replacing it with a kinetics video—still makes the task too easy, both because there are no occlusions or missing frames, and also because the motion of the person won’t match the motion of the camera. Therefore, we modify the SURREAL video before compositing. To solve the motion discrepancy problem, our first step is to estimate the camera motion, using off-the-shelf procedures. We then translate the person to follow the camera motion, by offsetting each frame of the SURREAL video by a vector equal to the estimated camera motion of the corresponding Kinetics frame.
We then follow the procedure shown in Figure 2 to actually construct the video. Our first challenge is to simulate occlusions. One approach might be to take occluders from standard segmentation datasets like COCO Lin et al. (2014), but this approach has some of the same problems as the SURREAL static backgrounds did. Specifically, COCO consists of static images, and so they don’t have any associated motion information. We might apply random, smooth trajectories to get some motion, but occlusions in real video often have internal motion on top of global translation: i.e., the occluding objects may be deformable and have 3D structure. Furthermore, occluders in real videos tend to move with the scene. Missing either of these properties will lead to occlusions that are artificially easily identified in synthetic data, which may lead to detectors which generalize poorly on real videos. Therefore, our approach is to extract occluders from the Kinetics video itself. We use standard superpixel segmentation (specifically SLIC Achanta et al. (2012)) to rapidly extract segments from the video, which are generally blobs of roughly uniform color tracked throughout the video, resulting in binary masks as shown in Figure 2. One superpixel is chosen at random, and then all pixels that overlap with the SURREAL person are removed from the person, resulting in a synthetic occlusion.
As a final step, for some videos we randomly occlude the entire person for a small number of frames, which we call a ‘total’ occlusion. This simulates a detector failure, which is common in real-world datasets like 3DPW where the person may move out of the frame or be otherwise totally hidden. To do this, we select a continuous chunk of frames and set all feature channels to 0. We add an extra channel to the input which is 1 if the frame is totally occluded in this way, and 0 for un-occluded frames. Further implementation details on the dataset construction are given in appendix A.
3.2 Network Architecture
Our hypothesis is that easily-accessible motion information will be useful for bridging the sim2real gap in 3D pose estimation. Therefore, we seek a method to provide motion information to a pose estimation model, while otherwise staying close to existing pipelines for comparability. Our starting point is the Human Mesh Recovery (HMR) pipeline Kanazawa et al. (2018). This recent algorithm directly regresses 3D SMPL poses from pixels, using first a ConvNet (ResNet-50) to obtain a feature vector, and then applying an iterative refinement algorithm on top of that feature vector to infer the pose.
A first modification is required to extend HMR to video. Our input, both at training and test time, is short clips (31 frames in most of our experiments). These may be the raw RGB videos, or the videos may be preprocessed to include other features like 2D keypoints or optical flow as described below. We assume that a person detector has already been run, meaning that the sequence tracks a single person whose pose needs to be estimated. A scalable memory architecture is important, because not every frame will be equally discriminative, especially when using motion features on frames that contain little motion. Thus, we need an architecture which can update its beliefs when the pose is easily identified, and otherwise leave them unchanged. We use an LSTM for this purpose (in a similar manner to Tokmakov et al. (2017)). Our architecture, which we call Motion HMR, is shown on the right hand side of Figure 3. This architecture applies a CNN, which is a standard ResNet-50, independently on each frame, average-pooling at the end to obtain a single feature vector per frame. We then pass these features into a bi-directional LSTM that operates in time over the short clips. Finally, we apply HMR’s iterative pose refinement on the output feature vectors from the LSTM for each frame independently. The result is a pose estimate for each frame in the sequence. At training time, we use the simplified version of the HMR loss function that was proposed for training from Kinetics pseudo ground truth Arnab et al. (2019). That is, we train directly for Procrustes-aligned 3D keypoint location error (rather than SMPL joint angles and absolute 3D keypoint positions), as well as 2D reprojection error of the 3D pose.
Providing motion inputs.
Given a video-based architecture, we next add motion information. Our first strategy is to use optical flow. Optical flow is already known to transfer well across domains, because it relies more on similarities between frames than on recognizing specific patterns Dosovitskiy et al. (2015); Ranjan and Black (2017); Ilg et al. (2017); Gaidon et al. (2016); Mayer et al. (2016); Ranjan et al. (2018). Furthermore, optical flow can have strong cues for depth: for example, if one end of a rigid body is stationary, but the other end is moving toward the first, then this indicates an out-of-plane rotation. We implement this as a simple preprocessing step. That is, we use an off-the-shelf optical flow algorithm FlowNet Dosovitskiy et al. (2015) as a frozen module, which produces an estimate of optical flow at the full resolution of the input sequence.
One disadvantage of optical flow is that it can become difficult to distinguish body parts from background. Especially for frames with little motion, the movement of individual limbs will be simply blobs of smooth motion, much like blobs of background. We hypothesize that this is mostly a problem of 2D part detection, and we note that 2D keypoint detection is a well-studied field. 2D keypoint detection is far easier to annotate than 3D, and even when 2D keypoints are not available, some recent works have argued that 2D correspondence and keypoints can even be obtained in a self-supervised manner Jakab et al. (2018); Zhou et al. (2016a). 2D keypoints alone do contain some information about 3D pose Martinez et al. (2017), although follow-up work has suggested that this approach to using 2D keypoints by itself performs poorly for 3D pose estimation on 3DPW Kanazawa et al. (2019). This leads to an interesting research question: if neither flow information nor 2D keypoints are enough to perform 3D pose estimation, then is it sufficient to identify the 2D keypoints, and then use the flow and keypoint motion to estimate 3D structure?
To answer this question, we provide 2D keypoints as another input to Motion HMR. At training time, these are obtained automatically from the known synthetic pose; at test time, these are the detections from an automatic 2D keypoint detector (for reproducibility, we use the automatic 2D keypoints provided with the 3DPW dataset). After computing optical flow, we concatenate an additional set of 12 channels to the input image which are keypoint heatmaps. That is, each channel contains zeros everywhere except near the keypoint associated with the channel; they are 1 at the keypoint location and fall off with a Gaussian distribution with a standard deviation of 10 pixels. The 12 keypoints we use are the ankles, knees, hips, shoulders, elbows, and wrists, which are standard in most pose datasets. For more details on the architecture and training, see appendix B.
We apply our trained models to the 3D Poses in the Wild dataset. This dataset is challenging because it is shot in real-world environments using handheld cameras, rather than the motion-capture rigs of prior 3D pose estimation work. There is non-trivial camera motion, strong lighting variations (indoor and outdoor scenes), and substantial clutter, including objects moving in the background and occlusions by both objects and humans.
Following prior work Arnab et al. (2019), we evaluate only on sequences in the test set, and among these, only on frames where at least 7 keypoints are visible (although all frames are visible to the algorithm). We pass 31-frame clips to the algorithm, the same as at training time, and evaluate the predicted poses using the 14 joints that are common to the SMPL and COCO models. We use the standard performance metric PA-MPJPE, which uses the procrustes algorithm to align the poses in 3D before computing squared error, and we average across each individual person before finally averaging across the entire dataset.
|Training: 3D poses only for synthetic data||Martinez et al. Martinez et al. (2017) (from Kanazawa et al. (2019))||157.0|
|RGB + DANN Ganin et al. (2016)||103.0 (107.5)|
|Flow Only (proposed)||100.1|
|RGB + Keypoints (proposed)||82.4|
|Keypoints Only (proposed)||77.6|
|Flow + Keypoints (proposed)||74.7|
|Training: 3D poses on real data||HMR Kanazawa et al. (2018) (from Arnab et al. (2019))||77.2|
|Temporal HMR Kanazawa et al. (2019)||80.1|
|Temporal HMR + InstaVariety Kanazawa et al. (2019)||72.4|
|HMR + Kinetics Arnab et al. (2019)||72.2|
As a baseline, we also ran Domain Adversarial Neural Networks (DANN) Ganin et al. (2016), a mainstay of domain adaptation, which uses an adversarial network trained to distinguish between the representations of real and synthetic images. The trunk of the network is then trained with the negative of this discriminator loss, resulting in representations that are indistinguishable across domains. While straightforward to implement, DANN is challenging to tune: synthetic images and real ones are mapped to overlapping distributions, there is no way to guarantee that the mapping preserves semantics. We apply DANN to the per-frame representations directly before the LSTM.
Table 1 shows our results, comparing training with and without motion information as input to the network. RGB alone performs poorly: a network trained only on short RGB video clips fares worse than one trained on flow, despite the relatively uninformative flow images. This relationship holds even when the RGB-only model is trained with DANN. One possible explanation is that neural networks tend to rely heavily on texture cues Geirhos et al. (2018). Synthetic textures are not similar to real ones; therefore the network can localize a synthetic person by distinguishing between sim and real textures. At test time this is impossible, and failures to identify parts early on can amplify at later layers which do detailed depth estimation.
Keypoints, on the other hand, perform surprisingly well, validating the intuitions from psychology that 2D motion encapsulates substantial information about 3D activities Johansson (1973). It is interesting to compare our keypoints-only results to Martinez et al. Martinez et al. (2017), which is a comparable algorithm in that it only uses 2D keypoints. There are a number of factors which may explain the relatively poor results of Martinez et al. (2017). Primarily, Martinez et al. (2017) is trained on single frames, whereas we use sequences. Furthermore, Martinez et al. gives relatively little attention to occlusions, resulting in a domain gap relative to 3DPW.
Another interesting result is that the network which incorporates RGB on top of keypoints actually performs worse than one that uses only keypoints, further emphasizing the size of the domain gap. It’s likely the simple presence of synthetic textures causes the network to rely on them, and ignore 2D motion cues that are more reliable out-of-domain, but harder to learn. On the other hand, adding flow to keypoints yields substantial improvements. This suggests that optical flow contains information about motion and silhouettes that pure 2D keypoints do not, and furthermore, that the optical flow estimated on synthetic and real images are a reasonably good match. We conjecture that the flow features lose low-level texture information that neural networks can easily overfit to, replacing it with simple piecewise-smooth regions that capture only shape and motion.
Our final result, of 74.7, is comparable to state-of-the-art works that use similar training pipelines. We outperform HMR Kanazawa et al. (2018), which trains on real-world motion capture datasets as well as real-world 2D images as a regularizer (ensuring that 3D poses are consistent with 2D annotations). In contrast, our network is trained only on annotated synthetic images from SURREAL. Even more interesting is extensions to HMR that use temporal sequences Kanazawa et al. (2019). While Kanazawa et al. (2019) reports pose sequences that are substantially more coherent, they find that adding temporal information harms the method’s absolute pose accuracy. While counter-intuitive, we hypothesize that this is due to another form of domain gap: specifically, the 3D datasets that this algorithm was trained on all contain static backgrounds. Thus, in the training set, motion is a very strong cue for the person’s location. At test time, however, 3DPW contains substantial background motion, which can confuse the algorithm. Our synthetic pipeline, on the other hand, allows us to provide realistic background motion. Overall, the only algorithms which currently outperform us are trained on large, weakly-labeled video datasets Arnab et al. (2019); Kanazawa et al. (2019). This sort of semi-supervised learning on real videos is an interesting avenue for future research, since it would allow us to add real videos to our synthetic training set without any manual annotation, and therefore potentially boost the performance of our algorithm even further.
Figure 5 shows qualitative results of the performance of our algorithm on 3DPW scenes. Our algorithm is often robust to both unusual poses and to occlusion, even when the occluders are other people (bottom left). The baseline, however, fails badly, even missing the 2D limb positions for relatively simple poses. This again confirms our suspicion that the RGB-only algorithm cannot identify the 2D locations of joints. Interestingly, there are similarities between the wrongly-estimated pose (note the elbow angles), suggesting some similarity in the way that real human textures are misinterpreted by the network.
The ablation results for the dataset preprocessing steps are given in Table 3. In all cases, we train a model from scratch using both optical flow and 2D keypoints as input. We begin with a model that is composited as SURREAL was: we select a single Kinetics frame as background, and composite all of the SURREAL images from a sequence onto it. Replacing static backgrounds with moving ones gives a substantial boost, confirming that networks which segment a moving person from a static background may fail to generalize to dynamic backgrounds. Tracking the background also helps, suggesting that camera motion is a non-trivial artifact in 3DPW. Finally, the boost from using occlusions validates that superpixels are a good approximation to the occlusions seen in real videos.
|Dataset construction approach||PA-MPJPE|
|No background tracking, no occlusions||80.3|
|Static background, no occlusions||88.9|
Finally, we consider the importance of long-term versus short-term motion for our network by varying the length of the clips that were fed into the network and retraining from scratch. Table 3 shows the results. We can see that performance improves until 31 frames, corresponding to roughly 1 second of video. This isn’t surprising because 3DPW contains clips where people are occasionally standing still. However, we don’t see any improvement moving to 2 seconds of video. One possible explanation is that errors are accumulating in the LSTM as the sequence length increases, indicating a potential area for future research in architectures. Furthermore, with a batch size of 2, we could only fit 2 clips in GPU memory simultaneously, which reduces the stability of the Batch Norm required by HMR (31 frames uses batch size 3; 8 and 16 used batch size 6).
Our results show that motion information can help neural networks learn 3D human pose estimation from synthetic images. Human pose estimation is challenging because humans are articulated and deformable, with wide appearance variation, yet they are far from the only thing in the visual world like this. Our results may have wide-ranging applications in, for example, robotics, where both camera motion and object motion (via manipulation) can provide strong cues for object pose. While it is somewhat disappointing that neural networks overfit to the RGB appearance of synthetic images, and therefore our final model loses out on cues like shading, it is possible that the advantages of RGB might be recovered through self-supervised learning. That is, we can estimate poses in video using sim2real, potentially fix errors using e.g. bundle adjustment Arnab et al. (2019), and then train a single-frame RGB model on the result. Overall, we believe motion information, and the sim2real transfer that it enables, may become an essential component of pose estimation systems whenever video is available.
We thank Konstantinos Bousmalis, João Carreira, Ankush Gupta, Mateusz Malinowski, Relja Arandjelović, Jean-Baptiste Alayrac, Viorica Pătrăucean, Jacob Walker, Yuxiang Zhou, and Anurag Arnab for helpful discussions.
- SLIC superpixels compared to state-of-the-art superpixel methods. IEEE transactions on pattern analysis and machine intelligence 34 (11), pp. 2274–2282. Cited by: §3.1.
- Pose-conditioned joint angle limits for 3D human pose reconstruction. In CVPR, Cited by: §2.
- Optical flow-based 3D human motion estimation from monocular video. In German Conference on Pattern Recognition, pp. 347–360. Cited by: §2.
- SCAPE: shape completion and animation of people. ACM TOG 24 (3), pp. 408–416. Cited by: §2.
- Exploiting temporal context for 3D human pose estimation in the wild. In CVPR, Cited by: §2, §3.2, Table 1, §4, §4, §5.
- Detailed human shape and pose from images. In CVPR, pp. 1–8. Cited by: §2.
- Keep it SMPL: automatic estimation of 3D human pose and shape from a single image. In ECCV, Cited by: §2.
- Using simulation and domain adaptation to improve efficiency of deep robotic grasping. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pp. 4243–4250. Cited by: §2.
- Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3722–3731. Cited by: §2.
-  Carnegie-mellon mocap database, http://mocap.cs.cmu.edu/. External Links: Cited by: §3.1.
- Synthesizing training images for boosting human 3D pose estimation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 479–488. Cited by: §2.
- Domain adaptation for visual applications: a comprehensive survey. arXiv preprint arXiv:1702.05374. Cited by: §2.
- Recognizing friends by their walk: gait perception without familiarity cues. Bulletin of the psychonomic society 9 (5), pp. 353–356. Cited by: §1.
- Structure-aware and temporally coherent 3D human pose estimation. In ECCV, Cited by: §2.
- Flownet: learning optical flow with convolutional networks. In Proceedings of the IEEE international conference on computer vision, pp. 2758–2766. Cited by: §1, §3.2.
- Marker-less 3D human motion capture with monocular image sequence and height-maps. In European Conference on Computer Vision, pp. 20–36. Cited by: §2.
- Virtual worlds as proxy for multi-object tracking analysis. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4340–4349. Cited by: §3.2.
- Domain-adversarial training of neural networks. The Journal of Machine Learning Research 17 (1), pp. 2096–2030. Cited by: §1, §2, Table 1, §4.
- ImageNet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. International Conference on Learning Representations. Cited by: §4.
- Learning camera viewpoint using cnn to improve 3D body pose estimation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 685–693. Cited by: §2.
- Geodesic flow kernel for unsupervised domain adaptation. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2066–2073. Cited by: §2.
- Estimating human shape and pose from a single image. In ICCV, pp. 1381–1388. Cited by: §2.
- Synthetic data for text localisation in natural images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2315–2324. Cited by: §1.
- Multilinear pose and body shape estimation of dressed subjects from image sets. In CVPR, pp. 1823–1830. Cited by: §2.
- On pre-trained image features and synthetic images for deep learning. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §1.
- An annotation saved is an annotation earned: using fully synthetic training for object instance detection. arXiv preprint arXiv:1902.09967. Cited by: §1.
- Exploiting temporal information for 3D pose estimation. In ECCV, Cited by: §2.
- Towards accurate marker-less human shape and pose estimation over time. In 3DV, pp. 421–430. Cited by: §2.
- Flownet 2.0: evolution of optical flow estimation with deep networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2462–2470. Cited by: §3.2.
- Unsupervised learning of object landmarks through conditional image generation. In Advances in Neural Information Processing Systems, pp. 4016–4027. Cited by: §3.2.
- Sim-to-real via sim-to-sim: data-efficient robotic grasping via randomized-to-canonical adaptation networks. arXiv preprint arXiv:1812.07252. Cited by: §2.
- Visual perception of biological motion and a model for its analysis. Perception & psychophysics 14 (2), pp. 201–211. Cited by: §1, §4.
- Machine learning: trends, perspectives, and prospects. Science 349 (6245), pp. 255–260. Cited by: §1, §1.
- End-to-end recovery of human shape and pose. In CVPR, Cited by: §1, §2, §3.2, Table 1, §4.
- Learning 3D human dynamics from video. In CVPR, Cited by: §2, §3.2, Table 1, §4.
- Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3819–3828. Cited by: §1, §1.
- The kinetics human action video dataset. arXiv preprint arXiv:1705.06950. Cited by: §3.1.
- Human body part classification from optical flow. In 2016 13th International Conference on Ubiquitous Robots and Ambient Intelligence (URAI), pp. 903–904. Cited by: §2.
- Recognizing the sex of a walker from a dynamic point-light display. Perception & psychophysics 21 (6), pp. 575–580. Cited by: §1.
- Unite the people: closing the loop between 3D and 2D human representations. In CVPR, Cited by: §2.
- Learning the depths of moving people by watching frozen people. arXiv preprint arXiv:1904.11111. Cited by: §2.
- Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §3.1.
- SMPL: a skinned multi-person linear model. ACM TOG 34 (6), pp. 248:1–248:16. External Links: Cited by: §2, §3.1.
- A simple yet effective baseline for 3D human pose estimation. In ICCV, Cited by: §2, §2, §3.2, Table 1, §4.
- A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4040–4048. Cited by: §3.2.
- Monocular 3D human pose estimation in the wild using improved cnn supervision. In 3DV, External Links: Cited by: §2.
- 3D human pose estimation from a single image via distance matrix regression. In CVPR, pp. 1561–1570. Cited by: §2.
- Ganerated hands for real-time 3D hand tracking from monocular rgb. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–59. Cited by: §1, §2.
- Real-time hand tracking under occlusion from an egocentric rgb-d sensor. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1284–1293. Cited by: §2.
- Driving policy transfer via modularity and abstraction. arXiv preprint arXiv:1804.09364. Cited by: §2.
- Relevant feature selection for human pose estimation and localization in cluttered images. In European Conference on Computer Vision, pp. 434–445. Cited by: §2.
- Neural body fitting: unifying deep learning and model-based human pose and shape estimation. In 3DV, Verona, Italy. Cited by: §2.
- Towards accurate multi-person pose estimation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4903–4911. Cited by: §1.
- Visual domain adaptation: a survey of recent advances. IEEE signal processing magazine 32 (3), pp. 53–69. Cited by: §2.
- Ordinal depth supervision for 3D human pose estimation. In CVPR, Cited by: §2.
- Coarse-to-fine volumetric prediction for single-image 3D human pose. In CVPR, pp. 1263–1272. Cited by: §2.
- Learning to estimate 3D human pose and shape from a single color image. In CVPR, Cited by: §2.
- SFV: reinforcement learning of physical skills from videos. arXiv preprint arXiv:1810.03599. Cited by: §2.
- Articulated people detection and pose estimation: reshaping the future. In 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3178–3185. Cited by: §2, §2.
- Learning people detection models from few training samples. In CVPR 2011, pp. 1473–1480. Cited by: §2.
- Generating human images and ground truth using computer graphics. Ph.D. Thesis, UCLA. Cited by: §2.
- Learning a non-linear knowledge transfer model for cross-view action recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2458–2466. Cited by: §2.
- 3D action recognition from novel viewpoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1506–1515. Cited by: §2.
- Reconstructing 3D human pose from 2D image landmarks. In ECCV, pp. 573–586. Cited by: §2.
- Optical flow estimation using a spatial pyramid network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4161–4170. Cited by: §3.2.
- Learning human optical flow. British Machine Vision Conference. Cited by: §3.2.
- Mocap-guided data augmentation for 3D pose estimation in the wild. In Advances in neural information processing systems, pp. 3108–3116. Cited by: §2.
- LCR-net: localization-classification-regression for human pose. In CVPR, Cited by: §2.
- FlowCap: 2d human pose from optical flow. In German conference on pattern recognition, pp. 412–423. Cited by: §2, §2.
- CAD2RL: real single-image flight without a single real image. In Robotics: Science and Systems Conference, Cited by: Appendix B, §2.
- How robust is 3D human pose estimation to occlusion?. In arXiv preprint arXiv:1808.09316, Cited by: §2.
- Real-time human pose recognition in parts from single depth images.. In Cvpr, Vol. 2, pp. 3. Cited by: §2.
- Learning from simulated and unsupervised images through adversarial training. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2107–2116. Cited by: §2.
- Combined discriminative and generative articulated pose and non-rigid shape estimation. In NIPS, pp. 1337–1344. Cited by: §2.
- Learning joint top-down and bottom-up processes for 3D visual inference. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), Vol. 2, pp. 1743–1752. Cited by: §2.
- Fast and robust hand tracking using detection-guided optimization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3213–3221. Cited by: §2.
- Compositional human pose regression. In ICCV, Cited by: §2.
- Integral human pose regression. In ECCV, Cited by: §2.
- Reconstruction of articulated objects from point correspondences in a single uncalibrated image. CVIU 80 (3), pp. 349–363. Cited by: §2.
- Efficient and precise interactive hand tracking through joint, continuous optimization of pose and correspondences. ACM Transactions on Graphics (TOG) 35 (4), pp. 143. Cited by: §2.
- Domain randomization for transferring deep neural networks from simulation to the real world. In 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 23–30. Cited by: Appendix B, §2.
- Learning motion patterns in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3386–3394. Cited by: §2, §3.2.
- Learning to segment moving objects. International Journal of Computer Vision 127 (3), pp. 282–301. Cited by: §2.
- Self-supervised learning of motion capture. In Advances in Neural Information Processing Systems, pp. 5236–5246. Cited by: §2.
- Deterministic 3D human pose estimation using rigid structure. In ECCV, Cited by: §2.
- Learning from synthetic humans. In CVPR, Cited by: §1, §2, §3.1.
- Recovering accurate 3D human pose in the wild using imus and a moving camera. In ECCV, Cited by: §1.
- Robust estimation of 3D human poses from a single image. In CVPR, pp. 2361–2368. Cited by: §2.
- Single image 3D interpreter network. In European Conference on Computer Vision, pp. 365–382. Cited by: §2.
- A duality based approach for realtime tv-l 1 optical flow. In Joint pattern recognition symposium, pp. 214–223. Cited by: Appendix A.
- Monocular 3D pose and shape estimation of multiple people in natural scenes–the importance of multiple scene constraints. In CVPR, Cited by: §2.
- MoSculp: interactive visualization of shape and time. In arXiv preprint arXiv:1809.05491, Cited by: §2.
- Men also like shopping: reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457. Cited by: §1, §1.
- A simple, fast and highly-accurate algorithm to recover 3D shape from 2D landmarks on a single image. PAMI. Cited by: §2.
- Does computer vision matter for action?. arXiv preprint arXiv:1905.12887. Cited by: §2.
- Learning dense correspondence via 3D-guided cycle consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 117–126. Cited by: §3.2.
- Sparseness meets deepness: 3D human pose estimation from monocular video. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4966–4975. Cited by: §2.
- Towards 3D human pose estimation in the wild: a weakly-supervised approach. In ICCV, Cited by: §2.
- Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision, pp. 2223–2232. Cited by: §2.
- Learning to estimate 3D hand pose from single rgb images. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4903–4911. Cited by: §2.
Appendix A Implementation details for dataset generation
We randomly crop the Kinetics videos to be , to match the SURREAL resolution, potentially resizing so that one dimension exactly matches the SURREAL resolution, and then randomly extract a 31-frame clip from both Kinetics and SURREAL. For every frame, we compute optical flow of the Kinetics data using TVL1 Zach et al. (2007), computed offline. Then we compute the median x and y flow for each frame. Finally, we translate the person according to the median flow at each frame, using the center frame of the clip as a reference.
We generate SLIC superpixels directly in the 4-D video tensor. Recall that SLIC is a clustering in RGB/XYZ space. This is sub-optimal under large camera motions, as the clusters are artificially encouraged to remain in the same place relative to the image frame. Therefore, we modify the standard SLIC algorithm by first integrating the median flow vector that we used initially in order to get an estimate of global camera motion throughout the video, and then subtracting the integrated value from the X/Y coordinates used by SLIC. The result is SLIC superpixels of color blobs in the video that roughly follow camera motion, which are inexpensive to compute. Given these superpixels, we choose a random superpixel, and mask out any part of the synthetic human covered by it. Note that, if the superpixel does not occlude the person, or if it occludes so much of the person that there are an average of less than 7 keypoints (out of the 14 standard COCO joints) per frame remaining, then we simply discard the superpixel and do no masking. This means that only roughly 30% of videos have occlusions generated in this way. When computing the superpixel, we use between 10 and 30 superpixels per image. We implement SLIC using an off-the-shelf implementation from skimage, where we add extra channels containing the (x,y) pixel coordinates which have been modified to account for camera motion, and then set a very low compactness for the SLIC computation (0.01). The concatenated (x,y) coordinates are multiplied by a random scalar between 4e-4 and 6e-4. To compute the final composite, we use hard binary masks derived from the ground-truth segmentation in SURREAL.
Bounding boxes are cropped at a resolution, following the procedure from HMR which uses the keypoints to ensure that the person is centered and roughly 150 pixels tall. The ‘total occlusions’ are up to 15 frames long in our training set. Such ‘total occlusions’ are also used at evaluation time for any frame where fewer than 7 keypoints are visible; note that such frames are not counted in the evaluation.
Appendix B Implementation details for network training
We compute FlowNet-based optical flow at the level of bounding boxes to avoid wasting computation on the background. To compute the flow, then, we need two frames for each bounding box. We obtain these by extending the bounding box at each frame into the future to create a set of ‘paired boxes’ for optical flow. One drawback of this approach is that there is no ‘paired box’ for the final frame of a clip. To deal with this problem, we predict two poses from each pair of frames: one for the present frame and one for the frame in the future. In this way, a 31-frame clip becomes 30 bounding box pairs, and then 60 total predictions. While we compute a loss for all 60 predictions at training time, at test time, we throw out all ‘future’ predictions except the final one, in order to get 31 predictions.
Our LSTM contains 1024 hidden units, and we add another layer of 1024 units on top of this for each of the prediction heads for the 2 predictions (current and future) that must be made for each frame.
Because keypoint detectors do not reliably detect all keypoints, we find that domain randomization Tobin et al. (2017); Sadeghi and Levine (2016) is useful for generating hidden keypoints at training time. That is, we randomly hide keypoints with some probability. All keypoints outside the frame of the kinetics video are considered hidden. All keypoints occluded by the SLIC mask are also considered hidden, as well as keypoints on frames that are totally occluded. Finally, any keypoints that are at least 20cm behind the depth map of the mesh in 3D are considered hidden, because they likely indicate that a limb is occluded behind the body. We then unhide all leg keypoints with a 50% probability, as we find that the 3DPW keypoint detector often produces estimates for leg keypoints even when they are occluded. We then randomly unhide hidden keypoints and hide un-hidden keypoints with a 5% probability. We initialize HMR using a standard ImageNet-pretrained ResNet-50 model for all experiments (as was done in the original HMR); when using optical flow, we expand 2 channels to 3 by adding an extra channel which is the flow magnitude, and pass this to the ResNet. Optical flow estimates are divided by 20 to put them in the same range as RGB pixels (roughly -1 to 1). When training with keypoint channels, these are concatenated alongside the other inputs, and extra weights in conv1 are initialized randomly.
Our implementation of DANN uses a 2-layer (1024 hidden units) adversary on the ConvNet output (right before the LSTM). We performed a hyperparameter sweep with 5 seeds, using a DANN loss weight between 0.2 and 5 (we found values outside this range performed worse) and report the best, with the mean in parentheses. This is the only experiment for which we performed any hyperparemter sweep.