# Synthetic Occlusion Augmentation for 3D Human Pose Estimation with Volumetric Heatmaps

## Abstract

In this paper we present our winning entry at the 2018 ECCV PoseTrack Challenge on 3D human pose estimation. Using a fully-convolutional backbone architecture, we obtain volumetric heatmaps per body joint, which we convert to coordinates using soft-argmax. Absolute person center depth is estimated by a 1D heatmap prediction head. The coordinates are back-projected to 3D camera space, where we minimize the L1 loss. Key to our good results is the training data augmentation with randomly placed occluders from the Pascal VOC dataset. In addition to reaching first place in the Challenge, our method also surpasses the state-of-the-art on the full Human3.6M benchmark among methods that use no additional pose datasets in training.

## 1 Introduction

The 3D part of the 2018 ECCV PoseTrack Challenge invited participants to tackle the following task. Given an uncropped, static RGB image containing a single person, estimate the position of body joints in 3D camera space, relative to the root (pelvis) joint position. Predicting human poses in 3D space has several important applications, such as human-robot collaboration and virtual reality.

## 2 Related Work

3D Human Pose Estimation. State-of-the-art 3D pose estimation methods are based on deep convolutional neural networks. We recommend Sarafianos et al.’s survey [19] for an overview of methods. Recently, building on experience gained from 2D human pose estimation (e.g., [13]), heatmap-based methods have been introduced for 3D pose estimation with promising results. This includes volumetric [17][24] and marginal heatmaps [14].

Occlusion Augmentation. Erasing or pasting over parts of an image has been successfully used as data augmentation in image classification, object detection, person re-identification [27][1][3][2][5], and facial facial landmark localization [26]. Ke et al. [10] augment images for 2D pose estimation by copying background patches over some of the body joints. We have recently found that such techniques are also very effective for 3D pose estimation [20].

## 3 Approach

The dataset in this challenge is a subset of Human3.6M [9][8] with a few important differences compared to the full benchmark. First, it lacks person bounding boxes and camera intrinsics as input. Second, the ground truth labels are more restricted, consisting only of camera-space 3D joint coordinates after subtraction of the root joint. Image-space joint coordinates are not available either.

We present a modified version of the method we recently used for studying occlusion-robustness in 3D pose estimation [20], extending it to handle the above-mentioned differences.

Image Preprocessing. We obtain person bounding boxes using the YOLOv3 detector [18]. Treating the original camera’s focal length as a global hyperparameter, we reproject the image to be centered on the person box, at a scale where the larger side of the box fills 90% of the output.

Backbone Network. We feed the cropped and zoomed image ( px) into a fully-convolutional backbone network (ResNet v2-50 [6][21]). We directly obtain volumetric heatmaps from the backbone net by adding a 1x1 convolutional layer on the last spatial feature map of the backbone, producing output channels. The resulting tensor is reshaped to yield volumes, one per body joint, each with depth .

Volumetric Heatmaps. We follow Pavlakos et al. in the interpretation of the volumetric heatmap’s axes [17]: X and Y correspond to image space and the depth axis to camera space, relative to the person center. Relative depths are not sufficient, however, when back-projecting from image to camera space. Pavlakos et al. optimize the root joint depth in post-processing, based on bone-length priors. By contrast, we predict it using a second prediction head on the backbone net (see Fig. 1). This outputs a 1D heatmap discretized to 32 units, representing a 10 meter range in front of the camera.

Soft-Argmax. We extract coordinate predictions from the heatmaps using soft-argmax [11][15]. Since this operation is differentiable, there is no need to provide ground-truth heatmaps at training time [24]. Instead, the loss can be computed deeper in the network and backpropagated through the soft-argmax operation. Soft-argmax also reduces the quantization errors inherent in hard argmax and gives fine-grained, continuous results without requiring memory-expensive, high-resolution heatmaps [24]. Indeed, we use a heatmap resolution as low as for the results presented in this paper.

Camera Intrinsics. Having predicted image coordinates , , depth coordinates relative to the person center and the absolute depth of the person center by soft-argmax, we now need camera intrinsics to move from image space to 3D camera space. As mentioned earlier, the original camera’s focal length is treated as a hyperparameter, and we must also take into account the zooming factor applied in preprocessing.

To avoid the need for precise hyperparameter tuning of , we learn an additional, input-independent corrective factor for the focal length during training, to achieve better alignment of image and heatmap locations. Denoting the image height and width as and , back-projection is performed as

(1) |

Loss. After subtracting the root joint coordinates, we compute the loss in the original camera space w.r.t. the provided root-relative ground truth. No explicit heatmap loss is used. Since all above operations are differentiable the whole network can be trained end-to-end.

Data Augmentation. In our recent study on the occlusion-robustness of 3D pose estimation [20], we found that augmenting training images with synthetic occlusions acts as an effective regularizer. Starting with the objects in the Pascal VOC dataset [4], we filter out persons, segments labeled as difficult or truncated and segments with area below 500 px, leaving 2638 objects. With probability , we paste a random number (between 1 and 8) of these objects at random locations in each frame. We also apply standard geometric augmentations (scaling, rotation, translation, horizontal flip) and appearance distortions (blurs and color manipulations). At test time only horizontal flipping augmentation is used.

Training Details. The backbone net is initialized with ImageNet-pretrained weights from [21]. We train the final method for 410 epochs on the union of the training and validation set using the Adam optimizer and cyclical (triangular) learning rates [22]. Our final challenge predictions were produced using a snapshot ensemble [7], averaging the predictions of snapshots taken at the last three learning-rate-minima of the cyclical schedule. We set and .

## 4 Results

The evaluation metric is the mean per joint position error (MPJPE) over all joints after subtraction of the root joint position. Our method achieves 45.2 mm MPJPE on the Challenge test set, with the next best methods reaching 47.8, 52.6, 58.7, 59.0, 59.5, 66.2, 66.5, respectively.

MPJPE | |
---|---|

15% | 52.9 |

50% | 50.4 |

85% | 49.5 |

no occlusion augm. | 62.6 |

MPJPE | ||

Tekin (CVPR’16) [25] | 125.0 | – |

Zhou (CVPR’16) [28] | 113.0 | – |

Xingyi (ECCV’16) [30] | 107.3 | – |

Sun (ICCV’17) [23] | 92.4 | 59.1 |

Martinez (ICCV’17) [12] | – | 62.9 |

Zhou (ICCV’17) [29] | – | 55.9 |

Pavlakos (CVPR’18) [16] | 71.9 | 56.2 |

Sun (ArXiv) [24] | 64.1 | 49.6 |

Ours (no occl.) | 65.7 | |

Ours (full) | 55.4 |

Importance of Occlusion Augmentation. Table 2 shows how synthetic occlusion augmentation improves results on the validation set as we vary the probability of augmenting each frame. Augmenting just 15% of the images already improves MPJPE by about 10 mm and increasing beyond 50% influences the results only slightly.

Full Human3.6M Benchmark. For comparison with prior work, we train and evaluate our method on the full Human3.6M benchmark as well (Table 2). Here we use the bounding boxes and camera intrinsics provided with the dataset and minimize the loss computed on the absolute (i.e. non-root-relative) coordinates in camera space for 40 epochs. The person center depth is estimated as described in Section 3. We follow the common protocol of training on five subjects (S1, S5, S6 S7, S8) and evaluating on two (S9, S11), without Procrustes alignment. We use no snapshot ensembling here, for better comparability. We set . Our method outperforms all prior work on Human3.6M, except for [24] who use extra 2D pose data in training.

## 5 Conclusion

We have presented an architecture and data augmentation method for 3D human pose estimation and have shown that it outperforms other methods both by achieving first place in the 2018 ECCV PoseTrack Challenge and by surpassing the state-of-the-art on the full benchmark among methods using no additional pose datasets in training.

## Acknowledgments

This project has been funded by a grant from the Bosch Research Foundation, by ILIAD (H2020-ICT-2016-732737) and by ERC Consolidator Grant DeeVise (ERC-2017-COG-773161).

### References

- DeVries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. arXiv:1708.04552 (2017)
- Dvornik, N., Mairal, J., Schmid, C.: Modeling visual context is key to augmenting object detection datasets. arXiv:1807.07428 (2018)
- Dwibedi, D., Misra, I., Hebert, M.: Cut, paste and learn: Surprisingly easy synthesis for instance detection. In: ICCV (2017)
- Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
- Georgakis, G., Mousavian, A., Berg, A.C., Kosecka, J.: Synthesizing training data for object detection in indoor scenes. arXiv:1702.07836 (2017)
- He, K., Zhang, X., Ren, S., Sun, J.: Identity mappings in deep residual networks. In: ECCV (2016)
- Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J.E., Weinberger, K.Q.: Snapshot ensembles: Train 1, get M for free. arXiv:1704.00109 (2017)
- Ionescu, C., Li, F., Sminchisescu, C.: Latent structured models for human pose estimation. In: ICCV (2011)
- Ionescu, C., Papava, D., Olaru, V., Sminchisescu, C.: Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. PAMI (2014)
- Ke, L., Chang, M.C., Qi, H., Lyu, S.: Multi-scale structure-aware network for human pose estimation. arXiv:1803.09894 (2018)
- Levine, S., Finn, C., Darrell, T., Abbeel, P.: End-to-end training of deep visuomotor policies. JMLR 17(1), 1334–1373 (2016)
- Martinez, J., Hossain, R., Romero, J., Little, J.J.: A simple yet effective baseline for 3D human pose estimation. In: ICCV (2017)
- Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV (2016)
- Nibali, A., He, Z., Morgan, S., Prendergast, L.: 3D human pose estimation with 2D marginal heatmaps. arXiv:1806.01484 (2018)
- Nibali, A., He, Z., Morgan, S., Prendergast, L.: Numerical coordinate regression with convolutional neural networks. arXiv:1801.07372 (2018)
- Pavlakos, G., Zhou, X., Daniilidis, K.: Ordinal depth supervision for 3D human pose estimation. In: CVPR (2018)
- Pavlakos, G., Zhou, X., Derpanis, K.G., Daniilidis, K.: Coarse-to-fine volumetric prediction for single-image 3D human pose. In: CVPR (2017)
- Redmon, J., Farhadi, A.: YOLOv3: An incremental improvement. arXiv:1804.02767 (2018)
- Sarafianos, N., Boteanu, B., Ionescu, B., Kakadiaris, I.A.: 3D human pose estimation: A review of the literature and analysis of covariates. CVIU 152, 1–20 (2016)
- Sárándi, I., Linder, T., Arras, K.O., Leibe, B.: How robust is 3D human pose estimation to occlusion? arXiv:1808.09316 (2018)
- Silberman, N., Guadarrama, S.: TensorFlow-Slim image classification model library. https://github.com/tensorflow/models/tree/master/research/slim (2016)
- Smith, L.N.: Cyclical learning rates for training neural networks. In: WACV (2017)
- Sun, X., Shang, J., Liang, S., Wei, Y.: Compositional human pose regression. In: ICCV (2017)
- Sun, X., Xiao, B., Liang, S., Wei, Y.: Integral human pose regression. arXiv:1711.08229 (2017)
- Tekin, B., Rozantsev, A., Lepetit, V., Fua, P.: Direct prediction of 3D body poses from motion compensated sequences. In: CVPR (2016)
- Yuen, K., Trivedi, M.M.: An occluded stacked hourglass approach to facial landmark localization and occlusion estimation. IEEE Trans. Intel. Veh. 2(4), 321–331 (2017)
- Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. arXiv:1708.04896 (2017)
- Zhou, X., Zhu, M., Leonardos, S., Derpanis, K.G., Daniilidis, K.: Sparseness meets deepness: 3D human pose estimation from monocular video. In: CVPR (2016)
- Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: ICCV (2017)
- Zhou, X., Sun, X., Zhang, W., Liang, S., Wei, Y.: Deep kinematic pose regression. In: ECCV (2016)