JointFlow: Temporal Flow Fields for Multi Person Pose Tracking

JointFlow: Temporal Flow Fields for Multi Person Pose Tracking

Abstract

In this work we propose an online multi person pose tracker which works on two consecutive frames and . The general formulation of our temporal network allows to rely on any multi person pose estimation network as spatial network. From the spatial network we extract image features and pose features for both frames. These features compose as input for our temporal model that predicts Temporal Flow Fields (TFF). These TFF are vector fields which indicate the direction in which body joint is going to move from frame to frame . This novel representation allows to formulate a similarity measure of detected joints. These similarities are used as binary potentials in an bipartite graph optimization problem in order to perform tracking of multiple poses. We show that these TFF can be learned by a relative small CNN network whilst achieving state-of-the-art multi person pose tracking results.

\addauthor

Andreas Doeringdoering@iai.uni-bonn.de1 \addauthorUmar Iqbaluiqbal@iai.uni-bonn.de1 \addauthorJuergen Gallgall@iai.uni-bonn.de1 \addinstitution Computer Vision Group
University of Bonn
Bonn, DE JointFlow

1 Introduction

Understanding of human body pose is an important information for many scene understanding problems such as activity recognition, surveillance and human-computer interaction. Estimating the pose in unconstrained environments with multiple interacting people is a challenging problem. Apart from the large amounts of appearance variation and complex human body articulation, it also poses additional challenges such as large scale variation within a single scene, varying number of persons and body part occlusion and truncation. Multi-person pose estimation in videos increases the complexity even further since it also requires to tackle the problems of person association overtime, large person or camera motion, motion blur, etc.

In this work, we address the problem of multi-person pose tracking in videos, i.e, our goal is to estimate the pose of all persons appearing in the video and assign a unique identity to each person overtime. The state-of-the-art approaches [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran, Xiu et al.(2018)Xiu, Li, Wang, Fang, and Lu] in this direction build on the recent progress in multi-person pose estimation in images and first estimate the poses from images using off-the-shelf methods followed by an additional step for person association overtime. There exist two main approaches for person association. The online approach performs matching of the poses estimated at each time frame with the previously tracked poses and assigns an identity to each pose before moving to the next time step. In contrast, offline or batch-processing based approaches [Iqbal et al.(2017)Iqbal, Milan, and Gall] first estimate the poses in the entire video and then perform pose tracking while enforcing global temporal coherency of the tracks. In any case, both types of approaches require some metrics to measure the similarity between a pair of poses. The choice of the metrics and features used for matching plays a crucial role in the performance of these approaches. Recent methods for pose tracking [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] rely on non-parametric metrics such as PCKh (head normalized Percentage of Correct Keypoints) [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] between a pair of poses, IoU (Intersection over Union) between the bounding boxes tightly enclosing each body pose [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran], similarity between the image features extracted from the person bounding boxes [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] or the optical flow information [Iqbal et al.(2017)Iqbal, Milan, and Gall, Insafutdinov et al.(2017)Insafutdinov, Andriluka, Pishchulin, Tang, Levinkov, Andres, and Schiele, Xiu et al.(2018)Xiu, Li, Wang, Fang, and Lu]. The location based metrics such as PCKh or IoU, however, assume that the poses change smoothly overtime, therefore, struggle in case of large camera or body pose motion and scale variations due to camera zoom. However, appearance based similarity metrics or optical flow information cannot handle large appearance variations due to person occlusions or truncation, motion blur, etc. The offline approaches try to tackle these challenges by enforcing long-range temporal coherence. This is often done by formulating the problem using complex spatio-temporal graphs [Iqbal et al.(2017)Iqbal, Milan, and Gall, Insafutdinov et al.(2017)Insafutdinov, Andriluka, Pishchulin, Tang, Levinkov, Andres, and Schiele] which results in very high inference time, therefore, makes these methods infeasible for many applications.

In this work we present an efficient approach for online multi-person pose tracking. In contrast to existing methods that rely on task-agnostic similarity metrics, we propose a task-specific novel representation for person association over time. We refer to this representation as Temporal Flow Fields (TFF). TFF represent the movement of each body part between two consecutive frames using a set of 2D vectors encoded in an image. Our TFF representation is inspired by the Part-Affinity Fields representation [Cao et al.(2017)Cao, Simon, Wei, and Sheikh] that measures the spatial association between different body parts and will be learned by a CNN. We integrate TFF in an online multi-person tracking approach and demonstrate that a greedy matching approach is sufficient to obtain state-of-the-art multi-person pose tracking results on the PoseTrack benchmark [Andriluka et al.(2018)Andriluka, Iqbal, Ensafutdinov, Pishchulin, Milan, Gall, and B.] dataset.

2 Related Work

The problem of multi-person pose estimation in images has seen a drastic improvement over the last few years [Pishchulin et al.(2012)Pishchulin, Jain, Andriluka, Thormaehlen, and Schiele, Chen and Yuille(2015), Eichner and Ferrari(2010), Ladicky et al.(2013)Ladicky, Torr, and Zisserman, Belagiannis et al.(2016)Belagiannis, Amin, Andriluka, Schiele, Navab, and Ilic, Iqbal and Gall(2016), Linna et al.(2016)Linna, Kannala, and Rahtu, Fang et al.(2017)Fang, Xie, Tai, and Lu, Papandreou et al.(2017)Papandreou, Zhu, Kanazawa, Toshev, Tompson, Bregler, and Murphy, Xia et al.(2017)Xia, Wang, Chen, and Yuille, Nie et al.(2017)Nie, Feng, Xing, and Yan, Wang et al.(2017a)Wang, Körding, and Yarkony, Wang et al.(2017b)Wang, Zhang, Ballester, Ihler, and Yarkony, Insafutdinov et al.(2016)Insafutdinov, Pishchulin, Andres, Andriluka, and Schiele, Pishchulin et al.(2016)Pishchulin, Insafutdinov, Tang, Andres, Andriluka, Gehler, and Schiele, Sun et al.(2017)Sun, Xiao, Liang, and Wei, Cao et al.(2017)Cao, Simon, Wei, and Sheikh, Insafutdinov et al.(2017)Insafutdinov, Andriluka, Pishchulin, Tang, Levinkov, Andres, and Schiele, Chen et al.(2018)Chen, Wang, Peng, Zhang, Yu, and Sun, He et al.(2017)He, Gkioxari, Dollár, and Girshick, Ning and He(2017), Varadarajan et al.(2017)Varadarajan, Datta, and Tickoo, Rogez et al.(2017)Rogez, Weinzaepfel, and Schmid], however, very few works have addressed the problem in videos [Iqbal et al.(2017)Iqbal, Milan, and Gall, Insafutdinov et al.(2017)Insafutdinov, Andriluka, Pishchulin, Tang, Levinkov, Andres, and Schiele, Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran, Xiao et al.(2018)Xiao, Wu, and Wei]. [Iqbal et al.(2017)Iqbal, Milan, and Gall] is one of the first works which tackels the problem of multi person pose estimation and tracking by solving a spatio-temporal graph matching problem. The spatio-temporal graph is created by densely connecting all detected joint candidates in the spatial domain. In the temporal domain all joints of the same class are connected. In order to find the best graph partition, a conditioned integer linear programming problem has to be optimized. For runtime reasons [Iqbal et al.(2017)Iqbal, Milan, and Gall] propose to sequentially optimize for temporal windows of a fixed size only. Nevertheless the runtime is still too high which makes this work impractical for real-time applications. A very similar approach with comparable performance is proposed by [Insafutdinov et al.(2017)Insafutdinov, Andriluka, Pishchulin, Tang, Levinkov, Andres, and Schiele] which in contrast to [Iqbal et al.(2017)Iqbal, Milan, and Gall] relies on a sparse spatio-temporal graph. [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] propose a video pose estimation formulation which consists of a 3D extension of the Mask R-CNN model [He et al.(2017)He, Gkioxari, Dollár, and Girshick]. By integrating temporal information, the proposed model estimates strong person bounding boxes and poses which the authors refer to as person tubes. To achieve this, their network first predicts bounding boxes for each frame followed by a pre-trained Resnet-101 network [He et al.(2016)He, Zhang, Ren, and Sun] for pose estimation. In order to link the estimated poses in time, [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] propose to solve a bipartite graph matching problem in a greedy fashion and show that the achieved results are very close to the optimal solution obtained via Hungarian algorithm. By comparing different distance metrics, the authors show that Intersection over Union (IoU) of person bounding boxes achieves the best trade-off between performance and runtime. Nevertheless, this approach requires to process entire sequences or portion of a sequence which limits the applicability for real-time applications.

In [Xiao et al.(2018)Xiao, Wu, and Wei], the authors follow a very similar baseline as proposed in [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran], but in contrast the authors rely on two different sources for person bounding boxes: a bounding box detector and optical flow. This allows to warp estimated poses of the previous frames with into the current frame and a similarity metric between estimated and warped poses based on the Object Keypoint Similarity (OKS) is used for the calculation of binary potentials of a temporal graph. By utilizing greedy graph matching similar to [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] this approach achieves state-of-the-art results.

3 Overview

Figure 1: Proposed approach: For two consecutive input frames and we utilize a Siamese network initialized by an arbitrary multi person pose estimation network. During spatial inference pose features such as belief maps or part affinity fields are used to estimated the poses for each frame. Building on these pose features our porposed temporal model predicts the temporal flow field for each detected joint which are used during inference to associate poses in time.

In this work we propose to predict Temporal Flow Fields in an online fashion. To this extend, we evaluate two frames at a time as visualized in Figure 1 and estimate their poses. The structure of our temporal model allows to utilize any network architecture for the task of multi person pose estimation. In the context of this work, we use the CNN of [Cao et al.(2017)Cao, Simon, Wei, and Sheikh] as a component in our Siamese network.

While the Siamese network is used to predict the poses in both frames, we take the last layer as input for the temporal CNN which predicts the Temporal Flow Fields (TFF).

To track the poses, we then create a bipartite graph as illustrated in Figure 4b) from the estimated poses and use the estimated TFF as similarity measure (Sec. 4) in a bipartite graph matching problem.

4 Multi-Person Pose Tracking

We represent the body pose of a person with body joints as , where represent the 2D pixel coordinates of the body joint in video frame . Given an input video, our goal is to perform multi-person pose estimation and tracking in an online manner. Formally, at every time instance with video frame containing persons, we first estimate a set of poses and then perform person association with the set of persons tracked until the last video frame . For pose estimation, we use an improved version of [Cao et al.(2017)Cao, Simon, Wei, and Sheikh] that we will explain briefly in Sec. 5. We formulate the problem of person association between the set of poses and as an energy maximization problem over a bipartite graph (Figure 4b) as follows

(1)
s.t.

where is a binary variable which indicates that the poses and are associated with each other, and the binary potentials define the similarity between the pair of poses and .

4.1 Temporal Flow Fields

Figure 2: Calculation of Temporal Flow Fields: Let and be the location of joint of person in frames and . For every point located on the flow field, the TFF contains a unit vector v and 0 otherwise.

We model the binary potentials (1) by Temporal Flow Fields (TFF) and define each TFF as a vector field that contains a unit vector for each pixel . Each unit vector points towards the direction of the target joint location where is the Euclidean distance between the estimated joint locations of person in frames and . We restrict the TFF to pixels that are close to the joint motion by a parameter and describe the set of pixels of the TFF as

(2)

where is a unit vector perpendicular to as illustrated in Figure 2. This allows a pixel-wise definition of TFF for joint class of person

(3)

In a final step, a single representation of a flow field is generated for each joint class by aggregating the TFF among all estimated persons.

(4)

where is the number of non-zero unit vectors at location across all persons.

4.1.1 Model

Figure 3: Proposed Model Structure: a) Siamese network to extract pose features (SVGG, Belief, PAF) for frames and and b) temporal network to extract temporal part affinity fields for feature map input SVGG and Beliefmaps (SVGG + Belief) from the two frames.

For the prediction of Temporal Flow Fields, we propose an efficient CNN as illustrated in Figure 3b) which consists of five -convolution layers with a stride of one pixel followed by two -convolutions. Non-linearity is achieved by ReLU layers after each convolution.

As input, the network expects image features and pose features. These are obtained from the Siamese network visualized in Figure 3a) which is initialized by a modified version of [Cao et al.(2017)Cao, Simon, Wei, and Sheikh] consisting of six stages. In particular, image features of both frames and are obtained by a feature extraction layer as illustrated in Figure 3a). We refer to these image features as SVGG. Additionally, the Siamese network predicts beliefmaps and Part Affinity Fields (PAFs) (cf. [Cao et al.(2017)Cao, Simon, Wei, and Sheikh]) at each stage which we refer to as pose features. Based on image features and pose features, extracted from the last stage, the temporal model predicts TFF.

For the training of our model, we calculate the weighted squared L2 loss of the form

(5)

where and are the ground truth TFF and the predicted TFF at pixel location respectively. is a binary mask with for all pixels located on an ignore region, i.e., a region for which the dataset does not provide any annotations.

4.2 Inference

Figure 4: Temporal edge candidate generation for incomplete pose estimates: a) on a joint level and b) on a person level where and are the accumulated temporal edge potentials estimated according to Equation (8).

During inference, we partition the bipartite graph by optimizing (1) using a greedy approach. In order to obtain the binary potentials we first generate temporal bipartie subgraphs with a set of edges that connect all detected joints of class in frame with all detected joints in frame of the same class. Figure 4a) illustrates such subgraphs. Along each temporal edge of , we follow the estimated TFF and obtain a flow field aggregate given by

(6)

where is a function that interpolates the location between both detected joints and . The value is high if the TFF points in the same direction as along i(o). In addition to this formulation, we have to consider a special case: if there is no motion of a joint between frames, no flow field would exist and the flow field aggregate would be zero. To overcome this issue, we incorporate the Euclidean distance between both joint locations into our similarity measure and define

(7)

where is a pre-defined distance threshold. This definition allows to formulate the binary potentials required to solve (1).

Instead of solving different bipartite graph matching problems for each joint class , we convert each estimated pose into a node of graph as illustrated in Figure 4b). The temporal potential is then defined as the accumulated similarity between all joints of persons and

(8)

where is a binary function with if both joints are detected. An example is shown in Figure 4a) and 4b): temporal edges between estimated joints of person of frame and persons in frame are estimated. In this particular case, persons and share 12 temporal connections among joints whereas persons and share nine temporal edges. The costs of these edges are accumulated and assigned as potential and to the temporal connections among persons as shown in Figure 4b).

By solving (1) we assign each detected person to either one of the poses from the previous frame, which continues the track, or a new track is initialized if no assignment is possible.

5 Implementation Details

For the spatial part of our proposed approach (Figure 1) we re-implemented the work of [Cao et al.(2017)Cao, Simon, Wei, and Sheikh] and applied two minor modifications: instead of initializing the feature extraction part by 10 layers of VGG19, we increase the number of layers to 12. The second modification includes a different edge configuration for the prediction of Part Affinity Fields. Both changes result in a gain in pose estimation performance. For further evaluations and a detailed description of the underlying edge configuration, we refer to the supplementary material. For the detection of joints, we perform Non-Maximum Suppression (NMS) on the estimated beliefmaps and discard all detections that do not meet a threshold . The spatial model was trained on the MSCOCO dataset [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] for 22 epochs with a learning rate of and a decay in learning rate of after 7 epochs. For finetuning we rely on the PoseTrack dataset [Andriluka et al.(2018)Andriluka, Iqbal, Ensafutdinov, Pishchulin, Milan, Gall, and B.] and train for 3000 iterations with a learning rate of and a decay in learning rate after 1000 iterations.

Our temporal model is trained on the PoseTrack dataset for 40 epochs with a learning rate and a learning rate decay of after seven epochs. Additionally, we fix and select as the desired width of our Temporal Flow Fields.

6 Experiments

MOTA MOTP Prec Rec Input Features Total Total Total Total SVGG 0.1 55.0 50.5 81.5 75.7 SVGG + Belief 0.1 55.2 50.5 81.5 75.7 SVGG + Belief + PAF 0.1 55.2 50.5 81.5 75.7
Table 1: Impact of different combinations of input features on the pose tracking performance.
MOTA MOTP Prec Rec mAP Input Features Total Total Total Total Total SVGG + Belief 0.1 55.0 50.5 81.5 75.7 69.1 SVGG + Belief 0.2 58.3 45.9 86.2 72.7 67.0 SVGG + Belief 0.3 58.0 23.2 90.4 67.4 62.7
Table 2: Impact of threshold during NMS of beliefmaps on pose estimation and tracking performance.

Within the first experiment, we tested different combinations of pose features (Table 2) in order to evaluate their performance. To this extend, each model was trained for 40 epochs on the PoseTrack dataset [Andriluka et al.(2018)Andriluka, Iqbal, Ensafutdinov, Pishchulin, Milan, Gall, and B.] and we rely on the metrics proposed in [Milan et al.(2016)Milan, Leal-Taixé, Reid, Roth, and Schindler] in order to measure pose tracking performance. Certainly, spatial image features (SVGG) provide a strong cue for the prediction of temporal vector fields of each joint class. Additional knowledge is provided by the estimated belief maps (Belief) of both input frames. Part affinity fields (PAFs) [Cao et al.(2017)Cao, Simon, Wei, and Sheikh] represent the skeletonal structure of the human body, and were expected to boost the performance even further. As Table 2 reveals, this is not the case. In order to reduce the number of input parameters, we use SVGG + Belief as desired input for the temporal model.

In an additional set of experiments, we evaluate different thresholds for non-maximum suppression of the heatmaps used for the detection of joint candidates. Even though a higher threshold results in less accurate pose estimates, more confident detection candidates result in stronger person tracks. According to Table 2, we select as passable trade-off resulting in a boost in tracking performance.

6.1 Comparison to Baselines

In an additional set of experiments, the performance of TFF is compared to different tracking metrics, namely intersection over union of persons, PCKh [Andriluka et al.(2014)Andriluka, Pishchulin, Gehler, and Schiele] and optical flow based tracking. For this purpose, the temporal potential defined in (8) has to be adapted to

(8)

where and are the bounding boxes for persons and in frames and respectively. The bounding boxes for each person are estimated from the detected poses. Table 3 summarizes the results. Both metrics can not compete with the proposed TFF.

MOTA MOTP Prec Rec
Baselines Total Total Total Total
PCKh 49.6 45.9 86.2 72.7
IoU 56.5 45.9 86.2 72.7
Optical Flow 57.7 45.9 86.2 72.7
Temporal Vector Fields 58.3 45.9 86.2 72.7
Table 3: Comparision to different baselines

Optical flow based tracking requires a different set of changes. First of all, we rely on the approach of [Ilg et al.(2017)Ilg, Mayer, Saikia, Keuper, Dosovitskiy, and Brox] in order to estimate the optical flow . Similar to Temporal Flow Fields, the optical flow is a vector field which can be used to predict the movement of each joint from frame to frame . In order to incorporate the optical flow into the greedy bipartite graph matching algorithm, the flow field aggregation energy (Equation (6)) has to be adapted as follows:

(8)

where controls the tolerance radius to mistakes. In that way, optical flow vectors which vote for locations close to still contribute significantly to the energy . Experiments have shown, that performs best.

Although the network for optical flow [Ilg et al.(2017)Ilg, Mayer, Saikia, Keuper, Dosovitskiy, and Brox] is much larger and more expensive than our network for TFF, TFF outperform the optical flow.

6.2 Comparison to State-of-the-Art

For a comparison to the state-of-the-art we compare to the results on the PoseTrack validation set reported in [Xiao et al.(2018)Xiao, Wu, and Wei, Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] and to the results on the PoseTrack test set taken from the PoseTrackChallenge Leaderboard [lea(2018)]. On the validation set we achieve a total MOTA of 58.3 which can be improved up to 59.2 after pruning tracks of a length smaller than 7 frames. We submitted our results to the official validation server and achieved the second place on the leaderboard with a final MOTA of 53.0.

MOTA MOTP Prec Rec mAP
Approach Evaluation Set Total Total Total Total Total
FlowTrack [Xiao et al.(2018)Xiao, Wu, and Wei] val 65.4 85.4 85.5 80.3 76.7
TFF val 58.3 45.9 86.2 72.7 67.0
TFF + pruning val 59.2 45.9 87.0 71.8 66.7
PoseFlow [Xiu et al.(2018)Xiu, Li, Wang, Fang, and Lu] val 58.3 67.8 87.0 70.3 66.5
ProTracker [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] val 55.2 61.5 88.1 66.5 60.6
FlowTrack [Xiao et al.(2018)Xiao, Wu, and Wei] test 57.8 57.3 79.4 80.3 74.6
TFF + pruning test 53.0 23.2 82.1 70.6 63.3
ProTracker [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] test 51.8 - - - 59.6
PoseFlow [Xiu et al.(2018)Xiu, Li, Wang, Fang, and Lu] test 51.0 16.9 78.9 71.2 63.0
MVIG test 50.8 - - - 63.2
SOPT-PT test 42.0 - - - 58.2
ML_Lab test 41.8 - - - 70.3
ICG test 32.0 - - - 51.2
IC_IBUG test -190.1 - - - 47.6
Table 4: Comparison to state-of-the-art. Approaches marked with have not been published yet.

7 Conclusions

In this work, we proposed a convolutional neural network architecture for the task of online multi person pose tracking. Our approach consists of two sub-networks: a spatial network for multi person pose estimation and a temporal network which predicts Temporal Flow Fields. TFF are used by a greedy temporal bipartite graph matching algorithm which associates estimated poses in two consecutive frames and . The results showed that a strong structural knowledge in form of image features and belief maps of both frames are crucial for a good performance of our temporal model. By relying on such feature input, our approach achieves state-of-the-art pose tracking results, even with a small network architecture. For this reason, in future work we will investigate in the direction of stronger network architectures in order to produce stronger Temporal Flow Field which are able to cope with additional challenges like occlusions and long-term dependencies.

References

  • [lea(2018)] PoseTrack Challenge Leaderboard. https://posetrack.net/leaderboard.php, 2018. [Online; accessed 30-April-2018].
  • [Andriluka et al.(2018)Andriluka, Iqbal, Ensafutdinov, Pishchulin, Milan, Gall, and B.] M. Andriluka, U. Iqbal, E. Ensafutdinov, L. Pishchulin, A. Milan, J. Gall, and Schiele B. PoseTrack: A benchmark for human pose estimation and tracking. In CVPR, 2018.
  • [Andriluka et al.(2014)Andriluka, Pishchulin, Gehler, and Schiele] Mykhaylo Andriluka, Leonid Pishchulin, Peter Gehler, and Bernt Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
  • [Belagiannis et al.(2016)Belagiannis, Amin, Andriluka, Schiele, Navab, and Ilic] Vasileios Belagiannis, Sikandar Amin, Mykhaylo Andriluka, Bernt Schiele, Nassir Navab, and Slobodan Ilic. 3d pictorial structures revisited: Multiple human pose estimation. TPAMI, 38(10):1929–1942, October 2016.
  • [Cao et al.(2017)Cao, Simon, Wei, and Sheikh] Zhe Cao, Tomas Simon, Shih-En Wei, and Yaser Sheikh. Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, 2017.
  • [Chen and Yuille(2015)] Xianjie Chen and Alan L. Yuille. Parsing occluded people by flexible compositions. In CVPR, 2015.
  • [Chen et al.(2018)Chen, Wang, Peng, Zhang, Yu, and Sun] Yilun Chen, Zhicheng Wang, Yuxiang Peng, Zhiqiang Zhang, Gang Yu, and Jian Sun. Cascaded pyramid network for multi-person pose estimation. CVPR, 2018.
  • [Eichner and Ferrari(2010)] Marcin Eichner and Vittorio Ferrari. We are family: Joint pose estimation of multiple persons. In ECCV, 2010.
  • [Fang et al.(2017)Fang, Xie, Tai, and Lu] Hao-Shu Fang, Shuqin Xie, Yu-Wing Tai, and Cewu Lu. RMPE: Regional multi-person pose estimation. In ICCV, 2017.
  • [Girdhar et al.(2018)Girdhar, Gkioxari, Torresani, Paluri, and Tran] Rohit Girdhar, Georgia Gkioxari, Lorenzo Torresani, Manohar Paluri, and Du Tran. Detect-and-track: Efficient pose estimation in videos. CVPR, 2018.
  • [He et al.(2016)He, Zhang, Ren, and Sun] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
  • [He et al.(2017)He, Gkioxari, Dollár, and Girshick] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross B. Girshick. Mask R-CNN. ICCV, 2017.
  • [Ilg et al.(2017)Ilg, Mayer, Saikia, Keuper, Dosovitskiy, and Brox] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical flow estimation with deep networks. In CVPR, 2017.
  • [Insafutdinov et al.(2016)Insafutdinov, Pishchulin, Andres, Andriluka, and Schiele] Eldar Insafutdinov, Leonid Pishchulin, Bjoern Andres, Mykhaylo Andriluka, and Bernt Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.
  • [Insafutdinov et al.(2017)Insafutdinov, Andriluka, Pishchulin, Tang, Levinkov, Andres, and Schiele] Eldar Insafutdinov, Mykhaylo Andriluka, Leonid Pishchulin, Siyu Tang, Evgeny Levinkov, Bjoern Andres, and Bernt Schiele. ArtTrack: Articulated Multi-person Tracking in the Wild. In CVPR, 2017.
  • [Iqbal and Gall(2016)] Umar Iqbal and Juergen Gall. Multi-person pose estimation with local joint-to-person associations. In ECCV, 2016.
  • [Iqbal et al.(2017)Iqbal, Milan, and Gall] Umar Iqbal, Anton Milan, and Juergen Gall. Posetrack: Joint multi-person pose estimation and tracking. In CVPR, 2017.
  • [Ladicky et al.(2013)Ladicky, Torr, and Zisserman] Lubor Ladicky, Philip H. S. Torr, and Andrew Zisserman. Human pose estimation using a joint pixel-wise and part-wise formulation. In CVPR, 2013.
  • [Lin et al.(2014)Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, and Zitnick] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
  • [Linna et al.(2016)Linna, Kannala, and Rahtu] Marko Linna, Juho Kannala, and Esa Rahtu. Real-time human pose estimation from video with convolutional neural networks. ArXiv-Preprint, 2016.
  • [Milan et al.(2016)Milan, Leal-Taixé, Reid, Roth, and Schindler] Anton Milan, Laura Leal-Taixé, Ian D. Reid, Stefan Roth, and Konrad Schindler. MOT16: A benchmark for multi-object tracking. ArXiv-Preprint, 2016.
  • [Nie et al.(2017)Nie, Feng, Xing, and Yan] Xuecheng Nie, Jiashi Feng, Junliang Xing, and Shuicheng Yan. Generative partition networks for multi-person pose estimation. ArXiv-Preprint, 2017.
  • [Ning and He(2017)] Guanghan Ning and Zhihai He. Dual path networks for multi-person human pose estimation. ArXiv-Preprint, 2017.
  • [Papandreou et al.(2017)Papandreou, Zhu, Kanazawa, Toshev, Tompson, Bregler, and Murphy] George Papandreou, Tyler Zhu, Nori Kanazawa, Alexander Toshev, Jonathan Tompson, Chris Bregler, and Kevin Murphy. Towards accurate multi-person pose estimation in the wild. In CVPR, 2017.
  • [Pishchulin et al.(2012)Pishchulin, Jain, Andriluka, Thormaehlen, and Schiele] Leonid Pishchulin, Arjun Jain, Mykhaylo Andriluka, Thorsten Thormaehlen, and Bernt Schiele. Articulated people detection and pose estimation: Reshaping the future. In CVPR, 2012.
  • [Pishchulin et al.(2016)Pishchulin, Insafutdinov, Tang, Andres, Andriluka, Gehler, and Schiele] Leonid Pishchulin, Eldar Insafutdinov, Siyu Tang, Björn Andres, Mykhaylo Andriluka, Peter Gehler, and Bernt Schiele. Deepcut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.
  • [Rogez et al.(2017)Rogez, Weinzaepfel, and Schmid] Gregory Rogez, Philippe Weinzaepfel, and Cordelia Schmid. LCR-Net: Localization-Classification-Regression for Human Pose. In CVPR, 2017.
  • [Sun et al.(2017)Sun, Xiao, Liang, and Wei] Xiao Sun, Bin Xiao, Shuang Liang, and Yichen Wei. Integral human pose regression. ArXiv-Preprint, 2017.
  • [Varadarajan et al.(2017)Varadarajan, Datta, and Tickoo] Srenivas Varadarajan, Parual Datta, and Omesh Tickoo. A greedy part assignment algorithm for real-time multi-person 2d pose estimation. ArXiv-Preprint, 2017.
  • [Wang et al.(2017a)Wang, Körding, and Yarkony] Shaofei Wang, Konrad Paul Körding, and Julian Yarkony. Efficient multi-person pose estimation with provable guarantees. ArXiv-Preprint, 2017a.
  • [Wang et al.(2017b)Wang, Zhang, Ballester, Ihler, and Yarkony] Shaofei Wang, Chong Zhang, Miguel Ángel González Ballester, Alexander T. Ihler, and Julian Yarkony. Multi-person pose estimation via column generation. ArXiv-Preprint, 2017b.
  • [Xia et al.(2017)Xia, Wang, Chen, and Yuille] Fangting Xia, Peng Wang, Xianjie Chen, and Alan L. Yuille. Joint multi-person pose estimation and semantic part segmentation. In CVPR, 2017.
  • [Xiao et al.(2018)Xiao, Wu, and Wei] B. Xiao, H. Wu, and Y. Wei. Simple Baselines for Human Pose Estimation and Tracking. ArXiv-Preprint, 2018.
  • [Xiu et al.(2018)Xiu, Li, Wang, Fang, and Lu] Yuliang Xiu, Jiefeng Li, Haoyu Wang, Yinghong Fang, and Cewu Lu. Pose Flow: Efficient online pose tracking. ArXiv-Preprint, 2018.
  • [Zhu et al.(2017)Zhu, Jiang, and Luo] Xiangyu Zhu, Yingying Jiang, and Zhenbo Luo. Multi-person pose estimation for posetrack with enhanced part affinity fields. Technical report, Samsung Research Beijing, 2017.

Appendix A Appendix

a.1 Qualitative Results

Figure 5: Qualitative results for sequences of the PoseTrack validation set [Andriluka et al.(2018)Andriluka, Iqbal, Ensafutdinov, Pishchulin, Milan, Gall, and B.].

a.2 Baseline Improvement

Figure 6: Different edge configurations used for the training of different spatial models.

We evaluate the robustness of different edge configurations as shown in Figure 6. This is motivated by the fact that edge configuration a) is prone to errors. If a single edge is not estimated correctly, the entire pose breaks. Similar to [Zhu et al.(2017)Zhu, Jiang, and Luo] we introduce skip connections to our model (Figure 6 b) Bypass model). Figure 6 c) illustrates a different idea to connect joints which we refer to as Range of Motion (ROM) model since pairs of joints are connected if both lie within the same ROM of a third joint. Further we train an edge configuration as proposed in [Insafutdinov et al.(2017)Insafutdinov, Andriluka, Pishchulin, Tang, Levinkov, Andres, and Schiele] which we refer to as Extended model. For completeness, we introduce a nearly-fully-connected (NFC) model (Figure 6 e)) which connects most nearby joints. We rely on the metric proposed in [Pishchulin et al.(2016)Pishchulin, Insafutdinov, Tang, Andres, Andriluka, Gehler, and Schiele] for the estimation of mean average precision (mAP) off all our pose estimation models. Table 5 shows the results.

Model VGG Layers Trained on Head Shou Elb Wri Hip Knee Ankl Total mAP
Standard 12 MSCOCO + PoseTrack 82.9 80.3 69.9 59.0 67.8 59.2 51.4 68.3
Bypass 12 MSCOCO + PoseTrack 83.0 79.2 67.6 59.0 66.2 61.2 53.6 68.2
ROM 12 MSCOCO + PoseTrack 82.0 76.2 70.3 57.9 69.3 61.7 54.1 68.3
Extended 12 MSCOCO + PoseTrack 82.1 77.9 70.5 57.0 72.1 63.6 53.9 69.1
NFC 12 MSCOCO + PoseTrack 78.3 75.8 68.3 56.9 69.2 62.1 53.5 67.1
Table 5: The evaluation of different edge configurations reveals that the Extended edge configuration performs best compared to the Standard edge configuration.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
192126
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description