# PoseTrack: Joint Multi-Person Pose Estimation and Tracking

## Abstract

In this work, we introduce the challenging problem of joint multi-person pose estimation and tracking of an unknown number of persons in unconstrained videos. Existing methods for multi-person pose estimation in images cannot be applied directly to this problem, since it also requires to solve the problem of person association over time in addition to the pose estimation for each person. We therefore propose a novel method that jointly models multi-person pose estimation and tracking in a single formulation. To this end, we represent body joint detections in a video by a spatio-temporal graph and solve an integer linear program to partition the graph into sub-graphs that correspond to plausible body pose trajectories for each person. The proposed approach implicitly handles occlusion and truncation of persons. Since the problem has not been addressed quantitatively in the literature, we introduce a challenging “Multi-Person PoseTrack” dataset, and also propose a completely unconstrained evaluation protocol that does not make any assumptions about the scale, size, location or the number of persons. Finally, we evaluate the proposed approach and several baseline methods on our new dataset.

## 1 Introduction

Human pose estimation has long been motivated for its applications in understanding human interactions, activity recognition, video surveillance and sports video analytics. The field of human pose estimation in images has progressed remarkably over the past few years. The methods have advanced from pose estimation of single pre-localized persons [30, 6, 40, 14, 16, 27, 5, 32] to the more challenging and realistic case of multiple, potentially overlapping and truncated persons [12, 8, 30, 16, 17]. Many applications, such as mentioned before, however, aim to analyze human body motion over time. While there exists a notable number of works that track the pose of a single person in a video [28, 9, 44, 33, 46, 20, 29, 7, 13, 18], multi-person human pose estimation in unconstrained videos has not been addressed in the literature.

In this work, we address the problem of tracking the poses of multiple persons in an unconstrained setting. This means that we have to deal with large pose and scale variations, fast motions, and a varying number of persons and visible body parts due to occlusion or truncation. In contrast to previous works, we aim to solve the association of each person across the video and the pose estimation together. To this end, we build upon the recent methods for multi-person pose estimation in images [30, 16, 17] that build a spatial graph based on joint proposals to estimate the pose of multiple persons in an image. In particular, we cast the problem as an optimization of a densely connected spatio-temporal graph connecting body joint candidates spatially as well as temporally. The optimization problem is formulated as a constrained Integer Linear Program (ILP) whose feasible solution partitions the graph into valid body pose trajectories for any unknown number of persons. In this way, we can handle occlusion, truncation, and temporal association within a single formulation.

Since there exists no dataset that provides annotations to quantitatively evaluate joint multi-person pose estimation and tracking, we also propose a new challenging *Multi-Person PoseTrack* dataset as a second contribution of the paper. The dataset provides detailed and dense annotations for multiple persons in each video, as shown in Fig. 1, and introduces new challenges to the field of pose estimation in videos. In order to evaluate the pose estimation and tracking accuracy, we introduce a new protocol that also deals with occluded body joints. We quantify the proposed method in detail on the proposed dataset, and also report results for several baseline methods.
The source code, pre-trained models and the dataset are publicly available.^{1}

## 2 Related Work

Single person pose estimation in images has seen a remarkable progress over the past few years [39, 30, 6, 40, 14, 16, 27, 5, 32]. However, all these approaches assume that only a single person is visible in the image, and cannot handle realistic cases where several people appear in the scene, and interact with each other. In contrast to single person pose estimation, multi-person pose estimation introduces significantly more challenges, since the number of persons in an image is not known a priori. Moreover, it is natural that persons occlude each other during interactions, and may also become partially truncated to various degrees. Multi-person pose estimation has therefore gained much attention recently [11, 37, 31, 43, 23, 12, 8, 3, 30, 16, 17]. Earlier methods in this direction follow a two-staged approach [31, 12, 8] by first detecting the persons in an image followed by a human pose estimation technique for each person individually. Such approaches are, however, applicable only if people appear well separated and do not occlude each other. Moreover, most single person pose estimation methods always output a fixed number of body joints and do not account for occlusion and truncation, which often is the case in multi-person scenarios. Other approaches address the problem using tree structured graphical models [43, 37, 11, 23]. However, such models struggle to cope with large pose variations, and are shown to be significantly outperformed by more recent methods based on Convolutional Neural Networks [30, 16]. For example, [30] jointly estimate the pose of all persons visible in an image, while also handling occlusion and truncation. The approach has been further improved by stronger part detectors and efficient approximations [16]. The approach in [17] also proposes a simplification of [30] by tackling the problem locally for each person. However, it still relies on a separate person detector.

Single person pose estimation in videos has also been studied extensively in the literature [28, 9, 46, 33, 46, 20, 44, 29, 13, 18]. These approaches mainly aim to improve pose estimation by utilizing temporal smoothing constraints [28, 9, 44, 33, 13] and/or optical flow information [46, 20, 29], but they are not directly applicable to videos with multiple potentially occluding persons.

In this work we focus on the challenging problem of joint multi-person pose estimation and data association across frames. While the problem has not been studied quantitatively in the literature^{2}

Previous datasets used to benchmark pose estimation algorithms in-the-wild are summarized in Tab. 1. While there exists a number of datasets to evaluate single person pose estimation methods in videos, such as \eg, J-HMDB [21] and Penn-Action [45], none of the video datasets provides annotations to benchmark multi-person pose estimation and tracking at the same time. To allow for a quantitative evaluation of this problem, we therefore also introduce a new “Multi-Person PoseTrack” dataset which provides pose annotations for multiple persons in each video to measure pose estimation accuracy, and also provides a unique ID for each of the annotated persons to benchmark multi-person pose tracking. The proposed dataset introduces new challenges to the field of human pose estimation and tracking since it contains a large amount of appearance and pose variations, body part occlusion and truncation, large scale variations, fast camera and person movements, motion blur, and a sufficiently large number of persons per video.

## 3 Multi-Person Pose Tracking

Our method jointly solves the problem of multi-person pose estimation and tracking for all persons appearing in a video together. We first generate a set of joint detection candidates in each video as illustrated in Fig. 2. From the detections, we build a graph consisting of spatial edges connecting the detections within a frame and temporal edges connecting detections of the same joint type over frames. We solve the problem using integer linear programming (ILP) whose feasible solution provides the pose estimate for each person in all video frames, and also performs person association across frames. We first introduce the proposed method and discuss the proposed dataset for evaluation in Sec. 4.

### 3.1 Spatio-Temporal Graph

Given a video sequence containing an arbitrary number of persons, we generate a set of body joint detection candidates where is the set for frame . Every detection at location in frame belongs to a joint type . Additional details regarding the used detector will be provided in Sec. 3.4.

For multi-person pose tracking, we aim to identify the joint hypotheses that belong to an individual person in the entire video. This can be formulated by a graph structure where is the set of nodes. The set of edges consists of two types of edges, namely spatial edges and temporal edges . The spatial edges correspond to the union of edges of a fully connected graph for each frame, \ie

(1) |

Note that these edges connect joint candidates independently of the associated joint type . The temporal edges connect only joint hypotheses of the same joint type over two different frames, \ie

(2) |

The temporal connections are not only modeled for neighboring frames, \ie, but we also take temporal relations up to frames into account to handle short-term occlusion and missing detections. The graph structure is illustrated in Fig. 2.

### 3.2 Graph Partitioning

By removing edges and nodes from the graph , we obtain several partitions of the spatio-temporal graph and each partition corresponds to a tracked pose of an individual person. In order to solve the graph partitioning problem, we introduce the three binary vectors , , and . Each binary variable implies if a node or edge is removed, \ie implies that the joint detection is removed. Similarly, with implies that the spatial edge between the joint hypothesis and in frame is removed while with implies that the temporal edge between the joint hypothesis in frame and in frame is removed.

A partitioning is obtained by minimizing the cost function

(3) | ||||

(4) | ||||

(5) | ||||

(6) |

This means that we search for a graph partitioning such that the cost of the remaining nodes and edges is minimal. The cost for a node is defined by the unary term:

(7) |

where corresponds to the probability of the joint hypothesis . Note that is negative when and detections with a high confidence are preferred since they reduce the cost function (3). The cost for a spatial or temporal edge is defined similarly by

(8) | ||||

(9) |

While denotes the probability that two joint detections and in a frame belong to the same person, denotes the probability that two detections of a joint in frame and are the same. In Sec. 3.4 we will discuss how the probabilities , , and are learned.

In order to ensure that the feasible solutions of the objective (3) result in well defined body poses and valid pose tracks, we have to add additional constraints. The first set of constraints ensures that two joint hypotheses are associated to the same person () only if both detections are considered as valid, \ie, and :

(10) |

The same holds for the temporal edges:

(11) |

The second set of constraints are transitivity constraints in the spatial domain. Such transitivity constraints have been proposed for multi-person pose estimation in images [30, 16, 17]. They enforce for any triplet of joint detection candidates that if and are associated to one person and and are also associated to one person, \ie and , then the edge should also be added:

(12) | ||||

An example of a transitivity constraint is illustrated in Fig. 7. The transitivity constraints can be used to enforce that a human can have only one joint type , \egonly one head. Let and have the same joint type while belongs to another joint type . Without transitivity constraints connecting and with might result in a low cost. The transitivity constraints, however, enforce that the binary cost is added. To prevent poses with multiple joints, we thus only have to ensure that the binary cost is very high if . We discuss this more in detail in Sec. 3.4.

In contrast to previous work, we also have to ensure spatio-temporal consistency. Similar to the spatial transitivity constraints (12), we can define temporal transitivity constraints:

(13) | ||||

The last set of constraints are spatio-temporal constraints that ensure that the pose is consistent over time. We define two types of spatio-temporal constraints. The first type consists of a triplet of joint detection candidates from two different frames and . The constraints are defined as,

(14) | ||||

and enforce transitivity for two temporal edges and one spatial edge. The second type of spatio-temporal constraints are based on quadruples of joint detection candidates from two different frames and . The spatio-temporal constraints ensure that if and are temporally connected and are spatially connected then the spatial edge has to be added:

(15) | ||||

An example of both types of spatio-temporal constraint can be seen in Fig. 7 and Fig. 7, respectively.

### 3.3 Optimization

We optimize the objective (3) with the branch-and-cut algorithm of the ILP solver Gurobi. To reduce the runtime for long sequences, we process the video batch-wise where each batch consists of frames. For the first frames, we build the spatio-temporal graph as discussed and optimize the objective (3). We then continue to build a graph for the next frames and add the previously selected nodes and edges to the graph, but fix them such that they cannot be removed anymore. Since the graph partitioning produces also small partitions, which usually correspond to clusters of false positive joint detections, we remove any partition that is shorter than 7 frames or has less than 6 nodes per frame on average.

### 3.4 Potentials

In order to compute the unaries (7) and binaries (8),(9), we have to learn the probabilities , , and .

The probability is given by the confidence of the joint detector. As joint detector, we use the publicly available pre-trained CNN [16] trained on the MPII Multi-Person Pose dataset [30]. In contrast to [16], we do not assume that any scale information is given. We therefore apply the detector to an image pyramid with 4 scales . For each detection located at , we compute a quadratic bounding box . We use for the width and height. To reduce the number of detections, we remove all bounding boxes that have an intersection-over-union (IoU) ratio over 0.7 with another bounding box that has a higher detection confidence.

The spatial probability depends on the joint types and of the detections. If , we define . This means that a joint type cannot be added multiple times to a person except if the detections are very close. If a partition includes detections of the same type in a single frame, the detections are merged by computing the weighted mean of the detections, where the weights are proportional to . If , we use the pre-trained binaries [16] after a scale normalization.

The temporal probability should be high if two detections of the same joint type at different frames belong to the same person. To that end, we build on the idea recently used in multi-person tracking [38] and compute dense correspondences between two frames using DeepMatching [41]. Let and be the sets of matched key-points inside the bounding boxes and and and the union and intersection of these two sets. We then form a feature vector by where . We also append the feature vector with non-linear terms as done in [38]. The mapping from the feature vector to the probability is obtained by logistic regression.

Dataset |
\arraybackslash
Video-labeled poses |
\arraybackslash
multi-person |
\arraybackslash
Large scale variation |
\arraybackslash
variable skeleton size |
\arraybackslash
# of persons |

Leeds Sports [22] | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash2000 |

MPII Pose [1] | \arraybackslash | \arraybackslash | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash40,522 |

We Are Family [11] | \arraybackslash | \arraybackslash✓ | \arraybackslash | \arraybackslash | \arraybackslash3131 |

MPII Multi-Person Pose [30] | \arraybackslash | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash14,161 |

MS-COCO Keypoints [25] | \arraybackslash | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash105,698 |

J-HMDB [21] | \arraybackslash✓ | \arraybackslash | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash32,173 |

Penn-Action [45] | \arraybackslash✓ | \arraybackslash | \arraybackslash✓ | \arraybackslash | \arraybackslash159,633 |

VideoPose [35] | \arraybackslash✓ | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash1286 |

Poses-in-the-wild [9] | \arraybackslash✓ | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash831 |

YouTube Pose [7] | \arraybackslash✓ | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash5000 |

FYDP [36] | \arraybackslash✓ | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash1680 |

UYDP [36] | \arraybackslash✓ | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash2000 |

Multi-Person PoseTrack | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash✓ | \arraybackslash16,219 |

## 4 The Multi-Person PoseTrack Dataset

In this section we introduce our new dataset for multi-person pose estimation in videos. The MPII Multi-Person Pose [1] is currently one of the most popular benchmarks for multi-person pose estimation in images, and covers a wide range of activities. For each annotated image, the dataset also provides unlabeled video clips ranging 20 frames both forward and backward in time relative to that image. For our video dataset, we manually select a subset of all available videos that contain multiple persons and cover a wide variety of person-person or person-object interactions. Moreover, the selected videos are chosen to contain a large amount of body pose appearance and scale variation, as well as body part occlusion and truncation. The videos also contain severe body motion, \ie, people occlude each other, re-appear after complete occlusion, vary in scale across the video, and also significantly change their body pose. The number of visible persons and body parts may also vary during the video. The duration of all provided video clips is exactly 41 frames. To include longer and variable-length sequences, we downloaded the original raw video clips using the provided URLs and obtained an additional set of videos. To prevent an overlap with the existing data, we only considered sequences that are at least 150 frames apart from the training samples, and followed the same rationale as above to ensure diversity.

In total, we compiled a set of 60 videos with the number of frames per video ranging between 41 and 151. The number of persons ranges between 2 and 16 with an average of more than 5 persons per video sequence, totaling over 16,000 annotated poses. The person heights are between 100 and 1200 pixels. We split the dataset into a training and testing set with an equal number of videos.

### 4.1 Annotation

As in [1], we annotate 14 body joints and a rectangle enclosing the person’s head. The latter is required to estimate the absolute scale which is used for evaluation. We assign a unique identity to every person appearing in the video. This person ID remains the same throughout the video until the person moves out of the field-of-view. Since we do not target person re-identification in this work, we assign a new ID if a person re-appears in the video. We also provide occlusion flags for all body joints. A joint is marked occluded if it was in the field-of-view but became invisible due to an occlusion. Truncated joints, \iethose outside the image border limits, are not annotated, therefore, the number of joints per person varies across the dataset. Very small persons were zoomed in to a reasonable size to accurately perform the annotation. To ensure a high quality of the annotation, all annotations were performed by trained in-house workers, following a clearly defined protocol. An example annotation can be seen in Fig. 1.

### 4.2 Experimental setup and evaluation metrics

Since the problem of simultaneous multi-person pose estimation and person tracking has not been quantitatively evaluated in the literature, we define a new evaluation protocol for this problem. To this end, we follow the best practices followed in both multi-person pose estimation [30] and multi-target tracking [26]. In order to evaluate whether a part is predicted correctly, we use the widely adopted PCKh (head-normalized probability of correct keypoint) metric [1], which considers a body joint to be correctly localized if the predicted location of the joint is within a certain threshold from the true location. Due to the large scale variation of people across videos and even within a frame, this threshold needs to be selected adaptively, based on the person’s size. To that end, [1] propose to use 30% of the head box diagonal. We have found this threshold to be too relaxed because recent pose estimation approaches are capable of predicting the joint locations rather accurately. Therefore, we use a more strict evaluation with a 20% threshold.

Given the joint localization threshold for each person, we compute two sets of evaluation metrics, one adopted from the multi-target tracking literature [42, 10, 26] to evaluate multi-person pose tracking, and one which is commonly used for evaluating multi-person pose estimation [30].

Tracking.
To evaluate multi-person pose tracking, we consider each joint trajectory as one individual target,^{3}

Pose.
For measuring frame-wise multi-person pose accuracy, we use *Mean Average Precision* (mAP) as is done in [30].
The protocol to evaluate multi-person pose estimation in [30] assumes that the rough scale and location of a group of persons is known during testing [30], which is not the case in realistic scenarios, and in particular in videos. We therefore propose to make no assumption during testing and evaluate the predictions without rescaling or shifting them according to the ground-truth.

Occlusion handling.
Both of the aforementioned protocols to measure pose estimation and tracking accuracy do not consider occlusion during evaluation, and penalize if an occluded target that is annotated in the ground-truth is not correctly estimated [26, 30]. This, however, discourages methods that either detect occlusion and do not predict the occluded joints or approaches that predict the joint position even for occluded joints. We want to provide a fair comparison for both types of occlusion handling. We therefore extend both measures to incorporate occlusion information explicitly. To this end, we first assign each person to one of the ground-truth poses based on the PCKh measure as done in [30]. For each matched person, we consider an occluded joint correctly estimated
either if *a)* it is predicted at the correct location despite being occluded,
or *b)* it is not predicted at all. Otherwise, the prediction is considered as a false positive.

## 5 Experiments

In this section we evaluate the proposed method for joint multi-person pose estimation and tracking on the newly introduced Multi-Person PoseTrack dataset.

Method | \arraybackslashRcll | \arraybackslashPrcn | \arraybackslashMT | \arraybackslashML | \arraybackslashIDs | \arraybackslash FM | \arraybackslash MOTA | \arraybackslash MOTP |

\arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash | \arraybackslash | |

Impact of temporal connection density | ||||||||

HT | \arraybackslash57.6 | \arraybackslash66.0 | \arraybackslash632 | \arraybackslash623 | \arraybackslash674 | \arraybackslash5080 | \arraybackslash27.2 | \arraybackslash56.1 |

HT:N:S | \arraybackslash62.7 | \arraybackslash64.9 | \arraybackslash760 | \arraybackslash510 | \arraybackslash470 | \arraybackslash5557 | \arraybackslash28.2 | \arraybackslash55.8 |

HT:N:S:H | \arraybackslash63.1 | \arraybackslash64.5 | \arraybackslash774 | \arraybackslash494 | \arraybackslash478 | \arraybackslash5564 | \arraybackslash27.8 | \arraybackslash55.7 |

HT:W:A | \arraybackslash62.8 | \arraybackslash64.9 | \arraybackslash758 | \arraybackslash526 | \arraybackslash516 | \arraybackslash5458 | \arraybackslash28.2 | \arraybackslash55.8 |

Impact of the length of temporal connection () | ||||||||

HT:N:S () | \arraybackslash62.7 | \arraybackslash64.9 | \arraybackslash760 | \arraybackslash510 | \arraybackslash470 | \arraybackslash5557 | \arraybackslash28.2 | \arraybackslash55.8 |

HT:N:S () | \arraybackslash63.0 | \arraybackslash64.8 | \arraybackslash775 | \arraybackslash502 | \arraybackslash431 | \arraybackslash5629 | \arraybackslash28.2 | \arraybackslash55.7 |

HT:N:S () | \arraybackslash62.8 | \arraybackslash64.7 | \arraybackslash763 | \arraybackslash508 | \arraybackslash381 | \arraybackslash5676 | \arraybackslash28.0 | \arraybackslash55.7 |

Impact of the constraints | ||||||||

All | \arraybackslash63.0 | \arraybackslash64.8 | \arraybackslash775 | \arraybackslash502 | \arraybackslash431 | \arraybackslash5629 | \arraybackslash28.2 | \arraybackslash55.7 |

All spat. transitivity | \arraybackslash22.2 | \arraybackslash76.0 | \arraybackslash115 | \arraybackslash1521 | \arraybackslash39 | \arraybackslash3947 | \arraybackslash15.1 | \arraybackslash58.0 |

All temp. transitivity | \arraybackslash60.3 | \arraybackslash65.1 | \arraybackslash712 | \arraybackslash544 | \arraybackslash268 | \arraybackslash5610 | \arraybackslash27.7 | \arraybackslash55.8 |

All spatio-temporal | \arraybackslash55.1 | \arraybackslash64.1 | \arraybackslash592 | \arraybackslash628 | \arraybackslash262 | \arraybackslash5444 | \arraybackslash23.9 | \arraybackslash55.7 |

Comparison with the Baselines | ||||||||

Ours | \arraybackslash63.0 | \arraybackslash64.8 | \arraybackslash775 | \arraybackslash502 | \arraybackslash431 | \arraybackslash5629 | \arraybackslash28.2 | \arraybackslash55.7 |

BBox-Tracking [38, 34] | ||||||||

+ LJPA [17] | \arraybackslash58.8 | \arraybackslash64.8 | \arraybackslash716 | \arraybackslash646 | \arraybackslash319 | \arraybackslash5026 | \arraybackslash26.6 | \arraybackslash53.5 |

+ CPM [40] | \arraybackslash60.1 | \arraybackslash57.7 | \arraybackslash754 | \arraybackslash611 | \arraybackslash347 | \arraybackslash4969 | \arraybackslash15.6 | \arraybackslash53.4 |

### 5.1 Multi-Person Pose Tracking

The results for multi-person pose tracking (MOT CLEAR metrics) are reported in Tab\onedot 2. To find the best setting, we first perform a series of experiments, investigating the influence of temporal connection density, temporal connection length, and inclusion of different constraint types.

We first examine the impact of different joint combinations for temporal connections. Connecting only the Head Tops (HT) between frames results in a Multi-Object Tracking Accuracy (MOTA) of with a recall and precision of and , respectively. Adding Neck and Shoulder (HT:N:S) detections for temporal connections improves the MOTA score to , while also improving the recall from to . Adding more temporal connections also increases other metrics such as MT, ML, and also results in a lower number of ID switches (IDs) and fragments (FM). However, increasing the number of joints for temporal edges even further (HT:N:S:H) results in a slight decrease in performance. This is most likely due to the weaker DeepMatching correspondences between hip joints, which are difficult to match. When only the body extremities (HT:W:A) are used for temporal edges, we obtain a similar MOTA as for (HT:N:S), but slightly worse other tracking measures. Considering the MOTA performance and the complexity of our graph structure, we use (HT:N:S) as our default setting.

Instead of considering only neighboring frames for temporal edges, we also evaluate the tracking performance while introducing longer-range temporal edges of up to and frames. Adding temporal edges between detections that are at most three frames apart improves the performance only slightly, whereas increasing the distance even further worsens the performance. For the rest of our experiments we therefore set .

To evaluate the proposed optimization objective (3) for joint multi-person pose estimation and tracking in more detail, we have quantified the impact of various kinds of constraints (10)-(15) enforced during the optimization. To this end, we remove one type of constraints at a time and solve the optimization problem. As shown in Tab. 2, all types of constraints are important to achieve best performance, with the spatial transitivity constraints playing the most crucial role. This is expected since these constraints ensure that we obtain valid poses without multiple joint types assigned to one person. Temporal transitivity and spatio-temporal constraints also turn out to be important to obtain good results. Removing either of the two significantly decreases the recall, resulting in a drop in MOTA.

Since we are the first to report results on the Multi-Person PoseTrack dataset, we also develop two baseline methods by using the existing approaches. For this, we rely on a state-of-the-art method for multi-person pose estimation in images [17]. The approach uses a person detector [34] to first obtain person bounding box hypotheses, and then estimates the pose for each person independently. We extend it to videos as follows. We first generate person bounding boxes for all frames in the video using a state-of-the-art person detector (Faster R-CNN [34]), and perform person tracking using a state-of-the-art person tracker [38] and train it on the training set of the Multi-Person PoseTrack Dataset. We also discard all tracks that are shorter than 7 frames. The final pose estimates are obtained by using the Local Joint-to-Person Association (LJPA) approach proposed by [17] for each person track. We also report results when Convolutional Pose Machines (CPM) [40] are used instead. Since CPM does not account for joint occlusion and truncation, the MOTA score is significantly lower than for LJPA. LJPA [17] improves the performance, but remains inferior \wrtmost measures compared to our proposed method. In particular, our method achieves the highest MOTA and MOTP scores. The former is due to a significantly higher recall, while the latter is a result of a more precise part localization. Interestingly, the person bounding-box tracking based baselines achieve a lower number of ID switches. We believe that this is primarily due to the powerful multi-target tracking approach [38], which can handle person identities more robustly.

Method | \arraybackslashHead | \arraybackslashSho | \arraybackslashElb | \arraybackslashWri | \arraybackslashHip | \arraybackslashKnee | \arraybackslashAnk | \arraybackslashmAP |
---|---|---|---|---|---|---|---|---|

Impact of the temporal connection density | ||||||||

HT | \arraybackslash52.5 | \arraybackslash47.0 | \arraybackslash37.6 | \arraybackslash28.2 | \arraybackslash19.7 | \arraybackslash27.8 | \arraybackslash27.4 | \arraybackslash34.3 |

HT:N:S | \arraybackslash56.1 | \arraybackslash51.3 | \arraybackslash42.1 | \arraybackslash31.2 | \arraybackslash22.0 | \arraybackslash31.6 | \arraybackslash31.3 | \arraybackslash37.9 |

HT:N:S:H | \arraybackslash56.3 | \arraybackslash51.5 | \arraybackslash42.2 | \arraybackslash31.4 | \arraybackslash21.7 | \arraybackslash31.6 | \arraybackslash32.0 | \arraybackslash38.1 |

HT:W:A | \arraybackslash56.0 | \arraybackslash51.2 | \arraybackslash42.2 | \arraybackslash31.6 | \arraybackslash21.6 | \arraybackslash31.2 | \arraybackslash31.7 | \arraybackslash37.9 |

Impact of the length of temporal connection () | ||||||||

HT:N:S | \arraybackslash56.1 | \arraybackslash51.3 | \arraybackslash42.1 | \arraybackslash31.2 | \arraybackslash22.0 | \arraybackslash31.6 | \arraybackslash31.3 | \arraybackslash37.9 |

HT:N:S | \arraybackslash56.5 | \arraybackslash51.6 | \arraybackslash42.3 | \arraybackslash31.4 | \arraybackslash22.0 | \arraybackslash31.9 | \arraybackslash31.6 | \arraybackslash38.2 |

HT:N:S | \arraybackslash56.2 | \arraybackslash51.3 | \arraybackslash41.8 | \arraybackslash31.1 | \arraybackslash22.0 | \arraybackslash31.4 | \arraybackslash31.5 | \arraybackslash37.9 |

Impact of the constraints | ||||||||

All | \arraybackslash56.5 | \arraybackslash51.6 | \arraybackslash42.3 | \arraybackslash31.4 | \arraybackslash22.0 | \arraybackslash31.9 | \arraybackslash31.6 | \arraybackslash38.2 |

All spat. transitivity | \arraybackslash7.8 | \arraybackslash10.1 | \arraybackslash7.2 | \arraybackslash4.6 | \arraybackslash2.7 | \arraybackslash4.9 | \arraybackslash5.9 | \arraybackslash6.2 |

All temp. transitivity | \arraybackslash50.5 | \arraybackslash46.8 | \arraybackslash37.5 | \arraybackslash27.6 | \arraybackslash20.3 | \arraybackslash30.1 | \arraybackslash28.7 | \arraybackslash34.5 |

All spatio-temporal | \arraybackslash42.3 | \arraybackslash40.8 | \arraybackslash32.8 | \arraybackslash24.3 | \arraybackslash17.0 | \arraybackslash25.3 | \arraybackslash22.4 | \arraybackslash29.3 |

Comparison with the state-of-the-art | ||||||||

Ours | \arraybackslash56.5 | \arraybackslash51.6 | \arraybackslash42.3 | \arraybackslash31.4 | \arraybackslash22.0 | \arraybackslash31.9 | \arraybackslash31.6 | \arraybackslash38.2 |

BBox-Detection [34] | ||||||||

+ LJPA [17] | \arraybackslash50.5 | \arraybackslash49.3 | \arraybackslash38.3 | \arraybackslash33.0 | \arraybackslash21.7 | \arraybackslash29.6 | \arraybackslash29.2 | \arraybackslash35.9 |

+ CPM [40] | \arraybackslash48.8 | \arraybackslash47.5 | \arraybackslash35.8 | \arraybackslash29.2 | \arraybackslash20.7 | \arraybackslash27.1 | \arraybackslash22.4 | \arraybackslash33.1 |

DeeperCut [16] | \arraybackslash56.2 | \arraybackslash52.4 | \arraybackslash40.1 | \arraybackslash30.0 | \arraybackslash22.8 | \arraybackslash30.5 | \arraybackslash30.8 | \arraybackslash37.5 |

### 5.2 Frame-wise Multi-Person Pose Estimation

The results for frame-wise multi-person pose estimation (mAP) are summarized in Tab. 3. Similar to the evaluation for pose tracking, we evaluate the impact of spatio-temporal connection density, length of temporal connections and the influence of different constraint types. Having connections only between Head Top (HT) detections results in a mAP of . As for pose tracking, introducing temporal connections for Neck and Shoulders (HT:N:S) results in a higher accuracy and improves the mAP from to . The mAP elevates slightly more when we also incorporate connections for hip joints (HT:N:S:H). This is in contrast to pose tracking where MOTA dropped slightly when we also use connections for hip joints. As before, inclusion of edges between all detections that are in the range of frames improves the performance, while increasing the distance further starts to deteriorate the performance. A similar trend can also been seen for the impact of different types of constraints. The removal of spatial transitivity constraints results in a drastic decrease in pose estimation accuracy. Without temporal transitivity constraints or spatio-temporal constraints the pose estimation accuracy drops by more than and , respectively. This once again indicates that all types of constraints are essential to obtain better pose estimation and tracking performance.

We also compare the proposed method with the state-of-the-art approaches for multi-person pose estimation in images. Similar to [17], we use Faster R-CNN [34] as person detector, and use the provided codes for LJPA [17] and CPM [40] to process each bounding box detection independently. We can see that person bounding box based approaches significantly underperform as compared to the proposed method. We also compare with the state-of-the-art method DeeperCut [16]. The approach, however, requires the rough scale of the persons during testing. For this, we use the person detections obtained from [34] to compute the scale using the median scale of all detected persons.

Our approach achieves a better performance than all other methods. Moreover, all these approaches require an additional person detector either to get the bounding boxes [17, 40], or the rough scale of the persons [16]. Our approach on the other hand does not require a separate person detector, and we perform joint detection across different scales, while also solving the person association problem across frames.

We also visualize how multi-person pose estimation accuracy (mAP) relates with the multi-person tracking accuracy (MOTA) in Fig. 8. Finally, Tab. 4 provides mean and median runtimes for constructing and solving the spatio-temporal graph along with the graph size for frames over all test videos.

\arraybackslashRuntime (sec./frame) | \arraybackslash# of nodes | \arraybackslash# of spatial edges | \arraybackslash# of temp. edges | |
---|---|---|---|---|

Mean | \arraybackslash14.7 | \arraybackslash2084 | \arraybackslash65535 | \arraybackslash12903 |

Median | \arraybackslash4.2 | \arraybackslash1907 | \arraybackslash58164 | \arraybackslash8540 |

## 6 Conclusion

In this paper we have presented a novel approach to simultaneously perform multi-person pose estimation and tracking. We demonstrate that the problem can be formulated as a spatio-temporal graph which can be efficiently optimized using integer linear programming. We have also presented a challenging and diverse annotated dataset with a comprehensive evaluation protocol to analyze the algorithms for multi-person pose estimation and tracking. Following the evaluation protocol, the proposed method does not make any assumptions about the number, size, or location of the persons, and can perform pose estimation and tracking in completely unconstrained videos. Moreover, the method is able to perform pose estimation and tracking under severe occlusion and truncation. Experimental results on the proposed dataset demonstrate that our method outperforms other baseline methods.

Acknowledgments. The authors are thankful to Chau Minh Triet, Andreas Doering, and Zain Umer Javaid for the help with annotating the dataset. The work has been financially supported by the DFG project GA 1927/5-1 (DFG Research Unit FOR 2535 Anticipating Human Behavior) and the ERC Starting Grant ARCA (677650).

### Footnotes

- http://pages.iai.uni-bonn.de/iqbal_umar/PoseTrack/
- Contemporaneously with this work, the problem has also been studied in [15]
- Note that only joints of the same type are matched.

### References

- M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In CVPR, 2014.
- M. Andriluka, S. Roth, and B. Schiele. People-tracking-by-detection and people-detection-by-tracking. In CVPR, 2008.
- V. Belagiannis, S. Amin, M. Andriluka, B. Schiele, N. Navab, and S. Ilic. 3d pictorial structures revisited: Multiple human pose estimation. TPAMI, 2015.
- K. Bernardin and R. Stiefelhagen. Evaluating multiple object tracking performance: The CLEAR MOT metrics. Image and Video Processing, 2008.
- A. Bulat and G. Tzimiropoulos. Human pose estimation via convolutional part heatmap regression. In ECCV, 2016.
- J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative error feedback. In CVPR, 2016.
- J. Charles, T. Pfister, D. Magee, D. Hogg, and A. Zisserman. Personalizing human video pose estimation. In CVPR, 2016.
- X. Chen and A. L. Yuille. Parsing occluded people by flexible compositions. In CVPR, 2015.
- A. Cherian, J. Mairal, K. Alahari, and C. Schmid. Mixing Body-Part Sequences for Human Pose Estimation. In CVPR, 2014.
- W. Choi. Near-online multi-target tracking with aggregated local flow descriptor. In ICCV, 2015.
- M. Eichner and V. Ferrari. We are family: Joint pose estimation of multiple persons. In ECCV, 2010.
- G. Gkioxari, B. Hariharan, R. Girshick, and J. Malik. Using k-poselets for detecting people and localizing their keypoints. In CVPR, 2014.
- G. Gkioxari, A. Toshev, and N. Jaitly. Chained predictions using convolutional neural networks. In ECCV, 2016.
- P. Hu and D. Ramanan. Bottom-up and top-down reasoning with hierarchical rectified gaussians. In CVPR, 2016.
- E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele. Articulated multi-person tracking in the wild. In CVPR, 2017.
- E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. In ECCV, 2016.
- U. Iqbal and J. Gall. Multi-person pose estimation with local joint-to-person associations. In ECCV Workshop on Crowd Understanding, 2016.
- U. Iqbal, M. Garbade, and J. Gall. Pose for action - action for pose. In FG, 2017.
- H. Izadinia, I. Saleemi, W. Li, and M. Shah. (MP)2T: Multiple people multiple parts tracker. In ECCV, 2012.
- A. Jain, J. Tompson, Y. LeCun, and C. Bregler. Modeep: A deep learning framework using motion features for human pose estimation. In ACCV, 2014.
- H. Jhuang, J. Gall, S. Zuffi, C. Schmid, and M. Black. Towards understanding action recognition. In ICCV, 2013.
- S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose estimation. In BMVC, 2010.
- L. Ladicky, P. H. Torr, and A. Zisserman. Human pose estimation using a joint pixel-wise and part-wise formulation. In CVPR, 2013.
- Y. Li, C. Huang, and R. Nevatia. Learning to associate: Hybridboosted multi-target tracker for crowded scene. In CVPR, 2009.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV, 2014.
- A. Milan, L. Leal-Taixe, I. Reid, S. Roth, and K. Schindler. Mot16: A benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831, 2016.
- A. Newell, K. Yang, and J. Deng. Stacked hourglass networks for human pose estimation. In ECCV, 2016.
- D. Park and D. Ramanan. N-best maximal decoders for part models. In ICCV, 2011.
- T. Pfister, J. Charles, and A. Zisserman. Flowing convnets for human pose estimation in videos. In ICCV, 2015.
- L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. DeepCut: Joint subset partition and labeling for multi person pose estimation. In CVPR, 2016.
- L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen, and B. Schiele. Articulated people detection and pose estimation: Reshaping the future. In CVPR, 2012.
- U. Rafi, I.Kostrikov, J. Gall, and B. Leibe. An efficient convolutional network for human pose estimation. In BMVC, 2016.
- V. Ramakrishna, T. Kanade, and Y. Sheikh. Tracking human pose by tracking symmetric parts. In CVPR, 2013.
- S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015.
- B. Sapp, D. J. Weiss, and B. Taskar. Parsing human motion with stretchable models. In CVPR, 2011.
- H. Shen, S.-I. Yu, Y. Yang, D. Meng, and A. Hauptmann. Unsupervised video adaptation for parsing human motion. In ECCV, 2014.
- M. Sun and S. Savarese. Articulated part-based model for joint object detection and pose estimation. In ICCV, 2011.
- S. Tang, B. Andres, M. Andriluka, and B. Schiele. Multi-person tracking by multicut and deep matching. In ECCV Workshop on Benchmarking Multi-target Tracking, 2016.
- A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In CVPR, 2014.
- S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
- P. Weinzaepfel, J. Revaud, Z. Harchaoui, and C. Schmid. DeepFlow: Large displacement optical flow with deep matching. In ICCV, 2013.
- B. Yang and R. Nevatia. An online learned CRF model for multi-target tracking. In CVPR, 2012.
- Y. Yang and D. Ramanan. Articulated human detection with flexible mixtures of parts. TPAMI, 2013.
- D. Zhang and Mubarak. Human pose estimation in videos. In ICCV, 2015.
- W. Zhang, M. Zhu, and K. G. Derpanis. From actemes to action: A strongly-supervised representation for detailed action understanding. In ICCV, 2013.
- S. Zuffi, J. Romero, C. Schmid, and M. J. Black. Estimating human pose with flowing puppets. In ICCV, 2013.