Multi-Person Pose Estimation via Column Generation
We study the problem of multi-person pose estimation in natural images. A pose estimate describes the spatial position and identity (head, foot, knee, etc.) of every non-occluded body part of a person. Pose estimation is difficult due to issues such as deformation and variation in body configurations and occlusion of parts, while multi-person settings add complications such as an unknown number of people, with unknown appearance and possible interactions in their poses and part locations. We give a novel integer program formulation of the multi-person pose estimation problem, in which variables correspond to assignments of parts in the image to poses in a two-tier, hierarchical way. This enables us to develop an efficient custom optimization procedure based on column generation, where columns are produced by exact optimization of very small scale integer programs. We demonstrate improved accuracy and speed for our method on the MPII multi-person pose estimation benchmark.
Multi-Person Pose Estimation via Column Generation
Shaofei Wang Beijing A&E Technologies email@example.com Chong Zhang SIMBioSys Group, DTIC, Universitat Pompeu Fabra Barcelona Spain firstname.lastname@example.org Miguel A. Gonzalez-Ballester SIMBioSys Group, DTIC, Universitat Pompeu Fabra & ICREA, Barcelona Spain email@example.com Alexander Ihler University of California, Irvine Irvine California firstname.lastname@example.org Julian Yarkony Experian Data Lab San Diego CA email@example.com
In this paper we consider the problem of multi-person pose estimation (MPPE) in natural images. MPPE is the problem of detecting and localizing people and their corresponding body parts. In practice, most MPPE systems work by running part detectors over the image, extracting a number of possible part locations, then integrating this information using a pose model to determine both the number of people present in the image, and the assignment of detected parts to people (the pose).
For instance,  employs a flexible mixture-of-parts model for joint detection and estimation of human poses, where human poses are modeled by pictorial structure  and efficient inference is achieved via dynamic programming and distance transform. In  the problem of finding the pose of a person is equivalent to finding the maximum a posterior (MAP) configuration of a probabilistic graphical model where the likelihood function trades off two terms. The first encourages that the part locations of a predicted person are supported by evidence in the image as described by local image features [4, 19]. The second encourages that the part locations of a predicted person satisfy the angular and distance relationships consistent with a person . An example of such a relationship is that the head of a person tends to be above neck.
Often, the part detectors may detect the presence of a given part several times in close proximity, leading to a multiple detection problem; a simple way to solve this is via non-max suppression (NMS), which removes all but the best detections in a small region. NMS can be done either as a pre-processing step to suppress non-local-maximum part detections, or as a post-processing step to suppress poses with lower scores/probabilities that overlap with poses of high scores/probabilities. Either way, distortion or missing detection problems may occur, particularly in multi-person images, either by removing the correct detections, or by removing detections corresponding to separate persons.
More recent works [12, 16] cast the MPPE problem as an integer linear program (ILP), in which multiple detections of a single part may be assigned to the same person. This allows non-max suppression to be folded into the pose model, improving its ability to find the correct pose. The cost function of the ILP is generated using deep neural networks [17, 2], and the ILP is optimized using a state of the art ILP solver, assisted by a greedy multi-stage optimization procedure.
We propose an alternative ILP formulation of MPPE, in which we impose several additional structure assumptions on the ILP. In particular, we model the part assignments using a two-tier structure, in which a local assignment tier handles non-max suppression by grouping multiple detections, while a global pose tier handles the overall pose shape using an augmented-tree structure for the human body. We exploit this problem structure to design a highly efficient column generation algorithm for optimizing the ILP [9, 3] tailored to this model; for example, the global pose tier exploits the tree structured body model [7, 6, 21] to generate columns efficiently using dynamic programming. Figure 1 shows an illustration contrasting  with our model; given many detections,  uses a dense model to associate parts with individuals, while our model corresponds to a two-tier structure with a tree-like body model. In combination, this results in a novel MPPE model that is both more accurate, and significantly faster, than the baseline method of [12, 16].
|(a) raw input||(b) Deeper Cut ||(c) our approach||(d) final output|
We also note that a more recent approach of  achieves considerable speed up over : it is about three orders of magnitude faster than  while being 10x faster than our proposed method. Nevertheless, as will be shown later in experiments section, it is not as accurate as our method, especially for difficult-to-localize parts such as ankles and wrist.
Our paper is organized as follows. In Section 2 we outline the assumptions of our model and its structure, then formulate it more precisely as an ILP. In Section 3 we introduce our column generation approach for computing the optimal MPPE assignment, where the column generation steps are solved using efficient dynamic programming and small scale, exactly solvable integer programs (IP). In Section 4 we demonstrate that our model and inference process provide state of the art results for MPPE on benchmark data. Finally, we conclude and discuss extensions in Section 5. Additional derivations and discussion are provided in the supplements.
2 Multi-Person Pose Estimation Model
In this section, we describe our two-tiered structure for reasoning about pose estimation. The input to our model is a set of body part detections; in practice, we use the body part detector of , which employs a deep convolutional neural network [5, 14]. Each detection is associated with exactly one body part. Our model uses fourteen parts, consisting of the head and neck, along with right and left variants of the ankle, knee, hip, wrist, elbow, and shoulder. We use the term complete pose to describe a person in an image, as represented by the detections associated with their body parts.
2.1 Assignment of Parts and Validity
We partition the body parts into two types: major parts, of which at least one is required to be present (not occluded) in any complete pose, and minor parts, any of which may be occluded. In practice, we take the neck to be the only major part, thus requiring that each complete pose be associated with at least one neck detection.
We reason about the assignment of parts to a complete pose in two tiers: a local assignment, which corresponds to a grouping of detections for a single part that are all associated with a single complete pose; and a global pose, which corresponds to at most one detection of each part. In practice, the score of a local assignment evaluates the coherence of the detections for that part (for example, two visually similar detections of a part in close proximity are more likely to correspond to the same person), while the score of the global pose captures the coherence of these part locations according to a (nearly) tree structured model of the human body (for example, the head is typically located above the neck). In any local assignment, we require that exactly one detection be assigned to some global pose, so that the global pose reasons about the overall position and visibility of the person, and the local assignment captures any additional detections associated with each visible part. A complete pose corresponds to a single person in the image, and consists of a single global pose and the local assignments (additional detections) associated with each of its visible parts.
Finally, we categorize detections as either global, local, or false positive. Global detections are those associated with some global pose; local detections are the non-global detections in a local assignment; and false positives are detections not contained in any global pose or local assignment.
These definitions result in the following requirements for a set of complete poses, which describe a group of people in the image:
A detection can only be global, local, or neither.
No two global poses can share a common detection.
No two local assignments can share a common detection.
The global detection of a local assignment must also be included in a global pose.
We refer to these conditions as the validity conditions and a selection of global poses and local assignments that meet them is referred to as valid.
2.2 Integer Linear Program Formulation
|set||set of detections|
|set||set of parts|
|set||set of major parts;|
|,||indicates that detection is associated with part .|
|none||short hand for|
|set||set of all global poses|
|set||set of all local assignments|
|set||set of global poses generated during column generation|
|set||set of local assignments generated during column generation|
|is the cost of including in a complete pose|
|is the cost of including in the same local assignment or global pose|
|indicates that is a global detection in global pose|
|indicates that is a local detection in local assignment|
|indicates that is a global detection in local assignment|
|is the cost of global pose|
|is the cost of local assignment|
|indicates that global pose is selected.|
|indicates that local assignment is selected.|
We now formally define the MPPE task as an integer linear program (ILP). We first describe the variables associated with detections and parts, global poses, and local assignments; give the validity constraints on these variables as linear inequalities; and finally define the cost of a pose and the overall optimization problem, and discuss its linear program (LP) relaxation. We summarize our notation in Table 1.
Detections and Parts.
We denote the set of detections in the image as , and index these detections by . Similarly, we use to denote the set of parts, indexed by , and denote the set of major parts by . We describe the mapping of detections to parts using a matrix , indexed by . Specifically, indicates that detection is associated with part . As a useful shorthand, we define to be the part associated with detection .
Given the set of detections , we define the set of all possible global poses over as . Members of have at least one global detection corresponding to a major part and no more than one detection corresponding to any given part. We describe mappings of detections to global poses using a matrix , and set if and only if detection is associated with global pose .
Note that the set of all possible poses is impractically large (it contains all valid assignments of detections to a global pose). Thus in practice, we never construct explicitly; instead, we maintain an active set of poses, , restricting to this set.
Next we denote the set of all possible local assignments over the detections by , and index these possible local assignments by . Since we require that, for any local assignment , exactly one of the detections in is global, we describe using two matrices , where if and only if detection is associated with as a local (non-global) detection, and if and only if detection is associated with as a global detection.
The set is too large to be considered explicitly during optimization. We maintain a subset during optimization, and explictly represent and restricted to .
We index a set of global poses and local assignments using indicator vectors, so that with to indicate that global pose is selected, and otherwise. Similarly, we let with to indicate that local assignment is selected, with otherwise.
A solution is a valid solution if and only if it satisfies the rules defined previously, which is written formally as the following set of linear inequalities:
We now describe the cost function for MPPE. Our total cost is expressed in terms of unary costs , where is the cost of assigning detection to a pose, and pairwise costs , where is the cost of assigning detections and to a common global pose or local assignment. We use to denote the cost of instancing a pose, which serves to regularize the number of people in an image.
The cost of a complete pose is thus the sum of the costs of the following.
terms associated with pairs of detections in its global pose
terms associated with pairs of detections within each of its local assignments
terms associated with detections in either its global or local assignments
term associated with instancing a pose.
For convenience, we separate these costs into as the cost associated with the global pose , and as the cost of local assignment , respectively:
Integer Linear Program.
We now cast the problem of finding the lowest cost set of poses as an integer linear program subject to our validity constraints:
By relaxing the integrality constraints on , we obtain a linear program relaxation of the ILP, and can convert Eq. (1) to its dual form using Lagrange multiplier sets :
3 Column Generation Solution
In this section we consider optimization of the LP relaxation in Eq. (2). As discussed, the primary difficulty is the intractable sizes of the sets . Instead, we consider subsets and that are constructed strategically during optimization so as to be small, while still solving the LP in Eq. (2) exactly. This type of column generation approach is common in the operations research literature, in which the task of generating the columns is often called pricing .
We solve the dual form LP in Eq. (2) iteratively with two steps. We first solve the dual LP over constraint sets and , which are initialized to be empty. Then, we identify violated constraints in the dual using combinatorial optimization and add these to sets and . One local assignment is identified corresponding to each possible selection of a global detection, and one global pose is identified for each selection of a detection corresponding to a major part. We repeat these two steps until no more violated constraints exist. We then solve the integer linear program over sets and . We diagram this procedure in Figure 3 and show the corresponding algorithm in Alg 1.
3.1 Identifying Violated Local Assignments
For each detection , we compute the most violated constraint corresponding to a local assignment in which is the global detection. We write this as an IP using the indicator vector , and define a new column for inclusion in matrices and , assigning and for all , where is the solution to
In practice, we solve this IP by explicit enumeration over the possible local assignments. Since the number of detections associated with any given part (and thus eligible to participate in the local assignment of ) is small – no larger than 15 and usually less than 10 – exhaustive search is feasible. One can convert this problem to an equivalent ILP problem and use an off-the-shelf ILP solver that employs branch-and-cut to solve it.
3.2 Identifying Violated Global Poses
For each detection such that (i.e., corresponds to a major part), we compute the most violated constraint corresponding to a global pose that includes detection . Again, we write this as an IP using an indicator vector , and define a new column to be included in , defined by for all , where is the solution to:
By enforcing some structure in the pairwise costs , we can ensure that this optimization problem is tractable. A common model in computer vision is to represent the location of parts in the body using a tree-structured model, for example in the deformable part model of [7, 6, 21]; this forces the terms to be zero between non-adjacent parts on the tree.
In our application we augment this tree model with additional edges from the major part (i.e., the neck) to all other non-adjacent body parts. This is illustrated in Fig 2. Then, given the global detection associated with the neck, the conditional model is tree-structured and can be optimized using dynamic programming in time, where is the maximum number of detections per part ( in practice).
Deeper Cut 
4.1 Experiment Setup
We evaluate our approach in terms of the Average Precision (AP) on the of MPII–Multiperson training set , which consists of 3844 images. For a fair comparison, we use the unary and pairwise costs directly provided by Insafutdinov et al., and did not modify or weight these costs in any way for any approach considered in this experiment. Our model thus only differs from  and  in that our two-tier structure defines a distinct and novel cost function. In particular, our introduction of the two-tier structure forces us to ignore the pairwise terms corresponding to interactions between non-global detections that are associated with different parts in a given pose. A major benefit of this difference is a fast and typically exact optimization process. Besides, local detections in a local assignment often do not align well with the ground-truth position of a body-part (e.g. Figure 1 and 2), thus pairwise interactions between such detections across part types can be noisy due to inaccurate localization, and ignoring such interactions may contribute to more accurate localization of body-parts.
In addition to the structure depicted in Figure 2(a), we found that adding additional edges for global pose that does not break the conditional tree structure slightly improves Mean Average Precision (mAP) from 78.8 to 79.1 with negligible increase in running time. The additional edges we employ in our final model are left-hip to left-shoulder, right-hip to right-shoulder and shoulders to head.
We set heuristically to discourage the selection of global poses that include few detections, which tend to be lower magnitude in their cost. After solving the LP (2), we tighten the relaxation if necessary using odd set inequalities of size three [10, 20], which does not interfere with pricing; more details can be found in the supplements. In practice, however, we find that these refinements are rarely necessary to produce integer solutions with identical cost to the LP relaxation at termination.
We compare our results against two baselines: 1) , whose results are obtained by its authors upon our request due to our limited acess to computing resources and commercial LP solvers. 2) , whose results are obtained via running their code over the costs from . We found that employing the augmented-tree structure instead of a fully-connected structure gives  sligntly better performance (from 78.4 to 78.5). Note that even based on the same graph structure,  still has more pairwise connections than our model as it considers connections between all detections from different parts.
4.2 Benchmark Results
As shown in Table 2, our approach runs much faster than  due to both the reduced model size and our more sophisticated inference algorithm. While  runs about 10x faster than our approach, we achieve more accurate results than it: the improvement in mAP might seem small (78.5 to 79.1), however we achieve much better AP on difficult-to-localize parts such as wrist (70.1 versus 68.6) and ankle (66.6 versus 65.2), while we only use a subset of edges compared to  and . Also keep in mind that all experiments are based on the same set of unary/pairwise costs without any form of learning, thus our improvement is solely due to our novel modeling for MPEE problem and the ability to find global minimum of our cost.
We also note that the code of  is in pure C++ and is heavily optimized, while our code is in pure Python and we did not take advantage of the parallelizable nature of our pricing problems. Nevertheless, we still achieve considerable speed up over . We will release the code and data we used upon acceptance of this paper.
We introduce a new formulation of the multi-person pose estimation problem, along with a novel inference algorithm based on column generation that admits efficient inference. We compare our results to a state of the art algorithm and demonstrate that our approach rapidly produces more accurate results than the baseline.
In future work we intend to apply our method to other domains where similar local/global structure is present, and can assist in non-maximum suppression or clustering, for example in relevant ILP optimization formulations of multi-object tracking , moral lineage tracking, and MPPE tasks on video .
-  M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele. 2d human pose estimation: New benchmark and state of the art analysis. In Proc. of CVPR, 2014.
-  P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics with deep learning. Nature communications, 5(4308), 2014.
-  C. Barnhart, E. L. Johnson, G. L. Nemhauser, M. W. P. Savelsbergh, and P. H. Vance. Branch-and-price: Column generation for solving huge integer programs. Operations Research, 46:316–329, 1996.
-  N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Proc. of CVPR, 2005.
-  C. Farabet, C. Couprie, L. Najman, and Y. LeCun. Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(8):1915–1929, 2013.
-  P. Felzenszwalb, D. McAllester, and D. Ramanan. A discriminatively trained, multiscale, deformable part model. In Proc. of CVPR, 2008.
-  P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence, 32(9):1627–1645, 2010.
-  P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition. International journal of computer vision, 61(1):55–79, 2005.
-  P. Gilmore and R. Gomory. A linear programming approach to the cutting-stock problem. Operations Research (volume 9), 1961.
-  O. Heismann and R. Borndörfer. A generalization of odd set inequalities for the set packing problem. In Operations Research Proc., 2014.
-  E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres, and B. Schiele. ArtTrack: Articulated multi-person tracking in the wild. In Proc. of CVPR, 2017.
-  E. Insafutdinov, L. Pishchulin, B. Andres, M. Andriluka, and B. Schiele. Deepercut: A deeper, stronger, and faster multi-person pose estimation model. CoRR, abs/1605.03170, 2016.
-  F. Jug, E. Levinkov, C. Blasse, E. W. Myers, and B. Andres. Moral lineage tracing. In Proc. of CVPR, 2016.
-  A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Proc. of NIPS, 2012.
-  E. Levinkov, J. Uhrig, S. Tang, M. Omran, E. Insafutdinov, A. Kirillov, C. Rother, T. Brox, B. Schiele, and B. Andres. Joint graph decomposition and node labeling: Problem, algorithms, applications. In Proc. of CVPR, 2017.
-  L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. Gehler, and B. Schiele. DeepCut: Joint subset partition and labeling for multi person pose estimation. In Proc. of CVPR, 2016.
-  D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Parallel Distributed Processing: Explorations in the Microstructure of Cognition, Vol. 1. MIT Press, Cambridge, MA, USA, 1986.
-  S. Tang, B. Andres, M. Andriluka, and B. Schiele. Subgraph decomposition for multi-target tracking. In Proc. of CVPR, 2015.
-  C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba. Hoggles: Visualizing object detection features. In Proc. of ICCV, 2013.
-  S. Wang, S. Wolf, C. Fowlkes, and J. Yarkony. Tracking objects with higher order interactions using delayed column generation. In Proc. of AISTATS, 2017.
-  Y. Yang and D. Ramanan. Articulated pose estimation with flexible mixtures-of-parts. In Proc. of CVPR, 2011.
Appendix A Tighter Bound for Multi-Person Pose Estimation
A tighter LP relaxation than that in the main paper can be motivated by the following observations: (1) no more than one global pose can include more than two members of a given set of three detections. (2) No more than one local assignment can include more than two members of a given set of three detections (either as local or global). These constraints are called odd set inequalities of order three . We formalize this below.
We refer to the set of all sets of three unique detections (triples) as . We use to define the adjacency matrix between triples and local assignments. Similarly we use to define the adjacency matrix between triples and global poses. Here if and only if local assignment contains two or more members of set . Similarly we set if and only if global pose contains two or more members of set . We define formally below.
a.1 Dual Form
We now write the corresponding primal LP for multi-person pose estimation with triples added.
The constraints and are referred to as “rows" of the primal problem. We now take the dual of Eq. (6). This induces two additional sets of Lagrange multipliers . We now write the dual below.
In order to tackle optimization we introduce subsets of and , denoted and respectively. These subsets are intially empty and grow only when needed. We write an optimization algorithm below in Alg 2 with subroutines (Section A.3) and (Section A.4) describing the generation of new triples and columns respectively.
a.3 Generating rows
Generating rows corresponding to local assignments is done separately for each part. We write the corresponding optimization for identifying the most violated constraint corresponding to a local assignment over a given part as follows.
Finding violated rows corresponding to global poses is assisted by the knowledge that one need only consider triples over three unique part types as no global pose includes two or more detections of a given part. Hence only such triples need be considered for global pose. For any given let the detections associated with it be , the corresponding optimization can then be written as below:
Triples are only added to if the corresponding constraint is violated.
a.4 Generating Columns
Generating columns is considered separately for global poses and local assignments. The corresponding equations are unmodified from the main document except for the introduction of terms over triples. We write the IP for generating the most violated constraint corresponding to a local assignment given the global detection below.
We optimize Eq. (10) via explicit enumeration as described in the main paper.
For each such that we compute the most violated constraint corresponding to a global pose including . We write this as an IP below.
The introduction of triples breaks the structure of the problem, thus we can no longer optimize Eq. (11) via dynamic programming. We found that employing the branch and bound algorithm proposed by  is not computationally problematic for our problems as the number of triplets needed for convergence is small.
Appendix B Additional Statistics for Results on MPII Training Set
With up to 150 detections per image, we found our column generation solver usually terminates with a few hundreds, and no more than 1000 columns (i.e. total number of global poses and local assignments).
Out of all 3844 instances, we observe fractional LP solutions on 131 instances, 45 of which we successfully reached integer solutions with the help of triplets constraints; for the rest of 86 fractional instances, it costs negligible additional time to run trial version of CPLEX ILP solver to obtain integer solutions given columns we generated.