Chained Predictions Using Convolutional Neural Networks
Abstract
In this work, we present an adaptation of the sequencetosequence model for structured vision tasks. In this model, the output variables for a given input are predicted sequentially using neural networks. The prediction for each output variable depends not only on the input but also on the previously predicted output variables. The model is applied to spatial localization tasks and uses convolutional neural networks (CNNs) for processing input images and a multiscale deconvolutional architecture for making spatial predictions at each step. We explore the impact of weight sharing with a recurrent connection matrix between consecutive predictions, and compare it to a formulation where these weights are not tied. Untied weights are particularly suited for problems with a fixed sized structure, where different classes of output are predicted at different steps. We show that chain models achieve top performing results on human pose estimation from images and videos.
Keywords:
Structured tasks, chain model, human pose estimation1 Introduction
Structured prediction methods have long been used for various vision tasks, such as segmentation, object detection and human pose estimation, to deal with complicated constraints and relationships between the different output variables predicted from an input image. For example, in human pose estimation the location of one body part is constrained by the locations of most of the other body parts. Conditional Random Fields, Latent Structural Support Vector Machines and related methods are popular examples of structured output prediction models that model dependencies among output variables.
A major drawback of such models is the need to handdesign the structure of the model in order to capture important problemspecific dependencies amongst the different output variables and at the same time allow for tractable inference. For the sake of efficiency, a specific form of conditional independence amongst output variables is often assumed. For example, in human pose estimation, a predefined kinematic body model is often used to assume that each body part is independent of all the others except for the ones it is attached to.
To alleviate some of the above modeling simplifications, structured prediction problems have been solved with sequential decision making, where all earlier predictions influence later predictions. The SEARN algorithm [1] introduced a very general formulation for this approach, and demonstrated its application to various natural language processing tasks using losses from binary classifiers. A related model recently introduced, the sequencetosequence model, has been applied to various sequence mapping tasks, such as machine translation, speech recognition and image caption generation [2, 3, 4]. In all these models the output is a sentence  where the words of the sentence are predicted in a first to last order. This model maximizes the log probability for output sequence conditioned on the input, by decomposing the probability of an output sequence with the multiplicative chain rule of probability; at each index of the output, the next prediction is made conditioned on all previous outputs and the input. A recurrent neural network is used at every step of the output and this allows parameter sharing across all the output steps.
In this paper we borrow ideas from the above sequencetosequence model and propose to extend it to more general structured outputs encountered in computer vision – human pose estimation from a single image and video. The contributions of this work are as follows:

A chain model for structured outputs, such as human pose estimation. The body part locations are predicted sequentially, where the prediction of each body part is dependent on all previously predicted body parts (See Fig. 1). The model is formulated using a neural network in which the feature extraction and prediction models are learned endtoend. Since we apply the model to spatial labelling tasks we use convolutional neural networks in both the inputs and outputs. The output convolutional neural networks is a multiscale deconvolution that we call deception because of its relationship to deconvolution [5, 6] and inception models [7].

We demonstrate two formulations of the chain model  one without weight sharing between different predictors (poses in images) to allow semanticspecific flow of information and the other with weight sharing to enforce recurrence in time (poses in videos). The latter model is a RNN similar to the sequencetosequence model.
The above model achieves top performing results on the MPII human pose dataset – 86.1% PCKh. We achieve stateofthe art performance for pose estimation on the PennAction video dataset – 91.8% PCK.
2 Related Work
Structured output prediction as sequence prediction.
The use of sequential models for structured predictions is not new. The SEARN algorithm [1] laid down a broad framework for such models in which a sequence of actions is generated by conditioning the next action on previous actions and the data. The optimization method proposed in SEARN is based on iterative improvement over policies using reinforcement learning.
A similar class of models are the more recent sequencetosequence models [2, 8] that map an input sequence to an output sequence of fixed vocabulary. The models produce output variables, one at a time, conditioned on inputs and previous output variables. A nextstep loss function is computed at each step, using a recurrent neural network. Sequencetosequence models have been shown to be very effective at a variety of language tasks including machine translation [2], speech recognition [3], image captioning [4] and parsing [9]. In this paper we use the same idea of chaining predictions for structured prediction on two vision problems  human pose estimation in individual frames and in video sequences. However, as exemplified in the pose estimation case, since we have a fixed output structure we are not limited to using recurrent models.
In the pose prediction problem, we used a fixed ordering of joints, that is motivated by the kinematics of the human body. Prior work in sequential modelling has explored the idea of choosing the best ordering for a task [10, 11, 12]. For example, Vinyals et al. [10] explored this question and found that for some problems, such as geometric problems, choosing an intuitive ordering of the outputs results in slightly better performance. However for simpler problems most orderings were able to perform equally well. For our problem, the number of joints being predicted is small, and tree based ordering of joints from head to torso to the extremities seems to be the intuitively correct ordering.
Human pose estimation
Human pose estimation has been one of the major playgrounds for structured prediction models in computer vision. Historically, most of the research has focused on graphical models, starting with treebased decompositions [13, 14, 15, 16] motivated by kinematic models of the human body.
Many of these models assume conditional independence of a body part from all other parts except the parent part as defined by the kinematic body model (see pictorial structure model [13]). This simplification comes at a performance cost and has been addressed in various ways: mixture model of parts [17]; mixtures of full body models [18]; higherorder spatial relationships [19]; image dependent pictorial structures [20, 21, 22, Karlinsky2010]. Like these above approaches, we assume an order among the body parts. However, this ordering is used only to decompose the joint probability of the output joints into a particular ordering of variables in the chain rule of probability, and not to make assumptions about the structure of the probability distribution. Because no simplifying assumptions are made about the joint distribution of the output variables it leads to a more expressive model, as exemplified in the experimental section. The model is only constrained by the ability of neural networks to model the conditional probability distributions that arise from the particular ordering of the variables chosen. In addition, the correlations among parts are learned through a set of nonlinear operations instead of imposing binary term constraints on handdesigned image features (e.g. RGB values, location) as done in CRFs.
It is worth noting that there have been models for pose estimation where parts are sequentially refined [23, 24, 25, 26]. In these models an initial prediction is made of all the parts; in subsequent steps, all part predictions are refined based on the image and earlier part predictions. However, note that the predictions are initially independent of each other.
3 Chain Models for Structured Tasks
Chain models exploit the structure of the tasks they are designed to tackle by sequentially predicting their outputs. To capture this structure each output prediction is conditioned on all outputs predicted already. This philosophy has been exploited in language processing where sentences, expressed as word sequences, need to be predicted [2, 8] from inputs. In recent automatic image captioning work [4, 27], for example, a sentence is generated from an image by maximizing the likelihood . The chain rule is applied, consecutively to model each output (here a word) given the image and all the previous outputs in the output sequence.
In computer vision, recognition problems, such as segmentation, detection and pose estimation, demonstrate rich structure with complex dependencies. In this work, we model this structure with a simple and efficient recognition machine that makes little to no assumptions about the structure, other than the ability of a neural network to model complex, incremental conditional distributions.
Mathematically, let be the objects to be detected. For example, for the pose prediction problem, is the location of the th body part. In video prediction problems, is the location of an object in the th frame of a video. Using the chain rule we decompose as follows:
(1) 
From the above equation, we see that the likelihood of assigning value to the th variable is given by , and depends on both the input as well as the assignment of previous variables. In this work, we model the likelihood with a convolutional neural network (CNN). The direct dependence of the current prediction on the ground truth values of previous variables allows for the model to capture all necessary relationships without making any assumption about the joint distributions of all the variables, other than assuming that each successive conditional distribution, , can be computed with a neural network.
3.1 Chain Models for Single Images
In the case of single images, the input is the image while the th variable can be, for example, the location of the th object in image (see Fig. 2).
The probability of each step in the decomposition of Eq. (1) is defined through a hidden state at step , which carries information about the input as well as states at previous steps. In addition it incorporates the values from previous steps. The final probability for variable is computed from the hidden state:
(2)  
(3) 
In the above equation, the previous variables are first transformed through a full neural net . Parameters and then linearly transform the previous hidden state and a function of previous output variables, , and a nonlinearity is then applied to each dimension of this output. The nonlinearity of choice is a Rectified Linear Unit. Finally, denotes multiplication. In image applications, however, the hidden state can be a feature map and the prediction a location in the image. In such cases, denotes convolution and is a CNN. Note that, as long as we feed in just the last variable in this equation, the recurrent equation insures that we condition on the entire history of joints. However feeding in more of the previous joints makes it easier for the model to learn the conditional distributions directly. In the computation of the conditional probability of from we use another neural net , which produces scores for potential object location. By applying a softmax function over these scores we convert them to a probability distribution over locations.
The initial state is computed based solely on the input : .
This formulation is reminiscent of recurrent networks (RNNs), the equations define how to transform a state from one step to the next. We differ, however, from RNNs in one important aspect, the parameters in Eq. (23) are not necessarily tied. Indeed, parameters and are indexed by the step. This design choice is appropriate for tasks such as human pose estimation where the number of outputs is fixed and where each step is different from the rest. In other applications, e.g. video, we tie these parameters: and , .
3.2 Chain Models for Videos
For videos, the input is a sequence of images (Fig. 2). Predictions are made at each step, as the images are fed in. At each step , we make predictions for the image at that step, using the past images, and the past output variables. Thus, we modify the equation for the hidden state as follows:
(4) 
where we add features extracted from image using a CNN. The final probability is computed as in Eq. (3).
In videos we often need to predict the same type of information at each step, e.g. location of all body joints of the person in the current frame. As such, the predictors can have the same weights. Thus, we tie the parameters , , and together, which results in a convolutional RNN.
As before, the connections from hidden state at the previous step guarantees that the prediction at each time step uses output variables from all previous steps, as long as the previous output variable is fed in at time . However, feeding in a larger time horizon leads to an easier learning problem.
3.3 Improved Learning with Scheduled Sampling
So far, we have described the method as using the input and only ground truth values of the previous output variables when making a prediction for the next output variable. However, it has previously been observed that for sequencetosequence models overfitting can be mitigated by probabilistically substituting ground truth values of previous output variables with samples from the probability distribution predicted by the model [28]. One challenge that arises in this is that, at the start of the training, the predicted probability distributions are wildly inaccurate and thus, feeding in samples from the distribution is counterproductive. The authors of [28] propose a method, called scheduled sampling, that uses an annealing schedule that feeds in only the ground truth outputs at the start of the training and increases the rate of sampling from the predictions of the model towards the end of the training. We use the idea of scheduled sampling in our paper and find that it leads to improved results.
4 Experimental Evaluation
To evaluate the proposed model, we apply it on human pose estimation, which is challenging and of great interest due to the complex relationship among body parts. In the single image case, we use the chain model to capture the structure of pose in space, i.e. how the location of a part influences others. For the videos, our model captures the constraints and dynamics of the body pose in time.
4.0.1 Tasks and Datasets
For our single image experiments we use the MPII Human Pose dataset [29], which consists of about 40K instances of people performing various actions. All frames come with a maximum of 16 annotated joints (e.g. Top Head, Right Ankle, Left Knee, etc.). For the task of pose estimation in video we use the Penn Action dataset [30], which consists of 2326 video sequences of people performing various sports. All frames come with a maximum of 13 annotated joints. During evaluation, if a joint prediction lies within a predefined distance, proportional to the size of the person, from the ground truth location it is counted as a correct detection. This metric is called PCK [31, 29].
Our model is illustrated in Fig. 2. We experiment with two choices for , the network which encodes the input image. First, a shallow CNN which consists of six layers each followed by a rectified linear unit [32] and Batch Normalization [33]. The first four layers include max pooling with stride 2, leading to an effective stride of 16. This network is described in Fig. 3. Second, we experiment with a deeper network of identical architecture to inceptionv3 [34]. We discard the last convolutional layer of inceptionv3 and connect the output to .
The network decodes the hidden state to a heatmap over possible locations of a single body part. This heatmap is converted to a probability distribution over locations using a softmax. The network consists of two towers of deconvolutional layers each of which increases the width and height of the feature maps by a factor of 2. Note that the deconvolutional towers are multiscale  in one layer, different filter sizes are used and combined together. This is similar to the inception model [7], with the difference that here it is applied with the deconvolution operation, and hence we call it deception.
4.1 Pose Estimation From a Single Image
In this application case, we use the chain model to predict the joints sequentially. The sequence with which the joints are processed is fixed and is motivated by the marginal distributions of the joints. In particular, we sort the joints in descending order according to the detection rates of an unchained feed forward net. This allows for the easy cases to be processed first (e.g. Torso, Head) while the harder cases (e.g. Wrist, Ankle) are processed last, and as a result use the contextual information from the joints predicted before them.
4.1.1 Inference
At test time, we use beam search to infer the optimal location of the joints. Note that exact inference is infeasible, due to the size of the search space (a total of possible solutions, where is the size of the prediction heatmap and are the number of joints). At each step , the best predictions are stored, where each prediction is the sequence of the first joints. The quality of a full body pose prediction is measured by its logprobability, which is the sum of the logprobabilities corresponding to the individual joint predictions.
An exact implementation of chain rule conditions on predictions made at every step. Alternatively, one could skip the nondifferentiable sampling operation and use the probability distributions directly. Even though this is not an exact application of the chain rule, it allows for the gradients to flow back to the output of each task. We found that this approximation led to very similar performance  it slowed down training time by a factor of 3 and sped up inference by a factor of .
PCKh (%)  Torso  Head  Shldr  Elbow  Wrist  Hip  Knee  Ankle  Mean  

86.8  91.9  85.8  74.5  69.0  71.1  61.4  50.6  73.9  

86.0  91.7  85.1  72.9  68.0  69.4  59.7  48.5  72.6  

88.1  92.0  86.1  74.1  67.7  73.7  64.7  58.0  75.6  

86.8  93.2  88.3  79.4  74.6  77.8  71.4  65.2  79.6  

88.7  94.4  90.0  82.6  78.6  80.2  74.8  68.4  82.2  

87.2  95.9  93.4  83.3  82.3  95.2  77.6  72.3  85.9  

91.1  95.0  90.2  81.0  77.4  77.2  73.7  64.6  81.3  

91.7  95.7  92.2  85.3  82.2  82.9  80.0  72.4  85.3 
4.1.2 Learning details
We use an SGD solver with momentum to learn the model parameters by optimizing the loss. The loss for one image is defined as the sum of losses for individual joints. The loss for the th joint is the cross entropy between the predicted probability over locations of the joint and the groundtruth probability . The former is defined based on the heatmap output by for the th joint: . The latter is defined based on a distance – all locations within radius of the groundtruth joint location are assigned same nonzero probability , all other locations are assigned probability . is a normalizer guaranteeing is a probability.
The final loss for reads as follows:
(5) 
We use batch size of 16; initial learning rate of 0.003 that was decayed every 100K steps (50K for the inception model); radius of . The model was trained for 120K iterations (55K for the inception model). Our images are rescaled to ( for the inception model). The weights of the network are initialized by sampling from a normal distribution of zero mean and 0.01 standard deviation. For the inception model, we initialize the weights of with weights from an ImageNet model.
4.1.3 Results
Table 1 shows the PCKh performance on the MPII validation set of our chain model and our baseline variants.
Rows 1, 2 & 3 show the performance of pure feed forward networks for the task in question. The 1st row shows the performance of a 9layer network, shallow + , which we call base network. The 2nd row is a similar network, where each deconvolutional tower, which we call deception, in is replaced by a single deconvolution. The difference in performance shows that multiscale deconvolutions lead to a better and very competitive baseline. Finally, the 3rd row shows the performance of a very deep network consisting of 24 layers. This network has the same number of parameters and the same depth as our chain model and serves as the baseline which we improve upon using the chain model.
Row 4 shows the performance of our chain model. This model improves significantly over all the baselines. The biggest gains are observed for Wrists and Ankles, which is a clear indication that conditioning on the predictions of previous joints provides cues for better localization.
Row 5 shows the performance of the chain model with multicrop evaluation, where at test time we average the predictions from flipping and jittering of the input image.
Row 6 shows the performance of an oracle chain model. For this model, at each step we use the oracle (ground truth) locations of all previous joints. This model is an estimate of the upper bound performance of our chain model, as it predicts the location of a joint given perfect knowledge of the location of all other joints which precede it in the sequence.
Row 7 shows the performance of the inception base network, + , where is the inceptionv3 [34]. We observe significant gains when using the inceptionv3 architecture compared to a shallower 6layer network for the encoder network, at the expense of more computations.
Row 8 shows the performance of the inception chain model. For both the inception base and chain model we use multicrop evaluation. In both cases, the inceptionv3 parameters were initialized with weights from an ImageNet model. The inception chain model leads to significant gains compared to its base network (row 7). The improvements are more evident for the joints of Wrist, Knee, Ankle.
4.1.4 Error Analysis
Digging deeper into the models, we perform an error analysis for the base network + , the very deep network and our chain model. For this analysis, the 6layer encoder network is used for all models. Similar to [35], we categorize the erroneous predictions into the three distinct classes: a) localization error, i.e. the prediction is within of the true location, b) confusion with other joints, i.e. the prediction is within of a different joint, and c) confusion with the background, i.e. the prediction lies somewhere else in the image. According to PCKh, a prediction is correct if it falls within . We set
Fig. 4 shows the error analysis for the hardest joints, namely Wrist and Ankle. Each plot consists of three sets of bars, the rates for error localization, confusion with other joints and confusion with background. According to the plots, the chain model reduces the misses due to confusion with other joints and the background. For Wrists, the confusion with other joints is the dominating error mode, and further analysis shows that the main source of confusion comes mainly from the opposite wrist and then the nearby joints. For Ankles, the biggest error mode comes from confusion with the background, which is not surprising since lower legs are usually heavily occluded and lack strong appearance cues.
Fig. 5 shows some examples of our predictions on the MPII dataset.
4.1.5 Comparison to Other Approaches
We evaluate our approach on the MPII test set and compare to other methods on the task of pose estimation from a single image. Table 2 shows the results of our approach and other leading methods in the field. We show the performance of both versions of our chain model, using a shallow 6layer encoder as well as the inceptionv3 architecture. For the shallow chain model, we ensemble two chain models trained at different input scales. For the inception chain model, no ensembling was performed.
The leading approaches by Wei et al. [26] and Newell et al. [36] rely on iteratively refining predictions. In particular, predictions are made initially for all joints independently. These predictions, which are quite poor (see [26]), are fed subsequently into a network for further refinement. Our approach produces only one set of predictions via a single chain model and does not refine them further. One could combine the two ideas, the one of chained predictions and the one of iterative refinement, to achieve better results.
Method  Head  Shoulder  Elbow  Wrist  Hip  Knee  Ankle  Total  

Carreira et al. [25]  95.7  91.7  81.7  72.4  82.8  73.2  66.4  81.3  
Tompson et al. [37]  96.1  91.9  83.9  77.8  80.9  72.3  64.8  82.0  
Hu&Ramanan [38]  95.0  91.6  83.0  76.6  81.9  74.5  69.5  82.4  
Pishchulin et al. [39]  94.1  90.2  83.4  77.3  82.6  75.7  68.6  82.4  
Lifshitz et al. [40]  97.8  93.3  85.7  80.4  85.3  76.6  70.2  85.0  
Wei et al. [26]  97.8  95.0  88.7  84.0  88.4  82.8  79.4  88.5  
Newell et al. [36]  97.6  95.4  90.0  85.2  88.7  85.0  80.6  89.4  
Chain model  93.8  91.8  84.2  79.4  84.4  77.9  70.7  84.1  

97.9  93.2  86.7  82.1  85.2  81.5  74.0  86.1 
4.2 Pose Estimation From Videos
Our chain models in time are described in Equation 4 and illustrated in Fig. 2. Here, the task is to localize body parts in time across video frames. The output variables from the joints of the previous frames are used as inputs to make a prediction for the joints in the current frame. We apply the chaining in two different ways  first, only in time, where each joint is predicted independently of the other joints (as in our baseline models), but chaining is done in time, and second, with chaining both in time and in joints.
4.2.1 Pose Estimation in Time
PCK (%)  Head  Shldr  Elbow  Wrist  Hip  Knee  Ankle  Mean  


64.2  55.4  33.8  24.4  56.4  54.1  48.0  48.0  

94.1  90.3  84.2  83.5  88.7  87.2  87.7  87.5  

93.1  91.8  85.7  78.8  90.2  91.9  91.1  88.6  

95.3  92.5  87.9  87.5  91.1  89.8  90.1  90.1  

95.8  93.2  88.9  89.6  91.3  89.8  91.2  91.0  

95.8  94.1  90.0  90.2  91.3  90.6  91.8  91.7  

95.6  93.8  90.4  90.7  91.8  90.8  91.5  91.8 
As shown in Fig. 2, the chain model sequentially processes the video frames. The predictions at the previous time steps are used through a recurrent module in order to make a prediction at the current time step. Again, we use a heatmap to encode the location of a part in the frame.
The details of our learning procedure are identical to the ones described for the single image case. The only difference is that each training example is now a sequence of images each of which has a groundtruth pose. Thus, the loss for is the sum over the losses for each frame. Each frame loss is defined as in the case of single image (see Eq. (5)).
We train our model for 120K iterations using SGD with momentum of 0.9, a batch size of 6 and a learning rate of 0.003 with step decay 100K. Images are rescaled to . A relative radius of is used for the loss. The weights are initialized randomly from a normal distribution with zero mean and standard deviation of 0.01.
Table 3 shows the performance on the Penn Action test set. For consistency with previous work on the dataset [41], a prediction is considered correct if it lies within , where is the height and width, respectively, of the instance in question. We refer to this metric as PCK. (Note that this is a weaker criterion than the one used on the MPII dataset). We show the per frame performance, as produced by a base network + trained to predict the location of the joints at each frame. We also provide results after applying temporal smoothing to the predictions via the Viterbi algorithm where the transition function is the Euclidean distance of the same joints in two neighboring frames. Additionally, we show the performance of a convolutional RNN with , in Eq. 4. This model corresponds to a standard convolutional RNN where the output variables of the previous time steps are not connected to the hidden state. All networks have roughly the same numbers of parameters, to ensure a fair comparison. For our chain model in time, we show results for two choices of time horizon . Namely, , where predictions of only the previous time step are being considered and , where predictions of the past 3 frames are considered at each time step. Finally, we show the performance of a chain model in time and in joints, with a time horizon of .
We compare to previous work on the Penn Action dataset [41]. This model uses action specific pose models, with shallow handcrafted features, and improves upon Yang & Ramanan [31].
We observe a gain in performance compared to the per frame CNN as well as the RNN across all joints. Interestingly, chain models show bigger improvement for arms compared to legs. This is due to the fact that the people in the videos play sports which involve big arm movements, while the legs are mostly unoccluded and less kinematic. In addition, we see that leads to better performance, which is not surprising since the model makes a decision about the location of the joints at the current time step based on observation from 3 past frames. We did not observe additional gains for . Chaining in time and in joints does not improve performance even further, possibly due to the already high accuracy achieved by the chain model in time.
Fig. 6 shows examples of predictions by our chain model on the Penn Action dataset. We also show the predictions made by the per frame detector. We see that the chain model is able to disambiguate rightleft confusions which occur often due to the constant motion of the person while performing actions, while the per frame detector switches very often between erroneous detections.
5 Conclusions
In this paper, motivated by sequencetosequence models, we show how chained predictions can lead to a powerful tool for structured vision tasks. Chain models allow us to sidestep any assumptions about the joint distribution of the output variables, other than the capacity of a neural network to model conditional distributions. We prove this point experimentally by showing top performing results on the task of pose estimation from images and videos.
References
 [1] Daumé Iii, H., Langford, J., Marcu, D.: Searchbased structured prediction. Machine learning 75(3) (2009) 297–325
 [2] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS. (2014)
 [3] Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell. arXiv preprint arXiv:1508.01211 (2015)
 [4] Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3156–3164
 [5] Dosovitskiy, A., Tobias Springenberg, J., Brox, T.: Learning to generate chairs with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 1538–1546
 [6] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. (2015) 3431–3440
 [7] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions. In: CVPR. (2015)
 [8] Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014)
 [9] Vinyals, O., Kaiser, Ł., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Advances in Neural Information Processing Systems. (2015) 2755–2763
 [10] Vinyals, O., Bengio, S., Kudlur, M.: Order Matters: Sequence to sequence for sets. ArXiv eprints (November 2015)
 [11] Goldberg, Y., Elhadad, M.: An efficient algorithm for easyfirst nondirectional dependency parsing. In: Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Association for Computational Linguistics (2010) 742–750
 [12] Ross, S., Gordon, G.J., Bagnell, J.A.: A Reduction of Imitation Learning and Structured Prediction to NoRegret Online Learning. ArXiv eprints (November 2010)
 [13] Felzenszwalb, P.F., Huttenlocher, D.P.: Pictorial structures for object recognition. International Journal of Computer Vision 61(1) (2005) 55–79
 [14] Ramanan, D.: Learning to parse images of articulated bodies. In: NIPS. (2006)
 [15] Andriluka, M., Roth, S., Schiele, B.: Pictorial structures revisited: People detection and articulated pose estimation. In: CVPR. (2009)
 [16] Eichner, M., Ferrari, V.: Better appearance models for pictorial structures. (2009)
 [17] Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixturesofparts. In: CVPR. (2011)
 [18] Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR. (2011)
 [19] Tian, Y., Zitnick, C.L., Narasimhan, S.G.: Exploring the spatial hierarchy of mixture models for human pose estimation. In: ECCV. (2012)
 [20] Wang, F., Li, Y.: Beyond physical connections: Tree models in human pose estimation. In: CVPR. (2013)
 [21] Sapp, B., Taskar, B.: Modec: Multimodal decomposable models for human pose estimation. In: CVPR. (2013)
 [22] Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Poselet conditioned pictorial structures. In: CVPR. (2013)
 [23] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: CVPR. (2014)
 [24] Ramakrishna, V., Munoz, D., Hebert, M., Bagnell, J.A., Sheikh, Y.: Pose machines: Articulated pose estimation via inference machines. In: Computer Vision–ECCV 2014. Springer (2014) 33–47
 [25] Carreira, J., Agrawal, P., Fragkiadaki, K., Malik, J.: Human pose estimation with iterative error feedback. (2015)
 [26] Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. CVPR (2016)
 [27] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A.C., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. CoRR abs/1502.03044 (2015)
 [28] Bengio, S., Vinyals, O., Jaitly, N., Shazeer, N.: Scheduled sampling for sequence prediction with recurrent neural networks. In: NIPS. (2015)
 [29] Andriluka, M., Pishchulin, L., Gehler, P., Schiele, B.: 2d human pose estimation: New benchmark and state of the art analysis. In: CVPR. (2014)
 [30] Zhang, W., Zhu, M., Derpanis, K.: From actemes to action: A stronglysupervised representation for detailed action understanding. In: ICCV. (2013)
 [31] Yang, Y., Ramanan, D.: Articulated human detection with flexible mixturesofparts. PAMI (2012)
 [32] Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML10). (2010) 807–814
 [33] Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
 [34] Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015)
 [35] Hoiem, D., Chodpathumwan, Y., Dai, Q.: Diagnosing error in object detectors. In: ECCV. (2012)
 [36] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. CoRR abs/1603.06937 (2016)
 [37] Tompson, J., Goroshin, R., Jain, A., LeCun, Y., Bregler, C.: Efficient object localization using convolutional networks. In: CVPR. (2015)
 [38] Hu, P., Ramanan, D.: Bottomup and topdown reasoning with hierarchical rectified gaussians. CVPR (2016)
 [39] Pishchulin, L., Insafutdinov, E., Tang, S., Andres, B., Andriluka, M., Gehler, P., Schiele, B.: Deepcut: Joint subset partition and labeling for multi person pose estimation. CVPR (2016)
 [40] Lifshitz, I., Fetaya, E., Ullman, S.: Human pose estimation using deep consensusvoting. CoRR abs/1603.08212 (2016)
 [41] Xiaohan Nie, B., Xiong, C., Zhu, S.C.: Joint action recognition and pose estimation from video. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR). (June 2015)