Deep Deformation Network for Object Landmark Localization
Abstract
We propose a novel cascaded framework, namely deep deformation network (DDN), for localizing landmarks in nonrigid objects. The hallmarks of DDN are its incorporation of geometric constraints within a convolutional neural network (CNN) framework, ease and efficiency of training, as well as generality of application. A novel shape basis network (SBN) forms the first stage of the cascade, whereby landmarks are initialized by combining the benefits of CNN features and a learned shape basis to reduce the complexity of the highly nonlinear pose manifold. In the second stage, a point transformer network (PTN) estimates local deformation parameterized as thinplate spline transformation for a finer refinement. Our framework does not incorporate either handcrafted features or part connectivity, which enables an endtoend shape prediction pipeline during both training and testing. In contrast to prior cascaded networks for landmark localization that learn a mapping from feature space to landmark locations, we demonstrate that the regularization induced through geometric priors in the DDN makes it easier to train, yet produces superior results. The efficacy and generality of the architecture is demonstrated through stateoftheart performances on several benchmarks for multiple tasks such as facial landmark localization, human body pose estimation and bird part localization.
Keywords:
Landmark localization, convolutional neural network, nonrigid shape analysis1 Introduction
Consistent localization of semantically meaningful landmarks or keypoints in images forms a precursor to several important applications in computer vision, such as face recognition, human body pose estimation or 3D visualization. However, it remains a significant challenge due to the need to handle nonrigid shape deformations, appearance variations and occlusions. For instance, facial landmark localization must handle not only coarse variations such as head pose and illumination, but also finer ones such as expressions and skin tones. Human body pose estimation introduces additional challenges in the form of large layout changes of parts due to articulations. Objects such as birds display tremendous variations in both appearance across species, as well as shape within the same species (for example, a perched bird as opposed to a flying one), which renders accurate part localization a largely open problem.
Consequently, there have been a wide range of approaches to solve the problem, starting with those related to PCAbased shape constraints (such as active shape models [1], active appearance models [2] and constrained local models [3]) and pictorial structures (such as DPM [4, 5] and poselets [6]). In recent years, the advent of convolutional neural networks (CNNs) has led to significant gains in feature representation [7]. In particular, cascaded regression networks specifically designed for problems such as facial landmark localization [8, 9, 10] or human body pose estimation [11] have led to improvements by exploiting problem structure at coarse and fine levels. But challenges for such frameworks have been the need for careful design and initialization, the difficulty of training complex cascades as well as the absence of learned geometric relationships.
In this paper, we propose a novel cascaded framework, termed deep deformation network (DDN), that also decomposes landmark localization into coarse and fine localization stages. But in contrast to prior works, we do not train cascaded regressors to learn a mapping between CNN features and landmark locations. Rather, stages in our cascade explicitly account for the geometric structure of the problem within the CNN framework. We postulate that this has three advantages. First, our framework is easier to train and test, in contrast to previous cascaded regressors where proper initialization is required and necessitates training a battery of individual CNNs as subproblems. Second, incorporation of geometric structures at both the coarse and fine levels regularizes the learning for each stage of the cascade by acting as spatial priors. Third, our cascade structure is general and still results in higher accuracy by learning part geometries and avoiding hardcoded connections of parts. These advantages are illustrated in Fig. 1.
Specifically, in Section 3, we propose two distinct mechanisms to inject geometric knowledge into the problem of landmark localization. First, in Section 3.2, we propose a shape basis network (SBN) to predict the optimal shape that lies on a lowrank manifold defined by the training samples. Our hypothesis is that shapes or landmarks for each object type reside close to a shape space, for which a lowrank decomposition reduces representation complexity and acts as regularization for learning. Note that unlike DPM, we do not define geometric connections among parts prior to localization, rather these relationships are learned. Further, even cascaded CNN frameworks such as [8] train individual CNNs for predefined relative localization of groups of parts within the first level of the cascade, which serves as initialization for later stages. Our SBN avoids such delicate considerations in favor of a learned basis that provides good global initializations. Second, in Section 3.3, we propose a point transformer network (PTN) that learns the optimal local deformation in the form of a thinplate spline (TPS) transformation that maps the initialized landmark to its final position.
A notable feature of our framework is its generality. Prior works explicitly design network structures to handle shape and appearance properties specific to object types such as faces, human bodies or birds. In contrast, our insights are quite general  a shape basis representation is suitable for regularized learning of a global initialization in a CNN framework, following which local deformations through learned TPS transformations can finely localize landmarks. We demonstrate this generality in Section 5 with extensive experiments on landmark localization that achieve stateoftheart in accuracy for three distinct object types, namely faces, human bodies and birds, on several different benchmarks. We use the same CNN architectures for each of these experiments, with identical and novel but straightforward training mechanisms.
To summarize, the main contributions of this paper are:

A novel cascaded CNN framework, called deep deformation network, for highly accurate landmark localization.

A shape basis network that learns a lowrank representation for global object shape.

A point transformer network that learns local nonrigid transformations for fine deformations, using the SBN output as an initialization.

A demonstration of the ease and generality of deep deformation network through stateoftheart performance for several object types.
2 Related Work
Facial landmark localization Facial landmark localization or face alignment is wellstudied in computer vision. Models that impose a PCA shape basis have been proposed as Active Shape Models [1] and its variants that account for holistic appearance [2] and local patch appearance [3]. The nonconvexity of the problem has been addressed through better optimization strategies that improve modeling for either the shape [12, 13, 14, 15] or appearance [16, 17, 18]. Exemplar consensus [19, 20] and graph matching [21] show high accuracy on localization. Regression based methods [22, 23, 24] that directly learn a mapping from the feature space to predict the landmark coordinates have been shown to perform better. Traditional regressionbased methods rely on handcraft features, for example, shape indexed feature [22], SIFT [24] or local binary feature [25]. Subsequent works such as [26, 27] have also improved efficiency. The recent success of deep networks has inspired cascaded CNNs to jointly optimize over several facial parts [8]. Variants based on coarsetofine auto encoders [10, 28] and multitask deep learning [9, 29] have been proposed to further improve performance. In contrast, cascade stages in our deep deformation networks do not require careful design or initialization and explicitly account for both coarse and fine geometric transformations.
Human body pose estimation Estimating human pose is more challenging due to greater articulations. Pictorial structures is one of the early influential models for representing human body structure [30]. The deformable part model (DPM) achieved significant progress in human body detection by combining pictorial structures with strong template features and latentSVM learning [4]. Yang and Ramanan extend the model by incorporating body part patterns [5], while Wang and Li propose a treestructured learning framework to achieve better performance against handcrafted part connections [31]. Pischulin et al. [32] apply poselets [6] to generate midlevel features regularizing pictorial structures. Chen and Yuille [33] propose dependent pairwise relations with a graphical model for articulated pose estimation. Deep neural network based methods have resulted in better performances in this domain too. Toshev and Szegedy [11] propose cascaded CNN regressors, Tompson et al. [34] propose joint training for a CNN and a graphical model, while Fan et al. [35] propose a dualsource deep network to combine local appearance with a holistic view. In contrast, our DDN also effectively learns part relationships while being easier to train and more efficient to evaluate.
Bird part localization Birds display significant appearance variations between classes and shape variations within the same class. An early work that incorporates a probabilistic model and user responses to localize bird parts is presented in [36]. Chai et al. [37] apply symbiotic segmentation for part detection. The exemplarbased model of [38], similar to [19], enforces pose and subcategory consistency to localize bird parts. Recently, CNNbased methods, for example, partbased RCNN [39] and Deep LAC [40] have demonstrated significant performance improvements.
General pose estimation While the above works focus on a specific object domain, a few methods have been proposed towards pose estimation for general object categories. As a general framework, DPM has also been shown to be effective beyond human bodies for facial landmark localization [41]. A successful example of more general pose estimation is the regressionbased framework of [42] and its variants such as [43, 44]. However, such methods are sensitive to initialization, which our framework avoids through an effective shape basis network.
Learning transformations with CNNs Agrawal et al. use a Siamese network to predict discretized rigid egomotion transformations formulated as a classification problem [45]. Razavian et al.[46] analyzes generating spatial information with CNNs, of which our SBN and PTN are specific examples in designing the spatial constraints. Our point transformer network is inspired by the the spatial transformer network of [47]. Similar to WarpNet [48], we move beyond the motivation of spatial transformer as an attention mechanism driven by the classification objective, to predict a nonrigid transformation for geometric alignment. In contrast to WarpNet, we exploit both supervised and synthesized landmarks and use the point transformer network only for finer local deformations, while using the earlier stage of the cascade (the shape basis network) for global alignment.
3 Proposed Method
In this section, we present a general deep network to efficiently and accurately localize object landmarks. As shown in Fig. 2, the network is composed of three components:

A modified VGGNet [7] to extract discriminative features.

A Shape Basis Network (SBN) that combines a set of shape bases using weights generated from convolutional features to approximately localize the landmark.

A Point Transformer Network (PTN) for local refinement using a TPS transformation.
The entire network is trained endtoend. We now introduce each of the above components in detail.
3.1 Convolutional Feature Extraction
For feature extraction, we adopt the VGG16 network [7] that achieves stateoftheart performance in various applications [49, 50]. The upper left corner in Fig. 2 shows the network structure, where each stage consists of a convolutional layer followed by a ReLU unit. We apply 2 stride across the network. Similar to most localization algorithms, our network takes as input a region of interest cropped by an object detector. We scale input images to resolution for facial landmark localization and for body and bird pose estimation. Compared to classification and detection tasks, landmark localization requires extracting much finer or lowlevel image information. Therefore, we remove the last stage from the original stage VGG16 network of [7]. We also experimented with using just the first three stages, but it performs worse than using four stages. In addition, we remove the pooling layers since we find they harm the localization performance. We hypothesize that shiftinvariance achieved by pooling layers is helpful for recognition tasks, but it is beneficial to keep the learned features shiftsensitive for keypoint localization. Given an image at resolution, the fourstage convolutional layers generate a response map with output channels.
3.2 Shape Basis Network
Let denote the features extracted by the convolutional layers. Each training image is annotated with up to 2D landmarks, denoted . To predict landmark locations, previous works such as [8, 9] learn a direct mapping between CNN features and groundtruth landmarks . Despite the success of these approaches, learning a vanilla regressor has limitations alluded to in Sec. 1. First, a single linear model is not powerful enough to handle large shape variations. Although cascade regression can largely improve the performance, a proper initialization is required, which is nontrivial. Second, with limited data, training a largecapacity network without regularization from geometric constraints entails a high risk of overfitting.
Both of the above limitations are effectively addressed by our Shape Basis Network (SBN), which predicts optimal object shape that lies in a lowdimensional manifold defined by the training samples. Intuitively, while CNNs allow learning highly descriptive feature representations, frameworks such as active shape models [1] have been historically effective in learning with nonlinear pose manifolds. Our SBN provides an endtoend trainable framework to combine these complementary advantages. Besides, for more challenging tasks, i.e.human body, the highly articulated structure cannot be easily represented by multiscale autoencoder or detection [10, 28]. The SBN complete basis representation alleviates the problem, while retaining accuracy and low cost.
More specifically, SBN predicts the shape as,
(1) 
Here is the mean shape among all training inputs, the columns of store the top orthogonal PCA bases and is a nonlinear mapping parameterized by that takes the CNN feature as input to generate the basis weights as output. We choose to preserve of the energy in the covariance matrix of the training inputs, . As shown in upperright corner of Fig. 3a, the mapping is represented by stacking two fully connected layers, where the first layer encodes each input as a D vector, which is further reduced to dimension by the second one. Conceptually, we jointly train SBN with other network components in an endtoend manner. During backward propagation, given the gradient , the gradient of is available as . We then propagate this gradient back to update the parameters for the fully connected layers as well as the lower convolutional layers.
In practice, we find it advantageous to pretrain the SBN on a simpler task and finetune with the later stages of the cascade. This is a shared motivation with curriculum learning [51] and avoids the difficulties of training the whole network from scratch. Given the PCA shape model ( and ) and the set of training images , we pretrain SBN to seek the optimal embedding such that the Euclidean distance between the prediction and the groundtruth () is minimized^{1}^{1}1Strictly speaking, the loss is defined over a minibatch of images., that is,
(2) 
where is a regularization factor that penalizes coefficients with large norm. We set in all experiments. To solve (2), we propagate back the gradient over ,
(3) 
to update the parameters of the fully connected layers and the lower layers.
Thus, the SBN brings to bear the powerful CNN framework to generate embedding coefficients that span the highly nonlinear pose manifold. The lowrank truncation inherent in the representation makes SBN clearly insufficient for localization on its own. Optimizing both the coefficients and PCA basis is possible. The orthogonality of the basis should be preserved, which causes extra efforts in tuning the network. As we assume, the role of SBN is to alleviate the difficulty of learning an embedding with large nonlinear distortions and to reduce the complexity of shape deformations to be considered by the next stage of the cascade.
3.3 Point Transformer Network
Given the input feature , SBN generates the object landmark as a linear combination of predefined shape bases. As discussed before, this prediction is limited by its assumption of a linear regression model. To handle more challenging pose variations, this section proposes the Point Transformer Network (PTN) to deform the initialized shape () using a thinplate spline (TPS) transformation to best match with groundtruth (). The refinement is not ideally local as global deformation also exists. Some neural methods [52] emphasize more on the local response map, while PTN incorporates global transformation into the overall deformation procedure.
A TPS transformation consists of an affine transformation and a nonlinear transformation parametrized by control points with the corresponding coefficients [53]. In our experiments, the control points form a grid (that is, ). The TPS transformation for any 2D point is defined as:
(4) 
where denotes in homogeneous form and is the radial basis function (RBF).
Given convolutional features and landmarks initialized by the SBN of Sec. 3.2, the PTN seeks the optimal nonlinear TPS mapping to match the groundtruth . Similar to SBN, this mapping is achieved by concatenating two fully connected layers, which first generate a D intermediate representation, which is then used to compute . See Fig. 3b for an overview of PTN. Following [53], PTN optimizes:
(5) 
where is the secondorder derivative of transformation with respect to . The weight is a tradeoff between the transformation error and the bending energy. Substituting (4) into the (5) yields an equivalent objective,
(6) 
where each element of the RBF kernel computes .
It is known that (6) can be optimized over the TPS parameters and in closed form for a pair of shapes. But in our case, these two parameters are generated by the nonlinear mapping from image features onthefly. Thus, instead of computing the optimal solution, we optimize (6) over using stochastic gradient descent.
In practice, a key difficulty in training PTN stems from overfitting of the nonlinear mapping , since the number of parameters in is larger than the number of labeled point pairs in each minibatch. For instance, Fig. 4 visualizes the effect of using different spatial transformation to match an example with the mean face. By replacing the TPS in (6) with an affine transformation, PTN may warp the face to a frontal pose only to a limited extent (Fig. 4c) due to outofplane head pose variations. Optimizing (6) with landmarks, PTN is able to more accurately align the landmarks with the mean face. As shown in Fig. 4d, however, the estimated transformation is highly nonlinear, which suggests a severe overfitting over the parameters of (6).
To address overfitting, a common solution is to increase the regularization weight . However, a large reduces the flexibility of the TPS transformation, which runs counter to PTN’s purpose of generating highly nonrigid local deformations. Thus, we propose a control point grid regularization method to further constrain the point transformation. For each training image, we estimate the optimal TPS transformation from the mean shape to the ground truth offline. Then this TPS transformation is applied on the control points to obtain their transformed locations . We now obtain additional synthesized landmarks with their original positions . Finally, we define an improved loss incorporating the synthesized control points:
(7) 
where the terms , , and are defined in a similar way as (6). As shown in Fig. 4e, the new loss incorporates information from additional points, which helps to reduce overfitting and produces more stable TPS warps. In our experiments, the values for the involved parameters are and .
To summarize, PTN forms the refinement stage for our cascade that generates highly nonrigid local deformations for a finerlevel localization of landmarks. It is pertinent to note here that unlike the spatial transformer network [47], we directly optimize a geometric criterion rather than a classification objective. Similar to WarpNet [48], we transform point sets rather than the entire image or dense feature maps, but go beyond it in striking a balance between supervised and synthesized landmark points.
4 Implementation Details
We implement DDN using the Caffe platform [54]. The three components of DDN shown in Fig. 2 – the convolutional layers for extracting features , SBN for computing the intermediate landmarks and PTN for generating the final position – can be trained from scratch endtoend. However, for ease of training, we pretrain SBN and PTN separately, before a joint training. For each task of localizing landmarks on faces, human bodies and birds, we synthetically augment training data by randomly cropping, rotating and flipping images and landmarks. We use the standard hyperparameters (0.9 for momentum and 0.004 for weight decay) in all experiments.
To pretrain SBN, we minimize (3) without the PTN part. For convolutional layers, we initialize with weights from the original VGG16 model. During the pretrain process, we first fix the convolutional weights and update the fully connected layers of SBN. When the error stops decreasing after epochs, the objective is relaxed to update both the convolutional layers and the fully connected layers. To pretrain PTN, we remove the SBN component from the network and replace the input with the mean shape . We fix the convolutional weights as the one pretrained with SBN and train the fully connected layers in PTN only. After epochs, we train both the convolutional and fully connected layers together.
After pretraining, we combine SBN and PTN in a joint network, where the SBN provides shape input to the PTN. The loss in (7) is evaluated at the end of PTN and is propagated back to update the fully connected and convolutional layers. During the joint training, we first update the weights of PTN by fixing the weights of SBN. Then the weights for SBN are relaxed and the entire network is updated jointly.
5 Experiments
We now evaluate DDN on its accuracy, efficiency and generality. Three landmark localization problems, faces, human bodies and birds are evaluated. The runtime is ms for face landmark localization, ms for human body part localization and ms for bird part localization, on a Tesla K80 GPU with 12G memory, with an Intel 2.4GHz 8core CPU machine.
Evaluation metrics For all the tasks, we use percentage of correctly localized keypoints (PCK) [5] as the metric for evaluating localization accuracy. For the th sample in the test set of size , PCK defines the predicted position of the th landmark, , to be correct if it falls within a threshold of the groundtruth position , that is, if
(8) 
where is the reference normalizer, i.e. interocular distance for face task, the maximum of the height and width of the bounding box for human body pose estimation and bird part localization, respectively. The parameter controls the threshold for correctness.
5.1 Facial Landmark Localization
To test face alignment in real scenarios, we choose the challenging 300 Faces intheWild Challenge (300W) [55] as the main benchmark. It contains facial images with large head pose variations, expressions, types of background and occlusions. For instance, the first row of Fig. 6 shows a few test examples from 300W. The dataset is created from five wellknown datasets – LFPW [19], AFW [41], Helen [56], XM2VTS [57] and iBug [55]. Training sets of Helen and LFPW are adopted as the overall training set. All the datasets use 68 landmark annotation. We report the relative normalized error on the entire 300W database and further compare the PCK on the proposed componentwise methods.
Dataset  ESR [22]  SDM [24]  ERT [26]  LBF [25]  cGPRT [27]  DDN (Ours) 
300W [55]  7.58  7.52  6.40  6.32  5.71  5.65 
Table 1 lists the accuracy of five stateoftheart methods as reported in the corresponding literature– explicit shape regression (ESR) [22], supervised descent method (SDM) [24], ensemble of regression trees (ERT) [26], regression of local binary features (LBF) [25] and cascade of Gaussian process regression tree (cGPRT) [27]. The benefit of the CNN feature representation allows our DDN framework to outperform all the other methods on 300W. We note that the improvement over cGPRT is moderate. For face alignment, handcrafted features such as SIFT feature are competitive with CNN features. This indicates that the extent of nonrigid warping in facial images, as opposed to human bodies or birds, is not significant enough to derive full advantage of the power of CNNs, which is also visualized in Fig. 3(b).
To provide further insights into DDN, we also evaluate the independent performances of the two components (SBN and PTN) in Table 2. We note that SBN achieves poor localization compared to PTN. However, the PTN is limited to inplane transformations. Thus, using SBN as an initialization, the combined DDN framework consistently outperforms the two independent networks.
To illustrate the need for nonrigid TPS transformations, we modify the PTN to use an affine transformation. The network is denoted as aDDN in Table 2. The performance is worse than DDN, which indicates that the flexibility of representing nonrigid transformations is essential. Visual results in the first row of Fig. 6 show that our method is robust to some degree of illumination changes, head pose variations and partial occlusions. The first failure case in the red box of the first row shows that distortions occur when large parts of the face are completely occluded. Note that in the second failure example, DDN actually adapts well to strong expression change, but is confused by a larger contrast for the teeth than the upper lip.
Method  Helen [56]  LFPW [19]  AFW [41]  iBug [55]  
0.05  0.10  0.05  0.10  0.05  0.10  0.05  0.10  
SBN (Ours)  51.4  87.0  48.1  84.0  36.4  74.0  22.6  57.3 
PTN (Ours)  81.4  96.4  63.8  91.3  57.4  90.5  49.1  86.5 
aDDN (Ours)  67.8  93.5  55.5  89.3  44.3  82.9  38.7  79.3 
DDN (Ours)  85.2  96.5  64.1  91.6  59.5  90.6  56.6  88.9 
5.2 Human Body Pose Estimation
Compared to facial landmarks, localization of human body parts is more challenging due to the greater degrees of freedom from the articulation of body joints. To evaluate our method, we use the Leeds Sports Pose (LSP) dataset [58], which has been widely used as the benchmark for human pose estimation. The original LSP dataset contains images of sportspersons gathered from Flickr, 1000 for training and 1000 for testing. Each image is annotated with joint locations, where left and right joints are consistently labelled from a personcentric viewpoint. To enhance the training procedure, we also use the extended LSP dataset [59], which contains 10,000 images labeled for training, which is the same setup used by most of the baselines in our comparisons.
We compare the proposed DDN with five stateoftheart methods publicly reported on LSP. Two of them [31, 32] are traditional DPMbased methods that model the appearance and shape of body joints in a treestructure model. The other three [34, 33, 35] utilize CNNs for human pose estimation combined with certain canonical frameworks. Our method also uses convolutional features, but proposes two new network structures, SBN and PTN, which can be integrated for endtoend training and testing. This is in contrast to previous CNNbased methods that include graphical models or pictorial structures that introduce extra inference cost in both training and testing.
Method  Head  Shoulder  Elbow  Wrist  Hip  Knee  Ankle  Mean 
Wang & Li [31]  84.7  57.1  43.7  36.7  56.7  52.4  50.8  54.6 
Pishchulin et al. [32]  87.2  56.7  46.7  38.0  61.0  57.5  52.7  57.1 
Tompson et al. [34]  90.6  79.2  67.9  63.4  69.5  71.0  64.2  72.0 
Chen & Yuille [33]  91.8  78.2  71.8  65.5  73.3  70.2  63.4  73.4 
Fan et al. [35]  92.4  75.2  65.3  64.0  75.7  68.3  70.4  73.0 
aDDN (Ours)  82.3  82.5  62.3  41.2  55.3  77.6  77.1  68.3 
DDN (Ours)  87.2  88.2  82.4  76.3  91.4  85.8  78.7  84.3 
Figure 5 compares our method with baselines in terms of PCK scores. In particular, the left two subfigures show the individual performance for landmarks (Elbow and Wrist), while the right subfigure contains the overall performance averaged on all joints. Detailed plots for all parts are provided in additional material. Among the baselines, the classical DPMbased methods [31, 32] achieve the worst performance due to weaker lowlevel features. The CNNbased methods of [34, 33, 35] improve over those by a large margin. However, the proposed DDN achieves a significant further improvement. PCK numbers for all the landmarks at are listed in Table 3, where DDN performs better across almost all the articulated joints. The mean accuracy over all landmarks is better than the best reported result of [33].
We also report numbers on the version of DDN trained with affine transformations. It is observed that the improvement in accuracy from using TPS warps as opposed to affine transformations is significantly larger for human body parts than facial landmarks. This reflects the greater nonrigidity inherent in the human body pose estimation problem, which makes our improvement over previous CNNbased methods remarkable.
Methods  Ba  Be  By  Bt  Cn  Fo  Le  Ll  Lw  Na  Re  Rl  Rw  Ta  Th  
0.02 
[60]  9.4  12.7  8.2  9.8  12.2  13.2  11.3  7.8  6.7  11.5  12.5  7.3  6.2  8.2  11.8 

Ours  18.8  12.8  14.2  15.9  15.9  16.2  20.3  7.1  8.3  13.8  19.7  7.8  9.6  9.6  18.3 
0.05 
[60]  46.8  62.5  40.7  45.1  59.8  63.7  66.3  33.7  31.7  54.3  63.8  36.2  33.3  39.6  56.9 

Ours  66.4  49.2  56.4  60.4  61.0  60.0  66.9  32.3  35.8  53.1  66.3  35.0  37.1  40.9  65.9 
0.08 
[60]  74.8  89.1  70.3  74.2  87.7  91.0  91.0  56.6  56.7  82.9  88.4  56.4  58.6  65.0  87.2 

Ours  88.3  73.1  83.5  85.7  85.0  84.7  88.3  57.5  58.9  77.1  88.7  62.1  59.1  66.6  87.4 
0.10 
[60]  85.6  94.9  81.9  84.5  94.8  96.0  95.7  64.6  67.8  90.7  93.8  64.9  69.3  74.7  94.5 

Ours  94.0  82.5  92.2  93.0  92.2  91.5  93.3  69.7  68.1  86.0  93.8  74.2  68.9  77.4  93.4 

These results highlight some of the previously discussed advantages of DDN over prior CNNbased frameworks. DDN incorporates geometric structure directly into the network, which makes shape prediction during training and testing endtoend, while also regularizing the learning. Thus, DDN can learn highly nonlinear mappings, which is nontrivial with handdesigned graphical models. Further, we hypothesize that extra modules such as graphical model inference of joint neighborhoods incur additional error. The second row of Fig. 6 shows several qualitative results generated by DDN, which handles a wide range of body poses with good accuracy. The challenging cases within the red box in the second row of Fig. 6 show that our method degrades when the body parts are highly occluded or folded.
5.3 Bird Part Localization
We now evaluate DDN on the wellknown CUB2002011 [61] dataset for bird part localization. The dataset contains 11,788 images of 200 bird species. Each image is annotated with a bounding box and 15 key points. We adopt the standard dataset split, where 5,994 images are used for training and the remaining 5,794 for testing. CUB2002011 was originally designed for the classification task of finegrained recognition in the wild, this, it contains very challenging pose variations and severe occlusions. Compared to facial landmark and human body joint, another difficulty is the nondiscriminative texture for many bird parts. For instance, the last row of Fig. 6 shows examples where part definitions such as wings or tail might be ambiguous even for humans. The abbreviation in Table. 4
For comparison, we choose the recent work from Zhang et al. [60] as the baseline. We report the PCK numbers at for each of the landmarks in Table 4. By converting the landmark labels to a dense segmentation mask, Zhang et al. exploit fully convolutional networks [49] for landmark localization. Instead, our DDN directly regresses from the VGG features to the locations of the sparse landmarks, which incurs significantly less computational cost. In addition, Zhang et al. predict each landmark independently without the consideration of geometric relations among landmarks, which are naturally encoded in our SBN. Therefore, our method achieves highly competitive and sometimes better performance than the stateoftheart, at a significantly lower expense.
6 Conclusion
We propose a cascaded network called Deep Deformation Network (DDN) for object landmark localization. We argue that incorporating geometric constraints in the CNN framework is a better approach than directly regressing from feature maps to landmark locations. This hypothesis is realized by designing a twostage cascade. The first stage, called the Shape Basis Network, initializes the shape as constrained to lie within a lowrank manifold. This allows a fast initialization that can account for large outofplane rotations, while regularizing the estimation. The second stage, called the Point Transformer Network, estimates local deformation in the form of nonrigid thinplate spline warps. The DDN framework is trainable endtoend, which combines the power of CNN feature representations with learned geometric transformations. In contrast to prior approaches, DDN avoids complex initializations, large cascades with several CNNs, handcrafted features or prespecified part connectivities of DPMs. Our DDN framework consistently achieves stateoftheart results on three separate tasks, i.e.face landmark localization, human pose estimation and bird part localization, which shows the generality of the proposed method.
References
 [1] Cootes, T., Taylor, C., Cooper, D., Graham, J.: Active shape modelstheir training and application. In: CVIU. (1995)
 [2] Cootes, T., Edwards, G., Taylor, C.: Active appearance models. In: ECCV. (1998)
 [3] Cristinacce, D., Cootes, T.: Automatic feature localization with constrained local models. PR (2007)
 [4] Felzenszwalb, P., Girshick, R., McAllester, D., Ramanan, D.: Object detection with discriminatively trained partbased models. PAMI (2010)
 [5] Yang, Y., Ramanan, D.: Articulated pose estimation with flexible mixturesofparts. In: CVPR. (2011)
 [6] Bourdev, L., Maji, S., Brox, T., Malik, J.: Detecting people using mutually consistent poselet activations. In: ECCV. (2010)
 [7] Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. In: arXiv preprint. (2014)
 [8] Sun, Y., Wang, X., Tang, X.: Deep convolutional network cascade for facial point detection. In: CVPR. (2013)
 [9] Zhang, Z., Luo, P., Loy, C., Tang, X.: Facial landmark detection by deep multitask learning. In: ECCV. (2014)
 [10] Zhang, J., Shan, S., Kan, M., Chen, X.: Coarsetofine autoencoder networks (cfan) for realtime face alignment. In: ECCV. (2014)
 [11] Toshev, A., Szegedy, C.: Deeppose: Human pose estimation via deep neural networks. In: CVPR. (2014)
 [12] Saragih, J., Lucey, S., Cohn, J.: Deformable model fitting by regularized landmark meanshift. IJCV (2011)
 [13] Yu, X., Yang, F., Huang, J., Metaxas, D.: Explicit occlusion detection based deformable fitting for facial landmark localization. In: FG. (2013)
 [14] Pedersoli, M., Timofte, R., Tuytelaars, T., Gool, L.V.: Using a deformation field model for localizing faces and facial points under weak supervisional regression forests. In: CVPR. (2014)
 [15] Yu, X., Huang, J., Zhang, S., Metaxas, D.: Face landmark fitting via optimized part mixtures and cascaded deformable model. PAMI (2015)
 [16] Matthews, I., Baker, S.: Active appearance models revisited. IJCV (2004)
 [17] Tzimiropoulos, G., Pantic, M.: Optimization problems for fast aam fitting inthewild. In: ICCV. (2013)
 [18] Cheng, X., Sridharan, S., Saragih, J., Lucey, S.: Rank minimization across appearance and shape for aam ensemble fitting. In: ICCV. (2013)
 [19] Belhumeur, P., Jacobs, D., Kriegman, D., Kumar, N.: Localizing parts of faces using a consensus of exemplars. In: CVPR. (2011)
 [20] Yu, X., Lin, Z., Brandt, J., Metaxas, D.: Consensus of regression for occlusionrobust facial feature localization. In: ECCV. (2014)
 [21] Zhou, F., Brandt, J., Lin, Z.: Exemplarbased graph matching for robust facial landmark localization. In: ICCV. (2013)
 [22] Cao, X., Wei, Y., Wen, F., Sun, J.: Face alignment by explicit shape regression. International Journal of Computer Vision (2013)
 [23] Dantone, M., Gall, J., Fanelli, G., Gool, L.V.: Realtime facial feature detection using conditional regression forests. In: CVPR. (2012)
 [24] Xiong, X., la Torre, F.D.: Supervised descent method and its applications to face alignment. In: CVPR. (2013)
 [25] Ren, S., Cao, X., Wei, Y., Sun, J.: Face alignment at 3000 fps via regressing local binary features. In: CVPR. (2014)
 [26] Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees. In: CVPR. (2014)
 [27] Lee, D., Park, H., Too, C.: Face alignment using cascade gaussian process regression trees. In: CVPR. (2015)
 [28] Zhu, S., Li, C., Loy, C., Tang, X.: Face alignment by coarsetofine shape searching. In: CVPR. (2015)
 [29] Yang, H., Mou, W., Zhang, Y., Patras, I., Gunes, H., Robinson, P.: Face alignment assisted by head pose estimation. In: BMVC. (2015)
 [30] Felzenszwalb, P., Huttenlocher, D.: Pictorial structures for object recognition. IJCV (2005)
 [31] Wang, F., Li, Y.: Beyond physical connections: Tree models in human pose estimation. In: CVPR. (2013)
 [32] Pishchulin, L., Andriluka, M., Gehler, P., Schiele, B.: Strong appearance and expressive spatial models for human pose estimation. In: ICCV. (2013)
 [33] Chen, X., Yuille, A.: Articulated pose estimation by a graphical model with image dependent pairwise relations. In: NIPS. (2014)
 [34] Tompson, J., Jain, A., LeCun, Y., Bregler, C.: Joint training of a convolutional network and a graphical model for human pose estimation. In: NIPS. (2014)
 [35] Fan, X., Zheng, K., Lin, Y., Wang, S.: Combining local appearance and holistic view: Dualsource deep neural networks for human pose estimation. In: CVPR. (2015)
 [36] Wah, C., Branson, S., Perona, P., Belongie, S.: Multiclass recognition and part localization with humans in the loop. In: ICCV. (2011)
 [37] Chai, Y., Lempitsky, V., Zisserman, A.: Symbiotic segmentation and part localization for finegrained categorization. In: ICCV. (2013)
 [38] Liu, J., Belhumeur, P.: Bird part localization using exemplarbased models with enforced pose and subcategory consistency. In: ICCV. (2013)
 [39] Zhang, N., Donahue, J., Girshick, R., Darrell, T.: Partbased rcnns for finegrained category detection. In: ECCV. (2014)
 [40] Lin, D., Shen, X., Lu, C., Jia, J.: Deep lac: Deep localization, alignment and classification for finegrained recognition. In: CVPR. (2015)
 [41] Zhu, X., Ramanan, D.: Face detection, pose estimation and landmark localization in the wild. In: CVPR. (2012)
 [42] Dollar, P., Welder, P., Perona, P.: Cascaded pose regression. In: CVPR. (2010)
 [43] BurgosArtizzu, X., Perona, P., Dollar, P.: Robust face landmark estimation under occlusion. In: ICCV. (2013)
 [44] Yan, J., Lei, Z., Yang, Y., Li, S.: Stacked deformable part model with shape regression for object part localization. In: ECCV. (2014)
 [45] Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV. (2015)
 [46] Razavian, A., Azizpour, H., Maki, A., Sullivan, J., Ek, C., Carlsson, S.: Persistent evidence of local image properties in generic convnets. In: SCIA. (2015)
 [47] Jaderberg, M., Simony, K., Zisserman, A., Kavukcuoglu, K.: Spatial transformer networks. In: NIPS. (2015)
 [48] Kanazawa, A., Jacobs, D., Chandraker, M.: Warpnet: Weakly supervised matching for singleview reconstruction. In: CVPR. (2016)
 [49] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: CVPR. (2015)
 [50] Ren, S., He, K., Girshick, R., Sun, J.: Faster rcnn: Towards realtime object detection with region proposal networks. In: arXiv preprint. (2016)
 [51] Bengio, Y., Louradour, J., Collobert, R., Weston, J.: Curriculum learning. In: ICML. (2009)
 [52] Baltrusaitis, T., Robinson, P., Morency, L.: Constrained local neural fields for robust facial landmark detection in the wild. In: ICCVW. (2013)
 [53] Bookstein, F.L.: Principal warps: Thinplate splines and the decomposition of deformations. PAMI (1989)
 [54] Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding. arXiv preprint (2014)
 [55] Sagonas, C., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces inthewild challenge: The first facial landmark localization challenge. In: ICCVW. (2013)
 [56] Le, V., Brandt, J., Lin, Z., Burden, L., Huang, T.: Interactive facial feature localization. In: ECCV. (2012)
 [57] Messer, K., Matas, J., Kittler, J., Letting, J., Maitre, G.: Xm2vtsdb: The extended m2vts database. In: In Second International Conference on Audio and Videobased Biometric Person Authentication (AVBPA). (1999)
 [58] Johnson, S., Everingham, M.: Clustered pose and nonlinear appearance models for human pose estimation. In: British Machine Vision Conference. (2010)
 [59] Johnson, S., Everingham, M.: Learning effective human pose estimation from inaccurate annotation. In: CVPR. (2011)
 [60] Zhang, N., Shelhamer, E., Gao, Y., Darrell, T.: Finegrained pose prediction, normalization and recognition. In: arXiv preprint. (2016)
 [61] Welder, P., Branson, S., Mita, T., Wah, C., Schrod, F., Belong, S., Perona, P.: Caltechucsd birds 200. In: CTechnical Report CNSTR2010001. (2010)