Cross-Task Representation Learning for Anatomical Landmark Detection

Cross-Task Representation Learning for Anatomical Landmark Detection


Recently, there is an increasing demand for automatically detecting anatomical landmarks which provide rich structural information to facilitate subsequent medical image analysis. Current methods related to this task often leverage the power of deep neural networks, while a major challenge in fine tuning such models in medical applications arises from insufficient number of labeled samples. To address this, we propose to regularize the knowledge transfer across source and target tasks through cross-task representation learning. The proposed method is demonstrated for extracting facial anatomical landmarks which facilitate the diagnosis of fetal alcohol syndrome. The source and target tasks in this work are face recognition and landmark detection, respectively. The main idea of the proposed method is to retain the feature representations of the source model on the target task data, and to leverage them as an additional source of supervisory signals for regularizing the target model learning, thereby improving its performance under limited training samples. Concretely, we present two approaches for the proposed representation learning by constraining either final or intermediate model features on the target model. Experimental results on a clinical face image dataset demonstrate that the proposed approach works well with few labeled data, and outperforms other compared approaches.

Anatomical landmark detection Knowledge transfer


1 Introduction

Accurate localization of anatomical landmarks plays an important role for medical image analysis and applications such as image registration and shape analysis [4]. It also has the potential to facilitate the early diagnosis of Fetal Alcohol Syndrome (FAS) [11]. An FAS diagnosis requires the identification of at least 2 of 3 cardinal facial features; a thin upper lip, a smooth philtrum and a reduced palpebral fissure length (PFL)[10], which means that even a small inaccuracy in the PFL measurement can easily result in misdiagnosis. Conventional approaches for extracting anatomical landmarks mostly rely on manual examination, which is tedious and subject to inter-operator variability. To automate landmark detection, recent methods in computer vision [16, 22, 25] and medical image analysis [26, 4, 11] have extensively relied on convolutional neural networks (CNN) for keypoint regression. Although these models have achieved promising performance, this task still remains challenging especially when handling the labeled data scarcity in medical domain, due to expensive and inefficient annotation process. Transfer learning, in particular fine-tuning pre-trained models from similar domains have been widely used to help reduce over-fitting by providing a better initialization [17]. However, merely fine-tuning the existing parameters may arguably lead to a suboptimal local minimum for the target task, because much knowledge of the pre-trained model in the feature space is barely explored [13, 12]. To address this, we explore the following question: Is it possible to leverage the abundant knowledge from a domain-similar source task to guide or regularize the training of the target task with limited training samples?

We investigate this hypothesis via cross-task representation learning, where “cross-task” here means that the learning process is made between the source and target tasks with different objectives. In this work, the proposed cross-task representation learning approach is illustrated for localizing anatomical landmarks in clinical face images to facilitate early recognition of fetal alcohol syndrome [1], where the source and target tasks are face recognition and landmark detection. Intuitively, the proposed representation learning is interpreted as preserving feature representations of a source classification model on the target task data, which serves as a regularization constraint for learning the landmark detector. Two approaches for the proposed representation learning are developed by constraining either final or intermediate network features on the target model.

Related Work.

Current state-of-art methods formulate the landmark detection as a CNN based regression problem, including two main frameworks: direct coordinate regression [24, 6] and heatmap regression [16, 22]. Heatmap regression usually outperforms its counterpart as it preserves high spatial resolution during regression. In medical imaging, several CNN architectures have been developed based on attention mechanisms [4, 26], and cascaded processing [23] for the enhancement of anatomical landmark detection. However, the proposed learning approach in this paper focuses on internally enriching the feature representations for the keypoint localization without complicating the network design.

Among existing knowledge transfer approaches, fine-tuning [22], as a standard practice initializes from a pre-trained model and shifts its original capability towards a target task, where a small learning rate is often applied and some model parameters may need to be frozen to avoid overfitting. However, empirically modifying the existing parameters may not generalize well over the small training dataset. Knowledge distillation originally proposed for model compression [9] is also related to knowledge transfer. This technique has been successfully extended and applied to various applications, including hint learning [20], incremental learning [14, 5], privileged learning [15], domain adaptation [7] and human expert knowledge distillation [19]. These distillation methods focused on training a compact model by operating the knowledge transfer across the same tasks [9, 19, 20]. However, our proposed learning approach aims to regularize the transfer learning across different tasks.


We propose a new deep learning framework for anatomical landmark detection under limited training samples. The main contributions are: (1) we propose a cross-task representation learning approach whereby the feature representations of a pre-trained classification model are leveraged for regularizing the optimization of landmark detection. (2) We present two approaches for the proposed representation learning by constraining either final or intermediate network features on the target task data. In addition, a cosine similarity inspired by metric learning is adopted as a regularization loss to transfer the relational knowledge between tasks. (3) We experimentally show that the proposed learning approach performs well in anatomical landmark detection with limited training samples and is superior to standard transfer learning approaches.

2 Method

In this section, we first present the problem formulation of anatomical landmark detection, and then describe the design of the proposed cross-task representation learning to address this task.

2.1 Problem Formulation

In this paper, our target task is anatomical landmark detection, which aims to localize a set of pre-defined anatomical landmarks given a facial image. Let be the training dataset with pairs of training samples in the target domain. represents a 2D RGB image with height and width , denotes the corresponding labeled landmark coordinates, and is the number of anatomical landmarks (). We formulate this task using heatmap regression, inspired by its recent success in keypoint localization [16, 22]. Following prior work [16], we downscale the labeled coordinates to of the input size (), and then transform them to a set of heatmaps . Each heatmap is defined as a 2D Gaussian kernel centered on the -th landmark coordinate . The entry of is computed as , where denotes the kernel width ( pixels). Consequently, the goal is to learn a network which regresses each input image to a set of heatmaps, based on the updated dataset .

For this regression problem, most state-the-of-the-art methods [22, 25] follow the encoder-decoder design, in which a pre-trained network (e.g. ResNet50 [8]) is usually utilized in the encoder for feature extraction, and then the entire network or only the decoder is fine-tuned during training. However, due to the limited number of training samples in our case, merely relying on standard fine-tuning may not always provide a good localization accuracy. Therefore, we present the proposed solution to address this problem in the next section.

2.2 Cross-Task Representation Learning


Fig. 1 depicts the overall design of the proposed cross-task representation learning approach. Firstly, the source model pre-trained on a face classification task is operated in the inference mode to predict rich feature representations from either classification or intermediate layers for the target task data. The target model is then initialized from the source model and extended with a task-specific decoder for the task of landmark detection (). Obtained feature representations are then transferred by regularization losses ( or ) for regularizing the target model learning.

Source Model.

Figure 1: Illustration of proposed approaches for learning the anatomical landmark detection models, where (a) presents the regularization constraint on the final layer output (), and (b) is to constrain the predictions on the encoder output ().

We consider a pre-trained face classification network as our source model, since generic facial representations generated from this domain-similar task have been demonstrated to be helpful for other facial analysis [21]. Formally, let be the source network for a face classification task with classes, where and are the learnable parameters. The network consists of a feature extractor (encoder) and a classifier , where denotes the dimensionality of the encoder output. A cross-entropy loss is typically used to train the network which maps a facial image to classification scores based on a rich labeled dataset . In practice, we adopt a pre-trained ResNet-50 [8] model from VGGFace2 [3] for the source network. Other available deep network architectures could also be utilized for this purpose.

Target Model.

For the task of heatmap regression, the target network is firstly initialized from the pre-trained source network. We then follow the design of [22], employing three deconvolutional layers after the encoder output to recover the desired spatial resolution, where each layer has the dimension of 256 and kernel with the stride of . Finally, a convolutional layer is added to complete this task-specific decoder . The primary learning objective is to minimize the following loss between the decoder outputs and the labeled heatmaps,


where denotes the Frobenius norm.

Regularized Knowledge Transfer.

Figure 2: Illustration of proposed framework for testing landmark detection models.

Motivated by knowledge distillation, we aim to regularize the network training by directly acquiring the source model’s predictions for the target task data , which are further transferred through a regularization loss . Hence, the total loss is defined as,


where is a weighting parameter. If , the knowledge transfer becomes standard fine-tuning, as no regularization is included.

For the design of , we firstly consider constraining the distance between the final layer outputs of the two networks, as shown in Fig. 1 (a). Similar to the distillation loss in [9], we use a temperature parameter with function to smooth the predictions, but the original cross-entropy function is replaced by the following term,


The purpose of this design of is to directly align the facial embeddings between instances, instead of preserving the original classification ability.

Moreover, we consider matching the features maps produced from both encoders as another choice, as shown in Fig. 1 (b). Motivated by the work in [18], we adopt the cosine similarity for the feature alignment as described below,


We conjecture that penalizing higher-order angular differences in this context would help transfer the relational information across different tasks, and also give more flexibility for the target model learning. Besides, both regularization terms can be combined together to regularize the learning process. Different approaches of the proposed learning strategy will be evaluated in the experimental section.

During inference, as shown in Fig. 2, only the trained target model is used to infer the heatmaps, and each of them is further processed via an function to obtain final landmark locations.

3 Experiments

3.1 Dataset and Implementation Details

We evaluate the proposed approach for extracting facial anatomical landmarks. Images used for training and test datasets were collected by the Collaborative Initiative on Fetal Alcohol Spectrum Disorders (CIFASD)1, a global multi-disciplinary consortium focused on furthering the understanding of FASD. It contains subjects from 4 sites across the USA, aged between 4 and 18 years. Each subject was imaged using a commercially available static 3D photogrammetry system from 3dMD2. For this study, we utilize the high-resolution 2D images captured during 3D acquisition, which are used as UV mapped textures for the 3D surfaces.

Specifically, we acquired in total 1549 facial images annotated by an expert, and randomly split them into training/validation set (80%), and test set (20%). All the images were cropped and resized to for the network training and evaluation. Standard data augmentation was performed with randomly horizontal flip (50%) and scaling (). During training, the Adam optimizer [2] was used for the optimization with the mini-batch size of for epochs. A polynomial decay learning rate was used with the initial value of . Parameters of and used in (2) and (3) were set to and , respectively.

3.2 Evaluation Metrics

For the evaluation, we firstly employ the Mean Error (ME), which is a commonly-used evaluation metric in the task of facial landmark detection. It is defined as, , where is the number of images in the test set, and and denote the manual annotations and predictions, respectively. Note that the original normalization factor measured by inter-ocular distance (Euclidean distance between outer eye corners) is not included in this evaluation, due to the unavailable annotations for the other eye, as illustrated in Fig. 3. In addition, we use the Cumulative Errors Distribution (CED) curve with the metrics of Area-Under-the-Curve (AUC) and Failure Rate (FR), where a failure case is considered if the point-to-point Euclidean error is greater than . Higher scores of AUC or lower scores of FR demonstrate the larger proportion of the test set is well predicted.

Figure 3: Qualitative performance of landmark prediction and heatmap regression on the test set. Subjects’ eyes are masked for privacy preservation. Better viewed in color.
Figure 4: Evaluation of CED curve on the test set. Better viewed in color.
FE [22] 1.8220.501 94.52% 0.01
FTP [22] 1.1610.261 40.32% 0.10
FT [22] 0.8580.238 10.65% 0.29
HG [16] 0.8790.386 12.58% 0.30
CTD-CD 0.8420.246 5.81% 0.31
CTD-ED 0.8300.245 7.74% 0.32
CTD-Com 0.8290.253 6.45% 0.32
Figure 5: Quantitative evaluation on the test set.

3.3 Results and Discussions

To verify the effectiveness of the proposed cross-task representation learning (CTD) approach, we compare to a widely-used CNN model: stacked Hourglass (HG) [16] and three variants of fine-tuning [22] without regularization (): Feature Extraction (FE) with freezing the encoder, Fine Tuning Parts (FTP) without freezing the final convolutional layer of the encoder, and Fine Tuning (FT) without freezing any layer. In addition, we present an ablation study to examine the significance of each approach in our proposed CTD, including the regularization on the classifier output (CTD-CD), the regularization on the encoder output (CTD-ED), and the regularization on both outputs (CTD-Com).

Fig. 3 shows the qualitative comparisons between different models on the test set. As we can see, the predicted landmarks from the proposed methods generally achieve the better alignment with the ground truth (the first left column) than the others, and seem to be more robust to difficult pixels especially when landmarks are in close proximity (upper lip). One possible reason is that feature representations generated from the source model encode richer facial semantics, which make landmark spatial locations more discriminative. Furthermore, the visualization of predicted heatmaps explains how each compared model responds to the desired task. We observe that our cross-task representation learning can effectively suppress spurious responses and improve the feature confidence in related regions, so that more accurate predictions can be achieved.

On the other hand, Table 5 summarizes the quantitative evaluation by reporting the statistics for each model. Fig. 5 depicts the CED curve which provides an intuitive understanding of the overall performance of the compared models. These evaluations above demonstrate that the proposed methods consistently outperform standard fine-tuning solutions. Moreover, CTD-ED performs slightly better than CTD-CD considering the scores of ME and AUC. This may be explained by the fact that features from intermediate layers are not only semantic, and also contain to some extent structural information which is beneficial for localization [7]. Interestingly, CTD-Com using both regularization losses achieves similar results in CTD-ED, as a result, CTD-ED may be considered as a better choice for the regularization of transfer learning.

4 Conclusions

In this paper, we presented a new cross-task representation learning approach to address the problem of anatomical landmark detection where labeled training data is limited. The proposed learning approach considered reusing the knowledge from a domain-similar source task as a regularization constraint for learning the target landmark detector. Moreover, several regularization constraints for the proposed learning approach were considered. Experimental results suggested that the proposed learning approach works well with limited training samples and outperforms other compared solutions. The proposed approach can be potentially applied to other related applications in the clinical domain where the target task has small training set and the source task data is not accessible.


This work was done in conjunction with the Collaborative Initiative on Fetal Alcohol Spectrum Disorders (CIFASD), which is funded by grants from the National Institute on Alcohol Abuse and Alcoholism (NIAAA). This work was supported by NIH grant U01AA014809 and EPSRC grant EP/M013774/1.




  1. S. J. Astley (2015) Palpebral fissure length measurement: accuracy of the FAS facial photographic analysis software and inaccuracy of the ruler. Journal of Population Therapeutics and Clinical Pharmacology 22 (1), pp. e9–e26. Cited by: §1.
  2. J. Ba (2015) Adam: a method for stochastic optimization. In Proc. of International Conference on Learning Representations (ICLR), pp. 1–15. Cited by: §3.1.
  3. Q. Cao, L. Shen, W. Xie, O. M. Parkhi and A. Zisserman (2018) VGGFace2: a dataset for recognising faces across pose and age. In IEEE International Conference on Automatic Face Gesture Recognition, Vol. , pp. 67–74. Cited by: §2.2.2.
  4. R. Chen, Y. Ma, N. Chen, D. Lee and W. Wang (2019) Cephalometric landmark detection by attentive feature pyramid fusion and regression-voting. In Medical Image Computing and Computer Assisted Intervention (MICCAI), pp. 873–881. Cited by: §1.0.1, §1.
  5. P. Dhar, R. V. Singh, K. Peng, Z. Wu and R. Chellappa (2019) Learning without memorizing. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.0.1.
  6. Z. Feng, J. Kittler, M. Awais, P. Huber and X. Wu (2018) Wing loss for robust facial landmark localisation with convolutional neural networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.0.1.
  7. S. Gupta, J. Hoffman and J. Malik (2016) Cross modal distillation for supervision transfer. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §1.0.1, §3.3.
  8. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.1, §2.2.2.
  9. G. Hinton, O. Vinyals and J. Dean (2015) Distilling the knowledge in a neural network. In Conference on Neural Information Processing Systems (NeurIPS) Workshops, Cited by: §1.0.1, §2.2.4.
  10. H. E. Hoyme, P. A. May and W. O. Kalberg (2006) A practical clinical approach to diagnosis of fetal alcohol spectrum disorders: clarification of the 1996 institute of medicine criteria. Pediatrics 115 (1), pp. 39–47. Cited by: §1.
  11. R. Huang, M. Suttie and J. A. Noble (2019) An automated CNN-based 3D anatomical landmark detection method to facilitate surface-based 3D facial shape analysis. In Medical Image Computing and Computer-Assisted Intervention (MICCAI) Workshops, pp. 163–171. Cited by: §1.
  12. X. Li, H. Xiong, H. Wang, Y. Rao, L. Liu, Z. Chen and J. Huan (2019) DELTA: DEep learning transfer using feature map with attention for convolutional networks. In Proc. of International Conference on Learning Representations (ICLR), pp. 1–13. Cited by: §1.
  13. X. Li, Y. Grandvalet and F. Davoine (2018) Explicit inductive bias for transfer learning with convolutional networks. In International Conference on Machine Learning (ICML), Vol. 80, pp. 2830–2839. Cited by: §1.
  14. Z. Li and D. Hoiem (2018) Learning without forgetting. IEEE Transactions on Pattern Analysis and Machine Intelligence 40 (12), pp. 2935–2947. Cited by: §1.0.1.
  15. D. Lopez-Paz (2016) Unifying distillation and privileged information. pp. 1–10. Cited by: §1.0.1.
  16. A. Newell, K. Yang and J. Deng (2016) Stacked hourglass networks for human pose estimation. In European Conference on Computer Vision (ECCV), pp. 483–499. Cited by: §1.0.1, §1, §2.1, Figure 5, §3.3.
  17. S. J. Pan and Q. Yang (2010) A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering 22 (10), pp. 1345–1359. Cited by: §1.
  18. W. Park, D. Kim, Y. Lu and M. Cho (2019) Relational knowledge distillation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §2.2.4.
  19. A. Patra, Y. Cai, P. Chatelain, H. Sharma, L. Drukker, A. T. Papageorghiou and J. A. Noble (2019) Efficient ultrasound image analysis models with sonographer gaze assisted distillation. In Proc. of Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 394–402. Cited by: §1.0.1.
  20. A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta and Y. Bengio (2015) FitNets: hints for thin deep nets. In Proc. of International Conference on Learning Representations (ICLR), pp. 1–13. Cited by: §1.0.1.
  21. O. Wiles, A.S. Koepke and A. Zisserman (2018) Self-supervised learning of a facial attribute embedding from video. In British Machine Vision Conference (BMVC), Cited by: §2.2.2.
  22. B. Xiao, H. Wu and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In European Conference on Computer Vision (ECCV), pp. 472–487. Cited by: §1.0.1, §1.0.1, §1, §2.1, §2.1, §2.2.3, Figure 5, §3.3.
  23. J. Zhang, M. Liu and D. Shen (2017) Detecting anatomical landmarks from limited medical imaging data using two-stage task-oriented deep neural networks. IEEE Transactions on Image Processing 26 (10), pp. 4753–4764. Cited by: §1.0.1.
  24. Z. Zhang, P. Luo, C. C. Loy and X. Tang (2016) Learning deep representation for face alignment with auxiliary attributes. IEEE Transactions on Pattern Analysis and Machine Intelligence 38 (5), pp. 918–930. Cited by: §1.0.1.
  25. Y. Zhao, Y. Liu, C. Shen, Y. Gao and S. Xiong (2020) MobileFAN: transferring deep hidden representation for face alignment. Pattern Recognition 100, pp. 107–114. Cited by: §1, §2.1.
  26. Z. Zhong, J. Li, Z. Zhang, Z. Jiao and X. Gao (2019) An attention-guided deep regression model for landmark detection in cephalograms. In Proc. of Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 540–548. Cited by: §1.0.1, §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description