Infant Pose Learning with Small Data

Infant Pose Learning with Small Data


With the increasing maturity of the human pose estimation domain, its applications have become more and more broaden. Yet, the state-of-the-art pose estimation models performance degrades significantly in the applications that include novel subjects or poses, such as infants with their unique movements. Infant motion analysis is a topic with critical importance in child health and developmental studies. However, models trained on large-scale adult pose datasets are barely successful in estimating infant poses due to significant differences in their body ratio and the versatility of poses they can take compared to adults. Moreover, the privacy and security considerations hinder the availability of enough infant images required for training a robust pose estimation model from scratch. Here, we propose a fine-tuned domain-adapted infant pose (FiDIP) estimation model, that transfers the knowledge of adult poses into estimating infant pose with the supervision of a domain adaptation technique on a mixed real and synthetic infant pose dataset. In developing FiDIP, we also built a synthetic and real infant pose (SyRIP) dataset with diverse and fully-annotated real infant images and generated synthetic infant images1. We demonstrated that our FiDIP model outperforms other state-of-the-art human pose estimation model for the infant pose estimation, with the mean average precision (AP) as high as 92.2.


1 Introduction

Current efforts in machine learning (especially with the recent waves of deep learning models introduced in the last decade) have obliterated records for regression and classification tasks that have previously seen only incremental accuracy improvements [31, 8, 17]. However, this performance comes at a large data cost, frequently requiring upwards of data/label pairs. There are many other applications that would significantly benefit from machine learning (ML)-based inferences, where data collection or labeling is expensive. In these domains, which we referred to as “Small Data” domains, the challenge we now face is how to learn efficiently with the same performance with less data. One example of these applications with the small data challenges is the problem of infant pose estimation.

The pose—a collection of human joint angles—is a succinct representation of an important portion of a person’s state. Almost any physical activity such as yoga, dance, and sport can be analyzed by looking at the poses of the participants. When it comes to human infants, long-term monitoring of infant poses provide information about their health condition and accurate recognition of these poses can lead to a better early developmental risk assessment and diagnosis [26, 9]. For example, infant’s physical activities can be quantified by estimating their poses over time and can be used for screening the risks of motor delays. Both motor delays and atypical movements are present in children with cerebral palsy and are risk indicators for autism spectrum disorders [41, 36]. In addition, studies have shown that particular poses during sleep can affect the risk of sudden infant death syndrome (SIDS) up to 20-fold when combined with other risk factors [2].

There are several publicly available human pose datasets such as Microsoft COCO [18], MPII [1], LSP [16], FLIC [32], and Buffy [28], however these pose images are predominantly from scenes such as sports, TV shows, and other daily activities performed by adult human, and none of these datasets provides any specific infants or young children pose images. Beside privacy issues which hamper large-scale data collection from infant and young children populations, infant pose images differ from available adult pose datasets due to the notable differences in their pose distribution compared to the common adult poses collected from surveillance viewpoints [19]. These differences mainly are due to infants having shorter limbs and completely different bone to muscle ratio compared to adults. Also, the approximate positions of various body keypoints (which are used for pose estimation) differ significantly between adults and infants. We have shown that successful mainstream human pose estimation algorithms do not yield accurate estimation results when tested on infant images or videos (see Section 5). Our results revealed that pose estimation done by general-purpose pose inference models lead to many times over-prediction or under-prediction of limb sizes for infants.

Figure 1: An overview architecture of our fine-tuning domain-adaptive (FiDIP) network, composed of two sub-networks: pose estimation network (red-dot box) and domain confusion network (blue-dot box). Main components of FiDIP include a feature extractor (orange), a pose predictor (blue), and a domain classifier (green).

In this paper, towards building a robust infant pose estimation model, we propose our fine-tuned domain-adapted infant pose (FiDIP) estimation network, which is a data-efficient inference model bootstrapped based on both transfer learning and synthetic data augmentation approaches. FiDIP benefits from the reposability of a 3D infant model [12] to generate a series of pose-diverse synthetic infant images with the goal of augmenting our small-size real infant image dataset. FiDIP consists of two main sub-networks, a pose estimation network and a domain confusion network as shown in Fig. 1. Following the homogeneous transfer learning theory, we start with a pre-trained pose estimation network trained on adult pose data (i.e. our source domain) and then fine-tune the network with both synthetic and real infant pose data (i.e. our target domain). Moreover, to tackle the domain shift issue between real and synthetic infant pose images, a domain confusion network is embedded into our primary pose estimation network to adapt the synthetic and real pose data in a feature level. Using these data-efficient learning paradigms, our FiDIP model demonstrates superior pose estimation performance when applied on real infant pose images compared to the SOTA general-purpose pose estimation models. In short, we address a critical small data problem centered around infant pose estimation by making the following contributions:

  • Proposing a fine-tuned domain-adapted infant pose (FiDIP) model built upon a two-stage training paradigm. In the stage I of training, we fine-tune a pre-trained domain confusion network in a pose-unsupervised manner. In the stage II, we fine-tune a pre-trained pose estimation model under the guidance of stage I-trained domain confusion network. Both domain confusion network and the pose estimation network are updated separately in iterative way.

  • Achieving two transfer learning goals simultaneously. In the FiDIP network, there exist two transfer learning tasks: (1) from adult pose domain into the infant pose domain, and (2) from synthetic image domain into the real image domain. We fine-tune the pose estimation network by constraining that to extract features with common domain knowledge between synthetic and real data.

  • Building and publicly releasing our synthetic and real infant pose (SyRIP) dataset, which includes 400 fully-labeled real infant images in various poses as well as 504 synthetic infant images generated by fitting a 3D skinned multi-infant linear (SMIL) body model into different feasible poses and then rendering them with various textures, backgrounds, and poses. In order to annotate the real infant pose data in a time-efficient manner, we have utilized an AI-human co-labeling toolbox (AH-CoLT) that was previously developed by the authors.

2 Related Work

Infant Pose Estimation.

For applications that require infant posture/motion analysis, the current approaches are dominantly based on (real-time or recorded) visual observation by the infant’s pediatrician or the use of contact-based inertial sensors. Meanwhile, there exist very few recent attempts initiated by the computer vision community to automatically perform pose estimation and tracking on videos taken from infants. In [15], authors estimated 3D body pose of infants in depth images for their motion analysis purpose. They employed pixel-wise body part classifier using random ferns to predict 3D joints. The aim of their work was to automate the task of motion analysis to identify infantile motor disorders. In [13], authors presented a statistical learning method called 3D skinned multi-infant linear (SMIL) body model using incomplete low quality RGB-D sequence of freely moving infants. The specific dataset they used is provided in [11], where users map real infant movements to the SMIL model with natural shapes and textures, and generate RGB and depth images with 2D and 3D joint positions. However, both of these works rely heavily on having access to the RGB-D data sequence which is difficult to obtain and hinder the use of these algorithms in regular webcam-based monitoring systems. Additionally, the definition of some joints’ positions in SMIL 3D body model is different from the positions of the corresponding keypoints marked in the commonly used human pose datasets, such as COCO [18] or MPII [1]. For instance, the location of joint “thigh” in the SMIL does not completely correspond to the position of keypoint “hip” in the COCO fashion. Therefore, its comparability is greatly reduced.

Synthetic Human Pose Data Generation.

Synthesizing complicated articulated 3D models such as a human body has been drawing huge attention lately due to its extensive applications in studying human poses, gestures, and activities [25, 27, 30, 3, 5, 7, 23, 20]. Among benefits of synthesizing data is the possibility to automatically generate enough labeled data for supervised learning purposes, especially in small data domains [33]. In [20], authors introduce a semi-supervised data augmentation approach that can synthesize large-scale labeled pose datasets using 3D graphical engines based on a physically-valid low dimensional pose descriptor. As introduced in [29], 3D human poses can be reconstructed by learning a geometry-aware body representation from multi-view images without annotations. Another research trend in synthesizing human pose images is simulating human figures by employing generative adversarial network (GAN) techniques. Authors in [21] presented a two-stage pose-guided person generation network to integrate pose by feeding a reference image and a novel pose into a U-Net-like network to generate a coarse reposed person image, and refine image by training the U-Net-like generator in an adversarial way. In [4], a multi-pose guided virtual try-on network is proposed to produce new person images by manipulating the desired clothes and poses. In these works, however, neither the generated human avatars nor the reconstructed poses are able to accurately adapt to the infant style. Additionally, these GAN-based approaches of synthetic human figures do not have the capabilities of simulating complicated poses regularly taken by infants.

Based on the above-mentioned challenges in achieving a robust infant pose estimation model and the shortcomings of the prior arts, we propose a data-efficient infant pose learning method targeted for small dataset sizes. Our fine-tuned domain-adapted infant pose (FiDIP) model outperforms the state-of-the-art (SOTA) general pose estimation models, especially on poses commonly seen in infants (see Fig. 2).

Figure 2: Samples of infant pose prediction results of DarkPose-AP:65.9 (2nd column), FasterR-CNN-AP:70.1 (3rd column), Pose-ResNet-AP:82.4 (4rd column), DarkPose-AP:88.4 (5th column), and our FiDIP-AP:92.2 (6th column), which are listed in Table 1. The 1st column is the visualization of groundtruth poses. Incorrect predictions are highlighted in red in each model.

3 FiDIP: Fine-tuned Domain-adapted Infant Pose Estimation Network

Our algorithm makes use of an initial pose estimation model trained on the abundant adult pose data, then fine-tunes that model on an augmented dataset, which consists of a small amount of manually-labeled real infant pose data and a series of pose-diverse synthetic infant images. In the augmented dataset, a domain adaptation method is proposed to align features of synthetic infant data with the real-world infant images. Given the fact that the labeling process for articulated objects such as human body is extremely time-consuming and does not scale well to various poses a human (an infant in our case) can take, the number of images in our dataset is limited. Therefore, rather than re-training the whole adult pose estimation network, we only update a few layers of that network to fine-tune that for infant pose estimation.

3.1 FiDIP Network Architecture

Our FiDIP network consists of two sub-network as shown in Fig. 1. The pose estimation components share the same structure as Pose-ResNet model with feature extractor as its encoder and pose estimator as its decoder [38]. Pose-ResNet is based on a few deconvolutional layers added on a backbone network, ResNet-50 [10], in which the fully connected layer of a ResNet-50 is removed, and three deconvolutional layers and a convolution are added. The rationale behind choosing a simple pose estimation model such as Pose-ResNet instead of a more complex network is as follows. Our FiDIP model should be able to learn infant poses only with small amount of labeled infant data, therefore we have to start with a pre-trained pose estimation model on adult pose data and fine-tune that with the available infant data. In order to shorten the time of fine-tuning process and therefore its required data as much as possible, we tested the performance of several SOTA pose estimators on our SyRIP test dataset (as listed in Table 1). The DarkPose models using high-resolution net (HRNet) [34] backbone achieved the highest pose estimation accuracy, especially compared to its version with simpler ResNet [10] backbone. However, HRNet has high-to-low resolution sub-networks connecting in parallel, which makes it too complex to be fine-tuned with small amount of data. Hence, as our pose estimation sub-network, we select a model that was well performed on real infant images before fine-tuning while having a simpler backbone model such as ResNet-50, which is Pose-ResNet.

The second component of FiDIP network is its domain confusion network. Its function is to enforce the images either in the real domain or in the synthetic domain being mapped into a same feature space after feature extraction. Our domain confusion network is composed of a domain classifier network and a feature extractor, which is shared with the pose estimation component. Domain classifier, as an embedded part, is a binary classifier with only three fully connected layers to distinguish whether the input feature belongs to a real or synthetic image.

3.2 FiDIP Network Training

FiDIP training procedure consists of a pre-stage and a two-stage training paradigm by embedding a domain confusion network into the pose estimation network and training/fine-tuning different components separately to achieve a robust infant pose estimation model.


The pose estimation component of FiDIP network is already pre-trained on adult pose images from COCO dataset [18]. Since our training strategy is based on the use of fine-tuning as a means for transfer learning, to avoid unbalanced components’ updating during fine-tuning, the domain classifier part of our domain confusion sub-network also needs to be pre-trained on both real and synthetic data from adult humans in advance (the combination dataset based on the real adult images in the validation part of COCO dataset and some part of synthetic humans for real (SURREAL) dataset [35]. During this pre-training, the feature extractor part stays frozen, and only the weights for domain classifier will be initialized. The following stages are done after this initialization.

Stage I.

In this stage, we lock pose estimation sub-network and only fine-tune the domain classifier of domain confusion sub-network based on the current performance of feature extractor using infant real and synthetic pose data. The objective of this stage is to obtain a domain classifier for predicting whether the features are from a synthetic infant image or real one. Since the pose estimation network is locked and only domain classifier is to be optimized, the optimization objective in this stage is the loss of domain classifier , which is calculated by the binary cross entropy:


where is the score of th feature belonging to synthetic domain, is the corresponding groundtruth, represents the sigmoid function, and is the batch size.

Stage II.

The pose estimation network is to be fine-tuned with locked domain classifier in this stage. We try to refine the feature extractor to not only affect the pose predictor but also confuse the domain classifier. We leverage the domain classifier updated at stage I, to promote the feature extractor to retain the ability to extract keypoint information during the fine-tuning process, but also to ignore the differences between the real domain and the synthetic domain. An adversarial training method, which is proposed in [6], is utilized to pushing features from synthetic images and real images into a common domain. A gradient reversal layer (GRL) is introduced to minimize the pose loss (), which measures the mean squared error (MSE) between the predicted heatmap and targeted heatmap for each keypoint as:


It simultaneously maximizes the domain loss (), so that the features representing both synthetic and real domains become similar. The optimization objective is represented as:


where controls the trade-off between the two losses that shape the features during fine-tuning. , , and represent parameters of feature extractor, pose predictor, and domain classifier, respectively.

The domain confusion network and the pose estimation network are updated separately in circular way. First, domain classifier is fine-tuned, so that it can distinguish the feature distributions of the synthetic and real domains of infant data. Then, the pose estimation sub-network is fine-tune, and GRL allows the feature extraction layer to generate features that can confuse the classifier. We repeat the fine-tuning until the feature extraction layer can change the inputs of the two different domains into features of the same distribution. In particular, when fine-tuning the pose estimation network in stage II, we only update the high-level representation layers of feature extractor, which are more related to the keypoints information, rather than the entire layers.

4 SyRIP: Building a Synthetic/Real Infant Pose Dataset

As stated earlier, there is a shortage of labeled pose dataset from infants, and despite recent efforts in developing them, a versatile dataset with different and complex poses to train a deep model on is yet to be built. The only publicly-available infant image dataset is MINI-RGBD dataset [11], which provides only 12 synthetic infant models with continuous pose sequences. However, beside having simple poses, MINI-RGBD sequential feature leads to a small variation in the poses between adjacent frames and the poses of whole dataset are mainly repeated. In Fig. 3(a), we show the distribution of body poses of MINI-RGBD dataset and observe that the poses in this dataset are relatively consistent. Both simplicity of the poses and being synthetic would cause the pose estimation models trained on MINI-RGBD to not generalize well to real-world infant poses.

To address this limitation, we have build a new infant pose dataset including both real and synthetic images that display infants in various positions while performing various activities, and utilize it to train our robust FiDIP model. Our synthetic and real infant pose (SyRIP) dataset includes a training part consists of 400 real and 504 synthetic infant images, and a test part with 100 real infant images, all with fully annotated body joints. Infants in these images have many different poses, like crawling, lying, sitting, and so on. The real images all come from YouTube videos and Google Images and the synthetic infant images are generated based on the 3D SMIL body model that are from the real images with known 2D pose ground truth.

Figure 3: A sample image and pose distributions of (a) MINI-RGBD dataset, (b) the real part of the SyRIP dataset, and (c) the synthetic part of the SyRIP dataset. The left side shows a sample of image of each dataset with its groundtruth labels. The right side shows the pose distribution of 200 images that are randomly selected from each dataset, in which colors of different body parts correspond to the colors of body parts in the left figures. For the pose distribution, we normalize all images based on the infant’s body bounding box to scale them into similar sized and, then align them based on their torso. To better represent poses, we also ignore the points for ears and eyes when we visualized the joints.

4.1 Real Pose Data Gathering

Due to difficultly in controlling infant movements as well as critical privacy concerns, access to infant images with various poses is limited. Therefore, for real portion of the SyRIP dataset, we look for publicly available yet scattered real infant images from sources such as YouTube and Google Images. The biggest benefit of this collection method is that the diversity of infant poses is guaranteed to the greatest extent. We choose infant (newborn to one year old) in various poses and many different backgrounds.

We manually query YouTube and download more than 40 videos with different infants ranging in length from 30 seconds to 5 minutes. Then, we use our splitting tool to split each video sequence to many separate frames. Finally, about 400 images including more than 50 infants with different poses from those frames are collected. We also select about 100 high-resolution images containing more than 90 infants from the Google Images. Compared to the images taken from the YouTube videos, Google images have infants with higher complexity of poses, but beyond that, the high resolution of the latter can be used to improve the quality of the whole dataset. The pose distribution of the real part of the SyRIP dataset is shown in the Fig. 3(b). Obviously, the poses in the real part of SyRIP are more diverse than those in the MINI-RGBD dataset.

4.2 Synthetic Pose Image Generation

On one hand, it is almost impossible to train a deep neural network model from scratch or even fine-tune it using just 400 images. On the other hand, it turn out challenging to find more real infant images online. Therefore, we generate synthetic infant images to expand our dataset and use this augmented dataset to fine-tune the existing pose estimation model. In order to get plenty of synthetic infant images with manifold poses, we utilize 3D skinned multi-infant linear (SMIL) body model [14], inspired by the approach used in synthetic humans for real (SURREAL) dataset [35]. For SURREAL generation, images are rendered from the synthetic adult bodies created by using the a skinned multi-person linear (SMPL) body model, whose parameters are fitted by the MoSh method given raw 3D MoCap marker data. What differentiate our method from SURREAL is using body model of an infant (i.e. SMIL model), instead of the adult body model as well as applying SMPLify-X [24] method to generate SMIL model parameters. Our pipeline of synthetic infant data generation is illustrated in Fig. 4.

Figure 4: Our pipeline of synthetic infant image generation. 3D infant body models are posed by fitting SMIL model pose and shape parameters into real infant images. Output images are rendered using random background images, texture maps on the body, lighting, and camera positions.

SMIL model has vertices and joints, and can be parameterized by the pose coefficients where stands for body joints and one more joint (pelvis, is the root of the kinematic tree) for global rotation, and the shape coefficients representing the proportions of the individual’s height, length, fat, thin, and head-to-body ratio. To fit SMIL model’s pose and shape to the pose of real infant images (skeletons), we minimize an objective function, which is formulated by rewriting the objective function of SMPLify-X in [24]. That is the sum of four loss terms: (1) a joint-based data term, which is the distance between groundtruth 2D joints and the 2D projection of the corresponding posed 3D joints of SMIL for each joint, (2) defined as a mixture of Gaussians pose prior learnt from poses, (3) a shape penalty , which is the Mahalanobis distance between the shape prior of SMIL and the shape parameters being optimized, and (4) a pose prior penalizing elbows and knees .


where represents intrinsic camera parameters, , , and are weights for specific loss terms, which are introduced in [24].

In our case, we generate 504 synthetic infant images to expand the training portion of SyRIP dataset. As shown in Fig. 4, we randomly selecte 100 various poses/skeletons from our annotated real subset as initial poses. The synthetic infant bodies are generated by applying the approach described above to fit SMIL model to these initial poses. In order to make our dataset as diverse as possible, we render generated infant bodies with random textures/clothes and random backgrounds from different viewpoints with some different lights. Since there are very few infant texture resources, we create the textures/clothes set consisting of 12 infant textures (naked only with diaper) images provided by MINI-RGBD dataset and 478 male clothing images coming from SURREAL dataset. For the background, we pick 600 scenarios approximately related to infant indoor and outdoor activities from SLUN dataset [39]. For each initial pose, we generate 10 synthetic images with different global rotations. However, during fitting SMIL model, many unnatural/invalid infant bodies are generated, because (1) there is an assumption that the focal length of the camera is know as a constant, while real images actually have different intrinsic camera parameters, and (2) SMIL is a linear model after all, and it has no ability to fit to very complicated poses. So, we manually filter out them and finally retain 504 good quality synthetic infant images with resolution as our synthetic subset. We also visualize the pose distribution of synthetic subset, as shown in Fig. 3(c), to make sure that the poses in our synthetic dataset has enough variations as well.

4.3 Pose Data Annotation

The purpose of creating our SyRIP dataset is to train a robust infant pose estimation network. Therefore, the quality of dataset annotation is very important. As infant poses are often too difficult to distinguish and for synthetic infants with naked textures, sometimes it’s hard to separate keypoints where some body parts overlap each other without clear boundaries, exclusive manual annotation is very time-consuming. Hence, we utilize our AI-human co-labeling toolbox (AH-CoLT) to annotate the SyRIP dataset. This toolbox is to provide an efficient and augmentative annotation tool to facilitate creating large labeled visual datasets and enables accurate ground truth labeling by incorporating the outcomes of SOTA AI recognizers into a time-efficient human-based review and revise process.

The whole process of AH-CoLT can be divided into three steps as AI labeling, human reviewing, and human revision. First, a set of images as the unlabeled data source is chosen and an appropriate already trained model as the AI labeler is selected to get the initial annotation results and store them in a pickle file. In this step, we adjust the AI model to be Faster R-CNN as a successful adult pose estimation model. Even though Faster R-CNN gives high accuracy results on adult poses, its annotation outcomes on infant poses are not fully accurate. Therefore, we have to do the second step, human review. In this step, we can review AI results and click on each joints to mark whether it is an error or correct. After that, we can get a pickle file that contains all information of all joints (their coordinates, whether they are visible and whether they are correct). Finally, using the human reviser interface, a human revises those error joints and click the correct points as the new right joints.

For SyRIP dataset, we annotate 17 keypoints across the infant body in COCO fashion that include nose point, pair of eyes, pair of ears, two points of shoulder, pair of elbows, two points wrist, two points of hip, two points of knee and two points of ankle.

5 Experimental Evaluation

5.1 Training and Test Datasets

As our network training described in Section 3.2, we pre-train the domain classifier at first, and then alternatively fine-tune domain classifier of domain confusion sub-network based on updated feature extractor is Stage I and fine-tune pose estimation sub-network under the constrain of domain classifier in Stage II. Therefore, our training dataset is divided into pre-training dataset for the pre-stage and stage training dataset for fine-tuning of the Stage I and Stage II. The pre-training dataset with only real/synthetic labels consists of 1904 samples from COCO Val2017 dataset and 2000 synthetic adult images from SURREAL dataset. As introduced in Section 4, we created SyRIP dataset by purposefully collecting 500 online infant images, with as different poses as possible, and expanding this small dataset by adding 504 synthetic infant images into it. The training part of SyRIP dataset (including 400 real and 504 synthetic infant images) with pose and domain annotations is the stage training dataset. We demonstrate the performance of our FiDIP network by conducting comparative experiments on test part of SyRIP dataset, that includes 100 real infant images.

5.2 Implementation Details

In our case, Pose-ResNet serves as the pose estimation sub-network of FiDIP, and behind its feature extraction layers (ResNet-50) connects a domain classifier, which is a binary classifier with only 3 fully connected layers. When training our FiDIP network, we adopted Adam optimizer with learning rate of 0.001, but different batch sizes and epochs. The batch size and epoch for pre-stage was 128 and 10, respectively. While, for fine-tuning stage (Stage I and Stage II), there were 80 epochs and 85 images in a batch. During the Stage II, we set GRL parameter as 0.0005, and froze the first three layers (Res1, Res2, and Res3) of the feature extractor.

5.3 Pose Estimation Performance

We evaluated the pose estimation performance of FiDIP on SyRIP test dataset and COCO Val2017 dataset, and compared it with the widely-used pose estimation models based on Faster R-CNN [37], DarkPose [40], and Pose-ResNet [38] algorithms, as listed in Table 1. The mean average precision (AP) over 10 thresholds of the object keypoint similarity (OKS), which is the distance between predicted keypoints and ground truth keypoints normalized by the scale of the person, is applied as the pose evaluation metric. Obviously, our FiDIP model has greatly improved its performances over its initial Pose-ResNet model by being fine-tuned with augmented dataset. FiDIP pose estimation accuracy tested on SyRIP dataset is as high as 92.2 in AP. Note that our SyRIP test dataset only contains 100 single-infant images, while the COCO val2017 dataset has about 5000 images with single or multiple people. So in theory, if a pose estimator is generalizable, it should also perform well on the SyRIP test dataset, which is the case for the as Pose-ResNet and DarkPose models. However, we observe that AP of Faster R-CNN models and one of the DarkPose models with input size are much lower on the infant test dataset than the COCO dataset. This result may show that the generalization of these two pose estimators is not high enough, so they are not robustly adapted to other pose-specific datasets.

Pose Estimation Backbone Input COCO SyRIP
Model Network Image Size Val2017-AP Test-AP
Faster R-CNN [37] ResNet-50-FPN Flexible 65.5 70.1
Faster R-CNN [37] ResNet-101-FPN Flexible 66.1 64.4
DarkPose [40] ResNet-50 12896 64.5 65.9
DarkPose [40] HRNet-W48 12896 74.2 82.1
DarkPose [40] HRNet-W32 256192 77.9 88.5
DarkPose [40] HRNet-W48 384288 79.2 88.4
Pose-ResNet [38] ResNet-50 256192 72.4 80.4
Pose-ResNet [38] ResNet-50 384288 72.3 82.4
FiDIP (Ours) ResNet-50 384288 59.1 92.2
Table 1: Performance comparison between FiDIP network and the SOTA pose estimators the COCO Val2017 and SyRIP test datasets.
Figure 5: t-SNE visualizations of extracted features for (a) original Pose-ResNet method, (b) method g (fine-tuning without domain adaptation), and (c) j (fine-tuning with domain adaptation) on our SyRIP dataset.

We also provide qualitative visualizations of our FiDIP network on SyRIP test dataset compared with the Faster R-CNN, DarkPose, and Pose-ResNet models performance in Fig. 2. Simple poses, such as the example in the 1st row of Fig. 2, are predicted accurately by almost all SOTA models. However, in infant’s daily activities, their poses are often varied and more complex, especially in their lower body. Both DarkPose model based on ResNet-50 with input size (2nd column) and Faster R-CNN model based on ResNet-50 (3rd column) trained on the adult datasets, show obvious inaccuracies in localizing the position of infant’s legs and feet. Even Pose-ResNet and DarkPose based on HRNet models with input size are unable to keep high performance of infant lower body estimation. While, for FiDIP has much greater chance of inferring keypoints correctly for infant pose images than other models as shown in the other rows of Fig. 2. Our model alongside other models also failed to predict some keypoints (in the last three rows of Fig. 2). This is due to the fact that we only used 904 images in total to fine-tune FiDIP for infant pose estimation. By comparison to the scale of the datasets used to train the SOTA models (e.g. COCO training dataset contains 200,000 images), 904 images are considered a very small dataset to be able to capture the entire infant pose distribution.

Method Training Domain Pre-train Update SyRIP
Data Adaptation DC Layers Test-AP

504 Syn - Res 4, 5 86.3
b 504 Syn - Res 5 85.9
c 504 Syn Res 4, 5 86.9
d 504 Syn Res 4, 5 87.0
e 504 Syn Res 5 86.0
f 504 Syn Res 5 86.5
g 904 R+S - Res 4, 5 91.8
h 904 R+S - Res 5 90.8
i 904 R+S Res 4, 5 91.3
j 904 R+S Res 4, 5 92.2
k 904 R+S Res 5 89.9
l 904 R+S Res 5 90.8
Table 2: Ablation study of FiDIP network on SyRIP test dataset. (DC stands for domain classifier.)

5.4 Ablation Study

Table 2 investigates the performance of alternative choices in the FiDIP model trained on different datasets. Among them, method j is our well-performed FiDIP model as reported in Table 1.

Domain Adaptation.

We explore whether the domain adaptation method we implemented can effectively overcome the difference between feature spaces of the real (R) domain and synthetic (S) domain in our SyRIP training dataset (904 R+S). Methods that contain domain adaptation show higher AP than other method without domain adaption. t-SNE [22] is used to visualize the distributions of extracted features for original Pose-ResNet, method g, and method j in Fig. 5. Obviously, the FiDIP network embedded with domain adaptation can align the feature distribution more successfully than other networks.

Update Layers.

Freezing weights of the first few layers of the pre-trained network is a common practice when fine-tuning network with an insufficient amount of training data. The first few layers are responsible to capture universal features like curves and edges, so we fix them to enforce our network to focus on learning dataset-specific features in the subsequent layers at Stage II. We explore the effect of updating different numbers of last few layers of network on the performance of the trained model. In Table 2, for method g, i, and j, the ResNet 4th and 5th blocks of our feature extractor (ResNet-50) are updated, while the first four ResNet blocks are fixed and only the weights of last one block is updated in method h, k, and l. We observe that method g, i, j perform much better than the other three.

6 Conclusion

In this paper, we present our FiDIP model consists of a pose estimation sub-network to leverage transfer learning from a pre-trained (on adult poses) pose estimation network and a domain confusion sub-network for adapting the model to both real infant and synthetic infant datasets. In order to expand the available infant pose images, a series of synthetic infant images are generated and add to a set of real infant pose images, which together form our SyRIP dataset. Our FiDIP model achieve much better result on the infant’s validation dataset than other SOTA pose estimation models with AP as high as 92.2.


  1. The code is available at: The SyRIP dataset can be downloaded at: Synthetic and Real Infant Pose (SyRIP).


  1. M. Andriluka, L. Pishchulin, P. Gehler and B. Schiele (2014) 2d human pose estimation: new benchmark and state of the art analysis. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3686–3693. Cited by: §1, §2.
  2. E. Athanasakis, S. Karavasiliadou and I. Styliadis (2011) The factors contributing to the risk of sudden infant death syndrome. Hippokratia 15 (2), pp. 127. Cited by: §1.
  3. W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischinski, D. Cohen-Or and B. Chen (2016) Synthesizing training images for boosting human 3d pose estimation. 3D Vision (3DV), 2016 Fourth International Conference on, pp. 479–488. Cited by: §2.
  4. H. Dong, X. Liang, X. Shen, B. Wang, H. Lai, J. Zhu, Z. Hu and J. Yin (2019) Towards multi-pose guided virtual try-on network. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9026–9035. Cited by: §2.
  5. Y. Du, Y. Wong, Y. Liu, F. Han, Y. Gui, Z. Wang, M. Kankanhalli and W. Geng (2016) Marker-less 3d human motion capture with monocular image sequence and height-maps. European Conference on Computer Vision, pp. 20–36. Cited by: §2.
  6. Y. Ganin and V. Lempitsky (2015) Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pp. 1180–1189. Cited by: §3.2.
  7. M. F. Ghezelghieh, R. Kasturi and S. Sarkar (2016) Learning camera viewpoint using cnn to improve 3d body pose estimation. 3D Vision (3DV), 2016 Fourth International Conference on, pp. 685–693. Cited by: §2.
  8. E. Gibney (2016) Google ai algorithm masters ancient game of go. Nature News 529 (7587), pp. 445. Cited by: §1.
  9. M. Hadders-Algra, A. W. K. Van den Nieuwendijk, A. Maitijn and L. A. van Eykern (1997) Assessment of general movements: towards a better understanding of a sensitive method to evaluate brain function in young infants. Developmental Medicine & Child Neurology 39 (2), pp. 88–98. Cited by: §1.
  10. K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
  11. N. Hesse, C. Bodensteiner, M. Arens, U. G. Hofmann, R. Weinberger and A. Sebastian Schroeder (2018) Computer vision for medical infant motion analysis: state of the art and rgb-d data set. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 0–0. Cited by: §2, §4.
  12. N. Hesse, C. Bodensteiner, M. Arens, U. G. Hofmann, R. Weinberger and A. Sebastian Schroeder (2018-09) Computer vision for medical infant motion analysis: state of the art and rgb-d data set. In The European Conference on Computer Vision (ECCV) Workshops, Cited by: §1.
  13. N. Hesse, S. Pujades, M. Black, M. Arens, U. Hofmann and S. Schroeder (2019) Learning and tracking the 3d body shape of freely moving infants from rgb-d sequences. IEEE transactions on pattern analysis and machine intelligence. Cited by: §2.
  14. N. Hesse, S. Pujades, J. Romero, M. J. Black, C. Bodensteiner, M. Arens, U. G. Hofmann, U. Tacke, M. Hadders-Algra and R. Weinberger (2018) Learning an infant body model from rgb-d data for accurate full body motion analysis. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 792–800. Cited by: §4.2.
  15. N. Hesse, A. S. Schröder, W. Müller-Felber, C. Bodensteiner, M. Arens and U. G. Hofmann (2017) Body pose estimation in depth images for infant motion analysis. In 2017 39th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1909–1912. Cited by: §2.
  16. S. Johnson and M. Everingham (2010) Clustered pose and nonlinear appearance models for human pose estimation. In Proceedings of the British Machine Vision Conference, Note: doi:10.5244/C.24.12 Cited by: §1.
  17. T. Karras, S. Laine and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4401–4410. Cited by: §1.
  18. T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §2, §3.2.
  19. S. Liu and S. Ostadabbas (October 28th, 2017, Venice, Italy) A vision-based system for in-bed posture tracking. In Fifth International Workshop on Assistive Computer Vision and Robotics (ICCV/ACVR’17), Cited by: §1.
  20. S. Liu and S. Ostadabbas (2018) A semi-supervised data augmentation approach using 3d graphical engines. European Conference on Computer Vision, pp. 395–408. Cited by: §2.
  21. L. Ma, X. Jia, Q. Sun, B. Schiele, T. Tuytelaars and L. Van Gool (2017) Pose guided person image generation. In Advances in neural information processing systems, pp. 406–416. Cited by: §2.
  22. L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §5.4.
  23. R. Okada and S. Soatto (2008) Relevant feature selection for human pose estimation and localization in cluttered images. European Conference on Computer Vision, pp. 434–445. Cited by: §2.
  24. G. Pavlakos, V. Choutas, N. Ghorbani, T. Bolkart, A. A. Osman, D. Tzionas and M. J. Black (2019) Expressive body capture: 3d hands, face, and body from a single image. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10975–10985. Cited by: §4.2, §4.2, §4.2.
  25. L. Pishchulin, A. Jain, M. Andriluka, T. Thormählen and B. Schiele (2012) Articulated people detection and pose estimation: reshaping the future. Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on, pp. 3178–3185. Cited by: §2.
  26. H. F. Prechtl (1990) Qualitative changes of spontaneous movements in fetus and preterm infant are a marker of neurological dysfunction.. Early human development. Cited by: §1.
  27. W. Qiu (2016) Generating human images and ground truth using computer graphics. Ph.D. Thesis, University of California, Los Angeles. Cited by: §2.
  28. D. Ramanan (2006) Learning to parse images of articulated bodies. NIPS 1, pp. 7. Cited by: §1.
  29. H. Rhodin, M. Salzmann and P. Fua (2018) Unsupervised geometry-aware representation for 3d human pose estimation. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 750–767. Cited by: §2.
  30. J. Romero, M. Loper and M. J. Black (2015) FlowCap: 2d human pose from optical flow. German Conference on Pattern Recognition, pp. 412–423. Cited by: §2.
  31. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla and M. Bernstein (2015) Imagenet large scale visual recognition challenge. International journal of computer vision 115 (3), pp. 211–252. Cited by: §1.
  32. B. Sapp and B. Taskar (2013) MODEC: multimodal decomposable models for human pose estimation. In In Proc. CVPR, Cited by: §1.
  33. H. Su, C. R. Qi, Y. Li and L. J. Guibas (2015) Render for cnn: viewpoint estimation in images using cnns trained with rendered 3d model views. Proceedings of the IEEE International Conference on Computer Vision, pp. 2686–2694. Cited by: §2.
  34. K. Sun, B. Xiao, D. Liu and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5693–5703. Cited by: §3.1.
  35. G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev and C. Schmid (2017) Learning from synthetic humans. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 109–117. Cited by: §3.2, §4.2.
  36. K. Vyas, R. Ma, B. Rezaei, S. Liu, M. Neubauer, T. Ploetz, R. Oberleitner and S. Ostadabbas (2019) Recognition of atypical behavior in autism diagnosis from video using pose estimation over time. In 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. Cited by: §1.
  37. Y. Wu, A. Kirillov, F. Massa, W. Lo and R. Girshick (2019) Detectron2. Note: \url Cited by: §5.3, Table 1.
  38. B. Xiao, H. Wu and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV), pp. 466–481. Cited by: §3.1, §5.3, Table 1.
  39. F. Yu, Y. Zhang, S. Song, A. Seff and J. Xiao (2015) LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. arXiv preprint:1506.03365. Cited by: §4.2.
  40. F. Zhang, X. Zhu, H. Dai, M. Ye and C. Zhu (2020) Distribution-aware coordinate representation for human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7093–7102. Cited by: §5.3, Table 1.
  41. L. Zwaigenbaum, S. Bryson and N. Garon (2013) Early identification of autism spectrum disorders. Behavioural brain research 251, pp. 133–146. Cited by: §1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description