Learning from Synthetic Animals

Learning from Synthetic Animals


Despite great success in human parsing, progress for parsing other deformable articulated objects, like animals, is still limited by the lack of labeled data. In this paper, we use synthetic images and ground truth generated from CAD animal models to address this challenge. To bridge the gap between real and synthetic images, we propose a novel consistency-constrained semi-supervised learning method (CC-SSL). Our method leverages both spatial and temporal consistencies, to bootstrap weak models trained on synthetic data with unlabeled real images. We demonstrate the effectiveness of our method on highly deformable animals, such as horses and tigers. Without using any real image label, our method allows for accurate keypoints prediction on real images. Moreover, we quantitatively show that models using synthetic data achieve better generalization performance than models trained on real images across different domains in the Visual Domain Adaptation Challenge dataset. Our synthetic dataset contains 10+ animals with diverse poses and rich ground truth, which enables us to use the multi-task learning strategy to further boost models’ performance.


1 Introduction

Thanks to the presence of large scale annotated datasets and powerful Convolutional Neural Networks(CNNs), the state of human parsing has advanced rapidly. By contrast, there is little previous work on parsing animals. Parsing animals is important for many tasks, including, but not limited to monitoring wild animal behaviors, developing bio-inspired robots, building motion capture systems, and etc. All these may bring improvements to our ecosystems and society.

One main problem for parsing animals is the limit of datasets. Though many datasets containing animals are built for classification, bounding box detection, recognition, and instance segmentation, only a small number of datasets are built for parsing animal keypoints and part segmentation. Annotating large scale datasets for animals is prohibitively expensive. Therefore, most existing approaches applied to parse humans, which often require enormous annotated data [1, 37], are less suited for parsing animals.

In this work, we use synthetic data to address this challenge. Many works [39, 34] also show that by jointly using synthetic images and real images, models can achieve better results than those trained on real images only. In addition, synthetic data also has many unique advantages compared to real-world datasets. First, rendering synthetic data with rich ground truth at scale is easier and cheaper compared with capturing real-world images. Second, synthetic data can also provide accurate ground truth for cases where annotations are hard to acquire for natural images, such as labeling optical flow [11] or under occlusion and low-resolution. Third, real-world datasets usually suffer from the long-tail problem where rare cases are less represented. Generated synthetic datasets can avoid this problem by sampling rendering parameters.

Figure 1: Overview. We generate a synthetic dataset by randomly sampling rendering parameters including camera viewpoints, lighting, textures, poses and etc. The dataset contains 10+ animals along with rich ground truth, such as dense 2D pose, part segmentation and depth maps. Using synthetic datasets, we show an effective method which allows for accurate keypoints prediction across domains. In addition to 2D pose estimation, we also show models can predict accurate part segmentation.

However, there are large domain gaps [7, 43, 16] between synthetic images and real images, which prevent models trained on synthetic data generalizing well to real-world images. Moreover, synthetic data is also limited by object diversity. ShapeNet [6] has been created to include diverse 3D models and SMPL [29] has been built for humans. Nevertheless, creating such diverse synthetic models is a difficult task, which requires capturing the appearance and attaching a skeleton to the object. Besides, considering the number of animal categories in the world, creating diverse synthetic models along with realistic textures for each animal is almost infeasible.

In this paper, we propose a method where models are trained using synthetic CAD models. Our method can achieve high performance with only a single CAD animal model. We use generated pseudo-labels on unlabeled real dataset for semi-supervised learning. In order to handle noisy labels generated with the weak model trained only on synthetic data, we designed three consistency-check criteria to evaluate the quality of the predicted labels, which is called consistency-constrained semi-supervised learning (CC-SSL). Through extensive experiments, we show that the model achieves similar performance to models trained on real data, but without using any annotation of real images. It also outperforms other domain adaptation methods by a large margin. Providing real image annotations, the performance can be further improved. Furthermore, we demonstrate models trained with synthetic data show better domain generalization performance compared with those trained on real data in multiple visual domains.

We summarize the contributions of our paper as follows. First, we propose a consistency-constrained semi-supervised learning framework (CC-SSL) to learn a model with one single CAD object. We show that models trained with synthetic data and unlabeled real images allow for accurate keypoints prediction on real-world images. Second, when using real image labels, we show that models trained jointly on synthetic and real images achieve better results compared to models trained only on real images. Third, we evaluate the generalizability of our learned models across different visual domains in the Visual Domain Adaptation Challenge dataset and we quantitatively demonstrate that models trained using synthetic data show better generalization performance than models trained on real-world images. Lastly, we generate an animal dataset with 10+ different animal CAD models and we demonstrate the data can be effectively used for 2D pose estimation, part segmentation, and multi-task learning.

2 Related Work

2.1 Animal Parsing

Though many datasets containing animals are built for classification, bounding box detection, recognition, and instance segmentation, only a small number of datasets are built for parsing animals, such as pose estimation [33, 44, 5, 32, 25] and animal part segmentation [8]. In addition, due to the labor required to annotating images, these datasets only cover a tiny portion of animal species in the world.

Due to the lack of annotations, synthetic data has been widely used to address the problem [48, 3, 49, 50]. Similar to SMPL models [29] for humans, [50] proposes a method to learn articulated SMAL shape models for animals. Later, [49] extracts more 3D shape details and is able to model new species. Unfortunately, these methods are built on manually extracted silhouettes and keypoint annotations. Recently, [48] proposes to copy texture from real animals and trains models to predict 3D mesh of animals in an end-to-end manner. Most related to our method is [3], where authors propose a method to estimate animal poses in real images using synthetic silhouettes, which requires an additional robust segmentation model for real images during inference. In contrast, our strategy do not require any additional pre-trained models.

2.2 Unsupervised Domain Adaptation

Unsupervised domain adaptation focuses on learning a model that works well on a target domain when provided with labeled source samples and unlabeled target samples. A number of image-to-image translation methods [27, 45, 17] are proposed to transfer images from different domains. Another line of work studies how to explicitly minimize some measure of feature difference, such as maximum mean discrepancy [42, 28] or correlation distances [38, 40]. [4] proposes to explicitly partition features into a shared space and a private space. Recently, adversarial loss [41, 16] is used to learn learning domain invariant features, where a domain classifier is trained to best distinguish the source and target distributions. [41] proposes a general framework to bring features from different domains closer. [16, 30] extend this idea with cycle consistency to improve results. More recent works have extended these ideas to various object detection [15, 21] , semantic segmentation [13, 16, 26] and pose estimation [7] tasks.

For deformable objects parsing, [7] studies using synthetic human images combined with domain adaptation to improve human 3D pose estimation. [43] renders 145 realistic synthetic human models to reduce the domain gap. Different from previous works where a large amount of realistic synthetic models are required, we show that models trained on one CAD model can learn domain-invariant features.

2.3 Self-training

Self-training has been proved effective in semi-supervised learning. Early work [24] draws the connection between deep self-training and entropy regularization. However, since generated pseudo-labels are noisy, a number of methods [22, 10, 46, 47, 23, 12, 26, 9, 35, 36] are proposed to address the problem. [46, 47] formulate self-training as a general EM algorithm and proposes a confidence regularized self-training framework. [23] proposes a self-ensembling framework to bootstrap models using unlabeled data. [12] extends the previous work to unsupervised domain adaptation and demonstrate its effectiveness in bridging domain gaps. [9] suggests the idea can be extend to semantic segmentation by incorporating GANs. Recently, [36, 18, 20] demonstrate the effectiveness of self-training in object detection.

Closely related to our work on 2D pose estimation is [35], where the authors propose a simple method for omni-supervised learning that distills knowledge from unlabeled data and demonstrate its effectiveness on detection and pose estimation. However, under large domain discrepancy, the assumption that the teacher model assigns high-confidence pseudo-labels is not guaranteed. To tackle the problem, we introduce a curriculum learning strategy [2, 14, 19] to progressively increase pseudo-labels and train models in iterations. We also extend [35] by leveraging both spatial and temporal consistencies.

3 Approach

Figure 2: Consistency-constrained semi-supervised learning pipeline. indicates the invariance consistency, indicates the equivariance consistency and indicates the temporal consistency. The training procedure can be described as following: we start with training a model only using synthetic data and obtain a initial weak model . Then we iterate the following procedure. For the th iteration, we first use the proposed Pseudo-Label generation Algorithm 1 to generate labels . Next, we train the model using () and () jointly.

In this section, we first formulate a unified image generation procedure in Section 3.1, which is built on the low dimension manifold assumption. In Section 3.2, we define three consistencies and discuss how to take advantage of these consistencies during pseudo-label generation process. In Section 3.3, we propose a Pseudo-Label Generation algorithm using consistency-check. Then in Section 3.4 we present our consistency-constrained semi-supervised learning algorithm and discuss the iterative training pipeline. Lastly, in Section 3.5, we explain how our synthetic datasets are generated.

We consider the problem under unsupervised domain adaptation framework with two datasets. We name our synthetic dataset as the source dataset () and real images as the target dataset . The goal is to learn a model to predict labels for the target data . We simply start with learning a source model using paired data () in a fully supervised way. Then we bootstrap the source model using target dataset in an iterative manner. An overview of the pipeline is presented in Figure 2.

3.1 Formulate Image Generation Procedure

In order to learn a model using synthetic data that can generalize well to real data, one needs to assume that there exists some essential knowledge shared between these two domains. Take animal 2D pose estimation as an example, though synthetic and natural images look differently by textures and background, they are quite similar in terms of poses and shape. Actually, these are exactly what we hope a model trained on synthetic data can learn. So an ideal model should be able to capture these essential factors and ignore those less relevant factors, such as lighting, background and etc.

Formally, we introduce a generator that transforms poses, shapes, viewpoints, textures, and etc, into an image. Mathematically, we group all these factors into two categories, task-related factors , which is what a model cares about, and others , which are irrelevant to the task at hand. So we parametrize the image generation process as following,


where is a generated image and denotes the generator. Specifically, for 2D pose estimation, represents factors related to the 2D keypoints, such as pose and shape; indicates factors independent of , which could be textures, lighting and background.

3.2 Consistency

Based on the formulation in the previous section, we define three consistencies and discuss how to take advantage of these consistencies during pseudo-label generation process for self-training.

Since model-generated labels on the target dataset are noisy, one needs to tell the model which predictions are correct and which are wrong. An ideal 2D keypoint detector should give consistent predictions on one image no matter how the background is perturbed. In addition, if one rotates the image, the prediction should change accordingly as well. Based on these intuitions, we propose to use consistency-check to reduce false positives.

We formulate these intuitive observations in a formal way. In the following paragraphs, we will introduce invariant consistency, equivariant consistency and temporal consistency. We also discuss how to use consistency-check to generate pseudo-labels, which serves as the basis for our semi-supervised learning method.

The transformation applied to an image can be considered as directly transforming the underlying factors in Equation 1. We define a general tensor operator, . In addition, we introduce corresponding to operations that would affect and to represent operations independent of . Then Equation 1 can be expressed as following,


We use to denote a perfect 2D pose estimation model. When is applied to Equation  2, it is obvious that, .

Invariance consistency: If the transform does not change factors associated with the task, the model’s prediction is expected to be the same. The idea here is that a well-behaved model should be invariant to operations on . For example, in 2D pose estimation, adding noise to the image or perturbing colors should not affect the model’s prediction. We name these transforms invariant transform , as shown in Equation 3.


If we apply multiple invariant transforms to the same image, the predictions on these transformed images should be consistent. This consistency can be used to verify whether the prediction is correct, which we refer to as invariance consistency.

Equivariance consistency: Besides invariance transform, there are other cases where the task related factors are changed. We use to denote operations that affect 2D poses. There are special cases where we can easily get the corresponding . One easy case is that, sometimes, the effect of only cause geometric transformations in 2D images, which we refer to as equivariant transform . Actually, this is essentially similar to what [35] proposes. Therefore, we have equivariance consistency as shown in Equation 4.


It is also easy to show that , so it means that we should get the same prediction, after applying the inverse transform , a good model should give back the original prediction.

Temporal consistency: It is difficult to model how to transform between frames in a video. This transform does not satisfy the invariant and equivariant properties we describe above. However, the is still caused by variations of underlying factors and , as in a real-world video, these factors can not change dramatically between neighboring frames.


Although we can not directly model , we can assume the keypoints shifting between two frames are relatively small as shown in Equation 5. Intuitively, this means that the keypoint prediction for the same joint in consecutive frames should not be too far away, otherwise it is likely to be incorrect.

For the 2D keypoint estimation, we observe that can be approximated by the optical flow result, which allows us to use optical flow to propagate pseudo-labels from confident frames to less confident ones.

We define these three consistencies with the 2D pose estimation. However, these consistencies are not restricted to 2D pose estimation. can actually take arbitrary form, such as factors relates to 3D pose. Then the invariance consistency is still the same, but the equivariance consistency no longer holds, since the mapping of 3D pose to 2D pose is not a one-to-one mapping and there are ambiguities in the depth dimension. However, one can still use it as a consistency for the other two dimensions, which means the projected poses should still satisfy the same consistency. The temporal consistency also follows the same principle. So it is easy to show that though corresponding consistencies may also be different for different tasks, they all follow the same philosophy.

Input: Target dataset ; model ; flow decay factor
Intermediate Result: , are predictions after applying
      invariance and equivariance transform.
Output: Pseudo-labels ; confidence score .

1:for  in  do
2: Invariance Consistency
4: Equivariance Consistency
6: Self-Ensembling
7:     Ensemble and to get (, )
8: Temporal Consistency
9:     if  then
12:     end if
13:end for
14:Sort and obtain based on a fixed curriculum learning policy.
Algorithm 1 Pseudo-Label Generation Algorithm

3.3 Pseudo-Label Generation

In this section, we explain the details about how these consistencies can be used in practice for generating pseudo-labels and propose pseudo-label generation Algorithm 1.

We address the noisy label problem in two ways. First, we develop an algorithm to generate pseudo-labels using consistency-check to remove false positives, which is based on the assumption that labels generated using the correct information always satisfy these consistencies. Second, we apply the curriculum learning idea to gradually increase the number of training samples and learn models in an iterative fashion.

To this end, we present our Pseudo-Label Generation Algorithm 1. For the th iteration, with the target dataset and previous model obtained from the th iteration, we iterate through each image in the target dataset. is not updated in this process. We apply multiple invariance transform , equivariance transform to , and ensemble all predictions to get the pair of estimated labels and confidence scores (, ). Then we check whether the as the confidence score is strong compared to the previous frame confidence . We will keep the confidence score given the current frame prediction is strong; otherwise, we will replace the prediction with the flow prediction and replace with by previous frame confidence with a decay factor . At this point, the algorithm has generated labels and confidence scores for every keypoint. Finally, we iterate through the target dataset again to select , which determines the percentage of labels used for training. Here, we employ the curriculum learning strategy. The idea here is that one can use keypoints with high confidence first and graduate include more keypoints after iterations. For instance, one may use keypoints ranking top 20 at the beginning, 30 for the second iteration and etc.

3.4 Consistency-Constrained Semi-Supervised Learning (CC-SSL)

For the th iteration, model is learned using defined as Equation 6. The loss function is defined to be the Mean Square Error on heatmaps of both the source data and target data and is used to balance the loss between source and target datasets.


To this end, we present our Consistency-Constrained Semi-Supervised Learning (CC-SSL) approach as following: we start with training a model only using synthetic data and obtain a initial weak model . Then we iterate the following procedure. For the th iteration, we first use Algorithm 1 to generate labels . with the generated labels, we simply train the model using () and () jointly.

3.5 Synthetic Dataset Generation

In order to create diverse combination of animal appearances and poses, we collect a synthetic animal dataset containing 10+ animals. Each animal comes with several animation sequences. We use Unreal Engine to collect rich ground truth and enable nuisance factor control. The implemented factor control includes randomizing lighting, textures, changing viewpoints and animal poses. We also implement domain randomization and ground truth generation to enable training models with our synthetic data.

The pipeline for generating synthetic data is as follows. Given a CAD model along with a few animation sequences, an animal with random time step and random texture is rendered from a random viewpoint for some random lighting and a random background image. Since the data is synthetic, we also generate ground truth depth maps, human part segmentation and dense joint locations (both 2D and 3D). See Figure 1 for samples from the synthetic dataset.

Horse Accuracy Tiger Accuracy
Eye Chin Shoulder Hip Elbow Knee Hoove Mean Eye Chin Shoulder Hip Elbow Knee Hoove Mean
synthetic + real
     Real 79.04 89.71 71.38 91.78 82.85 80.80 72.76 78.98 96.77 93.68 65.90 94.99 67.64 80.25 81.72 81.99
     CC-SSL-R 89.39 92.01 69.05 92.28 86.39 83.72 76.89 82.43 95.72 96.32 74.41 91.64 71.25 82.37 82.73 84.00
synthetic only
     Syn 46.08 53.86 20.46 32.53 20.20 24.20 17.45 25.33 23.45 27.88 14.26 52.99 17.32 16.27 19.29 21.17
     CycleGAN [45] 70.73 84.46 56.97 69.30 52.94 49.91 35.95 51.86 71.80 62.49 29.77 61.22 36.16 37.48 40.59 46.47
     BDL [26] 74.37 86.53 64.43 75.65 63.04 60.18 51.96 62.33 77.46 65.28 36.23 62.33 35.81 45.95 54.39 52.26
     CyCADA [16] 67.57 84.77 56.92 76.75 55.47 48.72 43.08 55.57 75.17 69.64 35.04 65.41 38.40 42.89 48.90 51.48
     CC-SSL 84.60 90.26 69.69 85.89 68.58 68.73 61.33 70.77 96.75 90.46 44.84 77.61 55.82 42.85 64.55 64.14
Table 1: Horse and Tiger Keypoints Prediction PCK@0.05. Synthetic data are with randomized background and textures. Synthetic only shows results when no real image label is available, Synthetic + Real are cases when labeled real images are available. In both scenarios, our proposed CC-SSL based methods achieve

4 Experiments

First, we quantitatively test our approach on the TigDog dataset [33] in Section 4.2. We compare our method with other popular unsupervised domain adaptation methods, such as CycleGAN [45], BDL [26] and CyCADA [16]. We also qualitatively show keypoints detection of other animals where no labeled real image is available, such as elephants, sheep and dogs. Second, in order to show the domain generalization ability, we annotated the keypoints of animals from Visual Domain Adaptation Challenge dataset (VisDA2019) dataset. In Section 4.3, we evaluate our models on these images from different visual domains. Third, the rich ground truth in synthetic data enables us to do more tasks beyond 2D pose estimation, so we also visualize part segmentation on horses and tigers and demonstrate the effectiveness of multi-task learning in Section 4.4.

4.1 Experiment Setup

Figure 3: Visualization of horse and tiger 2D pose estimation and part segmentation prediction. The 2D pose estimations are predicted using CC-SSL as described in Section 4.2 and part segmentation predictions are generated using the multi-task learning as described in Section 4.4. Best viewed in color.
Figure 4: Visualization of 2D pose estimation of other animals. Our method can be easily generalized to flexible pose estimation tasks, such elephants’ trunks. Best viewed in color.

Network Architecture. We use Stacked Hourglass  [31] as our backbone for all experiments. Since architecture design is not our main purpose, we strictly follow parameters from the original paper. Each model is trained with RMSProp for epochs. The learning rate starts with and decays twice at and epoches respectively. Input images are cropped with the size of and augmented with scaling, rotation, flipping and color perturbation.

Synthetic Datasets. We explain the details of our data generation parameters as follows. The virtual camera has a resolution of and field of view of 90. We randomize synthetic animal textures and backgrounds using Coco val2017 dataset. For each animal, we generated 5,000 images with random texture and 5,000 images with the texture coming with the CAD model, to which we refer as the original texture. We split the training set and test set with a ratio of 4:1, resulting 8,000 images for training and 2,000 for validation. We also generate multiple ground truth including part segmentation, depth maps and dense 2D and 3D poses. For part segmentation, we define nine parts for each animal, which are eyes, head, ears, torso, left-front leg, left-back leg, right-front leg, right-back leg and tail. The parts definition follows  [8] with a minor difference which is that we also distinguish front and back legs.

CC-SSL In our experiments, we pick scaling and rotation from and obtain using optical flow. is set to 0.9 and we train one model for 10 epochs and re-generate pseudo labels with the new model. In this process, models are trained for 60 epochs. is set to be 10 for all our experiments.

TigDog Dataset The TigDog dataset is a large dataset containing 79 videos for horses and 96 videos for tigers. In total, for horse, we have 8380 frames for training and 1772 frames for testing. For tigers, we have 6523 frames for training and 1765 frames for testing. Each frame is provided with 19 keypoint annotations, which are defined as eyes(2), shin(1), shoulders(2), legs(12), hip(1) and neck(1). The neck keypoint is not clearly distinguished for left and right, so we ignore it during our experiments.

4.2 2D Pose Estimation

Results Analysis. Our main results are summarized in Table 1 for horses and tiger keypoints prediction. We present our results separately in two different setups: the first one is under the unsupervised domain adaptation setting where real image annotations are not available; the second one is when labeled real images are available.

When annotations of real images are not available, our proposed CC-SSL surpasses other methods by a significant margin. The PCK@0.05 accuracy of horses reaches 70.77, which is very close to models trained directly on real images. For tigers, the proposed method achieves 64.14. It is worth noticing that these results are achieved without accessing any real data annotation, which demonstrated the effectiveness of our proposed method.

We also visualize the predicted keypoints in Figure 3. Surprisingly, even for some extreme poses, such as horse riding and lying on the ground, our method can still generate accurate predictions. The observations for tigers are similar.

When annotations of real images are available, our proposed CC-SSL-R achieved 82.43 for horses and 84.00 for tigers, which are are noticeably better than models trained on real images only. Our method is simply by further finetuning the model CC-SSL pretrained models using real data and we find that it is very effective.

Horse Tiger
Visible Kpts Accuracy Full Kpts Accuracy Visible Kpts Accuracy Full Kpts Accuracy
Sketch Painting Clipart Sketch Painting Clipart Sketch Painting Clipart Sketch Painting Clipart
Real 65.37 64.45 64.43 61.28 58.19 60.49 48.10 61.48 53.36 46.23 53.14 50.92
CC-SSL 72.29 73.71 73.47 70.31 71.56 72.24 53.34 55.78 59.34 52.64 48.42 54.66
CC-SSL-R 73.25 74.56 71.78 67.82 65.15 65.87 54.94 68.12 63.47 53.43 58.66 59.29
Table 2: Horse and Tiger 2D Pose Estimation on VisDA2019 PCK@0.05. We present our results under two settings: Visible Kpts Accuracy only accounts for visible keypoints; Full Kpts Accuracy also includes self-occluded keypoints. Under all settings, our proposed methods achieves better performance than Real baseline.

In addition to horses and tigers, we apply the same method to other animals as well. Our method can be easily transferred to other animal categories and we qualitatively show keypoints prediction results for other animals, as shown in Figure 4, such as sheep, dogs and elephants. Another advantage is that synthetic data can also provide flexible ground truth for different animals. For instance, our method can also detect trunks for elephants.

We empirically find the performance does not improve much with CycleGAN. We conjecture that one reason is that CycleGAN in general requires a large number of real images to work well. However, in our case, the diversity of real images is limited. Another reason is that animal shapes of transferred images are not maintained well, which have a negative impact on performance. We also try different adversarial training strategies. Though BDL works quite well for semantic segmentation, we find the improvements on keypoints detection is small. CyCADA also suffers from the same problem as CycleGAN. In comparison, CC-SSL does not suffer from those problems and it can work well even with limited diversity of real data.

We apply domain randomization for all synthetic datasets. The intuition here is that to encourage the model to use more shape and edge cues, which are indistinguishable between domains. In addition, we use the same set of augmentations as in [31] for baselines Real and Syn and a different set of augmentations, which we refer to as Strong Augmentation. In addition to what [31] used, we further include Affine Transform, Gaussian Noise and Gaussian Blurring.

4.3 Generalization Test on VisDA2019

In this section, we test model generalization on images from Visual Domain Adaptation Challenge dataset (VisDA2019). The dataset contains six domains, which are real, sketch, clipart, painting, inforgraph and quickdraw. We pick up sketch, painting and clipart for our experiments since inforgraph and quickdraw are not suitable for keypoints detection. We manually annotate images for horses and tigers for each of these three domains and evaluation results are summarized in Table 2. Same as before, we use Real as our baseline, and CC-SSL and CC-SSL-R for comparison.

For both animals, we observe that models trained using synthetic data achieve best performance in all settings. We present our results under two settings. Visible Keypoints Accuracy only accounts for keypoints that are directly visible whereas Full Keypoints Accuracy shows results with self-occluded keypoints.

Under all settings, CC-SSL-R is better than Real. More interestingly, notice that even without using real image labels, our CC-SSL method yields better performance than Real in almost all domains. The only one exception is the paintings domain of tigers. We hypothesis that this is because texture information (yellow and black stripes) in paintings is still well preserved so models trained on real image can still ”generalize”. For sketches and cliparts, appearances are more different from real images and models trained on synthetic data show better results.

4.4 Part Segmentation

Models Horse Tiger
Baseline 60.84 50.26
+Part segmentation 62.25 51.69
Table 3: Multi-task Learning. We show models can generalize better to real images when trained jointly using 2D keypoints and part segmentaion.

Since synthetic dataset is generated with rich ground truth, the task is not limited to 2D pose estimation. We also experiment with part segmentation and we visualize the results on TigDog dataset as shown in Figure 3. We show that when models are trained using both 2D poses and part segmentation, models can generalize better on real images for both animals, as shown in Table 3.

Here the baseline is only trained on synthetic data since annotations of part segmentation on real images are not available. We add a branch parallel to the original keypoint prediction one in the model for part segmentation.

5 Conclusions

In this paper, we present a simple yet efficient method using synthetic images to parse animals. To bridge the gap, we present a novel consistency-constrained semi-supervised learning (CC-SSL) method, which leverages both spatial and temporal constraints. We demonstrate the effectiveness of the proposed method on horses and tigers in the TigDog Dataset. Without any real image label, our model can detect keypoints reliably on real images. We further quantitatively evaluate the generalizability of our learned models across different domains in the Visual Domain Adaptation Challenge dataset. We demonstrate the models using synthetic data achieve better generalization performance across different domains in the Visual Domain Adaptation Challenge dataset. We build a synthetic dataset contains 10+ animals with diverse poses and rich ground truth and show that multi-task learning is effective.


This work is supported by IARPA via DOI/IBC contract No. D17PC00342. The authors would like to thank Chunyu Wang, Qingfu Wan, Yi Zhang for helpful discussions.


  1. M. Andriluka, L. Pishchulin, P. V. Gehler and B. Schiele (2014) 2D human pose estimation: new benchmark and state of the art analysis. In CVPR, pp. 3686–3693. External Links: Link, Document Cited by: §1.
  2. Y. Bengio, J. Louradour, R. Collobert and J. Weston (2009) Curriculum learning. In ICML, pp. 41–48. External Links: Link, Document Cited by: §2.3.
  3. B. Biggs, T. Roddick, A. W. Fitzgibbon and R. Cipolla (2018) Creatures great and SMAL: recovering the shape and motion of animals from video. CoRR abs/1811.05804. External Links: Link, 1811.05804 Cited by: §2.1.
  4. K. Bousmalis, G. Trigeorgis, N. Silberman, D. Krishnan and D. Erhan (2016) Domain separation networks. In NeurIPS, pp. 343–351. External Links: Link Cited by: §2.2.
  5. J. Cao, H. Tang, H. Fang, X. Shen, C. Lu and Y. Tai (2019) Cross-domain adaptation for animal pose estimation. CoRR abs/1908.05806. External Links: Link, 1908.05806 Cited by: §2.1.
  6. A. X. Chang, T. A. Funkhouser, L. J. Guibas, P. Hanrahan, Q. Huang, Z. Li, S. Savarese, M. Savva, S. Song, H. Su, J. Xiao, L. Yi and F. Yu (2015) ShapeNet: an information-rich 3d model repository. CoRR abs/1512.03012. External Links: Link, 1512.03012 Cited by: §1.
  7. W. Chen, H. Wang, Y. Li, H. Su, Z. Wang, C. Tu, D. Lischinski, D. Cohen-Or and B. Chen (2016) Synthesizing training images for boosting human 3d pose estimation. In 3DV, pp. 479–488. External Links: Link, Document Cited by: §1, §2.2, §2.2.
  8. X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun and A. L. Yuille (2014) Detect what you can: detecting and representing objects using holistic models and body parts. In CVPR, pp. 1979–1986. External Links: Link, Document Cited by: §2.1, §4.1.
  9. J. Choi, T. Kim and C. Kim (2019) Self-ensembling with gan-based data augmentation for domain adaptation in semantic segmentation. CoRR abs/1909.00589. External Links: Link, 1909.00589 Cited by: §2.3.
  10. Y. Ding, L. Wang, D. Fan and B. Gong (2018) A semi-supervised two-stage approach to learning from noisy labels. In 2018 IEEE Winter Conference on Applications of Computer Vision, WACV 2018, Lake Tahoe, NV, USA, March 12-15, 2018, pp. 1215–1224. External Links: Link, Document Cited by: §2.3.
  11. A. Dosovitskiy, P. Fischer, E. Ilg, P. Häusser, C. Hazirbas, V. Golkov, P. van der Smagt, D. Cremers and T. Brox (2015) FlowNet: learning optical flow with convolutional networks. In ICCV, pp. 2758–2766. External Links: Link, Document Cited by: §1.
  12. G. French, M. Mackiewicz and M. H. Fisher (2018) Self-ensembling for visual domain adaptation. In ICLR, External Links: Link Cited by: §2.3.
  13. R. Gong, W. Li, Y. Chen and L. V. Gool (2019) DLOW: domain flow for adaptation and generalization. In CVPR, pp. 2477–2486. External Links: Link Cited by: §2.2.
  14. S. Guo, W. Huang, H. Zhang, C. Zhuang, D. Dong, M. R. Scott and D. Huang (2018) CurriculumNet: weakly supervised learning from large-scale web images. In ECCV, pp. 139–154. External Links: Link, Document Cited by: §2.3.
  15. Z. He and L. Zhang (2019) Multi-adversarial faster-rcnn for unrestricted object detection. CoRR abs/1907.10343. External Links: Link, 1907.10343 Cited by: §2.2.
  16. J. Hoffman, E. Tzeng, T. Park, J. Zhu, P. Isola, K. Saenko, A. A. Efros and T. Darrell (2018) CyCADA: cycle-consistent adversarial domain adaptation. In ICML, pp. 1994–2003. External Links: Link Cited by: §1, §2.2, Table 1, §4.
  17. X. Huang, M. Liu, S. J. Belongie and J. Kautz (2018) Multimodal unsupervised image-to-image translation. In ECCV, pp. 179–196. External Links: Link, Document Cited by: §2.2.
  18. N. Inoue, R. Furuta, T. Yamasaki and K. Aizawa (2018) Cross-domain weakly-supervised object detection through progressive domain adaptation. In CVPR, pp. 5001–5009. External Links: Link, Document Cited by: §2.3.
  19. L. Jiang, D. Meng, Q. Zhao, S. Shan and A. G. Hauptmann (2015) Self-paced curriculum learning. In AAAI, pp. 2694–2700. External Links: Link Cited by: §2.3.
  20. M. Khodabandeh, A. Vahdat, M. Ranjbar and W. G. Macready (2019) A robust learning approach to domain adaptive object detection. CoRR abs/1904.02361. External Links: Link, 1904.02361 Cited by: §2.3.
  21. T. Kim, M. Jeong, S. Kim, S. Choi and C. Kim (2019) Diversify and match: A domain adaptive representation learning paradigm for object detection. In CVPR, pp. 12456–12465. External Links: Link Cited by: §2.2.
  22. Y. Kim, J. Yim, J. Yun and J. Kim (2019) NLNL: negative learning for noisy labels. CoRR abs/1908.07387. External Links: Link, 1908.07387 Cited by: §2.3.
  23. S. Laine and T. Aila (2017) Temporal ensembling for semi-supervised learning. In ICLR, External Links: Link Cited by: §2.3.
  24. D. Lee (2013) Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In Workshop on Challenges in Representation Learning, ICML, Vol. 3, pp. 2. Cited by: §2.3.
  25. S. Li, J. Li, W. Lin and H. Tang (2019) Amur tiger re-identification in the wild. CoRR abs/1906.05586. External Links: Link, 1906.05586 Cited by: §2.1.
  26. Y. Li, L. Yuan and N. Vasconcelos (2019) Bidirectional learning for domain adaptation of semantic segmentation. In CVPR, pp. 6936–6945. External Links: Link Cited by: §2.2, §2.3, Table 1, §4.
  27. M. Liu, T. Breuel and J. Kautz (2017) Unsupervised image-to-image translation networks. In NeurIPS, pp. 700–708. External Links: Link Cited by: §2.2.
  28. M. Long, Y. Cao, J. Wang and M. I. Jordan (2015) Learning transferable features with deep adaptation networks. In ICML, pp. 97–105. External Links: Link Cited by: §2.2.
  29. M. Loper, N. Mahmood, J. Romero, G. Pons-Moll and M. J. Black (2015) SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34 (6), pp. 248:1–248:16. External Links: Link, Document Cited by: §1, §2.1.
  30. Z. Murez, S. Kolouri, D. J. Kriegman, R. Ramamoorthi and K. Kim (2018) Image to image translation for domain adaptation. In CVPR, pp. 4500–4509. External Links: Link, Document Cited by: §2.2.
  31. A. Newell, K. Yang and J. Deng (2016) Stacked hourglass networks for human pose estimation. In Computer Vision - ECCV 2016 - 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part VIII, pp. 483–499. External Links: Link, Document Cited by: §4.1, §4.2.
  32. D. Novotný, D. Larlus and A. Vedaldi (2016) I have seen enough: transferring parts across categories. In BMVC, External Links: Link Cited by: §2.1.
  33. L. D. Pero, S. Ricco, R. Sukthankar and V. Ferrari (2015) Articulated motion discovery using pairs of trajectories. In CVPR, pp. 2151–2160. External Links: Link, Document Cited by: §2.1, §4.
  34. A. Prakash, S. Boochoon, M. Brophy, D. Acuna, E. Cameracci, G. State, O. Shapira and S. Birchfield (2019) Structured domain randomization: bridging the reality gap by context-aware synthetic data. In ICRA, pp. 7249–7255. External Links: Link, Document Cited by: §1.
  35. I. Radosavovic, P. Dollár, R. B. Girshick, G. Gkioxari and K. He (2018) Data distillation: towards omni-supervised learning. In CVPR, pp. 4119–4128. External Links: Link, Document Cited by: §2.3, §2.3, §3.2.
  36. A. Roy Chowdhury, P. Chakrabarty, A. Singh, S. Jin, H. Jiang, L. Cao and E. G. Learned-Miller (2019) Automatic adaptation of object detectors to new domains using self-training. In CVPR, pp. 780–790. External Links: Link Cited by: §2.3.
  37. B. Sapp and B. Taskar (2013) MODEC: multimodal decomposable models for human pose estimation. In CVPR, pp. 3674–3681. External Links: Link, Document Cited by: §1.
  38. B. Sun and K. Saenko (2016) Deep CORAL: correlation alignment for deep domain adaptation. In ECCV Workshops, pp. 443–450. External Links: Link, Document Cited by: §2.2.
  39. J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon and S. Birchfield (2018) Training deep networks with synthetic data: bridging the reality gap by domain randomization. In CVPR, pp. 969–977. External Links: Link, Document Cited by: §1.
  40. E. Tzeng, J. Hoffman, T. Darrell and K. Saenko (2015) Simultaneous deep transfer across domains and tasks. In ICCV, pp. 4068–4076. External Links: Link, Document Cited by: §2.2.
  41. E. Tzeng, J. Hoffman, K. Saenko and T. Darrell (2017) Adversarial discriminative domain adaptation. In CVPR, pp. 2962–2971. External Links: Link, Document Cited by: §2.2.
  42. E. Tzeng, J. Hoffman, N. Zhang, K. Saenko and T. Darrell (2014) Deep domain confusion: maximizing for domain invariance. CoRR abs/1412.3474. External Links: Link, 1412.3474 Cited by: §2.2.
  43. G. Varol, J. Romero, X. Martin, N. Mahmood, M. J. Black, I. Laptev and C. Schmid (2017) Learning from synthetic humans. In CVPR, pp. 4627–4635. External Links: Link, Document Cited by: §1, §2.2.
  44. P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie and P. Perona (2010) Caltech-UCSD Birds 200. Technical report Technical Report CNS-TR-2010-001, California Institute of Technology. Cited by: §2.1.
  45. J. Zhu, T. Park, P. Isola and A. A. Efros (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV 2017, pp. 2242–2251. External Links: Link, Document Cited by: §2.2, Table 1, §4.
  46. Y. Zou, Z. Yu, B. V. K. V. Kumar and J. Wang (2018) Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In ECCV, pp. 297–313. External Links: Link, Document Cited by: §2.3.
  47. Y. Zou, Z. Yu, X. Liu, B. V. K. V. Kumar and J. Wang (2019) Confidence regularized self-training. CoRR abs/1908.09822. External Links: Link, 1908.09822 Cited by: §2.3.
  48. S. Zuffi, A. Kanazawa, T. Y. Berger-Wolf and M. J. Black (2019) Three-d safari: learning to estimate zebra pose, shape, and texture from images ”in the wild”. CoRR abs/1908.07201. External Links: Link, 1908.07201 Cited by: §2.1.
  49. S. Zuffi, A. Kanazawa and M. J. Black (2018) Lions and tigers and bears: capturing non-rigid, 3d, articulated shape from images. In CVPR, pp. 3955–3963. External Links: Link, Document Cited by: §2.1.
  50. S. Zuffi, A. Kanazawa, D. W. Jacobs and M. J. Black (2017) 3D menagerie: modeling the 3d shape and pose of animals. In CVPR, pp. 5524–5532. External Links: Link, Document Cited by: §2.1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description