Estimating Hand Pose with Augmented Training Data (due: 11/16)

Estimating Hand Pose with Augmented Training Data (due: 11/16)

Liangjian Chen Dept. of Computer Science, University of California, Irvine Shih-Yao Lin Tencent Medical AI Lab Yusheng Xie Tencent Medical AI Lab Hui Tang Tencent Medical AI Lab Yufan Xue Tencent Medical AI Lab Xiaohui Xie Dept. of Computer Science, University of California, Irvine Yen-Yu Lin Research Center for Information Technology Innovation, Academia Sinica Wei Fan Tencent Medical AI Lab

Augmenting Hand Pose Training Data and Its Joint Labels with Cycle-Consistent Adversarial Networks for Estimating 3D Hand Pose from Images (due: 11/16)

Liangjian Chen Dept. of Computer Science, University of California, Irvine Shih-Yao Lin Tencent Medical AI Lab Yusheng Xie Tencent Medical AI Lab Hui Tang Tencent Medical AI Lab Yufan Xue Tencent Medical AI Lab Xiaohui Xie Dept. of Computer Science, University of California, Irvine Yen-Yu Lin Research Center for Information Technology Innovation, Academia Sinica Wei Fan Tencent Medical AI Lab

Generating Realistic Hand Pose Data from Synthetic Images based on Conditional Adversarial Networks for Hand Pose Estimation
(due: 11/16)

Liangjian Chen Dept. of Computer Science, University of California, Irvine Shih-Yao Lin Tencent Medical AI Lab Yusheng Xie Tencent Medical AI Lab Hui Tang Tencent Medical AI Lab Yufan Xue Tencent Medical AI Lab Xiaohui Xie Dept. of Computer Science, University of California, Irvine Yen-Yu Lin Research Center for Information Technology Innovation, Academia Sinica Wei Fan Tencent Medical AI Lab

Generating Scene-Consistent Hand Poses From Conditional Adversarial Network Image Translation (due 11/16)

Liangjian Chen Dept. of Computer Science, University of California, Irvine Shih-Yao Lin Tencent Medical AI Lab Yusheng Xie Tencent Medical AI Lab Hui Tang Tencent Medical AI Lab Yufan Xue Tencent Medical AI Lab Xiaohui Xie Dept. of Computer Science, University of California, Irvine Yen-Yu Lin Research Center for Information Technology Innovation, Academia Sinica Wei Fan Tencent Medical AI Lab

Generating Realistic Hand Pose Training Images Based on Scene-Consistent Adversarial Network for Hand Pose Estimation
(due 11/16)

Liangjian Chen Dept. of Computer Science, University of California, Irvine Shih-Yao Lin Tencent Medical AI Lab Yusheng Xie Tencent Medical AI Lab Hui Tang Tencent Medical AI Lab Yufan Xue Tencent Medical AI Lab Xiaohui Xie Dept. of Computer Science, University of California, Irvine Yen-Yu Lin Research Center for Information Technology Innovation, Academia Sinica Wei Fan Tencent Medical AI Lab

Generating Realistic Hand Pose Training Images Based on Tonality-Alignment Generative Adversarial Networks for Hand Pose Estimation

Liangjian Chen Dept. of Computer Science, University of California, Irvine Shih-Yao Lin Tencent Medical AI Lab Yusheng Xie Tencent Medical AI Lab Hui Tang Tencent Medical AI Lab Yufan Xue Tencent Medical AI Lab Xiaohui Xie Dept. of Computer Science, University of California, Irvine Yen-Yu Lin Research Center for Information Technology Innovation, Academia Sinica Wei Fan Tencent Medical AI Lab

Generating Realistic Training Images Based on Tonality-Alignment Generative Adversarial Networks for Hand Pose Estimation

Liangjian Chen Dept. of Computer Science, University of California, Irvine Shih-Yao Lin Tencent Medical AI Lab Yusheng Xie Tencent Medical AI Lab Hui Tang Tencent Medical AI Lab Yufan Xue Tencent Medical AI Lab Xiaohui Xie Dept. of Computer Science, University of California, Irvine Yen-Yu Lin Research Center for Information Technology Innovation, Academia Sinica Wei Fan Tencent Medical AI Lab
Abstract

Hand pose estimation from a monocular RGB image is an important but challenging task. A main factor affecting its performance is the lack of a sufficiently large training dataset with accurate hand-keypoint annotations. In this work, we circumvent this problem by proposing an effective method for generating realistic hand poses, and show that state-of-the-art algorithms for hand pose estimation can be greatly improved by utilizing the generated hand poses as training data. Specifically, we first adopt an augmented reality (AR) simulator to synthesize hand poses with accurate hand-keypoint labels. Although the synthetic hand poses come with precise joint labels, eliminating the need of manual annotations, they look unnatural and are not the ideal training data. To produce more realistic hand poses, we propose to blend a synthetic hand pose with a real background, such as arms and sleeves. To this end, we develop tonality-alignment generative adversarial networks (TAGANs), which align the tonality and color distributions between synthetic hand poses and real backgrounds, and can generate high quality hand poses. We evaluate TAGAN on three benchmarks, including the RHP, STB, and CMU-PS hand pose datasets. With the aid of the synthesized poses, our method performs favorably against the state-of-the-arts in both D and D hand pose estimations.

\cvprfinalcopy

1 Introduction

Estimating hand poses from monocular RGB images has drawn increasing attention in recent decades because it is essential to many applications such as health care [31], robotics [15], virtual and augmented reality [17, 18], and human computer interaction [30]. This task has gained significant progress, \eg[5, 22, 32], owing to the fast development of deep neural networks (DNN). These DNN-based methods learn the hand representation and perform pose estimation jointly. Despite the effectiveness, DNN-based methods highly rely on a vast amount of training data. However, it is not practical to enumerate all hand poses of interest and collect plenty pose-specific training data.

Figure 1: The crucial issue for the hand pose estimation task is the lack of training data and accurate hand-keypoint labels. Although the augmented reality (AR) simulator can produce large amount of training data, their produced synthetic hand poses look unnatural and are not the ideal training data. We aim at producing more realistic hand posed by blending a synthetic hand pose with a real background.

There exist two popular ways to address the issue of insufficient training data, namely using transfer learning and synthesizing training data. Transfer learning [8, 19] enables learning the neural networks with limited training data in the target domain. A DNN model is trained in advance with a large dataset in the source domain. By learning the transformation from source to target domains, the DNN model can be re-used in the target domain after fine-tuning with limited training data. Yet, transfer learning works when the data modality in the source and target domains is the same. For example, Aytar \etal [2] propose an approach for cross-model transfer learning in scene analysis. It learns the scene representations across different modalities(\eg, transfer from “real image” to “sketch image”) . However, the category of source and target data are required to be the same (\eg, For example, a bedroom photo and a text description of a bedroom ), which limits its usability.

Recent studies \eg[22, 42] have adopted AR simulators to generate large-scale training examples. In this way, plenty hand images with various poses, skin textures, lighting conditions can be systematically synthesized. Moreover, the accurate hand-keypoint annotations in these synthesized hand images are also available.

Training with such synthetic images directly may not result in a much improved hand pose estimator. The reason lies in the dissimilarity between the real and synthetic data. To address this issue, we suggest blending a synthetic hand pose (foreground) image with a real background image. In this work, we aim at seamless fusion of the foreground and background images so that the fused images can be more realistic and greatly improve the hand pose estimator.

We are aware of the dissimilarity in styles and appearances between the synthetic hand pose image and the real background image. Thus, we present a GAN-based method, tonality-alignment generative adversarial networks (TAGANs), to generate the realistic training data.

We are inspired by the image-to-image translation technique based on conditional GANs, Pix2pix Nets [12], where extra shape features serve as the input to GANs and can constrain the object shape in the synthesized photo. In addition to the shape constraint, we design a tonality-alignment loss function in TAGANs to align the color distributions tonality of the input and generated images. It turns out that the hand pose images can be better embedded into the background images, resulting in more realistic hand pose images. The hand pose estimator is then considerably improved by using the generated hand pose images as the augmented training data.

The main contributions of this paper are listed as follows:

First, we are the first study to fuse synthetic hand images into real scene images, and propose TAGANs to blend them realistically. Second, the experimental results demonstrate that various existing pose estimators trained on our generated hand pose data gain significant improvements over the current state-of-the-arts on both 2D and 3D datasets. Third, we will release our synthetic data that consists of a large number of hand pose images with accurate hand-keypoint annotations.

2 Related Work

The performance of DNN-based approaches are highly reliant on large amount of high-quality labeled data. Yet, collecting and labeling such large-scale data is a daunting task, especially in the hand pose estimation task. Each hand generally consists of hand keypoints. The data augmentation task is crucial for learning a hand pose estimator.

2.1 Data Augmentation via Simulator

Recent works [16] in hand pose estimation have trained their model on synthetic training data. Zimmermann and Brox [42] propose a synthetic hand pose dataset generated by a open source simulator, and they prove that training their pose estimator on those augmented synthetic hand pose image can be greatly improved. However, the synthetic hand images produced by the AR simulator look very artificial. To deal with the problem, Mueller \etal [22] propose a geometrically consistent GANs to enhance the appearance quality of synthetic hand images.

2.2 Data Augmentation via Adversarial Networks

Generating realistic images by using generative adversarial networks [7, 28] (GANs) has been an active research topic. The scheme of generating data via GANs is to feed an image as input to the GANs, and then GANs enables to generate a realistic image based on the inputs. Isola \etal propose an approach, named Pix2Pix Net [13], to learn a mapping from a sketch to a realistic image. For example, it can transfer a car sketch to a car image. Unlike the original GANs requiring paired training data, Zhu \etal [41] propose Cycle-Consistent Adversarial networks for learning to translate an image from a source domain to a target domain with unpaired examples. To increase the amount of training data, recent progress has utilized the AR simulators to produce synthetic training data. However, the synthetic data generated by the AR simulator look unnatural. Shrivastava \etal [28] present a simulatedunsupervised learning approach based on GANs, named SimGAN, to refine the realism of a simulator’s output with unlabeled real data. However, their simulator’s data only include the object itself but not the background scene. The resulting synthetic image is filled an object (\ega hand or an eye). SimGAN does not take the synthetic object and real background into account, yet the background information is crucial in many real world applications. For examples, detecting faces in a crowded market. In this work, we explore techniques that directly regularize the foreground (hand) and the background (natural scenes in which the hand appears [38][27][10]).

2.3 Vision-based Hand Pose Estimation

Hand pose estimation has drawn increasing attention for decades [1, 3, 9, 5, 37, 32, 6, 21, 25, 36, 40]. Recent research efforts can be categorized by their input data forms, which primarily include 2D RGB images and 3D RGBD images with depth map information. Specially, recent progress has tried to explore the scheme to directly estimate 3D hand pose from monocular RGB image. Oikonomidis \etal [23, 24] proposed a model-based hand tracking approaches based on Particle Swarm Optimization (PSO). Simon \etal [29] adopts the Multiview Bootstrapping to calculate the hand keypoints from RGB images. Zimmermann and Brox [42] propose a 3D Pose Regression Net which enables to estimate 3D hand pose from an RGB regular image.

3 Methodology

This section describes our approach to hand pose image generation. We first explain how GANs [7] and conditional GANs [20] can be applied to this problem, then depict synthetic hand images generation process, and finally describe how our approach improves the quality of the synthetic images.

3.1 GAN and Conditional GANs

GANs learn the mapping from a random noise vector to its generated image , \ie: , where is the generator. The conditional GANs (CGANs) is an extension of GANs. The inputs of CGANs can be augmented with additional conditions. Unlike GANs, CGANs leverage the additional conditions to constrain the output image . The conditions can be specified by feeding extra inputs to both the generator network and the discriminator network.

CGANs-based methods can be applied to image-to-image translation. In a representative work called pix2pix [12], the condition regarding shapes is used to constrain the shape of the output image to be similar to the additional input shape map . The shape map is pre-computed by applying the edge detector HED [35] to the image . is fed to both the generator and the discriminator as additional input layers. In this way, and its output (or latent space representation ) are transformed into a joint hidden representation. CGANs learn a mapping from and a random vector to the generated output , : . Thus, the CGANs objective can be formulated as

In Eq. (LABEL:eqn:CGANs), the generator aims at minimizing the differences between the real images and the images generates so that the discriminator cannot separate the real images from the generated ones. On the other hand, ’s adversary, , tries to learn a discriminating function to maximize such differences. The shape of output is constrained by . Thus, is optimized via

(1)

where and are the shape loss and its weight, respectively. The shape loss can be calculated by using L1 distance to lower the unfavorable effect of blurring, and it is defined by

(2)
Figure 2: Process of the synthetic hand images generation. From left to right: 1) a hand image in our synthetic dataset, 2) the hand representation after an affine transformation, 3) hand detection and localization in the target image, and 4) the merged results.

3.2 Synthetic hand image generation

Existing hand pose datasets are not large enough to learn a stable neural network hand pose estimator. Moreover, the hand-keypoints are annotated manually. The annotation process is expensive and labor-intensive. Even so, the annotated hand-keypoints are still error-prone and often not accurate enough. To address the quantitative and qualitative issues of hand pose training data, we adopt an open-source AR simulator to produce large-scale, high-quality hand pose images with accurate 2D/3D hand-keypoint labels.

Given a hand pose , our goal is to synthesize a hand image with its pose consistent with the given one. To find the best match, the first step is to use as a query to a dataset we collected, in which there are millions of hand poses generated by various hand models and under different lighting conditions. For each candidate hand pose in the dataset, its similarity to the target pose is defined as

(3)

where function is the affine transformation with which the transformed candidate pose can best match , and function is the feature representation of a pose. In this work, each hand pose is expressed as the concatenation vector of its D keypoints, \eg. The feature representation, , of a hand pose is the ordered collection of pair-wise keypoint differences along the and axes, \eg,

(4)

We select the candidate pose . Our dataset contains tens of millions of the hand poses. We employ the product quantization (PQ) technique [34] with GPU acceleration [14] to speed up the selection process. We then superpose the selected pose over the original scene.

The process of the synthetic hand images generation is summarized in Figure 2.

Figure 3: The proposed tonality-alignment GANs (TAGANs) performs conditional adversarial learning for deriving a mapping from the shape map and the color map to the generated image . The generator learns to produce realistic images to fool the discriminator , while the discriminator aims to separate the fake (synthetic) images from the real image. To make the synthetic images more realistic, we propose to further blend a synthetic hand image with a real background image. Our model takes both the shape and color features into account so that the resulting synthetic images are consistent with and simultaneously. The discriminator in TAGANs is also applied to verify the consistence, which in turn helps the generator to produce more realistic hand pose images.
Figure 4: Implementation details of TAGANs’ generator. The input shape map and color map are pre-computed from the synthetic hand image . The shape loss between the shape and the shape inferred from is computed by using distance. The color loss between the color histogram of and the color histogram of is calculated by using -divergence.

3.3 Tonality-Alignment GANs

Although the AR simulator can successfully relieve the lack of training data, the background of its produced images is artificial. The background tonality and color distributions between the synthetic and real hand poses are inconsistent. These issues make the synthetic hand poses unrealistic. Inspired by pix2pix’s  [13] successful use of the shape map to constrain the shape of the output image, we propose tonality-alignment GANs (TAGANs) to take the color distribution and shape features into account.

Given a synthetic image , we utilize its blur counterpart as the color reference for the output image . The blurred counterpart in our system is derived by applying an average filter to . In our method TAGANs, the shape map  and color map  are fed to both the generator and the discriminator as additional input layers such that the , and the output are transformed into a joint hidden representation. The main idea of TAGANs is illustrated in Figure 3.

The proposed TAGANs learn a mapping from , and a random vector to the generated output , \ie:. The objective of TAGANs is given below:

(5)

The generator in our model is optimized by

(6)

where is the loss function for enforcing the shape similarity between and as well as the color consistency between and . The loss is defined by

(7)

where and denote the color and shape distance functions, respectively. Constants and are the weights. The shape distance function is expressed as

(8)

Instead of the shape condition, we design a tonality-alignment loss to align the color distributions of the input and the generated images. The color distance function is defined by

(9)

where and are the color histograms of and , respectively. Thereby, in Eq. (9) is the Kullback-Leibler divergence between the two histograms.

4 Hand Pose Estimators

This section introduces two existing hand pose estimators evaluated in our experiments: including Hand3D [42] and convolutional pose machine (CPM[33]. The two approaches estimate 3D and 2D hand poses from a monocular RGB image, respectively.

The Hand3D applies the PosePrior network to estimate the canonical coordinates and the viewpoint relative to the coordinate system. Then it takes such information to estimate D hand poses. For D hand pose estimation, CPM leverages a multi-stage convolutional architecture with across-layer receptive fields to implicitly learn a spatial model for hand pose prediction.

We have made minimal modifications to both algorithms in this work. Both Hand3D and CPM are quite popular to serve as the baseline to develop more advanced hand pose estimators, such as [4, 22, 26, 29, 37]. Thus, we consider that if the synthetic hand pose images we generate can improve Hand3D and CPM, those images can also facilitate the following research of Hand3D and CPM.

Figure 5: Some examples of our RHTD dataset. Those hand images are used to train our TAGANs: The first row expresses the original images , and the second and the third rows show their color maps and shape maps , respectively. The color maps and shape maps are the inputs of our TAGANs’ generator, and the original images can be the input real image for our discriminator.
RHP
STB
CMU-PS
Figure 6: Some examples of the three benchmark datasets for evaluation. The RHP dataset [42] provides synthetic hand image with 3D hand keypoint annotations. Instead, the STB [39] and CMU-PS [29] datasets provide real hand images. The 3D and 2D keypoint annotations are included in STB and CMU-PS, respectively.

5 Experimental Setting

In this section we first describe our real hand training dataset (RHTD), which is applied to train data generators including CycleGANs, Pix2pix, and our TAGANs. Then we introduce the three evaluation datasets and our generated synthetic hand training dataset. Finally we introduce the hand pose estimators and the evaluation metrics which we adopted.

5.1 Datasets for Training and Evaluation

To train the data generators, we collect real unlabeld hand images captured from a person performing various hand gestures. We named this dataset as “real hand training dataset (RHTD)”. The RHTD includes various hand poses, perspective view, and lighting conditions. Some examples of our RHTD are shown in Figure 5. To train our TAGAN’s generator, our RHTD also provides the pre-computed edge [35] and color maps.

In addition to RHTD, we adopt an AR simulator to generate synthetic hand images with various hand poses, perspectives, and lighting conditions. Some synthetic hand image examples are shown in the first column of Figure 7. By using our synthetic hand image generation process, the synthetic hand images are then represented at the appropriate place of the real images (background). The proposed TAGANs are then applied to blend the synthetic hands with the real background images.

To evaluate the quality and the efficacy of the simulated data, we select three benchmark datasets including “Rendered Hand Pose” (RHP) [42], “Stereo Tracking Benchmark” (STB) [39], and “CMU Panoptic Studio” (CMU-PS) [29]. Some examples of them are shown in Figure 6.

The RHP dataset contains training and testing samples which are built upon different subjects performing actions. Each sample provides 1) an RGB image, 2) a depth map, and 3) segmentation masks for background, person, and each finger. Annotation on each hand contains keypoints in both 2D coordinates and 3D world coordinate positions . The RHP data is split into a validation set (R-val) and a training set (R-train). The split ensures that a subject or action can only be in either R-val or R-train.

The STB dataset provides hand images in total. It is split by two subsets: stereo subset STB-BB and color-depth subset STB-SK, As illustrated in  [4], the STB and RHP datasets use similar but different schemes. STB choose palm point as root point when RHP dataset use wirst. Therefore, to bridge this gap, we move the palm point in STB dataset to its wrist point by double the vector between the palm and the metacarpophalangeal of the middle finger.

The CMU-PS is a small dataset, which provides examples for training and examples for testing. The hand keypoint labels in this dataset are represented as 2D joint pixel coordinates.

5.2 Evaluation Metrics

Following the previous works [4, 29, 42], two metrics for hand pose estimation are adopted in our experiment. They are 1) average End-Point-Error (EPE) and 2) the Area Under the Curve (AUC) on the Percentage of Correct Keypoints (PCK). We report the performance for both 2D and 3D hand pose where the performance metrics are computed in pixels (px) and millimeters (mm), respectively. We measure the performance of 3D hand joint prediction using PCK curves averaged over all 21 keypoints. We leverage 2D PCK and 3D PCK to evaluate our approach on RHP, Stereo, and CMU datasets, respectively.

5.3 Hand Estimators

To evaluate the efficacy of our generated data, we select two recent state-of-the-arts hand pose estimators, Hand3D [42] and Convolutional Pose Machine CPM  [33], to train on our data and evaluate their performance. They covers both 3D and 2D hand pose estimation tasks. For the setting of Hand3D, we train it on different datasets (benchmark dataset w/wo our augmented data) with training epochs, and test it on benchmark dataset’s testing set . The Hand3D are applied to estimate the 3D hand pose from a monocular RGB image. Besides, the CPM is adopted to estimate the 2D hand pose from a single RGB image. Like the setting of Hand3D, we train it on various datasets and evaluate its performance by its results.

synthetic data keypoints color map shape map CycleGANs Pix2pix ours




Figure 7: TAGAN generation results. Columns from left to right are: synthetic hand images, their annotations, color maps, shape maps, the results by CycleGANs [40], Pix2pix Nets [12], and our TAGANs results, respectively. Both Pix2pix and our TAGANs are fed the shape maps. In addition to shape maps, the color maps are also fed into our TAGANs. The hands in the results by the CycleGANs lose their shape and color features because the CycleGANs take neither shape nor color conditions into account. The hand generated by Pix2pix enables to maintains the shape feature from the synthetic hand images, but the background color is missed. The results produced by TAGANs look natural and realistic as it maintains both shape and color information by using the shape and the color maps.

6 Experimental Results

This section holistically evaluates the results from TAGANss. First, we visually inspect the TAGANs-generated images and compare them with the ones generated via CycleGANs and Pix2pix. Second, we perform an ablation study to independently verify the efficacy of additional hand pose data and the importance of using TAGANs as a background-aware generator. We show that the hand pose estimators can be greatly improved by fine-tuning with the data generated by our TAGANs.

6.1 Comparison of the Generated Data

We compare the quality of generated hand images by using CycleGANs, Pix2pix, and our TAGANs. respectively. The CycleGANs learns the translation mapping from synthetic hand images to real images, and the mapping from real image to synthetic images. On the other hand, the Pix2pix leans the translation from an shape map to a realistic image. Unlike pix2pix adopts shape information only, our s takes hand shape and background tonality into account. Figure 7 shows the results generated by all the approaches. The first two columns show the synthetic hand images generated by an AR simulator and their hand-keypoint labels, respectively. The third and the forth columns show the pre-computed color maps and shape maps from those synthetic hand images. From the fifth to the seventh columns show the results generated by CycleGANs, Pix2pix Nets, and ours, respectively.

As shown in Figure 7, the CycleGANs does not have the shape and color constrains, it thus generates unnatural hand images. Although the Pix2pix can successfully generate hand’s shape by using shape features, it does not has color constraints. The generated hand images look unnatural. Our s gains color and shape constrains, thus our generated results jointly maintain the color and shape features, that make them look more realistic.

6.2 Ablation Study

In order to analysis the impact of our augmented data on the hand pose estimators, we conduct the experiments of both 2D and 3D pose estimation tasks.

First, we examine the impact of directly augmenting with simulator patches. We conduct four comparative experiments: STB, STBgreen (clean) background (STBGBG), RHP, RHPgreen background(RHPGBG). Examples of “green background” images are shown in the first column in Figure 7. For STB and RHP, we train Hand3D and CPM models on the training sets of these two datasets, and test the two models on their validation set respectively. For RHPGBG, we train the pose estimators on the combination of RHP’s training set and our simulated green background data, and test them on RHP’s validation set. STBGBG. The results of 3D pose estimation model are represented in Table 2. Besides, the results of 2D pose estimation is shown in Table 3, Figure  7(c) and Figure 7(d). As shown in Table 2 and Table 3, the hand pose estimators can be greatly improved by using large-scale synthetic data EPE mean is improved by mm (reducing error by 32%) and mm (reducing error by 57%) in RHD vs. RHDGBG and in STB vs. STBGBG, respectively. Similarly, in 2D pose estimation, the pixel EPE error and PCK@20 Accuracy both improve. see Table 3. In addition, we show the effectiveness of TAGANs-generated images can further improve the estimators beyond what GBG can do. This is illustrated with two experiments STBTAGANs, RHPTAGANs. For RHPTAGANs, the model is trained in the combination of RHP’s training set and TAGANs-generated data, then it is tested on RHP’s validation set. STBTAGANs follow the similar setting. Results show that, by substituting GBG images with the green background image, the model performance boosts in both 3D and 2D data, RHP and STB dataset setting. In RHP, TAGANs-generated data helps RHPTAGANs improves the EPE and AUC by mm and over the RHPGBG. In STB, STBTAGANs improves EPE and AUC by mm and over STBGBG. Table 3 shows our TAGANs-generated data benefits 2D pose estimators as well.

(a)
(b)
(c)
(d)
Figure 8: Performance Evaluation: (a) results on 3D STB dataset, (b) results on 3D RHD dataset, (c) the comparison of STB, STB+GBG, STB+TAGANs on 2D, and (d) the comparison of RHD, RHD+GBG, RHD+TAGANs on 2D. In (a) and (b), -axis is in millimeter (which evaluates 3D euclidean distance) . In (c) and (d), -axis is 2D pixel distance.
PCK@20      EPE mean
STB 51.42  100.25
STB + GBG 49.57 102.98
STB + TAGANs 53.57 97.31
Table 1: 2D Pose Estimation [29] trained on STB, tested on CMU-PS dataset. EPE mean is measured by millimeter.

In order to evaluate model’s generalization ability trained with TAGANs data, we design the following experiment. We deploy the CPM models, trained on STB, STBGBG, STBTAGANs, and test them on CMU-PS dataset. This result is summarized in Table 1. On one hand, comparison between STB and STBGBG shows that adding GBG data actually undermines model’s generalizability on CMU-PS due to lack of natural background scene and fine-grained skin detail. On the other hand, STBTAGANs on CMU-PS shows that TAGANs is able to bridge the gap between simulated and real data. STBTAGANs is able to improve both PCK@20 and EPE by and mm over STB, respectively.

6.3 Comparison with State-Of-the-Art Results

Figure 7(a) compares our results with current state-of-the-art methods on RHD dataset. The Hand3D model we deploy is less powerful than what is proposed  [11]. However, due to the additional TAGANs-generated data, our Hand3D results are able to significantly outperform current best models despite them having more sophisticated model architecture. Figure 7(b) compares the results on STB dataset. It shows that, with the help from TAGANs, we can significantly bridge the gap between the Hand3D model and a much more sophisticated state-of-the-art method.

AUC EPE mean EPE median
RHP 0.424 35.6 28.6
RHP + GBG 0.568 26.4 21.5
RHP + TAGANs 0.600 24.2 19.9
STB 0.664 15.7 13.1
STB + GBG 0.672 8.2 7.7
STB + TAGANs 0.682 7.9 7.3
Table 2: 3D Pose Estimation [42] Results on RHD dataset. EPE mean and median are measured by millimeter.
PCK@20 EPE mean
RHP 77.59 23.13
RHP + GBG 78.06 22.70
RHP + TAGANs 78.12 22.69
STB 87.18 11.18
STB + GBG 87.46 11.12
STB + TAGANs 87.78 10.95
Table 3: 2D Pose Estimation [29] Results on RHD dataset. EPE mean is measured by pixel distance.

7 Conclusions and Future Work

This study presents a novel data augmentation approach for improving hand pose estimation task. To produce more realistic hand images for training pose estimators, we propose TAGANs, a conditional adversarial networks model, to blend the synthetic hand poses with real background images. Our generated results aligns the hand shape and color tonality distribution between the synthetic hand and real background image. Our experimental results show that the state-of-the-arts hand pose estimator can be improved by using training on our generated training data in both 2D and 3D pose estimation tasks.

References

  • [1] M. Abdi, E. Abbasnejad, C. P. Lim, and S. Nahavandi. 3d hand pose estimation using simulation and partial-supervision with a shared latent space. In BMVC, 2018.
  • [2] Y. Aytar, L. Castrejon, C. Vondrick, H. Pirsiavash, and A. Torralba. Cross-modal scene networks. IEEE transactions on pattern analysis and machine intelligence, 40(10):2303–2314, 2018.
  • [3] S. Baek, K. I. Kim, and T.-K. Kim. Augmented skeleton space transfer for depth-based hand pose estimation. In CVPR, 2018.
  • [4] Y. Cai, L. Ge, J. Cai, and J. Yuan. Weakly-supervised 3d hand pose estimation from monocular rgb images. In ECCV, volume 12, 2018.
  • [5] G. Garcia-Hernando, S. Yuan, S. Baek, and T.-K. Kim. First-person hand action benchmark with rgb-d videos and 3d hand pose annotations. In CVPR, 2018.
  • [6] L. Ge, Y. Cai, J. Weng, and J. Yuan. Hand pointnet: 3d hand pose estimation using point sets. In CVPR, 2018.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, 2014.
  • [8] D. He, Y. Xia, T. Qin, L. Wang, N. Yu, T. Liu, and W.-Y. Ma. Dual learning for machine translation. In NIPS, 2016.
  • [9] F. Huang, A. Zeng, M. Liu, J. Qin, and Q. Xu. Structure-aware 3d hourglass network for hand pose estimation from single depth image. In BMVC, 2018.
  • [10] C. Ionescu, D. Papava, V. Olaru, and C. Sminchisescu. Human3.6m: Large scale datasets and predictive methods for 3d human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7), jul 2014.
  • [11] U. Iqbal, P. Molchanov, T. Breuel, J. Gall, and J. Kautz. Hand pose estimation via latent 2.5 d heatmap regression. 2018.
  • [12] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In CVPR, 2017.
  • [13] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. CVPR, 2017.
  • [14] J. Johnson, M. Douze, and H. Jégou. Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734, 2017.
  • [15] Z. Ju, X. Ji, J. Li, and H. Liu. An integrative framework of human hand gesture segmentation for human–robot interaction. IEEE Systems Journal, 11(3), 2017.
  • [16] A. U. Khan and A. Borji. Analysis of hand segmentation in the wild. In CVPR, 2018.
  • [17] S. Lin, H. F. Cheng, W. Li, Z. Huang, P. Hui, and C. Peylo. Ubii: Physical world interaction through augmented reality. TMC, 16(3):872–885, 2017.
  • [18] W. Lin, L. Du, C. Harris-Adamson, A. Barr, and D. Rempel. Design of hand gestures for manipulating objects in virtual reality. In ICHCI, 2017.
  • [19] F. Liu, X. Xu, S. Qiu, C. Qing, and D. Tao. Simple to complex transfer learning for action recognition. In TIP, volume 25, 2016.
  • [20] M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  • [21] G. Moon, J. Y. Chang, and K. M. Lee. V2v-posenet: Voxel-to-voxel prediction network for accurate 3d hand and human pose estimation from a single depth map. In CVPR, 2018.
  • [22] F. Mueller, F. Bernard, O. Sotnychenko, D. Mehta, S. Sridhar, D. Casas, and C. Theobalt. Ganerated hands for real-time 3d hand tracking from monocular rgb. In CVPR, 2018.
  • [23] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Efficient model-based 3d tracking of hand articulations using kinect. In BMVC, volume 1, 2011.
  • [24] I. Oikonomidis, N. Kyriazis, and A. A. Argyros. Tracking the articulated motion of two strongly interacting hands. In CVPR, 2012.
  • [25] R. Pandey, P. Pidlypenskyi, S. Yang, and C. Kaeser-Chen. Efficient 6-dof tracking of handheld objects from an egocentric viewpoint. In ECCV, 2018.
  • [26] P. Panteleris, I. Oikonomidis, and A. Argyros. Using a single rgb frame for real time 3d hand pose estimation in the wild. In WACV, 2018.
  • [27] J. M. Rehg and T. Kanade. Visual tracking of high dof articulated structures: an application to human hand tracking. In ECCV. Springer, 1994.
  • [28] A. Shrivastava, T. Pfister, O. Tuzel, J. Susskind, W. Wang, and R. Webb. Learning from simulated and unsupervised images through adversarial training. In CVPR, 2017.
  • [29] T. Simon, H. Joo, I. A. Matthews, and Y. Sheikh. Hand keypoint detection in single images using multiview bootstrapping. In CVPR, 2017.
  • [30] S. Sridhar, A. M. Feit, C. Theobalt, and A. Oulasvirta. Investigating the dexterity of multi-finger input for mid-air text entry. In CHI, 2015.
  • [31] J. P. Wachs, M. Kölsch, H. Stern, and Y. Edan. Vision-based hand-gesture applications. Commun. ACM, 54(2), 2011.
  • [32] C. Wan, T. Probst, L. Van Gool, and A. Yao. Dense 3d regression for hand pose estimation. In CVPR, 2018.
  • [33] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh. Convolutional pose machines. In CVPR, 2016.
  • [34] X. Wu, R. Guo, A. T. Suresh, S. Kumar, D. N. Holtmann-Rice, D. Simcha, and F. Yu. Multiscale quantization for fast similarity search. In NIPS. 2017.
  • [35] S. Xie and Z. Tu. Holistically-nested edge detection. In CVPR, 2015.
  • [36] Q. Ye and T.-K. Kim. Occlusion-aware hand pose estimation using hierarchical mixture density network. In ECCV, 2018.
  • [37] S. Yuan, G. Garcia-Hernando, B. Stenger, G. Moon, J. Yong Chang, K. Mu Lee, P. Molchanov, J. Kautz, S. Honari, L. Ge, et al. Depth-based 3d hand pose estimation: From current achievements to future goals. In CVPR, 2018.
  • [38] J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang. 3d hand pose tracking and estimation using stereo matching. In ICIP, 2016.
  • [39] J. Zhang, J. Jiao, M. Chen, L. Qu, X. Xu, and Q. Yang. 3d hand pose tracking and estimation using stereo matching. arXiv preprint arXiv:1610.07214, 2016.
  • [40] Y. Zhou, J. Lu, K. Du, X. Lin, Y. Sun, and X. Ma. Hbe: Hand branch ensemble network for real-time 3d hand pose estimation. In ECCV, 2018.
  • [41] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In ICCV, 2017.
  • [42] C. Zimmermann and T. Brox. Learning to estimate 3d hand pose from single rgb images. In ICCV, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
320285
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description