EvoPose2D: Pushing the Boundaries of 2D Human Pose Estimation using Neuroevolution

EvoPose2D: Pushing the Boundaries of 2D Human Pose Estimation using Neuroevolution

William McNally  Kanav Vats  Alexander Wong  John McPhee
Systems Design Engineering
   University of Waterloo
{wmcnally, k2vats, mcphee, a28wong}@uwaterloo.ca

Neural architecture search has proven to be highly effective in the design of computationally efficient, task-specific convolutional neural networks across several areas of computer vision. In 2D human pose estimation, however, its application has been limited by high computational demands. Hypothesizing that neural architecture search holds great potential for 2D human pose estimation, we propose a new weight transfer scheme that relaxes function-preserving mutations, enabling us to accelerate neuroevolution in a flexible manner. Our method produces 2D human pose network designs that are more efficient and more accurate than state-of-the-art hand-designed networks. In fact, the generated networks can process images at higher resolutions using less computation than previous networks at lower resolutions, permitting us to push the boundaries of 2D human pose estimation. Our baseline network designed using neuroevolution, which we refer to as EvoPose2D-S, provides comparable accuracy to SimpleBaseline while using 4.9x fewer floating-point operations and 13.5x fewer parameters. Our largest network, EvoPose2D-L, achieves new state-of-the-art accuracy on the Microsoft COCO Keypoints benchmark while using 2.0x fewer operations and 4.3x fewer parameters than its nearest competitor. The code is available at https://github.com/wmcnally/evopose2d.

1 Introduction

Two-dimensional human pose estimation is a visual recognition task dealing with the autonomous localization of anatomical human joints, or more broadly, “keypoints,” in RGB images and video [44, 43, 2]. It is widely considered a fundamental problem in computer vision due to its many downstream applications, including action recognition [10, 32] and human tracking [19, 1, 49]. In particular, it is a precursor to 3D human pose estimation [31, 36], which serves as a potential alternative to invasive marker-based motion capture.

Figure 1: A comparison of computational efficiency between EvoPose2D, SimpleBaseline, and HRNet at different scales. Circle size is proportional to the number of network parameters. EvoPose2D-S provides comparable accuracy to SimpleBaseline (ResNet-50) using 4.9x fewer FLOPs and 13.5x fewer parameters. At full-scale, EvoPose2D-L obtains state-of-the-art accuracy using 2.0x fewer FLOPs and 4.3x fewer parameters than HRNet-W48. FLOPs for SimpleBaseline and HRNet were re-calculated using TensorFlow profiler for consistency. Our results do not make use of ImageNet pretraining, half-body augmentation, or non-maximum suppression during post-processing.

In line with other streams of computer vision, the use of deep learning [23], and specifically convolutional neural networks [24] (CNNs), has been prevalent in 2D human pose estimation [44, 43, 35, 7, 9, 49, 40]. State-of-the-art methods use a two-stage, top-down pipeline, where an off-the-shelf person detector is first used to detect human instances in an image, and the 2D human pose network is run on the person detections to obtain keypoint predictions [9, 49, 40]. This paper focuses on the latter stage of this commonly used pipeline.

Recently, there has been a growing interest in the use of machines to help design CNN architectures through a process referred to as neural architecture search (NAS) [53, 3, 47]. These methods eliminate human bias and permit the automated exploration of diverse network architectures that often transcend human intuition, leading to better accuracy and computational efficiency. Despite the widespread success of NAS in many areas of computer vision [42, 8, 28, 12, 34, 52], the design of 2D human pose networks has remained, for the most part, human-principled.

Motivated by the success of NAS in other visual recognition tasks, this paper explores the application of neuroevolution, a form of NAS, to 2D human pose estimation for the first time. First, we propose a new weight transfer scheme that is highly flexible and reduces the computational expense of neuroevolution. Next, we exploit this weight transfer scheme, along with large-batch training on high-bandwidth Tensor Processing Units (TPUs), to accelerate a neuroevolution within a tailor-made search space geared towards 2D human pose estimation. In experiments, our method produces a 2D human pose network that has a relatively simple design, provides state-of-the-art accuracy when scaled, and uses fewer operations and parameters than the best performing networks in the literature (see Fig. 1). We summarize our research contributions as follows.

  • We propose a new weight transfer scheme to accelerate neuroevolution. In contrast to previous neuroevolution methods that exploit weight transfer, our method is not constrained by complete function preservation [48, 46]. Despite relaxing this constraint, our experiments indicate that the level of functional preservation afforded by our weight transfer scheme is sufficient to provide fitness convergence, thereby simplifying neuroevolution and making it more flexible.

  • We present empirical evidence that large-batch training can be used in conjunction with the Adam optimizer [21] to accelerate the training of 2D human pose networks with no loss in accuracy. We reap the benefits of large-batch training in our neuroevolution by maximizing training throughput on high-memory TPUs.

  • We design a search space conducive to 2D human pose estimation and leverage the above contributions to run a full-scale neuroevolution of 2D human pose networks within a practical time-frame (1 day using eight v2-8 TPUs). As a result, we are able to produce a computationally efficient 2D human pose estimation model that achieves state-of-the-art accuracy on the most widely used benchmark.

2 Related Work

This work draws upon several research areas in deep learning to engineer a high-performing 2D human pose estimation model. We review the three areas of the literature that are most relevant in the following sections.

Large-batch Training of Deep Neural Networks. Recent experiments have indicated that training deep neural networks using large batch sizes (256) with stochastic gradient descent causes a degradation in the quality of the model as measured by its ability to generalize to unseen data [16, 20]. It has been shown that the difference in accuracy on training and test sets, sometimes referred to as the generalization gap, can drop by as much as 5% as a result of using large batch sizes. Goyal et al. [14] implemented measures for mitigating the training difficulties caused by large batch sizes, including linear scaling of the learning rate, and an initial warmup period where the learning rate was gradually increased. These measures permitted them to train a ResNet-50 [15] on the ImageNet classification task [22] using a batch size of 8192 with no loss in accuracy, and training took just 1 hour on 256 GPUs.

Maximizing training efficiency using large-batch training is critical in situations where the computational demand of training is very high, such as in neural architecture search. However, deep learning methods are often data-dependent, and so it remains unclear whether the training measures imposed by Goyal et al. apply in the general case. It is also unclear whether the learning rate modifications are applicable to optimizers that use adaptive learning rates. Adam [21] is an example of such an optimizer and is widely used in 2D human pose estimation. To this end, we empirically investigate the use of large batch sizes in conjunction with the Adam optimizer in the training of 2D human pose networks in Section 4.2.2.

2D Human Pose Estimation using Deep Learning. Interest in human pose estimation dates back to 1975, when Fischler and Elschlager [11] used pictorial structures to recognize facial attributes in photographs. The first use of deep learning for human pose estimation came in 2014, when Toshev and Svegedy [44] regressed 2D keypoint coordinates directly from RGB images using a cascade of deep CNNs. Their method laid the foundation for a series of CNN-based methods offering superior performance over part-based models by learning features directly from the data as opposed to using primitive hand-crafted feature descriptors.

Arguing that the direct regression of pose vectors from images was a highly non-linear and difficult to learn mapping, Tompson et al. [43] introduced the notion of learning a heatmap representation, which represented the per-pixel likelihood for the existence of keypoints. The mean squared error (MSE) was used to minimize the distance between the predicted and target heatmaps, where the targets were generated using Gaussians with small variance centered on the ground-truth keypoint locations. The heatmap representation was highly effective, and continues to be used in the most recent human pose estimation models.

Several of the methods that followed built upon iterative heatmap refinement in a multi-stage fashion including intermediate supervision [45, 7, 35]. Remarking the inefficiencies associated with multi-stage stacking, Chen et al. [9] designed the Cascaded Pyramid Network (CPN), a holistic network constructed using a ResNet-50 [15] feature pyramid [26]. Xiao et al. [49] presented yet another holistic architecture that stacked transpose convolutions on top of ResNet. The aptly named SimpleBaseline network outperformed CPN despite having a simple architecture and implementation. Sun et al. [40] observed that most existing methods recover high-resolution features from low-resolution embeddings. They demonstrated with HRNet that maintaining high-resolution features throughout the entire network could provide greater accuracy. HRNet represents the state-of-the-art in 2D human pose estimation among peer-reviewed works at the time of writing.

An issue surrounding the 2D human pose estimation literature is that it is often difficult to make fair compairsons of model performance due to the heavy use of model-agnostic improvements. Examples include using different learning rate schedules [40, 25], more data augmentation [25, 5], loss functions that target more challenging keypoints [9], specialized post-processing steps [33, 18], or more accurate person detectors [25, 18]. These discrepancies can potentially account for reported differences in accuracy. To directly compare our method with the state-of-the-art, we re-implement SimpleBaseline [49] and HRNet [40] and train all networks under the same settings.

Neuroevolution. Until recently, the design of CNNs has primarily been human-principled, guided by rules of thumb based on previous experimental results. Hand-designing a CNN that performs optimally for a specific task is therefore very time consuming. Consequently, there has been a growing interest in NAS methods [53]. Neuroevolution is a form of neural architecture search that harnesses evolutionary algorithms to search for optimal network architectures [38]. We focus on neuroevolution due to its flexibility and simplicity compared to other approaches using reinforcement learning [53, 3, 54, 41, 42], one-shot NAS [4, 6, 37], or gradient-based NAS [29, 50].

Due to the large size of architectural search spaces, and the fact that sampled architectures need to be trained to convergence to evaluate their performance, NAS requires a substantial amount of computation. In fact, some of the first implementations required several GPU years [53, 54, 38]. This inevitably led to a branch of research aimed at making NAS practical by reducing the search time. Network morphisms [46] and function-preserving mutations [48] are techniques used in neuroevolution that tackle this problem. In essence, these methods iteratively mutate networks and transfer weights in such a way that the function of the network is completely preserved upon mutation, i.e., the output of the mutated network is identical to that of the parent network. Ergo, the mutated child networks need only be trained for a relatively small number of steps compared to when training from a randomly initialized state. As a result, these techniques are capable of reducing the search time to a matter of GPU days. However, function-preserving mutations can be challenging to implement and also restricting (e.g., complexity cannot be reduced [48]).

NAS algorithms have predominantly been developed and evaluated on small-scale image datasets [47]. The use of NAS in more complex visual recognition tasks remains limited, in large part because the computational demands make it infeasible. This is especially true for 2D human pose estimation, where training a single model can take several days [9]. Nevertheless, the use of NAS in the design of 2D human pose networks has been attempted in a few cases [50, 13, 51]. Although some of the resulting networks provided superior computational efficiency as a result of having fewer parameters and operations, none managed to surpass the best performing hand-crafted networks in terms of accuracy.

3 Neuroevolution of 2D Human Pose Networks

The cornerstone of the proposed neuroevolution framework is a simple yet effective weight transfer scheme that enables searching for optimal deep neural networks in a fast and flexible manner. In this paper, we tailor our search space to the task of 2D human pose estimation using prior knowledge of cutting-edge hand-crafted pose networks, but emphasize that our method is generally applicable to all types of deep networks.

Weight transfer. Suppose that a parent network is represented by the function , where is the input to the network and are its parameters. The foundation of our neuroevolution framework lies in the process by which the parameters in a mutated child network are inherited from such that . That is, the output, or “function,” of the mutated child network is similar to the parent but not necessarily equal. To enable fast neural architecture search, the degree to which the parent’s function is preserved must be sufficient to allow to be trained to convergence in a small fraction of the number of steps normally required when training from a randomly initialized state.

To formalize the proposed weight transfer in the context of 2D convolution, we denote as the weights used by layer of the parent network, and as the weights of the corresponding layer in the mutated child network, where is the kernel size, is the number of input channels, and is the number of output channels. For the sake of brevity, we consider the special case when , , and , but the following definition can easily be extended to when , , or . The inherited weights are given by:

where . is transferred to and the remaining non-inherited weights in are randomly initialized. An example of weight transfer between two convolutional layers is depicted in Fig. 2. In principle, the proposed weight transfer can be used with convolutions of any dimensionality (e.g., 1D or 3D convolutions), and is permitted between convolutional operators with different kernel size, stride, dilation, input channels, and output channels. More generally, it can be applied to any operations with learnable parameters, including batch normalization and dense layers.

In essence, the proposed weight transfer method relaxes the function-preservation constraint imposed in [46, 48]. In practice, we find that the proposed weight transfer preserves the majority of the function of deep CNNs following mutation. This enables us to perform network mutations in a simple and flexible manner while maintaining good parameter initialization in the mutated network. As a result, the mutated networks can be trained using fewer iterations, which accelerates the neuroevolution.

Search space. Neural architecture search helps moderate human involvement in the design of deep neural networks. However, neural architecture search is by no means fully automatic. To some extent, our role transitions from a network designer to a search designer. Decisions regarding the search space are particularly important because the search space encompasses all possible solutions to the optimization problem, and its size correlates with the amount of computation required to thoroughly explore the space. As such, it is common to exploit prior knowledge in order to reduce the size of the search space and ensure that the sampled architectures are tailored toward the task at hand [54].

Motivated by the simplicity and elegance of the SimpleBaseline architecture for 2D human pose estimation [49], we search for an optimal backbone using a search space inspired by [41, 42]. Specifically, the search space encompasses a single-branch hierarchical structure that includes seven modules stacked in series. Each module is constructed of chain-linked inverted residual blocks [39] that use an expansion ratio of six and squeeze-excitation [17]. For each module, we search for the optimal kernel size, number of inverted residual blocks, and output channels. Considering the newfound importance of spatial resolution in the deeper layers of 2D human pose networks [40], we additionally search for the optimal stride of the last three modules. Without going into too much detail, our search space can produce unique backbones. To complete the network, an initial convolutional layer with 32 output channels precedes the seven modules, and three transpose convolutions with kernel size of 3x3, stride of 2, and 128 output channels are used to construct the network head.

Figure 2: Two examples (, ) of the weight transfer used in the proposed neuroevolution framework. The trained weights (shown in blue) in the parent convolutional filter are transferred, either in part or in full (), to the corresponding filter in the mutated child network. The weight transfer extends to all output channels in the same manner as depicted here for input channels.

Fitness. To strike a balance between computational efficiency and accuracy, we build on the Pareto optimizations in [41, 42] and minimize a multi-objective fitness function including the validation loss and the number of network parameters. Given a 2D pose network represented by the function , the loss for a single RGB input image and corresponding target heatmap is given by


where is the number of keypoints and represents the keypoint visibility flags [27]. is generated by centering 2D Gaussians with a standard deviation of pixels on the ground-truth keypoint coordinates, where is the height of the output heatmaps. The fitness of a network can then be defined as:


where is the number of samples in the validation dataset, is the number of parameters in , is the target number of parameters, and controls the fitness trade-off between the number of parameters and the validation loss. Minimizing the number of parameters instead of the number of floating-point operations (FLOPs) allows us to indirectly minimize FLOPs while not penalizing mutations that decrease stride too severely.

Evolutionary strategy. The evolutionary strategy proceeds as follows. In generation “0”, a common ancestor network is manually defined and trained from scratch for epochs. In generation 1, children are generated by mutating the ancestor network. Weight transfer is performed between the ancestor and each child, after which the children’s weights are trained for epochs (). At the end of generation 1, the networks with the best fitness from the pool of () networks (children + ancestor) become the parents in the next generation. In generation 2 and beyond, the mutation weight transfer training process is repeated and the top- networks from the pool of () networks (children + parents) become the parents in the next generation. The evolution continues until manual termination, typically after the fitness has converged.

Large-batch training. Even with the computational savings afforded by weight transfer, running a full-scale neuroevolution of 2D human pose networks at a standard input resolution of 256x192 would not be feasible within a practical time-frame using common GPU resources (e.g., 8-GPU server). To reduce the search time to within a practical range, we exploit large batch sizes when training 2D human pose networks on TPUs. In line with [14], we linearly scale the learning rate with the batch size and gradually ramp-up the learning rate during the first few epochs. In Section 4.2, we empirically demonstrate that this training regimen can be used in conjunction with the Adam optimizer [21] to train 2D human pose networks up to a batch size of 2048 with no loss in accuracy. To our best knowledge, the largest batch size previously used to train a 2D human pose network was 256, which required 8 GPUs [25].

Compound scaling. It has been shown recently that scaling a network’s resolution, width (channels), and depth (layers) together is more efficient than scaling one of these dimensions individually [42]. Motivated by this finding, we scale the base network found through neuroevolution to different input resolutions using the following depth () and width () coefficients:


where is the search resolution, is desired resolution, and , , are scaling parameters. For convenience, we use the same scaling parameters as in [42] ( = 1.2, = 1.1, = 1.15) but hypothesize that better results could be obtained if these parameters were tuned.

4 Experiments

4.1 Microsoft COCO Dataset

The 2017 Microsoft COCO Keypoints dataset [27] is the predominant dataset used to evaluate 2D human pose estimation models. It contains over 200k images and 250k person instances labeled with 17 keypoints. We fit our models to the training subset, which contains 57k images and 150k person instances. We evaluate our models on both the validation and test-dev sets, which contain 5k and 20k images, respectively. We report the standard average precision (AP) and average recall (AR) scores based on Object Keypoint Similarity (OKS)111More details available at https://cocodataset.org/#keypoints-eval.: AP (mean AP at OKS = 0.50, 0.55, , 0.90, 0.95), AP (AP at OKS = 0.50), AP, AP (medium objects), AP (large objects), and AR (mean AR at OKS = 0.50, 0.55, , 0.90, 0.95).

4.2 Large-batch training of 2D Human Pose
Networks on TPUs

To maximize training throughput on TPUs, we run experiments to investigate the training behaviour of 2D human pose networks using larger batch sizes than have been used previously. For these experiments, we re-implement the SimpleBaseline model of Xiao et al. [49], which stacks three transpose convolutions with 256 channels and kernel size of 3x3 on top of ResNet-50, which is pretrained on ImageNet [22]. We run the experiments at an input resolution of , which yields output heatmap predictions of size . According to the TensorFlow profiler used, this model has 34.1M parameters and 5.21G FLOPs.

4.2.1 Implementation Details

Except during neuroevolution, where further details are provided in Section 4.3, the following experimental setup was used to obtain the results for all models trained in this paper. TensorFlow 2.3 and the tf.keras API were used for implementation. The COCO keypoints dataset was first converted to TFRecords for TPU compatibility (1024 examples per shard). The TFRecord dataset contained the serialized examples including the raw images, keypoint locations, and bounding boxes, and the dataset was stored in a Google Cloud Storage Bucket where it was accessed remotely by the TPU host CPU over the network. Thus, all preprocessing, including target heatmap generation, image transformations, and data augmentation, was performed on the host CPU. A single-device v3-8 TPU (8 TPU cores, 16GB of high-bandwidth memory per core) was used for training, validation, and testing.

Preprocessing. The RGB input images were first normalized to a range of [0, 1], then centered and scaled by the ImageNet pixel means and standard deviations. The images were then transformed and cropped to the input size of the network. During training, random horizontal flipping, scaling, and rotation were used for data augmentation. The exact data augmentation configuration is provided in the linked code.

Training. The networks were trained for 200 epochs using bfloat16 floating-point format, which consumes half the memory compared to commonly used float32. The loss represented in Eq. (1) was minimized using the Adam optimizer [21] with a cosine-decay learning rate schedule [30] and L2 regularization with weight decay. The base learning rate was set to and was scaled to , where is the global batch size. Additionally, a warmup period was implemented by gradually increasing the learning rate from to over the first five epochs. The validation loss was evaluated after every epoch using the ground-truth bounding boxes and a batch size of 256.

Testing. The common two-stage, top-down pipeline was used during testing [9, 49, 40]. We use the same detections as [49, 40] and follow the standard testing protocol: the heatmaps from the original and horizontally flipped images were averaged and the keypoint predictions were obtained after applying a quarter offset in the direction from the highest response to the second highest response. We do not use non-maximum suppression.

4.2.2 Results

The batch size was doubled from an initial batch size of 256 until the memory of the v3-8 TPU was exceeded. The maximum batch size attained was 2048. The loss curves for the corresponding training runs are shown in Fig. 3. While the final training loss increased marginally with batch size, the validation losses converged in the latter part of training, signifying that the networks provide similar accuracy. The AP values in Table 1 confirm that we are able to train up to a batch size of 2048 with no loss in accuracy. We credit the increase in AP over the original implementation (AP of 70.4) to the longer training cycle. Furthermore, we demonstrate the importance of warmup and learning rate scaling. When training at the maximum batch size, removing warmup resulted in a loss of 1.3 AP, and removing learning rate scaling resulted in a loss of 0.7 AP.

While preprocessing the data on the TPU host CPU provides flexibility for training using different input resolutions and data augmentation, it ultimately causes a bottleneck in the input pipeline. This is evidenced by the training times in Table 1, which decreased after increasing the batch size to 512, but leveled-off at around 5.3 hours using batch size of 512 or greater. We expect that the training time could be reduced substantially if preprocessing and augmentation were included in the TFRecord dataset, or if the TPU host CPU had more processing power. It is also noted that training these models for 140 epochs instead of 200, as in the original implementation [49], reduces the training time to 3.7 hours. Bypassing validation after every epoch speeds up training further. For comparison, training a model of similar size on eight NVIDIA TITAN Xp GPUs takes about 1.5 days [9].

Figure 3: Loss curves during training of SimpleBaseline (ResNet-50) [49] on a v3-8 Cloud TPU using various batch sizes and learning rate schedules. Top: Training loss. Bottom: Validation loss.
Batch size () Warmup Scale Training Time (hrs) AP
256 Y Y 7.20 71.0
512 Y Y 5.42 71.0
1024 Y Y 5.25 71.2
2048 Y Y 5.32 71.0
2048 N Y 5.35 69.7
2048 Y N 5.33 70.3
Table 1: Training time and final AP for large-batch training of SimpleBaseline on Cloud TPU. The original implementation reports an AP of 70.4 [49]. The bottom two rows highlight the importance of warmup and scaling the learning rate when using large batch sizes.

4.3 Neuroevolution

The neuroevolution described in Section 3 was run under various settings on an 8-CPU, 40 GB memory virtual machine that called on eight v2-8 Cloud TPUs to train several generations of 2D human pose networks. The input resolution used was 256x192, and the target number of parameters was set to M. Other settings, including , , and , are provided in the legend of Fig. 4. ImageNet pretraining was exploited by seeding the common ancestor network using the same inverted residual blocks as used in EfficientNet-B0 [42]. The ancestor network was trained for 30 epochs and all other networks were trained for 5 epochs. A batch size of 512 was used to provide near-optimal training efficiency (as per results in previous section) and prevent memory exhaustion mid-search. No learning rate warmup was used during neuroevolution, and the only data augmentation used was horizontal flipping. All other training details are the same as in Section 4.2.1.

Search results. Fig. 4 shows the convergence of fitness for three independent neuroevolutions (E1, E2, E3). The runtimes for E1, E2, and E3 were 1.5, 0.8 and 1.1 days, respectively. The gap between the fitness (solid line) and validation loss (dashed line) is larger in E2 and E3 compared to E1, indicating that smaller networks were favored more as a result of the change in . After increasing the number of children from 32 in E2 to 64 in E3, it became apparent that using fewer children may provide faster convergence, but may also cause the fitness to converge to a local minimum. Fig. 5 plots the validation loss against the number of parameters for all sampled networks. The prominent Pareto frontier near the bottom-left of the figure gives us confidence that the search space was thoroughly explored.

Figure 4: Tracking the network with the best fitness in three independent neuroevolutions. The dashed line represents the validation loss of the network with the lowest fitness. : fitness coefficient controlling trade-off between validation loss and number of parameters. : number of children. : number of parents.

EvoPose2D. The network with the lowest fitness from neuroevolution E3 was selected as our baseline network, which we refer to as EvoPose2D-S. Its architectural details are provided in Table 2. The overall stride of the backbone is less than what is typically seen in hand-designed 2D human pose networks. Specifically, the lowest spatial resolution observed in the network is the input size, compared to in SimpleBaseline and HRNet. As a result, the output heatmap is twice as large.

Figure 5: Validation loss versus the number of network parameters for all sampled networks.

The baseline network was scaled to various levels of computational expense. We create a lighter version (EvoPose2D-XS) by increasing the stride in Module 6, which cuts the FLOPs in half. Using the compound scaling method described in Section 3, we scale EvoPose2D-S to an input resolution of 384x288 (EvoPose2D-M), which is currently the highest resolution used in single-person 2D human pose estimation. We push the boundaries of 2D human pose estimation by scaling to an input resolution of 512x384 (EvoPose2D-L). Even at this high spatial resolution, EvoPose2D-L has approximately half the FLOPs of the largest version of HRNet.

Component Blocks Kernel Size Stride Output Shape
Stem Conv - 3 2 (, , 32)
Module 1 1 3 1 (, , 16)
Module 2 3 3 2 (, , 24)
Module 3 2 5 2 (, , 40)
Module 4 4 3 2 (, , 80)
Module 5 2 5 1 (, , 112)
Module 6 4 5 1 (, , 128)
Module 7 2 3 1 (, , 80)
Head Conv 1 - 3 2 (, , 128)
Head Conv 2 - 3 2 (, , 128)
Head Conv 3 - 3 2 (, , 128)
Final Conv - 1 1 (, , )
Table 2: The architecture of our base 2D pose network, EvoPose2D-S, designed via neuroevolution. With = 256, = 192, and , EvoPose2D-S contains 2.53M parameters and 1.07G FLOPs.
Method Backbone Pretrain Input size Params (M) FLOPs (G)
CPN [9] ResNet-50 Y
SimpleBaseline [49] ResNet-50 Y
SimpleBaseline [49] ResNet-101 Y
SimpleBaseline [49] ResNet-152 Y
HRNet-W [40] - N
HRNet-W [40] - Y
HRNet-W [40] - Y
MSPN [25] 4xResNet-50 Y
SimpleBaseline [49] ResNet-152 Y
HRNet-W [40] - Y
HRNet-W [40] - Y
HRNet-W + PF [33] - Y 90.9 84.4
SimpleBaseline ResNet-50 N
SimpleBaseline ResNet-50 Y
HRNet-W32 - N
EvoPose2D-XS - N 2.53 0.47
EvoPose2D-XS - WT 2.53 0.47
EvoPose2D-S - N 2.53
EvoPose2D-S - WT 2.53
EvoPose2D-M - N
EvoPose2D-L - N
EvoPose2D-L + PF - N 77.5 90.9 83.6 74.0 82.5
Table 3: Comparisons on the COCO validation set. All models in the bottom section were implemented as per Section 4.2.1. Pretrain: backbone pretrained on the ImageNet classification task. WT: ImageNet weights partially transferred from EfficientNet [42] according to weight transfer method defined in Section 3. PF: including PoseFix post-processing [33]. : re-calculated using TensorFlow profiler for consistency. Best results shown in bold.
Method Backbone Pretrain Input size Params (M) FLOPs (G)
CPN [9] Res-Inception Y - -
SimpleBaseline [49] ResNet-152 Y
HRNet-W [40] - Y
HRNet-W + PF [33] - Y 92.6 82.6
EvoPose2D-L - N 14.7 17.7
EvoPose2D-L + PF - N 14.7 17.7 76.8 84.3 73.5 81.7
Table 4: Comparisons on the COCO test-dev set. Pretrain: backbone pretrained on the ImageNet classification task. PF: including PoseFix post-processing [33]. Best results shown in bold.

Comparison with the state-of-the-art. To directly compare EvoPose2D with the best methods in the literature, we re-implement SimpleBaseline (ResNet-50) and HRNet-W32 as per the implementation described in Section 4.2.1. In our implementation of HRNet, we use a strided transpose pointwise convolution in place of a pointwise convolution followed by nearest-neighbour upsampling. This modification was required to make the model TPU-compatible, and did not change the number of parameters or FLOPs. The accuracy of our implementation is verified against the original in Table 3.

Comparing EvoPose2D-S with our SimpleBaseline implementation without ImageNet pretraining, we find that EvoPose2D-S is less accurate than SimpleBaseline by 1.2 AP on the COCO validation set (Table 3) but uses 13.5x fewer parameters and 4.9x fewer FLOPs. Similarly, we compare EvoPose2D-M with our HRNet-W32 (256x192) implementation, and observe that EvoPose2D-M is more accurate by 1.5 AP while using 3.9x fewer parameters and 1.4x fewer FLOPs.

Given the well-known benefits of ImageNet pretraining, we explore transferring some of the weights from pretrained EfficientNets [42] using the weight transfer scheme defined in Section 3 (indicated by WT in Table 3). An improvement of 0.3 and 0.4 AP was observed for EvoPose2D-XS and EvoPose2D-S, respectively, but we did not observe any improvement for the larger models (EvoPose2D-M/L). We suspect that the accuracy of all our networks would improve with proper ImageNet pretraining, as was performed for HRNet [40].

Despite not using ImageNet pretraining, EvoPose2D-L achieves state-of-the-art AP on the COCO validation set222Higher AP has been reported using HRNet with model-agnostic improvements, including a better person detector and unbiased data processing [18]. (with and without PoseFix [33]) using 4.3x fewer parameters and 2.0x fewer FLOPs than its nearest competitor. Since EvoPose2D was designed using the COCO validation data, it is especially important to perform evaluation on the COCO test-dev set. We therefore show in Table 4 that EvoPose2D-L also achieves state-of-the-art accuracy on the test-dev dataset, again without ImageNet pretraining.

5 Conclusion

We propose a simple yet effective weight transfer scheme and used it, in conjunction with large-batch training, to accelerate a neuroevolution of 2D human pose networks. We provide supporting experiments demonstrating that 2D human pose networks can be trained using a batch size of up to 2048 on a single TPU device with no loss in accuracy. We exploit large-batch training in our neuroevolution experiments that produce a lightweight 2D human pose network design. When scaled to higher input resolution, the EvoPose2D network designed using neuroevolution proved to be more accurate and more computationally efficient than the best performing 2D human pose estimation models in the literature.

Acknowledgements. We acknowledge financial support from the Canada Research Chairs Program, the Natural Sciences and Engineering Research Council of Canada (NSERC), and a Google Cloud Academic Research Grant. We also acknowledge the TensorFlow Research Cloud Program and an NVIDIA GPU Grant for hardware support.


  • [1] M. Andriluka, U. Iqbal, E. Insafutdinov, L. Pishchulin, A. Milan, J. Gall and B. Schiele (2018) Posetrack: a benchmark for human pose estimation and tracking. In CVPR, Cited by: §1.
  • [2] M. Andriluka, L. Pishchulin, P. Gehler and B. Schiele (2014) 2d human pose estimation: new benchmark and state of the art analysis. In CVPR, Cited by: §1.
  • [3] B. Baker, O. Gupta, N. Naik and R. Raskar (2017) Designing neural network architectures using reinforcement learning. In ICLR, Cited by: §1, §2.
  • [4] G. Bender, P. Kindermans, B. Zoph, V. Vasudevan and Q. Le (2018) Understanding and simplifying one-shot architecture search. In ICML, pp. 550–559. Cited by: §2.
  • [5] Y. Bin, X. Cao, X. Chen, Y. Ge, Y. Tai, C. Wang, J. Li, F. Huang, C. Gao and N. Sang (2020) Adversarial semantic data augmentation for human pose estimation. In ECCV, Cited by: §2.
  • [6] A. Brock, T. Lim, J. M. Ritchie and N. Weston (2017) Smash: one-shot model architecture search through hypernetworks. arXiv preprint arXiv:1708.05344. Cited by: §2.
  • [7] Z. Cao, T. Simon, S. Wei and Y. Sheikh (2017) Realtime multi-person 2d pose estimation using part affinity fields. In CVPR, Cited by: §1, §2.
  • [8] L. Chen, M. Collins, Y. Zhu, G. Papandreou, B. Zoph, F. Schroff, H. Adam and J. Shlens (2018) Searching for efficient multi-scale architectures for dense image prediction. In NeurIPS, Cited by: §1.
  • [9] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu and J. Sun (2018) Cascaded pyramid network for multi-person pose estimation. In CVPR, Cited by: §1, §2, §2, §2, §4.2.1, §4.2.2, Table 3, Table 4.
  • [10] G. Chéron, I. Laptev and C. Schmid (2015) P-cnn: pose-based cnn features for action recognition. In ICCV, Cited by: §1.
  • [11] M. A. Fischler and R. A. Elschlager (1973) The representation and matching of pictorial structures. IEEE Transactions on computers 100 (1), pp. 67–92. Cited by: §2.
  • [12] G. Ghiasi, T. Lin and Q. V. Le (2019) Nas-fpn: learning scalable feature pyramid architecture for object detection. In CVPR, Cited by: §1.
  • [13] X. Gong, W. Chen, Y. Jiang, Y. Yuan, X. Liu, Q. Zhang, Y. Li and Z. Wang (2020) AutoPose: searching multi-scale branch aggregation for pose estimation. arXiv preprint arXiv:2008.07018. Cited by: §2.
  • [14] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia and K. He (2017) Accurate, large minibatch sgd: training imagenet in 1 hour. arXiv preprint arXiv:1706.02677. Cited by: §2, §3.
  • [15] K. He, X. Zhang, S. Ren and J. Sun (2016) Deep residual learning for image recognition. In CVPR, Cited by: §2, §2.
  • [16] E. Hoffer, I. Hubara and D. Soudry (2017) Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In NeurIPS, Cited by: §2.
  • [17] J. Hu, L. Shen and G. Sun (2018) Squeeze-and-excitation networks. In CVPR, Cited by: §3.
  • [18] J. Huang, Z. Zhu, F. Guo and G. Huang (2020) The devil is in the details: delving into unbiased data processing for human pose estimation. In CVPR, Cited by: §2, footnote 2.
  • [19] E. Insafutdinov, M. Andriluka, L. Pishchulin, S. Tang, E. Levinkov, B. Andres and B. Schiele (2017) Arttrack: articulated multi-person tracking in the wild. In CVPR, Cited by: §1.
  • [20] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy and P. T. P. Tang (2017) On large-batch training for deep learning: generalization gap and sharp minima. In ICLR, Cited by: §2.
  • [21] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: 2nd item, §2, §3, §4.2.1.
  • [22] A. Krizhevsky, I. Sutskever and G. E. Hinton (2012) Imagenet classification with deep convolutional neural networks. In NeurIPS, Cited by: §2, §4.2.
  • [23] Y. LeCun, Y. Bengio and G. Hinton (2015) Deep learning. nature 521 (7553), pp. 436–444. Cited by: §1.
  • [24] Y. LeCun and Y. Bengio (1995) Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks 3361 (10), pp. 1995. Cited by: §1.
  • [25] W. Li, Z. Wang, B. Yin, Q. Peng, Y. Du, T. Xiao, G. Yu, H. Lu, Y. Wei and J. Sun (2019) Rethinking on multi-stage networks for human pose estimation. arXiv preprint arXiv:1901.00148. Cited by: §2, §3, Table 3.
  • [26] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie (2017) Feature pyramid networks for object detection. In CVPR, Cited by: §2.
  • [27] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In ECCV, Cited by: §3, §4.1.
  • [28] C. Liu, L. Chen, F. Schroff, H. Adam, W. Hua, A. L. Yuille and L. Fei-Fei (2019) Auto-deeplab: hierarchical neural architecture search for semantic image segmentation. In CVPR, Cited by: §1.
  • [29] H. Liu, K. Simonyan and Y. Yang (2019) Darts: differentiable architecture search. In ICLR, Cited by: §2.
  • [30] I. Loshchilov and F. Hutter (2017) SGDR: stochastic gradient descent with warm restarts. In ICLR, Cited by: §4.2.1.
  • [31] J. Martinez, R. Hossain, J. Romero and J. J. Little (2017) A simple yet effective baseline for 3d human pose estimation. In ICCV, Cited by: §1.
  • [32] W. McNally, A. Wong and J. McPhee (2019) STAR-net: action recognition using spatio-temporal activation reprojection. In CRV, Cited by: §1.
  • [33] G. Moon, J. Y. Chang and K. M. Lee (2019) Posefix: model-agnostic general human pose refinement network. In CVPR, Cited by: §2, §4.3, Table 3, Table 4.
  • [34] V. Nekrasov, H. Chen, C. Shen and I. Reid (2019) Fast neural architecture search of compact semantic segmentation models via auxiliary cells. In CVPR, Cited by: §1.
  • [35] A. Newell, K. Yang and J. Deng (2016) Stacked hourglass networks for human pose estimation. In ECCV, Cited by: §1, §2.
  • [36] D. Pavllo, C. Feichtenhofer, D. Grangier and M. Auli (2019) 3d human pose estimation in video with temporal convolutions and semi-supervised training. In CVPR, Cited by: §1.
  • [37] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le and J. Dean (2018) Efficient neural architecture search via parameter sharing. In ICML, Cited by: §2.
  • [38] E. Real, S. Moore, A. Selle, S. Saxena, Y. L. Suematsu, J. Tan, Q. V. Le and A. Kurakin (2017) Large-scale evolution of image classifiers. In ICML, Cited by: §2, §2.
  • [39] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. Chen (2018) Mobilenetv2: inverted residuals and linear bottlenecks. In CVPR, Cited by: §3.
  • [40] K. Sun, B. Xiao, D. Liu and J. Wang (2019) Deep high-resolution representation learning for human pose estimation. In CVPR, Cited by: §1, §2, §2, §3, §4.2.1, §4.3, Table 3, Table 4.
  • [41] M. Tan, B. Chen, R. Pang, V. Vasudevan, M. Sandler, A. Howard and Q. V. Le (2019) Mnasnet: platform-aware neural architecture search for mobile. In CVPR, Cited by: §2, §3, §3.
  • [42] M. Tan and Q. V. Le (2019) Efficientnet: rethinking model scaling for convolutional neural networks. In ICML, Cited by: §1, §2, §3, §3, §3, §4.3, §4.3, Table 3.
  • [43] J. J. Tompson, A. Jain, Y. LeCun and C. Bregler (2014) Joint training of a convolutional network and a graphical model for human pose estimation. In NeurIPS, Cited by: §1, §1, §2.
  • [44] A. Toshev and C. Szegedy (2014) DeepPose: human pose estimation via deep neural networks. In CVPR, Cited by: §1, §1, §2.
  • [45] S. Wei, V. Ramakrishna, T. Kanade and Y. Sheikh (2016) Convolutional pose machines. In CVPR, Cited by: §2.
  • [46] T. Wei, C. Wang, Y. Rui and C. W. Chen (2016) Network morphism. In ICML, Cited by: 1st item, §2, §3.
  • [47] M. Wistuba, A. Rawat and T. Pedapati (2019) A survey on neural architecture search. arXiv preprint arXiv:1905.01392. Cited by: §1, §2.
  • [48] M. Wistuba (2018) Deep learning architecture search by neuro-cell-based evolution with function-preserving mutations. In ECML PKDD, Cited by: 1st item, §2, §3.
  • [49] B. Xiao, H. Wu and Y. Wei (2018) Simple baselines for human pose estimation and tracking. In ECCV, Cited by: §1, §1, §2, §2, §3, Figure 3, §4.2.1, §4.2.2, §4.2, Table 1, Table 3, Table 4.
  • [50] S. Yang, W. Yang and Z. Cui (2019) Pose neural fabrics search. arXiv preprint arXiv:1909.07068. Cited by: §2, §2.
  • [51] H. Zhang, L. Wang, S. Jun, N. Imamura, Y. Fujii and H. Kobashi (2020) CPNAS: cascaded pyramid network via neural architecture search for multi-person pose estimation. In CVPRW, Cited by: §2.
  • [52] Z. Zhu, C. Liu, D. Yang, A. Yuille and D. Xu (2019) V-nas: neural architecture search for volumetric medical image segmentation. In 3DV, Cited by: §1.
  • [53] B. Zoph and Q. V. Le (2017) Neural architecture search with reinforcement learning. In ICLR, Cited by: §1, §2, §2.
  • [54] B. Zoph, V. Vasudevan, J. Shlens and Q. V. Le (2018) Learning transferable architectures for scalable image recognition. In CVPR, Cited by: §2, §2, §3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description