Joint Estimation of Age and Gender from Unconstrained Face Images using Lightweight Multi-task CNN for Mobile Applications

Joint Estimation of Age and Gender from Unconstrained Face Images
using Lightweight Multi-task CNN for Mobile Applications

Jia-Hong Lee, Yi-Ming Chan, Ting-Yen Chen, and Chu-Song Chen
Institute of Information Science, Academia Sinica, Taipei

Automatic age and gender classification based on unconstrained images has become essential techniques on mobile devices. With limited computing power, how to develop a robust system becomes a challenging task. In this paper, we present an efficient convolutional neural network (CNN) called lightweight multi-task CNN for simultaneous age and gender classification. Lightweight multi-task CNN uses depthwise separable convolution to reduce the model size and save the inference time. On the public challenging Adience dataset, the accuracy of age and gender classification is better than baseline multi-task CNN methods.

1 Introduction

Understanding age and gender from the human face plays an essential role in social interaction. To make communication proper and efficient, people subconsciously judge others’ age or gender. Thus, age and gender estimation is important in several applications, including re-identification in surveillance videos, intelligent advertising and human-computer interaction. Nevertheless, accurately and efficiently estimating age and gender from unconstrained face images is difficult.

Prior to deep neural network era, several approaches estimate age and gender from face images using designed image features and machine learning. In [4], Eidinger et al. combine four-patch local binary patterns (FPLBP) [14] and support vector machine (SVM) [3] to achieve the joint-estimation of age and gender. Han et al. [5] use biologically inspired features (BIF) and their designed hierarchical estimator for this task. Since deep CNNs get a great success in object classification, Rothe et al. [11] develop the DEX method which consists of the face detector in [10] and a deep CNN architecture VGG-16 [13] for age estimation. Levi_Hassner [8] introduce a five-layer CNN architecture that achieves the most favorable performance on the unconstrained public Adience dataset [4].

Although the Levi_Hassner CNN [8] can achieve high accuracy, it needs two independent models for predicting the age and gender, respectively. To reduce the memory cost, we develop a method simplifying the weights with a single light weight model through multi-task learning and the technique of depthwise separable convolution.

2 Related Work

There are two types of structures commonly used in multi-task learning with deep neural networks [12], and of hidden layers. Soft parameter sharing means that each task has its own deep neural network with same structure, and then a similarity function is utilized to regularize these models [15]. Thus the space required at run time is proportional to the number of tasks. To reduce the space complexity, the structure of hard parameter sharing is most commonly used in deep multi-task learning. It only employs one shared deep neural network and keeps several task-specific output layers [18]. The hard parameter sharing can not only reduce the space complexity, but can also decrease the risk of over-fitting [2].

Figure 1: The architecture of the LMTCNN.

The getting popular mobile device motivates researchers to develop deep neural networks for mobile applications [6, 9, 17, 7]. MobileNet [6] is one of the most interesting approaches about speedup of a deep neural network. The MobileNet is depended on a streamlined architecture that uses to factorize a general convolution into a depthwise convolution and a pointwise convolution. By combining the output values of the depthwise convolution with point wise convolution, a lightweight deep neural network thus can be constructed. To further parameterize the tradeoff between latency and accuracy, they use two global hyper-parameters, width multiplier and resolution multiplier to adjust the computational cost of the neural network.

3 Lightweight Multi-Task CNN

We refine the state-of-the-art approach [8] on two aspects: simultaneous inference and model redution, which are critical for mobile applications. Unlike Levi_Hassner CNN [8] that uses two independent models to recognize age and gender, only one single CNN for feature extraction is used for multiple tasks in our system. Thus, the memory requirement for deep neural networks is reduced. We employ a hard parameter sharing paradigm to learn the single CNN for both tasks. To further decrease the computation cost, we decompose the general CNN in [8] into depthwise and pointwise convolution networks. The pointwise convolution is a convolution with kernel’s size, and it combines the output values of the depthwise convolution.

3.1 Depthwise Separable Convolution

Before detailing our network architecture, we give a brief review of the depthwise seperable convolution in this section. First, we consider the computational complexity of a general convolution. Let us denote the size of a general convolutional layer by , where is the size of kernel , and and are the number of input and output channels, respectively. The dimension of input map is , where and are the width and height of the input feature map, respectively. The size of output map is , where and are the width and height of the map, respectively. Figure 2(a) shows a common feature convolutional layer. The computational cost of the common convolution layer is .

In depthwise separable convolution, we split the general convolution layer into two layers. One is the depthwise convolution layer with size of a 2D kernel filter per each input channel , and the other is the pointwise convolution layer with convolution used to generate a linear combination of the output of the depthwise layer, as shown in Figure 2(b). The computational cost of the depthwise separable convolution layer can be derived by the following equation: .

Figure 2: (a)The general convolutional filter (b) The depthwise separable filter

Dividing the computational cost of general convolution by depthwise separable convolution, we can obtain the computational cost ratio is . The greater the number of channels, the greater the speedup the depthwise separable convolution can be achieved. In [6], Howard et al. demonstrate that simplifying the architecture of a CNN in this manner can considerably increase the inference speed without sacrificing the classification performance.

3.2 Lightweight Multi-Task Network

The Lightweight Multi-Task CNN (LMTCNN) is composed of one general convolution layer, two depthwise separable convolution layers and two fully connected layers. Thus, it can accomplish multiple tasks while reducing the memory cost. The system overview is shown in Figure 1. To handle the age and gender classification on the Adience dataset [4], our proposed method consists of three steps.

First, input color face image is scaled to the and then cropped into the in size using over-sampling. The over-sampling here means that the system extracts five cropped regions from the scaled color face image, four cropped regions from the corners and one from the center of the scaled color face image. The LMTCNN processes five cropped regions with their horizontal reflections and estimates the final result by the average score of these regions.

Second, the size pixel values of 96 kernel filters are applied to the input in the first general convolution layer, followed by a rectified linear unit (ReLU), a max pooling layer with window size equals to and strides equal to two pixels and a local response normalization layer. The output feature map (size )of the first general convolution layer is processed by the two subsequent depthwise separable convolution layers defined in Table 1. The output feature map of the last depthwise separable convolution layer is fed to the kernel size of a max pooling layer that partitions the input feature map into a set of non-overlapping regions.

Type Filter Shape Input Size
dw Conv1 3 x 3 x 96 28 x 28 x 96
pw Conv1 1 x 1 x 96 x 256 28 x 28 x 96
dw Conv2 3 x 3 x 256 14 x 14 x 256
pw Conv2 1 x 1 x 256 x 384 14 x 14 x 256
Table 1: The architecture of the two depthwise separable convolution layers used in the lightweight multi-task CNN

Finally, the output feature map of the max pooling layer is fed to the two fully connected layers which contain 512 neurons, followed by a ReLU and a dropout layer. To achieve both the age classification for eight age classes and the gender classification for two gender classes, two separate softmax layers are followed by the output feature map of the average pooling layer. The first softmax layer assigns a probability for each class of the age and the other assigns a probability for each class of the gender. Figure 1 shows the network configuration visualization.

4 Experiment and Result

Our proposed network is implemented using the Tensorflow framework [1]. Training and Testing are executed on the desktop with Intel Xeon E5 3.5 GHz CPU, 64G RAM and GeForce GTX TITAN X GPU. Training our proposed network takes approximately six hours. When testing on the desktop, predicting age and gender on a single image requires approximately milliseconds.

4.1 The Result of Adience Dataset

The Adience dataset [4] is composed of pictures taken by camera from smartphone or tablets. The images of Adience dataset capture extreme variations, including extreme blur (low-resolution), occlusions, out-of-plane pose variations, expressions. The entire Adience dataset includes 26,580 unconstrained images of 2,284 subjects. Its age labels contain eight groups, including .

Unlike other datasets (such as Morph II) where the face images are taken in a controlled envoriment, the Adience dataset is a in-the-wild benchmark for joint age and gender estimation, and is thus more demanding. Because our purpose is to develop a mobile system that can estimate age and gender in real environments, testing this benchmark can reflect the performance more appropriately.

For age and gender classification, we measure and compare the accuracy using a five-fold cross validation. The number of images in each fold for training, validation and testing are shown in Table 2. The in-plane aligned version of the faces defined in [4] is used.

Fold Training Validation Testing
First 11,136 1,242 3,879
Second 11,905 1,348 3.005
Third 11,814 1,323 3,121
Forth 12,056 1,335 2,866
Fifth 11,593 1,277 3,387
Table 2: The number of images in each fold of the training, validation and testing sets

We compare our proposed method with baseline Levi_Hassner CNN [8] by using five-fold cross validation with the number of images shown in Table 2 to train by the training set and test by the testing set in each fold. Our proposed method is LMTCNN with the width multiplier of each depthwise separable convolution equals to 1 or 2. In Table 3, we demonstrate that the accuracy of the LMTCNN with of the first depthwise separable convolution and of the second depthwise separable convolution for age and gender classification. As can be seen, although our architectures are more compact, their performance are comparable to that of the Levi_Hassner CNN [8].

Age Gender
Methods Top-1 Top-2 Top-1
Acc.(%) Acc.(%) Acc.(%)
Levi_Hassner CNN [8] 44.14 69.98 82.52
LMTCNN-1-1 40.84 66.10 82.04
LMTCNN-2-1 44.26 70.78 85.16
Table 3: The accuracy of the age and gender classification generated by five-fold cross validation in Adience dataset

4.2 Mobile Applications

To run a deep neural network model on mobile devices with Android operation system, we convert deep neural network model into the computational graph of Tensorflow library [1] and we compare the model size of each method, as shown in Table 4. The model size of LMTCNN with of the first depthwise separable convolution and of the second depthwise separable convolution is approximately nine times smaller than that of Levi_Hassner CNN [8], and the model size of LMTCNN with of the first depthwise separable convolution and of the second depthwise separable convolution is approximately half smaller.

Methods Model size (MB)
Levi_Hassner CNN [8] 70.8
LMTCNN-1-1 8.7
LMTCNN-2-1 30.0
Table 4: The Comparison of the model size

For mobile application, we port the system with face detection, age recognition, and gender recognition on mobile devices, such as smartphone, tablet and smart robot. The face detection is implemented using the method of the MTCNN [16]. Then the region the facial regions are cropped from each frame for LMTCNN or Levi_Hassner CNN [8] to recognize the age and gender. Figure 3 demonstrates our system on the ASUS Zenbo and ASUS Zenfone 3. The ASUS Zenbo is a smart robot developed by the ASUS incorporation with Intel Atom x5-Z8550 2.4GHz CPU, 4G RAM and Android 6.0.1 system and The ASUS Zenfone 3 is a smartphone developed by the ASUS incorporation with Qualcomm Snapdragon 625 2.02GHz CPU, 3G RAM and Android 7.0 system. We also calculate the processing time of each method on the ASUS Zenbo and ASUS Zenfone 3, as shown in Table 5.

Methods Asus Zenbo Asus Zenfone 3
Speed Speed
(ms/frame) (ms/frame)
Levi_Hassner CNN [8] 4800 4800
LMTCNN-1-1 204.7 204.9
LMTCNN-2-1 297.6 367.2
Table 5: The speed of each method executed in the mobile devices

In summary, the above results (Table 34 and 5) reveal that LMTCNN can decrease the size of model and speed up the inference on mobile devices while maintaining the accuracy of age and gender classification.

5 Conclusion

Figure 3: The demonstration of LMTCNN executed on mobile devices. (a) Asus Zenbo and (b) Asus Zenfone 3.

We introduce the new network structure, LMTCNN, which accomplishes multiple tasks while maintaining the accuracy of age and gender classification. We also show that our architecture can be realized on mobile devices with limited computational resources. In the future, we will improve the performance of LMTCNN and reduce the size of model for the datasets of unconstrained face images with face attributes.


  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv, 2016.
  • [2] J. Baxter. A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning, 1997.
  • [3] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 1995.
  • [4] E. Eidinger, R. Enbar, and T. Hassner. Age and gender estimation of unfiltered faces. IEEE TIFS, 2014.
  • [5] H. Han, C. Otto, X. Liu, and A. K. Jain. Demographic estimation from face images: Human vs. machine performance. IEEE TPAMI, 2015.
  • [6] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv, 2017.
  • [7] Y.-D. Kim, E. Park, S. Yoo, T. Choi, L. Yang, and D. Shin. Compression of deep convolutional neural networks for fast and low power mobile applications. In ICLR, 2016.
  • [8] G. Levi and T. Hassner. Age and gender classification using convolutional neural networks. In CVPR Workshops, 2015.
  • [9] D. Li, X. Wang, and D. Kong. Deeprebirth: Accelerating deep neural network execution on mobile devices. In AAAI, 2018.
  • [10] M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. Face detection without bells and whistles. In ECCV, 2014.
  • [11] R. Rothe, R. Timofte, and L. Van Gool. Dex: Deep expectation of apparent age from a single image. In ICCV Workshops, 2015.
  • [12] S. Ruder. An overview of multi-task learning in deep neural networks. arXiv, 2017.
  • [13] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In ICLR, 2015.
  • [14] L. Wolf, T. Hassner, and Y. Taigman. Descriptor based methods in the wild. In Workshop on faces in’real-life’images: Detection, alignment, and recognition, 2008.
  • [15] Y. Yang and T. M. Hospedales. Trace norm regularised deep multi-task learning. In ICLR Workshops, 2017.
  • [16] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE SPL, 2016.
  • [17] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. arXiv, 2017.
  • [18] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning. In ECCV, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description