Fine-Grained Age Estimation in the wild with Attention LSTM Networks
Age estimation from a single face image has been an essential task in the field of human-computer interaction and computer vision which has a wide range of practical application value. Concerning the problem that accuracy of age estimation of face images in the wild are relatively low for existing methods, where they take into account only the whole features of face image while neglecting the fine-grained features of age-sensitive area, we propose a method based on Attention LSTM network for Fine-Grained age estimation in the wild based on the idea of Fine-Grained categories and visual attention mechanism. This method combines ResNets or RoR models with LSTM unit to construct AL-ResNets or AL-RoR networks to extract age-sensitive local regions, which effectively improves age estimation accuracy. Firstly, ResNets or RoR model pre-trained on ImageNet dataset is selected as the basic model, which is then fine-tuned on the IMDB-WIKI-101 dataset for age estimation. Then, we fine-tune ResNets or RoR on the target age datasets to extract the global features of face images. To extract the local characteristics of age-sensitive areas, the LSTM unit is then presented to obtain the coordinates of the age-sensitive region automatically. Finally, the age group classification experiment is conducted directly on the Adience dataset, and age-regression experiments are performed by the Deep EXpectation algorithm (DEX) on MORPH Album 2, FG-NET and LAP datasets. By combining the global and local features, we got our final prediction results. Our experiments illustrate the effectiveness of AL-ResNets or AL-RoR for age estimation in the wild, where it achieves new state-of-the-art performance than all other CNN methods on the Adience, MORPH Album 2, FG-NET and LAP datasets.
Avariety of face attributes from face images in the wild are known to be very useful in characterizing the individuals. Age is one of the significant properties, which is regarded as an inherent attribute and a crucial biological characteristic, and plays a fundamental role in human social interaction. Therefore, automatic age estimation has an essential relationship to the field of artificial intelligence. Age prediction has led to ever-growing studies in computer vision and machine learning communities, which indicates that age estimation is an available technique for many applications, such as medical diagnosis (premature facial aging due to various causes), age-based human-computer interaction system, advanced video surveillance, demographic information collection, etc. Accurate face age information is relevant for these applications, where the reliable support is necessary.
Most previous studies addressed the age prediction problem using hand-designed features with statistical models. However, manually-designed features behave unsatisfactorily on benchmarks of unconstrained images. Later research approached age estimation from face images in convolutional neural networks manner, automatically extracts feature representations for input images. Discriminative feature extraction for age estimation is affected by image facial gestures, light, makeup, background, etc. , so there are several reasons why automatic age estimation is regarded as a very challenging task. One of the principal reasons among them is that the significant similarity and subtle inter-class differences in face images with adjacent ages. Because of this, distinguishing age with only global feature vector may not achieve better results, age-sensitive areas (wrinkles, hair, spots, etc.) can provide more distinctive features for age estimation. Hence, it is important to note that the efficiency of a typical age estimation method highly depends on accurately locating the partial regions with distinctive features and introducing the Fine-Grained image categories conception reasonably. In contrast to traditional Fine-Grained categories datasets, where manual annotation of the specific part is the dominant strategy for the target, a wide array of Fine-Grained classification methods    , in general, are still deployed with additional supervisory information which increases computational complexity. Currently, none of the several public large age datasets mark certain image parts, which severely limits the development of age estimation. Therefore, there is an urgent demand on how to automatically obtain local information relevant to age characteristics.
To solve such problems, we propose a method of Fine-Grained age estimation based on Attention LSTM (AL) network to improve the abilities of feature extraction. As shown in Fig. 1(a, b), the Long Short-Term Memory (LSTM) unit is seamlessly inserted between the residual group and the fully connected layer of the residual network (ResNets) or the residual network of residual network (RoR) to form an AL-ResNets or AL-RoR network. The network is designed to effectively combine the strengths of ResNets or RoR with LSTM unit to generate feature representations on the critical age-sensitive region. In the context of fine-grained age estimation, extracting features can be regarded as a two-level process, where one is image-level and the other one is part-level.
Fig. 1(c) shows the pipeline of our framework. To improve the performance and alleviate the over-fitting problem on small-scale datasets, we train ResNets or RoR model on ImageNet  firstly, and then fine-tune it on IMDB-WIKI-101 dataset , thirdly, we use the model to further fine-tune on target age datasets to extract the global features of the images. Moreover, on the premise of image-level features extraction, a AL-ResNets or AL-RoR network based on ResNets or RoR is constructed, where the age-sensitive features of the effective local area on target images are extracted, the final prediction results are obtained by combining the predictions of the global and local features together. Finally, through abundant experiments on age datasets, our models achieve the new state-of-the-art results on Adience , MORPH Album 2 , FG-NET  and LAP datasets  .
The remainder of the paper is organized as follows. Section II briefly reviews related work for age estimation methods. The proposed AL-ResNets or AL-RoR age estimation method is described in Section III. Experimental results and analysis are presented in Section IV, leading to conclusions in Section V.
Ii Related Work
Since the 21st century, deep learning   has driven advances in the field of computer vision. Convolutional networks are gradually developing into a hot topic for image classification . Many models have been developed for convolutional networks, some of which are centered on accuracy improvement while others focus on optimizing network architecture. From LeNet-5  to 5-conv+3-fc AlexNet  and the 16-conv+3-fc VGG networks , then to 21-conv+1-fc GoogleNet , both the accuracy and depth of CNNs were promptly increasing. However, as the depth of the network increases, network performance will be severely affected by the crucial problem of vanishing gradients and over-fitting. Specifically, the difficulty of image classification comes from the fact that network performance not just rely on model depths, but more importantly, on feature representation. To learn stronger feature representation, a thousand-layer residual network  integrated the residual block as an essential part of the network, these stacked residual blocks could greatly improve training efficiency and largely resolve the degradation problem. To dig the optimization ability of residual networks, Zhang et al.  optimized the residual mapping of residual mapping by adding shortcuts level by level to construct RoR based on residual networks. Applying such shortcuts levels to the residual network was performed to further dissolve the current dilemma, and strengthened the learning ability of convolutional networks.
Automatic face analysis is a research topic that is currently receiving much attention from the Computer Vision and Pattern Recognition communities. Age estimation has historically been one of the most challenging problems within the field of facial analysis. Age estimation used to extract facial features manually in the past, but now CNN methods  are beginning to popularize. Note that there has been a good deal of achievements in training CNN directly on age datasets. Diverse age datasets can apply different age estimation methods. Face age dataset is divided into biological age dataset and apparent age dataset. A large amount of research has been devoted to age estimation from a face image under its most known form - the biological age estimation. Adience, MORPH Album 2 and FG-NET are prevalent benchmarks for biological age estimation, where the image labels are marked by the age group or actual age, and the predicted output of the network is the biological age of a person. In contrast, apparent age estimation research is still at a fresh start. There is only one publicly available dataset that is mostly used in the context of apparent age estimation: Chalearn LAP. What distinguishes it from other datasets is that the age of each image in this dataset is labeled by the average of annotator’s subjective opinions, so apparent age can be seen as the visual age based on the perspective of others. Given a reference dataset, one of the pivotal issues in age estimation is how to learn the distinctive characteristics.
Ii-a Biological Age Estimation
In the past twenty years, biological age estimation from face image has benefited tremendously from the evolutionary development in facial analysis. Based on manually-designed features, regression and classification methods were used to predict the age of face images. AGing pattErn Subspace(AGES)  was constructed to model the aging pattern, which was implemented for automatic age estimation. Subsequently, Chang et al.  proposed an ordinal hyperplane ranking algorithm called OHRank for estimating human age via facial images. Wang et al.  proposed a new framework for age feature extraction based on a manifold learning algorithm and deep learned aging pattern (DLA), which greatly improved the age estimation performance. Chen et al.  proposed a cumulative attribute concept based on SVR for learning a regression model, and it gained notable advantage on accuracy for age estimation. Guo et al.  proposed a kernel canonical correlation analysis (KCCA) method, which could derive an extremely low dimensionality in estimating age, but the amount of kernel calculation was tremendous. All of these methods had the same scope of applicability, which was only proven effective on constrained benchmarks, and could not achieve respectable results on the benchmarks in the wild.
Recent research on CNN showed that CNN model could learn a compact and discriminative feature representation when the size of training data is sufficiently large, so an increasing number of researchers start to use CNN for age estimation. Levi et al.  applied DCNN for the first time to age classification on unconstrained Adience benchmark. Yi et al.  proposed multi-scale convolution neural network, which based on the traditional face analysis method. Hou et al.  proposed a VGG-16 model with Smooth Adaptive Activation Functions (SAAF) to predict age group on Adience benchmark. Then they used the exact squared Earth Movers Distance(EMD2)  in loss function for CNN training and obtained better age estimation result. Rothe et al.  combined VGG-16 network pre-trained on ImageNet dataset with principal component analysis (PCA) method to obtain lower MAE value on MORPH Album 2. Then they transformed the age regression into age classification problem through Deep EXpection (DEX) method  and achieved better results. Recently, Hou et al.  used the R-SAAFc2+IMDB-WIKI method, and it got best results on FG-NET dataset. Zhang et al.  proposed a age-and-gender estimation method combining multi-level residual network (RoR) with two modest mechanisms, which was actively presented to achieve the state-of-the-art results on Adience benchmark. Gao et al.  proposed deep label distribution learning (DLDL) method, which effectively utilized the label ambiguity in both feature learning and classifier learning, the best MAE value on the MORPH dataset was achieved.
Ii-B Apparent Age Estimation
Face age estimation has made breakthroughs through the development of convolution neural network, researchers are not confined to the study of the biological age, and transfer the research focus to the apparent age research. Apparent age estimation was originally inspired by the 2015 ChaLearn Looking at People (LAP) competition , where the apparent age dataset was released. A method called Logistic Boosting Regression (Logit Boost)  was proposed, which realized the network optimization progressively. Xu et al.  proposed a deep label distribution method with distribution-based loss functions, and used Coc-DPM algorithm  and face point detector  for face search. Zhu et al.  used the Microsoft Project Oxford API  and Face ++ API  to preprocess LAP dataset, and then got GoogleNet pre-trained on several other datasets. Kuang et al. studied the age-related discriminative performance over multiple age datasets of MORPH, FG-NET, Adience, FACES , and mixed with random forest, quadratic regression as well as local adjustment methods. Lin et al.  fused real-value based regression models and Gaussian label distribution based classification models, which were pre-trained on CASIA WebFace , CACD , WebFaceAge  and MORPH datasets, and were fine-tuned on LAP dataset. Deep EXpectation (DEX) formulation  was proposed for apparent age estimation and won the LAP 2015 challenge. Recently, Agustsson et al.  proposed a nonlinear regression network called Anchored Regression Network (ARN), which achieved the state-of-the-art results on 2015LAP validation set.
The 2016 ChaLearn LAP Apparent Age Estimation (AAE) competition  had been completed and expanded the dataset scale based on the 15LAP dataset. Gurpinar et al.  proposed a two-level system for estimating the apparent age of facial images, where the samples were classified to eight age groups. Duan et al.  proposed a CNN2ELM method, where apparent age was estimated by Race-Net + Age-Net + Gender-Net + ELM Classifier + ELM Regression (RAGN). Malli et al.  divided the LAP dataset into age group and age-shifted group, and used these groups to train VGG-16 model. Uricar et al.  extracted the deep features and formulated a SO-SVM multi-class classifier on top of it. Huo et al.  proposed a novel method called Deep Age Distribution Learning (DADL) to use the deep CNN model to predict the age distribution. Dehghan et al.  introduced a large dataset of 4 million face recognition images to pre-train model, and then predicted apparent age on the age dataset. Antipov et al.  employed different age encoding strategies for training âgeneralâ and âchildrenâ networks, including 11 “general” models and 3 “children” models, which achieved the state-of-the-art results on 2016LAP dataset.
In this section, we describe the proposed AL-ResNets or AL-RoR architecture with Attention LSTM network for age estimation. Our methodology is essentially composed of three steps: (1) Constructing AL-ResNets or AL-RoR architecture for improving optimization ability of model; (2) The CNN model of ResNets or RoR, pre-trained on ImageNet and fine-tuned on IMDB-WIKI-101 dataset to alleviate over-fitting problem and trained for global features on target age datasets; (3) Extracting age-sensitive features of the local region by LSTM to further improve the performance of age estimation. In the following, we will describe the three main components in detail.
Iii-a Network Architecture
It is widely acknowledged that the performance of CNN-based age estimation relies heavily on the optimization ability of the CNN architecture, where deeper and deeper CNNs have been constructed. Particularly, ResNets won the first place at ILSVRC 2015 classification task, which had achieved tremendous success in various computer identification and classification tasks. RoR was constructed by adding identity shortcuts level-by-level based on original residual networks. It is noteworthy to mention that recently RoR also succeeded in the study of age estimation   for its outstanding performance. Therefore, in this paper, we construct new network structures named AL-ResNets and AL-RoR, which are based on the ResNets and RoR models, with the notion that both network depths and residual blocks information are efficiently represented in the architecture description.
To train the ResNets models for image classification tasks, the input RGB images need to be cast into an ordered preprocess procedure. Images are first resized to a fixed-size of 256256, followed by a random cut to further reduce image size to 224224 before entering the network. ResNets are built on 4 groups of residual blocks, where their basic components (conv, BN and ReLU) operate on shortcut levels. ResNets use shortcuts to propagate information only between neighboring layers in residual blocks. The LSTM unit is not only suitable for shallow ResNets, but also fits in nicely with other various deep residual networks. As shown in Fig. 2, AL-ResNets-34 and AL-ResNets-152 have different residual block structures, where each residual block can be expressed in a general form:
where and are input and output of the -th block, respectively. is a residual mapping function, is an identity mapping function, and is a ReLU function.
ResNets transform the learning of into the learning of by residual block structure. Compared with the learning way of ResNets, there are some differences in learning the convolutional feature representation directly from the RoR network. RoR transfers the learning problem to learning the residual mapping of residual mapping, which is simpler and easier than the original residual mapping to learn. RoR creates several direct paths for propagating information between different original residual blocks by adding extra shortcuts, so layers in upper blocks can propagate information to layers in lower blocks. Figure 3 shows the basic structure of a 34-layer AL-RoR based on RoR, which owns root-level, middle-level, and final-level shortcuts. Each residual block group contains residual blocks of 3, 4, 6 and 3, respectively, the junctions which are located at the end of each residual block group can be expressed by the following formulations.
where and are input and output of the -th block, and is a residual mapping function, and are both identity mapping functions. expresses the identity mapping of first-level and second-level shortcuts, and denotes the identity mapping of the final-level shortcuts.
When the images are trained on the ResNets or RoR network, effective global facial features can be obtained by extracting high-dimensional features of the entire image. After passing through fully connected layers, age categorization and a preliminary age prediction value can be achieved. One of the reasonable assumptions is that the facial age prediction is not only represented by the form of global characteristics but also can be related to many age-sensitive facial parts. So it is possible to introduce the local features to enhance network performance for fine-grained age estimation further. In this work, we build an AL-ResNets or AL-RoR network for learning partial region information with distinctive features, which based on LSTM unit and CNN(ResNets or RoR). The output feature map of ResNets or RoR last residual block is used as both the original fully connected layer input and the input of LSTM unit. The LSTM unit extracts the most significant regional features for softmax classification in forward-propagation and is optimized in backward-propagation according to the loss functions. For the combination method of global image-level features and local attention features in the AL-ResNets or AL-RoR network, instead of concatenating both features to get a final prediction, age classification prediction is first done separately on two sets of features and take a weighted average to obtain final prediction.
Iii-B Pre-Training For Global Features
The CNN provides useful models to train the global and local features of images on Adience, MORPH Album 2, FG-NET and LAP datasets, respectively, but it is not structural uniform for two training stages. One of the major architectures to extract influencing global feature elements is ResNets or RoR.
Due to the use of small-scale target age datasets for age estimation, the over-fitting problem will occur readily if training directly on it. Drawing on the idea of transfer learning, we use ResNets or RoR network training ImageNet dataset to learn basic image feature representation, which can efficiently reduce the over-fitting problem. The accuracy of age estimation relates to both the scale and the age distribution of the dataset. To promote ResNets or RoR model to further learn the feature expression of facial images and alleviate the over-fitting problem, large-scale face image dataset IMDB-WIKI-101 is used to fine-tune the model after it was pre-trained on ImageNet. IMDB-WIKI  is the largest publicly available dataset for age estimation of people in the wild, containing more than half million images with accurate age labels, whose age ranges from 0 to 100. However, there are many poor-quality images in the IMDB-WIKI dataset. Zhang et al.  first cleaned the dataset and divided them into 101 categories according to age distribution, the data set was renamed IMDB-WIKI-101, which is then used to fine-tune the network model for adapting to the age distribution of face images.
The pre-trained datasets consist of two large datasets, ImageNet and IMDB-WIKI-101, in which their extracted feature vectors will be combined to accomplish the model transition from general image classification to face age classification, then we fine-tune the pre-trained ResNets or RoR structure on target age dataset to generate a new network for global features.
Iii-C Training For Local Features
As the uppermost dilemma of age estimation is the similar features classification of adjacent age, a thought prompts us that we can use age-sensitive features in facial images when learning a particular age. So age estimation in the wild can be treated as a fine-grained classification problem, and be most commonly seen as part search. In order to automatically locate age-sensitive regions, we introduce attention idea proposed by Mnih et al.  to construct the AL-ResNets or AL-RoR network to extract local features from aligned images. Fine-Grained age group classification conveniently enables part-based approaches rather than confining to global, image-level features, which improves the cohesion of contextual age information to further reduce age prediction error.
A straightforward approach for part-based fine-grained age estimation in the wild is to extract features at the part location from feature map of the last convolution layer and build a softmax classifier for age estimation then. The AL-ResNets or AL-RoR model, automatically finding discriminative features on a support region, grounds on ResNets or RoR network to get the global features of the target age datasets. We use face detail features to update the internal state of LSTM unit and produce next position information for the next timestep. Training for local features with AL-ResNets or AL-RoR network consists of several parts, as shown in Fig. 4, which includes input feature module, LSTM unit module, location network module, feature cropping module, and output module.
Input feature module: The patch selected by the LSTM unit is used to train a new AL-ResNets or AL-RoR network from the feature map of the last residual block on ResNets or RoR model after proper global pre-training. Output feature map of the last residual block in ResNets or RoR model is taken as input features, except to reduce the possibility of part information confusion caused by over-enriching semantic information, there exist two other reasons for using the feature map of the last convolution layer as the active region. The first reason is that the cropped region is a local area on the feature map, which is much smaller than the size of the input image, requires only a fraction of computational cost compared to the entire network. The second reason is cropping features on the feature map can share the same network characteristics so that there is no need to use a separate network training for features cropping. The output feature map size in Group 4 of ResNets or RoR at Layer 34 is 51277. It then go through the pooling operation and generate 512 output channels. Following the number of output channels, the LSTM unit also has a 512-dimension input. The input feature dimensions of the LSTM is related to the feature maps of the ResNets or RoR models, so when increasing the model depth, the feature dimensions will go up correspondingly. The input feature module is used to generate 512-dimensional features as the input features of the LSTM unit, and the input features are trained through the CNN network, so the input feature module does not employ gradient descent algorithm during the AL-CNN model training process, which means that the module parameters are not updated.
LSTM unit module: The LSTM unit controls the cell state through the structure of “gates”, which is divided into Input gate, Forget gate, and Output gate. A typical gate approach consists of two fundamental parts: a sigmoid layer and a pointwise operation. Its main implementation is as follows: First, forget gate apply to select the information from state output at the last moment , and followed by the input gate and a new candidate vector generated by the tanh layer to create a product value. Then two sources of information will be combined for status update, where the process is to abandon unnecessary information and adding new information; Furthermore, the hidden layer status output of the LSTM is acquired using cell status which is maintained by the tanh layer at -1 to 1 and multiplied by the output value of the output gate. The LSTM unit performs the follow computation:
where and are the output cell state of the LSTM at the previous and current moment, respectively, is the hidden layer output state of LSTM, and all of them have the same feature dimension of 128. is the candidate vector for updating cell state.
Location network module: Location network module consists of a convolution layer followed by sigmoid activation function, the joint output of LSTM is considered as the input of the convolution layer. The output of the convolution layer is composed of the four locating feature point vectors as as follows (4), which are used to generate the coordinate of the current moment as the width and height of the bounding box. We share the same strategy of loss functions in backward propagation process to update the LSTM unit module and location network module, as is done in Cross Entropy Criterion.
Where is the joint output of the two states of LSTM with the feature dimension of 256, and denotes the overall parameters. The specific form of can be represented as one convolution layer with four outputs that are the parameters of the attended regions.
Feature cropping module: This module adopts a pooling strategy to simplify the computational complexity of the network. The module first crops the features on the output map of the ResNets or RoR convolution layer according to the location coordinates. Then, a pooling operation is performed on the main cropped features, where the output dimension is 512.
Output module: Finally, a key requirement of the output module is to get the network output on how to combine the prediction from global image-level and local features. There, we propose a weighted method to combine the output, and the weighted value is set to 1 and 0.5, respectively. The global features come from the original ResNets or RoR training features, and the local features are “age-sensitive features”, which are extracted from the feature map of the last residual block on AL-ResNets or AL-RoR network. The extracted global and local representations are denoted as and , the final age probability distribution , shown as:
We empirically demonstrated the effectiveness of AL-ResNets and AL-RoR on a series of benchmark datasets: Adience, MORPH, FGNET and LAP datasets.
The ResNets  or RoR  network pre-trained on the ImageNet dataset is used as the fundamental model. When fine-tuning the ResNets or RoR model, the IMDB-WIKI-101 dataset is randomly divided into two parts of training and testing with the size of 90% and 10%, and the number of output of the softmax classifier is changed from 1000 to 101. The learning rate starts from 0.01, and is divided by a factor of 10 at epoch 60 and 90.
In the target age experiment, we use the oversampling method  by taking a ten-crop way to crop and mirror images in each forward testing pass. We introduce deep expectation algorithm  to deal with the problem of age regression, the network is trained for classification with M output neurons, where number of output neurons M is set to 62, 70, and 101 respectively for training MORPH, FGNET and LAP datasets, where each neuron corresponding to an integer age. The weight decay is set to 1e-4 and the momentum is 0.9. The total epoch number for the Adience dataset is 60, the learning rate starts from 0.0001. When experimenting on MORPH Album dataset, the epoch is 120 with learning rate of 0.001. The learning rates for two datasets are divided by a factor of 10 after epoch 60. For training the global and local features of FG-NET/LAP dataset, the epoch number is set to 90 and 120, respectively, the learning rate is set to 0.001, and the former is divided by a factor of 10 after epoch 30, while the latter is after epoch 60. Our implementations are based on Torch 7 with one NVIDIA GeForce GTX Titan X.
MORPH: MORPH Album 2 is one of the largest publicly available age datasets. There are 55,134 face images, whose age range from 16 to 77. This dataset contains multiple races, such as white, black and others. In order to reduce the age difference between different races, we randomly selected 5475 face images among white people to avoid the influence of ethnic differences and randomly divided them into 80% for training and 20% for testing.
FG-NET: FG-NET is a small dataset that only includes 1002 images of 82 individuals in the wild, ranging from 0 to 69 years old with about 12 images per person. To ensure that everyone could provide pictures of different ages, they collect images by scanning paper documents of personal collections besides the digital images from recent years. We follow the standard Leave-One-Out-Protocol (LOPO) for FG-NET and report the average performance over the 82 splits.
Adience: The entire Adience collection includes 26,580 256256 color facial images of 2,284 subjects, with eight age group classes (0-2, 4-6, 8-13, 15-20, 25-32, 38-43, 48-53, 60-100). Adience dataset comes from images that people automatically upload to network albums from smartphones. These images are not artificially filtered before uploading, and they are completely unconstrained as they were taken under different variations. Testing for age classification is performed using a standard five-fold, subject-exclusive cross-validation protocol, defined in , the accuracy of five folds are averaged to be the final age group classification result.
LAP: The Chalearn LAP dataset is mainly used to study the apparent age estimation of face images. Each image label consists of the average age and the standard deviation. Most images are under unconstrained conditions (such as background, character rotation, and partial occlusion, etc.), which need face detection, alignment and cropping preprocessing. We rotate the input image in the interval of [, ] in steps and also by , and , the face box with the strongest detection score detected by the face detector  is taken, then the face box size is enlarged by 40% in both width and height and the face image is cropped. We do not delete those unaligned images and still keep them. The entire images are eventually squeezed to size of 256256. The 2015 LAP dataset is a relatively small dataset with a total of 4691 images, including 2476 for training, 1136 for validation and 1079 for testing. The 2016 dataset is an expanded version its predecessor, which adds nearly 3,000 images to its original scale. Apparent age estimation is divided into the development and test phases.
Iv-C Evaluation Protocol
Iv-C1 Accuracy and 1-off Accuracy
The evaluation measures utilized in the Adience experiments are exact accuracy and 1-off accuracy, where the exact accuracy computes the correctness for the estimated age group, and 1-off accuracy measures the results when the algorithm gives the specific age-group and one adjacent age-group.
Iv-C2 Mean Absolute Error
The results are evaluated in the MORPH Album 2 and FG-NET experiments by using the standard MAE measure. The MAE computes the error between the predicted age and the real one as follows (6), where , represent the actual age and the estimated age, respectively, represents the number of all the test pictures.
For age estimation on LAP dataset, we employ two evaluation metrics, namely mean absolute error (MAE) and -error resulted from the uncertainty introduced by standard deviation . -error is mainly affected by mean and standard deviation , as well as network prediction output value, where they are subject to a normal distribution. The expression of -error is shown in (7), where , , are the predicted age, the apparent age value and the standard deviation, respectively, and is the number of all test images.
Iv-D Age Group Classification Experiments
By information propagation, ResNets can alleviate the vanishing gradients problem. RoR based on ResNets benefits from the standpoint of optimization through RoR residual mapping and the extra shortcuts to expedite information propagation between layers, so we use ResNets or RoR as the base model. To find the optimal model of 34-layer network on Adience dataset, we carry out a lot of comparative experiments with Adience. Results of our classification system are presented in Table I, where results of age group classification with different feature training methods such as A-ResNets-34, I-ResNets-34, Ft-101-RoR-34, Ft-101-ResNets-34 and the proposed Ft-101-AL-RoR-34, Ft-101-AL-ResNets-34 tested on fold4 of Adience dataset are analyzed in terms of classification accuracy and 1-off accuracy. We evaluated the effect of each step in the proposed method on age group classification in the following six ways:
(1) Use solely Adience to train A-ResNets-34 network.
(2) After pre-trained on ImageNet dataset, fine-tune the I-ResNets-34 network with Adience.
(3) After pre-trained on ImageNet and IMDB-WIKI-101 datasets, fine-tune the Ft-101-RoR-34 network with Adience.
(4) After pre-trained on ImageNet and IMDB-WIKI-101 datasets, fine-tune the Ft-101-ResNets-34 network with Adience.
(5) Based on (3), then train the Ft-101-AL-RoR-34 network with Adience.
(6) Based on (4), then train the Ft-101-AL-ResNets-34 network with Adience.
The results of different methods tested on fold4 of Adience dataset are shown in Table I. The learning rate of A-ResNets-34 begins at 0.1 and I-ResNets-34 at 0.01. For Ft-101-RoR-34, Ft-101-ResNets-34, Ft-101-AL-RoR-34 and Ft-101-AL-ResNets-34, learning rate starts from 0.0001, all of epochs are set to 160. Compared with the result of A-ResNets-34, the I-ResNets-34 obtains higher accuracy because of the basic image feature expression acquisition by ImageNet pre-training. The result of Ft-101-RoR-34 or Ft-101-ResNets-34 is obviously superior to that of I-ResNets-34, which reveals that the network firstly pre-trained by ImageNet, and then fine-tuned through IMDB-WIKI dataset to achieve transfer learning and alleviate the over-fitting problem, which works better than pre-training only on ImageNet.
Ft-101-AL-ResNets-34 outperforms Ft-101-ResNets-34 performance, and the same experimental performance also matches the Ft-101-AL-RoR-34 and Ft-101-RoR-34 model results. Since Ft-101-AL-ResNets-34 based on Ft-101-ResNets-34 is constructed to train the partial regions with distinctive features, improvement of age group classification results benefit from both global features and local features of the input face images, it is expected to see the capability of training effective feature vectors of the face images by the proposed method. The age-sensitive features of local regions extracted by attention LSTM is efficient, which results in the further improvement on classification accuracy.
From the results in Table I, we can see that our approach can consistently improve the age estimation performance. Therefore, further experiments on five folds of Adience use AL-ResNets-34 and AL-RoR-34 networks for age group classification, all of epochs are set to 120. ResNets or RoR is pre-trained on ImageNet first, fine-tuned on IMDB-WIKI-101 dataset and Adience dataset, then constructed AL-ResNets-34 and AL-RoR-34 networks to train Adience. Furthermore, since the number of images is critical for classification, the over-sampling method is applied on the dataset to optimize network training, leading to a better age group classification result. As shown in Table II, results show that the effectiveness of the proposed method. From the results of Table I and Table II, we can see that the results of RoR are slightly worse than ResNets, this is due to the fact that the stochastic depth algorithm  can not play a role in improving the accuracy in the case of using the large datasets to fine-tune the model, so RoR without stochastic depth algorithm and ResNets have similar performance in these experiments .
The proposed model achieves the better results with shallow AL-ResNets in terms of accuracy measures, and we can expand to a more deeper AL-ResNets-152 network that is able to obtain more accurate age classification results under unconstrained situations, AL-ResNets-152 can boost the performance to 67.83% of accuracy and 97.53% of 1-off accuracy on the five folds of Adience dataset, respectively. To sufficiently evaluate the performance of our proposed Network, we compared it with other seven state-of-the-art methods, including: 4c2f-CNN , R-SAAFc2 , Chained Net , DEX , EMD , RSAAFc2(IMDB-WIKI)  and RoR+IMDB-WIKI with two mechanisms . The age group classification results of various methods are shown in Table III. Relying on two mechanisms, gender and weight loss layers, RoR+IMDB-WIKI  method achieved a significant improvement on classification accuracy. However, without adding any extra mechanism, our AL-ResNets-152 (ten-crop), trained as a single-model architecture, provides the new state-of-the-art age classification results. The success of above experiments should be credited to the use of AL-ResNets-152 model which is based on Adience dataset pre-trained ResNets with LSTM unit, and the training effectiveness of AL-ResNets-152 model on age-sensitive local facial features. Extensive comparisons on Adience dataset verify the effectiveness of the proposed method.
|DEX w/o IMDB-WIKI Pretrain ||55.66.1||89.71.8|
|DEX w/ IMDB-WIKI Pretrain ||64.04.2||96.600.90|
|RoR34+IMDB-WIKI with two mechanisms ||66.912.51||97.490.76|
|RoR152+IMDB-WIKI with two mechanisms ||67.343.56||97.510.67|
Iv-E Age Value Estimation Experiments
According to preceding experiments in this section, attention network can improve model performance. In this section, we analyze how to apply the models pre-trained on ImageNet and IMDB-WIKI-101 datasets to other age datasets. Deep expectation algorithm  and oversampling  are introduced to estimate the exact age value of MORPH Album 2, FG-NET and LAP datasets, which is used to prove the generalization ability of our method.
The performance comparison on 15LAP dataset is summarized in Table IV. To evaluate the impact of the attention LSTM of our models, we train a ResNets-34 model as a baseline using global features of LAP dataset. The visual attention component in AL-ResNets-34 model provides a dynamic strategy to highlight the important and discriminative region of the image. To prove its effectiveness, we compare the MAE and -error results of training LAP datasets using ResNets-34 model, both of which are significantly improved. The reason is mainly derived from the facts that AL-ResNets-34 network captures the age-sensitive effective features. Compared with AL-ResNets-34 model, we can obtain superior results by AL-ResNets-152, where relative MAE and -error gains with 11.4%, 3.57% respectively, which shows the power of Network depth and attention LSTM.
The introduction of RoR can improve the optimization ability of ResNets by adding a few identity shortcuts. To achieve better age estimation results, it is important to choose a suitable RoR basic model for a satisfying performance. Because LAP datasets are relatively small, overfitting can be a critical problem for the 15LAP dataset. The over-depth of the network may also cause the overfitting problems to be even more severe on small age datasets, so we employ the shallow RoR-34 model to alleviate the overfitting problem on LAP and other small age datasets. The 34-layer RoR is used to train the global features and the local age-sensitive features are extracted by using the attention LSTM in the AL-RoR-34 model based on RoR-34. Our single AL-RoR-34 model achieves 0.2683 -error in the development phase, and 0.2548 -error in the test phase of the dataset. The MAE reaches 3.137 for validation dataset. It can be seen that the proposed method achieves the state-of-the-art results, and surpassed the winners(1st place)  of ChaLearn LAP 2015. Figure 6 shows some examples of validation images where our method significantly outperforms DEX , the prediction age error ranges can be greatly reduced by using the proposed method. Some images of the minimum age prediction errors in the 15LAP validation set are shown in Figure 7, which indicates that the better age values can be obtained in different age ranges by our method.
|Logit Boost ||7.2949||0.5483||–|
|age group ||–||0.3162||0.2948|
|Rich Coding ||3.29||0.3273||0.2872|
We also performed an additional experiment for 16LAP dataset to demonstrate the superiority of our model. Table V summarizes the results for ChaLearn apparent age estimation 2016 challenge. Our AL-RoR-34 model achieves a test -error of 0.2859 and obtains the second best place. Our result is slightly worse than the previous method , this is because we round the label value on 16LAP dataset to satisfy classification requirements, but it has an impact on the range of predicted age errors. In addition, OrangeLabs  introduced an additional private dataset to incorporate network training to address the shortcomings lack of children images, and the final test result depended on the combined results of multiple model, while our results are based only on a 34-layer model.
|ITU SiMiT ||0.3668||No|
|palm seu ||0.3214||No|
|Method||MORPH Album 2||FG-NET|
|CNN (Multi-Task) ||3.63||–|
The constructed models also perform superior as well on both tasks of MORPH Album 2 and FG-NET, where attention LSTM network is essential for each independent model. As it can be seen in Table VI, by training the MORPH Album 2 dataset using the AL-RoR-34 network, the best MAE value is 2.36, an improvement of 0.06 over the DLDL  methods. When training on the FG-NET dataset, we achieve a new state-of-the-art MAE of 2.39 years. By using the shallow AL-RoR-34 network, better results can be achieved in the MORPH and FG-NET data sets. Compared to using only global features on both datasets, the addition of local features further reduces the age prediction error. This suggests that combining global and local features can result in better performance.
The paper proposes new AL-ResNets and AL-RoR architectures based on Attention LSTM network for facial age images. Fine-Grained age estimation method effectively learns discriminative age-sensitive features got by attention LSTM, and combines global features and local features on target age dataset to achieve better results. Pre-training on ImageNet is used to learn basic image feature representation, further fine-tuning on IMDB-WIKI-101 helps to learn the feature expression of facial images. By introducing visual attention into the age estimation, we not only obtain state-of-the-art performance on Adience, MORPH, FGNET and LAP datasets for age estimation, but also provide new feasible ideas for face age estimation and face analysis research.
The authors gratefully acknowledge the support of NVIDIA Corporation with the kind donation of the GPU used for this research.
-  E. Eidinger, R. Enbar, and T. Hassner, “Age and gender estimation of unfiltered faces,” IEEE Transactions on Information Forensics and Security, vol. 9, no. 12, pp. 2170–2179, 2014.
-  T. Berg, and P N. Belhumeur, “Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 955–962.
-  T. Berg, J. Liu, and S W. Lee, “Birdsnap: Large-scale fine-grained visual categorization of birds,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2019–2026.
-  X S. Wei, C W. Xie, and J. Wu, “Mask-cnn: Localizing parts and selecting descriptors for fine-grained image recognition,” arXiv preprint arXiv:1605.06878, 2016.
-  S. Huang, Z. Xu , and D. Tao, “Part-stacked cnn for fine-grained visual categorization,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1173–1182.
-  J. Deng, W. Dong, and R. Socher, “Imagenet: A large-scale hierarchical image database,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
-  K. Zhang, L R. Guo, and C. Gao , “Age group classification in the wild with deep RoR architecture,” International Conference on Image Processing, 2017.
-  W Y. Zou, X. Wang, and M. Sun , “Generic object detection with dense neural patterns and regionlets,” arXiv preprint arXiv:1404.4316, 2014.
-  Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, May. 2015.
-  O. Russakovsky, J. Deng, and H. Su, “Imagenet large scale visual recognition challenge,” International Journal of Computer Vision, vol. 115, no. 3, pp. 211–252, May. 2015.
-  Y. LECUN, L. BOTTOU, and Y. BENGIO, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998.
-  A. Krizhevsky, I. Sutskever, and G E. Hinton, “Imagenet classification with deep convolutional neural networks,” Advances in neural information processing systems, 2012, 1097–1105.
-  K. Simonyan, and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
-  K. He, X. Zhang, and S. Ren, “Deep residual learning for image recognition,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
-  K. Zhang, M. Sun, and T. X. Han, “Residual Networks of Residual Networks: Multilevel Residual Networks,” IEEE Transactions on Circuits and Systems for Video Technology, 2017.
-  G. Levi, and T. Hassner, “Age and gender classification using convolutional neural networks,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 34–42.
-  L. Hou, D. Samaras, and T M. Kurc, “Neural networks with smooth adaptive activation functions for regression,” arXiv preprint arXiv:1608.06557, 2016.
-  L. Hou, C P. Yu, and D. Samaras, “Squared Earth Mover’s Distance-based Loss for Training Deep Neural Networks,” arXiv preprint arXiv:1611.05916, 2016.
-  R. Rothe, R. Timofte, and Gool L. Van, “Deep expectation of real and apparent age from a single image without facial landmarks,” International Journal of Computer Vision, vol. 126, no. 2-4, pp. 144–157, 2018.
-  FG-NET Aging Database, http://www.fgnet.rsunit.com.
-  L. Hou, D. Samaras, and T. Kurc, “ConvNets with smooth adaptive activation functions for regression,” Artificial Intelligence and Statistics, 2017, 430–439.
-  K. Zhang, C. Gao, and L. Guo, “ Age Group and Gender Estimation in the Wild With Deep RoR Architecture,” IEEE Access, vol. 5, pp. 22492–22503, 2017.
-  X. Geng, Z H. Zhou, and K. Smith-Miles, “ Automatic age estimation based on facial aging patterns,” IEEE Transactions on pattern analysis and machine intelligence, vol. 29, no. 12, pp. 2234-2240, 2007.
-  K Y. Chang, C S. Chen, and Y P. Hung, “ Ordinal hyperplanes ranker with cost sensitivities for age estimation,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 585–592.
-  X. Wang, R. Guo, and C. Kambhamettu, “ Deeply-Learned Feature for Age Estimation,” IEEE Winter Conference on Applications of Computer Vision, 2015, pp. 534–541.
-  K. Chen, S. Gong, and T. Xiang, “ Cumulative Attribute Space for Age and Crowd Density Estimation,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 2467–2474.
-  V. Mnih, N. Heess, and A. Graves, “ Recurrent models of visual attention,” Advances in neural information processing systems, 2014, pp. 2204–2212.
-  G. Guo, G. Mu, and A. Graves, “Joint estimation of age, gender and ethnicity,” International Conference on Automatic Face and Gesture Recognition, 2013, pp. 1–6.
-  D. Yi, Z. Lei, and S Z. Li, “Age estimation by multi-scale convolutional network,” Asian Conference on Computer Vision, Springer, Cham, 2014, pp. 144–158.
-  R. Rothe, R. Timofte, and Gool L. Van, “Some like it hot-visual guidance for preference prediction,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1–9.
-  K R. Jr, and T. Tesafaye, “MORPH: A Longitudinal Image Database of Normal Adult Age-Progression,” International Conference on Automatic Face and Gesture Recognition, 2006, pp. 341–345.
-  B B. Gao, C. Xing and C W. Xie, “Deep label distribution learning with label ambiguity,” IEEE Transactions on Image Processing, vol. 26, no. 6, pp. 2825–2838, 2017.
-  C. Xing, X. Geng, and H. Xue, “Logistic boosting regression for label distribution learning,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4489–4497.
-  X. Yang, B B. Gao, and C. Xing, “Deep label distribution learning for apparent age estimation,” Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 102–108.
-  M. Mathias, R. Benenson, and M. Pedersoli, “Face detection without bells and whistles,” European Conference on Computer Vision, Springer, Cham, 2014, pp. 720–735.
-  Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2013, pp. 3476–3483.
-  Y. Zhu, Y. Li, and G. Mu, “A Study on Apparent Age Estimation,” Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 267–273.
-  Face++ api. http://www.faceplusplus.com.
-  Microsoft project oxford api.https://www.projectoxford.ai.
-  Z. Kuang, C. Huang, and W. Zhang, “Deeply learned rich coding for cross-dataset facial age estimation,” Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 96–101.
-  N C. Ebner, M. Riediger, and U. Lindenberger, “FACESâA database of facial expressions in young, middle-aged, and older women and men: Development and validation,” Behavior Research Methods, vol. 42, no. 1, pp. 351–362, 2010.
-  X. Liu, S. Li, and M. Kan, “Agenet: Deeply learned regressor and classifier for robust apparent age estimation,” Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 16–24.
-  D. Yi, Z. Lei, and S. Liao, “Learning face representation from scratch,” arXiv preprint arXiv:1411.7923, 2014.
-  B C. Chen, C S. Chen, and W H. Hsu, “Cross-age reference coding for age-invariant face recognition and retrieval,” European Conference on Computer Vision, Springer, Cham, 2014, pp. 768–783.
-  B. Ni, Z. Song, and S. Yan, “Web image and video mining towards universal and robust age estimator,” IEEE Transactions on Multimedia, vol. 13, no. 6, pp. 1217–1229, 2011.
-  S. Escalera, J. Fabian, and P. Pardo, “Chalearn looking at people 2015: Apparent age and cultural event recognition datasets and results,” Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 1–9.
-  R. Rothe, R. Timofte, and Gool L. Van, “Dex: Deep expectation of apparent age from a single image,” Proceedings of the IEEE International Conference on Computer Vision, 2015, pp. 10–15.
-  E. Agustsson, R. Timofte, and Gool L. Van, “Anchored Regression Networks applied to Age Estimation and Super Resolution,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1643–1652.
-  F. Gurpinar, H. Kaya, and H. Dibeklioglu, “Kernel ELM and CNN based facial age estimation,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 80–86.
-  M. Duan, K. Li, and K. Li, “An Ensemble CNN2ELM for Age Estimation,” IEEE Transactions on Information Forensics and Security, vol. 13, no. 3, pp. 758–772, 2018.
-  R C. Malli, M. AygÃ¼n, and H K. Ekenel, “Apparent age estimation using ensemble of deep learning models,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 714–721.
-  M. UricÃ¡r, R. Timofte, and R. Rothe, “Structured output svm prediction of apparent age, gender and smile from deep features,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 25–33.
-  Z. Huo, X. Yang, and C. Xing, “Deep age distribution learning for apparent age estimation,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 17–24.
-  A. Dehghan, E G. Ortiz, and G. Shu, “Dager: Deep age, gender and emotion recognition using convolutional neural network,” arXiv preprint arXiv:1702.04280, 2017.
-  G. Antipov, M. Baccouche, and S A. Berrani, “Apparent age estimation from face images combining general and children-specialized deep learning models,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 96–104.
-  S. Escalera, M. Torres, and S B. Martinez, “Chalearn looking at people and faces of the world: Face analysis workshop and challenge,” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1–8.
-  O M. Parkhi, V. Andrea, and Z. Andrew, “Deep Face Recognition,” British Machine Vision Conference, vol. 1, no. 3, pp. 6, 2015.
-  Cubbee, and S. Gross. ResNet models trainined on ImageNet. https://github.com/facebook/fb.resnet.torch/tree/master/pretrained.
-  G. Huang, Y. Sun, and Z. Liu, “Deep networks with stochastic depth,” European Conference on Computer Vision, Springer, Cham, 2016, pp. 646–661.