Exploring Correlations for Multiple Facial Attributes Recognition through Graph Attention Network

Exploring Correlations in Multiple Facial Attributes through Graph Attention Network


Estimating multiple attributes from a single facial image gives comprehensive descriptions on the high level semantics of the face. It is naturally regarded as a multi-task supervised learning problem with a single deep CNN, in which lower layers are shared, and higher ones are task-dependent with the multi-branch structure. Within the traditional deep multi-task learning (DMTL) framework, this paper intends to fully exploit the correlations among different attributes by constructing a graph. The node in graph represents the feature vector from a particular branch for a given attribute, and the edge can be defined by either the prior knowledge or the similarity between two nodes in the embedding with a fully data-driven manner. We analyze that the attention mechanism actually takes effect in the latter case, and utilize the Graph Attention Layer (GAL) for exploring on the most relevant attribute feature and refining the task-dependant feature by considering other attributes. Experiments show that by mining the correlations among attributes, our method can improve the recognition accuracy on CelebA and LFWA dataset. And it also achieves competitive performance.


Facial image provides rich high level attributes which are useful for describing the semantics. Recognizing facial attributes has many real world applications in video surveillance [\citeauthoryearVaquero et al.2009], human-computer interaction [\citeauthoryearCowie et al.2001] or image retrieval [\citeauthoryearParikh and Grauman2011]. Although, different attributes lie in distinct areas of facial regions, and may have different characteristics, recent works still tend to construct unified network to recognize them simultaneously [\citeauthoryearLiu et al.2015a, \citeauthoryearRudd, Günther, and Boult2016, \citeauthoryearHan et al.2017, \citeauthoryearHand and Chellappa2017]. The reasons are mainly as follows. First, it is costly in both time and space to build a deep network for each individual attribute. Second, recent results of deep learning shows that even totally different tasks actually share the same low level representation, therefore, both the structure and weights in lower layers can be shared among different tasks [\citeauthoryearYosinski et al.2014]. Third, in multi-task learning (MTL), parameters of the network are optimized by minimizing the combined loss functions for each task. Thus, it is inherently easy to generalize [\citeauthoryearMeyerson and Miikkulainen2018].

In general, the definition of MTL can be rather broad. As soon as there are more than one loss functions for optimizing, it is actually doing MTL [\citeauthoryearRuder2017]. Tasks in MTL may even not have the same data during training, but still only one single model is available for making multiple predictions in testing phase. Specifically, this paper considers multiple facial attributes recognition. In this topic, each facial image in the training set is labeled with multiple binary attributes, such as male, young, brown hair, eye glasses etc., and our goal is to design a network through which we can obtain multiple predictions for attributes. Similar to previous works, we also share lower layers with the purpose of mining correlations among different attributes, and make branches to learn the feature representation for each unique attribute at higher level until it gives the final binary results. In such a framework, the different task correlations are only reflected in lower layers. Once branches separate, they are considered to be independent of each other in later layers. Thus, there are no sharing of feature in higher layers, which means that task related correlation is not exploited enough. We argue that sharing lower layers, which even happens between two irrelevant tasks in transfer learning, is obviously inadequate for these relevant tasks.

In order to model the correlations among different attributes, we use a graph attention layer (GAL), which is initially proposed to model and infer the relation in knowledge graph in [\citeauthoryearVelickovic et al.2017], to create high level feature representation across different attributes. Our aim is to set up a graph in which each node represents a feature vector for an attribute and the edge linking between two nodes indicates whether they are directly dependent, in other words, they have the correlation with each other. The graph can be set up by the prior knowledge, e.g., the attribute ”wavy hair” is negatively correlated with ”straight hair” strongly, and ”young” is related to ”attractiveness” in some extent, hence there should be links between the corresponding nodes. But from prior knowledge, the strength of the link is difficult or even impossible to determine. Our solution is to explore the correlation in a data-driven way, so that both the link and the strength of it are learned from data. Specifically, we use the attention based architecture, in which the similarities among different attribute features are first measured to generate attention weights, and then these weights are used to linearly combine and augment the feature for classification. Note that the input feature of GAL is from the individual branch, so it dose not fully reflect correlation though they have shared low level features. While GAL’s outputs consider the relation of attributes and they are more expressive. Since the features before GAL have already shown their ability for classification, we also propose an optimization scheme in which two cross entropy loss functions and are adopted for constraining two different parts of the network, respectively. The idea is to apply the gradient of only for updating the weight in the feature learning network and the gradient of for the weight in the correlation learning network, as is shown in Figure 1. That is to say, is only responsible for finding the high level attributes correlation. As a summarization, we list the contributions of the paper as follows:

  • A GAL structure for multiple facial attributes recognition is proposed. It is a fully data driven approach to explore attributes’ correlations.

  • The separate optimization scheme for GAL and lower layers is also designed. Two gradients streams, computed from two loss functions before and after correlation learning layers, are responsible for updating the weight in GAL and lower layers, respectively.

The remainder of the paper is constructed as follows: We first discuss the related work in network structure design in MTL, graph neural network, and attribute analysis. This is followed by the introduction of our proposed method. In the last section, we provide the detailed information of our experiments and give the analysis on our work.

Figure 1: Flowchart of our proposed approach for multiple facial attributes recognition. The whole network consists of the feature learning and the correlation learning network. In the feature learning net, we use task-specific blocks, the structure of which is exactly the same, to extract the corresponding feature of attributes. The input of the correlation learning net are the sets of feature maps from each block in the feature learning net. After the convolution operation, each set of features is treated as a node in graph, and the correlation of the nodes is explored by graph attention machanism. The output of the graph attention are refined nodes, which are fed to classifiers respectively to recognize the corresponding attribute. The feature learning net and the correlation net are trained with two independent loss and , respectively.

Related Works

This section introduces the related works of the paper. There are many works on MTL, and most of the recent ones are built in deep CNN, focusing on the structure design of the network. Since our work intends to use graph neural network to mine the correlations among different attributes, we give brief introduction the works on graph based network, from which we can learn how to construct the neural network so that it can model relations among the attributes. Note that MTL is a general concept of machine learning and it dose not restrict its application in facial attributes recognition.

Network Structure Design in MTL

MTL has been successfully used in many applications of machine learning. The key issue in MTL is to investigate how to design the structure of the network. The simplest way is to share the layers and their parameters, which is first analyzed in [\citeauthoryearCaruna1993], and it is proved in [\citeauthoryearBaxter1997] that sharing more parameters can reduced the risk of overfitting. Besides the hard sharing, there can also be soft sharing which means that each task has its own model and parameters, but the distance of parameters is regularized to encourage them to be similar [\citeauthoryearYang and Hospedales2016]. Many deep learning approaches used MTL, explicitly or implicitly, as part of their model, can actually be regarded as either soft or hard parameter sharing. In deep CNN, hard parameters sharing is ordinary, which forces the lower convolution layers to use the same parameters while keeps several task-specific parameters in higher fully connected layers.

Besides soft or hard parameters and layers sharing, there are also flexible ways for the network structure design. [\citeauthoryearLong and Wang2015] consider the relation of different tasks. In their work, except sharing the lower convolution layer, they place matrix priors in the task-specific fully connected layers, which allow the model to learn the relation between tasks. The matrix prior provides extra constraints about tasks, but it is based on previous knowledge which is not fully data-driven. [\citeauthoryearLu et al.2017] gives a fully adaptive feature sharing method by gradually generating the network branches. They start from a thin network with only one final layer being task-specific, and dynamically widen it by greedily creating more branches with the task-specific parameters. This greedy approach to adapt the task-specific branch is a data-driven way to determine the network structure, but the solution is obviously not optimal, and it is rather time-consuming. [\citeauthoryearMisra et al.2016] starts from two identical separate CNN models with different but soft parameters shared between them. They add an extra unit, named the cross stich unit, to share the same level features from each model. The cross stich unit takes the input feature from both models, processes them by simple calculation, and distributes them back into each model. With cross stich units at different levels of the two models, the sharing happens at multiple stage of the feature representations, but the training becomes very unstable, particularly for large number of tasks. [\citeauthoryearRuder12 et al.2017] improves [\citeauthoryearMisra et al.2016] by taking into the task hierarchy. They allow different levels of the feature to directly give the task-specific layer for final prediction, in other words, the loss gradients can be directly given to lower layers, which makes the training more stable.

Except the network structure design, there are also other issues, such as to determine the weight for the loss of each task, to incorporate auxiliary loss to improve the performance. Due to the page limit, we can not elaborate them.

Graph Neural Network

The idea of Graph Neural Network (GNN) is first proposed in [\citeauthoryearGori, Monfardini, and Scarselli2005, \citeauthoryearScarselli et al.2009] to deal with graph-structured data. Different from CNN, which is suitable for the data with the regular grid-like structure, e.g. image, GNN mainly deals with the data in irregular domain, like social network, 3D meshes, or telecommunication network et. al.. [\citeauthoryearKipf and Welling2016] introduces the convolution operation onto graph, and proposes the multi-layer Graph Convolution Network (GCN). Similar to the convolution in CNN, the graph convolution also computes a weighted linear combination in its neighbourhood. The key difference is that the neighbourhood of a node is irregular and determined by the edge link between nodes, hence the structure of the graph needs to know before the convolution. They also prove that the convolution on graph can performed easily and equivalently in spectral domain. [\citeauthoryearVelickovic et al.2017] present the Graph Attention (GAT) network, in which the graph structure can be totally learned or refined from data by the self-attention mechanism. Moreover, GAT can be computed efficiently without out matrix inversion. In our work, GAT is used to explore the correlations among facial attributes without knowing any prior knowledge on them.

Attribute Analysis

Face has many high level important attributes. The algorithms for single facial attribute, such as gender, age or kinship [\citeauthoryearLevi and Hassner2015, \citeauthoryearRanjan, Patel, and Chellappa2017, \citeauthoryearRobinson et al.2016] usually consider prior knowledge on face, and intend to extract the discriminative feature or loss functions. However, building a single deep model for multiple facial attribute is still difficult. Mainly because too much design on a particular attribute may not generalize to the others. [\citeauthoryearLiu et al.2015a] propose a CelebA dataset in which amount to 40 binary attributes requires to be estimated by a single model. [\citeauthoryearHan et al.2017] provides a DMTL (deep MTL) approach in which they first divide attributes into several groups and construct group-specific layer. Then output from these layers are used further by attribute-specific layers. This work actually designs the network structure based on the prior knowledge, and it gives the best performance. [\citeauthoryearHand and Chellappa2017] uses a similar idea, but they intend to find and use the correlations among different attributes. They design a simple fully connected AUX layer which takes all the attribute-specific feature as input and refine them before making final predictions. Among the above works, only [\citeauthoryearHan et al.2017, \citeauthoryearHand and Chellappa2017] consider the correlation for attributes. Both of them are highly dependent on prior knowledge. Although the AUX in [\citeauthoryearHand and Chellappa2017] is a data-driven approach, one single fully connected layer is still not enough for correlation mining.

Our Approach

In multiple facial attributes recognition, correlation between attributes always exist, and they deserve to be exploited better in a single MTL model. Basically, there are two types of difficuties to consider the correlation of attributes. First, attributes are heterogeneous in many aspects. E.g., ”blonde hair” focuses on color while ”big lips” describes geometry. Besides, ”blonde hair” and ”big lips” are both relatively low-level attributes that can be identified directly from the face image, while attribute like ”attractiveness” is high level semantic attribute. The relation between each low level attribute and ”attractiveness” is difficult to determine. Only when using a data-driven approach, can we reduce subjective influence and get an unbiased cognitive result based on the average aesthetic judgment of the certain dataset. Actually, the heterogeneity of facial attributes determines it is easy to introduce bias with the guidance of prior knowledge, hence the reliable correlation could not be found only with prior knowledge.

In order to fully explore the relationship among attributes, we divide framework into two parts, as is shown in Figure 1. One part is feature learning network (FLN) with a backbone of several full shared layers, and a total of task-specific branches. Each branch is of the same structure but different parameters, which is used to extract the unbiased features of attributes respectively. The other part is the correlation learning network (CLN). Here, we introduce the concept of graph, and regard the sets of features extracted from the feature learning network as nodes in graph. The nodes are then given to the GAL to explore the relation among attributes, and give the weight of each relation to represent the strength of the correlation. Then, by integrating the information from each node, the refined complete feature information can be obtained and the classifier can be learned in an unbiased way.

Feature Learning Net

We use Alexnet-cvgj model [\citeauthoryearSimon, Rodner, and Denzler2016], without the two fully connected layer, as the shared backbone to extract the low level shared features. The output of the shared layer is given to task-specific learning branches, each branch corresponding to one attribute. The branch consists of a layer of convolution, batchnorm and position squeeze excitation (PSE) module, which is proved to be useful for finding the relevant in spatial dimension. The diagram of PSE module can be found in Figure2. The output of one whole branch is one feature set, which has enhanced spatial information. All the feature sets are fed to their corresponding fully connected layers for classification, respectively. The cross entropy losses from different classifiers are summed for gradients calculation, and then branches are updated separately, there is no common parameters between each other and thus no interplay among branches. While the shared layer receives the effect from all the losses, making it eventually learn how to transform the input image to global features, and meet the requirements of branches at the same time.

Figure 2: Flowchart of PSE module. PSE module firstly makes a feature compression, calculating the average value of the same position in different channels, which is called position average pooling (PAP), and thus get a single channel feature map with attentioned position. Secondly, PSE module uses two layers of convolution and sigmoid function to activate the position attention feature map, and will obtain a mask which focuses on local information. Finally, PSE computes the elementwise multiplication of the output of the batchnorm layer and the mask.

Correlation Learning Net

Correlation Learning Net (CLN) consists of two parts: one is to project them into a certain feature space and add them into the graph node list; the other uses the GAT and find the correlation among the nodes. As is shown in Figure 1 and Figure 3, the output features of the branch in FLN feed into the CLN. In order to obtain sufficient expressive feature for graph attention learning, there should be at least one learnable linear transformation is required, so we use convolutions, parameters of which is not shared, to enhance the expressiveness. In [\citeauthoryearHand and Chellappa2017], fully connected layer is used to execute the projection, because fully connected layer can fuse all the features in one feature set. However, for the task of facial attribute recognition that is sensitive to geometric construction, the use of fully connected layers for mapping leads to the loss of spatial information, so we choose convolution operation to execute the projection of feature space. In order to assist the CLN, we flatten the outputs of the convolutions to make them into vector format, and we treat these vectors as nodes. In formalization, we define the input of the CLN, i.e. the output of the branches in the FLN as , here ranges from 1 to , represents the index of attribute. , and are the height, width and the number of channels of feature , respectively. The operation of convolution is represented as , the operation reshape is as while the ouput node is as . The calculation function is as follows.

Figure 3: Operation befor GAL. The input feature is the output of the branches in FLN. We consider the convolution operation as projection. While the reshape operation makes features into a node, and thus help the learning of the graph attention network.

After projection, each node is represented by a multi-dimensional vector with its dimension of . As is shown in Figure 4, We make matrix multiplication between any two vectors. The larger the value is, the higher the similarity and the stronger the correlation will be. The matrix multiplication gives us an by affinity matrix .


In this matrix , each row represents the correlation between the corresponding node and all the nodes, including itself. is then given to softmax to normalize. We then compute the multiplication of the normalized attention and the original nodes, that is, weight and add all nodes according to the correlation weight, and finally get the refined node which integrates all nodes information.

Figure 4: Flow-chart of graph attention (GAT). The input is the processed nodes from mapping operation. We apply the matrix multiplication to compute the affinity matrix among the nodes. And finally weight the original nodes according to each row of the affinity matrix to integrade the complementary information from correlated nodes.

As the node vector is the reshape of multiple feature maps, it keeps features’ spatial information, therefore, the operation of the matrix multiplication is actually doing the comparison of the similarity between two sets of spatial feature maps. If the two sets of spatial feature maps are highly correlated, the recognition of these two attributes is dependent on a similar source, otherwise, the two attributes do not focus on the similar position. Through the weighted operation, the complementary information of the correlated feature sets is strengthened and the non-correlated of feature sets are suppressed. In conclusion, this data-driven approach allows the data to find the complementary information they need to help with classification on their own.

Loss Functions and the Optimization Strategy

As is shown in Figure1, there are two loss terms for optimization, and respectively. can be directly computed from each branch of FLN. The prediction of each attribute is evaluated by softmax cross entropy loss first. Then, losses from every branch are summed together to form , as is shown in (1).


Here is the total number of the training samples, and is the attributes number. is the binary label, and is the estimated value given by the softmax function for the th sample’s th attribute. Note that has the same form with , but it computes on the refined feature after GAL.

When minimizing the two loss terms, the basic idea is to keep each branch task-specific in GAL, meanwhile make each branch after GAL more expressive by considering the correlation of attributes. The basic idea of the optimization strategy is to separate the gradient flow of CLN and FLN. Therefore, the gradient stream from only applies to the parameters in FLN to keep the expressiveness of each branch, and the gradient stream from is only responsible for the parameters update in CLN to make it explore attribute correlations. The gradient flow is shown in Figure1 with dash arrow of different colors.



We evaluated our method on two challenge face attribute datasets, CelebA [\citeauthoryearLiu et al.2015b] and LFWA [\citeauthoryearWolf, Hassner, and Taigman2011]. CelebFaces Attributes Dataset (CelebA) is a large-scale face attributes dataset with about 100 thousand identities and 200 thousand face images. It splits 80 percent images for training, and 20 percent for validation and test. Each image has attribute annotations, including ”5 o’clock shadow”, ”arched eyebrows”, ”attractive”, ”bags under eyes”, ”bald”, ”bangs”, etc.. It provides In-The-Wild, Align and Cropped sets, and we apply our method on the aligned one. LFWA dataset contains over 13 thousand images of faces collected from web. It is partitioned into about half for training and half for test. Each image is annotated with exactly the same forty attributes used in CelebA dataset. In order to overcome the overfitting problem, we add some distortion [\citeauthoryearBloice, Stocker, and Holzinger2017] to the training set, and expand the training set to over 75 thousand images.

Implementation Details

The network proposed in this paper has no restriction on the structure of backbone. For simplicity, we use Alexnet-cvgj to build our bottom structure to get global semantic features. To prove the effectiveness of the method we proposed, we provide several contrast test results. We build a network only with FLN as our baseline. Results of joint training of FLN and CLN are marked as GAL-j, and GAL-c. GAL-j means we make joint training on the FLN and the CLN with two terms and loss at the same time. The parameters in FLN and CLN is updated by the gradients of and , respectively, which is our proposed optimization scheme. GAL-c refers to that we fix all the parameters in FLN and only train the correlation learning net. In fact, these two cases for training both add constraints to FLN. For our scheme, FLN is required to accurately extract the features of each attribute without deviation, and CLN is required to fully explore the correlation between each attribute pair. If the constraint on FLN is absent, the gradients from the CLN will flow to the FLN to introduce deviation to each independent branch, and then affect the final performance. GAL-j and GAL-c both learn an affinity matrix by data-driven approach. For comparison, we define an affinity matrix artificially, in other words, we build a correlation graph by prior knowledge. This approach is marked as GAL-p. Detailed information of the correlation graph is offered in Table 1. There are 8 groups in total, and each group is decided according to the naturally appearing location of attributes. All the nodes in the same group are adjacent, and the sum of values on the edges is one, while different groups have no linking edge.

To assist training, we use the publically available Alexnet-cvgj pretrain model on ImageNet to initialize the shared layers. For all the training images, we first standardize them to size, and then randomly left-or-right flip the images in an online way, before they are fed into the network. On the CelebA dataset, for all the net, we set the initial learning rate to be 0.005, and it will follow a polynomial decay function with the training process going on. Batch size is 256, the max iteration step is 25600. As for baseline, the weight decay is set to be 0.0005, while for other three methods, it becomes 0.001. When train on the LFWA dataset, we apply the cyclical learning rate [\citeauthoryearSmith2017] to train all the net. The maximum learning rate is 0.005, minmun is 0, stepsize equals 5000, and has no decay. The total training step is 10000 iteration.

Results and Analysis

Based on the methods described above, our results on CelebA dataset is listed in Table 2, and LFWA is listed in Table 3. Comparison with results from other current methods is listed in Table 4 as well.

Group Attributes
Global Attractive, Blurry, Chubby, Heavy Makeup, Male, Oval Face, Pale Skin, Smiling, Young
Hair Bald, Bangs, Black Hair, Blond Hair, Brown Hair, Gray Hair, Receding Hairline, Straight Hair, Wavy Hair, Wearing Hat
Eye Arched Eyebrows, Bags Under Eyes, Bushy Eyebrows, Eyeglasses, Narrow Eyes
Nose Big Nose, Pointy Nose
CheekEar High Cheekbones, Rosy Cheeks, Sideburns, Wearing Earrings
Mouse 5 o’Clock Shadow, Big Lips, Mouth Slightly Open, Mustache, Wearing Lipstick
Chin Double Chin, Goatee, No Beard
Neck Wearing Necklace, Wearing Necktie
Table 1: Custom Pre-classification of Nodes
Approach Baseline GAL-c GAL-j GAL-p
Accuracy 90.73 90.89 91.43 90.13
Table 2: Results on CelebA Dataset
Approach LNets-ANet MCNN-AUX Baseline GAL-j
Accuracy 84 86.3 85.19 85.25
Table 3: Comparison on LFWA Dataset





5 Shadow 94.51 95.00 94.21 94.80
Arched Eyebrows 83.42 86.00 82.12 84.16
Attractive 83.06 85.00 82.83 82.89
Bags Un Eyes 84.92 85.00 83.75 85.29
Bald 98.90 99.00 99.06 98.90
Bangs 96.05 99.00 96.05 96.13
Big Lips 71.47 96.00 70.88 71.81
Big Nose 84.53 85.00 83.82 84.35
Black Hair 89.78 91.00 90.32 90.34
Blonde Hair 96.01 96.00 96.07 95.98
Blurry 96.17 96.00 95.50 96.22
Brown Hair 89.15 88.00 89.16 89.15
Bushy Eyebrows 92.84 92.00 92.41 92.88
Chubby 95.67 96.00 94.98 95.75
Double Chin 96.32 97.00 96.18 96.40
Eyeglasses 99.63 99.00 99.61 99.55
Goatee 97.24 99.00 97.31 97.40
Gray Hair 98.20 98.00 98.28 98.34
Heavy Makeup 91.55 92.00 91.10 91.80
H. Cheekbones 87.58 88.00 86.88 87.98
Male 98.17 98.00 98.26 98.17
Mouth S. O. 93.74 94.00 92.60 93.92
Mustache 96.88 97.00 96.89 96.83
Narrow Eyes 87.23 90.00 87.23 87.57
No Beard 96.05 97.00 95.99 96.21
Oval Face 75.84 78.00 75.79 75.78
Pale Skin 97.05 97.00 97.04 97.22
Pointy Nose 77.47 78.00 74.83 77.61
Reced. Hairline 93.81 94.00 93.29 93.76
Rosy Cheeks 95.16 96.00 94.45 95.22
Sideburns 97.85 98.00 97.83 97.93
Smiliing 92.73 94.00 91.77 92.98
Straight Hair 83.58 85.00 84.10 83.67
Wavy Hair 83.91 87.00 85.65 84.32
Wear. Earrings 90.43 91.00 90.20 90.34
Wear. Hat 99.05 99.00 99.02 99.05
Wear. Lipstick 94.11 93.00 91.69 94.18
Wear. Necklace 86.63 89.00 87.85 86.96
Wear. Necktie 96.51 97.00 96.90 96.62
Young 88.48 90.00 88.66 88.57
Average 91.29 92.60 91.01 91.43
Table 4: Comparison on CelebA Dataset

Analysis on GAL

As is listed in Table 2, the mean accuracy of Baseline, GAL-c, GAL-j, and GAL-p are 90.73, 90.89, 91.43 and 90.13 respecitively. Obviously, adding CLN do help increase the performance., and the proposed optimization strategy plays an important role in facial attributes recognition task as well. GAL-c train the FLN and the CLN separately. Parameters in the FLN are loaded from the Baseline model, which is regarded as the best attribute feature extractor. During the training process, parameters in the FLN are fixed, only the parameters in the correlation learning net can be updated. GAL-j train the FLN and the CLN with two separate gradient streams from and at the same time. Actually, GAL-c and GAL-j both constrain the attribute feature independency and the attribute correlation, intending to fully extract independent feature and fully exploit correlation among attributes. But the mean accuracy of GAL-j is 0.54 higher than that of GAL-c. The difference between the two methods lies in whether the FLN and the CLN share the training process. We believe that with the guidance from FLN and our proposed optimization strategy, CLN will not easily fall into local minima and can jump out of local minima as training goes on. Moreover, with a fixed FLN , as is shown by the GAL-c training results, the final model cannot reach the optimum. The training method of GAL-p is exactly the same with GAL-j, but GAL-p’s affinity matrix is defined artificially, GAL-j learns the affinity matrix on its own. It is quite reasonable that the mean accuracy of GAL-p is even lower than baseline as it is very hard to determine the adjacency of attributes and the degree of correlation. For the dataset of LFWA, the mean accuracy of GAL-j is 0.06 higher than the Baseline, as is listed in Table 3, indicates that GAL is effective, but the effect is not obvious. The main reason is LFWA dataset is too small, so it is difficult to train our model just using LFWA.

Comparison with Other Approaches

As is shown in Table 4 and Table 3, performance of our approach is better than AFFACT [\citeauthoryearGünther, Rozsa, and Boult2016] and LNets-ANet [\citeauthoryearLiu et al.2015a]. DMTL [\citeauthoryearHan et al.2017] method has the best performance at present. DMTL also explores the correlation of attributes. It learns the common shared features first, and uses several branches to learn different groups of features. Finally, individual attribute classification is made based on its group features. The classification of the group is determined by the prior knowledge. The training of DMTL takes at least 100,000 iterations, and need to pretrain its model on CASIA dataset. While our net is trained in end-to-end method, we can get results in 25600 iterations and have no need to pretrain on a much bigger dataset. MCNN-AUX [\citeauthoryearHand and Chellappa2017] is an end-to-end network, it uses fully connected layers to explore the correlation of attributes. As it is a quite shallow network, on LFWA it suffers less overfitting than our approach. But our approach has better performance on the large dataset CelebA.


Figure 5: Heatmap of the affinity matrix in GAL-j. We have all the attributes on the x-axis and the y-axis. The deeper the color is, the stronger relationship exists in the two attributes. Best viewed in color.

Figure 5 shows the heatmap of affinity matrix learned in GAL-j. We can clearly see that every attribute has the strongest correlation with itself. Besides, there are some intuitive relationships in the heatmap, such as ”smile” and ”narrow eyes”. With the data-driven approach, we can also find some interesting point like ”young” and ”wearing earings”, ”lipstick”, ”hat” and ”necktie” are correlated strongly, which indicating the subjective concept of young. We noticed that ”high cheekbones” and ”smile” is closely related. Actually, when someone smile, he is more likely to be seen as having high cheekbones, may be noise label is introduced on this reason. In addition, there are some attribute almost have nothing to do with each other. For instance, ”bush eyebrows” has no relation with ”bags under eyes” and ”mustache”, and ”mustache” is absolutely not related to heavy makeup.


In order to fully explore the relations among attributes and synthesize the information of related attribute features, we propose to add GAL to the FLN to study the independent features and correlation at the same time in an end-to-end way. After the FLN, we use the convolution layer to reduct dimension and reshape the independent feature, making it a node. The nodes are fed into the GAL layer and the relations among them is mapped with a data-driven approach. From the visualization of the affinity matrix, we can see that the intuitively related attributes are still correlated in the graph, but the correlation degree, that is, the correlation weight is determined. Meanwhile, the affinity matrix learned from the data also gives us some unexpected attribute relations. In the process of training, we found that training with two independent loss simultaneously can guide the network to find the optimal, prevent it from falling into the local minimum point. Experiments show this method can obtain a good classification result. All in all, the approach we propose is effective and meets our expectations. You can find our codes on https://github.com/crazydemo/facial-attribute-classification-with-graph


  1. Baxter, J. 1997. A bayesian/information theoretic model of learning to learn via multiple task sampling. Machine learning 28(1):7–39.
  2. Bloice, M. D.; Stocker, C.; and Holzinger, A. 2017. Augmentor: An image augmentation library for machine learning. CoRR abs/1708.04680.
  3. Caruna, R. 1993. Multitask learning: A knowledge-based source of inductive bias. In Machine Learning: Proceedings of the Tenth International Conference, 41–48.
  4. Cowie, R.; Douglas-Cowie, E.; Tsapatsoulis, N.; Votsis, G.; Kollias, S.; Fellenz, W.; and Taylor, J. G. 2001. Emotion recognition in human-computer interaction. IEEE Signal processing magazine 18(1):32–80.
  5. Gori, M.; Monfardini, G.; and Scarselli, F. 2005. A new model for learning in graph domains. In Neural Networks, 2005. IJCNN’05. Proceedings. 2005 IEEE International Joint Conference on, volume 2, 729–734. IEEE.
  6. Günther, M.; Rozsa, A.; and Boult, T. E. 2016. AFFACT - alignment free facial attribute classification technique. CoRR abs/1611.06158.
  7. Han, H.; Jain, A. K.; Shan, S.; and Chen, X. 2017. Heterogeneous face attribute estimation: A deep multi-task learning approach. IEEE transactions on pattern analysis and machine intelligence.
  8. Hand, E. M., and Chellappa, R. 2017. Attributes for improved attributes: A multi-task network utilizing implicit and explicit relationships for facial attribute classification. In AAAI, 4068–4074.
  9. Kipf, T. N., and Welling, M. 2016. Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.
  10. Levi, G., and Hassner, T. 2015. Age and gender classification using convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 34–42.
  11. Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015a. Deep learning face attributes in the wild. In Proceedings of the IEEE International Conference on Computer Vision, 3730–3738.
  12. Liu, Z.; Luo, P.; Wang, X.; and Tang, X. 2015b. Deep learning face attributes in the wild. In Proceedings of International Conference on Computer Vision (ICCV).
  13. Long, M., and Wang, J. 2015. Learning multiple tasks with deep relationship networks. CoRR, abs/1506.02117 3.
  14. Lu, Y.; Kumar, A.; Zhai, S.; Cheng, Y.; Javidi, T.; and Feris, R. S. 2017. Fully-adaptive feature sharing in multi-task networks with applications in person attribute classification. In CVPR, volume 1,  6.
  15. Meyerson, E., and Miikkulainen, R. 2018. Pseudo-task augmentation: From deep multitask learning to intratask sharing—and back. arXiv preprint arXiv:1803.04062.
  16. Misra, I.; Shrivastava, A.; Gupta, A.; and Hebert, M. 2016. Cross-stitch networks for multi-task learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3994–4003.
  17. Parikh, D., and Grauman, K. 2011. Relative attributes. In Computer Vision (ICCV), 2011 IEEE International Conference on, 503–510. IEEE.
  18. Ranjan, R.; Patel, V. M.; and Chellappa, R. 2017. Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  19. Robinson, J. P.; Shao, M.; Wu, Y.; and Fu, Y. 2016. Families in the wild (fiw): Large-scale kinship image database and benchmarks. In Proceedings of the 2016 ACM on Multimedia Conference, 242–246. ACM.
  20. Rudd, E. M.; Günther, M.; and Boult, T. E. 2016. Moon: A mixed objective optimization network for the recognition of facial attributes. In European Conference on Computer Vision, 19–35. Springer.
  21. Ruder12, S.; Bingel, J.; Augenstein, I.; and Søgaard, A. 2017. Sluice networks: Learning what to share between loosely related tasks. stat 1050:23.
  22. Ruder, S. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
  23. Scarselli, F.; Gori, M.; Tsoi, A. C.; Hagenbuchner, M.; and Monfardini, G. 2009. The graph neural network model. IEEE Transactions on Neural Networks 20(1):61–80.
  24. Simon, M.; Rodner, E.; and Denzler, J. 2016. Imagenet pre-trained models with batch normalization.
  25. Smith, L. N. 2017. Cyclical learning rates for training neural networks. In 2017 IEEE Winter Conference on Applications of Computer Vision (WACV), 464–472.
  26. Vaquero, D. A.; Feris, R. S.; Tran, D.; Brown, L.; Hampapur, A.; and Turk, M. 2009. Attribute-based people search in surveillance environments. In Applications of Computer Vision (WACV), 2009 Workshop on, 1–8. IEEE.
  27. Velickovic, P.; Cucurull, G.; Casanova, A.; Romero, A.; Lio, P.; and Bengio, Y. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.
  28. Wolf, L.; Hassner, T.; and Taigman, Y. 2011. Effective unconstrained face recognition by combining multiple descriptors and learned background statistics. IEEE Trans Pattern Anal Mach Intell 33(10):1978–1990.
  29. Yang, Y., and Hospedales, T. M. 2016. Trace norm regularised deep multi-task learning. arXiv preprint arXiv:1606.04038.
  30. Yosinski, J.; Clune, J.; Bengio, Y.; and Lipson, H. 2014. How transferable are features in deep neural networks? In Advances in neural information processing systems, 3320–3328.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description