Deep Learning Face Attributes in the WildThis work has been accepted to appear in ICCV 2015. This is the pre-printed version. Content may slightly change prior to the final publication.
Predicting face attributes in the wild is challenging due to complex face variations. We propose a novel deep learning framework for attribute prediction in the wild. It cascades two CNNs, LNet and ANet, which are fine-tuned jointly with attribute tags, but pre-trained differently. LNet is pre-trained by massive general object categories for face localization, while ANet is pre-trained by massive face identities for attribute prediction. This framework not only outperforms the state-of-the-art with a large margin, but also reveals valuable facts on learning face representation.
(1) It shows how the performances of face localization (LNet) and attribute prediction (ANet) can be improved by different pre-training strategies. (2) It reveals that although the filters of LNet are fine-tuned only with image-level attribute tags, their response maps over entire images have strong indication of face locations. This fact enables training LNet for face localization with only image-level annotations, but without face bounding boxes or landmarks, which are required by all attribute recognition works. (3) It also demonstrates that the high-level hidden neurons of ANet automatically discover semantic concepts after pre-training with massive face identities, and such concepts are significantly enriched after fine-tuning with attribute tags. Each attribute can be well explained with a sparse linear combination of these concepts.
Face attributes are beneficial for multiple applications such as face verification , identification , and retrieval. Predicting face attributes from images in the wild is challenging, because of complex face variations such as poses, lightings, and occlusions as shown in Fig. ?.
Attribute recognition methods are generally categorized into two groups: global and local methods. Global methods extract features from the entire object, where accurate locations of object parts or landmarks are not required. They are not robust to deformations of objects . Recent local models  first detect object parts and extract features from each part. These local features are concatenated to train classifiers. For example, Kumar  predicted face attributes by extracting hand-crafted features from ten face parts. Zhang  recognized human attributes by employing hundreds of poselets  to align human body parts. These local methods may fail when unconstrained face images with complex variations are present, which makes face localization and alignment difficult. As shown in Fig. ? (a), HOG+SVM fails because the faces or landmarks are wrongly localized or misaligned. Thus the features are extracted at wrong positions . Recent research shows that face localization and alignment are still not well solved problems, especially in the wild condition, although much progress has been achieved in the past decade. It is also proved by our experimental result.
This work revisits global methods by proposing a novel deep learning framework, which integrates two CNNs, LNet and ANet, where LNet locates the entire face region and ANet extracts high-level face representation from the located region. The novelties are in three aspects. Firstly, LNet is trained in a weakly supervised manner, only image-level attribute tags of training images are provided, making data preparation much easier. This is different from training face and landmark detectors, where face bounding boxes and landmark positions are required. LNet is pre-trained by classifying massive general object categories, such that its pre-trained features have good generalization capability on handling large background clutters. LNet is then fine-tuned by attributes tags. We demonstrate that features learned in this way are effective for face localization and also can distinguish subtle differences between human faces and analogous patterns, such as a cat face.
Secondly, ANet extracts discriminative face representation, making attribute recognition from the entire face region possible. ANet is pre-trained by classifying massive face identities and is fine-tuned by attributes. We show that the pre-training step enables ANet to account for complex variations in the unconstrained face images.
Thirdly, within the rough locations of face regions provided by LNet, averaging the predictions of multiple patches can improve the performance. A simple way is to evaluate the feed-forward pass for each single patch. However, it is slow and has a lot of redundant computation. A novel fast feed-forward scheme is proposed to replace patch-by-patch evaluation. It evaluates images with arbitrary sizes with only one-pass feed-forward operation. It becomes non-trivial if the filters are locally shared, while studies  showed that locally shared filters perform better in face related tasks. This is solved by proposing an interweaved operation.
Besides proposing new methods, our framework also reveals valuable facts on learning face representation. They not only motivate this work but also benefit future research on face and deep learning. (1) It shows how pre-training with massive object categories and massive identities can improve feature learning for face localization and attribute recognition, respectively. (2) It demonstrates that although filters of LNet are fine-tuned by attribute tags, their response maps over the entire image have strong indication of face location. Good features for face localization should be able to capture rich face variations, and more supervised information on these variations improves the learning process. The examples in Fig. ? (a) show that as the number of attributes decreases, the localization capability of learned neurons gets reduced dramatically. (3) ANet is pre-trained with massive face identities. It discloses that the pre-trained high-level hidden neurons of ANet implicitly learn and discover sematic concepts that are related to identity, such as race, gender, and age. It indicates that when a deep model is pre-trained for face recognition, it implicitly learns attributes. The performance of attribute prediction drops without this pre-training stage.
The main contributions are summarized as follows. (1) We propose a novel deep learning framework, which combines massive objects and massive identities to pre-train two CNNs for face localization and attribute prediction, respectively. It achieves state-of-the-art attribute classification results on both the challenging CelebFaces  and LFW  datasets, improving existing methods by and percent, respectively. (2) A novel fast feed-forward algorithm for CNN with locally shared filters is devised. (3) Our study reveals multiple valuable facts on leaning face representation by deep models. (4) We also contribute a large facial attribute database with more than eight million attribute labels and it is times larger than the largest publicly available dataset.
Extracting hand-crafted features at pre-defined landmarks has become a standard step in attribute recognition . Kumar  extracted HOG-like features on various face regions to tackle attribute classification and face verification. To improve the discriminativeness of hand-crafted features given a specific task, Bourdev  built a three-level SVM system to extract higher-level information. Deep learning  recently achieved great success in attribute prediction, due to their ability to learn compact and discriminative features. Razavian  and Donahue  demonstrated that off-the-shelf features learned by CNN of ImageNet  can be effectively adapted to attribute classification. Zhang  showed that better performance can be achieved by ensembling learned features of multiple pose-normalized CNNs. The main drawback of these methods is that they rely on accurate landmark detection and pose estimation in both training and testing steps. Even though a recent work  can perform automatic part localization during test, it still requires landmark annotations of the training data.
Fig. ? illustrates our pipeline where LNet locates the entire face region in a coarse-to-fine manner as shown in (a) and (b), while ANet extracts features for attribute recognition as shown in (c).
Different from existing works that rely on accurate face and landmark annotations, LNet is trained in a weakly supervised manner with only image-level annotations. Specifically, it is pre-trained with one thousand object categories of ImageNet  and fine-tuned by image-level attribute tags. The former step accounts for background clutters, while the latter step learns features robust to complex face variations. Learning LNet in this way not only significantly reduces data labeling, but also improves the accuracy of face localization. Both LNet and LNet have network structures similar to AlexNet , whose hyper parameters are specified in Fig. ? (a) and (b) respectively. The fifth convolutional layer (C5) of LNet indicates head-shoulders while C5 of LNet indicates faces, with their highly responsed regions in their averaged response maps. Moreover, the input of LNet is a image, while the input of LNet is the head-shoulder region, which is localized by LNet and resized to .
As illustrated in Fig. ? (c), ANet is learned to predict attributes by providing the input face region , which is detected by LNet and properly resized. Specifically, multi-view versions  of are utilized to train ANet. Furthermore, ANet contains four convolutional layers, where the filters of C1 and C2 are globally shared and the filters of C3 and C4 are locally shared. The effectiveness of local filters have been demonstrated in many face related tasks . To handle complex face variations, ANet is pre-trained by distinguishing massive face identities, which facilitates the learning of discriminative features.
Fig. ? (d) outlines the procedure of attribute recognition. ANet extracts a set of feature vectors (FCs) by cropping overlapping patches on . An efficient feed-forward algorithm is developed to reduce redundant computation in the feature extraction stage. SVMs  are trained to predict attribute values given each FC. The final prediction is obtained by averaging all these values, to cope with small misalignment of face localization.
The cascade of LNet and LNet accurately localizes face regions by being trained on image-level attribute tags.
Both LNet and LNet are pre-trained with general object categories from the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 , containing million training images and thousands validation images. All the data is employed for pre-training except one third of the validation data for choosing hyper-parameters . We augment data by cropping ten patches from each image, including one patch at the center and four at the corners, and their horizontal flips. We adopt softmax for object classification, which is optimized by stochastic gradient descent (SGD) with back-propagation (BP) . As shown in Fig. ? (a.2), the averaged response map in C5 of LNet already indicates locations of objects including human faces after pre-training.
Both LNet and LNet are fine-tuned with attribute tags. Additional output layers are added to the LNets individually for fine-tuning and then removed for evaluation. LNet adopts the full image as input while LNet uses the highly responsed region in the averaged response map in C5 of LNet as input, which roughly respond to head-shoulders. The cross-entropy loss is used for attribute classification, , where is the probability of the -th attribute given image . As shown in Fig. ? (a.3), the response maps after fine-tuning become much more clean and smooth, indicating that the filters learned by attribute tags can detect face patterns with complex variations. To appreciate the effectiveness of pre-training, we also include the averaged response map in C5 of being directly trained from scratch with attribute tags but without pre-training in Fig. ? (a.4). It cannot separate face regions from background and other body parts well.
Thresholding and Proposing Windows
We show that the responses of C5 in LNet are discriminative enough to separate faces and background by simply searching a threshold, such that a window with response larger than this threshold corresponding to face and otherwise is background. To determine the threshold, we select images, each of which contains a single face, and background images from SUN dataset . For each image, EdgeBox  is adopted to propose candidate windows, each of which is measured by a score that sums over its response values normalized by its window size. A larger score indicates the localized pattern is more likely to be a face. Each image is then represented by the maximum score over all its windows. In Fig. ? (b), the histogram of the maximum scores shows that these scores clearly separate face images from background images. The threshold is chosen as the decision boundary as shown in Fig. ? (b). More results are given in Fig. ? (a), showing that the above strategy can precisely localize face within a single test image. Since each training image only contains one single face, we localize a face region using the window with the largest score during training.
Pruning Multiple Faces within a single Window
. For some challenging cases in the testing stage, it encounters difficulty when multiple faces are presented within a single window, such that there may be multiple regions with high responses. We predict attributes based one face region which generates the largest response
To understand why rich attribute information enables accurate face localization, one could consider the examples in Fig. ?. If only a single detector  is used to classify all the positive and negative samples in Fig. ? (a), it is difficult to handle complex face variations. Therefore, multi-view face detectors  were developed in Fig. ? (b), face images in different views are handled by different detectors. View labels were used in training detectors and the whole training set is divided into subsets according to views. If views are treated as one type of face attributes, learning face representation by predicting attributes with deep models actually extends this idea to extreme. As shown in Fig. ? (c), a filter (or a group of filters) functions as a detector of an attribute. When a subset of neurons are activated, they indicate the existence of face images with a particular attribute configuration. The neurons at different layers can form many activation patterns, implying that the whole set of face images can be divided into many subsets based on attribute configurations, and each activation pattern corresponds to one subset (‘pointy nose’, ‘rosy cheek’, and ‘smiling’). Therefore, it is not surprising that filters learned by attributes lead to effective representations for face localization.
As shown in Fig. ? (c) and (d), ANet is learned to extract features and SVM classifiers are used to predict attributes. Specifically, in the pre-training stage, ANet is trained by classifying massive face identities. In the fine-tuning stage, we first extend the localized face region, which is properly resized, with a small factor to incorporate more context information. Then, multiple patches are cropped from the enlarged face region and utilized as inputs of ANet. ANet is fine-tuned by attributes to learn the high-level feature . Furthermore, as shown in Fig. ? (d), each feature vector is adopted to train SVM classifier for attribute prediction. The above strategy is similar to the multi-view data augmentation , increasing the robustness of attribute recognition. In the testing stage, attributes are predicted by averaging the SVM scores over all the patches.
Pre-training of ANet
We introduce how to learn discriminative features by pre-training ANet with a large number of identities. We select eight thousand face identities from the CelebFaces  dataset, where each identity has around twenty images. There are over thousand training images in total. A simple way to train ANet is to classify eight thousand categories with the softmax loss. However, it is challenging because the number of samples of each identity is limited to maintain the intra-class invariance. To improve intra-class invariance, we employ the similarity loss similar to . It decreases the distances between samples of the same identity. We have , where and denote the feature vectors of the -th and -th face images respectively, and indicates the identities of these samples are the same. In summary, ANet is pre-trained by combining the softmax loss and the similarity loss.
Efficient Feature Extractions
In test, ANet is evaluated on multiple patches of the face region as shown in Fig. ? (d), leading to redundant convolutional computations because of the large overlaps in these patches. When all the filters are globally-shared, the computational cost can be reduced by applying , which convolves the filters in the input image and then obtains a feature vector for each patch by pooling over the last convolutional layer. Given a simple example with one convolutional layer as shown in Figure 1 (a), the feature vector for each patch (rectangle in red) can be extracted by pooling in the corresponding region of the response map , without evaluating convolutions in the input image patch-by-patch. Therefore, it shares the convolutions for every patch.
However, this scheme is not applicable when we have more than two convolutional layers whose filters are locally-shared. An example is illustrated in Figure 1 (b), where each patch is equally divided into cells and we learn different filters for different cells. To reduce computations in the first convolutional layer, each local filter can be applied on the entire image, resulting in the response map with nine channels, and . The final response map is obtained by cropping and padding the regions (rectangles in black) in these 9 channels. As a result, each feature vector can be pooled from , without convolving the input image patch-by-patch. Nevertheless, since is corresponded to a patch of the input image, the succeeding local convolutions have to be handled patch-by-patch, leading to redundant computations.
To this end, we propose an interweaved operation, which is a fast feed-forward method for CNN with locally-shared filters. Suppose we have four local filters in the next locally convolutional layer and each filter is applied on cells of as shown in (b). These cells are the receptive fields of the filters, including , , , and . Instead of directly applying the local filters on , the interweaved operation generates an interweaved map for each filter, where . Each local filter is then apply on its corresponding interweaved map. Since the interweaved map capturing the entire image, each local filter is turned into a global filter such that its computation can be shared across different patches.
Specifically, each interweaved map, , is achieved by padding the cells of the corresponding channels in an interweaved manner, , as shown in Figure 1 (d). All of the interweaved maps are illustrated in Figure 1 (c). After that, each of the four local filters is applied on its corresponding interweaved map, leading to four response maps , where . As a result, the feature vector is pooled and concatenated from the receptive fields of the filters, which are the rectangles in black as shown in (c).
Intuitively, instead of padding cells according to the receptive fields of all the local filters ( in (b)), which has to be performed in a patch-by-patch way, the interweaved operation pads the cells with respect to the receptive field of each local filter over the entire image. It enables extracting multiple feature vectors with only one-pass of feed-forward evaluation. This operation can be repeated when more locally convolutional layers are added. The proposed feature extraction scheme has achieved speedup empirically when compared with patch-by-patch scanning. It is applicable to CNNs with local filters and compatible to all existing CNN operations.
Large-scale Data Collection
We construct two face attribute datasets, namely CelebA and LFWA, by labeling images selected from two challenging face datasets, CelebFaces  and LFW . CelebA contains ten thousand identities, each of which has twenty images. There are two hundred thousand images in total. LFWA has images of identities. Each image in CelebA and LFWA is annotated with forty face attributes and five key points by a professional labeling company. CelebA and LFWA have over eight million and five hundred thousand attribute labels, respectively.
CelebA is partitioned into three parts. Images of the first eight thousand identities (with thousand images) are used to pre-train and fine-tune ANet and LNet, and the images of another one thousand identities (with twenty thousand images) are employed to train SVM. The images of the remaining one thousand identities (with twenty thousand images) are used for testing. LFWA is partitioned into half for training and half for testing. Specifically, images are adopted to train SVM and the remaining images for test. When being evaluated on LFWA, LNet and ANet are trained on CelebA.
Methods for Comparisons
The proposed method is compared with three competitive approaches, i.e. FaceTracer , PANDA-w , and PANDA-l . FaceTracer extracts HOG and color histograms in several important functional face regions and then trains SVM for attribute classification. We extract these functional regions referring to the ground truth landmark points. PANDA-w and PANDA-l are based on PANDA , which was proposed recently for human attribute recognition by ensembling multiple CNNs, each of which extracts features from a well-aligned human part. These features are concatenated to train SVM for attribute recognition. It is straightforward to adapt this method to face attributes, since face parts can be well-aligned by landmark points. Here, we consider two settings. PANDA-w obtains the face parts by applying the state-of-the-art face detection  and alignment  on wild images, while PANDA-l attains the face parts by using ground truth landmark points. For fair comparison, all the above methods are trained with the same data as ours.
3.1Effectiveness of the Framework
This section demonstrates the effectiveness of the framework. All experiments in this section are done on CelebA.
We compare LNet with four state-of-the-art face detectors, including DPM , ACF Multi-view , SURF Cascade , and Face++ . We evaluate them by using ROC curves when
LNet significantly outperforms LNet (without pre-training) by 74 percent when the overlap ratio equals to , which validates the effectiveness of pre-training, as shown in Fig. ?(c). We then explore the influence of the number of attributes on localization. Fig. ?(d) illustrates rich attribute information facilitates face localization. To examine the generalization ability of LNet, we collect another face images for testing, namely MobileFaces, which comes from a different source
Attribute-specific Regions Discovery
Different attribute captures information from different region of face. We show that LNet automatically learns to discover these regions. Given an attribute, by converting fully connected layers of LNet into fully convolutional layers following , we can locate important region of this attribute. Figure 2 shows some examples. The important regions of some attributes are locally distributed, such as ‘Bags Under Eyes’, ‘Straight Hair’ and ‘Wearing Necklace’, but some are globally distributed, such as ‘Young’, ‘Male’ and ‘Attractive’.
Pre-training Discovers Semantic Concepts
We show that pre-training of ANet can implicity discover semantic concepts related to face identity. Given a hidden neuron at the FC layer of ANet as shown in Fig. ?(c), we partition the face images into three groups, including the face images with high, medium, and low responses at this neuron. The face images of each group are then averaged to obtain the mean face. We visualize these mean faces for several neurons in Fig. ?(a). Interestingly, these mean face changes smoothly from high response to low response, following a high-level concept. Human can easily assign each neuron with a semantic concept it measures (the text in yellow). For example, the neurons in (a.1) and (a.4) correspond to ‘gender’ and ‘race’, respectively. It reveals that the high-level hidden neurons of ANet can implicitly learn to discover semantic concepts, even though they are only optimized for face recognition using identity information and attribute labels are not used in pre-training. We also observe that most of these concepts are intrinsic to face identity, such as the shape of facial components, gender, and race.
To better explain this phenomena, we compare the accuracy of attribute prediction using features at different layers of ANet right after pre-training. They are FC, C4, and C3. The forty attributes are roughly separated into two groups, which are identity-related attributes, such as gender and race, and identity-non-related attributes, e.g. attributes of expressions, wearing hat and sunglasses. We select some representative attributes for each group and plot the results in Fig. ?(a), which shows that the performance of FC outperforms C4 and C3 in the group of identity-related attributes, but they are relatively weaker when dealing with identity-non-related attributes. This is because the top layer FC learns identity features, which are insensitive to intra-personal face variations.
Fine-tuning Expands Semantic Concepts
Fig. ? shows that after fine-tuning, ANet can expand these concepts to more attribute types. Fig. ?(b) visualizes the neurons in the FC layer, which are ranked by their responses in descending order with respect to several test images. Human can assign semantic meaning to each of these neurons. We found that a large number of new concepts can be observed. Remarkably, these neurons express diverse high-level meanings and cooperate to explain the test images. The activations of all the neurons are visualized in Fig. ?(b), and they are sparse. In some sense, attributes presented in each test image are explained by a sparse linear combination of these concepts. For instance, the first image is described as “a lady with bangs, brown hair, pale skin, narrow eyes and high cheekbones”, which well matches human perception.
To validate this, we explore how the number of neurons influences attribute prediction accuracies. Best performing neurons for each attribute are identified by sorting corresponding SVM weights. Fig. ?(b) illusatrates that only of ANet best performing neurons are needed to achieve of the original performance of a particular attribute
Automatic Attributes Grouping
Here we show that the weight matrix at the FC layer of ANet can implicitly capture relations between attributes. Each column vector of the weight matrix can be viewed as a decision hyperplane to partition the negatives and positive samples of an attribute. By simply applying k-means to these vectors, the clusters show clear grouping patterns, which can be interpreted semantically. As shown in Figure 3, Group #1, Group #2 and Group #4 demonstrate co-occurrence relationship between attributes, ‘Attractive’ and ‘Heavy Makeup’ have high correlation. Attributes in Group #3 share similar color descriptors, while attributes in Group #6 correspond to certain texture and appearance traits.
The attribute prediction performance is reported in Table. ?. On CelebA, the prediction accuracies of FaceTracer , PANDA-w , PANDA-l , and our LNets+ANet are , , , and percent respectively, while the corresponding accuracies on LFWA are , , , and percent. Our method outperforms PANDA-w by nearly percent. Remarkably, even when PANDA-l is equipped with groundtruth bounding boxes and landmark positions, our method still achieves percent gain. The strength of our method is illustrated not only on global attributes, “Chubby” and “Young”, but also on fine-grained facial traits, “Mastache” and “Pointy Nose”. We also report performance on extended attributes and compare our result with  and . The evaluation protocol is the same as . In Table ?, LNets+ANet outperforms them by and percent respectively.
When compared with +ANet, LNets accounts for nearly percentage improvement over using an off-the-shelf face detector . We also experiment with the case of providing ANet with localized face region by LNets, but without pre-training, denoted as LNets+ANet(w/o). The average accuracies have dropped and percent on CelebA and LFWA, which indicate pre-training with massive facial identities helps discover semantic concepts.
Performance on LFWA+
To further examine whether the proposed approach can be generalized to unseen attributes, we manually label more attributes for the testing images on LFWA and denote this extended dataset as LFWA+. To test on these attributes, we directly transfer weights learned by deep models to extract features, and only re-train SVMs using one third of the images. LNets+ANet leads to , , and percent average gains over the other three approaches (FaceTracer, PANDA-w, and PANDA-l). It demonstrates that our method learns discriminative face representations and has good generalization ability.
Size of Training Dataset
We compare the attribute prediction accuracy of the proposed method with the accuracy of PANDA-l, regarding different sizes of training datasets. Only the training data of ANet is changed in our method for fair comparison. Figure 5 demonstrates that LNets+ANet performs well when dataset size is small, but the performance of PANDA-l drops significantly.
For a image, LNets takes ms to localize face region while ANet takes ms to extract features on GPU. In contrast, a naïve patch-by-patch scanning needs nearly ms to extract features. Our framework has large potential in real-world applications.
This paper has proposed a novel deep learning framework for face attribute prediction in the wild. With carefully designed pre-training strategies, our method is robust to background clutters and face variations. We devise a new fast feed-forward algorithm for locally shared filters to save redundant computation, which enables evaluating image with arbitrary size in realtime. It allows taking images of arbitrary sizes as input without normalization. We have also revealed multiple important facts about learning face representation, which shed a light on new directions of face localization and representation learning.
- In CelebFaces and LFW, it is assumed that each image has a “dominant” face, based on which the attribute tags were labeled by users.
- IoU indicates Intersection over Union.
- MobileFaces was collected by normal users with mobile phones, while CelebA and LFWA collected face images of celebrities taken by professional photographers.
- Best performing neurons are different for different attributes.
- Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation.
T. Berg and P. N. Belhumeur. In CVPR, pages 955–962, 2013.
- Self-taught object localization with deep networks.
A. Bergamo, L. Bazzani, D. Anguelov, and L. Torresani. arXiv preprint arXiv:1409.3964, 2014.
- Describing people: A poselet-based approach to attribute classification.
L. Bourdev, S. Maji, and J. Malik. In ICCV, pages 1543–1550, 2011.
- Deep attribute networks.
J. Chung, D. Lee, Y. Seo, and C. D. Yoo. In NIPS Workshop on Deep Learning and Unsupervised Feature Learning, volume 3, 2012.
- Imagenet: A large-scale hierarchical image database.
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. In CVPR, pages 248–255, 2009.
- Decaf: A deep convolutional activation feature for generic visual recognition.
J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. arXiv preprint arXiv:1310.1531, 2013.
- Liblinear: A library for large linear classification.
R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. JMLR, 9:1871–1874, 2008.
- Describing objects by their attributes.
A. Farhadi, I. Endres, D. Hoiem, and D. Forsyth. In CVPR, pages 1778–1785, 2009.
- Dimensionality reduction by learning an invariant mapping.
R. Hadsell, S. Chopra, and Y. LeCun. In CVPR, volume 2, pages 1735–1742, 2006.
- Spatial pyramid pooling in deep convolutional networks for visual recognition.
K. He, X. Zhang, S. Ren, and J. Sun. In ECCV, pages 346–361. 2014.
- Labeled faces in the wild: A database for studying face recognition in unconstrained environments.
G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.
- Imagenet classification with deep convolutional neural networks.
A. Krizhevsky, I. Sutskever, and G. E. Hinton. In NIPS, pages 1097–1105, 2012.
- Facetracer: A search engine for large collections of images with faces.
N. Kumar, P. Belhumeur, and S. Nayar. In ECCV, pages 340–353. 2008.
- Attribute and simile classifiers for face verification.
N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. In ICCV, pages 365–372, 2009.
- Handwritten digit recognition with a back-propagation network.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel. In NIPS, 1990.
- Learning surf cascade for fast and accurate object detection.
J. Li and Y. Zhang. In CVPR, pages 3468–3475, 2013.
- Fully convolutional networks for semantic segmentation.
J. Long, E. Shelhamer, and T. Darrell. In CVPR, 2015.
- A deep sum-product architecture for robust facial attributes analysis.
P. Luo, X. Wang, and X. Tang. In ICCV, pages 2864–2871, 2013.
- Two faces are better than one: Face recognition in group photographs.
O. K. Manyam, N. Kumar, P. Belhumeur, and D. Kriegman. In IJCB, pages 1–8, 2011.
- Face detection without bells and whistles.
M. Mathias, R. Benenson, M. Pedersoli, and L. Van Gool. In ECCV, pages 720–735. 2014.
- Is object localization for free?–weakly-supervised learning with convolutional neural networks.
M. Oquab, L. Bottou, I. Laptev, and J. Sivic. In CVPR, pages 685–694, 2015.
- Cnn features off-the-shelf: an astounding baseline for recognition.
A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson. arXiv preprint arXiv:1403.6382, 2014.
- Clustering by fast search and find of density peaks.
A. Rodriguez and A. Laio. Science, 344(6191):1492–1496, 2014.
- Exploiting relationship between attributes for improved face verification.
F. Song, X. Tan, and S. Chen. CVIU, 122:143–154, 2014.
- Deep convolutional network cascade for facial point detection.
Y. Sun, X. Wang, and X. Tang. In CVPR, pages 3476–3483, 2013.
- Deep learning face representation by joint identification-verification.
Y. Sun, X. Wang, and X. Tang. In NIPS, 2014.
- Deepface: Closing the gap to human-level performance in face verification.
Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. In CVPR, pages 1701–1708, 2014.
- Sun database: Large-scale scene recognition from abbey to zoo.
J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba. In CVPR, pages 3485–3492, 2010.
- Aggregate channel features for multi-view face detection.
B. Yang, J. Yan, Z. Lei, and S. Z. Li. In IJCB, pages 1–8, 2014.
- Part-based r-cnns for fine-grained category detection.
N. Zhang, J. Donahue, R. Girshick, and T. Darrell. In ECCV, pages 834–849. 2014.
- Panda: Pose aligned networks for deep attribute modeling.
N. Zhang, M. Paluri, M. Ranzato, T. Darrell, and L. Bourdev. In CVPR, 2014.
- Object detectors emerge in deep scene cnns.
B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. In ICLR, 2015.
- Edge boxes: Locating object proposals from edges.
C. L. Zitnick and P. Dollár. In ECCV, pages 391–405. 2014.