FAME: Face Association through Model Evolution
We attack the problem of learning face models for public faces from weakly-labelled images collected from web through querying a name. The data is very noisy even after face detection, with several irrelevant faces corresponding to other people. We propose a novel method, Face Association through Model Evolution (FAME), that is able to prune the data in an iterative way, for the face models associated to a name to evolve. The idea is based on capturing discriminativeness and representativeness of each instance and eliminating the outliers. The final models are used to classify faces on novel datasets with possibly different characteristics. On benchmark datasets, our results are comparable to or better than state-of-the-art studies for the task of face identification.
To label faces of friends in social networks or celebrities and politicians in news, automatic methods are indispensable to manage large number of face images piling up on the web. On the other hand, unlike their counterparts in controlled datasets, faces on the web inherit all type of challenges naturally, resulting in the traditional methods incapable to recognise.
Recent availability of real-world face datasets  accelerated the works in web-scale face verification, that is given a pair of faces deciding their identity. On the other hand, identification, that is finding the identity of a face, is still relatively less studied for the real-world faces. The requirement for a considerable amount of faces to be labeled is the main bottleneck for scalability in identification. Continuous inclusion of new individuals, and new instances for each individual should also be considered for web-scale identification task.
In this study, we challenge the identification of faces for famous people. The famous people tend to change their make-up, hair style/colour, and accessories more often compared to regular people, resulting in large number of varieties in face images. Moreover, they are likely to appear with others in photographs, causing faces of irrelevant people to be retrieved.
We propose a new method, FAME, that utilises the noisy results obtained through a name query to construct models in identifying famous people. Our models evolve through consecutive iterations to associate the query name with the correct set of faces. These models are then used to label faces on novel datasets. FAME removes the outlier faces in constructing models, while retaining the diversity as much as possible. Details of FAME will follow the review of recent work on relevant domains.
Naming faces using weakly-labeled data: The work of Berg is one of the first attempts in labelling large number of faces from weakly-labeled web images  with the “Labeled Faces in the Wild” (LFW) dataset introduced. It is assumed that in an image at most one face can correspond to a name, and names are used as constraints in clustering faces. Appearances of faces are modelled through Gaussian mixture model with one mixture per name. In , k-PCA is used to reduce the dimensionality of the data and LDA is used for projection. Initial discriminant space learned from faces with a single associated name is used for clustering through a modified k-means. Better discriminants are then learned to re-cluster. In  face name associations are captured through an EM based approach.
For aligning names and faces in an (a)symmetric way, Pham  cluster the faces using a hierarchical agglomerative clustering method. They use the constraint that faces in an image cannot be in the same cluster. They then use an EM based approach for aligning names and faces based on probability of reoccurrences. They use a 3D morphable model for face representation. They introduce the picturedness and namedness: the probability of a person being in the picture based on textual info, and being in the text based on visual info.
Ideally, there should be a single cluster per person. However, these methods are likely to produce clusters with several people mixed in, and multiple clusters for the same person.
In , Ozkan and Duygulu consider the problem as retrieving faces for a single query name, and then pruning the set from the irrelevant faces. A similarity graph is constructed where the nodes are faces, and edges are the similarity between faces. With the assumption that the most similar subset of faces will correspond to the queried name, the densest component in the graph is sought using a greedy method. In , the method of  is improved by introducing the constraint for each image to contain a single instance of the queried person and replacing the threshold in constructing the binary graphs with assigning non-zero weights to k nearest neighbours. The authors further generalised the graph based method for for multi-person naming, as well as null assignments. They propose a min-cost max-flow based approach to optimise face name assignments under unique matching constraints.
In  face-name association problem is tackled as a multiple instance learning problem over pairs of bags. Detected faces in an image is put into a bag, and names detected in the caption are put into the corresponding set of labels. A pair of bags is labeled as positive if they share at least one label, and negative otherwise. The results are reported on Labelled Yahoo! News dataset which is obtained through manually annotating and extending LFW dataset. In , it is shown that the performance of graph-based and generative approaches for text-based face retrieval and face-name association tasks can be improved with the incorporation of logistic discriminant based metric learning (LDML) .
Kumar  introduced attribute and smile classifiers for verifying the identity of faces. For describable aspects of visual appearance, binary attribute classifiers are trained with the help of AMT. Moreover, simile classifiers are trained to recognise the similarity of faces to specific reference people. Pub-Fig, dataset of public figures on the web, is presented alternative to LFW with larger number of individuals each having more instances.
Recently, PubFig83, a subset of PubFig dataset with near-duplicates eliminated and individuals with large number of instances are selected, is provided for face identification task . Inspired from biological systems, Pinto et al.  consider V1-like features and introduce both single- and multi-layer feature extraction architecture followed by LinearSVM classifier.
 define the open-universe face identification problem as identifying faces with one of the labeled categories in a dataset including distractor faces that do not belong to any of the labels. In , the authors combine PubFig83, as being the set of labeled individuals, and LFW, as being the set of distractors. On this set, they evaluate a set of identification methods including nearest neighbour, SVM, sparse representation based classification (SRC) and its variants, as well as linearly approximated SRC that they proposed in .
Other recent work include  where Fisher vectors on densely sampled SIFT features are utilised. Large margin dimensionality reduction is used to reduce high dimensionality.
Harvesting web for concept learning: Recently, there have been many studies on harvesting web for re-ranking of search results and building qualified training sets . In  visual features and surrounding the text are used for collecting animal images from web, and visual exemplars are obtained through clustering text. Relevant clusters are required to be identified manually, as well as irrelevant images in clusters. In , OPTIMOL framework is presented to incrementally learn object categories from web search results. Given a set of seed images a non parametric latent topic model is applied to categorise collected web images. The model is iteratively updated with the newly categorised images. To prevent over specialised results, a set of cache images with high diversity are retained at each iteration. In  after the removal of abstract images from the search results collected through text and image search, text and metadata are used to re-rank the images. A visual classifier is trained by sampling from the top ranked images as positives and random images from other categories as negatives. Recently, NEIL  is proposed to learn object and scene categories, as well as common sense knowledge using web search results. Discovering representative and discriminative instances: Our method is also related to the recently emerged studies in discovering discriminative patches. . In , discriminative patches in images are discovered through an iterative method which alternates between clustering and training discriminative classifiers. Li  solves same problem with multiple instance learning.  and  apply the idea to scene images for learning discriminative properties by embracing the unsupervised exemplar models. Moreover  enhances the unsupervised learning schema by more robust alternation of Mean-Shift clusteringalgorithm. Disciminative patch ideas is also applied to video domain by .
An important caveat in learning models from weakly-labelled data is the impurity of the collection. To be useful, spurious instances should be eliminated before generating models for each category. In this study, we present a new approach for learning better models through iteratively pruning the data (see Figure 1). First, we benefit from large number of global negatives representing the rest of the world against the class of interest. Next, among the candidate in-class examples we try to separate the most confident instances from the others. These two successive steps are repeated to eliminate outlier instances iteratively. To consider intra-class variability, we use a representation that results in large dimensional feature vectors to make each class linearly separable even when the data include some level of variation. The model evolution and representation will be detailed in the following.
We propose a method that allows the models to evolve through eliminating the outlier instances with successive linear classifiers. First, we learn a hyperplane that separates the initial set of candidate class instances from the large set of global negatives. Global negative set is curated by the instances of other classes and the random face images collected from Web. Then, we select some fraction of the class instances that are distant from the separating hyperplane. We use these instances as the discriminative seed set, since they are confidently classified against the rest of the world. We consider the rest of the class data as possible negatives. We then learn another model that try to capture in-class dissimilarities between discriminative examples and possible negatives. At the final step, we combine the confidence scores of the first and the second models. By combining the two scores, that respectively correspond to the confidence of being different from the rest of the world, and in-class affinity of the instance, we get a measure of instance saliency. Over these confidence scores we detect instances with the lowest scores as the outliers for that iteration. These steps are iterated multiple times up to a desired level of pruning. The representation that we use (see Section 3.2) might cause computational burden with complicated learning models. Therefore, we leverage simple linear regression (LR) models with L1 norm regularisation performing sparse feature selection as the learning evolves. Sparsity makes categories more distinct and captures category related commonalities.
Algorithm ? summarises our data elimination procedure. refers to the examples collected for a class and refers to the the vast numbers of global negatives. Each vector is a dimensional representation of a single face image. At each iteration , the first LR model learns a hyperplane between the candidate set of class instances and global negatives . Then the current is divided into two subsets: instances in that are farthest from the hyperplane are kept as the candidate positive set () and the rest is considered as the negative set () for the next model. is the set of salient instances for the class and is the set of possible spurious instances. The second LR model uses as positive and as the negative set to learn best possible hyperplane separating them. For each instance in , by aggregating the confidence values of both models, we eliminate instances with the lowest scores as the outliers. At the next iterations, we run all the steps again and end up with a clean set of class instances .
This iterative procedure continues until it satisfies a stopping condition which is refined by ’s training accuracy as the measure of present data quality. As we incrementally remove poor instances, we expect to have better separation against the negative instances therefore ’s accuracy increases. However, if the accuracy saturates or degrade then we stop the algorithm. Alternatively, when we have very large number of class instances, we can divide data into two independent subset and apply the iterative elimination to both as we measure the quality of one set’s over the other set’s at each iteration . It is similar to co-training approach and more robust to over-fitting, albeit it requires very large number of instances for convincing results.
To represent face images we learn two distinct set of filters by an unsupervised method similar to  (see Figure ? ). First set is learned from the raw-pixel random patches extracted from grey-scale images. The second set is learned from LBP  encoded images. First set of learned filters are receptive to edge- and corner-like structural points and the second set is more sensitive to textural commonalities of the LBP histogram statistics.LBP encoded images are invariant to illumination since the intensity relations between pixels are considered instead of exact pixel values. We use rotation invariant LBP encoding  that gives binary codes for each pixel. Then, we convert these binary codes into corresponding integer values. A Gaussian filter is used to smooth out the heavy-tailed locations.
The pipeline in order to learn filters from both raw-pixel and LBP images is as follows. First we extract a set of randomly sampled patches in the size of predefined receptive field. Then contrast normalisation is applied to each patch (for only raw-image filters) and patches are whitened to reduce the correlations among dimensions. These patches are clustered using k-means into groups. We perform thresholding to centroids with box-plot statistics over the activations counts to remove the outlier centroids that are supposedly not representative for the face images but background clutters. After the learning phase, centroid activations are collected from receptive fields with small striding. We applied spatial average pooling onto five different grids including a grid at the center of the image additional to 4 equal-sized quadrants since face images includes important spatial regularities at the center. We use triangular activation function to map each receptive field to learned centroids. This yields a dimensional representation for each face. However, since we use two different set of filters, at the end, each image presented by dimensions. Thresholding of centroid activations provides a implicit removal of outlier patches as well as the salient set of centroids. We use those outlier centroids to eliminate patches at the feature extraction step by assuming the patches assigned to outlier centroids are not relevant thus avoiding them in pooling.
Images are collected using Bing to train models. Then, two recent benchmark datasets, FAN-large  and PubFig83, are used for testing.Bing collection:For a given name, 500 images are gathered using Bing image search
The dataset is expanded with horizontally flipped images. Before learning filters from raw-pixel images, each grey-level face image is resized to 60 pixels height and LBP images resized to 120 pixels height. LBP encoding has been done by 16 different filter orientation and at radius 2. We sample random patches from images and apply contrast normalization to only raw-pixel patches. Then, we perform ZCA whitening transform and set to 0.5. We use receptive field of 6x6 regions with 1 stride and learn 2400 centroids for both raw-pixel images and LBP encoded images. Hence, we conclude 2 (raw-pixel + LBP) x 5 (pooling grids) x 2400 (centroids) dimensional feature representation of each image. For instance to centroid distances we used Euclidean Distance. We detect the outliers by a threshold at the 99% upper whisker of the centroid activations. Our implementation of feature learning framework aggregated upon the code furnished by . For iterative elimination, we train L1 norm Logistic Regression model with Gauss-Seidel algorithm  and final classification is done with Linear SVM through grafting algorithm  that learns sparse set of important features incrementally by using gradient information. At each FAME iteration we eliminate five images. We stop when there is no improvement on the first model accuracy. If the classifier saturates so quickly, iteration continues until 10% of the instances are pruned. If we encounter memory constraints due to large number of global negatives, at each iteration we sample a different set of negative instances, to provide slightly different linear boundaries that are able to detect different spurious instances.
We conduct controlled experiments over PubFig83+LFW. We select classes with at least 270 instances and inject 10% (27 instances) noise instances. There are six classes conforming that criterion. Noisy images are randomly chosen from global negatives consisting of “distract” set of PubFig83+LFW and FAN-large faces that we collected. As a result, we have 297x6 training instances. We apply FAME to this data while applying cross-validation at each iteration step, between these six classes.
Figure ? helps to visualise the model evolution in FAME. As shown on the left, at each iteration dataset is divided into candidate positives and possible negatives: candidate positives are selected as the most representative instances of the class and true outliers are found among the possible negatives. As shown on the right, FAME is able to learn models from noisy weakly label sets, while eliminating the outliers at successive iterations for a variety of people.
As Figure ?-(a) shows with the increasing number of iterations, more outliers are eliminated. Although some correct instances are also eliminated, the ratio is very low compared to the spurious instances. Moreover, our observations show that the eliminated positive examples are usually not in good quality and therefore their elimination from the final model is not harmful but rather helpful as supported with the results in Figure ?-(b). As seen in Figure ?-(c) , we can achieve up to 75.2 on FAN-Large (EASY) and 79.8 on PubFig83 by removing one outlier at each iteration: we prefer to eliminate five outliers for the efficiency.
We compare FAME with baseline method that learns models from the raw collection gathered through querying the name without any pruning. As seen in Table ? with one vs all L1 norm Linear SVM model on the raw data, the performance is very low on all datasets. Note that, on the datasets FAN-Large EASY and ALL, as well as PubFig83, we learn the models from web images and tested them on these novel datasets for the same categories. We also divided the collected Bing images into two subsets to test the effect of training and testing on the same type of dataset. FAME leads encouraging results even the model is susceptible to domain shifting problem, with a significant improvement over baseline.
The most similar data handling approach to ours is the method of Singh , although there are important differences. First,  clusters the data to capture intra-cluster variance and uncover the representative instances. However, it requires to decide the optimal cluster number in advance and divides the problem into multiple homologous pieces which need to be solved separately. This increase the complexity of the proposed system.
Second difference lies in the philosophy. They aim to discover representative and discriminative set of instances whereas we aim to prune spurious ones. Hence, they need to keep all vast negative instances on memory but we can sample different subsets of global negatives and find corresponding outlier instances. It provides faster and easier way of data pruning. They divide each class into two sets and apply their scheme by interchanging data after each iteration like in the case of co-training learning procedure. Nevertheless, co-training demands large number of instances for reliable results. In our methodology, we prefer to use all the class data at once in our particular scheme. We evaluate the method of Sing on the same datasets, and show that FAME is superior to their method (see Table ?). We use the released code by Singh  with up-limit settings that our resources allow.
To test the effectiveness of the proposed linear regression based model learning, we also compare our results by using only the model (FAME-M1) and using SVM for classification (FAME-SVM). As shown in Table ?, all FAME models outperforms the baseline method as well as the method of  with a large improvement with the proposed LR model.
|-||Bing||FAN-Large (EASY)||FAN-Large (ALL)||PubFig83|
Finally, we compare the performance of FAME on the benchmark PubFig83 dataset with the other state-of-the-art studies on face identification. In this case, unlike the previous experiments where we learned the models from noisy images, in order to make a fair comparison we learned the models from the same dataset. As seen in Table ? FAME achieves the best accuracy in this setting. Referring back to Table ? even with the domain adaptation setting where the model is learned from the noisy web images our results are comparable to the most recent studies on face identification that train and test on the same dataset. Note that, the method of Pinto  is similar to our classification pipeline but we prefer to learn the filters in an unsupervised way with the method of Coastes ..
|Method||Pinto  (S)||Pinto et al.(M)||face.com ||Becker |
5Summary and future work
We propose a novel method to prune the web images collected for a query to learn models to be used for classification on novel datasets. We rely on large number of negative instances in selecting a set of good instances which are then used to learn models to eliminate the bad ones. The proposed method outperforms the baseline and is comparable to state-of-the-art methods even within the difficulties of domain adaptation. Although the proposed method is tested for identification of faces, it is a general method that could be used for other domains as we aim to attack as our future work.
- Face description with local binary patterns: Application to face recognition.
Timo Ahonen, Abdenour Hadid, and Matti Pietikainen. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(12):2037–2041, 2006.
- Evaluating open-universe face identification on the web.
Brian C Becker and Enrique G Ortiz. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2013.
- Evaluating open-universe face identification on the web.
Brian C Becker and Enrique G Ortiz. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2013 IEEE Conference on, pages 904–911. IEEE, 2013.
- Finding iconic images.
Tamara L Berg and Alexander C Berg. In Computer Vision and Pattern Recognition Workshops, 2009.
- Who’s in the picture?
Tamara L. Berg, Alexander C. Berg, Jaety Edwards, and David A. Forsyth.
- Names and faces in the news.
Tamara L. Berg, Alexander C. Berg, Jaety Edwards, Michael Maire, Ryan White, Yee Whye Teh, Erik Learned-Miller, and David A. Forsyth.
- Animals on the web.
Tamara L. Berg and David A. Forsyth.
- Neil: Extracting visual knowledge from web data.
Xinlei Chen, Abhinav Shrivastava, and Abhinav Gupta. In International Conference on Computer Vision (ICCV), 2013.
- An analysis of single-layer networks in unsupervised feature learning.
Adam Coates, Andrew Y Ng, and Honglak Lee. In International Conference on Artificial Intelligence and Statistics, pages 215–223, 2011.
- Mid-level visual element discovery as discriminative mode seeking.
Carl Doersch, Abhinav Gupta, and Alexei A Efros. In Advances in Neural Information Processing Systems, pages 494–502, 2013.
- What makes paris look like paris?
Carl Doersch, Saurabh Singh, Abhinav Gupta, Josef Sivic, and Alexei A Efros. ACM Transactions on Graphics (TOG), 31(4):101, 2012.
- Learning collections of part models for object recognition.
Ian Endres, Kevin J Shih, Johnston Jiaa, and Derek Hoiem. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 939–946. IEEE, 2013.
- Harvesting large-scale weakly-tagged image databases from the web.
Jianping Fan, Yi Shen, Ning Zhou, and Yuli Gao. In Computer Vision and Pattern Recognition (CVPR), 2010.
- Learning object categories from google’s image search.
Robert Fergus, Li Fei-Fei, Pietro Perona, and Andrew Zisserman. In International Conference on Computer Vision (ICCV), 2005.
- Automatic Face Naming with Caption-based Supervision.
Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. In Computer Vision and Pattern Recognition (CVPR).
- Face recognition from caption-based supervision.
Matthieu Guillaumin, Thomas Mensink, Jakob Verbeek, and Cordelia Schmid. International Journal of Computer Vision, 96(1):64–82, January 2012.
- Is that you? Metric learning approaches for face identification.
Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. In International Conference on Computer Vision (ICCV 2009), 2009.
- Multiple instance metric learning from automatically labeled bags of faces.
Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. In European Conference on Computer Vision (ECCV) , 2010.
- Labeled faces in the wild: A database for studying face recognition in unconstrained environments.
Gary B Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Technical report, Technical Report 07-49, University of Massachusetts, Amherst, 2007.
- Representing videos using mid-level discriminative patches.
Arpit Jain, Abhinav Gupta, Mikel Rodriguez, and Larry S Davis. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 2571–2578. IEEE, 2013.
- Blocks that shout: Distinctive parts for scene classification.
Mayank Juneja, Andrea Vedaldi, CV Jawahar, and Andrew Zisserman. In Computer Vision and Pattern Recognition (CVPR), 2013 IEEE Conference on, pages 923–930. IEEE, 2013.
- Unsupervised detection of regions of interest using iterative link analysis.
Gunhee Kim and Antonio Torralba. In NIPS, volume 1, pages 4–2, 2009.
- Attribute and Simile Classifiers for Face Verification.
N. Kumar, A. C. Berg, P. N. Belhumeur, and S. K. Nayar. In International Conference on Computer Vision (ICCV), 2009.
- Optimol: automatic online picture collection via incremental model learning.
Li-Jia Li and Li Fei-Fei. International journal of computer vision, 88(2):147–168, 2010.
- Harvesting mid-level visual concepts from large-scale internet images.
Quannan Li, Jiajun Wu, and Zhuowen Tu. CVPR, 2013.
- Gray scale and rotation invariant texture classification with local binary patterns.
Timo Ojala, Matti Pietikäinen, and Topi Mäenpää. In Computer Vision-ECCV 2000, pages 404–420. Springer, 2000.
- Face recognition for web-scale datasets.
Enrique G. Ortiz and Brian C. Becker. Computer Vision and Image Understanding, 118:153–170, January 2014.
- A large-scale database of images and captions for automatic face naming.
Mert Özcan, Jie Luo, Vittorio Ferrari, and Barbara Caputo. In BMVC, pages 1–11, 2011.
- A graph based approach for naming faces in news photos.
Derya Ozkan and Pinar Duygulu.
- Interesting faces: A graph based approach for finding people in news.
Derya Ozkan and Pinar Duygulu. Pattern Recognition, 43(5):1717–1735, May 2010.
- Grafting: Fast, incremental feature selection by gradient descent in function space.
Simon Perkins, Kevin Lacker, and James Theiler. The Journal of Machine Learning Research, 3:1333–1356, 2003.
- Cross-media alignment of names and faces.
P.T. Pham, M.F. Moens, and T. Tuytelaars. IEEE Transactions on Multimedia, 12(1):13–27, 2010.
- Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook.
Nicolas Pinto, Zak Stone, Todd Zickler, and David Cox. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011 IEEE Computer Society Conference on, pages 35–42. IEEE, 2011.
- Scaling up biologically-inspired computer vision: A case study in unconstrained face recognition on facebook.
Nicolas Pinto, Zak Stone, Todd Zickler, and David Cox. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2011.
- Harvesting image databases from the web.
Florian Schroff, Antonio Criminisi, and Andrew Zisserman. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(4):754–766, 2011.
- A simple and efficient algorithm for gene selection using sparse logistic regression.
Shirish Krishnaj Shevade and S Sathiya Keerthi. Bioinformatics, 19(17):2246–2253, 2003.
- Fisher Vector Faces in the Wild.
K. Simonyan, O. M. Parkhi, A. Vedaldi, and A. Zisserman. In British Machine Vision Conference (BMVC), 2013.
- Unsupervised discovery of mid-level discriminative patches.
Saurabh Singh, Abhinav Gupta, and Alexei A. Efros. In European Conference on Computer Vision (ECCV), 2012.
- Unsupervised discovery of mid-level discriminative patches.
Saurabh Singh, Abhinav Gupta, and Alexei A Efros. In European Conference Computer Vision (ECCV). 2012.
- Face detection, pose estimation, and landmark localization in the wild.
Xiangxin Zhu and Deva Ramanan. In Computer Vision and Pattern Recognition (CVPR), 2012.