Merge or Not? Learning to Group Faces via Imitation Learning
Abstract
Given a large number of unlabeled face images, face grouping aims at clustering the images into individual identities present in the data. This task remains a challenging problem despite the remarkable capability of deep learning approaches in learning face representation. In particular, grouping results can still be egregious given profile faces and a large number of uninteresting faces and noisy detections. Often, a user needs to correct the erroneous grouping manually. In this study, we formulate a novel face grouping framework that learns clustering strategy from groundtruth simulated behavior. This is achieved through imitation learning (a.k.a apprenticeship learning or learning by watching) via inverse reinforcement learning (IRL). In contrast to existing clustering approaches that group instances by similarity, our framework makes sequential decision to dynamically decide when to merge two face instances/groups driven by short and longterm rewards. Extensive experiments on three benchmark datasets show that our framework outperforms unsupervised and supervised baselines.
1 Introduction
Face grouping is an actively researched computer vision problem due to its enormous potential in commercial applications. It not only allows users to organize and tag photos based on faces but also retrieve and revisit huge quantity of relevant images effortlessly.
The performance of face grouping significantly benefits from the recent emergence of deep learning approaches [5, 24, 28, 30, 33, 37]. Nevertheless, we still observe some challenges when we apply existing methods on realworld photo albums. In particular, we found that deeply learned representation can still perform poorly given profile faces and false detections. In addition, there is no obvious mechanism to disambiguate large quantity of noninterested faces^{1}^{1}1Noninterested faces refer to faces that we do not want to group (e.g. faces in the background). This is the term popularized by the earlier work in face clustering [47]. that are captured under the same condition with the person of interests. We provide an illustrative example in Fig. 1, of which results were obtained from the Inceptionv3 model [32] finetuned with MSCeleb1M [13] images with face identity. Despite the model achieves an accuracy of 99.27% on LFW [14], which is on par with the accuracy reported by a stateoftheart method [37], its performance on the openworld face grouping task is unsatisfactory. We attempted to adapt the deep model with openworld albums [45] but with limited success. We show experimental results in Sec. 5. Learning such an openworld model is still far from being solved due to highly imbalanced data (much more frontal faces compared to profile instances in existing datasets) and a large negative space to cover.
Thinking about humans, we tend to execute a visual grouping task in sequence with intermediate decision to govern our next step, like playing a jigsaw puzzle [42] with pieces of varying visual complexity. First we will link pieces with strong correlation and high confidence, then gain insights and accumulate visual evidence from these stable clusters. Consequently, a larger group can be formed through merging ambiguous positives and discarding uninteresting outliers. In the process, we may exploit contextual cues and global picture considering other samples.
The above intuition motivates a novel face grouping framework. Our goal is not to design a better deep representation, but learning to make better merging/notmerging decision from expert?s demonstration using existing representation. In particular, we wish to introduce intermediate sequential decision between the clustering steps, i.e., when to merge two samples or groups given the dynamic context. Towards this goal, we assume different clustering states, where the states differ in their current partitions of data. At each time step, an agent will choose from two possible actions, i.e., to merge or not to merge a pair of face groups. The process responds at the next time step by moving to a new state and provides a reward to the agent. A sequence of good actions would lead to higher accumulative reward than suboptimal decisions.
Learning a decision strategy in our problem is nontrivial. In particular, the decision process is adversely affected by uninteresting faces and noisy detections. Defining a reward function for face grouping is thus not straightforward, which needs to consider the similarity of faces, group consistency, and quality of images. In addition, we also need to consider the operation cost involved, i.e., the manual human effort spent on adding or removing a photo from a group. It is hard to determine the relative weights of these terms apriori. This is in contrast to (first person) imitation learning setting of which the reward is usually assumed known and fixed, e.g., using the change of game score [20].
Contributions: We make the following contributions to overcome the aforementioned challenges:
1) We formulate a novel face grouping framework based on imitation learning (IL) via inverse reinforcement learning [21, 27]. To our knowledge, this is the first attempt to address visual clustering via inverse reinforcement learning. Once learned, the policy can be transferred to unseen photo albums with good generalization performance.
2) We assume the reward as an unknown to be ascertained through learning by watching an expert’s behavior. We formulate the learning such that both short and longterms rewards are considered. The formal considers similarity, consistency and quality of local candidate clusters; whereas the latter measures the operation cost to get from an arbitrary photos partition to the final groundtruth partition. The new reward system effectively handles the challenges of profile, noisy, and uninteresting faces, and works well with conventional face similarity under an openworld context.
3) We introduce a largescale dataset called Grouping Faces in the Wild (GFW) to facilitate the research of realworld photo grouping. The new dataset contains faces of identities collected from a social network. This dataset is realistic, providing a large number of uninteresting faces and noisy detections.
Extensive experiments are conducted on three datasets, namely, LFW simulated albums, ACCIO dataset (Harry Potter movie) [12], and the GFW introduced by us. We show that the proposed method can be adapted to a variety of clustering algorithms, from the conventional kmeans and hierarchical clustering to the more elaborated graph degree linkage (GDL) approach [44]. We show that it outperforms a number of unsupervised and supervised baselines.
2 Related Work
Face Grouping: Traditional face clustering methods [4, 18, 22, 47] are usually purely datadriven and unsupervised. They mainly focus on finding good distance metric between faces or effective subspaces for face representation. For instance, Zhu et al. [47] propose a rankorder distance that measures the similarity between two faces using their neighboring information. Fitzgibbon and Zisserman [9] further develop a joint manifold distance (JMD) that measures the distance between two subspaces, each of which invariant to a desired group of transformations. Zhang et al. [44] propose agglomerative clustering on a directed graph to better capture global manifold structures of face data. There exist techniques that employ user interactions [35], extra information on the web [3] and prior knowledge of family photo albums [39]. Deep representation is recently found effective for face clustering [28], and largescale face clustering has been attempted [23]. Beyond imagebased clustering, most existing videobased approaches employ pairwise constraints derived from face tracklets [6, 38, 41, 45] or other auxiliary information [8, 34, 46] to facilitate face clustering in video. The stateoftheart method by Zhang et al. [45] adapts DeepID2+ model [31] to a target domain with joint face representation adaptation and clustering.
In this study, we focus on imagebased face grouping without temporal information. Our method differs significantly to existing methods [45] that cluster instances by deep representation alone. Instead, our method learns from experts to make sequential decision on grouping considering both short and longterm rewards. It is thus capable of coping with uninteresting faces and noisy detections effectively.
Clustering with Reinforcement Learning: There exist some pioneering studies that explored clustering with RL. Likas [19] models the decision process of assigning a sample from a data stream to a prototype, e.g., cluster centers produced by online Kmeans. Barbakh and Fyfe [2] employ RL to select a better initialization for Kmeans. Our work differs to the aforementioned studies: (1) [2, 19] are unsupervised, e.g., their loss is related to the distance from data to a cluster prototype. In contrast, our framework guides an agent with a teacher’s behavior. (2) We consider a decision that extends more flexibly to merge arbitrary instances or groups. We also investigate a novel reward function and new mechanisms to deal with noises.
Imitation Learning: Ng and Russel [21] introduced the concept of inverse reinforcement learning (IRL), which is also known as imitation learning or apprenticeship learning [1]. The goal of IRL is to find a reward function to explain observed behavior of an expert who acts according to an unknown policy. Inverse reinforcement learning is useful when a reward function is multivariate, i.e., consists of several reward terms of which the relative weights of these terms are unknown apriori. Imitation learning was shown effective when the supervision of a dynamic process is obtainable, e.g., in robotic navigation [1], activity understanding and forecasting [16] and visual tracking [40].
3 Overview
An illustration of the proposed framework is given in Fig. 2. We treat grouping as a sequential process. In each step during test time, two candidate groups and are chosen. Without loss of generality, a group can be formed by just a single instance. Given the two groups, we extract meaningful features to characterize their similarity, group consistency, and image quality. Based on the features, an agent will then perform an action, which can be either i) merging the two groups, or ii) not merging the two groups. Once the action is executed accordingly, the grouping proceeds to select the next pair of groups. The merging stops when there are no further candidate groups can be chosen, e.g., the similarity between any groups is higher than a predefined threshold. Next, we define some key terminologies.
Recommender: At each time step we pick and consider the merging of two face groups. The action space is large with a complexity of , where is the number of groups. This adds hurdles to both learning and test stages. To makes our approach scalable, we employ a recommender, , which recommends two candidates cluster and at each time step. This reduces the action space to a binary problem, i.e., to merge or not to merge a pair of face groups. A recommender can be derived from many classic clustering algorithms especially agglomerativebased algorithm like hierarchical clustering (HC), rankedordered clustering [47] and GDL approach [44]. For instance, hierarchical clusteringbased always suggest two clusters that are nearest by some distance metric. In Sec. 5, we perform rigorous evaluations on plausible choices of a recommender.
State: Each state , contains the current grouping partition and recommender history , at time step . In each discrete state, the recommender will recommend a pair of cluster based on the current state.
Action: An action is denoted as . An agent can execute two possible actions, i.e., merge two groups or not. That is the action set is defined as , and .
Transition: If a merging action is executed, candidate groups and will be merged. The corresponding partition is updated as . Otherwise, the partition remains unchanged, . The candidate information will be appended to the history so that the same pair would not be recommended by . The transition is thus represented as , where denotes the transition function, and and .
4 Learning Face Grouping by Imitation
The previous section explains the face grouping process at test time. An agent is used to determine the right action at each step, i.e., merging or not merging a pair of groups. To learn an agent with the desired behavior, we assume access to demonstrations by some expert. In our study, we obtain these demonstrations from a set training photo albums of which the groundtruth partition of the photos is known. Consequently, given any two candidate groups, and , we know if merging them is a correct action or not. These groundtruth actions represent the pseudo expert’s behavior.
Towards the goal of learning an agent from the expert’s behavior, we perform the learning in two stages: (1) we find a reward function to explain the behavior via inverse reinforcement learning [21], (2) with the learned reward function we find a policy that maximizes the cumulative rewards.
Formally, let denotes the reward function, which rewards the agent after it executes action in state . And is a set of state transition probabilities upon taking action in state . For any policy , a value function evaluates the value of a state as the total amount of reward an agent can expect to accumulate over the future, starting from that state, ,
(1) 
where is a discount factor.
An actionvalue function is used to judge the value of actions, according to
(2) 
where the notation represents the transition to state after taking an action at state . Our goal is to first uncover the reward function from expert’s behavior, and find a policy that maximizes .
Rewards: In our study, the reward function that we wish to learn consists of two terms, denoted as
(3) 
The first and second term corresponds to short and longterm rewards, respectively. The parameter helps balance the scale of the two terms. The shortterm reward is multivariate. It considers how strong two instances/groups should be merged locally based on face similarity, group consistency, and face quality. A longterm reward captures more farsighted clustering strategy through measuring the operation cost to get from an arbitrary photos partition to the final groundtruth partition. Note that during the test time, the longterm reward function is absorbed in our learned actionvalue function for a policy , thus no groundtruth is needed during testing. We provide explanations on the short and longterm rewards as follows.
4.1 ShortTerm Reward
Before a human user decides a merge between any two face groups, he/she will determine how close the two groups are in terms of face similarity. In addition, he/she may consider the quality and consistency of images in each group to prevent any accidental merging of uninteresting faces and noisy detections. We wish to capture such a behavior through learning a reward function.
The reward is considered shortterm since it only examines the current groups’ partition. Specifically, we compute the similarity between two groups, the quality for each group and photos consistency in each group as a feature vector , and we project this feature into a scalar reward,
(4) 
where if action , and if . Note that we assume the actual reward function is unknown and should be learned through IRL. We observe that through IRL, a powerful reward function can be learned. An agent can achieve a competitive result even by myopically deciding based on one step’s reward function rather than multiple steps. We will show that optimizing is equivalent to learning a hyperplane in support vector machine (SVM) (Sec. 4.3).
Next, we describe how we design the feature vector , which determines the characteristics an agent should examine before making a group merging decision. A feature vector is extracted considering the candidate groups, all faces’ representation in the groups, and current partition , that is .
The proposed feature vector contains three kinds of features, so as to capture face similarity, group consistency, and image quality. All face representation are extracted from Inceptionv3 model [32] finetuned with MSCeleb1M [13]. More elaborated features can be considered given the flexibility of the framework.
Face Similarity: We compute a multidimensional similarity vector to describe the relationship between two face groups and . Specifically, we first define the distance between the representation of two arbitrary faces , and as . The subscript on indicates its group. In this study, we define the distance function as angular distance. We then start from : for a face in , we compute its distance to all the faces in and select a median from the resulting distances. That is
(5) 
where . We select number of instances with the shortest distances from to define the distance from to . Note that the distance is not symmetric. Hence, we repeat the above process to obtain another shortest distances from to define the distance from to . Lastly, these distances are concatenated to form a dimensional feature vector.
Group Consistency: Group consistency measures how close the samples in a group to each other. Even two groups have high similarity in between their respective members, we may not want to merge them if one of the group is not consistent, which may happen when there are a number of noninteresting faces inside the group. We define the consistency of a group as the median of pairwise distances between faces in the group itself. Given a group :
(6) 
Consistency is computed for the two candidate groups, contribute a twodimensional feature vector to .
Face Quality: As depicted in Fig. 1, profile faces and noises could easily confuse a stateoftheart face recognition model. To make our reward function more informed on the quality of the images, we train a linear classifier by using annotated profile and falsely detected faces as negative samples, and clear frontal faces as positive samples. A total of 100k face images extracted from movies is used for training. The output of the classifier serves as the quality measure. Here, we concatenate the quality values of the top faces in each of the two groups to form another dimensional features to .
4.2 LongTerm Reward
While the shortterm reward captures how likely two groups should be merged given the current partition, the longterm reward needs to encapsulate a more farsighted clustering strategy.
To facilitate the learning of this reward, we introduce the term ‘operation cost’, which measures the efforts needed to manipulate the images in the current partition to approach to groundtruth partition. Formally, given a partition and groundtruth partition . A sequence of operations can be executed to gradually modify the partition to . The cost function maps each type of operations into a positive time cost. then we define as the minimal cost for this change:
(7)  
where is the number of steps needed to get from to .
The cost function can be obtained from a user study. In particular, we requested 30 volunteers and show them a number of randomly shuffled images as an album. Their task is to reorganize the photos into a desired groups’ partition. We recorded the time needed for three types of operations: (1) adding a photo into a group, (2) removing a photo from a group, and (3) merging two groups. The key results are shown in Fig. 3. It can be observed that the ‘removing’ operation takes roughly 6 longer than the ‘adding’ operation. The ‘merging’ operation is almost similar to ‘adding’. Consequently, we set the cost for these three operations as 1, 6, 1, respectively. The validity is further confirmed by the plot in Fig. 3 that shows a highcorrelation between the time consumed and the computed operation cost.
Given Eqn. (7), we define the longterm reward as:
(8) 
which encodes the operation cost changes in steps.
The key benefit brought by is that it provides a longterm reward that guides an agent to thinking about the global picture of the grouping process. For any action that can hardly be decided (e.g., merging two noisy groups or merging a clean group with a noisy group), this term provides a strong evidence to the actionvalue function.
4.3 Finding the Reward and Policy
As discussed in Sec. 4, we assume the availability of a set training photo albums of which the groundtruth partition of the photos is known. Let denotes a set of albums in a training set. The groundtruth partition for albums is given as , from which we can derive the groundtruth actions as an expert’s behavior. Our goal is to find a reward function based on this behavior. We perform the learning in two steps to ease the convergence of our method: (1) Firstly, we employ IRL [21] to find the reward function with a myopic or shortsighted policy. (2) We then use the classic greedy algorithm [36] to find the optimal policy.
Step 1: Algorithm 1 summarizes the first step. Specifically, we set in Eqn. (2) and in Eqn. (3). This leads to a myopic policy that considers the current maximal shortterm reward. This assumption greatly simplifies our optimization as of (Eqn. (4)) are the only parameters to be learned. We solve this using a binary RBFkernel SVM with actions as the classes. We start the learning process with an SVM of random weights and an empty training set . We execute the myopic policy repeatedly on albums. Once the agent chooses the wrong action w.r.t. the groundtruth, the representations of the involved groups and the associated groundtruth will be added to the SVM training set. Different albums constitute different games in which SVM will be continually optimized using the instances that it does not perform well. Note that the set is accumulated, hence each time we use samples collected from over time for retraining . The learning stops when all albums are correctly partitioned.
Step 2: Once the reward function is learned, finding the best policy becomes a classic RL problem. Here we apply the greedy algorithm [36]. greedy policy is a way of selecting random actions with uniform distribution from a set of available actions. Using this policy either we can select random action with probability and we can select an action with probability that gives maximum reward in a given state. Specifically, we set in Eqn. (2) and in Eqn. (3). We first approximate the actionvalue function in Eqn. (2) by a random forest regressor [25]. The input to the regressor is and the output is the associated value. The parameters of the regressor are initialized by , , and value, which are obtained in the first step (Algorithm 1). After the initialization, the agent selects and executes an action according to , i.e., , but with a probability the agent will act randomly so as to discover a state that it has never visited before. At the same time the parameters of will be updated directly from the samples of experience drawn from the algorithm’s past games. At the end of learning, the value of is decayed to 0, and is used as our actionvalue function for policy .
5 Experiments
Training Data: Our algorithm needs to learn a grouping policy from a training set. The learned policy can be applied to other datasets for face grouping. Here we employ albums simulated from MSCeleb1M [13] of 80k identities as our training source. We will release the training data.
Test Data: To show the generalizability of the learned policy, we evaluate the proposed approach on three datasets of different scenarios exclusive from the training source. Example images are provided in Fig. 4.
1) LFWAlbum: We construct a challenging simulated albums from LFW [14], MSCeleb1M [13], and PFW [29], with a good mix of frontal, profile, and noninterested faces. We prepare 20 albums and with exclusive identities. Note that the MSCeleb1M samples used here are exclusive from the training data.
2) ACCIO Dataset: This dataset [12] is commonly used in the studies of video face clustering. It contains face tracklets extracted from series of Harry Potter movie. Following [45], we conduct experiments on the first instalment of the series, which contains 3243 tracklets from 36 known identities. For a fair comparison, we do not consider uninterested faces in this dataset following [45]. We discard the temporal information and used only the frames in our experiments.
3) Grouping Face in the Wild (GFW): To better evaluate our algorithm for realworld application, we collect 60 real users’ albums with permission from a Chinese social network portal. The size of an album varies from 120 to 3600 faces, with a maximum number of identities of 321. In total, the dataset contains 84,200 images with 78,000 faces of 3,132 different identities. All faces are automatically detected using FasterRCNN [26]. False detections are observed. We annotate all detections with identity/noise labels. The images are unconstrained, taken in various indoor/outdoor scenes. Faces are naturally distributed with different poses with spontaneous expression. In addition, faces can be severely occluded, blurred with motion, and differently illuminated under different scenes. We will release the data and annotations. To our knowledge, this is the largest realworld face clustering dataset.
Given the limited space, we exclude results on traditional grouping datasets like YaleB [11, 17], MSRAA [47], MSRAB [47] and Easyalbum [7]. YaleB were captured in controlled condition with very few profile faces and noises. The number of albums is limited in the other three datasets.
Dataset 
LFWAlbum  ACCIO1  GFW  

Metric 
P(%)  R(%)  (%)  P(%)  R(%)  (%)  P(%)  R(%)  (%)  
Kmeans 
73.6  86.6  79.3  1.12  72.2  34.4  46.6  0.65  66.6  35.7  41.1  1.47 
GDL [44] 
66.5  92.2  76.4  1.21  18.1  91.1  30.2  3.51  67.4  59.4  55.9  1.30 
HC 
74.2  80.8  76.6  0.35  17.1  91.9  28.9  3.28  77.5  22.3  15.0  0.81 
AP [10] 
76.7  71.1  73.7  1.07  82.2  9.6  17.1  0.59  69.7  25.3  32.7  0.86 
Deep Adaptation [45] 
        71.1  35.2  47.1           
ILKmeans 
76.7  87.8  81.6  0.95  82.8  34.1  48.3  0.54  53.4  43.6  43.3  1.17 
ILGDL 
79.9  90.1  84.5  0.54  88.6  46.3  60.8  0.78  78.4  76.2  74.5  0.68 
ILHC 
97.8  85.3  91.1  0.14  90.8  78.6  84.3  0.52  96.6  53.7  67.3  0.17 
SVM Deep Features 
82.7  87.4  85.0  0.45  89.0  61.3  72.6  0.74  84.3  46.4  56.3  0.33 
Siamese Network Deep Features  87.1  87.6  87.3  0.44  59.7  88.1  71.2  0.79  49.9  92.3  62.8  0.33 

Implementation Details: All face representation are extracted from Inceptionv3 model [32] finetuned with MSCeleb1M [13]. We suggest some parameter settings as follows. We set in Eqn. (3) to balance the scales of short and longterm rewards. We fixed the number of faces to form the similarity and quality features (Sec. 4.1). The five shortest distances is a good tradeoff between performance and feature complexity. If a group has fewer than five faces (to the extreme only one face exists), we pad the distance vector with the farthest distance.
Evaluation Metrics: We employ multiple metrics to evaluate the face grouping performance, including the Bcubed precision, recall, and score suggested by [43] and [45]. Specifically, Bcubed recall measures the average fraction of face pairs belonging to the ground truth identity assigned to the same cluster. And Bcubed precision is the fraction of face pairs assigned to a cluster with matching identity labels. The score measures the harmonic means of these two metrics. We also use operation cost introduced in Sec. 4.2. To facilitate comparisons across datasets of different sizes, we compute the operation cost normalized by the number of photos as our metric, i.e., . We believe that this metric is more important than the others since it directly reflects how much effort per image a user needs to spend to organize a photo album.
5.1 Comparison with Unsupervised Methods
We compare our method with classic and popular clustering approaches: 1) Kmeans, 2) Graph Degree Linkage (GDL) [44], 3) Hierarchical Clustering (HC), and 4) Affinity Propagation (AP) [10]. Note that we also compare with [45]. Since the code is not publicly available, we only compare with its reported precision, recall, and scores on the ACCIO1 dataset. Note that these baselines use the same features as our approach, as discussed in Sec. 4.1. To verify if the proposed imitation learning (IL) framework helps existing clustering methods, we adapt Kmeans, GDL and HC into ILKmeans^{2}^{2}2For ILKmeans algorithm, the action space is no longer binary due to the nature of Kmeans. Here we adapt the framework to have an action space of , for determining the merging of a sample into one of the clusters. And we replace the SVM with a RankSVM [15] to compute the rewards for each cluster., ILGDL and ILHC to equip them with the sequential decision capability. This is achieved by using the respective algorithm as the recommender (see Sec. 3).
Table 1 summarizes the results on three datasets. We observed that: (1) imitation learning consistently improves the different clustering baselines. For instance, on LFWAlbum, the score and of HC improves from 76.6% and 0.35 to 91.1% to 0.14. Notably, ILHC outperforms other variants based on the proposed IL, although our framework is not specifically developed to work only with hierarchical clustering. (2) The operation cost is lower with a highprecision algorithm. This result matches with our user study since a user is good at adding similar photos into a group but poor at removing noisy faces that can be hard to distinguish.
We compare grouping results of ILHC and HC qualitatively in Fig. 5. ILHC yields more coherent face groupings with exceptional robustness to outliers.
5.2 Comparison with Supervised Methods
We compare our framework with two supervised baselines, namely a SVM classifier and a threelayer Siamese network. The three layers of the Siamese network have 256, 64, 64 hidden neurons, respectively. A contrastive loss is used for training. To train the baselines, each time we sample two subsets of identities from MSCeleb1M as the training data. SVM and the Siamese Network are used to predict if two groups should be merged or not. Features are extracted following the method presented in Sec. 4.1. These supervised baselines are thus strong since their input features are identical to those we use in our IL framework. The features include face similarity vector that is derived from Inceptionv3 face recognition model finetuned with MSCeleb1M dataset. The deep representation achieves 99.27% on LFW, which is better than [30] and onpar with [37]. The results of the baseline are presented in Table 1. It is observed that the ILbased approach outperforms the supervised baselines by a considerable margin.
5.3 Ablation Study
Further Analysis on Recommender: In Sec. 5.1, we tested three different recommenders based on different clustering methods, namely Kmeans, GDL, and HC. In this experiment, we further analyze the use of a random recommender that randomly chooses a pair to recommend. Figure 6 shows the score comparisons between a Hierarchical Clustering (HC) recommender and a random recommender. In comparison to the recommender based on HC, which always recommends the nearest groups, the random recommender exhibits a slower convergence and poorer results. It is worth pointing out that the random recommender still achieves a score of 61.9% on GFW, which outperforms the unsupervised baseline, which only achieves 15%. The results suggest the usefulness of deploying a recommender.
We also evaluate an extreme approach that does not employ a recommender but selects a group pair to merge based on the values produced by the learned actionvalue function. Specifically, in each step, we compute the exhaustively for all possible pairs of group, and select the pair with the highest value to merge. This approach achieves on GFW. It is not surprising that the result is better than our ILHC as this approach performs exhaustive search for pairs. This method has a runtime complexity of , much higher than the ILHC. The results suggest the effectiveness of the clusteringbased recommender in our framework.
Discard the Face Quality Feature: If we remove the face quality feature from the feature vector , the score achieved by ILHC of LFWAlbum, ACCIO1, and GFW will drop from 91.1%, 84.3%, and 67.3%, to 89.5%, 65.0%, and 48.4%, respectively. The results suggest that the importance of quality measure depends on the dataset. Face quality feature is essential on the GFW dataset but less so on others, since GFW consists more poorquality images.
Reward Function Settings: We evaluate the effect of two reward terms in the reward function defined in Eqn. (3).
1) & : The full reward setting with .
2) w/o : Without the longterm reward based on operation cost, i.e., .
3) w/o : In this setting, we discarded learned by IRL, and redefined it to take a naïve loss, i.e., , where is an indicator function that outputs 1 if the condition is true, and 1 if it is false.
The results reported in Table 2 shows that both short and longterm rewards are indispensable to achieve good results. Comparing the baselines “w/o ” against the full reward, we observed that IL learned a more powerful shortterm reward function than the naïve loss. Comparing the baselines “w/o ” against the full reward, albeit removing only reduces the score slightly, the number of false positive and false negative merges actually increase for noisy and hard cases. Figure 7 shows some representative groups that were mistakenly handled by ILHC w/o . It is worth pointing out that by adjusting the cost distributions of , e.g., changing the cost of ‘add, remove, merge’ from (1,6,1) to (1,1,1), one could alter the algorithm’s bias on precision and recall to suit for different application scenarios. A chart of Bcubed PRcurves is depicted in Fig. 8 to show the influence of cost distribution. Hierarchical clustering with imitation learning (ILHC) outperforms the baselines HC and AP no matter which settings we use. We recommend a high precision setting in order to achieve a low normalized operation cost , as suggested by experiments in Sec. 5.1.
Dataset  LFWAlbum  GFW  

Metric  (%)  (%)  
&  91.1  0.14  67.3  0.17 
w/o  90.7  0.14  62.6  0.17 
w/o  73.0  0.54  17.1  0.65 
6 Conclusion
We have proposed a novel face grouping framework that makes sequential merging decision based on short and longterm rewards. With inverse reinforcement learning, we learn powerful reward function to cope with realworld grouping tasks with unconstrained face poses, illumination, occlusion, and abundant of uninteresting faces and false detections. We have demonstrated that the framework benefits many existing agglomerativebased clustering algorithms.
References
 [1] P. Abbeel and A. Y. Ng. Apprenticeship learning via inverse reinforcement learning. In ICMR, 2004.
 [2] W. Barbakh and C. Fyfe. Clustering with reinforcement learning. In IDEAL, 2007.
 [3] T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y.W. Teh, E. LearnedMiller, and D. A. Forsyth. Names and faces in the news. In CVPR, 2004.
 [4] X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang. Diversityinduced multiview subspace clustering. In CVPR, 2015.
 [5] J.C. Chen, V. M. Patel, and R. Chellappa. Unconstrained face verification using deep CNN features. In WACV, 2016.
 [6] R. G. Cinbis, J. Verbeek, and C. Schmid. Unsupervised metric learning for face identification in tv video. In ICCV, 2011.
 [7] J. Cui, F. Wen, R. Xiao, Y. Tian, and X. Tang. Easyalbum: an interactive photo annotation system based on face clustering and reranking. In ACM SIGCHI, 2007.
 [8] E. El Khoury, C. Senac, and P. Joly. Faceandclothing based people clustering in video content. In ICMIR, 2010.
 [9] A. W. Fitzgibbon and A. Zisserman. Joint manifold distance: a new approach to appearance based clustering. In CVPR, 2003.
 [10] B. J. Frey and D. Dueck. Clustering by passing messages between data points. Science, 2007.
 [11] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination cone models for face recognition under variable lighting and pose. TPAMI, 23(6):643–660, 2001.
 [12] E. Ghaleb, M. Tapaswi, Z. AlHalah, H. K. Ekenel, and R. Stiefelhagen. Accio: A data set for face track retrieval in movies across age. In ICMR, 2015.
 [13] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. MSCeleb1M: A dataset and benchmark for large scale face recognition. In ECCV, 2016.
 [14] G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical Report 0749, University of Massachusetts, Amherst, October 2007.
 [15] T. Joachims. Optimizing search engines using clickthrough data. In ACM SIGKDD, 2002.
 [16] K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. Activity forecasting. In ECCV, 2012.
 [17] K.C. Lee, J. Ho, and D. J. Kriegman. Acquiring linear subspaces for face recognition under variable lighting. TPAMI, 27(5):684–698, 2005.
 [18] Z. Li and X. Tang. Bayesian face recognition using support vector machine and face clustering. In CVPR, 2004.
 [19] A. Likas. A reinforcement learning approach to online clustering. Neural computation, 11(8):1915–1932, 1999.
 [20] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Humanlevel control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
 [21] A. Y. Ng, S. J. Russell, et al. Algorithms for inverse reinforcement learning. In ICMR, 2000.
 [22] C. Otto, B. Klare, and A. K. Jain. An efficient approach for clustering face images. In ICB, 2015.
 [23] C. Otto, D. Wang, and A. K. Jain. Clustering millions of faces by identity. arXiv preprint arXiv:1604.00989, 2016.
 [24] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015.
 [25] L. D. Pyeatt, A. E. Howe, et al. Decision tree function approximation in reinforcement learning. In ISAS, 2001.
 [26] S. Ren, K. He, R. Girshick, and J. Sun. Faster RCNN: Towards realtime object detection with region proposal networks. In NIPS, 2015.
 [27] S. Ross, G. J. Gordon, and D. Bagnell. A reduction of imitation learning and structured prediction to noregret online learning. In aistats, pages 627–635, 2011.
 [28] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In CVPR, 2015.
 [29] S. Sengupta, J. C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs. Frontal to profile face verification in the wild. In WACV, 2016.
 [30] Y. Sun, Y. Chen, X. Wang, and X. Tang. Deep learning face representation by joint identificationverification. In NIPS, 2014.
 [31] Y. Sun, X. Wang, and X. Tang. Deeply learned face representations are sparse, selective, and robust. In CVPR, 2015.
 [32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015.
 [33] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface: Closing the gap to humanlevel performance in face verification. In CVPR, 2014.
 [34] Z. Tang, Y. Zhang, Z. Li, and H. Lu. Face clustering in videos with proportion prior. In IJCAI, 2015.
 [35] Y. Tian, W. Liu, R. Xiao, F. Wen, and X. Tang. A face annotation framework with partial clustering and interactive labeling. In CVPR, 2007.
 [36] C. J. C. H. Watkins. Learning from delayed rewards. PhD thesis, University of Cambridge England, 1989.
 [37] Y. Wen, K. Zhang, Z. Li, and Y. Qiao. A discriminative feature learning approach for deep face recognition. In ECCV, 2016.
 [38] B. Wu, Y. Zhang, B.G. Hu, and Q. Ji. Constrained clustering and its application to face clustering in videos. In CVPR, 2013.
 [39] S. Xia, H. Pan, and A. Qin. Face clustering in photo album. In ICPR, 2014.
 [40] Y. Xiang, A. Alahi, and S. Savarese. Learning to track: Online multiobject tracking by decision making. In ICCV, 2015.
 [41] S. Xiao, M. Tan, and D. Xu. Weighted blocksparse low rank representation for face clustering in videos. In ECCV, 2014.
 [42] L. Xie, A. N. Antle, and N. Motamedi. Are tangibles more fun?: Comparing children’s enjoyment and engagement using physical, graphical and tangible user interfaces. In International conference on tangible and embedded interaction, pages 191–198. ACM, 2008.
 [43] L. Zhang, D. V. Kalashnikov, and S. Mehrotra. A unified framework for context assisted face clustering. In ICMR, ICMR ’13, pages 9–16, New York, NY, USA, 2013. ACM.
 [44] W. Zhang, X. Wang, D. Zhao, and X. Tang. Graph degree linkage: Agglomerative clustering on a directed graph. In ECCV, 2012.
 [45] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Joint face representation adaptation and clustering in videos. In ECCV, 2016.
 [46] C. Zhou, C. Zhang, H. Fu, R. Wang, and X. Cao. Multicue augmented face clustering. In ACM MM, 2015.
 [47] C. Zhu, F. Wen, and J. Sun. A rankorder distance based clustering algorithm for face tagging. In CVPR, 2011.