# Merge or Not? Learning to Group Faces via Imitation Learning

## Abstract

Given a large number of unlabeled face images, face grouping aims at clustering the images into individual identities present in the data. This task remains a challenging problem despite the remarkable capability of deep learning approaches in learning face representation. In particular, grouping results can still be egregious given profile faces and a large number of uninteresting faces and noisy detections. Often, a user needs to correct the erroneous grouping manually. In this study, we formulate a novel face grouping framework that learns clustering strategy from ground-truth simulated behavior. This is achieved through imitation learning (a.k.a apprenticeship learning or learning by watching) via inverse reinforcement learning (IRL). In contrast to existing clustering approaches that group instances by similarity, our framework makes sequential decision to dynamically decide when to merge two face instances/groups driven by short- and long-term rewards. Extensive experiments on three benchmark datasets show that our framework outperforms unsupervised and supervised baselines.

## 1Introduction

Face grouping is an actively researched computer vision problem due to its enormous potential in commercial applications. It not only allows users to organize and tag photos based on faces but also retrieve and revisit huge quantity of relevant images effortlessly.

The performance of face grouping significantly benefits from the recent emergence of deep learning approaches [5]. Nevertheless, we still observe some challenges when we apply existing methods on real-world photo albums. In particular, we found that deeply learned representation can still perform poorly given *profile faces and false detections*. In addition, there is no obvious mechanism to disambiguate *large quantity of non-interested faces*^{1}

Thinking about humans, we tend to execute a visual grouping task in sequence with intermediate decision to govern our next step, like playing a jigsaw puzzle [42] with pieces of varying visual complexity. First we will link pieces with strong correlation and high confidence, then gain insights and accumulate visual evidence from these stable clusters. Consequently, a larger group can be formed through merging ambiguous positives and discarding uninteresting outliers. In the process, we may exploit contextual cues and global picture considering other samples.

The above intuition motivates a novel face grouping framework. Our goal is not to design a better deep representation, but learning to make better merging/not-merging decision from expert?s demonstration using existing representation. In particular, we wish to introduce intermediate sequential decision between the clustering steps, , when to merge two samples or groups given the dynamic context. Towards this goal, we assume different clustering states, where the states differ in their current partitions of data. At each time step, an agent will choose from two possible actions, , to merge or not to merge a pair of face groups. The process responds at the next time step by moving to a new state and provides a reward to the agent. A sequence of good actions would lead to higher accumulative reward than suboptimal decisions.

Learning a decision strategy in our problem is non-trivial. In particular, the decision process is adversely affected by uninteresting faces and noisy detections. Defining a reward function for face grouping is thus not straightforward, which needs to consider the similarity of faces, group consistency, and quality of images. In addition, we also need to consider the operation cost involved, , the manual human effort spent on adding or removing a photo from a group. It is hard to determine the relative weights of these terms a-priori. This is in contrast to (first person) imitation learning setting of which the reward is usually assumed known and fixed, , using the change of game score [20].

**Contributions:** We make the following contributions to overcome the aforementioned challenges:

1) We formulate a novel face grouping framework based on imitation learning (IL) via inverse reinforcement learning [21]. To our knowledge, this is the first attempt to address visual clustering via inverse reinforcement learning. Once learned, the policy can be transferred to unseen photo albums with good generalization performance.

2) We assume the reward as an unknown to be ascertained through learning by watching an expert’s behavior. We formulate the learning such that both short- and long-terms rewards are considered. The formal considers similarity, consistency and quality of local candidate clusters; whereas the latter measures the operation cost to get from an arbitrary photos partition to the final ground-truth partition. The new reward system effectively handles the challenges of profile, noisy, and uninteresting faces, and works well with conventional face similarity under an open-world context.

3) We introduce a large-scale dataset called Grouping Faces in the Wild (GFW) to facilitate the research of real-world photo grouping. The new dataset contains faces of identities collected from a social network. This dataset is realistic, providing a large number of uninteresting faces and noisy detections.

Extensive experiments are conducted on three datasets, namely, LFW simulated albums, ACCIO dataset (Harry Potter movie) [12], and the GFW introduced by us. We show that the proposed method can be adapted to a variety of clustering algorithms, from the conventional k-means and hierarchical clustering to the more elaborated graph degree linkage (GDL) approach [44]. We show that it outperforms a number of unsupervised and supervised baselines.

## 2Related Work

**Face Grouping:** Traditional face clustering methods [4] are usually purely data-driven and unsupervised. They mainly focus on finding good distance metric between faces or effective subspaces for face representation. For instance, Zhu [47] propose a rank-order distance that measures the similarity between two faces using their neighboring information. Fitzgibbon and Zisserman [9] further develop a joint manifold distance (JMD) that measures the distance between two subspaces, each of which invariant to a desired group of transformations. Zhang [44] propose agglomerative clustering on a directed graph to better capture global manifold structures of face data. There exist techniques that employ user interactions [35], extra information on the web [3] and prior knowledge of family photo albums [39]. Deep representation is recently found effective for face clustering [28], and large-scale face clustering has been attempted [23]. Beyond image-based clustering, most existing video-based approaches employ pairwise constraints derived from face tracklets [6] or other auxiliary information [8] to facilitate face clustering in video. The state-of-the-art method by Zhang [45] adapts DeepID2+ model [31] to a target domain with joint face representation adaptation and clustering.

In this study, we focus on image-based face grouping without temporal information. Our method differs significantly to existing methods [45] that cluster instances by deep representation alone. Instead, our method learns from experts to make sequential decision on grouping considering both short- and long-term rewards. It is thus capable of coping with uninteresting faces and noisy detections effectively.

**Clustering with Reinforcement Learning:** There exist some pioneering studies that explored clustering with RL. Likas [19] models the decision process of assigning a sample from a data stream to a prototype, , cluster centers produced by on-line K-means. Barbakh and Fyfe [2] employ RL to select a better initialization for K-means. Our work differs to the aforementioned studies: (1) [2] are unsupervised, , their loss is related to the distance from data to a cluster prototype. In contrast, our framework guides an agent with a teacher’s behavior. (2) We consider a decision that extends more flexibly to merge arbitrary instances or groups. We also investigate a novel reward function and new mechanisms to deal with noises.

**Imitation Learning:** Ng and Russel [21] introduced the concept of *inverse reinforcement learning* (IRL), which is also known as *imitation learning* or apprenticeship learning [1]. The goal of IRL is to find a reward function to explain observed behavior of an expert who acts according to an unknown policy. Inverse reinforcement learning is useful when a reward function is multivariate, , consists of several reward terms of which the relative weights of these terms are unknown a-priori. Imitation learning was shown effective when the supervision of a dynamic process is obtainable, , in robotic navigation [1], activity understanding and forecasting [16] and visual tracking [40].

## 3Overview

0

### 3.1Preliminaries

*Markov decision process* (MDP) has been extensively employed for modeling dynamic environments where an agent needs to perform sequential decisions and executing actions. The applications of MDP can be found on different computer vision tasks, , tracking [40], feature selection [?], human activity forecasting [16], and interactive data annotation [?]. Formally, MDP is represented as a tuple , where is a finite set of states and is a set of actions (decisions). is a set of state transition probabilities upon taking action in state . A reward function is denoted as , which rewards the agent after it executes action in state .

*Reinforcement learning* (RL) [?] aims at finding a policy in MDP, which maps from states to probability distributions over actions, so as to maximize the numerical reward signal. A value function evaluates the value of a state as the total amount of reward an agent can expect to accumulate over the future, starting from that state, .

where is a discount factor. RL also defines an action-value function to judge the value of actions, according to

The optimal value function is and the optimal -function is . The goal of RL is to find a policy that maximizes .

Ng and Russel [21] introduced the concept of *inverse reinforcement learning* (IRL), of which the goal is to find a reward function^{2}

Behavior can be observed from another expert who acts according to an unknown policy. In this case, the task of inverse reinforcement learning can be regarded as *imitation learning* or apprenticeship learning [1].

An illustration of the proposed framework is given in Figure 2. We treat grouping as a sequential process. In each step during test time, two candidate groups and are chosen. Without loss of generality, a group can be formed by just a single instance. Given the two groups, we extract meaningful features to characterize their similarity, group consistency, and image quality. Based on the features, an agent will then perform an action, which can be either i) merging the two groups, or ii) not merging the two groups. Once the action is executed accordingly, the grouping proceeds to select the next pair of groups. The merging stops when there are no further candidate groups can be chosen, , the similarity between any groups is higher than a pre-defined threshold. Next, we define some key terminologies.

**Recommender**: At each time step we pick and consider the merging of two face groups. The action space is large with a complexity of , where is the number of groups. This adds hurdles to both learning and test stages. To makes our approach scalable, we employ a recommender, , which recommends two candidates cluster and at each time step. This reduces the action space to a binary problem, , to merge or not to merge a pair of face groups. A recommender can be derived from many classic clustering algorithms especially agglomerative-based algorithm like hierarchical clustering (HC), ranked-ordered clustering [47] and GDL approach [44]. For instance, hierarchical clustering-based always suggest two clusters that are nearest by some distance metric. In Section 5, we perform rigorous evaluations on plausible choices of a recommender.

**State:** Each state , contains the current grouping partition and recommender history , at time step . In each discrete state, the recommender will recommend a pair of cluster based on the current state.

**Action:** An action is denoted as . An agent can execute two possible actions, , merge two groups or not. That is the action set is defined as , and .

**Transition:** If a merging action is executed, candidate groups and will be merged. The corresponding partition is updated as . Otherwise, the partition remains unchanged, . The candidate information will be appended to the history so that the same pair would not be recommended by . The transition is thus represented as , where denotes the transition function, and and .

## 4Learning Face Grouping by Imitation

The previous section explains the face grouping process at test time. An agent is used to determine the right action at each step, , merging or not merging a pair of groups. To learn an agent with the desired behavior, we assume access to demonstrations by some expert. In our study, we obtain these demonstrations from a set training photo albums of which the ground-truth partition of the photos is known. Consequently, given any two candidate groups, and , we know if merging them is a correct action or not. These ground-truth actions represent the pseudo expert’s behavior.

Towards the goal of learning an agent from the expert’s behavior, we perform the learning in two stages: (1) we find a reward function to explain the behavior via inverse reinforcement learning [21], (2) with the learned reward function we find a policy that maximizes the cumulative rewards.

Formally, let denotes the reward function, which rewards the agent after it executes action in state . And is a set of state transition probabilities upon taking action in state . For any policy , a value function evaluates the value of a state as the total amount of reward an agent can expect to accumulate over the future, starting from that state, ,

where is a discount factor.

An action-value function is used to judge the value of actions, according to

where the notation represents the transition to state after taking an action at state . Our goal is to first uncover the reward function from expert’s behavior, and find a policy that maximizes .

**Rewards:** In our study, the reward function that we wish to learn consists of two terms, denoted as

The first and second term corresponds to short- and long-term rewards, respectively. The parameter helps balance the scale of the two terms. The short-term reward is multivariate. It considers how strong two instances/groups should be merged locally based on face similarity, group consistency, and face quality. A long-term reward captures more far-sighted clustering strategy through measuring the operation cost to get from an arbitrary photos partition to the final ground-truth partition. Note that during the test time, the long-term reward function is absorbed in our learned action-value function for a policy , thus no ground-truth is needed during testing. We provide explanations on the short- and long-term rewards as follows.

### 4.1Short-Term Reward

Before a human user decides a merge between any two face groups, he/she will determine how close the two groups are in terms of face similarity. In addition, he/she may consider the quality and consistency of images in each group to prevent any accidental merging of uninteresting faces and noisy detections. We wish to capture such a behavior through learning a reward function.

The reward is considered short-term since it only examines the current groups’ partition. Specifically, we compute the similarity between two groups, the quality for each group and photos consistency in each group as a feature vector , and we project this feature into a scalar reward,

where if action , and if . Note that we assume the actual reward function is unknown and should be learned through IRL. We observe that through IRL, a powerful reward function can be learned. An agent can achieve a competitive result even by myopically deciding based on one step’s reward function rather than multiple steps. We will show that optimizing is equivalent to learning a hyperplane in support vector machine (SVM) (Sec. Section 4.3).

Next, we describe how we design the feature vector , which determines the characteristics an agent should examine before making a group merging decision. A feature vector is extracted considering the candidate groups, all faces’ representation in the groups, and current partition , that is .

The proposed feature vector contains three kinds of features, so as to capture face similarity, group consistency, and image quality. All face representation are extracted from Inception-v3 model [32] fine-tuned with MS-Celeb-1M [13]. More elaborated features can be considered given the flexibility of the framework.

**Face Similarity:** We compute a multi-dimensional similarity vector to describe the relationship between two face groups and . Specifically, we first define the distance between the representation of two arbitrary faces , and as . The subscript on indicates its group. In this study, we define the distance function as angular distance. We then start from : for a face in , we compute its distance to all the faces in and select a median from the resulting distances. That is

where . We select number of instances with the shortest distances from to define the distance from to . Note that the distance is not symmetric. Hence, we repeat the above process to obtain another shortest distances from to define the distance from to . Lastly, these distances are concatenated to form a -dimensional feature vector.

**Group Consistency:** Group consistency measures how close the samples in a group to each other. Even two groups have high similarity in between their respective members, we may not want to merge them if one of the group is not consistent, which may happen when there are a number of non-interesting faces inside the group. We define the consistency of a group as the median of pairwise distances between faces in the group itself. Given a group :

Consistency is computed for the two candidate groups, contribute a two-dimensional feature vector to .

**Face Quality:** As depicted in Figure 1, profile faces and noises could easily confuse a state-of-the-art face recognition model. To make our reward function more informed on the quality of the images, we train a linear classifier by using annotated profile and falsely detected faces as negative samples, and clear frontal faces as positive samples. A total of 100k face images extracted from movies is used for training. The output of the classifier serves as the quality measure. Here, we concatenate the quality values of the top faces in each of the two groups to form another -dimensional features to .

### 4.2Long-Term Reward

While the short-term reward captures how likely two groups should be merged given the current partition, the long-term reward needs to encapsulate a more far-sighted clustering strategy.

To facilitate the learning of this reward, we introduce the term ’*operation cost*’, which measures the efforts needed to manipulate the images in the current partition to approach to ground-truth partition. Formally, given a partition and ground-truth partition . A sequence of operations can be executed to gradually modify the partition to . The cost function maps each type of operations into a positive time cost. then we define as the minimal cost for this change:

where is the number of steps needed to get from to .

The cost function can be obtained from a user study. In particular, we requested 30 volunteers and show them a number of randomly shuffled images as an album. Their task is to reorganize the photos into a desired groups’ partition. We recorded the time needed for three types of operations: (1) adding a photo into a group, (2) removing a photo from a group, and (3) merging two groups. The key results are shown in Figure 3. It can be observed that the ‘removing’ operation takes roughly 6 longer than the ‘adding’ operation. The ‘merging’ operation is almost similar to ‘adding’. Consequently, we set the cost for these three operations as 1, 6, 1, respectively. The validity is further confirmed by the plot in Figure 3 that shows a high-correlation between the time consumed and the computed operation cost.

Given Eqn. , we define the long-term reward as:

which encodes the operation cost changes in steps.

The key benefit brought by is that it provides a long-term reward that guides an agent to thinking about the global picture of the grouping process. For any action that can hardly be decided (, merging two noisy groups or merging a clean group with a noisy group), this term provides a strong evidence to the action-value function.

### 4.3Finding the Reward and Policy

As discussed in Section 4, we assume the availability of a set training photo albums of which the ground-truth partition of the photos is known. Let denotes a set of albums in a training set. The ground-truth partition for albums is given as , from which we can derive the ground-truth actions as an expert’s behavior. Our goal is to find a reward function based on this behavior. We perform the learning in two steps to ease the convergence of our method: (1) Firstly, we employ IRL [21] to find the reward function with a myopic or short-sighted policy. (2) We then use the classic -greedy algorithm [36] to find the optimal policy.

**Step 1**: Algorithm ? summarizes the first step. Specifically, we set in Eqn. and in Eqn. . This leads to a myopic policy that considers the current maximal short-term reward. This assumption greatly simplifies our optimization as of (Eqn. ) are the only parameters to be learned. We solve this using a binary RBF-kernel SVM with actions as the classes. We start the learning process with an SVM of random weights and an empty training set . We execute the myopic policy repeatedly on albums. Once the agent chooses the wrong action w.r.t. the ground-truth, the representations of the involved groups and the associated ground-truth will be added to the SVM training set. Different albums constitute different games in which SVM will be continually optimized using the instances that it does not perform well. Note that the set is accumulated, hence each time we use samples collected from over time for retraining . The learning stops when all albums are correctly partitioned.

**Step 2**: Once the reward function is learned, finding the best policy becomes a classic RL problem. Here we apply the -greedy algorithm [36]. -greedy policy is a way of selecting random actions with uniform distribution from a set of available actions. Using this policy either we can select random action with probability and we can select an action with probability that gives maximum reward in a given state. Specifically, we set in Eqn. and in Eqn. . We first approximate the action-value function in Eqn. by a random forest regressor [25]. The input to the regressor is and the output is the associated value. The parameters of the regressor are initialized by , , and value, which are obtained in the first step (Algorithm ?). After the initialization, the agent selects and executes an action according to , , , but with a probability the agent will act randomly so as to discover a state that it has never visited before. At the same time the parameters of will be updated directly from the samples of experience drawn from the algorithm’s past games. At the end of learning, the value of is decayed to 0, and is used as our action-value function for policy .

## 5Experiments

**Training Data:** Our algorithm needs to learn a grouping policy from a training set. The learned policy can be applied to other datasets for face grouping. Here we employ albums simulated from MS-Celeb-1M [13] of 80k identities as our training source. We will release the training data.

**Test Data:** To show the generalizability of the learned policy, we evaluate the proposed approach on three datasets of different scenarios exclusive from the training source. Example images are provided in Figure 4.

*1) LFW-Album*: We construct a challenging simulated albums from LFW [14], MS-Celeb-1M [13], and PFW [29], with a good mix of frontal, profile, and non-interested faces. We prepare 20 albums and with exclusive identities. Note that the MS-Celeb-1M samples used here are exclusive from the training data.

*2) ACCIO Dataset*: This dataset [12] is commonly used in the studies of video face clustering. It contains face tracklets extracted from series of Harry Potter movie. Following [45], we conduct experiments on the first instalment of the series, which contains 3243 tracklets from 36 known identities. For a fair comparison, we do not consider uninterested faces in this dataset following [45]. We discard the temporal information and used only the frames in our experiments.

*3) Grouping Face in the Wild (GFW)*: To better evaluate our algorithm for real-world application, we collect 60 real users’ albums with permission from a Chinese social network portal. The size of an album varies from 120 to 3600 faces, with a maximum number of identities of 321. In total, the dataset contains 84,200 images with 78,000 faces of 3,132 different identities. All faces are automatically detected using Faster-RCNN [26]. False detections are observed. We annotate all detections with identity/noise labels. The images are unconstrained, taken in various indoor/outdoor scenes. Faces are naturally distributed with different poses with spontaneous expression. In addition, faces can be severely occluded, blurred with motion, and differently illuminated under different scenes. We will release the data and annotations. To our knowledge, this is the largest real-world face clustering dataset.

Given the limited space, we exclude results on traditional grouping datasets like Yale-B [11], MSRA-A [47], MSRA-B [47] and Easyalbum [7]. Yale-B were captured in controlled condition with very few profile faces and noises. The number of albums is limited in the other three datasets.

Dataset | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|

Metric | P(%) | R(%) | (%) | P(%) | R(%) | (%) | P(%) | R(%) | (%) | |||

K-means | 73.6 | 86.6 | 79.3 | 1.12 | 72.2 | 34.4 | 46.6 | 0.65 | 66.6 | 35.7 | 41.1 | 1.47 |

GDL [44] | 66.5 | 92.2 |
76.4 | 1.21 | 18.1 | 91.1 | 30.2 | 3.51 | 67.4 | 59.4 | 55.9 | 1.30 |

HC | 74.2 | 80.8 | 76.6 | 0.35 | 17.1 | 91.9 |
28.9 | 3.28 | 77.5 | 22.3 | 15.0 | 0.81 |

AP [10] | 76.7 | 71.1 | 73.7 | 1.07 | 82.2 | 9.6 | 17.1 | 0.59 | 69.7 | 25.3 | 32.7 | 0.86 |

Deep Adaptation [45] | - | - | - | - | 71.1 | 35.2 | 47.1 | - | - | - | - | - |

IL-Kmeans | 76.7 | 87.8 | 81.6 | 0.95 | 82.8 | 34.1 | 48.3 | 0.54 | 53.4 | 43.6 | 43.3 | 1.17 |

IL-GDL | 79.9 | 90.1 | 84.5 | 0.54 | 88.6 | 46.3 | 60.8 | 0.78 | 78.4 | 76.2 |
74.5 |
0.68 |

IL-HC | 97.8 |
85.3 | 91.1 |
0.14 |
90.8 |
78.6 | 84.3 |
0.52 |
96.6 |
53.7 | 67.3 | 0.17 |

SVM Deep Features | 82.7 | 87.4 | 85.0 | 0.45 | 89.0 | 61.3 | 72.6 | 0.74 | 84.3 | 46.4 | 56.3 | 0.33 |

Siamese Network Deep Features | 87.1 | 87.6 | 87.3 | 0.44 | 59.7 | 88.1 | 71.2 | 0.79 | 49.9 | 92.3 | 62.8 | 0.33 |

**Implementation Details:** All face representation are extracted from Inception-v3 model [32] fine-tuned with MS-Celeb-1M [13]. We suggest some parameter settings as follows. We set in Eqn. to balance the scales of short- and long-term rewards. We fixed the number of faces to form the similarity and quality features (Sec. Section 4.1). The five shortest distances is a good trade-off between performance and feature complexity. If a group has fewer than five faces (to the extreme only one face exists), we pad the distance vector with the farthest distance.

**Evaluation Metrics:** We employ multiple metrics to evaluate the face grouping performance, including the B-cubed precision, recall, and score suggested by [43] and [45]. Specifically, B-cubed recall measures the average fraction of face pairs belonging to the ground truth identity assigned to the same cluster. And B-cubed precision is the fraction of face pairs assigned to a cluster with matching identity labels. The score measures the harmonic means of these two metrics. We also use *operation cost* introduced in Section 4.2. To facilitate comparisons across datasets of different sizes, we compute the operation cost normalized by the number of photos as our metric, , . We believe that this metric is more important than the others since it directly reflects how much effort per image a user needs to spend to organize a photo album.

### 5.1Comparison with Unsupervised Methods

We compare our method with classic and popular clustering approaches: 1) K-means, 2) Graph Degree Linkage (GDL) [44], 3) Hierarchical Clustering (HC), and 4) Affinity Propagation (AP) [10]. Note that we also compare with [45]. Since the code is not publicly available, we only compare with its reported precision, recall, and scores on the ACCIO-1 dataset. Note that these baselines use the same features as our approach, as discussed in Section 4.1. To verify if the proposed imitation learning (IL) framework helps existing clustering methods, we adapt K-means, GDL and HC into IL-K-means^{3}

Table ? summarizes the results on three datasets. We observed that: (1) imitation learning consistently improves the different clustering baselines. For instance, on LFW-Album, the score and of HC improves from 76.6% and 0.35 to 91.1% to 0.14. Notably, IL-HC outperforms other variants based on the proposed IL, although our framework is not specifically developed to work only with hierarchical clustering. (2) The operation cost is lower with a high-precision algorithm. This result matches with our user study since a user is good at adding similar photos into a group but poor at removing noisy faces that can be hard to distinguish.

We compare grouping results of IL-HC and HC qualitatively in Figure 5. IL-HC yields more coherent face groupings with exceptional robustness to outliers.

### 5.2Comparison with Supervised Methods

We compare our framework with two supervised baselines, namely a SVM classifier and a three-layer Siamese network. The three layers of the Siamese network have 256, 64, 64 hidden neurons, respectively. A contrastive loss is used for training. To train the baselines, each time we sample two subsets of identities from MS-Celeb-1M as the training data. SVM and the Siamese Network are used to predict if two groups should be merged or not. Features are extracted following the method presented in Section 4.1. These supervised baselines are thus strong since their input features are identical to those we use in our IL framework. The features include face similarity vector that is derived from Inception-v3 face recognition model fine-tuned with MS-Celeb-1M dataset. The deep representation achieves 99.27% on LFW, which is better than [30] and on-par with [37]. The results of the baseline are presented in Table ?. It is observed that the IL-based approach outperforms the supervised baselines by a considerable margin.

### 5.3Ablation Study

**Further Analysis on Recommender**: In Section 5.1, we tested three different recommenders based on different clustering methods, namely K-means, GDL, and HC. In this experiment, we further analyze the use of a random recommender that randomly chooses a pair to recommend. Figure 6 shows the score comparisons between a Hierarchical Clustering (HC) recommender and a random recommender. In comparison to the recommender based on HC, which always recommends the nearest groups, the random recommender exhibits a slower convergence and poorer results. It is worth pointing out that the random recommender still achieves a score of 61.9% on GFW, which outperforms the unsupervised baseline, which only achieves 15%. The results suggest the usefulness of deploying a recommender.

We also evaluate an extreme approach that does not employ a recommender but selects a group pair to merge based on the values produced by the learned action-value function. Specifically, in each step, we compute the exhaustively for all possible pairs of group, and select the pair with the highest value to merge. This approach achieves on GFW. It is not surprising that the result is better than our IL-HC as this approach performs exhaustive search for pairs. This method has a runtime complexity of , much higher than the IL-HC. The results suggest the effectiveness of the clustering-based recommender in our framework.

**Discard the Face Quality Feature**: If we remove the face quality feature from the feature vector , the score achieved by IL-HC of LFW-Album, ACCIO-1, and GFW will drop from 91.1%, 84.3%, and 67.3%, to 89.5%, 65.0%, and 48.4%, respectively. The results suggest that the importance of quality measure depends on the dataset. Face quality feature is essential on the GFW dataset but less so on others, since GFW consists more poor-quality images.

**Reward Function Settings**: We evaluate the effect of two reward terms in the reward function defined in Eqn. .

1) & : The full reward setting with .

2) w/o : Without the long-term reward based on operation cost, , .

3) w/o : In this setting, we discarded learned by IRL, and redefined it to take a naïve loss, , , where is an indicator function that outputs 1 if the condition is true, and -1 if it is false.

The results reported in Table ? shows that both short- and long-term rewards are indispensable to achieve good results. Comparing the baselines “w/o ” against the full reward, we observed that IL learned a more powerful short-term reward function than the naïve loss. Comparing the baselines “w/o ” against the full reward, albeit removing only reduces the score slightly, the number of false positive and false negative merges actually increase for noisy and hard cases. Figure 7 shows some representative groups that were mistakenly handled by IL-HC w/o . It is worth pointing out that by adjusting the cost distributions of , , changing the cost of ‘add, remove, merge’ from (1,6,1) to (1,1,1), one could alter the algorithm’s bias on precision and recall to suit for different application scenarios. A chart of B-cubed PR-curves is depicted in Figure 8 to show the influence of cost distribution. Hierarchical clustering with imitation learning (IL-HC) outperforms the baselines HC and AP no matter which settings we use. We recommend a high precision setting in order to achieve a low normalized operation cost , as suggested by experiments in Section 5.1.

Dataset | ||||
---|---|---|---|---|

Metric | (%) | (%) | ||

& | 91.1 | 0.14 | 67.3 | 0.17 |

w/o | 90.7 | 0.14 | 62.6 | 0.17 |

w/o | 73.0 | 0.54 | 17.1 | 0.65 |

0

### 5.4Generalization of Learned Policy (this section will be removed)

The policy learned by the proposed approach can be transferred to other datasets with a good generalization performance. Here we conduct two experiments:

**Cross-Dataset Evaluation**: Our algorithms needs to learn a grouping policy from a training set. To show that the learned policy can be transferred to an unseen distribution, we deliberately train IL-HC on LFW and apply the policy to GFW with an entirely different data distribution. We then repeat the experiment by swapping the training and test sets. We also evaluated the generalization of policies from LFW and GFW on Accio. Table ? suggests that IL-HC’s policies can generalize well. In all cases, the cross-dataset performances are still better than the baselines reported.

Test Train | ||||
---|---|---|---|---|

Metric | (%) | (%) | ||

LFW | 91.1 | 0.142 | 90.4 | 0.156 |

GFW | 46.8 | 0.268 | 67.3 | 0.174 |

ACCIO | 77.7 | 0.549 | 73.6 | 0.642 |

Metric | (%) | (%) | ||

IL-HC | 91.1 | 0.14 | 88.2 | 0.21 |

HC | 76.6 | 0.35 | 58.7 | 0.59 |

**Testing with Additional Noises:** The original LFW-Album dataset contains 10-30% of noises, including uninterested faces and false detections. To evaluate the robustness of IL-HC to noise, we increased the number of uninterested faces and false detections in LFW-Album. Table ? shows that the performance of the proposed algorithm is still competitive even on a more challenging dataset, while the baseline clustering method suffers a more rapid degradation of performance.

## 6Conclusion

We have proposed a novel face grouping framework that makes sequential merging decision based on short- and long-term rewards. With inverse reinforcement learning, we learn powerful reward function to cope with real-world grouping tasks with unconstrained face poses, illumination, occlusion, and abundant of uninteresting faces and false detections. We have demonstrated that the framework benefits many existing agglomerative-based clustering algorithms.

### Footnotes

- Non-interested faces refer to faces that we do not want to group ( faces in the background). This is the term popularized by the earlier work in face clustering [47].
- Ng and Russel [21] suggest the learning of reward function rather than the policy since a reward function provides a more parsimonious description of behavior.
- For IL-K-means algorithm, the action space is no longer binary due to the nature of K-means. Here we adapt the framework to have an action space of , for determining the merging of a sample into one of the clusters. And we replace the SVM with a RankSVM [15] to compute the rewards for each cluster.

### References

**Apprenticeship learning via inverse reinforcement learning.**

P. Abbeel and A. Y. Ng. In*ICMR*, 2004.**Clustering with reinforcement learning.**

W. Barbakh and C. Fyfe. In*IDEAL*, 2007.**Names and faces in the news.**

T. L. Berg, A. C. Berg, J. Edwards, M. Maire, R. White, Y.-W. Teh, E. Learned-Miller, and D. A. Forsyth. In*CVPR*, 2004.**Diversity-induced multi-view subspace clustering.**

X. Cao, C. Zhang, H. Fu, S. Liu, and H. Zhang. In*CVPR*, 2015.**Unconstrained face verification using deep CNN features.**

J.-C. Chen, V. M. Patel, and R. Chellappa. In*WACV*, 2016.**Unsupervised metric learning for face identification in tv video.**

R. G. Cinbis, J. Verbeek, and C. Schmid. In*ICCV*, 2011.**Easyalbum: an interactive photo annotation system based on face clustering and re-ranking.**

J. Cui, F. Wen, R. Xiao, Y. Tian, and X. Tang. In*ACM SIGCHI*, 2007.**Face-and-clothing based people clustering in video content.**

E. El Khoury, C. Senac, and P. Joly. In*ICMIR*, 2010.**Joint manifold distance: a new approach to appearance based clustering.**

A. W. Fitzgibbon and A. Zisserman. In*CVPR*, 2003.**Clustering by passing messages between data points.**

B. J. Frey and D. Dueck.*Science*, 2007.**From few to many: Illumination cone models for face recognition under variable lighting and pose.**

A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman.*TPAMI*, 23(6):643–660, 2001.**Accio: A data set for face track retrieval in movies across age.**

E. Ghaleb, M. Tapaswi, Z. Al-Halah, H. K. Ekenel, and R. Stiefelhagen. In*ICMR*, 2015.**MS-Celeb-1M: A dataset and benchmark for large scale face recognition.**

Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. In*ECCV*, 2016.**Labeled faces in the wild: A database for studying face recognition in unconstrained environments.**

G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Technical Report 07-49, University of Massachusetts, Amherst, October 2007.**Optimizing search engines using clickthrough data.**

T. Joachims. In*ACM SIGKDD*, 2002.**Activity forecasting.**

K. M. Kitani, B. D. Ziebart, J. A. Bagnell, and M. Hebert. In*ECCV*, 2012.**Acquiring linear subspaces for face recognition under variable lighting.**

K.-C. Lee, J. Ho, and D. J. Kriegman.*TPAMI*, 27(5):684–698, 2005.**Bayesian face recognition using support vector machine and face clustering.**

Z. Li and X. Tang. In*CVPR*, 2004.**A reinforcement learning approach to online clustering.**

A. Likas.*Neural computation*, 11(8):1915–1932, 1999.**Human-level control through deep reinforcement learning.**

V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al.*Nature*, 518(7540):529–533, 2015.**Algorithms for inverse reinforcement learning.**

A. Y. Ng, S. J. Russell, et al. In*ICMR*, 2000.**An efficient approach for clustering face images.**

C. Otto, B. Klare, and A. K. Jain. In*ICB*, 2015.**Clustering millions of faces by identity.**

C. Otto, D. Wang, and A. K. Jain.*arXiv preprint arXiv:1604.00989*, 2016.**Deep face recognition.**

O. M. Parkhi, A. Vedaldi, and A. Zisserman. In*BMVC*, 2015.**Decision tree function approximation in reinforcement learning.**

L. D. Pyeatt, A. E. Howe, et al. In*ISAS*, 2001.**Faster R-CNN: Towards real-time object detection with region proposal networks.**

S. Ren, K. He, R. Girshick, and J. Sun. In*NIPS*, 2015.**A reduction of imitation learning and structured prediction to no-regret online learning.**

S. Ross, G. J. Gordon, and D. Bagnell. In*aistats*, pages 627–635, 2011.**Facenet: A unified embedding for face recognition and clustering.**

F. Schroff, D. Kalenichenko, and J. Philbin. In*CVPR*, 2015.**Frontal to profile face verification in the wild.**

S. Sengupta, J. C. Chen, C. Castillo, V. M. Patel, R. Chellappa, and D. W. Jacobs. In*WACV*, 2016.**Deep learning face representation by joint identification-verification.**

Y. Sun, Y. Chen, X. Wang, and X. Tang. In*NIPS*, 2014.**Deeply learned face representations are sparse, selective, and robust.**

Y. Sun, X. Wang, and X. Tang. In*CVPR*, 2015.**Rethinking the inception architecture for computer vision.**

C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.*arXiv preprint arXiv:1512.00567*, 2015.**Deepface: Closing the gap to human-level performance in face verification.**

Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. In*CVPR*, 2014.**Face clustering in videos with proportion prior.**

Z. Tang, Y. Zhang, Z. Li, and H. Lu. In*IJCAI*, 2015.**A face annotation framework with partial clustering and interactive labeling.**

Y. Tian, W. Liu, R. Xiao, F. Wen, and X. Tang. In*CVPR*, 2007.*Learning from delayed rewards*.

C. J. C. H. Watkins. PhD thesis, University of Cambridge England, 1989.**A discriminative feature learning approach for deep face recognition.**

Y. Wen, K. Zhang, Z. Li, and Y. Qiao. In*ECCV*, 2016.**Constrained clustering and its application to face clustering in videos.**

B. Wu, Y. Zhang, B.-G. Hu, and Q. Ji. In*CVPR*, 2013.**Face clustering in photo album.**

S. Xia, H. Pan, and A. Qin. In*ICPR*, 2014.**Learning to track: Online multi-object tracking by decision making.**

Y. Xiang, A. Alahi, and S. Savarese. In*ICCV*, 2015.**Weighted block-sparse low rank representation for face clustering in videos.**

S. Xiao, M. Tan, and D. Xu. In*ECCV*, 2014.**Are tangibles more fun?: Comparing children’s enjoyment and engagement using physical, graphical and tangible user interfaces.**

L. Xie, A. N. Antle, and N. Motamedi. In*International conference on tangible and embedded interaction*, pages 191–198. ACM, 2008.**A unified framework for context assisted face clustering.**

L. Zhang, D. V. Kalashnikov, and S. Mehrotra. In*ICMR*, ICMR ’13, pages 9–16, New York, NY, USA, 2013. ACM.**Graph degree linkage: Agglomerative clustering on a directed graph.**

W. Zhang, X. Wang, D. Zhao, and X. Tang. In*ECCV*, 2012.**Joint face representation adaptation and clustering in videos.**

Z. Zhang, P. Luo, C. C. Loy, and X. Tang. In*ECCV*, 2016.**Multi-cue augmented face clustering.**

C. Zhou, C. Zhang, H. Fu, R. Wang, and X. Cao. In*ACM MM*, 2015.**A rank-order distance based clustering algorithm for face tagging.**

C. Zhu, F. Wen, and J. Sun. In*CVPR*, 2011.