Low-resolution Face Recognition in the Wild
via Selective Knowledge Distillation
Typically, the deployment of face recognition models in the wild needs to identify low-resolution faces with extremely low computational cost. To address this problem, a feasible solution is compressing a complex face model to achieve higher speed and lower memory at the cost of minimal performance drop. Inspired by that, this paper proposes a learning approach to recognize low-resolution faces via selective knowledge distillation. In this approach, a two-stream convolutional neural network (CNN) is first initialized to recognize high-resolution faces and resolution-degraded faces with a teacher stream and a student stream, respectively. The teacher stream is represented by a complex CNN for high-accuracy recognition, and the student stream is represented by a much simpler CNN for low-complexity recognition. To avoid significant performance drop at the student stream, we then selectively distil the most informative facial features from the teacher stream by solving a sparse graph optimization problem, which are then used to regularize the fine-tuning process of the student stream. In this way, the student stream is actually trained by simultaneously handling two tasks with limited computational resources: approximating the most informative facial cues via feature regression, and recovering the missing facial cues via low-resolution face classification. Experimental results show that the student stream performs impressively in recognizing low-resolution faces and costs only MB memory and runs at faces per second on CPU and faces per second on GPU.
Face, a fundamental attribute that distinguishes one subject from another, needs to be recognized many times everyday in modern computer vision and multimedia applications. Among these applications, many well-known face recognition models [1, 2, 3, 4] need to be re-deployed on mobile phones  or even smart cameras  to meet the real-world requirements that aim to identify low-resolution faces with extremely low computational cost and memory footprint (i.e., face recognition in the wild ). Toward this end, it is necessary to explore a feasible solution that can address a key challenge in face recognition: how to convert an existing complex face model into a more efficient one that still works well on low-resolution faces without remarkable loss of recognition accuracy?
Compared with high-resolution faces, low-resolution faces have their unique visual attributes. As shown in Fig. 1, many details are missing in low-resolution faces. However, they are still recognizable for subjects who are familiar with the corresponding high-resolution faces, implying that the neural systems of human beings may have the capability of recovering missing details of familiar faces. Inspired by this fact, many existing low-resolution face models have been proposed, which can be roughly grouped into two categories: the hallucination category and the embedding category.
Models in the hallucination category propose reconstructing the high-resolution faces before recognition [8, 9, 10]. For example, Kolouri et al.  described a transport-based single frame super-resolution method to automatically construct a nonlinear Lagrangian model of high-resolution facial appearance. After that, the low-resolution facial image was enhanced by finding the model parameters that best fit the given low-resolution data. Jian et al.  observed that the singular values of a face image at different resolutions have approximately linear relationship. Based on this observation, they first applied singular value decomposition for face representation to learn the mapping function between low-resolution and high-resolution face pairs, and then performed both hallucination and recognition of low-resolution faces simultaneously. Similar method proposed by Yang et al.  used sparse representation to perform joint hallucination and recognition, which can synthesize person-specific versions of low-resolution faces with recognition guarantee. Typically, these approaches exhibit impressive performance in recognizing the reconstructed high-resolution faces, while the super-resolution operation often brings in additional computational cost and leads to low recognition speed.
Different from the hallucination-based models, models in the embedding category directly extract discriminative features from low-resolution faces by using various external face contexts. For example, Biswas et al.  proposed embedding low-resolution facial images into an Euclidean space such that the distances between them in the transformed space can well approximate the best distances of high-resolution faces. Ren et al.  proposed coupled kernel embedding to map the facial images with different resolutions onto an infinite subspace. The recognition process was then carried out in the new space by minimizing the dissimilarities captured by their kernel gram matrices in the low-resolution and high-resolution spaces. Generally speaking, the most important process in the embedding-based approaches is transferring the knowledge from high-resolution faces to low-resolution ones. However, a key issue that needs to be carefully addressed in this process is correctly transferring only the desired knowledge rather than all of them from high-resolution domain to low-resolution domain. Such selective knowledge transfer is one of the most important challenges in converting existing face models into more efficient ones that also work well on low-resolution faces.
Inspired by this fact, we propose a selective knowledge distillation approach for low-resolution face recognition in the wild. As shown in Fig. 2, a two-stream CNN is first trained to simultaneously recognize high-resolution faces and their resolution-degraded versions by using two streams. The two streams consist of a teacher stream that operates on high-resolution faces, and a student stream that is much simpler for low-resolution face recognition. To ensure that the student stream has comparable recognition performance with the teacher stream, we then selectively distil only the most informative facial features from the teacher stream by solving a sparse graph optimization problem, which are then used to regularize the fine-tuning process of the student stream. In this way, the student stream is actually trained by simultaneously handling two tasks with limited computational resources: approximating the most informative facial cues via feature regression, and recovering the missing facial cues via low-resolution face classification. Note that the teacher stream can adopt any architecture of existing deep face models, implying that the proposed approach can convert any existing face model into a much simpler one with higher speed and lower memory at the cost of minimal performance drop. Experimental results on four public datasets show that the student stream performs impressively in recognizing faces at extremely low resolutions. In particular, it uses only MB memory and runs at about faces per second on a single CPU thread or faces per second on GPU.
The main contributions of this paper are summarized as follows. 1) We propose a face model compression method via selective knowledge distillation, which can greatly reduce model size and complexity without remarkable performance drop; 2) We propose graph-based optimization algorithm that can extract the most discriminative facial features from existing face models, which can be used to supervise the training process of low-resolution face models; and 3) We conduct comprehensive experiments to show that the compressed model can achieve an extremely high recognition speed with a comparable accuracy with the state-of-the-art high-resolution face models.
The rest of this paper is organized as follows: Section II reviews related works and Section III presents the selective knowledge distillation approach. Extensive experiments are conducted in Section IV to evaluate the proposed approach, and the paper is concluded in Section V.
Ii Related Works
The approach we proposed in this paper aims to distil knowledge from complex face models for low-resolution face recognition. Therefore, we briefly review related works from three aspects, including the general face recognition models, low-resolution face recognition and knowledge distillation.
Ii-a General Face Recognition Models
Recently, the general face recognition technique has evolved from the classic shallow frameworks [13, 14] to deep ones [1, 15, 3, 4, 16, 17, 18] with impressive performance improvements. For the deep approaches, a key factor to distinguish them is the loss functions they adopted. For example, DeepFace  is an early attempt to ensemble Convolutional Neural Networks (CNNs) by building 3D faces with identification loss. After that, various loss functions have been proposed for training face recognition CNNs, such as triplet loss [3, 4], center loss  and range loss . In , the tasks of identifying faces and their attributes were simultaneously considered to enhance the recognition performance. For the DeepID series, several small CNNs using different facial patches were first separately trained in , and its subsequent works incorporate face verification signals  and change the base networks  to increase accuracy.
Generally speaking, these deep models have achieved impressive performance in recognizing general faces. As shown in Tab. I, however, many of such generic models have a large amount of parameters, high dimensional feature representations and complex classification function for inference. The complexity of these models prevent them from being directly deployed in the wild where the computational resource is limited. Although DeepID series take low-resolution faces as the input, the unique attributes of low-resolution faces are not explored. To further enhance the performance of low-resolution face recognition, it is necessary to explore the missing features during the resolution degradation.
Ii-B Low-Resolution Face Recognition
Typically, there are two ways for low-resolution face recognition. The hallucination category aims to reconstruct high-resolution faces before recognition, while the embedding category proposes extracting features directly from low-resolution faces via the embedding schema. In the hallucination category, Kolouri et al.  constructed a nonlinear Lagrangian model of high-resolution facial appearance and then found the model parameters that best fit the low-resolution faces. Jian et al.  proposed a framework based on singular value decomposition and performed face hallucination and recognition simultaneously. In , a joint face hallucination and recognition framework was proposed based on sparse representation. This framework can synthesize person-specific low-resolution faces for recognition. In , a system was proposed to recognize faces by using sparse representation with the specific dictionary involving many natural and facial images. Moreover, deep models like  and  can generate extremely realistic high-resolution images from low-resolution faces. However, the speed of such hallucination or super-resolution based approaches may be a little slow due to the complex high-resolution face reconstruction process, which hinders their direct deployment in real-world scenarios with limited computational resources.
Instead of reconstructing high-resolution faces, a more direct approach is embedding low-resolution faces into various external contexts to recover the missing features during resolution degradation. Inspired by that, some approaches proposed transforming both high-resolution and low-resolution faces into a unified feature space for matching [25, 26, 27, 28, 29, 30, 31], while in [32, 33] the multi-scale (multi-resolution) faces were simultaneously analyzed to extract better features. In , the multidimensional scaling was adopted to learn a common transformation matrix to simultaneously transform the facial features of low-resolution and high-resolution training images. Shekhar et al.  proposed a joint sparse coding technique for robust recognition at low-resolution, while Wang et al.  attempted to solve very low resolution recognition problem using deep learning methods. In , CNNs were adopted with a manifold-based track comparison strategy for low-resolution face recognition in videos.
From these approaches, we find that the core idea of the embedding-based approaches is transferring (or making use of) the knowledge from high-resolution faces. As a result, the performance in low-resolution face recognition is mainly influenced by two key factors: what knowledge to transfer and how to make use of it. In other words, the desired knowledge should be selectively distilled from high-resolution data (or models) and guide the low-resolution face recognition process in a right way. This is also the core idea of this paper.
Ii-C Knowledge Distillation
Instead of mining the knowledge from high-resolution faces, another way to obtain a low-resolution face model (i.e., the student network) is distilling such knowledge directly from pre-trained complex face models (i.e., the teacher network). With the development of much deeper and wider networks, such distillation technique has been adopted in many works [38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49] to compress a complex model (or an ensemble) into a simpler model that is much easier to deploy. Among these works, Luo et al.  utilised the learned knowledge of a large teacher network or the ensemble of some networks as the supervision to train a compact student network for face recognition. In their approach, the most relevant neurons for face recognition were selected at the higher hidden layers for knowledge transfer. Lopez-Paz et al.  proposed the general distillation framework to combine distillation and learning with privileged information. Su and Maji  proposed cross quality distillation to learn models for recognizing low-resolution images, non-localized objects and line-drawings by using labeled high-resolution images, labeled localized objects and color images, respectively. Radosavovic et al.  proposed data distillation to ensemble predictions from multiple transformations of unlabeled data to automatically generate new training annotations.
To sum up, the core component of knowledge distillation is the trade-off between speed and performance, and such a technique provides an opportunity to convert many complex models into simple models that can be deployed in the wild. Note that in this study we not only try to distil complex face models into simple ones, but also explore the feasibility of using resolution-degraded faces as the input to further speed up the recognition speed while maintaining the recognition accuracy. In this way, the challenges in low-resolution face recognition and knowledge distillation are simultaneously addressed with a single framework.
Iii The Approach
Our two-stream knowledge distillation framework consists of a teacher stream and a student stream (see Fig. 2). The teacher stream can adopt any complex face recognition neural networks that have been previously trained (and the training data may be no longer available). The distillation process aims to learn a simple and compact student stream that imitates the teacher stream for its practical deployment in real-world scenario.
The learning process consists of three stages: 1) the Initialization stage initializes the teacher stream by taking a complex CNN or an ensemble of several CNNs pre-trained on high-resolution face images, and the student stream by classifying low-resolution face images with identity labels; 2) the Selection stage extracts the most informative knowledge from the teacher stream where the “right” knowledge is selected while the “wrong” one is wiped out; and 3) the Fine-tuning stage transfers the selective knowledge from teacher and low-resolution face images to co-supervise the fine-tuning progress of the student stream by jointly performing feature regression and face identity classification. More details of the three stages are described as follows.
For the sake of simplicity, we define the key components of the two-stream CNNs as follows:
Teacher stream. The teacher stream is a complex CNN (or an ensemble of several complex CNNs) with the set of parameters pre-trained for recognizing a high-resolution face . Here we assume that absorbs the rich knowledge encoded in massive high-resolution face images from a teacher face set , in which each face is labeled by an integer from the identity set . Generally, the number of training face images is very large and may be invisible to the student stream (e.g., a CNN model released on the Internet is pre-trained with additional face images from private datasets).
Student stream. The student stream is a much simpler CNN for recognizing a low-resolution face with parameters . It is learned from the student face set , where is the number of high-resolution faces. For each high-resolution face , the student face set also contains its resolution-degraded versions, and the th resolution-degraded face is denoted as . Note that both the high-resolution face and its degraded versions correspond to the same identity label from the identity set . Here we assume that there are totally classes of faces in , and the number of high-resolution faces for the th class is such that .
Iii-B Initialization of the Two-stream CNNs
As shown in Fig. 2, our two-stream CNNs simultaneously conduct high-resolution and low-resolution face recognition with a teacher stream and a student stream , respectively. The parameter set of the teacher stream can be initilized by state-of-the-art face recognition models or their ensemble, such as VGGFace  with VGG16 architecture , FaceNet  with GoogLeNet architecture  and VGGFace2  with ResNet50 architecture . As a representative example, we use the architecture of VGGFace  in the teacher stream and initialize with the author-released model. Note that VGGFace is pre-trained on a massive face image dataset , which we assume is no longer available in the knowledge distillation process.
The student stream aims to recognize a low-resolution face with a compact network trained on . Therefore, we adopt a lightweight network architecture that is similar to . As shown in Fig. 3, the student stream can take a low-resolution (e.g., ) face as the input with its majority using filters and increasing the number of channels after every pooling step. Moreover, the global average pooling is used to make predictions as well as filters to reduce the feature dimension between convolutions. Note that a mimic layer is adopted here to receive the knowledge from the teacher stream in the future, where is the dimension of the learned high-resolution face representation in the teacher stream. Thus, the features from the mimic layer can be used for feature approximation with the teacher network. In addition, since the capacity of the student model is weak, the feature layer that mimics the transferred knowledge should be sufficiently deep, we empirically insert the identity layer between the mimic layer and the softmax layer. The identity layer also plays the role of feature compression. Finally, the architecture has ten convolutional layers, three max pooling layers and three fully-connected layers. As a result, the amount of parameters in reaches only M, which is only % of the teacher parameter set (i.e., M). These parameters are first initialized with xavier and then optimized by minimizing the classification loss over :
Iii-C Selective Knowledge Distillation from the Teacher Stream
After the initialization stage, the student stream usually suffers from low recognition accuracy over low-resolution faces since many identity cues are missing during the resolution degradation. As a result, its parameters need to be fine-tuned again under the supervision of the teacher stream to learn how to extract the most discriminative features even when the face resolution is very low. Let and be the sub-networks composed by the first several layers of the teacher stream and the student stream , respectively. is the feature extraction backend before the softmax layer for extracting the identity features of high-resolution face images, while corresponds to the main feature branch till the mimic layer and is used to extract approximated features to match the teacher. The fine-tuning process of the student stream can be described as mimicking the feature representation with to improve the final recognition ability of . The problem is: how to conduct the inter-network supervision?
Typically, the teacher stream has very powerful ability to recognize high-resolution faces with the identities available in the teacher face set . However, it can not directly recognize the unfamiliar faces from the student face set due to the diverse identities from and . In this case, the knowledge from the teacher stream may contain some errors, which will mislead the fine-tuning process of the student stream. Thus, we selectively distil only the most informative knowledge and reject the wrong one from to improve the feature extraction ability of the sub-network and the face recognition ability of the student stream .
Toward this end, a feasible solution is finding out the most informative faces from by using the features given by the teacher stream, and such informative faces can be defined as the ones with small inter-class similarity and large intra-class similarity. Toward this end, we formulate the selective knowledge distillation process as an inference problem on a graph. The nodes represent faces and the edges represent their correlations. As shown in Fig. 4, a densely-connected graph will contain massive edges between all the nodes from face classes and thus slow down the inference process. In order to conduct the graph-based inference efficiently, we add a centroid node for each face class and then construct a sparse-connected graph . In the graph , the node set contains two types of nodes: face nodes and centroid nodes . The th face node and the th centroid node is represented by -dimensional column feature vectors and , respectively. These two types of feature vectors are extracted from high-resolution face images with the teacher model and can be computed as
where is an indicator function which equals 1 if and 0 otherwise. We can see that each face node is characterized by the appearance of a specific face, and the centroid node is represented by the average appearance.
With the assistance of centroid nodes, we can construct a sparse graph whose edge set consists of densely-connected intra-class edges that link all face nodes within the same class, and sparsely-connected inter-class edges that link face nodes in one class with the rest centroid nodes outside the class. In this way, the sparsely-connected graph contains nodes in total and only edges.
Given the sparse graph, we can select the most informative faces by solving a binary labeling problem:
where is a binary vector with components, and its th component equals 1 if the face is an informative face and 0 otherwise. Note we use the Cosine distance to measure the similarity between two feature vectors. We can see that the first term prefers the selection of less informative face nodes that have low similarity with the “average” faces in other classes. is a negative weight that balances the two terms so that the second term prefers the face nodes that have high similarity with other faces with the same identity label. In particular, with the non-negative distance measure and the negative weight , the first term tends to select less faces and the second term tends to select more. In practice, we can solve the problem (3) by using the graph cut algorithm .
After solving (3), we can select a limited number of informative faces with high intra-class similarity and low inter-class similarity. In this process, the outliers, which are likely to be the errors made by the teacher stream, are discarded from the perspective of feature clustering. In this way, many helpful knowledge can be accurately distilled and the influences of noisy knowledge introduced by teacher network can be greatly alleviated, which well refines the feature supervision for training the student network. The amount of outliers discarded can be controlled by (the influence of will be discussed in experiments).
Iii-D Teacher-supervised Student Stream Fine-tuning
With the selected informative faces and their features extracted by the teacher stream, the fine-tuning of the student stream will jointly address two issues: 1) approximating the features of informative faces given by the teacher stream via feature regression, and 2) recovering the missing facial cues from low-resolution faces. Thus, we can fine-tune the student stream by solving the minimization problem
where the influences of the classification loss and the regression loss are combined together with equal importance to form a multi-task learning problem. The first term is the classification loss of the student stream over all low-resolution faces. Similar to (1), it is defined as
The term in (4) is the feature regression loss of the sub network formed by feature extraction backend of the student stream. It can be defined as
By incorporating these two terms into (4), we can solve the classification and regression tasks via the stochastic gradient descent algorithm with standard back-propagation . In this way, the student stream can be fine-tuned under the supervision of the teacher stream in the form of feature regression, leading to improved low-resolution face recognition ability with a limited computational cost.
In this section, we first introduce the experiment setting and then conduct four experiments to verify the proposed approach. The first experiment is conducted to analyze and discuss the influence of selective knowledge distillation, and the second experiment compares the performance of teacher and student networks in a face verification task. In the third and the fourth experiments, we further compare the student stream with state-of-the-art low-resolution face models in face recognition task and face retrieval task, respectively. Finally, we conduct the efficiency analysis of the learned student models.
Iv-a Experiment Setting
We conduct experiments on four well-known face datasets: UMDFaces , LFW (Labeled Faces in the Wild) , UCCS (UnConstrained College Students)  and SCface (Surveillance Cameras face) , which are used to verify the proposed approach from the perspective of selective knowledge distillation, face verification, face identification and face retrieval, respectively. Details of the three datasets (and experimental settings) are listed as follows. We implement all the models with TensorFlow  on NVIDIA GPU K80 and single core Intel CPU 2.6G.
The UMDFaces dataset  contains images with annotations from subjects, which is obtained by crawling public images on the Internet. In the experiments, we use this dataset to train all the student models and verify the selective distillation operation. For each training face, we first perform face alignment by using the algorithm in  to localize facial landmarks. Faces are then cropped and normalized into high-resolution images, which are used as the input of the teacher network. Similarly, to form the low-resolution faces for training each student network, we perform random perturbation times on the localized facial landmarks, and then crop and normalize the face regions into face images with size where the resolution value .
On the UMDFaces dataset, all the face images are first fed into VGGFace, the selected teacher stream, to extract D feature vectors. By solving the graph optimization problem in Eq. (3), informative features are selected and represented with an indicate vector . After that, with the selected informative features, face identity labels and the student input faces, the student network is trained by using standard BP algorithm. In the training, we set the batch size as . Batch normalization layer is introduced to accelerate the network training and prevent over-fitting.
On the LFW dataset , we evaluate all student models in the task of face verification. In the experiment, pairs of face images, including positive and negative pairs, are adopted in the evaluation. The performance is reported as the Area under ROC curve (AUC). In the experiment, feature vectors in the hidden layers (mimic layer and identity layer) are first extracted and normalized from a pair of face images. The similarity between them is calculated for verification by using simple threshold. Unlike  that trained Joint Bayesian  for face verification, the similarity is used throughout the experiments to directly show the benefit from better supervision utilized to train students.
On the UCCS dataset , we compare the student models with state-of-the-arts in the face recognition task. Faces from labeled identities subjects are adopted, where blurry, occluded and badly illuminated images are generally common. Note that the identities in training and testing are exclusive. This dataset is suitable to benchmark more challenging unconstrained face recognition in surveillance conditions.
On the SCface dataset , we compare the student models with state-of-the-arts in face retrieval task. The dataset contains subjects, each having one high-resolution frontal face image and multiple low-resolution images, captured from three distances (4.2m, 2.6m and 1.0m, respectively) using different quality surveillance cameras. In the experiment, subjects are randomly selected for training and the rest subjects for testing. Among the testing images, for each subject, one high-resolution face image is used for constructing retrieval dataset and low-resolution face images are used for retrieving. As a result, each low-resolution face image from total is matched with high-resolution face images. The rank-1 recognition accuracies on three subsets with different distances and total set are reported, respectively.
Iv-B Selective Knowledge Distillation
To study selective knowledge distillation, first we would like to explore the influence of different settings of parameter on the parse graph optimization algorithm. Therefore, we gradually increase from to with integer power of , and then investigate the decreasing tendency of the number of selected informative faces. In Fig. 5, we show the influence of parameter on the number of selected informative faces.
When the negative constant is very small, the number of the informative faces decreases very slowly. After the increases to around , the number of the informative faces starts to decrease sharply. It continues to decrease and remains after becomes larger than .
We further delve into the process of discarding faces during selective knowledge distillation. Fig. 6 gives an example for showing the process of discarding faces in five identity classes when increases, where we adopt t-Distributed Stochastic Neighbor Embedding (t-SNE)  to visualize the high-dimensional face nodes. Through solving the sparse graph optimization problem in Eq. 4, some noisy face nodes (see the original face nodes in Fig. 6(a) and the noisy nodes in Fig. 6(b) marked in gray) that are usually far away from their own class centroid or closer to other classes will be discarded, leading to a more compact visualization effect (see Fig. 6(c)). In Fig. 6(d), we show the discarded noisy face images, where we can see that the discarded faces are often characterized by side postures, heavy occlusions, inconsistent illuminations or blurry appearances. These challenging images, which may be beyond the recognition capability of the teacher stream, are selected and discarded. This implies that selective knowledge distillation indeed selects the more informative faces while discards the less ones. Ideally, the teacher model should have a powerful ability to handle various face variations and cluster the faces correctly, which means that high intra-class similarity and low inter-class similarity are achieved with the extracted features by the teacher model. However, in some challenging cases such as large pose variations, the teacher model may fail and thus causes low intra-class similarity (as shown in Fig. 6). In this case, the extracted teacher knowledge is considered as “wrong” and thus will be discarded by our method in feature regression task.
Iv-C Low-Resolution Face Verification on LFW
With the selected face images, we train many student models to compress the teacher model with different input resolution and various supervision signals. The supervision signals we study are abbreviated as follows:
: only face class supervision (no distillation).
: selective distillation without face class supervision
: selective distillation with face class supervision
: direct distillation with face class supervision
For the sake of simplicity, a student model is represented as S--, where supervision signal . For example, model S-- means the student model uses a input resolution of and is trained with both selective distillation and class supervision. Note that S-- is the baseline student model that are directly trained with the supervision of face classes. Similarly, the teacher model is represented as T- with input. All the student models are trained with the same architecture as shown in Fig.3.
The performance of various student models is shown in Fig. 7, from which we can see that the recognition accuracy is decreasing along with the lower face resolution. The student model S-- achieves an accuracy of by using both selective knowledge distillation and face class supervision, which is only lower than the teacher model T- without metric learning. Note that the model parameter in S-- is much less than the teacher model VGGFace (M vs. M), and the dimension of face feature vectors has a remarkable compression rate of . From these results, we can safely claim that this performance is very competitive particularly for practical deployment on resource-limited devices.
From Fig. 7, we can also find that, without the supervision from the teacher stream, the baseline student model S-- has a very low accuracy of , implying that the baseline model itself may lack the capability of extracting discriminative features when being directly trained on low-resolution faces. After being trained with joint supervision signals from the teacher stream and face identities, the model S-- achieves a sharp improvement of in terms of recognition accuracy (i.e., from to ). Similar accuracy improvements can be found between S-- and S-- as well as S-- and S--, implying that either selective or direct knowledge distillation can effectively transfer the teacher’s knowledge into the student network so as to improve the recognition performance remarkably.
To further verify the importance of knowledge selection, we compare the performance between S-- and S--. By carefully selecting informative knowledge, S-- achieves an accuracy gain of against S-- which does not discard noisy faces during training. In addition, the face class supervision signal can also improve the performance, so that the model S-- achieves a higher accuracy than S--. In summary, our two-stream structure can accurately distil informative knowledge from the teacher stream and recover missing knowledge from the student stream.
Iv-D Low-Resolution Face Identification on UCCS
Since the performance of low-resolution face verification task is promising, we further study low-resolution face identification task on a challenging benchmark, UCCS , and compare with the state-of-the-art method proposed in , VLRR (very low-resolution recognition). In VLRR, the cropped face regions are normalized into as high-resolution faces, which are then down-sampled by a factor for low-resolution images of . The evaluation is performed on a -subject subset by layer-by-layer greedy unsupervised model training. Their model reported the best error rates of at top-1 and at top-5.
Following the experimental settings of , we choose a -subject subset of original-resolution images by ranking the subjects according to the number of images. The cropped face regions are then normalized to to obtain images. Note that this number is a little smaller than those claimed by VLRR (i.e., 4,500 training images and 935 testing images). After that, to achieve fair comparisons, we randomly separate the images according to a ratio of to training and testing sets. Finally, we have images for training and the rest for testing.
On these data, we first train a student model with the input directly on the training set of UCCS and then test the performance on its testing set. In this case, the model achieves % top-1 error rate and % top-5 error rate, which are worse than VLRR. We suspect that the models, once pre-trained on other datasets, can provide valuable prior knowledge on low-resolution visual recognition problem, as stated in . To verify that, we use the student model S-- pre-trained on UMDFaces to fine-tune a new model for face identification on UCCS. First, we fix the parameters before mimicking layer and modify the last softmax layer to -way. Then, we train the feature reduction sub-network with the images. The fine-tuned model reaches top-1 error rate and top-5 error rate, indicating the correct classification of out of testing samples in top-1 results and in top-5, respectively. This implies that our method can achieve better accuracy than VLRR, which may be caused by the fact that the selective supervision from the teacher stream can help the student network learn the discriminative features even when the face resolution is very low.
Iv-E Low-Resolution Face Retrieval on SCface
We further study low-resolution face retrieval task on SCface , and compare with the baseline (PCA ) and state-of-the-arts, including three embedding-based models (DCA , DAlign  and LRFRW ) and one hallucination-based model (SHSR ). Here, LRFRW employs deep learning to perform cross-domain transfer. The results are shown in Tab. II, where the accuracies on three subsets with different distances and total set are reported respectively. In the experiment, we fine-tune two student models on SCface training set, including S-- with the default D identity features and S--- with D identity features, respectively.
From the results, we can see several important observations. First, all the models give the accuracies of less than and specially an extremely low total accuracy of with the baseline PCA model, showing face retrieval on SCface is a very challenging task. Second, as the resolution increases along with the distance getting closer, the recognition accuracy gradually increases, which is as expected, which implies that the resolution is indeed an important effect on recognition performance. Thus, the hallucination-based SHSR super-resolves low-resolution face images for feeding to a pre-trained face recognizor, which improves the total accuracy to . Third, the embedding models which transfer knowledge between different domains, such as transfer features from high-resolution to low-resolution faces by discriminant correlation analysis in DCA model and by supervised discriminative cross-resolution learning in LRFRW model, transfer knowledge from near-infrared to visible images by dictionary alignment in DAlign model, achieve the improved total accuracies of , and , respectively. This reveals the impact of the transferred knowledge from other domains. Finally, our two models give better total accuracies than other models, e.g., S-64- achieves an improved total accuracy of over DAlign, implying that the selective knowledge from the teacher stream can facilitate the student network.
|Model||Dist 1||Dist 2||Dist 3||Total|
|Model||GPU (Nvidia K80)||CPU (Intel 2.6GHZ)|
|time / #faces per second||time / #faces per second|
Iv-F Efficiency Analysis
Our approach can greatly reduce the amount of model parameters and memory footprint without significant accuracy drop. As shown in Fig. 8, the memory reductions are , , and for the low-resolution student models with , , and , respectively. In particular, for the faces with a very low-resolution of , the inference memory is only MB.
In addition, the teacher model, VGGFace (T-224), contains million parameters, while the student network only has about million parameters, making a great reduction of in model complexity at the cost of a very small drop in recognition accuracy. Due to the extreme reductions on the memory cost and the parameter number, the computation complexity can greatly decrease. As shown in Tab. III, the inference runtime on both high-end GPU and low-end CPU is reduced greatly. With a NVIDIA K80 GPU, the inference time for a face is reduced from ms with T-224 (VGGFace teacher) to ms, ms, ms and ms with S-96-sc, S-64-sc, S-32-sc and S-16-sc, respectively. The inference times are also remarkably reduced even in CPU. Our model takes ms and ms to recognize a face with a very low-resolution of in GPU and CPU respectively, which means faces per second and faces per second.
At present, the problems of large model parameters and high feature dimension widely exist in face recognition models based on deep learning, which hinders their practical deployment on resource-restricted applications (e.g., on embedded or mobile devices). To address this problem, this paper proposes a knowledge distillation method, adopting original large model as the teacher network and letting the teacher selectively supervise the training of student networks via designing the multi-task loss function combining regression and classification items. We have accomplished combination of high-dimensional deep feature regression and low-resolution facial classification, which achieves the uniform compression of deep network and feature dimension with recognition accuracy rate assured. Experimental results show that the proposed approach can transfer the informative knowledge from the teacher network to student models, leading to compact face recognition models with impressive effectiveness and efficiency.
In our future work, we will tentatively explore the usage of recurrent mechanism that aims to handle the failure cases in the teacher stream. Face attributes such as gender, age and makeup will be incorporated into the multi-task framework to further enhance the performance of the compressed model.
Acknowledgement. This work was partially supported by grants from National Key Research and Development Plan (2016YFC0801005), National Natural Science Foundation of China (61772513 & 61672072), Beijing Nova Program (Z181100006218063), and the International Cooperation Project of Institute of Information Engineering at Chinese Academy of Sciences (Y7Z0511101). Shiming Ge is also supported by Youth Innovation Promotion Association, CAS.
-  Y. Taigman, M. Yang, M. Ranzato, and L. Wolf, “Deepface: Closing the gap to human-level performance in face verification,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1701–1708.
-  Y. Sun, X. Wang, and X. Tang, “Deep learning face representation from predicting 10,000 classes,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1891–1898.
-  F. Schroff, D. Kalenichenko, and J. Philbin, “FaceNet: A unified embedding for face recognition and clustering,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 815–823.
-  O. M. Parkhi, A. Vedaldi, and A. Zisserman, “Deep face recognition,” in British Machine Vision Conference (BMVC), vol. 1, no. 3, 2015, p. 6.
-  B. Amos, B. Ludwiczuk, and M. Satyanarayanan, “OpenFace: A general-purpose face recognition library with mobile applications,” CMU School of Computer Science, 2016.
-  A. Pentland and T. Choudhury, “Face recognition for smart environments,” Computer, vol. 33, no. 2, pp. 50–55, 2000.
-  D. Liu, B. Cheng, Z. Wang, H. Zhang, and T. S. Huang, “Enhance visual recognition under adverse conditions via deep networks,” arXiv preprint arXiv:1712.07732, 2017.
-  S. Kolouri and G. K. Rohde, “Transport-based single frame super resolution of very low resolution face images,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4876–4884.
-  M. Jian and K.-M. Lam, “Simultaneous hallucination and recognition of low-resolution faces based on singular value decomposition,” IEEE Transactions on Circuits and Systems for Video Technology (CSVT), vol. 25, no. 11, pp. 1761–1772, 2015.
-  M.-C. Yang, C.-P. Wei, Y.-R. Yeh, and Y.-C. F. Wang, “Recognition at a long distance: Very low resolution face recognition and hallucination,” in International Conference on Biometrics (ICB), 2015, pp. 237–242.
-  S. Biswas, K. W. Bowyer, and P. J. Flynn, “Multidimensional scaling for matching low-resolution face images,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 34, no. 10, pp. 2019–2030, 2012.
-  C.-X. Ren, D.-Q. Dai, and H. Yan, “Coupled kernel embedding for low-resolution face image recognition,” IEEE Transactions on Image Processing (TIP), vol. 21, no. 8, pp. 3770–3783, 2012.
-  T. Ahonen, A. Hadid, and M. Pietikäinen, “Face recognition with local binary patterns,” in European Conference on Computer Vision (ECCV). Springer, 2004, pp. 469–481.
-  S. J. D. Prince and J. H. Elder, “Probabilistic linear discriminant analysis for inferences about identity,” in IEEE International Conference on Computer Vision (ICCV), 2007, pp. 1–8.
-  W. Wu, M. Kan, X. Liu, Y. Yang, S. Shan, and X. Chen, “A discriminative feature learning approach for deep face recognition,” in European Conference on Computer Vision (ECCV), 2016, pp. 499–515.
-  A. T. Tran, T. Hassner, I. Masi, and G. Medioni, “Regressing robust and discriminative 3D morphable models with a very deep neural network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1493–1502.
-  X. Zhang, Z. Fang, Y. Wen, Z. Li, and Y. Qiao, “Range loss for deep face recognition with long-tail,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5409–5418.
-  G. Hu, Y. Hua, Y. Yuan, Z. Zhang, Z. Lu, S. S. Mukherjee, T. M. Hospedales, N. M. Robertson, and Y. Yang, “Attribute-enhanced face recognition with neural tensor fusion networks,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3764–3773.
-  Y. Sun, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” Neural Information Processing Systems (NIPS), vol. 27, pp. 1988–1996, 2014.
-  Y. Sun, D. Liang, X. Wang, and X. Tang, “Deepid3: Face recognition with very deep neural networks,” arXiv, 2015.
-  Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2018.
-  T. Uiboupin, P. Rasti, G. Anbarjafari, and H. Demirel, “Facial image super resolution using sparse representation for improving face recognition in surveillance monitoring,” in IEEE Conference on Signal Processing and Communication Application, 2016, pp. 437–440.
-  C. Dong, C. C. Loy, K. He, and X. Tang, “Learning a deep convolutional network for image super-resolution,” in European Conference on Computer Vision (ECCV). Springer, 2014, pp. 184–199.
-  C. Ledig, L. Theis, F. HuszÃ¡r, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, and Z. Wang, “Photo-realistic single image super-resolution using a generative adversarial network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 105–114.
-  W. W. Zou and P. C. Yuen, “Very low resolution face recognition problem,” IEEE Transactions on Image Processing (TIP), vol. 21, no. 1, pp. 327–340, 2012.
-  P. Zhang, X. Ben, W. Jiang, R. Yan, and Y. Zhang, “Coupled marginal discriminant mappings for low-resolution face recognition,” Optik-International Journal for Light and Electron Optics, vol. 126, no. 23, pp. 4352–4357, 2015.
-  J. Jiang, R. Hu, Z. Wang, and Z. Cai, “CDMMA: Coupled discriminant multi-manifold analysis for matching low-resolution face images,” Signal Processing, vol. 124, pp. 162–172, 2016.
-  X. Wang, H. Hu, and J. Gu, “Pose robust low-resolution face recognition via coupled kernel-based enhanced discriminant analysis,” IEEE/CAA Journal of Automatica Sinica, vol. 3, no. 2, pp. 203–212, 2016.
-  X. Xing and K. Wang, “Couple manifold discriminant analysis with bipartite graph embedding for low-resolution face recognition,” Signal Processing, vol. 125, pp. 329–335, 2016.
-  J. Shi and C. Qi, “From local geometry to global structure: Learning latent subspace for low-resolution face image recognition,” IEEE Signal Processing Letters, vol. 22, no. 5, pp. 554–558, 2015.
-  M. Haghighat and M. Abdel-Mottaleb, “Low resolution face recognition in surveillance systems using discriminant correlation analysis,” in IEEE International Conference on Automatic Face & Gesture Recognition (FG). IEEE, 2017, pp. 912–917.
-  X. Li, W.-S. Zheng, X. Wang, T. Xiang, and S. Gong, “Multi-scale learning for low-resolution person re-identification,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 3765–3773.
-  K.-H. Pong and K.-M. Lam, “Multi-resolution feature fusion for face recognition,” Pattern Recognition, vol. 47, no. 2, pp. 556–567, 2014.
-  S. P. Mudunuri and S. Biswas, “Low resolution face recognition across variations in pose and illumination,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 38, no. 5, pp. 1034–1040, 2016.
-  S. Shekhar, V. M. Patel, and R. Chellappa, “Synthesis-based robust low resolution face recognition,” arXiv preprint arXiv:1707.02733, 2017.
-  Z. Wang, S. Chang, Y. Yang, and et al., “Studying very low resolution recognition using deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4792–4800.
-  C. Herrmann, D. Willersinn, and J. Beyerer, “Low-resolution convolutional neural networks for video face recognition,” in IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS). IEEE, 2016, pp. 221–227.
-  C. Bucilua, R. Caruana, and A. Niculescumizil, “Model compression,” in ACM Conference on Knowledge Discovery and Data Mining (KDD), 2006, pp. 535–541.
-  G. Hinton, J. Dean, and O. Vinyals, “Distilling the knowledge in a neural network,” in Neural Information Processing Systems (NIPS) Workshop, 2014, pp. 1–9.
-  A. Romero, N. Ballas, S. Kahou, and et al., “FitNets: Hints for thin deep nets,” in International Conference on Learning Representations (ICLR), 2015.
-  T. Chen, I. Goodfellow, and J. Shlens, “Net2Net: Accelerating learning via knowledge transfer,” in International Conference on Learning Representations (ICLR), 2016.
-  P. Luo, Z. Zhu, Z. Liu, and et al., “Face model compression by distilling knowledge from neurons,” in The AAAI Conference on Artificial Intelligence (AAAI), 2016, pp. 3560–3566.
-  Z. Li and D. Hoiem, “Learning without forgetting,” in European Conference on Computer Vision (ECCV), 2016, pp. 614–629.
-  Y. Kim and A. M. Rush, “Sequence-level knowledge distillation,” in Conference on Empirical Methods on Natural Language Processing (EMNLP), 2016, pp. 1317–1327.
-  G. Urban, K. J. Geras, S. E. Kahou, and et al., “Do deep convolutional nets really need to be deep and convolutional?” in International Conference on Learning Representations (ICLR), 2017.
-  G. Chen, W. Choi, X. Yu, and et al., “Learning efficient object detection models with knowledge distillation,” in Neural Information Processing Systems (NIPS), 2017, pp. 742–751.
-  Y. Chen, N. Wang, and Z. Zhang, “DarkRank: Accelerating deep metric learning via cross sample similarities transfer,” in The AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  T. Chen, L. Lin, W. Zuo, X. Luo, and L. Zhang, “Learning a wavelet-like auto-encoder to accelerate deep neural networks,” in The AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  G. Zhou, Y. Fan, R. Cui, W. Bian, X. Zhu, and G. Kun, “Rocket launching: A unified and effecient framework for training well-behaved light net,” in The AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  D. Lopezpaz, L. Bottou, B. Scholkopf, and V. Vapnik, “Unifying distillation and privileged information,” in International Conference on Learning Representations (ICLR), 2016.
-  J.-C. Su and S. Maji, “Adapting models to signal degradation using distillation,” in British Machine Vision Conference (BMVC), 2017.
-  I. Radosavovic, P. Dollar, R. Girshick, G. Gkioxari, and K. He, “Data distillation: Towards omni-supervised learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 4119–4128.
-  K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” in International Conference on Learning Representations (ICLR), 2015.
-  C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1–9.
-  K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
-  J. Redmon and A. Farhadi, “Yolo9000: Better, faster, stronger,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7263â–7271.
-  A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classification with deep convolutional neural networks,” in Neural Information Processing Systems (NIPS), 2012, pp. 1097–1105.
-  Y. Boykov and V. Kolmogorov, “An experimental comparison of min-cut/max-flow algorithms for energy minimization in computer vision,” IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), vol. 26, no. 9, pp. 1124–1137, 2004.
-  A. Bansal, A. Nanduri, C. Castillo, and et al., “UMDFaces: An annotated face dataset for training deep networks,” in IEEE International Joint Conference on Biometrics (IJCB), 2017, pp. 464–473.
-  E. Learned-Miller, G. B. Huang, A. RoyChowdhury, H. Li, and G. Hua, “Labeled faces in the wild: A survey,” 2016.
-  A. Sapkota and T. E. Boult, “Large scale unconstrained open set face database,” in IEEE Sixth International Conference on Biometrics: Theory, Applications and Systems, 2014, pp. 1–8.
-  M. Grgic, K. Delac, and S. Grgic, “SCface–surveillance cameras face database,” Multimedia Tools and Applications (MTA), vol. 51, no. 3, pp. 863–879, 2011.
-  M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, and M. Devin, “Tensorflow: A system for large-scale machine learning,” in The USENIX Symposium on Operating Systems Design and Implementation (OSDI), vol. 16, 2016, pp. 265–283.
-  S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment via regressing local binary features,” IEEE Transactions on Image Processing (TIP), vol. 25, no. 3, pp. 1233–1245, 2016.
-  D. Chen, X. Cao, L. Wang, F. Wen, and J. Sun, “Bayesian face revisited: a joint formulation,” in European Conference on Computer Vision (ECCV), 2012, pp. 566–579.
-  L. van der Maaten and G. Hinton., “Visualizing high-dimensional data using t-sne,” Journal of Machine Learning Research (JMLR), vol. 9, no. 11, pp. 2579–2605, 2008.
-  B. Cheng, D. Liu, Z. Wang, H. Zhang, and T. S. Huang, “Visual recognition in very low-quality settings: Delving into the power of pre-training,” in The AAAI Conference on Artificial Intelligence (AAAI), 2018.
-  S. P. Mudunuri, S. Venkataramanan, and S. Biswas, “Dictionary alignment with re-ranking for low resolution nir-vis face recognition,” IEEE Transactions on Information Forensics and Security (TIFS), pp. 1–1, 2018.
-  P. Li, L. Prieto, D. Mery, and P. Flynn, “Low resolution face recognition in the wild,” arXiv preprint arXiv:1805.11529, 2018.
-  M. Singh, S. Nagpal, M. Vatsa, R. Singh, and A. Majumdar, “Identity aware synthesis for cross resolution face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018, pp. 479–488.
Shiming Ge (M’13-SM’15) is currently an Associate Professor at Institute of Information Engineering at Chinese Academy of Sciences. Prior to that, he was a senior researcher in ShanDa Innovations, a researcher in Samsung Electronics and Nokia Research Center. He received the B.S. and Ph.D degrees both in Electronic Engineering from the University of Science and Technology of China (USTC) in 2003 and 2008, respectively. His research mainly focuses on computer vision, deep learning and AI security, especially high-performance deep models towards scalable applications.
Shengwei Zhao received his B.S. degree from the School of Mathematics and Statistics in Wuhan University in 2017. He is now a Master student at the Institute of Information Engineering at Chinese Academy of Sciences and the School of Cyber Security at the University of Chinese Academy of Sciences. His major research interests are deep learning and computer vision.
Chenyu Li is currently a PhD. candidate at the Institute of Information Engineering at Chinese Academy of Sciences and the School of Cyber Security at the University of Chinese Academy of Sciences. She received the B.S. degree from the School of Electronics and Information Engineering at the Tongji University. Her research interests are computer vision and deep learning.
Jia Li (M’12-SM’15) is currently an associate Professor with the School of Computer Science and Engineering, Beihang University, Beijing, China. He received the B.E. degree from Tsinghua University in Jul. 2005 and the Ph.D. degree from the Institute of Computing Technology, Chinese Academy of Sciences, in Jan. 2011. Before he joined Beihang University, he used to serve in Nanyang Technological University, Peking University and Shanda Innovations. His research interests include computer vision and multimedia big data, especially the deep learning-based visual content understanding. He is the author or coauthor of over 50 technical articles in refereed journals and conferences such as TPAMI, TIP, IJCV, ICCV and CVPR. His major research interests are cognitive vision towards evolvable algorithms and models.