A Paired Sparse Representation Model for Robust Face Recognition from a Single Sample

A Paired Sparse Representation Model for Robust Face Recognition from a Single Sample

Fania Mokhayeri fmokhayeri@livia.etsmtl.ca Eric Granger eric.granger@etsmtl.ca Laboratoire d’imagerie, de vision et d’intelligence artificielle (LIVIA)
École de Technologie Supérieure, Université du Québec, Montreal, Canada
Abstract

Sparse representation-based classification (SRC) has been shown to achieve a high level of accuracy in face recognition (FR). However, matching faces captured in unconstrained video against a gallery with a single reference facial still per individual typically yields low accuracy. For improved robustness to intra-class variations, SRC techniques for FR have recently been extended to incorporate variational information from an external generic set into an auxiliary dictionary. Despite their success in handling linear variations, non-linear variations (e.g., pose and expressions) between probe and reference facial images cannot be accurately reconstructed with a linear combination of images in the gallery and auxiliary dictionaries because they do not share the same type of variations. In order to account for non-linear variations due to pose, a paired sparse representation model is introduced allowing for joint use of variational information and synthetic face images. The proposed model, called synthetic plus variational model, reconstructs a probe image by jointly using (1) a variational dictionary and (2) a gallery dictionary augmented with a set of synthetic images generated over a wide diversity of pose angles. The augmented gallery dictionary is then encouraged to pair the same sparsity pattern with the variational dictionary for similar pose angles by solving a newly formulated simultaneous sparsity-based optimization problem. Experimental results obtained on Chokepoint and COX-S2V datasets, using different face representations, indicate that the proposed approach can outperform state-of-the-art SRC-based methods for still-to-video FR with a single sample per person.

keywords:
Face Recognition, Sparse Representation-Based Classification, Face Synthesis, Generic Learning, Simultaneous Sparsity, Video Surveillance

1 Introduction

Video-based face recognition (FR) has attracted a considerable amount of interest from both academia and industry due to the wide range applications as found in surveillance and security. In contrast to FR systems based on still images, an abundance of spatio-temporal information can be extracted from target domain videos to contribute in the design of discriminant still-to-video FR systems.

Sparse Representation-based Classification (SRC) techniques can provide an accurate and cost-effective solution in many video FR applications when there are a sufficient number of reference training images per each person under controlled condition Wright1; xu; xu2017. However, single sample per person (SSPP) problems are common in video-based security and surveillance applications, as found in, e.g., biometric authentication and watch-list screening farshad; Dewan. For example, still-to-video FR systems are typically designed using only one reference still image per individual in the source domain, and then faces captured with video surveillance cameras in target domain are matched against these reference stills S2; S3. Additionally, when faces are captured under challenging uncontrolled conditions, they may vary considerably according to pose, illumination, occlusion, blur, scale, resolution, expression, etc. In such cases, using SRC techniques often associated with limited robustness to intra-class variations, and a lower recognition rate.

State-of-the-art approaches designed to address SSPP problems in SRC-based FR systems can be roughly divided into three categories: (1) image patching methods, where the images are partitioned into several patches Zhu; gaor, (2) face synthesis technique to expand the gallery dictionary mokhayeri; hu, and (3) generic learning methods, where a genetic training set111A generic set is defined as an auxiliary set comprised of many facial video ROIs from unknown individuals captured in the target domain. is used to leverage variational information from an auxiliary generic set of images to represent the differences between probe and gallery images  Wei; deng2018. Indeed, similar intra-class variations may be shared by different individuals in the generic set and reference regions of interest (ROIs) in the gallery. Moreover, a generic set can be easily collected during operations or some camera calibration process, and encode subtle knowledge on faces appearing in the operational environment. One of the pioneering techniques in generic learning is extended SRC (ESRC) Deng, which manually constructs an auxiliary variational dictionary from a generic set to accurately represent a probe face with unknown variations from the target domain. ESRC was subsequently generalized to employ different sparsity for identity and variational parts in sparse coefficients Li, and to learn the variational dictionary that accounts for the relationship between the reference gallery and external generic set Yang.

Although leveraging intra-class variations from a generic set has been shown to improve robustness to some linear facial variations, it cannot accurately address non-linear facial variations (e.g., pose and expression) between reference still ROIs in the source domain and probe videos ROIs captured in real-world capture conditions in the target domain. Indeed, non-linear variations are not additive nor sharable. For instance, a probe video ROI with various lighting can be recovered with a linear combination of an image with a natural lighting and its corresponding illumination component. However, a probe ROI with a profile view cannot be accurately reconstructed with a linear combination of frontal view ROIs in gallery dictionary and profile view ROIs in the auxiliary dictionary because they do not share the same type of variations. Non-linear facial variations between still and video ROIs make it difficult to represent a probe image using a linear combination of reference and generic set images. Another concern with ESRC is the large manually designed auxiliary dictionary (obtained via random selection in the generic set) which is computationally expensive. To address these concerns, we focus on two issues: (1) how to represent a probe image under non-linear variations with a linear combination of reference set and generic set, (2) how to design a discriminative dictionary, and (3) how to yield a robust representation with a minimum number of images.

In this paper, a paired sparse representation framework referred as the synthetic plus variational model (S+V) is proposed to address the problem of non-linear pose variations by increasing the range of pose variations in the gallery dictionary. Since collecting a large database with a wide variety of views is extremely expensive and time-consuming, a set of synthetic face images under representative pose are generated. As illustrated in Fig. 1, a probe video ROI is reconstructed using an auxiliary dictionary as well as a gallery dictionary augmented with a set of synthetic face images generated under a representative diversity of azimuth angles. The proposed sparse model not only allows probe image to be represented by the atoms of both augmented and auxiliary dictionaries, but also restricts the selected atoms to be combined with the same viewpoint, thus providing an improved representation.

Figure 1: Overall architecture of the proposed approach. The gallery dictionary is augmented with a diverse set of synthetic images and the auxiliary variational dictionary co-jointly encode non-linear variations in appearance. Sparse coefficients within each dictionary share the same sparsity pattern in terms of pose angle.

Under this model, facial ROIs from trajectories in the generic set are clustered in the captured condition space (defined by pose angle) by applying row sparsity Elhamifar2. The auxiliary variational dictionary with block structure is designed using intra-class variations as subsets the pose clusters. Following this, the gallery dictionary is augmented with the synthetic face images generated from the original reference image in the source domain, where the rendering parameters are estimated based on the center of each cluster in the target domain. By introducing a joint sparsity structure, the pose-guided augmented gallery dictionary is encouraged to share the same sparsity pattern with the auxiliary dictionary for the same pose angles. Each synthetic facial ROI in the augmented gallery dictionary is thereby combined with approximately the same facial viewpoint in the variational dictionary in a joint manner rakotomamonjy. During the operation, each input probe face captured in videos is represented by a linear combination of ROIs from a same person and same pose in the augmented gallery dictionary as well as the intra-class variations from a same pose in the auxiliary variational dictionary. In this framework, the auxiliary dictionary models the linear variations (such as illumination changes, different occlusion levels) and non-linear pose variation are modeled by augmented gallery dictionary. Note that the S+V model is paired across different domains in the enrollment stage. The main contributions of this paper are:

  • A generalized sparse representation model for still-to-video FR, using generic learning and data augmentation to represent both linear and non-linear variations based on only one reference still ROI;

  • A simultaneous optimization technique to encourage pairing between each synthetic profile image in the augmented gallery dictionary and a similar view in the auxiliary dictionary;

  • An efficient SRC method to design a compact augmented dictionary using row sparsity.

This paper extends our preliminary investigation of synthetic plus variational models Fania_FG2019 in several ways, in particular with: (1) a comprehensive analysis of dictionary design and of selection of representative face exemplars; (2) a detailed description of the proposed joint sparsity structure; and (3) more experimental results and interpretations, including results with deep facial representations, an ablation study and complexity analysis.

For proof-of-concept validation, a particular implementation of the proposed SRC technique for still-to-video FR is considered where representative pose angles are selected by applying clustering on the generic set. The original and synthetic ROIs rendered under these pose angles are employed to design an augmented gallery dictionary, while the pose clusters of video ROIs are exploited to design an auxiliary variational dictionary with block structure. The simultaneous sparsity constraint is then applied to both dictionaries to improve the discrimination power of the dictionaries. Moreover, since most state-of-the-art FR methods rely on Convolution Neural Network (CNN) architectures such as ResNet He and VGGNet simonyan, the model is fed with CNN features extracted from the atoms of dictionaries Gao; Cai, in order to further improve still-to-video FR accuracy. Performance of the SRC implementation is evaluated on two public video FR databases – Chokepoint Wong and COX-S2V Huang3.

The rest of the paper is organized as follows. Section 2 provides a brief review for SRC methods that employ generic learning to address SSPP problems. Section 3 describes the proposed S+V model. Section 4 presents a particular implementation of the S+V model for still-to-video FR system. Finally, Sections 5 and 6 describe the methodology and experimental results, respectively.

2 Background on Sparse Modelling for Still-to-Video FR

In the following, the set composed of reference still ROI belonging to one of different classes, is the number of pixels or features representing a ROI and is the total number of reference still ROIs. The set denotes the auxiliary generic set composed of external generic images of unknown persons captured in the target domain. The set denotes the auxiliary variational dictionary composed of intra-class variations extracted from .

2.1 Sparse Representation-based Classification (SRC):

Given a probe image , SRC represents as a sparse linear combination of a reference set . SRC uses the -minimization to regularize the representation coefficients. More precisely, SRC derives the sparse coefficient of by solving the following -minimization problem:

(1)

where is a regularization parameter, and . After the sparse vector of coefficients is obtained, the probe image is recognized as belonging to class if it satisfies:

(2)

where is a vector whose only nonzero entries are the entries in that are associated with class . SRC is based on the idea that a probe image can be best linearly reconstructed by the columns of if it belongs to class . As a result, most non-zero elements of will be associated with class , and yields the minimum reconstruction error. An important assumption of SRC is that it requires a large amount of reference training images to form an over-complete dictionary. However, in many practical applications, the number of labeled reference images are limited, and SRC accuracy declines in such cases Wright1.

2.2 SRC through Generic Learning:

Since the facial variations share much similarity across different individuals, an external generic set with multiple images of unknown persons as they appear in the target domain can provide discriminant information on intra-class variations. These additional variations can enrich the gallery diversity, especially in SSPP scenarios. The general model solves the following minimization problem:

(3)

where is a sparse vector that selects a limited number of variant bases from the gallery dictionary , and is another sparse vector that selects a variant bases from the auxiliary variational dictionary , , and . The variant bases can be estimated by subtracting the natural (original) image of a class from other images of the same class, the difference from the class centroid, and pairwise difference. The probe image is recognized as belonging to class if it satisfies:

(4)

where is reused as a matrix operator.

Deng et alDeng introduced extended SRC (ESRC), which manually designs an auxiliary dictionary (through random selection from a generic set) to accurately represent a probe face with unknown variations from the target domain. The model of Eq. 4 degenerates to the ESRC model when and . Motivated by ESRC, Yang et alYang proposed the sparse variation dictionary learning (SVDL) model to learn the variational dictionary by accounting for the relationship between the reference gallery and external generic set. A robust auxiliary dictionary learning (RADL) technique was proposed in Wei that extracts representative information from external data via dictionary learning without assuming the prior knowledge of occlusion in probe images. In farshad, variational information from the target domain was integrated with the reference gallery set through domain adaptation to enhance the facial models for still-to-video FR. A new approach is proposed to learn a kernel SRC model based on a virtual dictionary and the original training set fan2018. Authors in deng2018 developed a superposed linear representation classifier to cast the recognition problem by representing the test image in term of a superposition of the class centroids and the shared intra-class differences. A local generic representation-based (LGR) framework for FR with SSPP was proposed in Zhu. It builds a gallery dictionary by extracting the patches from the gallery database, while an intra-class variation dictionary is formed by using an external generic set to predict the possible facial variations (e.g., illuminations, pose, and expressions). In order to address non-linearity, authors in fan used a nonlinear mapping to transform the original reference data into a high dimensional feature space, which is achieved using a kernel-based method. A customized SRC (CSR) had been proposed to leverage the different sparsity of identity and variational parts in sparse coefficients, and to assign different parameters to their regularization terms Li. In yang2017, a joint and collaborative sparse representation framework was presented that exploits the distinctiveness and commonality of different local regions. A novel discriminative approach is proposed in lin2018, in which a robust dictionary is learned from diversities in training samples, generated by extracting and generating facial variations. In xie2019 feature sparseness-based regularization is proposed to learns deep features with better generalization capabilities. In this paper, the regularization is integrated into the original loss function, and optimized with a deep metric learning framework. Authors in luo2019 propose a novel multi-resolution dictionary learning method for FR that provides multiple dictionaries – each one associated with a resolution – while encoding the similarity of representations obtained using different dictionaries in the training phase. 3D Morphable Model (3DMM), proposed by Blanz and Vetter blanz1, has been widely used to synthesize new face images from a single 2D face image. The 3DMM is expanded by adopting a shared covariance structure to mitigate small sample estimation problems associated with data in high dimensional spaces koppen2018. It models the global population as a mixture of Gaussian sub-populations, each with its own mean value. Finally, an efficient deep learning model for face synthesis is proposed in jiao2018 which is does no rely on complex optimization.

The aforementioned techniques work well in video-based FR. However, they neglect the impact of non-linear variations between probe images and facial images in the gallery and auxiliary dictionaries. To account for the non-linearities, particularly pose variations, the range of viewpoints represented in the gallery dictionary should be increased to represent the probe image with the same view gallery and variations, and thereby compensate the non-linear pose variations. Additionally, the sparsity pattern should ensure the correlation between the gallery and variational dictionaries in terms of pose angles.

3 The Proposed Approach - A Synthetic plus Variational Model

In this section, a new sparse representation model – called the Synthetic plus Variational (S+V) model – is proposed to overcome issues related to the non-linear pose variations with conventional and ESRC model. SRC techniques commonly assumed that frontal and profile views share the same type of variations. To address this limitation, we increase the range of pose variations of gallery dictionary to represent the probe with the same view gallery and variations, and accordingly compensate the non-linear pose variations.

The proposed S+V model exploits two dictionaries including (1) an augmented gallery dictionary containing the original reference still ROI of each individual as well as their synthetic profile ROIs (with diverse poses) enrolled to the still-to-video FR system, and (2) an auxiliary variational dictionary which contains variations from the target domain that can be shared by different persons. Two dictionaries are correlated by imposing the simultaneous sparsity prior that force the augmented gallery dictionary to pair the same sparsity pattern with the auxiliary dictionary for the same pose angles. In this manner, each synthetic profile image in the augmented gallery dictionary is combined with approximately the similar view in the auxiliary dictionary. Fig. 2 gives an illustrative example that compares the sparsity structure of SRC, ESRC and S+V model. The rest of this section presents more details on the dictionary design and encoding process with the S+V model.

Figure 2: A comparison of the coefficient matrices for three sparsity models: (a) Independent sparsity (SRC) with a single dictionary, (b) Extended sparsity (ESRC) with two dictionaries, and (c) Paired extended sparsity (S+V model) with pair-wise correlation between two dictionaries where the sparse coefficients of same poses share the same sparsity pattern. Each column represents a sparse coefficient vector and each square block denotes a coefficient value.

3.1 Dictionary Design:

In order to design the gallery and auxiliary dictionaries, the representative pose angles are determined by characterizing the capture conditions from a large generic set of video ROIs in the pose space (estimations of pitch, roll, and yaw). Prior to operation, e.g., during a camera calibration process, facial ROIs are isolated in facial trajectories from the videos of unknown persons captured in the target domain. A representative set of video ROIs are selected by applying row sparsity regularized optimization program on facial trajectories in the captured condition space defined by pose angles. Next, the variational information of the generic set with multi-samples per person are extracted to form an auxiliary dictionary based on the subsets of the pose clusters. A compact set of synthetic images is then generated from the reference set in the source domain based on the information obtained from the center of each cluster in the target domain, called pose representatives, and integrated into the gallery dictionary to enrich the diversity of the gallery set. Two dictionaries are correlated by imposing the simultaneous sparsity prior that force the same sparsity patterns among the multiple sparse representation vectors in the augmented and auxiliary dictionaries in terms of pose angles. Finding representative poses not only are employed to make a pair-wise correlation between the dictionaries but also can save time and memory and improve the recognition performance due to preventing over-fitting. Inspired by Elhamifar2; Elhamifar3, the representative selection problem is formulated as a row sparsity regularized trace minimization problem where the objective is to find a few representatives (exemplars) that efficiently represent the collection of data points according to their dissimilarities.

The proposed model allows to select pose representatives from a collection of pose samples. The pose angles are estimated using the discriminative response map fitting method zhx which is a state-of-the-art method for accurate fitting, suitable for handling occlusions and changing illumination conditions. The estimated head pose for the video ROI () in the generic set is defined as . Euler angles , , and are used to represent roll, yaw and pitch rotation around axis, axis, and axis of the global coordinate system, respectively. The set of dissimilarities between every pair of pose data points are then calculated by using the Euclidean distance, which indicates how well the data point is suited to be an exemplar of data point . The dissimilarities are arranged into matrix:

(5)

where denotes the row of . Variables are associated with dissimilarities , and organized into matrix of the same size as:

(6)

where denotes the row of . is the probability that data point is representative for data point , and . The row sparsity regularized trace minimization algorithm is applied on matrix to select some representative exemplars that can suitably encode pose data according to dissimilarities as follows:

(7)

subject to:

where the parameter sets the trade-off between these two terms.

Once this optimization problem (Eq. 7) has been solved, one can find the representative indices from the nonzero rows of . The clustering of data points into clusters, associated with representatives, is obtained by assigning each data point to its closest representative. In particular, if { } denote the indices of the representatives, data point is assigned to the pose representative such that .

The auxiliary dictionary is designed based on these pose clusters, where each cluster forms a block in the dictionary. The pose angle of representative video ROI of each pose cluster, referred as pose exemplar, is used as rendering parameter to generate synthetic face images with varying poses using off-the-shelf 3D face models blanz1; Tran1; Tran2. In this way, synthetic profile faces, , are generated under the representative pose angles from a given single still face image where .

The augmented gallery dictionary , is formed by merging each still ROI of reference set with synthetic images rendered w.r.t. representative pose exemplars, where here .

3.2 Synthetic Plus Variational Encoding:

With the S+V model (see Fig. 3), each probe video ROI is seen as a combination of two different sub-signals in the augmented gallery dictionary and auxiliary variation dictionary in the linear additive model:

(8)

where denote the augmented gallery dictionary, denote the variational dictionary, and is a noise term. This model searches for the sparsest representation of the probe sample in both and dictionaries. We first extend the original ESRC to the following robust formulation (Eq. 9).

(9)

where corresponds with combination of Gaussian and Laplacian priors, defined as Eq. 10. This model assigns different regularization parameters to the and coefficients to guaranty the robustness of the variational information from generic set Li.

(10)

The simultaneous sparsity constraint is then imposed to fully benefit from the variational information as well as synthetic still ROIs. Each generic set cluster found during the representative selection forms a block in the auxiliary dictionary, and exemplar of each cluster is considered as rendering parameter in face synthesizing for augmenting the gallery dictionary. The same sparsity pattern constraint in terms of the pose angle is imposed on the dictionaries which encourages similar pose angles to select the same set of atoms for representing each view. In this way, the coefficient vectors for the still ROIs in the augmented gallery dictionary are forced to share the same sparsity pattern with non-zero coefficients associated with the video ROI belonging to the corresponding block (cluster) of the same view in the auxiliary dictionary. This improves the discrimination power of the dictionaries accordingly.

Figure 3: An illustration of sparsity pattern with the S+V model based on clustering results in the pose space. Each column represents a sparse representation vector, each square denotes a coefficient and each matrix is a dictionary.

The new sparse coefficients can be obtained by solving the following optimization problem:

(11)

where denotes the Frobenius norm, and are coefficients matrix consists of blocks which is number of clusters/representatives.

(12)

subject to:

where is the sparsity level and is the mixed norm defined as the sum of of all rows of matrix and and then applying on the obtained vector. Note that each view in formulation of Eq. 12 shares the same sparsity pattern at class-level, but not necessarily at atom-level in real world scenarios. This problem, called joint dynamic sparse representation, can be solved by applying across the of the dynamic active sets Nasrabadi as follows:

(13)

subject to:

where is defined as follows:

(14)

where is a set coefficients associated with the active set

(15)

where for is dynamic active set refers to the indices of a set of coefficients belonging to the same class in the coefficient matrix. In order to solve this optimization problem, the classical alternating direction method of multipliers is considered tropp2006. The use of joint dynamic sparsity regularization term allows combining the cues from all the views during joint sparse representation. Moreover, it provides a better representation of the multiple view images, which represent different measurements of the same individual from different viewpoints. Finally, the residuals for each class are calculated for the final classification as follows:

(16)

where is a vector whose nonzero entries are the entries in that are associated with class . Then the class with the minimum reconstruction error is regarded as the label for the probe subject . Algorithm 1 summarizes the S+V model for still-to-video FR from a SSPP.

Input: Reference still ROIs , Generic set , probe sample , and parameters , , and .
1 Estimate pose angles of .
2 Apply row sparsity clustering in the pose space of , and produce clusters (representative exemplars).
3 Find center of each cluster as representative pose angles.
4 Construct the variation dictionary, , with blocks.
5 for each  do
6        Generate synthetic images per each individual based on representative pose angle.
7        Merge with to form .
8       
9 end for
10Solve the sparse representation problem to estimate coefficient matrix, and , for by Eq. 13.
11 Compute the residual, by Eq. 16.
Output: .
Algorithm 1 Synthetic Plus Variational Model.

4 Still-to-Video Face Recognition with the S+V Model

In this section, a particular implementation is considered (see Fig. 4) to assess the impact of using the S+V model for still-to-video FR. The augmented and auxiliary dictionaries are constructed by employing the representative synthetic ROIs and generic variations, respectively, and classification is performed by SRC while the generic set in the auxiliary dictionary is forced to combine with approximately the same facial viewpoint in the augmented gallery dictionary. The main steps of the proposed domain-invariant FR with the S+V model are summarized as follows.

Figure 4: Block diagram of the proposed still-to-video FR system with the S+V modeling.
  • Step 1. Select Representatives: The generic set in the target domain is clustered based on their pose angles based on row sparsity.

  • Step 2. Design an Augmented Gallery Dictionary: The synthetic ROIs are generated for each of the reference gallery set in the source domain to form an augmented gallery dictionary , where is the number of clusters/representatives.

  • Step 3. Form an Auxiliary Dictionary: The variations of the natural albedo of the generic set in the target domain are extracted by subtracting the natural image from other images of the same class to form a generic auxiliary dictionary with block structure.

  • Step 4. Extract Features: The deep CNN features of and are extracted.

  • Step 5. Apply Simultaneous Sparsity: The augmented gallery dictionary is encouraged to pair the sparsity pattern with the auxiliary dictionary for the same pose angles by applying the simultaneous sparsity.

  • Step 6. Validation: The proposed system assess if given probe ROIs belong to one of the enrolled persons and rejects invalid probe ROIs using sparsity concentration index (SCI) criteria defined in Wright1:

    (17)

    A probe ROI is accepted as valid if SCI and otherwise rejected as invalid, where is an outlier rejection threshold.

5 Experimental Methodology

5.1 Datasets:

In order to evaluate the performance of the proposed S+V model for still-to-video FR, an extensive series of experiments are conducted on Chokepoint222http://arma.sourceforge.net/chokepoint. Wong and COX-S2V333http://vipl.ict.ac.cn. Huang3 datasets. Chokepoint Wong and COX-S2V Huang3 datasets are suitable for experiments in still-to-video FR in video surveillance because they are composed of a high-quality still image and lower-resolution video sequences, with variations of illumination conditions, pose, expression, blur and scale.

Chokepoint Wong (see Fig. 5) consists of subjects walking through portal (P1) and subjects in portal (P2). Videos are recorded over sessions (S1,S2,S3,S4) one month apart. An array of cameras (Cam1,Cam2,Cam3) are mounted above P1 and P2 that capture the subjects during sessions while they are either entering (E) or leaving (L) the portals in a natural manner. In total, data subsets are available (P1E, P1L, P2E, and P2L), and the dataset consists of video sequences.

COX-S2V dataset Huang3 (see Fig. 6) contains individuals, with high-quality still image and low-resolution video sequences per each individual simulating video surveillance scenario. The video frames are captured by cameras (Cam1, Cam2, Cam3, Cam4) mounted at fixed locations of about meters high. In each video, an individual walk through an S-shape route with changes in pose, illumination, and scale.

((a)) Still Reference ROIs
((b)) Examples of Video ROIs
Figure 5: Examples of still images and video frames from portals and cameras of Chokepoint dataset.
((a)) Still Reference ROIs
((b)) Examples of Video ROIs
Figure 6: Examples of still images and video frames from 3 cameras of COX-S2V dataset.

5.2 Protocol and Performance Measures:

A particular implementation of the S+V model for still-to-video FR has been considered to validate the proposed approach. We hypothesize that accuracy can be improved by adding synthetic reference faces to the gallery dictionary and encouraging the dictionaries to share the same sparsity pattern for the same pose angles can address non-linear pose variations.

First, it is assumed that during the calibration process, representative pose angles are selected based on the pose clusters obtained from facial ROI trajectories of unknown persons captured in the target domain using the row sparsity clustering. During the enrollment of an individual to the system, synthetic ROIs for each reference still ROI are generated under typical pose variations from different camera viewpoints. For face synthesis, we employ the conventional 3D Morphable Model (3DMM) blanz1 and the CNN-regressed 3DMM Tran1, that relies on a CNN for regressing 3DMM parameters. The gallery dictionary is constructed using the reference still ROIs of the individuals along with their synthetic ROIs. Next, the auxiliary variational dictionary is designed using the intra-class variations of the generic set with block structure ( blocks). Additionally, we consider extracting deep features using CNN models to further improve the FR recognition rate. The networks are pre-trained using the VGGFace2 dataset with AlexNet krizhevsky, ResNet He and VGGNet simonyan architectures using Triplet Loss facenet. The extracted features are concatenated as a row feature vector of this dictionary. The sparse model is fed with the extracted features. In all experiments with Chokepoint dataset, target individuals are selected randomly to design a watch-list that includes a high-quality frontal captured images, and for the experiment with COX-S2V, individuals are randomly selected to build a watch-list from high-quality faces. Videos of individuals that are assumed to come from non-target persons are used as generic set. The rest of the videos including other non-target individuals and videos of individuals who are already enrolled in the watch-list are used for testing. In order to obtain representative results, this process is repeated times with a different random selection of watch-lists and the average performance is reported with standard deviation over all the runs.

During the operational phase, FR is performed by sparse coding the features of probe ROI over the features of augmented and auxiliary (variational) dictionaries ROIs. The sparsity parameter is fixed to during the experiments. We also compared the S+V method to several baseline state-of-the-art methods: ESRC Deng, SVDL Yang, RADL Wei, LGR Zhu, CSR Li, face frontalization Hassner, and recognition via generation masi2.

The average performance of the proposed and baseline FR systems is measured in terms of accuracy and complexity. For accuracy, we measure the partial area under ROC curve pAUC(20%) (using the AUC at %) and area under precision-recall space (AUPR). An estimation of time complexity is provided analytically based on the worst-case number of operations performed per iteration. Then, the average running time of our algorithm is measured with a randomly selected probe ROIs using a PC workstation with an Intel Core i7 CPU (3.41GHz) processor and 16GB RAM.

6 Results and Discussion

This section first shows some examples of synthetic face images produced under representative pose variations, and then presents still-to-video FR performance achieved when augmenting SRC dictionaries with such images to address non-linear variations caused by pose changes. In order to investigate the impact of the proposed S+V model on performance, we considered the still-to-video FR system described in Section 4 with a growing number of synthetic faces, along with a generic training set. Finally, this section presents an ablation study (showing the effect of each module on the performance) and a complexity analysis for our proposed approach.

6.1 Synthetic Face Generation:

Fig. 7 shows an example of the clustering (based on row sparsity) obtained with facial ROIs of trajectories extracted from Chokepoint videos of individuals and trajectories extracted from COX-S2V videos of individuals in the 3-dimensional pose (roll-pitch-yaw) space. In this experiment, and representative pose condition clusters are typically determined using row sparsity with Chokepoint and COX-S2V data, respectively. The exemplars selected from these clusters (black circles) are used to define representative pose angles for synthetic face generation with 3DMM and 3DMM-CNN techniques. For instance, the representative pose angles with the Chokepoint database, are listed as follows: (pitch, yaw, roll)= (15.65, 14.77, -0.62), (12.44, 2.76, 3.64), (9.06, -5.46, 4.73), (1.98, 6.09, 2.79), (13.21, 15.32, 6.14), (0.64, -18.93, 0.86), (5.23, 2.92, 2.03) degrees.

Figs. 8 and 9 show the synthetic face images generated based on 3DMM and 3DMM-CNN under representative exemplars using reference still ROIs of the Chokepoint and COX-S2V datasets, respectively.

((a)) Chokepoint
((b)) COX-S2V
Figure 7: Example of clusters obtained with and facial trajectories represented in the pose space with Chokepoint (ID#1, #5, #6, #7, #9) and COX-S2V (ID#1, #9, #24, #33, #36, #38, #44, #56, #78, #80) datasets, respectively. Clusters are shown with different colors, and their representative pose exemplars are indicated with a black circle.
Figure 8: Examples of synthetic face images generated from the reference still ROI of individuals ID#25 and ID#26 (a) of Chokepoint dataset. They are produced based on representative exemplars (poses) and using 3DMM (b) and 3DMM-CNN (c).
Figure 9: Examples of synthetic face images generated from the reference still ROI of individuals ID#21 and ID#151 (a) of COX-S2V dataset. They are produced based on representative exemplars (poses) and using 3DMM (b) and 3DMM-CNN (c).

6.2 Impact of Number of Synthetic Images:

In this subsection, the proposed S+V model is evaluated for a growing set of synthetic facial images in the augmented gallery dictionary. Fig. 10 shows the average pAUC(20%) and AUPR accuracy obtained for the implementation in Section 4 when increasing the number of synthetic ROIs per each individual. These ROIs were sampled from the representative pose exemplars from the Chokepoint and COX-S2V datasets. Results indicate that adding representative synthetic ROIs to the gallery dictionary allows to outperform the baseline system designed with an original reference still ROI alone. AUC and AUPR accuracy increase considerably by about with only and synthetic pose ROIs (1 sample per pose cluster) for Chokepoint and COX-S2V datasets, respectively.

((a)) Chokepoint
((b)) Chokepoint
((c)) COX-S2V
((d)) COX-S2V
Figure 10: Average pAUC(20%) and AUPR accuracy of S+V model versus the size of the synthetic set generated using 3DMM and 3DMM+CNN on Chokepoint (a,b) and COX-S2V (c,d) databases. Error bars are standard deviation.

To further assess the benefits, Fig. 11 compares the performance of the proposed S+V method (adds synthetic samples) with the original SRC (without an auxiliary dictionary), and to ESRC (with manually designed auxiliary dictionary). Results in this figure show that the proposed method outperforms the others, and that FR performance is higher when the dictionary is designed using the representative views than based on the manually designed dictionary. The proposed method can therefore adequately generate representative facial ROIs for the gallery, and then match it with the corresponding variations in the auxiliary dictionary. Encouraging pair-wise relationships between the variational and augmented gallery dictionaries has a positive impact on the performance of still-to-video FR system based on SRC.

((a)) Chokepoint
((b)) Chokepoint
((c)) COX-S2V
((d)) COX-S2V
Figure 11: Average pAUC(20%) and AUPR accuracy for SRC, ESRC and S+V model on Chokepoint (a,b) and COX-S2V (c,d) databases. Error bars are standard deviation.

6.3 Impact of Camera Viewpoint:

To evaluate the robustness of the proposed S+V model to pose variations, accuracy is measured for different portals and video cameras, as well as for a fusion of cameras. Tables 1 and 2 summarize the average accuracy on Chokepoint and COX-S2V datasets, respectively. For the Chokepoint dataset, videos are captured over 4 sessions for cameras (Camera1, Camera2, Camera3) over portals 1 (P1E, P1L) and portal 2 (P2E, P2L), while for the COX-S2V dataset, videos are captured over cameras (Camera1, Camera2 and Camera3). The performance of the S+V model is compared with that of SRC and ESRC using the same configurations. Results show that the S+V model outperforms other techniques across different pose variations. Using synthetic profile views can improve the robustness of FR systems to pose variations. As expected, designing a system that combines faces from all the cameras (and portals) always provides a higher level of accuracy.

{adjustbox}

width=1 Portal Viewpoint Accuracy SRC ESRC S+V Model pAUC(20%) AUPR pAUC(20%) AUPR pAUC(20%) AUPR P1 Camera1 0.4820.023 0.3610.021 0.6910.020 0.5340.023 0.7120.024 0.6070.021 Camera2 0.4950.021 0.3890.022 0.7030.022 0.5530.020 0.7190.022 0.6150.022 Camera3 0.4120.025 0.3770.023 0.5320.023 0.5120.022 0.6720.026 0.5720.023 All Cameras 0.5130.022 0.4380.024 0.7180.019 0.5790.018 0.7310.021 0.7060.022 P2 Camera1 0.4220.023 0.3870.020 0.6040.024 0.5260.021 0.6220.022 0.5180.020 Camera2 0.4520.022 0.4160.023 0.6310.025 0.5480.020 0.6520.021 0.5460.021 Camera3 0.3780.021 0.3510.022 0.5170.022 0.4350.023 0.5380.025 0.4410.022 All Cameras 0.4710.020 0.4230.021 0.6510.020 0.5470.019 0.6720.018 0.5730.023 P1&P2 All Cameras 0.5240.032 0.4750.031 0.8020.028 0.6510.025 0.8920.019 0.7510.020

Table 1: Average accuracy of FR systems based on the proposed S+V model, SRC, and ESRC over different sessions, portals and cameras of the Chokepoint dataset. Feature representations are raw pixels, the 3DMM method is used for face synthesis.
{adjustbox}

width=1 Viewpoint Accuracy SRC ESRC S+V Model pAUC(20%) AUPR pAUC(20%) AUPR pAUC(20%) AUPR Camera1 0.4810.020 0.4320.021 0.7650.019 0.6450.022 0.7800.020 0.6570.021 Camera2 0.4750.023 0.4190.022 0.7160.020 0.6020.020 0.7470.023 0.6290.022 Camera3 0.5070.021 0.4410.019 0.8020.021 0.6710.021 0.8240.021 0.7150.019 All 3 Cameras 0.5660.030 0.4800.027 0.8350.027 0.6950.026 0.9050.020 0.7760.017

Table 2: Average accuracy of FR systems using the proposed S+V model, SRC, and ESRC over different sessions and portals of the COX-S2V dataset. Feature representations are raw pixels, the 3DMM method is used for face synthesis.

6.4 Impact of Feature Representations:

Table 3 shows the effect on FR performance of using different feature representations (including raw pixels, AlexNet krizhevsky, ResNet He and VGGNet simonyan) and face synthesis methods (3DMM and 3DMM-CNN) for videos from all cameras of the Chokepoint and COX-S2V datasets.

Technique Face Synthesis Features Accuracy
Chokepoint database COX-S2V database
pAUC(20%) AUPR pAUC(20%) AUPR
TM N/A Raw pixels 0.5510.027 0.5030.028 0.5740.031 0.5120.029
AlexNet 0.5630.026 0.5130.029 0.5860.030 0.5190.027
VGGNet-16 0.5700.028 0.5240.026 0.5970.027 0.5280.030
VGGNet-19 0.5780.025 0.5310.027 0.6050.029 0.5330.028
ResNet-50 0.5950.027 0.5500.026 0.6280.024 0.5510.025
SRC N/A Raw pixels 0.5250.030 0.4750.029 0.5680.031 0.4810.030
AlexNet 0.5370.025 0.4870.028 0.5810.027 0.4940.026
VGGNet-16 0.5520.026 0.4910.027 0.5900.025 0.5050.027
VGGNet-19 0.5670.027 0.5120.024 0.6020.023 0.5110.028
ResNet-50 0.5810.026 0.5330.025 0.6230.022 0.5230.024
3DMM Raw pixels 0.8920.018 0.7510.019 0.9030.020 0.7750.016
S+V Model AlexNet 0.9050.019 0.7710.020 0.9130.016 0.7830.015
VGGNet-16 0.9080.016 0.7730.017 0.9160.018 0.7860.016
VGGNet-19 0.9120.017 0.7790.018 0.9210.016 0.7910.017
ResNet-50 0.9170.015 0.7830.016 0.9250.015 0.7980.014
3DMM-CNN Raw pixels 0.8550.019 0.7370.018 0.8710.019 0.7410.018
AlexNet 0.8730.020 0.7520.020 0.8840.018 0.7530.019
VGGNet-16 0.8800.017 0.7590.017 0.8910.017 0.7610.016
VGGNet-19 0.8840.018 0.7630.020 0.9020.016 0.7650.017
ResNet-50 0.8910.016 0.7690.014 0.9070.017 0.7710.015
Table 3: Average accuracy of FR systems using the proposed S+V model and template matching using different feature representation on Chokepoint and COX-S2V databases.

We further evaluate the impact on the performance of different CNN feature extractors and loss functions for FR with the S+V model. Table 4 shows the average AUC and AUPR accuracy of FR systems using the proposed S+V model with different pre-trained CNNs for feature representation and loss functions (triplet loss facenet, cosine loss a and angular softmax b) on the Chokepoint and COX-S2V databases. Results indicate that coupling the S+V model with deep CNN features can further improve FR accuracy over using raw pixels, and that using ResNet-50 outperforms there other CNN architectures. Additionally, SphereFace training method yields the higher accuracy. By using CNN features along with 3DMM or 3DMM-CNN, a still-to-video FR system with the S+V model outperforms the baseline template matcher (TM) and SRC.

Results show that coupling the S+V model with deep CNN features can further improve the FR accuracy over using raw pixels, and that using ResNet-50 outperforms all other deep architectures. The results also indicate that SphereFace training method yields higher accuracy. Using CNN features and 3DMM or 3DMM-CNN, a FR system with the S+V model outperform the baseline template matcher (TM) and SRC.

{adjustbox}

width=1.1 Technique Deep Architecture Training Accuracy Chokepoint database COX-S2V database pAUC(20%) AUPR pAUC(20%) AUPR AlexNet FaceNet facenet 0.9050.019 0.7710.020 0.9130.016 0.7830.015 S+V Model CosFace b 0.9080.021 0.7740.022 0.9150.017 0.7870.016 SphereFace a 0.9120.020 0.7800.018 0.9180.015 0.7920.014 VGGNet-19 FaceNet facenet 0.8840.021 0.7630.020 0.9020.019 0.7650.018 CosFace b 0.8890.019 0.7680.022 0.9070.017 0.7720.016 SphereFace a 0.9060.018 0.7710.017 0.9130.015 0.7780.017 ResNet-50 FaceNet facenet 0.9170.015 0.7830.016 0.9240.015 0.7980.014 CosFace b 0.9200.018 0.7860.019 0.9270.018 0.8020.020 SphereFace a 0.9220.015 0.7910.014 0.9280.017 0.8050.015

Table 4: Average accuracy of FR systems using the proposed S+V model (3DMM face synthesis) with different deep feature representations on Chokepoint and COX-S2V databases.

Tables 5 shows the average accuracy of FR for the augmented and auxiliary dictionaries with the videos from all cameras of the Chokepoint and COX-S2V datasets, respectively.

Technique Accuracy
Chokepoint database COX-S2V database
pAUC(20%) AUPR pAUC(20%) AUPR
S+V Model Augmented Dictionary 0.8290.28 0.7050.27 0.8470.26 0.7180.254
Auxiliary Dictionary 0.8360.23 0.7140.25 0.8620.22 0.7310.021
Table 5: Average accuracy of FR systems using the augmented dictionary (3DMM face synthesis) and auxiliary dictionaries on Chokepoint and COX-S2V databases.

6.5 Comparison with State-of-the-Art Methods:

Table 6 presents the FR accuracy obtained with the proposed S+V model compared with the state-of-the-art SRC techniques based on generic learning – ESRC Deng, SVDL Yang, LGR Zhu, RADL Wei, CSR Li. Each one uses the same number of samples, raw pixel-based features, and a regularization parameter set to . Accuracy of the S+V model is also compared with that of the Flow-Based Face Frontalization Hassner and Recognition via Generation masi2 techniques. The baseline system is a SRC model designed with the original reference still ROI of each enrolled person, and raw pixel-based features. The table shows that the S+V model, using a joint generic learning and face synthesis, achieves the higher level of accuracy than other methods under the same configuration, has potential in surveillance FR.

Techniques Accuracy
Chokepoint database COX-S2V database
pAUC(20%) AUPR pAUC(20%) AUPR
SRC (Baseline) Wright1 0.5240.032 0.4750.031 0.5680.030 0.4800.027
ESRC Deng 0.8020.028 0.6510.025 0.8350.027 0.6950.026
ESRC-KSVD 0.8110.023 0.6590.022 0.8400.023 0.7120.021
SVDL Yang 0.8250.023 0.7030.025 0.8430.025 0.7240.023
RADL Wei 0.8320.019 0.7110.020 0.8490.022 0.7300.021
LGR Zhu 0.8490.022 0.7170.024 0.8780.023 0.7440.025
CSR Li 0.8520.025 0.7220.020 0.8800.021 0.7530.020
Face Frontalization Hassner 0.8220.021 0.7110.023 0.8430.022 0.7190.023
Recognition via Generation masi2 0.8150.023 0.7030.025 0.8380.024 0.7050.026
S+V Model (Ours) 0.8920.019 0.7510.020 0.9050.018 0.7760.017
Table 6: Average accuracy of FR systems based on the proposed S+V model and related state-of-the art SRC methods for videos from all 3 cameras of the Chokepoint and COX-S2V databases. Feature representations are raw pixels, the 3DMM method is used for face synthesis.
Figure 12: Illustration of procedure for the selection of the largest pose variations.
((a)) Chokepoint
((b)) Chokepoint
((c)) COX-S2V
((d)) COX-S2V
Figure 13: Average pAUC(20%) and AUPR accuracy of S+V model and related state-of-the-art techniques versus the different pose variations on Chokepoint (a,b) and COX-S2V (c,d) databases. Error bars are standard deviation.

In order to assess still-to-video FR accuracy under the worst-case pose variations between the probe video ROIs and augmented gallery dictionary ROIs, we compute the minimum distance between the pose angle of each probe video ROI ( trajectories in cameras), , and pose angles of both reference still and synthetic ROIs in the augmented gallery dictionary, :

(18)

where corresponds to the probe video ROI, for . Next, 5 video ROIs that have the largest distance, , are chosen as the faces with the largest pose differences (see Fig. 12). Fig. 13 shows the accuracy obtained with the SRC, ESRC, RADL, LGR and S+V models when these ROIs are classified as probe ROIs.

As the pose differences increase, FR accuracy decreases. The FR system using the S+V model reaches the highest accuracy due to the added robustness to pose variations. Then, LGR outperforms SRC, ESRC and RADL across all pose variations. Accuracy of the SRC is much lower than the others because, with only one frontal reference gallery ROI per person, the probe ROIs are not well represented.

Fig. 14 shows the impact of the size of generic set in the auxiliary variational dictionary on FR accuracy. The results of SRC, ESRC, RADL and LGR are also shown for the same configurations for comparison. Accuracy of the S+V model increases significantly with respect to other state-of-the-art methods as the number of generic ROIs grows. The results support the conclusion that by augmenting the gallery dictionary, allows the S+V model to increasingly benefit from the variational information of the generic set.

((a)) Chokepoint
((b)) Chokepoint
((c)) COX-S2V
((d)) COX-S2V
Figure 14: Average pAUC(20%) and AUPR accuracy versus the size of the generic set on Chokepoint (a,b) and COX-S2V (c,d) databases. Error bars are standard deviation.

6.6 Ablation Study:

Designing S+V model for still-to-video FR consists of three main steps: () face synthesis, () adding intra-class variations, and () pairing the dictionaries. In this subsection, an ablation study is presented to show the impact of each module on the FR performance. We assume that all FR systems use a pixel-based feature representation, 3DMM face synthesis, and synthetic images in the augmented dictionary.

Tables 7 and 8 shows the average accuracy of the ablation study with videos from all cameras of the Chokepoint and COX-S2V datasets, respectively. Firstly, we disabled the face synthesis module, , and performed experiments to show the impact of augmenting the reference gallery with synthetic faces on FR accuracy. Next, we removed the auxiliary dictionary to evaluate the impact of considering generic set variations with the S+V model. By removing both and modules from the S+V model, accuracy declines significantly by about . The results suggest that the addition of synthetic and generic set faces is an effective strategy to cope with facial variations. Another important component of the S+V model is the selection of representative ROIs and pairing the dictionaries. By removing the row sparsity and joint sparsity in the S+V model, , and by adding randomly selected synthetic ROIs, accuracy decreases by about .

Accuracy Removed Module
baseline (none)
pAUC(20%) 0.8920.019 0.8390.21 0.8270.27 0.8830.25
AUPR 0.7510.020 0.7090.23 0.7020.25 0.7210.22
Table 7: The results of ablation study with Chokepoint database.
Accuracy Removed Module
baseline (none)
pAUC(20%) 0.9050.018 0.8570.22 0.8350.24 0.8870.20
AUPR 0.7760.017 0.7210.20 0.7120.21 0.7690.21
Table 8: The results of ablation study with COX-S2V database.

6.7 Complexity Analysis:

Time complexity is an important consideration in many real-time FR applications in video surveillance. The time required by the S+V model to classify a probe ROI is where is the dimension of the face descriptors, is the number of ROIs per class in the augmented gallery dictionary, is the total number of classes (enrolled individuals), is the total number of reference still images, is the total size of the external generic set, is the number of views, and is number active sets (at each iteration, we need to select most representative dynamic active sets from coefficient matrix.) In video FR applications, may be larger, therefore the computational burden of handling larger dictionaries may represent bottleneck of the proposed method. The complexity of SRC and ESRC are , , respectively. The complexity of LGR is where is the number of patches, is the total number of patches, is the feature dimension of patches. Although the proposed S+V model outperforms SRC and ESRC, it requires more computations, mostly because of the pairing of the dictionaries.

Table 9 reports the average test time required by the proposed and baseline techniques to classify a probe ROI from Chokepoint and COX-S2V videos. The LGR and RADL are more computationally intensive than the S+V model. Finally, Table 10 reports the average time for the main steps of the proposed framework: face synthesis (), intra-class variation extraction (), and pairing the dictionaries () on videos of all 3 cameras in the Chokepoint and COX-S2V datasets. The time complexity of is the highest, followed by with complexity , where and are, respectively, the number of rows and columns of the dissimilarity matrix.

Technique Classification Time (sec)
Chokepoint database COX-S2V database
SRC Wright1 1.03 2.56
ESRC Deng 1.72 3.42
RADL Wei 4.62 8.15
LGR Zhu 7.13 12.37
S+V Model 2.81 4.83
Table 9: Average time required by techniques to classify a probe videos ROI with the Chokepoint and COX-S2V datasets.
Module Processing Time (Sec)
Chokepoint database COX-S2V database
(3DMM) 120 120
(3DMM-CNN) 1.3 1.3
0.53 0.53
2.47 4.41
Table 10: Average computational time of different step in the S+V model with the Chokepoint and COX-S2V datasets.

7 Conclusion

In this paper, a paired sparse reconstruction model is proposed to account for linear and non-linear variations in the context of still-to-video FR. The proposed S+V model leverages both face synthesis and generic learning to effectively represent probe ROIs from a single reference still. This approach manages the non-linear variations by enriching the gallery dictionary with a representative set of synthetic profile faces, where synthetic (still) faces are paired with generic set (video) face in the auxiliary variational dictionary. In this way, the augmented gallery dictionary is encouraged to share the same sparsity pattern with the auxiliary dictionary for the same pose angles. Experimental results obtained using the Chokepoint and COX-S2V datasets suggest that the proposed S+V model allows us to efficiently represent linear and non-linear variations in facial pose with no need to collect a large amount of training data, and with only a moderate increase in time complexity. Results indicated that generic learning alone cannot effectively resolve the challenges of the SSPP and visual domain shift problems. With S+V model, generic learning and face synthesis are complementary. The results also reveal that the performance of FR systems based on the S+V model can further improve with CNN features. Future research includes investigating the geometrical structure of the data space in the dictionaries and the corresponding coefficients to improve the discrimination. To reduce reconstruction time, we plan to extend the current S+V model, allowing it to represent larger sparse codes.

Reference

References

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
393337
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description