1 Introduction

Because of high dimensionality, correlation among covariates, and noise contained in data, dimension reduction (DR) techniques are often employed to the application of machine learning algorithms. Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and their kernel variants (KPCA, KLDA) are among the most popular DR methods. Recently, Supervised Kernel Principal Component Analysis (SKPCA) has been shown as another successful alternative. In this paper, brief reviews of these popular techniques are presented first. We then conduct a comparative performance study based on three simulated datasets, after which the performance of the techniques are evaluated through application to a pattern recognition problem in face image analysis. The gender classification problem is considered on MORPH-II and FG-NET, two popular longitudinal face aging databases. Several feature extraction methods are used, including biologically-inspired features (BIF), local binary patterns (LBP), histogram of oriented gradients (HOG), and the Active Appearance Model (AAM). After applications of DR methods, a linear support vector machine (SVM) is deployed with gender classification accuracy rates exceeding 95% on MORPH-II, competitive with benchmark results. A parallel computational approach is also proposed, attaining faster processing speeds and similar recognition rates on MORPH-II. Our computational approach can be applied to practical gender classification systems and generalized to other face analysis tasks, such as race classification and age prediction.

Dimension Reduction; PCA; LDA; FDA; KPCA; KFDA; SKPCA; SVM; Parameter Optimization; Gender Classification; MORPH-II.

A Comparison Study on Nonlinear Dimension Reduction Methods with Kernel Variations: Visualization, Optimization and Classification \AuthorKatherine C. Kempfert, Yishi Wang *, Cuixian Chen , and Samuel W.K. Wong \AuthorNamesKatherine C. Kempfert, Yishi Wang, Cuixian Chen and Samuel W.K. Wong \corresCorrespondence: wangy@uncw.edu; Tel.: +1-910-962-3292 \firstnoteCurrent address: The University of Florida. Gainesville, FL 32611

1 Introduction

Due to advances in data collection and storage capabilities, the demand has been growing substantially for gaining insights into high-dimensional, complex-structured, and noisy data. Researchers from diverse areas have applied DR techniques to visualize and analyze such data fodor2002survey; sorzano2014survey. DR techniques are also helpful to address the issues of collinearity and (i.e., number of features exceeding the sample size in a dataset), by projecting the data into a lower dimensional space with less correlation, so that classical statistical methods can be applied izenman2008modern. Principal Component Analysis (PCA) pearson1901liii; hotelling1933analysis is a well-studied algorithm with the goal of projecting input features onto a lower dimensional subspace while preserving the largest variance possible; lower dimensionality permits easier visualization, for example via heat maps. While PCA is a fully automatic algorithm, DR techniques that account for domain expertise via user input have also been more recently studied yang2003interactive; johansson2009interactive. For classification problems, in which the label information as the response variable is available, Linear Discriminant Analysis (LDA) (sometimes referred to as Fisher’s Discriminant Analysis (FDA)) can be used for DR by minimizing intra-class variation and maximizing inter-class variation fisher1936use; rao1948utilization. Since PCA only utilizes the correlation or covariance matrices, it is considered an unsupervised approach, whereas LDA is considered a supervised approach with labeling information built into its objective function. Despite the dissimilarities, both PCA and LDA search for linear combinations of the features and, therefore, can be applied in linearly separable types of problems lee2007nonlinear. The main challenge is that many problems in practical applications of machine learning are nonlinear nhan2015beyond; yin2012kernel. For nonlinear DR, kernel methods are popular choices because of their flexibility shawe2004kernel; motai2015kernel; xie2012robust, e.g, Kernel PCA scholkopf1997kernel, Kernel LDA for two classes mika1999fisher, and more generalized Kernel LDA for multiple classes baudat2000generalized. For kernel methods, it is also possible to design specialized kernels based on domain knowledge of a problem barzilay1999domain; scholkopf1998prior.

Given the problems in image analysis of high dimensionality and complex correlation structures, DR techniques are often a necessary step hinton2006reducing. Thus, variants of PCA, LDA, and their kernel extensions have been popular in computer vision with applications of image classification and discrimination zhao1998discriminant; martinez2001pca; yang2003can. Studies include Eigenfaces turk1991eigenfaces, Fisherfaces belhumeur1997eigenfaces, face recognition with KPCA kim2002face, face recognition with Kernel Direct LDA lu2003face, 2D-PCA yang2004two, 2D-LDA li20052d, among many others. When there are sufficient labeled face images, LDA is experimentally reported to outperform PCA for face recognition belhumeur1997eigenfaces. In the case of a small training set, the conclusion could be reversed martinez2001pca. Studies comparing classification performance of PCA, LDA, and their kernel variations include karg2009comparison; ye2009comparative. The connections among KLDA, KPCA, and LDA are further discussed in yang2004essence. By incorporating labeling information into the construction of the objective function, Supervised Kernel PCA (SKPCA) barshan2011supervised has been proposed for visualization, regression, and classification. A modified version of SKPCA for classification problems can be found in wang2015modified. These studies suggest that SKPCA works well in practice among different DR algorithms fewzee2012dimensionality; samadani2013discriminative; wu2013biomedical. Moreover, it has been found in ashtiani2015dimension that with bounded kernels, projections from SKPCA are uniformly converging, regardless of the input features’ dimension.

2 Associated Work

In recent years, facial demographic analysis has become popular in computer vision, because of its broad applications in human-computer interaction (HCI), security, surveillance, and marketing, which can benefit from the automatic estimation of characteristics like age, gender, and race. Recent surveys on demographic estimation from biometrics are presented in fu2010age; sun2018demographic. Specifically, a major task is gender classification, aiming to automatically determine if a person is male or female. Beyond computer vision, the topic has been studied extensively by anthropologists, sociologists, and psychologists. Gender can easily be identified by humans, achieving 96% accuracy in an experiment classifying photographs of adult faces burton1993s. Automating gender classification has been a priority in real-world applications. A number of biometrics have been used to identify gender, including face, voice, gait, handwriting, and even the iris sun2018demographic. However, gender classification from faces is the most common, probably because photography of faces is non-intrusive and ubiquitous. Ng et al. provide a survey of gender classification via face and gait ng2012vision.

Gender classification with faces launched in 1990, when neural networks were applied directly to pixels from face photographs golomb1990sexnet; cottrell1991empath. Many other early studies utilized the geometric-based approach to represent human faces, relying on measurements of facial landmarks poggio1992hyberbf; wiskott1995face. Though intuitive, such approaches are sensitive to the placement of landmarks, which can only accommodate frontal representations of the face, and may omit some important information from the face (such as texture of the skin). In recent years, the appearance-based methods have been more commonly adopted, which rely on a transformation of an image’s pixels guo2010human; guo2011simultaneous; shan2012learning. Such methods capture both the geometric relationships of the face and texture information. However, a drawback is their sensitivity to illumination and viewpoint variations. Other issues are associated with the high dimensionality of the transformed pixels, which will be discussed further in the next paragraph. Some most recent gender classification studies involve convolutional neural networks (CNN) yang2013automatic; yi2014age; antipov2016minimalistic; antipov2017effective. Though CNNs have reached state-of-the-art accuracy rates, they are known to be less interpretable than some other methods.

Pixels often contain high redundancy and noise, which cannot be removed completely by pre-processing steps. Hence, the vectors resulting from appearance-based feature extraction methods genetically inherit redundancy and noise. Popular image feature extraction methods include local texture techniques such as local binary patterns (LBP) yang2007demographic; lian2006multi; makinen2008experimental; alexandre2010gender, Gabor filters xia2008multi, biologically-inspired features (BIF) guo2009gender; han2015demographic, and histogram of oriented gradients (HOG) guo2009gender. Such methods could lead to a high dimension of extracted features, thwarting practical applications by increasing runtime and memory consumption. When "", for which the dimension of the feature space exceeds the sample size of the dataset, a fundamental assumption of many standard statistical procedures is violated. Additionally, collinearity of features can cause numerical problems, while noisy features can obscure true relationships with the response variable and hinder predictive performance. These significant issues motivate the use of DR techniques. The fundamental goal of DR is to extract and retain information in a lower dimensional space. Many of these methods fall under manifold learning, identifying a low-dimensional manifold embedded in a high-dimensional ambient space ma2011manifold.

Even though PCA and LDA have been widely considered as popular and effective approaches for DR in machine learning, their kernel versions are much less investigated. To our best knowledge, KPCA, KLDA, and SKPCA have never before been directly compared on visualization and classification performance through simulations and practical applications to face image analysis problems.

Our main contributions in this study can be summarized as follows. (1) The nonlinear manifold learning projections for KPCA, KLDA, and SKPCA are directly compared with visualization through simulated datasets. (2) Motivated by the nonlinear nature of soft-biometric analysis problems, we utilize KPCA, KLDA, and SKPCA for dimension reduction on four types of appearance-based extracted features (BIF, HOG, LBP, and AAM) for the gender classification task. Moreover, the classification performance is compared systematically on parameter optimization. (3) For applications to practical large-scale systems, we propose an additional parallel computational framework that can decrease runtime while maintaining similar classification rates.

The remainder of the paper is structured as follows. In Section 3, we review the theory of KPCA, SKPCA, and KLDA. In Section 4, we conduct simulation studies to visualize projections. We propose our machine learning methods for gender classification on Morph-II in Section 5. The comparative performance of KPCA, SKPCA, and KLDA on Morph-II is presented and discussed in Section 6. The performance of these DR methods is further compared in Section 7 through application to the FG-NET dataset. The computational framework for large-scale practical systems is proposed in Section 8 and investigated on Morph-II. Finally, we conclude and offer future directions of research in Section 9.

3 Kernel-Based Dimension Reduction Methods

The nonlinearity in a classification problem can often be addressed by kernel-based DR methods, with the appropriate choice of kernels. The driving reasons are the nonlinearity of chosen kernels, flexibility of tuning parameter selection, and most importantly, the kernel tricks. Mercer’s theorem guarantees that a symmetric positive-definite function can be written as the sum of a convergent sequence of product functions, which potentially project the data into infinite-dimensional space scholkopf2001generalized. Thus, it is feasible to separate the data in the new space. On the other hand, Representer Theorem shows that the solution for certain kernel methods lies in the finite-dimensional span of the training data wahba1990spline; scholkopf2001generalized. This is very helpful, since we do not need to compute the coordinates of the projected data in the infinite-dimensional space, but only the inner products between all pairs of data in the feature space.

3.1 Notations

With the goal of emphasizing the connections between KPCA, SKPCA, and KLDA, we define the following notations for classification problems.

Let be the feature space, a non-empty subset in with as the number of covariates for each subject. Let be the space for the response variable, a subset in . Let be a series of independent observations following a joint probability measure . Let denote the outcomes of the response variable. Let be an feature matrix, with as the -th row for , and for as its -th column. Thus, the matrix can be written as:

Without loss of generality, we may assume that each column of the matrix is normalized, such that the mean of is 0 and standard deviation is 1, for .

Let be the sample covariance matrix of . We then have


Let be a reproducing kernel Hilbert space on from a kernel function , which is a Mercer kernel (symmetric and positive-definite), and be a reproducing kernel Hilbert space on from a kernel function .

For the kernel , its associated space may be infinite-dimensional, but with some additional conditions, the minimizer of a regularized risk function lies in the finite span of the training observations scholkopf2001generalized. Additionally, it has been shown scholkopf2001generalized that there exists a function


such that for all ,


where is the dot product. Let be a matrix such that its -th element is . We then have


where . Here, the kernel matrix is the Gram matrix of the ’s.

3.2 Principal Component Analysis and Kernel Principal Component Analysis

In standard PCA, we seek an orthogonal transformation matrix satisfying


where for some , such that each column vector successively inherits maximal proportion of variance from the column vectors ’s, while ensuring the projection directions are perpendicular. The solutions can be expressed as the eigenvalue problem


where is the -th column of , for .

Following the work of scholkopf1998nonlinear, PCA can be extended to KPCA by first choosing a Mercer kernel , with which is transformed to . This maps the features in to . Assume that is a vector with 0 in each entry. With the Gram matrix as defined in (4) and through the kernel trick from (3), we have the eigenvalue problem


where is the desired dimension and are the eigenvectors of , with associated eigenvalues . Hence, the advantage of the kernel-based approach is to calculate the Gram matrix without an explicit expression for . Without the centralization assumption on , the matrix in (7) can be replaced by


where is the desired dimension, , is an identity matrix with dimension , and is a matrix of 1’s with dimension .

We note that is idempotent, since it is a square matrix satisfying . For any square matrix with dimension , the average of each column of the matrix is 0, as is the average of each row of the matrix . Thus, the matrix is the "centralized" version of the original matrix.

3.3 Supervised Kernel Principal Component Analysis

PCA and KPCA are unsupervised methods, since they do not consider the response variable, only considering directions of maximum variability in the covariates. If the goal is classification, this may not be ideal, since the principal components may be unrelated to the class difference. SKPCA is a supervised generalization of KPCA, which aims to find the principal components with maximal dependence on the response variable. Drawing from barshan2011supervised and wang2015modified, we formulate SKPCA as follows.

In SKPCA, class information is incorporated by maximizing the Hilbert Schmidt independence criterion (HSIC) gretton2005measuring. With the aforementioned reproducing kernel Hilbert spaces on and on and related kernel functions and respectively, the HSIC can be expressed as


where represents the expectation on independent pairs of and (with respect to ) and and alike are the expectations based on various marginal distributions from .

With the results from gretton2005measuring, an empirical estimator of (9) is


where and are defined as before for KPCA and is a link matrix with dimension , where is an indicator function with value 1 if the event is true and 0 otherwise.

Similarly as for KPCA, and can be adjusted to satisfy the centralization assumption. As discussed previously, is an idempotent matrix. Therefore, following from (10),


where and are the "centralized" versions of the and matrices respectively.

On another note, in the binary gender classification problem, and wang2015modified. Therefore, we modify the link matrix according to wang2015modified by


It can be shown that maximization of (10) is equivalent to solving the generalized eigenvalue problem


where and each is an eigenvector with related eigenvalue for , where is the desired dimension wang2015modified. Therefore, the main advantage of the link matrix in (12) becomes apparent: the rank of may increase, permitting more eigenvalues to be computed.

3.4 Linear Discriminant Analysis and Kernel Linear Discriminant Analysis

Given a dataset with finite classes, LDA aims to find the best set of features to discriminate among the classes. We first review standard LDA, then generalize to KLDA. We note that sometimes parametric assumptions for LDA are made, such as that observations from each class are normally distributed with common covariance. Here, we make no such assumptions. Suppose that each observation for belongs to exactly one of classes. Define the following feature vectors: as the overall mean and as the mean of the -th class with the size of the -th class in the sample, for .

In standard LDA, we seek to maximize the objective function


where is a x vector, is the between-class scatter matrix, and is the within-class scatter matrix defined by


Hence, maximizing involves finding some rotation of the scatter matrices such that the "distance" between the groups is maximized relative to the variations within each group.

Maximization of in (14) is equivalent to solving the generalized eigenvalue problem


where each is an eigenvector with corresponding eigenvalue , for , where is the desired dimension.

LDA is generalized to KLDA using the kernel representation from (3). Analogously to LDA above, we seek a solution that will result in the maximization of the objective function


where now and and are the between-class and within-class scatter matrices in defined by


The above expressions involve knowledge of , which is often not available. It can be shown that equation (17) is equivalent to




is a matrix of dimension with rows being features from the -th class, and is an matrix with its -th element as . A full discussion of KLDA can be found in mika1999fisher.

Maximization of in equation (19) is equivalent to solving the generalized eigenvalue problem


where each is an eigenvector with associated eigenvalue , for with as the desired dimension.

Comparing the generalized eigenvalue problems in (16) and (21), the structures of matrices and are similar, since both "measure" the variation between different classes.

Let , a matrix of dimension . Due to the centralization function of , has row-sum equal to zero for every row. Besides, . For the matrix , due to the idempotent property of ,


Thus, the matrix has an identical structure to the and matrices, which "measure" the overall variation within groups.

4 Visualization on Simulation Studies

To visualize and improve understanding of the manifold learning methods KPCA, SKPCA, and KLDA, we apply them in three simulation studies. For comparison, the linear techniques PCA and LDA are also considered. Each dataset contains nonlinear patterns, and the goal is to transform the data to be linearly separable. For this reason, the radial basis function (RBF)


is chosen as a kernel for each pair of observed vectors . For each DR method, a range of values for the tuning parameter are tested and selected to visually separate the classes. A full discussion of the RBF kernel, among others, can be found in steinwart2008support. Figures 1, 2, and 3 compare the original data to 2-dimensional projections from each DR method. In each plot, color corresponds to the true class to which an observation belongs.

(a) Original Data (b) KLDA Projections in 2D (c) KPCA Projections in 2D
(d) SKPCA Projections in 2D (e) PCA Projections in 2D (f) LDA Projections in 2D
Figure 1: Wine Chocolate Simulation Study.

In the first simulation study, the original data are plotted in 3D in Figure 1(a); the green sphere is embedded within the magenta group, necessitating nonlinear manifold learning. The KLDA projections in (b) are linearly separable with very good variation between the classes and a fair amount of variation within the classes. KPCA and SKPCA projections in (c) and (d) are at least approximately linearly separable, as it is not clear whether there is a linear boundary that perfectly separates the two classes. In (e), PCA fails to linearly separate the groups, rotating the wine chocolate in 2D. The maximum dimension LDA can retain is ; with 2 classes, the projections must be plotted on a 1D number line, given in (f). Points from the two classes overlap considerably in plots (e) and (f).

(a) Original Data (b) KLDA Projections in 2D (c) KPCA Projections in 2D
(d) SKPCA Projections in 2D (e) PCA Projections in 2D (f) LDA Projections in 2D
Figure 2: Apple Tart Simulation Study.

In the second simulation study, the original data in Figure 2(a) follow a nonlinear pattern. In (b), KLDA produces groups which are linearly separable. The KPCA projections are approximately linearly separable in (c); however, there is some overlap between groups, especially the green and pink groups in the third quadrant. In (d), SKPCA produces almost linearly separable groups. In plots (e) and (f), PCA and LDA simply rotate the original data in 2D space, as expected.

(a) Original Data (b) KLDA Projections in 2D (c) KPCA Projections in 2D
(d) SKPCA Projections in 2D (e) PCA Projections in 2D (f) LDA Projections in 2D
Figure 3: Swiss Roll Simulation Study.

For the third simulation study, the original data in Figure 3(a) are in 3D and follow a swirling, nonlinear pattern. In (b), KLDA yields favorable results; the groups are well-separated linearly. KPCA and SKPCA in (c) and (d) also produce good results, although in (c) more separation between the purple and bright green groups would be ideal. In (e) and (f), respectively, PCA and LDA merely rotate the original data projected in 2D space; there is no linear separation between the magenta and purple groups, nor between the two green groups.

For all three simulation studies, KLDA, KPCA, and SKPCA are effective to transform the data into linearly separable groups. In all cases, the projected data become approximately linearly separable after applying KLDA, KPCA, or SKPCA. In general, KLDA and SKPCA perform the best here. Their success over KPCA is expected, since KLDA and SKPCA are supervised techniques. On the other hand, results indicate that KPCA and SKPCA are more sensitive than KLDA to different choices of tuning parameters. Hence, SKPCA and KPCA may perform better for alternative choices of parameters. In all our studies, the nonlinear techniques outperform linear PCA and LDA. These preliminary studies suggest the radial kernel is appropriate for our face analysis experiments.

5 Kernel-based Dimension Reduction Optimization and Classification on Morph-II

Motivated by the nonlinear nature of facial demographic analysis, we propose and implement a novel machine learning process for the Morph-II dataset. We consider the kernel-based DR methods KPCA, SKPCA, and KLDA on three types of appearance-based extracted features (LBP, BIF, and HOG) for the gender classification task. We illustrate parameter optimization and compare the performance of these methods on Morph-II.

5.1 Longitudinal Face Database

MORPH ricanek2006morph is one of the most popular face databases available to the public, especially for age estimation, race classification, and gender classification. Multiple versions of MORPH have been released, and the version adopted in this work is the 2008 MORPH-II non-commercial release (referred to as Morph-II in this paper). Morph-II includes over 55,000 mugshots with longitudinal spans and metadata such as date of birth, race, gender, and age.

In addition to its size, Morph-II presents challenges because of disproportionate race and gender ratios. About 84.6% of images are of males, while only about 15.4% of images are of females. Imbalanced classes are known to negatively affect certain classification algorithms japkowicz2002class. Moreover, Morph-II is skewed in terms of race, with approximately 77.2% of images picturing black subjects. Guo et al. found that age, gender, and race interact for demographic analysis tasks including gender classification, race classification, and age prediction guo2009gender; guo2010study; guo2010human, so both race and gender imbalance in Morph-II can hamper gender classification.

5.2 Subsetting Scheme

To overcome the uneven race and gender distributions in Morph-II, Guo et al. proposed a subsetting scheme guo2010human. Since then, many studies on Morph-II have adopted such an evaluation protocol. Based on discussions in Guo et al. guo2010human, a new automatic subsetting scheme is proposed in Yip2018preliminary, aiming to automatically ensure independent training and testing sets. Additionally, inconsistencies in age, gender, and race in Morph-II have been identified and corrected in Yip2018preliminary. After following the steps to clean MORPH-II outlined in Yip2018preliminary, we apply the automatic subsetting scheme, summarized in Figure 4 and described below.

Let be the Whole Morph-II dataset, the selected training/testing set, and the remaining set. We further divide into even subsets and . Separately within each subset and , we fix the ratios of white (W) to black (B) images as 1:1 and male (M) to female (F) images as 3:1. Further, and have been selected such that the age distributions within each set are similar (details shown in Yip2018preliminary). The gender and race summaries for the subsetting scheme are shown in Table 2. Most importantly, the sets and are independent; no sets share images from the same subject. We use as an alternating training and testing set. First, we train on and test on , then we train on and test on . The final classification accuracy is obtained by averaging the classification accuracies from the alternations.

Figure 4: Flowchart representing our subsetting scheme Yip2018preliminary for MORPH-II, which improves the one from guo2010human.
WF BF WM BM dF dM Overall F M
S1 1,285 1,285 3855 3,855 0 0 10,280 2570 7,710
S2 1,285 1,285 3,855 3,855 0 0 10,280 2,570 7,710
R 0 3,150 220 28,980 144 1,850 34,344 3,294 31,050
Overall 2,570 5,720 7,930 36,690 144 1,850 54,904 8,434 46,470

Race-gender combinations are abbreviated, e.g., BF represents the black female group. Abbreviations dF and dM represent those who are neither black nor white in race.

Table 2: Number of Images in Subsets by Race and Gender

5.3 Facial Feature Extraction

In computer vision, image preprocessing is often an essential first step to reduce unnecessary variation, decrease pixel dimension, and simplify pixel encoding. Despite the standard format of police photography in mugshots, Morph-II photographs vary in head-tilt, camera distance, occlusion, and illumination. We address this variation as follows. Images are first converted to grayscale. Next, faces are automatically detected, eliminating the background and hair, so that no external cues can be used to classify gender. The resulting images are centered and scaled with respect to the center of the irises. Finally, the images are cropped to be 70 pixels tall by 60 pixels wide. Full methodological details are given in Yip2018preprocess and align with standard preprocessing protocols from face analysis.

After preprocessing, pixel-related features are extracted from the face images in Morph-II. As discussed previously, there are numerous approaches for feature extraction. In this study on Morph-II, we incorporate domain expertise by choosing three well-established appearance-based models from image analysis: local texture techniques such as local binary patterns (LBP) yang2007demographic; lian2006multi; makinen2008experimental; alexandre2010gender, biologically-inspired features (BIF) guo2009gender; han2015demographic, and histogram of oriented gradients (HOG) guo2009gender. Additionally, these model-based approaches provide "robust interpretation by constraining solutions to be face-like" edwards1998interpreting. Detailed documentation of our feature extraction process can be found in feature; Yip2018preprocess.

Dimension Reduction
Classifier Linear SVM
Table 3: Parameter Summary

5.4 Kernel-Based Dimension Reduction Optimization

Tuning parameter selection is essential for kernel-based methods in order to achieve good results. Within the framework of feature extraction, dimension reduction, and the classification model, there are many combinations of parameters to be considered. The main parameters and tested values are summarized in Table 3 and discussed as follows. LBP features have two main parameters: block size and window radius . For HOG, the two main parameters are block size and number of orientations . For BIF, we consider the block size and the parameter , which represents the spatial aspect ratio; there is also a choice of pooling operation, which we select here as the standard deviation operation.

For each dimension reduction method, the radial kernel


is used for each pair of observation vectors , based on the results from our simulation studies. In the kernel, we must select the tuning parameter , which scales the extent of similarity between pairs of vectors. This parameter must be chosen with particular care, since a poor choice can result in transformed features with little to no variability. Empirically, we observed that SKPCA was more sensitive than KLDA and KPCA to the choice of ; values of at or above resulted in a rank deficient matrix and failure to compute all requested eigenvalues. For SKPCA, we consider an additional scaling parameter in the modified link function proposed by Wang et al. wang2015modified:


for all observed responses in the same class. The scale parameter enables the weighing of dependence between the covariates and response.

Finally, we choose a linear SVM to classify gender based on the dimension-reduced, transformed features. The motivation for this classifier is discussed in the next section. The main parameter for linear SVM is the cost , which measures the extent to which misclassification in training will be permitted. We consider values of from to .

Method Feature Parameters Accuracy
Table 4: Tuning Results on a Subset of MORPH-II

We tune on small subsets of Morph-II to reduce runtime, memory consumption, and risk of over-fitting. images from and images from are randomly selected. The standard method of grid search is followed for tuning on these subsets. For each combination of parameters, a model is trained on the subset from and then tested on the subset from . For each dimension reduction method paired with each feature type (BIF, HOG, and LBP), the best three or four accuracy rates from tuning are obtained. (Except in the case of ties, we choose only the top three accuracy rates.) The tuning results for these top-performing parameters are given in Table 4. The parameters corresponding to these maximum accuracy rates are applied to the full dataset through the previously discussed evaluation protocol. Although this protocol involves testing on images from and , any overlap of images is minor (in each testing set, less than of images have been used in tuning) and has little impact on the reported accuracy (discussed in Section 6). Regardless, the tuning parameters are selected through the same procedure, so the classification performances can be fairly compared among all considered DR methods.


width=0.99 Method Feature Parameters Acc(1) TPR(2) TNR(3) Mem(4) Time(5) KPCA BIF 0.9296 0.9473 0.8127 34.04 42.26 BIF 0.9297 0.9455 0.8112 34.68 36.94 BIF 0.9071 0.9377 0.7050 31.74 33.83 BIF 0.9096 0.9374 0.7266 31.80 35.97 HOG 0.9391 0.9726 0.7172 34.00 31.54 HOG 0.9391 0.9727 0.7170 34.00 30.86 HOG 0.9391 0.9724 0.7192 34.00 32.17 HOG 0.9364 0.9626 0.7634 34.35 31.41 LBP 0.9391 0.9726 0.7172 34.00 31.54 LBP 0.9391 0.9726 0.7172 34.00 30.86 LBP 0.9391 0.9724 0.7192 34.00 32.17 LBP 0.9364 0.9626 0.7634 34.35 31.41 SKPCA BIF 0.9507 0.9616 0.8781 35.48 42.04 BIF 0.9532 0.9639 0.8823 33.04 38.34 BIF 0.9260 0.9477 0.7827 20.03 34.58 HOG 0.9467 0.9645 0.8292 36.69 37.39 HOG 0.9489 0.9786 0.7528 38.28 53.96 HOG 0.9488 0.9786 0.7517 39.83 60.55 LBP 0.9585 0.9727 0.8641 28.68 25.33 LBP 0.9585 0.9764 0.8642 23.22 38.42 LBP 0.9585 0.9730 0.8640 29.87 28.00 LBP 0.9585 0.9727 0.8640 27.92 22.83 KLDA BIF 0.9415 0.9539 0.8594 24.89 34.50 BIF 0.9426 0.9558 0.8858 24.74 35.46 BIF 0.9131 0.9374 0.7532 22.80 26.78 BIF 0.9205 0.9421 0.7783 22.83 33.88 HOG 0.9369 0.9517 0.8392 36.52 81.71 HOG 0.9398 0.9545 0.8425 52.48 148.24 HOG 0.9175 0.9421 0.7542 21.18 21.57 LBP 0.9418 0.9578 0.8428 24.58 37.17 LBP 0.9417 0.9558 0.8486 24.70 36.45 LBP 0.9392 0.9543 0.8397 20.77 31.12 (1) Acc represents mean accuracy.
(2) TPR represents mean true positive rate (recall/sensitivity): the proportion of male faces correctly classified.
(3) TNR represents mean true negative rate (specificity): the proportion of female faces correctly classified.
(4) Mem represents mean memory in gigabytes.
(5) Time represents mean runtime in hours for training and testing.

Table 5: Gender Classification Results on MORPH-II

5.5 Gender Classification

For the classification part of the modeling, linear SVM is adopted. Many face analysis studies have involved SVM, as summarized in byun2002applications. Briefly, SVM identifies a separating hyperplane with maximal margin between the classes. Several popular kernels for SVM include linear, polynomial, and RBF steinwart2008support. We select the linear kernel, because directions of variability in the data are expected to be linear after the nonlinear transformations of KPCA, SKPCA, or KLDA. Indeed, Schölkopf et al. observed this to be true for KPCA in their landmark study scholkopf1998nonlinear. The linear kernel for SVM also reduces computational cost, compared to nonlinear kernels.

With the parameters in Table 4 that are selected from tuning on subsets, we implement dimension reduction and classification on the full Morph-II dataset, following the subsetting scheme discussed in Section 5.2. The challenges of the large size of Morph-II, the high dimensionality of the features, and the computational complexity of these dimension reduction methods necessitate the use of high-performance computing (HPC). For example, the kernel matrix for each dimension reduction method is , requiring approximately 23 gigabytes of storage. Thus, we implement the process on the HiPerGator 2.0 supercomputing cluster at the University of Florida. The code is written in R. The R package rARPACK is used to optimize the solving of eigenvalue problems qiu2016package, and the e1071 package is utilized for training and testing the SVM model dimitriadou2005misc.

6 Experiment Results

The kernel-based DR methods KPCA, SKPCA, and KLDA are applied to three facial feature extraction methods: BIF, HOG, and LBP. The DR methods transform the feature data, then reduce the dimension. In all cases, a dimension of is retained, substantially lower than the dimension of the original feature space. The dimensionality of 100 is selected as a trade-off between computation time and classification accuracy based on our preliminary studies. The transformed and dimension-reduced data serve as input for the linear SVM, which classifies each image subject as male or female. Additionally, these predicted gender classes are mapped to probabilities through a sigmoid function, following platt1999probabilistic. This process is applied to each alternation of the evaluation protocol: 1) train on , test on and 2) train on , test on . The classification results are averaged over these two testing sets. The mean classification accuracy over the testing images is chosen as the evaluation criterion for our methods on Morph-II, as it is the usual performance metric for gender classification guo2009gender, especially in similar studies guo2011simultaneous; guo2014framework; yi2014age; yang2013automatic.

These mean classification results from Morph-II are shown in Table 5. In addition to the accuracy, the true positive rate (also known as sensitivity or recall) and true negative rate (also called specificity) are given. For this study, we define the true positive rate (TPR) as the proportion of male faces correctly classified, while the true negative rate (TNR) as the proportion of female faces correctly classified. The memory and runtime are also listed in Table 5. The runtime is the total time for training and testing on HPC, i.e., the average of time1 (train on , test on ) and time2 (train on , test on ). As mentioned in Section 5.4, there is a small overlap between the tuning and testing sets that could contribute to over-fitting. We have assessed the potential impact of over-fitting on our reported accuracy rates and found it to be very small: it is estimated to be (at most) between 0.09% and 0.2% and to monotonically decrease as reported accuracy rates increase.

Figure 5: Receiver operating characteristic (ROC) curve and area under the curve (AUC) are compared by method for gender classification on Morph-II. Each color corresponds to a DR method paired with feature type. For each probability threshold, the true and false positive rates are reported as the averages from the testing sets of the alternating evaluation protocol.

The classification performance is further visualized in Figure 5 through receiver operating characteristic (ROC) curves for the nine combinations of DR method and feature extraction type. For each combination, its displayed curve corresponds to the "best" results from Table 5 (the combination of parameters reaching maximum mean classification accuracy or maximum mean true positive rate in the event of ties). For each alternation of the evaluation protocol, the true and false positive rates in testing are calculated for each probability threshold. To construct the ROC curves, each of the resulting rates for each threshold is averaged over the testing sets.

Table 5 shows that for the feature BIF, SKPCA and KLDA outperform KPCA. For the feature HOG, SKPCA achieves higher accuracy than both KPCA and KLDA, while the latter two techniques perform very similarly. Last, for the feature LBP, SKPCA produces better classification accuracy than KPCA and KLDA. In summary, our experiment’s results indicate that SKPCA outperforms KLDA consistently, while KLDA outperforms KPCA for all three features BIF, LBP, and HOG. On the other hand, for KPCA, the features HOG and LBP produce approximately the same accuracies, outperforming BIF. For SKPCA, LBP achieves slightly better results than BIF, while LBP and BIF both outperform HOG. Finally, for KLDA, BIF reaches slightly higher accuracy than LBP, while BIF and LBP both exceed HOG.

In most cases, the accuracy (in Table 5) and AUC (in Figure 5) metrics agree on the best methods. An exception is that SKPCA with the HOG features achieves slightly higher accuracy () than KLDA with the BIF features (), but SKPCA with HOG has lower AUC than KLDA with BIF. The other exception is that KPCA with the HOG features has the lowest AUC of the nine methods, but its accuracy is higher than KPCA with the BIF features. In summary, the accuracy and AUC results imply that SKPCA generally performs best for gender classification on Morph-II, while KLDA tends to outperform KPCA. Meanwhile, the LBP and BIF features often yield better classification performance, with less memory usage, than the HOG features.

It is interesting that, overall, LBP achieves even slightly better performance than BIF for the dimension reduction method SKPCA on the task of gender classification, since BIF is popular in demographic analysis such as age estimation, gender classification, and race classification guo2009gender; guo2010study; guo2010human; guo2011simultaneous; guo2014framework. Another interesting fact is displayed by the results of the true positive and negative rates in Table 5: males have a higher probability of correct identification than females, with the biggest margin exceeding 20%. Our finding is consistent with han2015demographic: females are more challenging to correctly classify than males, both for automatic approaches and human perception. Similarly, for race classification on Morph-II, Guo and Mu found in guo2010study that training a model on female faces (and testing on male faces) contributed to significantly more errors on average than training on male faces (and testing on female faces), even when controlling for differences in the training sample sizes. Our results also indicate that, overall, HOG and LBP outperform BIF for males, while BIF works consistently better than LBP and HOG for females.

Next, in Table 6 we compare our results to studies using similar methods on Morph-II, as well as recent state-of-the-art works with deep learning on MORPH-II. With the exception of han2015demographic, all studies’ results in the table are mean testing classification accuracy from an alternating evaluation protocol based on Guo et al guo2010human. Hence, our results can be directly compared to these studies. With LBP features, SKPCA, and a linear support vector machine (SVM), our gender classification accuracies approximate 96%, competitive with benchmark results. Interestingly, several reported accuracy rates from human observers of gender range from 96% burton1993s to 96.9% han2015demographic. The similarity in recognition rates between our methods and human observers can further validate the success of our approach.

Method Accuracy Reference Year
BIF+OLPP 98% guo2011simultaneous 2011
BIF+PLS 97.34% guo2011simultaneous 2011
BIF+KPLS 98.2% guo2011simultaneous 2011
BIF+CCA 95.2% guo2014framework 2014
BIF+KCCA 98.4% guo2014framework 2014
BIF+rCCA 97.6% guo2014framework 2014
Multi-scale CNN 97.9% yi2014age 2014
Ranking CNN 97.9% yang2013automatic 2015
BIF+Hierarchical-SVM 97.6% han2015demographic 2015
Human Estimators 96.9% han2015demographic 2015
LBP+SKPCA+L-SVM 95.85% This work 2019
Table 6: Comparison Results for Gender Classification on MORPH-II

7 Kernel-based Dimension Reduction Optimization and Classification on FG-NET

For further comparison between KPCA, SKPCA, and KLDA, we apply a modification of our approach from Section 5 to a smaller face dataset, the face and gesture recognition network (FG-NET). FG-NET is a popular, publicly available database used for age estimation, gender classification, face recognition, and other demographic analysis tasks panis2016overview. It contains 1002 images from 82 subjects: 47 males and 35 females with ages varying from 0 to 69 years panis2016overview.

For each image, 109 features are extracted using the Active Appearance Model (AAM), a commonly adopted appearance-based approach that models the shape and texture of the face edwards1998interpreting; cootes1998active. As in Section 5.4, the radial kernel defined in equation (24) is chosen for each of the DR methods KPCA, SKPCA, and KLDA. Additionally, the modified link function from equation (25) is applied in the SKPCA algorithm. Thus, the tuning parameter in the radial kernel and in the modified link function must be selected. As in our experiments on Morph-II, linear SVM is chosen as the classifier for FG-NET. On Morph-II, values of the cost parameter ranging from to were tested. On FG-NET, we have observed convergence issues in the SVM algorithm for values of exceeding 10, so only the values are tested. The considered tuning parameters are summarized in Table 7.

Dimension Reduction
Classifier Linear SVM
Table 7: Parameter Summary for FG-NET

For cross-validation, we use leave-one-person-out (LOPO), the most well-accepted scheme for FG-NET panis2016overview. LOPO is a variation of -fold cross-validation that produces independent training and testing folds in longitudinal datasets. The number of folds is set equal to the number of subjects in the dataset, so here. For , testing fold contains only images of person , while training fold contains all remaining images. Similarly to on Morph-II, we choose the mean classification accuracy over the testing folds to be the evaluation criterion.

Method Parameters Acc(1) TPR(2) TNR(3)
0.7025 0.7325 0.6621
0.6932 0.7233 0.6528
0.6801 0.6651 0.7001
0.7154 0.7542 0.6633
0.6933 0.7413 0.6289
0.6893 0.7701 0.5809
0.7225 0.7593 0.6730
0.7176 0.7810 0.6324
0.7131 0.7431 0.6727
(1) Acc represents mean accuracy. (2) TPR represents mean true positive rate (recall/sensitivity): the proportion of male faces correctly classified. (3) TNR represents mean true negative rate (specificity): the proportion of female faces correctly classified.
Table 8: Gender Classification Results on FG-NET
Figure 6: Receiver operating characteristic (ROC) curve and area under the curve (AUC) are compared by method for gender classification on FG-NET. Each color corresponds to a DR method.

For each fold, we transform and reduce the dimension of the features through each DR method. In all cases, a dimension of 100 is retained to facilitate comparison with the results on Morph-II. The transformed, dimension-reduced features then predict the gender of the testing fold’s images through a linear SVM. The predicted classes from SVM are also mapped to probabilities through platt1999probabilistic, similarly as in Section 6. The gender classification accuracy is calculated for the testing fold. Finally, all such testing classification accuracies are averaged to compute the mean classification accuracy from testing; the testing probabilities are used to form ROC curves.

The optimum gender classification results on FG-NET are presented in Table 8. The maximum classification accuracy of about 72.25% is achieved by KLDA. For other choices of parameters, KLDA reaches above 71% accuracy, which is close to the maximum accuracy attained by SKPCA. Meanwhile, the peak accuracy reached by KPCA is 70.25%. In general here, KLDA is observed to outperform SKPCA and KPCA, while SKPCA tends to surpass KPCA. In most cases, the probability of correctly classifying males (sensitivity/true positive rate) is higher than the probability of correctly classifying females (specificity/true negative rate). For each DR method, an ROC curve (corresponding to the results from Table 8 with maximal mean classification accuracy) is displayed in Figure 6. The area under the curve (AUC) is highest for KLDA, followed by KPCA then SKPCA.

Overall, the gender classification results on Morph-II are stronger than on FG-NET. Lower accuracy on FG-NET could be caused by the greater number of minors (aged 0-18), who have been more difficult to classify than adults in some studies wang2015modified; wang2010gender. Additionally, there are substantially fewer faces for training in FG-NET versus Morph-II (under 1000 versus 10280 images). Another contributor could be the choice of features and its dimension; the AAM features have dimension 109 on FG-NET, while the HOG, LBP, and BIF features have dimensions ranging from 500 to thousands on Morph-II. SKPCA reaches peak performance on Morph-II, while KLDA attains optimal results on FG-NET. However, the results on Morph-II and FG-NET are similar in that the supervised methods KLDA and SKPCA outperform the unsupervised method KPCA for gender classification. Further, both datasets evidence that female faces are more challenging to classify than male faces.

8 Computational Framework for Practical Systems

To tackle the challenges of high dimensionality and intensive computation for large-scale databases (like Morph-II, as shown in the Time column of Table 5) in real-world applications, we propose a computational framework to substantially decrease runtime.

Our approach involves parallel computing, the bootstrap resampling method, and ensemble learning. Let denote the main training set and the testing set. If is very large, we can save some time by drawing bootstrapped samples from . Let denote the th bootstrapped sample from . Send to a core (or processor), Core . Train the model on . Test on the full testing set , obtaining a set of gender predictions corresponding to Core and . Repeat this process for all bootstrapped samples and corresponding cores . The final predictions are obtained by taking the majority rule of the predictions from all cores and samples. Hence, the results from this scheme approximate the results from the full Morph-II. This framework is summarized in Figure 7.

Figure 7: Flowchart representing the parallel computational framework for practical systems proposed for Morph-II and other large datasets.

To explore the effectiveness, this framework is applied to Morph-II with a selection of BIF, LBP, and HOG features as preliminary studies. This experiment is implemented through the HiPerGator 2.0 supercomputer at University of Florida with five cores per combination of feature and dimension reduction method. Following the subsetting scheme discussed in Section 5.2, for simplicity, we consider only the case of bootstrapping image samples from for training, while each image from is used for testing.

Method Feature Accuracy Memory (gb) Time (min)
0.9330 27.59 90
0.9178 29.77 101
0.8927 25.85 37
0.9417 53.28 89
0.9056 51.43 74
0.9274 20.33 24
0.9416 30.99 100
0.9133 25.42 102
0.9118 17.05 26
Table 9: Classification Results Based on Bootstrapping

We evaluate this framework by comparing the approximated results in Table 9 to the results from Table 5. For each combination of feature and dimension reduction method, each of the five cores independently trains a bootstrapped sample of images from and tests on . Then the gender predictions over all five cores are compared with a simple majority rule; e.g., if an image is predicted male for three images and female for two images, the final gender prediction is male. The times in Table 9 are the total runtimes for this process, which include training and testing on HPC. Therefore, the times and memory can be compared between Tables 5 and 9. A distinction is that in Table 5, results are averaged for the alternating scheme, while in Table 9, the results are only from when is used for training and for testing.

It is shown in Table 9 that, in many cases, the accuracy rates from the approximations are similar to those from the main approach in Table 5. This is a very good result, especially considering that the bootstrapping approach uses no more than images total for training, while the main approach used all images for training. This finding suggests that our methods may perform reasonably well on Morph-II with smaller training sets. The most substantial difference between the bootstrapped approach and the main approach is in the runtime. For all combinations of features and dimension reduction methods, the bootstrapping approach has decreased the runtime to under two hours. Meanwhile, the main approach in Table 5 yields runtimes exceeding 20 hours. Hence, our preliminary results indicate the parallel approximation approach can attain similar accuracy rates to the main approach, while substantially saving time. Such a result is promising for practical gender classification systems, where gender predictions must be made in real-time.

9 Conclusion

We have performed a comparative study of the nonlinear dimension reduction methods KPCA, SKPCA, and KLDA. These kernel-based methods are first applied to three simulated datasets for visualization and comparison. SKPCA and KLDA outperform KPCA, reinforcing the need for supervised approaches in classification tasks. The radial kernel performed well, encouraging its use for face analysis.

Next, we have proposed and evaluated a new machine learning process for Morph-II. First, we use a novel subsetting scheme that reduces class imbalances while establishing independence between training and testing sets. Then we preprocess Morph-II photographs and extract three appearance-based features: HOG, LBP, and BIF. We transform and reduce the dimension of these features through KPCA, SKPCA, and KLDA. Linear SVM classifies the gender of Morph-II subjects, reaching accuracy rates of 95%. With promising preliminary results on Morph-II, a practical computational framework is offered that reduces runtime through parallelization and approximation.

The performance of the dimension reduction methods are further compared through an application to the FG-NET dataset. Images are represented through the appearance-based AAM features; transformed and reduced in dimension through KPCA, SKPCA, and KLDA; and classified as containing a male or female subject through linear SVM. While SKPCA performed optimally on Morph-II, KLDA reached top performance on FG-NET with 72% leave-one-person-out (LOPO) accuracy.

Further directions of research involve automatic tuning parameter selection, reduction of computational cost, and application to other face analysis tasks. Our approach could yield improved results with better choices of parameters, but it is impossible to anticipate and try all combinations. Automatic parameter selection for kernels could help identify a good set of parameters more easily. Perhaps the most important future direction of research on Morph-II is to reduce computational cost. For many practical demographic analysis systems, predictions must be made in real-time. For our gender classification methods, our parallel approximation approach substantially reduced runtime while attaining similar accuracy rates to the main approach. Such computational strategies should be further investigated to help bring gender classification and other face analysis tasks to practical implementation. Finally, our machine learning pipeline for Morph-II could be generalized to race classification or even age estimation.

10 Acknowledgments

This material is based in part upon work supported by the National Science Foundation under Grant Numbers DMS-1659288. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The authors would like to thank the reviewers for the helpful comments that significantly improves the presentation of the paper.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description