Abstract
Because of high dimensionality, correlation among covariates, and noise contained in data, dimension reduction (DR) techniques are often employed to the application of machine learning algorithms. Principal Component Analysis (PCA), Linear Discriminant Analysis (LDA), and their kernel variants (KPCA, KLDA) are among the most popular DR methods. Recently, Supervised Kernel Principal Component Analysis (SKPCA) has been shown as another successful alternative. In this paper, brief reviews of these popular techniques are presented first. We then conduct a comparative performance study based on three simulated datasets, after which the performance of the techniques are evaluated through application to a pattern recognition problem in face image analysis. The gender classification problem is considered on MORPHII and FGNET, two popular longitudinal face aging databases. Several feature extraction methods are used, including biologicallyinspired features (BIF), local binary patterns (LBP), histogram of oriented gradients (HOG), and the Active Appearance Model (AAM). After applications of DR methods, a linear support vector machine (SVM) is deployed with gender classification accuracy rates exceeding 95% on MORPHII, competitive with benchmark results. A parallel computational approach is also proposed, attaining faster processing speeds and similar recognition rates on MORPHII. Our computational approach can be applied to practical gender classification systems and generalized to other face analysis tasks, such as race classification and age prediction.
A Comparison Study on Nonlinear Dimension Reduction Methods with Kernel Variations: Visualization, Optimization and Classification \AuthorKatherine C. Kempfert, Yishi Wang *, Cuixian Chen , and Samuel W.K. Wong \AuthorNamesKatherine C. Kempfert, Yishi Wang, Cuixian Chen and Samuel W.K. Wong \corresCorrespondence: wangy@uncw.edu; Tel.: +19109623292 \firstnoteCurrent address: The University of Florida. Gainesville, FL 32611
1 Introduction
Due to advances in data collection and storage capabilities, the demand has been growing substantially for gaining insights into highdimensional, complexstructured, and noisy data. Researchers from diverse areas have applied DR techniques to visualize and analyze such data fodor2002survey; sorzano2014survey. DR techniques are also helpful to address the issues of collinearity and (i.e., number of features exceeding the sample size in a dataset), by projecting the data into a lower dimensional space with less correlation, so that classical statistical methods can be applied izenman2008modern. Principal Component Analysis (PCA) pearson1901liii; hotelling1933analysis is a wellstudied algorithm with the goal of projecting input features onto a lower dimensional subspace while preserving the largest variance possible; lower dimensionality permits easier visualization, for example via heat maps. While PCA is a fully automatic algorithm, DR techniques that account for domain expertise via user input have also been more recently studied yang2003interactive; johansson2009interactive. For classification problems, in which the label information as the response variable is available, Linear Discriminant Analysis (LDA) (sometimes referred to as Fisher’s Discriminant Analysis (FDA)) can be used for DR by minimizing intraclass variation and maximizing interclass variation fisher1936use; rao1948utilization. Since PCA only utilizes the correlation or covariance matrices, it is considered an unsupervised approach, whereas LDA is considered a supervised approach with labeling information built into its objective function. Despite the dissimilarities, both PCA and LDA search for linear combinations of the features and, therefore, can be applied in linearly separable types of problems lee2007nonlinear. The main challenge is that many problems in practical applications of machine learning are nonlinear nhan2015beyond; yin2012kernel. For nonlinear DR, kernel methods are popular choices because of their flexibility shawe2004kernel; motai2015kernel; xie2012robust, e.g, Kernel PCA scholkopf1997kernel, Kernel LDA for two classes mika1999fisher, and more generalized Kernel LDA for multiple classes baudat2000generalized. For kernel methods, it is also possible to design specialized kernels based on domain knowledge of a problem barzilay1999domain; scholkopf1998prior.
Given the problems in image analysis of high dimensionality and complex correlation structures, DR techniques are often a necessary step hinton2006reducing. Thus, variants of PCA, LDA, and their kernel extensions have been popular in computer vision with applications of image classification and discrimination zhao1998discriminant; martinez2001pca; yang2003can. Studies include Eigenfaces turk1991eigenfaces, Fisherfaces belhumeur1997eigenfaces, face recognition with KPCA kim2002face, face recognition with Kernel Direct LDA lu2003face, 2DPCA yang2004two, 2DLDA li20052d, among many others. When there are sufficient labeled face images, LDA is experimentally reported to outperform PCA for face recognition belhumeur1997eigenfaces. In the case of a small training set, the conclusion could be reversed martinez2001pca. Studies comparing classification performance of PCA, LDA, and their kernel variations include karg2009comparison; ye2009comparative. The connections among KLDA, KPCA, and LDA are further discussed in yang2004essence. By incorporating labeling information into the construction of the objective function, Supervised Kernel PCA (SKPCA) barshan2011supervised has been proposed for visualization, regression, and classification. A modified version of SKPCA for classification problems can be found in wang2015modified. These studies suggest that SKPCA works well in practice among different DR algorithms fewzee2012dimensionality; samadani2013discriminative; wu2013biomedical. Moreover, it has been found in ashtiani2015dimension that with bounded kernels, projections from SKPCA are uniformly converging, regardless of the input features’ dimension.
2 Associated Work
In recent years, facial demographic analysis has become popular in computer vision, because of its broad applications in humancomputer interaction (HCI), security, surveillance, and marketing, which can benefit from the automatic estimation of characteristics like age, gender, and race. Recent surveys on demographic estimation from biometrics are presented in fu2010age; sun2018demographic. Specifically, a major task is gender classification, aiming to automatically determine if a person is male or female. Beyond computer vision, the topic has been studied extensively by anthropologists, sociologists, and psychologists. Gender can easily be identified by humans, achieving 96% accuracy in an experiment classifying photographs of adult faces burton1993s. Automating gender classification has been a priority in realworld applications. A number of biometrics have been used to identify gender, including face, voice, gait, handwriting, and even the iris sun2018demographic. However, gender classification from faces is the most common, probably because photography of faces is nonintrusive and ubiquitous. Ng et al. provide a survey of gender classification via face and gait ng2012vision.
Gender classification with faces launched in 1990, when neural networks were applied directly to pixels from face photographs golomb1990sexnet; cottrell1991empath. Many other early studies utilized the geometricbased approach to represent human faces, relying on measurements of facial landmarks poggio1992hyberbf; wiskott1995face. Though intuitive, such approaches are sensitive to the placement of landmarks, which can only accommodate frontal representations of the face, and may omit some important information from the face (such as texture of the skin). In recent years, the appearancebased methods have been more commonly adopted, which rely on a transformation of an image’s pixels guo2010human; guo2011simultaneous; shan2012learning. Such methods capture both the geometric relationships of the face and texture information. However, a drawback is their sensitivity to illumination and viewpoint variations. Other issues are associated with the high dimensionality of the transformed pixels, which will be discussed further in the next paragraph. Some most recent gender classification studies involve convolutional neural networks (CNN) yang2013automatic; yi2014age; antipov2016minimalistic; antipov2017effective. Though CNNs have reached stateoftheart accuracy rates, they are known to be less interpretable than some other methods.
Pixels often contain high redundancy and noise, which cannot be removed completely by preprocessing steps. Hence, the vectors resulting from appearancebased feature extraction methods genetically inherit redundancy and noise. Popular image feature extraction methods include local texture techniques such as local binary patterns (LBP) yang2007demographic; lian2006multi; makinen2008experimental; alexandre2010gender, Gabor filters xia2008multi, biologicallyinspired features (BIF) guo2009gender; han2015demographic, and histogram of oriented gradients (HOG) guo2009gender. Such methods could lead to a high dimension of extracted features, thwarting practical applications by increasing runtime and memory consumption. When "", for which the dimension of the feature space exceeds the sample size of the dataset, a fundamental assumption of many standard statistical procedures is violated. Additionally, collinearity of features can cause numerical problems, while noisy features can obscure true relationships with the response variable and hinder predictive performance. These significant issues motivate the use of DR techniques. The fundamental goal of DR is to extract and retain information in a lower dimensional space. Many of these methods fall under manifold learning, identifying a lowdimensional manifold embedded in a highdimensional ambient space ma2011manifold.
Even though PCA and LDA have been widely considered as popular and effective approaches for DR in machine learning, their kernel versions are much less investigated. To our best knowledge, KPCA, KLDA, and SKPCA have never before been directly compared on visualization and classification performance through simulations and practical applications to face image analysis problems.
Our main contributions in this study can be summarized as follows. (1) The nonlinear manifold learning projections for KPCA, KLDA, and SKPCA are directly compared with visualization through simulated datasets. (2) Motivated by the nonlinear nature of softbiometric analysis problems, we utilize KPCA, KLDA, and SKPCA for dimension reduction on four types of appearancebased extracted features (BIF, HOG, LBP, and AAM) for the gender classification task. Moreover, the classification performance is compared systematically on parameter optimization. (3) For applications to practical largescale systems, we propose an additional parallel computational framework that can decrease runtime while maintaining similar classification rates.
The remainder of the paper is structured as follows. In Section 3, we review the theory of KPCA, SKPCA, and KLDA. In Section 4, we conduct simulation studies to visualize projections. We propose our machine learning methods for gender classification on MorphII in Section 5. The comparative performance of KPCA, SKPCA, and KLDA on MorphII is presented and discussed in Section 6. The performance of these DR methods is further compared in Section 7 through application to the FGNET dataset. The computational framework for largescale practical systems is proposed in Section 8 and investigated on MorphII. Finally, we conclude and offer future directions of research in Section 9.
3 KernelBased Dimension Reduction Methods
The nonlinearity in a classification problem can often be addressed by kernelbased DR methods, with the appropriate choice of kernels. The driving reasons are the nonlinearity of chosen kernels, flexibility of tuning parameter selection, and most importantly, the kernel tricks. Mercer’s theorem guarantees that a symmetric positivedefinite function can be written as the sum of a convergent sequence of product functions, which potentially project the data into infinitedimensional space scholkopf2001generalized. Thus, it is feasible to separate the data in the new space. On the other hand, Representer Theorem shows that the solution for certain kernel methods lies in the finitedimensional span of the training data wahba1990spline; scholkopf2001generalized. This is very helpful, since we do not need to compute the coordinates of the projected data in the infinitedimensional space, but only the inner products between all pairs of data in the feature space.
3.1 Notations
With the goal of emphasizing the connections between KPCA, SKPCA, and KLDA, we define the following notations for classification problems.
Let be the feature space, a nonempty subset in with as the number of covariates for each subject. Let be the space for the response variable, a subset in . Let be a series of independent observations following a joint probability measure . Let denote the outcomes of the response variable. Let be an feature matrix, with as the th row for , and for as its th column. Thus, the matrix can be written as:
Without loss of generality, we may assume that each column of the matrix is normalized, such that the mean of is 0 and standard deviation is 1, for .
Let be the sample covariance matrix of . We then have
(1) 
Let be a reproducing kernel Hilbert space on from a kernel function , which is a Mercer kernel (symmetric and positivedefinite), and be a reproducing kernel Hilbert space on from a kernel function .
For the kernel , its associated space may be infinitedimensional, but with some additional conditions, the minimizer of a regularized risk function lies in the finite span of the training observations scholkopf2001generalized. Additionally, it has been shown scholkopf2001generalized that there exists a function
(2) 
such that for all ,
(3) 
where is the dot product. Let be a matrix such that its th element is . We then have
(4) 
where . Here, the kernel matrix is the Gram matrix of the ’s.
3.2 Principal Component Analysis and Kernel Principal Component Analysis
In standard PCA, we seek an orthogonal transformation matrix satisfying
(5) 
where for some , such that each column vector successively inherits maximal proportion of variance from the column vectors ’s, while ensuring the projection directions are perpendicular. The solutions can be expressed as the eigenvalue problem
(6) 
where is the th column of , for .
Following the work of scholkopf1998nonlinear, PCA can be extended to KPCA by first choosing a Mercer kernel , with which is transformed to . This maps the features in to . Assume that is a vector with 0 in each entry. With the Gram matrix as defined in (4) and through the kernel trick from (3), we have the eigenvalue problem
(7) 
where is the desired dimension and are the eigenvectors of , with associated eigenvalues . Hence, the advantage of the kernelbased approach is to calculate the Gram matrix without an explicit expression for . Without the centralization assumption on , the matrix in (7) can be replaced by
(8) 
where is the desired dimension, , is an identity matrix with dimension , and is a matrix of 1’s with dimension .
We note that is idempotent, since it is a square matrix satisfying . For any square matrix with dimension , the average of each column of the matrix is 0, as is the average of each row of the matrix . Thus, the matrix is the "centralized" version of the original matrix.
3.3 Supervised Kernel Principal Component Analysis
PCA and KPCA are unsupervised methods, since they do not consider the response variable, only considering directions of maximum variability in the covariates. If the goal is classification, this may not be ideal, since the principal components may be unrelated to the class difference. SKPCA is a supervised generalization of KPCA, which aims to find the principal components with maximal dependence on the response variable. Drawing from barshan2011supervised and wang2015modified, we formulate SKPCA as follows.
In SKPCA, class information is incorporated by maximizing the Hilbert Schmidt independence criterion (HSIC) gretton2005measuring. With the aforementioned reproducing kernel Hilbert spaces on and on and related kernel functions and respectively, the HSIC can be expressed as
(9)  
where represents the expectation on independent pairs of and (with respect to ) and and alike are the expectations based on various marginal distributions from .
With the results from gretton2005measuring, an empirical estimator of (9) is
(10) 
where and are defined as before for KPCA and is a link matrix with dimension , where is an indicator function with value 1 if the event is true and 0 otherwise.
Similarly as for KPCA, and can be adjusted to satisfy the centralization assumption. As discussed previously, is an idempotent matrix. Therefore, following from (10),
(11) 
where and are the "centralized" versions of the and matrices respectively.
On another note, in the binary gender classification problem, and wang2015modified. Therefore, we modify the link matrix according to wang2015modified by
(12) 
It can be shown that maximization of (10) is equivalent to solving the generalized eigenvalue problem
(13) 
where and each is an eigenvector with related eigenvalue for , where is the desired dimension wang2015modified. Therefore, the main advantage of the link matrix in (12) becomes apparent: the rank of may increase, permitting more eigenvalues to be computed.
3.4 Linear Discriminant Analysis and Kernel Linear Discriminant Analysis
Given a dataset with finite classes, LDA aims to find the best set of features to discriminate among the classes. We first review standard LDA, then generalize to KLDA. We note that sometimes parametric assumptions for LDA are made, such as that observations from each class are normally distributed with common covariance. Here, we make no such assumptions. Suppose that each observation for belongs to exactly one of classes. Define the following feature vectors: as the overall mean and as the mean of the th class with the size of the th class in the sample, for .
In standard LDA, we seek to maximize the objective function
(14) 
where is a x vector, is the betweenclass scatter matrix, and is the withinclass scatter matrix defined by
(15)  
Hence, maximizing involves finding some rotation of the scatter matrices such that the "distance" between the groups is maximized relative to the variations within each group.
Maximization of in (14) is equivalent to solving the generalized eigenvalue problem
(16) 
where each is an eigenvector with corresponding eigenvalue , for , where is the desired dimension.
LDA is generalized to KLDA using the kernel representation from (3). Analogously to LDA above, we seek a solution that will result in the maximization of the objective function
(17) 
where now and and are the betweenclass and withinclass scatter matrices in defined by
(18)  
The above expressions involve knowledge of , which is often not available. It can be shown that equation (17) is equivalent to
(19) 
where
(20)  
is a matrix of dimension with rows being features from the th class, and is an matrix with its th element as . A full discussion of KLDA can be found in mika1999fisher.
Maximization of in equation (19) is equivalent to solving the generalized eigenvalue problem
(21) 
where each is an eigenvector with associated eigenvalue , for with as the desired dimension.
Comparing the generalized eigenvalue problems in (16) and (21), the structures of matrices and are similar, since both "measure" the variation between different classes.
Let , a matrix of dimension . Due to the centralization function of , has rowsum equal to zero for every row. Besides, . For the matrix , due to the idempotent property of ,
(22) 
Thus, the matrix has an identical structure to the and matrices, which "measure" the overall variation within groups.
4 Visualization on Simulation Studies
To visualize and improve understanding of the manifold learning methods KPCA, SKPCA, and KLDA, we apply them in three simulation studies. For comparison, the linear techniques PCA and LDA are also considered. Each dataset contains nonlinear patterns, and the goal is to transform the data to be linearly separable. For this reason, the radial basis function (RBF)
(23) 
is chosen as a kernel for each pair of observed vectors . For each DR method, a range of values for the tuning parameter are tested and selected to visually separate the classes. A full discussion of the RBF kernel, among others, can be found in steinwart2008support. Figures 1, 2, and 3 compare the original data to 2dimensional projections from each DR method. In each plot, color corresponds to the true class to which an observation belongs.
(a) Original Data  (b) KLDA Projections in 2D  (c) KPCA Projections in 2D 
(d) SKPCA Projections in 2D  (e) PCA Projections in 2D  (f) LDA Projections in 2D 
In the first simulation study, the original data are plotted in 3D in Figure 1(a); the green sphere is embedded within the magenta group, necessitating nonlinear manifold learning. The KLDA projections in (b) are linearly separable with very good variation between the classes and a fair amount of variation within the classes. KPCA and SKPCA projections in (c) and (d) are at least approximately linearly separable, as it is not clear whether there is a linear boundary that perfectly separates the two classes. In (e), PCA fails to linearly separate the groups, rotating the wine chocolate in 2D. The maximum dimension LDA can retain is ; with 2 classes, the projections must be plotted on a 1D number line, given in (f). Points from the two classes overlap considerably in plots (e) and (f).
(a) Original Data  (b) KLDA Projections in 2D  (c) KPCA Projections in 2D 
(d) SKPCA Projections in 2D  (e) PCA Projections in 2D  (f) LDA Projections in 2D 
In the second simulation study, the original data in Figure 2(a) follow a nonlinear pattern. In (b), KLDA produces groups which are linearly separable. The KPCA projections are approximately linearly separable in (c); however, there is some overlap between groups, especially the green and pink groups in the third quadrant. In (d), SKPCA produces almost linearly separable groups. In plots (e) and (f), PCA and LDA simply rotate the original data in 2D space, as expected.
(a) Original Data  (b) KLDA Projections in 2D  (c) KPCA Projections in 2D 
(d) SKPCA Projections in 2D  (e) PCA Projections in 2D  (f) LDA Projections in 2D 
For the third simulation study, the original data in Figure 3(a) are in 3D and follow a swirling, nonlinear pattern. In (b), KLDA yields favorable results; the groups are wellseparated linearly. KPCA and SKPCA in (c) and (d) also produce good results, although in (c) more separation between the purple and bright green groups would be ideal. In (e) and (f), respectively, PCA and LDA merely rotate the original data projected in 2D space; there is no linear separation between the magenta and purple groups, nor between the two green groups.
For all three simulation studies, KLDA, KPCA, and SKPCA are effective to transform the data into linearly separable groups. In all cases, the projected data become approximately linearly separable after applying KLDA, KPCA, or SKPCA. In general, KLDA and SKPCA perform the best here. Their success over KPCA is expected, since KLDA and SKPCA are supervised techniques. On the other hand, results indicate that KPCA and SKPCA are more sensitive than KLDA to different choices of tuning parameters. Hence, SKPCA and KPCA may perform better for alternative choices of parameters. In all our studies, the nonlinear techniques outperform linear PCA and LDA. These preliminary studies suggest the radial kernel is appropriate for our face analysis experiments.
5 Kernelbased Dimension Reduction Optimization and Classification on MorphII
Motivated by the nonlinear nature of facial demographic analysis, we propose and implement a novel machine learning process for the MorphII dataset. We consider the kernelbased DR methods KPCA, SKPCA, and KLDA on three types of appearancebased extracted features (LBP, BIF, and HOG) for the gender classification task. We illustrate parameter optimization and compare the performance of these methods on MorphII.
5.1 Longitudinal Face Database
MORPH ricanek2006morph is one of the most popular face databases available to the public, especially for age estimation, race classification, and gender classification. Multiple versions of MORPH have been released, and the version adopted in this work is the 2008 MORPHII noncommercial release (referred to as MorphII in this paper). MorphII includes over 55,000 mugshots with longitudinal spans and metadata such as date of birth, race, gender, and age.
In addition to its size, MorphII presents challenges because of disproportionate race and gender ratios. About 84.6% of images are of males, while only about 15.4% of images are of females. Imbalanced classes are known to negatively affect certain classification algorithms japkowicz2002class. Moreover, MorphII is skewed in terms of race, with approximately 77.2% of images picturing black subjects. Guo et al. found that age, gender, and race interact for demographic analysis tasks including gender classification, race classification, and age prediction guo2009gender; guo2010study; guo2010human, so both race and gender imbalance in MorphII can hamper gender classification.
5.2 Subsetting Scheme
To overcome the uneven race and gender distributions in MorphII, Guo et al. proposed a subsetting scheme guo2010human. Since then, many studies on MorphII have adopted such an evaluation protocol. Based on discussions in Guo et al. guo2010human, a new automatic subsetting scheme is proposed in Yip2018preliminary, aiming to automatically ensure independent training and testing sets. Additionally, inconsistencies in age, gender, and race in MorphII have been identified and corrected in Yip2018preliminary. After following the steps to clean MORPHII outlined in Yip2018preliminary, we apply the automatic subsetting scheme, summarized in Figure 4 and described below.
Let be the Whole MorphII dataset, the selected training/testing set, and the remaining set. We further divide into even subsets and . Separately within each subset and , we fix the ratios of white (W) to black (B) images as 1:1 and male (M) to female (F) images as 3:1. Further, and have been selected such that the age distributions within each set are similar (details shown in Yip2018preliminary). The gender and race summaries for the subsetting scheme are shown in Table 2. Most importantly, the sets and are independent; no sets share images from the same subject. We use as an alternating training and testing set. First, we train on and test on , then we train on and test on . The final classification accuracy is obtained by averaging the classification accuracies from the alternations.
WF  BF  WM  BM  dF  dM  Overall  F  M  

S1  1,285  1,285  3855  3,855  0  0  10,280  2570  7,710 
S2  1,285  1,285  3,855  3,855  0  0  10,280  2,570  7,710 
R  0  3,150  220  28,980  144  1,850  34,344  3,294  31,050 
Overall  2,570  5,720  7,930  36,690  144  1,850  54,904  8,434  46,470 
Racegender combinations are abbreviated, e.g., BF represents the black female group. Abbreviations dF and dM represent those who are neither black nor white in race.
5.3 Facial Feature Extraction
In computer vision, image preprocessing is often an essential first step to reduce unnecessary variation, decrease pixel dimension, and simplify pixel encoding. Despite the standard format of police photography in mugshots, MorphII photographs vary in headtilt, camera distance, occlusion, and illumination. We address this variation as follows. Images are first converted to grayscale. Next, faces are automatically detected, eliminating the background and hair, so that no external cues can be used to classify gender. The resulting images are centered and scaled with respect to the center of the irises. Finally, the images are cropped to be 70 pixels tall by 60 pixels wide. Full methodological details are given in Yip2018preprocess and align with standard preprocessing protocols from face analysis.
After preprocessing, pixelrelated features are extracted from the face images in MorphII. As discussed previously, there are numerous approaches for feature extraction. In this study on MorphII, we incorporate domain expertise by choosing three wellestablished appearancebased models from image analysis: local texture techniques such as local binary patterns (LBP) yang2007demographic; lian2006multi; makinen2008experimental; alexandre2010gender, biologicallyinspired features (BIF) guo2009gender; han2015demographic, and histogram of oriented gradients (HOG) guo2009gender. Additionally, these modelbased approaches provide "robust interpretation by constraining solutions to be facelike" edwards1998interpreting. Detailed documentation of our feature extraction process can be found in feature; Yip2018preprocess.

LBP  



s=  

KPCA  

0.0001,0.001  


Classifier  Linear SVM 
5.4 KernelBased Dimension Reduction Optimization
Tuning parameter selection is essential for kernelbased methods in order to achieve good results. Within the framework of feature extraction, dimension reduction, and the classification model, there are many combinations of parameters to be considered. The main parameters and tested values are summarized in Table 3 and discussed as follows. LBP features have two main parameters: block size and window radius . For HOG, the two main parameters are block size and number of orientations . For BIF, we consider the block size and the parameter , which represents the spatial aspect ratio; there is also a choice of pooling operation, which we select here as the standard deviation operation.
For each dimension reduction method, the radial kernel
(24) 
is used for each pair of observation vectors , based on the results from our simulation studies. In the kernel, we must select the tuning parameter , which scales the extent of similarity between pairs of vectors. This parameter must be chosen with particular care, since a poor choice can result in transformed features with little to no variability. Empirically, we observed that SKPCA was more sensitive than KLDA and KPCA to the choice of ; values of at or above resulted in a rank deficient matrix and failure to compute all requested eigenvalues. For SKPCA, we consider an additional scaling parameter in the modified link function proposed by Wang et al. wang2015modified:
(25) 
for all observed responses in the same class. The scale parameter enables the weighing of dependence between the covariates and response.
Finally, we choose a linear SVM to classify gender based on the dimensionreduced, transformed features. The motivation for this classifier is discussed in the next section. The main parameter for linear SVM is the cost , which measures the extent to which misclassification in training will be permitted. We consider values of from to .
Method  Feature  Parameters  Accuracy  
KPCA 

0.882  

0.882  

0.882  

0.882  

0.917  

0.919  

0.917  

0.917  

0.912  

0.912  

0.912  

0.912  
SKPCA 

0.899  

0.899  

0.899  

0.931  

0.931  

0.931  

0.937  

0.937  

0.938  

0.939  
KLDA 

0.875  

0.875  

0.875  

0.875  

0.917  

0.917  

0.904  

0.906  

0.908  

0.898 
We tune on small subsets of MorphII to reduce runtime, memory consumption, and risk of overfitting. images from and images from are randomly selected. The standard method of grid search is followed for tuning on these subsets. For each combination of parameters, a model is trained on the subset from and then tested on the subset from . For each dimension reduction method paired with each feature type (BIF, HOG, and LBP), the best three or four accuracy rates from tuning are obtained. (Except in the case of ties, we choose only the top three accuracy rates.) The tuning results for these topperforming parameters are given in Table 4. The parameters corresponding to these maximum accuracy rates are applied to the full dataset through the previously discussed evaluation protocol. Although this protocol involves testing on images from and , any overlap of images is minor (in each testing set, less than of images have been used in tuning) and has little impact on the reported accuracy (discussed in Section 6). Regardless, the tuning parameters are selected through the same procedure, so the classification performances can be fairly compared among all considered DR methods.
5.5 Gender Classification
For the classification part of the modeling, linear SVM is adopted. Many face analysis studies have involved SVM, as summarized in byun2002applications. Briefly, SVM identifies a separating hyperplane with maximal margin between the classes. Several popular kernels for SVM include linear, polynomial, and RBF steinwart2008support. We select the linear kernel, because directions of variability in the data are expected to be linear after the nonlinear transformations of KPCA, SKPCA, or KLDA. Indeed, Schölkopf et al. observed this to be true for KPCA in their landmark study scholkopf1998nonlinear. The linear kernel for SVM also reduces computational cost, compared to nonlinear kernels.
With the parameters in Table 4 that are selected from tuning on subsets, we implement dimension reduction and classification on the full MorphII dataset, following the subsetting scheme discussed in Section 5.2. The challenges of the large size of MorphII, the high dimensionality of the features, and the computational complexity of these dimension reduction methods necessitate the use of highperformance computing (HPC). For example, the kernel matrix for each dimension reduction method is , requiring approximately 23 gigabytes of storage. Thus, we implement the process on the HiPerGator 2.0 supercomputing cluster at the University of Florida. The code is written in R. The R package rARPACK is used to optimize the solving of eigenvalue problems qiu2016package, and the e1071 package is utilized for training and testing the SVM model dimitriadou2005misc.
6 Experiment Results
The kernelbased DR methods KPCA, SKPCA, and KLDA are applied to three facial feature extraction methods: BIF, HOG, and LBP. The DR methods transform the feature data, then reduce the dimension. In all cases, a dimension of is retained, substantially lower than the dimension of the original feature space. The dimensionality of 100 is selected as a tradeoff between computation time and classification accuracy based on our preliminary studies. The transformed and dimensionreduced data serve as input for the linear SVM, which classifies each image subject as male or female. Additionally, these predicted gender classes are mapped to probabilities through a sigmoid function, following platt1999probabilistic. This process is applied to each alternation of the evaluation protocol: 1) train on , test on and 2) train on , test on . The classification results are averaged over these two testing sets. The mean classification accuracy over the testing images is chosen as the evaluation criterion for our methods on MorphII, as it is the usual performance metric for gender classification guo2009gender, especially in similar studies guo2011simultaneous; guo2014framework; yi2014age; yang2013automatic.
These mean classification results from MorphII are shown in Table 5. In addition to the accuracy, the true positive rate (also known as sensitivity or recall) and true negative rate (also called specificity) are given. For this study, we define the true positive rate (TPR) as the proportion of male faces correctly classified, while the true negative rate (TNR) as the proportion of female faces correctly classified. The memory and runtime are also listed in Table 5. The runtime is the total time for training and testing on HPC, i.e., the average of time1 (train on , test on ) and time2 (train on , test on ). As mentioned in Section 5.4, there is a small overlap between the tuning and testing sets that could contribute to overfitting. We have assessed the potential impact of overfitting on our reported accuracy rates and found it to be very small: it is estimated to be (at most) between 0.09% and 0.2% and to monotonically decrease as reported accuracy rates increase.
The classification performance is further visualized in Figure 5 through receiver operating characteristic (ROC) curves for the nine combinations of DR method and feature extraction type. For each combination, its displayed curve corresponds to the "best" results from Table 5 (the combination of parameters reaching maximum mean classification accuracy or maximum mean true positive rate in the event of ties). For each alternation of the evaluation protocol, the true and false positive rates in testing are calculated for each probability threshold. To construct the ROC curves, each of the resulting rates for each threshold is averaged over the testing sets.
Table 5 shows that for the feature BIF, SKPCA and KLDA outperform KPCA. For the feature HOG, SKPCA achieves higher accuracy than both KPCA and KLDA, while the latter two techniques perform very similarly. Last, for the feature LBP, SKPCA produces better classification accuracy than KPCA and KLDA. In summary, our experiment’s results indicate that SKPCA outperforms KLDA consistently, while KLDA outperforms KPCA for all three features BIF, LBP, and HOG. On the other hand, for KPCA, the features HOG and LBP produce approximately the same accuracies, outperforming BIF. For SKPCA, LBP achieves slightly better results than BIF, while LBP and BIF both outperform HOG. Finally, for KLDA, BIF reaches slightly higher accuracy than LBP, while BIF and LBP both exceed HOG.
In most cases, the accuracy (in Table 5) and AUC (in Figure 5) metrics agree on the best methods. An exception is that SKPCA with the HOG features achieves slightly higher accuracy () than KLDA with the BIF features (), but SKPCA with HOG has lower AUC than KLDA with BIF. The other exception is that KPCA with the HOG features has the lowest AUC of the nine methods, but its accuracy is higher than KPCA with the BIF features. In summary, the accuracy and AUC results imply that SKPCA generally performs best for gender classification on MorphII, while KLDA tends to outperform KPCA. Meanwhile, the LBP and BIF features often yield better classification performance, with less memory usage, than the HOG features.
It is interesting that, overall, LBP achieves even slightly better performance than BIF for the dimension reduction method SKPCA on the task of gender classification, since BIF is popular in demographic analysis such as age estimation, gender classification, and race classification guo2009gender; guo2010study; guo2010human; guo2011simultaneous; guo2014framework. Another interesting fact is displayed by the results of the true positive and negative rates in Table 5: males have a higher probability of correct identification than females, with the biggest margin exceeding 20%. Our finding is consistent with han2015demographic: females are more challenging to correctly classify than males, both for automatic approaches and human perception. Similarly, for race classification on MorphII, Guo and Mu found in guo2010study that training a model on female faces (and testing on male faces) contributed to significantly more errors on average than training on male faces (and testing on female faces), even when controlling for differences in the training sample sizes. Our results also indicate that, overall, HOG and LBP outperform BIF for males, while BIF works consistently better than LBP and HOG for females.
Next, in Table 6 we compare our results to studies using similar methods on MorphII, as well as recent stateoftheart works with deep learning on MORPHII. With the exception of han2015demographic, all studies’ results in the table are mean testing classification accuracy from an alternating evaluation protocol based on Guo et al guo2010human. Hence, our results can be directly compared to these studies. With LBP features, SKPCA, and a linear support vector machine (SVM), our gender classification accuracies approximate 96%, competitive with benchmark results. Interestingly, several reported accuracy rates from human observers of gender range from 96% burton1993s to 96.9% han2015demographic. The similarity in recognition rates between our methods and human observers can further validate the success of our approach.
Method  Accuracy  Reference  Year 

BIF+OLPP  98%  guo2011simultaneous  2011 
BIF+PLS  97.34%  guo2011simultaneous  2011 
BIF+KPLS  98.2%  guo2011simultaneous  2011 
BIF+CCA  95.2%  guo2014framework  2014 
BIF+KCCA  98.4%  guo2014framework  2014 
BIF+rCCA  97.6%  guo2014framework  2014 
Multiscale CNN  97.9%  yi2014age  2014 
Ranking CNN  97.9%  yang2013automatic  2015 
BIF+HierarchicalSVM  97.6%  han2015demographic  2015 
Human Estimators  96.9%  han2015demographic  2015 
LBP+SKPCA+LSVM  95.85%  This work  2019 
7 Kernelbased Dimension Reduction Optimization and Classification on FGNET
For further comparison between KPCA, SKPCA, and KLDA, we apply a modification of our approach from Section 5 to a smaller face dataset, the face and gesture recognition network (FGNET). FGNET is a popular, publicly available database used for age estimation, gender classification, face recognition, and other demographic analysis tasks panis2016overview. It contains 1002 images from 82 subjects: 47 males and 35 females with ages varying from 0 to 69 years panis2016overview.
For each image, 109 features are extracted using the Active Appearance Model (AAM), a commonly adopted appearancebased approach that models the shape and texture of the face edwards1998interpreting; cootes1998active. As in Section 5.4, the radial kernel defined in equation (24) is chosen for each of the DR methods KPCA, SKPCA, and KLDA. Additionally, the modified link function from equation (25) is applied in the SKPCA algorithm. Thus, the tuning parameter in the radial kernel and in the modified link function must be selected. As in our experiments on MorphII, linear SVM is chosen as the classifier for FGNET. On MorphII, values of the cost parameter ranging from to were tested. On FGNET, we have observed convergence issues in the SVM algorithm for values of exceeding 10, so only the values are tested. The considered tuning parameters are summarized in Table 7.

KPCA  




Classifier  Linear SVM 
For crossvalidation, we use leaveonepersonout (LOPO), the most wellaccepted scheme for FGNET panis2016overview. LOPO is a variation of fold crossvalidation that produces independent training and testing folds in longitudinal datasets. The number of folds is set equal to the number of subjects in the dataset, so here. For , testing fold contains only images of person , while training fold contains all remaining images. Similarly to on MorphII, we choose the mean classification accuracy over the testing folds to be the evaluation criterion.
Method  Parameters  Acc^{(1)}  TPR^{(2)}  TNR^{(3)}  


0.7025  0.7325  0.6621  
0.6932  0.7233  0.6528  
0.6801  0.6651  0.7001  

0.7154  0.7542  0.6633  
0.6933  0.7413  0.6289  
0.6893  0.7701  0.5809  

0.7225  0.7593  0.6730  
0.7176  0.7810  0.6324  
0.7131  0.7431  0.6727 
For each fold, we transform and reduce the dimension of the features through each DR method. In all cases, a dimension of 100 is retained to facilitate comparison with the results on MorphII. The transformed, dimensionreduced features then predict the gender of the testing fold’s images through a linear SVM. The predicted classes from SVM are also mapped to probabilities through platt1999probabilistic, similarly as in Section 6. The gender classification accuracy is calculated for the testing fold. Finally, all such testing classification accuracies are averaged to compute the mean classification accuracy from testing; the testing probabilities are used to form ROC curves.
The optimum gender classification results on FGNET are presented in Table 8. The maximum classification accuracy of about 72.25% is achieved by KLDA. For other choices of parameters, KLDA reaches above 71% accuracy, which is close to the maximum accuracy attained by SKPCA. Meanwhile, the peak accuracy reached by KPCA is 70.25%. In general here, KLDA is observed to outperform SKPCA and KPCA, while SKPCA tends to surpass KPCA. In most cases, the probability of correctly classifying males (sensitivity/true positive rate) is higher than the probability of correctly classifying females (specificity/true negative rate). For each DR method, an ROC curve (corresponding to the results from Table 8 with maximal mean classification accuracy) is displayed in Figure 6. The area under the curve (AUC) is highest for KLDA, followed by KPCA then SKPCA.
Overall, the gender classification results on MorphII are stronger than on FGNET. Lower accuracy on FGNET could be caused by the greater number of minors (aged 018), who have been more difficult to classify than adults in some studies wang2015modified; wang2010gender. Additionally, there are substantially fewer faces for training in FGNET versus MorphII (under 1000 versus 10280 images). Another contributor could be the choice of features and its dimension; the AAM features have dimension 109 on FGNET, while the HOG, LBP, and BIF features have dimensions ranging from 500 to thousands on MorphII. SKPCA reaches peak performance on MorphII, while KLDA attains optimal results on FGNET. However, the results on MorphII and FGNET are similar in that the supervised methods KLDA and SKPCA outperform the unsupervised method KPCA for gender classification. Further, both datasets evidence that female faces are more challenging to classify than male faces.
8 Computational Framework for Practical Systems
To tackle the challenges of high dimensionality and intensive computation for largescale databases (like MorphII, as shown in the Time column of Table 5) in realworld applications, we propose a computational framework to substantially decrease runtime.
Our approach involves parallel computing, the bootstrap resampling method, and ensemble learning. Let denote the main training set and the testing set. If is very large, we can save some time by drawing bootstrapped samples from . Let denote the th bootstrapped sample from . Send to a core (or processor), Core . Train the model on . Test on the full testing set , obtaining a set of gender predictions corresponding to Core and . Repeat this process for all bootstrapped samples and corresponding cores . The final predictions are obtained by taking the majority rule of the predictions from all cores and samples. Hence, the results from this scheme approximate the results from the full MorphII. This framework is summarized in Figure 7.
To explore the effectiveness, this framework is applied to MorphII with a selection of BIF, LBP, and HOG features as preliminary studies. This experiment is implemented through the HiPerGator 2.0 supercomputer at University of Florida with five cores per combination of feature and dimension reduction method. Following the subsetting scheme discussed in Section 5.2, for simplicity, we consider only the case of bootstrapping image samples from for training, while each image from is used for testing.
Method  Feature  Accuracy  Memory (gb)  Time (min)  

KPCA 

0.9330  27.59  90  

0.9178  29.77  101  

0.8927  25.85  37  
SKPCA 

0.9417  53.28  89  

0.9056  51.43  74  

0.9274  20.33  24  
KLDA 

0.9416  30.99  100  

0.9133  25.42  102  

0.9118  17.05  26 
We evaluate this framework by comparing the approximated results in Table 9 to the results from Table 5. For each combination of feature and dimension reduction method, each of the five cores independently trains a bootstrapped sample of images from and tests on . Then the gender predictions over all five cores are compared with a simple majority rule; e.g., if an image is predicted male for three images and female for two images, the final gender prediction is male. The times in Table 9 are the total runtimes for this process, which include training and testing on HPC. Therefore, the times and memory can be compared between Tables 5 and 9. A distinction is that in Table 5, results are averaged for the alternating scheme, while in Table 9, the results are only from when is used for training and for testing.
It is shown in Table 9 that, in many cases, the accuracy rates from the approximations are similar to those from the main approach in Table 5. This is a very good result, especially considering that the bootstrapping approach uses no more than images total for training, while the main approach used all images for training. This finding suggests that our methods may perform reasonably well on MorphII with smaller training sets. The most substantial difference between the bootstrapped approach and the main approach is in the runtime. For all combinations of features and dimension reduction methods, the bootstrapping approach has decreased the runtime to under two hours. Meanwhile, the main approach in Table 5 yields runtimes exceeding 20 hours. Hence, our preliminary results indicate the parallel approximation approach can attain similar accuracy rates to the main approach, while substantially saving time. Such a result is promising for practical gender classification systems, where gender predictions must be made in realtime.
9 Conclusion
We have performed a comparative study of the nonlinear dimension reduction methods KPCA, SKPCA, and KLDA. These kernelbased methods are first applied to three simulated datasets for visualization and comparison. SKPCA and KLDA outperform KPCA, reinforcing the need for supervised approaches in classification tasks. The radial kernel performed well, encouraging its use for face analysis.
Next, we have proposed and evaluated a new machine learning process for MorphII. First, we use a novel subsetting scheme that reduces class imbalances while establishing independence between training and testing sets. Then we preprocess MorphII photographs and extract three appearancebased features: HOG, LBP, and BIF. We transform and reduce the dimension of these features through KPCA, SKPCA, and KLDA. Linear SVM classifies the gender of MorphII subjects, reaching accuracy rates of 95%. With promising preliminary results on MorphII, a practical computational framework is offered that reduces runtime through parallelization and approximation.
The performance of the dimension reduction methods are further compared through an application to the FGNET dataset. Images are represented through the appearancebased AAM features; transformed and reduced in dimension through KPCA, SKPCA, and KLDA; and classified as containing a male or female subject through linear SVM. While SKPCA performed optimally on MorphII, KLDA reached top performance on FGNET with 72% leaveonepersonout (LOPO) accuracy.
Further directions of research involve automatic tuning parameter selection, reduction of computational cost, and application to other face analysis tasks. Our approach could yield improved results with better choices of parameters, but it is impossible to anticipate and try all combinations. Automatic parameter selection for kernels could help identify a good set of parameters more easily. Perhaps the most important future direction of research on MorphII is to reduce computational cost. For many practical demographic analysis systems, predictions must be made in realtime. For our gender classification methods, our parallel approximation approach substantially reduced runtime while attaining similar accuracy rates to the main approach. Such computational strategies should be further investigated to help bring gender classification and other face analysis tasks to practical implementation. Finally, our machine learning pipeline for MorphII could be generalized to race classification or even age estimation.
10 Acknowledgments
This material is based in part upon work supported by the National Science Foundation under Grant Numbers DMS1659288. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation. The authors would like to thank the reviewers for the helpful comments that significantly improves the presentation of the paper.