# Nonlinear Supervised Dimensionality Reduction via Smooth Regular Embeddings

###### Abstract

The recovery of the intrinsic geometric structures of data collections is an important problem in data analysis. Supervised extensions of several manifold learning approaches have been proposed in the recent years. Meanwhile, existing methods primarily focus on the embedding of the training data, and the generalization of the embedding to initially unseen test data is rather ignored. In this work, we build on recent theoretical results on the generalization performance of supervised manifold learning algorithms. Motivated by these performance bounds, we propose a supervised manifold learning method that computes a nonlinear embedding while constructing a smooth and regular interpolation function that extends the embedding to the whole data space in order to achieve satisfactory generalization. The embedding and the interpolator are jointly learnt such that the Lipschitz regularity of the interpolator is imposed while ensuring the separation between different classes. Experimental results on several image data sets show that the proposed method yields quite satisfactory performance in comparison with other supervised dimensionality reduction algorithms and traditional classifiers.

Keywords: Manifold learning, dimensionality reduction, supervised learning, out-of-sample, nonlinear embeddings

## 1 Introduction

In many data analysis applications, collections of data are acquired in a high-dimensional ambient space; however, the intrinsic dimension of data is much lower. For instance, the face images of a person reside in a high-dimensional space, however, they are concentrated around a low-dimensional manifold that can be parameterized with a few variables such as pose and illumination parameters. An important problem of interest in data analysis has been the learning of low-dimensional models that provide suitable representations of data for accurate classification. Many supervised manifold learning methods have been proposed in the recent years that aim to enhance the separation between training samples from different classes while respecting the geometric structure of data manifolds. However, the generalization capabilities of such methods to initially unavailable novel samples have rather been overlooked so far. In this work, we propose a nonlinear supervised dimensionality reduction method that builds on theoretically established generalization bounds for manifold learning.

Classical methods such as LDA and Fisher’s linear discriminant reduce the dimensionality of data by learning a projection so that the between-class separation is increased while the within-class separation is reduced. In the recent years, much research effort has focused on the discovery of low-dimensional structures in data sets, which gave rise to the topic of manifold learning [1], [2], [3], [4], [5], [6]. Following these works, many supervised extensions of methods such as the Laplacian eigenmaps algorithm [3] have been proposed. Linear dimensionality reduction methods such as [7], [8], [9], [10], [11], [12], [13], [14] learn a linear projection of training samples onto a lower-dimensional domain, where the distance between samples from different classes are increased and the distances within the same class are decreased. Most of these methods include a structure preservation objective as well, which aims to map nearby samples in the original domain to nearby locations in the new domain of embedding. Nonlinear methods such as [15] pursue a similar objective in the learnt embedding; however, the embedding is given by a pointwise nonlinear mapping instead of a linear projection.

The performance of linear methods depends largely on the distribution of the data in the original ambient space, since the distribution of the data after the embedding is strictly dependent on the original distribution via a linear projection. Nonlinear dimensionality reduction methods such as [15] have greater flexibility in the learnt representation. However, two critical issues arise concerning supervised dimensionality reduction methods: First, most nonlinear methods compute a pointwise mapping only for the initially available data samples. In order to generalize them to initially unavailable points, an interpolation needs to be done, which is called the out-of-sample extension of the embedding. Second, existing dimensionality reduction methods focus on the properties of the computed embedding only as far as the training samples are concerned: Existing algorithms mostly aim to increase the between-class separation and preserve the local structure, however, only for the training data. Meanwhile, the important question is how well these algorithms generalize to test data. This question is even more critical for nonlinear dimensionality reduction methods, as the classification performance obtained on test data will not only depend on the properties of the embedding of the training data, but also on the properties of the interpolator used for extending the embedding to the whole ambient space. Several methods have been proposed to solve the out-of-sample extension problem, such as unsupervised generalizations with smooth functions [16], [17], [18], [19] or semi-supervised interpolators [20]. These methods intend to generalize an already computed embedding to new data and are constrained by the initially prescribed coordinates for training data. Meanwhile, the best strategy for achieving satisfactory generalization to test data would not consist in learning the embedding and the interpolation sequentially, but rather in learning them in a joint and coherent manner.

In this work, we propose a nonlinear supervised manifold learning method for classification where the embeddings of training data are learned and optimized in a joint way along with the interpolator that extends the embedding to the whole ambient space. A distinctive property of our method is the fact that it explicitly aims to have good generalization to test data in the learning objective. In order to achieve this, we build on the previous work [21] where a theoretical analysis of supervised manifold learning is proposed. The theoretical results in [21] show that for good classification performance, the separation between different classes in the embedding of training data needs to be sufficiently high, while at the same time the interpolation function that extends the embedding to test data must be sufficiently regular. For good generalization to initially unavailable test samples, a compromise needs to be found between these two important criteria. In this work, we adopt radial basis function interpolators for the generalization of the embedding, and learn the embedding of the training data and the parameters of the interpolator, i.e., the coefficients and the scale parameter of the interpolation function, at the same time with a joint optimization algorithm. The analysis in [21] characterizes the regularity of an interpolator via its Lipschitz regularity. We first derive an upper bound on the Lipschitz constant of the interpolator in terms of the parameters of the embedding. Then, relying on the theoretical analysis in [21], we propose to optimize an objective function that maximizes the separation between different classes and preserves the local geometry of training samples, while at the same time minimizing an upper bound on the Lipschitz constant of the RBF interpolator. We propose an alternating iterative optimization scheme that first updates the embedding coordinates, and then the interpolator parameters in each iteration. We test the classification performance of the proposed method on several real data sets and show that it outperforms the supervised manifold learning methods in comparison and traditional classifiers.

The rest of the paper is organized as follows. In Section 2, we overview the related work. In Section 3, we review the recent theoretical results that motivate our method and in Section 4, we formulate the supervised manifold learning problem and present the proposed algorithm. In Section 5, we evaluate our method with experiments on several face and object data sets. Finally, we conclude in Section 6.

## 2 Related Work

### 2.1 Unsupervised manifold learning

Manifold learning algorithms aim to compute a low-dimensional representation of data that is coherent with its intrinsic geometry, which is characterized in several different ways via geodesic distances [1], locally linear representations [2], second order characteristics [5], and graph spectral decompositions [3], [4] in previous works. When the underlying manifold model is not analytically known, it is common to represent data with a graph model. Given a set of data samples , most manifold learning methods build a data graph such that two samples and are linked with an edge when they are nearest neighbors of each other (). The edge weights are typically assigned with respect to a similarity measure between neighboring samples.

Denoting as the weight matrix containing the edge weights , and the degree matrix as the diagonal matrix having as -th entry the total edge weight between and its neighbors, the graph Laplacian matrix is given by . The Laplacian eigenmaps algorithm [3] maps each data sample to a sample such that the following optimization problem is solved

(1) |

where is the data matrix consisting of the coordinates to be learned and is the identity matrix. Hence, the Laplacian eigenmaps algorithm formulates the new coordinates of data as the functions that have the slowest variation on the data graph, so that neighboring samples in the original domain are mapped to nearby coordinates in the new domain of embedding. The locality preserving projections (LPP) [4] algorithm has the same objective; however, the new coordinates are constrained to be given by a linear projection of the original coordinates.

### 2.2 Supervised manifold learning

Many supervised manifold learning algorithms have been proposed in the recent years, most of which are extensions of the Laplacian eigenmaps method. These methods seek to embed data into new coordinates such that neighboring samples in the same class are mapped to nearby coordinates, while samples from different classes are mapped to distant points. This is often represented as an objective function that minimizes while maximizing , where and are the within-class and between-class Laplacian matrices, derived respectively from the within-class and between-class weight matrices and . The between-class edges in can be set with respect to different strategies in different methods. The supervised dimensionality reduction problem is formulated in [15] as

(2) |

where is a constant that adjusts the weight between the structure preservation and the class-aware discrimination terms. Similar formulations are adopted in [22], [8]; however, under the linear projection constraint . The recent work in [23] is based on a similar objective, which also exploits subclass information by identifying favorable data connections within the same class. A local adaptation of the Fisher discriminant analysis is proposed in [7]. A projection matrix is sought so that the following objective is maximized

(3) |

where and are the within-class and between-class scatter matrices obtained with the edge weights of samples on the data graph. The methods in [9], [10], [12], [13], [24] also optimize a similar Fisher-like objective by maximizing the between-class local scatter and minimizing the within-class local scatter. The method in [25] proposes a scatter discriminant analysis to learn embeddings of local image descriptors. In another recent work [26], the optimization of within- and between-class local scatters is formulated via -norms for robustness against image degradations.

Several supervised linear dimensionality reduction methods are based on preserving locally linear representations of data. The algorithm in [27] provides a supervised extension of the well-known LLE method [2] by introducing a label-dependent distance function; however, it is a nonlinear method without an explicit consideration of the out-of-sample problem. The Neighborhood Preserving Discriminant Embedding method presented in [28] is a linear dimensionality reduction method extending the unsupervised NPE method [29] based on locally linear representations. The Hybrid Manifold Embedding method [30] computes a locally linear but globally nonlinear mapping function by first grouping the data into local subsets via geodesic clustering and then learning a supervised embedding of each cluster. The supervised dimensionality reduction method in [31] partitions the manifold into local regions and takes into account the variation of the embedding along tangent directions of the manifold.

### 2.3 Continuous embeddings via nonlinear functions

The vast majority of supervised dimensionality reduction methods rely on linear projections, and the methods computing a continuous supervised nonlinear embedding are less common. The generalization of the embedding of a given set of training samples to the whole space via continuous interpolation functions is known as the out-of-sample problem. The out-of-sample problem is of critical importance especially for nonlinear supervised manifold learning methods computing a pointwise embedding at only training samples. While out-of-sample extensions via the Nyström method [16], locally linear representations [32], or smooth interpolators such as polynomials [17] are commonly employed in unsupervised manifold learning, fewer works have studied the interpolation problem within a formulation specifically suited to supervised manifold learning [20], [33].

A possible solution to get around the limitations of linear embeddings while avoiding the out-of-sample problem of nonlinear embeddings is to employ kernel extensions of linear dimensionality reduction methods. The kernel extensions of many well-known dimensionality reduction methods such as PCA, LDA, ICA exist [34], [35], [36]. The construction of continuous functions via smooth kernels is also quite common in Reproducing Kernel Hilbert Space (RKHS) methods [37], [38]; however, these methods differ from supervised manifold learning methods in that the learnt mapping often represents class labels of data samples rather than their coordinates in a lower-dimensional domain of embedding as in manifold learning. The choice of the kernel type and parameters can be critical in kernel methods. Several previous works in the semi-supervised learning literature have addressed the learning of kernels by combining known kernels [39], [40]. A two-stage multiple kernel learning method is recently proposed in [41] for supervised dimensionality reduction, which finds a nonlinear mapping by optimizing between-class and within-class distances.

## 3 Theoretical Bounds in Supervised Manifold Learning

Nonlinear dimensionality reduction methods in the literature that minimize objectives as in (2) often yield embeddings where training samples from different classes are linearly separable, and the local neighborhoods on the same manifold are preserved as imposed by the term involving the within-class graph Laplacian. On the other hand, most existing methods fail to consider how well these embeddings generalize to new test data: When a test sample of unknown class label is mapped to the low-dimensional domain of embedding via an interpolator or an out-of-sample extension method, what is critical is how likely the test sample is to be correctly classified. This depends both on the coordinates of the embedding for the training samples and the interpolator used to generalize the embedding to the whole ambient space. In the previous work [21], this problem is theoretically studied. In this section, we overview some main results from [21], which will provide a basis for the manifold learning algorithm we propose in Section 4.

The classification problem is analyzed in [21] in a setting where each data sample in the training set is assumed to belong to one of the classes and the samples of each class are distributed according to the probability measure . Let denote the support of the probability measure . Denoting as an open ball of radius around a point

the following definition introduces the smallest possible measure for a ball of radius centered around a point in the support of the -th class.

Next, we recall the definition of Lipschitz continuity for a function .

###### Definition 1.

A function is Lipschitz continuous with constant if for any

The analysis in [21] considers supervised manifold learning algorithms that compute the embedding of each training sample . It is assumed that a test sample of unknown class label is mapped to via an interpolation function . The following main result from [21] gives a bound on the classification error, when the estimate of the class label of is estimated via nearest-neighbor classification in as , where

###### Theorem 1.

Let be a set of training samples such that each is drawn i.i.d. from one of the probability measures , with denoting the probability measure of the -th class. Let be an embedding of in such that there exist a constant and a constant depending on satisfying

For given and , let be a Lipschitz-continuous interpolation function with constant , which maps each to , such that

(4) |

Consider a test sample randomly drawn according to the probability measure of class . For any , if contains at least training samples from the -th class drawn i.i.d. from such that

then the probability of correctly classifying with nearest-neighbor classification in is lower bounded as

(5) |

Theorem 1 considers an embedding such that nearby training samples from the same class are mapped to nearby coordinates, while training samples from different classes are separated by a distance of at least in the low-dimensional domain of embedding. The parameter can be considered as the separation margin of the embedding. Then for such an embedding, the condition in (4) assumes an interpolator that is sufficiently regular (with a sufficiently small Lipschitz constant ) compared to the separation margin . Finally, a probabilistic classification guarantee is given for this setting in (5), which states that the misclassification probability decreases exponentially with the number of samples under these assumptions. An extension of this result is also presented in [21] which studies the performance of classification when a linear classifier is used instead of nearest-neighbor classification in the low-dimensional domain. If a linear classifier is used in the domain of embedding, a very similar condition to (4) relating the interpolator regularity to the separation margin is obtained, which yields a similar probabilistic bound on the misclassification error.

While most supervised manifold learning methods in the literature focus on achieving a large separation between the training samples from different classes in the embedding, the condition (4) in the above theoretical analysis points to a critical compromise that must be sought in supervised dimensionality reduction: Achieving high separation between different classes in the training set does not necessarily mean that the classifier will generalize well to test samples. The presence of a sufficiently regular interpolator is furthermore needed, so that the Lipschitz constant of the interpolator remains below a threshold involving the separation margin of the embedding. From this perspective, depending on the data distribution, increasing the separation too much has the risk of forcing the interpolator to be too irregular, which may in turn cause condition (4) to fail. What we propose in this work is to learn the embedding together with the interpolator in view of the condition (4), which is detailed in the next section.

## 4 Proposed Nonlinear Supervised Smooth Embedding Method

In this section, we present our proposed supervised dimensionality reduction method. We first formulate the manifold learning problem and define an optimization problem based on the perspectives discussed in Section 3. We then describe our algorithm.

### 4.1 Formulation of the manifold learning problem

Given training points from classes, our purpose is to learn an embedding of data together with a continuous interpolation function , such that . The interpolator will then be used to classify new test points by mapping to the low-dimensional domain as , so that examining with respect to the embedding of the training points with known class labels provides an estimate of the class label of .

Our method relies on the theoretical results presented in Section 3. Recall from Theorem 1 that, a necessary condition to obtain good generalization performance is

In the sequel, we formulate a manifold learning problem in view of this condition, whose purpose is to make the the Lipschitz constant of the interpolator and the distance between neighboring points from the same class as small as possible, while making the separation between different classes as large as possible, in order to increase the chances that the above condition be met.

Let , where is the -th dimension of , with . We propose to choose the function as a radial basis function (RBF) interpolator, as RBF interpolators are a well-studied family of functions [42], [43] with many desirable properties such as smoothness and adjustable spread around anchor points. Hence, each component of is of the form

(6) |

where is an RBF kernel, are the coefficients, and are the kernel centers. A common choice for the RBF kernel is the Gaussian kernel , which we also adopt in this work. Under this setting, we now examine our three entities of interest, namely the regularity of the interpolator, the distance between neighboring points from the same class and the separation between different classes.

Interpolator regularity. We begin with proposing a Lipschitz constant for in terms of the function parameters.

###### Proposition 1.

Let and let be the matrix consisting of the RBF coefficients such that . Then the RBF interpolator satisfies for all

where .

###### Proof.

We first show that is a Lipschitz constant of the RBF kernel. It is easy to show that the derivative of takes is maximum magnitude at , so that for all

This upper bound on the derivative magnitude shows that is a Lipschitz constant for the Gaussian kernel as

for all .

Then we derive the Lipschitz constant of as follows. First, observe that

where the second term inside the sum can be upper bounded as

This gives in the above equation

where denotes the coefficient vector of the function and is the -norm of a vector. Then we have

where denotes the Frobenius norm of a matrix. Hence, we get

where is the Lipschitz constant of . ∎

When learning an interpolator, we would like to minimize the Lipschitz constant of . From the form (6) of the interpolator components and the fact that the interpolator values at training points must correspond to the coordinates of the embedding , we get the relation

where is the matrix consisting of the values of the RBF kernels with and is the matrix consisting of the coordinates of the embeddings of the training samples. Then the coefficient matrix is given by , so that

(7) |

In order to keep the Lipschitz constant of the interpolator small in the learnt embedding, we need to keep both the Lipschitz constant of the Gaussian kernel and the norm of the coefficient matrix small. Using the expression of in (7) and recalling that , we thus propose to minimize the following objective for controlling the interpolator regularity

(8) |

where is a weight parameter. The objective is chosen proportionally to the squares of the terms and instead of themselves, due to the convenience of the analytical expression obtained for in (7).

Distance between neighboring points from the same class. Recall from Theorem 1 that the condition (4) required for good classification performance enforces the term to be sufficiently small, where is an upper bound on the distance between the embeddings of nearby samples; i.e., whenever . It is not easy to study the distance in relation with the ambient space distance for each pair of samples , . Nevertheless, we adopt a constructive solution here and relax this problem to the minimization of the distance between the embeddings of nearby points from the same class. The total distance between the embeddings of neighboring points from the same class, weighted by the edge weights, is given by

where is the within-class weight matrix containing the weights of the edges between each pair of neighboring samples from the same class, and is the corresponding within-class Laplacian matrix. Hence, the objective

(9) |

used in several previous works as discussed in Section 2 is an appropriate choice for our purpose.

Separation between samples from different classes. The last entity to be examined in view of the condition (4) is the separation margin . In order to satisfy the condition (4), the separation between the samples between different classes must be sufficiently high. Although the margin stands for a lower bound for the distance between any pair of samples from different classes in Theorem 1, the examination of the minimum value of for all pairs of samples is a relatively hard problem. We propose to relax this and evaluate the total distance between the embeddings of different-class samples in this study. Hence, in order to increase the separation margin , we propose to maximize

where is a between-class weight matrix given by

and is the corresponding between-class Laplacian matrix. Thus, the maximization of the separation margin is represented by the objective function

(10) |

Overall optimization problem. Now, bringing together the objective functions presented in (8), (9), and (10), we propose to solve the following optimization problem in order to learn an embedding together with its corresponding interpolator:

(11) |

Here , , and are positive weights that balance the different terms in the objective function, and the normalization condition is imposed in order to prevent solutions with arbitrarily small embedding coordinates that might trivially minimize the objective.

### 4.2 Proposed manifold learning algorithm

The proposed objective function (11) can be made convex with respect to if the weight parameters and are suitably chosen; however, it is not jointly convex with respect to both optimization variables and . We thus propose to minimize (11) with an alternating iterative optimization algorithm. In each iteration, we first fix the scale parameter and optimize the embedding coordinates , which is then followed by fixing and optimizing .

Optimization of . When the scale parameter is fixed, the minimization of the objective (11) is equivalent to the following optimization problem

(12) |

The solution to this problem is given by the matrix whose -th column consists of the eigenvector of the matrix

(13) |

that corresponds to its -th smallest eigenvalue, for .

Optimization of . Note that the dependence of the objective function (11) on the scale parameter is through its third term and fourth term . Hence, when the embedding is fixed, the optimization of the objective is reduced to the problem

(14) |

The objective in (14) is not a convex function of in general. Nevertheless, a useful observation is the following: As the entries of the matrix consist of the RBF kernel terms

the matrix and its inverse have poor conditioning when takes arbitrarily large values. Hence, the first term in (14) increases with increasing large values of . On the other hand, the term approaches infinity as approaches 0. These observations imply that that there exists a positive kernel scale that minimizes the objective (14). As the problem (14) requires the optimization of a single parameter , an optimal value can be computed easily in practice via a basic search algorithm within a suitable range of values.

These steps for the alternating optimization of and are applied successively until the stabilization of the objective function. Note that if is chosen sufficiently small to make the matrix in (13) positive semi definite, the overall objective function (11) is positive. In this case, since both of the alternating optimization steps in (12) and (14) bring updates that cannot increase the objective function in each iteration, being bounded from below, the objective function is guaranteed to converge.

Once the embedding of the training points and the kernel scale parameter are computed in this way, the interpolation function is simply obtained as in (6) by computing the coefficients as . We call the proposed method Nonlinear Supervised Smooth Embedding (NSSE) and give its description in Algorithm 1.

### 4.3 Complexity of the proposed algorithm

We now analyze the computational complexity of the proposed NSSE method. The algorithm is composed of three main stages, which are the initialization stage (calculation of the and matrices), the main loop between steps 3 and 6 of Algorithm 1, and the finalization stage in step 7.

In the initialization step, the complexity of the computation of and is mainly determined by the complexity of computing the within-class and between-class weight matrices and , which is of .

We next consider the main loop of the algorithm. The matrix in step 4 can be calculated with complexity and it is inverted with complexity to obtain . As a result, the computation of is of complexity . In order to find in step 4, the eigenvectors of should be found, which is of complexity . Consequently, the total complexity of step 4 is . In step 5, the expression must be computed repeatedly to find , which is of complexity . Hence, the complexity of the main loop of the algorithm is found as .

## 5 Experimental Results

In this section, we evaluate the performance of the proposed NSSE method on six real data sets. We first describe the data sets, then study the iterative optimization procedure employed in the proposed method, and then compare the performance of NSSE with that of other supervised manifold learning algorithms and traditional classifiers.

### 5.1 Data sets and experimentation setting

We experiment on the data sets listed below. Some sample images from one class of each data set are presented in Figure 1.

Yale Face Database. The data set consists of 2242 greyscale face images of 38 different subjects, where each subject has 59 images [44]. All images are taken from a single viewpoint with variations in the lighting angles and lighting rates.

COIL-20 Database. The Columbia Object Image Library database consists of 1440 grayscale images of 20 different objects, where each object has 72 images captured by rotation increments of 5 degrees [45].

ORL Database. The database consists of a total of 400 images, with 10 images of each one of the 40 subjects taken in an upright, frontal position [46]. The images contain variations in the the lighting, facial expressions and facial details such as glasses.

FEI Database. The FEI database is a Brazilian face database containing a total of 2800 images, with 14 images for each one of the 200 subjects taken in an upright frontal position with profile rotation of up to about 180 degrees and scale variation of about 10% [47]. We experiment on 50 classes from this database.

ROBOTICS-CSIE Database. The database contains a total of 3330 grayscale face images of 90 subjects, with 37 images for each subject captured under rotation increments of 5 degrees [48]. We experiment on 40 classes from this database.

MIT-CBCL Database. The database contains face images of 10 subjects [49]. We experiment on a total of 5240 images, with 524 images per subject captured under rotations of up to 30 degrees and varying illumination conditions.

We experiment on greyscale versions of the images resized to around pixels. All experiments are conducted in a supervised setup, by randomly separating the images into a training set and a test set in each repetition of the experiment. In all experiments, the proposed NSSE algorithm is evaluated in a setting where the training images are used to learn a continuous embedding into a low-dimensional domain. The test images are then mapped to the domain of embedding via the learnt interpolator and their class labels are estimated via nearest neighbor classification in the low-dimensional domain. The graph edge weights are set with a Gaussian kernel. In all experiments, the weight parameters , , and of NSSE are set with cross-validation.

### 5.2 Study of the iterative optimization procedure

In this first experiment, we study the iterative optimization procedure employed in the proposed method. As discussed in Section 4.2, the NSSE algorithm follows an alternating optimization scheme by minimizing the objective function in (11) first with respect to the embedding of the training samples, and then the scale parameter of the RBF kernels.

The results given in Figure 2 are obtained on the FEI face data set, where an embedding into a dimensional domain is computed using a total of training samples. Figure 2 shows the variation of the objective function in (11) throughout the iterations. Although the proposed alternating optimization procedure is not theoretically guaranteed to find the global optimum of the objective, it is observed from the figure that the proposed scheme can effectively minimize the objective function, which converges in a small number of iterations. The misclassification rates of the test images in percentage are reported in Figure 2 obtained with the embeddings and interpolators computed in each iteration. The results show that the progressive update of the continuous embedding throughout the iterations improves the classification performance. The comparison of the plots in Figures 2 and 2 reveals that the variations of the objective function and the misclassification rate throughout the iterations are quite similar. This suggests that the choice of the objective function in (11), motivated by theoretical bounds, indeed matches the actual classification error. Figure 2 shows the evolution of the RBF kernel scale parameter throughout the iterations. The RBF kernel scale is deliberately initialized with a too high value in this experiment in order to study the effect of the initial conditions on the algorithm performance. Despite the initialization of with a too large value, the iterative minimization of the objective gradually pulls the kernel scale towards a favorable value that improves the classification performance.

The same experiment is repeated in Figure 3, by initializing the RBF kernel scale this time with a small value. It is observed that the RBF scale is effectively optimized throughout the iterations towards a larger value, which gradually decreases the objective function and improves the classification accuracy. These results suggest that the algorithm performance is not affected much by the initialization of the RBF kernel scale. We have obtained similar results on the other data sets and under different choices of the parameters such as the number of training samples, which we skip here for brevity.

### 5.3 Variation of the classification performance with the embedding dimension

We now study the classification performance of the proposed algorithm in relation with the dimension of the embedding. The proposed NSSE method is compared to some other dimensionality reduction algorithms listed below.

(SUPLAP) The Supervised Laplacian Eigenmaps method proposed in [15] computes a nonlinear low-dimensional embedding of the training samples by minimizing the objective in (2). We extend the embedding of the training samples given by the SUPLAP method to the whole space via an RBF interpolator of the same form as in NSSE. We then embed the test samples into the low-dimensional domain with this interpolation function.

(LFDA) The Local Fisher Discriminant Analysis method proposed in [7] is a supervised manifold learning algorithm that computes a linear embedding by optimizing a Fisher-type cost with additional locality preservation objectives.

(LDE) The Local Discriminant Embedding method [22] is a manifold learning method that optimizes a similar objective as in the SUPLAP method; however, learns a linear projection.

(LDA) Linear Discriminant Analysis is a classical dimensionality reduction technique that maximizes the between-class scatter while minimizing the within-class scatter.

The dimensionality reduction methods are applied on training samples to compute a -dimensional embedding, which is then used to classify test samples via nearest neighbor classification in the domain of embedding. The algorithms are evaluated for a range of values. The parameters of the other methods in comparison are adjusted to attain their best performance.

The variation of the misclassification rates of test samples in percentage with the dimension of the embedding is presented in Figures 4, 5, 6, 7, 8, and 9, respectively for the Yale, COIL-20, ORL, FEI, ROBOTICS-CSIE and the MIT-CBCL databases. The number of training images used in the computation of the embeddings are indicated in the figure captions for each experiment, which are chosen proportionally to the total number of samples in the data set. The results are the average of 20 random realizations of the experiments with different training and test sets. Most of the tested methods are based on solving a generalized eigenvalue problem and the rank of the involved matrices may be different for each method depending on the number of training samples and the number of classes. Hence, the maximum possible dimension of the embedding may vary between different methods, as well as the best range of dimensions where the methods perform well. For this reason, the results on each data set are grouped into two figures with different ranges for better visual clarity.

The results in Figures 4-9 show that the classification accuracy of the proposed NSSE algorithm compares quite favorably to those of the other methods, as NSSE often yields the smallest misclassification rate at the optimal dimension. The misclassification rate of LDA is observed to decrease monotonically with the dimension and its best performance is attained when reaches the number of classes. The LDE and LFDA algorithms exhibit their best performances at much higher dimensions compared to the other algorithms. The error rates of these algorithms usually decrease as the embedding dimension increases; however, in some datasets a local optimum for can also be observed.

Among all methods, the nonlinear NSSE and SUPLAP methods often perform better than the linear LDA, LFDA, and LDE methods. This shows that the flexibility of nonlinear methods when learning an embedding is likely to bring an advantage in computing better representations for data. It is then interesting to compare the performances of the two nonlinear methods; NSSE and SUPLAP. The SUPLAP algorithm attains its best performance when the dimension of the embedding is close to the number of classes, while the optimum value of for the proposed NSSE algorithm is smaller in most data sets. Interestingly, the optimal dimension of NSSE is much smaller than that of SUPLAP in data sets with a low intrinsic dimension such as COIL-20, FEI, and ROBOTICS-CSIE, which are generated by the variation of only one or two camera angle parameters. Similarly, in data sets of larger intrinsic dimension such as MIT-CBCL due to several pose and lighting parameters, the optimal dimension of NSSE is higher and closer to that of SUPLAP. This may suggest that the embedding computed with NSSE tries to capture the intrinsic geometry of data and provides a better representation when the embedding dimension is chosen proportionally to the intrinsic dimension of data.

The reduction of the embedding dimension is desirable especially regarding the complexity of the classification of test samples in a practical application. Another advantage of NSSE over SUPLAP is that NSSE is less sensitive to the choice of the dimension, as the misclassification performance is less affected for non-optimal values of . Such benefits of the proposed NSSE algorithm mainly result from the fact that the Lipschitz continuity of the interpolator is imposed in the learning objective. Consequently, the training samples are embedded more evenly in the low-dimensional space so as to allow the construction of a regular interpolator, which in return reduces the required number of dimensions or the sensitivity to the non-optimal choice of .

In fact, Figure 10 provides a visual comparison of the embeddings obtained with the NSSE and the SUPLAP algorithms. Panels (a) and (b) show the two-dimensional embeddings of 70 training samples from 10 classes of the ROBOTICS-CSIE data set, respectively with the NSSE and the SUPLAP methods. The embeddings of training samples look similar between the two methods, although different classes are more regularly spaced in NSSE. The performance difference between these two methods becomes much clearer when the embeddings of the test samples in panels (c) and (d) are observed. Even at this very small embedding dimension of 2, the NSSE method separates test samples from different classes much more successfully than SUPLAP, which is due to the inclusion of the interpolator parameters in the learning objective in order to attain good generalization performance.

### 5.4 Overall comparison with baseline classifiers and supervised dimensionality reduction methods

We now provide an overall comparison of the proposed NSSE method with baseline classifiers and other supervised manifold learning methods. We compare NSSE with the supervised manifold learning algorithms discussed in Section 5.3, as well as the SVM and the nearest neighbor (NN) classifiers in the original domain. The embedding dimensions and other algorithm parameters of the manifold learning methods are set to their optimal values yielding the best performance. The classification errors over test samples are studied by varying the training/test ratio and the results are averaged over 20 realizations of the experiments under different random choices of the training and test samples.

The misclassification rates of test samples in percentage are presented for the compared methods for different training data sizes in Tables 1, 2, 3, 4, 5, 6, respectively for the Yale, COIL-20, ORL, FEI, ROBOTICS-CSIE, and MIT-CBCL data sets. The leftmost columns of the tables show the number of training samples per class used for learning the classifiers. Experiments are conducted over a suitable range of number of training samples for each data set, considering the total number of samples in the data set. The smallest classification error of each experiment is shown in bold.

The proposed NSSE method is observed to often outperform the other methods in Tables 1-6. The performances of the algorithms improve as the number of training samples increases as expected, while the algorithm closest in performance to NSSE in almost all experiments is the nonlinear supervised manifold learning algorithm SUPLAP. On the other hand, the linear manifold learning algorithms LFDA, LDA, and LDE exhibit variable performance depending on the data set. Their comparison to the baseline SVM and NN classifiers shows that these linear methods may get outperformed by baseline classifiers especially when the number of samples is sufficiently high, while the nonlinear NSSE and SUPLAP methods often yield better performance than these baseline classifiers. The proposed NSSE often outperforms the SUPLAP method, with the performance gap being more significant especially when the number of training samples is limited. This can be explained with the fact that the lack of training samples is likely to lead to degenerate embeddings in nonlinear methods computing a pointwise embedding as in SUPLAP, while the regularization term enforcing the regularity of the interpolator in NSSE proves effective for the prevention of such degeneracies and ensuring the preservation of the overall geometric structure of data in the embedding.

# Training / class | NSSE | SUPLAP | SVM | NN | LFDA | LDA | LDE |
---|---|---|---|---|---|---|---|

6 | 22.09 | 23.10 | 29.43 | 63.48 | 26.88 | 35.85 | 20.04 |

10 | 12.59 | 12.92 | 15.16 | 52.89 | 40.62 | 63.97 | 21.78 |

15 | 7.52 | 7.95 | 9.09 | 43.57 | 10.77 | 57.86 | 7.95 |

20 | 5.02 | 5.60 | 6.14 | 37.51 | 7.42 | 52.79 | 5.15 |

30 | 2.56 | 2.56 | 2.98 | 30.12 | 3.22 | 46.43 | 3.04 |

# Training / class | NSSE | SUPLAP | SVM | NN | LFDA | LDA | LDE |
---|---|---|---|---|---|---|---|

7 | 9.18 | 10.97 | 10.38 | 13.89 | 17.93 | 11.84 | 20.86 |

10 | 5.88 | 6.81 | 6.92 | 10.21 | 13.59 | 7.68 | 16.84 |

15 | 3.26 | 3.84 | 4.59 | 6.87 | 11.32 | 4.22 | 14.00 |

20 | 1.50 | 2.03 | 3.23 | 4.51 | 9.53 | 2.28 | 12.64 |

30 | 0.81 | 0.80 | 2.27 | 2.30 | 7.08 | 0.99 | 13.27 |

# Training / class | NSSE | SUPLAP | SVM | NN | LFDA | LDA | LDE |
---|---|---|---|---|---|---|---|

2 | 14.63 | 16.04 | 19.73 | 19.34 | 27.69 | 21.17 | 24.91 |

3 | 8.54 | 9.48 | 10.70 | 12.96 | 14.89 | 13.13 | 12.73 |

5 | 3.90 | 5.31 | 4.35 | 6.91 | 8.10 | 7.73 | 7.05 |

# Training / class | NSSE | SUPLAP | SVM | NN | LFDA | LDA | LDE |
---|---|---|---|---|---|---|---|

2 | 21.45 | 27.06 | 35.37 | 32.13 | 29.82 | 30.92 | 30.05 |

4 | 8.28 | 12.46 | 12.85 | 19.45 | 12.90 | 12.56 | 10.80 |

7 | 5.06 | 6.41 | 9.09 | 10.85 | 9.74 | 5.40 | 7.76 |

# Training / class | NSSE | SUPLAP | SVM | NN | LFDA | LDA | LDE |
---|---|---|---|---|---|---|---|

7 | 13.55 | 27.22 | 23.96 | 34.46 | 24.87 | 29.42 | 25.13 |

14 | 4.38 | 11.73 | 8.77 | 17.80 | 11.96 | 14.15 | 9.73 |

21 | 2.83 | 6.51 | 4.76 | 10.09 | 6.99 | 10.56 | 5.88 |

# Training / class | NSSE | SUPLAP | SVM | NN | LFDA | LDA | LDE |
---|---|---|---|---|---|---|---|

10 | 6.48 | 7.31 | 9.90 | 14.42 | 12.32 | 18.43 | 9.69 |

20 | 2.56 | 3.38 | 4.18 | 5.64 | 8.36 | 8.38 | 6.01 |

40 | 0.85 | 1.21 | 1.52 | 1.45 | 5.29 | 3.18 | 2.97 |

## 6 Conclusion

We have proposed a nonlinear supervised manifold learning method that learns an embedding of the training data jointly with a smooth RBF interpolation function that extends the embedding to the whole space. The embedding and the interpolator parameters are jointly optimized with the purpose of good generalization to initially unavailable data, based on recent theoretical results on the performance of supervised manifold learning algorithms. In particular, the embedding and the RBF paramaters are learnt such that the interpolator has sufficiently good Lipschitz regularity while the samples from different classes are separated as much as possible. Experiments on image data sets have shown that the proposed method learns embeddings yielding better classification performance while requiring a smaller number of dimensions in comparison with other supervised manifold learning approaches. Thanks to the priors on the Lipschitz regularity of the interpolator incorporated in the learning objective, the proposed method can learn efficient representations even under limited availability of training samples, and is relatively robust to conditions such as the non-optimal choice of the embedding dimension and unfavorable initialization of the interpolator parameters. Our study shows that nonlinear mappings are promising in supervised dimensionality reduction, and taking into account the generalizability of the embedding explicitly in the learning objective highly improves the classification performance.

## References

- [1] J. B. Tenenbaum, V. de Silva, J. C. Langford, A global geometric framework for nonlinear dimensionality reduction., Science 290 (5500) (2000) 2319–2323.
- [2] S. T. Roweis, L. K. Saul, Nonlinear dimensionality reduction by locally linear embedding, Science 290 (2000) 2323–2326.
- [3] M. Belkin, P. Niyogi, Laplacian eigenmaps for dimensionality reduction and data representation, Neural Computation 15 (6) (2003) 1373–1396.
- [4] X. He, P. Niyogi, Locality Preserving Projections, in: Advances in Neural Information Processing Systems 16, MIT Press, Cambridge, MA, 2004.
- [5] D. L. Donoho, C. Grimes, Hessian eigenmaps: Locally linear embedding techniques for high-dimensional data, Proceedings of the National Academy of Sciences of the United States of America 100 (10) (2003) 5591–5596.
- [6] Z. Zhang, H. Zha, Principal manifolds and nonlinear dimension reduction via local tangent space alignment, SIAM Journal of Scientific Computing 26 (2005) 313–338.
- [7] M. Sugiyama, Dimensionality reduction of multimodal labeled data by local fisher discriminant analysis, Journal of Machine Learning Research 8 (2007) 1027–1061.
- [8] Q. Hua, L. Bai, X. Wang, Y. Liu, Local similarity and diversity preserving discriminant projection for face and handwriting digits recognition., Neurocomputing 86 (2012) 150–157.
- [9] W. Yang, C. Sun, L. Zhang, A multi-manifold discriminant analysis method for image feature extraction, Pattern Recognition 44 (8) (2011) 1649–1657.
- [10] Z. Zhang, M. Zhao, T. Chow, Marginal semi-supervised sub-manifold projections with informative constraints for dimensionality reduction and recognition, Neural Networks 36 (2012) 97–111.
- [11] B. Li, J. Liu, Z. Zhao, W. Zhang, Locally linear representation fisher criterion, in: The 2013 International Joint Conference on Neural Networks, 2013, pp. 1–7.
- [12] Y. Cui, L. Fan, A novel supervised dimensionality reduction algorithm: Graph-based fisher analysis, Pattern Recognition 45 (4) (2012) 1471–1481.
- [13] R. Wang, X. Chen, Manifold discriminant analysis, in: CVPR, 2009, pp. 429–436.
- [14] M. Yu, L. Shao, X. Zhen, X. He, Local feature discriminant projection, IEEE Trans. Pattern Anal. Mach. Intell. 38 (9) (2016) 1908–1914.
- [15] B. Raducanu, F. Dornaika, A supervised non-linear dimensionality reduction approach for manifold learning, Pattern Recognition 45 (6) (2012) 2432–2444.
- [16] Y. Bengio, J. F. Paiement, P. Vincent, O. Delalleau, N. Le Roux, M. Ouimet, Out-of-sample extensions for LLE, ISOMAP, MDS, Eigenmaps, and Spectral Clustering, in: Adv. Neural Inf. Process. Syst., MIT Press, 2004, pp. 177–184.
- [17] H. Qiao, P. Zhang, D. Wang, B. Zhang, An explicit nonlinear mapping for manifold learning, IEEE T. Cybernetics 43 (1) (2013) 51–63.
- [18] G. H. Chen, C. Wachinger, P. Golland, Sparse projections of medical images onto manifolds, in: Proc. 23rd Int. Conf. Information Processing in Medical Imaging, 2013, pp. 292–303.
- [19] B. Peherstorfer, D. Pflüger, H. J. Bungartz, A sparse-grid-based out-of-sample extension for dimensionality reduction and clustering with laplacian eigenmaps, in: Proc. 24th Australasian Joint Conf. Advances in Artificial Intelligence, 2011, pp. 112–121.
- [20] E. Vural, C. Guillemot, Out-of-sample generalizations for supervised manifold learning for classification, IEEE Transactions on Image Processing 25 (3) (2016) 1410–1424.
- [21] E. Vural, C. Guillemot, A study of the classification of low-dimensional data with supervised manifold learning, Submitted for publication. [Online]. Available: https://arxiv.org/abs/1507.05880.
- [22] H. Chen, H. Chang, T. Liu, Local discriminant embedding and its variants, in: IEEE Computer Society Conf. Computer Vision and Pattern Recognition, 2005, pp. 846–853.
- [23] A. Maronidis, A. Tefas, I. Pitas, Subclass graph embedding and a marginal fisher analysis paradigm, Pattern Recognition 48 (12) (2015) 4024–4035.
- [24] Q. Gao, J. Ma, H. Zhang, X. Gao, Y. Liu, Stable orthogonal local discriminant embedding for linear dimensionality reduction, IEEE Trans. Image Processing 22 (7) (2013) 2521–2531.
- [25] M. Yu, L. Shao, X. Zhen, X. He, Local feature discriminant projection, IEEE Trans. Pattern Anal. Mach. Intell. 38 (9) (2016) 1908–1914.
- [26] S. Chen, J. Wang, C. Liu, B. Luo, Two-dimensional discriminant locality preserving projection based on l1-norm maximization, Pattern Recognition Letters 87 (2017) 147–154.
- [27] S. Zhang, Enhanced supervised locally linear embedding, Pattern Recognition Letters 30 (13) (2009) 1208–1218.
- [28] Y. Pang, A. T. B. Jin, F. S. Abas, Neighbourhood preserving discriminant embedding in face recognition, J. Visual Communication and Image Representation 20 (8) (2009) 532–542.
- [29] X. He, D. Cai, S. Yan, H. Zhang, Neighborhood preserving embedding, in: 10th IEEE Int. Conf. Computer Vision, 2005, pp. 1208–1213.
- [30] Y. Liu, Y. Liu, K. C. C. Chan, K. A. Hua, Hybrid manifold embedding, IEEE Trans. Neural Netw. Learning Syst. 25 (12) (2014) 2295–2302.
- [31] Y. Zhou, S. Sun, Manifold partition discriminant analysis, IEEE Trans. Cybernetics 47 (4) (2017) 830–840.
- [32] F. Dornaika, B. Raduncanu, Out-of-sample embedding for manifold learning applied to face recognition, in: IEEE Conf. Computer Vision and Pattern Recognition, CVPR Workshop, 2013, pp. 862–868.
- [33] C. Orsenigo, C. Vercellis, Kernel ridge regression for out-of-sample mapping in supervised manifold learning, Expert Syst. Appl. 39 (9) (2012) 7757–7762.
- [34] B. Schölkopf, A. J. Smola, K. R. Müller, Kernel principal component analysis, in: 7th Int. Conf. Artificial Neural Networks, 1997, pp. 583–588.
- [35] G. Baudat, F. Anouar, Generalized discriminant analysis using a kernel approach, Neural Computation 12 (10) (2000) 2385–2404.
- [36] F. R. Bach, M. I. I. Jordan, Kernel independent component analysis, Journal of Machine Learning Research 3 (2002) 1–48.
- [37] M. Belkin, P. Niyogi, V. Sindhwani, Manifold regularization: A geometric framework for learning from labeled and unlabeled examples, Journal of Machine Learning Research 7 (2006) 2399–2434.
- [38] N. Aronszajn, Theory of reproducing kernels, Transactions of the American Mathematical Society 68 (3) (1950) 337–404.
- [39] G. Dai, D. Y. Yeung, Kernel selection for semi-supervised kernel machines, in: Prof. 24th Int. Conf. Machine Learning, 2007, pp. 185–192.
- [40] A. Argyriou, M. Herbster, M. Pontil, Combining graph laplacians for semi-supervised learning, in: Advances in Neural Information Processing Systems 18, 2005, pp. 67–74.
- [41] A. Nazarpour, P. Adibi, Two-stage multiple kernel learning for supervised dimensionality reduction, Pattern Recognition 48 (5) (2015) 1854–1862.
- [42] B. J. C. Baxter, The interpolation theory of radial basis functions, Ph.D. thesis, Cambridge University, Trinity College (1992).
- [43] C. Piret, Analytical and numerical advances in radial basis functions, Ph.D. thesis, University of Colorado (2007).
- [44] A. S. Georghiades, P. N. Belhumeur, D. J. Kriegman, From few to many: Illumination cone models for face recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intelligence 23 (6) (2001) 643–660.
- [45] S. A. Nene, S. K. Nayar, H. Murase, Columbia Object Image Library (COIL-20), Tech. rep. (Feb 1996).
- [46] F. Samaria, A. Harter, Parameterisation of a stochastic model for human face identification, in: Proc. Second IEEE Workshop on Applications of Computer Vision, 1994, pp. 138–142.
- [47] C. E. Thomaz, G. A. Giraldi, A new ranking method for principal components analysis and its application to face image analysis, Image Vision Comput. 28 (6) (2010) 902–913.
- [48] Robotics CSIE database for face detection, available: http://robotics.csie.ncku.edu.tw/Databases/FaceDetect_PoseEstimate.htm.
- [49] MIT-CBCL face recognition database, available: http://cbcl.mit.edu/software-datasets/heisele/facerecognition-database.html.