# Class Mean Vector Component and Discriminant Analysis for Kernel Subspace Learning

###### Abstract

The kernel matrix used in kernel methods encodes all the information required for solving complex nonlinear problems defined on data representations in the input space using simple, but implicitly defined, solutions. Spectral analysis on the kernel matrix defines an explicit nonlinear mapping of the input data representations to a subspace of the kernel space, which can be used for directly applying linear methods. However, the selection of the kernel subspace is crucial for the performance of the proceeding processing steps. In this paper, we propose a component analysis method for kernel-based dimensionality reduction that optimally preserves the pair-wise distances of the class means in the feature space. We provide extensive analysis on the connection of the proposed criterion to those used in kernel principal component analysis and kernel discriminant analysis, leading to a discriminant analysis version of the proposed method. We illustrate the properties of the proposed approach on real-world data.

IEEEexample:BSTcontrol

Index Terms— Kernel subspace learning, Principal Component Analysis, Kernel Discriminant Analysis, Approximate kernel subspace learning

## I Introduction

Kernel methods are very effective in numerous machine learning problems, including nonlinear regression, classification, and retrieval. The main idea in kernel-based learning is to nonlinearly map the original data representations to a feature space of (usually) increased dimensionality and solve an equivalent (but simpler) problem using a simple (linear) method for the transformed data. That is, all the variability and richness required for solving a complex problem defined on the original data representations is encoded by the adopted nonlinear mapping. Since for commonly used nonlinear mappings in kernel methods the dimensionality of the feature space is arbitrary (virtually infinite), the data representations in the feature space are implicitly obtained by expressing their pair-wise products stored in the so-called kernel matrix , where is the number of samples forming the problem at hand.

The feature space determined by spectral decomposition of has been shown to encode several properties of interest: it has been used to define low-dimensional features suitable for linear class discrimination [1], to train linear classifiers capturing nonlinear patterns of the input data [2], to reveal nonlinear data structures in spectral clustering [3] and it has been shown to encode information related to the entropy of the input data distribution [4]. The expressive power of and its resulting basis motivates us to study its discriminative power as well.

In this paper we first propose a kernel matrix component analysis method for kernel-based dimensionality reduction optimally preserving the pair-wise distances of the class means in the kernel space. We show that proposed criterion also preserves the distances of the class means with respect to the total mean of the data in the kernel space, as well as the Euclidean divergence between the class distributions in the input space. We analyze the connection of the proposed criterion with those used in (uncentered) kernel principal component analysis and kernel discriminant analysis, providing new insights related to the dimensionality selection process of these two methods. Extensions using approximate kernel matrix definitions are subsequently proposed. Finally, exploiting the connection of the proposed method to kernel discriminant analysis, we propose a discriminant analysis method that is able to produce kernel subspaces the dimensionality of which is not bounded by the number of classes forming the problem at hand. Experiments on real-world data illustrate our findings.

## Ii Preliminaries

Let us denote by a set of -dimensional vectors, where is the set of vectors belonging to class . In kernel-based learning [2], the samples in are mapped to the kernel space by using a nonlinear function , such that , where . Since the dimensionality of is arbitrary (virtually infinite), the data representations in are not calculated. Instead, the non-linear mapping is implicitly performed using the kernel function expressing dot products between the data representations in , i.e. . By applying the kernel function on all training data pairs, the so-called kernel matrix is calculated. One of the most important properties of the kernel function is that it leads to a positive semi-definite (psd) kernel matrix . While the indefinite matrices [5, 6] and general similarity matrices [7] have also been researched, in this paper we will consider only positive semi-definite kernel functions.

The importance of kernel methods in Machine Learning comes from the fact that, in the case when a linear method can be expressed based on dot products of the input data, they can be readily used to devise nonlinear extensions. This is achieved by exploiting the Representer theorem [2] stating that the solution of a linear model in , e.g. , can be expressed as a linear combination of the training data, i.e. , where and is a matrix containing the combination weights. Then, the output of a linear model in can be calculated by , where is a vector having its -th element equal to . That is, instead of optimizing with respect to the arbitrary dimensional , the solution involves the optimization of the combination weights .

Another important aspect of using kernel methods is that they allow us to train models of increased discrimination power [2, 8]. Considering the Vapnik-Chervonenkis (VC) dimension of a linear classifier defined on the data representations in the original feature space , the number of samples that can be shattered (i.e., correctly classified irrespectively of their arrangement) is equal to . On the other hand, the VC dimension of a linear classifier defined on the data representations in is higher. For the most widely used kernel function, i.e. the Gaussian kernel function , it is virtually infinite. In practice this means that, under mild assumptions, a linear classifier applied on data representations in can classify all training data.

Using the definition of and its psd property, its spectral decomposition leads to , where and are the eigenvalues and the corresponding eigenvectors of . Thus, an explicit nonlinear mapping from to is defined, such that the -th dimension of the training data is:

(1) |

where is the -th largest eigenvalue of and is the corresponding eigenvector. In the case where is centered in , is the space defined by kernel Principal Component Analysis (kPCA) [2]. Moreover, as has been shown in [9, 10], the kernel matrix needs not to be centered. In the latter case, is called the effective dimensionality of and is the corresponding effective subspace of . The latter case is also essentially the same as the uncentered kernel PCA. Recently, the kernel Entropy Component Analysis (ECA) was proposed [4] following the uncentered kernel approach and sorting eigenvectors based to the size of the entropy values defined as . Kernel ECA has also been shown to be the projection that optimally preserves the length of the data mean vector in [11].

After sorting the eigen-vectors based on the size of either the eigenvalues, or the entropy values, the -th dimension of a samples in the kernel subspace is obtained by:

(2) |

where is a vector having elements . Note that the use of such an explicit mapping preserves the discriminative power of the kernel space, since a linear classifier on the data representations in can successfully classify all training samples.

When a lower-dimensional subspace of is sought, the criterion for selecting an eigen-pair () is defined in a generative manner, i.e. minimizing the quantity leading to selecting the eigen-pairs corresponding to the maximal eigen-values in the case of kernel PCA, or maximizing the entropy of the data distribution leading to selecting the eigen-pairs corresponding to the maximal entropy values in the case of kernel ECA.

## Iii Class Mean Vector Component Analysis

Since the data representations in the kernel space form classes which are linearly separable, we make the assumption that classes in these spaces are unimodal. We express the distance between classes and by:

(3) |

where is the mean vector of class in . Since is calculated by using elements of the kernel matrix , i.e. , we exploit the spectral decomposition of and express the mean vectors in the effective kernel subspace, i.e., with their -th dimension equal to:

(4) |

where is the indicator vector for class having elements equal to for , and otherwise. Then, takes the form:

(5) | |||||

From the above, it can be seen that the eigen-pairs of maximally contributing to the distance between the two class means are those with a high eigenvalue and an eigenvector angularly aligned to the vector .

We express the weighted pair-wise distance between all classes in by:

(6) | |||||

where each class contributes proportionally to its cardinality and

(7) |

expresses the weighted alignment of the eigenvector to all possible combinations of class indicator vectors difference.

To define the subspace of the kernel space that maximally preserves the pair-wise distances between the class means in the kernel space, we keep the eigen-pairs minimizing minimizing:

(8) |

Thus, in contrary to (uncentered) kernel PCA and kernel ECA selecting the eigen-pairs corresponding to the maximal eigenvalues or entropy values, for the selection of an eigen-pair in CMVCA both the eigenvalue needs to be high and the corresponding eigenvector needs to be angularly aligned to the difference of a pair of class indicator vectors.

### Iii-a CMVCA preserves the class means to total mean distances

In the above we defined CMVCA as the method preserving the pair-wise distances between class means in . Considering the weighted distance value of dimension we have:

(9) | |||||

where is a vector having elements equal to . In the above we considered that and that . Thus, the eigen-pairs selected by minimizing the criterion in (8) are those preserving the distances between the class means to the total mean in .

### Iii-B CMVCA as the Euclidean divergence between the class probability density functions

Let us assume that the data forming and are drawn from the probability density functions and , respectively. The Euclidean divergence between these two probability density functions is given by:

(10) |

Given the observations of these two probability density functions in and , can be estimated using the Parzen window method [12, 13]. Let be the Gaussian kernel centered at with width . Then, we have:

(11) | |||||

or expressing it using the eigen-decomposition of :

(12) |

Note here that the estimated Euclidean divergence between and gets the same form as the distance of the class mean vectors of classes and in (5). Thus, in (6) can be expressed as:

(13) |

From the above, it can be seen that the dimensions minimizing the criterion in (8), are those optimally preserving the weighted pair-wise Euclidean divergence between the probability density functions of the classes in the input space. Interestingly, exploiting the psd property of the kernel matrix, the analysis in [14] based on the expected value of kernel convolution operator shows that the Parzen window method can be replaced by any psd kernel function.

### Iii-C CMVCA in terms of uncentered PCA projections

Let us denote by the -th eigenvector of the scatter matrix . is in essence a projection vector defined by applying uncentered kernel PCA on the input vectors . By using the connection between the eigenvectors of and , i.e. [15], we have:

(14) | |||||

Using , we get:

(15) |

Since , the contribution of uncentered kernel PCA axis to is determined by the cosine of the angle between and in the sense that the axes which are most angularly aligned with contribute the most. This result adds to the insight provided in [16, 11] and defines CMVCA in terms of the projections obtained by applying uncentered kernel PCA on the input data.

### Iii-D Connection between CMVCA and KDA

To analyze the connection between CMVCA and Kernel Discriminant Analysis, we define the within-class scatter matrix:

(16) |

and the between-class scatter matrix:

(17) |

The total distance is then given by , i.e.:

(18) |

Using the above scatter matrices, KDA and its variants [17, 18] the eigenvectors maximizing the Rayleigh quotient:

(19) |

leading to at most axes which are the eigenvectors corresponding to the positive eigenvalues of the generalized eigen-problem .

Here we are interested in the discrimination power in terms of the KDA criterion of the axes defined from the spectral decomposition of . Expressing the above projections based on the eigenvectors of and assuming the data to be centered, i.e., , the Rayleigh quotient for axis is equal to:

(20) | |||||

The criterion of CMVCA from (6) and (9) for axis becomes:

(21) |

Thus, while in CMVCA an eigen-pair contributes to the Rayleigh quotient based on both the size of and the angular alignment between and the class indicator vectors , the criterion of KDA selects dimensions based on only the angular alignment between and the class indicator vectors . Note that (20) also gives new insights on why the KDA criterions restricts the dimensionality of the produced subspace by the number of classes. That is, since by definition form an orthogonal basis, the number of eigen-vectors that can be angularly aligned to the class indicator vectors is restricted by the number of classes , which is equal to the rank of the between-class scatter matrix for uncentered data. We will exploit this observation in Section IV to define a discriminative version of CMVCA.

### Iii-E CMVCA on approximate kernel subspaces

In cases where the cardinality of is prohibitive for applying kernel methods, approximate kernel matrix spectral analysis can be used. Probably the most widely used approach is based on the Nyström method, which first chooses a set of reference vectors to calculate the kernel matrix between the reference vectors and the matrix containing the kernel function values between the training and reference vectors. In order to determine the reference vectors, two approaches have been proposed. The first is based on selecting columns of using random or probabilistic sampling [19, 20], while the second one determines the reference vectors by applying clustering on the training vectors [21, 22].

The Nyström-based approximation of is given by:

(22) |

where . When the ranks of and match, (22) gives an exact calculation of and is the same as defined in Section II. Eigen-decomposition of leads to . When is a -rank matrix, the matrices and have the same leading eigenvalues [15]. The matrix is an matrix and, thus, applying eigen-analysis to it can highly reduce the computational complexity required for the calculation of eigenvalues of the approximate kernel matrix in (22). Finally, is represented in the approximate kernel subspace by:

(23) |

After the calculation of , the leading eigen-vector and the corresponding eigen-values of the approximate kernel matrix correspond to the right singular vectors and the singular values of .

### Iii-F CMVCA on randomized kernel subspaces

Another approach proposed for making the use of kernel methods in big data feasible is based on random nonlinear mappings [23]. A nonlinear mapping is defined such that , or by using the entire dataset . Singular value decomposition of can lead to a basis in a subspace of . That is, the right singular vectors of corresponding to the non-zero singular values define the axes minimizing where is the CMVCA criterion (6) calculated on using .

## Iv Class Mean Vector Discriminant Analysis

An interesting extension of CMVCA is motivated by the connection of CMVCA with KDA obtained by following the analysis in Subsection III-D. By comparing (20) and (21) we see that in the case where , the scores calculated for the kernel subspace dimensions by CMVCA and KDA are the same. This situation arises when the data is whitened, i.e. when , where is the identity matrix. Interestingly, the information needed for whitening can be directly obtained by the eigenanalysis of , since there is a direct connection between the eigenvalues and eigenvectors of and .

Given a kernel matrix , where is whitened, application of CMVCA requires eigen-decomposition of for calculating the eigen-vectors and the corresponding eigenvalues to be used for weighting the dimensions of the kernel subspace based on (6). has all its eigenvalues equal to and, thus, its eigenvectors form the axes of an arbitrary basis, i.e.:

(24) |

Such basis can be efficiently calculated by applying an orthogonalization process (e.g. Cholesky decomposition) starting from a vector belonging to the span of . The vector is such a vector and, thus, can be used for generating the basis.

Dataset | |||
---|---|---|---|

MNIST-100 | |||

AR | |||

scene | |||

MIT-indoor |

Moreover the vectors belong to the span of and also satisfy the two properties, i.e., and . Thus, they can be selected to form the first eigenvectors of . Note that from (21) it can be seen that these vectors contribute the most to the Rayleigh quotient criterion. To form the rest bases, we can apply an orthogonalization process on the subspaces determined by each class indicator vector , each generating a basis in appended by zeros for the remaining dimensions, leading to eigenvectors in total.

## V Experiments

In our experiments we used four datasets, namely the MNIST [24], AR [25], scene [26] and MIT indoor [27] datasets. For MNIST dataset, we used the first training samples per class to form the training set and we report performance on the entire test set. For the rest datasets, we perform ten experiments by randomly keeping half of the samples per class for training and the remaining for evaluation, and we report the average performance over the ten experiments. We used the vectorized pixel intensity values for representing images in MNIST and AR datasets. For the scene and MIT indoor datasets we used deep features generated by average pooling over spatial dimension of the last convolution layer of VGG network [28] trained on ILSVRC2012 database, and we follow the approximate kernel and randomized kernel approaches using . Details of these datasets are shown in Table I. In all experiments we used the Gaussian kernel function and set the width value equal to the mean pair-wise distance between all training samples. In order to illustrate the effect of using different subspace dimensionality, we used the nearest class centroid classifier for all possible subspaces produced by each of the methods.

Dataset | kPCA | kECA | CMVCA | KDA | CMVDA |
---|---|---|---|---|---|

MNIST-100 | |||||

AR | |||||

scene (N) | |||||

MIT Indoor (N) | |||||

scene (R) | |||||

MIT Indoor (R) |

Figure 1 illustrates the performance obtained by applying kernel PCA, kernel ECA, KDA and the proposed CMVCA, CMVDA and the variant of CVMDA-R using the random basis of the kernel matrix produced by whitened kernel effective space (Eq. (24)) as a function of the subspace dimensionality. CMVCA performs on par with kPCA and kECA on MNIST-100 and AR datasets, while it outperforms them for small subspace dimensionalities in scene and MIT Indoor. In scene dataset, CMVCA clearly outperforms kPCA and kECA, probably due to the unimodal structure of classes obtained by using CNN features. CMVDA and KDA outperform kPCA, kECA and CMVCA by using a small subspace dimensionality (equal to and , respectively) while the performance obtained by applying the CMVDA-R variant gradually increases to match the performance of CMVDA when all the dimensions are used. For completeness, we provide the maximum performance obtained by each method in Table II.

Figure 2 illustrates the Rayleigh quotient values as a function of the dimensionality of the subspace produced by all methods for the AR and MIT indoor datasets. As can be seen, the value of the Rayleigh quotient of the subspaces obtained by applying the unsupervised methods are, as expected, low. The subspaces obtained by KDA lead to a high value, which is gradually decreasing as more dimensions are added. CMVDA leads to subspaces with a high Rayleigh quotient value which is gradually reduced, similarly to KDA. Similar behaviors were observed for the rest of the datasets.

## Vi Conclusion

In this paper, we proposed a component analysis method for kernel-based dimensionality reduction preserving the distances between the class means in the kernel space. Analysis of the proposed criterion shows that it is also preserves the distances between the class means and the total mean in the kernel space, as well as the Euclidean divergence of the class probability density functions in the input space. Moreover, we showed that the proposed criterion, while expressing different properties, has relations to the criteria used in kernel principal component analysis and kernel discriminant analysis. The latter connection leads to a discriminant analysis version of the proposed method. The properties of the proposed approach were illustrated through experiments on real-world data.

## References

- [1] J. Yang, A. Frangi, J. Yang, D. Zhang, Z. Jin, and Z. Jin, “Kpca plus lda: a complete kernel fisher discriminant framework for feature extraction and recognition,” IEEE Tranactions on Pattern Analysis and Machine Intelligence, vol. 27, no. 2, pp. 230–244, 2005.
- [2] K. Müller, S. Mika, G. Rätsch, K. Tsuda, and B. Schölkopf, “An introduction to kernel-based learning algorithms,” IEEE Tranactions on Neural networks, vol. 12, no. 2, pp. 181–201, 2001.
- [3] J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Tranactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
- [4] R. Jenssen, “Kernel entropy component analysis,” IEEE Tranactions on Pattern Analysis and Machine Intelligence, vol. 32, no. 5, pp. 847–860, 2010.
- [5] F. Schleif and P. Tiño, “Indefinite proximity learning: A review,” Neural Computation, vol. 27, no. 10, pp. 2039–2096, 2015.
- [6] A. Gisbrecht and F. Schleif, “Metric and non-metric proximity transformations at linear costs,” Neurocomputing, vol. 167, pp. 643–657, 2015.
- [7] M. Balcan, A. Blum, and N. Srebro, “A theory of learning with similarity functions,” Machine Learning, vol. 72, no. 1–2, pp. 89–112, 2008.
- [8] V. Vapnik, Statistical Learning Theory. New York: Wiley, 1998.
- [9] N. Kwak, “Nonlinear Projection Trick in kernel methods: an alternative to the kernel trick,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 12, pp. 2113–2119, 2013.
- [10] N. Kwak, “Implementing kernel methods incrementally by incremental nonlinear projection trick,” IEEE Transactions on Cybernetics, vol. 47, no. 11, pp. 4003–4009, 2017.
- [11] R. Jenssen, “Mean vector component analysis for visualization and clustering of nonnegative data,” IEEE Tranactions on Neural networks and Learning Systems, vol. 24, no. 10, pp. 1553–1564, 2013.
- [12] J. Principe, Information Theoretic Learning: Renyi Entropy and Kernel Perspectives. Springer, 2010.
- [13] C. Williams and M. Seeger, “The effect of the input density distribution on kernel based classifiers,” International Conference on Machine Learning, 2000.
- [14] R. Jenssen, “Kernel entropy component analysis: New theory and semi-supervised learning,” IEEE International Workshop on Machine Learning for Signal Processing, 2011.
- [15] N. Wermuth and H. Riissmann, “Eigenanalysis of symmetrizable matrix products: a result with statistical applications,” Scandinavian Journal of Statistics, vol. 20, pp. 361–367, 1993.
- [16] J. Cadima and I. Jolliffe, “On relationships between uncentered and column-centered principal component analysis,” Pakistan Journal on Statistics, vol. 25, no. 4, pp. 473–503, 2009.
- [17] A. Iosifidis, A. Tefas, and I. Pitas, “On the optimal class representation in Linear Discriminant Analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 9, pp. 1491–1497, 2013.
- [18] A. Iosifidis, A. Tefas, and I. Pitas, “Kernel reference discriminant analysis,” Pattern Recognition Letters, vol. 49, pp. 85–91, 2014.
- [19] C. Williams and M. Seeger, “Using the Nyström method to speed up kernel machines,” Advances in Neural Information Processing Systems, pp. 682–688, 2001.
- [20] S. Kumar, M. Mohri, and A. Talwalkar, “Sampling techniques for the Nyström method,” Journal of Machine Learning Research, vol. 13, no. 1.
- [21] K. Zhang and J. Kwok, “Clustered Nyström method for large scale manifold learning and dimensionality reduction,” IEEE Transactions on Neural Networks, vol. 21, no. 10, pp. 1576–1587, 2010.
- [22] A. Iosifidis and M. Gabbouj, “Nyström-based approximate kernel subspace learning,” Pattern Recognition, vol. 57, pp. 190–197, 2016.
- [23] A. Rahimi and B. Recht, “Random features for large-scale kernel machines,” Advances in Neural Information Processing Systems, pp. 1177–1184, 2007.
- [24] Y. LeCun, L. Bottou, and Y. Bengio, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998.
- [25] A. Martinez and A. Kak, “PCA versus LDA,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 23, no. 2, pp. 228–233, 2001.
- [26] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” IEEE Conference on Computer Vision and Pattern Recognition, 2006.
- [27] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, 2009.
- [28] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv:1409.1556, 2014.