Probabilistic ClassSpecific Discriminant Analysis
Abstract
In this paper we formulate a probabilistic model for classspecific discriminant subspace learning. The proposed model can naturally incorporate the multimodal structure of the negative class, which is neglected by existing methods. Moreover, it can be directly used to define a probabilistic classification rule in the discriminant subspace. We show that existing classspecific discriminant analysis methods are special cases of the proposed probabilistic model and, by casting them as probabilistic models, they can be extended to classspecific classifiers. We illustrate the performance of the proposed model, in comparison with that of related methods, in both verification and classification problems.
IEEEexample:BSTcontrol
Index Terms— ClassSpecific Discriminant Analysis, Multimodal data distributions, Verification, Classification.
I Introduction
ClassSpecific Discriminant Analysis (CSDA) [1, 2, 3, 4, 5, 6], determines an optimal subspace suitable for verification problems, where the objective is the discrimination of the class of interest from the rest of the world. As an example, let us consider the person identification problem, either through face verification [1], or through exploiting movement information [7]. Different from person recognition, which is a multiclass classification problem assigning a sample (facial image or movement sequence) to a class in a predefined set of classes (person IDs in this case), person identification discriminates the person of interest from all rest people.
While multiclass discriminant analysis models, like Linear Discriminant Analysis (LDA) and its variants [8, 9, 10], can be applied in such problems, they are inherently limited by the adopted class discrimination definition. That is, the maximal dimensionality of the resulting subspace is restricted by the number of classes, due to the rank of the betweenclass scatter matrix. In verification problems involving two classes LDA and its variants lead to onedimensional subspaces. On the other hand, CSDA by expressing class discrimination using the outofclass and intraclass scatter matrices is able to define subspaces the dimensionality of which is restricted by the the number of samples forming the smallest class (which is usually the class of interest) or the number of original space dimensions. By defining multiple discriminant dimensions, CSDA has been shown to achieve better class discrimination and better performance compared to LDA [2, 4, 5].
While the definition of class discrimination in CSDA and its variants based on the intraclass and outofclass scatter matrices overcomes the limitations of LDA related to the dimensionality of the discriminant subspace, it overlooks the structure of the negative class. Since in practice samples forming the negative class belong to many classes, different from the positive one, it is expected that they will form subclasses, as illustrated in the example of Figure 1. Class discrimination as defined by CSDA and its variants disregards this structure.
In this paper, we formulate the class specific discriminant analysis optimization problem based on a probabilistic model that incorporates the abovedescribed structure of the negative class. We show that the optimization criterion used by standard CSDA and its variants corresponts to a special case of the proposed probabilistic model, while new discriminant subspaces can be obtained by allowing samples of the negative class to form subclasses automatically determined by applying (unsupervised) clustering techniques on the negative class data. Moreover, the use of the proposed probabilistic model for classspecific discriminant learning naturally leads to a classification rule in the discriminant subspace, something that is not possible when the standard CSDA criterion is considered.
Ii Related Work
Let us denote by a set of dimensional vectors representing samples of the positive class and by , where , a set of vectors representing samples of the negative class. In the following we consider the linear classspecific subspace learning case and we will describe how to perform nonlinear (kernelbased) classspecific subspace learning following the same processing steps in Section IIIC. We would like determine a linear projection , mapping to a dimensional subspace, i.e. that enhances discrimination of the two classes.
Classspecific Discriminant Analysis [2] defines the projection matrix as the one maximizing the following criterion:
(1) 
where and are the outofclass and intraclass distances defined as follows:
(2) 
and
(3) 
is the mean vector of the positive class, i.e. . That is, it is assumed that the positive class is unimodal and is defined as the matrix maping the positive class vectors as close as possible to the positive class mean vector, while it maps the negative class vectors as far as possible from it, in the reduced dimensionality space .
The criterion in (1) can be expressed as:
(4) 
where is the trace operator. and are the outofclass and intraclass scatter matrices defined in the space :
(5) 
and
(6) 
is obtained by solving the generalized eigenanalysis problem of and keeping the eigenvectors corresponding to the largest eigenvalues [11]. Here, we should note that, since the rank of is at most (and considering that usually ), the dimensionality of the learnt subspace is restricted by the number of samples forming the positive class (or by the dimensionality of the original space ), i.e. . In the case where is singular, a regularized version of the above problem is solved by using , where is the identity matrix and is a small positive value.
A Spectral Regression [12] based solution of (4) has been proposed in [4, 5]. Let us denote by an eigenvector of the generalized eigenanalysis problem with eigenvalue . By setting , the original eigenanalysis problem can be transformed to the following eigenanalysis problem , where and . Here and are the positive and negative class indicator vectors. That is, if and for , and . Based on the above, the projection matrix optimizing (4) is obtained by applying a twostep process:

Solution of the eigenanalysis problem , leading to the determination of a matrix , where is the eigenvector corresponding to the th largest eigenvalue.

Calculation of , where:
(7)
It has been shown in [13] that by exploiting the structure of and , their generalized eigenvectors can be directly obtained using the labeling information of the training vectors. However, in that case the order of the eigenvectors is not related to their discrimination ability. Finally, it has been shown in [14] that the classspecific discriminant analysis problem in (1) can be casted as a lowrank regression problem in which the target vectors are the same as those defined in [13], providing a new proof of the equivalence between classspecific discriminant analysis and class specific spectral regression.
After determining the data projection matrix , the training vectors are mapped to the discriminant subspace . When a classification problem is considered, a classifier is trained using . For example, a linear SVM is used in [5] that is trained on vectors , where is the mean vector of the positive samples in the discriminant subspace and the absolute value operator is applied elementwise on the resulting vector. Due to the need of training an additional classifier, classspecific discriminant analysis models are usually employed in classspecific ranking settings, where test vectors are mapped to the discriminant subspace and are subsequently ordered based on their distance w.r.t. the positive class mean .
Iii Probabilistic ClassSpecific Learning
In this section, we follow a probabilistic approach for defining a classspecific discrimination criterion that is able to encode subclass information of the negative class. We call the proposed method Probabilistic ClassSpecific Discriminant Analysis (PCSDA). PCDA assumes that exists a data generation process for the positive class following the distribution:
(8) 
where is the mean vector of the positive class and is the covariance of the underlying data generation process. A (multimodal) negative class is formed by subclasses the representations of which are drawn from the following distribution:
(9) 
Here denotes the representations of the negative subclasses and is the covariance of the negative subclass generation process. Samples of the negative subclasses are drawn from the following distribution:
(10) 
where is the covariance of the underlying data generation process for each subclass of the negative class. As we will show later, a special case of the above definition for the negative class is when each subclass is formed by one sample, leading to the class discrimination criterion used by the standard CSDA and it’s variants.
Given a set of positive samples the probability of correct assignment under our model (under the assumption that the data are i.i.d.) is equal to:
(11) 
Let us assume that negative class is formed by subclasses, i.e. , each having a cardinality of . We will show how to relax this assumption in the next subsection. Then the probability of assigning each of the negative samples to the corresponding subclass is equal to:
(12) 
By centering the training samples to the positive class mean (this can always be done by setting ) we have:
(13) 
and
(14)  
In the above, is the scatter matrix of the positive class, i.e. , is the withinsubclass scatter matrix of the th subclass of the negative class, i.e. and is the scatter matrix of the th subclass w.r.t. the positive class, i.e. , where is the mean vector of the th subclass of the negative class, i.e. . Since the assignment of the negative samples to subclasses is not provided by the labels used during training, we define them by applying a clustering method (e.g. Means) on the negative class vectors.
Iiia Training phase
The parameters of the proposed PCSDA are the covariance matrices , and defining the data generation processes for the positive and negative classes. These parameters are estimated by maximizing the (log) probability of correct assignment of the vectors :
(15) 
where
(16) 
and
(17)  
and are the total scatter matrices of the negative subclasses, i.e., and .
By substituting (16) and (17) in (15) the optimization function takes the form:
(18)  
The saddle points of with respect to , and lead to:
(19)  
(20)  
(21) 
Combining (20) and (21) we get:
(22) 
By combining (21) and (22) we can define the total scatter matrix for the negative class:
(23) 
Let us denote by the eigenvectors of the generalized eigenanalysis problem:
(24) 
The columns of diagonalize both and , i.e. and , where is diagonal matrix. Setting , we have:
(25) 
Let be the solution of the eigenanalysis problem . From (19) and (25):
(26) 
where is a diagonal matrix scaling the columns of to calculate . Combining (24), (25) and (26):
(27) 
where . Thus, and can be computed by solving an eigenanalysis problem defined on the input vectors .
In the above we set the assumption that the number of samples forming the negative classes is equal. This assumption can be relaxed following one of the following approaches. After assigning all negative samples to subclasses and calculating the conditional distributions of each subclass, one can sample vectors from these distributions. Alternatively, one can calculate the total withinsubclass matrix of the negative class and define the subclass scatter matrices as . Note that for the calculation of the model’s parameters (Eqs. (19) and (23)), only the total scatter matrices , and are used.
IiiB Test phase
After the estimation of the model’s parameters, a new sample represented by the vector can be evaluated. The posterior probabilities of the positive and negative classes are given by:
(28) 
(29) 
The a priori class probabilities and can be calculated by the proportion of the positive and negative samples in the training phase, i.e. and . Depending on the problem at hand, it may be sometimes preferable to consider equiprobable classes, i.e. , leading to the maximum likelihood classification case. The classconditional probabilities of are given by:
(30)  
and
(31)  
where . In the case where we want , we keep the dimensions of corresponding to the largest diagonal elements of .
Combining (28)(31) the ratio of class posterior probabilities is equal to:
(32) 
and can also be used to define a classification rule:
(33)  
classifying to the positive class if and to the negative class otherwise.
In classspecific ranking settings one can follow the process applied when using the standard CSDA approach. First, test vectors are mapped to the discriminant subspace and are obtained. Then, ’s are ordered based on their distance w.r.t. the positive class mean .
CSDA  PCSDA ()  PCSDA ()  PCSDA ()  PCSDA ()  CSSR [14]  PCSR  
0.9318  0.9368  0.9368  0.9370  0.9363  0.4494  0.9235  
0.9348  0.9342  0.9342  0.9344  0.9337  0.9213  0.9194  
0.9003  0.8937  0.9011  0.8987  0.9059  0.9069  0.9069  
0.8841  0.8861  0.8908  0.8954  0.8903  0.7532  0.9043  
0.8696  0.8643  0.8728  0.8747  0.8693  0.8223  0.9169  
0.8760  0.8775  0.8795  0.8756  0.8815  0.8831  0.8831  
0.9266  0.9270  0.9279  0.9275  0.9292  0.8155  0.9131  
0.8917  0.8920  0.8982  0.8965  0.9040  0.9046  0.9046  
0.8508  0.8441  0.8449  0.8522  0.8507  0.7921  0.8996  
0.8331  0.8341  0.8400  0.8567  0.8595  0.7801  0.8365  
Mean  0.8899  0.8890  0.8926  0.8949  0.8960  0.8029  0.9008 
CSDA  PCSDA ()  PCSDA ()  PCSDA ()  PCSDA ()  CSSR [14]  PCSR  
Anger  0.7343 (0.0642)  0.7520 (0.1001)  0.7178 (0.1438)  0.7660 (0.0844)  0.7003 (0.0921)  0.6170 (0.1537)  0.8748 (0.0814) 
Disgust  0.5844 (0.0808)  0.6167 (0.1020)  0.5868 (0.0738)  0.6066 (0.0560)  0.6184 (0.0437)  0.6751 (0.1240)  0.6782 (0.1017) 
Fear  0.6075 (0.0664)  0.6297 (0.0866)  0.6300 (0.0869)  0.6465 (0.0313)  0.6432 (0.0468)  0.6727 (0.0635)  0.6641 (0.1871) 
Happyness  0.5511 (0.0253)  0.5430 (0.0165)  0.5752 (0.0405)  0.5558 (0.0682)  0.5606 (0.0479)  0.8513 (0.0716)  0.8126 (0.1418) 
Sadness  0.3901 (0.0635)  0.4386 (0.0874)  0.4357 (0.0789)  0.4236 (0.0794)  0.4471 (0.0932)  0.4324 (0.1444)  0.4496 (0.1282) 
Surprise  0.7482 (0.1119)  0.7680 (0.1024)  0.7572 (0.0958)  0.7571 (0.0970)  0.7612 (0.1344)  0.6534 (0.1444)  0.8881 (0.0582) 
Neutral  0.7128 (0.0655)  0.7216 (0.0644)  0.7166 (0.0767)  0.7148 (0.0687)  0.6956 (0.1043)  0.5948 (0.1783)  0.6907 (0.1926) 
Mean  0.6183 (0.0682)  0.6385 (0.0799)  0.6313 (0.0852)  0.6386 (0.0693)  0.6324 (0.0803)  0.6424 (0.1257)  0.7226 (0.1273) 
CSDA  PCSDA ()  PCSDA ()  PCSDA ()  PCSDA ()  CSSR [14]  PCSR  
C1  0.5484 (0.0365)  0.5557 (0.0455)  0.5922 (0.0459)  0.5794 (0.0317)  0.6035 (0.0385)  0.6616 (0.1390)  0.7456 (0.0295) 
C2  0.6256 (0.0500)  0.6328 (0.0444)  0.7236 (0.0597)  0.7654 (0.0257)  0.7998 (0.0483)  0.8721 (0.0649)  0.8728 (0.0303) 
C3  0.5683 (0.1185)  0.5718 (0.1147)  0.6165 (0.1169)  0.6881 (0.1141)  0.7038 (0.1169)  0.6482 (0.1593)  0.8417 (0.0160) 
C4  0.5175 (0.0269)  0.5192 (0.0269)  0.5270 (0.0249)  0.5483 (0.0311)  0.5571 (0.0435)  0.6642 (0.0269)  0.6506 (0.0236) 
C5  0.8650 (0.0230)  0.8647 (0.0267)  0.8675 (0.0268)  0.8776 (0.0250)  0.8904 (0.0262)  0.8666 (0.0518)  0.9186 (0.0217) 
C6  0.5090 (0.0205)  0.5129 (0.0182)  0.5205 (0.0165)  0.5527 (0.0441)  0.5471 (0.0224)  0.5476 (0.1566)  0.6410 (0.0386) 
C7  0.8634 (0.0192)  0.8611 (0.0209)  0.8656 (0.0196)  0.8793 (0.0162)  0.8852 (0.0049)  0.7561 (0.1141)  0.9253 (0.0168) 
C8  0.9441 (0.0180)  0.9361 (0.0151)  0.9396 (0.0184)  0.9417 (0.0241)  0.9400 (0.0195)  0.7403 (0.1601)  0.9679 (0.0339) 
C9  0.7589 (0.0496)  0.7529 (0.0459)  0.7718 (0.0600)  0.7837 (0.0541)  0.7921 (0.0322)  0.6980 (0.1235)  0.8347 (0.0173) 
C10  0.8944 (0.0089)  0.8906 (0.0095)  0.8931 (0.0101)  0.8944 (0.0094)  0.9008 (0.0124)  0.6607 (0.2370)  0.9057 (0.0107) 
C11  0.8796 (0.0117)  0.8770 (0.0158)  0.8795 (0.0118)  0.8725 (0.0268)  0.8795 (0.0406)  0.7839 (0.0832)  0.9173 (0.0390) 
C12  0.8905 (0.0217)  0.8893 (0.0145)  0.8894 (0.0112)  0.8898 (0.0176)  0.8965 (0.0151)  0.8072 (0.0870)  0.9063 (0.0081) 
C13  0.5181 (0.0540)  0.5076 (0.0528)  0.6335 (0.0501)  0.6548 (0.0391)  0.6504 (0.0595)  0.6281 (0.2371)  0.7601 (0.0596) 
C14  0.7934 (0.0158)  0.7925 (0.0172)  0.7900 (0.0222)  0.7891 (0.0213)  0.8021 (0.0300)  0.6869 (0.1697)  0.8609 (0.0156) 
C15  0.8404 (0.0195)  0.8363 (0.0186)  0.8799 (0.0188)  0.8906 (0.0184)  0.9000 (0.0167)  0.8381 (0.0509)  0.8917 (0.0160) 
Mean  0.7344 (0.0329)  0.7334 (0.0324)  0.7593 (0.0342)  0.7738 (0.0333)  0.7832 (0.0351)  0.7240 (0.1241)  0.8427 (0.0251) 
IiiC CSDA variants under the probabilistic model
A special case of the PCSDA can be obtained by setting the assumption that each negative sample forms a negative subclass, i.e. and . In that case , , the negative samples are drawn from a distribution , and is calculated by solving for , where , i.e., we obtain the class discrimination definition of CSDA. The Spectral Regressionbased solutions of CSDA in [4, 5, 13] and the lowrank regression solution of [14] use the same class discrimination criterion and, thus, correspond to the same setting of PCSDA. Since the discrimination criterion used in CSDA is a special case of the proposed probabilistic model, all the abovementioned methods can be extended to perform classification using in (33).
IiiD Nonlinear PCSDA
In the above analysis we considered the linear classspecific subspace learning case. In order to nonlinearly map to traditional kernelbased learning methods perform a nonlinear mapping of the input space to the feature space using a function , i.e. . Then, linear classspecific projections are defined by using the training data in . Since the dimensionality of is arbitrary (virtually infinite), the data representations in cannot be calculated. Traditional kernelbased learning methods address this issue by exploiting the Representer theorem and the nonlinear mapping is implicitly performed using the kernel function encoding dot products in the feature space, i.e. [15].
As has been shown in [16] the effective dimensionality of the kernel space is at most equal to and, thus, an explicit nonlinear mapping can be calculated such that . This is achieved by using , where and contain the eigenvectors and eigenvavlues of the kernel matrix [17]. Thus, extension of PCSDA to the nonlinear (kernel) case can be readily obtained by applying the abovedescribed linear PCSDA on the vectors . For the cases where the size of training set is prohibitive for applying kernelbased discriminant learning, the Nyströmbased kernel subspace learning method of [18] or nonlinear data mappings based on randomized features, as proposed in [19], can be used. Here we should note that the application of Means using in corresponds to the application of kernel Means in the original space .
Iv Experiments
In this Section we provide experimental results illustrating the performance of the proposed PCSDA in comparison with existing CSDA variants. For each dataset, we formed classspecific ranking problems for each class using the class data as positive samples and the data of the remaining classes as negative samples. In all the experiments the data representations are first mapped to the PCA space preserving of the spectrum energy and subsequently nonlinearly mapped to the subspace of the kernel space (as discussed in Section IIID). We used the RBF kernel function setting the value of equal to the mean distance value between the positive training vectors. The discriminant subspace is subsequently obtained by applying one of the competing discriminant methods. The dimensionality of the discriminant subspace is determined by applying fivefold crossvalidation on the training data within the range of . Finally, performance is measured in the discriminant subspace using the average precision metric.
Table I illustrates the performance over all classes of the MNIST dataset [20]. Since by using the entire training set performance saturates, we formed a reduced problem by using only the first images of each class in the training set and we report performance on the entire test set. Table II illustrates the performance over all classes of the JAFFEE dataset [21]. We used the grayscale values of the facial images as a representation. For classspecific ranking problem, we applied five experiments by randomly splitting each class in subsets and using the first subset of each class to form the training set and the second ones to form the test set. Table III illustrates the performance over all classes of the scene dataset [22]. We employed the deep features generated by average pooling over spatial dimension of the last convolution layer of VGG network [23] trained on ILSVRC2012 database. For each classspecific ranking problem, we applied five experiments by randomly splitting each class using splits.
In Tables IIII, the variants of the proposed method that outperform the corresponding baseline are highlighted for each problem. As can be seen, the performance obtained for the standard CSDA and its probabilistic version PCSDA () is very similar. The differences observed are mainly due to the small differences in the ranking of the used projection vectors (Eq. (27)). Moreover, exploiting subclass information of the negative class can be beneficial since performance improves in most tested cases. Comparing the CSSR in [14] and its probabilistic version PCSSR, it can be seen that the latter improves the performance considerably. This is due to the use of (27), which ranks all eigenvectors of CSSR according to their classspecific discrimination power.
Finally, in Table IV we compare the performance of CSDA and PCSDA for classification. Since each of the classspecific classification problems is imbalanced, we used the gmean metric, which is defined as the squareroot of the multiplication of the correct classification rate of each class. Classification is performed by using the classification rule in (33) for PCSDA. For CSDA a linear SVM is used is trained on vectors , similar to [5]. The value of is optimized in the range jointly with the discriminant subspace dimensionality by applying fivefold crossvalidation on the training data. Comparing the performance of the CSDA+SVM classification scheme with that of PCSDA () on the MNIST and JAFFE datasets, we can see that the use of an additional classifier trained on the data representations in the discriminant subspace increases classification performance. By by allowing the negative class to form subclasses the performance increases for PCSDA and is competitive of that of the CSDA+SVM classification scheme. Overall, the PCSDA model achieves competitive performance without requiring the training of a new classifier in the discriminant subspace.
MNIST  scene  JAFFE  

CSDA  0.9531  0.9119 (0.0074)  0.7439 (0.0679) 
PCSDA ()  0.9362  0.9144 (0.0125)  0.7102 (0.1164) 
PCSDA ()  0.9378  0.9167 (0.0143)  0.8029 (0.0665) 
PCSDA ()  0.9403  0.9191 (0.0145)  0.7661 (0.0762) 
PCSDA ()  0.9501  0.9189 (0.0156)  0.7560 (0.0592) 
V Conclusions
In this paper we proposed a probabilistic model for classspecific discriminant subspace learning that is able to incorporate subclass information naturally appearing in the negative class in the optimization criterion. The adoption of a probabilitybased optimization criterion for classspecific discriminant subspace learning leads to a classifier defined on the data representations in the discriminant subspace. We showed that the proposed approach includes as special cases existing classspecific discriminant analysis methods. Experimental results illustrated the performance of the proposed model, in comparison with that of related methods, in both verification and classification problems.
References
 [1] Y. Kittler, J. Li, and J. Matas, “Face verification using client specific Fisher faces,” The Statistics of Directions, Shapes and Images, pp. 63–66, 2000.
 [2] G. Goudelis, S. Zafeiriou, A. Tefas, and I. Pitas, “Classspecific kernel discriminant analysis for face verification,” IEEE Transactions on Information Forensics and Security, vol. 2, no. 3, pp. 570–587, 2007.
 [3] S. Zafeiriou, G. Tzimiropoulos, M. Petrou, and T. Stathaki, “Regularized kernel discriminant analysis with a robust kernel for face recognition and verification,” IEEE Transactions on Neural Networks, vol. 23, no. 3, pp. 526–534, 2012.
 [4] S. Arashloo and J. Kittler, “Classspecific kernel fusion of multiple descriptors for face verification using multiscale binarized statistical image features,” IEEE Tranactions on Information Forensics and Security, vol. 9, no. 12, pp. 2100–2109, 2014.
 [5] A. Iosifidis, A. Tefas, and I. Pitas, “Classspecific reference discriminant analysis with application in human behavior analysis,” IEEE Tranactions on HumanMachine Systems, vol. 45, no. 3, pp. 315–326, 2015.
 [6] A. Iosifidis, M. Gabbouj, and P. Pekki, “Classspecific nonlinear projections using classspecific kernel spaces,” IEEE International Conference on Big Data Science and Engineering, pp. 1–8, 2015.
 [7] A. Iosifidis, A. Tefas, and A. Pitas, “Activity based person identification using fuzzy representation and discriminant learning,” IEEE Transactions on Information Forensics and Security, vol. 7, no. 2, pp. 530–542, 2012.
 [8] R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. WileyInterscience, 2000.
 [9] A. Iosifidis, A. Tefas, and I. Pitas, “On the optimal class representation in Linear Discriminant Analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 9, pp. 1491–1497, 2013.
 [10] A. Iosifidis, A. Tefas, and I. Pitas, “Kernel reference discriminant analysis,” Pattern Recognition Letters, vol. 49, pp. 85–91, 2014.
 [11] Y. Jia, F. Nie, and C. Zhang, “Trace ratio problem revisited,” IEEE Transactions on Neural Networks, vol. 20, no. 4, pp. 729–735, 2009.
 [12] D. Cai, X. He, and J. Han, “Spectral regression for efficient regularized subspace learning,” International Conference on Computer Vision, 2007.
 [13] A. Iosifidis and M. Gabbouj, “Scaling up classspecific kernel discriminant analysis for largescale face verification,” IEEE Transactions on Information Forensics and Security, vol. 11, no. 11, pp. 2453–2465, 2016.
 [14] A. Iosifidis and M. Gabbouj, “Classspecific kernel discriminant analysis revisited: Further analysis and extensions,” IEEE Transactions on Cybernetics, vol. 47, no. 12, pp. 4485–4496, 2017.
 [15] B. Scholkpf and A. Smola, Learning with Kernels. MIT Press, 2001.
 [16] N. Kwak, “Nonlinear Projection Trick in kernel methods: an alternative to the kernel trick,” IEEE Transactions on Neural Networks and Learning Systems, vol. 24, no. 12, pp. 2113–2119, 2013.
 [17] N. Kwak, “Implementing kernel methods incrementally by incremental nonlinear projection trick,” IEEE Transactions on Cybernetics, vol. 47, no. 11, pp. 4003–4009, 2017.
 [18] A. Iosifidis and M. Gabbouj, “Nyströmbased approximate kernel subspace learning,” Pattern Recognition, vol. 57, pp. 190–197, 2016.
 [19] A. Rahimi and B. Recht, “Random features for largescale kernel machines,” Advances in Neural Information Processing Systems, 2007.
 [20] Y. LeCun, L. Bottou, and Y. Bengio, “Gradientbased learning applied to document recognition,” Proceedings of the IEEE, vol. 86, pp. 2278–2324, 1998.
 [21] M. Lyons, S. Akamatsu, M. Kamachi, and J. Gyoba, “Coding facial expressions with gabor wavelets,” IEEE International Conference on Automatic Face and Gesture Recognition, pp. 200–205, 1998.
 [22] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories,” IEEE Conference on Computer Vision and Pattern Recognition, 2006.
 [23] K. Simonyan and A. Zisserman, “Very deep convolutional networks for largescale image recognition,” arXiv:1409.1556, 2014.