# Null Space Analysis for Class-Specific Discriminant Learning

## Abstract

In this paper, we carry out null space analysis for Class-Specific Discriminant Analysis (CSDA) and formulate a number of solutions based on the analysis. We analyze both theoretically and experimentally the significance of each algorithmic step. The innate subspace dimensionality resulting from the proposed solutions is typically quite high and we discuss how the need for further dimensionality reduction changes the situation. Experimental evaluation of the proposed solutions shows that the straightforward extension of null space analysis approaches to the class-specific setting can outperform the standard CSDA method. Furthermore, by exploiting a recently proposed out-of-class scatter definition encoding the multi-modality of the negative class naturally appearing in class-specific problems, null space projections can lead to a performance comparable to or outperforming the most recent CSDA methods.

IEEEexample:BSTcontrol

Index Terms— Class-Specific Discriminant Analysis, Dimensionality reduction, Multi-modal data distributions, Null space analysis.

## I Introduction

^{1}

Class-specific discrimination finds application in problems where the objective is to discriminate a class of interest from any other possibility. One of the most notable examples of class-specific problems is person identification, e.g., through face or motion analysis [1, 2]. Different from person recognition, which is a multi-class classification problem defined on a pre-defined set of identity classes, person identification discriminates a person of interest from all the other people, i.e., is a binary problem. The application of Linear Discriminant Analysis (LDA) [3, 4, 5] in such binary problems leads to one-dimensional discriminant subspace due to the rank of the adopted between-class scatter matrix. CSDA [1, 6, 7, 8, 9, 10] allows discriminant subspaces of higher dimensionality by defining suitable intra-class and out-of-class scatter matrices. This leads to better class discrimination in such binary problems compared to LDA [6, 8, 9].

Existing methods for CSDA typically operate on nonsingular scatter matrices and seek for discriminant directions in the span of the positive training data. In the case where the data dimensionality is higher than the cardinality of the training set, the scatter matrices are singular and regularization is applied to address computational stability problems. However, experience from multi-class discriminant analysis approaches dealing with this small sample size problem [11] indicates that null space directions contain high discrimination power [12, 13, 14]. Interestingly, one-class discrimination approaches based on null space analysis have been also recently proposed [15, 16]. However, the latter ones are in fact designed by following a multi-class setting and cannot be directly extended for class-specific discrimination.

In this paper, we provide a null space analysis for CSDA. We then formulate straightforward class-specific variants of Null space Linear Discriminant Analysis (NLDA) [12, 11], and closely related Uncorrelated Linear Discriminant Analysis (ULDA) [17, 18, 19], Orthogonal Linear Discriminant Analysis (OLDA) [20], and Regularized Orthogonal Linear Discriminant Analysis (ROLDA) [13] methods. We carry out a detailed evaluation of the significance of each algorithmic step both theoretically and experimentally and discuss different implementation strategies. Furthermore, we combine the concepts of null space analysis with a recently proposed out-of-class scatter definition encoding the multi-modality of the negative class naturally appearing in class-specific problems [21] and propose heterogeneous extensions of the class-specific null space algorithms. Our experimental evaluation of the proposed methods shows that the straightforward extensions can outperform the baseline CSDA algorithm, while the heterogeneous extensions can achieve a performance comparable to or outperforming the most recent CSDA extensions.

The rest of the paper is organized as follows. In Section II, we give the generic problem statement for class-specific subspace learning. In Section III, we introduce the standard CSDA algorithm as well as recent extensions. The main contributions of this paper are described in Section IV, where we provide the null space analysis and propose a number of CSDA extensions exploiting the analysis. We give our experimental results in Section V and conclude the paper in Section VI.

## Ii Problem Statement

Let us denote by the training vectors, each followed by a binary label , where a label indicates that sample belongs to the class of interest, or the positive class, while a label indicates that sample belongs to the negative class. In practice, the latter case corresponds to a sample belonging to one of the subclasses forming the negative class, the labels of which are not available during training either because these subclasses are not sampled adequately well or because they are expensive to annotate. We want to map the training vectors to a lower-dimensional feature space, i.e., , so that the discrimination of the positive class to the negative class is increased.

A basic assumption in class-specific learning is that the two classes are not linearly separable, but the negative samples lay in multiple directions around the samples of the class of interest. Therefore, class-specific methods typically rely on non-linear approaches. To non-linearly map to , traditional kernel-based learning methods map the training vectors to an intermediate feature space using a function , i.e., . Then, linear class-specific projections are defined by exploiting the Representer theorem and the non-linear mapping is implicitly performed using the kernel function encoding dot products in the feature space, i.e., [22]. In this way, a non-linear projection from to is expressed as a linear transformation of the kernel matrix having as elements the pair-wise dot products of the training data representations in calculated using the kernel function .

One can also exploit the Nonlinear Projection Trick (NPT) [23] and apply first an explicit non-linear mapping , where has elements . This is achieved by setting , where and contain the non-zero eigenvalues and the corresponding eigenvectors of the centered kernel matrix . For an unseen test sample , the corresponding mapping is performed as , where is the centered version of the (uncentered) kernel vector having elements . In the cases, where the size of training set is prohibitive for applying NPT, approximate methods for kernel subspace learning can be used, like the one in [24]. After applying NPT, a linear projection , where , corresponds to a nonlinear mapping from to . In the rest of this paper, we assume that the data has been preprocessed with NPT, which allows to obtain non-linear mappings with linear formulations.

## Iii Class-Specific Discriminant Analysis

CSDA [6] defines the optimal projection matrix as the one projecting the data representations of the positive class as close as possible to the positive class mean while at the same time maximizing the scatter of the negative class data from the positive class mean. This objective is achieved by calculating the out-of-class and intra-class scatter matrices w.r.t. the positive class mean :

The optimal is the one maximizing the following criterion:

(1) |

where is the trace operator. is obtained by solving the generalized eigenproblem and keeping the eigenvectors in the row the space of both scatter matrices corresponding to the largest eigenvalues [25], where with the assumption that . By defining the total scatter matrix as

we can easily see that and we get two equivalent optimization criteria for CSDA:

While the scatter matrices are symmetric (and also the inverse of a symmetric matrix is symmetric), their product is typically not symmetric, which means that the eigenvectors of used as the solution of CSDA are not guaranteed to be orthogonal. Furthermore, when the data dimensionality leads to rank deficient scatter matrices, the inverse of cannot be directly computed. This is known as the small sample size problem [11].

A long line of research has considered these issues for LDA and Kernel Discriminant Analysis (KDA) including variants such as regularized LDA [26], pseudo-inverse LDA [27], NLDA [12, 11], ULDA [17, 18, 19], OLDA [20], and ROLDA [13]. However, the problems induced by the small sample size problem and the scatter matrix singularity have not received much attention in connection with CSDA. A common approach is to follow the approach of regularized LDA and solve for the eigenvectors of , where is a small positive value and is an identity matrix. The drawback of this approach is the additional hyperparameter , the value of which may have a significant impact on the results and is usually determined by following a cross-validation process.

A Spectral Regression [28] based solution of (1) was proposed in [8, 9]. It has been shown in [29] that the spectral regression based solution of (1) can be efficiently calculated by exploiting the labeling information of the training vectors. The equivalence of (1) to a low-rank regression problem in which the target vectors can be determined by exploiting the labeling information of the training vectors was proposed in [30]. Finally, a probabilistic framework for class-specific subspace learning was recently proposed in [21], encapsulating criterion (1) as a special case.

## Iv Null Space Analysis for CSDA

In the rest of the paper, we assume that the data is centered to the positive class mean. This can always be done by setting . Then, the total, intra-class, and out-of-class scatter matrices are given as

where and with are matrices having as columns the positive and negative training data, respectively. For linearly independent training samples , we have , and . When and , all scatter matrices are full rank and the corresponding null spaces are empty. As the null spaces are the main focus of this paper, we concentrate on cases, where . Thus, we have .

Here, we should note that linear independence of training vectors does not necessarily imply linear independence of for any given kernel function. However, for widely used kernel functions, like the Radial Basis Function (RBF) and linear kernels, this connection exists. For data representations in obtained by applying NPT, we have with full rank. This is because the dimensions corresponding to the zero eigenvalues of have been discarded and because and are symmetrizable matrix products ( and ) meaning that they share the same eigenvalues [31]. We have , whenever the training vectors are linearly independent and .

To define discriminant directions for our null space CSDA, we will follow ideas similar to those in Null Foley-Sammon transform [32]. That is, the projection matrix is formed by the projection directions satisfying

(2) | |||||

(3) |

The vectors satisfying the above expressions are called null projections and, since , lead to the best separability of the positive and negative classes. As , the projections satisfying both (2) and (3) satisfy

(4) |

In order to further analyze the null space projections , we denote by any of the above-defined scatter matrices and define the null and orthogonal complement spaces of a matrix as follows: and , respectively. Here, is the row space of (and as is symmetric it is also equal to the column space). From the above equations, we see that . Similarly, . Moreover, is a null projection of , i.e., , only if and , since and all three matrices are positive semi-definite. Thus, we have shown that

(5) |

When is full rank, we have , which means that , i.e., the directions satisfying (3) (or equivalently (4)) also satisfy (2). This implies also that (and . When does not hold, it can be achieved by mapping the data to the row space of .

The eigenvectors corresponding to zero eigenvalues span the null space of a matrix, while the eigenvectors corresponding to non-zero eigenvalues do not necessarily span the whole row space. However, since we are dealing with symmetric matrices having orthogonal eigenvectors the whole space is spanned. Furthermore, when , the eigenvalues of are the positive eigenvalues of and , i.e.,

(6) |

where , are the eigenvalues of , are the non-zero eigenvalues of , and are the non-zero eigenvalues of . This is because for the non-zero eigenvalues of , we have and in a similar manner the non-zero eigenvalues of are eigenvalues of . As , the non-zero eigenvalues of and form the full set of eigenvalues of . However, we observed experimentally that for most datasets is quite ill-conditioned (i.e., it has a large ratio of largest to smallest eigenvalues). We also observed numerical instability occurring especially in computations involving eigenvalues of . (6) typically does not hold accurately and the null space of and the row space of are not properly aligned.

Now we proceed to formulate CSDA extensions based on the null space analysis. We will analyze the significance of each algorithmic step and also discuss the consequences of the above-mentioned numerical instability.

### Iv-a Null Space Class-Specific Discriminant Analysis

In this section, we propose Null space Class-Specific Discriminant Analysis (NCSDA), where we exploit similar steps as proposed for the original NLDA [12] and its extensions [11, 13]. We aim at exploiting the discriminant information available in the null space of intra-class scatter matrix by maximizing the following constrained criterion:

(7) | ||||||

As for NLDA, the main idea is to first remove the null space of to ensure , then map the data to the remaining null space of , and finally maximize there. A major difference w.r.t. NLDA is the final subspace dimensionality innately following from the algorithm. After removing the null space of , and, in the same way, for NLDA. The rank of the between-class scatter matrix used in NLDA is limited by the number of classes. Thus, the innate subspace dimensionality is low and the NLDA methods do not apply any further dimensionality reduction. However, the innate NCSDA dimensionality is typically much higher equaling to and, therefore, it becomes a desired property that the algorithm can, in addition to mapping data to the , also provide an optimal ranking for the projection vectors so that only some of them can be selected to obtain lower-dimensional final representations. The pseudo-code of the proposed NCSDA is given in Algorithm 1 and each step is analyzed below.

Steps 4-6 remove the null space of to obtain (due to (5)). We follow [13] and use Singular Value Decomposition (SVD) to get , where and are orthogonal,

is a diagonal matrix with positive diagonal elements in decreasing order and . Therefore,

can be partitioned as , where contains the eigenvectors of corresponding to non-zero eigenvalues, contains the eigenvectors corresponding to zero eigenvalues, and contains the non-zero eigenvalues. The reduced SVD of can be now given as

(8) |

Step 6 projects the data to the subspace spanned by the columns of to remove the null space of . We denote the scatter matrices of projected data as , , and .

We note that Step 6 corresponds to applying uncentered Principal Component Analysis (PCA) on the data centered to the positive class mean with the dimensionality set to . As discussed above, for data matrix obtained by applying NPT, is full rank, i.e., . In this case, Steps 4 and 6 are not needed, but we keep them in the algorithm to ensure that for any input data.

In Step 8, the null space of is computed. It is the most critical step in NCSDA. Following the approach of [11, 32], the null space can be found by solving the eigenproblem

(9) |

and forming the projection matrix from the eigenvectors corresponding to the zero eigenvalues. As is symmetric, the resulting projection vectors will be orthogonal. While these projection vectors span the null space of , the zero eigenvalues do not provide any additional information for ranking the vectors when an additional dimensionality reduction is desired.

As , we can turn our attention to and solve the eigenproblem

(10) |

to find the vectors spanning the row space of (i.e, select the eigenvectors corresponding to non-zero eigenvalues). Also this approach results in orthogonal projection vectors and it allows to rank them according to their ability to maximize in (7). However, our experiments show that the projection vectors obtained by solving (10) fail to span the null space of due to the numerical instability discussed above. This makes the classification performance poor.

Therefore, we also investigate the use of the following generalized eigenproblems in Step 8 to analyze their ability to provide null projections for and to rank the projection vectors for further dimensionality reduction:

(11) |

(12) |

(13) |

where is a small positive value and is an identity matrix. To obtain the projection vectors, we select the eigenvectors corresponding to the zero eigenvalues for (11), but the eigenvectors corresponding to the non-zero eigenvalues for (12) and (13). Other possible generalized eigenproblems to consider could be and . However, we leave them out, because the analysis for the former combines the elements (drawbacks) of those for (11) and (13), while the latter gives exactly the same results as (12) when selecting the eigenvalues smaller than one.

The projection vectors resulting from the generalized eigenproblems are no longer guaranteed to be orthogonal. However, for symmetric matrices and and for a positive definite , the generalized eigenproblem has real eigenvalues and the eigenvectors are linearly independent and -orthogonal, i.e., for [33]. All the scatter matrices are symmetric and positive semi-definite. After removing the null space of , is full rank and, therefore, positive definite. Furthermore, the regularization applied in (11) and (12) preserves the symmetry, while making and positive definite. Thus, (11), (12), and (13) will have -orthogonal, -orthogonal, and -orthogonal eigenvectors, respectively. For (12) and (13), the linear independence of the eigenvectors is important to maintain the assumption that the eigenvectors corresponding to the non-zero eigenvalues span the row space of and, thus, the null space of due to (5). For (12), we have . Since we select only the eigenvectors spanning the row space of to form and assume that they are null projections for , we get , i.e., the projection vectors are orthogonal.

Considering the usefulness of (11)-(13) for projection vector ranking, the eigenvectors corresponding to zero eigenvalues are used in (11) and, therefore, the eigenvalues do not offer ranking information. Furthermore, after mapping the data to the row space of all the non-zero generalized eigenvalues computed with respect to as in (13) are equal to one and, thus, they are also useless for ranking. Only for (12) we have non-zero eigenvalues which can be directly used for ranking the projection vectors. We also note that (12) corresponds to applying the standard CSDA in the row space of , which results in orthogonal projection vectors. In our experiments, the projection vectors solved from (11)-(13) span the null space of with same accuracy as the solution of (9).

Step 10 aims at finding a mapping that maximizes in the null space of . The mapping can be formed by solving the eigenproblem and taking the eigenvectors corresponding to non-zero eigenvalues. A corresponding step was a part of the original NLDA [12]. However, it was considered unnecessary in [11] since , or in the case of NLDA, has no null space to remove ( should be full rank after removing the null space of and mapping to null space of ) and it was proved in [13] that the projection has now effect in the algorithm. This is because for any orthogonal matrix , . We know that is orthogonal because is symmetric. Therefore,

and has no effect in maximizing (7). However, these arguments against Step 10 do not take into account the need for ranking the projection vectors for further dimensionality reduction, while it offers another approach for evaluating the usefulness of the vectors.

Our experiments confirm that Step 10 improves the results when combined with using (9) in Step 8 and the subspace dimension is cut to 1-25. With (10) and (12), the vectors have been already ranked to maximize (7) and Step 10 does not change the results. For (11) and (13), /-orthogonality of projection vectors in is now problematic. For (11), we have and, for (13), , which for the null space of becomes . Thus, in both cases all the eigenvalues in Step 10 will be equal to one and no further ranking information is gained.

In Step 12, the separate projection matrices are combined to form . Finally, we add an optional step to orthogonalize . This Step 14 compensates for the lack of orthogonality following from using (11) or (13) in Step 8. When (9), (10), or (12) is used, the projection vectors are already orthogonal and Step 14 has no effect.

### Iv-B Orthogonal CSDA

Next, we formulate three modified optimization criteria as straightforward class-specific versions of those originally used for (generalized) ULDA [20], OLDA [20], and ROLDA [13]. The criterion for the proposed Uncorrelated Class-Specific Discriminant Analysis (UCSDA) is

(14) | ||||||

where denotes the pseudo-inverse and is the subspace dimensionality. The projection vectors are required to be -orthogonal, which guarantees that the vectors mapped to the projection space are mutually uncorrelated. The criterion for the proposed Orthogonal Class-Specific Discriminant Analysis (OCSDA) is almost the same, but the constraint now requires standard orthogonality:

(15) | ||||||

Finally, the criterion for Regularized Orthogonal Class-Specific Discriminant Analysis (ROCSDA) regularizes the total scatter matrix in :

(16) | ||||||

Let us consider the criterion in and without taking the constraints into account first. We have

where is the subspace dimensionality, the first equality follows from and basic properties of trace, the second equality follows from for all square matrices, and the inequality is due to and for the positive semi-definite . Thus, the criterion gets its maximum value for a given if is full rank and is a null projection for . Therefore, NCSDA maximizes the criterion for if . From our null space analysis, we know that this can be obtained by removing the null space of if . As we consider in this paper only cases where , the solution of NCSDA always maximizes the criterion and provides a solution to UCSDA/OCSDA whenever the constraints are satisfied. In Section IV-A, we presented solutions which are either -orthogonal, orthogonal, or both.

The original solutions to generalized ULDA [20] and OLDA [20] were derived using simultaneous diagonalization of the three scatter matrices. This solution does not require to hold. It can be shown that simultaneously diagonalizes , , and and is a solution to for (see [20, 19] for details), when , where and are as defined in the reduced SVD of in (8) and is obtained by first mapping the negative samples as and then applying the reduced SVD on as

(17) |

where is orthogonal and contains the eigenvectors of corresponding to the non-zero eigenvectors. Thus, spans the row space of . We now provide a pseudo-code for the proposed UCSDA, OCSDA, and ROCSDA based on the above derivation in Algorithm 2.

While Steps 4-6 in Algorithm 1 apply uncentered PCA on the data, Steps 4-9 in Algorithm 2 apply uncentered PCA whitening. In our NCSDA experiments, we observed a discrepancy between the vectors spanning the null space of (as solved from (9)) and vectors spanning the row space of (as solved from (10)), which we believe to be related with the ill-conditioned total scatter matrix . The whitening operation gives -orthogonal projection vectors and, after mapping, all the eigenvalues of will be ones. Thus, is no longer ill-conditioned. Indeed, we observe the whitening to cure the discrepancy and the projection vectors solved using (10) become able to satisfy the null constraint for .

However, due to (6), all the non-zero eigenvalues of and will be also ones. This removes the ability of the non-zero eigenvalues to rank the projection vectors for dimensionality reduction. Applying ROCSDA provides a compromise between the ill-conditioned total scatter matrix and losing the ranking ability due to equalized eigenvalues. The regularized whitening leads to stable eigenvalues close to one, but leaves some variability to use for ranking the projection vectors, which significantly improves the performance. While ROCSDA can also prevent numerical errors due to division with very small numbers, this is not relevant for our implementation, because typically a threshold (e.g., ) is used to decide whether an eigenvalue equals to zero and setting values of smaller than brings the same improvement. We believe that the whitening applied in UCSDA and OCSDA prevents numerical errors of NCSDA, while the main contribution of ROCSDA is to improve the projection vector ranking.

Step 11 finds vectors spanning the row space of . We know from our null space analysis that due to (5) this corresponds to finding the vectors spanning the null space of . In fact, we could use any of the eigenproblems used in Step 8 of Algorithm 1 also here. However, we now confine ourselves to the best performing eigenproblem (12) along with (17) and a similar SVD approach for given as

(18) |

where we select as our projection vectors the columns of , which correspond to the zero singular values in the full SVD.

### Iv-C Heterogeneous Null Space Csda and Heterogeneous Orthogonal Csda

Up to this point, we have aimed at maximizing following the standard CSDA assumption, i.e., the negative samples are evenly spread out around the positive class. However, this is typically not the case, but the negative class actually consist of more than one distinct classes. While in class-specific approaches, we assume not to know labels for such classes, we can still cluster the negative data and reformulate the solution so that we allow the clusters of similar items to stay close to each other and concentrate on maximizing the distance of these clusters to the positive class mean. Such an approach has been recently proposed in [21] and here we combine the heterogeneous formulation for the negative class with the concepts based on our null space analysis.

We now assume the negative class to be formed of clusters. The centroid of the cluster can be computed as , where is the number of items in the cluster and is the sample in the cluster. The out-of-class scatter was earlier defined as . Let us define the negative class within-cluster scatter and the between-cluster scatter as

We can show that as follows

where the zero comes from .

We now proceed to propose optimization criteria for two heterogeneous CSDA variants exploiting the null space analysis, namely Heterogeneous Null space Class-Specific Discriminant Analysis (HNCSDA) and Heterogeneous Orthogonal Class-Specific Discriminant Analysis (HOCSDA). The criterion for Heterogeneous Null space Class-Specific Discriminant Analysis (HNCSDA) is

(19) | ||||||

and for Heterogeneous Orthogonal Class-Specific Discriminant Analysis (HOCSDA)

(20) | ||||||

Maximizing instead of will try to push the clusters of the negative class far away from the positive class mean while allowing the samples within the clusters be close to each other. This in many cases describes the negative class in a more natural way. Furthermore, we can see that , , and for linearly independent samples. Thus, the innate dimensionality of the proposed heterogeneous methods will be limited by the number of clusters in the negative class, which is low enough to be used as the final subspace dimensionality. The pseudo-code for the proposed HNCSDA is presented in Algorithm 3 and the pseudo-code for the proposed HOCSDA in Algorithm 4.