Nonlinear Dimensionality Reduction for Discriminative Analytics of Multiple Datasets

# Nonlinear Dimensionality Reduction for Discriminative Analytics of Multiple Datasets

Jia Chen, Gang Wang, , and Georgios B. Giannakis,  This work was supported in part by NSF grants 1711471, 1500713, and the NIH grant no. 1R01GM104975-01. This paper was presented in part at the 43rd IEEE International Conference on Acoustics, Speech, and Signal Processing, Galgary, Canada, April 15-20, 2018 [1]. The authors are with the Digital Technology Center and the Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA. Emails: {chen5625,  gangwang,  georgios}@umn.edu.
###### Abstract

Principal component analysis (PCA) is widely used for feature extraction and dimensionality reduction, with documented merits in diverse tasks involving high-dimensional data. Standard PCA copes with one dataset at a time, but it is challenged when it comes to analyzing multiple datasets jointly. In certain data science settings however, one is often interested in extracting the most discriminative information from one dataset of particular interest (a.k.a. target data) relative to the other(s) (a.k.a. background data). To this end, this paper puts forth a novel approach, termed discriminative (d) PCA, for such discriminative analytics of multiple datasets. Under certain conditions, dPCA is proved to be least-squares optimal in recovering the component vector unique to the target data relative to background data. To account for nonlinear data correlations, (linear) dPCA models for one or multiple background datasets are generalized through kernel-based learning. Interestingly, all dPCA variants admit an analytical solution obtainable with a single (generalized) eigenvalue decomposition. Finally, corroborating dimensionality reduction tests using both synthetic and real datasets are provided to validate the effectiveness of the proposed methods.

Index terms— Principal component analysis, discriminative analytics, multiple background datasets, kernel learning.

## I Introduction

Principal component analysis (PCA) is the ‘workhorse’ method for dimensionality reduction and feature extraction. It finds well-documented applications in the fields of statistics, bioinformatics, genomics, quantitative finance, and engineering, to name a few. The central objective of PCA is to obtain low-dimensional representations for high-dimensional data, while preserving most of the high-dimensional data variance [2].

However, various practical scenarios involve multiple datasets, in which one is tasked with extracting the most discriminative information of one dataset of particular interest relative to others. For instance, consider two gene-expression measurement datasets of volunteers from across different geometrical areas and genders: the first dataset collects gene-expression levels of cancer patients, which is known as target data, while the second contains levels from healthy individuals is called background data. The critical goal is to identify molecular subtypes of cancer within cancer patients. Performing PCA on either the target data or the target together with background data is likely to yield principal components (PCs) that correspond to the background information common to both datasets (e.g., the demographic patterns and genders) [3], rather than the PCs uniquely describing the subtypes of cancer. Albeit simple to comprehend and practically relevant, such discriminative data analytics has not been broadly addressed.

Generalizations of PCA include kernel (K) PCA [4, 5], graph PCA [6], L1-PCA [7], robust PCA [8, 9], multi-dimensional scaling [10], locally linear embedding [11], Isomap [12], and Laplacian eigenmaps [13]. Linear discriminant analysis (LDA) is a ‘supervised’ dimensionality reduction method, which seeks linear combinations of data vectors by reducing the variation in the same classes while increasing the separation between classes of labeled data points [14]. Nonetheless, the aforementioned tools work with only a single dataset, and they are not able to analyze multiple datasets jointly. On the other hand, canonical correlation analysis is widely employed for analyzing multiple datasets [15, 16], but its goal is to extract the shared low-dimensional structure. The recent proposal called contrastive (c) PCA aims at extracting contrastive information between two datasets [17], by searching for directions along which the target data variance is large whereas the background data one is small. cPCA can reveal dataset-specific information often missed by standard PCA if the involved hyper-parameter is properly selected. The cPCA solution is often found with SVD. Albeit feasible to automatically choose the best from a list of candidate values, performing SVD multiple times can be computationally cumbersome in large-scale feature extraction settings.

Building on but going beyond cPCA, this paper starts by developing a novel approach, termed discriminative (d) PCA, for discriminative analytics of two datasets. dPCA looks for linear combinations of data vectors, by maximizing the ratio of the variance of target data to that of background data. This also justifies our chosen description as discriminative PCA. Under certain conditions, dPCA is proved to be least-squares (LS) optimal in the sense that dPCA reveals PCs specific to the target data relative to background data. Different from cPCA, dPCA is parameter-free, and it requires a single generalized eigen-decomposition, lending itself favorably to large-scale discriminative data exploration applications. However, real-world observations often exhibit nonlinear correlations, rendering dPCA inadequate for complex practical setups. To this end, nonlinear dPCA is developed via kernel-based learning. Similarly, the solution of KdPCA can be provided analytically in terms of generalized eigenvalue decompositions. As the complexity of KdPCA grows only linearly with the dimensionality of data vectors, KdPCA is preferable over dPCA for discriminative analytics of high-dimensional data.

This paper further extends dPCA to cope with multiple (more than two) background datasets. Specifically, we develop multi-background (M) dPCA to extract low-dimensional discriminative structure unique to the target data but not to multiple sets of background data. This becomes possible by looking for linear combinations of data vectors to maximize the ratio of the variance of target data to the sum of variances of all background data. At last, kernel (K) MdPCA is put forth to account for nonlinear data correlations.

The remainder of this paper is structured as follows. Upon reviewing the prior art in Section II, linear dPCA is motivated and presented in Section III. The optimality of dPCA is established in Section IV. To account for nonlinearities, KdPCA is developed in Section V. Generalizing their single-background variants, multi-background (M) dPCA and KMdPCA models are discussed in Section VI. Numerical tests are reported in Section VII, while the paper is concluded with research outlook in Section VIII.

Notation: Bold uppercase (lowercase) letters denote matrices (column vectors). Operators , , and denote matrix transposition, inverse, and trace, respectively; is the -norm of vector ; means that symmetric matrix is positive definite; is a diagonal matrix holding elements on its main diagonal; denotes all-zero vectors or matrices; and represents identity matrices of suitable dimensions.

## Ii Preliminaries and Prior Art

Let us start by considering two datasets, namely a target dataset that we are interested in analyzing, and a background dataset that contains latent background component vectors also present in the target data. Generalization to multiple background datasets will be presented in Section VI. Assume without loss of generality that both datasets are centered; in other words, their corresponding sample means have been removed from the datasets. To motivate our novel approaches in subsequent sections, some basics of PCA and cPCA are outlined next.

Standard PCA handles a single dataset at a time. To extract useful information from , PCA looks for low-dimensional representations with as linear combinations of by maximizing the variances of [2]. Specifically for , (linear) PCA yields , with the component (projection) vector found by

 ^u:=argmaxu∈RD u⊤Cxxu (1a) s.to u⊤u=1 (1b)

where is the sample covariance matrix of . Solving (1) yields as the eigenvector of corresponding to the largest eigenvalue. The resulting projections are the first principal components (PCs) of the target data vectors. When , PCA looks for , obtained from the top eigenvectors of . As alluded to in Section I, PCA applied on only, or on the combined datasets can generally not uncover the discriminative patterns or features of the target data relative to the background data.

On the other hand, the recent cPCA seeks a vector along which the target data exhibit large variations while the background data exhibit small variations. Concretely, cPCA solves [17]

 maxu∈RD u⊤Cxxu−αu⊤Cyyu (2a) s.to u⊤u=1 (2b)

where denotes the sample covariance matrix of , and the hyper-parameter trades off maximizing the target data variance (the first term in (2a)) for minimizing the background data variance (the second term). For a given , the solution of (2) is given by the eigenvector of associated with its largest eigenvalue, along which the obtained data projections constitute the first contrastive (c) PCs. Nonetheless, there is no rule of thumb for choosing . Even though a spectral-clustering based algorithm is utilized to automatically select from a list of candidate values, its brute-force search discourages its use in large-scale datasets.

## Iii Discriminative Principal Component Analysis

Unlike PCA, LDA is a ‘supervised’ dimensionality reduction method. It looks for linear combinations of data vectors that reduce that variation in the same class and increase the separation between classes [14]. This is accomplished by maximizing the ratio of the labeled data variance between classes to that within the classes.

In a related but unsupervised setup, when we are given a target dataset and a background dataset, and we are tasked with unveiling component vectors that are only present in but not in , a meaningful approach would be maximizing the ratio of the variance of target data over that of the background data. With a slight abuse of the term ‘discriminant’, we call our approach discriminative (d) PCA, which solves

 ^u:=argmaxu∈RD u⊤Cxxuu⊤Cyyu (3a) s.to u⊤u=1. (3b)

We will term the solution of (3) discriminant component vector, and the projections the first discriminative (d) PCs. Next, we discuss the solution of (3).

Using Lagrangian duality theory, the solution of (3) can be found as the right eigenvector of associated with the largest eigenvalue. To establish this, note that (3) can be equivalently rewritten as

 ^u:=argmaxu∈RD u⊤Cxxu (4a) s.to u⊤Cyyu=1 (4b)

followed by scaling to have unit norm in accordance with (3b). Letting denote the dual variable associated with the constraint (4b), the Lagrangian of (4) becomes

 L(u;λ)=u⊤Cxxu+λ(1−u⊤Cyyu).

At the optimum , the KKT conditions confirm that

 Cxx^u=^λCyy^u. (5)

This is a generalized eigenvalue problem, whose solution is the generalized eigenvector of corresponding to the generalized eigenvalue . Left-multiplying (5) by yields , corroborating that the optimal objective value of (4a) is attained when being the largest generalized eigenvalue. Furthermore, (5) can be solved efficiently by means of well-documented generalized eigenvalue decomposition solvers, including e.g., Cholesky’s factorization [18].

Supposing further that is nonsingular, and left-multiplying (5) by yields

 C−1yyCxx^u=^λ^u. (6)

Evidently, the optimal solution of (4) can also be found as the right eigenvector of corresponding to the largest eigenvalue .

To find multiple () component vectors, namely that form , (3) can be generalized as follows (cf. (3))

For concreteness, the solution of (7) is given in Theorem 1 next. Our proposed dPCA for discriminative analytics of two datasets is summarized in Algorithm 1 for future reference.

###### Theorem 1.

Given centered data and with sample covariance matrices and , the -th column of the dPCA optimal solution in (7) is given by the right eigenvector of associated with the -th largest eigenvalue, where .

Starting with the first column of as in (3), the proof of Theorem 1 proceeds inductively to determine the second column as in (3) that is also orthogonal to the first column. Since this proof of (7) parallels that of PCA, we skip it for brevity. Three observations come in order.

###### Remark 1.

When there is no background data, with , dPCA boils down to standard PCA.

###### Remark 2.

Consider the eigenvalue decomposition . With , and upon changing , (4) can be expressed as

 ^v:=argmaxv∈RD v⊤C−1/2yyCxxC−⊤/2yyv s.to v⊤v=1

whose solution is provided by the leading eigenvector of . Subsequently, the solution of (4) is recovered as , followed by normalization to obey the unit norm in (3b). This indeed suggests that discriminative analytics of and using dPCA can be understood as PCA of the ‘denoised’ or ‘background-removed’ data , followed by an ‘inverse’ transformation to map the obtained component vector in data space to data space. In this sense, can be seen as the data obtained after removing the dominant ‘background’ component vectors from the target data.

###### Remark 3.

Inexpensive power or Lanczos iterations [18] can be employed to compute the principal eigenvectors in (6) efficiently.

Consider again (4). Based on Lagrange duality, when selecting in (2), cPCA maximizing is equivalent to , which coincides with dPCA. This suggests that cPCA and dPCA become equivalent when in cPCA is carefully chosen as the optimal dual variable of our dPCA formulation (4), namely to be the largest eigenvalue of .

To gain further insight on the relationship between dPCA and cPCA, let us suppose that and are simultaneously diagonalizable; that is, there exists an unitary matrix such that

 Cxx:=UΣxxU⊤,andCyy:=UΣyyU⊤

where diagonal matrices hold accordingly eigenvalues of and of on their main diagonals. It is easy to check that . Seeking the first component vectors is tantamount to taking the columns of that correspond to the largest values among . On the other hand, cPCA for a fixed , looks for the first component vectors of , which amounts to taking the columns of associated with the largest values in .

## Iv Optimality of dPCA

In this section, we show that dPCA is optimal when data obey a certain affine model. In a similar vein, PCA adopts a factor analysis model to express the non-centered background data as

 \lx@overaccentset∘yj=my+Ubψj+ey,j,j=1,2,…,n (9)

where denotes the unknown location (mean) vector; has orthonormal columns with ; are some unknown coefficients with covariance matrix ; and the modeling errors are assumed to be independent and identically distributed (i.i.d.) zero-mean random vectors with covariance matrix . Adopting the least-squares (LS) criterion, the unknowns , , and can be estimated by [19]

 minmy,{ψj}Ub n∑j=1∥∥\lx@overaccentset∘yj−my−Ubψj∥∥22 s.to U⊤bUb=I

whose solution is provided by , , and columns are the first leading eigenvectors of , in which . It is clear that . Introduce matrix such that its orthonormal columns satisfy . Furthermore, let and with . Therefore, . As , the strong law of large numbers asserts that ; that is, as .

Here we assume that the target data, namely share the background component vectors with data , but also have extra component vectors specific to the target data relative to the background data. Focusing for simplicity on , we model as

 \lx@overaccentset∘xi=mx+[Ub us][χb,iχs,i]+ex,i,i=1,2,…,m (10)

where represents the location of ; account for zero-mean modeling errors; collects orthonormal columns, where is the shared component vectors with background data, and the pattern that is present only in the target data. Simply put, our goal is to extract this discriminative subspace given and .

Similarly, given , the unknowns , , and can be estimated by

 maxmx,{χi}Ux m∑i=1∥∥\lx@overaccentset∘xi−mx−Uxχi∥∥22 s.to U⊤xUx=I

yielding , with , where stacks up as its columns the eigenvectors of . When , it holds that , with .

Let denote the submatrix of formed by its first rows and columns. When and is nonsingular, one can express as follows

 UyΣ−1yU⊤yUxΣxU⊤x =[Ub Un][Σ−1b00I][I00U⊤nus] ×[Σx,k00λx,k+1][U⊤bu⊤s] =[Ub Un][Σ−1bΣx,k00λx,k+1U⊤nus][U⊤bu⊤s] =UbΣ−1bΣx,kU⊤b+λx,k+1UnU⊤nusu⊤s.

Observe that the first and second summands have rank and , respectively, thus implying that has at most rank . If denotes the -th column of , that is orthogonal to and , right-multiplying by yields

 C−1yyCxxub,i=(λx,i/λy,i)ub,i

for , which hints that are eigenvectors of associated with eigenvalues . Again, right-multiplying by gives rise to

 C−1yyCxxus=λx,k+1UnU⊤nusu⊤sus=λx,k+1UnU⊤nus. (11)

To proceed, we will leverage the following three facts: i) is orthogonal to all columns of ; ii) columns of are orthogonal to those of ; and iii) has full rank. Based on i)-iii), it follows readily that can be uniquely expressed as a linear combination of columns of ; that is, , where are some unknown coefficients, and denotes the -th column of . One can manipulate in (11) as

 UnU⊤nus =un,1u⊤n,1us+⋯+un,D−ku⊤n,D−kus =p1un,1+⋯+pD−kun,D−k =us

yielding ; that is, is the -th eigenvector of corresponding to eigenvalue .

Before moving on, we will make two assumptions.

###### Assumption 1.

Background and target data are generated according to the models (9) and (10), respectively, with the background data sample covariance matrix being nonsingular.

###### Assumption 2.

It holds for all that .

Assumption 2 essentially requires that is discriminative enough in the target data relative to the background data. Assumption 2 states that the eigenvector of associated with the largest eigenvalue is . Under these two assumptions, we establish the optimality of dPCA next.

###### Theorem 2.

Under Assumptions 1 and 2 with , as , the solution of (3) recovers the component vector specific to target data relative to background data, namely .

## V Kernel dPCA

With advances in data acquisition and data storage technologies, a sheer volume of possibly high-dimensional data are collected everyday, that topologically lie on a nonlinear manifold in general. This goes beyond the ability of the (linear) dPCA in Section III due mainly to a couple of reasons: i) dPCA presumes a linear low-dimensional hyperplane to project the target data vectors; and ii) dPCA incurs computational complexity of that grows quadratically with the dimensionality of data vectors. To address these challenges, this section generalizes dPCA to account for nonlinear data relationships via kernel-based learning, and puts forth kernel (K) dPCA for nonlinear discriminative analytics. Specifically, KdPCA starts by ‘lifting’ both the target and background data vectors from the original data space to a higher-dimensional (possibly infinite-dimensional) feature space using a common nonlinear mapping, which is followed by performing linear dPCA on the lifted data.

Consider first the dual version of dPCA. Toward this end, define the augmented data vector with as

 zi:={xi,1≤i≤myi−m,m

and express the wanted component vector in terms of , yielding , where denotes the dual vector. Substituting into (3) leads to our dual dPCA

 maxa∈RN (12a) s.to a⊤a=1 (12b)

based on which we will develop our KdPCA in the sequel.

Similar to deriving KPCA from dual PCA [4], our approach is first to transform from to a high-dimensional space (possibly with ) by some nonlinear mapping function , followed by removing the sample means of and from the corresponding transformed data; and subsequently, implementing dPCA on the centered transformed datasets to obtain the low-dimensional dPCs. Specifically, the sample covariance matrices of the lifted data and can be expressed as follows

 Cϕxx :=1mm∑i=1(ϕ(xi)−μx)(ϕ(xi)−μx)⊤∈RL×L Cϕyy :=1nn∑j=1(ϕ(yj)−μy)(ϕ(yj)−μy)⊤∈RL×L

where the -dimensional vectors and are accordingly the sample means of and . For convenience, let . Upon replacing and in (12) with and , respectively, the KdPCA formulation in (12) boils down to

 maxa∈RN a⊤Φ⊤(Z)CϕxxΦ(Z)aa⊤Φ⊤(Z)CϕyyΦ(Z)a (13a) s.to a⊤a=1. (13b)

In the sequel, (13) will be further simplified by leveraging the so-termed ‘kernel trick’ [20].

To start, let us define a kernel matrix of the dataset whose -th entry is for , where represents some kernel function. Kernel matrix of is defined analogously. Further, the -th entry of matrix is for and . Centering , , and produces

 Kcxx :=Kxx−1m1mKxx−1mKxx1m+1m21mKxx1m Kcyy :=Kyy−1n1nKyy−1nKyy1n+1n21nKyy1n Kcxy :=Kxy−1m1mKxy−1nKxy1n+1mn1mKxy1n

with matrices and having all entries . Based on those centered matrices, let

 K:=[KcxxKcxy(Kcxy)⊤Kcyy]∈RN×N. (14)

Define further two auxiliary matrices and with -th entries

 Kxi,j:={Ki,j/m1≤i≤m   0m

where stands for the -th entry of .

Substituting (14) and (15) into (13) gives rise to our KdPCA formulation for , namely

 ^a:=argmaxa∈RN a⊤KKxaa⊤KKya (16a) s.to a⊤a=1. (16b)

Along the lines of dPCA, the solution of KdPCA in (16) can be provided by

 [KKx+(Kx)⊤K]^a=^λ[KKy+(Ky)⊤K]^a. (17)

The optimum coincides with the generalized eigenvector of corresponding to the largest generalized eigenvalue . To enforce the constraint in (16b), one simply normalizes to have unit norm.

When looking for dPCs, with collected as columns in , the KdPCA in (16) can be generalized to as

whose columns correspond to the generalized eigenvectors of associated with the largest generalized eigenvalues. Having found , one can project the data onto the component vectors by . It is worth remarking that the KdPCA can be performed in the high-dimensional feature space without explicitly forming and evaluating the nonlinear transformations. Indeed, this is accomplished by the ‘kernel trick’ [20]. The main steps of our KdPCA are summarized in Algorithm 2.

## Vi Discriminative Analytics with Multiple Background Datasets

So far, we have presented discriminative analytics methods for two datasets. This section presents their generalizations to cope with multiple (specifically, one target plus more than one background) datasets. Suppose that, in addition to the zero-mean target dataset , we are also given centered background datasets for . The sets of background data