Whitening-Free Least-SquaresNon-Gaussian Component Analysis

Whitening-Free Least-Squares
Non-Gaussian Component Analysis

Hiroaki Shiino Yahoo Japan Corporation
Kioi Tower 1-3 Kioicho, Chiyoda-ku, Tokyo 102-8282, Japan. Nara Institute of Science and Technology
8916-5 Takayama-cho Ikoma, Nara 630-0192, Japan. The University of Tokyo
5-1-5 Kashiwanoha, Kashiwa-shi, Chiba 277-8561, Japan. RIKEN Center for Advanced Intelligence Project
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.
   Hiroaki Sasaki Yahoo Japan Corporation
Kioi Tower 1-3 Kioicho, Chiyoda-ku, Tokyo 102-8282, Japan. Nara Institute of Science and Technology
8916-5 Takayama-cho Ikoma, Nara 630-0192, Japan. The University of Tokyo
5-1-5 Kashiwanoha, Kashiwa-shi, Chiba 277-8561, Japan. RIKEN Center for Advanced Intelligence Project
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.
   Gang Niu Yahoo Japan Corporation
Kioi Tower 1-3 Kioicho, Chiyoda-ku, Tokyo 102-8282, Japan. Nara Institute of Science and Technology
8916-5 Takayama-cho Ikoma, Nara 630-0192, Japan. The University of Tokyo
5-1-5 Kashiwanoha, Kashiwa-shi, Chiba 277-8561, Japan. RIKEN Center for Advanced Intelligence Project
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.
   Masashi Sugiyama Yahoo Japan Corporation
Kioi Tower 1-3 Kioicho, Chiyoda-ku, Tokyo 102-8282, Japan. Nara Institute of Science and Technology
8916-5 Takayama-cho Ikoma, Nara 630-0192, Japan. The University of Tokyo
5-1-5 Kashiwanoha, Kashiwa-shi, Chiba 277-8561, Japan. RIKEN Center for Advanced Intelligence Project
1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan.
Abstract

Non-Gaussian component analysis (NGCA) is an unsupervised linear dimension reduction method that extracts low-dimensional non-Gaussian “signals” from high-dimensional data contaminated with Gaussian noise. NGCA can be regarded as a generalization of projection pursuit (PP) and independent component analysis (ICA) to multi-dimensional and dependent non-Gaussian components. Indeed, seminal approaches to NGCA are based on PP and ICA. Recently, a novel NGCA approach called least-squares NGCA (LSNGCA) has been developed, which gives a solution analytically through least-squares estimation of log-density gradients and eigendecomposition. However, since pre-whitening of data is involved in LSNGCA, it performs unreliably when the data covariance matrix is ill-conditioned, which is often the case in high-dimensional data analysis. In this paper, we propose a whitening-free variant of LSNGCA and experimentally demonstrate its superiority.

Keywords:
non-Gaussian component analysis, dimension reduction, unsupervised learning

Authors’ Instructions

1 Introduction

Dimension reduction is a common technique in high-dimensional data analysis to mitigate the curse of dimensionality [1]. Among various approaches to dimension reduction, we focus on unsupervised linear dimension reduction in this paper.

It is known that the distribution of randomly projected data is close to Gaussian [2]. Based on this observation, non-Gaussian component analysis (NGCA) [3] tries to find a subspace that contains non-Gaussian signal components so that Gaussian noise components can be projected out. NGCA is formulated in an elegant semi-parametric framework and non-Gaussian components can be extracted without specifying their distributions. Mathematically, NGCA can be regarded as a generalization of projection pursuit (PP) [4] and independent component analysis (ICA) [5] to multi-dimensional and dependent non-Gaussian components.

The first NGCA algorithm is called multi-index PP (MIPP). PP algorithms such as FastICA [5] use a non-Gaussian index function (NGIF) to find either a super-Gaussian or sub-Gaussian component. MIPP uses a family of such NGIFs to find multiple non-Gaussian components and apply principal component analysis (PCA) to extract a non-Gaussian subspace. However, MIPP requires us to prepare appropriate NGIFs, which is not necessarily straightforward in practice. Furthermore, MIPP requires pre-whitening of data, which can be unreliable when the data covariance matrix is ill-conditioned.

To cope with these problems, MIPP has been extended in various ways. The method called iterative metric adaptation for radial kernel functions (IMAK) [6] tries to avoid the manual design of NGIFs by learning the NGIFs from data in the form of radial kernel functions. However, this learning part is computationally highly expensive and pre-whitening is still necessary. Sparse NGCA (SNGCA) [7, 2] tries to avoid pre-whitening by imposing an appropriate constraint so that the solution is independent of the data covariance matrix. However, SNGCA involves semi-definite programming which is computationally highly demanding, and NGIFs still need to be manually designed.

Recently, a novel approach to NGCA called least-squares NGCA (LSNGCA) has been proposed [8]. Based on the gradient of the log-density function, LSNGCA constructs a vector that belongs to the non-Gaussian subspace from each sample. Then the method of least-squares log-density gradients (LSLDG) [9, 10] is employed to directly estimate the log-density gradient without density estimation. Finally, the principal subspace of the set of vectors generated from all samples is extracted by eigendecomposition. LSNGCA is computationally efficient and no manual design of NGIFs is involved. However, it still requires pre-whitening of data.

MIPP IMAK SNGCA LSNGCA
WF-LSNGCA
(proposed)
 
Manual
NGIF design
Need No Need Need No Need No Need
 
Computational
efficiency
Efficient
(iterative)
Inefficient
(iterative)
Inefficient
(iterative)
Efficient
(analytic)
Efficient
(analytic)
 Pre-whitening Need Need No Need Need No Need
Table 1: NGCA methods.

The existing NGCA methods reviewed above are summarized in Table 1. In this paper, we propose a novel NGCA method that is computationally efficient, no manual design of NGIFs is involved, and no pre-whitening is necessary. Our proposed method is essentially an extention of LSNGCA so that the covariance of data is implicitly handled without explicit pre-whitening or explicit constraints. Through experiments, we demonstrate that our proposed method, called whitening-free LSNGCA (WF-LSNGCA), performs very well even when the data covariance matrix is ill-conditioned.

2 Non-Gaussian Component Analysis

In this section, we formulate the problem of NGCA and review the MIPP and LSNGCA methods.

2.1 Problem Formulation

Suppose that we are given a set of -dimensional i.i.d. samples of size , , which are generated by the following model:

(1)

where () is an -dimensional signal vector independently generated from an unknown non-Gaussian distribution (we assume that is known), is a noise vector independently generated from a centered Gaussian distribution with an unknown covariance matrix , and is an unknown mixing matrix of rank . Under this data generative model, probability density function that samples follow can be expressed in the following semi-parametric form [3]:

(2)

where is an unknown smooth positive function on , is an unknown linear mapping, is the centered Gaussian density with the covariance matrix , and denotes the transpose. We note that decomposition (2) is not unique; multiple combinations of and can give the same probability density function. Nevertheless, the following -dimensional subspace , called the non-Gaussian index space, can be determined uniquely [11]:

(3)

where denotes the null space of , denotes the orthogonal complement, and denotes the column space of .

The goal of NGCA is to estimate the non-Gaussian index space from samples .

2.2 Multi-Index Projection Pursuit (MIPP)

MIPP [3] is the first algorithm of NGCA.

Let us whiten the samples so that their covariance matrix becomes identity:

where is the covariance matrix of . In practice, is replaced by the sample covariance matrix. Then, for an NGIF , the following vector was shown to belong to the non-Gaussian index space [3]:

where denotes the differential operator w.r.t.  and denotes the expectation over . MIPP generates a set of such vectors from various NGIFs :

(4)

where the expectation is estimated by the sample average. Then is normalized as

(5)

by which is proportional to its signal-to-noise ratio. Then vectors with their norm less than a pre-specified threshold are eliminated. Finally, PCA is applied to the remaining vectors to obtain an estimate of the non-Gaussian index space .

The behavior of MIPP strongly depends on the choice of NGIF . To improve the performance, MIPP actively searches informative as follows. First, the form of is restricted to , where denotes a unit-norm vector and is a smooth real function. Then, estimated vector is written as

where is the derivative of . This equation is actually equivalent to a single iteration of the PP algorithm called FastICA [12]. Based on this fact, the parameter is optimized by iteratively applying the following update rule until convergence:

The superiority of MIPP has been investigated both theoretically and experimentally [3]. However, MIPP has the weaknesses that NGIFs should be manually designed and pre-whitening is necessary.

2.3 Least-Squares Non-Gaussian Component Analysis (LSNGCA)

LSNGCA [8] is a recently proposed NGCA algorithm that does not require manual design of NGIFs (Table 1). Here the algorithm of LSNGCA is reviewed, which will be used for further developing a new method in the next section.

Derivation:

For whitened samples , the semi-parametric form of NGCA given in Eq.(2) can be simplified as

(6)

where is an unknown smooth positive function on and is an unknown linear mapping. Under this simplified semi-parametric form, the non-Gaussian index space can be represented as

Taking the logarithm and differentiating the both sides of Eq.(6) w.r.t.  yield

(7)

where denotes the differential operator w.r.t. . This implies that

belongs to the non-Gaussian index space . Then applying eigendecomposition to and extracting the leading eigenvectors allow us to recover . In LSNGCA, the method of least-squares log-density gradients (LSLDG) [9, 10] is used to estimate the log-density gradient included in , which is briefly reviewed below.

Lsldg:

Let denote the differential operator w.r.t. the -th element of . LSLDG fits a model to , the -th element of log-density gradient , under the squared loss:

(8)

The second term in Eq.(8) yields

where the second-last equation follows from integration by parts under the assumption . Then sample approximation yields

(9)

As a model of the log-density gradient, LSLDG uses a linear-in-parameter form:

(10)

where denotes the number of basis functions, is a parameter vector to be estimated, and is a basis function vector. The parameter vector is learned by solving the following regularized empirical optimization problem:

where is the regularization parameter,

This optimization problem can be analytically solved as

where is the -by- identity matrix. Finally, an estimator of the log-density gradient is obtained as

All tuning parameters such as the regularization parameter and parameters included in the basis function can be systematically chosen based on cross-validation w.r.t. Eq.(9).

3 Whitening-Free LSNGCA

In this section, we propose a novel NGCA algorithm that does not involve pre-whitening. A pseudo-code of the proposed method, which we call whitening-free LSNGCA (WF-LSNGCA), is summarized in Algorithm 1.

0:  Element-wise standardized data samples: .
1:  Obtain an estimate of by the method described in Section 3.2.
2:  Apply eigendecomposition to and extract the leading eigenvectors as an orthonormal basis of non-Gaussian index space .
Algorithm 1 Pseudo-code of WF-LSNGCA.

3.1 Derivation

Unlike LSNGCA which used the simplified semi-parametric form (6), we directly use the original semi-parametric form (2) without whitening. Taking the logarithm and differentiating the both sides of Eq.(2) w.r.t.  yield

(11)

where denotes the derivative w.r.t.  and denotes the derivative w.r.t. . Further taking the derivative of Eq.(11) w.r.t.  yields

(12)

where denotes the second derivative w.r.t. . Substituting Eq.(12) back into Eq.(11) yields

(13)

This implies that

belongs to the non-Gaussian index space . Then we apply eigendecomposition to and extract the leading eigenvectors as an orthonormal basis of non-Gaussian index space .

Now the remaining task is to approximate from data, which is discussed below.

3.2 Estimation of

Let be the -th element of :

To estimate , let us fit a model to it under the squared loss:

(14)

The second term in Eq.(14) yields

where the second-last equation follows from integration by parts under the assumption . included in the third term in Eq.(14) may be replaced with the LSLDG estimator reviewed in Section 2.3. Note that the LSLDG estimator is obtained with non-whitened data in this method. Then we have

(15)

Here, let us employ the following linear-in-parameter model as :

(16)

where denotes the number of basis functions, is a parameter vector to be estimated, and is a basis function vector. The parameter vector is learned by minimizing the following regularized empirical optimization problem:

where is the regularization parameter,

This optimization problem can be analytically solved as

Finally, an estimator of is obtained as

All tuning parameters such as the regularization parameter and parameters included in the basis function can be systematically chosen based on cross-validation w.r.t. Eq.(15).

3.3 Theoretical Analysis

Here, we investigate the convergence rate of WF-LSNGCA in a parametric setting.

Let be the optimal estimate to given by LSLDG based on the linear-in-parameter model , and let

where must be strictly positive definite. In fact, should already be strictly positive definite, and thus is also allowed in our theoretical analysis.

We have the following theorem (its proof is given in Section 3.4):

Theorem 3.1.

As , for any ,

provided that for all converge in to , i.e., .

Theorem 3.1 is based on the theory of perturbed optimizations [13, 14] as well as the convergence of LSLDG shown in [8]. It guarantees that for any , the estimate in WF-LSNGCA converges to the optimal estimate based on the linear-in-parameter model , and it achieves the optimal parametric convergence rate . Note that Theorem 3.1 deals only with the estimation error, and the approximation error is not taken into account. Indeed, approximation errors exist in two places, from to in WF-LSNGCA itself and from to in the plug-in LSLDG estimator. Since the original LSNGCA also relies on LSLDG, it cannot avoid the approximation error introduced by LSLDG. For this reason, the convergence of WF-LSNGCA is expected to be as good as LSNGCA.

Theorem 3.1 is basically a theoretical guarantee that is similar to Part One in the proof of Theorem 1 in [8]. Hence, based on Theorem 3.1, we can go along the line of Part Two in the proof of Theorem 1 in [8] and obtain the following corollary.

Corollary 1.

For eigendecomposition, define matrices and . Given the estimated subspace based on samples and the optimal estimated subspace based on infinite data, denote by the matrix form of an arbitrary orthonormal basis of and by that of . Define the distance between subspaces as

where stands for the Frobenius norm. Then, as ,

provided that for all converge in to and are well-chosen basis functions such that the first eigenvalues of are neither nor .

3.4 Proof of Theorem 3.1

Step 1.

First of all, we establish the growth condition (see Definition 6.1 in [14]). Denote the expected and empirical objective functions by

Then , , and we have

Lemma 1.

Let be the smallest eigenvalue of , then the following second-order growth condition holds

Proof.

must be strongly convex with parameter at least . Hence,

where we used the optimality condition . ∎

Step 2.

Second, we study the stability (with respect to perturbation) of at . Let

be a set of perturbation parameters, where is the cone of -by- symmetric positive semi-definite matrices. Define our perturbed objective function by

It is clear that , and then the stability of at can be characterized as follows.

Lemma 2.

The difference function is Lipschitz continuous in modulus

on a sufficiently small neighborhood of .

Proof.

The difference function is

with a partial gradient

Notice that due to the -regularization in , such that . Now given a -ball of , i.e., , it is easy to see that ,

and consequently

This says that the gradient has a bounded norm of order , and proves that the difference function is Lipschitz continuous on the ball , with a Lipschitz constant of the same order. ∎

Step 3.

Lemma 1 ensures the unperturbed objective grows quickly when leaves ; Lemma 2 ensures the perturbed objective changes slowly for around , where the slowness is compared with the perturbation it suffers. Based on Lemma 1, Lemma 2, and Proposition 6.1 in [14],

since is the exact solution to given , , and .

According to the central limit theorem (CLT), . Consider :

The first half is clearly due to CLT. For the second half, the estimate given by LSLDG converges to for any in according to Part One in the proof of Theorem 1 in [8], and converges to in the same order because the basis functions in are all derivatives of Gaussian functions. Consequently,

since converges to for any in , and

due to CLT, which proves . Furthermore, we have already assumed that . Hence, as ,

Step 4.

Finally, for any , the gap of and is bounded by

where the Cauchy-Schwarz inequality is used. Since the basis functions in are again all derivatives of Gaussian functions, must be bounded uniformly, and then

Applying the same argument for all completes the proof. ∎

4 Experiments

In this section, we experimentally investigate the performance of MIPP, LSNGCA, and WF-LSNGCA.111 The source code of the experiments is at https://github.com/hgeno/WFLSNGCA.

4.1 Configurations of NGCA Algorithms

4.1.1 Mipp

We use the MATLAB script which was used in the original MIPP paper [3]222http://www.ms.k.u-tokyo.ac.jp/software.html. In this script, NGIFs of the form () are used:

where , , and are scalars chosen at the regular intervals from , , and . The cut-off threshold is set at and the number of FastICA iterations is set at (see Section 2.2).

4.1.2 Lsngca

Following [10], the derivative of the Gaussian kernel is used as the basis function in the linear-in-parameter model (10):

where is the Gaussian bandwidth and is the Gaussian center randomly selected from the whitened data samples . The number of basis functions is set at . For model selection, -fold cross-validation is performed with respect to the hold-out error of Eq.(9) using candidate values at the regular intervals in logarithmic scale for Gaussian bandwidth and regularization parameter .

4.1.3 Wf-Lsngca

Similarly to LSNGCA, the derivative of the Gaussian kernel is used as the basis function in the linear-in-parameter model (16) and the number of basis functions is set as . For model selection, -fold cross-validation is performed with respect to the hold-out error of Eq.(15) in the same way as LSNGCA.

4.2 Artificial Datasets

(a) Independent Gaussian Mixture
(b) Dependent Super-Gaussian
(c) Dependent Sub-Gaussian
(d) Dependent Super- and Sub-Gaussian
Figure 1: Distributions of non-Gaussian components.
(a) Independent Gaussian Mixture
(b) Dependent Super-Gaussian
(c) Dependent Sub-Gaussian
(d) Dependent Super- and Sub-Gaussian
Figure 2: Averages and standard deviations of the subspace estimation error as the function of the condition-number controller over 50 simulations on artificial datasets.

Let , where are the -dimensional non-Gaussian signal components and are the -dimensional Gaussian noise components. For the non-Gaussian signal components, we consider the following four distributions plotted in Figure 1:

(a) Independent Gaussian Mixture:
.

(b) Dependent super-Gaussian:
.

(c) Dependent sub-Gaussian:
is the uniform distribution on .

(d) Dependent super- and sub-Gaussian:
and is the uniform distribution on , where if and otherwise.

For the Gaussian noise components, we include a certain parameter , which controls the condition number; the larger is, the more ill-posed the data covariance matrix is. The detail is described in Appendix A.

We generate samples for each case, and standardize each element of the data before applying NGCA algorithms. The performance of NGCA algorithms is measured by the following subspace estimation error:

(17)

where is the true non-Gaussian index space, is its estimate, is the orthogonal projection on , and is an orthonormal basis in .

The averages and the standard derivations of the subspace estimation error over runs for MIPP, LSNGCA, and WF-LSNGCA are depicted in Figure 2. This shows that, for all 4 cases, the error of MIPP grows rapidly as increases. On the other hand, LSNGCA and WF-LSNGCA perform much stably against the change in . However, LSNGCA performs poorly for (a). Overall, WF-LSNGCA is shown to be much more robust against ill-conditioning than MIPP and LSNGCA.

(a) The function of sample size.
(b) The function of data dimension.
Figure 3: The average CPU time over 50 runs when the Gaussian mixture is used as non-Gaussian components and the condition-number controller . The vertical axis is in logarithmic scale.

In terms of the computation time, WF-LSNGCA is less efficient than LSNGCA and MIPP, but its computation time is still just a few times slower than LSNGCA, as seen in Figure 3. For this reason, the computational efficiency of WF-LSNGCA would still be acceptable in practice.

4.3 Benchmark Datasets

Finally, we evaluate the performance of NGCA methods using the LIBSVM binary classification benchmark datasets333 We preprocessed the LIBSVM binary classification benchmark datasets as follows: vehicle: We convert original labels ‘1’ and ‘2’ to the positive label and original labels ‘3’ and ‘4’ to the negative label. SUSY: We convert original label ‘0’ to the negative label. shuttle: We use only the data labeled as ‘1’ and ‘4’ and regard them as positive and negative labels. svmguide1: We mix the original training and test datasets. [15]. From each dataset, points are selected as training (test) samples so that the number of positive and negative samples are equal, and datasets are standardized in each dimension. For an -dimensional dataset, we append -dimensional noise dimensions following the standard Gaussian distribution so that all datasets have dimensions. Then we use PCA, MIPP, LSNGCA, and WF-LSNGCA to obtain -dimensional expressions, and apply the support vector machine (SVM) 444 We used LIBSVM with MATLAB [15]. to evaluate the test misclassification rate. As a baseline, we also evaluate the misclassification rate by the raw SVM without dimension reduction.

The averages and standard deviations of the misclassification rate over runs for are summarized in Table 2. As can be seen in the table, the appended Gaussian noise dimensions have negative effects on each classification accuracy, and thus the baseline has relatively high misclassification rates. PCA has overall higher misclassification rates than the baseline since a lot of valuable information for each classification problem is lost. Among the NGCA algorithms, WF-LSNGCA overall compares favorably with the other methods. This means that it can find valuable low-dimensional expressions for each classification problem without harmful effects of a pre-whitening procedure.

 
Dataset