WhiteningFree LeastSquares
NonGaussian Component Analysis
Abstract
NonGaussian component analysis (NGCA) is an unsupervised linear dimension reduction method that extracts lowdimensional nonGaussian “signals” from highdimensional data contaminated with Gaussian noise. NGCA can be regarded as a generalization of projection pursuit (PP) and independent component analysis (ICA) to multidimensional and dependent nonGaussian components. Indeed, seminal approaches to NGCA are based on PP and ICA. Recently, a novel NGCA approach called leastsquares NGCA (LSNGCA) has been developed, which gives a solution analytically through leastsquares estimation of logdensity gradients and eigendecomposition. However, since prewhitening of data is involved in LSNGCA, it performs unreliably when the data covariance matrix is illconditioned, which is often the case in highdimensional data analysis. In this paper, we propose a whiteningfree variant of LSNGCA and experimentally demonstrate its superiority.
Keywords:
nonGaussian component analysis, dimension reduction, unsupervised learningAuthors’ Instructions
1 Introduction
Dimension reduction is a common technique in highdimensional data analysis to mitigate the curse of dimensionality [1]. Among various approaches to dimension reduction, we focus on unsupervised linear dimension reduction in this paper.
It is known that the distribution of randomly projected data is close to Gaussian [2]. Based on this observation, nonGaussian component analysis (NGCA) [3] tries to find a subspace that contains nonGaussian signal components so that Gaussian noise components can be projected out. NGCA is formulated in an elegant semiparametric framework and nonGaussian components can be extracted without specifying their distributions. Mathematically, NGCA can be regarded as a generalization of projection pursuit (PP) [4] and independent component analysis (ICA) [5] to multidimensional and dependent nonGaussian components.
The first NGCA algorithm is called multiindex PP (MIPP). PP algorithms such as FastICA [5] use a nonGaussian index function (NGIF) to find either a superGaussian or subGaussian component. MIPP uses a family of such NGIFs to find multiple nonGaussian components and apply principal component analysis (PCA) to extract a nonGaussian subspace. However, MIPP requires us to prepare appropriate NGIFs, which is not necessarily straightforward in practice. Furthermore, MIPP requires prewhitening of data, which can be unreliable when the data covariance matrix is illconditioned.
To cope with these problems, MIPP has been extended in various ways. The method called iterative metric adaptation for radial kernel functions (IMAK) [6] tries to avoid the manual design of NGIFs by learning the NGIFs from data in the form of radial kernel functions. However, this learning part is computationally highly expensive and prewhitening is still necessary. Sparse NGCA (SNGCA) [7, 2] tries to avoid prewhitening by imposing an appropriate constraint so that the solution is independent of the data covariance matrix. However, SNGCA involves semidefinite programming which is computationally highly demanding, and NGIFs still need to be manually designed.
Recently, a novel approach to NGCA called leastsquares NGCA (LSNGCA) has been proposed [8]. Based on the gradient of the logdensity function, LSNGCA constructs a vector that belongs to the nonGaussian subspace from each sample. Then the method of leastsquares logdensity gradients (LSLDG) [9, 10] is employed to directly estimate the logdensity gradient without density estimation. Finally, the principal subspace of the set of vectors generated from all samples is extracted by eigendecomposition. LSNGCA is computationally efficient and no manual design of NGIFs is involved. However, it still requires prewhitening of data.
MIPP  IMAK  SNGCA  LSNGCA 



Need  No Need  Need  No Need  No Need  







Prewhitening  Need  Need  No Need  Need  No Need 
The existing NGCA methods reviewed above are summarized in Table 1. In this paper, we propose a novel NGCA method that is computationally efficient, no manual design of NGIFs is involved, and no prewhitening is necessary. Our proposed method is essentially an extention of LSNGCA so that the covariance of data is implicitly handled without explicit prewhitening or explicit constraints. Through experiments, we demonstrate that our proposed method, called whiteningfree LSNGCA (WFLSNGCA), performs very well even when the data covariance matrix is illconditioned.
2 NonGaussian Component Analysis
In this section, we formulate the problem of NGCA and review the MIPP and LSNGCA methods.
2.1 Problem Formulation
Suppose that we are given a set of dimensional i.i.d. samples of size , , which are generated by the following model:
(1) 
where () is an dimensional signal vector independently generated from an unknown nonGaussian distribution (we assume that is known), is a noise vector independently generated from a centered Gaussian distribution with an unknown covariance matrix , and is an unknown mixing matrix of rank . Under this data generative model, probability density function that samples follow can be expressed in the following semiparametric form [3]:
(2) 
where is an unknown smooth positive function on , is an unknown linear mapping, is the centered Gaussian density with the covariance matrix , and denotes the transpose. We note that decomposition (2) is not unique; multiple combinations of and can give the same probability density function. Nevertheless, the following dimensional subspace , called the nonGaussian index space, can be determined uniquely [11]:
(3) 
where denotes the null space of , denotes the orthogonal complement, and denotes the column space of .
The goal of NGCA is to estimate the nonGaussian index space from samples .
2.2 MultiIndex Projection Pursuit (MIPP)
MIPP [3] is the first algorithm of NGCA.
Let us whiten the samples so that their covariance matrix becomes identity:
where is the covariance matrix of . In practice, is replaced by the sample covariance matrix. Then, for an NGIF , the following vector was shown to belong to the nonGaussian index space [3]:
where denotes the differential operator w.r.t. and denotes the expectation over . MIPP generates a set of such vectors from various NGIFs :
(4) 
where the expectation is estimated by the sample average. Then is normalized as
(5) 
by which is proportional to its signaltonoise ratio. Then vectors with their norm less than a prespecified threshold are eliminated. Finally, PCA is applied to the remaining vectors to obtain an estimate of the nonGaussian index space .
The behavior of MIPP strongly depends on the choice of NGIF . To improve the performance, MIPP actively searches informative as follows. First, the form of is restricted to , where denotes a unitnorm vector and is a smooth real function. Then, estimated vector is written as
where is the derivative of . This equation is actually equivalent to a single iteration of the PP algorithm called FastICA [12]. Based on this fact, the parameter is optimized by iteratively applying the following update rule until convergence:
The superiority of MIPP has been investigated both theoretically and experimentally [3]. However, MIPP has the weaknesses that NGIFs should be manually designed and prewhitening is necessary.
2.3 LeastSquares NonGaussian Component Analysis (LSNGCA)
LSNGCA [8] is a recently proposed NGCA algorithm that does not require manual design of NGIFs (Table 1). Here the algorithm of LSNGCA is reviewed, which will be used for further developing a new method in the next section.
Derivation:
For whitened samples , the semiparametric form of NGCA given in Eq.(2) can be simplified as
(6) 
where is an unknown smooth positive function on and is an unknown linear mapping. Under this simplified semiparametric form, the nonGaussian index space can be represented as
Taking the logarithm and differentiating the both sides of Eq.(6) w.r.t. yield
(7) 
where denotes the differential operator w.r.t. . This implies that
belongs to the nonGaussian index space . Then applying eigendecomposition to and extracting the leading eigenvectors allow us to recover . In LSNGCA, the method of leastsquares logdensity gradients (LSLDG) [9, 10] is used to estimate the logdensity gradient included in , which is briefly reviewed below.
Lsldg:
Let denote the differential operator w.r.t. the th element of . LSLDG fits a model to , the th element of logdensity gradient , under the squared loss:
(8) 
The second term in Eq.(8) yields
where the secondlast equation follows from integration by parts under the assumption . Then sample approximation yields
(9) 
As a model of the logdensity gradient, LSLDG uses a linearinparameter form:
(10) 
where denotes the number of basis functions, is a parameter vector to be estimated, and is a basis function vector. The parameter vector is learned by solving the following regularized empirical optimization problem:
where is the regularization parameter,
This optimization problem can be analytically solved as
where is the by identity matrix. Finally, an estimator of the logdensity gradient is obtained as
All tuning parameters such as the regularization parameter and parameters included in the basis function can be systematically chosen based on crossvalidation w.r.t. Eq.(9).
3 WhiteningFree LSNGCA
In this section, we propose a novel NGCA algorithm that does not involve prewhitening. A pseudocode of the proposed method, which we call whiteningfree LSNGCA (WFLSNGCA), is summarized in Algorithm 1.
3.1 Derivation
Unlike LSNGCA which used the simplified semiparametric form (6), we directly use the original semiparametric form (2) without whitening. Taking the logarithm and differentiating the both sides of Eq.(2) w.r.t. yield
(11) 
where denotes the derivative w.r.t. and denotes the derivative w.r.t. . Further taking the derivative of Eq.(11) w.r.t. yields
(12) 
where denotes the second derivative w.r.t. . Substituting Eq.(12) back into Eq.(11) yields
(13) 
This implies that
belongs to the nonGaussian index space . Then we apply eigendecomposition to and extract the leading eigenvectors as an orthonormal basis of nonGaussian index space .
Now the remaining task is to approximate from data, which is discussed below.
3.2 Estimation of
Let be the th element of :
To estimate , let us fit a model to it under the squared loss:
(14) 
The second term in Eq.(14) yields
where the secondlast equation follows from integration by parts under the assumption . included in the third term in Eq.(14) may be replaced with the LSLDG estimator reviewed in Section 2.3. Note that the LSLDG estimator is obtained with nonwhitened data in this method. Then we have
(15)  
Here, let us employ the following linearinparameter model as :
(16) 
where denotes the number of basis functions, is a parameter vector to be estimated, and is a basis function vector. The parameter vector is learned by minimizing the following regularized empirical optimization problem:
where is the regularization parameter,
This optimization problem can be analytically solved as
Finally, an estimator of is obtained as
All tuning parameters such as the regularization parameter and parameters included in the basis function can be systematically chosen based on crossvalidation w.r.t. Eq.(15).
3.3 Theoretical Analysis
Here, we investigate the convergence rate of WFLSNGCA in a parametric setting.
Let be the optimal estimate to given by LSLDG based on the linearinparameter model , and let
where must be strictly positive definite. In fact, should already be strictly positive definite, and thus is also allowed in our theoretical analysis.
We have the following theorem (its proof is given in Section 3.4):
Theorem 3.1.
As , for any ,
provided that for all converge in to , i.e., .
Theorem 3.1 is based on the theory of perturbed optimizations [13, 14] as well as the convergence of LSLDG shown in [8]. It guarantees that for any , the estimate in WFLSNGCA converges to the optimal estimate based on the linearinparameter model , and it achieves the optimal parametric convergence rate . Note that Theorem 3.1 deals only with the estimation error, and the approximation error is not taken into account. Indeed, approximation errors exist in two places, from to in WFLSNGCA itself and from to in the plugin LSLDG estimator. Since the original LSNGCA also relies on LSLDG, it cannot avoid the approximation error introduced by LSLDG. For this reason, the convergence of WFLSNGCA is expected to be as good as LSNGCA.
Theorem 3.1 is basically a theoretical guarantee that is similar to Part One in the proof of Theorem 1 in [8]. Hence, based on Theorem 3.1, we can go along the line of Part Two in the proof of Theorem 1 in [8] and obtain the following corollary.
Corollary 1.
For eigendecomposition, define matrices and . Given the estimated subspace based on samples and the optimal estimated subspace based on infinite data, denote by the matrix form of an arbitrary orthonormal basis of and by that of . Define the distance between subspaces as
where stands for the Frobenius norm. Then, as ,
provided that for all converge in to and are wellchosen basis functions such that the first eigenvalues of are neither nor .
3.4 Proof of Theorem 3.1
Step 1.
First of all, we establish the growth condition (see Definition 6.1 in [14]). Denote the expected and empirical objective functions by
Then , , and we have
Lemma 1.
Let be the smallest eigenvalue of , then the following secondorder growth condition holds
Proof.
must be strongly convex with parameter at least . Hence,
where we used the optimality condition . ∎
Step 2.
Second, we study the stability (with respect to perturbation) of at . Let
be a set of perturbation parameters, where is the cone of by symmetric positive semidefinite matrices. Define our perturbed objective function by
It is clear that , and then the stability of at can be characterized as follows.
Lemma 2.
The difference function is Lipschitz continuous in modulus
on a sufficiently small neighborhood of .
Proof.
The difference function is
with a partial gradient
Notice that due to the regularization in , such that . Now given a ball of , i.e., , it is easy to see that ,
and consequently
This says that the gradient has a bounded norm of order , and proves that the difference function is Lipschitz continuous on the ball , with a Lipschitz constant of the same order. ∎
Step 3.
Lemma 1 ensures the unperturbed objective grows quickly when leaves ; Lemma 2 ensures the perturbed objective changes slowly for around , where the slowness is compared with the perturbation it suffers. Based on Lemma 1, Lemma 2, and Proposition 6.1 in [14],
since is the exact solution to given , , and .
According to the central limit theorem (CLT), . Consider :
The first half is clearly due to CLT. For the second half, the estimate given by LSLDG converges to for any in according to Part One in the proof of Theorem 1 in [8], and converges to in the same order because the basis functions in are all derivatives of Gaussian functions. Consequently,
since converges to for any in , and
due to CLT, which proves . Furthermore, we have already assumed that . Hence, as ,
Step 4.
Finally, for any , the gap of and is bounded by
where the CauchySchwarz inequality is used. Since the basis functions in are again all derivatives of Gaussian functions, must be bounded uniformly, and then
Applying the same argument for all completes the proof. ∎
4 Experiments
In this section, we experimentally investigate the performance of MIPP, LSNGCA, and WFLSNGCA.^{1}^{1}1 The source code of the experiments is at https://github.com/hgeno/WFLSNGCA.
4.1 Configurations of NGCA Algorithms
4.1.1 Mipp
We use the MATLAB script which was used in the original MIPP paper [3]^{2}^{2}2http://www.ms.k.utokyo.ac.jp/software.html. In this script, NGIFs of the form () are used:
where , , and are scalars chosen at the regular intervals from , , and . The cutoff threshold is set at and the number of FastICA iterations is set at (see Section 2.2).
4.1.2 Lsngca
Following [10], the derivative of the Gaussian kernel is used as the basis function in the linearinparameter model (10):
where is the Gaussian bandwidth and is the Gaussian center randomly selected from the whitened data samples . The number of basis functions is set at . For model selection, fold crossvalidation is performed with respect to the holdout error of Eq.(9) using candidate values at the regular intervals in logarithmic scale for Gaussian bandwidth and regularization parameter .
4.1.3 WfLsngca
Similarly to LSNGCA, the derivative of the Gaussian kernel is used as the basis function in the linearinparameter model (16) and the number of basis functions is set as . For model selection, fold crossvalidation is performed with respect to the holdout error of Eq.(15) in the same way as LSNGCA.
4.2 Artificial Datasets
Let , where are the dimensional nonGaussian signal components and are the dimensional Gaussian noise components. For the nonGaussian signal components, we consider the following four distributions plotted in Figure 1:

(a) Independent Gaussian Mixture:
. 
(b) Dependent superGaussian:
. 
(c) Dependent subGaussian:
is the uniform distribution on . 
(d) Dependent super and subGaussian:
and is the uniform distribution on , where if and otherwise.
For the Gaussian noise components, we include a certain parameter , which controls the condition number; the larger is, the more illposed the data covariance matrix is. The detail is described in Appendix A.
We generate samples for each case, and standardize each element of the data before applying NGCA algorithms. The performance of NGCA algorithms is measured by the following subspace estimation error:
(17) 
where is the true nonGaussian index space, is its estimate, is the orthogonal projection on , and is an orthonormal basis in .
The averages and the standard derivations of the subspace estimation error over runs for MIPP, LSNGCA, and WFLSNGCA are depicted in Figure 2. This shows that, for all 4 cases, the error of MIPP grows rapidly as increases. On the other hand, LSNGCA and WFLSNGCA perform much stably against the change in . However, LSNGCA performs poorly for (a). Overall, WFLSNGCA is shown to be much more robust against illconditioning than MIPP and LSNGCA.
In terms of the computation time, WFLSNGCA is less efficient than LSNGCA and MIPP, but its computation time is still just a few times slower than LSNGCA, as seen in Figure 3. For this reason, the computational efficiency of WFLSNGCA would still be acceptable in practice.
4.3 Benchmark Datasets
Finally, we evaluate the performance of NGCA methods using the LIBSVM binary classification benchmark datasets^{3}^{3}3 We preprocessed the LIBSVM binary classification benchmark datasets as follows: vehicle: We convert original labels ‘1’ and ‘2’ to the positive label and original labels ‘3’ and ‘4’ to the negative label. SUSY: We convert original label ‘0’ to the negative label. shuttle: We use only the data labeled as ‘1’ and ‘4’ and regard them as positive and negative labels. svmguide1: We mix the original training and test datasets. [15]. From each dataset, points are selected as training (test) samples so that the number of positive and negative samples are equal, and datasets are standardized in each dimension. For an dimensional dataset, we append dimensional noise dimensions following the standard Gaussian distribution so that all datasets have dimensions. Then we use PCA, MIPP, LSNGCA, and WFLSNGCA to obtain dimensional expressions, and apply the support vector machine (SVM) ^{4}^{4}4 We used LIBSVM with MATLAB [15]. to evaluate the test misclassification rate. As a baseline, we also evaluate the misclassification rate by the raw SVM without dimension reduction.
The averages and standard deviations of the misclassification rate over runs for are summarized in Table 2. As can be seen in the table, the appended Gaussian noise dimensions have negative effects on each classification accuracy, and thus the baseline has relatively high misclassification rates. PCA has overall higher misclassification rates than the baseline since a lot of valuable information for each classification problem is lost. Among the NGCA algorithms, WFLSNGCA overall compares favorably with the other methods. This means that it can find valuable lowdimensional expressions for each classification problem without harmful effects of a prewhitening procedure.

