ADIS - A robust pursuit algorithm for probabilistic, constrained and non-square blind source separation with application to fMRI

# ADIS - A robust pursuit algorithm for probabilistic, constrained and non-square blind source separation with application to fMRI

Gautam Pendse To whom correspondence should be addressed. e-mail: gpendse@mclean.harvard.edu    David Borsook    Lino Becerra

Imaging and Analysis Group (IMAG), Harvard Medical School
Feb 27, 2009
###### Abstract

In this article, we develop an algorithm for probabilistic and constrained projection pursuit. Our algorithm called ADIS (automated decomposition into sources) accepts arbitrary non-linear contrast functions and constraints from the user and performs non-square blind source separation (BSS). In the first stage, we estimate the latent dimensionality using a combination of bootstrap and cross validation techniques. In the second stage, we apply our state-of-the-art optimization algorithm to perform BSS. We validate the latent dimensionality estimation procedure via simulations on sources with different kurtosis excess properties. Our optimization algorithm is benchmarked via standard benchmarks from GAMS performance library. We develop two different algorithmic frameworks for improving the quality of local solution for BSS. Our algorithm also outputs extensive convergence diagnostics that validate the convergence to an optimal solution for each extracted component. The quality of extracted sources from ADIS is compared to other well known algorithms such as Fixed Point ICA (FPICA), efficient Fast ICA (EFICA), Joint Approximate Diagonalization (JADE) and others using the ICALAB toolbox for algorithm comparison. In several cases, ADIS outperforms these algorithms. Finally we apply our algorithm to a standard functional MRI data-set as a case study.

## 1 Introduction

The Generalized Linear Model (GLM) is a popular tool for analyzing functional MRI (fMRI) data. GLM analysis proceeds on a voxel by voxel basis using the same design matrix. One of the difficulties associated with GLM analysis is the construction of an appropriate design matrix. Unmodeled regressors that modulate the fMRI signal in addition to the EVs but are not a part of the design matrix will invalidate the analysis and the associated inferences. Further, these unmodeled regressors might be different in different brain regions and a voxel by voxel analysis with a common design matrix may not be appropriate. These considerations imply that model based analyses make very strong assumptions which are very likely violated in a real fMRI dataset.

Model free analysis on the other hand does not need any postulations as to the shape of the expected response. One such technique that has become popular in recent years, particularly for application to fMRI is Independent Component Analysis (ICA) [9]. See [25] for a survey on ICA. Popular software packages such as FSL ([35]) come with ICA software for doing non-square ICA via automatic latent dimensionality estimation also known as Probabilistic Independent Component Analysis (PICA) [2]. Essentially these techniques consists of a data reduction step using PCA followed by the application of standard ICA algorithms. In the signal processing community, many well established algorithms exist for doing ICA, the most popular ones being efficient FastICA [27], Fixed Point ICA (FPICA) [26], Joint Approximate Diagonalization of Cumulant Matrices (JADE) [6], Extended Robust ICA (ERICA) [13] and unbiased ICA (UNICA) [14]. In this article, we will question the validity of solution produced by current ICA codes. Are these solutions really optimal? Can we afford to pay the price for using non-optimal solutions?

One of the challenges involved in applying ICA to real data is the confidence in the quality of estimated solution. This problem has been recognized before. For example the software package ICASSO [21] uses bootstrapping simulations to run an ICA algorithm multiple times and then clusters the estimated sources to assess reliability. ICASSO takes into account the ”algorithmic” variability and the variability in the original data induced due to sampling, but since it assumes square, noise free mixing it ignores the estimation errors induced due to noisy source mixing. The presence of non-square mixing in real data also introduces additional variability due to the unknown latent dimensionality of the sources.

While a bootstrapping strategy can always be used to test the ”sensitivity” of estimated BSS solution from any algorithm, it is critical to have a reliable and verifiable optimization solver to solve the non-convex BSS problem in the first place.

Currently existing ICA codes perform optimization using techniques that have formulas for updating the unknown variables using a Newton step, gradient descent, natural gradient [3], [1] [18] or similar strategies. In addition they also potentially have a number of heuristic rules for updating the various ”learning” parameters in these algorithms. No convergence diagnostics are used to check the optimality of estimated solution. To the best of our knowledge, such optimization codes are not benchmarked using standard optimization test cases. Such ad-hoc strategies along with non-verified optimality may severely affect the quality of solution produced by these algorithms and potentially impact the practical conclusions drawn from incorrect results. The popular FastICA algorithm [27] uses an approximate Newton iteration where the approximate Hessian simplification reduces it to a gradient descent algorithm with a fixed step size. However there is no reason to use these approximations. One can use state of the art optimization software to compute near exact step sizes with locally varying Hessian approximations. In fact, it has been shown [39] that these exact step size search significantly increases the estimation efficiency and robustness to initialization in comparison to the fixed update rules of FastICA. The optimization core in our algorithm ADIS efficiently handles local non-convexity as well as allows for infeasible steps, i.e., steps that violate the constraints leading to a more fuller exploration of parameter space and increasing the likelihood of converging to a global optimum.

Another issue is the modification of the default contrast function in ICA (e.g. Negentropy) to do other types of source extraction. It is also desirable to be able to add additional equality and/or inequality constraints to BSS estimation depending on the application at hand. These issues cannot be addressed using current ICA software. In this paper, we develop our algorithm ADIS and validate its various components. Finally, we compare ADIS to currently existing ICA codes on many standard benchmark datasets from ICALAB.

Our algorithm ADIS (section 2, 3):

1. Uses a state-of-the-art optimization algorithm at its core (inspired by LANCELOT software [12]) (section 4, 14)

2. Uses a bootstrap simulation/cross-validation based approach for latent dimensionality estimation (in case of non-square BSS) (section 6)

3. Enables the user to use arbitrary contrast functions and nonlinear constraints for BSS

4. Produces ”good quality” local solutions using a special multistage framework for BSS (section 3)

5. Produces extensive convergence diagnostics for each extracted component to validate the ”optimality” of the extracted source (section 4)

We perform validation of each component of ADIS as follows:

1. Validation of the latent dimensionality estimation procedure using simulations on sources with different statistical properties (section 6)

2. Validation of our optimization core using standard benchmarks from the GAMS performance library (http://www.gamsworld.org/performance, [15]) (section 15)

3. We then use the ”Negentropy” contrast function as a special case and compare the results of ADIS in terms of separation quality and robustness to other well known algorithms such as efficient FastICA, FPICA, JADE and others using the ICALAB toolbox [8], [7] for BSS algorithm comparison. (section 7)

4. Finally we apply ADIS to real fMRI data as a case study (section 8)

## 2 Probabilistic Projection Pursuit

Projection Pursuit is a standard statistical technique for data analysis [17],[16], [22]. In this article we generalize projection pursuit in a probabilistic framework similar to the one proposed by Beckmann et. al. ([2]). We consider data generation at points via a noisy mixing process as follows:

 x=μ+As+η,i=1,2,…,n (1)

where , and , . We assume that to achieve a compact representation of the observed data.

The problem is to estimate automatically , , and given observations . This problem is also called a blind source separation (BSS) problem since , and are all unknown. The inclusion of noise term makes the problem into a probabilistic one.

First consider the case when for all . Section 5.1 shows how to handle the case .

If be a vector of ones. Then

 1Tpx=1Tpμ+1TpAs+1Tpη,i=1,2,…,n (2)

Let

 ¯D=D−1p1TpD, where D=x,A,μ,η (3)

Then we can write

 ¯x=¯μ+¯As+¯η (4)

where Since the scaling of and is arbitrary we assume without loss of generality that

 E(s)=0 and Cov(s)=Iq (5)

No other assumptions are made about the joint source density other than the ones in 5. From equations 4 and 5:

 ¯μ=E(¯x) (6)
 E[(¯x−¯μ)(¯x−¯μ)T]=¯A¯AT+σ2(Ip−1p1Tp/p) (7)

Using the law of large numbers (LLN) we can approximate the covariance matrix as:

 E[(¯x−¯μ)(¯x−¯μ)T]≈1nn∑i=1(¯xi−¯μ)(¯xi−¯μ)T=UΣUT (8)

In 8, with the singular values and a matrix containing the corresponding singular vectors of the covariance matrix. The estimate of is easily obtained from 6.

 ^¯μ=1nn∑i=1¯xi (9)

Without knowing the true source densities it is not possible to estimate the maximum likelihood estimates of and . However one can find estimates that try to satisfy the second order condition 7 as closely as possible by minimizing the Frobenius norm:

 ^¯A,^σ2=argmin¯A,σ2||1nn∑i=1(¯xi−^¯μ)(¯xi−^¯μ)T−¯A¯AT−σ2(Ip−1p1Tp/p)||2F

It is easily shown that if is a submatrix of containing the first singular vectors corresponding to the largest singular values and is the submatrix of then the solution to 2 is given by:

 ^¯A=Uq(Σq−^σ2Iq)1/2QT (10)

and

 ^σ2=1p−q−1p−1∑i=q+1λi (11)

where is an arbitrary orthogonal matrix (). Given and , the least squares estimate of are given by:

 ^si=(^¯AT^¯A)−1^¯AT(¯xi−^¯μ)=Q(Σq−^σ2Iq)−1/2UTq(¯xi−^¯μ)=Q~xi (12)

where

 ~xi=(Σq−^σ2Iq)−1/2UTq(¯xi−^¯μ) (13)

## 3 Problems solved in Projection Pursuit

In Projection Pursuit (PP), we parameterize the orthogonal matrix as

 Q=[w1,w2…,wq]T (14)

where . The th PP projection is defined as for each point :

 ^ski=wTk~xi,i=1…n and k=1…q (15)

In vector form we can write the th projection as:

 ^sk=wTk~x (16)

where is a vector and is a matrix. We define an objective function and optimize for the projection vectors. Mathematically we solve the optimization problem:

 [w∗1,…,w∗q]=argmaxw1,…,wqf(wT1~x,…,wTq~x)+b(wT1~x,…,wTq~x;Θ)

Here is function accounting for supplementary information that we want to include in the optimization problem. is all supplementary information of interest to the optimization problem. For example, in spatial problems could be a spatial location of points and could be a function accounting for spatial smoothness (such as a Markov random field). Many such problem specific functions can be proposed based on user objectives. There could also be additional user defined equality and inequality constraints, for example:

 ci(wTk~x)=0,i=1…m,,k=1…q (17)
 gi(wTk~x)≥0,i=1…L,,k=1…q (18)

Thus in general, we get a constrained projection pursuit problem. Constraints 17 and 18 can be arbitrary non-linear constraints not necessarily parameterized by .

### 3.1 Separability

The optimization problem in 3 must involve a joint optimization of the vectors in general. In many important practical cases (such as for example when is the joint negentropy index) the objective function has a separable structure [26] such that

 f(wT1~x,wT2~x,…,wTq~x)=q∑k=1h(wTk~x) (19)

for some function , then the optimization can proceed sequentially where at each stage we solve

 w∗k= arg min wkh(wTk~x) (20)

At each stage is a unit vector orthogonal to the previously calculated vectors i.e

 w∗Tkw∗l={0 when l

### 3.2 Multistage Optimization stragegy

In this section we describe a 2 or 3 stage strategy which we have found via experience to converge to good ”local” solutions. It is well known that if an optimization problem that has multiple optima then the ”local” solution found by an algorithm is strongly dependent on how the algorithm is initialized. For applications such as BSS, even though an algorithm finds a ”local” solution as indicated by convergence diagnostics, it may not be a ”global” solution. In order to increase our chances of finding a solution that is global, we propose a random sampling strategy for initialilzation of an optimization algorithm first.

#### 3.2.1 Stage 0: Search for good seed points

For concreteness, suppose are the optimal solutions found previously and suppose we are tying to find . Let be a matrix that is the orthogonal complement of as determined by say Gram-Schmidt orthogonalization. Then

 ~WTw∗l=0,l=1,2,…,k−1 (22)
1. Generate vectors in whose elements are drawn from a uniform distribution on .

2. Standardize each vector to have unit norm to get the set of vectors where each .

For each we define seed points as

 ui=~Wzi (23)

It is easy to see that and . Thus satisfies the constraints in 21. We then compute the objective function at each of these points and choose points that give the highest objective function values as candidate seed points for the next step.

#### 3.2.2 Stage 1: Computing local optimum at R best points from Stage 1

In this stage, we compute the local solutions of the optimization problem 20 starting from and choose the best local solution that has the highest function value.

 w∗k=argmaxhk(w∗ik),i=1,2,…,R (24)

ADIS uses and as the defaults.

#### 3.2.3 Stage 2: Joint optimization with initialization from Stage 2

After Stage 1 has been applied from we have an estimate of the solution vectors . In this stage we solve the joint optimization problem 19 subject to the single joint constraint:

 q∑i=1q∑j=i(wTiwj−δij)2=0 (25)

where if and otherwise. We initialize the algorithm with the solution from Stage 1. The algorithm converges to a local joint solution only in a few iterations.

## 4 Optimization Algorithm

For flexible and powerful BSS algorithms, a primary requirement is a fast and robust optimization algorithm that can handle non-linear user defined constraints as well as handle non-convexity in the objective function or the constraints. Furthermore, extensive convergence diagnostics should be a standard output of the optimization process to ensure convergence to a local solution. The optimization core in ADIS (coded in MATLAB, www.mathworks.com) uses a modified augmented lagrangian algorithm (inspired by the implementation in LANCELOT package [10], [12]) to solve equality constrained problems generated in constrained non-convex BSS problems. Inequality constraints are handled by first transforming them to equality constraints via slack variables and solving the resulting bound constrained optimization problem. Some features of interest are as follows:

1. A Trust region based approach [30] is used to generate search directions at each step (for both equality constrained and inequality constrained problems).

2. For equality constraints only, the subproblems above are solved using a conjugate gradient approach (Newton-CG -Steihaug) [36] that is fast and accurate even for large problems and can handle both positive definite and indefinite hessian approximations. If both equality and inequality constraints are present then we solve the trust region problem with a non-linear gradient projection technique [5] followed by subspace optimization using Newton-CG-Steihaug. Our algorithm allows for infeasible iterates i.e., those that do not satisfy the problem constraints during optimization. This allows for a fuller exploration of parameter space and increases the likelihood of converging to a global optimum.

3. A symmetric rank 1 (SR1) quasi-Newton approximation to the hessian [11] is used which is known to generate good hessian approximations for both convex and non-convex problems. As suggested in [33] we do the update also on the rejected steps to gather curvature information about the function. We provide options for BFGS [4] especially for convex problems and an option for preconditioning the CG iterations. We also implement limited memory variants of SR1 and BFGS for large problems.

4. Our algorithm accepts vectorized constraints so that multiple constraints can be programmed simultaneously. Only gradient information is required. Hessian information is optional but not required. Optionally, it is easy to interface our code with INTLAB software package [34] for automatic differentiation in which case the user only codes the function and constraints and the gradients/hessians are generated automatically.

We tested the performance of our algorithm using standard optimization benchmarks from the GAMS performance benchmark problems (http://www.gamsworld.org/performance, [15]). The appendix shows some sample benchmarks as well as provides more technical details of the algorithm.

### 4.1 Convergence Diagnostics

It is critical to verify that the optimization algorithm has found a local solution by checking convergence diagnostics. These diagnostics help us determine if the Karush-Kuhn-Tucker (KKT) [33] necessary conditions for optimality have been satisfied or not. Profile plots are plots of a diagnostic measure versus iteration number. We propose the following checks for all BSS algorithms:

1. Convergence to a point satisfying necessary conditions for optimality can be accessed by looking at profile plots for

• Objective function

• Optimality error (such as ”gradient of the lagrangian” for equality constraints or ”KKT optimality checks” for general constraints)

• Feasibility error (checking constraint satisfaction)

• Lagrange multipliers

• Other parameters in the algorithm (such as a barrier or penalty parameter)

The user should at the very least check these diagnostic plots to make sure convergence is attained. If possible the second order sufficient conditions for optimality should also be verified at the solution point using Hessian information.

2. The algorithm used for optimization should flag an error and stop running in case convergence is not attained at any intermediate stage. This prevents the user from getting access to incorrect results.

These convergence diagnostics are a standard feature of ADIS. Any solution returned by ADIS is guaranteed to be optimal.

## 5 Statistics on estimated sources

In this section we develop equations that enable us to apply ADIS to a real data-set and make inferences from extracted sources.

Once are estimated we can compute their variance using the GLM estimate

 ^Cov(^si)=(^¯AT^¯A)−1^σ2i (26)

where is the estimated variance at point

 ^σ2i=(¯xi−^¯μ−^¯A^si)T(¯xi−^¯μ−^¯A^si)p−q (27)

We can create maps of contrasts of interest using the above equations. We also estimate the relative variance (RV) contribution at point using the component as follows:

If then

 RV(k,i)=Var(ak)^ski2∑qk=1Var(ak)^ski2 (28)

Inspection of these voxelwise variance explained maps is very useful in searching through the estimated sources for application based relevance.

### 5.1 Correcting for Autocorrelation

Extending to the case of autocorrelated noise is straightforward. When then we proceed in an iterative fashion as follows:

1. Set and estimate , and compute the pointwise residual

 ¯ri=¯xi−^¯μ−^¯A^si (29)
2. Compute autocorrelation in and prewhiten the data using the estimated correlation matrix to get . Run the PP algorithm on until does not change from one iteration to the next in an average sense. Various prewhitening schemes such as models or non-parametric approaches can be used. ADIS uses the non-parametric approach proposed in [38] to do prewhitening. By default ADIS will not do iterative prewhitening unless explicitly specified by the user.

## 6 Latent dimensionality estimation

Sophisticated bayesian strategies exist for estimating the latent dimensionality in the case of Gaussian sources [29]. In this section we develop a latent dimensionality estimation procedure that works very well both with Gaussian and non-Gaussian sources using a bootstrap/cross-validation procedure.

The latent dimensionality is estimated in two steps. We first estimate a lower bound on the latent dimensionality (6.1) followed by a cross validation analysis to refine the lower bound (6.2). In subsection 6.3 we validate this approach via extensive numerical simulations.

### 6.1 Stage 1 - Estimating the lower bound

Let

 X=[¯x1−^¯μ,¯x2−^¯μ,…,¯xn−^¯μ] (30)

and suppose are the eigenvalues of .

1. Randomly permute each column of the matrix to get the matrix .

2. Compute the eigenvalues of such that

3. Choose to be the largest value of such that .

First we destroy the systematic correlations between the columns of via random permutations of each column. Then we estimate the singular values of the permuted covariance matrix and compare these with the singular values of the unpermuted covariance matrix. Only those singular values that exceed the ones from random permutation are deemed significant and the lower bound on latent dimensionality is estimated to be the cardinality of those singular values.

If are permutation matrices then we can write:

 Xb=1nn∑i=1Pi(¯xi−^¯μ)(¯xi−^¯μ)TPTi (31)

Suppose

 ¯A=UaΣaVTa (32)

is the singular value decomposition of A with

Let be the true latent dimensionality. Then for large it can be shown (and verified by simulation) that:

 λi=⎧⎪⎨⎪⎩σ2ai+σ2 if% i≤qσ2 if q+1≤i≤(p−1)0 if i=p (33)

and

 λbi={σ2+1p−1∑qi=1σ2ai if i≤(p−1)0 if i=p (34)

Thus the non-zero eigenvalues of satisfy , the noise variance. Hence the largest index such that is a lower bound for the latent dimensionality .

### 6.2 Stage 2 - Cross Validation

Suppose the true latent dimensionality is assumed to be . Then for large the eigenvalues will follow equation (33) i.e, the eigenvalues {} should be well approximated by a constant . We estimate the quality of this model using leave one out cross validation [20]. Let be the mean of the eigenvalues from to excluding the index .

 M−kq=1p−2−qp−1∑j=q+1,j≠kλi (35)

The leave one out cross validation error assuming the true latent dimensionality to be at point is given by:

 E(q,k)=(λk−M−kq)2,k=q+1,…,p−1 (36)

The mean cross validation error and its variance can be estimated from these pointwise values as follows:

 ¯E(q)=1p−1−qp−1∑k=q+1E(q,k) (37)
 Var(¯E(q))=1p−1−qVar{E(q,k),k=q+1,…,p−1} (38)

When is smaller than then both and will be large and when is greater than then both and will be small. Since the eigenvalues are expected to remain constant beyond the change in will be small beyond . We define the following index for a given value of quantifying the change in cross validation error from to .

 Δ(q)=¯E(q)−¯E(q+1)√Var(¯E(q))+% Var(¯E(q+1)) (39)

If is the true latent dimensionality then will tend to have a maximum at since this is the point which will show the largest change in cross validation error in going from to . We thus propose the estimate

 qtrue=1+argmaxqΔ(q),q=ql,…,p−4 (40)

where is a lower bound on calculated from stage 1.

The estimation of uses a smaller number of points when gets very close to . Thus the estimate becomes unstable when is within a few time points of . In order to robustify our estimate against this instability we define the cumulative maximum index function which calculates the index of maximum of from .

 f(r)=argmaxqΔ(q),q=ql,…,r (41)

Then we count the number of times that a maximum is detected at

 g(y)=Card{r:f(r)=y} (42)

and define the estimate of dimensionality to be

 qtrue=1+argmaxyg(y) (43)

### 6.3 Validation of the approach

To test this latent dimensionality algorithm we generate data as per equation (1). Details of the simulation are as follows:

1. The true mixing matrix was chosen to be a matrix where the elements were drawn from a uniform distribution in (0,1). This random matrix was then scaled so that its minimum singular value .

2. The sources in (1) were generated from three different types of distributions based on their kurtosis ”excess” values . The chosen distributions were Gaussian (), Uniform () and Gamma ().

3. The ratio of noise standard deviation in (1), to the minimum singular value of , was varied between .

4. The ratio of the true latent dimensionality to the dimensionality of the observed data was varied between 0.1,0.2,…,0.5.

5. This process was repeated 20 times for each combination of source type, ratio and ratio.

6. For each individual simulation we estimated the latent dimensionality based on the 2 stage estimation strategy described above.

We chose and as fixed parameters of the simulation. Simulations show that this 2 stage estimation is almost unbiased for both Gaussian and non-Gaussian embedded sources for all values of the ratios and . Results are shown in figure 1(a) - 1(c).

## 7 Benchmarking: Comparison with other BSS algorithms

Choosing the negative entropy index as our objective function and without imposing any additional constraints, our algorithm attempts to estimate sources that are independent. To test and compare our algorithm with others, we used an approximation to negative entropy as proposed in [26] (see appendix for details).

ICALAB [8], [7] (available from http://www.bsp.brain.riken.go.jp/ICALAB/) is a Matlab package for comparing algorithms for BSS. We used ICALAB to compare the performance of our algorithm with other standard BSS algorithms such as FJADE [6], FPICA [26], [24], EFICA [27], [28] , ERICA [13] and UNICA [14] which use higher order statistics to separate sources.

The quality of source extraction is measured using the Source to Interferences Ratio (SIR) [37] (of the estimated mixing matrix) which measures the ratio of the energy of the estimated source projected onto the true source to the energy of the estimated source projected onto the other sources. Higher values of (SIR) indicate better performance. Please see the appendix for details. ICALAB also comes with standard benchmarking datasets (http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/).

A Monte Carlo analysis was performed using ICALAB by generating uniformly distributed random matrices and creating a mixed source data-set for a given set of sources .

 Xi=AiS,i=1,2,…,nb (44)

To get a baseline measure of performance for each algorithm, we use square mixing without additional noise for the simulation. Each algorithm was then run on this mixed data set. This process was repeated times for each dataset using a new mixing matrix every time. In ICALAB, most of the algorithms are given default parameters that are tuned optimum values for typical data. As suggested in ICALAB, we use these default algorithmic parameters for benchmarking purposes.

ADIS is able to perform non-square BSS in the presence of noise. However, since the other algorithms in our benchmarking test have not been designed to do this, we think its unfair to compare non-square ability of ADIS with other algorithms.

The 13 benchmarking datasets and their short descriptions are given in the appendix. In order to evaluate the effect of different types of mixing matrices, we ran additional Monte Carlo simulations when was chosen to be one of the following (a) Random sparse (b) Random bipolar (c) Symmetric random (d) Ill conditioned random (e) Hilbert (f) Toeplitz (g) Hankel (h) Orthogonal (i) Nonnegative symmetric (j) Bipolar symmetric (k) Skew symmetric.

The results are shown in figures 2(a) - 4(d) and table 1. The key performance feature is the SIR index [37] (see appendix for definition), the higher the value the better. Another important feature is the variability of SIR over 100 mixtures each generated using a different mixing matrix but containing the same underlying sources. Ideally an algorithm should be robust enough to converge to the same solution regardless of variation in the mixing matrix. The standard deviation of SIR captures this variability, the lower the variability of SIR the better. Additional results showing simulation results with different types of mixing matrices are shown in 7(a) - 9(d).

ADIS perfomed better than all other algorithms in almost all cases both in terms of the mean SIR index as well as the standard deviation of the mean SIR, which measures the robustness and convergence to the same solution. ADIS also produces extensive convergence diagnostics a sample of which is shown in figure 6(a). These diagnostics guarantee convergence and improve confidence in the estimated sources.

An illustration of performance improvement using ADIS (Stage 1 + 2) over Stage 1 is shown fir the ACsparse10 dataset in figure 5(b). The corresponding convergence diagnostics are shown in figure 6(b). ADIS Stage 2 performs better than ADIS Stage 1 but we did not observe as dramatic an improvement as seen for ACsparse10. Other ICA algorithms were outperformed using only ADIS Stage 1.

All experiments were performed on a computer with an Intel Xeon (TM) processor (3.4 GHz) and 4GB of RAM. The runtimes of ADIS (Stage 1) were comparable to those of FPICA, ERICA and UNICA. We found EFICA and JADE to be faster than other algorithms in general.

## 8 Case Study using real fMRI data

### 8.1 fMRI case study

To demonstrate how ADIS performs on a real dataset, we used the ”FSL Evaluation and Example Data Suite” (FEEDS) from FMRIB Image Analysis Group, Oxford University. The URL for this data suite is: http://www.fmrib.ox.ac.uk/fsl/feeds/doc/index.html One of the datasets in the example suite contains an audio visual experiment with two explanatory variables, the visual stimulus (30s off, 30s on) and an auditory stimulus (45s off, 45s on). Analysis was carried out using FEAT (FMRI Expert Analysis Tool) Version 5.4, part of FSL (FMRIB’s Software Library).

www.fmrib.ox.ac.uk/fsl


The following pre-statistics processing was applied; motion correction using MCFLIRT [Jenkinson 2002]; non-brain removal using BET [Smith 2002]; spatial smoothing using a Gaussian kernel of FWHM 5mm; mean-based intensity normalisation of all volumes by the same factor; highpass temporal filtering (Gaussian-weighted LSF straight line fitting, with sigma=50.0s).

ADIS estimated a latent dimensionality of . The source components activating the auditory and visual cortex were identified by inspecting the estimated voxelwise variance explained map. The results are shown in figures 11-12.

## 9 Conclusion

We implemented ADIS, an algorithm for probabilistic, constrained and non-square projection pursuit. We validated all aspects of ADIS including the latent dimensionality estimation procedure and its optimization core. When compared to other algorithms using standard benchmarking datasets using ICALAB, we find our algorithm outperforms other standard algorithms such as FastICA, FPICA, JADE, ERICA and UNICA in terms of both robustness and separation quality. Our algorithm also guarantees ”optimality” for each blind source via extensive convergence diagnostics and enables the user to use arbitrary contrast function and constraints for BSS. We hope it will be useful as a general BSS tool for the signal processing and fMRI community.

## 11 Negentropy Index

Given a random variable , the negative entropy a measure of non-Gaussianity. It is easy to show that imposition of independence on sources in BSS is equivalent to maximization of negentropy.

Robust approximations to negative entropy were developed in [23]. If is a non-quadratic, non-linear function and is a Gaussian random variable of the same variance as then the negentropy measure is given as [26]

 J(X)∝[E(G(X))−E(G(v)]2 (45)

In this paper, we used the following function [23] for :

 G(x)=log[cosh(x)] (46)

where is the hyperbolic cosine function

 cosh(x)=ex+e−x2 (47)

## 12 Sources to Interferences Ratio (SIR)

The SIR ratio is defined in [37]. We give here a brief summary of the key equations from that paper. Let and let be the orthogonal projector onto the subspace spanned by . If are the true sources and if are the corresponding estimated values then define:

 starget=Psj^sj (48)
 einterf=Ps^sj−Psj^sj (49)

The purity of source separation is measured using the SIR performance index defined as follows:

 SIR=10log10||starget||2||einterf||2 (50)

## 13 Benchmarking datasets

The 13 benchmarking datasets and their short descriptions are as follows:
(http://www.bsp.brain.riken.jp/ICALAB/ICALABSignalProc/) :

• nband5 - contains 5 narrow band sources. This is a rather ”easy” benchmark for second order separation algorithms but apprently presents challenges for higher order algorithms.

• 10halo - contains 10 speech signals that are highly correlated (all 10 speakers say the same sentence).

• GnBand - contains 5 fourth order colored sources with a distribution close to Gaussian. This is a rather ”difficult” benchmark.

• acspeech16 - contains 16 typical speech signals which have a temporal structure but are not precisely independent

• ABio5 - contains 5 typical biological sources

• ACsparse10 - contains 10 sparse (smooth bell-shape) sources that are approximately independent. The SOS blind source separation algorithms fail to separate such sources.

• 25SpeakersHALO - 25 highly correlated speech signals

• Vsparserand10 - very sparse random signals

• ACsincpos10 - positive sparse signals

• X5smooth - smooth signals

• Speech20 - 20 speech/music sources

• X10randsparse - random sparse signals

• 64soundsstd - a variety of sound sources (64 in total)

## 14 Details on Optimization Algorithm

Our optimization algorithm solves the general problem:

 min xf(x) (51) s.t. ci(x)=0, i=1,2,…,m (52) s.t. gj(x)≥0, j=1,2,…L (53)

where .

We convert the inequality constraints into equality constraints via slack variables as follows:

 gj(x)−sj=0 (54) sj≥0, j=1,2,…L (55)

Thus the optimization problem becomes:

 min f(x) (56) s.t. ci(x)=0, i=1,2,…,m (57) s.t. gj(x)−sj=0, j=1,2,…L (58) sj≥0 (59)

This problem is now an equality constrained problem where the inequalities have been replaced by the bound constraints on the slack variables. Thus it suffices to consider equality constrained problems with bounds on independent variables as follows:

 min f(x) (60) s.t. ci(x)=0, i=1,2,…,m (61) s.t. li≤xi≤ui, i=1,2,…n (62)

where .

Our code uses a trust region based augmented lagrangian approach to solve these bound constrained problems following closely the LANCELOT software package [12], [10]. The augmented lagrangian function for the above problem is defined as:

 L(x,λ,μ)=f(x)−m∑i=1λici(x)+μ2m∑i=1ci(x)2 (63)

At each outer iteration , given current values of and we solve the subproblem:

 min L(x,λk,μk) (64) s.t. li≤xi≤ui (65)

If is the projection operator defined as

 [P(z,l,u)]i=⎧⎪⎨⎪⎩li if zi≤lizi if li≤zi≤uiui if zi≥ui (66)

then the Karush-Kuhn-Tucker (KKT) optimality condition for 64 is given as [10]:

 x−P(x−∇xL(x,λk,μk),l,u)=0 (67)

The outer iteration code is given in Framework 1. Note that the penalty parameter is updated based on a feasibility monitoring strategy that allows for a decrease in if sufficient accuracy is not achieved in solving the subproblem 64.

At each inner iteration we form a quadratic approximation to the augmented lagrangian and approximately solve the inequality constrained quadratic sub-problem:

 min p12pT∇2xxL(x,λ,μ)p+∇xL(x,λ,μ)Tp (68) s.t. li≤xi≤ui (69) s.t. ||p||∞≤Δ (70)

The inner iteration code uses non-linear gradient projection [5] followed by Newton-CG-Steihaug conjugate gradient iterations [36]. Quasi-Newton updates are performed using either SR1 [11] (recommended for non-convex functions) or BFGS [4] (recommended for convex functions). For very large problems, we switch to the limited memory variants [32] of these quasi-Newton approximations. The algorithm details are given in Framework 2. The trust region update code is based on a standard progress monitoring strategy [33] and is given in Framework 3.