Alternating Diffusion Map Based Fusion of Multimodal Brain Connectivity Networks for IQ Prediction
Objective: To explain individual differences in development, behavior, and cognition, most previous studies focused on projecting resting-state functional MRI (fMRI) based functional connectivity (FC) data into a low-dimensional space via linear dimensionality reduction techniques, followed by executing analysis operations. However, linear dimensionality analysis techniques may fail to capture nonlinearity of brain neuroactivity. Moreover, besides resting-state FC, FC based on task fMRI can be expected to provide complementary information. Motivated by these considerations, we nonlinearly fuse resting-state and task-based FC networks (FCNs) to seek a better representation in this paper. Methods: We propose a framework based on alternating diffusion map (ADM), which extracts geometry-preserving low-dimensional embeddings that successfully parameterize the intrinsic variables driving the phenomenon of interest. Specifically, we first separately build resting-state and task-based FCNs by symmetric positive definite matrices using sparse inverse covariance estimation for each subject, and then utilize the ADM to fuse them in order to extract significant low-dimensional embeddings, which are used as fingerprints to identify individuals. Results: The proposed framework is validated on the Philadelphia Neurodevelopmental Cohort data, where we conduct extensive experimental study on resting-state and fractal -back task fMRI for the classification of intelligence quotient (IQ). The fusion of resting-state and -back task fMRI by the proposed framework achieves better classification accuracy than any single fMRI, and the proposed framework is shown to outperform several other data fusion methods. Conclusion and Significance: To our knowledge, this paper is the first to demonstrate a successful extension of the ADM to fuse resting-state and task-based fMRI data for accurate prediction of IQ.
Over the past few decades, there has been great attention to data fusion techniques and their applications in various fields; see, e.g., [1, 4, 2, 3, 5] and references therein. Integrating multiple datasets acquired by different sensors for a phenomenon of interest may yield more informative knowledge than any individual dataset does, because the multiple datasets can provide complementary information of the observed phenomenon from several different views. A straightforward approach is to simply concatenate feature vectors from multiple datasets into a single feature vector. However, such a concatenation scheme is very sensitive to the scaling of the data. Multivariate approaches, such as canonical correlation analysis (CCA) , independent component analysis (ICA) , and partial least squares (PLS) , have been independently developed by maximizing the correlation between the linear combinations of features from two datasets. Their penalized versions for high-dimensional settings and extensions to multiple datasets have also been proposed in [9, 10, 11, 12]. To analyze the joint information between different tasks and different brain regions in multiple functional MRI (fMRI) datasets, Calhoun and his collaborators [13, 14, 15, 16, 17] have proposed many ICA-based multitask data fusion approaches (e.g., joint ICA and multimodal CCA+joint ICA) according to various optimization assumptions. All the aforementioned approaches are based on linear mixture models, so they cannot optimally handle datasets that appear to have nonlinear structures and relations. To overcome this issue, many kernel based data fusion approaches have been studied in recent years [18, 19, 20, 21, 22, 23, 24, 25], where each dataset is individually used to construct a kernel matrix, and then the obtained kernel matrices are combined in linear or nonlinear ways to seek a unified kernel matrix that best represents all available information. A typical kernel based approach is multiple kernel learning [18, 19], which finds the unified kernel matrix by linearly combining the multiple kernel matrices. However, this approach assumes that the complementary information from multiple data sources is linearly provided, which might not necessarily be true in practice. Moreover, to learn an optimal unified kernel matrix, tuning the weight coefficient assigned to each single kernel matrix is computationally intensive. In [20, 21, 22, 23, 24, 25], nonlinear kernel fusion processes have been proposed, which represent complementary information nonlinearly into the intrinsic low-dimensional geometry, and avoid assigning weight coefficient to each single kernel matrix.
An approach of particular interest in this paper is alternating diffusion map (ADM), which was proposed more recently in [23, 24, 25]. The ADM is based on the framework of diffusion map (DM) , one class of manifold learning algorithms , and can achieve nonlinear dimensionality reduction in such a way that the intrinsic common structures underlying multiple high-dimensional datasets are maintained. More concretely, the ADM takes advantage of the product of the kernel matrices constructed separately by each dataset based on a stochastic Markov matrix to produce a unified representation, which can be interpreted as employing diffusion processes on each dataset in an alternating manner. This allows one to extract the common latent variables across multiple datasets that are assumed to drive the observed phenomenon, while filtering out other hidden variables that are sensor-specific and thought of as nuisance, irrelevant to the phenomenon. Hence, the ADM can provide a more reliable description of the phenomenon. So far the ADM has proven to be a powerful tool in voice detection from audio-visual signals [28, 29], Alzheimer’s disease classification from multiple electroencephalography (EEG) signals , and sleep stage classification from EEG and respiration signals . Here we show that the ADM can also be adapted to multimodal fMRI data. To our knowledge, this paper is the first to demonstrate a successful extension of the ADM to fuse resting-state and task-based fMRI data for the prediction of intelligence quotient (IQ).
The proposed framework in this paper begins with a preprocessing stage in which a brain functional connectivity network (FCN) is individually built for each subject from fMRI data. More specifically, the brain is graphically depicted as a network with regions of interest (ROIs) as the nodes and functional connectivities (FCs) as the edges, where the FC between two nodes is defined as statistical dependence between the blood oxygenation level-dependent (BOLD) fMRI time series in the two corresponding ROIs. Different from conventionally representing the FCN by a sample covariance matrix of the multi-ROI time series, we represent it by a symmetric positive definite (SPD) matrix [32, 33, 34], which is computed based on sparse inverse covariance estimation using the graphical least absolute shrinkage and selection operator (GLASSO) algorithm . Accordingly, two sets of SPD matrices are respectively derived from resting-state and task-based fMRI datasets. The FCN organization varies between individuals, and accordingly acts as a “fingerprint” of a subject . Recent works [37, 38, 39] have also studied the relations between the functional and structural brain connectivity patterns to improve the reliability of individual “fingerprint” as a biomarker.
We therefore store the SPD matrices of all subjects and treat them as new features from fMRI data for subsequent analysis. However, the dimension of the SPD matrix is usually much larger than the number of subjects. For example, there are FCs with ROIs in our study. If we directly use the SPD matrices to train a classifier, it will suffer from the curse of dimensionality, which often leads to overfitting and poor generalization performance. Fortunately, despite individual variation, human brains do in fact share common connectivity patterns across different subjects, i.e., variations of the SPD matrices representing brain connectivity are driven by a small subset of unknown parameters. It suggests that we adopt nonlinear dimensionality reduction (e.g., manifold learning) algorithms to extract the intrinsic variables of the SPD matrices prior to training a classifier. In this paper, based on the two sets of SPD matrices derived from two fMRI datasets, respectively, we use the ADM to fuse them to find meaningful low-dimensional embeddings, so that their shared source of variability is maintained while noise specific to any single set of SPD matrices is reduced. These low-dimensional embeddings are then used as fingerprints to classify individuals of different behaviors and cognitions (e.g., IQ).
As the set of SPD matrices is known to form a Riemannian manifold instead of a full Euclidean space, geometric distances, such as affine-invariant Riemannian distance  and root stein divergence distance ), have been proposed to measure the similarities of SPD matrices by considering the underlying manifold where they reside. These distances can better discover the Riemannian geometry than the traditional Euclidean distance, and have been used successfully to characterize FC differences [32, 33, 34, 42]. In this paper, we adopt a geodesic distance on SPD matrices, namely the Log-Euclidean distance , to measure the similarities of SPD matrices in the ADM because of its computational efficiency. The Euclidean distance and the Cholesky distance  are tested for comparison.
We finally validate our proposed framework by fusing two fMRI datasets (i.e., resting-state and fractal -back task fMRI) from the publicly available Philadelphia Neurodevelopmental Cohort (PNC) data [47, 48] to build a predictor for subjects with different IQs. The subjects’ IQ scores were assessed by the Wide Range Achievement Test (WRAT) administered in the PNC. The WRAT is an achievement test that measures an individual’s learning ability including reading, spelling, and arithmetic , which can provide a reliable estimate of IQ. A large body of clinical studies has emerged to argue that distinct patterns of brain functional activity account for the proportion of difference of IQ among individuals [50, 51]. These findings suggest that the ADM of fusing multiple sets of FCNs in this paper has the potential to automate the task of classifying populations of low and high IQs. As will be seen experimentally, the classification results demonstrate the advantage of our proposed framework. Specifically, the ADM achieves superior classification performance over that of the DM (using any single set of FCNs) and several existing fusion methods. In addition, the effectiveness of incorporating the log-Euclidean distance into the DM and the ADM is verified in comparison to the Euclidean and Cholesky distances.
The rest of this paper is organized as follows. In Section II, the proposed framework is presented, including the brain FCN construction and two manifold learning methods (i.e., the DM and the ADM). In Section III, a simulation example is first illustrated, and then the experimental results on the PNC data are shown. The discussions are presented in Section IV, followed by the conclusion in Section V.
Notations: Uppercase boldface, lowercase boldface, and normal italic letters are used to denote matrices, vectors, and scalars, respectively. The superscript denotes the transpose of a vector or matrix. denotes the -th entry of matrix , and denotes the -th entry of vector . We denote the set of real numbers as .
In this section, an overview of the proposed framework is outlined in Fig. 1. There are three major steps: brain FCNs are extracted as SPD matrices from each fMRI dataset; the ADM is applied for fusing two sets of FCNs derived from two fMRI datasets to find a meaningful low-dimensional representation; and support vector machine (SVM) classification is carried out based on the low-dimensional embeddings. In the following, we will present the details of these steps.
Ii-a Brain FCN representation using SPD matrices
The BOLD fMRI signal, as a time series, measures neural activity by detecting changes in blood flow at many spatial locations of the brain. In fMRI, studies can focus on specific tasks as well as at rest, and brain networks are usually built based on the BOLD signals to describe FC across brain regions. The network nodes are brain ROIs, and the FC between two nodes is defined as temporal covariance or correlation of fMRI time series in the two nodes.
Let be a BOLD fMRI time series for a subject, where is the number of time points and is a -dimensional vector, corresponding to an observation of brain ROIs at the -th time point. Assume that has been normalized to have zero mean and unit variance along each row. As described above, the FCN is represented by a covariance matrix of the multi-ROI time series. To estimate , we generally obtain the estimation of its inverse by maximizing the penalized log-likelihood over the space of all SPD matrices:
where is the sample covariance matrix, and , , denote the determinant, the trace, the sum of the absolute values of the entries of a matrix, respectively. In (1), the regularization parameter controls the tradeoff between the degree of sparsity and the log-likelihood estimation of . In this paper, we use the Bayesian Information Criterion (BIC)  to select the optimum , and the maximization problem (1) can be efficiently solved via the graphical LASSO (GLASSO) algorithm  (its Matlab software package: http://statweb.stanford.edu/~tibs/glasso/).
Ii-B Nonlinear dimensionality reduction of FCNs
From (1), we can individually compute the SPD matrices , , to represent the FCNs of subjects from one fMRI dataset. In what follows, we shall use the terms “SPD matrices” and “FCNs” interchangeably. The SPD matrices are treated as the features extracted from subjects’ fMRI data for subsequent analysis, and considered as points distributed in a high-dimensional space. We have to reduce the dimension of these SPD matrices by finding the significant features, since many of the features may be noninformative or redundant while increasing computational cost and classification complexity. In spite of individual variation, brains do share some common FC patterns across different subjects. Therefore, the SPD matrices used to represent FCNs shall have some similar structures , and their variations only depend on a small subset of unknown parameters. Inspired by this evidence, we aim to generate a low-dimensional representation of the SPD matrices. Since brain activity involves multiple nonlinear neural dynamics, we adopt here a nonlinear dimensionality reduction algorithm for best representing the high-dimensional SPD matrices by their low-dimensional embeddings, where the intrinsic geometry of the SPD matrices can be well preserved in the embedding coordinates. The details are elaborated as follows.
Ii-B1 Gaussian kernel function
In machine learning, kernel functions are often used to define similarity measures to learn the relations among subjects via the kernel trick, and in particular the Gaussian kernels are widely used. In this paper, we calculate a similarity matrix by using the Gaussian kernel function with a distance of SPD matrices, i.e.,
where is the bandwidth of the Gaussian kernel function and is a distance chosen by the user to measure two SPD matrices. This construction defines a weighted graph, in which the nodes correspond to the subjects , and is the weight matrix of the graph.
Different definitions of would lead to different similarity matrices. An appropriate distance is crucial to perform the following dimension reduction while revealing the intrinsic geometry of the SPD matrices, since the set of SPD matrices is restricted to some Riemannian manifold, not a full Euclidean space. For ease of computation, we investigate one commonly used geodesic distance, i.e., the log-Euclidean distance (LEU) , that considers the specific geometry of the manifold. The LEU between and is given by
where, for a SPD matrix with its eigenvalue decomposition , the matrix logarithm of is defined by , and denotes the Frobenius matrix norm. For comparison, we consider the Cholesky distance (CK)  and the traditional Euclidean distance (EU) as well. The CK is given by
where denotes the low triangular matrix obtained by the Cholesky decomposition of , i.e., . The EU is given by
The LEU (3) is known as one of the most widely adopted distances for SPD matrices, because it is a geodesic distance induced by Riemannian metrics and provides a more accurate distance measure than the EU (5). Apart from these geodesic distances, a number of other distances (e.g., the CK (4)) that do not necessarily arise from Riemannian metrics can also be used to capture the nonlinearity among SPD matrices. Different from the LEU that is derived based on matrix logarithm, the CK induces a reparameterization measure of SPD matrices based on matrix decomposition, because a SPD matrix has a unique Cholesky decomposition. It is shown in  that the Gaussian kernels (2) with the LEU, the CK, and the EU are all positive semidefinite on manifolds for any , such that one would be able to freely tune to reflect the data distribution.
Ii-B2 DM for single FCN dataset
Considering that the data points lie on an intrinsically low-dimensional manifold embedded into , we use the DM  to obtain their low-dimensional embeddings with . The DM is a graph-based nonlinear dimensionality reduction method, which extends and enhances ideas from other manifold learning methods by deploying a stochastic Markov matrix based on the similarities between data points in high-dimensional space to identify a low-dimensional representation that captures the intrinsic geometry in the dataset. The procedure of the DM is demonstrated in Fig. 2 and described in detail below.
Based on the similarity matrix calculated in (2), we first get a normalized kernel matrix by
such that each row sums to , where is a diagonal matrix with . Hence, we can imagine a Markov chain on the graph with the transition matrix , in the sense that the -th entry represents the transition probability from node to node .
It is easy to check that is similar to the positive semidefinite matrix . As such, let and denote the ordered eigenvalues and corresponding normalized eigenvectors of , i.e.,
where and . Moreover, we can readily verify that the largest eigenvalue is equal to and its associated eigenvector is a constant vector. Then, a compact representation, referred to as DM, is achieved by keeping only the largest non-trivial eigenvalues and eigenvectors of , i.e.,
where is an estimated dimension of the embedding space.
The key idea in the DM is that the Euclidean distance between two embeddings (e.g., and ) is approximately equal to the diffusion distance between the two corresponding data points (e.g., and ) in the original space. The diffusion distance between the -th and -th subjects is defined as the weighted distance between the transition probabilities of node and node , i.e.,
where stands for the stationary distribution of , calculated by for . The diffusion distance is a metric that can reveal the intrinsic geometry among data points. It is robust to noise as well, since the diffusion could be viewed as a nonlinear process that averages all possible connectivity between pairs of data points on the graph.
Ii-B3 ADM based fusion of two FCN datasets
The ADM [23, 24, 25] is a recently developed data fusion technique on the basis of the DM framework. The purpose of the ADM is to fuse two datasets to find a more coherent and accurate representation, in the sense that the information from the two datasets is diffused to yield the underlying common information (which is assumed to drive the phenomenon of interest), and meanwhile nuisance specific to any single dataset is reduced. Let and be the FCNs extracted from two different fMRI datasets for the same subjects, respectively. By using the ADM described below, we can obtain low-dimensional embeddings .
In the same way as in (2), we separately construct similarity matrices of the two datasets: for all and ,
where is the tuneable kernel bandwidth and is a chosen metric on the data points. From the similarity matrices, we get the normalized kernel matrices and as in (6), respectively. According to the ADM in , a unified kernel matrix is given by
Since is real and symmetric, it has real eigenvalues, and the eigenvectors are real and orthogonal to each other. As such, let be the eigenvalues of with decreasing magnitude, and be the corresponding normalized eigenvectors. Hence, a low-dimensional representation (referred to as ADM) for the common structures in the datasets is obtained by taking its eigenvectors corresponding to the largest eigenvalues in magnitude, i.e.,
where is an estimated dimension of the embedding space.
In the ADM, a Markov chain on a graph is first built for each dataset, where the subjects represent the graph nodes, and the normalized kernel matrix is viewed as the transition matrix of the Markov chain on the graph. In other words, we obtain two graphs with the same set of nodes and two different transition matrices (i.e., and ). Then, we combine the information from the two datasets by the product of the transition matrices, which takes into account all the various connectivities of two nodes hopping within and across the two graphs. It is shown in  that efficient low-dimensional embeddings (12) based on the matrix characterize the common structures (common latent variables) between the manifolds underlying the different datasets, and in the meantime attenuate the differences (sensor-specific variables) between the manifolds. The interested reader can find a theoretical foundation of the ADM in [23, 25].
Ii-B4 Out-of-sample extension
In the above, we present how to use the DM (or ADM) to provide a mapping for a training set with FCNs (or FCNs and ) to a -dimensional (or -dimensional) space. In order to extend the mapping to new data points (unlabeled FCNs) without reapplying a large-scale eigendecomposition on the entire data, we introduce the Nyström extension [45, 46, 32, 22], which is an efficient non-parametric solution widely used for the methods involving spectral decomposition. Accordingly, for the DM and the ADM, respectively, we derive an explicit mapping between new FCNs and the low-dimensional embedding space obtained from the training set as follows.
Given a new FCN , we want to extend the DM mapping to get . We first calculate the similarities between and , , and then normalize them to get for , i.e.,
The extended eigenvectors for the new data point are approximated as the weighted sums of the original eigenvectors, i.e.,
and the embedding is given by
Given new FCNs and for a two-dataset scenario, we want to extend the ADM mapping to get . Similar to (13), we calculate, for and ,
Let for , and
Then, the extension is given by
and the embedding is
Ii-C Classification using SVM
In this paper, classification is explored as a potential application to validate our proposed framework, in that if the intrinsic manifold structures of data are faithfully preserved by the proposed framework, the obtained embeddings of the original high-dimensional data points that belong to different classes will be separated far from each other in the low-dimensional embedding space. The classification performance is assessed by using a linear kernel SVM with default hyper-parameters on the embeddings. We remark that we here choose a simple linear kernel SVM classifier for three reasons: 1) since the DM and the ADM mentioned above provide embedded features globally in linear coordinates, we limited the tests to linear classifiers; 2) SVM is known as one of the state-of-the-art classifiers and has been extensively used in biomedical data analysis because of its accurate classification performance [54, 55]; and 3) although there are many other advanced classifiers, the emphasis in this paper is the superior performance of the proposed framework, not the optimal classification scheme.
Iii Experimental results and discussion
Iii-a Simulation result
Let be three statistically independent uniform random variables on . We generate samples of , and define two sets of simulated samples in by
for , where is an orthonormal transformation matrix. Assume that these two datasets are observations acquired by two sensors, respectively, where is a common variable, and and are the variables that are sensor-specific. As can be seen in the first and third columns of Fig. 3, each set of simulated samples lies on a -dimensional Swiss roll embedded in .
We first apply the DM separately to each dataset, and the -dimensional embeddings are presented in the 2nd and 4th columns of Fig. 3. The subfigures in each row are obtained from the same dataset, i.e., (a)–(d) are scatter plots of and their embeddings, and (e)–(h) are scatter plots of and their embeddings. In the first two columns of Fig. 3, data points are colored according to . In subfigures (c), (d), data points are colored according to . In subfigures (g), (h), data points are colored according to . One can see that all the scatter plots of the -dimensional embeddings exhibit a smooth color gradient, which implies accurate parametrization of both the common and the sensor-specific variables for each dataset.
We next apply the ADM to fuse the two datasets. The -dimensional embeddings are shown with different color coding schemes in Fig. 4. The data points in the leftmost subfigure are colored according to the common variable , while those in the middle and the rightmost subfigures are colored according to the sensor-specific variables and , respectively. We observe that the color gradient is smooth only for the common variable. Equivalently, this means that the embeddings obtained by the ADM successfully extract a parametrization of the common variable , while filtering out the nuisance variables and that are specific to each dataset.
Iii-B Application to IQ classification
Iii-B1 Data preprocessing and experimental setting
The PNC [47, 48] is a large-scale collaborative study of child development between the Children’s Hospital of Philadelphia and the Brain Behavior Laboratory at the University of Pennsylvania. The publicly available PNC data were downloaded from dbGap (www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000607.v1.p1). In this PNC sample, genetics, neuroimaging, and cognitive assessment measures were all acquired in nearly 900 adolescents aged from to years. In this paper, we study two functional imaging datasets (i.e., functional imaging of working memory task and resting state), and their classification performance on IQ. The scores of the WRAT administered in the PNC reflect subjects’ IQ levels, since the WRAT is a standardized achievement test that measures an individual’s ability, e.g., reading recognition, spelling, and math computation , which can provide a reliable estimate of IQ. To mitigate the influence of age over the results, we first selected a subset of all subjects for whom ages were above years. Next, we converted their WRAT scores to -scores, and only kept subjects whose absolute values of -scores were above . The low IQ group consisted of the subjects with -scores smaller than , and the high IQ group consisted of the subjects with with -scores larger than . As a consequence, we were left with subjects that were separated into two groups according to IQ levels: the low and high IQ groups (Table I).
|Group||Age ()||Male/Female||WRAT score ()|
MRI examinations were conducted on a single T Siemens TIM Trio whole-body scanner. Both task-based and resting-state images were collected using a single-shot, interleaved multi-slice, gradient-echo, echo planar imaging sequence. All the images were preprocessed in SPM12 (www.fil.ion.ucl.ac.uk/spm/), including motion correction, spatial normalization to standard MNI space, and spatial smoothing with a mm FWHM Gaussian kernel. A regression procedure was applied to address motion-related influences and a Hz–Hz band-pass filter was applied to the functional time series. In resting-state, subjects were instructed to stay awake with the eyes open, fixate on the displayed crosshair, and keep still. In fractal -back task to probe working memory, subjects were required to respond to a presented fractal only when it was the same as the one presented on a previous trial. Based on a recently validated -region functional parcellation scheme , ROIs were defined to describe the whole brain as mm diameter spheres centered upon ROI coordinates. Thus, for each subject, each type of fMRI data can be represented by a matrix of which the rows correspond to the ROIs and the columns the time points. All the fMRI data were centered and normalized by subtracting from each row the mean and dividing it by its standard deviation. We finally obtained two fMRI datasets, i.e., resting-state and fractal -back task fMRI.
Iii-B2 Visualization of brain FCNs
Recall that for each subject the FCN is defined by a SPD matrix obtained in Subsection II-A, where is the number of ROIs. Within this network, there are unique edges (or connections) and functional modules, i.e., somatomotor/hand, somatomotor/mouth, cingulo-opercular control, auditory, default mode, memory retrieval, visual, fronto-parietal control, salience, subcortical, ventral attention, dorsal attention, cerebellar, and uncertain. We sought to interrogate significantly different connections between low and high IQ groups. Two-sample t-tests were performed for each of the Fisher z-transformed connection strength values in the network. In the first column of Fig. 5, we displayed the number of connections by setting different thresholds (i.e., ) in terms of typical modules. For ease of visualization, we ranked all connections according to their -values, and selected the top of the connections (i.e., uncorrected, for resting-state, and for -back task). The number of these selected connections differing between groups was assessed for each of the modules both for within- and between-module connections shown in the second column of Fig. 5, and the corresponding three-dimensional axial views in anatomical space are visualized using the BrainNet Viewer . One can see that a majority of the significantly different connections associated with IQ were involved with the default mode, fronto-parietal control, and visual modules, which is in agreement with the reports in previous studies [59, 60, 61]. The default mode module has been linked to self-referential thought and autobiographical memory. The fronto-parietal module, including portions of the lateral prefrontal cortex and posterior parietal cortex, is thought to serve cognitive control abilities and working memory, among others. The visual module is related to the ability to process visual stimuli and to understand spatial relation between objects.
Iii-B3 Classification results
We first assessed the classification performance for high vs low IQ when only one single dataset (resting-state FCNs or -back task FCNs) was used with and without applying the DM. Second, we evaluated the classification performance when both resting-state and -back task FCNs were used with applying the ADM. Third, we compared the performance of the proposed ADM based framework with that of several other common data fusion methods.
In nonlinear dimensionality reduction of the FCNs by the DM and the ADM, two important parameters have to be set, i.e., the kernel bandwidth in the Gaussian kernel matrix and the target dimension of the reduced space, both of which influence the embedding and thus the subsequent classification results. Too small will result in a sparse (or even disconnected) graph that is unable to capture the local structures around the data points, whereas too large will cause a dense graph that may generate a redundant description of the data. Analogously, if the target dimension ( in the DM or in the ADM) is too large or too small, the mapping will tend to be noisy and unstable or may not capture sufficient information about the manifold geometry. Choosing parameters from a reasonable range is of importance. Notably, a max-min scheme has been suggested in  for choosing :
where is typically set in the range . In this paper, we fixed for the kernel bandwidth in the DM. However, in the ADM, the unified kernel matrix (11) involves the product of two single kernel matrices. This insight indicates that the max-min measure for kernel bandwidth in the DM could be relaxed in the ADM. That is, smaller values for could be used to set and in (10). Although an automated method for determining and has been proposed , we choose to tune them by cross-validation in this study. Different values of the kernel bandwidth employed in our experiments were tested by setting for each dataset in the ADM. In both the DM and the ADM, the target dimension varied in the range of .
A -fold cross-validation (CV) procedure was implemented to evaluate the classification performance in all experiments. The whole data were randomly partitioned into equal-sized disjoint subsets with similar class distributions. Each subset in turn was used as the test set and the remaining subsets were used to train the SVM classifier. Specifically, for every pair of training and test sets, the low-dimensional embeddings of the training set were first computed, and an SVM classifier was trained by the labeled samples in the embedded training set. Then, the low-dimensional embeddings of the test set were obtained by using the out-of-sample extension, and the trained SVM was applied to predict class labels of the samples in the embedded test set. The classifier accuracy was estimated by comparing against the ground-truth labels on the test set. The test result in the CV was the average of the individual accuracy measures. The whole process was repeated times to reduce the effect of sampling bias, and the average classification accuracy (ACC) was computed over all realizations. All free parameters, i.e., the kernel bandwidth and the target dimension, were tuned from their respective ranges by -fold inner CV on the training set, and the parameters with the best performance in the inner CV were used in the testing.
3.1) Results of the DM and the ADM: The performance of the DM incorporating different distances (i.e., LEU (3), CK (4), and EU (5)) on SPD matrices was tested for each single dataset of FCNs, respectively. To see if more or less significant information got lost after the embedding, we also vectorized the original high-dimensional FC data without applying the DM and then directly used them to train an SVM classifier. The results are reported in Table II. We found that the classification performance using -back task FCNs was usually better than that using resting-state FCNs. This highlights the importance of -back task FCs in IQ classification. Compared with the results of the vectorized method, DM+CK and DM+EU got similar or even worse results, while DM+LEU made significant improvement. Among all the methods, DM+LEU achieved the best performance (DM+LEU vs the other methods: for resting-state and for -back task). It means that the incorporation of the LEU into the DM successfully extracted the most informative low-dimensional embeddings, but the incorporation of the other distances (i.e., the CK and the EU) into the DM did not.
We next compared the performance of the ADM for fusion of the two datasets of FCNs (i.e., resting-state FCNs and n-back task FCNs), as shown in the last row of Table III. The performance using the ADM based data fusion was better than that using the DM on any single dataset. In particular, ADM+LEU achieved classification accuracy, which was better than the results of the DM on any single dataset in Table II (e.g., in DM+LEU, for -back task and for resting-state), and made improvement of about in comparison to the vectorized method for each single dataset. It demonstrates the power of ADM based data fusion and also justifies the assumption that a proper fusion of different datasets can produce more coherent information useful to understand the observed phenomenon. In accordance with the performance of both the DM and the ADM with respect to different distances on SPD matrices, the LEU always achieved the best result, the CK followed, and the EU was the lowest. This again indicates that it is important to consider the manifold property of SPD matrices to obtain low-dimensional embeddings, resulting in discrimination of individuals with different IQ levels.
|Concatenated DM II|
We also investigated the effect of free parameters on classification performance in such a way that the parameters of interest were successively set to one combination across their ranges and for every setting of the parameters the testing accuracy was computed in the CV with the left-out parameters being optimally tuned. In the left of Fig. 6, the classification accuracies of DM+LEU for each single dataset and ADM+LEU for fusion of two datasets are shown with varying settings of the target dimension. As seen from the figure, the target dimension has an important impact on the classification. If the selected target dimension is too small, the mapping will lose some important information. If the selected target dimension is too large, the embeddings will be still noisy and redundant such that they cannot effectively reflect the intrinsic structures of the original high-dimensional data. Both of the above cases will lead to poor classification accuracy. Similarly, selecting optimum kernel bandwidths in the ADM plays a role in the classification performance. As shown in the right of Fig. 6, the parameters’ sensitivity by changing values of and in ADM+LEU is presented. We observed that the best parameter combination was always found in our experiments; for example, in ADM+LEU the selected target dimension was usually in the range of , and the selected and were usually in the range of .
3.2) Comparison with other data fusion methods: To further demonstrate the strength of the ADM, we compare it with other data fusion methods described below.
Concatenated DM I: concatenate all the features from two datasets into a single feature vector, and then apply the DM.
Concatenated DM II : apply the DM to obtain low-dimensional embeddings of each dataset separately, and then concatenate the embeddings into a unified vector.
Kernel-sum DM : add up the similarity matrices constructed from each dataset to get a unified similarity matrix as , and then perform the rest of the procedures of the DM based on .
Kernel-dot-product DM : multiply the similarities matrices constructed from each dataset element by element to get a unified similarity matrix as , and then perform the rest procedures of the DM based on .
For fair comparison, all experiments for the above methods were implemented by the same evaluation framework as the ADM. It turns out that the ADM with the LEU still achieved the highest accuracies among all the methods with all different distances on SPD matrices. It demonstrates the effectiveness of applying the LEU to measure the similarities of FCNs (e.g., compared with ADM+LEU, for both ADM+CK and ADM+EU). When using the LEU on SPD matrices, the ADM performed improved results compared with the kernel-dot-product DM and the concatenated DM II with , and yielded results most similar to the kernel-sum DM. For the EU and the CK, there were no substantial differences of accuracy between the kernel-sum DM, the kernel-dot-product DM, and the ADM. Importantly, similar to the ADM, the kernel-sum and kernel-dot-product DM methods define a unified similarity matrix that sums or multiplies the pairwise similarities between subjects from each dataset, resulting in a better combination of complementary information from each dataset. It is shown from Table III that both of them achieved better classification results than those using the concatenated methods (i.e., the concatenated DM I and the concatenated DM II). In the concatenated DM I, the classification accuracy was only . The classification performance of the concatenated DM II was slightly better than that of the concatenated DM I. The poor classification performance based on the concatenated feature set in the concatenated methods may be largely ascribed to ignoring the mutual relations that exist between the datasets. This suggests that it is better to fuse heterogeneous datasets using kernel/similarity matrices rather than direct fusion in the original feature space.
Iv-a Most discriminative brain FCs
It can be seen that the ADM with the LEU achieved the best classification performance. Equivalently, the low-dimensional embeddings obtained by this method best characterized the underlying data structures associated with IQ variability. Therefore, the alternating diffusion distance, defined by the Euclidean distance in the low-dimensional embedding space, i.e., for each pair of subjects and , can provide a measure between subjects in terms of the common latent variables of interest extracted from the two sets of FCNs. Based on the alternating diffusion distance, we attempted to evaluate the discriminative power of the features (i.e., FCs) according to their Laplacian scores  as follows.
In each CV of the ADM with the LEU, we first learnt the embeddings corresponding to the highest classification accuracy on the training set. We then constructed a -nearest-neighbor graph with nodes. The -th node corresponds to . If is among nearest neighbors of or is among nearest neighbors of , we put the edge
with being set as ; otherwise, set . This graph structure can nicely reflect the common manifold geometry of the data. Thus, the importance of a feature can be regarded as the degree to which the feature is consistent with the graph structure induced from (21).
Let denote the -th resting-state FC of the -th subject, and with subjects. The Laplacian score of the -th resting-state FC is defined by
where is the estimated variance on the graph. By spectral graph theory, we compute as
where and . Similarly, the Laplacian score of the -th -back task FC is also defined. Obviously, the smaller the Laplacian score is, the better the feature is. Since the Laplacian score of a feature is different in each CV, we averaged the Laplacian scores of each feature in all CV folds, and ranked the features according to their averaged Laplacian scores in increasing order. We visualized resting-state and -back task FCs with the smallest averaged Laplacian scores in Fig. 7, respectively, where was set as . It is found that the majority of the selected FCs are located in frontal, parietal, temporal, and occipital lobes.
Iv-B Future work and limitations
The free parameter tuning in the manifold learning methods, e.g., the kernel bandwidth and target dimension in the DM and the ADM in this paper, is crucial for classification. How to choose the optimal values for the free parameters remains an open and actively researched question. Although algorithms for automatic tuning of the optimal kernel bandwidth and target dimension in the DM and the ADM have been proposed in , they have been experimentally shown to be unsuitable for the datasets in this study. Therefore, in this paper we implemented grid search CV for parameter tuning. Note that Dudoit and van der Laan  have provided the asymptotic proof for choosing the tuning parameter with minimal CV error, which gives a theoretical basis for this approach.
In fMRI data analysis, not all ROIs are related to IQ differences. The ROIs are filtered to extract only those that can help to discriminate between high and low IQ. Therefore, feature selection could be performed to extract the most informative ROIs prior to constructing FCNs in our proposed framework. We will investigate the effect of using different feature selection approaches on the classification performance in future work.
In line with recent studies , task fMRI data have a better prediction of IQ than resting-state fMRI data. Furthermore, it has been shown in  that combining multiple different task fMRI datasets can significantly improve IQ predictive power, compared with using any single task fMRI dataset. Apart from resting-state and -back task fMRI datasets, there exist emotional task fMRI and single nucleotide polymorphism (SNP) datasets in the PNC. Therefore, an interesting future work is to fuse all the three neuroimaging datasets and one genomic dataset together by means of the ADM, which could capture more discriminative information and further improve the IQ classification performance.
In this paper, we considered a manifold based data fusion method (i.e., the ADM), by which the information from two datasets acquired by different sensors is diffused to extract the common information driving the phenomenon of interest, and simultaneously to reduce the sensor-specific nuisance. We tested the potential of the ADM for predicting IQ with the PNC dataset, resulting from a comprehensive study of brain development. Specifically, for each of resting-state and -back task fMRI, we first represented the FCN by a SPD matrix using the graphical LASSO for each subject. This results in two FCNs (or two SPD matrices), i.e., resting-state and -back task FCNs, for each subject. We next utilized the ADM to fuse the resting-state and -back task FCNs to extract a meaningful low-dimensional representation. The obtained low-dimensional embeddings were used to train a linear kernel SVM classifier. The experimental results show that the prediction accuracy of the fused data by means of the ADM is larger than that of using any single set of FCNs, and the ADM also achieves superior classification performance in comparison with several other data fusion methods. Moreover, in the construction of similarity matrices, we employed the Log-Euclidean manifold based metric to measure the distance between SPD matrices. The effectiveness of incorporating it into the DM or the ADM was verified by the comparative experiments with the Cholesky metric and the traditional Euclidean metric on SPD matrices.
The authors wish to thank the NIH (R01GM109068, R01MH104680, R01MH107354, R01AR059781, R01EB006841, R01EB005846, P20GM103472), and NSF (#1539067) for their partial support.
-  V. D. Calhoun and T. Adali, “Feature-based fusion of medical imaging data,” IEEE Trans. Inf. Technol. Biomed., vol. 13, pp. 711-720, 2009.
-  D. Lahat, T. Adali, and C. Jutten, “Multimodal data fusion: An overview of methods, challenges, and prospects,” Proc. IEEE, vol. 103, no. 9, pp. 1449-1477, 2015.
-  T. Adali, Y. Levin-Schwartz, and V. D. Calhoun, “Multimodal data fusion using source separation: Application to medical imaging,” Proc. IEEE, vol. 103, no. 9, pp. 1494-1506, 2015.
-  V. D. Calhoun and J. Sui, “Multimodal fusion of brain imaging data: A key to finding the missing link(s) in complex mental illness,” Biol. Psych.: Cogn. Neurosci. Neuroimag., vol. 1, no. 3, pp. 230-244, 2016.
-  N. Yokoya, C. Grohnfeldt, and J. Chanussot, “Hyperspectral and multispectral data fusion: A comparative review of the recent literature,” IEEE Geosci. Remote Sens. Mag., vol. 5, no. 2, pp. 29-56, 2017.
-  H. Hotelling, “Relations between two sets of variates,” Biometrika, vol. 28, no. 3/4, pp. 321-377, 1936.
-  P. Comon, “Independent component analysis, a new concept,” Signal Process., vol. 36, no. 3, pp. 287-314, 1994.
-  H. Wold, “Partial least squares,” in Encyclopedia of Statistical Sciences, New York:Wiley, vol. 6, pp. 581-591, 1985.
-  E. Parkhomenko, D. Tritchler, and J. Beyene, “Sparse canonical correlation analysis with application to genomic data integration,” Stat. Appl. Genet. Mol. Biol, vol. 8, no. 1, pp. 1-34, 2009.
-  J. Fang, D. Lin, S. C. Schulz, et al., “Joint sparse canonical correlation analysis for detecting differential imaging genetics modules,” Bioinformatics, vol. 32, no. 22, pp. 3480-3488, 2016.
-  H. Chun and S. Keleş, “Sparse partial least squares regression for simultaneous dimension reduction and variable selection,” J. of the Royal Stat. Soc.: Series B, vol. 72, no. 1, pp. 3-25, 2010.
-  J. R. Kettenring, “Canonical analysis of several sets of variables,” Biometrika, vol. 58, no. 3, pp. 433-451, 1971.
-  V. M. Vergara, A. Ulloa, V. D. Calhoun, et al., “A three-way parallel ICA approach to analyze links among genetics, brain structure and brain function,” NeuroImage, vol. 98, pp. 386-394, 2014.
-  V. D. Calhoun, T. Adali, K. A. Kiehl, et al., “A method for multitask fMRI data fusion applied to schizophrenia,” Hum. Brain Mapp., vol. 27, pp. 598-610, 2006.
-  J. Sui, G. Pearlson, A. Caprihan, et al., “Discriminating schizophrenia and bipolar disorder by fusing fMRI and DTI in a multimodal CCA+joint ICA model,” NeuroImage, vol. 57, no. 3, pp. 839-855, 2011.
-  S. M. Plis, J. Sui, T. Lane, et al., “High-order interactions observed in multi-task intrinsic networks are dominant indicators of aberrant brain function in schizophrenia,” NeuroImage, vol. 102, pp. 35-48, 2014.
-  M. S. Çetin, F. Christensen, et al., “Thalamus and posterior temporal lobe show greater inter-network connectivity at rest and across sensory paradigms in schizophrenia,” NeuroImage, vol. 97, pp. 117-126, 2014.
-  E. Castro, V. Gómez-Verdejo, M. Martínez-Ramón, et al., “A multiple kernel learning approach to perform classification of groups from complex-valued fMRI data analysis: Application to schizophrenia,” NeuroImage, vol. 87, pp. 1-17, 2014.
-  Y. Y. Lin, T. L. Liu, and C. S. Fuh, “Multiple kernel learning for dimensionality reduction,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 33, no. 6, pp. 1147-1160, 2011.
-  B. Wang, A. M. Mezlini, F. Demir, et al., “Similarity network fusion for aggregation data types on a genomic scale,” Nat. Methods, vol. 11, pp. 333-337, 2014.
-  S.-P. Deng, S. Cao, D.-S. Huang, and Y.-P. Wang, “Identifying stages of kidney renal cell carcinoma by combining gene expression and DNA methylation data,” IEEE/ACM Trans. Comput. Biol. Bioinform., vol. 14, pp. 1147-1153, 2017.
-  O. Lindenbaum, A. Yeredor, M. Salhov, and A. Averbuch, “Multiview diffusion maps,” in arXiv:1508.05550, 2015.
-  R. R. Lederman and R. Talmon, “Learning the geometry of common latent variables using alternating-diffusion,” Appl. Comput. Harmon. Anal., vol. 44, no. 3, pp. 509-536, 2018.
-  O. Katz, R. Talmon, Y.-L. Lo, and H.-T. Wu, “Alternating diffusion maps for multimodal data fusion,” Inf. Fusion, vol. 45, pp. 346-360, 2019.
-  T. Shnitzer, M. Ben-Chen, L. Guibas, R. Talmon, and H.-T. Wu, “Recovering hidden components in multimodal data with composite diffusion operators,” arXiv preprint arXiv:1808.07312 (2018).
-  R. R. Coifman and S. Lafon, “Diffusion maps,” Appl. Comput. Harmon. Anal., vol. 21, no. 1, pp. 5-30, 2006.
-  Y. Ma, P. Niyogi, G. Sapiro, and R. Vidal, “Dimensionality reduction via subspace and submanifold learning,” IEEE Signal Process. Mag., vol. 28, no. 2, 2011.
-  D. Dov, R. Talmon, and I. Cohen, “Kernel-based sensor fusion with application to audio-visual voice activity detection,” IEEE Trans. Signal Process., vol. 64, no. 24, pp. 6406-6416, 2016.
-  D. Dov, R. Talmon, and I. Cohen, “Sequential audio-visual correspondence with alternating diffusion kernels,” IEEE Trans. Signal Process., vol. 66, no. 12, pp. 3100-3111, 2018.
-  T. Shnitzer, M. Rapaport, et al., “Alternating diffusion maps for dementia severity assessment,” in Proc. IEEE ICASSP, pp. 831-835, 2017.
-  G.-R. Liu, Y.-L. Lo, Y.-C. Sheu, and H.-T. Wu, “Diffuse to fuse EEG spectra – intrinsic geometry of sleep dynamics for classification,” arXiv:1803.01710, 2018.
-  A. Qiu, A. Lee, M. Tan, et al., “Manifold learning on brain functional networks in aging,” Med. Image Anal., vol. 20, no. 1, pp. 52-60, 2015.
-  S. I. Ktena, S. Arslan, and D. Rueckert, “Gender classification and manifold learning on functional brain networks,” in BIH Symposium on Big Data Initiatives for Connectomics Research, 2015.
-  J. Young, D. Lei, and A. Mechelli, “Discriminative Log-Euclidean kernels for learning on brain networks,” in International Workshop on Connectomics in Neuroimaging, 2017.
-  J. Friedman, T. Hastie, and R. Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432-441, 2008.
-  E. Amico and J. Goñi, “The quest for identifiability in human functional connectomes,” Sci. Rep., (2018) 8:8254.
-  B. Mišić, R. F. Betzel, M. A. de Reus, et al., “Network-level structure-function relationships in human neocortex,” Cerebral Cortex, vol. 26, pp. 3285-3296, 2016.
-  H. Liang and H. Wang, “Structure-function network mapping and its assessment via persistent homology,” PLoS Comput. Biol., 13.1 (2017): e1005325.
-  E. Amico and J. Goñi, “Mapping hybrid functional-structural connectivity traits in the human connectome,” Network Neuroscience, vol. 2, pp. 306-322, 2018.
-  W. Förstner and B. Moonen, “A metric for covariance matrices,” in Geodesy—The Challenge of the 3rd Millennium, pp. 144-152, 2012.
-  S. Sra, “A new metric on the manifold of kernel matrices with application to matrix geometric mean,” in Advances in Neural Information Processing Systems 25, New York, NY, USA: Springer, pp. 299-309, 2003.
-  J. Zhang, L. Zhou, L. Wang, and W. Li, “Functional brain network classification with compact representation of SICE matrices,” IEEE Trans. Biomed. Eng., vol. 62, pp. 1623-1634, 2015.
-  V. Arsigny, P. Fillard, X. Pennec, and N. Ayache, “Log-euclidean metrics for fast and simple calculus on diffusion tensors,” Magn. Reson. Med., vol. 56, pp. 411-421, 2006.
-  S. Jayasumana, R. Hartley, M. Salzmann, et al., “Kernel methods on Riemannian manifolds with Gaussian RBF kernels,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, pp. 2464-2477, 2015.
-  C. Fowlkes, S. Belongie, F. Chung, and J. Malik, “Spectral grouping using the Nyström method,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 26, pp. 214-225, 2004.
-  O. Lindenbaum, A. Yeredor, and I. Cohen, “Musical key extraction using diffusion maps,” Signal Process., vol. 117, pp. 198-207, 2015.
-  T. D. Satterthwaite, M. A. Elliott, K. Ruparel, et al., “Neuroimaging of the Philadelphia neurodevelopmental cohort,” Neuroimage, vol. 86, pp. 544-553, 2014.
-  T. D. Satterthwaite, J. J. Connolly, K. Ruparel, et al., “The Philadelphia neurodevelopmental cohort: A publicly available resource for the study of normal and abnormal brain development in youth,” Neuroimage, vol. 124, pp. 1115-1119, 2016.
-  G. S. Wilkinson and G. J. Robertson, Wide Range Achievement Test 4 (WRAT4), Lutz, FL, 2006.
-  K. G. Noble, N. Tottenham, and B. J. Casey, “Neuroscience perspectives on disparities in school readiness and cognitive achievement,” The Future of Children, vol. 15, pp. 71-89, 2005.
-  I. J. Deary, L. Penke, and W. Johnson, “The neuroscience of human intelligence differences,” Nat. Rev. Neurosci., vol. 11, pp. 201-211, 2010.
-  G. Schwarz, “Estimating the dimension of a model,” Ann. Statist., vol. 6, no. 2, pp. 461-464, 1978.
-  P. Zille, V. D. Calhoun, J. M. Stephen, et al., “Fused estimation of sparse connectivity patterns from rest fMRI. Application to comparison of children and adult brains,” IEEE Trans. Med. Imag., DOI: 10.1109/TMI.2017.2721640.
-  M. Ramezani, P. Abolmaesumi, K. Marble, et al., “Fusion analysis of functional MRI data for classification of individuals based on patterns of activation,” Brain Imaging Behav., vol. 9, no. 2, pp. 149-161, 2015.
-  H. Yang, J. Liu, J. Sui, et al., “A hybrid machine learning method for fusing fMRI and genetic data: combining both improves classification of schizophrenia,” Front. Hum. Neurosci., 2010.
-  J. D. Power, A. L. Cohen, et al., “Functional network organization of the human brain,” Neuron, vol. 72, no. 4, pp. 665-678, 2011.
-  Y. Keller, R. R. Coifman, S. Lafon, and S. W. Zucker, “Audio-visual group recognition using diffusion maps,” IEEE Trans. Signal Process., vol. 58, no. 1, pp. 403-413, 2010.
-  M. Xia, J. Wang, and Y. He, “BrainNet Viewer: A network visualization tool for human brain connectomics,” PloS one, 8.7 (2013): e68910.
-  L. Chaddock-Heyman, T. B. Weng, C. Kienzler, et al., “Scholastic performance and functional connectivity of brain networks in children,” PloS one, 13.1 (2018): e0190073.
-  M. Song, Y. Zhou, J. Li, et al., “Brain spontaneous functional connectivity and intelligence,” NeuroImage, vol. 41, pp. 1168-1176, 2008.
-  M. W. Cole, T. Ito, and T. S. Braver, “Lateral prefrontal cortex contributes to fluid intelligence through multinetwork connectivity,” Brain Connectivity, vol. 5, pp. 497-504, 2015.
-  D. Zhou and C. J. C. Burges, “Spectral clustering and transductive learning with multiple views,” in Proc. 24th Int. Conf. Mach. Learn., pp. 1159-1166, 2007.
-  X. He, D. Cai, and P. Niyogi, “Laplacian score for feature selection,” in Advances in Neural Information Processing Systems, pp. 507-514, 2006.
-  S. Dudoit and M. J. van der Laan, “Asymptotics of cross-validated risk estimation in estimator selection and performance assessment,” Stat. Meth., vol. 2, pp. 131-154, 2005.
-  A. S. Greene, S. Gao, D.Scheinost, and R. T. Constable, “Task-induced brain state manipulation improves prediction of individual traits,” Nat. Commun., 9.1 (2018): 2807.
-  S. Gao, A. S. Greene, R. T. Constable, and D.Scheinost, “Task integration for connectome-based prediction via canonical correlation analysis,” in IEEE 15th International Symposium on Biomedical Imaging, 2018.