Autism Spectrum Disorder Classification using Graph Kernels on Multidimensional Time Series
We present an approach to model time series data from resting state fMRI for autism spectrum disorder (ASD) severity classification. We propose to adopt kernel machines and employ graph kernels that define a kernel dot product between two graphs. This enables us to take advantage of spatio-temporal information to capture the dynamics of the brain network, as opposed to aggregating them in the spatial or temporal dimension. In addition to the conventional similarity graphs, we explore the use of graph using sparse coding, and the persistent homology of time delay embeddings, in the proposed pipeline for ASD classification. In our experiments on two datasets from the ABIDE collection, we demonstrate a consistent and significant advantage in using graph kernels over traditional linear or non linear kernels for a variety of time series features.
Modeling the relationships between functional or structural regions in the brain is a significant step towards understanding, diagnosing and eventually treating a gamut of neurological conditions including epilepsy, stroke, and autism. A variety of sensing mechanisms, such as functional-MRI, Electroencephalography (EEG) and Electrocorticography (ECoG), are commonly adopted to uncover patterns in both brain structure and function. In particular, the resting state fMRI [Kelly2008] has been proven effective in identifying diagnostic biomarkers for mental health conditions such as the Alzheimer disease [Chen2011] and autism [Plitt2015]. At the core of these neuropathology studies are predictive models that map the variations in brain functionality, obtained as time-series measurements in regions of interest, to suitable clinical measures. For example, the Autism Brain Imaging Data Exchange (ABIDE) is a collaborative effort [ABIDEpaper2014], which seeks to build a data-driven approach for autism diagnosis. Further, several published studies have reported that predictive models can reveal patterns in brain activity that act as effective biomarkers for classifying patients with mental illness [Plitt2015].
Figure 1 illustrates a generic pipeline used in these studies. Given the rest state fMRI measurements, the functional connectivities between the different regions of the brain can be estimated. Though the network can be constructed using the individual voxels, it is common practice to extract regions of interest (ROI) based on pre-defined atlases or the correlation structure in the data [Calhoun2001]. In addition to making the analysis more interpretable, this process enables dimensionality reduction by allowing the use of a single representative time-series for each region [Behzadi2007]. Building predictive models requires the use of appropriate features for each subject, whose brain activity is represented as a multivariate time series. While exploiting the statistics of features, e.g. covariance structure of the multivariate time series data, is critical to building effective models, it can be highly beneficial to utilize other non-imaging characteristics shared across subjects from a larger population.
Graphs are a natural representation to encode the relationships in a population. In addition to revealing the correlations in the imaging features (e.g. fMRI), the graphs could include a wide range of non-imaging features based on more general characteristics of the subjects, for example geographical, socio-cultural or gender. The advances in graph signal processing and the generalization of complex machine learning techniques, such as deep neural networks, to arbitrarily structured data make graphs an attractive solution. Despite the availability of such tools, the choice of graph construction, with subjects as nodes and their similarities as edges , is crucial to the success of this pipeline. Existing approaches construct a neighborhood graph based on the imaging features and then remove edges based on other criteria such as gender [Parisot2017Spectral]. However, as we show in our empirical studies with the ABIDE dataset, these hybrid graphs do not provide significant improvements to the prediction performance over baseline methods, such as kernel machines, based only on the imaging features.
Contributions: In this paper, we propose a new approach to predictive modeling, which relies on generating an ensemble of population graphs, utilizing graph convolutional neural networks (G-CNNs) ([Defferrard2016GCNN, Kipf2016GCNN]) as our predictive model, and employing a consensus strategy to obtain inference. First, using a bootstrapping approach to design graph ensembles allows our predictive model to better explore connections between subjects in a large population graph that are not captured by simple heuristics. Second, graph CNNs provide a powerful computing framework to make inferences on graphs, by treating the subject-specific image features as a function, , defined at the nodes of the population graph. In addition to existing population graph construction strategies, we study the use of graph kernel similarities obtained by treating the measurements for each subject as a graph representation. Note, the latter is a subject-specific graph and can effectively model the spatio-temporal statistics of the brain activity. Our results show that the proposed bootstrapped G-CNN approach achieves the state-of-the-art performance in ASD classification and more interestingly, the graph ensemble strategy improves the prediction accuracies of all population graph construction approaches. In addition, the proposed bootstrapping reduces the sensitivity of the prediction performance to the graph construction step. Consequently, even non-experts can design simpler graphs, which, with bootstrapping, can perform on par with more sophisticated graph construction strategies.
In this section, we provide a brief overview of some of the related work for graph convolutional networks
Iii Proposed Approach
Figure 2 illustrates an overview of the proposed approach for predictive modeling in ASD classification. As it can be observed, the pipeline requires an initial population graph and the features at each node (i.e. subject) in the graph as inputs. Subsequently, we create an ensemble of randomized graph realizations and invoke the training of a graph CNN model for every realization. The output layer of these neural networks implement the softmax function, which computes the probabilities for class association of each node. Finally, the consensus module fuses the decisions from the ensemble to obtain the final class label.
In this section, we begin describing the proposed approach for predictive modeling to classify subjects with autism. More specifically, we describe the feature extraction procedure and the different strategies adopted for constructing population graphs. In the next section, we will present the predictive modeling algorithm, based on graph CNNs, that incorporates both the extracted features and information from the population graph.
Iii-a Feature Extraction: The Connectivity Matrix
Feature design has become an integral part of advanced machine learning systems. A good feature representation is characterized by its ability to describe the variabilities in the dataset, while being concise and preferably low-dimensional. It has been recently shown in [Abraham2017NeuroImage] that the connectivity matrix can be reliably estimated from the resting state fMRI data as the covariance matrix obtained using the Ledoit-Wolf shrinkage estimator. Denoting the total number of regions of interest for each subject as , the resulting covariance matrix captures the relationships between the time-series measurements from the different ROIs. As shown in [Abraham2017NeuroImage], these features are informative and can be directly used for classification by using the vectorized upper triangular part of the covariance matrix as the feature for each subject. Though more sophisticated feature learning strategies could be employed, we observed in our experiments that the connectivity matrix produces the best performance. Consequently, we adopt that feature representation in our approach. Since the resulting feature vector, commonly referred as the cursed representation, is high-dimensional, we perform dimensionality reduction on the features before using them to train a classifier.
Iii-B Population Graph Construction
Though the classifier can be directly trained using the extracted features, such an approach can fail to incorporate non-imaging/sensing information that can be critical to discriminate between different classes. For example, it is likely that there is discrepancy in some aspects of data collection at different sites, or the gender of the subject is important in generalizing autism spectrum disorder classifiers. It is non-trivial to directly incorporate such information into the subject features, but a graph can be a very intuitive way to introduce these relationships into the learning process. Graphs are natural data structures to model data in high dimensional spaces, where nodes represent the subjects and the edges describe the relations between them. This can be particularly effective while studying larger populations for traits of interest, because the graph can encode information that is different from the imaging features estimated for each subject independently. Furthermore, this can provide additional context to the machine learning algorithm used for prediction, thereby avoiding overfitting in cases of limited training data. It is important to distinguish the population graph construction process from graphs that are used to analyze the brain activity of each subject, i.e. connectivity matrix of statistical dependencies between different ROIs.
Common data processing tasks such as filtering and localized transforms of signals do not directly generalize to irregular domains such as signals defined on graphs. Consequently, several graph signal processing tools have been developed to generalize ideas from Euclidean domain analysis. In particular, spectral filtering and wavelet analysis have become popular graph signal processing techniques in several applications [shuman2013emerging]. These ideas have been further extended to build convolutional neural networks directly in the graph domain, which take advantage of the fact that convolutions are multiplications in the spectral domain ([Defferrard2016GCNN, Kipf2016GCNN]). In this paper, we propose to use graph convolutional neural networks (G-CNNs) to address the task of ASD classification using a population graph, with each node being characterized by the extracted features.
An inherent challenge with such an analysis is that the results obtained are directly dictated by the weighted graph defined for the analysis. Consequently, designing suitable weighted graphs that capture the geometric structure of data is essential for meaningful analysis. In our context, the population graph construction determines how two subjects are connected, so that context information could be shared between them. In addition to approaches that employ simple heuristics (e.g. distances between features) or domain-specific characteristics (e.g. gender), we also investigate the use of a more sophisticated graph construction strategy, aimed at exploiting the spatio-temporal statistics of each subject. Here is the list of graph construction mechanisms adopted in this paper:
Site and Gender: Previous studies have showed that the geographical site information and the gender of the subject are important to build generalizable models and hence we use them to define the similarity graph between subjects. If two subjects have the same gender, they are given a score of , and if they are not. Similarity, the subjects were given a score of , if they were processed at the same site, and if not.
Linear Kernel Feature Graph: The edge weights of the graph are computed as the Euclidean dot product or a linear kernel, between connectivity features from two different ROIs, for a given subject. As expected, this graph does not provide any additional information because it is directly based on the features that were defined at the nodes. However, the authors in [Parisot2017Spectral] showed that this can be combined with the gender and site graphs to build a new graph with additional information for improved prediction.
Graph Kernel on the Connectivity Matrix: This graph is constructed by interpreting the connectivity matrix as an adjacency matrix of a graph. We refer to this as the subject graph. This has favorable properties in that it can model the spatio-temporal statistics between ROIs explicitly as opposed to vectorizing them. This is followed by defining a graph kernel such as Weisfeler-Lehman ([shervashidze2011weisfeiler]) on pairs of subject graphs, resulting in a kernel matrix that can be used for classification or prediction [jie2014topological]. In contrast, we interpret the resulting similarity kernel matrix as the population graph, after multiplying them edge weights from the gender and site graphs.
Iv Similarity Measures for Graph Design
In this section we briefly introduce convolutional neural networks on graphs, and present our bootstrapped training strategy using an ensemble of randomized population graphs.
Iv-a Graph Convolutional Neural Networks
Convolutional neural networks enable extraction of statistical features from structured data, in the form of local stationary patterns, and their aggregation for different semantic analysis tasks, e.g. image recognition or activity analysis. When the signal of interest does not lie on a regular domain, for example graphs, generalizing CNNs is particuarly challenging due to the presence of convolution and pooling operators, typically defined on regular grids.
In graph signal processing theory, this challenge is alleviated by switching to the spectral domain, where the convolution operations can be viewed as simple multiplications. In general, there exists no mathematical definition for the translation operation on graphs. However, a spectral domain approach defines the localization operator on graphs via convolution with a Kronecker delta signal. However, localizing a filter in the spectral domain requires the computation of the graph Fourier transform and hence translations on graphs are computationally expensive. Before we describe the graph CNN architecture used in our approach, we define the Fourier transform for signals on graphs.
Preliminaries: Formally, an undirected weighted graph is represented by , where denotes the set of vertices or nodes, denotes the set of edges and is the adjacency matrix that specifies the weights between edges . The fundamental component of graph analysis is the normalized graph Laplacian , which is defined as , where is the degree matrix and denotes the identity matrix. The set of eigenvectors of the Laplacian are referred to as the graph Fourier basis, , and hence the Fourier transform of a signal is defined as . Finally, the spectral filtering operation can be defined as , where is the parametric filter.
Formulation: Since spectral filtering is computationally prohibitive as the number of nodes in the graph increases, [Hammond2011] proposed to approximate as a truncated expansion in terms of the Chebyshev polynomials recursively.
where is the set of Chebyshev coefficients, is the order for the approximation, are rescaled eigenvalues and the Chebyshev polyomials are recursively defined as . Using this approximation in the spectral filtering operation implies that the filtering is localized, i.e., the filtering depends only on the neighborhood nodes. We can use this K-localized filtering to define convolutional neural networks on graphs.
For example, the approach in [Kipf2016GCNN] uses such that the layer-wise computation is linear w.r.t . Though, this is a crude approximation, by stacking multiple layers one can still recover a large class of complex filter functions, that are not just limited to the linear functions supported by the first order Chebyshev approximation. The resulting filtering can then be expressed as
Here, indicates the convolution operation in the spatial domain. Note that, the filter parameters are shared over the whole graph. In summary, the complete processing in a layer is defined as follows: Given an input function , where is the number of nodes, and is the dimension of the multivariate function at each node, the activations can be computed as
Here is the set of filter parameters, the graph Laplacian is reparameterized as . The activation function is chosen to be ReLU in our implementation. This linear formulation of graph convolutional networks is much more computationally efficient compared to the accurate filtering implementation.
Iv-B Ensemble Learning
As described in the previous section, population graphs provide a convenient way to incorporate non-imaging features into the predictive modeling framework, where the connectivity features from fMRI data are used as a function on the graph. While G-CNN can automatically infer spectral filters for achieving discrimination across different classes, the sensitivity of its performance to the choice of the population graph is not straightforward to understand. Consequently, debugging and fine-tuning these networks can be quite challenging.
The conventional approach to building robust predictive models is to infer an ensemble of weak learners from data and then fuse the decisions using a consensus strategy. The need for ensemble models in supervised learning has been well-studied [Dietterich2000]. A variety of bootstrapping techniques have been developed, wherein multiple weak models are inferred using different random subsets of data. The intuition behind the success of this approach is statistical, whereby different models may have a similar training error when learned using a subset of training samples, but their performance on test data can be different since they optimize for different regions of the input data space.
We employ a similar intuition to building models with population graphs, wherein the quality of different similarity metrics for graph construction can lead to vastly different predictive models. More specifically, starting with a population graph, we create an ensemble of graphs, , by dropping out a pre-defined factor of edges randomly. In this paper, we use a uniform random distribution for the dropout, though more sophisticated weighted distributions could be used. For each of the graphs in the ensemble, we build a G-CNN model (details in Section V) with the connectivity features as the dimensional multivariate function at each node. The output of each of the networks is a softmax function for each node, indicating the probability for the subject to be affected by ASD. Note that, unlike conventional ensemble learners, we do not subsample data, but only subsample edges of the population graph. Given the decisions from all the weak learners, we employ simple consensus strategies such as averaging or obtaining the maximum of the probabilities estimated by each of the G-CNNs for a test subject. As we will show in our experimental results with the ABIDE dataset, the proposed ensemble approach boosts the performance of all the population graph construction strategies discussed in Section III.
In this section, we describe our experiments to evaluate the performance of the proposed ensemble approach in ASD classification, for different population graphs. For comparison, we report the state-of-the-art results obtained using the methods in [Abraham2017NeuroImage] and [Parisot2017Spectral].
V-a The ABIDE dataset
We present our results on the Autism Brain Imaging Data Exchange (ABIDE) dataset [ABIDEpaper2014] that contains resting state fMRI data (rs-fMRI) for 1112 patients, as part of the preprocessed connectomes project
The resulting data per subject consists of the mean time series obtained from the rs-fMRI for each Region of Interest (ROI). In total, there are 111 ROIs for the HO atlas considered here, resulting in the data, for the subject, of size matrix, where is the total number of slices in the fMRI measurement. We use the same train/test splits as [Abraham2017NeuroImage] and the 10-fold split of the ABIDE dataset, resulted in patients used for training and the remaining patients for testing. Instead of reporting the average performance, we report the accuracies for the best performing and the worst performing splits chosen w.r.t the baseline described in the next subsection (i.e split 1, split 8), since this provides a much better understanding of the performance variabilities.
Baseline: First, for each patient we estimate the connectivity matrix between the ROIs, as the covariance of the multivariate time-series data estimated using the Ledoit-Wolf shrinkage estimator. In order to establish a baseline, we construct the connectivity feature by vectorizing the upper triangular part of the covariance matrix and performing dimensionality reduction (D) on the vectorized feature. The connectivity features of the subjects are then fed into a linear SVM [Abraham2017NeuroImage] and a kernel SVM with the radial basis function kernel.
Graph Convolutional Neural Network (G-CNN): We use the G-CNN implementation from [Kipf2016GCNN], particularly the Chebychev polynomial approximation for the Fourier basis as described in [Defferrard2016GCNN]. We constructed a fully convolutional network, with 6 layers of 64 neurons each, a learning rate of 0.005, and dropout of 0.1/0.2 depending on the graph. We used Chebychev polynomials upto degree 3 to approximate the Graph Fourier Transform in the G-CNN in all our experiments. The G-CNN requires a suitable graph , features defined for each node, and the classification label for the binary classification task. For a given graph, the best results were obtained typically in around epochs, similar numbers have also been reported in [Parisot2017Spectral].
Ensemble G-CNN: For a given graph , we generate new graphs such that the new set of graphs were obtained by randomly dropping of the edges of . The G-CNN for each graph was trained for 100 epochs, with a dropout of 0.1. The predictions from each model obtained using the ensemble are fused using two simple strategies - (a) avg, where we take the mean of the predictions and assign the class as the one with the largest value, (b) max, where we take the most confident value for each class and assign the label which has a higher value. In general, we found that a G-CNN for the ABIDE dataset, that is particularly sensitive, tends to overfit very easily making it extremely challenging to tune the hyperparameters. However, introducing randomness in the graphs significantly improved robustness and led to improved training, behaving a lot like dropout inside traditional neural networks. This behavior can be possibly attributed to the fact that the dropout in the G-CNN tends to ignore random edges in the hidden layers of the neural network, while we are proposing to use multiple randomized graphs in the input layer. In general, the random ensembles behave as weak learners, where the performance for each individual network is suboptimal, but the consensus decision is better than the state-of-the-art.
|\cellcolorgray!15Population Graph||\cellcolorgray!15Predictive Model||\cellcolorgray!15Best Split (%)||\cellcolorgray!15Worst Split(%)|
|-||Linear SVM (baseline)||66.28||65.14|
|-||Kernel SVM (baseline)||69.14||62.29|
|Ensemble G-CNN (avg)||67.42||68.00|
|Ensemble G-CNN (max)||68.57||67.42|
|Ensemble G-CNN (avg)||69.14||65.14|
|Ensemble G-CNN (max)||69.14||66.28|
|Ensemble G-CNN (avg)||69.14||67.42|
|Ensemble G-CNN (max)||70.28||66.28|
V-C Classification Results
The classification accuracy for different graphs and their corresponding random ensembles are shown in table I. A few observations are important to note – There is a clear and obvious advantage in using the graph based approach to classifying populations. This is especially true in datasets where the features alone are not sufficient, but require contextual information, which a population graph can provide. Secondly, the difference in performance for different graphs illustrates the importance of graph construction, further emphasizing the need for robust training strategies such as those proposed in this paper. It is also seen that using a random graph ensemble significantly improves performance by up to higher than the state-of-the-art using only features as in [Abraham2017NeuroImage]. We also show improvements over using the linear kernel graph proposed in a recent manuscript ([Parisot2017Spectral]). Finally, it can be observed that the best performing split has a consistently high performance across different kinds of graphs and only marginally better than the baseline, perhaps indicating that the connectivity features dictate the performance in that case. In contrast, the worst performing split contains nearly lower performance without using the graph information. We observe that the largest gains in performance are in this split, indicating that the proposed ensemble strategy is able to recover important information for classification, which can be missed by heuristic graph construction. Another interesting observation is that a simple domain-specific graph such as the , can perform as well as any state-of-the-art method, when couple with the proposed ensemble strategy.
In this paper, we studied the advantages of using graph kernels for autism spectrum disorder classification. We performed experiments with the ABIDE dataset, consisting of multidimensional time series which is obtained from resting state fMRI. We showed that the graph kernel is an effective strategy across different kinds of graphs constructed using similarity measures on time series such as – correlation, graph, persistent homology. The graph kernel with a temporal feature such as persistent homology, combines both spatial and temporal information to effectively model the dynamics of a brain network.