Improving Classification Rate of Schizophrenia Using a Multimodal Multi-Layer Perceptron Model with Structural and Functional MRI
The wide variety of brain imaging technologies allows us to exploit information inherent to different data modalities. The richness of multimodal datasets may increase predictive power and reveal latent variables that otherwise would have not been found. However, the analysis of multimodal data is often conducted by assuming linear interactions which impact the accuracy of the results.
We propose the use of a multimodal multi-layer perceptron model to enhance the predictive power of structural and functional magnetic resonance imaging (sMRI and fMRI) combined. We also use a synthetic data generator to pre-train each modality input layers, alleviating the effects of the small sample size that is often the case for brain imaging modalities.
The proposed model improved the average and uncertainty of the area under the ROC curve to 0.8500.051 compared to the best results on individual modalities (0.7410.075 for sMRI, and 0.8330.050 for fMRI).
Improving Classification Rate of Schizophrenia Using a Multimodal Multi-Layer Perceptron Model with Structural and Functional MRI
Alvaro E. Ulloa, Sergey Plis, Vince D. Calhoun email@example.com, [splis,vcalhoun]@mrn.org Department of Electrical and Computer Engineering,The University of New Mexico, NM USA The Mind Research Network, NM USA
noticebox[b]31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\end@float
Recent advances on detecting and monitoring biomedical signals makes simultaneous investigation of multiple aspects of human physiology and their interactions more practical and efficient. This is particularly true for studies on brain imaging, where functional, structural, and chemical composition of the brain can be measured.
Brain imaging is often used for evaluation and research of mental disorders, mainly due to its ability to directly measure information about the brain. Schizophrenia is among the most prevalent mental disorders, affecting about 1% of the population worldwide . Schizophrenia is a devastating disease that alters normal behavior and may provoke hallucinations. Depending on the severity, schizophrenia can impair an individual and significantly degrade his/her quality of life, resulting in a social burden. The need to better diagnose and treat this disorder motivates the study of schizophrenia at the behavioral and biological levels.
In this paper, we focus on assessing the predictive power of two popular brain imaging modalities: structural magnetic resonance imaging (sMRI) and functional MRI (fMRI). SMRI is a non-invasive technique for measuring gray matter concentration (GMC) which provides direct anatomical information of brain structure in form of a three-dimensional image. On the other hand, fMRI uses blood-oxygen-level dependent contrast to estimate blood flow as an indirect measure of neuronal activity in the brain over time which results in a four-dimensional image.
The literature presents overwhelming evidence that GMC abnormalities, captured by sMRI, among individuals diagnosed with schizophrenia yields meaningful information for clinical evaluation [2, 3, 4]. Likewise, fMRI has also proven informative for mental illness discrimination [5, 6, 7]. Therefore, we hypothesize that given each modality is informative in a complementary fashion, the combination of two or more modalities may increase accuracy of mental illness prediction.
While the richness of the information embedded in brain data provides great promise for unveiling hidden features and knowledge , it also raises a pressing question on how to systemically analyze the data to maximize the benefits [9, 10]. Some of the challenges inherent to multimodal data analysis are:
Different data modalities are often incompatible in dimensionality and inherent physical properties of the data. For example, sMRI is a three-dimensional image while fMRI is a sequence of three-dimensional images. This makes the data incompatible with classical data analysis techniques such as correlation or linear regression.
Brain imaging data is highly dimensional and scarce in samples. For instance, brain structure measured with sMRI can present 300,000 or more intracranial voxels, while the number of samples (subjects) collected ranges between less than 100 and 2,800 .
Most current data mining methods for multimodal analysis assume linear interaction between modalities. However, the independent nature of each modality weakens this assumption and compromises the reliability of the results.
In order to address the first problem, we first perform a segmentation of the sMRI image to produce a gray matter map and summarize the four-dimensional fMRI image to a three-dimensional image using amplitude of low frequency fluctuations (ALFF) maps . This is thus a feature-based fusion approach . Even though the interpretability of ALFF maps is not as direct as sMRI, ALFF has proven informative and useful for analysis of brain behavior .
We use synthetic data generators  to address the second problem. The generators feed machine learning classifiers in an online fashion. This will allow the model to learn from large number of synthetic samples, and enhance the model predictive power. We then propose a multimodal multi-layer perceptron (MLP) model pre-trained at the input layer with an independent MLP on synthetic data. The MLP model also addresses the third challenge because of the non-linear nature of MLPs.
To the best of our knowledge, we present the first study on multimodal classification with a MLP, pre-trained with synthetic data, and applied to sMRI and fMRI modalities to predict schizophrenia diagnosis.
2 Materials and Methods
In this section, we present the dataset used for our experiments as well as a detailed description of the MLP architecture, including the synthetic data generators, used for prediction of schizophrenia diagnosis using the sMRI and fMRI contained in the dataset.
The dataset consists of data collected from multiple sites including the University of California Irvine, the University of California Los Angeles, the University of California San Francisco, Duke University, University of North Carolina, University of New Mexico, University of Iowa, and University of Minnesota Institutional Review Boards.
Both the sMRI and fMRI data collected belong to the function biomedical informatics research network (fBIRN) dataset as described in [15, 16]. The number of subjects in the fBIRN dataset, after purging subjects without both sMRI and fMRI data, includes 135 schizophrenia patients (including schizophrenia and schizoaffective disorder) and 169 healthy controls.
As described in , the schizophrenia patients and healthy controls were matched as much as possible for age, sex, handedness, and race distributions, recruited from eight sites, that participated in the study. Each individual was diagnosed following the Structured Clinical Interview for DSM-IVTR Axis I Disorders (SCID-I/P) . All patients were clinically stable on anti-psychotic medication for at least 2 months, and had an illness duration of more than 1 year.
2.1.2 MRI parameters
The sMRI data was set to a slice thickness of 1.2 mm, sagittal slice orientation, and re-sliced to mm. The latter was not a requirement but set for convenience. SPM5 was used to segment the brain into white matter (WM), GM, and cerebral spinal fluid with unmodulated normalized parameters
The imaging protocol for the fMRI scans at all sites was a T2*-weighted AC-PC aligned echo planar imaging sequence (TR/TE 2s/30ms, flip angle 77 degrees, 32 slices collected sequentially from superior to inferior, mm with 1 mm gap, 162 frames, 5:38 min). For the resting scan, subjects were instructed to lie still with eyes closed.
2.1.3 Quality control
For quality control, each sMRI volume was correlated with all others to compute the mean correlation as a quality metric. Images with a quality metric 2 standard deviations below the mean were categorized as noisy. Nine images were discarded based on this analysis, yielding a final dataset composed of 290 images.
2.2 Multi-layer perceptron
A multilayer perceptron (MLP) is a feed-forward neural network model that projects a set of inputs through a set of non-linear operations. The model is trained by reducing the binary cross-entropy between the true and estimated labels. We used the AdaGrad  learning algorithm to optimize the cost function.
We designed three MLPs, two for unimodal pre-training with synthetic data generator and one for the multimodal MLP that combines both modalities by concatenation of hidden units.
The unimodal MLP is designed with 3 layers, where aside form the input layer we set 20 hidden nodes at each other layer, sigmoid activation functions, 50% dropout, and regularization with a weight of 0.1 at the input layer and 0.01 for the other layers.
The multimodal MLP is designed with 3 layers for each modality, a merging layer that concatenates hidden units on the fourth layer, and 2 merged layers for final output. Before concatenation all input layers have 20 hidden units, and after concatenation it has 40, 20, and 1 for the output. The regularization weight is set to 0.1 at the input layers and 0.01 for the rest.
2.3 Synthetic data generator
The synthetic data generator is set to produce synthetic sMRI and ALFF maps. It starts by fitting unlabeled data using independent component analysis (ICA), then it passes the ICA parameters to a random variable (RV) generator that imitates the parameters to generate more samples with the same statistical properties. As in our case, labeled data is given to the RV generator to capture two sets of statistical parameters, one for healthy controls and other for schizophrenia patients. The following sections present a brief description of ICA, describe two RV generators and the final method.
2.3.1 Independent Component Analysis
ICA is a matrix factorization technique in which an observed data matrix is factored as
where, is the mixing matrix, the source matrix, the number of samples, the number of variables, and the number of sources. When the number of sources, , is unknown, as in most real world problems, it can be estimated using the criterion defined in . This matrix factorization is possible given that the sources in are mutually independent and non-Gaussian.
When factorizing structural MRI and ALFF maps, data is organized into a subject voxel matrix. The mixing matrix then represents the subjects’ loading patterns, i.e., how each source is weighted across subjects, and the rows of represent the sources, which are weighted patterns of voxels.
2.3.2 RV generator: Rejection sampling
Rejection sampling is a well-known technique for RV generation which samples from complex probability distribution functions (PDF). However, it is only defined for one-dimensional RVs and given a PDF in close form. We use the classical rejection sampling method on the marginal distributions of the multivariate RVs and without prior knowledge of the RV PDF.
Let be a random variable (RV) generator function that receives as input a probability density function and the number of samples to generate, . The generated samples are randomly drawn from the input PDF, .
First, the generator function samples two RVs, and where denotes the PDF of the uniform distribution and denote the minimum and maximum observed sample. The method then accepts as a sample from given that . This procedure is repeated until the desired number of samples is obtained.
2.3.3 RV generator: Multivariate Normal
We use the sample mean and sample covariance matrix from the matrix data as input of this RV generator. Then, we use a spectral decomposition approach for generating multivariate random normal samples. Contrary to a rejection sampling generator, this approach accounts for correlation structure among the RVs, however it loses generality for marginal distributions.
The data generator is based on two assumptions:
The estimation of ICA sources from the observed data is a good approximation of the true sources.
A group of individuals with a common diagnosis shares statistical properties that are reflected in their loading coefficients ().
The data generator fits unlabeled data with ICA, from which it passes the estimated mixing matrix to a RV generator of choice. Then, given the labels, the generator captures observed statistical parameters for each label group and generates mixing parameters with the same observed statistical properties. The new mixing parameters are then used to reconstruct using the sources estimated from unlabeled data. A more detailed description follows.
Given the listed assumptions are met, our generator first factors the observed dataset into and as described in sec. 2.3.1. Then, it splits into sub-matrices and , which represent healthy controls and schizophrenia patients respectively. Next, the method feeds each matrix to an RV generator of choice, as described in sections 2.3.2 or 2.3.3.
In the case of using the rejection sampling method, we estimate the probability density functions (PDF) of each column of as follows
where, indicates the column of the matrix, and the function denotes a N-bin normalized histogram. The algorithm then proceeds to input and to the rejection sampling RV generator as follows
where, is the RV generator defined in 2.3.2, and denotes a synthetic mixing matrix.
In the case of the multivariate normal sampling method, we simply estimate the mean and covariance matrix of and and generate samples using the estimated parameters, and
Finally, we reconstruct images for each diagnosis group by
where, is the voxel mean computed at the beginning of the method, and is the resulting simulated image.
2.4 Experimental Setup
First, we fit the data generator for each modality using all unlabeled data available. This ensures we estimate the best sources and mixing matrix possible in order to meet with the first assumption of the generator. Then, we split the data of each modality into 87.5% training and 12.5% testing (8-fold cross-validation).
For each data modality, the training dataset is fed to the data generator, which is set to produce 10,000 batches of 20 samples: 10 healthy controls and 10 patients with 20 estimated sources. This results in a total of 20,000 samples per modality to use as pre-training data. See Fig. 1 for a view of the experimental setup for pre-training.
Once each unimodal MLP is pre-trained, we use the input weights and set them to initialize the weight parameters of the multimodal MLP. The multimodal MLP then starts training with real training data. The training procedure is set to split 90% for training and 10% for validation. We use the validation dataset as a proxy for the testing set and avoid over fitting, thus, after 100 epochs we measure the loss on validation data and keep the weights that results in the minimum validation loss after 1000 epochs. See Fig. 2 for a view of the complete system.
In the pre-training phase, the method sequentially fed batches of data to an online trainer for a unimodal MLP classifier. It is important to notice that a sample of synthetic data is only seen by the trained model once, and in practice the online learners are fed with batches sequentially without first pre-generating the dataset but generating data on-the-fly.
We also train and test several other classical classifiers on raw data for comparison. All our experiments are implemented using free software provided by scikit-learn  and Theano . Finally, we report area under the ROC curve (AUC) in the testing set.
For completeness, we run classical classifiers including naive bayes, logistic regression, RBF and Linear support vector machines, linear discriminant analysis, random forest, nearest neighbors, and decision tree. These were run on raw data for each modality and concatenated data for the multimodal approach. We used a grid search of hyper-parameters and evaluated the best combination within a nested a 10-fold cross validation, we then report average and standard deviation of AUC across the 8 folds. For the concatenated approach, some classifiers were too expensive to compute so we did not report results as submission of this paper.
The results indicate that the best classification results are obtained when the information of sMRI and fMRI is combined. Also, the proposed MLP model reported significantly higher AUC average. See a complete summary of the results in Table 1.
|sMRI||fMRI||sMRI + fMRI|
|Online learning and synthetic data|
|MLP with MVN||0.65||0.05||0.82||0.06||0.85||0.05|
|MLP with rejection||0.74||0.07||0.83||0.05||0.84||0.05|
We first investigated the ability of sMRI and fMRI data to predict schizophrenia diagnosis. The results of various classifiers applied to each individual modality provides evidence that both modalitites are indeed informative since the overall prediction accuracy is significantly higher than random chance (0.5). The literature supports our findings in both modalities [22, 23], thus we can assume with confidence that sMRI and fMRI are of relevance for schizophrenia diagnosis.
Then, we hypothesized that given sMRI and fMRI hold potential for schizophrenia diagnosis prediction, the combination of both modalities may improve the overall classification accuracy. Again, the results show evidence in favor of the stated hypothesis because, as shown in Table 1, the proposed multimodal MLP model significantly increased (p-value: 0.016, one-tailed student t-test) the average AUC and reduced uncertainty among data folds compared to the best multimodal result with various classifiers.
Previous efforts on multimodal classification showed promising results , yet, most of the analysis is focused on feature extraction under linearity assumptions. In this study, we proposed a non-linear approach, MLP, that improves generalization in classification of schizophrenia patients and healthy controls from their sMRI and fMRI images. Based on classification results, the use of the proposed MLP model in combination with the data generator is promising.
In general, multimodal deep learning has gained popularity due to the high classification rates reported in the literature [24, 25]. However, all of the application fields are big data problems. This is not the case in brain imaging. The high cost of MRI data collection constrains the amount of data that can be collected per study. In this paper, we used a synthetic data generation technique to mitigate the effects of a limited sample size. As the results show, the MLP model appears to benefit from the use of the generator which reduces AUC variance and slightly improves classification results compared to the MLP model with out pre-training.
It is well known that large MLPs overfit the training data, however this seems to not be the case for big data problems, the overfitting expected from nets with excess capacity did not occur . Even though our sample size does not enter the category of big data, we are confident that the synthetic data generator used for pre-training played a role on regularizing the unimodal training. The generator was used to provide more than 200,000 samples to the MLP trainer. Additionally, to avoid overfitting in the multimodal training phase, we used other regularization methods, norm and dropout, besides using the pre-trained weights.
We presented, to our knowledge, the first MLP design using data generators for pre-training applied to multimodal brain imaging data. The use of the data generator proved useful for pre-training in the sense that it improved classification performance, probably acting as a regularizer to avoid overfitting of the unimodal MLP model.
The multimodal design was a simple concatenation of unimodal MLPs and can be further used for more than two data modalities. As future work, we could assess the utility of the multimodal MLP including genetic and behavioral information.
-  D. Bhugra, “The global prevalence of schizophrenia,” PLoS Medicine, vol. 2, no. 5, p. e151, 2005.
-  S. A. Meda, N. R. Giuliani, V. D. Calhoun, K. Jagannathan, D. J. Schretlen, A. Pulver, N. Cascella, M. Keshavan, W. Kates, R. Buchanan et al., “A large scale (n= 400) investigation of gray matter differences in schizophrenia using optimized voxel-based morphometry,” Schizophrenia research, vol. 101, no. 1, pp. 95–105, 2008.
-  C. N. Gupta, V. D. Calhoun, S. Rachakonda, J. Chen, V. Patel, J. Liu, J. Segall, B. Franke, M. P. Zwiers, A. Arias-Vasquez et al., “Patterns of gray matter abnormalities in schizophrenia based on an international mega-analysis,” Schizophrenia bulletin, p. sbu177, 2014.
-  D. Cooper, V. Barker, J. Radua, P. Fusar-Poli, and S. M. Lawrie, “Multimodal voxel-based meta-analysis of structural and functional magnetic resonance imaging studies in those at elevated genetic risk of developing schizophrenia,” Psychiatry Research: Neuroimaging, vol. 221, no. 1, pp. 69–77, 2014.
-  A. J. Gaebler, K. Mathiak, J. W. Koten, A. A. König, Y. Koush, D. Weyer, C. Depner, S. Matentzoglu, J. C. Edgar, K. Willmes et al., “Auditory mismatch impairments are characterized by core neural dysfunctions in schizophrenia,” Brain, vol. 138, no. 5, pp. 1410–1423, 2015.
-  J. A. Turner, H. Chen, D. H. Mathalon, E. A. Allen, A. R. Mayer, C. C. Abbott, V. D. Calhoun, and J. Bustillo, “Reliability of the amplitude of low-frequency fluctuations in resting state fmri in chronic schizophrenia,” Psychiatry Research: Neuroimaging, vol. 201, no. 3, pp. 253–255, 2012.
-  F. Liu, W. Guo, L. Liu, Z. Long, C. Ma, Z. Xue, Y. Wang, J. Li, M. Hu, J. Zhang et al., “Abnormal amplitude low-frequency oscillations in medication-naive, first-episode patients with major depressive disorder: a resting-state fmri study,” Journal of affective disorders, vol. 146, no. 3, pp. 401–406, 2013.
-  J. Sui, T. Adali, Q. Yu, J. Chen, and V. D. Calhoun, “A review of multivariate methods for multimodal fusion of brain imaging data,” Journal of neuroscience methods, vol. 204, no. 1, pp. 68–81, 2012.
-  J. Liu and V. D. Calhoun, “A review of multivariate analyses in imaging genetics,” Frontiers in neuroinformatics, vol. 8, 2014.
-  A. Meyer-Lindenberg, “The future of fmri and genetics research,” Neuroimage, vol. 62, no. 2, pp. 1286–1292, 2012.
-  M. R. Sabuncu, E. Konukoglu, A. D. N. Initiative et al., “Clinical prediction from structural brain mri scans: A large-scale empirical study,” Neuroinformatics, pp. 1–16, 2014.
-  Z. Yu-Feng, H. Yong, Z. Chao-Zhe, C. Qing-Jiu, S. Man-Qiu, L. Meng, T. Li-Xia, J. Tian-Zi, and W. Yu-Feng, “Altered baseline brain activity in children with adhd revealed by resting-state functional mri,” Brain and Development, vol. 29, no. 2, pp. 83–91, 2007.
-  V. D. Calhoun and T. Adali, “Feature-based fusion of medical imaging data,” Information Technology in Biomedicine, IEEE Transactions on, vol. 13, no. 5, pp. 711–720, 2009.
-  A. Ulloa, S. Plis, E. Erhardt, and V. Calhoun, “Synthetic structural magentic resonance image generator improves deep learning prediction of schizophrenia,” May 2015, submitted to MLSP 2015.
-  J. M. Segall, J. A. Turner, T. G. van Erp, T. White, H. J. Bockholt, R. L. Gollub, B. C. Ho, V. Magnotta, R. E. Jung, R. W. McCarley et al., “Voxel-based morphometric multisite collaborative study on schizophrenia,” Schizophrenia bulletin, vol. 35, no. 1, pp. 82–95, 2009.
-  J. A. Turner, E. Damaraju, T. G. Van Erp, D. H. Mathalon, J. M. Ford, J. Voyvodic, B. A. Mueller, A. Belger, J. Bustillo, S. McEwen et al., “A multi-site resting state fmri study on the amplitude of low frequency fluctuations in schizophrenia,” Frontiers in neuroscience, vol. 7, 2013.
-  M. B. First, R. L. Spitzer, M. Gibbon, and J. B. Williams, “Structured clinical interview for dsm-iv-tr axis i disorders, research version, patient edition,” SCID-I/P, Tech. Rep., 2002.
-  M. D. Zeiler, “ADADELTA: an adaptive learning rate method,” CoRR, vol. abs/1212.5701, 2012.
-  Y. Li, T. Adalí, and V. Calhoun, “Estimating the number of independent components for functional magnetic resonance imaging data,” Hum Brain Mapp, vol. 28, no. 11, pp. 1251–1266, 2007.
-  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay, “Scikit-learn: Machine learning in Python,” Journal of Machine Learning Research, vol. 12, pp. 2825–2830, 2011.
-  J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: a CPU and GPU math expression compiler,” in Proceedings of the Python for Scientific Computing Conference (SciPy), Jun. 2010, oral Presentation.
-  Y. Takayanagi, T. Takahashi, L. Orikabe, Y. Mozue, Y. Kawasaki, K. Nakamura, Y. Sato, M. Itokawa, H. Yamasue, K. Kasai et al., “Classification of first-episode schizophrenia patients and healthy subjects by automated mri measures of regional brain volume and cortical thickness,” PloS one, vol. 6, no. 6, p. e21047, 2011.
-  D. Chyzhyk and M. Graña, “Classification of schizophrenia patients on lattice computing resting-state fmri features,” Neurocomputing, vol. 151, pp. 151–160, 2015.
-  N. Srivastava and R. R. Salakhutdinov, “Multimodal learning with deep boltzmann machines,” in Advances in neural information processing systems, 2012, pp. 2222–2230.
-  S. E. Kahou, X. Bouthillier, P. Lamblin, C. Gulcehre, V. Michalski, K. Konda, S. Jean, P. Froumenty, A. Courville, P. Vincent et al., “Emonets: Multimodal deep learning approaches for emotion recognition in video,” arXiv preprint arXiv:1503.01800, 2015.
-  R. C. S. L. L. Giles, “Overfitting in neural nets: Backpropagation, conjugate gradient, and early stopping,” in Advances in Neural Information Processing Systems 13: Proceedings of the 2000 Conference, vol. 13. MIT Press, 2001, p. 402.