Detect, Quantify, and Incorporate Dataset Bias: A Neuroimaging Analysis on 12,207 Individuals
Neuroimaging datasets keep growing in size to address increasingly complex medical questions. However, even the largest datasets today alone are too small for training complex models or for finding genome wide associations. A solution is to grow the sample size by merging data across several datasets. However, bias in datasets complicates this approach and includes additional sources of variation in the data instead. In this work, we combine 15 large neuroimaging datasets to study bias. First, we detect bias by demonstrating that scans can be correctly assigned to a dataset with 73.3% accuracy. Next, we introduce metrics to quantify the compatibility across datasets and to create embeddings of neuroimaging sites. Finally, we incorporate the presence of bias for the selection of a training set for predicting autism. For the quantification of the dataset bias, we introduce two metrics: the Bhattacharyya distance between datasets and the age prediction error. The presented embedding of neuroimaging sites provides an interesting new visualization about the similarity of different sites. This could be used to guide the merging of data sources, while limiting the introduction of unwanted variation. Finally, we demonstrate a clear performance increase when incorporating dataset bias for training set selection in autism prediction. Overall, we believe that the growing amount of neuroimaging data necessitates to incorporate data-driven methods for quantifying dataset bias in future analyses.
As neuroimaging is joining the ranks of a ”big data” science with more and larger datasets becoming available , the issue of dataset bias is becoming prevalent. In general, bias refers to statistics that are systematically different from the population parameters. In a collection of unbiased datasets, similar results should be achieved by running independent analyses on each dataset and it would be straightforward to pool subjects across datasets without introducing unwanted variation. Further, models learned on one dataset would naturally generalize to other datasets. However, in practice, neuroimaging datasets are subject to various types of biases. These include subject selection, acquisition method, and processing biases. While efforts have been made and are still ongoing to improve image processing to limit the impact of dataset bias in the outcome (e.g., volume or thickness measurements), substantial bias still remains [10, 13, 14, 15, 23].
Selection bias stems from the fact that subjects included in the study do not represent the overall population. Examples are (i) the recruitment of particular target groups, e.g., young adults; (ii) the recruitment of a particular disease group; or (iii) an over-representation of more educated participants in convenience samples. While the first two are potentially related to the study objective and can be controlled for, the third one is more difficult to control and also seems to appear in epidemiological studies . A second bias stems from the image acquisition, where magnetic field strength, manufacturer, gradients, pulse sequences and head positioning cause variations in the images. While standardization efforts are undertaken for instance by the ADNI , variation related to the scanner remains , and it is questionable if a further standardization is in the manufacturer’s interest. Finally, there is processing bias in image segmentation and registration, which is in part related to acquisition bias. The type of segmentation method and the selected parameters can largely influence the outcome. Further, head motion, voxel size and image noise can cause bias in segmentation results.
In this paper, we first detect the magnitude of dataset bias present in large neuroimaging studies. Instead of trying to remove the bias, we propose to incorporate it in the analysis, which requires to quantify it first. To this end, we introduce two dataset metrics: the Bhattacharyya distance in feature space and the age prediction error for quantifying model generalization by including a variable from subject demographics. In addition to operating on the level of datasets, we also look at a more fine-grained analysis on acquisition sites. Based on the dataset metric, we create an embedding of neuroimaging sites to visualize the similarity among them. Finally, we demonstrate the benefit of composing a training set based on the dataset metric for autism prediction.
We work on MRI T1 brain scans from 15 large-scale public datasets: ABIDE I+II , ADHD200 , ADNI , AIBL , COBRE , CORR , GSP , HBN , HCP , IXI
|Dataset||Diagnosis||Age (mean)||Age (SD)||Males %||Sites||Patients|
Iii Name That Dataset
In order to evaluate the impact of dataset bias, we play the game Name That Dataset on neuroimaing data that was originally proposed by Torralba and Efros  on natural images. The task is to predict the dataset a scan is coming from solely based on image measurements. Fig. 1 illustrates the performance for classifying the 15 datasets for different image features. A random forest classifier with default settings was used for the prediction . The splitting of training and testing dataset is done under consideration of the dataset. The performance of image-based classifiers increases logarithmically with the amount of training data. If no dataset bias was present, the prediction accuracy should be close to random chance (6.7% for 15 datasets). As not all datasets have the same size and have different distributions of age and sex, we compare to results of a classifier trained on age and sex as baseline. With only 0.1% of the data used for training, volume measures perform similar to prediction with meta data. As we increase the amount of training data to 70%, the accuracy increases over 73.3% for the combination of volume and thickness features, which perform better than each of them alone. Compared to 42.2% for age and sex, this illustrates that there is a strong bias in the datasets that cannot be explained by basic demographics. We focused the analysis on selecting only healthy controls, because we thought that the inclusion of patients would facilitate the classification. However, the results are similar, as shown for the combination of volume and thickness in Fig. 1.
From the confusion matrix, we see that datasets with a similar population result in higher confusion, e.g., between ABIDE I, ABIDE II, and ADHD200. Single site datasets like HCP are very homogeneous and do therefore show almost no confusion with any of the other datasets. In contrast, multi-site datasets like CORR that also cover a wide age range, show high confusion with other datasets. Overall, however, high classification accuracy and the strong diagonal indicate that datasets possess unique, identifiable characteristics.
The lesson learned from this experiment is that even when working with image-derived values that represent physical measures (volume, thickness), there is still substantial bias in datasets, although techniques like atlas renormalization  were employed to improve consistency across scanners. Of course, much of the bias can be attributed to the different goals of the studies, like the inclusion of subjects from different age groups. However, even when focusing on datasets that cover a similar age range, we observe a high accuracy. While we are not aware of previous attempts on trying to Name That Dataset, our results echo concerns raised in previous studies. In a large ENIGMA study of over 15,000 subjects on brain asymmetry , it was reported that dataset heterogeneity explained over 10% of the total observed variance per structure. On the ADNI, with an optimized MPRAGE imaging protocol across all sites , the intra-subject variability of compartment volumes for scans on different scanners was roughly 10 times higher than repeated scans on the same scanner . Similarly, previous studies reported on a drop of accuracy when training on different datasets  or working with multi-site data .
Iv Quantifying Dataset Compatibility
Iv-a Compatibility Metrics
Having shown the presence of dataset bias, our next aim is to define metrics that quantify their compatibility. Given data sources and , the metric expresses the compatibility among them. As first metric, we propose to compute the Bhattacharyya distance between data sources. To this end, we estimate multivariate normal distributions and , respectively, and compute the Bhattacharyya distance between them . The dimensionality of the distributions corresponds to the number of image-derived measures, where we use brain structure volumes in our experiments.
As second metric, we propose to compute the age prediction error, which includes a variable from subject demographics (age). We train an age regression model on the source set and predict on the target set . Since we know the chronological age on the target set, we compute the average mean age prediction error, . To have a symmetric metric, we set . Age estimation has previously been used for modeling healthy aging and differentiating it to abnormal aging in dementia [8, 2] and has the advantage, in contrast to other prediction tasks, that age is a commonly available variable. We use random forest regression on volume measures for the age regression. While the Bhattacharyya distance is measuring the similarity of image features, the age prediction error expresses how well one data source is suited for training a model that is deployed on a second dataset.
Iv-B Site Embedding
To investigate the similarity across datasets, we create an embedding based on the metrics. However, many of the large neuroimaging datasets are multi-site datasets, i.e., scans were acquired at different scanning sites. Some initiatives like the ADNI put major efforts in the standardization of scans across sites. Other multi-site datasets like ABIDE  retrospectively aggregate data that was independently acquired from laboratories around the world. To study the variation in such datasets, we perform an analysis of variance (ANOVA) on the ABIDE I dataset with age, age squared, sex, diagnosis, and site as variables. For putamen, amygdala, and nucleus accumbens, the percentage of variance explained by site is 20.9%, 23.7%, and 32.7%, respectively, while the total variance explained from all variables ranged between 32.9% to 38.7%. Site is therefore the major source of variation, several times higher than age, sex, or diagnosis. Based on these results, we will operate on the level of sites, instead of datasets, in the following.
We compute the metric between all pairs of sites in our data, where we limit the analysis to sites with more than 25 subjects to have enough samples for a reliable estimation. Based on the pair-wise age prediction across all sites, we use the resulting distance matrix in t-SNE  for visualizing the similarity of sites. Fig. 2 shows the embedding, where the age prediction error was used as metric and the perplexity in t-SNE was set to 5. We only show results for the age prediction error in this experiment because it yielded a clearer separation of datasets. We compare both metrics in section IV-C.
It is striking to see that some sites are more similar to sites from other datasets than to sites from the same dataset. We observe four clusters. Cluster I contains all sites from ADNI and AIBL, representing old subjects. Cluster II consists of sites from IXI, NKI, COBRE, and OASIS, which include subjects from a very wide age range. Cluster III has younger subjects mainly in their twenties, including GSP and HCP, together with many sites from ABIDE and CORR. Cluster IV mainly contains children and adolescents, e.g., HBN and sites from ABIDE. In Fig. 3, we show the same embedding as in Fig. 2 but with the label color according to the age. It is natural to see that the major variations are due to age, due to its predominant impact on brain morphology [24, 28]. Within those clusters age is relatively homogeneous so that other factors like field strength and manufacturer can play a role. All in all, we believe that such an embedding of the majority of neuroimaging datasets is of great value to clarify the relationship between different datasets. In addition, it could be used to guide the combination of data from sites, while limiting the introduction of unwanted variation.
Iv-C Incorporate Bias in Training Set Selection
We demonstrate the benefits of the compatibility metric for the classification of autism, where we only operate with the ABIDE I + II datasets because the other datasets do not contain autistic subjects. To this end, we select one site for testing and we compose the training set based on the metric . The rationale is that sites that are close to the target site will be better suited for training a classifier than sites that are very distant. In details, we sample the training set from the source dataset that consists of all sites, except for the testing site. We encourage the selection of samples from sites that are near by setting the probability of the sample being selected proportional to , the negative exponential of the site metric. As baseline, we use a uniform distribution, which corresponds to random sampling. Fig. 4 illustrates the autism classification accuracy for the two largest sites in ABIDE I and ABIDE II, respectively. We observe that selecting the training set with either of the two metrics outperforms the random selection, and further that the computation of the distance with age prediction yields the best results.
Noteworthy, the selection algorithm is driven by image measurements. This makes it on the one hand very versatile, as it can be easily applied to image archives with T1-weighted MRI scans. On the other hand, by directly operating on the output, this models all of the previously discussed biases.
On a large collection of datasets with 12,207 individuals, we have illustrated that dataset bias has a strong influence on neuroimaging measures. We have quantified dataset compatibility with metrics based on the age prediction error and the Bhattacharyya distance. Computation of the metric between all pairs of neuroimaging sites enabled the creation of an embedding, which illustrated that sites across datasets can be more similar than sites within datasets. Finally, we demonstrated the advantages of incorporating dataset bias for training set selection in autism prediction, where age prediction outperformed Bhattacharyya distance. Overall, we believe that the growing amount of neuroimaging data necessitates to incorporate data-driven methods for quantifying dataset bias in future analyses.
Acknowledgement: This work was supported in part by the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B).
- Alexander, L.M., Escalera, J., Ai, L., Andreotti, C., Febre, K., Mangone, A., Potler, N.V., Langer, N., et al.: An open resource for transdiagnostic research in pediatric mental health and learning disorders. bioRxiv p. 149369 (2017)
- Becker, B.G., Klein, T., Wachinger, C.: Gaussian process uncertainty in age estimation as a measure of brain abnormality. NeuroImage (2018)
- Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
- Buckner, R., Hollinshead, M., Holmes, A., Brohawn, D., Fagerness, J., O’Keefe, T., Roffman, J.: The brain genomics superstruct project. Harvard Dataverse Network (2012)
- Di Martino, A., Yan, C., et al.: The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry 19(6), 659–667 (2014)
- Ellis, K., Bush, A., Darby, D., et al.: The australian imaging, biomarkers and lifestyle (aibl) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease. International Psychogeriatrics 21(04), 672–687 (2009)
- Fischl, B., Salat, D.H., et al.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33(3), 341–355 (2002)
- Franke, K., Ziegler, G., Klöppel, S., Gaser, C., Initiative, A.D.N., et al.: Estimating the age of healthy subjects from t 1-weighted mri scans using kernel methods: Exploring the influence of various parameters. Neuroimage 50(3), 883–892 (2010)
- Gollub, R.L., Shoemaker, J., King, M., White, T., Ehrlich, S., Sponheim, S., Clark, V., Turner, J., Mueller, B., Magnotta, V., et al.: The mcic collection: a shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia. Neuroinformatics 11(3), 367–388 (2013)
- Guadalupe, T., Mathias, S.R., Theo, G., et al.: Human subcortical brain asymmetries in 15,847 people worldwide reveal effects of age and sex. Brain imaging and behavior 11(5), 1497–1514 (2017)
- Han, X., Fischl, B.: Atlas renormalization for improved brain mr image segmentation across scanner platforms. IEEE transactions on medical imaging 26(4), 479–486 (2007)
- Jack, C.R., Bernstein, M.A., Fox, N.C., Thompson, P., Alexander, G., Harvey, D., Borowski, B., Britson, P.J., L Whitwell, J., Ward, C., et al.: The alzheimer’s disease neuroimaging initiative (adni): Mri methods. Journal of magnetic resonance imaging 27(4), 685–691 (2008)
- Jovicich, J., Czanner, S., Han, X., Salat, D., van der Kouwe, A., Quinn, B., Pacheco, J., Albert, M., Killiany, R., Blacker, D., et al.: Mri-derived measurements of human subcortical, ventricular and intracranial brain volumes: reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths. Neuroimage 46(1), 177–192 (2009)
- Kruggel, F., Turner, J., Muftuler, L.T., Initiative, A.D.N., et al.: Impact of scanner hardware and imaging protocol on image quality and compartment volume precision in the adni cohort. Neuroimage 49(3), 2123–2133 (2010)
- LeWinn, K.Z., Sheridan, M.A., Keyes, K.M., Hamilton, A., McLaughlin, K.A.: Sample composition alters associations between age and brain structure. Nature Communications 8(1), 874 (2017)
- Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov), 2579–2605 (2008)
- Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. J. Cognitive Neurosci. 19(9), 1498–1507 (2007)
- Marek, K., Jennings, D., Lasch, S., Siderowf, A., Tanner, C., Simuni, T., Coffey, C., Kieburtz, K., Flagg, E., Chowdhury, S., et al.: The parkinson progression marker initiative (ppmi). Progress in neurobiology 95(4), 629–635 (2011)
- Mayer, A., Ruhl, D., Merideth, F., Ling, J., Hanlon, F., Bustillo, J., Cañive, J.: Functional imaging of the hemodynamic sensory gating response in schizophrenia. Human brain mapping 34(9), 2302–2312 (2013)
- Milham, M.P., Fair, D., Mennes, M., Mostofsky, S.H., et al.: The adhd-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in systems neuroscience 6, 62 (2012)
- Nielsen, J.A., Zielinski, B.A., Fletcher, P.T., Alexander, A.L., Lange, N., Bigler, E.D., Lainhart, J.E., Anderson, J.S.: Multisite functional connectivity mri classification of autism: Abide results. Frontiers in human neuroscience 7 (2013)
- Nooner, K.B., Colcombe, S.J., Tobe, R.H., Mennes, M., Benedict, M.M., Moreno, A.L., Panek, L.J., Brown, S., Zavitz, S.T., Li, Q., et al.: The nki-rockland sample: a model for accelerating the pace of discovery science in psychiatry. Frontiers in neuroscience 6 (2012)
- Nugent, A.C., Luckenbaugh, D.A., Wood, S.E., Bogers, W., Zarate, C.A., Drevets, W.C.: Automated subcortical segmentation using first: test–retest reliability, interscanner reliability, and comparison to manual segmentation. Human brain mapping 34(9), 2313–2329 (2013)
- Potvin, O., Dieumegarde, L., Duchesne, S.: Normative morphometric data for cerebral cortical areas over the lifetime of the adult human brain. Neuroimage 156, 315–339 (2017)
- Smith, S.M., Nichols, T.E.: Statistical challenges in ”big data” human neuroimaging. Neuron 97(2), 263–268 (2018)
- Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. pp. 1521–1528. IEEE (2011)
- Van Essen, D.C., Smith, S.M., Barch, D.M., Behrens, T., Yacoub, E., Ugurbil, K., Consortium, W.M.H., et al.: The wu-minn human connectome project: an overview. Neuroimage 80, 62–79 (2013)
- Wachinger, C., Golland, P., Kremen, W., Fischl, B., Reuter, M.: Brainprint: A discriminative characterization of brain morphology. NeuroImage 109 (2015)
- Wachinger, C., Reuter, M.: Domain adaptation for alzheimer’s disease diagnostics. Neuroimage 139, 470–479 (2016)
- Zuo, X.N., Anderson, J.S., Bellec, P., et al.: An open science resource for establishing reliability and reproducibility in functional connectomics. Scientific data 1, 140049 (2014)