Detect, Quantify, and Incorporate Dataset Bias: A Neuroimaging Analysis on 12,207 Individuals

Detect, Quantify, and Incorporate Dataset Bias: A Neuroimaging Analysis on 12,207 Individuals


Neuroimaging datasets keep growing in size to address increasingly complex medical questions. However, even the largest datasets today alone are too small for training complex models or for finding genome wide associations. A solution is to grow the sample size by merging data across several datasets. However, bias in datasets complicates this approach and includes additional sources of variation in the data instead. In this work, we combine 15 large neuroimaging datasets to study bias. First, we detect bias by demonstrating that scans can be correctly assigned to a dataset with 73.3% accuracy. Next, we introduce metrics to quantify the compatibility across datasets and to create embeddings of neuroimaging sites. Finally, we incorporate the presence of bias for the selection of a training set for predicting autism. For the quantification of the dataset bias, we introduce two metrics: the Bhattacharyya distance between datasets and the age prediction error. The presented embedding of neuroimaging sites provides an interesting new visualization about the similarity of different sites. This could be used to guide the merging of data sources, while limiting the introduction of unwanted variation. Finally, we demonstrate a clear performance increase when incorporating dataset bias for training set selection in autism prediction. Overall, we believe that the growing amount of neuroimaging data necessitates to incorporate data-driven methods for quantifying dataset bias in future analyses.

I Introduction

As neuroimaging is joining the ranks of a ”big data” science with more and larger datasets becoming available [25], the issue of dataset bias is becoming prevalent. In general, bias refers to statistics that are systematically different from the population parameters. In a collection of unbiased datasets, similar results should be achieved by running independent analyses on each dataset and it would be straightforward to pool subjects across datasets without introducing unwanted variation. Further, models learned on one dataset would naturally generalize to other datasets. However, in practice, neuroimaging datasets are subject to various types of biases. These include subject selection, acquisition method, and processing biases. While efforts have been made and are still ongoing to improve image processing to limit the impact of dataset bias in the outcome (e.g., volume or thickness measurements), substantial bias still remains [10, 13, 14, 15, 23].

Selection bias stems from the fact that subjects included in the study do not represent the overall population. Examples are (i) the recruitment of particular target groups, e.g., young adults; (ii) the recruitment of a particular disease group; or (iii) an over-representation of more educated participants in convenience samples. While the first two are potentially related to the study objective and can be controlled for, the third one is more difficult to control and also seems to appear in epidemiological studies [25]. A second bias stems from the image acquisition, where magnetic field strength, manufacturer, gradients, pulse sequences and head positioning cause variations in the images. While standardization efforts are undertaken for instance by the ADNI [12], variation related to the scanner remains [14], and it is questionable if a further standardization is in the manufacturer’s interest. Finally, there is processing bias in image segmentation and registration, which is in part related to acquisition bias. The type of segmentation method and the selected parameters can largely influence the outcome. Further, head motion, voxel size and image noise can cause bias in segmentation results.

In this paper, we first detect the magnitude of dataset bias present in large neuroimaging studies. Instead of trying to remove the bias, we propose to incorporate it in the analysis, which requires to quantify it first. To this end, we introduce two dataset metrics: the Bhattacharyya distance in feature space and the age prediction error for quantifying model generalization by including a variable from subject demographics. In addition to operating on the level of datasets, we also look at a more fine-grained analysis on acquisition sites. Based on the dataset metric, we create an embedding of neuroimaging sites to visualize the similarity among them. Finally, we demonstrate the benefit of composing a training set based on the dataset metric for autism prediction.

Ii Data

We work on MRI T1 brain scans from 15 large-scale public datasets: ABIDE I+II [5], ADHD200 [20], ADNI [12], AIBL [6], COBRE [19], CORR [30], GSP [4], HBN [1], HCP [27], IXI1, MCIC [9], NKI [22], OASIS [17], and PPMI [18]. All datasets were processed with FreeSurfer [7] version 5.3. We keep only one scan per subject from longitudinal or test-retest datasets. After exclusion of scans with processing errors and incomplete meta data, we work with scans from 12,207 subjects (6,827 male; 8,126 controls; mean age: 33.5 (sd=23.9)). Demographics per dataset are reported in Table I.

Dataset Diagnosis    Age (mean)   Age (SD)   Males %   Sites   Patients
ABIDE I Autism      1,095 17.1 8.1 85.2 24 526
ABIDE II Autism 1,032 15.2 9.4 76.1 17 477
ADHD200 ADHD 965 12.1 3.3 61.8 8 384
ADNI Alzheimer’s 1,682 73.6 7.2 55.0 62 1144
AIBL Alzheimer’s 262 72.9 7.6 47.3 2 91
COBRE    Schizophrenia 146 37.0 12.8 74.7 1 72
CORR 1,476 25.9 15.4 50.0 32 0
GSP 1,563 21.5 2.8 42.3 5 0
HBN Psychiatric 689 10.7 3.6 59.7 2 497
HCP 1,113 28.8 3.7 45.6 1 0
IXI 561 48.6 16.5 44.6 3 0
MCIC Schizophrenia 194 33.1 11.5 71.6 3 104
NKI Psychiatric 624 38.4 22.5 39.1 1 268
OASIS Alzheimer’s 415 52.8 25.1 38.6 1 100
PPMI Parkinson’s 390 61.2 10.0 62.6 284
TABLE I: Demographics of 15 neuroimaging datasets.

Iii Name That Dataset

Fig. 1: Left: Dataset classification accuracy for volume, thickness, and the combination of both, together with age and sex. The percentage of the data used for training is shown in log scale. Curves show the average score over 50 repetitions, error bars show the standard deviation. Right: Confusion matrix for volume and thickness with 70% training data.

In order to evaluate the impact of dataset bias, we play the game Name That Dataset on neuroimaing data that was originally proposed by Torralba and Efros [26] on natural images. The task is to predict the dataset a scan is coming from solely based on image measurements. Fig. 1 illustrates the performance for classifying the 15 datasets for different image features. A random forest classifier with default settings was used for the prediction [3]. The splitting of training and testing dataset is done under consideration of the dataset. The performance of image-based classifiers increases logarithmically with the amount of training data. If no dataset bias was present, the prediction accuracy should be close to random chance (6.7% for 15 datasets). As not all datasets have the same size and have different distributions of age and sex, we compare to results of a classifier trained on age and sex as baseline. With only 0.1% of the data used for training, volume measures perform similar to prediction with meta data. As we increase the amount of training data to 70%, the accuracy increases over 73.3% for the combination of volume and thickness features, which perform better than each of them alone. Compared to 42.2% for age and sex, this illustrates that there is a strong bias in the datasets that cannot be explained by basic demographics. We focused the analysis on selecting only healthy controls, because we thought that the inclusion of patients would facilitate the classification. However, the results are similar, as shown for the combination of volume and thickness in Fig. 1.

From the confusion matrix, we see that datasets with a similar population result in higher confusion, e.g., between ABIDE I, ABIDE II, and ADHD200. Single site datasets like HCP are very homogeneous and do therefore show almost no confusion with any of the other datasets. In contrast, multi-site datasets like CORR that also cover a wide age range, show high confusion with other datasets. Overall, however, high classification accuracy and the strong diagonal indicate that datasets possess unique, identifiable characteristics.

The lesson learned from this experiment is that even when working with image-derived values that represent physical measures (volume, thickness), there is still substantial bias in datasets, although techniques like atlas renormalization [11] were employed to improve consistency across scanners. Of course, much of the bias can be attributed to the different goals of the studies, like the inclusion of subjects from different age groups. However, even when focusing on datasets that cover a similar age range, we observe a high accuracy. While we are not aware of previous attempts on trying to Name That Dataset, our results echo concerns raised in previous studies. In a large ENIGMA study of over 15,000 subjects on brain asymmetry [10], it was reported that dataset heterogeneity explained over 10% of the total observed variance per structure. On the ADNI, with an optimized MPRAGE imaging protocol across all sites [12], the intra-subject variability of compartment volumes for scans on different scanners was roughly 10 times higher than repeated scans on the same scanner [14]. Similarly, previous studies reported on a drop of accuracy when training on different datasets [29] or working with multi-site data [21].

Iv Quantifying Dataset Compatibility

Iv-a Compatibility Metrics

Having shown the presence of dataset bias, our next aim is to define metrics that quantify their compatibility. Given data sources and , the metric expresses the compatibility among them. As first metric, we propose to compute the Bhattacharyya distance between data sources. To this end, we estimate multivariate normal distributions and , respectively, and compute the Bhattacharyya distance between them . The dimensionality of the distributions corresponds to the number of image-derived measures, where we use brain structure volumes in our experiments.

As second metric, we propose to compute the age prediction error, which includes a variable from subject demographics (age). We train an age regression model on the source set and predict on the target set . Since we know the chronological age on the target set, we compute the average mean age prediction error, . To have a symmetric metric, we set . Age estimation has previously been used for modeling healthy aging and differentiating it to abnormal aging in dementia [8, 2] and has the advantage, in contrast to other prediction tasks, that age is a commonly available variable. We use random forest regression on volume measures for the age regression. While the Bhattacharyya distance is measuring the similarity of image features, the age prediction error expresses how well one data source is suited for training a model that is deployed on a second dataset.

Fig. 2: Embedding of neuroimaging sites. Color encodes dataset. Name of site and dataset displayed next to point. Four clusters are highlighted.

Iv-B Site Embedding

To investigate the similarity across datasets, we create an embedding based on the metrics. However, many of the large neuroimaging datasets are multi-site datasets, i.e., scans were acquired at different scanning sites. Some initiatives like the ADNI put major efforts in the standardization of scans across sites. Other multi-site datasets like ABIDE [5] retrospectively aggregate data that was independently acquired from laboratories around the world. To study the variation in such datasets, we perform an analysis of variance (ANOVA) on the ABIDE I dataset with age, age squared, sex, diagnosis, and site as variables. For putamen, amygdala, and nucleus accumbens, the percentage of variance explained by site is 20.9%, 23.7%, and 32.7%, respectively, while the total variance explained from all variables ranged between 32.9% to 38.7%. Site is therefore the major source of variation, several times higher than age, sex, or diagnosis. Based on these results, we will operate on the level of sites, instead of datasets, in the following.

We compute the metric between all pairs of sites in our data, where we limit the analysis to sites with more than 25 subjects to have enough samples for a reliable estimation. Based on the pair-wise age prediction across all sites, we use the resulting distance matrix in t-SNE [16] for visualizing the similarity of sites. Fig. 2 shows the embedding, where the age prediction error was used as metric and the perplexity in t-SNE was set to 5. We only show results for the age prediction error in this experiment because it yielded a clearer separation of datasets. We compare both metrics in section IV-C.

It is striking to see that some sites are more similar to sites from other datasets than to sites from the same dataset. We observe four clusters. Cluster I contains all sites from ADNI and AIBL, representing old subjects. Cluster II consists of sites from IXI, NKI, COBRE, and OASIS, which include subjects from a very wide age range. Cluster III has younger subjects mainly in their twenties, including GSP and HCP, together with many sites from ABIDE and CORR. Cluster IV mainly contains children and adolescents, e.g., HBN and sites from ABIDE. In Fig. 3, we show the same embedding as in Fig. 2 but with the label color according to the age. It is natural to see that the major variations are due to age, due to its predominant impact on brain morphology [24, 28]. Within those clusters age is relatively homogeneous so that other factors like field strength and manufacturer can play a role. All in all, we believe that such an embedding of the majority of neuroimaging datasets is of great value to clarify the relationship between different datasets. In addition, it could be used to guide the combination of data from sites, while limiting the introduction of unwanted variation.

Fig. 3: Embedding of neuroimaging sites. Color encodes mean age of site. Name of site and dataset displayed next to data point.

Iv-C Incorporate Bias in Training Set Selection

We demonstrate the benefits of the compatibility metric for the classification of autism, where we only operate with the ABIDE I + II datasets because the other datasets do not contain autistic subjects. To this end, we select one site for testing and we compose the training set based on the metric . The rationale is that sites that are close to the target site will be better suited for training a classifier than sites that are very distant. In details, we sample the training set from the source dataset that consists of all sites, except for the testing site. We encourage the selection of samples from sites that are near by setting the probability of the sample being selected proportional to , the negative exponential of the site metric. As baseline, we use a uniform distribution, which corresponds to random sampling. Fig. 4 illustrates the autism classification accuracy for the two largest sites in ABIDE I and ABIDE II, respectively. We observe that selecting the training set with either of the two metrics outperforms the random selection, and further that the computation of the distance with age prediction yields the best results.

Noteworthy, the selection algorithm is driven by image measurements. This makes it on the one hand very versatile, as it can be easily applied to image archives with T1-weighted MRI scans. On the other hand, by directly operating on the output, this models all of the previously discussed biases.

Fig. 4: Autism prediction results for site UM from ABIDE I (left) and site KKI from ABIDE II (right) by training set selection with Bhattacharyya distance, age prediction, and random sampling. Curves show the average score over 200 repetitions, error bars show the standard deviation.

V Conclusion

On a large collection of datasets with 12,207 individuals, we have illustrated that dataset bias has a strong influence on neuroimaging measures. We have quantified dataset compatibility with metrics based on the age prediction error and the Bhattacharyya distance. Computation of the metric between all pairs of neuroimaging sites enabled the creation of an embedding, which illustrated that sites across datasets can be more similar than sites within datasets. Finally, we demonstrated the advantages of incorporating dataset bias for training set selection in autism prediction, where age prediction outperformed Bhattacharyya distance. Overall, we believe that the growing amount of neuroimaging data necessitates to incorporate data-driven methods for quantifying dataset bias in future analyses.

Acknowledgement: This work was supported in part by the Bavarian State Ministry of Education, Science and the Arts in the framework of the Centre Digitisation.Bavaria (ZD.B).




  1. Alexander, L.M., Escalera, J., Ai, L., Andreotti, C., Febre, K., Mangone, A., Potler, N.V., Langer, N., et al.: An open resource for transdiagnostic research in pediatric mental health and learning disorders. bioRxiv p. 149369 (2017)
  2. Becker, B.G., Klein, T., Wachinger, C.: Gaussian process uncertainty in age estimation as a measure of brain abnormality. NeuroImage (2018)
  3. Breiman, L.: Random forests. Machine learning 45(1), 5–32 (2001)
  4. Buckner, R., Hollinshead, M., Holmes, A., Brohawn, D., Fagerness, J., O’Keefe, T., Roffman, J.: The brain genomics superstruct project. Harvard Dataverse Network (2012)
  5. Di Martino, A., Yan, C., et al.: The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry 19(6), 659–667 (2014)
  6. Ellis, K., Bush, A., Darby, D., et al.: The australian imaging, biomarkers and lifestyle (aibl) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease. International Psychogeriatrics 21(04), 672–687 (2009)
  7. Fischl, B., Salat, D.H., et al.: Whole brain segmentation: automated labeling of neuroanatomical structures in the human brain. Neuron 33(3), 341–355 (2002)
  8. Franke, K., Ziegler, G., Klöppel, S., Gaser, C., Initiative, A.D.N., et al.: Estimating the age of healthy subjects from t 1-weighted mri scans using kernel methods: Exploring the influence of various parameters. Neuroimage 50(3), 883–892 (2010)
  9. Gollub, R.L., Shoemaker, J., King, M., White, T., Ehrlich, S., Sponheim, S., Clark, V., Turner, J., Mueller, B., Magnotta, V., et al.: The mcic collection: a shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia. Neuroinformatics 11(3), 367–388 (2013)
  10. Guadalupe, T., Mathias, S.R., Theo, G., et al.: Human subcortical brain asymmetries in 15,847 people worldwide reveal effects of age and sex. Brain imaging and behavior 11(5), 1497–1514 (2017)
  11. Han, X., Fischl, B.: Atlas renormalization for improved brain mr image segmentation across scanner platforms. IEEE transactions on medical imaging 26(4), 479–486 (2007)
  12. Jack, C.R., Bernstein, M.A., Fox, N.C., Thompson, P., Alexander, G., Harvey, D., Borowski, B., Britson, P.J., L Whitwell, J., Ward, C., et al.: The alzheimer’s disease neuroimaging initiative (adni): Mri methods. Journal of magnetic resonance imaging 27(4), 685–691 (2008)
  13. Jovicich, J., Czanner, S., Han, X., Salat, D., van der Kouwe, A., Quinn, B., Pacheco, J., Albert, M., Killiany, R., Blacker, D., et al.: Mri-derived measurements of human subcortical, ventricular and intracranial brain volumes: reliability effects of scan sessions, acquisition sequences, data analyses, scanner upgrade, scanner vendors and field strengths. Neuroimage 46(1), 177–192 (2009)
  14. Kruggel, F., Turner, J., Muftuler, L.T., Initiative, A.D.N., et al.: Impact of scanner hardware and imaging protocol on image quality and compartment volume precision in the adni cohort. Neuroimage 49(3), 2123–2133 (2010)
  15. LeWinn, K.Z., Sheridan, M.A., Keyes, K.M., Hamilton, A., McLaughlin, K.A.: Sample composition alters associations between age and brain structure. Nature Communications 8(1), 874 (2017)
  16. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of Machine Learning Research 9(Nov), 2579–2605 (2008)
  17. Marcus, D.S., Wang, T.H., Parker, J., Csernansky, J.G., Morris, J.C., Buckner, R.L.: Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. J. Cognitive Neurosci. 19(9), 1498–1507 (2007)
  18. Marek, K., Jennings, D., Lasch, S., Siderowf, A., Tanner, C., Simuni, T., Coffey, C., Kieburtz, K., Flagg, E., Chowdhury, S., et al.: The parkinson progression marker initiative (ppmi). Progress in neurobiology 95(4), 629–635 (2011)
  19. Mayer, A., Ruhl, D., Merideth, F., Ling, J., Hanlon, F., Bustillo, J., Cañive, J.: Functional imaging of the hemodynamic sensory gating response in schizophrenia. Human brain mapping 34(9), 2302–2312 (2013)
  20. Milham, M.P., Fair, D., Mennes, M., Mostofsky, S.H., et al.: The adhd-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in systems neuroscience 6,  62 (2012)
  21. Nielsen, J.A., Zielinski, B.A., Fletcher, P.T., Alexander, A.L., Lange, N., Bigler, E.D., Lainhart, J.E., Anderson, J.S.: Multisite functional connectivity mri classification of autism: Abide results. Frontiers in human neuroscience 7 (2013)
  22. Nooner, K.B., Colcombe, S.J., Tobe, R.H., Mennes, M., Benedict, M.M., Moreno, A.L., Panek, L.J., Brown, S., Zavitz, S.T., Li, Q., et al.: The nki-rockland sample: a model for accelerating the pace of discovery science in psychiatry. Frontiers in neuroscience 6 (2012)
  23. Nugent, A.C., Luckenbaugh, D.A., Wood, S.E., Bogers, W., Zarate, C.A., Drevets, W.C.: Automated subcortical segmentation using first: test–retest reliability, interscanner reliability, and comparison to manual segmentation. Human brain mapping 34(9), 2313–2329 (2013)
  24. Potvin, O., Dieumegarde, L., Duchesne, S.: Normative morphometric data for cerebral cortical areas over the lifetime of the adult human brain. Neuroimage 156, 315–339 (2017)
  25. Smith, S.M., Nichols, T.E.: Statistical challenges in ”big data” human neuroimaging. Neuron 97(2), 263–268 (2018)
  26. Torralba, A., Efros, A.A.: Unbiased look at dataset bias. In: Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on. pp. 1521–1528. IEEE (2011)
  27. Van Essen, D.C., Smith, S.M., Barch, D.M., Behrens, T., Yacoub, E., Ugurbil, K., Consortium, W.M.H., et al.: The wu-minn human connectome project: an overview. Neuroimage 80, 62–79 (2013)
  28. Wachinger, C., Golland, P., Kremen, W., Fischl, B., Reuter, M.: Brainprint: A discriminative characterization of brain morphology. NeuroImage 109 (2015)
  29. Wachinger, C., Reuter, M.: Domain adaptation for alzheimer’s disease diagnostics. Neuroimage 139, 470–479 (2016)
  30. Zuo, X.N., Anderson, J.S., Bellec, P., et al.: An open science resource for establishing reliability and reproducibility in functional connectomics. Scientific data 1, 140049 (2014)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description