RIDDLE: Race and ethnicity Imputation from Disease history with Deep LEarning
Ji-Sung Kim1, Xin Gao2, Andrey Rzhetsky3*
1 Center for Statistics and Machine Learning, Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
2 King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal, Saudi Arabia.
3 Institute for Genomics and Systems Biology, Computation Institute, Departments of Medicine and Human Genetics, University of Chicago, Chicago, Illinois, United States of America
This manuscript was revised and published in PLOS Computational Biology. This arXiv preprint is now outdated; the updated article is available as a free and open access publication at https://doi.org/10.1371/journal.pcbi.1006106.
Anonymized electronic medical records are an increasingly popular source of research data. However, these datasets often lack race and ethnicity information. This creates problems for researchers modeling human disease, as race and ethnicity are powerful confounders for many health exposures and treatment outcomes; race and ethnicity are closely linked to population-specific genetic variation. We showed that deep neural networks generate more accurate estimates for missing racial and ethnic information than competing methods (e.g., logistic regression, random forest). RIDDLE yielded significantly better classification performance across all metrics that were considered: accuracy, cross-entropy loss (error), and area under the curve for receiver operating characteristic plots (all ). We made specific efforts to interpret the trained neural network models to identify, quantify, and visualize medical features which are predictive of race and ethnicity. We used these characterizations of informative features to perform a systematic comparison of differential disease patterns by race and ethnicity. The fact that clinical histories are informative for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health, (3) possible variation in lifestyle, such as dietary habits, and (4) differences in background genetic variation which predispose to diseases.
Electronic medical records (EMRs) are an increasingly popular source of biomedical research data . EMRs are digital records of patient medical histories, describing the occurrence of specific diseases and medical events such as the observation of heart disease or dietary counseling. EMRs can also contain demographic information such as gender or age.
However, these datasets are often anonymized and lack race and ethnicity information (e.g., insurance claims datasets). Race and ethnicity information may also be missing for specific individuals within datasets. This is problematic in research settings as race and ethnicity can be a powerful confounder for a variety of effects. Race and ethnicity are strong correlates of socioeconomic status, a predictor of access to and quality of education and healthcare. These factors are differentially associated with disease incidence and trajectories. As a result of this correlation, race and ethnicity may be associated with variation in medical histories. As an example, it has been reported that referrals for cardiac catheterization are rarer among African American patients than in White patients . Furthermore, researchers have reported differences in genetic variation which influence disease across racial and ethnic groups . Due to the association between race, ethnicity and medical histories, we hypothesize that clinical features in EMRs can be used to impute missing race and ethnicity information.
In addition, race and ethnicity information can be useful for producing and investigating hypotheses in epidemiology. For example, variation in disease risk across racial and ethnic groups that cannot be fully explained by allele frequency information may provide insights into the possible environmental modifiers of genes .
The task of race and ethnicity imputation can be serialized as a supervised learning problem. Typically, the goal of imputation is estimate a posterior probability distribution over plausible values for a missing variable. This distribution of plausible values can be used to generate a single imputed dataset (e.g., by choosing plausible values with highest probability), or to generate multiple imputed datasets as in multiple imputation . In our setting, the goal was to impute the distribution of mutually-exclusive racial and ethnic classes given a set of clinical features. Features comprised age, gender, and codes from the International Disease Classification, version 9 (ICD9, ); ICD9 codes describe medical conditions, medical procedures, family information, and some treatment outcomes.
Bayesian approaches to race and ethnicity imputation using census data have been proposed  and have been used for race and ethnicity imputation in EMR datasets . However, these approaches require sensitive geolocation and surname data from patients. Geolocation and surname data can be missing in anonymized EMR datasets (as in the datasets used here), limiting the utility of approaches which use this information.
1.2 Deep learning
Traditionally, logistic regression classifiers have been used to impute categorical variables such as race and ethnicity . However, there has been recent interest in the use of deep learning for solving similar supervised learning tasks. Deep learning is particularly exciting as it offers the ability to automatically learn complex representations of high-dimensional data. These representations can be used to solve learning tasks such as regression or classification .
Deep learning involves the approximation of some utility function (e.g., classification of an image) as a neural network. A neural network is a directed graph of functions which are referred to as units, neurons or nodes. This network is organized into several layers; each layer corresponds to a different representation of the input data. As the input data is transformed and propagated through this network, the data at each layer corresponds to a new alternate-dimension representation of the sample . For our imputation task, the aim was to learn the representation of an individual as a mixture of race and ethnicity classes where each class is assigned a probability. This representation is encoded in the final output layer of the neural network. The output of a neural network functions as a prediction of the distribution of race and ethnicity classes given a set of input features.
We introduce a framework for using deep learning to estimate missing race and ethnicity information in EMR datasets: RIDDLE or Race and ethnicity Imputation from Disease history with Deep LEarning. RIDDLE uses a relatively simple multilayer perceptron (MLP), a type of neural network architecture that is a directed acyclic graph (see Fig 1). For its nodes, our neural network architecture utilizes Parametric Rectifier Linear Units (PReLUs) , which are rectifier functions:
where is the input, and is the output of the PReLU node. Further details of RIDDLE’s implementation are described in the Methods.
In addition to investigating the novel utility of deep learning for race and ethnicity imputation, we used recent methods in interpreting neural network models  to perform a systematic evaluation of racial and ethnic patterns for approximately 15,000 different medical events. We believe that this type of large-scale evaluation of disease patterns and maladies by race and ethnicity has not been done heretofore.
We aimed to assess RIDDLE’s imputation performance in a multiclass classification setting. We used EMR datasets from Chicago and New York City, collectively describing over 1.5 million unique patients. There were approximately 15,000 unique input features consisting of basic demographic information (gender, age) and observations of clinical events (codified as ICD9 codes). The target class was race and ethnicity; possible values were White, Black, Other or Hispanic. Although race and ethnicity can be described as a mixture, our training datasets labeled race and ethnicity as one of four mutually exclusive classes. For the testing set, we treated the target race and ethnicity class as missing, and compared the predicted class against the true class. The large dimensionality of features, high number of samples, and heterogeneity of the source populations present a unique and challenging classification problem.
In our experiments, RIDDLE yielded an average accuracy of 0.671, top-two accuracy of 0.865, and cross-entropy loss of 0.849 on test data, significantly outperforming logistic regression and random forest classifiers (see Fig 2). Support vector machines (SVMs) with various kernels were also evaluated. However, SVMs could not be feasibly used as the computational cost was too high; each experiment required more than 36 hours.
While the multiclass learning problem appeared relatively hard, RIDDLE exhibited class-specific receiver operating characteristic’s (ROC) area under the curve (AUC) values above 0.8 (see Fig 3), and a mean micro-average (all cases considered as binary) AUC of 0.877 – significantly higher than that of logistic regression (mean=0.854, ) and random forest (mean=0.799, ) classifiers.
RIDDLE exhibited runtime performance comparable to that of other machine learning methods on a standard computing configuration without the use of a graphics processing unit or GPU (see Table 1). Support vector machines were also evaluated, but precise runtime measurements could not be obtained as experiments took greater than 36 hours each (36 hours runtime was the allowed maximum on the system used in our analysis). However, on a smaller subset (150K samples) of the full dataset, RIDDLE exhibited significantly better classification accuracy and faster runtime performance than SVMs with various kernels (see Table 3 in the Supporting Information).
|Method||Average runtime (h)|
|SVM, linear kernel||>|
|SVM, polynomial kernel||>|
|SVM, RBF kernel||>|
2.1 Influence of missing data on classifier performance
In order to replicate real-world applications where data other than race and ethnicity (e.g., features) may be missing, we conducted additional experiments to simulate random missing data. A random subset of sample features (ranging from 10% to 30% of all features) was artificially masked completely at random. Otherwise, the same classification training and evaluation scheme was used as before. Under simulation of random missing data, RIDDLE significantly outperformed logistic regression and random forest classifiers in terms of classification accuracy (see Fig 4).
2.2 Feature interpretation
A major criticism of deep learning is the opaqueness of trained neural network models for intuitive interpretation. While intricate functional architectures enable neural networks to learn complex tasks, they also create a barrier to understanding how learning decisions (e.g., classifications) are made. In addition to creating a precise race and ethnicity estimation framework, we sought to identify and describe the factors which contribute to these estimations. We computed DeepLIFT (Learning Important FeaTures) scores to quantitatively describe how specific features contribute to the probability estimates of each class. The DeepLIFT algorithm compares the activation of each node to a reference activation; the difference between the reference and observed activation is used to compute the contribution score of a neuron to a class .
If a feature contributes to selecting for a particular class, this feature-class pair is assigned a positive DeepLIFT score; conversely, if a feature contributes to selecting against a particular class, the pair is assigned a negative score. The magnitude of a DeepLIFT score represents the strength of the contribution.
Using DeepLIFT scores, we were able to construct natural orderings of race and ethnicity classes for each feature, sorting classes by positive to negative scores. The following example ordering shows how the example feature (heart disease) is a strong predictor for the African American class, and a weak (or negative) predictor for the Other class.
We computed the class orderings for all 15,000 features. The orderings of the 10 most predictive features (by highest ranges of DeepLIFT scores) are described in Table 2.
We visualized the orderings of the 25 most common features using both frequencies and DeepLIFT scores (see Fig 5). Race and ethnicity class orderings obtained from frequency scores were distinctly different than those obtained from DeepLIFT scores. This suggests that RIDDLE’s MLP network is able to learn non-linear and non-frequentist relationships between ICD9 codes and race and ethnicity categories.
According to orderings constructed using DeepLIFT scores, sex is an important feature for predicting race and ethnicity in our models: men who seek medical attention are least likely to be African American, followed by Hispanic men. Men who seek medical attention are most likely to be White or Other.
In addition, specific medical diagnoses convey grains of racial and ethnic information: hypertension and human immunodeficiency virus (HIV) are more predictive for African American individuals than White individuals. This finding is also reflected in medical literature, where it has been reported that African Americans are at significantly higher risk for heart disease [12, 13] and HIV [14, 15] than their White peers.
The fact that these features are important for imputing race and ethnicity could reflect (1) a skewed distribution of blue- and white-collar professions across racial and ethnic groups, (2) uneven accessibility and subjective importance of prophylactic health care across racial and ethnic groups, (3) and possible variation in lifestyle, such as dietary habits. Further work would involve investigating epidemiological hypotheses on how these environmental factors may affect differential clinical patterns across race and ethnicity.
Some of the genetic diseases are famously discriminative for races and ethnicities. For example, sickle cell disease occurs 88 time more frequently in African Americans than in the rest of the US population . In our model, sickle cell anemia most strongly predicts for the African American race. It has been reported Lyme disease predominately occurs in Whites, and largely unreported for Hispanics or African Americans . This finding is also reflected in our model, where Lyme disease serves as a strong predictor of the White race. Additional strongly White-predictive diseases and medical procedures include atrial fibrillation, hypothyroidism, prostate neoplasm, dressing and sutures, lump in breast, coronary atherosclerosis. These are primarily diseases of older age, suggesting that lifespan varies across race and ethnicity due to socioeconomic and lifestyle reasons.
These orderings provide a high-level description of community structure, and may reflect socioeconomic, cultural, habitual, and genetic variation linked to race and ethnicity across the population of two large cities, New York City and Chicago.
In our experiments, RIDDLE yielded favorable classification performance with class-specific AUC values of above 0.8. RIDDLE displayed significantly better classification performance across all tested metrics compared to the popular classification methods logistic regression and random forest. RIDDLE’s superior top-two accuracy and loss results suggest that RIDDLE produces more accurate probability estimates for race and ethnicity classes compared to currently used techniques. Although results could not be obtained for SVMs due to unacceptably high computational costs, RIDDLE outperformed SVMs in runtime efficiency and classification performance on a smaller subset of the full dataset (see Table 3 in the Supporting Information).
Furthermore, RIDDLE, without the use of a GPU, displayed runtimes comparable to those of traditional classification techniques and required less memory. With these findings, we argue that deep-learning-driven imputation offers notable utility for race and ethnicity imputation in anonymized EMR datasets. Our current work simulated conditions where ethnicity was missing completely at random. Future work will involve simulating conditions where race and ethnicity are missing at random or missing not at random, and formalizing a multiple imputation framework involving deep-learning estimators.
However, these results also highlight a growing privacy concern. It has been shown that the application of machine learning poses non-trivial privacy risks, as sensitive information can be recovered from non-sensitive features . Our results underscore the need for further anonymization in clinical datasets where race and ethnicity are private information; simple exclusion is not sufficient.
In addition to assessing the predictive and computational performance of our imputation framework, we made efforts to analyze how specific features contribute to race and ethnicity imputations in our neural network model. Each individual feature may represent only a weak trend, but together numerous indicators can synergize to provide a compelling evidence of how a person’s lifestyle, her social circles, and even genetic background can vary by race and ethnicity.
The aforementioned highlights of race- and ethnicity-influenced patterns of health diversity and disparity (see the Results) can be extended to thousands of codes. To the best of our knowledge, this systematic comparison across all classes of maladies with respect to race and ethnicity is done for the first time in our study.
4.1 Ethics Statement
Our study used de-identified, independently collected patient data, and was determined by the Internal Review Board (IRB) of the University of Chicago to be exempt from further IRB review, under the Federal Regulations category 45 CFR 46.101(b).
We used an anonymized EMR datasets jointly comprising 1,650,000 individual medical histories from the New York City (Columbia University) and Chicago metropolitan populations (University of Chicago). Medical histories are encoded as variable length lists of ICD9 codes (approximately 15,000 unique codes) coupled with onset ages in years. Each individual belongs to one of four mutually exclusive classes of race (Other, White, Black) or ethnicity (Hispanic). Features included quinary gender (male, female, trans, other, unknown), and reported age in years.
Onset age information of each ICD9 code was removed and continuous age information was coerced into discrete integer categories. Features were vectorized in a binary encoding scheme, where each individual is represented by a binary vector of zeros (feature absent) and ones (feature present). Each element in the binary encoded vector corresponds to an input node in the trained neural network (see Fig 1).
-fold cross-validation () and random shuffling were used to produce ten complementary subsets of training and testing data, corresponding to ten classification experiments; this allowed for test coverage of the entire dataset. From the training set, approximately 10% of samples were used as holdout validation data for parameter tuning and performance monitoring. Testing data was held out separately and was only used during the evaluation process.
4.3 Hyperparameter tuning
Hyperparameters were selected using randomized grid search on the 10,000 samples from the validation data. It has been reported that randomized grid search requires far less computational effort than exhaustive grid search with only slightly worse performance .
4.4 A deep learning approach
We used Keras  with a TensorFlow backend  to train a deep multilayer perceptron (MLP) with parameterized rectified linear units (PReLU) . The network was composed of an input layer of 15,122 nodes, two hidden layers of 512 PReLU nodes each, and a softmax output layer of four nodes (see Fig 1). Dropout regularization was applied to each hidden layer . The MLP was trained iteratively using the Adam optimizer . Training was performed in a batch-wise fashion; data vectorization (via binary encoding) was also done batch-wise in coordination with training. The large number of samples (1.65M) and attention to scalability necessitated “on the fly“ vectorization. The number of training epochs (passes over the data) was determined by early stopping with patience and model caching , where the model from the epoch with minimal validation loss was selected.
Categorical cross-entropy was chosen as the loss function; categorical cross-entropy penalizes the assignment of lower probability on the correct class and the assignment of non-zero probability to incorrect classes.
4.5 Other machine learning approaches
We evaluated several other machine learning approaches: random forest classifier, logistic regression, and support vector machines (SVMs) with various kernels (linear, polynomial, radial basis function). Traditionally, logistic regression has been used for categorical imputation tasks . We used fast Cython (C compiled from Python) or array implementations of these methods offered in the popular ‘scikit-learn‘ library.
4.6 Missing data simulation
In order to replicate real-world scenarios where additional information (other than race and ethnicity) may be absent, we conducted simulation experiments where we randomly removed some proportion of feature data (10%, 20%, or 30%). We conducted separate training and testing pipelines with these new “deficient“ datasets.
We computed standard accuracy and cross-entropy loss scores for testing data across all ten experiments. We also computed top-two accuracy, a specification of top- accuracy. In top- accuracy, a prediction is considered correct if the true class is contained within the classes with highest probability assignments. In addition to evaluating classification performance, we also monitored runtime performance across methods. Models were trained on a standard computing configuration: 16 Intel Sandybridge cores at 2.6 GHz.
Significant differences in performance scores were detected using paired t-tests with Bonferroni adjustment.
4.8 Neural network interpretation
We computed DeepLIFT scores to interpret how certain features contribute to probability estimates for each class . The DeepLIFT algorithm takes a trained neural network and produces feature-to-class contribution scores for each passed sample. We computed DeepLIFT scores using test samples in each of our -fold cross validation experiments, to achieve full coverage of the dataset. To describe high-level relationships between features and classes, we summed scores across all samples to produce an aggregate score. The aggregate DeepLIFT scores for the ten most predictive features are summarized in Table 2. As described prior, we computed orderings of race and ethnicity classes with each feature’s DeepLIFT scores. These orderings describe how certain features (e.g., medical conditions) can predict for or against a particular race and ethnicity class. We visualize the orderings defined by DeepLIFT scores for the twenty-five most common features in Fig 5, and compare them to orderings produced from total frequencies of feature-class observations. For the visualizations, frequency counts were mean-centered to facilitate comparison to DeepLIFT scores.
We are grateful to Drs. Raul Rabadan and Rachel Melamed for preparing the Columbia University dataset. This work was funded by the DARPA Big Mechanism program under ARO contract W911NF1410333, by NIH grants R01HL122712, 1P50MH094267, U01HL108634-01, and a gift from Liz and Kent Dauten.
|Method||Average accuracy||Average runtime (s)|
|SVM, linear kernel|
|SVM, polynomial kernel|
|SVM, Gaussian kernel|
- P. B. Jensen, L. J. Jensen, and S. Brunak, “Mining electronic health records: towards better research applications and clinical care,” Nature Reviews Genetics, vol. 13, no. 6, pp. 395–405, 2012.
- K. A. Schulman, J. A. Berlin, W. Harless, J. F. Kerner, S. Sistrunk, B. J. Gersh, R. Dube, C. K. Taleghani, J. E. Burke, S. Williams, et al., “The effect of race and sex on physicians’ recommendations for cardiac catheterization,” New England Journal of Medicine, vol. 340, no. 8, pp. 618–626, 1999.
- E. G. Burchard, E. Ziv, E. J. Pérez-Stable, and D. Sheppard, “The importance of race and ethnic background in biomedical research and clinical practice,” The New England Journal of Medicine, vol. 348, no. 12, p. 1170, 2003.
- J. A. Sterne, I. R. White, J. B. Carlin, M. Spratt, P. Royston, M. G. Kenward, A. M. Wood, and J. R. Carpenter, “Multiple imputation for missing data in epidemiological and clinical research: potential and pitfalls,” BMJ, vol. 338, p. b2393, 2009.
- WHO, 2010.
- M. N. Elliott, A. Fremont, P. A. Morrison, P. Pantoja, and N. Lurie, “A new method for estimating race/ethnicity and associated disparities where administrative records lack self-reported race/ethnicity,” Health Services Research, vol. 43, no. 5p1, pp. 1722–1736, 2008.
- R. W. Grundmeier, L. Song, M. J. Ramos, A. G. Fiks, M. N. Elliott, A. Fremont, W. Pace, R. C. Wasserman, and R. Localio, “Imputing missing race/ethnicity in pediatric electronic health records: reducing bias with use of us census location and surname data,” Health Services Research, vol. 50, no. 4, pp. 946–960, 2015.
- P. Sentas and L. Angelis, “Categorical missing data imputation for software cost estimation by multinomial logistic regression,” Journal of Systems and Software, vol. 79, no. 3, pp. 404–414, 2006.
- Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
- K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proceedings of the IEEE international conference on computer vision, pp. 1026–1034, 2015.
- A. Shrikumar, P. Greenside, A. Shcherbina, and A. Kundaje, “Not just a black box: Learning important features through propagating activation differences,” arXiv preprint arXiv:1605.01713, 2016.
- S. Barber, D. A. Hickson, X. Wang, M. Sims, C. Nelson, and A. V. Diez-Roux, “Neighborhood disadvantage, poor social conditions, and cardiovascular disease incidence among african american adults in the jackson heart study,” Am J Public Health, vol. 106, no. 12, pp. 2219–2226, 2016.
- K. L. Gilbert, K. Elder, S. Lyons, K. Kaphingst, M. Blanchard, and M. Goodman, “Racial composition over the life course: Examining separate and unequal environments and the risk for heart disease for african american men,” Ethn Dis, vol. 25, no. 3, pp. 295–304, 2015.
- N. Crepaz, A. K. Horn, S. M. Rama, T. Griffin, J. B. Deluca, M. M. Mullins, S. O. Aral, H. P. R. S. Team, et al., “The efficacy of behavioral interventions in reducing HIV risk sex behaviors and incident sexually transmitted disease in black and Hispanic sexually transmitted disease clinic patients in the united states: a meta-analytic review,” Sexually Transmitted Diseases, vol. 34, no. 6, pp. 319–332, 2007.
- R. F. Gillum, M. E. Mussolino, and J. H. Madans, “Diabetes mellitus, coronary heart disease incidence, and death from all causes in african american and european american women: The nhanes i epidemiologic follow-up study,” J Clin Epidemiol, vol. 53, no. 5, pp. 511–8, 2000.
- J. Ojodu, M. M. Hulihan, S. N. Pope, A. M. Grant, C. Centers for Disease, and Prevention, “Incidence of sickle cell trait–united states, 2010,” MMWR Morb Mortal Wkly Rep, vol. 63, no. 49, pp. 1155–8, 2014.
- A. D. Fix, C. A. Peña, and G. T. Strickland, “Racial differences in reported lyme disease incidence,” American Journal of Epidemiology, vol. 152, no. 8, pp. 756–759, 2000.
- J. A. Calandrino, A. Kilzer, A. Narayanan, E. W. Felten, and V. Shmatikov, ““You might also like: ” Privacy risks of collaborative filtering,” in Security and Privacy (SP), 2011 IEEE Symposium on, pp. 231–246, IEEE, 2011.
- Y. Bengio, “Practical recommendations for gradient-based training of deep architectures,” in Neural Networks: Tricks of the Trade, pp. 437–478, Springer, 2012.
- F. Chollet, “Keras,” 2015.
- M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al., “Tensorflow: Large-scale machine learning on heterogeneous distributed systems,” arXiv preprint arXiv:1603.04467, 2016.
- N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.,” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
- D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.