Identifiable Phenotyping using Constrained Non–Negative Matrix Factorization

Identifiable Phenotyping using Constrained Non–Negative Matrix Factorization

Identifiable PhenotypingShalmali Joshi \emailshalmali@utexas.edu
\addrElectrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA \ANDIdentifiable PhenotypingSuriya Gunasekar \emailsuriya@utexas.edu
\addrElectrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA \ANDIdentifiable PhenotypingDavid Sontag \emaildsontag@cs.nyu.edu
\addrComputer Science
New York University
NYC, NY, USA \ANDIdentifiable PhenotypingJoydeep Ghosh \emailjghosh@utexas.edu
\addrElectrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA
Abstract

This work proposes a new algorithm for automated and simultaneous phenotyping of multiple co–occurring medical conditions, also referred to as comorbidities, using clinical notes from electronic health records (EHRs). A latent factor estimation technique, non-negative matrix factorization (NMF), is augmented with domain constraints from weak supervision to obtain sparse latent factors that are grounded to a fixed set of chronic conditions. The proposed grounding mechanism ensures a one-to-one identifiable and interpretable mapping between the latent factors and the target comorbidities. Qualitative assessment of the empirical results by clinical experts show that the proposed model learns clinically interpretable phenotypes which are also shown to have competitive performance on day mortality prediction task. The proposed method can be readily adapted to any non-negative EHR data across various healthcare institutions.

plus 0.3ex

Identifiable Phenotyping using Constrained Non–Negative Matrix Factorization Shalmali Joshi shalmali@utexas.edu
Electrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA
Suriya Gunasekar suriya@utexas.edu
Electrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA
David Sontag dsontag@cs.nyu.edu
Computer Science
New York University
NYC, NY, USA
Joydeep Ghosh jghosh@utexas.edu
Electrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA

1 Introduction

Reliably querying for patients with specific medical conditions across multiple organizations facilitates many large scale healthcare applications such as cohort selection, multi-site clinical trials, epidemiology studies etc. (Richesson et al., 2013; Hripcsak and Albers, 2013; Pathak et al., 2013). However, raw EHR data collected across diverse populations and multiple care-givers can be extremely high dimensional, unstructured, heterogeneous, and noisy. Manually querying such data is a formidable challenge for healthcare professionals.

EHR driven phenotypes are concise representations of medical concepts composed of clinical features, conditions, and other observable traits facilitating accurate querying of individuals from EHRs (NIH Health Care Systems Research Collaboratory, 2014). Efforts like eMerge Network111http://emerge.mc.vanderbilt.edu/, PheKB222http://phekb.org/ are well known examples of EHR driven phenotyping. Traditionally used rule–based composing methods for phenotyping require substantial time and expert knowledge and have little scope for exploratory analyses. This motivates automated EHR driven phenotyping using machine learning with limited expert intervention.

We propose a weakly supervised model for jointly phenotyping co–occurring conditions (comorbidities) observed in intensive care unit (ICU) patients. Comorbidities are a set of co-occurring conditions in a patient at the time of admission that are not directly related to the primary diagnosis for hospitalization (Elixhauser et al., 1998). Phenotypes for the comorbidities listed in Table 1 are derived using text-based features from clinical notes in a publicly accessible MIMIC-III EHR database (Saeed et al., 2011). We present a novel constrained non–negative matrix factorization (CNMF) for the EHR matrix that aligns the factors with target comorbidities yielding sparse, interpretable, and identifiable phenotypes.

The following aspects of our model distinguish our work from prior efforts:

  1. Identifiability: A key shortcoming of standard unsupervised latent factor models such as NMF (Lee and Seung, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003) for phenotyping is that, the estimated latent factors learnt are interchangeable and unidentifiable as phenotypes for specific conditions of interest. We tackle identifiability by incorporating weak (noisy) but inexpensive supervision as constraints our framework. Specifically, we obtain weak supervision for the target conditions in Table 1 using the Elixhauser Comorbidity Index (ECI) (Elixhauser et al., 1998) computed solely from patient administrative data (without human intervention). We then ground the latent factors to have a one-to-one mapping with conditions of interest by incorporating the comorbidities predicted by ECI as support constraints on the patient loadings along the latent factors.

  2. Simultaneous modeling of comorbidities: ICU patients studied in this paper are frequently afflicted with multiple co–occurring conditions besides the primary cause for admission. In the proposed NMF model, phenotypes for such co–occurring conditions jointly modeled to capture the resulting correlations.

  3. Interpretability: For wider applicability of EHR driven phenotyping for advance clinical decision making, it is desirable that these phenotype definitions be clinically interpretable and represented as a concise set of rules. We consider the sparsity in the representations as a proxy for interpretability and explicitly encourage conciseness of phenotypes using tuneable sparsity–inducing soft constraints.

We evaluate the effectiveness of the proposed method towards interpretability, clinical relevance, and prediction performance on EHR data from MIMIC-III. Although we focus on ICU patients using clinical notes, the proposed model and algorithm are general and can be applied on any non-negative EHR data from any population group.

2 Data Extraction

The MIMIC-III dataset consists of de-identified EHRs for adult ICU patients at the Beth Isreal Deaconess Medical Center, Boston, Massachusetts from . For all ICU stays within each admission, clinical notes including nursing progress reports, physician notes, discharge summaries, ECG, etc. are available. We analyze patients who have stayed in the ICU for at least hours ( patients). We derive phenotypes using clinical notes collected within the first hours of patients’ ICU stay to evaluate the quality of phenotypes when limited patient data is available. Further, we evaluate the phenotypes on a day mortality prediction problem. To avoid obvious indicators of mortality and comorbidities, apart from restricting to 48 hour data, we exclude discharge summaries as they explicitly mention patient outcomes (including mortality).

Congestive Heart Failure Cardiac Arrhythmias Valvular Disease Pulmonary Circulation Disorder Peripheral Vascular Disorder
Hypertension Paralysis Other Neurological Disorders Chronic Pulmonary Diseases Diabetes Uncomplicated
Diabetes Complicated Hypothyroidism Renal Failure Liver Disease (excluding bleeding) Peptic Ulcer
AIDS Lymphoma Metastatic Cancer Solid Tumor (without metastasis) Rheumatoid Arthritis
Coagulopathy Obesity Weight loss Fluid Electrolyte Disorder Blood Loss Anemia
Deficiency Anemia Alcohol abuse Drug abuse Psychoses Depression
Table 1: Target comorbidities
  1. Clinically relevant bag-of-words features: Aggregated clinical notes from all sources are represented as a single bag-of-words features. To enhance clinical relevance, we create a custom vocabulary containing clinical terms from two sources (a) the Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT), and (b) the level-0 terms provided by the Unified Medical Language System (UMLS), consolidated into a standard vocabulary format using Metamorphosys — an application provided by UMLS for custom vocabulary creation.333See https://www.nlm.nih.gov/healthit/snomedct/ and https://www.nlm.nih.gov/research/umls/ To extract clinical terms from the raw text, the notes were tagged for chunking using a conditional random field tagger444https://taku910.github.io/crfpp/. The tags are looked up against the custom vocabulary (generated from Metamorphosys) to obtain the bag-of-words representation. Our final vocabulary has 3600 clinical terms.

  2. Computable weak diagnosis: We incorporate domain constraints from weak supervision to ground the latent factors to have a one-to-one mapping with the conditions of interest. In the model described in Section 3, this is enforced by constraining the non-zero entries on patient loading along the latent factors using a weak diagnosis for comorbidities. The weak diagnoses of target comorbidities in Table 1 are obtained using ECI555https://git.io/v6e7q, computed solely from patient administrative data without human annotation. We refer to this index as weak diagnoses as it is not a physician’s exact diagnosis and is subject to noise and misspecification. Note that ECI ignores diagnoses code related to the primary diagnoses of admission. Thus, ECI models presence and absence of conditions other than the primary reason for admission (comorbidities). The phenotype candidates from the proposed model can be considered as concise representations of such comorbidities.

3 Identifiable High–Throughput Phenotyping

The notation used in the paper are enumerated in Table 2.

Notation Description
for integer Set of indices .
Simplex in dimension , .
Column of a matrix .
Support of a vector , .
Observations
, Number of patients () and features (), respectively.
EHR matrix from MIMIC III: Clinically relevant bag-of-words features from notes in first hours of ICU stay for patients.
Indices for comorbidities in Table 1.
for Set of comorbidities patient is diagnosed with using ECI .
Factor matrices
Estimate of patients’ risk for the conditions.
, Estimate of phenotype factor matrix and feature bias vector.
Table 2: Notation used in the paper

In summary, for each patient , (a) the bag-of-words features from clinical notes is represented as column of EHR matrix , and (b) the list of comorbidities diagnosed using ECI is denoted as . Let an unknown represent the risk of patients for comorbidities of interest; each entry lies in the interval , with and indicating no-risk and maximum-risk, respectively, of patient being afflicted with condition . If denotes an accurate diagnosis for patient , then satisfies .

Definition 1 (EHR driven phenotype)

EHR driven phenotypes for co–occurring conditions are a set of vectors , such that for a patient afflicted with conditions ,

(1)

where is a bias representing the feature component observed independent of the target conditions. with as columns is referred as the phenotype factor matrix.

Note that we explicitly model a feature bias to capture frequently occurring terms that are not discriminative of the target conditions, e.g., temperature, pain, etc.

Cost Function

The bag-of-words features are represented as counts in the EHR matrix . We consider a factorized approximation of parametrized by matrices , and as , where denotes a vector of all ones of appropriate dimension. The approximation error of the estimate is measured using the –divergence defined as follows:

(2)

Minimizing the –divergence is equivalent to maximum likelihood estimation under a Poisson distributional assumption on individual entries of the EHR matrix parameterized by  (Banerjee et al., 2005).

Phenotypes

For the comorbidities, columns of , are proposed as candidate phenotypes derived from the EHR , i.e. approximations to .

Constraints

The following constraints are incorporated in learning and .

  1. Support Constraints: The non-negative rank– factorization of is ‘grounded’ to target comorbidities by constraining the support of risk corresponding to patient using weak diagnosis from ECI as an approximation of the conditions in Definition 1.

  2. Sparsity Constraints: Scaled simplex constraints are imposed on the columns of with a tuneable parameter to encourage sparsity of phenotypes. Restricting the patient loadings matrix as not only allows to interpret the loadings as the patients’ risk, but also makes simplex constraints effective in a bilinear optimization.

Simultaneous phenotyping of comorbidities using constrained NMF is posed as follows:

(3)
s.t.

The optimization in (3) is convex in either factor with the other factor fixed. It is solved using alternating minimization with projected gradient descent (Parikh and Boyd, 2014; Lin, 2007). See complete algorithm in Algorithm 1. The proposed model in general can incorporate any weak diagnosis of medical conditions. In this paper, we note that, since we use ECI, the results are not representative of the primary diagnoses at admission.

  while Not converged do
     
     
Algorithm 1 Phenotyping using constrained NMF.
Input: , and paramter . Initialization: .

4 Empirical Evaluation

The estimated phenotypes are evaluated on various metrics. We denote the model learned using Algorithm 1 with a given parameter as –CNMF. The following baselines are used for comparison:

  1. Labeled LDA (LLDA): LLDA (Ramage et al., 2009) is the supervised counterpart of LDA, a probabilistic model to estimate topic distribution of a corpus. It assumes that word counts of documents arise from multinomial distributions. It incorporates supervision on topics contained in a document and can be naturally adapted for phenotyping from bag-of-words clinical features, where the topic–word distributions form candidate phenotypes. While LLDA assumes that the topic loadings of a document lie on the probability simplex , –CNMF allows each patient–condition loading to lie in . In interpreting the patient loading as a disease risk, the latter allows patients to have varying levels of disease prevalence. Also, LLDA can induce sparsity only indirectly via a hyperparameter of the informative prior on the topic–word distributions. While this does not guarantee sparse estimates, we obtain reasonable sparsity on LLDA estimates. We use the Gibbs sampling code from MALLET (McCallum, 2002) for inference. For a fair comparison to CNMF which uses an extra bias factor, we allow LLDA to model an extra topic shared by all documents in the corpus.

  2. NMF with support constraints (NMF+support): This NMF model incorporates non–negativity and support constraints from weak supervision but not the sparsity inducing constraints on the phenotype matrix. This allows to study the effect of sparsity inducing constraints for interpretability. On the other hand, imposing sparsity without our grounding technique does not yield identifiable topics and hence is not studied as a baseline.

  3. Multi-label Classification (MLC): This baseline treats weak supervision (from ECI) as accurate labels in a fully supervised model. A sparsity inducing regularized logistic regression classifier is learned for each condition independently. The learned weight vector for each condition determines importance of clinical terms towards discriminating patients with condition and are treated as candidate phenotypes for condition .

The weak supervision does not account for the primary diagnosis for admission in the ICU population as the ECI ignores primary diagnoses at admission (Elixhauser et al., 1998). However, the learning algorithm can be easily modified to account for the primary diagnoses, if required by using a modified form of supervision or absorbing the effects in an additional additive term appended to the model. Nevertheless, the proposed model generates highly interpretable phenotypes for comorbidities. Finally, to mitigate the effect of local minima, whenever applicable, for each model, the corresponding algorithm was run with random initializations and results providing the lowest divergence were chosen for comparison.

4.1 Interpretability–accuracy trade–off

Sparsity of the latent factors is used as a proxy for interpretability of phenotypes. Sparsity is measured as the median of the number of non–zero entries in columns of the phenotype matrix (lower is better). The parameter in –CNMF controls the sparsity by imposing scaled simplex constraints on . CNMF was trained on multiple in the range of to . Stronger sparsity-inducing constraints results in worse fit to the cost function. This trade–off is indeed observed in all models (see A for details). For all models, we pick estimates with lowest median sparsity while ensuring that the phenotype candidate for every condition is represented by at least non-zero clinical terms.

4.2 Clinical relevance of phenotypes

Figure 1: Qualitative Ratings from Annotation: The two bars represent the ratings provided by the two annotators. Each bar is a histogram of the scores for the comorbidities sorted by scores.

We requested two clinicians to evaluate the candidate phenotypes based on the top terms learned by each model. The ratings were requested on a scale of (poor) to (excellent). The experts were asked to rate based on whether the terms are relevant towards the corresponding condition and whether the terms are jointly discriminative of the condition. Figure 1 shows the summary of qualitative ratings obtained for all models. For each model, we show two columns (corresponding to two experts). The stacked bars show the histogram of the ratings for the models. Nearly of the phenotypes learned from our model were rated ‘good’ or better by both annotators. In contrast, NMF with support constraints but without sparsity inducing constraints hardly learns clinically relevant phenotypes.

0.4–CNMF LLDA MLC NMF
0.4–CNMF 0 28 20 44
LLDA 7 0 12 35
MLC 6 21 0 42
NMF+support 1 0 1 0
Table 3: Relative Rankings Matrix: Each row of the table is the number of times the model along the row was rated strictly better than the model along the column by clinical experts, e.g., column in row implies that LLDA was rated better than MLC  times over all conditions by all experts.

The proposed model –CNMF also received significantly higher number of ‘excellent’ and ‘good’ ratings from both experts. Although LLDA and MLC estimate sparse phenotypes, they are not at par with –CNMF. Table 3 shows a summary of relative rankings for all models. Each cell entry shows the number of times the model along the corresponding row was rated strictly better than that along the column. –CNMF is better than all three baselines. The supervised baseline MLC outperforms LLDA even though LLDA learns comorbidities jointly suggesting that the simplex constraint imposed by LLDA may be restrictive.

Figure 2 is an example of a phenotype (top 15 terms) learned by all models for psychoses. For this condition, the proposed model was rated “excellent” and strictly better than both LLDA and MLC by both annotators while LLDA and MLC ratings were tied. However, the phenotype for Hypertension (in Figure 3) learned by –CNMF has more terms related to ‘Renal Failure’ or ‘End Stage Renal Disease’ rather than hypertension. One of our annotators pointed out that “Candidate 1 is a fairly good description of renal disease, which is an end organ complication of hypertension”, where the anonymized Candidate 1 refers to –CNMF. Exploratory analysis suggests that hypertension and renal failure are the most commonly co-occurring set of conditions. Over 93% of patients that have hypertension (according to ECI) also suffer from Renal Failure. Thus, our model is unable to distinguish between highly co-occurring conditions. Other baselines were also rated poorly for hypertension, while LLDA was rated only slightly better. More examples of phenotypes are provided in B.

Figure 2: Phenotypes learned for ‘Psychoses’ (words are listed in order of importance)
Figure 3: Phenotypes learned for ‘Hypertension’

4.3 Mortality prediction

To quantitatively evaluate the utility of the learned phenotypes, we consider the day mortality prediction task. We divide the EHR into cross-validation folds of training and test patients. As this is an imbalanced class problem, the training–test splits are stratified by mortality labels. For each split, all models were applied on the training data to obtain phenotype candidates and feature biases . For each model, the patient loadings along the respective phenotype space are used as features to train a logistic regression classifier for mortality prediction. For CNMF and NMF+support, these are obtained as for fixed . For LLDA, these are obtained using Gibbs sampling with fixed topic–word distributions. For MLC, the predicted class probabilities of the comorbidities are used as features. Additionally, we train a logistic regression classifier using the full EHR matrix as features.

We clarify the following points on the methodology: (1) is learned on the patients in the training dataset only, hence there is no information leak from test patients into training. (2) Test patients’ comorbidities from ECI are not used as support constraints on their loadings. (3) Regularized logistic regression classifiers are used to learn models for mortality prediction. The regularization parameters are chosen via grid-search.

Model AUROC Sensitivity Specificity
1. –CNMF 0.63(0.02) 0.59(0.04) 0.62(0.03)
2. NMF+support 0.52(0.02) 0.56(0.13) 0.51(0.14)
3. LLDA 0.64(0.02) 0.62(0.03) 0.61(0.05)
4. MLC 0.66(0.01) 0.62(0.06) 0.62(0.05)
5. Full EHR 0.72(0.02) 0.69(0.02) 0.63(0.04)
6. CNMF+Full EHR (, ) 0.72(0.02) 0.61(0.09) 0.71(0.07)
Table 4: day mortality prediction: –fold cross-validation performance of logistic regression classifiers. Classifiers for –CNMF and competing baselines (NMF+support, LLDA, MLC) were trained on the dimensional phenotype loadings as features. Full EHR denotes the baseline classifier (-regularized logistic regression) using full dimensional EHR as features. CNMF+Full EHR denotes the performance of the -regularized classifier learned on Full EHR augmented with CNMF features (hyperparameter was manually tuned to match performance of the Full EHR model).

The performance of the above baselines trained on regularized logistic regression over a -fold cross-validation is reported in Table 4: rows . The classifier trained on the full EHR unsurprisingly outperforms all baselines as it uses richer high dimensional information. All phenotyping baselines, except NMF+support, show comparable performance on mortality prediction which in spite of learning on a small number of features, is only slightly worse than predictive performance of full EHR with features.

Augmented features for mortality prediction (CNMF+Full EHR)

Unsurprisingly, Table 4 suggests that the high dimensional EHR data has additional information towards mortality prediction which are lacking in the dimensional features generated via phenotyping. To evaluate whether this additional information can be captured by CNMF if augmented with a small number of raw EHR features, we train a mortality prediction classifier using regularized logistic regression on CNMF features/loadings combined with raw bag–of–words features, with parameters tuned to match the performance of the full EHR model. The results are reported in the final row of Table 4.

In exploring the weights learned by the classifier for all features, we observe that only of the features corresponding to raw EHR based bag-of-words features have non–zero weights. This suggests that comorbidities capture significant amount of predictive information on mortality and achieve comparable performance to full EHR model with a small number of additional terms. See Figure 35 in Appendix showing the weights learned by the classifier for all features. Figure 4 shows comorbidities and EHR terms with top magnitude weights learned by the CNMF+full EHR classifier. For example, it is interesting to note that the top weighted EHR term – dnr or ‘Do Not Resuscitate’ is not indicative of any comorbidity but is predicitive of patient mortality.

Figure 4: Top magnitude weights on (a) EHR and (b) CNMF features in CNMF+Full EHR classifier

5 Discussion and Related Work

Supervised learning methods like Carroll et al. (2011); Kawaler et al. (2012); Chen et al. (2013) or deep learning methods (Lipton et al., 2015; Kale et al., 2015; Henao et al., 2015) for EHR driven phenotyping require expert supervision. Although unsupervised methods like NMF (Anderson et al., 2014) and non–negative tensor factorization (Kolda and Bader, 2009; Harshman, 1970) are inexpensive alternatives (Ho et al., 2014a, b, c; Luo et al., 2015), they pose challenges with respect to identifiability, interpretability and computation, limiting their scalability.

Most closely related to our paper is work by Halpern et al. (2016b) which is a semi-supervised algorithm for learning the joint distribution on conditions, requiring only that a domain expert specify one or more ‘anchor’ features for each condition (no other labeled data). An ‘anchor’ for a condition is a set of clinical features that when present are highly indicative of the target condition, but whose absence is not a strong label for absence of the target condition (Halpern et al., 2014, 2016a). For example, the presence of insulin medication is highly indicative of diabetes, but the converse is not true. Joshi et al. (2015) use a similar supervision approach for comorbidities prediction. Whereas the conditions in Halpern et al. (2016b) are binary valued, in our work they are real-valued between 0 and 1. Furthermore, we assume that the support of the conditions is known in the training data.

Our approach achieves identifiability using support constraints to ground the latent factors and interpretability using sparsity constraints. The phenotypes learned are clinically interpretable and predictive of mortality when augmented with a sparse set of raw bag-of-words features on unseen patient population. The model outperforms baselines in terms of clinical relevance according to experts and significantly better than the model which includes supervision but no sparsity constraints. The proposed method can be easily extended to other non–negative data to obtain more comprehensive phenotypes. However, it was observed that the algorithm does not discriminate between frequently co–occurring conditions, e.g. renal failure and hypertension. Further, the weak supervision (using ECI) does not account for the primary diagnoses of admission. Additional model flexibility to account for a primary condition in explaining the observations could potentially improve performance. Addressing the above limitations along with quantitative evaluation of risk for disease prediction, and understanding conditions for uniqueness of phenotyping solutions are interesting areas of follow-up work.

Acknowledgements

We thank Dr. Saul Blecker and Dr. Stephanie Kreml for their qualitative evaluation of the computational phenotypes. SJ, SG and JG were supported by NSF: SCH #1418511. DS was supported by NSF CAREER award #1350965. We also thank Yacine Jernite for sharing a code used in preprocessing clinical notes.

References

  • Anderson et al. (2014) A. Anderson, P. K. Douglas, W. T. Kerr, V. S. Haynes, A. L. Yuille, J. Xie, Y. N. Wu, J. A. Brown, and M. S. Cohen. Non-negative matrix factorization of multimodal MRI, fMRI and phenotypic data reveals differential changes in default mode subnetworks in ADHD. Neuroimage, 2014.
  • Banerjee et al. (2005) A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 2005.
  • Blei et al. (2003) D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 2003.
  • Carroll et al. (2011) R. J. Carroll, A. E. Eyler, and J. C. Denny. Naive electronic health record phenotype identification for rheumatoid arthritis. In AMIA Annual Symposium, 2011.
  • Chen et al. (2013) Y. Chen, R. J. Carroll, E. Hinz, A. Shah, A. E. Eyler, J. C. Denny, and H. Xu. Applying active learning to high-throughput phenotyping algorithms for electronic health records data. Journal of the American Medical Informatics Association, 2013.
  • Elixhauser et al. (1998) A. Elixhauser, C. Steiner, D. R. Harris, and R. M. Coffey. Comorbidity measures for use with administrative data. Medical Care, 1998.
  • Halpern et al. (2014) Y. Halpern, Y. Choi, S. Horng, and D. Sontag. Using anchors to estimate clinical state without labeled data. In AMIA Annual Symposium, 2014.
  • Halpern et al. (2016a) Y. Halpern, S. Horng, Y. Choi, and D. Sontag. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 2016a.
  • Halpern et al. (2016b) Y. Halpern, S. Horng, and D. Sontag. Clinical tagging with joint probabilistic models. Conference on Machine Learning for Health Care, 2016b.
  • Harshman (1970) R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an explanatory multi-modal factor analysis. UCLA Working Papers in Phonetics, 1970.
  • Henao et al. (2015) R. Henao, J. T. Lu, J. E. Lucas, J. Ferranti, and L. Carin. Electronic health record analysis via deep poisson factor models. Journal of Machine Learning Research, 2015.
  • Ho et al. (2014a) J. C Ho, J. Ghosh, S. R. Steinhubl, W. F. Stewart, J. C. Denny, B. A. Malin, and J. Sun. Limestone: High-throughput candidate phenotype generation via tensor factorization. Journal of Biomedical Informatics, 2014a.
  • Ho et al. (2014b) J. C. Ho, J. Ghosh, and J. Sun. Extracting phenotypes from patient claim records using nonnegative tensor factorization. In International Conference on Brain Informatics and Health, 2014b.
  • Ho et al. (2014c) J. C. Ho, J. Ghosh, and J. Sun. Marble: High-throughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014c.
  • Hripcsak and Albers (2013) G. Hripcsak and D. J. Albers. Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association, 2013.
  • Joshi et al. (2015) S. Joshi, O. Koyejo, and J. Ghosh. Simultaneous prognosis of multiple chronic conditions from heterogeneous EHR data. In International Conference on Healthcare Informatics, 2015.
  • Kale et al. (2015) D. C. Kale, Z. Che, M. T. Bahadori, W. Li, Y. Liu, and R. Wetzel. Causal Phenotype Discovery via Deep Networks. In AMIA Annual Symposium, 2015.
  • Kawaler et al. (2012) E. Kawaler, A. Cobian, P. Peissig, D. Cross, S. Yale, and M. Craven. Learning to predict post-hospitalization VTE risk from EHR data. In AMIA Annual Symposium, 2012.
  • Kolda and Bader (2009) T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 2009.
  • Lee and Seung (2001) D. Lee and H. S. Seung. Algorithms for non-negative matrix factorization. In Advances in Neural Information Processing Systems, 2001.
  • Lin (2007) C. J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 2007.
  • Lipton et al. (2015) Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzell. Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
  • Luo et al. (2015) Y. Luo, Y. Xin, E. Hochberg, R. Joshi, O. Uzuner, and P. Szolovits. Subgraph augmented non-negative tensor factorization (santf) for modeling clinical narrative text. Journal of the American Medical Informatics Association, 2015.
  • McCallum (2002) A. K. McCallum. Mallet: A machine learning for language toolkit, 2002. URL http://mallet.cs.umass.edu.
  • NIH Health Care Systems Research Collaboratory (2014) NIH Health Care Systems Research Collaboratory. Rethinking Clinical Trials: A Living Textbook of Pragmatic Clinical Trials, 2014.
  • Parikh and Boyd (2014) N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 2014.
  • Pathak et al. (2013) J. Pathak, A. N. Kho, and J. C. Denny. Electronic health records-driven phenotyping: challenges, recent advances, and perspectives. Journal of the American Medical Informatics Association, 2013.
  • Ramage et al. (2009) D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods in Natural Language Processing, 2009.
  • Richesson et al. (2013) R. L. Richesson, W. E. Hammond, M. Nahm, D. Wixted, G. E. Simon, J. G. Robinson, A. E. Bauck, D. Cifelli, M. M. Smerek, J. Dickerson, et al. Electronic health records based phenotyping in next-generation clinical trials: a perspective from the nih health care systems collaboratory. Journal of the American Medical Informatics Association, 2013.
  • Saeed et al. (2011) M. Saeed, M. Villarroel, A. T. Reisner, G. Clifford, L. W. Lehman, G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark. Multiparameter intelligent monitoring in intensive care II (MIMIC-II): a public-access intensive care unit database. Critical Care Medicine, 2011.

Appendix  A Phenotype Sparsity

As suggested in Section 4.1, there is an inherent tradeoff between fit to the cost function and desired sparsity. The trade-off is made explicit for –CNMF in Figure 5. The sparsity of LLDA is controlled by tuning the hyperparameter () of the word-topic multinomial parameters (Blei et al., 2003) and for MLC via the regularization parameter . A smaller value of ensures that the word-topic probabilities are sparse. As the value of is increased, sparsity decreases (i.e. number of non-zero elements increases). For logistic regression (used by MLC), as the regularization parameter increases, sparsity increases. Figure 5(a) demonstrates the sparsity of the estimated phenotypes for LLDA and Figure 5(b) shows that of logistic regression. We choose phenotypes obtained at and for qualitative annotation. The parameters were chosen to achieve the lowest median sparsity while ensuring that for each chronic condition, the corresponding phenotype candidate is represented by at least non-zero clinical terms. Our fourth baseline (NMF + support) did not estimate sparse phenotypes and does not have a tuneable sparsity parameter (but were nevertheless annotated for qualitative evaluation). The proposed model provides the best sparsity among all baselines.

Figure 5: Sparsity–Accuracy Trade–off. Sparsity of the model is measured as the median of the number of non-zero entries in columns of the phenotype matrix . (a) shows a box plots of the median sparsity across the chronic conditions for varying values. The median and third–quartile values are explicitly noted on the plots. (b) divergence function value of the estimate from Algorithm 1 plotted against parameter.
(a) LLDA
(b) MLC
Figure 6: Phenotype sparsity for baseline models

Appendix  B Sample Phenotypes for Baseline Models

Figures 733 show the top terms learned for all target chronic conditions for the proposed model and baselines. The sparsity level chosen is based on the criterion described in Section 4.1. For all conditions, the terms are ordered in decreasing order of importance as learned by the models.

Figure 7: Learned Phenotypes for Liver Disease
Figure 8: Learned Phenotypes for Solid Tumor
Figure 9: Learned Phenotypes for Metastatic Cancer
Figure 10: Learned Phenotypes for Chronic Pulmonary Disorder
Figure 11: Learned Phenotypes for Alcohol Abuse
Figure 12: Learned Phenotypes for Diabetes Uncomplicated
Figure 13: Learned Phenotypes for Diabetes Complicated
Figure 14: Learned Phenotypes for Peripheral Vascular Disorder
Figure 15: Learned Phenotypes for Renal Failure
Figure 16: Learned Phenotypes for Other Neurological Disorders
Figure 17: Learned Phenotypes for Cardiac Arrhythmias
Figure 18: Learned Phenotypes for Drug Abuse
Figure 19: Learned Phenotypes for Paralysis
Figure 20: Learned Phenotypes for AIDS
Figure 21: Learned Phenotypes for Fluid Electrolyte Disorders
Figure 22: Learned Phenotypes for Rheumatoid Arthritis
Figure 23: Learned Phenotypes for Lymphoma
Figure 24: Learned Phenotypes for Coagulopathy
Figure 25: Learned Phenotypes for Obesity
Figure 26: Learned Phenotypes for Pulmonary Circulation Disorder
Figure 27: Learned Phenotypes for Valvular Disease
Figure 28: Learned Phenotypes for Peptic Ulcer
Figure 29: Learned Phenotypes for Congestive Heart Failure
Figure 30: Learned Phenotypes for Hypothyroidism
Figure 31: Learned Phenotypes for Weight loss
Figure 32: Learned Phenotypes for Deficiency Anemias
Figure 33: Learned Phenotypes for Blood Loss Anemia
Figure 34: Learned Phenotypes for Depresssion

Appendix  C Augmented Mortality Prediction

Figure 35 shows weights learned by the classifier for all features. The weights shaded red correspond to phenotypes and are relatively high compared to raw notes based features (shaded blue), indicating that comorbidities capture significant amount of predictive information on mortality and achieve comparable performance to full EHR model when augmented with additional raw clinical terms.

Figure 35: Weights learned by the CNMF+Full EHR classifier for all features. The weights shaded red correspond to phenotypes.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
24035
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description