Identifiable Phenotyping using Constrained Non–Negative Matrix Factorization
Abstract
This work proposes a new algorithm for automated and simultaneous phenotyping of multiple co–occurring medical conditions, also referred to as comorbidities, using clinical notes from electronic health records (EHRs). A latent factor estimation technique, nonnegative matrix factorization (NMF), is augmented with domain constraints from weak supervision to obtain sparse latent factors that are grounded to a fixed set of chronic conditions. The proposed grounding mechanism ensures a onetoone identifiable and interpretable mapping between the latent factors and the target comorbidities. Qualitative assessment of the empirical results by clinical experts show that the proposed model learns clinically interpretable phenotypes which are also shown to have competitive performance on day mortality prediction task. The proposed method can be readily adapted to any nonnegative EHR data across various healthcare institutions.
plus 0.3ex
Electrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA Suriya Gunasekar suriya@utexas.edu
Electrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA David Sontag dsontag@cs.nyu.edu
Computer Science
New York University
NYC, NY, USA Joydeep Ghosh jghosh@utexas.edu
Electrical and Computer Engineering
The University of Texas at Austin
Austin, TX, USA
1 Introduction
Reliably querying for patients with specific medical conditions across multiple organizations facilitates many large scale healthcare applications such as cohort selection, multisite clinical trials, epidemiology studies etc. (Richesson et al., 2013; Hripcsak and Albers, 2013; Pathak et al., 2013). However, raw EHR data collected across diverse populations and multiple caregivers can be extremely high dimensional, unstructured, heterogeneous, and noisy. Manually querying such data is a formidable challenge for healthcare professionals.
EHR driven phenotypes are concise representations of medical concepts composed of clinical features, conditions, and other observable traits facilitating accurate querying of individuals from EHRs (NIH Health Care Systems Research Collaboratory, 2014). Efforts like eMerge Network^{1}^{1}1http://emerge.mc.vanderbilt.edu/, PheKB^{2}^{2}2http://phekb.org/ are well known examples of EHR driven phenotyping. Traditionally used rule–based composing methods for phenotyping require substantial time and expert knowledge and have little scope for exploratory analyses. This motivates automated EHR driven phenotyping using machine learning with limited expert intervention.
We propose a weakly supervised model for jointly phenotyping co–occurring conditions (comorbidities) observed in intensive care unit (ICU) patients. Comorbidities are a set of cooccurring conditions in a patient at the time of admission that are not directly related to the primary diagnosis for hospitalization (Elixhauser et al., 1998). Phenotypes for the comorbidities listed in Table 1 are derived using textbased features from clinical notes in a publicly accessible MIMICIII EHR database (Saeed et al., 2011). We present a novel constrained non–negative matrix factorization (CNMF) for the EHR matrix that aligns the factors with target comorbidities yielding sparse, interpretable, and identifiable phenotypes.
The following aspects of our model distinguish our work from prior efforts:

Identifiability: A key shortcoming of standard unsupervised latent factor models such as NMF (Lee and Seung, 2001) and Latent Dirichlet Allocation (LDA) (Blei et al., 2003) for phenotyping is that, the estimated latent factors learnt are interchangeable and unidentifiable as phenotypes for specific conditions of interest. We tackle identifiability by incorporating weak (noisy) but inexpensive supervision as constraints our framework. Specifically, we obtain weak supervision for the target conditions in Table 1 using the Elixhauser Comorbidity Index (ECI) (Elixhauser et al., 1998) computed solely from patient administrative data (without human intervention). We then ground the latent factors to have a onetoone mapping with conditions of interest by incorporating the comorbidities predicted by ECI as support constraints on the patient loadings along the latent factors.

Simultaneous modeling of comorbidities: ICU patients studied in this paper are frequently afflicted with multiple co–occurring conditions besides the primary cause for admission. In the proposed NMF model, phenotypes for such co–occurring conditions jointly modeled to capture the resulting correlations.

Interpretability: For wider applicability of EHR driven phenotyping for advance clinical decision making, it is desirable that these phenotype definitions be clinically interpretable and represented as a concise set of rules. We consider the sparsity in the representations as a proxy for interpretability and explicitly encourage conciseness of phenotypes using tuneable sparsity–inducing soft constraints.
We evaluate the effectiveness of the proposed method towards interpretability, clinical relevance, and prediction performance on EHR data from MIMICIII. Although we focus on ICU patients using clinical notes, the proposed model and algorithm are general and can be applied on any nonnegative EHR data from any population group.
2 Data Extraction
The MIMICIII dataset consists of deidentified EHRs for adult ICU patients at the Beth Isreal Deaconess Medical Center, Boston, Massachusetts from –. For all ICU stays within each admission, clinical notes including nursing progress reports, physician notes, discharge summaries, ECG, etc. are available. We analyze patients who have stayed in the ICU for at least hours ( patients). We derive phenotypes using clinical notes collected within the first hours of patients’ ICU stay to evaluate the quality of phenotypes when limited patient data is available. Further, we evaluate the phenotypes on a day mortality prediction problem. To avoid obvious indicators of mortality and comorbidities, apart from restricting to 48 hour data, we exclude discharge summaries as they explicitly mention patient outcomes (including mortality).
Congestive Heart Failure  Cardiac Arrhythmias  Valvular Disease  Pulmonary Circulation Disorder  Peripheral Vascular Disorder 
Hypertension  Paralysis  Other Neurological Disorders  Chronic Pulmonary Diseases  Diabetes Uncomplicated 
Diabetes Complicated  Hypothyroidism  Renal Failure  Liver Disease (excluding bleeding)  Peptic Ulcer 
AIDS  Lymphoma  Metastatic Cancer  Solid Tumor (without metastasis)  Rheumatoid Arthritis 
Coagulopathy  Obesity  Weight loss  Fluid Electrolyte Disorder  Blood Loss Anemia 
Deficiency Anemia  Alcohol abuse  Drug abuse  Psychoses  Depression 

Clinically relevant bagofwords features: Aggregated clinical notes from all sources are represented as a single bagofwords features. To enhance clinical relevance, we create a custom vocabulary containing clinical terms from two sources (a) the Systematized Nomenclature of MedicineClinical Terms (SNOMED CT), and (b) the level0 terms provided by the Unified Medical Language System (UMLS), consolidated into a standard vocabulary format using Metamorphosys — an application provided by UMLS for custom vocabulary creation.^{3}^{3}3See https://www.nlm.nih.gov/healthit/snomedct/ and https://www.nlm.nih.gov/research/umls/ To extract clinical terms from the raw text, the notes were tagged for chunking using a conditional random field tagger^{4}^{4}4https://taku910.github.io/crfpp/. The tags are looked up against the custom vocabulary (generated from Metamorphosys) to obtain the bagofwords representation. Our final vocabulary has 3600 clinical terms.

Computable weak diagnosis: We incorporate domain constraints from weak supervision to ground the latent factors to have a onetoone mapping with the conditions of interest. In the model described in Section 3, this is enforced by constraining the nonzero entries on patient loading along the latent factors using a weak diagnosis for comorbidities. The weak diagnoses of target comorbidities in Table 1 are obtained using ECI^{5}^{5}5https://git.io/v6e7q, computed solely from patient administrative data without human annotation. We refer to this index as weak diagnoses as it is not a physician’s exact diagnosis and is subject to noise and misspecification. Note that ECI ignores diagnoses code related to the primary diagnoses of admission. Thus, ECI models presence and absence of conditions other than the primary reason for admission (comorbidities). The phenotype candidates from the proposed model can be considered as concise representations of such comorbidities.
3 Identifiable High–Throughput Phenotyping
The notation used in the paper are enumerated in Table 2.
Notation  Description 

for integer  Set of indices . 
Simplex in dimension , .  
Column of a matrix .  
Support of a vector , .  
Observations  
,  Number of patients () and features (), respectively. 
EHR matrix from MIMIC III: Clinically relevant bagofwords features from notes in first hours of ICU stay for patients.  
Indices for comorbidities in Table 1.  
for  Set of comorbidities patient is diagnosed with using ECI . 
Factor matrices  
Estimate of patients’ risk for the conditions.  
,  Estimate of phenotype factor matrix and feature bias vector. 
In summary, for each patient , (a) the bagofwords features from clinical notes is represented as column of EHR matrix , and (b) the list of comorbidities diagnosed using ECI is denoted as . Let an unknown represent the risk of patients for comorbidities of interest; each entry lies in the interval , with and indicating norisk and maximumrisk, respectively, of patient being afflicted with condition . If denotes an accurate diagnosis for patient , then satisfies .
Definition 1 (EHR driven phenotype)
EHR driven phenotypes for co–occurring conditions are a set of vectors , such that for a patient afflicted with conditions ,
(1) 
where is a bias representing the feature component observed independent of the target conditions. with as columns is referred as the phenotype factor matrix.
Note that we explicitly model a feature bias to capture frequently occurring terms that are not discriminative of the target conditions, e.g., temperature, pain, etc.
Cost Function
The bagofwords features are represented as counts in the EHR matrix . We consider a factorized approximation of parametrized by matrices , and as , where denotes a vector of all ones of appropriate dimension. The approximation error of the estimate is measured using the –divergence defined as follows:
(2) 
Minimizing the –divergence is equivalent to maximum likelihood estimation under a Poisson distributional assumption on individual entries of the EHR matrix parameterized by (Banerjee et al., 2005).
Phenotypes
For the comorbidities, columns of , are proposed as candidate phenotypes derived from the EHR , i.e. approximations to .
Constraints
The following constraints are incorporated in learning and .

Support Constraints: The nonnegative rank– factorization of is ‘grounded’ to target comorbidities by constraining the support of risk corresponding to patient using weak diagnosis from ECI as an approximation of the conditions in Definition 1.

Sparsity Constraints: Scaled simplex constraints are imposed on the columns of with a tuneable parameter to encourage sparsity of phenotypes. Restricting the patient loadings matrix as not only allows to interpret the loadings as the patients’ risk, but also makes simplex constraints effective in a bilinear optimization.
Simultaneous phenotyping of comorbidities using constrained NMF is posed as follows:
(3)  
s.t.  
The optimization in (3) is convex in either factor with the other factor fixed. It is solved using alternating minimization with projected gradient descent (Parikh and Boyd, 2014; Lin, 2007). See complete algorithm in Algorithm 1. The proposed model in general can incorporate any weak diagnosis of medical conditions. In this paper, we note that, since we use ECI, the results are not representative of the primary diagnoses at admission.
4 Empirical Evaluation
The estimated phenotypes are evaluated on various metrics. We denote the model learned using Algorithm 1 with a given parameter as –CNMF. The following baselines are used for comparison:

Labeled LDA (LLDA): LLDA (Ramage et al., 2009) is the supervised counterpart of LDA, a probabilistic model to estimate topic distribution of a corpus. It assumes that word counts of documents arise from multinomial distributions. It incorporates supervision on topics contained in a document and can be naturally adapted for phenotyping from bagofwords clinical features, where the topic–word distributions form candidate phenotypes. While LLDA assumes that the topic loadings of a document lie on the probability simplex , –CNMF allows each patient–condition loading to lie in . In interpreting the patient loading as a disease risk, the latter allows patients to have varying levels of disease prevalence. Also, LLDA can induce sparsity only indirectly via a hyperparameter of the informative prior on the topic–word distributions. While this does not guarantee sparse estimates, we obtain reasonable sparsity on LLDA estimates. We use the Gibbs sampling code from MALLET (McCallum, 2002) for inference. For a fair comparison to CNMF which uses an extra bias factor, we allow LLDA to model an extra topic shared by all documents in the corpus.

NMF with support constraints (NMF+support): This NMF model incorporates non–negativity and support constraints from weak supervision but not the sparsity inducing constraints on the phenotype matrix. This allows to study the effect of sparsity inducing constraints for interpretability. On the other hand, imposing sparsity without our grounding technique does not yield identifiable topics and hence is not studied as a baseline.

Multilabel Classification (MLC): This baseline treats weak supervision (from ECI) as accurate labels in a fully supervised model. A sparsity inducing regularized logistic regression classifier is learned for each condition independently. The learned weight vector for each condition determines importance of clinical terms towards discriminating patients with condition and are treated as candidate phenotypes for condition .
The weak supervision does not account for the primary diagnosis for admission in the ICU population as the ECI ignores primary diagnoses at admission (Elixhauser et al., 1998). However, the learning algorithm can be easily modified to account for the primary diagnoses, if required by using a modified form of supervision or absorbing the effects in an additional additive term appended to the model. Nevertheless, the proposed model generates highly interpretable phenotypes for comorbidities. Finally, to mitigate the effect of local minima, whenever applicable, for each model, the corresponding algorithm was run with random initializations and results providing the lowest divergence were chosen for comparison.
4.1 Interpretability–accuracy trade–off
Sparsity of the latent factors is used as a proxy for interpretability of phenotypes. Sparsity is measured as the median of the number of non–zero entries in columns of the phenotype matrix (lower is better). The parameter in –CNMF controls the sparsity by imposing scaled simplex constraints on . CNMF was trained on multiple in the range of to . Stronger sparsityinducing constraints results in worse fit to the cost function. This trade–off is indeed observed in all models (see A for details). For all models, we pick estimates with lowest median sparsity while ensuring that the phenotype candidate for every condition is represented by at least nonzero clinical terms.
4.2 Clinical relevance of phenotypes
We requested two clinicians to evaluate the candidate phenotypes based on the top terms learned by each model. The ratings were requested on a scale of (poor) to (excellent). The experts were asked to rate based on whether the terms are relevant towards the corresponding condition and whether the terms are jointly discriminative of the condition. Figure 1 shows the summary of qualitative ratings obtained for all models. For each model, we show two columns (corresponding to two experts). The stacked bars show the histogram of the ratings for the models. Nearly of the phenotypes learned from our model were rated ‘good’ or better by both annotators. In contrast, NMF with support constraints but without sparsity inducing constraints hardly learns clinically relevant phenotypes.
0.4–CNMF  LLDA  MLC  NMF  

0.4–CNMF  0  28  20  44 
LLDA  7  0  12  35 
MLC  6  21  0  42 
NMF+support  1  0  1  0 
The proposed model –CNMF also received significantly higher number of ‘excellent’ and ‘good’ ratings from both experts. Although LLDA and MLC estimate sparse phenotypes, they are not at par with –CNMF. Table 3 shows a summary of relative rankings for all models. Each cell entry shows the number of times the model along the corresponding row was rated strictly better than that along the column. –CNMF is better than all three baselines. The supervised baseline MLC outperforms LLDA even though LLDA learns comorbidities jointly suggesting that the simplex constraint imposed by LLDA may be restrictive.
Figure 2 is an example of a phenotype (top 15 terms) learned by all models for psychoses. For this condition, the proposed model was rated “excellent” and strictly better than both LLDA and MLC by both annotators while LLDA and MLC ratings were tied. However, the phenotype for Hypertension (in Figure 3) learned by –CNMF has more terms related to ‘Renal Failure’ or ‘End Stage Renal Disease’ rather than hypertension. One of our annotators pointed out that “Candidate 1 is a fairly good description of renal disease, which is an end organ complication of hypertension”, where the anonymized Candidate 1 refers to –CNMF. Exploratory analysis suggests that hypertension and renal failure are the most commonly cooccurring set of conditions. Over 93% of patients that have hypertension (according to ECI) also suffer from Renal Failure. Thus, our model is unable to distinguish between highly cooccurring conditions. Other baselines were also rated poorly for hypertension, while LLDA was rated only slightly better. More examples of phenotypes are provided in B.
4.3 Mortality prediction
To quantitatively evaluate the utility of the learned phenotypes, we consider the day mortality prediction task. We divide the EHR into crossvalidation folds of training and test patients. As this is an imbalanced class problem, the training–test splits are stratified by mortality labels. For each split, all models were applied on the training data to obtain phenotype candidates and feature biases . For each model, the patient loadings along the respective phenotype space are used as features to train a logistic regression classifier for mortality prediction. For CNMF and NMF+support, these are obtained as for fixed . For LLDA, these are obtained using Gibbs sampling with fixed topic–word distributions. For MLC, the predicted class probabilities of the comorbidities are used as features. Additionally, we train a logistic regression classifier using the full EHR matrix as features.
We clarify the following points on the methodology: (1) is learned on the patients in the training dataset only, hence there is no information leak from test patients into training. (2) Test patients’ comorbidities from ECI are not used as support constraints on their loadings. (3) Regularized logistic regression classifiers are used to learn models for mortality prediction. The regularization parameters are chosen via gridsearch.
Model  AUROC  Sensitivity  Specificity  

1.  –CNMF  0.63(0.02)  0.59(0.04)  0.62(0.03) 
2.  NMF+support  0.52(0.02)  0.56(0.13)  0.51(0.14) 
3.  LLDA  0.64(0.02)  0.62(0.03)  0.61(0.05) 
4.  MLC  0.66(0.01)  0.62(0.06)  0.62(0.05) 
5.  Full EHR  0.72(0.02)  0.69(0.02)  0.63(0.04) 
6.  CNMF+Full EHR (, )  0.72(0.02)  0.61(0.09)  0.71(0.07) 
The performance of the above baselines trained on regularized logistic regression over a fold crossvalidation is reported in Table 4: rows –. The classifier trained on the full EHR unsurprisingly outperforms all baselines as it uses richer high dimensional information. All phenotyping baselines, except NMF+support, show comparable performance on mortality prediction which in spite of learning on a small number of features, is only slightly worse than predictive performance of full EHR with features.
Augmented features for mortality prediction (CNMF+Full EHR)
Unsurprisingly, Table 4 suggests that the high dimensional EHR data has additional information towards mortality prediction which are lacking in the dimensional features generated via phenotyping. To evaluate whether this additional information can be captured by CNMF if augmented with a small number of raw EHR features, we train a mortality prediction classifier using regularized logistic regression on CNMF features/loadings combined with raw bag–of–words features, with parameters tuned to match the performance of the full EHR model. The results are reported in the final row of Table 4.
In exploring the weights learned by the classifier for all features, we observe that only of the features corresponding to raw EHR based bagofwords features have non–zero weights. This suggests that comorbidities capture significant amount of predictive information on mortality and achieve comparable performance to full EHR model with a small number of additional terms. See Figure 35 in Appendix showing the weights learned by the classifier for all features. Figure 4 shows comorbidities and EHR terms with top magnitude weights learned by the CNMF+full EHR classifier. For example, it is interesting to note that the top weighted EHR term – dnr or ‘Do Not Resuscitate’ is not indicative of any comorbidity but is predicitive of patient mortality.
5 Discussion and Related Work
Supervised learning methods like Carroll et al. (2011); Kawaler et al. (2012); Chen et al. (2013) or deep learning methods (Lipton et al., 2015; Kale et al., 2015; Henao et al., 2015) for EHR driven phenotyping require expert supervision. Although unsupervised methods like NMF (Anderson et al., 2014) and non–negative tensor factorization (Kolda and Bader, 2009; Harshman, 1970) are inexpensive alternatives (Ho et al., 2014a, b, c; Luo et al., 2015), they pose challenges with respect to identifiability, interpretability and computation, limiting their scalability.
Most closely related to our paper is work by Halpern et al. (2016b) which is a semisupervised algorithm for learning the joint distribution on conditions, requiring only that a domain expert specify one or more ‘anchor’ features for each condition (no other labeled data). An ‘anchor’ for a condition is a set of clinical features that when present are highly indicative of the target condition, but whose absence is not a strong label for absence of the target condition (Halpern et al., 2014, 2016a). For example, the presence of insulin medication is highly indicative of diabetes, but the converse is not true. Joshi et al. (2015) use a similar supervision approach for comorbidities prediction. Whereas the conditions in Halpern et al. (2016b) are binary valued, in our work they are realvalued between 0 and 1. Furthermore, we assume that the support of the conditions is known in the training data.
Our approach achieves identifiability using support constraints to ground the latent factors and interpretability using sparsity constraints. The phenotypes learned are clinically interpretable and predictive of mortality when augmented with a sparse set of raw bagofwords features on unseen patient population. The model outperforms baselines in terms of clinical relevance according to experts and significantly better than the model which includes supervision but no sparsity constraints. The proposed method can be easily extended to other non–negative data to obtain more comprehensive phenotypes. However, it was observed that the algorithm does not discriminate between frequently co–occurring conditions, e.g. renal failure and hypertension. Further, the weak supervision (using ECI) does not account for the primary diagnoses of admission. Additional model flexibility to account for a primary condition in explaining the observations could potentially improve performance. Addressing the above limitations along with quantitative evaluation of risk for disease prediction, and understanding conditions for uniqueness of phenotyping solutions are interesting areas of followup work.
Acknowledgements
We thank Dr. Saul Blecker and Dr. Stephanie Kreml for their qualitative evaluation of the computational phenotypes. SJ, SG and JG were supported by NSF: SCH #1418511. DS was supported by NSF CAREER award #1350965. We also thank Yacine Jernite for sharing a code used in preprocessing clinical notes.
References
 Anderson et al. (2014) A. Anderson, P. K. Douglas, W. T. Kerr, V. S. Haynes, A. L. Yuille, J. Xie, Y. N. Wu, J. A. Brown, and M. S. Cohen. Nonnegative matrix factorization of multimodal MRI, fMRI and phenotypic data reveals differential changes in default mode subnetworks in ADHD. Neuroimage, 2014.
 Banerjee et al. (2005) A. Banerjee, S. Merugu, I. S. Dhillon, and J. Ghosh. Clustering with bregman divergences. Journal of Machine Learning Research, 2005.
 Blei et al. (2003) D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 2003.
 Carroll et al. (2011) R. J. Carroll, A. E. Eyler, and J. C. Denny. Naive electronic health record phenotype identification for rheumatoid arthritis. In AMIA Annual Symposium, 2011.
 Chen et al. (2013) Y. Chen, R. J. Carroll, E. Hinz, A. Shah, A. E. Eyler, J. C. Denny, and H. Xu. Applying active learning to highthroughput phenotyping algorithms for electronic health records data. Journal of the American Medical Informatics Association, 2013.
 Elixhauser et al. (1998) A. Elixhauser, C. Steiner, D. R. Harris, and R. M. Coffey. Comorbidity measures for use with administrative data. Medical Care, 1998.
 Halpern et al. (2014) Y. Halpern, Y. Choi, S. Horng, and D. Sontag. Using anchors to estimate clinical state without labeled data. In AMIA Annual Symposium, 2014.
 Halpern et al. (2016a) Y. Halpern, S. Horng, Y. Choi, and D. Sontag. Electronic medical record phenotyping using the anchor and learn framework. Journal of the American Medical Informatics Association, 2016a.
 Halpern et al. (2016b) Y. Halpern, S. Horng, and D. Sontag. Clinical tagging with joint probabilistic models. Conference on Machine Learning for Health Care, 2016b.
 Harshman (1970) R. A. Harshman. Foundations of the parafac procedure: Models and conditions for an explanatory multimodal factor analysis. UCLA Working Papers in Phonetics, 1970.
 Henao et al. (2015) R. Henao, J. T. Lu, J. E. Lucas, J. Ferranti, and L. Carin. Electronic health record analysis via deep poisson factor models. Journal of Machine Learning Research, 2015.
 Ho et al. (2014a) J. C Ho, J. Ghosh, S. R. Steinhubl, W. F. Stewart, J. C. Denny, B. A. Malin, and J. Sun. Limestone: Highthroughput candidate phenotype generation via tensor factorization. Journal of Biomedical Informatics, 2014a.
 Ho et al. (2014b) J. C. Ho, J. Ghosh, and J. Sun. Extracting phenotypes from patient claim records using nonnegative tensor factorization. In International Conference on Brain Informatics and Health, 2014b.
 Ho et al. (2014c) J. C. Ho, J. Ghosh, and J. Sun. Marble: Highthroughput phenotyping from electronic health records via sparse nonnegative tensor factorization. In SIGKDD International Conference on Knowledge Discovery and Data Mining, 2014c.
 Hripcsak and Albers (2013) G. Hripcsak and D. J. Albers. Nextgeneration phenotyping of electronic health records. Journal of the American Medical Informatics Association, 2013.
 Joshi et al. (2015) S. Joshi, O. Koyejo, and J. Ghosh. Simultaneous prognosis of multiple chronic conditions from heterogeneous EHR data. In International Conference on Healthcare Informatics, 2015.
 Kale et al. (2015) D. C. Kale, Z. Che, M. T. Bahadori, W. Li, Y. Liu, and R. Wetzel. Causal Phenotype Discovery via Deep Networks. In AMIA Annual Symposium, 2015.
 Kawaler et al. (2012) E. Kawaler, A. Cobian, P. Peissig, D. Cross, S. Yale, and M. Craven. Learning to predict posthospitalization VTE risk from EHR data. In AMIA Annual Symposium, 2012.
 Kolda and Bader (2009) T. G. Kolda and B. W. Bader. Tensor decompositions and applications. SIAM Review, 2009.
 Lee and Seung (2001) D. Lee and H. S. Seung. Algorithms for nonnegative matrix factorization. In Advances in Neural Information Processing Systems, 2001.
 Lin (2007) C. J. Lin. Projected gradient methods for nonnegative matrix factorization. Neural computation, 2007.
 Lipton et al. (2015) Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzell. Learning to diagnose with lstm recurrent neural networks. arXiv preprint arXiv:1511.03677, 2015.
 Luo et al. (2015) Y. Luo, Y. Xin, E. Hochberg, R. Joshi, O. Uzuner, and P. Szolovits. Subgraph augmented nonnegative tensor factorization (santf) for modeling clinical narrative text. Journal of the American Medical Informatics Association, 2015.
 McCallum (2002) A. K. McCallum. Mallet: A machine learning for language toolkit, 2002. URL http://mallet.cs.umass.edu.
 NIH Health Care Systems Research Collaboratory (2014) NIH Health Care Systems Research Collaboratory. Rethinking Clinical Trials: A Living Textbook of Pragmatic Clinical Trials, 2014.
 Parikh and Boyd (2014) N. Parikh and S. P. Boyd. Proximal algorithms. Foundations and Trends in optimization, 2014.
 Pathak et al. (2013) J. Pathak, A. N. Kho, and J. C. Denny. Electronic health recordsdriven phenotyping: challenges, recent advances, and perspectives. Journal of the American Medical Informatics Association, 2013.
 Ramage et al. (2009) D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multilabeled corpora. In Empirical Methods in Natural Language Processing, 2009.
 Richesson et al. (2013) R. L. Richesson, W. E. Hammond, M. Nahm, D. Wixted, G. E. Simon, J. G. Robinson, A. E. Bauck, D. Cifelli, M. M. Smerek, J. Dickerson, et al. Electronic health records based phenotyping in nextgeneration clinical trials: a perspective from the nih health care systems collaboratory. Journal of the American Medical Informatics Association, 2013.
 Saeed et al. (2011) M. Saeed, M. Villarroel, A. T. Reisner, G. Clifford, L. W. Lehman, G. Moody, T. Heldt, T. H. Kyaw, B. Moody, and R. G. Mark. Multiparameter intelligent monitoring in intensive care II (MIMICII): a publicaccess intensive care unit database. Critical Care Medicine, 2011.
Appendix A Phenotype Sparsity
As suggested in Section 4.1, there is an inherent tradeoff between fit to the cost function and desired sparsity. The tradeoff is made explicit for –CNMF in Figure 5. The sparsity of LLDA is controlled by tuning the hyperparameter () of the wordtopic multinomial parameters (Blei et al., 2003) and for MLC via the regularization parameter . A smaller value of ensures that the wordtopic probabilities are sparse. As the value of is increased, sparsity decreases (i.e. number of nonzero elements increases). For logistic regression (used by MLC), as the regularization parameter increases, sparsity increases. Figure 5(a) demonstrates the sparsity of the estimated phenotypes for LLDA and Figure 5(b) shows that of logistic regression. We choose phenotypes obtained at and for qualitative annotation. The parameters were chosen to achieve the lowest median sparsity while ensuring that for each chronic condition, the corresponding phenotype candidate is represented by at least nonzero clinical terms. Our fourth baseline (NMF + support) did not estimate sparse phenotypes and does not have a tuneable sparsity parameter (but were nevertheless annotated for qualitative evaluation). The proposed model provides the best sparsity among all baselines.
Appendix B Sample Phenotypes for Baseline Models
Appendix C Augmented Mortality Prediction
Figure 35 shows weights learned by the classifier for all features. The weights shaded red correspond to phenotypes and are relatively high compared to raw notes based features (shaded blue), indicating that comorbidities capture significant amount of predictive information on mortality and achieve comparable performance to full EHR model when augmented with additional raw clinical terms.