Multimodal Machine Learning for Automated ICD Coding

Multimodal Machine Learning for Automated ICD Coding

Keyang Xu, Mike Lam, Jingzhi Pang, Xin Gao, Charlotte Band,
Piyush Mathur MD, Frank Papay MD, Ashish K. Khanna MD, Jacek B. Cywinski MD,
Kamal Maheshwari MD, Pengtao Xie, Eric Xing
Petuum Inc., Pittsburgh, Pennsylvania, USA
Cleveland Clinic, Cleveland, Ohio USA

This study presents a multimodal machine learning model to predict ICD-10 diagnostic codes. We developed separate machine learning models that can handle data from different modalities, including unstructured text, semi-structured text and structured tabular data. We further employed an ensemble method to integrate all modality-specific models to generate ICD-10 codes. Key evidence was also extracted to make our prediction more convincing and explainable.

We used the Medical Information Mart for Intensive Care III (MIMIC -III) dataset to validate our approach. For ICD code prediction, our best-performing model (micro-F1 = 0.7633, micro-AUC = 0.9541) significantly outperforms other baseline models including TF-IDF (micro-F1 = 0.6721, micro-AUC = 0.7879) and Text-CNN model (micro-F1 = 0.6569, micro-AUC = 0.9235). For interpretability, our approach achieves a Jaccard Similarity Coefficient (JSC) of 0.1806 on text data and 0.3105 on tabular data, where well-trained physicians achieve 0.2780 and 0.5002 respectively.

Automated ICD Coding; Multimodal Machine Learning; Deep Learning

I Introduction

The International Classification of Diseases (ICD), endorsed by the World Health Organization (WHO), is a medical classification list of codes for diagnoses and procedures111 ICD codes have been adopted widely by physicians and other health care providers for reimbursement, storage and retrieval of diagnostic information ([1], [2]). The process of assigning ICD codes to a patient visit is time-consuming and error-prone. Clinical coders need to extract key information from Electronic Medical Records (EMRs) and assign correct codes based on category, anatomic site, laterality, severity, and etiology [3]. The amount of information and complex hierarchy greatly increase the difficulty. The second annual ICD-10 coding contest results show that average accuracy for overall inpatient is around 61%, far below 95% accuracy standard under ICD-9222 Coding errors can further cause billing mistakes, claim denials and underpayment [4]. Therefore, an automatic and robust ICD coding process is in great demand.

EMRs store data in different modalities, such as unstructured text, semi-structured text, and structured tabular data [5]. Unstructured text data includes notes taken by physicians, nursing notes, lab reports, test reports, and discharge summaries. Semi-structured text data refers to a list of structured phrases and unstructured sentences that describe diagnoses. Structured tabular data contains prescriptions and clinical measurements such as vital signs, lab test results, and microbiology test results ([6], [7], [8]). How to leverage all information from large-volume EMR data is a non-trivial task. Besides, in clinical practice, providing predictions with black-box machine learning models is not convincing for physicians and insurance companies. How to provide evidence related to predicted ICD codes is also an important task.

In this work, we developed an ensemble-based approach that integrates three modality-specific models to fully exploit complementary information from multiple data sources and boost the predictive accuracy of ICD codes. We also explored methods to produce interpretable and explainable predictions.

I-a Related Work

Many researchers have explored the topic of automatic ICD coding using text data in recent years [9]. Larkey and Croft [10] considered it as a single-label classification on patient discharge summaries with multiple classifiers. Kavuluru et al. [11] treated it as a multi-label classification problem [12], and developed ranking approaches after text feature engineering and selection using EMR data. Koopman et al. [13] developed a classification system combining Support Vector Machine (SVM) and rule-based methods to identify four high-impact diseases with patient death certificates. It is also common to see unsupervised and semi-supervised strategies applied in predicting ICD codes. Scheurwegs et al. [14] proposed an unsupervised medical concept extraction approach using unlabelled corpus to frame clinical text into a list of concepts, which helps with ICD code prediction. Deep learning models are also widely used [15]. Duarte el al. [16] developed a recurrent neural model and Mullenbach et al. [17] adopted a CNN-based model with per-label attention mechanism to assign ICD codes based on free-text description. Shi et al. [18] used a neural architecture with Long Short-Term Memory (LSTM) and attention mechanism that takes diagnostic descriptions as input to predict ICD codes. Similarly, Xie et al. [19] further explored tree-of-sequences LSTM encoding and adversarial learning to improve the prediction results.

I-B Contributions

The primary contribution of this study is twofold:

  • We applied NLP and multimodal machine learning to predict ICD diagnostic codes, achieving the state-of-the-art accuracy. The complementary nature of multimodal data makes our model more robust and accurate. In addition, we effectively addressed data imbalance issues, which is a very general problem for ICD code prediction.

  • We proposed approaches to interpret predictions for both unstructured text and structured tables, which make our prediction more explainable and convincing to physicians.

Ii Methods

In this section, we will introduce our approach for predicting ICD-10 codes for a patient visit, as well as for identifying the evidence to make our predictions explainable.

We developed an ensemble-based approach as shown in Figure 1, which integrates modality-specific models. For unstructured text, we applied the Text Convolutional Neural Network (Text-CNN) model [20] for multi-label classification. For semi-structured text data, we constructed a deep learning model to analyze semantic similarities between diagnosis descriptions and ICD code descriptions. For tabular data, our approach transformed numeric features to binary features and fed them to a decision tree [21] to classify the ICD codes. During testing, our model ensembled the three aforementioned models on different modalities for improving prediction accuracy [22]. To make our predictions explainable, key evidence was retrieved from raw data and presented to physicians.

Fig. 1: Model architecture for ICD code prediction based on multimodal data, where each prediction is interpreted using retrieval-based methods.

Ii-a Dataset Description

The dataset used for this study is MIMIC-III [23], which contains approximately 58,000 hospital admissions of 47,000 patients who stayed in the ICU of the Beth Israel Deaconess Medical Center between 2001 and 2012. The original diagnostic codes are ICD-9 codes. We mapped these to ICD-10 codes, as they are more widely adopted in today’s clinical practices. Based on clinical meaningfulness and the minimal sample size requirements for training workable ML models, we selected 32 ICD codes (see details in Table I). Patient admissions that have ground truth labels of these 32 codes and at least one discharge summary were selected. Six major tables from the MIMIC III dataset were selected:

  • ADMISSIONS contains all information regarding a patient admission, including a preliminary diagnosis.

  • LABEVENTS contains all laboratory measurements.

  • PRESCRIPTIONS contains medications related to order entries.

  • MICROBIOLOGYEVENTS contains microbiology information such as whether an organism tested negative or positive in the culture.

  • CHARTEVENTS contains all charted data including patients’ routine vital signs and other information related to their health.

  • NOTEEVENTS contains all notes including nursing and physician notes, echocardiography reports, and discharge summaries.

ICD-10 Code
Essential (primary) hypertension
Heart failure, unspecified
Unspecified atrial fibrillation
Atherosclerotic heart disease of native coronary artery
without angina pectoris
Acute kidney failure, unspecified
Type 2 diabetes mellitus without complications
Hyperlipidemia, unspecified
Urinary tract infection, site not specified
Pure hypercholesterolemia, unspecified
Anemia, unspecified
Hypothyroidism, unspecified
Pneumonia, unspecified organism
Acute posthemorrhagic anemia
Severe sepsis without septic shock
Major depressive disorder, single episode, unspecified
Nicotine dependence, unspecified, uncomplicated
Thrombocytopenia, unspecified
Presence of aortocoronary bypass graft
Personal history of nicotine dependence
Hypertensive chronic kidney disease with stage 5
chronic kidney disease or end stage renal disease
Severe sepsis with septic shock
Long term (current) use of insulin
Obstructive sleep apnea (adult) (pediatric)
Unspecified asthma, uncomplicated
Age-related osteoporosis without current
pathological fracture
Unspecified convulsions
End stage renal disease
Obesity, unspecified
Delirium due to known physiological condition
Unspecified protein-calorie malnutrition
Morbid (severe) obesity due to excess calories
TABLE I: 32 ICD-10 codes and associated descriptions

Ii-B Classification Based on Unstructured Text

In this section, we will discuss how our model predicts diagnostic codes based on unstructured text from NOTEEVENTS. Our approach includes two steps: data pre-processing and deep learning based classification.

The pre-processing step aims to provide a clean and standardized input for the deep learning model. It applied a standardized pipeline including tokenization and word normalization. Tokens with frequency less than 10 were removed. At test time, out-of-vocabulary words were considered to be a special token “UNK”.

The processed text was fed into a multi-label classification model for ICD code prediction. We denote the ICD code set as . Given a text input which is a sequence of tokens, our goal is to select a subset of codes that is most relevant to . A Text-CNN model [20] was applied to achieve this. This model represents tokens in using word embeddings ([24, 25]), and applies a convolution layer on top of the embeddings to capture the temporal correlations among adjacent tokens. To learn the Text-CNN model, the objective is to minimize the average cross-entropy (CE) loss [26] for each ICD code class, which is defined as follows,


where and are the number of training examples (patients’ admissions) and unique ICD code. is a binary label representing whether example is assigned with code and is the probability (predicted by Text-CNN) indicating the likelihood that code is relevant to example .

In clinical practices, clinical guidelines developed by human experts are used to guide diagnoses. Medical knowledge from the guidelines can be leveraged by machine learning models to improve prediction accuracy. To achieve it, keywords and phrases were extracted from unstructured guidelines to generate TF-IDF feature vectors [27]. These features were fed into the fully connected layer of Text-CNN, as additional inputs. We refer to this modified model as Text-TF-IDF-CNN. The detailed network structure is shown in Figure 2.

Fig. 2: A modified Text-CNN which uses clinical notes as inputs. The model that utilizes TF-IDF features are denoted as Text-TF-IDF-CNN.

Training samples for 32 ICD code classes are not evenly distributed —– some classes have many samples while others have few. Such data imbalance issue often significantly compromises the performance of machine learning models. To address it, we applied Label Smoothing Regularization (LSR) [28]. LSR prevents our classifier from being too certain about labels during training and thus avoid overfitting. The LSR approach is able to improve the generalization of our model. More specifically, as defined in (1), and the average cross-entropy loss is minimized when . LSR was implemented by replacing the truth label with the linear combination between the ground truth label and a sample drawn from a prior distribution , with weights and . Here prior distribution is defined as . We sampled from a beta distribution in each iteration during training, where and is the hyperparameter for Beta distribution. As a result, the smoothed label of class for sample is described as follows,


The average cross-entropy (CE) loss is adjusted in Equation (1) by replacing with the smoothed label , to train our Text-CNN or Text-TF-IDF-CNN model.

Ii-C Ranking Based on Semi-structured Text

Medical coders review the diagnosis descriptions taken by physicians in the form of textual phrases and sentences in the clinical notes or admissions table, then manually assign appropriate ICD codes by following the coding guidelines [29]. For example, the description “Acute on Chronic Kidney Failure” is a strong signal for the ICD code N179 “Acute Kidney Failure, unspecified” because they share close semantic similarity.

We formulated this task as Diagnosis-based Ranking (DR) problem where latent representations of all the code descriptions are computed as points in a low-dimensional dense vector space. During inference, diagnosis descriptions are mapped to the same vector space and ICD codes are ranked based on their distance from the diagnosis vector. A neural network model that considers both the character- and word-level features was trained to represent the diagnoses and ICD code descriptions in the same space. To train the model, a triplet loss function is minimized [30]. It takes a triplet as input: anchor example (), positive example and negative example , respectively. The triplet loss defines a relative similarity between instances, by minimizing the distance between the positive pairs and maximizing the distance between the negative ones. This loss function is defined as:


where is the Euclidean distance for and is the margin. Here we took margin as 1. The associated network structure is shown in Figure 3. Pre-trained word embedding and Character-level Convolution Neural Network (Char-CNN) [31] are used to represent each word token. The Char-CNN layer creates representations such that tokens with similar form like “injection” and “injections” are mapped closer to each other. Similarly, the word embeddings represent a semantic space, such that a token “Hypertension” and its abbreviation “HTN” map closer to each other. The word embedding matrix was pre-trained on PubMed333 [25], which contains abstracts of over 550,000 biomedical papers. Then, a bidirectional LSTM [32] layer encodes the sequential context information followed by a max-pooling layer to compute final representations.

To construct the triplets for training, we crawled ICD-10 codes and synonyms for their corresponding code description from online resources444 Then, we mapped all the synonyms for each ICD code as the positive and anchor examples respectively. To create the high-quality negative examples, instead of randomly sampling, we computed the edit distance to extract n-grams that are similar to the code description. During inference, min-max normalization transformed the distance value to a range [0, 1]. If one admission has multiple diagnoses, the mean of features was computed to combine those normalized results.

Fig. 3: Model architecture for diagnosis-based ranking.

Ii-D Classification based on Tabular Data

Tabular data mainly includes four tables, including LABEVENTS, PRESCRIPTIONS, MICROBIOLOGYEVENTS, and CHARTEVENTS.

  • LABEVENTS (LAB): It uses a binary value to indicate whether each laboratory value is abnormal. Furthermore, if a patient has multiple records of the same lab test, a majority vote strategy was adopted. Missing values were considered as normal. In our experiments, a total of 753 lab tests were used to construct features for classification.

  • CHARTEVENTS: Based on physicians’ suggestion, we mainly focused on three measurements: Body Mass Index (BMI), heart rate, and blood pressure. A binary vector was used to denote whether a specific measurement result is within the normal range. We combined features from this table with LABEVENTS.

  • PRESCRIPTIONS (MED): A binary vector was used to denote whether a medication is prescribed. Missing medications were considered as not being prescribed. Medications with a frequency less than 50 were removed to reduce noise. A total of 1135 medications were used to construct binary features.

  • MICROBIOLOGYEVENTS (BIO): A binary feature was used to represent a microbiology even — whether an organism test is positive or negative. A total of 363 organisms were used to form feature vectors.

Our solution applied a decision tree [21] as the classifier using binary features and leveraged a one-versus-all strategy [33] for multi-label classification. Higher weights were given to samples from minority classes to handle data imbalance.

Ii-E Model Ensemble

During testing, our solution takes an ensemble of the trained models for predicting ICD codes. Specifically, for each class , the final predicted probability is a weighted sum of probabilities predicted by individual models:


where is the probability predicted by model on class , and is the total number of models. The weight parameters were tuned by performing grid search on the validation set. However, not every single patient visit is guaranteed to have all the data resources available. Hence in order to ensure all weights sum up to one, i.e., , if the -th predictor is missing, its weight will be given to the Text-CNN or Text-TF-IDF-CNN model.

Ii-F Methods for Interpretation

Two methods were adopted to extract the important textual and tabular features that have the highest influence on the final ICD code predictions.

Ii-F1 Text Interpretability

After our model predicts a specific ICD code , it is desirable to identify key phrases that lead to such a prediction from the textual input. To capture the association between a word and an ICD code , we extracted all the paths connecting and from the trained neural network. For each path, we computed the influence score by multiplying the values of all hidden units and the weights associated with all edges along this path. The scores of all paths were added up to measure the association strength between and . Consecutive words with non-zero scores were combined into phrases and then ranked by the maximum score. Top ranked phrases were considered important signals for models to determine a specific ICD code.

Ii-F2 Table Interpretability

The nature of structured tabular features makes the method inherently different from those using text data. Therefore, the Local Interpretable Model-Agnostic Explanation (LIME) method [34] was adopted to tackle this problem. LIME computes a score for each feature of an instance that represents the importance of this feature in contributing to the model’s final prediction. Instead of going through the trained model, LIME learns the weights of features by approximating the decision boundary locally (i.e. instance-specific) based on a simple model.

Iii Experiments

In this section, we will introduce experiments for validating our approach and evaluation results for both classification and interpretability.

Iii-a Experimental Settings

The data was randomly split into training, validation, and test sets containing 31,155, 4,484, and 9,020 admissions respectively. Admissions of the same patient were categorized into the same set. This prevents the model from memorizing information of a patient from the training set and leveraging that information to inflate performance on the test set.

Iii-B Hyperparameters

Hyperparameters were tuned on the validation set, including: (1) For Text-CNN and Text-TF-IDF-CNN model, kernel sizes of 2, 3, 4 were adopted and each kernel had 128 feature maps. Word embedding dimension was 256. Dropout rate was 0.1. L2-regularization coefficient was . Adam optimizer [35] with learning rate was used. Batch size was 32; was 0.3 for beta distribution in label smoothing regularization. (2) For diagnosis ranking, character embedding size was 50. One layer of bi-directional LSTM with 100 hidden units was used for encoding.

Iii-C Evaluation Metrics

  • For Classification, we used F1 and the Area Under the ROC Curve (AUC) ([36, 37]) as evaluation metrics. F1 score is the harmonic mean of precision and recall, and AUC score summarizes performances under different thresholds. To better compute the average across different classes, we adopted micro-averages and macro-averages. Classes with more samples have larger weights for micro-averaged metrics but are treated equally for macro-averaged metrics.

  • For Interpretability, we used the Jaccard Similarity Coefficient (JSC) [38] to measure the overlap between two sets, which are our extracted evidence and physicians’ annotations. It is defined as,

Iii-D Evaluation of Code Classification

The overall performance for multi-label classification is shown in Table II. Word-TF-IDF and MetaMap-TF-IDF are baseline methods, where MetaMap extracts medical entities and links them to UMLS concepts [39]. Keyword-TF-IDF only adopts TF-IDF values of keywords from clinical guidelines.

Most models in Table II were developed based on the CNN architecture. Vanilla Text-CNN model and DenseNet [40] perform similarly with TF-IDF models on F1 but achieve better results on AUC.

Label Smoothing (LS) significantly improves the performance as it alleviates data imbalance issue, especially for F1 scores. TextCNN+DR also enhances Text-CNN performance on all four metrics.

For Tabular Data (TD) that comprises BIO, LAB and MED, we observe improvements on all four metrics after ensembling any type with Text-CNN+LS. Incorporating TD, Text-CNN+LS+TD gets the best macro- and micro-F1 (0.6137 and 0.6929 respectively) among all models with Text-CNN structure. By further ensembling DR, our solution gets a higher AUC score but slightly lower F1 scores. Text-CNN+LS+DR+TD improves 13% on macro-F1 and 5.4% on micro-F1 over Text-CNN, and 25.9% on macro-F1 and 19.5% on micro-F1 over Word-TF-IDF, which demonstrates the effectiveness of our proposed ensemble approach and the benefits of using multimodal data.

In addition, evaluation results show that the use of the most relevant subsets from clinical guidelines, denoted as Text-TF-IDF-CNN, can significantly increase the performance. The combined model of Text-TF-IDF-CNN, LS, DR and TD achieves 0.6867 Macro-F1, 0.7633 Micro-F1, 0.9337 Macro-AUC and 0.9541 Micro-AUC, the highest scores among all model tested. Detailed evaluation results for 32 ICD classes are listed in Figure 4.

Methods F1 AUC
Macro Micro Macro Micro
Word-TF-IDF .5277 .6648 .7249 .7834
Word-TF-IDF + MetaMap-TF-IDF
.5336 .6721 .7298 .7879
Keyword-TF-IDF .5678 .7225 .7375 .7867
DenseNet .5327 .6621 .8885 .9228
Text-CNN .5429 .6569 .8959 .9235
Text-CNN + LS .6054 .6908 .9010 .9293
Text-CNN + LS + DR .6097 .6914 .9029 .9307
Text-CNN + LS + Bio .6108 .6922 .9076 .9323
Text-CNN + LS + Lab .6107 .6925 .9136 .9366
Text-CNN + LS + Med .6122 .6928 .9124 .9379
Text-CNN + LS + TD
.6137 .6929 .9177 .9406
Text-CNN + LS + DR + TD .6133 .6921 .9188 .9416
Text-TF-IDF-CNN + LS .6813 .7632 .9157 .9420
Text-TF-IDF-CNN + LS + DR + TD
TABLE II: Evaluations of classification for our models and baselines. LS stands for label smoothing. DR is the diagnosis-based ranking model. LAB contains both LABEVENTS and CHARTEVENTS features. Tabular Data (TD) includes Bio, Med and Lab.
Fig. 4: Detailed AUC and F1 scores for 32 ICD codes for Text-TF-IDF-CNN + LS + TD + DR model. ICD codes are listed on the X axis by the training sample size, in descending order.

Iii-E Interpretability Evaluation

We collected a test set of 25 samples from five selected ICD-10 codes, including I10, I50.9, N17.9, E11.9, and D64.9. Annotations were obtained independently from three practicing physicians at Cleveland Clinic, who were asked to annotate key evidence from all data sources that determines the assignment of ICD-10 codes. The physicians have 15, 22 and 32 years of experience, respectively.

Iii-E1 Text-oriented evaluation

We compare top- phrases extracted by our model with physicians’ annotations. is the number of snippets physician annotated. If tokenized phrases and annotations coincide above a certain threshold, they are considered overlapped. This also applies when comparing overlap between physician’s annotations. We find different physicians have different annotation styles. For instance, physicians 2 and 3 tend to highlight keywords while physician 1 tends to highlight sentences, resulting in fewer tokens obtained from physicians 2 and 3. Therefore, average overlap score is as low as 0.2784 among highly-trained professionals, indicating that finding evidence from clinical notes to determine ICD codes is a non-trivial task.

Physician 1
- .3105 .1798 .1477
Physician 2
.3105 - .3450 .2137
Physician 3
.1798 .3450 - .1905
Physician 1
- .5455 .4225 .3867
Physician 2
.5455 - .5325 .2885
Physician 3
.4225 .5325 - .2564
TABLE III: Jaccard Similarity Coefficient (JSC) between physicians and our model

Table III shows the overlap score between physicians’ annotations and the outputs of our model. On average, our model obtains JSC of 0.1806. Our approach is designed to find n-grams carrying important signals. It can capture numerous phrases that are either directly related to a specific disease or that can provide insights for the final diagnosis prediction.

Iii-E2 Table-oriented evaluation

For each table, the most important features found by LIME were selected as evidence for each sample, where is the number of important features physician annotated. We removed duplicates and computed the JSC. All results are shown in Table III. We observe that the average agreement among physicians on tabular data is 0.5002, higher than that on text data. The average JSC between our model and physicians’ annotations is 0.3105. In general, our model captures more features than physicians. Some of those extra features are informative for diagnoses even though they are not annotated by physicians. This is similar to the finding from text interpretability.

Interpretability Demonstration
D64.9 Anemia, Unspecified
Unstructured Text
‘Right Ventricular’ (1)
‘Block, Postoperative MRSA Pneumonia, Postoperative Pleural Effusions, Anemia, Acute Renal
Insufficiency, Hypertension, Hyperlipidemia, Rectal Abscess’ (2)
’nosocomial pneumonia. Cultures eventually grew out mrsa’ (3)
‘Possible CAD on his mother’s’ (4)
‘Bentall Procedure utilizing a 23mm Homograft with Repair of’ (5)
‘performed a Bentall procedure’ (6)
‘anemic’ (7)
‘postoperative day 8’ (8)
‘reason: check’ (9)
‘AV endocarditis (Strep viridans)’ (10)
LABEVENTS ‘Hematocrit’(1), ‘Hyaline Casts’(2), ‘pO2’(3), ‘Hemoglobin’(10)
PRESCRIPTION ‘Vancomycin’(1), ‘Heparin’(2), ‘Sodium Chloride 0.9% Flush’(3)
E11.9 Type 2 diabetes mellitus without complications
Unstructured Text
‘Medical History: Copd Obesity DM II’ (1)
‘nebs changed to inhalers. Pts DM’ (2)
‘COPD, DM, Tobacco Abuse, Obesity, Iron Deficiency Anemia’ (3)
‘B12 Deficiency’ (4)
‘Metformin 850 mg po’ (5)
‘h/o CHF, DM II’ (6)
‘4) FEN - Diabetic diet’ (7)
‘Metformin HCL 850 mg’ (8)
‘CHF, COPD’ (9)
‘circumstances should he smoke while on oxygen as this can result in fire’ (10)
LABEVENTS ‘Glucose’(1), ‘Basophils’(2), ‘Lactate Dehydrogenase (LD)’(3)
PRESCRIPTION ‘Insulin’(1), ‘Metformin’(2), ‘Glyburide’(3),’Humulin-R Insulin’(6)
I10 Essential (primary) hypertension
Unstructured Text
‘coumadin, HTN, COPD, Hepatocellular carcinoma’ (1)
‘diagnostic thoracentesis, urinalysis negative, urine cx’ (2)
‘subsequent enucleation, stent in pancreas’ (3)
‘abdomen, HTN: Pt’s blood’ (4)
‘EF 55%, severe pulm a HTN’ (5)
‘qday, COPD: started prednisone 60mg’ (6)
‘Atrial Fibrillation, Hypertension, Chronic Obstructive Pulmonary’ (7)
‘COPD, Afib on coumadin was in his USOH until’ (8)
‘Septic-picture’ (9)
‘Abdominal ultrasound impression:’ (10)
LABEVENTS ‘Creatinine’(1), ‘Monocytes’(2), ‘Myelocytes’(3)
PRESCRIPTION ‘Atenolol’(1), ‘Sodium CHloride 0.9% Flush’(2), ‘Albuterol-Ipratropium’(3), ‘Losartan Potassium’(8), ‘Furosemide’(10)
I50.9 Heart failure, unspecified
Unstructured Text
‘Lisinopril’ (1)
‘Anxiety/Depression, CKD, HLD, Obesity, HTN’ (2)
‘Chronic Systolic Heart Failure’ (3)
‘Syndrome which is’ (4)
‘a witnessed grand mal seizure’ (5)
‘Carvedilol’ (6)
‘LAD, RBBB’ (7)
‘Fluticasone-salmeterol’ (8)
‘Medications for this’ (9)
‘Cardiomyopathy’ (10)
LABEVENTS ‘Troponin T’(1), ‘Urea Nitrogen’(2), ‘Heart_rate’(3)
PRESCRIPTION ‘Furosemide’(1), ‘Carvedilol’(2), ‘Topiramate (Topamax)’(3), ‘Lisinopril’(4)
N17.9 Acute kidney failure, unspecified
Unstructured Text
‘Acute Kidney Injury: Multifactorial etiology’ (1)
‘Renal failure’ (2)
‘Hands, wrists, elbows, shoulders, forearm developed a’ (3)
‘Creatinine’ (4)
‘IVDU, states he used heroin, barbiturates, cocaine’ (5)
‘concomitant bacterial infection, sepsis, a drug effect or billiary obstruction remain’ (6)
‘casts. Tacrolimus nephrotoxicity was considered’ (7)
‘Kidney injury, normocytic anemia, hypertension’ (8)
‘Hx polysubstance abuse, alcohol use’ (9)
‘Hypertension, GERD’ (10)
LABEVENTS ‘Creatinine’(1), ‘Urea Nitrogen’(2), ‘Monocytes’(3)
PRESCRIPTION ‘Sodium Bicarbonate’(1), ‘Iso-Osmotic Dextrose’(2), ‘Phytonadione’(3)
TABLE IV: Samples of Prediction Interpretability from different modalities. Samples are displayed as in descending order and their rankings are shown in brackets.

Iv Discussion

Iv-a Code Classification

The use of TF-IDF features makes vector space model perform well on F1 score, indicating the importance of retrieving disease-related keywords in these classification tasks. Inspired by Keyword-TF-IDF that filters out noise and irrelevant features, we proposed Text-TF-IDF-CNN that incorporates cognitive TF-IDF features extracted from clinical guidelines to the fully-connected layer in our deep learning model. It shows the possibility of leveraging human domain knowledge to improve ML model performance in clinical settings.

Performance improvements on macro-averaged metrics are greater than micro-averaged metrics. One reason is that multimodal data mitigates the issue of limited inference from text data caused by small training sample size. Additionally, given physical examination and lab tests are mandatory for most inpatients, tabular data considerably benefits classes with a small sample size. As we expand the prediction scope to cover the entire code list in the future, resolving data imbalance will play a bigger role in improving the predictive accuracy.

Iv-B Interpretability

Table IV shows some examples of evidence extracted from different modalities. For text data, key evidence was extracted based on the Text-TF-IDF-CNN model. For tabular data, key evidence was extracted based on LIME from LABEVENTS and PRESCRIPTIONS tables.

For text data, among all ranked evidence extracted, our model is able to identify keywords highly relevant to a specific code. For example, “DM II” and “Metformin” were extracted for E11.9. Results reveal that our model is able to identify text snippets that are related to a specific disease, even though the relationship can be indirect. Take N17.9 for example, the snippet “IVDU, states he used heroin, barbiturates, cocaine” indicates drug abuse could be the cause of acute kidney failure.

For tabular data, valuable items can be extracted as well, especially where text evidence is not sufficient. For example, “Glucose” from LABEVENTS and “Insulin” from PRESCRIPTIONS were extracted as top evidence for E11.9, which were not found in top ranked keywords from unstructured text.

Interpretation results are affected by the performance of classification. For example, items extracted in classes with an AUC score above 0.9 are more relevant than those in class D64.9 with an AUC score around 0.7 (see Figure 4).

Iv-C Future Work

We plan to improve our models in following areas:

  • Enlarge the code list. Currently we focus on 32 ICD-10 codes that are frequent in hospitals and we will cover more minority classes.

  • Reduce feature dimensions. Currently feature dimensions for tabular data are very large and some features are duplicate. Feature dimension reduction and duplicate removal may improve performance for both classification and interpretability.

  • Add human knowledge. Incorporating domain expertise proves helpful in improving the prediction. Exploration on new methods to better utilize existing knowledge besides using keyword matching can be another interesting research direction.

V Conclusion

We proposed a novel multimodal machine learning approach to predict ICD-10 diagnostic codes using EMRs. We developed separate models that handle data from different modalities, including unstructured text, semi-structured text, and structured tabular data. Experiments show that our ensembled model outperforms all the baseline methods in classification. Incorporating human knowledge into machine learning models to further enhance the performance is what we would like to explore in our future work. In addition, our interpretability method makes our prediction more explainable, transparent and trustworthy. The capability to extract valuable information that includes but not limited to code-specific text phrases, important lab test record and prescription, demonstrates great potential for use in real clinical practice.


The contributions from manifold individuals made this research project possible: Keyang Xu, Mike Lam, and Jingzhi Pang implemented the methods; designed, tested, and validated the machine learning models; conducted the experiments; and compiled the results. All contributors collaborated to select the 32 ICD codes used. Dr. Piyush Mathur coordinated the advising medical team and, together with Dr. Ashish K. Khanna, Dr. Jacek B. Cywinski and Dr. Frank Papay, completed the medical annotations. Xin Gao and Charlotte Band coordinated the team, and edited and revised the report. Pengtao Xie and Eric Xing supervised the research and edited the report.


The authors thank Devendra Singh, Yaodong Yu and Yuan Yang for their valuable help.


  • [1] S. G. Nadathur, “Maximising the value of hospital administrative datasets,” Australian Health Review, vol. 34, no. 2, pp. 216–223, 2010.
  • [2] A. Bottle and P. Aylin, “Intelligent information: a national system for monitoring clinical performance,” Health services research, vol. 43, no. 1p1, pp. 10–31, 2008.
  • [3] H. Quan, V. Sundararajan, P. Halfon, A. Fong, B. Burnand, J.-C. Luthi, L. D. Saunders, C. A. Beck, T. E. Feasby, and W. A. Ghali, “Coding algorithms for defining comorbidities in icd-9-cm and icd-10 administrative data,” Medical care, pp. 1130–1139, 2005.
  • [4] D. L. Adams, H. Norman, and V. J. Burroughs, “Addressing medical coding and billing part ii: a strategy for achieving compliance. a risk management approach for reducing coding and billing errors.” Journal of the National Medical Association, vol. 94, no. 6, p. 430, 2002.
  • [5] D. W. Bates, M. Ebell, E. Gotlieb, J. Zapp, and H. Mullins, “A proposal for electronic medical records in us primary care,” Journal of the American Medical Informatics Association, vol. 10, no. 1, pp. 1–10, 2003.
  • [6] J. M. Lee and A. O. Muis, “Diagnosis code prediction from electronic health records as multilabel text classification: A survey.”
  • [7] Z. C. Lipton, D. C. Kale, C. Elkan, and R. Wetzel, “Learning to diagnose with lstm recurrent neural networks,” arXiv preprint arXiv:1511.03677, 2015.
  • [8] G. Parthiban and S. Srivatsa, “Applying machine learning methods in diagnosing heart disease for diabetic patients,” International Journal of Applied Information Systems (IJAIS), vol. 3, pp. 2249–0868, 2012.
  • [9] M. H. Stanfill, M. Williams, S. H. Fenton, R. A. Jenders, and W. R. Hersh, “A systematic literature review of automated clinical coding and classification systems,” Journal of the American Medical Informatics Association, vol. 17, no. 6, pp. 646–651, 2010.
  • [10] “Combining classifiers in text categorization,” in Proceedings of the 19th annual international ACM SIGIR conference on Research and development in information retrieval.   ACM, 1996, pp. 289–297.
  • [11] R. Kavuluru, A. Rios, and Y. Lu, “An empirical evaluation of supervised learning approaches in assigning diagnosis codes to electronic medical records,” Artificial intelligence in medicine, vol. 65, no. 2, pp. 155–166, 2015.
  • [12] G. Tsoumakas and I. Katakis, “Multi-label classification: An overview,” International Journal of Data Warehousing and Mining (IJDWM), vol. 3, no. 3, pp. 1–13, 2007.
  • [13] B. Koopman, G. Zuccon, A. Nguyen, A. Bergheim, and N. Grayson, “Automatic icd-10 classification of cancers from free-text death certificates,” International journal of medical informatics, vol. 84, no. 11, pp. 956–965, 2015.
  • [14] E. Scheurwegs, K. Luyckx, L. Luyten, B. Goethals, and W. Daelemans, “Assigning clinical codes with data-driven concept representation on dutch clinical free text,” Journal of biomedical informatics, vol. 69, pp. 118–127, 2017.
  • [15] B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi, “Deep ehr: A survey of recent advances in deep learning techniques for electronic health record (ehr) analysis,” IEEE journal of biomedical and health informatics, vol. 22, no. 5, pp. 1589–1604, 2018.
  • [16] F. Duarte, B. Martins, C. S. Pinto, and M. J. Silva, “Deep neural models for icd-10 coding of death certificates and autopsy reports in free-text,” Journal of biomedical informatics, vol. 80, pp. 64–77, 2018.
  • [17] J. Mullenbach, S. Wiegreffe, J. Duke, J. Sun, and J. Eisenstein, “Explainable prediction of medical codes from clinical text,” arXiv preprint arXiv:1802.05695, 2018.
  • [18] H. Shi, P. Xie, Z. Hu, M. Zhang, and E. P. Xing, “Towards automated icd coding using deep learning,” arXiv preprint arXiv:1711.04075, 2017.
  • [19] P. Xie and E. Xing, “A neural architecture for automated icd coding,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), vol. 1, 2018, pp. 1066–1076.
  • [20] Y. Kim, “Convolutional neural networks for sentence classification,” CoRR, vol. abs/1408.5882, 2014.
  • [21] J. R. Quinlan, “Induction of decision trees,” Mach. Learn., vol. 1, no. 1, pp. 81–106, Mar. 1986.
  • [22] T. G. Dietterich, “Ensemble methods in machine learning,” in International workshop on multiple classifier systems.   Springer, 2000, pp. 1–15.
  • [23] A. E. Johnson, T. J. Pollard, L. Shen, H. L. Li-wei, M. Feng, M. Ghassemi, B. Moody, P. Szolovits, L. A. Celi, and R. G. Mark, “Mimic-iii, a freely accessible critical care database,” Scientific data, vol. 3, p. 160035, 2016.
  • [24] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013.
  • [25] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” in Advances in neural information processing systems, 2013, pp. 3111–3119.
  • [26] J. Shore and R. Johnson, “Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy,” IEEE Transactions on information theory, vol. 26, no. 1, pp. 26–37, 1980.
  • [27] J. Ramos et al., “Using tf-idf to determine word relevance in document queries,” in Proceedings of the first instructional conference on machine learning, vol. 242, 2003, pp. 133–142.
  • [28] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” CoRR, vol. abs/1512.00567, 2015.
  • [29] K. J. O’malley, K. F. Cook, M. D. Price, K. R. Wildes, J. F. Hurdle, and C. M. Ashton, “Measuring diagnoses: Icd code accuracy,” Health services research, vol. 40, no. 5p2, pp. 1620–1639, 2005.
  • [30] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identification by multi-channel parts-based cnn with improved triplet loss function,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1335–1344.
  • [31] X. Zhang, J. Zhao, and Y. LeCun, “Character-level convolutional networks for text classification,” in Advances in neural information processing systems, 2015, pp. 649–657.
  • [32] Z. Huang, W. Xu, and K. Yu, “Bidirectional lstm-crf models for sequence tagging,” arXiv preprint arXiv:1508.01991, 2015.
  • [33] R. Rifkin and A. Klautau, “In defense of one-vs-all classification,” J. Mach. Learn. Res., vol. 5, pp. 101–141, Dec. 2004.
  • [34] M. T. Ribeiro, S. Singh, and C. Guestrin, “”why should I trust you?”: Explaining the predictions of any classifier,” CoRR, vol. abs/1602.04938, 2016.
  • [35] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  • [36] C. Goutte and E. Gaussier, “A probabilistic interpretation of precision, recall and f-score, with implication for evaluation,” in European Conference on Information Retrieval.   Springer, 2005, pp. 345–359.
  • [37] J. Huang and C. X. Ling, “Using auc and accuracy in evaluating learning algorithms,” IEEE Transactions on knowledge and Data Engineering, vol. 17, no. 3, pp. 299–310, 2005.
  • [38] S. Niwattanakul, J. Singthongchai, E. Naenudorn, and S. Wanapu, “Using of jaccard coefficient for keywords similarity,” in Proceedings of the International MultiConference of Engineers and Computer Scientists, vol. 1, no. 6, 2013.
  • [39] A. R. Aronson, “Effective mapping of biomedical text to the umls metathesaurus: the metamap program.” in Proceedings of the AMIA Symposium.   American Medical Informatics Association, 2001, p. 17.
  • [40] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, 2017, pp. 2261–2269.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description