HYPE: A High Performing NLP System for Automatically Detecting Hypoglycemia Events from Electronic Health Record Notes

HYPE: A High Performing NLP System for Automatically Detecting Hypoglycemia Events from Electronic Health Record Notes

Yonghao Jin
Department of Computer Science
University of Massachusetts Lowell
Lowell, MA 01854
Fei Li
Department of Computer Science
University of Massachusetts Lowell
Lowell, MA 01854
\ANDHong Yu
Department of Computer Science
University of Massachusetts Lowell
Lowell, MA, 01854
The Bedford Veterans Affairs Medical Center, Bedford, MA

Hypoglycemia is common and potentially dangerous among those treated for diabetes. Electronic health records (EHRs) are important resources for hypoglycemia surveillance. In this study, we report the development and evaluation of deep learning-based natural language processing systems to automatically detect hypoglycemia events from the EHR narratives. Experts in Public Health annotated 500 EHR notes from patients with diabetes. We used this annotated dataset to train and evaluate HYPE, supervised NLP systems for hypoglycemia detection. In our experiment, the convolutional neural network model yielded promising performance ( ) in a 10-fold cross-validation setting. Despite the annotated data is highly imbalanced, our CNN-based HYPE system still achieved a high performance for hypoglycemia detection. HYPE could be used for EHR-based hypoglycemia surveillance and to facilitate clinicians for timely treatment of high-risk patients.


HYPE: A High Performing NLP System for Automatically Detecting Hypoglycemia Events from Electronic Health Record Notes

  Yonghao Jin Department of Computer Science University of Massachusetts Lowell Lowell, MA 01854 yonghao_jin@student.uml.edu Fei Li Department of Computer Science University of Massachusetts Lowell Lowell, MA 01854 foxlf823@gmail.com Hong Yu thanks: The Bedford Veterans Affairs Medical Center, Bedford, MA Department of Computer Science University of Massachusetts Lowell Lowell, MA, 01854 hong_yu@uml.edu


noticebox[b]Machine Learning for Health (ML4H) Workshop at NeurIPS 2018.\end@float

1 Introduction

In 2014, Centers for Disease Control and Prevention (CDC) estimated that 29.1 million Americans aged 20 or older have diabetes mellitus (DM).CDC2011National2014 () Treatment-associated hypoglycemia (low blood sugar) in patients with DM is the third most common adverse drug event (ADE), resulting in about 25,000 emergency department visits and 11,000 hospitalizations yearly among Medicare patients.lipskakjNAtionalTrendsUs2014 () Electronic health records (EHRs) are important resources for reporting hypoglycemia.lipskakjNAtionalTrendsUs2014 () However, studies have shown that many hypoglycemic events are not represented by ICD codes but are described in EHR narratives.DefiningReportingHypoglycemia2005a () Manual chart review is prohibitively expensive. Therefore, automatically extracting hypoglycemia-related information from EHR notes can be a crucial complement for extracting such information from structured EHR data.lipskakjNAtionalTrendsUs2014 ()

However, reliably detecting hypoglycemia events in EHR notes is very challenging for the following reasons. First, the descriptions of adverse events (in this case, hypoglycemia) can be very flexible in the clinical notes (e.g., “patient with hypoglycemia”, “she has low bs level”, “bs is in low 20”), so it is impossible to manually specify rules to recognize all the patterns. Second, hypoglycemia is a relatively rare adverse event, making it difficult to collect enough data to train a machine learning model.

In this paper, we report the development and evaluation of HYPE, natural language processing (NLP) systems to effectively detect hypoglycemia events in the EHR narratives. Our HYPE were built upon the advanced NN architectures, which have shown greatly outperformed traditional machine learning models in different NLP applications.rumengHybridNeuralNetwork2018 (); jagannathaStructuredPredictionModels2016 (); jagannathaBidirectionalRNNMedical2016 () We have shown that our deep-learning-based HYPE systems were high-performing and state-of-the-art, outperforming the traditional machine learning models in a wide margin in both recall and precision.

2 EHR Corpus and Annotation

With the approval from the Institutional Review Boards (IRB) at the University of Massachusetts Medical School, we annotated 500 English EHR notes from patients with diabetes who were treated at the UMass Memorial Center in 2015. Since hypoglycemia could be rare eventslipskakjNAtionalTrendsUs2014 (); bellFrequencySevereHypoglycemia1997 (), we therefore developed a strategy to increase the likelihood of mentioning in EHR notes. Specifically, we selected the first 500 notes by querying both hypoglycemia ICD-9 codes (251.*) and related diabetic medications (e.g., insulin and metformin). We split each note into sentences using the NLTK packagebirdNaturalLanguageProcessing2009 (), which we have adapted to the EHRs. We asked experienced experts to annotate each sentence as containing a hypoglycemic event (Positive) or not (Negative). A sentence was annotated as Positive if it describes any hypoglycemia-related diagnosis and symptoms (e.g. “patients have low blood sugar level”).

In all, the 500 EHR notes contains a total of 95,246 sentences (an average of sentences per note, minimum 6 and maximum 912) with 1,316 (3 %) annotated as Positive sentences. The average sentence length is (minimum 2 and maximum 318 ) words. We cropped sentences with more than 40 words for efficiency reason. All EHR notes were fully deidentified prior to annotation or being used to develop the NLP systems.

3 Method

We used a standard deep learning text classification modelkimConvolutionalNeuralNetworks2014 (). The architecture of the model is shown in Figure 1. The main architecture could be split into three parts: {enumerate*}

an input layer takes input sentence and constructs a matrix containing word embeddings for each word,

a sentence embedding layer computes a fixed dimensional vector from the variable dimension matrix from last layer and

the final output layer projects the sentence vector to probability scores of each class.

Figure 1: Model Architecture

We initialized the input layer with public available 100-dimension word vectorspyysaloDistributionalSemanticsResources2013 () trained on a combined text corpus from both public and medical domains using Word2Vec. For the sentence embedding layer, we experimented with recurrent and convolutional neural network layer, shown in Figure 2. In RNN setting, the output of the last step is chosen to be the sentence vector. In CNN, we applied max-pooling to each filter to build a fixed dimensional vector. The dimension of the sentence vector is fixed to be 300 in all experiments. In the output layer, the elements of the sentence vector is randomly zeroed out by a dropout layer with dropout rate 0.5 and finally, a softmax layer is used to build the probability model. During training, we used Adam algorithmkingmaAdamMethodStochastic2014 () with learning rate 0.00005 on a cross-entropy loss.

Figure 2: Sentence Embedding Layers. A. RNN Model B. CNN Model C. TCN Model

4 Experiments

4.1 Evaluation

Due to the sparsity of the dataset, we performed 10-fold cross-validation to robustly evaluate the performance of each model. The dataset was randomly split into 10 groups. Each time we took out one group as the testing set, and the rest as the training set. The development set was constructed by randomly separating 10% data from each training set.

4.2 Baseline Model

We applied SVM, a commonly used learning algorithm for classification problems, as our strong baseline model. Each sentence is vectorized by a long sparse vector with the dimension equal to the vocabulary size of the training corpus (after removing common stop words). We used the scikit-learn package to build the sentence vectors and train the SVM model with radial basis function kernel.pedregosaScikitlearnMachineLearning2011 ()

5 Results and Discussion

Table 1: Detailed performance of each experiment

5.1 Principal Results

Our results (Table 1) show that comparing with a strong-baseline SVM model, NN models all improved the performance (precision, recall, F1, PR-AUC, and ROC-AUC) by a large margin. The fundamental difference between a NN model and a SVM model is their representations of data. Our SVM models use bag-of-word and n-grams to represent the input sentences. In contrast, NN models are able to generate high-level sentence features, including both semantic and syntactic features. Our results also show that high-performance NN models can be trained using a relatively small set of annotated and imbalanced EHR data (a total of 41,034 sentences, of which 1,316 sentences are Positive instances). The implication is significant as the “knowledge-bottleneck” challenge has made it unrealistic to annotate a large amount of clinical data for supervised machine-learning applications.

5.2 Comparison Between Different Neural Network Models

Our results show that CNN performed the best for detecting sentence-level hypoglycemia, even though the data is imbalanced. One of the advantages of the recently proposed TCN model is its improved performance for longer-sentences. Therefore, it is not surprising that our results show that the TCN model yields the best recall , although its precision is lower than the CNN model. Our results show that the two RNN models (LSTM and Bi-LSTM), have similar performance and both under-performed the CNN models. This suggests that RNN-based models are less effective than the CNN-models in capturing important sentence patterns for this task, even in a bi-directional configuration. The performance might improve by adding an attention mechanism, but that will greatly increase the complexities of the models.

5.3 Error Analysis

We manually examined the error cases and identified two types of common errors. The models often failed on cases where hypoglycemia events are indicated in the numerical measurement of blood sugar level. While the models could easily identify sentences like “BS is low”, it almost always made mistakes when encountering “BS is 68” or “fsbs noted to be 9”. This type of sentences is difficult to be identified for many reasons. Frist, there is no good way to represent the relative size of a number in the embedding space, so it is impossible for the model to learn a “less than” operation to identify low blood sugar value. Second, the units of the numeric value are often absent, which must be inferred from the order of magnitude of the value. In the above examples, “68” should be “68 mg/dL” and “9” should be “9 nmol/L”. External human knowledge must be incorporated to correctly identify this kind of sentences. In the future, we may detect a unit expression and replace it with either “less than” or “more than.”

5.4 Limitations and Future Work

Our study has several limitations. Our annotated EHR data was small (a total of 500 EHR notes) and was selected using diabetes-related ICD codes, and therefore may not represent the natural distribution of the EHR data, in which the hypoglycemic events were sparse. To overcome this challenge, we focused on sentence-level classification, and therefore HYPE could be robust to the naturally distributed EHR data. On the other hand, because HYPE focused on sentence-level classification, it may miss hypoglycemic events that are expressed across multiple sentences. In future work, we may explore paragraph or document-level classification. Another limitation is that HYPE only detects the presence of a bleeding event. It does not identify assertion and severity. More annotated data is needed for refined classification and this will be our future work.

6 Conclusion

In this study, we addressed the question of how to automatically detect EHR note sentences containing hypoglycemia events. Our deep learning models also achieved both high precisions (up to about 96%) and recalls (up to about 89%). We found that among three deep learning models namely RNN, TCN, and CNN, CNN achieved the best performance. These encouraging results indicate that deep-learning-based approaches are promising to be applied to hypoglycemia event detection. Our work is an important step towards accurate surveillance of hypoglycemic events in EHRs and may help provide clinicians a valid tool to improve treatment of diabetes mellitus.


  • (1) “CDC - 2011 National Diabetes Fact Sheet - Publications - Diabetes DDT.” 00000.
  • (2) Lipska KJ, Ross JS, Wang Y, and et al, “National trends in us hospital admissions for hyperglycemia and hypoglycemia among medicare beneficiaries, 1999 to 2011,” vol. 174, no. 7, pp. 1116–1124.
  • (3) “Defining and Reporting Hypoglycemia in Diabetes A report from the American Diabetes Association Workgroup on Hypoglycemia,” vol. 28, no. 5, pp. 1245–1249.
  • (4) L. Rumeng, J. Abhyuday N, and Y. Hong, “A hybrid Neural Network Model for Joint Prediction of Presence and Period Assertions of Medical Events in Clinical Notes,” vol. 2017, pp. 1149–1158.
  • (5) A. N. Jagannatha and H. Yu, “Structured prediction models for RNN based sequence labeling in clinical text,” in Proceedings of the Conference on Empirical Methods in Natural Language Processing, vol. 2016, pp. 856–865.
  • (6) A. N. Jagannatha and H. Yu, “Bidirectional RNN for Medical Event Detection in Electronic Health Records,” vol. 2016, pp. 473–482.
  • (7) D. S. Bell and V. Yumuk, “Frequency of severe hypoglycemia in patients with non-insulin-dependent diabetes mellitus treated with sulfonylureas or insulin,” vol. 3, no. 5, pp. 281–283, 1997 Sep-Oct.
  • (8) S. Bird, E. Klein, and E. Loper, Natural Language Processing with Python: Analyzing Text with the Natural Language Toolkit. "O’Reilly Media, Inc.".
  • (9) Y. Kim, “Convolutional Neural Networks for Sentence Classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751, Association for Computational Linguistics.
  • (10) S. Pyysalo, F. Ginter, H. Moen, T. Salakoski, and S. Ananiadou, “Distributional semantics resources for biomedical text processing,” in The 5th International Symposium on Languages in Biology and Medicine, pp. 39–43.
  • (11) D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”
  • (12) F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, and others, “Scikit-learn: Machine learning in Python,” vol. 12, pp. 2825–2830.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description