Neural Language Model for Automated Classification of Electronic Medical Records at the Emergency Room. The Significant Benefit of Unsupervised Generative Pre-training
In order to build a national injury surveillance system based on emergency room (ER) visits we are developing a coding system to classify their causes from clinical notes content. Supervised learning techniques have shown good results in this area but require to manually build a large learning annotated dataset. New levels of performance have been recently achieved in neural language models (NLM) with the use of models based on the Transformer architecture with an unsupervised generative pre-training step. Our hypothesis is that methods involving a generative self-supervised pre-training step significantly reduce the number of annotated samples required for supervised fine-tuning. In this case study, we assessed whether we could predict from free text clinical notes whether a visit was the consequence of a traumatic or a non-traumatic event. We compared two strategies: Strategy A consisted in training the GPT-2 NLM on the full samples dataset with all labels (trauma/non-trauma). In Strategy B, we split the training dataset in two parts, a large one of samples without any label for the self-supervised pre-training phase and a smaller one (up to 10 000 samples) for the supervised fine-tuning with labels. While strategy A needed to process samples to achieve good performance (), strategy B needed only samples, a gain of 80. Moreover, an AUC of was measured with only 30 labeled samples processed 3 times (3 epochs). To conclude, it is possible to adapt a multi-purpose NLM model such as the GPT-2 to create a powerful tool for classification of free-text notes with the need of a very small number of labeled samples. Only two modalities (trauma/non-trauma) were predicted for this case study but the same method can be applied for multimodal classification tasks such as diagnosis/disease terminologies.
Keywords Neural Language Model pre-training Transformer GPT-2
Over the past 10 years, neural language models (NLMs) have progressively taken the largest share in the field of natural language processing with techniques based on long short-term memory and gated recurrent networks  or convolutional networks . NLMs have then become indispensable in this field with applications for machine translation, document classification, text summary and speech recognition.
The benefit of unsupervised pre-training have been quickly identified  but in the domain of NLMs, new levels of performance have only been recently achieved with the use of models based on the Transformer architecture  with an unsupervised generative pre-training step . One of the latest examples is the GPT-2, published in February 2019 by OpenAI, a large transformer-based language model with 1.5 billion parameters, pre-trained on a dataset of 8 million web pages to predict the next word after a given prompt sentence . This work quickly attracted attention because the authors demonstrated the ability of the model to generate artificial texts that are difficult to distinguish from texts written by humans. Moreover, the meaning of these artificial sentences was surprisingly consistent with the original text, used as prompt. Although only a reduced version of the original model has been made public, its applications are already potentially numerous. Indeed, beyond the capability to produce coherent texts, the GPT-2, has the potential to perform tasks such as question answering and document classification. Following the same idea as the BERT model , transferring many self-attention blocks from a pre-trained model proved sufficient to transfer contextual representations in the dataset.
The training of the model is then performed in two distinct phases : a first generative pre-training unsupervised (or more accurately called self-supervised) phase, consists in the exploitation of a corpus of texts which leads to the automatic text production capability, the relevance of which indicating that the networks learned contextual representations. A second supervised fine-tuning phase consists in resuming learning from a corpus of annotated texts with the objective of creating a system that allows the realization of a specific task.
We intended to leverage the document classification potential of GPT-2 to classify free-text clinical notes in the context of the project TARPON. This French project proposes to build a national surveillance system based on the exhaustive collection of emergency room (ER) visits reports in France. Its main feature is to apply automatic language analysis to extract injury mechanisms and causes from the computerized medical record information produced by the medical staff as free text for each visit. The creation of this database and its matching with the French national health data system will be used to create a nation-wide comprehensive and automated trauma monitoring, research and alert system. More than 21 million un-labeled clinical notes from the emergency room are produced every year in France. The cause for the visit is not available as a standardized database although fully described in free text computerized clinical reports. The overall project objective is to develop a tool that would derive standardized data describing injury mechanisms and causes from these notes. To that purpose, substantial amounts of manually annotated data would be necessary to train a conventional model.
Our hypothesis is that methods involving a generative self-supervised pre-training step such as the GPT-2 significantly reduce the number of annotated samples required for the supervised fine-tuning phase. This is of paramount importance for all projects wishing to use NLMs models for free-text classification tasks because the manual annotation phase is by far the most expensive one. The objective of our study is therefore to measure the gain in terms of manual annotation work obtained by adopting this pre-training step.
2.1 Study overview
To test our hypothesis, we exploited the fact that we could derived the traumatic/non-traumatic nature of the cause of the ER visit from available diagnostic codes assigned by clinicians or technical staff at the time of the patient’s hospitalization. We then designed a case study to assess whether we can also predict the traumatic/non-traumatic nature of the cause of the ER visit from computerized free text clinical notes. The traumatic/non-traumatic cause of the visit derived from diagnostic codes are used here as the labels.
In order to measure the gain offered by a self-supervised training phase, we compared the performance of two strategies (Figures 2 and 1). Strategy A consisted in training the GPT-2 NLM on our full training dataset with all labels in a single fully-supervised phase. In Strategy B, we split the training dataset in two parts, a large one without any label for the self-supervised pre-training phase and a much smaller one for the supervised fine-tuning with labels. The main question was therefore to assess how many samples will be necessary in this fine-tuning part of Strategy B to achieve the same performance as in Strategy A. This should give us a measure of how much annotation work will be saved as a result of Strategy B.
We retrieved clinical notes and International Classification of Diseases diagnostic codes, version 10 (ICD-10) from the automatic computerized system of the adult ER of University Hospital of Bordeaux from 2011 to 2018. The ICD-10  is the most used standardized way of indicating diagnoses and medical procedures, and is the terminology mandatorily used in France for all stays in private or public hospitals.
The data set contains medical records of visits of which contain both diagnosis code and clinical note. The label (trauma / non-trauma event) was derived from the ICD-10 code: a total of visits with an ICD-10 codes beginning with letters S, T1 to T35 and V were coded as trauma and visits with an ICD-10 code beginning with letters A, C D, E, G, H, I, J, L, N were coded as non-trauma. A total of visits with codes beginning with other letters (F, M, O, P, Q, T36 to T98, X40 to X57, Y10 to Y98, U, Z) were excluded because they correspond to pathologies for which the traumatic nature is either uncertain or discussed from a semantic point of view. The total study sample size was therefore .
A random sample of clinical notes was selected for validation. The samples from a remaining dataset with notes were first used with labels in Strategy A in order to estimate the number of samples needed to achieve maximum prediction performance on the -sample validation set. For the second Strategy, we further split the notes into a sample of notes with no label for unsupervised pre-training with 3 epochs and a second dataset with a maximum number of samples with labels for the supervised fine-tuning step.
Like Neural Language models based on convolution and recurrent networks, the GPT-2 proposed by Radford and colleagues is a sequence to sequence  transduction model. The new feature is that it is built on a Transformer architecture . The main feature of the Transformer architecture is to use attention weight on text inputs. During the training process, the network learns a context vector which gives global level information on inputs telling where attention should be focused. The novel approach is to eliminate recurrence completely and replace it with attention to handle the dependencies between input and output.
The GPT-2 has been developed in order to allow its application to a wide range of undefined problems. This model is designed to predict the next token from the input of a text sequence. By looping this process, it then functions as an artificial text generator. This text can be generated de novo or from an arbitrary portion of text called "prompt". The model is trained on millions of webpages without any explicit supervision. There are 4 versions of GPT-2 with respectively 117, 345, 762 and 1542 million parameters. Only the two smallest ones are trainable on standard workstations. Their model files are respectively 0.49 and 1.42GB in size.
The models were trained on web text mostly written in English while our clinical notes are in French. Consequently, we did not in the present work use those pre-trained models and started training from a random set of weights.
2.4 Text representation and input format
The authors of the GPT-2 chose a modified version of the Byte Pair Encoding method  which is a middle ground between word level encoding for frequent symbol sequences and character level encoding for infrequent symbol sequences. The evaluation of GPT-2 on its ability to predict the final word of sentences (this ability requires modeling long-range dependencies in text) showed that the accuracy was significantly improved by adding a stop-word filter .
2.5 Operating principle
In Strategy B, the pre-training step is referred to as unsupervised learning because it is derived from simply reading the text database of clinical notes, without labels (Figure 2). It actually uses a sliding learning window on the text. The first part of this window corresponds to the input and the final token corresponds to the token to be predicted. Thus, the term unsupervised could be considered as abusive and self-supervised is more appropriate. In our case study, the result is a model that can generate texts that resemble clinical notes.
For the supervised learning phases (Strategy A and second learning process in Strategy B), we added a sequence at the end of the clinical note, consisting of an arbitrary textual identifier (e. g. TARPON) followed by an arbitrary code, say 1 for clinical notes corresponding to traumatic events and 0 for clinical notes corresponding to non-traumatic events. As described above, this code was derived from the diagnosis classification manually coded by clinicians.
For both Strategies (Figures 2 and 1), the validation phase consisted in building a prompt by adding at the end of the clinical note for which we are trying to predict the TRAUMA code the arbitrary textual identifier (TARPON) and ask the model to predict the next token (here ideally 0 or 1). On the first iterations, the prediction can be any tokens but, as expected, this quickly turned to be only 0s and 1s.
The prediction performance of the model was measured with F1 score and area under the ROC curve statistics (AUC) . An evaluation on the same -sample dataset was performed every 50 iterations for Strategy A and Strategy B.
The 117M model was trained on a PC with a single Nvidia GeForce GTX 1080 Ti GPU with 11GB of video RAM. The 345M model was trained on a PC with a single Nvidia TITAN RTX GPU with 24GB of video RAM. The training phase took about one week in each Strategy.
2.8 Ethics, Confidentiality of data
No nominative data were necessary for this work. Data were also not indirectly nominative as no admission date or time were used. The dataset was however not specifically de-identified. Data processing and computing were conducted within the facilities of the Emergency Department of the University Hospital of Bordeaux.
We compared in Figure 3 and Figure 4 Strategy A (fully supervised training without pre-training) and B (supervised training with pre-training) by plotting AUC and F1 by iterations with a batch size of 1 case read per iteration. In Strategy A, AUC and F1 score reach the values of and respectively after the processing of samples. The use of generative pre-training (Strategy B) achieved the same performance after iterations, a gain of . An AUC of and an F1-score of was observed after the processing of only 120 samples (Figure 5). The same performance was achieved with only 30 labeled samples processed 3 times (3 epochs of learning).
Comparing 117M and 345M GPT-2 models, it showed no significant improvement using a more complex model.
As suggested by Radford and colleagues , large gains could be realized by generative pre-training on a corpus of unlabeled text, saving a large amount of labeling work. In our example of clinical notes classification task, the order of magnitude is a factor of . In their work, Radford and colleagues reported an improvement of on commonsense reasoning (Stories Cloze Test), on question answering (RACE), and on textual entailment (MultiNLI) .
These results are in line with recent work that showed that self-supervised pre-training methods, such as ELMo  and BERT , have established a qualitatively new level of performance in most widely used Natural Language Understanding benchmarks. Howard and Ruder  in particular report very similar results in a comparable text classification task, with a model that matches with only 100 labeled examples the performance of training from scratch on 80x more data. While the extensive use of pre-trained word embeddings could be considered as of the same nature as generative pre-training, the gain provided by generative pre-training is a major step for those who seek to classify free text documents with minimal manual coding efforts.
We have benefited from the work of the researchers who published the GPT-2 model, which still seems be a merge of the most efficient today. Other models have been and will be proposed and strategies for classification will need to be updated. Recent and promising works include the work of Yang and colleagues and their XLNet model  which currently ranks first at the Standford Question Answering Dataset (SQuAD2.0, rajpurkar.gihub.io/SQuAD-explorer).
Probably because the GPT-2 model was only recently made public, very few applications have been published today. However, this type of tool will probably be extensively used in the near future for a wide range of tasks. In the area of document classification alone, they will likely provide faster and more relevant access to information. Certainly, these applications will go beyond simple classification tasks. The 345M GPT-2 model did not generate significantly better results than the 117M model. The use of larger models may bring further improvement, which could have been tested should we had access to the necessary computing power. Unfortunately, this was not the case and we will have to be satisfied with the results presented here. Of note, no variance was estimated in this work. This proved unnecessary given the large size of the validation sample ().
In this study, we used as a reference a label based on the ICD-10 codes. We attempted to increase the reliability of this gold-standard by selecting only a sub-set of diagnoses. This method has had the advantage of providing us with a large amount of labeled data but does not allow us to compare the model’s performance with human annotation.
Our work shows that it is possible to easily adapt a multi-purpose NLM model such as the GPT-2 to create a powerful classification tool of free-text notes. The self-supervised training phase appeared to be an efficient strategy to dramatically decrease the number of labeled samples needed for supervised learning. These results will be used in the coming months to implement the exhaustive coding of all events leading to trauma with ER visits, making it possible to build a national trauma observatory within the TARPON project. More generally, this also opens broad perspectives for all those interested in free-text automatic coding. In the field of health, this will be particularly useful for diagnosis coding, clinical report classification and patient reports analysis.
-  Jinmiao Huang, Cesar Osorio, and Luke Wicent Sy. An empirical evaluation of deep learning for icd-9 code assignment using mimic-iii clinical notes. Computer Methods and Programs in Biomedicine, 177:141 – 153, 2019.
-  M. Li, Z. Fei, M. Zeng, F. Wu, Y. Li, Y. Pan, and J. Wang. Automated icd-9 coding via a deep learning approach. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 16(4):1193–1202, July 2019.
-  Dumitru Erhan, Yoshua Bengio, Aaron Courville, Pierre-Antoine Manzagol, Pascal Vincent, and Samy Bengio. Why does unsupervised pre-training help deep learning? Journal of Machine Learning Research, 11(Feb):625–660, 2010.
-  Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 5998–6008. Curran Associates, Inc., 2017.
-  Sascha Rothe, Shashi Narayan, and Aliaksei Severyn. Leveraging pre-trained checkpoints for sequence generation tasks. CoRR, abs/1907.12461, 2019.
-  Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 2019.
-  Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805, 2018.
-  Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. Improving language understanding by generative pre-training. 2018.
-  World Health Organization. International statistical classification of diseases and related health problems : 10th revision (ICD-10), Fifth edition, 2016. World Health Organization, 2015.
-  Kyunghyun Cho, Bart van Merrienboer, Çaglar Gülçehre, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using RNN encoder-decoder for statistical machine translation. CoRR, abs/1406.1078, 2014.
-  Rico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909, 2015.
-  David Martin Powers. Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation. 2011.
-  Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark, Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations. CoRR, abs/1802.05365, 2018.
-  Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia, July 2018. Association for Computational Linguistics.
-  Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. Xlnet: Generalized autoregressive pretraining for language understanding. CoRR, abs/1906.08237, 2019.