Extracting Clinical Concepts from User Queries
Clinical concept extraction often begins with clinical Named Entity Recognition (NER). Although a number of annotated clinical notes are available, NER models trained exclusively on the sentences from the annotated clinical notes tend to struggle with tagging clinical entities in user queries because of the structural differences between clinical note sentences and user queries. In many cases, for example, user queries are compounded of multiple clinical entities, without comma or conjunction words separating them. By using as dataset a mixture of annotated clinical note sentences and synthesized user queries which require no manual annotation, we adapt a clinical NER model based on the BiLSTM-CRF architecture for tagging clinical entities in user queries. Our contribution are the following: 1) We found that when trained on a mixture of synthesized user queries and clinical note sentences, the NER model performs better on both types of input data. 2) We provide an end-to-end and easy-to-implement framework for clinical concept extraction from user queries.
Clinical Concept Extraction Named Entity Recognition Information Retrieval
Medical search engines are an essential component for many online medical applications, such as online diagnosis systems and medical document databases. A typical online diagnosis system, for instance, relies on a medical search engine. The search engine takes as input a user query that describes some symptoms and then outputs clinical concept entries that provide relevant information to assist in diagnosing the problem. One challenge medical search engines face is the segmentation of individual clinical entities. When a user query consists of multiple clinical entities, a search engine would often fail to recognize them as separate entities. For example, the user query “fever joint pain weight loss headache” contains four separate clinical entities: “fever”, “joint pain”, “weight loss”, and “headache”. But when the search engine does not recognize them as separate entities and proceeds to retrieve results for each word in the query, it may find “pain” in body locations other than “joint pain”, or it may miss “headache” altogether, for example. Some search engines allow the users to enter a single clinical concept by selecting from an auto-completion pick list. But this could also result in retrieving inaccurate or partial results and lead to poor user experience.
We want to improve the medical search engine so that it can accurately retrieve all the relevant clinical concepts mentioned in a user query, where relevant clinical concepts are defined with respect to the terminologies the search engine uses. The problem of extracting clinical concept mentions from a user query can be seen as a variant of the Concept Extraction (CE) task in the frequently-cited NLP challenges in healthcare, such as 2010 i2b2/VA  and 2013 ShARe/CLEF Task 1 . Both CE tasks in 2010 i2b2/VA and 2013 ShARe/CLEF Task 1 ask the participants to design an algorithm to tag a set of predefined entities of interest in clinical notes. These entity tagging tasks are also known as clinical Named Entity Recognition (NER). For example, the CE task in 2010 i2b2/VA defines three types of entities: “problem”, “treatment”, and “test”. The CE task in 2013 ShARe/CLEF defines various types of disorder such as “injury or poisoning”, “disease or syndrome”, etc. In addition to tagging, the CE task in 2013 ShARe/CLEF has an encoding component which requires selecting one and only one Concept Unique Identifier (CUI) from Systematized Nomenclature Of Medicine Clinical Terms (SNOMED-CT) for each disorder entity tagged. Our problem, similar to the CE task in 2013 ShARe/CLEF, also contains two sub-problems: tagging mentions of entities of interest (entity tagging), and selecting appropriate terms from a glossary to match the mentions (term matching). However, several major differences exist. First, compared to clinical notes, the user queries are much shorter, less technical, and often less coherent. Second, instead of encoding, we are dealing with term matching where we rank a few best terms that match an entity, instead of selecting only one. This is because the users who type the queries may not have a clear idea about what they are looking for, or could be laymen who know little terminology, it may be more helpful to provide a set of likely results and let the users choose. Third, the types of entities are different. Each medical search engine may have its own types of entities to tag. There is also one minor difference in the tagging scheme between our problem and the CE task in 2013 ShARe/CLEF - We limit our scope to dealing with entities of consecutive words and not disjoint entities
2 Related Work
An effective model that has been commonly used for NER problem is a Bi-directional LSTM with a Conditional Random Field (CRF) on the top layer (BiLSTM-CRF), which is described in the next section. Combining LSTM’s power of representing relations between words and CRF’s capability of accounting for tag sequence constraints, Huang et al.  proposed the BiLSTM-CRF model and used handcrafted word features as the input to the model. Lample et al.  used a combination of character-level and word-level word embeddings as the input to BiLSTM-CRF. Since then, similar models with variation in types of word embeddings have been used extensively for clinical CE tasks and produced state-of-the-art results [18, 5, 6, 3]. Word embeddings have become the cornerstone of the neural models in NLP since the famous Word2vec  model demonstrated its power in word analogy tasks. One well-known example is that after training Word2vec on a large amount of news data, we can get word relations such as . More sophisticated word embedding technique emerged since Word2vec. It has been shown empirically that better quality in word embeddings leads to better performance in many downstream NLP including entity tagging [14, 11]. Recently, contextualized word embeddings generated by deep learning models, such as ELMo , BERT , and Flair , have been shown to be more effective in various NLP tasks. In our project, we make use of a fine-tuned ELMo model
Tang et al.  provided straightforward algorithm for term matching. The algorithm starts with finding candidate terms that contain ALL the entity words, with term frequency - inverse document frequency (tf-idf) weighting. Then the candidates are ranked based on the pairwise cosine distance between the word embeddings of the candidates and the entity.
We adopt the tagging - encoding pipeline framework from the CE task in 2013 ShARe/CLEF. We first tag the clinical entities in the user query and then select relevant terms from a glossary in dermatology
3.1 Entity Tagging
We use the same BiLSTM-CRF model proposed by Huang et al. . An illustration of the architecture is shown in Figure 1 . Given a sequence (or sentence) of n tokens, , we use a fine-tuned ELMo model to generate contextual word embeddings for all the tokens in the sentence, where a token refers to a word or punctuation. We denote the ELMo embedding, , for a token in the sentence by . The notation and the procedure described here can be adopted for Flair embeddings or other embeddings. Now, given a sequence of tokens in ELMo embeddings, , the BiLSTM layer generates a matrix of scores, of size , where is the number of tag types, and is the parameters of the BiLSTM. To simplify notation, we will omit the and write . Then, denotes the score of the token, , being assigned to the th tag. Since certain constraints may exist in the transition between tags, an “O” tag should not be followed by an “I” tag, for example, a transition matrix, , of dimension , is initialized to model the constraints. The learnable parameters, , represent the probability of the th tag follows the th tag in a sequence. For example, if we index the tags by: 1:“B”, 2:“I”, and 3:“O”, then would be the probability that an “O” tag follows a “B” tag. A beginning transition and an end transition are inserted in and hence is of dimension .
Given a sequence of tags, , where each , , corresponds to an index of the tags, the score of the sequence is then given by
The probability of the sequence of tags is then calculated by a softmax,
where denotes the set of all possible tag sequences. During training, the objective function is to maximize by adjusting and .
3.2 Term Matching
The term matching algorithm of Tang et al.  is adopted with some major modifications. First, to identify candidate terms, we use a much looser string search algorithm where we stem the entity words with snowball stemmer
“severe burns on legs”,
and one relevant term is “leg burn”. After stemming, “burns” and “legs” in Ex.3.1 become “burn” and “leg”, respectively, allowing “leg burn” to be considered as a candidate. Although the word “severe” is not in the term “leg burn”, the term is still considered a candidate because we selected using ANY. The stopword “on” is ignored when finding candidate terms so that not every term that contains the word “on” is added to the candidate pool. When a candidate term, , is found in this manner for the tagged entity, , we calculate the semantic similarity score, , between and in two steps. In the first step, calculate the maximum similarity score for each word in as shown in Figure 2. Given a word in the candidate term, (, is the number of words in the candidate term) and a word in the tagged entity, .Their similarity score, (shown as the element in the boxed matrix in Figure 2), is given by
where and are the ELMo embeddings for the word and , respectively. The ELMo embeddings have the same dimension for all words when using the same fine-tuned ELMo model. Thus, we can use a distance function (e.g., the cosine distance), denoted in equation 3, to compute the semantic similarity between words. In step 2, we calculate the candidate-entity relevance score (similarity) using the formula
where is a score threshold, and is an indicator function that equals 1 if or equals 0 if not. In equation 4 we define a metric that measures “information coverage” of the candidate terms with respect to a tagged entity. If the constituent words of a candidate term are relevant to the constituent words in the tagged entity, then the candidate term offers more information coverage. Intuitively, the more relevant words present in the candidate term, the more relevant the candidate is to the tagged entity. The purpose of the cutoff, , is to screen the word pairs that are dissimilar, so that they do not contribute to information coverage. One can adjust the strictness of the entity - terminology matching by adjusting . The higher we set , the fewer candidate terms will be selected for a tagged entity. A normalization factor, , is added to give preference to more concise candidate terms given the same amount of information coverage.
We need to create an extra stopword list to include words such as “configuration” and “color”, and exclude these words from the word count for a candidate term. This is because the terms associated with the description of color or configuration usually have the word “color” or “configuration” in them. On the other hand, a user query normally does not contain such words. For example, a tagged entity in a user query could be “round yellow patches”, for which the relevant terminologies include “round configuration” and “yellow color”. Since we applied a normalization factor, , to the relevance score, the word “color” and “configuration” would lower the relevance score because they do not have a counterpart in the tagged entity. Therefore, we need to exclude them from word count. Once the process is complete, calculate for all candidate terms and then we can apply a threshold on all to ignore candidate terms with low information coverage. Finally, rank the terms by their and return the ranked list as the results.
Despite the greater similarity between our task and the 2013 ShARe/CLEF Task 1, we use the clinical notes from the CE task in 2010 i2b2/VA on account of 1) the data from 2010 i2b2/VA being easier to access and parse, 2) 2013 ShARe/CLEF containing disjoint entities and hence requiring more complicated tagging schemes. The synthesized user queries are generated using the aforementioned dermatology glossary. Tagged sentences are extracted from the clinical notes. Sentences with no clinical entity present are ignored. 22,489 tagged sentences are extracted from the clinical notes. We will refer to these tagged sentences interchangeably as the i2b2 data. The sentences are shuffled and split into train/dev/test set with a ratio of 7:2:1. The synthesized user queries are composed by randomly selecting several clinical terms from the dermatology glossary and then combining them in no particular order. When combining the clinical terms, we attach the BIO tags to their constituent words. The synthesized user queries (13,697 in total) are then split into train/dev/test set with the same ratio. Next, each set in the i2b2 data and the corresponding set in the synthesized query data are combined to form a hybrid train/dev/test set, respectively. This way we ensure that in each hybrid train/dev/test set, the ratio between the i2b2 data and the synthesized query data is the same.
The reason for combining the two data is their drastic structural difference (See figure 3 for an example). Previously, when trained on the i2b2 data only, the BiLSTM-CRF model was not able to segment clinical entities at the correct boundary. It would fail to recognize the user query in Figure 3(a) as four separate entities. On the other hand, if the model was trained solely on the synthesized user queries, we could imagine that it would fail miserably on any queries that resemble the sentence in Figure 3(b) because the model would have never seen an “O” tag in the training data. Therefore, it is necessary to use the hybrid training data containing both the i2b2 data and the synthesized user queries.
To make the hybrid training data, we need to unify the tags. Recall that in Section 1 we point out that the tags are different for the different tasks and datasets. Since we use custom tags for dermatology glossary in our problem, we would need to convert the tags used in 2010 i2b2/VA. But this would be an infeasible job as we need experts to manually do that. An alternative is to avoid distinguishing the tag types and label all tags under the generic BIO tags.
To show the effects of using the hybrid training data, we trained two models of the same architecture and hyperparameters. One model was trained on the hybrid data and will be referred to as hybrid NER model. The other model was trained on clinical notes only and will be referred to as i2b2 NER model. We evaluated the performance of the NER models by micro-F1 score on the test set of both the synthesized queries and the i2b2 data.
We used the BiLSTM-CRF implementation provided by the flair package . We set the hidden size value to be 256 in the LSTM structure and left everything else at default values for the SequenceTagger model on flair. For word embeddings, we used the ELMo embeddings fine-tuned on PubMed articles
4.3 Hyperparameter Tuning
We defined the following hyperparameter search space:
embeddings: [“ELMo on pubmed”, “stacked flair on pubmed”],
hidden_size: [128, 256],
learning_rate: [0.05, 0.1],
mini_batch_size: [32, 64, 128].
The hyperparameter optimization was performed using Hyperopt
From the hyperparameter tuning we found that the best combination was
embeddings: “ELMo on pubmed”,
|Model||Synthesized Queries||Clinical Notes|
With the above hyperparameter setting, the hybrid NER model achieved a F1 score of on synthesized queries and on clinical notes while the i2b2 NER model achieved a F1 score of on synthesized queries and on clinical notes (See Table 1).
Since there was no ground truth available for the retrieved terms, we randomly picked a few samples to assess its performance. Some example outputs of our complete framework on real user queries are shown in Figure 4. For example, from the figure we see that the query “child fever double vision dizzy” was correctly tagged with four entities: “child”, “fever”, “double vision”, and “dizzy”. A list of terms from our glossary was matched to each entity. In real world application, the lists of terms will be presented to the user as the retrieval results to their queries.
In most real user queries we sampled, the entities were tagged at the correct boundary and the tagging was complete (such as the ones shown in Figure 4). Only on a few user queries the tagging was controversial. For example, the query “Erythematous blanching round, oval patches on torso, extremities” was tagged as “Erythematous blanching” and “oval patches on torso”. The entity “extremities” was missing. The segmentation was also not correct. A more appropriate tagging would be “Erythematous blanching round, oval patches”, “torso”, and “extremities”. The tagging could be further improved by synthesizing more realistic user queries. Recall that the synthesized user queries were created by randomly combining terminologies from the dermatology glossary, which, while providing data that helped the model learn entity segmentation, did not reflect the co-occurrence information in real user queries. For example, there could be two clinical entities that often co-occur or never co-occur in a user query. But since the synthesized user queries we used combined terms randomly, the co-occurrence information was thus missing.
The final retrieval results of our framework were not evaluated quantitatively in terms of recall and precision, due the the lack of ground truth. When ground truth becomes available, we will be able to evaluate our framework more thoroughly. Recently, a fine-tuned BERT model in the medical domain called BioBERT  has attracted some attention in the medical NLP domain. We could experiment with BioBERT embeddings in the future. We could also include query expansion technique for term matching. When finding candidate terms for an entity, our first step was still based on string matching. Given that there might be multiple entities that could be matched to the same term, it could be hard to include all these entities in the glossary and hard to match terms to these entities.
In this project, we tackle the problem of extracting clinical concepts from user queries on medical search engines. By training a BiLSTM-CRF model on a hybrid data consisting of synthesized user queries and sentences from clinical note, we adopt a CE framework for clinical user queries with minimal effort spent on annotating user queries. We find that the hybrid data enables the NER model perform better on both tagging the user queries and the clinical note sentences. Furthermore, our framework is built on an easy-to-use deep learning NLP Python library, which lends it more prospective value to various online medical applications that employ medical search engines.
This paper results from a technical report of a project the authors have worked on with visualDx, a healthcare informatics company that provides web-based clinical decision support system. The authors would like to thank visualDx for providing them the opportunity to work on such an exciting project. In particular, the authors would like to thank Roy Robinson, the Vice President of Technology and Medical Informatics at visualDx, for providing the synthesized user queries, as well as preliminary feedback on the performance of our framework.
- See section 1.8 of annotation guidelines, https://drive.google.com/file/d/0B7oJZ-fwZvH5VmhyY3lHRFJhWkk/edit
- We worked with visualDx on this project and used their glossary. See Acknowledgment.
- (2019-06) FLAIR: an easy-to-use framework for state-of-the-art NLP. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Minneapolis, Minnesota, pp. 54–59. External Links: Cited by: §4.2.
- (2018-08) Contextual string embeddings for sequence labeling. In Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, pp. 1638–1649. External Links: Cited by: §2, §4.2.
- (2016) Bidirectional LSTM-CRF for clinical concept extraction. CoRR abs/1610.05858. External Links: Cited by: §2.
- (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. External Links: Cited by: §2.
- (2018-04 May) Named entity recognition using neural networks for clinical notes. In Proceedings of the 1st International Workshop on Medication and Adverse Drug Event Detection, F. Liu, A. Jagannatha and H. Yu (Eds.), Proceedings of Machine Learning Research, Vol. 90, , pp. 7–15. External Links: Cited by: §2.
- (2017-07) Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics (Oxford, England) 33, pp. i37–i48. External Links: Cited by: §2.
- (2015) Bidirectional LSTM-CRF models for sequence tagging. CoRR abs/1508.01991. External Links: Cited by: §2, §3.1.
- (2019-10) Speech and language processing (3rd edition draft). Note: Draft on webpage at \urlhttps://web.stanford.edu/ jurafsky/slp3/ Cited by: Figure 1.
- (2016) Neural architectures for named entity recognition. CoRR abs/1603.01360. External Links: Cited by: §2.
- (2019) BioBERT: a pre-trained biomedical language representation model for biomedical text mining. CoRR abs/1901.08746. External Links: Cited by: §5.
- (2017) Learned in translation: contextualized word vectors. CoRR abs/1708.00107. External Links: Cited by: §2.
- (2013) Efficient estimation of word representations in vector space. CoRR abs/1301.3781. Cited by: §2.
- (2018) Deep contextualized word representations. CoRR abs/1802.05365. External Links: Cited by: §2.
- (2019) Enhancing clinical concept extraction with contextual embedding. CoRR abs/1902.08691. External Links: Cited by: §2.
- (2013) Overview of the share/clef ehealth evaluation lab 2013. In Proceedings of the 4th International Conference on Information Access Evaluation. Multilinguality, Multimodality, and Visualization - Volume 8138, CLEF 2013, Berlin, Heidelberg, pp. 212–231. External Links: Cited by: §1.
- (2013-01) Recognizing and encoding disorder concepts in clinical text using machine learning and vector space model. CEUR Workshop Proceedings 1179, pp. . Cited by: §2, §3.2.
- (2011-06) 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Association : JAMIA 18, pp. 552–6. External Links: Cited by: §1.
- (2018) Clinical concept extraction with contextual word embedding. CoRR abs/1810.10566. External Links: Cited by: §2.