DeepTag: inferring all-cause diagnoses from clinical notes in under-resourced medical domain

DeepTag: inferring all-cause diagnoses from clinical notes in under-resourced medical domain

Allen Nie Ashley Zehnder Rodney L. Page Department of Clinical Sciences, Colorado State University, Fort Collins, CO 80523, USA Arturo Lopez Pineda Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA Manuel A. Rivas Department of Biomedical Data Science, Stanford University, Stanford, CA 94305, USA Carlos D. Bustamante James Zou

In many under-resourced settings, clinicians lack time and expertise to annotate patients with standard medical diagnosis codes. Veterinary medicine is an example of this and clinical encounters are largely captured in free text notes which are not labeled with diagnosis code. The lack of such standard coding makes it challenging to apply data science to improve patient care. It is also a major impediment to translational research, where, for example, we would like to leverage veterinary data to inform drug development for humans. We develop a deep learning algorithm, DeepTag, to automatically infer diagnosis codes from veterinarian free text notes. DeepTag is trained on a newly curated dataset of 112,558 veterinary notes manually annotated by experts. DeepTag extends multi-task LSTM with an improved hierarchical objective that captures structures between diseases. To foster human-machine collaboration, DeepTag also learns to abstain in examples when it is uncertain and defer them to human experts, resulting in improved performance of the model. DeepTag accurately infers disease codes from free text even in challenging out-of-domain settings where the text comes from different clinics than the ones used for training. It enables automated disease annotation across a broad range of clinical diagnoses with minimal pre-processing. The technical framework in this work can be applied in other medical domains that currently lack medical coding infrastructure.



While a robust medical coding infrastructure exists in the US health care system for human medical records, this is not the case in veterinary medicine, which suffers from a lack of coding infrastructure and standardized nomenclatures across medical institutions. We refer to this as under-resourced. This hampers efforts at clinical research and public health monitoring. Due to the relative ease of obtaining large volumes of free-text veterinary clinical records for research (compared to similar volumes of human medical data), we use veterinary records in this work as a use case to investigate methods for building automatic tagging systems for free text clinical notes.

It is becoming increasingly accepted that spontaneous diseases in animals have important translational impact on the study of human disease for a variety of disciplines [kol2015companion]. Beyond the study of zoonotic diseases, which represent 60-70% of all emerging diseases aaa, non-infectious diseases, like cancer, have become increasingly studied in companion animals as a way to mitigate some of the problems with rodent models of disease [leblanc2015defining]. However, when it comes to identifying clinical cohorts of veterinary patients on a large scale for clinical research, there are several problems. One of the first is that veterinary clinical visits rarely have diagnoses applied to them, either by clinicians or medical coders. There is no substantial third party payer system and no HealthIT act that applies to veterinary medicine, so there are few incentives for clinicians or hospitals to annotate their records for diseases to be able to identify patients by diagnosis. Billing codes are largely institution-specific and rarely applicable across institutions, unless hospitals are under the same management structure and records system. Some large corporate practice groups have their own internal clinical coding structures, but that data is rarely made available for outside researchers. A small number () academic veterinary centers (of a total of 30 veterinary schools in the US)bbb employ dedicated medical coding staff that apply disease codes to clinical records so these records can be identified for clinical faculty for research purposes. How best to utilize this rare, well-annotated, veterinary clinical data for the development of tools that can help organize the remaining seqments of the veterinary medical domain is an area of active research.

Related work

The field of natural language processing has largely focused on learning language patterns without human specification. The techniques have improved, from discrete pattern generation such as n-grams [jurafsky2014speech], to continuous learning algorithms like Long-short-term Memory Networks (LSTMs) [hochreiter1997long]. The goal is to learn to identify meaningful patterns of language implicitly with a model without the need for domain expertise. This strategy has proven to be extremely successful when a sizable amount of data can be acquired. Combined with advances in optimization and classification algorithms, the field has developed algorithms that match or exceed human performance in traditionally difficult tasks such as machine translation [vaswani2017attention], reading comprehension [yu2018qanet], and text-to-speech generation [van2016wavenet].

In addition, various disease tagging systems have been built on the MIMIC dataset [saeed2011multiparameter], which is an ICU dataset that contains a human patient’s discharge summary text and its associated ICD-9 codes, including some models that incorporate hierarchical structure from the ICD coding system [perotte2013diagnosis]. In countries like the US, the deployment of such system is not as crucial due to a large medical coding infrastructure, thus the dataset has only served as a testing bed for various algorithmic improvements. For under-resourced domains like veterinary records, only traditional methods, such as building categorization dictionaries with human experts, have been previously explored [anholt2014mining]. However, the amount of effort needed to generate such a dictionary is substantial and generally limits the scope of the work to a specific disease or disease subsets.

Other work in this area of broad, high-level, multi-category medical record classification is somewhat rare in that most prior published work focuses on a subset of specific diseases, even if that subset covers a range of diseases[halpern2016electronic, gehrmann2017comparing]. The methods utilized in prior works ranges from noisy labeling techniques built from domain-expert curated phenotype libraries [halpern2016electronic] to recurrent neural networks used to predict a well-curated subset of diseases of interest[gehrmann2017comparing]. Recent reviews highlight these methods and demonstrate that the vast majority of publications that attempt to identify patient cohorts utilize NLP or rule-based methodologies [shivade2013review]. Other reviews discuss a variety of deep learning methods on EHR data, but highlight the difficulty of assigning meaning to free text documents[shickel2017deep]. A recent paper utilized deep learning models to perform several key prediction tasks using clinical data from two hospitals, one of which utilized the unstructured free text clinical notes, and was able to predict all of a patients ICD codes at the time of discharge [rajkomar2018scalable]. However, these predictions relied heavily on the structured data used to create patient representations and also utilized hundreds of thousands of CPU hours, resources which may not be assumed to be generally available for these kinds of tasks, especially in under-resourced settings.

Our contribution focuses on building a modern NLP-based tagging system for veterinarian medical documents. We investigate the out-of-domain generalization problem for text-based machine learning algorithms in this poorly resourced domain. By using human expert knowledge to define latent structure between labels, we are able to observe a significant increase in performance for out-of-domain data adaptation. We have also investigated possible ways to foster human-machine interaction by allowing the tagger to abstain when there is a lack of confidence in assigning correct categorization labels. We show that in a multi-label classification setting, a learned abstention model outperforms the baseline.


CSU clinical note example

Jem is a 10 year old male castrated hound mix that was presented for continuation of chemotherapy for previously diagnosed B-cell multicentric lymphoma. Jem was started on CHOP chemotherapy last week and has been doing very well since receiving doxorubicin. The owners have noted his lymph nodes have gotten much smaller. He has some loose stool, yet improved with metronidazole. Current medications include prednisolone. Assessment: Jem is in a strong partial remission based on today’s physical exam. He is also doing very well since starting chemotherapy. A CBC today was unremarkable and adequate for chemotherapy. She was dispensed oral cyclophosphamide and furosemide that the owners were instructed to give at home.

Expert annotated diseases: Disorder of hematopoietic cell proliferation, Neoplasm and/or hamartoma

PP clinical note example

Likely ear infection shaking head now swollen drooping ear otherwise doing well amublating well- had RF carpal arthrodesis at UCD. wt: 95.3 lbs. Ears/Eyes/Nose/Throat: Clear OU/ brown yeasty debris AU errythema AU no fb/tm;s intact AU mod aural hematoma AD Cardiovascular: No murmur/arrhythmia. Femoral pulses strong and synchronous. HR:84 Respiratory: Lungs sound clear bilaterally no crackles or wheezes. Eupneic. RR:20 Lymph nodes: No palpable peripheral lymphadenopathy Oral Exam: mm pink and moist. CRT 2 sec Musculoskeletal: No lameness noted. Arthrodesis carpal joint RF thickened stifle LH ambulating well Nervous System: Appropriate mentation. No overt neurologic deficits Integument: Full haircoat. Adequate skin turgor. otitis externa AU aural hematoma AD. Ear cytology - ++cocci yeast AU. Cleaned with epiotic. Rx Tresaderm BID x 14 days. applied first dose Rx temarilP taper Skin prep medial AD over hematoma. then 19g butterfly needle attached to syringe drained 10ml bloody fluid held pressure with guaze no more bleeding CE: Recheck in 14d to ensure infection cleared discussed aural hematomas options for tx sx. may recur.

Expert annotated diseases: Infectious disease, Disorder of the integument, Disorder of the auditory system

Figure 1: System workflow and clinical note examples. Figure (a) shows the workflow of DeepTag with abstention. Then we show two example hierarchies among the subsets of the 42 SNOMED-CT codes. Figure (b) shows two example notes from the CSU and PP datasets. The highlighted text shows the supporting evidence human curators use to assign disease tags to these documents.


DeepTag takes clinician’s notes as input and predicts a set of SNOMED-CT disease codes. SNOMED-CT is a comprehensive and precise clinical health terminology managed by the International Health Terminology Standards Development Organization. DeepTag is a bi-directional long short-term memory (LSTM) neural network with a hierarchical learning objective. We leverage the fact that the SNOMED-CT codes are structured and developed a novel hierarchical objective function which improved the generalization performance of DeepTag.

DeepTag is trained on 112,558 annotated veterinary notes from the Colorodo State University of Veterinary Medicine and Biomedical Sciences (CSU) for research purposes. Each of these notes is a free text description of a patient visit, and is manually labeled with at least one, but on average eight, SNOMED-CT codes by experts. When the coder-applied disease codes that are mapped up to the children of parent note Disease (disorder) (ConceptID: 64572001), there are 41 SNOMED-CT top-level disease codes present in the CSU dataset. In addition, we map every non-disease related codes to an extra spurious code. In total, DeepTag learns to tag a clinical note with a subset of 42 codes.

We evaluate DeepTag on two different datasets. One consists of 5,628 randomly sampled non-overlapping documents from the same CSU dataset that the system is trained on. The other dataset contains 586 documents and are collected from a private practice (PP) located in northern California. Each of the these document is also manually annotated with the appropriate SNOMED-CT codes by human experts. We refer to this dataset as the PP dataset. We regard the PP dataset as a “out-of-domain” dataset due to its substantial difference with regard to writing style and institution type compared to the CSU dataset.

Tagging performance

  • CSU PP (Out-of-domain)
    Disease Prec Rec Accu N Sub Prec Rec Accu N Sub
    Autoimmune disease 94.0 72.3 81.4 99.6 1280 11 0.0 0.0 0.0 99.8 1 1(1)
    Congenital disease 72.9 35.9 47.3 97.8 3345 224 46.7 3.5 6.4 97.0 17 8(6)
    Propensity to adverse reactions 89.1 70.2 78.1 98.2 5105 8 67.2 12.6 19.5 92.9 43 7(2)
    Metabolic disease 68.9 55.4 61.0 96.9 5265 82 56.6 48.5 51.1 95.9 26 12(9)
    Disorder of auditory system 81.0 66.2 72.8 97.7 5393 67 78.8 70.3 73.8 94.6 64 12(6)
    Hypersensitivity condition 85.7 74.6 79.5 97.7 6871 31 67.7 22.4 31.6 92.3 50 11(4)
    Disorder of endocrine system 79.2 66.7 72.2 96.9 7009 84 44.4 21.7 28.7 91.7 46 8(8)
    Disorder of hematopoietic cell proliferation 95.1 87.4 91.0 98.9 7294 22 62.7 25.0 34.5 97.5 16 6(1)
    Disorder of nervous system 76.1 63.8 69.2 96.4 7488 243 40.4 26.7 30.8 94.7 27 19(14)
    Disorder of cardiovascular system 79.3 62.5 69.7 95.7 8733 351 44.1 52.1 46.4 89.4 53 30(24)
    Disorder of the genitourinary system 77.7 62.6 69.3 95.7 8892 317 47.8 39.1 42.2 92.2 44 19(12)
    Traumatic AND/OR non-traumatic injury 72.8 57.2 63.5 94.8 9027 536 50.5 15.8 23.1 96.6 19 13(8)
    Visual system disorder 84.3 81.1 82.6 96.9 10139 413 65.0 62.6 63.2 92.4 62 39(34)
    Infectious disease 71.2 53.7 60.8 92.9 11304 260 63.8 23.0 32.3 86.1 88 20(10)
    Disorder of respiratory system 79.5 65.5 71.8 95.2 11322 274 38.3 42.2 38.2 93.6 27 16(14)
    Disorder of connective tissue 75.4 67.0 70.7 91.3 17477 567 30.4 24.2 26.3 94.4 24 15(11)
    Disorder of musculoskeletal system 77.0 73.4 74.8 91.1 20060 670 54.0 41.4 46.1 90.9 56 31(19)
    Disorder of integument 84.2 71.6 77.3 92.3 21052 360 65.7 60.1 62.6 80.8 156 58(32)
    Disorder of digestive system 76.8 67.1 71.5 89.7 22589 694 58.0 47.9 51.3 70.8 195 47(36)
    Neoplasm and/or hamartoma 92.2 88.9 90.5 93.9 36108 749 26.1 72.5 37.8 74.7 59 18(7)

    This table reports the DeepTag’s performance (precision, recall, and accuracy) for the 21 most frequent disease categories (from a total of 42 categories). indicates the total number of examples in the dataset. Sub indicates the number of specific disease codes that are present in the dataset that are binned into one of the disease level codes. For the PP dataset, the Sub number in parentheses indicate the number of subtypes that are also present in CSU dataset.

Table 1: Report of DeepTag performance on CSU test data and PP data

We present DeepTag’s performance on the CSU and PP test data in Table 1. To save space, we display the 21 most frequent disease codes in Table 1. Each SNOMED-CT code corresponds to one disease category. For each category, we report the scores for precision, recall, , accuracy, the number of training examples in the category () and the number of disease subtypes in this category. While DeepTag achieves reasonable scores overall, its performance is quite heterogeneous in different categories. Moreover the performance decreases when DeepTag is applied to the out-of-domain PP test data.

We identify two factors that substantially impact DeepTag’s performance: 1) the number of training examples that are tagged with the given disease label; 2) the number of subtypes, which are SNOMED-CT codes that are actually applied to the CSU dataset that are lower in the SNOMED-CT hierarchy compared to the disease-level codes DeepTag is predicting. We use the number of subtypes as a proxy for the diversity of the clinical text descriptions. Thus, a higher number of subtypes indicates a wider spectrum of diseases.

Figure 2: Per-label score plotted with log of number of examples in the training dataset. Results shown here are from the DeepTag model. Each point represents a label, its corresponding number of training examples in CSU, and the per-label score from the DeepTag model.

Performance improves with more training examples.

We first note that DeepTag works relatively well when the number of training examples for each label is abundant. We generate a scatter plot to capture the correlation between number of examples in the in CSU dataset and the label’s score evaluated on the CSU test set. We also plot the score for the label evaluated on PP dataset and its number of training examples on CSU dataset.

For the CSU dataset, we observe an almost linear relationship between the log number of examples and the score in Figure 2. We observe a similar pattern when evaluating on PP dataset, thought the correlation is weaker and the pattern is less linear. This is due to the out-of-domain nature of PP, which we investigate in depth below.

Model EM Precision Recall
unwgt wgt unwgt wgt unwgt wgt
CSU data
LSTM 47.4 76.6 85.9 59.3 78.7 65.3 81.7
BLSTM 48.2 76.1 86.0 57.6 79.4 63.5 82.2
DeepTag-M 48.6 76.8 86.3 58.7 79.6 64.6 82.4
DeepTag 48.4 79.9 86.1 62.1 79.8 68.0 82.4
PP data
LSTM 13.8 48.1 65.7 31.8 51.9 33.8 54.4
BLSTM 13.8 47.3 66.0 35.6 57.9 36.9 58.4
DeepTag-M 17.1 53.4 68.0 37.9 59.9 40.6 61.1
DeepTag 17.4 56.5 70.3 41.4 62.4 43.2 63.4

Aggregate prediction performance across the 42 categories. BLSTM refers to the multi-task bidirectional LSTM. DeepTag is our best model, and DeepTag-M is the variation with a meta-category loss. EM indicates the exact match ratio, which is the percentage of the clinical notes where the algorithm perfectly predicts all of the disease categories. For example, if a note has three true disease labels, then the algorithm achieves an exact match if it predicts exactly these three labels, no more and no less. For each precision, recall and score, there are two ways to compute an algorithm’s performance. First we can take an unweighted average of the score across all the disease categories (unwgt) or we can take an average weighted by the number of test examples in each category (wgt).

Table 2: Evaluation of trained classifiers on the CSU and PP data

More diverse categories are harder to predict.

After observing the general correlation between number of training examples and per-label scores, we can investigate outliers. These are diseases that have many examples but on which the tagger performed poorly and diseases that have few examples but the tagger performed well. For disorder of digestive system, despite having the second highest number of training examples (22,589), both precision and recall are lower than other frequent diseases. We find that this disorder categories covers the second largest number of disease subtypes (694). On the other hand, disorder of hematopoietic cell proliferation has the highest score with relatively few training examples (). This category has only 22 subtypes. Similarly autoimmune diseases has few training examples () but it still has a relatively high , and it also has only 11 subtypes.

The number of subtypes—i.e. the number of different types of lower level codes are mapped to each higher-level disease code—can serve as an indicator for the diversity of the text descriptions. For a disease like disorder of digestive system, it subsumes many different types of diseases such as periodontal disease, hepatic disease, and disease of stomach, which all have different diagnoses. Similarly, Neoplasm and/or hamartoma encapsulates many different histologic types and be categorized as benign, malignant, or unknown, thus resulting in many different lower-level codes (749 codes) being mapped into the same top-level disease code. The tagger needs to associate diverse descriptions to the same high-level label, increasing the difficulty of the tagging.

We hypothesize that disease labels with many subtypes will be difficult for the system to predict. This hypothesis suggests that the number of subtypes within a diagnosis category could explain some of the heterogeneity in DeepTag’s performance beyond the heterogeneity due to the training sample size. We conduct a multiple linear regression test with both the number of training examples as well as number of subtypes each label contains as covariates and the score as the outcome. In the regression, the coefficient for number of subtypes is negative with . This indicates that, controlling for the number of training examples, having more subtypes in a disease category makes tagging more challenging and decreases DeepTag’s performance on the label.

Performance on PP

Next we investigate DeepTag’s performance discrepancy between the CSU and PP test data. A primary contributing factor to the discrepancy is that the underlying text in PP is stylistically and functionally different from the text in CSU. Note that DeepTag was only trained on CSU text and was not fine-tuned on PP. The example texts in Figure 0(b) illustrate the striking difference. In particular, PP uses many more abbreviations that are not observed in CSU.

After filtering out numbers, 15.4% of words in PP are not found in CSU. Many of the PP specific words appear to be medical acronyms that are not used in CSU or terms that describe test results or medical procedures. Since these vocabulary has no trained and updated word embedding from the CSU dataset, the tagger will not be able to leverage them in the disease tagging process.

Despite having many training examples, DeepTag is doing poorly on some very frequent diseases, for example, neoplasm and/or hamartoma. On the opposite end of the spectrum, the tagger is able to do well for disorder of auditory system on both CSU and PP dataset, despite only having a moderate amount of training examples. Besides the main issue of vocabulary mismatch, many subtypes (lower-level codes) that get mapped to a certain disease level code do not exist in PP, and subtypes in PP also might not exist in CSU. We refer to this as the subtype distribution shift.

For example, In CSU, neoplasm and/or hamartoma has 749 observed subtypes. Only 7 out of 749 subtypes are present in PP. Moreover there are 11 subtypes are unique to the PP dataset and are not observed in the CSU training set. These differences appear to be primarily due to differences in how the primary medical codes are applied to the datasets and not significant discrepancies in the types of neoplasias observed.

In addition to the subtype analysis, we note that for rarer diseases, the precision drop between CSU and PP is not as deep as the recall drop. This can be interpreted as the model is fairly confident and precise about the key phrases it discovered from the CSU dataset. However, the PP dataset uses other terms or phrases (that are not covered in the CSU dataset) to describe the disease, resulting in a sharper loss on recall.

Improvements from disease hierarchy

DeepTag is designed to leverage the hierarchy between labels, which is defined by a map of correspondence between ICD-9 and SNOMED disease level codescccSee the supplementary material. Based on the hierarchy, we can augment the system with the knowledge that diseases that are under the same parent, which we call a meta-category, should be more similar to each other than diseases that belong to a different parent. The hierarchy is an implicit constraint that we can place on the model and it serves as a regularization. Basic deep learning systems like LSTM and BLSTM does not incorporate this information.

DeepTag uses a -based distance objective to place this constraint between disease label embeddings. The objective encourages the embeddings of diseases that are in the same meta-category to be closer to each other than embeddings of diseases across meta-categories. In addition, we investigated another approach that can also leverage hierarchy: DeepTag-M. This method computes the probability of a parent-level code based on the probability of its children-level codes. Instead of forcing similarity/dissimilarity constraints on disease label embeddings, DeepTag-M encourages the model to make correct prediction on the parent level as well as on the child level.

In Table 2, we compare the performance of DeepTag, DeepTag-M, the standard multi-task LSTM and bidirectional LSTM (BLSTM). On the CSU dataset, DeepTag and DeepTag-M perform slightly better (or at the same capacity) compared to the baseline models (LSTM and BLSTM). DeepTag is able to have higher unweighted precision, recall, and score compared to the other models, indicating its ability to have good performance on a wide spectrum of diseases. The importance of leveraging hierarchy is shown on the PP dataset (Figure 3). Since it is out-of-domain, dieases hierarchy provides much-needed regularization to make both DeepTag and DeepTag-M outperform baseline models by a substantial margin, with DeepTag the overall best model.

Figure 3: Performance comparison on PP. We compare the per-label score between baseline LSTM model and DeepTag model on the PP dataset. The disease categories are sorted from the least frequent to the most frequent in the training dataset, which comes from CSU.

Learning to abstain

Augmenting a tagging system with the ability to abstain (decline to assign codes) can foster human-machine collaboration. When the system does not have enough confidence to make decisions, it has the option to defer to its human counterparts. This aspect is important in DeepTag because after tagging the documents, further analysis from various parties might be conducted on the tagged documents such as investigating the prevalence of certain specific diseases. In order to not mislead further clinical research, having the ability to abstain from making very erroneous predictions and ensuring highly precise tagging is an important feature.

We developed an additional abstention wrapper on top of DeepTag that we call DeepTag-abstain. The module learns to estimate how well the DeepTag system will perform on a document based on the predicted categories DeepTag makes on the document as well as DeepTag’s internal confidence on the predictions.

We compare DeepTag-abstain to an intuitive baseline where an abstention score is simply computed by the confidence associated with the diagnosis code assignments. In order to evaluate how well DeepTag-abstain performs compared to the baseline, we compute an abstention priority score for each document. A document with higher abstention priority score will be removed earlier than a document with low score. We then compute the weighted average of and exact match ratio (EM) for all the documents that are not removed.

For both baseline and DeepTag-abstain, we specify a proportion of the documents need to be removed. We adjust the dropped portion from 0 to 0.9 (dropping 90% of the examples at the high end). An abstention method that can drop more erroneously tagged documents earlier will observe a faster increase in its performance, corresponding to a curve with steeper slope.

DeepTag-abstain demonstrates a substantial improvement over the baseline in Figure 4. The baseline here is the natural approach that abstains based on the original DeepTag’s uncertainty at the last layer. DeepTag-abstain is a more powerful approach that learns where to abstain based on the model’s internal representation of the input text. We note that not all learning to abstain schemes are able to out-perform the baseline. The details of module design and improvement curve for the rest of the modules can be seen in Appendix Fig S2.

Figure 4: Comparison of the abstention models. DeepTag-abstain is the abstention priority score estimator that uses confidence scores as input and estimate instance-level accuracy of a given document. Baseline refers to the abstention scheme where the instance-level abstention priority score is computed from individual label confidence scores without any learning. As a greater proportion of the examples are abstained from, the performance— and Exact Match (EM)—of both methods improve. DeepTag-abstain shows faster improvement, indicating that it learns to abstain in more difficult cases.


In this study, we developed a multi-label classification algorithm for veterinarian clinical text, which is an under-resourced clinical domain. In order to improve the performance of DeepTag on diseases with rare occurrences, we investigated with loss augmentation strategies that leverage the hierarchical structure of the disease categories. These augmentations provide gains over the LSTM and BLSTM baselines, which are common methods used for these type of prediction tasks. We also experimented with different methodologies to allow the model to learn to abstain on examples where the model is not confident in the predictions. We demonstrate that learned abstention rules outperform rules set manually.

Our work demonstrates novel methods for applying broad disease category labels to under-resourced clinical records as well as applying those trained algorithms to an external dataset in order to examine out-of-domain generalization. We have also demonstrated means to allow human domain experts to use their judgement where automated taggers have a high level of uncertainty in order to improve the overall workflow. We confirm that out-of-domain generalization is a significant concern for learned tagging systems to be deployed in real world implementations that may vary substantially from the data on which they were trained. Even though our work attempts to mitigate this problem, there is significant research to be done in optimizing methods for domain adaptation. Our current work is important not only for veterinary medical records, which are rarely coded, but also may have implications for human medical records in under-resourced countries that are important regions of the world for public health surveillance and clinical research.

There are several aspects of the data that may have limited our ability to apply methods from our training set to our external validation set. Private veterinary practices often have data records that closely resemble the PP dataset used to evaluate our methods here. However, the large annotated dataset we used for training is from an academic institution (as these are, largely, the institutions that have dedicated medical coding staff). As can be seen from Table 2, the performance drop due to domain mismatch is non-negligible. The domain shift comes from two parts. First, text style mismatch – private commercial notes use more abbreviations and tend to include many procedural examinations (even though many are non-informative or non-diagnostic). This requires the model to learn beyond keyword or phrase matching. Second, label distribution mismatch – the CSU training dataset focuses largely on neoplasm and several other tumor-related diseases, largely due to the fact that the CSU hospital is a regional tertiary referral center for cancer and cancer represents nearly 30% of the caseload. Other practices will have datasets composed of labels that appear with different frequencies, depending on the specializations of that particular practice. A very important path forward is to use learning algorithms that are robust to domain shift, and experimenting with unsupervised representation learning to mitigate the domain shift between academic datasets and private practices datasets.

Currently we are predicting top-level SNOMED-CT disease codes, which are not the SNOMED-CT codes that have been directly annotated on the dataset. A possible extension of the research is to learn to also tag specific important procedures, drug reactions, etc. from the same set of notes. Additionally, many of the SNOMED-CT codes that are applied to clinical records are categorized as ’Findings’ that are not actual ’Disorders’ as the actual diagnosis of a patient may not be clear at the time the codes are applied. One example is an animal that is evaluated for ’vomiting’ and the actual cause is not determined, may have a code of ’vomiting(finding)’(300359004) applied and not ’vomiting(disorder)’(422400008) and these ’non-disorder’ disease codes are not evaluated in our current work. However, these are an important subset of codes and represent another means to identify particular patient cohorts with particular clinical signs or presentations, vs. diagnosed disorders.



Colorado State University dataset

The CSU dataset contains discharge summaries as well as applied diagnostic codes for clinical patients from the Colorado State University College of Veterinary Medicine and Biomedical Sciences. This institution is a tertiary referral center with an active and nationally recognized cancer center. Because of this, the CSU dataset over-represents cancer-related diseases. Rare disease categories in CSU dataset are diseases like pregnancy, perinatal and mental disorders, but these are also rare in the larger veterinary population as a whole and do not represent a dataset bias. Overall, there are 112,558 unique discharge summaries in CSU dataset. We split this dataset into training, validation, and test set by 0.9/0.05/0.05.

Private Practice dataset

An external validation dataset was obtained from a regional private practice. These records did not have diagnostic codes available and only approximately 3% of these records had free text diagnoses applied by the attending clinician. Two veterinary domain experts applied SNOMED-CT disease codes to a subset of these records and acheived consensus on the records used for validation. This dataset (PP) is used for external validation of algorithms developed using the CSU dataset. There are 586 documents in this external validation dataset.

Data processing

Documents in our corpus have been tagged with SNOMED-CT codes that describe the clinical conditions present at the time of the visit being annotated. Annotations are applied from the SNOMED-CT veterinary extension (SNOMEDCT_VET), which is fully compatible and is an extension of the International SNOMED-CT edition. It can be accessed in a dedicated browser and is maintained by the Veterinary Terminology Services Laboratory at the Virginia-Maryland Regional College of Veterinary Medicineddd Medical coders applying diagnostic codes are either veterinarians or trained medical coders with expertise in the veterinary domain and the SNOMED terminology. We further develop a correspondence between high-level SNOMED-CT disease level codes and ICD-9 top-level codes. We explain our process fully in Supplement and provide the correspondence map.

Difference in data structures

Due to the inherent differences in clinical notes/discharge summaries prepared for patients in an academic setting compared to the shorter ’SOAP’ format notes (Subjective, Objective, Assessment, Plan) prepared in private practice, there is a substantial difference in the format as well as the writing style and level of detail between these two datasets. In addition. the private practice records exhibit significant differences in record styles between clinicians, with some clinicians using standardized forms and others using abbreviated clinical notes containing only references to abnormal clinical findings.

As can be seen in Fig S1, both dataset have more than 80% documents associated with more than one label, and in terms of document length distribution, PP dataset document is much shorter than CSU dataset, while the average PP document length is 191. The average CSU document length is 325.

In order to bridge the gap between two domains, we additionally use a curated veterinarian abbreviation list that maps an abbreviation to its full text. We include this abbreviation list in our supplementary materials.

Algorithm development and analysis

We trained our modeling algorithm on CSU dataset and evaluate on a held-out portion of data from the CSU dataset as well as the PP dataset. We formulate our base model to be a recurrent neural network with long short-term memory cells (LSTMs). We additionally decide to run this recurrent neural network on both the forward direction and backward direction of the document (bidrectional), as is found beneficial in Graves et al [graves2005bidirectional]. We augment this baseline model with two losses: cluster penalty and meta-label prediction loss to leverage human expert knowledge in how semantically related these disease labels are.

We tuned the clustering penalty hyperparameters , and , and our search range was [1e-1, 1e-5]. We also tuned the meta label prediction loss hyperparameter in a similar range.


We would like to acknowledge the help of Devin Johnsen for her help in annotating the private practice records used in this work. We also want to thank Selina Dwight and Matthew Wright for helpful feedback. Our work is funded by the Chan-Zuckerberg Investigator Program and National Science Foundation (NSF) Grant CRII 1657155.

Author contributions statement

A.N. performed the natural language processing work, designed and built DeepTag, and performed all of the experiments. A.Z. acquired the data, established the meta-hierarchies, organized and aided in annotation of the private practice data, and provided feedback on the NLP and machine learning outputs. R.P. provided access to the training data. A.P. aided in initial data acquisition and provided feedback on the project focus and strategy. M.R., S.D. and C.B. provided feedback on the project. J.Z. designed and supervised the project. A.N., A.Z. and J.Z. wrote the paper.


Supplementary Materials

Development of SNOMED-CT to ICD9 correspondence

In order to link top-level SNOMED-CT disease codes into related categories for the purposes of the learning with label hierarchy for this work, we have arranged a subset (n=59) of the 95 SNOMED disease codes (direct children of conceptID 64572001) into a hierarchy that mimics that of top level ICD9 codes. Excluded codes are those that are not directly relevant to a veterinary classification task due to the infrequency of certain disorders (ex: 278919001: Communication disorders) or (ex: 242028000: Weightlessness) or code categories are functionally redundant due to the structure of the hierarchy (i.e., diseases categorized under diabetic complication also map to the relevant body system categories, thus ’diabetic cataracts’ are captured under visual system disorders). This mapping of high level disease hierarchies allows veterinary diseases to be directly compared to human medical notes that are more commonly coded in ICD9 or ICD10 than using SNOMED-CT. Currently available SNOMED-CT to ICD9 maps do not include the SNOMEDCT-VET extension and also do not map the higher level disease codes used in this work. The mapping of top level SNOMED disease codes to the relevant ICD9 codes is provided as a supplemental table. Mappings were reviewed by experienced veterinary medical coders familiar with the SNOMED-CT hierarchy.

Model description

We formulate the problem of veterinarian disease tagging as a multi-label classification problem. Given a veterinarian record , which contains detailed description of the diagnosis, we try to infer a subset of diseases , given a pre-defined set of diseases . The problem of inferring a subset of labels can be viewed as a series of independent binary prediction problems [sorower2010literature]. The binary classifier learns to predict whether a tag exists or not for , where .

Our learning system has two components: a text processing module and tag prediction module. Our text processing module will use long-short-term memory networks (LSTMs) which have demonstrated their effectiveness in learning implicit language patterns from the text [mikolov2012statistical]. Our tag prediction module will consist of binary classifiers that are parameterized independently.

A long-short-term memory networks is a recurrent neural network with LSTM cell. It takes one word as input, as well as the previous cell and hidden state. Given a sequence of word embeddings , the recurrent computation of LSTM networks at a time step can be described in Eq 1, where is the sigmoid function , and is the hyperbolic tangent function. We use to indicate the hadamard product.


An extension of this recurrent neural network with LSTM cell is to introduce bidirectional passes [graves2005bidirectional]. Graves et al. shows that introducing bidrectional passes, it can effectively eliminate problems such as retaining long-term dependency when the document is very long. We parameterize two LSTM cells with different set of parameters, one cell is used in forward pass where the sequence is passed in sequentially from the beginning , one cell is used for backward pass, where the sequence is passed in with reversed ordering . At the end of both passes, bidirectional LSTM will output two hidden states represents each input , and we stack these two hidden states as our new hidden state for this input .

After computing hidden states over the entire documents, we introduce global max pooling over the hidden states, as suggested by Collobert & Weston [collobert2008unified] so that the hidden states will aggregate information from the entire documents. Assuming the dimension of hidden state is , global max pooling apply an element-wise maximum operation over the temporal dimension of the hidden state matrix, described in Eq 2.


Then we define a binary classifier for each label in our pre-defined set. The binary classifier takes in a vector that represents the veterinary record and outputs a sufficient statistic for the Bernoulli probability distribution indicating the probability of whether a tag should is predicted. For :


We use binary cross entropy loss averaged across all labels as the training loss. Given the binary predictions from the model and correct one-hot label , binary cross entropy loss is written as follow:


As usual, the decision boundary in our model is 0.5. We can generate a list of predicted label by applying a decision function :


Leveraging disease hierarchy

We introduce two penalties that are inspired by the implicit relationships between the SNOMED-CT disease codes that we refer to as meta-labels or clusters. By augmenting our loss with these two penalties, we aim to increase model’s ability to predict labels that have fewer instances. In the result section, we refer to model trained with cluster penalty as “DeepTag”, and model trained with meta-label prediction loss as “DeepTag-M”.

Cluster penalty

After defining the meta-labels for the SNOMED-CT disease tags, we can use techniques the from multi-task learning literature. Jacob et al.[jacob2009clustered] proposed a hypothesis that if two tasks are similar, the task-specific parameters for these two tasks should be close in vector space, vice versa.

Following Jacob et. al, we can first compute the mean vector of all tasks . We can define , where is a set of labels that belong to cluster . Then we can compute the mean vector for each cluster of tasks: for , .

The within-cluster closeness constraint can be computed as the distance between task specific weight vectors and the cluster mean vector . can be computed as the distance between and . We formulate this as an additional loss term , and allow three hyperparameter , and to control the strength of this penalty.


Meta-label prediction loss

We propose an additional penalty following the intuition that we want the model to make accurate predictions for the broad category even though mistakes can be made on the fine-grained level. Meta labels are created by examining whether any disease label under this meta label has been marked as tagged. Following the same logic, since the disease labels are predicted independently, we can compute the probability of the presence of a meta label from the probability of disease labels that belong to this meta label.


After computing the probability of presence of each meta-label, given the set of meta labels that are created from our true set of labels , we can then compute the binary cross entropy loss between the model’s estimation on meta label probability and true meta labels in Eq 8. We use to adjust the strength of this penalty.


Learning to abstain

In practice, it is often desirable for the model to forfeit the prediction if the prediction is likely to be incorrect. When the method is used in collaboration with human experts, the model can just defer difficult cases to them, fostering human-computer collaboration. However, this is still an under-explored field in machine learning, and previous research has focused largely on binary-class single-label classification [cortes2016learning]. We formally describe the set-up and our learning-based approach in the following sections, and extend relevant discussion to a multi-label setting.

We propose two abstention settings. Each setting will compute a score for each document, which we refer to as the abstention priority score. We can then rank these documents using this score . When user specifies a percentage of documents to be dropped, documents that have high will be dropped first.

Confidence-based abstention

Since our model already outputs a probability for each label, if our model is well-calibrated, meaning that the output probability satisfies the following constraint in Eq 9, then our probability should reflect how uncertain the model is about the output.


The notion of calibration means that when the model thinks the chance of a given prediction to be correct is p%, we collect all instances that the model gives such probability, and the model in total will be correct p% of the time. A well-calibrated model’s output probability corresponds to the model’s confidence/certainty on how correct its prediction is. Previous research has shown that binary classifiers with sigmoid scoring function and cross-entropy loss are often well-calibrated [niculescu2005predicting].

Given calibrated , we want to compute how confident the model is on these predictions. Noticeably, For each prediction, the model is more confident if is farther away from 0.5. Based on this observation, we can convert the probability into a confidence score with function :


We can now compute the probability of the model getting labels correct on a single example. We choose all subsets from the entire label set, and compute the probability of a chosen subset to be correct as well as the probability of the not chosen labels to be incorrect.


The score is an abstention priority score because it is a valid indication of how confident the model’s overall output is. We refer to this scheme confidence-based abstention module (or “CB” in Figure S2, “Baseline” in Figure 4).

Learning-based abstention

Instead of computing from a fixed formula, we can try to link abstention priority score to a value that we care about. For example, we want to drop examples that will induce high loss, or equivalently, examples where predicted result gives a low accuracy. However, we do not have access to ground-truth answers in the real world, instead, we propose that if the data distribution between training and deployment are consistent (, which is the underlying assumption specified in calibration), then we can learn to estimate loss or accuracy for each example. We can compute a regression target for the learned abstention module using the training dataset’s accuracy and loss value for each example (Eq 12).


This abstention learning module can take an input and output an estimated abstention score . We train this module by minimizing minimum square squared error with the regression target:


We choose four possible inputs from various parts of the DeepTag model that the DeepTag-abstention module can use to predict accuracy or loss without knowing the ground-truth label. Two choices are obvious: confidence scores that is used to compute confidence-based abstention priority score in the previous section, and estimated probability for the presence of each label , which we have used to compute confidence scores via function . However, since is obtained by applying a sigmoid function to the output of the classifier , then we can also use the prior-to-sigmoid value as input. At last, we hypothesize that the representation of document might also contain relevant information that is useful for model to determine whether the document is difficult to process.

We fit the model to estimate in the training set of our data, same split as the one used to train the overall model. We then evaluate on a previously unseen test set.

Experimental Details

We initialize our model with 100-dimension GloVE word vectors [pennington2014glove], and we initialize un-matched words in the CSU training data with sampled multivariate normally distributed vectors. We allow all word embeddings to be updated through the training process. We use a recurrent neural network with a 512 dimension LSTM cell, and set the feed-forward dropout rate to be 20%. We use batch size of 32, clipping gradient at 5. We use ADAM [kingma2014adam] optimizer with a learning rate of 0.001.

We trained all models to the maximum of 5 epochs with early stopping, the maximum number of epoch is picked by observing performance on validation dataset. After picking out the best hyper-parameters on validation set, we evaluate all models in-domain generalization performance on the CSU test dataset and out-domain generalization performance on the PP dataset.

After hyperparameter searching, we are report models with the hyperparameters that perform well on each dataset. We train each model five times and report the averaged result. For the CSU dataset, we find works best for DeepTag-M, and works best for DeepTag (cluster penalty). For the PP dataset, we find works the best for DeepTag-M, and works best for DeepTag (cluster penalty). We report these results in Table 2.

For Table 1, we report DeepTag trained with and we regard this as our best setting.

Abstention Experimental Details

We use a 3-layer neural network with SELU activation [klambauer2017self] to parameterize abstention model . The learning to abstain model is trained on various outputs generated by the DeepTag system after training the bidirectional LSTM with cluster penalty. All configurations of learning to abstain models are trained optimally for 3 epochs on the training set, and evaluated on the unseen test set.

Figure S1: Document length and label distribution on CSU and PP dataset. Proportion of records in each dataset with certain length (number of words) or certain number of labels.
Figure S2: Abstention improvement curve. Top-left: learning to reject model with confidence score as input, estimate accuracy or loss. Top-right: learning to reject model with post-sigmoid probabilities score as input, estimate accuracy or loss. Bottom-left: learning to reject model with prior-to-sigmoid logits as input, estimate accuracy or loss. Bottom-right: learning to reject model with global max pooled hidden states as input, estimate accuracy or loss.

SNOMED-ICD9 Correspondence

Here we provide the full list of the hierarchy that we used throughout the paper.

  1. Complications of pregnancy, childbirth, and the puerperium

    1. Disorder of labor / delivery (disorder)

    2. Disorder of pregnancy (disorder)

    3. Disorder of puerperium (disorder)

  2. Diseases of the genitourinary system

    1. Disorder of the genitourinary system (disorder)

  3. Diseases of the musculoskeletal system and connective tissue

    1. Disorder of connective tissue (disorder)

    2. Disorder of musculoskeletal system (disorder)

  4. Diseases of the skin and subcutaneous tissue

    1. Angioedema and/or urticaria (disorder)

    2. Disorder of pigmentation (disorder)

    3. Disorder of integument (disorder)

  5. Certain conditions originating in the perinatal period

    1. Disorder of fetus or newborn (disorder)

  6. Congenital anomalies

    1. Hereditary disease (disorder)

    2. Congenital disease (disorder)

    3. Familial disease (disorder)

  7. Injury and poisoning

    1. Disorder caused by exposure to ionizing radiation (disorder)

    2. Poisoning (disorder)

    3. Traumatic AND/OR non-traumatic injury (disorder)

    4. Self-induced disease (disorder)

  8. Symptoms, signs, and ill-defined conditions

    1. Hyperproteinemia (disorder)

    2. Clinical finding (finding)

  9. Neoplasms

    1. Neoplasm and/or hamartoma (disorder)

    2. Fibromatosis (disorder)

  10. Infectious and parasitic diseases

    1. Disease caused by Arthropod (disorder)

    2. Disease caused by Annelida (disorder)

    3. Infectious disease (disorder)

    4. Disease of presumed infectious origin (disorder)

    5. Disease caused by parasite (disorder)

    6. Enzootic disease (disorder)

    7. Epizootic disease (disorder)

  11. Diseases of blood and blood-forming organs

    1. Anemia (disorder)

    2. Disorder of cellular component of blood (disorder)

    3. Disorder of hematopoietic cell proliferation (disorder)

    4. Disorder of hemostatic system (disorder)

    5. Spontaneous hemorrhage (disorder)

    6. Hyperviscosity syndrome (disorder)

    7. Secondary and recurrent hemorrhage (disorder)

    8. Secondary hemorrhage (disorder)

  12. Endocrine, nutritional and metabolic diseses, and immunity disorders

    1. Autoimmune disease (disorder)

    2. Disorder of immune function (disorder)

    3. Hypersensitivity condition (disorder)

    4. Metabolic disease (disorder)

    5. Nutritional deficiency associated condition (disorder)

    6. Nutritional disorder (disorder)

    7. Obesity (disorder)

    8. Obesity associated disorder (disorder)

    9. Propensity to adverse reactions (disorder)

    10. Disorder of endocrine system (disorder)

  13. Diseases of the nervous system

  14. Feline hyperesthesia syndrome (disorder)

    1. Disorder of nervous system (disorder)

  15. Mental disorders

    1. Mental disorder (disorder)

  16. Diseases of the circulatory system

    1. Disorder of cardiovascular system (disorder)

  17. Diseases of sense organs

    1. Disorder of auditory system (disorder)

    2. Vertiginous syndrome (disorder)

    3. Visual system disorder (disorder)

    4. Sensory disorder (disorder)

  18. Diseases of the digestive system

    1. Vomiting (disorder)

    2. Enterotoxemia (disorder)

    3. Disorder of digestive system (disorder)

  19. Diseases of the respiratory system

    1. Disorder of respiratory system (disorder)

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description