Dilated Convolutional Attention Network for Medical Code Assignment from Clinical Text

Dilated Convolutional Attention Network for Medical Code Assignment from Clinical Text


Medical code assignment, which predicts medical codes from clinical texts, is a fundamental task of intelligent medical information systems. The emergence of deep models in natural language processing has boosted the development of automatic assignment methods. However, recent advanced neural architectures with flat convolutions or multi-channel feature concatenation ignore the sequential causal constraint within a text sequence and may not learn meaningful clinical text representations, especially for lengthy clinical notes with long-term sequential dependency. This paper proposes a Dilated Convolutional Attention Network (DCAN), integrating dilated convolutions, residual connections, and label attention, for medical code assignment. It adopts dilated convolutions to capture complex medical patterns with a receptive field which increases exponentially with dilation size. Experiments on a real-world clinical dataset empirically show that our model improves the state of the art.


1 Introduction

Medical code assignment categorizes clinical documents with sets of codes to facilitate hospital management and improve health record searching (Hsia et al., 1988; Farkas and Szarvas, 2008). These clinical texts comprise physiological signals, laboratory tests, and physician notes, where the International Classification of Diseases (ICD) coding system is widely used for annotation. Most hospitals rely on manual coding by human coders to assign standard diagnosis codes to the discharge summaries for billing purposes. However, this work is and error-prone (Hsia et al., 1988; Farzandipour et al., 2010). Incorrect coding can cause billing mistakes and mislead other general practitioners when patients are readmitted. Intelligent automated coding systems could act as a recommendation system to help coders to allocate correct medical codes to clinical notes.

Automatic medical code assignment has been intensively researched during the past decades (Crammer et al., 2007; Stanfill et al., 2010). Recent advances in natural language processing (NLP) with deep learning techniques have inspired many methods for automatic medical code assignment (Shi et al., 2017; Mullenbach et al., 2018; Li and Yu, 2020). Zhang et al. (2019) incorporated structured knowledge into medical text representations by preserving translational property of concept embeddings. However, several challenges remain in medical text understanding. Diagnosis notes contain complex diagnosis information, which includes a large number of professional medical vocabulary and noisy information such as non-standard synonyms and misspellings. Free text clinical notes are lengthy documents, usually from hundreds to thousands of tokens. Thus, medical text understanding requires effective feature representation learning and complex cognitive process to enable multiple diagnosis code assignment.

Previous neural methods for medical text encoding generally fall into two categories. Medical text modeling is commonly regarded as a synonym of recurrent neural networks (RNNs) that capture the sequential dependency. Such works include AttentiveLSTM (Shi et al., 2017), Bi-GRU (Mullenbach et al., 2018) and HA-GRU (Baumel et al., 2018). The other category uses convolutional neural networks (CNNs) such as CAML (Mullenbach et al., 2018) and MultiResCNN (Li and Yu, 2020). These methods only capture locality but have achieved the optimal predictive performance on medical code assignment.

Inspired by the generic temporal convolutional network (TCN) architecture (Bai et al., 2018), we consider medical text modeling with causal constraints, where the encoding of the current token only depends on previous tokens, using the dilated convolutional network. We combine it with the label attention network for fine-grained information aggregation.

Distinction of Our Model

The MultiResNet is currently the state-of-the-art model. It applies multi-channel CNN with different filters to learn features and further concatenates these features to produce a final prediction. In contrast, our model extends the TCN to sequence modeling that uses a single filter and the dilation operation to control the receptive field. In addition, instead of weight tying used in the TCN, we customize it with label attention pooling to extract relevant rich features.

Our Contributions

We contribute to the literature in three ways. (1) We consider medical text modeling from the perspective of imposing the sequential causal constraint in medical code assignment using dilated convolutions, which effectively captures long sequential dependencies and learns contextual representations in the long clinical notes. (2) We propose a dilated convolutional attention network (DCAN), coupling residual dilated convolution, and label attention network for more effective and efficient medical text modeling. (3) Experiments in real-world medical data show improvement over the state of the art. Compared with multi-channel CNN and RNN models, our model also offers a smaller computational cost.

2 Proposed Model

This section describes the proposed model - Dilated Convolutional Attention Network (DCAN). It includes three main components, i.e., dilated convolution for learning features from word embeddings of clinical notes, residual connection for stacking a deep neural architecture, and label attention module for prioritizing relevant representation for different labels. The architecture of our proposed model is illustrated in Fig. 1.

Our model benefits from the effective integration of these three neural modules. Dilated convolutions are widely used in audio signal modeling (Oord et al., 2016) and semantic segmentation (Yu and Koltun, 2015). Yu and Koltun (2015) proposed dilated convolutions with an exponentially large receptive filed. Bai et al. (2018) utilized causal convolutions with dilation and tied weighting for sequence modeling. Following their works, we integrated dilated convolution with label attention network for better medical text encoding to predict diagnosis codes. The dilated convolution follows the causal constraint of sequence modeling. By stacking dilated convolutions with residual connection (He et al., 2016), our DCAN model can be built as a very deep neural network to learn different levels of features. And the final label attention module further extracts the most relevant information to the label space.

Figure 1: Model architecture of dilated convolutional attention network

2.1 Dilated Convolution Layers

A clinical note with words is denoted as . We use word2vec (Mikolov et al., 2013) to train word embeddings from raw tokens. Word embedding matrix of a clinical note is denoted as , where is the dimension of word vectors. The word embeddings are then inputted into the dilated convolution layers, which are also called convolutions with dilated filters. Specifically, we use a 1D convolution operator to each dimension (i.e. channel) of the word vectors. Given a sequence of one-dimensional elements and a convolutional filter , the one-dimensional dilated convolution is denoted as


where is the dilation size of the spacing between kernel elements, is the element of input sequence, is the convolving kernel (aka, the filter) size, and refers to past time steps. The 1D dilated convolution has output channels, i.e., for each of the input channels features are learned and summed over the input channels. The dilated convolution is followed by a weight normalization, an activation function, and a dropout operation. Two dilated convolution layers ared stacked to a dilated convolution block. It outputs a hidden representation of the -th layer, where the dimension of the hidden representation is the number of output channels in the last dilated convolution layer. To expand the receptive field, the dilation size is exponentially increased, i.e., for .

2.2 Residual Connections

Residual connections (He et al., 2016) of residual blocks are built upon the dilated convolution layers to create deep neural networks. Given the input encoding vector , the output of residual connection is denoted as , where represents neural layers and is a non-linear activation function. We use residual mechanism between two stacked dilated convolution layers, which is formalized as:


2.3 Label Attention Layer

We apply the label attention layer to prioritize important information in the hidden representation relevant to ICD codes. Specifically, the dot product attention is used to calculate the attention score as:


where (the superscript represents the ordinal of the layer and not the power) is the hidden encoding of the -th layer, is the parameter matrix of the label attention layer (also known as the query), and is the number of ICD codes. The attention matrix captures the importance of ICD code and hidden word representation pair. The output of the attention layer is then calculated by multiplying attention with the hidden representation from residual dilated convolution layers. The attentive representation is formalized as


With features representing sequential dependency and label awareness, the final representation is further used for medical code classification.

2.4 Classification Layer

The classification layer is a linear fully-connected layer. The -th projected representation is calculated as:


where is the linear weight, is the bias, and and are the -th row of and for . The predicted logits between 0 and 1 are produced by a pooling operation over the linearly projected matrix and passed into the activation function, denoted as:


2.5 Training

ICD code assignment is a typical multi-label multi-class classification problem. We adopt the binary cross entropy loss denoted as:


where is the ground-truth label, is the sigmoid score for prediction, and is the number of ICD codes. To mitigate the effect of noisy labels, we apply label smoothing over ground-truth labels penalizing model from over-confident predictions. The modified targets are calculated as:


where is the smoothing coefficient. We use Adam optimizer (Kingma and Ba, 2014) to train the model with backpropagation.

3 Experiments

This section introduces the experimental analysis of real-world clinical datasets. Our proposed models are compared with several recent strong baselines. The code is publicly available at https://agit.ai/jsx/DCAN.

Model AUC-ROC F1
Macro Micro Macro Micro P@5
CNN (Kim, 2014) 87.6 90.7 57.6 62.5 62.0
C-MemNN (Prakash et al., 2017) 83.3 - - - 42.0
Attentive LSTM (Shi et al., 2017) - 90.0 - 53.2 -
Bi-GRU (Mullenbach et al., 2018) 82.8 86.8 48.4 54.9 59.1
CAML (Mullenbach et al., 2018) 87.5 90.9 53.2 61.4 60.9
DR-CAML (Mullenbach et al., 2018) 88.4 91.6 57.6 63.3 61.8
LEAM (Wang et al., 2018) 88.1 91.2 54.0 61.9 61.2
MultiResCNN (Li and Yu, 2020) 89.90.4 92.80.2 60.61.1 67.00.3 64.10.1
DCAN (Ours) 90.20.6 93.10.1 61.50.7 67.10.1 64.20.2
Table 1: Results on MIMIC-III dataset with top-50 ICD codes. “-” indicates no results reported in the original paper.

3.1 Dataset and Settings

This paper focuses on textual discharge summaries from a hospital stay. Following Shi et al. (2017) and Mullenbach et al. (2018), additional experiment on the subset of MIMIC-III (Johnson et al., 2016) with the top 50 frequent labels is conducted. Free-text discharge summaries are extracted, including raw notes, ICD diagnoses, and procedures for patients. Textual notes related to the same admission are concatenated to a single document to be used as input to our model. Each document is labeled with a set of ICD-9 diagnosis and procedure codes, which are the prediction targets. We use the standard train-test partition. The MIMIC-III dataset with top-50 codes contains 8,066 training, 1,573 development, and 1,729 test instances.


We preprocess the textual documents following the preprocessing procedures developed by Mullenbach et al. (2018) and Li and Yu (2020). The NLTK package1 is utilized for tokenization and all tokens are converted into lowercase. Alphabetic characters such as numbers and punctuations are removed. All documents are truncated at the length of 2500 tokens. We choose some common settings from prior publications. For example, the word embedding dimension is 100, the dropout rate is 0.2. The Adam optimizer Kingma and Ba (2014) is used to optimize our model parameters. The rest choices of hyper-parameters are configured via random search.

3.2 Baselines

Baselines models include memory network based C-MemNN (Prakash et al., 2017), the joint embedding model (LEAM) (Wang et al., 2018), RNN-based models like Attentive LSTM (Shi et al., 2017) and Bi-GRU (Mullenbach et al., 2018), and CNN-based models such as vanilla CNN (Kim, 2014), CAML (Mullenbach et al., 2018) and MultiResCNN (Li and Yu, 2020).

3.3 Results

We evaluate the F1-score and area under the receiver operating characteristic curve (AUC-ROC) with both micro and macro averaging, and the precision at codes with (P@5). The results are shown in Table 1. Our model outperforms the state-of-the-art in all the metrics. To compare with the MultiResCNN model, we follow its setting and run our model for three times. We average the predictive scores and calculate their standard deviation. Our model has a clear improvement in the macro F1-score when the macro score calculates the label-wise average by treating all codes equally. For the other metrics, our model still has a marginal improvement with a lower or comparable standard deviation. We also try the pre-trained Bidirectional Encoder Representations from Transformers(BERT) model (Devlin et al., 2018) for sequence classification. However, the BERT model does not work well in this task. This conclusion is also reported by Li and Yu (2020).

Model # params. training time training ep.
CAML 6.2M 673 s/ep 85 epochs
MultiResCNN 11.9M 1161 s/ep 26 epochs
DCAN (Ours) 8.7M 951 s/ep 23 epochs
Table 2: Computational cost comparison

Computational efficiency

We compared the computational efficiency from two perspectives, i.e., number of parameters and convergence epochs, results are shown in Table 2. Not relying on concatenated multi-channel features, our model has fewer trainable parameters and takes less training time than the state-of-the-art MultiResCNN. Moreover, our model converges faster.

4 Conclusion

Recent years extensively studies the automatic medical code assignment. Neural clinical text encoding models use CNNs to extract local features and RNNs to preserve sequential dependency. This paper combines both by using dilated convolution. The dilated convolutional attention network (DCAN) consists of dilated convolution layers, residual connections, and the label attention layer. The DCAN model obeys the causal constraint of sequence encoding and learns rich representations to capture label-aware importance. Through experiments on the MIMIC-III dataset, our model shows better predictive performance than the state-of-the-art methods.


We thank Academy of Finland (grants no. 286607 and 294015 to PM) and Finnish Center for Artificial Intelligence for support of this research. We acknowledge the computational resources provided by the Aalto Science-IT project. The authors wish to acknowledge CSC - IT Center for Science, Finland, for computational resources.


  1. http://www.nltk.org


  1. Shaojie Bai, J Zico Kolter, and Vladlen Koltun. 2018. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. arXiv preprint arXiv:1803.01271.
  2. Tal Baumel, Jumana Nassour-Kassis, Raphael Cohen, Michael Elhadad, and Noemie Elhadad. 2018. Multi-label classification of patient notes: case study on ICD code assignment. In Workshops at the Thirty-Second AAAI Conference on Artificial Intelligence.
  3. Koby Crammer, Mark Dredze, Kuzman Ganchev, Partha Talukdar, and Steven Carroll. 2007. Automatic code assignment to medical text. In Biological, translational, and clinical language processing, pages 129–136.
  4. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  5. Richárd Farkas and György Szarvas. 2008. Automatic construction of rule-based ICD-9-CM coding systems. In BMC bioinformatics, volume 9(Suppl 3), pages 1–9. Springer.
  6. Mehrdad Farzandipour, Abbas Sheikhtaheri, and Farahnaz Sadoughi. 2010. Effective factors on accuracy of principal diagnosis coding based on international classification of diseases, the 10th revision (ICD-10). International Journal of Information Management, 30(1):78–84.
  7. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  8. David C Hsia, W Mark Krushat, Ann B Fagan, Jane A Tebbutt, and Richard P Kusserow. 1988. Accuracy of diagnostic coding for medicare patients under the prospective-payment system. New England Journal of Medicine, 318(6):352–355.
  9. Alistair EW Johnson, Tom J Pollard, Lu Shen, H Lehman Li-wei, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. 2016. MIMIC-III, a freely accessible critical care database. Scientific data, 3:160035.
  10. Yoon Kim. 2014. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.
  11. Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  12. Fei Li and Hong Yu. 2020. ICD coding from clinical text using multi-filter residual convolutional neural network. In Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence.
  13. Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
  14. James Mullenbach, Sarah Wiegreffe, Jon Duke, Jimeng Sun, and Jacob Eisenstein. 2018. Explainable prediction of medical codes from clinical text. In Proceedings of NAACL-HLT, pages 1101–1111.
  15. Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499.
  16. Aaditya Prakash, Siyuan Zhao, Sadid A Hasan, Vivek Datla, Kathy Lee, Ashequl Qadir, Joey Liu, and Oladimeji Farri. 2017. Condensed memory networks for clinical diagnostic inferencing. In Thirty-First AAAI Conference on Artificial Intelligence.
  17. Haoran Shi, Pengtao Xie, Zhiting Hu, Ming Zhang, and Eric P Xing. 2017. Towards automated ICD coding using deep learning. arXiv preprint arXiv:1711.04075.
  18. Mary H Stanfill, Margaret Williams, Susan H Fenton, Robert A Jenders, and William R Hersh. 2010. A systematic literature review of automated clinical coding and classification systems. Journal of the American Medical Informatics Association, 17(6):646–651.
  19. Guoyin Wang, Chunyuan Li, Wenlin Wang, Yizhe Zhang, Dinghan Shen, Xinyuan Zhang, Ricardo Henao, and Lawrence Carin. 2018. Joint embedding of words and labels for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2321–2331.
  20. Fisher Yu and Vladlen Koltun. 2015. Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122.
  21. Xiao Zhang, Dejing Dou, and Ji Wu. 2019. Learning conceptual-contexual embeddings for medical text. arXiv preprint arXiv:1908.06203.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description