A Weakly Supervised Adaptive DenseNet for Classifying Thoracic Diseases and Identifying Abnormalities

A Weakly Supervised Adaptive DenseNet for Classifying Thoracic Diseases and Identifying Abnormalities

Bo Zhou
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA, USA
   Yuemeng Li
Department of Bioengineering
University of Pennsylvania
Philadelphia, PA, USA
   Jiangcong Wang
Department of Bioengineering
University of Pennsylvania
Philadelphia, PA, USA

We present a weakly supervised deep learning model for classifying diseases and identifying abnormalities based on medical imaging data. In this work, instead of learning from medical imaging data with region-level annotations, our model was trained on imaging data with image-level labels to classify diseases, and is able to identify abnormal image regions simultaneously. Our model consists of a customized pooling structure and an adaptive DenseNet front-end, which can effectively recognize possible disease features for classification and localization tasks. Our method has been validated on the publicly available ChestX-ray14 dataset. Experimental results have demonstrated that our classification and localization prediction performance achieved significant improvement over the previous models on the ChestX-ray14 dataset. In summary, our network can produce accurate disease classification and localization, which can potentially support clinical decisions. Both authors equally contributed to this work

Figure 1: Examples of the disease localizations generated from our network. The localization output from our network (red bounding box), that trained only with image-level annotation, match with the ground truth localization (green bounding box).

1 Introduction

Large scale annotated visual datasets have boosted performance of deep learning methods on many challenging computer vision problems [1, 2, 3]. Tasks like object detection, classification, tracking, and segmentation have been successfully tackled by techniques built on top of these large-scale dataset with annotations [4, 5, 6]. There are increasing numbers of applications utilizing deep learning methods in medical imaging analysis over the last decade [7]. In clinical procedures, visual evidence such as segmentation or spatial localization of abnormal regions that supports the diagnosis results, is an vital part of clinical diagnosis. This provides a comprehensive interpretation of diagnosis results and potentially decreases the false positive rate.

In this work, we focus on the automatic disease diagnosis and localization in chest radiography released by [8] named ChestX-ray 14, which is one of the largest public chest radiography dataset with image-level disease labels and contains a small subset of region-level disease localizations (bounding boxes). Our goal is to develop a deep learning scheme capable of both classifying the disease and localizing the associated lesion sites. Divergent from standard strongly supervised object detection, our model does not require ground truth localization annotations during training. Firstly, we adopted and modified a pre-trained classification CNN as our front-end feature extractor [9]. The pre-trained front-end encodes the information from a large perceptive field. Then, after passing through a simple bridging structure, the extracted feature from the front-end are fed into a customized two-stage pooling network structure, which produces both classification and associated localization simultaneously.

Both quantitative and qualitative visual evaluations show that our proposed model obtains significant improvement over the previous published state-of-the-art results on disease classification and localization. Visual evaluations indicate a strong alignment and correspondence between the clinical annotations and the predicted disease candidate regions shown in figure 12.

1.1 Contributions

In summary, the contributions of this work are two-folds:

  • We developed a weakly supervised end-to-end learning structure that learns from chest radiography images containing multiple common thoracic diseases by explicitly searching over possible disease features and locations in the image.

  • We performed an extensive experimental analysis of our model’s classification and localization performance on the large-scale ChestX-ray14 datasets. We found that our model (i) outputs more accurate image-level labels than current state-of-the-art; (ii) predicts better lesion localization than previous published methods.

2 Related Work

CAD for chest radiography: The chest radiography is the most ordered and common radiological examination for chest diseases. It is a low-cost, low radiation and fast imaging exam. Recent studies also demonstrated the chest radiography’s application on detection and evaluation of coronary artery diseases, which are usually evaluated using Computed Tomography (CT) with expensive cost and high radiation dose [10, 11, 12]. CAD techniques have been widely applied in chest radiography for task such as automatic diagnosis and patient image retrieval. Previously, Bar et al. adapted the Decaf-Net for 8 thoracic diseases classification on a relatively small chest x-ray dataset [13]. Lajhani et al. proposed the DCNN for tuberculosis classification task. Their model ensembled both Alex-Net and Google-Net and achieved impressive results [14]. Anavi et al. has worked on image retrieval in medicine, specifically for chest radiography given the pathology [15]. All the aforementioned research showed promising results. However, they only conducted their experiments on relatively small dataset, ranging from 10 to 500 images.

ChestX-ray14: Recently, National Institute of Health (NIH) released one of the largest public chest radiography dataset, consisting of 108,948 posterior-anterior view images from 32,797 patients with eight major chest diseases [8]. A small subset of this dataset is provided with hand labeled bounding boxes for evaluation. Lately, NIH further expanded this dataset to 112,120 frontal-view images with 6 additional thoracic diseases, named ChestX-ray14. Several deep learning methods have been addressed the application of CNN on this dataset for thoracic disease classification and localization [16, 17, 18]. Wang et al. [8] applied a pre-trained Res-Net as the backbone to generate heatmap as localization, and subsequently used a global max pooling to obtain classification. However, most of the localizations generated by their network mismatched with the ground-truth bounding boxes. Yao et al. [16] used DenseNet [9] to extract features. To harness the correlation between some of the diseases, they used a LSTM module to repeatedly decode the feature vector from a DenseNet [9] front end and produced one disease prediction at each step. They achieved improved results compared to the baseline in [8]. One of the most recent work from Rajpurkar et al. achieved a good multi-label classification results by fine-tuning a pre-trained DenseNet-121 [9, 18]. Their classification results outperformed the previous state-of-the-art methods [8, 16, 17]. However, there was no quantification localization analysis from their work. Another most recent work from Li et al. [17] used a pre-trained Res-Net to extract features and divided them into patches. They passed the extracted patches through a fully-convolutional classification CNN to obtain a disease probability map. In the mean time, a classification score was acquired by multiplying all probability values together. They achieved significantly better classification and localization result than the baseline [8] for certain diseases.

Figure 2: The architecture of the adaptive densenet-169 is shown above. The network receives input of size , passes it through a convolution and max pool layer, and then dense blocks interleaved with transition blocks, and produces feature maps. (a) illustrates the dense connection in dense block, where output from each conv-bn-relu layer is concatenated to the input before fed to the next layer. (b) shows the average pool layer is taken out in the modified transition block, so that it maintains same spatial resolution as previous dense block’s. (c) illustrates the kernel dilation in dilated dense block. Pre-trained convolution kernels in the last dense block were dilated to keep the receptive field unchanged, after an average pool layer is removed in (b).

Weakly supervised learning: The ChestX-ray14 dataset [8] contains mostly classification labels but few bounding boxes annotations. Therefore, we seek a model that is capable of localizing the diseased regions given only image wise labels. This falls into category of weakly supervise learning (WSL), which often refers as the task of capturing object location through a customized deep learning model that trained with only image-level labels. Oquab et al. [19] proposed a WSL scheme. They used a pre-trained CNN to generate class probability map across spatial locations and applied max-pooling across spatial locations to get a single binary score vector. With their strategy, they were able to get both decent classification accuracy and localization mAP on the benchmark dataset [20, 3]. Recently, Durand et al. proposed a more sophisticated WSL approach [21]. They presented the idea of class pooling based on the assumption that an object inherently can be decomposed into different sub-maps to capture better semantic segmentation in WSL. Their experimental results tested on PASCAL 2007, PASCAL 2012, MS COCO, 15 Scene, MIT67 [22, 20, 3, 23, 24] indicated the advantage of such design and achieve significant improvement over [19] ’s model.

3 Method

In this work, we aim to use chest radiographs as input with only image-wise labels to train a model that generates classification for thoracic disease along with the localization heatmaps. The architecture of our proposed network can be summarized into three parts: 1) using an Adaptive DenseNet to generate feature maps and 2) using a bridging layer to convert the feature maps into class-specific sub-maps 3) using a WSL pooling structure to pool the sub-maps into class-specific heatmaps and single probability score for each disease.

3.1 Adaptive DenseNet

The Adaptive DenseNet inherits the basic structure from DenseNet [9]. The dense block of DenseNet consists of direct connections from preceding layers to all subsequent layers which pass image information in deep model’s training. We maintained the DenseNet structures before the fully connected classification layers as our basic structure. However, the output feature map size after the final dense block is greatly decreased (), losing too much spatial resolution as compared to the input image (), which is not suitable for our localization task. To overcome this, firstly, we took out the average pooling operation at the third transition layer to achieve a fine resolution. Secondly, in order to maintain the same receptive field of each convolution kernel, we dilated all kernels in the fourth dense block. The Adaptive DenseNet outputs a feature map with spatial size of . Details of our Adaptive Densenet is shown in Figure 2. In this work, we chose the DenseNet-169 as our basic structure. Our adaptation can be applied to other DenseNet structures as well, such as DenseNet-121 and Densenet-201. With the Adaptive DenseNet, we were able to acquire a fine resolution feature map.

Figure 3: Bridging layer connects adaptive DenseNet and pooling layers. Given the DenseNet output feature, the bridging layer performs a 1x1 convolution to transform DenseNet output feature into sub-maps. The output channels of sub-maps are , where being number of sub-maps per class and being number of class. The sub-maps fed into the WSL pooling layers.
Figure 4: Illustration of our 2-stage pooling process: class-wise pooling and spatial pooling. During class-wise pooling (green region), multiple class-specific sub-maps are averaged and a single heatmap is generated for each class. Each classes’ heatmap are used for their localization. The heatmap is then passed to spatial pooling. We employed two different spatial pooling schemes during training and testing. During training, single random score from top scores in the heatmap is selected to increase robustness against erroneous top scores. During testing, top scores and bottom scores are weighted averaged with weight on bottom samples to give maximum performance. The final output of spatial pooling is a single class score for each class.

3.2 Bridging layer

To connect the output of the Adaptive DenseNet and the pooling structure described in section 3.3, a bridging layer is proposed and used here. Specifically, the output of the Adaptive DenseNet is transformed by a convolutional layer, which learns the mapping from the Adaptive DenseNet’s output feature maps to class specific sub-maps. Through the bridging layer, the channel size of the feature map is transferred from (channel size of Adaptive DenseNet’s output) to , as shown in Figure 3. This bridging layer connects the Adaptive DenseNet and the WSL pooling structure together to form the overall network.

3.3 WSL pooling

To obtain accurate disease classification along with disease localization heatmap, we integrated a customized weakly supervised learning strategy into our network. Similar to [21], we implemented two types of global pooling layers to summarize all information for each class in feature map outputted from the Adaptive DenseNet, which include: 1) a class-wise pooling within the same class, and 2) a spatial pooling along different classes. The class-wise pooling summarizes submaps into final localization heatmap, and spatial-wise pooling summarizes heatmap into classification. Details of the pooling designs are demonstrated in Figure 4.

class-wise pooling: The class-wise pooling combines the submaps from each of the classes from the bridging layer. In specific, our class-wise pooling combines the maps for all disease classes independently through Equation 1.


where is the number of sub-maps that contains partial information for each disease class. , , are feature map’s height, width, and the number of class, respectively. The input feature with size of is divided into branches with each size of . Then, the sub-maps information are composed into one final feature map for each class through an average pooling. This results in a transformed class heatmap of size .

spatial-wise pooling: Given a class heatmap with size of , we used a spatial-wise pooling to generate our final classification output. Our spatial-wise pooling extracts a classification vector () from the class heatmap outputted from class-wise pooling shown above. The most common strategy for spatial-wise pooling is maxpooling, which might potentially ignore most of the feature information, generate localization heatmap only in small regions, and cause insufficient pass of gradient for training.

Therefore, we proposed to use a customized spatial-wise pooling strategy for both training and testing of our network, as shown in Figure 4. During the training phase, we firstly selected top scores from heatmaps and randomly chose one among the k numbers for each class. Because the class heatmap may not be perfect, the highest score may not correctly correspond to the disease location. Thus, random sampling from top spatial locations instead of simply choosing the max gives the network higher probability to capture the correct spatial location of the disease and generate the correct gradient. During the testing phase, we looked at regions with the highest()/lowest() activation from the heatmap . The highest activation indicates presence of classes while the lowest activation indicates absence of classes. Thus both are incorporated to the prediction layer to achieve a robust classification performance. We introduced a weighting factor to control the relative importance between these two terms.

Formally, let be the heatmap for class , let / be the set of top /bottom scores in , and let be the final class score. During training phase, we have:


where is a single element uniformly random sampled from . During testing phase, we have:


where / are elements in /. and is the weighting factor added on bottom scores.

3.4 Network’s training

For network initialization, we transfered the weights from the pre-trained models on ImageNet for our Adaptive DenseNet. Then, we randomly initialized parameters for bridging and WSL pooling layers. We used learning rate of 0.002 with weight decay of 0.1 for every 10 epochs during training. Our model were implemented with Pytorch (https://pytorch.org/) on a Nvidia GTX 1080Ti. We added batch normalization [25] after each convolution and additional drop out of rate of 0.1 for each DenseBlock inside Adaptive DenseNet [26] to prevent overfitting. Since there is a large data imbalance between each class, we trained our model with weighted binary cross entropy loss:


where represents percent of positive sample and represents percent of negative sample among all dataset.

3.5 Disease classification and localization

The heatmap obtained from our class-wise pooling is further used as a reference map to generate disease bounding boxes. A thresholding followed by connected component analysis are applied for each disease class to generate bounding boxes. Empirically, we used a threshold value of 0.8 for ”Cardiomegaly” and threshold values of 0.9 for the rest of classes to obtain the best results. The disease classification score is acquired from the spatial-wise pooling on the heatmap. In short, our network structure is designed for both disease classification and localization tasks, yet only use chest radiographs along with image-level annotations for our training.

4 Experiments and Results

4.1 Data for experiments

We performed our experiment using the ChestX-ray14 dataset. The diseases in the corresponding chest radiograph are diagnosed by radiologists and the labeled ground truth are obtained through text mining on the patients’ diagnostic reports. We used exactly same published data split as in [18, 8, 16]. The dataset is divided into training (70%), validation (10%) and testing (20%). The original resolution of image is and was downscaled to . In order to fit dataset into ImageNet [1] pre-trained models, we normalized the image by mean and standard deviation of the images from ImageNet. We used random crop of size from the downscaled image as the network input for training and we used center crop of same size for testing. For localization validation, the ChestX-ray14 dataset contains 880 images with 983 disease bounding boxes annotated by board-certified radiologists. We calculated the intersection over union (IoU) between our prediction and the ground truth bounding boxes.

4.2 Disease classification

Pathology Wang [8] Yao [16] Li [17] Rajpurkar* [18] Ours
Atelectasis 0.7003 0.772 0.80 0.8124 0.8123
Cardiomegaly 0.81 0.904 0.87 0.9101 0.9068
Effusion 0.7585 0.859 0.87 0.8763 0.8783
Infiltration 0.6614 0.695 0.70 0.6973 0.7055
Mass 0.6933 0.792 0.83 0.8265 0.8348
Nodule 0.6687 0.717 0.75 0.7689 0.7822
Pneumonia 0.658 0.713 0.67 0.7588 0.7744
Pneumothorax 0.7993 0.841 0.87 0.8625 0.8702
Consolidation 0.7032 0.788 0.80 0.8000 0.8111
Edema 0.8052 0.882 0.88 0.8898 0.8953
Emphysema 0.833 0.829 0.91 0.9047 0.9255
Fibrosis 0.7859 0.767 0.78 0.8208 0.8193
Pleural Thickening 0.6835 0.765 0.79 0.7764 0.7796
Hernia 0.8717 0.914 0.77 0.8969 0.9482
Table 1: AUROC is used to evaluate model classification performance for 14 diseases. The best results are shown in bold text. In order to obtain a comprehensive comparison, performance of all previous networks to our knowledge are included. All results are obtained from their latest update. * means the score is obtained by running their released model on our testing data.

We calculated the Area under Receiver Operating Characteristic curve (AUROC) for each class to evaluate the classification performance of our model. Our results are selected based on the best performance of classifier for each class throughout our training iterations. Our classification performance is compared with other state-of-the-art results [8, 16, 17, 18]. The experiment is tested with the same dataset splitting as demonstrated in [8, 16, 18].

Figure 5: Examples of localization results on 8 diseases along with ground truth bounding box annotations. In each pair, a chest radiograph (left) inputs into our model and the corresponding heatmap (right) is generated. The heatmaps produced from our model match with the ground-truth bounding boxes (green) annotated by radiologists and indicated good IoU results (yellow).

In all 14 diseases, our model achieved better classification AUROC than [8, 16] as shown in Table 1. In comparison to [17], our model achieved better classification AUROC in 13 out of 14 diseases. In particular, our model obtained more than 2% higher AUROC in ”Cardiomegaly”, ”Nodule”, ”Pneumonia”, ”Fibrosis” and ”Hernia”. Comparing our results with [18]’s work, our model outperformed theirs in 11 out of 14 diseases. We observed significant improvements over [18]’s work, especially for ”Nodule”, ”Pneumonia”, ”Consolidation”, ”Emphysema” and ”Hernia”.

As shown in table 1, our network performed better on diseases classification with large lesions than one with smaller lesions. ”Cardiomegaly” and ”Emphysema” are the representative classes for disease with large lesion and our model achieved classification score over 90% on both. Although our network performed less accurate on identifying smaller lesions such as mass and nodule, our model is significantly improved from [8, 16, 17, 18] on these classes.

In addition, we performed hyper-parameter searching for best classification performance (details shown in Supplemental Materials). In class-wise pooling, we tested number of sub-maps () from number 2 to 18 and found 14 sub-maps yielded best result. We then kept this setting for the rest of experiment. In spatial pooling, we tested hyper-parameters on two different strategies separately. During training, we tested our model using top scores, ranging from 1 to 20. We found that at 10 yielded best network. During testing, we tested from 1 to 20 and from 1 to 25. We found that both and at 15 yielded best classification result. We tested ranging from 0.25 to 1 and equals at 1 gave the best importance balance between and .

4.3 Disease localization

T(IoU) Model Atelectasis Cardiomegaly Effusion Infiltration Mass Nodule Pneumonia Pneumothorax
0.1 Wang 0.6888 0.9383 0.6601 0.7073 0.4000 0.1392 0.6333 0.3775
Ours 0.4111 0.9658 0.5948 0.8130 0.5294 0.0759 0.7417 0.3265
Ours* 0.4278 0.9658 0.6928 0.8211 0.5529 0.2152 0.7750 0.3265
0.2 Wang 0.4722 0.6849 0.4509 0.4796 0.2588 0.0506 0.35 0.2346
Ours 0.2389 0.9452 0.3399 0.6748 0.3059 0.0127 0.5750 0.2041
Ours* 0.2556 0.9452 0.5033 0.6829 0.3294 0.1519 0.6333 0.2041
0.3 Wang 0.2444 0.4589 0.3006 0.2764 0.1529 0.0379 0.1666 0.1326
Ours 0.1444 0.9247 0.1438 0.5203 0.1529 0.0000 0.4417 0.1429
Ours* 0.1611 0.9247 0.3333 0.5366 0.1765 0.1392 0.5000 0.1429
0.4 Wang 0.0944 0.2808 0.2026 0.1219 0.0705 0.0126 0.075 0.0714
Ours 0.0889 0.8767 0.0588 0.3577 0.1176 0.0000 0.2583 0.1122
Ours* 0.1056 0.8767 0.2680 0.3902 0.1412 0.1392 0.3250 0.1122
0.5 Wang 0.0500 0.1780 0.1111 0.0650 0.0117 0.0126 0.0333 0.0306
Ours 0.0444 0.7808 0.0131 0.2520 0.0588 0.0000 0.1417 0.0510
Ours* 0.0611 0.7808 0.2288 0.2846 0.0824 0.1392 0.2083 0.0510
0.6 Wang 0.0222 0.0753 0.0457 0.0243 0.0000 0.0126 0.0166 0.0306
Ours 0.0111 0.5205 0.0000 0.1545 0.0235 0.0000 0.0417 0.0408
Ours* 0.0278 0.5205 0.2157 0.1870 0.0471 0.1392 0.1083 0.0408
0.7 Wang 0.0055 0.0273 0.0196 0.0000 0.0000 0.0000 0.0083 0.0204
Ours 0.0056 0.2330 0.0000 0.0732 0.0118 0.0000 0.0167 0.0102
Ours* 0.0222 0.2329 0.2157 0.1057 0.0353 0.1392 0.0833 0.0102
IoU Ours 0.1196 0.5733 0.1463 0.3226 0.1515 0.0146 0.2514 0.1113
Ours* 0.1362 0.5733 0.3346 0.3476 0.1740 0.1601 0.3115 0.1113
Table 2: Comparison of disease localization using IoU, where threshold of IoU are set to 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7 for evaluation, respectively. Best results between Wang et al.’s and ours are shown in bold. Last row shows average IoU for all diseases. * means results obtained from same dataset with expanded annotation from certified radiologist.

We generated our predicted bounding boxes by applying a thresholding on the heatmap obtained from our class-wise pooling layer. Examples are shown in Figure 5. No localization annotation was used during training process. The heatmaps were generated by network with training only from image-level labels. All 880 images with bounding boxes ground truth were used for evaluation.

In Table 2, we compared our results with [8]. Our model achieved significantly higher IoU score over [8] on diseases with large lesion, such as ”Cardiomegaly” (abnormally large heart). The heatmap is well-fitted with the ground truth bounding box. Disease such as ”Infiltration” (substance such as blood infiltrates through vessel into lung) and ”Pneumonia” (inflammation in lung) that affect large area of lungs manifest as visually more prominent patterns, which are learned by our network. Moreover, the localization for small disease region such as ”Atelectasis” (partially collapsed lung) and Mass (an abnormal lump 3cm) can also be well captured by our network. Noted that for ”Effusion” (liquid occupying lung space), our network detects the disease that covers whole lesion while the ground truth includes some extra parts such as the shoulder.

We presents 5 ambiguous cases on 5 diseases in Figure 6, which represents bias in localization annotations that leads to lower IoU score. In the case of ”Effusion”, the annotation outlined the liquid-lung boundary. Instead, our network includes the full liquid rinsed area. In the cases of ”Infiltration” and ”Pneumonia” that spread both lungs, the ground truth annotations only includes single lung, whereas our network captures both lungs. For cases like ”Mass” and ”Nodule”, the ground truth bounding box only highlights one of many instances of ”Mass” and ”Nodule”, but our localization highlights all instances.

Observing annotation ambiguity, we attempted to quantify the effect. We expanded the annotations by having a certified radiologist manually re-labeled bounding boxes on lesion on 80 ambiguous cases. We evaluated our model with the expanded annotations and the updated localization IoU is shown in Table 2. The updated IoU score on expanded annotations is significantly improved for certain classes, such as ”Effusion”, ”Nodule”, ”Atelectasis”, ”Infiltration”, ”Mass”, and ”Pneumonia”.

Figure 6: This shows 5 ambiguous cases on 5 diseases where bounding annotations are bias. The input image is shown on the right, with two bounding boxes labels: green is the provided ground truth and red is our localized bounding boxes. The class-pooling output heatmap is shown on the right. The red bounding boxes are generated by applying a naive thresholding.

5 Discussion

In clinical procedures, visual evidence such as segmentation or spatial localization of disease lesions, in support of disease classification results, is a vital part of clinical diagnosis. It provides a comprehensive insight into the disease and potentially decreases the false positive of diagnosis. In this work, we proposed a weakly supervised adaptive DenseNet architecture. It only trains on image-level annotation, yet is able to provide both disease classification and corresponding visual evidence, making it potentially valuable in clinical setting.

In our experiment, we specifically looked at 14 different thoracic diseases. Our network demonstrated its ability to precisely identify disease patterns which generated accurate disease classification and corresponding heatmaps for disease localization. In our classification experiments, our network outperformed the current state-of-the-art method on 10 out of 14 diseases. Specifically, our network has shown significant classification improvements on diseases with large lesion such as ”Pneumonia” and ”Emphysema”. Our model also gave robust classification results on diseases with small lesions such as ”Nodule”. In our localization experiments, our network achieved significant better localization performance on 5 out of 8 diseases with mean T(IoU)=, as compared to the NIH baseline [8]. The possible reason for our network to achieve higher classification accuracy is two-fold. First of all, the use of a two-stage pooling allows the network to capture complex intra-class variations. Secondly, the removal of average pooling helps to improve spatial resolution and maintain more image details.

We also evaluated the effect of localization label ambiguity. We collaborated with a certified radiologist to expand the ground truth bounding box annotations on 80 cases. We found that our localization has significantly better overlap with the bounding box in the expanded annotation than the one provided in ChestX-ray14 dataset on certain cases. Examples of the ambiguity cases are illustrated in our Supplemental Materials with detailed explanations.

It is worth noting that the annotations in the ChestX-ray14 dataset is generated by text mining instead of manual annotation, which guarantees 90% correctness of labeling. Given that manual annotation is laborious, text mining is a viable way to automatically extract large amount of image-level label from existing clinical text corpus. In this case, the labels can potentially be further improved. The model should be able to achieve enhanced performance if such data is available. Furthermore, a dataset with more disease categories included will be favorable as it is closer to real life clinic settings. Future works include expanding the current ChestX-ray14 dataset with more manual label and more disease categories such as coronary artery disease [10, 11] for evaluation. Applications of our model on different medical imaging dataset that targeting multiple diseases should also be evaluated in the future.

Currently, our localization result is on the level of bounding box. Another future improvement of our network is to produce a fine-grain instance segmentation of object given the probability map (heatmap). Fully-connected CRF (FC-CRF) is a post-processing technique that can produce such segmentation. One downfall of FC-CRF is it requires objects in the image with a clear boundary. In our dataset, chest radiography generates a projective images where objects overlay, so the objects tends to be blurry and FC-CRF is hardly applicable. However, extensive studies have shown FC-CRF’s successful applications on other medical imaging modalities, such as CT and MRI, which capture medical images with clear organ/disease boundary. [27, 28, 29]. Thus, our model can also be applied to other medical image dataset and further produce object segmentation given only image-level annotation for training.

6 Conclusion

In summary, we presented a weakly supervised adaptive DenseNet to classify and localize 14 thoracic diseases by using only image-level annotation during training. Extensive experiments demonstrated the effectiveness of our network which achieved both the best classification and localization results among the previous published state-of-the-art approaches on the ChestX-ray14 dataset.


  • [1] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012.
  • [2] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015.
  • [3] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
  • [4] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Computer Vision (ICCV), 2017 IEEE International Conference on, pages 2980–2988. IEEE, 2017.
  • [5] Ross Girshick. Fast r-cnn. arXiv preprint arXiv:1504.08083, 2015.
  • [6] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [7] Geert Litjens, Thijs Kooi, Babak Ehteshami Bejnordi, Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Mohsen Ghafoorian, Jeroen AWM van der Laak, Bram van Ginneken, and Clara I Sánchez. A survey on deep learning in medical image analysis. Medical image analysis, 42:60–88, 2017.
  • [8] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3462–3471. IEEE, 2017.
  • [9] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017.
  • [10] Bo Zhou, Di Wen, Katelyn Nye, Robert C Gilkeson, Brendan Eck, David Jordan, and David L Wilson. Detection and quantification of coronary calcium from dual energy chest x-rays: Phantom feasibility study. Medical physics, 44(10):5106–5119, 2017.
  • [11] Bo Zhou, Yi Jiang, Di Wen, Robert C Gilkeson, Jun Hou, and David L Wilson. Visualization of coronary artery calcium in dual energy chest radiography using automatic rib suppression. In Medical Imaging 2018: Image Processing, volume 10574, page 105740E. International Society for Optics and Photonics, 2018.
  • [12] Di Wen, Katelyn Nye, Bo Zhou, Robert C Gilkeson, Amit Gupta, Shiraz Ranim, Spencer Couturier, and David L Wilson. Enhanced coronary calcium visualization and detection from dual energy chest x-rays with sliding organ registration. Computerized Medical Imaging and Graphics, 2018.
  • [13] Yaniv Bar, Idit Diamant, Lior Wolf, Sivan Lieberman, Eli Konen, and Hayit Greenspan. Chest pathology detection using deep learning with non-medical training. In Biomedical Imaging (ISBI), 2015 IEEE 12th International Symposium on, pages 294–297. IEEE, 2015.
  • [14] Paras Lakhani and Baskaran Sundaram. Deep learning at chest radiography: automated classification of pulmonary tuberculosis by using convolutional neural networks. Radiology, 284(2):574–582, 2017.
  • [15] Yaron Anavi, Ilya Kogan, Elad Gelbart, Ofer Geva, and Hayit Greenspan. A comparative study for chest radiograph image retrieval using binary texture and deep learning classification. In Engineering in Medicine and Biology Society (EMBC), 2015 37th Annual International Conference of the IEEE, pages 2940–2943. IEEE, 2015.
  • [16] Li Yao, Eric Poblenz, Dmitry Dagunts, Ben Covington, Devon Bernard, and Kevin Lyman. Learning to diagnose from scratch by exploiting dependencies among labels. arXiv preprint arXiv:1710.10501, 2017.
  • [17] Zhe Li, Chong Wang, Mei Han, Yuan Xue, Wei Wei, Li-Jia Li, and Fei-Fei Li. Thoracic disease identification and localization with limited supervision. arXiv preprint arXiv:1711.06373, 2017.
  • [18] Pranav Rajpurkar, Jeremy Irvin, Kaylie Zhu, Brandon Yang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis Langlotz, Katie Shpanskaya, et al. Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning. arXiv preprint arXiv:1711.05225, 2017.
  • [19] Maxime Oquab, Léon Bottou, Ivan Laptev, and Josef Sivic. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 685–694, 2015.
  • [20] M Everingham, L Van Gool, CKI Williams, J Winn, and A Zisserman. The pascal visual object classes challenge 2012 (voc2012) results (2012). In URL http://www. pascal-network. org/challenges/VOC/voc2011/workshop/index. html, 2011.
  • [21] Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord. Wildcat: Weakly supervised learning of deep convnets for image classification, pointwise localization and segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2017), 2017.
  • [22] Mark Everingham, L Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes challenge 2007 (voc 2007) results (2007), 2008.
  • [23] Ariadna Quattoni and Antonio Torralba. Recognizing indoor scenes. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 413–420. IEEE, 2009.
  • [24] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer vision and pattern recognition, 2006 IEEE computer society conference on, volume 2, pages 2169–2178. IEEE, 2006.
  • [25] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
  • [26] Vincent Tinto. Dropout from higher education: A theoretical synthesis of recent research. Review of educational research, 45(1):89–125, 1975.
  • [27] Konstantinos Kamnitsas, Christian Ledig, Virginia FJ Newcombe, Joanna P Simpson, Andrew D Kane, David K Menon, Daniel Rueckert, and Ben Glocker. Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis, 36:61–78, 2017.
  • [28] Xuanang Xu, Fugen Zhou, and Bo Liu. Automatic bladder segmentation from ct images using deep cnn and 3d fully connected crf-rnn. International journal of computer assisted radiology and surgery, pages 1–9, 2018.
  • [29] Jeovane Honório Alves, Pedro Martins Moreira Neto, and Lucas Ferrari de Oliveira. Extracting lungs from ct images using fully convolutional networks. arXiv preprint arXiv:1804.10704, 2018.
  • [30] Fisher Yu, Vladlen Koltun, and Thomas Funkhouser. Dilated residual networks. In Computer Vision and Pattern Recognition, volume 1, 2017.

Supplemental Materials

Appendix A Analysis of Model Parameters

Figure 7: This shows classification AUC for different number of sub-maps (). The red mark indicates result of best yielded by comparing overall classification performance among all 14 diseases, where =10.
Figure 8: This shows classification AUC for different number of top . The red mark indicates result of best yielded by comparing overall classification performance among all 14 diseases, where top =10.
Figure 9: This shows classification AUC for different number of kmaxs. The red mark indicates result of best kmaxs yielded by comparing overall classification performance among all 14 diseases, where kmax =15.
Figure 10: This shows classification AUC for different number of kmins. The red mark indicates result of best kmins yielded by comparing overall classification performance among all 14 diseases, where kmin=15.
Figure 11: This shows classification AUC for different number of . The red mark indicates result of best yielded by comparing overall classification performance among all 14 diseases, where =1.

Appendix B Detailed Model Discussion

Figure 12 shows cases where our localization results are preferred over the given ground truth by a certified radiologist. In example ”Mass”, ”Nodule”, ”Pneumonia”, our localization covered more instances of lesions than the ground truth. In example ”Effusion”, ”Infiltration”, our model defined more precise bounding boxes than the ground truth. In example ”Atelectasis”, our model captured more likely lesion region than the ground truth.

Figure 12: Localization results for 6 thoracic disease. Green boxes are provided ground truth annotations which incorrectly locates diseases and red boxes are generated by our model that accurately identifies the disease region. In these cases our bounding boxes are favored over the annotations by a certified radiologist, based on accuracy of localization and thoroughness of covering all lesions.
Figure 13: Example cases of atelectasis and pneumonia where our network fails to capture the medical complexity and predicts wrong bounding box.

Localization of ”Atelectasis” and ”Pneumothorax” is often complicated by associated medical complexity. Figure 13 shows two example cases. ”Atelectasis” (partially collapsed lung) is a disease of heterogeneous appearances, whose appearances is affected by locale, extent and cause of the pathology. Two common sub-types of atelectasis is plate-like atelectasis and lobar atelectasis. Plate-like atelectasis manifests as a visible white dense strip on lower lung, while atelectasis on lower lobe can lead to shrink up of the lobe and disrupts the right interlobar artery. In this case our network mis-classifies a disrupted right interlobar artery as lobar atelectasis and misses the plate-like atelectasis. ”Pneumothorax” (air enter chest cavity) is an acute symptom which must be promptly treated in clinical setting. The treatment vacuums the excessive air in chest cavity by inserting a tube. Therefore pneumothroax X-ray image tends to manifests as a restored lung with the vacuum tube presenting. This misleads the network to associate the tube, instead of the actual pathology, to pneumothorax label.

Appendix C Artifacts from Dilated Kernel

Despite the performance of our network, there is a design that potentially can further improve the network’s performance. In specific, we removed average pooling in the last transition block and introduced dilated convolution in sub-sequent dense block. This doubles the size of output feature map and provides better localization. Empirically this also boosts up classification accuracy. However, the dilated convolution introduces griding artifact in the heatmap. An example is shown in Figure 14. This is caused by high frequency components in the convolution inputs, as explained in [30]. We suspect the high frequency components in inputs comes from dense connection from shallow layers, where features are still raw and noisy. One of our future improvement is to remove the griding artifact while still making use of pre-trained convolution weights.

Figure 14: Example heatmap showing griding artifact. The dash line helps visualizing the artifact.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description