Abnormality Detection and Localization in Chest X-Rays using Deep Convolutional Neural Networks
Chest X-Rays (CXRs) are widely used for diagnosing abnormalities in the heart and lung area. Automatically detecting these abnormalities with high accuracy could greatly enhance real world diagnosis processes. Lack of standard publicly available dataset and benchmark studies, however, makes it difficult to compare and establish the best detection methods. In order to overcome these difficulties, we have used the publicly available Indiana chest X-Ray dataset, JSRT dataset and Shenzhen Dataset and studied the performance of known deep convolutional network (DCN) architectures on different abnormalities. We employed heat maps obtained from occlusion sensitivity as a measure of localization in the CXRs. We find that the same DCN architecture doesn’t perform well across all abnormalities. Shallow features or earlier layers consistently provide higher detection accuracy compared to deep features. We have also found ensemble models to improve classification significantly compared to single model. Combining these insight, we report the highest accuracy on chest X-Ray abnormality detection on this dataset. We find that in the cardiomegaly classification task, where comparison could be made, the deep learning method improves the accuracy by a staggering 17 percentage point compared to rule based methods. We applied the techniques developed along the way to the problem of tuberculosis detection on a different dataset and achieved the highest accuracy on that task. Our localization experiments using these trained classifiers show that for spatially spread out abnormalities like cardiomegaly and pulmonary edema, the network can localize the abnormalities successfully most of the time. One remarkable result of the cardiomegaly localization is that the heart and its surrounding region is most responsible for cardiomegaly detection, in contrast to the rule based models where the ratio of heart and lung area is used as the measure. We believe that through deep learning based classification and localization, we will discover many more interesting features in medical image diagnosis that are not considered traditionally.
Medical X-rays are one of the first choices for diagnosis due to its “ability of revealing some unsuspected pathologic alterations, its non-invasive characteristics, radiation dose and economic considerations” . X-Rays are mostly used as a preliminary diagnosis tool. There are many benefits of developing computer aided detection (CAD) tools for X-Ray analysis. First of all, CAD tools help the radiologist to make a quantitative and well informed decision. As the data volume increases, it will become increasingly difficult for the radiologists to go through all the X-Rays that are taken maintaining the same level of efficiency. Automation and augmentation is severely needed to help radiologists maintain the quality of diagnosis.
Over the past decade, a number of research groups have focused on developing CAD tools to extract useful information from X-Rays. Historically, these CAD tools depended on rule based methods to extract useful features and draw inference based on them. The features are often useful for the doctor to gain quantitative insight about an X-Ray, while inference helps them to connect those abnormal features to certain disease diagnosis. However, the accuracy of these CAD tools has not achieved a significantly high level to work as independent inference tool. Thus CAD tools in X-Ray analysis are left as mostly providing easy visualization functionality.
In recent time, deep learning has achieved superhuman performance on a number of image based classification. This success in recognizing objects in natural images has spurred a renewed interest in applying deep learning to medical images as well. A number of reports recently have emerged where indeed superhuman accuracies were obtained in a number of abnormality detection tasks. This success of classifying abnormalities in images have not translated to other radiological modalities mainly because of the absence of large standard datasets. Creation of high quality and orders of magnitude larger dataset will certainly drive the field forward.
In this work, we report DCN based classification and localization on the publicly available datasets for chest X-Rays. Our contributions are the following:
We show a 17 percentage point improvement in accuracy over rule based methods for Cardiomegaly detection using ensemble of deep convolutional networks.
Multiple random train/test data split achieve robust accuracy results when the number of training examples are low.
Shallow features or earlier layers perform better than deep features for classification accuracy.
Ensemble of DCN models performs better than single models. However, mix of rule based and DCN ensemble model degraded accuracy.
Sensitivity based localization provides correct localization for spatially spread out diseases.
Results of 20 different abnormalities which we believe will serve as a benchmark for other studies to be compared against.
Direct application of the methods developed in the paper on the Shenzen dataset achieve the highest accuracy for tuberculosis detection.
The paper is organized as follows. In Section 2 we overview of the related work. In Section 3, we describe the dataset, analysis method, evaluation figure of merits and the localization method used. Then in Section 4.1, we present our results on single and ensemble models and critique various issues discussed above. In Section 4.2, we describe the localization results and discuss their performance. The cardiomegaly detection and tuberculosis detection are discussed in detail along with comparison sections Section 4.3 and Section 4.4. Finally we conclude summarizing our results in Section 5. The conclusions in the paper are derived by analyzing two representative abnormalities i.e. cardiomegaly and pulmonary edema. The classification results for other abnormalities are given in the supplementary materials.
Local binary pattern (LBP) features were employed in segmented images to classify normal vs. pathology on CXRs in  for early detection purposes. The dataset used in the study was private and contained 48 images total. In , image registration technique was used to localize the heart and lung region and then computed radiographic index like cardiothoracic ratio (CTR), cardiothoracic area ratio (CTAR) to classify cardiomegaly from the X-ray images. In  lung segmentation was performed using 247 images from JSRT, 138 images from Montgomery and 397 images from the India dataset with segmentation accuracies of 95.4%, 94.1%, and 91.7% respectively. Jaeger et. al  segmented lungs using graph cut method and used large features sets both from the domain of object detection and content based image retrieval for early screening of tuberculosis (TB) and made the databases public. Additionally, few other works on TB screening has been conducted using the public datasets  and using additional data along with the public datasets . They achieved near human performance in detecting TB. Gabor filter features were extracted from histogram equalized CXRs in  in order to detect pulmonary edema using 40 pulmonary edema and 40 normal images and achieved 97% accuracy. The dataset is private hence the accuracy cannot be compared. In an attempt to identify multiple pathologies in a single CXR, bag of visual words is constructed from local features which are fed to probabilistic latent semantic analysis (PLSA) pipeline . They used the ImageClef dataset and clustered various types of X-Rays present in the dataset. However, they didn’t detect any abnormality in the paper. In a view to classifying abnormalities in the CXRs, a cascade of convolutional neural network (CNN) and recurrent neural network (RNN) are employed  on the Indiana dataset chest X-Rays. However, no test accuracy was given nor any comparison with previous results was discussed. Hence it was impossible to determine the robustness of the results. Usage of pre-trained Decaf model in a binary classifier scheme of normal vs. pathology, cardiomegaly, mediastinum and right pleural effusion have been attempted . This work was reported on a private dataset, and hence no comparison can be made.
2.1Deep Learning on Medical Image Analysis
A detailed survey of deep learning in medical image analysis can be found in . Localization of cancer cells is demonstrated in . Using inception network, human level diabetic retinopathy detection is shown in . Using a multiclass approach, inception network is used in , to obtain human level skin cancer detection.
The three publicly available datasets for our studies in this paper are:
Indiana Dataset : Set consists of 7284 CXRs, both frontal and lateral images with disease annotations, such as cardiomegaly, pulmonary edema, opacity or pleural effusion. Indiana Set is collected from various hospitals affiliated with the Indiana University School of Medicine. The set is publicly available through Open-i SM, which is a multimodal (image + text) biomedical literature search engine developed by U.S. National Library of Medicine. A typical example of a normal CXR (left) and a CXR with cardiomegaly abnormality (right) is shown in Figure 2. Visually, it can be observed that the heart in the cardiomegaly example is quite big compared to that of the normal CXR.
JSRT Dataset : Set compiled by the Japanese Society of Radiological Technology (JSRT). The set contains 247 chest X-rays, among which 154 have lung nodules (100 malignant cases, 54 benign cases), and 93 have no nodules. All X-ray images have a size of pixels and a gray-scale color depth of 12 bit. The pixel spacing in vertical and horizontal directions is 0.175 mm. The JSRT set is publicly available and has gold standard masks  for performance evaluation.
Shenzhen Dataset : This set is compiled at Shenzhen No.3 People’s Hospital, Guangdong Medical College, Shenzhen, China. The recorded frontal CXRs are classified into two categories: normal and tuberculosis (TB). In a one month period, 326 normal cases and 336 cases with tuberculosis have been recorded from the outpatient clinics comprising a total of 662 CXRs in the dataset. The clinical reading of each of the CXRs is also provided.
3.2Deep Convolution Network Models
As described in Section 2.1, deep convolutional networks (DCN) have achieved significantly higher accuracy than previous methods in disease detection in various diagnostic modalities. In many cases, these accuracies have surpassed human detection capabilities. Here, we explore the performance of various DCNs for heart disease detection on chest X-Rays. We use binary classification of Cardiomegaly and Pulmonary Atelectasis against normal chest X-Rays as representative examples. Results for other diseases are given in the supplementary materials. We explored several DCN models, e.g, AlexNet , VGG-Net  and ResNet . These models vary in the number of convolution layers used and achieve higher classification accuracy as the number of convolution layers is increased. Specifically, ResNet and its variants have achieved superhuman performance on the celebrated ImageNet dataset. In the experiments we have extracted features from one of the layers of the DCN. We have frozen all the layers upto this layer and added a binary classifier layer to detect the abnormality. The second fully connected layer has been selected for feature extraction in AlexNet, VGG-16 and VGG-19 networks. The features from the ResNet-50, ResNet-101 and ResNet-152 are extracted from the
res4b35 layers respectively. All the DCN models have been implemented in Tensorflow and have been finetuned using Adam optimizer  with learning rate . The weights of the networks AlexNet, and VGG were obtained from the respective project pages, while weights of the ResNet models were obtained from MatConvNet Pre-train Library
The quality of detection was evaluated in terms for four measures: accuracy, area under receiver operating characteristics (ROC) curve (AUC), sensitivity and specificity. The accuracy is the ratio of number of correctly classified samples to total samples. Unless otherwise stated, classifier threshold is set to in the reported values of accuracy, sensitivity and specificity. ROC curve is the graphical plot of true positive rate (TPR) vs false positive rate (FPR) of a binary classifier when classifier threshold is varied from to . The number of pathological samples that are correctly identified as pathological sample by the classifier is called true positive (TP). The number of pathological samples that are incorrectly classified as normal by the classifier is called false negative (FN). The number of normal samples that are correctly classified as normal is called true negative (TN), and in a similar fashion, the number of normal samples that are incorrectly identified as pathological samples is called false positive (FP). True positive rate (TPR) is the proportion of pathological samples that are correctly identified as pathological sample, given as
TPR is also called sensitivity which is called such as this measure shows the degree to which does not miss a pathological sample. False positive rate (FPR) is proportion of normal samples that are incorrectly identified as pathological samples, given as,
The measure specificity shows the degree to which the classifier correctly identifies normal samples as normal. The objective of a classifier to attain high sensitivity as well as specificity so that the classifier attains low diagnosis error.
The sensitivity of softmax score to occlusion of a certain region in the chest X-Ray was used to find which region in the image is responsible for the classification decision. We followed the localization using occlusion sensitivity described in . In this experiment, a patch of square size is occluded in the CXRs and is observed whether the classifier can detect pathology in the presence of the occlusion. If the region corresponding pathology is occluded then the classifier should no longer detect the pathology with higher probability and thus this drop in probability indicates that the pathology is located at the location of the occlusion. This occluded region is slid through the whole CXR and thus a probability map of the pathology corresponding to the CXR is obtained. The regions where the probabilities are below a certain threshold indicates that the pathology is likely to be occupying that region. Thus, the pathology in the CXR can be localized.
The overall classification scheme and localization scheme is visualized in Figure 3. In summary, the classification scheme (top) is ensemble of different types of DCNS and the localization (bottom) is obtained from the overlapping occlusions.
Classification using single models
Our first experiment use single model with DCNs fine-tuned from a model trained on ImageNet. Detection of cardiomegaly is done only for the frontal CXR images from the Indiana Dataset. It contains 332 frontal CXRs with cardiomegaly. In order to balance the binary classification, 332 normal frontal CXRs have been selected randomly from the database. Of these images, 282 of each class have been selected for training and 50 of each class for testing.
In addition to training the DCNs, we also performed rule based features for cardiomegaly detection. Overall, we ran experiments with the following characteristics: (1) The NNs are fine-tuned on the Indiana dataset, (2) The NNs are fine-tuned using dropout technique , (3) The fusion of NN feature and rule based features, and (4) The fusion of NN feature and rule based feature trained using dropout technique. The results are summarized in tables ?- ?.
In table ?, the results obtained by fine-tuning the DCNs are shown. We find that deeper models like VGG-19 and ResNet improve the classification accuracy significantly. For example, the accuracy of Cardiomegaly detection improves by 6 percentage point from that using AlexNet when VGG-16 and ResNet-101 are used. In order to understand the robustness of these results, we further calculate the sensitivity, specificity, sensitivity vs 1-specificity curve and derive the area under curve (AUC) metric for classification using different networks. We find that although ResNet-101 gives the highest specificity and VGG-16 gives the highest sensitivity, VGG-19 gives an overall better performance with the highest AUC of . The AUC calculated using VGG-19 is at least one percentage point higher than the other networks considered here.
Adding dropout improves the classification accuracy of the shallower networks but degrades the performance of deep models. We find that VGG-16 and AlexNet achieve the highest accuracy and AUC respectively when dropout is used as shown in table ?. On the contrary, the accuracy of deeper models like ResNet-101 and VGG-19 drops by about 4 percentage points.
For all these experiments, we found that taking features from earlier layers compared to later layers improve accuracy by 2 to 4 percentage points. Shallow DCN features are often useful for detecting small objects in images . Our findings are similar for chest X-Ray abnormality classification as well. As an example, we are showing the performance obtained by taking features from different layers of ResNet-152 model. The candidate layers are chosen from the 4th, 5th and final stage of the network based on what type of operations they perform. The chosen layers and their corresponding operations are listed in Table ?. The notation of the layers is based on the pre-trained model obtained from MatConvNet Pre-train Library. We trained five models to detect cardiomegaly using features from each of the layers and the average performance of these features in terms of accuracy, AUC, sensitivity, and specificity for Cardiomegaly detection are shown in Figure 4. It can be observed that the performance of the final pooling layer (
pool5) is degraded compared to the other layers in terms of accuracy, sensitivity and specificity. In particular features from residual connections (
res5c) and ReLU (
res5cx) are considerably better with features from
res4b35 providing highest accuracy. Similar observations are made for other ResNet variants, VGG nets and AlexNet.
In addition to the DCN features, we experimented with DCN and rule based feature fusion for single model classification. The rule based features that were used in the study are 1D-cardio-thoracic ratio (CTR), 2D-cardio-thoracic ratio and cardio-thoracic area ratio (CTAR) .
1D-CTR is the ratio between the maximum transverse cardiac diameter and the maximum thoracic diameter measured between the inner margins of ribs, which is formulated as,
The 2D-CTR is the ratio between the perimeter of the heart region to the perimeter of the entire thoracic region and formulated as
while CTAR, the ratio between area of the heart region to the sum of the area of the left and right lung region, is formulated as
In the experiments involving rule based features, we concatenated the features with the features extracted from a DCN and trained a fully connected layer to detect cardiomegaly. However, the results degraded and hence are not shown here.
Observation from these single model classification results is that different figure of merit is maximized by different DCNs. We wanted to explore if this is expected or due to some limitation of the data or training process itself. Hence, rather than taking a single train-test split of the data, we randomly split the train-test data and trained nine different model for each architecture. Then we calculated the mean and standard deviation for the figure of merits of interest. The results can be seen in table ?. We find that after averaging the nine random train-test sample results, a clear trend emerges where a single model, ResNet-152 in this case, achieves the highest accuracy, AUC and sensitivity. The mean specificity for ResNet-152, in this case is close to the highest number, however, the max specificity is indeed highest for ResNet-152.
Having around 600 images for training a network is not sufficient. We wanted to see how does the mean accuracy and the standard deviation vary as we change the number of training examples. Since averaging over multiple train-test splits gave a robust classification accuracy and other figures of merit, we used this classification process to identify the deviation of the result as a function of the number of training images. The results are shown in Figure 5. As expected, for both accuracy and AUC, the mean is lower and deviation is higher for less than 50 training example per category. As the number of example increases, the mean increases and the deviation decreases coming to a saturation at about 200 images.
To check if the same model gives the highest accuracy for different abnormalities, we model pulmonary edema using the same averaging process described above. Our dataset for the detection of pulmonary edema contains available 45 frontal CXR images with pulmonary edema and randomly chosen 45 normal frontal CXRs from the Indiana Dataset. We partitioned the dataset in train and test set such that 30 of each class have been selected for training and 15 of each class for testing. We have run our program with 15 different seeds and reported the overall performance metrics in the table ? as (mean s.d.). We find that whereas ResNet-152 gave the highest accuracy for cardiomegaly detection, for pulmonary edema detection ResNet-50 gives the highest accuracy, highest AUC and highest sensitivity. ResNet-152 has a slightly higher specificity. This shows that there is no single model appropriate for all abnormalities, rather the suitable network varies for different abnormalities. This observation is consistent with the conclusions drawn in . In this case, ResNet-152 which gave the highest accuracy for cardiomegaly detection achieves almost one percentage point reduced accuracy compared to ResNet-50.
Classification using ensemble of models
We trained four different instances of each of the DCNs, i.e, AlexNet, VGG-16, VGG-19, ResNet-50, ResNet-101 and ResNet-152, to detect cardiomegaly. Thus a total of networks were trained on the same training data. There are a number of ways to perform ensemble on the trained model. The methods include linear averaging, bagging, boosting, stacked regression  etc. Since, the number of images in the training dataset is only 564, which is far less than the number of trainable parameters in the classifiers, the individual classifiers always overfit the training set. In this situation, if bagging, boosting and/or stacked regression are employed to build the ensemble model, it will result in a completely biased model. Thus, the ensemble models were obtained by using simple linear averaging of the probabilities given by the individual models. The performance of the ensembles was measured using 50 cardiomegaly and 50 normal images for all the possible combinations of the trained individual models. The performance of these combinations is shown in Figure 6 using boxplots. The horizontal red bars indicate the 50 percentile values and the spread of the blue boxes indicate the 25 and 75 percentile values. The black stars indicate extreme points in the data. It can be observed from the figure that, combinations of 7 to 10 models can achieve higher accuracy, however they have the largest spread. On the other hand, as number of models in the ensemble increases, the accuracy of the ensemble model converges to a certain value which for this experiment was .
The ROC curves of one instance AlexNet, VGG-19, ResNet-152 and one ensemble model, that is linear average of 6 different types of DCNs, are shown in Figure 7. The curves are obtained using 50 cardiomegaly and 50 normal images. The AUC obtained for each model are , , and , respectively. We can understand from the AUC values that, the separation between the pathology class and the normal class increases when an ensemble of multiple DCNs are performed. For the ensemble model to be used as a screening tool with high sensitivity, the operating point on the curve is set to achieve sensitivity. The specificity obtained at this point is . The second operating point is set for high specificity of and the sensitivity at this point is .
For any diagnostic task, it is desirable to gain intuitive understanding of why a certain classification decision is made rather than being a black box method. In other words, it is desirable to distinguish features that contributed most to certain abnormality in the entire chest X-Ray. There are various ways of achieving this goal . The method used in  is the simplest, where a patch is occluded in the image to measure its impact on the eventual classification confidence score. We have used this method to find the regions in the image responsible for a certain abnormality detection. As a representative example, we have used cardiomegaly and pulmonary edema which occur in heart and lung areas respectively. The localization scheme described in Section 3.4 is followed with a patch size of pixels taking lowest values of probabilities. Instead of gray level occlusion as in  we found that black level occlusion works better for CXRs. This is due to the fact that the CXRs themselves are mostly gray level and occlusion of the same level does not hide much information compared to the neighborhood.
The localization of abnormalities in cardiomegaly examples are shown in Figure 9. Here, of the image area is shown which has the highest sensitivity. It can be observed from the figures that the network is indeed most sensitive to the region where the heart is larger than a normal heart. We have performed this experiment on cardiomegaly and normal images and found this localization to be consistent for most examples. There is not much functional difference between a normal and cardiomegaly example other than the fact that the heart in cardiomegaly is larger than a normal heart. Given the fact that the normal images could also have various size of heart depending on the age or physical attributes of a patient, we found this level of localization sensitivity to be remarkable. Also interesting is the fact that the standard rule based features like CTR and CTAR take into account the relative size of heart and lung to determine if there is cardiomegaly present or not. In the DCN localization experiment, we see counter-intuitively that most of the signals contributing to the softmax score are coming from the heart only. This means that there are characteristic features in the shape of the heart and its surrounding regions that alone is sufficient to detect cardiomegaly. The lung and its relative size are probably less important features when trying to detect cardiomegaly. This observation is counterintuitive and needs to be explored further in future work.
Pulmonary Edema Localization
In order to test the effectiveness of the localization procedure in areas other than the heart region, we chose pulmonary edema which occurs in the lung region. Also, pulmonary edema is detected by the net like white structure in the lung area. No anatomical shape change is associated with the abnormality. We have found that the localization is obtained best when the ROIs of lungs are taken to compute the map. Following the scheme in Section 3.4, localization experiment on pulmonary edema is performed as shown in Figure 11. It has been observed that the classifier is not sensitive to the fine features like septal or Kerley B lines. The localization is mainly obtained in the lung region where excess fluid is observed. Some localization regions are outside the lung region which occurs primarily for the fact that, even though the occlusion center is outside the lung, it occludes lung region and thus the probability drop occurs.
4.3Comparison between Rule based and DCN based cardiomegaly detection
A comparison between rule based and DCN based cardiomegaly detection is shown in table ?. State-of-the-art method by Candemir et al.  reported an accuracy of while classifying between 250 cardiomegaly and 250 normal images. They employed 1D-CTR, 2D-CTR, and CTAR computed from segmented CXRs as features. A brief discussion about the rule based approach is given in the supplementary materials in Section 7.1. In verifying that claim in the paper, we reproduced those results and achieved an accuracy of on the same train-test set split on which the DCNs are trained. It can be observed from the table that the results are similar to that obtained by . However, it is evident from the table that all DCN based approaches outperform the rule based method. As stated earlier, DCNs were fine tuned on a sample of 560 images and validated on 100 images. Among the independent DCN models, VGG-19 model achieves the highest accuracy of and highest AUC of for detecting cardiomegaly. The ensemble model, which is linear average of the six individual DCN models, shows the best accuracy of and AUC of . The accuracy is percentage point higher than that reported in Candemir’s paper and the AUC is percentage points higher than our implementation of the paper. Similarly, a percentage higher sensitivity and percentage point higher specificity from the Candemir’s paper is reported. This quantum of improvement in accuracy, AUC, sensitivity and specificity makes a strong case for use of deep learning based detection techniques in real world application of medical image analysis.
In this section we evaluate the effectiveness of the network design and DCN pipelines for a different dataset and abnormality. We use the Shenzen dataset as it is often used for reporting accuracy on tuberculosis detection.
Detailed study of tuberculosis detection will be provided in a future publication. But the a comparison among several TB classification methods and proposed DCN based methods along with their ensemble using Shenzhen Dataset is shown in table ?. Previously, Jaeger et. al  extracted several features from lung segmented CXRs and employed various classification methods to benchmark the features. The results reported in the table is obtained using low-level content-based image retrieval based features and linear logistic regression based classification. Hwang et. al  trained three different DCNs on three different train/test split on a large private KIT dataset and tested the ensemble of the model on Shenzhen dataset. It is to be noted that, both the KIT and Shenzhen dataset were obtained using Digital Radiography. Lopes and Valiati  employed bags of features and ensemble method using features from ResNet, VGG and GoogLeNet models and trained SVM classifier on them. They obtained highest AUC in Shenzhen dataset using ensemble of individual SVM classifiers. Lakhani and Sundaram  employed AlexNet and GoogLeNet on a combined dataset of four different databases and performed ensemble on the trained models. They do not report test results on Shenzhen dataset and thus it was not shown in the table. The DCN based methods shown in table ? have comparable or higher accuracy and lower AUC than the results already present in the literature. The VGG-16 model obtains highest sensitivity and AlexNet model obtains highest specificity. The sensitivity and specificity measures for Jaeger’s, Hwang’s and Lopes and Valiati’s paper are not shown in the table as they were not reported in the respective papers. In terms of accuracy and AUC, our ensemble method obtains highest values of and , respectively. This accuracy is obtained when classifier threshold is set to . When classifier threshold is set to , the accuracy obtained is . Thus, we report a percentage point higher accuracy and percentage point higher AUC compared to nearest Lopes and Valiati’s paper.
In summary, we have explored DCNN based abnormality detection in frontal chest XRays. We have found the existing literature to be insufficient for making comparison of various detection techniques either due to studies reported on private datasets or not reporting the test scores in proper detail . In order to overcome these difficulties, we have used the publicly available Indiana chest X-Ray dataset and studied the performance of various DCN architectures on different abnormalities. We have found that the same DCNN architecture doesn’t perform well across all abnormalities. When the number of training examples is low, a consistent detection result can be achieved by doing multiple train-test with random data split and the average values are used as the accuracy measure. Shallow features or earlier layers consistently provide higher detection accuracy compared to deep features. We have also found ensemble models to improve classification significantly compared to single model when only DCNN models are used. Combining DCNN models with rule based models degraded the accuracy. Combining these insights, we have reported the highest accuracy on a number of chest X-Ray abnormality detection where comparison could be made. For the cardiomegaly classification task, the deep learning method improves the accuracy by a staggering 17 percentage point. Using the same method developed in the paper, we achieve the highest accuracy on the Shenzen dataset for Tuberculosis detection. We have also performed localization of features responsible for classification decision. We found that for spatially spread out abnormalities like cardiomegaly and pulmonary edema, the network can localize the abnormalities successfully most of the time. However, the localization fails for pointed features like lung nodule or bone fracture. One remarkable result of the cardiomegaly localization is that the heart and its surrounding region is most responsible for cardiomegaly detection. This is counterintuitive considering the usual method of using the ratio of heart and lung area as a measure for cardiomegaly. However, expert radiologists often conclude upon cardiomegaly by looking at the heart’s shape rather than using a quantitative method. We believe that through deep learning based classification and localization, we will discover many more interesting features that are not considered traditionally.
While finishing this paper, we became aware of a new dataset announcement and paper focused on similar problem . It would be interesting to apply the techniques discussed our paper on the new dataset becomes available.
This research used resources of the National Energy Research Scientific Computing Center, a DOE Office of Science User Facility supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC02-05CH11231. Thanks Leonid Oliker at NERSC for sharing his allocation on the OLCF Titan supercomputer with us on project CSC103. Thanks to Hoo-Chang Shin for correspondence regarding the Indiana chest X-Ray dataset.
In this section, first a detailed description about the segmentation of the CXRs is given. Then the rule based machine learning model is described. After that we show classification results of 20 chest X-Ray abnormalities. And finally we show more localization results with additional insights.
7.1Rule based Approach in Detecting Cardiomegaly
This method uses existing CXRs and their radiologist marked lung/heart boundaries as models, and estimates the lung/heart boundary of a patient X-ray by registering the model X-rays to the patient X-ray. 
Using Radon Transform and Bhattacharyya Distance to find visually similar images
We use the publicly available chest x-ray dataset (JSRT)  with reference boundaries given in the SCR dataset . For a given test image, the radon transform of that image is calculated at radial coordinates, ranging from 0 to 90 degree. The radon function computes projections of an image along specified directions. Bhattacharyya distance  is calculated between radon transform of test CXR and the sample CXRs to find 5 visually similar samples from the JSRT dataset. We use the most similar images from the dataset to register to the test image. As mentioned by Candemir et al. , the main objective of similarity measurement is to increase the correspondence performance and reduce the computational cost during registration.
Calculating correspondence between test CXR and model CXRs using SIFTFlow
We compute the correspondence map between the test CXR and visually similar CXR models by calculating local image features and matching the most similar locations. We employ the SIFT-flow algorithm which matches densely sampled SIFT features between two images. The computed correspondence map is a transformation from model X-ray to the patient X-ray. Finally, the computed transformation matrix is applied on the model CXR’s lung-heart boundary to generate an approximate lung-heart segmentation of the test image.
Rule based feature extraction
7.2Classification Results on the 20 chest X-Ray abnormalities
In this section, we report the classification accuracy, sensitivity and specificity using the ResNet-152 model. We hope that these numbers will set a benchmark to compare against other machine learning methods on this dataset.
8Additional Examples of Localization
In this section we show more examples of localization. Few localization samples are shown in Figure 12. It can be observed that, in the CXRs with Cardiomegaly (Figure 12(a) and (b)) a fine localization around the heart is observed. In the normal CXRs (Figure 12(c) and (d)) such localization is not observed. Rather the lowest probabilities are spread out in the CXR image. It is interesting to note that, the localization algorithm gets low probability where the heart is enlarged during cardiomegaly, but the proportion is small compared to the localization in other areas of normal CXRs. In order to observe the performance of the heat map we computed histograms of heat maps of each of the 100 CXRs in the test set for Cardiomegaly detection and average histograms are shown in Figure 12(e) and (f) for CXRs with Cardiomegaly and normal CXRs, respectively. It is to be noted that, the histograms include both success and failure cases. It can be observed that, for CXRs with Cardiomegaly the classifier is highly sensitive toward Cardiomegaly detection even under occlusion. This indicates that, the classifier primarily looks for local features in a CXR instead of some feature that is spread out in the entire CXR. However, the classifier is not sensitive toward normal CXRs under occlusion. Rather, the probabilities are spread out in the probability spectrum. After that, we analyzed the failure cases where the classifier is unable to classify the image correctly. Two such examples of failure cases are shown in Figure 13. The localized CXR shown in Fig. Figure 13(a) contains Cardiomegaly whereas the classifier detects it as normal. However, the localization shows that it localizes around heart quite well despite the in accurate classification. On the other hand, Figure 13(b) shows an example of normal image which has been classified as Cardiomegaly by the classifier. There is stronger localization around the hear that that is observed for normal images as in Figure 12(c) and (d), however, like those images the localization is spread out.
In a similar fashion, additional localization results for Pulmonary Edema is shown in Figure 14. In Figure 14(a) and (b) localization of two examples of CXRs with Pulmonary Edema is shown. As stated earlier the classifier localizes in the lung region. This is not the case when normal images are used to localize Pulmonary Edema as seen in Figure 14(c) and (d). The localizations are obtained in random dense locations such as the sternum or heart. Like the cardiomegaly case, the histogram averages for CXRs with pulmonary Edema (Figure 14(e)) shows a sensitivity toward pulmonary edema detection while the normal CXRs shows a spread out detection. It is interesting to note that, in the histogram of normal images high probability (>0.85) is non-existent, thus ensuring low false positive rate. In the test set none of the normal images have been diagnosed as Pulmonary Edema. The failure cases are shown in Figure 13. These CXRs are with Pulmonary Edema. However, the localization algorithm shows that one of them localizes in lungs whereas the other one shows a localization pattern similar to that obtained in normal CXRs.
- P. Campadelli and E. Casiraghi, “Lung field segmentation in digital postero-anterior chest radiographs,” in International Conference on Pattern Recognition and Image Analysis.1em plus 0.5em minus 0.4emSpringer, 2005, pp. 736–745.
- A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
- J. M. Carrillo-de Gea, G. García-Mateos, J. L. Fernández-Alemán, and J. L. Hernández-Hernández, “A computer-aided detection system for digital chest radiographs,” Journal of Healthcare Engineering, vol. 2016, 2016.
- S. Candemir, S. Jaeger, W. Lin, Z. Xue, S. Antani, and G. Thoma, “Automatic heart localization and radiographic index computation in chest x-rays,” in SPIE Medical Imaging.1em plus 0.5em minus 0.4em International Society for Optics and Photonics, 2016, pp. 978 517–978 517.
- S. Candemir, S. Jaeger, K. Palaniappan, J. P. Musco, R. K. Singh, Z. Xue, A. Karargyris, S. Antani, G. Thoma, and C. J. McDonald, “Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration,” IEEE transactions on medical imaging, vol. 33, no. 2, pp. 577–590, 2014.
- S. Jaeger, A. Karargyris, S. Candemir, L. Folio, J. Siegelman, F. Callaghan, Z. Xue, K. Palaniappan, R. K. Singh, S. Antani et al., “Automatic tuberculosis screening using chest radiographs,” IEEE transactions on medical imaging, vol. 33, no. 2, pp. 233–245, 2014.
- U. Lopes and J. Valiati, “Pre-trained convolutional neural networks as feature extractors for tuberculosis detection,” Computers in Biology and Medicine, 2017.
- S. Hwang, H.-E. Kim, J. Jeong, and H.-J. Kim, “A novel approach for tuberculosis screening based on deep convolutional neural networks,” in SPIE Medical Imaging.1em plus 0.5em minus 0.4em International Society for Optics and Photonics, 2016, pp. 97 852W–97 852W.
- P. Lakhani and B. Sundaram, “Deep learning at chest radiography: Automated classification of pulmonary tuberculosis by using convolutional neural networks,” Radiology, p. 162326, 2017.
- A. Kumar, Y.-Y. Wang, K.-C. Liu, I.-C. Tsai, C.-C. Huang, and N. Hung, “Distinguishing normal and pulmonary edema chest x-ray using gabor filter and svm,” in Bioelectronics and Bioinformatics (ISBB), 2014 IEEE International Symposium on.1em plus 0.5em minus 0.4emIEEE, 2014, pp. 1–4.
- M. R. Zare, A. Mueen, M. Awedh, and W. C. Seng, “Automatic classification of medical x-ray images: hybrid generative-discriminative approach,” IET Image Processing, vol. 7, no. 5, pp. 523–532, 2013.
- H.-C. Shin, K. Roberts, L. Lu, D. Demner-Fushman, J. Yao, and R. M. Summers, “Learning to read chest x-rays: recurrent neural cascade model for automated image annotation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2497–2506.
- Y. Bar, I. Diamant, L. Wolf, S. Lieberman, E. Konen, and H. Greenspan, “Chest pathology detection using deep learning with non-medical training,” in Biomedical Imaging (ISBI), 2015 IEEE 12th International Symposium on.1em plus 0.5em minus 0.4emIEEE, 2015, pp. 294–297.
- G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. van der Laak, B. van Ginneken, and C. I. Sánchez, “A survey on deep learning in medical image analysis,” arXiv preprint arXiv:1702.05747, 2017.
- D. Wang, A. Khosla, R. Gargeya, H. Irshad, and A. H. Beck, “Deep learning for identifying metastatic breast cancer,” arXiv preprint arXiv:1606.05718, 2016.
- V. Gulshan, L. Peng, M. Coram, M. C. Stumpe, D. Wu, A. Narayanaswamy, S. Venugopalan, K. Widner, T. Madams, J. Cuadros et al., “Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs,” JAMA, vol. 316, no. 22, pp. 2402–2410, 2016.
- A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun, “Dermatologist-level classification of skin cancer with deep neural networks,” Nature, vol. 542, no. 7639, pp. 115–118, 2017.
- D. Demner-Fushman, M. D. Kohli, M. B. Rosenman, S. E. Shooshan, L. Rodriguez, S. Antani, G. R. Thoma, and C. J. McDonald, “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, pp. 304–310, 2016.
- J. Shiraishi, S. Katsuragawa, J. Ikezoe, T. Matsumoto, T. Kobayashi, K.-i. Komatsu, M. Matsui, H. Fujita, Y. Kodera, and K. Doi, “Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodules,” American Journal of Roentgenology, vol. 174, no. 1, pp. 71–74, 2000.
- B. Van Ginneken, M. B. Stegmann, and M. Loog, “Segmentation of anatomical structures in chest radiographs using supervised methods: a comparative study on a public database,” Medical image analysis, vol. 10, no. 1, pp. 19–40, 2006.
- S. Jaeger, S. Candemir, S. Antani, Y.-X. J. Wáng, P.-X. Lu, and G. Thoma, “Two public chest x-ray datasets for computer-aided screening of pulmonary diseases,” Quantitative imaging in medicine and surgery, vol. 4, no. 6, p. 475, 2014.
- K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
- D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
- M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional networks,” in European conference on computer vision.1em plus 0.5em minus 0.4emSpringer, 2014, pp. 818–833.
- N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: a simple way to prevent neural networks from overfitting.” Journal of Machine Learning Research, vol. 15, no. 1, pp. 1929–1958, 2014.
- K. Ashraf, B. Wu, F. Iandola, M. Moskewicz, and K. Keutzer, “Shallow networks for high-accuracy road object-detection,” in Proceedings of the 3rd International Conference on Vehicle Technology and Intelligent Transport Systems, 2017, pp. 33–40.
- H. Azizpour, A. Sharif Razavian, J. Sullivan, A. Maki, and S. Carlsson, “From generic to specific deep representations for visual recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2015, pp. 36–45.
- L. Breiman, “Stacked regressions,” Machine learning, vol. 24, no. 1, pp. 49–64, 1996.
- S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. Müller, and W. Samek, “On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation,” PloS one, vol. 10, no. 7, p. e0130140, 2015.
- J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller, “Striving for simplicity: The all convolutional net,” arXiv preprint arXiv:1412.6806, 2014.
- B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Learning deep features for discriminative localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2921–2929.
- X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, and R. M. Summers, “Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases,” arXiv preprint arXiv:1705.02315v2, 2017.
- A. Bhattacharyya, “On a measure of divergence between two statistical populations defined by their probability distribution,” Bull. Calcutta Math. Soc, 1943.