Encoding High-Level Visual Attributes in Capsules for Explainable Medical Diagnoses

Encoding High-Level Visual Attributes in Capsules for Explainable Medical Diagnoses

Rodney LaLonde
Center for Research in Computer Vision
University of Central Florida
Orlando, FL 32816
&Drew Torigian
Department of Radiology
University of Pennsylvania
Philadelphia, PA 19104
Ulas Bagci
Center for Research in Computer Vision
University of Central Florida
Orlando, FL 32816

Deep neural networks are often called black-boxes due to their difficult-to-interpret decisions. This is characteristic of a deeper trend in machine learning, where predictive performance typically comes at the cost of interpretability. In some domains, such as image-based diagnostic tasks, understanding the reasons behind machine generated predictions is vital in assessing trust. In this study, we introduce novel designs of capsule networks to provide explainable diagnoses. Our proposed deep explainable capsule architecture, called DX-Caps, can encode high-level visual attributes within the vectors of capsules in order to simultaneously produce malignancy predictions for lung cancer as well as approximations of six visually-interpretable attributes, used by radiologists to explain their predictions. To reduce parameter and memory burden of this deeper network, we introduce a new capsule-average pooling function. With this simple, but fundamental addition, capsule networks can be designed in a deeper fashion than was possible before. Our overall approach can be characterized as multi-task learning; we learn to approximate the six high-level visual attributes of a lung nodule within the vectors of our uniquely constructed deep capsule network, while simultaneously segmenting the nodule and predicting its malignancy potential (diagnosis). Tested on over 1000 CT scans, our experimental results show that our proposed algorithm can approximate the visual attributes of lung nodules far better than a deep multi-path dense 3D CNN Shen et al. (2019). The proposed network also achieves higher diagnostic accuracy than a baseline explainable capsule network X-Caps and CapsNet Sabour et al. (2017) when applied to this task for the first time as well. To the best of our knowledge, this is the first study to investigate capsule networks for visual attribute prediction in general, and explainable medical image diagnosis in particular.

1 Introduction

Although deep learning (DL) has played a major role in a wide array of fields, there exist several which have yet to be comparably impacted: military, security, transportation, finance, legal, and healthcare among others Bloomberg (11.16.2018); Lehnis (15.06.2018); Polonski (10.01.2018). At its core, DL owes its success to the joining of two essential tasks, feature extraction and feature classification, learned in a joint manner, usually through a form of backpropagation. Although this direction has dramatically improved the predictive performance on a diverse range of tasks, it has also come at a great cost, the sacrifice of human-level explainability. As features becomes less interpretable, and the functions learned more complex, model predictions become more difficult to explain and the generalization ability of trained networks is less well understood. Several works have began to press towards this goal of explainable deep learning, as explored in Section 2, but the problem remains largely unsolved.

1.1 Explainable lung cancer diagnoses

Figure 1: Lung nodules with high-level visual attribute scores as determined by expert radiologists. Scores were given from for six different visual attributes related to diagnosing lung cancer.

DL-based computer-aided diagnosis (CAD) systems have largely failed to be adopted into routine clinical work-flows. Unlike detection tasks, diagnosis (classification) requires radiologists to explain their predictions through the language of high-level visual attributes, shown in Fig. 1. For DL-powered CAD systems to be adopted by the healthcare industry and other high-risk domains, methods must be developed which can provide this same level of explainability. Towards this goal, we propose a novel multi-task deep capsule-based architecture for learning visually-interpretable feature representations within the vectors of the capsules. Although the proposed method is generic and it can be applied to many computer vision problems, we focus here on a high impact healthcare problem: lung cancer diagnosis.

Lung cancer is the far-leading cause of cancer-related death in both men and women for Health Statistics (US and others (2017). The National Lung Screening Trial showed that screening patients with low-dose computed tomography (CT) has reduced lung cancer mortality by Team (2011); Yankelevitz and Smith (2013). However, only of lung cancer cases are diagnosed at an early stage Howlader et al. (04.2018). The reasons behind this are due to the screening related challenges, including (a) high false positive rates, (b) over-diagnosis, and (c) missed nodules (i.e., tumor) during screening Marshall et al. (2013). Based on DL models such as 2D and 3D deep convolutional neural networks (CNN), there have been several studies conducted to alleviate these challenges Shin et al. (2016); Setio et al. (2017); Buty et al. (2016); Hussein et al. (2017b, a); Khosravan and Bagci (2018); Khosravan et al. (2019); Setio et al. (2016); Huang et al. (2017); Ding et al. (2017); Dou et al. (2017), and such explorations were partially successful in improving both nodule detection and image-based diagnostic rates. Noticeably, some achieved highly successful diagnosis results, comparable to or even better than expert level diagnosis Hussein et al. (2017a). Nevertheless, the black-box nature of these previous studies contributed to these methods not making their way into clinical routines. The purpose of this study is therefore to fill this important research gap by creating explainable medical diagnoses through learning visually-interpretable features from medical images with new DL models, specifically novel capsule network architectures.

1.2 Capsule neural networks and their visually-interpretable features

Capsule networks differ from traditional CNNs by replacing the scalar feature maps with vectorized representations, where these vectors are responsible for encoding orientation information, and thus provide equivariance to affine transformations on the input (as opposed to CNNs which are only equivarient to translation). These capsule vectors are then used in a dynamic routing algorithm which seeks to maximize the agreement between low-level and high-level features, not only in presence, but also in part-whole relationship agreement. In their introductory work Sabour et al. (2017), a capsule network (CapsNet) was shown to produce promising results on both the MNIST LeCun et al. (1998) and CIFAR10 Krizhevsky (2009) data sets; but more importantly, CapsNet was shown to encode high-level visually-interpretable feature representations of digits in MNIST (e.g. stroke thickness, skew, localized-parts) within the dimensions of its capsule vectors. While capsule networks are still a young area of research with many improvements to be made in terms of performance and accuracy, their ability to capture visually-interpretable features can be paramount in critical application domains that demand explainability of models.

1.3 Summary of our contributions

In this work, we introduce two novel multi-task capsule network architectures, providing explainable diagnoses in the same form as radiologists by learning to approximate six high-level visual attributes from lung nodules, as seen in Fig. 1, which radiologists estimate when determining the malignancy of a nodule. Additionally, our networks both segment nodules and determine their malignancy score in a multi-task learning (MTL) approach. Our first proposed architecture, called X-Caps, is an intuitive extension of the original CapsNet, where each dimension of the output capsule layer is supervised by a visual attribute label from multiple radiologists. By forcing each dimension of the capsule vector output to embed a specific visually-interpretable feature, a significant benefit (explainable decision) is obtained by unraveling knowledge hierarchy inside deep networks. The multiple visual attributes are learned simultaneously with their associated weights being updated by both the radiologists visual interpretation scores as well as their contribution to the final malignancy score, and the segmentation reconstruction error. This first network proved too restrictive and provides reaffirmation that deep features carry more discriminative power than these interpretable ones.

Our second architecture, called DX-Caps, is a much deeper capsule network with branching shared-weight paths for visual attribute and malignancy prediction. DX-Caps utilizes the recently introduced locally-constrained dynamic routing algorithm LaLonde and Bagci (2018) and a novel capsule-average pooling (CAP) layer to reduce the spatial dimension of each capsule type. CAP recombines their vectors in a manner far-more computationally efficient than the fully-connected capsules used by the conventional CapsNet, similar to using global average pooling in a traditional CNN. These novelties allowed for the creation of a much deeper capsule network while keeping the memory to a single 12 GB GPU. DX-Caps malignancy predictive performance (without any pre-training) is on par with previously state-of-the-art deep pre-trained 2D/3D CNN works (e.g. Buty et al. (2016); Hussein et al. (2017b)), while also outputting visual attribute scores, where nearly no previous works do so.

Furthermore, since radiologists’ scores vary significantly between each other for both malignancy and visual characteristics of a given nodule, it is not possible to train the proposed networks directly against these scores. Previous works train against the mean of the radiologist scores and convert the mean to a binary label (malignant or benign). However, this throws away significant information. In our proposed method, we fit a Gaussian distribution of mean and variance equal to that of the radiologists’ scores for a given nodule and compute the mean squared error between this and our network output for supervised training. In this way, overconfidence by the network on more ambiguous nodules is punished in proportion to radiologists’ disagreement, and likewise for under-confidence and strong radiologist agreement. This allows our method to produce classification scores across all five possible score values, rather than simply binary labels as in previous studies.

The rest of the paper is organized as follows. In Section 2, we summarize related works in the literature pertaining to interpretable deep learning, lung cancer diagnosis using deep networks, and capsule networks in medical imaging. In Section 3, we introduce our proposed paradigm of learning visually-interpretable features via our newly designed deep multi-task capsule networks. In Section 4, we explain the our experiments and results. We conclude our work with discussions in Section 5.

2 Related work

The majority of work in explainable deep learning has focused around post hoc deconstruction of already trained models. Two main approaches are primarily investigated, interpretation of the features learned by the networks and explaining deep networks’ final predictions, at both the local (i.e. individual neurons) and global (i.e. entire layers/networks) level. These approaches typically rely on human-experts to examine their results and attempt to discover meaningful patterns. While there are numerous studies on explainable deep learning, we will attempt to faithfully cover the more prominent approaches. Following this, Section 2 will cover relevant lung cancer diagnosis and capsule-based works.

Visualization of features

Several works have attempted to examine network interpretability at the individual neuron level. Some of the earliest methods focused on visualizing individual filters and activation maps. While this can provide some insight into certain aspects of a network, such as dead neurons, the visualization of individual filters or feature maps are typically not interpretable at the human-level. Zeiler and Fergus Zeiler and Fergus (2014) attached a deconvolutional network to network layers to map activations back to pixel space for visualization. Later, Springenberg et al. Springenberg et al. (2014) used an all convolutional network and a guided-backpropagation algorithm to create much sharper visualizations which did not require the keys of the pooling operations. Mahendran and Vedaldi Mahendran and Vedaldi (2015) focused more on layers of neurons and examine the representations learned by shallow and deep CNNs by inverting images using gradient descent. While these methods provide some insight into what CNNs learn, they are ultimately limited, as deep networks typically have hundreds of thousands of neurons and it is intractable to visually examine all or even large subsets of neurons in a network. Additionally, there is evidence to suggest these visualizations are unrelated to network predictions Nie et al. (2018).

Receptive fields, input contributions

Beyond visualizing the features of CNNs, several methods have attempted to examine the effect of individual neurons or image regions on network outputs. In this first category, Girshick et al. Girshick et al. (2014) examined the receptive field of individual neurons and found the images which maximally activated each. Kindermans et al. Kindermans et al. (2018) showed that Springenberg et al. (2014); Zeiler and Fergus (2014) (discussed above) did not create theoretically correct explanations for linear models, and created PatternNet and PatternAttribution to better visualize neuron activations. In the latter category, Kumar et al. Kumar et al. (2017) examined which input region correspond most strongly with each output class. An occlusion-based approach was used by Zeiler and Fergus Zeiler and Fergus (2014) for masking out image regions to examine their contribution to the final output. One of the most popular methods of visualizing input contributions is Grad-CAM Selvaraju et al. (2017) which highlights the relative positive activation map of convolutional layers with respect to network outputs. Arguably, saliency detection can also fall into this category of determining which input regions are important. While these methods give important information related to designing networks and training data, they tell us very little about the internal representations being learned.

Feature spaces and GANs

Rather than looking at the individual neurons or image regions, several approaches instead focus on examining the feature spaces learned by deep networks. Generative adversarial networks (GAN) by Goodfellow et al. Goodfellow et al. (2014), show vulnerable regions of a learned feature space for a given network. In Chen et al. (2016), Chen et al. creates a GAN-based method called InfoGAN to separate noise from the “latent code” in images. Using this method, they maximize the mutual information between the latent representations and the image inputs, encoding concepts such as rotation, width, and digit type for MNIST. In a similar way, capsule networks by Sabour et al. Sabour et al. (2017) (CapsNet) encode visually-interpretable concepts such as stroke thickness, skew, rotation, and others. These two methods are the most similar to the proposed approach. Lakkaraju et al. Lakkaraju et al. (2017) attempt to discover a CNN’s “blind spots” by sampling points in feature space in a weakly-supervised manner. While the other methods mentioned can provide some important clues about the feature space being learned, InfoGAN and CapsNet show the most promise for encoding and extracting visually-interpretable features.

Disentangling representations

Methods for disentangling representations are focused on discovering the visual patterns learned by CNN filters, then disentangling their relationship to each other. Zhang et al. Zhang et al. (2018) created multi-layer graph structure, where each layer of the graph matches each layer of the CNN. Activated visual patterns across all training images are added as nodes and patterns which co-occur in images have edges added between them. In Bau et al. (2017), Bau et al. introduced six types of semantic filters for CNNs: objects, parts, scenes, textures, materials, and colors. Networks are then trained using these labels at the pixel-level to identifies hidden units’ semantics for any given CNN, and align them with human-interpretable concepts. Unfortunately, the former of these methods can only provide little about the features learned, while the latter method requires a dramatic increase in labeled data, where multiple labels need to be provided at the pixel level.

Lung nodule classification

The majority of recent lung cancer diagnosis (nodule classification) studies have focused on deep 2D, multi-view, and 3D CNNs, with most works trained/tested on the publicly available LIDC-IDRI data set from Lung Image Database Consortium Armato III et al. (2011). Buty et al. Buty et al. (2016) extracted features from a pre-trained 2D multi-view CNN while encoding shape information though spherical harmonics (SH) to improve diagnostic accuracy from (CNN) to (CNN+SH). Hussein et al. Hussein et al. (2017b) achieved a similar result, extracting deep features from a multi-view CNN then applying a Gaussian process regression strategy to achieve accuracy. In a later work, Hussein et al. Hussein et al. (2017a) reported a deep 3D CNN, pre-trained on Sports-1M Karpathy et al. (2014), used in a MTL with trace norm approach to combine visual attributes achieved accuracy. In this same work, the authors propose a more complicated regularized graphical lasso post-processing algorithm to combine imaging features with radiologists’ visual attributes and gained an accuracy improvement; however, no results were reported on the visual attribute predictions. Shen et al. Shen et al. (2019) is one of the only works in the literature to attempt to create an interpretable framework by simultaneously predicting visual attribute scores along with malignancy. The authors used a deep multi-path dense 3D CNN to achieve an accuracy of , however their results on individual attribute predictions were as low as . Most recently, some deeper multi-crop Shen et al. (2017), multi-scale Shen et al. (2015), and denser multi-path multi-output Dey et al. (2018) CNNs, using methods such as curriculum learning Nibali et al. (2017) or gradient boosting machines Zhu et al. (2018) and complicated post-processing techniques Hussein et al. (2017a), have been applied to push diagnosis accuracy to . However, adding such techniques is beyond the scope of this work and would lead to an unwieldy enumeration of ablation studies necessary to understand the contributions between our proposed capsule architectures and such techniques. For a fair comparison in this study, we compare our method directly against CapsNet and CNNs without post-processing techniques.

Capsule network-based medical diagnosis

It is also worth noting, a number of recent studies have proposed using CapsNet for a variety of medical imaging classification tasks Afshar et al. (2018); Iesmantas and Alzbutas (2018); Shen and Gao (2018), although no work in the literature has studied capsule networks for lung cancer diagnosis. Nonetheless, since these methods nearly all follow the exact CapsNet architecture, or propose minor modifications which produce nearly identical predictive performance Mobiny and Van Nguyen (2018), it is sufficient to compare only with CapsNet in reference to these works. Lastly, a recent study by Duarte et al. Duarte et al. (2018) proposed a network which performed action recognition and localization in videos. However, since this network only contains two capsule layers inside a deep 3D CNN, whereas our proposed architectures contain nearly all capsule layers, we do not compare with this work.

3 Learning visually-interpretable features

The goal of our proposed method is to model visual attributes using capsule neural networks in order to provide the same explanations as radiologists for predicting malignancy, while simultaneously performing malignancy prediction and nodule segmentation/reconstruction. The Lung Image Database Consortium and Image Database Resource Initiative (LIDC-IDRI) Armato III et al. (2011), contains a collection of lung nodules with scores ranging from across a set of visual attributes, indicating their relative appearance, and malignancy, as scored by up to four radiologists. These characteristics and scores are shown in Figure 1.

3.1 Capsules for encoding visual attributes

Figure 2: X-Caps: Explainable Capsule Networks. For a screen detected nodule, the proposed network (1) predicts high-level visual attributes of the nodule, (2) segment the nodule and reconstruct the input image, and (3) classify nodule as malignant or benign.

Our first approach, referred to as explainable capsules, or X-Caps, was designed to remain as similar as possible to CapsNet, while allowing us to have more control over the visually-interpretable features learned by the capsule vectors. CapsNet already showed great promise when trained on the MNIST data set for its ability to model high-level visually-interpretable features. With this first study, we examine the ability of capsules to model specific visual attributes within their vectors, rather that simply hoping some are learned successfully in the more challenging lung nodule data. As shown in Figure 2, X-Caps shares the same overall structure as CapsNet, with the major difference being the addition of the supervised labels being provided for each dimension of the X-Caps vectors. To compute a final malignancy score, we attach a fully-connected layer to all six of these vectors with a Sigmoid activation. For X-Caps, all output labels have their values scaled to to allow for easier training with the activation functions.

As in CapsNet, we also perform reconstruction of the input as a form of regularization. However, we extend the idea of regularization to perform a sudo-segmentation. Whereas in segmentation, the goal is to output a binary mask of pixels which belong to the nodule region, in our formulation we attempt to reconstruct only the pixels which belong to the nodule region, while the rest are mapped to zero. More specifically, we formulate this problem as


where is the supervised loss for the reconstruction regularization, is a weighting coefficient for the reconstruction loss, is the reconstruction target pixel, is the ground-truth segmentation mask value, and is the output of the reconstruction network, at pixel location , respectively, and and are the width and height, respectively, of the input image. This adds another task to our MTL approach and an additional supervisory signal which can help our network distinguish visual characteristics from background noise. The malignancy prediction score, as well as each of the visual attribute scores also provide a supervisory signal in the form of


where is the combined loss for the visual attributes, is the average of the attribute scores given by at minimum three radiologists for attribute , is the total number of attributes, is the weighting coefficient placed on the attribute, is the network prediction for the score of the attribute, is the loss for the malignancy score, is the average of the malignancy scores given by at minimum three radiologists, is the network prediction for the average malignancy score, and is the weighting coefficient for the malignancy score. In this way, the overall loss for X-Caps is simply . For simplicity, the values of each and are set to , and is set to 111 Further tuning of these parameters could potentially lead to superior results but we did not have the computational resources to perform such an analysis for this study..

3.2 Going deeper with explainable capsules

Figure 3: DX-Caps: Deep Explainable Capsule Network. Similar to the X-Caps, the proposed network (1) predicts high-level visual attributes of the nodule, (2) segments the nodule through a masked reconstruct the input image, and (3) classifies nodules scores on a scale of . The newly proposed capsule-average pooling allows us to create very deep networks while performing classification.

We hypothesize that the lung nodules and visual attributes being studied would be more complex in nature than handwritten digits and require a deeper hierarchical structure to better represent them. While X-Caps provides some empirical evidence towards the ability for capsule vectors to have their vectors explicitly supervised to learn specific visual attributes, we push the network deeper and study a more complex network structure, while relaxing the requirement to use only visually-explainable features in malignancy prediction. Building on the locally-constrained dynamic routing introduced by LaLonde and Bagci LaLonde and Bagci (2018), and with our newly proposed capsule-average pooling (CAP), we are able to create a deep network structure which we call DX-Caps, or deep explainable capsules.

The proposed DX-Caps, illustrated in Figure 3, consists of a single convolutional layer, followed by five capsule layers before branching into separate paths for predicting malignancy and visual attributes. With this structure, each visual attribute and malignancy have their own capsule types. This allows to network to encode and predict high-level visual attribute information to a greater degree, as for a given attribute, each score has its own vector, where the vector being used for attribute score is different from the vector used to identify attribute score . Since these weights are shared, we force the capsule vectors to jointly learn to encode orientation information about all visual attributes in each of these capsule types, while capsules before the branching learn features relevant to both visual attribute and malignancy prediction from our MTL loss function.

For a deeper capsule network, there was a need to replace the fully-connected capsule layer used by CapsNet, which was far too memory intensive to be computationally tractable in single GPU training. To this end, we introduce a capsule-average pooling (CAP) algorithm which splits apart capsules by capsule type in a given layer and reforms new capsules as the average of the vectorized activations from the previous layer. More formally, for any given layer , there exists a set of capsule types , and within each capsule type, there exists a grid of capsule vectors , where is the spatial dimensions of the capsule type at layer . Each has dimension such that is the length of the capsule vectors. Parent capsules are formed by computing the average across the spatial grid along each dimension of the capsule vectors, . Therefore each child capsule in has exactly one corresponding parent capsule, where the set of parents capsules is denoted as . For each , we compute the following , where each now have dimension . A single overall parent capsule is formed by concatenating each to form a vector of dimension . In the case of our proposed DX-Caps, is the number of score classes we have, i.e. five. The output is then formed as normal by computing the length of each vector in this 2D grid to arrive at a final values corresponding to our classification prediction. This formulation reduces the parameter and memory burden and allows us to create DX-Caps while still fitting into a single 12GB GPU’s memory.

Uncertainty modeling of the visual scoring

All previous works in lung nodule classification follow the same strategy of averaging radiologists’ scores for visual attributes and malignancy. To better model the uncertainty inherently present in the labels due to inter-observer variation, we propose a different approach: rather than simply trying to regress the average of the values submitted by radiologists, or performing binary classification of these values rounded as above or below the score of , we attempt to predict the distribution of radiologists’ scores. Specifically, for a given nodule where we have at minimum three radiologists’ score values for each attribute and for malignancy prediction, we compute the mean and standard deviation of those values and fit a Gaussian function to them, which is in turn used as the ground-truth for our classification vector. Nodules with strong inter-observer agreement produce a sharp peak, in which case wrong or unsure (i.e. low confidence score) predictions are severely punished. Likewise, for low inter-observer agreement nodules, we expect our network to output a more spread distribution and it will be punished for strongly predicting a single class label. This proposed approach allows us to model the uncertainty present in radiologists’ labels in a way that no previous study has.

4 Experiments and results

For our experiments, we used publicly available LIDC-IDRI data set Armato III et al. (2011). Five-fold stratified cross-validation was performed to split the nodules into training and testing sets, with of each training set set aside for validation and early stopping. All models were trained using Adam Kingma and Ba (2014) with an initial learning rate of reduced by a factor of after validation loss plateau. All code is implemented in Keras with TensorFlow backend support and will be made publicly available. Consistent with the literature, predictions were considered correct if within 1.0 of the radiologists’ average score.

The experimental results summarized in Table 1 illustrate the prediction of visual attributes with the proposed X-Caps and DX-Caps in comparison with a adapted version CapsNet and a deep multi-path dense 3D CNN (HSCNN Shen et al. (2019)). To the best of our knowledge, HSCNN is the only other work in the literature which presents attribute-level results pursuant to learning interpretable features through the modeling of high-level visual attributes for lung cancer diagnosis. DX-Caps outperformed baseline CapsNet as well as X-Caps in predicting both malignancy and visual attribute scores. While HSCNN slightly outperformed DX-Caps in malignancy prediction, it performed significantly worse than DX-Caps on average at attribute prediction, which is the main focus for explainability of predictions to radiologists. Experimental results support our hypothesis that a deep capsule network, through the aid of the introduced capsule-average pooling, can model visual attributes better than a baseline X-Caps, CapsNet, and a dense CNN.

Prediction Accuracy Capsule Networks CNNs
Attributes CapsNet X-Caps DX-Caps HSCNN Shen et al. (2019)
subtley -
sphericity -
margin -
lobulation - -
spiculation - -
texture -
Table 1: Prediction accuracy of visual attribute learning with capsule networks. While HSCNN, a multi-path dense 3D CNN, predicts malignancy at a slightly higher accuracy, it performs significantly worse than DX-Caps at attribute prediction, which is the main focus for explainability. It is worth noting that margin and lobulation have a lower correlation with malignancy than nearly all other attributes Li et al. (2017), and this might in part explain DX-Caps lower accuracy on these attributes.

5 Discussions and concluding remarks

Deep leaning-generated predictions are mostly black-box in nature and not explainable; hence, not trusted by healthcare specialists. Available studies for explaining DL models, typically focus on post hoc interpretations of trained networks, rather than attempting to build-in explainability. This is the first study, to the best of our knowledge, for learning to encode high-level visual attributes from radiologists within the vectors of a capsule-based network to perform explainable image-based diagnosis. We simultaneously approximate visually-interpretable attributes along with malignancy predictions through individual capsule types in order to explain these malignancy predictions in the same language as radiologists. The results of our study show the proposed deep explainable capsule architecture, DX-Caps, made possible by introducing a capsule-average pooling function, successfully approximated visual attribute scores far better than a deep multi-path dense 3D CNN. We also implemented a version of CapsNet for lung cancer diagnosis for the first time in the literature and a shallow version of our explainable capsule network, X-Caps, with both achieving inferior performance as compared with DX-Caps. As the field of capsule networks progress and similar advancements such as those made with CNNs (e.g. residual/dense connections, batch/group normalization), deeper and more powerful capsule networks can be created to boost performance even further.


  • [1] P. Afshar, A. Mohammadi, and K. N. Plataniotis (2018) Brain tumor type classification via capsule networks. In 2018 25th IEEE International Conference on Image Processing (ICIP), pp. 3129–3133. Cited by: §2.
  • [2] S. Armato III, G. McLennan, L. Bidaut, M. F. McNitt-Gray, C. R. Meyer, A. P. Reeves, B. Zhao, D. R. Aberle, C. I. Henschke, E. A. Hoffman, et al. (2011) The Lung Image Database Consortium (LIDC) and Image Database Resource Initiative (IDRI): a completed reference database of lung nodules on CT scans. Medical Physics 38 (2), pp. 915–931. Cited by: §2, §3, §4.
  • [3] D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017-07) Network dissection: quantifying interpretability of deep visual representations. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 00, pp. 3319–3327. External Links: Document, Link, ISSN 1063-6919 Cited by: §2.
  • [4] J. Bloomberg (11.16.2018) Don’t Trust Artificial Intelligence? Time To Open The AI ‘Black Box’.. Note: http://www.forbes.com/sites/jasonbloomberg/2018/09/16/dont-trust-artificial-intelligence-time-to-open-the-ai-black-box/#6ceaf3793b4aForbes Magazine Cited by: §1.
  • [5] M. Buty, Z. Xu, M. Gao, U. Bagci, A. Wu, and D. J. Mollura (2016) Characterization of lung nodule malignancy using hybrid shape and appearance features. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 662–670. Cited by: §1.1, §1.3, §2.
  • [6] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel (2016) Infogan: interpretable representation learning by information maximizing generative adversarial nets. In Advances in neural information processing systems, pp. 2172–2180. Cited by: §2.
  • [7] R. Dey, Z. Lu, and Y. Hong (2018) Diagnostic classification of lung nodules using 3d neural networks. In 2018 IEEE 15th International Symposium on Biomedical Imaging (ISBI 2018), pp. 774–778. Cited by: §2.
  • [8] J. Ding, A. Li, Z. Hu, and L. Wang (2017) Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 559–567. Cited by: §1.1.
  • [9] Q. Dou, H. Chen, L. Yu, J. Qin, and P. Heng (2017) Multilevel contextual 3-d cnns for false positive reduction in pulmonary nodule detection. IEEE Transactions on Biomedical Engineering 64 (7), pp. 1558–1567. Cited by: §1.1.
  • [10] K. Duarte, Y. Rawat, and M. Shah (2018) Videocapsulenet: a simplified network for action detection. In Advances in Neural Information Processing Systems, pp. 7610–7619. Cited by: §2.
  • [11] N. C. for Health Statistics (US et al. (2017) Health, united states, 2016: with chartbook on long-term trends in health. National Center for Health Statistics (US). Cited by: §1.1.
  • [12] R. Girshick, J. Donahue, T. Darrell, and J. Malik (2014) Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 580–587. Cited by: §2.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in neural information processing systems, pp. 2672–2680. Cited by: §2.
  • [14] N. Howlader, A. Noone, M. Krapcho, D. Miller, K. Bishop, S. Altekruse, C. Kosary, M. Yu, J. Ruhl, Z. Tatalovich, A. Mariotto, D. Lewis, H. Chen, E. Feuer, and K. Cronin (04.2018) SEER Cancer Statistics Review, 1975-2013, National Cancer Institute. Note: https://seer.cancer.gov/archive/csr/1975_2013/ Cited by: §1.1.
  • [15] X. Huang, J. Shan, and V. Vaidya (2017) Lung nodule detection in ct using 3d convolutional neural networks. In Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on, pp. 379–383. Cited by: §1.1.
  • [16] S. Hussein, K. Cao, Q. Song, and U. Bagci (2017) Risk stratification of lung nodules using 3d cnn-based multi-task learning. In International Conference on Information Processing in Medical Imaging, pp. 249–260. Cited by: §1.1, §2.
  • [17] S. Hussein, R. Gillies, K. Cao, Q. Song, and U. Bagci (2017) Tumornet: lung nodule characterization using multi-view convolutional neural network with gaussian process. In Biomedical Imaging (ISBI 2017), 2017 IEEE 14th International Symposium on, pp. 1007–1010. Cited by: §1.1, §1.3, §2.
  • [18] T. Iesmantas and R. Alzbutas (2018) Convolutional capsule network for classification of breast cancer histology images. In International Conference Image Analysis and Recognition, pp. 853–860. Cited by: §2.
  • [19] A. Karpathy, G. Toderici, S. Shetty, T. Leung, R. Sukthankar, and L. Fei-Fei (2014) Large-scale video classification with convolutional neural networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1725–1732. Cited by: §2.
  • [20] N. Khosravan and U. Bagci (2018) S4ND: single-shot single-scale lung nodule detection. MICCAI. Cited by: §1.1.
  • [21] N. Khosravan, H. Celik, B. Turkbey, E. Jones, B. Wood, and U. Bagci (2019) A collaborative computer aided diagnosis (c-cad) system with eye-tracking, sparse attentional model, and deep learning. Medical Image Analysis. Cited by: §1.1.
  • [22] P. Kindermans, K. T. Schütt, M. Alber, K. Müller, D. Erhan, B. Kim, and S. Dähne (2018) Learning how to explain neural networks: patternnet and patternattribution. In International Conference on Learning Representations (ICLR), Cited by: §2.
  • [23] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.
  • [24] A. Krizhevsky (2009) Learning multiple layers of features from tiny images. Technical report Citeseer. Cited by: §1.2.
  • [25] D. Kumar, A. Wong, and G. W. Taylor (2017) Explaining the unexplained: a class-enhanced attentive response (clear) approach to understanding deep neural networks. In IEEE Computer Vision and Pattern Recognition (CVPR) Workshop, Cited by: §2.
  • [26] H. Lakkaraju, E. Kamar, R. Caruana, and E. Horvitz (2017) Identifying unknown unknowns in the open world: representations and policies for guided exploration.. In AAAI, pp. 2124–2132. Cited by: §2.
  • [27] R. LaLonde and U. Bagci (2018) Capsules for object segmentation. arXiv preprint arXiv:1804.04241. Cited by: §1.3, §3.2.
  • [28] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. Cited by: §1.2.
  • [29] M. Lehnis (15.06.2018) Can We Trust AI If We Don’t Know How It Works?. Note: http://www.bbc.com/news/business-44466213BBC News Cited by: §1.
  • [30] X. Li, Y. Kao, W. Shen, X. Li, and G. Xie (2017) Lung nodule malignancy prediction using multi-task convolutional neural network. In Medical Imaging 2017: Computer-Aided Diagnosis, Vol. 10134, pp. 1013424. Cited by: Table 1.
  • [31] A. Mahendran and A. Vedaldi (2015) Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5188–5196. Cited by: §2.
  • [32] H. M. Marshall, R. V. Bowman, I. A. Yang, K. M. Fong, and C. D. Berg (2013) Screening for lung cancer with low-dose computed tomography: a review of current status. Journal of thoracic disease 5 (Suppl 5), pp. S524. Cited by: §1.1.
  • [33] A. Mobiny and H. Van Nguyen (2018) Fast capsnet for lung cancer screening. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 741–749. Cited by: §2.
  • [34] A. Nibali, Z. He, and D. Wollersheim (2017) Pulmonary nodule classification with deep residual networks. International journal of computer assisted radiology and surgery 12 (10), pp. 1799–1808. Cited by: §2.
  • [35] W. Nie, Y. Zhang, and A. Patel (2018) A theoretical explanation for perplexing behaviors of backpropagation-based visualizations. In International Conference on Machine Learning, pp. 3806–3815. Cited by: §2.
  • [36] V. Polonski (10.01.2018) People Don’t Trust AI–Here’s How We Can Change That.. Note: http://www.scientificamerican.com/article/people-dont-trust-ai-heres-how-we-can-change-that/Scientific American Cited by: §1.
  • [37] S. Sabour, N. Frosst, and G. E. Hinton (2017) Dynamic routing between capsules. In Advances in Neural Information Processing Systems, pp. 3856–3866. Cited by: Encoding High-Level Visual Attributes in Capsules for Explainable Medical Diagnoses, §1.2, §2.
  • [38] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra (2017) Grad-cam: visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision, pp. 618–626. Cited by: §2.
  • [39] A. A. A. Setio, F. Ciompi, G. Litjens, P. Gerke, C. Jacobs, S. J. van Riel, M. M. W. Wille, M. Naqibullah, C. I. Sánchez, and B. van Ginneken (2016) Pulmonary nodule detection in ct images: false positive reduction using multi-view convolutional networks. IEEE transactions on medical imaging 35 (5), pp. 1160–1169. Cited by: §1.1.
  • [40] A. A. A. Setio, A. Traverso, T. De Bel, M. S. Berens, C. van den Bogaard, P. Cerello, H. Chen, Q. Dou, M. E. Fantacci, B. Geurts, et al. (2017) Validation, comparison, and combination of algorithms for automatic detection of pulmonary nodules in computed tomography images: the luna16 challenge. Medical image analysis 42, pp. 1–13. Cited by: §1.1.
  • [41] S. Shen, S. X. Han, D. R. Aberle, A. A. Bui, and W. Hsu (2019) An interpretable deep hierarchical semantic convolutional neural network for lung nodule malignancy classification. Expert Systems with Applications. Cited by: Encoding High-Level Visual Attributes in Capsules for Explainable Medical Diagnoses, §2, Table 1, §4.
  • [42] W. Shen, M. Zhou, F. Yang, C. Yang, and J. Tian (2015) Multi-scale convolutional neural networks for lung nodule classification. In International Conference on Information Processing in Medical Imaging, pp. 588–599. Cited by: §2.
  • [43] W. Shen, M. Zhou, F. Yang, D. Yu, D. Dong, C. Yang, Y. Zang, and J. Tian (2017) Multi-crop convolutional neural networks for lung nodule malignancy suspiciousness classification. Pattern Recognition 61, pp. 663–673. Cited by: §2.
  • [44] Y. Shen and M. Gao (2018) Dynamic routing on deep neural network for thoracic disease classification and sensitive area localization. In International Workshop on Machine Learning in Medical Imaging, pp. 389–397. Cited by: §2.
  • [45] H. Shin, H. R. Roth, M. Gao, L. Lu, Z. Xu, I. Nogues, J. Yao, D. Mollura, and R. M. Summers (2016) Deep convolutional neural networks for computer-aided detection: cnn architectures, dataset characteristics and transfer learning. IEEE transactions on medical imaging 35 (5), pp. 1285–1298. Cited by: §1.1.
  • [46] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller (2014) Striving for simplicity: the all convolutional net. arXiv preprint arXiv:1412.6806. Cited by: §2, §2.
  • [47] N. L. S. T. R. Team (2011) Reduced lung-cancer mortality with low-dose computed tomographic screening. New England Journal of Medicine 365 (5), pp. 395–409. Cited by: §1.1.
  • [48] D. F. Yankelevitz and J. P. Smith (2013) Understanding the core result of the national lung screening trial. New England Journal of Medicine 368 (15), pp. 1460–1461. Cited by: §1.1.
  • [49] M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, pp. 818–833. Cited by: §2, §2.
  • [50] Q. Zhang, R. Cao, F. Shi, Y. N. Wu, and S. Zhu (2018) Interpreting cnn knowledge via an explanatory graph. In AAAI, Cited by: §2.
  • [51] W. Zhu, C. Liu, W. Fan, and X. Xie (2018) Deeplung: deep 3d dual path nets for automated pulmonary nodule detection and classification. In 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 673–681. Cited by: §2.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description