Assessment of Breast Cancer Histology using Densely Connected Convolutional Networks
Breast cancer is the most frequently diagnosed cancer and leading cause of cancer-related death among females worldwide. In this article, we investigate the applicability of densely connected convolutional neural networks to the problems of histology image classification and whole slide image segmentation in the area of computer-aided diagnoses for breast cancer. To this end, we study various approaches for transfer learning and apply them to the data set from the 2018 grand challenge on breast cancer histology images (BACH).
Keywords:digital pathology, breast cancer, deep learning
This work presents approaches for the classification of microscopy images as well as the segmentation of whole slide images (WSIs) in the area of computer-aided diagnosis for breast cancer. In particular, it describes how the recently invented densely connected convolutional neural networks  can be applied to the aforementioned tasks on data from the 2018 grand challenge in Breast Cancer Histology Images (BACH).
1.0.1 Clinical Background
According to the global cancer statistics 2012 , breast cancer is the most frequently diagnosed cancer and the leading cause of cancer-related death among females worldwide, with an estimated 1.6 million new cases and over 0.5 million deaths per year. With tumor stage remaining the most important determinant of the outcome , an early detection of breast cancer is crucial for reducing mortality rates. Among other factors, such as patient age, axillary lymph node status, tumor size, hormone receptor status, and HER2 status, histological features play an important role for categorizing patients with invasive breast cancer in order to assess prognosis and determine the appropriate therapy . While the importance of histomorphological grading for breast cancer has been acknowledged almost 30 years ago , the computerized assessment of histological features has become increasingly popular during the last decade. Tumor grading in breast cancer is typically based on the following three criteria suggested by Elston and Ellis :
mitotic activity as a measure of cellular proliferation,
nuclear pleomorphism, i.e. how different the tumor cells are in comparison to normal cells, and
glandular and tubular differentiation, i.e. how well the tumor resembles normal structures.
Current developments in the area of digital pathology are driven by the observation that genetic and phenotypic intra-tumor heterogeneity have a direct impact on both diagnosis and disease management  as well as the availability of effective machine learning techniques, such as deep convolutional neural networks. Particularly the segmentation of WSIs, i.e. the second part of the BACH challenge, plays an increasingly important role as it facilitates not only a standardized assessment of resection margins, but also novel scoring approaches, such as the ImmunoScore , and a better understanding of tumor heterogeneity and micro-environment, e.g. via phenotype-guided genetic readouts.
1.0.2 Related Work
in digital pathology can be categorized with respect to approaches which focus on the three aforementioned criteria for breast cancer grading as well as approaches for WSI segmentation. In the following discussion we focus on the most recent approaches for beast cancer and breast cancer metastases that are based on deep learning. For a more exhaustive overview, we refer the interested reader to recent overview articles, such as  or .
Possibly the largest class of methods focuses on the computational assessment of mitotic activity. This field has been extensively promoted by the recent success of deep-learning-based approaches starting with the seminal work of . Referring the interested reader to the review paper of  for an overview of all methods for mitosis detection until 2015, we specifically want to mention the more recent works on leveraging the potential of crowdsourcing for training deep networks , on deep regression networks  and on using deep residual Hough voting . The next category, comprises methods aiming at cellular or nuclear features. Recent examples include works on stacked sparse auto encoders for nuclei detection  and on hierarchical learning . Regarding the assessment of glandular and tubular structures, there are only a few works in the field of breast cancer, such as  or . However, for a general overview on the computational assessment of relevant pathological structures and primitives, we refer the interested reader to .
In contrast to the approaches for particular histopathological tasks, there is the group of methods that are aiming at classification of whole tissue regions or at WSI-segmentation, which requires learning of features on both cellular and structural level. A good example is the recent work of the BACH challenge organizers presenting a classification method for Hematoxylin- and Eosin-stained (HE-stained) histological images from breast cancer patients, cf. Araujo et al. .
Regarding WSI-segmentation, there exists a series of methods is related to the recent challenges on cancer metastasis detection in lymph node (CAMELYON16 & CAMELYON17). Examples for notable works using the associated data sets are the ones of the organizers [16, 4], as well as the works of Wang et al.  or Liu et al. . Conceptually, these approaches are also comparable to the recent works of Su et al.  and Cruz-Roa et al. .
1.0.3 Contributions and Organization
We participated in the BACH challenge due to our interest in the learning and integration of features from multiple levels and their application to WSI-segmentation, particularly in case of small data sets. As several contributions for the CAMELYON challenges were based on the popular Inception-v3 architecture proposed by , we wanted to assess the performance of another recently published and very promising architecture,i.e. the densely connected convolutional networks proposed by Huang et al. . As the data set of the challenge is too small for training such large architectures from scratch, we investigated two approaches for transfer learning: One based on weights obtained from training on ImageNet  and one based one weights obtained from training the network on data from the CAMELYON challenges, which is described in Sec. 2. The evaluation of these two approaches for both sub-challenges is described in Sec. 3 and discussed in Sec. 4, before we conclude this paper with Sec. 5.
The BACH challenge comprises two sub-challenges, i.e. classification of histology images (part A) and segmentation of WSIs (part B). In order to achieve these goals, we train classifiers to predict the correct label for a given microscopic image (in case of part A) or a patch extracted from a WSI (in case of part B) . Thereby , where the numeric values encodes one of the four class labels: normal (0), benign (1), carcinoma in situ (2), invasive carcinoma (3). As explained later in Sec. 2.2 and Sec. 2.3, these classifiers are implemented via densely connected convolutional neural networks (DenseNets).
2.1 Pre-training on CAMELYON data
As the size of both data sets for part A and part B is too small for training deep networks from scratch, we decided to employ transfer learning with pre-trained networks. Besides using a network which has been pre-trained on ImageNet data , we also investigated the possibility of using a network pre-trained on data from the two CAMELYON challenges . As of now, the data from these challenges consists of approximately 691 WSIs, of which 210 are tumor cases. The tumor cases contain metastases of breast cancer in lymph nodes, ranging from large metastatic areas to small to individual cancerous cells in lymph node tissue. All non-tumor WSIs are control cases exhibiting no pathological findings.
Preparatory to patch extraction, we subtracted the background as described in  to ensure that only patches from foreground regions are sampled. To speed up this process for the large CAMELYON dataset, background subtraction is done at a level where each extracted patch is represented by a single pixel with a value obtained through interpolation. Then we covered the entire WSI with a regular grid of patch center points and extracted a patch along with its label according to the grid center point. Thereby, we ensure that the extracted patches exhibit a random portion of overlap of classes, which should help to reduce over-fitting on large homogeneous regions. After downscaling all extracted patches to match the physical resolution of the BACH data set, we obtained in total 274,272 image patches of physical size at pixels from both CAMELYON data sets. For pre-training the network, we randomly split the data with a ratio of 80% and 20% for training and validation, respectively, making sure that data from all sub-groups of the two challenges is equally represented in training and validation. This way, we obtained 119,705 normal and 101,347 invasive patches for training and 30,240 normal and 22,980 invasive patches for validation.
For pre-training the network, we used a uniform Xavier initialization and trained the network for 90 epochs starting with a learning rate of , which is decreased by a factor of every 20 epochs. The employed data augmentation strategy is identical to the one used for fine tuning, which is described in Sec. 2.3.
2.2 Classification of Histological Images (A)
2.2.1 Data Preparation, Scale Selection and Augmentation
The data of part A consists of 400 images of size pixels with a pixel resolution of . The images have been assigned one of the four aforementioned classes, if two medical experts agreed to the predominant type of cancer in each image. For classification, we rescaled all images by a factor of 10 resulting in images of size pixels.
It is important to note that this data set contains multiple subsets of images acquired from the same patient. In order to evaluate a classification method in a clinically correct way, it is essential to prevent images from the same patient being present in both training and validation set. Due to the limited amount of data, however, we decided to explicitly drop this constraint and performed a 5-fold cross-validation with a data distribution of 80% and 20% for training and validation, respectively. This way, we wanted to ensure that the network has seen the maximum variability of the data during training.
As pathological images do not have a canonical orientation, we used arbitrary rotations as well as horizontal and vertical flips for data augmentation. In order to achieve slight scale-invariance we also used random scale changes in the range . To achieve robustness against spatial recognition of features, we employed random shifting of up to 50% of the width and height for each image. Pixels outside of the range of the original image are replaced by their nearest neighbors inside the image. Finally, we normalized each image by the mean and the standard deviation of all images in the data set.
2.2.2 Network Architecture and Training
We used a DenseNet-161 architecture as proposed by Huang et al.  which generalizes the concept of residual learning introduced by He et al. . The architecture consists of seven stages, six spatial reduction stages and one classifier stage. Each spatial reduction uses a stride of . The first two stages consist of a single convolution (kernel size ) and a single max pooling (kernel size ), respectively. The next four stages consist of densely connected convolutions (kernel sizes and ) followed by a full max pooling. The head of the classifier consists of a global average pooling of the spatial feature map and a single fully-connected layer. We employ a categorical cross-entropy loss and retain the weights with highest classification accuracy on the validation set. Neither dropout nor weight-decay were used.
For transfer learning, we first trained only the fully connected layer for 25 epochs with a learning rate of in order to avoid over-fitting. Next, we trained the whole network for 250 epochs with a learning rate of . One epoch consists of all possible batches of size 32 of the training data. The training data was randomly shuffled between each epoch.
2.3 Segmentation of Whole Slide Images (B)
2.3.1 Data Preparation, Scale Selection and Augmentation
The data of part B consists of 30 whole slide images of which only 10 are annotated with regard to the aforementioned tissue classes. All WSIs have a spatial resolution of per pixel.
From the ten annotated WSIs, we extracted patches of physical size into pixels, corresponding to a down-sampling factor of . We followed a similar procedure as described in Sec. 2.1. The only difference is that the background subtraction is done at the patch-level, with a patch being considered background if at least 80% of its pixels are considered background. We obtained a total of 24,406 patches, with 13,280 being labeled normal, 903 benign, 354 in situ, and 9,869 invasive. Due to the very limited amount of data in the benign and the carcinoma in situ classes, we refrained from splitting training and validation data according to the individual WSIs and performed a random splitting, as done for part A of the challenge, and refrained from performing a cross-validation. Again, our motivation was to expose the maximum variability to the network during training.
We employed a similar strategy for data augmentation as described in Sec. 2.2: The main difference is that missing pixels are replaced by the actual pixels from the larger image, except on the borders of the WSI where they are replaced by their nearest neighbors. Finally, in order to achieve robustness with respect to color perturbations introduced by varying staining conditions for example, we employed the color augmentation procedure suggested by . All other augmentation parameters are kept as in Sec. 2.2
2.3.2 Additional Data
To further reduce data shortage, we added additional data by partially annotating 16 of the 20 originally non-annotated WSIs. This was done with the help of a trained pathologist. In particular, we aimed at reducing the problem of imbalanced classes and specifically annotated regions containing benign malformations and carcinoma in situ.
After performing the data extraction again as described above, we extracted a total of 41,506 image patches, with 25,230 normal, 1,723 benign, 1,759 in situ and 12,794 invasive tissue regions.
2.3.3 Network Architecture and Training
We used the same DenseNet-161 architecture as in part A, as described in Sec. 2.2.2. For transfer learning, we first trained only the fully connected layers for 6 epochs with a learning rate of in order to avoid over-fitting. Next, we trained the whole network for 60 epochs with a learning rate of and 40 epochs with a learning rate of . In order to compensate for the highly imbalanced classes, we employed log-balanced class weights, i.e. the weight for class is defined as , where denotes the number of all training patches and the number of training patches belonging to class .
2.3.4 Patch-based Segmentation and Post-Processing
In order to produce a segmentation of a full WSI, we first down-scale the WSI to obtain the expected resolution of the classifier. We then classify every grid center point of a grid with cell size pixels. In total, the down-sampling factor is approximately .
The resulting label image is then post-processed by applying a median filter to smooth the segmentation and a small dilation of all non-normal classes (overlapping classes are resolved in the order benign in-situ invasive) to slightly decrease the false-negative rate and slightly increase the size of tumor regions after decreasing them with the median filter.
We conducted several experiments for both parts of the challenge on a dedicated workstation with Intel i7-6850K processor, 64GB RAM and two NVIDIA Geforce GTX 1080 Ti graphics cards. As an operating system we used Ubuntu 16.04 LTS, endowed with docker and NVIDIA-docker. For implementing the network architecture and conducting the training we used python 2.7.12, keras 2.1.3 and tensorflow 1.4.0 backend (official tensorflow docker). Training time on this machine (using one GPU) was around 10 hours for pre-training on the CAMELYON data set and between 7 and 9 hours for transfer learning. The inference time per image or patch is around and for a full WSI is around .
3.1 Classification of Histological Images (A)
The results of our experiments for part A of the BACH challenge are summarized in Tab. 1. In lines one and two, we report the achieved accuracies for baseline approaches based on the VGG-19 and Inception-v3 architectures [21, 23] The DenseNet-161 architectures, which were trained using the same hyper-parameter settings as described in Sec. 2.2, but less aggressive data augmentation. The performed experiments show that the baseline architectures exhibit worse performance in our setting. Although we did not perform a grid search for hyper-parameter tuning, we believe that the discrepancy between these architectures is not solely caused by a discrepancy in quality of the hyper-parameters, such that the reported results give a fair qualitative impression of the performance of these architectures. In line three and four of Tab. 1, we report the results of the proposed approach using ImageNet-data and CAMELYON data for pre-training, respectively. These two experiments suggest that pre-training using ImageNet-data seems to outperform pre-training on CAMELYON data. For composing the challenge submission, we selected the best performing network pre-trained on ImageNet from the cross-validation experiment, cf. line three of Tab. 1.
3.2 Segmentation of Whole Slide Images (B)
For the second part of the challenge we conducted two main experiments: At first, we limited ourselves to the 10 annotated WSIs, using a random stratified split into 80% training and 20% validation data. We tested this approach using networks pre-trained on CAMELYON as well as pre-trained on ImageNet. Secondly, we added additional data from selected WSIs as described in Section 2.3 and repeated the training, comparing the obtained results with VGG-19 and Inception-v3 architectures as a baseline approaches, cf. Tab. 2.
Similar to part A, the DenseNet architecture outperforms the Inception architecture and pre-training on ImageNet outperforms pre-training on CAMELYON. We chose the model trained on the extended data set for submission as it achieves highest accuracy. Since the remaining four unlabeled WSIs do not exhibit sufficient variability in order to assess the network performance based on the score suggested by the challenge, we based our decision solely on patch-accuracy.
|architecture||pre-training on||data||patch-based acc.|
|DenseNet-161||ImageNet||annotated 80/20 split||95.75%|
|DenseNet-161||CAMELYON||annotated 80/20 split||95.33%|
|VGG-19||ImageNet||ext. annotated 80/20 split||96.04%|
|Inception-v3||ImageNet||ext. annotated 80/20 split||95.51%|
|DenseNet-161||ImageNet||ext. annotated 80/20 split||96.24%|
Regarding the results for part A of the challenge, it becomes apparent that the DenseNet architecture outperforms the other baseline methods. More interesting than this first qualitative comparison, however, is the fact that pre-training on ImageNet is considerably better than pre-training on CAMELYON data. We hypothesize that this discrepancy arises from the fact that features learned from the CAMELYON data base do not generalize well enough to the specific appearance of the images from part A.
Comparing the achieved results to the ones reported by Araujo et al.  is not straightforward: In  a classification accuracy of 78% for the same task is reported, however, the used dataset is even smaller (285 images) and we do not have any information regarding the splitting of the data.
Regarding the results for part B, we observe again that the DenseNet architecture outperforms the baseline approaches, i.e. the VGG-19 and Inception-v3 architectures. Furthermore, we observed comparable results for the DenseNet trained on the original ten WSIs and the extended data base of 26 WSIs. As we observed a better generalization performance in preliminary experiments, where we gradually added additional training data, we decided to submit the network which has seen the largest data variability during training to the challenge phase.
In both parts, we observed a better performance for networks pre-trained on ImageNet data in comparison to the ones pre-trained on CAMELYON data. By training networks on the CAMELYON data base we were hoping to learn features, particularly in the first layer, which are better suited to digital pathology images. On the other hand, networks pre-trained on ImageNet are known to learn very robust and general features due to the high variability of the ImageNet data base and it might be that the features learnt from the CAMELYON data base generalize less well to the data of this challenge.
The conducted experiments demonstrate that densely connected convolutional networks are well-suited for transfer learning, even in case the considered data set is small. In order to develop classification algorithms which can be used in clinical practice, a significantly larger amount of data is necessary. We want to emphasize that the chosen splittings for training and validation (in both part A and B) are not suited for a clinical evaluation. In addition to this, a data base for training such a network possibly requires more precise and also different annotations. This cannot only be observed by inspecting the rather coarse annotations of the WSIs, but also by the fact that part A only contains images where two pathologists agreed. In fact, computer-assisted diagnoses would be particularly helpful in those excluded cases. However, future work should not be limited to the creation of larger and carefully annotated data bases. The development of sophisticated feature visualization techniques will be crucial to not only understand performance differences of differently trained networks, but also to make the computed decision more understandable to the medical expert.
-  Albarqouni, S., Baur, C., Achilles, F., Belagiannis, V., Demirci, S., Navab, N.: Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE transactions on medical imaging 35(5), 1313–1321 (2016)
-  Apou, G., Schaadt, N.S., Naegel, B., Forestier, G., Schönmeyer, R., Feuerhake, F., Wemmert, C., Grote, A.: Detection of lobular structures in normal breast tissue. Computers in biology and medicine 74, 91–102 (2016)
-  Araújo, T., Aresta, G., Castro, E., Rouco, J., Aguiar, P., Eloy, C., Polónia, A., Campilho, A.: Classification of breast cancer histology images using convolutional neural networks. PloS one 12(6), e0177544 (2017)
-  Bejnordi, B.E., Veta, M., van Diest, P.J., van Ginneken, B., Karssemeijer, N., Litjens, G., van der Laak, J.A., Hermsen, M., Manson, Q., Balkenhol, M., et al.: Diagnostic assessment of deep learning algorithms for detection of lymph node metastases in women with breast cancer. Jama 318(22), 2199–2210 (2017)
-  Chen, H., Wang, X., Heng, P.A.: Automated mitosis detection with deep regression networks. In: Biomedical Imaging (ISBI), 2016 IEEE 13th International Symposium on. pp. 1204–1207. IEEE (2016)
-  Cireşan, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Mitosis detection in breast cancer histology images with deep neural networks. In: International Conference on Medical Image Computing and Computer-assisted Intervention. pp. 411–418. Springer (2013)
-  Cruz-Roa, A., Gilmore, H., Basavanhally, A., Feldman, M., Ganesan, S., Shih, N., Tomaszewski, J., González, F., Madabhushi, A.: Accurate and reproducible invasive breast cancer detection in whole-slide images: A deep learning approach for quantifying tumor extent. Scientific Reports 7, 46450 (2017)
-  Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)
-  Dong, F., Irshad, H., Oh, E.Y., Lerwill, M.F., Brachtel, E.F., Jones, N.C., Knoblauch, N.W., Montaser-Kouhsari, L., Johnson, N.B., Rao, L.K., et al.: Computational pathology to discriminate benign from malignant intraductal proliferations of the breast. PloS one 9(12), e114885 (2014)
-  Elston, C., Ellis, I.: Pathological prognostic factors in breast cancer. i. the value of histological grade in breast cancer: experience from a large study with long-term follow-up. Histopathology 19(5), 403–410 (1991)
-  Fridman, W.H., Pagès, F., Sautès-Fridman, C., Galon, J.: The immune contexture in human tumours: impact on clinical outcome. Nature Reviews Cancer 12(4), 298–306 (2012)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
-  Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.: Densely connected convolutional networks. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 4700–4708 (2017)
-  Janowczyk, A., Doyle, S., Gilmore, H., Madabhushi, A.: A resolution adaptive deep hierarchical (radhical) learning scheme applied to nuclear segmentation of digital pathology images. Computer Methods in Biomechanics and Biomedical Engineering: Imaging & Visualization pp. 1–7 (2016)
-  Janowczyk, A., Madabhushi, A.: Deep learning for digital pathology image analysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics 7 (2016)
-  Litjens, G., Sánchez, C., Timofeeva, N., Hermsen, M., Nagtegaal, I., Kovacs, I., Hulsbergen-Van De Kaa, C., Bult, P., Van Ginneken, B., Van Der Laak, J.: Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis. Scientific reports 6, 26286 (2016)
-  Liu, Y., Gadepalli, K., Norouzi, M., Dahl, G.E., Kohlberger, T., Boyko, A., Venugopalan, S., Timofeev, A., Nelson, P.Q., Corrado, G.S., et al.: Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442 (2017)
-  Martelotto, L., Ng, C., Piscuoglio, S., Weigelt, B., Reis-Filho, J.: Breast cancer intra-tumor heterogeneity. Breast Cancer Research 16(3), 210 (2014)
-  Robertson, S., Azizpour, H., Smith, K., Hartman, J.: Digital image analysis in breast pathology–from image processing techniques to artificial intelligence. Translational Research (2017)
-  Schnitt, S.: Classification and prognosis of invasive breast cancer: from morphology to molecular taxonomy. Modern pathology 23, S60–S64 (2010)
-  Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
-  Su, H., Liu, F., Xie, Y., Xing, F., Meyyappan, S., Yang, L.: Region segmentation in histopathological breast cancer images using deep convolutional neural network. In: IEEE International Symposium on Biomedical Imaging. pp. 55–58. IEEE (2015)
-  Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: IEEE Conference on Computer Vision and Pattern Recognition. pp. 2818–2826 (2016)
-  Torre, L., Bray, F., Siegel, R., Ferlay, J., Lortet-Tieulent, J., Jemal, A.: Global cancer statistics, 2012. CA: A Cancer Journal for Clinicians 65(2), 87–108 (2015), http://dx.doi.org/10.3322/caac.21262
-  Veta, M., Van Diest, P., Willems, S., Wang, H., Madabhushi, A., Cruz-Roa, A., Gonzalez, F., Larsen, A., Vestergaard, J., Dahl, A., et al.: Assessment of algorithms for mitosis detection in breast cancer histopathology images. Medical image analysis 20(1), 237–248 (2015)
-  Wang, D., Khosla, A., Gargeya, R., Irshad, H., Beck, A.: Deep learning for identifying metastatic breast cancer. arXiv preprint arXiv:1606.05718 (2016)
-  Warner, E.: Breast-cancer screening. New England Journal of Medicine 365(11), 1025–1032 (2011)
-  Wollmann, T., Rohr, K.: Deep residual hough voting for mitotic cell detection in histopathology images. In: International Symposium on Biomedical Imaging. pp. 341–344. IEEE (2017)
-  Xu, J., Xiang, L., Liu, Q., Gilmore, H., Wu, J., Tang, J., Madabhushi, A.: Stacked sparse autoencoder (ssae) for nuclei detection on breast cancer histopathology images. IEEE transactions on medical imaging 35(1), 119–130 (2016)