BACH: Grand Challenge on Breast Cancer Histology Images
Breast cancer is the most common invasive cancer in women, affecting more than 10% of women worldwide. Microscopic analysis of a biopsy remains one of the most important methods to diagnose the type of breast cancer. This requires specialized analysis by pathologists, in a task that i) is highly time- and cost-consuming and ii) often leads to nonconsensual results. The relevance and potential of automatic classification algorithms using hematoxylin-eosin stained histopathological images has already been demonstrated, but the reported results are still sub-optimal for clinical use. With the goal of advancing the state-of-the-art in automatic classification, the Grand Challenge on BreAst Cancer Histology images (BACH) was organized in conjunction with the 15th International Conference on Image Analysis and Recognition (ICIAR 2018). A large annotated dataset, composed of both microscopy and whole-slide images, was specifically compiled and made publicly available for the BACH challenge. Following a positive response from the scientific community, a total of 64 submissions, out of 677 registrations, effectively entered the competition. From the submitted algorithms it was possible to push forward the state-of-the-art in terms of accuracy (87%) in automatic classification of breast cancer with histopathological images. Convolutional neuronal networks were the most successful methodology in the BACH challenge. Detailed analysis of the collective results allowed the identification of remaining challenges in the field and recommendations for future developments. The BACH dataset remains publically available as to promote further improvements to the field of automatic classification in digital pathology.
keywords:Breast cancer, Histology, Digital pathology, Challenge, Comparative study, Deep learning
Breast cancer is one of most common cancer-related death causes in women of all age Siegel2017 (), but early diagnosis and treatment can significantly prevent the disease’s progression and reduce its morbidity rates Smith2005 (). Because of this, women are recommended to do self check-ups via palpation and regular screenings via ultrasound or mammography; if an abnormality is found, a breast tissue biopsy is performed NationalBreastCancerFoundation2015 (). Usually, the collected tissue sample is stained with hematoxylin and eosin (H&E), which allows to distinguish the nuclei from the parenchyma, and is observed via an optic microscope. Complementarly, these samples can also be scanned at giga-resolution, referred as whole-slide image (WSI), for posterior digital processing. During assessment, pathologists search for signs of cancer on microscopic portions of the tissue by analyzing its histological properties. This procedure allows to distinguish malignant regions from non-malignant (benign) tissue, which present changes in normal structures of breast parenchyma that are not directly related with progression to malignancy. These malignant lesions can be further classified as in situ carcinoma, where the cancerous cells are restrained inside the mammary ductal-lobular system, or invasive if the cancer cells are spread beyond the ducts. Due to the importance of correct diagnosis in patient management the search for precise, robust and automated systems has increased. The differentiation of breast samples into normal, benign and malignant (either in situ or invasive) brings relevant changes in the treatment of the patients making the accurate diagnosis essential. For instance, benign lesions can usually be followed clinically without the need for surgery, but malignancy almost always require surgery with or without the addition of chemotherapy.
The analysis of breast cancer WSIs is non-trivial due to the large amount of data to visualize and the complexity of the task Elmore2015 (). On this setting, computer-aided diagnosis (CAD) systems can alleviate the procedure by providing a complementary and objective assessment to the pathologist. Despite the high performance of these systems for the binary classification (healthy vs malignant) of microscopy Kowal2013 (); Filipczuk2013 (); George2014 (); Belsare2015 () and whole-slide images Cruz-Roa2014 (), the previously referred standard clinical classification procedure has now only started to be explored Araujo2017 (); Fondon2018 (); Han2017 (); Bejnordi2017 ().
Challenges are known to enable advances on the medical image analysis field by promoting the participation of multiple researchers of different backgrounds on a competitive, but scientifically constructive, setting. Over the past years, the scientific community has been promoting challenges on different imaging modalities and topics Grandchallenge (). Related to breast cancer, CAMELYON Camelyon () is a two-edition challenge aimed at cancer metastases detection on WSI of lymph node sections.
To further promote and complement the research on the breast cancer image analysis field, the Grand Challenge on BreAst Cancer Histology images (BACH) was organized as part of the ICIAR 2018 conference (15th International Conference on Image Analysis and Recognition). BACH is a biomedical image challenge built on top of the Bioimaging2015 challenge, that aims at the classification of H&E stained microscopy and WSI breast cancer tissue images. Specifically, the participants of BACH were asked to predict the type of these tissue samples as 1) Normal, 2) Benign 3) In situcarcinoma and 4) Invasive carcinoma, with the goal of providing pathologists a tool to reduce the diagnosis workload. The rest of the paper is organized as follows. Section 2 details the challenge in terms of organization, dataset and participant’s evaluation. Then, Section 3 describes the approaches of the best performing methods, and the corresponding performance is provided on Section 4. Finally, Section 6 summarizes the findings of this study.
2 Challenge description
The BACH challenge was organized into different stages, providing a well structured workflow to potentiate the success of the initiative (Fig. 1). The challenge was hosted on Grand Challenge111https://grand-challenge.org/, which allowed for an easy platform set-up. At the time of this writing, Grand Challenge accounts for more than 12 000 registered users and, alongside Kaggle Kaggle (), it is one of the preferred platforms for medical imaging-related challenges. The BACH was also announced via the Sci-diku-imageworld mailing list222https://list.ku.dk/listinfo/sci-diku-imageworld. Participants were asked to register on the Grand Challenge to access most of the contents of the BACH webpage. All registrations were manually validated by the organization to minimize spam and anonymous participation. Once accepted, participants could download the data by filling a form asking for their name, institution and e-mail address. Once the form was submitted, an e-mail containing an unique set of credentials (username, password) and the dataset download link was automatically sent to the provided e-mail address. This allowed the organization to better limit the dataset access to non-participants as well as collect a list of the institutions/companies interested in the challenge.
BACH was divided on two parts, A and B. Part A consisted on automatically classifying H&E stained breast histology microscopy images in four classes: 1) Normal, 2) Benign, 3) In situcarcinoma and 4) Invasive carcinoma. Part B consisted on performing pixel-wise labeling of whole-slide breast histology images in the same four classes. Participants were allowed to participate on a single part of the challenge. Also, to promote participation (thus more competition and higher quality of the methods), the ICIAR 2018 conference ICIAR2018 () sponsored the challenge by awarding monetary prizes to the first and second best performing methods, for both challenge parts.The prize awarding was contingent of the acceptance and presentation of the methodology at the ICIAR 2018 conference.
The BACH website333https://iciar2018-challenge.grand-challenge.org/ was first made publicly available on the November 2017 with the release of the labeled training set. The registered participants had up to February 2018 (4 months) to submit their methods and a paper describing their approach. To promote the dissemination of the methods, participants were also required to submit their paper to the ICIAR conference. The test set was released on the February 2018 and submissions were open for a week. Results were announced a month after, at the March 2018.
The BACH challenge made available two labeled training datasets for the registered participants. The first dataset is composed of microscopy images annotated image-wise by two expert pathologists from the Institute of Molecular Pathology and Immunology of the University of Porto (IPATIMUP) and from the Institute for Research and Innovation in Health (i3S). The second dataset contains pixel-wise annotated and non-annotated WSI images. For the WSI, annotations were performed by a pathologist and revised by a second expert. The training data will be made publicly available after the paper is published. [sentence to be replaced by a link to an institutional (long-term) repository, upon manuscript acceptance for publication]
2.2.1 Microscopy images dataset
The microscopy dataset is composed of 400 training and 100 test images, distributed evenly by the four classes. All images were acquired in 2014, 2015 and 2017 using a Leica DM 2000 LED microscope and a Leica ICC50 HD camera and all patients are from the CovilhÃ£ and Porto regions (Portugal). The annotation was performed by two medical experts. Images where there was disagreement between the Normal and Benign classes were discarded. The remaining doubtful cases were confirmed via imunohistochemical analysis. The provided images are on RGB .tiff format and have a size of pixels and a pixel scale of 0.42 m 0.42 m. The labels of the images were provided in .csv format. Participants were also provided with a partial patient-wise distribution of the images. Note that the microscopy image dataset is an extension of the one used for developing the approach in Araujo2017 ().
2.2.2 Whole-slide images dataset
Whole-slide images (WSI) are high resolution images containing the entire sampled tissue. Because of that, each WSI can have multiple pathological regions. The BACH’s Part B dataset is composed of 30 WSI for training and 10 WSI for algorithm testing. Specifically for training, the organization provided 10 pixel-wise annotated regions for the Benign, In situ carcinoma and Invasive carcinoma classes and 20 non-annotated WSI. The provided annotations aim at identifying regions of interest for the diagnosis on the lowest zoom setting and thus may include non-tissue and normal tissue regions, as depicted in Fig. 3. The distribution of the labels is shown in Table 1.
The WSI images were acquired in 2013–2015 from patients from the Castelo Branco region (Portugal) with a Leica SCN400, and were made available on .svs format, with a pixel scale of 0.467 m/pixel and variable size (e.g. 42113 62625 pixels). The ground-truth was released as the coordinates of the points that enclose each labeled region via a .xml file.
2.3 Performance Evaluation
The methods developed by the participants were evaluated on independent test sets for which the ground-truth was hidden. Specifically, for Part A participants were requested to submit a .csv containing row-wise pairs of (image name, predicted label) for the 100 microscopy images. Performance on the microscopy images was evaluated based on the overall prediction accuracy, i.e., the ratio between correct samples and the total number of evaluated images.
For Part B it was required the submission of downsampled WSI .png masks with values 0 – Normal, 1 – Benign, 2 – In situ carcinoma and 3 – Invasive carcinoma. Possible mismatches between the prediction’s and ground truth’s sizes were corrected by padding or cropping the prediction masks. The performance on the WSI images was evaluated based on the custom score :
where pred is the predicted class (0, 1, 2 or 3), and gt is the ground truth class, is the linear index of a pixel in the image, is the total number of pixels in the image and is the binarized value of , i.e., is 0 if the label is 0 and 1 otherwise. This score is based on the accuracy metric, aiming at penalizing more the predictions that are farther from the ground truth value. Note that, in the denominator, the cases in which the prediction and ground truth are both 0 (Normal class) are not counted, since these can be seen as true negative cases.
Note that this custom evaluation score was preferred over the quadratic weighted Cohenâs kappa statistic Cohen1960 () since it allows to ignore correct Normal class predictions (highly dominant) while penalizing wrong Normal predictions, whereas for Kappa the Normal class would have either to be completely considered or ignored.
3 Competing solutions
This section provides a comprehensive description of the participating approaches. Table 2 and Table 3 summarize the methods that achieved an accuracy and score on Part A and B, respectively. The most relevant methods in terms of performance and applicability are detailed on the next sections. For methods that solve Part A and B jointly, refer to Section 3.4, and for Part A or Part B exclusively refer to Sections 3.2 and 3.3, respectively.
|et al.Chennamsetty (216)Chennamsetty2018 ()||0.87||
|Kwok (248) Kwok2018 ()||0.87||Inception-Resnet-v2||✓||✗||Part B||0.71||299299||✗|
|et al.Brancati (1) Brancati2018 ()||0.86||Resnet-34, 50, 101||✓||3||✗||1||
|et al.Marami (16) Marami2018 ()||0.84||Inception-v3||✓||4||
|et al.Kohl (54) Kohl2018 ()||0.83||Densenet-161||✓||✗||✗||1||205154||✗|
|et al.Wang (157) Wang2018 ()||0.83||VGG16||✓||✗||✗||0.765||224224||✗|
|et al.Steinfeldt (186)||0.81||
|et al.Kone (19) Kone2018 ()||0.81||ResNeXt50||✓||✓||BISQUE||1||299299||✗|
|et al.Nedjar (36)||0.81||
|et al.Ravi (412)||0.8||Resnet-152||✓||✗||✗||0.875||224224||Krishnan2012 ()|
|et al.Wang (22) ZWang2018 ()||0.79||VGG16||✓||✗||✗||0.255||224224||Macenko2009 ()|
|et al.Cao (425) Cao2018 ()||0.79||
|et al.Seo (60)||0.79||
|et al.Sidhom (370)||0.78||ResNet-50||✓||✗||✗||
|et al.Guo (242) Guo2018 ()||0.77||GoogLeNet||✓||2||✗||
|et al.Ranjan (61)||0.77||AlexNet||✓||2||✗||1||224224||✗|
|et al.Mahbod (73) Mahbod2018 ()||0.77||
|et al.Ferreira (18) Ferreira2018 ()||0.76||Inception-ResNet-v2||✓||✗||✗||1||224224||✗|
|et al.Pimkin (256) Pimkin2018 ()||0.76||
|et al.Sarker (358)||0.75||Inception-v4||✓||✗||✗||0.083||299299||✗|
|et al.Rakhlin (98) Rakhlin2018 ()||0.74||
|et al.Iesmantas (164) Iesmantas2018 ()||0.72||
|et al.Xie (253)||0.72||CNN||✗||✗||✗||0.083||512512||✗|
|et al.Weiss (268) Weiss2018 ()||0.72||
|et al.Awam (6) Awan2018 ()||0.71||
|Kwok (248) Kwok2018 ()||0.69||Inception-Resnet-v2||✓||✗||Part A||8.5e||299299||✗|
|Marami et al. (16)Marami2018 ()||0.55||
|Jia et al. (296)||0.52||
|Li et al. (137)||0.52||
|Murata et al. (91)||0.50||U-Net||✗||✗||✗||1.6e||256256||✗|
|Galal et al. (264) Galal2018 ()||0.50||DenseNet||✗||✗||✗||1.6e||20482048||✗|
|Vu et al. (166) Vu2018 ()||0.49||
|et al.Kohl (54) Kohl2018 ()||0.42||Densenet-161||✓||✗||
The vast majority of Part A and all of Part B participants proposed a convolutional neural network (CNN) approach to solve BACH. CNNs are now the state-of-the-art approach for computer vision problems and show high promise in the field of medical image analysis Litjens2017 (); Tajbakhsh2016 () because they are easy to set up, require little applied field knowledge (specially when compared with handcrafted feature approaches) and allow to migrate base features from generic natural image applications Deng2009 ().
Generically, CNNs are composed of 3 main building blocks: 1) convolutional layers, which are responsible for learning the convolutional kernels that extract, integrate and process features relevant for the task in question; 2) pooling layers, that allow to reduce the dimension of the feature space by selecting/merging the activations (i.e. the output of the convolutional kernels) generated by the convolutional layers; and 3) fully- connected layers, which process the output of the convolutional and pooling layers to produce the final network output. For instance, in the case of image classification tasks, the outputs are image-wise class probabilities and in the case of segmentation as classification tasks the outputs are pixel-wise class probabilities. The number of features that are learned is mainly controlled by the number of feature maps in each convolutional layer. Stacking multiple convolutional layers significantly increases the predictive power of the model and as consequence CNNs have millions of parameters that need to be optimized for the problem under analysis. For classification tasks, this optimization is usually performed by backpropagating the gradient prediction error computed via the differentiable categorical cross-entropy loss function.
CNN performance is highly dependent on the architecture of the network as well as on the hyper-parameter optimization (learning rate, for instance). The large number of parameters in CNNs make them prone to overfit to the training data, specially when a relatively low number of training images is available. Because of that, it is a common practice in medical image analysis to fine-tune networks trained on medical images. In BACH, participants opted mainly for pre-trained networks that have historically achieved high performance in the ImageNet natural image analysis challenge Russakovsky2015 (). From those, VGG Simonyan2014 (), Inception Szegedy2015 (), ResNet He2016 () and DenseNet Huang2017 () were the ones that achieved the overall higher results. A brief description of these networks is provided bellow.
VGG (Visual Geometry Group) was one of the first networks to show that increasing model depth allows higher prediction performance. This network is composed of blocks of 2-3 convolutional layers with a large number of filters that are followed by a max pooling layer. The output of the last layer is then connected to a set of fully-connected layers to produce the final classification. However, despite the success of this model on the ImageNet challenge, the linear structure of VGG and large number of parameters (approximately 140M for 16 layers) does not allow to significantly increase the depth of the model and increases tendency to overfit.
The Inception network follows the theory that most activations in a (deep-)CNN are either unnecessary or redundant and thus the number of parameters can be reduced by using locally sparse building blocks (a.k.a. inception blocks). At each inception block, the number feature maps of the previous block is reduced via an convolution. The projected features are then convolved in parallel by kernels of increasing size, allowing to combine information at multiple scales. Finally, replacing the the fully-connected layers by a global average pooling allows to significantly reduce the model parameters (23M parameters with 159 layers) and makes the network fully convolutional, enabling its application to different input sizes.
Increasing network depth leads to vanishing gradient problems as a result of the large number of multiplication operations. Consequently, the error gradient will be vanishingly small preventing effective updates of the weights in the initial layers of the model. The recent versions of Inception tackle this issue by using Batch Normalization Ioffe2015 (), which allows to reestablish the gradient by normalizing the intermediary activation maps with the statistics of the training batch. Alternatively, ResNet uses residual blocks to stabilize the value of the error gradient during training. In each residual block the input activation map is summed to the output of a set of convolutional layers, thus stopping the gradient from vanishing and easing the flow of information. A ResNet with 50 residual blocks (169 layers) has approximately 25M parameters.
The high performance of models like Inception or ResNet have strengthened the deep learning design principle that ”deeper networks are better” by improving on the feature redundancy and gradient vanishing problems. Recently, an even deeper network, DenseNet, has addressed these same issues by using dense blocks. Dense blocks introduce short connections between convolutional layers, i.e., for each layer, the activations of all preceding layers are used as inputs. By doing so, DenseNet promotes feature re-use, reducing the feature redundancy and thus allowing to decrease the number of feature maps per layer. Specifically, a DenseNet with 201 layers has approximately 20M parameters to optimize.
As already mentioned, fine-tuning of high performance networks trained in natural images is the preferred approach for medical image analysis. Fine-tuning for classification is usually performed as follows: 1) the network is initialized with weights trained to solve a natural image classification problem such as the ImageNet classification task; 2) the classification head, usually a fully-connected layer, is replaced by a new one with randomly initialized parameters; 3) initially, the new classification head is trained for a fixed number of iterations by inputting the medical images and inhibiting the filters of the pre-trained model to change; 4) then, different blocks of the pre-trained model are progressively allowed to learn and adapt to the new features, allowing the model to move to new local optima and increase the overall performance of the network.
3.2 Part A
3.2.1 et al.Chennamsetty (team 216)
et al.Chennamsetty Chennamsetty2018 () used an ensemble of ImageNet pre-trained CNNs to classify the images from Part A. Specifically, the algorithm is composed of a ResNet-101He2016 () and two DenseNet-161 Huang2017 () networks fine-tunned with images from varying data normalization schemes. Initializing the model with pre-trained weights alleviates the problem of training the networks with limited amount of high quality labeled data. First, the images were resized to pixels via bilinear interpolation and normalized to zero mean and unit standard deviation according to statistics derived either from ImageNet or Part A datasets, as detailed below.
During training, the ResNet-101 and a DenseNet-161 were fine-tuned with images normalized from the breast histology data whereas the other DenseNet-161 was fine-tuned with the ImageNet normalization. Then, for inference, each model in the ensemble predicts the cancer grade in the input image and a majority voting scheme is posteriorly used for assigning the class associated with the input.
3.2.2 et al.Brancati (team 1)
et al.Brancati Brancati2018 () proposed a deep learning approach based on a fine-tuning strategy by exploiting transfer learning on an ensemble of ResNet He2016 () models. ResNet was preferred to other deep network architectures because it has a small number of parameters and shows a relatively low complexity in comparison to other models. The authors opted by further reducing the complexity of the problem by down-sampling the image by factor and using only the central patch of size as input to the network. In particular, was fixed to 80% of the original image size and was set equal to the minimum size between the width and high of the resized image.
The proposed ensemble is composed of 3 ResNet configurations: 34, 50 and 101. Each configuration was trained on the images from Part A and the classification of a test image is obtained by computing the highest class probability provided by the three configurations.
3.2.3 et al.Wang (team 157)
et al.Wang Wang2018 () proposed the direct application of VGG-16 Simonyan2014 () to solve Part A. Prior to fine-tuning the model, all images from Part A were resized to and normalized to zero mean and unit standard deviation. To account for the model input size, training is performed by cropping patches of pixels at random locations of the input image. First, the model is trained using a Sample Pairing [A] data augmentation scheme. Specifically, a random pair of images of different labels is independently augmented (translations, rotations, etc.) and then superimposed with each other. The resulting mixed patch receives the layer of one of the initial images and is afterwards used to train the classifier. The learned weights are then used as a starting point to train the network with the initial (i.e. non mixed) dataset.
3.2.4 Kone et. al (team 19)
et al.Kone Kone2018 () proposed a hierarchy of 3 ResNeXt50 Xie2017 () models in a binary tree like structure (one parent and two child nodes) for the 4-class classification of Part A. The top CNN classifies images in two high level groups: 1. carcinoma, which includes the in situ and the invasive classes and 2. non-carcinoma, which includes normal and benign. Then, each of children CNNs sub-classifies the images in the respective 2 classes.
The training is performed in two steps. First, the parent ResNeXt50 pre-trained on ImageNet is fine tunned with the images from Part A. The learned filters are then used as the starting point for the child networks. The authors also divide the ResNeXt50 layers into three groups and assign them different learning rates based on the optimal one found during training.
3.3 Part B
3.3.1 et al.Galal (team 264)
et al.Galal Galal2018 () proposed Candy Cane, a fully convolutional network based on DenseNets Huang2017 () for the segmentation of WSIs. Candy Cane was designed following an auto-encoder scheme of downsampling and upsampling paths with skip connections between corresponding down and up feature maps to preserve low level feature information. Accounting for GPU memory restrictions, the authors propose a downsampling path much longer than the upsampling counter-part. Specifically, Candy Cane operates on slice images and outputs the corresponding labels at a size of pixels. Similarly to an expert that looks at a microscope in few adjacent regions to examine the tissue but then identifies regions in the larger context of the tissue, the large input size of the model allows the network to have both microscopy and tissue organization contexts. The output of the system is then resized to the original size.
3.4 Part A and B
3.4.1 et al.Kwok (team 248)
et al.Kwok Kwok2018 () used a two-stage approach to take advantage of both microscopy and WSI images. To account for the partially missing patient-wise origin on Part A, images which origin was not available were clustered based on color similarity. The data was then split accordingly. For stage 1, patches of pixels were extracted from Part A’s images with a stride of 99 pixels. These patches were then resized to pixels and used for fine-tuning a 4 class Inception-Resnet-v2 Szegedy2017 () trained on ImageNet. This network was then used for analyzing the WSIs. Specifically, WSI foreground masks were computed via a threshold on the L*a*b color space. Then, patches were extracted from the WSIs in the same way as for Part A. This second patch dataset was refined by discarding images with foreground and posteriorly labeled using the CNN trained on Part A. Finally, patches from the top 40% incorrect predictions (evenly sampled from each of the 4 classes) were selected as hard examples for stage 2.
For stage 2, the CNN was retrained by combing the patches extracted from Part A () and Part B (). The resulting model was used for labeling both microscopy images and WSIs. Prediction results were aggregated from patch-wise predictions back onto image-wise predictions (for Part A) and WSI-wise heatmaps (for Part B). Specifically for Part B, the patch-wise predictions were mapped to hard labels (Normal=0, Benign=1, in situ=2 and Invasive=3) and combined into a single image based on the patch coordinates and the network’s stride. The resulting map was then normalized to and multi-thresholded at to bias the predictions more towards Normal/Benign and less to in situ and Invasive carcinomas.
3.4.2 et al.Marami (team 16)
et al.Marami Marami2018 () proposed a classification scheme based on an ensemble of four modified Inception-v3 Szegedy2016 () CNNs that aims at increasing the generalization capability of the method by combining different networks trained on random subsets of the data. Specifically, the networks were adapted by adding an adaptive pooling before a set of custom fully connected layers, allowing higher robustness to small scale changes. Each of these CNNs was trained via a 4 fold cross-validation approach on 512512 images extracted at 20 magnification from both microscopy images from Part A and WSI from Part B, as well as with benign tissue images from the BreakHis public dataset Spanhol2016 ().
Predictions on unseen data are inferred by averaging the output probabilities of the trained ensemble network for each class, making the system more robust to potential inconsistencies and corruption in the labeled data. For Part A, the final label was obtained by majority voting of 12 overlapping 512512 regions. For the WSIs, local predictions were generated by using a 512512 sliding window of stride 256 pixels. The resulting output map was then refined using a ResNet34He2016 () to separate tissue regions from the background and regions with artifacts, reducing potential misclassifications due to ink and other artifacts in whole slide images.
3.4.3 et al.Kohl (team 54)
et al.Kohl Kohl2018 () used an ImageNet pretrained DenseNet Huang2017 () to approach both parts of the challenge. For Part A, the 400 training images were downsampled by a factor 10 and normalized to zero mean and unit standard deviation. The network was then trained on two steps: 1) fine-tuning the fully-connected portion of the network by 25 epochs to avoid over-fitting and 2) training the entire network for 250 epochs.
For Part B, the authors extracted patches of ( pixels) from the annotated WSIs. This patch dataset was then refined by removing patches consisting of at least 80% background pixels, similarly to et al.Litjens Litjens2016 ()). Due to the very limited amount of data in the benign and in situ carcinoma classes, the authors did not perform WSI-wise split for validation purposes and instead used a randomly split dataset. Also, 16 of the 20 originally non-annotated WSIs were also annotated with the help of a trained pathologist and thus in total image patches ( normal, benign, in situ and invasive carcinomas) were used. Network training was similar for Part A and Part B: 25 epochs for training the fully-connected layers followed by 250 epochs for training the whole network in case of Part A, and 6 epochs for training the fully-connected layers followed by 100 epochs for training the whole network using log-balanced class weights in case of Part B.
3.4.4 et al.Vu (team 166)
et al.Vu Vu2018 () proposed to use an encoder-decoder network to solve both Part A and Part B. For Part A, the authors use the encoder part of the model. The encoder is composed of five convolutional processing blocks that integrate dense skip connections, group and dilated convolutions, and self-attention mechanism for dynamic channel selection following the design trends of DenseNet, Squeeze-Excitation Network (SENet) and ResNext Huang2017 (); Jegou2016 (); Yu2015 (); Hu2017 (); Xie2017 (). For classifying the microscopy images, the model has a head composed of a global average pooling and a fully-connected softmax layer. Training is performed by downsampling the images and online data augmentation is used.
For Part B the full encoder-decoder scheme is used. This segmentation network follows the U-Net Ronneberger2015 () structure with skip connections between the downsample and upsample, the decoder is composed by the same convolutional blocks and the upsample is performed considering the nearest neighbor. Also, to ease network convergence, the encoder is initialized with the weights learned from Part A. For training, the WSI are first downscaled by a factor of 4 and sub-regions containing the labels of interest are collected. Specifically, the authors collect sub-regions of size from which the central regions of are used as input to the model. The corresponding output segmentation map has a size of . To prioritize the detection of pathological regions, the segmentation network is trained with two categorical crossentropy loss terms, where the main loss targets for the four histology classes and the auxiliary loss is computed for normal and benign vs in situ carcinoma and invasive carcinoma groups.
The BACH had worldwide participation, with a total of 677 registrations and 64 submissions for both part A (51) and B (13), as shown on Fig. 4.
4.1 Performance in Part A
Participants of the Part A of the challenge were ranked in terms of accuracy. As in et al.Araújo Araujo2017 (), these submissions were further evaluated in terms of sensitivity and specificity:
where , , and are the class-wise true-positive, true-negative, false-positive and false-negative predictions, respectively. For benchmarking purposes, a simple fine-tuning experiment on the BACH Part A was conducted. Specifically, the classification parts of VGG16, Inception v3, ResNet50 and DenseNet169 were replaced by a pair of untrained fully-connected layers with 1024 and 4 neurons. These networks were then trained on two steps, first by updating only the new fully-connected layers until the validation loss stopped improving and posteriorly training the entire model until the same stop criteria was met. Adam Kingma2015 () was used as optimizer and the loss was the categorical cross-entropy.
Finally, to further evaluate the performance of the methods submitted to Part A, four pathologists were asked to classify the field images from the BACH training set. One of these pathologists was involved in the construction of training and testing sets from BACH, and the remaining three were external to the process. The difference in the annotation process was that in the BACH sets construction the pathologists had access to other regions of the patient tissue (and potentially imunohistochemical analysis), whether in this second phase they could only see the field image, i.e., they only had access to the same information of the automated classification algorithms.
The class-wise performance of the methods is shown in Table 4. Table 5 shows the performance for two binary cases: 1) a referral scenario, Pathological, where the objective is to distinguish Normal images from the remaining classes and, 2) a cancer detection scenario, Cancer, where the Normal and Benign classes are grouped vs the In situ and Invasive classes.
|216 Chennamsetty2018 ()||0.87||0.96||0.88||0.8||0.96||0.84||1.0||0.88||0.99|
|248 Kwok2018 ()||0.87||0.96||0.93||0.72||0.96||0.88||0.97||0.92||0.96|
|1 Brancati2018 ()||0.86||0.96||0.91||0.68||0.97||0.84||0.99||0.96||0.95|
|16 Marami2018 ()||0.84||0.92||0.95||0.64||0.96||0.84||0.99||0.96||0.89|
|54 Kohl2018 ()||0.83||0.96||0.92||0.52||0.97||0.88||0.92||0.96||0.96|
|157 Wang2018 ()||0.83||0.96||0.91||0.64||0.99||0.92||0.91||0.8||0.97|
|19 Kone2018 ()||0.81||1.0||0.95||0.4||0.99||0.92||0.92||0.92||0.89|
|216 Chennamsetty2018 ()||0.9||0.88||0.96||0.92||0.86||0.98|
|248 Kwok2018 ()||0.94||0.93||0.96||0.92||0.92||0.92|
|1 Brancati2018 ()||0.92||0.91||0.96||0.9||0.9||0.9|
|16 Marami2018 ()||0.94||0.95||0.92||0.9||0.94||0.86|
|54 Kohl2018 ()||0.93||0.92||0.96||0.89||0.94||0.84|
|157 Wang2018 ()||0.92||0.91||0.96||0.94||0.96||0.92|
|19 Kone2018 ()||0.96||0.95||1.0||0.86||0.96||0.76|
Fig. 4(a) depicts, for the top-10 participants, the difference between the reported performances on the training set (cross-validation) and the achieved performances on the hidden test set. Also, a class-wise study of these methods shows that the Benign and In situ classes are the most challenging to classify (Fig. 4(b)). In particular, Fig. 4(c) and 4(d) show two images with 100% inter-observer agreement that were misclassified by the majority (at least 80%) of the top-10 approaches.
4.1.1 Inter-observer analysis
The accuracies of the three external pathologists are of 94, 78 and 73. The performance of the pathologist from BACH is 96. Note that this pathologist annotated the images after one month from the first annotation, in order to avoid the influence of past knowledge regarding the patients’ exams. For comparison purposes, Table 6 shows the inter-observer and method-observer agreement via the quadratic Cohen’s kappa score.
|216 Chennamsetty2018 ()||0.83||0.76||0.84|
|248 Kwok2018 ()||0.88||0.85||0.84|
|1 Brancati2018 ()||0.84||0.8||0.8|
|16 Marami2018 ()||0.83||0.84||0.8|
|54 Kohl2018 ()||0.88||0.82||0.78|
|157 Wang2018 ()||0.88||0.8||0.88|
|19 Kone2018 ()||0.83||0.9||0.72|
|Experts (avg.)||0.90 0.05||0.82 0.20||0.83 0.12|
4.2 Performance in Part B
The overall challenge performance and main approaches of the participating teams are shown in Table 3. Similarly to Part A, please refer to Tables 7 and 8 for the team-wise sensitivity and specificity of the methods. Examples of pixel-wise predictions for Part B are shown in Fig. 6. The correct identification of invasive regions was more successful, as opposed to benign and in situ regions.
|248 Kwok2018 ()||0.69||0.36||0.7||0.03||0.59||0.4||0.96|
|16 Marami2018 ()||0.55||0.09||0.99||0.05||0.95||0.45||0.92|
|264 Galal2018 ()||0.50||0.05||0.88||0.08||0.52||0.47||0.74|
|166 Vu2018 ()||0.49||0.14||0.9||0.05||0.63||0.44||0.76|
|256 Pimkin2018 ()||0.46||0.18||0.58||0||0||0.58||0.47|
|54 Kohl2018 ()||0.42||0.03||0.98||0.06||0.75||0.52||0.74|
|248 Kwok2018 ()||0.78||0.59||0.46||0.93|
|16 Marami2018 ()||0.6||0.95||0.52||0.93|
|264 Galal2018 ()||0.81||0.52||0.61||0.58|
|166 Vu2018 ()||0.68||0.63||0.52||0.72|
|256 Pimkin2018 ()||0.9||0||0.68||0.48|
|54 Kohl2018 ()||0.68||0.75||0.57||0.74|
BACH accounted for a large number of final submissions in comparison to other medical image analysis challenges. Despite this, there is a similar significant difference between the number of registrations and effective submissions. This stems from common factors such as 1) registrations to inspect the data before deciding to participate or to get the data for other purposes, 2) difficulty in downloading the data, which is specially true in Asian countries due to Internet accessibility limitations, and 3) high complexity of the task, specially of part B. The verified drop on the submission rate is common on biomedical imaging challenges Grandchallenge (), which points to a need to revise future challenge designs to keep the participants’ interest throughout its entire duration. BACH partially addressed this issue by partnering with the ICIAR conference, which empirically motivated participants by providing an opportunity to show their developed work to the scientific community. For future challenge organizations, it will be needed to further promote participation not only by improving data access but also, for instance, establishing intermediary benchmark timepoints in which participants can compare their performance to motivate themselves to further improve their methods.
The vast majority of the submitted methods used deep learning for solving both tasks A and B. This follows the common trend on the field of medical image analysis, where deep learning approaches are complementing or even replacing the standard manual feature engineering approaches since they allow to achieve high performances while significantly reducing the need for field-knowledge Litjens2017 (). As known, deep learning approaches require large amounts of training data to produce a generalizable model, which are usually not available for medical image analysis due to complexity and high cost of the annotation process. As a consequence, it is a common practice to initialize the models with filters trained on large datasets of natural images, such as ImageNet, and fine-tune them to the specific problem Tajbakhsh2016 (). In fact, as shown in Table 2, all of the top performing methods are composed by one or more deep CNNs architectures such as Inception Szegedy2015 (), DenseNet Huang2017 (), VGG Simonyan2014 () or ResNet He2016 () pre-trained on ImageNet. The difference in performance is thus mainly a consequence of design and training details. For Part A, and unlike previous approaches for the analysis of breast histology cancer images Araujo2017 (), the results of BACH suggest that training the models with a large portion/entire image (even if resized to fit the standard input size of the network) is better than using local patches. This indicates that the overall nuclei and tissue organization may be more relevant than nuclei-scale features, such as nuclei texture, for distinguishing different types of breast cancer. Interestingly this matches the importance that clinical pathologists give to tissue architecture features in the diagnostic task. In fact, unlike small patch-based approaches, using large portions of the images eases the integration of both local and global context in the decision process. Besides, patch-based approaches have to handle the problem of patch-label attribution based on the image-level label. Although more sophisticated methods, such as Multiple Instance Learning-based ones could be applied, the vast majority of the teams attributes the label of the image to the patch, which has obvious limitations since the patch may contain only normal tissue, for instance, and be labeled with a different class.
For Part B, the large image size inhibits the direct application of standard segmentation networks, such as U-Net Ronneberger2015 (), to the entire image. Consequently, participants dealt with the issue by analyzing local patches and performing a posterior fusion of the outputs to produce the final probability mapping. In fact, following the same trend of Part A, these methods preferred a large receptive field that guarantees the integration of contextual and local features during prediction and thus eases the generation of the final class map.
5.1 Performance in Part A
A significant number of submitted methods surpassed the performance of et al.Araújo Araujo2017 (), which reported an overall 4-class accuracy of 77.8%. Indeed, BACH provided a larger and more representative dataset which, when combined with advances on architectures and transfer-learning techniques, has enabled the development of methods with higher generalization ability. Specifically, these architectures show a high sensitivity for cancer (specially Invasive) cases, which are of great relevance in terms of clinical application (faster automated diagnosis in the cases demanding more urgent attention). Also, as depicted in Table 4 and 5, the approaches proposed by the participants outperform simple fine-tuning solutions, indicating that there was a clear effort to improve network performance by changing relevant design and training details.
On the other hand, the submitted methods still failed on correctly predicting images of the more subtle Benign and In situ classes. In fact, Fig. 5b shows that the Benign class is the one that affects the most the performance of the methods, which is to be expected since the presence of normal elements and usual preservation of tissue architecture associated with benign lesions makes this class specially hard to distinguish from normal tissue. Furthermore, the Benign class is the one that presents greater morphological variability and thus discriminant features are more difficult to learn.
The generalization capacity of the methods is also affected by the image acquisition pipeline. Specifically, during the acquisition of field images, pathologists focus on capturing regions that contain representative features (tissue architecture, cytological features, etc.) for the given label. As a consequence, whenever those features are subtle, as it is common on normal tissue, specialists tend to capture non-relevant structures, such as fat cells. Likewise, for in situ carcinomas it is common to center the images on mammary ducts, where the cancer is contained. Fig. 4(c) and Fig. 4(e) show two images from the test and training sets, respectively. Fig. 4(e), labeled as In situ, is centered on a duct and surrounded on the left by non-relevant fat tissue. Fig. 4(c) has, by coincidence, the same acquisition scheme of Fig. 4(e) and despite being correctly classified as Benign by 100% of the experts, 60% of the top-10 methods classified it as In situ and 10% as Normal. Likewise, Fig. 4(d) shows a full-consensus Benign test image that was classified by 70% of the top-10 as Invasive. Once again, this image has a similar overall tissue organization as training cases of other classes, as shown in the invasive tissue depicted in Fig. 4(f). The differences, which lie in the cytological features (nuclei size, color and variability), are clear and yet the approaches failed to correctly capture these discriminant characteristics. This suggests that the networks may be partially modeling how images were acquired instead of focusing on what leads to the classification Abramoff2016 ().
Finally, Fig. 4(a) shows the difference between the top-10 participants’ results reported at the submission time via splitting of the training data with the achieved performance on the independent test set. The majority of the methods has a 10% difference over the expected accuracy, which may be due to: 1) patient-wise overfit to the training data, i.e., the authors did not had in account the origin of the images when doing the split and, due to the lack of staining normalization, the networks may have memorized specific staining patterns. In fact, Kwok Kwok2018 () was the only top-10 performer to report a lower expected accuracy. As described in Section 3.4, the author performed patient-wise division by clustering images of similar colors, which may contributed to the robustness of the method; 2) over-optimistic train-validation-test split by doing a single split round; and/or 3) excessive hyper parameter-tuning to increase the performance on the split test set, reducing generalization capability. With this in mind, future versions of BACH will provide guidelines on data splitting to reduce this discrepancy on the reported results.
5.1.1 Inter-observer analysis
The Part A BACH dataset was manually annotated by two medical experts. Consequently, in the best scenario the performance of the automatic methods is expected to be equal to that of the observers. Taking into account the average of the above stated accuracies, which is 85, one can see that the performance obtained by the competing solutions (Table 2) is in line with this value, being the highest accuracy of 87. Interestingly, Table 4 and 5 also show that, similarly to the deep learning models, the specialists had more difficulty in distinguishing between the Normal and Benign classes in comparison with the cancerous classes. This further corroborates the hypothesis that the participants tended to fail Benign images due to the previously discussed complexity of this class.
Overall, the results in Table 4 and 5, as well as the comparison of the quadratic Cohen’s kappa score (Table 6) between the different pathologists vs ground-truth and the automatic methods vs the ground-truth, show that deep learning models trained on properly annotated data can achieve human performance in complex medical tasks and may in the near future play an important role as second-opinion systems.
5.2 Performance in Part B
In general, Part B is much more challenging than Part A due to the large amount of information to process and need to integrate a wide range of scales. The pixel-wise sensitivity and specificity of the methods from Part B detailed in Table 7, shows that the Invasive class tends to be the easiest to detect as the methods achieved an average sensitivity of 0.4. This is to be expected, since Invasive carcinoma is characterizable by an abnormal and non-confined nuclei density and thus methods tend to require less contextual information for the prediction. In fact, this is corroborated by the results in Part A that indicate that Invasive is the easiest of the pathological classes (see discussion of Fig. 4(b)). On the other hand, In situ has the lowest detection sensitivity (average of 0.06 and maximum value of 0.18) of the pathological classes. Unlike the Invasive carcinoma, classification as in situ is reliant on the location of the pathological cells – which means that without proper global and local context, which is complex to achieve due to the large size of the images, this class becomes non-trivial to classify. On the other hand, the methods of Part A did not tend to fail on images from the in situ class. This indicates that the microscopy images provide enough local and global context to perform the labeling and thus that human experience had an essential role during the acquisition and annotation of these images.
Fig. 6 shows examples of predictions from the top-3 performing participants. In general, one can observe the overestimation of invasive (blue) regions, and more difficulty in predicting the in situ (green) and benign (red) ones, which can also be seen in Table 7, where the sensitivity of the solutions for the invasive class is clearly superior to the others. Fig. 6 shows this tendency, where two of the top performing teams fail to predict the in situ and benign regions and tend estimate them as invasive or as background.
5.3 Diversity in the Solutions of the BACH Challenge
Challenge designs should also promote a higher diversity of methodologies. However, BACH submissions followed the recent computer vision trend with deep learning vastly being the preferred approach. Specifically, pre-trained deep networks on natural images are relatively easy to set up and allow to achieve high performance while significantly reducing the need for field-knowledge, easing the participation on this and other challenges. Although the raw high performance of these methods is of interest, the scientific novelty of the approaches is reduced and usually limited to hyperparameter setup or network ensemble. Also, the black-box behavior of deep learning approaches hinders their application on the medical field, where specialists need to understand the reasoning behind the system’s decision. It is the authors’ belief that medical imaging challenges should further promote advances on the field by incentivating participants to propose significantly novel solutions that move from what? to why?. For instance, it would be of interest on future editions to ask participants to produce an automatic explanation of the method’s decision. This will require the planning of new ground-truths and metrics that benefit systems that, by providing proper decision reasoning, are more capable of being used in the clinical practice.
5.3.1 Limitations of the BACH Challenge
While an effort has been made in creating a relevant, stimulating and fair challenge, capable of advancing the state-of-the-art, the authors are aware of some limitations, namely: 1) The reference labels for both Part A and Part B were obtained via manual annotation of two medical experts. Even though images where the observers disagreed were discarded, the labeling process is still reliant on the subjectivity and experience of the annotators (specially on Normal vs Benign labeling since no immunohistochemistry analysis is useful), limiting the performance of the submitted methods to that of the human expert. Increasing the number of annotators would allow to further increase the reliability of the dataset. 2) Patient-wise labels were only partially available. Participants could have used data with known patients for training and the remaining for method validation or use alternative approaches (such as clustering) to estimate the origin of the images. Despite this, the availability of this information for all images would allow a more fair patient-wise split for training and evaluating the algorithms, and could eventually lead to a smaller discrepancy between the training set cross-validated and the test set results. 3) The pixel-wise annotations of the WSI are not highly detailed and thus the delineated regions may include normal tissue regions besides the class assigned to that region. 4) Automatic evaluation of the participants’ algorithms would have ease the submission procedure, allowing to an almost real-time feedback of the teams’ performance. In this scenario, a scheme of multiple submissions could have been implemented, in which teams would be allowed to submit results on the website during the challenge running period, up to a maximum number of submissions. This would probably also boost the number of final submissions out of the challenge registrations.
BACH was organized to promote research on computer-aided diagnosis for automatic breast cancer histology image analysis. Despite the complexity of the task, the challenge received a large number of high quality solutions that achieve similar performance to human experts. Namely, the best performing methods achieved an accuracy of 0.87 in classifying high resolution microscopy images in Normal, Benign, In situ carcinoma and Invasive carcinoma classes and a 0.69 score in labeling entire WSI.
Proper experiment design seems to be essential to achieve high performance in the breast cancer histology image analysis. Specifically, the conducted study allows to infer that 1) generically, using the latest CNN designs allows to positively impact the system’s performance given that fine-tuning is properly performed; 2) CNNs seem to be robust to small color variations of H&E images and thus color normalization was not essential to attain high accuracies; 3) proper training splitting is essential to infer the generalization capability of the model, since CNNs may overfit to patient/acquisition details, and 4) using large context images as the network input allows for overall high performance even if the input image (and thus the overall quality of the information) has to be downsampled. On the other hand, current deep learning solutions still have issues dealing with large, high resolution images and further investment on development of methods for WSI analysis should be done.
It is the organizaners’ hope that the comprehensive analysis herein presented will motivate more challenges on medical imaging and specially pave the way for the development of new breast cancer CAD methods that contribute to the early detection of this pathology, with clear benefits for our societies.
Guilherme Aresta is funded by the FCT grant contract
SFRH/BD/120435/2016. Teresa Araújo is funded by the FCT grant contract SFRH/BD/122365/2016. Aurélio Campilho is with the project ”NanoSTIMA: Macro-to-Nano Human Sensing: Towards Integrated Multimodal Health Monitoring and Analytics/NORTE-01-0145-FEDER-000016”, financed by the North Portugal Regional Operational Programme (NORTE 2020), under the PORTUGAL 2020 Partnership Agreement, and through the European Regional Development Fund (ERDF). Quoc Dang Vu, Minh Nguyen Nhat To, Eal Kim and Jin Tae Kwak are supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (No.
The authors would like to thank the pathologists Dr Ana Ribeiro, Dr Rita Canas Marques e Dr Ierecê Aymoré for their help in labeling the microscopy images.
The authors would also like to thank all the other BACH Challenge participants that registered, submitted their method and were accepted at the 15 International Conference on Image Analysis and Recognition (ICIAR 2018): Kamyar Nazeri, Azad Aminpour, and Mehran Ebrahimi; Nick Weiss, Henning Kost, and André Homeyer; Alexander Rakhlin, Alexey Shvets, Vladimir Iglovikov, and Alexandr A. Kalinin; Zeya Wang, Nanqing Dong, Wei Dai, Sean D. Rosario, and Eric P. Xing; Carlos A. Ferreira, Tânia Melo, Patrick Sousa, Maria Inês Meyer, Elham Shakibapour and Pedro Costa; Hongliu Cao, Simon Bernard, Laurent Heutte, and Robert Sabourin; Ruqayya Awan, Navid Alemi Koohbanani, Muhammad Shaban, Anna Lisowska, and Nasir Rajpoot; Sulaiman Vesal, Nishant Ravikumar, AmirAbbas Davari, Stephan Ellmann, and Andreas Maier; Yao Guo, Huihui Dong, Fangzhou Song, Chuang Zhu, and Jun Liu; Aditya Golatkar, Deepak Anand, and Amit Sethi; Tomas Iesmantas and Robertas Alzbutas; Wajahat Nawaz, Sagheer Ahmed, Ali Tahir, and Hassan Aqeel Khan; Artem Pimkin, Gleb Makarchuk, Vladimir Kondratenko, Maxim Pisov, Egor Krivov, and Mikhail Belyaev; Auxiliadora Sarmiento, and Irene Fondón; Quoc Dang Vu, Minh Nguyen Nhat To, Eal Kim, and Jin Tae Kwak; Yeeleng S. Vang, Zhen Chen, and Xiaohui Xie; Chao-Hui Huang, Jens Brodbeck, Nena M. Dimaano, John Kang, Belma Dogdas, Douglas Rollins, and Eric M. Gifford (authors are ordered according to conference proceedings).
- (1) R. L. Siegel, K. D. Miller, A. Jemal, Cancer Statistics, 2017 67 (1) (2017) 7–30.
- (2) R. a. Smith, V. Cokkinides, H. J. Eyre, American Cancer Society guidelines for the early detection of cancer, 2006., CA: a cancer journal for clinicians 55 (1) (2005) 31–44.
- (3) NationalBreastCancerFoundation, Breast Cancer Diagnosis (2015).
- (4) J. G. Elmore, G. M. Longton, P. A. Carney, B. M. Geller, T. Onega, A. N. A. Tosteson, H. D. Nelson, M. S. Pepe, K. H. Allison, S. J. Schnitt, F. P. OMalley, D. L. Weaver, Diagnostic Concordance Among Pathologists Interpreting Breast Biopsy Specimens, JAMA 313 (11) (2015) 1122.
- (5) M. Kowal, P. Filipczuk, A. Obuchowicz, J. Korbicz, R. Monczak, Computer-aided diagnosis of breast cancer based on fine needle biopsy microscopic images, Computers in Biology and Medicine 43 (10) (2013) 1563–1572.
- (6) P. Filipczuk, T. Fevens, A. Krzyzak, R. Monczak, Computer-aided breast cancer diagnosis based on the analysis of cytological images of fine needle biopsies, IEEE Transactions on Medical Imaging 32 (12) (2013) 2169–2178.
- (7) Y. M. George, H. H. Zayed, M. I. Roushdy, B. M. Elbagoury, Remote computer-aided breast cancer detection and diagnosis system based on cytological images, IEEE Systems Journal 8 (3) (2014) 949–964.
- (8) A. D. Belsare, M. M. Mushrif, M. A. Pangarkar, N. Meshram, Classification of breast cancer histopathology images using texture feature analysis, TENCON 2015 - 2015 IEEE Region 10 Conference (2015) 1–5.
- (9) A. Cruz-Roa, A. Basavanhally, F. González, H. Gilmore, M. Feldman, S. Ganesan, N. Shih, J. Tomaszewski, A. Madabhushi, Automatic detection of invasive ductal carcinoma in whole slide images with convolutional neural networks, in: M. N. Gurcan, A. Madabhushi (Eds.), Medical Imaging 2014: Digital Pathology, San Diego, 2014, p. 904103.
- (10) T. Araújo, G. Aresta, E. Castro, J. Rouco, P. Aguiar, C. Eloy, A. Polónia, A. Campilho, Classification of breast cancer histology images using Convolutional Neural Networks, PLoS ONE 12 (6) (2017) e0177544.
- (11) I. Fondón, A. Sarmiento, A. I. García, M. Silvestre, C. Eloy, A. Polónia, P. Aguiar, Automatic classification of tissue malignancy for breast carcinoma diagnosis, Computers in Biology and Medicine 96 (December 2017) (2018) 41–51.
- (12) Z. Han, B. Wei, Y. Zheng, Y. Yin, K. Li, S. Li, Breast Cancer Multi-classification from Histopathological Images with Structured Deep Learning Model, Scientific Reports 7 (1) (2017) 1–10.
B. E. Bejnordi, G. Zuidhof, M. Balkenhol, M. Hermsen, P. Bult, B. van Ginneken,
N. Karssemeijer, G. Litjens, J. van der Laak,
Context-aware stacked convolutional
neural networks for classification of breast carcinomas in whole-slide
histopathology images, arXiv (2017) 1–13.
Grand Challenges in
Biomedical Image Analysis.
Camelyon 17 Grand Challenge
ICIAR 2018: 15th International
Conference on Image Analysis and Recognition (2018).
- (18) J. Cohen, A Coefficient of Agreement for Nominal Scales, Educational and Phychological Measurement 20 (1) (1960) 37–46.
- (19) A. Krizhevsky, I. Sutskever, G. Hinton, ImageNet Classification with Deep Convolutional Neural Networks, Advances in Neural Information Processing Systems 25 (2012) 1106–1114.
- (20) S. S. Chennamsetty, M. Safwan, V. Alex, Classification of Breast Cancer Histology Image using Ensemble of Pre-trained Neural Networks, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 804–811.
- (21) S. Kwok, Multiclass Classification of Breast Cancer in Whole-Slide Images, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 931–940.
- (22) N. Brancati, M. Frucci, D. Riccio, Multi-classification of Breast Cancer Histology Images by Using a Fine-Tuning Strategy, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 771–778.
- (23) B. Marami, M. Prastawa, M. Chan, M. Donovan, G. Fernandez, J. Zeineh, Ensemble Network for Region Identification in Breast Histopathology Slides, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 861–868.
- (24) B. E. Bejnordi, G. Litjens, N. Timofeeva, I. Otte-Höller, A. Homeyer, N. Karssemeijer, J. A. Van Der Laak, Stain specific standardization of whole-slide histopathological images, IEEE Transactions on Medical Imaging 35 (2) (2016) 404–415.
- (25) M. Kohl, C. Walz, F. Ludwig, S. Braunewell, M. Baust, Assessment of Breast Cancer Histology Using Densely Connected Convolutional Networks, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 903–913.
- (26) Y. Wang, L. Sun, K. Ma, J. Fang, Breast Cancer Microscope Image Classification Based on CNN with Image Deformation, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 845–852.
- (27) I. Koné, L. Boulmane, Hierarchical ResNeXt Models for Breast Cancer Histology Image Classification, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 796–803.
- (28) M. M. R. Krishnan, P. Shah, Statistical Analysis of Textural Features for Improved Classification of Oral Histopathological Images, J Med Syst 36 (2) (2012) 865–881.
- (29) Z. Wang, N. Dong, W. Dai, S. D. Rosario, E. P. Xing, Classification of Breast Cancer Histopathological Images using Convolutional Neural Networks with Hierarchical Loss and Global Pooling, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 745–753.
- (30) M. Macenko, M. Niethammer, J. S. Marron, D. Borland, J. T. Woosley, X. Guan, C. Schmitt, N. E. Thomas, A method for normalizing histology slides for quantitative analysis, in: Proceedings - 2009 IEEE International Symposium on Biomedical Imaging: From Nano to Macro, ISBI 2009, Boston, 2009, pp. 1107–1110.
- (31) H. Cao, S. Bernard, L. Heutte, R. Sabourin, Improve the Performance of Transfer Learning Without Fine-Tuning Using Dissimilarity-Based Multi-view Learning for Breast Cancer Histology Images, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 779–787.
- (32) E. Reinhard, M. Ashikhmin, B. Gooch, P. Shirley, S. Pre, Color Transfer between Images, IEEE Comput. Graph.Appl. 21 (2001) 34–41.
- (33) Y. Guo, H. Dong, F. Song, C. Zhu, J. Liu, Breast Cancer Histology Image Classification Based on Deep Neural Networks, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 827–836.
- (34) A. Mahbod, I. Ellinger, R. Ecker, r. Smedby, C. Wang, Breast Cancer Histological Image Classification Using Fine-Tuned Deep Network Fusion, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 754–762.
- (35) C. A. Ferreira, T. Melo, P. Sousa, M. I. Meyer, E. Shakibapour, P. Costa, A. Campilho, Classification of Breast Cancer Histology Images Through Transfer Learning Using a Pre-trained Inception Resnet V2, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 763–770.
- (36) A. Pimkin, G. Makarchuk, V. Kondratenko, M. Pisov, E. Krivov, M. Belyaev, Ensembling Neural Networks for Digital Pathology Images Classification and Segmentation, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 877–886.
- (37) A. Rakhlin, A. Shvets, V. Iglovikov, A. A. Kalinin, Deep Convolutional Neural Networks for Breast Cancer Histology Image Analysis, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 737–744.
- (38) T. Iesmantas, A. Robertas, Convolutional Capsule Network for Classification of Breast Cancer Histology Images Tomas, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 853–860.
- (39) N. Weiss, H. Kost, A. Homeyer, Towards Interactive Breast Tumor Classification Using Transfer Learning, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 727–736.
- (40) R. Awan, N. A. Koohbanani, M. Shaban, A. Lisowska, N. Rajpoot, Context-Aware Learning Using Transferable Features for Classification of Breast Cancer Histology Images, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 788–795.
- (41) S. Galal, V. Sanchez-Freire, Candy Cane: Breast Cancer Pixel-Wise Labeling with Fully Convolutional Densenets, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 820–826.
- (42) Q. D. Vu, M. N. N. To, E. Kim, J. T. Kwak, Micro and Macro Breast Histology Image Analysis by Partial Network Re-use, in: A. Campilho, F. Karray, B. ter Haar Romeny (Eds.), Image Analysis and Recognition, Springer International Publishing, Póvoa de Varzim, 2018, pp. 895–902.
- (43) G. Litjens, T. Kooi, B. E. Bejnordi, A. A. A. Setio, F. Ciompi, M. Ghafoorian, J. A. W. M. van der Laak, B. van Ginneken, C. I. Sánchez, A Survey on Deep Learning in Medical Image Analysis, Medical Image Analysis 42 (2017) 60–88.
- (44) N. Tajbakhsh, J. Y. Shin, S. R. Gurudu, R. T. Hurst, C. B. Kendall, M. B. Gotway, J. Liang, Convolutional Neural Networks for Medical Image Analysis: Full Training or Fine Tuning?, IEEE Transactions on Medical Imaging 35 (5) (2016) 1299–1312.
- (45) J. Deng, W. Dong, R. Socher, Li-Jia Li, Kai Li, L. Fei-Fei, ImageNet: A large-scale hierarchical image database, in: 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
- (46) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, L. Fei-Fei, ImageNet Large Scale Visual Recognition Challenge, International Journal of Computer Vision 115 (3) (2015) 211–252.
K. Simonyan, A. Zisserman, Very Deep
Convolutional Networks for Large-Scale Image Recognition, arXiv (2014)
- (48) C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going deeper with convolutions, in: Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.
- (49) K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 770–778.
- (50) G. Huang, Z. Liu, L. v. d. Maaten, K. Q. Weinberger, Densely Connected Convolutional Networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 2261–2269.
S. Ioffe, C. Szegedy, Batch
Normalization: Accelerating Deep Network Training by Reducing Internal
Covariate Shift, Arxiv (2015) 1–11.
- (52) S. Xie, R. Girshick, P. Dollar, Z. Tu, K. He, Aggregated Residual Transformations for Deep Neural Networks, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), IEEE, 2017, pp. 5987–5995.
- (53) C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi, Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, in: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17), 2017, pp. 4278–4284.
- (54) C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the Inception Architecture for Computer Vision, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR) (2016) 2818–2826.
- (55) F. A. Spanhol, L. S. Oliveira, C. Petitjean, L. Heutte, Breast Cancer Histopathological Image Classification using Convolution Neural Networks, in: International Joint Conference on Neural Networks (IJCNN 2016), Vancouver, 2016, pp. 2560–2567.
- (56) G. Litjens, C. I. Sánchez, N. Timofeeva, M. Hermsen, I. Nagtegaal, I. Kovacs, C. Hulsbergen-van de Kaa, P. Bult, B. van Ginneken, J. van der Laak, Deep learning as a tool for increased accuracy and efficiency of histopathological diagnosis., Scientific reports 6 (January) (2016) 26286.
- (57) S. Jégou, M. Drozdzal, D. Vazquez, A. Romero, Y. Bengio, The One Hundred Layers Tiramisu : Fully Convolutional DenseNets for Semantic Segmentation, ArXiv e-prints 1611.
- (58) F. Yu, V. Koltun, Multi-Scale Context Aggregation by Dilated Convolutions, ArXiv e-prints 1511.
- (59) J. Hu, L. Shen, G. Sun, Squeeze-and-Excitation Networks, ArXiv e-prints 1709.
- (60) O. Ronneberger, P. Fischer, T. Brox, U-Net: Convolutional Networks for Biomedical Image Segmentation, Medical Image Computing and Computer-Assisted Intervention (MICCAI) 9351 (2015) 234–241.
D. P. Kingma, J. Ba, Adam: A Method for
Stochastic Optimization, in: International Conference on Learning
Representations (ICLR), San Diego, 2015.
- (62) M. D. Abràmoff, Y. Lou, A. Erginay, W. Clarida, R. Amelon, J. C. Folk, M. Niemeijer, Improved Automated Detection of Diabetic Retinopathy on a Publicly Available Dataset Through Integration of Deep Learning, in: Investigative Opthalmology & Visual Science, Vol. 57, 2016, p. 5200.