An End-to-End Deep Learning Histochemical Scoring System for Breast Cancer Tissue Microarray
One of the methods for stratifying different molecular classes of breast cancer is the Nottingham Prognostic Index Plus (NPI+) which uses breast cancer relevant biomarkers to stain tumour tissues prepared on tissue microarray (TMA). To determine the molecular class of the tumour, pathologists will have to manually mark the nuclei activity biomarkers through a microscope and use a semi-quantitative assessment method to assign a histochemical score (H-Score) to each TMA core. However, manually marking positively stained nuclei is a time consuming, imprecise and subjective process which will lead to inter-observer and intra-observer discrepancies. In this paper, we present an end-to-end deep learning system which directly predicts the H-Score automatically. The innovative characteristics of our method is that it is inspired by the H-Scoring process of the pathologists where they count the total number of cells, the number of tumour cells, and categorise the cells based on the intensity of their positive stains. Our system imitates the pathologists’ decision process and uses one fully convolutional network (FCN) to extract all nuclei region (tumour and non-tumour), a second FCN to extract tumour nuclei region, and a multi-column convolutional neural network which takes the outputs of the first two FCNs and the stain intensity description image as input and acts as the high-level decision making mechanism to directly output the H-Score of the input TMA image. In additional to developing the deep learning framework, we also present methods for constructing positive stain intensity description image and for handling discrete scores with numerical gaps. Whilst deep learning has been widely applied in digital pathology image analysis, to the best of our knowledge, this is the first end-to-end system that takes a TMA image as input and directly outputs a clinical score. We will present experimental results which demonstrate that the H-Scores predicted by our model have very high and statistically significant correlation with experienced pathologists’ scores and that the H-Scoring discrepancy between our algorithm and the pathologits is on par with that between the pathologists. Although it is still a long way from clinical use, this work demonstrates the possibility of using deep learning techniques to automatically and directly predicting the clinical scores of digital pathology images.
Breast cancer (BC) is a heterogeneous group of tumours with varied genotype and phenotype features [rakha2014nottingham]. Recent research of Gene Expression Profiling (GEP) suggests that BC can be divided into distinct molecular tumour groups [perou1999distinctive, nielsen2010comparison]. Personalised BC management often utilizes robust commonplace technology such as immunohistochemistry (IHC) for tumour molecular profiling [soria2010methodology, green2013identification].
Diaminobenzidine (DAB) based IHC techniques stain the target antigens (detected by biomarkers) with brown colouration (positive) against a blue colouration (negative) counter-stained by Hematoxylin (see Fig.1 for some example images). To determine the biological class of the tumour, pathologists will mark the nuclei activity biomarkers through a microscope and give a score based on a semi-quantitative assessment method called the modified histochemical scoring (H-Score) [mccarty1985estrogen, goulding1995new]. The H-Scores of tissue samples stained with different biomarkers are combined together to determine the biological class of a case. Clinical decision making is to choose an appropriate treatment from a number of available treatment options according to the biological class of the tumour. For instance, one of the methods for stratifying different molecular classes is the Nottingham Prognosis Index Plus (NPI +)[rakha2014nottingham] which uses 10 breast cancer relevant biomarkers to stain tumour tissues prepared on tissue microarray (TMA). Tissue samples stained by each of these 10 biomarkers are given a histochemical score (H-Score) and these 10 scores together will determine the biological class of the case.
Therefore, H-Score is one of the most important pieces of information for molecular tumour classification. When the tumour region occupies more than 15% of the TMA section, a H-Score is calculated based on a linear combination of the percentage of strongly stained nuclei (), the percentage of moderately stained nuclei () and the percentage of weakly stained nuclei () according to equation (1):
The final score has a numerical value ranges from 0 to 300. Thus, the histochemical assessment of the TMA’s is based on the following semi-quantitative information: the total number of cells, the number of tumour cells and the stain intensity distributions within the tumour cells. In clinical practice, diagnosis requires averaging two experienced pathologists’ assessments. Manually marking the positively stained nuclei is obviously a time consuming process. As visual assessment of the TMA’s is subjective, there is the problem of inter-observer discrepancy and the issue of repeatability. The semi-quantitative nature of the method (strongly stained, moderately stained and weakly stained, the definitions of strong, moderate and weak cannot be precise and subjective), makes it even more difficult to ensure inter-subject as well as intra-subject consistency.
With the increasing application of clinicopathologic prognosis, Computer Aided Diagnosis (CAD) systems have been proposed to support the pathologists’ decision making. The key parameters in tissue image assessment include the number of tumour cells, the positive staining intensities within these cells and the total number of all cells in the image. To classify the positively stained pixels and their stain intensity, methods such as colour deconvolution that perform mathematical transformation of the RGB image [ruifrok1997quantification] [ruifrok2001quantification] are widely used to separate positive stains from negative stains. Numerous computer-assisted approaches have been proposed for cell or nuclei detection and segmentation [irshad2014methods]. Most literature on histopathology image analysis perform various low-level quantification steps, there is still little attempt to perform end-to-end assessment of the image directly.
In this paper, we ask this question: is it possible to develop a CAD model that would directly give a high-level assessment of a digital pathological image, just like an experienced pathologist would, for example, to give out a H-Score directly? In an attempt to answer this question, we propose an end-to-end deep learning system for directly predicting the H-Scores of breast cancer TMA images, see Fig. 6. Instead of pushing the raw digital images into the neural network directly, we follow a similar process that pathologists use for H-Score estimation. We first construct a stain intensity nuclei image (SINI) which only contains nuclei pixels and their corresponding stain intensity information, and a stain intensity tumour image (SITI) which only contains tumour nuclei pixels and their corresponding stain intensity information. The SINI and SITI block irrelevant background pixels while only retain useful information for calculating the H-Score. These two H-Score relevant images are then fed into a dual-channel convolutional neural network with two input pipelines, which are finally merged into one pipeline to give an output (H-Score). To the best of our knowledge, this is a first work that attempts to develop deep learning based TMA processing model that directly outputs the histochemical scores. We will present experimental results which demonstrate that the H-Scores predicted by our model have high and statistically significant correlation with experienced pathologists’ scores and that the H-Scoring discrepancy between our algorithm and the pathologists is on par with that between the pathologists. Although it is still perhaps a long way from clinical use, this work nevertheless demonstrates the possibility of automatically scoring cancer TMA’s based on deep learning.
2 Related Works
Researchers have proposed various computer-assisted analysis methods for histopathological images [kothari2013pathology]. For pixel-level positive stain segmentation, Pham [pham2007quantitative] adapted Yellow channel in CMYK model, which is believed to have strong correlation with the DAB stain; Ruifrok [ruifrok1997quantification] presented the brown image calculated based on mathematical transformation of the RGB image. Yao [yao2012interactive] employed Hough forest for mitotic cell detection, which is a combination of generalized Hough transform and random decision trees. Shu et al. [shu2013segmenting] proposed utilizing morphological filtering and seeded watershed for overlapping nuclei segmentation. Object-based CAD systems have also been developed for tubule detection in breast cancer [basavanhally2011incorporating], glandular structure segmentation [fu2014novel], and etc.
With the development of deep learning techniques, various deep neural network based CAD models have been published. Deep convolutional networks with deeper architectures can be used to build more complex models which will result in more powerful solutions. Li [li2017hep] used a 88-layer residual network for human epithelial type 2 (HEp-2) cell segmentation and classification. AggNet with a novel aggregation layer is proposed for mitosis detection in breast cancer histology images [albarqouni2016aggnet]. Google brain presented a multi scale CNN model to aid breast cancer metastasis detection in lymph nodes[liu2017detecting]. A deep learning-based system is proposed for the detection of metastatic cancer from whole slide images, which won the Camelyon Grand Challenge 2016 [wang2016deep]. Shah et al. [shah2016deep] presented the first completely data-driven model integrated numerous biologically salient classifiers for invasive breast cancer prognosis. A symmetric fully convolutional network is proposed by Ronneberger for microscopy image segmentation [ronneberger2015u].
Digital pathology is relative new compared with other type of medical imaging such as X-ray, MRI, and CT. Deep learning as one of the most powerful machine learning techniques emerged in recent years has seen widespread applications in many areas. Yap et al. [yap2017automated] investigated three deep learning models for breast ultrasound lesion detection. Moeskops [moeskops2016deep] introduced a single CNN model with triplanar input patches for segmenting three different types of medical images, brain MRI, breast MRI and cardiac CTA. A combination of multi-channel image representation and unsupervised candidate proposals is proposed for automatic lesion detection in breast MRI [amit2017hybrid].
Most existing high-level CAD frameworks directly follow the assessment criteria by extracting quantitative information from the digital images. Masmoudi et al.[masmoudi2009automated] proposed an automatic Human Epidermal Growth Factor Receptor 2 (HER2) assessment method, which is an assemble algorithm of colour pixel classification, nuclei segmentation and cell membrane modelling. Gaussian-based bar filter was used for membrane isolation after colour decomposition in [hall2008computer]. Trahearn et al. [trahearn2016simultaneous] established a two-stage registration process for IHC stained WSI scoring. Thresholds were defined for DAB stain intensity groups, and tumour region and nuclei were detected by two different detectors. Recently, Zhu [zhu2017sur] proposed to train an aggregation model based on deep convolutional network for patient survival status prediction.
3 Problem and Method
An immunohistochemical assessment can be formulated as a model that maps the input images from the input space to the a label space . Given an input image , its label is assigned according to the quantitative information of positive staining intensity , the number of tumour cells and total number of cells in the image :
Traditional assessment methods have at least three unsolved issues for both the pathologists and the CAD systems. Firstly, the positive staining intensity needs to be categorized into four classes: unstained, weak, moderate, and strong. However, there is no standard quantitative criterion for classifying the DAB stain intensity. Thus, two pathologists often classify the same staining intensity into two different categories or two different intensities into the same category. Furthermore, the human visual system may pay more attention to strongly stained regions but they are often surrounded by a variety of staining intensities [trahearn2016simultaneous], which may also affect the assessment results. Secondly, cell/nuclei instance counting is a very important parameter in the assessment. Nevertheless, both human and computer still cannot deal with the difficulty of counting overlapping cells very well. Moreover, variability in the appearance of different types of nucleus, heterogeneous staining, and the complex tissue architectures make individually segmenting cell/nuclei a very challenging problem. Thirdly, the apparent size differences between tumour nuclei and normal nuclei will affect the quantitative judgement of tumour nuclei assessment. Examples of these challenging cases are illustrated in Fig. 2.
To tackle the problem mentioned above, we propose to develop a convolutional neural network (CNN) based CAD framework for biomarker assessment of TMA images. Instead of using CNN as a feature extractor or for low level processing such as cell segmentation only, we have developed an end-to-end system which directly predicts the biomarker score (H-Score). The innovative characteristic of our method is that it is inspired by the H-Scoring process of the pathologists where they count the total number of nuclei and the number of tumour nuclei and categorise tumour nucleus based on the intensity of their positive stains. In the complete system, as illustrated in Fig. 6, one fully convolutional network (FCN) is used to extract all nuclei region which acts as the step of counting all nucleus and capture all foreground information, another FCN is used to extract tumour nuclei region which acts as the step of counting all tumour nucleus. To mimic the process of categorising tumour nuclei based on their positive stain intensities, we derive a stain intensity image which together with the outputs of the two FCNs are presented to another deep learning network which acts as the high-level decision making mechanism to directly output the H-Score of the input TMA image.
3.1 Stain Intensity Description
Although various DAB stain separation methods have been proposed [ruifrok2001quantification, brey2003automated], few work studied the stain intensity description and grouping. Since there is no formal definitions for the boundaries between stain intensity groups (e.g, strong, moderate, weak), previous works used manually defined thresholds for pixel-wise classification to segment positive stains into each stain group [trahearn2016simultaneous].
In this work, we propose to directly use the luminance values of the image to describe the staining intensity instead of setting artificial intensity category boundaries. The original RGB image is first transformed into three-channel stain component image () using colour deconvolution [ruifrok2001quantification]:
where is the stain matrix composed of staining colours equal to
for DAB-H stained images, and is Optical Density converted image calculated according Lambert-Beers law:
is the spectral radiation intensity for a typical 8bit RGB camera [haub2015model].
Only the DAB channel image from the three colour deconvolution output channels is used, which describes the DAB stain according the chroma difference.
Most previous works set a single threshold on to separate positively stained tissues. However, as shown in Fig.3, the deeply stained positive nuclei can have dark and light pixel values on the DAB channel image, since the strongly stained pixels will have significantly broader hue spectrum. Furthermore, as illustrated in Fig.4, the same DAB channel value can correspond to different pixel colours. Also, from Fig.4, it is clear that in order to separate the positive stain (brown colour) from the negative stain (blue colour), the DAB channel thresholds should be set based on the luminance values. In this paper, we use the Luminance Adaptive Multi-Thresholding (LAMT) method developed by the authors [liu2016luminance] to classify positively stained pixels. Specifically, the transformed pixel is divided into equal intervals according to the luminance:
where ; and are lower and upper boundary respectively of th luminance interval. is the luminance image of the original RGB image calculated according to Rec. 601 [rec1995bt]:
The transformed pixels are thresholded with different values according to its luminance instead of a single threshold, the threshold is assigned as follows:
where is the stain label.
Once we have separated the positive stain from the negative stain, we need to find a way to describe the stain intensity. As we have already seen in Fig.3 and Fig.4, the pixel values of can not describe the biomarker stain intensity. We propose to use a scheme described in Eq.3.1 to assign stain intensity values to pixels:
where is the stain intensity description image.
The idea is that for the positive stain pixels, is the same as the luminance component of the original image in order to preserve the morphology of the positive nuclei; for the negative stain pixels, will have a higher value for strongly stained pixels (darker blue colour) and a lower value for weakly stained pixels (lighter blue colour). In order to separate the positive and negative pixel values clearly, we add an offset of 255 to the negatively stained pixels (most negative stain pixels will have a high and positive stain pixels will have a low , the value of positive and negative pixels will be clearly separated in ). Therefore, the larger is, the weaker is the stain, the smaller is, the stronger is the stain. When is below or equal to 255, it is a positive stain pixel. In this way, we have obtained an image which gives a continuous description of the stain intensity of the image. Instead of setting artificial boundaries to separate the different degrees of stain intensity, we have now a continuous description of the stain intensity (see Fig.5). Note that the pixel values of final image are normalized to the range from 0 to 1.
3.2 Nuclei and Tumour Maps
As discussed above, the important information pathologists use to come up with the H-Score is the number of nuclei and the number of tumour nuclei in the TMA image. We therefore need to extract these two pieces of information and we use two separate FCNs, one for segmenting all nucleus and the other for segmenting tumour nucleus only.
To segment the tumour region, we use our own manually pixel-wise labelled tumour TMA images to train the FCN. While for segmenting general nuclei which detects both tumour and non-tumour nuclei, we utilize a transfer learning strategy to train another FCN. For general nuclei detection, the training data is obtained from three different datasets: immunofluorescence (IIF) stained HEp-2 cell dataset [hobson2016hep], Warwick hematoxylin and eosin (H&E) stained colon cancer dataset [sirinukunwattana2016locality], and our own DAB-H TMA images. Since these three image sets are stained with different types of biomarker, we transform the colour image into grayscale for training. Training on a mixed image set could help to reduce overfitting on limited medical dataset and further boost the performance and robustness [chen2016dcan].
For both the general nuclei detection network and the tumour nuclei detection network, we use the symmetric U shape network architecture (U-Net) [ronneberger2015u] with skip connection. The high resolution features from the contracting path are combined with the output from the upsampling path, which allows the network to learn the high resolution contextual information. The loss function is designed according the Dice coefficient as:
where is the predicted pixel and is the ground truth.
3.3 The H-Score Prediction Framework
The overview of the H-Score prediction framework is illustrated in Fig.6. It consists of three stages: 1) Nuclei segmentation, tumour segmentation, and stain intensity description; 2) Constructing the Stain Intensity Nuclei Image (SINI) and the Stain Intensity Tumour Image (SITI); 3) Predicting the final histochemical score (H-Score) by the Region Attention Multi-column Convolutional Neural Network (RAM-CNN). The rationale of this architecture is as follows: as only the number of nuclei, the number of tumour nuclei and the stain intensity of the tumour nuclei are the useful information for predicting H-Score, we therefore first extract these information. Rather than setting artificial boundaries for the categories of stain intensity, we retain a continuous description of the stain intensity. Only the information useful for predicting the H-Score is presented to a deep CNN to estimate the H-Score of the input image. This is in contrast to many work in the literature where the whole image is thrown to the CNN regardless if a region is useful or not for the purpose.
The detail of the first stage have been described in Section 3.1 and 3.2. As illustrated in Fig.6, an input TMA image is processed by the tumour detection network which will output a binary image mask, , marking all the tumour nuclei, where if is a part of a tumour nuclei and otherwise; by the general nuclei detection network which will output another binary image mask, , marking all tumour and non-tumour nuclei, where if is a part of a nuclei and otherwise; and by the colour deconvolution and stain intensity labelling operation of Equation (8) to produce the stain intensity description image . In the second stage, we construct SINI and SITI by multiplying the nuclei mask image and tumour mask image with the stain intensity description image , i.e. , and . Hence, all background pixels are zero, while only region of interests (ROI) are retained in SINI and SITI. All necessary information is preserved for histochemical assessment. Removing the background and only retaining ROI will enable the RAM-CNN convolutional layers to focus on foreground objects [li2017not] which will significantly reduce computational costs and improve performance.
|Layer||Input / Filter Dimensions|
The proposed RAM-CNN is a deep regression model with dual input channels. The architecture of RAM-CNN is shown in Table I. Two inputs correspond to SINI and SITI respectively, and the input size is . The parameters of the two individual branches are updated independently for extracting cell and tumour features respectively, without interfering with each other. The two pipelines are merged into one after two convolutional layers for H-Score prediction. The loss function for H-Score prediction is defined as:
where is the estimated score generated by RAM-CNN. is the ground truth H-Score.
4 Experiments and Results
The H-Score dataset used in our experiment contains 105 TMA images of breast adenocarcinomas from the NPI+ set [rakha2014nottingham]. Each image contains one whole TMA core. The tissues are cropped from a sample of one patient which are stained with three different nuclei activity biomarkers: ER, p53, and PgR. The original images are captured at a high resolution of optical magnification, and then resized to 10241024 pixels. The dataset is manually marked by two experienced pathologists with H-Score based on common practice. For each TMA core, the pathologists give the percentage of nuclei of different stain intensity levels, and then calculate the H-Score using Eq.1. The final label (H-Score) is determined by averaging two pathologists’ scores, if the difference between two pathologists is smaller than 20. The dataset is available from the authors on request.
For training the general nuclei detection network, we transform Warwick H&E colon adenocarcinoma [sirinukunwattana2016locality] and NPI+ images to grayscale; the green channel was extracted from HEp-2 cell dataset [hobson2016hep]. As HEp-2 cell images are IIF stained, the gray value should be inversed.
4.2 Data and Label Augmentation
As in typical medical imaging applications, the dataset sizes are relatively small. In developing deep learning based solutions, it is a common practice to augment the training dataset for training. The training images for general nuclei detection network and tumour detection network are augmented by randomly cropping sub-images as input samples. For the H-Score dataset, rotation with random angles and randomly shifting the image horizontally and vertically within 5% of image height and width are performed to augment the training set.
As shown in the top row of Fig.7, the distribution of the label (H-Score) in the original dataset is unbalanced, some labels (H-Scores) have far more samples than others. Furthermore, one of the biggest problems is that because we have only limited number of samples, the H-Score values are discrete and discontinuous. There are many gaps between two H-Scores that has no data. Also, the values of the TMA image score given by the pathologists have a quantitative step-size of 5. Therefore, if an image has a score of 125, it means it has a value of around 125, the values in the vicinity of 125, i.e., 126 or 124 should also be suitable for labelling that image. In order to solve the ambiguity issue, we introduce Distributed Label Augmentation (DLA) which was inspired by the work of [gao2017deep, zhou2012multi].