Label Super Resolution with Inter-Instance Loss

Label Super Resolution with Inter-Instance Loss

Maozheng Zhao Stony Brook University Le Hou Stony Brook University Han Le Stony Brook University Dimitris Samaras Stony Brook University Nebojsa Jojic Microsoft Research Danielle Fassler Stony Brook University Tahsin Kurc Stony Brook University Rajarsi Gupta Stony Brook University Kolya Malkin Shahira Abousamra Stony Brook University Shroyer Kenneth Stony Brook University Joel Saltz Stony Brook University
Abstract

For the task of semantic segmentation, high-resolution (pixel-level) ground truth is very expensive to collect, especially for high resolution images such as gigapixel pathology images. On the other hand, collecting low resolution labels (labels for a block of pixels) for these high resolution images is much more cost efficient. Conventional methods trained on these low-resolution labels are only capable of giving low-resolution predictions. The existing state-of-the-art label super resolution (LSR) method is capable of predicting high resolution labels, using only low-resolution supervision, given the joint distribution between low resolution and high resolution labels. However, it does not consider the inter-instance variance which is crucial in the ideal mathematical formulation. In this work, we propose a novel loss function modeling the inter-instance variance. We test our method on two real world applications: cell detection in multiplex immunohistochemistry (IHC) images, and infiltrating breast cancer region segmentation in histopathology slides. Experimental results show the effectiveness of our method.

1 Introduction

Given an input image with pixels , a semantic segmentation model [18, 9, 2, 21, 1, 19, 8, 12, 33] outputs a prediction image , where each is one of predefined classes: .

Figure 1: We focus on the problem of training a neural network for high-resolution semantic segmentation with low-resolution ground truth. The key component is to construct a loss between two distributions: predicted label count, and suggested label count from a low-resolution image block.

Conventional high-resolution semantic segmentation models require large amounts of high-resolution ground truth data (pixel-level labels) [1, 19, 8]. It is very labor intensive to collect these large scale datasets, especially for datasets of gigapixel images such as pathology images [13, 17]. The weakly supervised semantic segmentation approaches [2, 21, 22, 31, 23] learn to produce pixel-level segmentation results given sparse, e.g., image-level labels. It requires that the set of image-level classes must be the same as the pixel-level classes. For example, given that the image contains a cat, the network learns to segment the cat [21]. In many applications, however, low-resolution (e.g., block-level) information may correlate with pixel-level labels in a more complex way [16]. For example, a patch in a tissue image may be assigned a probability of containing cancer tissue and may contain high/low amounts of different types of cells [16, 27].

The Label Super Resolution (LSR) method [16] models this problem by utilizing the joint distribution between low-resolution and high-resolution labels, as shown in Fig. 1. The LSR model is trained with each low resolution label assigned to each group of pixels (i.e., an image block) . Let be the number of pixels with high-resolution class label in an image block, LSR tries to match the the actual count of in prediction with the count distribution indicated by .

For each fixed image block, the LSR loss matches the distribution of predicted given by the network, with the distribution of designated by the low resolution label : . Note that the ground truth is computed across multiple image blocks with the same low resolution label . On the other hand, the distribution of predicted is computed on each fixed image block. In other words, the existing LSR loss does not consider variance across image blocks with the same .

To address this problem, we propose a new loss function. The proposed loss functions match the distribution of across a set of image blocks with the same label to the distribution suggested by the low resolution label . Mathematically, this models the true variance of class/label counts across image blocks, not just within an image block.

We evaluate the proposed loss function on two image analysis tasks: semantic segmentation to identify different cell types in multiplex immunohistochemistry (IHC) images and infiltrating breast cancer region segmentation in Hematoxylin and Eosin stained pathology images. The experiment results show that both of the loss functions outperform the LSR loss function significantly. To summarize, our contribution are as follows:

  1. A novel loss functions for label super resolution, which takes into account variance across image blocks with the same low-resolution label.

  2. A multi-class cell detection model with low resolution pathologist annotations. The model significantly outperforms color-based baselines and the existing LSR method.

  3. A breast cancer region segmentation model. The model can produce accurate high-resolution cancer segmentation boundary with only low resolution supervision in the training phase.

The rest of the paper is organized as follows. Sec. 2 introduces the proposed loss functions; Sec. 3 describes the detailed implementation of our method, with experiments on two pathology image analysis tasks; Finally, Sec. 4 concludes this paper.

2 Label Super Resolution

The existing Label Super Resolution (LSR) approach [16] proposed an intra-instance loss function with which it learns to super-resolve low resolution labels. The key source of information it utilizes is the conditional distribution : the probability distribution of within an image block with low resolution label . As an example, Table 1 shows for each high-resolution label and low-resolution label for the cancer segmentation task. In this example, is a binary label indicating if an image block is a cancer block or not; is a binary label indicating if a pixel is a cancer pixel or not (i.e., if the pixel is in a cancer cell or not). The cancer probability of an image block is provided by a patch-level cancer classifier [17, 13]. The values in Table 1 were computed through manual annotation. For each label , a domain expert examined 10 to 12 pixel image blocks with label and visually approximated for each image block. The visual approximation process and the effect of using visually approximated counts rather than exact counts using ground truth masks are elaborated in Sec. B of the appendices. In total, the domain expert examined 100 to 120 image blocks, instead of painstakingly delineating the precise boundaries of small and large cancer and non-cancer regions in whole slide tissue images. The cost of annotation in LSR is very low compared with conventional per-pixel labeling.

Image block with Count% of
low resolution class : high resolution class :
probability% as cancer block cancer Non-cancer
0-20
20-30
30-40
40-50
50-60
60-70
70-80
80-90
90-95
95-100
Table 1: The distribution of the count (in percentage) of high resolution labels in image blocks with low resolution labels . For example, in an image block with to probability of being cancer block, there are (expectation) (standard deviation) cancer pixels. In this case, the cancer block probability is given by a low resolution cancer classifier.

All super resolution methods in this paper use the conditional distribution . We first describe this baseline method [16] as an intra-instance loss. We then formulate two new loss functions. An overview of these three loss functions is shown in Fig. 2

Figure 2: (a). The intra-instance loss baseline [16] described in Sec. 2.1. This method models the label counts as a random variable, and derives the distribution of it given a fixed input image block . (b). Our proposed inter-instance loss described in Sec. 2.2. This method computes the distribution of label counts across input blocks, considering the label counts given by the network as a constant given a fixed input block. (c). Our proposed intra + inter-instance loss described in Sec. 2.3. This method computes the distribution of label counts across input blocks, considering the label counts given by the network as a random variable given a fixed input image block.

2.1 Baseline: intra-instance loss

We introduce the intra-instance loss [16] starting with label counting. The classification/segmentation network produces, for each pixel in the image, a probability that a given pixel is in class . This is expressed as , where is -th input image block with low resolution label ; and is the class of a pixel with coordinates . The LSR approach models the network’s output on a pixel as a Bernoulli distribution. If we sampled the model’s prediction at each pixel , the value of would be

(1)

where is the indicator function. Given a set of pixels in , whose class label is , the value of is approximated by a Gaussian distribution:

(2)

where

(3)

As shown in Table 1, the ground truth is also modeled as a Gaussian distribution, only depending on the low resolution class :

(4)
Statistics matching:

The LSR method minimizes the distance between and for each input with label . The distance between two Gaussian distributions is formulated as follows:

(5)
Drawback of Intra-instance Loss:

Given an instance (an image block) with a low-resolution label, the distribution of predicted class counts is computed by fixing the input instance. In other words, is computed instead of . By minimizing the dissimilarity between and , a classification/segmentation network trained with the absolute optimal training error produces the same distribution of class counts regardless of the fact that is varied.

2.2 Inter-instance loss

Because the distribution of real class counts is computed across different instances (image blocks) with the same label , we argue that one should also model the distribution of predicted class counts across instances.

We formulate our proposed inter-instance loss as follows. First, we develop a new intra-instance loss. For each input instance with low resolution label , the predicted value of is defined as the average predicted probability for high resolution class . In this case, is discrete:

(6)

In other words, we model the predicted count as a constant: .

Using this simplified formulation, we model the predicted count across different instances as an approximate Gaussian distribution:

(7)

where and are computed empirically:

(8)

In practice, it may not be possible to compute the exact and when the number of image blocks is large and computational resources are limited. We address this problem by estimating and on a batch of sample instances. This strategy is well in line with stochastic neural network training strategies.

The inter-instance loss is computed as follows:

(9)

Our method matches to by assuming that the predicted value of is a constant given an input block .

Drawback of Inter-instance Loss:

The inter-instance loss does not consider intra-image variation: the confidence of model prediction. Less confident predictions yield larger intra-image variations.

2.3 Intra + inter-instance loss

Following the intra-instance loss formulation in Sec. 2.1, the predicted label counts vary when prediction for each pixel is viewed as a Bernoulli random variable.

Our intra+inter-instance loss is based on label count sampling. We have developed the following sampling strategy. Given low resolution label , we first sample . We then use the segmentation network to compute . Finally we sample a class count according to for all . This across-block label count is approximated by the following Gaussian distribution:

(10)

Here, is the label count, , given with low resolution label . We compute and empirically:

(11)

We visually approximated and using a batch of image blocks. We use Eq. 9 as the statistics matching loss.

3 Experiments

We evaluated our loss functions with two image analysis tasks in digital pathology: cell detection in multiplex Immunohistochemistry (IHC) images and cancer segmentation in Hematoxylin and Eosin (H&E) stained images.

3.1 Cell detection in multiplex images

Analysis of human patient tissue stained by immunohistochemistry (IHC) provides information on protein expression and distribution [24, 4]. These proteins function as biomarkers that can be used to classify cells that might otherwise indistinguishable [7]. Furthermore, these assays can give physicians and researchers information that could be used to deliver more accurate prognosis or treatment to patients, or identify novel therapeutic targets to investigate.

Traditional IHC techniques are only able to stain one type of protein per slide; if assessment of more than one protein is necessary, individual IHC multiple staining operations must be performed on consecutive slides from the same tissue specimen. Each of these operations possibly adds error, such as inaccurate image registration, to the final IHC results.

Multiplex IHC [30, 34] is a new technique that allows for the staining of multiple markers on the same slide [6]. In our case, 5 protein markers used were each indicative of a specific immune cell type. Thus, analysis of the multiplex IHC digital pathology images will give information on the immune response to cancer. This is an active area of cancer research as it is known that immune response can impact patient outcome and response to treatment.

3.1.1 Training data

The training data was extracted from 16 whole slide multiplex IHC images. Each tissue sample was stained with 6 stains, binding to different types of cells, and digitized at high-resolution using a digital microscopy scanner. Because of the staining process and image scanning being done at once in the RGB space without any specialized filters, stains may bleed into neighbor cells and stain colors may overlap and mix in the resulting images. This makes it difficult for methods that depend on explicit color and intensity manipulations to accurately detect different types of cells. Table 2 shows the associations between 5 types of cells considered in this work and the corresponding stains.

CD16 (Black)
Myeloid Cell
CD20 (Red)
B cell
CD3 (Yellow)
DN T cell
CD4 (Cyan)
Helper T cell
CD8 (Purple)
Cytotoxic T cell
# of patches with “high” cell count 422 170 773 229 461
# of patches with “low” cell count 2552 2804 2201 2745 2513
# of cells doted in validation set 200 83 169 77 265
# of cells doted in testing set 326 183 343 111 478
Table 2: The multiplex dataset for the detection of different cell types. Identifying cell types is crucial in pathology image analysis. The “high” and “low” cell counts are low resolution labels.

A total of 2974 patches of -pixels were randomly extracted from the 11 multiplex IHC images at 10X (i.e., a pixel is 1 micron in each image dimension). Under the supervision of a pathologist, a medical major student assigned a high cell count or low cell count label for each patch and each cell type. The definition of high or low cell count is given by a pathologist beforehand. These patch level labels were used as low resolution training labels.

The unknown high resolution label is the pixel-level classification of each cell type. The LSR model needs the joint distribution between low resolution labels and high resolution labels . We assume that in each patch (image block), the count of a cell type is independent from the counts of other cell types. To create the joint distribution table for our methods, a graduate student selected 12 patches from each low resolution label and visually approximated the count of each cell type in terms of the number of pixels. The mean and standard deviation of pixel counts among the 12 patches were used as the ground truth values of and .

3.1.2 Testing and validation data

Ideally, semantic segmentation results are evaluated by the Intersection Over Union (IoU) score computed from predicted and ground truth segmentation masks. As mentioned before, the pixel level ground truth mask is very expensive to produce. We use cell locations as weak labels. Each cell’s center pixel is given a cell type label. In our experiments, 2 undergraduate students labeled 68 patches under the supervision of a pathologist. In these 68 patches, 12 patches were labeled by both of the 2 students to evaluate the inter-rater agreement. We used 24 labeled patches as validation data and the remaining 44 patches as testing data.

3.1.3 Evaluation method

We use the F1-score to evaluate the cell segmentation results. The segmentation network outputs a class (cell type) prediction at each pixel in the image. We view each isolated segmented region of class as a detection result for a cell or a group of cells of class . The recall and precision for each cell type are defined as below.

(12)
(13)

Since the cell types are mutually exclusive, no trivial method (such as predicting every pixel as one cell type) would achieve the best overall precision and recall.

3.1.4 Baseline and proposed loss methods

Since the cell types are differentiated by the color of the corresponding stain(s), a straightforward way to detect different cells is to do color decomposition. Color based methods, however, do not generally work well and are limited because of: chromatic overlapping [28]; the bleeding and mixing of stains; and the lack of other important information such as shape of cells. We compared our approach to two color-based methods.

Color deconvolution:

Color deconvolution [26, 15, 20] is widely used for color decomposition of IHC images with no more than 3 stains. It is limited by the matrix inverse process which requires that the number of input color channels be equal to the number of decomposed channels. Multiplex IHC images possibly contain more than 3 stains, encoded in 3-channel (RGB) images. In those cases, color deconvolution is applied to decompose 3 stains at a time.

Color separation via L2 distance:

This method decomposes each stain by directly computing the L2 distance between pixels’ RGB values against the standard RGB values of a stain. This color decomposition method does not limit the number of output decomposed stains.

Segmentation with limited high res labels:

We use the “color separation via L2 distance” method to derive a few high resolution labels, for training a semantic segmentation network. In particular, limited, high confidence, high resolution predictions are obtained based on low resolution ground truth labels and the color decomposition results. These sparse high resolution prediction results are used as pseudo labels for training a fully supervised, semantic segmentation network. We refer to this method as high res. If a training image patch has label em high for a stain (type of cell), pixels in its L2 color decomposition with the top 0.25% to 0.5% confidence are selected to assign the high resolution labels for that stain. This is a time-efficient way to generate high resolution ground truth labels for training. We use a U-net-like semantic segmentation network [25]. The output has 6 classes: 5 cell types and 1 background class. We assume that the different types of cells do not overlap spatially. Thus, the last layer of the network is a 6-way softmax layer.

Label super resolution:

We build Label Super Resolution (LSR) models to segment multiple types of cells to sub-pixel level, trained with low resolution labels. We test the three LSR loss functions as described in the previous sections: Intra-instance, as described in Sec. 2.1; Inter-instance, as described in Sec. 2.2; and the Intra+inter-instance, as described in Sec. 2.3. The only difference between these three methods is the way the predicted label count distribution is modeled, as described in Fig. 2.

Label super resolution adding limited high res labels:

We trained the same semantic segmentation network with both the limited high resolution labels and the label super resolution. The high resolution loss and super resolution loss terms are weighted and averaged in order to form the final loss. The weights of the losses are selected based on visual and quantitative evaluation on the validation set. We introduced three methods with this setting: Intra-instance & high res, Inter-instance & high res, and Intra+inter-instance & high res.

3.1.5 Training details

We use the RMSprop optimizer [11] with and a learning rate of 0.5 (loss applied on each output pixel is averaged instead of added) to train all of the networks. For the high resolution only setting (the “high res” method) the batch size is 10. For the networks with the intra-instance label super resolution loss, we also used a batch size of 10. In the inter-instance settings, the loss is first computed in a group of 10 images (recall that the loss requires inter-instance statistics), and two groups are used in a batch. Same for the intra+inter-instance settings, the group size is 10, and there are 2 groups in each batch.

3.1.6 Experimental results

The F1-scores on the testing set are in Table 3. Color decomposition based methods perform poorly, due to the fact that these methods do not consider critical information such as shapes of different cell types. Using the intra-instance loss together with limited high resolution supervision outperforms the network uses only limited high resolution supervision. More importantly, the intra+inter-instance loss outperforms the intra-instance loss significantly.

Figure 3: Multiplex immunohistochemistry image patches and cell detection results. The yellow contours in an image are the detected/segmented cell boundaries. Ground truth cell locations annotated by medical students are shown in red (successfully detected) and green (missed) dots. Due to chromatic overlapping [28] and other problems, color based methods such as color deconvolution does not perform well on multiplex images. Our network with intra+inter-instance label super resolution loss achieves the best result.
Figure 4: Cancer segmentation results. The baseline intra-instance loss LSR [16] generates prediction results with pepper noise, due to the lack of inter-instance variance modeling: it forces the network to predict a certain label count given a fixed low resolution label, regardless of its input image block (patch). On the other hand, our Intra+inter-instance yields smoother and more accurate segmentation results. The green area in the cancer boundary image indicates the mask in which we compute the masked IoU and DICE.

3.1.7 Inter-rater agreement

The inter-rater agreement is the averaged F1-score between across each pair of human raters: if two dots given by two separate human raters are within pixels away, we consider these two dots match. Otherwise, we consider these two dots do not match. One dot given by a human rater can match at most one dot (the closest dot) given by another human rater. The F1-score is then computed using one human rater’s dot annotations as if they are ground truth, and another human rater’s dot annotations as if they are detection results. The F1-score computed with is shown in Table 3. The value of is roughly the average radius of cells. We note that the inter-rater F1-scores and the algorithm’s F1-scores are not directly comparable due to different protocols of evaluation. The purpose of showing inter-rater F1-scores is to stress that cell detection is very hard, even in multiplex images.

Average
CD16
Myeloid Cell
CD20
B cell
CD3
DN T cell
CD4
Helper T cell
CD8
Cytotoxic T cell
Inter-rater agreement 0.7411 0.6387 0.7007 0.7035 0.8253 0.8372
Color deconvolution 0.1846 0.1294 0.2031 0.1236 0.3420 0.1246
Color separation via L2 0.2655 0.2306 0.1911 0.2623 0.2913 0.3521
high res 0.5013 0.4601 0.3690 0.5060 0.5505 0.6207
Intra-instance & high res 0.5241 0.4031 0.4339 0.4801 0.6137 0.6894
Inter-instance & high res 0.5355 0.5082 0.3825 0.5541 0.5755 0.6574
Intra+inter-instance & high res 0.5507 0.4934 0.4394 0.5248 0.6190 0.6772
Table 3: The F1-scores of cell detection in multiplex pathology images. Descriptions of methods tested are in Sec. 3.1.4. Color based methods perform poorly, due to the fact that these methods do not consider critical information such as shapes of different cell types. More importantly, the inter-instance loss outperforms the intra-instance loss baseline. And the proposed intra+inter-instance loss achieves the best result. : the inter-rater F1-scores and the algorithm’s F1-scores are not directly comparable, detailed in Sec. 3.1.7.

3.2 Breast cancer region segmentation

Automatic cancer segmentation in pathology images has significant applications such as computer aided diagnosis and scientific studies [5]. Manually annotating pixel-accurate cancer regions is time consuming, cost ineffective, and ambiguous. On the other hand, low resolution labels are relatively easy to collect and publicly available. Existing methods utilize low resolution labels to automatically produce low resolution segmentation results [27]. However, high resolution segmentation results have unique advantages such as showing accurate cancer boundaries which are important for the analysis of invasive carcinoma and infiltrating patterns of cancer [14, 32]. Our proposed method is able to produce high resolution segmentation results using the low-resolution annotations.

3.2.1 Dataset

We applied the proposed method to the task of cancer segmentation in breast carcinoma. Our low resolution labels are automatically generated from a cancer/non-cancer region classifier. The cancer/non-cancer region classifier labels a patch of pixels at a time, giving it a probability value of being cancer. The probability value is then quantified into 10 bins as 10 low resolution classes. Using this classifier, we labeled 1,092 breast carcinoma (BRCA) slides in The Cancer Genome Atlas (TCGA) repository [29], patch by patch. From 1,000 slides, we randomly extracted 26,767 patches with their low-resolution labels as training data. The patches from the rest 92 slides were for the validation and testing purposes. For training, the -pixel patches were downsampled to pixels at 2.4X (4.2 microns per pixel). The classifier has a DICE score of 0.726 on the HASHI cancer segmentation dataset [3], which has 196 TCGA slides. The details of the classifier are described in Sec. C of the appendices.

The joint distribution between the low resolution labels and the count of high resolution labels is in Table 1. The process of computing this table is described in Sec. 2.

3.2.2 Evaluation method

For evaluating our high resolution cancer segmentation results, we collected 49 patches of pixels at 2.5X magnification and carefully annotated cancer regions in detail. 42 of them are used as test set and 7 of them are used as validation set. We use the Intersection over Union (IoU) and DICE coefficient scores as the evaluation metrics.

Since the only difference between low and high resolution cancer maps is reflected near cancer/non-cancer boundaries, we compute IoU and DICE scores only in areas within a distance of 240 pixels (1000 microns, width of an input patch) away from the ground truth cancer/non-cancer boundaries. We call those metrics as masked IoU and masked DICE. These two scores show performance difference only in regions that matter.

3.2.3 Implementation details

Similar to the multiplex application, we use a U-net-like architecture [25] with label super resolution losses. We do not use any high resolution data during training: only label super resolution methods are used. We use the RMSprop optimizer [11] with to train all networks. In the intra-instance setting, we use a batch size of 30 and a learning rate of 0.00001. For the intra+inter-instance loss, the loss is computed using a group of 15 instances and each batch has 2 groups; and the learning rate is 0.001.

3.2.4 Experimental results

We compare our methods to the original low resolution results given by the cancer/non-cancer region classification method. We call this the low resolution model. The quantitative results are shown in Table 4. The proposed intra+inter-instance loss super resolves low resolution cancer region boundaries given by the low resolution model. This means that our method can generate finer cancer segmentation results with very limited amount of annotation labor overhead. More importantly, the network with the intra+inter-instance loss outperforms the network with the intra-instance loss.

Masked IoU Masked DICE
Low resolution model 0.5722 0.7278
Intra-instance 0.5810 0.7350
Intra+inter-instance 0.5953 0.7463
Table 4: Quantitative results for cancer segmentation in pathology slides. The masked IoU/DICE is computed only in areas around cancer/non-cancer boundaries. It evaluates label super resolution methods in areas that matter, since prediction results totally inside/outside cancer regions do not need to be super resolved. In this sense, the proposed intra+inter-instance loss yields better results compared to the original low resolution cancer results. The network with intra+inter-instance loss outperforms the network with intra-instance loss consistently.

4 Conclusions

The high cost of high resolution annotations to train pixel-level classification and segmentation is a major roadblock to the effective application of deep learning in digital pathology and other domains that generate and analyze very high-resolution images. A label super resolution approach can address this problem by using low resolution annotations, but the current implementations do not take into account variations across image patches. The novel loss functions proposed in this work aim to alleviate this limitation. Our empirical results show that the across instance loss better captures and models the variance of high resolution labels within image blocks of the same low resolution label. As a result, they are capable of outperforming the existing baselines significantly. In the future, we plan to generalize this approach to detection networks, in addition to segmentation.

5 Acknowledgements

This work was supported in part by 1U24CA180924-01A1, 3U24CA215109-02, and 1UG3CA225021-01 from the National Cancer Institute, R01LM009239 from the U.S. National Library of Medicine, and a gift from Adobe. Approval from (SBU) Institutional Review Board (IRB) - SBU IRB number 94651-31.

References

  • [1] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(12):2481–2495, 2017.
  • [2] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence, 40(4):834–848, 2018.
  • [3] A. Cruz-Roa, H. Gilmore, A. Basavanhally, M. Feldman, S. Ganesan, N. Shih, J. Tomaszewski, A. Madabhushi, and F. González. High-throughput adaptive sampling for whole-slide histopathology image analysis (hashi) via convolutional neural networks: Application to invasive breast cancer detection. PloS one, 13(5):e0196828, 2018.
  • [4] D. J. Dabbs. Diagnostic immunohistochemistry e-book. Elsevier Health Sciences, 2013.
  • [5] B. Gecer, S. Aksoy, E. Mercan, L. G. Shapiro, D. L. Weaver, and J. G. Elmore. Detection and classification of cancer in whole slide breast histopathology images using deep convolutional networks. Pattern recognition, 84:345–356, 2018.
  • [6] M. A. Gorris, A. Halilovic, K. Rabold, A. van Duffelen, I. N. Wickramasinghe, D. Verweij, I. M. Wortel, J. C. Textor, I. J. M. de Vries, and C. G. Figdor. Eight-color multiplex immunohistochemistry for simultaneous detection of multiple immune checkpoint molecules within the tumor microenvironment. The Journal of Immunology, 200(1):347–354, 2018.
  • [7] C. P. Hans, D. D. Weisenburger, T. C. Greiner, R. D. Gascoyne, J. Delabie, G. Ott, H. K. Müller-Hermelink, E. Campo, R. M. Braziel, E. S. Jaffe, et al. Confirmation of the molecular classification of diffuse large b-cell lymphoma by immunohistochemistry using a tissue microarray. Blood, 103(1):275–282, 2004.
  • [8] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P.-M. Jodoin, and H. Larochelle. Brain tumor segmentation with deep neural networks. Medical image analysis, 35:18–31, 2017.
  • [9] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961–2969, 2017.
  • [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [11] G. Hinton, N. Srivastava, and K. Swersky. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14, 2012.
  • [12] L. Hou, V. Nguyen, A. B. Kanevsky, D. Samaras, T. M. Kurc, T. Zhao, R. R. Gupta, Y. Gao, W. Chen, D. Foran, et al. Sparse autoencoder for unsupervised nucleus detection and representation in histopathology images. Pattern recognition, 86:188–200, 2019.
  • [13] L. Hou, D. Samaras, T. M. Kurc, Y. Gao, J. E. Davis, and J. H. Saltz. Patch-based convolutional neural network for whole slide tissue image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2424–2433, 2016.
  • [14] J. Jass, Y. Ajioka, J. Allen, Y. Chan, R. Cohen, J. Nixon, M. Radojkovic, A. Restall, S. Stables, and L. Zwi. Assessment of invasive growth pattern and lymphocytic infiltration in colorectal cancer. Histopathology, 28(6):543–548, 1996.
  • [15] A. M. Khan, N. Rajpoot, D. Treanor, and D. Magee. A nonlinear mapping approach to stain normalization in digital histopathology images using image-specific color deconvolution. IEEE Transactions on Biomedical Engineering, 61(6):1729–1738, 2014.
  • [16] L. H. R. S. J. C. D. S. J. S. L. J. N. J. Kolya Malkin, Caleb Robinson. Label super-resolution networks. In International Conference on Learning Representations (ICLR), 2019.
  • [17] Y. Liu, K. Gadepalli, M. Norouzi, G. E. Dahl, T. Kohlberger, A. Boyko, S. Venugopalan, A. Timofeev, P. Q. Nelson, G. S. Corrado, et al. Detecting cancer metastases on gigapixel pathology images. arXiv preprint arXiv:1703.02442, 2017.
  • [18] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.
  • [19] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1520–1528, 2015.
  • [20] D. Onder, S. Zengin, and S. Sarioglu. A review on color normalization and color deconvolution methods in histopathology. Applied Immunohistochemistry & Molecular Morphology, 22(10):713–719, 2014.
  • [21] G. Papandreou, L.-C. Chen, K. P. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a deep convolutional network for semantic image segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1742–1750, 2015.
  • [22] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In Proceedings of the IEEE international conference on computer vision, pages 1796–1804, 2015.
  • [23] K. Rakelly, E. Shelhamer, T. Darrell, A. A. Efros, and S. Levine. Few-shot segmentation propagation with guided networks. arXiv preprint arXiv:1806.07373, 2018.
  • [24] J. Ramos-Vara. Technical aspects of immunohistochemistry. Veterinary pathology, 42(4):405–426, 2005.
  • [25] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pages 234–241. Springer, 2015.
  • [26] A. C. Ruifrok, D. A. Johnston, et al. Quantification of histochemical staining by color deconvolution. Analytical and quantitative cytology and histology, 23(4):291–299, 2001.
  • [27] J. Saltz, R. Gupta, L. Hou, T. Kurc, P. Singh, V. Nguyen, D. Samaras, K. R. Shroyer, T. Zhao, R. Batiste, et al. Spatial organization and molecular correlation of tumor-infiltrating lymphocytes using deep learning on pathology images. Cell reports, 23(1):181–193, 2018.
  • [28] E. C. Stack, C. Wang, K. A. Roman, and C. C. Hoyt. Multiplexed immunohistochemistry, imaging, and quantitation: a review, with an assessment of tyramide signal amplification, multispectral imaging and multiplex analysis. Methods, 70(1):46–58, 2014.
  • [29] The TCGA team. The Cancer Genome Atlas. https://cancergenome.nih.gov/.
  • [30] T. Tsujikawa, S. Kumar, R. N. Borkar, V. Azimi, G. Thibault, Y. H. Chang, A. Balter, R. Kawashima, G. Choe, D. Sauer, et al. Quantitative multiplex immunohistochemistry reveals myeloid-inflamed tumor-immune complexity associated with poor prognosis. Cell reports, 19(1):203–217, 2017.
  • [31] Y. Wei, X. Liang, Y. Chen, X. Shen, M.-M. Cheng, J. Feng, Y. Zhao, and S. Yan. Stc: A simple to complex framework for weakly-supervised semantic segmentation. IEEE transactions on pattern analysis and machine intelligence, 39(11):2314–2320, 2017.
  • [32] B. Weigelt, F. C. Geyer, R. Natrajan, M. A. Lopez-Garcia, A. S. Ahmad, K. Savage, B. Kreike, and J. S. Reis-Filho. The molecular underpinning of lobular histological growth pattern: a genome-wide transcriptomic analysis of invasive lobular carcinomas and grade-and molecular subtype-matched invasive ductal carcinomas of no special type. The Journal of Pathology: A Journal of the Pathological Society of Great Britain and Ireland, 220(1):45–57, 2010.
  • [33] J. Xu, X. Luo, G. Wang, H. Gilmore, and A. Madabhushi. A deep convolutional neural network for segmenting and classifying epithelial and stromal regions in histopathological images. Neurocomputing, 191:214–223, 2016.
  • [34] E. Yanagita, N. Imagawa, C. Ohbayashi, and T. Itoh. Rapid multiplex immunohistochemistry using the 4-antibody cocktail yana-4 in differentiating primary adenocarcinoma from squamous cell carcinoma of the lung. Applied Immunohistochemistry & Molecular Morphology, 19(6):509–513, 2011.

APPENDICES

Appendix A Assumption made during the computation of intra-instance variance

In the Intra-instance loss, the variance of label counts is computed using the following equation (also Eq. 3 in the main submission):

(14)

Eq. 14 assumes that for are independent with each other. We explain this in detail:

Let be a condensed index representing , and where is the indicator function. Then

(15)

By assuming that are independent from each other, we have for all . Thus,

(16)

In practice, the assumption that are independent from each other are usually not true. In other words, there exists for some . As a result, the value of Eq. 16 is strictly smaller than the true variance in Eq. 15.

The statistics matching process is trying to match the variance computed by Eq. 16 to the empirical variance. Since Eq. 16 would be smaller than the empirical variance, directly matching them would introduce bias. Let

(17)

and denote the empirical variance as . We can match the value of Eq. 16 with . The optimal value of depends on the distribution of data.

For the Intra-instance LSR baseline, instead of matching (, ) with (, ) respectively by the Eq. 5 in the original submission, we match (, ) with (, ) respectively. The hyperparameter is selected via experiments. Note that this term is not presented nor used in the original Intra-instance LSR paper [16].

In the proposed Intra+inter-instance LSR setting, the variance is computed by Eq. (11) in the original submission as follows:

(18)

Here is also computed by Eq. 16. Thus, it is also strictly smaller than the actual in real world datasets. Then is also smaller than the actual . In our experiments, we also select the hyperparameter for matching (, ) with (, ) respectively. To investigate the influence of on the final performance of different models. We test different values of on the breast cancer segmentation task and the results are shown in Table 5. We see that for the breast cancer segmentation task, the best is .

0.2 0.4 0.6 0.8 1.0
Intra+inter instance loss (Masked IoU) 0.5656 0.5753 0.5716 0.6275 0.6223
Table 5: The performance of the proposed Inter+intra-instance LSR for the breast cancer segmentation task with different scale factors for scaling the ground truth empirical standard deviation.

Appendix B Results using alternative ground truth label counts

The label super resolution network is trained using the conditional distribution which is the distribution of label counts within an image block with given low resolution label .

b.1 Visually approximating ground truth label counts

In the main submission we show a less accurate but faster way of obtaining . We call this process visual approximation. The process of it is as follows: A domain expert visually approximates the count of pixels without drawing the exact cancer mask. The count of the rest pixels in the image would be the count of pixels for the non-cancer region. The distributions of for each low resolution class is the visually approximated , as shown in Table 6.

b.2 Estimating ground truth label counts using masks

In practice, training with visually approximated would save annotation time but may impede the performance of the model. Thus, we also show performance of the models trained with mask estimated . We call this process mask estimation. The process is as follows: A domain expert draws an accurate mask of cancer regions for each image block. Then the count of pixels for a class (such as cancer or non-cancer) is directly computed from the mask of this image block. Given a low resolution label , the mask estimated distribution of can be computed, as shown in Table 7.

For both visual approximation and mask estimation methods, we extracted 12-20 blocks for each of the low resolution classes. A total number of blocks are extracted. For each low resolution label , we extract at most one image block with label , per WSI.

b.3 Results of visual approximation and mask estimation

The time consumed to visually approximate the count for all the images and draw the cancer masks for all the images is shown in Table 8. From the table we can see that the mask drawing time for an image is times the visual approximation time. It should be noted that training a deep model using pixel-level supervision that can generalize well to different slides requires many more drawn masks. The time of drawing 167 blocks here is still much less than the actual annotation time for training a traditional pixel-level supervised semantic segmentation model. We also show the performance of the model trained only with those 167 blocks with high resolution supervision in the second row of Table 9.

The performance of models trained using the visually approximated , the mask estimated and limited high resolution supervision are in Table 9. We can see that for Intra-instance LSR, mask estimation does not significantly improve the performance. For intra+inter-instance LSR, mask estimation significantly improves the performance. As a result, with mask estimated , the intra+inter-instance LSR outperforms the intra-instance LSR significantly. Both of the LSR methods outperform the low resolution labels (the first row in the table) which shows the effectiveness of label super resolution. And both of the LSR methods outperform the model trained with the 167 blocks with pixel-level supervision, this shows that training a pixel-level supervised semantic segmentation model requires large amount of pixel-level super vision. Limited amount of pixel-level super vision may lead to overfitting.

Image block with Count% of
low resolution class : high resolution class :
probability% as cancer block Cancer Non-cancer
0-20
20-30
30-40
40-50
50-60
60-70
70-80
80-90
90-95
95-100
Table 6: The mean% standard deviation% of the visually approximated count (in percentage) of high resolution labels in image blocks with low resolution labels .
Image block with Count% of
low resolution class : high resolution class :
probability% as cancer block Cancer Non-cancer
0-20
20-30
30-40
40-50
50-60
60-70
70-80
80-90
90-95
95-100
Table 7: The mean% standard deviation% of the mask estimated count (in percentage) of high resolution labels in image blocks with low resolution labels .
Visual approximation Drawing masks
Time for 167 images 56 min 32s 136 min 17s
Average time for 1 image 20.3s 48.96s
Table 8: The time consumption for two different methods of estimating .
Masked IoU Masked DICE
Low resolution model 0.5722 0.7279
Model trained with limited high-res supervision 0.5507 0.7103
Intra-instance LSR (visually approximated ) 0.5827 0.7363
Intra-instance LSR (mask estimated ) 0.5832 0.7367
Intra+inter-instance LSR (visually approximated ) 0.5850 0.7381
Intra+inter-instance LSR (mask estimated ) 0.6315 0.7741
Table 9: Quantitative results for cancer segmentation in pathology slides. The masked IoU/DICE is computed only in areas around cancer/non-cancer boundaries. It evaluates label super resolution methods in areas that are within a distance of 240 pixels (1000 microns, width of an input patch) away from the ground truth cancer/non-cancer boundaries.

Appendix C Details of the patch-level breast cancer classifier

In Sec. 3.2.1 of the main submission, we use a patch-level classifier to automatically generate low resolution labels. We show details of the classifier here.

The patch-level breast cancer classifier labels patches with probabilities of containing cancer. These probabilities are quantized to 10 bins as low resolution labels for label super resolution, since the probability of containing cancer given by a classifier is correlated with the percentage of cancer regions. We trained the classifier using 102 Whole Slide Images (WSIs) from the Surveillance, Epidemiology, and End Results (SEER) dataset as training data. A pathologist drew the boundaries of cancer regions in WSIs, generating a cancer region mask. To train the patch-level classifier, we extracted patches of pixels in 40X magnification. The label for each patch (0 or 1) was set by thresholding the ratio of cancer region in the patch by 0.5. We used ResNet34 [10], as the patch classification network. The resulting classifier was validated on a set of 7 WSIs and tested on a set of 89 WSIs. The DICE score between the prediction of this classifier and ground truth mask in the test set in [3] is 0.791.

For label super resolution in Sec. 3.2 of the main submission, we merge four patches into one pixel image block. The low resolution label, quantized cancer probability, of an image block is the maximum probability among its 4 patches. The image block is then resized to pixels for training the label super resolution network.

Appendix D Visual examples of cell detection results in multiplex Immunohistochemistry (IHC) images

Fig. 5 and Fig. 6 show the cell detection results for two patches using different methods. From the results we can see that the color based methods (color deconvolution and color separation by L2 distance) work poorly due to the fact that these methods do not consider critical information such as shapes of different cell types. LSR methods are able to detect cell with proper shapes better. The Inter-instance LSR and Intra+inter-instance LSR tend to have less false positive predictions than the Intra-instance LSR especially for CD 16 and CD 8 as shown in Fig. 5.

Fig. 7 shows more cell segmentation results using the proposed intra+inter-instance loss. The model detects different types of cells reasonably well. Note that the model is trained without accurate pixel level annotation drawn by human.

Figure 5: Visual examples of cell detection in multiplex Immunohistochemistry (IHC) images using different methods. Images in each row are the results of detecting 5 different cell types, for an input patch. The yellow contours in an image are the detected/segmented cell boundaries. Ground truth cell locations annotated by medical students are shown in red (successfully detected) and green (missed) crosses.
Figure 6: Visual examples of cell detection in multiplex Immunohistochemistry (IHC) images using different methods. Images in each row are the results of detecting 5 different cell types, for an input block. The yellow contours in an image are the detected/segmented cell boundaries. Ground truth cell locations annotated by medical students are shown in red (successfully detected) and green (missed) crosses.
Figure 7: Visual examples of cell segmentation in multiplex Immunohistochemistry (IHC) images using the proposed Intra+inter-instance loss together with the high resolution loss. Images in each row are the results of detecting 5 different cell types, for an input block. The yellow contours in an image are the detected/segmented cell boundaries. Ground truth cell locations annotated by medical students are shown in red (successfully detected) and green (missed) crosses.

Appendix E More visual examples of breast cancer segmentation results

Fig. 8 show more breast cancer segmentation results. The green lines are the segmentation boundaries by thresholding the low resolution probability scores for patches. The red lines are ground truth cancer boundaries given by pathologists. The blue lines are the cancer segmentation boundaries predicted by the proposed Intra+inter-instance LSR. The cyan lines are the the cancer segmentation boundaries predicted by the Intra-instance LSR baseline.

From those figures, we can see that the proposed Intra+inter-instance LSR predicts more continuous boundaries than the Intra-instance LSR baseline. Because given the low resolution label of a block, the Intra-instance LSR tries to match the count of pixels of cancer regions in each block with while the proposed Intra+inter-instance method considers the variance among blocks with the same low resolution label .

The segmentation results with super resolution are much closer to the ground truth than the low resolution results (green lines). It is to be noted that the annotation effort to train a super resolution model to super resolve from low resolution labels is much less than the effort for training a pixel-level supervised semantic segmentation model.

Figure 8: Visual examples of breast cancer segmentation. The green lines are the segmentation boundaries by thresholding the low resolution probability scores for patches . The red lines are ground truth cancer boundaries given by pathologists. The blue lines are the cancer segmentation boundaries predicted by the proposed Intra+inter-instance LSR. The cyan lines are the the cancer segmentation boundaries predicted by the Intra-instance LSR baseline.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
350879
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description