Robust Image Segmentation Quality Assessment without Ground Truth

Robust Image Segmentation Quality Assessment without Ground Truth


Deep learning based image segmentation methods have achieved great success, even having human-level accuracy in some applications. However, due to the black box nature of deep learning, the best method may fail in some situations. Thus predicting segmentation quality without ground truth would be very crucial especially in clinical practice. Recently, people proposed to train neural networks to estimate the quality score by regression. Although it can achieve promising prediction accuracy, the network suffers robustness problem, e.g. it is vulnerable to adversarial attacks. In this paper, we propose to alleviate this problem by utilizing the difference between the input image and the reconstructed image, which is reconstructed from the segmentation to be assessed. The deep learning based reconstruction network (REC-Net) is trained with the input image masked by the ground truth segmentation against the original input image as the target. The rationale behind is that the trained REC-Net can best reconstruct the input image masked by accurate segmentation. The quality score regression network (REG-Net) is then trained with difference images and the corresponding segmentations as input. In this way, the regression network may have lower chance to overfit to the undesired image features from the original input image, and thus is more robust. Results on ACDC17 dataset demonstrated our method is promising.

Segmentation Assesment Adversarial.

1 Introduction

Unsupervised segmentation quality assessment, which estimates segmentation accuracy without human or expert intervention, is of high interest in medical imaging research and clinical fields. In the era of big data, automated deep learning based image processing enables to process a large amount of medical image data efficiently. This is especially helpful for one essential image analysis task - semantic segmentation, as manual contouring is tedious and time consuming. In many applications, the deep learning based segmentation methods can even achieve expert-level accuracy. However, in practice, deep learning methods may fail due to many factors: such as domain shift [11], adversarial noise, and low image quality. Therefore predicting segmentation quality without ground truth would be very crucial and of high interest for the downstream analysis. In 3D interactive segmentation, it would be substantially helpful if the user can be navigated to the erroneous segmentation [20, 6], as browsing the segmentation in 3D to check the accuracy everywhere would be really painful. In addition, in clinical setting, it is costly and critical important to find out if the acquired images whose segmentation should support clinical decision in diagnosis and treatment. Ideally, if such unsupervised assessment could be performed while the patient is still in scanner, a new scan can be obtained immediately or even automatically if the current acquired image is not usable.

One straightforward idea is to predict segmentation quality using a CNN regression network, where the image and its segmentation are concatenated as different channels to feed into the network [14, 13]. However, that state-of-the-art method suffers the robustness problem if the input images have a different distribution from that of those training datasets for the regress network. This can be demonstrated with adversarial attacks, in which it involves adding hand-crafted perturbations to the images drew from the distribution of training data and leading to misbehave for deep neural networks.

Inspired by the work of representation learning and factorization [9, 3], we propose to improve the prediction robustness by extracting features directly related to the segmentation. More precisely, we propose to utilize the difference of the original input image and the reconstructed image conditioned on the input image and the input segmentation. Our work is most related to Kohlberger et al.’s work [7], in which the quality assessment score is estimated by regression based on numerous statistical and energy measures from segmentation algorithms. We share Kohlberger et al.’s idea that by explicitly computing some features, one can fit a model to estimate segmentation quality. However, the intuition behind our idea is that comparing two images should be much easier than comparing an image to its segmentation. Then we may need only simple metrics, not dozens of metrics as in [7], to predict the segmentation quality. This is due to the effective reconstruction capability of deep CNN (e.g., U-net [16]) from masked images. Our method also shares merits of unsupervised lesion or outlier detection [17, 2, 4, 12, 18, 1], where only normal data (ground truth segmentation in our scenario) is utilized in the training of the reconstruction network.

Related Work: Segmentation quality assessment has been attracting considerable attentions. Most earlier work takes a reverse-testing strategy, which basically uses existing reference segmentations (otherwise generating some reference segmentations first). In [5, 23], they rely on cross-validation to assess the quality of the whole model, instead of assess the quality of the segmentation result for each dataset. Recently, the reverse classification accuracy (RCA) method [15] estimates the segmentation quality metrics for each image data. It is based on image registration and the atlas images with manual segmentations. First the input image is registered to the atlas images and then a set of surrogate reference segmentation for the input image can be generated by reversely transforming the manual segmentations of the atlas. Then they evaluate a quality metric between the candidate segmentation and the set of surrogate reference segmentations, and then take the best value as a prediction for segmentation quality. However, reverse classification is computationally demanding, as it involves highly expensive operations, such as registration for atlas segmentation, and re-training the deep learning segmentation network with the segmentation to be assessed [22]. Robinson et al. [14, 13] propose to predict segmentation quality using a CNN regression network, in which the image and its segmentation are concatenated as input of the network. The major drawback of that method is that the trained network is generally vulnerable to the adversarial attacks.

Contributions: In this paper, we propose to make use of features directly related to segmentation to improve the robustness of the quality regression network for segmentation quality assessment. To achieve this goal, we have developed two CNNs: one is a reconstruction network (REC-Net), which aims to reconstruct the image masked by the provided segmentation; the other is a quality regression network (REG-Net), which predicts the segmentation quality based on the reconstruction difference image and the provided segmentation. Our experiments on ACDC17 dataset 1 have demonstrated highly promising performance of the proposed method.

2 Method

In this section, we develop the proposed reconstruction network (REC-Net) and the quality regression network (REG-Net) for assessing the segmentation quality with absence of ground truth.

Assume the input image, its ground truth segmentation and the candidate segmentation (to be assessed) are , and , respectively. For supervised segmentation quality assessment, it is trivial to apply any metric functions (e.g. Dice, Jaccard scores) to the pair and to get the ground truth segmentation quality values. However, for unsupervised setting, only and are provided, which makes the problem really challenging. In this paper, the Dice score is chosen as the metric. The ground truth Dice and prediction Dice are denoted as and , respectively.

Figure 1: The work flow of proposed segmentation quality assessment method.

We use to represent the image with the segmented target being masked by zero, in which if the corresponding pixel belongs to the target; otherwise . More specifically,


In other words, all pixels that are labeled by as the target object in are set to zero intensity. The reconstructed image using the proposed reconstruction network (REC-Net) from , is denoted as . The difference image , which serves as one input channel to the quality regression network (REG-Net), is defined as:


The output of REG-Net is the predicted score for the segmentation quality. The flow of proposed method is demonstrated as in Fig. 1.

We will first develop the REC-Net for the reconstruction of from , then the REG-Net will be proposed.

Figure 2: The architecture design for REC-Net and REG-Net. For simplicity, all RELU layers, followed after each convolution layer and fully connected layer, are not shown. The final tanh layer in REC-Net and sigmoid layer in REG-Net are also not presented in the figure. In REC-Net, each number indicates the number of the filters. In REG-Net, the number above each convolution layer indicates the number of filters used, and that above each fully connected layer represent the dimensions of each flattened feature vector. Each convolution layer has a kernel size of 3 with a stride of 1 and a padding of 1. Each pooling layer or transposed convolution layer has a kernel of size 2 with a stride of 2. The two networks are trained separately.

2.1 REC-Net

The proposed REC-Net architecture is an auto-encoder network with skipping connections, which is similar to the U-net architecture. The L1 loss is chosen as the reconstruction loss function, with


The architecture of REC-Net is demonstrated in Fig. 2.

In our setting, the input to REC-Net () is the original input image masked by the segmentation . During the training, only pairs of and its ground truth segmentation are input into REC-Net. The rationale behind is that the REC-Net is trained to well recover the original input image from the masked image only when is a good segmentation. On the other hand, may be correlated to the quality of the provided segmentations. The proposed method shares the idea with that for the unsupervised lesion or outliers detection [17, 2, 4, 12, 18, 1], where only normal data is utilized during the network training. Sample reconstructed images conditioned on the ground truth segmentations are demonstrated in Fig. 3. As can be noticed, the REC-Net can effectively recover from masked image . Quantitatively, the pixel intensity range of is [-1, 1] and the validation L1 reconstruction error is around 0.01. To demonstrate the reconstruction behavior of the REC-Net with different segmentations, sample reconstructed images are shown in Fig. 4. As can be seen, generally speaking, it is non-trivial to predict reconstructed images, but it is clear that if the segmentation is not good enough, the reconstructed image is apparently different from the original input image. And it is highly probable the difference image is a proper feature for the segmentation quality assessment. This claim is verified by the experiments.

Figure 3: Sample reconstructions conditioned on ground truth segmentations. (a) Sample in validation set. (b) Corresponding with ground truth segmentations as input masks.
Figure 4: Sample reconstructions conditioned on different segmentations. , and are sample over-segmentation, non-overlap segmentation and under-segmentation.

2.2 REG-Net

After getting the reconstructed image from the masked image , the REG-Net is used to automatically extract proper features to estimate the segmentation quality of by regression. The network architecture is shown in Fig. 2. The REG-Net is a light weighted Alexnet like network, which consists of convolution layers, max pooling layers and fully connected layers and non-linear layers. To make the network as robust as possible in the architecture aspect, two drop out layers [19] with rate 0.5 were introduced. The key idea of drop out is to randomly drop neurons from the neural network during training. This prevents neurons from co-adapting too much. Besides using a robust architecture, we propose to utilize robust features i.e. features truly related to the target prediction. More precisely, we make use of instead of as in [14, 13]. The rationale is that the may contain many features that are totally unrelated to the segmentation quality prediction but may be utilized by the network. As is conditioned on the segmentation and the REC-Net is trained only using the ground truth segmentations, it should be a more robust feature than .

3 Experiments and Results

3.1 Data

To validate the proposed segmentation quality assessment method, we utilize a public dataset: Automated Cardiac Diagnosis Challenge (ACDC) MICCAI challenge 2017. It consists of 3-D cine-MR images from patients with various heart disease. Images were acquired with 1.5T or 3T MR scanners. The resolution is between 1.22 and 1.68 and the number of phases varies between 28 to 40 images per patient. The dataset consists of images from 100 patients, with expert annotations for left-ventricular cavity (LVC), right-ventricular cavity (RVC), and left-ventricular myocardium (LVM). For our experiments, only segmentation of LVM, which is very challenging, was considered. Each slice was resampled with a resolution of ; the intensity was normalized to the range of [-1,1] and it was center cropped with a size of . The dataset was randomly split into three sets of 80, 10 and 10 patients for training, validation and testing.

3.2 Implementation

The networks were implemented with PyTorch [10]. The learning rates for REC-Net and REG-Net were both . The training ran for 500 and 200 epochs for REC-Net and REG-Net, respectively. Random rotation and random flipping were utilized to augment data on the fly.

3.3 Segmentation simulation via U-nets

To generate data for the REG-Net training, segmentations of different quality have to be generated first. Robinson et al. utilized random forests with different depths to generate simulated segmentations [13]. In our experiments, in contrast to using random forests, U-nets with different depths (4 or 2), different number of starting filters (8 or 4) and different training epochs (10, 20, 30, 50, 150), were applied to generate the simulated segmentations with different quality. The networks were trained with the training set. Then inference was conducted on the whole data set, including the training, validation and test sets. Therefore, for each image slice, 20 simulated segmentations were generated. To make the Dice scores of simulated segmentations to be used in REG-Net unbiased, we split the range of possible Dice scores ([0, 1]) into ten bins with equal width, and randomly sample simulated segmentations such that each bin includes the same number of simulated segmentations. This sampling was done for the training set, the validation set and the test set, separately. The final dataset used for REG-Net training, validation and testing consists of 3200 slices, 1000 slices and 1000 slices, respectively.

3.4 Robustness to adversarial attacks

We compared the robustness of our proposed method for unsupervised segmentation quality assessment with respect to adversarial attacks against the state-of-the-art methods [13, 15, 14]. Those previous methods mainly utilize a deep learning regression network (similar to Alexnet) to predict the Dice score. Although that type of simple network can achieve promising prediction accuracy, it is also known that they are very vulnerable to the adversarial attacks [21]. In this paper, we applied a simple fast gradient sign method [8] to generate the adversarial images for REG-Net to conduct our experiments. Only adversarial attacks on the original images and the difference image were considered and no changes were made to . The processes are demonstrated in Fig. 5.

Figure 5: Illustrating adversarial images generation. (a) Adversarial image generation for REG-NET, which is trained using and as inputs. (b) Generation of for REG-Net, which is trained using and as inputs. (c) Generation of for REG-Net. The generated in (a) was utilized.

For the method in [13, 15, 14], a REG-Net was trained with and as inputs. The trained network is denoted as REG-Net and the adversarial images were computed as:


The is the adversarial attack level. means no attack and a bigger means a more severe attack. denotes the gradient of the loss of REG-Net with respect to the input image . is the sign function.

For the proposed method, another REG-Net was trained with and as inputs. The trained network is denoted by REG-Net. The adversarial images were computed as:


denotes the gradient of the loss of REG-Net with respect to the input image . One should note that the gradients in Eq. 4 and Eq. 5 are different, as they correspond to two differently trained REG-Nets: REG-Net and REG-Net, respectively. Then, () was fed into REG-Net (REG-Net) together with to predict the Dice scores.

Note that is mainly used to test the robustness of the quality regression network (REG-Net). We also want to test the robustness of the whole proposed framework including the reconstruction network (REC-Net) and REG-Net. The attack image, denoted by , is thus generated by plugging the adversarial image computed by Eq. 4 into the trained REC-Net to get the reconstructed image, and then computing the difference to as

Method Network, Input MAE
Robinson et al. REG-Net, 0.040.05
proposed REG-Net, 0.040.05
Table 1: Mean absolute errors of dice prediction when there is no attack.
Figure 6: Sample segmentation quality assessments for Robinson et al.’s method (left column) and the proposed method (right column) without attacks. The digits on the right bottom corners indicate prediction dice scores .

3.5 Performance comparison

For performance comparison, the mean absolution error (MAE) of the Dice scores,


was utilized as the metric, where is total number of slices in the test set. The results without adversarial attacks are shown in Table. 1. As can be seen, when there is no attack, the proposed method works as well as Robinson et al.’s [13, 15, 14]. Sample results without attacks are demonstrated in Fig. 6.

The performance when having attacks is demonstrated in Table. 2.

Method Network, Input
Robinson et al. REG-Net, 0.080.06 0.110.07 0.140.08 0.160.09
proposed REG-Net, 0.070.06 0.090.06 0.090.07 0.120.09
proposed REG-Net, 0.040.05 0.040.05 0.070.06 0.140.10
Table 2: Mean absolute errors of dice prediction under different levels of adversarial attack. One should note that the error rate is 0.25 if blindly predicting dice as 0.5, since the distribution of dice of test dataset is uniform in range [0, 1].

It can be noticed that for all methods, the MAEs are monotonically non-decreasing as the attack level increases. When applying attacks on REG-Net directly, i.e. the inputs to REG-Net are and computed by Eq. 5, the proposed method has a smaller increasing rate and works better than Robinson et al.’s. When using surrogate attack image , the performance of the proposed method is the best if the attack level is not very high (). This is expected, since the adversarial image is computed based on REG-Net, but not REG-Net. However, when the attack level increases from to , the degradation is significant. Possible explanation is that the REC-Net can not reconstruct the input images accurately enough when the attack is too severe. This can be demonstrated in Fig. 7 with .

Figure 7: Sample segmentation quality assessments for different methods. The digits on the right bottom corners indicate prediction dice scores . (a) Sample predictions without adversarial attacks. (b) Predictions under different attack levels. The demonstrate the , which is the reconstructed image from the input image masked by .

In conclusion, the results demonstrate that the proposed method (REG-Net with the reconstruction difference image as the input feature) is more robust than Robinson et al.’s (REG-Net with the original input image as the input feature), in the aspect of adversarial attacks. Sample comparisons are illustrated in Fig. 7.

Discussions: Why not training REC-Net and REG-Net simultaneously for the proposed method? One may note that only ground truth segmentation for each input image is utilized to train REC-Net, such that REC-Net can only accurately reconstruct the original image from the masked input image by the ground truth, i.e. . However, for the training of REG-Net, the training data should consist of segmentation of different qualities such that the bias problem can be avoided. Therefore, it is not proper to train the two networks simultaneously.

4 Conclusion

In this paper, a robust method for segmentation quality assessment has been proposed. We make use of the image difference between the input image and the reconstructed image using our proposed image reconstruction network (REC-Net), as the feature image. The reconstruction network (REC-Net) is trained with the masked input image by the ground truth segmentation as input, and the original input image as the target reconstruction. Quality score regression network (REG-Net) is then trained with reconstruction difference image and the segmentation as input. By using reconstruction difference image as the feature, the regression network may have lower chance to overfit to the undesired image features and then can be more robust. Results on ACDC17 dataset demonstrated our method is promising.




  1. Alaverdyan, Z., Jung, J., Bouet, R., Lartizien, C.: Regularized siamese neural network for unsupervised outlier detection on brain multiparametric magnetic resonance imaging: application to epilepsy lesion screening (2018)
  2. Baur, C., Wiestler, B., Albarqouni, S., Navab, N.: Deep autoencoding models for unsupervised anomaly segmentation in brain mr images. arXiv preprint arXiv:1804.04488 (2018)
  3. Chartsias, A., Joyce, T., Papanastasiou, G., Semple, S., Williams, M., Newby, D., Dharmakumar, R., Tsaftaris, S.A.: Factorised spatial representation learning: application in semi-supervised myocardial segmentation. arXiv preprint arXiv:1803.07031 (2018)
  4. Chen, X., Konukoglu, E.: Unsupervised detection of lesions in brain mri using constrained adversarial auto-encoders. arXiv preprint arXiv:1806.04972 (2018)
  5. Fan, W., Davidson, I.: Reverse testing: an efficient framework to select amongst classifiers under sample selection bias. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 147–156. ACM (2006)
  6. Kashyap, S., Oguz, I., Zhang, H., Sonka, M.: Automated segmentation of knee mri using hierarchical classifiers and just enough interaction based learning: Data from osteoarthritis initiative. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 344–351. Springer (2016)
  7. Kohlberger, T., Singh, V., Alvino, C., Bahlmann, C., Grady, L.: Evaluating segmentation error without ground truth. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 528–536. Springer (2012)
  8. Kurakin, A., Goodfellow, I., Bengio, S.: Adversarial examples in the physical world. arXiv preprint arXiv:1607.02533 (2016)
  9. Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
  10. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in pytorch (2017)
  11. Patel, V.M., Gopalan, R., Li, R., Chellappa, R.: Visual domain adaptation: A survey of recent advances. IEEE signal processing magazine 32(3), 53–69 (2015)
  12. Pawlowski, N., Lee, M.C., Rajchl, M., McDonagh, S., Ferrante, E., Kamnitsas, K., Cooke, S., Stevenson, S., Khetani, A., Newman, T., et al.: Unsupervised lesion detection in brain ct using bayesian convolutional autoencoders (2018)
  13. Robinson, R., Oktay, O., Bai, W., Valindria, V., Sanghvi, M., Aung, N., Paiva, J., Zemrak, F., Fung, K., Lukaschuk, E., et al.: Real-time prediction of segmentation quality. arXiv preprint arXiv:1806.06244 (2018)
  14. Robinson, R., Oktay, O., Bai, W., Valindria, V.V., Sanghvi, M.M., Aung, N., Paiva, J.M., Zemrak, F., Fung, K., Lukaschuk, E., et al.: Subject-level prediction of segmentation failure using real-time convolutional neural nets (2018)
  15. Robinson, R., Valindria, V.V., Bai, W., Suzuki, H., Matthews, P.M., Page, C., Rueckert, D., Glocker, B.: Automatic quality control of cardiac mri segmentation in large-scale population imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 720–727. Springer (2017)
  16. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)
  17. Schlegl, T., Seeböck, P., Waldstein, S.M., Schmidt-Erfurth, U., Langs, G.: Unsupervised anomaly detection with generative adversarial networks to guide marker discovery. In: International Conference on Information Processing in Medical Imaging. pp. 146–157. Springer (2017)
  18. Seeböck, P., Waldstein, S., Klimscha, S., Gerendas, B.S., Donner, R., Schlegl, T., Schmidt-Erfurth, U., Langs, G.: Identifying and categorizing anomalies in retinal imaging data. arXiv preprint arXiv:1612.00686 (2016)
  19. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15(1), 1929–1958 (2014)
  20. Sun, S., Sonka, M., Beichel, R.R.: Graph-based ivus segmentation with efficient computer-aided refinement. IEEE transactions on medical imaging 32(8), 1536–1549 (2013)
  21. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)
  22. Valindria, V.V., Lavdas, I., Bai, W., Kamnitsas, K., Aboagye, E.O., Rockall, A.G., Rueckert, D., Glocker, B.: Reverse classification accuracy: predicting segmentation performance in the absence of ground truth. IEEE transactions on medical imaging 36(8), 1597–1606 (2017)
  23. Zhong, E., Fan, W., Yang, Q., Verscheure, O., Ren, J.: Cross validation framework to choose amongst models and datasets for transfer learning. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. pp. 547–562. Springer (2010)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description