# Fully Automated Pancreas Segmentation with Two-stage 3D Convolutional Neural Networks

###### Abstract

Due to the fact that pancreas is an abdominal organ with very large variations in shape and size, automatic and accurate pancreas segmentation can be challenging for medical image analysis. In this work, we proposed a fully automated two stage framework for pancreas segmentation based on convolutional neural networks (CNN). In the first stage, a U-Net is trained for the down-sampled 3D volume segmentation. Then a candidate region covering the pancreas is extracted from the estimated labels. Motivated by the superior performance reported by renowned region based CNN, in the second stage, another 3D U-Net is trained on the candidate region generated in the first stage. We evaluated the performance of the proposed method on the NIH computed tomography (CT) dataset, and verified its superiority over other state-of-the-art 2D and 3D approaches for pancreas segmentation in terms of dice-sorensen coefficient (DSC) accuracy in testing. The mean DSC of the proposed method is 85.99%.

###### Keywords:

Computed Tomography (CT), pancreas automate segmentationmulti-stage deep convolutional neural network.## 1 Introduction

Automated and accurate organ segmentation is a fundamental step in medical image analysis, computer assisted diagnosis and radiation therapy plans. Recently, deep learning based methods such as convolutional neural networks (CNN) have demonstrated to be powerful tools for organ segmentation thanks to the availability of large annotated datasets and computational resources compared with traditional segmentation techniques.

One issue in organ segmentation is whether to deal with 2D slices or 3D volumes since training models on both 2D and 3D scans have their advantages and disadvantages. Specifically, training models on 3D volumes directly can leverage the inherent spatial and anatomical information in volumetric organs with the cost of significantly higher computational power and memory than training 2D models. On the contrary, there are usually more training samples for 2D network training by slicing volumes in three orthogonal planes (sagittal, coronal and transverse), which sacrifices the 3D geometric information. Moreover, the fusion of 2D segmentation results to construct 3D mask is necessary. Existing deep learning based techniques for pancreas segmentation include both cases. For example, 2D networks were explored in [8], [9]. 3D network for pancreas segmentation were studied in [3], [10]. In [2] and [7], the authors combined 2D and 3D networks for pancreas segmentation.

In addition, segmentation of the small, soft and flexible organs like pancreas automatically and accurately can be difficult due to its large variations in shape, size and the varying surrounding contents in comparison with the large organs (e.g., liver, kidney, stomach, etc.). Thus, much more accurate segmentation can be achieved by using smaller input region around the target. The coarse-to-fine multi-stage techniques have been explored widely to address this problem. The basic idea is to determine the regions of interest (ROIs)/candidate regions in a coarse step followed by refining the segmentation on the ROIs. However, the ROI generation through the bounding box estimation for pancreas can be difficult. Several methods to generate meaningful regions on 2D slices with recall of value 99% have been explored. For example, machine learning based techniques are implemented in [1], [5] for candidate region generation through bounding box regression. A fixed-point algorithm during testing stage was studied in [9]. Recurrent neural networks were considered in [8] to keep the consistency between training and testing stages. [10] proposed to generate candidate regions using patch-based method.

In this work, we proposed a two-stage method for automated pancreas segmentation on 3D computed tomography (CT) scans, which contains two steps: i) coarse segmentation on down-sampled 3D volumes for candidate region generation; ii) to refine the pancreas segmentation on smaller regions-of-interest (ROIs) at the finest resolution scale. The performance of the proposed algorithm was demonstrated on the NIH dataset.

## 2 Method

We denote the 3D CT scans as with size . The down-sampled CT scans is , where the superscript letter is the decimation factor. The ground truth masks corresponding to the original and down-sampled CT scans are represented by and . is the total voxel number in the 3D scans. The vector version of the 3D volumes is denoted by their corresponding lower-case letters. For example, is the vector version of ground truth mask . The two steps of the proposed method are detailed as follows.

### 2.1 Coarse scale segmentation

Due to the high dimensionality of the original 3D CT scans, training models on the original CT scans leads to high cost of the computational power and memory, which limits the depth and architecture of the networks. Thus, we first train a 3D U-Net on the down-sampled volume with a decimation factor . Based on the tight relationship between segmentation and localization, a candidate region of the pancreas can be extracted after obtaining the coarse scale segmentation mask. Note that the normalized bounding box of pancreas in down-sampled volume and original volume are the same since down-sampling operation cannot change the shape and location of the pancreas. Besides, the candidate region generation is conducted on 3D volumes instead of 2D slices since the location of pancreas on 3D volumes of different subjects are more consistent than that on 2D slices inter and/or intra subjects.

### 2.2 Fine scale segmentation

In the second stage, another UNet of the same architecture is trained on the candidate regions generated in the first stage at the finest resolution scale. Since we cannot make sure that the ROIs generated in the first stage has both high precise and recall, the second stage during training and testing procedure are implemented in different ways.

#### 2.2.1 In training,

the bounding boxes are extracted from the ground-truth mask , and then enlarged by adding margins (10 pixels) along the three orthogonal axes. The candidate regions cropped with the enlarged ground truth bounding boxes (denoted as ) from the original 3D CT scans are the inputs of segmentation network in the second stage. Note that it is possible to train the networks of the two stages simultaneously since the ROIs used here are generated without the aid of the output in the first stage.

#### 2.2.2 In testing,

after obtaining the estimated label map from the down-sampled data , we generated two bounding boxes in different ways: i) We extracted a bounding box from directly and then enlarged it by adding margins of 2 pixels along different axes, denoted as . A bounding box covering the pancreas on the original CT scans was finally obtained by rescaling through multiplying by R along different axes. ii) After up-sampling with a factor R, one bounding box was extracted from it directly. The bounding box was then enlarged by adding 10 pixels of margin along different axes. Note that the differences between the two bounding boxes come from the errors of sampling operation and different margins considered.

Since two bounding boxes are generated in the testing procedure, two estimated masks can be obtained by feeding the two cropped ROIs into the U-Net separately. Finally, the two estimated masks and the up-sampled pancreas mask estimated in the first stage were combined by marginal voting. Fig. 2 explains the testing procedure of the proposed method.

### 2.3 Network architecture and loss function

The architecture of the UNet employed in this work is shown in Fig. 3. Note that the the architecture in stage 1 and stage 2 are the same. The parameters of the two networks in Stage 1 and Stage 2 can be trained simultaneously since they are independent during training.

The loss function is formulated as below. Note that the same loss function are used in both stages.

(1) |

where is the penalty parameter. The dice loss and center loss are given by

(2) | |||||

(3) |

where

(4) |

where is the center point of th slice of mask , which is denoted as the weighted sum of coordinates. is the spatial normalization factor. Note that both the dice loss and center loss are differentiable. In this work, the parameter is fixed as for the initial 50 epochs and is decreased to 0 in the following epochs.

## 3 Experiments

### 3.1 Dataset

Our method is evaluated on the public NIH pancreatic segmentation dataset^{1}^{1}1https://wiki.cancerimagingarchive.net/display/Public/Pancreas-CT [5]. There are 82 contrast-enhanced abdominal CT scans and corresponding annotated labels in this dataset. The CT scans have resolutions of pixels with varying pixel sizes and slice thickness between mm. The pixel values for all the CT scans were clipped to HU, then rescaled to the range . Following previous work of pancreas segmentation, we used 4-fold cross-validation to assess the robustness of the model, i.e., 20 subjects are chosen for validation in each fold.

### 3.2 Quantitative Assessment Metrics

In both steps, the dice similarity coefficient (DSC) is employed for segmentation accuracy evaluation. Moreover, recall and intersection-over-union (IoU) are used to assess the localization performance. The metrics used in this work are expressed as below.

DSC | (5) | ||||

Recall | (6) | ||||

IoU | (7) |

where and are the ground truth and estimated masks. and are the bounding boxes extracted from the ground truth and estimated masks.

### 3.3 Results

Fig. 4 shows the segmentation results from stage 1 and stage 2 respectively of three subjects in testing. For all the subjects, the overlaps between the estimated masks and ground truth masks in the second stage are larger compared to the overlaps in stage one. Quantitative results of the two-stage method are reported in Table 1. The mean DSC from stage 2 segmentation increases by at least 3%, compared to the stage 1 result.

The bounding box accuracy evaluated by recall and IoU is summarized in Table 2. The mean recalls of both bounding boxes are higher than 97%. The IoUs of the two bounding boxes from different subjects are all higher than 60% except for one subject whose IoU is lower than 40%. However, the segmentation performance of this subject also benefits from the two-stage method, i.e., DSC increased from 39.98% to 57.20%.

The comparison of pancreas segmentation using different methods are reported in Table 3. The proposed method outperforms the others in terms of the mean DSC.

Stage 1 | Stage 2 | |||||
---|---|---|---|---|---|---|

Fold | Mean DSC | Max DSC | Min DSC | Mean DSC | Max DSC | Min DSC |

F0 | 82.18% 5.28% | 89.31% | 65.59% | 85.82% 4.58% | 91.20% | 74.95% |

F1 | 78.20% 10.60% | 88.67% | 39.98% | 84.85% 6.75% | 90.80% | 57.20% |

F2 | 83.23% 3.70% | 89.61% | 76.36% | 86.52% 2.56% | 91.14% | 82.00% |

F3 | 81.91% 6.84% | 88.68% | 78.69% | 86.79% 2.43% | 90.64% | 82.13% |

Bounding box 1 | Bounding box 2 | |||||

Fold | Mean Recall | Max Recall | Min Recall | Mean Recall | Max Recall | Min Recall |

F0 | 98.44% 2.23% | 100% | 90.24% | 99.41% 1.69% | 100% | 93.22% |

F1 | 98.28% 2.24% | 100% | 90.74% | 99.38% 1.39% | 100% | 93.89% |

F2 | 97.67% 4.18% | 100% | 85.24% | 98.71% 3.13% | 100% | 88.59% |

F3 | 98.42% 1.95% | 100% | 93.42% | 99.61% 0.73% | 100% | 97.78% |

Fold | Mean IoU | Max IoU | Min IoU | Mean IoU | Max IoU | Min IoU |

F0 | 81.04% 5.18% | 88.23% | 69.55% | 69.52% 5.39% | 77.68% | 60.04% |

F1 | 76.93% 13.37% | 86.78% | 20.80% | 66.45% 11.60% | 77.66% | 18.35% |

F2 | 77.74% 4.33% | 84.85% | 69.43% | 66.78% 3.85% | 72.61% | 59.19% |

F3 | 76.30% 12.69% | 90.87% | 42.42% | 66.96% 10.15% | 83.38% | 40.51% |

Method | Mean DSC | Max DSC | Min DSC |
---|---|---|---|

Roth et.al. MICCAI’2016 [6] | 78.01% 8.20% | 88.65% | 34.11% |

Holistically Nested 2D FCN [4] | 81.27% 6.27% | 88.96% | 50.69% |

Zhou et.al. MICCAI’2017 [9] | 82.65% 5.47% | 90.85% | 63.02% |

Attention-UNet [3] | 83.10% 3.80% | - | - |

ResDSN C2F [10] | 84.59% 4.86% | 91.45% | 69.62% |

Ours (Coarse) | 81.91% 6.84% | 89.61% | 39.98% |

Ours (Refine) | 85.99% 4.51% | 91.20% | 57.20% |

### 3.4 Discussion

According to the quantitative results about bounding box estimation in Tab. 2, it is difficult to achieve high IoUs for different subjects. Thus, a new testing method different from the training procedure is introduced. In testing, two candidate regions were extracted with the estimated mask in the first stage. Then, an up-sampled segmentation mask in the fist stage and two refined segmentation masks by majority voting. Although the inconsistency between training and testing, the proposed method has achieved competitive segmentation accuracy compared with state-of-the-art algorithms.

It is also interesting to note that more isolated false positive (FP) errors are introduced in the second stage segmentation for subject 39. The FP error is caused by the center loss term used for training. We noticed that the center loss term contribute to increase the convergence speed during training procedure. However, it can cause more isolated false positive errors.

## 4 Conclusions

A two-stage pancreas segmentation method was proposed in this work. Two deep networks of the same architecture were trained with down-sampled and original 3D CT scans for the purpose of coarse ROI definition and refined segmentation. We also proposed a novel testing framework, which can easily used for other small organ segmentation. The proposed method has achieved competitive segmentation accuracy compared with state-of-the-art algorithms.

## References

- [1] Farag, A., Lu, L., Roth, H.R., Liu, J., Turkbey, E., Summers, R.M.: A Bottom-Up Approach for Pancreas Segmentation Using Cascaded Superpixels and (Deep) Image Patch Labeling. IEEE Trans. Med. Imaging 26(1), 386–399 (Jan 2017)
- [2] Li, J., Lin, X., Che, H., Li, H., Qian, X.: Probability Map Guided Bi-directional Recurrent UNet for Pancreas Segmentation. Arxiv (2019), https://arxiv.org/abs/1903.00923
- [3] Oktay, O., Schlemper, J., Folgoc, L.L., Matthew Lee, M.H., Misawa, K., Mori, K., McDonagh, S., Hammerla, N.Y., Kainz, B., Glocker, B., Rueckert, D.: Attention U-Net: Learning Where to Look for the Pancreas. In: Medical Imaging with Deep Learning (MIDL) (2018)
- [4] Roth, H., Lu, L., Lay, N., Harrison, A., Farag, A., Sohn, A., Summers, R.: Spatial aggregation of holistically-nested convolutional neural networks for automated pancreas localization and segmentation. Medical Image Analysis 45 (2017)
- [5] Roth, H.R., Lu, L., Farag, A., Shin, H.C., Liu, J., Turkbey, E.B., Summers, R.M.: DeepOrgan: Multi-level Deep Convolutional Networks for Automated Pancreas Segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A. (eds.) Medical Image Computing and Computer-Assisted Intervention (MICCAI). pp. 556–564. Springer International Publishing, Cham (2015)
- [6] Roth, H.R., Lu, L., Farag, A., Sohn, A., Summers, R.M.: Spatial aggregation of holistically-nested networks for automated pancreas segmentation. In: Ourselin, S., Joskowicz, L., Sabuncu, M.R., Unal, G., Wells, W. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2016. pp. 451–459. Springer International Publishing, Cham (2016)
- [7] Xia, Y., Xie, L., Liu, F., Zhu, Z., Fishman, E., Yuille, A.: Bridging the gap between 2d and 3d organ segmentation with volumetric fusion net. In: Frangi, A., Fichtinger, G., Schnabel, J., Alberola-López, C., Davatzikos, C. (eds.) Medical Image Computing and Computer Assisted Intervention â MICCAI 2018 - 21st International Conference, 2018, Proceedings. pp. 445–453. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), Springer Verlag (1 2018)
- [8] Yu, Q., Xie, L., Wang, Y., Zhou, Y., Fishman, E.K., Yuille, A.L.: Recurrent saliency transformation network: Incorporating multi-stage visual cues for small organ segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
- [9] Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixed-point model for pancreas segmentation in abdominal ct scans. In: International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI). vol. 1, pp. 693–701 (2017)
- [10] Zhu, Z., Xia, Y., Shen, W., Fishman, E.K., L., Y.A.: A 3d coarse-to-fine framework for volumetric medical image segmentation. In: International Conference on 3D Vision (2018)