A multi-scale pyramid of 3D fully convolutional networks for abdominal multi-organ segmentation

A multi-scale pyramid of 3D fully convolutional networks for abdominal multi-organ segmentation

Holger R. Roth Graduate School of Informatics, Nagoya University, Nagoya, Japan
Contact: or rothhr@mori.m.is.nagoya-u.ac.jp kensaku@is.nagoya-u.ac.jp
   Chen Shen Graduate School of Informatics, Nagoya University, Nagoya, Japan
   Hirohisa Oda Graduate School of Informatics, Nagoya University, Nagoya, Japan
   Takaaki Sugino Graduate School of Informatics, Nagoya University, Nagoya, Japan
   Masahiro Oda Graduate School of Informatics, Nagoya University, Nagoya, Japan
   Yuichiro Hayashi Graduate School of Informatics, Nagoya University, Nagoya, Japan
   Kazunari Misawa Aichi Cancer Center, Nagoya, Japan
   Kensaku Mori Graduate School of Informatics, Nagoya University, Nagoya, Japan
Information Technology Center, Nagoya University, Nagoya, Japan
Research Center for Medical Bigdata, National Institute of Informatics, Tokyo, Japan

Recent advances in deep learning, like 3D fully convolutional networks (FCNs), have improved the state-of-the-art in dense semantic segmentation of medical images. However, most network architectures require severely downsampling or cropping the images to meet the memory limitations of today’s GPU cards while still considering enough context in the images for accurate segmentation. In this work, we propose a novel approach that utilizes auto-context to perform semantic segmentation at higher resolutions in a multi-scale pyramid of stacked 3D FCNs. We train and validate our models on a dataset of manually annotated abdominal organs and vessels from 377 clinical CT images used in gastric surgery, and achieve promising results with close to 90% Dice score on average. For additional evaluation, we perform separate testing on datasets from different sources and achieve competitive results, illustrating the robustness of the model and approach.

1 Introduction

Multi-organ segmentation has attracted considerable interest over the years. The recent success of deep learning-based classification and segmentation methods has triggered widespread applications of deep learning-based semantic segmentation in medical imaging [1, 2]. Many methods focused on the segmentation of single organs like the prostate [1], liver [3], or pancreas [4, 5]. Deep learning-based multi-organ segmentation in abdominal CT has also been approached recently in works like [6, 7]. Most of these methods are based on variants of fully convolutional networks (FCNs) [8] that either employ 2D convolutions on orthogonal cross-sections in a slice-by-slice fashion [3, 4, 5, 9] or 3D convolutions [1, 2, 7]. A common feature of these segmentation methods is that they are able to extract features useful for image segmentation directly from the training imaging data, which is crucial for the success of deep learning. This avoids the need for handcrafting features that are suitable for detection of individual organs.

However, most network architectures require severely downsampling or cropping the images for 3D processing to meet the memory limitations of today’s GPU cards [1, 7] while still considering enough context in the images for accurate segmentation of organs.

In this work, we propose a multi-scale 3D FCN approach that utilizes a scale-space pyramid with auto-context to perform semantic image segmentation at a higher resolution while also considering large contextual information from lower resolution levels. We train our models on a large dataset of manually annotated abdominal organs and vessels from pre-operative clinical computed tomography (CT) images used in gastric surgery and evaluate them on a completely unseen dataset from a different hospital, achieving a promising performance compared to the state-of-the-art.

Our approach is shown schematically in Fig. 1. We are influenced by classical scale-space pyramid [10] and auto-context ideas [11] for integrating multi-scale and varying context information into our deep learning-based image segmentation method. Instead of having separate FCN pathways for each scale as explored in other work [12, 13], we utilize the auto-context principle to fuse and integrate the information from different image scales and different amounts of context. This helps the 3D FCN to integrate the information of different image scales and image contexts at the same time. Our model can be trained end-to-end using modern deep learning frameworks. This is in contrast to previous work which utilized auto-context using a separately trained models for brain segmentation [13].

In summary, our contributions are 1) introduction of a multi-scale pyramid of 3D FCNs; 2) improved segmentation of fine structures at higher resolution; 3) end-to-end training of multi-scale pyramid FCNs showing improved performance and good learning properties. We perform a comprehensive evaluation on a large training and validation dataset, plus unseen testing on data from different hospitals and public sources, showing promising generalizability.


Figure 1: Multi-scale pyramid of 3D fully convolutional networks (FCNs) for multi-organ segmentation. The lower-resolution-level 3D FCN predictions are upsampled, cropped and concatenated with the inputs of a higher resolution 3D FCN. The Dice loss is used for optimization at each level and training is performed end-to-end.

2 Methods

2.1 3D fully convolutional networks

Convolutional neural networks (CNN) have the ability to solve challenging classification tasks in a data-driven manner. Fully convolutional networks (FCNs) are an extension to CNNs that have made it feasible to train models for pixel-wise semantic segmentation in an end-to-end fashion [8]. In FCNs, feature learning is purely driven by the data and segmentation task at hand and the network architecture. Given a training set of images and labels , where denotes a CT image and a ground truth label image, the model can train to minimize a loss function in order to optimize the FCN model , where denotes the network parameters, including the convolutional kernel weights for hierarchical feature extraction.

While efficient implementations of 3D convolutions and growing GPU memory have made it possible to deploy FCN on 3D biomedical imaging data [1, 2], image volumes are in practice often cropped and downsampled in order for the network to access enough context to learn an effective semantic segmentation model while still fitting into memory. Our employed network model is inspired by the fully convolutional type 3D U-Net architecture proposed in Çiçek et al. [2].

The 3D U-Net architecture

is based on U-Net proposed in [14] and consists of analysis and synthesis paths with four resolution levels each. It utilizes deconvolution [8] (also called transposed convolutions) to remap the lower resolution and more abstract feature maps within the network to the denser space of the input images. This operation allows for efficient dense voxel-to-voxel predictions. Each resolution level in the analysis path contains two convolutional layers, each followed by rectified linear units (ReLU) and a max pooling with strides of two in each dimension. In the synthesis path, the convolutional layers are replaced by deconvolutions of with strides of two in each dimension. These are followed by two convolutions, each followed by ReLU activations. Furthermore, 3D U-Net employs shortcut (or skip) connections from layers of equal resolution in the analysis path to provide higher-resolution features to the synthesis path [2]. The last layer contains a convolution that reduces the number of output channels to the number of class labels . This architecture has over 19 million learnable parameters and can be trained to minimize the average Dice loss derived from the binary case in [1]:


Here, represents the continuous values of the softmax 3D prediction maps for each class label of and the corresponding ground truth value in at each voxel .

2.2 Multi-scale auto-context pyramid approach


(a) g.t. (scale 0)


(b) pred. (scale 0)


(c) g.t. (scale 1)


(d) pred. (scale 1)


(e) g.t. (scale 0)


(f) pred. (scale 0)


(g) g.t. (scale 1)


(h) pred. (scale 1)
Figure 10: Axial CT images and 3D surface rendering with ground truth (g.t.) and predictions overlaid. We show the two scales used in our experiments. Each scale’s input is of size in this setting.

To effectively process an image at higher resolutions, we propose a method that is inspired by the auto-context algorithm [11]. Our method both captures the context information at lower resolution downsampled images and learns more accurate segmentations from higher resolution images in two levels of a scale-space pyramid , with being the number of levels in our multi-scale pyramid, and being one of the multi-scale input subvolumes at each level .

In the first level, the 3D FCN is trained on images of the lowest resolution in order to capture the largest amount of context, downsampled with a factor of and optimized using the Dice loss . This can be thought of as a form of deep supervision [15]. In the next level, we use the predicted segmentation maps as a second input channel to the 3D FCN while learning from the images at a higher resolution, downsampled by a factor of , and optimized using Dice loss . For input to this second level of the pyramid, the previous level prediction maps are upsampled by a factor of 2 and cropped in order to spatially align with the higher resolution levels. These predictions can then be fed together with the appropriately cropped image data as a second channel. This approach can be learned end-to-end using modern multi-GPU devices and deep learning frameworks with the total loss being . This idea is shown schematically in Fig. 1. The resulting segmentation masks for the two-level case are shown in Fig. 10. It can be observed that the second-level auto-context network markedly outperforms the first-level predictions and is able to segment structures with improved detail, especially at the vessels.

2.3 Implementation & Training

We implement our approach in Keras111https://keras.io/ using the TensorFlow222https://www.tensorflow.org/ backend. The Dice loss [3] is used for optimization with Adam and automatic differentiation for gradient computations. Batch normalization layers are inserted throughout the network, using a mini-batch size of three, sampled from different CT volumes of the training set. We use randomly extracted subvolumes of fixed size during training, such that at least one foreground voxel is at the center of each subvolume. On-the-fly data augmentation is used via random translations, rotations and elastic deformations similar to [2].

3 Experiments & Results

In our implementation, a constant input and output size of randomly cropped subvolumes is used for training in each level. For inference, we employ network reshaping [8] to more efficiently process the testing image with a larger input size while building up the full image in a tiling approach [2]. The resulting segmentation masks for both levels are shown in Fig. 17. It can be observed that the second-level auto-context network markedly outperforms the first-level predictions and is able to segment structures with improved detail. All experiments were performed using a DeepLearning BOX (GDEP Advance) with four NVIDIA Quadro P6000s with 24 GB memory each. Training of 20,000 iterations using this unoptimized implementation took several days, while inference on a full CT scan takes just a few minutes on one GPU card.


Our data set includes 377 contrast-enhanced clinical CT images of the abdomen in the portal-venous phase used for pre-operative planning in gastric surgery. Each CT volume consists of 460–1,177 slices of 512512 pixels. Voxel dimensions are mm. With , we downsample each volume by a factor of in the first level and a factor of in the second level. A random 90/10% split of 340/37 patients is used for training and testing the network. We achieve Dice similarity scores for each organ labeled in the testing cases as summarized in Table 1. We list the performance for the first level and second level models when utilizing auto-context trained separately or end-to-end, and compare to using no auto-context in the second level. This shows the impact of using or not using the lower resolution auto-context channel at the higher resolution input while training from the same input resolution from scratch. In our case, each contains labels consisting of the manual annotations of seven anatomical structures (artery, portal vein, liver, spleen, stomach, gallbladder, pancreas), plus background.

Table 2 compares our results to recent literature and also displays the result using an unseen testing dataset from a different hospital consisting of 129 cases from a distinct research study. Furthermore, we test our model on a public data set of 20 contrast-enhanced CT scans.333We utilize the 20 training cases of the VISCERAL data set (http://www.visceral.eu/benchmarks/anatomy3-open) as our test set.

Level 1: Initial (low res)
Dice (%) artery vein liver spleen stomach gall. pancreas Avg.
Avg 75.4 64.0 95.4 94.0 93.7 80.2 79.8 83.2
Std 3.9 5.4 1.0 0.8 7.6 15.5 8.5 06.1
Min 67.4 41.3 91.5 92.6 48.4 27.3 49.7 59.7
Max 82.3 70.9 96.4 95.8 96.5 93.5 90.6 89.4
Level 2: Auto-context
Dice (%) artery vein liver spleen stomach gall. pancreas Avg.
Avg 82.5 76.8 96.7 96.6 95.9 84.4 83.4 88.1
Std 4.1 6.4 1.0 0.7 8.0 14.0 8.4 6.1
Min 73.3 46.3 92.9 94.4 48.1 28.0 53.9 62.4
Max 90.0 83.5 97.9 98.0 98.7 96.0 93.4 93.9
End-to-End: Auto-context (high-res)
Dice (%) artery vein liver spleen stomach gall. pancreas Avg.
Avg 83.0 79.4 96.9 97.2 96.2 83.6 86.7 89.0
Std 4.4 6.7 1.0 1.0 5.9 17.1 7.4 6.2
Min 73.2 50.2 93.5 94.9 61.4 29.7 60.0 66.1
Max 91.0 87.7 98.3 98.7 98.7 96.4 95.2 95.1
Level 2: No auto-context (high-res)
Dice (%) artery vein liver spleen stomach gall. pancreas Avg.
Avg 69.9 72.8 86.7 90.9 3.8 73.4 77.0 67.8
Std 6.2 7.0 6.4 5.3 1.3 22.5 10.8 8.5
Min 59.5 47.1 69.9 75.7 0.7 7.8 36.1 42.4
Max 82.1 82.9 95.7 97.0 7.4 95.9 90.9 78.8
*Best average performance is shown in bold.
Table 1: Comparison of different levels of our model. End-to-end training gives a statistically significant improvement ().
Table 2: We compare our model trained in an end-to-end fashion to recent work on multi-organ segmentation. [9] is using a 2D FCN approach with a majority voting scheme, while [7] employs 3D FCN architectures. Furthermore, we list our performance on an unseen testing dataset from a different hospital and on the public Visceral dataset without any re-training and compare it to the current challenge leaderboard (LB) best performance for each organ. Note that this table is incomprehensive and direct comparison to the literature is always difficult due to the different datasets and evaluation schemes involved.
Dice (%) train/test artery vein liver spleen stomach gall. pancreas Avg.
Ours (end-to-end) 340/37 83.0 79.4 96.9 97.2 96.2 83.6 86.7 89.0
Unseen test none/129 - - 95.3 93.6 - 80.8 75.7 86.3
Gibson et al. [7] 72 (8-CV) - - 92 - 83 - 66 80.3
Zhou et al. [9]444Dice score estimated from Intersection over Union (Jaccard index). 228/12 73.8 22.4 93.7 86.8 62.4 59.6 56.1 65.0
Hu et al. [6] 140 (CV) - - 96.0 94.2 - - - 95.1
Visceral (LB) 20/10 - - 95.0 91.1 - 70.6 58.5 78.8
Visceral (ours)555At the time of writing, the testing evaluation servers of the challenge were not available anymore for submitting results. none/20 - - 94.0 87.2 - 68.2 61.9 77.8


(a) Ground truth (axial)


(b) first level (upsampled)


(c) second level (auto-context)


(d) Ground truth (3D)


(e) first level (upsampled)


(f) second level (auto-context)
Figure 17: Axial CT images and 3D surface rendering of predictions from two multi-scale levels in comparison with ground truth annotations. In particular, the vessels are segmented more completely and in greater detail in the second level, which utilizes auto-context information in its prediction.

4 Discussion & Conclusion

The multi-scale auto-context approach presented in this paper provides a simple yet effective method for employing 3D FCNs in medical-imaging settings. No post-processing was applied to any of the network outputs. The improved performance in our approach is effective for all organs tested (apart from the gallbladder, where the differences are not significant). Note that we used different datasets (from different hospitals and scanners) for separate testing. This experiment illustrates our method’s generalizability and robustness to differences in image quality and populations. Running the algorithms at a quarter to half of the original resolution improved performance and efficiency in this application. While this method could be extended to using a multi-scale pyramid with the original resolution as the final level, we found that the added computational burden did not add significantly to the segmentation performance. The main improvement comes from utilizing a very coarse image (downsampled by a factor of four) in an effective manner. In this work, we utilized a 3D U-Net-like model for each level of the image pyramid. However, the proposed auto-context approach should in principle also work well for other 3D CNN/FCN architectures and 2D and 3D image modalities.

In conclusion, we showed that an auto-context approach can result in improved semantic segmentation results for 3D FCNs based on the 3D U-Net architecture. While the low-resolution part of the model is able to benefit from a larger context in the input image, the higher resolution auto-context part of the model can segment the image with greater detail, resulting in better overall dense predictions. Training both levels end-to-end resulted in improved performance.


This work was supported by MEXT KAKENHI (26108006, 17H00867, 17K20099) and the JPSP International Bilateral Collaboration Grant.


  • [1] Milletari, F., Navab, N., Ahmadi, S.A.: V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3D Vision (3DV), IEEE (2016) 565–571
  • [2] Çiçek, Ö., Abdulkadir, A., Lienkamp, S.S., Brox, T., Ronneberger, O.: 3d u-net: learning dense volumetric segmentation from sparse annotation. In: MICCAI, Springer (2016) 424–432
  • [3] Christ, P.F., Elshaer, M.E.A., Ettlinger, F., Tatavarty, S., Bickel, M., Bilic, P., Rempfler, M., Armbruster, M., Hofmann, F., D’Anastasi, M., Sommer, W.H., Ahmadi, S.A., Menze, B.H.: Automatic liver and lesion segmentation in ct using cascaded fully convolutional neural networks and 3D conditional random fields. In: MICCAI, Springer (2016) 415–423
  • [4] Roth, H.R., Lu, L., Farag, A., Sohn, A., Summers, R.M.: Spatial aggregation of holistically-nested networks for automated pancreas segmentation. In: MICCAI, Springer (2016) 451–459
  • [5] Zhou, Y., Xie, L., Shen, W., Wang, Y., Fishman, E.K., Yuille, A.L.: A fixed-point model for pancreas segmentation in abdominal ct scans. In: MICCAI, Springer (2017) 693–701
  • [6] Hu, P., Wu, F., Peng, J., Bao, Y., Chen, F., Kong, D.: Automatic abdominal multi-organ segmentation using deep convolutional neural network and time-implicit level sets. International journal of computer assisted radiology and surgery 12(3) (2017) 399–411
  • [7] Gibson, E., Giganti, F., Hu, Y., Bonmati, E., Bandula, S., Gurusamy, K., Davidson, B.R., Pereira, S.P., Clarkson, M.J., Barratt, D.C.: Towards image-guided pancreas and biliary endoscopy: Automatic multi-organ segmentation on abdominal ct with dense dilated networks. In: MICCAI, Springer (2017) 728–736
  • [8] Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: IEEE CVPR. (2015) 3431–3440
  • [9] Zhou, X., Takayama, R., Wang, S., Hara, T., Fujita, H.: Deep learning of the sectional appearances of 3D ct images for anatomical structure segmentation based on an FCN voting method. Medical Physics (2017)
  • [10] Adelson, E.H., Anderson, C.H., Bergen, J.R., Burt, P.J., Ogden, J.M.: Pyramid methods in image processing. RCA engineer 29(6) (1984) 33–41
  • [11] Tu, Z., Bai, X.: Auto-context and its application to high-level vision tasks and 3d brain image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 32(10) (2010) 1744–1757
  • [12] Chen, L.C., Yang, Y., Wang, J., Xu, W., Yuille, A.L.: Attention to scale: Scale-aware semantic image segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. (2016) 3640–3649
  • [13] Salehi, S.S.M., Erdogmus, D., Gholipour, A.: Auto-context convolutional neural network (auto-net) for brain extraction in magnetic resonance imaging. IEEE transactions on medical imaging 36(11) (2017) 2319–2330
  • [14] Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: MICCAI, Springer (2015) 234–241
  • [15] Lee, C.Y., Xie, S., Gallagher, P., Zhang, Z., Tu, Z.: Deeply-supervised nets. In: Artificial Intelligence and Statistics. (2015) 562–570
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description