3D RoI-aware U-Net for Accurate and Efficient Colorectal Tumor Segmentation

3D RoI-aware U-Net for Accurate and Efficient Colorectal Tumor Segmentation

Yi-Jie Huang, Qi Dou, Zi-Xian Wang, Li-Zhi Liu, Ying Jin, Chao-Feng Li,
, Lisheng Wang, Hao Chen, Rui-Hua Xu
Institute of Image Processing and Pattern Recognition, Department of Automation, Shanghai Jiao Tong University, China
Imsight Medical Technology Co. Ltd., China
Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong
Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China;
Collaborative Innovation Center for Cancer Medicine, Guangzhou, China

Objective: Segmentation of colorectal cancerous regions from the Magnetic Resonance (MR) image is a crucial procedure for radiotherapy, which requires to accurately delineate boundaries of the tumors. This work aims to address this important while challenging task in an accurate as well as efficient manner. Methods: We propose a novel multi-tasking framework, referred to as 3D RoI-aware U-Net (3D RU-Net), for RoI localization and intra-RoI segmentation, where the two tasks share one backbone network. With the region proposals from the localization branch, we crop multi-level feature maps from the backbone network to form a U-Net-like intra-RoI segmentation branch. To effectively train the model, we propose a novel Dice based hybrid loss to tackle the issue of class-imbalance under the multi-task setting. Furthermore, we design a multi-resolution model ensemble strategy to improve the discrimination capability of the framework. Results: Our method has been validated on 64 cancerous cases with a four-fold cross-validation, outperforming state-of-the-art methods by a significant margin in terms of both accuracy and speed. Conclusion: Experimental results demonstrated that the proposed method enables accurate and fast whole volume RoI localization and intra-RoI segmentation. Significance: This paper proposes a general 3D segmentation framework which rapidly locates the RoI region in large volumetric images and accurately segments the in-region targets. The method has a great potential to be extended to other small 3D object segmentation tasks from medical images.

1 Introduction

Figure 1: Typical examples of MR slices with colorectal cancer. The cancer regions are delineated with red lines and zoomed in for clearer illustration.

Colorectal cancer is the second leading cause of cancer-related mortalities in the United States [29]. In current clinical routine of radiotherapy, colorectal cancer regions are manually delineated from volumetric images acquired by magnetic resonance (MR) imaging, which is considered as optimal imaging modality and provides rich context of soft tissues. However, this procedure is laborious, subjective and time-consuming, thus suffers from tedious effort and limited reproducibility. Therefore, automatic colorectal cancer segmentation methods are highly demanded in clinical practice for radiotherapy. This task is, however, very challenging due to low foreground-to-background ratio, low-contrast and inconsistent appearance of cancerous regions, as well as hard mimics from the complex peritumoral areas. As is shown in Fig. 1, it’s challenging to distinguish abnormality from the normal peritumoral tissue.

Automatic segmentation of volumetric MR images has been widely studied in the literature. Some initial works are based on super-voxel clustering [21, 16]. While intensity and connectivity based methods serve some good baselines, nowadays deep learning based methods have taken the state-of-the-art to a higher level. Initially, a series of FCN backbone models [4, 26, 5, 22, 34, 2, 9] are formerly proposed to effectively segment medical images. The success of state-of-the-art FCNs emphasizes that it’s essential to fuse features ( use skip connections) from different levels to gather fine grained details that are lost in the downsampling process.

Next, in practice, we define whole volume segmentation as a problem of automatically processing whole volumetric images rather than processing manually selected RoI patches. The superiority of being more automatic simplifies the workflow, excludes human subjectivity and enables fast processing for big data sets. Due to various limitations, given aforementioned backbone networks, related works have to tackle this task by three categories of methods: part based fully convolutional networks (FCNs), discrete RoI localization and segmentation based models and unified RoI localization-segmentation based models.

As naive practices, part based FCNs perform clean iterative inference: the images are divided into 2D slices [26, 32], 2.5D slices [27, 28] or small 3D patches [34] to be processed. Except for segmenting non-object targets such as vessels, such design mainly results from the nature that 3D FCNs extract large feature tensors that occupy excessive GPU memory, thus whole volume end-to-end segmentation are hardy applicable. The drawbacks of such practice are apparently noticeable: slow, unspecific and incomplete context utilization. As a consequence of searching the target patch-wisely, they all expense the majority of computing resources outside regions of interest (RoIs) and produce many false positives. Additionally, 2D or 2.5D FCNs lack necessary 3D contexts while 3D patch based FCNs also cannot see complete global contexts, thus such incomplete context utilization can degrade the performance.

In more recent trends, discrete RoI localization and segmentation models are widely applied to eliminate redundant computation and reduce false positives. By definition, these methods combine discrete RoI localization and segmentation modules to perform intra-RoI segmentation. Traditionally, RoIs are localized using prior knowledges such as multi-atlas registration [25, 18], whose performance is limited by inconsistent intensity distributions, different field of views (FoVs) and relatively slow registration speed. Learning based RoI localization decouples RoI localization from prior knowledge [10, 6, 24, 19]. Some of the related practices[10] extract region proposals using external modules such as Multiscale Combinatorial Grouping (MCG) [1], then classify and segment them; Later works adopt light CNN models such as 2D CNNs for RoI localization [19, 31] and then use 3D FCNs for segmentation. Compared to part based methods, these works tackle the tasks in graceful and efficient manners. However, speed issues still occur to these methods since they either have time-consuming external region proposal modules or repeatedly extract low-level features that could have been shared across the two stages.

As a promising development, unified RoI localization-segmentation models such as Multi-task Network Cascades (MNC) [7] and Mask R-CNN [11] further eliminate redundant feature extraction and achieve better speed and accuracy by sharing a backbone network across the region proposal network (RPN) for RoI detection, the RoIAlign module for RoI extraction, the intra-RoI CNN for classification and the intra-RoI FCN for segmentation. In the state-of-the-art Mask R-CNN practice, they employ the Feature Pyramid Network (FPN) [20] backbone feature extractor, which is similar to the U-Net [26], to extract region proposals by fusing different levels of feature maps and acquire better detail preservation. We seek to design a unified one-step whole volume segmentation framework like Mask R-CNN, but directly extending it to volumetric formulation to tackle 3D medical tasks encounters some issues. Firstly, like the case of 3D U-Net, extending FPNs that has symmetric encoder-decoder construction to 3D takes excessive GPU memory compared to pure encoder ones and has smaller applicable volume size. Additionally, the RPN that fits and regresses pre-set bounding boxes of different scales and XY ratios named as anchors to ground truth bounding boxes will have anchors of more ratios, namely XYZ ratios, to fit fewer 3D medical objects if direct extension is conducted. Thus this extension can result in severe overfitting. Finally, the RoIAlign module performs bilinear interpolation to resample feature tensors within bounding boxes to dimension-fixed bins ( 1414.). Such design introduces shape distortion, scale normalization and detail loss, which may degrade the performance.

Apart from the way whole volume predictions are generated, recent works propose some strategies to further boost the performance of volumetric image segmentation. Firstly, V-Net [22] adopts parameter-free Dice coefficient loss to harness the class-imbalance issue. Secondly, Deep Contour-aware Networks (DCAN) [3] employs a contour-aware loss function for better discrimination between boundaries and the background. In addition, Orchestral Fully Convolutional Networks (OFCNs) [33] and Hybrid Loss guided Fully Convolutional Networks (HL-FCNs) [15] adopt model ensemble for better robustness.

An initial work to automatically segment colorectal cancer regions was published in ISBI [15]. As one step further, according to insights listed above, in this paper, we propose a novel unified RoI localization-segmentation framework, named as 3D RoI-aware U-Net (3D RU-Net), to segment cancerous tissues from whole volume MR images in one-step manner. To further boost the performance, we design a hybrid loss function to help the network both handle small objects in big volumes and focus on accurately recognizing ambient borders in local RoIs, and additionally adopt multi-resolution ensemble strategy for better robustness. Experiments conducted on 64 acquired scans demonstrated the efficacy of our method and ablation studies validate the contribution gain of each component from our framework.

Our main contributions are summarized as follows:

Figure 2: The illustration of 3D RU-Net. The network consists of a whole volume contractive path and an intra-RoI expansive path. Bounding boxes are predicted and passed to preceding feature maps by upscaling, then the feature tensors are cropped and memory-efficient multilevel feature fusion for intra-RoI segmention is performed in the expansive path.
  1. We extend the unified RoI localization-segmentation framework to 3D formulation, and adopt multilevel feature fusion mechanism to the segmentation branch. This configuration enables faster and memory efficient detail preserving one-step whole volume segmentation compared to part based and discrete RoI localization-segmentation based counterparts.

  2. Considering automatic class rebalancing and better boundary discrimination, we propose a Dice formulated contour-aware multi-task learning strategy to further improve the accuracy. Additionally, the accelerated framework encourages us to employ a multi-resolution model ensemble strategy to suppress the false positives and refine the boundary details at an acceptable speed cost.

  3. Extensive experiments on the acquired dataset proved the efficacy of our proposed framework. Furthermore, our method is inherently general and can be applied in other similar applications.

The remainder of this paper is organized as follows. We describe our method in Section II and report the experimental results in Section III. Section IV further discusses some insights as well as issues of the proposed method. The conclusions are drawn in Section V.

2 Methodology

The proposed 3D RU-Net framework is illustrated in Fig. 2. We input whole image volumes and crop RoIs from multi-scale feature maps to formulate an intra-RoI U-Net expansive path for high-resolution cancerous tissue segmentation.

2.1 Construction of 3D RU-Net

In this section, we form a unified RoI localization-segmentation framework, with reused features and shared weights over different tasks, which essentially borrows the spirit from Mask R-CNN[11] with ResNet-FPN as the backbone feature extractor but modified to solve our underlying 3D medical image segmentation problem.

To address the issues discussed in section 1, we propose a framework to effectively localize and segment colorectal cancer in the following aspects:

2.1.1 Backbone Network

Due to limited GPU memory of commonly used devices and dramatically increased parameters of 3D convolution kernels, it’s essential to carefully design the 3D backbone feature extractor to avoid GPU memory overflow and overfitting. Instead of constructing a 3D version of encoder-decoder architecture like 3D U-Net or 3D FPN, or directly extending popular backbones [30, 13, 14] to 3D, we adopt a variation of ResBlock [13] formulated 3D U-Net’s contractive path, called Whole Volume Contractive Path, to process whole image volumes without dividing them into multiple parts. Since it has fewer parameters and produces smaller feature tensors while providing the discrimination capability needed, it’s more optimal in our underlying task.

2.1.2 RoI Localization

Since the cancerous regions has inconsistent 3-dimensional scales, shapes and XYZ ratios and the fact that we have only 64 samples to train, validate and test on, we avoid fitting anchors defined by a large number of combinations of different scales and XYZ ratios to ground truth object bounding boxes and degrading voxel-wise labels to object-wise labels. Instead, we perform low resolution whole volume segmentation based on the terminal feature map of the backbone network trained towards Dice loss to tackle the extremely imbalanced foreground-to-background ratio, which will be introduced in subsection 2.2. Then we perform connectivity analysis to compute desired bounding boxes. To make up for potential bounding box undersize due to the coarseness of this step, the bounding boxes computed are practically extended to or of its original size or to an over-designed cube of fixed size  ( ) voxels along the Z,Y and X axis.

2.1.3 RoI Cropping Layer

Rather than sampling targets within bounding boxes of different shapes into a pre-defined size-fixed bins ( 1414.), the RoI Cropping Layer extract feature tensors within bounding boxes by directly cropping them without resampling. This design keeps bounding boxes’ shape ratios and scales unchanged to avoid potential detail loss and semantics shift related performance degradation. Additionally, the RoI Cropping Layer not only crops the feature tensors within bounding boxes from the terminal layer but also extend the computed bounding boxes to preceding feature scales. We denote the feature tensors generated by ResBlocks of each scale of the contractive path as , and as illustrated in Fig. 2. Assume that the center coordinates and size of an RoI predicted in is  and , respectively, then the cropped RoI windows are: a window centered on  of size for , a window centered on  of size for , a window centered on of size for .

2.1.4 Intra-RoI Segmentation

Given cropped feature tensors extracted from multiple preceding feature tensors, we construct the segmentation branch named as U-Net-like Intra-RoI Expansive Path by applying successful multilevel feature fusion mechanism. The construction of Intra-RoI Expansive Path is more or less symmetrical to the Whole Volume Contractive Path, while the beneficial difference lies on much smaller size of the expansive path’s feature tensors. Besides, since no shape distortion or scale normalization is included, this module directly and losslessly restores the original size of the RoI region. The same set of weights is used to iteratively process different RoIs if multiple RoIs are localized.

2.2 Dice-based Multi-task Hybrid Loss Function

It is observed that the cancerous regions with low contrast, ambiguous borders and unbalanced distribution are hard to learn even using the successful multilevel feature fusion mechanism. Thus we propose to use a Dice-based multi-task hybrid Loss function to improve the performance.

2.2.1 Dice Loss Formulation

Inspired by the success of [22], we apply Dice loss function to formulate the optimization objective, since it serves as an effective hyper-parameter free class balancer to help the network learn objects of small size and weak saliency. The Dice loss is defined as:


where the sums are computed over the voxels of the predicted volume and the ground truth volume . is a smoothness term that avoids devision by 0. In the optimization stage, the Dice loss is minimized by gradient descend using the following derivate:


2.2.2 Dice Loss for Global Localization

To tackle the class imbalance issue of the global RoI localization task, we employ the aforementioned Dice loss:


where and denotes predictions of the localization top and down-sampled annotations.

2.2.3 Dice-based Contour-aware Loss for Local Segmentation

Figure 3: 3D RU-Net1, 3D RU-Net2 and 3D RU-Net3 of identical architecture are trained with datasets of different resolution rates, namely HighRes set, MidRes set and LowRes set, and their predictions are first resolution-restored and averaged.

Compared to the localization task, the intra-RoI segmentation branch needs multiple constraints to acquire better boundary-sensitive segmentation results. In semantic segmentation practices, the ambiguous borders are the most difficult to learn but learned with insufficient attention. Borrowing the insight of previous exploration of adding an auxiliary contour-aware side task[3], we further formulate the side task using Dice loss to help it tackle the extreme sparsity of contour labels in 3D space. Practically we add an extra Softmax branch at the output terminal of the segmentation branch to predict the contour voxels, trained in parallel with the region segmentation task. Taking the side task into account, the loss function of the segmentation branch is denoted as following by summarizing the weighted losses:


where , denoting the auxiliary task weight to ensure that the region segmentation task dominates while other tasks take effects.

Finally, the overall loss function is:


where denotes the balance of weight decay term and denotes the parameters of the whole network.

Part Name Layer Name Input Layer (s) Kernel Out Channel (s) Contractive Path ResBlock1 Image 48 MaxPooling1 ResBlock1 48 ResBlock2 MaxPooling1 96 MaxPooling2 ResBlock2 96 Bottleneck ResBlock3 MaxPooling2 192 LocTop (sigmoid) ResBlock3 1 RoI RoICropping1 Bbox1,ResBlock1 - 48 RoICropping2 Bbox2,ResBlock2 - 96 RoICropping3 Bbox3,ResBlock3 - 192 Expansive Path UpConv1 RoICropping3 96 Add1 RoICropping2,UpConv1 - 96 ResBlock4 Add1 96 UpConv2 ResBlock4 48 Add2 RoICropping1,UpConv2 - 48 ResBlock5 Add2 48 SegTop1 (sigmoid) ResBlock5 1 SegTop2 (sigmoid) ResBlock5 1
Table 1: Parameters and connectivity of the network.

Figure 4: Examples of: (a) Original Images (b) Normalized Images. The intensity of homogeneous tissues from images acquired under different imaging configurations are normalized to identical ranges.

2.3 Multi-resolution Model Ensemble

Model ensemble strategy is considered as an effective practice to further boost the performance, and is widely employed in practical cases, at a cost of computational expensiveness.

Encouraged by the dramatically accelerated framework, in this paper we propose to employ multi-resolution model ensemble, , using models of identical structure but trained on three datasets of different resolution rate instead of applying models of different structure designs. In detail, we resample acquired MR images of ZYX spacings ranging from mm to mm to three datasets of stepped spacing configurations: mm for HighRes set, mm for MidRes set, mm for LowRes set, feeding to 3D RU-Net1, 3D RU-Net2 and 3D RU-Net3, respectively. In the inference stage, as is shown in Fig. 3, three networks’ outputs are averaged to generate the final prediction.

3 Experiments

3.1 Dataset and Preprocessing

3.1.1 Dataset

The dataset contains a total of 64 MR images of the pelvic cavity of T2 modality. Target areas were labeled voxel-wisely by experienced radiologists, and contour labels were automatically generated from the region labels using erosion and subtraction operations.

3.1.2 Preprocessing

Spacing normalization is conducted according to the criterion described in subsection 2.3. To normalize the intensities of input images acquired under different imaging configurations and field of views, we perform intra-body intensity normalization to exclude the affect of inconsistent body-to-background ratios. By OTSU[23] thresholding, connectivity analysis and closing operation, body masks are extracted as foreground and other voxels are set as background. The mean intensity and standard deviation are computed within the body mask according to following formulas:


where denotes the intensity of a voxel and denotes the count of mask voxels. Then the image is normalized according to the following criterion:


A few examples of the comparison between original images and intensity-normalized images are illustrated in Fig. 4.

Before feeding the images to the network, we crop the input images according to minimum bounding boxes of the body masks to further reduce the GPU memory footprint. Additionally, in the training stage, we perform on-the-fly data augmentation when feeding training samples. Applied operations include scaling, flipping, intensity jittering, and translation.

3.2 Implementation Details

Our implementation is publicly available at https://github.com/huangyjhust/3D-RU-Net.

3.2.1 Hyper-Parameters

The network’s detailed connectivity and kernel configuration are illustrated in Table 2.2.3. Specifically, to fit the anisotropic spacing of the acquired dataset which has larger spacing along Z axis, flat kernels of , pooling rate of and up-sampling rate of are employed by the input and output blocks, ResBlock1, MaxPooling1, UpConv2, ResBlock5. For direct comparison to related methods, we assign an over-designed fixed window to the RoI Cropping Layer: .

3.2.2 Training Process

The backbone network is initialized using MSRA criterion[12], then pre-trained using our previous work’s patch-wise HL-FCN[15]. We use Adam[17] optimizer at a learning rate of . The weights of convolution kernels are penalized with L2 norm for better generalization capability. Then, we iteratively train the RoI localization branch and the segmentation branch. In each iteration, we first train the RoI localization branch named as LocTop, then predict the bounding boxes. Next, we train the segmentation branches named as SegTop1 and SegTop2, using the predicted bounding boxes. Four-fold cross-validation is conducted on 64 scans.

3.3 Evaluation Metrics

3.3.1 Dice Similarity Coefficient (DSC)

The Dice similarity coefficient (DSC) measures a general overlap rate that equally assigns significance to recall rate and false positive rate. DSC is denoted as:


where the metric is scored in [0,1]. Better prediction generates a score closer to 1.0. Since this network is trained towards this metric, DSC is not enough to evaluate the performance.

3.3.2 Voxel-wise Recall Rate

We also employ voxel-wise recall rate to evaluate the recall capability of different methods.


3.3.3 Average Symmetric Surface Distance (ASD)

We define the shortest distance of an arbitrary voxel of one volume’s surface to another volume’s surface as:


where denotes th voxel from extracted surface of volume , denotes th voxel from extracted surface of volume , and denotes Euclidean distance. Then the evaluation value is defined as:


Specifically, this metric is sensitive to failures such as debris outliers predicted far away from the colon region or complete failure to recall an object. The long distance makes up for the small size of the debris and produce large error penalty. If a failure segmentation has 0 recall rate, its surface distance is set as 50 mm, which is big enough to be a strong penalty.

3.3.4 Average Inference Time

We include average inference time to evaluate speed in the inference stage. Since this metric is decided by the size of the input volume, the standard deviation is not evaluated. The tested methods are all performed on a workstation platform with 2x Xeon E5 CPU (8C16T) @ 2.4 Ghz, 128GB RAM and an NVIDIA Titan Xp GPU with 12GB GPU memory. The code is implemented with Keras backended by Tensorflow.

3.3.5 Typical GPU Memory Footprint

By analyzing this metric, we describe the GPU memory efficiency of the proposed methods by tracking the total GPU memory footprint given an input volume of typical size voxels.

3.4 Results

The statistical results are listed in Table. 2. Comparison of predicted masks between different methods is illustrated in Fig. 5(a); Eight volume predictions are illustrated in Fig. 5(b).

Firstly, we compare our proposed method to a series of part based models, HL-FCNs [15], V-Net[22], their multi-resolution ensemble counterparts and 3D U-Net[5], aiming to show the effectiveness of the proposed training strategies and the speed superiority of one-step RoI localization-segmentation pipeline. These part based networks acquire patches at a stride of 50% window overlapping. They are of identical depth and kernel configuration as the proposed method, and trained towards different loss functions on different dataset of the stepped resolution rates, namely Dice-based Hybrid Loss, Dice Loss and Cross-entropy Loss, respectively. As the result, the proposed method outperforms all part-based methods, especially in ASD and speed.

Next, we compare the proposed method to a discrete RoI localization-segmentation based method, aiming to emphasize the performance and speed benefit of the proposed method. In detail, the cascaded models referred to as 3D Cascaded Models1 in 2 is a modules-detached version of 3D RU-Net1: it consists an independent RoI localization module the same as 3D RU-Net’s Whole Volume Contractive Path and a full 3D U-Net fed with patches to perform intra-RoI segmentation. Compared to the case of 3D Cascaded Models1, the proposed method’s segmentation branch acquires low level features trained on global images and pre-extracted by the Whole Volume Contractive Path. This helps the network reject false positives better and achieve better speed.

(a) (I) Cancerous region, (II) Expert delineation, (III) Proposed method(predicted regions), (IV) Proposed method (predicted contours) (V) V-Net[22] (Ensemble) (VI) 3D U-Net[5] (VII) 3D Cascaded Models (VIII) 3D Mask R-CNN[11]
(b) (I) selected 2D slices (II) 3d segmentation masks. Green indicates true positives; Red indicates false positives; Blue indicates false negatives.
Figure 5: Illustration of typical examples for the segmentation results: (a) RoI segmentation results of the compared methods (b) whole volume segmentation results with Dice evaluation indicated.
Method DSC Recall ASD[mm] Time[s]
3D RU-Net(Multi-Res Ensemble) 0.7350.147 0.7480.185 2.633.16 0.57
3D RU-Net3(LowRes) 0.7070.146 0.7400.184 3.634.72 0.12(0.08+0.041)
3D RU-Net2(MidRes) 0.7120.144 0.7350.183 4.607.14 0.16(0.12+0.041)
3D RU-Net1(HighRes) 0.6850.171 0.7130.221 5.166.33 0.29(0.25+0.041)
HL-FCN[15](Multi-Res Ensemble) 0.7210.139 0.7220.172 3.834.95 18.11
HL-FCN3[15](LowRes) 0.6990.125 0.7020.149 3.904.43 2.25
HL-FCN2[15](MidRes) 0.7000.145 0.7210.173 5.487.06 5.60
HL-FCN1[15](HighRes) 0.6770.184 0.6920.213 10.2414.59 10.26
V-Net[22](Multi-Res Ensemble) 0.6990.137 0.7240.18 4.185.89 18.11
V-Net3[22](LowRes) 0.6850.138 0.6850.195 4.195.75 2.25
V-Net2[22](MidRes) 0.6730.153 0.7020.177 5.707.31 5.60
V-Net1[22](HighRes) 0.6600.182 0.7090.22 10.3212.11 10.26
3D U-Net1[5](HighRes) 0.6170.192 0.5730.239 4.264.35 10.26
3D Cascaded Models1(HighRes) 0.6670.193 0.7250.250 5.118.87 0.35(0.25+0.091)
3D Mask R-CNN1[11](HighRes) 0.5640.190 0.5850.256 7.9310.33 0.55
Table 2: Comparison of colorectal cancer segmentation results using different methods.
Part Name Layer Name Size GPU Memory Footprint Part GPU Memory Footprint
Contractive Path ResBlock1 3796.88 MBytes 5827.17MBytes
MaxPooling1 105.47 MBytes
ResBlock2 1898.45 MBytes
MaxPooling2 26.37 MBytes
Bottleneck ResBlock3 474.60 MBytes 474.87MBytes
RPN(sigmoid) 0.27MBytes
RoI RoICropping1 40.50 MBytes 65.82 MBytes
RoICropping2 20.25 MBytes
RoICropping3 5.06 MBytes
Expansive Path UpConv1 20.25 MBytes 669.93 MBytes
Add1 20.25 MBytes
ResBlock4 182.25 MBytes
UpConv2 40.50 MBytes
Add2 40.50 MBytes
ResBlock5 364.50 MBytes
Seg1(sigmoid) 0.84 MBytes
Seg2(sigmoid) 0.84 MBytes
Table 3: GPU memory footprint tracking given an input volume of size

Additionally, we further validate our method’s efficacy in 3D medical images by comparing it to a 3D variation of Mask R-CNN[11]. Due to the limitation of GPU memory, directly extending ResNet-FPN to 3D cannot take whole volume images as the input, we instead employ 3D RU-Net’s Whole Volume Contractive Path as the backbone feature extractor. The experiment shows that the 3D Mask R-CNN suffers severely from overfitting issue in bounding box learning, therefore the performance of the segmentation branch is affected by both the detail loss of the absence of FPN-based backbone network and failures of bounding box detection and regression.

Finally, it’s significant to point out that the speed and performance gains are enabled by the memory efficiency of the proposed method that eliminates the need of conventional 3D U-Net for sliding-stitching workflow and enables one-step whole volume inference. Here we track the memory footprint to evaluate the memory efficiency of the proposed method in the environment where in-place computing is deactivated thus a ResBlock has nine tensor nodes. A typical T2 volume of 3D pelvic image is of size . By body cropping, the size typically drops to . Given this volume as input, the GPU memory footprint details are listed in Table. 3. By constructing the intra-RoI expansive path, a typical GPU can assign 90% of its GPU memory to the contractive path to detect RoIs and spend only 10% GPU memory on intra-RoI segmentation, while conventional encoder-decoder networks spend  50% GPU memory on each path. Therefore, while model ensemble strategy is often considered to be computationally expensive, based on the proposed method, we can have the performance gain at a promisingly acceptable cost.

4 Discussion

In this paper, we aim to segment colorectal cancerous tissues accurately and fast. We combine the whole volume RoI localization model and intra-RoI segmentation model to be a unified, weight sharing, feature reusing and jointly trained model: 3D RoI-aware U-Net (3D RU-Net).

We notice a recent trend that researchers seek to first detect and then segment medical objects. But they usually utilize independent models to achieve this goal, regardless of potential benefit of feature sharing and joint training; and many of them are even not using full 3D contexts in detection stage, this passes more candidates to posterior branches to further discriminate them and needs more time to process. Aiming to refine the workflow, The pre-extracted low level features provides the intra-RoI segmentation branch better understanding of the whole image to discriminate background from false positives, and saves over 50% time for each RoI segmentation.

Compared to successful and general Mask R-CNN for natural object instance segmentation, the advantage of the proposed framework over Mask R-CNN mainly lies on its loose assumption about objectness, less data amount requirement and avoiding the unnecessary shape distortion and scale normalization introduced by RoIAlign’s bin fitting. Firstly, though object detection frameworks are effectively used in medical cases, it best performs in cases where targets have strong objectness assumption, for instance, lung nodule detection[8]. In cases where lesions have statistically inconsistent shapes and few training samples, learning bounding box detection and regression is more difficult and prone to overfitting. In such cases, degrading voxel-wise labels to object-wise labels can be non-optimal and unnecessary. Secondly, the background contexts are scale and shape ratio sensitive. They serve as important clues for false positive exclusion in medical cases, thus warping them to dimension-fixed cubes using RoIAlign module can cause performance degradation. Finally, FPN/U-Net like multilevel feature fusion mechanism must be implemented in a GPU memory efficient way, or it can be inapplicable. This insight is generalizable enough to be adopted by other unified detection and segmentation frameworks.

Although our method achieved competitive results, there are some limitations. Firstly, according to Fig. 5(b), the model is often confused about which slice to start or end, thus this significantly affects the score. In fact, decision about starting and ending slice index can be observer-dependent due to weak contrast in the border of cancerous tissues and low resolution along Z axis. Secondly, there are cases where we fail to detect objects of unseen appearance in other samples, thus the segmentation branch also responses with incorrect masks. In these cases the standard deviation of ASD significantly increases due to strong penalty to missing objects, see 3D RU-Net2 and 3D Cascaded Models in Table 2. In these cases, the model ensemble strategy serves as a strong rectifier. To fix this issue, better normalization method needs to be explored, and more training samples can be beneficial as well.

5 Conclusion

In this paper, we propose a unified RoI localization-segmentation-based framework for full-automatic one-step whole volume colorectal cancer segmentation referred to as 3D RoI-aware U-Net (3D RU-Net). We emphasize the importance and effectiveness of shared feature extraction across the localization and segmentation branches, the Dice-based hybrid loss function as well as multi-resolution model ensemble. Experimental results demonstrated impressive superiority in terms of accuracy and speed over part-based methods, discrete RoI localization and segmentation-based methods as well as direct 3D extension of Mask R-CNN. In principle, the proposed framework is scalable enough to be adopted to other medical image segmentation tasks.


  • [1] Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 328–335, 2014.
  • [2] H. Chen, Q. Dou, L. Yu, J. Qin, and P. A. Heng. Voxresnet: Deep voxelwise residual networks for brain segmentation from 3d mr images. Neuroimage, 2017.
  • [3] H. Chen, X. Qi, L. Yu, Q. Dou, J. Qin, and P. A. Heng. Dcan: Deep contour-aware networks for object instance segmentation from histology images. Medical Image Analysis, 36:135–146, 2017.
  • [4] Hao Chen, Qi Dou, Xi Wang, Jing Qin, Jack C. Y. Cheng, and Pheng Ann Heng. 3d fully convolutional networks for intervertebral disc localization and segmentation. In International Conference on Medical Imaging and Virtual Reality, pages 375–382, 2016.
  • [5] Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In MICCAI, pages 424–432. Springer, 2016.
  • [6] Jifeng Dai, Kaiming He, and Jian Sun. Convolutional feature masking for joint object and stuff segmentation. In Computer Vision and Pattern Recognition, pages 3992–4000, 2015.
  • [7] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
  • [8] Jia Ding, Aoxue Li, Zhiqiang Hu, and Liwei Wang. Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 559–567. Springer, 2017.
  • [9] Qi Dou, Lequan Yu, Hao Chen, Yueming Jin, Xin Yang, Jing Qin, and Pheng-Ann Heng. 3d deeply supervised network for automated segmentation of volumetric medical images. Medical image analysis, 41:40–54, 2017.
  • [10] Bharath Hariharan, Pablo Arbeláez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Computer Vision and Pattern Recognition, pages 447–456, 2015.
  • [11] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
  • [12] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  • [14] Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017.
  • [15] Yi-Jie Huang, Qi Dou, Zi-Xian Wang, Li-Zhi Liu, Li-Sheng Wang, Hao Chen, Pheng-Ann Heng, and Rui-Hua Xu. Hl-fcn: Hybrid loss guided fcn for colorectal cancer segmentation. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 195–198. IEEE, 2018.
  • [16] Benjamin Irving, Amalia Cifor, Bartłomiej W Papież, Jamie Franklin, Ewan M Anderson, Michael Brady, and Julia A Schnabel. Automated colorectal tumour segmentation in dce-mri using supervoxel neighbourhood contrast characteristics. In MICCAI, pages 609–616. Springer, 2014.
  • [17] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014.
  • [18] S Klein, Ua Van-Der-Heide, Im Lips, M Van-Vulpen, M Staring, and Jp Pluim. Automatic segmentation of the prostate in 3d mr images by atlas matching using localized mutual information. Medical Physics, 35(4):1407–1417, 2008.
  • [19] Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, and Pheng Ann Heng. H-denseunet: Hybrid densely connected unet for liver and liver tumor segmentation from ct volumes. arXiv preprint arXiv:1709.07330, 2017.
  • [20] Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
  • [21] Dwarikanath Mahapatra, Peter J Schuffler, Jeroen AW Tielbeek, Jesica C Makanyanga, Jaap Stoker, Stuart A Taylor, Franciscus M Vos, and Joachim M Buhmann. Automatic detection and segmentation of crohn’s disease tissues from abdominal mri. IEEE Trans. on Med. Imaging, 32(12):2332–2347, 2013.
  • [22] Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 565–571. IEEE, 2016.
  • [23] N Otsu. A threshold selection method from gray-level histogram. IEEE Trans Smc, 9(1):62–66, 1979.
  • [24] Pedro H. O. Pinheiro, Ronan Collobert, and Piotr Dollar. Learning to segments objects candidates. In Advances in Neural Information Processing Systems, 2015.
  • [25] Torsten Rohlfing, Daniel B Russakoff, and Jr Maurer, Calvin R. Performance-based classifier combination in atlas-based image segmentation using expectation-maximization parameter estimation. IEEE Transactions on Medical Imaging, 23(8):983, 2004.
  • [26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
  • [27] Holger R. Roth, Le Lu, Amal Farag, Hoo Chang Shin, Jiamin Liu, Evrim B. Turkbey, and Ronald M. Summers. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 556–564, 2015.
  • [28] Holger R. Roth, Le Lu, Ari Seff, Kevin M. Cherry, Joanne Hoffman, Shijun Wang, Jiamin Liu, Evrim Turkbey, and Ronald M. Summers. A new 2.5d representation for lymph node detection using random sets of deep convolutional neural network observations. Med Image Comput Comput Assist Interv, 17(1):520–527, 2014.
  • [29] Rebecca L. Siegel, Kimberly D. Miller, and Ahmedin Jemal. Cancer statistics, 2017. CA: A Cancer Journal for Clinicians, 67(1):7–30, 2017.
  • [30] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  • [31] Min Tang, Ziehen Zhang, Dana Cobzas, Martin Jagersand, and Jacob L Jaremko. Segmentation-by-detection: A cascade network for volumetric medical image segmentation. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 1356–1359. IEEE, 2018.
  • [32] Jiazhou Wang, Jiayu Lu, Gan Qin, Lijun Shen, Yiqun Sun, Hongmei Ying, Zhen Zhang, and Weigang Hu. A deep learning based auto segmentation of rectal tumors in mr images. Medical physics, 2018.
  • [33] Botian Xu, Yaqiong Chai, Cristina M Galarza, Chau Q Vu, Benita Tamrazi, Bilwaj Gaonkar, Luke Macyszyn, Thomas D Coates, Natasha Lepore, and John C Wood. Orchestral fully convolutional networks for small lesion segmentation in brain mri. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 889–892. IEEE, 2018.
  • [34] Lequan Yu, Xin Yang, Hao Chen, Jing Qin, and Pheng-Ann Heng. Volumetric convnets with mixed residual connections for automated prostate segmentation from 3d mr images. In AAAI, pages 66–72, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description