3D RoI-aware U-Net for Accurate and Efficient Colorectal Tumor Segmentation
Objective: Segmentation of colorectal cancerous regions from the Magnetic Resonance (MR) image is a crucial procedure for radiotherapy, which requires to accurately delineate boundaries of the tumors. This work aims to address this important while challenging task in an accurate as well as efficient manner. Methods: We propose a novel multi-tasking framework, referred to as 3D RoI-aware U-Net (3D RU-Net), for RoI localization and intra-RoI segmentation, where the two tasks share one backbone network. With the region proposals from the localization branch, we crop multi-level feature maps from the backbone network to form a U-Net-like intra-RoI segmentation branch. To effectively train the model, we propose a novel Dice based hybrid loss to tackle the issue of class-imbalance under the multi-task setting. Furthermore, we design a multi-resolution model ensemble strategy to improve the discrimination capability of the framework. Results: Our method has been validated on 64 cancerous cases with a four-fold cross-validation, outperforming state-of-the-art methods by a significant margin in terms of both accuracy and speed. Conclusion: Experimental results demonstrated that the proposed method enables accurate and fast whole volume RoI localization and intra-RoI segmentation. Significance: This paper proposes a general 3D segmentation framework which rapidly locates the RoI region in large volumetric images and accurately segments the in-region targets. The method has a great potential to be extended to other small 3D object segmentation tasks from medical images.
Colorectal cancer is the second leading cause of cancer-related mortalities in the United States . In current clinical routine of radiotherapy, colorectal cancer regions are manually delineated from volumetric images acquired by magnetic resonance (MR) imaging, which is considered as optimal imaging modality and provides rich context of soft tissues. However, this procedure is laborious, subjective and time-consuming, thus suffers from tedious effort and limited reproducibility. Therefore, automatic colorectal cancer segmentation methods are highly demanded in clinical practice for radiotherapy. This task is, however, very challenging due to low foreground-to-background ratio, low-contrast and inconsistent appearance of cancerous regions, as well as hard mimics from the complex peritumoral areas. As is shown in Fig. 1, it’s challenging to distinguish abnormality from the normal peritumoral tissue.
Automatic segmentation of volumetric MR images has been widely studied in the literature. Some initial works are based on super-voxel clustering [21, 16]. While intensity and connectivity based methods serve some good baselines, nowadays deep learning based methods have taken the state-of-the-art to a higher level. Initially, a series of FCN backbone models [4, 26, 5, 22, 34, 2, 9] are formerly proposed to effectively segment medical images. The success of state-of-the-art FCNs emphasizes that it’s essential to fuse features ( use skip connections) from different levels to gather fine grained details that are lost in the downsampling process.
Next, in practice, we define whole volume segmentation as a problem of automatically processing whole volumetric images rather than processing manually selected RoI patches. The superiority of being more automatic simplifies the workflow, excludes human subjectivity and enables fast processing for big data sets. Due to various limitations, given aforementioned backbone networks, related works have to tackle this task by three categories of methods: part based fully convolutional networks (FCNs), discrete RoI localization and segmentation based models and unified RoI localization-segmentation based models.
As naive practices, part based FCNs perform clean iterative inference: the images are divided into 2D slices [26, 32], 2.5D slices [27, 28] or small 3D patches  to be processed. Except for segmenting non-object targets such as vessels, such design mainly results from the nature that 3D FCNs extract large feature tensors that occupy excessive GPU memory, thus whole volume end-to-end segmentation are hardy applicable. The drawbacks of such practice are apparently noticeable: slow, unspecific and incomplete context utilization. As a consequence of searching the target patch-wisely, they all expense the majority of computing resources outside regions of interest (RoIs) and produce many false positives. Additionally, 2D or 2.5D FCNs lack necessary 3D contexts while 3D patch based FCNs also cannot see complete global contexts, thus such incomplete context utilization can degrade the performance.
In more recent trends, discrete RoI localization and segmentation models are widely applied to eliminate redundant computation and reduce false positives. By definition, these methods combine discrete RoI localization and segmentation modules to perform intra-RoI segmentation. Traditionally, RoIs are localized using prior knowledges such as multi-atlas registration [25, 18], whose performance is limited by inconsistent intensity distributions, different field of views (FoVs) and relatively slow registration speed. Learning based RoI localization decouples RoI localization from prior knowledge [10, 6, 24, 19]. Some of the related practices extract region proposals using external modules such as Multiscale Combinatorial Grouping (MCG) , then classify and segment them; Later works adopt light CNN models such as 2D CNNs for RoI localization [19, 31] and then use 3D FCNs for segmentation. Compared to part based methods, these works tackle the tasks in graceful and efficient manners. However, speed issues still occur to these methods since they either have time-consuming external region proposal modules or repeatedly extract low-level features that could have been shared across the two stages.
As a promising development, unified RoI localization-segmentation models such as Multi-task Network Cascades (MNC)  and Mask R-CNN  further eliminate redundant feature extraction and achieve better speed and accuracy by sharing a backbone network across the region proposal network (RPN) for RoI detection, the RoIAlign module for RoI extraction, the intra-RoI CNN for classification and the intra-RoI FCN for segmentation. In the state-of-the-art Mask R-CNN practice, they employ the Feature Pyramid Network (FPN)  backbone feature extractor, which is similar to the U-Net , to extract region proposals by fusing different levels of feature maps and acquire better detail preservation. We seek to design a unified one-step whole volume segmentation framework like Mask R-CNN, but directly extending it to volumetric formulation to tackle 3D medical tasks encounters some issues. Firstly, like the case of 3D U-Net, extending FPNs that has symmetric encoder-decoder construction to 3D takes excessive GPU memory compared to pure encoder ones and has smaller applicable volume size. Additionally, the RPN that fits and regresses pre-set bounding boxes of different scales and XY ratios named as anchors to ground truth bounding boxes will have anchors of more ratios, namely XYZ ratios, to fit fewer 3D medical objects if direct extension is conducted. Thus this extension can result in severe overfitting. Finally, the RoIAlign module performs bilinear interpolation to resample feature tensors within bounding boxes to dimension-fixed bins ( 1414.). Such design introduces shape distortion, scale normalization and detail loss, which may degrade the performance.
Apart from the way whole volume predictions are generated, recent works propose some strategies to further boost the performance of volumetric image segmentation. Firstly, V-Net  adopts parameter-free Dice coefficient loss to harness the class-imbalance issue. Secondly, Deep Contour-aware Networks (DCAN)  employs a contour-aware loss function for better discrimination between boundaries and the background. In addition, Orchestral Fully Convolutional Networks (OFCNs)  and Hybrid Loss guided Fully Convolutional Networks (HL-FCNs)  adopt model ensemble for better robustness.
An initial work to automatically segment colorectal cancer regions was published in ISBI . As one step further, according to insights listed above, in this paper, we propose a novel unified RoI localization-segmentation framework, named as 3D RoI-aware U-Net (3D RU-Net), to segment cancerous tissues from whole volume MR images in one-step manner. To further boost the performance, we design a hybrid loss function to help the network both handle small objects in big volumes and focus on accurately recognizing ambient borders in local RoIs, and additionally adopt multi-resolution ensemble strategy for better robustness. Experiments conducted on 64 acquired scans demonstrated the efficacy of our method and ablation studies validate the contribution gain of each component from our framework.
Our main contributions are summarized as follows:
We extend the unified RoI localization-segmentation framework to 3D formulation, and adopt multilevel feature fusion mechanism to the segmentation branch. This configuration enables faster and memory efficient detail preserving one-step whole volume segmentation compared to part based and discrete RoI localization-segmentation based counterparts.
Considering automatic class rebalancing and better boundary discrimination, we propose a Dice formulated contour-aware multi-task learning strategy to further improve the accuracy. Additionally, the accelerated framework encourages us to employ a multi-resolution model ensemble strategy to suppress the false positives and refine the boundary details at an acceptable speed cost.
Extensive experiments on the acquired dataset proved the efficacy of our proposed framework. Furthermore, our method is inherently general and can be applied in other similar applications.
The remainder of this paper is organized as follows. We describe our method in Section II and report the experimental results in Section III. Section IV further discusses some insights as well as issues of the proposed method. The conclusions are drawn in Section V.
The proposed 3D RU-Net framework is illustrated in Fig. 2. We input whole image volumes and crop RoIs from multi-scale feature maps to formulate an intra-RoI U-Net expansive path for high-resolution cancerous tissue segmentation.
2.1 Construction of 3D RU-Net
In this section, we form a unified RoI localization-segmentation framework, with reused features and shared weights over different tasks, which essentially borrows the spirit from Mask R-CNN with ResNet-FPN as the backbone feature extractor but modified to solve our underlying 3D medical image segmentation problem.
To address the issues discussed in section 1, we propose a framework to effectively localize and segment colorectal cancer in the following aspects:
2.1.1 Backbone Network
Due to limited GPU memory of commonly used devices and dramatically increased parameters of 3D convolution kernels, it’s essential to carefully design the 3D backbone feature extractor to avoid GPU memory overflow and overfitting. Instead of constructing a 3D version of encoder-decoder architecture like 3D U-Net or 3D FPN, or directly extending popular backbones [30, 13, 14] to 3D, we adopt a variation of ResBlock  formulated 3D U-Net’s contractive path, called Whole Volume Contractive Path, to process whole image volumes without dividing them into multiple parts. Since it has fewer parameters and produces smaller feature tensors while providing the discrimination capability needed, it’s more optimal in our underlying task.
2.1.2 RoI Localization
Since the cancerous regions has inconsistent 3-dimensional scales, shapes and XYZ ratios and the fact that we have only 64 samples to train, validate and test on, we avoid fitting anchors defined by a large number of combinations of different scales and XYZ ratios to ground truth object bounding boxes and degrading voxel-wise labels to object-wise labels. Instead, we perform low resolution whole volume segmentation based on the terminal feature map of the backbone network trained towards Dice loss to tackle the extremely imbalanced foreground-to-background ratio, which will be introduced in subsection 2.2. Then we perform connectivity analysis to compute desired bounding boxes. To make up for potential bounding box undersize due to the coarseness of this step, the bounding boxes computed are practically extended to or of its original size or to an over-designed cube of fixed size ( ) voxels along the Z,Y and X axis.
2.1.3 RoI Cropping Layer
Rather than sampling targets within bounding boxes of different shapes into a pre-defined size-fixed bins ( 1414.), the RoI Cropping Layer extract feature tensors within bounding boxes by directly cropping them without resampling. This design keeps bounding boxes’ shape ratios and scales unchanged to avoid potential detail loss and semantics shift related performance degradation. Additionally, the RoI Cropping Layer not only crops the feature tensors within bounding boxes from the terminal layer but also extend the computed bounding boxes to preceding feature scales. We denote the feature tensors generated by ResBlocks of each scale of the contractive path as , and as illustrated in Fig. 2. Assume that the center coordinates and size of an RoI predicted in is and , respectively, then the cropped RoI windows are: a window centered on of size for , a window centered on of size for , a window centered on of size for .
2.1.4 Intra-RoI Segmentation
Given cropped feature tensors extracted from multiple preceding feature tensors, we construct the segmentation branch named as U-Net-like Intra-RoI Expansive Path by applying successful multilevel feature fusion mechanism. The construction of Intra-RoI Expansive Path is more or less symmetrical to the Whole Volume Contractive Path, while the beneficial difference lies on much smaller size of the expansive path’s feature tensors. Besides, since no shape distortion or scale normalization is included, this module directly and losslessly restores the original size of the RoI region. The same set of weights is used to iteratively process different RoIs if multiple RoIs are localized.
2.2 Dice-based Multi-task Hybrid Loss Function
It is observed that the cancerous regions with low contrast, ambiguous borders and unbalanced distribution are hard to learn even using the successful multilevel feature fusion mechanism. Thus we propose to use a Dice-based multi-task hybrid Loss function to improve the performance.
2.2.1 Dice Loss Formulation
Inspired by the success of , we apply Dice loss function to formulate the optimization objective, since it serves as an effective hyper-parameter free class balancer to help the network learn objects of small size and weak saliency. The Dice loss is defined as:
where the sums are computed over the voxels of the predicted volume and the ground truth volume . is a smoothness term that avoids devision by 0. In the optimization stage, the Dice loss is minimized by gradient descend using the following derivate:
2.2.2 Dice Loss for Global Localization
To tackle the class imbalance issue of the global RoI localization task, we employ the aforementioned Dice loss:
where and denotes predictions of the localization top and down-sampled annotations.
2.2.3 Dice-based Contour-aware Loss for Local Segmentation
Compared to the localization task, the intra-RoI segmentation branch needs multiple constraints to acquire better boundary-sensitive segmentation results. In semantic segmentation practices, the ambiguous borders are the most difficult to learn but learned with insufficient attention. Borrowing the insight of previous exploration of adding an auxiliary contour-aware side task, we further formulate the side task using Dice loss to help it tackle the extreme sparsity of contour labels in 3D space. Practically we add an extra Softmax branch at the output terminal of the segmentation branch to predict the contour voxels, trained in parallel with the region segmentation task. Taking the side task into account, the loss function of the segmentation branch is denoted as following by summarizing the weighted losses:
where , denoting the auxiliary task weight to ensure that the region segmentation task dominates while other tasks take effects.
Finally, the overall loss function is:
where denotes the balance of weight decay term and denotes the parameters of the whole network.
2.3 Multi-resolution Model Ensemble
Model ensemble strategy is considered as an effective practice to further boost the performance, and is widely employed in practical cases, at a cost of computational expensiveness.
Encouraged by the dramatically accelerated framework, in this paper we propose to employ multi-resolution model ensemble, , using models of identical structure but trained on three datasets of different resolution rate instead of applying models of different structure designs. In detail, we resample acquired MR images of ZYX spacings ranging from mm to mm to three datasets of stepped spacing configurations: 4.01.01.0 mm for HighRes set, 4.01.51.5 mm for MidRes set, 4.02.02.0 mm for LowRes set, feeding to 3D RU-Net1, 3D RU-Net2 and 3D RU-Net3, respectively. In the inference stage, as is shown in Fig. 3, three networks’ outputs are averaged to generate the final prediction.
3.1 Dataset and Preprocessing
The dataset contains a total of 64 MR images of the pelvic cavity of T2 modality. Target areas were labeled voxel-wisely by experienced radiologists, and contour labels were automatically generated from the region labels using erosion and subtraction operations.
Spacing normalization is conducted according to the criterion described in subsection 2.3. To normalize the intensities of input images acquired under different imaging configurations and field of views, we perform intra-body intensity normalization to exclude the affect of inconsistent body-to-background ratios. By OTSU thresholding, connectivity analysis and closing operation, body masks are extracted as foreground and other voxels are set as background. The mean intensity and standard deviation are computed within the body mask according to following formulas:
where denotes the intensity of a voxel and denotes the count of mask voxels. Then the image is normalized according to the following criterion:
A few examples of the comparison between original images and intensity-normalized images are illustrated in Fig. 4.
Before feeding the images to the network, we crop the input images according to minimum bounding boxes of the body masks to further reduce the GPU memory footprint. Additionally, in the training stage, we perform on-the-fly data augmentation when feeding training samples. Applied operations include scaling, flipping, intensity jittering, and translation.
3.2 Implementation Details
Our implementation is publicly available at https://github.com/huangyjhust/3D-RU-Net.
The network’s detailed connectivity and kernel configuration are illustrated in Table 2.2.3. Specifically, to fit the anisotropic spacing of the acquired dataset which has larger spacing along Z axis, flat kernels of , pooling rate of and up-sampling rate of are employed by the input and output blocks, ResBlock1, MaxPooling1, UpConv2, ResBlock5. For direct comparison to related methods, we assign an over-designed fixed window to the RoI Cropping Layer: .
3.2.2 Training Process
The backbone network is initialized using MSRA criterion, then pre-trained using our previous work’s patch-wise HL-FCN. We use Adam optimizer at a learning rate of . The weights of convolution kernels are penalized with L2 norm for better generalization capability. Then, we iteratively train the RoI localization branch and the segmentation branch. In each iteration, we first train the RoI localization branch named as LocTop, then predict the bounding boxes. Next, we train the segmentation branches named as SegTop1 and SegTop2, using the predicted bounding boxes. Four-fold cross-validation is conducted on 64 scans.
3.3 Evaluation Metrics
3.3.1 Dice Similarity Coefficient (DSC)
The Dice similarity coefficient (DSC) measures a general overlap rate that equally assigns significance to recall rate and false positive rate. DSC is denoted as:
where the metric is scored in [0,1]. Better prediction generates a score closer to 1.0. Since this network is trained towards this metric, DSC is not enough to evaluate the performance.
3.3.2 Voxel-wise Recall Rate
We also employ voxel-wise recall rate to evaluate the recall capability of different methods.
3.3.3 Average Symmetric Surface Distance (ASD)
We define the shortest distance of an arbitrary voxel of one volume’s surface to another volume’s surface as:
where denotes th voxel from extracted surface of volume , denotes th voxel from extracted surface of volume , and denotes Euclidean distance. Then the evaluation value is defined as:
Specifically, this metric is sensitive to failures such as debris outliers predicted far away from the colon region or complete failure to recall an object. The long distance makes up for the small size of the debris and produce large error penalty. If a failure segmentation has 0 recall rate, its surface distance is set as 50 mm, which is big enough to be a strong penalty.
3.3.4 Average Inference Time
We include average inference time to evaluate speed in the inference stage. Since this metric is decided by the size of the input volume, the standard deviation is not evaluated. The tested methods are all performed on a workstation platform with 2x Xeon E5 CPU (8C16T) @ 2.4 Ghz, 128GB RAM and an NVIDIA Titan Xp GPU with 12GB GPU memory. The code is implemented with Keras backended by Tensorflow.
3.3.5 Typical GPU Memory Footprint
By analyzing this metric, we describe the GPU memory efficiency of the proposed methods by tracking the total GPU memory footprint given an input volume of typical size voxels.
Firstly, we compare our proposed method to a series of part based models, HL-FCNs , V-Net, their multi-resolution ensemble counterparts and 3D U-Net, aiming to show the effectiveness of the proposed training strategies and the speed superiority of one-step RoI localization-segmentation pipeline. These part based networks acquire patches at a stride of 50% window overlapping. They are of identical depth and kernel configuration as the proposed method, and trained towards different loss functions on different dataset of the stepped resolution rates, namely Dice-based Hybrid Loss, Dice Loss and Cross-entropy Loss, respectively. As the result, the proposed method outperforms all part-based methods, especially in ASD and speed.
Next, we compare the proposed method to a discrete RoI localization-segmentation based method, aiming to emphasize the performance and speed benefit of the proposed method. In detail, the cascaded models referred to as 3D Cascaded Models1 in 2 is a modules-detached version of 3D RU-Net1: it consists an independent RoI localization module the same as 3D RU-Net’s Whole Volume Contractive Path and a full 3D U-Net fed with patches to perform intra-RoI segmentation. Compared to the case of 3D Cascaded Models1, the proposed method’s segmentation branch acquires low level features trained on global images and pre-extracted by the Whole Volume Contractive Path. This helps the network reject false positives better and achieve better speed.
|3D RU-Net(Multi-Res Ensemble)||0.7350.147||0.7480.185||2.633.16||0.57|
|3D Cascaded Models1(HighRes)||0.6670.193||0.7250.250||5.118.87||0.35(0.25+0.091)|
|3D Mask R-CNN1(HighRes)||0.5640.190||0.5850.256||7.9310.33||0.55|
|Part Name||Layer Name||Size||GPU Memory Footprint||Part GPU Memory Footprint|
|Contractive Path||ResBlock1||3796.88 MBytes||5827.17MBytes|
|RoI||RoICropping1||40.50 MBytes||65.82 MBytes|
|Expansive Path||UpConv1||20.25 MBytes||669.93 MBytes|
Additionally, we further validate our method’s efficacy in 3D medical images by comparing it to a 3D variation of Mask R-CNN. Due to the limitation of GPU memory, directly extending ResNet-FPN to 3D cannot take whole volume images as the input, we instead employ 3D RU-Net’s Whole Volume Contractive Path as the backbone feature extractor. The experiment shows that the 3D Mask R-CNN suffers severely from overfitting issue in bounding box learning, therefore the performance of the segmentation branch is affected by both the detail loss of the absence of FPN-based backbone network and failures of bounding box detection and regression.
Finally, it’s significant to point out that the speed and performance gains are enabled by the memory efficiency of the proposed method that eliminates the need of conventional 3D U-Net for sliding-stitching workflow and enables one-step whole volume inference. Here we track the memory footprint to evaluate the memory efficiency of the proposed method in the environment where in-place computing is deactivated thus a ResBlock has nine tensor nodes. A typical T2 volume of 3D pelvic image is of size . By body cropping, the size typically drops to . Given this volume as input, the GPU memory footprint details are listed in Table. 3. By constructing the intra-RoI expansive path, a typical GPU can assign 90% of its GPU memory to the contractive path to detect RoIs and spend only 10% GPU memory on intra-RoI segmentation, while conventional encoder-decoder networks spend 50% GPU memory on each path. Therefore, while model ensemble strategy is often considered to be computationally expensive, based on the proposed method, we can have the performance gain at a promisingly acceptable cost.
In this paper, we aim to segment colorectal cancerous tissues accurately and fast. We combine the whole volume RoI localization model and intra-RoI segmentation model to be a unified, weight sharing, feature reusing and jointly trained model: 3D RoI-aware U-Net (3D RU-Net).
We notice a recent trend that researchers seek to first detect and then segment medical objects. But they usually utilize independent models to achieve this goal, regardless of potential benefit of feature sharing and joint training; and many of them are even not using full 3D contexts in detection stage, this passes more candidates to posterior branches to further discriminate them and needs more time to process. Aiming to refine the workflow, The pre-extracted low level features provides the intra-RoI segmentation branch better understanding of the whole image to discriminate background from false positives, and saves over 50% time for each RoI segmentation.
Compared to successful and general Mask R-CNN for natural object instance segmentation, the advantage of the proposed framework over Mask R-CNN mainly lies on its loose assumption about objectness, less data amount requirement and avoiding the unnecessary shape distortion and scale normalization introduced by RoIAlign’s bin fitting. Firstly, though object detection frameworks are effectively used in medical cases, it best performs in cases where targets have strong objectness assumption, for instance, lung nodule detection. In cases where lesions have statistically inconsistent shapes and few training samples, learning bounding box detection and regression is more difficult and prone to overfitting. In such cases, degrading voxel-wise labels to object-wise labels can be non-optimal and unnecessary. Secondly, the background contexts are scale and shape ratio sensitive. They serve as important clues for false positive exclusion in medical cases, thus warping them to dimension-fixed cubes using RoIAlign module can cause performance degradation. Finally, FPN/U-Net like multilevel feature fusion mechanism must be implemented in a GPU memory efficient way, or it can be inapplicable. This insight is generalizable enough to be adopted by other unified detection and segmentation frameworks.
Although our method achieved competitive results, there are some limitations. Firstly, according to Fig. 5(b), the model is often confused about which slice to start or end, thus this significantly affects the score. In fact, decision about starting and ending slice index can be observer-dependent due to weak contrast in the border of cancerous tissues and low resolution along Z axis. Secondly, there are cases where we fail to detect objects of unseen appearance in other samples, thus the segmentation branch also responses with incorrect masks. In these cases the standard deviation of ASD significantly increases due to strong penalty to missing objects, see 3D RU-Net2 and 3D Cascaded Models in Table 2. In these cases, the model ensemble strategy serves as a strong rectifier. To fix this issue, better normalization method needs to be explored, and more training samples can be beneficial as well.
In this paper, we propose a unified RoI localization-segmentation-based framework for full-automatic one-step whole volume colorectal cancer segmentation referred to as 3D RoI-aware U-Net (3D RU-Net). We emphasize the importance and effectiveness of shared feature extraction across the localization and segmentation branches, the Dice-based hybrid loss function as well as multi-resolution model ensemble. Experimental results demonstrated impressive superiority in terms of accuracy and speed over part-based methods, discrete RoI localization and segmentation-based methods as well as direct 3D extension of Mask R-CNN. In principle, the proposed framework is scalable enough to be adopted to other medical image segmentation tasks.
-  Pablo Arbeláez, Jordi Pont-Tuset, Jonathan T Barron, Ferran Marques, and Jitendra Malik. Multiscale combinatorial grouping. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 328–335, 2014.
-  H. Chen, Q. Dou, L. Yu, J. Qin, and P. A. Heng. Voxresnet: Deep voxelwise residual networks for brain segmentation from 3d mr images. Neuroimage, 2017.
-  H. Chen, X. Qi, L. Yu, Q. Dou, J. Qin, and P. A. Heng. Dcan: Deep contour-aware networks for object instance segmentation from histology images. Medical Image Analysis, 36:135–146, 2017.
-  Hao Chen, Qi Dou, Xi Wang, Jing Qin, Jack C. Y. Cheng, and Pheng Ann Heng. 3d fully convolutional networks for intervertebral disc localization and segmentation. In International Conference on Medical Imaging and Virtual Reality, pages 375–382, 2016.
-  Özgün Çiçek, Ahmed Abdulkadir, Soeren S Lienkamp, Thomas Brox, and Olaf Ronneberger. 3d u-net: learning dense volumetric segmentation from sparse annotation. In MICCAI, pages 424–432. Springer, 2016.
-  Jifeng Dai, Kaiming He, and Jian Sun. Convolutional feature masking for joint object and stuff segmentation. In Computer Vision and Pattern Recognition, pages 3992–4000, 2015.
-  Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware semantic segmentation via multi-task network cascades. In Computer Vision and Pattern Recognition, pages 3150–3158, 2016.
-  Jia Ding, Aoxue Li, Zhiqiang Hu, and Liwei Wang. Accurate pulmonary nodule detection in computed tomography images using deep convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 559–567. Springer, 2017.
-  Qi Dou, Lequan Yu, Hao Chen, Yueming Jin, Xin Yang, Jing Qin, and Pheng-Ann Heng. 3d deeply supervised network for automated segmentation of volumetric medical images. Medical image analysis, 41:40–54, 2017.
-  Bharath Hariharan, Pablo ArbelÃ¡ez, Ross Girshick, and Jitendra Malik. Hypercolumns for object segmentation and fine-grained localization. In Computer Vision and Pattern Recognition, pages 447–456, 2015.
-  Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In IEEE International Conference on Computer Vision, pages 2980–2988, 2017.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In Proceedings of the IEEE international conference on computer vision, pages 1026–1034, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
-  Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Laurens van der Maaten. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, volume 1, page 3, 2017.
-  Yi-Jie Huang, Qi Dou, Zi-Xian Wang, Li-Zhi Liu, Li-Sheng Wang, Hao Chen, Pheng-Ann Heng, and Rui-Hua Xu. Hl-fcn: Hybrid loss guided fcn for colorectal cancer segmentation. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 195–198. IEEE, 2018.
-  Benjamin Irving, Amalia Cifor, Bartłomiej W Papież, Jamie Franklin, Ewan M Anderson, Michael Brady, and Julia A Schnabel. Automated colorectal tumour segmentation in dce-mri using supervoxel neighbourhood contrast characteristics. In MICCAI, pages 609–616. Springer, 2014.
-  Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. Computer Science, 2014.
-  S Klein, Ua Van-Der-Heide, Im Lips, M Van-Vulpen, M Staring, and Jp Pluim. Automatic segmentation of the prostate in 3d mr images by atlas matching using localized mutual information. Medical Physics, 35(4):1407–1417, 2008.
-  Xiaomeng Li, Hao Chen, Xiaojuan Qi, Qi Dou, Chi-Wing Fu, and Pheng Ann Heng. H-denseunet: Hybrid densely connected unet for liver and liver tumor segmentation from ct volumes. arXiv preprint arXiv:1709.07330, 2017.
-  Tsung-Yi Lin, Piotr Dollár, Ross Girshick, Kaiming He, Bharath Hariharan, and Serge Belongie. Feature pyramid networks for object detection. In CVPR, volume 1, page 4, 2017.
-  Dwarikanath Mahapatra, Peter J Schuffler, Jeroen AW Tielbeek, Jesica C Makanyanga, Jaap Stoker, Stuart A Taylor, Franciscus M Vos, and Joachim M Buhmann. Automatic detection and segmentation of crohn’s disease tissues from abdominal mri. IEEE Trans. on Med. Imaging, 32(12):2332–2347, 2013.
-  Fausto Milletari, Nassir Navab, and Seyed-Ahmad Ahmadi. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In 3D Vision (3DV), 2016 Fourth International Conference on, pages 565–571. IEEE, 2016.
-  N Otsu. A threshold selection method from gray-level histogram. IEEE Trans Smc, 9(1):62–66, 1979.
-  Pedro H. O. Pinheiro, Ronan Collobert, and Piotr Dollar. Learning to segments objects candidates. In Advances in Neural Information Processing Systems, 2015.
-  Torsten Rohlfing, Daniel B Russakoff, and Jr Maurer, Calvin R. Performance-based classifier combination in atlas-based image segmentation using expectation-maximization parameter estimation. IEEE Transactions on Medical Imaging, 23(8):983, 2004.
-  Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image segmentation. In MICCAI, pages 234–241. Springer, 2015.
-  Holger R. Roth, Le Lu, Amal Farag, Hoo Chang Shin, Jiamin Liu, Evrim B. Turkbey, and Ronald M. Summers. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 556–564, 2015.
-  Holger R. Roth, Le Lu, Ari Seff, Kevin M. Cherry, Joanne Hoffman, Shijun Wang, Jiamin Liu, Evrim Turkbey, and Ronald M. Summers. A new 2.5d representation for lymph node detection using random sets of deep convolutional neural network observations. Med Image Comput Comput Assist Interv, 17(1):520–527, 2014.
-  Rebecca L. Siegel, Kimberly D. Miller, and Ahmedin Jemal. Cancer statistics, 2017. CA: A Cancer Journal for Clinicians, 67(1):7–30, 2017.
-  Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
-  Min Tang, Ziehen Zhang, Dana Cobzas, Martin Jagersand, and Jacob L Jaremko. Segmentation-by-detection: A cascade network for volumetric medical image segmentation. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 1356–1359. IEEE, 2018.
-  Jiazhou Wang, Jiayu Lu, Gan Qin, Lijun Shen, Yiqun Sun, Hongmei Ying, Zhen Zhang, and Weigang Hu. A deep learning based auto segmentation of rectal tumors in mr images. Medical physics, 2018.
-  Botian Xu, Yaqiong Chai, Cristina M Galarza, Chau Q Vu, Benita Tamrazi, Bilwaj Gaonkar, Luke Macyszyn, Thomas D Coates, Natasha Lepore, and John C Wood. Orchestral fully convolutional networks for small lesion segmentation in brain mri. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pages 889–892. IEEE, 2018.
-  Lequan Yu, Xin Yang, Hao Chen, Jing Qin, and Pheng-Ann Heng. Volumetric convnets with mixed residual connections for automated prostate segmentation from 3d mr images. In AAAI, pages 66–72, 2017.