Evaluation of Multi-Slice Inputs to Convolutional Neural Networks for Medical Image Segmentation
When using Convolutional Neural Networks (CNNs) for segmentation of organs and lesions in medical images, the conventional approach is to work with inputs and outputs either as single slice (2D) or whole volumes (3D). One common alternative, in this study denoted as pseudo-3D, is to use a stack of adjacent slices as input and produce a prediction for at least the central slice. This approach gives the network the possibility to capture 3D spatial information, with only a minor additional computational cost.
In this study, we systematically evaluate the segmentation performance and computational costs of this pseudo-3D approach as a function of the number of input slices, and compare the results to conventional end-to-end 2D and 3D CNNs. The standard pseudo-3D method regards the neighboring slices as multiple input image channels. We additionally evaluate a simple approach where the input stack is a volumetric input that is repeatably convolved in 3D to obtain a 2D feature map. This 2D map is in turn fed into a standard 2D network. We conducted experiments using two different CNN backbone architectures and on five diverse data sets covering different anatomical regions, imaging modalities, and segmentation tasks.
We found that while both pseudo-3D methods can process a large number of slices at once and still be computationally much more efficient than fully 3D CNNs, a significant improvement over a regular 2D CNN was only observed for one of the five data sets. An analysis of the structural properties of the segmentation masks revealed no relations to the segmentation performance with respect to the number of input slices.
The conclusion is therefore that in the general case, multi-slice inputs appear to not significantly improve segmentation results over using 2D or 3D CNNs.
drsDRSDiabetic Retinopathy Screening \newacronymvqaVQAVisual Question Answering \newacronymslamSLAMSimultaneous Localization and Mapping \newacronym[plural=CNNs,firstplural=Convolutional Neural Networks (CNNs)]cnnCNNConvolutional Neural Network \newacronymnlpNLPnatural language processing \newacronymmcbMCBMultimodal Compact Bilinear \newacronymmlbMLBMultimodal Low-rank Bilinear \newacronymmutanMUTANMultimodal Tucker Fusion for Visual Question Answering \newacronymidridIDRIDIndian Diabetic Retinopathy Image Dataset \newacronym[plural=RNNs,firstplural=Recurrent Neural Networks (RNNs)]rnnRNNRecurrent Neural Network \newacronymlstmLSTMLong Short-term Memory \newacronymbowBOWbag-of-words \newacronymgruGRUGated Recurrent Units \newacronym[plural=QAs,firstplural=Questions & Answers (QAs)]qaQAQuestion & Answer \newacronymmaMAMicroaneurysms \newacronymheHEHemorrhages \newacronymexEXHard Exudates \newacronymseSESoft Exudates \newacronymnarNARno apparent retinopathy \newacronymdrDRDiabetic Retinopathy \newacronymdmeDMEDiabetic Macular Edema \newacronympdrPDRProliferative Diabetic Retinopathy \newacronymwmlbWMLBWeighted Multimodal Low-rank Bilinear Attention Network \newacronymdlDLdeep learning \newacronymqcmlbQC-MLBQuestion-Centric Multimodal Low-rank Bilinear \newacronymbertBERTBidirectional Encoder Representations from Transformers \newacronymbleuBLEUBilingual Evaluation Understudy \newacronymmlmMLMMasked Language Model \newacronymnspNSPNext Sentence Prediction \newacronymreluReLUrectified linear unit \newacronymnnNNneural network \newacronymchalImageCLEF-VQA-MedImageCLEF-VQA-Med \newacronymproposed< Model >Full name of the proposed model \newacronymmriMRImagnetic resonance imaging \newacronymbratsBraTS19Brain Tumors in Multimodal Magnetic Resonance Imaging Challenge 2019 \newacronymkitsKiTS19Kidney Tumor Segmentation Challenge 2019 \newacronymibsrIBSR18Internet Brain Segmentation Repository \newacronymheneU-HANDUmeå Head and Neck Database \newacronymprosU-PROUmeå Pelvic Region Organs \newacronymhpc2nHPC2NHigh Performance Computer Center North \newacronymflopsFLOPsfloating point operations \newacronymctCTcomputed tomography \newacronymt1cT1cpost-contrast T1-weighted \newacronymt2T2wT2-weighted \newacronymt1T1wT1-weighted \newacronymflairFLAIRT2 Fluid Attenuated Inversion Recovery \newacronymlggLGGlow grade glioma \newacronymhggHGGhigh grade glioma \newacronym[plural=DSCs]dscDSCDice similarity coefficient \newacronymsebSEBSqueeze-and-Excitation block \newacronym[plural=REBs,firstplural=ResNet blocks]resREBResNet block \newacronymhd95HD95 percentile of the Hausdorff distance \newacronymdaucDAUCDice Area Under Curve \newacronymrftpRFTPsRatio of Filtered True Positives \newacronym[plural=SDs,firstplural=standard deviations (SDs)]sdSDstandard deviation \newacronymceCEcategorical cross–entropy \newacronymgpuGPUGraphical Processing Unit
Medical Image Segmentation, Convolutional Neural Network, Multi-Slice, Deep Learning
Segmentation of organs and pathologies are common activities for radiologists and routine work for radiation oncologists. Nowadays manual annotation of such regions of interest is aided by various software toolkits for image enhancement, automated contouring, and structure analysis in all fields on image-guided radiotherapy [32, 4, 9]. Over the recent years, \glsdl has emerged as a very powerful concept in the field of medical image analysis. The ability to train complex neural networks by example to independently perform a vast spectrum of annotation tasks has proven itself a promising method to produce segmentations of organs and lesions with expert-level accuracy [38, 24].
For both organ segmentation and lesion segmentation, the most common \glsdl model is the \glscnn. Whereas the classic approach of segmenting 3D medical volumes by \glsplcnn consists of training on and predicting the individual 2D slices independently, the interest has shifted in recent years towards full 3D convolutions in vo1umetric neural networks [37, 24, 28, 6, 11]. Volumetric convolution kernels have the advantage of taking inter-slice context into account, thus preserving more of the spatial information than what is possible when using 2D convolutions within slices. However, volumetric operations require a much larger amount of computational resources. For medical image applications, the lack of sufficient \glsgpu memory to fit entire volumes at once requires in almost all cases a patch-based approach, reduced input sizes, and/or small batch sizes and therefore longer training times.
I-a Related Work
In terms of fully connected, end-to-end 3D networks, studies often attempt to compensate for the small patch size that can maximally fit into the \glsgpu memory at once by creating more efficient architectures or utilizing post-processing methods. The original U-Net by Ronneberger et al. , an architecture which was, at that time, and still is, a popular and powerful network for semantic medical image segmentation, was first reintroduced as a 3D variant by Çiçek et al. . The 3D U-Net was used by Vu et al. [41, 42] in a cascaded approach where a first coarse prediction was used to generate a candidate region in which a second, finer-grained prediction was performed; this proved to be an effective way of reducing the amount of input data for the final prediction. V-Net by Milletari et al.  extended the network of  by adding residual connections to the 3D U-Net.
Li et al.  reduced the computational cost required for a fully connected 3D \glscnn by replacing the deconvolution steps in the upsampling phase with dilated convolution to preserve the spatial resolution of the feature maps. VoxResNet  is a very deep residual network that was trained on small 3D patches. The resulting output probability map was combined with the original multimodal volumes into a second VoxResNet to obtain a more accurate output. A related approach from Yu et al.  extended this architecture by implementing long residual connections between residual blocks, in addition to the short connections within the residual blocks. The same group proposed another densely connected architecture called DenseVoxNet , where each layer had access to the feature maps of all its preceding layers, decreasing the number of parameters and possibly avoiding to learn redundant feature maps.
Lu et al.  used a graph cut model to refine the output of their coarse 3D \glscnn. A 3D network composed of two separate convolutional pathways, at low and high resolution, was introduced by Kamnitsas et al. . For improvement, the resulting segmentation was, in turn, post-processed by a Conditional Random Field. A variant of this multi-scale feature extraction during convolution was used by Lian et al. , who used this procedure in the encoding phase of their U-Net-like 3D \glscnn. Ren et al.  exploited the small size of regions of interest in the head and neck area (i.e. the optic nerves and chiasm) to build an interleaved combination of small-input, shallow \glsplcnn trained at different scales and in different regions. Feng et al.  used a two-step procedure: a first 3D U-Net was used to localize thoracic organs in a substantially downsampled volume, and crop to a bounding box around each organ. Then, individual 3D U-Nets were trained to segment each organ inside its subvolume at the original resolution. Another example of 3D convolutions applied only on a small region of interest is from the work of Anirudh et al. , who randomly sampled subvolumes in lung images for which the centroid pixel intensity was above a certain intensity threshold, to classify the subvolume as containing a lung nodule or not.
While these studies have shown that 3D \glsplcnn are worth the effort, alternative approaches have been investigated to involve volumetric context to improve segmentation while avoiding 3D convolutions altogether. One of the more common methods, usually called 2.5D, is to use \glsplcnn that combine tri-planar 2D \glsplcnn from intersecting orthogonal patches [31, 35, 8, 44, 26, 29, 20, 14]. This can be a computationally efficient way of incorporating more 3D spatial information, and these studies all present promising results. However, this method is limited in the volumetric information it can encompass at once.
We, therefore, investigate a method that uses a volumetric input but is still largely 2D based with a minimal amount of 3D operations. Instead of a method that takes a single 2D slice as input, and outputs the 2D segmentation of that slice, one can also incorporate neighboring slices to provide a 3D context to enhance segmentation performance. A common approach to this is to include neighboring slices to a central slice as multiple input image channels. Novikov et al.  included the preceding and succeeding axial slice for vertebrae and liver segmentation. Such a three-slice input was also used by Kitrungrotsakul et al.  for the detection of mitotic cells in 4D data (spatial + temporal). This was a cascaded approach where a first detection step with a three-slice input produced results for these three slices. In the second step, they reduced the number of false positives where for each slice the time-frame before and after was included. In a deep \glscnn for liver segmentation, Han  used five neighboring slices. Ghavami et al.  compared incorporating three, five, and seven slices for prostate segmentation from ultrasound images. While their method produced promising segmentation results, no significant difference was found between these three input sizes. In a recent paper, Ganaye et al.  employed a seven-slice input producing an output for the three central slices, which the authors refer to as 2.5D. This model was used to evaluate a loss function that penalized anatomically unrealistic transitions between adjacent slices. The authors did not report a significant improvement between the baseline 2D and 2.5D models, but the 2.5D model did outperform in terms of Hausdorff Distance when the non-adjacency loss was employed.
In this paper, we systematically investigate using multiple adjacent slices as input to predict for the central slice in that subset, and we investigate this on the segmentation task in medical images. We will henceforth refer to any method based on this principle as pseudo-3D. We compare the segmentation performance of a range of input multi-slice sizes () to conventional end-to-end 2D and fully 3D input-output \glsplcnn. We employ the common approach from the literature where each neighboring slice is put as a separate channel in the input, and we will refer to this method as the channel-based method. Further, we introduce a second pseudo-3D method, that appears to have not been proposed in the literature before. This pseudo-3D method consists of two main components: a transition block that transforms a -slice input into a single-slice (i.e. 2D) feature map by using 3D convolutions, and this feature map is then followed by a standard 2D convolutional network, such as the U-Net  or the SegNet , that produces the final segmentation labels. This method shall be referred to as the proposed method.
The main contributions of our work are:
We systematically compare the segmentation performance of 2D, pseudo-3D (with varying input size, ), and 3D approaches.
We introduce a novel pseudo-3D method, using a transition block that transforms a multi-slice subvolume into a 2D feature map that can be processed by a 2D network. This method is compared to the channel-based pseudo-3D method.
We compare the computational efficiency of fully 2D and 3D \glsplcnn to the pseudo-3D methods in terms of graphical memory use, number of model parameters, \glsflops, training time, and prediction time.
We conduct all experiments on a diverse range of data sets, covering a broad range of data set sizes, imaging modalities, segmentation tasks, and body regions.
Ii Proposed Method
The underlying concept of the pseudo-3D methods is similar to that of standard slice-by-slice predictions using 2D \glsplcnn, but the input is now a subvolume with an odd number of slices, , extracted from the whole volume with a total of slices. The output of the model is compared to the ground truth of the central slice. If , the method is equivalent to a 2D \glscnn. A fully 3D \glscnn would be for both input and output, where all operations in the network are in 3D and the output volume is compared to the ground truth of the whole volume. See Figure 1 for an illustration of the proposed method. In this study, the number of slices in the input subvolume ranged from to . In order to isolate the contribution of using multi-slice inputs, this work did not include multi-slice outputs—where the multiple outputs for each slice are usually aggregated using e.g. means or medians.
Let the input volume be of width , height , depth , and have channels. A common way of utilizing depth information to train with regards to the central slice is as follows: group the channel and depth dimension together as one, and consider the input to be of shape , i.e. with channels. By incorporating the slices in the channel dimension, the multi-slice input can be processed by a regular 2D network. As was mentioned in Section I-B, this method is denoted here as the channel-based method.
The channel-based method is compared to a novel pseudo-3D approach denoted the proposed method. Consider the input to be of shape . This is fed through a transition block with layers (where is the floor function). In each layer, a 3D convolution with a kernel of size is applied to the volume within the image, after it has been padded in the and dimensions, but not in the dimension. Thus, after each layer in the transition block, the depth of the image is reduced by slices, while the width and height stay the same size. After the final convolution, the depth dimension is removed. Hence, the shapes change as
In both the proposed method and the channel-based method, the output layer of the network is the segmentation mask, with an output shape of . Hence, it produces a single segmentation slice, corresponding to the central slice of the input subvolume. See Figure 1 for an illustration of this.
The network architectures that were evaluated in this work was the U-Net  and the SegNet , two popular variants of encoder-decoder architectures that have been successful in semantic medical image segmentation. An illustration of both pseudo-3D methods, with U-Net as the main network architecture, is given in Figure 2. Another illustration of the networks with SegNet backbone can be seen in Figure 11 in the Supplementary Material.
We evaluate the two pseudo-3D methods for and compare them to the corresponding conventional end-to-end 2D and 3D networks, all with the U-Net or SegNet architectures. This yields a total of different experiments for each data set (six input sizes for the two pseudo-3D methods, plus 2D and 3D methods, all with two network architectures). Apart from the segmentation performance, the computational cost is also evaluated across experiments in terms of the number of network parameters, the maximum required amount of \glsgpu memory, the number of \glsflops, the training time per epoch, and the prediction time per sample.
We here present the data sets the experiments were conducted on, as well as the encompassing information and parameters used in the experiments.
To test the generalizing capabilities of the methods, we ran experiments on five different data sets, covering a variety of modalities, data set sizes, segmentation tasks, and body areas. Three of the data sets are publicly available, as they were part of segmentation challenges. On top of those, we further used two in-house data sets collected at the University Hospital of Umeå, Umeå, Sweden.
An in-house data set containing \glsct images of the pelvis region from patients that underwent radiotherapy for prostate cancer at the University Hospital of Umeå, Umeå, Sweden. We denote this data set \glspros. The delineated structures include the prostate (in most cases annotated as the clinical or gross target volume) and some organs at risk, among them the bladder and rectum. The individual structure masks were merged into a single multilabel truth image, with pixel value for the prostate, for the bladder, and for the rectum (see Figure 3). Patients without the complete set of structures were excluded, resulting in a final data set containing patients.
An in-house data set containing \glsct images of the head and neck region of patients. This data set comprises the patients from the University Hospital of Umeå, Umeå, Sweden, that participated in the ARTSCAN study . We denote this data set \glshene. For each \glsct image, manual annotations of the target volumes and various organs at risk were provided. The organ structures that were included with this data were the bilateral submandibular glands, bilateral parotid glands, larynx, and medulla oblongata (see Figure 4). After removal of faulty \glsct volumes where the slice spacing changed within a volume and excluding patients in which not all of the six aforementioned structures were present, the final data set contained patients.
The \glsbrats [27, 3] was part of the MICCAI 2019 conference. It contains multimodal pre-operative \glsmri data of patients with pathologically confirmed \glshgg () or \glslgg () from 19 different institutes. For each patient, \glst1, \glst1c, \glst2, and \glsflair scans were available, acquired with different protocols and various scanners at T.
Manual segmentations were carried out by one to four raters and approved by neuroradiologists. The necrotic and non-enhancing tumor core, peritumoral edema, and contrast-enhancing tumor were assigned labels , , and respectively (see Figure 5). The images were co-registered to the same anatomical template, interpolated to a uniform voxel size and skull-stripped.
The data set for the \glskits challenge , part of the MICCAI 2019 conference, contains preoperative \glsct data from randomly selected kidney cancer patients that underwent radical nephrectomy at the University of Minnesota Medical Center between 2010 and 2018. Medical students annotated under supervision the contours of the whole kidney including any tumors and cysts (label 1), and contours of only the tumor component excluding all kidney tissue (label 2) (see Figure 6). Afterward, voxels with a radiodensity of less than HU were excluded from the kidney contours, as they were most likely perinephric fat.
The \glsibsr data set  is a publicly available data set with \glst1 \glsmri volumes, and is commonly used as a standard data set for tissue quantification and segmentation evaluation. Whole-brain segmentations of cerebrospinal fluid (CSF), gray matter, and white matter were included with their respective labels , , and (see Figure 7).
Due to the diverse range of data sets, it must be ensured that the training data is as similar as possible across experiments in order to achieve a fair comparison.
|original voxel size (in mm)||1.0-1.0-1.0||--||1.0-1.0-1.0||--||--|
|preprocessed voxel size (in mm)||1.0-1.0-1.0||2.3-2.3-2.3||1.0-1.0-1.0||1.3-1.0-5.8||2.7-2.7-3.9|
Magnetic Resonance Image Preprocessing
The \glsbrats and \glsibsr data sets were N4ITK bias field corrected  and normalized to zero-mean and unit variance. The \glsbrats volumes were cropped around the center to a resolution of , to increase processing speed. This last step was skipped for the \glsibsr data set because of the much smaller amount of data samples.
Computed Tomography Image Preprocessing
In the \glspros, \glshene, and \glskits data sets, all images had an resolution (i.e. sagittal-coronal ) of and a varying slice count. The voxel size also varied between patients, so a preprocessing pipeline (see Figure 8) was set up to transform these two data sets to a uniform resolution and voxel size.
First, the data were resampled to an equal voxel size within the same set. The volumes were then zero-padded to the size of the single largest volume from that set after resampling. In order to increase processing speed and lower the memory consumption, the \glspros and \glskits volumes were thereafter downsampled to , and the \glshene volumes were downsampled to . An example of this method pipeline is shown in Figure 8.
As a final step, the images were normalized by clipping each case to the range , subtracting and dividing by .
Iii-C Training Details
Our method was implemented in Keras 2.2.4
For the 3D experiments, the \glsbrats data set was the only data where the whole volumes could be fed into the network at once because of constraints in GPU memory. For the other data sets, we resorted to a patch-based approach where the input size would be , the largest size possible for our available hardware.
In all experiments, we employed the Adam optimizer  with an initial learning rate of . If the validation loss did not improve after a certain number of epochs, we used a patience callback that dropped the learning rate by a factor of and an early stopping callback that terminated the experiment. Because of the differences in data set sizes, these callbacks had to be determined from initial exploratory experiments for each separate data set to ensure experiments did not run for too long or too short. The patience callbacks were set to five epochs for the \glsbrats, \glskits, and \glspros experiments, six epochs for the \glshene data set, and ten epochs for the \glsibsr data set. The early stopping callbacks were set to epochs for \glspros data, epochs for \glsbrats and \glskits data, for \glshene data, and epochs for \glsibsr. The maximum number of epochs an experiment could run for, regardless of any changes in the validation loss, was set to for the \glshene and \glspros data and for the other data sets. Batch normalization and a norm regularization, with parameter , were applied to all convolutional layers, both in the transition block and in the main network. The \glsrelu function was used as the intermediate activate function. The activation function of the final layer was the function. Each data set was split into training and test set, and with the training set, in turn, being split into for training and for validation.
As loss function, we employed a combination of the \glsdsc and \glsce. The \glsdsc is typically defined as
with the output segmentation and its ground truth. However, a differentiable version of Equation 1, the so-called soft \glsdsc, was used. The soft \glsdsc is defined as
where for each label , the is the output of the network and is a one-hot encoding of the ground truth segmentation map. The is a small constant added to avoid division by zero.
The \glsdsc is a good objective for segmentation, as it directly represents the degree of overlap between structures. However, for unbalanced data sets with small structures and where the vast majority of pixels are background, it may converge to poorly generalizing local minima, since misclassifying only a few pixels can lead to large deviations in \glsdsc. A common way [36, 43] to resolve this is to combine the \glsdsc loss with the \glsce loss, defined as
and we did this as well. Hence, the final loss function was
In order to artificially increase the data set size and to diversify the data, we employ various common methods for on-the-fly data augmentation: flipping along the horizontal axis, rotation within a range of to degrees, shear images within the range of to , zoom with a factor between and , and adding small elastic deformations as described in . The data augmentation implementation we used was based on . The images in the \glskits data are asymmetric along the -axis because of the liver; therefore, vertical flipping was not applied on that data set as it would result in anatomically unrealistic images (see Table I).
For evaluation of the segmentation performance, we employed the conventional \glsdsc as defined in Equation 1. In order to ensure a fair comparison and to investigate the variability of the results within experiments, we used five-fold cross-validation in each experiment (except for the \glspros). Due to its much larger size, the experiments on the \glspros data set were run only once.
To compare the computational cost of our proposed models to the corresponding 2D and 3D \glscnn models, we extracted the number of trainable parameters, the maximum amount of \glsgpu memory used, the number of \glsflops, training time per epoch, and prediction time per sample.
|Model||#slices||#params||memory||\acrshortflops||t per epoch||p per sample|
|3D||128||1 461k||16 335MB||7.306M||370s||2.36s|
The segmentation performances in terms of \glsdsc of all models are illustrated in Figure 9. For each data set, the mean \glsdsc scores are plotted (with point-wise standard deviation bars) as a function of the input size, and are given for the 2D, pseudo-3D with to , and 3D models, and for the U-Net and SegNet backbones. These results are also tabled in Table III, along with summaries of the experiment setups per data set.
Randomly selected example segmentations are illustrated in Figure 10. For each data set, a prediction from the 2D models, pseudo-3D models with the proposed method, and 3D models are given, along with their respective ground truths. The \glsbrats, \glskits and \glspros segmentations are cropped for ease of viewing. We chose to omit examples for the channel-based pseudo-3D models because of their high level of similarity to the proposed method. Segmentations with the channel-based method, along with additional exemplary segmentations, can be found in Figure 14-16 in Section F of the Supplementary Material.
The computational costs of the models used for \glsbrats experiments are presented in Table II. The number of model parameters, graphical memory use, and \glsflops are only dependent on the model type, and therefore the corresponding columns in Table II are equal for all other data sets. The same variables are shown for the other data sets in Table IX–XII in Section B of the Supplementary Material, where the only differences are in the training and inference times due to the different numbers of samples; these two parameters scale with the data set size.
This study evaluated the inclusion of neighboring spatial context as an input of \glsplcnn for medical image segmentation. Such pseudo-3D methods with a multi-slice input and single-slice output are commonly implemented by regarding the adjacent slices as additional channels of the central slice. Apart from this approach, we also proposed an alternative pseudo-3D method, based upon multiple preliminary 3D convolution steps before processing by a 2D \glscnn. Across five different data sets and using U-Net and SegNet \glscnn backbones, we compared both these pseudo-3D methods, for an input size up to , to end-to-end 2D and 3D \glsplcnn with respectively single slice and whole volume inputs and outputs. Additionally, we evaluated a number of computational parameters to get a sense of each model’s hardware requirement and load.
V-a Computational costs
As seen in Table II, the computational costs are in line with what would be expected. The transition block adds a relatively small amount of extra parameters on top of the main 2D network, and the required amount of GPU memory and \glsflops scale accordingly with . Since the input is still the same size as for the channel-based method, the training times per epoch are largely similar. One advantage of the fully 3D \glsplcnn demonstrated in these results, is that prediction time is significantly faster because samples can be processed all at once instead of slice by slice.
The high computational cost of end-to-end 3D convolution is also demonstrated in Table II. The memory footprint is almost times larger than the 2D U-Net; over GB is required to train on the complete volumes, which is at or above the limit of most modern commodity GPUs. Both pseudo-3D methods use less than 5 % of the GPU memory consumed by the end-to-end 3D network, even at . It can thus be concluded that both pseudo-3D methods are computationally very efficient ways of including more inter-slice information, with the proposed method being slightly more expensive in terms of the GPU memory consumption compared to the channel-based method.
V-B Quantitative analysis
As can be seen in Figure 9, overall, all experiments managed to produce acceptable segmentation results, even for data sets with complex structures such as the \glsbrats images, or with organs that can be hard to visually distinguish, such as in the \glshene set. One obvious similarity between these data sets is that using a U-Net backbone outperforms the SegNet in nearly every case. Regarding the behavior as a function of input size , the results in Figure 9 are inconclusive for almost all data sets.
In the plots from the \glsbrats, \glskits, \glsibsr and \glshene data in Figure 9, there does not seem to be an additional benefit by adding more slices as input over an end-to-end 2D approach. There seem to be some exceptions, like the surge at in the \glshene results, but in these four data sets the variance is either too high or the rate of increase is too low to draw any strong conclusions. For these cases, it would be doubtful if the accessory downsides, e.g. increased training time, are worth the at most marginal improvements in segmentation performance. Likewise, there seems to be no significant difference between our proposed method and the channel-based method in these four data sets.
The only data set in this study where \glsdsc does seem to significantly improve with is in the \glspros. As more slices are being included in the input volume, the segmentation performance approaches that of a fully 3D network, and the proposed method outperforms the channel-based method by an increasing margin. While the overall improvement when going from 2D to pseudo-3D with is arguably low, we can regard the \glspros case as a demonstration of the possibility that pseudo-3D models can improve the segmentation performance over 2D methods.
Fully 3D \glsplcnn seems to produce equal or worse results than their 2D and pseudo-3D counterparts in most cases. Again, the only exception seems to be in the \glspros results, and in this case only when the U-Net is used as backbone network (see the respective plot in Figure 9). This could be explained by the much higher number of parameters of 3D \glsplcnn, which makes them prone to overfitting. The high number of data samples in the \glspros set, combined with the skip-connections that differentiates the U-Net from the SegNet, might have been enough to overcome the problem in this specific case.
There does not seem to be a straight-forward explanation as to why the \glspros data set is an exception compared to the other data sets. In an attempt to connect \glsdsc behaviour with to differences in data set properties a feature-based regression analysis was performed. We computed features of the structures (ground truth masks) that describe each mask’s structural properties: structure depth (i.e. the average number of consecutive slices a structure is present in), structure size relative to the total volume, and average structural inter-slice spatial displacement. The extracted feature values for all data sets and their respective structures can be found in the Supplementary Material Table IV–VIII. We found no significant agreements between models that could connect one of these data set features to \glsdsc with respect to . For more details about the feature extraction and regression analysis, see Sections A.1 and A.2 of the Supplementary Material.
Another distinction between the \glspros set and the others included in this study is its much larger number of samples. As mentioned above, this could have been a contributing factor to the higher performance of the \glspros 3D U-Net compared to other data sets’ 3D \glscnn results. This feature was also hypothesized to influence the relation between and \glsdsc, and therefore the following analysis was performed: the same experiments were performed but now training on distinct subsets of samples from the \glspros data set. The average scores obtained from the five distinct subsets can be found in Figure 12 in the Supplementary Material, where we see a similar behavior as in Figure 9. Hence, we rule out the data set size as the main cause as well.
We, therefore, conclude that pseudo-3D methods have the potential to increase segmentation performance, but in the general case will not yield better results compared to conventional, end-to-end 2D and 3D \glsplcnn.
Further analysis to explain the behaviour of the \glspros data set might be performed. The regression-based feature analysis was somewhat rudimentary, and could likely be extended with e.g. more sophisticated models and more data.
Another possible follow-up study might be to investigate whether it is the multi-slice output (e.g. producing segmentations for all input slices) in pseudo-3D methods that improve the results in other studies. While this was out of the scope of this work, aggregating multiple outputs may be the main reason why pseudo-3D methods sometimes improve the segmentation performance. Based on our conclusions that using multi-slice inputs does not seem to improve the results on their own, the added benefit might only come into play from aggregation of multiple outputs. In this case, using something like Bayesian dropout could prove just as beneficial.
V-C Qualitative analysis
It is important to emphasize that the images in Figure 10 are randomly selected single slices from thousands of samples and are therefore presented purely for illustrative purposes, and might not always be a representation of the overall segmentation performance of a particular data set. However, some remarks can be made that can be related to the quantitative results in Figure 9. The relatively large variance in segmentation performance between experiments of the \glsbrats data are demonstrated in Figure 10; as seen, the predictions can differ quite drastically within the same model and with varying . This reflects the \glspldsc of the \glsbrats set presented in Figure 9.
It also appears that the U-Net is better at capturing fine structural details, while the SegNet segmentations seems to be coarser and simpler. This becomes particularly noticeable in data sets with complex structures, such as the gray matter-white matter border in the \glsibsr images (Figure 10). This in turn results in an overall large difference in mean \glsdsc between U-Net and SegNet. When the ground truth structures are more coarsely shaped, such as in the \glshene set, the SegNet can keep up much better with the U-Net performance.
V-D Effect of the Loss Function
In an earlier stage of this project, we employed a different experimental setup with a pure \glsdsc loss function. However, these initial experiments proved this loss not to be sufficient for all data sets. Particularly the \glskits and \glshene data sets yielded unacceptably unstable results which, even with exactly equal hyperparameters, could either result in fairly accurate segmentations or complete failure. Investigation of the \glspldsc of individual structures demonstrated that in these failed experiments, multiple structures did not improve beyond a \glsdsc on the order of 0.1. After adapting the loss function to include also the \glsce term (see Equation 4), the results improved substantially for all data sets. Performance details for each run using the pure \glsdsc and final loss function can be seen in Figure 13 and Table XIII in Section E of the Supplementary Material.
This study systematically evaluated pseudo-3D \glsplcnn, where a stack of adjacent slices is used as input for a prediction on the central slice. The hypothesis underlying this approach is that the added neighboring spatial information would improve segmentation performance, with only a small amount of added computational cost compared to an end-to-end 2D \glscnn. However, whether or not this is actually a sensible approach had not previously been evaluated in the literature.
Aside from the conventional method, where the multiple slices are input as multiple channels, we introduced here a novel pseudo-3D method where a subvolume is repeatably convolved in 3D to obtain a final 2D feature map. This 2D feature map is then in turn fed into a standard 2D network.
We investigated the segmentation performance in terms of the \glsdsc and the computational cost for a large range of input sizes, for the U-Net and SegNet backbone architectures, and for five diverse data sets covering different anatomical regions, imaging modalities, and segmentation tasks. While pseudo-3D networks can have a large input image size and still be computationally less costly than fully 3D \glsplcnn by a large factor, a significant improvement from using multiple input slices was only observed for one of the data sets. We also observed no significant improvement of 3D network performance over 2D networks, regardless of data set size.
Because of ambiguity in the underlying cause of the behavior on the U-PRO data set compared to the results on the other data sets, we conclude that in the general case pseudo-3D approaches appear to not significantly improve segmentation results over 2D methods.
This research was conducted using the resources of the High Performance Computing Center North (HPC2N) at Umeå University, Umeå, Sweden. We are grateful for the financial support obtained from the Cancer Research Fund in Northern Sweden, Karin and Krister Olsson, Umeå University, The Västerbotten regional county, and Vinnova, the Swedish innovation agency.
|learning rate drop|
|2D||0.768 (0.018)||0.766 (0.016)||0.898 (0.005)||0.656 (0.012)||0.807 (0.005)|
|proposed ()||0.765 (0.017)||0.775 (0.014)||0.907 (0.005)||0.646 (0.008)||0.811 (0.005)|
|proposed ()||0.766 (0.018)||0.789 (0.014)||0.909 (0.004)||0.654 (0.005)||0.817 (0.004)|
|proposed ()||0.769 (0.013)||0.797 (0.017)||0.911 (0.003)||0.645 (0.012)||0.818 (0.004)|
|proposed ()||0.771 (0.022)||0.790 (0.017)||0.912 (0.003)||0.642 (0.007)||0.819 (0.004)|
|proposed ()||0.766 (0.018)||0.792 (0.023)||0.913 (0.003)||0.654 (0.008)||0.827 (0.004)|
|proposed ()||0.772 (0.018)||0.796 (0.026)||0.916 (0.003)||0.690 (0.004)||0.831 (0.004)|
|channel-based ()||0.770 (0.014)||0.789 (0.013)||0.904 (0.003)||0.665 (0.007)||0.809 (0.005)|
|channel-based ()||0.767 (0.017)||0.804 (0.009)||0.904 (0.003)||0.663 (0.008)||0.810 (0.004)|
|channel-based ()||0.765 (0.019)||0.800 (0.015)||0.896 (0.005)||0.648 (0.009)||0.813 (0.004)|
|channel-based ()||0.763 (0.016)||0.787 (0.013)||0.902 (0.004)||0.659 (0.011)||0.809 (0.005)|
|channel-based ()||0.764 (0.016)||0.777 (0.017)||0.902 (0.007)||0.663 (0.003)||0.809 (0.005)|
|channel-based ()||0.769 (0.016)||0.772 (0.015)||0.905 (0.008)||0.674 (0.006)||0.814 (0.005)|
|3D||0.769 (0.016)||0.763 (0.014)||0.924 (0.002)||0.635 (0.009)||0.841 (0.004)|
|2D||0.744 (0.021)||0.755 (0.020)||0.782 (0.011)||0.643 (0.006)||0.763 (0.005)|
|proposed ()||0.746 (0.017)||0.766 (0.017)||0.786 (0.007)||0.625 (0.011)||0.773 (0.005)|
|proposed ()||0.756 (0.017)||0.767 (0.018)||0.789 (0.006)||0.635 (0.005)||0.772 (0.005)|
|proposed ()||0.752 (0.018)||0.767 (0.015)||0.790 (0.008)||0.627 (0.004)||0.774 (0.005)|
|proposed ()||0.760 (0.013)||0.760 (0.018)||0.792 (0.013)||0.623 (0.012)||0.784 (0.005)|
|proposed ()||0.751 (0.016)||0.761 (0.015)||0.796 (0.008)||0.620 (0.011)||0.791 (0.005)|
|proposed ()||0.763 (0.016)||0.778 (0.014)||0.798 (0.010)||0.662 (0.011)||0.802 (0.005)|
|channel-based ()||0.752 (0.019)||0.767 (0.016)||0.784 (0.006)||0.638 (0.010)||0.772 (0.005)|
|channel-based ()||0.755 (0.018)||0.769 (0.015)||0.766 (0.004)||0.629 (0.010)||0.771 (0.005)|
|channel-based ()||0.753 (0.016)||0.755 (0.016)||0.766 (0.010)||0.639 (0.007)||0.770 (0.005)|
|channel-based ()||0.748 (0.019)||0.737 (0.014)||0.749 (0.008)||0.628 (0.006)||0.767 (0.005)|
|channel-based ()||0.747 (0.015)||0.735 (0.008)||0.742 (0.011)||0.627 (0.005)||0.762 (0.005)|
|channel-based ()||0.754 (0.018)||0.741 (0.012)||0.762 (0.009)||0.649 (0.009)||0.775 (0.005)|
|3D||0.726 (0.013)||0.735 (0.012)||0.744 (0.012)||0.599 (0.005)||0.771 (0.005)|
Vi-a Structure Analysis
We selected three data set features that describe each data set’s structural properties: structure depth, structure size relative to the total volume, and average structural inter-slice spatial displacement (see Table IV-VIII). These aforementioned structural properties are computed as follows.
The structure depth of class is computed as
where denotes the patient , represents an unconnected region of class in patient , the denotes the number of consecutive slices (in the axial dimension) of region . Here and are the number of patients and unconnected regions of class in patient , respectively.
The structure size relative to the total volume of class is defined as
where denotes the total number of voxels labeled as class in patient . The , , and are the height, width, and depth of the input volume, respectively.
To compute the structure spatial displacement, we first compute the center of mass, , of class of patient at slice (in the axial dimension) as
where is the value of a voxel at coordinates in class in patient and slice , and where denotes the total number of voxels labeled as class in patient in slice .
With these, the structure spatial displacement of class is computed as
where denotes the Euclidian distance between two coordinate points, and .
To find explanations to why the \acrshortpros data set is an exception compared to the other data sets, we attempted to connect \acrshortdsc behaviour with , that is the number of slices extracted from the whole volume with a total of slices as the subvolume input, to differences in data set properties including structure depth, structure size relative to the total volume, and average structural inter-slice spatial displacement, and the number of training samples.
We aggregated all these structure properties to generate minimum, mean and maximum of , and over all classes in each data set, such that for each data set we obtained a fixed set of nine features. Overall, we formed a regression task to compute regression models for all models (U-Net and SegNet), architectures (2D, 3D, proposed, and channel-based), s (number of slices extracted from the whole volume), and data sets including \glsbrats, \glskits, \glsibsr, \glshene and \glspros with the following input features:
, minimum of structure depth over classes,
, average of structure depth over classes,
, maximum of structure depth over classes,
, minimum of structure size relative to the total volume over classes,
, mean of structure size relative to the total volume over classes,
, maximum of structure size relative to the total volume over classes,
, minimum of spatial displacement over classes,
, average of spatial displacement over classes,
, maximum of spatial displacement over classes.
We used the Bootstrap (with rounds) to compute the mean regression coefficient vectors and the corresponding confidence intervals the some regularised linear regression models. In the regression analysis, we used Ridge regression, Lasso, Elastic Net, and Bayesian ARD regression. The analysis was performed using scikit-learn 0.22
Vi-B Supplementary Computational Results
To compare the computational cost of our proposed models to the corresponding 2D and 3D \glscnn models, we extracted the number of trainable parameters, the maximum amount of \glsgpu memory used, the number of \glsflops, training time per epoch, and prediction time per sample.
The computational costs of the models used for \glsbrats experiments are presented in Table 2 of the main paper. The number of model parameters, graphical memory use, and \glsflops are only dependent on the model type, and therefore are equal for all other data sets. The same variables are shown here for the other data sets in Table IX–XII, where the only differences are in the training and inference times due to the different numbers of samples; these two parameters scale with the data set size.
Vi-C SegNet Pseudo-3D architecture
Figure 2 in the main paper shows both pseudo-3D methods with a U-Net backbone. Here, Figure 11 shows the same methods but with a SegNet backbone.
Vi-D \acrshortpros Subset Experiments Results
A distinction between the \glspros set and the others included in this study is its much larger number of samples. This feature was hypothesized to influence the relation between and \glsdsc, and therefore the following analysis was performed: the same experiments were performed but now training on distinct subsets of samples from the \glspros data set. The average scores obtained from the five distinct subsets can be found here in Figure 12, where we see a similar behavior as in Figure 9 in the main paper. Hence, we rule out the data set size as the main cause of the \glspros performance behaviour.
Vi-E Supplementary Quantitative Results
In an earlier stage of this project, we employed a different experimental setup with a pure \glsdsc loss function. However, these initial experiments proved this loss not to be sufficient for all data sets. Particularly the \glskits and \glshene data sets yielded unacceptably unstable results which, even with exactly equal hyperparameters, could either result in fairly accurate segmentations or complete failure. Investigation of the \glspldsc of individual structures demonstrated that in these failed experiments, multiple structures did not improve beyond a \glsdsc on the order of 0.1. After adapting the loss function to include also the \glsce term, the results improved substantially for all data sets. Performance details for each run using the pure \glsdsc and final loss function can be seen in Figure 13 and Table III.
Vi-F Supplementary Qualitative Results
Example segmentations are illustrated in Figure 14–16. It is important to emphasize that the images are randomly selected single slices from thousands of samples and are therefore presented purely for illustrative purposes, and might not always be a representation of the overall segmentation performance of a particular data set.
- (2016) Lung nodule detection using 3d convolutional neural networks trained on weakly labeled data. In Medical Imaging 2016: Computer-Aided Diagnosis, Vol. 9785, pp. 978532. Cited by: §I-A.
- (2017) Segnet: a deep convolutional encoder-decoder architecture for image segmentation. IEEE transactions on pattern analysis and machine intelligence 39 (12), pp. 2481–2495. Cited by: §I-B, §II, TABLE III.
- (2017) Advancing the cancer genome atlas glioma MRI collections with expert segmentation labels and radiomic features. Scientific data 4, pp. 170117. Cited by: §III-A3.
- (2013) A survey of mri-based medical image analysis for brain tumor studies. Physics in Medicine & Biology 58 (13), pp. R97. Cited by: §I.
- (2018) VoxResNet: deep voxelwise residual networks for brain segmentation from 3d mr images. NeuroImage 170, pp. 446 – 455. External Links: Cited by: §I-A.
- (2016) 3D u-net: learning dense volumetric segmentation from sparse annotation. In International conference on medical image computing and computer-assisted intervention, pp. 424–432. Cited by: §I-A, §I.
- (1997) Brainweb: online interface to a 3D MRI simulated brain database. In NeuroImage, Cited by: §III-A5.
- (2015) Deep neural networks for anatomical brain segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 20–28. Cited by: §I-A.
- (2016) Interactive contour delineation of organs at risk in radiotherapy: clinical evaluation on nsclc patients. Medical physics 43 (5), pp. 2569–2580. Cited by: §I.
- (2017) TensorLayer: A Versatile Library for Efficient Deep Learning Development. ACM Multimedia. External Links: Cited by: §III-C2.
- (2017) 3D deeply supervised network for automated segmentation of volumetric medical images. Medical image analysis 41, pp. 40–54. Cited by: §I.
- (2019) Deep convolutional neural network for segmentation of thoracic organs-at-risk using cropped 3d images. Medical physics. Cited by: §I-A.
- (2019) Removing segmentation inconsistencies with semi-supervised non-adjacency constraint. Medical Image Analysis 58, pp. 101551. External Links: Cited by: §I-A.
- (2019) 2.5 d cnn model for detecting lung disease using weak supervision. In Medical Imaging 2019: Computer-Aided Diagnosis, Vol. 10950, pp. 109503O. Cited by: §I-A.
- (2018) Integration of spatial information in convolutional neural networks for automatic segmentation of intraoperative transrectal ultrasound images. Journal of Medical Imaging 6 (1), pp. 011003. Cited by: §I-A.
- (2017) Automatic liver lesion segmentation using a deep convolutional neural network method. arXiv preprint arXiv:1704.07239. Cited by: §I-A.
- (2019) The KiTS19 challenge data: 300 kidney tumor cases with clinical context, CT semantic segmentations, and surgical outcomes. arXiv preprint arXiv:1904.00445. Cited by: §III-A4.
- (2017) Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis 36, pp. 61–78. Cited by: §I-A.
- (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §III-C1.
- (2019) VesselNet: a deep convolutional neural network with multi pathways for robust hepatic vessel segmentation. Computerized Medical Imaging and Graphics 75, pp. 74 – 83. External Links: Cited by: §I-A.
- (2019) A cascade of cnn and lstm network with 3d anchors for mitotic cell detection in 4d microscopic image. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1239–1243. Cited by: §I-A.
- (2017) On the compactness, efficiency, and representation of 3d convolutional networks: brain parcellation as a pretext task. In International Conference on Information Processing in Medical Imaging, pp. 348–360. Cited by: §I-A.
- (2018) Multi-channel multi-scale fully convolutional network for 3d perivascular spaces segmentation in 7t mr images. Medical image analysis 46, pp. 106–117. Cited by: §I-A.
- (2017) A survey on deep learning in medical image analysis. Medical image analysis 42, pp. 60–88. Cited by: §I, §I.
- (2017-02-01) Automatic 3d liver location and segmentation via convolutional neural network and graph cut. International Journal of Computer Assisted Radiology and Surgery 12 (2), pp. 171–182. Cited by: §I-A.
- (2015) An ensemble of 2d convolutional neural networks for tumor segmentation. In Scandinavian Conference on Image Analysis, pp. 201–211. Cited by: §I-A.
- (2014) The multimodal brain tumor image segmentation benchmark (BRATS). IEEE transactions on medical imaging 34 (10), pp. 1993–2024. Cited by: §III-A3.
- (2016) V-net: fully convolutional neural networks for volumetric medical image segmentation. In 2016 Fourth International Conference on 3D Vision (3DV), pp. 565–571. Cited by: §I-A, §I.
- (2019) 3D convolutional neural networks for tumor segmentation using long-range 2d context. Computerized Medical Imaging and Graphics 73, pp. 60–72. Cited by: §I-A.
- (2018) Deep sequential segmentation of organs in volumetric medical scans. IEEE transactions on medical imaging 38 (5), pp. 1207–1215. Cited by: §I-A.
- (2013) Deep feature learning for knee cartilage segmentation using a triplanar convolutional neural network. In International conference on medical image computing and computer-assisted intervention, pp. 246–253. Cited by: §I-A.
- (2007) A review of computer-aided diagnosis of breast cancer: toward the detection of subtle signs. Journal of the Franklin Institute 344 (3-4), pp. 312–348. Cited by: §I.
- (2018) Interleaved 3d-cnns for joint segmentation of small-volume structures in head and neck ct images. Medical Physics 45 (5), pp. 2063–2075. Cited by: §I-A.
- (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention, pp. 234–241. Cited by: §I-A, §I-B, §II, TABLE III.
- (2014) A new 2.5 d representation for lymph node detection using random sets of deep convolutional neural network observations. In International conference on medical image computing and computer-assisted intervention, pp. 520–527. Cited by: §I-A.
- (2017-08) ReLayNet: retinal layer and fluid segmentation of macular optical coherence tomography using fully convolutional networks. Biomed. Opt. Express 8 (8), pp. 3627–3642. Cited by: §III-C1.
- (2019) Deep learning in medical imaging and radiation therapy. Medical physics 46 (1), pp. e1–e36. Cited by: §I.
- (2017) Deep learning in medical image analysis. Annual review of biomedical engineering 19, pp. 221–248. Cited by: §I.
- (2003) Best practices for convolutional neural networks applied to visual document analysis.. In Icdar, Vol. 3. Cited by: §III-C2.
- (2010) N4ITK: improved n3 bias correction. IEEE transactions on medical imaging 29 (6), pp. 1310. Cited by: §III-B1.
- (2019) End-to-End Cascaded U-Nets with a Localization Network for Kidney Tumor Segmentation. arXiv preprint arXiv:1910.07521. Cited by: §I-A.
- (2019) TuNet: End-to-end Hierarchical Brain Tumor Segmentation using Cascaded Networks. arXiv preprint arXiv:1910.05338. Cited by: §I-A.
- (2018) 3D segmentation with exponential logarithmic loss for highly unbalanced object sizes. In Medical Image Computing and Computer Assisted Intervention – MICCAI 2018, A. F. Frangi, J. A. Schnabel, C. Davatzikos, C. Alberola-López and G. Fichtinger (Eds.), Cham, pp. 612–619. Cited by: §III-C1.
- (2015) Automated anatomical landmark detection ondistal femur surface using convolutional neural network. In 2015 IEEE 12th international symposium on biomedical imaging (ISBI), pp. 17–21. Cited by: §I-A.
- (2017) Automatic 3d cardiovascular mr segmentation with densely-connected volumetric convnets. In Medical Image Computing and Computer-Assisted Intervention - MICCAI 2017, M. Descoteaux, L. Maier-Hein, A. Franz, P. Jannin, D. L. Collins and S. Duchesne (Eds.), Cham, pp. 287–295. External Links: Cited by: §I-A.
- (2017) Volumetric convnets with mixed residual connections for automated prostate segmentation from 3d mr images. In Thirty-first AAAI conference on artificial intelligence, Cited by: §I-A.
- (2011) Two-year results from a swedish study on conventional versus accelerated radiotherapy in head and neck squamous cell carcinoma–the artscan study. Radiotherapy and Oncology 100 (1), pp. 41–48. Cited by: §III-A2.