A framework for CT image segmentation inspired by the clinical environment
Computed tomography (CT) data poses many challenges to medical image segmentation based on convolutional neural networks (CNNs). The main issues arise during feature extraction, due to the large dynamic range of intensities and the varying number of recorded slices of CT volumes. In this paper we address these issues with a framework that combines domain-specific data pre-processing and augmentation with state-of-the-art CNN architectures. The focus is not limited to score optimization, but also to stabilize the achieved prediction performance, since this is a mandatory requirement for use in automated and semi-automated workflows in the clinical environment.
The framework is validated contextually to an architecture comparison to show CNN architecture independent effects of our framework functionality. This comparison includes a modified U-Net and a modified Mixed-Scale Dense Network (MS-D Net) to compare dilated convolutions for parallel multi-scale processing to the U-Net approach based on traditional scaling operations. Finally, in order to combine the superior recognition performance of 2D-CNN models with the more comprehensive spatial information of 3D-CNN models, we propose an ensemble model.
The framework performs successfully when tested on a range of tasks such as liver and kidney segmentation, without significant differences in prediction performance on strongly differing volume sizes and varying slice thickness.
Keywords:Medical image segmentation Computed Tomography (CT) Kidney tumor segmentation Liver segmentation
Spatial characteristics of tumors like size, shape, location or growth pattern are central clinical features. Changes in these characteristics are important indicators for disease progression and treatment effects. Automated, quantitative assessment of these characteristics and their changes from radiological images would yield an efficient and objective tool for radiologists to monitor the disease course. Thus, a reliable and accurate automated segmentation method is desirable to extract spatial tumor and organ characteristics from computed tomography (CT) volumes.
In recent years, convolutional neural networks (CNNs)  became the state of the art method for image segmentation, as well as many other tasks in computer vision , such as image classification, object detection and object tracking . The applications of CNNs are diverse, but the general data handling is often very similar since the feature extraction is performed internally by the CNN itself. Improvements in the application of CNNs often address the neural network architecture, the training algorithm or the use case [21, 4], while the data is typically handled the same way as grayscale images or RGB images, but with additional dimensions.
However, this approach neglects prior information about the specific physical processes by which these images or volumes are acquired and by which the image contrast is determined, possibly leading to an inaccurate or suboptimal image analysis. For instance, while most image formats map pixels on relative scales of a few hundred values, voxels in CT volumes are mapped on the Hounsfield scale , a quantitative mapping of radiodensity calibrated such that the value for air is -1000 Hounsfield Units (HU) and that for water is 0 HU, with values in the human body reaching up to about 2000 HU (cortical bone). Therefore, in contrast to most standard images where pixel intensities themselves might not be meaningful, the actual grey values of CT volumes carry semantic information related to the nature of CT scanning , and special consideration is required to leverage it.
This also means that CT data typically contains a range of values that are not necessarily relevant for a particular diagnostic question [6, 9]. Thus, when radiologists inspect CT volumes for diagnosis, they typically rely on windowing, i.e. restricting the range of displayed grey values, to focus the image information to relevant values. CNN-based image segmentation frameworks rarely include such potentially essential steps from the expert workflow, assuming that the data only has to be normalized and then the network learns autonomously to focus on the relevant image regions.
In this paper, we present a framework for CNN based image segmentation in CT data, which addresses the challenges of a clinically meaningful CT volume processing. The proposed framework is inspired by insights on both the data acquisition process and the diagnostic process performed by the radiologist, addressing in particular the spatial information CT volumes and the use of the HU scale.
The focus is not on the optimization of the loss function on the whole dataset, but rather on obtaining a robust segmentation quality, independent of the differences in size and shape of the input volumes. For this reason, we also consider the standard deviation of the dice score for evaluation. If a segmentation model is used in an automated or semi-automated process in which the result of the segmentation is not directly analyzed, particularly strong segmentation errors pose a problem because the user tends to rely on the segmentation model, only analyzing the final result of the process. Therefore, our goal is to specifically address the demands of algorithms for CT processing in the clinical environment, where algorithms are required to process each volume consistently and without strong differences in quality of the output.
We evaluated the framework with a mixed-scale dense convolutional neural network (MS-D Net)  with dilated convolutions and the nnU-Net  with traditional scaling operations, which is a modified U-Net . For each architecture both a 2D-CNN and a 3D-CNN implementation is considered. Finally, we show an ensemble CNN, which allows to combine the longitudinal information leveraged in 3D-CNNs with the proportionally higher value of each segmented voxel in the 2D-CNNs training process, resulting in more accurate results from a theoretical point of view.
For testing the robustness of the trained models, the folds are mixed so that there are always cases in the test set that are independent from the training data. This simulates worst case scenarios for the application in the clinical environment. In order to make the results comprehensible and reproducible, we use open datasets for training and evaluation: the CNN-models for kidney tumor segmentation are trained and validated on the dataset of the 2019 Kidney Tumor Segmentation Challenge  and the liver segmentation models are trained and validated on the dataset of the CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge .
In the authors’ perspective, the rise of deep learning methods in medical image processing seems to have split the community into two factions: those who use such methods and those who do not, or even consider CNNs as a temporary hype. With this work we want to show that clinically applicable CNN-based frameworks require the cooperation of different expertise and motivate to reconciliate the two factions.
In the following, we describe the data preprocessing and augmentation in section 2.1, the network architectures in section 2.2 and the training procedure in section 2.3. The preprocessing includes volume shape reduction and grey-value windowing. The proposed augmentation addresses the scarcity of data, with the aim of providing additional samples for the training procedure. For the CNN architectures we consider two models: one with dilated convolutions (MS-D) and one with traditional scaling operations (U-Net). We further explain the construction of the stacked CNN model. Subsequently, in section 2.3 the training procedure for the two considered architectures is described.
2.1 Preprocessing and Augmentation
In order to ensure an adequate data quality in the training process for each model, we adapt the data preprocessing and augmentation for CT data. The following description of preprocessing is adapted to the dataset of the KiTS Kidney Tumor Segmentation Challenge  and the dataset of the CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge , but can be applied to any other CT dataset with minor changes.
2.1.1 Image Preprocessing
The image normalization is adapted from  to make it more general and enable a more realistic normalization for real life applications.
To reduce the complexity and optimize the dynamic range, we apply a windowing to each volume by clipping the voxels grey value range to the (0.6, 0.99) percentile range. This represents the range that a radiologist would use for decision-making, but is slightly larger, since no dynamic correction based on the respective CT volume is feasible. For other segmentation problems, this percentile must be adjusted depending on the relevant body parts (examples are shown in Figure 1). This windowing is followed by a z-score normalization based on the mean and the standard deviation, which is calculated from the foreground of a random sample of the data set. The mean and the standard deviation of the full dataset would not reflect the clinical environment.
In order to save costs and time in CT volume acquisition, a CT volume is usually targeted to the region of interest (Figure 2). However, the region of interest is typically defined liberally, so that no area of potential interest is missing, causing a repetition of the scan. Such requirement results in different CT acquisition policies and consequently to CT volumes with varying number of slices. This poses a challenge to the application of CNNs, typically tackled standardizing the number of slices. We decided to reduce each volume to the size of 16 slices. This eliminates the need to upsample volumes that contain only a few slices. The selection of the slices is random and can be repeated several times per volume, enabling a simultaneous augmentation effect. Background slices are excluded during the training phase, since they are also ignored in the test phase. A higher number of slices did not lead to beneficial effects in our experiments, which is consistent with the observation that most CNNs only use a small semantic context for decision finding [11, 19].
In order to save memory, the dimension of each slice was downsampled from a size of 512 x 512 voxels to a size of 128 x 128 voxels. In our experiments larger slice sizes did not result in any advantages in terms of segmentation performance and only resulted in higher computing costs.
2.1.2 Image Augmentation
In addition to the slices number standardization mentioned above, we also used image noising with normal distributed noise map, slice skipping, slice interpolation and a range shift to address potential variation in the CT acquisition process (Figure 2). We further rotated the images to simulate the variability in patient positioning, which cannot be excluded to a certain extent despite fixation.
The first architecture we consider is a modified U-Net architecture called nnU-Net . This architecture is the native U-Net architecture  with instance normalization  instead of batch normalization  and LeakyReLUs of slope 1e-2  instead of ReLUs. The second architecture we take into account is a modified mixed-scale dense convolutional neural network (MS-D Net) , which is modified in the same way as the U-Net. The modification of the MS-D Net is done to avoid the effect of the applied activation function. The model comparison is intended to demonstrate the independence of our framework from the implemented feature extraction approach in the CNN. Therefore, we compare traditional scaling operations with dilated convolutions as part of the evaluation of the framework.
In clinical diagnoses, the localization of tumors and relevant adjacent structures is performed not only by examining the respective slice, but also previous and subsequent slices for increased spatial information. Based on this consideration a 3D CNN might seem a preferable choice in order to avoid spatial information losses. However, previous work has shown that 3D segmentation methods perform worse than 2D alternatives with anisotropic data [1, 13]. Another reason why medical image segmentation with 3D CNNs often proves challenging is the variability in number of slices per volume. The number of slices per volume varies mainly depending on factors like body region under investigation, diagnostic question, variability in subjects’ size and acquisition protocol considerations (typically a trade-off between minimizing scanning time and exposure to radiation while maximising data quality).
CNN ensembles showed superior performance for several detection tasks [8, 15, 25]. Based on this achievements, we combine different models into a single, stacked CNN model to leverage the different strengths of each architecture to handle the described task. In case of the kidney-tumor segmentation the stacked CNN consist of a set of 3D MS-D Nets, which are trained to detect the kidney without a distinction between the healthy kidney tissue and the tumor tissue, and several 2D nnU-Nets, which are trained to segment all three classes (healthy tissue, tumor and background). For the liver segmentation, both models are trained to detect the liver, without the need of handling a class distinction.
All described networks were trained independently and from scratch. The training procedure is shown in Algorithm 1. The procedure is implemented in Python with Tensorflow 1.14 and performed on the IBM Power System Accelerated Compute Server (AC922) with two NVIDIA Tesla V100 GPUs, which allowed us to parallelize the experiments, but the considered approach is also running on a system with a NVIDIA GTX 1080.
In each epoch the volumes of a randomly selected batch are pre-processed and augmented, whereby the already mentioned slice augmentation effect is also achieved with the normalization steps (line 12). For the 2D modified U-Net and the 2D modified MS-D Net a batch size of 28 is used, while for the 3D modified U-Net and the 3D modified MS-D Net the model fits are done with single volumes. In case of 3D segmentation, we augmented 80 percent of training batches and for the training of the 2D models 90 percent of data are augmented, while the range shift is only applied to 20 percent of data.
For the weight update of for function , which is a modified U-Net or a modified MS-D Net, we used the ADAM optimization  with the proposed parameter configuration and optimized the loss function (line 13 in Algorithm 1), which is a combination of the Tanimoto loss and the categorical crossentropy . The weighting of the loss functions is configured for both architectures with and . Both values are determined experimentally. The Tanimoto loss is implemented as shown in equation 1, where denotes the set of predicted voxel-wise annotations and denotes the set of ground truth voxel-wise annotations. The smooth factor is chosen as . Similar to the well-known Dice score, the Tanimoto coefficient treats each class independently, therefore the Tanimoto loss is particularly suitable for problems with a high class imbalance. However, this leads to a class-wise maximum error, if the class does not occur in the sample, which is attenuated by the smooth factor . A more detailed discussion is given in .
For the validation of our framework, we validated the augmentation of our framework in comparison to the multidimensional (2D and 3D) image augmentation for TensorFlow, which is developed by Google DeepMind . Since the normalization and especially the windowing of the CT volume has a strong influence on the cropping and the selection of the slices, we do not conduct a comparative experiment, since the experiment would not show the raw influence of the data pre-processing on the performance of the neural network, but also the influence of the used data in training and evaluation. Both architectures are trained with each data dimensionality and are evaluated in a 5 fold cross validation. However, the data were sorted according to the number of slices, so the models are validated on CT volumes that do not occur in the training data set in a similar form. The predictions are evaluated volume-wise using the Dice score as shown in equation 2 using the same annotation as in equation 1. In table 1 the averaged results over all volumes and all cross validation folds are shown for the kidney tumor segmentation and in table 2 the results for liver segmentation are shown.
The results show that the models trained with CT-specific image augmentation do not show significant differences in prediction performance, but a significantly lower standard deviation due to stable predictions. Additionally, the results suggest that the use of 3D spatial information does not necessarily lead to a better segmentation performance and confirm the related work regarding the worst segmentation performance of 3D segmentation methods on anisotropic data.
However, the results for the 3D MS-D Net show less background errors in case of kidney tumor segmentation, which is reflected by the relatively high total dice score with lower class-wise scores at the same time. This means that for this architecture the whole object (kidney and tumor) is detected well, but the class distinction works comparatively poorly.
In the case of liver segmentation, we assume that the MS-D Net generally produces more segmentation errors, but in a different form then the segmentation errors of the other approaches. In particular, slices with only small or rare expressions of the region of interest (shown in Figure 3) poses a challenge to 2-D CNNs. Since the errors of the MS-D Net are different from the segmentation errors of the nnU-Net for both cases, the merge as a stacked CNN leads to a better Dice score, as it balances strengths and weaknesses of the different models.
Based on these experiments, we constructed the stacked CNN. The stacked CNN consists of a set of 3D MS-D Nets and a set of 2D nnU-Nets, which are trained with CT-specific image augmentation. Each set contains the top-5 models. The models are selected based on the validation scored.
|nnU-Net + MIA||2D|
|nnU-Net + CTIA||2D|
|nnU-Net + MIA||3D|
|nnU-Net + CTIA||3D|
|MS-D Net + MIA||2D|
|MS-D Net + CTIA||2D|
|MS-D Net + MIA||3D|
|MS-D Net + CTIA||3D|
This approach yields (i) more stable predictions due to the combinations of different model outputs per architecture and (ii) superior results for the prediction of all classes due to the combination of two architectures with different individual strengths.
|nnU-Net + MIA||2D|
|nnU-Net + CTIA||2D|
|nnU-Net + MIA||3D|
|nnU-Net + CTIA||3D|
|MS-D Net + MIA||2D|
|MS-D Net + CTIA||2D|
|MS-D Net + MIA||3D|
|MS-D Net + CTIA||3D|
In this work, we propose a machine learning framework for medical image segmentation addressing the specific demands of CT images for clinical applications, with respect to preprocessing and data augmentation. We systematically evaluated this framework for two different state-of-the-art CNN architectures and for different input dimensionalities.
In these experiments, we showed that 3D spatial information does not necessarily lead to positive effects with respect to segmentation performance for fine image structures, which is in line with previous findings [1, 13]. On the other hand, the 3D MS-D Net showed better segmentation results for the background class.
The results suggest that an ensemble-based approach is an effective way to combine the complementary strengths of the different combinations of network model and input dimension. We showed that a stacked CNN model indeed outperformed all other approaches considered in this work as it could leverage the benefits of both architectures by learning a combination of a top- selection from each model. We have also shown that both CNN architectures exhibit similar performance differences for different input data, once more highlighting the crucial role of data preparation and of the proposed framework.
Our work addresses central methodological challenges in automated segmentation of CT volumes for medical use, where accurate and reliable organ and tumor segmentation is of utmost importance. Existing clinical nephrometry scores have a poor predictive power  and massively reduce the underlying information contained in CT volumes. The improved parameterization of kidney tumors through a more efficient, objective and reliable segmentation, should allow for better clinical evaluation, better prediction of clinical outcomes, and ultimately to a better treatment of the underlying pathology.
-  Baumgartner, C.; Koch, L.; Pollefeys, M.; Konukoglu, E.: An Exploration of 2D and 3D Deep Learning Techniques for Cardiac MR Image Segmentation. 2017. arXiv:1709.04496 [cs.CV].
-  Broder, J.: Chapter 9 - Imaging of Nontraumatic Abdominal Conditions, in Diagnostic Imaging for the Emergency Physician. p. 445-577. Elsevier. 2011.
-  Brenner, D. J.; Hall, E. J.: Computed Tomography — An Increasing Source of Radiation Exposure. New England Journal of Medicine, 357(22), 2277–2284. doi:10.1056/nejmra072149. 2007.
-  Chlebus, G.; Schenk, A.; Moltz, J.; van Ginneken, B.; Hahn, H.: Automatic liver tumor segmentation in CT with fully convolutional neural networks and object-based postprocessing. Nature. Scientific Reports 8. Article number: 15497. 2018.
-  Cicek, O; Abdulkadir, A.; Lienkamp, S.; Brox, T.; Ronneberger, O.: 3D U-Net: Learning Dense Volumetric Segmentation from Sparse Annotation. Medical Image Computing and Computer-Assisted Intervention (MICCAI), Springer, LNCS, Vol.9901: 424–432. 2015.
-  Costelloe, C. M.; Chuang, H.; Chasen, B.;Pan, T.; Fox, P.; Bassett, R.; Madewell, J.: Bone Windows for Distinguishing Malignant from Benign Primary Bone Tumors on FDG PET/CT. J Cancer 2013; 4(7):524-530. doi:10.7150/jca.6259. 2013.
-  DeepMind Health Research Team: Multidimensional (2D and 3D) Image Augmentation for TensorFlow. 2018. https://github.com/deepmind/multidim-image-augmentation/blob/master/doc/index.md. 2019-04-30.
-  Dolz, J.; Desrosiers, C.; Wang, L.; Yuan, J.; Shen, D; Ayed, I.: Deep CNN ensembles and suggestive annotations for infant brain MRI segmentation. 2017. arXiv:1712.05319 [cs.CV].
-  Harris, K.M.; Adams, H.; Lloyd, D.C.F.; Harvey, D.J.: The effect on apparent size of simulated pulmonary nodules of using three standard CT window settings. Clinical Radiology. Volume 47. Issue 4. p. 241-244. 1993.
-  Heller, N.; Sathianathen, N.; Kalapara, A.; Walczak, E.; Moore, K.; Heather Kaluzniak, H.; Rosenberg, J.; Blake, P.; Rengel, Z.; Oestreich, M.; Dean, J.; Tradewell, M.; Shah, A.; Tejpaul, R.; Edgerton, Z.; Peterson, M.; Raza, S.; Regmi, S.; Papanikolopoulos, N. ; Weight, C.: The KiTS19 Challenge Data: 300 Kidney Tumor Cases with Clinical Context, CT Semantic Segmentations, and Surgical Outcomes. 2019. arXiv:1904.00445 [q-bio.QM].
-  Hu, J.; Shen, L.; Albanie, S.; Sun, G.; Wu, E.: Squeeze-and-Excitation Networks. 2017. arXiv:1709.01507 [cs.CV].
-  Ioffe, S.; Szegedy, C.: Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. 2015. arXiv:1502.03167 [cs.LG].
-  Isensee, F.; Jaeger, P.; Full, P.; Wolf, I.; Engelhardt, S.; Maier-Hein, K.: Automatic Cardiac Disease Assessment on cine-MRI via Time-Series Segmentation and Domain Specific Features. 2017.arXiv:1707.00587 [cs.CV].
-  Isensee, F.; Petersen, J.; Klein, A.; Zimmerer, D.; Jaeger, P.; Kohl, S.; Wasserthal, J.; Koehler, G.; Norajitra T.; Wirkert, S.; Maier-Hein, K.: nnU-Net: Self-adapting Framework for U-Net-Based Medical Image Segmentation. 2018. arXiv:1809.10486 [cs.CV].
-  Kamnitsas, K.; Bai, W.; Ferrante, E.; McDonagh, S.; Sinclair, M.; Pawlowski, N.; Rajchl, M.; Lee, M.; Kainz, B.; Rueckert, D.; Glocker, B.: Ensembles of Multiple Models and Architectures for Robust Brain Tumour Segmentation. In: Brainlesion: Glioma, Multiple Sclerosis, Stroke and Traumatic Brain Injuries. 2018. Third International Workshop. BrainLes 2017. MICCAI 2017.
-  Kayalibay B., Jensen G., van der Smagt P.: CNN-based Segmentation of Medical Imaging Data. 2017. arXiv:1701.03056 [cs.CV].
-  Kingma, D.; Ba, J.: Adam: A Method for Stochastic Optimization. 2014. arXiv:1412.6980 [cs.LG].
-  Krizhevsky, A.; Sutskever, I.; Hinton, G.: ImageNet Classification with Deep Convolutional Neural Networks. Advances in Neural Information Processing Systems 25 (NIPS 2012). p. 1097-1105. 2012.
-  LaLonde, R.; Bagci, U.: Capsules for Object Segmentation. 2018. arXiv:1804.04241 [stat.ML].
-  Maas, A.; Hannun, A.; Ng , A.: Rectifier nonlinearities improve neural network acoustic models. International Conference on Machine Learning (ICML) (2013).
-  Minnemaa, J.; van Eijnatten, M.; Kouw, W.; Diblen, F.; Mendrik, A.; Wolff, J.: CT image segmentation of bone for medical additive manufacturing using a convolutional neural network. Computers in Biology and Medicine. Volume 103. p. 130-139. 2018.
-  Moen, E.; Bannon, D.; Kudo, T.; Graf, W.; Covert, M.; Van Valen, D.: Deep learning for cellular image analysis. Nature Methods. https://doi.org/10.1038/s41592-019-0403-1DO. 2019.
-  Pelt D., Sethian J.: A mixed-scale dense convolutional neural network for image analysis. PNAS 115 (2) 254-259. https://doi.org/10.1073/pnas.1715832114.
-  Selver, A.; Uenal, G.; Dicle, O.; Gezer, S.; Baris, M.; Aslan, S.; Candemir, C.; Kavur, A.E.; Kazaz, E.: CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation. The IEEE International Symposium on Biomedical Imaging (ISBI). https://chaos.grand-challenge.org/. 2019.
-  Teramoto, A.; Fujita, H.; Yamamuro, O.; Tamaki, T.: Automated detection of pulmonary nodules in PET/CT images: Ensemble false?positive reduction using a convolutional neural network technique. 2016. Medical Physics. Quantitative imaging and image processing. 11 May 2017. p. 2821-2827.
-  Ulyanov, D.; Vedaldi, A.; Lempitsky, V.: Instance Normalization: The Missing Ingredient for Fast Stylization. 2016. arXiv:1607.08022 [cs.CV].
-  Voulodimos, A.; Doulamis, N.; Doulamis, A.; Protopapadakis, E.: Deep Learning for Computer Vision: A Brief Review. Computational Intelligence and Neuroscience.Volume 2018. Article ID 7068349. https://doi.org/10.1155/2018/7068349. 2018