A framework for CT image segmentation inspired by the clinical environment

A framework for CT image segmentation inspired by the clinical environment

Marie Kloenne 1AICURA medical, Bessemerstrasse 22, 12103 Berlin, Germany 1firstname.lastname@aicura-medical.com 2⋆2⋆    Sebastian Niehaus 1AICURA medical, Bessemerstrasse 22, 12103 Berlin, Germany 1firstname.lastname@aicura-medical.com 3⋆3⋆    Leonie Lampe 1AICURA medical, Bessemerstrasse 22, 12103 Berlin, Germany 1firstname.lastname@aicura-medical.com Technische Fakultät, Universität Bielefeld, Universitätsstrasse 25, 33615 Bielefeld, GermanyMax Planck Institute for Human Cognitive and Brain Sciences, Stephanstrasse 1a, 04103 Leipzig, Germany    Alberto Merola 1AICURA medical, Bessemerstrasse 22, 12103 Berlin, Germany 1firstname.lastname@aicura-medical.com Technische Fakultät, Universität Bielefeld, Universitätsstrasse 25, 33615 Bielefeld, GermanyMax Planck Institute for Human Cognitive and Brain Sciences, Stephanstrasse 1a, 04103 Leipzig, Germany    Janis Reinelt 1AICURA medical, Bessemerstrasse 22, 12103 Berlin, Germany 1firstname.lastname@aicura-medical.com Technische Fakultät, Universität Bielefeld, Universitätsstrasse 25, 33615 Bielefeld, GermanyMax Planck Institute for Human Cognitive and Brain Sciences, Stephanstrasse 1a, 04103 Leipzig, Germany    Nico Scherf 3Institute for Medical Informatics and Biometry, Carl Gustav Carus Faculty of Medicine, Technische Universität Dresden, Fetscherstrasse 74, 01307 Dresden, Germany 3 4 4

Computed tomography (CT) data poses many challenges to medical image segmentation based on convolutional neural networks (CNNs). The main issues arise during feature extraction, due to the large dynamic range of intensities and the varying number of recorded slices of CT volumes. In this paper we address these issues with a framework that combines domain-specific data pre-processing and augmentation with state-of-the-art CNN architectures. The focus is not limited to score optimization, but also to stabilize the achieved prediction performance, since this is a mandatory requirement for use in automated and semi-automated workflows in the clinical environment.
The framework is validated contextually to an architecture comparison to show CNN architecture independent effects of our framework functionality. This comparison includes a modified U-Net and a modified Mixed-Scale Dense Network (MS-D Net) to compare dilated convolutions for parallel multi-scale processing to the U-Net approach based on traditional scaling operations. Finally, in order to combine the superior recognition performance of 2D-CNN models with the more comprehensive spatial information of 3D-CNN models, we propose an ensemble model.
The framework performs successfully when tested on a range of tasks such as liver and kidney segmentation, without significant differences in prediction performance on strongly differing volume sizes and varying slice thickness.

Medical image segmentation Computed Tomography (CT) Kidney tumor segmentation Liver segmentation
11footnotetext: The authors contributed equally to this paper.

1 Introduction

Spatial characteristics of tumors like size, shape, location or growth pattern are central clinical features. Changes in these characteristics are important indicators for disease progression and treatment effects. Automated, quantitative assessment of these characteristics and their changes from radiological images would yield an efficient and objective tool for radiologists to monitor the disease course. Thus, a reliable and accurate automated segmentation method is desirable to extract spatial tumor and organ characteristics from computed tomography (CT) volumes.
In recent years, convolutional neural networks (CNNs) [18] became the state of the art method for image segmentation, as well as many other tasks in computer vision [27], such as image classification, object detection and object tracking [22]. The applications of CNNs are diverse, but the general data handling is often very similar since the feature extraction is performed internally by the CNN itself. Improvements in the application of CNNs often address the neural network architecture, the training algorithm or the use case [21, 4], while the data is typically handled the same way as grayscale images or RGB images, but with additional dimensions.

However, this approach neglects prior information about the specific physical processes by which these images or volumes are acquired and by which the image contrast is determined, possibly leading to an inaccurate or suboptimal image analysis. For instance, while most image formats map pixels on relative scales of a few hundred values, voxels in CT volumes are mapped on the Hounsfield scale [2], a quantitative mapping of radiodensity calibrated such that the value for air is -1000 Hounsfield Units (HU) and that for water is 0 HU, with values in the human body reaching up to about 2000 HU (cortical bone). Therefore, in contrast to most standard images where pixel intensities themselves might not be meaningful, the actual grey values of CT volumes carry semantic information related to the nature of CT scanning [3], and special consideration is required to leverage it.

This also means that CT data typically contains a range of values that are not necessarily relevant for a particular diagnostic question [6, 9]. Thus, when radiologists inspect CT volumes for diagnosis, they typically rely on windowing, i.e. restricting the range of displayed grey values, to focus the image information to relevant values. CNN-based image segmentation frameworks rarely include such potentially essential steps from the expert workflow, assuming that the data only has to be normalized and then the network learns autonomously to focus on the relevant image regions.

In this paper, we present a framework for CNN based image segmentation in CT data, which addresses the challenges of a clinically meaningful CT volume processing. The proposed framework is inspired by insights on both the data acquisition process and the diagnostic process performed by the radiologist, addressing in particular the spatial information CT volumes and the use of the HU scale.
The focus is not on the optimization of the loss function on the whole dataset, but rather on obtaining a robust segmentation quality, independent of the differences in size and shape of the input volumes. For this reason, we also consider the standard deviation of the dice score for evaluation. If a segmentation model is used in an automated or semi-automated process in which the result of the segmentation is not directly analyzed, particularly strong segmentation errors pose a problem because the user tends to rely on the segmentation model, only analyzing the final result of the process. Therefore, our goal is to specifically address the demands of algorithms for CT processing in the clinical environment, where algorithms are required to process each volume consistently and without strong differences in quality of the output.

We evaluated the framework with a mixed-scale dense convolutional neural network (MS-D Net) [23] with dilated convolutions and the nnU-Net [14] with traditional scaling operations, which is a modified U-Net [5]. For each architecture both a 2D-CNN and a 3D-CNN implementation is considered. Finally, we show an ensemble CNN, which allows to combine the longitudinal information leveraged in 3D-CNNs with the proportionally higher value of each segmented voxel in the 2D-CNNs training process, resulting in more accurate results from a theoretical point of view.

For testing the robustness of the trained models, the folds are mixed so that there are always cases in the test set that are independent from the training data. This simulates worst case scenarios for the application in the clinical environment. In order to make the results comprehensible and reproducible, we use open datasets for training and evaluation: the CNN-models for kidney tumor segmentation are trained and validated on the dataset of the 2019 Kidney Tumor Segmentation Challenge [10] and the liver segmentation models are trained and validated on the dataset of the CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge [24].

In the authors’ perspective, the rise of deep learning methods in medical image processing seems to have split the community into two factions: those who use such methods and those who do not, or even consider CNNs as a temporary hype. With this work we want to show that clinically applicable CNN-based frameworks require the cooperation of different expertise and motivate to reconciliate the two factions.

2 Method

In the following, we describe the data preprocessing and augmentation in section 2.1, the network architectures in section 2.2 and the training procedure in section 2.3. The preprocessing includes volume shape reduction and grey-value windowing. The proposed augmentation addresses the scarcity of data, with the aim of providing additional samples for the training procedure. For the CNN architectures we consider two models: one with dilated convolutions (MS-D) and one with traditional scaling operations (U-Net). We further explain the construction of the stacked CNN model. Subsequently, in section 2.3 the training procedure for the two considered architectures is described.

2.1 Preprocessing and Augmentation

In order to ensure an adequate data quality in the training process for each model, we adapt the data preprocessing and augmentation for CT data. The following description of preprocessing is adapted to the dataset of the KiTS Kidney Tumor Segmentation Challenge [10] and the dataset of the CHAOS - Combined (CT-MR) Healthy Abdominal Organ Segmentation Challenge [24], but can be applied to any other CT dataset with minor changes.

2.1.1 Image Preprocessing

The image normalization is adapted from [14] to make it more general and enable a more realistic normalization for real life applications.

To reduce the complexity and optimize the dynamic range, we apply a windowing to each volume by clipping the voxels grey value range to the (0.6, 0.99) percentile range. This represents the range that a radiologist would use for decision-making, but is slightly larger, since no dynamic correction based on the respective CT volume is feasible. For other segmentation problems, this percentile must be adjusted depending on the relevant body parts (examples are shown in Figure 1). This windowing is followed by a z-score normalization based on the mean and the standard deviation, which is calculated from the foreground of a random sample of the data set. The mean and the standard deviation of the full dataset would not reflect the clinical environment.

In order to save costs and time in CT volume acquisition, a CT volume is usually targeted to the region of interest (Figure 2). However, the region of interest is typically defined liberally, so that no area of potential interest is missing, causing a repetition of the scan. Such requirement results in different CT acquisition policies and consequently to CT volumes with varying number of slices. This poses a challenge to the application of CNNs, typically tackled standardizing the number of slices. We decided to reduce each volume to the size of 16 slices. This eliminates the need to upsample volumes that contain only a few slices. The selection of the slices is random and can be repeated several times per volume, enabling a simultaneous augmentation effect. Background slices are excluded during the training phase, since they are also ignored in the test phase. A higher number of slices did not lead to beneficial effects in our experiments, which is consistent with the observation that most CNNs only use a small semantic context for decision finding [11, 19].

Figure 1: Three examples for the use case oriented windowing (First row: Bone oriented windowing, Second row: Organ oriented windowing, Third row: Lung oriented windowing). The organ oriented windowing is applied in this work, while the other two examples would be used for the analysis of abnormalities in lung or bony structures in CT.

In order to save memory, the dimension of each slice was downsampled from a size of 512 x 512 voxels to a size of 128 x 128 voxels. In our experiments larger slice sizes did not result in any advantages in terms of segmentation performance and only resulted in higher computing costs.

2.1.2 Image Augmentation

In addition to the slices number standardization mentioned above, we also used image noising with normal distributed noise map, slice skipping, slice interpolation and a range shift to address potential variation in the CT acquisition process (Figure 2). We further rotated the images to simulate the variability in patient positioning, which cannot be excluded to a certain extent despite fixation.

2.2 Architecture

The first architecture we consider is a modified U-Net architecture called nnU-Net [14]. This architecture is the native U-Net architecture [5] with instance normalization [26] instead of batch normalization [12] and LeakyReLUs of slope 1e-2 [20] instead of ReLUs. The second architecture we take into account is a modified mixed-scale dense convolutional neural network (MS-D Net) [23], which is modified in the same way as the U-Net. The modification of the MS-D Net is done to avoid the effect of the applied activation function. The model comparison is intended to demonstrate the independence of our framework from the implemented feature extraction approach in the CNN. Therefore, we compare traditional scaling operations with dilated convolutions as part of the evaluation of the framework.

Figure 2: CT scanning configuration, which poses challenges to the application of CNNs. The representation above presents the varying slice thickness, which allows to map the same region of interest to a different number of slices. The representation below shows the varying size of volumes depending on the chosen region of interest.

In clinical diagnoses, the localization of tumors and relevant adjacent structures is performed not only by examining the respective slice, but also previous and subsequent slices for increased spatial information. Based on this consideration a 3D CNN might seem a preferable choice in order to avoid spatial information losses. However, previous work has shown that 3D segmentation methods perform worse than 2D alternatives with anisotropic data [1, 13]. Another reason why medical image segmentation with 3D CNNs often proves challenging is the variability in number of slices per volume. The number of slices per volume varies mainly depending on factors like body region under investigation, diagnostic question, variability in subjects’ size and acquisition protocol considerations (typically a trade-off between minimizing scanning time and exposure to radiation while maximising data quality).

CNN ensembles showed superior performance for several detection tasks [8, 15, 25]. Based on this achievements, we combine different models into a single, stacked CNN model to leverage the different strengths of each architecture to handle the described task. In case of the kidney-tumor segmentation the stacked CNN consist of a set of 3D MS-D Nets, which are trained to detect the kidney without a distinction between the healthy kidney tissue and the tumor tissue, and several 2D nnU-Nets, which are trained to segment all three classes (healthy tissue, tumor and background). For the liver segmentation, both models are trained to detect the liver, without the need of handling a class distinction.

2.3 Training

All described networks were trained independently and from scratch. The training procedure is shown in Algorithm 1. The procedure is implemented in Python with Tensorflow 1.14 and performed on the IBM Power System Accelerated Compute Server (AC922) with two NVIDIA Tesla V100 GPUs, which allowed us to parallelize the experiments, but the considered approach is also running on a system with a NVIDIA GTX 1080.

1:Initialize network with random weights
2:Initialize validation data
3:Initialize batch size
4:Assume standard deviation
5:Select windowing percentile
7:     repeat
8:         Select random volume
9:         Windowing(, )
10:         Normalization(, )
11:         Augmentation of
12:         Downsampling and slide reduction of
14:     until Number of in =
15:      =
16:      = ( , ) + ( , )
17:      = ADAM(,)
18:      = Validate((;,)
19:until Convergence of
Algorithm 1 Training procedure

In each epoch the volumes of a randomly selected batch are pre-processed and augmented, whereby the already mentioned slice augmentation effect is also achieved with the normalization steps (line 12). For the 2D modified U-Net and the 2D modified MS-D Net a batch size of 28 is used, while for the 3D modified U-Net and the 3D modified MS-D Net the model fits are done with single volumes. In case of 3D segmentation, we augmented 80 percent of training batches and for the training of the 2D models 90 percent of data are augmented, while the range shift is only applied to 20 percent of data.

For the weight update of for function , which is a modified U-Net or a modified MS-D Net, we used the ADAM optimization [17] with the proposed parameter configuration and optimized the loss function (line 13 in Algorithm 1), which is a combination of the Tanimoto loss and the categorical crossentropy . The weighting of the loss functions is configured for both architectures with and . Both values are determined experimentally. The Tanimoto loss is implemented as shown in equation 1, where denotes the set of predicted voxel-wise annotations and denotes the set of ground truth voxel-wise annotations. The smooth factor is chosen as . Similar to the well-known Dice score, the Tanimoto coefficient treats each class independently, therefore the Tanimoto loss is particularly suitable for problems with a high class imbalance. However, this leads to a class-wise maximum error, if the class does not occur in the sample, which is attenuated by the smooth factor . A more detailed discussion is given in [16].


3 Evaluation

For the validation of our framework, we validated the augmentation of our framework in comparison to the multidimensional (2D and 3D) image augmentation for TensorFlow, which is developed by Google DeepMind [7]. Since the normalization and especially the windowing of the CT volume has a strong influence on the cropping and the selection of the slices, we do not conduct a comparative experiment, since the experiment would not show the raw influence of the data pre-processing on the performance of the neural network, but also the influence of the used data in training and evaluation. Both architectures are trained with each data dimensionality and are evaluated in a 5 fold cross validation. However, the data were sorted according to the number of slices, so the models are validated on CT volumes that do not occur in the training data set in a similar form. The predictions are evaluated volume-wise using the Dice score as shown in equation  2 using the same annotation as in equation  1. In table 1 the averaged results over all volumes and all cross validation folds are shown for the kidney tumor segmentation and in table 2 the results for liver segmentation are shown.

Figure 3: Examples of challenging 2D segmentation cases for liver segmentation (top) and kidney tumor segmentation (bottom).

The results show that the models trained with CT-specific image augmentation do not show significant differences in prediction performance, but a significantly lower standard deviation due to stable predictions. Additionally, the results suggest that the use of 3D spatial information does not necessarily lead to a better segmentation performance and confirm the related work regarding the worst segmentation performance of 3D segmentation methods on anisotropic data.
However, the results for the 3D MS-D Net show less background errors in case of kidney tumor segmentation, which is reflected by the relatively high total dice score with lower class-wise scores at the same time. This means that for this architecture the whole object (kidney and tumor) is detected well, but the class distinction works comparatively poorly.

In the case of liver segmentation, we assume that the MS-D Net generally produces more segmentation errors, but in a different form then the segmentation errors of the other approaches. In particular, slices with only small or rare expressions of the region of interest (shown in Figure 3) poses a challenge to 2-D CNNs. Since the errors of the MS-D Net are different from the segmentation errors of the nnU-Net for both cases, the merge as a stacked CNN leads to a better Dice score, as it balances strengths and weaknesses of the different models.

Based on these experiments, we constructed the stacked CNN. The stacked CNN consists of a set of 3D MS-D Nets and a set of 2D nnU-Nets, which are trained with CT-specific image augmentation. Each set contains the top-5 models. The models are selected based on the validation scored.

Kidney Tumor Total
nnU-Net + MIA 2D
nnU-Net + CTIA 2D
nnU-Net + MIA 3D
nnU-Net + CTIA 3D
MS-D Net + MIA 2D
MS-D Net + CTIA 2D
MS-D Net + MIA 3D
MS-D Net + CTIA 3D
Stacked CNN
Table 1: Results for the kidney tumor segmentation: Total dice scores (mean std.) in each segmentation class for the different architectures and input dimensionalities (2D and 3D). Each approach is validated with multidimensional (2D and 3D) image augmentation (MIA) with CT-specific image augmentation (CTIA).

This approach yields (i) more stable predictions due to the combinations of different model outputs per architecture and (ii) superior results for the prediction of all classes due to the combination of two architectures with different individual strengths.

nnU-Net + MIA 2D
nnU-Net + CTIA 2D
nnU-Net + MIA 3D
nnU-Net + CTIA 3D
MS-D Net + MIA 2D
MS-D Net + CTIA 2D
MS-D Net + MIA 3D
MS-D Net + CTIA 3D
Stacked CNN
Table 2: Results for liver segmentation: Total Dice score (mean std.) for the different architectures and input dimensionalities (2D, 2D Multi-Channel(M.-C.) and 3D). Each approach is validated as well as in the liver tumor segmentation experiment with multidimensional (2D and 3D) image augmentation (MIA) with CT-specific image augmentation (CTIA).

4 Conclusion

In this work, we propose a machine learning framework for medical image segmentation addressing the specific demands of CT images for clinical applications, with respect to preprocessing and data augmentation. We systematically evaluated this framework for two different state-of-the-art CNN architectures and for different input dimensionalities.

In these experiments, we showed that 3D spatial information does not necessarily lead to positive effects with respect to segmentation performance for fine image structures, which is in line with previous findings [1, 13]. On the other hand, the 3D MS-D Net showed better segmentation results for the background class.

The results suggest that an ensemble-based approach is an effective way to combine the complementary strengths of the different combinations of network model and input dimension. We showed that a stacked CNN model indeed outperformed all other approaches considered in this work as it could leverage the benefits of both architectures by learning a combination of a top- selection from each model. We have also shown that both CNN architectures exhibit similar performance differences for different input data, once more highlighting the crucial role of data preparation and of the proposed framework.

Our work addresses central methodological challenges in automated segmentation of CT volumes for medical use, where accurate and reliable organ and tumor segmentation is of utmost importance. Existing clinical nephrometry scores have a poor predictive power [10] and massively reduce the underlying information contained in CT volumes. The improved parameterization of kidney tumors through a more efficient, objective and reliable segmentation, should allow for better clinical evaluation, better prediction of clinical outcomes, and ultimately to a better treatment of the underlying pathology.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description