Technical Considerations for Semantic Segmentation in MRI using Convolutional Neural Networks

Technical Considerations for Semantic Segmentation in MRI using Convolutional Neural Networks

Arjun D. Desai
Department of Radiology
Stanford University
&Garry E. Gold
Departments of Radiology, Bioengineering, and Orthopedic Surgery
Stanford University
Brian A. Hargreaves
Departments of Radiology, Electrical Engineering, and Bioengineering
Stanford University
Akshay S. Chaudhari
Department of Radiology
Stanford University

High-fidelity semantic segmentation of magnetic resonance volumes is critical for estimating tissue morphometry and relaxation parameters in both clinical and research applications. While manual segmentation is accepted as the gold-standard, recent advances in deep learning and convolutional neural networks (CNNs) have shown promise for efficient automatic segmentation of soft tissues. However, due to the stochastic nature of deep learning and the multitude of hyperparameters in training networks, predicting network behavior is challenging. In this paper, we quantify the impact of three factors associated with CNN segmentation performance: network architecture, training loss functions, and training data characteristics. We evaluate the impact of these variations on the segmentation of femoral cartilage and propose potential modifications to CNN architectures and training protocols to train these models with confidence.


Technical Considerations for Semantic Segmentation in MRI using Convolutional Neural Networks

 Submitted to Magnetic Resonance in Medicine

Arjun D. Desai
Department of Radiology
Stanford University
Garry E. Gold
Departments of Radiology, Bioengineering, and Orthopedic Surgery
Stanford University
Brian A. Hargreaves
Departments of Radiology, Electrical Engineering, and Bioengineering
Stanford University
Akshay S. Chaudhari
Department of Radiology
Stanford University

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Magnetic resonance imaging (MRI) provides high spatial resolution and exquisite soft tissue contrast, leading to its pervasive use for visualization of tissue anatomy. Using MR images for quantitative analysis of tissue-specific information is critical for numerous diagnostic and prognostic protocols. The gold-standard for high fidelity region annotation is manual tissue segmentation, which can be time-consuming and prone to inter-reader variations [1, 2]. Thus, there has always been great interest in developing fully-automated tissue segmentation techniques that are robust to small image variations [3, 4].

A common application of segmentation is the segmentation of articular cartilage for studying changes associated with onset and development of osteoarthritis (OA) [5, 6]. Recent advances in MRI have focused on developing non-invasive morphological and compositional biomarkers for tracking the onset and progression of OA. There is promising evidence suggesting that changes in cartilage morphology and composition may serve as early-OA biomarkers [7, 8]. Despite its potential, accurate measurement of cartilage morphology entails tedious manual segmentation of the fine structure of cartilage in hundreds of MRI slices and patients [9]. Though automatic segmentation of femoral cartilage is of great interest, the tissue’s thin morphology and low contrast with surrounding tissue structures makes automatic segmentation challenging.

Traditional automatic segmentation approaches have utilized 3D statistical shape modeling or multi-atlas segmentations modulated by anisotropic regularizers [10, 11]. However, such techniques are highly sensitive to deformations in knee shape, which can be caused by variations in patient knee size and in incidence and progression of pathology [12]. Advances in deep learning and convolution neural networks (CNNs) have shown great potential for enhancing the accuracy of cartilage segmentation [13, 14]. However, due to the stochastic nature of deep learning and the multitude of training parameters (hyperparameters) that can be fine-tuned for any given problem, developing analytic estimations of network behavior is challenging [15, 16]. As a result, practical design choices for optimizing CNN performance for segmentation in MRI, especially for femoral cartilage segmentation, have been under-explored.

Often, CNN architectures are modified in the hope of increasing overall accuracy and generalizability while minimizing inference time. In the case of the popular ImageNet challenge [17] for natural image classification, classification accuracy and generalizability have varied considerably with changes in network architecture [18, 19]. Additionally, while 2D architectures have been effective at slice-by-slice segmentation of medical images, recent works have also utilized volumetric architectures, which take 3D volumes as inputs to potentially add through-plane (depth) contextual cues to improve segmentation [20, 21]. However, the extent to which network structure and input depth impact semantic segmentation in medical imaging remains unclear.

Variations in CNN training protocol may also affect network performance. As network weights are optimized with respect to the gradient of the training loss function, the selection of loss function may dictate network accuracy. In particular for segmentation, where foreground-background class imbalance is common, loss functions, such as weighted cross-entropy and soft Dice loss, are often chosen to minimize the impact of class imbalance [22, 23]. In addition, supervised CNN training requires both large training sets and corresponding high-fidelity segmentation masks, which are difficult to produce. In cases of limited training data, data augmentation is a common practice for artificially increasing variability of training data to reduce overfitting and promote generalizability [24, 25]. Moreover, MRI volumes can be acquired with varying fields of view (FOVs), resulting in different matrix sizes. A commonly reported advantage of fully convolutional network (FCN) CNN architectures is their ability to infer on images or volumes of arbitrary matrix sizes not specifically utilized during the training process [26].

In this study, we investigate three factors associated with the performance and generalizability of segmentation CNNs: network architecture, training loss functions, and data extent. While performance is quantified by traditional segmentation accuracy metrics, we also quantify the generalizability of a network using the sensitivity of applying trained networks to segment MR images at varying FOVs. All experiments were conducted on segmentation of femoral cartilage, a challenging localization problem and a relevant target for studying OA. We seek to quantify performance variations induced by these three factors to motivate how CNN segmentation models can be built, trained, and deployed with confidence.

2 Methods

2.1 Dataset

Data for this study were acquired from the Osteoarthritis Initiative (OAI) (, a longitudinal study for studying osteoarthritis progression [27]. 3D sagittal double echo steady state (DESS) datasets along with their corresponding femoral cartilage segmented masks were utilized in this study (relevant scan parameters: FOV=14cm, Matrix=384307 (zero-filled to 384384), TE/TR=5/16ms, 160 slices with a thickness of 0.7mm) [27]. This dataset consisted of 88 patients with Kellgren-Lawrence (KL) OA grades between 1 and 4 [28], measured at two time points (baseline and 1 year) for a total of 176 segmented volumes. These patients were randomly split into cohorts of 60 patients for training, 14 for validation, and 14 for testing, resulting in 120, 28, and 28 volumes used during training, validation, and testing, respectively. An approximately equal distribution of KL grades was maintained among all three groups (Supporting Table S1).

2.2 Data Pre-processing

All DESS volumes in §2.1 were downsampled by a factor of 2 in the slice dimension (to dimensions of 38438480) prior to network training and inference to increase SNR and reduce computational complexity, justified by previous studies reporting that approximately 1.5mm slices are adequate for cartilage morphometry [9]. Images were downsampled using sum-of-squares combinations, and the corresponding masks were downsampled by taking the union of the masks to compensate for partial volume artifacts. The volume was then cropped in-plane to 288288 by calculating the center of mass (COM) and centering the cropped region 50 pixels in the superior (up) direction and 20 pixels in the anterior (left) direction to bias the COM to femoral cartilage and away from the tibia and posterior muscles. The volume was then cropped to remove 4 slices from both the medial and lateral ends, resulting in volumes of dimensions 28828872. All scans were subsequently volumetrically zero-mean whitened (=0, =1) by subtracting the image volume mean and scaling by the image volume standard deviation.

All data pre-processing was conducted using MATLAB (MathWorks, Natick, MA). These training, validation, and testing sets were used for all experiments unless otherwise indicated.

Figure 1: Three encoder-decoder fully convolutional network architectures adopted for femoral cartilage segmentation. U-Net (A) uses skip connections to concatenate weights from the encoder to the decoder while SegNet (B) passes pooling indices to the decoder to reduce computational complexities of weight concatenation. Unlike traditional encoder-decoder architectures, DeeplabV3+ (C) uses spatial pyramid pooling and atrous (dilated) convolutions to extract latent feature vectors at multiple fields of view.

2.3 Network Architecture

In this experiment, we wanted to evaluate the sensitivity of the semantic segmentation task to different, popular CNN architectures. We selected three general 2D FCN architectures for analysis: U-Net, SegNet, and DeeplabV3+ [29, 30, 31]. These FCN architectures utilize variations of the encoder-decoder model for semantic segmentation for extracting features at different spatial fields of view.

The U-Net architecture implements an encoder-decoder model using max-pooling and transpose convolutions to downsample and upsample feature maps (Figure  1a). In this structure, the number of network filters increases exponentially as a function of network depth. The U-Net also relies on deep skip connections by concatenating encoder outputs to the decoding layers in order to share spatial cues between the two and to propagate the loss efficiently at different network depths [32, 33]. SegNet uses a similar encoder-decoder architecture but passes pooling indices to upsampling layers to avoid the overhead of copying encoder weights (Figure  1b). In contrast to using max-pooling to promote spatial invariance and to downsample feature maps, DeeplabV3+ implements ‘Xception‘ blocks [34] and spatial pyramid pooling with dilated convolutions to capture a larger receptive field without increasing the parameter size (Figure  1c). Instead of transposed convolutions, the DeeplabV3+ decoder uses bilinear upsampling to upsample the features to the input image size. While the U-Net and SegNet have shown promise for musculoskeletal MRI semantic segmentation [13, 14], DeeplabV3+ has been primarily utilized for natural image segmentation [35] and has seen limited use in segmentation of medical images.

As a baseline comparison, all architectures were trained for 100 epochs and subsequently fine-tuned for 100 epochs following training hyperparameters detailed in Table  1.

Architectures BS initial LR LR step-decay (DF, DP) Optimizer Initialization
U-Net 35 1e-2 0.8, 1 Adam He
SegNet 15 1e-3 N/A Adam He
DeeplabV3+ 12 1e-4 N/A Adam He

BS, mini-batch size; LR, learning rate; DF, drop factor; DP, drop period (epochs)

Table 1: Default hyperparameters used for network training.

2.4 Volumetric Architectures

In this experiment, we trained a 2.5D [36] and 3D U-Net architecture for femoral cartilage segmentation. The 2.5D network uses a stack of continuous slices in a scan to generate a segmentation mask for the central slice (additional details are described in the supplemental information). Three 2.5D networks with inputs of thickness =3,5,7 were trained.

In contrast, a 3D network outputs a segmentation on all slices. As all operations are applied in 3D, the 2 max-pooling applied in the through-plane direction constrains the input to have slices, where refers to the number of pooling steps. To maintain an identical number of pooling steps as the 2D and 2.5D networks (), the 3D U-Net was trained using 32 slices of the volume as an input (). As a result, the scans in the training dataset described in §2.2 were also cropped by an additional 4 slices from the medial and lateral ends, resulting in volumes with 64 slices. Memory constraints of the hardware necessitated that this volume be further divided into two 3D subvolumes of size 28828832 and that the batch size be reduced to 1. The 2D and 2.5D networks had an exponentially increasing number of filters ranging from 321024, while the 3D network had filters ranging from 16512 to accommodate for the same network depth. All networks maintained a comparable number of weights (Supporting Figure S2).

2.5 Loss Function

As trainable parameters in a network update with respect to the loss function, we hypothesize that a relevant loss function is critical for any learning task. Traditionally, binary cross-entropy has been used for binary classification tasks. However, class imbalance has shown to limit the optimal performance in cases of general cross-entropy losses [20]. We selected three additional loss functions commonly used for segmentation in cases of class imbalance for comparison: soft Dice loss [37], weighted cross-entropy (WCE), and focal loss (=3) [38], as described additionally in the supplementary.

In this experiment, four models using the U-Net architecture were trained using the four loss functions described above with the training, validation, and testing sets described in §2.1.

Figure 2: An example augmented, final 2D slice is generated from an original 2D slice by sequentially applying four feasible transformation factors: scaling, shearing, gamma, and motion. Parameters for all four factors are sampled uniformly at random.

2.6 Data Augmentation

To qualify the effect of data augmentation on model generalizability, we trained the standard U-Net architecture with and without augmented training data.

Each 2D slice and corresponding mask in the training volume were randomly augmented to add heterogeneity to the training dataset. The augmentation procedure consisted of sequential transformations of the original image and masks with: 1. zooming (between 0-10%), 2. shearing (between -15 to 15 in both the horizontal and vertical directions), 3. gamma variations (between 0.8-1.1 for simulating varying contrasts), and 4. motion blur (between 0-5 pixels in magnitude and 0 to 360 in direction). Parameters for each transformation were chosen uniformly at random within the specified ranges, with an example slice shown in Figure  2. These specific augmentation methods and magnitudes were chosen to mimic typically encountered physiological and imaging variations. No augmentations were applied to the scans in the validation and test sets.

All 2D slices were augmented fourfold, resulting in the augmented dataset consisting 5x the data in the non-augmented training set. To overcome this discrepancy while training separate networks with and without augmented data, the networks trained using augmented data were trained for 5x shorter than those trained using non-augmented data (20 epochs total).

2.7 Generalizability to Multiple Fields of View

In this experiment, we compare the differences in network performance on scans at different FOVs with the same underlying image resolution. In addition to the inference dataset (V0) cropped to a volume of () described in §2.1, three new test sets were created with different degrees of cropping: V1 (), V2 (), and V3 ().

As data augmentation is hypothesized to increase network generalizability, we compared the performances of the U-Net models trained using non-augmented and augmented data as specified in §2.6 among the four test sets (V0-V3).

2.8 Training Data Extent

Performance of CNNs has also been shown to be limited by the extent (amount) of data available for training [39]. To explore the relationship between the extent of training data and network accuracy, we trained each of the three base network architectures in §2.3 on varying sized subsets of the training data. The original training set consisting of 60 patients was randomly sampled (with replacement) to create 3 additional sub-training sets of 5, 15, and 30 patients with similar distributions of KL grades (Supporting Table S3). The same validation and testing sets described in §2.1 (with 14 patients, each at two time points) were used to assess the generalizability of the networks.

The network trained on the complete sample of training data (60 patients) was trained for 100 epochs. To ensure that all sub-sampled networks maintained an equal number of backpropagation steps to update filter weights, we scaled the number of epochs by the ratio of the fully sampled patient count (60) to the number of patients in the sub-training set. As a result, networks trained on 5, 15, and 30 patients were trained for 1200, 400, and 200 epochs respectively. Experiments were repeated 3 times each (with Python seeds 1, 2, and 3) to enhance reproducibility and to minimize the stochasticity of random network weight initializations.

2.9 Network Training Hyperparameters

For all experiments, convolutional layers with rectified linear unit (ReLU) activations were initialized using the "He" initialization [40, 41]. Training was performed using the Adam optimizer with default parameters (, , =1e-8) with random shuffling of mini-batches using a Tensorflow backend [42, 43]. All neural network computations were performed on 1 Titan Xp graphical processing unit (GPU, NVIDIA, Santa Clara, CA) consisting of 3,840 CUDA cores and 12GB of GDDR5X RAM.

Due to the randomness of the training processes, we empirically determined a pseudo-optimal set of hyperparameters for training each network. To reduce large variances in training batch normalization layers caused by small mini-batch sizes, the largest mini-batch size that could be loaded on the Titan Xp GPU was used for each network. The initial learning rate and use of step decay was also empirically determined based on the network architecture. Table  1 details the hyperparameters used with each network architecture. Networks were trained using the soft Dice loss, unless otherwise specified.

2.10 Quantitative Comparisons

For each experiment, the model that resulted in the best loss on the validation dataset was used for analysis on the testing dataset. During testing, output probabilities of femoral cartilage () were thresholded at 0.5 to create binary femoral cartilage segmentations (, ). No additional post-processing was performed on the thresholded masks.

Segmentation accuracy was measured on the testing dataset using three metrics - Dice similarity coefficient (DSC), volumetric overlap error (VOE), and coefficient of variation (CV) [44]. High accuracy segmentation methods maximize DSC (a maximum of 100%) while minimizing VOE and CV (a minimum of 0%). The segmentation masks obtained from the OAI dataset served as ground truth. Statistical comparisons between the inference accuracy of different models were assessed using Kruskal-Wallis tests, and corresponding Dunn post-hoc tests, (). All statistical analyses were performed using the SciPy (v1.1.0) library [45].

Additionally, changes in network performance in the slice (depth-wise) direction were visualized using graphs termed depth-wise region of interest distribution (dROId) plots. The normalized depth-wise field of view (dFOV) spanning the region of interest is defined as the ordered set of continuous slices containing femoral cartilage according to ground truth manual segmentation, where, in the set, the first slice corresponds to medial side (dFOV=0%) and the last slice corresponds to the lateral side (dFOV=100%). All volumes were mirrored to follow this convention.

Table 2: A summary of network performance (mean (standard deviation)) in base and volumetric architecture, loss function, and data augmentation experiments. Models with Dice score coefficient (DSC) accuracy and volumetric overlap error (VOE) significantly (p<0.05) than corresponding metric of all other models in the given experiment are marked with *. Models with all metrics signficantly different (p<0.01) than corresponding metric of all other models in the given experiment are marked with **. Best performing networks in each experiment category are bolded.

3 Results

All performance results (except data limitation) are summarized in Table  2.

Figure 3: Sample segmentations from three FCN architectures (U-Net, SegNet, DeeplabV3+) with true-positive (green), false-positive (blue), and false-negative (red) overlays. Despite statistically significant difference between the performance of U-Net and the other two architectures, there is minimal visual variation between network outputs. Thick, continuous cartilaginous regions (A) have considerably better performance throughout the entire region, including edge pixels. Failures (red arrows) occur in regions of thin, disjoint femoral cartilage common in edge (B) and medial-lateral transition slices (C). However, (C) shows all networks successfully handled challenging slices that include difficult to segment anatomy (white arrows), such as cartilage lesions, heterogeneous signal, and proximity to anatomy with similar signal (ACL, fluid).
Figure 4: Performance bar graphs and depth-wise region of interest distribution (dROId) plots for convolutional neural network models with different (A) network architectures, (B) volumetric architectures, (C) training loss functions, and (D) training data augmentations. The field of view defined by the region of interest in dROId plots is normalized (0-100%) to map performance at knee-specific anatomical locations despite variations in patient knee size.

3.1 Network Architecture Comparison

A comparison of the performance of the U-Net, SegNet, and DeeplabV3+ architectures on sample slices is shown in Figure  3. All three base architectures maintained high fidelity in segmenting slices containing thick cartilage structures (Figure  3A). However, all networks had worse performance in slices containing regions of full-thickness cartilage loss and denuded subchondral bone, edge slices, and medial-lateral transition regions (Figure  3B,C). Despite lower accuracy in these regions, these networks accurately segmented slices with heterogeneous signal caused by pathology and proximity to anatomy with similar signal (Figure  3C). Performance decreased at edge regions (dFOV~[0, 10]%, [90, 100]%) and at the medial-lateral transition region (dFOV~[55, 65]%) as seen in the dROId plot in Figure  4A. There was no significant difference in the performance of U-Net, SegNet, and DeeplabV3+ models as measured by DSC (p=0.08), VOE (p=0.08), and CV (p=0.81).

3.2 Volumetric Architectures Comparison

Results of the 2D, 2.5D, and 3D U-Net architectures showed no significant difference between the performance of 2D U-Net and that of the three versions (=3,5,7) of the 2.5D U-Net. The 2D U-Net, however, did perform significantly better than the 3D U-Net (DSC,VOE-p<0.05). There were also no significant differences (p=1.0) in the performance between the 2.5D architectures using inputs of different depths (=3,5,7). Decreased DSC at edge and medial-lateral transition regions was indicative for all models as seen on the dROId plot (Figure  4B). In the 3D U-Net model, DSC was greater in the lateral compartment of the knee (dFOV~[60,90]%) compared to that of the medial compartment (dFOV~[15, 45]%). Among 2D and 2.5D networks, performance in the lateral and medial regions was comparable.

Figure 5: A summary of performances of networks trained on (A) binary cross-entropy (BCE), (B) soft Dice, (C) weighted cross-entropy (WCE), and (D) focal losses. The estimated median probability (, defined as , is marked by the red star. BCE has peak errors at =0, =1, with relatively uniform number of errors at low confidence probabilities (). Soft Dice loss has a clear bi-modal distribution with peaks at =0,1, with negligible errors at low confidence probabilities. Almost all WCE errors were false-positives (>0.5), with increasing error density at higher probabilities. While focal loss exhibits similar peaks at =0,1, as soft Dice and BCE,  90% of the error density is centered around low confidence probabilities.

3.3 Loss Function Comparison

Performance differences between BCE, soft Dice, and focal losses were negligible, but all three losses significantly outperformed WCE (p<5e-10) across all slices (Figure  4C).

Using the WCE loss model for inference, the incidence rate of false-positives (misclassifying a background pixel as femoral cartilage) was significantly greater (p<2e-10) than the incidence rate of false-negatives (misclassifying a femoral cartilage pixel as background). Over 99% of the WCE model errors were false-positives (Figure  5C). The pixel-wise error distribution, as measured on the test set (V0) appeared correlated to the output probability of femoral cartilage (), which may be an indicator of network confidence in classifying a pixel as femoral cartilage.

For BCE, soft Dice, and focal losses, the difference between the false-positive and false-negative rates were not significant (p>0.4). The incidence of errors is also symmetrically distributed around the threshold probability with medians of 0.48, 0.84, and 0.51, respectively (Figure  5 A,B,D). The error rate in BCE was relatively uniform across all probabilities while the distribution of error rates in soft Dice loss is primarily bi-modal with peaks at and . The focal loss error distribution was more densely centered around .

3.4 Data Augmentation Comparison

The use of augmented training data significantly decreased network performance (p<0.001) compared to the augmented training data set (Figure  4B). The performance was also consistently lower at other regions of the knee.

Figure 6: The Dice score coefficient (DSC) accuracy on four test sets consisting of volumes at different spatial fields of view (FOVs) using the U-Nets trained on non-augmented and augmented training sets. Inference using the non-augmented U-Net is variable across different FOVs, with significantly lower accuracy in test sets cropped to different FOVs (p<0.01). While the augmented U-Net has a generally lower DSC on the test sets than the non-augmented U-Net, it performs consistently on volumes at different FOVs (p>0.99).

3.5 FOV Generalizability Comparison

Baseline U-Net network performance was variable across test sets consisting of scans at different fields of view (Figure  6). Inference on semi-cropped test sets (V1, V2) had significantly lower performance (p<0.01) than that on the original test set (V0). There was no significant difference (p=1.0) between performance on test set V0 and the non-cropped test set (V3). In contrast, there was no significant difference in performance of the augmented U-Net model across all four test sets (p>0.99).

Figure 7: Performance of U-Net, SegNet, and DeeplabV3+ (DLV3+) when trained on retrospectively subsampled training data. The plots (log-x scale) and corresponding values indicate a power-law relationship between segmentation performance, as measured by the (A) Dice score coefficient accuracy (DSC), (B) volumetric overlap error (VOE), and (C) coefficient of variation (CV), and the number of training patients for all networks. Experiments were repeated 3 times with fixed Python random seeds to ensure reproducibility of the results.

3.6 Data Extent Trend

Network performance for all three networks increased with increasing training data (Figure  7). The trend between the number of patients in the training dataset and network performance followed a power-law () scaling, as hypothesized previously [46], for all performance metrics (p<1e-4). Pixel-wise performance metrics, DSC and VOE, had a strong fit to the hypothesized power-law curve for all architectures ( and , respectively). CV had a relatively weaker, but still strong, fit (). Among the different architectures, there was no significant difference in the intercept () or exponent () of the curve fit measured at different seeds, and all exponents were less than 1 (<1).

4 Discussion

In this study, we examined how variations in FCN architecture, loss functions, and training data impacted network performance for femoral cartilage segmentation. We found no significant pixel-wise difference in the performance of U-Net, SegNet, and DeeplabV3+, three commonly used FCN frameworks for natural image segmentation. There was also no significant performance difference between the segmentations produced by 2D and 2.5D networks. We demonstrated that BCE, soft Dice, and focal losses had similar false-positive and false-negative incidence rates, while WCE biased the network toward false-positive errors. Moreover, while data augmentation reduced U-Net performance, it increased generalizability in performance among scan volumes at different fields of view. Additionally, this study verified that segmentation performance scales directly, following a power-law relationship, with increasing data size. Traditionally, training methods and architectures have been a design choice when applying CNNs for semantic segmentation. In these cases, our findings provide insight into which design choices may be most effective for knee MR image segmentation using CNNs.

4.1 Base Architecture Variations

Based on network performance metrics, newer network architectures like DeeplabV3+ have slightly improved, though not significant, segmentation accuracy compared to traditionally used U-Net and SegNet models. The larger receptive field induced by using dilated convolutions in DeeplabV3+ may increase spatial awareness to foreground-background boundary regions.

The expressivity of a network, often used to characterize network generalizability, is defined as the degree to which the network structure facilitates learning features that are representative for the task. As expressivity increases, performance also increases. Raghu, et al. and Bengio, et al. suggest that expressivity is highly impacted by network structures such as depth, which enables hierarchical feature representations, and regularizations, which prime the network to learn representative features that are stable across different inputs [47, 48]. While DeeplabV3+ does not follow the same sequential autoencoder structure as U-Net and SegNet, it leverages dilated convolutions to extract features at various fields of view and decodes these features to create a hierarchical feature representation as expressive as those generated by the other two architectures.

Though network architecture has been closely linked with expressivity, there was no significant difference in the performance of the three network architectures, and all networks failed in similar regions of minimal, disjoint cartilage (Figure  3). The non-uniqueness in failure cases indicates that all three network models may optimize for similar deep features and consequently, segment images in a visually comparable manner. This minimal difference in performance suggests that beyond some threshold expressivity, differences in CNN architectures may have a negligible impact on the overall segmentation performance. Similar work for fully-connected neural networks (i.e. no convolutions) demonstrated network generalization is not limited by the architecture for a wide array of tasks, given that the network is expressive enough to achieve a small training error [49]. While CNNs and fully-connected neural networks are not an exhaustive representation of all forms of neural networks, the trend of the limited effect of network structure on overall expressivity indicates that improving architectures may not be as effective in training better-performing networks.

4.2 Practical Design for Volumetric Architectures

In this study, the volumetric (2.5D/3D) networks had a negligible impact on segmentation accuracy and even performed worse than traditional 2D slice-wise segmentation in the case of the 3D network. The limited difference between 2.5D and 2D networks may be explained by the negligible difference in expressivity of these networks. These networks only differ at the first convolutional layer, which takes the image/volume as the input. While 2.5D networks accept an input volume (), and 2D networks accept an input slice (), the output of the initial convolution layer is the same size in both networks. As a result, 2.5D networks only have more parameters in the first convolution layer (Supporting Figure S2), which is negligible when compared to the general size of the network and may not expressively represent the through-plane information 2.5D networks hope to capture.

Unlike 2.5D networks, which collapse the 3D input into multiple 2D features after the first convolutional layer, 3D networks maintain the depth-wise dimension throughout the network. While this allows depth-wise features to be extracted throughout the entire network, the number of network parameters also increases, which can limit the batch size of the network. In the 3D network trained in this study, a batch size of 1 was required to fit the scan volume as an input, which may have lead to less stable feature regularization. Additionally, to allow fitting the scan volume as an input, the 3D network had approximately the same number of parameters as the 2.5D and 3D networks. However, as the number of parameters per kernel increases to maintain the extra dimension, the number of filters at the initial convolutional layers had to be curbed twofold. The fewer filters at earlier stages in the network likely contributed to lower expressivity, and consequently poorer performance, of the network. With increased computational and parallelization power, designing 3D networks with similar filter counts as 2D networks may increase network expressivity.

4.3 Selecting Loss Functions

While network architectures did not significantly impact performance, U-Net models trained using BCE, soft Dice, and focal losses performed significantly better than the model trained using WCE loss. While WCE is intended to normalize loss magnitude between imbalanced classes, the artificial weighting biases the network to over-classify the rarer class.

The degree of false-positive bias introduced into the network using WCE is likely modulated by the respective class weights. As the median frequency re-weighting method over-biases the network, traditional weighting protocols based on class incidence may not be the optimal weighting scheme. While optimal performance is traditionally measured by reducing the overall error, WCE loss weightings may be used to intentionally steer a network either towards additional false-positives or false-negatives, depending on the specific use case.

Additionally, the different error distributions around the threshold probability (=0.5) indicate the potential success of each loss function (Figure  5). In a binary problem, the probability output of pixel () is binarized at some threshold probability , typically chosen to be the midpoint (=0.5). Let define the output of the binarization operation on for pixel , such that . Let correspond to the ground-truth class for pixel . Therefore, pixel is misclassified if . If pixel is misclassified, let be the minimum amount of shift required to to correctly classify pixel (i.e. ). For the loss functions used above, the energy required to shift is directly proportional to . If is close to the limit bounds (), is very large; but if is close to the threshold probability , is much smaller. Therefore, a distribution that is densely centered around minimizes and has the most potential for reducing error rate with limited energy.

Of the four error distributions induced by different loss functions, focal loss produces errors that are most densely centered around , which may make it most amenable for future optimization. Focal loss likely achieves this distribution by weighting the BCE loss to be inversely proportional to the correct classification. For example, a pixel with a probability for its correct class close to 1 will be weighted less than a pixel with a probability for its correct class close to 0. As a result, well-classified pixels will not contribute to the loss, and consequently, will not be further optimized. This preserves high error rate close to =0.5, as a network trained with focal loss is most uncertain about these examples. This symmetric distribution also suggests that correcting false-positive and false-negative errors would require an equal amount of energy.

4.4 Achieving Multi-FOV Robustness through Data Augmentation

As MR scan protocols can often adjust the image FOV for different sized patients, training an FCN that is generalizable to multiple FOVs may be desired. The U-Net trained on non-augmented images did not exhibit the same performance across different FOVs. Recall that test sets V1, V2, and V3 covered a larger through-plane field of view (80 slices) than V0, whose dimensions were identical to the training volumes (72 slices). Failure cases in V1 and V2 were predominately in the 8 slices not included in the volumetrically cropped testing/training sets. It is likely the network failed in these regions because these additional slices include anatomy that may not have been seen during training.

In contrast, the U-Net trained using augmented training data exhibited the same performance across all FOVs. The augmentations introduced realistic variations that could occur during imaging (motion and gamma variations) and artificial variations that change the distribution of anatomy across pixels (zooming and shearing). The later set of artificial augmentations manipulate the FOV that the tissue of interest covers in the training image. As a result, the optimized network likely consists of a family of features that is robust to spatial FOV variations within the degrees of the zooming and shearing distributions used. Thus, instead of measuring the expressivity of a FCN network on a single test set, we suggest that the expressivity for multi-FOV applications should be quantified by its performance on test sets at varying FOVs for evaluating robustness to multi-FOV scans.

While augmentations have been readily accepted as a method to increase network accuracy, the 2D U-Net trained with augmented data in this study had sub-optimal performance. This phenomenon likely occurred because the network trained with non-augmented data optimized features for images containing the same FOV of anatomy as the training images. In contrast, the augmented dataset may challenge the network by varying the FOV and contrast of information seen. The optimal minimum may not minimize loss as efficiently on the non-augmented datasets, and as a result, the features are not optimized to achieve a high testing loss on test set V0. However, these features likely increased the stability of the network for inference on scans of multiple FOVs. We suggest that augmentations should be meticulously curated to increase network expressivity to expected image variations, especially in regards to tissues of interest having variable sizes in potential test images.

4.5 Navigating Training with Limited Data

The performance of all three networks changed at a considerably slow rate as data size increased. The rate is primarily governed by the exponent value () in the power-law equation. The mean exponent across three seeds <0.05 for all architectures indicated a slow growth in performance as a function of data size. In a recent work, Hestness, et al. empirically verified that the error rate in image classification decreases following a power-law scaling with regardless of architecture [50]. Like image classification, semantic segmentation also appeared to follow this trend, with minimal variation in among architectures.

Moreover, this slow-growth power-law performance scaling can allow us to empirically estimate the performance of these networks as the data size increases. Based on these parameter estimates, achieving a 95% Dice accuracy for the U-Net, SegNet, and DeeplabV3+ models would require approximately 350, 440, and 300 patients, respectively. Therefore, while increasing training data does increase performance over time, the addition of each subsequent dataset diminishes marginal utility. These results suggest that even with small amounts of data, high percentage of performance can generally be obtained.

4.6 Limitations

Despite the promising empirical relationships elucidated in this work, there were limitations to this study that should be addressed in future studies. Training hyperparameters for each network were empirically determined by investigating training loss curves for the initial epochs. While a robust hyperparameter search may yield a more optimal set for training, this was beyond the primary premise of this work, which aimed to explore larger tradeoffs between network architectures, loss functions, and training data. Additionally, the 3D U-Net architecture trained in the volumetric architecture experiment fixed the input depth at 32 slices, resulting in a low batch size and fewer number of filters at each network level. Future studies could modulate the number of input slices to increase batch size and number of filters to optimize network performance. Moreover, all networks performed binary segmentation, but as most loss functions allow for multi-class segmentation, it would be useful to understand the impact of this problem on performance for each tissue.

5 Conclusion

In this study, we quantified the impact of variations in network architecture, loss functions, and training data for segmenting femoral cartilage from 3D MRI in order to investigate the tradeoffs involved in segmentation with CNNs. Variations in network architectures yielded minimal differences in overall segmentation accuracy. Additionally, loss functions dictate how the network weights are optimized and, as a result, influence how errors are distributed across probabilities. Moreover, realistic data augmentation methods can increase network generalizability at the cost of absolute network performance on any given test set. Limited amounts of training data may also not be the bottleneck in network performance.


Contract grant sponsor: National Institutes of Health (NIH); contract grant numbers NIH R01 AR063643, R01 EB002524, K24 AR062068, and P41 EB015891. Contract grant sponsor: Philips (research support). Image data was acquired from the Osteoarthritis Initiative (OAI). The OAI is a public-private partnership comprised of five contracts (N01-AR-2-2258; N01-AR-2-2259; N01-AR-2-2260; N01-AR-2-2261; N01-AR-2-2262) funded by the National Institutes of Health, a branch of the Department of Health and Human Services, and conducted by the OAI Study Investigators. Private funding partners include Merck Research Laboratories; Novartis Pharmaceuticals Corporation, GlaxoSmithKline; and Pfizer, Inc. Private sector funding for the OAI is managed by the Foundation for the National Institutes of Health. This manuscript was prepared using an OAI public use data set and does not necessarily reflect the opinions or views of the OAI investigators, the NIH, or the private funding partners.


  • [1] Hoyte L, Ye W, Brubaker L, Fielding JR, Lockhart ME, Heilbrun ME, Brown MB, Warfield SK, Network PFD. Segmentations of mri images of the female pelvic floor: A study of inter-and intra-reader reliability. Journal of Magnetic Resonance Imaging 2011; 33:684–691.
  • [2] Bogner W, PinkerDomenig K, Bickel H, Chmelik M, Weber M, Helbich TH, Trattnig S, Gruber S. Readout-segmented echo-planar imaging improves the diagnostic performance of diffusion-weighted mr breast examinations at 3.0 t. Radiology 2012; 263:64–76.
  • [3] Dam EB, Lillholm M, Marques J, Nielsen M. Automatic segmentation of high-and low-field knee mris using knee image quantification with data from the osteoarthritis initiative. Journal of Medical imaging 2015; 2:024001.
  • [4] Bauer S, Wiest R, Nolte LP, Reyes M. A survey of mri-based medical image analysis for brain tumor studies. Physics in Medicine & Biology 2013; 58:R97.
  • [5] ErhartHledik JC, Favre J, Andriacchi TP. New insight in the relationship between regional patterns of knee cartilage thickness, osteoarthritis disease severity, and gait mechanics. Journal of biomechanics 2015; 48:3868–3875.
  • [6] Dunn TC, Lu Y, Jin H, Ries MD, Majumdar S. T2 relaxation time of cartilage at mr imaging: comparison with severity of knee osteoarthritis. Radiology 2004; 232:592–598.
  • [7] Eckstein F, Kwoh CK, Boudreau RM, Wang Z, Hannon MJ, Cotofana S, Hudelmaier MI, Wirth W, Guermazi A, Nevitt MC et al. Quantitative mri measures of cartilage predict knee replacement: a case–control study from the osteoarthritis initiative. Annals of the Rheumatic Diseases 2013; 72:707–714.
  • [8] Welsch GH, Mamisch TC, Domayer SE, Dorotka R, KutschaLissberg F, Marlovits S, White LM, Trattnig S. Cartilage t2 assessment at 3-t mr imaging: in vivo differentiation of normal hyaline cartilage from reparative tissue after two cartilage repair procedures—initial experience. Radiology 2008; 247:154–161.
  • [9] Eckstein F. Double echo steady state magnetic resonance imaging of knee articular cartilage at 3 tesla: a pilot study for the osteoarthritis initiative. Annals of the Rheumatic Diseases 2006; 65:433–441.
  • [10] Seim H, Kainmueller D, Lamecker H, Bindernagel M, Malinowski J, Zachow S. Model-based auto-segmentation of knee bones and cartilage in mri data. Medical Image Analysis for the Clinic: A Grand Challenge 2010; pp. 215–223.
  • [11] Shan L, Zach C, Charles C, Niethammer M. Automatic atlas-based three-label cartilage segmentation from mr knee images. Medical image analysis 2014; 18:1233–1246.
  • [12] Pedoia V, Majumdar S, Link TM. Segmentation of joint and musculoskeletal tissue in the study of arthritis. Magnetic Resonance Materials in Physics, Biology and Medicine 2016; 29:207–221.
  • [13] Liu F, Zhou Z, Jang H, Samsonov A, Zhao G, Kijowski R. Deep convolutional neural network and 3d deformable approach for tissue segmentation in musculoskeletal magnetic resonance imaging. Magnetic resonance in medicine 2018; 79:2379–2391.
  • [14] Norman B, Pedoia V, Majumdar S. Use of 2d u-net convolutional neural networks for automated cartilage and meniscus segmentation of knee mr imaging data to determine relaxometry and morphometry. Radiology 2018; p. 172322.
  • [15] LeCun Y, Bengio Y, Hinton G. Deep learning. nature 2015; 521:436.
  • [16] ShwartzZiv R, Tishby N. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 2017; .
  • [17] Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision 2015; 115:211–252.
  • [18] Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015. pp. 1–9.
  • [19] He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2016. pp. 770–778.
  • [20] Kamnitsas K, Ledig C, Newcombe VF, Simpson JP, Kane AD, Menon DK, Rueckert D, Glocker B. Efficient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation. Medical image analysis 2017; 36:61–78.
  • [21] Çiçek Ö, Abdulkadir A, Lienkamp SS, Brox T, Ronneberger O. 3d u-net: learning dense volumetric segmentation from sparse annotation. In: International Conference on Medical Image Computing and Computer-Assisted Intervention, In: International Conference on Medical Image Computing and Computer-Assisted Intervention, 2016. pp. 424–432.
  • [22] Tian Z, Liu L, Fei B. Deep convolutional neural network for prostate mr segmentation. In: Medical Imaging 2017: Image-Guided Procedures, Robotic Interventions, and Modeling, In: Medical Imaging 2017: Image-Guided Procedures, Robotic Interventions, and Modeling, 2017. p. 101351L.
  • [23] Baumgartner CF, Koch LM, Pollefeys M, Konukoglu E. An exploration of 2d and 3d deep learning techniques for cardiac mr image segmentation. In: International Workshop on Statistical Atlases and Computational Models of the Heart, In: International Workshop on Statistical Atlases and Computational Models of the Heart, 2017. pp. 111–119.
  • [24] Perez L, Wang J. The effectiveness of data augmentation in image classification using deep learning. arXiv preprint arXiv:1712.04621 2017; .
  • [25] Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems, In: Advances in neural information processing systems, 2012. pp. 1097–1105.
  • [26] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015. pp. 3431–3440.
  • [27] Peterfy C, Schneider E, Nevitt M. The osteoarthritis initiative: report on the design rationale for the magnetic resonance imaging protocol for the knee. Osteoarthritis and Cartilage 2008; 16:1433–1441.
  • [28] Kellgren J, Lawrence J. Osteo-arthrosis and disk degeneration in an urban population. Annals of the Rheumatic Diseases 1958; 17:388.
  • [29] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, In: International Conference on Medical image computing and computer-assisted intervention, 2015. pp. 234–241.
  • [30] Badrinarayanan V, Kendall A, Cipolla R. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv preprint arXiv:1511.00561 2015; .
  • [31] Chen LC, Papandreou G, Kokkinos I, Murphy K, Yuille AL. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE transactions on pattern analysis and machine intelligence 2018; 40:834–848.
  • [32] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, In: Proceedings of the thirteenth international conference on artificial intelligence and statistics, 2010. pp. 249–256.
  • [33] Pascanu R, Mikolov T, Bengio Y. On the difficulty of training recurrent neural networks. In: International Conference on Machine Learning, In: International Conference on Machine Learning, 2013. pp. 1310–1318.
  • [34] Chollet F. Xception: Deep learning with depthwise separable convolutions. arXiv preprint 2017; pp. 1610–02357.
  • [35] Chen LC, Papandreou G, Schroff F, Adam H. Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587 2017; .
  • [36] Roth HR, Lu L, Seff A, Cherry KM, Hoffman J, Wang S, Liu J, Turkbey E, Summers RM. A new 2.5 d representation for lymph node detection using random sets of deep convolutional neural network observations. In: International conference on medical image computing and computer-assisted intervention, In: International conference on medical image computing and computer-assisted intervention, 2014. pp. 520–527.
  • [37] Milletari F, Navab N, Ahmadi SA. V-net: Fully convolutional neural networks for volumetric medical image segmentation. In: 3D Vision (3DV), 2016 Fourth International Conference on, In: 3D Vision (3DV), 2016 Fourth International Conference on, 2016. pp. 565–571.
  • [38] Lin TY, Goyal P, Girshick R, He K, Dollár P. Focal loss for dense object detection. IEEE transactions on pattern analysis and machine intelligence 2018; .
  • [39] Rafiq M, Bugmann G, Easterbrook D. Neural network design for engineering applications. Computers & Structures 2001; 79:1541–1552.
  • [40] Nair V, Hinton GE. Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML-10), In: Proceedings of the 27th international conference on machine learning (ICML-10), 2010. pp. 807–814.
  • [41] He K, Zhang X, Ren S, Sun J. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, In: Proceedings of the IEEE international conference on computer vision, 2015. pp. 1026–1034.
  • [42] Kingma DP, Ba J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 2014; .
  • [43] Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al. Tensorflow: a system for large-scale machine learning. In: OSDI, In: OSDI, 2016. pp. 265–283.
  • [44] Taha AA, Hanbury A. Metrics for evaluating 3d medical image segmentation: analysis, selection, and tool. BMC medical imaging 2015; 15:29.
  • [45] Jones E, Oliphant T, Peterson P. SciPy: open source scientific tools for Python. 2014; .
  • [46] Amari Si, Fujita N, Shinomoto S. Four types of learning curves. Neural Computation 1992; 4:605–618.
  • [47] Raghu M, Poole B, Kleinberg J, Ganguli S, SohlDickstein J. On the expressive power of deep neural networks. arXiv preprint arXiv:1606.05336 2016; .
  • [48] Bengio Y, Delalleau O. On the expressive power of deep architectures. In: International Conference on Algorithmic Learning Theory, In: International Conference on Algorithmic Learning Theory, 2011. pp. 18–36.
  • [49] Lampinen AK, Ganguli S. An analytic theory of generalization dynamics and transfer learning in deep linear networks. arXiv preprint arXiv:1809.10374 2018; .
  • [50] Hestness J, Narang S, Ardalani N, Diamos G, Jun H, Kianinejad H, Patwary M, Ali M, Yang Y, Zhou Y. Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 2017; .
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description