FastVentricle: Cardiac Segmentation with ENet
Cardiac Magnetic Resonance (CMR) imaging is commonly used to assess cardiac structure and function. One disadvantage of CMR is that post-processing of exams is tedious. Without automation, precise assessment of cardiac function via CMR typically requires an annotator to spend tens of minutes per case manually contouring ventricular structures. Automatic contouring can lower the required time per patient by generating contour suggestions that can be lightly modified by the annotator. Fully convolutional networks (FCNs), a variant of convolutional neural networks, have been used to rapidly advance the state-of-the-art in automated segmentation, which makes FCNs a natural choice for ventricular segmentation. However, FCNs are limited by their computational cost, which increases the monetary cost and degrades the user experience of production systems. To combat this shortcoming, we have developed the FastVentricle architecture, an FCN architecture for ventricular segmentation based on the recently developed ENet architecture. FastVentricle is faster and runs with less memory than the previous state-of-the-art ventricular segmentation architecture while still maintaining excellent clinical accuracy.
Patients with known or suspected cardiovascular disease often receive a cardiac MRI to evaluate cardiac function. These scans are annotated with ventricular contours in order to calculate cardiac volumes at end systole (ES) and end diastole (ED); from the cardiac volumes, relevant diagnostic quantities such as ejection fraction and myocardial mass can be calculated. Manual contouring can take upwards of 30 minutes per case, so radiologists often use automation tools to help speed up the process.
Active contour models [kass1988] are a heuristic-based approach to segmentation that have been utilized previously for segmentation of the ventricles [choa2006, zhu2013geodesic] with optional use of a ventricle shape prior [plue2005, schwarz20073d]. However, active contour-based methods not only perform poorly on images with low contrast, they are also sensitive to initialization and hyperparameter values. We encourage the interested reader to refer to recent review papers [petitjean2011review, peng2016review] as a jumping-off point for further insight on the usage of these (and many other) non-deep learning approaches for cardiac segmentation.
Deep learning methods for segmentation have recently defined state-of-the-art with the use of fully convolutional networks (FCNs) [long2015fully]. Simple FCN architectures similar to that described by [long2015fully] have been utilized for cardiac segmentation [tran2016fully]. The general idea behind FCNs is to use a downsampling path to learn relevant features at a variety of spatial scales followed by an upsampling path to combine the features for pixelwise prediction (see Fig. 1). DeconvNet [noh2015learning] pioneered the use of a symmetric contracting-expanding architecture for more detailed segmentations, at the cost of longer training and inference time, and the need for larger computational resources. UNet [ronneberger2015u], originally developed for use in the biomedical community where there are often fewer training images and even finer resolution is required, added the use of skip connections between the contracting and expanding paths to preserve details.
A UNet variant, DeepVentricle, has previously been used for cardiac segmentation [lau2016] and has received FDA clearance for clinical usage [arterysfda]. Both UNet and DeepVentricle are FCNs with symmetric downsampling and upsampling paths, skip connections, two convolution operations before each pooling or upsampling operation, and double/half the number of filters as in the previous layer following each pool/upsample, respectively. A key difference is that UNet utilizes valid padding, resulting in segmentation maps that are smaller than the input image, whereas DeepVentricle uses same padding.
One disadvantage of fully symmetric architectures in which there is a one-to-one correspondence between downsampling and upsampling layers is that they can be slow, especially for large input images. ENet, an alternative FCN design, is an asymmetrical architecture optimized for speed [paszke2016enet]. ENet utilizes early downsampling to reduce the input size using only a few feature maps. This reduces both training and inference time, given that much of the network’s computational load takes place when the image is at full resolution, and has minimal effect on accuracy since much of the visual information at this stage is redundant. Furthermore, the ENet authors show that the primary purpose of the expanding path in FCNs is to upsample and fine-tune the details learned by the contracting path rather than to learn complicated upsampling features; hence, ENet utilizes an expanding path that is smaller than its contracting path. ENet also makes use of bottleneck modules, which are convolutions with a small receptive field that are applied in order to project the feature maps into a lower dimensional space in which larger kernels can be applied [he2016deep]. Finally, throughout the network, ENet leverages a diversity of low cost convolution operations. In addition to the more-expensive convolutions, ENet also uses cheaper asymmetric ( and ) convolutions and dilated convolutions [yu2015multi].
In this paper, we present FastVentricle, an ENet variation with skip connections for segmentation of the LV endocardium (LV endo), LV epicardium (LV epi), and RV endocardium (RV endo), and compare it to DeepVentricle, the architecture previously cleared by the FDA for clinical use. We show that inference with FastVentricle requires significantly less time and memory than inference with DeepVentricle while achieving statistically indistinguishable volume accuracy.
Training Data. We use a database of 1143 short-axis cine Steady State Free Precession (SSFP) scans annotated as part of standard clinical care to train and validate our model. We split the data chronologically with 80% of studies used for training, 10% for validation, and 10% as a hold out set. We curate the hold out set to only include contours from trusted annotators, i.e. licensed radiologists or technologists. The hold out set is used for all plots and tables in this paper. Annotated contour types include LV endo, LV epi and RV endo. Scans are annotated at ED and ES. Contours were annotated with different frequencies; 96 (1097) of scans have LV endo contours, 22 (247) have LV epi contours and 85 (966) have RV endo contours.
Data Preprocessing. We normalize all MRIs such that the 1st and 99th percentile of pixel intensities of a batch of images fall at -0.5 and 0.5, i.e. their “usable range” falls between -0.5 and 0.5. We crop and resize the images such that the ventricle contours take up a larger percentage of the image; the actual crop and resize factors are hyperparameters. Cropping the image increases the fraction of the image that is taken up by the foreground (ventricle) class, making it easier to resolve fine details and helping the model converge.
Training. We use the Keras [chollet2015keras] deep learning package with TensorFlow [abadi2016tensorflow] as the backend to implement and train all of our models. We modify the standard per-pixel cross-entropy loss to account for missing ground truth annotations in our dataset. We discard the component of the loss that is calculated on images for which ground truth is missing; we only backpropagate the component of the loss for which ground truth is known. This allows us to train on our full training dataset, including series with missing contours. Network weights are updated per the Adam rule [kingma2014adam]. We train and test the models using NVIDIA Maxwell Titan X GPUs. We augment our data during training by flipping, shearing, shifting, zooming, and rotating the image/label pairs. To compare different models, we use relative absolute volume error (RAVE). Using a relative metric ensures that equal weight is given to pediatric and adult hearts. We focus on the volume error, as opposed to the Sørensen–Dice index or a similar overlap metric, because ventricular volumes are the clinical endpoint used to diagnose disease and determine clinical care. RAVE is defined as , where is the ground truth volume and is the volume computed from the predicted 2D segmentation masks. Volumes are calculated from segmentation masks using a frustum approximation.
Hyperparameter Search. We fine tune the DeepVentricle and FastVentricle network architectures using random hyperparameter search [bergstra2012random]. In practice, for each of the DeepVentricle and FastVentricle architectures, we: i) run models with random sets of hyperparameters for 20 epochs (i.e. 20 passes over the training set), ii) select from the resulting corpus of models the three models with the highest validation set accuracy, iii) select the final model from the three candidates based on lowest average RAVE on the validation set. The hyperparameters of the DeepVentricle architecture include the use of batch normalization, the number of convolution layers, the number of initial filters, and the number of pooling layers. The hyperparameters of the FastVentricle architecture include the kernel size for asymmetric convolutions, the number of times Section 2 of the network is repeated, the number of initial bottleneck modules, the number of initial filters, the projection ratio, and whether to use skip connections. These connections run from the contracting path (the initial block and Section 1) to the equivalent image size on the expanding path (Section 5 and Section 4, respectively). Refer to [paszke2016enet] for nomenclature of the Sections and detailed descriptions of the hyperparameters. For both architectures, the hyperparameters also include the batch size, learning rate, dropout probability, crop fraction, image size, and the strength of all data augmentation parameters. We trained 50 DeepVentricle models and 20 FastVentricle models for our hyperparameter search, with the scope of the search limited by computational limitations and time constraints. We note that skip connections are used in the 5 best FastVentricle architectures in terms of validation set accuracy, demonstrating their usefulness for this problem.
|LV Endo (ED)||LV Endo (ES)||LV Epi (ED)||LV Epi (ES)||RV Endo (ED)||RV Endo (ES)|
Volume Error Analysis. Figure 2 presents box plots of the RAVE comparing DeepVentricle and FastVentricle for each combination of ventricular structure (LV endo, LV epi, RV endo) and phase (ES, ED). Within our hold out set, we find that the performance of the models are very similar across structures and phases (see Table 1 for median RAVEs as well as corresponding sample sizes). For both models, segmentations at ED tend to be better than at ES, as the chambers at ES are smaller and dark-colored papillary muscles tend to blend with the myocardium when the heart is contracted. RV endo is the most difficult of the structures due to its more complex shape. Furthermore, we find that model performance at the apex and center of the ventricle is often better than at the base, as it is generally ambiguous from just the basal slice where the valve plane (separating ventricle from atrium) is. Although trained on only ES and ED annotations, we are able to perform visually pleasing inference on all time points. Figure 3 shows examples of network predictions on different slices and time points for studies for both DeepVentricle and FastVentricle.
We additionally make Bland-Altman plots [bland1986statistical] for the volume error for each combination of ventricular structure and phase (Figure 4). These plots are used to show the agreement between the “gold standard” method (i.e. expert manual annotations) and our automated DeepVentricle and FastVentricle methods. If the 95 limits of agreement (mean ) are within the range that would not make a clinical difference, the new method can be used in place of the old with no adverse effects.
Statistical Analysis. We measure the statistical significance of the difference between DeepVentricle and FastVentricle’s RAVE distributions for each combination of phase and anatomy for which we have ground truth annotations. We use the Wilcoxon-Mann-Whitney test 111Using the SciPy 0.17.0 implementation with default parameters https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html to assess the null hypothesis that the distribution of DeepVentricle and FastVentricle’s RAVE are equal. Table 1 displays the results. We find that there is no statistical evidence to claim one model as the best, since the lowest measured p-value is 0.28.
Computational Complexity and Inference Speed. To be clinically and commercially viable, any automated algorithm must be faster than manual annotations, and lightweight enough to deploy easily. As seen in Table 2, we find that FastVentricle is roughly faster than DeepVentricle and uses less memory for inference. Because the model contains more layers, FastVentricle takes longer to initialize before being ready to perform inference. However, in a production setting, the model only needs to be initialized once when provisioning the server, so this additional cost is incidental.
|Inference GPU time per sample (msec)||31||7|
|Initialization GPU time (sec)||1.3||13.3|
|Number of parameters||19,249,059||755,529|
|GPU memory required for inference (MB)||1,800||270|
|Size of the weight file (MB)||220||10|
Internal Representation. Neural networks are infamous for being black boxes, i.e., it is very difficult to “look inside” and understand why a certain prediction is being made. This is especially troublesome in the medical setting, as doctors prefer to use tools that they can understand. We follow the results of [deepdream] to visualize the features that FastVentricle is “looking for” when performing inference. Beginning with random noise as a model input and a real segmentation mask as the target, we perform backpropagation to update the pixel values in the input image such that the loss is minimized. Figure 6 shows the result of such an optimization for DeepVentricle and FastVentricle. We find that, as a doctor would, the model is confident in its predictions when the endocardium is light and the contrast with the epicardium is high. The model seems to have learned to ignore the anatomy surrounding the heart. We also note that the optimized input for DeepVentricle is less noisy than that for FastVentricle, probably because the former model is larger and utilizes skip connections at the full resolution of the input image. DeepVentricle also seems to “imagine” structures which look like papillary muscles inside the ventricles.
Performance. Though accuracy is the most important property of a model when making clinical decisions, speed of algorithm execution is also critical for maintaining positive user experience and minimizing infrastructure costs. Within the bounds of our experiments, we find no statistically significant difference between the accuracy of DeepVentricle and that of its faster cousin, FastVentricle. This suggests that FastVentricle can replace DeepVentricle in a clinical setting with no detrimental effects.
Both DeepVentricle and FastVentricle have a median RAVE of between and . To put this in context, [suinesiaputra2015quantification] investigated the reproducibility of LV endo volume measurements from CMR over 15 studies with 7 expert readers and found the interrater standard deviation to be, on average, 14mL, i.e. approximately 10 of the volume being measured. Although there are occasional cases on which our algorithm performs poorly, for example the congenital case displayed in Figure 7, we auto-detect inference results with noisy contours when using DeepVentricle or FastVentricle for clinical use and opt not to display them to the user. Note that this quality detection algorithm has not been run for any plots in this paper; all presented results contain the full hold out set for completeness. For all cases, our production application allows radiologists to view and modify any erroneous contours before completing the report.
We show that a new ENet-based FCN with skip connections, FastVentricle, can be used for quick and efficient segmentation of cardiac anatomy. Trained on a sparsely annotated database, our algorithm provides LV endo, LV epi, and RV endo contours to clinicians for the purpose of calculating important diagnostic quantities such as ejection fraction and myocardial mass. FastVentricle is faster and runs with less memory than the previous state-of-the-art.
We experimented with a fully convolutional 3D network to improve the consistency of basal slice segmentations, but found that the memory requirements of this model limited the resolution of the input images such that the results did not outperform 2D segmentation. A more careful consideration of additional spatial information, perhaps with a 2.5D network incorporating long axis views, could potentially improve performance. Finally, as with any deep learning-based model, additional training cases could further improve performance.
=0mu plus 1mu