3D Consistent Biventricular Myocardial Segmentation Using Deep Learning for Mesh Generation
Abstract
We present a novel automated method to segment the myocardium of both left and right ventricles in MRI volumes. The segmentation is consistent in 3D across the slices such that it can be directly used for mesh generation. Two specific neural networks with multiscale coarsetofine prediction structure are proposed to cope with the small training dataset and trained using an original loss function. The former segments a slice in the middle of the volume. Then the latter iteratively propagates the slice segmentations towards the base and the apex, in a spatially consistent way. We perform 5fold crossvalidation on the 15 cases from STACOM to validate the method. For training, we use real cases and their synthetic variants generated by combining motion simulation and image synthesis. Accurate and consistent testing results are obtained.
Keywords:
myocardial segmentation, deep learning, convolutional neural network, propagation, spatial consistency1 Introduction
While most research about myocardial segmentation focuses on either left ventricle (LV) [1] or right ventricle (RV) [2] segmentation on 2D slices, there is a great need for 3Dconsistent biventricular (BV) segmentation, in which LV and RV together are segmented. 3Dconsistent BV segmentation provides not only consistency, but also robustness on poorquality slices (e.g. near apex). Its output may then be used to generate complete meshes. These advantages are missing from most other methods. For example the model of [7] is not very capable of segmenting the slices near apex. The lack of publicly available automatic BV mesh generation tool and the success of deep learning on medical image analysis [10] motivate us to develop this method based on two neural networks : the former segments a single slice in the volume and the latter propagates the segmentation to the other slices.
The most competitive BV myocardial segmentation methods are the automatic ones. Shapeconstrained deformable models are applied on a dataset of 28 CT volumes in [4]. The chain takes 22s for a volume. In [9] 4chamber segmentation is performed using steerable features learned on a dataset of 457 annotated CT volumes. The speed is 4s per volume. The authors of [8] apply marginal space learning and probabilistic boostingtree on a dataset of 100 annotated MRI volumes to learn to jointly delineate LV and RV. It spends about 3s on each volume. We cannot compare directly with these methods on their reported error measures due to differences in datasets. In this paper we propose an effective pipeline based on 2 neural networks combining the assets of 2D (speed) and 3D (consistency). An original loss function is also applied for training. Our approach has the following advantages:

Unlike the abovementioned papers our networks are trained on a publicly available dataset STACOM [6].

The dataset we use of 15 annotated volumes is much smaller than the abovementioned datasets. Our method is dataefficient as its generalize even with small training sets.

Unlike the above methods, our method is modelfree. Anyone familiar with deep learning may implement our networks without difficulty.

Our method takes about 3s to segment a volume. This is the same as the fastest one in the above methods.

Compared to MRI images, CT images usually have much better resolution and stronger heart/background contrast. Working with MRI images, we actually solve a more challenging version of segmentation problems.
2 Approach
2.0.1 Data preprocessing.
The method takes shortaxisview MRI images as input. We process 3D stacks of 2D slices, cropped around the heart. As standardization the cropped volume is resampled into an isotropic volume by linear interpolation. Furthermore, we manually identify the base slice near mitral annulus. For images of reasonably good quality (e.g. STACOM) the segmentation can be initialized from any slice around the middle of the volume. So in this paper, for testing on the 3 testing cases of STACOM in each fold, the initialization slice is automatically chosen as the one in the middle of the volume.
2.0.2 Initial segmentation.
We then apply the initialization network to segment the selected slice. The output is a mask of which each pixel is a probability (0 for background and 1 for myocardium).
2.0.3 Spatial segmentation propagation.
Then we apply the spatial propagation network to propagate segmentation masks. During upward (towards base) propagation, we suppose the slices up to that of index are already segmented. Taking this slice, its predicted mask and the next 4 slices (+1 to +4) as input, the spatial propagation network predicts the next 4 segmentation masks. The iteration stops at the base slice. Similarly, we downward (towards apex) propagate the segmentation masks. Thereby we complete the segmentation.
3 Networks
3.0.1 Multiscale coarsetofine prediction
The two neural networks we use for this paper are characterized by multiscale coarsetofine predictions. As presented in Fig. 2, the main body of the initialization network is separated into 3 subnetworks with input/output sizes 32, 64 and 128 respectively. SubNet32, taking a downsampled slice of size 32 as input, outputs a predicted mask of the same size. Then SubNet64 takes a downsampled slice of size 64 and incorporates the predicted mask of size 32 to make a prediction of size 64. Similarly, SubNet128 outputs the final predicted mask of size 128. During training, 3 loss functions which compare the outputs of SubNet32, SubNet64 and SubNet128 with the ground truth masks of size 32, 64 and 128 respectively are applied. The spatial propagation network, consisting of SubNet64 and SubNet128, has analogous structure and loss functions for training.
3.0.2 Loss function.
The networks are trained by stochastic gradient descent. An original loss function is designed to overcome numerical instability and class imbalance during training. We call it stabilized and classbalanced cross entropy loss, where pixelwise losses are added to work with a total loss. For each pixel, suppose the predicted probability is p and the ground truth is g. The pixel loss is
(1) 
with
(2) 
and a, b and t are parameters such that a 0, b 0, a 2b = 1 and 0 t 1. To roughly preserve the predicted probability, a is set close to 1, b and t are set near 0. In this paper we empirically pick a = 0.999, b = 0.0005 and t = 0.02.
The purpose of applying (2) is to avoid computation of logarithm on any value too close to 0 while roughly reserving the predicted probability. Without it, poorly predicted values of p may result in extremely large loss and gradient values, which may harm numerical stability of training and even cause overflow.
On the other hand, there is a strong imbalance between myocardial and background pixel. The latter represents around 80% of the image. With common loss functions, the overall training effect of background pixels dominates the effect of the myocardial pixels. It may hinder the network performance in recognizing the myocardium. Setting the loss to 0 in (1) whenever the prediction is close enough to ground truth reduces the effect. When the predicted probabilities for background are close enough to 0, our loss function stops “pulling” them further to 0 and instead focuses on “pushing” the probabilities on myocardium to 1.
3.0.3 Convolution layer group.
The two networks mainly consist of convolution layer groups. In each group, a convolution layer is followed by a batch normalization layer and a leaky ReLU layer of negative part coefficient 0.25.
3.0.4 Data augmentation inside network.
The first layers in the two networks are noise layers for data augmentation during training to make the networks more robust. They are removed in testing. Data augmentation includes randomly rotating input slices together and adding Gaussian and pepperandsalt noise.
4 Experiments
Our experiments involved an existing dataset with MRI image volume sequences: 15 subjects from STACOM [6] (30 instants per cycle, with ground truth segmentation). After resampling to isotropic volume of voxel size 1.25mm there are about 60 slices below the base in each volume. We divide the 15 cases into 5 groups of 3 cases. In each fold of the 5fold crossvalidation, the 3 cases of one picked group are used as testing. And the training set consists of the 12 cases from the remaining 4 groups.
4.0.1 Training.
As data augmentation to generate a large database with ground truth from a small database, we combine a motion simulation method and an image synthesis algorithm to generate realistic volume sequence variants of the training cases. Infarcted mesh motion sequences were simulated according to the scheme depicted in [3]. Then the original volume sequences were warped to generate synthesized image variants using an algorithm inspired from [5]. For each of the training subjects, 31 (1 healthy and 30 infarcted) 30instant volume sequence variants were generated. In total 12*31*30=11160 volumes were used for training the spatial segmentation network in each fold of the 5fold crossvalidation. On the other hand, we observe better image/mesh coincidence around enddiastole than around endsystole in the synthesized sequences. Considering the tradeoff between robustness (diversity in training set) and accuracy (image/mesh coincidence), we decide to train the initialization network only with volumes from 10 instants around enddiastole. Hence it is trained on 12*31*10=3720 volumes. Please note the augmentations of a same case remain similar. Our synthesized database is not comparable to real ones of similar size in terms of diversity and richness.
We use only the slices below the base in synthesized volumes to train the spatial propagation network. For the initialization net we also use these slices except the top 1/6 and the bottom 1/6 of them. The potentially poor image quality of slices near base and apex may cause additional inaccuracies.
The networks are trained by stochastic gradient descent with batch size 1 and learning rate 0.0001. The initialization network is trained for 300000 iterations. It takes about 23 hours in total on GPU. The spatial propagation network is trained for 600000 iterations, which together take roughly 44 hours.
4.0.2 Testing.
The method is tested on the enddiastole (the only instant where ground truth is available) of testing cases. It takes about 3s to segment a volume using GPU. The output probabilities are binarized to obtain myocardium/background segmentation taking 0.5 as threshold. We use the Dice index to measure performance. The 3D Dice indices (considering all pixels of all slices below base) are 0.7851 for v1 and 0.7817 for v2. The predicted masks and the groundtruth (axial and coronal views) as well as the BV meshes generated directly from the predicted masks using CGAL ^{1}^{1}1http://www.cgal.org are presented in the left part of Fig. 3. In the right part of Fig. 3, 2D Dice indices for both subjects change smoothly across slices, confirming the spatial consistency of our method.
For comparison, if we use the initialization network to segment all the slices independently without propagation, the method not only loses the spatial consistency but also fails completely on the slices near apex (the middle part of Fig. 3). Our propagation method therefore appears crucial to maintain spatial consistency and reach accurate results even on the most difficult slices.
5 Conclusion and Perspectives
We demonstrate that our deeplearningbased automatic method for BV segmentation is robust, and combines the assets of 2D (speed) and 3D to provide spatially consistent meshes ready to be used for simulations. Besides, we proposed two original networks: (i)the initialization network predicts segmentation in a multiscale coarsetofine manner; (ii)the second network propagates segmentation with spatial consistency. A novel loss function is also proposed to overcome class imbalance. For training, we use image synthesis as data augmentation. Meshes of high quality are generated. In the future, we will explore the capacity of neural networks in maintaining temporal consistency of segmentation. As we only use 15 subjects in this paper, significant improvement is expected if more data are added afterwards.
References
 [1] Avendi, M.R., Kheradvar, A., Jafarkhani, H.: A combined deeplearning and deformablemodel approach to fully automatic segmentation of the left ventricle in cardiac MRI. Med Image Anal, vol. 30, 108–119 (2016)
 [2] Avendi, M.R., Kheradvar, A., Jafarkhani, H.: Automatic segmentation of the right ventricle from cardiac MRI using a learningbased approach. http://onlinelibrary.wiley.com/doi/10.1002/mrm.26631/pdf (2017)
 [3] Duchateau, N., De Craene, M., Allain, P., et al.: Infarct localization from myocardial deformation: Prediction and uncertainty quantification by regression from a lowdimensional space. IEEE Trans. Med. Imaging, 35(10), 2340–2352 (2016)
 [4] Ecabert, O., Peters, J., Schramm, H., et al.: Automatic modelbased segmentation of the heart in CT images. IEEE Trans. Med. Imaging, 27(9), 1189–1201 (2008)
 [5] Prakosa, A., Sermesant, M., Delingette, H., et al.: Generation of synthetic but visually realistic time series of cardiac images combining a biophysical model and clinical images. IEEE Trans. Med. Imaging, 32(1), 99–109 (2013)
 [6] TobonGomez, C., De Craene, M., McLeod, K., et al.: Benchmarking framework for myocardial tracking and deformation algorithms: an open access database. Med Image Anal, vol. 17, 632–48 (2013)
 [7] Tran, P.V.: A fully convolutional neural network for cardiac segmentation in shortaxis MRI. arXiv preprint arXiv:1604.00494 (2016)
 [8] Wang, Y., Georgescu, B., Chen, T., et al.: Learningbased detection and tracking in medical imaging: A probabilistic approach. LNCVB, vol. 7, 209–235 (2013)
 [9] Zheng, Y., Barbu, A., Georgescu, B., et al.: Fourchamber heart modeling and automatic segmentation for 3D cardiac CT volumes using marginal space learning and steerable features. IEEE Trans. Med. Imaging, 27, 1668–1681 (2008)
 [10] Zhou, K., Greenspan, H., Shen, D.: Deep learning for medical image analysis. https://www.amazon.co.uk/DeepLearningMedicalImageAnalysis/dp/0128104082 (2017)