# Accelerating cardiac cine MRI beyond compressed sensing using DL-ESPIRiT

Running head: DL-ESPIRiT

Address correspondence to:

Christopher M. Sandino

Department of Electrical Engineering

Stanford University, Stanford, CA, 94025, United States

sandino@stanford.edu

This work was supported by NSF Graduate Research Fellowship, General Electric Healthcare, Google Cloud.

Approximate word count: 229 (Abstract) 4510 (body)

Submitted to Magnetic Resonance in Medicine as a Full Paper.

## Abstract

Purpose: To propose a novel combined parallel imaging and deep learning-based reconstruction framework for robust reconstruction of highly accelerated 2D cardiac cine MRI data.

Methods: A novel neural network architecture, known as DL-ESPIRiT, is proposed to address SENSE-related FOV limitations of previously proposed deep learning-based reconstruction frameworks. Additionally, a novel convolutional neural network based on separable 3D convolutions is integrated into DL-ESPIRiT to more efficiently learn spatiotemporal priors for dynamic image reconstruction. The network is trained on fully-sampled 2D cardiac cine datasets collected from eleven healthy volunteers with IRB approval. DL-ESPIRiT is compared against a state-of-the-art parallel imaging and compressed sensing method known as -ESPIRiT. The reconstruction accuracy of both methods is evaluated on retrospectively undersampled datasets (R=12) with respect to standard image quality metrics as well as automatic deep learning-based segmentations of left ventricular volumes.

Results: DL-ESPIRiT produces higher fidelity image reconstructions when compared to -ESPIRiT reconstructions with respect to standard image quality metrics (0.001). As a result of improved image quality, segmentations made from DL-ESPIRiT images are also more accurate than segmentations from -ESPIRiT images. Preliminary results show that DL-ESPIRiT can be used to reconstruct rapidly acquired 2D cardiac cine data (1 slice/heartbeat) more accurately than -ESPIRiT.

Conclusion: DL-ESPIRiT synergistically combines a robust parallel imaging model and a deep learning-based prior to produce high-fidelity reconstructions of highly accelerated 2D cardiac cine data acquired with reduced fields-of-view.

Keywords: cardiac cine, deep learning, compressed sensing

## 1 Purpose

Cardiac cine MRI is a widely used imaging technique for non-invasive characterization of heart morphology and function [1]. In a standard cardiac cine MRI scan, a two-dimensional (2D) steady-state gradient echo acquisition is synchronized with the cardiac cycle and typically performed over 10-15 slices covering the entire heart in a short-axis view. To minimize motion artifacts caused by respiration, the patient is asked to repeatedly hold their breath for 15-20 seconds at a time until all slices are acquired.

Such multi-breath-hold protocols pose several problems to the cardiac MRI workflow. Firstly, exams are long and complex due to the need of repeated breath-holds and intermediate resting periods for patient recovery. Secondly, long successive breath-holding is uncomfortable, and can be especially challenging in patients with impaired breath-hold capacity. Finally, the inevitable variation in the level of inspiration across each breath-hold introduces slice misalignment in the cine images. This variation is known to impact the ability to accurately estimate ventricular volumes [2], which are used to compute important functional indices such as left ventricular ejection fraction (LVEF). These issues can be mitigated by reducing the total scan time and thereby reducing the number of breath-holds. In fully-sampled imaging, the only way to achieve this is to trade-off in-plane spatial resolution, number of slices, and/or temporal resolution, each of which may impact diagnostic value. Reductions in scan time can also be achieved by using more efficient non-Cartesian sampling trajectories to traverse k-space [3]. In this work, we only consider standard Cartesian sampling since non-Cartesian acquisitions remain susceptible to various image artifacts related to system imperfections and off-resonance.

Scan time can also be reduced by parallel imaging (PI) methods that leverage multi-channel receiver array information to reconstruct the image from data sampled below the Nyquist criteria. For example, sensitivity encoding (SENSE) utilizes explicit knowledge of coil sensitivity profiles to localize signals in space and remove aliasing artifacts introduced by undersampling [4]. However, SENSE-based methods require accurate estimation of coil sensitivity maps from calibration data; otherwise, model errors can arise resulting in reconstruction artifacts. For example, when the field of view (FOV) is smaller than the subject, overlapping anatomies create discontinuities in coil sensitivity maps and cause residual ghosting in the SENSE reconstructed images [5, 6]. When imaging at double oblique scan planes, such as standard cardiac cine views, the FOV must be prescribed conservatively large to avoid such errors. k-Space based PI approaches, such as GRAPPA [7] or SPIRiT [8], do not rely on explicit sensitivity maps and instead exploit coil-wise correlations in k-space to directly synthesize missing data samples. Generally, k-space based approaches are robust to anatomy overlap and can enable faster scans with a reduced FOV [9]. Either type of method can be used to accelerate scan time by a factor of 2-3X without sacrificing image quality or resolution. For this reason, parallel imaging is almost always used in routine clinical scans to reduce the number of breath-holds; however, 5-6 breath-holds are still necessary for a standard 2D cardiac cine acquisition of 10-12 slices [10].

Further scan time acceleration has been achieved by exploiting prior information about the underlying signal structure in addition to parallel imaging during reconstruction. Many reconstruction methods have been developed to leverage spatiotemporal redundancy in dynamic imaging data to remove aliasing artifacts [11, 12, 13]. Other methods, such as compressed sensing, instead leverage spatiotemporal transform sparsity by iteratively solving a regularized inverse problem [14, 15, 16]. Compressed sensing (CS) methods have been instrumental in enabling rapid 2D cardiac cine acquisitions that can be completed in a single breath-hold [17, 18] or while the patient is freely breathing [16, 19]. Despite its potential to drastically simplify the cardiac MRI exam workflow, CS has not yet seen widespread clinical adoption like parallel imaging has. Transform sparsity assumptions used by CS are often simplistic and incapable of accurately modelling complex cardiac dynamics. Thus, CS requires careful hand-tuning of the relative weights assigned to data consistency and regularization terms. However, reconstruction from highly undersampled data requires strong regularization to completely remove aliasing artifacts, which leads to over-regularization that produces images with textural artifacts and spatiotemporal blurring.

More recently, deep learning-based approaches have been proposed to leverage historical exam data from multiple subjects to implicitly and automatically learn better priors for constrained image reconstruction [20, 21, 22, 23, 24]. These methods are comprised of the following steps: 1) a conventional CS algorithm is unrolled to a fixed number of iterations, 2) the prior information in each iteration is enforced by a neural network, and 3) the unrolled algorithm is trained end-to-end in a supervised fashion. For 2D cardiac cine imaging, various neural network architectures, including cascaded [25] and recurrent unrolled networks [26, 27], have been developed and trained to reconstruct up to 9X accelerated data with higher fidelity than CS. However, Refs. [25, 26, 27] did not leverage parallel imaging information which could potentially enable further scan time acceleration. Several other deep learning-based reconstruction methods that do leverage parallel imaging [21, 22] use a limited coil sensitivity model that remains susceptible to SENSE-related FOV limitations.

In this work, a combined parallel imaging and deep learning-based reconstruction framework is proposed for robust reconstruction of dynamic MRI data. The novelty of this framework is summarized as follows: 1) an extended coil sensitivity model based on ESPIRiT [28] is integrated into an existing deep learning-based reconstruction approach [24] to improve its robustness to SENSE-related FOV limitations and 2) a novel convolutional neural network (CNN) architecture based on separable 3D convolutions is developed to learn spatiotemporal priors for dynamic data more efficiently. We apply our novel DL-ESPIRiT network to 12X undersampled 2D cardiac cine data and show higher reconstruction accuracy than a combined parallel imaging and compressed sensing (PICS) algorithm with respect to standard image quality metrics. Furthermore, we show that as a result of improved image quality and sharpness, the accuracy of automatic LVEF measurements is improved over a state-of-the-art PICS reconstruction method.

## 2 Theory

### 2.1 Reconstruction Overview

The MR imaging process can be modelled as a linear system of the form:

(1) | ||||

(2) |

where the true image () is transformed into the sampled multi-channel raw data () using a forward model comprised of the coil sensitivity operator (), discrete Fourier transform (), and k-space sampling operator (). If all coil sensitivity maps () can be reliably estimated from calibration data, then a SENSE [4] parallel imaging reconstruction can be directly obtained from using the least squares solution to Eq. 1.

Alternatively, ESPIRiT [28] exploits advantages of k-space based PI techniques by deriving SENSE-like sensitivity maps using an eigenvalue approach. Multiple sets of ESPIRiT maps can be derived from calibration data to flexibly represent overlapping anatomies in reduced FOV acquisitions using an augmented forward model:

(3) | ||||

(4) |

where is the number of sets of ESPIRiT maps, is the -th set of ESPIRiT maps, and is the corresponding -th image. The number of sets of ESPIRiT maps (M) is a hyperparameter that must be chosen prior to reconstruction. As suggested by Ref. [28], two sets of maps in practice are sufficient for high-fidelity reconstruction of images with anatomy overlap. For simplicity, Eq. 3 can be re-written in the same form as Eq. 1 by vertically stacking images and sensitivity map operators :

(5) | ||||

(6) | ||||

(7) |

into a single ESPIRiT matrix . Here, the and operators are repeated for each coil image. Similarly to SENSE, multiple ESPIRiT reconstructions can be obtained from y using the least squares solution of Eq. 5.

For datasets acquired with an acceleration rate that is higher than what is supported by the coil hardware, the SENSE and ESPIRiT problems become ill-posed leading to noise amplification and residual aliasing artifacts in the reconstructed images. These can be suppressed by incorporating prior information about the images into the ESPIRiT reconstruction via regularization. The regularized ESPIRiT problem can be formulated as a non-linear inverse problem of the form:

(8) |

where is a regularization function with associated regularization strength . In compressed sensing theory, the sampling operator (P) is designed to pseudo-randomly sample k-t space causing aliasing artifacts in image domain to appear incoherent and noise-like [14]. Thus, the regularization term is typically designed to suppress noise-like artifacts in the final images by promoting sparsity in some transform domain, such as in -ESPIRiT. More generally, when is a proper convex function, the proximal gradient descent (PGD) method [29] can be used to iteratively solve the optimization problem in Eq. 8 by alternating between two updates. The first is a data consistency update of the form:

(9) |

where is the forward ESPIRiT signal model (), is its conjugate transpose, and is the PGD step size. This is followed by a proximal update of the form:

(10) |

where is the proximal operator of defined as:

(11) |

In the case that is the -norm of a unitary transform () applied to , i.e. , the proximal operator simplifies into a soft-thresholding function in the transform domain.

### 2.2 Data-driven Reconstruction Overview

Hand-crafted regularization functions have enabled significant acceleration of standard cardiac cine scans. However, data-driven regularization design can yield regularization functions that more accurately describe complex signal dynamics, and produce higher fidelity reconstructions as a result. In a deep-learning-based data-driven reconstruction, the regularization function is parameterized by a neural network, and automatically learned from historical exam data via some training process.

While other approaches directly learn using a field-of-experts model [21], a simpler and more straightforward approach is to learn the proximal operator of [30, 31]. This can be achieved by unrolling [32] the PGD algorithm in Eqs. 9 and 10 to a fixed number of iterations, and replacing the regularization function’s proximal operator with a neural network. This produces an unrolled PGD network of the form:

(12) | ||||

(13) |

where is a conditionally generative neural network whose parameters are learned uniquely for each of the iterations. The unrolled PGD network is trained end-to-end in a supervised fashion to output images which are close to fully-sampled reference images. Closeness is evaluated with respect to some dissimilarity metric , for which common choices include pixel-wise and differences. A loss function is constructed from the average dissimilarity over all of the training examples:

(14) |

where is the number of training examples, is the unrolled PGD network output, and is the fully-sampled image. The loss function is iteratively minimized over the training dataset with respect to the network parameters by a stochastic gradient descent (SGD) algorithm.

## 3 Methods

### 3.1 Network Architecture

The DL-ESPIRiT network architecture takes directly after the unrolled proximal gradient descent network described in the previous section. As shown in Figure 1, the DL-ESPIRiT network takes inputs of a zero-filled reconstruction of a 2D cardiac cine slice, as well as its corresponding ESPIRiT maps computed from time-averaged k-space data. The network then alternates between 3D spatiotemporal convolutional neural networks (CNN) and data consistency steps. It is trained to output images which are close to the corresponding fully-sampled ground truth images in an pixel-wise sense:

(15) |

Much like in the PGD algorithm, the data consistency step applies a gradient descent update in which the gradient of the data consistency term in Eq. 8 is subtracted from the output of the CNN. In this step, ESPIRiT maps are used by the signal model to project data back and forth between image and k-space domains. This step ensures that the final output of the DL-ESPIRiT network does not deviate from the acquired k-space data.

In between data consistency and CNN steps, the complex-valued data is converted into two real-valued channels by stacking the real and imaginary parts along the feature dimension. Additionally, images corresponding to each set of ESPIRiT maps are also stacked along the feature dimension. For example, when using two sets of ESPIRiT maps, a total of four image channels is passed as input to the CNN. This way information across multiple images is shared allowing the CNN to learn a joint regularization function across the ESPIRiT channel dimension. In contrast, -ESPIRiT regularizes each channel independently since it is not obvious how to jointly regularize these images.

Each neural network between data consistency steps is a fully convolutional residual network (ResNet), which is currently the state-of-the-art architecture for computer vision tasks [33]. Each ResNet is composed of 3D convolutional layers with kernels to leverage all spatial and temporal dimensions for de-aliasing. There are five total convolutional layers for each DL-ESPIRiT iteration which corresponds to a spatiotemporal receptive field of size . The first convolution of each ResNet expands the initial (2 or 4) images into 96 feature maps, which are propagated through the network until the final convolution where they are recombined into the original number of images. All convolutional layers are preceded by ReLU pre-activation layers as recommended by He et al. [34]. Furthermore, convolutions are implemented using circular padding along the phase encoding and temporal directions in order to enforce circular boundary conditions in the two dimensions [24].

Inspired by previous work in CNN architecture design for video-based action recognition tasks, we also implement the DL-ESPIRiT network using separable 3D convolutions, henceforth referred to as (2+1)D convolutions [35]. In this version of the network, all 3D convolutions are replaced with (2+1)D convolutions which have been decomposed into simpler 2D spatial and 1D temporal components (Fig. 2B). An additional ReLU activation is added in between 2D spatial and 1D temporal convolutions to enhance the network’s representational power. As shown by Figure 2C, (2+1)D convolutional kernels are simpler to learn resulting in lower training loss and higher reconstruction accuracy [36]. To make a fairer comparison, the number of learnable parameters between 3D and (2+1D) convolutions is matched by expanding the number of 2D spatial filters to be:

(16) |

where the spatial kernel size is , the temporal kernel size is , is the number of input feature maps, is the number of feature maps output by the temporal convolution, and is the number of features output by the intermediate spatial convolution.

### 3.2 Data Acquisition

Fully-sampled, multi-slice 2D cardiac cine datasets were collected from 22 healthy volunteers using an ECG-gated balanced steady-state free precession (bSSFP) sequence with IRB approval. Each volunteer was asked to repeatedly hold his or her breath for 15-20 seconds during the acquisition of each slice. Every breath-hold was followed by adequate resting time to allow the subject to recover and ensure consistent breath-holding. Data collection was performed on a mixture of 1.5T and 3.0T GE MRI scanners (GE Healthcare, Waukesha, WI) using a 32-channel cardiac coil. The relevant scan parameters are shown in Table 1. For each volunteer, data were acquired at multiple slices and at different cardiac views including standard short-axis and long-axis (2-chamber, 3-chamber and 4-chamber) views. All datasets were coil compressed [37] down to 8 virtual coils for computational speed and memory considerations.

To prove the feasibility of the proposed DL-ESPIRiT reconstruction method, a customized ECG-gated bSSFP sequence is implemented to support prospective acquisition of each slice in a single heartbeat. Data acquisition of each slice is preceded by a whole cardiac cycle of dummy pulses to establish signal steady state. The k-t view-ordering is designed using a variable-density undersampling mask similar to the masks used for training [38]. To reduce the appearance of banding artifacts, the repetition time is minimized using a partial echo acquisition. With IRB approval, fully-sampled and prospectively undersampled datasets are acquired in two separate scans from a healthy volunteer on a 1.5T scanner.

Scan Parameters | Training Data (R=1) | Prospective (R=1) | Prospective (R=12) |
---|---|---|---|

Echo Time (ms) | 1.6 | 1.6 | 1.2 (partial) |

Repetition Time (ms) | 3.7 | 3.7 | 3.3 |

Flip Angle () | 40-60 | 60 | 60 |

Number of Slices | 1-12 | 5 | 5 |

Slice Thickness (mm) | 8 | 8 | 8 |

Matrix Size | 200-224 x 160-180 | 200 x 170 | 200 x 170 |

Spatial Resolution (mm) | 1.8-2.0 x 1.6-1.8 | 1.8 x 1.8 | 1.8 x 1.8 |

Temporal Resolution (ms) | 40-42 | 42 | 42 |

Scan Time (s) | 15-250 | 77 | 10 |

### 3.3 Training

Datasets collected from each volunteer are divided into training, validation, and test sets (0.5, 0.05, 0.45 split). For training, each dataset is split up slice-by-slice to create 180 unique training examples. This number is further augmented using data augmentation techniques such as random flipping and circular translations along phase encoding and temporal directions. Additionally, images are randomly cropped along the readout direction to 64 points to reduce memory requirements for training. To show the network an adequate number of examples with anatomy overlap, the FOV in the phase encoding direction is retrospectively reduced by random factors between 0-15%. Variable-density undersampling masks are randomly generated with a 10-15X acceleration rate and applied on-the-fly to each example during training. Additionally, the first 20-30% of k-space readout is masked out to simulate a partial echo acquisition. Finally, to simulate different temporal resolutions, the number of cardiac phases in the training data is varied between 12-20 by retrospectively sorting the data using a nearest neighbor gating approach.

To evaluate the impact of adding the extended coil sensitivity model, we trained two separate DL-ESPIRiT networks based on 3D convolutions with one set of maps (3DM1) and two sets of maps (3DM2). Additionally, to evaluate performance across 3D and (2+1)D networks, we compared the 3DM2 trained network against another network based on (2+1)D convolutions and two sets of maps ((2+1)DM2). All networks are comprised of 10 PGD iterations, one 5-layer ResNet per PGD iteration, and 96 feature maps per convolutional layer. As stated previously, the number of spatial feature maps in each (2+1)D convolution are expanded to 216 so that 3D and (2+1)D networks have the same number of learnable parameters (see Eq. 16).

Both training and inference pipelines are implemented in Tensorflow [39]. Due to memory limitations, each network is trained with a batch size of 1 using the Adam optimizer [40] with hyperparameters =0.9, =0.999, =10, initial learning rate of 10, and 200k training steps. A warm restart is performed after 100k training steps with a decayed learning rate of 10 [41]. 3D and (2+1)D networks were trained for a total of 178 and 317 hours respectively. To enable training a larger network with more PGD iterations, each DL-ESPIRiT network is split in half and trained across two NVIDIA Tesla V100 16GB video cards.

### 3.4 Evaluation

The performance of DL-ESPIRiT is evaluated on fully sampled data from which we can obtain ground-truth images. Specifically, ten fully-sampled short-axis cardiac cine datasets, excluded from the training, are retrospectively undersampled and reconstructed slice-by-slice using various reconstruction algorithms. We compare all 3D and (2+1)D DL-ESPIRiT approaches against a standard PICS reconstruction method known as -ESPIRiT with spatial and temporal total variation (TV) regularization [28]. Regularization strengths for spatial and temporal TV priors are empirically determined to be 0.002 and 0.01 respectively, based on our pre-tuning of the two regularization parameters. The -ESPIRiT problem is solved using the alternating direction method of multipliers [42] algorithm with 200 inner-loop iterations. We use the Berkeley Advanced Reconstruction Toolbox (BART, v0.4.04) implementation with GPU acceleration [43]. All reconstructions are performed on a separate computer system from the one used for training with one NVIDIA GTX 1080 Ti 12GB video card. Reconstruction times for -ESPIRiT, 3D DL-ESPIRiT, and (2+1)D DL-ESPIRiT were 5.36 0.05, 3.89 0.04, and 4.89 0.03 seconds per slice respectively.

To evaluate image quality for each reconstruction, we compute peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) [44] for all slices and all volunteers. Image quality metrics are evaluated with respect to the corresponding fully-sampled reference images. A two-tailed, paired -test is conducted to determine the statistical significance of the improved image quality metrics between the different reconstruction methods. -ESPIRiT and DL-ESPIRiT reconstructions are further evaluated based on the measurement accuracy of multiple cardiac functional indices including left ventricular end-diastolic volume (EDV), end-systolic volume (ESV), stroke volume (SV), and ejection fraction (EF). Indices are measured based on automatic segmentations of epicardial and endocardial left ventricular borders using a pre-trained convolutional neural network that has been FDA cleared for the assessment of heart function [45]. Measurement accuracy is evaluated with respect to automatic segmentations computed from fully-sampled images. All segmentations and volumetric analyses are performed on the Arterys platform (Arterys Inc, San Francisco, CA).

## 4 Results

As shown by Fig. 3, 3DM1 DL-ESPIRiT reconstructions show residual ghosting artifacts originating from sensitivity map errors in areas with anatomy overlap. By augmenting the signal model with two sets of ESPIRiT maps, the 3DM2 DL-ESPIRiT reconstruction is able to reconstruct overlapped components separately without ghosting artifacts.

Figure 4 shows representative -ESPIRiT and DL-ESPIRiT reconstructions of data that was retrospectively undersampled by 12X. For this set of images, DL-ESPIRiT is able to capture more realistic cardiac dynamics than -ESPIRiT. -ESPIRiT reconstructions show significant staircasing artifacts along time, which are characteristic of total variation-based reconstructions. Furthermore, error maps in Fig. 4 show that the (2+1)DM2 network produces higher fidelity images than both -ESPIRiT and 3DM2 DL-ESPIRiT reconstructions. In Fig. 5, the performance across all reconstruction methods for acceleration rates 10, 12, and 14 is compared. As the acceleration rate is increased, each method produces progressively blurrier images, except for the (2+1)DM2 DL-ESPIRiT reconstruction method, which retains sharpness of left ventricular trabeculae.

Figure 6 shows average PSNR and SSIM metrics across all slices for each test subject. (2+1)DM2 DL-ESPIRiT significantly outperforms -ESPIRiT with respect to PSNR and SSIM metrics (t(9)=5.781, =0.0002). For most subjects, (2+1)DM2 DL-ESPIRiT outperforms 3DM2 DL-ESPIRiT with respect to both metrics (t(9)=4.817, =0.00095).

Figure 7 shows Bland-Altman plots for left ventricular EDV, ESV, SV, and EF measured from -ESPIRiT and (2+1)DM2 DL-ESPIRiT reconstructed images. As a result of improved image quality and sharpness, automatic segmentations from DL-ESPIRiT images produce more accurate EDV, ESV, SV, and EF measurements than -ESPIRiT with respect to measurements made from fully-sampled images.

Figure 8 shows a representative example of -ESPIRiT and DL-ESPIRiT reconstructions of prospectively undersampled data acquired with an acceleration rate of 12X. Again, DL-ESPIRiT generalizes well for this prospective case producing visually sharper images than -ESPIRiT. Compared to fully-sampled images that were acquired in a separate scan, DL-ESPIRiT reconstructions faithfully depict cardiac anatomy and dynamics except loss of definition in some fine structures such as small papillary muscles.

## 5 Discussion

In this work, a combined parallel imaging and deep learning reconstruction method known as DL-ESPIRiT is developed and trained to reconstruct vastly accelerated 2D cardiac cine MRI data with higher fidelity than compressed sensing. DL-ESPIRiT is able to leverage historical exam data to automatically learn a better prior than fixed hand-crafted priors in compressed sensing. We build on previous data-driven reconstruction methods by incorporating a robust parallel imaging model, known as ESPIRiT, that is capable of leveraging multi-coil information to reconstruct images without SENSE-related FOV limitations. Additionally, a novel 3D CNN architecture based on separable 3D convolutions is proposed to simplify network training and achieve higher reconstruction accuracy.

Within this framework, we propose a generalized method for reconstruction of multiple images corresponding to multiple sets of SENSE-like coil sensitivity maps derived using ESPIRiT. Compared to previously proposed frameworks that use a single set of sensitivity maps in their signal models [21, 22], DL-ESPIRiT uses multiple sets of ESPIRiT maps to enhance its robustness to errors in the map estimation process. This is especially advantageous for 2D cardiac cine imaging, which is performed in double oblique slices and thus prone to sensitivity map errors arising from anatomy overlap. Other data-driven learning methods that do not rely on sensitivity maps instead apply neural networks in k-space to directly synthesize missing data samples [46, 47]. These methods are also exempted from artifacts caused by erroneous sensitivity maps, however, they rely on local correlations in k-space to estimate missing data which may limit the choice of sampling pattern and increase computation compared to SENSE-like reconstruction.

In this work, we compare our method against an existing PICS method known as -ESPIRiT using two sets of sensitivity maps and a spatiotemporal total variation prior. As shown by reconstructions of 12X accelerated data in Figs. 4 and 5, -ESPIRiT produces images with significant spatiotemporal blurring and staircasing artifacts. As a result, automatic segmentations made on -ESPIRiT reconstructions produce less accurate measurements of ventricular volumes. In particular, end-systolic volumetric measurements are overestimated, which is likely caused by temporal blurring due to rapid cardiac motion near end-systole. This was also found in another study by Inoue et al, which found that 2D cardiac cine scans acquired with lower temporal resolution tend to produce overestimated ESV and underestimated LVEF [48].

On the other hand, DL-ESPIRiT produces visually sharper images and, as a result, more accurate LVEF measurements in reconstructions of retrospectively undersampled data. The same trend is observed in prospectively undersampled acquisitions as well. The DL-ESPIRiT network based on (2+1)D convolutions in particular produced the sharpest looking images with the best image quality on average. This can be attributed to multiple factors. Despite 3D and (2+1)D networks having the same number of learnable parameters, the (2+1)D network exhibits better training convergence and reconstruction accuracy due to the convolutional kernels’ simpler structure [35]. Moreover, the additional ReLU activation layer in between spatial and temporal convolutions may enhance the (2+1)D network’s representational power and therefore its ability to learn more complex priors. Although higher representational power could cause the network model to overfit the training data, this phenomenon is not observed in the monotonically decreasing validation loss curves shown in Fig. 2C.

Despite improvements in image quality over -ESPIRiT, DL-ESPIRiT reconstructions are notably blurred with high acceleration for both retrospectively and prospectively undersampled data. This is especially evident in fine structures, such as papillary muscles and trabeculae, which can appear significantly blurred during systolic phases. Sharpness could be improved by using different training loss functions, such as an adversarial loss, which has been shown to produce visually sharper images than networks trained on and pixel-wise loss for MRI reconstruction problems [49]. Exploring different training loss functions for enhancing sharpness of these structures will be the subject of future work.

In previous works [21], deep learning-based reconstruction approaches were shown to be computationally much faster than compressed sensing-based ones. In this work, we found DL-ESPIRiT reconstruction times to be only slightly faster than -ESPIRiT. The discrepancy between our work and previous work is due to 3D convolutions. While 3D convolutions provide a natural way of jointly exploiting spatial and temporal dimensions for reconstruction, they are computationally more complex and memory intensive when compared to 2D convolutions. (2+1)D convolutions are composed of 2D and 1D convolutions in series and, in theory, could be implemented using 2D convolutions instead of 3D convolutions as described in this work. However, 2D convolutions only accept 4D arrays in the (batch size, height, width, channels) format. Therefore, inputting dynamic data would require looping over the time dimension for the 2D spatial convolution, and looping over a spatial dimension for the 1D temporal convolution. This implementation requires significant memory overhead during training, which ultimately limits network depth and performance.

One limitation of this study is that the training data was retrospectively undersampled to numerically simulate data collected with a faster scan time. This was done in order to provide a proper baseline for comparison and evaluation of -ESPIRiT and DL-ESPIRiT reconstruction techniques. However, training on simulated data may be suboptimal since prospectively collected data may contain features which are not present in the training data. In bSSFP imaging, for example, fast transitions between phase encoding steps gives rise to eddy currents that can perturb the steady state signal and introduce image artifacts [50]. These were not observed in our preliminary data with prospective undersampling shown in Fig. 8, but may arise for datasets collected with higher acceleration rates, smaller fields-of-view, or finer in-plane spatial resolution. Different eddy current compensation strategies such as phase encode pairing [50] will be the subject of future work.

## 6 Conclusion

A novel deep learning-based reconstruction framework known as DL-ESPIRiT is proposed which improves robustness to SENSE-related FOV limitations over existing deep learning frameworks. Furthermore, a novel CNN architecture based on (2+1)D convolutions is proposed and integrated into DL-ESPIRiT to enhance spatiotemporal learning for dynamic image reconstruction. As a result of these two developments, a single breath-hold 2D cardiac cine scan is feasible, which can potentially lead to more accurate ventricular function assessment and improved patient comfort.

## Acknowledgements

The authors would like to acknowledge Haonan Wang (GE Healthcare) for his assistance in data collection, and Neerav Dixit (Stanford University) for helpful discussion.

## References

- [1] Sakuma H, Fujita N, Foo TK, Caputo GR, Nelson SJ, Hartiala J, Shimakawa A, Higgins CB. Evaluation of left ventricular volume and mass with breath-hold cine MR imaging. Radiology 1993; 188:377–380.
- [2] Greil GF, Germann S, Kozerke S, Baltes C, Tsao J, Urschitz MS, Seeger A, Tangcharoen T, Bialkowsky A, Miller S, Sieverding L. Assessment of left ventricular volumes and mass with fast 3D cine steady-state free precession k-t space broad-use linear acquisition speed-up technique (k-t BLAST). Journal of Magnetic Resonance Imaging 2008; 27:510–515.
- [3] Roifman I, Gutierrez J, Wang E, Biswas L, Sparkes J, Connelly KA, Wright GA. Evaluating a novel free-breathing accelerated cardiac MRI cine sequence in patients with cardiomyopathy. Magnetic Resonance Imaging 2019; 61:260–266.
- [4] Pruessmann KP, Weiger M, Scheidegger MB, Boesiger P. SENSE: Sensitivity encoding for fast MRI. Magnetic Resonance in Medicine 1999; 42:952–962.
- [5] Griswold MA, Kannengiesser S, Heidemann RM, Wang J, Jakob PM. Field-of-view limitations in parallel imaging. Magnetic Resonance in Medicine 2004; 52:1118–1126.
- [6] Goldfarb JW. The SENSE ghost: Field-of-view restrictions for SENSE imaging. Journal of Magnetic Resonance Imaging 2004; 20:1046–1051.
- [7] Griswold MA, Jakob PM, Heidemann RM, Nittka M, Jellus V, Wang J, Kiefer B, Haase A. Generalized Autocalibrating Partially Parallel Acquisitions (GRAPPA). Magnetic Resonance in Medicine 2002; 47:1202–1210.
- [8] Lustig M, Pauly JM. SPIRiT: Iterative self-consistent parallel imaging reconstruction from arbitrary k-space. Magnetic Resonance in Medicine 2010; 64:457–471.
- [9] Blaimer M, Breuer F, Mueller M, Heidemann RM, Griswold MA, Jakob PM. SMASH, SENSE, PILS, GRAPPA: How to choose the optimal method. Topics in Magnetic Resonance Imaging 2004; 15:223–236.
- [10] Kramer CM, Barkhausen J, Flamm SD, Kim RJ, Nagel E. Standardized cardiovascular magnetic resonance (CMR) protocols 2013 update. Journal of Cardiovascular Magnetic Resonance 2013; 15.
- [11] Kellman P, Epstein FH, McVeigh ER. Adaptive sensitivity encoding incorporating temporal filtering (TSENSE). Magnetic Resonance in Medicine 2001; 45:846–852.
- [12] Breuer FA, Kellman P, Griswold MA, Jakob PM. Dynamic autocalibrated parallel imaging using temporal GRAPPA (TGRAPPA). Magnetic Resonance in Medicine 2005; 53:981–985.
- [13] Tsao J, Boesiger P, Pruessmann KP. K-t BLAST and k-t SENSE: Dynamic MRI with high frame rate exploiting spatiotemporal correlations. Magnetic Resonance in Medicine 2003; 50:1031–42.
- [14] Lustig M, Donoho D, Pauly JM. Sparse MRI: The application of compressed sensing for rapid MR imaging. Magnetic Resonance in Medicine 2007; 58:1182–95.
- [15] Jung H, Sung K, Nayak KS, Kim EY, Ye JC. K-t FOCUSS: A general compressed sensing framework for high resolution dynamic MRI. Magnetic Resonance in Medicine 2009; 61:103–116.
- [16] Feng L, Srichai MB, Lim RP, Harrison A, King W, Adluru G, Dibella EV, Sodickson DK, Otazo R, Kim D. Highly accelerated real-time cardiac cine MRI using k-t SPARSE-SENSE. Magnetic Resonance in Medicine 2013; 70:64–74.
- [17] Vincenti G, Monney P, Chaptinel J, Rutz T, Coppo S, Zenge MO, Schmidt M, Nadar MS, Piccini D, Chèvre P, Stuber M, Schwitter J. Compressed sensing single-breath-hold CMR for fast quantification of LV function, volumes, and mass. JACC: Cardiovascular Imaging 2014; 7:882–892.
- [18] Kido T, Kido T, Nakamura M, Watanabe K, Schmidt M, Forman C, Mochizuki T. Compressed sensing real-time cine cardiovascular magnetic resonance: Accurate assessment of left ventricular function in a single-breath-hold. Journal of Cardiovascular Magnetic Resonance 2016; 18.
- [19] Xue H, Kellman P, Larocca G, Arai AE, Hansen MS. High spatial and temporal resolution retrospective cine cardiovascular magnetic resonance from shortened free breathing real-time acquisitions. Journal of Cardiovascular Magnetic Resonance 2013; 15:102.
- [20] Yang Y, Sun J, Li H, Xu Z. Deep ADMM-Net for Compressive Sensing MRI. In: Advances in Neural Information Processing Systems 29, Barcelona, Spain, 2016. pp. 10–18.
- [21] Hammernik K, Klatzer T, Kobler E, Recht MP, Sodickson DK, Pock T, Knoll F. Learning a variational network for reconstruction of accelerated MRI data. Magnetic Resonance in Medicine 2018; 79:3055–3071.
- [22] Cheng JY, Chen F, Alley MT, Pauly JM, Vasanawala SS. Highly Scalable Image Reconstruction using Deep Neural Networks with Bandpass Filtering. arXiv:1805.03300 [physics] 2018; .
- [23] Aggarwal HK, Mani MP, Jacob M. MoDL: Model-Based Deep Learning Architecture for Inverse Problems. IEEE Transactions on Medical Imaging 2019; 38:394–405.
- [24] Cheng JY, Chen F, Sandino C, Mardani M, Pauly JM, Vasanawala SS. Compressed Sensing: From Research to Clinical Practice with Data-Driven Learning. arXiv:1903.07824 [eess.IV] 2019; .
- [25] Schlemper J, Caballero J, Hajnal JV, Price A, Rueckert D. A Deep Cascade of Convolutional Neural Networks for Dynamic MR Image Reconstruction. IEEE transactions on medical imaging 2018; 37:491–503.
- [26] Qin C, Schlemper J, Caballero J, Price AN, Hajnal JV, Rueckert D. Convolutional recurrent neural networks for dynamic MR image reconstruction. IEEE Transactions on Medical Imaging 2019; 38:280–290.
- [27] Qin C, Schlemper J, Duan J, Seegoolam G, Price A, Hajnal J, Rueckert D. K-t NEXT: Dynamic MR Image Reconstruction Exploiting Spatio-temporal Correlations. arXiv:1907.09425 [cs, eess] 2019; .
- [28] Uecker M, Lai P, Murphy MJ, Virtue P, Elad M, Pauly JM, Vasanawala SS, Lustig M. ESPIRiT - An eigenvalue approach to autocalibrating parallel MRI: Where SENSE meets GRAPPA. Magnetic Resonance in Medicine 2014; 71:990–1001.
- [29] Parikh N, Boyd S. Proximal Algorithms. Foundations and Trends in Optimization 2014; .
- [30] Diamond S, Sitzmann V, Heide F, Wetzstein G. Unrolled Optimization with Deep Priors. arXiv:1705.08041 [cs.CV] 2017; .
- [31] Mardani M, Sun Q, Vasawanala S, Papyan V, Monajemi H, Pauly J, Donoho D. Neural Proximal Gradient Descent for Compressive Imaging. arXiv:1806.03963 [cs] 2018; .
- [32] Gregor K, LeCun Y. Learning Fast Approximations of Sparse Coding. In: Proceedings of the 27th International Conference on International Conference on Machine Learning, USA, 2010. pp. 399–406.
- [33] He K, Zhang X, Ren S, Sun J. Deep Residual Learning for Image Recognition. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, Nevada, United States, 2016.
- [34] He K, Zhang X, Ren S, Sun J. Identity Mappings in Deep Residual Networks. arXiv:1603.05027 [cs] 2016; .
- [35] Tran D, Wang H, Torresani L, Ray J, LeCun Y, Paluri M. A Closer Look at Spatiotemporal Convolutions for Action Recognition. arXiv:1711.11248 [cs] 2017; .
- [36] Sandino CM, Lai P, Vasanawala SS, Cheng JY. Deep reconstruction of dynamic MRI data using separable 3D convolutions. In: Proceedings of International Society of Magnetic Resonance in Medicine Workshop on Machine Learning, Part II, Washington, DC, United States, 2018.
- [37] Zhang T, Pauly JM, Vasanawala SS, Lustig M. Coil compression for accelerated imaging with Cartesian sampling. Magnetic Resonance in Medicine 2013; 69:571–582.
- [38] Lai P, Brau A. Improving cardiac cine MRI on 3T using 2D k-t accelerated auto-calibrating parallel imaging. Journal of Cardiovascular Magnetic Resonance 2014; 16:W3.
- [39] Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, Ghemawat S, Goodfellow I, Harp A, Irving G, Isard M, Jia Y, Jozefowicz R, Kaiser L, Kudlur M, Levenberg J, Mane D, Monga R, Moore S, Murray D, Olah C, Schuster M, Shlens J, Steiner B, Sutskever I, Talwar K, Tucker P, Vanhoucke V, Vasudevan V, Viegas F, Vinyals O, Warden P, Wattenberg M, Wicke M, Yu Y, Zheng X. TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. arXiv:1603.04467 [cs.DC] 2016; .
- [40] Kingma DP, Ba JL. Adam: A method for stochastic gradient descent. In: Proceedings of the 3rd International Conference on Learning Representations, San Diego, CA, United States, 2015.
- [41] Loshchilov I, Hutter F. SGDR: Stochastic Gradient Descent with Warm Restarts. arXiv:1608.03983 [cs, math] 2016; .
- [42] Boyd S, Parikh N, Chu E, Peleato B, Eckstein J. Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends in Machine Learning 2010; 3:1–122.
- [43] Uecker M, Ong F, Tamir JI, Bahri D, Virtue P, Cheng JY, Zhang T, Lustig M. Berkeley advanced reconstruction toolbox. In: Proceedings of 23rd Annual Meeting of the International Society of Magnetic Resonance in Medicine, Toronto, Ontario, Canada, 2015.
- [44] Wang Z, Bovik AC, Sheikh HR, Simoncelli EP. Image quality assessment: From error visibility to structural similarity. IEEE Transactions on Image Processing 2004; 13:1–13.
- [45] Lieman-Sifry J, Le M, Lau F, Sall S, Golden D. FastVentricle: Cardiac segmentation with E-Net. In: International Conference on Functional Imaging and Modeling of the Heart, Toronto, Ontario, Canada, 2017.
- [46] Han Y, Sunwoo L, Ye JC. K-Space Deep Learning for Accelerated MRI. IEEE Transactions on Medical Imaging (Early Access) 2019; .
- [47] Akçakaya M, Moeller S, Weingärtner S, Ugurbil K. Scan-specific robust artificial-neural-networks for k-space interpolation (RAKI) reconstruction: Database-free deep learning for fast imaging. Magnetic Resonance in Medicine 2019; 81:439–453.
- [48] Inoue Y, Nomura Y, Nakaoka T, Watanabe M, Kiryu S, Okubo T, Ohtomo K. Effect of temporal resolution on the estimation of left ventricular function by cardiac MR imaging. Magnetic Resonance Imaging 2005; 23:641–645.
- [49] Mardani M, Gong E, Cheng JY, Vasanawala SS, Zaharchuk G, Xing L, Pauly JM. Deep generative adversarial neural networks for compressive sensing MRI. IEEE Transactions on Medical Imaging 2019; 38:167–179.
- [50] Bieri O, Markl M, Scheffler K. Analysis and compensation of eddy currents in balanced SSFP. Magnetic Resonance in Medicine 2005; 54:129–137.