Synthetic contrast enhancement in cardiac CT with Deep Learning
In Europe, the 20% of the CT scans cover the thoracic region. The acquired images contain information about the cardiovascular system that often remains latent due to the lack of contrast in the cardiac area. On the other hand, the contrast enhanced computed tomography (CECT) represents an imaging technique that allows to easily assess the cardiac chambers volumes and the contrast dynamics. With this work we aim to face the problem of extraction and presentation of these latent information, using a deep learning approach with convolutional neural networks. Starting from the extraction of relevant features from the image without contrast medium, we try to re-map them on features typical of CECT, to synthesize an image characterized by an attenuation in the cardiac chambers as if a virtually iodine contrast medium was injected. The purposes are to guarantee an estimation of the left cardiac chambers volume and to perform an evaluation of the contrast dynamics. Our approach is based on a deconvolutional network trained on a set of 120 patients who underwent both CT acquisitions in the same contrastographic arterial phase and the same cardiac phase. To ensure a reliable predicted CECT image, in terms of values and morphology, a custom loss function is defined by combining two terms. The first contribute is an error function to find a pixel-wise correspondence, which takes into account the similarity in term of Hounsfield units between the input and output images. A second term is added to enforce the definition of the chambers. It’s expressed by a cross-entropy computed on the binarized versions of the synthesized and of the real CECT image. The proposed method is finally tested on 20 subjects; the left heart chambers are evaluated with the Dice metric () and the volume percentage error (%), while the dynamics of the x-ray attenuation is evaluated with the NMI index () and PSNR().
In Europe the 20% of the CT scans cover the thoracic region [Radreport180].
Improvements in CT scanner technology-specifically provide during routine chest examination, heart images much less degraded by cardiac motion artifacts and that allow detailed evaluation of the cardiac structures [bruzzi2006and]. Therefore, the acquired images actually contain good quality information about the cardiovascular system that often remains latent, due to the lack of contrast in the cardiac area.
On the other hand, the contrast enhanced computed tomography (CECT) represents an imaging technique that allows to easily assess the cardiac chambers volumes and the contrast dynamics [rizvi2015analysis]. In fact, from the clinical point of view, it is important to define the morphology of the cardiac chambers to identify any patients affected by cardiopathies or valvular pathologies.
A physician, viewing several CECT cardiac images, develops visual memories related to the contrast medium distribution in the cardiac chambers, updating and enriching those memories based on experience and on prior knowledge. As a matter of fact, the acquired expertise allows clinicians to transfer information about the shapes and positions of left atrium (LA) and the left ventricle (LV) onto an image where they are not visible, thanks to imagery and memory retrieval operations [van2012schema].
In computer science the deep convolutional neural networks (DCNN) architecture, inspired by the biology of the human visual system, has achieved breakthrough performance in image analysis. This has brought us to develop a DCNN model able to create a contrast enhanced image from a non-contrast enhanced one.
In this work we demonstrate how employing a DCNN model it is possible to synthesized an image characterized by an attenuation in the left cardiac chambers as if a virtually iodine contrast medium was injected. In addition, by exploiting the DCNN capability to extract features with spatial and contrast invariance [lecun1995convolutional], we suggest that the designed model is able to mimic the human visual memory system outperforming an expert radiologist in the volumetric assessment of left heart chambers.
2 Materials and Methods
The study includes 150 ECG triggered CT scans acquired during a standard cardiac CT session, with a tube voltage of 120 KVp and a modulated tube current (50-350mA). The images are reconstructed with a dimension of pixels, a resolution ranging between 0.3 mm to 0.5 mm, and a fixed slice thickness of 3.0 mm. The CECT scans are obtained with tube voltage peak variable from 80-120 KVp, a modulated tube current(50-500mA) and successively reconstructed with the same dimensions and in plane resolution of the corresponding basal CT.
Both the acquisitions are acquired at the same cardiac telediastolic phase, i.e. 75% R-R. All images are reconstructed using iterative filter for ionizing dose reduction. For each patient 50 ml of contrast medium has been injected venously, at a flow rate of 4.5 ml/s. All volume are acquired in arterial contrastographic phase, characterized by an higher quantity of contrast medium in the left part of the heart compared to the right one, allowing a better visualization of the left cardiac chambers and coronary vessels. For each CECT scans a manual segmentation of the left atrium (LA) and left ventricle (LV) is also provided by expert radiologists. This information is used, together with the CECT images, as ground truth for the network.
Before feeding the data into the model, some preprocessing operations are required.
To cope with some misalignments in the images, caused by respiratory motion, a rigid registration on CECT scans is performed as first processing step, using the CT images as references and the mutual information as cost function to minimize.
While the two acquisitions share the same isotropic in plane resolution, they have different slice thickness. For such reason we also perform a reslicing operation on the registered CECT scans, bringing the two acquisitions to a common axial resolution of 3.0 mm.
Finally, as our aim is to produce a synthetic contrastographic map capable to characterize the heart left chambers, instead of using the entire collected CT volume, we consider for each subject the only axial slices where the LA and the LV are present and have been manually segmented on the CECT cases; the remaining slices are instead discarded.
2.3 Learning phase
2.3.1 Model architecture:
The architecture model adopted is a fully convolutional network, inspired by the well-known Deconvolutional Network [noh2015learning]. Figure 1 illustrates in detail the complete architecture designed, where a first encoding path is used as features extractor from the input image, while the decoding path progressively reconstructs the contrast enhanced image using the previous derived features.
The encoder comprises eight convolutional layers with a squared kernel size, equal to 3. For each of these layers a ReLU activation function is used, immediately after a batch normalization [ioffe2015batch] applied to the convolution operation output. No max-pooling is used, in fact the downsampling operation is performed using a stride of 2 pixels in all the convolutional layers with even index. An intermediate convolutional layer separates the encoder from the decoder. Here the expanding module is symmetrically built using eight convolutional layers, where an up-sampling of factor 2 is realized this time on the odd layers, employing transposed convolutions with a kernel size, stride 2 and no zero-padding.
Finally, a skip connection, with a convolutional block is included in the model. Its output is added to the last convolutional layer where a single 3x3 kernel and a linear activation function are used, to generate continuous values for each pixels, instead of probability scores. Such implementation is adopted with the precise aim to obtain a residual representation of the input [he2016deep], propagating spatial information from the input image that can get lost in the encoding process.
2.3.2 Loss function:
To correctly map the input CT image to the respective CECT image, and to recreate an adequate attenuation dynamics for the extraction of the left heart chambers, we propose a loss function given by the combination of two terms.
The first term is used to learn the network parameters by minimizing the difference between the synthesized image and the real CECT image. Specifically, it is a regressive term implemented with the root mean square error (RMSE) as in Eq. 2 where and represent the predicted CECT image and the real CECT scan respectively, while iterates over the pixels.
The second term is the binary cross entropy (BCE) computed between the manually segmented chambers and the normalized version of the generated CECT image (see Eq. 3). The predicted output is in fact transformed in a binary image using a sigmoid function, modified to work as an approximation of a Heaviside step function: . The variable controls the function steepness and is appropriately chosen to avoid exploding gradient effects. The is instead used to shift the sigmoid function and center it around an appropriate value to discriminate the LA and the LV from the chamber walls and other structures. This threshold is taken considering the mean on the minimum HU values in the ground truth masks of the heart left chambers. The final cost function (Eq. 1) is the result of linear combination of the two terms and it is multiplied by a binary mask of the heart, extracted to focus the analysis only on the cardiac region, as consequence of the observation that the contrast features are mainly located in this region. , as and are all considered model hyperparameters. An L2 regularization with a weight decay is also added to prevent overfitting.
2.4 Evaluation metrics
The output image quality is assessed with the Normalized Mutual Information index (NMI) and Peak Signal to Noise Ratio index (PSNR) as they allow to quantify the capability of the model to recreate a synthesized cardiac image as close as possible to the real CECT one.
The chambers estimation is evaluated with the Dice index. On each axial slice the overlapping grade between the heart chambers, segmented by threshold on the synthesized CECT image, and the manual segmentation is quantified. Additionally, the agreement of the predicted measurements with the manual references is highlighted by the Pearson coefficient () and the Bland-Altman plot, together with the volume percentage error ().
A set of qualitative comparisons are also performed to estimate the efficiency of the new measurement method on 10 images randomly chosen from the total scans in test set. Using the chambers ground truth as reference we compare the network results with those obtained by two expert radiologists, who have been asked to draw the cardiac chambers on the image without contrast. Dice index (see Fig.2) and are used to evaluate the aforementioned comparisons. A Pearson coefficient is instead computed to quantify the intra and the inter observer variability on two operators chosen for the above task. Clearly, they are not among those who provided the ground truth.
All the metrics and the comparisons are evaluated inside the region of interest, delimited by the heart mask extracted for each subject.
3 Experimental settings and Results
In this work, we employ a total number of patients equal to 150, that we randomly split in 120 cases to train the model, 10 cases for its validation and 20 cases for the performance evaluation. For each patient we have collected a CT and a CECT scan, the segmentation of the LA and the LV, and a binary mask of the heart. All of these images are resampled from the original dimensions to 128x128 pixels resulting in a loss of spatial information but a considerable acceleration in processing time.
We implemented the network in Tensorflow [abadi2016tensorflow] and trained the model from scratch, launching it on a single NVIDIA TITAN X GPU machine for 800 epochs. We used the Adam optimization for training the network parameters, with a learning rate fixed to and a batch size of 32 samples of 2D images from the CT volume axial projection. To increase the training cases and guarantee a higher generalization power in the prediction phase, we also use data augmentation, applying random rotation with a max angle of 25 degrees on all the couple of input and output provided to the network.
About the loss hyper-parameters we set , and to 1, 0.01 and 0.001 respectively, while for we have chosen the value of 10. Finally, we have found in 300 HU an adequate inferior limit for to discriminate LA and LV in the synthesized CECT image.
To evaluate the model performance 78 test slices have been extracted from 20 test volumes. With the above settings we achieve an average NMI value of and a PSNR of in the overall process of synthesis. Compared to the manually segmented LA and LV, our prediction reaches a mean Dice of , while an high agreement with the chambers volumes is reported by a (p<0.01), Bland-Altman plot (Fig. 3) and a of %.
The qualitative comparison outcomes show how the model manages to create a volume by committing an error of 7% (), which is close to the one committed by humans, but with the advantage of producing a more reliable geometry: vs . A good intra observer reproducibility () is also observed, while a lower inter observer reproducibility is instead reported ().
4 Discussion and Conclusion
We propose a novel approach to synthesize cardiac CECT images from contrast-free CT thoracic scans, exploiting the ability of the DCNN to mimic the human visual system and to regenerate the imagery and memory retrieval operations.
The synthetic CECT images show a good similarity and contrast dynamics (, ) compared to the gold standard (i.e a real CECT images), allowing a simple extraction of the left cardiac chambers by thresholding.
The comparison of the automatic segmentation with the manual reference annotations highlights an independence of the synthesized chambers morphology from both the heart shapes and the slice positions in the CT volume (Dice=0.88), Fig.4.
From the qualitative analysis it’s then possible to assess how the DCNN is able to overcome the inter observer variability outperforming the human performances (intra observer correlation IAOC , inter observer correlation IROC ) and offering the chance to have a fast and repeatable measurement.
From the presented results we can in fact assess the network capability to mimic the neurophysiological process of the image synthesis after the training phase. The DCNN properties make the model able to extract deep features and latent information which are not directly perceptible by the human eye, resulting in more accurate () and reproducible measurements (IROC=). Moreover, as consequence of a on the entire test set, we can hypothesize to employ this method in clinical applications such as the identification of patients with heart or valvular diseases.
To conclude, we have proposed a DCNN approach to synthesize a contrast enhanced image from a basal CT scan, where it is possible to segment the left atrium and ventricle by threshold. What emerges is a promising approach that can offer the chance to retrieve from patients, who undergo chest CT exams, volumetric information, otherwise hidden, about the cardiac chambers.