X-ray computed tomography (CT) is widely used in clinical practice. The involved ionizing X-ray radiation, however, could increase cancer risk. Hence, the reduction of the radiation dose has been an important topic in recent years. Few-view CT image reconstruction is one of the main ways to minimize radiation dose and potentially allow a stationary CT architecture. In this paper, we propose a deep encoder-decoder adversarial reconstruction (DEAR) network for 3D CT image reconstruction from few-view data. Since the artifacts caused by few-view reconstruction appear in 3D instead of 2D geometry, a 3D deep network has a great potential for improving the image quality in a data-driven fashion. More specifically, our proposed DEAR-3D network aims at reconstructing 3D volume directly from clinical 3D spiral cone-beam image data. DEAR is validated on a publicly available abdominal CT dataset prepared and authorized by Mayo Clinic. Compared with other 2D deep-learning methods, the proposed DEAR-3D network can utilize 3D information to produce promising reconstruction results.
1 \pubvolumexx \issuenum1 \articlenumber5 2019 \copyrightyear2019 \historyReceived: date; Accepted: date; Published: date \TitleDeep Encoder-decoder Adversarial Reconstruction (DEAR) Network for 3D CT from Few-view Data \AuthorHuidong Xie , Hongming Shan and Ge Wang * \AuthorNamesHuidong Xie, Hongming Shan and Ge Wang \corresCorrespondence: firstname.lastname@example.org
X computed tomography (CT) is one of the most essential imaging modalities widely used in clinical practices brenner_computed_2007. Even though CT brings overwhelming healthcare benefits to patients, it could potentially increase the patients’ cancer risk due to the involved ionizing radiation. The data from the National Lung Screening Trial indicate that annual lung cancer screening with low-dose CT could significantly reduce lung cancer-related mortality shan_3-d_2018-1. If the effective dose of a routine CT examination is reduced to less than 1 mSv, the long-term risk of CT scanning can be negligible. In the past years, numerous deep-learning-based CT denoising methods were proposed to reduce radiation dose with excellent results chen_low-dose_2017-1; shan_competitive_2019-1; chen_low-dose_2017. In parallel, few-view CT is also being actively investigated to reduce the radiation dose, especially for breast CT glick_breast_2007-1 and C-arm CT wallace_three-dimensional_2008.
Few-view CT is a challenging problem. Due to the requirement imposed by the Nyquist sampling theorem jerri_shannon_1977, reconstructing high-quality CT images from under-sampled projection data was previously considered an unsolvable problem. With sufficient projection data, analytical methods such as filtered back-projection (FBP) wang_approximate_2007 can be used for accurate image reconstruction. However, FBP will introduce severe streak artifacts when projection data are limited. Numerous iterative reconstruction algorithms were proposed to incorporate prior knowledge for suppressing image artifacts in few-view scans. Well-know methods include algebraic reconstruction technique (ART) gordon_algebraic_1970, simultaneous algebraic reconstruction technique (SART) andersen_simultaneous_1984, expectation maximization (EM) dempster_maximum_1977, etc. Even though these iterative methods do improve image quality, they are usually time-consuming and still not able to produce clinically-acceptable results in many cases. Recently, with the assistance of graphics processing unit (GPU) and big data, deep learning has become a new frontier of tomographic imaging and gives new opportunities for few-view CT reconstruction wang_perspective_2016; wang_guest_2015.
Deep learning has been now well recognized in the field of medical tomographic imaging greenspan_guest_2016. Several methods were proposed to resolve few-view CT issues in a data-driven fashion. For example, Jin et al. jin_deep_2017 proposed a U-net-based ronneberger_u-net:_2015 FBPConvNet to remove streak artifacts in the 2D image domain. Lee et al. used a similar U-net structure to eliminate artifacts in the sinogram domain lee_deep-neural-network-based_2019. Chen et al. designed a LEARN network chen_learn:_2018 to map sparse sinogram data directly to a tomographic image, which combines the convolutional neural network lecun_convolutional_1995 and a classic iterative process under a data-driven regularization. Inspired by the FBP workflow, Li et al. published their iCT-NET li_learning_2019 to perform CT reconstruction in various special cases and consistently obtain decent results. Our recently published DNA network xie_dual_2019-1 addressed the few-view CT issue by learning a network-based reconstruction algorithm from sinogram data. But none of these proposed methods were designed to perform 3D image reconstruction, subject to potential loss in the 3D context.
In this paper, we propose a deep encoder-decoder adversarial reconstruction network (DEAR) for 3D CT from few-view data, featured by a direct mapping from a 3D input dataset to a 3D image volume. In diagnosis, radiologists need to extract 3D spatial information by looping adjacent slices and form contextual clues. Therefore, it is reasonable and even necessary to use 3D convolutional layers for maximally avoiding streak artifacts in a batch of adjacent reconstructed image slices. The main contributions of our DEAR-3D network are summarized as follows:
DEAR-3D utilizes 3D convolutional layers to extract 3D information from multiple adjacent slices in a generative adversarial network (GAN) goodfellow_generative_2014 framework. Different from reconstructing 2D images from 3D input data shan_3-d_2018-1, DEAR-3D directly reconstructs a 3D volume, with faithful texture and image details; and
An extensive comparative study was performed between DEAR-3D and various 2D counterparts to demonstrate the merits of the proposed 3D network.
The rest of this paper is organized as follows. Sec.2 introduces the DEAR-3D model, its 2D counterparts and the GAN framework utilized in the proposed model. Sec.3 describes our experimental design and results, in comparison with other state-of-the-art models for few-view CT. Finally, Sec.4 presents discussions and concludes this paper.
3D CT image reconstruction can be expressed as follows:
where denotes a 3D image volume reconstructed from sufficient projection data, where and denote the width/height of input images and number of images acquired from a particular patient respectively. denotes the corresponding interpolated 3D sinogram from a spiral cone-beam scan, , and denote the number of views, the number of detectors per row, and the number of detector rows respectively, and is the inverse operator to reconstruct the CT image volume, such as a typical cone-beam reconstruction formula or algorithm schaller_efficient_1998; grangeat_mathematical_1991; grangeat_evaluation_1991; katsevich_improved_2004 when sufficient projection data are obtained. However, when the number of data (linear equations) is not sufficient to resolve all the unknown voxles in the few-view CT setting, streak artifacts will be introduced in the reconstructed images, and how to reconstruct high-quality images becomes highly non-trivial. Deep learning (DL) promises to extract features in reference to extensive knowledge hidden in big data. With a large amount of training data, task-specific and robust prior knowledge can be taken advantage of in establishing a relationship between few-view data/images and the corresponding full-view images. Such a deep network can be formulated in Eq. (2),
where and denote a 3D image volume reconstructed from sufficient projection data and the counterpart from insufficient few-view/sparse-view data, respectively, and denotes our DEAR-3D network to remove artifacts caused by few-view problem.
2.1 Proposed Framework
The overall network architecture is shown in Fig. 1. The proposed DEAR-3D network is optimized in a Wasserstein Generative Adversarial Network (WGAN) frameworkarjovsky_wasserstein_2017, which is currently one of the most advanced frameworks. In this study, the proposed framework consists of two components: a generator network and a discriminator network . aims at directly reconstructing a 3D image volume from a batch of 3D few-view image slices. receives images from both and the ground-truth dataset, trying to distinguish whether the input is real. Both networks optimize themselves during the training process. If an optimized network can hardly distinguish fake images (from ) from real images (from the ground-truth dataset), then the generator network fools the discriminator successfully. By the design, the introduction of also helps to improve the texture of reconstructed images.
Different from the vanilla generative adversarial network (GAN) goodfellow_generative_2014, WGAN replaces the logarithm term in the loss function with the Wasserstein distance, improving the training stability. In WGAN, the 1-Lipschitz function is assumed with weight clipping. However, Ref gulrajani_improved_2017-1 pointed out that weight clipping may be problematic in WGAN, and suggested to replace it with a gradient penalty term, which is used in our proposed method. Hence, the objective function of the GAN framework is expressed as follows:
where and represent few-view/sparse-view 3D image volume and full-view 3D image volume respectively, denotes the expectation of as a function of , and denote the trainable parameters of the networks and respectively, and . is uniformly sampled from the interval [0,1]. In other words, represents another batch of 3D image slices between fake and real images. Furthermore, denotes the gradient of with respect to . Lastly, is a parameter used to balance the Wasserstein distance and the gradient penalty. The networks and are updated in an iterative manner as suggested by goodfellow_generative_2014; gulrajani_improved_2017-1; arjovsky_wasserstein_2017.
2.2 Generator Network
The input to the generator is a batch of 3D image slices with dimensionality of where , , and denote the batch size, number of adjacent input slices and dimension of each input image slice. Intuitively, should be equal to the total number of image slices of a particular patient, and tissues in all the different 2D planes should relate to each other. However, it is not practical due to an extremely large memory cost. Hence, is experimentally adjusted to 9. The structure of the generator is inspired by the U-net ronneberger_u-net:_2015, originally proposed for biological image segmentation. Since then, the U-net has been utilized for various applications in the field of medical imaging. For example, chen_low-dose_2017-1; shan_3-d_2018-1 used U-net with conveying paths for CT image denoising, xie_dual_2019-1; jin_deep_2017; lee_deep-neural-network-based_2019 applied U-net for few-view CT, and donoho_compressed_2006 for compressed sensing MRI . In DEAR, is a revised U-net with conveying paths and built in reference to DenseNet huang_densely_2016. The generator consists of 4 3D convolutional layers for down-sampling and 4 3D transpose convolutional layers for up-sampling a batch of 3D image slices. The dimension of 3D kernel for down-sampling and up-sampling is set as . In the original U-net ronneberger_u-net:_2015, stride of 2 was used in each down-sampling or up-sampling layer to extract features in different dimensions for segmentation. However, for image reconstruction, down-sampling input images severely may result in a compromised performance because convolutional layers may not be able to recover the images from low-dimensional feature maps accurately. Therefore, stride of 1 is used in all the convolutional layers of and zero-padding is not applied in down-sampling and up-sampling layers. A rectified linear unit (ReLU) activation function is used after each 3D convolutional layer.
A dense block is added after each down-sampling and up-sampling layer. Each dense block contains 5 3D convolutional layers to extract 3D image features from the input feature maps. Note that zero-padding is used in all 3D convolutional layers to maintain the dimensionality of the input feature maps. Inspired by ResNet he_deep_2015, shortcuts are applied to connect early and current feature maps, allowing gradients to flow directly to the current layer from the corresponding earlier layer. Different from ResNet, DenseNet further improves the information flow between layers by connecting all the earlier feature maps to the current layer. Consequently, the layer receives all the feature maps from all previous layers, , , , … , , as the input:
where represents the concatenation of all the feature-maps produced by the layers , denotes the operation performed by the layer, and is defined as a composite function of a 3D convolutional operation and a ReLU activation. The kernel size and stride are set as and respectively for all the 3D convolutional layers in the proposed dense block. Note that the the purpose of DEAR-3D is to learn the inverse amplitude of artifacts in the input images, and therefore input images are directly added to the last convolutional layer as presented in Fig. 1.
2.3 Discriminator Network
The discriminator network takes input from either or the ground-truth dataset, trying to classify whether the input images are real. In DEAR-3D, the discriminator network has 6 convolutional layers with 64, 64, 128, 128, 256, 256 filters and followed by 2 fully-connected layers with the numbers of neurons 1024 and 1 respectively. The leaky ReLU activation function is used after each layer with a slope of 0.2 in the negative part. 3D convolutional layers with kernel dimension and zero-padding are used for all convolutional layers. Stride is set to 2 for all the layers.
2.4 Objective Functions for Generator
This subsection introduces and evaluates different objective functions used for few-view CT artifact reduction. As shown in Fig. 2, a composite objective function is used to optimize DEAR-3D.
2.4.1 MSE Loss
The mean-squared-error (MSE) chen_low-dose_2017-1; wolterink_generative_2017; wang_mean_2009 is a popular choice for denoising and artifact-removal applications wang_mean_2009. Nevertheless, it could lead to over-smoothing images zhao_loss_2017-1. Moreover, MSE is not sensitive to image texture and assumes background noise is white Gaussian noise independent of local image features zhou_wang_image_2004. The MSE used in the proposed method is expressed as follows:
where and denote the number of batches, number of input slices and image width/height respectively, and represent the ground-truth 3D image volume and 3D image volume reconstructed by respectively.
2.4.2 Structural Similarity Loss
To overcome the disadvantages of MSE loss and acquire visually superior images, the structural similarity index (SSIM) zhou_wang_image_2004 is introduced in the objective function. SSIM measures structural similarity between two images. The convolution window used to measure SSIM is set to . The SSIM is expressed as follows:
where and are constants used to stabilize the formula if the denominator is too small, stands for the dynamic range of voxels values, and , , , , and are the means of and , variances of and , and the covariance between and respectively. Since the maximum value of SSIM is 1, the structural loss used to optimize DEAR-3D is expressed as follows:
2.4.3 Adversarial Loss
The adversarial used in DEAR-3D is for the generator to produce realistic images that are indistinguishable by the discriminator network. Refer to Eq. (3), the adversarial loss is expressed as follows:
The overall objective function of is then expressed as follows:
where and are hyper-parameters to balance different loss functions.
2.5 Corresponding 2D networks for comparisons
To evaluate the performance of the proposed 3D network, a 2D network is built for bench-marking, which is denoted as DEAR-2D. DEAR-2D uses the exactly same structure as the DEAR-3D, except that all the 3D convolutional layers in the dense blocks are replaced with 2D convolutional layers. Please note that the number of parameters of DEAR-2D will be less than that of DEAR-3D due to the fact that the dimension of input 2D batches is significant smaller than the dimension of 3D batches. For a fair comparison, another 2D network is built with an accordingly increased number of training parameters, denoted as DEAR-2D-i. The number of training parameters is increased by increasing number of filters in 2D convolutional layers. Different from DEAR-3D, the 2D counterparts only utilize 2D convolutional layers to extract 2D feature maps from a batch of 2D input images. Therefore, the 2D counterparts aims at reconstructing 2D images instead of 3D images, which may lead to a potential loss in contextual information. Consequently, in DEAR-2D, all the 2D convolutional layers in both the encoder-decoder part and the dense blocks contain 38 filters with kernel dimension . On the other hand, in DEAR-2D-i, all the 2D convolutional layers in both the encoder-decoder part and the dense blocks contain 48 filters with kernel dimension . Table 1 shows the numbers of parameters of the three networks.
Moreover, to demonstrate the effectiveness of different loss functions used to optimize the proposed neural network, 2D and 3D networks with different combinations of loss components are considered for comparison.
3 Experimental design and results
3.1 Dataset and Pre-processing
A clinical abdominal dataset was used to train and evaluate the performance of the proposed DEAR-3D method. The dataset was prepared and authorized by Mayo Clinic for “the 2016 NIH-AAPM-Mayo Clinic Low Dose CT Grand Challenge” noauthor_low_nodate. The dataset contains a total of 5,936 abdominal CT images selected with 1 mm slice thickness. All the images were reconstructed from 2,304 projections under 100 peak kilovoltage (kVp), which were used as the ground-truth images to train the proposed method. The distance between the x-ray source and the detector array is 1085.6 milimeters, and the distance between the x-ray source and the iso-center is 595 milimeters. The pixel size is 0.664 millimeters. All the images are of . For data-preprocessing, pixel values of patient images were normalized to be between 0 and 1. During the training process, 4 patients (a total of 2,566 images) were used for training, and 6 patients (a total of 3370 images) for validation and testing. Patches with dimension were cropped with stride 32 from the whole images for data augmentation, resulting in a total of 502,936 2D training patches. 2D patches were used to train the DEAR-2D and DEAR-2D-i networks. 3D patches were extracted from the pre-processed 2D patches to train the DEAR-3D network. 3D patches were extracted with stride of 1 in the dimension. Then, the optimized networks are applicable to images with any image dimension since the proposed DEAR-3D network contains only convolutional layers. The fan-beam Radon transform and fan-beam inverse Radon transform were used to simulate 75-view few-view images. 75-view sinograms were synthesized from angles equally distributed over a full scan range.
3.2 Hyperparameter Selection and Network Comparison
In the experiments, all codes were implemented in the TensorFlow framework abadi_tensorflow:_2016 on an NVIDIA Titan RTX GPU. The Adam optimization method was implemented to optimize the training parameters kingma_adam:_2015 with and . During the training process, a mini-batch size of 10 was selected, resulting the input with dimensionality of . The hyperparameter used to balance the Wasserstein distance and the gradient penalty was set as 10, as suggested in gulrajani_improved_2017-1. The learning rate was initialized as , and decreased by a factor of 2 after each epoch. The hyperparameters and were adjusted using the following steps. First, the proposed network was optimized using only the MSE loss. The testing results were treated as the baseline for fine-tuning the other 2 hyper-parameters. Then, the SSIM loss was added as part of the objective function. Finally, the adversarial loss was added, and the hyperparameter was fine-tuned. Through this process, and were set to 0.5 and 0.0025 respectively. Please note that and were fine-tuned for the best SSIM values in the validation set.
For qualitative comparison, the proposed DEAR-3D network was compared with two deep-learning-based methods for few-view CT image reconstruction, including the FBPConvNet method (a classic U-net ronneberger_u-net:_2015 with conveying paths to solve the CT problem jin_deep_2017) and a CNN-based residual network cong_deep-learning-based_2019 (denoted as residual-CNN in this paper). The network settings were made the same as the default settings described in the original papers. The analytical FBP method was used as a baseline for comparison.
Moreover, to highlight the effectiveness of the proposed objective functions used in the 3D architecture, as shown in Table 2, 5 different networks with different combinations of objective functions were trained for comparison: (1) The DEAR-2D network with only MSE loss and without WGAN (denoted as DEAR-2D); (2) DEAR-2D with MSE and SSIM but without WGAN (denoted as DEAR-2D); (3) DEAR-2D-i with MSE and SSIM loss and without WGAN (denoted as DEAR-2D-i); (4) DEAR-3D with MSE and SSIM loss but without WGAN (denoted as DEAR-3D); (5) a full DEAR-3D network with WGAN (denoted as DEAR-3D). Hyparameters for all of these 5 networks were experimentally adjusted using the steps mentioned above.
3.3 Comparison with Other Deep-learning methods
To visualize the performance of different methods, a few representative slices were selected from the testing set. Fig. 3 shows results using different methods from 75-view few-view images. Three metrics, peak signal-to-noise ratio (PSNR) korhonen_peak_2012, SSIM, and root-mean-square-error (RMSE)willmott_advantages_2005 were computed for quantitative assessment. The quantitative results are shown in Table 3. For better evaluation of the image quality, the regions-of-interest (ROIs) are marked by rectangles in Fig. 3 are magnified in Fig. 4.
The ground-truth images and the corresponding few-view images are presented in Fig. 3a and 3b respectively. As shown in Fig. 3b, streak artifacts are clearly visible in the images reconstructed using the FBP method. As shown in the ground-truth images in Fig.3a, lesions and subtle details are visible which are hidden by few-view artifacts in Fig. 3a. The results from the 2D-based deep-learning reconstruction methods (FBPConvNet and residual-CNN) are shown in Fig. 3c and 3d as well as Fig. 4c and 4d respectively. These 2D methods can effectively reduce artifacts but they would potentially miss spatial correlation between adjacent slices, resulting in loss of subtle but critical details. As shown in the first and second row of Fig. 4, FBPConvNet and residual-CNN tend to distort or smooth out some subtle details in the ROIs but these details are visible in the full-dose images reconstructed by the FBP method (indicated by the blue and orange arrows in Fig. 4). Moreover, it is observed that, residual-CNN is unable to effectively remove streak artifacts in the reconstructed images, especially along the boundaries (the orange and blue arrows in Fig. 4). Our proposed method, DEAR-3D, is better at removing artifacts as well as keeping tiny but vital details than the competitive methods. The proposed DEAR-3D method is also better at recovering image texture than the other methods, this may be due to the processing capability of the 3D network and the discriminative power of the WGAN framework.
3.4 Ablation Analysis
This subsection demonstrate the effectiveness of different components in the proposed DEAR-3D network. As mentioned above, 5 variants of the DEAR-3D network were trained for this purpose. The results are shown in Fig. 5, with the corresponding quantitative measurements in Table 4. The zoomed-in regions-of-interest (ROIs), which are marked by rectangles in Fig. 5, are shown in Fig. 6. As presented in Fig. 6, due to the improper 2D design of the objective function, the DEAR-2D network with only MSE loss tends to smooth out features such as the lesion, leading to an unacceptable image quality (the lesion becomes barely visible in the first row in Fig. 6). Adding SSIM as part of the objective function improved the overall image quality but due to the lack of 3D spatial context, the 2D based methods are unable to recover subtle details (indicated by the blue arrows in the first and third rows in Fig. 6). There is no significant difference observed between the DEAR-2D and the DEAR-2D-i networks. Lastly, the combination of the 3D architecture, WGAN and the adversarial loss improved image texture and overall image quality, which is desirable in practice. In summary, it is observed that the 2D-based methods compromise some details in the reconstructed images (the blue arrows in Fig. 6), and by providing information from adjacent slices, the DEAR-3D network performs better than the other methods at removing artifacts and keeping image texture.
4 Discussion and Conclusions
Few-view CT may be implemented as a mechanically stationary scanner in the future cramer_stationary_2018 for health-care and other utilities. Current commercial CT scanners use one or two x-ray sources mounted on a rotating gantry, and take hundreds of projections around a patient. The rotating mechanism is not only massive but also power-consuming. Hence, current commercial CT scanners are inaccessible outside hospitals and imaging centers, due to their size, weight, and cost. Designing a stationary gantry with multiple miniature x-ray sources is an interesting approach to resolve this issue cramer_stationary_2018. Unfortunately, the current technology does not allow us to assemble hundreds of miniature x-ray sources in a ring for reconstructing a high-quality CT image over an ROI of a decent aperture. Few-view CT is an attractive option. However, streak artifacts would be introduced from a few-view scan due to insufficiency of projection data. Recently, deep learning has achieved remarkable results for few-view CT, and our proposed DEAR-3D network is a step forward along this direction.
This paper has introduced a novel 3D Deep Encoder-decoder Adversarial Reconstruction Network (DEAR-3D) for directly reconstructing a 3D volume from 3D input data. Compared with 2D-based methods jin_deep_2017; lee_deep-neural-network-based_2019; chen_learn:_2018; xie_dual_2019-1; cong_deep-learning-based_2019, DEAR-3D avoids the potential loss in the 3D spatial context. Specifically, our proposed network is featured by (1) a 3D convolutional encoder-decoder network with conveying-paths; (2) the Wasserstein GAN framework for optimal parameters; and (3) the powerful DenseNet architecture for improved performance.
In conclusion, we have presented a novel 3D deep network, DEAR-3D, for solving the few-view CT problem. The proposed method outperforms 3D deep-learning methods and promises clinical utilities such as breast cone-beam CT and C-arm cone-beam CT for future research probabilities. In the follow-up investigation, we plan to further improve the network and perform more experiments to optimize and validate the DEAR-3D network.
H.Xie. and H.Shan. initiated the project and designed the experiments. H.Xie. performed machine learning research and experiments. H.Xie. wrote the paper. H.Shan. and G.Wang. participated in the discussions and edited the paper.