Abstract
Xray computed tomography (CT) is widely used in clinical practice. The involved ionizing Xray radiation, however, could increase cancer risk. Hence, the reduction of the radiation dose has been an important topic in recent years. Fewview CT image reconstruction is one of the main ways to minimize radiation dose and potentially allow a stationary CT architecture. In this paper, we propose a deep encoderdecoder adversarial reconstruction (DEAR) network for 3D CT image reconstruction from fewview data. Since the artifacts caused by fewview reconstruction appear in 3D instead of 2D geometry, a 3D deep network has a great potential for improving the image quality in a datadriven fashion. More specifically, our proposed DEAR3D network aims at reconstructing 3D volume directly from clinical 3D spiral conebeam image data. DEAR is validated on a publicly available abdominal CT dataset prepared and authorized by Mayo Clinic. Compared with other 2D deeplearning methods, the proposed DEAR3D network can utilize 3D information to produce promising reconstruction results.
1 \pubvolumexx \issuenum1 \articlenumber5 2019 \copyrightyear2019 \historyReceived: date; Accepted: date; Published: date \TitleDeep Encoderdecoder Adversarial Reconstruction (DEAR) Network for 3D CT from Fewview Data \AuthorHuidong Xie , Hongming Shan and Ge Wang * \AuthorNamesHuidong Xie, Hongming Shan and Ge Wang \corresCorrespondence: wangg6@rpi.edu
1 Introduction
X computed tomography (CT) is one of the most essential imaging modalities widely used in clinical practices brenner_computed_2007. Even though CT brings overwhelming healthcare benefits to patients, it could potentially increase the patients’ cancer risk due to the involved ionizing radiation. The data from the National Lung Screening Trial indicate that annual lung cancer screening with lowdose CT could significantly reduce lung cancerrelated mortality shan_3d_20181. If the effective dose of a routine CT examination is reduced to less than 1 mSv, the longterm risk of CT scanning can be negligible. In the past years, numerous deeplearningbased CT denoising methods were proposed to reduce radiation dose with excellent results chen_lowdose_20171; shan_competitive_20191; chen_lowdose_2017. In parallel, fewview CT is also being actively investigated to reduce the radiation dose, especially for breast CT glick_breast_20071 and Carm CT wallace_threedimensional_2008.
Fewview CT is a challenging problem. Due to the requirement imposed by the Nyquist sampling theorem jerri_shannon_1977, reconstructing highquality CT images from undersampled projection data was previously considered an unsolvable problem. With sufficient projection data, analytical methods such as filtered backprojection (FBP) wang_approximate_2007 can be used for accurate image reconstruction. However, FBP will introduce severe streak artifacts when projection data are limited. Numerous iterative reconstruction algorithms were proposed to incorporate prior knowledge for suppressing image artifacts in fewview scans. Wellknow methods include algebraic reconstruction technique (ART) gordon_algebraic_1970, simultaneous algebraic reconstruction technique (SART) andersen_simultaneous_1984, expectation maximization (EM) dempster_maximum_1977, etc. Even though these iterative methods do improve image quality, they are usually timeconsuming and still not able to produce clinicallyacceptable results in many cases. Recently, with the assistance of graphics processing unit (GPU) and big data, deep learning has become a new frontier of tomographic imaging and gives new opportunities for fewview CT reconstruction wang_perspective_2016; wang_guest_2015.
Deep learning has been now well recognized in the field of medical tomographic imaging greenspan_guest_2016. Several methods were proposed to resolve fewview CT issues in a datadriven fashion. For example, Jin et al. jin_deep_2017 proposed a Unetbased ronneberger_unet:_2015 FBPConvNet to remove streak artifacts in the 2D image domain. Lee et al. used a similar Unet structure to eliminate artifacts in the sinogram domain lee_deepneuralnetworkbased_2019. Chen et al. designed a LEARN network chen_learn:_2018 to map sparse sinogram data directly to a tomographic image, which combines the convolutional neural network lecun_convolutional_1995 and a classic iterative process under a datadriven regularization. Inspired by the FBP workflow, Li et al. published their iCTNET li_learning_2019 to perform CT reconstruction in various special cases and consistently obtain decent results. Our recently published DNA network xie_dual_20191 addressed the fewview CT issue by learning a networkbased reconstruction algorithm from sinogram data. But none of these proposed methods were designed to perform 3D image reconstruction, subject to potential loss in the 3D context.
In this paper, we propose a deep encoderdecoder adversarial reconstruction network (DEAR) for 3D CT from fewview data, featured by a direct mapping from a 3D input dataset to a 3D image volume. In diagnosis, radiologists need to extract 3D spatial information by looping adjacent slices and form contextual clues. Therefore, it is reasonable and even necessary to use 3D convolutional layers for maximally avoiding streak artifacts in a batch of adjacent reconstructed image slices. The main contributions of our DEAR3D network are summarized as follows:

DEAR3D utilizes 3D convolutional layers to extract 3D information from multiple adjacent slices in a generative adversarial network (GAN) goodfellow_generative_2014 framework. Different from reconstructing 2D images from 3D input data shan_3d_20181, DEAR3D directly reconstructs a 3D volume, with faithful texture and image details; and

An extensive comparative study was performed between DEAR3D and various 2D counterparts to demonstrate the merits of the proposed 3D network.
The rest of this paper is organized as follows. Sec.2 introduces the DEAR3D model, its 2D counterparts and the GAN framework utilized in the proposed model. Sec.3 describes our experimental design and results, in comparison with other stateoftheart models for fewview CT. Finally, Sec.4 presents discussions and concludes this paper.
2 Methodology
3D CT image reconstruction can be expressed as follows:
(1) 
where denotes a 3D image volume reconstructed from sufficient projection data, where and denote the width/height of input images and number of images acquired from a particular patient respectively. denotes the corresponding interpolated 3D sinogram from a spiral conebeam scan, , and denote the number of views, the number of detectors per row, and the number of detector rows respectively, and is the inverse operator to reconstruct the CT image volume, such as a typical conebeam reconstruction formula or algorithm schaller_efficient_1998; grangeat_mathematical_1991; grangeat_evaluation_1991; katsevich_improved_2004 when sufficient projection data are obtained. However, when the number of data (linear equations) is not sufficient to resolve all the unknown voxles in the fewview CT setting, streak artifacts will be introduced in the reconstructed images, and how to reconstruct highquality images becomes highly nontrivial. Deep learning (DL) promises to extract features in reference to extensive knowledge hidden in big data. With a large amount of training data, taskspecific and robust prior knowledge can be taken advantage of in establishing a relationship between fewview data/images and the corresponding fullview images. Such a deep network can be formulated in Eq. (2),
(2) 
where and denote a 3D image volume reconstructed from sufficient projection data and the counterpart from insufficient fewview/sparseview data, respectively, and denotes our DEAR3D network to remove artifacts caused by fewview problem.
2.1 Proposed Framework
The overall network architecture is shown in Fig. 1. The proposed DEAR3D network is optimized in a Wasserstein Generative Adversarial Network (WGAN) frameworkarjovsky_wasserstein_2017, which is currently one of the most advanced frameworks. In this study, the proposed framework consists of two components: a generator network and a discriminator network . aims at directly reconstructing a 3D image volume from a batch of 3D fewview image slices. receives images from both and the groundtruth dataset, trying to distinguish whether the input is real. Both networks optimize themselves during the training process. If an optimized network can hardly distinguish fake images (from ) from real images (from the groundtruth dataset), then the generator network fools the discriminator successfully. By the design, the introduction of also helps to improve the texture of reconstructed images.
Different from the vanilla generative adversarial network (GAN) goodfellow_generative_2014, WGAN replaces the logarithm term in the loss function with the Wasserstein distance, improving the training stability. In WGAN, the 1Lipschitz function is assumed with weight clipping. However, Ref gulrajani_improved_20171 pointed out that weight clipping may be problematic in WGAN, and suggested to replace it with a gradient penalty term, which is used in our proposed method. Hence, the objective function of the GAN framework is expressed as follows:
(3) 
where and represent fewview/sparseview 3D image volume and fullview 3D image volume respectively, denotes the expectation of as a function of , and denote the trainable parameters of the networks and respectively, and . is uniformly sampled from the interval [0,1]. In other words, represents another batch of 3D image slices between fake and real images. Furthermore, denotes the gradient of with respect to . Lastly, is a parameter used to balance the Wasserstein distance and the gradient penalty. The networks and are updated in an iterative manner as suggested by goodfellow_generative_2014; gulrajani_improved_20171; arjovsky_wasserstein_2017.
2.2 Generator Network
The input to the generator is a batch of 3D image slices with dimensionality of where , , and denote the batch size, number of adjacent input slices and dimension of each input image slice. Intuitively, should be equal to the total number of image slices of a particular patient, and tissues in all the different 2D planes should relate to each other. However, it is not practical due to an extremely large memory cost. Hence, is experimentally adjusted to 9. The structure of the generator is inspired by the Unet ronneberger_unet:_2015, originally proposed for biological image segmentation. Since then, the Unet has been utilized for various applications in the field of medical imaging. For example, chen_lowdose_20171; shan_3d_20181 used Unet with conveying paths for CT image denoising, xie_dual_20191; jin_deep_2017; lee_deepneuralnetworkbased_2019 applied Unet for fewview CT, and donoho_compressed_2006 for compressed sensing MRI . In DEAR, is a revised Unet with conveying paths and built in reference to DenseNet huang_densely_2016. The generator consists of 4 3D convolutional layers for downsampling and 4 3D transpose convolutional layers for upsampling a batch of 3D image slices. The dimension of 3D kernel for downsampling and upsampling is set as . In the original Unet ronneberger_unet:_2015, stride of 2 was used in each downsampling or upsampling layer to extract features in different dimensions for segmentation. However, for image reconstruction, downsampling input images severely may result in a compromised performance because convolutional layers may not be able to recover the images from lowdimensional feature maps accurately. Therefore, stride of 1 is used in all the convolutional layers of and zeropadding is not applied in downsampling and upsampling layers. A rectified linear unit (ReLU) activation function is used after each 3D convolutional layer.
A dense block is added after each downsampling and upsampling layer. Each dense block contains 5 3D convolutional layers to extract 3D image features from the input feature maps. Note that zeropadding is used in all 3D convolutional layers to maintain the dimensionality of the input feature maps. Inspired by ResNet he_deep_2015, shortcuts are applied to connect early and current feature maps, allowing gradients to flow directly to the current layer from the corresponding earlier layer. Different from ResNet, DenseNet further improves the information flow between layers by connecting all the earlier feature maps to the current layer. Consequently, the layer receives all the feature maps from all previous layers, , , , … , , as the input:
(4) 
where represents the concatenation of all the featuremaps produced by the layers , denotes the operation performed by the layer, and is defined as a composite function of a 3D convolutional operation and a ReLU activation. The kernel size and stride are set as and respectively for all the 3D convolutional layers in the proposed dense block. Note that the the purpose of DEAR3D is to learn the inverse amplitude of artifacts in the input images, and therefore input images are directly added to the last convolutional layer as presented in Fig. 1.
2.3 Discriminator Network
The discriminator network takes input from either or the groundtruth dataset, trying to classify whether the input images are real. In DEAR3D, the discriminator network has 6 convolutional layers with 64, 64, 128, 128, 256, 256 filters and followed by 2 fullyconnected layers with the numbers of neurons 1024 and 1 respectively. The leaky ReLU activation function is used after each layer with a slope of 0.2 in the negative part. 3D convolutional layers with kernel dimension and zeropadding are used for all convolutional layers. Stride is set to 2 for all the layers.
2.4 Objective Functions for Generator
This subsection introduces and evaluates different objective functions used for fewview CT artifact reduction. As shown in Fig. 2, a composite objective function is used to optimize DEAR3D.
2.4.1 MSE Loss
The meansquarederror (MSE) chen_lowdose_20171; wolterink_generative_2017; wang_mean_2009 is a popular choice for denoising and artifactremoval applications wang_mean_2009. Nevertheless, it could lead to oversmoothing images zhao_loss_20171. Moreover, MSE is not sensitive to image texture and assumes background noise is white Gaussian noise independent of local image features zhou_wang_image_2004. The MSE used in the proposed method is expressed as follows:
(5) 
where and denote the number of batches, number of input slices and image width/height respectively, and represent the groundtruth 3D image volume and 3D image volume reconstructed by respectively.
2.4.2 Structural Similarity Loss
To overcome the disadvantages of MSE loss and acquire visually superior images, the structural similarity index (SSIM) zhou_wang_image_2004 is introduced in the objective function. SSIM measures structural similarity between two images. The convolution window used to measure SSIM is set to . The SSIM is expressed as follows:
(6) 
where and are constants used to stabilize the formula if the denominator is too small, stands for the dynamic range of voxels values, and , , , , and are the means of and , variances of and , and the covariance between and respectively. Since the maximum value of SSIM is 1, the structural loss used to optimize DEAR3D is expressed as follows:
(7) 
2.4.3 Adversarial Loss
The adversarial used in DEAR3D is for the generator to produce realistic images that are indistinguishable by the discriminator network. Refer to Eq. (3), the adversarial loss is expressed as follows:
(8) 
The overall objective function of is then expressed as follows:
(9) 
where and are hyperparameters to balance different loss functions.
2.5 Corresponding 2D networks for comparisons
To evaluate the performance of the proposed 3D network, a 2D network is built for benchmarking, which is denoted as DEAR2D. DEAR2D uses the exactly same structure as the DEAR3D, except that all the 3D convolutional layers in the dense blocks are replaced with 2D convolutional layers. Please note that the number of parameters of DEAR2D will be less than that of DEAR3D due to the fact that the dimension of input 2D batches is significant smaller than the dimension of 3D batches. For a fair comparison, another 2D network is built with an accordingly increased number of training parameters, denoted as DEAR2Di. The number of training parameters is increased by increasing number of filters in 2D convolutional layers. Different from DEAR3D, the 2D counterparts only utilize 2D convolutional layers to extract 2D feature maps from a batch of 2D input images. Therefore, the 2D counterparts aims at reconstructing 2D images instead of 3D images, which may lead to a potential loss in contextual information. Consequently, in DEAR2D, all the 2D convolutional layers in both the encoderdecoder part and the dense blocks contain 38 filters with kernel dimension . On the other hand, in DEAR2Di, all the 2D convolutional layers in both the encoderdecoder part and the dense blocks contain 48 filters with kernel dimension . Table 1 shows the numbers of parameters of the three networks.
# Parameters  DEAR3D  DEAR2D  DEAR2Di 
5,123,617  3,459,749  5,519,329 
Moreover, to demonstrate the effectiveness of different loss functions used to optimize the proposed neural network, 2D and 3D networks with different combinations of loss components are considered for comparison.
3 Experimental design and results
3.1 Dataset and Preprocessing
A clinical abdominal dataset was used to train and evaluate the performance of the proposed DEAR3D method. The dataset was prepared and authorized by Mayo Clinic for “the 2016 NIHAAPMMayo Clinic Low Dose CT Grand Challenge” noauthor_low_nodate. The dataset contains a total of 5,936 abdominal CT images selected with 1 mm slice thickness. All the images were reconstructed from 2,304 projections under 100 peak kilovoltage (kVp), which were used as the groundtruth images to train the proposed method. The distance between the xray source and the detector array is 1085.6 milimeters, and the distance between the xray source and the isocenter is 595 milimeters. The pixel size is 0.664 millimeters. All the images are of . For datapreprocessing, pixel values of patient images were normalized to be between 0 and 1. During the training process, 4 patients (a total of 2,566 images) were used for training, and 6 patients (a total of 3370 images) for validation and testing. Patches with dimension were cropped with stride 32 from the whole images for data augmentation, resulting in a total of 502,936 2D training patches. 2D patches were used to train the DEAR2D and DEAR2Di networks. 3D patches were extracted from the preprocessed 2D patches to train the DEAR3D network. 3D patches were extracted with stride of 1 in the dimension. Then, the optimized networks are applicable to images with any image dimension since the proposed DEAR3D network contains only convolutional layers. The fanbeam Radon transform and fanbeam inverse Radon transform were used to simulate 75view fewview images. 75view sinograms were synthesized from angles equally distributed over a full scan range.
3.2 Hyperparameter Selection and Network Comparison
In the experiments, all codes were implemented in the TensorFlow framework abadi_tensorflow:_2016 on an NVIDIA Titan RTX GPU. The Adam optimization method was implemented to optimize the training parameters kingma_adam:_2015 with and . During the training process, a minibatch size of 10 was selected, resulting the input with dimensionality of . The hyperparameter used to balance the Wasserstein distance and the gradient penalty was set as 10, as suggested in gulrajani_improved_20171. The learning rate was initialized as , and decreased by a factor of 2 after each epoch. The hyperparameters and were adjusted using the following steps. First, the proposed network was optimized using only the MSE loss. The testing results were treated as the baseline for finetuning the other 2 hyperparameters. Then, the SSIM loss was added as part of the objective function. Finally, the adversarial loss was added, and the hyperparameter was finetuned. Through this process, and were set to 0.5 and 0.0025 respectively. Please note that and were finetuned for the best SSIM values in the validation set.
For qualitative comparison, the proposed DEAR3D network was compared with two deeplearningbased methods for fewview CT image reconstruction, including the FBPConvNet method (a classic Unet ronneberger_unet:_2015 with conveying paths to solve the CT problem jin_deep_2017) and a CNNbased residual network cong_deeplearningbased_2019 (denoted as residualCNN in this paper). The network settings were made the same as the default settings described in the original papers. The analytical FBP method was used as a baseline for comparison.
Moreover, to highlight the effectiveness of the proposed objective functions used in the 3D architecture, as shown in Table 2, 5 different networks with different combinations of objective functions were trained for comparison: (1) The DEAR2D network with only MSE loss and without WGAN (denoted as DEAR2D); (2) DEAR2D with MSE and SSIM but without WGAN (denoted as DEAR2D); (3) DEAR2Di with MSE and SSIM loss and without WGAN (denoted as DEAR2Di); (4) DEAR3D with MSE and SSIM loss but without WGAN (denoted as DEAR3D); (5) a full DEAR3D network with WGAN (denoted as DEAR3D). Hyparameters for all of these 5 networks were experimentally adjusted using the steps mentioned above.
MSE  SSIM  AL  

DEAR2D  ✓  
DEAR2D  ✓  ✓  
DEAR2Di  ✓  ✓  
DEAR3D  ✓  ✓  
DEAR3D  ✓  ✓  ✓ 
3.3 Comparison with Other Deeplearning methods
To visualize the performance of different methods, a few representative slices were selected from the testing set. Fig. 3 shows results using different methods from 75view fewview images. Three metrics, peak signaltonoise ratio (PSNR) korhonen_peak_2012, SSIM, and rootmeansquareerror (RMSE)willmott_advantages_2005 were computed for quantitative assessment. The quantitative results are shown in Table 3. For better evaluation of the image quality, the regionsofinterest (ROIs) are marked by rectangles in Fig. 3 are magnified in Fig. 4.
FBP  FBPConvNet  residualCNN  DEAR3D  

PSNR  
SSIM  
RMSE 
FBP  DEAR2D  DEAR2D  DEAR2Di  DEAR3D  DEAR3D  

PSNR  
SSIM  
RMSE 
The groundtruth images and the corresponding fewview images are presented in Fig. 3a and 3b respectively. As shown in Fig. 3b, streak artifacts are clearly visible in the images reconstructed using the FBP method. As shown in the groundtruth images in Fig.3a, lesions and subtle details are visible which are hidden by fewview artifacts in Fig. 3a. The results from the 2Dbased deeplearning reconstruction methods (FBPConvNet and residualCNN) are shown in Fig. 3c and 3d as well as Fig. 4c and 4d respectively. These 2D methods can effectively reduce artifacts but they would potentially miss spatial correlation between adjacent slices, resulting in loss of subtle but critical details. As shown in the first and second row of Fig. 4, FBPConvNet and residualCNN tend to distort or smooth out some subtle details in the ROIs but these details are visible in the fulldose images reconstructed by the FBP method (indicated by the blue and orange arrows in Fig. 4). Moreover, it is observed that, residualCNN is unable to effectively remove streak artifacts in the reconstructed images, especially along the boundaries (the orange and blue arrows in Fig. 4). Our proposed method, DEAR3D, is better at removing artifacts as well as keeping tiny but vital details than the competitive methods. The proposed DEAR3D method is also better at recovering image texture than the other methods, this may be due to the processing capability of the 3D network and the discriminative power of the WGAN framework.
3.4 Ablation Analysis
This subsection demonstrate the effectiveness of different components in the proposed DEAR3D network. As mentioned above, 5 variants of the DEAR3D network were trained for this purpose. The results are shown in Fig. 5, with the corresponding quantitative measurements in Table 4. The zoomedin regionsofinterest (ROIs), which are marked by rectangles in Fig. 5, are shown in Fig. 6. As presented in Fig. 6, due to the improper 2D design of the objective function, the DEAR2D network with only MSE loss tends to smooth out features such as the lesion, leading to an unacceptable image quality (the lesion becomes barely visible in the first row in Fig. 6). Adding SSIM as part of the objective function improved the overall image quality but due to the lack of 3D spatial context, the 2D based methods are unable to recover subtle details (indicated by the blue arrows in the first and third rows in Fig. 6). There is no significant difference observed between the DEAR2D and the DEAR2Di networks. Lastly, the combination of the 3D architecture, WGAN and the adversarial loss improved image texture and overall image quality, which is desirable in practice. In summary, it is observed that the 2Dbased methods compromise some details in the reconstructed images (the blue arrows in Fig. 6), and by providing information from adjacent slices, the DEAR3D network performs better than the other methods at removing artifacts and keeping image texture.
4 Discussion and Conclusions
Fewview CT may be implemented as a mechanically stationary scanner in the future cramer_stationary_2018 for healthcare and other utilities. Current commercial CT scanners use one or two xray sources mounted on a rotating gantry, and take hundreds of projections around a patient. The rotating mechanism is not only massive but also powerconsuming. Hence, current commercial CT scanners are inaccessible outside hospitals and imaging centers, due to their size, weight, and cost. Designing a stationary gantry with multiple miniature xray sources is an interesting approach to resolve this issue cramer_stationary_2018. Unfortunately, the current technology does not allow us to assemble hundreds of miniature xray sources in a ring for reconstructing a highquality CT image over an ROI of a decent aperture. Fewview CT is an attractive option. However, streak artifacts would be introduced from a fewview scan due to insufficiency of projection data. Recently, deep learning has achieved remarkable results for fewview CT, and our proposed DEAR3D network is a step forward along this direction.
This paper has introduced a novel 3D Deep Encoderdecoder Adversarial Reconstruction Network (DEAR3D) for directly reconstructing a 3D volume from 3D input data. Compared with 2Dbased methods jin_deep_2017; lee_deepneuralnetworkbased_2019; chen_learn:_2018; xie_dual_20191; cong_deeplearningbased_2019, DEAR3D avoids the potential loss in the 3D spatial context. Specifically, our proposed network is featured by (1) a 3D convolutional encoderdecoder network with conveyingpaths; (2) the Wasserstein GAN framework for optimal parameters; and (3) the powerful DenseNet architecture for improved performance.
In conclusion, we have presented a novel 3D deep network, DEAR3D, for solving the fewview CT problem. The proposed method outperforms 3D deeplearning methods and promises clinical utilities such as breast conebeam CT and Carm conebeam CT for future research probabilities. In the followup investigation, we plan to further improve the network and perform more experiments to optimize and validate the DEAR3D network.
H.Xie. and H.Shan. initiated the project and designed the experiments. H.Xie. performed machine learning research and experiments. H.Xie. wrote the paper. H.Shan. and G.Wang. participated in the discussions and edited the paper.
References
yes