Real-Time 2D-3D Deformable Registration with Deep Learning and Application to Lung Radiotherapy Targeting

Real-Time 2D-3D Deformable Registration with Deep Learning and Application to Lung Radiotherapy Targeting

Markus D. Foote \orcidID0000-0002-5170-1937 The final authenticated publication is available online at Computing and Imaging Institute, Department of Bioengineering, University of Utah, Salt Lake City, UT, USA
   Blake E. Zimmerman\orcidID0000-0003-1769-7943 1Scientific Computing and Imaging Institute, Department of Bioengineering, University of Utah, Salt Lake City, UT, USA
   Amit Sawant 2Department of Radiation Oncology, University of Maryland, School of Medicine, Baltimore, MD, USA 2    Sarang C. Joshi 1Scientific Computing and Imaging Institute, Department of Bioengineering, University of Utah, Salt Lake City, UT, USA

Radiation therapy presents a need for dynamic tracking of a target tumor volume. Fiducial markers such as implanted gold seeds have been used to gate radiation delivery but the markers are invasive and gating significantly increases treatment time. Pretreatment acquisition of a respiratory correlated 4DCT allows for determination of accurate motion tracking which is useful in treatment planning. We design a patient-specific motion subspace and a deep convolutional neural network to recover anatomical positions from a single fluoroscopic projection in real-time. We use this deep network to approximate the nonlinear inverse of a diffeomorphic deformation composed with radiographic projection. This network recovers subspace coordinates to define the patient-specific deformation of the lungs from a baseline anatomic position. The geometric accuracy of the subspace deformations on real patient data is similar to accuracy attained by original image registration between individual respiratory-phase image volumes.

Therapy Target Tracking Lung Cancer Real-Time Computed Tomography Fluoroscopy Deep Learning Convolutional Neural Network Motion Analysis Image Registration Computational Anatomy

1 Introduction

According to the United States Centers for Disease Control and Prevention, lung cancer is the leading cause of cancer death, accounting for 27% of all cancer deaths in the United States [22]. Accurate estimation of organ movement and normal tissue deformations plays a crucial role in dose calculations and treatment decisions in radiation therapy of lung cancer [3]. State-of-the-art radiation treatment planning uses 4D (3D + time) respiratory correlated computed tomography (4D-RCCT) scans as a planning tool to deliver radiation dose to the tumor during treatment [6] and minimize dose to the healthy tissue and vital organs. Understanding respiratory motion allows for more accurate treatment planning and delivery, resulting in more targeted dose delivery to only the tumor.

While general motion patterns in radiotherapy patients are relatively well understood, cycle-to-cycle variations remain a significant challenge and may account for observed discrepancies between predictive treatment plan indicators and clinical patient outcomes, especially due to an increased irradiated volume which limits adequate dose to the target [18]. Gating methods have been developed but require implanted markers and lengthen treatment time. Real-time motion tracking using magnetic resonance imaging is only applied within the 2D imaging slices due to time constraints for imaging [4]. Accurate and fast noninvasive target tracking that accounts for real-time cycle-to-cycle variations in respiratory motion is a recognized need in radiotherapy [9, 19].

In this paper, we propose to use 4D-RCCT acquired at treatment planning time to develop a finite low dimensional patient specific space of deformations and use treatment time single angle fluoroscopy imaging in conjunction with deep convolutional neural network to recover the full deformations of the anatomy during treatment. The constant time inference capability of convolutional neural networks allows real-time recovery of the anatomic position of the tumor target and surrounding anatomy using radiographic images of the treatment area acquired using fluoroscopy. We envision our framework to be incorporated as the target position monitoring subsystem in a conformal radiotherapy system [19]. The proposed method eliminates the need for invasive fiducial markers while still producing targeted radiation delivery during variable respiration patterns. While this framework requires training of the model on a per-patient basis, only inference is required during treatment. Inference of the deep convolutional network and subsequent linear combination of patient-specific motion components can be calculated faster than real-time and drive motion-tracking of conformal radiotherapy treatment.

The proposed framework begins by creating motion subspace from CT registration among the respiratory phase images. This motion subspace is then used to generate many respiratory phase fluoroscopic images with known subspace coordinates. These labeled fluoroscopic images serve as a training dataset for a deep convolutional neural network that recovers the motion subspace coordinates. Motion coordinates recovered by the deep neural network from unseen fluoroscopy define a linear combination of deformation components that form a full 3D deformation that is represented in the 2D fluoroscopic image.

2 Related Work

2.1 Low-rank Respiratory Motion Models

The respiratory cycle exhibits low-rank behavior and hysteresis, leading to application of principal component analysis to study the deformation of lungs and other organs through time [8]. Sabouri et al.  2017 has used such a PCA approximations of the respiratory cycle to correlate the lung’s principal deformations to respiratory markers directly observable during treatment such as body surface tracking during conformal radiation therapy [17]. PCA has also been applied to MR-based image guidance in radiotherapy [4].

2.2 Rank Constrained Density Motion Estimation

Our real-time tracking procedure is based on the rank-constrained diffeomorphic density matching framework, summarized here for completeness (see [1, 2, 14, 15] for details). In general, the rank-constrained diffeomorphic density matching problem produces a registration between multiple image volumes that is explained by only a few principal components of deformation – less than the total number of volumes. The deformation set matrix rank is non-smooth and therefore unfit for optimization. Instead, the nuclear norm is used as a smooth and convex surrogate. The deformation set matrix is constructed with each vectorized deformation field as a row,


where is the inverse of the deformation from the -th image in the image series to a selected reference image. We thus define the nuclear norm for deformations between images as


where are the singular values. As there are deformations between images, giving only singular values .

As CT measures the linear attenuation coefficient which is proportional to the true density, 4DCT volumes are interpreted as densities, thus values change with compression or expansion. A density is acted upon by a diffeomorphism to compensate for changes of the density by the deformation:


where denotes the Jacobian determinant of . The Riemannian geometry of the group of diffeomorphisms with a suitable Sobolev metric is linked to the Riemannian geometry of densities with the Fisher-Rao metric [1, 7, 10]. The Fisher-Rao metric is used as the measure for similarity due to the property that it is invariant to the action of diffeomorphisms:


The linkage between a suitable Sobolev metric and the Fisher-Rao metric allows for evaluation of the distance in the space of diffeomorphisms in closed form. The Fisher-Rao metric, an incompressibility measure, and surrogate rank measure can then be used to match a set of densities by minimizing the energy functional:


where is a chosen base or reference density and are the other densities in the series. The first term here penalizes dissimilarity between the two densities. The second term penalizes deviations from a volume-preserving deformation. The penalty function acts as weighting of the volume-preserving measure. The final term penalizes the surrogate rank of the subspace in which the set of deformations exist. We use a gradient flow and and iterative soft thresholding for optimizing the above energy functional. This optimization produces a related set of rank-optimized deformations with improved geometric accuracy over other registration methods due to the increased physiologic relevance of the low-rank deformations which match well with the general forward-reverse process of the inhale-exhale cycle, along with hysteresis in other components [2].

2.3 Digitally Reconstructed Radiographs

For a given 4D-RCCT dataset, it is sufficient for training to simulate the 2D projection images at different respiratory states from the 3D volumes acquired at different states. These digitally reconstructed radiographs (DRR) are commonly used for radiation therapy planning. This allows for simulation of the 2D projections that would be acquired at the different phases of breathing throughout a state-of-the-art radiotherapy treatment [16, 19]. A point in the DRR projection image is defined by the line integral of linear attenuation coefficients (values in the CT volume) over distance from the x-ray source to the projection image [9, 20, 21]. The path can be written as where and is the piercing point, or the unique point in the projection image that is closest to the x-ray source. The DRR operation generates a 2D projection of the 3D CT volume that can be used for training in place of actual radiography recordings. These projections closely approximate the fluoroscopic images acquired during a state-of-the-art radiotherapy treatment session.

3 Methodology

3.1 Motion Subspace Learning

A 10-phase respiratory correlated 4D-RCCT of a lung cancer patient for treatment planning from the University of Maryland was provided in anonymized form. Each 3D volume in the 4D-RCCT dataset is with resolution mm. These 10 3D CT volumes, each representing a distinct respiratory phase, are used for rank constrained density registration as described by Foote et al.  [2] to produce 9 rank-optimized deformation fields that map to the full-exhale phase volume from each of the other phase volumes . The low-rank subspace of the space of diffeomorphisms in which the rank-optimized deformation components exist is determined via principal component analysis (PCA).

3.2 Training Dataset Generation

A dataset for training is generated by calculating DRR projections through a deformed reference CT volume The deformation applied to the reference volume is calculated as a linear combination of rank-optimized components. Only the first two rank-optimized components are used as they explain 99.7% of the deformation data (Table 1). The weights for this linear combination are generated as a dense grid of 300 evenly spaced points in first rank-optimized component direction and 200 evenly spaced points in the second (60000 total samples for training). The value of the maximum weight is set as 150% and 110% of the maximum magnitude of the corresponding rank-optimized component for the first and second directions, respectively. These weights serve as the target for training and are then used to calculate a linear combination of the rank-optimized component deformations. The resulting grid of deformations are calculated and applied to the full-exhale 4D-RCCT volume using PyCA [13] (Fig. 1). DRR images are calculated by ray casting and integrating along a cone-beam geometry of the fluoroscope through the deformed full-exhale 4DCT volume [16, 20, 21]. The geometry setup for the DRR generation is similar to the geometry that would be found in a current radiotherapy treatment linear accelerator’s imaging accessories, such as the Varian TrueBeam. Specifically, the x-ray source was positioned laterally from the center of the CT volume. The center of the virtual detector for the projection image was then positioned 150 cm horizontally from the source, through the CT volume, with a pixel array with spatial resolution mm. DRR images were preprocessed before training with variance equalization, normalizing the image intensities to , histogram equalization, and average downsampling.

Number of Percentage of
Included Components Dataset Explained
1 0.90115488
2 0.99739416
3 0.99865254
4 0.99973184
5 0.99989146
6 0.99996631
7 0.99999949
8 1.0
Table 1: Data explanation power of cumulative rank-optimized components of deformation from rank-constrained density deformation optimization. These values are calculated from the normalized cumulative sum of eigenvalues.
Figure 1: Dataset generation overview. The deformations from rank constrained motion estimation are decomposed using PCA. Coordinates within the PCA space are used to generate a new deformation field, which is then applied to the reference CT volume. The resulting volume is used for DRR projection.

3.3 Network Architechture

The mapping from rank-optimized component weights to a projection image is highly nonlinear as it includes both a diffeomorphic deformation and radiographic projection. We aim to approximate the inverse of this mapping with a deep convolutional neural network as a multi-output continuous regression problem. A promising network architecture for this application is DenseNet [5]. DenseNet has achieved state-of-the-art performance on several datasets while using fewer parameters and decreasing over-fitting. The direct connections between constant spatial layers allow for reusing features learned from previous layers and improve the flow of gradients throughout the network. Additionally, DenseNet is ideal for real-time applications due to reduced model complexity and depth.

We use an efficient DenseNet model in PyTorch tailored to our application so that single-channel images are used as input [12, 11]. The rank-optimized component weights used to produce the projection image are the regression points (Fig. 2). Summarized here for completeness, the DenseNet architecture convolves each input with a kernel size of 3 and filter number of before being input to the first dense block, where is the growth rate. Each subsequent dense block consists of 8 convolutional layers all with filter size of 3 and a growth rate of 4. The 4 convolutional layers within a dense block are each preceded by a batch normalization layer and a rectified linear unit (ReLU) layer. The output from a dense block enters a transition block where batch normalization and non-linear activation are followed by a convolutional feature map reduction layer (compression=0.5) and spatial reduction factor of 2 in each dimension using max-pooling. A total of 4 dense and transition blocks are used to reduce the spatial dimensions to final feature maps. The final layer of the network was a linear layer that regresses the feature maps to the rank-optimized component weights.


Dense Block 1



Dense Block 2



Dense Block 3



Dense Block 4





Figure 2: DenseNet architecture used to learn the inverse deformation-projection estimate. Each dense block represents a group of 4 feature layers and growth rate 4 with an included transition layer. A final linear layer recovers the weights for deformation components.

4 Results

For our experiments, we use a 10-phase RCCT of a lung radiotherapy patient as the source data. Rank-optimized deformations were calculated as described by Foote et al. and summarized in section 2.2 using the full-exhale (Phase 5) CT volume as a reference. Using PCA, rank-optimized components of these deformations were determined and used to generate a DRR training dataset as described in 3.2. In section 4.1 we outline the specific training procedure for the neural network. Then in section 4.2 we test the neural network against unseen data from a breathing model. Finally, section 4.3 describes generalization of the trained model on the phases of the RCCT dataset that were not used in training dataset generation.

4.1 Training and Test Performance

The network is trained for 300 epochs using a loss function with a learning rate of 0.1 and batch size of 2048. The loss independently penalizes each deformation component which avoids preferential fitting of the larger-weighted component at any point. The learning rate is decreased by a factor of 5 when the training loss plateaus within a relative threshold of for the last 20 epochs. The dataset is randomly split 80-20 between training and on-line testing to monitor for over-fitting during training. Training for 300 epochs completes in under 2 hours on a single Nvidia Quadro GV100 GPU.

4.2 Spline Model Deformation Validation

A motion-subspace-based breathing model was created using a spline interpolation of the first two rank-optimized component weights from the 9 rank-optimized deformations (Fig. 3). At any point along the model curve, the coordinate weights provide a linear combination of the rank-optimized component deformations to apply to the reference full-exhale volume to obtain a model volume for the corresponding breathing phase. DRR projections through these model volumes in the same manner as the training dataset generation produced fluoroscopic images for evaluation. As with the training images, these projection images are preprocessed with variance equalization, normalization, histogram equalization, and downsampling. Evaluation of these images through the trained network recovers the weights of the rank-optimized components. Inference on a single Nvidia Quadro GV100 GPU has a throughput of 1113 images/second with the same PyTorch implementation – significantly faster than real-time. The relation of these weights to the original spline model is shown in Figure 3.

Accuracy of the deformation recovered from the inferred weights is measured by maximum deformation distance error compared to the reconstruction of these deformations with the two rank-optimized deformation component weights directly from the model. From the 40 model points, the maximum error between the applied deformation and the recovered deformation was 1.22 mm.

Rank-Optimized Component 1 Weight

Rank-OptimizedComponent 2 Weight
Figure 3: Deformation component weights for each of the 9 original components (blue squares) are control points for the spline-based breathing model (red line). DRR images derived from the weights along this spline model are input to the network. Resulting inferred weights (black triangles) closely align with the model.

4.3 RCCT Phase Patient Data and Geometric Validation

Until this point, DRR images derived from 9 of the 10 original 4DCT volumes have not been used as input for the network for either training or previous testing. Rather than a synthetic spline model of respiratory phases, we now use the original CT volumes captured at 9 stages of the respiratory cycle. These CT volumes inherently contain deformations that have not been included in any training or evaluation to this point, as the rank-constrained motion estimation produces imperfect deformations to these CT images. No deformation is applied to these CT volumes as each volume represents this intrinsic respiratory deformation relative to the full-exhale anatomical state. Evaluation of our framework on these intrinsic deformations effectively projects the true deformation into our 2-dimensional motion subspace.

Exactly as in training dataset generation, DRR images through these (undeformed) volumes are calculated and preprocessed with variance equalization, normalization, histogram equalization, and downsampling. Evaluation of the network gives rank-optimized deformation component weights describing the intrinsic deformations that can be compared in the subspace of motion component weights with the weights directly recovered from original rank-constrained density motion estimation (Fig. 4).

These weights produced by the network are used to reconstruct a deformation field as a linear combination of the rank-optimized deformation component fields. We calculate and report the error as a difference from the original rank-constrained motion estimation deformation distance that is recovered by both our deep learning framework in Figure 5. Errors for the deformation recovered at each CT phase are also calculated against the deformations from rank-constrained motion estimation (Table 2).

Rank-Optimized Component 1 Weight

Rank-OptimizedComponent 2 Weight
Figure 4: Weights of rank-optimized deformation components recovered by network on real patient CT data (red triangles) align well with original deformation weights for rank-optimized components (black circles). The respiratory cycle proceeds clockwise around the loop.
Figure 5: Slice of error map of calculated deformation error recovered by the deep learning framework. Phase 0 (left) and 9 (right) are selected as representative examples. Both general and localized errors are similar to the axial resolution of the CT scan (3mm).
Phase Average Distance Error (mm) Maximum Distance Error (mm)
0 0.136 3.92
1 0.212 3.15
2 0.132 2.58
3 0.291 8.98
4 0.223 6.62
6 0.201 4.15
7 0.286 6.96
8 0.252 9.55
9 0.183 5.07
Table 2: Deformation distance error for each phase’s recovered deformation.

5 Discussion

In this paper, we have shown that estimation of anatomic position in the presence of breathing variability is possible with the combination of (1) rank-constrained density motion estimation for determination of motion components and (2) deep learning for subsequent identification of the weights of motion components in real-time. While this framework is extensible to dimensions higher than 2, the rank-constrained nature of the deformation fields produces accurate results using only two dimensions. This level of accuracy is a trade-off against the curse of dimensionality in both the dataset size and computational cost of training; however, using rank-constrained deformation components increases the accuracy of resulting deformations with the same number of dimensions.

The speed and accuracy attained by this framework is suitable for inclusion as a tumor position monitoring component of a conformal radiation therapy system. Determination of 3D tumor location from noninvasive 2D radiographic projections via deep learning instead of variational optimization approaches to the 2D-3D deformation determination problem provides real-time results for conformal radiation therapy for lung tumors.


This work was partially supported through research funding from the National Institute of Health (R01 CA169102 and R03 EB026132) and the Huntsman Cancer Institute. The authors are grateful for the support of NVIDIA Corporation by providing the GPU used for this research.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description