Fast Predictive Simple Geodesic Regression

# Fast Predictive Simple Geodesic Regression

## Abstract

Deformable image registration and regression are important tasks in medical image analysis. However, they are computationally expensive, especially when analyzing large-scale datasets that contain thousands of images. Hence, cluster computing is typically used, making the approaches dependent on such computational infrastructure. Even larger computational resources are required as study sizes increase. This limits the use of deformable image registration and regression for clinical applications and as component algorithms for other image analysis approaches. We therefore propose using a fast predictive approach to perform image registrations. In particular, we employ these fast registration predictions to approximate a simplified geodesic regression model to capture longitudinal brain changes. The resulting method is orders of magnitude faster than the standard optimization-based regression model and hence facilitates large-scale analysis on a single graphics processing unit (GPU). We evaluate our results on 3D brain magnetic resonance images (MRI) from the ADNI datasets.

## 1Introduction

Longitudinal image data provides us with a wealth of information to study aging processes, brain development and disease progression. Such studies, for example ADNI [?] and the Rotterdam study [?], involve analyzing thousands of images. In fact, even larger studies will be available in the near future. For example, the UK Biobank [?] targets on the order of 100,000 images once completed. With the number of images increasing, large-scale image analysis typically resorts to using compute clusters for parallel processing. While this is, in principle, a viable solution, increasingly larger compute clusters will become necessary for such studies. Alternatively, more efficient algorithms can reduce computational requirements, which then facilitates computations on individual computers or much smaller compute clusters, interactive (e.g., clinical) applications, efficient algorithm development, and use of these efficient algorithms as components in more sophisticated analysis approaches (which may use them as part of iterative processes).

Image registration is a key task in medical image analysis to study deformations between images. Building on image registration approaches, image regression models [?] have been developed to analyze deformation trends in longitudinal imaging studies. One such approach is geodesic regression (GR) [?] which (for images) build on the large displacement diffeomorphic metric mapping model (LDDMM) [?]. In general, GR generalizes linear regression to Riemannian manifolds. When applied to longitudinal image data, it can compactly express spatial image transformations over time. However, the solution to the underlying optimization problem is computationally expensive. Hence, a simplified, approximate, GR approach has been proposed [?] (SGR) to decouple the computation of the regression geodesic into pairwise image registrations. However, even such a simplified GR approach would require months of computation time on a single graphics processing unit (GPU) to process thousands of 3D image registrations for large-scale imaging studies such as ADNI [?]. The primary reason computational bottleneck for SGR are the optimization required to compute pair-wise registrations.

Recently, efficient approaches have been proposed for deformable image registration [?]. In particular, for LDDMM, which is the basis of GR approaches for images, registrations can be dramatically sped up, by either working with finite-dimensional Lie algebras [?] and frequency diffeomorphisms [?], or by fast predictive image registration (FPIR) [?]. FPIR predicts the initial conditions (specifically, the initial momentum) of LDDMM, which fully characterize the geodesic and the spatial transformation using a learned a patch-based deep regression model. Because numerical optimization of standard LDDMM registration is replaced by a single prediction step, followed by optional correction steps [?], FPIR is dramatically faster than optimization-based LDDMM without compromising registration accuracy, as measured on several registration benchmarks [?].

Besides FPIR, other predictive image registration approaches have been proposed. Dosovitskiy et al. [?] use a convolutional neural network (CNN) to directly predict optical flow. Liu et al. [?] use an encoder-decoder network to synthesize video frames. Schuster et al. [?] investigate strategies to improve optical flow prediction via a CNN. Cao et al. [?] use a sampling strategy and CNN regression to directly learn the mapping from moving and target image pairs to the final deformation field. Miao et al. [?] use CNN regression for 2D/3D rigid registration. Sokooti et al.[?] use CNNs to directly predict a 3D displacement vector field from input image pairs. An unsupervised approach for image registration was proposed by de Vos et al. [?]; here, the loss function is the image similarity measure between images themselves and a deformation is parameterized via a spatial transformer (which essentially amounts to a parameterized model of deformation in image registration) which generates the sought-for displacement vector field. In [?], Hong et al. employ a low-dimensional band-limited representation of velocity fields in Fourier space [?] to speed up SGR [?] for population-based image analysis.

In this work, we will build on FPIR, as it is a desirable approach for brain image registration for the following reasons: First, FPIR predicts the initial momentum of LDDMM and therefore inherits the theoretical properties of LDDMM. Consequently, FPIR results in diffeomorphic transformations, even though predictions are computed in a patch-by-patch manner; this can not be guaranteed by most other prediction methods. Second, patch-wise prediction allows for training of the prediction models based on a very small number of images, containing a large number of patches. Third, by using a patch-wise approach, even high-resolution image volumes can be processed without running into memory issues on a GPU. Fourth, none of the existing predictive methods address longitudinal data. However, as both FPIR and SGR are based on LDDMM, they naturally integrate and hence result in our proposed fast predictive simple geodesic regression (FPSGR) approach.

Our contributions can be summarized as follows:

Predictive geodesic regression

We use a fast predictive registration approach for image geodesic regression. Different to [?], we specifically validate that our approach can indeed capture the frequently subtle deformation trends of longitudinal image data.

Large-scale dataset capability

Our predictive regression approach facilitates large-scale image regression within a short amount of time on a single GPU, instead of requiring months of computation time for standard optimization-based methods on a single computer, or on a compute cluster.

Accuracy

We assess the accuracy of FPSGR by (1) studying linear models of atrophy scores (which are derived from the nonlinear SGR model) over time, as well as (2) correlations between atrophy scores and various diagnostic groups.

Validation

We demonstrate the performance of FPSGR by analyzing images of the ADNI-1 / ADNI-2 datasets. For comparison, we also perform SGR using numerical optimization for the registrations, again on the complete ADNI-1 / ADNI-2 datasets.

This work is an extension of a recent conference paper [?]. In particular, all our experiments are now in 3D. We also added significantly more results to further explore the behavior of FPSGR in comparison to optimization-based SGR. In particular, we added (a) an example to visualize the performance of regression models and associated quantitative comparisons (Sec. Section 4.1); (b) an analysis of local atrophy score correlated with clinical variables (Sec. Section 4.3); (c) correlations within diagnostic groups (Sec. Section 4.3); (d) a comparison with pairwise registration (Sec. Section 4.4); (e) and experiments on extrapolation on unseen data (Sec. Section 4.5, Section 4.6).

Organization. The remainder of this article is organized as follows: Section 2 describes FPSGR, Section 3 discusses the experimental setup and the training of the prediction models. In Section 4, we present experimental results for 3D MR brain images. The paper concludes with a summary and an outlook on future work.

## 2Fast predictive simple geodesic regression

Our fast predictive simple geodesic regression approach is a combination of two methods: First, fast predictive image registration (FPIR) and, second, integration of FPIR with simple geodesic regression (SGR). Both FPIR and SGR are based on the shooting formulation of LDDMM [?]; Fig. ? illustrates our overall approach. The individual components are described in the following.

### 2.1Lddmm

Shooting-based LDDMM and geodesic regression minimize

where is the initial image (known for image-to-image registration and to be determined for geodesic regression), is the initial momentum, is a smoothing operator that connects velocity and momentum as and with , is a weight, is the measured image at time (there will be only one such image for image-to-image registration at ), and denotes the image similarity measure between and (for example or geodesic distance); is the dual of the negative Jacobi-Lie bracket of vector fields: and denotes the Jacobian. The deformation of the source image can be computed by solving , where denotes the identity map.

### 2.2Fpir

Fast predictive image registration [?] aims at predicting the initial momentum, , between a source and a target image patch-by-patch. Specifically, we use a deep encoder-decoder network to predict the patch-wise momentum. As shown in Fig. ?, in 3D the inputs are two layers of image patches ( in 2D), where the two layers are from the source and target images respectively. Two patches are taken at the same position by two parallel encoders, which learn features independently. The output is the predicted initial momentum in the , and directions (obtained by numerical optimization on the training samples). Basically, the network is split into an encoder and a decoder part. An encoder consists of 2 blocks of three 3 3 3 convolutional layers with PReLU activations, followed by another 2 2 2 convolution+PReLU with a stride of two, serving as a “pooling” operation. The number of features in the first convolutional layer is 64 and increases to 128 in the second. In the decoder, three parallel decoders share the same input generated from the encoder. Each decoder is the inverse of the encoder except for using 3D transposed convolution layers with a stride of two to perform “unpooling”, and no non-linearity at the end. To speed up computations, we use patch pruning (i.e., for brain imaging, e.g., patches outside the brain are not predicted as the momentum is expected to be zero there) and a large pixel stride (e.g., 14 for patches) for the sliding window of the predicted patches.

### 2.3Correction network

We follow [?] and use a two-step approach to improve overall prediction accuracy. An additional correction step, i.e., a correction network, corrects the prediction of the initial prediction network. Figure 1 illustrates this two-step approach graphically. The correction network has the same structure as the prediction network. Only the inputs and outputs differ. For the prediction network, the inputs are the original moving image and the original target image; output is the predicted initial momentum. For the correction network, the inputs are the original moving image and the warped target image; the output is the momentum difference.

### 2.4Sgr

Determining the initial image, , and the initial momentum, , of Eq. is computationally costly. However, in simple geodesic regression, the initial image is fixed to the first image of a subject’s longitudinal image set (left-most part of Fig. ?). Furthermore, the similarity measure is chosen as the geodesic distance between images and approximated so that the geodesic regression problem can be solved by computing pair-wise image registrations with respect to the first image. Specifically, we define the quadratic distance between two images and as

Assume we have an image at time as well as two images and . Further, assume that the spatial transformation maps to and maps to . Then and . Furthermore, assume that maps to , i.e., . Then . Assuming that the geodesic between and is parameterized by the initial velocity and between and by the initial velocity and that we travel between and in time (and similarly for ) we can rewrite the map between and based on the exponential map as

which can be approximated to first order as

Hence, the squared geodesic distance between the two images can be approximated as

where and . Hence, Eq. becomes

where is the sought-for initial momentum of the regression geodesic and are the initial momenta corresponding to the geodesic connecting (the starting image of the geodesic) and the measurements in time . Differentiating Eq. w.r.t. results in

Thus,

In practice, is very small and can thus be omitted. Furthermore, is obtained by either registering to in unit time or, as in our FPSGR approach, by predicting the momenta via FPIR, denoted as . As Equation 4 was derived assuming that images are transformed into each other in time instead of unit time, the obtained unit-time predicted momenta correspond in fact to the approximation . Finally, we obtain the approximated optimal of the energy functional in Eq. , for a fixed as

## 3Setup / Training

All our experiments use 3D images from the ADNI dataset1 which consists of 6471 3D MR brain images of size voxels. In particular, ADNI-1 contains 3479 images from 833 subjects and ADNI-2 contains 2992 images from 823 subjects. Images belong to various types of diagnostic categories which we will discuss later.

We perform the following two types of studies:

Registration

We assess our hypothesis that training FPIR on longitudinal data for longitudinal registrations is preferred over training using cross-sectional data. Vice versa, training FPIR on cross-sectional data for cross-sectional registrations is preferred over training using longitudinal data. Comparisons are with respect to registration results obtained by numerical optimization (i.e., LDDMM).

Regression

As for regression, we compare linear models fitted to atrophy scores over time, where scores are either obtained from FPSGR or optimization-based SGR. Additionally, we study correlations between atrophy scores and diagnostic groups. Our hypothesis is that FPSGR is accurate enough to achieve comparable performance to optimization-based SGR, at much lower computational cost, in both situations.

### 3.1Training of the prediction models

We use a randomly selected set of 120 patients’ MRI images from ADNI for training the prediction models and to test the performance of FPIR. We use all of the ADNI data for our regression experiments.
Training for registration. We randomly selected 120 subjects from ADNI-1 and registered their baseline images to their 24 month follow-up images. We used the first 100 subjects for training and the remaining 20 subjects for testing. For longitudinal training, we registered the baseline image of a subject to the subject’s 24-month image. For cross-sectional training, we registered a subject’s baseline image to another subject’s 24-month image. To assess the performance of prediction models trained on these two types of paired data, we (1) perform the same type of registrations on the held-out 20 subjects and (2) compare the 2-norm of the deformation error computed from the output of the prediction models with respect to the result obtained by numerical optimization of LDDMM2 (which serves as the “ground-truth”). Table 1 shows the results which confirm our hypothesis that training the prediction model with longitudinal registration cases is preferred for longitudinal registration over training with cross-sectional data. The deformation error is very small for longitudinal training / testing which provides strong evidence that the predictive method exhibits performance comparable to the (costly) optimization-based LDDMM. Another interpretation of these results is, that it is beneficial to train a prediction model with deformations that are to be expected, i.e., relatively small deformations for longitudinal registrations and larger deformations for cross-sectional registrations. As we are interested in longitudinal registrations for the ADNI data, we only train our 3D models using longitudinal registrations in the following.

Training for regression. The ADNI-1 dataset contains 228 normal controls, 257 subjects with mild cognitive impairment (MCI), 149 with late mild cognitive impairment (LMCI), as well as 199 subjects suffering from Alzheimer’s disease (AD). We randomly picked roughly 1/6 of patients from each diagnostic category to form a set of 139 subjects for training in ADNI-1, i.e., 38 normal controls, 43 MCI, 25 LMCI, as well as 33 AD subjects; this results in 139 subjects overall. The baseline images of each subject were registered to all the later time-points within the same subject. To maintain the diagnostic ratio, we picked (out of all registrations) 45 registrations from the normal group, 50 registrations from the MCI group, 30 registrations from the LMCI group, and 40 registrations from the AD group, resulting in 165 longitudinal registration cases for training.

The same strategy was applied to ADNI-2. In detail, ADNI-2 contains 200 normal controls, 111 subjects with significant memory complaint (SMC), 182 subjects with early mild cognitive impairment (EMCI), 175 with late mild cognitive impairment (LMCI), and 155 subjects with Alzheimer’s disease (AD). We picked 150 subjects and 140 longitudinal registrations, consisting of 35 registrations from the control group, 20 registrations from the SMC group, 30 registrations from the EMCI group, 30 registrations from the LMCI group, and 25 registrations from the AD group. Note that there are fewer registrations than subjects (140 vs. 150) in this setup, as our priority is to maintain the overall diagnostic ratio.

For both, ADNI-1 and ADNI-2, the remaining 5/6 of the data is used for testing. We trained four prediction models and their four corresponding correction models, leading to eight prediction models in total, listed in Table ?. We also note that the training sets within ADNI-1 and ADNI-2, resp., were not overlapping.

### 3.2Parameter selection

We use the regularization kernel

with set to . The parameter , from equation , is set to . We train our network (using ADAM [?]) over 10 epochs with a learning rate of .

### 3.3Efficiency

Once trained, the prediction models allow fast computations of registrations. We use a TITAN X (Pascal) GPU and PyTorch3 for our implementation of FPIR. For the 3D ADNI-1 dataset ( MR images), FPSGR took about one day to predict 2646 pairwise registrations (i.e., 25 [s]/prediction) and to compute the regression result. Optimization-based LDDMM4 would require 40 days of runtime. Runtime for FPIR on ADNI-2 is identical to ADNI-1 as the images have the same spatial dimension.

Compared to the-state-of-art fast geodesic regression model [?], FPSGR is also at least twice as fast. The model in [?] achieves times speed-up compared with SGR [?] for the same setting (parallel computing with the same number of cores). In our case, we achieve a more than 40 times speed-up compared with SGR for the same setting (a single GPU).

## 4Experimental results for 3D ADNI data

For our experiments, we created 10 different (dataset, registration approach) combinations, each combination specifically designed to assess certain properties of our proposed strategy. These combinations are described next.

• All subjects from the ADNI-1 dataset in combination with optimization-based LDDMM.

• Two subgroups of ADNI-1 (i.e., different training data portions) in combination with FPSGR without a correction network.

• The same two subgroups as in 2), but in combination with FPSGR with a correction network.

• The same five groups of 1-3, but for ADNI-2.

Our general hypothesis is that the prediction models (for ADNI-1/2) show similar performance to optimization-based LDDMM and that using the correction network for the predictions improves results. To assess differences, we compare differences in deformations. Specifically, for every deformation produced by the different approaches, we compute its Jacobian determinant (JD). The JDs are then warped to a common coordinate system for the entire ADNI dataset using existing non-linear deformations from [?]. Each such spatially normalized JD is then averaged within a region where the rate of atrophy is significantly associated with Alzheimer’s disease (AD), i.e., within a statistical region of interest (stat-ROI) (see Figure 2). Specifically, we quantify atrophy as

where denotes the determinant and the cardinality/size of a set; is an area in the temporal lobes which was determined in prior studies [?] to be significantly associated with accelerated atrophy in Alzheimer’s disease. The resulting scalar value is an estimate of the relative volume change experienced by that region between the baseline and a follow-up image. Hence, its sign is positive when the region has lost volume over time and is negative if the region has gained volume over time.

We limited our experiments to the applications in [?], wherein nonlinear registration/regression is used to quantify atrophy within regions known to be associated to varying degrees with AD (), mild cognitive impairment (MCI) () (including LMCI5), and normal ageing (NC: normal control) () in an elderly population. These are the diagnostic groups for ADNI-1. For ADNI-2, there are also 3 diagnostic categories6: normal ageing () (including SMC), mild cognitive impairment (including EMCI and LMCI) (), and AD ().

Specifically, we investigate the following six questions:

• Can the prediction models for regression qualitatively capture similar trends to the regression model obtained by numerical optimization? (Sec. Section 4.1)

• Are atrophy measurements derived from FPSGR biased to overestimate or underestimate volume changes? (Sec. Section 4.2)

• Are FPSGR atrophy measurements consistent with those derived from deformations via numerical optimization (LDDMM) which produced the training dataset? (Sec. Section 4.3)

• Are regression results more stable and hence capture trends better than pairwise registrations? (Sec. Section 4.4)

• Is the predictive power of the regression models strong enough to forecast deformations for unseen future timepoints (Sec. Section 4.5)

• Do the prediction results capture expected trends in deformation? (Sec. Section 4.6)

If these experiments resolve favorably, then the substantially improved computational efficiency of FPSGR justifies its use for large-scale imaging studies. Tables Table 2 and Table 3 show the distributions of the prediction cases per time-point and the diagnostic groups in ADNI-1 and ADNI-2, respectively.

### 4.1Regression results

Table 1 indicates that FPIR can predict deformation fields similar to the ones obtained using optimization-based LDDMM, even for the subtle changes seen in longitudinal imaging data. However, it remains to be seen how a predictive model performs for image regression. Figure 3 shows an exemplary regression result. In this specific case, large changes can be observed around the ventricles. To illustrate differences between the methods, Figure 3 shows regression results based on optimization-based LDDMM, for FPSGR without a correction network, and for FPSGR with a correction network. All three methods successfully capture the expanding ventricles and generally capture the image changes. Both FPSGR methods show results that are highly similar to SGR using optimization-based LDDMM. Hence, FPSGR is useful for longitudinal image regression. To further quantify the regression accuracy, we compute the overlay error between measured images and the images on the geodesic as

where is the brain area, is the regressed image at time and is the measured image at time . Table 4 shows the overlay error for the population of 100 subjects which includes all diagnostic groups in ADNI-1. Both FPSGR methods obtain results comparable with optimization-based LDDMM. This justifies the use of the proposed methods. The correction network generally increases the prediction accuracy over using the prediction network only.

### 4.2Bias

Estimates of atrophy are susceptible to bias [?]. To quantitatively assess this potential bias, we separately considered different diagnostic groups. Specifically, we considered six diagnostic change groups in our experiments: (1) NC for all time points (NC-NC), (2) starting with NC and changing to MCI or AD at a later time point (NC-MCI), (3) MCI for all time points (MCI-MCI), (4) starting with MCI and reversing to NC at later time points (MCI-NC), (5) starting with MCI and changing to AD at later time points (MCI-AD), and (6) AD for all the time points (AD-AD)7. In particular, we follow [?] and fit a straight line (i.e., linear regression) through all atrophy measurements over time, conditioned on each diagnostic change category. The intercept term is an estimate of the atrophy one would measure when registering two scans acquired on the same day; hence it should be near zero and its 95% confidence interval should contain zero. Quantitatively, Table 5 lists the slopes, intercepts, and 95% confidence intervals for all ten groups of ADNI-1 and ADNI-2, respectively. LDDMM-1 and LDDMM-2 denote the optimization-based results split into the same testing groups used for Pred-1 and Pred-2 to allow for a direct comparison. All of the results show intercepts that are near zero relative to the range of changes observed and all prediction intercept confidence intervals contain zero. For all diagnostic change groups the prediction and prediction+correction models exhibit more stable results than the optimization-based LDDMM method as indicated by the tighter confidence intervals. Furthermore, all slopes are positive, indicating average volume loss over time. This is consistent with expectations for an aging and neuro-degenerative population. The slopes capture increasing atrophy with disease severity. In ADNI-1/ADNI-2, we expect and all six experimental groups (i.e. LDDMM-1, Pred-1, Pred+Corr-1, LDDMM-2, Pred-2, and Pred+Corr-2) are generally consistent with this expectation. Exceptions happen in ADNI-2 for the NC-MCI and MCI-NC cases. As the number of subjects involved is relatively small, i.e., fewer than 20, compared with the other cases (roughly 100), one may speculate that this observation is caused by the limited number of data points for NC-MCI and MCI-NC as shown in the #data column of Table 5. However, the behavior within each starting diagnostic category, is consistent, i.e., for NC and for MCI . Hence, all six groups’ slope results in ADNI-1/ADNI-2 are generally consistent with our expectation (and also consistent with results in [?]). The slope estimated from the prediction+correction results is larger than the slope estimated from the prediction model results and closer to the slope obtained from the optimization-based LDDMM results. This indicates that the correction network can improve prediction accuracy. Figure 5 shows linear regression results for the estimated atrophy scores in ADNI-1/2 for the Pred+Corr-1 model. Both the data points themselves (i.e., the atrophy scores), as well as kernel density estimates for the linear trends for each subject are shown. These results are consistent with the results of Table 5 discussed above. We conclude that (1) neither LDDMM optimization nor FPSGR produced deformations with significant bias to overestimate or underestimate volume change; (2) a linear model of atrophy scores generated by FPSGR can capture intrinsic volume change (i.e., slope) among different diagnostic change groups. Note that our LDDMM optimization results and the prediction results show the same trends. Further, they are directly comparable as the results are based on the same test images (also for the atrophy measurements).

### 4.3Atrophy

Atrophy estimates have also been shown to correlate with clinical variables [?]. To quantify this effect, we computed the Spearman rank-order correlation8 between our atrophy estimates and the diagnostic groups (NC = 0, MCI = 1, AD = 2), and also between our atrophy estimates and the scores of the mini-mental state exam (MMSE). We applied the Benjamini-Hochberg procedure [?] for all the correlation results in this paper to reduce the false discovery rate for multiple comparisons. The overall false discovery rate was set to be 0.01, which resulted in an effective significance level of 0.0093. Detailed results can be found in Table 6 and Figure 6, respectively. In detail, for ADNI-1/2, we randomly selected 200 9 cases from each diagnostic category at each month and calculated the Spearman rank-order correlation. Figure 6 shows the results for 50 repetitions. We observe median correlations for all four prediction models in the range of to for MMSE and to for diagnostic category. The correlations for all four prediction+correction models were in the range of to for MMSE and to for diagnostic category. Previous studies reported Pearson correlations between comparable atrophy estimates and clinical variables as high as for MMSE and for diagnostic category for 100 subjects[?]. Our two optimization-based LDDMM results achieve median correlations ranging from to for MMSE and to for diagnostic category, which is very similar to the predction+correlation models. In general, the correction+prediction FPSGR models outperform the models using only the prediction network. Further, using the correction network, FPSGR achieved comparable and sometimes even slightly better performance compared to the optimization-based LDDMM SGR method, see Table 6 for additional quantitative results. Specifically, FPSGR using the prediction+correction network performs best in 8 out of 18 comparisons for MMSE and in 12 out of 20 comparisons for diagnostic group. In the cases where FPSGR with prediction+correction network did not perform best its difference to the best method was generally very small. In general FPSGR using the correction network performs better than FPSGR without the correction network. To check for statistical differences in the performance of FPSGR, we use a paired t-test. Table 7 shows the resulting p-values for the three methods: optimization-based SGR (i.e., LDDMM), FPSGR without correction network (i.e., Pred) and FPSGR with correction network (i.e., Pred+Corr). In both correlation with MMSE and DX, FPSGR with correction network shows significantly better performance than LDDMM and FPSGR without correction network, which justifies the use of the FPSGR method. In summary, FPSGR captures correlations between atrophy and clinical measures well.

To further explore the correlations of atrophy with MMSE scores, we visualize them separated by diagnostic group where diagnosis did not change (i.e., NC-NC, MCI-MCI, AD-AD) in Figure 7. For the ADNI-1 dataset, we observe (as expected) very low correlations for the normal diagnostic group (with no clear trend), and much stronger correlations for the MCI and AD groups. MCI and AD also exhibit increasingly stronger correlations with time. In case of ADNI-2, the MCI group shows modest correlations, which remain consistent across time. Correlations are relatively low for the normal groups. The AD groups show increasingly strong correlations over time. In contrast to ADNI-1, ADNI-2 focuses mainly on earlier stages of the diagnostic groups [?]. Hence, the deformations in ADNI-2 are generally smaller than in ADNI-1. This may explain why the NC and MCI diagnostic groups show consistent correlation values over time (instead of stronger correlations as for AD in ADNI-2 or the MCI and AD groups in ADNI-1).

To address the question how stat-ROI specific measures behave over time, we explore how atrophy locally (i.e., voxel-by-voxel) correlates with MMSE. The local atrophy is defined as

I.e., each voxel in a stat-ROI has an associated atrophy score. Figure 8 shows kernel density estimates of the highest 10% local correlations in a violin plot. For the ADNI-1 MCI and AD groups, a clear shift toward stronger correlations can be observed over time, consistent with the boxplots of Figure 7. This indicates the progression of the disease. Correlations for the normal groups in ADNI 1/2 are mostly centered around a modest correlation (as expected). In ADNI-2, only the AD diagnostic group shows a shift towards stronger correlations over time. All the other diagnostic groups show a relatively consistent distribution over time. This is also consistent with Figure 7.

### 4.4Justification of SGR

For simple geodesic regression to be a useful model it should outperform pairwise image registration. The main conceptual difference is that the regression model will recover an average trend based on multiple image time-points, i.e., the resulting regression geodesic will be a compromise between all the measurements. In contrast, for pairwise image registration (which can be seen as a trivial case of geodesic regression with two images only) the deformation will in general be able to match the target image well. However, just as in linear regression, this may accentuate the effects of noise. In both setups, images can be interpolated or extrapolated based on the estimated geodesic.

Tables Table 8 and Table 9 justify the use of SGR. Specifically, Table 8 shows linear regression results of atrophy measures over time as obtained via SGR (i.e., using an SGR fit over all time-points followed by atrophy computations based on the deformations of the regression geodesic) compared with atrophy measures obtained by pairwise registration. For both the ADNI-1 and the ADNI-2 datasets, SGR outperforms the pairwise registration approach in two aspects: (1) the estimated intercept of SGR is generally closer to zero than for the pairwise method and the intercept 95% confidence interval is narrower; (2) 11 out of 24 of the 95% confidence intervals of the pairwise methods show bias to either overestimate or underestimate volume change, while none of the SGR results show such significant bias. Table 9 compares the correlations between atrophy and clinical measures (MMSE and diagnostic category) of SGR and the pairwise approach. SGR performs better than the pairwise approach in 13 out of 18 cases for MMSE and in 15 out of 20 cases for the diagnostic category. Furthermore, when the pairwise method is better than SGR, the difference is much smaller compared to the differences observed for the cases where SGR is better than the pairwise method. Also note that the pairwise method shows better performance in later months compared to earlier months. This could, for example, be because the deformations are larger for later time-points and hence the registration result becomes more stable, or because SGR is also heavily influenced by the last time-point. To address the above observation, we used a Shapiro-Wilk normality test and a Wilcoxon signed-rank test. From Table ? we see that we can reject the null-hypothesis of normality and hence, a paired -test is not appropriate. As an alternative, we conducted a Wilcoxon signed-rank test to compare the SGR prediction model and the pairwise prediction model. Table ? shows that the SGR prediction model is statistically significantly better than the pairwise prediction model. Based on the above points, we conclude that SGR is more stable over time than the pairwise method and in general also results in stronger correlations.

### 4.5Forecasting

Another interesting question for SGR and geodesic regression in general is the suitability of the model for the data. To address this question, we evaluate if SGR can forecast unseen future time-points. Specifically we consider this question in two different scenarios:

• Extrapolate-clinical: Can we extrapolate the SGR results into the future (to time-points that do not exist in the ADNI image dataset, but for the clinical data) while still obtaining strong correlations.

• Extrapolate-image: How well can correlations between atrophy and clinical measures be predicted for time-points when we do or do not use image data at that very time-point. We artificially leave out image measurements so that we can compare prediction results to results when we have the image measurement.

For both scenarios we use two different forecasting approaches. In the first approach (Forecast) we simply compute SGR results with the available image time-points and then extrapolate using the resulting regression geodesic to the desired time-point in the future. In the second approach (Replace), we artificially impute the missing image time-points by simply replacing them by the image at the closest measured time-point. For example, if we have images at 6, 12, and 18 month, but we want to forecast at 24 month, we use the 18 month image as the imputed 24 month image and then perform SGR on the 6, 12, 18, and the imputed 24 month images. We then obtain the deformation at 24 months from the SGR result.

Table 10 shows correlations between atrophy and the clinical measures for the Forecast results for 60 month, 72 month and 84 month. The resulting correlations of atrophy with diagnostic category are all above 0.3 (or below -0.3). Furthermore, the Forecast correlations show a downward trend with respect to time, which means that the prediction of “far-away“ points is not as accurate as for the “near” future. On the other hand, SGR using the 6 month to 48 month time points results in correlation around -0.5 for MMSE and 0.5 for DX on average. Hence the correlation with the diagnostic category is consistent for that of 60 months. In other words, using 6 month to 48 month data, our prediction model can predict accurately up to 60 month. Our prediction+correction network performs as well as and even slightly better than SGR using optimization-based LDDMM. Figure 10 shows that these forecasting results capture the trends of the changes in the temporal lobes near the hippocampus and changes in the ventricles.

Table 11 and Figure 9 show Forecast and Replace results for correlations between atrophy and clinical measures in comparison to using all images. Specifically, for the Forecast and Replace results we did not use the available images at 36 and 48 month so we could compare against the results obtained when using these images. If FPSGR is a good model, it should results in correlation results as close to the correlation results using all images as possible. The Forecast correlations are only slightly weaker (0.02 to 0.05 lower) than the original correlations using all images illustrating that FPSGR can approximately forecast future changes.

The overall correlations in Table 11 show that the Replace group performs better than the Forecast group. In particular, we are also interested in the prediction of MCI converters, namely, MCI to NC, MCI to MCI, and MCI to AD. The boxplots in Figure 9 show the correlations for such predictions. The Replace group in Figure 9 show relatively worse correlation performance than the Forecast group in ADNI-1 Pred-1 and consistent performance in ADNI-1 Pred-2. Hence SGR on a longitudinal image data can achieves good forecasting result for MCI converters.

Thus, both Extrapolate-clinical and Extrapolate-image experiments justify the use of FPSGR in predicting near future longitudinal trends especially for MCI converters.

### 4.6Jacobian Determinant (JD)

The average JD images qualitatively agree with prior results [?]: severity of volume change increases with severity of diagnosis and time. Change is most substantial in the temporal lobes near the hippocampus (see Figure 10). In Figure 10, 6 month to 48 month are existing data points; 60 month to 84 month are forecast results. Blue indicates volume loss. Red indicates expansion. Results are consistent with expectations: volume loss increases with time and severity of diagnosis in temporal lobes; volume expansion increases with respect to time and severity of diagnosis around the ventricles / cerebrospinal fluid. The forecast results capture visually sensible volume loss or expansion over time, qualitatively illustrating the performance of our method.

## 5Conclusion and future work

In this work, we proposed a fast approach for geodesic regression (FPSGR) to study longitudinal image data. FPSGR incorporates the recently proposed FPIR [?] into the SGR [?] framework, thus leading to a computationally efficient solution to geodesic regression. Since FPSGR replaces the computationally intensive intermediate step of computing pairwise initial momenta via a deep-learning prediction method, it is orders of magnitude faster than existing approaches [?], without compromising accuracy. Consequently, FPSGR facilitates the analysis of large-scale imaging studies. Experiments on the ADNI-1/ADNI-2 datasets demonstrate that FPSGR captures expected atrophy trends of normal aging, MCI and AD. It further (1) exhibits negligible bias towards volume changes within stat-ROIs, (2) shows high correlations with clinical variables (MMSE and diagnosis) and (3) produces consistent forecasting results on unseen data.

In future work, it will be be interesting to explore FPSGR for the task of classifying stable Mild Cognitive Impairment (sMCI) and progressive Mild Cognitive Impairment (pMCI). Currently, FPSGR only shows modest accuracy for distinguishing these types within MCI. Extending our approach to time-warped geodesic regression models [?] might improve the accuracy in this context. Furthermore, end-to-end prediction of averaged initial momenta would be an interesting future direction, as this would allow learning representations that characterize the geodesic path among multiple time-series images, not only based on averages of momenta for two images as in FPIR [?].

## Acknowledgements

Research reported in this publication was supported by the National Institutes of Health (NIH) and the National Science Foundation (NSF) under award numbers NIH R01AR072013, NSF ECCS-1148870, and EECS-1711776. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH or the NSF. We also thank Nvidia for the donation of a TitanX GPU.

Data collection and sharing for this project was funded by the Alzheimer’s Disease Neuroimaging Initiative (ADNI) (National Institutes of Health Grant U01 AG024904) and DOD ADNI (Department of Defense award number W81XWH-12-2-0012). ADNI is funded by the National Institute on Aging, the National Institute of Biomedical Imaging and Bioengineering, and through generous contributions from the following: AbbVie, Alzheimer’s Association; Alzheimer’s Drug Discovery Foundation; Araclon Biotech; BioClinica, Inc.; Biogen; Bristol-Myers Squibb Company; CereSpir, Inc.; Cogstate; Eisai Inc.; Elan Pharmaceuticals, Inc.; Eli Lilly and Company; EuroImmun; F. Hoffmann-La Roche Ltd and its affiliated company Genentech, Inc.; Fujirebio; GE Healthcare; IXICO Ltd.; Janssen Alzheimer Immunotherapy Research & Development, LLC.; Johnson & Johnson Pharmaceutical Research & Development LLC.; Lumosity; Lundbeck; Merck & Co., Inc.; Meso Scale Diagnostics, LLC.; NeuroRx Research; Neurotrack Technologies; Novartis Pharmaceuticals Corporation; Pfizer Inc.; Piramal Imaging; Servier; Takeda Pharmaceutical Company; and Transition Therapeutics. The Canadian Institutes of Health Research is providing funds to support ADNI clinical sites in Canada. Private sector contributions are facilitated by the Foundation for the National Institutes of Health (www.fnih.org). The grantee organization is the Northern California Institute for Research and Education, and the study is coordinated by the Alzheimer’s Therapeutic Research Institute at the University of Southern California. ADNI data are disseminated by the Laboratory for Neuro Imaging at the University of Southern California.

## References

### Footnotes

1. Data used in the preparation of this article were obtained from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer’s disease (AD).
2. LDDMM results are generated using a vector momentum formulation: https://bitbucket.org/scicompanat/vectormomentum
3. http://pytorch.org
4. Here, we used 300 fixed iterations for each registration. 300 iterations can guarantee almost all the results converge. Note that the optimization-based LDDMM also uses a GPU implementation.
5. We combine MCI and LMCI mainly because (a) the diagnostic changes available on the IDA website (https://ida.loni.usc.edu/login.jsp) only provide these three diagnostic groups; (b) to be consistent with the experiments conducted by Hua et al. [?], where only Normal, MCI and AD were used as labels to classify ADNI-1. Hereafter, in all discussions of ADNI-1, MCI is a combination of MCI and LMCI of ADNI-1
6. Similar to ADNI-1, a detailed diagnosis for ADNI-2 is only available for the baseline images; MR images at later time points are only labeled as NC, MCI, and AD. Thus, we combine SMC and NC, as well as EMCI and LMCI to be consistent with the diagnostic changes in the ADNI Diagnosis Summary available on the IDA website. Hereafter, in all discussions of ADNI-2, NC includes NC and SMC and MCI includes EMCI and LMCI.
7. In ADNI-1/ADNI-2, there are two patients who show a reversion from AD to MCI. We omitted these cases in our experiment because the number of such cases is too small.
8. We used Spearman rank-order correlation instead of Pearson correlation, because the diagnostic groups imply an ordering only.
9. In ADNI-1 48 month, the number was 60 because there was not enough data; ADNI-2 36 month was omitted due to lack of data.
You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters