# VoxelMorph: A Learning Framework for Deformable Medical Image Registration

###### Abstract

We present VoxelMorph, a fast learning-based framework for deformable, pairwise medical image registration. Traditional registration methods optimize an objective function for each pair of images, which can be time-consuming for large datasets or rich deformation models. In contrast to this approach, and building on recent learning-based methods, we formulate registration as a function that maps an input image pair to a deformation field that aligns these images. We parameterize the function via a convolutional neural network (CNN), and optimize the parameters of the neural network on a set of images. Given a new pair of scans, VoxelMorph rapidly computes a deformation field by directly evaluating the function. In this work, we explore two different training strategies. In the first (unsupervised) setting, we train the model to maximize standard image matching objective functions that are based on the image intensities. In the second setting, we leverage auxiliary segmentations available in the training data. We demonstrate that the unsupervised model’s accuracy is comparable to state-of-the-art methods, while operating orders of magnitude faster. We also show that VoxelMorph trained with auxiliary data improves registration accuracy at test time, and evaluate the effect of training set size on registration. Our method promises to speed up medical image analysis and processing pipelines, while facilitating novel directions in learning-based registration and its applications. Our code is freely available at http://voxelmorph.csail.mit.edu.

## I Introduction

Deformable registration is a fundamental task in a variety of medical imaging studies, and has been a topic of active research for decades. In deformable registration, a dense, non-linear correspondence is established between a pair of images, such as 3D magnetic resonance (MR) brain scans. Traditional registration methods solve an optimization problem for each volume pair by aligning voxels with similar appearance while enforcing constraints on the registration mapping. Unfortunately, solving a pairwise optimization can be computationally intensive, and therefore slow in practice. For example, state-of-the-art algorithms running on the CPU can require tens of minutes to hours to register a pair of scans with high accuracy [28, 44, 6]. Recent GPU implementations have reduced this runtime to just minutes, but require a GPU for each registration [54].

We present a novel registration method that learns a parametrized registration *function* from a collection of volumes. We implement the function using a convolutional neural network (CNN), that takes two -D input volumes and outputs a mapping of all voxels of one volume to another volume. The parameters of the network, i.e. the convolutional kernel weights, can be optimized using only a training set of volumes from the dataset of interest.
The procedure learns a common representation that enables alignment of a new pair of volumes from the same distribution. In essence, we replace a costly optimization solved for each test image pair with one global function optimization during a training phase. Registration of a new test scan pair is achieved by simply evaluating the learned function on the given volumes, resulting in rapid registration, even on a CPU. We implement our method as a general purpose framework, VoxelMorph, available at http://voxelmorph.csail.mit.edu^{1}^{1}1We implement VoxelMorph as a flexible framework that includes the methods proposed in this manuscript, as well as extensions that are beyond the scope of this work [20].

In the learning-based framework of VoxelMorph, we are free to adopt any differentiable objective function, and in this paper we present two possible choices. The first approach, which we refer to as unsupervised^{2}^{2}2 We use the term unsupervised to underscore the fact that VoxelMorph is a learning method (with images as input and deformations as output) that requires no deformation fields during training. Alternatively, such methods have also been termed self-supervised, to highlight the lack of supervision, or end-to-end, to highlight that no external computation is necessary as part of a pipeline (such as computing ’true’ deformation fields)., uses only the input volume pair and the registration field computed by the model. Similar to traditional image registration algorithms, this loss function quantifies the dissimilarity between the intensities of the two images and the spatial regularity of the deformation. The second approach also leverages anatomical segmentations available at training time for a subset of the data, to learn network parameters.

Throughout this study, we use the example of registering 3D MR brain scans. However, our method is broadly applicable to other registration tasks, both within and beyond the medical imaging domain. We evaluate our work on a multi-study dataset of over 3,500 scans containing images of healthy and diseased brains from a variety of age groups. Our unsupervised model achieves comparable accuracy to state-of-the-art registration, while taking orders-of-magnitude less time. Registration with VoxelMorph requires less than a minute using a CPU and under a second on a GPU, in contrast to the state-of-the-art baselines which take tens of minutes to over two hours on a CPU.

This paper extends a preliminary version of the work presented at the 2018 International Conference on Computer Vision and Pattern Recognition [9]. We build on that work by expanding analyses, and introducing an auxiliary learning model that can use anatomical segmentations during training to improve registration on new test image pairs for which segmentation maps are not available. We focus on providing a thorough analysis of the behavior of the VoxelMorph algorithm using two loss functions and a variety of settings, as follows. We test the unsupervised approach on more datasets and both atlas-based and subject-to-subject registration. We then explore cases where different types and numbers of anatomical region segmentations are available during training as auxiliary information, and evaluate the effect on registration of test data where segmentations are not available. We present an empirical analysis quantifying the effect of training set size on accuracy, and show how instance-specific optimization can improve results. Finally, we perform sensitivity analyses with respect to the hyperparameter choices, and discuss an interpretation of our model as amortized optimization.

The paper is organized as follows. Section 2 introduces medical image registration and Section 3 describes related work. Section 4 presents our methods. Section 5 presents experimental results on MRI data. We discuss insights of the results and conclude in Section 6.

## Ii Background

In the traditional volume registration formulation, one (moving or source) volume is warped to align with a second (fixed or target) volume. Fig. 1 shows sample 2D coronal slices taken from 3D MRI volumes, with boundaries of several anatomical structures outlined. There is significant variability across subjects, caused by natural anatomical brain variations and differences in health state. Deformable registration enables comparison of structures between scans. Such analyses are useful for understanding variability across populations or the evolution of brain anatomy over time for individuals with disease. Deformable registration strategies often involve two steps: an initial affine transformation for global alignment, followed by a much slower deformable transformation with more degrees of freedom. We concentrate on the latter step, in which we compute a dense, nonlinear correspondence for all voxels.

Most existing deformable registration algorithms iteratively optimize a transformation based on an energy function [65]. Let and denote the fixed and moving images, respectively, and let be the registration field that maps coordinates of to coordinates of . The optimization problem can be written as:

(1) | ||||

(2) |

where represents warped by , function measures image similarity between its two inputs, imposes regularization, and is the regularization trade-off parameter.

There are several common formulations for , and . Often, is characterized by a displacement vector field specifying the vector offset from to for each voxel: , where is the identity transform [8]. Diffeomorphic transforms model through the integral of a velocity vector field, preserving topology and maintaining invertibility on the transformation [10]. Common metrics used for include intensity mean squared error, mutual information [72], and cross-correlation [5]. The latter two are particularly useful when volumes have varying intensity distributions and contrasts. enforces a spatially smooth deformation, often modeled as a function of the spatial gradients of .

Traditional algorithms optimize (1) for each volume pair. This is expensive when registering many volumes, for example as part of population-wide analyses. In contrast, we assume that a field can be computed by a parameterized function of the data. We optimize the function parameters by minimizing the expected energy of the form of (1) over a dataset of volume pairs. Essentially, we replace pair-specific optimization of the deformation field by global optimization of the shared parameters, which in other domains has been referred to as amortization [42, 64, 76, 18]. Once the global function is estimated, a field can be produced by evaluating the function on a given volume pair. In this paper, we use a displacement-based vector field representation, and focus on various aspects of the learning framework and its advantages. However, we recently demonstrated that velocity-based representations are also possible in a VoxelMorph-like framework, also included in our codebase [20].

## Iii Related Work

### Iii-a Medical Image Registration (Non-learning-based)

There is extensive work in 3D medical image registration [3, 5, 8, 10, 21, 30, 69, 75, 77]. Several studies optimize within the space of displacement vector fields. These include elastic-type models [8, 22, 62], statistical parametric mapping [4], free-form deformations with b-splines [61], discrete methods [21, 30] and Demons [57, 69]. Diffeomorphic transforms, which are topology-preserving, have shown remarkable success in various computational anatomy studies. Popular formulations include Large Diffeomorphic Distance Metric Mapping (LDDMM) [10, 14, 15, 32, 41, 52, 55, 77], DARTEL [3], diffeomorphic demons [71], and standard symmetric normalization (SyN) [5]. All of these non-learning-based approaches optimize an energy function for each image pair, resulting in slow registration. Recent GPU-based algorithms build on these concepts to reduce algorithm runtime to several minutes, but require a GPU to be available for each registration [54, 53].

### Iii-B Medical Image Registration (Learning-based)

There are several recent papers proposing neural networks to learn a function for medical image registration. Most of these rely on ground truth warp fields [13, 45, 59, 63, 74], which are either obtained by simulating deformations and deformed images, or running classical registration methods on pairs of scans. Some also use image similarity to help guide the registration [13]. While supervised methods present a promising direction, ground truth warp fields derived via conventional registration tools as ground truth can be cumbersome to acquire and can restrict the type of deformations that are learned. In contrast, VoxelMorph is unsupervised, and is also capable of leveraging auxiliary information such as segmentations during training if those are available.

Two recent papers [24, 46], were the first to present unsupervised learning based image registration methods. Both propose a neural network consisting of a CNN and spatial transformation function [39] that warps images to one another. However, these two initial methods are only demonstrated on limited subsets of volumes, such as 3D subregions [46] or 2D slices [24], and support only small transformations [24].

A recent method has proposed a segmentation driven cost function to be used in registering different imaging modalities – T2w MRI and 3D ultrasound – within the same subject [36, 35]. The authors demonstrate that a loss functions based solely on segmentation maps can lead to an accurate within-subject cross-modality registration network. Parallel to this work, in one of our experiments, we demonstrate the use of segmentation maps during training in subject-to-atlas registration. We provide an analysis of the effect of different anatomical label availability on overall registration quality, and evaluate how a combination of segmentation and image based losses behaves in various scenarios. We find that a segmentation-based loss can be helpful, for example if the input segment labels are the same as those we evaluate on (consistent with [36], and [35]). We also show that the image-based and smoothness losses are still necessary, especially when we evaluate registration accuracy on labels not observed during training, and to encourage deformation regularity.

### Iii-C 2D Image Alignment

Optical flow estimation is a related registration problem for 2D images. Optical flow algorithms return a dense displacement vector field depicting small displacements between a pair of 2D images. Traditional optical flow approaches typically solve an optimization problem similar to (1) using variational methods [11, 34, 67]. Extensions that better handle large displacements or dramatic changes in appearance include feature-based matching [12, 47] and nearest neighbor fields [16].

In recent years, several learning-based approaches to optical flow estimation using neural networks have been proposed [2, 27, 37, 40, 58, 70]. These algorithms take a pair of images as input, and use a convolutional neural network to learn image features that capture the concept of optical flow from data. Several of these works require supervision in the form of ground truth flow fields [27, 37, 58, 70], while we build on a few that use an unsupervised objective [2, 40]. The spatial transform layer enables neural networks to perform both global parametric 2D image alignment [39] and dense spatial transformations [40, 56, 78] without requiring supervised labels. An alternative approach to dense estimation is to use CNNs to match image patches [7, 29, 68, 73]. These methods require exhaustive matching of patches, resulting in slow runtime.

We build on these ideas and extend the spatial transformer to achieve n-D volume registration, and further show how leveraging image segmentations during training can improve registration accuracy at test time.

## Iv Method

Let be two image volumes defined over an -D spatial domain . For the rest of this paper, we focus on the case but our method and implementation are dimension independent. For simplicity we assume that and contain single-channel, grayscale data. We also assume that and are affinely aligned as a preprocessing step, so that the only source of misalignment between the volumes is nonlinear. Many packages are available for rapid affine alignment.

We model a function using a convolutional neural network (CNN), where are network parameters, the kernels of the convolutional layers. The displacement field between and is in practice stored in a -dimensional image. That is, for each voxel , is a displacement such that and correspond to similar anatomical locations, where the map is formed using an identity transform and .

Fig. 2 presents an overview of our method. The network takes and as input, and computes using a set of parameters . We warp to using a spatial transformation function, enabling evaluation of the similarity of and . Given unseen images and during test time, we obtain a registration field by evaluating .

We use (single-element) stochastic gradient descent to find optimal parameters by minimizing an expected loss function using a training dataset. We propose two unsupervised loss functions in this work. The first captures image similarity and field smoothness, while the second also leverages anatomical segmentations. We describe our CNN architecture and the two loss functions in detail in the next sections.

### Iv-a VoxelMorph CNN Architecture

In this section we describe the particular architecture used in our experiments, but emphasize that a wide range of architectures may work similarly well and that the exact architecture is not our focus. The parametrization of is based on a convolutional neural network architecture similar to UNet [38, 60], which consists of encoder and decoder sections with skip connections.

Fig. 3 depicts the network used in VoxelMorph, which takes a single input formed by concatenating and into a 2-channel 3D image. In our experiments, the input is of size , but the framework is not limited by a particular size. We apply 3D convolutions in both the encoder and decoder stages using a kernel size of , and a stride of 2. Each convolution is followed by a LeakyReLU layer with parameter . The convolutional layers capture hierarchical features of the input image pair, used to estimate . In the encoder, we use strided convolutions to reduce the spatial dimensions in half at each layer. Successive layers of the encoder therefore operate over coarser representations of the input, similar to the image pyramid used in traditional image registration work.

In the decoding stage, we alternate between upsampling, convolutions and concatenating skip connections that propagate features learned during the encoding stages directly to layers generating the registration. Successive layers of the decoder operate on finer spatial scales, enabling precise anatomical alignment. The receptive fields of the convolutional kernels of the smallest layer should be at least as large as the maximum expected displacement between corresponding voxels in and . In our architecture, the smallest layer applies convolutions over a volume of the size of the input images.

### Iv-B Spatial Transformation Function

The proposed method learns optimal parameter values in part by minimizing differences between and . In order to use standard gradient-based methods, we construct a differentiable operation based on spatial transformer networks [39] to compute .

For each voxel , we compute a (subpixel) voxel location in . Because image values are only defined at integer locations, we linearly interpolate the values at the eight neighboring voxels:

(3) |

where are the voxel neighbors of , and iterates over dimensions of . Because we can compute gradients or subgradients,^{3}^{3}3The absolute value is implemented with a subgradient of 0 at 0. we can backpropagate errors during optimization.

### Iv-C Loss Functions

In this section, we propose two loss functions: an unsupervised loss that evaluates the model using only the input volumes and generated registration field, and an auxiliary loss that also leverages anatomical segmentations at training time.

#### Iv-C1 Unsupervised Loss Function

The unsupervised loss consists of two components: that penalizes differences in appearance, and that penalizes local spatial variations in :

(4) |

where is a regularization parameter. We experimented with two often-used functions for . The first is the mean squared voxelwise difference, applicable when and have similar image intensity distributions and local contrast:

(5) |

The second is the local cross-correlation of and , which is more robust to intensity variations found across scans and datasets [5]. Let and denote local mean intensity images: , where iterates over a volume around , with in our experiments. The local cross-correlation of and is written as:

(6) |

A higher CC indicates a better alignment, yielding the loss function: .

Minimizing will encourage to approximate , but may generate a non-smooth that is not physically realistic. We encourage a smooth displacement field using a diffusion regularizer on the spatial gradients of displacement :

(7) |

and approximate spatial gradients using differences between neighboring voxels. Specifically, for , we approximate , and use similar approximations for and .

#### Iv-C2 Auxiliary Data Loss Function

Here, we describe how VoxelMorph can leverage auxiliary information available during training but not during testing. Anatomical segmentation maps are sometimes available during training, and can be annotated by human experts or automated algorithms. A segmentation map assigns each voxel to an anatomical structure. If a registration field represents accurate anatomical correspondences, the regions in and corresponding to the same anatomical structure should overlap well.

Let be the voxels of structure for and , respectively. We quantify the volume overlap for structure using the Dice score [26]:

(8) |

A Dice score of indicates that the anatomy matches perfectly, and a score of 0 indicates that there is no overlap. We define the segmentation loss over all structures as:

(9) |

alone does not encourage smoothness and agreement of image appearance, which are essential to good registration. We therefore combine with (4) to obtain the objective:

(10) |

where is a regularization parameter.

In our experiments, which use affinely aligned images, we demonstrate that loss (10) can lead to significant improvements. In general, and depending on the task, this loss can also be computed in a multiscale fashion as introduced in [36], depending on quality of the initial alignment.

Since anatomical labels are categorical, a naive implementation of linear interpolation to compute is inappropriate, and a direct implementation of (8) might not be amenable to auto-differentiation frameworks. We design and to be image volumes with channels, where each channel is a binary mask specifying the spatial domain of a particular structure. We compute by spatially transforming each channel of using linear interpolation. We then compute the numerator and denominator of (8) by multiplying and adding and , respectively.

### Iv-D Amortized Optimization Interpretation

Our method substitutes the pair-specific optimization over the deformation field with a global optimization of function parameters for function . This process is sometimes referred to as amortized optimization [50]. Because the function is tasked with estimating registration between any two images, the fact that parameters are shared globally acts as a natural regularization. We demonstrate this aspect in Section V-C (Regularization Analysis). In addition, the quality and generalizability of the deformations outputted by the function will depend on the data it is trained on. Indeed, the resulting deformation can be interpreted as simply an approximation or initialization to the optimal deformation , and the resulting difference is sometimes referred to as the amortization gap [18, 50]. If desired, this initial deformation field could be improved using any instance-specific optimization. In our experiments, we accomplish this by treating the resulting displacement as model parameters, and fine-tuning the deformation for each particular scan independently using gradient descent. Essentially, this implements an auto-differentiation version of conventional registration, using VoxelMorph output as initialization. However, most often we find that the initial deformation, the VoxelMorph output, is already as accurate as state of the art results. We explore these aspects in experiments presented in Section V-D.

## V Experiments

We demonstrate our method on the task of brain MRI registration. We first (Section V-B) present a series of atlas-based registration experiments, in which we compute a registration field between an atlas, or reference volume, and each volume in our dataset. Atlas-based registration is a common formulation in population analysis, where inter-subject registration is a core problem. The atlas represents a reference, or average volume, and is usually constructed by jointly and repeatedly aligning a dataset of brain MR volumes and averaging them together [23]. We use an atlas computed using an external dataset [28, 66]. Each input volume pair consists of the atlas (image ) and a volume from the dataset (image ). Fig. 4 shows example image pairs using the same fixed atlas for all examples. In a second experiment (Section V-C), we perform hyper-parameter sensitivity analysis. In a third experiment (Section V-D), we study the effect of training set size on registration, and demonstrate instance-specific optimization. In the fourth experiment (Section V-E) we present results on a dataset that contains manual segmentations. In the next experiment (Section V-F), we train VoxelMorph using random pairs of training subjects as input, and test registration between pairs of unseen test subjects. Finally (Section V-G), we present an empirical analysis of registration with auxiliary segmentation data. All figures that depict brains in this paper show 2D slices, but all registration is done in 3D.

### V-a Experimental Setup

Method | Dice | GPU sec | CPU sec | ||
---|---|---|---|---|---|

Affine only | 0.584 (0.157) | 0 | 0 | 0 | 0 |

ANTs SyN (CC) | 0.749 (0.136) | - | 9059 (2023) | 9662 (6258) | 0.140 (0.091) |

NiftyReg (CC) | 0.755 (0.143) | - | 2347 (202) | 41251 (14336) | 0.600 (0.208) |

VoxelMorph (CC) | 0.753 (0.145) | 0.45 (0.01) | 57 (1) | 19077 (5928) | 0.366 (0.114) |

VoxelMorph (MSE) | 0.752 (0.140) | 0.45 (0.01) | 57 (1) | 9606 (4516) | 0.184 (0.087) |

#### V-A1 Dataset

We use a large-scale, multi-site, multi-study dataset of 3731 T1–weighted brain MRI scans from eight publicly available datasets: OASIS [48], ABIDE [25], ADHD200 [51], MCIC [31], PPMI [49], HABS [19], Harvard GSP [33], and the FreeSurfer Buckner40 [28]. Acquisition details, subject age ranges and health conditions are different for each dataset. All scans were resampled to a grid with mm isotropic voxels. We carry out standard pre‐processing steps, including affine spatial normalization and brain extraction for each scan using FreeSurfer [28], and crop the resulting images to . All MRIs were anatomically segmented with FreeSurfer, and we applied quality control using visual inspection to catch gross errors in segmentation results and affine alignment. We include all anatomical structures that are at least voxels in volume for all test subjects, resulting in 30 structures. We use the resulting segmentation maps in evaluating our registration as described below. We split our dataset into 3231, 250, and 250 volumes for train, validation, and test sets respectively, although we highlight that we do not use any supervised information at any stage. In addition, the Buckner40 dataset is only used for testing, using manual segmentations.

#### V-A2 Evaluation Metrics

Obtaining dense *ground truth* registration for these data is not well-defined since many registration fields can yield similar looking warped images. We first evaluate our method using volume overlap of anatomical segmentations. If a registration field represents accurate correspondences, the regions in and corresponding to the same anatomical structure should overlap well (see Fig. 4 for examples). We quantify the volume overlap between structures using the Dice score (8). We also evaluate the regularity of the deformation fields. Specifically, the Jacobian matrix captures the local properties of around voxel . We count all non-background voxels for which , where the deformation is not diffeomorphic [3].

#### V-A3 Baseline Methods

We use Symmetric Normalization (SyN) [5], the top-performing registration algorithm in a comparative study [44] as a first baseline. We use the SyN implementation in the publicly available Advanced Normalization Tools (ANTs) software package [6], with a cross-correlation similarity measure. Throughout our work with medical images, we found the default ANTs smoothness parameters to be sub-optimal for applying ANTs to our data. We obtained improved parameters using a wide parameter sweep across multiple datasets, and use those in these experiments. Specifically, we use SyN step size of 0.25, Gaussian parameters (9, 0.2), at three scales with at most 201 iterations each. We also use the NiftyReg package, as a second baseline. Unfortunately, a GPU implementation is not currently available, and instead we build a multi-threaded CPU version^{4}^{4}4We used the latest source code, updated March, 2018 (tree [4e4525]).. We searched through various parameter settings to obtain improved parameters, and use the CC cost function, grid spacing of 5, and 500 iterations.

#### V-A4 VoxelMorph Implementation

We implemented our method using Keras [17] with a Tensorflow backend [1]. We extended the 2D linear interpolation spatial transformer layer to -D, and here use . We use the ADAM optimizer [43] with a learning rate of . While our implementation allows for mini-batch stochastic gradient descent, in our experiments each training batch consists of one pair of volumes. Our implementation includes a default of 150,000 iterations. Our code and model parameters are available online at http://voxelmorph.csail.mit.edu.

### V-B Atlas-based Registration

In this experiment, we train VoxelMorph for atlas-based registration. We train separate VoxelMorph networks with different regularization parameters. We then select the network that optimizes Dice score on our validation set, and report results on our test set.

Table I presents average Dice scores computed for all subjects and structures for baselines of only global affine alignment, ANTs, and NiftyReg, as well as VoxelMorph with different losses. VoxelMorph variants perform comparably to ANTs and NiftyReg in terms of Dice^{5}^{5}5Both VoxelMorph variants are different from ANTs with paired t-test p-values of and and with slightly higher Dice values. There is no difference between VoxelMorph (CC) and NiftyReg (p-value of ), and no significant difference between VoxelMorph (CC) and VoxelMorph (MSE) (p-value of ), and are significantly better than affine alignment. Example visual results of the warped images from our algorithms are shown in Figs. 4 and 6. VoxelMorph is able to handle significant shape changes for various structures.

Fig. 5 presents the Dice scores for each structure as a boxplot. For ease of visualization, we average Dice scores of the same structures from the two hemispheres into one score, e.g., the left and right hippocampi scores are averaged. The VoxelMorph models achieve comparable Dice measures to ANTs and NiftyReg for all structures, performing slightly better on some structures such as the lateral ventricles, and worse on others such as the hippocampi.

Table I includes a count of voxels for which the Jacobian determinant is non-positive. We find that all methods result in deformations with small islands of such voxels, but are diffeomorphic at the vast majority of voxels (99.4% - 99.9%). Figs. 6 and Fig. 11 in the supplemental material illustrate several example VoxelMorph deformation fields. VoxelMorph has no explicit constraint for diffeomorphic deformations, but in this setting the smoothness loss leads to generally smooth and well-behaved results. ANTs and NiftyReg include implementations that can enforce or strongly encourage diffeomorphic deformations, but during our parameter search these negatively affected runtime or results. In this work, we ran the baseline implementations with configurations that yielded the best Dice scores, which also turned out to produce good deformation regularity.

#### V-B1 Runtime

Table I presents runtime results using an Intel Xeon (E5-2680) CPU, and a NVIDIA TitanX GPU. We report the elapsed time for computations following the affine alignment preprocessing step, which all of the presented methods share, and requires just a few minutes even on a CPU. ANTs requires two or more hours on the CPU, while NiftyReg requires roughly 39 minutes for the given setting. ANTs runtimes vary widely, as its convergence depends on the difficulty of the alignment task. Registering two images with VoxelMorph is, on average, times faster on the CPU compared to ANTs, and 40 times faster than NiftyReg. When using the GPU, VoxelMorph computes a registration in under a second. To our knowledge, there is no publicly available ANTs implementation for GPUs. It is likely that the SyN algorithm would benefit from a GPU implementation, but the main advantage of VoxelMorph comes from not requiring an optimization on each test pair, as can be seen in the CPU comparison. Unfortunately, the NiftyReg GPU version is unavailable in the current source code on all available repository history.

### V-C Regularization Analysis

Fig. 7 shows average Dice scores for the validation set for different values of the smoothness regularization parameter . The results vary smoothly over a large range of values, illustrating that our model is robust to choice of . Interestingly, even setting , which enforces no explicit regularization on registration, results in a significant improvement over affine registration. This is likely because the optimal network parameters need to register all pairs in the training set well, yielding an implicit dataset regularization for the function .

### V-D Training Set Size and Instance-Specific Optimization

We evaluate the effect of training set size on accuracy, and the relationship between amortized and instance-specific optimization. Because MSE and CC performed similarly for atlas-based registration, in this section we use MSE. We train VoxelMorph on subsets of different sizes from our training dataset, and report Dice scores on: (1) the training subset, (2) the held out test set, and (3) the test set when each deformation is further individually optimized for each test image pair. We perform (3) by “fine-tuning” the displacements obtained from VoxelMorph using gradient descent for 100 iterations on each test pair, which took seconds on the GPU or seconds on a single-threaded CPU.

Fig. 8 presents our results. A small training set size of 10 scans results in slightly lower train and test Dice scores compared to larger training set sizes. However, there is no significant difference in Dice scores when training with 100 scans or the full dataset. Further optimizing the VoxelMorph parameters on each test image pair results in better test Dice scores regardless of training set size, comparable to the state-of-the-art.

Method | Dice |
---|---|

Affine only | 0.608 (0.175) |

ANTs SyN (CC) | 0.776 (0.130) |

NiftyReg (CC) | 0.776 (0.132) |

VoxelMorph (MSE) | 0.766 (0.133) |

VoxelMorph (MSE) inst. | 0.776 (0.132) |

VoxelMorph (CC) | 0.774 (0.133) |

VoxelMorph (CC) inst. | 0.786 (0.132) |

### V-E Manual Anatomical Delineations

Since manual segmentations are not available for most datasets, the availability of FreeSurfer segmentations enabled the broad range of experiments above. In this experiment, we use VoxelMorph models already trained in Section V-B to test registration on the (unseen) Buckner40 dataset containing scans. This dataset contains expert manual delineations of the same anatomical structures used in previous experiments, which we use here for evaluation. We also compute VoxelMorph with instance-specific optimization, as described in Section V-D. The Dice score results, shown in Table II, show that VoxelMorph using cross-correlation loss behaves comparably to ANTs and NiftyReg using the same cost function, consistent with the first experiment where we evaluated on FreeSurfer segmentations. VoxelMorph with instance-specific optimization further improves the results, similar to the previous experiment. On this dataset, results using VoxelMorph with MSE loss obtain slightly lower scores, but are improved by the instance-specific optimization procedure to be comparable to ANTs and NiftyReg.

### V-F Subject-to-Subject Registration

In this experiment, we train VoxelMorph for subject-to-subject registration. Since there is more variability in each registration, we double the number of features for each network layer. We also compute VoxelMorph with instance-specific optimization, as described in Section V-D. Table III presents average test Dice scores on 250 randomly selected test pairs for registration. Consistent with literature, we find that the normalized cross correlation loss leads to more robust results compared to using the MSE loss. VoxelMorph (with doubled feature counts) Dice scores are comparable with ANTs and slightly below NiftyReg, while results from VoxelMorph with instance-specific optimization are comparable to both baselines.

Method | Dice |
---|---|

Affine only | 0.579 (0.173) |

ANTs SyN (CC) | 0.761 (0.117) |

NiftyReg (CC) | 0.772 (0.117) |

VoxelMorph (MSE) | 0.727 (0.146) |

VoxelMorph x2 (MSE) | 0.750 (0.058) |

VoxelMorph x2 (MSE) inst. | 0.764 (0.048) |

VoxelMorph (CC) | 0.737 (0.139) |

VoxelMorph x2 (CC) | 0.763 (0.049) |

VoxelMorph x2 (CC) inst. | 0.772 (0.119) |

### V-G Registration with Auxiliary Data

In this section, we evaluate VoxelMorph when using segmentation maps during training with loss function (10). Because MSE and CC performed similarly for atlas-based registration, in this section we use MSE with . We present an evaluation of our model in two practical scenarios: (1) when subsets of anatomical structure labels are available during training, and (2) when coarse segmentations labels are available during training. We use the same train/validation/test split as the previous experiments.

#### V-G1 Training with a subset of anatomical labels

In many practical settings, it may be infeasible to obtain training segmentations for all structures. We therefore first consider the case where segmentations are available for only a subset of the 30 structures. We refer to structures present in segmentations as *observed*, and the rest as *unobserved*. We considered three scenarios, when: one, 15 (half), and 30 (all) structure segmentations are observed. The first two experiments essentially simulate different amounts of partially observed segmentations. For each experiment, we train separate models on different subsets of observed structures, as follows. For single structure segmentations, we manually selected four important structures for four folds (one for each fold) of the experiment: hippocampi, cerebral cortex, cerebral white matter, and ventricles. For the second experiment, we randomly selected 15 of the 30 structures, with a different selection for each of five folds. For each fold and each subset of observed labels, we use the segmentation maps at training, and show results on test pairs where segmentation maps are not used.

Fig. 9a-c shows Dice scores for both the observed and unobserved labels when sweeping in (10), the auxiliary regularization trade-off parameter. We train our models with FreeSurfer annotations, and show results on both the general test set using FreeSurfer annotations (top) and the Buckner40 test set with manual annotations (bottom). The extreme values (or ) and serve as theoretical extremes, with corresponding to unsupervised VoxelMorph, and corresponding to VoxelMorph trained *only* with auxiliary labels, without the smoothness and image matching objective terms.

In general, VoxelMorph with auxiliary data significantly outperforms (largest p-value among the four settings) unsupervised VoxelMorph (equivalent to or ) and ANTs on observed structures in terms of Dice score. Dice score on observed labels generally increases with an increase in .

Setting | |||||
---|---|---|---|---|---|

one (count) | 9606 (4471) | 10435 (4543) | 22998 (3171) | 121546 (12203) | 685811 (6878) |

one (%) | 0.18 (0.09) | 0.20 (0.09) | 0.44 (0.06) | 2.33 (0.23) | 13.14 (0.13) |

half (count) | 9606 (4471) | 9470 (4008) | 17886 (4919) | 86319 (13851) | 516384 (7210) |

half (%) | 0.18 (0.09) | 0.18 (0.08) | 0.34 (0.09) | 1.65 (0.27) | 9.90 (0.14) |

all (count) | 9606 (4471) | 10824 (5029) | 19226 (4471) | 102295 (14366) | 528552 (8720) |

all (%) | 0.18 (0.09) | 0.21 (0.10) | 0.37 (0.09) | 1.96 (0.28) | 10.13 (0.17) |

coarse (count) | 9606 (4471) | 9343 (4117) | 15190 (4416) | 76677 (11612) | 564493 (7379) |

coarse (%) | 0.18 (0.09) | 0.18 (0.08) | 0.29 (0.08) | 1.47 (0.22) | 10.82 (0.14) |

Interestingly, VoxelMorph (trained with auxiliary data) yields improved Dice scores for unobserved structures compared to the unsupervised variant for a range of values (see Fig. 9a-b), even though these segmentations were not explicitly observed during training. When all structures that we use during evaluation are observed during training, we find good Dice results at higher values (Fig 9c.). Registration accuracy for unobserved structures starts declining when is large, in the range . This can be interpreted as the range where the model starts to over-fit to the observed structures - that is, it continues to improve the Dice score for observed structures while harming the registration accuracy for the other structures (Fig. 9c)

#### V-G2 Training with coarse labels

We consider the scenario where only coarse labels are available, such as when all the white matter is segmented as one structure. This situation enables evaluation of how the auxiliary data affects anatomical registration at finer scales, within the coarsely delineated structures. To achieve this, we merge the 30 structures into four broad groups: white matter, gray matter, cerebral spinal fluid (CSF) and the brain stem, and evaluate the accuracy of the registration on the original structures.

Fig. 9d (top) presents mean Dice scores over the original 30 structures with varying . With of 0.01, we obtain an average Dice score of on FreeSurfer segmentations. This is roughly a Dice point improvement over VoxelMorph without auxiliary information (p-value ).

#### V-G3 Regularity of Deformations

We also evaluate the regularity of the deformation fields both visually and by computing the number of voxels for which the determinant of the Jacobian is non-positive. Table IV provides the quantitative regularity measure for all values, showing that VoxelMorph deformation regularity degrades slowly as a function of (shown on a log scale), with roughly 0.2% of the voxels exhibiting folding at the lowest parameter value, and at most 2.3% when . Deformations from models that don’t encourage smoothness, at the extreme value of , exhibit – folding voxels. A lower value such as therefore provides a good compromise of high Dice scores for all structures while avoiding highly irregular deformation fields, and avoiding overfitting as described above. Fig 10 shows examples of deformation fields for and , and we include more figures in the supplemental material for each experimental setting.

#### V-G4 Testing on Manual Segmentation Maps

We also test these models on the manual segmentations in the Buckner40 dataset used above, resulting in Fig. 9 (bottom). We observe a behavior consistent with the conclusions above, with smaller Dice score improvements, possibly due to the higher baseline Dice scores achieved on the Buckner40 data.

## Vi Discussion and Conclusion

VoxelMorph with unsupervised loss performs comparably to the state-of-the-art ANTs and NiftyReg software in terms of Dice score, while reducing the computation time from hours to minutes on a CPU and under a second on a GPU. VoxelMorph is flexible and handles both partially observed or coarsely delineated auxiliary information during training, which can lead to improvements in Dice score while still preserving the runtime improvement.

VoxelMorph performs amortized optimization, learning global function parameters that are optimal for an entire training dataset. As Fig. 8 shows, the dataset need not be large: with only 100 training images, VoxelMorph leads to state-of-the-art registration quality scores for both training and test sets. Instance-specific optimization further improves VoxelMorph performance by one Dice point. This is a small increase, illustrating that amortized optimization can lead to nearly optimal registration.

We performed a thorough set of experiments demonstrating that, for a reasonable choice of , the availability of anatomical segmentations during training significantly improves test registration performance with VoxelMorph (in terms of Dice score) while providing smooth deformations (e.g. for , less than folding voxels). The performance gain varies based on the quality and number of anatomical segmentations available. Given a single labeled anatomical structure during training, the accuracy of registration of test subjects for that label increases, without negatively impacting other anatomy. If half or all of the labels are observed, or even a coarse segmentation is provided at training, registration accuracy improves for all labels during test. While we experimented with one type of auxiliary data in this study, VoxelMorph can leverage other auxiliary data, such as different modalities or anatomical keypoints. Increasing also increases the number of voxels exhibiting a folding of the registration field. This effect may be alleviated by using a diffeomorphic deformation representation for VoxelMorph, as introduced in recent work [20].

VoxelMorph is a general learning model, and is not limited to a particular image type or anatomy – it may be useful in other medical image registration applications such as cardiac MR scans or lung CT images. With an appropriate loss function such as mutual information, the model can also perform multimodal registration. VoxelMorph promises to significantly speed up medical image analysis and processing pipelines, while opening novel directions in learning-based registration.

## References

- [1] (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467. Cited by: §V-A4.
- [2] (2016) Unsupervised convolutional neural networks for motion estimation. In Image Processing (ICIP), 2016 IEEE International Conference on, pp. 1629–1633. Cited by: §III-C.
- [3] (2007) A fast diffeomorphic image registration algorithm. Neuroimage 38 (1), pp. 95–113. Cited by: §III-A, §V-A2.
- [4] (2000) Voxel-based morphometry-the methods. Neuroimage 11, pp. 805–821. Cited by: §III-A.
- [5] (2008) Symmetric diffeomorphic image registration with cross-correlation: evaluating automated labeling of elderly and neurodegenerative brain. Medical image analysis 12 (1), pp. 26–41. Cited by: §II, §III-A, §IV-C1, §V-A3.
- [6] (2011) A reproducible evaluation of ants similarity metric performance in brain image registration. Neuroimage 54 (3), pp. 2033–2044. Cited by: §I, §V-A3.
- [7] (2017) CNN-based patch matching for optical flow with thresholded hinge embedding loss. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 7. Cited by: §III-C.
- [8] (1989) Multiresolution elastic matching. Computer Vision, Graphics, and Image Processing 46, pp. 1–21. Cited by: §II, §III-A.
- [9] (2018) An unsupervised learning model for deformable medical image registration. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 9252–9260. Cited by: §I.
- [10] (2005) Computing large deformation metric mappings via geodesic flows of diffeomorphisms. Int. J. Comput. Vision 61, pp. 139–157. Cited by: §II, §III-A.
- [11] (2004) High accuracy optical flow estimation based on a theory for warping. European Conference on Computer Vision (ECCV), pp. 25–36. Cited by: §III-C.
- [12] (2011) Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Trans. Pattern Anal. Mach. Intell. 33 (3), pp. 500–513. Cited by: §III-C.
- [13] (2017) Deformable image registration based on similarity-steered cnn regression. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 300–308. Cited by: §III-B.
- [14] (2005) Large deformation diffeomorphic metric mapping of vector fields. IEEE transactions on medical imaging 24 (9), pp. 1216–1230. Cited by: §III-A.
- [15] (2009) Multi-contrast large deformation diffeomorphic metric mapping for diffusion tensor imaging. Neuroimage 47 (2), pp. 618–627. Cited by: §III-A.
- [16] (2013) Large displacement optical flow from nearest neighbor fields. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2443–2450. Cited by: §III-C.
- [17] (2015) Keras. GitHub. Note: https://github.com/fchollet/keras Cited by: §V-A4.
- [18] (2018) Inference suboptimality in variational autoencoders. arXiv preprint arXiv:1801.03558. Cited by: §II, §IV-D.
- [19] (2015) Harvard aging brain study: dataset and accessibility. NeuroImage. Cited by: §V-A1.
- [20] (2018) Unsupervised learning for fast probabilistic diffeomorphic registration. arXiv preprint arXiv:1805.04605. Cited by: §II, §VI, footnote 1.
- [21] (2016) Patch-based discrete registration of clinical brain images. In International Workshop on Patch-based Techniques in Medical Imaging, pp. 60–67. Cited by: §III-A.
- [22] (1997) Spatial transformation and registration of brain images using elastically deformable models. Computer Vision and Image Understanding 66 (2), pp. 207–222. Cited by: §III-A.
- [23] (2004) Multi-subject registration for unbiased statistical atlas construction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp. 655–662. Cited by: §V.
- [24] (2017) End-to-end unsupervised deformable image registration with a convolutional neural network. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pp. 204–212. Cited by: §III-B.
- [25] (2014) The autism brain imaging data exchange: towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular psychiatry 19 (6), pp. 659–667. Cited by: §V-A1.
- [26] (1945) Measures of the amount of ecologic association between species. Ecology 26 (3), pp. 297–302. Cited by: §IV-C2.
- [27] (2015) Flownet: learning optical flow with convolutional networks. In IEEE International Conference on Computer Vision (ICCV), pp. 2758–2766. Cited by: §III-C.
- [28] (2012) FreeSurfer. Neuroimage 62 (2), pp. 774–781. Cited by: §I, §V-A1, §V.
- [29] (2016) Patchbatch: a batch augmented loss for optical flow. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4236–4245. Cited by: §III-C.
- [30] (2008) Dense image registration through mrfs and efficient linear programming. Medical image analysis 12 (6), pp. 731–741. Cited by: §III-A.
- [31] (2013) The mcic collection: a shared repository of multi-modal, multi-site brain image data from a clinical investigation of schizophrenia. Neuroinformatics 11 (3), pp. 367–388. Cited by: §V-A1.
- [32] (2009) Registration of anatomical images using paths of diffeomorphisms parameterized with stationary vector field flows. International Journal of Computer Vision 85 (3), pp. 291–306. Cited by: §III-A.
- [33] (2015) Brain genomics superstruct project initial data release with structural, functional, and behavioral measures. Scientific data 2. Cited by: §V-A1.
- [34] (1980) Determining optical flow. Cited by: §III-C.
- [35] (2018) Label-driven weakly-supervised learning for multimodal deformable image registration. In Biomedical Imaging (ISBI 2018), 2018 IEEE 15th International Symposium on, pp. 1070–1074. Cited by: §III-B.
- [36] (2018) Weakly-supervised convolutional neural networks for multimodal image registration. Medical image analysis 49, pp. 1–13. Cited by: §III-B, §IV-C2.
- [37] (2017) Flownet 2.0: evolution of optical flow estimation with deep networks. In IEEE conference on computer vision and pattern recognition (CVPR), Vol. 2, pp. 6. Cited by: §III-C.
- [38] (2017) Image-to-image translation with conditional adversarial networks. arXiv preprint. Cited by: §IV-A.
- [39] (2015) Spatial transformer networks. In Advances in neural information processing systems, pp. 2017–2025. Cited by: §III-B, §III-C, §IV-B.
- [40] (2016) Back to basics: unsupervised learning of optical flow via brightness constancy and motion smoothness. In European Conference on Computer Vision, pp. 3–10. Cited by: §III-C.
- [41] (2000) Landmark matching via large deformation diffeomorphisms. IEEE transactions on image processing 9 (8), pp. 1357–1370. Cited by: §III-A.
- [42] (2018) Semi-amortized variational autoencoders. preprint arXiv:1802.02550. Cited by: §II.
- [43] (2014) ADAM: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §V-A4.
- [44] (2009) Evaluation of 14 nonlinear deformation algorithms applied to human brain mri registration. Neuroimage 46(3), pp. 786–802. Cited by: §I, §V-A3.
- [45] (2017) Robust non-rigid registration through agent-based action learning. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 344–352. Cited by: §III-B.
- [46] (2017) Non-rigid image registration using fully convolutional networks with deep self-supervision. preprint arXiv:1709.00799. Cited by: §III-B.
- [47] (2011) SIFT flow: dense correspondence across scenes and its applications. IEEE Trans. Pattern Anal. Mach. Intell. 33 (5), pp. 978–994. Cited by: §III-C.
- [48] (2007) Open access series of imaging studies (oasis): cross-sectional mri data in young, middle aged, nondemented, and demented older adults. Journal of cognitive neuroscience 19 (9), pp. 1498–1507. Cited by: §V-A1.
- [49] (2011) The parkinson progression marker initiative (ppmi). Progress in neurobiology 95 (4), pp. 629–635. Cited by: §V-A1.
- [50] (2018) Iterative amortized inference. arXiv preprint arXiv:1807.09356. Cited by: §IV-D.
- [51] (2012) The ADHD-200 consortium: a model to advance the translational potential of neuroimaging in clinical neuroscience. Frontiers in systems neuroscience 6, pp. 62. Cited by: §V-A1.
- [52] (2005) Increasing the power of functional maps of the medial temporal lobe by using large deformation diffeomorphic metric mapping. Proceedings of the National Academy of Sciences 102 (27), pp. 9685–9690. Cited by: §III-A.
- [53] (2014) Global image registration using a symmetric block-matching approach. Journal of Medical Imaging 1 (2), pp. 024003. Cited by: §III-A.
- [54] (2010) Fast free-form deformation using graphics processing units. Computer methods and programs in biomedicine 98 (3), pp. 278–284. Cited by: §I, §III-A.
- [55] (2009) Atlas-based whole brain white matter analysis using large deformation diffeomorphic metric mapping: application to normal elderly and alzheimer’s disease participants. Neuroimage 46 (2), pp. 486–499. Cited by: §III-A.
- [56] (2017) Transformation-grounded image generation network for novel 3D view synthesis. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 702–711. Cited by: §III-C.
- [57] (1999) Understanding the ”demon’s algorithm”: 3d non-rigid registration by gradient descent. pp. 597–605. Cited by: §III-A.
- [58] (2017) Optical flow estimation using a spatial pyramid network. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Vol. 2, pp. 2. Cited by: §III-C.
- [59] (2017) SVF-net: learning deformable image registration using shape matching. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 266–274. Cited by: §III-B.
- [60] (2015) U-net: convolutional networks for biomedical image segmentation. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241. Cited by: §IV-A.
- [61] (1999) Nonrigid registration using free-form deformation: application to breast mr images. IEEE Transactions on Medical Imaging 18 (8), pp. 712–721. Cited by: §III-A.
- [62] (2002) Hammer: hierarchical attribute matching mechanism for elastic registration. IEEE Transactions on Medical Imaging 21 (11), pp. 1421–1439. Cited by: §III-A.
- [63] (2017) Nonrigid image registration using multi-scale 3d convolutional neural networks. In International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 232–239. Cited by: §III-B.
- [64] (2016) Amortised map inference for image super-resolution. arXiv preprint arXiv:1610.04490. Cited by: §II.
- [65] (2013) Deformable medical image registration: a survey. IEEE transactions on medical imaging 32 (7), pp. 1153–1190. Cited by: §II.
- [66] (2013) Quantification and analysis of large multimodal clinical image studies: application to stroke. In International Workshop on Multimodal Brain Image Analysis, pp. 18–30. Cited by: §V.
- [67] (2010) Secrets of optical flow estimation and their principles. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2432–2439. Cited by: §III-C.
- [68] (2016) Fully-trainable deep matching. arXiv preprint arXiv:1609.03532. Cited by: §III-C.
- [69] (1998) Image matching as a diffusion process: an analogy with maxwell’s demons. Medical Image Analysis 2 (3), pp. 243–260. Cited by: §III-A.
- [70] (2016) Deep end2end voxel2voxel prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 17–24. Cited by: §III-C.
- [71] (2009) Diffeomorphic demons: efficient non-parametric image registration. NeuroImage 45 (1), pp. S61–S72. Cited by: §III-A.
- [72] (1997) Alignment by maximization of mutual information. International journal of computer vision 24 (2), pp. 137–154. Cited by: §II.
- [73] (2013) DeepFlow: large displacement optical flow with deep matching. In IEEE International Conference on Computer Vision (ICCV), pp. 1385–1392. Cited by: §III-C.
- [74] (2017) Quicksilver: fast predictive image registration–a deep learning approach. NeuroImage 158, pp. 378–396. Cited by: §III-B.
- [75] (2010) Learning task-optimal registration cost functions for localizing cytoarchitecture and function in the cerebral cortex. IEEE transactions on medical imaging 29 (7), pp. 1424–1441. Cited by: §III-A.
- [76] (2017) Advances in variational inference. arXiv preprint arXiv:1711.05597. Cited by: §II.
- [77] (2017) Frequency diffeomorphisms for efficient image registration. In International conference on information processing in medical imaging, pp. 559–570. Cited by: §III-A.
- [78] (2016) View synthesis by appearance flow. European Conference on Computer Vision (ECCV), pp. 286–301. Cited by: §III-C.

SUPPLEMENTARY MATERIAL