Datadriven reconstruction of Gravitationally Lensed Galaxies using
Recurrent Inference Machines
Abstract
We present a machine learning method for the reconstruction of the undistorted images of background sources in strongly lensed systems. This method treats the source as a pixelated image and utilizes the Recurrent Inference Machine (RIM) to iteratively reconstruct the background source given a lens model. Our architecture learns to minimize the likelihood of the model parameters (source pixels) given the data using the physical forward model (ray tracing simulations) while implicitly learning the prior of the source structure from the training data. This results in better performance compared to linear inversion methods, where the prior information is limited to the 2point covariance of the source pixels approximated with a Gaussian form, and often specified in a relatively arbitrary manner. We combine our source reconstruction network with a convolutional neural network that predicts the parameters of the mass distribution in the lensing galaxies directly from telescope images, allowing a fully automated reconstruction of the background source images and the foreground mass distribution.
,
1. Introduction
Gravitational lensing is a powerful probe for studying many different subjects in astrophysics and cosmology, including the mass distributions in galaxies (e.g. Treu & Koopmans, 2004; Vegetti et al., 2012; Hezaveh et al., 2016) and galaxy clusters (e.g. Hoekstra et al., 2013; Natarajan et al., 2017, and references therein), the internal structures and star formation properties of galaxies at high redshift (e.g. Marrone et al., 2018; Spilker et al., 2018), and the expansion rate of the universe (e.g. Refsdal, 1964; Blandford & Narayan, 1992; Suyu et al., 2014, and references therein). An integral part of all these studies is lens modeling, through which models of both the matter distribution in the lens and the morphology of the background source are obtained. This is traditionally done using maximum likelihood (or maximum aposteriori) methods, a procedure through which the values of the parameters describing these models are optimized by maximizing their posterior given the data.
Modeling the light distribution in the background sources requires an appropriate choice of parametrization. A common choice is to assume that the source takes on a simple parametric form (e.g., a Gaussian or Sersic profile Bussmann et al., 2013; Hezaveh et al., 2013; Spilker et al., 2016). The small number of parameters in these models are then explored using nonlinear samplers such as Markov chain Monte Carlo (MCMC) samplers and are typically well constrained by the large volume of observational data. However, for observations of complex background sources with high angular resolution and high signaltonoise ratio (SNR), these simple parametric profiles are often found to be inadequate. Adding additional parametric source components is in general difficult since as the dimensionality of the parameter space is increased, exploring this multidimensional and multimodal space using nonlinear optimizers becomes cost prohibitive (although see e.g., Brewer et al., 2011).
In Warren & Dye (2003), a method to model the background source as a pixelated image was developed, allowing a linear inversion to reconstruct the most probable image of the background source given a lens model and a particular form and strength of regularization (which becomes essential to avoid overfitting the data). This method decomposes the modeling procedure into a nonlinear exploration of the lens parameters and a linear reconstruction of the source pixels at each step. Suyu et al. (2006) formulated this method in a Bayesian framework to objectively determine the regularization strength given a fixed lens model by maximizing the Bayesian evidence.
To allow for a linear inversion, these methods assume a quadratic logprior for the values of the source pixels. The three most commonly used forms of covariance matrices correspond to brightness, gradient, and curvature priors. These respectively enforce that either the pixel values, the gradients, or curvatures between adjacent pixels are drawn from a Gaussian distribution with mean 0 and variance given by a regularization constant. In other words, only a twopoint prior is usually imposed on sources, ignoring any knowledge of their higher order statistics. This can result in known issues, for example, allowing negative pixels or leakage of noise in the background source reconstructions and has been known to have the potential to bias the inferred parameters of the foreground structures (e.g., Nightingale et al., 2018). The choice of pixel shape and size in the source reconstruction has also been shown to have the potential to cause systematic issues.
Moreover, the significant computational cost of performing the many matrix inversions required to find the mostprobable parameters of the model in a complex, multidimensional parameter space makes these modeling frameworks slow and computationally expensive. Because the computational scaling of matrix inversions grows steeply with image size (roughly ), this expense will grow rapidly as the resolution and size of images continue to increase. For optical data the matrices are often sparse, allowing faster computations with sparse linear algebra libraries. However, for interferometric data, where the matrices are dense, these operations can be extremely costly.
In recent years, a number of complex lens modeling tools have been developed to mitigate the abovementioned issues (e.g., Nightingale et al., 2018). However, they typically include ad hoc procedures, for example in constructing the pixelization of the background sources or the iterative procedure through which the priors are determined. Because of their computational cost, testing for and characterizing their systematic errors is difficult and correcting them is in general nontrivial. These issues have inspired the exploration of alternative forms of parameterization for the background sources, using different basis functions, such as shapelets (Birrer et al., 2015; Birrer & Amara, 2018).
In this paper, motivated by the recent advances in deep learning, we investigate the use of a machine learning approach to reconstruct the images of the background sources of gravitational lenses. This is primarily driven by the fact that deep learning methods can learn complex priors from training examples (e.g., Ulyanov et al., 2018), circumventing the need to explicitly specify a regularization for the background sources. We explore the use of the recurrent inference machine (RIM; Putzky & Welling, 2017), an architecture combining convolutional neural networks (CNNs, LeCun et al., 1989) with recurrent neural networks (RNNs) to solve linear and nonlinear problems iteratively.
CNNs are a class of machine learning methods that are particularly suited to image processing applications. In general, a CNN processes input images through a number of layers to produce outputs of interest. At each layer, the image is convolved with a number of filters and acted on by a nonlinear activation function. The resulting feature maps are then fed to the following layer as an input. After a number of these layers, the output of the last layer is interpreted as the output of the network. The values of the convolutional filters, also called network weights, are learned through a process known as training, where pairs of correct inputoutput examples are shown to the network. The values of the network weights are determined by optimizing a cost function, reducing the difference between the truth and the network’s predictions. Given enough training examples, these networks can make accurate predictions on previously unseen examples using these learned parameters.
CNNs have recently seen a wide increase in use for astrophysical applications, including within the field of gravitational lensing, where CNNs have been used for performing both the tasks of lens finding (e.g. Lanusse et al., 2017; Jacobs et al., 2017; Petrillo et al., 2018; Pourrahmani et al., 2018; Schaefer et al., 2018), and lens modeling (Hezaveh et al., 2017; Perreault Levasseur et al., 2017; Morningstar et al., 2018), accelerating the inference of lens parameters by many orders of magnitude.
Unlike CNNs, which construct a mapping directly between a single input and output in a forward manner, RNNs operate on sequences of inputs and outputs. More specifically, they process each input through a group of layers, referred to as a cell, to both produce an output and update a hidden memory state. The next input in the series, which could sometimes be the previous output, is then processed through the same cell, which produces a new output and again updates the hidden state. This procedure can then be repeated for sequences of desired length, and through connections with the hidden state allows the network to exhibit dynamic temporal behavior. Therefore, much of the current use of RNNs is dedicated to the analysis of time series data (e.g. Naul et al., 2018; Charnock & Moss, 2017). However, because they are Turingcomplete (Siegelmann & Sontag, 1991, 1995) RNNs can be used to simulate algorithmic structures that perform any sequential process.
Numerical optimization of a function can be framed as such a sequential process: starting from an initial guess, a series of steps in the parameter space of a model are taken to arrive at a final point that optimizes the value of the target function. Recently, RNNbased architectures have been used in this way to learn the process of training of other machine learning models (i.e. metalearning, Andrychowicz et al., 2016). These approaches could equally be used to learn the process of optimization of other blackbox functions. The RIM (Putzky & Welling, 2017) is an implementation of such an architecture. In Morningstar et al. (2018) we used the RIM to deconvolve dirty interferometric images prior to lensing analysis. In this work, we explore the use of this network to reconstruct the images of background galaxies from their lensed noisy data.
2. methods
In this section, we detail the network architecture used and describe the training set and training strategy.
2.1. The Recurrent Inference Machine
The RIM framework is designed to solve linear problems of the form
(1) 
for , where the vector contains observed measurements, is a vector of parameters that we would like to infer, A is a known matrix mapping the parameters to the model, and is a vector of noise. In the case of interest here, represents the pixel values of the true, unlensed background source image, is a vector of the pixel values of the observed lensed image, A encodes the effects of lensing and observational blurring, and is a vector of additive Gaussian noise. In this model, the likelihood of given is proportional to
(2) 
where the matrix is the noise covariance matrix. Figure 2 shows a diagram of the structure of the analysis for source reconstruction.
The RIM solves equation 1 recursively. At a given time step , the RIM takes as input its current estimate of the true source image, , along with the gradient of the likelihood with respect to the current estimate, evaluated at , that is, . It then passes this pair through the network and outputs an update to the estimate , which is added to the current estimate to produce the new estimate,
(3) 
The network iteratively updates its estimate of the underlying true image, using the likelihood gradient and its current prediction to guide its trajectory, in a fashion analogous to Newton’s method of optimization.
Figure 2 shows a more detailed diagram of the structure of the RIM. The network cell consists of five layers. The first is an embedding layer that spatially downsamples the image (using a nonunity stride), but upsamples the number of features by producing more images in the channel dimension. The second layer is the main RNN cell, which is a convolutional Gated Recurrent Unit (GRU). The GRU shields its hidden memory state using two gates, one of which determines what to remove from the memory state (the reset gate) and the other of which determines what to add to the memory state (the update gate). The output of the GRU is passed to a spatially upsampling layer using a transpose convolution with the same stride as the embedding layer. The output of the spatial upsampling layer is passed to a second convolutional GRU. The fifth layer is the output layer, which downsamples along the channel dimension. For all layers, we use relatively large filters because the images are fairly smooth on small scales. The activation functions for the embedding, GRU, and spatial upsampling layers are all chosen to be hyperbolic tangent functions. The GRU additionally uses sigmoid activations on its gates. The output layer has a linear activation. We also choose to use an output nonlinearity when generating the prediction for the source. For this, we chose to use the sigmoid function, which has the advantage of enforcing that no output pixels have a value that is less than zero, meaning that the source can not have negative flux. We also found that it exhibits better performance compared to a rectified linear unit nonlinearity, which would impose the same physical requirements, but which does not have a restricted upper bound.
2.2. The Forward Model
The RIM takes the gradient of equation 2, that is, the likelihood of the trial source image given the observed lensed image , as an input. To compute this likelihood, a forward, or physical, model is required. The lens equation relates the position in the image plane to its corresponding position in the source plane
(4) 
where is the position in the source plane and is the position in the image plane. The reduced deflection angle is determined by the mass distribution in the lensing structures. Using the lens equation, and assuming a given interpolation scheme (typically bilinear), it is possible to construct a mapping between the source and image planes, which can be specified in matrix form. This matrix is known as the lensing distortion matrix L (see Treu & Koopmans, 2004, for how to calculate this matrix). Using this matrix, the forward model can be constructed as follows
(5) 
where is the (noiseless) observed lensed image, and is the true background source. The operator B is the blurring operator, which adds the effect of smearing to the image due to the point spread function (PSF). We can then write the loglikelihood () of the predicted source given an observation as
(6) 
The gradient of the loglikelihood, which is used in practice by the network, is then given by
(7) 
The full input to the RIM at each time step is then obtained by stacking this with the current prediction of the source, , in the channel dimension.
2.3. The Training Set
The network was trained on 200 000 simulated strong lensing images. The procedure used to generate this training data is described in detail in Hezaveh et al. (2017), but we will reiterate the major points here. Lens models were simulated using the Singular Isothermal Ellipsoid (SIE; Kormann et al., 1994) plus external shear. The SIE and shear model is composed of seven parameters that fully describe the deflections: the Einstein Radius , the and components of the ellipticity , the position of the center of mass of the lens (,), and the and components of the shear . Lens models were simulated using randomly drawn values for these parameters.
The background sources were taken to be images of galaxies from the GalaxyZoo and GREAT03 datasets. The positions and sizes of these sources were randomly chosen, such that the minimum flux magnification found in the training set was 3. We also imposed that the flux of the lensed sources does not fall off of the image. The scaled images of the source were then interpolated onto a uniform grid of size centered at the origin, so that a fixed pixel size could be used without scaling or shifting the source grid. Both the lensed images and the background source images contain 192 pixels on each side. The lensed images use a pixel size of , and the source images use a pixel size of approximately .
Observational effects were then added randomly to the images at training time, in such a way that the network never saw the same training image twice. These effects include blurring with a PSF, approximated as a twodimensional Gaussian function with a width randomly chosen between 0 (no PSF smearing) and , and addition of noise, drawn from a Gaussian distribution with an RMS given by a uniform distribution between 0.1% to 10% of the peak surface brightness, resulting in an SNR of between 10 and 1000 (more heavily weighted toward low SNR). To compute the model likelihood, both the true PSF and noise covariance matrix are provided to the forward model.
2.4. Training
To train the network, we use the Adam optimizer (Kingma & Ba, 2014) with a learning rate of . This learning rate is decreased exponentially with a decay rate of 0.96 and a decay timescale of 5000 training steps. We also employ gradient clipping in the RIM to avoid exploding gradients. We optimize the rootmeansquared error summed over all pixels as a cost function.
In order to compute the lensing distortion matrix in the forward model, the parameters of the deflection must be provided to the network. During training, we gave the network these parameters since they are known from the simulations. However, when applying this technique to data this may not be possible, as the parameters of the lensing may not be known a priori. Therefore, we also experimented with using lens model parameters that were predicted by a feedforward convolutional neural network as described in Perreault Levasseur et al. (2017). This network produces an estimate of the mean and standard deviation of the marginalized network posterior of the lens model parameters. By using this to predict the lens model for the RIM, we are able to reconstruct the source without specifying any inputs other than the observed image, noise covariance matrix, and PSF.
3. Results and Discussion
Once the network is trained, we test its performance on a simulated test set. For this, we use a validation set of 2000 images with background sources and lens models previously unseen by the network. These images are each given a randomly generated PSF and noise realization drawn from the same distributions as the ones used to produce the training set.
We first examine the performance of the RIM in the absence of errors in the lens model. This should provide an assessment of the peak performance of the network when the assumed distortion matrix is the correct lens mapping. Several examples of this are shown in Figure 3. We find that the reconstructions are visually representative of the true sources and that the RMS error of the network prediction is less than 0.5% of the peak surface brightness of the source.
As a separate metric, we compute the flux of the reconstructed sources (via a sum over all pixels) and compare this to the known flux of the ground truth images. We find that the predicted flux of the reconstructed source (determined by summing over all source plane pixels) has a median absolute error of 3%. The bias in recovered flux is 0.07%, substantially lower than the random scatter. The predicted flux is shown against the true flux in Figure 4.
Finally, as an additional metric of accuracy, we computed the coherence spectrum of our predicted and true images. The coherence spectrum is defined as
(8) 
where is the wavenumber, is the cross correlation of the images, and , are the autocorrelation functions of the two images. The coherence spectrum is used to compare the similarity between two regularly sampled signals as a function of frequency. A coherence of implies perfect correlation between the two images, and a coherence of implies no correlation. We plot the coherence spectrum as a function of frequency, averaged over 2000 images in our test set in Figure 5. We find that our model exhibits perfect coherence for large spatial frequencies, which declines as the spatial frequency increases. For spatial frequencies of roughly , our coherence is similar to that of two uncorrelated white noise realizations. We therefore deem our model effective up to spatial frequencies of . This scale is comparable to the size of fluctuations seen in the top right panel of Figure 6.
We also perform a linear pixelated source inversion on a similar source grid ( pixels), assuming a gradient source covariance matrix. We determine the strength of the prior following the procedure described in Suyu et al. (2006). In Figure 5 we additionally show the coherence spectrum of the reconstructed sources obtained in this way and the ground truth, averaged over all images in the test set. We note that on all scales, the RIM has a higher coherence with the ground truth than the linear inversion. Figure 6 compares the true source morphology with both the reconstructed source obtained trough semilinear inversion and the ones given by the RIM. We note that, in linear inversions, even for the optimal strength of the regularization the resulting reconstructions contain high levels of noise, which is leaked into the source pixels with intensities as high as several percent of the peak source brightness. In addition, some pixels have negative values. This makes tasks such as estimating the intrinsic flux of the sources or analyzing their properties more difficult. The RIM reconstructions, however, succeed at suppressing the leakage of noise into the source pixels where no flux is present. We attribute this to the network learning higher order priors from the training images, which enforce zero flux over such areas.
We evaluate the logprior of the linear model for the two reconstructions (from linear inversion and RIM) and the true background source. We find that the true source has a negative logprior value 85 higher than the source found through linear inversion. It is not surprising that the latter has a low value, since the process of optimizing the posterior explicitly optimizes the prior. However, the fact that the true source is about 13 worse than the mostprobable semilinear source suggest that the prior, even with the optimal strength, is inconsistent with the statistics of the true source. Interestingly, the source reconstructed using the RIM gives a prior value only 0.2 away from the true source, which suggest that it has the same statistics under this prior.
In the above discussion, it was assumed that the true parameters of the mass distribution in the lensing structures were known (these parameters were used in the forward model to predict the model observations given a proposed model of the background source, as given by equation 6). A question of interest is if the reconstruction of the background source behaves well when a slightly incorrect lensing distortion is used. To explore this, we use the network from Perreault Levasseur et al. (2017) to predict the lens model, and use these parameters to evaluate the forward model likelihood. Because the CNN introduces errors in the lens model, this can demonstrate the robustness of the RIM against imperfect foreground distortion parameters. We show three example reconstructions in Figure 7. In roughly 60% of cases, the estimated lens model is accurate enough that the performance of the RIM is unaffected by the CNN. In the remaining cases, we find a modest decrease in performance. The median error in flux, for example, climbs to 9% (see Figure 4) and the RMS pixel error climbs from 0.5% to 6%. We also find that there are some (albeit rare) cases where the RIM returns morphologies that are inconsistent with the ground truth. This occurs when the lens model parameters given by the CNN are significantly discrepant from the truth. In these cases, the failure is easily diagnosed by examining the model residuals, which are strongly inconsistent with noise, or by calculating the loglikelihood (see e.g., Figure 7).
We then use a downhill optimizer to reoptimize the parameters of the foreground mass distribution (to correct the errors produced by the feedforward CNN), by minimizing the likelihood, at every step, calculating the likelihood with a background source produced by the RIM. This, in essence, becomes a traditional lens parameter optimization procedure but for the fact that the linear inversion of the source is replaced by the RIM reconstruction. We find that our optimizer converges to the true values within tens to a few hundred steps, even in cases where the initial lens model parameters were significantly discrepant from the truth (Figure 8). The fact that the optimizer converges implies that the source reconstruction network is able to generalize and produce results that preserve the shape of the gradient of the likelihood in neighbourhoods around the true foreground parameters.
This also hints at the possibility for easy automation of the lens modeling process, at least assuming a known form of parameterized model for the lensing distortion. A simple example workflow for an automated lens modeling system is described in algorithm 1. While it is a fairly unsophisticated and simple process, we found that this workflow was sufficient to achieve convergence in all of our tests. Future works will examine generalizing this example workflow to accommodate more complex lens models, including those with multiple lens mass components.
We then explore if it is possible to sample the posterior of the lens model parameters with an MCMC sampler, while performing the background source reconstructions with the RIM. We use the affineinvariant algorithm of Goodman & Weare (2010), as implemented in ForemanMackey et al. (2013) to sample the posterior of the lens parameters. Figure 9 shows the corner plot of the samples. We find that the sampled parameters are well constrained and relatively Gaussian, and that the procedure can identify known degeneracies in the lens model (e.g., the degeneracy between ellipticity and external shear), again suggesting that the network can generalize to examples beyond (but close to) its training data, where the lens models are slightly incorrect. We also show the constraints on the lens model obtained using semilinear inversion to reconstruct the source. While both achieve results consistent with the ground truth, the RIM achieves higher precision. We attribute this to the more constraining prior learned by the network.
In this work, we only obtained a point estimate for the pixelated morphologies of background sources without obtaining their uncertainties. The uncertainties of the predictions of neural network is an active area of research. Previously in Perreault Levasseur et al. (2017) we demonstrated the use of variational inference in obtaining approximate uncertainties for these predictions. In the future we plan to extend this method to the predictions of the RIM to obtain an estimate of the uncertainties of the predictions.
Machine learning systems are often unable to deal with generalization to different tasks. In particular, feedforward CNNs perform substantially worse when the test data distribution is different from the training data distribution. Because we use images of particular subsets of galaxies from the GalaxyZoo and GREAT3 challenge data, it is possible that the galaxies that make up the background sources of real gravitational lenses possess different morphologies than the galaxies used in our training set. However, we speculate that the differences between the training and test distributions matter less for the RIM than they do for a typical feedforward CNN, because it learns a procedure to optimize the likelihood given a set of observed data. Therefore, the RIM may still perform adequately, even if its test data are significantly different than the training data. To test this hypothesis, we examined the performance of the network on lensed images, where the background source was an example far from the training data. For this, we chose to use an image of text. Figure 10 shows the results of this experiment. The top left panel shows the data (lensed noisy image of a text). The right and middle lower panels show the reconstructions using linear inversion and RIM respectively. Remarkably, we found that RIM performed reasonably well even though the local and global structures of the text image are very different that the unlensed galaxies used for training. The reconstruction identifies the positions of letters, but due to smearing by the PSF, it appears to be unable to produce a legible image. We have confirmed this result using images of handwritten numbers from the MNIST dataset (LeCun et al., 1998). Even in a regime outside of its training set, the RIM appears to perform well compared to a linear inversion.
Since the only input of the network is the gradient of the loglikelihood with respect to the source pixels, one can easily generalize the application of this method to interferometric data by simply modifying the forward model to predict the model visibilities and use them to calculate the likelihood in the space. This is essentially equivalent to replacing matrix B in equation 5 with a matrix performing the Fourier transform. However, since only a forward prediction is required (i.e. the prediction of visibilities given a lensed source images is needed), this could be done using a fast Fourier transform (FFT). Given the extreme computation of cost of linear source inversions for large visibility sets, this could result in many orders of magnitude improvement in speed and computational cost.
4. Conclusions
We present a method that uses recurrent convolutional neural networks to recover the morphology of the background sources of gravitational lenses from telescope data. From our tests of this method we draw the following conclusions:

We found that the reconstructed sources predicted by neural network exhibit better fidelity to the ground truth images (measured using both meansquared error and the coherence spectrum) than linear inversion methods.

We showed that neural networks are able to reconstruct the image of the background source using only the image, the noise covariance matrix, and the point spread function; information easily discerned from the data itself. To perform these source reconstructions, we use a convolutional neural network to predict the parameterized form of the mass distribution in the lensing structures from the observed images.

We observed that small errors in the lens model can occasionally cause errors in the source reconstruction when predicting the lens model using a CNN. The errors on both the lens model and the source morphology can be corrected by modeling the image in a maximum likelihood fashion, using the RIM as a constraint on the source morphology.

We used our network as a source reconstruction module to sample the parameters of the mass distribution in the lensing structure using an MCMC procedure. The estimated parameters and uncertainties exhibit fidelity to the ground truth. We also sampled this posterior using a semilinear modeling method. The joint probability density of the parameters obtained from these two procedures exhibit similar degeneracies, however the RIM resulted in higher precision compared to semilinear models.

We tested the performance of our network on examples outside the training data, by both providing the network incorrect lens models and also requiring the networks to perform reconstructions of images of text. We found that the networks performed well in both instances, showing that they can potentially be robust against possible discrepancies between training and test data sets.

Our current work does not provide an estimate of the uncertainty of the reconstructions. In future works, we plan to investigate methods for obtaining the uncertainties of these reconstructions in a manner similar to the work presented in Perreault Levasseur et al. (2017).
References
 Andrychowicz et al. (2016) Andrychowicz, M., Denil, M., Gomez, S., et al. 2016, in Advances in Neural Information Processing Systems, 3981
 Birrer & Amara (2018) Birrer, S., & Amara, A. 2018, ArXiv eprints, arXiv:1803.09746
 Birrer et al. (2015) Birrer, S., Amara, A., & Refregier, A. 2015, ApJ, 813, 102
 Blandford & Narayan (1992) Blandford, R. D., & Narayan, R. 1992, ARA&A, 30, 311
 Brewer et al. (2011) Brewer, B. J., Lewis, G. F., Belokurov, V., et al. 2011, MNRAS, 412, 2521
 Bussmann et al. (2013) Bussmann, R. S., PérezFournon, I., Amber, S., et al. 2013, ApJ, 779, 25
 Charnock & Moss (2017) Charnock, T., & Moss, A. 2017, ApJ, 837, L28
 ForemanMackey et al. (2013) ForemanMackey, D., Hogg, D. W., Lang, D., & Goodman, J. 2013, PASP, 125, 306
 Goodman & Weare (2010) Goodman, J., & Weare, J. 2010, Communications in Applied Mathematics and Computational Science, 5, 65
 Hezaveh et al. (2017) Hezaveh, Y. D., Levasseur, L. P., & Marshall, P. J. 2017, Nature, 548, 555
 Hezaveh et al. (2013) Hezaveh, Y. D., Marrone, D. P., Fassnacht, C. D., et al. 2013, ApJ, 767, 132
 Hezaveh et al. (2016) Hezaveh, Y. D., Dalal, N., Marrone, D. P., et al. 2016, ApJ, 823, 37
 Hoekstra et al. (2013) Hoekstra, H., Bartelmann, M., Dahle, H., et al. 2013, Space Sci. Rev., 177, 75
 Jacobs et al. (2017) Jacobs, C., Glazebrook, K., Collett, T., More, A., & McCarthy, C. 2017, MNRAS, 471, 167
 Kingma & Ba (2014) Kingma, D. P., & Ba, J. 2014, CoRR, abs/1412.6980, arXiv:1412.6980
 Kormann et al. (1994) Kormann, R., Schneider, P., & Bartelmann, M. 1994, A&A, 284, 285
 Lanusse et al. (2017) Lanusse, F., Ma, Q., Li, N., et al. 2017, Monthly Notices of the Royal Astronomical Society, 473, 3895
 LeCun et al. (1989) LeCun, Y., Boser, B., Denker, J. S., et al. 1989, Neural computation, 1, 541
 LeCun et al. (1998) LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. 1998, Proceedings of the IEEE, 86, 2278
 Marrone et al. (2018) Marrone, D. P., Spilker, J. S., Hayward, C. C., et al. 2018, Nature, 553, 51
 Morningstar et al. (2018) Morningstar, W. R., Hezaveh, Y. D., Perreault Levasseur, L., et al. 2018, ArXiv eprints, arXiv:1808.00011 [astroph.IM]
 Natarajan et al. (2017) Natarajan, P., Chadayammuri, U., Jauzac, M., et al. 2017, MNRAS, 468, 1962
 Naul et al. (2018) Naul, B., Bloom, J. S., Pérez, F., & van der Walt, S. 2018, Nature Astronomy, 2, 151
 Nightingale et al. (2018) Nightingale, J., Dye, S., & Massey, R. J. 2018, Monthly Notices of the Royal Astronomical Society, 478, 4738
 Perreault Levasseur et al. (2017) Perreault Levasseur, L., Hezaveh, Y. D., & Wechsler, R. H. 2017, ApJ, 850, L7
 Petrillo et al. (2018) Petrillo, C. E., Tortora, C., Chatterjee, S., et al. 2018, MNRAS, arXiv:1807.04764
 Pourrahmani et al. (2018) Pourrahmani, M., Nayyeri, H., & Cooray, A. 2018, ApJ, 856, 68
 Putzky & Welling (2017) Putzky, P., & Welling, M. 2017, arXiv preprint arXiv:1706.04008
 Refsdal (1964) Refsdal, S. 1964, MNRAS, 128, 307
 Schaefer et al. (2018) Schaefer, C., Geiger, M., Kuntzer, T., & Kneib, J.P. 2018, A&A, 611, A2
 Siegelmann & Sontag (1991) Siegelmann, H. T., & Sontag, E. D. 1991, Applied Mathematics Letters, 4, 77
 Siegelmann & Sontag (1995) —. 1995, Journal of computer and system sciences, 50, 132
 Spilker et al. (2016) Spilker, J., Marrone, D., Aravena, M., et al. 2016, ArXiv eprints, arXiv:1604.05723
 Spilker et al. (2018) Spilker, J. S., Aravena, M., Béthermin, M., et al. 2018, Science, 361, 1016
 Suyu et al. (2006) Suyu, S. H., Marshall, P. J., Hobson, M. P., & Blandford, R. D. 2006, MNRAS, 371, 983
 Suyu et al. (2014) Suyu, S. H., Treu, T., Hilbert, S., et al. 2014, ApJ, 788, L35
 Treu & Koopmans (2004) Treu, T., & Koopmans, L. V. E. 2004, ApJ, 611, 739
 Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., & Lempitsky, V. 2018, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Vol. 18
 Vegetti et al. (2012) Vegetti, S., Lagattuta, D. J., McKean, J. P., et al. 2012, Nature, 481, 341
 Warren & Dye (2003) Warren, S. J., & Dye, S. 2003, ApJ, 590, 673