A unified framework for 21cm tomography sample generation and parameter inference with Progressively Growing GANs
Abstract
Creating a database of 21cm brightness temperature signals from the Epoch of Reionisation (EoR) for an array of reionisation histories is a complex and computationally expensive task, given the range of astrophysical processes involved and the possibly highdimensional parameter space that is to be probed. We utilise a specific type of neural network, a Progressively Growing Generative Adversarial Network (PGGAN), to produce realistic tomography images of the 21cm brightness temperature during the EoR, covering a continuous threedimensional parameter space that models varying Xray emissivity, Lyman band emissivity, and ratio between hard and soft Xrays. The GPUtrained network generates new samples at a resolution of in a second (on a laptop CPU), and the resulting global 21cm signal, power spectrum, and pixel distribution function agree well with those of the training data, taken from the 21SSD catalogue (Semelin et al., 2017). Finally, we showcase how a trained PGGAN can be leveraged for the converse task of inferring parameters from 21cm tomography samples via Approximate Bayesian Computation.
keywords:
dark ages, reionization, first stars – methods: numerical – methods: statistical1 Introduction
The 21cm spectral line holds a great potential for probing the Dark Ages, the formation of the first generations of stars, and the Epoch of Reionisation (EoR; see Furlanetto et al., 2006; Morales & Wyithe, 2010; Pritchard & Loeb, 2012, for reviews of 21cm cosmology). It is produced by a spinflip transition of H gas in the intergalactic medium (IGM) due to the absorption of Cosmic Microwave Background (CMB) photons. This signal from the early Universe, which is received on Earth highly redshifted in the radio frequencies, is a dormant treasure trove that is expected to reveal a plethora of information on inflationary and dark sector physics (Furlanetto et al., 2019a), the matter power density after recombination (Loeb & Zaldarriaga, 2004), and the reionisation history of the Universe (Furlanetto et al., 2019b).
Since the signal is concealed by foreground sources that can be 5 magnitudes brighter (Pober et al., 2013; Dillon et al., 2014), its detection is extremely challenging, and it is only recently that the first tentative absorption profile from the Dark Ages has been measured in the global (skyaveraged) signal by EDGES (Experiment to Detect the Global EoR Signature, Bowman et al., 2018). Currently operating interferometers such as LOFAR (Low Frequency Array, van Haarlem et al., 2013), GMRT (Giant Metrewave Radio Telescope, Paciga et al., 2013), MWA (Murchison Widefield Array, Tingay et al., 2013), and PAPER (Precision Array for Probing the Epoch of Reionization, Parsons et al., 2010) are not sensitive enough to map the spatial fluctuations of the 21cm signal and aim instead at characterising the signal by means of statistics such as the power spectrum (Mellema et al., 2015), for which several upper limits at different redshifts have been derived (e.g. Beardsley et al., 2016; Patil et al., 2017; Barry et al., 2019; Eastwood et al., 2019; Gehlot et al., 2019; Kolopanis et al., 2019; Li et al., 2019). However, future radio interferometer experiments such as the SKA (Square Kilometer Array, Mellema et al., 2013) and HERA (Hydrogen Epoch of Reionization Array, DeBoer et al., 2017) will usher in an era of 21cm imaging.
With these powerful telescopes coming online within the next few years, accurate and fast emulators are needed that generate templates for a wide range of astrophysical parameters and reionisation models. The EDGES signal shows a trough with an amplitude more than twice as large as can be accommodated by the most extreme predictions (Cohen et al., 2017), which, if confirmed, could hint at an excess radio background (e.g. Feng & Holder 2018; Fialkov & Barkana 2019) from a so far unknown population of highredshift radio sources such as black holes (EwallWice et al., 2018) or supernovae (Jana et al. 2019, but see Sharma 2018 for constraints on the cooling time when resorting to astrophysical explanations), or at exotic physics such as scattering between dark matter and baryons (Barkana, 2018; Hirano & Bromm, 2018; Slatyer & Wu, 2018). Whether or not this signal will be substantiated by other instruments, predicting signals for all physically plausible models and conducting parameter studies is vital in order to be prepared for upcoming observations. However, generating mock samples is a challenging task: the 21cm signal depends both on cosmology and on astrophysical processes, which are not well understood to date. Moreover, running computationally expensive largescale cosmological simulations with radiative transfer requires both big volumes as well as high resolutions in order to resolve the ionising sources, which often makes the exploration of a large parameter space infeasible. Approximate methods for the computation of the 21cm signal that do not invoke 3D radiative simulations have been developed, e.g. 21cmFast (Mesinger et al., 2011) and SimFast21 (Santos et al., 2010), whereas methods such as Bears (Thomas et al., 2009) postprocess darkmatteronly simulations to generate 21cm maps.
Machine learning techniques lend themselves to vastly speeding up the creation of mock samples as well as carrying out fast and accurate parameter inference. In the following, we will briefly review the related literature: Kern et al. (2017) present an emulator for the 21cm power spectrum based on Gaussian processes and use it to forecast the constraining power of HERA. A neuralnetworkbased power spectrum emulator is introduced in Schmit & Pritchard (2018), which can be harnessed for accelerating the evaluation of the likelihood function during Bayesian parameter inference; this present paper shares the spirit of their approach in that the output of a generative model is used for parameter inference. Chardin et al. (2019) predict the reionisation times in threedimensional cubes using convolutional neural networks (CNNs). Cohen et al. (2019) generate smooth global 21cm signals with a neural network that is trained to predict the strength of the principal components depending on seven astrophysical parameters. A complementary approach to the work herein, which is based upon generative adversarial networks (GANs) as well, has been presented by ZamudioFernandez et al. (2019). In that paper, a GAN is trained on 3D cubes from the IllustrisTNG simulation (Pillepich et al., 2018) and learns to generate new realisations of the H distribution. In terms of statistics, the quality of GANgenerated samples exceeds that of Halo Occupation Distributions. In contrast to modelling the sources of 21cm emission, our method addresses the prediction of the signal as received on Earth.
While the aforementioned works consider the task of artificial data creation, the reverse problem of machinelearningaided parameter inference has also been studied in the recent years. Pioneering the use of neural networks for the analysis of the 21cm signal, Shimabukuro & Semelin (2017) train a neural network on 21cm power spectra to infer astrophysical parameters. A comprehensive comparison of different machine learning methods for parameter estimation from the 21cm power spectrum is carried out in Doussot et al. (2019). To date, the majority of machine learning contributions to 21cm cosmology has focused on the power spectrum, which is arguably the simplest and most powerful statistics when it comes to constraining EoR models. However, unlike the CMB, the 21cm signal is highly nonGaussian (e.g. Majumdar et al. 2018), and valuable additional information is encoded in the full tomographic signal. In this spirit, Gillet et al. (2019) consider the recovery of astrophysical parameters by means of a neural network that is trained directly on 21cm images. La Plante & Ntampaka (2019) determine the EoR duration from 2D images in the plane of the sky using a CNN within an error of per cent, assuming a wellconstrained midpoint of the EoR. A CNN is also employed by Hassan et al. (2019), who infer cosmological and astrophysical parameters from 21cm mock maps with SKAlike noise. A neural network classifier that discerns galaxydominated and Active Galactic Nuclei (AGN)dominated models from images in the presence of realistic noise is presented in Hassan et al. (2017).
In this work, we employ a neural network, to be more specific a Progressively Growing Generative Adversarial Network (PGGAN) (that we name 21cmGAN), for the creation of twodimensional tomographic 21cm signals, conditioned on a threedimensional parameter vector.
Our implementation of 21cmGAN in Tensorflow (Abadi et al., 2016), as well as the trained neural network are publicly available
This paper is structured as follows: in Section 2, we give a bird’s eye view of the evolution of the 21cm signal between Recombination and Reionisation. In Section 3, we introduce the public data set from Semelin et al. (2017, henceforth 21SSD), which we use as training data for our neural network. Then, we turn towards generative models in Section 4 and summarise the major concepts behind (PG)GANs. In Section 5, we present our results and assess the quality of the generated samples by analysing the global 21cm signal, the power spectrum at fixed redshift, and the pixel distribution function (PDF). In Section 6, we show how parameter inference can be done using the trained PGGAN. To this end, we consider two methods that use different parts of the neural network (generator and critic). We conclude this paper in Section 7.
2 21cm tomography
2.1 21cm brightness temperature
The differential brightness temperature of the 21cm line against the CMB is commonly written as (e.g. Furlanetto et al. 2006)
(1)  
where is the optical depth, is the spin temperature of the 21cm transition, is the background temperature of the CMB, is the local H fraction, is the Hubble parameter as a function of redshift , is the local fractional overdensity, and denote the fractional energy content of baryonic matter and total matter, respectively, and is the dimensionless Hubble constant defined by . Moreover, is the velocity gradient along the line of sight, which entails redshift space distortions (Barkana & Loeb, 2005). The locality of the noncosmological quantities in the above equation is what causes the spatial variations of .
The spin temperature is related to the ratio of number densities in each of the two hyperfine states (1s singlet) and (1s triplet) and is given as
(2) 
where . As derived by Field (1958), the spin temperature can be written as a weighted sum of the CMB temperature , the kinetic temperature , and the colour temperature :
(3) 
Note that an alternative convention expresses the inverse spin temperature as the weighted sum of inverse temperatures. The colour temperature is tightly coupled to the kinetic temperature, i.e. (Field, 1959).
The interplay between mechanisms that drive towards or , respectively, is expressed in the equation above by the coupling coefficients (accounting for atomic collisions) and (accounting for the Wouthuysen–Field effect, Wouthuysen 1952; Field 1958). This decomposition is motivated by the following considerations (see e.g. Furlanetto et al. 2006; Pritchard & Loeb 2012; Mesinger et al. 2011 for more detailed explanations): In the postrecombination Universe between , the IGM is collisionally coupled to the CMB via Compton scattering between CMB photons and leftover free electrons, leading to and therefore . Afterwards, the IGM cools adiabatically according to , faster than the CMB whose temperature scales as . Hence , and the 21cm signal can be detected for the first time in absorption. As the IGM gradually disperses due to the expansion of the Universe, collisional cooling becomes inefficient, and the spin temperature approaches again the CMB temperature, reaching equality at where .
The next transition is brought about by the onset of the first astrophysical sources, which start coupling again to . This coupling arises from the Wouthuysen–Field effect due to the background Lyman radiation. Since , the 21cm signal is once more in absorption, marking the advent of the Cosmic Dawn. Spatial fluctuations of the 21cm signal emerge from inhomogenities of the Lyman background, as well as from density variations. The Lyman coupling gradually intensifies until it eventually saturates. From then on, the 21cm signal is no longer susceptible to Lyman fluctuations. Xray heating from astrophysical sources leads to growing temperature differences between hot, overdense regions and the cold bulk of the gas, and the 21cm signal from heated regions can be seen for the first time in emission. Gas temperature fluctuations now contribute to the brightness temperature fluctuations.
As the Xray heating continues and becomes dominant over cosmic expansion in the entire IGM, surpasses and keeps rising, eventually giving . The 21cm line is globally in emission, and the contribution of the spin temperature in Equation (1) becomes negligible. As reionisation proceeds and an increasing number of regions become ionised, fluctuations are sourced by spatially varying ionisation fractions. With the end of reionisation, the 21cm signal essentially disappears (see e.g. Wyithe & Loeb 2008 for the postreionisation signal from damped Lyman systems). Note however that the timing, duration, and order of the transitions involving astrophysical sources are uncertain and modeldependent (Pritchard & Furlanetto, 2007).
3 The 21SSD data set
We use the publicly available data set from 21SSD that contains lightcones of the 21cm brightness temperature during the EoR in a redshift range of to for a threedimensional parameter space. In what follows, we briefly summarise the fixed and varying parameters of the simulations, but the interested reader is referred to said paper for a detailed discussion.
3.1 Overview of the simulations
The lightcones have been produced from fully coupled radiative hydrodynamic simulations in a box of side length with particles, run with the parallel Tree+SPH code Licorice (Semelin & Combes, 2002; Semelin et al., 2007; Baek et al., 2009, 2010; Vonlanthen et al., 2011; Semelin, 2016). MonteCarlo raytracing on an adaptive mesh is used for the computation of Xray and UV radiative transfer, and both hard and soft Xray emission is enabled. The only feedback from star formation is photoheating. The Lyman flux is calculated in a postprocessing step on a fixed grid consisting of cells, emitting photons once every . Planck Collaboration (2016) cosmology is assumed, and the simulations start at .
Several astrophysical parameters have been fixed such that the simulation results comply with observational constraints (see Bouwens et al. 2015a; Bouwens et al. 2015b and references therein): a Kennicutt–Schmidt law with exponent 1 and models the star formation in regions with comoving overdensity , i.e. , and a Salpeter initial mass function (IMF) is adopted for the star population, truncated below and above , respectively. The escape fraction of ionising UV photons from unresolved structures is set to .
3.2 The threedimensional parameter space
The 21SSD data set explores a threedimensional parameter space, where the varying parameters are the Xray emissivity , the hardtosoft ratio of the Xray emission , and Lyman band emissivity .

: After the mean kinetic temperature of the neutral gas has decreased monotonously since recombination, Xray heating triggers a turnaround and eventually causes . Assuming that the local proportionality between Xray luminosity and star formation rate (SFR) holds for high redshifts, one can parametrise (Furlanetto et al., 2006)
(4) with the unknown renormalisation parameter accounting for the extrapolation to high redshifts.

: In the simulations, hard Xrays from Xray binaries receive a special treatment due to their large mean free path, which renders the tracking of all hard Xray photons over the course of the entire simulation unfeasible (see 21SSD for details). The description of their spectral properties follows Fragos et al. (2013). In contrast, AGNs produce soft Xrays between and , with a spectral index of . Soft Xrays are believed to carry most of the energy (Furlanetto et al., 2006), but their shorter mean free paths leads to a more localised energy deposition which generates spatial fluctuations. The fraction of hard Xrays is defined as
(5) where and are the Xray emissivities of Xray binaries and AGNs, respectively.

: In the 21SSD data set, the efficiency of the Lyman band emission is described by one free parameter . The energy emission from a stellar population within a frequency band is given by
(6) where is stellar mass, , , is the IMF, is the lifetime of a star as a function of its mass, is the energy emission per time and per frequency from a star of mass at frequency . The parameter is then defined as the proportionality factor between the energy as produced in the simulation and the theoretical energy emission in the relevant frequency band, that is
(7) with the Lyman frequency and Lyman limit frequency . Higher Lyman band emission efficiencies produce more prominent troughs and peaks in the differential brightness temperature.
3.3 Data creation for the PGGAN training
The 21SSD data set contains simulations for each of the possible combinations of the following parameter values:
thus 45 simulations in total. For each simulation, three highresolution lightcones of the differential brightness temperature are available, corresponding to observers aligned with the three coordinate axes, giving a total of 135 lightcones. The lineofsight dimension of 21cm images is special as it represents measurements at different frequencies, from which the associated redshifts and hence comoving distances can be inferred to reconstruct a 3D distribution. The frequency resolution is typically chosen such that the resulting spatial lineofsight resolution matches the angular one. This is also the case for the 21SSD lightcones whose cells have an equal that leads to almost cubic cells (of slightly varying comoving thickness). Although brightness temperature lightcones at ( pixels, with the lineofsight dimension being the longest) and ( pixels) resolution are provided in the public data set for convenience, we start from the highresolution lightcones ( pixels) in order to obtain more slices as a large data set size is crucial for deep learning. We cut slices from each lightcone by fixing pixels in the first and second dimension, respectively, yielding in total 2D tomography samples ( samples for each set of parameters) at a resolution of . The shorter (vertical) dimension of the 2D samples is perpendicular to the line of sight and covers a comoving length of , whereas the longer (horizontal) dimension is aligned with the line of sight and comprises a redshift range of to (from left to right). If 21cmGAN is to be applied to smaller data sets, one might resort to data augmentation methods such as mirroring. We subsequently downscale the samples by averaging over adjacent pixels to obtain a resolution of pixels. From these samples, we create versions from resolution to , corresponding to each stage of the PGGAN training (see Subsection 4.2).
4 PGGANs
The two most prominent representatives within the class of generative models are Variational AutoEncoders (VAEs, Kingma & Welling, 2013) and Generative Adversarial Networks (GANs, Goodfellow et al., 2014). While GANs are notoriously hard to train, they are more flexible and generally produce images of higher quality (e.g. Arora & Zhang, 2017, however, see Razavi et al. 2019 for recent progress on VAEs), which is why we choose them for this work.
4.1 GANs
The setting of a standard GAN in our context is the following: two players, the generator and the discriminator , compete in a minmaxgame against each other. Let be the probability distribution function of the differential brightness temperature images (where and stand for the height and width of the images)
(8) 
where the function is defined as
(9)  
While the mathematical theory of such problems is well established (e.g. Nash 1950), the practical training of GANs is intricate and often unstable (Arjovsky & Bottou, 2017). A problem commonly encountered is that of socalled mode collapse, which describes the situation when the generator output is but a small subset of the real data distribution . Whereas the loss is based upon the Jensen–Shannon divergence, different loss functions have been proposed that rely on other distance measures such as divergences (Nowozin et al., 2016). An interesting approach was introduced by Arjovsky et al. (2017), who advocate the use of the Wasserstein distance (also known as Earth Mover’s distance) that provides meaningful gradients even when and are far apart from each other. The Wasserstein distance between two probability distributions and is defined as
(10) 
where denotes the set of all joint distributions with marginals and , respectively, and . In the case of , the Wasserstein distance gives the minimum amount of work needed for moving the “probability mass” piled up according to to the distribution , where work is defined as the product of probability mass and distance by which the mass is moved. An equivalent characterisation with more practical relevance is obtained using the Kantorovich–Rubenstein duality, yielding
(11) 
where the supremum is taken over all 1Lipschitz continuous functions. While this Lipschitz constraint is achieved by clipping the weights of the neural network in Arjovsky et al. (2017), Gulrajani et al. (2017) propose to make use of the fact that a differentiable function is 1Lipschitz if and only if its gradients are bounded by 1 everywhere. This motivates replacing the weight clipping by a penalty term, giving rise to WGANGPs (Wasserstein GANs with Gradient Penalty). The appropriate loss function for our saddle point problem reads as
(12)  
Here, with a random number , and the factor determines the strength of the gradient penalty. Using the gradient penalty in lieu of the weight clipping has been shown to improve the image quality and convergence. In the context of WGANs, the discriminator is referred to as the critic since it no longer measures the probability of but now produces a score that gauges how real seems to the critic (in particular, the output of the critic is not confined to ). The aim of the critic is to maximise the difference between scores for real and fake samples.
4.2 Progressive growing
In this work, we present the emulation of 21cm tomography samples consisting of pixels. The corresponding angular resolution is (for , which is within the typical range of the SKA. As an example, the Baseline Design of SKA1Low (Dewdney et al., 2013) will have a projected noise of at a frequency of (corresponding to redshift ) on scales of (Mellema et al., 2015). With increasing resolution (and hence number of pixels per image), training GANs becomes increasingly time and memory consuming. Furthermore, the more pixels the discriminator has at its disposal, the easier it becomes to discern real and fake images, while the task of the generator to generate high quality output consisting of many pixels becomes harder. This can lead to an imbalance and cause the GAN training to fail. Recently, much work has been focused on methods for the generation of highresolution images with GANs, and applying techniques such as an intermediate latent space in the generator (Karras et al., 2018), selfattention layers (Zhang et al., 2018), spectral normalisation (Miyato et al., 2018), or a truncation trick (Brock et al., 2018) has led to remarkable results. A major breakthrough is the work by Karras et al. (2017, hereafter PGGAN), who progressively increase the resolution and number of layers during training, obtaining highquality images. Although our resolution of pixels is rather modest, we use a Progressively Growing GAN (PGGAN), mainly because of two reasons: firstly, we observe a lower tendency towards mode collapse when building up our GAN progressively and secondly, our neural network architecture can easily be extended to generate higher resolution tomography samples that might be needed for future interferometry experiments. Of course, creating 3D samples at the full resolution of the 21SSD catalogue of pixels with GANs is still a long way off.
During the training, new layers are gradually added to both the generator and the critic. This happens in a smooth manner, allowing the GAN to slowly transition to a higher resolution. Once part of the GAN, the weights of each layer remain trainable until the end of the training. While the GAN is trained on lower resolution images, downscaled images are shown to the critic. For more details on the progressive growing of the GAN, we refer the interested reader to PGGAN, whose approach we largely follow.
4.3 Network architecture
Our network architecture borrows from PGGAN, which is why we mainly describe the differences in what follows. The workflow of our PGGAN is schematically depicted in Figure 1, and the details are listed in Appendix A.
Our generator network is a CNN whose layers successively construct the image up to the desired resolution. We take the latent vector to be of length , where we fix the first three components to be the entries of and append independent and identically distributed drawings from a standard normal distribution, the latter forming the vector . Thus, we condition the generator by manually encoding information in the otherwise random latent vector. Whereas benchmark data sets for GANs are typically composed of square images, our images have an aspect ratio of . For this reason, we choose as the base resolution of our GAN. First, the latent vector is mapped to the base resolution via a fully connected layer. Then, convolutional layers and upsampling layers follow, which gradually increase the spatial resolution until the desired resolution at the respective stage of the training is reached.
For the critic, we use a structure that is essentially symmetric with respect to the generator architecture. Since the input for the critic is an image , we cannot simply append the threedimensional parameter vector to the input as we do for the generator. Instead, we concatenate the three parameters as constant additional image channels. We leave the implementation of more sophisticated conditioning methods for the critic as proposed by Miyato & Koyama (2018) for future work. While the critic determines autonomously which features are relevant for assessing the image quality, there are two major requirements for the generated samples that we point out: firstly, the global brightness temperature, that is the vertical average in the 2D case considered herein, should agree with the samples from the 21SSD catalogue. Secondly, the fluctuations of the mock images should exhibit the same statistics as their 21SSD counterparts. We encourage the critic to evaluate these properties by appending the vertically averaged brightness temperature and the difference towards it, i.e. , as additional image channels. Altogether, the input for the critic hence takes the shape (1 image channel, 2 extra channels, and 3 parameter channels).
4.4 Training
We train our PGGAN on 4 Nvidia Tesla Pascal P100 GPUs on the supercomputer Raijin, which is located in Canberra and is part of the Australian National Computational Infrastructure (NCI). We use a batch size of 8 per GPU and execute 40000 iterations for each growth stage of the PGGAN. The training alternates between transitional stages during which the next higher resolution (with twice as many pixels in each dimension) is faded in, and stabilising stages, where the PGGAN is trained at the respective resolution. Starting at a resolution of , this means that the training consists of stabilising stages, separated by transitional stages, to arrive at resolution . This amounts to iterations in total, requiring of wall time. We take the WGANGP loss function as defined in Equation (12) with , plus a small additional penalty term in order to keep the critic scores close to zero and prevent them from diverging. The minimisation is done using an Adam optimiser (Kingma & Ba, 2014).
4.5 Practical usage
We briefly elaborate on how 21cmGAN can be utilised in practice for the generation of 21cm tomography samples and for aiding parameter inference. The first and simplest option is to use the trained PGGAN presented in this work, which readily produces tomography samples at a resolution of . If required, instrumentspecific noise and foregrounds can be added to the generated images. Another possibility is to retrain (a modified version of) 21cmGAN on a different set of tomography images. First, a parameter space needs to be defined that describes the processes impacting the 21cm signal. In this work, Xray emissivity, ratio of hard to soft Xrays, and Lyman efficiency are considered; however, alternative parameter spaces are spanned for instance by the ionising efficiency of highredshift galaxies (which itself depends on the escape fraction of ionising UV photons, the fraction of galactic gas in stars, the number of times a hydrogen atom recombines, and the number of ionising photons that baryons in stars produce), the mean free path of ionising photons in ionised regions, and the minimum virial temperature for star formation in haloes (Greig & Mesinger, 2015). Furthermore, different models for star formation and supernova feedback can be considered (Mesinger et al., 2016). Once the parameter space is set, a suitable sampling strategy is required, e.g. gridbased sampling, Latin Hypercube Sampling, or an adaptive metricbased approach as proposed by Eames et al. (2019). Then, 21cm samples are generated for the chosen parameter sets, either using a hydrodynamic simulation with radiative transfer or an approximate method. The produced samples (or 2D slices thereof) serve as training data for 21cmGAN, which learns the dependence of the 21cm tomography samples on the individual parameters. The trained neural network can create output samples for arbitrary points in parameter space (evidently, caution is needed when the neural network is evaluated on parameters that lie outside the range in parameter space on which the neural network was trained, or if the training data stems from an extremely coarse sampling of the parameter space).
The trained neural network represents a sampler from the probability distribution function , which, in contrast to , can be evaluated on a continuous subset in parameter space. This sampler can be utilised for backward modelling, that is parameter inference from a given tomography sample, as well: since the approximate likelihood function is analytically intractable, methods that rely on an expression for the likelihood such as MCMC cannot be applied, however, likelihoodfree methods such as Approximate Bayesian Computation (ABC) provide a suitable framework not only for finding the maximumlikelihood estimate for , but also for obtaining error estimates. This will be demonstrated in Section 6.
5 Results
5.1 Qualitative comparison
We start with a visual comparison between the samples generated by 21cmGAN and those from the 21SSD catalogue. Figure 2 shows random samples for different parameter sets, sorted by the strength of the emission in decreasing order (cf. Fig. 7 in 21SSD for samples at the full resolution of the catalogue). Evidently, the PGGAN has learned to map the input parameters to the correct redshifts and amplitudes of the absorption and emission regions. While the 21cm line is in emission almost everywhere at for , the emission region for consists of many jagged patches, which is well reproduced by the PGGAN. Also, the isolated bubbles in absorption at are captured correctly. The spatial distribution of emission and absorption regions is diverse, and we do not observe any indication of mode collapse. Altogether, the visual impression of the GANgenerated tomography samples is satisfactory, and it is difficult to distinguish them from their 21SSD counterparts by eye. More random samples for selected parameter vectors in Appendix B demonstrate the diversity of the PGGAN output and their resemblance to the 21SSD slices.
5.2 The global 21cm signal
Figure 3 shows the global 21cm signal as a function of redshift (solid lines, first legend column for the 21SSD samples, second column for the PGGAN samples), computed as the average over all pixel values of 6144 samples at each redshift. The 1region is shaded for PGGAN samples and delimited by dotted lines for the 21SSD samples. As already seen above, the global signal is highly sensitive to . Additionally, higher Lyman emissivities cause earlier and more prominent peaks and troughs. The parameter has an effect on the average kinetic temperature of the H gas (see Fig. 5 in 21SSD). We consider the same parameter sets as in Figure 2. The global signal of the samples from the PGGAN agrees well with that from the 21SSD catalogue for all parameter sets, and the maximum difference amounts to less than (and much less for models with high Xray emissivity). Also, the spread in the brightness temperature at each redshift is well reproduced, with low Xray models exhibiting a larger spatial variability than high Xray models. The fluctuations of the PGGAN brightness temperature follow those of the 21SSD samples, which suggests that extending the set of training images such that the resulting global signal is smoother would lead to a smoother global signal of the 21SSD samples as well.
5.3 Power spectrum
A standard statistic for measuring the strength of fluctuations is the power spectrum. Figure 4 shows the 1D power spectrum for two models bracketing the investigated range of and one model with an intermediate value of . The model with has the highest power on all scales at redshifts and . At , the intermediate model has the lowest power, while the power increases by . In contrast, the power of the model with decreases on large scales as the 21cm line changes from absorption (at ) to emission (at ). The GANgenerated samples possess a similar power spectrum to the 21SSD samples at both redshifts and capture the effects of the different parameters, although the spectra are a bit noisier on small scales.
5.4 NonGaussianities
Due to the strong nonGaussianity of the 21cm signal, the power spectrum alone is not sufficient for a complete description of the fluctuations. In the context of our PGGAN tomography emulator, it is therefore important that not only the power spectra match, but also the pixel distribution functions (PDFs) at each redshift. The PDF gives the distribution of the occurring brightness temperatures at each redshift. Figure LABEL:fig:PDF shows the PDF for the same models and redshifts as in Figure 4. The PDF of the model with the lowest Xray emissivity has the widest temperature spread, which is clearly nonGaussian but negatively skewed at . In contrast, the model with has a narrow distribution that is positively skewed at . The PDFs from the GANgenerated samples closely resemble those from their 21SSD brethren; in particular, the width, height, and skewness are in excellent agreement.
Alternative ways of characterising the nonGaussian structure of the 21cm fluctuations are higher order statistics such as the bispectrum or topological measures, for instance Minkowski functionals, which are not in the scope of this work.
5.5 Interpolating in parameter space
Keeping the 509 randomly drawn components of the noise vector fixed, one can investigate how a particular sample changes as either of the reionisation parameters is varied. Figure 6 shows an exemplary interpolation in parameter space of a random sample. The three blocks of four samples correspond to varying , , and (from top to bottom), while the remaining two other parameters are kept fixed. The PGGAN was only trained on some of the depicted parameter vectors and has learned a continuous mapping from the parameters to the samples, rather than just memorising samples that it has seen during the training. The random noise governs the localisation of the stochastic brightness temperature fluctuations.
6 GANaided inference using Approximate Bayesian Computation
The trained PGGAN has learned a probabilistic mapping from the parameter space to realistic 2D tomographic samples. We now present two approaches for harnessing the PGGAN for the inverse problem, namely for inferring parameters from a tomographic sample. First, we briefly discuss uncertainty in the context of parameter inference from 21cm images.
6.1 Sources of uncertainty
In uncertainty quantification, one commonly distinguishes between two different types of uncertainty: aleatoric (statistical) uncertainty from the data and epistemic (systematic) uncertainty from the modelling. For the application of estimating astrophysical parameters from spatially resolved 21cm measurements from the EoR, aleatoric uncertainty comes from thermal noise and from sample variance. The latter is due to the individuality of the spatial fluctuations in each finite volume under consideration and is also present for the 3D 21cm lightcones in the 21SSD catalogue, each of which possesses slightly different statistics such as the 2D power spectrum. By going from 3D to 2D tomographic samples, the sample variance increases significantly since the information content per sample is reduced. When being trained, the GAN learns to generate samples whose particular spatial fluctuations are determined by the random noise vector , in such a way that the sampletosample variance agrees with that of the training data. Conversely, accurate inference of parameters from tomographic samples is limited by the sample variance. This being said, even for upcoming observations with the SKA, thermal noise will be a greater source of stochasticity as compared to sample variance (Koopmans et al., 2015). None the less, we focus in this section on aleatoric uncertainty stemming from sample variance as this is the stochasticity that the GAN has learned to generate in the training process. The extension to additional uncertainty due to thermal noise is discussed in Subsection 6.5.
Since aleatoric uncertainty is inherent to the tomography samples, it does not decrease as the training of the GAN proceeds. This is in contrast to the epistemic uncertainty, which describes the ignorance about the correct model that explains the data. For neural networks, it is determined by the spread of the posterior distributions of the neural network weights. In the course of the GAN training, the epistemic uncertainty should gradually decline as the distribution produced by the GAN approaches the true distribution, i.e. . The estimation of the epistemic uncertainty is out of the scope of this work, and we simply take the credible regions obtained for the parameters at face value.
6.2 ABC rejection sampling
A popular class of techniques that allow parameter inference given a forward simulator (the PGGAN in our case) is known as Approximate Bayesian Computation (ABC). Unlike MCMCtype methods, ABC is a likelihoodfree way of inverse modelling, meaning that an analytic expression of is not required. For simplicity, we use ABC rejection sampling, which is the easiest representative of ABC methods, but more sophisticated schemes such as ABC Sequential Monte Carlo (ABCSMC) or Bayesian Optimization for LikelihoodFree Inference (BOLFI, Gutmann & Corander, 2015) can be applied without difficulties.
Let be a tomographic sample belonging to a true unknown parameter vector . Furthermore, let be the prior distribution of the parameters, and let be a discrepancy measure between the observed sample and a generated sample . The underlying idea of ABC rejection sampling is simple: generate samples for random parameter vectors and compare them to . If a generated sample is similar to , i.e. is sufficiently small, the parameters should be similar to .
Generatorbased approach
With the generator network from our trained PGGAN as a “black box” simulator, random samples drawn from can be readily generated. Algorithm 1 lists schematically the steps needed to obtain an approximation of the posterior distribution . Due to the randomness of the sample generation, the discrepancy is not a deterministic function of the parameter vector (assuming fixed ), but rather a random variable depending on the noise vector . A new random noise sample should be drawn for each parameter vector in order to avoid a bias due to the distinct features emanating from a fixed noise sample. Possible choices for are for instance a suitable norm between the power spectra or the PDFs. As , one expects the distribution defined by the accepted samples to gradually shift from the prior distribution to the posterior distribution . Note that the approximation made for the posterior distribution is twofold, namely , and the latter is approximated by a finite number of drawings of , which makes a careful sanity check of the obtained results mandatory.
Criticbased approach
For the generatorbased approach as described above, a suitable statistic needs to be defined based on which the proximity between a generated sample and is measured. Furthermore, drawing the same multiple times generally leads to different owing to the randomness of the generated images. Another approach for parameter inference in the ABC framework relies on the trained critic network of the PGGAN: recalling that the critic assesses the quality of each generated sample given an associated parameter vector, one expects the critic to assign a low score to a real image paired with a wrong parameter vector. This motivates the use of the critic score as a discrepancy measure for the ABC rejection sampling, as summarised in Algorithm 2. Now, is shown to the critic together with parameter vectors . Those that receive a high critic score are accepted, where . Thus, there is no need to generate new samples and to manually define a statistic for the calculation of . Moreover, depends deterministically on
6.3 Inference example
We repeat the training of the PGGAN from Section 5, but this time excluding all samples corresponding to the parameters from the training set (for all three values of since the effect of is less dominant than that of the other two parameters) in order to assess the ability of the PGGAN to interpolate to new points in parameter space, which is crucial for parameter inference.
As a test case, we take the sample shown in Figure 7, which arises from a moderate reionisation history expressed by the parameter vector . Since it is difficult to determine appropriate values for and a priori, we draw samples for each method and determine and a posteriori such that the number of accepted samples amounts to 512, which is roughly per cent. We checked that the resulting posterior distributions are sufficiently robust with respect to the accepted quantile, and accepting e.g. per cent instead of per cent gives similar results.
In this example, we take noninformative priors over the parameter range spanned by the 21SSD catalogue. These are given by a uniform prior for and loguniform priors for the scale variables and , i.e. , , and .
For the generatorbased approach, we take the discrepancy to be the Wasserstein distance between the brightness temperature distributions at each scale factor , integrated over the range of scale factors (or equivalently over the range of received frequencies). The scale factor is a proxy for the horizontal position within the tomography samples (recall that the cells in the samples have constant implying constant ). The motivation behind this choice of is the following: for each scale factor, similar parameter vectors are expected to give similar PDFs. Binning the values of at each scale factor, one could proceed by defining bintobin discrepancy measures such as the absolute value of the difference between the PDFs in each temperature bin, summed over all bins. However, given that only pixel values of are available at each scale factor for our 2D tomographic samples, very few values would be assigned to each temperature bin, and bintobin measures would depend sensitively on the chosen temperature bin size. Contrarily, the Wasserstein distance as a crossbin dissimilarity measure is much less sensitive to the chosen bin size: the amount of work required to move “probability mass” to an adjacent bin is small, and so is hence the resulting Wasserstein distance. To be more specific, the Wasserstein distance in the case of two Dirac delta distributions located at and collapses to and simply measures the difference between the two temperatures. We take the Wasserstein distance and the norm over the scale factor for consistency with the loss function of the GAN. The reason for evaluating the Wasserstein distance between the PDFs at each scale factor instead of the PDFs of all pixel values in the image is that two samples that have a very different evolution of the brightness temperature over time should be assigned a high discrepancy, even if the distribution of all pixel values taken in aggregate is similar. In fact, binning is not needed at all for the calculation of the Wasserstein distance, which can be done in 1D by means of a discrete form of the equivalent characterisation (Vallender, 1974):
(13) 
where and are the cumulative distribution functions of and , respectively. Thus, we take
(14) 
where , , and and are the PDFs of and at scale factor , respectively. The integrand is nonnegative since the Wasserstein distance defines a metric. Numerically, we compute the Wasserstein distance between the brightness temperature distributions at each image column, and sum up the results over the columns.
6.4 Results
Figure 8 shows the resulting estimates of the posterior distribution for the two different approaches. We stress again that for this PGGAN, all samples corresponding to were excluded from the training set, not only the particular sample . This is because in a realistic scenario, it is unlikely that the underlying parameter vector of an observed tomography sample lies exactly on the parameter grid of the training set.
Both approaches perform similarly well, although the generatorbased approach gives a somewhat more accurate estimate of the Lyman band emissivity . Given that the quantiles of the marginal prior distributions are and for and , respectively, the spread of the posterior distributions for and is small (see the titles of the marginal distributions in Figure 8), and the ABC inference has improved the bounds by a factor of a few. Determining the Xray hardtosoft ratio is harder due to its smaller imprint on the tomography samples as compared to the other two parameters (see Figure 6). The posterior distributions for both approaches peak close to the correct value , although the improvement with respect to the prior quantiles is small. This is in line with 21SSD who find that is difficult to constrain with the power spectrum and the PDF. The observed covariance between and is expected since increasing the Xray emissivity and decreasing the Xray hardness both results in a higher mean kinetic temperature of the gas.
For the generatorbased approach, our choice of is only one out of many wellmotivated discrepancy measures. We remark that replacing the integrand in Equation (14) with the absolute difference between the mean brightness temperatures at scale factor barely changes the results. In order words, simply comparing the means for each scale factor instead of the full temperature distributions does not result in a loss in discriminatory power. However, statistics that capture more information such as the one proposed herein might be useful for future highresolution 21cm imaging.
6.5 Inference in the presence of noise
Whereas we have assumed that the tomography sample with unknown parameters has the same quality as the training data in the inference above, real 21cm samples will be plagued by thermal noise and foregrounds that need to be removed. Therefore, it is vital to discuss how the above approaches carry over to a more realistic setting. For the generatorbased approach, accounting for noise is relatively straightforward: given an instrumentspecific noise model, random noise realisations can be added to the GANgenerated samples before calculating . We repeat the generatorbased parameter inference for the same sample, subject to Gaussian random noise with a redshiftdependent variance as it is expected from the SKA (depicted in Figure 7). We choose the same observational parameters as in 21SSD. For simplicity, we calculate the noise root mean square integrated over all wave numbers and subsequently draw random noise realisations in real space with SKAlike variance at each redshift (taking into account that the angular resolution for the fixedsize pixels varies with redshift), neglecting spatial correlations that would arise when carefully sampling the noise in the UVplane and applying an inverse Fourier transform back to real space.
Interestingly, the generatorbased parameter inference barely deteriorates in the presence of noise: the resulting marginalised equitailed credibility regions are for , for , and for . This shows that astrophysical knowledge can be extracted in realistic situations where the measurements are impacted by both sample variance and thermal noise. We emphasise again that even for upcoming 21cm surveys, thermal noise will be a greater source of uncertainty than sample variance, and one would expect the credibility regions to be set mainly by the noise. In such situations, the contribution of the epistemic uncertainty merits further investigation, which we will carry out in future work.
On the other hand, dealing with noise in the criticbased approach is more intricate: without introducing noise to the PGGAN, it is not clear whether the critic, which has been trained on noiseless samples, will output high scores for a noisy image together with the correct parameter vector. A possible solution could be to add realistic noise to the generated samples during training before showing them to the critic. While it is a common technique to add noise to the discriminator input (or several layers of the discriminator) in GANs (Salimans et al., 2016) in order to make it harder to distinguish fake from real images, this often comes at the cost of reduced image quality. Another approach for both generatorbased and criticbased inference could be adding noise to the training samples and training the PGGAN to produce noisy samples. Our proposed methods exploit the fact that the generator and critic networks are trained anyway, but of course, it is also possible to employ an additional (possibly Bayesian) neural network for the task of parameter inference, as done in several references in the introduction of this work.
7 Conclusions
We have presented a framework for the fast creation of realistic 21cm tomography samples by means of a PGGAN. During training, the PGGAN learns to correctly account for Xray emissivity , Xray hardtosoft ratio , and Lyman band emissivity , and to generate matching 21cm tomographic samples at gradually increasing resolution. Once trained, the PGGAN produces 2D samples at an SKAlike resolution of ( pixels) in roughly a second on a laptop CPU. In comparison, the production of the 21SSD catalogue (which of course contains snapshots and lightcones at much higher resolution) took CPU hours. The PGGANgenerated samples are diverse and accurately reproduce the global 21cm signal, the power spectrum, and the pixel distribution function of the training data. We have shown how both the generator network and the critic network can be harnessed for the task of parameter inference in the context of ABC rejection sampling. The reionisation parameters of an exemplary tomography sample are accurately recovered, and tight constraints are obtained for the parameters and – despite the fact that no samples for the correct parameter vector were contained in the training set. For the generatorbased approach, this is even the case in the presence of simulated thermal noise with SKAlike variance, which shows the applicability of our approach in realistic scenarios. In the dawning era of highredshift 21cm imaging, deep learning techniques will provide valuable tools for forward and backward modelling. In this paper we have demonstrated how (PG)GANs can be utilised for both tasks.
Acknowledgements
The authors thank Benoît Semelin for his help with the 21SSD data set and with the SKA noise computation, and for making the data publicly available. The authors also thank the anonymous referee for their feedback that improved the quality of this work. The authors acknowledge the National Computational Infrastructure (NCI), which is supported by the Australian Government, for providing services and computational resources on the supercomputers Raijin and Gadi that have contributed to the research results reported within this paper. This research made use of the Argus Virtual Research Desktop environment funded by the University of Sydney. F. L. is supported by the University of Sydney International Scholarship (USydIS).
Appendix A Implementation details of the neural network
Table 1 lists the layers of the neural network at the final resolution. We found it beneficial in our experiments to replace pixel normalisation by instance normalisation (Ulyanov et al., 2016), except for directly after the input layer of the generator. In contrast to PGGAN, we process the generator input with a fully connected layer instead of a convolutional layer. Besides, we noticed that the PGGAN becomes prone to mode collapse if the number of channels is reduced as the PGGAN grows deeper, which is why we keep the number of channels constant at which is feasible at the resolution considered herein – however, for the creation of higher resolution samples, a bottleneck architecture might be attractive. Different from the common practice for GANs, downsampling in PGGAN is not achieved by strided convolutions, but rather by average pooling, which we follow. The upsampling operation is a nearest neighbour interpolation. Recall that the channels of the critic input correspond to , the global signal at each redshift , , and the three model parameters. As suggested by PGGAN, we employ an equalised learning rate and a minibatch standard deviation layer with group size 4. The decay rates of the moments for the Adam optimiser are taken to be and .
It proved advantageous to take the logarithm of the two scale variables, i.e. Xray and Lyman band emissivity and , respectively, and to normalise all components of before feeding them to the neural network via the mapping for , where and are the mean and standard deviation of each parameter computed over the set of training images.
For the training, we normalise the brightness temperatures via the mapping
(15) 
The inverse hyperbolic sine term leads to a steeper slope and hence to an increased sensitivity to small perturbations around . The temperature range is monotonically mapped to , and since we do not apply an activation function after the last convolutional layer of the generator that would confine the generator output to a certain interval, the few pixels outside this range do not require any special treatment. We calculate the inverse transformation of the PGGAN output back to temperature space numerically.
The total number of trainable parameters for the fully grown PGGAN amounts to and for the generator and the critic, respectively.
Operation  Output shape () 

GENERATOR:  
Parameters and noise  512 1 1 
IN LReLU FC PN  512 1 8 
IN LReLU Conv  512 1 8 
Upsampling  512 2 16 
IN LReLU Conv  512 2 16 
IN LReLU Conv  512 2 16 
Upsampling  512 4 32 
IN LReLU Conv  512 4 32 
IN LReLU Conv  512 4 32 
Upsampling  512 8 64 
IN LReLU Conv  512 8 64 
IN LReLU Conv  512 8 64 
Upsampling  512 16 128 
IN LReLU Conv  512 16 128 
IN LReLU Conv  512 16 128 
Upsampling  512 32 256 
IN LReLU Conv  512 32 256 
IN LReLU Conv  512 32 256 
Conv  1 32 256 
CRITIC:  
Input tensor  6 32 256 
LReLU Conv  512 32 256 
LReLU Conv  512 32 256 
LReLU Conv  512 32 256 
Downsampling  512 16 128 
LReLU Conv  512 16 128 
LReLU Conv  512 16 128 
Downsampling  512 8 64 
LReLU Conv  512 8 64 
LReLU Conv  512 8 64 
Downsampling  512 4 32 
LReLU Conv  512 8 64 
LReLU Conv  512 8 64 
Downsampling  512 4 32 
LReLU Conv  512 4 32 
LReLU Conv  512 4 32 
Downsampling  512 2 16 
LReLU Conv  512 2 16 
LReLU Conv  512 2 16 
Downsampling  512 1 8 
Minibatch STD  513 1 8 
LReLU Conv  512 1 5 
LReLU Conv  512 1 1 
FC  1 1 1 
Appendix B Samples for selected parameter vectors
Footnotes
 pubyear: 2019
 pagerange: A unified framework for 21cm tomography sample generation and parameter inference with Progressively Growing GANs–B
 https://github.com/FloList/21cmGAN
 The pixel values can also be scaled to other intervals, e.g. instead of .
 The critic score becomes stochastic if the critic contains layers that introduce randomness at evaluation time, such as dropout layers.
References
 Abadi M., et al., 2016, preprint (arXiv:1603.04467)
 Arjovsky M., Bottou L., 2017, preprint (arXiv:1701.04862)
 Arjovsky M., Chintala S., Bottou L., 2017, International Conference on Machine Learning, pp 214–223
 Arora S., Zhang Y., 2017, preprint (arXiv:1706.08224)
 Baek S., Di Matteo P., Semelin B., Combes F., Revaz Y., 2009, A&A, 495, 389
 Baek S., Semelin B., Di Matteo P., Revaz Y., Combes F., 2010, A&A, 523, A4
 Barkana R., 2018, Nature, 555, 71
 Barkana R., Loeb A., 2005, ApJ, 624, L65
 Barry N., et al., 2019, preprint (arXiv:1909.00561)
 Beardsley A. P., et al., 2016, ApJ, 833, 102
 Bouwens R. J., et al., 2015a, ApJ, 803, 34
 Bouwens R. J., Illingworth G. D., Oesch P. A., Caruana J., Holwerda B., Smit R., Wilkins S., 2015b, ApJ, 811, 140
 Bowman J. D., Rogers A. E. E., Monsalve R. A., Mozdzen T. J., Mahesh N., 2018, Nature, 555, 67
 Brock A., Donahue J., Simonyan K., 2018, preprint (arXiv:1809.11096)
 Chardin J., Uhlrich G., Aubert D., Deparis N., Gillet N., Ocvirk P., Lewis J., 2019, preprint (arXiv:1905.06958)
 Cohen A., Fialkov A., Barkana R., Lotem M., 2017, MNRAS, 472, 1915
 Cohen A., Fialkov A., Barkana R., Monsalve R., 2019, preprint (arXiv:1910.06274)
 DeBoer D. R., et al., 2017, Publ. Astron. Soc. Pacific, 129, 045001
 Dean T. A., Singh S. S., Jasra A., Peters G. W., 2014, Scandinavian Journal of Statistics, 41, 970
 Dewdney P. E., Turner W., Millenaar R., Mccool R., Lazio J., Cornwell T. J., 2013, SKA Technical Document
 Dillon J. S., et al., 2014, Phys. Rev. D, 89, 023002
 Doussot A., Eames E., Semelin B., 2019, MNRAS, 490, 371
 Eames E., Doussot A., Semelin B., 2019, MNRAS, 3664, 3655
 Eastwood M. W., et al., 2019, The Astronomical Journal, 158, 84
 EwallWice A., Chang T.C., Lazio J., Doré O., Seiffert M., Monsalve R. A., 2018, ApJ, 868, 63
 Feng C., Holder G., 2018, ApJ, 858, L17
 Fialkov A., Barkana R., 2019, MNRAS, 486, 1763
 Field G. B., 1958, Proceedings of the IRE, 46, 240
 Field G. B., 1959, ApJ, 129, 536
 Fragos T., et al., 2013, ApJ, 764, 41
 Furlanetto S. R., Peng Oh S., Briggs F. H., 2006, Phys. Rep., 433, 181
 Furlanetto S., et al., 2019a, preprint (arXiv:1903.06212)
 Furlanetto S., et al., 2019b, preprint (arXiv:1903.06204)
 Gehlot B. K., et al., 2019, MNRAS, 488, 4271
 Gillet N., Mesinger A., Greig B., Liu A., Ucci G., 2019, MNRAS, 293, 282
 Goodfellow I. J., PougetAbadie J., Mirza M., Xu B., WardeFarley D., Ozair S., Courville A., Bengio Y., 2014, Advances in neural information processing systems, 27, 2672
 Greig B., Mesinger A., 2015, MNRAS, 449, 4246
 Gulrajani I., Ahmed F., Arjovsky M., Dumoulin V., Courville A., 2017, in Advances in Neural Information Processing Systems. pp 5768–5778 (arXiv:1704.00028)
 Gutmann M. U., Corander J., 2015, Journal of Machine Learning Research, 17, 1
 Hassan S., Liu A., Kohn S., Aguirre J. E., Plante P. L., Lidz A., 2017, Proceedings of the International Astronomical Union, 12, 47
 Hassan S., Andrianomena S., Doughty C., 2019, preprint (arXiv:1907.07787)
 Hirano S., Bromm V., 2018, MNRAS, 480, L85
 Jana R., Nath B. B., Biermann P. L., 2019, MNRAS, 483, 5329
 Karras T., Aila T., Laine S., Lehtinen J., 2017, preprint (arXiv:1710.10196)
 Karras T., Laine S., Aila T., 2018, preprint (arXiv:1812.04948)
 Kern N. S., Liu A., Parsons A. R., Mesinger A., Greig B., 2017, ApJ, 848, 23
 Kingma D. P., Ba J. L., 2014, preprint (arXiv:1412.6980)
 Kingma D. P., Welling M., 2013, preprint (arXiv:1312.6114)
 Kolopanis M., et al., 2019, ApJ, 883, 133
 Koopmans L., et al., 2015, in Proceedings of Advancing Astrophysics with the Square Kilometre Array â PoS(AASKA14). Sissa Medialab, Trieste, Italy, p. 001 (arXiv:1505.07568), doi:10.22323/1.215.0001, https://pos.sissa.it/215/001
 La Plante P., Ntampaka M., 2019, ApJ, 880, 110
 Li W., et al., 2019, preprint (arXiv:1911.10216)
 Loeb A., Zaldarriaga M., 2004, Phys. Rev. Lett., 92, 211301
 Majumdar S., Pritchard J. R., Mondal R., Watkinson C. A., Bharadwaj S., Mellema G., 2018, MNRAS, 476, 4007
 Mellema G., et al., 2013, Experimental Astronomy, 36, 235
 Mellema G., Koopmans L., Shukla H., Datta K. K., Mesinger A., Majumdar S., 2015, in Proceedings of Advancing Astrophysics with the Square Kilometre Array â PoS(AASKA14). Sissa Medialab, Trieste, Italy, p. 010 (arXiv:1501.04203), doi:10.22323/1.215.0010, https://pos.sissa.it/215/010
 Mesinger A., Furlanetto S., Cen R., 2011, MNRAS, 411, 955
 Mesinger A., Greig B., Sobacchi E., 2016, MNRAS, 459, 2342
 Mirza M., Osindero S., 2014, preprint (arXiv:1411.1784)
 Miyato T., Koyama M., 2018, preprint (arXiv:1802.05637)
 Miyato T., Kataoka T., Koyama M., Yoshida Y., 2018, preprint (arXiv:1802.05957)
 Morales M. F., Wyithe J. S. B., 2010, Annu. Rev. Astron. Astrophys., 48, 127
 Nash J. F., 1950, Proc. Natl. Acad. Sci., 36, 48
 Nowozin S., Cseke B., Tomioka R., 2016, in Advances in Neural Information Processing Systems. pp 271–279 (arXiv:1606.00709)
 Paciga G., et al., 2013, MNRAS, 433, 639
 Parsons A. R., et al., 2010, The Astronomical Journal, 139, 1468
 Patil A. H., et al., 2017, ApJ, 838, 65
 Pillepich A., et al., 2018, MNRAS, 473, 4077
 Planck Collaboration 2016, A&A, 594, A13
 Pober J. C., et al., 2013, ApJ, 768, L36
 Pritchard J. R., Furlanetto S. R., 2007, MNRAS, 376, 1680
 Pritchard J. R., Loeb A., 2012, Reports on Progress in Physics, 75, 086901
 Razavi A., van den Oord A., Vinyals O., 2019, preprint (arXiv:1906.00446)
 Salimans T., Goodfellow I., Zaremba W., Cheung V., Radford A., Chen X., 2016, Advances in Neural Information Processing Systems, pp 2234–2242
 Santos M. G., Ferramacho L., Silva M. B., Amblard A., Cooray A., 2010, MNRAS, 406, 2421
 Schmit C. J., Pritchard J. R., 2018, MNRAS, 475, 1213
 Semelin B., 2016, MNRAS, 455, 962
 Semelin B., Combes F., 2002, A&A, 388, 826
 Semelin B., Combes F., Baek S., 2007, A&A, 474, 365
 Semelin B., Eames E., Bolgar F., Caillat M., 2017, MNRAS, 472, 4508
 Sharma P., 2018, MNRAS, 481, L6
 Shimabukuro H., Semelin B., 2017, MNRAS, 468, 3869
 Slatyer T. R., Wu C.l., 2018, Phys. Rev. D, 98, 023013
 Thomas R. M., et al., 2009, MNRAS, 393, 32
 Tingay S. J., et al., 2013, Publications of the Astronomical Society of Australia, 30, e007
 Ulyanov D., Vedaldi A., Lempitsky V., 2016, preprint (arXiv:1607.08022)
 Vallender S. S., 1974, Theory of Probability & Its Applications, 18, 784
 Vonlanthen P., Semelin B., Baek S., Revaz Y., 2011, A&A, 532, A97
 Wouthuysen S. A., 1952, The Astronomical Journal, 57, 31
 Wyithe J. S. B., Loeb A., 2008, MNRAS, 383, 606
 ZamudioFernandez J., Okan A., VillaescusaNavarro F., Bilaloglu S., Cengiz A. D., He S., Levasseur L. P., Ho S., 2019, preprint (arXiv:1904.12846)
 Zhang H., Goodfellow I., Metaxas D., Odena A., 2018, preprint (arXiv:1805.08318)
 van Haarlem M. P., et al., 2013, A&A, 556, A2