# Learning Bayes’ theorem with a neural network for gravitational-wave inference

###### Abstract

We wish to achieve the Holy Grail of Bayesian inference with deep-learning techniques: training a neural network to instantly produce the posterior for the parameters , given the data . In the setting of gravitational-wave astronomy, we have access to a generative model for signals in noisy data (i.e., we can instantiate the prior and likelihood ), but are unable to economically compute the posterior for even a single realization of . Here we demonstrate how a network may be taught to estimate regardless, by simply showing it numerous realizations of .

## Introduction.

In the Bayesian analysis of signals immersed in noise Gregory (2005), we seek a representation for the posterior probability of one or more parameters that govern the shape of the signals. Unless the parameter-to-signal map (the *forward model*) is very simple, the analysis (or *inverse solution*) comes at significant computational cost, as it requires the stochastic exploration of the likelihood surface at a large number of locations in parameter space. Such is the case, for instance, of parameter estimation for gravitational-wave sources such as the compact binaries detected by LIGO–Virgo LIGO Scientific Collaboration and Virgo Collaboration and others (2018); Abbott et al. (2019); here each likelihood evaluation requires that we generate the gravitational waveform corresponding to a set of source parameters, and compute its noise-weighted correlation with detector data Creighton and Anderson (2011). Waveform generation is usually the costlier operation, so gravitational-wave analysts often utilize faster, less accurate waveform models Hannam et al. (2014); Pan et al. (2014), or accelerated *surrogates* of slower, more accurate models Blackman et al. (2015).

Extending the analysis from the data we have to the data we might measure (i.e., characterizing the parameter-estimation prospects of future experiments) compounds the expense, since we need to explore posteriors for many noise realizations, and across the domain of possible source parameters. For concreteness, we price the evaluation of a single Bayesian posterior at times the cost of generating a waveform, and the characterization of parameter-estimation prospects at times the cost of a posterior. With current computational resources, this means that (for instance) accurate component-mass estimates only become available hours or days after the detection of a binary black-hole coalescence Usman et al. (2016); Messick et al. (2017), while any extensive study of parameter-estimation prospects must rely on less reliable techniques such as the Fisher-matrix approximation Vallisneri (2008).

In this Letter, we show how one- or two-dimensional marginalized Bayesian posteriors may be produced using *deep neural networks* Goodfellow et al. (2016) trained on large ensembles of signal + noise data streams. (Specifically, we adopt *multilayer perceptrons* Haykin (1999), although other architectures are likely to be viable.) The network for each source parameter or parameter pair takes as input a noisy signal, and instantly outputs an approximate posterior, represented either as a histogram or parametrically (e.g., as a Gaussian mixture). We dub such networks Percival:^{1}^{1}1After the original achiever of the Grail de Troyes et al. (1999). Posterior Estimation Results Computed Instantaneously Via Artificial Learning. Crucially, the *loss function* used in the training of these networks does not require that we actually compute the likelihood or posterior for each signal, but only that we provide the true source parameters. As long as the training set reflects the assumed prior distribution of the parameters and sampling distribution of the noise, the resulting posteriors approximate the correct Bayesian distribution—and indeed they achieve it asymptotically for very large training sets and networks.

In the gravitational-wave context, the training of Percival typically requires waveform evaluations (plus the computation of network weight gradients), which is times the cost of stochastically exploring the posterior for a single noisy signal. However, the costly training is performed *offline*; afterwards, the networks can perform inference on multiple signals with negligible execution times. The prompt generation of posterior parameter distributions for binary coalescence alerts Abbott and others (2019) could be a worthy use of this speed (once suitable networks are trained, which we do not attempt here). Another potential application is the generation of effective proposal kernels to facilitate more detailed posterior analyses with Markov chain Monte Carlo methods Christensen and Meyer (1998). Last, very rapid inference could also prove useful in the characterization of parameter-estimation prospects for next-generation detectors such as LISA Danzmann and others (2017).

The still somewhat hefty cost of training Percival can be offset significantly by pairing it with its forward counterpart Roman Chua et al. (2019), which is essentially a neural network that has been fitted to the relevant waveform model in a reduced-order representation Field et al. (2011). Roman (Reduced-Order Modeling with Artificial Neurons) takes as input a set of source parameters, and outputs in milliseconds the corresponding signal at high accuracy; it provides a conceptually cleaner alternative to the combined analysis framework comprising surrogate waveforms Field et al. (2014) and the likelihood compression technique known as *reduced-order quadrature* Cañizares et al. (2013). Being a neural network, Roman has additional features such as analytic waveform derivatives and—more pertinently for this work—the generation of waveforms in large batches at next to no marginal cost. The latter allows us to fully exploit highly parallel GPU architectures to build training sets for Percival that are effectively infinite in size, which in turn grants immunity to the perennial deep-learning problem of overfitting.

Deep-learning techniques have gained popularity in the gravitational-wave community over the past two years, with the majority of efforts focused on applying *convolutional neural networks* LeCun and Bengio (1998) to the *classification* task of signal detection, specifically for transient signals from compact binaries in ground-based detector data Gebhard et al. (2017); George and Huerta (2018, 2018); Gabbard et al. (2018); Fan et al. (2019); Nakano et al. (2019); Rebei et al. (2019); Shen et al. (2019); Gebhard et al. (2019); Krastev (2019). They are also being investigated as detection tools for persistent signals from asymmetric neutron stars Mytidis et al. (2019); Dreissigacker et al. (2019); Morawski et al. (2019); Miller et al. (2019). While the application of neural networks to the *regression* task of source parameter estimation is addressed in some of these papers George and Huerta (2018); Fan et al. (2019); Nakano et al. (2019); George and Huerta (2018), there it is restricted to the recovery of pointwise estimates, and with a frequentist characterization of errors based only on the test set. However, near the completion of this manuscript, we learned of related work by Gabbard et al. Gabbard et al. (2019), where a conditional variational autoencoder Tonolini et al. (2019) is trained on parameter–noisy signal pairs to output samples from the Bayesian posterior. We expect that comparison between our method and theirs will offer useful insight.

## Training neural networks to produce posteriors.

We now describe our scheme to perform Bayesian posterior estimation using neural networks. A *perceptron classifier* is a network that takes a data vector as input and outputs the estimated probabilities that the input belongs to each member of a universal set of disjoint classes . The *Bayesian optimal discriminant* is the classifier that returns the posterior probabilities , where is the likelihood of occurring in , and is the prior probability of itself. It is a well-established result in the machine-learning literature that perceptron classifiers can approximate Bayesian optimal discriminants Ruck et al. (1990); Wan (1990) when they are trained on a population of inputs distributed according to . This is achieved by minimizing (over the network weights) the loss function , where is the indicator function of , and is some vector distance on the space (e.g., the squared norm, or the discrete Kullback–Leibler divergence Pollard (2002)).

Note that the above training procedure does not require computation of the actual posteriors , but only the ability to randomly draw classes from the prior and data vectors from the conditional sampling distribution (i.e., we need a *generative model* of the data). To move from classification to the computation of posterior densities for continuous parameters, we may simply define classes based on a binning of the parameter domain of interest; in other words, the network will then output histograms of the target posterior. This coarse graining of parameter space highlights the relationship between classification and regression, but is not actually necessary for our scheme, since the network can instead output a parametric posterior representation (such as a Gaussian mixture) to be fed into an analogous loss function. Although histograms rapidly become impractical for marginalized posteriors in dimensions, the parametric approach extends more readily to the estimation of higher-dimensional posteriors.

Let us now explicitly derive the loss function used in our scheme. Consider a data model described by the continuous parameters , with denoting the parameters for which we seek a posterior. We seek to minimize the statistical distance between the (marginalized) true posterior and the network-estimated posterior , integrated over the distribution of the data . If is the Kullback–Leibler divergence, we have the integrated distance

(1) |

Using Bayes’ theorem on and dropping the term containing (which is constant with respect to the network weights), we obtain the loss function

(2) |

which can be approximated (modulo normalization) as a discrete sum over a notional training batch :

(3) |

where, crucially, each is first drawn from the prior before is drawn from the conditional .

The summand in Eq. (3) is precisely the -dependent term in the Kullback–Leibler divergence between the Dirac delta function and . Likewise, if the network-estimated posterior is represented as a histogram , Eq. (3) simplifies to

(4) |

where is now the indicator function of the -th histogram bin. Eq. (4) is familiar to machine-learning practitioners, and is more commonly known as the cross-entropy loss for classification problems Goodfellow et al. (2016). Finally, note that a derivation similar to the one above can also be given for the squared distance , resulting in an alternative (but less tractable) loss

(5) |

## Leveraging reduced waveform representations.

In gravitational-wave astronomy, the data is usually a time or frequency series of strain measured by the detector. For the transient signals observed by ground-based detectors, is typically a vector of length ; this rises to for persistent LIGO–Virgo signals, or the mHz-band signals sought by future space-based detectors. Once the presence of a signal is established, Bayesian inference proceeds via the canonical likelihood

(6) |

where is a noise-weighted inner product that incorporates a power spectral density model for the detector noise (assumed to be Gaussian and additive).

As mentioned earlier, the generation of the waveform template and the evaluation of the inner product are both computationally expensive, due to the complexity of relativistic modeling and the large dimension of the inner-product space. In the *reduced-order modeling* framework Field et al. (2011), we mitigate this cost by constructing a reduced basis for the (far more compact) template manifold embedded in the inner-product space, as well as a fast interpolant for the template model in this reduced representation. The neural network Roman Chua et al. (2019) implements this interpolation, allowing Eq. (6) to be cast in the reduced but statistically equivalent form

(7) |

where the -vectors and are the projections of and , respectively, onto the reduced basis; furthermore, for the true source parameters and the (projected) noise realization .

In this Letter, we give a demonstration of Percival by using the four-parameter, Roman network described in Chua et al. (2019), which embodies the family of 2.5PN TaylorF2 waveforms Arun et al. (2009) emitted by inspiraling black-hole binaries with aligned spins and component masses . The Roman templates are normalized to unit signal-to-noise ratio . Here, we vary signal amplitudes by setting ; our model parameters are then , where the chirp mass and symmetric mass ratio are an alternative parametrization of the space. Large batches of Roman signals (with noise added as described after Eq. (7)) can be generated on a 2014 Tesla K80 GPU in .

## Example results.

The marginalized posteriors on the full domain of the Roman model are nontrivial: they are highly multimodal at sub-threshold signal-to-noise ratios , and this persists even for due to correlations in the space (although is then better constrained). The parameter space is also quite large from a Fisher-information perspective, resulting in posteriors that are highly localized and hence difficult to resolve. We train a number of one- and two-dimensional Percival networks with different priors but, for the above reasons, find the most success when the priors are confined to smaller subspaces.

Our first example is a network that estimates the joint posterior of the parameters , for signals distributed according to the uniform prior

(8) |

where the prior intervals are given by , and . This is effectively a non-spinning three-parameter submodel. Our Percival network takes as input signal + noise coefficients , processes them through eight hidden layers of width 1024 (with the *leaky ReLU* activation function Maas et al. (2013) applied to each), and outputs (with linear activation) a vector of five quantities that specify a single bivariate Gaussian. Training is performed using batch gradient descent with *Adam optimization* Kingma and Ba (2014) and a manually decayed learning rate. The network is fed examples before the loss levels off; to eliminate overfitting, we generate a new batch of signals at every training epoch.

In Fig. 1, we compare the Percival-estimated posteriors for five test signals against the corresponding true posteriors, which are sampled through a brute-force approach that exploits the batch-generation capability of Roman. The trained network appears to localize the posteriors well, and to capture their near-Gaussian correlation structure with reasonable accuracy. This statement is made more quantitative in Figs 2 and 3, where the performance of the network is assessed on larger test sets. Fig. 2 depicts the accuracy of the network-estimated posterior covariance matrix , for 1000 test signals distributed according to . Each point on the plot corresponds to a single matrix, with color describing the accuracy of the associated ellipse -volume :

(9) |

where is obtained from the sample covariance matrix of the true posterior. The angle of the segment through each point describes the accuracy of the ellipse orientation: , where is the minimal angle between the principal eigenvectors of the estimated and sample covariance matrices.

For any given test signal, two overlapping factors seem to reduce network performance: (i) a smaller chirp mass, since the information content (variability) of the signal is larger in this regime, and (ii) vicinity to the boundary. This latter effect is somewhat expected, since the posterior is closer to a truncated Gaussian in the case of a near-boundary signal, and the Gaussian defined by the sample covariance matrix actually tends to be a poorer fit than the (truncated) network estimate.

The distribution of network prediction errors for a test set of 5000 signals is shown in Fig. 3; the error in the mean value of each parameter is quoted relative to the respective prior interval length , while the error in each standard deviation is taken to be the quantity (such that a value of corresponds to an overestimate of by a factor of ten). The network recovers the mean values at around accuracy, but can overestimate the standard deviations by up to a factor of three.

In Fig. 3, we also show the error distribution of and for a second Percival network that estimates the one-dimensional posterior of , for signals distributed according to the uniform prior

(10) |

where , and the other intervals are given as before. This network is identical to the first, except that it outputs 18 quantities specifying a three-component bivariate Gaussian mixture (five for each component, plus three weights). Even though the family of posteriors over the full five-dimensional prior space is larger and more complex, the second network yields comparable results to the first, but does have some difficulty in resolving the multimodality that arises in certain posteriors.

To show the viability of the histogram representation, we present results for a network trained on weak signals (), which exhibit posteriors with complex features. The parameter of interest is , and the prior is

(11) |

where are given as before. While such a prior is not very relevant for gravitational-wave astronomy (since inference is unlikely to be conducted on sub-threshold sources), the Percival network is nevertheless able to estimate the highly nontrivial posteriors, due to the greater degrees of freedom in the posterior representation and the smaller prior space. Posteriors for three representative test signals are displayed in Fig. 4.

## Discussion.

In this Letter, we describe a proof-of-principle demonstration (Percival) of instantaneous Bayesian posterior approximation in complex inverse problems such as gravitational-wave parameter estimation, using straightforward perceptron networks trained on large signal + noise sets drawn from the assumed parameter priors and noise sampling distribution. The computational cost of parameter-space exploration is shifted from the analysis of each observed data set to the offline network-training stage, making this scheme useful for low-latency parameter estimation, or for the science-payoff characterization of future experiments. Notably, the loss function does not require likelihood evaluations, but only the true parameter values of the training examples. Thus, our scheme can be used whenever the likelihood is expensive or unknown, but forward modeling is efficient and we have access to many samples of noise.

In our examples, we leverage the hyper-efficient Roman forward modeling Chua et al. (2019), which allows us to train networks over effectively infinite training sets. We find that relatively modest network architectures are able to approximate the quasi-Gaussian posteriors obtained for stronger signals, as well as the multimodal posteriors that occur at very low signal-to-noise ratios. We fully expect the accuracy of approximation to improve with network capacity and training iterations. Larger and deeper networks should also learn posteriors across broader regions of parameter space, although it may prove expedient to train separate networks for different regions.

The real-world analysis of gravitational-wave signals involves a number of complications that are not represented in our demonstration, such as multidetector data sets, the presence of *extrinsic* parameters that describe the relative spacetime location of source and detector, and variations (or estimation error) in the noise spectral density. These are beyond the scope of this Letter, but they can be handled by a combination of strategies: simply adding extra parameters to the model, designing networks with symmetries that make them insensitive to new degrees of freedom, and transforming (e.g., time-shifting, or whitening) the input data. Last, since convolutional neural networks have been successfully trained to recognize and decode gravitational waveforms represented as strain time series, it should be possible to combine these with a posterior-generating stage analogous to Percival, providing an end-to-end mapping from detector data to Bayesian posteriors.

## Acknowledgements.

We are grateful to Chad Galley for sharing his expertise in reduced-order modeling, and to Chris Messenger for correspondence on his related work. We also thank Natalia Korsakova and Michael Katz for helpful conversations, and acknowledge feedback from fellow participants in the 2018 LISA workshop at the Keck Institute for Space Studies. This work was supported by the Jet Propulsion Laboratory (JPL) Research and Technology Development program, and was carried out at JPL, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. © 2019 California Institute of Technology. U.S. Government sponsorship acknowledged.

## References

- Low-latency gravitational-wave alerts for multimessenger astronomy during the second advanced LIGO and virgo observing run. The Astrophysical Journal 875 (2), pp. 161. External Links: Document, Link Cited by: Introduction..
- Binary black hole population properties inferred from the first and second observing runs of advanced ligo and advanced virgo. arXiv:1811.12940. Cited by: Introduction..
- Higher-order spin effects in the amplitude and phase of gravitational waveforms emitted by inspiraling compact binaries: ready-to-use gravitational waveforms. Phys. Rev. D 79, pp. 104023. External Links: Document, Link Cited by: Leveraging reduced waveform representations..
- Fast and Accurate Prediction of Numerical Relativity Waveforms from Binary Black Hole Coalescences Using Surrogate Models. Phys. Rev. Lett. 115, pp. 121102. External Links: Document, Link Cited by: Introduction..
- Gravitational wave parameter estimation with compressed likelihood evaluations. Phys. Rev. D 87, pp. 124005. External Links: Document, Link Cited by: Introduction..
- Markov chain monte carlo methods for bayesian gravitational radiation data analysis. Phys. Rev. D 58, pp. 082001. External Links: Document, Link Cited by: Introduction..
- Reduced-order modeling with artificial neurons for gravitational-wave inference. Phys. Rev. Lett. 122, pp. 211101. External Links: Document, Link Cited by: Introduction., Leveraging reduced waveform representations., Leveraging reduced waveform representations., Discussion..
- Gravitational-wave physics and astronomy: an introduction to theory, experiment and data analysis. Wiley Series in Cosmology, Wiley. External Links: ISBN 9783527408863, LCCN 2012554560, Link Cited by: Introduction..
- Laser Interferometer Space Antenna. ArXiv e-prints. External Links: 1702.00786 Cited by: Introduction..
- Perceval: the story of the grail. Yale University Press. External Links: ISBN 9780300075854, Link Cited by: footnote 1.
- Deep-learning continuous gravitational waves. Phys. Rev. D 100, pp. 044009. External Links: Document, Link Cited by: Introduction..
- Applying deep neural networks to the detection and space parameter estimation of compact binary coalescence with a network of gravitational wave detectors. Science China Physics, Mechanics & Astronomy 62 (6), pp. 969512. External Links: ISSN 1869-1927, Document, Link Cited by: Introduction..
- Reduced Basis Catalogs for Gravitational Wave Templates. Phys. Rev. Lett. 106, pp. 221102. External Links: Document, Link Cited by: Introduction., Leveraging reduced waveform representations..
- Fast Prediction and Evaluation of Gravitational Waveforms Using Surrogate Models. Phys. Rev. X 4, pp. 031006. External Links: Document, Link Cited by: Introduction..
- Matching Matched Filtering with Deep Networks for Gravitational-Wave Astronomy. Phys. Rev. Lett. 120, pp. 141103. External Links: Document, Link Cited by: Introduction..
- Estimating bayesian parameter estimation using conditional variational autoencoders for gravitational-wave astronomy. arXiv preprint. Cited by: Introduction..
- ConvWave: Searching for Gravitational Waves with Fully Convolutional Neural Nets. In Workshop on Deep Learning for Physical Sciences (DLPS) at the 31st Conference on Neural Information Processing Systems (NIPS), External Links: Link Cited by: Introduction..
- Convolutional neural networks: a magic bullet for gravitational-wave detection?. arXiv:1904.08693. Cited by: Introduction..
- Deep neural networks to enable real-time multimessenger astrophysics. Phys. Rev. D 97, pp. 044039. External Links: Document, Link Cited by: Introduction..
- Deep Learning for real-time gravitational wave detection and parameter estimation: Results with Advanced LIGO data. Physics Letters B 778, pp. 64 – 70. External Links: ISSN 0370-2693, Document, Link Cited by: Introduction..
- Deep learning. MIT Press. Cited by: Introduction., Training neural networks to produce posteriors..
- Bayesian logical data analysis for the physical sciences: a comparative approach with mathematica® support. Cambridge University Press. External Links: ISBN 9781139444286, Link Cited by: Introduction..
- Simple model of complete precessing black-hole-binary gravitational waveforms. Phys. Rev. Lett. 113, pp. 151101. External Links: Document, Link Cited by: Introduction..
- Neural networks: a comprehensive foundation. Prentice Hall. External Links: ISBN 9780132733502, LCCN gb98059240, Link Cited by: Introduction..
- Adam: A Method for Stochastic Optimization. ArXiv e-prints. External Links: 1412.6980 Cited by: Example results..
- Real-time detection of gravitational waves from binary neutron stars using artificial neural networks. arXiv:1908.03151. Cited by: Introduction..
- Convolutional networks for images, speech, and time series. In The Handbook of Brain Theory and Neural Networks, M. A. Arbib (Ed.), pp. 255–258. External Links: ISBN 0-262-51102-9, Link Cited by: Introduction..
- GWTC-1: a gravitational-wave transient catalog of compact binary mergers observed by ligo and virgo during the first and second observing runs. arXiv:1811.12907. Cited by: Introduction..
- Rectifier Nonlinearities Improve Neural Network Acoustic Models. In Proceedings of the 30th International Conference on Machine Learning, Cited by: Example results..
- Analysis framework for the prompt discovery of compact binary mergers in gravitational-wave data. Phys. Rev. D 95, pp. 042001. External Links: Document, Link Cited by: Introduction..
- How effective is machine learning to detect long transient gravitational waves from neutron stars in a real search?. arXiv e-prints, pp. arXiv:1909.02262. External Links: 1909.02262 Cited by: Introduction..
- Deep learning classification of the continuous gravitational-wave signal candidates from the time-domain f-statistic search. arXiv:1907.06917. Cited by: Introduction..
- Sensitivity study using machine learning algorithms on simulated -mode gravitational wave signals from newborn neutron stars. Phys. Rev. D 99, pp. 024024. External Links: Document, Link Cited by: Introduction..
- Comparison of various methods to extract ringdown frequency from gravitational wave data. Phys. Rev. D 99, pp. 124032. External Links: Document, Link Cited by: Introduction..
- Inspiral-merger-ringdown waveforms of spinning, precessing black-hole binaries in the effective-one-body formalism. Phys. Rev. D 89, pp. 084006. External Links: Document, Link Cited by: Introduction..
- A user’s guide to measure theoretic probability. Cambridge Series in Statistical and Probabilistic Mathematics, Cambridge University Press. External Links: ISBN 9780521002899, LCCN 2001035270, Link Cited by: Training neural networks to produce posteriors..
- Fusing numerical relativity and deep learning to detect higher-order multipole waveforms from eccentric binary black hole mergers. Phys. Rev. D 100, pp. 044025. External Links: Document, Link Cited by: Introduction..
- The multilayer perceptron as an approximation to a bayes optimal discriminant function. IEEE Transactions on Neural Networks 1 (4), pp. 296–298. Cited by: Training neural networks to produce posteriors..
- Deep learning at scale for gravitational wave parameter estimation of binary black hole mergers. arXiv:1903.01998. Cited by: Introduction..
- Variational inference for computational imaging inverse problems. arXiv preprint arXiv:1904.06264. Cited by: Introduction..
- The PyCBC search for gravitational waves from compact binary coalescence. Classical and Quantum Gravity 33 (21), pp. 215004. External Links: Document, Link Cited by: Introduction..
- Use and abuse of the Fisher information matrix in the assessment of gravitational-wave parameter-estimation prospects. Phys. Rev. D 77, pp. 042001. External Links: Document, Link Cited by: Introduction..
- Neural network classification: a bayesian interpretation. IEEE Transactions on Neural Networks 1 (4), pp. 303–305. Cited by: Training neural networks to produce posteriors..