How to estimate the 3D power spectrum of the Lyman- forest
We derive and numerically implement an algorithm for estimating the 3D power spectrum of the Lyman- (Ly) forest flux fluctuations. The algorithm exploits the unique geometry of Ly forest data to efficiently measure the cross-spectrum between lines of sight as a function of parallel wavenumber, transverse separation and redshift. We start by approximating the global covariance matrix as block-diagonal, where only pixels from the same spectrum are correlated. We then compute the eigenvectors of the derivative of the signal covariance with respect to cross-spectrum parameters, and project the inverse-covariance-weighted spectra onto them. This acts much like a radial Fourier transform over redshift windows. The resulting cross-spectrum inference is then converted into our final product, an approximation of the likelihood for the 3D power spectrum expressed as second order Taylor expansion around a fiducial model. We demonstrate the accuracy and scalability of the algorithm and comment on possible extensions. Our algorithm will allow efficient analysis of the upcoming Dark Energy Spectroscopic Instrument dataset.
Inference of the power spectra of cosmic fields is one of the main tools of observational cosmology. Theory behind estimation of power spectra for Cosmic Microwave Background and galaxy surveys (and other point tracers) has been well developed over the past thirty years. It is not difficult to write a likelihood for the power spectrum given measured fields assuming these to be Gaussian. From this one can derive the standard “optimal quadratic estimator”222Strictly speaking, the OQE is only approximately quadratic and optimal, even for Gaussian fields, although in the large data set limit the distinction is not important. The OQE equation as usually written appears at first glance to be quadratic in the data, but it only achieves a maximum likelihood solution after iteration, which makes the final answer non-quadratic in the input data (it is easy to see this by noting that the Fisher matrix, i.e., 2nd derivative of the likelihood, appears in the standard formula, while the maximum likelihood solution can only require the 1st derivative). (OQE) 1998PhRvD..57.2117B (); 1998ApJ…503..492S (); 2000ApJ…533…19B (), but it is numerically too expensive to evaluate for large surveys using brute-force linear algebra. Therefore, a number of good approximations have been developed 1994ApJ…426…23F (); 1998ApJ…499..555T (); 2006PhRvD..74l3507T ().
In the field of Ly forest, the situation is much less developed. 2006ApJS..163…80M () used an OQE to measure the Ly forest one-dimensional (1D) power spectrum, describing the correlations between pixels in a given line of sight, and a similar algorithm was more recently used in 2013A&A…559A..85P (). However, as the density of lines of sight increases in present and future Ly forest surveys, it becomes more important to measure the full three-dimensional (3D) correlations, including correlations between pixels from different lines of sight. Recent analyses of 3D correlations, focused on measuring the large scale feature of Baryon Acoustic Oscillation (BAO), have relied on measuring the correlation function using simple pixel-product algorithms 2011JCAP…09..001S (); 2013A&A…552A..96B (); 2015A&A…574A..59D (); 2017A&A…603A..12B () (although see 2013JCAP…04..026S () for an example of using OQE-like approach to estimate the correlation function). The main goal of this paper is to present a likelihood based algorithm that will allow us to measure the 3D power spectrum of the Ly forest down to small scales, and including all the information in traditional 1D analyses.
In principle the power spectrum and correlation function are simply linear transforms of one another, potentially containing the same information. In spite of this, in practice they are often thought of as living in disconnected worlds, i.e., estimated using much different algorithms and not compared directly. Each has potential advantages for isolating different kinds of systematics into a limited range of their coordinates (e.g., the power spectrum is good for isolating slowly varying effects to small wavenumbers, while the correlation function is good for isolating effects at specific separations). Different tricks and approximations are generally used to make estimation algorithms tractable for large data sets (including, most importantly, estimation of error bars). The power spectrum has the advantage of generally less correlated errors and diagonalization of the theory in the linear regime limit. The primary point of this paper is to provide a practical algorithm for the 3D Ly forest power spectrum measurement, in order to exploit its advantages, but an implicit point is that the different statistics do not need to live in isolated worlds. We show in passing how a measurement of the power spectrum can always be converted into a correlation function measurement and vice versa, so that ultimately one should be able to exploit the advantages of both within the same basic analysis (our conversion of the cross-spectrum to power spectrum is an explicit example of this).
Methods developed for point tracers do not adapt directly to the Ly forest problem, because of the unique window function of the Ly forest. We measure the flux fields in individual lines of sight, each of which covers a certain redshift range, which do not overlap perfectly and which generally offer relatively high resolution in the radial direction but only sparse sampling in the transverse direction.
This work is clearly timely given the forthcoming large increases in the density of lines of sight in upcoming Ly forest surveys such as Dark Energy Spectroscopic Instrument (DESI, 2013arXiv1308.0847L ()). The analysis of these datasets will present numerous challenges beyond power spectrum estimation in terms a plethora of observational and instrumental systematics. In the current work we sweep these under the rug and assume ideal measurements. However, ultimately one of the purposes of our general framework is to allow understanding and propagation of these effects.
We expect that applying Fast Fourier Transform (FFT) algorithm to each spectrum will lead to algorithmic improvements, since each line of sight’s correlations are approximately stationary. This also performs “scale sorting”, which isolates large-scale radial modes that correlate across large transverse separations from small scale modes which have significant correlations across considerably smaller transverse separations. This line of thought naturally brings cross-spectrum, , between lines of sight as a function of angular separation as a natural intermediate product,333In this paper we follow the jargon of 1999ApJ…511L…5H () in calling the radial Fourier transform of the 3D Ly forest flux correlation function at non-zero transverse separation the “cross-spectrum.” This is not to be confused with the three-dimensional cross-power spectrum between multiple tracers, e.g. the 3D Fourier transform of the cross-correlation function between quasar positions and Ly forest. from which the 3D power spectrum can be later inferred. In this paper we bring this rough idea to completion, by presenting a detailed algorithm and its numerical implementation. We begin with the likelihood function and then develop the numerically efficient and mathematically rigorous derivation of expressions for cross-spectrum with well understood approximations and then demonstrate how this can be transformed into 3D power spectrum. The final product will be a likelihood for 3D power spectrum as functions of redshift and of radial and transverse wavenumbers.
We start in section II by presenting the basic likelihood analysis formalism we follow in this paper, and the specific case of a clustering analysis of the Ly forest is discussed in section III. The numerical implementation is detailed in section IV. In section V we revisit 1D analyses in this context, and we show examples of analyses on mock datasets. Measurements of the cross-power spectrum on mock datasets are presented in VI, and in section VII we describe how these can be translated into measurements of the 3D power spectrum. We conclude with a discussion in section VIII.
Ii Likelihood approximation vs. standard power spectrum “estimators”
While our algorithm should obtain results in the end essentially equivalent to the standard OQE, i.e., it could be understood simply as a computing optimization, we think it is useful to take a step back and shed some of the historical baggage that comes along with the words in OQE. In cosmological applications, the OQE has been introduced first into analysis of the CMB and galaxy survey data 1997PhRvD..55.5895T (); 1998PhRvD..57.2117B (); 1998ApJ…499..555T (); 2000ApJ…533…19B (). There are two canonical derivations that appear in these papers. In the first one, a general quadratic estimator is written, the weights are then chosen to ensure inverse variance weighting and unbiased estimates and then Cramer-Rao inequality is used to demonstrate the estimator to be optimal (however, strictly speaking this demonstration requires knowing the correct answer in advance to use for the weighting). In the second derivation, the likelihood is written and then the estimator is presented as a Newton-Raphson step towards maximum likelihood with second derivative replaced by Fisher matrix for numerical efficiency. These are both valid approaches, but the OQE process is commonly viewed as simply an estimator that transforms data into summary statistics. In this paper, we want to connect the same fundamental mathematics to a more Bayesian thinking about the likelihood. In particular, we believe that understanding the measurement process explicitly as a likelihood expansion leads to a streamlined understanding of how we can convert between parameter bases, e.g., between cross-spectrum, correlation function, and power spectrum parameters.
Standard Bayesian data analysis says that for any data set and parameters we should compute the posterior probability for the parameters given the data, . Generally we do this using Bayes theorem, where is a prior on the parameters. In multi-dimensional parameter spaces it is frequently not possible to fully tabulate , let alone obtain an analytic form, so we generally need to look for some limited approximate representation of it, sufficient for the use we intend. One such representation is a set of samples of the parameter vector – instances of distributed consistent with , e.g., from MCMC. These are convenient if one wants to integrate over , e.g., to produce means or variances or confidence intervals for .
An alternative is to approximate by a Taylor series around , i.e., , where . If we truncate this series at quadratic order this is equivalent to the ubiquitous assumption that the errors are Gaussian. This assumption is exact if likelihood has a Gaussian shape and the theory is linear in parameters, but it is in general a good approximation due to central limit theorem or if theory is effectively linear over the range of high-likelihood region. In the power spectrum measurement problem it is wrong in sample-variance dominated measurement when only few modes are present (e.g. the low- CMB power spectrum). A key point here, especially relevant to band power parameters, is if we have this Taylor series expansion around parameters close enough to the values where we want to use the likelihood, we are done, i.e., there is no fundamental need to estimate “best” values for the parameters, as this is irrelevant to further use of the likelihood function. In fact, for band power parameters where we expect the true power to be smooth, it generally makes more sense to expand around our best global estimate of parameter values, e.g., their values in our favorite cosmological model constrained by all available datasets. Moreover, for the same reason it is better not to perform “iterative” expansion around maximum likelihood, because the covariance matrix given by an expansion around the best guess of true cosmological parameter is going to be more accurate than that from an expansion around a model that is the best fit to only one of the datasets (for a concrete example, see 2004MNRAS.348..885E ()). This does not preclude identifying deviations from the best-guess model, as the likelihood expansion still implies some maximum likelihood parameter values – they just are not necessarily . At most, iteration should take place at the most global possible level, e.g., possibly adjusting if the results of all-data cosmological parameter estimation give a change in the best model.444In contrast, the most standard “estimate the parameter values and covariance matrix” form of data analysis is to find maximum likelihood values of parameters, where , and expand around this point to find the covariance matrix.
Given and , finding the constraints on a different set of parameters, centered on the same fiducial model and assuming a Taylor series for is still sufficient, is generally as simple as applying the derivative chain rule, e.g., , where is the new set of parameters.555As we discuss below, minor subtleties do arise in the definition of for band power measurements.
To make this more concrete for our problem: while the goal of a power spectrum measurement is often presented as being to produce a set of estimated values for band parameters and their covariance matrix (e.g., the estimates of 2006ApJS..163…80M () and 2013A&A…559A..85P ()), we consider our task to be to determine the coefficients and .
We describe our data as , where is the mean, is the response matrix, is the cosmological fluctuation of interest and is the instrumental noise. We assume a Gaussian likelihood for the data given a model,
where with the covariance of (the signal we want to measure) and the (diagonal) noise matrix. In the next section we will discuss in more detail what and are in the context of Ly forest analyses, but what matters here is that they do not depend on the parameters we are trying to estimate.
As usual in power spectrum estimation, we assume that is linear in some parameters. The derivatives we need are now quite simple:
where is the covariance matrix evaluated using , and is the derivative of the covariance matrix with respect to the parameter . These Taylor series coefficients imply some mean (maximum likelihood) values for , and a covariance matrix around that maximum, but those quantities are superfluous – we are generally going to use our power spectrum results to constrain some more fundamental parameters and for that we just need .
The second derivatives have two terms: the first involves the measured data , and the second is a trace of a matrix product. The expectation value of the first term is proportional to the second term if is the correct covariance matrix for the data, so we approximate the second derivative by its expected value under this assumption, as is usually done in deriving the OQE:
where is the Fisher matrix. In general we do not need to make this approximation and it may not actually be helpful, but we do not investigate it further in this paper (see 2015JCAP…01..022M () for a discussion about this approximation).
The traditional OQE is derived as a Newton-Raphson (NR) step toward the maximum likelihood value of parameters, including this approximation for the 2nd derivative matrix, i.e., the standard OQE equation is simply a re-shuffling of terms in:
where is the Fisher matrix and the first derivatives of the log-likelihood.
Our bottom line point of this section is that in this paper we do not consider this NR step to be a fundamental part of our analysis. Our purpose is to estimate the likelihood derivatives, and we only compute things like the implied central value when necessary for diagnostics like plotting.
In this paper we often ignore the priors on the parameters, since they only play two very minor roles: i) we use them to compute implied central values and errorbars when making plots; ii) in section VII we marginalize over some of the band powers, and we need to take into account the priors on the parameters that we marginalize out (see appendix A on marginalization). Whenever they are required, we use Gaussian priors around zero, with a width set to a thousand times the fiducial power evaluated at the center of the band.666We have tested that the exact widths of the priors do not qualitatively change the results of this paper.
Iii Clustering analyses of the Ly forest
In this section we will introduce basic notation in analyses of the Ly forest, present our data model and discuss the parameterization of the power spectrum.
iii.1 Data vector
In a clustering analysis of the Ly forest the data set consists of a set of optical spectra of high redshift quasars, identified by an angular position and a redshift . In the analysis of a real survey we would need to discuss instrumental specifications and details of the data reduction pipeline: sky subtraction, spectral calibration, co-addition of individual exposures, estimates of the noise variance, etc. In this study we will not address these issues in detail, although we will discuss how they would affect our measurements in section VIII.
We will use a simplified data model to describe the measured flux in a pixel of observed wavelength in the spectrum of quasar as:
where is the quasar continuum, is the transmitted flux fraction, the optical depth and the instrumental noise in the pixel. is the convolution kernel for pixel , including both spectral resolution and the pixelization (we will discuss this in detail soon).
In cosmological analyses of the Ly forest we are interested in measuring the statistics of the fluctuations around the mean transmitted flux fraction (often referred as the mean flux), , giving, in our simplified model,
where we have assumed that can be approximated as constant over the width of the smoothing kernel, and defined .
The covariance of the observed data vector (defined in Eq. 7) is given by:
where is the (diagonal) noise matrix and is the 3D correlation function of the field we are interested in, and is the angular separation between the spectra of pixels and . Defining , we can exactly equivalently describe the 3D correlations as a function of angular separation, velocity separation and redshift, , with . Note that this is equivalent to defining and then . These quantities should be understood as just different ways of packaging the two observable wavelengths – we do not rely on them having any more fundamental physical meaning.
In the presence of redshift evolution, the power spectrum is not well-defined as a variance of Fourier modes. We can, however, still model the data as a Gaussian random field, understood to mean a field with Gaussian likelihood function specified by a correlation function, which does not require the field to be stationary. We define the 3D power spectrum to be the Fourier transform of this correlation function:
where has units of inverse radian, and units of . We expect that on scales smaller than the scale of the evolution, the power spectrum defined this way will closely follow intuition based on the ideal stationary case, but in any case it is well-defined without any assumption about that. Note that it is not completely trivial here that the Fourier transform is defined to be with respect to observable coordinates, rather than, e.g., an approximation of comoving coordinates using a fiducial model. One should not think, however, that the BAO feature is directly smeared by this definition – a delta function at some separation in the correlation function in comoving coordinates remains a delta function in observable coordinates.777The key is to remember that by definition we must consider pairs of pixels at fixed – there is no averaging together here of, e.g., purely transverse pixel pairs at different redshifts where fixed angular separation means different comoving separation.
Given a comoving coordinate power spectrum we can always compute this observable coordinate power spectrum by transforming to a correlation function, introducing the observable coordinates, and transforming back. Note, however, that on the BAO scale the effect of ignoring the non-linearity of the coordinate transform is tiny. E.g., in a typical model the radial comoving separation between and is 68.358 Mpc/h, while Mpc/h, a % difference. If points at the same two redshifts are separated by 0.02 radians, the true comoving transverse separation is 78.418 Mpc/h, while 0.02 times the angular diameter distance at is 78.425 Mpc/h, % different. I.e., at foreseeable levels of precision we can consider the relation between power spectra to be a simple linear change of units, except possibly on much larger scales. The key is that we keep separation calculations symmetric around the central redshift, so that differences are quadratic in the scale changes (strictly speaking we measure distances symmetrically in , but it doesn’t make any difference to this discussion). We will discuss coordinates further in Section VIII.1.
From the same correlation function we can also define the cross-spectrum,
that quantifies the correlation of lines of sight modes as a function of angular separation.
We can now rewrite the elements of the covariance matrix in equation 8 as a function of the cross-spectrum:
where is the Fourier transform of the convolution kernel for pixel , i.e.,
where is the Gaussian spectral resolution and is the pixel width, both in velocity units.888Physically, the resolution should be modeled as a continuous convolution of the continuous field, only subsequently discretely sampled onto pixels, but currently spectra are usually presented with a single resolution value per pixel, which is generally slowly varying from pixel to pixel, so we will not make this distinction here
We parameterize the power spectrum with a grid of band power parameters:
where is the fiducial power spectrum (around which the likelihood is expanded as explained in Section II) and is a trilinear interpolation kernel. Note that the fiducial power spectrum is evaluated at the full resolution, independent of individual band-power bins. This decreases the impact of features smaller than the bin size, which can still be represented perfectly if they are consistent with the fiducial model.
We also consider an alternative parameterization in which we introduce a constant at each redshift that allows us to define a 3D magnitude of the wavevector,
and cosine of the angle between the wavevector and the line of sight,
is necessary of course because observable and do not have the same units. We can then parameterize by simple interpolation over and – we will call this
We will discuss the relation between these parameterizations further below – the advantage of parameterization is that if we choose to be the Alcock-Paczynski factor 1979Natur.281..358A () at each redshift for something close to the true model, then will be the cosine of the comoving angle corresponding to observable . The BAO feature will be approximately independent of which should allow relatively coarse interpolation in , and the linear redshift space boost of power will be approximately independent of , producing aesthetically pleasing plots.
As discussed in Section IV, we will start our analysis by estimating the likelihood as a function of cross-spectrum parameters, defined as:
where is a bilinear interpolation kernel and is a top-hat kernel in the transverse direction. We chose a nearest-grid-point (NGP) interpolation in the transverse direction because, in combination with the numerical approximation described in IV.2, will reduce considerably the number of elements of the Fisher matrix that we need to compute. Using NGP in , on the other hand, would not reduce it further. This will be discussed again in IV.2.
We will discuss after explaining our detailed numerical algorithm why we do not from the start define bands in approximate comoving coordinates using a fiducial cosmology. Suffice it to say at this point that this is not necessary as long as we plan to make measurements in sufficiently narrow redshift bins. As long as we use a good interpolation scheme, i.e., one that converges to an exact representation of the function in the limit of fine point-spacing, using different coordinates to represent the same function can only improve the efficiency of the representation, i.e., allow us to represent the function accurately with fewer interpolation points. We will always test explicitly that results of interest have converged with respect to band spacing.
The covariance matrix is given by , with the derivative with respect to one of these band power parameters being:
and we have all the pieces required to compute the derivatives of the (log-) likelihood described in section II.
Note that each matrix , corresponding to a cross-spectrum parameter with angular separation , will have a very sparse structure, since only the sub-matrices corresponding to pairs of spectra separated by will be non-zero. This would not be the case if we wanted to directly estimate the three-dimensional power, where a given pair of pixels would contribute to all band power parameters.
iii.3.1 Inverting functional parameterizations
We have introduced several parameterization of continuous functions of the form , where we use here to represent all continuous variables. As usual, we describe the likelihood to be a function of data given a set of parameters , . Theories generally predict , so to use this kind of likelihood function we need to know how to compute . We intend to work mostly close to the limit of high resolution parameterization, i.e., spacing between points smaller than any structure in , where it should be fairly obvious that we can simply take . However, to approach this limit robustly it is useful to think a little more carefully about this problem. A formal solution is , viewing as an matrix, however, this matrix is not generally invertible. A natural thing to do is use the pseudo-inverse of , equivalent to finding to give the least square deviation from the target . Assuming equal weight for all , i.e., using uniform discretization to make into a finite dimensional matrix, the solution is:
with . We see that is the response of the parameter to a delta function impulse in away from . In the simple case of top-hat bands which do not overlap, this reduces to simply averaging over the band, but for other cases like linear interpolation it is not quite so simple. E.g., for two parameters at and determining the interval by linear interpolation, the effective weight on for the first () parameter is , i.e., it peaks at and goes linearly through zero at to become negative at higher . Intuitively, if the power is high at , the 2nd parameter will be larger, requiring a lower value of the first parameter to describe the intermediate range. When more bins are considered, is a band matrix with just one additional diagonal, but is not, coupling elements that are more than one bin apart. In less trivial setups we can always evaluate the formula numerically, but we see also that in the limit of narrow , relative to the structure in , we are justified using .
Another approach to this problem is to compute the expected response of estimated parameter values to changes in the true model in much narrower bands 1999PhRvD..60j3516K (); 2012ApJ…761…13S (), which amounts to going back to the data to re-compute the Fisher matrix with one replaced with derivatives with respect to the much finer bands. While it is possible that this approach could save some computation over simply making the original measurement bands themselves narrower, in this paper our plan is to always make sure that our end results have converged with respect to the width of the measurement bins, i.e., we will make our measurement bands narrow enough that the detailed structure within them is not important. A natural question at this point is: what is the relationship between this data-derived method and the mathematical pseudo-inverse above? The key is that the discretization required for the pseudo-inverse, assumed to be uniform in in Eq. (19), is ambiguous as a matter of pure math. Another way to look at the pseudo-inverse is that if was measured using many finely and uniformly spaced points with equal errorbars, then Eq. (19) would give the best fit for the coarsely interpolated parameterization. The data-derived method determines a more correctly representative, in general non-uniform, weighting within the bins.
iii.4 Connection to the traditional 1D power spectrum
In the limit of zero angular separation, the cross-spectrum becomes the traditional 1D power spectrum:
It is very common in 1D analyses to ignore the correlation between the fluctuations in neighboring spectra, resulting in a block-diagonal covariance matrix and in a likelihood that is factorizable. The derivatives of the log-likelihood are then additive, and we recover the method to estimate the 1D power spectrum presented in 2006ApJS..163…80M () and later used in 2013A&A…559A..85P ().999Note, however, that the redshift interpolation of the power is better defined in our analysis. For instance, 2013A&A…559A..85P () split each spectrum into two or three segments, and assumed that each segment contributed to a single redshift bin.
Iv Numerical implementation
A naive implementation of the algorithm implied by sections II and III is computationally intractable, since it involves thousands of linear algebra operations with products of extremely large matrices, of size set by the number of pixels in the survey ( for relevant surveys).
In this section we will discuss some approximations and numerical implementations that make the problem tractable, and that allow us to analyze the BOSS data set typically using a few thousand CPU hours depending on the number of bins used.
We will start in IV.1 by choosing a parameterization of the likelihood in which a quasar pair contributes only to a small fraction of parameters. In IV.2 we present a particular fiducial clustering model that allows us to have a block-diagonal covariance matrix, reducing drastically the size of the linear algebra operations required. In IV.3 we show that the information in the response matrices can be described by a very small number of eigenmodes. Finally, in IV.4 we further speed up the algorithm by internally using Fast Fourier Transforms (FFT) to compute some of the matrix multiplications.
We aim to estimate first the cross-spectrum, because using it each pair of quasars contributes only to a single parameterized band in the transverse coordinate (angular separation), instead of contributing to all parameters if we went straight to a 3D power spectrum. Estimating from the raw data is the computationally demanding step. After that, the data set is massively compressed and we have a lot of options to transform to other parameterizations, which we will discuss below. To be clear: it is the sparsity of the quasars on the sky in real data sets, relative to the scales of interest, that motivates approaching them pair-wise, and therefore using the cross-spectrum. This is analogous to the efficiency of estimating a 3D correlation function pair-wise for a sparse set of point objects (which could similarly be converted to a power spectrum by the methods below). If we had much denser sampling (or wanted to efficiently push the measurement to much larger scales) we would be motivated to use a method that, e.g., compressed spectra in an angular pixel into some aggregate measure of the flux in that pixel, transformed in the transverse directions, and then measured the 3D power spectrum directly. To cover both regimes we could imagine something akin to the PM approach to N-body simulations (e.g., 2013MNRAS.436..540H (); 2007arXiv0711.4655S (); 2003NewA….8..581P () – generally, any of the ways similar problems have been solved for N-body simulations could be useful).
iv.2 Zero cross-correlation fiducial model gives block-diagonal
In order to measure 3D correlations we need to consider the full likelihood, and we need to choose a fiducial power in order to specify the global weighting matrix . The ideal choice would be to use our global best guess of the true power spectrum, a CDM model with linear bias and redshift space distortions and appropriate non-linear correction on smaller scales 2003ApJ…585…34M (); 2015JCAP…12..017A (); 2017A&A…603A..12B (). If we choose a different model, our Taylor series generally will not be as accurate near the true model as discussed in Section II.
In this work we take the fiducial cross-spectrum to be
i.e. it is vanishing everywhere except at zero angular separation. This makes the weighting matrix block-diagonal giving a huge speed up. Because variance of the field is dominated by very small scale modes, or equivalently, because the is very rapidly falling with (as shown in Figure 6 below), this is a reasonably good approximation to the truth. This is further supported by the study presented in Section VII below, and in particular in Figure 8, where we show that any possible underestimation of the measured uncertainties caused by this approximation is smaller than 6-8%.
We use subscripts and to identify blocks corresponding to a particular quasar. We can now describe the (log-) likelihood derivatives as sum over blocks:
where we have also defined and is the sub-matrix of corresponding to the cross-correlation of pixels in quasars and . From equation 22 above it is clear that the second term of the first derivative, the one with a trace, will only be non-zero for derivatives with respect to parameters. With our top-hat parameterization in the direction, the Fisher matrix is non-zero only for parameters corresponding to the same value of .
iv.3 Eigenmode decomposition of matrices
We started with a problem that was computationally impossible, and thanks to the choice of parameters ( instead of ) and the choice of weighting matrices (block diagonal ) we have now a problem that is just extremely computationally intensive.
Let us consider a BOSS like survey, with an area of 10 000 square degrees and a density of lines of sight of roughly 15 per square degree. If we are interested in measuring 3D correlation up to 3 degrees (roughly ), we will need to correlate each spectrum with an average of 400 neighbors, for a total of roughly 30 million pairs of spectra (not counting the same pair twice). If we are measuring a total of 1000 cross-power spectrum parameters (as discussed below we typically use several thousand parameters), the Fisher matrix (or second derivative) will have a million elements , and even if we consider only the most relevant covariances we still need to compute several tens of thousand elements. As shown in equation 23, for each element we need to compute the trace of 30 million products of four matrices, of size equal to the number of pixels in a spectrum, roughly 500.
The algorithm described above would work on a small data set, with limited number of parameters, but it quickly becomes too slow when applied to any relevant survey. In this section we will introduce numerical implementations that will speed up dramatically the algorithm without adding further approximations.
To motivate the math below, let us consider first the simplified example of a measurement on a very large periodic box, with parallel lines of sight and a static universe, and ignoring pixelization or resolution effects. In this case we could define the following modes for each line of sight:
and it is easy to show that these modes would be uncorrelated for different values of , and that their correlations would be given by the cross-power spectrum:
where is the Dirac delta function. In this space, both the covariance matrix and its derivatives would be diagonal. Redshift evolution and the finite length of the lines of sight in a realistic survey complicate this picture, but as we show below most of the information in the matrices is still captured in a small number of modes.
So we would like to project each spectrum onto some kind of Fourier-like modes, but while the parameter response matrices are localized in redshift, Fourier modes are not. For this reason we could hope to do better by using Fourier transforms over a set of ad hoc redshift windows. What we do will be qualitatively similar to this, but with the modes and envelopes more carefully derived. Our first thought to derive modes might be to use Karhunen-Loeve (KL) eigenmodes 1997ApJ…480…22T (), which are generally eigenmodes of , where is a Cholesky decomposition. As we have set things up, each quasar pair would have its own set of matrices, covering a different set of wavelengths, and with different pixelization and resolution. We would need to compute the eigenmode decomposition of all of them, which is easier than computing KL modes for the fully coupled data set, but would still make the algorithm intractable.
For this reason, we will first extend a little bit our data model. Let us define a common pixel grid covering the whole wavelength range of the survey.101010In particular we use a grid of 2450 cells, of same width as in BOSS coadded spectra , covering the redshift range . We can map all our spectra into that grid, giving effectively infinite noise to cells that are not covered by a given spectrum (or masked by the pipeline). If we discretize the integral over wavelength in equation 7 we can rewrite the data model as:
with , and the covariance matrix is:
where is now the signal covariance without pixel or resolution effects.111111To prevent numerical artifacts when evaluating Equation 29 we compute with a small Gaussian smoothing of , and correct for it by using as the pixel resolution when computing the resolution matrices . We have checked that the results do not depend on the exact value of used.
The derivatives of the covariance matrix for a pair of spectra are now:
and similar to equation 18 we can write them as:
Equation 29 implies that: i) the derivative matrix will be zero unless and ii) if non-zero, the matrix will only depend on (,) and will be the same for all quasar pairs. Therefore, if we choose to project onto eigenmodes of , we will only need to decompose as many matrices as parameters in our analysis. In Figure 1 we show a couple of matrices that might appear in a typical analysis.
Now we simply exactly diagonalize , i.e., find where is diagonal. This is motivated by the fact that in the limit of infinitely narrow -bins and infinitely large redshift bin, these signal matrices would have exactly two non-zero eigenvalues.121212To see this, note that a single Fourier mode of any phase will have correlation function and that , where is the cell width of the grid. In this case the pixel covariance can be written as . A matrix with finite redshift and -bins is thus expected to have a few dominant eigenvectors.
In Figure 2 we present the distribution of eigenvalues for the matrices presented in figure 1, sorted by absolute value. In both cases the number of relevant eigenvalues is very small (6 or 7 out of 2450). The dotted lines show the of the maximum eigenvalue, and we generally neglect those eigenmode with an eigenvalue smaller (in absolute value) than that. This means that not only the matrices presented above are diagonal, but only a handful of elements are considered non-zero.
In Figure 3 we show the first eigenvector for the same matrices . The left panel corresponds to a parameter with wavenumber times larger than the wavenumber in the right panel. They are both localized in a region of the grid corresponding to the redshift of the parameter in the derivative, as shown by the vertical dotted lines.
We will compute the eigendecomposition only once per parameter, and use the decomposition for all quasars pairs. With this diagonalization we can now look again at the derivatives. The first term in the first derivatives, i.e., the one involving the data vector, will now be:
where is the same for all quasars pairs that contribute to this particular angular separation bin, is now formally defined over all wavelengths but it is approximately limited to the range actually covered by the spectrum, and we have defined . For each parameter, is akin to the collection of Fourier modes in the range corresponding to the parameter, for a transform windowed to the range covered by the parameter.
Similarly we can expand the second term of the first derivative:
and the second derivative (or Fisher matrix):
where we have defined and . We will have to compute the objects and only once for each spectrum, and once we have them the cross-correlations will be very fast.
Note that since we only consider eigenmodes (see Figure 2), the diagonal matrices have only non-zero elements. This means that we only need to compute a very small block of elements of the matrices.
iv.4 FFT of eigenvectors
The eigenmode decomposition described above moved the computational challenge from the spectra cross-correlations to the computation of per-spectrum objects: and . In particular the slowest part is to compute the rotated inverse covariances:
for which naively we need to compute a product of three matrices of size equal to the number of cells in our radial grid, 2450. For each quasar , naively we need to compute this product for each pair of matrices (,), specified by their values of (, ).131313Remember that in Equation 29 we introduced as matrices that do not depend on . In the analysis presented in Section VI, this would imply over matrices per quasar. On the other hand, we know that in the idealized high symmetry case this rotation amounts to FFTs along each row and column which can be done orders of magnitude faster.
Fortunately, matrices are only used to compute the elements of the Fisher matrix , as shown in Equation 33. Parameters that have very different values of either or will be very weakly correlated, and therefore we decide to compute only a subset of the elements of the Fisher matrix. In particular, we ignore the correlation of parameter pairs with either or , where is the geometric mean and is the speed of light, with the exception of neighboring parameters that are always included.141414The last term was added to allow correlations of the lowest values of over all redshift ranges. In our default analysis, this reduces the number of matrices we need to compute by an order of magnitude.
The eigenvectors shown in figure 3 have clear oscillatory features, what suggests that they would be very sparse in Fourier space. For this reason we introduce discrete Fourier transform and inverse Fourier transform matrices pairs in the computation of the per-quasar objects:
where we have defined and . Note that while is a 2D (inverse) Fourier transform, is a 1D Fourier transform, with each column of the matrix containing the 1D Fourier transform of each eigenvector.
The modes for some eigenvectors with negative eigenvalues are not as localized as those shown in figure 4, and for this reason we apply a cut in the tails of the Fourier modes, and we only keep of its norm.
In order to speed up the computation of equation 35, we use another approximation. In the limit of infinitely long windows, the matrices would be diagonal. We checked that including off-diagonal correlations between modes separated by less than (corresponding to 15 off-diagonal elements in our default setting) was enough to recover unbiased results in the analysis on mock data sets presented in Section V.
iv.5 Summary of numerical optimizations
To summarize, we are trying to evaluate equations 2 and 3. The brute force evaluation would require us to deal with matrices of size where is the number of quasars and is the number of pixels in a spectrum. We do the following approximations.
Expand around the model with zero cross-power, in effect reducing the covariance matrix to dense matrices of size . All relevant terms can now be calculated by manipulating (millions of) matrices.
Diagonalizing the matrices of derivatives of covariance matrix with respect to clustering parameters. This allows us to project relevant eigenvectors separately for each spectrum, speeding up consequent operations.
Computing the projections of the weighted data vector and more importantly inverse covariance matrix ( and ) by rotating to Fourier space where the signal eigenvectors become very sparse, and fairly diagonal, then rotating back.
There are several parameters that control the accuracy of the latter two approximations, which we summarize in Table 1.
|0.02||Keep eigenmodes whose absolute eigenvalue is larger that of maximum eigenvalue.|
|0.999||Cut tails of eigenvector FFTs so that it contains of its norm.|
|Ignore correlations in between modes separated by more than (in ).|
|0.175||Ignore elements of Fisher matrix () with .|
|0.0005||Ignore elements of Fisher matrix () with (in ).|
V Analysis of the 1D power spectrum on mock data
We will begin our numerical exploration by using mock datasets to compare measurements of the 1D power spectrum implemented using brute force real space matrix multiplication (as done in, e.g., 2006ApJS..163…80M ()) with our numerical implementation presented in IV. The main motivation here is to establish basic sanity of our numerical work and to test the validity of approximations that we describe in Section IV. Measurements of 1D power spectrum alone are sufficiently fast that we can afford to switch these approximations off and see the impact that they have.
v.1 Mock dataset for measurements
We simulate data mimicking the quasar spectra from the twelfth data release (DR12, 2015ApJS..219…12A ()) of the Sloan Digital Sky Survey (SDSS-III, 2011AJ….142…72E ()), containing mostly spectra from the Baryon Oscillation Spectroscopic Survey (BOSS, 2013AJ….145…10D ()). We generate mock spectra for 181,506 quasar spectra with , using quasar continua matching the amplitude of each observed quasar, and adding Gaussian noise to the spectra using the noise variance estimated by the SDSS pipeline.
The mock spectra have a mean transmitted flux fraction that approximately matches the one measured in the data, and Gaussian fluctuations around this mean are generated using an analytical expression for from 2013A&A…559A..85P (), corrected to be flat at very low-k. A continuum for each spectrum is generated using a mean restframe shape for all quasars, redshifted and normalized accordingly to roughly match the observed magnitude. Finally, the spectra are convolved using the mean resolution within each forest, pixelized using the pixel width of the SDSS coadded spectra, , and Gaussian noise is added using the variance provided by the SDSS pipeline.
v.2 Band power measurement
We parameterize following equation 17, for the special case of , and we use the same that was used to generate the mock spectra. 151515We have tested that we recover the right power spectrum when starting with a fiducial not far from the truth. We will use a grid of band power parameters with z-bins and k-bins, for a total of 147 parameters. The first z-bin is centered at , and the other bins are separated by . The last z-bin is centered at . The first k-bin is centered at , and the following 5 bins are linearly spaced with a separation of . The last 15 k-bins increase logarithmically with , with the last bin centered at .
We present the measured power in figure 5, where we plot the measured power divided by the input power used to generate the mock spectra. The definition of “measured power” in all of our plots is the maximum likelihood implied by our Taylor expansion of the likelihood, equivalent to the result of a single NR step given by Equation 5. The errorbars are the square root of the diagonal elements of the inverse of the Fisher matrix. The dotted lines show the measurement using brute force matrix multiplication, while the dashed lines shows the measurement when using the -oriented implementation described in section IV. We see that both methods agree well with each other and the expectation for these mocks, with the biggest deviations maybe not surprisingly appearing at the extremes of and .
v.3 Goodness of fit -
In the analysis above we computed the Taylor expansion of the log-likelihood around the true power spectrum . In this case, the expected value of the first derivatives is zero, and their covariance equals the Fisher matrix, . The standard can then be computed directly from the derivatives using and its expected value is equal to the number of band power parameters.
In table 2 we show the performance of the different algorithms, and the sensitivity to the different approximations discussed in Section IV. When using all the (147) bins in the measurement, we measure a value of for the exact matrix multiplication analysis (probability of 30.7%), and a value of for our new default implementation (probability of 4.7%). is the limit typically used for data of this resolution, since on large scales the suppression of power caused by the spectral resolution is very large and difficult to correct for accurately. If we marginalize over the last bin at , the agreement with the new implementation is considerably improved to for 140 degrees of freedom (d.o.f), and the probability is of 16.9%. In Appendix A we describe how we marginalize the likelihood over unwanted parameters.
In table 2 we also show the effect of the different numerical settings on the values of as a function of .
|exact matrix ops||155.1||0.307||139.0||0.508||125.3||0.670|
One can see that we are not very sensitive to the details of the numerical implementation, although does seem to make a non-negligible difference. Our algorithm is not optimized for , so we take these results as good enough and move on to our real goal of measuring 3D cross-correlations.
Vi Analysis of the cross-power on mock data
After having established validity of the method on 1D power spectrum measurement, we proceed to cross-power spectra.
vi.1 Mock dataset for measurements
We will present the measurement on 40 realizations of the BOSS DR11 mocks, that have been widely used in past BOSS Ly BAO analyses 2015A&A…574A..59D (); 2017A&A…603A..12B (). These mocks were generated for the eleventh data release (DR11) of BOSS, for a total of 134 386 quasars. Since BOSS did not generate mock spectra for the final data release (DR12), and since the difference in area is roughly 10%, we will use these mocks as realistic simulations of the BOSS survey.
The algorithm to generate the mock Ly forest skewers is described in detail in 2012JCAP…01..001F (). In short, a correlated and anisotropic Gaussian field is computed along all lines of sight, with an input power spectrum such that after applying a lognormal transformation to the Gaussian field to obtain an optical depth, the resulting field has the desired Ly forest power spectrum from 2003ApJ…585…34M (). The important thing to note is that while these mocks have been extensively used in the past BOSS paper, they are known to contain imperfections, especially on small scales. Some of these imperfections are related to redshift interpolation applied when making these mocks and would have been missed for analysis that use a single bin across the survey.
vi.2 Band power measurements
We will use the same grid of redshift () and line of sight wavenumbers () bins that were used in measuring in section V, and we will use a total of angular separation bins: the first bin will be at zero separation (containing 1D information), followed by five bins linearly spaced by an angular separation of radians, or roughly transverse ; the last 21 bins are distributed logarithmically with a fractional increase of , with a last bin centered at radians, or roughly .
In Figure 6 we show the cross-power spectrum measured from 40 mock realizations of the BOSS survey. The measured power is as usual plotted at the maximum likelihood point implied by our quadratic expansion of , i.e., the result of a single NR step away from (Equation 5). The errorbars are the square root of the diagonal elements of the inverse of the Fisher matrix. We show a representative sample of angular separations, and the four redshift bins that are better measured in the analysis. The solid lines show the expected measurement given the input theory that went into generating the mock data, described in more detail in section VII.2. While we plot results for diagnostic purposes, we do not intend to be an end-stage result of our algorithm – it is an intermediate data compression step in the process of producing .
Vii From to
In the previous sections we have presented an algorithm to obtain a second order Taylor expansion of the log-likelihood function around a fiducial model spectrum, describing the cross-spectrum of the Ly forest .
For a sufficiently fine binning, this presents a complete summary statistics describing 2-pt correlations, and we could decide to stop here. We could compute theoretical predictions for the cross-spectrum, and do parameter estimation (BAO scale, linear power…) directly in this space. However, since the models are usually defined in Fourier space, it is useful to translate the measurement into a truly three-dimensional power spectrum . In principle we could convert the measurement into any other parameterization, including, for example, the correlation function.
We would like to compute the Taylor expansion of the (log-) likelihood with respect to a new set of parameters . Using the derivative chain rule we can compute the new derivatives (,) as a function of the old ones (,):