Efficient Two-Point Correlations

A Computationally Efficient Approach for Calculating Galaxy Two-Point Correlations

Regina Demina,, Sanha Cheong,, Segev BenZvi, Otto Hindrichs,
Department of Physics and Astronomy, University of Rochester, 500 Joseph C. Wilson Boulevard, Rochester, NY 14627, USA
E-mail: regina@pas.rochester.eduE-mail: scheong@u.rochester.eduE-mail: sybenzvi@pas.rochester.eduEmail: otto.heinz.hindrichs@cern.ch
Accepted XXX. Received YYY; in original form ZZZ

We developed a modification to the calculation of the two-point correlation function commonly used in the analysis of large scale structure in cosmology. An estimator of the two-point correlation function is constructed by contrasting the observed distribution of galaxies with that of a uniformly populated random catalog. Using the assumption that the distribution of random galaxies in redshift is independent of angular position allows us to replace pairwise combinatorics with fast integration over probability maps. The new method significantly reduces the computation time while simultaneously increasing the precision of the calculation. It also allows to introduce cosmological parameters only at the last and least computationally expensive stage, which is helpful when exploring various choices for these parameters.

large-scale structure of Universe – distance scale – dark energy – surveys – galaxies: statistics – methods: data analysis
pubyear: 2016pagerange: A Computationally Efficient Approach for Calculating Galaxy Two-Point CorrelationsA Computationally Efficient Approach for Calculating Galaxy Two-Point Correlations

1 Introduction

The two-point spatial correlation function (2pcf) is a key tool used in astrophysics for the study of large scale structure (LSS). Given an object such as a galaxy, the number of other galaxies within a volume element at a distance from the first galaxy is


where is the mean number density and is the two-point correlation function characterizing the deviation from the uniform distribution in separation between galaxies (Peebles, 1973).

The correlation function and its Fourier transform, the galaxy power spectrum , have been used to describe the distribution of matter in galaxy surveys (Yu & Peebles, 1969; Peebles & Hauser, 1974; Groth & Peebles, 1977; Feldman et al., 1994). During the past decade has become a popular tool for the reconstruction of the clustering signal known as baryon acoustic oscillations, or BAO (Eisenstein et al., 2005). The BAO signal is the signature of the density differences that arose in the early universe before the thermal decoupling of photons and baryons (Sunyaev & Zeldovich, 1970; Peebles & Yu, 1970). It is detectable today as a characteristic peak in the galaxy spatial correlation function at roughly  Mpc, where defines the Hubble parameter  km s Mpc.

Given a spectroscopic survey containing the 3D coordinates of each galaxy, there exist several possible ways to estimate . The most popular estimator is due to Landy & Szalay (1993), which is constructed by combining pairs of galaxies from a catalog of observed objects (“data”) and a randomly generated catalog with galaxies distributed uniformly over the fiducial volume of the survey but using the same selection function as the data. The Landy-Szalay (LS) estimator is


where , , and are the normalized distributions of the pairwise combinations of galaxies from the data and random catalogs (plus cross terms) at a given distance from each other.

There exist many 2pcf estimators other than the LS estimator, and they have different advantages and limitations. However, all commonly known estimators are functions of and/or distributions (see Hamilton (1993); Kerscher et al. (2000); Vargas-Magana et al. (2013)) and therefore depend on the random catalog . To limit statistical fluctuations in , it is typical to generate random catalogs with one to two orders of magnitude more galaxies than the survey under investigation. Unfortunately, the “brute-force” computation of , in which all possible pair combinations are counted, is an calculation due to , where is the number of galaxies in the random catalog ; this results in a trade-off between statistical uncertainties and computation time. This trade-off has consequences beyond reducing uncertainties. For example, researchers often simulate a variety of cosmological parameters when studying the LSS, but the computational overhead required to calculate may limit the number and type of possible analyses that can be carried out. The overhead also increases the effort needed to compute the covariance of the 2pcf as well as the effect of different sources of systematic uncertainties.

In this paper, we suggest a method which substantially reduces the time needed to compute and and hence is applicable to any estimator 111The code can be downloaded from http://www.pas.rochester.edu/~regina/LaSSPIA.html. Moreover, the computationally intensive part of the calculation is independent of the choice of cosmological parameters, allowing exploring a larger parameter space. The method is based on the assumption that the probability for a galaxy to be observed at a particular location in the random catalog can be factorized into separate angular and redshift components. This assumption is frequently made in the analysis of large scale structures and it allows us to replace pairwise combinatorics in the calculations of and with fast integration over probability maps. We find that the estimation of significantly speeds up with respect to the brute-force calculation. In practical terms, this means an estimate of which is typically carried out on large computing clusters can be performed on a modern notebook computer.

The paper is structured as follows. In Section 2 we describe the mathematical justification of the method. In Section 3 we describe the resulting algorithm used to compute in detail. The performance of the algorithm is studied in Section 4 using mock catalogs and spectroscopic data from the CMASS portion of the Sloan Digital Sky Survey (SDSS) DR9 catalog (Ross et al., 2012; Sanchez et al., 2012; Anderson et al., 2013; Percival et al., 2014). We then conclude in Section 5.

2 Mathematical proof

2.1 Random-random distribution

The 3D position of any galaxy is described by its right ascension , declination , and its redshift . Based on this information the cosmological distances are calculated given a set of cosmological parameters: - the present day relative matter density of the universe, - the measure of the curvature of space, and - the relative density due to the cosmological constant. The comoving radial distance is calculated from the observed redshift as {linenomath}


where {linenomath}


is the Hubble distance, is the speed of light and is calculated as: {linenomath}


The transverse distance is calculated as: {linenomath}


As suggested by current cosmological constraints (see e.g. Aubourg et al. (2015); Ade et al. (2016)), is small and thus eq. 6 can be approximated by {linenomath}


The angular separation between points and given their right ascension , declination is calculated as


The distance between these two points (illustrated in Fig. 1) is approximated by:


where and , the distances transverse and parallel to the line of sight (LOS) respectively, are defined as:

Figure 1: Calculation of the distance between points and given their comoving distances from the Earth and and their angular separation .
Figure 2: Distribution of galaxies over redshift and comoving distance , calculated using cosmological parameters defined in Section 4, in four widely separated regions of the Northern sky based on the SDSS-III DR9 random catalog. Regions are defined by the following selection criteria: region 1 (, ), region 2 (, ), region 3 (, ), region 4 (, ).

Since random catalogs are typically generated by uniformly populating the fiducial volume of the survey, the galaxy distribution over , and thus over , is factorizable from the angular distribution. In other words, any angular region of the sky has the same distribution of galaxies in (see Ross et al. (2012) and Fig. 2). This means that the expected count of random galaxies, , can be factorized into the product of the expected count at a given angular position and a redshift probability density function (PDF) : {linenomath}


evaluated this way has smaller statistical uncertainty since the precision of depends on the number of points in the 2D angular cell, and the precision of depends on the number of points in the 1D range in . In contrast, the statistical precision without the factorizability assumption is determined by the number of points in a 3D cell . We will show in Section 4 that the uncertainties in evaluated using the suggested method are indeed smaller.

The expected count of random-random galaxy pairs separated by a distance can be expressed as: {linenomath}


where represents a solid angle (and ), is calculated according to eq. 9. The integral is taken over the entire fiducial volume of the random catalog , and a factor of is introduced to account for double counting of the random-random pairs. The Dirac- function is introduced to ensure that the distance between two galaxies is equal to the distance of interest . To isolate the angular variables, we rewrite the function in eq. 13 as an integral of the product of two functions: {linenomath}


Note that the second function is independent of the radial positions of the galaxies, or their redshifts. Thus, can be rewritten as {linenomath}


where {linenomath}


is the count of galaxy pairs whose angular separation is .

The count of random galaxy pairs is constructed in two steps:

  1. histogramming, where we construct the distribution over angular separation using the count of random galaxies according to eq. 16,

  2. integration, where we convolve the angular distribution with the redshift PDFs and to obtain the distribution of random pairs over .

Note that only the second step depends on the choice of cosmological parameters.

2.2 Data-random and data-data distributions

Unlike in the random catalog, in the data catalog the distribution of galaxies over cannot be factorized from their angular distribution. Thus, the count distribution over the opening angle between galaxies in the data and random catalogs is also a function of , the position of the data galaxy: {linenomath}


The distribution can then be calculated as {linenomath}


At the histogramming step, we construct the distribution according to eq. 17, keeping track of the redshift of the data galaxies . At the integration step is convolved with the redshift PDF to obtain the distribution of data-random pairs over according to eq. 18.

Finally, the data-data count is estimated using a brute-force iteration over all possible data galaxy pairs. For uniformity, the calculation is also broken down into histogramming and integration steps. At the histogramming step the count distribution over the opening angle is constructed from the pair count of data galaxies keeping track of the redshifts of both galaxies in the pair.

At the integration step, is evaluated from by converting the redshifts into distances and calculating the distance between the two galaxies . This breakdown does not offer any time savings in the calculation of , but it allows for the definition of the cosmological parameters only at the second step. This way the computationally intensive first step can be performed once and then the computationally fast second step is performed for each set of cosmological parameters.

2.3 Generalization of the 2pcf to anisotropic case

The discussion up to this point has dealt with only a spherically symmetric 2pcf. However, the algorithm is easily generalized to study anisotropy in the 2pcf (Davis & Peebles, 1983).

The histogramming step is identical to the isotropic case. At the integration step, we compute , and in two dimensions : {linenomath}


where the distances transverse () and parallel () to the LOS are computed according to eqs. 10 and 11 respectively.

The BAO signal is expected to manifest itself as an ellipse in this 2D histogram. In the isotropic case the ellipse is reduced to a circle.

3 Description of the algorithm

3.1 Weights

To data we apply the weights according to the prescription of Ross et al. (2012) to deal with the issues of close-pair corrections (), redshift-failure corrections (), systematic targeting effects () and shot noise and cosmic variance () (Feldman et al., 1994) such that the total weight of a given data galaxy is:


Each galaxy in the random catalog has a -dependent weight , defined the same way as in data catalog. The algorithm can be generalized to the case where the random weights also depend on angular position and are factorizable into angular and redshift weights: .

3.2 Binning

and are calculated using finely binned probability densities in and defined from the existing random catalog, or based on the completeness map and the radial selection function. The choice of the bin sizes is important and is determined by the final bin size desired in . In practice, this means that the bin sizes in , and must be smaller than the angle subtended by at the outermost radius of the data set, at least by a factor of two (). The bin size in should be chosen such that the corresponding be smaller than by the same factor. We note that the binning is not equal-area but for fine enough bins it does not affect the result. Some assumption about the cosmological parameters must be made for the calculation of and , so to be on the conservative side one should use the largest for the set of the cosmological parameters under evaluation, which typically means using the smallest value of . Using this algorithm with steps smaller than 1  Mpc is possible but impractical as it results in an angular map that is too finely segmented and thus very large and noisy.

3.3 Limits

In the calculation presented in this paper, we only consider galaxy pairs separated by a distance less than a certain maximum distance scale of interest . For this, the 2D angular space is divided into regions of the size , such that , where . Here, is the smallest radial distance of the survey for a set of cosmological parameters under consideration (typically corresponding to the largest value of ), and is the maximum declination of the survey. The algorithm proceeds to calculate , and only within one region and its neighbors.

3.4 Algorithm

3.4.1 Probability Density Functions

Once the binning and limits are chosen, the algorithm proceeds as follows. First, using the random catalog, we produce the count distribution and PDF . For each galaxy, is incremented by , and is incremented by 1. If the catalog contains the angular dependent weight then is incremented by that amount.

Figure 3: Top: Distribution of the DR9 random catalog in in the northern sky. Middle: the DR9 random catalog in the southern sky. Bottom: a zoomed-in view of northern sky. The positions of galaxies observed in the DR9 survey are plotted as white stars.

This part of the algorithm is linear in the number of galaxies in random catalog and hence is very fast.

Examples of binned angular histograms are shown in Fig. 3. It is clear from the zoomed up view that the statistical fluctuations, determined by the finite size of the random catalog, are significant. However, it is important to note that these fluctuations are much smaller than the fluctuations in 3D cells with full coordinates , effectively used in the brute-force approach, or in any other approach that relies on 3D number density distribution. In this paper, we generate the 2D angular and 1D redshift probability maps based on an existing random catalog (published along with the observational data), but ideally these maps should be generated directly based on the tile completeness and radial selection function. This approach will create smoother probability maps and minimize the shot noise associated with the use of random catalogs.

3.4.2 Histogramming

We compute the random-random 1D binned distribution using . To compute binned , we consider all non-repeating pairs of angular cells and . The angle between the two cells is calculated from their angular positions using eq. 8 and the corresponding bin is incremented: {linenomath}


where a factor is introduced such that when the two cells are identical and otherwise.

Then, we calculate the data-random 2D binned distribution using galaxies from the data catalog and . Here, we consider every pair of a data galaxy and an angular cell . The angle between the data galaxy and the angular cell is calculated from their angular positions using eq. 8 and the corresponding 2D bin is incremented: {linenomath}


where is the weight of the data galaxy.

We compute the data-data 3D binned distribution by looping over all non-repeating pairs of galaxies. Each pair is weighted by the product of the weights of individual galaxies in the survey: {linenomath}


3.4.3 Integration

Next, we perform an integration over redshifts and and produce the distributions , and . This is achieved in three nested loops iterating over , and . Given these three variables and choice of cosmological parameters, we compute the distance separation using eq. 9 and increment the corresponding bin of the final distributions:


For the computation of the anisotropic 2pcf distances and are computed according to eq. 10 and eq. 11 respectively.

3.4.4 Normalization

The histograms are normalized in the following way. For the unweighted calculation (all galaxies have weight ), the normalization constant of is simply . However, for the weighted calculation, the normalization constant is: {linenomath}


where is the weight of the galaxy in the random catalog. In this specific form, the calculation is and requires a double loop over indices. However, it can be rewritten as:


which is now an calculation. The same approach can be applied when normalizing the histogram.

For the unweighted calculation, the normalization constant of is simply where is the total number of galaxies in the data catalog. For the weighted calculation, the normalization constant is:


where is the weight of the galaxy in the data catalog. The normalization constant only requires an computation and hence can be computed quickly even in the weighted calculation. Finally, the LS estimator is calculated according to eq. 2. All the other estimators can be calculated just as easily, based on , and distributions computed above.

4 Performance of the algorithm

4.1 Algorithm settings and Random catalog generation

The performance of the new algorithm is evaluated by calculating the runtime and the uncertainty of the LS estimator of the 2pcf. Though the method described in this paper can be applied under any cosmology, for certainty we assume a CDM+GR flat cosmology with parameters consistent with those used in the analysis of the SDSS-III DR9 data set (Anderson et al., 2013), i.e. and . In the distance calculation we set  km s Mpc, and define the distances in units of  Mpc.

Usually, the calculations of and are the most computationally expensive since the size of the random catalog is typically much larger than the size of the data catalog. Therefore, we report the results of and for different sizes of random catalogs, while we do not change the data catalog.

To generate random catalogs of any desired size, we begin with the northern sky of the existing SDSS-III DR9 random catalog which contains 3.5M galaxies and extract the distributions and . We then randomly generate new galaxies according to the product of these distributions. In each angular cell we generate the number of galaxies , which is distributed according to Poisson distribution with a mean of . Each of these galaxies is assigned a redshift , which is distributed according to . Weights are a function of the redshift . While the new catalogs may amplify existing statistical fluctuations in the DR9 random catalog, our interest is in testing the runtime of the algorithms and estimating the variation from the mean in as a function of random catalog size. We find the bias in the 2pcf, defined as deviations from the true mean (which is zero), is significantly smaller than statistical uncertainties (Fig. 8). The bias is produced by shot noise in the random catalog, so if and are generated from the tile completeness and radial selection function it can be minimized.

4.2 Timing study

We compare the timing of the algorithm presented here against brute-force pair counting and the parallelized algorithm CUTE (Alonso, 2012). In all cases, we report runtime in CPU hours rather than elapsed wall-clock time. As in the fast algorithm, the brute-force pair count and computation with CUTE are performed using a grid scheme with a maximum distance of  Mpc. Hence, the three sets of results include the same amount of statistics.

In Fig. 4, we show the runtime of the algorithms measured for the pure algorithmic parts of the executable, which are based on a C++ implementation running on modern CPUs for random catalog sizes of to 50M galaxies. For this study we chose the following algorithm settings. The size of the binning in the right ascension , declination and the angular separation is 1 mrad, which corresponds to the transverse separation of 2.55  Mpc at the outermost radius. The binning in corresponds to a separation of 1  Mpc along the LOS. As expected, the runtime of our realization of the brute-force calculation as well as that of CUTE are proportional to for and for . In contrast, the runtime of the fast algorithm plateaus around a constant value because it depends on the fiducial volume and the number of angular and radial bins, rather than the size of the random catalog . The maximum runtime of the fast algorithm is reached when each bin in the distribution is populated and hence is used in the calculation of .

Figure 4: The total runtime (in CPU hours) of the fast, brute-force and CUTE calculations of as a function of the size of the random catalog.

The fits of the runtime dependence on , shown in Fig. 5, include terms proportional to and . The fits of the runtime dependence on , shown in Fig. 6, include terms linear and quadratic in . The numbers of bins in , and , scale inversely with the bin size in the final histogram, and linearly with the distance scale of interest . The total number of the 2D angular cells is the product of and . Thus, the calculation time scales linearly with the total number of the 2D angular cells. Other fast computational techniques based on the number density rather than the actual permutation counting rely on data distribution in 3D cells (Moore et al., 2001) and thus have stronger scaling with the number of bins.

Figure 5: The total runtime (in CPU hours) of the fast calculation of the pair-wise combinations as a function of the step in . In this study  Mpc. The binning in is . The angular binning is .
Figure 6: The total runtime (in CPU hours) of the fast calculation of the pair-wise combinations as a function of the maximum distance . The binning in is  Mpc. The angular binning is 0.56 mrad.

4.3 Algorithm precision

To check the precision of the fast algorithm we generate 20 realizations of random catalogs with 3.5M galaxies each using the procedure described above. We estimate the root mean square (RMS) of using the 20 random catalogs and present it in Fig. 7. Note that though the bin size used in computation is small as specified above, the result is plotted with a much larger bin size, simply for visual purposes. In both the brute-force and fast algorithms, the uncertainties in decrease as the size of the random catalog increases, but the uncertainties in the results based on the fast algorithm are smaller than those in the respective brute-force results, in agreement with the qualitative argument presented in Section 2.

Figure 7: Uncertainties in as a function and the size of the random catalog. The uncertainties are calculated as the RMS of obtained with 20 catalogs generated for each of three catalog sizes, for both methods of computing .

To check the fast algorithm for bias, we compute for 20 mock catalogs representing with 200k galaxies Poisson-distributed within the survey volume. These mock catalogs are produced using the same procedure used to generate the random catalogs. We find no significant bias; that is, the mean of is centered at zero as expected for a truly random distribution, and its RMS obeys a Poisson distribution (Fig. 8). Note that the Poisson error is dominated by , since the and are much larger and hence contribute smaller relative errors.

Figure 8: Mean and RMS uncertainty in of 20 uniformly-generated mock data sets with 200k galaxies per data set.

4.4 Performance on data with BAO signal

To demonstrate the performance of the algorithm for the anisotropic case we present the result on a mock dataset with a strong signal embedded (Fig. 9). The signal was generated by adding spatially correlated galaxy pairs on top of a uniform background. The galaxy pair separation distance is distributed according to a Gaussian of mean  Mpc and a standard deviation of  Mpc. To avoid biasing the distribution of galaxies in the mock sample, after signal addition some of the galaxies are removed to preserve the original distribution in .

Finally, the algorithm is also applied to the SDSS-III DR9 BOSS data catalog. The results obtained from the fast algorithm and the brute-force algorithm and their difference are presented in Fig. 10. The BAO peak is clearly visible in both, and the two distributions are consistent.

Figure 9: based on a mock data set with embedded BAO signal with the radius of  Mpc. The BAO signal is clearly seen as a circle in the 2D distribution.
Figure 10: Top: for SDSS-III DR9 BOSS data catalog using the fast algorithm suggested in this paper (blue) compared to brute-force calculation (red). The error bars represent Poisson uncertainties in . Bottom: the difference between the fast and brute-force estimates of .

5 Conclusion

We have presented a new computational method of the galaxy 2pcf that replaces summation over all possible galaxy pairs with a numeric integration of the probability map. The method provides a significant reduction in the calculation time and improves the precision of the calculation. Moreover, the computationally intensive histogramming part of the calculation is independent of the choice of cosmological parameters. The output of the histogramming stage is used at the fast integration stage, where the cosmological parameters need to be defined to compute the cosmological distances. The integration stage can be repeated for a different set of parameters without redoing the histogramming stage allowing for a fast probe of a larger parameter space.

In the future, the method could be used for a fast evaluation of the galaxy correlations in large spectroscopic surveys. In this case, the generation of large-size random catalogs can be replaced by weighted probability maps determined by observational conditions.


We thank L. Samuchia, E. Blackman and B. Betchart for useful discussions. The authors acknowledge the support from the Department of Energy under the grant DE-SC0008475..0


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description