IftUam/csic14128
HALOGEN: A tool for fast generation of mock halo catalogues
Abstract
We present a simple method of generating approximate synthetic halo catalogues: halogen. This method uses a combination of order Lagrangian Perturbation Theory (2LPT) in order to generate the largescale matter distribution, analytical mass functions to generate halo masses, and a singleparameter stochastic model for halo bias to position haloes. halogen represents a simplification of similar recently published methods.
Our method is constrained to recover the 2point function at intermediate () scales, which we show is successful to within per cent. Larger scales () are reproduced to within per cent. We compare several other statistics (e.g. power spectrum, point distribution function, redshift space distortions) with results from body simulations to determine the validity of our method for different purposes. One of the benefits of halogen is its flexibility, and we demonstrate this by showing how it can be adapted to varying cosmologies and simulation specifications.
A driving motivation for the development of such approximate schemes is the need to compute covariance matrices and study the systematic errors for large galaxy surveys, which requires thousands of simulated realisations. We discuss the applicability of our method in this context, and conclude that it is well suited to mass production of appropriate halo catalogues.
The code is publicly available at https://github.com/savila/halogen
keywords:
largescale structure of Universe – methods: body simulations – galaxies: distances and redshifts – galaxies: haloes –1 Introduction
We have entered an observational era where it is customary for redshift surveys to map millions of galaxies in the sky with the volumes of these surveys exceeding Gpc scales. Recent and upcoming galaxy survey projects include PAU (Castander et al., 2012), BOSS (Dawson et al., 2013), DES (Frieman & Dark Energy Survey Collaboration, 2013), DESi (Levi et al., 2013), Euclid (Laureijs et al., 2011), etc. The interpretation of such surveys demands a new generation of theory tools in order to better understand and interpret the large amounts of data. One important component is the need for accurate simulations of the expected results, to which the observations should be compared. However, models of largescalestructure and the clustering of (dark matter) haloes forming in it are inherently nonlinear, and require the production of simulations based on body calculations. Such simulations are extremely costly, and consequently very few realisations can be run for a given application. However, investigating the effects of systematic errors, cosmic variance, and their interplay require many hundreds of realisations of a single simulation (e.g. BOSS survey used 600 (Manera et al., 2013)). These are necessary to compute covariance matrices which characterise the resultant uncertainty on the final parameters.
To mitigate this situation, many have now turned to approximate schemes in order to calculate the required realisations of the simulations. Early such work used the socalled lognormal realisations (Coles & Jones, 1991), which placed particles randomly according to a lognormal distribution, given the true power spectrum. While this is indeed efficient, and reproduces 2point statistics faithfully, its lack of physical motivation for the particle placement results in poor higherorder statistics, such as the 3point function or CountsinCells moments. Improved methods developed in the past decade include PThalos (Manera et al., 2013; Scoccimarro & Sheth, 2002), pinocchio (Monaco et al., 2002; Monaco et al., 2013), patchy (Kitaura et al., 2014), cola (Tassev et al., 2013), qpm (White et al., 2014), EZmocks (Chuang et al., 2015), etc. For a comparison of these methods (incl. halogen presented here) we refer the reader to Chuang et al. (2014).
One may segregate these methods into two classes – predictivetype methods which are required to ‘find’ haloes in a given density field (eg. pinocchio, cola and PThalos), and statisticaltype methods which merely stochastically sample a density field to locate haloes (eg. patchy, qpm and EZmocks). The former have the advantage of being predictive, and often not requiring an body reference simulation for calibration, while the latter have the advantage of computational speed and resources, as the number of particles used is reduced.
We present a new (statisticaltype) approximate scheme, called halogen whose prime objective is to generate halo catalogues with the correct 2point clustering and massdependent bias using a simple and rapid approach.
We note that statisticaltype methods tend to follow a standard pattern of four steps:

Produce a density field.

Sample halo masses.

Sample particles as halos with some bias.

Assign halo velocities.
In this paper we seek to abstract this pattern, providing a framework in which each step is highly modular. Whilst modular, halogen implements default behaviour with very simple (and rapid) components – using the popular order Lagrangian Perturbation Theory (2LPT) as the gravity solver, theoretical mass functions, a singleparameter bias prescription (as opposed to 2 or more parameters for other statisticaltype methods) and a direct linear transformation of the velocities. As such, halogen can be rapidly calibrated, and easily extended. In addition, we introduce physically motivated constraints for halo exclusion and mass conservation, which tie the individual steps together.
In this paper we will compare the results from halogen to a pair of reference body simulations to be presented in §2. We introduce the general ideas of the method in §3, leaving a more detailed explanation of the spatial placement of haloes – which we consider the essence of halogen – for §4. §5 demonstrates the effects of each parameter of halogen and how to optimise them. We conclude with some applications and results in §6.
2 The Reference Simulations
To tune halogen to a specific cosmology, we require an body simulation. In order to show the adaptability of halogen to varying setups, we have not limited ourselves to a single simulation but used two with differing box size, mass resolution, and cosmology. Further, the reference halo catalogues have been obtained by applying two different halo finding techniques, and have different number density. We summarise the characteristics of both reference catalogues in Table 1 and describe them below.
Goliat Simulation
This simulation was run with the gadget2 code (Springel, 2005) from initial conditions generated by 2LPTic^{1}^{1}1http://cosmo.nyu.edu/roman/2LPT at . It uses dark matter particles in a box with side length Mpc. The cosmological parameters used in this simulation are , , , , , yielding a mass resolution of . The halo catalogue was obtained from a snapshot and has been generated with the halo finder ahf (Knollmann & Knebe, 2009), a sphericaloverdensity (SO) algorithm. Though ahf identifies subhaloes, they have been discarded for the present analysis as these scales are too small for 2LPT to resolve. There is a possibility of phenomenologically adding substructure in a later step using a Halo Occupation Distribution (HOD) prescription (Skibba & Sheth, 2009), but we leave that to a future study. In this catalogue we use a halo reference density of
halogen requires an input density field obtained from 2LPT (see §3.1). For this purpose, we run a 2LPTic snapshot at with the same inital condition phases as those used in goliat.
Big MultiDark Simulation^{2}^{2}2http://www.cosmosim.org
BigMultiDark described in Klypin et al. (2014), employs the cosmology from the Planck CMB mission (Planck Collaboration, 2014), which for some parameters represents a significant change with respect to the goliat simulation: , , , , , . The halo catalogue is extracted with a FriendsofFriends (fof) (Davis et al., 1985) algorithm (which intrinsically neglects substructure) at , and we choose a reference halo number density of .
Compared to goliat, is both larger (Mpc) and more resolved ( particles of mass ). It was run with Lgadget2 from initial conditions based on the Zel’dovich Approximation (ZA) at . Given the large scales that it explores while resolving large numbers of haloes, it is well suited to probing the Baryon Acoustic Oscillation (BAO) peak.
For the input of halogen we run 2LPTic to with the same initial condition phases as BigMultiDark. The cosmology and used are the same, but with a lower resolution of .
Name  Finder  IC  

goliat  ,  0  0.044  0.27  0.69  0.7  0.8  0.96  AHF  2LPT  32  
BigMultiDark  ,  0.56  0.048  0.31  0.73  0.68  0.82  0.96  FOF  ZA  100 
3 Method Outline
In this section we briefly outline our method, leaving a more detailed presentation of the actual modus operandi of halogen for §4. The general algorithm consists of four (major) steps:

generate a dark matter density field,

draw halo masses by sampling a halo mass function,

populate the volume with haloes in the box, and

assign velocities to the haloes.
We aim to decouple each of these steps from the others as far as possible so that different algorithms may be used at each point. The first two steps are relatively trivial, as they use predeveloped prescriptions from the literature, and we discuss these, and basic outlines of the last two steps, in this section.
3.1 Density Field
The basic scaffolding of halogen is an appropriate dark matter density field realised at the desired redshift, sampled by particles. For simplicity we choose to use order Perturbation Theory (2LPT) (Moutarde et al., 1991; Bouchet et al., 1995) to produce this field, which can be obtained with the public code 2LPTic.
We show in Figure 1 the density distribution of an body simulation (top panel) and a 2LPT representation (bottom panel) at . Notably, the 2LPT distribution appears to be blurred in comparison to the body simulation. This is due to the fact that 2LPTic – as the name suggests – was originally designed only to generate initial conditions (Scoccimarro, 1998), since even 2order perturbation theory breaks down at low redshift when overdensities become highly nonlinear. The smallscale difference in Figure 1 can be explained by shell crossing, an effect in which particles following their 2LPT trajectories cross paths and continue rather than gravitationally attracting each other in a fully nonlinear manner (Neyrinck, 2013; Sahni & Shandarin, 1996). In order to compensate for shellcrossing, Manera et al. (2013) advocates the use of a smoothing kernel over the input power spectrum. We tested the effect of this smoothing in halogen but did not find any improvement in the final catalogue.
Nevertheless, 2LPT provides a suitable approximation of the large scale distribution of matter, where perturbations have not yet entered into the highly nonlinear regime and this is sufficient for halogen. Note that halogen is in principle agnostic about the method in which this density field snapshot is produced. Other methods, for instance the “QuickPM” cf. the QPM method described by White et al. 2014, COLA (Tassev et al., 2013) or 3LPT could equally be employed by the user. A different choice of density field will yield somewhat different results, especially at smaller scales. As long as the chosen method reconstructs large scales correctly, the remaining steps of halogen should be unmodified.
Despite this, we have by default incorporated 2LPTic as part of the halogen code (which bypasses the costly I/O of writing the snapshot to disk), but also allow the user to provide an arbitrary snapshot with a distribution of particles in a cosmological volume. Our choice for 2LPT was mainly driven by its low computational cost and success in the distribution of matter at large scales. We use this approach for all results in this paper.
3.2 The Mass Function
The halo mass function (HMF) measures the number density of haloes above a given mass scale. It is required to generate massconditional clustering, which in turn is a prerequisite for extension to HODbased galaxy mock generation.
We produce a sampled mass function by the standard inverseCDF method, utilising an arbitrary input HMF.
The most accurate HMF for a given cosmology, over a range of suitable scales, may be obtained from an body simulation via a halofinding algorithm –although there are notable variations depending on the technique (Knebe et al., 2011). Since we require a full body simulation for the tuning of halogen, it would be perfectly acceptable to use this simulation to generate the HMF. However, in the hope of future improvements, we wish to avoid using the full simulation as far as possible. Fortunately, there is a wealth of literature concerning accurate predictions of the HMF for widely varying cosmologies and redshifts using Extended PressSchechter theory (Press & Schechter, 1974; Bond et al., 1991).
The mass function may be calculated by any means, so long as a discretised function of is provided. For simplicity, we decided to use the online halo mass function calculator HMFcalc^{3}^{3}3http://hmf.icrar.org (Murray et al., 2013) for obtaining the halo mass distribution in this paper.
3.3 Spatial Placement of Haloes
The crucial step in the generation of approximate halo catalogues is the commissioning of halo positions. In keeping with the philosophy of modularity, the haloplacement step is decoupled from the rest. Any routine which takes a vector of halo masses and an array of dark matter particle positions and returns a subset of those positions as the halo locations is acceptable. However, we consider this step to be at the heart of the halogen method, as it is responsible of generating the correct massdependent clustering.
To achieve an efficient placement that reconstitutes the target twopoint statistics, we recognise the validity of the clustering on large scales from the broadbrush 2LPT field. We place haloes on 2LPT field particles, essentially using the estimated density field as scaffolding on which to build an approximate halo field. We will follow a series of steps in the construction of the method of spatial placement to be presented in §4 below.
3.4 Assignment of Velocities
The most obvious way to assign velocities to each halo would be to use the velocity of the particle on which it is centred. However, haloes are viralised systems whose velocities tend to be lower than that of their constituent particles. This is potentially mitigated by using the average velocity of all particles within a defined radius of the artificially placed halo. However, this is not robust as there are often very few particles inside the halo radius. Additionally, the 2LPT particle velocities will differ from their body counterparts due to shellcrossing, especially on the small scales associated with haloes.
Thus, we prefer to take a phenomenological approach, and assume that a simple mapping via a factor can be applied to the collection of halo velocities to recover the results of the body distribution
(1) 
This factor could a priori depend on the velocity (i.e. a nonlinear mapping) and the mass of the halo . However, we will show in §5.2 that a linear mapping is sufficient and present a way to compute .
4 Halogen
Though halogen is a fourstage proccess, the most crucial aspect is the assignment of halo positions, which this section describes in some detail. The general concept is to specify a sample of particles from an underlying density field as haloes.
The motivating philosophy of halogen is to start from the simplest idea and improve if necessary. In this vein, we present here successive stages of evolution of the halogen method, which we hope will show satisfactorily that the method as it stands is optimal. Figure 2 will serve as the showcase for the various stages of halogen. In it we present the 2point correlation function (2PCF) for each stage of development to verify that the method approaches the goliat reference catalogue as new characteristics are added.
Note that the 2PCF is computed with the publicly available parallel code CUTE^{4}^{4}4http://members.ift.uamcsic.es/dmonge/CUTE.html (Alonso, 2012). In the fitting routine that is included in the halogen package and described in §5.1 we also use the same code.
4.1 Random particles
We start with the simplest approach: using random particles from the 2LPT snapshot as the sites for haloes. We expect to recover the largescale shape of the 2PCF in this way, as this is encoded in the 2LPT density field which we trace.
However, it is clear from Figure 2 that this method (’random noexc’) consistently underestimates the 2PCF over all scales except , where it should sharply drop to 1, but rather remains positive.
The consistent underestimate is a realisation of an inaccurate linear bias, , defined as the scaling factor between the 2point function of the haloes and the underlying matter density field:
(2) 
We begin to address this in §4.3.
The smallscale clustering can be explained by the fact that particles can be arbitrarily close, whereas distinct haloes – recall that subhaloes have been removed – have a welldefined minimum separation (otherwise they merge). The turnover in the simulation based 2PCF occurs around the mean halo radius scale.
4.2 Random particles (with exclusion)
The simplest improvement to the random case is to eliminate the artificial smallscale correlations. Though the primary application of halogen will be for large scales, a simple improvement at small scales is useful.
As we have noted, the artificial clustering at small scales arises from the fact that particles can be arbitrarily close, whereas simulated haloes have a minimum separation. The radius of a halo is a rather subjective quantity, and its definition is modified in various applications and halofinders. However, we may parametrise this by
(3) 
where is the overdensity of the halo with respect to the critical density of the Universe. For the work presented here we used .
Using this scale, we introduce exclusion, a modifiable option which controls the degree to which haloes can overlap, which we set to mimic the halo finder’s specification. For example, in this work we use both ahf and fof (see Knebe et al., 2013, for a comparison and an introduction to all relevant halo finding techniques). For the latter we do not allow any overlap whereas for the former halogen’s halo centres are not allowed to lie inside another halo’s radius.
The effect of exclusion is presented in Figure 2 (’random exc’). As expected, scales of Mpc show a turnover while larger scales are unaffected. We note that the turnover is at smaller scales for halogen than for ahf. This is to be expected, as it is unlikely to find two ahf haloes separated by a distance slightly exceeding , due to reasons akin to the FOF overlinking problem. In such cases, there is an increased likelihood of the two halos being subsumed into one, or one becoming a subhalo of the other. It is conceivable that one could empirically model these effects by tuning the value of by some factor which captures this suppressed probability. However, as we are more interested in large scales and these considerations touch upon the subtleties of halo definition, we consider these exclusion criteria sufficient for present purposes. We will use this form of exclusion (in an appropriate form) for all following work.
4.3 Ranked approach
We return now to the problem of underestimation of the correlations, which we noted was due to an incorrect realisation of the linear halo bias. In effect, a random choice of particle position correponds to sampling the matter power spectrum uniformly, and therefore . However, halo bias is generally greater than unity (especially for higher mass halo samples) (Tinker et al., 2005).
Increasing the bias corresponds to sampling higherdensity regions. The simplest way to achieve this is to rankorder the density of regions in the particle distribution, and assign haloes to these regions based on their mass.
To calculate densities from the particle distribution, we simply create a uniform grid with cellsize , and obtain the density in each cell using a NearestGridPoint (NGP) assignment scheme (Hockney & Eastwood, 1988). We consider specification of the optimal in §5.3. The cells are ordered by density, and the haloes by mass, and each halo is assigned to its corresponding cell (a random particle is chosen within the cell).
Using in this case, we obtain the results shown in Figure 2 labelled ’ranked exc’. The resulting 2PCF is now overestimated. This is not surprising, since even if we expect haloes to form in dense environments, the bias is not completely deterministic: in reality the most massive halo does not need to reside in the densest place.
The effect of introducing a scale length, , is also clearly seen in this result. There is a turnover in the 2PCF below , which corresponds to a significant reduction of bias on these scales since a random particle is chosen within the cell.
4.4 approach
We find that selecting completely random particles yields too low a bias, whereas the ranked approach is highly biased. We require an intermediate solution, which has higher probability of selecting dense areas than the random approach, and lower probability than the ranked approach.
The probability that a cell is chosen is a function of its density,
(4) 
In the completely random case, we have . In principle we can tailor so that the probability of selecting a cell reproduces the appropriate bias. We choose to constrain to have a powerlaw form, i.e.
(5) 
When , we recover the random approach, and as we obtain the ranked approach.
In Figure 2 we show results for , demonstrating the effectiveness of our model for tuning the normalisation (i.e. bias) of the 2PCF. The curve closely matches the 2PCF of the ahf catalogue, at least at scales larger than the applied cell size .
The exact value of for a particular application may be determined by a leastsquares fit, which we describe in more detail in §5.1 (note that here the choice of was not formally fit).
In corollary with this prescription, we also introduce a means to roughly ensure mass conservation in cells: once a halo is placed, if the total halo mass in the cell exceeds the original mass, the cell is eliminated from future selections. However, we do not update the value of the probability after every halo placement because it is computationally very expensive () and we have checked that doing so has a negligible effect on output statistics.
We note that a similar method was employed in QPM (White et al., 2014). In fact, the physically meaningful quantiy is – the distribution of halo density (i.e. the fraction of haloes in cells with density ). This can be written as
(6) 
where specifies the relative probability of choosing a cell given its density (in our case, ), and is the intrinsic distribution of cell densities given the cell size and cosmology (heavily related to the cosmological parameter ). QPM specifies the target distribution directly, as a Gaussian. In halogen we instead specify , which is more closely tied to our algorithm. In principle one can convert from QPMlike methods to halogen with Eq. (6).
4.5 approach
The approach as it stands reproduces the 2PCF accurately down to the scale of . If the 2PCF of a sample of given number density is all that is required for a specific application, then this will do well.
However, if we were to select a subsample of the most massive haloes of our catalogues and recompute the 2PCF, the bias would be incorrect, since more massive haloes are more biased (Tinker et al., 2005). For a truly representative catalogue, in which the haloes are conditionally placed based on their mass, the bias model is required to be massdependent. Failing this, there is no physical meaning attached to the assignment of masses in the second step (§3.2).
Massdependent halo bias is also crucial for implementing HOD models on the catalogue, for use in galaxy survey statistics, as the number of galaxies associated with a halo depends on its mass.
We incorporate this massdependence into the parameter, so that we finally have
(7) 
with an increasing function.
In practice, we use discrete mass bins, and for each bin , with masses , we use a different . We describe how we obtain the bestfit to this massdependent using the fiducial halo catalogue from the simulation in §5.1.
Using just five mass bins, we illustrate this approach in Figure 2, labelled “ exc” (magenta line) using the bestfit values for . We list in Table 2 the mass thresholds, applied values, and corresponding number densities of all haloes with . Note that though the probability is not recomputed after placing a halo, it is recomputed with updated and when changing mass bins.
Though the approach does not improve the 2PCF with respect to the approach in Figure 2, it has the clear advantage of reproducing a mass dependent clustering, which as we noted is essential for further HOD analyses, and useful for being able to use any massrange in the same realisation.
bin  []  [(Mpc]  

0  3.54  
1  2.26  
2  1.77  
3  1.48  
4  1.41 
4.6 Summary
In conclusion, halogen constitutes a method for generating a halo catalogue which exhibits correct 2point clustering statistics, while not only positioning the haloes correctly, but also imbuing them with physically meaningful masses. The method can be summarised as follows.
The particles generated by 2LPT (§3.1) are covered by a grid of cell size , the halo masses generated from the halo mass function (§3.2) are ordered by mass, and starting from the most massive halo they are placed by

selecting a cell with probability ,

randomly selecting a particle within the cell and using its coordinates as the halo position,

ensuring that the halo does not overlap (following an exclusion criterion) with any previously placed halo in any cell, and rechoosing a different random particle in that case,^{5}^{5}5If, after several iterations all the particles found were excluded, rechoose cell (to avoid infinite loops).

subtracting the halo’s mass from the selected cell, : if the cell is removed from selection.
Note that the physically motivated nature of the process suggests that higherorder statistics may also be recovered with some success.
5 Parameter study
We have mentioned several parameters of the halogen method, and these are of particular importance in producing accurate realisations. In this section we will discuss each parameter, its effects and how to optimise for it if possible.
There are three parameters in halogen (with other options and parameters being expressly determined by the required output, such as the size of the simulation box ): the two physical parameters of the model, – controlling the linear bias – and – controlling the velocity bias – and the one parameter of the algorithm, . These are summarised in Table 3, and detailed in the following subsections.
In the previous Section we used goliat as a reference. We now turn to BigMultiDark and its fof catalogue: this simulation has a larger volume, allowing us to probe BAO scales. The increased volume also reduces cosmic variance on intermediate scales. halogen primarily aims at reproducing clustering statistics for even larger volumes, hence it is beneficial to assess the performance of halogen and its parameters in this regime. Furthermore, this demonstrates independence from the underlying simulation and halo finding technique.
Parameter  Motivation  Value 

linear bias  fit to bias  
velocity bias  
algorithm  

5.1 Fitting
The value of is crucial to the performance of halogen, as it constitutes the only physical parameter controlling the bias. The halogen package contains a standalone routine which determines a bestfit for , which can then be passed to halogen to generate any number of realisations. We describe this routine here, and illustrate it with application to BigMultiDark. The fitting of is based on the standard minimisation technique. However, a few details are worth mentioning.
Massdependence.
We perform the fit in sharpedged mass bins to determine a massdependent , i.e. for each bin we fit a for the mass range . There are two conceivable ways of doing this – differentially or cumulatively. We have experimented with both and find that the cumulative procedure has better performance. That is, we fit the first mass bin, and then the first and second together (keeping the best value of for the first bin), and so on. This has the advantage of being able to properly correct for deviations in previous bins, which is particularly important since the first bins to be fit are the high masses, for which fewer haloes exist. Misestimation of here is more likely, but is compensated for when fitting to lower mass bins by including the highmass estimates in the fit.
HALOGEN variance.
The halo placement in halogen is probabilistic, even given a constant underlying density field. Using different random seeds can slightly affect the final placement, and thus the clustering statistics (the extent of this is dependent on the volume, and ). We term this “halogen variance”, and note that it is not to be confused with cosmic variance. Cosmic variance is introduced by modifying the the random seed of 2LPTic, which in effect results in a different realisation of the universe^{6}^{6}6Cosmic variance – strictly speaking – requires the study of the same volume, but in a different place in the universe. This approach is more appropriately called ’sampling variance’ yet nevertheless the generally accepted technique for generating covariance matrices.. During the fit each mass bin is realised several times (ten in the case of BigMultiDark) with halogen to average out the effects of halogen variance, and also provide an error (computed as the standard deviation) to use in the definition of .
minimisation.
The fit is performed by minimising :
(8) 
where and are the 2PCFs of halogen and the reference catalogue, respectively. We note that minimising this statistic is susceptible to systematic errors in halogen in bins where the stochastic error () is much smaller than the systematic error (). This is especially likely when the region of the fit approaches . To test whether the region is stable, we may choose a distance estimator to be minimized that treats all scales with the same weight, eg. . We have tried with both quantities in our fitted range, and the results are left unchanged, indicating that the range of the fit is stable.
We use a grid of to cover the expected result for each mass bin. We use a cubic spline interpolation over to locate a precise minimum for the bestfit .
Fitting Range.
We restrict the range of the fit to scales in which the shape of is flat. This corresponds to midrange scales of , which avoids smallscale effects of halogen, and largescale cosmic variance.
Number of mass bins.
The number of bins to use in this procedure will depend on the needs of the user, and the size and resolution of the reference simulation. It determines the reliability of the massdependent clustering. For BigMultiDark we distribute the haloes into 8 roughly equinumbered bins with the mass thresholds as shown in Table 4. In that table we also show the bestfit , and the equivalent number density for each mass threshold.
The 2PCFs for our 8 values of are shown in Figure 3, where we compare the results from halogen against the BigMultiDark reference catalogue. The range used during the fitting procedure and for the minimization is indicated by the vertical lines.
We note that the choice of finely controls the bias. This is demonstrated in Figure 4, in which we show the resultant for the entire grid of for this fit (lefthand side). There is a per cent deviation in over the grid range ( between consecutive lines). On the righthand side of the figure, we show the of each of those curves and the cubic spline fit interpolation used to find the minimum, which corresponds to the bestfit value shown in Table 4.
bin  []  [(Mpc]  

0  4.80  0.564  
1  2.79  0.672  
2  2.28  0.715  
3  2.00  0.743  
4  1.90  0.754  
5  1.84  0.760  
6  1.73  0.771  
7  1.73  0.771 
5.2 Velocity factor
In §3.4 we outlined a method of converting the velocity of 2LPTic particles (designated as halo sites), , to the velocity of a halogen halo, . We stated that the transformation was linear in , and thus we can write
(9) 
where we have retained a massdependence in the conversion factor. This section will explore the means to calculate this factor.
We begin by justifying our choice of a linear function. Figure 5 shows the onecomponent velocity distribution of BigMultiDark and the particles selected by halogen. Both curves are welldescribed by a Gaussian with , where the standard deviation of the body haloes is reduced compared to that of , i.e. . This confirms our claim in §3.4 that the particle velocities are larger than the halo velocities, and also shows that a simple linear transformation suffices to map the distribution of .
This simple characterisation leads to a transformation of , which is verified by the blue dotted line where this remapping has been applied.
We expect that the velocity bias (Colín et al., 2000) will be dependent on massscale in general. We can easily incorporate this into our fitting routing by calculating
(10) 
for each interval of mass while performing the fit for . These results are also listed in Table 4. There is a noticeable decrease in towards higher mass haloes. We will see in §6.4 below how this affects the modelling of Redshift Space Distortions.
We finally note that there may be other more complex models of velocity bias accounting for the physics of low scales and adjusting other statistics beyond the overall velocity distribution. However, the model presented here is very simple and capable of reproducing the halo velocity distribution with a great accuracy.
5.3 Cell size:
We have previously mentioned the cellsize which is introduced to halogen to provide a simple local density via the NGP scheme (Hockney & Eastwood, 1988). We have also noted that it defines a lowerlimit of reliability of the resultant 2PCF. In this section we explore this parameter further, describing its effects and how to optimise for it.
In Figure 6 we show the 2PCF of the BigMultiDark catalogue against halogen results for several values of . We note two effects,

determines the minimum scale at which the 2PCF is reliable and

controls the broadening of the Baryonic Acoustic Oscillations (BAO) peak.
The first effect is clearly noticeable in the lefthand panel where the halogen 2PCF detaches from the BigMultiDark curve at . This is expected, since particles are chosen at random inside the cell, reducing the bias at these scales.
The second effect is more noticeable in the righthand panel. As is decreased, the broadening and dampening (best seen in the lower panel as the difference between the artificial peak at and trough at ) is decreased. The reason for this is that we introduce an uncertainty (on a scale ) in the position of the haloes that propagates to an uncertainty in the determination of . In effect, the density field has been filtered by a quasitophat function (Hockney & Eastwood, 1988), which has the known effect of peakbroadening.
Clearly, should be set as small as possible to mitigate these effects. However, a limit is enforced by the meaninterparticleseparation, , of the input density field. We cannot hope to reliably probe scales smaller than , and even just above this scale we run into the problem of having poor statistics within cells. We recommend using a value of (ensuring particles per cell on average), and in this work we take as the reference.
We comment here that the choice of affects the optimal relation. This is unfortunate, because it would be useful to be able to perform the fit for using a lower resolution (since this is the bottleneck). The mechanism by which this effect occurs is known, and we hope to be able to correct for it in the future.
Let us illustrate the mechanism with an example: suppose we take a cell with cellsize and density from a volume . For the same distribution, we could also use , which forms 8 subcells with densities . For the same , the probability of choosing the cell in case I is
(11) 
whereas in case II we have
(12) 
and clearly these are not in general equivalent if . We expect the difference in the distributions to be dependent on , the two cellsizes and their ratio and the cosmology, via the mass variance . In future studies we hope to be able to quantify this relationship to enable faster fitting.
Figure 7 shows the effect of changing on the bestfit and we notice two characteristics. Firstly, is an increasing function for all , as expected since is increasing. Secondly, low masses are less sensitive to , which we expect mathematically from Eqs.11 and 12 with an increasing (the greater is, the greater the differences expected).
In Figure 6 we have refit the relation for each value of , ensuring proper comparison between curves. Furthermore, we run 5 realisations of each and display the average, to reduce the effects of halogen variance.
6 Results and Applications
While previous sections were dedicated to the dessign and optimisation of halogen, we have now defined the final method and fixed the optimal parameters. In this section we discuss the performance of halogen in more detail, both in the clustering statistics so far analysed, and in other statistics that halogen is not constrained to match. We begin by demonstrating the power of halogen for massproduction of halo catalogues for use in deriving covariance matrices to measure cosmic variance, which we envision as the primary application of the halogen machinery.
6.1 Mass production of halo catalogues
The driving motivation of developing fast methods for synthetic halo catalogues is to accurately produce robust covariance matrices for large galaxy survey statistics. Though halogen requires a full body simulation to calibrate its two parameter sets, once these parameters have been established, we are free to run as many realisations (with different phases for the initial conditions) of the the halo catalogue (using the same cosmological parameters, volume, mass resolution etc.) as we like. This process is expected to purely simulate the effects of cosmic variance, and thus is extremely valuable for deriving the covariance matrices.
In order to verify that the variance seen in the resulting data traces the expected cosmic variance, we complemented the generation of the halogen catalogues with several corresponding body simulations. Due to the computational time constraints, we were only able to run five simulations, which were based on goliat, and in which only the seed for the random Initial Condition (IC) phases was changed. The initial conditions for these runs were generated with 2LPTic at redshift (for the body) and (for halogen), using the same seed for each pair. The body particle distributions were evolved to using gadget2 (and subsequently analysed with ahf).
In Figure 8 we present the 2PCF of those 5 pairs of catalogues (with halogen as lines, and ahf as points). The halogen lines are the average of 5 realisations of halogen placement (maintaining the same phases) and the error bars show the halogen variance. Given that the goliat box size is rather small (Gpc), scales are dominated by cosmic variance effects. This makes it easy to identify the signature of each set of initial conditions. Though the realisations are significantly different, we note that the halogen catalogue follows the body result, and maintains the correct normalisation at intermediate scales (). We stress that the fitting procedure has only been performed once; all five cases used fixed parameters. The similarity of the goodness of fit in each case (as compared to that directly fitted to) demonstrates that the fitted is universal with respect to input seed. We note also that the halogen variance is significantly subdominant to the cosmic variance.
To better appreciate the dominance of the cosmic variance in a more applicable scenario, we return to the BigMultiDark simulation. This has a reduced cosmic variance due to the larger volume, but has the disadvantage that we cannot run several body simulations of this magnitude. The blue line of Figure 9 shows how the 2PCF of a singlerun halogen (neither halogen nor cosmic variance has been averaged out) compares to the reference BigMultiDark catalogue when they have the same initial condition phases. We further show the halogen variance () and cosmic variance (). The former has been computed as usual: running 5 realisations of halogen on the same 2LPT snapshot. For the latter we run five 2LPTic snapshots with different IC seeds. In order to avoid mixing and for each of them we first averaged out halogen variance by running 5 realisations of halogen and is computed as the dispersion of the five resulting (free) lines. We find for all scales that the halogen variance is dominated by the cosmic variance, .
6.2 Probability Distribution Function
A simple but powerful statistic for point particles is the Probability Distribution Function (PDF), which is the distribution of particles per cell on a given scale. Though simple, it contains interesting information as it contains contributions from the entire hierarchy of point functions (Peebles, 1980; Fry, 1985; Saslaw, 2000).
Covering the BigMultiDark simulation with meshes of various (regular) sizes, we show in Figure 10 a histogram of the number of haloes per cell for both the halogen and BigMultiDark catalogues; the cell size ranges from to Mpc. We find good agreement, especially at lower numbers of cell, where the contribution of nonlinear scales is reduced. We note that the mesh used to calculate the PDF is not to be confused with the grid used by halogen for the NGP density assignment.
6.3 Power Spectrum
halogen has been designed to recover the 2PCF of a provided halo catalogue. As the power spectrum is its Fourier Transform, it theoretically contains the same information. However, this information is distributed differently in the two functions and there is mode coupling when transforming from one to another: an error at a given scale in one of the magnitudes can propagate to an error at all scales in the other. So we expect to witness different strengths and weaknesses in .
In Figure 11 we compare the power spectrum of the BigMultiDark fof catalogue to the corresponding halogen realisation. We find agreement to 5% across the scales MpcMpc, but note that smaller scales Mpc () are underestimated. This underestimation arises from the smallest scales of the 2PCF, , which integrate through higher scales in .
6.4 Correlation Function in Redshift Space
Observed galaxies are not directly located in 3D space, but in 2Dangular coordinates with redshift converted to a polar distance. However, such distances are modified by galaxies’ peculiar velocities – velocity components that are not due to the Hubble expansion. These modifications are encoded as Redshift Space Distortions (RSD), and we can begin to account for them by assigning correct velocities to haloes.
Using the halo velocities, we can mimic this effect when calculating the 2PCF. We show the results of such an analysis in Figure 12, in which the monopole of the 2PCF in redshift space is compared for the halogen and BigMultiDark catalogues. To show the effect of our velocity transformation, we also include the 2PCF of the ’selected particles’ in which the velocities were not transformed. The normalisation and shape are significantly improved by the simple linear transformation (Eq. (9)), and we find agreement to below 5% per cent at intermediate scales.
7 Conclusions
We have presented a new method called halogen for the construction of approximate halo catalogues. It consists of 4 major steps:

create a distribution of particles in a cosmological volume using order Lagrangian Perturbation Theory and distribute them in a grid of cell size

sample a theoretical halo mass function with a list of halo masses and order them in descending mass.

place the haloes at the position of particles with a probability dependent on the cell density and halo mass . We select random particles within cells, respecting the exclusion criterion and conserving mass in cells (cf. §4).

assign the velocity of the selected particle to the halo through a factor
We noted the modularity of these steps and acknowledged alternatives for each of them. The 2LPT in step (i) provides us with the correct large scale clustering at a low computational cost, while step (ii) reconstructs the halo mass function. The heart of halogen is step (iii) where the mass dependent bias is modelled through the parameter that stochastically places more massive haloes in overdensities, recovering the correct 2point correlation function as a function of mass. We also preclude haloes from overlapping to match the smallscale behaviour of the 2point clustering. In the last step (iv), we remap particle velocities in order to obtain the correct halo velocity distribution.
We studied how the parameters of the method – , and – can be optimised and summarised the results in Table 3. Though halogen needs a reference halo catalogue from an NBody simulation to obtain and , once they have been optimised for a given setup, halogen can be used to generate a multitude of halo catalogues, allowing the quantification of cosmic variance.
The halo mass function is recovered by construction –with some negligible sampling noise– to the theoretical value. The 2point function at intermediate scales (, where the bias is controlled by ) can be obtained in a BigMultiDarklike simulation at the level and to the level at BAO scales () (Figure 9). In redshift space, the error at intermediate scales rises to and remains at at large scales (Figure 12). The clustering has a massdependence, for which the accuracy is controlled by the number of bins in the fit (Figure 3). The power spectrum can be recovered at the level in the range of scales (Figure 11). The halo PDF is accurately reproduced at low cell, but overpredicts the high/cell tail where the contributions of nonlinearities are higher (Figure 10).
We remark upon the adaptability of halogen to different setups: goliat and BigMultiDark have different characteristics (see Table 1) and halogen can be used for both with little recalibration effort. This indicates that halogen is not only capable of running on one specific boxsize, redshift or cosmology, which makes it a powerful tool for exploring the statistics of varying cosmologies etc.
We have also verified that changing the initial phases in 2LPTic for halogen leads to changes in the correlation function (due to cosmic variance) that follow the body simulation. This implies that doing so will yield robust estimates of cosmic variance, over potentially hundreds to thousands of realisations.
We have demonstrated that halogen is a powerful tool for modelling statistics of halo catalogues, and the effects cosmic variance on them. The most immediate application of halogen is the generation of the many catalogues required to study the control of systematics and for computing covariance matrices for large galaxy surveys (e.g. DES, DESi, Euclid). However, it can conceivably be used for other applications involving the study of cosmic variance.
Future work will involve improvements to the method, for instance by exploring subcell adjustments (i.e. alternatives to the random choice inside cells) or by changing one of the 4 stages of halogen (e.g. what happens if we use 3LPT?, what is the best function for in Eq. (4)?). Furthermore, in the present study we have neglected substructure and refered to a possible extension using HOD models. We anticipate a fully integrated HOD layer to the method in future releases, which will enable a more direct comparison to observed data.
Acknowledgements
SA and JGB acknowledge financial support from the Spanish MINECO under grant FPA201239684C0302 and ConsoliderIngenio “Physics of the Accelerating Universe (PAU)” (CSD200700060). They also acknowledge the support from the Spanish MINECO’s “Centro de Excelencia Severo Ochoa” Programme under Grant No. SEV20120249.
SA is also supported by a PhD FPIfellowship from the Universidad Autónoma de Madrid. He also thanks the ”Estancias Breves” program from the UAM and the UWA Reasearch Collaboration Award 2014 that supported his stay in ICRAR, where this project was born. He further thanks David Alonso for his advices at different stages of the project.
AK is supported by the Ministerio de Economía y Competitividad (MINECO) in Spain through grant AYA201231101 as well as the ConsoliderIngenio 2010 Programme of the Spanish Ministerio de Ciencia e Innovación (MICINN) under grant MultiDark CSD200900064. He also acknowledges support from the Australian Research Council (ARC) grants DP130100117 and DP140100198. He further thanks Dinosaur Jr. for the bug.
Part of this research was undertaken as part of the Survey Simulation Pipeline (SSimPL; ssimpluniverse.tk). The Centre for AllSky Astrophysics (CAASTRO) is an Australian Research Council Centre of Excellence, funded by grant CE11E0090.
The work was supported by iVEC through the use of advanced computing resources located at iVEC@Murdoch.
The MultiDark Database and the web application providing online access to it were constructed as part of the activities of the German Astrophysical Virtual Observatory as result of a collaboration between the LeibnizInstitute for Astrophysics Potsdam (AIP) and the Spanish MultiDark Consolider Project CSD200900064. The BigMD simulation suite have been performed in the Supermuc supercomputer at LRZ using time granted. The simulation and its FOF halo catalogue has been kindly made available to us courtesy Stefan Gottlöber, Anatoly Klypin, Francisco Prada, and Gustavo Yepes before its public release. We also acknowledge PRACE for awarding us access to resource Curie supercomputer based in France (project PA2259). Some computation were performed on HYDRA, the HPCcluster of the IFTUAM/CSIC.
This research has made use of NASA’s Astrophysics Data System (ADS) and the arXiv preprint server.
References
 Alonso (2012) Alonso D., 2012, ArXiv eprints 1210.1833
 Bond et al. (1991) Bond J. R., Cole S., Efstathiou G., Kaiser N., 1991, ApJ, 379, 440
 Bouchet et al. (1995) Bouchet F. R., Colombi S., Hivon E., Juszkiewicz R., 1995, A&A, 296, 575
 Castander et al. (2012) Castander F. J., Ballester O., Bauer A., CardielSas L., Carretero J., Casas R., Castilla J., Crocce M., Delfino M., Eriksen M., et al. 2012, in Society of PhotoOptical Instrumentation Engineers (SPIE) Conference Series Vol. 8446 of Society of PhotoOptical Instrumentation Engineers (SPIE) Conference Series, The PAU camera and the PAU survey at the William Herschel Telescope. p. 6
 Chuang et al. (2015) Chuang C.H., Kitaura F.S., Prada F., Zhao C., Yepes G., 2015, MNRAS, 446, 2621
 Chuang et al. (2014) Chuang C.H., Zhao C., Prada F., Munari E., Avila S., Izard A., Kitaura F.S., Manera M., Monaco P., Murray S., Knebe A., Scoccola C. G., Yepes G., GarciaBellido J., Marin F. A., Muller V., Skibba R., Crocce M., Fosalba P., Gottlober S., Klypin A. A., Power C., Tao C., Turchaninov V., 2014, ArXiv eprints 1412.5228
 Coles & Jones (1991) Coles P., Jones B., 1991, MNRAS, 248, 1
 Colín et al. (2000) Colín P., Klypin A. A., Kravtsov A. V., 2000, ApJ, 539, 561
 Davis et al. (1985) Davis M., Efstathiou G., Frenk C. S., White S. D. M., 1985, ApJ, 292, 371
 Dawson et al. (2013) Dawson K. S., Schlegel D. J., Ahn C. P., Anderson S. F., Aubourg É., Bailey S., Barkhouser R. H., Bautista J. E., Beifiori A., et al. 2013, AJ, 145, 10
 Frieman & Dark Energy Survey Collaboration (2013) Frieman J., Dark Energy Survey Collaboration 2013, in American Astronomical Society Meeting Abstracts 221 Vol. 221 of American Astronomical Society Meeting Abstracts, The Dark Energy Survey: Overview. p. 335.01
 Fry (1985) Fry J. N., 1985, ApJ, 289, 10
 Hockney & Eastwood (1988) Hockney R. W., Eastwood J. W., 1988, Computer simulation using particles
 Jing (2005) Jing Y. P., 2005, ApJ, 620, 559
 Kitaura et al. (2014) Kitaura F.S., Yepes G., Prada F., 2014, MNRAS, 439, L21
 Klypin et al. (2014) Klypin A., Yepes G., Gottlober S., Prada F., Hess S., 2014, ArXiv eprints 1411.4001
 Knebe et al. (2011) Knebe A., Knollmann S. R., Muldrew S. I., Pearce F. R., et al. 2011, MNRAS, 415, 2293
 Knebe et al. (2013) Knebe A., Pearce F. R., Lux H., Ascasibar Y., Behroozi P., Casado J., Moran C. C., Diemand J., et al. 2013, MNRAS, 435, 1618
 Knollmann & Knebe (2009) Knollmann S. R., Knebe A., 2009, ApJS, 182, 608
 Laureijs et al. (2011) Laureijs R., Amiaux J., Arduini S., Auguères J. ., Brinchmann J., Cole R., Cropper M., Dabin C., Duvet L., Ealet A., et al. 2011, ArXiv eprints 1110.3193
 Levi et al. (2013) Levi M., Bebek C., Beers T., Blum R., Cahn R., Eisenstein D., Flaugher B., Honscheid K., Kron R., Lahav O., McDonald P., Roe N., Schlegel D., representing the DESI collaboration 2013, ArXiv eprints 1308.0847
 Manera et al. (2013) Manera M., Scoccimarro R., Percival W. J. e. a., 2013, MNRAS, 428, 1036
 Monaco et al. (2013) Monaco P., Sefusatti E., Borgani S., Crocce M., Fosalba P., Sheth R. K., Theuns T., 2013, MNRAS, 433, 2389
 Monaco et al. (2002) Monaco P., Theuns T., Taffoni G., 2002, MNRAS, 331, 587
 Moutarde et al. (1991) Moutarde F., Alimi J.M., Bouchet F. R., Pellat R., Ramani A., 1991, ApJ, 382, 377
 Murray et al. (2013) Murray S. G., Power C., Robotham A. S. G., 2013, Astronomy and Computing, 3, 23
 Neyrinck (2013) Neyrinck M. C., 2013, MNRAS, 428, 141
 Peebles (1980) Peebles P. J. E., 1980, in Ehlers J., Perry J. J., Walker M., eds, Ninth Texas Symposium on Relativistic Astrophysics Vol. 336 of Annals of the New York Academy of Sciences, Statistics of the distribution of galaxies. pp 161–171
 Planck Collaboration (2014) Planck Collaboration 2014, A&A, 571, A16
 Press & Schechter (1974) Press W. H., Schechter P., 1974, ApJ, 187, 425
 Sahni & Shandarin (1996) Sahni V., Shandarin S., 1996, MNRAS, 282, 641
 Saslaw (2000) Saslaw W. C., 2000, The Distribution of the Galaxies
 Scoccimarro (1998) Scoccimarro R., 1998, MNRAS, 299, 1097
 Scoccimarro & Sheth (2002) Scoccimarro R., Sheth R. K., 2002, MNRAS, 329, 629
 Skibba & Sheth (2009) Skibba R. A., Sheth R. K., 2009, MNRAS, 392, 1080
 Springel (2005) Springel V., 2005, MNRAS, 364, 1105
 Tassev et al. (2013) Tassev S., Zaldarriaga M., Eisenstein D. J., 2013, Journal of Cosmology and Astroparticle Physics, 6, 36
 Tinker et al. (2008) Tinker J., Kravtsov A. V., Klypin A., Abazajian K., Warren M., Yepes G., Gottlöber S., Holz D. E., 2008, ApJ, 688, 709
 Tinker et al. (2005) Tinker J. L., Weinberg D. H., Zheng Z., Zehavi I., 2005, ApJ, 631, 41
 Watson et al. (2013) Watson W. A., Iliev I. T., D’Aloisio A., Knebe A., Shapiro P. R., Yepes G., 2013, MNRAS, 433, 1230
 White et al. (2014) White M., Tinker J. L., McBride C. K., 2014, MNRAS, 437, 2594