Forward Modeling of Spectroscopic Galaxy Surveys: Application to SDSS

Forward Modeling of Spectroscopic Galaxy Surveys: Application to SDSS

[    [    [    [    [    [    [    [

Galaxy spectra are essential to probe the spatial distribution of galaxies in our Universe. To better interpret current and future spectroscopic galaxy redshift surveys, it is important to be able to simulate these data sets. We describe Uspec, a forward modeling tool to generate galaxy spectra taking into account intrinsic galaxy properties as well as instrumental responses of a given telescope. The model for the intrinsic properties of the galaxy population was developed in an earlier work for broad-band imaging surveys [1]. We apply Uspec to the SDSS/CMASS sample of Luminous Red Galaxies (LRGs). We construct selection cuts that match those used to build this LRG sample, which we then apply to data and simulations in the same way. The resulting real and simulated average spectra show a very good agreement overall, with the simulated one showing a slightly bluer galaxy population. For a quantitative comparison, we perform Principal Component Analysis (PCA) of the sets of spectra. By comparing the PCs constructed from simulations and data, we find very good agreement for the first four components, and moderate for the fifth. The distributions of the eigencoefficients also show an appreciable overlap. We are therefore able to properly simulate the LRG sample taking into account the SDSS/BOSS instrumental responses. The small residual differences between the two samples can be ascribed to the intrinsic properties of the simulated galaxy population, which can be reduced by adjusting the model parameters in the future. This provides good prospects for the forward modeling of upcoming large spectroscopic surveys.

a]Martina Fagioli, a]Julian Riebartsch, a]Andrina Nicola, a]Jörg Herbel, a]Adam Amara, a]Alexandre Refregier, b]Chihway Chang, a,c]and Laurenz Gamper \affiliation[a]Institute for Particle Physics and Astrophysics, ETH Zürich, 8093 Zürich, Switzerland \affiliation[b]Kavli Institute for Cosmological Physics, University of Chicago, Chicago, IL 60637, USA \affiliation[c]uSystems, Technoparkstrasse 2, 8406 Winterthur, Switzerland \ \keywordsSpectroscopic surveys, spectra simulations, principal components analysis

1 Introduction

Although the cosmological principle ensures that the universe is homogeneous and isotropic when studied at sufficiently large scales, observations tell us that at smaller scales galaxies are not randomly and evenly distributed in space. Galaxies clump into clusters, and create voids, large areas of the universe which are empty, forming also complicated structures like filaments and sheets. This large scale structure depends both on the cosmology which describes the universe, and on galaxy properties. Three-dimensional maps that take into account the angular positions in the sky and the redshifts of galaxies are therefore a very powerful cosmological probe [2, 3, 4, 5]. This, combined with measurements of intrinsic galaxy properties such as colors, luminosities, morphologies, spectral types, or stellar masses, can also provide clues about galaxy formation and evolution [6, 7, 8].

A large sample of tracers of such large scale structures is needed to extract information about the galaxy clustering pattern and its relation with galaxy properties [9, 10, 11, 12]. Ongoing and upcoming wide-field galaxy surveys such as the Dark Energy Survey111 (DES), the Kilo-Degree Survey222 (KiDS) and the survey of the Large Synoptic Survey Telescope333 (LSST) provide a wealth of photometric data. However, the errors associated with photometric redshift measurements make spectroscopic redshift surveys also necessary. Galaxy redshift surveys such as the Deep Extragalactic Evolutionary Probe (DEEP2) [13], the Very Large Telescope Deep Survey (VVDS) [14] and the Baryon Oscillation Spectroscopic Survey (BOSS) [15] within the Sloan Digital Sky Survey (SDSS) III [16] already provide measurements of the galaxy clustering power spectrum.

In light of upcoming large spectroscopic redshift surveys, such as the extended BOSS (eBOSS444 [17, 18], the Dark Energy Spectroscopic Instrument (DESI555 [19, 20], the Wide Field Infrared Survey Telescope (WFIRST666, and ESA’s Euclid satellite777 [21], it is necessary to be able to interpret the increasingly precise and numerous data from such cosmological surveys and to forecast the science performances of the experiments, for example, understanding redshift fitting routines or the reliability of photometric redshift estimates. Simulations will play a key role in this scenario. Spectroscopic surveys such as SDSS/BOSS, the 6df Galaxy Survey (6dFGS), the Big Baryon Oscillation Spectroscopic Survey (BigBOSS) and 4m Multi-Object Spectroscopic Telescope (4MOST) [22] have developed simulation tools to prepare their observing strategies and improve their data reduction performances (see e.g., [23, 24, 25, 26]). [27] developed the SPectrOscopic KEn Simulation (SPOKES), an end-to-end simulation facility for spectroscopic cosmological surveys.

In this paper, we describe Uspec and its performances in simulating realistic galaxy spectra. Uspec takes as inputs a galaxy model, described in details in [1], and the instrumental setup of a given telescope, and outputs redshifted, noisy galaxy spectra. This allows us to forward model a spectroscopic galaxy sample, and to compare it to an existing one after having performed the same cuts on both. The forward modeling approach is becoming a widely used technique [28, 1, 29], and consists of generating observable quantities from an astrophysical model containing, for instance, the evolution of the luminosity functions of red and blue galaxies with cosmic time (see e.g. [1]). A similar approach has also been performed recently for image simulations with narrow band photometric data from the Physics of the Accelerating Universe Survey, (PAUS888, as shown in Tortorelli et al. (in prep.).

Here, we choose to simulate a sample of Luminous Red Galaxies (LRGs) from the SDSS surveys. LRG are a widely used tracer of large scale structure [30, 31, 32, 33], and their spectra can be well approximated by a linear combination of templates and coefficients [34, 35, 36]. We compare our simulated sample to the real ones from SDSS, after applying the appropriate cuts to both. The resulting stacked galaxy spectrum is compared to a stacked galaxy spectrum coming from the SDSS/BOSS survey. We also perform a principal component analysis which shows the agreement between the simulations and the data, proving that the method is able to reproduce the variety of properties of the LRGs we aimed to study.

The paper is structured as follows. In Section 2, we describe the model from which the basic ingredients to simulate galaxies are drawn. In Section 3, we describe the data set and the reasons behind the cuts applied to the data. In Section 4, we describe Uspec and the instrumental setup added to the model galaxies in order to generate realistic spectra. In Section 5, we compare our simulated data to real galaxy spectra, both through comparing stacked real and simulated spectra and through a Principal Component Analysis of the two populations. In Section 6, we present our conclusions.

Throughout this work, we use a standard CDM cosmology with = 0.3, = 0.7 and km s Mpc.

2 Galaxy population model

To simulate galaxy spectra, we need basic galaxy properties as inputs. The model from which such properties are drawn is fully described in [1], and the model parameter values that we use in this study are given in Tortorelli et al. (in prep.). In this section, we review the aspects of the model necessary for simulating galaxy spectra given an input galaxy population. For a more complete description of the model we refer the reader to [1].

2.1 Galaxy luminosity functions

The galaxies are drawn from galaxy luminosity functions (see e.g. [37, 38]). We refer to a galaxy luminosity function as the number of galaxies for comoving volume and absolute magnitude :


where denotes redshift. The functional form of is taken to be a Schechter function [39]. The galaxies are drawn from separate and evolving luminosity functions for blue and red galaxies. The distinction between red and blue galaxies is done through their Specific Star Formation Rates (SSFRs). The redshifts and absolute magnitudes are obtained by sampling from the corresponding luminosity function.

2.2 Spectral energy distributions

The next step is to model the Spectral Energy Distributions (SEDs). We model the SEDs of galaxies as linear combinations of templates weighted by coefficients , where are suitably chosen templates and come from a Dirichlet distribution [40] of order five, as described in [1]. This model was constructed using the NYU Value-Added Galaxy Catalog (NYU-VAGC999 based on SDSS, with galaxies mostly at [41]. The templates used are the templates presented in [42]. The templates are based on the Bruzual Charlot stellar evolution synthesis models [43]. For the coefficients , different Dirichlet distributions are used for blue and red galaxies and the parameters describing these distributions are chosen to be redshift dependent, such that statistically different coefficients are assigned to galaxies at different redshift. The coefficients , together with redshifts and magnitudes are stored and given as inputs for the spectra simulations.

2.3 Catalog generator

The basic galaxy properties described above are generated and stored as described in [1]. In [1], the galaxy catalogs were used in order to simulate astronomical images, with the Ultra Fast Image Generator (Ufig) [44, 45, 46, 47, 1, 29]. Galaxy catalogs can be generated modifying the filters to compute the output magnitudes, and the extinction given by Milky Way dust can be added. This is the first time these catalogs are used to generate galaxy spectra.


As a starting point, we choose a red galaxy sample that is widely used for large scale structure studies and whose properties are expected to be easier to model. We thus select LRGs from Data Release 13 (DR13) of the SDSS/BOSS survey [48]. The imaging and spectroscopic data of this survey were obtained at the 2.5m telescope of the Apache Point Observatory (APO) in Sunspot, New Mexico [49], with respectively a wide field mosaic CCD camera [50] and a twin multi-object fiber spectrograph. The BOSS survey (Baryon Oscillation Spectroscopic Survey) [15] within the SDSS III [16] uses an upgrade of the SDSS spectrograph. We describe its relevant features in the sections below.

3.1 SDSS photometry

We make use of the petroMag magnitudes of the SDSS catalog [51]. These magnitudes are computed with using a modified form of the [52] system, as described in [53] and [54]. Petrosian magnitudes are measured containing a constant fraction of the total light of the objects. Furthermore, these magnitudes are model-independent101010 They are therefore the most suitable choice to describe the photometry of bright galaxies with high signal-to-noise ratio. In the following analysis, we do not consider the u-band, as the u-band magnitude measurements have large photometric errors for SDSS red galaxies (see e.g. [42]). Note that we do not correct the magnitudes used here for the reddening caused by Galactic dust, but we include this effect in our modeling of the LRGs.

3.2 SDSS spectroscopy

The spectra used in this analysis are taken from the BOSS survey. Here we highlight the relevant features of the spectra we employ; for a full description of the characteristics of these data we refer the reader to [55].

In the BOSS survey, the number of fibers per plate has been increased to 500 (from the previous 320 of SDSS) so that a total of 1000 objects is observed per exposure. Of these fibers, 895 are dedicated to science targets, 100 to sky and standard stars, and 5 to repeated targets. The BOSS fiber size is . This is the most suitable choice for high redshift galaxies (as in this case, up to ) in order to maximize the signal-to-noise while keeping the sky background contamination low. We use spectroscopic redshifts from [56]. The wavelength range extends from 3,650 to 10,400 Å. However, part of the blue wavelengths (Å) are excluded from the analysis below due to fringing. The same applies to the reddest wavelengths of the spectrum (Å) due to the prominent sky background residuals. For a more detailed description of noise effects, such as read-out and shot noise, and effects of flux loss such as atmospheric transmission, see Section 4.1. For a more detailed description of the sky background and resolving power effects, see Section 4.2.

3.3 Sample selection

Here, we describe how we selected the final sample of the galaxies that we analyzed. We emulate cuts as in the SDSS/BOSS CMASS sample [60], which aims at selecting a stellar-mass limited sample of galaxies111111 Color cuts are applied in the and plane in order to isolate high redshift galaxies, in the approximate redshift range . The sample is aimed at including red galaxies only; however, the color cuts explicitly applied in SDSS-I/II Cut-II [61] and 2SLAQ [62] to select red galaxies are not applied in CMASS. This is the reason why in our final sample galaxies with sign of gas emission (like e.g., and emission lines) are included. However, galaxies with visible emission lines only account for about 4 of the total CMASS sample (see e.g., [63] and [64]).

First, we exclude galaxies with or band magnitudes above the SDSS magnitude limit in those bands, namely:

Cuts on the band magnitude are already included in the final CMASS color/magnitude cuts applied below. No cuts are applied in the band (see Section 3.1).

In order to emulate the CMASS sample, the following cuts in magnitude, color and redshift are applied:

where , which is the distance perpendicular to the locus of the galaxy colors in the vs. color plane. This ensures the exclusion of low redshift galaxies from the sample. The cuts in the band define the faint and bright limits. is the measurement of the flux contained within the aperture of a spectroscopic fiber in band ( in the case of the BOSS spectrograph).

A sharp cut in redshift () is also introduced by us in order to ensure a closer match between the redshift distribution of simulated and real galaxies (see Section 3.4.2 below). This cut is introduced to focus on the performances of the spectra simulations. In our future work, we will rely only on photometric quantities, as spectroscopic properties such as redshifts need to be measured on spectra themselves, which are the data products we seek to simulate.

3.4 Catalogs comparison

In this section, we present the comparison between the simulated and real galaxy catalogs. The simulated properties are derived from the model described in Section 2 and are given as input in order to simulate galaxy spectra. Here we compare those simulated input properties and the data. In Section 3.3 we listed the cuts in magnitude and redshift spaces applied to SDSS galaxies to mimic the CMASS sample. The same cuts have been applied to the simulated galaxies with the appropriate modifications (see below).

3.4.1 Magnitudes

Figure 1: SDSS (green) and simulated (red) magnitude distributions. The simulated magnitudes have been derived using SDSS filters and the errors added considering the SDSS magnitude error distributions.

The magnitudes here employed for the simulated galaxies have been computed using SDSS filters. These magnitudes are noise free. To add realistic uncertainties to our magnitudes, we take those from the real SDSS magnitudes we employ. We find a correlation between SDSS magnitudes and their associated uncertainties, which we model with a linear relationship. As expected, fainter magnitudes have larger uncertainties. We construct Gaussians centered in the fitted value of uncertainty in each magnitude bin, having as standard deviation the scatter around the fitted values. We randomly draw uncertainties from these distributions. These uncertainties are then added to our noise-free simulated magnitudes.

The data magnitudes we employ are affected by reddening caused by dust. Therefore, we also add this effect on simulated magnitudes using reddening maps from [59]. With regard to the cut presented in Section 3.3 on the fiber2Mag magnitude, the fiber2Mag is assumed to be the same as the band magnitude for the simulated galaxies.

Figure 1 shows the comparison between real (green) and simulated (red) magnitudes in four () of the five SDSS bands. The band, although present in the simulated galaxy catalog, has been excluded from the analysis for the reasons described in Section 3.1. It can be seen from Figure 1 that the distributions of the real and simulated galaxy population magnitudes are similarly centered and occupy the same region in the magnitude space. However, the band magnitude distributions show a small disagreement between the two galaxy populations, with the simulated galaxies being on average fainter with respect to the SDSS ones.

3.4.2 Redshifts

Redshifts are also assigned to simulated galaxies during the galaxy catalog generation. Figure 2 shows the comparison between the real (green) and the simulated (red) spectroscopic redshift distributions. The simulated redshift distribution is shifted towards lower redshifts, with a median redshift of 0.54 for SDSS galaxies and median redshifts of 0.50 for simulated galaxies (i.e., ). In future work, the parameters of the input model can be adjusted to improve the match of the two distributions.

In this first analysis of Uspec, we force a match between the redshift distributions of the real and the simulated samples. The matching is done such that in every bin of redshift there is the same number of objects for both the real and the simulated galaxies, randomly chosen from a parent sample. However, it is worth noting that the difference in is related to the parameters controlling the redshift evolution of the luminosity functions. For a more detailed discussion on this aspect, see Appendix A. After this matching, the total number of galaxies to be analyzed for both the real and the simulated sample is 1617 objects.

Figure 2: Spectroscopic redshift distributions for SDSS (green) and simulations (red). The vertical lines show the medians of the two distributions.

4 Spectra simulations: Uspec

Uspec simulates galaxy spectra of an experiment given its location, instrument and a cosmological model. All the steps taken in order to construct galaxy spectra are illustrated in the flowchart in Figure 3.

First, the spectrum of a galaxy is constructed as a linear combination


where are the 5 kcorrect templates from [42], and the coefficients are described in Section 2.2. This produces a noise-free, rest-frame galaxy model spectrum. The spectrum is then shifted to the observed frame given the redshift provided by the simulated galaxy catalog. The dimming effect due to shift in wavelengths, i.e., the fact that the flux enclosed in a wavelength bin must be assigned to a broader wavelength bin , and the dimming due to the luminosity distance of the sources, are applied [65].

4.1 Instrumental response

Noise is added to the constructed galaxy model spectra. The instrumental effects which are included into the simulated SDSS-like spectra are listed below:

  • Read-out noise: A read-out noise of /pixel is assumed (see as a reference [55] for read-out noise in the BOSS spectrograph). The read-out noise is computed per unit wavelength, taking into account the instrumental resolution, and a random normal realization of it is added to the model galaxy and sky spectra in photons (see the flowchart in Figure 3).

  • Shot noise: The shot noise is the poisson random realization of the model galaxy or sky spectrum. Specifically, a poisson realization of the , where stands in this case for flux expressed in photons, is created. A different random poisson realization of is then subtracted, in order to simulate a realistic sky background subtraction, as sky from a different fiber in the same plate is subtracted from the galaxy in the SDSS survey. We do not account for differences in the sky background given by different locations in the SDSS plate, which we assume to be negligible. For a detailed description of the sky model, see Section 4.2.

  • Transmission curve: The transmission loss due to atmosphere and instrument is taken into account. We have used the ‘spthroughput’ routine of the ‘idlspec2D’ spectroscopic reduction pipeline121212 built by Princeton University and flux calibration files to create the throughput of the instrument. The galaxy and sky model spectra are first multiplied for the atmospheric transmission. The final, sky-subtracted spectrum is then divided by the atmospheric transmission as to mimic the steps of the data reduction (see flowchart in Figure 3).

(templates coefficients) / wavelength (4)

continuum + emission lines + pollution (4.2)

galaxy model atmospheric transmission (4.1)

sky model atmospheric transmission (4.2)

total spectrum poisson(galaxy model[photons] sky model[photons]) read-out noise (4.2)

sky-subtracted spectrum total spectrum[photons] poisson(sky model[photons]) (4.2)

sky-subtracted spectrum atmospheric transmission (4.2)

observed spectrum (4.3)
Figure 3: Flowchart illustrating the construction of the final galaxy model spectra. All instrumental effects (read-out and shot noise, and effects of atmospheric transmission) are indicated in red. In light blue, all steps relative to the construction and use of the sky model spectrum are shown. All steps involving the use of the galaxy model only are shown in green. Poisson indicates a random poisson realization of the flux given in photons (shot noise).

4.2 Sky model

Figure 4: Sky model spectrum. The first four panels show the contributions to the continuum (from scattered starlight, scattered moonlight in dark time, and zodiacal light) and emission lines coming from night pollution which are not visible in Paranal spectra. The model sky emission lines come the UVES Atlas of Paranal sky lines. The total (noise-free) model sky is shown in red. The comparison with a random (noisy) sky taken from the SDSS sample is shown in the second last panel. The residuals between the two are shown in the last panel. The bump around 6000Å is due to atmospheric pollution due Na and Ne lamps [66], which are not present in Paranal sky spectra.

The emission lines of the night sky are important contaminants for astronomical observations. It is therefore important to properly model the sky spectrum in order to estimate its effective impact on the noise associated to the galaxies simulated by Uspec.

The night sky spectrum is recorded in every single astronomical observation in SDSS. Although its variability makes it difficult to model if one wants to simulate the precise intensity of the features of the night sky at a given moment of the night and position in the sky, the central wavelengths of the emission lines are constant in time and space, and can therefore be easily modeled. For the purpose of forward modeling the sky, a sky spectrum at a given moment of time or position in space is not needed. The poisson realization of a night sky background which includes the common sky lines and continuum is sufficient for purpose of mimicking a realistic sky subtraction. Below we list the input parameters for our simulated sky model; in Figure 4 we show our (noise-free) model sky spectrum model, its individual components and the comparison with a real (noisy) random BOSS sky spectrum.

  • Line Spread Function (LSF): We compute the LSF (i.e., the broadening due to the instrumental resolving power as a function of wavelength ) on randomly observed sky spectra in the BOSS survey. We fit Gaussians to a series of evenly distributed sky lines along the wavelength direction. We then derive a linear relation between the FWHM of those lines and their central wavelengths. The relation we derive is in agreement with the resolving power provided by BOSS ( in the blue range, in the red range [55]). We use this relation to determine the width of the sky lines in our model spectrum.

  • Emission Lines:

    • UVES Atlas of Paranal sky lines131313 The sky lines central wavelengths and intensities are taken from [67]. The atlas of lines in the optical and near-IR wavelength range has been acquired by UVES, the echelle spectrograph at the 8.2-m UT2 telescope of the Very Large Telescope (VLT). While the absolute intensities of sky line emission depend on the time and location of the observations, the UVES line intensities are used here as a reference in order to construct a realistic sky model spectrum.

    • Light pollution emission lines: [68] list emission lines tracing light pollution, such as HgI 5461, 5770, 5791 Å and components of the NaI 5890, 5896 Å lines indicative of both high-pressure and low-pressure Sodium lamps. These lines are not included in the UVES atlas of sky lines as in a dark site like Paranal there is no trace of such elements in the atmosphere. For this reason, we fit Gaussians to these sky lines from a real SDSS spectrum, deriving their peak intensities. We then construct a mock spectrum with using these sky lines only, using the LSF described above and the intensities as described here, and add it to the total spectrum.

  • Continuum: The continuum for our model sky has been computed with the SkyCal Web Application141414 from the ESO Sky Calculator [69, 70]. This includes:

    • Scattered Moonlight: Scattered moonlight has a stronger effect on the continuum if the observing date is close to full Moon. This is not the case for the observations of CMASS, which have been taken during dark time. However, as particularly the blue wavelengths are affected by it, it can not be neglected.

    • Zodiacal Light: Zodiacal light is coming from interplanetary dust grains scattering sunlight. Here we choose values for ecliptic latitude and heliocentric ecliptic longitude for targets at the zenith, with airmass = 1. A strong continuum coming from zodiacal light would be expected for low absolute values of such coordinates [69].

    • Scattered Starlight: Starlight is scattered in the atmosphere. The distribution of stars reaches a peak when it gets close to the centre of the Milky Way. Therefore, the scattering model required for this kind of distribution is that for extended sources. This component of the continuum is minor compared to the other two main components mentioned above. As a consequence, computing a mean continuum spectrum it is sufficient for an exposure time calculator application, such as the one used here [69].

A demonstration of the impact of the sky model on the data analysis is given in Appendix B.

4.3 Construction of the final Uspec spectrum

As described in the flowchart in Figure 3 and in the previous sections, a model sky-subtracted galaxy spectrum is built starting by a linear combination of 5 templates and coefficients assigned to the galaxies, as described in Section 2. The simulated spectrum includes read-out and shot noise and the effects of the atmospheric transmission. At this point, the magnitude in the band is computed and compared to the input band. A warning is generated in case the difference between the two exceeds . This happens for of the generated galaxy spectra, which are however included in the final sample. The final fluxes, input magnitudes, redshifts and output band magnitudes are stored in order to be compared with those from real data.

5 Results

5.1 Stacked spectra comparison

Figure 5: Average stacked LRGs spectra for both SDSS (green) and Uspec (red) samples. Vertical dashed lines indicate the position of strong features, such as [O II] and [O III] emission lines and the Balmer absorption lines.

In order to compare real and simulated galaxy spectra, we compute the average stacked galaxy spectra for both samples. A total of 1617 galaxy spectra is used for both the SDSS and the Uspec samples. As in [57], the SDSS spectra are corrected for Galactic extinction following the extinction curve for diffuse gas from [58] with , and using the Galactic values from the maps of [59]. This is needed as simulated galaxy spectra do not include Galactic dust. We do not correct for any internal dust extinction, since passive galaxies are expected to have negligible intrinsic dust. After, the spectra are shifted to the rest frame and normalized by the mean flux at Å wavelengths. In this region, the spectra of red galaxies are flat and no strong features are expected. The spectra are then interpolated onto a 1 Å linearly spaced wavelength grid. The normalization and interpolation steps are applied to both SDSS and Uspec spectra. Figure 5 shows the comparison between the two spectra. The average spectra for both galaxy samples are clearly those of red galaxies with an old stellar population, revealing features such as the Ca II H K lines, the G-band at 4300 Å, the Balmer absorption lines and the break at Å. The Mg lines are also clearly visible in both spectra.

Overall, the agreement between the two average spectra is good. Nonetheless, it is visible that the Uspec average spectrum shows a somewhat flatter shape and a stronger emission in [O II] and [O III], an indication of an overall bluer population than the SDSS sample. This difference in the overall population can be explained by looking at the difference in the color-color space between the two populations. Figure 6 shows the comparison between SDSS and Uspec galaxies in the vs. color-color space. The Figure shows that the centroids of the two distributions have a small offset. This is due to the differences in magnitude which are also visible in Figure 1, which become especially relevant for the band. Figure 6 shows that the real data galaxy population has a tail in the distribution towards redder colors in both the and the planes. This explains the difference in the overall shape of the stacked spectra and the stronger emission in [O II] for the simulated galaxies with respect to the SDSS ones, and also shows the ability of [1] and Uspec to generate realistic spectra given some input properties. This difference in the spectra can be used as a diagnostic to fix the differences in the input galaxy properties in [1]. This can be achieved in future work with an ABC (Approximate Bayesian Computation) optimization of the model parameters, as discussed in Section 6. These differences, however, highlight the need of further quantifying the agreement of the two populations studied, which we now turn to.

Figure 6: Figure showing the vs. color-color diagram, showing the comparison between SDSS and simulated galaxies.

5.2 Principal Component Analysis (PCA)

As discussed above, we need to quantify the differences between the two populations of real and simulated galaxies. Galaxy spectra contain large amount of information, as each galaxy spectrum is described by 3469 data points. A useful approach to this problem is therefore trying to reduce its dimensionality.

The Karhunen-Loéve transform, also commonly called Principal Component Analysis (PCA), is a technique which is widely used to reduce the dimensionality of big data sets [71]. Its application in astronomy has been exploited in details (see e.g., [72, 73, 74]). Applying PCA to spectroscopy basically consists in representing the spectra as a lower dimensional set of eigenspectra [75]. The eigenspectra are obtained by finding a matrix such that


where is the diagonal matrix containing the eigenvalues of the correlation matrix constructed from the spectra. No weights are included in this analysis. We solve this problem through Singular Value Decomposition (SVD). In both real and simulated spectra, we mask the regions where strong sky lines are expected, namely at 5578.5 Å, 5894.6 Å, 6301.7 Å, 6364.5 Å, 7246 Å, with a FWHM of 15 Å for each sky line. See Appendix B for a discussion on how PCA performs without masking those lines. The first 200 Å in the blue and about 2500 Å in the infrared regions of the spectra have also been excluded from the analysis as they are severely dominated by residual sky features. The same analysis is applied to both the SDSS spectral sample and the Uspec simulated spectra. Each galaxy spectrum can be constructed as follows:


where a (b) are the expansion coefficients, or eigencoefficients (see Section 5.2.2 below) and () are the eigenspectra for the data (simulations).

The first five principal components comparison for both samples is shown in Figure 7. The grey areas show the masked regions where strong sky lines are expected. Overplotted are the and bands from SDSS. The PCA conducted independently for the data and the simulations shows good agreement in the first four components. The choice of using five components is motivated by the initial construction of the Uspec spectra which are built from the five templates (Section 4). As is can be seen in the Figure, the first two components capture most of the physical information coming from the spectra. Even if individual features are not visible, as the spectra are analyzed in the observed frame, the overall shape of red galaxies spectra is clearly visible in the first two PCA components. The three higher PCA components all show similar patterns, for both the SDSS and the Uspec sample. In those, the characteristic bump at Å, which is the position of the 4000 Å break shifted by the median redshifts of the two populations, is clearly visible, together with other redshifted features such as [O II] emission, the G-band and the H absorption, all broadened due to the variety of the redshifts of the two samples.

Figure 7: Principal components comparison between SDSS and Uspec galaxies. The first component captures most of the signal coming from the spectra. The grey areas show the masked regions where strong sky lines are expected. The SDSS filter curves for are overplotted.

5.2.1 Mixing matrix

By definition, solving the eigenvalue problem of Equation 3 means that the basis set and are two sets of orthonormal basis. This means that we can define a mixing matrix M such that


so that, if , reduces to


In other words, in the ideal case, i.e., if real and simulated data were described by the same basis set, the mixing matrix would be the identity matrix. Figure 8 shows a graphical representation of the mixing matrix between the real and the simulated spectra presented above. In the Figure, the numbers in the boxes show the scalar products between the different components. As expected, the first components show better agreement than higher order components, with a significant drop for the fifth component. The non diagonal elements of the matrix are significantly smaller than the diagonal ones up to the fourth component. A distance metric can be defined in order to assess how far the mixing matrix is from being diagonal. We choose to use the ratio between the product of the diagonal elements of matrix M and its determinant, . For a diagonal matrix, such a ratio should be one. In our case


which shows how similar the two basis sets independently used to describe real and simulated spectra are. In Appendix A, we describe how the mixing matrix changes when we do not match the redshift distributions of the two populations. In particular, it is worth noting how varies for a difference in the median redshift between the two populations of , going from 0.872 to 0.827. In Appendix B, we show how changes with not masking the strong sky emission lines, dominating all principal components except for the first one. The effect of sky lines should be taken into account as they might strongly influence the distance metrics introduced here.

Figure 8: Graphical representation of the mixing matrix M between the principal components of real and simulated spectra.

5.2.2 Eigencoefficients

We can determine the relative contribution of each eigenspectrum to the observed spectrum by calculating the respective eigencoefficients, i.e., and from Equations 4 and 5. Those are simply the scalar products of the eigenspectra with their normalized spectra. Figures 9 and 10 show the distributions of the five eigencoefficients. Figure 9 shows the real spectra from SDSS (green) and the simulated Uspec spectra (red) projected onto SDSS principal components (i.e., onto the basis set from Equation 4). The coefficients distributions for both the real and the simulated spectra are overlapping. Also, all the coefficients are correlated to each other. However, the Uspec eigencoefficients are occupying a larger parameter space in all five coefficients. This is even more evident when looking at Figure 10, where the spectra are projected onto Uspec principal components (or the basis set from Equation 5). The coefficients distributions still appear to be overlapping and correlations between different coefficients are visible. However, the regions occupied by Uspec spectra are clearly larger than SDSS ones. This is due to the higher signal-to-noise of Uspec simulated spectra, which can be also seen in Figure 7. This signal-to-noise ratio brings the eigencoefficients to be sensitive to a wider variety in properties in the simulated galaxy population. This effect can be accounted for when matching the noise properties of the SDSS spectra. Furthermore, the differences in the eigencoefficients can also be used as distance measures to check the correctness of the inputs of the simulations. This is particularly evident when comparing the projected coefficients here described and those shown in Appendix A. In an upcoming publication, we will use the eigencoefficients defined here to better constrain the redshift distribution presented in [1], in addition to the distance measures already outlined in [1].

Figure 9: Eigencoefficients resulting from projecting real spectra from SDSS (green) and simulated Uspec spectra (red) onto SDSS principal components.
Figure 10: Eigencoefficients resulting from projecting real spectra from SDSS (green) and simulated Uspec spectra (red) onto Uspec principal components.

6 Conclusions

In this paper we describe Uspec, a tool to simulate galaxy spectra for cosmological surveys. Uspec builds galaxy spectra starting from a linear combination of templates and coefficients. The coefficients, together with magnitudes and spectroscopic redshifts, are drawn from luminosity functions which evolve with redshift.

To compare our simulations to real data, we considered LRGs samples using the redshifts and colors cuts as in the CMASS sample from SDSS/BOSS. We apply these cuts to both real and simulated galaxies. We then modeled the noise and instrumental properties and include them in the simulated spectra. In particular, we modeled the read-out noise, the shot noise and the instrumental and atmospheric throughput. We also constructed and included in Uspec a night background sky model spectrum with same characteristic as the ones observed at the APO in New Mexico. The LSF for the line broadening has been derived from sky spectra observed in the BOSS survey.

We compared the average spectra of the real and simulated galaxy populations, finding good agreement. A small residual difference between the average spectra can be seen and be ascribed to the intrinsic properties of the simulated galaxy population. This can be reduced by adjusting the model parameters in the future.

To further quantify the level of agreement between the real and the simulated galaxy samples, we also performed a PCA. We find a remarkably good agreement between the two populations for the first four principal components, and a moderate agreement for the fifth component. The comparison of the eigencoefficients of the two galaxy populations also shows that we are able to reproduce the variety of properties of LRGs in the SDSS survey. Both the distribution of the eigencoefficients resulting from projecting real spectra from SDSS and simulated Uspec spectra onto SDSS, and onto Uspec principal components, overlap. However, the Uspec galaxies distributions are somewhat broader than those of data, which can be explained by the higher signal-to-noise ratio of the simulated galaxies, which results in a wider variety of projected coefficients.

We define a distance measure as the ratio between the product of the diagonal elements of the mixing matrix M of the PCAs and its determinant. The mixing matrix is expected to be the identity matrix if real and simulated data have the same principal components. We find . In the course of our analysis, we match the redshift distribution of real and simulated data. It is worth noting however that, if we have a difference in the median redshifts between the two galaxy sample of , then the value of decreases to 0.827. Also, the eigencoefficients distributions change when introducing such a difference. This is an interesting result as it indicates that the distance measure , as well as the eigencoefficients distributions, are sensitive to the parameters that control the redshift evolution of the input luminosity function. For a detailed discussion on this aspect, see Appendix A.

The results presented here are promising and offer good prospects for applying our method to large upcoming spectroscopic surveys such as DESI. In our future work, we plan to incorporate Uspec into a full ABC (Approximate Bayesian Computation) framework, to better match the intrinsic and noise properties of our simulated galaxies and real data. Furthermore, we will seek to simulate the population of blue star-forming galaxies, to also test the population of emitters which offers new different insights into the study of clustering of different galaxy populations.


MF would like to thank Luca Tortorelli for useful discussions on galaxy properties. This research made use of IPython, NumPy, SciPy, and Matplotlib. We acknowledge support by SNF grant .


Appendix A Sensitivity analysis for the redshift distribution

This paper is aimed at presenting Uspec and its capabilities of simulating realistic galaxy spectra. The PCA analysis quantifies the differences between the real and simulated spectra. However, here we describe how the PCA analysis can be also used as an additional constraint for the input redshift distribution of [1].

Figures 11 and 12 show the PCA analysis results when keeping all the galaxies that pass the selection criteria described in Section 3.3, i.e., not matching the redshift distributions. This brings the total number of galaxies analyzed to 2126. The left panel of Figure 11 shows the comparison between the five principal components of real and simulated spectra. The first two components are only sensitive to the overall shape of the spectra, and appear to be almost unchanged with respect to those in Figure 7. It is visible however how the position of the redshifted 4000 Å break is changed for the Uspec galaxies. This is especially evident in the first component. The effect of the redshift difference (as reported in the main text in Section 3.4.2, ) is more evident in the higher order components, where the spectral features for the Uspec are shifted towards bluer wavelengths, where the median of the Uspec is centered. The difference is reflected in the mixing matrix M, shown in the right panel of Figure 11. If we evaluate the distance metrics based on the mixing matrix, we find:


From the comparison with the evaluation of the mixing matrix in the main text (), where the of real and simulated data are matched, it is clear how this can be used as a distance metrics to constrain the input .

Also, Figure 12 shows the eigencoefficients both projected onto SDSS principal components (left panel), and onto Uspec principal components (right panel). The distributions of higher order Uspec coefficients are clearly different than those of SDSS, and also than those shown in the main text (Figures 9 and 10). This is an indication of how the coefficient distribution can be used as distance metrics in refining the input of simulated galaxies.

(a) Mixing matrix for not-matched redshift distributions (i.e., keeping all the galaxies as shown in Figure 2). Following formulas in Section 5.2.1, we find:
Figure 11: PCA comparison and mixing matrix for not-matched redshift distributions. The positions of broad features appear shifted when a difference in the median redshift distributions is introduced. This effect is especially evident in the last three principal components.
Figure 12: Eigencoefficients projected onto SDSS principal components (left), and onto Uspec principal components (right). The eigencoefficients can be used as distance measures to constrain the redshift distribution .

Appendix B Impact of the sky model

As discussed in Section 4.2, the background sky is an important source of systematics in any astronomical observation. The impossibility to predict the strength of skylines can have important effects also in our PCA analysis. However, the position of the strong sky lines is easily predictable, as well as their width given the instrumental resolution R. Throughout our analysis, we masked the regions where the strongest skylines at 5578.5 Å, 5894.6 Å, 6301.7 Å, 6364.5 Å, 7246 Å are expected, with a FWHM of 15 Å for each sky line. Figure 13 shows how the PCA analysis looks line when not masking these sky lines. The redshift distributions are here matched. The first components appear mostly unchanged, excepted for the presence of the [O I] sky line at 5578.5 Å. This sky line completely dominates the higher order principal components. This is reflected also in the mixing matrix shown in the right panel of Figure 13. In this case, . As sky lines are not redshifted, they are not washed out as galactic intrinsic emission lines when performing the PCA analysis in observed frame. This shows the impact of strong emission lines on the real spectra and the importance of properly masking them.

(a) Mixing matrix when strong sky lines are not masked. The redshift distributions are here matched. Following formulas in Section 5.2.1, we find:
Figure 13: PCA comparison and mixing matrix when strong sky lines are not masked. While the first principal component is not sensitive to sky lines, the other principal components appear to be completely dominated by those for the SDSS population.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description