Accurate parameter estimation for star formation history in galaxies using SDSS spectra
To further our knowledge of the complex physical process of galaxy formation, it is essential that we characterize the formation and evolution of large databases of galaxies. The spectral synthesis STARLIGHT code of Cid Fernandes et al. (2004) was designed for this purpose. Results of STARLIGHT are highly dependent on the choice of input basis of simple stellar population (SSP) spectra. Speed of the code, which uses random walks through the parameter space, scales as the square of the number of basis spectra, making it computationally necessary to choose a small number of SSPs that are coarsely sampled in age and metallicity. In this paper, we develop methods based on diffusion map (Lafon & Lee, 2006) that, for the first time, choose appropriate bases of prototype SSP spectra from a large set of SSP spectra designed to approximate the continuous grid of age and metallicity of SSPs of which galaxies are truly composed. We show that our techniques achieve better accuracy of physical parameter estimation for simulated galaxies. Specifically, we show that our methods significantly decrease the age-metallicity degeneracy that is common in galaxy population synthesis methods. We analyze a sample of 3046 galaxies in SDSS DR6 and compare the parameter estimates obtained from different basis choices.
keywords:methods: data analysis – methods: statistical – methods: numerical – galaxies: evolution – galaxies: formation – galaxies: stellar content
Determining the physical parameters of large samples of galaxies is crucial to constrain our knowledge of the complicated physical processes that govern the formation and evolution of these systems. Ultimately, we seek methods that quickly and effectively map observed data from galaxies to sets of parameters that describe the star formation and chemical histories of these galaxies. Due to the complexity of the mechanisms of galaxy evolution, it is critical that we adopt a model that uses few assumptions about how galaxies evolve. Moreover, full utilization of the available theoretical stellar models, avoiding ad hoc simplifications, is critical to obtaining accurate parameter estimates.
There has been a large amount of work dedicated to estimating the physical parameters of databases of galaxies using different types of data. For example, spectral indices (Worthey 1994; Kauffmann et al. 2003; Tremonti et al. 2004; Gallazzi et al. 2005; Chen et al. 2009), emission features (Mas-Hesse & Kunth 1991; Leitherer et al. 1995), and full high-resolution, broad-band spectra (Reichardt, Jimenez & Heavens 2001; Glazebrook et al. 2003; Panter, Heavens & Jimenez 2003; Cid Fernandes et al. 2004, 2005; Mathis, Charlot & Brinshmann 2006; Ocvirk et al. 2006; Asari et al. 2007) have all been utilized to estimate the star formation histories (SFHs) and metallicities of galaxies. Recently, due to the abundance of high-quality, homogeneous databases of galaxy spectra such as the Sloan Digital Sky Survey (SDSS, York et al. 2000; Strauss et al. 2002), which provides high-resolution spectra for hundreds of thousands of galaxies, the approach of full high-resolution spectral fitting is now possible to infer the properties of large populations of galaxies. Recent work has shown promise in the application of these methods to SDSS galaxy spectra (see Panter et al. 2007 and Asari et al. 2007 for reviews of results achieved by two such fitting methods). However, large-scale analyses, while possible, are not necessarily accurate, and can be computationally challenging. There is much room for improvement in accuracy of SFH parameter estimation. It is critical that we understand the shortcomings of the current models and improve their accuracy. At the same time, to achieve our goal of obtaining accurate SFH estimates for a large sample of galaxies, we cannot sacrifice the computational efficiency of current methods.
A common technique in the literature, called empirical population synthesis, is to model galaxies as mixtures of simple stellar populations (SSPs) with known physical parameters. Recent studies that have used this method are, e.g. Bica (1988), Pelat (1997), Cid Fernandes et al. (2001), and Moultaka et al. (2004). Under this model, galaxy data are treated as linear combinations of the observed properties of SSPs. Historically, SSP parameters and observables were derived from observations of well-understood stellar systems. More recent studies have instead used model-produced SSPs, such as those from the evolutionary population synthesis models of Bruzual & Charlot (2003). The STARLIGHT spectral fitting code, introduced by Cid Fernandes et al. (2004), fits observed spectra with linear combinations of SSPs from the models of Bruzual & Charlot (2003). In Cid Fernandes et al. (2005), it was shown that SFH parameters of simulated galaxy spectra could be recovered by STARLIGHT in the absence of noise. However, their simulated spectra were generated and fit with the same basis of 45 SSPs, rendering the results difficult to generalize to the expected performance on real galaxies, which can be composed of SSPs on infinitely fine grids of age and metallicity.
Results of the STARLIGHT fitting code are highly dependent on the choice of basis of SSP spectra. The database of SSPs available to us could theoretically be infinitely large, encompassing all combinations of age, metallicity, initial mass function, evolutionary prescription, etc. Including too many SSPs in our basis causes the code to be prohibitively slow and the solution to be degenerate due to the inclusion of SSPs with almost identical spectra. If, however, we simply hand-select a few SSPs on a coarse grid of age and metallicity with which to model our data, then we effectively ignore large subsets of our SSP parameter space, leading to suboptimal estimates for fits to databases of galaxies. Using a hand-selected basis of SSPs may also lead to the inclusion of multiple SSPs whose spectra are essentially identical, which can lead to degeneracies.
There is much room for improvement in the accuracy of SFH estimates by exploring numerical methods of selecting small bases of SSPs. In this paper, we propose a method of choosing a small set of prototype spectra from a large database of SSP spectra based on their observable quantities. Our method attempts to capture much of the variability of a large set of SSPs in a few numerically chosen prototype SSPs. The method is based on the diffusion map of Coifman & Lafon (2006) and Lafon & Lee (2006), a non-linear technique that seeks a simple, natural representation of data that are complex and high-dimensional, such as high-resolution astronomical spectra.
We utilize the diffusion -means method to numerically find a basis of prototype SSPs with which we can fit galaxies to accurately estimate their SFHs. In a future study, we will consider the problem of computational speed in methods such as STARLIGHT, which takes minutes to fit each SDSS galaxy spectrum for a basis of size 150, and show how to quickly extend the SFH estimates obtained by using the methods in this paper to a large database of galaxies by taking advantage of the geometry of the manifold on which the data lie.
The applicability of the methods introduced here extend beyond the problem of SFH estimation using high-resolution spectra. Any astronomical task where observed data (i.e. spectra, photometric data, spectral indices, etc.) are fit as linear combinations of observable data from simpler systems can benefit from intelligent numerical selection of prototypes.
The outline of the paper is as follows. In §2 we describe the STARLIGHT spectral synthesis code, discuss the drawbacks of SSP bases employed in the literature, and introduce the diffusion -means method of prototype selection. We directly compare the basis of prototypes selected by our method to those used in the literature and those found using other methods. In §3 we fit several sets of realistic simulated galaxy spectra using the STARLIGHT code with different bases to directly compare the ability of each basis to accurately estimate galaxy parameters. In §4 we fit a set of SDSS galaxy spectra and analyse physical parameter estimates obtained by different bases. We conclude with some remarks and discussion of future directions in §5.
2.1 Spectral synthesis
Model and STARLIGHT fitting algorithm
We adopt the model of galaxy spectra as a mixture of simple systems introduced in Cid Fernandes et al. (2004):
where is the model flux in wavelength bin . The component is the th basis spectrum normalized at wavelength . The basis of SSP spectra is chosen before performing the analysis. The scalar is the proportion of flux contribution of the th component at , where ; the vector with components is the population vector of a galaxy. These flux fractions can be converted to mass fractions, , using the light-to-mass ratios of each basis spectrum at . To describe the SFH of a galaxy, we will use time-binned versions of and , denoted and , respectively. Using binned population vectors lets us avoid degeneracies that can occur in the estimation of the original population vectors.
Other components of the model in (1) are the scaling parameter , the reddening term, , and the Gaussian convolution, . The reddening term describes distortion in the observed spectrum due to foreground dust, and is modeled by the extinction law of Cardelli, Clayton & Mathis (1989) with . The Gaussian convolution, , accounts for movement of stars within the observed galaxy with respect to our line-of-sight, and is parametrized by a central velocity and dispersion .
To fit individual galaxies, we use the STARLIGHT fitting routine introduced by Cid Fernandes et al. (2005). The code uses a Metropolis algorithm plus simulated annealing to find the minimum of
where is the observed flux in the wavelength bin , is the model flux in (1), and is the inverse of the noise in the measurement . The summation is over the wavelength bins in the observed spectrum. The minimization of (2) is performed over parameters: , and , where is the number of basis spectra in . The speed of the algorithm scales as . For the analysis of, e.g. galaxy spectra in the SDSS database, it is crucial to pick a basis with a small number of spectra.
In, for example, Cid Fernandes et al. (2004) and Cid Fernandes et al. (2005), the STARLIGHT model (1) and fitting algorithm are tested using simulations. The simulated spectra are generated from the model in (1) using SSPs from Tremonti (2003) and SSPs from Bruzual & Charlot (2003), respectively. These authors fit the simulations using the same basis of SSPs that was used to generate the simulations. From their analyses, they conclude that in the absence of noise, the algorithm accurately recovers the input parameters. In the presence of noise, the population vector is not recovered, and it is advised that a time-binned description of the SFH be adopted. Although their use of the same basis for both generating and fitting simulated spectra is an appropriate test that the algorithm works, it is not a fair assessment of the expected performance of the methods for a database of real galaxy spectra. In §2.1.2 we discuss our concerns with their simulation assessments. In §2.2 we present a new suite of methods that we test with several sets of simulated galaxy spectra. These simulations, which are described in §3, are designed to be more representative of a large database of galaxy spectra, such as SDSS.
Stellar population models
In reality, the population of all observed galaxies is not constrained to be mixtures of a small number of SSPs on a discrete grid of age and metallicity. A more physically accurate representation of galaxies is as mixtures of an infinitely large basis of SSPs on a continuous grid of age and metallicity and, depending on the complexity of the underlying physics, a possibly infinite grid of prescriptions for initial mass function and evolutionary track. In our simulations in §3 we simulate galaxies from a large database of SSPs on a fine grid of age and metallicity to determine whether we can accurately describe their SFHs, metallicities, and kinematic parameters using an appropriately chosen, computationally tractable basis of or 150 spectra.
We adopt a SSP database containing 1278 spectra. This set of spectra is used to both choose an appropriate basis (§2.2) and generate simulated spectra (§3). The database is meant to represent the large population of SSPs from which observed galaxies can be composed. We use the spectra from the models of Bruzual & Charlot (2003), computed for a Chabrier (2003) IMF, ‘Padova 1994’ evolutionary tracks (Alongi et al. 1993; Bressan et al. 1993; Fagotto et al. 1994a, b; Girardi et al. 1996), and STELIB library (Le Borgne et al. 2003). Theoretically, we could also vary the IMF and evolutionary tracks across spectra and use the same methods developed in §2.2, but for simplicity decide to fix them in this study. The SSPs in our database are generated on a grid of 213 approximately evenly log-spaced time bins from 0 to 18 Gyrs and 6 different metallicities: = 0.0001, 0.0004, 0.004, 0.008, 0.02 and 0.05, where is the fraction of the mass of a star composed of metals (). This grid is designed to approximate a continuous variation across age and . The coarseness in the current grid is due to limitations in the current stellar population models. Again, a finer grid of age and can be implemented with the techniques that we develop in §2.2.
2.2 Choosing an appropriate basis of prototype spectra
In previous studies using the STARLIGHT code, researchers hand-selected sets of SSP spectra to use in the model fitting. According to Cid Fernandes et al. (2004), “the elements of the base should span the range of spectral properties observed in the sample galaxies and provide enough resolution in age and metallicity to address the desired scientific questions.” In this same paper, they claim that their basis of SSPs adequately recovers parameters of age and metallicity in their simulations. However, as discussed in §2.1, these simulations are unrealistic since they employ the same basis to both generate and fit the simulated spectra.
In this section we present methods to numerically determine a basis of a small number of prototype spectra, , from a large set of SSP spectra. The resultant basis of prototype spectra will be designed to capture a large proportion of the variation of the set of SSPs, which is chosen to span the range of observed spectral properties of a data set of galaxies at a high resolution in age and metallicity, such as the database of 1278 SSPs described in §2.1.2. The numerically chosen basis of prototype spectra, , should recover physical parameters better than any hand-chosen basis of the same size because hand-selected bases ignore subsets of the parameter space and may cause the solution to be degenerate by including multiple SSPs with nearly identical spectra. In §3 we show that the bases computed from our numerical methods outperform hand-selected bases used in the literature in parameter estimation for simulated galaxies.
Diffusion map is a popular new technique for data parametrization and dimensionality reduction that has been developed by Coifman & Lafon (2006) and Lafon & Lee (2006), and was recently used by Richards et al. (2009) to analyse SDSS spectra and Freeman et al. (2009) to estimate photometric redshifts. As shown in these latter works, diffusion map has the powerful ability to uncover simple structure in complicated, high-dimensional data. These analyses demonstrate that this simple structure can be used to make precise statistical inferences about the data.
In the problem at hand, we begin with a set of high-resolution broad-band SSP spectra , where can be very large and each has flux measurements in wavelength bins. From this set, our goal is to derive a small basis of prototype spectra, , each of length , that captures most of the variability in . In what follows, we briefly review the basics of the diffusion map technique and then show how it can be used to find an appropriate basis .
The main idea of the diffusion map is that it finds a parametrization of a data set that preserves the connectivity of data points in a way that is dependent only on the relationship of each datum to points in a local neighborhood. By preserving the local interactions in a data set, diffusion map is able to learn, e.g. a natural parametrization of the of the spiral in Fig. 1 of Richards et al. (2009) where other methods, such as principal components analysis, fail.
Starting with our set of SSP spectra, , in units of /Å per initial , we normalize the spectra at to ensure that we are not confused by absolute scale differences between individual spectra. We retain the scaling factors for later use to convert back to the original units. Next, we select a discrepancy measure between SSPs, i.e.
which is the Euclidean distance between two SSP spectra. The specific choice of is often not crucial to the diffusion map; any reasonable measure of discrepancy will return similar results. It is important to note here the flexibility of the diffusion map method. For each problem, we can define a different discrepancy measure based on the type of data available to us and the types of information we want to incorporate. For example, we could define our measure by weighting more heavily those regions around absorption lines which are highly correlated with stellar age and metallicity. Contrast this flexibility with the rigidity of principal components analysis, which is only based on the correlation structure of the data.
From our discrepancy measure in (3), we construct a weighted graph on our data set , where the nodes are the individual SSP spectra and the weights on the edges of the graph reflect the amount of similarity between pairwise spectra. We define the weights as
where should be small enough such that unless and are similar, but large enough so that the entire graph is connected (i.e. every SSP has at least one non-zero weight to another SSP). In practice, we choose that produces a basis that achieves the best fits to galaxy spectra (see §3). By normalizing the rows of the weight matrix, , to sum to unity, we can define a Markov random walk on the graph where the probability of going from to in one step is . We store all of these one-step probabilities in a matrix .
By the theory of Markov chains, the -step transition probabilities are defined by . In the previous analyses in Richards et al. (2009) and Freeman et al. (2009), the authors fixed . In these previous papers, the choice of was unimportant because the task was linear regression on the diffusion coordinates, in which the multipliers of the diffusion map were absorbed into the linear regression coefficients . In the present work, our task is to determine a set of prototype spectra. The tool we use is -means clustering in diffusion space, where results are dependent upon intra-point Euclidean distances of the diffusion map, which in turn are dependent on .
In this paper we introduce a diffusion operator that considers all scales simultaneously. We define the multi-scale diffusion map as
where and are the right eigenvectors and eigenvalues of , respectively, in a bi-orthogonal decomposition. Euclidean distance in the -dimensional space described by equation (5) approximates the multi-scale diffusion distance, a distance measure that utilizes the geometry of the data set by simultaneously considering all possible paths between any two data points at all scales , in the Markov random walk constructed above. An advantage to using the multi-scale diffusion map, besides elimination of the tuning parameter , is that it gives a more robust description of the structure of the data by considering the propagation of local interactions of data points through all scales.
We also have three other tuning parameters: , (dimensionality of the diffusion map), and (number of prototypes) over which to optimize the fits to galaxy spectra. For simplicity, we fix by choosing the dimensionality of the diffusion map to coincide with a 95% drop-off in the eigenvalue multipliers in the multi-scale diffusion map. This is reasonable because most of the information about the structure of the data will be in the few dimensions with large eigenvalue multipliers. Simulations show that this choice of produces results comparable to optimizing over a large grid of . For the parameter , fits should improve as the basis gets larger. However, we want to keep small both for computational reasons and to avoid the occurrence of nearly identical spectra in our basis, which can cause degeneracies in our fits. In practice, we choose =45 or 150 to directly compare our results with the bases used by Cid Fernandes et al. (2005) (CF05) and Asari et al. (2007) (Asa07). This leaves us with only one tuning parameter, , over which to optimize the diffusion -means basis in fits to galaxy spectra.
Finally, we determine a set of prototype spectra by performing -means clustering in the -dimensional diffusion map representation (5) of the SSP spectra, . The -means algorithm is a standard machine learning method used to cluster data points into groups. It works by minimizing the total intra-cluster variance,
where is the set of points in the th cluster and is the -dimensional geometric centroid of the set . As described in §3.2 of Lafon & Lee (2006), for diffusion maps
where is the trivial left eigenvector of . The -means algorithm begins with an initial partition of the data and alternately computes cluster centroids and reallocates points to the cluster with the nearest centroid. The algorithm stops when no points are allocated differently in two consecutive iterations. The final centroids define the prototypes to be used in our basis.
In Fig. 1, we plot the mapping of 1278 SSP spectra into the dimensional diffusion space described by (5). The large black dots denote the -means centroids for . Individual SSPs are coloured by cluster membership. The prototypes capture the variability of the SSP spectra along a low-dimensional manifold in diffusion space. Note that the density of prototypes varies along different parts of the manifold. This is due both to the local complexity of the manifold and the sampling of the original base of SSPs. This latter effect, that we tend to obtain more prototypes in areas where there are more SSPs in our basis, can easily be corrected for by using a weighted -means method, described in §2.2.3, that accounts for varying density of SSPs along the manifold.
The last step of the diffusion -means algorithm is to determine the protospectra and their properties. First, the normalization constant for the th prototype spectrum is
and the th protospectrum is
Similarly, the age, metallicity, and “stellar-mass fraction”—the percent of the initial stellar mass still in stars—of prototype are naturally defined as
where , and are the age, metallicity and stellar-mass fraction of the th SSP, respectively. Equations (10)-(12) are natural definitions of the prototype parameters in the sense that the fit of this basis to a galaxy would yield exactly the same parameter estimates as the equivalent mixture of the original set of SSPs, .
In Fig. 2 we plot the 45 SSP spectra in CF05 and the 45 prototype spectra found by diffusion -means. All spectra are normalized to 1 at =4020Å and are coloured by log age. The diffusion -means prototypes spread themselves evenly over the range of spectral profiles, capturing a gradual trend from young to old spectra. On the other hand, the CF05 basis includes many similar spectra from younger populations and sparsely covers the range of spectral profiles of older populations.
See Algorithm 1 for a summary of the diffusion -means algorithm. Sample Matlab and R code is available on the web at http://www.stat.cmu.edu/~annlee/software.htm. The diffusionMap R package is available at http://cran.r-project.org/.
Other -means algorithms
In §3, we compare the fits obtained by the diffusion -means basis to those found by two other numerical methods: principal components (PC) -means and standard -means. Each of these two methods is similar to diffusion -means, but both have critical drawbacks.
PC -means works similarly to diffusion -means (Algorithm 1) except that -means is performed on the projection of the normalized SSP spectra into principal components (PC) space, not diffusion space. For an example of the application of principal components analysis to SSP spectra see Ronen, Aragón-Salamanca & Lahav (1999). The main drawback to PC -means is its assumption that the SSP spectra lie on a linear subspace of the original dimensional space. If the SSPs actually lie on a non-linear manifold, then the prototypes may poorly capture the intrinsic variation of the original SSPs because the non-linear structure will have been inappropriately collapsed on to a linear space by the principal components projection.
Standard -means is also similar to Algorithm 1, except that -means is performed in the original dimensional space; i.e., no reduction in dimensionality is done before running -means. There are two obvious drawbacks to this procedure. First, the algorithm generally is slow because distance computations are cumbersome in high dimensions and -means usually takes more iterations to converge. For comparison, the dimensionality of the spaces used by diffusion and PC -means are each , a factor of 100 smaller than . Second, and more importantly, appropriate prototype spectra are difficult to find by standard -means because Euclidean distances, used by -means to define clusters, are only physically meaningful over short distances. Diffusion -means avoids this problem by clustering in diffusion space, in which Euclidean distance approximates diffusion distance, a measure that has physical meaning on all scales. The result is that standard -means inappropriately relates SSPs that are not physically similar.
In Fig. 3, we plot versus for =150 prototypes in Asa07 and diffusion, standard, and PC -means. Notably, diffusion -means finds a much higher density of prototypes with high or high , reflecting the complicated manner in which SSPs with those properties vary with respect to and . At the other extreme, the Asa07 prototypes reside on a regular grid, and thus include many prototype spectra that are essentially identical and also exclude prototypes that have unique spectral properties. The standard and PC -means prototypes also estimate more high and high prototypes than a regular grid.
Incorporating other information
The methods introduced above use only the observable properties of SSPs to choose representative prototypes. It may be the case that we want to incorporate other a priori information that we have about the SSPs and their relationship with the galaxies we are fitting. For instance, we might know that a SSP with a particular age and metallicity is generally found in the types of galaxies we are trying to fit, and hence will want to include in our basis a prototype with characteristics closely matching those of this SSP. This information can easily be incorporated with the framework introduced above by defining an a priori weight, for each SSP in our database, where higher weights signify more importance of the SSP.
By modifying the definition of the diffusion map geometric centroid (7) to be
and altering the -means algorithm to minimize
instead of in (6), we choose a basis that reflects both the numerical observable properties of the SSPs and our a priori knowledge.
This weighted -means algorithm can be used to correct for the tendency of the prototype SSPs to depend on the specific set of SSPs, , chosen. If, for example, we define as the inverse of the local density of SSP , then the prototype SSPs in Fig. 1 would more uniformly cover the manifold of SSPs in diffusion space, and not tend to more heavily sample denser regions of the SSP manifold. In this manner, the prototype spectra will only depend on the spectral properties, and not the particular sampling of the set of SSPs.
We use simulations to assess the performance of the diffusion -means method for selection of a basis of prototype SSP spectra compared to bases derived by other -means methods and the bases of used by CF05 and used by Asa07. We analyse the performance of these methods on physically-motivated simulated spectra from two different prescriptions. All simulated galaxy spectra are generated as mixtures of a fine grid of 1278 SSP spectra from the model of Bruzual & Charlot (2003), as described in §2.1.2. We construct the simulated spectra with a fine grid of SSPs because we expect real galaxies to consist of sub-populations of stars on a continuous grid of age and . This approach differs significantly from the simulation prescriptions of, for example, Cid Fernandes et al. (2004) and Cid Fernandes et al. (2005) as discussed in §2.1.2.
Like Cid Fernandes et al. (2005), we simulate spectra over the range 3650-8000 Å. We bin all spectra to one measurement every 1 Å. For each simulated galaxy, we independently sample the reddening parameter, , and velocity dispersion, , from a realistic distribution based on fits to a large sample of SDSS galaxies by the STARLIGHT code that was performed by the SEAGal (Semi Empirical Analysis of Galaxies) Group and is available on the web at www.starlight.ufsc.br. We add Gaussian noise following the normalized mean SDSS error spectrum with specified S/N at Å. We mask out any spectral bins around emission lines, as done in Cid Fernandes et al. (2005).
To choose the optimal value of for the diffusion -means basis, we fit simulations to each basis from a grid of values and find the basis for which the fits achieve the smallest median value, (2). Although we choose the optimal by minimizing , our ultimate goal is accurate estimation of the SFH and physical parameters for each galaxy. However, we find that bases that achieve small values tend to have small errors in their physical parameter estimates.
To compare results across different bases, we compare the errors in physical parameter and SFH estimation. Our ultimate goal is to find the basis that most accurately describes the star formation and metallicity history of each galaxy. In this study, our main parameters of interest are
, and . To test the ability of different bases to estimate these parameters, we compute the mean square error (MSE) of the estimates across the simulated galaxies. To characterize the ability to recover SFH, we compute condensed population vectors by binning both and into 5 logarithmically spaced time bins and then comparing these true condensed vectors, and to the estimated condensed vectors, and . Other techniques have proposed different schemes for compressing the SFH information retrieved by spectral fitting (see Asari et al. 2007 for an overview of the different compression methods employed in current spectral synthesis codes). Our choice of 5 age bins is an attempt to avoid the degeneracies that are likely if the population vector were gridded into very fine age bins while still retaining a sufficient number of age bins to detect significant recent starburst events. Admittedly, the choice of 5 age bins is ad hoc. However, because we are only interested in comparing the ability of different bases of prototype spectra to recover SFH, any reasonable parametrization of the SFH will allow us to perform this analysis.
3.1 Exponentially decreasing SFH
In the first set of simulations, we choose a SFH with exponentially decaying star formation rate (SFR): SFR , where is the age of the SSP in Gyr and we use SSPs up to a maximum of 14 Gyrs. Like Tojeiro et al. (2007), we set to have a SFH that is neither dominated by recent star formation nor too similar to a single old burst galaxy. In our simulations, we consider specific SFR (SSFR) as defined in Asari et al. (2007) by normalizing all simulated spectra to have the same mass. This is reasonable because in practice we are interested in how the SFH varies across many galaxies, and therefore wish to remove the absolute mass-scale dependence of each object’s SFR. We have SSPs from 196 time bins where each bin’s metallicity is randomly chosen from 6 different values. We test our methods on 50 simulated spectra. Across different spectra the SFH is the same, while and the random noise all vary. Each spectrum has S/N=10 at .
The following bases of prototype spectra are used to fit each simulated spectrum: 1) the CF05 basis, 2) diffusion -means, 3) PC -means, and 4) standard -means. Each basis contains prototype spectra. For diffusion -means, we perform fits over a grid of values of to study how parameter estimates vary as a function of .
In Fig. 4 we show the error in spectral fits using diffusion -means for different values of . Also plotted is the error in the standard -means (sK) and PC -means (PC) bases. All errors are plotted as the ratio of the statistic to the from the CF05 basis fit. In the vast majority of the simulated spectra, all three of the techniques considered produce better spectral fits than the CF05 basis. There are no significant differences between spectral fit errors for the diffusion -means bases as varies. The minimum median for diffusion -means is achieved at .
We plot the MSE of our estimates for , and in Fig. 5. For each parameter, the CF05 basis predictions are the worst in terms of MSE. The performance of the diffusion -means basis is not very dependent on the particular choice of . This is a nice property, as any reasonable choice of will give us good results, and we need not spend computing time searching for an optimal value of . Diffusion -means bases tend to find the best estimates of , , and , and performs competitively with the other -means bases for estimation.
We compare the condensed SFH estimates across bases. Plots of the recovered and vectors in Fig. 6 show that the CF05 basis tends to significantly underestimate the contributions of the oldest population while overestimating the contribution of the second oldest population. All three -means bases find estimates that are closer to the true , but tend to underestimate the contribution of the oldest component. All three -means bases find very accurate estimates of . The bias in estimation occurs for at least two reasons. First, since the simulations are generated from SSPs on a fine age grid, it is intrinsically difficult to estimate all of the contributions of the subpopulations. Second, due to the exponential form of the SFR, older populations are the most important in the SFH, making it easy to underestimate their importance and almost impossible to overestimate their contribution.
For this simple prescription of galaxy evolution, bases constructed using -means methods find significantly better estimates of physical parameters and SFH than the CF05 basis. Next, we test our methods on a slightly more complicated set of simulations and also compare bases of size 45 and 150 to characterize the advantage, if any, of using a larger basis and whether this advantage is worth the extra expenditure in computation time.
3.2 Exponentially decreasing SFH with random starbursts
In our second set of simulations, we use a prescription similar to Chen et al. (2009) to create a set of more realistic spectra on which to test our methods. The added SFH complexities over the simulations in §3.1 are:
The time when a galaxy begins star formation is distributed uniformly between 0 and 5.7 Gyr after the Big Bang, where the Universe is assumed to be 13.7 Gyr old.
We again have SFR , but now allow to vary between galaxies. For each galaxy we draw from a uniform distribution between 0.7 and 1 .
We allow for starbursts with equal probability at all times. The probability a starburst begins at time is normalized so that the probability of no starbursts in the life of the galaxy is 33%. The length of each burst is distributed uniformly between 0.03 and 0.3 Gyr and the fraction of total stellar mass formed in the burst in the past 0.5 Gyr is distributed logarithmically between 0 and 0.1. The SFR of each starburst is constant throughout the length of the burst.
As above, each galaxy spectrum is generated as a mixture of SSPs of up to 196 time bins, with a randomly drawn metallicity in each bin. Allowing for random starbursts over the exponential SFH increases the importance of younger populations in the simulated galaxies. We test our methods on 500 simulated spectra, 250 with S/N=10 and 250 with S/N=20 at . Across each spectrum, SFH, and the random noise all vary. We fit bases of size 45 and 150.
For this set of simulations, we only look at results for the diffusion -means that achieves the lowest fit error for each ; the optimal values are =250 for the basis and =500 for the basis. As before, results are not very sensitive to the specific choice of . Note that the =250 basis of size 45 also achieved good fits to the simpler simulations in §3.1. In Fig. 7 we plot the MSE of the physical parameter estimates for each of the 4 bases for =45 and 150. The diffusion -means bases perform significantly better than all other bases for mean age and estimation and achieve similar results as the other -means bases for and estimation. The -means bases heavily outperform the hand-chosen bases (CF05 & Asa07) for each parameter. The same trends are seen for error in estimation of time-binned star formation histories (Fig. 8). The error estimates for all bases, split into S/N=10 and S/N=20 galaxy spectra, are in Table 1.
An interesting trend appears in Figs. 7-8. For the hand-chosen bases, we see considerable reduction in the estimation error of every parameter in going from the =45 basis of CF05 to the =150 basis of Asa07. However, in the -means bases, and particularly for the diffusion -means basis, there is no substantial lowering of the MSE of any parameter in increasing from =45 to =150. This tells us that even with as few as 45 prototypes, diffusion -means provides us a basis that can successfully fit these particular simulations. Moving to =150 prototypes allows us no significant gains in parameter estimation while costing us 10 times the computational time. Note, however, that although a basis of size 45 may be sufficient for this set of simulations, a set of real galaxy spectra may be more complicated and require a larger basis of prototypes to achieve good fits.
A common difficulty in population synthesis studies is the so-called age-metallicity degeneracy, where young, metal-rich SSPs are confused with old, metal-poor SSPs. Using a regular grid of SSP prototypes in population synthesis can exacerbate this problem because several SSPs with nearly identical spectra (but different ages and metallicities) will likely be included, inducing degeneracy in the spectral model (1), and causing the estimated SFH parameters for each galaxy to be highly variable. A diffusion -means basis, on the other hand, picks SSP prototypes whose spectra more evenly span the range of spectral characteristics, minimizing the inclusion of very similar spectra and thus decreasing the degeneracy effects in the minimization of (2). In Fig. 9 we see that the age-Z degeneracy is much smaller in diffusion -means bases, quantified by the slope of the linear trend in versus . Some degeneracy in age-metallicity still exists for the diffusion -means bases, but the magnitude of the slope of the trend is half.
One may note that both and are significantly biased for all 4 bases in Fig. 9. This probably occurs due to the simulation prescription, which includes mostly old SSPs from a large, fine grid of age and Z. One may be inclined to use these estimated biases to correct the estimates of real galaxy spectra. However, the amount of bias is dependent on the particular prescription of SFH in the simulations. Since there is no general model that describes galaxy SFH for a wide range of galaxies, there is no straightforward way to estimate the bias, if any, that will occur.
4 Application to SDSS spectra
In this section, we apply the fitting techniques to a sample of SDSS spectra to compare the results using different bases. We compare the physical parameters estimated from each of three different bases: Asa07, diffusion map with =45, and diffusion map with =150. We also compare parameter estimates to those estimated with the MPA/JHU data base. Our goal in this section is to study how the choice of basis affects the final parameter estimates for a large set of real galaxy spectra.
4.1 Data preparation
Our data sample consists of spectra from ten arbitrarily chosen spectroscopic plates of SDSS DR6 (02660274 inclusive, and 0286; Adelman-McCarthy et al. 2008) that are classified as galaxy spectra. The goal is to test our methods on a sample that is large enough that we include galaxies with a broad range of star formation and metallicity histories but not so large that analysis is computationally cumbersome. In a future work, we will present results for the more than 700k galaxy spectra in SDSS DR7.
We remove spectra from this sample by applying three cuts. The first, motivated by aperture considerations, is to analyse only spectra whose SDSS redshift estimates 0.05. Second, we determine the proportion of the 3850 wavelength bins that are flagged as bad. If this proportion exceeds 10%, we remove the spectrum from the sample; if not, we retain the spectrum for further analysis. Third, we select galaxy spectra with a SDSS redshift confidence level 0.35. These cuts leave us with a sample of 3046 galaxy spectra.
We bring all spectra to rest frame using the SDSS redshift estimates. In the fitting routines, we exclude any spectral bins with SDSS flag 2. This eliminates any sky residuals, bad pixels, and artefacts from the data reduction. We additionally mask the pixels in the regions within and around emission lines as in Cid Fernandes et al. (2005). Results of the STARLIGHT spectral fits give a mean of 1.03 for each of the three bases used, close to a value of 1 that would be expected if wavelength bin fluxes were independently distributed Gaussian random variables. Note that this value varies considerably from the mean value of 0.78 found in the analysis of 50362 SDSS spectra by Cid Fernandes et al. (2005). This difference can be attributed to our careful handling of spectral bin error estimates in our rebinning of the spectra to a dispersion of 1 measurement per Å.
4.2 Parameter estimates
The STARLIGHT fitting routine is used to fit all 3046 SDSS spectra with each of the three bases. Each basis yields parameter estimates that are substantially different. In Fig. 10 we plot each galaxy’s versus estimate for each basis. There are several notable differences in the plane for each basis. First, both diffusion map bases produce estimates that are tightly centered around an increasing linear trend while the Asa07 estimates are diffusely spread around such a trend. Moreover, the direction of discrepancy in the Asa07 estimates from the linear trend corresponds exactly with the direction of degeneracy between old, metal-poor and young, metal-rich galaxies. This suggests that the observed variability along this direction is not due to the physics of these galaxies, but rather is caused by confusion stemming from the choice of basis.
Secondly, the slopes of the estimated trends produced by the diffusion map bases (0.146 for =45 and 0.180 for =150) are considerably smaller than the slope of the trend in the Asa07 estimates (0.250). These estimates have implications for the physical models that govern galaxy evolution. Finally, the =45 diffusion map basis estimates no young, metal-poor galaxies, showing a sharp decline from densely-populated regions of the parameter space to vacant regions. This suggests that this small number of prototypes is not sufficient to cover the parameter space; particularly, young, metal-poor SSPs have been neglected in the basis.
Next we compare kinematic parameter estimation. Fig. 11 shows that the Asa07 basis tends to consistently estimate slightly larger values of than either diffusion map basis. On average, Asa07 estimates of are 0.045 larger than diffusion map =45 estimates and 0.024 larger than diffusion map =150 estimates. Estimates of are consistent between all three bases.
Finally, we compare age-binned SFH estimates found by each basis. Like Cid Fernandes et al. (2005), we condense the SFH to three parameters: old, ( yr), intermediate, ( yr), and young, ( yr). Plots of diffusion map estimates of each component versus Asa07 estimates are in Fig. 12. The estimated SFHs show a high degree of correspondence between the bases. On average, each diffusion map basis obtains slightly higher estimates of and slightly lower estimates of both and .
4.3 Comparisons to the MPA/JHU data base
We compare the SDSS parameters estimated with the
three different choices of STARLIGHT basis with parameters
obtained using measured emission line and continua values from the
MPA/JHU data base. The MPA/JHU collaboration has made their SDSS
spectral measurements publicly available on the MPA SDSS
We use the Balmer emission-line ratio to estimate the nebular extinction, , of each galaxy. Following Osterbrock (1989), we assume case B recombination, a density of 100 cm and a temperature of 10 K to compute according to (17)-(18):
where we have used the same constants used in Chen et al. (2009).
Fig. 13 shows plots of the values of computed from the MPA/JHU data base versus the values of the stellar extinction, , computed by each of the three different bases in the STARLIGHT code. We have only plotted the 1755 SDSS galaxies in our sample whose and emission line measurements both have signal-to-noise ratio . We find that and are loosely linearly related for each basis. The estimated Spearman correlation is 0.51 for the Asa07 basis and 0.52 for each diffusion map basis. The slope of the linear trend is slightly higher for the Asa07 basis (1.21) than for the diffusion map bases (1.16 for each basis). All three bases obtain stellar extinction estimates that are consistent with the nebular extinction values , and there are no significant differences between the strength of the relationships between the estimates of each basis and the nebular extinction estimates.
As in Chen et al. (2009), we compare our spectral synthesis estimates to two age indicators from the MPA/JHU database: (Balogh et al. 1999) and (Worthey & Ottaviani 1997). The value describes the break in the 4000 Å region of the galaxy spectrum; higher values generally correspond to increasing ages of the galaxy’s stellar subpopulations. Larger strength of the absorption line, , indicates higher star formation rates in the past 0.1-1 Gyr of a galaxy’s lifetime. Hence, we expect our estimates of to correlate positively with and negatively with .
In Fig. 14, we see that the estimates show the expected relationships with both and for each of the three bases. Correlations are strongly positive for the trend and strongly negative for the trend. We have plotted the 3031 galaxies in our sample that have reliable and measurements in the MPA/JHU catalogue. There is no significant difference in each relation between the bases.
Lastly, we compare our STARLIGHT estimates of with a measure of nebular metallicity based on the ratio of emission lines [OIII] and [NII], provided by MPA/JHU. Particularly, the quantity measures the oxygen abundance of a galaxy, and is defined as
where ON (Pettini & Pagel 2004). All emission line fluxes are first corrected for extinction using the estimates found in STARLIGHT fits. We would expect to be positively related to , as the former measures stellar metallicity and the latter measures nebular oxygen abundance.
In Fig. 15 we see that all three bases find estimates that correlate positively with , and that the =150 diffusion map basis estimates correlate more strongly with , with Spearman correlation coefficient of 0.21, compared to the Asa07 basis estimates, which have a correlation of 0.12 with the estimates. Furthermore, the best-fit linear trend of the versus diffusion map =150 estimates has a significantly higher slope, 0.128 than the linear trend of vs. Asa07 basis, 0.056. This result suggests that the diffusion map basis obtains more reliable metallicity estimates than the Asa07 basis.
We have presented a method to choose an appropriate basis of SSP prototypes to be used in population synthesis models of observed galaxy spectra. Particularly, we have used the STARLIGHT fitting routines of Cid Fernandes et al. (2005) to study the ability of both literature and diffusion -means bases to model both simulated and SDSS galaxy spectra. On simulated galaxies, our diffusion -means basis outperforms hand-chosen literature bases as well as bases constructed from standard -means or PC -means. For real SDSS galaxy spectra, we show that the diffusion -means basis finds parameter estimates that differ substantially from the estimates retrieved by the basis of Asari et al. (2007). The diffusion -means estimates seem to alleviate the age-metallicity degeneracy that has been detrimental to previous population synthesis studies of large data bases of galaxies.
At this point, more discussion of the age-metallicity degeneracy is warranted. Degeneracy arises because several SSPs have very similar spectral characteristics, but different ages and metallicities. When degenerate SSPs are included, estimates can be highly variable. This is shown in the first panel in Fig. 10 and in the STARLIGHT code itself, which in the presence of degeneracy tends to converge to different solutions due to the random walk taken through the parameter space by the Metropolis algorithm. The diffusion -means algorithm, on the other hand, includes only the mean spectrum and mean parameters of each set of degenerate spectra. This choice minimizes the statistical risk of the final solution by decreasing the variance and bias of the estimates (our measure of statistical risk is the MSE, which is the sum of the variance and squared bias of an estimate), as seen in simulations (Figs. 5-9). Our SFH estimates for SDSS spectra are significantly less variable in the direction of the age-Z degeneracy, as seen in Fig. 10, while the relative amount of bias is impossible to quantify without knowledge of the true SFH of each galaxy.
In the face of degeneracies and with no other outside information available, our proposed method (of picking the mean of several degenerate solutions) is the best one can do in terms of minimizing the statistical risk of the estimates. However, with more information about what features in the SSP spectra are most strongly related to their ages and metallicities, we could potentially improve our estimates by incorporating such information in the construction of the similarity measure in (3). The weight matrix produced by an expert-produced similarity measure may produce a more complete coverage of the parameter space in our SSP prototypes and help to remove degeneracies in the parameter estimates. Further study is required in this direction.
The ultimate goal of this project is to be able to quickly and accurately estimate physical property and SFH parameters for a large data base of galaxies. As an extension of this work, we plan to study how to quickly and effectively extend estimates found for a small set of galaxies using the computationally intensive STARLIGHT routine to a large database of galaxies. To achieve this goal, we must exploit the low-dimensional structure of these data. This approach was taken by Richards et al. (2009) in galaxy redshift estimation. These methods must be further developed and refined to reliably estimate SFH parameters.
The methods presented in this paper are applicable in a wide range of problems in astrophysical data analysis. Diffusion map is a powerful tool to uncover simple structure in complicated, high-dimensional data, whether they be spectra, photometric colours, two-dimensional images, data cubes, etc. Specifically, diffusion map is both more flexible and more widely applicable than PCA, which relies on the linear correlation structure of the data and assumes that the data reside on a low-dimensional, linear manifold. The diffusion -means method can be used in a variety of problems often encountered in astrophysics where a large set of observed or model data are described with a few prototype examples.
Future work will attempt to more concretely define criteria for the performance of algorithms that define prototypes from data points and to theoretically justify the use of diffusion -means. We plan to explore more deeply the relationship between the -means methods introduced here and to compare the methods introduced in this paper to other methods of basis selection, such as Basis Pursuit and Matching Pursuit. We also will investigate how meaningful standard errors can be attached to our parameter estimates using any choice of basis.
We thank the reviewer for their helpful comments.
This work was supported by NSF grants CCF-0625879 and DMS-0707059 and ONR grant N00014-08-1-0673.
Funding for the SDSS and SDSS-II has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation, the U.S. Department of Energy, the National Aeronautics and Space Administration, the Japanese Monbukagakusho, the Max Planck Society, and the Higher Education Funding Council for England. The SDSS Web site is http://www.sdss.org/. The SDSS is managed by the Astrophysical Research Consortium for the Participating Institutions. The Participating Institutions are the American Museum of Natural History, Astrophysical Institute Potsdam, University of Basel, Cambridge University, Case Western Reserve University, University of Chicago, Drexel University, Fermilab, the Institute for Advanced Study, the Japan Participation Group, Johns Hopkins University, the Joint Institute for Nuclear Astrophysics, the Kavli Institute for Particle Astrophysics and Cosmology, the Korean Scientist Group, the Chinese Academy of Sciences (LAMOST), Los Alamos National Laboratory, the Max-Planck-Institute for Astronomy (MPIA), the Max-Planck-Institute for Astrophysics (MPA), New Mexico State University, Ohio State University, University of Pittsburgh, University of Portsmouth, Princeton University, the United States Naval Observatory, and the University of Washington.
The STARLIGHT project is supported by the Brazilian agencies CNPq, CAPES and FAPESP and by the France-Brazil CAPES/Cofecub program.
- pagerange: Accurate parameter estimation for star formation history in galaxies using SDSS spectra–Accurate parameter estimation for star formation history in galaxies using SDSS spectra
- pubyear: 2009
- Adelman-McCarthy, J. K., et al. 2008, ApJS, 175, 297
- Alongi, M., Bertelli, G., Bressan, A., Chiosi, C., Fagotto, F., Greggio, L., and Nasi, E., 1993, A&AS, 97, 851
- Asari, N. V., Cid Fernandes, R., Stasińska, G., Torres-Papaqui, J. P., Mateus, A., Sodré, L., Schoenell, W., and Gomes, J. M., 2007, MNRAS, 381, 263
- Balogh, M. L., Morris, S. L., Yee, H. K. C., Carlberg, R. G. and Ellingson, E., 1999, ApJ, 527, 54
- Bica, E., 1988, A&A, 195, 76
- Bressan, A., Fagotto, F., Bertelli, G., Chiosi, C., 1993, A&AS, 100, 647
- Brinchmann, J., Charlot, S., White, S. D. M., Tremonti, C., Kauffmann, G., Heckman, T., and Brinkmann, J., 2004, MNRAS, 351, 1151
- Bruzual, G., and Charlot, S., 2003, MNRAS, 344, 1000
- Cardelli, J. A., Clayton, G. C., and Mathis, J. S., 1989, ApJ, 345, 245
- Chabrier, G., 2003, PASP, 115, 763
- Chen, X. Y., Liang, Y. C., Hammer, F., Zhao, Y. H. and Zhong, G. H., 2009, A&A, 495, 457
- Cid Fernandes, R., Sodre, L., Schmitt, H. R., Leao, J. R. S., 2001, MNRAS, 325, 60
- Cid Fernandes, R., Gu, Q., Melnick, J., Terlevich, E., Terlevich, R., Kunth, D., Rodrigues Lacerda, R., and Joguet, B., 2004, MNRAS, 355, 273
- Cid Fernandes, R., Mateus, A., Sodré, L., Stasińska, G., and Gomes, J. M., 2005, MNRAS, 358, 363
- Coifman, R. R., & Lafon, S., 2006, Appl. Comput. Harmon. Anal., 21, 5
- Fagotto, F., Bressan, A., Bertelli, G., Chiosi, C., 1994a, A&AS, 104, 365
- Fagotto, F., Bressan, A., Bertelli, G., Chiosi, C., 1994b, A&AS, 105, 29
- Freeman, P. E., Newman, J. A., Lee, A. B., Richards, J. W., and Schafer, C. M., in preparation.
- Gallazzi, A., Charlot, S., Brinchmann, J., White, S. D. M., and Tremonti, C. A., 2005, MNRAS, 362, 41
- Girardi, L., Bressan, A., Chiosi, C., Bertelli, G., and Nasi, E., 1996, A&AS, 117, 113
- Glazebrook, K., et al., 2003, ApJ, 587, 55
- Kauffmann, G. et al., 2003, MNRAS, 341, 33
- Lafon, S., & Lee, A., 2006, IEEE Trans. Pattern Anal. and Mach. Intel., 28, 1393
- Le Borgne, J.-F., et al., 2003, A&A, 402, 433
- Leitherer, C., Robert, C., Heckman, T. M., 1995, ApJS, 99, 173
- Mas-Hesse, J. M.& Kunth, D., 1991, A&AS, 88, 399
- Mathis, H., Charlot, S., and Brinshmann, J., 2006, MNRAS, 365, 385
- Moultaka, J., Boisson, C., Joly, M., Pelat, D., 2004, A&A, 420, 459
- Ocvirk, P., Pichon, C., Lancon, A., and Thiebaut, E., 2006, MNRAS, 365, 46
- Osterbrock, D. E., 1989, Astrophysics of Gaseous Nebulae and Active Galactic Nuclei (Mill Valley: University Science Books)
- Panter, B., Heavens, A. F., and Jimenez, R., 2003, MNRAS, 343, 1145
- Panter, B., Jimenez, R., Heavens, A. F. and Charlot, S., 2007, MNRAS, 378, 1550
- Pelat, D., 1997, MNRAS, 284, 365
- Pettini, M. and Pagel, B. E. J., 2004, MNRAS, 348, L59
- Reichardt, C., Jimenez, R., and Heavens, A. F., 2001, MNRAS, 327, 849
- Richards, J. W., Freeman, P. E., Lee, A. B., Schafer, C. M., 2009, ApJ, 691, 32
- Ronen, S., Aragón-Salamanca, A., and Lahav, O., 1999, MNRAS, 303, 284
- Strauss, M. A. et al., 2002, AJ, 124, 1810
- Tremonti, C., 2003, PhD thesis, Johns Hopkins Univ.
- Tremonti, C. A. et al., 2004, ApJ, 613, 898
- Tojeiro, R., Heavens, A. F., Jimenez, R., and Panter, B., 2007, MNRAS, 381, 1252
- Worthey, G., 1994, ApJS, 94, 687
- Worthey, G. and Ottaviani, D. L., ApJS, 111, 377
- York, D. G. et al., 2000, AJ, 120, 1579