A catalogue of photometric redshifts for the SDSS-DR9 galaxies
Key Words.:techniques: photometric - galaxies: distances and redshifts - galaxies: photometry - methods: data analysis - catalogs
Context:Accurate photometric redshifts for large samples of galaxies are among the main products of modern multiband digital surveys. Over the last decade, the Sloan Digital Sky Survey (SDSS) has become a sort of benchmark against which to test the various methods.
Aims:We present an application of a new method to the estimation of photometric redshifts for the galaxies in the SDSS
Data Release 9 (SDSS-DR9). Photometric redshifts for more than million galaxies were produced and made available at the URL:
Methods:The MLPQNA (Multi Layer Perceptron with Quasi Newton Algorithm) model provided within the framework of the DAMEWARE (DAta Mining and Exploration Web Application REsource) is an interpolative method derived from machine learning models.
Results:The obtained redshifts have an overall uncertainty of with a very small average bias of , and a fraction of catastrophic outliers () of . This result is slightly better than what was already available in the literature, also in terms of the smaller fraction of catastrophic outliers.
In the last few years, photometric redshifts (photo-z) for large samples of normal or active galaxies have become crucial for a variety of cosmological applications (Scranton et al. 2005; Myers et al. 2006; Hennawi et al. 2006; Giannantonio et al. 2008) and many different methods for their evaluation have been presented and extensively discussed in the literature (cf. Hildebrandt et al. 2010). The problem of deriving accurate photometric redshifts has become even more cogent due to the huge amount of data produced by most ongoing and planned photometric surveys (cf. PANNSTARS: Kaiser 2004, KIDS111http://www.astro-wise.org/projects/KIDS/, EUCLID: Laureijs et al. 2011) aimed at explaining weak lensing to prove the dark components of the universe.
Without entering into details which can be found elsewhere, it is worth reminding that, broadly speaking, all photo-z methods are based on the interpolation of some a priori knowledge represented by sets of templates, and differ only in one or both of the following aspects: (i) the way in which the a priori Knowledge Base (KB) is constructed (higher accuracy spectroscopic redshifts or, rather, empirically or theoretically derived Spectral Energy Distributions or SEDs), and (ii) the interpolation/fitting algorithm employed.
In all methods, the main source of uncertainty is in the fact that the function mapping the color space into the spectroscopic redshift space is just an oversimplified approximation of the complex and otherwise unknown relation existing between colors and the redshift (as an example, see Csabai et al. 2003). Among the various interpolative methods, we shall just quote a few: i) polynomial fitting (Connolly et al., 1995); ii) nearest neighbors (Csabai et al., 2003); iii) neural networks (D’Abrusco et al. 2007; Yéche et al. 2010 and references therein); iv) support vector machines (Wadadekar, 2005); v) regression trees (Carliles et al., 2010); vi) gaussian processes (Way & Srivastava, 2006; Bonfield et al., 2010), and vii) diffusion maps (Freeman et al., 2009).
In this paper we focus on the application to the galaxies contained in the SDSS Data Release (DR9, Paris et al. 2012), of the MLPQNA (Multi Layer Perceptron with Quasi Newton Algorithm) method already described in detail elsewhere (Brescia et al., 2012, 2013), hence we refer the interested readers to these papers for all the mathematical and technical details. We wish to notice that in the framework of the PHAT1 contest (Hildebrandt et al., 2010), which blindly compared most existing methods for photo-z evaluation, the MLPQNA method proved to be among the two best empirical methods to date (Cavuoti et al., 2012). This in spite of the very limited base of knowledge available for the contest ( objects only).
MLPQNA is just one among the many data mining methods publicly available under the DAta Mining & Exploration Web Application REsource infrastructure (DAMEWARE; Brescia et al. 2014).
2 The Data
The Sloan Digital Sky Survey (SDSS, York et al. 2000), is the forerunner of modern wide-field surveys. It combines multi-band photometry and fiber-based spectroscopy, thus providing both photometric data for a very large number of objects and spectroscopic information for a smaller but still significant subsample of the same population. Hence it provides all information needed to constrain the fit of an interpolating function mapping the photometric features into the spectroscopic redshift space. This is the main reason why most, if not all, photometric redshifts methods have been tested on the various data releases of the SDSS which, over the years, has become a sort of benchmark data set against which to test old and new methods.
To form our Knowledge Base (KB) we extracted from the spectroscopic subsample of the SDSS-DR9 all objects with specClass galaxy together with their photometry, in particular we used the () magnitudes and the related colors, rejecting all objects with missing or non detected information in any of the SDSS photometric bands.
The cuts in the magnitude were obtained by considering the limits within which the photometric parameter space sampled by the spectroscopic objects is significantly covered. Within these limits, the neural algorithm during the training phase is exposed in every regions of the cleaned parameter space to a number of examples sufficiently large to allow learning. Obviously the less populated will be the region of the parameter space, the less accurate is expected to be the accuracy of the final result. An additional implication is that the less populated is a region of the parameter space, the less likely is the capability to correctly learn the rule for peculiar or rare objects. The resulting psf magnitude limits are listed in Table 1. While Fig. 1 shows the psf magnitude distributions in the knowledge base. As also described in Oyaizu et al. 2008 we trained our model on the spectroscopic sample up to the magnitude limit of . By considering a photometric limit of , the resulting fainter limit in the training set covers the complete photometric region of interest without introducing boundary effects for photometric redshifts of galaxies having magnitudes near the limit. All this taken into account the complete spectroscopic KB consisted of objects.
|Band||lower limit||upper limit|
3 Experiments and discussion
In machine learning supervised methods it is common practice to use the available KB to build at least three disjoint subsets for every experiment: one (training set) for training purposes, i.e. to train the method in order to acquire the hidden correlation among the input features which is needed to perform the regression; the second one (validation set) to check the training, in particular against a loss of generalization capabilities (a phenomenon also known as overfitting); and the third one (test set) to evaluate the overall performances of the model (Brescia et al., 2013).
In this work, the validation was performed during training, by applying the standard leave-one-out k-fold cross validation mechanism (Geisser, 1975). We would like to stress that none of the objects included in the training (and validation) sample was included in the test sample and only the test data were used to generate the statistics. In other words, the test was blind, i.e. based only on objects never submitted to the network.
We decided to populate the training and the test set with respectively % and % of the objects in the KB, namely with and objects, respectively. This decision, which might seem a little anomalous since it is common practice for machine learning methods to operate with data sets of reversed proportion, was dictated by the large number of examples present in the knowledge base and by the specificity of the MLPQNA method which can overfit (with a loss of generalization capability) the data when exposed to a very large number of examples. The histogram in Fig.2 shows the distribution of the objects in the KB as a a function of the zspec in both training and test sets.
In order to ensure that the KB provided a proper coverage of the Parameter Space, the data were split into the two data sets by random extraction. In other words, by randomly shuffling and splitting the original dataset, we replicated several times the extraction sequence, and evaluated the average of their output. This mechanism prevents possible biases induced by fluctuations in the coverage of the parameter space, namely small differences in the redshift distribution of training and test samples used in the experiments.
Once the data sets were produced, we checked which types of flux combinations were more effective, in terms of magnitudes or related colors and therefore we performed and compared two experiments with two different sets of features: (i) MAG, by using the five SDSS magnitudes and (ii) MIXED, by replacing the magnitudes with the derived colors and leaving only the as pivot magnitude. The best combination turned out to be the MIXED type. From the data mining point of view this is rather surprising since the amount of information should not change by applying linear combinations between features. From the physical point of view, however, the better performances of the MIXED experiment can be easily understood by noticing that even though colors are derived as a subtraction of magnitudes, the content of information is quite different, since an ordering relationship is implicitly assumed, thus increasing the amount of information in the final output (i.e., flux gradients instead of fluxes). The additional pivot magnitude used in the experiment serves to remove the degeneracy in the luminosity class for a specific galaxy type.
Individual experiments as well as their comparison with results provided by others, were evaluated in a consistent and objective manner using a homogeneous and standard set of statistical indicators:
the bias, defined as the mean value of the residuals ;
the standard deviation () of the residuals;
the normalized median absolute deviation or of the residuals, defined as ;
all the above quantities calculated also on the normalized residuals, i.e. .
Furthermore, as an overall estimate of the accuracy of the final results it can be used the prescription in CLSI 2006 by deriving the overall uncertainty (OU) defined as .
Results are given in Table 2, where we also compare our results with those obtained by Laurino et al. (2011), who used on the SDSS Data Release 7 objects, a machine learning model with a slightly more complex architecture, named WGE (Weak gated Experts) and which, to the best of our knowledge, has achieved the higher accuracy so far.
|Laurino et al.||0.015||0.015||0.016||0.021||0.014||0.013||0.013||0.019|
In the second half of the table we give the fraction of outliers, i.e. the fraction of objects for which the photometric redshift estimate deviates more than in absolute value, or deviating more than , or from the spectroscopic value.
Summarizing, MLPQNA achieves the very small bias of , and a normalized standard deviation of . Moreover, our method leads to a very small fraction of outliers, i.e. less than 0.04% and using the and the criteria, respectively.
In Fig. 3, we plot the photometric redshift estimates versus the spectroscopic redshift values for all objects in the test set. After the rejection of catastrophic outliers, as defined by the , we obtain a of , which is larger than . This is exactly what has to be expected according to Mobasher et al. (2007). In fact, in the case where photo-z are empirical, it is always useful to analyze the direct correlation between the and the standard deviation calculated on data which are not catastrophic outliers. In these cases, a correct photo-z prediction occurs whenever the quantity is lower than the for the cleaned sample.
It needs to be noticed that for empirical methods the overestimates the theoretical gaussian , mainly due to catastrophic outliers and to the intrinsic training error.
In order to better characterize the performances of the experiment, we computed also the statistics on subsets of the test data binned accordingly to either redshift or magnitude range.
For what redshifts are concerned, as it is shown in Table. 2 we built the first subset by using objects in the redshift range starting from up to the redshift which includes of the objects in the test set; the second one using the range from to (corresponding to an additional 25% of the sample); the third one using the range to (corresponding to an additional 15%) and, finally, the fourth one included all remaining objects (redshift ). We also derived statistics in the redshift range corresponding to the same range covered in Laurino et al. (2011) (in order to allow a fair comparison).
The behaviour of the residuals as a function of the magnitude in the SDSS band was instead studied in the three bins listed in Table 3.
|mag bin (r)||test objects||skewness||kurtosis|
Using these information, we assigned a photo-z quality flag (from as best quality to as worst quality) to all objects in all mag bins, by following both the mag completeness limit and the trend as criteria. Results are summarized in Table 3.
As expected, the error remains still acceptable also slightly outside the magnitude completeness limit (). In this region, however, the number of training points is rather small and for the reasons stated above, the predicted redshifts need to be taken with some cautions since, given the selection criteria applied to select the targets for the spectroscopic survey, it is very likely that not all galaxy types are present in the knowledge base and that the much wider population of objects with photometric observations only is not well represented in the training set.
It needs however to be stressed that in the already mentioned PHAT1 contest, MLPQNA obtained very good results using a KB of size (i.e. objects) similar to that used for training in the last magnitude bin.
In order to better characterize the distribution of the residuals in terms of gaussianity of the distributions, we fitted a Gaussian to the residuals in the various quality bins obtaining the kurtosis and skewness listed in 3. The distributions of residuals appear to be quite symmetric even though slightly leptokurtic.
4 The photometric Catalogue
To produce the final catalogue we downloaded from the SDSS DR9 data server222http://skyserver.sdss3.org/CasJobs/ all objects falling within the declination range and detected in all SDSS bands.
We underline that all empirical photo-z methods suffer from a poor capability to extrapolate outside the range of distributions imposed by the training. In other words, outside the limits of magnitudes and zspec used in the training set, these methods do not ensure optimal performances. Hence, in order to remain in a safe condition, we performed a selection of objects in the final catalog according the same selection done on the training and test sample limits.
Furthermore, the SDSS DR hosts objects which are spectroscopically recognized as galaxies, but whose photometric class is different. In most cases such objects are photometrically classified as stars. From the spectral point of view indeed, there is a zspec value assigned to most of such objects, although they are lost from any photometric search based on galaxy type. Hence, for completeness reason, we added such objects to the photo-z photometric catalogue, by retrieving them through a special SQL query (see Appendix Sec. C).
For convenience, the whole catalogue was split in files, containing a total of objects with the estimated photo-z. Among them, the file with suffix specialObjects includes the photo-z for special objects with a mismatch between photometric and spectroscopic class assignment. The other files forming the final catalogue correspond to different declination ranges and each is structured in columns containing:
column 1: the SDSS-DR9 object identification;
columns 2 and 3: right ascension and declination;
columns 4-8: the , , , , and PSF magnitudes;
columns 9-13: the error for all magnitudes;
columns 14-18: the extinction for each magnitude;
columns 19-22: the colors derived from type magnitudes;
column 23: the estimated photo-z;
column 24: quality flag of the photo-z obtained from the information gathered during the analysis of the test set. stands for the best photo-z accuracy, for photo-z with lower accuracy, and for the photo-z related to objects outside the completeness limit.
The produced photometric catalogue is publicly available for download at the public URL: http://dame.dsf.unina.it/catalog/DR9PHOTOZ/.
The MLPQNA neural network was applied to the SDSS-DR9 photometric galaxy data, using a knowledge base derived from the SDSS-DR9 spectroscopic subsample.
After a set of experiments the best results were obtained with a two hidden layer network, using a combination of the SDSS colors (obtained from the SDSS ) plus the pivot magnitude in the band. This gives a normalized overall uncertainty of with a very small average bias of , a low , and to a low fraction of outliers ( at and at ). After the rejection of catastrophic outliers, the residual uncertainty is .
The trained network was then used to process the galaxies in the SDSS-DR9 matching the above outlined selection criteria and to produce the complete photometric catalogue. This catalog consists of photo-z estimates for more than million SDSS-DR9 galaxies and is available at the URL: http://dame.dsf.unina.it/catalog/DR9PHOTOZ/.
Acknowledgements.The authors would like to thank the anonymous referee for extremely precious comments and suggestions. Part of this work was supported by the PRIN-MIUR 2011, Cosmology with the Euclid space mission, and by the Project F.A.R.O., call by the University Federico II of Naples. One of us (GL) wishes to thank G.S. Djorgovski and the Department of Astronomy and Astrophysics at the Caltech for support and hospitality.
- Bonfield et al. (2010) Bonfield, D.G., Sun, Y., Davey, N., Jarvis, M.J., Abdalla, F.B., Banerji, M., & Adams, R.G., 2010. MNRAS, 485, 987.
- Brescia et al. (2012) Brescia, M., Cavuoti, S., Paolillo, M., Longo, G., Puzia, T., 2012. MNRAS, 421, 1155.
- Brescia et al. (2013) Brescia, M., Cavuoti, S., D’Abrusco, R., Longo, G., & Mercurio, A., 2013. ApJ, 772, 140.
- Brescia et al. (2014) Brescia, M., Cavuoti, S., Longo, G., et al., 2014. DAMEWARE: A web cyberinfrastructure for astrophysical data mining. To appear in PASP (accepted for publication), arXiv:1406.3538.
- Carliles et al. (2010) Carliles, S., Budavári, T., Heinis, S., Priebe, C., & Szalay, A. S. 2010. ApJ, 712, 511.
- Cavuoti et al. (2012) Cavuoti, S., Brescia, M., Longo, G., & Mercurio, A., 2012. A&A, 546, 13.
- Connolly et al. (1995) Connolly, A.J., Csabai, I., Szalay, A.S., Koo, D.C., Kron, R.G., & Munn, J.A., 1995. AJ, 110, 2655.
- CLSI (2006) Clinical and Laboratory Standard Institute, 2006. User Verification of performance for precision and trueness. Vol. 25, n. 17
- Csabai et al. (2003) Csabai, I., Budavári, T., Connolly, A.J., Szalay, A.S., et al., 2003. AJ, 125, 580.
- D’Abrusco et al. (2007) D’Abrusco, R., Longo, G., & Walton, N., 2007. ApJ, 663, 752.
- Freeman et al. (2009) Freeman, P.E., Newman, J.A., Lee, A.B., Richards, J.W., Schafer, C.M., 2009. MNRAS, 398, 4, 2012, 2021.
- Giannantonio et al. (2008) Giannantonio, T., Scranton, R., Crittenden, R.G., Nichol, R.C., Boughn, S.P., Myers, D., & Richards, G.T., 2008. Phys. Rev. D, 77, 123520.
- Hennawi et al. (2006) Hennawi, J.F., Strauss, M.A., Oguri, M., Inada, N., Richards, G.T., et al., 2006. AJ, 131, 1.
- Hildebrandt et al. (2010) Hildebrandt, H., Arnouts, S., Capak, P., Wolf, C. et al., 2010. A&A, 523, 31.
- Kaiser (2004) Kaiser, N., 2004. Pan-STARRS: a wide-field optical survey telescope array. Proc. of SPIE, 5489, 11.
- Geisser (1975) Geisser, S., 1975. Journal of the American Statistical Association, 70 (350), 320-328.
- Laureijs et al. (2011) Laureijs, R., et al., 2011. Euclid Definition Study Report, ESA/SRE(2011)12, Issue 1.1, [arXiv:astro-ph/1110.3193].
- Laurino et al. (2011) Laurino, O., D’Abrusco, R., Longo, G., & Riccio G., 2011. MNRAS, 418, 2165.
- Myers et al. (2006) Myers, A.D., Brunner, R.J., Richards, G.T., Nichol, R.C., Schneider, D.P., Vanden Berk, D.E., Scranton, R., Gray, A.G., & Brinkmann, J., 2006. ApJ, 638, 622.
- Mobasher et al. (2007) Mobasher, B., Capak, P., Scoville, N. Z., Dahlen, T., Salvato, M., et al., 2007. The Astrophysical Journal Supplement Series, 172, 1, 117.
- Oyaizu et al. (2008) Oyaizu, H., et al., 2008. ApJ, 674, 2, 768-783.
- Paris et al. (2012) Paris, I., et al., 2012. A&A, 548, 66.
- Scranton et al. (2005) Scranton, R., Menard, B., Richards, G.T., Nichol, R.C., Myers, A.D., et al., 2005. ApJ, 633, 2, 589.
- Yéche et al. (2010) Yéche, C., Petitjean, P., Rich, J., Aubourg, E., Busca, N., et al., 2010. A&A, 523, 14.
- York et al. (2000) York, D.G., Adelman, J., Anderson, Jr., J.E., et al. 2000. AJ, 120, 1579.
- Wadadekar (2005) Wadadekar, Y., 2005. PASP, 117, 79.
- Way & Srivastava (2006) Way, M.J., & Srivastava, A.N., 2006. ApJ, 647, 102.
Appendix A Spectroscopic Query
The following SQL code has been used to obtain the spectroscopic Knowledge Base to train and test the model.
SELECT p.objid, s.specObjID, p.ra, p.dec, p.psfMag_u, p.psfMag_g, p.psfMag_r, p.psfMag_i, p.psfMag_z, p.psfmagerr_u, p.psfmagerr_g, p.psfmagerr_r, p.psfmagerr_i, p.psfmagerr_z, p.fiberMag_u, p.fiberMag_g, p.fiberMag_r, p.fiberMag_i, p.fiberMag_z, p.fibermagerr_u, p.fibermagerr_g, p.fibermagerr_r, p.fibermagerr_i, p.fibermagerr_z, p.petroMag_u, p.petroMag_g, p.petroMag_r, p.petroMag_i, p.petroMag_z, p.petromagerr_u, p.petromagerr_g, p.petromagerr_r, p.petromagerr_i, p.petromagerr_z, p.modelMag_u, p.modelMag_g, p.modelMag_r, p.modelMag_i, p.modelMag_z, p.modelmagerr_u, p.modelmagerr_g, p.modelmagerr_r, p.modelmagerr_i, p.modelmagerr_z, p.extinction_u, p.extinction_g, p.extinction_r, p.extinction_i, p.extinction_z, s.z as zspec, s.zErr as zspec_err, s.zWarning, s.class, s.subclass, s.primTarget INTO mydb.galaxies_spec FROM PhotoObjAll as p, SpecObj as s WHERE s.class = ’GALAXY’ AND s.zWarning = 0 AND p.mode = 1 AND p.SpecObjID = s.SpecObjID AND dbo.fPhotoFlags(’PEAKCENTER’) != 0 AND dbo.fPhotoFlags(’NOTCHECKED’) != 0 AND dbo.fPhotoFlags(’DEBLEND_NOPEAK’) != 0 AND dbo.fPhotoFlags(’PSF_FLUX_INTERP’) != 0 AND dbo.fPhotoFlags(’BAD_COUNTS_ERROR’) != 0 AND dbo.fPhotoFlags(’INTERP_CENTER’) != 0
Appendix B Photometric Query
The produced photometric catalogue with the estimated photo-z has been taken from SDSS DR9 service, by applying the following SQL query. The reported code here is referred to a DEC range between and deg, as an example.
SELECT p.objid, p.ra, p.dec, p.psfMag_u, p.psfMag_g, p.psfMag_r, p.psfMag_i, p.psfMag_z, p.psfmagerr_u, p.psfmagerr_g, p.psfmagerr_r, p.psfmagerr_i, p.psfmagerr_z, p.extinction_u, p.extinction_g, p.extinction_r, p.extinction_i, p.extinction_z INTO mydb.p60_p65 FROM Galaxy as p WHERE p.mode = 1 AND dbo.fPhotoFlags(’PEAKCENTER’) != 0 AND dbo.fPhotoFlags(’NOTCHECKED’) != 0 AND dbo.fPhotoFlags(’DEBLEND_NOPEAK’) != 0 AND dbo.fPhotoFlags(’PSF_FLUX_INTERP’) != 0 AND dbo.fPhotoFlags(’BAD_COUNTS_ERROR’) != 0 AND dbo.fPhotoFlags(’INTERP_CENTER’) != 0 AND p.dec >= 60 AND p.dec <65
Appendix C Special Query
Below it is reported the SQL code used for the query needed to integrate the photo-z catalogue with objects spectroscopically recognized as galaxies but photometrically assigned to different classes within the SDSS DR.
SELECT p.objid, p.ra, p.dec, p.psfMag_u, p.psfMag_g, p.psfMag_r, p.psfMag_i, p.psfMag_z, p.psfmagerr_u, p.psfmagerr_g, p.psfmagerr_r, p.psfmagerr_i, p.psfmagerr_z, p.extinction_u, p.extinction_g, p.extinction_r, p.extinction_i, p.extinction_z INTO mydb.photoerror FROM PhotoObjAll as p, SpecObj as s WHERE s.class = ’GALAXY’ AND p.type != 3 AND p.mode = 1 AND dbo.fPhotoFlags(’PEAKCENTER’) != 0 AND dbo.fPhotoFlags(’NOTCHECKED’) != 0 AND dbo.fPhotoFlags(’DEBLEND_NOPEAK’) != 0 AND dbo.fPhotoFlags(’PSF_FLUX_INTERP’) != 0 AND dbo.fPhotoFlags(’BAD_COUNTS_ERROR’) != 0 AND dbo.fPhotoFlags(’INTERP_CENTER’) != 0 AND p.SpecObjID = s.SpecObjID