Precision photometric redshift calibration for galaxygalaxy weak lensing^{†}^{†}thanks: Based in part on observations undertaken at the European Southern Observatory (ESO) Very Large Telescope (VLT) under Large Program 175.A0839.
Abstract
Accurate photometric redshifts are among the key requirements for precision weak lensing measurements. Both the large size of the Sloan Digital Sky Survey (SDSS) and the existence of large spectroscopic redshift samples that are fluxlimited beyond its depth have made it the optimal data source for developing methods to properly calibrate photometric redshifts for lensing. Here, we focus on galaxygalaxy lensing in a survey with spectroscopic lens redshifts, as in the SDSS. We develop statistics that quantify the effect of source redshift errors on the lensing calibration and on the weighting scheme, and show how they can be used in the presence of redshift failure and sampling variance. We then demonstrate their use with source galaxies with spectroscopy from DEEP2 and zCOSMOS, evaluating several public photometric redshift algorithms, in two cases including a full for each object, and find lensing calibration biases as low as % (due to fortuitous cancellation of two types of bias) or as high as 20% for methods in active use (despite the small mean photoz bias of these algorithms). Our work demonstrates that lensingspecific statistics must be used to reliably calibrate the lensing signal, due to asymmetric effects of (frequently nonGaussian) photoz errors. We also demonstrate that largescale structure (LSS) can strongly impact the photoz calibration and its error estimation, due to a correlation between the LSS and the photoz errors, and argue that at least two independent degreescale spectroscopic samples are needed to suppress its effects. Given the size of our spectroscopic sample, we can reduce the galaxygalaxy lensing calibration error well below current SDSS statistical errors.
keywords:
gravitational lensing – galaxies: distances and redshifts1 Introduction
Galaxygalaxy lensing is the deflection of light from distant source galaxies due to the matter in more nearby lens galaxies. In the weak regime, gravitational lensing induces 0.1–10% level tangential shear distortions of the shapes of background galaxies around foreground galaxies, allowing direct measurement of the galaxymatter correlation function around galaxies. Due to the very small signal, typical measurements involve stacking thousands of lens galaxies to get an averaged lensing signal.
Since the initial detections of galaxygalaxy (gg) lensing (Tyson et al., 1984; Brainerd et al., 1996; Hudson et al., 1998; Fischer et al., 2000; Smith et al., 2001; McKay et al., 2001), it has been used to address a wide variety of astrophysical questions using data from numerous sources. These applications include (but are not limited to) determining the relation between stellar mass, luminosity, and halo mass to constrain models of galaxy formation (Hoekstra et al., 2005; Heymans et al., 2006a; Mandelbaum et al., 2006c); understanding the relation between halo mass from lensing and bias from galaxy clustering to constrain cosmological parameters (Sheldon et al., 2004; Seljak et al., 2005); measuring galaxy density profiles (Hoekstra et al., 2004; Mandelbaum et al., 2006b); and understanding the extent of tidal stripping of the matter profiles of cluster satellite galaxies (Natarajan et al., 2002; Limousin et al., 2007). In the future, galaxygalaxy lensing will be used for geometrical tests that constrain the scale factor and curvature of the Universe (Jain & Taylor, 2003; Bernstein & Jain, 2004; Bernstein, 2006). As data continue to pour in, and future surveys are planned with even greater statistical power, the time has come to place galaxygalaxy lensing on a firmer foundation by addressing systematics to greater precision.
The gg lensing signal calibration depends on several systematics, including the calibration of the shear (Heymans et al., 2006b; Massey et al., 2007) and theoretical uncertainties such as galaxy intrinsic alignments (Agustsson & Brainerd, 2006; Altay et al., 2006; Heymans et al., 2006c; Mandelbaum et al., 2006b; Faltenbacher et al., 2007), both areas in which there is significant ongoing work. Here, we focus on the proper calibration of the source redshift distribution for galaxygalaxy lensing in the case where all lens redshifts are known. The SDSS has the rather unique capability of offering spectroscopic redshifts for all lenses, which both removes any calibration bias due to error in lens redshift estimation, and also allows us to compute the signal as a function of physical transverse (instead of angular) separation from the lenses, simplifying theoretical interpretation. While several theoretical studies have estimated the effects of photoz errors for shearshear autocorrelations (Huterer et al., 2006; Ma et al., 2006; Abdalla et al., 2007; Bernstein & Ma, 2007), we present the first such analysis for galaxygalaxy lensing, in which we not only offer statistics to use to evaluate the calibration bias, but also carry out an analysis with attention to practical issues such as sampling variance in the calibration sample. This work will therefore enable future gg lensing analyses with other datasets to address other scientific questions, and reveal potential issues with spectroscopic calibration of photoz’s that are more general than just gg lensing. We also address the extension of these techniques to galaxygalaxy lensing without lens redshifts, and to cosmic shear, in Appendix A.
Currently, there are two methods used for source redshift determination in gg lensing. The first is the use of an average redshift distribution for the sources. The primary difficulty with this method is finding a sample of galaxies with spectroscopy that has the same selection criteria as the source galaxies. Weak lensing requires welldetermined shapes for each source, so a lensing source catalog is not purely fluxlimited, and literature estimates of for fluxlimited samples may not be appropriate (we show in this paper that for SDSS, the lensingselected sample is at a higher mean redshift than the corresponding fluxlimited sample at fixed magnitude). The solution is to find a spectroscopic sample that overlaps the source sample and is at least as deep, using it to determine the redshift distribution using only lensingselected galaxies in the spectroscopic sample. For deeper lensing surveys, no such spectroscopic sample exists. In other cases, it exists but may be quite small, with large uncertainty in due to Poisson error and, more significantly, largescale structure. The second difficulty is that without individual redshift estimates for each source, there is no way to remove sources that are physicallyassociated with lenses from the source sample, which can lead to dilution of the lensing signal by nonlensed galaxies (a systematic that is easily controlled) and, more significantly, signal suppression due to intrinsic alignments [which cannot yet be easily controlled (Agustsson & Brainerd, 2006; Mandelbaum et al., 2006b), and which can cause contamination larger than the size of the statistical errors for small transverse separations].
The second method is to use broadband photometry to measure photometric redshifts (photoz’s) for each source galaxy. Photoz estimation exploits the fact that even with broad passbands, we can still learn enough about the spectral energy distribution to estimate the redshift. While photoz estimation that yields accurate values over a wide range of redshifts for all galaxy types is difficult, there have been several recent successes in this field (Feldmann et al., 2006; Ilbert et al., 2006). To fully constrain the calibration of the gg lensing signal, we must understand the full photoz error distribution as a function of many parameters, particularly those relevant to galaxygalaxy lensing, such as brightness, colour, environment, and of course redshift. Since the photoz error distributions will depend on a complex interplay between the widths and shapes of the filter functions, the set of filters used in the photoz estimates, the photometry error distributions, and the spectral energy distributions of the galaxies themselves, the photoz error distributions will not be symmetric or Gaussian in general, even if the photometric errors in flux are Gaussian (the magnitude errors are not in any case, and some photoz methods use magnitudes instead of fluxes). To be accurate, this photoz error distribution must be determined with a sample of galaxies with the same selection criteria (depth, colour, etc.) as the source sample. This is quite important because, as the photometry gets noisier, the photoz error distribution can not just broaden, but can also develop asymmetry, tails, and other nonGaussian properties.
So, as for methods that use a statistical source redshift distribution, we once again must find a large spectroscopic sample with the same selection criteria as our source catalog. (Some photoz methods also require a training sample with the same selection criteria as the source sample.) The completeness and rate of spectroscopic redshift failure are both potentially important, particularly if the spectroscopic redshift failures all lie in a specific region of redshift or colour space. If a photoz method has a significant failure fraction, then we may be forced to eliminate a large fraction of the source sample, thus increasing statistical error significantly. Three major advantages of photoz’s for lensing are that they (1) allow us to eliminate some fraction of the physicallyassociated lenssource pairs, thus reducing the effects of intrinsic alignments, (2) allow us to optimally weight each galaxy by the expected signal, and (3) allow us to reduce, if not eliminate, “sources” that are in the foreground from the sample entirely (a special case of optimal weighting).
We present a method to obtain robust, percentlevel calibration of the gg lensing signal using a sample of several thousand spectroscopic redshifts selected from the source sample (i.e., with the same selection criteria). The sources of spectroscopy we use to demonstrate this method are the DEEP2 and zCOSMOS surveys (described in section 2). The use of two surveys in two areas of the sky carried out with two different telescopes is important, because (a) they do not have the same patterns of redshift failure, and (b) the largescale structure in the two surveys is not correlated with each other, so effects of sampling (cosmic) variance are reduced for the combined sample. In addition, we use spacebased data for the full COSMOS sample to quantify the efficacy of our star/galaxy separation scheme.
We then use this method to analyze the redshiftrelated calibration bias of the lensing signal in previous gg lensing analyses that used our SDSS source catalog (Hirata et al., 2004; Mandelbaum et al., 2005; Seljak et al., 2005; Mandelbaum et al., 2006c, b, a; Mandelbaum & Seljak, 2007). Our calibration bias analysis is quite important, as our statistical error for some applications has dropped below 5%, making our systematics requirements more stringent.
More importantly, we take a broad view, testing not just the redshift determination methods that we have used in the past, but also several new ones that have been developed in the past few years, in order to determine which ones are most useful for lensing. In the process, we determine which common photoz failure modes and error distributions are most problematic for gg lensing. The results of our analysis will be useful not only for SDSS gg lensing, and the method we present is generally useful for future weak lensing analyses (and generalizable to scenarios without spectroscopy for lenses and to shearshear autocorrelations), particularly as larger, deeper spectroscopic datasets are becoming available.
In section 2, we describe the lensing source catalog and the spectroscopic redshift samples. Section 3 includes a description of the source redshift determination algorithms that we will test in this work. In section 4, we describe our method for determining the source redshiftrelated calibration bias, including handling complexities such as largescale structure. We present the results of our analysis in section 5, and discuss the implications of these results in section 6.
When computing angular diameter distances, we assume a flat cosmology with and .
2 Data
2.1 Sdss
The data used for the lensing source catalog are obtained from the SDSS (York et al., 2000), an ongoing survey to image roughly steradians of the sky, and follow up approximately one million of the detected objects spectroscopically (Eisenstein et al., 2001; Richards et al., 2002; Strauss et al., 2002). The imaging is carried out by driftscanning the sky in photometric conditions (Hogg et al., 2001; Ivezić et al., 2004), in five bands () (Fukugita et al., 1996; Smith et al., 2002) using a speciallydesigned widefield camera (Gunn et al., 1998). These imaging data are used to create the source catalog that we use in this paper. In addition, objects are targeted for spectroscopy using these data (Blanton et al., 2003b) and are observed with a 640fiber spectrograph on the same telescope (Gunn et al., 2006). All of these data are processed by completely automated pipelines that detect and measure photometric properties of objects, and astrometrically calibrate the data (Lupton et al., 2001; Pier et al., 2003; Tucker et al., 2006). The SDSS is well underway, and has had seven major data releases (Stoughton et al., 2002; Abazajian et al., 2003, 2004, 2005; Finkbeiner et al., 2004; AdelmanMcCarthy et al., 2006, 2007a, 2007b).
The source sample we describe was originally presented in Mandelbaum et al. (2005), hereinafter M05. It includes over 30 million galaxies from the SDSS imaging data with band model magnitude brighter than 21.8. Shape measurements are obtained using the REGLENS pipeline, including PSF correction done via reGaussianization (Hirata & Seljak, 2003) and with selection criteria designed to avoid various shear calibration biases. A full description of this pipeline can be found in M05.
2.2 Deep2
The DEEP2 Galaxy Redshift Survey (Davis et al., 2003; Madgwick et al., 2003; Coil et al., 2004; Davis et al., 2005) consists of spectroscopic observation of four fields using the DEep Imaging MultiObject Spectrograph (DEIMOS, Faber et al. 2003) on the Keck Telescope. This paper uses data from field , the Extended Groth Strip (EGS), centered at RA , Dec. (J2000) and with dimensions (Davis et al., 2007). Galaxies brighter than were observed in all four DEEP2 fields, but in the other three fields besides EGS, two colour cuts were made to exclude galaxies with redshifts below . The DEEP2 EGS sample, in contrast, includes objects of all colours with , although colourselected objects with receive slightly lower selection weight. This is the sample from which a bright subset, , was extracted for this paper. The selection probabilities for all objects are wellknown, allowing us to account for this deweighting directly, though this has little impact for this study, since only a small fraction of galaxies with useful SDSS shape measurements are fainter than , and they have little statistical weight due to their larger shape measurement errors. Due to saturation of the CFHT detectors used for target selection, no galaxies brighter than were targeted; these galaxies constitute a very small fraction of our source sample.
For this paper, we use all EGS data collected through the spring of 2005, a parent catalog of more than 13 000 spectra (Davis et al., 2007). The DEEP2 EGS objects with (the limit of our source catalog) that failed to yield redshifts in initial DEEP2 analyses were reexamined in detail; after this effort, the net redshift success rate (defined as DEEP2 quality 3 or 4) was 96%, significantly higher than for the full EGS sample. The positions of the DEEP2 EGS matches in our source catalog are shown in the right panel of Fig. 1. There are SDSS galaxies in this region with matches in DEEP2 at . Roughly 65% of those pass the lensing selection, leaving us with a sample of .
2.3 zCOSMOS
The other redshift survey used for this work is zCOSMOS (Lilly et al., 2007), which uses the Visible MultiObject Spectrograph (VIMOS, LeFevre et al. 2003) on the 8m European Southern Observatory’s Very Large Telescope (ESO VLT) to obtain spectra for galaxies in the COSMOS field, which is 1.7 deg centered at RA , Dec. . We use data from the zCOSMOSbright survey, which is purely fluxlimited to , well beyond the fluxlimit of our source catalog, and currently contains galaxies (Lilly et al., in prep.). Observations began in 2005 and will take at least three years to complete.
One important benefit of the zCOSMOS data is that due to its location in the Cosmological Evolution Survey (COSMOS) field (Capak et al., 2007; Scoville et al., 2007b, a; Taniguchi et al., 2007), there is very deep broadband observing data from a variety of telescopes in addition to a single passband observation from the Advanced Camera for Surveys (ACS) on the Hubble Space Telescope (HST). This photometry has been used to generate extremely highquality photometric redshifts using the Zurich Extragalactic Bayesian Redshift Analyzer (ZEBRA, Feldmann et al. 2006), which will be described further in Section 3, and several other photoz codes (Mobasher et al., 2007). Using data with , , , , , , , and photometry, the photometric redshift accuracy for the bright, selected sample is remarkable, . This accuracy is achieved using % of the zCOSMOS sample as a training set. In cases of spectroscopic redshift failure, these nearly noiseless photoz’s can be used instead. We will demonstrate explicitly that the effect on the estimated lensing redshift calibration bias of using their photoz’s for redshift failures is within the statistical error. Consequently, the nominal 8% spectroscopic redshift failure rate for zCOSMOS galaxies in our source catalog is effectively zero for our purposes.
The HST imaging in the full COSMOS field was also used for another test because it enables star/galaxy separation to be performed more accurately than in SDSS. Consequently, we use the full COSMOS galaxy sample to match against our source catalog and identify the stellar contamination fraction to high accuracy.
The positions of the zCOSMOS matches in our source catalog is shown in the left panel of Fig. 1. We have spectra in an area covering square degrees, 88% of the eventual area of the zCOSMOS survey. The sampling is denser in some regions than in others (and will eventually be filled out evenly in the full area). In this region, there are SDSS galaxies with ; roughly 65% pass our lensing selection cuts, leaving us with matches in the source catalog.
3 Redshift determination algorithms
Here we describe the source redshift determination algorithms in more detail. We begin with those used in our current lensing source catalog, for which we want to assess calibration biases in past works, then describe methods that have more recently become available.
3.1 Previous methods
In our catalog, which was created in 2004, we used three approaches to source redshift determination, all described in detail in M05. For the sources, we used photometric redshifts from kphotoz v3_2 (Blanton et al., 2003a) and their error distributions determined using a sample of galaxies in the DEEP2 EGS. We also required to avoid contamination from physicallyassociated lenssource pairs. For the sources, we used a source redshift distribution from DEEP2 EGS (from fitting to redshifts), which means that we lack individual redshift estimates for each source. The sample of redshifts used for this early work with the EGS was a factor of smaller than the EGS sample used for this work, or a factor of smaller than the combined EGS zCOSMOS sample used here. For the highredshift LRG source sample (see selection criteria in M05), we used wellcalibrated photometric redshifts and their error distributions determined using data from the 2dFSDSS LRG and Quasar Survey (2SLAQ), as presented in Padmanabhan et al. (2005).
3.2 New options
There are several relatively new photoz options for SDSS data, all of which have relatively low failure rates of %. The first is available in the SDSS DR5 (data release 5) skyserver “Photoz” table (Budavári et al., 2000; Csabai et al., 2003). The photoz’s for this template method are determined by fitting observed galaxy colours to empirical templates from Coleman et al. (1980) extended using spectral synthesis models. There is an additional step (not used for all template methods) in which the templates are iteratively adjusted using a training sample. We have performed our tests on both the DR5 and DR6 template photoz’s, and found no significant differences in performance between the two.
The second new option is available in the SDSS DR6 skyserver in the “Photoz2” table. These photoz’s were computed using a neural net (NN) algorithm similar to that of Collister & Lahav (2004) trained using a training set from many data sources combined: SDSS spectroscopic samples, 2SLAQ, CFRS, CNOC2, DEEP, DEEP2, and GOODSN. A more complete description of both NN photoz’s in the DR6 database can be found in Oyaizu et al. (2007): the “CC2” photoz’s use colours and concentrations, while the “D1” photoz’s use magnitudes and concentrations. In the text, we will describe any difference between the DR5 and DR6 results; Oyaizu et al. (2007) recommends against using the DR5 photoz’s for science applications now that the improved DR6 versions exist.
The third new option we test is the ZEBRA (Feldmann et al., 2006) algorithm, which has already been successfully used with much deeper imaging data in the COSMOS field. This method involves templatefitting, but also takes a fluxlimited sample of galaxies (without spectroscopic redshifts) from the data source for which we want photoz’s. These data are used to create a Bayesian modification of the likelihoods based on the for the full sample (Brodwin et al., 2006) and on its template distribution. In practice, this prior helps avoid scatter to low redshifts. A key question we will address is how this algorithm behaves with the significantly noisier SDSS photometry. To avoid confusion, we will refer to the highquality ZEBRA photoz’s derived using the deep photometry in the COSMOS field as “ZEBRA” photoz’s, and the ZEBRA photoz’s using the much shallower SDSS photometry as “ZEBRA/SDSS” photoz’s.
To be specific about the training method, to get the ZEBRA/SDSS photoz’s, half of a fluxlimited sample of SDSS galaxies with zCOSMOS redshifts are used for template optimization. This part of the analysis includes fixing the redshifts of those galaxies to the spectroscopic redshift, finding the bestfitting template, and optimizing it as described in Feldmann et al. (2006). Then, a sample of SDSS galaxies (fluxlimited to ) without spectra were used to iteratively compute the templateredshift prior.
3.3 Effects of photoz error for lensing
Finally, we clarify the effects of photoz error on the lensing calibration:

A positive photoz bias, defined as a nonzero , will lower the signal (because the critical surface density, defined below in Eq. 2, will be underestimated).

A negative photoz bias will raise the signal.

Photoz scatter will usually lower the signal due to the shape of the critical surface density near . This effect can be very significant for sources at redshifts below , where is the size of the scatter.
The last point is very important for a shallow survey like SDSS when the lens redshift is above , because of the large number of sources within a few of the lens redshift. For a deeper survey such as the CanadaFranceHawaii Telescope Legacy Survey (CFHTLS), with lenses and sources separated by on average this effect may in fact be negligible. The effects of photoz bias are important not just in the mean, but as a function of redshift. If low redshift sources have nonzero photoz bias, and high redshift sources have nonzero photoz bias in the opposite direction, so that the mean photoz bias for the full sample is zero, the effect of the opposing photoz biases on lensing calibration will not, in general, cancel out since the effect on lensing calibration tends to be more significant for the sources that are closer to the lenses.
Catastrophic photoz errors are those that are well beyond the typical scatter, typically occurring due to some systematic error, colourredshift degeneracy, or other problem (and by definition, these photoz’s are not flagged as problematic by the algorithm, so they can only be identified using a spectroscopic sample with similar selection to the target sample). The catastrophic error rate may be important, depending on the type of catastrophic error. For example, sending a few percent of the sources to will not lead to calibration bias, it will simply lead to that fraction of the sources not being included because they have , causing a percentlevel increase in the final error. In short, it is clear that the three metrics often used to quantify the accuracy of photoz methods – the mean bias, scatter, and catastrophic failure rate – are not sufficient to quantify the efficacy of a photoz method for lensing. In this paper, we will introduce a metric that is optimized towards understanding the effects of photoz’s on galaxygalaxy lensing calibration, and present results for the photoz mean bias, scatter, and catastrophic failure rate only as a means of understanding the results for our lensingoptimized metric. For other science applications, the optimal metric may be quite different from what we present here.
4 Methodology
4.1 Theory
Galaxygalaxy lensing measures the tangential shear distortions in the shapes of background galaxies induced by the mass distribution around foreground galaxies (for a review, see Bartelmann & Schneider 2001). The result is a measurement of the sheargalaxy crosscorrelation as a function of relative foregroundbackground separation on the sky. We will assume that the redshift of the foreground galaxy is known, so we express the relative separation in terms of transverse comoving scale . One can relate the shear distortion to , where is the surface mass density at the transverse separation and its mean within , via
(1) 
Here we use the critical mass surface density,
(2) 
where and are angular diameter distances to the lens and source, is the angular diameter distance between the lens and source, and the factor of arises due to our use of comoving coordinates. For a given lens redshift, rises from zero at to an asymptotic value at ; that asymptotic value is an increasing function of lens redshift.
In this work, we focus on calibration bias in due to bias in arising from source redshift uncertainty.
4.2 Redshift calibration bias determination
Here, we present a method for testing the accuracy of source redshift determination that is optimized towards gg lensing. Formally, we wish to calculate the differential surface density using our estimator , which is defined as a weighted sum over lenssource pairs ,
(3) 
To isolate the dependence of calibration on redshiftrelated quantities, we will assume that the estimated tangential shear, , is unbiased. (derived from our source redshift estimator) is the critical surface density estimated for a given lenssource pair . The weights for each lenssource pair are determined using redshift information as well:
(4) 
where is the rms ellipticity per component for the source sample (shape noise), and is the ellipticity measurement error per component.
We want to relate our estimated to the true . To do so, we use the relation between the measured shear and , Eq. (1). Putting equation 1 into equation 3 (assuming ), we define the redshift calibration bias via
(5) 
a weighted sum of the ratio of the estimated to the true critical surface density.
This expression must be computed as a function of lens redshift. In the limit that the sources are at much higher redshift than the lenses, does not depend as strongly on the source redshift, so (for a given photometric redshift bias) will be smaller than if the lens redshift is just below the source redshift. For a lens sample with redshift distribution , the average calibration bias can be computed as a weighted average over the redshift distribution,
(6) 
where the redshiftdependent lens weight is defined as the total weight derived from all sources that contribute to the lensing signal for a given lens redshift, .
In the ideal case, we would do this calculation with a large, complete spectroscopic sample drawn at random from our source sample, sparsely sampled on the sky and therefore lacking features in the redshift distribution due to largescale structure. We can then find on a grid of lens redshifts by forming the sums in equation 5 using all sources with spectra. Finally, we can use the total weight as a function of lens redshift and the lens redshift distribution to estimate the average redshift bias of the lensing signal.
To get the errors on the bias in this simple scenario, we can simply bootstrap resample our sample of source galaxies with spectroscopy. For a sample of galaxies, bootstrap resampling requires us to make many “new” galaxy samples consisting of galaxies drawn from the original sample with replacement. Assuming that the observed galaxy redshifts accurately reflect the underlying redshift distribution, and the redshifts are uncorrelated, the mean bestfit redshift distribution will reflect the true one, and the errors in the redshift calibration bias can be determined from the variance of the calibration biases for each bootstrap resampled dataset. Since the bootstrap depends on the assumption that the objects we are bootstrapping are independent, this method only gives proper errors in the case where LSS is unimportant.
In general, there are several problems that mean we are no longer dealing with the ideal case. The first problem is sampling variance, since most redshift surveys are completed in a welldefined, small region of the sky. The second is the fact that most redshift surveys suffer from some incompleteness, and that incompleteness may be a function of apparent magnitude or colour, which means that the loss of those redshifts can make the spectroscopic sample no longer comparable to the full source sample. We attempt to ameliorate these problems by using two sources of spectroscopy on different areas of the sky and with different spectrographs and analysis pipelines, so that the LSS and incompleteness tendencies in each sample are different. Below, we address these deviations from the ideal case in more detail.
4.3 Effects of sampling variance
Largescale structure can be problematic when using surveys on small regions of the sky to determine bias in the lensing signal due to photometric redshift error. The LSS may emphasize particular regions of the source redshift distribution that have unusual features in the photometric redshift errors. To avoid this problem, we would like to fit for a redshift distribution in a way that accounts properly for uncertainties due to sampling variance. There are many approaches to this problem in the literature, such as that demonstrated in Brodwin et al. (2006).
The simplest way around our aforementioned problem, that LSS causes the redshifts to be correlated so that the assumption behind the bootstrap is violated, is to bootstrap the bins in the redshift histogram instead. In the limit that the bins are significantly wider than the typical sample correlation length, the correlations within the bins will be far more important than the correlations between adjacent bins. Thus, the requirement that the bootstrapped data points be independent is much closer to being fulfilled. Here, we will use redshift bins with size , where each bin is considered as a pair of points . In a given bootstrapped histogram, some redshift bins will be included multiple times, others not at all, but each time a given bin is used, it has the same number of galaxies as in the real data. While this method is simplistic, it has the advantage of not requiring us to understand the details of the sample selection, since the lensing selection is a very nontrivial cut to understand and simulate. The resulting errors on the bestfit from this bootstrap will include the effects of both Poisson error (which is nonnegligible given the size of the samples used) and largescale structure. The errors are valid assuming that there are no correlations between the Mpcwide bins. We discuss this assumption, which depends not just on straightforward integration of the matter power spectrum but also redshiftspace distortions, galaxy bias, and magnification bias, further in section 5.7.
For each bootstrapped histogram with bins centered at containing galaxies each, we minimize the function
(7) 
via summation over redshift bins . is the number of galaxies predicted to lie in bin given the model for , i.e.
(8) 
For each bootstrapped histogram, we also imposed a normalization condition on the fit that (the total number of galaxies in the spectroscopic sample). In the case of Poisson error, the natural choice for is . However, in the presence of LSS, which contributes significantly to the variance in each bin, the distribution of values in each bin is, in fact, unknown, so the optimal weighting scheme is unclear. Consequently, we use the simplest possible weighting scheme, for all . We have, however, confirmed that if we do use , then the changes in the bestfit redshift distribution parameters, and the implied changes in redshift calibration bias, are well below the level.
Our 2parameter model for the redshift distribution is
(9) 
which has mean redshift
(10) 
This choice is based purely on the empirical observation that it describes the shape of the redshift distribution better than the many other functional forms that we tried, and addition of extra parameters did not significantly improve the bestfit . In particular, allowing the powerlaw inside the exponent to vary from (a common choice) did not lead to any significant change to the bestfit redshift distribution below , where the vast majority of the galaxies are located. The changes above that redshift are marginally statistically significant, but there are so few sources above that redshift that our final results for the redshift bias that we eventually want to calculate do not change within the statistical error.
We will present bestfit redshift distributions for zCOSMOS and DEEP2 EGS separately to demonstrate that the results are consistent within the errors. We then use both samples combined to create an overall redshift distribution.
This distribution is crucial to our scheme to avoid sampling variance effects in the determination of the redshift calibration bias. To counterbalance regions of source redshift space that are over or underrepresented in our spectroscopic sample due to LSS fluctuations, we incorporate an additional weight into the calculation of the redshift bias in Eq. (5). For a galaxy in redshift bin in our histogram, the LSS weight () is the ratio of the number of galaxies predicted to lie in bin from our bestfit redshift distribution, to the number actually found in that bin (). Thus, those regions in redshift space with too many/few galaxies due to LSS or Poisson fluctuations will be down/upweighted appropriately. We can then get errors on the average redshift bias using the bestfit redshift histograms for each bootstrap resampled histogram to derive the LSS weights. This procedure incorporates uncertainty in the source redshift distribution appropriately, since we never need to bootstrap the galaxies themselves.
In an analysis containing many patches of sky, the size of the errors can be verified by comparing the redshift bias computed in each patch of sky. Unfortunately, with only two patches of sky, this method is not an option for this work.
4.4 Redshift incompleteness and failures
For precision results, we require a high redshift completeness and quality. There are several tests that we can carry out to ensure that the sample is of high quality. We consider the redshift failures separately for the DEEP2 and zCOSMOS samples. In both cases, we will determine the magnitude and colour distribution of the failures relative to the full sample, to see if a particular region of redshift space is causing the problems.
For zCOSMOS, there are highquality photoz’s derived from very deep photometry which we can use in the case of spectroscopic redshift failure. To control for any effect on the computed redshift calibration bias, we also check the results using the zCOSMOS photoz’s for a larger portion of the full sample, to ensure that noise in these photoz’s has a negligible effect on the results.
For DEEP2 EGS, we lack redshift estimates for the failures. To place a very conservative bound on the effect of failures on the estimated calibration bias, we estimate the redshift bias with all the failures forced to , and then to . For both surveys, we will compare the ranges of colours and redshifts spanned by the successes and failures, to ensure that our procedures for handling redshift failure are justified.
The next issue is the quality of the nonfailed redshifts, which in DEEP2 are assessed by visual inspection and repeat observations, and in zCOSMOS using the photoz’s as well. For DEEP2, we have used only and redshifts, which are 96% of our sample, and are estimated to be % and % reliable. For zCOSMOS, the reliabilities for and objects (92% of our sample) are %. For this survey we also use , those with slightly lower quality in principle but with extremely good matches between the spectroscopic and photometric redshift, and (singleline redshifts with good matches between the spectroscopic and photometric redshifts, which in this apparent magnitude and redshift range are usually from H), both of which also are % reliable as determined from repeat observations.
In the DEEP2 EGS, there are also minor selection effects to control for. The first effect is the fact that no galaxies brighter than were targeted. Galaxies brighter than that limit constitute only 4% of the source sample, but we nonetheless include tests of the effect this has on the result.
The other selection effect in DEEP2 EGS occurs at magnitudes fainter than , where objects are given slightly lower selection weights than higherz galaxies. While the fraction of source galaxies fainter than this magnitude is only %, we use their selection probabilities to properly compensate for this effect. To be explicit, the total weight for each source is thus a product of lensing weight , the LSS weight , and (or 1 for the zCOSMOS galaxies).
Finally, we clarify our statement that our method requires the spectroscopic sample used to evaluate photoz’s to be comparable to the source sample. As demonstrated above, it is possible to use weights to account for welldefined targeting priorities that might make the spectroscopic sample slightly nonrepresentative of the source catalog. Thus, our statement that we require the spectroscopic sample to be comparable to the source sample is really a statement that it must contain all galaxy types (spectral types, magnitudes, etc.) in the source sample with representation levels that are sufficient to overcome the noise. If some reweighting is necessary to account for under or overrepresentation of a given population, then for our purposes, this is sufficient to fulfill our requirements. Thus, one could not use a spectroscopic sample with a strict cutoff two magnitudes brighter than the flux limit of the source catalog. One could use a spectroscopic sample that has a lower redshift success rate for fainter galaxies, as long as that lower success rate is due to statistical error, so that the failures have the same redshift distribution as the successes, rather than some systematic error (e.g. inability to determine redshifts for any object of a particular spectral type above some cutoff redshift). Reweighting schemes to account for different fractions of various galaxy populations in the training and photometric samples are being successfull used by the SDSS neural net photoz group to predict redshift distributions and photoz error distributions in the photometric samples.^{1}^{1}1Lima, Cunha, Oyaizu, Lin, Frieman, 2007, in prep.
4.5 Direct use of photoz’s
Here, we explain our use of photoz’s directly for estimation. One might argue that since we have a spectroscopic sample, we should estimate using a deconvolved photoz error distribution. However, in this paper we test the use of photoz’s directly, for several reasons.
First, as we have argued previously, a key advantage of using photoz’s is that we can eliminate intrinsicallyaligned sources. Once we start eliminating sources from the sample on the basis of detailed cuts on photoz, colour, or apparent magnitude, we would have to reestimate the photoz error distribution for the sample that passes these cuts and redo the deconvolution procedure. This is computationally expensive and potentially difficult to do robustly, if the cuts result in our photoz error distribution being poorlydetermined due to insufficient spectroscopic galaxies that pass the cuts to properly sample the distribution. We would therefore like to find a photoz method that can lead to accurate lensing calibration on its own.
There is, in principle, one simple option that might improve the lensing calibration and that can be done without full deconvolution: we can correct each photoz for the mean photoz bias. To be accurate, this should be done as a function of galaxy colour and magnitude. We will test the results of doing so for one of the photoz methods when we present the results of our analysis.
The final reason to use photoz’s directly is because that is the approach taken in many lensing papers to date, and we would like to test the accuracy of what is currently done in the field to see what improvements need to be made. In section 5.9, we will consider using a full as a new alternative approach to using the photoz alone.
5 Results: application to SDSS lensing
5.1 Matching results
There are 1013 and 1825 galaxies in our source catalog with spectra from DEEP2 EGS and zCOSMOS, respectively (including redshift failures). We now characterize these matches relative to the entire source catalog and compared to each other.
Figure 2 shows the redshift histograms for matches between the source catalog and the zCOSMOS and DEEP2 samples. The zCOSMOS histogram is shown both with and without precision photometric redshifts for the redshift failures, whereas for DEEP2, the failures (4%) were excluded entirely. As shown, there is significant largescale structure in the redshift histograms, but not correlated between the two samples. Visually, the redshift histogram for DEEP2 appears to be at slightly higher redshift on average. We assess the statistical significance of any differences below.
Figure 3 shows the distribution of apparent band magnitude for the zCOSMOS and DEEP2 matches relative to that of the entire source catalog, . The apparent magnitude histogram for zCOSMOS is quite similar to that for the full source catalog (within the noise), and the failures are predominantly at the faint end. The apparent magnitude histogram for DEEP2 shows the deficit at (4% of the sample) due to targeting constraints.
Of the matches, of those in zCOSMOS (8%) and of those in DEEP2 (4%) are redshift failures (where failures are defined as having redshift success rates below 99%). In Fig. 4, we show the distributions of various quantities for the zCOSMOS and DEEP2 failures as compared with the full sample. Fig. 3 shows the relation of the failures to the general sample as a function of apparent magnitude; the top part of Fig. 4 shows that the colour distribution for the failures is similar to the colour distribution for the successes. We thus have no reason to believe the failures lie in a particular region of redshift space. The DEEP2 failures lie in the colour locus, just like the majority of the successes in this bright subsample of the EGS data. (This is not true for deeper redshift samples, such as the other DEEP2 fields, where failures typically occur for blue, galaxies. The flux and apparent size cuts imposed on our sample essentially remove any such galaxies.) Inspection of the DEEP2 spectra suggests that the redshift distribution is similar to that for the successes, with failures due to bad astrometry, a bad column running through the spectrum, or similar failures that do not correlate with redshift. We also show the zCOSMOS photoz error distribution as a function of redshift in the bottom of Fig. 4 for spectroscopic redshift successes. The photoz errors for this sample are indeed as small as, or even smaller than, those presented elsewhere for these photoz’s (Feldmann et al., 2006). We may view this error as a “systematic floor” to the error, with the increase in error for the ZEBRA/SDSS photoz’s being ascribed to the much noisier photometry. We will see that this statistical error dominates the error budget.
Next, we present redshift distributions for each survey separately, with two purposes: (1) to demonstrate that they are consistent with being drawn from the same underlying redshift distribution, and (2) to determine the weights to compensate for sampling variance as described in section 4.3.
Fig. 5 shows the observed and bestfit redshift histograms for zCOSMOS, DEEP2, and both surveys combined. Table 1 shows the corresponding bestfit parameters from Eq. (9). The weighting to account for the DEEP2 selection at causes a negligible change in the results. By bootstrapping the redshift histogram as described in section 4.3, we have determined the median predicted number of galaxies in each bins, and the 68% confidence limits on that number, as shown on the plot. Because we have imposed a normalization condition on the fit, the errorbars are correlated between various parts of the histogram. We can see from the plot and table 1 that while the DEEP2 sample is at slightly higher redshift on average, the redshift distributions from zCOSMOS and DEEP2 are consistent with each other within the (Poisson plus LSS) errors. While it is difficult to compare the curves for , where the number of galaxies has declined sharply, we can compare the total fraction of the sample with to show that they are consistent: for DEEP2 EGS, this fraction lies between at the 68% CL; for zCOSMOS, between . These limits were determined using the fraction above for the bestfit for 200 bootstrapresampled redshift histograms, and therefore include both Poisson error and sampling variance. It is clear that any discrepancy between the bestfit zCOSMOS and DEEP2 redshift histograms with respect to the fraction of the sample above are not significant at the 68% CL.
As shown in the lower left panel of Fig. 5, there is no systematic tendency for the observed and bestfit for the full sample to deviate from each other, only Poisson and LSS fluctuations, so the form we have chosen for is acceptable. (The fluctuations are quite large for because the bestfit drops below , so discreteness will cause the ratio of to be either zero or some large number.)
It is important to note that this plot is the unweighted redshift distribution; inclusion of the lensing weights in Eq. (4) will change the effective source redshift distribution.
Sample  

zCOSMOS  
DEEP2 EGS  
Both 
5.2 Photoz error distributions
As a way of understanding the trends in our lensingoptimized photoz error statistic , we first examine the photoz error distribution as a function of redshift. Figure 6 shows the photoz error as a function of the (true) redshift for the lensingselected galaxies from zCOSMOS and DEEP2 for the photoz algorithms tested in this work. The galaxies are divided by apparent magnitude into three samples with , , and , and we show the 68% CL errors determined in bins of size for each apparent magnitude bin. For all methods, the error distributions tend to be highly nonGaussian, often skewed and with significant tails. While the requirement that makes skewness inevitable at low even for a wellbehaved photoz estimator, the effect persists to such high redshift for all methods that this constraint is clearly not the cause. Thus, the 68% confidence limits as a function of redshift are more useful than a calculation of the average photoz bias and scatter. Nonetheless, we do tabulate the mean bias and the overall scatter in Table 2 for each method, for the full sample and the subset (to facilitate comparison between kphotoz, used only for , and the other methods).
Method  Mean bias  Scatter 

kphotoz  ()  () 
Template  ()  () 
NN/CC2  ()  () 
NN/D1  ()  () 
ZEBRA/SDSS  ()  () 
For the kphotoz method, there is a clear tendency to fail towards very low redshift, as demonstrated by the peak in for . For lensing, such failures will be flagged as being below the lens redshift for nearly all relevant lens redshifts, thus excluding them from the source sample. Consequently, the only effect of this failure mode is to reduce the number of available sources, not to bias the weak lensing results. However, it is apparent that this method is as noisy for as the other photoz algorithms are for , and that the photoz error tends to be positive for and negative above that.
For the templatebased database photoz’s, there is an even stronger failure mode towards than for kphotoz (because the template method goes fainter than the kphotoz sample). This failure mode contributes to the significantly negative 68% CL limits on the photoz error, since the points suggest that ignoring these failures leads to a more symmetric error distribution. We must quantify the effect this has in reducing the total weight; even if the bias in the lensing signal due to the strong failure mode is small, the increased statistical error due to loss of sources may be problematic. This failure mode is the cause of the large mean photoz bias in Table 2.
For the neural network algorithm, the plot shows the CC2 (colour and concentrationbased) photoz’s, but the trends are qualitatively similar for the D1 (magnitude and concentrationbased) photoz’s. There are entries for both versions in Table 2. As shown, the method has a reasonably small overall scatter and no major failure modes. We caution the reader that the same is not true for the NN photoz’s in the DR5 database, for which there is a significant scatter to redshifts that more than doubles the number of sources estimated to be in this redshift range. The scatter is also larger for the DR5 NN photoz’s. In both the DR5 and the DR6 versions, there is a tendency towards positive photoz bias at lowintermediate redshifts () that may bias the lensing signal low.
Finally, the ZEBRA/SDSS method also lacks a major catastrophic failure mode and has reasonably small overall photoz bias. The redshift histograms derived from the spectroscopic and photometric redshifts agree remarkably well. As for the NN/CC2 photoz’s, there is a trend towards positive photoz error at low redshift and negative error at high redshift. Because of the overall lower number of sources above , and the decreased dependence of on source redshift at higher redshift, we have no reason to believe that the effects of the different direction of the calibration biases in the lensing signal will cancel out. We can also conclude, in comparison with the ZEBRA photoz errors in the lower panel of Fig. 4 (using the far deeper COSMOS photometry) for the same exact set of sources, that for the redshifts and magnitudes dominated by this source sample, statistical error due to noisy SDSS photometry dominates over systematic error in this photoz method.
5.3 Redshift bias
In Figure 7, we show the lensing calibration bias for different source redshift determination methods, using the full lensingselected spectroscopic redshift sample. The bottom panel shows the total lensing weight ascribed to the source sample for that lens redshift, determined via summation over the lensing weights described in Section 4.4. Note that the and LRG samples use photoz’s with the requirement that , to reduce contamination by physicallyassociated sources (for consistency with our previous analyses). However, for the new photoz methods, we have not imposed any such condition (we will revisit this choice later).
As shown, the sample with photoz’s from kphotoz has a significant negative calibration bias that increases with lens redshift to % at . As for all methods, the bias worsens with lens redshift because, for a given source with some photoz error, a higher lens redshift leads to a higher relative error in . The sample (using from DEEP2 EGS) has a small positive bias that increases to 10% at . We assess the significance of these biases for our previous work in section 5.4. The results for the LRG source sample confirm our assertion in previous works that for , this sample is essentially free of redshift bias.
The lack of significant redshift calibration bias for the template photoz code for can be explained by the trends in Fig. 6: the calibration bias due to the slight negative photoz bias balances out the calibration bias due to photoz scatter. Even at higher redshift, the redshift calibation bias, while nonzero, is less significant than for the other photoz methods. The neural net and ZEBRA/SDSS photoz’s, however, have significant negative bias (% to %, at ), presumably because of the aforementioned tendency to positive photoz bias for . This difference between the three methods is also the reason why the latter two methods have high total weight for the range of lens redshift considered here, whereas the template photoz code has lower weight (a) because of its scatter to low photoz (which eliminates possible sources from the sample) and (b) because it does not tend to scatter sources to higher photoz, which increases the weight artifically at the expense of biasing the signal. We emphasize that this higher weight for the two photoz methods does not mean that the error on is lower with these methods, because it may be due purely to the overestimate of . In section 5.8, we will address the effect of using photoz’s on the statistical error in .
Given that kphotoz has a similarly sized photoz error ( only) as the other photoz methods for the full source sample (all magnitudes), it is important to understand why the lensing calibration bias is so much worse for this method. The reason this occurs is that the sample is at lower mean redshift. Since those sources are closer on average to the lens redshift, the same size photoz error translates to a larger error in .
To understand the results, we consider fixed lens redshift of , and show the redshift bias as a function of true source redshift for each method in Fig. 8 (again, with lensing weight as a function of source redshift as in Section 4.4). Clearly, all source redshift bins with must give , because the sources are not lensed. Above , the calibration bias is no longer identically zero, but may be significantly negative due to scatter in the estimates of source redshift (near , the derivative is large so photoz errors are very important). As the source redshift increases, the same photoz error becomes less important because that derivative decreases, so the calibration bias approaches zero. The other important quantity to consider is the weight in each source redshift bin; if those source redshift bins with significant bias are given little weight, then the bias does not matter. If there is no weight for that means that none of the galaxies with true have had photoz misestimated to be above that. This plot makes it clear that part of the reason for the significant bias for the NN, kphotoz and ZEBRA/SDSS photoz’s is that they give too much weight to . This is less of a problem for the template photoz’s, so the calibration bias for this method is much less.
Finally, we show the resulting mean calibration bias when these results are averaged over a lens redshift distribution using Eq. (6). Errors are determined using the prescription in section 4.3. The lens redshift distributions that we consider are as follows: “sm1”–”sm7” are the redshift distributions for the seven stellar mass bins from Mandelbaum et al. (2006c); “LRG” is the redshift distribution for the spectroscopic LRGs, a volumelimited sample, used for lensing in Mandelbaum et al. (2006b); and “maxBCG” is the redshift distribution of the SDSS maxBCG clusters (Koester et al., 2007b; Koester et al., 2007a). These nine lens redshift distributions are plotted in figure 9. The stellar mass subsamples correspond roughly to luminosity samples with band luminosities of , , , , , , and . The LRGs are red galaxies with typical luminosities of a few , and the maxBCG clusters are clusters selected from imaging data with masses .
The average redshift calibration biases (defined in Eq. 6) for the redshift determination methods given in Fig. 7 for these nine lens redshift distributions are shown in Table 3. As shown, for the stellar mass subsamples, the bias gets more significant at higher stellar mass because of the higher mean redshift. The maxBCG sample gives similar bias to sm7 because of the similar redshift range, and the LRG sample gives the worst bias because it has the highest mean redshift. The only method for which the trend is different is the template photoz code, for which the trend of changes sign with redshift due to the different trends of photoz error with redshift.
LRG  template  NN/CC2  NN/D1  ZEBRA/SDSS  

sm1  
sm2  
sm3  
sm4  
sm5  
sm6  
sm7  
LRG  
maxBCG 
As shown, the NN/D1 photoz’s give nominally worse calibration bias than the NN/CC2 photoz’s for lowerredshift lens samples, and the reverse is true at higher redshift. This trend is consistent with the difference between the two methods in Fig. 7. We also performed the analysis with the DR5 NN photoz’s, and found the lensing calibration bias for these lens redshift distributions to be similar to the NN/CC2 calibration biases, well within the errors. This result suggests that the failure mode to in the DR5 version was not a significant source of lensing calibration bias, and the overall positive photoz bias (present in all NN photoz’s tested in this paper) is the main cause.
Finally, we consider what happens if we correct for the mean photoz bias when estimating for each source. For the template photoz’s, this correction causes the mean calibration bias for sm7 to go from to . This result may be puzzling until we consider the effects of photoz bias and scatter separately (section 3.3). We know that photoz scatter causes a negative calibation bias, and a negative photoz error like this method has causes a positive calibration bias. When we did not correct for the mean photoz bias, these two effects apparently cancelled out. This cancellation is a nontrivial result that depends on our sample selection. With a different cut on apparent magnitude, for example, it is not clear that the effects would balance as precisely. Now that we have corrected for the effects of mean photoz bias, we are left with the suppression of the lensing signal due to the photoz scatter. For the NN/CC2 and NN/D1 photoz’s, the correction for the mean photoz bias decreases calibration bias from and to and , respectively, for sm7 (since the positive photoz bias and the scatter change the lensing calibration in the same direction). For ZEBRA/SDSS, the photoz bias was slightly negative, so correcting for it worsens the lensing calibration bias as for the template photoz’s, but only slightly: from to for sm7.
From these results, we can conclude that once the effects of the mean photoz bias are removed, the effects on the lensing calibration due to scatter in the photoz’s are the smallest for the SDSS NN/D1 photoz’s, followed by SDSS NN/CC2, ZEBRA/SDSS, and finally are the largest for the template photoz’s. This trend is consistent with the trends in Table 2 for the photoz scatter. We therefore have two possible procedures for handling calibration bias in the lensing signal: (1) to correct for the mean photoz bias before computing the lensing signal, and apply a correction to the lensing signal afterwards to account for residual calibration bias due to photoz scatter; or (2) to apply a correction to the lensing signal due to the combined effects of photoz bias and scatter at once. In either case, we must depend on the fact that our calibration subsample has the same sample properties as the full source catalog, so that corrections derived using this subsample will apply to the full catalog.
5.4 Implications for previous work
Here we determine the implications of Table 3 for previous work with this lensing source catalog.
First, we consider the results for Mandelbaum et al. (2006c), in which we divided the sample into stellar mass and luminosity subsamples with the seven redshift distributions sm1–sm7 shown in Fig. 9. For that work, the signal presented was an average over the signal using the and source sample with weighting. To determine the average bias on this signal, we use our bootstrapresampled and , averaging the bias as a function of redshift for each resampling using the weights for these two samples, then find the average over all the resampled datasets. The average biases for sm1–sm7 are shown in Table 4.
Lens sample  

sm1  
sm2  
sm3  
sm4  
sm5  
sm6  
sm7  
LRG 
We also consider the spectroscopic LRG lens redshift distribution, which was used for lensing in Mandelbaum et al. (2006b) and Mandelbaum & Seljak (2007). In that case, we detected a % suppression of the lensing signal for the source sample relative to the and LRG source samples. Table 3 makes it clear that this suppression was, in fact, real. To account for this suppression, we had multiplied the signal and its error by a factor of . This is equivalent to multiplying by when computing both the weights () and the lensing signal. We thus incorporate this factor into the computation of the bias in Eq. (5) before taking the weighted average with the sample. The average bias once the correction factor is incorporated is shown in Table 4. Because of this suppression of the weight in the sample due to the calibration factor, and because of its already low weight relative to for (see Fig. 7), the uncertainty on the calibration bias is actually dominated by the larger sample uncertainty, which is why it is larger than one might naively expect from combining the results in Table 3 for and . It is clear that this way of combining the signal for and is nonoptimal from the perspective of constraining calibration bias.
No results are shown for the maxBCG lensing sample because none of the previous works using this source catalog have used it.
It is clear from this table that there was statistically significant redshift calibration bias in previous works using this source catalog. However, the absolute value of the error is below the statistical error on the lensing signal in those works, and is smaller than the generous % () systematic error that was used for those science results. We conclude that there is no cause for concern in using results in our previous work with this catalog without applying a correction.
5.5 Systematics: targeting and redshift failure
In the previous sections, all quoted calibration errors were statistical. Here, we consider the size of systematic errors.
First, we include the DEEP2 redshift failures in the sample, once putting them all at and then all at (with an LSS weight of ). We have already shown in section 5.1 that the failures have a similar SDSS magnitude and colour distribution to the remainder of the sample. This statement is also true in the DEEP2 photometry, placing these galaxies without spectroscopic redshifts in the colour locus (like those with successful redshift determination). Consequently, placing them all at and gives extremely conservative bounds on the systematic error due to these redshift failures. Table 5 shows the new and the change in compared to table 3 for all methods of source redshift determination, including the combined and method used in our previous work (Sec. 5.4), for four lens redshift distributions: sm1, sm4, sm7, and LRG, which are at progressively higher redshifts.
LRG  template  NN/CC2  ZEBRA/SDSS  Previous work  

Fail to  
sm1  
sm4  
sm7  
LRG  
Fail to  
sm1  
sm4  
sm7  
LRG 
As shown in Table 5, these extreme assumptions change our estimated calibration bias at the level, in most cases . If we consider that the real effect is likely many factors smaller than this (since the failures roughly follow the magnitude and colour distribution of the successes, and therefore likely the redshift distribution), this systematic is far below our uncertainty on the calibration bias, from which we can conclude that systematic effects due to the excluded DEEP2 redshift failures are negligible.
We next consider the effects of using the zCOSMOS photoz for their redshift failures. As shown in Fig. 4, the failures have similar colours and magnitudes as the successes, so we do not anticipate that they will have a significantly different photoz error distribution from the successes shown at the bottom of that figure. To test the effect of using ZEBRA photoz’s for this 8% of the sample, we randomly replace the photoz’s for the spectroscopic redshifts in another 8% of the sample that are redshift successes. We then compare the resulting calibration biases to the original ones. These results (shown in Table 6) indicate that for all methods of source redshift distribution determination and lens redshift distributions, the use of zCOSMOS photoz’s for the 8% of the zCOSMOS sample that lacks redshifts changes the results well below the statistical error. We conclude that systematic error in our results due to redshift failures in either survey are unimportant, with the caveat that if the redshift failures are a systematically different population than the successes, this test would not uncover any resulting systematic error (however, we have no evidence that this is the case).
LRG  template  NN/CC2  ZEBRA/SDSS  Previous work  

sm1  
sm4  
sm7  
LRG 
One final systematic is that in DEEP2 EGS, roughly 4% of our source catalog at bright magnitudes () was not targeted. We must assess whether properly including these galaxies would significantly change the results. However, the small photoz error for bright objects, and the low mean redshift, makes this unlikely. In the SDSS, only a subset of these galaxies have spectroscopy, those with (fluxlimited) and fainter ones that are very red. Since including these SDSS spectroscopic redshifts will create a sample with strange selection (lacking blue galaxies at ), we instead take the spectroscopic galaxies from zCOSMOS at , choose a random subset to account for the smaller size of the DEEP2 sample, and add the resulting galaxies to the DEEP2 sample. We then refit the redshift histogram for DEEP2, getting new redshift distribution parameters , , and . We see that the change in mean source redshift is well within the errors in Table 1. When computing the mean redshift bias using this augmented sample, we find that the changes are even smaller than those shown in Table 5. This is not surprising, because in that table we have taken redshift failures and put them at very extreme redshifts, whereas here we have added a comparable number of redshifts but with very good photoz’s.
5.6 Agreement between the two surveys
As an additional systematics test, we compare the results when doing the full analysis separately for each survey. In this case, we use LSS weights derived using the redshift histograms for each survey separately instead of using the combined histogram. In Table 7, we show the results for each survey separately, with the bottom section showing the statistical significance of the difference.
LRG  template  NN/CC2  ZEBRA/SDSS  Previous work  
zCOSMOS  
sm1  
sm4  
sm7  
LRG  
DEEP2 EGS  
sm1  
sm4  
sm7  
LRG  
Statistical significance of difference (in units of )  
sm1  
sm4  
sm7  
LRG 
The results in this table show apparently significant discrepancies between the results with zCOSMOS and with DEEP2 separately. The fact that the statistical significance of the difference is for the last four columns, which use the full catalog, but for the first column (which uses only) and for the second column (which uses only) suggests that we should focus on the sample to find the source of the discrepancy. We must understand this discrepancy in order to assess whether our results are biased or our errorbars are significantly underestimated on the final, combined analysis.
In Fig. 10 we show plots for that will shed light on this discrepancy. The upper left plot shows for for both surveys. As shown, the bestfit histograms are very similar, but the LSS fluctuations are more pronounced than for the full sample. The lower left panel shows the ratio of the bestfit number predicted in zCOSMOS to the number in DEEP2 (normalized to the same total numbers of galaxies), with the 68% confidence region shown with dashed lines. This confidence region, including both Poisson and sampling variance error, was determined as follows: for each survey, bootstrapresampled redshift histograms were created, and used to fit for the . We then pair up the 200 bestfit from zCOSMOS and from DEEP2 EGS, and determine the ratio of these values for each survey. The ratios are ranked, and the middle 68% are chosen to determine the 68% confidence region. It is reassuring that for all redshifts, this shaded region includes a ratio of . It is apparent that the scarcity of redshifts at causes the errorbars on the ratio to become extremely large (well off the limits of the plot).
The top right panel in Fig. 10 shows for several lens redshifts. As shown, these results are very similar for the two surveys. The bottom right plot shows the fractional weight for each lens redshift and survey. In principle, the LSS weighting was designed to ensure that these curves would not have structure due to LSS fluctuations in number density as a function of redshift. We can see (particularly for ) that the curves for each survey are quite different and have significant LSS fluctuations, so we must understand why this is the case. We have ascertained that if we use from DEEP2 with the weight from zCOSMOS, we recover the same as when we use and from zCOSMOS, implying that the weight differences cause the discrepancy in .
To solve this problem, we consider only sources with . As shown with arrows, for , the weight in this bin is a factor of higher in zCOSMOS as in DEEP2. We have confirmed that this bin alone is a significant reason why the average calibration bias is on average more negative for zCOSMOS as for DEEP2. There are and galaxies at in this bin in zCOSMOS and DEEP2 respectively. Using the LSS weights derived for each survey separately, we weight zCOSMOS and DEEP2 by factors of and , giving weighted numbers of galaxies of and . Thus, the weighted ratio , where the expected value is given the total number of galaxies in each survey. This ratio of therefore represents a % enhancement of zCOSMOS relative to DEEP2, due to the fact that the LSS weights were derived using all galaxies in each survey, not just those at that we use here. While we can therefore conclude that LSS weighting may need to be done as a function of apparent magnitude, this % enhancement in source number does not account for a factor of enhancement in the weights.
Figure 11 shows the photoz distribution for kphotoz for the sources in this narrow redshift slice in each survey. It is important to note that our past analyses have required . The photoz distributions for the zCOSMOS and DEEP2 galaxies in this redshift slice are quite different, with the DEEP2 distribution being skewed to lower photoz, and the zCOSMOS one to higher photoz. Consequently, forty of the zCOSMOS galaxies pass this photoz cut (23%), as compared with two of the DEEP2 galaxies (7%). In terms of raw numbers, this gives an additional factor of enhancement of the weight in zCOSMOS on top of the previous factor of . Thus, the two factors together give nearly the factor of four enhancement in weight that we noticed on Fig. 10 as the source of the discrepancy.
Having accounted for the source of the problem, we must understand why the photoz distributions look so different for the two surveys. The bottom panel of Fig. 11 gives colourmagnitude information for these , galaxies in the two surveys. As shown, the DEEP2 galaxies are both fainter and bluer on average than those in zCOSMOS at this redshift. This is consistent with the fact that the redshift histograms show a local underdensity in DEEP2 and a significant overdensity in zCOSMOS at this redshift. We have found that for this photoz method, the photoz’s are biased low for blue galaxies, but not red galaxies. Hence, the different photoz distributions in the top panel of Fig. 11 reflect the different mixes of spectral types and different detections of the galaxies in the two surveys at this source redshift, rather than some more ominous effect such as differences in photometric calibration across the SDSS survey area.
We have confirmed that similar effects are at play in other parts of the source redshift distribution (e.g. ) that show significant differences in weight between the two surveys in Fig. 10. In short, the cause of the different redshift biases in the two surveys is the interplay between largescale structure and photoz errors, where LSS emphasizes certain spectral types that have different photoz error properties. (Explicit demonstration of how this effect can come about will be shown in Section 5.11, where we show photoz error distributions for ZEBRA/SDSS as a function of colour and magnitude.) Even in the absence of our cut, the mean estimated would have been much higher in zCOSMOS than in DEEP2, giving the same sign of the discrepancy between the surveys as we have now (except in that case, both and would be different, not just ). This interplay between photoz’s and LSS is a problem when trying to estimate the bias due to redshift calibration with a reasonably small subsample of redshifts () on a small area of the sky. It is also avoidable in principle, if we use our sample with spectroscopic redshifts to derive photoz error distributions as a function of colour and magnitude, which may be used to obtain accurate for each object.
To confirm these findings, we have boxcarsmoothed the weights shown in Fig. 10 with smoothing lengths of , , and for , , and (larger smoothing lengths chosen for higher because the LSS fluctuations in are more significant there). The resulting weight functions are reasonably smooth, as shown in Fig. 12, but include some apparent mean offset in the redshift distributions for the two surveys. We find that the discrepancy between for the two surveys is 5%, 15%, and 50% smaller for , , and respectively than when using the unsmoothed . Most of the change arises from the DEEP2 mean calibration bias going to lower (more negative) values, with the zCOSMOS mean calibration bias changing only slightly. The apparent discrepancy in Table 7 for LRG lenses is thus reduced due to this smoothing to a discrepancy, with the remaining discrepancy presumably due to the offset in the weight histograms shown in Fig. 12.
We now ask if the LSS fluctuations are the cause of the discrepancy with the other photoz methods. As we will show later for ZEBRA/SDSS and have confirmed for the template and neural net photoz algorithms (but do not show here), it is a general tendency of these photoz algorithms to underestimate the photoz’s for blue galaxies, and slightly overestimate them for red galaxies. Consequently the same effect occurs when the mixes of spectral types are different in the two surveys, even when we are using another photoz algorithm, and this is evident in for each survey. We therefore estimate using the same method of boxcar smoothing the weight as a function of redshift for each survey that the discrepancies for these methods are really .
We now address another unusual feature of the calibration uncertainties in Table 7: the uncertainties are actually smaller for DEEP2 than for zCOSMOS (only slightly larger than for the combined sample), despite the fact that sampling variance is % larger for DEEP2 EGS as for zCOSMOS! This result is also due to the LSS fluctuations in the weights for both surveys. The DEEP2 mean calibration bias was, as we saw previously, significantly affected by this problem, and it is also responsible for making the errorbars artificially small (since our method of getting the errors does not allow to vary as much as it should in reality). So, our worstcase and calibration differences for LRG lenses (with kphotoz and with the other photoz methods, respectively) is actually much less significant than these numbers suggest, and therefore not a problem.
We must ask whether this effect means that our mean results are biased or our errorbars are too optimistic when using the combined sample of galaxies for the two surveys. However, we are fortunate to be able to combine large samples at completely different points on the sky. The total (sample variance Poisson) errors when using two uncorrelated fields with and galaxies are smaller than if we simply had a single field on the sky with galaxies (which would be correlated with each other).
A comparison of Fig. 10 with Fig. 8 can help us answer this question. In Fig. 10, it is clear that the weight as a function of source redshift for is not smooth at all due to LSSphotoz error correlation in each survey. The fluctuations are at times % off from the value one might expect if the curve is smooth. However, in Fig. 8, these curves for the combined sample are significantly smoother, with fluctuations that are at most % for the LRG sources (the smallest and most highly clustered sample) and even less for the other samples, %. We thus conclude that the effect is reduced by a factor of , and is therefore negligible for the combined sample. To verify this conclusion, we have performed the same boxcar smoothing of the weight functions in Fig. 8 with the same smoothing lengths as for the two survey subsamples, and found that the resulting redshift calibration biases for the combined sample changed by % for sm1–sm5, % for sm6, sm7, LRGs, and maxBCG lenses. These changes are well within the errors on the calibration bias for these lens samples.
Finally, we notice in the top panel of Fig. 11 that our naive requirement that has required us to ignore a significant majority of the galaxies in this redshift slice, all of which are actually lensed. Since and the sources are all at true redshifts , we could conceivably use them all for lensing; using the subset at eliminates a large fraction of these sources. We return to this point in sections 5.8 and 5.10.
5.7 Size of errorbars on calibration bias
While we have previously asserted (section 4.3) that correlations between the bins in the redshift histograms should be negligible, we now present tests of this assertion, which (if violated) could cause the errorbars to be underestimated. One reason why they might be violated is the existence of a supercluster that happens to lie partially within two histogram bins instead of entirely within one. While such a large LSS fluctuation is unlikely in an area of such small comoving volume, we nonetheless present tests of this possibility.
As an example of a candidate supercluster, we find a large overdensity with in zCOSMOS. By plotting the detailed redshift distribution in this region, we see that there are, in fact, large overdensities with line of sight separations of Mpc between them. Clusters that are separated by such a large separation are unlikely to be correlated: the correlation function for dark matter at this separation is , so the clusters would need to have bias of to have the correlation probability to become appreciable relative to a random distribution. There should be fewer than one cluster with such a high bias in an observable universe. While magnification bias may increase the probability by a factor of a few (Hui et al., 2007), it does so by invoking the crosscorrelation between mass and galaxies, so one loses one power of the bias, which therefore cannot bring the correlations to a level comparable to unity. These galaxy bias and magnification bias effects are difficult to simulate realistically, so we cannot turn to simulations to solve this problem.
To test the effects on the errorbars of the bestfit redshift distribution and on the final calibration bias, we redo the analysis using bins of size , which will then include these structures all in one bin. We find that for zCOSMOS, this procedure increases the errors on the final results by %, whereas the size of the errors for DEEP2 and the combined sample (DEEP2 zCOSMOS) are essentially unaffected.
As an additional test, we shift the original histogram bins by in redshift, so that all three structures fall into the bin from . We find that while the bestfit redshift histogram is unaffected, the errors on it are significantly increased (by nearly a factor of in the bins near this LSS fluctuation, and a smaller factor further away from it). To understand why it has such a large effect, we consider that it adds an additional number of galaxies to the histogram in that one bin. The penalty on the fit (Eq. 7) is therefore . When we consider splitting the fluctuation equally into two bins (as we had effectively been doing before), the excess number of galaxies in each bin is , leading to a penalty of , half as much as if the entire overdensity is in one bin. The effect when fitting to the shifted histogram using both surveys together is nearly the same as when fitting zCOSMOS alone, whereas the errors for DEEP2 alone are unaffected (because our contrived binshifting did not correlate with any LSS fluctuations in DEEP2).
Given that these structures are likely to be uncorrelated, our binshifting that treated them as correlated leads to overestimated errors. On the other hand, our default binning puts one of them into one histogram bin, and left the other two together; we may therefore suppose that our errors for zCOSMOS and the combined sample are, in fact, slightly overestimated (since we effectively treated two of the structures as correlated). It is clear that the limited number of independent patches makes the error estimate from the bootstrap noisy, and while our final results may be treated as having conservative errorbars, we cannot exclude the possibility that they may be a factor of two larger. However, this finding that the zCOSMOS errorbars may be overestimated may also explain the fact that in the previous section, we found the calibration of the lensing signal in DEEP2 to be constrained more tightly than in zCOSMOS despite the fact that DEEP2 is smaller.
Finally, we note that bootstrapping data points times will in general lead to statistical uncertainty in the determined errors at the level. For the case where we bootstrap a redshift histogram with bins to get the bestfit redshift distribution, and use those results to get errors on the lensing signal calibration uncertainty, the errors are therefore reliable at the % level. This uncertainty is due to noise, rather than violation of the bootstrap assumptions as in the rest of this section.
5.8 Purity and completeness
Here we address questions of purity and completeness of the source sample for each photoz method. We define purity as the fraction of the total estimated lensing weight that is attributed to sources with spectroscopic redshift above the lens redshift (i.e., that are truly lensed). Low purity would be associated with a strong negative calibration bias. Completeness can be defined by constructing the analogues of the lensing weights in Eq. (4), but using the true rather than the estimated one. We then define a “true” for each object, and find the fraction of the total summed “true” weights that is actually used by lensed sources defined using any given photoz method. Low completeness can occur because photoz’s are scattered low, so that we assume they are below the lens redshift.
These two issues, purity and completeness, are two of the three factors that determine the statistical error on the lensing signal for a given photoz method as compared with the statistical error in the optimal case where all lens and source redshifts are known. The final factor is how much a photoz method causes the weighting scheme to deviate from optimal weighting. We would like to estimate the total increase in the error on the lensing signal due to all three factors combined.
To do so, we consider the lensing signal estimator in the optimal case where all lenses and sources are known. In that case, we have a shear , a critical surface density , and weights . (These weights are analogous to those defined in Eq. 4, where comes from shape noise and measurement error added in quadrature.) In this ideal case, the lensing signal is
(11) 
and its variance is
(12) 
In reality, we have an estimated critical surface density , an estimated weight , and a calibration bias defined via Eq. (5). We can relate it to the true lensing signal
(13) 
so its variance is
(14) 
We then rearrange the definition of as follows:
(15) 
Inserting this form for into equation 14, we find that
(16) 
Comparing equations 12 and 16, we find that
(17) 
This ratio has the form of a correlation coefficient between the square roots of the real and ideal weights for each lenssource pair, and therefore is constrained to lie between 0 and 1 (not between 1 and 1 as for correlation coefficients in general, since the weights are strictly ). It is only equal to one in the case where the estimated weight is strictly proportional to the ideal weight . This is as it should be: the measured (“real”) variance of the lensing signal using a given photoz method is always greater than or equal to the ideal variance. This expression encodes all three possible ways the real measurement can be degraded relative to the ideal one: via loss of lensed sources, inclusion of sources that are not lensed, and nonoptimal weighting. This statistic is therefore another lensingoptimized metric than can be used to classify photoz algorithms for gg lensing purposes.
Fig. 13 shows the purities (bottom left), completenesses (top left), the variance ratio (top right), and the implied change in variance due to nonoptimal weighting (bottom right) as a function of lens redshift for each method.
We first consider the completeness as a function of lens redshift in the top left panel of Fig. 13. The results for kphotoz verify our previous findings that the combination of a broad photoz error distribution with our requirement that causes us to lose a significant fraction of the available lensing weight. The results for the LRG source sample verify our previous assertions that the photoz’s for these sources are able to correctly put them all at high redshift, so that we do not lose essentially any of them. The template photoz completeness is % on average, which is not surprising given the significant failure mode to that causes us to lose some sources. The neural net photoz’s (CC2 and D1) give the highest completeness of all the photoz methods considered here (except the highly specialized LRG source sample), in part due to the positive mean photoz error.
In the lower left panel of Fig. 13, we see the purity as a function of lens redshift. The swiftly declining purity above for kphotoz is the main cause of the large negative calibration bias for this method for higher redshift lens samples, and is a result of large photoz error coupled with a lower mean redshift for than the full samples used for the other photoz methods. The LRG source sample purity is uniformly high, dropping from at to a minimum of at . This result attests to the efficiency of the colour cuts in selecting only highredshift sources, and the small size of the photoz error distribution. Of the other photoz methods, the template photoz has the highest purity; the tendency towards a positive photoz error seen previously for the NN and ZEBRA/SDSS photoz’s cause a decline in purity with redshift (though it is also the cause of their relatively high completeness) just as it causes a negative calibration bias in the lensing signal.
The upper right panel of Fig. 13 shows the variance in the ideal case relative to the true variance that results from using a given photoz method. For kphotoz, this number drops as low as for , implying that the errors are a factor of larger when using this photoz method than in the ideal case. ZEBRA/SDSS and the template photoz’s give similar results for this parameter, from at to at , implying errors ranging from to times the ideal. The NN photoz’s give slightly better results than that, as does using a redshift distribution for galaxies. The highredshift LRGs naturally give nearly identical errors in reality than in the ideal case, because the sources are at redshifts significantly higher than the lenses, so any photoz errors cannot cause a significant deviation from optimal weighting.
Finally, the lower right panel shows the estimated change in variance due to nonoptimal weighting, obtained by taking the variance ratio and dividing out the effects of impurity and incompleteness. The results suggest that for all source samples except the highredshift LRGs, the nonoptimal weighting is nonoptimal has a similar effect on the errors independent of photoz method, increasing them by % at worst for this range of lens redshifts.
5.9 Using distributions
Here we consider the possibility of using a full redshift probability distribution, , for each object, with two different sources of this distribution. The first is the posterior from the ZEBRA/SDSS method. For this method, is determined by marginalizing over templates using the joint redshifttemplate prior and the likelihood from the fit :
(18) 
The second is a distribution determined using some of the machinery described in Oyaizu et al. (2007) but independently of the photoz determination in that paper. The photozindependent estimate of (Cunha et al. 2007, in prep.) is calculated as follows: the training set comprised of 639 915 spectroscopic objects from a variety of surveys is reweighted using the procedures in Oyaizu et al. (2007) and Lima et al. (2007), in prep. to match the joint, 5dimensional probability distribution of the source catalog for which we would like to obtain photoz’s. The five parameters used to create this distribution are , , , colours and the band apparent magnitude. The redshift distribution of the weighted training set provides an estimate of the true underlying distribution of the photometric sample. The estimate of for each galaxy in the photometric sample is given by the weighted distribution of the 100 nearest training set neighbors in colour/magnitude space (the same 4colours and bandmagnitude mentioned above). Finally, to reduce the effects of Poisson noise, largescale structure, and magnitude errors in the training sample, we adopt a ”moving window” smoothing technique. We calculate in 140 bins in the redshift range with a constant bin width of . The derived in this way will be referred to as the NN , where NN in this context refers to “nearest neighbor” rather than “neural net.”
In this section, we recompute and for various lens redshift distributions, but instead of using the photoz to get , we integrate over the full (normalized to integrate to unity):
(19) 
We then compare the results using the two estimates of to the results using the photoz alone. Figure 14 shows the calibration bias as a function of using the photozs directly (as in Fig. 7) and the full estimates of . In Table 8, we show the calibration bias averaged over various lens redshift distributions (as in Table 3) using the full . As s