# Mean-flux Regulated PCA Continuum Fitting of SDSS Ly Forest Spectra

## Abstract

Continuum fitting is an important aspect of Ly forest science, since errors in the estimated optical depths scale with the fractional continuum error. However, traditional methods of estimating continua in noisy and moderate-resolution spectra ( and , e.g. SDSS) such as power-law extrapolation or dividing bythe mean spectrum, achieve no better than RMS accuracy. To improve on this, we introduce mean-flux regulated/principal component analysis (MF-PCA) continuum fitting. In this technique, PCA fitting is carried out redwards of the quasar Ly line in order to provide a prediction for the shape of the Ly forest continuum. The slope and amplitude of this continuum prediction is then corrected using external constraints for the Ly forest mean-flux. From tests on mock spectra, we find that MF-PCA reduces the errors to 8% RMS in spectra, and RMS in spectra with . The residual Fourier power in the continuum is decreased by a factor of a few in comparison with dividing by the mean continuum, enabling Ly flux power spectrum measurements to be extended to larger scales. Using this new technique, we make available continuum fits for 12,069 Ly forest spectra from SDSS DR7 for use by the community. This technique is also applicable to future releases of the ongoing BOSS survey, which is obtaining spectra for Ly forest spectra at low signal-to-noise ().

###### Subject headings:

intergalactic medium — quasars: emission lines — quasars: absorption lines — methods: data analysis^{4}

## 1. Introduction

Over the past 2 decades, the Lyman- (Ly) forest absorption observed in the spectrum of high-redshift quasars has been an important probe of large-scale structure and the inter-galactic medium (IGM) at . The fundamental quantity of interest of the Ly forest is its local optical depth to absorption, , at position . This is not a directly observed quantity: it is derived from the flux transmission , which requires knowledge of the intrinsic quasar continuum in order to be extracted from the observed flux, , where is the observed flux and is the restframe wavelength of the quasar at redshift . The error in the measured optical depth, , scales roughly with the fractional continuum error, , which means that accurate continuum fitting is important in the optically thin Ly forest, where . Therefore, accurate estimates of the underlying quasar continuum are crucial in order to take full advantage of modern Ly forest data sets.

While virtually all aspects of Ly forest science are dependent on the continuum determination, some are more sensitive than others. For example, the Ly flux probability distribution, which is used to constrain the IGM temperature-density relation (TDR), is notoriously sensitive to continuum errors: Lee (2011) have recently found that a 2% systematic error in the continuum estimation can double the errors in the TDR even with high signal-to-noise data. On the other hand, measurements of the 1-dimensional flux power spectrum and other 2-point statistics (e.g. threshold clustering functions, Lee & Spergel 2011) are affected by the Fourier power introduced by intrinsic quasar emission lines in the Ly forest region. For example, the unaccounted continuum variance in the quasar continuum has limited the measurement of the 1-dimensional Ly forest flux power spectrum (McDonald et al. 2006) to comoving scales of at . Ongoing attempts to measure the baryon acoustic oscillation (BAO) feature in the Ly forest using transverse correlations across lines-of-sight are less sensitive to continuum errors (McDonald & Eisenstein 2007); however, large continuum errors will still degrade the significance of the BAO signal measured in this fashion.

Unfortunately, accurate quasar continuum fitting is a non-trivial problem. At redshifts () in which the Ly forest becomes accessible to ground-based optical telescopes, the high absorber line-density makes it challenging to identify the intrinsic quasar continuum. With high-resolution and high signal-to-noise () quasar spectra obtained from large telescopes, continua are usually fitted using some form of spline-fitting to the observed transmission peaks of the forest — if one believes that the transmission peaks truly reach the quasar continuum at a given redshift (but see Faucher-Giguère et al. 2008; Lee 2011). These direct-fitting methods cannot, in general, be applied to large data sets such as the Sloan Digital Sky Survey, as the modest resolution and low make it impossible to directly fit the Ly forest except for the very highest subsamples — and even then steps need to be taken to account for the degradation of transmission peaks from the lower resolution (see, e.g., Dall’Aglio et al. 2009). Moreover, direct fitting techniques usually require significant human intervention and are often very time-consuming, precluding their application to the Ly forest sightlines in SDSS.

Noisy Ly forest data usually require some form of extrapolation from the
relatively unabsorbed spectrum bluewards^{5}

There are two major issues with power-law estimation of Ly forest continua. Firstly, there is a break in the underlying quasar power-law at . This was first identified by Zheng et al. (1997) from a study of low-redshift () quasars observed in the ultra-violet, in which the Ly forest continuum could be clearly identified due to the low Ly line density at those epochs. A subsequent study by Telfer et al. (2002) found mean power-law indices of and redwards and bluewards of , respectively. This implies that a naïve power-law extrapolation from would underestimate the true Ly forest continuum by . Furthermore, Telfer et al. (2002) found a large scatter in and from the individual quasars in their sample, with no correlation between the two; this increases the error from power-law extrapolation in individual Ly forest spectra. While Desjacques et al. (2007) and Pâris et al. (2011) have discussed this EUV-NUV power-law break in the context of Ly forest continuum estimation, it is often ignored in Ly forest analyses.

Secondly, the Ly forest ‘continuum’ (usually defined around ) includes weak emission lines such as Fe ii 1071 and Fe ii/Fe iii 1123, although the exact identifications vary from author to author. These emission lines can cause deviations of up to from a flat continuum. It is possible to take these features into account on average: for example, Bernardi et al. (2003) modeled them as two Gaussian functions superposed on top of an underlying power-law. However, there is a great diversity in the shape and equivalent width of these weak emission lines (Suzuki 2006). Therefore, the use of an average continuum shape would not account for variations of up to 10% within individual quasars due the presence of these emission lines.

One possible avenue for improved quasar continuum fits is Principal Component Analysis (PCA). Suzuki et al. (2005) explored this using a sample of 50 low-redshift quasars observed in the UV by the Hubble Space Telescope (HST), in which the continuum can be clearly identified. They concluded that while PCA fits to the red-side () of individual spectrum gave a good prediction of the Ly continuum shape (i.e. the weak emission lines), the overall continuum amplitude had errors. This is presumably due to the EUV-NUV power-law break. Pâris et al. (2011) recently carried out a similar analysis on a high- ( per pixel) subsample of the Sloan Digital Sky Survey (SDSS) quasar sample. They found a better prediction accuracy of , possibly due to their larger spectral baseline ( as opposed to in the earlier work).

However, the standard PCA formalism does not take pixel noise into account, whereas the majority of quasars observed in SDSS have low signal-to-noise () Francis et al. (1992) have argued that PCA fitting errors scale directly with the noise level, which implies that, e.g., one can expect no better than continuum accuracy in a spectrum using PCA, even without taking the power-law break into account.

For all the reasons outlined above, more accurate continuum-fitting methods are sorely needed to take full advantage of Ly forest data from SDSS and future spectroscopic surveys. In this paper, we will explore a refinement of the PCA technique which we term ‘mean-flux regulated PCA’ (MF-PCA). Briefly, we carry out least-squares-fitting of PCA templates to the unabsorbed quasar spectrum redwards () of the Ly line in order to obtain a prediction for the continuum shape, and then use the expected mean flux, , to constrain the amplitude of the fitted continuum. Tytler et al. (2004) have shown that the dispersion in expected from a segment of Ly forest at is . When averaged across an entire Ly forest sightline (which spans for a quasar at ), one expects the continuum amplitude to be predicted to .

This paper is organized as follows: § 2 describes the publicly available
SDSS quasar sample which will be the initial subject of our new technique. § 3
elucidates the MF-PCA technique, which is then tested on mock spectra in
§ 4. We will then discuss the results of our continuum fitting and
future improvements.
The continuum-fits have been made publicly available and can be downloaded via
anonymous
FTP^{6}

## 2. Data

The MF-PCA technique which we develop in this paper is optimized towards large sets of noisy Ly forest data spectra. We will apply this technique to the Sloan Digital Sky Survey data, which comprises Ly forest sightlines at moderate resolution () and modest signal-to-noise ( per pixel).

This section provides an overview of the SDSS Ly forest data sample, and also the two sets of quasar templates which we will use to fit this data set.

### 2.1. SDSS DR7 Ly Forest Sample

In this paper, we carry out continuum-fitting for publicly-available spectra from the final SDSS Data Release 7 (DR7) quasar catalog (Schneider et al. 2010), which is comprised of 105,783 spectroscopically confirmed quasars observed from the 2.5m SDSS Telescope in Apache Point, NM. The spectra cover the observed wavelength range with a spectral resolution of .

From this overall catalog, we select a subsample suitable for Ly forest studies. First, we require that some portion of the quasar Ly forest region, , be within the observed wavelength range. Since the extreme blue end (near ) of the SDSS spectra are known to suffer from spectrophotometric problems, we use as the lower wavelength limit. This sets a minimum quasar redshift of . For the quasars that satisfy this redshift criterion, we excise the portions of the spectra that lie below . In addition, broad absorption line (BAL) quasars have continua which are difficult to characterize (although see Allen et al. 2011, for a method to recover quasar continua from BALs), therefore we discard quasars flagged as BALs in the Shen et al. (2011) value-added quasar catalog.

There are 13,133 quasars in the DR7 quasar catalog which satisfy the above criteria. We make further quality cuts by discarding 962 spectra which have bitmasks set (this signifies issues with the fiber; see Stoughton et al. 2002, for further details on the SDSS bitmask system), and 2 spectra where the signal-to-noise was too low to normalize the spectra, leading to negative normalizations. This leaves us with a sample of 12,069 spectra, to which we will apply the MF-PCA technique. Within individual spectra, we mask pixels which have either zero inverse-variance or the maskbits set. This avoids the use of problematic pixels, such as rejected extractions, bright sky-lines or bad flats.

The median signal-to-noise in the sample is per SDSS pixel within the Ly forest,
and per pixel in the wavelength region.
The redshift and signal-to-noise^{7}

In addition, we need to deal with Damped Ly Absorbers (DLAs) within the spectra. These are absorbing systems with neutral hydrogen column densities of which result in complete absorption over large portions () of affected sight-lines. Since the MF-PCA technique (§ 3) fits the amplitude of the quasar continuum based on the mean-flux of the low column-density Ly forest, the excess absorption of a DLA within a sightline would bias the continuum estimate.

To correct for this, we use a catalog of 1427 DLAs identified in the SDSS DR7 spectra by Noterdaeme et al. (2009). First, we mask the wavelength region corresponding to the equivalent width of each DLA (Draine 2011):

(1) |

where is the rest-frame wavelength of the hydrogen Ly transition, is the electron charge, is the electron mass, is the speed of light, is the H i column density of the DLA, is the Ly oscillator strength, and is the sum of the Einstein coefficients for the transition.

However, the damping wings of each DLA extend beyond the equivalent width, providing a small but non-negligible excess absorption to the pixels close to the DLA. We correct for this by multiplying each pixel in the spectrum with , where

(2) |

and is the wavelength separation in the DLA restframe.

### 2.2. Quasar Templates

In order to predict the shape of the Ly forest continuum, we need a set of template spectra with clearly identified continua at wavelengths . For this purpose, we will use two different sets of quasar templates, derived from quasars observed in the Hubble Space Telescope (HST), and SDSS itself.

Suzuki et al. (2005) derived PCA templates from 50 quasars that had been observed by the Far Object Spectrograph (FOS) on the Hubble Space Telescope in the ultraviolet. At the low-redshifts () of these quasars, the line-density of the Ly forest is sufficiently small that the quasar continuum could be clearly identified. This enabled the creation of templates in the range , covering Ly to C iv 1549.

Pâris et al. (2011) recently carried out a similar study, applying the techniques in Suzuki et al. (2005) to a subsample of 78 SDSS DR7 quasars. These quasars were selected to have full coverage of the Ly forest and relatively high signal-to-noise (). The transmission peaks in the Ly forest were hand-fitted with a low-order spline function to provide a continuum estimate. PCA templates were then derived in the spectral range , which included the C iii 1906 line. While this process might give a biased continuum level due to the low resolution and of the templates, it should provide a good description of the relative shape of the quasar continuum which is required for MF-PCA — the mean-flux regulation process is designed to correct for uncertainties in the overall continuum level arising from pure PCA fitting.

We do not expect the redshift differences between the template and the DR7 quasars to be a significant issue, even for the Suzuki et al. (2005) quasars (). This is because various studies (e.g. vanden Berk et al. 2004; Fan 2006) have suggested that there is little redshift evolution of quasar spectra. However, the shape of quasar spectra is known to have a significant luminosity dependence, such as the well-known Baldwin effect (Baldwin et al. 1978), which is the anti-correlation between the strength of the C iv 1549 emission line and the quasar luminosity. It is therefore reasonable to supposed that there would be a significant difference in the spectral shapes represented by two templates and the overall SDSS sample. This is because the Suzuki et al. (2005) sample is comprised of relatively low-luminosity, nearby quasars as opposed to the Pâris et al. (2011) quasars, which were selected to have high and are therefore a luminous subsample of the SDSS quasars.

In order to compare the relative luminosities, we calculate , the intrinsic monochromatic luminosity near , for the quasars in our SDSS DR7 sample as well as the two template samples. We assume a standard CDM cosmology with , , and . The respective distributions of is shown in Figure 2. The SDSS DR7 quasars have a typical luminosity of , while the Suzuki et al. (2005) and Pâris et al. (2011) quasars are about 0.5 dex fainter and brighter, respectively. However, the combined luminosity distributions of the Suzuki et al. (2005) and Pâris et al. (2011) template quasars significantly overlap the full range of SDSS DR7 quasars, which justitifes the use of both templates in this paper.

## 3. Method

The mean-flux regulated PCA (MF-PCA) fitting method described in this paper is essentially a two step process: (1) fitting of the red side () of the individual quasar spectra using PCA templates, to predict the shape of the weak emission lines in the Ly forest continuum. This is followed by (2) constraining the amplitude of the predicted Ly forest continuum to be consistent with existing measurements of the mean-flux evolution of the Ly forest, .

### 3.1. Least-squares PCA Fitting

The basic concept of principal component analysis (PCA) is that a normalized quasar spectrum, , can be represented as

(3) |

where is the mean quasar spectrum, is the th principal component or ‘eigenspectrum’, and are the weights for an individual quasar. The formalism for deriving the eigenspectra and weights is described in Suzuki et al. (2005) and Pâris et al. (2011) .

The standard PCA formalism for deriving the weights, , does not take into account spectral noise, which renders it unsuitable for noisy SDSS spectra (see Figure 1b). Instead, we first carry out a least-squares fit to the red-side of each spectrum using the full eigenspectra as a basis. Due to the correlation between the weak emission lines within the Ly forest and in (Suzuki et al. 2005), we expect this to provide a reasonable prediction for the shape of the continuum.

As described in § 2.2, we have two separate sets of PCA eigenspectra from Suzuki et al. (2005) and Pâris et al. (2011). In principle, one could combine the two sets of quasar templates to generate one set of eigenspectra which would encompass the diversity of both template samples. However, the template spectra from Pâris et al. (2011) are not available to us at time of writing, therefore we will carry out our fitting procedure separately for the two sets of PCA eigenspectra. Suzuki et al. (2005) had found that out of their 10 principal component eigenspectra, only the first 8 components appeared to describe physical features in the spectra, while the 9th and 10th components seemed to describe mostly noise. Therefore, we will use only 8 components from each set of eigenspectra for our fits. In addition, for the sake of consistency we limit ourselves to the rest-wavelength range of each eigenspectrum even though the Pâris et al. (2011) eigenspectra extend up to .

However, we have found that fitting the SDSS spectra with just the PCA weights was insufficient to account for the large diversity of the sample. Therefore, we introduce 2 additional fit parameters: a power-law component, , and redshift-correction factor, . The power-law component, , is necessary due to the large range of slopes found in the SDSS quasars. Even though the 3rd through 5th principal components in the HST eigenspectra include the spectral slope, they also describe some emission-line features — the introduction of as a free parameter allows an additional degree of freedom and enables a better fit to the emission lines and slope simultaneously. Due to this degeneracy between the slopes within the eigenspectra and , we do not interpret the latter as the slope of the underlying quasar power-law continuum. The power-law parameter also helps account for low-order spectro-photometric errors as well as dust extinction in the spectrum.

The redshift-correction factor, , translates the spectrum along the wavelength axis with respect to the rest wavelength given by the pipeline redshift, , to a best-fitting rest wavelength, . It is required as the SDSS pipeline redshifts are not completely accurate (see, e.g., Hewett & Wild 2010). However, due to the asymmetry and velocity shifting of quasar emission lines at different redshifts, we do not necessarily interpret as a true redshift correction — it is merely an ad-hoc parameter to obtain the best-possible fit to the spectrum. The full list of free parameters for our continuum-fitting procedure shown in Table 1 ( and are free parameters for the mean-flux regulation step, described in § 3.2).

Fit Parameter | Description |
---|---|

Flux normalization, evaluated at | |

Redshift correction factor | |

Power-law exponent | |

PCA coefficients | |

Linear mean-flux regulation coefficient | |

Quadratic mean-flux regulation coefficient |

We are now in a position to carry out the fitting procedure. First, the quasar spectrum is shifted to the quasar restframe using the pipeline redshift, and normalized at . We then use the least-squares fitting routine MPFIT (Markwardt 2009) to find the best-fitting set of parameters, , given the spectrum and its noise. The initial fits from this procedure is represented by the black dashed lines in the examples shown in Figure 3.

However, while the intrinsic quasar spectrum is generally well-defined redwards of Ly in the SDSS spectra, in many cases intervening metal absorption lines can be seen in the spectrum in the wavelength. To prevent these absorption features from biasing the PCA fitting, we carry out a simple iterative procedure to mask these absorption lines: using the continuum, , obtained from the initial least-squares fit, we mask pixels in which , where and are the observed spectrum and pipeline noise, respectively. We then make a new PCA fit and repeat this process until the fit converges. In Figure 3, the final PCA fits, , are shown as black solid lines while the masked pixels are denoted by crosses.

From Figure 3, we see that the least-squares PCA fitting procedure generally works well, even with noisy () spectra. The fit to the Ly emission line is sometimes imperfect, but unsurprising considering that the fitted range () only takes partial account of the line. Comparing the initial (dashed-line) and final (solid-line) fits in Figure 3, we see that the red-side metal absorption lines usually have little effect on the fits, but in certain cases (e.g. Figure 3(b) and (c)) absorption line-masking noticeably improves the fit.

We use the absolute flux error to quantify the goodness of the PCA fits redwards of Ly:

(4) |

where is the fitted continuum and is the observed spectrum smoothed by a 15-pixel boxcar, to avoid biasing in noisy spectra. and represents the range over which we calculate .

The distribution of in the fitted data is shown in Figure 5, which plots against the red-side signal-to-noise per pixel, , for PCA fits to a subset of the SDSS spectra as well as the mock spectra described in § 4.

Figure 5 provides a useful diagnostic for the quality of the PCA fits on the SDSS spectra. We expect the fits to the mock spectra (green crosses) to represent the case in which the PCA eigenspectra describe the spectra nearly perfectly (see § 4), therefore they typically have smaller values of than the real spectra. Clearly, the SDSS spectra with large are most likely bad fits, but note that the presence of metal absorption lines and other artifacts in the real data can bias to larger values even for good fits (we did not mask any lines when calculating , as our metal-masking algorithm is rudimentary and sometimes masks legitimate pixels).

In practice, we carry out the PCA fitting procedure using the two different PCA templates described in § 2.2, then for each SDSS spectrum we select the fit which gives the lower value of . Fits with values under the 95th percentile of those from the mock spectra (red line in Figure 5 are then automatically considered good fits, while the rest are visually inspected and flagged for goodness-of-fit on the red-side of the spectrum — approximately 90% of the spectra were adequately fit. The spectra which are not well-fitted by our procedure consist mostly of objects which have strong absorption systems at , such as metal absorption from DLAs and weak BAL quasars. An example of this is shown in Fig. 4a. There are also quasars with unusual spectral shapes which are not represented in the template spectra described in § 2.2, such as quasars with weak emission lines (Fig. 4b).

The spectra now have been had PCA fits carried out on them redwards of Ly, but the predicted continuum, , extends bluewards of Ly (). For the objects which are well-fitted on the red-side of the spectrum, we expect the predicted continua to provide a reasonable prediction for the shape of the Ly continuum bluewards of Ly, but the overall amplitude is uncertain due to the EUV-NUV power-law break described in the Introduction. We now turn to the next fitting step, mean-flux regulation, to constrain the continuum amplitude.

### 3.2. Mean-Flux Regulation

In the least-squares PCA fitting step described in the previous section, we have hitherto used no information bluewards of the quasar Ly emission line due to the absorption from the Ly forest. However, since the absorption redshift at any point in the Ly forest is known (), the average absorption averaged over each sightline can be used to constrain the predicted continuum. In this section, we will describe the use of the mean-flux evolution of the Ly forest, , to regulate the amplitude and slope of the predicted PCA continuum. We refer to this as the ‘mean-flux regulation’ step.

Using the PCA continuum fitted to the observed spectrum, we first extract the Ly forest transmission in the range . The extracted Ly forest is then divided into bins, and the mean-flux, , is evaluated for each bin, where are the central rest wavelengths of each bin.

We now introduce a quadratic fitting function bluewards of a pivot point, , to obtain the mean-flux regulated continuum:

(5) | |||||

where and are free parameters for the fit, while . Note that for the lower redshifts () in which only a limited portion of the Ly forest is accessible, we use only the linear parameter, , in order to avoid over-fitting.

We again use least-squares-fitting to find the values of and which provide the best fit between the extracted mean-flux and the external mean-flux constraint . In this mean-flux regulation step, the parameters and fitted to are kept fixed. For the mean-flux constraint, we use a double power-law fitted from the mean-flux measurement of (Pâris et al. 2011, kindly provided by Isâbelle Paris):

(6) |

where .

In principle, the errors in the continuum fit should now be at the level of a few percent, arising from some combination of the large-scale variance in the Ly forest and errors in the fitting. In the next section, we will use mock spectra to quantify the level of continuum errors in the MF-PCA technique.

## 4. Tests on Mock Spectra

The errors in the MF-PCA continuum fitting technique can be tested by carrying the above procedure on noisy mock spectra, and comparing the fitted continuum with the ‘true’ continuum which is known by construction. This testing process is also useful to check our algorithm for bugs and efficiency. In this section, we will describe the process of generating realistic mock spectra, and the quantitative results of the MF-PCA technique. We will also make a comparison between MF-PCA and the common technique of using the mean quasar continuum as the Ly forest continum.

### 4.1. Generating Mock Spectra

The first step is to create synthetic quasar spectra from PCA eigenspectra, by making Gaussian realizations of the PCA weights, , (see Equation 3) in the manner described in Suzuki (2006). Note that this is an approximation, as the distribution of the weights may not be fully Gaussian, but it does generate realistic-looking quasar spectra. In principle, the PCA eigenspectra used to generate the mock spectra and those used in the fitting procedure should be separate but drawn from the same distribution. While we do have two sets of PCA eigenspectra (§ 2.2), they represent quasars with different luminosities. Hence, we do not expect to be able use eigenspectra from Pâris et al. (2011) to fit mock spectra generated from the Suzuki et al. (2005) eigenspectra, and vice versa. However, since we only use 8 principal components in our PCA fitting step (§ 3.1), we can generate our mock spectra using 10 principal components in order to increase the uncertainty in the fitting. Nevertheless, the tests described in this section will primarily apply to the limit in which the PCA eigenspectra are a good representation of the fitted quasars, which we have argued (§ 2.2) is a reasonable assumption.

Therefore, we generate mock quasar spectra in the spectral range ,
using 10 principal components from the Suzuki et al. (2005) eigenspectra.
Next, we need to introduce Ly forest absorption to the mock spectra.
For this, we use the publicly available Roadrunner Ly forest
simulations^{8}

For a given mock quasar at redshift , we select the simulation box with the closest redshift, . Using Equation 6, we then re-normalize the mean-flux of the box to , i.e. using the absorber redshift corresponding to the in the quasar spectrum. We choose to normalize the mean-flux across the entire box rather than in individual spectra in order to preserve the variance across different lines-of-sight, which is a source of error in the MF-PCA continuum fitting. A random line-of-sight is selected from the set of skewers, and the transmitted flux in each pixel, , is rescaled to , where is the absorber redshift corresponding to the pixel. This introduces redshift evolution of the mean flux, , within the individual sightlines which had hitherto had a fixed value of .

The simulated Ly forest absorption is added to the mock quasar spectrum in , and smoothed to the approximate SDSS resolution, . Gaussian noise is then added to the mock spectrum using the noise array of a randomly-chosen SDSS quasar spectrum with the same and . The mock spectra are then run through the MF-PCA fitting process described above to obtain continuum fits.

### 4.2. MF-PCA Continuum-Fitting on Mock Spectra

In Figure 6, we show several examples of the mock spectra and the fitted MF-PCA continua. The first thing to note is that mock spectra look realistic. They look similar to the real spectra shown in Figure 3, apart from the lack of metal absorption redwards of the Ly emission line.

As described in § 3.1, we use the mock spectra as a benchmark for the PCA fit quality on the red-side () of the spectra — we automatically accept all fits with absolute flux error, , less than the 95th percentile of the distribution measured from the mocks (Figure 5). For the fits with larger that require visual inspection, we use the mocks as a visual guide for what constitutes a good fit.

It is clear from Figure 6 that the mean-flux regulated continuum, , is a corrected version of the PCA fit, . In several cases, the initial PCA continuum,, appeared unphysical (e.g. dipping below the peaks of the forest). These were rectified by the mean-flux regulated fit, .

We can place this on a more quantitative footing by comparing the fitted continua, , to the ‘true’ continua, , which is known by construction in the mock spectra. We define the continuum fitting residual,

(7) |

We can then carry out MF-PCA fits on large numbers of mock spectra to obtain statistics on the continuum fitting errors as a function of and quasar redshift. In Figure 7 we show the residuals from fitting 1000 mock spectra in two bins of signal-to-noise, , and quasar redshift, . The orange lines show the residuals as a function of wavelength, , binned into restframe bins from a subset of 100 mocks. The dashed lines show the dispersion of the residuals estimated from bootstrap resampling at each wavelength bin, while the dotted lines show the 10th and 90th percentile at each wavelength. Figure 7(a) represents one of the worst case scenarios — the signal-to-noise () is low at , making it difficult to obtain a good fit for the continuum shape. At the same time, the redshift is sufficiently low that the Ly forest occupies the blue end of the SDSS spectrographs where the signal-to-noise deteriorates rapidly with decreasing wavelength. This causes a large scatter in the flux of Ly forest even when averaged over large segments, introducing more errors to the mean flux regulation process.

At moderate signal-to-noise (, Figure 7(b)), the situation is significantly better. The dispersion of the fit residuals are well under 10%, and in the central portion of the fitted region. Many of the residuals are flat to within a few percent across the Ly forest region, indicating that the PCA-fitting has successfully accounted for the shape of the Ly continuum. About one-tenth of the fitted continua are badly fit, with badly-predicted continuum shapes and/or residuals greater than 10%.

We also calculate the mean bias, , of the residuals in Figure 7. Averaged over spectra, the MF-PCA technique have a low bias of , although this not include any systematics errors in the measurement used to constrain the overall continuum levels.

To quantify the overall fit quality on each mock spectrum, we use the RMS of the continuum residuals evaluated over :

(8) |

Figure 8 shows the median RMS error from runs of 1000 mocks as a function of redshift, for 4 different bins. At lower redshifts, the RMS error is relatively high for the low- spectra because the mean flux regulation is affected by the increased noise levels near the blue-end of the SDSS spectra at . As the observed Ly forest region clears the blue end of the spectra, the RMS error decreases to a minimum at . It then rises with redshift at due to the increasing variance in the Ly forest, which adds to the error in the mean-flux regulation. At fixed redshift, the median RMS decreases with as might be expected. Below , it drops below RMS for moderate signal-to-noise () and asymptotes to RMS for the spectra.

To estimate the contribution from various sources of error in the MF-PCA fitting, we also carried out the mean-flux regulation directly on the simulated Ly forest skewers without introducing a quasar continuum, or adding pixel noise. In other words, we directly fit the function Eq. 5 to the mean flux evaluated over 3 bins in each skewer. The RMS error in the continuum from this estimate is shown as the dashed line in Fig. 8. For quasars, the RMS error contribution from Ly forest variance is about 1.5% ; this increases to 4% at . This suggests that even in the limit of high-, errors in the continuum shape from PCA fitting contributes to the overall RMS continuum error.

### 4.3. Power-law+Mean Continuum Fitting

To place the above results in context, we carry out another set of continuum fits on the mock spectra, using the continuum model

(9) |

in which the mean spectrum, , is multiplied with a power-law, , and is the flux normalization at . Both and are determined separately for each quasar, with the power-law fitted to the regions near and . This model is highly similar to that implemented in Slosar et al. (2011).

In Figure 9, we show the continuum residuals from , as a function of rest-wavelength in the Ly forest region. Comparing this plot with Figure 7(b), which show the MF-PCA fitting residuals from the same bin, it is clear that MF-PCA dramatically reduces the range of fitting errors by a factor of 3. Indeed, the residuals are significantly larger than even the worst-case scenario for MF-PCA continua, represented in Figure 7(a). The significant bias (black line) of the residuals from is puzzling at first glance, as one would expect to recover the mean spectrum (and hence no bias) when averaging over large numbers of spectra. We suspect this bias most likely due to an asymmetry in the distribution of power-law spectral indices in quasars (see Desjacques et al. 2007), which we have not accounted for in our mock spectra. However, this does not affect the scatter and shape of the continua, which is the present quantity of interest. The median RMS error from the continua shown in Figure 9 is , which is worse than any of the bins evaluated in Figure 8.

### 4.4. Residual Continuum Power

The RMS continuum fitting error is not the most important quantity for studies of the 1-dimensional Ly forest flux power spectrum,

(10) |

where , and is the Fourier wavenumber. Rather, it is the Fourier power from the continuum errors which is the troublesome systematic. For example, in their measurement of from SDSS data, McDonald et al. (2006) were limited to scales of , corresponding to comoving distances of or in the quasar restframe wavelength. This was due to the increasing influence of continuum power at large scales. It is therefore pertinent to investigate the amount of residual Fourier power introduced by the various continuum fitting methods.

In Figure 10, we show the mean power spectrum of the continuum residuals from 1000 mock spectra, , for the power-law+mean continuum method, , and MF-PCA continuum fitting, , shown for two signal-to-noise bins. All the mock spectra had quasar redshifts in the range . All the residual power-spectra have the same overall shape, with a bump at corresponding to the weak emission lines in the intrinsic quasar spectrum at scales of . This is unsurprising, since imperfections in the continuum fitting should give similarly-shaped residuals. At fixed , the amplitudes follow the same pattern we had already discussed above: the residual power from MF-PCA fitting is significantly lower than that from the power-law+mean continuum fits. Even with noisy spectra, continuum fitting reduces the residual power by compared to fitted to higher- spectra.

In their measurement of from SDSS data, McDonald et al. (2006) had used a mean continuum shape for all their spectra, which is similar to except that they did not fit for the individual power-law indices. Using this method, McDonald et al. (2006) found that residual continuum power started becoming problematic at scales greater than , indicated by the solid red vertical line in Figure 10. We thus estimate the residual power in at this scale at which continuum power interferes with measurements. We can then look for the points in the residual power spectra with the same limiting power (red arrows in Figure 10). Note that this assumes that the Ly forest power is constant whereas the Ly forest power decreases with scale, but the change is gradual. For our rough estimates, we can approximate it as constant over small logarithmic intervals.

The corresponding limiting values of (red dotted lines) are significantly smaller than for . This suggests that MF-PCA continuum fitting could allow the Ly forest flux power spectrum to be measured at larger scales than previously possible. For spectra, the lower -limit is now . This corresponds to a doubling of the accessible comoving scales: at compared to in the McDonald et al. (2006) study, where these distances are calculated assuming a standard flat CDM cosmology with , and with a cosmological constant. Even for noisy () spectra, the accessible scale has been increased significantly to .

The new continuum-limited scales approach the baryon acoustic oscillation (BAO) scale at moderate signal-to-noise (). In Fig. 10, the black vertical dot-dot-dot-dashed line indicates the wavenumber of first BAO peak, calculated from the prescription in Eisenstein & Hu (1998). Even though future BAO measurements in the Ly forest are expected to be carried out in 3-dimensions (McDonald & Eisenstein 2007), the increase in accessible modes along the lines-of-sight will improve the robustness and precision of the measurements.

## 5. Results and Conclusion

### 5.1. Public Release of Continua

We have carried out mean-flux regulated PCA (MF-PCA) continuum fitting on 12,069 quasar spectra from the
SDSS DR7 catalog. The continua in the spectra range
have been made publicly available, and can be downloaded via anonymous
FTP^{9}

Since we expect the MF-PCA technique to provide a good continuum fit only when there is a good PCA fit redwards of the quasar Ly line, the fitted spectra have been visually inspected to verify the fit quality in the wavelength region. Approximately of the spectra had reasonable PCA fits, and these have been flagged as such in the publicly available continua, although we recommend that users of the continua should make their own cuts on the fit quality. Approximately 30% of the spectra were better fitted by the low-redshift Suzuki et al. (2005) templates, while the rest were better fitted by the Pâris et al. (2011) templates. This is qualitatively as expected, since there is greater overlap by the Pâris et al. (2011) templates in the luminosity distribution of the SDSS quasars.

From the mock spectra analysis of the fit quality in § 4, we have estimated the continuum fitting error at each pixel within the Ly forest, as a function of quasar redshift and spectral signal-to-noise. However, the errors have significant covariances, therefore it would be too unwieldy to provide the full error estimates for each spectrum although we can provide them upon request.

It is worth noting that although we have made a choice on the mean-flux of the Ly forest in our fits (Eq 6), it is straightforward for users to rescale each fitted continuum to their favorite .

### 5.2. Empirical Tests of Fit Quality

While we have studied the performance of MF-PCA continuum fitting on mock spectra in § 4, it is difficult to empirically constrain the quality of the fits. One possibility is to compare a small subset of the data with high-resolution, high- spectra of the same objects. However, even with high-resolution spectra, it is questionable whether there are sufficient transmission peaks in the forest to adequately constrain the continuum shape; Faucher-Giguère et al. (2008) have shown that accurate fitting of the quasar continuum is difficult beyond . Furthermore, most high-resolution spectra are obtained from echelle spectrographs with uncertain spectrophotometry, so it would be tricky to directly compare quasar sightlines which have been observed in both SDSS and high-resolution echelle spectrographs.

However, it is possible to get a sense of the efficacy of our MF-PCA continua by stacking large numbers of spectra. This cancels out the Ly forest power from individual sightlines, and allows the underlying continuum shape to be seen, albeit lowered due to the mean Ly absorption.

Recall that the PCA coefficients, , parametrize the shape of the quasar spectra. Therefore, if our continuum-fitting technique works, quasars with similar measured from should have similar-looking continua within the Ly forest region. Thanks to the large number of spectra in our SDSS sample, it is possible to stack spectra with similar values of to recover the collective shape of their Ly forest continua.

We can select subsamples of quasars based on their values of and , where and are the standard deviations of and , respectively, in the low-redshift HST eigenspectra (Suzuki 2006). These two principal components account for approximately 80% of the total variance in the low-redshift quasar templates. We limit ourselves to spectra with , and which have been visually inspected to be decent fits redwards of . We also select quasars with in order to ensure reasonably complete coverage of the Ly forest. Within a subsample, each spectrum is first normalized near and rebinned into a common wavelength grid with bins before being stacked. The same procedure is carried out on the MF-PCA continuum fitted to each spectrum, to obtain a mean MF-PCA continuum for the subsample.

In Fig. 11 we show 4 subsamples from our SDSS sample with different with respect to the low-redshift Suzuki et al. (2005) eigenspectra. Redwards of , we see that the least-squared PCA procedure generally does a good job of fitting the emission lines, although there are inaccuracies in fitting N v 1240 and Si ii 1306. Bluewards of , the stacked spectrum appears to have a similar shape to the fitted continua, although the overall flux level is depressed due to the mean Ly absorption.

We can make a more direct comparison between the stacked spectra and the fitted Ly forest continua by correcting each observed Ly forest pixel by its mean flux (using Equation 6) before stacking. The mean-flux corrected Ly forest is shown as the disembodied red line in Fig. 11. It is gratifying to see that the stacked MF-PCA continua generally agrees well with the stacked Ly forest spectra. Our technique can clearly account for the diversity in quasar continua: spectra with clear emission-line features (Fig 11a) and those with smooth continua (Fig 11b) are well differentiated. Because we have corrected each Ly forest pixel by the same mean-flux (Eq. 6) which we have used to carry out the MF-PCA fits, we expect the amplitude of the corrected Ly forest stacks, in , to match those of the stacked MF-PCA continua, but the tilt and shape of the continuum bears testament to the success of the technique. In addition, since the MF-PCA continua shown in Fig 11 were a subset which used the Suzuki et al. (2005) quasar templates, this suggests that it was appropriate to use low-redshift templates to fit some of the SDSS spectra.

### 5.3. Conclusions

We have introduced mean-flux regulated PCA (MF-PCA) continuum fitting, a new technique for predicting the Ly forest continuum in low- spectra. In tests on mock spectra, we have found that MF-PCA can predict the continuum at the RMS level in SDSS spectra with at , and RMS in spectra. This is a significant improvement over the RMS continuum errors previously achievable in low- spectra. We are making available MF-PCA continuum fits for 12,069 Ly forest spectra from the SDSS DR7 quasar catalog. The MF-PCA technique also significantly reduces the Fourier power from continuum-fitting residuals by a factor of a few in comparison with dividing by a mean continuum. This will allow a concomitant increase in the accessible scales for Ly forest flux power spectrum measurements.

This improved continuum-fitting accuracy will significantly increase the value of low- Ly forest data. For example, the ongoing Baryon Oscillations Spectroscopic Survey (BOSS) will obtain Ly forest spectra from quasars at , with the aim of measuring the baryon acoustic oscillation feature in the Ly forest absorption across different quasar sightlines. The typical signal-to-noise () of BOSS spectra will be even lower than that of SDSS (), therefore we expect the MF-PCA technique to contribute significantly to the utility of the BOSS data.

There are several ways in which the current work could be improved. The Suzuki et al. (2005) PCA templates with which we have used to fit some of the spectra were derived from a low-redshift quasar sample which may not be a perfect descriptor of the SDSS data (although the test in § 5 shows that it does a reasonable job). In addition, while the Pâris et al. (2011) templates were indeed obtained from SDSS quasars, they used a high-luminosity subset which are not representative of the full SDSS luminosity distribution. Furthermore, the hand-fitting technique which they had used to obtain continua from these spectra cannot be used for the lower-luminosity (and hence lower-) quasars.

However, with a large data set such as SDSS or BOSS, it is possible to regard the Ly forest absorption within individual spectra as a noise term which cancels out with sufficiently large numbers of template spectra. This will allow new eigenspectra to be generated from the data itself, although each individual spectrum will need to be corrected by its mean flux before being included in the eigenspectrum solution. In the near future, we will work on this technique to generate new eigenspectra from the BOSS data.

The other issue with the MF-PCA fitting is that it requires an assumed mean-flux, for the Ly forest. This is not ideal, as the evolution of the mean-flux is an important observable of the Ly forest. This could in principle be overcome by solving simultaneously for the mean-flux of the Ly forest and the continuum-fitting parameters for the individual spectra, using maximum-likelihood techniques. This would allow large Ly forest data sets to be continuum-fitted and studied in a fully self-consistent fashion.

.

### Footnotes

- affiliation: Department of Astrophysical Sciences, Princeton University, Princeton, New Jersey 08544, USA
- affiliation: E.O. Lawrence Berkeley National Lab, 1 Cyclotron Rd., Berkeley, CA, 94720, USA
- affiliation: Department of Astrophysical Sciences, Princeton University, Princeton, New Jersey 08544, USA
- slugcomment: Submitted to AJ
- In this paper, we use the terms ‘blue’ and ‘red’ relative to the quasar Ly emission line unless otherwise noted
- ftp.astro.princeton.edu/lee/continua/
- Henceforth, all signal-to-noise values quoted in this paper are the median values per pixel, in the range unless indicated otherwise
- http://mwhite.berkeley.edu/BOSS/LyA/RoadRunner/
- ftp.astro.princeton.edu/lee/continua/

### References

- Allen, J. T., Hewett, P. C., Maddox, N., Richards, G. T., & Belokurov, V. 2011, MNRAS, 410, 860
- Baldwin, J. A., Burke, W. L., Gaskell, C. M., & Wampler, E. J. 1978, Nature, 273, 431
- Bernardi, M., Sheth, R. K., SubbaRao, M., Richards, G. T., Burles, S., Connolly, A. J., Frieman, J., Nichol, R., et al. 2003, AJ, 125, 32
- Dall’Aglio, A., Wisotzki, L., & Worseck, G. 2009, ArXiv e-prints
- Desjacques, V., Nusser, A., & Sheth, R. K. 2007, MNRAS, 374, 206
- Draine, B. T. 2011, Physics of the Interstellar and Intergalactic Medium, ed. Draine, B. T. (Princeton University Press)
- Eisenstein, D. J. & Hu, W. 1998, ApJ, 496, 605
- Fan, X. 2006, New Astronomy Reviews, 50, 665
- Faucher-Giguère, C., Prochaska, J. X., Lidz, A., Hernquist, L., & Zaldarriaga, M. 2008, ApJ, 681, 831
- Francis, P. J., Hewett, P. C., Foltz, C. B., & Chaffee, F. H. 1992, ApJ, 398, 476
- Hewett, P. C. & Wild, V. 2010, MNRAS, 405, 2302
- Lee, K.-G. 2011, ArXiv e-prints
- Lee, K.-G. & Spergel, D. N. 2011, ApJ, 734, 21
- Markwardt, C. B. 2009, in Astronomical Society of the Pacific Conference Series, Vol. 411, Astronomical Data Analysis Software and Systems XVIII, ed. D. A. Bohlender, D. Durand, & P. Dowler, 251–+
- McDonald, P. & Eisenstein, D. J. 2007, Phys. Rev. D, 76, 063009
- McDonald, P., Seljak, U., Burles, S., Schlegel, D. J., Weinberg, D. H., Cen, R., Shih, D., Schaye, J., et al. 2006, ApJS, 163, 80
- Noterdaeme, P., Petitjean, P., Ledoux, C., & Srianand, R. 2009, A&A, 505, 1087
- Pâris, I., Petitjean, P., Rollinde, E., Aubourg, E., Busca, N., Charlassier, R., Delubac, T., Hamilton, J.-C., et al. 2011, A&A, 530, A50+
- Schneider, D. P., Richards, G. T., Hall, P. B., Strauss, M. A., Anderson, S. F., Boroson, T. A., Ross, N. P., Shen, Y., et al. 2010, AJ, 139, 2360
- Shen, Y., Richards, G. T., Strauss, M. A., Hall, P. B., Schneider, D. P., Snedden, S., Bizyaev, D., Brewington, H., et al. 2011, ApJS, 194, 45
- Slosar, A., Font-Ribera, A., Pieri, M. M., Rich, J., Le Goff, J.-M., Aubourg, É., Brinkmann, J., Busca, N., et al. 2011, ArXiv e-prints
- Stoughton, C., Lupton, R. H., Bernardi, M., Blanton, M. R., Burles, S., Castander, F. J., Connolly, A. J., Eisenstein, D. J., et al. 2002, AJ, 123, 485
- Suzuki, N. 2006, ApJS, 163, 110
- Suzuki, N., Tytler, D., Kirkman, D., O’Meara, J. M., & Lubin, D. 2005, ApJ, 618, 592
- Telfer, R. C., Zheng, W., Kriss, G. A., & Davidsen, A. F. 2002, ApJ, 565, 773
- Tytler, D., Kirkman, D., O’Meara, J. M., Suzuki, N., Orin, A., Lubin, D., Paschos, P., Jena, T., Lin, W., Norman, M. L., & Meiksin, A. 2004, ApJ, 617, 1
- vanden Berk, D., Yip, C., Connolly, A., Jester, S., & Stoughton, C. 2004, in Astronomical Society of the Pacific Conference Series, Vol. 311, AGN Physics with the Sloan Digital Sky Survey, ed. G. T. Richards & P. B. Hall, 21–+
- Vanden Berk, D. E., Richards, G. T., Bauer, A., Strauss, M. A., Schneider, D. P., Heckman, T. M., York, D. G., Hall, P. B., et al. 2001, AJ, 122, 549
- White, M., Pope, A., Carlson, J., Heitmann, K., Habib, S., Fasel, P., Daniel, D., & Lukic, Z. 2010, ApJ, 713, 383
- Zheng, W., Kriss, G. A., Telfer, R. C., Grimes, J. P., & Davidsen, A. F. 1997, ApJ, 475, 469