QSO characterization within the Gaia mission

Determination of astrophysical parameters of quasars within the Gaia mission

L. Delchambre
Institut d’Astrophysique et de Géophysique, Université de Liège, Allée du 6 Août 17, B-4000 Sart Tilman (Liège), Belgique
E-mail: ldelchambre@ulg.ac.be
Accepted ???. Received ???; in original form ???

We describe methods designed to determine the astrophysical parameters of quasars based on spectra coming from the red and blue spectrophotometers of the Gaia satellite. These methods principally rely on two already published algorithms that are the weighted principal component analysis and the weighted phase correlation. The presented approach benefits from a fast implementation; an intuitive interpretation as well as strong diagnostic tools on the potential errors that may arise during predictions. The production of a semi-empirical library of spectra as they will be observed by Gaia is also covered and subsequently used for validation purpose. We detail the pre-processing that is necessary in order for these spectra to be fully exploitable by our algorithms along with the procedures that are used in order to predict the redshifts of the quasars; their continuum slopes; the total equivalent width of their emission lines and whether these are broad absorption line (BAL) quasars or not. Performances of these procedures were assessed in comparison with the Extremely Randomized Trees learning method and were proven to provide better results on the redshift predictions and on the ratio of correctly classified observations though the probability of detection of BAL quasars remains restricted by the low resolution of these spectra as well as by their limited signal-to-noise ratio. Finally, the triggering of some warning flags allows us to obtain an extremely pure subset of redshift predictions where approximately of the observations come along with absolute errors that are below .

methods: data analysis – quasars: general.
pagerange: Determination of astrophysical parameters of quasars within the Gaia missionDetermination of astrophysical parameters of quasars within the Gaia missionpubyear: 2017

1 Introduction

Gaia is one of the cornerstone space missions of the Horizon 2000+ science program of the ESA that aims to bring a consensus on the history and evolution of our Galaxy through the survey of a billion celestial objects (Perryman et al., 2001). This objective being achieved by capturing a ‘snapshot’ of the present structure; dynamic and composition of the Milky Way by means of precise astrometric and photometric measurements of all the observed objects as well as by the determination of the distances; proper motions; radial velocities and chemical compositions of a subset of these objects (Gaia Collaboration et al., 2016).

The on-board instrumentation is principally composed of two m telescopes pointing in directions separated by a basic angle of , the light acquisition being then carried out by slowly rotating the satellite on its spin axis while reading each CCD column at the same rate as the objects cross the focal plane (i.e. in the so-called Time Delay Integration mode, hereafter TDI mode). The high astrometric precision of Gaia then coming from: (i) its lack of atmospheric perturbations (ii) its large focal length of m (iii) the combination of the beams of light coming from both telescopes onto a single focal plane composed of a patchwork of CCDs that allow to relate the positions of the objects coming from the two fields of view with an extremely precise angular resolution (iv) its scanning law that enables to maximize the number of observed objects as well as the number of positional relations arising from the previous point (Lindegren et al., 2012).

In addition, a high-resolution () spectrometer, called Radial Velocity Spectrometer (RVS), centred around the triplet (- nm) will allow to determine the radial velocities of some of the most luminous stars ( mag), while two low-resolution spectrophotometers, namely the Blue Photometer (BP) observing in the range - nm () and the Red Photometer (RP) observing in the range - nm (), will allow to classify and characterize the objects having mag. The interested reader is invited to read de Bruijne (2012) and Gaia Collaboration et al. (2016) for a more complete description of the Gaia spacecraft and of its payload.

The previously described instrumentation, combined with the fact that Gaia is a full-sky survey where each object will be observed times on average, is a unique opportunity in order to achieve some additional objectives. A non-exhaustive list of such applications are: a finer calibration of the whole cosmological distance ladder (i.e. through parallaxes, Cepheids & RR Lyrae stars, type-Ia supernovae, …) (de Bruijne, 2012); a better understanding of the stellar physics and evolution through the refinement of the Hertzsprung-Russell diagram (Jordi et al., 2008); the discovery of thousands high-mass exoplanets (Perryman et al., 2014) as well as new probes regarding fundamental physics (Mignard, 2009).

Amongst the most peculiar objects that Gaia will observe, stand quasars–also termed quasi-stellar objects (QSOs) for historical reasons. Quasars are active galactic nuclei originating from the matter accretion that was occurring in the vicinity of supermassive black holes being at cosmological distances. Due to their high luminosity () and their large redshift (), these objects play a key-role in fixing the celestial reference frame used by Gaia, but they also have their own intrinsic interest in various cosmological applications like in the evolution scenarios of the galaxies (Hamann & Ferland, 1999); as discriminants over the various universe model and their parametrization (López-Corredoira et al., 2016); as tracers of the large-scale distribution of Baryonic matter at high redshift (Yahata et al., 2005) or as a means to independently constrain the Hubble constant if the latter are gravitationally lensed (Schneider & Sluse, 2013).

The identification and characterization of the quasars that Gaia is expected to observe takes place in the framework of the Data Processing and Analysis Consortium (DPAC) which is responsible for the treatment of the Gaia data in a broad sense, that is: from data calibration and simulation to catalogue publication through intermediate photometric/astrometric/spectroscopic processing; variability analysis and astrophysical parameters (APs) determination. The DPAC is an academic consortium composed of nine Coordination Units (CUs), each being in charge of a specific part of the data processing (O’Mullane et al., 2007). One of these, the CU8 ‘Astrophysical Parameters’, is dedicated to the classification of the objects observed by Gaia and to the subsequent determination of their APs (Bailer-Jones et al., 2013).

The present paper describes the algorithms that are to be implemented within the CU8 Quasar Classifier (QSOC) software module in order to determine the APs of the objects classified as QSOs by the CU8 Discrete Source Classifier (DSC) module while relying on their low-resolution BP/RP spectra. The collected APs aiming to be published within the upcoming Gaia data release 3 catalogue. The covered APs encompass: the redshift; the QSO type (i.e. type I/II QSO or Broad Absorption Line QSO, hereafter BAL QSO); the slope of the QSO continuum and the total equivalent width of the emission lines.

In the following, section 2 explains the conventions we used along this paper. Section 3 makes a brief review of the methods that were specifically developed in the field of this study. We present the production of a semi-empirical library of BP/RP spectra of QSOs used in order to train/test our models in section 4. The AP determination procedures are covered within section 5 while their performances are assessed in section 6. Some discussion on the latter takes place in section 7. Finally, we conclude in section 8.

2 Conventions

This paper uses the following notations: vectors are in bold italic, ; being the element of the vector . Matrices are in uppercase boldface or are explicitly stated; i.e. from which element at row (variable) , column (observation) will be denoted by .

Flux will here denote the spectral power received per unit area (derived unit of W m) while flux density will represent the received flux per wavelength unit (derived units of W m). If not stated otherwise, input spectral energy distributions (SED) will be considered to be expressed in terms of flux density while BP/RP spectra will be expressed in terms of flux by convention.

3 Methods

One of the main characteristics of the Gaia data processing is the large amount of information that has to be handled (e.g. about gigabytes of compressed scientific data are received from the satellite each day) and the consequent need for fast and reliable algorithms in order to reduce those data. These last requirements led the CU8 scientists to use techniques coming nearly exclusively from the field of the supervised learning methods whose underlying principle is to guess the APs of each observed object, which are unknowns, based on the interpolation of the APs of some similar template objects (Bailer-Jones et al., 2013).

These methods have been proven to be fairly fast and reliable but often consist in black-box algorithms having no physical significance and having only basic diagnostic tools in order to identify the potential problems that may occur during the APs retrieval. This last point is particularly crucial in the case of medium-to-low quality observations, like BP/RP spectra, or in the case where the problem is itself prone to error, like the existing degeneracy in the redshift determination of low signal-to-noise ratio (SNR) QSOs (Delchambre, 2016).

With these constraints and shortcomings in mind, we have developed two complementary algorithms that are specifically designed to gather the quasar APs within the Gaia mission based on the object BP/RP spectra while providing a clear diagnostic tool and ensuring an execution time that is limited to floating point operations (i.e. conventionally considered as ‘fast’ algorithms).

3.1 Weighted principal component analysis

Principal component analysis (PCA) aims to extract a set of templates – the principal components – from a set of observations while retaining most of its variance (Pearson, 1901). Mathematically, it is equivalent to find a decomposition of the covariance matrix associated with the input data set,


that is such that is orthogonal and diagonal and for which


The first columns of being then the searched principal components. A decomposition such as the one of equation 1 is straightforwardly given by the singular value decomposition (SVD) of the covariance matrix (Press et al., 2002).

Consider now the building of a set of rest-frame quasar templates based on a spectral library having a finite precision on the fluxes and a limited wavelength coverage. From the definition of the redshift, we will have that the observed wavelength, , can be related to the rest-frame wavelength, , through


and as a consequence, we will have that every quasar within the input library will cover a specific rest-frame wavelength range that depends on its redshift. Furthermore, the measurement of the quasar fluxes often comes along with an estimation of their associated uncertainties originating, for example, from the Poisson nature of the photons counting; from the CCD readout noise; from the sky background subtraction or from spectra edge effects. These uncertainties being not taken into account within the classical PCA implementation.

In Delchambre (2015), we solved the previously mentioned issues by considering the use of a weighted covariance matrix inside equation 1. For this purpose, we defined the weighted covariance of two discrete variables, and having weights respectively given by and and weighted mean values given by and as


The suggested implementation relies on two spectral decomposition methods, namely the power iteration method and the Rayleigh quotient iteration, that allow to gain flexibility; numerical stability as well as lower execution times111Under the condition that the number of observations within the input data set is much larger than the number of variables when compared to alternative weighted PCA methods (Bailey, 2012; Tsalmantza & Hogg, 2012).

3.2 Weighted phase correlation

The redshift has a particular importance over the whole quasar APs because any error committed on the latter would make the other APs diverge. Its is then critical to have the most precise estimation of it along with a strong diagnostic tool in order to flag the insecure predictions. A technique fulfilling these requirements stands in the phase correlation algorithm (Glazebrook et al., 1998) whose goal is to find the phase at which an input signal and a set of templates match at best in a chi-squared sense.

For reasons already enumerated within section 3.1, we will consider here a weighted version of the previously mentioned algorithm. We are then searching for the shift at which an input spectrum, , associated with a weight vector, , and a set of templates, , matches at best in a weighted chi-squared sense. Mathematically, it is equivalent to find the shift, , for which


is minimal given that are the linear coefficients minimizing equation 5 for a specific shift.

In Delchambre (2016), we showed that the latter equation can be re-written as


where is the so-called cross-correlation function (CCF) at shift which can be evaluated for all in floating point operations, being the number of samples we used. Given that the first term of equation 6 is independent of the shift, we will simply have that the minimum of equation 5 will be associated with the maximum of the CCF.

Practically, both and must be sampled on a uniform logarithmic wavelength scale in order for the redshift to turn into a simple linear shift (i.e. ) and must be extended and zero-padded such as to deal with the periodic nature of the phase correlation algorithm. A sub-sampling precision on the shift can then be gained by fitting a quadratic curve in the vicinity of the optimum of the CCF while the curvature of this quadratic curve will be used as an approximation of the uncertainty associated with the found shift.

The described weighted phase correlation algorithm relies on the assumption that the most probable redshift is associated with the maximal peak of the CCF, which is not always verified in the case of QSOs. The reason for this is twofold:

  1. The highest peak of the CCF may not always lead to a physical solution like the omission of some characteristic emission lines (e.g. Ly; \ionMgii279; or H nm) or the fit of a ‘negative’ emission line coming from the presence of matter being in the line-of-sight towards the observed QSO. The origin of this issue mainly stands in the imperfections of the templates we used as well as in the assumption we made that quasar spectra can be modelled as a linear combination of these templates.

  2. In the case of low-SNR spectra, it may also happen that the signal of some emission lines starts to be flooded within the noise such that these will not be recognized as a genuine signal but rather as variance coming from noise. As a result, ambiguities can emerge within the CCF (i.e. multiple equivalent maxima) and hence within the redshift determination.

Emission Rest-frame wavelength Relative intensity
line(s) nm compared to Ly
Table 1: Rest-frame wavelengths and relative intensities of the emission lines used in the computation of . Listed values come from the PCA mean spectra described in section 5.

In order to identify these sources of errors, we defined two complementary score measures associated with each redshift candidate: (i) , defined as the ratio of the value of the CCF evaluated at to the maximum of the CCF and (ii) defined as


where is the mean value of the continuum-subtracted emission line standing at rest-frame wavelength if we consider the observed spectrum to be at redshift ; is the associated uncertainty and is the theoretical intensity of the emission line standing at normalized such that all the covered emission lines intensities sum up to one. Equation 7 can then be seen as the weighted geometric mean of a set of normal cumulative distribution functions of mean zero and standard deviations evaluated in . Table 1 summarizes the various emission lines and theoretical intensities we used in the context of the present study.

We can already notice that these two score measures can easily highlight the sources of errors that may occur within the CCF peak selection, namely the choice of an unphysical solution and the choice of an ambiguous candidate respectively through a low and through a low absolute difference between the chosen candidate’s and the one from another candidate. These will constitute in fine strong diagnostic tools regarding our implementation.

4 Semi-empirical BP/RP spectral library building

Similarly to supervised learning methods, the undertaken approach is based on the availability of a learning library of BP/RP spectra for which the various APs are known. Such a library being non-existent at the present time, we had to convert an already released spectral library of QSOs into BP/RP spectra according to the most up-to-date instrument model. We focused, for this purpose, on the twelfth data release of the Sloan Digital Sky Survey quasar catalogue (Pâris et al., 2017, hereafter DR12Q). The choice of this specific catalogue comes from: (i) the fact that each spectrum in it was visually inspected yielding extremely secure APs (ii) the large number of QSOs it contains (amongst which BAL QSOs) (iii) its medium resolution of (iv) its spectral coverage which is comparable to the one of the Gaia BP/RP spectrophotometers ( nm).

This spectral library will then have to be extended such as to match the wavelength range covered by the Gaia BP/RP spectra and be subsequently convolved with the instrumental response of the BP/RP spectrophotometers such as to provide the final library.

4.1 Spectra extrapolation

Besides the fact that the DR12Q spectra have a narrower wavelength coverage ( nm) when compared to the BP/RP spectra ( nm), we also have to mention that the regions where and where nm often tend to be unreliable because of spectrograph edge effects while some other inner regions might be discarded because of bad CCD columns, cosmic rays, significant scattered light or sky background subtraction problems, for example (Bolton et al., 2012; Dawson et al., 2013). In order to solve these shortcomings, we have extracted a set of rest-frame PCA templates out of the DR12Q library that were later fitted to each individual spectrum as a mean to extrapolate them. Note that since the DR12Q spectra are already sampled on a logarithmic wavelength scale (at a sampling rate of ), no re-sampling will be needed before extrapolation.

Figure 1: Results of the pre-processing and extrapolation procedures. The input spectrum is decomposed into a continuum part, a noiseless spectrum and a set of deviant points (night sky emission lines and spectrograph edge effects being here depicted as shaded regions). This decomposition allows us to compute the SNR associated with each spectrum and to set those having a SNR on a common logarithmic rest-frame wavelength scale such as to extract PCA templates out of them. The fit of these templates to each input spectrum provides the final extrapolation. Note that the illustrated input spectrum has some corrupted samples in the region encompassing the \ionCiv emission line which are successfully recovered through this extrapolation procedure. Similarly, the unobserved Ly emission line seems to be consistently reproduced as well.

Raw spectra are not readily exploitable, rather they have first to be pre-processed such as to get rid of contaminating signals and to have some insights about their usability. For this purpose, we used a procedure that is identical to the one described in Delchambre (2016, section 5.1). We will hence concentrate here on the results of this procedure rather than on the underlying implementation details. We will then get, for each spectrum: (i) the set of deviant points coming from a clipping algorithm applied to the high frequency components of the spectrum as well as from the removal of the night sky emission lines and spectrograph edge effects (ii) an empirical estimation of the QSO continuum coming from the low frequency components of a multi-resolution analysis of the spectrum (iii) a smoothed version of the provided spectrum, that we will consider here as being noiseless (iv) an evaluation of its SNR coming from the ratio of the variance that is present within the noiseless and continuum-subtracted spectrum over the variance that can be attributed to noise (i.e. raw fluxes from which we subtracted the deviant points; the QSO continuum and the noiseless spectrum). Figure 1 illustrates the results of the previously described procedure and provides a self-explanatory example of the necessity we have to pre-process our input spectra.

Figure 2: Mean observations (black lines) and first two principal components coming from the PCA decompositions of the DR12Q emission lines and continuum spectra regarding type I/II QSOs (up) and BAL QSOs (down). Notice how the entire emission lines are modelled by the type I/II QSOs templates while the BAL QSOs templates rather focus on the \ionOvi; Ly; \ionSiiv and \ionCiv emission lines where most of the BAL characteristics are found.

Spectra having a SNR larger than one are then normalized such as to have a weighted norm of one and are subsequently set on a common logarithmic rest-frame wavelength scale. These will constitute the input dataset upon which we will extract our PCA templates. We choose to consider the retrieval of the BAL QSO templates separately from the type I/II QSOs as a way to ensure that the characteristic features of the BAL QSOs are correctly reproduced within our extrapolated spectra. Doing otherwise would have required a much larger number of PCA components to be fitted in order to accurately model those features (the latter being not seen in the vast majority of QSOs, they would have been omitted from the dominant PCA components). Also, we required the continuum to rely on an empirical basis such as to restrain at most any unphysical behaviour within our extrapolation. We will consequently subtract each empirical continuum from each input spectrum as a way to extract the PCA components from both these subtracted spectra as well as from the continua themselves. By way of comparison, continuum templates are frequently taken as being a combination of power-law and exponential functions (Claeskens et al., 2006) that often succeed in reproducing the observed spectrum but that tend to diverge over the unobserved wavelengths. Consequently, four sets of PCA templates were built based upon the algorithm described in section 3.1: one set of templates for the type I/II QSO emissions lines; one similar set associated with the BAL QSOs and two corresponding sets of continuum templates. Figure 2 provides the mean observation and two first principal components for these four sets of templates. For the sake of completeness, let us also mention that, during PCA retrieval, weights were taken as the inverse standard deviation on the fluxes regarding the emission line PCAs while these were simply set to one if the continuum we computed was associated with some observed fluxes and zero otherwise.

Finally, we used emission line templates (the mean observation and the first PCA components) for both the fit of the type I/II and BAL observations along with continuum templates for the type I/II QSOs and continuum templates for the BAL QSOs. These fits ultimately provide the extrapolated spectra, as illustrated in figure 1. The number of templates we used allows us to explain % of the weighted variance222Weighted variance naturally results from equation 4 in the case where while having expected values, in the case of the weighted variance of the input data set or equal to the optimal linear combination of the templates in the case of the explained weighted variance. that is present within the type I/II emission lines and % of their continuum variance. These become respectively % and % in the case of BAL QSOs. Even if apparently low, these ratios practically reflect the fact that the spectra we used come along with noise that will not be grabbed by the dominant PCA components. Consequently, the produced extrapolation will be considered here as being noiseless.

4.2 BP/RP instrumental convolution

The optical system of the Gaia BP/RP spectrophotometers consists for each in six mirrors, a dispersing prism and a set of dedicated CCDs (either blue or red-enhanced) that can be modelled as


where is the dispersed flux in the one-dimensional co-moving333sliding at the same rate as the TDI mode drift each CCD column focal plane position ; is the input SED at the observed wavelength ; is the global instrumental response at and where is the monochromatic line spread function (LSF) standing at and being evaluated at , being the co-moving focal plane position associated with . In more details, the global instrumental response encompasses: the mirrors reflectivity, the attenuation that is due to particulate and molecular contamination, the attenuation coming from the mirror roughness, the prism transmissivity and the CCD quantum efficiency at the observed wavelength . Note that in the following, we will consider a mean instrumental model averaged over each field-of-view and over each row of CCD within the focal plane given that our goal is to simulate end-of-mission (combined) spectra that will consist in an aggregation of individual (epoch) observations.

The BP/RP sampled fluxes, , will then be given by


where is the oversampling we choose to use. This oversampling arise from the higher SNR that is gained by the combination of the epoch spectra which, in turn, provide the opportunity to reach a higher sampling rate when compared to the initial pixels provided by the BP/RP acquisition window (e.g. through flux interpolation). A common consensus within the CU8 is to consider an oversampling of , which results in samples for each of the BP and RP spectra. This convention will be adopted here.

The instrument model described so far is not able to deal with the various sources of noise that will contaminate our actual observations. Instead, the spectra produced through equations 8 and 9 will consist in the approximated noise-free counterparts of what Gaia will observe. Extending the aperture photometry approach developed in Jordi et al. (2010), we can still have an estimation of the noise variance that is associated with each sampled flux, , as


where is an overall mission safety margin designed to take into account the potentially unknown sources of errors (, by convention); is the variance of the noise associated with if the latter was coming from a single epoch observation; is the number of epoch observations used to compute the combined spectra and is the uncertainty arising from the flux internal calibration. The scaling of the single epoch variance, , reflects the assumption we made that each flux within the combined spectra comes from the mean value of a set of epoch fluxes. In extreme cases, for example, we will have that each combined flux is averaged over the whole epoch observations (i.e. in the case of ) while in the case of , each flux within the combined spectra can be seen as gathered directly from the epoch spectra. Next, the variance coming from the uncertainties in the flux internal calibration, , is taken to be equal to the inner product of the fluxes that are present within the pixels surrounding each sample with a linear function that is inversely proportional to the global instrumental response, , evaluated in those pixels. Its objective being to take into account the fact that the precision on the flux calibration will principally depend on the instrumental response in the vicinity of the pixel of interest. Because of the intricacy that is inherent to the modelling of these calibration errors, the latter were voluntarily tuned such as to stand within a moderate range of values.

We can then decompose the epoch variance, into variance coming from photon and CCD noise, , and variance coming from the uncertainties in the background estimation, , as


where is the effective CCD exposure time, the latter being dependent on the magnitude of the objects based on the activation of bypasses within some specific CCD columns which aim to prevent luminous objects from saturating (de Bruijne, 2012). Both terms within equation 11 can then be simply extended as


where is the total CCD detection noise including, amongst other, the CCD readout noise and the CCD dark noise; is the background flux we subtracted from our observation and that we will consider here as a constant based on a typical sky-background surface brightness and where is the number of pixels we used in order to estimate , that is taken here as being equal to the width of the BP/RP acquisition window (i.e. pixels).

Figure 3: Example of produced BP/RP spectrum and associated flux uncertainties for magnitudes of , and . Spectra normalization allows to fairly compare the noise levels that are present amongst the various provided magnitudes. We can straightly figure out the effect of the global instrumental response and of the LSF convolution applied to the extrapolated spectrum as respectively producing characteristic bell shapes as well as very broadened emission lines within the resulting BP/RP spectra. The noticed non-uniform spectral resolution arises from the varying wavelength dispersion function of each of the BP/RP spectrophotometers.

Being now able to model the entire instrumental response along with the associated uncertainties, we first choose to normalize our input spectra to magnitudes where we expect QSOs to be observed, that is at . This normalization allows us to study the behaviour of the implemented methods under an increasing level of noise. BP/RP spectra were then produced based on the most up-to-date instrument model coming from tests carried out by EADS Astrium (later renamed Airbus Defence and Space) during the commissioning phase of the satellite. We have to note that this instrument model still has to be updated in order to match the actual operational condition of the satellite even if the latter is not expected to vary too much from the model we used. Noisy spectra can then be obtained by adding the appropriate random Gaussian noise (i.e. having a variance of ) to each of the (noise-free) spectral fluxes, . An example of produced BP/RP spectra is illustrated in figure 3.

5 Astrophysical parameter determination

Figure 4: Illustrative synthesized spectrum coming from the linear combination of the interpolated and flat-fielded BP/RP spectra of magnitude . The observed BAL features allows to highlight the capability of pre-processing we used in order to recover the general shape of the extrapolated spectrum. Flux uncertainties were also pre-processed accordingly, the BP/RP transition region being unequivocally recognized as a bump around nm. We have to note that the noiseless synthesized spectrum comes from a similar pre-processing applied to a noise-free version of the BP/RP spectra.

The bell shape of the BP/RP spectra prevents us from using both algorithms described in section 3. This is even more damageable given the fact that these are not sampled on a logarithmic wavelength scale and that the wavelength coverage of each pixel is not uniform. In order to tackle these problems, a resampling of the spectra fluxes and uncertainties was first performed using cubic spline interpolation (Press et al., 2002). The uniform logarithmic sampling we used, , ensuring a sampling on the redshift that is better than , which is comparable to human expertise, while producing a reasonable amount of sampled points in the final templates (assuming that ). Note that such a logarithmic interpolation will obviously introduce covariances between the resulting samples. Nevertheless, given that the specific resampling we used stands in a wavelength range where the corresponding logarithmic function is approximately linear and that both the number of samples within the BP/RP spectra and within the synthesized spectra are of the same order of magnitude, we will have that these covariances will be restricted to close neighbouring samples while having moderate magnitudes. These will consequently have a limited impact on the resulting predictions and will hence be ignored in the following. The division of these interpolated spectra by a flat BP/RP spectrum (i.e. coming from a flat input SED) concurrently fixes the bell-shape issue, that is mostly due to the global instrumental response, as well as the problem of the non-uniform wavelength coverage of the pixels, that is due to the inconstant wavelength dispersion function of the BP/RP spectrophotometers. Accordingly, these flat-fielded spectra can then be considered as being approximately proportional to the convolution of the input SED by the LSF over a linear wavelength scale, plus noise. Now, we will have that the resulting spectra will be disjoint although they are overlapping, which would yield to a tremendous loss of efficiency if these were to be considered individually. A more interesting solution stands in the linear combination of the flat-fielded BP/RP spectra according to a given weighting scheme such as to produce a single synthesized spectrum. In more details, if we consider and as being the interpolated fluxes of the BP and RP spectra; , , as their associated uncertainties; , , as their corresponding flat BP/RP fluxes and , as the weighting coefficient used to join these spectra, then we have that the synthesized fluxes, , can be represented as


and their associated uncertainties, , as


In the context of the present study, the weighting coefficients we selected are given by




is the hyperbolic tangent transition function from to . The specific weighting used in equation 16 ensures a smooth transition between the flat-fielded BP and RP spectra while keeping most of their significant regions. A continuum spectrum was then gathered and subsequently subtracted from each synthesized spectrum using a procedure similar to the one described in section 4.1. An illustrative synthesized spectrum produced through equations 14 and 15 is shown in figure 4.

Having now exploitable input spectra, we decided to split our input spectral library into two parts, the first part being used as a ‘learning set’ (LS1) in order to produce the PCA templates upon which we will base the analysis of the second part, the ‘test set’ (TS1). Conversely, the second part will be subsequently used as a learning set (LS2) for the analysis of the first part (TS2). This two-fold cross validation procedure will finally provide the APs for the whole set of observations as if these were gathered based upon a totally independent dataset. The splitting criterion we used relies on the uniform selection of half the input library sorted according to the QSOs redshifts such as to ensure an even repartition of the latter amongst these two data sets.

Figure 5: Mean observations (black lines) and first two principal components coming from the PCA decompositions of BP/RP synthesized spectra out of LS1 and LS2 regarding type I/II QSOs (up) and BAL QSOs (down). Both learning sets show extremely good agreement for both QSO types, the noticed discrepancies in the BAL components at wavelength longer than the H emission line being explained by the small number of BAL QSOs having ( of the learning sets). We can further notice that most of the BAL features standing between the \ionOvi and \ionCiv emission lines were retained from the DR12Q library (see figure 2) while the removal of the fine QSO structures by the BP/RP instrumental convolution now allows the first BAL templates to model the entire emission lines.

Two sets of rest-frame PCA templates were then produced for each of the learning set according to the QSO type (type I/II or BAL). These were based on the noise-free; continuum-subtracted and synthesized BP/RP spectra having both un-normalized magnitudes and SNR within the extrapolation procedure. From the noiseless nature of these input spectra, we had to select custom weights associated with the synthesized fluxes, , as


where the first two terms practically reflect the limited confidence we set on the spectra edges due, for example, to the potential inaccuracies in the spectra extrapolation or to low fluxes within the flat BP/RP spectra leading to numerical instabilities and where the last term stands for the uncertainties introduced in the BP/RP spectra combination. Figure 5 provides the mean observation and first two principal components of the synthesized spectra of the type I/II and BAL QSOs for both learning sets.

It is worth to mention that, at first glance, it might seems misleading from the point of view of the validation process to retrieve PCA components from synthesized spectra which are themselves based on the linear combination of templates. Nevertheless, let us first remind that it is one of our assumptions that any (noiseless) DR12Q quasar spectrum can be fairly represented as the linear combination of a sufficient number of such templates. Secondly, we have to note that BP/RP spectra come from the instrumental convolution of the extrapolated spectra in the observed wavelengths. The latter being then set on rest frame, we will have that the resulting PCA components will have to reflect the averaged convolution applied over the whole observed wavelengths. Finally, this convolution will have the effect of smoothing the high-frequency components from the extrapolated spectra. These being concurrently the main source of unexplained variance within the DR12Q templates, we expect the produced library to be consistent regarding an hypothetical real noise-free BP/RP spectral library.

Warning flag Value Condition(s) for rising
Z_AMBIGUOUS 1 More than one peak have both
Z_NOTOPTIMAL 8 We did not choose the optimal
peak (i.e. )
Table 2: Warning flags used in the redshift selection procedure. These can be combined through bitwise OR operator.

The extracted PCA components were then used in order to produce their CCF against the noisy synthesized BP/RP spectra of magnitude through the algorithm described in section 3.2. The redshift identification being based on the CCF peak having a corresponding redshift in the range ; a and a minimal scaled distance from the ideal point as given by


The selected redshift was then flagged for potential inaccuracies in the peak selection according to the values provided within Table 2. Constants used in the peak selection procedure as well as within Table 2 are purely empirical and based on a visual inspection procedure.

The optimal number of PCA components to use was chosen as a trade-off between the ratio of explained variance; the ability of the templates to model BAL QSOs and the potential overfit of the observations coming from the use of a too large number of templates. This overfitting being characterized by frequent ambiguities in the corresponding CCFs that eventually results in a large number of erroneous redshift predictions (though these will have a non-zero warning flag). Tests performed on each learning set show that the use of PCA components is a satisfactory compromise between these constraints that ultimately leads to a ratio of explained variance of (LS1) and (LS2) regarding the type I/II QSOs and of (LS1) and (LS2) regarding the BAL QSOs.

The BAL QSO identification is based on the comparison of the value of the CCF peak we selected using type I/II templates, , against the value of the CCF peak selected from BAL templates, , through


where is the redshift selected from type I/II templates and is the redshift selected from BAL templates. Though straight classification between these two types of QSOs is sometimes practical, it is however commonly motivated by the specific needs of the end-users. As an example, studying the physics of BAL QSOs will require an extremely pure subset of observations (e.g. with ) while a re-observation survey can easily deals with a ‘hint’ on the BAL nature of the observed QSOs (e.g. with ). Still, the frequent discrepancies observed between the redshift predicted based upon these two kinds of templates enforce us to use such a classification. Accordingly we will consider, in the following, as the default reshift whenever and while keeping as a discriminant value for further application-specific classification. The effect of the thresholding of this discriminant value on the resulting ratio of correctly/incorrectly classified observations will be deferred to section 6.2.

Figure 6: Histograms of the differences between the continuum slopes, , and total emission line equivalent widths, , predicted based upon DR12Q spectra and synthesized spectra regarding various normalization magnitudes.

The slope of the QSO continuum corresponds to the spectral index, , as defined by


or more compactly expressed in terms of frequency, , as . This index is obtained from the fit of a power law function to the observations over wavelength regions that are commonly devoid from emission/absorption features, that is: ; ; ; and nm. The exact procedure employs a -sigma clipping algorithm (with , ) such as to underweigh iron emission blends as well as other fortuitous absorption/emission structures by a factor . This procedure was applied to both the input DR12Q spectra and to the synthesized BP/RP spectra as a way to fairly compare the resulting predictions while discarding any bias that can be due to the differences in the used algorithms. Because of their high numerical complexities, non-linear optimization algorithms were not used for the least-squares solution of equation 21. Rather, each power-law function was fitted through a linear regression of the wavelengths against the fluxes by taking the logarithm of both sides of the latter equation. Although this choice seems to be harmless from the point of view of the DR12Q spectra, synthesized spectra will have to cope with the large amount of discarded samples coming from the frequent negative fluxes encountered within the spectra edges. These discarded samples leading to a large bias towards positive fluxes (see figure 4, for example), we consequently decided to reject samples standing outside the observed region nm for these specific spectra.

Finally, the total equivalent width of the emission lines can be represented as


where is the continuum slope we fitted based upon equation 21 and are the emission lines fluxes, the latter being set to if belongs to an emission line region and to zero otherwise. The identification of these emission line regions is based on parts of the spectrum where smoothed fluxes coming from a points wide Savitsky-Golay filtering (Press et al., 2002) stands higher than the continuum. In agreement with what was previously done, equation 22 was integrated over the interval nm regarding DR12Q spectra and over the interval nm regarding the synthesized spectra. The rest-frame total equivalent width being hence straightly given by .

Mean Mean
DR12Q spectra
Synthesized spectra
Table 3: Mean predicted continuum slope, , and total emission line equivalent width, , based on DR12Q spectra and synthesized spectra with various magnitude .
Figure 7: Mean spectra coming from the aggregation of DR12Q spectra normalized such a to have a weighted norm equal to one and from synthesized spectra being normalized in the same way. Flux densities were normalized between and nm. Shaded regions correspond to wavelength ranges that were used in order to fit the QSO continuum slopes through equation 21. One can notice the higher flux densities of the synthesized mean spectrum within the continuum regions and nm when compared to the DR12Q mean spectrum while the lower flux encountered within regions and nm often yield a red bias during the continuum fitting procedure.

The results of these continuum slope and total emission line equivalent width determination procedures are summarized within Table 3 and figure 6 regarding DR12Q spectra and synthesized spectra of various normalizing magnitudes. We can easily see that the continuum slopes predicted from synthesized spectra tend to be bluer than those of the DR12Q spectra. This bias comes from the spread of the \ionSiiv and \ionCiv emission lines over the continuum region nm and, to a lesser extent, from a similar spread of the \ionCiv and \ionCiii] emission lines over the continuum region nm as depicted within figure 7. The noticed relation between the flattening of the continuum slope and the normalizing magnitude comes from the increasing number of negative samples that are rejected within the continuum regions nm and nm where faint fluxes are usually found and which ultimately tend to artificially redden those regions. Also, we can observe that the total equivalent widths of the emission lines predicted based upon synthesized spectra are underestimated compared to the ones predicted using DR12Q spectra. The reason for this similarly stands within the globally overestimated continuum flux as well as from the reddening of the spectrum at wavelengths longer than nm according to the magnitude. Note that the miss of some narrow emission lines because of the LSF convolution also tends to lessen the predicted . The reader should hence pay a careful attention to these systematic effects once using these measurements.

Given these shortcomings, one might rightfully wonder whether the use of non-linear optimization algorithms worths to be envisaged in order to predict the continuum slope of the QSOs at the expense of a ten times longer execution time. Doing so will provide us with a mean value of the continuum slopes of for the DR12Q spectra and a correlation factor of if these are compared to the results of our approach. In the case of synthesized spectra, these numbers become respectively , , , and for the mean continuum slopes of magnitudes with associated correlation factors of for and of , for and , respectively. The observed flattening of the predicted continuum slopes at ironically comes from the non-rejection of the negative fluxes from the red part of the spectra which tends to give larger weights to these regions (i.e. the fraction of red fluxes being then more significant). While this effect will have a negligible (but still noticeable) impact on the predictions because of the sufficient SNR of the red part of the spectra at these magnitudes (e.g. the fit of the red part of the spectra providing a good approximation of the continuum slopes at these magnitudes), it will have a deleterious impact on fainter magnitudes where the red part of the spectra is often better approximated by a flat curve. This effect gets further amplified through the subsequent rejection of the blue fluxes by the –sigma clipping algorithm. Let us still mention that this non-linear approach remains the most rigorous in a statistical point of view though the strong similarities noticed in both approaches and their common difficulties in predicting the continuum slopes of faint sources do not justify its use regarding its larger time consumption.

6 Performance comparison

The performances of our approach were assessed in comparison with the Extremely Randomized Trees learning method (Geurts et al., 2006, hereafter ERT). While classical tree-based learning methods usually try to find, at each node, a splitting criterion (i.e. an attribute and a threshold within this attribute) that is such that the learning set of observations associated with this node is split at best with respect to a given score measure (e.g. variance reduction in regression problem or information gain in classification problem), the ERT instead picks up random attributes as well as a random threshold associated with each of these attributes in order to select the one maximizing the provided score measure. This procedure is then recursively repeated until the number of learning set observations in all leaf nodes falls under a given limit, . The averaged prediction of a set of trees then allows to subsequently lessen the variance of the model (i.e. the sensitivity of each individual tree to the used learning set). The choice of this specific method mainly comes from both its fast learning phase as well as from its high performances regarding other competing methods like Artificial Neural Networks or Support Vector Machine while having only a few numbers of parameters to tune. Let us also note that this method is the one that is presently in use within the QSOC software module in order to predict most of the QSO APs (Bailer-Jones et al., 2013).

First of all, let us mention that the QSO continuum slope and that the total equivalent width of the emission lines will not be considered within this performance comparison because these can be straightly predicted based on observable quantities. Regarding the adjustment of the parameters of the ERT models, tests have shown that the prediction of the QSO redshift and type are rather insensitive to the and parameters if these stand within reasonable ranges of values. Consequently and according to Geurts et al. (2006), the default values of , and , were accepted respectively for the redshift regression problem and BAL classification problem, being here the number of points contained within our BP/RP spectra (i.e. if ). The number of trees to build, , should be ideally as large as possible. Nevertheless, based on time and memory constraints we have, the latter was set to .

In the following, the ERT models will be built based upon the noisy learning sets LS1 and LS2 where the observations having a SNR are selected and normalized such as to have a unit norm for the whole set of magnitudes. Their predictions being then gathered from the associated test sets of corresponding magnitude within TS1 and TS2. We have to note that because of selection effects and observational bias within the DR12Q catalogue, neither LS1 nor LS2 will follow a realistic distribution of the redshift (Pâris et al., 2017). Similarly, these will neither contain a genuine fraction of BAL QSOs (Reichard et al., 2003; Knigge et al., 2008; Gibson et al., 2009). Consequently, the ERT models that will be built based on these learning sets will be particularly suited for the prediction of the QSO redshifts and type that are the most frequently encountered within LS1 and LS2. In that sense, these will constitute data-oriented models whose predictions on TS1 and TS2 will be optimistic when compared to those based on real observations. Finally, we can further note that the weighted phase correlation algorithm (hereafter WPC) is not sensitive to this data unbalancing and that the associated results will remain valid irrespectively of the actual APs distribution we will encounter.

Figure 8: (top) Distribution of the predicted redshifts coming from ERT models, , and WPC model, , for quasars with magnitudes equal to , , , and with respect to the DR12Q redshift, . (middle) Histogram of the absolute error, , between DR12Q redshifts and redshift predicted based upon ERT models, WPC models and WPC models while having an empty warning flag (see table 2). (bottom) Ratio of observations having an absolute error on the redshift that is lower than some threshold limit as given by .

6.1 Redshift determination

The distribution of the predicted redshifts against DR12Q redshifts is given within the upper part of figure 8 for the case of the ERT predictions as well as for the case of the WPC predictions regarding the various normalizing magnitudes. We can already notice a trend of the ERT predictions standing at to be driven towards where stand most of our learning set observations. This effect is particularly noticeable at where stands our second most numerous source of QSOs and it further tends to strengthen along with an increasing magnitude (though these misclassified observations will typically have strong associated uncertainties). To a lesser extent, we may also note an opposite trend where the observations standing between and tend to be underestimated. These effects potentially reflect the inability of our models to fully grab the information that is present within our learning sets and/or the incompleteness of the latter. The fact that very high redshift objects () get correctly predicted presumably comes from the entrance of the Ly forest within the observed wavelength range where the extremely faint fluxes found therein allow to unequivocally characterize these observations.

Although roughly performed here, the analysis of the results coming from machine learning methods often suffer from a lack of physical significance and interpretation that mostly arises because of their underlying complexity. Furthermore, these methods strongly depend on the completeness of the learning set we used in order to build them. In an illustrative purpose, let us consider that a given QSO spectrum is getting a correct redshift prediction from such a model, suppose now that a similar spectrum has a slightly higher redshift which results in a mean shift by a few pixels in the observed spectrum, then nothing ensures us that this shifted spectrum will get a correct prediction from the previous model since this ultimately depends on whether or not a somewhat similar spectrum was encountered within the used learning set. According to this, a learning method dedicated to the redshift determination of QSOs should ideally be based on a learning set of observations covering the vast majority of QSO shapes and characteristics over the entire range of redshift we are looking for. With these arguments in mind, we may still suppose that the ERT models we used (and in a broader sense, any model based on machine learning method) are not the best suited in predicting the redshift of QSOs.

Regarding the redshift distribution from WPC (see figure 8, up), we can readily notice a tighter dispersion of the errors when compared to the ERT predictions with median absolute errors of , , , and for the WPC predictions of magnitudes , respectively, and corresponding ERT median absolute errors of , , , and (see figure 8, middle). Similarly, , , , and of the observations have a catastrophic prediction on their redshift (i.e. ) within the WPC for the same set of magnitudes while the corresponding ratios of ERT observations become respectively , , , and (see figure 8, down).

Ratio of observations triggering warning flags
Triggered magnitude
warning flag
Ratio of observations having
Triggered magnitude
warning flag
Table 4: Ratio of observations triggering warning flags and associated ratio of observations having while triggering these warning flags.

Most of the WPC errors come from mismatches between emission lines. These mainly consist in the confusions of H with \ionCiii]; \ionMgii with Ly; \ionMgii with \ionCiv; \ionMgii with \ionCiii] and \ionCiv with Ly. We have to note that these mismatches do not constitute by themselves real cases of degeneracy but rather arise because of the effect of noise on the emission lines identification as we will soon see. Still, this effect is already unambiguously depicted in figure 8 (top), where the number of observations suffering from such an emission line mismatch problem tends to increase along with an increasing magnitude. In the same figure, we may also note that low SNR spectra () tend to produce constant predictions at , and . These correspond to the fit of deviant fluxes from spectra edges by the H, \ionCiv and Ly emission lines, respectively. Though unavoidable, most of these errors will come along with a non-empty warning flag (see Table 4) which offers the possibility to discard these insecure predictions. By doing so, we rejected , , , and of the total number of observations regarding magnitudes , respectively. This leads to corresponding median absolute errors of , , , and and associated ratios of catastrophic redshift predictions of , , , and for the same set of magnitudes.

From our previous discussion, we can notice that the performances we gained were achieved at the expense of a very high rejection rate of the observations having a non-empty warning flag. The distribution of these warning flags amongst the observations is given in Table 4 along with their associated ratio of catastrophic redshift prediction once triggered. We first have to note that Z_LOWZSCORE is the most frequently triggered warning flag amongst these observations and is hence the one that contributes at most in their removal. The reason for this stands in the fact that the measure is primarily designed such as to be sensitive to the presence of all the emission lines that are theoretically covered at a given redshift estimate. The miss of one such a line can be attributed either to the wrong redshift estimate we made or to its misidentification owing to its noise or its strong damping by the instrumental convolution, though the right redshift was selected. This misleading distinction is clearly depicted in Table 4 where solely of the observations having and Z_LOWZSCORE flag set comes along with , thus arguing for the frequent misidentification of some emission lines while at this ratio becomes , hence consisting in a larger fraction of effective redshift confusions. Secondly, the Z_AMBIGUOUS flag is frequently set because of the intrinsic degeneracy existing in the prediction of the redshift of quasars albeit we can assess from Table 4 that in of the cases the right redshift is selected amongst these ambiguous solutions at while this ratio drops to at . For completeness, we have to mention that of successful identifications is still better than the ratio that would be obtained from a random selection of the solution given that the observations having Z_AMBIGUOUS flag set often consist in more than two ambiguous solutions (see figure 8, for example). We can further note that, once an ambiguity is detected (i.e. Z_AMBIGUOUS warning flag triggered), the optimal peak of the CCF is commonly selected as the most probable redshift estimate as these do not additionally trigger a Z_NOTOPTIMAL warning flag. Now, if a sub-optimal peak of the CCF is selected, then of the observations comes along with at while this ratio becomes at . Given that the latter observations must have a that is greater or equal to the one associated with the optimal peak of the CCF in order to be selected, these eventually reveal the effective degeneracy that exists in the redshift prediction of quasars once based on low SNR spectra. Finally, the Z_LOWCHI2R warning flag is rarely triggered given the strong constraints we set on it (see equation 19). This decision is further supported by the fact that the associated ratio of catastrophic redshift prediction stands to be the highest amongst the whole set of warning flags.

Figure 9: Ratio of observations having and greater or equal to the provided values (i.e. completeness) and ratio of observation having amongst the latter (i.e. impurity).

As pointed out within Section 5, the thresholds that were used in order to trigger the Z_LOWZSCORE and Z_LOWCHI2R warning flags are somewhat arbitrary and other values might be better suited regarding the specific needs of the end user. This is particularly true given that these were shown to have a strong impact on the trade-off between the completeness and the impurity of our predictions as we have just seen. These ratios of completeness and impurity being given in figure 9 for varying thresholds on and . Note that we do not considered the Z_AMBIGUOUS and Z_NOTOPTIMAL warning flags in this analysis given that automatically implies that both these flags are set. Also remind that we required such as to limit the number of ambiguous solutions that are potentially associated with each observation.

6.2 BAL binary classification

Figure 10: ROC curves of BAL binary classification regarding the WPC and ERT models for quasars with magnitudes equal to , , , and . Shaded areas correspond to regions were the accuracy stands higher or equal to the one coming from a constant type I/II classifier. The points of the ROC curves associated with the optimal accuracies stand at the highest distances from the border of these regions while being contained in it.

The data unbalancing has a particularly insidious impact on the analysis of the results coming from the binary classification of BAL QSOs. Indeed, based on the fact that solely of the DR12Q observations are BAL QSOs, a model that will systemically classify the observations as type I/II QSO would then provide a satisfactory ratio of correctly classified observations (i.e. an accuracy) of while no BAL QSO will be identified. Consequently, this ratio will not constitute an objective analysis tool if considered alone. We will hence use two additional and complementary statistical measures that were specifically designed for the analysis of the performance of binary classifiers. First, the true positive rate (hereafter TPR) will here denotes the fraction of BAL QSOs that are correctly identified by a given model. It constitutes an estimator of the probability of detection of the BAL QSOs by this model. Secondly, the false positive rate (hereafter FPR) will denote the fraction of type I/II QSOs that are wrongly classified as BAL QSOs. A perfect binary classifier should hence have TPR along with FPR . Note that both these statistical measures can be adjusted by varying the user-defined threshold that was set either on , for the case of the WPC, or on the number of trees that voted for the BAL class regarding the ERT. By doing so and reporting the corresponding TPR against FPR we obtain the so-called Receiver Operating Characteristics (ROC) curve as depicted within figure 10 for the case of the WPC and ERT models for quasars with magnitudes . These curves allow to straightly compare the performances of these two competing models while depending neither on the data unbalancing, nor on the specific thresholds we used. The area under the ROC curve being then often taken as a fair indicator of their global performances.

Now, like many data reduction pipelines, our primary objective will be to optimize the accuracy of our model with respect to the fraction of BAL QSOs that will be encountered amongst the real observations. We will then have to take into account the potential unbalancing that will be present within the Gaia observations. However, because of the uncertainties surrounding the selection effect from DSC as well as the observational bias, this unbalancing is not known a priori. Consequently, we decided to consider a fraction of BAL QSOs, , equal to the one that is present within the DR12Q catalogue (i.e. ). The presented accuracies should hence be updated once a realistic ratio will be available though the general conclusions drawn out of these are not supposed to change (assuming that remains small). We can then easily figure out that the regions of the ROC curves where the accuracy is constant correspond to lines whose equations are given by


Our goal will then be to find the point(s) of the ROC curve that intersect such a line while maximizing . Note that the trivial case where corresponds to the accuracy that would be obtained by a constant type I/II classifier which is thereby always achievable. Stated otherwise, the point(s) of the ROC curve having an optimal associated accuracy correspond(s) to the one (or those) whose distance to the line of constant type I/II accuracy is the greatest while being on its left side.

magnitude Threshold TPR FPR Accuracy
magnitude Threshold TPR FPR Accuracy
Table 5: Optimal accuracies of the ERT and WPC models for quasars with magnitudes of , , , and along with their associated thresholds, TPR and FPR.
Figure 11: Comparison of the Balnicity index of the \ionCiv trough with respect to the TPR of the WPC models regarding BAL QSOs with various magnitudes.

Figure 10 focuses on the regions of the ROC curves where the accuracy stands higher than the one of a constant type I/II classifier. From our previous discussion, we can readily see that the WPC models have overall better accuracies when compared to the ERT models for the whole set of magnitudes. The best achievable accuracies being summarized in Table 5 along with their associated thresholds, FPR and TPR for magnitudes of , , , and . The extremely low TPR found therein can be explained by both the relatively low SNR of the synthesized spectra as well as by the removal of most of the narrow absorption features by the LSF convolution and/or by the under-sampling of these spectra. The effect of noise can be readily recognized based on figure 10 where the ROC curves tend to match the ones that would be obtained from a random classifier (i.e. a diagonal line) as we increase the magnitude. This translates as a drop of the point of optimal accuracy along the ROC curves which consists in both a lower TPR and a compensating lower FPR (see Table 5). In extreme cases, BAL QSOs become barely identifiable with a probability of detection within the WPC of for magnitude of and of for the case of . Figure 11 compares the TPR of the WPC with the Balnicity index of the \ionCiv trough (Weymann et al., 1991, hereafter BI) for the various normalizing magnitudes. This BI can be seen as a modified equivalent width of the BAL absorption occurring in the blue part of the \ionCiv emission line. We can notice a strong dependence of the TPR according to BI, which reflects the difficulty in identifying BAL QSOs having narrow absorption features. We can finally notice that if one can afford to have a high FPR, then the ERT provides a better TPR than the WPC. This would be the case, for example, if we would like to filter the Gaia catalogue by keeping most of the the BAL QSOs while rejecting a still significant number of type I/II QSOs.

7 Discussion

Although already fully operational, the presented software module may still experience some minor improvements that will be summarized in the remainder. First, we did not consider any extinction by the interstellar medium. The associated correction relies on the availability of a wavelength-dependent extinction law such as the one of Fitzpatrick (1999) as well as on a map of galactic extinction like the one that will be produced by the Total Galactic Extinction software module from CU8 (Bailer-Jones et al., 2013). The total equivalent width of the emission lines as well as the continuum slope from equation 21 might benefit from this correction. Nevertheless, due to the fact that the continuum slopes we subtracted from our synthesized spectra are purely empirical (see section 5), these will also contain most of the encountered extinction. Accordingly, this correction is not expected to bring any major improvement on the prediction of the redshift of QSOs nor on the subsequent calculation of the BAL discriminant value, . Furthermore, based on the fact that most of the DR12Q spectra stand at relatively high galactic latitude (i.e. ) where the extinction is weak, the spectra we used in this study were not much affected by this extinction. A more challenging objective would be to enable the prediction of this extinction based on the BP/RP spectra of quasars. This problem is currently being investigated but seems to be hardly attainable because of the degeneracy existing between the extinction curve and the intrinsic continuum slope of the QSOs. Secondly, the computation of a value from the optimal point of the CCF (see equation 6) can straightly allow to send feedback about the potential misclassification of the quasars we received from the DSC module.

8 Conclusion

We have described in the present work the processing of the BP/RP spectra coming from the Gaia satellite in order to determine the astrophysical parameters of quasars within the QSOC module of the CU8 coordination unit from the DPAC. These astrophysical parameters encompass: the redshift of the QSOs, their continuum slopes, the total equivalent width of their emission lines and whether or not these are broad absorption lines (BAL) QSOs. We have highlighted the necessity to have fast and reliable algorithms such as to deal with the huge amount of spectra that Gaia will provide as well as with their limited signal-to-noise ratio and resolution. We have introduced two already developed algorithms, namely the weighted principal component analysis and the weighted phase correlation, that were specifically designed in order to fulfil both these mentioned objectives and whose combination allows to securely predict both the redshift of the QSOs and to set a discriminant on their type. We have presented the construction of a semi-empirical library of BP/RP spectra based on the Gaia instrumental convolution of the observations coming from the Sloan Digital Sky Survey which were extrapolated in order to cover the wavelength range of the BP and RP spectra. We saw the pre-processing that is required in order for these BP/RP spectra to be fully exploitable by our algorithms as well as the methods we used for predicting the various astrophysical parameters. Some systematic bias were noticed within the prediction of the continuum slopes and of the total equivalent width of the emission lines. These bias can be mostly explained by both the spread of the \ionSiiv, \ionCiv and \ionCiii] emission lines over the continuum regions situated between nm and nm as well as by the rejection of the negative fluxes that are usually found within the red part of the pre-processed spectra.

A comparison with the currently used machine learning method showed that our approach is the one of predilection for the determination of the redshift of the quasars while benefiting from a straight physical significance as well as from strong diagnostic tools on the potential errors that may arise during predictions. Cross validation tests showed that , , , and of the observations come along with an absolute error on the predicted redshift that is lower than for the case of quasars with magnitudes equal to . These ratios become respectively , , , and once the insecure predictions are discarded based on the triggering of some warning flags. We explored the repartition of these warning flags amongst the observations and studied the effect of setting customized warning thresholds on the trade-off between the completeness and the impurity of our predictions. Our methods were proved to yield the best ratio of correctly classified observations regarding the identification of BAL QSOs assuming that these will be observed much less frequently than the type I/II QSOs. Machine learning methods may still provide a better probability of detection of these BAL QSOs at the expense of much higher contamination rates. Finally, we have that , , , and of the observations were correctly classified by our methods regarding quasars with magnitudes of , , , and , respectively.


The author acknowledges support from the ESA PRODEX Programme ‘Gaia-DPAC QSOs’ and from the Belgian Federal Science Policy Office.

Funding for SDSS-III has been provided by the Alfred P. Sloan Foundation, the Participating Institutions, the National Science Foundation and the U.S. Department of Energy Office of Science. The SDSS-III web site is http://www.sdss3.org/.

SDSS-III is managed by the Astrophysical Research Consortium for the Participating Institutions of the SDSS-III Collaboration including the University of Arizona, the Brazilian Participation Group, Brookhaven National Laboratory, Carnegie Mellon University, University of Florida, the French Participation Group, the German Participation Group, Harvard University, the Instituto de Astrofisica de Canarias, the Michigan State/Notre Dame/JINA Participation Group, Johns Hopkins University, Lawrence Berkeley National Laboratory, Max Planck Institute for Astrophysics, Max Planck Institute for Extraterrestrial Physics, New Mexico State University, New York University, Ohio State University, Pennsylvania State University, University of Portsmouth, Princeton University, the Spanish Participation Group, University of Tokyo, University of Utah, Vanderbilt University, University of Virginia, University of Washington and Yale University.


  • Bailer-Jones et al. (2013) Bailer-Jones C. A. L., et al., 2013, A&A, 559, A74
  • Bailey (2012) Bailey S., 2012, PASP, 124, 1015
  • Bolton et al. (2012) Bolton A. S., et al., 2012, AJ, 144, 144
  • Claeskens et al. (2006) Claeskens J.-F., Smette A., Vandenbulcke L., Surdej J., 2006, MNRAS, 367, 879
  • Dawson et al. (2013) Dawson K. S., et al., 2013, AJ, 145, 10
  • Delchambre (2015) Delchambre L., 2015, MNRAS, 446, 3545
  • Delchambre (2016) Delchambre L., 2016, MNRAS, 460, 2811
  • Fitzpatrick (1999) Fitzpatrick E. L., 1999, PASP, 111, 63
  • Gaia Collaboration et al. (2016) Gaia Collaboration et al., 2016, A&A, 595, A1
  • Geurts et al. (2006) Geurts P., Ernst D., Wehenkel L., 2006, Machine Learning, 63, 3
  • Gibson et al. (2009) Gibson R. R., et al., 2009, ApJ, 692, 758
  • Glazebrook et al. (1998) Glazebrook K., Offer A. R., Deeley K., 1998, ApJ, 492, 98
  • Hamann & Ferland (1999) Hamann F., Ferland G., 1999, ARA&A, 37, 487
  • Jordi et al. (2008) Jordi C., Fabricius C., Carrasco J. M., Figueras F., Masana E., Voss H., Luri X., 2008, in Jin W. J., Platais I., Perryman M. A. C., eds, IAU Symposium Vol. 248, A Giant Step: from Milli- to Micro-arcsecond Astrometry. pp 500–501, doi:10.1017/S1743921308019947
  • Jordi et al. (2010) Jordi C., et al., 2010, A&A, 523, A48
  • Knigge et al. (2008) Knigge C., Scaringi S., Goad M. R., Cottis C. E., 2008, MNRAS, 386, 1426
  • Lindegren et al. (2012) Lindegren L., Lammers U., Hobbs D., O’Mullane W., Bastian U., Hernández J., 2012, A&A, 538, A78
  • López-Corredoira et al. (2016) López-Corredoira M., Melia F., Lusso E., Risaliti G., 2016, International Journal of Modern Physics D, 25, 1650060
  • Mignard (2009) Mignard F., 2009, in IAU Symposium #261, American Astronomical Society. p. 891
  • O’Mullane et al. (2007) O’Mullane W., et al., 2007, in Shaw R. A., Hill F., Bell D. J., eds, Astronomical Society of the Pacific Conference Series Vol. 376, Astronomical Data Analysis Software and Systems XVI. p. 99 (arXiv:astro-ph/0611885)
  • Pâris et al. (2017) Pâris I., et al., 2017, A&A, 597, A79
  • Pearson (1901) Pearson K., 1901, Phi. Mag., 2, 559
  • Perryman et al. (2001) Perryman M. A. C., et al., 2001, A&A, 369, 339
  • Perryman et al. (2014) Perryman M., Hartman J., Bakos G. Á., Lindegren L., 2014, ApJ, 797, 14
  • Press et al. (2002) Press W. H., Teukolsky S. A., Vetterling W. T., Flannery B. P., 2002, Numerical recipes in C++ : the art of scientific computing
  • Reichard et al. (2003) Reichard T. A., et al., 2003, AJ, 126, 2594
  • Schneider & Sluse (2013) Schneider P., Sluse D., 2013, A&A, 559, A37
  • Tsalmantza & Hogg (2012) Tsalmantza P., Hogg D. W., 2012, ApJ, 753, 122
  • Weymann et al. (1991) Weymann R. J., Morris S. L., Foltz C. B., Hewett P. C., 1991, ApJ, 373, 23
  • Yahata et al. (2005) Yahata K., et al., 2005, PASJ, 57, 529
  • de Bruijne (2012) de Bruijne J. H. J., 2012, Ap&SS, 341, 31
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description