COSMOGRAIL: the COSmological MOnitoring of
GRAvItational Lenses
Key Words.:
methods: data analysis – gravitational lensing: strong – cosmological parametersCOSMOGRAIL is a longterm photometric monitoring of gravitationally lensed quasars aimed at implementing Refsdal’s timedelay method to measure cosmological parameters, in particular . Given the long and well sampled light curves of strongly lensed quasars, timedelay measurements require numerical techniques whose quality must be assessed. To this end, and also in view of future monitoring programs or surveys such as the LSST, a blind signal processing competition named Time Delay Challenge 1 (TDC1) was held in 2014. The aim of the present paper, which is based on the simulated light curves from the TDC1, is double. First, we test the performance of the timedelay measurement techniques currently used in COSMOGRAIL. Second, we analyse the quantity and quality of the harvest of time delays obtained from the TDC1 simulations. To achieve these goals, we first discover time delays through a careful inspection of the light curves via a dedicated visual interface. Our measurement algorithms can then be applied to the data in an automated way. We show that our techniques have no significant biases, and yield adequate uncertainty estimates resulting in reduced values between 0.5 and 1.0. We provide estimates for the number and precision of timedelay measurements that can be expected from future timedelay monitoring campaigns as a function of the photometric signaltonoise ratio and of the true time delay. We make our blind measurements on the TDC1 data publicly available.
1 Introduction
The methods used to constrain current cosmological models all benefit from independent measurements of the local value of the Hubble parameter, H (see e.g. Fig. 48 of Weinberg et al. 2013). One way of achieving a measurement of H is based on time delays in strong gravitational lens systems. The method, first suggested by Refsdal (1964), proposes measuring the differences in the travel time of photons coming from multiple images of a distant source, such as a quasar. This time delay, , is connected to the overall matter distribution responsible for the lensing effect, and to the timedelay distance to the lens, i.e. , with some sensitivity to curvature and dark energy as well (e.g., Suyu et al. 2014).
Exploiting this relationship to constrain and cosmology in general requires both an accurate mass model for the lens and accurate time delay measurements (see e.g., Suyu et al. 2012; Linder 2011; Moustakas et al. 2009). Modelling the lens mass in an unbiased way is difficult and prone to degeneracies known as the masssheet degeneracy (e.g., Schneider & Sluse 2013) and, more generally, the source plane transformation described in Schneider & Sluse (2014). The effect of lens model degeneracies can be mitigated by combining astrometric information from high resolution imaging, measurements of the lens dynamics, priors on the mass density profile of the lens, and an analysis of structures along the line of sight (e.g., Suyu et al. 2014; Greene et al. 2013; Fadely et al. 2010; Treu & Koopmans 2002; Falco et al. 1997; Keeton & Kochanek 1997). Integral field spectroscopy coupled with the adaptive optics that is becoming available on the VLT and at the Keck observatory will be essential in this respect. One of the key ingredient for the method to work at all is, however, the quality of the timedelay measurements, which is the focus of the present work.
In practice, measuring time delays is achievable if the lensed source is photometrically variable. Gravitationally lensed quasars are ideal targets: the quasars can show variability accessible by moderately sized groundbased optical telescopes on timescales of a few days, while the time delays resulting from galaxyscale lenses are typically of the order of weeks to months (see, e.g., Oguri & Marshall 2010). The intrinsic light curve of the quasar is seen shifted by the relative time delays in the different lensed images. However, this simple situation is often contaminated: microlensing by stars in the lensing galaxy introduces extrinsic variability in the individual light curves with an amplitude sometimes comparable with that of the intrinsic variability of the quasar. To yield competitive cosmological constraints, reliable timedelay measurements with percentlevel precision are needed (Treu 2010). An efficient implementation of these measurements has long been hampered by how difficult it is to obtain photometric data for periods of many years at a desirable cadence, which must be close to 1 epoch per few days (Eigenbrod et al. 2005).
COSMOGRAIL is a monitoring program targeting more than 30 strongly lensed quasars using meterclass telescopes, with a cadence of 3 days for the most interesting systems. Recent results include light curves and timedelay measurements that are accurate to within a few percent points in HE 04351223 (Courbin et al. 2011), SDSS J10015027 (Rathna Kumar et al. 2013) and in RX J11311231 (Tewes et al. 2013b). To measure these time delays, we developed and implemented several algorithms in the form of a COSMOGRAIL curveshifting toolbox named PyCS, described in Tewes et al. (2013a).
In the fall of 2013, a first blind timedelay measurement competition named Time Delay Challenge 1 (TDC1) was proposed to the community by Dobler et al. (2015). The two main goals of this open challenge were (1) to homogeneously assess the quality of timedelay measurement algorithms on a common set of realistic synthetic light curves, and (2) to obtain some quantitative informations on the impact of observing strategy (cadence, season length, campaign length) on timedelay measurements. We took part in this challenge and submitted timedelay estimates using the team name PyCS. Liao et al. (2015) give a summary of the results from all TDC1 participating teams, as well as some general conclusions. The present paper is complementary to Liao et al. (2015). It focuses on the PyCS methods that we also apply to real light curves, and hence assesses the quality and reliability of the COSMOGRAIL timedelay measurements.
To evaluate our existing techniques with the thousands of light curves of TDC1 under conditions similar to the analysis of COSMOGRAIL data, we separated the problem of timedelay measurement of a light curve pair into two successive stages.
 Stage I:

we first attempt to discover a plausible time delay, without trying to measuring it precisely. We also evaluated how confident we were that the proposed approximate solution is close to the true time delay and not a catastrophic failure. Owing to the limited length of the monitoring seasons, the randomness of quasar variability, noise and microlensing, this was not possible for every light curve pair of TDC1 or a real monitoring campaign. We note that in the case of TDC1 we had no prior information on the time delay to look for, as we had no knowledge of the mass distribution in the lensing galaxy. Only the light curves themselves were used.
 Stage II:

for those systems for which Stage I was successful, we then focused on accurately estimating the time delay and associated uncertainty with the PyCS techniques, constraining the algorithms to a delay range around the solution from Stage I. As the PyCS Stage II methods did not rely on a physical model of the light curves, they would not be able to deal adequately with comparing odds among very different solutions.
This twostage structure is of general interest beyond PyCS, as the stages concern discernible aspects of the timedelay measurement problem. Stage I deals with the quantity and the purity of timedelay measurements, while Stage II deals with their actual accuracy.
The paper is structured as follows. Section 2 summarizes the data from the Time Delay Challenge 1 and the metrics used to evaluate techniques. In Sect. 3, we present the approaches we took to address Stage I, while Sect. 4 presents the Stage II algorithms. In Sect. 5 we discuss the results, expanding on the analysis of Liao et al. (2015), and we conclude in Sect. 6.
2 Time Delay Challenge 1
The mock data used in this work are synthetic quasar light curves made publicly available in the context of the the Time Delay Challenge 1 (TDC1) proposed by Dobler et al. (2015). These data mimic the photometric variations seen in real gravitationally lensed quasars, with different time sampling, number of seasons, and season length. The curves are generated with simple yet plausible noise properties, and include microlensing variability. The dataset is split into five “rungs” or stages that simulate different observational strategies, each rung consisting of 1024 pairs of light curves. The rungs randomly mix microlensing, noise properties, and variability but differ in time sampling, number of seasons, and season length. These differences are listed in Table 1 of Liao et al. (2015, hereafter TDC1 paper).
Participants to the TDC1 were asked to provide their best point estimate and associated 1 uncertainty for as many pairs of curves as possible. The submitted entries to the challenge were then compared to the true input delays, and evaluated using simple metrics probing different properties of the estimates. The details of how the simulations were set up, as well as a discussion of these metrics are given in Dobler et al. (2015). Results obtained by the blind submissions of the different participating teams are summarized in the TDC1 paper, including our submissions denoted “PyCS”. For completeness, we briefly summarize the four main metrics:

The fraction of submitted time delay estimates,
(1) where is the number of measured time delays and is the total number of curves.

The mean between the measured time delay and the true value weighted using the estimated uncertainties ,
(2) 
The mean “claimed” precision of the timedelay measurements, computed from the estimated uncertainties ,
(3) 
The mean accuracy of the timedelay measurements,
(4)
To analyse the results in greater detail, we also introduce two modified metrics. First, a “blind” precision,
(5) 
where we replace in Eq. 3 the true value of by its estimation . This metric can be computed without knowing of the true time delays; its summation terms are useful, for instance to sort light curve pairs of a real monitoring survey. Second, we introduce a variant of the accuracy
(6) 
where we replace the in the denominator of Eq. 4 by its absolute value. While is sensitive to a bias on the amplitude of (i.e., over or underestimation of delays), responds to signed biases.
Statistical uncertainties on these metrics are computed following
(7) 
where is the sample standard deviation of the summation terms of , , and .
3 Stage I: discovering time delays
To apply any of the PyCS timedelay measurement algorithms (Tewes et al. 2013a) to a pair of light curves, a prior estimate of the time delay is required. Depending on the considered light curves, identifying this delay from the data might be difficult or even impossible. In the following, we describe two approaches to discover rough timedelay estimates (Stage I). Both methods rely solely on the light curves without considering the configuration of the lens system. The first method is based on a visual inspection of the light curves and is the method we used to blindly prepare submissions for the TDC1 (Liao et al. 2015). We developed the second method after the TDC1 results were released. We use the data from the challenge to evaluate the relative merits and drawbacks of each method.
3.1 D3cs: D3 visual curve shifting
This method is based on visual inspection of the light curves, in the spirit of citizen science projects (e.g., see the review by Marshall et al. 2014). To ease this process, we developed a dedicated browserbased visualization interface, using the D3.js JavaScript library^{1}^{1}1DataDriven Documents, http://www.d3js.org/ by Bostock et al. (2011). We have made this interface public^{2}^{2}2http://www.astro.unibonn.de/~mtewes/d3cs/tdc1/ (See “Read me first” for help).
The main motivations behind this timecostly yet simple approach were (1) to obtain rough initial estimates for the time delays and their associated uncertainties, and (2) to estimate how confident one can be that the timedelay estimations are not catastrophic outliers. Clearly, visual curveshifting allows for more freedom than any automated procedure. It also permits dealing in a more flexible way with unusual behaviour of light curves, such as highly variable signaltonoise from one season to the next, extreme microlensing, or even when the time delays are comparable in length to the visibility period of the object.
Our interface allows users to interactively shift the light curves in time, magnitude, and flux, and to zoom in on sections of the data. It permits the visual estimation of the time delay and of an associated uncertainty. Importantly, the interface also asks to pick one of four choices of confidence category for the proposed solution:

secure: if a catastrophic error can be excluded with a very high confidence^{3}^{3}3This same category was named doubtless in Liao et al. (2015);

plausible: if the solution yields a good fit and no other solutions are seen;

multimodal: if the proposed solution is only one among two or more possible solutions that are equally satisfactory;

uninformative: if the data does not allow the estimate of any time delay.
Unlike massive crowdsourcing programs (e.g. Galaxy Zoo; Lintott et al. 2008), only four scientists participated in the visual inspection of TDC1 and each pair of curves was measured by at least two independent scientists. The behaviour of different users in terms of time spent per timedelay measurement span a broad range. Fast users spend on average 25 seconds per object, while slower users spend more than 60 seconds per object. This includes the time taken to measure a delay, to estimate a 1 uncertainty, and to allocate one of the four confidence levels described above.
To obtain a single Stage I estimation for each light curve pair, we reduce the results obtained by all four scientists in a very conservative way. We systematically downgrade to multimodal the confidence category of pairs with conflicting timedelay estimates.
We define samples combining the four confidence categories as follows: sec stands for secure only, secpla stands for secure + plausible, and secplamul for secure + plausible + multimodal. The combination of all estimations is labelled all.
Figure 1 shows the distribution of the error on the timedelay estimation versus the true time delay for the secure and plausible D3CS subsamples. Table 1 summarizes the results of the D3CS classification and displays the fraction of catastrophic outliers , i.e. timedelay estimations more than 20 days away from the truth. Notably, the secure sample contains 1623 timedelay estimates free of any catastrophic outliers.
Rung 0  Estimates  

Secure  548 (53.5%)  0% 
Plausible  291 (28.4%)  2.1% 
Multimodal  60 (5.9%)  30.0% 
Uninformative  125 (12.2%)  – 
Rung 1  Estimates  

Secure  288 (28.1%)  0% 
Plausible  383 (37.4%)  1.3% 
Multimodal  127 (12.4%)  17.3% 
Uninformative  226 (22.1%)  – 
Rung 2  Estimates  

Secure  223 (21.8%)  0% 
Plausible  406 (39.6%)  1.2% 
Multimodal  168 (16.4%)  27.4% 
Uninformative  227 (22.2%)  – 
Rung 3  Estimates  

Secure  329 (32.1%)  0% 
Plausible  324 (31.7%)  4.9% 
Multimodal  161 (15.7%)  18.6% 
Uninformative  210 (20.5%)  – 
Rung 4  Estimates  

Secure  235 (23.0%)  0% 
Plausible  430 (42.0%)  3.5% 
Multimodal  108 (10.5%)  26.9% 
Uninformative  251 (24.5%)  – 
All Rungs  Estimates  

Secure  1623 (31.7%)  0% 
Plausible  1834 (35.8%)  2.6% 
Multimodal  624 (12.2%)  23.2% 
Uninformative  1039 (20.3%)  – 
Through this simple experiment, we have demonstrated that such an approach is easily manageable for about 5000 light curves. In total the four scientists involved in the visual estimation of the time delays spent 150 hours measuring the 5120 delays. We note that 30 of the time delays were measured by three or more users.
3.2 Attempt to design an automated Stage I procedure
Visual inspection of the light curves is a timeconsuming process that cannot be repeated many times. Designing an automated method whose efficiency approaches that of D3CS is therefore complementary and would help to minimize the time spent on the visual inspection. We developed such a method after the end of TDC1. The concept of the method is to estimate a time delay by fitting a spline on one of the two light curves, and computing the residual signal of the second light curve relative to the spline after applying time and magnitude shifts. The details of the method are described in Appendix A; the present section evaluates its performance and compares this estimation to the visual timedelay values.
We characterize the efficiency of the automated Stage I procedure by comparing its fraction of catastrophic outliers with that of D3CS. We define catastrophic outliers as timedelay estimations deviating from the truth by more than 20 days, i.e. with . The timedelay and confidence estimation evaluated by the automated procedure are reduced to two numbers: the depth of the absolute minimum and the interseason variations of the microlensing . Figure 2 shows the evolution of the fraction of catastrophic outliers as a function of the cumulative number of timedelay estimations, sorted by increasing . The larger is, the more confident the automated procedure in the timedelay estimation is. We study the impact of the automated procedure parameters and by introducing three subsamples of automated estimations:

the Crudeall subsample contains all the estimations;

the Crude1min subsample contains only the estimations for which the procedure finds a unique local minimum with a depth 1;

the Crude1.5 subsample contains only the estimations with a magnitude shift deviation at the location of the absolute minimum .
Figure 2 shows the fraction of outliers in the three subsamples and compares them to the value obtained visually with D3CS, which are shown as four diamondshaped symbols corresponding to the combination of the four confidence categories of D3CS described in Sect. 3.1. We note that the uninformative D3CS estimations are systematically considered as catastrophic outliers here.
The selection criteria applied to the Crude1min and Crude1.5 subsamples are not able to decrease the fraction of outliers. This highlights how difficult it is to find efficient selection criteria for the automated procedure parameters, although no exhaustive exploration of the parameters space has been conducted. As expected, all the D3CS subsamples contain fewer outliers than the corresponding automated procedure subsamples. However, the efficiency of the latter are of the same order as the other automated methods presented in the TDC1 paper, which have % when half of the TDC1 data, i.e. 2500 light curve pairs, are measured.
In conclusion, although the automated procedure presented here is less complete and reliable than D3CS, it yields candidates that can be evaluated by eye in a second phase. Such a combined approach would benefit both from the speed of the automated method and from the flexibility of the human eye estimate when dealing with a broad variety of properties in the light curves. We note, however, that in the rest of the paper, we only use the results obtained via D3CS as our Stage I estimates.
4 Stage II: measuring time delays
Using the Stage I results as initial estimates, we proceed in this section by running our PyCS timedelay measurement techniques on the simulated TDC1 light curves. In Tewes et al. (2013a), three different algorithms were proposed: the simultaneous fit of the light curves using freeknot splines, the regression difference technique, and an approach based on a dispersion measurement of which the freeknot splines and the regression difference technique yielded the most accurate and precise results when applied to simulated COSMOGRAIL data (in Courbin et al. 2011; Tewes et al. 2013b; Eulaers et al. 2013; Rathna Kumar et al. 2013). To analyse the TDC1 simulations, we have therefore focused on adapting only these two most promising methods for an automated use.
We note again that our Stage II methods cannot be asked to judge the plausibility of a proposed delay. This step belongs to the Stage I method, i.e. to the visual inspection with D3CS to prevent or at least reduce catastrophic outliers. In practice, despite a correct Stage I estimate, any automated Stage II method may fail to converge, or it may yield a timedelay measurement that is incompatible with the initial approximation. To prevent these failures from contaminating our measurements we systematically discard any Stage II result that does not lie within 1.5 D3CS uncertainty estimate of the D3CS point estimate. This threshold acknowledges that the uncertainty estimates obtained from D3CS are typically overestimated by a factor of 2 to 3, which has been confirmed by Liao et al. (2015). We note that this rejection affects less than 1% of the light curve pairs and has no significant influence on the metric.
4.1 Freeknot spline technique
In the freeknot spline technique (spl), each light curve in a pair is modelled as the sum of a spline representing the intrinsic variability of the quasar, common to both images of the pair, and an independent spline representing the extrinsic variability due to microlensing. The intrinsic spline has a higher density of knots and is therefore more flexible accomodating the quasar variability, which is assumed to be faster than the microlensing variability. During the iterative fitting process, the light curves are shifted in time so as to optimize the between the data and the model light curve. To analyse a TDC1 light curve pair, we repeat this fit 20 times, starting form different initial conditions covering the Stage I uncertainty. This tests the robustness of the optimization procedure. The best model fit is then used to generate 40 simulated noisy light curves with a range of true time delays around the bestfit solution and using the temporal sampling of the original light curves. By blindly rerunning the spline fit on these simulated data, and comparing the resulting delays with the true input time delays, the delay measurement uncertainty is estimated.
We simplified and automated the spl algorithm for TDC1 with respect to the description of the freeknot spline method given in (Tewes et al. 2013a) and its applications to real COSMOGRAIL data. The main adaptations are the following:

The temporal density of spline knots controlling the flexibility of the intrinsic spline was computed from the signaltonoise ratios measured on the two light curves, using an empirical calibration. The signaltonoise ratios were obtained from a structure function, by comparing the typical amplitude of the light curve variability observed on a timescale of 50 to 75 days with the scatter between temporally adjacent observing epochs. For the microlensing spline, the knot density was fixed to be the same for all TDC1 pairs.

When generating our mock light curves, we did not inject any fast microlensing signal to mimic correlated noise. Only plain white noise was added to the generative model.

We did not analyse the timedelay measurement errors on the simulated curves as a function of true time delay. Instead, only the RMS error of these timedelay measurements was used as our total uncertainty estimate.

Finally, we did not manually finetune any parameters or correct for problematic model fits.
As a result, the entire spl analysis took about 5 CPUminutes per TDC1 pair.
4.2 Regression difference with splines
Our second Stage II method, sdi (for spline difference), is based on the regression difference technique of Tewes et al. (2013a). To speed up the analysis, we replace the Gaussian process regressions by spline fits. In summary, the method independently fits a different spline to each of the two light curves, and then minimizes the variability of the difference between these two splines by shifting them in time with respect to each other. The advantage of this approach is that it does not require an explicit microlensing model. To estimate the uncertainty, the sdi method is run on the same simulated light curves provided by the spl technique. The sdi method has an even lower computational cost than spl.
5 Results on the Time Delay Challenge 1 (TDC1) data
In this section, we separately analyse results from the Stage I and Stage II measurement processes as obtained on the simulated light curves of TDC1. General results from Liao et al. (2015) regarding submissions prepared with our methods include the following:

The Stage II methods spl and sdi show no significant deviations of the accuracy from zero, and can thus be considered as inherently unbiased, given the statistical limits due to the finite challenge size.

The claimed precisions of spl and sdi are very good, with values of the order of and .

Based on results from the method simple powerlaw models for the dependence of , , and on monitoring cadence, season length, and campaign length were adjusted. These relations attempt to capture general trends regarding the behaviour of all methods used in the context of TDC1, including our spl technique.
In the present paper, we focus on aspects that are complementary to the discussion of Liao et al. (2015).
5.1 Efficiency of timedelay discovery (Stage I)
We start by analysing the fraction of light curve pairs for which a time delay can be discovered with visual inspection, as a function of time delay and image brightness. This analysis relates only to the first stage of the timedelay measurement process.
Aside from the time delay and the quasar image brightness, the question of discoverability depends on observational conditions (e.g. monitoring cadence and duration) and on astrophysical characteristics (e.g. amount of quasar variability and microlensing perturbations). In the following, we select a given observing strategy and average over the astrophysical parameters of the TDC1 simulations, which follow clearly motivated distributions (Dobler et al. 2015). A large sample of light curve pairs with almost identical observing conditions can be obtained by merging rungs 2 and 3. These rungs share the fiducial threeday monitoring cadence for five seasons of four months each. The differing cadence dispersion of days for rung 2 and days for rung 3 (Table 1 of Liao et al. 2015) do not have a significant impact on the discoverability of time delays.
In practice, time delays can be measured accurately in pairs of light curves if the quality of both light curves is sufficient. In the following we consider as a relevant observable the photometric precision achieved in the fainter image of a pair. This is computed for each pair of light curves, as the median of the photometric error bars across the epochs of the TDC1 simulations. This is made legitimate by their overall effectiveness in representing the amplitude of the simulated noise, except for very few “evil” epochs of some systems (see Section 2.5 of Liao et al. 2015). However, when analysing real light curves, using the photometric scatter between the points might be a better choice than using potentially misestimated photometric error bars.
Figure 3 presents the distribution of the fraction of light curve pairs for which time delays could be identified via a meticulous D3CS visual inspection for two different monitoring strategies. In the left panel, only time delays categorized as secure through the visual inspection are considered as discovered. This is very conservative because in a real survey, simple lens models will help to identify the correct time delay for almost all of the plausible systems as well. For this reason we also consider the combined secpla sample, shown in the central panel.
Some of the cases categorized as multimodal could certainly also be resolved using simple lens model considerations, but in practice the vast majority of these light curve pairs contain too few clear common features to estimate a reliable time delay, even if an approximate value would be known from the modelling. We therefore consider the discoverability of the secpla selection shown in the central panel of Fig. 3 as roughly representative of the fraction of potentially helpful delays that can be reliably measured from a real monitoring campaign or survey. It can be seen as an approximate lower limit for the fraction of time delays that can be observationally constrained in the absence of prior from a lens model, provided the properties of the microlensing signal are similar to those of the simulated light curves used here. Finally, the right panel shows the evolution of this discoverability if the same monitoring strategy is carried out for five more seasons, i.e., for a total of ten years.
We observe that after five years of monitoring, light curve pairs with a photometric precision in the fainter image better than mag and a true time delay shorter than days (2/3 of the season length) are very likely to yield a timedelay measurement. Pursuing the monitoring for five more years significantly increases the average chances that longer delays up to of the season length become measurable.
5.2 Precision of timedelay measurement (Stage II)
We now turn to the analysis of the timedelay measurements (Stage II) for all systems where the time delay is successfully discovered (Stage I).
Figure 4 summarizes the results of the spl and sdi methods in terms of the metrics (accuracy), (claimed precision), and , as defined in Sect.2. The figure follows the same conventions as Table 4 of Liao et al. (2015), but also includes measurements obtained on the secpla samples of each rung. As expected, the results for these secpla samples are more scattered than for the sec samples. The reason for these significant offsets in and with respect to the sec results is the stronger impact of outliers on the nonrobust metrics.
To characterize the achievable precision of the Stage II measurements without being influenced by catastrophic outliers but still benefiting from a large number of light curve pairs, we now focus on the secpla sample from which we remove systems with estimated delays that are off by more than 20 days. This rejects about 1% of the secpla sample. We also note that catastrophic outliers are errors of the Stage I estimate, not Stage II.
Figure 5 presents metrics related to the Stage II timedelay measurement precision as a function of the photometric quality of the fainter quasar light curve and the time delay. In each tile the top left panel shows the average claimed precision for the spl technique, for a fiveyear monitoring with fourmonth seasons and a cadence of three days. We find that the cadence dispersion plays no significant role in this analysis, and we therefore merge rungs 2 and 3 to obtain this larger sample.
In contrast, each tile of the top right panel shows the root mean square (RMS) of the relative error of the timedelay estimates . The observed structure is inevitably noisier because this RMS is computed from the actual point estimates, while the precision is based on the corresponding claimed uncertainty estimates. We observe both a qualitative and a quantitative similarity between these plots, suggesting that the timedelay uncertainty estimates, (Eq. 2), from the PyCS techniques correctly capture trends related to the photometric quality and to the amount of temporal overlap in the light curves.
In the lower right panel of Fig. 5, the map of metrics for the spl technique shows no strong evolution across the wellsampled regions of the parameter space. It does however indicate that the uncertainty estimates from spl are too conservative by a small factor of . This is particularly visible for the high quality light curves with small time delays, i.e. with large temporal overlaps. Finally, the bottom left panel shows the average metric computed using only the best half of light curve pairs in each tile, where the quality of a system is judged via the blind relative precision (see Eq. 5). This operation, aimed at increasing the precision, divides by two the usable fraction of systems as given in Fig. 3. We consider such a selection in more detail in Sect. 5.3.
We observe in Fig. 5 that the best average relative precision in timedelay measurements seems to be achieved for time delays in the range 4080 days for this particular monitoring strategy. This corresponds to about half of the season length, and results from the tradeoff between absolute delay length and amount of overlap in the light curves.
Given the observed general aptitude of our timedelay uncertainty estimates, and thus , to describe the actual point estimate errors committed by the spl technique, and the excellent competitiveness of spl compared to other time delay measurement techniques (see, e.g., Fig. 13 of Liao et al. 2015), we see the left panels of Fig. 5 as roughly representative of the precision that can be achieved by a stateoftheart timedelay measurement method.
5.3 Effect on the overall metrics of selecting the best systems
In Fig. 6 we show the evolution of the overall average metrics as a function of the fraction of estimations for the splsec and sdisec samples. The five columns represent the five rungs and the estimations are sorted according to their blind precision , i.e the precision estimate from the TDC1 data prior to the unblinding of the true values. The noncumulative median value of the true delays (bottom row) is computed on consecutive bins of ten estimations. The shaded regions represent the statistical uncertainties of the metrics defined in Eq.7. These uncertainties are too small in the top row to be distinguished from the curves.
In the top row, increases monotonically with . This is expected since the estimations are sorted according to and since the D3CS sec sample is free of any outliers. The metrics , , and , respectively in the second, third and fourth rows stabilize around a mean value once a significant fraction of estimations is considered. The variations around the mean such as the jump in at in rung 2 are due to noncatastrophic outliers, i.e. time delays that deviate significantly from the truth but by less than 20 days. These outliers are the result of a nonoptimal convergence of the Stage II methods for the curves with the lowest signaltonoise.
The high end of and are subject to strong variations in all rungs. These variations occur for small absolute values of the true time delay . Similarly, the high end of increases strongly. A small error on the timedelay estimation particularly affects , , and if the true time delay is small.
This correlation between the loss in precision and accuracy means that for the corresponding light curves, our algorithms estimate the time delays less accurately, but do provide larger error bars. We observe that the remains constant as increases. In conclusion, sorting the measurements in and rejecting a small fraction of the least precise estimations allows an optimal accuracy to be maintained without affecting the .
Figure 7 shows the evolution of the TDC1 metrics with the fraction of estimations, , sorted by increasing order of , for the splsec and splsecpla sample. The few catastrophic outliers result in striking discontinuities in the curves. Quantifying the accuracy and precision of Stage II methods is different from avoiding such catastrophic outliers, and to address the former question, we also display in Fig. 7 a new subsample, splsecplacut, where the 29 timedelay estimates with an absolute timedelay error larger than 20 days are a posteriori removed. Similarly, the impact of outliers can be reduced by considering the median of the individual metrics instead of their mean. This is not surprising, but nevertheless it reflects the need either to use metrics that can cope with outliers or, as in our Stage I approach, to make sure that no outliers contaminate the timedelay samples used for subsequent cosmological application.
6 Summary and conclusions
In this work, we used the simulated light curves of the Time Delay Challenge 1 (TDC1) proposed by Dobler et al. (2015) to test the performance of the PyCS numerical techniques currently in use in COSMOGRAIL to measure time delays. These methods are fully datadriven, in the sense that they do not attempt to include any physical model for microlensing or for the intrinsic variations of quasars. This choice was deliberate and considers an empirical representation of the data that minimizes the risk of bias due to choosing the wrong physical model. The price to pay is that error bars on the measurements must be estimated from mocks and that we cannot use prior knowledge from external observations of quasars in a direct formal way. Using the same simulated light curves, we also assessed the quantity and quality of the timedelay measurements from future monitoring campaigns or surveys. We have made public our six main TDC1 submissions, obtained using the D3CS, spl, and sdi methods for the high purity secure and the less conservative plausible samples. These data are available on the COSMOGRAIL website^{5}^{5}5http://www.cosmograil.org/tdc1. Our results can be summarized as follows:

The visual estimation of time delays (Stage I) is extremely efficient in spotting catastrophic outliers and in providing useful timedelay estimates to be refined with subsequent numerical techniques (Stage II).

We attempted to build a simple automated timedelay estimation procedure that we can apply to the TDC1 data. While useful, this automated procedure does not achieve as good purity in the timedelay sample as the visual estimation. We note that we did not use this automated procedure for any of our submissions to the TDC1.

We provide a general analysis of the achievability of timedelay measurements as a function of the photometric precision of the light curves. In particular we show that almost all time delays shorter than twothirds of the season length can be measured in five years of monitoring with fourmonth seasons and realistic photometric quality.

Our Stage II methods spl and sdi can be considered unbiased given the statistical limits due to the finite challenge size. The values are close to unity. These good results emphasize the reliability of COSMOGRAIL timedelay measurements in general.

We quantify the average precision on the timedelay measurements as a function of photometric quality of the light curves. We find that the best average precision seems to be obtained for pairs whose time delay is approximately half of the monitoring season length.

We show that we can reliably evaluate the individual precision of our timedelay estimates. This may enable us, for any sample of light curves, to identify which are the most valuable objects to be followed up for cosmological purposes. We note, however, that any selection on the time delays in a quasar sample may also result in biases on the cosmological inference.
The above is true for the specific light curves of TDC1. These curves have been generated with simple models for quasar variations and microlensing and they do not include all potential nuisances of astrophysical, atmospheric, or instrumental origin. In addition, the PyCS techniques currently used in COSMOGRAIL do not attempt to account for these effects.
Acknowledgements.
The authors would like to thank the organizers of the Time Delay Challenge, as well as Peter Schneider for useful discussions and the anonymous referee for his/her useful comments. This work is supported by the Swiss National Science Foundation (SNSF). Malte Tewes acknowledges support by a fellowship of the Alexander von Humboldt Foundation and the DFG grant Hi 1495/21. Dominique Sluse acknowledges support from a Back to Belgium grant from the Belgian Federal Science Policy (BELSPO).References
 Bostock et al. (2011) Bostock, M., Ogievetsky, V., & Heer, J. 2011, IEEE Trans. Visualization & Comp. Graphics (Proc. InfoVis)
 Courbin et al. (2011) Courbin, F., Chantry, V., Revaz, Y., et al. 2011, A&A, 536, A53
 Dobler et al. (2015) Dobler, G., Fassnacht, C. D., Treu, T., et al. 2015, ApJ, 799, 168
 Eigenbrod et al. (2005) Eigenbrod, A., Courbin, F., Vuissoz, C., et al. 2005, A&A, 436, 25
 Eulaers et al. (2013) Eulaers, E., Tewes, M., Magain, P., et al. 2013, A&A, 553, A121
 Fadely et al. (2010) Fadely, R., Keeton, C. R., Nakajima, R., & Bernstein, G. M. 2010, ApJ, 711, 246
 Falco et al. (1997) Falco, E. E., Shapiro, I. I., Moustakas, L. A., & Davis, M. 1997, ApJ, 484, 70
 Greene et al. (2013) Greene, Z. S., Suyu, S. H., Treu, T., et al. 2013, ApJ, 768, 39
 Keeton & Kochanek (1997) Keeton, C. R. & Kochanek, C. S. 1997, ApJ, 487, 42
 Liao et al. (2015) Liao, K., Treu, T., Marshall, P., et al. 2015, ApJ, 800, 11
 Linder (2011) Linder, E. V. 2011, Phys. Rev. D, 84, 123529
 Lintott et al. (2008) Lintott, C. J., Schawinski, K., Slosar, A., et al. 2008, MNRAS, 389, 1179
 Marshall et al. (2014) Marshall, P. J., Lintott, C. J., & Fletcher, L. N. 2014, ArXiv1409.4291
 Molinari et al. (2004) Molinari, N., Durand, J., & Sabatier, R. 2004, Computational Statistics & Data Analysis, 45, 159
 Moustakas et al. (2009) Moustakas, L. A., Abazajian, K., Benson, A., et al. 2009, astro2010: The Astronomy and Astrophysics Decadal Survey, arXiv:0902.3219
 Oguri & Marshall (2010) Oguri, M. & Marshall, P. J. 2010, MNRAS, 405, 2579
 Rathna Kumar et al. (2013) Rathna Kumar, S., Tewes, M., Stalin, C. S., et al. 2013, A&A, 557, A44
 Refsdal (1964) Refsdal, S. 1964, MNRAS, 128, 307
 Schneider & Sluse (2013) Schneider, P. & Sluse, D. 2013, A&A, 559, A37
 Schneider & Sluse (2014) Schneider, P. & Sluse, D. 2014, A&A, 564, A103
 Suyu et al. (2012) Suyu, S. H., Treu, T., Blandford, R. D., et al. 2012, arXiv:1202.4459
 Suyu et al. (2014) Suyu, S. H., Treu, T., Hilbert, S., et al. 2014, ApJ, 788, L35
 Tewes et al. (2013a) Tewes, M., Courbin, F., & Meylan, G. 2013a, A&A, 553, A120
 Tewes et al. (2013b) Tewes, M., Courbin, F., Meylan, G., et al. 2013b, A&A, 556, A22
 Treu (2010) Treu, T. 2010, ARA&A, 48, 87
 Treu & Koopmans (2002) Treu, T. & Koopmans, L. V. E. 2002, MNRAS, 337, L6
 Weinberg et al. (2013) Weinberg, D. H., Mortonson, M. J., Eisenstein, D. J., et al. 2013, Phys. Rep, 530, 87
Appendix A Automated Stage I procedure
If more time had been devoted to the visual inspection, we expect that more correctly estimated plausible timedelay estimations would have been classified as secure. After the TDC1, we developed an automated Stage I procedure. Its goal is to speed up and possibly improve the quality of the D3CS output by providing a range of reasonable initial time delays and associated confidence estimation. The following section describes technically how the timedelay and confidence estimation for this automated procedure are computed.
a.1 Timedelay estimation
For each pair of curves, a boundedoptimalknot spline (Molinari et al. 2004; Tewes et al. 2013a) is fitted to one of the two light curves. The second light curve is then shifted by an amount in time and in magnitude. Thus, for a given observing epoch , the value of the shifted light curve can be written as . For every value of the time and magnitude shifts, we select all the points in the second light curve that overlap in time with the spline. For these points, we compute the residuals relative to the spline, i.e. the difference in magnitude between the points and the spline. The residual for point is
(8) 
We then compute the average absolute residual for every time and magnitude shift, i.e.
(9) 
The possible presence of microlensing, assumed to be constant over an observing season, is handled in a very simple way. For each time shift , we apply independent magnitude shifts to each season . We define the residual curve as the sum of the smallest average absolute residuals for each season . The index runs from 1 to and denotes the different timedelay values we want to test. This translates into
(10) 
We define as a local minimum in if
(11) 
where we keep the absolute minimum in as the final timedelay estimation. The index running from 1 to 10 spans a range of 20 days around each tested value . Figure 8 shows a typical residual curve with the absolute minimum indicated as a coloured diamond and the true time delay indicated as a vertical dashed grey line.
a.2 Confidence estimation
For each pair of curves, we compute three parameters related to the shape of the residual curve that can be used to estimate the quality of the timedelay estimations:

The number of local minima in .

The depth of each minimum and the absolute (i.e. the deepest) minimum ,
(12) where is the standard deviation between the elements in . Examples of values for are indicated in Fig.8 below the minima for time shifts days and days.

The total magnitude shift and the microlensing variability , where we use the per season magnitude shifts minimizing the average absolute residual at each time shift ,
(13) where is the standard deviation between the elements in . In other words the quantity is the difference between the sum of the magnitude shifts applied to each season at a given time shift and the minimum of this sum on all time shifts. This quantity follows the colour code in the sidebar of Fig.8 and is equivalent to the seasontoseason microlensing variations minimizing the residuals for a given time shift. The lower this quantity is, the smaller the impact of microlensing is.