The DOHA algorithm: a new recipe for cotrending largescale transiting exoplanet survey light curves.
Abstract
We present DOHA, a new algorithm for cotrending photometric light curves obtained by transiting exoplanet surveys. The algorithm employs a novel approach to the traditional “differential photometry” technique, by selecting the most suitable comparison star for each target light curve, using a twostep correlation search. Extensive tests on real data reveal that DOHA corrects both intranight variations and longterm systematics affecting the data. Statistical studies conducted on a sample of 9 500 light curves from the Qatar Exoplanet Survey reveal that DOHAcorrected light curves show an RMS improvement of a factor of , compared to the raw light curves. In addition, we show that the transit detection probability in our sample can increase considerably, even up to a factor of 7, after applying DOHA.
keywords:
Extrasolar planets – transits – survey – algorithm.1 Introduction
In the last decade, a significant portion of the hunt for transiting extrasolar planets has been conducted by various groundbased, largescale surveys, such as SuperWASP (Pollaco et al., 2006), HatNet (Bakos et al., 2004), TrES (Alonso et al., 2004) and QES (Alsubai et al., 2013). A common, defining characteristic of these surveys is that they were designed to cover as large a field of view as possible.
Data obtained by these surveys tend to suffer from a, more or less, common problem: the presence of unwanted flux variations that can either mask or mimic real (astrophysical) variations. A significant part of these variations is introduced by fixed, ordered trends in the data, collectively referred to as “systematics”. The list of systematics is rather long including, among others, variations due to airmass and seeing, colourdependent extinction, object merging etc. The imprint of systematics on the data can be viewed as components leading to commonmode behaviour among the light curves of observed stars.
In addition, unwanted flux variations can also be introduced by random events. By definition these are not systematic, or in other words, they are events that do not have a distinct, common mode imprint on the data (see e.g. Pinheiro da Silva et al. (2008)).
As the photometric accuracy for groundbased exoplanet detection is required to be of the order of 1% or better, it became readily apparent that all these variations, with amplitudes that often exceed a few percent and “signatures” that can easily mimic a transit event, can severely reduce the transit detection probability and, therefore, need to be accounted for and corrected. This has lead to detrending algorithms such as TFA (Kovács et al., 2005) and SysRem (Tamuz et al., 2005). Recently, similar work has been done for space missions such as CoRoT (Mislis et al., 2010; Ofir et al., 2010) and Kepler (Still et al., 2012).
While they differ in their implementation, the core idea of these algorithms remains the same: they try to identify and correct systematic patterns, by exploiting their common mode behaviour. A crucial factor in this exercise is the actual commonness of the patterns, in the (statistical) sense of what percentage of stars are affected by them, or in other words, how representative the patterns are of the entire sample. An additional consideration is the quantitative contribution of each pattern on the overall variations and whether specific patterns can be viewed as driving the variations. We maintain the distinction between common and uncommon dominant patterns throughout the manuscript.
In this paper we present DOHA, an algorithm conceived to correct for both systematic variations (regardless of commonness) and assorted data irregularities. The structure of the paper is as follows: in Section 2 we briefly describe the data set used in testing the algorithm; in Section 3 we present the algorithm itself, while Section 4 contains the results after applying DOHA to our sample light curves. Section 5 shows a test for signal detection efficiency and Section 6 summarises our work.
2 The sample data
For testing our algorithm we used data from the Qatar Exoplanet Survey (QES). The QES uses six 4k4k FLI ProLine PL6801 cameras, equipped with 4x400mm, 1x200mm and 1x135mm lenses, mosaiced to image an field on the sky in the magnitude range of . For our purposes, we selected data from a single field (RA=350, DEC=300’), obtained with one of the cameras equipped with a 400mm lens (FOV=). The data were collected over a threemonth period (end MarJun 2010) and consist of 9 500 stars, with an average of 1 300 data points each and an exposure time of 60 sec. The data were reduced using the QES pipeline as described in Alsubai et al. (2013).
3 The algorithm
Let us assume that our data set consists of light curves , with number of data points each, and that the total number of light curves is . We wish to correct the light curve of the th target star . DOHA achieves correction using a twostep correlation search approach.
In the first step, the algorithm calculates the correlation coefficient between the target star and a potential comparison star . Only stars with are considered as potential comparison stars, with and the RMS values of the raw comparison and target light curves respectively. The correlation coefficient is calculated as
(1) 
where and are the th data points of the target and potential comparison star light curves respectively; and are the mean flux values of the light curves; and and are the standard deviations of the target and comparison light curves respectively. We note that is calculated on the common set of points of and ; missing points are not substituted.
At the end of this first step, values of correlation coefficients have been calculated in total. Subsequently, the mean and standard deviation of all values are derived, and those stars that have corresponding values larger than 2 from the mean are selected. In this fashion, we create a “family” of definite comparison stars, of size , with light curves , . Henceforth, we will denote this as Family Group Light curves (FGL).
We should note that the 2 cutoff limit is not cast in stone. It is a “middleground” balance between selecting an adequate number of stars for the FGL on the one hand; and selecting only those stars that are strongly correlated with the target on the other. The limit can be adjusted to better suit the given data set, depending e.g. on the number of stars with sufficient data points and on the severity of the systematic and nonsystematic trends.
In the second step, the algorithm splits the light curve of the target star to its individualnight segments. Working in each segment separately, the algorithm recalculates correlation coefficients, but this time, only the light curves from the FGL are taken into consideration. We use singlenight segments to better account for airmass and colourextinction variations.
As before, the mean and standard deviation of the values of are calculated and those stars with larger than 1.5 from the mean are selected to create the New Family Group Light curves (NFGL). Assuming that the NFGL is of size M, we will denote its light curves as , . We reiterate that the NFGL is created for each individual night segment; there are as many NFGLs as there are individual nights in the data. Also note that the constituent light curves of one NFGL are not necessarily the same as those of an other.
Again, the 1.5 cutoff is a reasonable “default” value, adjusted to account for the much smallersized FGL (compared to the original number of light curves), following the considerations described previously.
Still working on an individual night basis (inb), from the corresponding NFGL, we create a "master" comparison light curve , which is the mean of all the light curves in the given NFGL,
(2) 
Once the master comparison curve is calculated, correction of the target light curve (for the given individual night) is achieved via a doubleiterative, “global” RMS minimisation technique, as follows:

We construct an array of scaling correction factors , of size (), with and arbitrary step; for our tests, we chose a step of 0.01^{1}^{1}1In which case, I=99; and

We define RMS, the RMS of the raw target light curve segment, , as the reference, starting point

An Isteps iteration over all begins

Subsequently, a steps iteration begins, . For the given , at the th step, a temporary corrected target light curve is constructed by
(3) 
The RMS of this light curve, , is calculated and compared to that of the previous step,

Iterations over halt when

The Isteps iteration continues with , and the entire process is repeated
The corrected target light curve segment is chosen to be the one with the minimum RMS, that is . The final DOHAcorrected light curve of the target star is simply the concatenation of all corrected target light curve segments .
Finally, we note that, as with the cutoffs applied in the creation of the FGL and the NFGL, the arraystep for the values can be adjusted to better suit a given data set.
For additional clarity, we provide representative illustrations of the algorithm’s description, as presented above, in Figures 1 and 2.
In the top panel of Figure 1 we plot a (randomly selected) raw target light curve (green curve), and the light curve of the highest correlated comparison star from the FGL (red curve). To better highlight variations around the respective mean values (indicated as solid lines), light curves are plotted as consecutive points and not according to their timestamps, which span a range of three months. In the four middlepanels, we show four different individual night segments and we plot the corresponding points (again in green), along with the master comparison light curve (in red) for that particular night. Finally, in the bottom panel we plot again the raw target light curve (green) as well as the final, DOHAcorrected light curve (blue).
Note that DOHA does not apply explicit outlier rejection. Most of the outlying points in the raw light curve (Fig. 1, bottom panel, top curve) were actually “broughtinline” by the algorithm’s correction.
Figure 2 shows the values of the correlation coefficients, calculated through Equation 1, for a random target star. The black, solid line corresponds to the 2 level away from the mean. All the stars above this line (green points) are used to form the FGL.
Note that our data are magnitudesorted (brightest to faintest). The target star used in Fig.2 is a 11.3mag one. Given the criterion and the fact the bright stars are more likely to have highcorrelation comparisons (this is explained in detail in the next Section), it is not surprising that most of the FGL members clump on the lefthand side of Figure 2. i.e. have a small starindex.
4 Results
4.1 Statistical tests
For each of the 9 500 stars in our field, we constructed the corresponding FGL, noted the onchip position of the target and of the highest correlated star in the FGL and calculated their distance in pixels. In Figure 3 we plot the target magnitude versus the distance to the highest correlated comparison star. The distance is given in pixels, with 1 pixel corresponding to . Figure 3 does not show any distinct pattern between the two plotted quantities. Effectively, the highest correlated star for any given target can be located anywhere on the chip.
Additionally, we noted the actual value of the highest correlated comparison for a given target and in Figure 4 we plot these against target magnitude. Despite the scatter, a linear trend is clearly visible in Figure 4, showing that for bright stars, it is much more likely to find a comparison star with a high value. This could be an indication that red noise dominates the systematics, and bright stars are more susceptible to red noise, as they tend to occupy a larger area (more pixels) on the CCD.
Finally, we combine Figures 3 and 4 and plot the distance of the highest correlated comparison star versus the corresponding value in Figure 5. As with Fig. 3, there is again no distinct pattern between the two plotted quantities. The value of maximum correlation is independent of the distance between target and comparison, even for high values (>0.9).
4.2 RMS diagram
In order to better assess the performance of DOHA, we constructed the RMS diagrams of both the raw and the DOHAcorrected light curves for all stars in our sample. The resulting diagrams are shown in the left panel of Figure 6. The dashed black line indicates the theoretical noise floor curve.
To appreciate the light curve improvement, we can define the relative RMS improvement as . This is plotted in the middle panel of Figure 6 against target magnitude. In general, brighter stars show larger improvement, a result of the fact that for brighter stars it is easier to find comparisons with very high values (see again Fig. 4). We should also note that very large relative improvement factors (>0.8) should be interpreted with some caution. As mentioned before (and see also Fig. 1 and 8) DOHA performs very well with outliers; part of the improvement can be attributed exactly to outlying points being brought to the correct level. For this reason, in the righthand panel of Figure 6 we again plot the relative improvement, but this time using the median absolute deviation (MAD), that is .
Using statistics from the Kepler mission, Howard et al. (2012) estimate that there are 0.0066 transiting hot Jupiters per star, with orbital periods up to 10 days. The deepest, groundobserved, transiting exoplanet so far is HATS6b, with (Hartman et al., 2015). From Fig. 6, we calculate that in our raw sample, of the light curves have the required accuracy to detect such a transit. This percentage increases to after applying DOHA, i.e. there is a factor of 2 improvement.
4.3 Comparison with SysRem
As a further performance test, we subjected our sample to detrending using the SysRem algorithm, and compare it with the DOHA results from the previous section. Figure 7 shows the resulting RMS diagrams in the left panel, and the relative RMS and MAD improvements in the middle and right panel respectively. There is a rather small percentage (2% of the total sample) of stars where SysRem yields a better RMS. Also, the same considerations about the treatment of outliers, mentioned previously, apply here.
The main difference between SysRem and DOHA is the assumption on the nature of the patterns affecting the data. The implicit assumption of SysRem is that the commonmode patterns are dominant and can be expressed as linearly varying components, calculated from the entire sample and, furthermore, that these calculated components are representative of the entire sample and can therefore be used to correct it. On the other hand, DOHA makes no assumption on the nature or the dominance of the patterns (common or uncommon, as described in the introduction) and, moreover, DOHA tries to find the representative components for each star individually, without being based on whole sample statistics. To illustrate the point of uncommon dominant patterns, we refer the reader to Fig. 2 again, where it can be seen that (a) the majority of stars actually show very little correlation with the target, (b) practically a quarter of the stars shows, in fact, negative correlation and (c) there are indeed stars which show high correlation.
4.4 Individual light curves
4.4.1 General examples
To illustrate the performance of DOHA more accurately, we present here three individual light curves selected from our sample. These light curves are representative examples of the patterns affecting our data. Table 1 gathers some basic information on these three stars. The reported periods come after running the “Box Least Squares” (BLS) algorithm of Kovács et al. (2002).

3UC171131243 (Fig. 8, top panels): this is a constant star, but systematics introduce nonreal variability, which is pickedup by the BLS search. DOHA, not only corrects the systematics creating the variability, but also corrects almost all the outliers between phase .

3UC175129698 (Fig. 8 middle panels): this is a typical short period variable star, () Note that the amplitude of the variability remains unaffected after applying DOHA. Furthermore, similar to the previous example, the algorithm manages to correct almost all the outlying points.

3UC177129206 (Fig. 8 bottom panels): the light curve of this system seems to contain a “transitlike” signature at phase . The corrected light curve is much cleaner without outliers, but, most importantly, the “transitlike” signal disappears. This is a case where DOHA successfully corrects a falsepositive identification. The fact that this star is, indeed, constant is supported by radial velocity measurements which show no RV variations, to a level of , corresponding to a mass of smaller than .
UNSO4 ID  RA  Dec  Mag  BLS period [d] 

3UC171131243  13:57:16.09  04:31:51.10  13.4  1.966354 
3UC175129698  13:52:05.31  02:45:32.70  13.0  0.391308 
3UC177129206  13:54:36.36  01:42:11.80  12.8  1.596665 
4.4.2 Transiting light curves
A further test was to assess the performance of DOHA on known transiting exoplanets that have been observed with the QES. We have selected data that actually contain three known planets in the same field^{2}^{2}2This field is different than the one described in Sec. 2.: WASP1b (Collier Cameron et al., 2007) with period days and magnitude ; HATP19b (Hartman et al., 2011) with period days and magnitude ; and KELT1b (Siverd et al., 2012) with period days and magnitude . KELT1b is a very massive object (27), but because it is orbiting a midF type star, the depth of the transit is very small (0.6%), making it an ideal target for our test.
We first detrended the raw data using SysRem and subsequently ran BLS on the resulting light curves. BLS successfully detected WASP1b and HATP19b, but failed to detect KELT1b. We then, repeated the process, only this time we corrected the raw light curves using DOHA. In this case, all three planets were successfully detected by the BLS, with the correct parameters for orbital period and transit depth. Figures 911 show the SysRem (left panels) versus DOHA (right panels) phasefolded and binned light curves of the three planets.
5 Signal detection algorithms
As a final test, we investigated the effect of DOHA on the performance of signal detection algorithms. For comparative purposes, we also ran the same tests using the SysRemdetrended light curves, as presented in Sec. 4.3.
The tests were conducted as follows: we injected simulated transit signals, generated using the Pál (2008) model, in all the 9374 raw light curves of our sample (Sec. 2). For all transits, the stellar and planetary parameters were kept fixed to = 1.0 , = 1.0 and = 1.0 , while the orbital period was randomly chosen from a uniform distribution, with . This combination of stellar and planetary parameters was chosen to ensure a large number of detections for statistical purposes.
Subsequently, the raw light curves (now including the transit signals) were subjected to correction using both SysRem and DOHA. In each corrected set, we searched for transits using two separate signal detection algorithms: BLS and SiDRA (Mislis et al., 2016). We note that the test was not designed to compare BLS with SiDRA, only to assess how the probability of detecting a transit, using each detection algorithm, changes after applying DOHA.
The combination of SysRem+BLS yielded 149 transits (1.6% of the total), whereas DOHA+BLS successfully identified 1226 transits (13.1% of the total). To have a clearer view, we divided our sample in 0.5magwide bins and in Figure 12 we plot the BLS detection efficiency in each magnitude bin. If we now restrict the considered magnitude range to (the working magnitude range of the QES survey), then BLS correctly identifies 6.2% of the transits, using SysRem; and 58% of the transits, using DOHA.
For the test with SiDRA^{3}^{3}3SiDRA is an entropybased, random forest classification algorithm, and does not yield physical parameters, such as the orbital period., we imposed a strict 70% confidence cutoff (see Mislis et al., 2016, for details). At this level, SiDRA classified 505 systems as definite planets using the SysRem light curves (5.4% of the total); and 938 systems (10.0% of the total) using the DOHAcorrected light curves. If we again restrict the magnitude range, as before, then SiDRA returns 7% of the total number of planets, using SysRem; and 20% using DOHA.
As a byproduct, using the DOHA light curves, SiDRA correctly identified three (already known) variables in the field (2 RR Lyr and 1 W Uma) which have been missed in a variable search using SysRemdetrended light curves.
It is evident that DOHA significantly increases the chances of finding transiting planets, regardless of the signal detection algorithm employed.
6 Conclusions
In this paper we have presented DOHA, a new algorithm for correcting light curves obtained with largescale, groundbased photometric surveys, with an emphasis on those of transiting exoplanets. Adopting the reasoning of a comment made by the referee during the review process, we denote DOHA as a cotrending, rather than a detrending algorithm.
DOHA is based on the standard differential photometry technique of correcting a target light curve using a master comparison light curve constructed from suitable, individual comparison stars. The success of DOHA lies in its ability to optimise the way in which suitable comparison stars are selected for each target separately. DOHA looks for and corrects commonmode patterns shared by the target and a “base” of comparison stars (constituting a small subset of all stars in the field), which is built after a twostep correlation search; the first accounting for longterm trends, the second for intranight variations. In short, DOHA exploits the defining characteristic of systematics, that is their manifestation as commonmode behaviour of the data, without making any assumptions on their nature and prevalence. As such, DOHA is able to correct data trends and patterns regardless of their commonness and/or individual contribution to the variations in the sample. Our algorithm can either be used as standalone on raw light curves, or as a compliment to detrending algorithms, correcting for residual uncommon patterns.
To test and assess the performance of DOHA, we have used 9500 light curves from the QES transiting survey. The results show that DOHA is able to improve the light curve RMS by a factor of 2, doubling the probability of detecting a transit signal. Results also indicate that DOHA is particularly efficient on bright stars.
Finally, by adding simulated transits in all of our sample light curves, we showed that, using DOHA combined with two separate signal detection algorithms, the number of successful detections can increase considerably.
Acknowledgments
We would like to thank the anonymous referee for a prompt and useful report. This publication was made possible by NPRP grant X0191006 from the Qatar National Research Fund (a member of Qatar Foundation). The statements made herein are solely the responsibility of the author.
References
 Alonso et al. (2004) Alonso R. et al, 2004, ApJ, 613L, 153
 Alsubai et al. (2013) Alsubai K. et al. 2013, Acta Astron., 63, 465
 Bakos et al. (2004) Bakos G., Noyes R. W., Kovács G., Stanek K. Z., Sasselov D. D. and Domsa I., 2004, PASP, 116, 266
 Collier Cameron et al. (2007) Collier Cameron A., Bouchy F., Hébrard G., et al. 2007, MNRAS, 375, 951
 Hartman et al. (2011) Hartman J., D., Bakos G., Sato B., et al. 2011, ApJ, 726, 52
 Hartman et al. (2015) Hartman J., D., Bayliss D., Brahm R., et al. 2015, ApJ, 149, 166
 Howard et al. (2012) Howard A., Marcy G., Bryson S., et al. 2012, ApJS, 201, 15
 Kovács et al. (2002) Kovács G., Zucker S. & Mazeh T., 2002, A&A, 391, 369
 Kovács et al. (2005) Kovács G., Bakos G. and Noyes R., 2005, MNRAS, 356, 557
 Mislis et al. (2010) Mislis D., Schmitt J. H. M. M., Carone L. et al. 2010 A&A, 522, 86
 Mislis et al. (2016) Mislis D., Bachelet E., Alsubai K. A., et al. 2016, MNRAS, 455, 626
 Ofir et al. (2010) Ofir A., Alonso R., Bonomo A., et al. 2010, MNRAS, 404L, 99O
 Pál (2008) Pál A., 2008, MNRAS, 390, 281
 Pinheiro da Silva et al. (2008) Pinheiro da Silva L., Rolland G., Lapeyrere V. et al. 2008, MNRAS, 384, 1337
 Pollaco et al. (2006) Pollaco, D. L. et al, 2006, PASP, 118, 1407
 Siverd et al. (2012) Siverd R., Beatty T., Pepper J., et al. 2012, ApJ, 761, 123
 Still et al. (2012) Still M & Barclay T., 2012, Astrophysics Source Code Library, ascl.soft08004S
 Tamuz et al. (2005) Tamuz, O., Mazeh, T., Zucker, S., 2005, MNRAS, 356, 1466