A comparative study of four significance measures for periodicity detection in astronomical surveys
Abstract
We study the problem of periodicity detection in massive data sets of photometric or radial velocity time series, as presented by ESA’s Gaia mission. Periodicity detection hinges on the estimation of the false alarm probability (FAP) of the extremum of the periodogram of the time series. We consider the problem of its estimation with two main issues in mind. First, for a given number of observations and signaltonoise ratio, the rate of correct periodicity detections should be constant for all realized cadences of observations regardless of the observational time patterns, in order to avoid sky biases that are difficult to assess. Second, the computational loads should be kept feasible even for millions of time series. Using the Gaia case, we compare the method [Paltani, 2004, SchwarzenbergCzerny, 2012], the Baluev method [Baluev, 2008] and the GEV method [Süveges, 2014], as well as a method for the direct estimation of a threshold. Three methods involve some unknown parameters, which are obtained by fitting a regressiontype predictive model using easily obtainable covariates derived from observational time series. We conclude that the GEV and the Baluev methods both provide good solutions to the issues posed by a largescale processing. The first of these yields the best scientific quality at the price of some moderately costly preprocessing. When this preprocessing is impossible for some reason (e.g. the computational costs are prohibitive or good regression models cannot be constructed), the Baluev method provides a computationally inexpensive alternative with slight biases in regions where time samplings exhibit strong aliases.
keywords:
methods:data analysis – methods:statistical – stars:variables:general.1 Introduction
Identification and analysis of variable objects has been and will remain an important product of many small or largescale astronomical surveys. Periodic sources among them are singularly important for many special fields of astrophysics. Examples include the Cepheids, which form the basis of the cosmic distance ladder, and therefore are fundamental to cosmology; RR Lyrae, which trace ancient structures around the Milky Way and thus relate to the evolution of our Galaxy; multiperiodic stars, whose asteroseismology provides insight into the structure and evolution of stars; or eclipsing binaries, which can offer the possibility of determining the mass and radius of their component stars, and thus also provide strong constraints on stellar physics and the coevolution of stars.
Specific research fields require the selection of objects of specific types, and the databases in which they must be sought have now reached the terabyte scale. Whereas the Hipparcos mission [ESA, 1997] twenty years ago provided photometry and astrometry for about a hundred thousand stars, the Gaia mission, launched in December 2013, will furnish a similar catalog of approximately 1 billion astronomical objects by the end of its 5 year lifetime, with a precision of roughly 100 times that of Hipparcos. Other surveys also will produce databases of comparable or larger size. To facilitate efficient searches on such volumes, the catalogs should contain additional derived information about the sources. Specifically, for studies relying on variable stars, the variability type is of primary importance, however other attributes of the source such as the mean absolute magnitude, period, amplitude, harmonic amplitudes and relative phases can all help the researcher to direct better his/her search for a sample.
A crucial step to obtain this information is the discovery and correct identification of the period of the variable objects. A search for periodicity is performed on the time series of either photometric or radial velocity measurements of candidate objects, in order to produce a periodogram. The found period corresponds to the extremum of the periodogram. A decision should then be made as to whether this period is significant or not. Depending on the decision, the process then can continue with the characterization of the source as periodic and the production of the derived information for the catalog. When the decision step fails for a source, this information will obviously be erroneous. If these failures are systematic, and depend on some unrecognized factors, studies using such samples may be affected by serious unidentified and unknown biases.
Unfortunately, period detection is one of the procedures which is most at risk from such biases, because quasiperiodicities and sparse sampling in the time cadence of the observations affect the statistical characteristics of the periodogram. Their most important effects are the strong longterm dependencies appearing in the periodograms (called “aliases” in the rest of this paper), the loss of an orthogonal frequency system (that is, loss of orthogonality of the Fourier frequencies), and the degeneracy caused by computing an oversampled periodogram often at hundreds of thousands of test frequencies, based often on only a few dozen observations. Nevertheless, whether a found period belongs to a real periodic signal or is just due to random fluctuations must be assessed in a strictly formalised statistical way [a concise and clear paper is SchwarzenbergCzerny, 1998]. In the presence of these strong longrange dependencies, the most commonly applied statistical tests [Lomb, 1976, Scargle, 1982, Horne & Baliunas, 1986, Frescura et al., 2008, SchwarzenbergCzerny, 2012] lack solid theoretical support, and can yield incorrect estimates in the absence of clear recipes by which to tune them [Horne & Baliunas, 1986, Frescura et al., 2008, SchwarzenbergCzerny, 2012]. Thus, they can present largely uninvestigated biases. When applied en mass to time series from a skyscanning survey, which have coordinatedependent observational cadences with different regularities from point to point, these biases will add an unknown, coordinatedependent element to other, better investigated biases, such as that due to the number of observations [for e.g. the Gaia survey, see Eyer & Mignard, 2005].
In addition to these biases, the computational greediness of most procedures makes the situation even more difficult for survey data. The above methods usually need many noise simulations for each time sampling pattern as well as the computation of the periodogram for each simulation, in order to reproduce the distribution of the periodogram peak in the absence of periodicity. This is not feasible during the data processing of a largescale survey producing millions of time series.
To help solve these issues, we collected three propositions from the literature how to perform a significance test on periodograms: that of Paltani [2004] and SchwarzenbergCzerny [2012] ( method), that of Baluev [2008] (Baluev method) and that of Süveges [2014] (GEV method), and we added a fourth, ad hoc one, which consists of the direct determination of a critical level of periodogram peaks separating significant periodicities from nonsignificant ones at a given confidence level (quantile method). Three of these models, the , the GEV and the quantile methods, depend on unknown parameters, which differ from one sky location to another, and must be estimated for each of the candidate variable light curves (several tens of millions for Gaia).
In our study, instead of individually estimating these parameters for each time series, we investigated how they depend on some quantities that can be easily and quickly computed for every time series, such as the number of observations, the variance of times or spectral window features. We constructed regressiontype models linking these covariates to the parameters of the FAP methods. As a result, the costly simulationbased individual estimation of the parameters can be replaced with an estimation based on only the calculation of the above quantities and predicting the parameters from the previously estimated regression model with excellent results.
In order to achieve this, the crucial condition is the existence of such a regression model. Although theory gives indications as to what covariates the parameters of the FAP methods may depend on, at present there is no derivation of specific formulae or relationships. For the Gaia case, where the observational times consist of relatively irregularly spaced clusters of quasiregular sequences of observations, some quite clearcut relationships were found empirically. Since for many scanning surveys, their location on Earth or in a space orbit and/or their rotation determines some typical repeating observational cadences, and hence some characteristic spectral window patterns, in general it appears worthwhile to investigate these possibilities for the detection of periodic variability in other surveys too.
The performance of the procedures was assessed using two fundamental statistical paradigms. First, on simulated noise sequences, we checked the false alarm rate of the methods and the quality of their approximation to the true distribution of the maximum of the periodograms. Second, on weak noisy signals, we checked the ability of the methods to find their periodicity as significant. This way, we characterize the methods in terms of their statistical size and power. We conclude that at least two of these methods, the Baluev and the GEV ones, provide good approximations to the pvalue of the periodicity in the interesting low value range.
In Section 2 we give an overview of the problem. First we summarize the statistical principles applied in the detection of periodicities, and discuss the factors that can influence the crucial ingredient in the methods, the distribution of the extrema of the periodograms of white noise. We demonstrate these effects on a simple model using Gaialike simulations. Section 3 presents the four candidate methods and the regression models to estimate their parameters. Section 4 details their application to the Gaia survey, and summarises the expected performances of the four methods using noise and signal simulations. Finally, Section 5 provides a summary table of the crucial advantages and drawbacks of the methods, and discusses the possible choices for large surveys.
2 Periodicity detection
2.1 Principles of testing
Suppose is a photometric or radial velocity time series observed at epochs . The sequence of times may be anything from almost completely irregular to almost completely regular. The goal is to assess whether the time series contains a periodic signal or not. To this end, we compute a periodogram, consisting of some appropriately defined goodnessoffit measures of some periodic models at a large set of candidate frequencies, using one or more of the various methods in the literature, such as the Deeming method [Deeming, 1975]; PDMJurkevich [Jurkevich, 1971, Stellingwerf, 1978, Dupuy & Hoffman, 1985]; String Length [Lafler & Kinman, 1965, Burke et al., 1970, Renson, 1978, Dworetsky, 1983, Clarke, 2002]; SuperSmoother [Friedman, 1984, Reimann, 1994]; CLEAN [Foster, 1995]; Keplerian periodograms [Cumming, 2004]; LombScargle or generalized LombScargle method (Lomb 1976, FerrazMello 1981, Scargle 1982, Zechmeister & Kürster 2009); FastChi2 [Palmer, 2009]; conditional entropy method [Graham et al., 2013]; FAMOUS (F. Mignard, available with documentation at the website ftp://ftp.obsnice.fr/pub/mignard/Famous). The most probable frequency of a potential periodic signal is indicated by the extremum of the periodogram, which can be a maximum or minimum depending on the specific period search algorithm.
Strict statistical hypothesis testing contrasts the zero hypothesis of no periodicity to the alternative of a periodic component of any frequency in the time series. It consists of computing the probability that a time series with no periodicity produces a maximum higher than or equal to the observed maximum (or a minimum lower than or equal to the observed minimum). This probability is called the False Alarm Probability (FAP). Denoting in general the distribution of under with , FAP for maxima and for minima. If we find a periodogram extremum which is less likely than a prespecified confidence limit , then we can state that the hypothesis of no periodicity can be rejected at the confidence level . This case of a FAP will be termed a detection. Based on inspection of frequency search results for weak noisy signals, if the significant maximum/minimum of the periodogram is within of the correct frequency, then we will speak about a correct detection, otherwise an incorrect or false detection (we did not use the theoretically based formula for the calculation of the errors, since the inspection showed a significantly enlarged distribution of absolute differences between true and found frequencies with respect to what is expected from the formula).
To compute the FAP, we need to know the zero distribution of under , that is, the distribution of the periodogram maximum in the absence of periodicity. This is determined by several factors.

The periodogram type defines the distribution of any single periodogram value (in statistical terminology, the marginal distribution or margin), and has a determining role in shaping . For example, for the generalized least squares (GLS) periodogram as defined in Zechmeister & Kürster [2009], , and is approximately a beta distribution, so also must have a finite tail with endpoint at 1. The LombScargle periodogram with the original normalisation [Lomb, 1976, Scargle, 1982] has an exponential marginal distribution, with an exponentially decaying tail, and thus must have a tail smoothly decreasing towards infinity.

The “no periodicity” assumption is usually not sufficient to constrain , we must make further assumptions about the character of the time series under . Some options are: the time series is white noise with a specific distribution; the time series is white noise, with an unspecified distribution; the time series has some specified temporal structure such as a CARMA process or a random walk. The derivation of is very hard under most of these assumptions, so there is practically no known for the periodograms listed above under any assumption other than white noise, and even under the assumption of white noise, wellbehaved approximate distributions were published only quite recently and only for certain types of periodograms.

An irregular character of the time sampling entails the loss of any orthogonal frequency system, so in theory, no principles relying on independence could be applied without an extra step of orthogonalisation. Quasiregularities in the time sampling, e.g. daily and yearly cadences in groundbased observations or the 6hour spin rate of Gaia, can induce aliasing (strictly speaking, leakage, but we will use the commonly accepted term in astronomy), which leads to a complex pattern of dependence across frequencies in the periodogram. Moreover, dependence is also introduced into the periodogram by the simple fact that based on observations, we usually compute periodogram values. According to mathematical statistics, extrema of dependent sets do not behave the same way as those of independent sets [see e.g. Leadbetter et al., 1983, Beirlant et al., 2004, de Haan & Ferreira, 2006], not even when their marginal distribution is the same, so this dependence must have an effect on . This is clearly demonstrated by the simulations of Cuypers [2012], using Gaussian noise sequences with the same time span and the same number of observations, but with different temporal sampling patterns.
Theory and simulations thus both suggest to take dependence into account when trying to derive the distribution of periodogram maxima. However, there is no general mathematical derivation pointing to explicit formulae with which this could be accomplished. The parameters of the distribution of maxima of dependent or independent variables are in practice not derived from theory, but estimated casewise for every time series. The reason is that dependence itself in the set of random variables can take infinitely many forms, and for any reallife data set, usually little is known about it a priori.
However, for the frequency analysis of time series, we have an aid in revealing dependence structures in the periodograms: the spectral window of the time cadence yields a picture of the autocorrelations in the periodogram. Though correlations are in general not sufficient measures of dependence (apart from the case of a multivariate Gaussian), a nonzero correlation is an indicator of dependence. Moreover, from a practical point of view, the spectral window is quite easily available. As a substitute for theorybased relations between the distribution of periodogram maxima and the dependence in the periodogram, we may look for relationships linking the parameters of the distribution to numerical features of the spectral windows.
Gaia’s complex motion, consisting of a 6hour rotation and a slow precession of its axis, induces a large variety of spectral windows showing more or less prominent peaks corresponding to its spin rate and the integer multiples thereof [see e.g. Eyer & Mignard, 2005]. Two examples in Figure 1 illustrate the range of possibilities. The spatial variations over the sky, together with those of the number of observations , can be appreciated in Figure 2, which shows maps of and the strength of a typical Gaia alias, at 12 . Their spatial inhomogeneity implies that we can expect also the distribution of maxima to vary over the sky. We show on an example in the next section that this is indeed so.
2.2 Illustration
We illustrate the importance of these issues and their possible effects on the detection of periodicities over the sky by showing the discrepancies resulting from an independencebased simple FAP model in the Gaia case. The model approximates the distribution of the periodogram maxima by
(1) 
where is the “equivalent number of independent frequencies” [Scargle, 1982, Horne & Baliunas, 1986, SchwarzenbergCzerny, 1998], which should optimally be estimated for each time sampling from a high number of noise simulations, and is the marginal distribution of the periodogram.
We simulated 1500 Gaussian white noise sequences using Gaialike time samplings at each of 900 sky positions along an ecliptic meridian from the ecliptic plane to the pole (the long red line in Figure 2)^{1}^{1}1See the link Gaia Observation Forecast Tool at http://www.cosmos.esa.int/web/gaia/gaiadata for an online tool to predict Gaia observations for arbitrary sky positions.. A generalised least squares (GLS) periodogram [Zechmeister & Kürster, 2009] was calculated for each sequence, its maximum retained, and the FAP based on equation (1) computed. For the purposes of the illustration, and since many simulations per time series are in any case unfeasible in large surveys, we replaced by its upper limit , the number of Fourier frequencies falling in the scanned frequency range. The marginal distribution of the periodogram values is approximately a beta distribution for the GLS periodogram, so is taken to be the incomplete beta function with parameters 1 and [SchwarzenbergCzerny, 1998]. At each location, we obtained thus 1500 FAP values for white noise sequences.
The upper panel of Figure 3 shows 300 of the 900 histograms of FAPs, lined up sidebyside versus ecliptic latitude. Good approximations to the true distribution of the maxima should give a flat aspect of the surface traced by the histograms: the FAP values for noise based on a good approximate should follow a uniform distribution, that is, a flat histogram when binned, and moreover, they should do so at every location, independently of latitude or any other parameter. The lower two panels show only the last bin of these histograms, that is, the fraction of FAPs below the significance levels (black) and (red). This is the proportion of false detections (noise sequences for which a periodicity was detected). In the case of a wellbehaved FAP, this should be close to by definition of the FAP. From the plots, we can draw several important conclusions.

The false detection fractions are in general much above , which means that this FAP approximation is too permissive. The best should indeed be lower than the number of Fourier frequencies .

The middle panel of Figure 3 shows a strong spatial inhomogeneity: some regions, most importantly the ecliptic pole (), have a much lower average false alarm rate than others, say, a region at .

The false alarm rates depend only very weakly (if at all) on , as it can be seen in the bottom panel of Figure 3, so this spatial variation is not governed by variations in . Around , the significant fraction varies in a much wider interval (between ) than at other values; the lowest false alarm rates here are due to the ecliptic poles, where the numbers of observations are about 80. As the ecliptic pole is the region where aliasing is the strongest, indicating the strongest dependences in periodograms, this hints at the significant effect of dependences in the distribution of periodogram maxima.
A few important practical consequences can immediately be seen. First, two variable objects with the same light curve characteristics and with the same number of observations and signaltonoise ratio can have different detection probabilities, if located in different sky regions. Second, inhomogeneities are present on very short angular distance scales on the sky, because the shapes of the histograms are sharply fluctuating on very short distance scales. We need to account for these spatial inhomogeneities on the sky. The plots in this illustrative example and theory both suggest that this should be pursued through a modelling of the distribution of the maxima with the help of and variables characterizing the dependence in the periodogram. In the next section, we present several different approximations to , and describe the strategy to model its spatial variations.
3 Methodology
3.1 The four approximations to the FAP
There are now several propositions in the literature to obtain a reliable FAP for periodicity detection. We will investigate here three of them: the method [Paltani, 2004, SchwarzenbergCzerny, 2012], based on the formula ; the Baluev method [Baluev, 2008], based on highlevel excursions of stochastic processes; and the GEV method [Süveges, 2014], based on univariate extremevalue theory of random variables. To these, we add a fourth, ad hoc alternative: we directly estimate some desired limit level (quantile) that separates periodogram extrema that are significant at the required confidence level from nonsignificant ones. We applied the GLS method as described in Zechmeister & Kürster [2009] to search for periods. In this normalization, the marginal distribution of the periodogram is Beta [SchwarzenbergCzerny, 1998], and the extremum indicating the most likely frequency of a potential signal is its maximum.
The first three alternatives all attempt to give an approximation for the distribution of the maximum of the periodogram of a white noise process as the null distribution. The fourth one, the quantile method produces only a decision whether or not the found periodicity is significant, without giving its probability under the noise assumption. The four alternatives are the following.

This method approximates the true distribution of the periodogram maxima by that of the maxima of an equivalent system of independent frequencies. These frequencies usually cannot be identified with any subset of real test frequencies, as the periodogram values are more or less dependent over the whole spectrum. The approximate null distribution is
where is the marginal distribution of the periodogram [Beta for the GLS used here; SchwarzenbergCzerny, 1998], and , the only unknown parameter of the distribution, is the number of equivalent independent frequencies. is estimated using simulations; the Paltani–SchwarzenbergCzerny proposition is to compute periodograms of a sufficiently high number of noise sequences under the same time sampling, extract their maxima, and then use the median of these maxima to give an estimate for as
After having estimated for a specific time sampling, the pvalue of an observed periodogram maximum can be given by
(2) 
Baluev method [Baluev, 2008]
This method is based on the theory of the extremes of continuousparameter stochastic processes with beta, FisherSnedecor or chisquared margins. We use here a variant with beta margins. An upper bound to the FAP is given using approximations to the right tail of the exact distribution as
(3) where is the gamma function, is the variance of the observation times, and is the uppermost test frequency of the periodogram. The derivation assumes a low level of aliasing and spectral leakage, and is best at the lowest FAP values. pvalues higher than 0.05 should be considered only as rough bounds on the actual pvalue, but for , the method provides good approximations to the actual pvalues when aliases are weak. The above formula does not contain any unknown parameters, only simple quantities like the number of time series points or the variance of the observing times, which makes its implementation and application very fast and straightforward.

GEV method [Süveges, 2014]
A simpler look at the periodogram considers it as a set of discrete random variables with a particularly strong interdependence. According to univariate extremevalue theory, the maximum of a set of random variables follows the generalized extremevalue distribution (GEV):
where the distribution function is defined only on such that (for the GLS periodogram the endpoint of the distribution must be 1, that is, ). This limiting distribution plays a similar general role for maxima as the Gaussian distribution for sums and averages, and this remains so for certain types of dependence. In practice, the GEV approximation works well for astronomical periodograms when its parameters are estimated individually for every time sampling, even though mathematical theory has yet to prove the dependence in astronomical periodograms to fall under the general validity condition of extremevalue limits.
The two free GEV parameters and must be estimated for all individual time series. Once the estimates , and have been obtained, the pvalue of an observed periodogram peak can be computed as
(5) and can be compared to the prespecified level .

Quantile method
This procedure consists of estimating directly the level , which is exceeded by a maximum produced by a pure noise sequence with the prespecified probability . That is, if is the maximum of the periodogram of a white noise sequence. Figure 3 shows that this critical level (the quantile) depends on the location, and a direct estimation of needs to be done locationwise. Once is estimated, the question “Is this periodicity significant or not at the level ?” is equivalent to the question whether the computed is larger or smaller than . Thus, this method does not return any pvalue, only gives a yes/no answer to the question of significance, and there is no quantification how unlikely is to come from noise. There are no other parameters to estimate than itself.
3.2 Parameter estimation
Two of the approximations for , namely the and the GEV, and the quantile methods all contain one or more quantities that should be estimated: and for the GEV, for the method, and the quantile itself in the quantile method. In principle, the best way would be to estimate them individually for each observed sequence, by simulating a large number of noise sequences with the same marginal distribution as the observations and with the same time sampling pattern. But as this involves the computation of a lot of periodograms (of the order of 510 in the case of the GEV method, and hundreds or thousands for the others), this casewise estimation cannot be applied to all objects of a large survey.
The first substitute for the casewise estimation may be interpolation, if the characteristic cadence patterns on the sky are known in advance for the survey. We can then simulate a high number of noise sequences on a sufficiently fine grid on the sky with the local predicted time sampling, estimate the local parameters of the chosen method, and interpolate them to obtain the parameters at any other point. However, Figure 3 shows extremely sharp variations as a function of ecliptic latitude, which in fact would be even more violent if all 900 histograms could have been plotted. There is also at least one important characteristic of the time series that cannot be expected to be smooth, the number of observations , which is inherently discrete. The distribution of the maxima is expected to depend on it. It follows that simple interpolation in coordinates might not yield an adequate model.
Nevertheless, for surveys with a fixed prescribed scanning law like Gaia, these relationships provide a way to avoid costly casebycase estimation during the mission at the cost of some preparatory work. Suppose that we have a representative sample of sky locations . We can obtain the observing times in advance for each, and use a high number of randomly generated white noise sequences at these points to infer the required parameter (denoted by , the index indicating that the parameter is locationdependent). For all , we can also compute a set of good candidate explanatory variables , which are preferably fast and straightforward to compute; they can be sky coordinates, the number of observations, the variance of the observation times, or any spectral window feature such as its maximum value in some restricted range of test frequencies, the value at a frequency corresponding to the dominant cadence of the survey, or sums of its highest peaks. Then if a relationship can be established using these simulations, the casewise estimation during data processing can be replaced by a much cheaper computation: for a new time series during mission, say at location , we compute the necessary features for the given time series, and then use the estimated relation to infer the parameter . After obtaining , the decision about the significance of an observed periodogram maximum can be found from equations (2), (5) or by a simple comparison to the obtained quantile.
4 Application to the Gaia survey
4.1 Simulations
We simulated 1500 constant photometric time series and 1500 sinusoidally varying photometric time series using AGISLab [Holl et al., 2012], both with Gaialike noise [Jordi et al., 2010], at each of 3889 sky positions, with the local Gaia time sampling, in the following manner:

We selected several different sets of locations:

900 locations evenly distributed along an ecliptic meridian between the points and ;

on a rectangular grid cutting into the most densely sampled ecliptic latitude^{2}^{2}2The inclination of the spin axis of the satellite was fixed differently for the mission. As a consequence, the most densely sampled latitude is now at . ;

714 uniformly randomly distributed points over the sky;

a region including part of the Large Magellanic Cloud near the south ecliptic pole, where the Gaia scanning law induces strong aliasing;

randomly scattered points over a quarter of the sky with sparse sampling (); and

randomly scattered points over a quarter of the sky that had both high aliasing and a low number of points.
The selected 3889 sky positions are shown in Figure 2 as red dots.


At each location, we simulated 1500 independent white noise sequences and 1500 sinusoidal signals with signaltonoise ratio SNR and 1, with uniformly distributed random frequencies in , with Gaialike error distribution, and sampled with the local Gaia nominal scanning law.

The full periodogram was computed for every simulated time series (white noise and signals alike), and the maximum of each periodogram extracted.

We characterised each location by the vector of explanatory variables , as described in Section 3.2. The vector contained the number of observations at , the variance of the observing times, the highest spectral window values in small intervals around 4, 8, , 68 (characteristic Gaia alias frequencies), , and various other sums of subsets of these peaks, in an attempt to find the best approximation to an unknown theoretical link between the distribution of the periodogram maxima and the longrange dependence in the periodogram.

At each location, we performed the casewise estimation of the parameters of the , GEV and quantile methods, using the 1500 noise simulations.
We divided the 3889 locations into three parts. The points of the line (region (i) above, see Figure 2) and a band of the denselysampled rectangle (region (ii) above), in total 1329 points, were selected for a training set. The casewise estimated parameters at the training set locations were used to fit several alternative models for each of the relationships ; these alternative models are described in more details in Section 4.2. All fitted models were then applied to 1280 locations randomly chosen from the remaining locations (the rest of region (ii) and (iii)(vi)), and the best one for each of and was chosen according to their performance in reproducing the casewise estimated parameter values. Finally, the selected models were applied to the 1280 sky positions not used so far, in order to compare their performances on an independent test set.
Since there is no theory predicting the precise form of the functions , we wanted to avoid unnecessary extrapolation as much as possible, so we selected the subsets above such that training, selection and test sets all had sufficient coverage of the whole space of . For instance, the number of observations were in , and for the training, model selection and test set respectively, while the sum was within , and in the three sets.
4.2 Model fitting
Figure 4 gives a glimpse into the dependence of the parameters of the FAP methods on two of the possible variables, and the sum . For , and , there is a wellconstrained monotone relationship between the number of observations in the time series and the parameter in question; the link between and is rather diffuse, as the dependence on is at least partly already accounted for by . The relationships for the GEV parameters are not linear, despite the deceptive impression. Moreover, the relationships for all parameters seem to vary somewhat as a function of , indicated on the plots by colour. For , this variation is at least as important as the variation with ; for the GEV and the quantile methods, it seems only secondary.
In the absence of a theory for the relationship between , spectral window summaries and the parameters , , and , several options were tried, ranging from parametric to global nonparametric models: (a) we classified the spectral windows obtained for 7000 locations randomly scattered on the sky, then fitted separate linear models in each class depending only on ; (b) cutpoints in were used to divide the time series into groups according to the level of aliasing, then separate linear or nonlinear models were fitted within the groups; (c) we applied nonparametric random forest regression [Breiman, 2001] using all the possible covariates we defined for all locations; (d) we fitted 2D thin plate splines using coupled with a spectral windowrelated covariate; (e) smoothing splines as a function of , which can be taken as the univariate reduction of the thin plate splines, without a dependencerelated covariate.
These models were fitted using the individually estimated , , , and at the training set locations. We needed to select both the best type of models and the best set of explanatory variables for all the corresponding links and , which represents a very broad range of possible models. The random forest regression has some especially useful features: it is able to fit models with a high number of covariates without getting unstable, and it yields a measure of importance about all the covariates, which greatly helps variable selection. Using the few most important variables according to random forest, we fitted the models (b), (c) and (d), together with (a) and (e), and measured the goodness of the models by their predictive mean squared error on the model selection set. Plots of the fitted and against the individually estimated parameters were also inspected in order to avoid to select lowscatter, but biased estimators.
The choice of model variables reflects the lack of theoretical background: it is made purely on the grounds of predictive power and parsimony. Nevertheless, when was among the best candidate variables, we preferred it to other, more particular choices such as a spectral window value at a specific frequency, since is a good general summary of the strength of the dependence between two distant frequencies.
For the three method that have parameters to estimate, the best models found are the following.

method
In agreement with the impressions from the rightmost panel of Figure 4, we selected a relatively complex model with four variables, , and two spectral windowrelated quantities: the sum of spectral window values at nine prespecified characteristic frequencies, , and the sum of the highest three spectral window peaks within the range . There were many nearly equivalent choices, with only marginally worse performance. The final model is taken to be this fourvariate random forest model . Figure 5 shows a projection on the 2dimensional subspace spanned by and , with the value of colourcoded.

GEV method
For both parameters of the GEV method, the best model in terms of predictive power turned out to be the thin plate spline with covariates and . An overview of the fit is given in Figure 6, where the values of the individually estimated parameters are colourcoded over the plane . The fitted thinplate spline model is superposed as contour lines. Their inclination confirms that to know the parameters of the GEV, the sole knowledge of would not be sufficient: to estimate and , we also need the value of . Indeed, if we use the smooth spline model without , the fraction of significant pvalues on noise sequences becomes increasingly downward biased as aliases grow, and in the highly aliased region around the ecliptic pole, this bias shows a pattern similar to the Baluev method (bottom third panel of Figure 9). The fractions become nearindependent of region as shown in the bottom second panel of Figure 9, once is included in the fit.

Quantile method
For the two quantiles, we selected the same covariates as for the GEV ( and ) among several nearly equivalent candidates. The fitted models are presented in Figure 7. In this case, the contour lines of the fit seem to be almost parallel to the axis, implying that we should not find strong effect of aliasing on the results. Despite this, the pvalues computed on noise sequences show a bias similar to that of the Baluev method or the smooth splinemodelled GEV: in highaliased regions, their fraction drops below the nominal levels, though the effect is weaker than for those two models. Consequently, we decided to keep as a model variable.
4.3 Quality assessment of the FAP methods
We compared the best models selected in Section 4.2 through their statistical size and power, using the noise simulations for the first and noisy signal simulations for the second at the test locations.
4.3.1 Statistical size
The statistical size is the probability of a Type I error: that we get a significant test statistic value, although the zero hypothesis is true. In other words, we make a false discovery of a periodicity in noise. The desired value for a test is always fixed in advance, as the fraction of false detections we allow when applying our tests to white noise sequences. However, it is necessary to make sure that whatever we may wish to fix in the future, the true proportion of the false discoveries among noise sequences is indeed close to . As was already discussed in Section 2.1, this hinges on a sufficiently good knowledge of the distribution of the test statistic under the zero hypothesis. If this is unsatisfactory, the pvalues computed for any observed periodogram peak will be false, and we either find a lot of contaminating constant stars in the periodic sample, or we lose a higher than expected truly periodic sources in the low SNR regime. The first criterion of FAP methods is therefore the quality of their approximation to the true zero distribution of the periodogram peaks.
To check this, we used the periodogram maxima of the noise sequences simulated at each of the 1280 test set locations. For all of them, we computed the pvalues according to each of the models selected in Section 4.2. The general quality of the approximate distributions was visualised by taking the histograms of these pvalues, and comparing them to the uniform distribution. Figure 8 depicts these histograms at 50 randomly selected test locations.
The overall quality of the approximate distribution is the best for the GEV method, though there are some systematic discrepancies for both very low and very high pvalues. The distortion at high pvalues, which do not cause any contamination in a pvalueselected periodic object sample, is not interesting from the practical point of view. At the low pvalue end, we find slightly fewer values than expected, which means that the approximate distribution somewhat overestimates the pvalues when they are low; this suggests that the FAP based on the GEV will be slightly conservative, tending to give a bit fewer detections than the true distribution would do.
The other two methods have more serious deformations as compared to the uniform distribution. However, the Baluev approximation is not intended to be a true pvalue, but an upper bound on its true unknown value. Its discrepancies are concentrated at the uninteresting high pvalues, and, agreeing with the findings in Baluev [2008], at low pvalues it might be used as an approximate pvalue. Its performance there is similar to that of the GEV method, somewhat even more conservative.
The approximation has a systematic curvature throughout the interval , and underestimates pvalues, yielding about 1.5 times more false positives than . The quantile method does not give pvalues, so there is no corresponding histogram in Figure 8.
As there seem to be some, at least slight, discrepancies in the approximate distributions by all methods, it is important to assess how these depend on the features of the time samplings, how they are distributed over the sky, and whether they are capable of causing systematic biases in the detection probability of lowSNR periodic objectsas a function of sky position. For all sky positions, we computed the number of pvalues falling below 0.05 (significant at the level ) and those below 0.01 (significant at the level ). Figure 9 shows these fractions as functions of the number of observations and of the absolute value of the ecliptic latitude. As the 0.95 and the 0.99 quantiles were modelled, the quantile method too could be checked by these means.
One of the covariates in the models is , the number of observations in the time series. If the models found in Section 4.2 are sufficiently good, we do not expect much residual variation of the fraction of significant pvalues with respect to . Indeed, this is confirmed in the top row of Figure 9, with only slight discrepancies: the GEV method seems to be more conservative than average for low , whereas the Baluev one is more conservative for high . Another set of exceptionally low false alarm rates is present at for the Baluev method. For both methods, the effect is weaker for lower . The quality of the modelling depends also on the number of sparsely sampled time series used, so some improvement for the GEV can be expected if we include more locations with sparse sampling into the training set (our present choice overrepresented the densely sampled time series). Unfortunately, no such improvement can be expected for the Baluev method, where the dependence on is fixed, and there is no free parameter to adapt. The higher false alarm rate of the method, indicated by Figure 8, is more or less homogeneous over ; it is not obvious whether the impression of higher rates at low is significant or not. The most homogeneous and bestperforming method with the least bias is the quantile method – perhaps not surprisingly, as it is a direct model for the limit between significant and nonsignificant, and not a model for the whole distribution of periodogram maxima.
Dependence on sky location was not explicitly included in the models, so to check homogeneity with respect to celestial coordinates is important. Due to a rough rotational selfsimilarity of the patterns around the ecliptic axis and the similarity of the two hemispheres (see Figure 2, dependencies on the absolute value of the ecliptic latitude are sufficient for the most important effects. The bottom row of Figure 9 shows this dependence. Again, the quantile method shows the least systematic variation, followed by the GEV. The method has a weak smooth variation with a minimum false alarm rate around ecliptic latitudes 60. The Baluev method shows a decrease of false alarm rates in the pole regions, due to the fact that the Baluev approximation is built on the assumption of weak aliasing. In the pole regions the time samplings are close to periodic; the spectral windows of these cadences have high peaks at several multiples of the Gaia spin frequency up to high frequencies, as Figure 1 shows. The lower false alarm rates around , seen in the upper row of plots in Figure 9, are due to these locations. Since the formula for the Baluev FAP does not depend on anything else than , further modelling cannot correct for this bias.
The spatial patterns of the same false alarm rates in the region of dense sampling (region (ii) in the list of Section 4.1) are presented in Figure 10, for . The fraction of false alarms is shown by the colour, white denoting fractions equal to the required level , red indicating too many false alarms, blue too few of them. A homogeneous white colour on the map implies a goodquality method. A map of the number of observations at the locations is given on the left, in order to compare the patterns of false alarm fractions with those of . The maps confirm the main conclusions from the previous plots, including the (formerly uncertain) impression that the Baluev FAP is increasingly conservative with increasing : the bluest regions coincide with the most densely sampled regions. For the GEV method, there is some tendency of the bluest regions to roughly follow the regions with sparse time sampling, again corroborating the impression of Figure 9, though the effect is very weak. The quantile method and the methods show the weakest systematic sky effects, either correlating with , or independent of it.
The spatial distributions of the false alarm rates in region (iv), at the ecliptic latitude of the Large Magellanic Cloud, are shown in Figure 11. The variations in in this region are less important than in region (ii), but the alias strengths are increasing steadily between ecliptic latitudes 80 and 90, as the leftmost panel shows. Neither the GEV nor the method has discernible correlation with the alias strength, which indicates that most of the variation is accounted for by including dependencerelated variables in the fit. The false alarm rates of the Baluev method decrease with increasing alias strengths, as expected from Figure 9. Surprisingly, the quantile method shows a similar effect, though this was not obvious from Figure 9, and the fit for the quantiles also contain a dependencerelated covariate. However, the effect is weaker than for the Baluev method, as can be seen from the generally whiter shades in the panel showing the quantile method results than those in the panel corresponding to the Baluev method.
In summary, the distributional quality seems to be generally best for the GEV method. The Baluev method provides a good approximate distribution in the regime of very low pvalues, despite strong overall distortions. Both are slightly conservative. The quality of approximation is the worst for the method, as it has a general tendency to overestimate the significances (underestimate the pvalues). Sky systematics are the strongest for the Baluev method, and relatively weak or very weak for the other three.
4.3.2 Statistical power
Statistical power measures the capacity of a method to reject the zero hypothesis when it is not true. This quantity is in general hard to compute, most importantly because the alternative hypothesis is usually a composite hypothesis comprising a range of possible alternatives. Simulations for specific cases can be done, in the hope that they yield insight into what can be expected under more involved realistic situations. We used our signal simulations (described in Section 4.1) at the 1280 test locations. Figure 12 shows a general overview of the results for signals with SNR = 1 and using . The left panel shows the fraction of all detections, both correct and incorrect, by each of the methods, as a function of the number of observations in the time series. In agreement with the results on the false alarm rate given in the previous section, the method produces the highest fraction of detection (gray circles), while the Baluev method yields the lowest (black squares).
The middle panel of Figure 12 presents the fraction of correct detections among all the signal sequences. This is, as expected, the highest for the method, and the lowest for the Baluev method, which implies that a sample selected by the method will contain the highest number of correct detections. However, these correct detections come at the cost of an even higher number of false detections, as the third panel shows: the ratio between correct and incorrect detections is the lowest for the method, and highest for the Baluev method. The difference between the ratios obtained by the four FAP methods is very small for small and nearaverage , where the detection score is anyway very low for this SNR, but increases at high where there is a more substantial chance to discover the signal in the noise.
Spatial distributions of the detection rate and the correct/incorrect detection ratio in the densely sampled and the highly aliased polar regions are also shown in Figures 13 and 14, respectively. For the rectangle around , where the variations in the number of observations are more important than the variations in the level of aliasing, Figure 13 suggests that the main driving force behind the variations in both the number of detections and the correct/incorrect detection ratio is the number of observations, as the patterns in the four right side panels follow the patterns of the map of in the top left panel, rather than the patterns of in the bottom left panel. Apart from this, the plots confirm both conclusions drawn from Figure 12 as to the slightly higher number of detections and the slightly worse correct/incorrect ratio of the method as compared to the other three methods. Those perform quite similarly, only with tiny differences, which are more visible in Figure 12.
Similar conclusions can be drawn from Figure 14 showing the polar region, though with minor differences. With the high characteristic of this region, the average detection rate is barely above the respective false alarm rates of the methods ( was used), and it is found to be quite homogeneous within the region. The ratio between correct and incorrect detections does seem to vary with and , although it is generally very low, at most 20% of the detections are correct. The changes with are obvious for all methods. The variations with are much less evident; nevertheless, the average ratio seems to decline from to the south ecliptic pole, apparently following an average increase of , while the average is fairly homogeneous throughout the region.
In summary, whereas the method gives the highest absolute number of detections and of correct detections due to its generally permissive behaviour, its ratio of correct and incorrect detections is the least favourable among the four methods. The Baluev method provides the best correct/incorrect ratio, but in general, there is a small ( few percent) loss in the number of sources identified with a correct significant frequency compared to the method. The GEV and the quantile methods perform very similarly to the Baluev one, with a slightly worse correct/incorrect ratio.
5 Discussion
We studied the problem of periodicity detection in large surveys, using the Gaia case to demonstrate the issues and to assess the proposed solutions. Three FAP methods were implemented and tested from the literature: the proposition of Paltani [2004] and SchwarzenbergCzerny [2012], that of Baluev [2008] and that of Süveges [2014], which were all applied to the periodograms from the generalized least squares period search method, one of the most reliable methods used nowadays in astronomy. We tested also a direct estimation of a critical value between significant and nonsignificant periodicities corresponding to a fixed confidence level (the quantile).
The quantile method, the method of Paltani–SchwarzenbergCzerny and the GEV method of Süveges involve parameters which depend on the time sampling, and therefore, for the Gaia survey, vary over the sky. Thus, these parameters must be estimated individually for each sky position, or equivalently, for each time sampling. As these functions are strongly varying over short distances, we related the variations in the parameters to the number of observations and the sum of the most characteristic peaks in the local spectral window instead of sky coordinates. We tested several regressiontype procedures to obtain these relations, and selected the bestperforming procedure for the parameters of each of the three FAP methods.
The features and typical performances of the FAP methods, each using the parameters estimated from the bestperforming procedure, are summarized in Table 1. The differences between the models are in majority small, apart from one: the generally too permissive behaviour of the method.
The Baluev method provides good FAP estimations in the interesting low pvalue regime with some systematics only at the most highly aliased or most densely sampled time series, while not requiring any preprocessing. Thus, it is the least costly but reliable choice for period searches performed on large databases, and can be performed also when a sufficiently good model for the parameters of the other three methods cannot be constructed.
In terms of unbiasedness and sky systematics, the GEV and the quantile methods are the best. However, the quantile method delivers only a decision “significant/nonsignificant” referring to a single confidence level fixed in advance, whereas the GEV method produces a pvalue which can be compared to any desired , and is the most reliable among the three distributional methods. Both the GEV and the quantile methods require a preliminary construction of models for their parameters. Once a model with good predictive qualities is found, the GEV parameters or the quantile can be computed for all time series at small cost. If such a model cannot be found for the typical observational cadences of a database of time series, the parameters must be estimated individually; in that case, the quantile method can require a very high number of full periodogram calculations depending on the desired , whereas the GEV method needs the time equivalent to several () periodogram computations [Süveges, 2014].
The method, in general, underestimates the pvalues, and thus, overestimates the significances of the periodogram peaks. However, this effect is quite homogeneous over the sky, and there are only weak systematic locationdependent effects. Like the GEV and the quantile methods, it needs a preliminary model fitting for , and while the models for the first two contain variables that are at least heuristically arguable ( determines the tail of the marginal distribution of the periodogram, while characterizes the average strength of dependence in the periodogram), the variables necessary for modelling seem to be more ad hoc.
The number of detected weak signals is the highest for the method, similarly to the number of weak signals detected with correct frequency, though, due to its too permissive behaviour, the correct detections are hidden in a larger sample of detections. The rate of correct detections among all detections is the best for the Baluev method, and the worst for the method.
The above methods differ from each other in their generalizability to other period search methods and for other surveys than Gaia. The Baluev method is given originally for variants of the least squares method, with three different normalizations [Baluev, 2008]. Any periodogram derived from other period search methods should be checked if they have margins compatible with one of these normalizations. The method can be used in every case when the marginal distribution of the periodogram is known. The GEV method can be used in principle for any periodogram type which indicates the best frequency with a maximum; periodograms which indicate it by their minima can be dealt with by transforming them to be maxima, for example multiplying the periodograms by . Another advantage is that it is unnecessary to know the marginal distribution of the periodogram. The predictive regression models must then obviously be set up using the maxima of that specific kind of periodograms.
As for the extendability of the model building to surveys with different typical time samplings, unfortunately there are no theoretical indications for the direct estimation of the parameters of any of the methods based on the full joint probability distribution of the periodogram. However, there are suggestions from theory that the two most influential factors must be the tail of (hence the dependence on for the GLS in this study), and something that characterizes the strength of dependence in the periodogram (hence our choice of spectral window features). Thus, a visualization of individually estimated parameters at a relatively small random set of time samplings of the survey against and spectral window features can be informative whether such modelling is promising enough.
Our findings, in summary, favour two methods of FAP calculations. One is the GEV method, which yields an overall good performance with weak systematic inhomogeneities over the sky, and can be combined with most period search methods, but needs the setting up of a model for its parameters in a preprocessing step. The other is the Baluev method, which shows some systematic bias over the sky according to the varying strength of aliasing, and can be used only for methods with specific margins, but does not require any preprocessing, and can be computed during processing practically instantaneously. Both methods thus seem to make good progress towards the solution of the question of False Alarm Probabilities in the special case of large surveys.
Acknowledgments
References
 Baluev [2008] Baluev R. V. 2008, MNRAS, 385, 1279
 Beirlant et al. [2004] Beirlant J., Goegebeur Y., Segers J., Teugels J., 2004, Statistics of Extremes. John Wiley.
 Breiman [2001] Breiman L. 2001, Machine Learning, 45, 5
 Burke et al. [1970] Burke E. W., Rolland W. W., Boy W. R. 1970, J. Royal Astr. Soc. Canada, 64, 353
 Clarke [2002] Clarke D. 2002, A&A, 386, 763
 Cumming [2004] Cumming A. 2004, MNRAS, 354, 1165
 Cuypers [2012] Cuypers J. 2012, Proceedings of the 285th IAU Symposium, 299
 Deeming [1975] Deeming T. J. 1975, Ap&SS, 36, 137
 de Haan & Ferreira [2006] de Haan L., Ferreira A., 2006, Extreme Value Theory: An Introduction. Springer Science+Business Media.
 Dupuy & Hoffman [1985] Dupuy D. L., Hoffman G. A. 1985, International AmateurProfessional Photoelectric Photometry Communications, 20, 1
 Dworetsky [1983] Dworetsky M. M. 1983, MNRAS, 203, 919
 Eyer & Mignard [2005] Eyer L., Mignard F. 2005, MNRAS, 361, 1136
 FerrazMello [1981] FerrazMello S. 1981, AJ, 86, 619
 Frescura et al. [2008] Frescura F. A. M., Engelbrecht C. A., Frank B. S. 2008, MNRAS, 388, 1693
 Foster [1995] Foster G. 1995, AJ, 109, 1889
 Friedman [1984] Friedman J. H. 1984, LCS Technical Report 5.
 Graham et al. [2013] Graham M. J., Drake A. J., Djorgovski S. G., Mahabal A. A., Donalek C. 2013, MNRAS, 434, 2629
 ESA [1997] European Space Agency, 1997, The Hipparcos and Tycho Catalogues, ESA SP1200
 Holl et al. [2012] Holl B., Lindegren L., Hobbs D., 2012, A&A, 543, A15
 Horne & Baliunas [1986] Horne J. H., Baliunas S. L. 1986, ApJ, 302, 757
 Jordi et al. [2010] Jordi C., Gebran M., Carrasco J. M., de Bruijne J., Voss H., Fabricius C., Knude J., Vallenari A. et al., 2010, A&A, 523, A48
 Jurkevich [1971] Jurkevich I. 1971, ApSS, 13, 154
 Lafler & Kinman [1965] Lafler J., Kinman T. D., 1965, ApJS, 11, 216
 Leadbetter et al. [1983] Leadbetter M. R., Lindgren G., Rootzén H., 1983, Extremes and Related Properties of Random Sequences and Processes. SpringerVerlag, New York.
 Lomb [1976] Lomb N. R. 1976, Ap&SS, 39, 447
 LSST Science Collaborations et al. [2009] LSST Science Collaborations et al., 2009, arXiv eprint 0912.0201
 Palmer [2009] Palmer D. M., 2009, ApJ, 695, 496
 Paltani [2004] Paltani S., 2004, A&A, 420, 789
 Reimann [1994] Reimann J. D., 1994, PhD thesis, Univ. California, Berkeley
 Renson [1978] Renson P. 1978, A&A, 63, 125
 Resnick [1987] Resnick S., 1987, Extreme Values, Regular Variation and Point Processes. SpringerVerlag, New York.
 Scargle [1982] Scargle J. D. 1982, ApJ, 263, 835
 SchwarzenbergCzerny [1998] SchwarzenbergCzerny A. 1998, MNRAS, 301, 831
 SchwarzenbergCzerny [2012] SchwarzenbergCzerny A. 2012, Proceedings of the 285th IAU Symposium, 81
 Smith [1985] Smith R. L. 1985, Biometrika, 72, 67
 Stellingwerf [1978] Stellingwerf R. F. 1978, ApJ, 224, 953
 Süveges [2014] Süveges M., 2014, MNRAS, 440, 2099
 Zechmeister & Kürster [2009] Zechmeister M. & Kürster M., 2009, A&A, 496, 577