Random threshold for linear model selection, revisited.
Merlin Keller Marc Lavielle
EDF R&D, Chatou, France
INIRIA Saclay, France and University of Paris Sud, Orsay, France
Contents
Abstract
In [9], a random thresholding method is introduced to select the significant, or non null, mean terms among a collection of independent random variables, and applied to the problem of recovering the significant coefficients in non ordered model selection. We introduce a simple modification which removes the dependency of the proposed estimator on a window parameter while maintaining its asymptotic properties. A simulation study suggests that both procedures compare favorably to standard thresholding approaches, such as multiple testing or modelbased clustering, in terms of the binary classification risk. An application of the method to the problem of activation detection on functional magnetic resonance imaging (fMRI) data is discussed.
1 Introduction
In [9], the following model is considered:
(1) 
where are unknown constants, some of which are zero, and are independent, identically distributed (iid) zeromean random variables, with known cumulative distribution function (cdf)
Within this model, the problem of selecting the significant coefficients based on the observations is studied. Such a problem arises in many different application areas, such as genomics [7], or neuroimaging [6], to cite just a few.
Many methods have been proposed to perform this task. Multiple testing procedures for instance (see [7] for an overview of existing methods), have been developed to control a certain type I error rate, such as the familywise error rate (FWER), or the false discovery rate (FDR) [2] at a userfixed level. It can be argued however that the choice of a level, which ultimately defines the subset of selected coefficients, is arbitrary, as their is no safe guideline to what an ‘optimal’ level of false detections should be.
An alternative, that allows to control both type I and type II error rates, consists in fitting a mixture model to the data, with one class for the null (zeromean) data, and one, or more, for the nonnull data. A detection threshold can then be derived, which minimizes a certain classification risk, such as the binary risk, associated to the 0 1 loss function, resulting in a ‘naive Bayes’ classifier [10]. The main difficulty of this approach lies in the choice of a distribution for the nonnull data, which may influence significantly the resulting classifier. Many authors have proposed to deal with this issue through model selection techniques (see [8, 11, 5] for instance), however it remains an openended problem.
In view of these difficulties, the random threshold (RT) approach introduced in [9] appears as a promising candidate, since it does not require the specification of a typeI error level, nor of a model for the nonzero mean observations. The principle of RT lies in estimating the number of significant coefficients, based on a random centering of the partial sums of the ordered observations. Because it relies on as little assumptions as possible, we expect RT to be more robust than the abovementioned approaches.
However, to date very little is known concerning the classification performances of RT procedures; [9] essentially gives a minimal separation speed between null and nonnull data for the method to attain perfect classification asymptotically. Furthermore, the algorithm described therein still depends on a window parameter, which may have some influence in presence of noisy data.
This article describes a simple modification of the RT procedure, which removes its dependency on the window parameter, while maintaining its asymptotic properties. We then study the classification performances of both techniques using numerical experiments, in comparison to the abovementioned standard approaches.
The rest of this article is organized as follows. In Section 2, the original RT method is reviewed. The variable window extension is introduced in Section 3. In Section 4 the results of numerical experiments are presented, which show the good properties of RT in terms of classification. An application to fMRI data analysis is discussed in Section 5, and we conclude in Section 6 by considering the perspectives open by this work.
2 Original random thresholding procedure
2.1 Testing the presence of significant coefficients
We start by recalling how the presence of nonzero means is tested, that is, how the null hypothesis is tested in [9]. This is done by comparing the cumulative sums of the ordered observations to their conditional expectations under according to the following steps:

Order the observations

For , let

Let and

Define the test statistic . The null hypothesis is rejected if , with such that
Note that the cumulative sums are not computed directly from the ordered observations, but from the transforms which, under are an ordered series of random variables. The conditional expectations can then be computed using the following result (see [9] for details);
Theorem 1 (Conditional expectations of ordered exponential variables)
Under the are an ordered series of random variables. It follows that:
Furthermore, for it is shown in [9] using MonteCarlo simulations that:
(2) 
which provides an approximate calibration for the above test. It is illustrated in Figure 1, on a dataset of observations simulated according to model (1), with noise terms sampled from the distribution.
Under the global null that is, when (Figure 1, top), the cumulative sums are seen to follow closely their conditional expectations Consequently, the resulting test statistic value does not exceeds the critical value given by (2).
In contrast, when we add nonzero means, all taken equal to to the null data (Figure 1, bottom), the s become substantially larger than their expected values under resulting in a gap between the corresponding curves. Note that this gap is most significant around because, for all is the sum of the largest observations (after transformation), containing mostly nonzero means for Consequently, the ensuing test statistic is far above the critical value
2.2 Selecting the significant coefficients
Upon rejection of the null hypothesis, the following task consists in selecting the significant coefficients. The procedure for doing so can be interpreted in a datadependent ‘multiple hypothesis testing’ setting, as described hereafter. Consider the null hypothesis as defined in Section 2.1, and the set of alternative hypotheses:

for any and
In other terms, corresponds to the hypothesis that the largest observations only have nonzero means. Even though in reallife datasets null and nonnull data are never perfectly separated, in general one cannot expect more than to discriminate between such hypotheses in nonordered model selection. Note that this is equivalent to choosing a certain detection threshold to separate null from nonnull data.
Denote the expectation under (instead of ). The RT procedure first computes the ’s using the same steps 1. and 2. as in Section 2.1, then adds the following steps:

Let be some positive integer. For and compute:

Let
As in the preceeding section, for each is an ordered series of variables under so can easily be computed using Theorem 1 (see [9] for further details). Heuristically, the partial cumulative sums are compared to their conditional expected values under the hypotheses for The number of significant coefficients is estimated as the index corresponding to the minimal gap between and as evaluated by
This is illustrated in Figure 2, using a a dataset simulated exactly as in Section 2.1, that is, with significant coefficients. Informally, the procedure uses a sliding window with width and compares the cumulative sums within this window to their conditional expectations under the hypothesis that the window contains the largest null terms. For the window contains in fact mainly significant terms, so that the are well above their expected values, yielding a normalized gap of For the window indeed contains mostly the largest null terms, so and are of the same order, yielding a much smaller gap Finally, for the window contains null terms, but not the largest, so the cumulative sums become lower than their expected values. Consequently, the gap increases Figure 3 shows the complete sequence of values, with a clear minimum at close to the true number of significant coefficients.
3 Extensions and asymptotic properties
3.1 Unknown distribution extension
The RT method recalled in the previous Sections may be difficult to apply to reallife problems, where the noise distribution is in general unknown. In [9], an extension is proposed to the case where is a parametric distribution with unknown. Quite naturally, this consists in estimating under each hypothesis from the null data then using this estimate to derive the transforms for More precisely, having chosen some positive integer the extension consists in performing for the following steps:

let be an estimate of

for let

for compute:


Compute
Finally, the estimated number of components is given as before by
This simple extension is much more computation intensive than the original procedure, since the ’s for must be recomputed for each instead of once and for all. It is illustrated in Figure 4, using the same simulated dataset as in the previous Sections. Here the null distribution is defined as the Gaussian and the unknown variance is estimated by the usual mean squares: The estimated number of significant coefficient, is still close to its true value, and so is the corresponding standard error estimate
3.2 Varying window extension
As we have seen previously, the RT procedure depends on a parameter which can be interpreted as a window width, since is a function of must be smaller than the number of null coefficients, but at the same time not too small, or would become unstable. Hence choosing an appropriate value for may be a hindrance in practice, especially since we want the RT method to be adaptive and depend as little as possible on any form of tuning.
This issue can be avoided by redefining as a function of thus replacing the fixed width by a varying width which requires no prior tuning. We define the following procedure, starting with the same steps 1. and 2. as in Section 2.1, and adding the following steps:

Let be a lower bound on the number of null coefficients. For and compute:
(3) 
Let
In other terms, would be strictly equal to the test statistic defined in Section 2.1, if the sequence where replaced by the subsequence i.e., the set of null terms under
Notice that is independent from as long as reaches its global minimum on The varying window extension presented here can of course be combined with the unknown distribution extension in Section 3.1.
Figure 5 illustrates the varying window extension on the same simulated dataset as previously. Intuitively, the ‘sliding’ window is replaced by a ‘shrinking’ window, which at first encloses all observations, than progressively reduces in width as the largest observations are left out. Otherwise the same observations as in the fixedwidth case hold: when the cumulative sums are larger than their conditional expectations resulting in a large gap (). This gap is considerably reduced when as the s and the s become of the same order (), then the gap increases again for as the cumulative sums become smaller than their expected values (). Though not shown here, the sequence attains its minimum in as in the fixedwindow case.
3.3 Asymptotic properties
The estimator of the number of significant coefficients presented in Section 2.2 is consistent. This is the main result in [9], and it can be extended to the varying window setting. We start by recalling the following asymptotic framework:

There exists and a subset of with and such that if For all other index,

For any for a certain sequence with (see [9] for further details).

such that
We then have the following result:
Theorem 2 (Consistency of the random threshold)
Let stand for the estimator defined in Section 2.2. Under assumptions AF1, AF2, AF3, and appropriate VonMises type conditions on the cdf of the errors (see [9] for further details), is consistent in the sense that:
(4) 
for any positive decreasing sequence such that
This result can be refined by deriving an upper bound, which we do not detail here, on the convergence rate of the probability in Equation (4), for a particular choice of sequence Consistency also holds in the unknown distribution case, under a different set of assumptions which we do not recall here, and under the varying window extension, as shown in Appendix A.
This theorem is interesting in that it gives a convergence rate for provided that a minimal signaltonoise (SNR) ratio is attained, represented by a lower bound on the absolute values of the nonzero means (assumption AF2). Note that, in order for the random threshold (or any other threshold for that matter) to asymptotically separate perfectly null from nonnull data, this SNR must necessarily become arbitrarily large as the sample size increases.
However, this theorem provides no clue to what happens when the SNR remains bounded, as we expect to be the case in reallife applications. In the remainder of this paper, our goal is to explore the behaviour of the RT approach in such cases.
4 Simulation study
In order to assess empirically the classification properties of RT, we designed several numerical experiments. Our goal was to compare the binary classification risk of the RT procedure (with both fixed and varying window width), to those obtained by modelbased clustering and FDR control techniques. Specifically, we used a mixturemodel, estimated via an expectationmaximisation (EM) algorithm [4] to approximate the risk minimizing detection threshold, and the BenjaminiHochberg (BH) procedure [2] to derive a threshold controlling the FDR at a certain level. We dismissed FWER control techniques as they essentially yield constant thresholds at a given level, and are therefore of little interest when compared to adaptive approaches.
We considered two cases, depending on whether the null distribution was considered as known or not. Note that the BH procedure is based on the pvalues hence it requires that be known, whereas this same distribution can be estimated using the EM algorithm. So in order to compare methods on a fair basis, we compared RT to the BH procedure when the null distribution was known, and to mixture model fit otherwise.
4.1 Results with known null distribution
We chose to simulate the directly in the case of a known null distribution. Datasets of observations where generated, containing each significant terms. These were sampled from the Gamma distribution where is a scale parameter, and the remaining null terms were sampled from the distribution.
Table 1 shows the average classification risks obtained by the different methods over simulated datasets and for different choices of the Gamma distribution parameters. More precisely, we chose to compute the ratio of each attained risk to the lowest achievable (oracle) risk, which makes more sense since perfect classification is in general unattainable.
The binary risk and oracle threshold can be computed as follows. Consider a given dataset where is a binary variable, equal to if is a null term (sampled from the distribution), and if is a nonnull term (sampled from the Gamma distribution). Then the overall classification error associated to a given detection threshold is given by:
(5) 
that is, the sum of type I (false detections) and type II (non detections) errors. The oracle threshold is then chosen to minimize this classification error:
1.0  2.0  3.0  1.0  2.0  3.0  1.0  2.0  3.0  
1.88  1.85  1.36  1.6  1.08  1.12  1.31  1.06  1.65  
2.4  1.66  1.14  1.59  1.04  1.61  1.19  1.33  3.00  
2.7  1.37  1.09  1.42  1.21  2.91  1.07  2.01  6.02  
FDR  FDR  FDR 
1.0  2.0  3.0  1.0  2.0  3.0  
1.31  1.15  1.11  1.24  1.13  1.10  
1.30  1.14  1.14  1.25  1.12  1.14  
1.27  1.13  1.16  1.23  1.12  1.17  
fix. RT  var. RT 
It can be seen that both RT approaches perform in general better than FDR control through the BH procedure, with a slight advantage to the varying window extension. Most importantly, the classification risks they attain is never more than and times the oracle risk for the fixed and varying window version, respectively. In contrast to these near optimal performances, whatever the chosen level of FDR control, the BH procedure always performs poorly for at least one model, with an average classification risk that rises as high as times the optimal one in the worst case.
These results suggest that the RT approach, due to its adaptive nature, is indeed more stable than error rate control techniques, that depend on the choice of a false detection level, as we had anticipated. Moreover, the excellent performance of the RT methods, which attained near optimal risks on the studied datasets, is very encouraging for this approach.
4.2 Results with unknown null distribution
To illustrate the unknown distribution case, we simulated observations among which where sampled from the distribution with and represented the significant terms, and where sampled from the distribution and represented the null terms. We used less observations than in the known distribution case because the methods used in the present case are much more computerintensive.
We implemented an EM algorithm to estimate a twoclass Gaussian mixture model (GMM) from the data, with one zeromean class to model the null data. As is often the case with iterative algorithms, providing initial values for the model parameters was the main problem we encountered. We found an efficient strategy for doing so, taking advantage of the fact that the negative data contained mostly null terms, and could provide a good initial guess for the null distribution variance and mixture weight. Details of the algorithm are given in Appendix B.
Table 2, top shows the average ratios of the classification risks obtained by the different methods with respect to the oracle risk, over simulated datasets and for different choices of the Gaussian distribution parameters for the significant terms.
1.0  2.0  3.0  1.0  2.0  3.0  1.0  2.0  3.0  
1.03  1.03  1.08  1.03  1.06  1.02  1.03  1.06  1.03  
1.06  1.03  1.04  1.32  1.13  1.05  1.30  1.12  1.05  
1.11  1.06  1.04  1.60  1.19  1.08  1.55  1.18  1.08  
GMM fit  fix. RT  var. RT 
4.01  2.03  1.89 
GMM fit  fix. RT  var. RT 
All methods performed satisfyingly, yielding close to optimal risks, with modelbased clustering performing slightly better than the RT methods. This comes as a little surprise, since in this case the former approach had several advantages: it was based on a parametric model for the significant terms that was precisely the one used to simulate the data, and it explicitly minimized the binary classification risk. In contrast, the RT approach requires no parametric assumption on the distribution of significant terms, and does not explicitly minimize a classification risk, but nevertheless gave good results.
Furthermore, performances of the modelbased method can deteriorate when it is based on the wrong assumptions. To illustrate this, we simulated observations among which where sampled from the distribution, from the distribution, and the remaining from the distribution and represented the null data. Consequently, the significant terms had a bimodal distribution. Most of these terms where next to the null mode, and a small number where next to a more distant mode.
This way, we hoped to trick the mixture model, which assumed a unimodal distribution for the significant terms, in detecting only the distant mode, while merging the other one with the null distribution. This is exactly what happened, as can be seen in Table 2, bottom: the mixturemodel fit performs significantly worse than the RT approach in this case, the later maintaining a reasonable, though also degraded, classification risk.
Of course it can be argued that such a dataset does not represent a realistic situation; our point here is simply to illustrate the increased robustness of RT due to the fact that it requires no assumptions other than a noise model. It can also be discussed that an alternative to the simplistic twoclass GMM used here would be to allow a variable number of classes, combined with a model selection framework [8, 11, 5]. However, implementing such complex strategies would be nontrivial, especially concerning the algorithm’s initialization. This last issue could be addressed for instance by using stochastic extensions of the EM, such as the stochastic averaging EM (SAEM) [3], in order to reduce dependence on initial values. In contrast to such sophisticated strategies, the simplicity of the RT approach, which requires minimal implementation and virtually no tuning, appears as a key advantage in practice, especially in view of the good performances suggested by this study.
5 Application to fMRI data analysis
We now apply the random threshold approach to functional magnetic resonance imaging (fMRI) data analysis. fMRI is a modality of in vivo brain imaging that allows to measure the variations of cerebral blood oxygen levels induced by the neural activity of a subject lying inside a MRI scanner and submitted to a series of stimuli. A sequence of threedimensional (3D) images of the brain is thus acquired, measuring over time a vascular effect of neural activity known as the blood oxygenation level dependent (BOLD) effect. From the time series recorded in each voxel, and the occurrence times for each stimulus, one may compute an estimate of the BOLD effect of the subject in response to any given stimulus, and more generally to any difference or combination of stimuli (contrast) [6, 16].
Thus, the fMRI data for one subject generally consists in a spatial map of scores where is the number of voxel in the search volume (which can be as high as ), and the estimated BOLD effect at voxel This map of measures of cerebral activity, also termed activation map, is plagued by several sources of uncertainty: the natural variability of brain activity, and the estimation noise induced by the MRI scanner. Thus, model (1) provides a good representation of the activation map with significant terms corresponding to brain regions involved in the task under study.
In a typical fMRI study however, not one but several subjects are recruited from a population of interest, and scanned while submitted to the same series of stimuli. Activation maps associated with a given contrast are obtained for each subject, as described above, and used as input data for inference at the betweensubject level, where the goal is to evidence a general brain activity pattern. Mass univariate, or voxelbased, detection [6] is to date the most widely used approach to address this question. It starts with normalizing individual images onto a common brain template using nonrigid image registration. Next, a statistic is computed in each voxel to locally assess mean group effects.
In both singlesubject and multisubject fMRI data analysis, the problem of activation detection can be formulated statistically as that of detecting nonzero means among a collection of observations. The most common approach consists in thresholding a statistical map of brain activity [6]. Multipletesting techniques are widely used [12, 13], as well as mixturemodels. The GammaGaussian mixture model (GGM) is most often used in this context [1]. It uses a Gaussian distribution for null, or inactivated, data, a Gamma distribution for activated data, and a negative Gamma distribution for deactivated data.
As these methods suffer from certain limitations, as discussed in the previous Sections, the RT appears as an appealing alternative in this context. Hence we decided to compare the regions detected by the different approaches, to see if RT succeeded in recovering regions known to be involved in certain wellstudied cognitive tasks.
5.1 Data acquisition and preprocessings
We used a real fMRI dataset from the Localizer database [14], involving a cohort of 38 righthanded subjects, and acquired as follows. The participants were presented with a series of stimuli or were engaged in tasks such as passive viewing of horizontal or vertical checkerboards, left or right click after audio or video instruction, computation (subtraction) after video or audio instruction, sentence listening and reading. Events occurred randomly in time (mean inter stimulus interval: 3s), with ten occurrences per event type, and ten event types in total.
Functional images were acquired on a General Electric Signa 1.5T scanner using an Echo Planar Imaging sequence . Each volume consisted of mmthick axial contiguous slices. A session comprised scans. Anatomical T1 weighted images were acquired on the same scanner, with a spatial resolution of mm. Finally, the cognitive performance of the subjects was checked using a battery of syntactic and computation tasks.
Singlesubject analyses were conducted using SPM5 (http://www.fil.ion.ucl.ac.uk). Data were submitted successively to motion correction, slice timing and normalization to the MNI template. For each subject, BOLD contrast images were obtained from a fixedeffect analysis on all sessions. Group analyses were restricted to the intersection of all subjects’ wholebrain masks, comprising voxels.
We considered the score maps computed for different contrasts of experimental conditions. These were first converted to score maps, to obtain approximatively Gaussian statistics in inactivated voxels. Using these maps as input data, we then compared the detection thresholds obtained by GammaGaussian mixture modeling (GGM), fixedwindow random thresholding and the varyingwindow extension, also using the unknown variance extension in both cases (see Section 3.1). For simplicity, we only present here the results obtained for a fixed window equal to
5.2 Individual subject activation map
Our first illustration concerns the activation map of a single subject, for the “sentence–checkerboard” contrast. This contrast subtracts the effect of viewing horizontal and vertical checkerboards from that of reading video instructions, thus allowing to detect brain regions specifically implicated in the reading task.
Figure 6, left, shows an axial slice from the score map before thresholding. Activations are clearly seen in Wernicke’s and Broca’s areas (right and upper right), which are known to be involved in language processing (see [15], for instance). The detection threshold found by GGM fit for the score map () is much lower than those found by the random threshold procedure, both with a varying window () and a fixed window ().
The random thresholds with fixed and variable windows yield very similar activation maps in this case, which seem to capture the activated regions seen in the raw map. In contrast, the much lower threshold found by mixture modeling detects several smaller clusters, some of which may be false positives.
5.3 Group activation map
In this second example, we consider a group activation map, specifically a map of statistics computed from the individual contrast maps of 15 subjects, thus enabling to infer regions of positive mean effects in the parent population. Our choice of limiting the number of subjects, rather than using the whole cohort, was driven by the fact that many fMRI studies are conducted on groups of less then 20 subjects.
We report results for the “calculation–sentences” contrast, which subtracts activations due to reading or hearing instructions from the overall activations detected during the mental calculation tasks. This contrast may thus reveal regions that are specifically involved in the processing of numbers.
Figure 6, left, shows an axial slice from the activation map before thresholding, with clear activations in the bilateral anterior cingulate (upper middle), bilateral parietal (lower left and right) and right precentral (upper right) regions, all known to be involved in number processing [14].
Though sorted in the same order as previously, the varying window random threshold () is now roughly at equal distances from the threshold found by GGM modeling () and the fixed window random threshold ().
The three methods detected activations in the regions described above, though the fixed window random threshold seemed to miss some activations, and the GGM approach further detected smaller clusters, some of which may be false positives.
Of course one cannot conclude from these examples only that RT is ‘better’ at detecting activations than GGM fit. However, the varying window extension successfully detected regions known to be involved in the two cognitive tasks considered here, while avoiding isolated peaks in other regions, which may be part of the background noise. These results suggest that the RT succeeded in capturing only the active regions, while the GGM approach seemed to detect spurious activations.
6 Conclusion
In this paper, we have introduced a simple modification to the random threshold (RT) procedure proposed in [9], to obtain an entirely unsupervised procedure for recovering non null mean terms from a collection of independent random observations, based solely on a parametric model of the null terms. Our modification, which requires no prior tuning, conserves the consistency properties of the original procedure.
We have implemented all the different versions of the random threshold method in a Python package. This was integrated to the Neuroimaging in Python (NIPY) opensource library, freely downloadable from http://nipy.sourceforge.net.
We validated this approach through extensive numerical experiments, and showed that both the original procedure, based on a fixed windowwidth, and our extension, which uses a variable window, compare favorably to both multiple testing procedures, and modelbased clustering, in terms of the binary classification risk, with a slight advantage to our varying window extension. On the vast majority of simulated datasets, the risks achieved where close to the lowest achievable (oracle) risk, whereas each of the other approaches behaved poorly in at least one case.
Thus RT appears as a very promising method for nonordered model selection whenever no parametric assumptions are available concerning the data distribution. Such methods are needed in many application domains, as we have illustrated in the case of activation detection for fMRI data analysis.
The good classification performances of RT evidenced empirically in our simulations suggest that a promising direction for future research would be to study its properties in the mixturemodel setting, and especially its largesample behaviour. An interesting question to answer would be whether the random threshold converges to a certain limit when the SNR remains constant, and if so, how does this limit compares to the oracle threshold.
References
 [1] C. Beckmann, M. Woolrich, and S. Smith. Gaussian / gamma mixture modelling of ica/glm spatial maps. In Ninth Int. Conf. on Functional Mapping of the Human Brain., 2003.
 [2] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, 57(1):289–300, 1995.
 [3] B. Delyon, M. Lavielle, and E. Moulines. Convergence of a stochastic approximation version of the em algortihm. The Annals of Statistics, 27(1):94–128, 1999.
 [4] A.P. Dempster, A.P. Laird, and D.B. Rubin. Maximum likelihood from incomplete data via the em algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39:1–38, 1977.
 [5] Bradley Efron, Trevor Hastie, Lain Johnstone, and Robert Tibshirani. Least angle regression. Annals of Statistics, 32:407–499, 2004.
 [6] K. J. Friston. Human Brain Function, chapter 2, pages 25–42. Academic Press, 1997.
 [7] Y. Ge, S. Dudoit, and T.P. Speed. Resamplingbased multiple testing for microarray data analysis. Technical report, Department of Statistics, University of California, Berkeley 2. Division of Biostatistics, University of California, Berkeley 3., 2003.
 [8] T. Hastie, R. Tibshirani, and J. Friedman, editors. The Elements of Statistical Learning. Springer Series in Statistics, 2001.
 [9] M. Lavielle and C. Ludeña. Random threshold for linear model selection. ESAIM: Probability and Statistics, 2007.
 [10] David J. C. MacKay. Information theory, inference, and learning algorithms. Cambridge University Press, 1st edition, June 2003.
 [11] S. Massart. Concentration Inequalities and Model Selection. Lecture Notes in Mathematics. Springer, 2003.
 [12] T.E. Nichols and S. Hayasaka. Controlling the Familywise Error Rate in Functional Neuroimaging: A Comparative Review. Statistical Methods in Medical Research, 12(5):419–446, 2003.
 [13] M. Perone Pacifico, C. Genovese, I. Verdinelli, and L. Wasserman. False discovery control for random fields. Journal of the American Statistical Association, 99(468):1002–1014, 2004.
 [14] P. Pinel, B. Thirion, S. MÃ©riaux, A. Jobert, J. Serres, D. Le Bihan, J.B. Poline, and S. Dehaene. Fast reproducible identification and largescale databasing of individual functional cognitive networks. BMC Neurosci, 8(1):91, Oct 2007.
 [15] Cathy J. Price. The anatomy of language: contributions from functional neuroimaging. Journal of Anatomy, 197(3):335–359, October 2000.
 [16] K.J. Worsley, C.H. Liao, J. Aston, V. Petre, G.H. Duncan, F. Morales, and A.C. Evans. A general statistical analysis for fMRI data. 15(1):1–15, January 2002.
Appendix A Proof of the Theorem 2 for the varying window extension
Following [9], we first recall some notations. Set for and for notice that is a sample from the distribution Let and be the sequences and in decreasing order. Let be the subset of where and
A first lemma in [9] shows that i.e., the collections and are stochastically in order with high probability. The proof can then be restricted to
Also, let Equation (4) can be shown separately for and Since the two cases are treated similarly, we will restrict ourselves here to the case On
Thus is decomposed into a random part and a deterministic part Over is a function of Before going further, we now recall the following result:
Let be an ordered sequence of independent random variables. For let Introduce for the random process Then it is shown in [9] that as a process indexed on converges in distribution to a certain zero mean Gaussian process
To use this result, let and for and for in [AF3]. Then as a process indexed by converges in distribution to the zeromean Gaussian process:
similarly, converges in distribution to another zeromean Gaussian process, and so does their sum,
On the other hand,
(6) 
so that there exists a constant which depends on in [AF3], such that for all we have Finally we use the following inequality:
From Equation (6), hence it follows that:
On the other hand,
so that we have:
where is a constant which depends on in [AF3]. This last probability vanishes as goes to infinity, due to the weak convergence of
Appendix B Details of the EM algorithm for the twoclass GMM with a zeromean class
We consider the following model:
(7) 
where and represents the proportion of data in class so that the vector of model parameters is:
Having initialized to the EM algorithm alternates the following steps:

Estep. Compute the conditional law of the indicator variable at step that is, the Bernoulli defined by:
(8) 
Mstep. Update the estimate of model parameters by maximizing the conditional expectation of the complete loglikelihood, yielding:
the expectation being taken with respect to the conditional distribution of the indicator variables computed in the previous step. This yields:
Note that, throughout the iterations,
Initialization.
An initial guess for is provided by the negative data, which consists mainly of null data:
Then, we use a kernel estimate of the data density:
(10) 
for a symmetric, positive kernel and a bandwidth In pratice, we used the Gaussian kernel and
By identifying the null mode of the data density kernel estimate to the null component of the mixture model, we then obtained an initial guess for the mixture weights:
Finally, the conditional law of the indicator variables where approached by:
These initial guesses are used to derive initial model parameter values via the Mstep described above, for