# Statistical inference for nested data using sufficient summary statistics: a tutorial

###### Abstract

Data with hierarchical structure arise in many fields. Estimating global effect sizes from nested data, and testing effects against global null hypotheses, is, however, more challenging than in the traditional setting of independent and identically distributed data. In this paper, we review statistical approaches to deal with nested data following either a fixed-effect or a random-effects model. We focus on methods that are easy to implement, such as group-level t-tests and Stouffer’s method. The properties of these approaches are discussed within the context of neuroimaging applications, quantitatively assessed on simulated data, and demonstrated on real human neurophysiological data from a simulated-driving experiment. With what we call the inverse-variance-weighted sufficient-summary-statistic approach, we highlight a particularly compelling technique that combines desirable statistical properties with computational simplicity, and we provide step-by-step instructions to apply it to a number of popular measures of effect size.

###### keywords:

hierarchical inference, non-i.i.d. data, effect size, fixed effect, random effects, hypothesis testing, statistical power, summary statistic, inverse-variance-weighting, t-test, correlation, area under the curve, Stouffer’s method, neuroimaging, electroencephalography## 1 Introduction

In many applications of statistics, data possess a nested (hierarchical) structure. Such data arises naturally due to grouping of subjects (e.g., patients in different medical centers, students thought by different teachers, data acquired in different scientific studies) or due to repeated measurements in the same subject. In those situations, the assumption of identical distributed observations is not appropriate. In multi-center clinical trials, for example, subtle differences may exist in each of the centers’ processes as well as in the genetic or cultural traits of the studied populations.

In psychological or neuroimaging studies, multiple data points are typically acquired for the same subject throughout the course of an experiment. While samples acquired from the same subject may be expected to be identically distributed, differences must be assumed across subjects. In specific application such as functional magnetic resonance imaging (fMRI), different distributions are typically even assumed for individual experimental sessions conducted with the same subject, giving rise to a three-level nesting hierarchy. Here we consider the setting of standard neuroimaging studies, and distinguish between a subject (lower) and a group (higher) level.

From a statistical point of view, the question arises how to obtain precise (e.g., best unbiased) estimators for group-level effect sizes as well as powerful statistical tests for nested data. A flexible statistical framework for such data consists of combining data of all subject in one single linear model, referred to as the nested linear model, hierarchical linear model, multi-level model or linear mixed model. These models are extensively covered in, e.g., Quené and Van den Bergh (2004); Woltman et al. (2012); Hox et al. (2010). Estimating hierarchical linear models typically requires advanced statistical techniques that can be hard to implement and difficult to optimize. In some applications such as fMRI, for example, analyzing the combined data of all subjects within a single model poses a serious computational challenge. In other applications such as meta analyses, this may not even be possible as only summary statistics but no individual samples are available from each study.

We here consider the problem of estimating group-level effect sizes and estimating their statistical significance. We point out that this problem can be formulated in a compellingly simple framework, where group-level inference is conducted using the results of separate subject-level analyses only. The resulting statistical methods are simple to implement, computationally efficient to estimate, and the underlying model can be made equivalent to a nested linear model in many cases Beckmann et al. (2003). Such approaches are standard in meta analysis Borenstein et al. (2009); Card (2011), and known as ‘summary statistic’ approaches in the fMRI literature Holmes and Friston (1998); Beckmann et al. (2003); Monti (2011).

Within this framework, this paper reviews ways to estimate group-level effect sizes and to assess their statistical significance. We first provide a reference for a number of popular parametric and non-parametric effect size measures (Section 2.2). We then discuss the need to choose an appropriate group-level model, as between-subject variances typically have to be taken into account through a ‘random effects’ model as opposed to a ‘fixed effect’ model (Section 2.3). We also demonstrate why the simple approach of ignoring the group structure by pooling the data of all subject is invalid (Section 2.4). We then outline the popular approach of computing effect sizes on the subject level and treating these effect sizes as samples in a group-level test. This ‘naive-summary-statistic’ approach is, however, suboptimal in terms of statistical power (Section 2.5). With what we call the inverse-variance-weighted sufficient-summary-statistic approach, we discuss a technique that overcomes these limitations, leading to unbiased minimum-variance group-level effect size estimates and a powerful statistical test of group-level null hypotheses. In a tutorial style, we outline the steps that are required to employ this approach (Section 2.6). Lastly, we discuss the advantages and drawbacks of Stouffer’s method of combining subject-level p-values (Section 2.7).

Using synthetic data representing a two-sample separation problem, we empirically assess the sensitivity and specificity of the reviewed approaches (Section 3). The properties of the various approaches are further highlighted in an application to human neurophysiological data acquired during simulated emergency braking in a driving simulator (Section 4). The paper ends with a discussion of nested linear models, of multivariate extensions, of non-parametric (bootstrapping and surrogate data) approaches, and a note on prominent applications of nested statistical models in neuroimaging (Section 5).

## 2 Theory

### 2.1 Statistical terminology

An effect size is any quantitative measure that reflects the magnitude of some phenomenon of interest. An estimator for is unbiased, if its expected value is .

A statistical test is a procedure to decide, based on an observed sample from a population, which of two complementary hypotheses about a population parameter is true. In this paper, we want to make inference about the presence or absence of an effect. The null hypothesis is that no effect is present. The zero effect is denoted by . The null hypothesis is denoted by . The alternative hypothesis that an effect is present is denoted by . A one-tailed alternative hypothesis assumes that either or , while a two-tailed alternative hypothesis assumes that .

The p-value denotes the probability of observing an effect at least as strong as the observed effect under the assumption of the null hypothesis. To calculate the p-value, a test statistic needs to be derived from the observed effect size, where its distribution under the null hypothesis is known or can be reasonably well approximated. Denoting the test statistic by , its cumulative distribution function under the by , and its observed value in a given sample by , the p-values for a one-tailed alternative hypothesis are given by

(1) | ||||

(2) |

where denotes probability. The p-value for a two-tailed alternative hypothesis is given by

(3) |

The null hypothesis is rejected (and the alternative hypothesis accepted) if the p-value falls below an alpha-level . The most commonly used alpha-levels are and . If the null hypothesis is rejected, we speak of a statistically significant effect. The value of the test statistic that is required for a significant effect is called critical value.

The power or sensitivity of a statistical test is the fraction of actually present effects that are statistically significant according to the test. The specificity of a test is the fraction of non-existent effects, for which the is accepted. Conversely, the false positive rate is the fraction of non-existent effects that are statistically significant.

### 2.2 Common effect size measures

Before introducing the nested data setting, we review a number of popular effect size measures. For each measure, we also present an analytic expression of its variance, which is a prerequisite for assessing its statistical significance. We will later need the variance for performing statistical inference in the nested setting, too.

#### 2.2.1 Mean of a sample

A common measure of effect size is the mean of a sample. Consider a neuroimaging experiment, in which the participant is repeatedly exposed to the same stimulus. A common question to ask is whether this stimulus evokes a brain response that is significantly different from a baseline value. Assume that we observe independent samples . The sample mean is denoted by , and the unbiased sample variance is given by . The variance of is given by

(4) |

Assuming independent and identically distributed (i.i.d.) samples, which are either normal (Gaussian) distributed or large enough, the null hypothesis can be tested using that the statistic

(5) |

is approximately Student-t-distributed with degrees of freedom. This is the one-sample t-test.

A similar effect size is the mean difference of two paired samples . Here, the could, for example, represent baseline activity that is measured in each repetition right before the presentation of the experimental stimulus. A natural null hypothesis is that the mean difference is zero, i.e., . This hypothesis can be tested with a paired t-test, which replaces by in Eq. (4),(5).

Note that, if or cannot be assumed to be normal distributed, a more appropriate test is the non-parametric Wilcoxon signed-ranked test Wilcoxon (1945).

#### 2.2.2 Difference between class-conditional means

A slightly different treatment is required for the difference between the means of two unpaired samples. Consider an experiment with two conditions and . In neuroimaging studies, these could differ in the type of stimulus presented. We observe samples of brain activity within condition , and samples within condition . The sample means are denoted by and , and their difference is given by

(6) |

The variance of is estimated as

(7) |

where and are the unbiased sample variances of and . The null hypothesis of equal means is given by . Under the assumption of either normal distributed and , or large enough samples, the null hypothesis can be tested with Welch’s two-sample t-test. It computes the test statistic

(8) |

which is approximately Student-t-distributed. The degrees of freedom can be approximated using the Welch-Satterthwaite equation Welch (1947). Note that assuming equal variances of and leads to the better known Student’s t-test, which is, however, less recommendable than Welch’s t-test Ruxton (2006).

#### 2.2.3 Area under the ROC curve

In many cases, one may be interested in quantifying the predictive accuracy of a binary classifier to separate experimental condition from condition . The receiver operating characteristic (ROC) is a plot that visualizes the performance of such a binary classification system. It is obtained by plotting the true positive rate (TPR) against the false positive rate (FPR) when varying the threshold that divides the predicted condition into and . Assume without loss of generality that condition is associated with a positive label indicating that the detection of instances of that condition is of particular interest, while is associated with a negative label. TPR is defined as the fraction of correctly classified positive samples among all positive samples, while FPR denotes the fraction negative samples that are incorrectly classified as positives.

A common way to reduce the ROC curve to a single quantity is to calculate the area beneath it Fawcett (2006). The resulting statistics is called the area under the curve (AUC), and is equivalent to the probability that a classifier will correctly rank a randomly chosen pair of samples , where is a sample from and is a sample from Hanley and McNeil (1982). The AUC is also equivalent (see Hanley and McNeil, 1982; Mason and Graham, 2002) to the popular Mann-Whitney U Mann and Whitney (1947) and Wilcoxon rank-sum Wilcoxon (1945) statistics, which provide a non-parametric test for differences in the central tendencies of two unpaired samples. It is therefore an appropriate alternative to the two-sample t-test discussed in Section 2.2.2, if the data follow non-Gaussian distributions.

Assuming, without loss of generality, that higher values are indicative for class , the AUC is given as

(9) |

Perfect class separability is denoted by and , while chance-level class separability is attained at . Thus, a common null hypothesis is .

Assume we have samples from condition and samples from condition . To compute the test statistics, all observations from both conditions are pooled and ranked, beginning with rank one for the smallest value. Defining by the rank of (the -th sample from condition ), the Wilcoxon rank-sum statistic for class is defined as

(10) |

while the Mann-Whitney U statistic is given by

(11) |

Finally, the AUC statistic is given by

(12) |

The exact distributions of , and under the null hypothesis can be derived from combinatorial considerations Mann and Whitney (1947); Mason and Graham (2002), and critical values for rejecting the null hypothesis can be calculated using recursion formulae Shorack et al. (1966). However, these distributions are approximately normal distributed for samples of moderate size (). The mean and variance of Mann-Whitney’s is given by

(13) |

where and denote expected value and variance under the null hypothesis Mason and Graham (2002). From Eq. (12), the mean and variance of the AUC statistic follow as

(14) |

Note that this null distribution does not depend on the distribution of the data, and is only based on the assumptions of i.i.d. samples, equal variances of both classes, and that observations are ordinal (that is, it is possible to rank any two observations).

If the null hypothesis is violated (e.g., ), the variances of , , and become data-dependent. The variance for general can be approximated as Hanley and McNeil (1982); Greiner et al. (2000)

(15) |

where and . The variances of and follow accordingly. A statistical test for the null hypothesis can be devised using that

(16) |

is approximately standard normal distributed for large sample sizes (analogous for and ).

#### 2.2.4 Pearson correlation coefficient

The Pearson product-moment correlation coefficient is used when one is interested in the linear dependence of a pair of random variables . Suppose that for each subject, we have i.i.d. pairs of observations with sample mean . In a neuroimaging context, these pairs could reflect neural activity in two different anatomical structures, or concurrently-acquired neural activity and behavioral (e.g. response time relative to a stimulus) data. The sample Pearson product-moment correlation coefficient is given by

(17) |

where denotes perfect correlation, and denotes perfect anti-correlation. The null hypothesis of no correlation is given by . Assessing the statistical significance of Pearson correlations can be done using the Fisher z-transformation Fisher (1915), defined as

(18) |

If has a bivariate normal distribution, then is approximately normal distributed with mean and variance

(19) |

Therefore the test statistic

(20) |

is approximately standard normal distributed.

The Fisher-transformation is also used when averaging correlations, where the standard approach is to Fisher-transform each individual correlation before computing the average. The reason behind this step is that the distribution of the sample correlation is skewed, whereas the Fisher-transformed sample correlation is approximately normal distributed and thus symmetric (cf. Silver and Dunlap (1987)). Results can be transformed back into a valid Pearson correlation using the inverse transformation

(21) |

The same back transformation can be applied to map confidence intervals derived for into the Pearson correlation domain.

Pearson correlation can also be used to derive the coefficient of determination, which indicates the proportion of the variance in the dependent variable that is predictable from the independent variable in a linear regression. If an intercept term is included in the regression, the coefficient of determination is given as the square of the Pearson product-moment correlation between the two variables . Another strongly related quantity is the point-biserial correlation coefficient, which is used when one variable is dichotomous, i.e., indicates membership in one of two experimental conditions. Pearson correlation is mathematically equivalent to point-biserial correlation if one assigns two distinct numerical values to the dichotomous variable.

#### 2.2.5 Linear regression coefficients

A multiple linear regression model has the form

(22) |

where the dependent variable is the -th sample, are independent variables (or, factors), are corresponding regression coefficients, is an intercept parameter, and is zero-mean, uncorrelated noise. In a neuroimaging context, the samples could represent a neural feature such as the activity of a particular brain location measured at various times , while the could represent multiple factors thought to collectively explain the variability of such as the type of experimental stimulus or behavioral variables. In some fields, such a model is called a neural encoding model. It is also conceivable to have the reverse situation, in which the represent multiple neural features, while the dependent variable is of non-neural origin. This situation would be called neural decoding.

The independent variables could be either categorial (i.e., multiple binary variables coding for different experimental factors) or continuous. The specific case in which all independent variables are categorial is called analysis of variance (ANOVA). Linear models therefore generalize a relatively broad class of effect size measures including differences between class-conditional means and linear correlations Poline and Brett (2012).

The most common way to estimate the regression coefficients is ordinary least-squares (OLS) regression. The resulting estimate is also the maximum-likelihood estimate under the assumption of Gaussian-distributed noise. Using the vector/matrix notations , , , , and , Eq. (48) can be rewritten as . The OLS estimate is then given by

(23) |

The estimated coefficients can be treated as effect sizes measuring how much of measured data is explained by the individual factors . The null hypothesis for factor having no explanatory power is . The estimated variance of is

(24) |

where and is an unbiased estimator of the noise variance. A statistical test for the null hypothesis can be devised using that

(25) |

is t-distributed with degrees of freedom. A similar procedure can be devised for regularized variants such as Ridge regression Hoerl and Kennard (1970).

### 2.3 Nested statistical inference

In the following, our goal is to combine the data of several subjects to estimate a population effect and to assess its statistical significance. We denote the number of subjects with . The observed effect sizes of each individual subject are denoted by (). Other quantities related to subject are also indexed by the subscript , while the same quantities without subject index denote corresponding group-level statistics.

Two different models may be formulated for the overall population effect.

##### Fixed-effect (FE) model

In the fixed-effect (FE) model, we assume that there is one (fixed) effect size that underlies each subject, that is

(26) |

The observed effect can therefore be modeled as

(27) |

where denotes the deviation of the subject’s observed effect from the true effect . We assume that the noise terms are independent, zero-mean random variables with subject-specific variance .

The null hypothesis tested by a fixed-effect model is that no effect is present in any of the subjects. Thus, , where denotes the zero effect.

##### Random-effects (RE) model

In the random-effects (RE) model, the true effect sizes are allowed to vary over subjects. They are assumed to follow a common distribution of effects with mean . The observed effect is modeled as

(28) | ||||||

(29) |

where denotes the deviation of the subject’s observed effect from the true subject-specific effect , and where denotes the deviation of the true subject-specific effect from the population effect . and are assumed to be zero-mean, independent quantities. The subject-specific variance of is , while the variance of is . For , we recover the fixed-effect model.

The null hypothesis being tested is that the population effect is zero (), while each individual subject-specific effect may still be non-zero.

Besides testing different null hypotheses, fixed-effect and random-effects models assume different variances of the observed effect sizes. In the fixed-effect model, all observed variability is assumed to be within-subject variability

(30) |

The random-effects model additionally accounts for variability between subjects

(31) |

If the data follow a random-effects model, neglecting in a fixed-effect analysis leads to an underestimation of the variance. This has negative consequences if we attempt to make inference on the mean population effect () relying only on a fixed-effect analysis: We may arrive at spurious results, as the underestimated variance leads to a decreased specificity of the statistical test Hunter and Schmidt (2000); Field (2003); Schmidt et al. (2009). On the other hand, there is little disadvantage of using a random-effects analysis, even when the data follows a fixed-effect model. As the assumption of a fixed population effect is unrealistic in most practical cases, it is often recommended to carry out random-effects analysis per default Field (2003); Penny and Holmes (2007); Card (2011); Monti (2011).

### 2.4 Data pooling

The most naive approach to conduct group-level inference would be to pool the samples of all subjects, and thus to disregard the nested structure of the data. In electroencephalography (EEG) studies, this approach is sometimes pursued when computing ‘grand-average’ (group-level) waveforms of event-related potentials (ERP) that are elicited by the brain in response to a stimulus.

Pooling the samples of all subjects may violate the assumption of identically distributed data underlying many statistical tests. Depending on the type of analysis, this may result in an over- or underestimation of the effect size, an over- or underestimation of the effect variance, and ultimately in low sensitivity or low specificity of the resulting statistical test.

The following two examples illustrate the problem. In both cases, two variables, and , are modeled for subjects. samples were independently drawn for each subject and variable from Gaussian distributions according to , , , , where the notation denotes a Gaussian distribution with mean and variance . The subject-specific offsets were independently drawn from another Gaussian: . In a practical example (e.g., neuroimaging), these means may indicate individual activation baselines, which are usually not of interest. Given the generated sample, a difference in the means of and is correctly identified for each subject by Welch’s two-sample t-test (). Because of the substantial between-subject variance, this difference is, however, not significant in the pooled data of all subjects (). See Figure 1 (A) for a graphical depiction.

A Pearson correlation analysis of the same data correctly rejects the hypothesis of a linear dependence between and for each subject (, ). However, the presence of subject-specific offsets causes a strong correlation of and across the pooled data of all subjects (, , see Figure 1 (B) for a depiction). In many practical cases, this correlation will not be of interest and must be considered spurious.

These examples motivate the use of hierarchical approaches for testing data with nested structure, which we introduce below.

### 2.5 Naive summary-statistic approach

The simplest variant of the summary-statistic approach ignores subject-specific variances , treating subject-level effect sizes as group-level observations. In this approach, which is somewhat popular in the neuroimaging literature Holmes and Friston (1998); Penny and Holmes (2007), the null hypothesis is tested based on the subject-level effect sizes , which are considered i.i.d. . The variance of the mean effect is estimated as

(32) |

which is an unbiased estimate of even if variances vary across subjects Mumford and Nichols (2009). If the are normal distributed (for example, because they represent the means of normal distributed or many quantities), the test statistic

(33) |

is t-distributed with degrees of freedom. This is the standard one-sample t-test applied to the individual effect sizes .

The naive summary-statistic approach is valid both under the fixed-effect and random-effects models Mumford and Nichols (2009). Its statistical power is, however, limited due to two factors. First, it assigns equal importance to each subject. This is sub-optimal if subject-level variances vary across subjects (for example, because of different amounts of recorded data). In this case, a weighting scheme taking into account subject-level variances is optimal (see Section 2.6.2). Second, the approach does not make use of all the available data, as only the group level data is used to estimate the variance through Eq. (32). However, even if subject-level variances are constant across subjects, it is beneficial to make use of their estimates (see Section 2.6.1).

Both issues are addressed by the sufficient-summary statistic approach described in the next section. An empirical comparison of the statistical power of both approaches on simulated data is provided in Section 3.

### 2.6 Sufficient-summary-statistic approach

If estimates of the variances of the subject-level effect sizes can be obtained, this gives rise to a more powerful summary-statistic approach compared to the naive approach outlined above. Assuming that the subject-level effect size estimates are unbiased, any convex combination

(34) |

of the with non-negative weights is also unbiased (has expectation ), as the denominator of Eq. (34) ensures that the weights sum to one. Importantly, with the exception of the coefficient of determination discussed in Section 2.2.4, all effect size measures discussed in this paper are unbiased estimators of the corresponding population effects. The variance of defined in Eq. (34) is given by

(35) |

If the are normal distributed (for example, because they represent the means of normal distributed or many quantities), the weighted mean is also normal distributed. According to the central limit theorem, this is also approximately the case if the are not normal distributed but the number of subjects is large. In both cases, we can test the null hypothesis using that the test statistic

(36) |

is standard normal distributed.

The variances typically need to be estimated, as the exact population values are unknown. As any estimate integrates information from all samples of all subjects, it can be considered a fairly accurate estimate, justifying the use of a z-test even when we replace by it’s estimate in Eq. (36) Borenstein et al. (2009); Card (2011). Sometimes, however, the more conservative t-distribution with degrees of freedom is assumed for Thirion et al. (2007); Jackson et al. (2010).

#### 2.6.1 Equal-variance-weighting

The z-test introduced in Eq. (36) is valid regardless of the choice of the non-negative weights . One popular choice is to assign equal weights

(37) |

to all subjects, such that becomes the arithmetic mean of the . This procedure is similar to the naive summary-statistic approach introduced in Section 2.5 in that both approaches assign equal importance to each subject-level effect size. However, it differs in the way the variance is estimated, and in terms of the distribution that is assumed for the test statistic. For the naive summary-statistic approach, variances are estimated through Eq. (32) using the data points on the group-level only. The equal-variance-weighting approach instead uses the subject-level variances. That is, following Eq. (35): .

If the individual are unbiased, both methods yield an unbiased estimate of the variance . But the variance of this variance estimate is typically smaller for the equal variance weighting approach, because it makes use of all the available data. This more accurate estimate means that the test statistic is approximately normal distributed rather than t-distributed with degree of freedoms. This translates into a power gain, as illustrated in the simulation presented in Section 3. However, estimating the between-subject variance for a random-effects model is not straightforward, and also may introduce biases and variability (see Section 2.6.3).

#### 2.6.2 Inverse-variance-weighting

Interestingly, the choice of equal weights is suboptimal in terms of obtaining a group-level effect size estimate with minimal variance. It is generally desirable to minimize the variance of the weighted average, as unbiased estimators with smaller variance achieve a lower mean squared error (MSE), and lead to more powerful statistical tests. The minimum-variance estimate is obtained by weighting each subject-level effect size proportional to the inverse of its variance using weights

(38) |

This result is consistent with the intuition that less precise should have a lower impact on the overall estimate than those that are estimated with high confidence. Inserting into Eq. (35), we obtain the optimal value

(39) |

#### 2.6.3 Estimation of between-subject variance

To perform inverse-variance-weighting under the random-effects model, the between-subjects variance needs to be estimated in order to obtain the total subject-wise variance . Several iterative and non-iterative alternative methods have been proposed Worsley et al. (2002); Guolo and Varin (2015); Veroniki et al. (2016). A popular, relatively simple approach is the non-iterative procedure proposed by DerSimonian and Laird DerSimonian and Laird (1986). For a given estimate of the within-subject variances (which can be obtained using the procedures discussed in Section 2.2), and for fixed-effect quantities

(40) |

the between-subject variance according to DerSimonian and Laird (1986) is estimated as

(41) |

The truncation of the estimated variance to zero introduces a positive bias, that is, is over-estimated Rukhin (2013). Moreover, the estimate may be quite variable for small sample sizes. As the inverse-variance-weighting approach does not take this variation into account, the resulting p-values can become too small when the number of subjects is too small Brockwell and Gordon (2001); Guolo and Varin (2015). Nevertheless, the Dersimonian and Laird approach is acceptable for a moderate to large number of subjects Jackson et al. (2010); Guolo and Varin (2015), and is the default approach in many software routines in the meta-analysis community Veroniki et al. (2016).

After has been calculated, the random-effects quantities are finally computed as

(42) |

#### 2.6.4 Algorithm

The sufficient-summary-statistic approach using inverse-variance-weighting is summarized in Algorithm 1. First, the subject-level effect sizes and the within-subject variances are estimated based on the available subject-wise data samples. Second, the between-subject variance is estimated as outlined in Section 2.6.3 (unless a fixed-effect model can reasonably be assumed). Third, the estimated population effect is calculated as the average of the subjects effects, weighted by the inverse of their estimated variances as outlined in Section 2.6.2. This variance of a subject’s estimated effect around the population effect is the sum of the within-subject measurement error variance and the between-subject variance (cf. Eq. (31)). Finally, the estimated mean effect is subjected to a z-test as introduced in Eq. (36).

Different effect sizes and their corresponding variances have been discussed in Section 2.2. With the exception of the Pearson correlation coefficient, these measures can be directly subjected to the inverse-variance-weighting approach. That is, and for the mean difference are given in Eqs. (6) and (7), for the AUC in Eqs. (12) and (15), and for linear regression coefficients in Eqs. (23) and (24). As discussed in Section 2.2.4, it is, however, beneficial to transform correlation coefficients into approximately normal distributed quantities with known variance prior to averaging across subjects. We can proceed with the application of the inverse-variance-weighting approach just as outlined before, treating the transforms given in Eq. (18) rather than the as effect sizes. The resulting population effect can be transformed back into a valid Pearson correlation using the inverse transformation described in Eq. (21).

### 2.7 Stouffer’s method of combining p-values

A general approach for combining the results of multiple statistical tests is Stouffer’s method Stouffer et al. (1949); Whitlock (2005). For a set of independent tests of null hypotheses , Stouffer’s method aims to determine whether all individual null hypotheses are jointly to be accepted or rejected, or, in other words, if the global null hypothesis is true) is true. In general, the individual may not necessarily refer to the same effect size or even effect size measure, and the p-values for each individual hypothesis may be derived using different test procedures including non-parametric, bootstrap- or permutation-based tests. In the present context of nested data, Stouffer’s method can be used to test group-level null hypotheses in the fixed-effect setting, i.e., the absence of an effect in all subjects of the studied population.

Denote with the null hypothesis that there is no effect in subject , and with the one-tailed p-value of an appropriate statistical test for . If the null hypothesis is true, is uniformly distributed between and (see Murdoch et al. (2008) for an illustration). Therefore, the one-tailed p-values can be converted into standard normal distributed z-scores using the transformation

(43) |

where denotes the inverse of the standard normal cumulative distribution function. For Gaussian-distributed subject-level effect sizes with known variance, this step can be carried out more directly using

(44) |

The cumulative test statistic

(45) |

follows the standard normal distribution, which can be used to derive a p-value for the group-level .

Notice that Stouffer’s method as outlined above is applied to one-tailed p-values. However, testing for the presence of an effect often requires a two-tailed test. In this case, it is important to take the direction of the effect in different subjects into account. We cannot simply combine two-tailed tests – a positive effect in one subject and a negative effect in another subject would be seen as evidence for an overall effect, even though they cancel each other out. However, the direction of the effect can be determined post-hoc. To this end, one-tailed p-values for the same direction are calculated for each subject and combined as outlined in Eqs. (43) and (45) into a group-level one-tailed p-value . The group-level two-tailed p-value is then obtained as (see Eq. (1)-(3)) Whitlock (2005) .

## 3 Simulations

In the following, we present a set of simulations, in which we compare the statistical approaches reviewed above to test for a difference between two class-conditional means in artificial data. We consider two conditions and with true means and and class-conditional mean difference . We want to test the null hypothesis or, equivalently, . The following scenarios are investigated: 1) The data are generated either within a fixed-effect or a random-effects model. 2) The data are generated from either a Gaussian or a non-Gaussian distribution. In each scenario, we compare the methods’ abilities to reject the null hypothesis when we vary the true class-conditional mean difference .

Data for or subjects were generated as follows. First, subject-specific class-conditional mean differences were sampled according to

where is the between-subject variance. For the fixed-effect model, we set , while for the random-effects model, we set .

We then sampled data points for condition and data points for condition from Gaussian distributions with variance and class-conditional means and , respectively. A separate set of samples was drawn from non-Gaussian F(2,5)-distributions adjusted to have the same class-conditional means and variance. The number of data points, and , the class-conditional means, and , and the variance, , were randomly drawn for each subject such that is uniformly distributed between 0.5 and 2, and are uniformly distributed between 50 and 80, and the true mean of class , , is uniformly distributed between -3 and 3. In each scenario, the true class-conditional mean difference, , was varied across .

All experiments were repeated 1000 times with different samples drawn from the same distributions. We report the rejection rate, which is the fraction of the test runs in which the null hypothesis was rejected. When the null hypothesis is true (), the rejection rate is identical to the error or false positive rate of the statistical tests under study. In the converse case, in which the null hypothesis is false (), the rejection rate determines the power of the test. All statistical tests were performed at significance level . An ideal test would thus obtain a rejection rate of when the null hypothesis is true (), and a rejection rate of 1 otherwise. The higher the rejection rate in the presence of an effect (), the higher is the power of a test. However, if the null hypothesis is true, a rejection rate greater than indicates the occurrence of spurious findings beyond the acceptable -level.

### 3.1 Simulation 1: Fixed effect vs. random effects

Figure 2 depicts the results achieved by the tested statistical procedures in the fixed-effect (top row) and random-effects (bottom row) scenarios for Gaussian-distributed data, using data from S=5 and S=20 subjects. The ‘pooling’ approach consists of pooling the samples of all subjects and performing one two-sample t-test (cf. Section 2.4). ‘Naive (paired t-test)’ refers the naive summary-statistic approach, in which each subject’s mean difference is treated as an observation for a group-level paired t-test (cf. Section 2.5). Four variants of the sufficient-summary-statistic approach are considered (cf. Section 2.6). These variants differ in assuming either random effects (RE) or one fixed effect (FE), and in using either the inverse-variance-weighting scheme (Eq. (38)) or equal weights (Eq. (37)). ‘Stouffer’ finally refers to using Stouffer’s method to combine p-values obtained from subject-level two-sample t-tests (cf. Section 2.7). Note that all group-level tests are carried out two-tailed.

In line with our previous considerations, data pooling yielded very low power in the presence of an effect both under the fixed-effect and random-effects models. The highest power is achieved in both cases by the inverse-variance-weighted sufficient-summary-statistic approach, followed by Stouffer’s method, the sufficient-summary-statistic approach using equal weights, and the paired t-test.

Considerable differences are observed between the fixed-effect and random-effects settings. For data following the fixed-effect model, the fixed-effect variants of the sufficient-summary-statistic approach display only a negligible advantage over their random-effects counterparts, indicating that the latter succeed in estimating the between-subject variance to be zero. Moreover, in the case of equal class means, all approaches achieve a false positive rate close to the expected value of .

The situation is different for data following a random-effects model. Here, the fixed-effect variants of the sufficient-summary-statistic approach as well as Stouffer’s method and the pooling approach display false positive rates that are between two and five times higher (26%) than what would acceptable under the null hypothesis. This problem is substantially alleviated by the random-effect variants of the sufficient-summary-statistic approach. Nevertheless, when data is only available from subjects, the null hypothesis is still rejected too often ( for inverse-variance-weighting). This is due to the variability in the estimate of the between-subject variance (cf. Section 2.6.3). When subjects are available, the expected false positive rate of is achieved.

The naive summary-statistic approach (paired t-test of subject-wise means) achieves the expected false positive rate of regardless of the number of subjects, and therefore represents a valid statistical test also in the random-effects setting.

### 3.2 Simulation 2: Gaussian vs. non-Gaussian

Figure 3 depicts the results of parametric and non-parametric statistical tests for simulated non-Gaussian-distributed data of subjects following either the fixed-effect model (left panel) or the random-effects model (middle panel). For comparison, the results obtained on Gaussian-distributed data following a random-effects model are displayed in the right panel. Four different statistical tests are compared: 1) the random-effects inverse-variance-weighted sufficient-summary-statistic approach for the difference between class-conditional means, 2) the same test for the area under the non-parametric receiver-operating curve (AUC), 3) the naive summary-statistic approach in the form of a paired t-test between subject-wise means, and 4) its non-parametric equivalent, the Wilcoxon signed rank test. Note that for the naive summary-statistic approaches, the mean differences of each subject are treated as observations for a group-level paired t-test or Wilcoxon signed rank test, respectively.

The figure shows that, as for Gaussian-distributed data, the inverse-variance-weighted sufficient-summary-statistic approach achieves considerably higher statistical power than the corresponding naive summary-statistic approaches. Furthermore, non-parametric approaches achieve a higher power for non-Gaussian-distributed data than their parametric equivalents assuming Gaussian-distributed data. This difference is particularly pronounced for the better performing inverse-variance-weighted sufficient-summary-statistic approaches. The difference for the naive summary approaches is much smaller, because subject-level averages tend to be more Gaussian according to the central limit theorem. In contrast, parametric approaches have only a very minor advantage over non-parametric ones for Gaussian-distributed data. Note further that, when the Gaussianity assumption of the parametric approaches is violated, spurious results can, in theory, not be ruled out. However, such effects are very small here.

## 4 Analysis of emergency-braking-related brain activity

We analyzed neuro- and myo-electrical activity of human participants during a simulated driving experiment. During the experiment, participants had the task to closely follow a computer-controlled lead vehicle. This lead vehicle would occasionally slow down abruptly, in which case the participant had to perform an emergency braking. The full study is described in (Haufe et al., 2011). Brain signals were acquired using 64 EEG electrodes (referenced to an electrode on the nose), while we here only report on the central EEG electrode Cz. Muscular activation of the lower right leg was acquired from two electromyographic (EMG) electrodes using a dipolar derivation. EEG and EMG Data were recorded from 18 participants in three blocks à 45 minutes. On average, clean data from 200 emergency situations were obtained from each participant (min: 123, max: 233). After filtering and sub-sampling to 100 Hz, the data were aligned relative to the onset of the braking of the lead vehicle as indicated by its brake light. For each time point relative to this stimulus, EEG and EMG measurements were contrasted with a sample of identical size that had been obtained from normal driving periods of each participant.

Figure 4 (top left) shows the deviation of EEG and EMG signals in emergency braking situations from signals obtained during normal driving periods as a function of time after stimulus. For each participant, a two-sample t-test was conducted (Eq. (6)). Assuming a random-effects model, the within-subject (i.e., within-participant) variance was estimated using Eq. (7), while the between-subject variance was estimated using Eq. (41). Results are presented in terms of the absolute value of the group-level z-score, which was computed using inverse-variance-weighting as described in Algorithm 1. It is apparent that the brain activity measured by EEG exhibits a significant amount of emergency-braking-related information at an earlier point in time than the activity measured at the right leg muscle, but is superseded in terms of class separability by the EMG later on. This result reflects the decision-making process that is taking place in the brain prior to the execution of the physical braking movement.

The top right panel of Figure 4 depicts the same EEG time course in comparison to the curve obtained under the fixed-effect model. Compared to the RE model, the FE model leads to an inflation of z-scores starting 300 ms post-stimulus. Note that this is consistent with the result of Cochran’s Q-test for effect size heterogeneity Cochran (1954) indicating non-zero between-subject variability after 200 ms post-stimulus (), but not before.

The bottom left panel of Figure 4 depicts the difference between the inverse-variance-weighted sufficient-summary-statistic approach and the naive summary-statistic approach implemented as a paired t-test for differences in the subject-wise means of the two conditions. As expected, the inverse-variance-weighted approach achieves a higher power than the naive approach by taking the subject-level variances into account.

Finally, the bottom right panel of Figure 4 depicts the difference between subject-level two-sample t-tests and non-parametric AUC tests according to Eqs. (12) and (15). Again, no substantial difference is found between the two, indicating that the raw samples are approximately normal distributed, justifying the use of a parametric t-test.

## 5 Discussion

In this paper we have provided a review of existing methods to assess the statistical significance of group-level effect sizes in data with nested structure. We demonstrated that simply pooling the data of all subjects is not a valid approach. In contrast, the naive summary-statistic approach of performing a paired t-test on subject-level effect sizes is valid. However, it has suboptimal statistical power. With the inverse-variance-weighted sufficient-summary-statistic approach and Stouffer’s method, we discussed two general strategies that combine the simplicity and low complexity of ‘naive’ approaches with higher statistical power by using prior knowledge about the distributions and variances of the subject-level effect sizes. The benefit of these two strategies over the ‘naive’ approaches was demonstrated in a set of simulations.

The simulations as well as the presented real-data analysis also highlighted the necessity to account for between-subject variances through a random-effects analysis. A failure to do so results in underestimated p-values and significant, but non-existing spurious effects. Stouffer’s method is a fixed-effects analysis, and thus provides a valid group-level test only if the assumption of zero between-subjects variance can be theoretically justified. In most practical cases, this is not the case Holmes and Friston (1998); Field (2003); Stephan et al. (2009); Schmidt et al. (2009); Allefeld et al. (2016). We thus recommend the use of the inverse-variance-weighting approach when the number of subjects is modest and the subject-wise variances can be reliably estimated.

Importantly, while we here only considered data with two nesting levels, both Stouffer’s method and the inverse-variance-weighted approach naturally extend to hierarchies with arbitrary numbers of levels. For example, p-values derived from individual subjects of a study, e.g., using Stouffer’s method, can again be combined at a higher level to test for consistent effects across multiple studies. In a similar way, group-level effects with variances derived from subject-level samples through Eq. (36) can be further combined into a higher-level average with known variance.

### 5.1 Alternative definitions of fixed and random effects

The notions of ‘fixed’ and ‘random’ effects are used differently in different branches of statistics. See, for example, Gelman (2005) for a discussion of five different definitions of ‘fixed’ and ‘random’ effects in the statistical literature. In ANOVA, the factor levels of a ‘random effect’ are assumed to be randomly selected from a population, while the factor levels of a ‘fixed effect’ are chosen by the experimenter. In contrast to the definition of a ‘fixed effect’ used here (Eq. (27)), the effect sizes of a ‘fixed effect’ factor in ANOVA are allowed to differ across subjects.

Here we define a fixed effect (FE) to be constant across subjects, while a random effect (RE) is allowed to vary across subjects. The fundamental model underlying RE analysis is given by Eqs. (28) and (29), while the FE model is defined in Eq. (27). These definitions are used in the meta-analysis literature Field (2003); Borenstein et al. (2009); Card (2011), which contains most statistical discussion of between-subject variance estimators DerSimonian and Laird (1986); Brockwell and Gordon (2001); Schmidt et al. (2009); Rukhin (2013).

In parts of the neuroimaging literature, a different interpretation of the fixed-effect model is predominant Penny and Holmes (2007); Monti (2011). Here,

(46) |

where denotes the deviation of the subject’s observed effect from the subject-specific true effect , which is not modeled as a random variable. In this view, the subjects are not randomly drawn from a population, but are ‘fixed’. There is no overall population effect and the implicit null hypothesis behind the model is . This yields an alternative interpretation of the same analysis: a fixed-effect analysis allows one to draw valid inference on the mean effect – but only for the specific mean of the observed subjects. Such an analysis would correspond to a case study, but a generalization to the population from which the subjects were drawn is not possible Penny and Holmes (2007). In contrast, the fixed-effect model Eq. (27) we assume throughout this paper allows such a generalization – but the assumption of a constant effect across subjects has to be theoretically justified.

### 5.2 Nested multiple linear models

Another approach to handle nested data are nested linear models (also called hierarchical linear models, multi-level models or mixed linear models). These models extend the multiple linear regression model discussed in Section 2.2.5 to deal with nested data. Following Hox et al. (2010), this is done by introducing subject-specific regression coefficients . The model for the -th sample of subject then reads

(47) |

The subject-specific coefficients are further expressed as

(48) |

where is a subject-specific intercept, models known subject-resolved independent variables , is a vector of corresponding coefficients modeling the influence of these variables on , and is group-level zero-mean noise. In this complete form, all coefficients are subject-specific. We therefore speak of a random-effects nested linear model. It is also conceivable that only some of the coefficients are subject-specific, while others are shared between subjects. For example, in some applications it may be reasonable to model subject-specific intercepts , but identical effects for all subjects. A resulting model would be called a mixed-effects nested linear model.

Nested linear models are very general and allow for more complex statistical analysis than the procedures for estimating and testing group-level effects discussed here. On the downside, the estimation of nested linear models is difficult because no closed-form solution exists in the likely case that the variances of the subject- and group-level noise terms are unknown. Fitting a nested linear model using iterative methods is time consuming when the number of subjects and/or samples per subject is large, as all data of all subjects enter the same model. This is especially problematic when the number of models to be fitted is large, as, for example, in a mass-univariate fMRI context, where an individual model needs to be fitted for ten-thousands of brain voxels.

When only the group-level effect is of interest, the presented inverse-variance-weighting approach is the more practical and computationally favorable alternative. In this approach, regression coefficients are estimated at the subject level, which bears the advantage that the global optimum for each subject can be found analytically in a computationally efficient manner. As the individual are normal distributed with variance given in Eq. (24), they can then be combined using the inverse-variance-weighting scheme. This approach displays mathematically equivalence to a nested-linear model analysis when the covariances are known Beckmann et al. (2003). For these reasons, we here refrained from a deeper discussion of nested linear models. The interested reader is referred to, for example, Quené and Van den Bergh (2004); Woltman et al. (2012); Hox et al. (2010).

### 5.3 Resampling and surrogate-data approaches

While the variances of the effect size measures discussed here can be derived analytically, this may not be the case in general. However, given sufficient data, the variance of the observed effect can always be estimated through resampling procedures such as the bootstrap or the jackknife Efron and Efron (1982). Assuming an approximately normal distribution for , the inverse-variance-weighting approach can hence be applied.

For some types of data such as time series, the subject-level i.i.d. assumption underlying most statistical procedures discussed here is violated. For such dependent samples, the variance of an observed effect – be it analytically derived or obtained through a resampling procedure under the i.i.d. assumption – is underestimated. This problem can be addressed through sophisticated resampling techniques which accommodate dependent data structure. A detailed describtion of these techniques can be found, for example, in Kreiss and Paparoditis (2011); Lahiri (2013).

For some types of analysis questions, it is not straightforward to determine the expected effect under the null hypothesis . A potential remedy to this problem is the method of surrogate data, which was originally introduced in the context of identifying nonlinearity in time series Theiler and Prichard (1997) and is increasingly often applied to functional neuroimaging data in order to test for brain interactions (see Haufe and Ewald (2016) for a discussion). Surrogate data are artificial data that are generated by manipulating the original data in a way such that all crucial properties (including the dependency structure of the samples) are maintained except for the effect that is measured by . As such, surrogate data can provide an empirical distribution for under the null hypothesis. This may be used to derive subject-level p-values, which can be subjected to Stouffer’s method to test for population effects under the fixed-effect model.

### 5.4 Multivariate statistics

In the present paper we assumed that only a single effect is measured for each subject. However, massively multivariate data are common in many domains. This is particularly so in neuroimaging, where brain activity is typically measured at hundreds or even thousands of locations in parallel. When (group) statistical inference is performed jointly for multiple measurement channels, the resulting group-level p-values must be corrected for multiple comparisons using, e.g., the false discovery rate (Benjamini and Hochberg, 1995).

Another way to perform inference for multivariate data is to use inherently multivariate effect size measures such as canonical correlations, coefficients of multivariate linear models, or more generally univariate effect size measures that are calculated on optimal linear combination of the measurement channels (e.g., Dähne et al. (2015); Haufe et al. (2014)). However, most multivariate statistics involve some sort of model fitting. If the number of data channels is high compared to the number of samples, overfitting may occur, and may bias the expected value of the effect under the null hypothesis. One way to avoid that bias by splitting the data into training and test parts, where the training set is used to fit the parameters of the multivariate model, while the actual statistical test is carried out on the test data using the predetermined model parameters Lemm et al. (2011).

### 5.5 Application in neuroimaging

While the methods summarized in this manuscript are applicable in any context, in which nested data occur, the examples provided here are all lent from neuroimaging. In parts of the neuroimaging literature, the use of suboptimal inference procedures such as the naive summary-statistic approach or data pooling is still widespread Mumford and Nichols (2009); Pernet et al. (2011). Given the low signal-to-noise ratios and small sample regimes that are typical for neuroimaging studies, the loss of statistical power that is incurred by using these procedures is unfortunate. Nested linear models are frequently used in the analysis of fMRI data as implementations are provided by the major software packages. They are however less common outside the fMRI context. The sufficient-summary-statistic approach and Stouffer’s method have occasionally been employed in neuroimaging contexts Nolte et al. (2008); Schubert et al. (2009); Haufe et al. (2011); Winkler et al. (2015). The theoretical grounds on which they are applied have however not always been laid out in the same detail as here.

A distinction is made in the neuroimaging literature between so-called ‘activation-like’ and ‘information-like’ effect size measures. Allefeld et al. argue that measures that quantify the presence of an effect without a notion of directionality (that is, are ‘information-like’) cannot be subjected to a subsequent random-effects group-level analysis, because their domain is bounded from below by what would be expected under the null hypothesis of no effect Allefeld et al. (2016). Their arguments refers in particular to the practice of plugging single-subject classification accuracies into a group-level paired t-test. Because the true single-subject classification accuracies can never be below chance-level, the group-level null hypothesis being tested is the fixed-effect hypothesis of no effect in any subject. For the current investigation, this issue is, however, of minor importance, as, except for the coefficient of determination, all effect size measures discussed here are directional and therefore ‘activation-like’.

## 6 Conclusion

In this paper, we have reviewed practical approaches to conduct statistical inference on group-level effects in nested data settings, and have demonstrated their properties on simulated and real data. With the inverse-variance-weighted sufficient-summary-statistic approach, we highlighted an approach that combines simplicity with excellent statistical properties. We have furthermore provided a practical guideline for using this approach in conjunction with some of the most popular measures of statistical effects.

## Acknowledgements

SH was supported by a Marie Curie International Outgoing Fellowship (grant No. PIOF-GA- 2013-625991) within the 7th European Community Framework Programme.

## References

- Allefeld et al. (2016) Allefeld, C., Görgen, K., Haynes, J.-D., 2016. Valid population inference for information-based imaging: from the second-level t-test to prevalence inference. NeuroImage 141, 378–392.
- Beckmann et al. (2003) Beckmann, C. F., Jenkinson, M., Smith, S. M., 2003. General multilevel linear modeling for group analysis in fMRI. Neuroimage 20 (2), 1052–1063.
- Benjamini and Hochberg (1995) Benjamini, Y., Hochberg, Y., 1995. Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society. Series B (Methodological), 289–300.
- Borenstein et al. (2009) Borenstein, M., Hedges, L. V., Higgins, J., Rothstein, H. R., 2009. Introduction to Meta-Analysis. Wiley Online Library.
- Brockwell and Gordon (2001) Brockwell, S. E., Gordon, I. R., 2001. A comparison of statistical methods for meta-analysis. Statistics in medicine 20 (6), 825–840.
- Card (2011) Card, N. A., 2011. Applied meta-analysis for social science research. Guilford Press.
- Cochran (1954) Cochran, W. G., 1954. The combination of estimates from different experiments. Biometrics 10 (1), 101–129.
- Dähne et al. (2015) Dähne, S., Bießman, F., Samek, W., Haufe, S., Goltz, D., Gundlach, C., Villringer, A., Fazli, S., Müller, K.-R., 2015. Multivariate machine learning methods for fusing functional multimodal neuroimaging data. Proceedings of the IEEE 103 (9), 1507–1530.
- DerSimonian and Laird (1986) DerSimonian, R., Laird, N., 1986. Meta-analysis in clinical trials. Controlled clinical trials 7 (3), 177–188.
- Efron and Efron (1982) Efron, B., Efron, B., 1982. The jackknife, the bootstrap and other resampling plans. Vol. 38. SIAM.
- Fawcett (2006) Fawcett, T., 2006. An introduction to ROC analysis. Pattern recognition letters 27 (8), 861–874.
- Field (2003) Field, A. P., 2003. The problems in using fixed-effects models of meta-analysis on real-world data. Understanding Statistics: Statistical Issues in Psychology, Education, and the Social Sciences 2 (2), 105–124.
- Fisher (1915) Fisher, R. A., 1915. Frequency distribution of the values of the correlation coefficient in samples from an indefinitely large population. Biometrika, 507–521.
- Gelman (2005) Gelman, A., 2005. Analysis of variance - why it is more important than ever. The Annals of Statistics 33 (1), 1–53.
- Greiner et al. (2000) Greiner, M., Pfeiffer, D., Smith, R., 2000. Principles and practical application of the receiver-operating characteristic analysis for diagnostic tests. Preventive veterinary medicine 45 (1), 23–41.
- Guolo and Varin (2015) Guolo, A., Varin, C., 2015. Random-effects meta-analysis: the number of studies matters. Statistical methods in medical research, 0962280215583568.
- Hanley and McNeil (1982) Hanley, J. A., McNeil, B. J., 1982. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology 143 (1), 29–36.
- Haufe and Ewald (2016) Haufe, S., Ewald, A., 2016. A simulation framework for benchmarking EEG-based brain connectivity estimation methodologies. Brain topography, 1–18.
- Haufe et al. (2014) Haufe, S., Meinecke, F., Görgen, K., Dähne, S., Haynes, J.-D., Blankertz, B., Bießmann, F., 2014. On the interpretation of weight vectors of linear models in multivariate neuroimaging. NeuroImage 87, 96–110.
- Haufe et al. (2011) Haufe, S., Treder, M. S., Gugler, M. F., Sagebaum, M., Curio, G., Blankertz, B., 2011. EEG potentials predict upcoming emergency brakings during simulated driving. Journal of neural engineering 8 (5), 056001.
- Hoerl and Kennard (1970) Hoerl, A. E., Kennard, R. W., 1970. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics 12 (1), 55–67.
- Holmes and Friston (1998) Holmes, A., Friston, K., 1998. Generalisability, random effects and population inference. Neuroimage 7, S754.
- Hox et al. (2010) Hox, J. J., Moerbeek, M., van de Schoot, R., 2010. Multilevel analysis: Techniques and applications. Routledge.
- Hunter and Schmidt (2000) Hunter, J. E., Schmidt, F. L., 2000. Fixed effects vs. random effects meta-analysis models: implications for cumulative research knowledge. International Journal of Selection and Assessment 8 (4), 275–292.
- Jackson et al. (2010) Jackson, D., Bowden, J., Baker, R., 2010. How does the DerSimonian and Laird procedure for random effects meta-analysis compare with its more efficient but harder to compute counterparts? Journal of Statistical Planning and Inference 140 (4), 961–970.
- Kreiss and Paparoditis (2011) Kreiss, J.-P., Paparoditis, E., 2011. Bootstrap methods for dependent data: A review. Journal of the Korean Statistical Society 40 (4), 357–378.
- Lahiri (2013) Lahiri, S. N., 2013. Resampling methods for dependent data. Springer.
- Lemm et al. (2011) Lemm, S., Blankertz, B., Dickhaus, T., Múller, K.-R., 2011. Introduction to machine learning for brain imaging. NeuroImage 56 (2), 387–399.
- Mann and Whitney (1947) Mann, H. B., Whitney, D. R., 1947. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 50–60.
- Mason and Graham (2002) Mason, S. J., Graham, N., 2002. Areas beneath the relative operating characteristics (ROC) and relative operating levels (ROL) curves: Statistical significance and interpretation. Quarterly Journal of the Royal Meteorological Society 128 (584), 2145–2166.
- Monti (2011) Monti, M. M., 2011. Statistical analysis of fMRI time-series: a critical review of the GLM approach. Frontiers in human neuroscience 5, 28.
- Mumford and Nichols (2009) Mumford, J. A., Nichols, T., 2009. Simple group fMRI modeling and inference. Neuroimage 47 (4), 1469–1475.
- Murdoch et al. (2008) Murdoch, D. J., Tsai, Y.-L., Adcock, J., 2008. P-values are random variables. The American Statistician 62 (3), 242–245.
- Nolte et al. (2008) Nolte, G., Ziehe, A., Nikulin, V. V., Schlögl, A., Krämer, N., Brismar, T., Müller, K.-R., 2008. Robustly estimating the flow direction of information in complex physical systems. Physical review letters 100 (23), 234101.
- Penny and Holmes (2007) Penny, W., Holmes, A., 2007. Random effects analysis. In: Friston, K., Ashburner, J., Kiebel, S., Nichols, T., Penny, W. (Eds.), Statistical Parametric Mapping. Academic Press, London, Ch. 12, pp. 156–165.
- Pernet et al. (2011) Pernet, C. R., Chauveau, N., Gaspar, C., Rousselet, G. A., 2011. LIMO EEG: a toolbox for hierarchical linear modeling of electroencephalographic data. Computational intelligence and neuroscience 2011, 3.
- Poline and Brett (2012) Poline, J.-B., Brett, M., 2012. The general linear model and fMRI: does love last forever? Neuroimage 62 (2), 871–880.
- Quené and Van den Bergh (2004) Quené, H., Van den Bergh, H., 2004. On multi-level modeling of data from repeated measures designs: A tutorial. Speech Communication 43 (1), 103–121.
- Rukhin (2013) Rukhin, A. L., 2013. Estimating heterogeneity variance in meta-analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75 (3), 451–469.
- Ruxton (2006) Ruxton, G. D., 2006. The unequal variance t-test is an underused alternative to Student’s t-test and the Mann–Whitney U test. Behavioral Ecology 17 (4), 688–690.
- Schmidt et al. (2009) Schmidt, F. L., Oh, I.-S., Hayes, T. L., 2009. Fixed-versus random-effects models in meta-analysis: Model properties and an empirical comparison of differences in results. British Journal of Mathematical and Statistical Psychology 62 (1), 97–128.
- Schubert et al. (2009) Schubert, R., Haufe, S., Blankenburg, F., Villringer, A., Curio, G., 2009. Now you’ll feel it, now you won’t: EEG rhythms predict the effectiveness of perceptual masking. Journal of cognitive neuroscience 21 (12), 2407–2419.
- Shorack et al. (1966) Shorack, R. A., et al., 1966. Recursive generation of the distribution of the Mann-Whitney-Wilcoxon U-statistic under generalized Lehmann alternatives. The Annals of Mathematical Statistics 37 (1), 284–286.
- Silver and Dunlap (1987) Silver, N. C., Dunlap, W. P., 1987. Averaging correlation coefficients: should Fisher’s z transformation be used? Journal of Applied Psychology 72 (1), 146–148.
- Stephan et al. (2009) Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J., Friston, K. J., 2009. Bayesian model selection for group studies. Neuroimage 46 (4), 1004–1017.
- Stouffer et al. (1949) Stouffer, S., Suchman, E., DeVinney, L., Star, S., Williams, R. J., 1949. The American Soldier, Vol. 1: Adjustment during Army Life. Princeton University Press, Princeton.
- Theiler and Prichard (1997) Theiler, J., Prichard, D., 1997. Using ‘surrogate surrogate data’ to calibrate the actual rate of false positives in tests for nonlinearity in time series. Fields Inst Comm 11, 99–113.
- Thirion et al. (2007) Thirion, B., Pinel, P., Mériaux, S., Roche, A., Dehaene, S., Poline, J.-B., 2007. Analysis of a large fMRI cohort: Statistical and methodological issues for group analyses. Neuroimage 35 (1), 105–120.
- Veroniki et al. (2016) Veroniki, A. A., Jackson, D., Viechtbauer, W., Bender, R., Bowden, J., Knapp, G., Kuss, O., Higgins, J., Langan, D., Salanti, G., 2016. Methods to estimate the between-study variance and its uncertainty in meta-analysis. Research synthesis methods 7 (1), 55–79.
- Welch (1947) Welch, B. L., 1947. The generalization of Student’s’ problem when several different population variances are involved. Biometrika 34 (1/2), 28–35.
- Whitlock (2005) Whitlock, M. C., 2005. Combining probability from independent tests: the weighted Z-method is superior to Fisher’s approach. Journal of Evolutionary Biology 18 (5), 1368–1373.
- Wilcoxon (1945) Wilcoxon, F., Dec. 1945. Individual comparisons by ranking methods. Biometrics Bulletin 1 (6), 80–83.
- Winkler et al. (2015) Winkler, I., Haufe, S., Porbadnigk, A. K., Müller, K.-R., Dähne, S., 2015. Identifying Granger causal relationships between neural power dynamics and variables of interest. NeuroImage 111, 489–504.
- Woltman et al. (2012) Woltman, H., Feldstain, A., MacKay, J. C., Rocchi, M., 2012. An introduction to hierarchical linear modeling. Tutorials in Quantitative Methods for Psychology 8 (1), 52–69.
- Worsley et al. (2002) Worsley, K. J., Liao, C., Aston, J., Petre, V., Duncan, G., Morales, F., Evans, A., 2002. A general statistical analysis for fMRI data. Neuroimage 15 (1), 1–15.