A Hitchhiker’s Guide to Statistical Comparisons of Reinforcement Learning Algorithms

A Hitchhiker’s Guide to Statistical Comparisons of Reinforcement Learning Algorithms

Cédric Colas
INRIA - Flowers Team
Bordeaux, France
&Olivier Sigaud
Sorbonne University - ISIR
Paris, France
&Pierre-Yves Oudeyer
INRIA - Flowers team
Bordeaux, France
cedric.colas@inria.fr
Abstract

Consistently checking the statistical significance of experimental results is the first mandatory step towards reproducible science. This paper presents a hitchhiker’s guide to rigorous comparisons of reinforcement learning algorithms. After introducing the concepts of statistical testing, we review the relevant statistical tests and compare them empirically in terms of false positive rate and statistical power as a function of the sample size (number of seeds) and effect size. We further investigate the robustness of these tests to violations of the most common hypotheses (normal distributions, same distributions, equal variances). Beside simulations, we compare empirical distributions obtained by running Soft-Actor Critic and Twin-Delayed Deep Deterministic Policy Gradient on Half-Cheetah. We conclude by providing guidelines and code to perform rigorous comparisons of RL algorithm performances.

1 Introduction

Reproducibility in Machine Learning and Reinforcement Learning in particular (RL) has become a serious issue in the recent years. As pointed out in islam2017reproducibility and henderson2017deep, reproducing the results of an RL paper can turn out to be much more complicated than expected. In a thorough investigation, henderson2017deep showed it can be caused by differences in codebases, hyperparameters (e.g. size of the network, activation functions) or the number of random seeds used by the original study. henderson2017deep states the obvious: the claim that an algorithm performs better than another should be supported by evidence, which requires the use of statistical tests. Building on these observations, this paper presents a hitchhiker’s guide for statistical comparisons of RL algorithms. The performances of RL algorithm have specific characteristics (they are independent of each other, they are not paired between algorithms etc.). This paper reviews some statistical tests relevant in that context and compares them in terms of false positive rate and statistical power. Beside simulations, it compares empirical distributions obtained by running Soft-Actor Critic (sac) (haarnoja2018soft) and Twin-Delayed ddpg (td3) (fujimoto2018addressing) on Half-Cheetah (brockman2016openai). We finally provide guidelines to perform robust difference testing in the context of RL. A repository containing the raw results and the code to reproduce all experiments is available at https://github.com/ccolas/rl_stats.

2 Comparing RL Algorithms: Problem Definition

2.1 Model

Figure 1: Two normal distributions representing the performances of two algorithms. Dashed lines: performance measures (realizations). Plain lines: empirical means of the two samples ().

In this paper, we consider the problem of conducting meaningful comparisons of Algorithm 1 and Algorithm 2. Because the seed of the random generator is different for each run111 Yes, it should be. The random seed is not an hyperparameter., two runs of a same algorithm yield different measures of performance. An algorithm performance can therefore be modeled as a random variable , characterized by a distribution. Measuring the performance at the end of a particular run is equivalent to measuring a realization of that random variable. Repeating this times, we obtain a sample of size .

To compare RL algorithms on the basis of their performances, we focus on the comparisons of the central tendencies (): the means or the medians of the associated random variables .222 Because of space constraints, we do not investigate other possible criteria for comparing RL algorithms (e.g. lower variance, high minimal performance, area under the learning curve, etc.) Unfortunately, we cannot know exactly. Given a sample of , we can estimate by the empirical mean: (resp. the empirical median). However, comparing central performances does not simply boil down to the comparison of their estimates. As an illustration, Figure 1 shows two normal distributions describing the distributions of two algorithm performances and . Two samples of sample size are collected. In this example, we have but . The rest of this text uses central performance to refer to either the mean or the median of the performance distribution . It is noted while its empirical estimate is noted . The distinction is made where necessary.

2.2 A Few Definitions

Statistical difference testing.

Statistical difference testing offers a principled way to compare the central performances of two algorithms. It defines two hypothesis: 1) the null hypothesis  :  and 2) the alternative hypothesis : . When performing a test, one initially assumes the null hypothesis to be true. After having observed , statistical tests usually estimate the probability to observe two samples whose empirical central difference is at least as extreme as the observed one () under (e.g. given ). This probability is called the p-value. If the p-value is very low, the test rejects and concludes that a true underlying difference () is likely. When the p-value is high, the test does not have enough evidence to conclude. This could be due to the lack of true difference, or to the lack of statistical power (too few measurements given how noisy they are). The significance level (usually ) draws the line between rejection and conservation of : if p-value, is rejected.

True True
Pred. True neg. False neg.
Pred. False pos. True pos.
Table 1: Hypothesis testing
Statistical errors.

Note that having a p-value of still results in chance out of to claim a difference that does not exist. This is called a type-I error or false positive. The false positive rate is usually noted , just like the significance level. Indeed, statistical tests with significance level are supposed to enforce a false positive rate of . Further experiments demonstrate it is not always the case, which is why we prefer to note the false positive rate . False negatives occur when the statistical test fails to recognize a true difference in the central performances. This depends on the size of the underlying difference: the larger the difference, the lower the risk of false negative. The false negative rate is noted .

Trade-off between false positive and statistical power.

Ideally, we would like to set to ensure the lowest possible false positive rate . However, decreasing the confidence level makes the statistical test more conservative. The test requires even bigger empirical differences to reject , which decreases the probability of true positive. This probability of true positive is called the statistical power of a test. It is the probability to reject when holds. It is directly impacted by the effect size: the larger the effect size, the easier it is to detect (larger statistical power). It is also a direct function of the sample size: larger samples bring more evidence to support the rejection of . Generally, the sample size is chosen so as to obtain a theoretical statistical power of . Different tests have different statistical powers depending on the assumptions they make, whether they are met, how the p-value is derived etc.

Parametric vs. non-parametric.

Parametric tests usually compare the means of two distributions by making assumptions on the distributions of the two algorithms’ performances. Non-parametric tests on the other hand usually compare the medians and do not require assumptions on the type of distributions. Non-parametric tests are often recommended when one wants to compare median rather than means, when the data is skewed or when the sample size is small. Section 4.2 shows that these recommendations are not always justified.

Test statistic.

Statistical tests usually use a test statistic. It is a numerical quantity computed from the samples that summarizes the data. In the t-test for instance, the statistic is computed as , where is the pooled standard deviation . Under the t-test assumptions, this statistic follows the analytic Student’s distribution with density function . The probability to obtain a difference more important than the sample difference (p-value) can be rewritten p-value and can be computed as the area under such that .

Relative effect size.

The relative effect size is the absolute effect size , scaled by the pooled standard deviation , such that . The relative effect size is independent of the spread of the considered distributions.

3 Statistical Tests for RL

3.1 Assumptions in the Context of RL

Each test makes some assumptions (e.g. about the nature of the performance distributions, their variances, the sample sizes etc.). In the context of RL, some assumptions are reasonable while others are not. It is reasonable to assume that RL performances are measured at random and independently from one another. The samples are not paired, and here we assume they have the same size.333 This assumption could be relaxed as none of the test requires it. Other common assumptions might be discussed:

  • Normal distributions of performances: this might not be the case (skewed distributions, bimodal distributions, truncated distributions).

  • Continuous performances: the support of the performance distribution might be bounded: e.g. in the Fetch environments of Gym (brockman2016openai), the performance is a success rate in .

  • Known standard deviations: this is not the case in RL.

  • Equal standard deviations: this is often not the case (see (henderson2017deep)).

3.2 Relevant Statistical Tests

This section briefly presents various statistical tests relevant to the comparison of RL performances. It focuses on the underlying assumptions (kanji2006100) and provides the corresponding implementation from the Python Scipy library when available. Further details can be found in any statistical textbook. Contrary to henderson2017deep, we do not recommend using the Kolmogorov-Smirnov test as it tests for the equality of the two distributions and does not test for a difference in their central tendencies (smirnov1948table).

T-test.

This parametric test compares the means of two distributions and assumes the two distributions have equal variances (student1908probable). If this variance is known, a more powerful test is available: the Z-test for two population means. The test is accurate when the two distributions are normal, it gives an approximate guide otherwise. Implementation: scipy.stats.ttest_ind(x1, x2, equal_var=True).

Welch’s t-test.

It is a t-test where the assumption of equal variances is relaxed (welch1947generalization). Implementation: scipy.stats.ttest_ind(x1, x2, equal_var=False).

Wilcoxon Mann-Whitney rank sum test.

This non-parametric test compares the median of two distributions. It does not make assumptions about the type of distributions but assumes they are continuous and have the same shape and spread (wilcoxon1945individual). Implementation: scipy.stats.mannwhitneyu(x1, x2, alternative=‘two-sided’).

Ranked t-test.

In this non-parametric test that compares the medians, all realizations are ranked together before being fed to a traditional t-test. conover1981rank shows that the computed statistic is a monotonically increasing function of the statistic computed by the Wilcoxon Mann-Whitney test, making them really close. Implemented in our code.

Bootstrap confidence interval test.

In the bootstrap test, the sample is considered to be an approximation of the original distribution (efron1994introduction). Given two observed samples () of size , we obtain two bootstrap samples () of size by sampling with replacement in () respectively and compute the difference in empirical means . This procedure is repeated a large number of times (e.g. ). The distance between percentiles and is considered to be the % confidence interval around the true mean difference . If it does not include , the test rejects the null hypothesis with confidence level . This test does not require any assumptions on the performance distributions, but we will see it requires large sample sizes. Implementation: https://github.com/facebookincubator/bootstrapped.

Permutation test.

Under the null hypothesis, the realizations of both samples would come from distributions with the same mean. The empirical mean difference () should not be affected by the relabelling of the different realization (in average). The permutation test performs permutations of the realization labels and computes . This procedure is repeated many times (e.g. ). is rejected if the proportion of that falls below the original difference is higher than . Implemented in our code.

4 Empirical Comparisons of Statistical Tests

This section compares the above statistical tests in terms of their false positive rates and statistical powers. A false positive rate estimates the probability to claim that two algorithms perform differently when holds. It impacts directly the reproducibility of a piece of research and should be as low as possible. Statistical power is the true positive rate and refers to the probability to find evidence for an existing effect. The following study is an extension of the one performed in de2013using. We conduct experiments using models of RL distributions (analytic distributions) and true empirical RL distributions collected by running trials of both sac (haarnoja2018soft) and td3 (fujimoto2018addressing) on Half-Cheetah-v2 (brockman2016openai) for 2M timesteps.444 Using the spinning up implementation of OpenAI: https://github.com/openai/spinningup

4.1 Methods

Figure 2: Candidate distributions to represent algorithm performances.
Investigating the case of non-normal distributions.

Several candidate distributions are selected to model RL performance distributions (Figure 2): a standard normal distribution, a log-normal distribution and a bimodal distribution that is an even mixture of two normal distributions. All these distributions are tuned so that . In addition we use two empirical distributions of size collected from sac and td3.

Investigating the case of unequal standard deviations.

To investigate the effect of unequal standard deviations, we tune the distribution parameters to double the standard deviation of Algorithm 2 as compared to Algorithm 1. We also compare sac and td3 which have different standard deviations ().

Measuring false positive rates.

To test for false positive rates , we simply enforce by aligning the central performances of the two distributions: (the median for the Mann-Whitney test and the ranked t-test, the mean for others). Given one test, two distributions and a sample size, we sample and from distributions , and compare them using the test with . We repeat this procedure times and estimate as the proportion of rejection. The standard error of this estimate is: . It is smaller than the widths of the lines on the reported figures. This procedure is repeated for every test, every combination of distributions and for several sample sizes (see pseudo-code in the supplementary material).

Measuring true positive rates (statistical power).

Here, we enforce the alternative hypothesis by sampling from a given distribution centered in (mean or median depending on the test), and from a distribution whose mean (resp. median) is shifted by an effect size . Given one test, two distributions (the second being shifted) and the sample size, we repeat the procedure above and obtain an estimate of the true positive rate. Tables reporting the statistical powers for various effect sizes, sample sizes, tests and assumptions are made available in the supplementary results.

4.2 Results: Comparison of False Positive Rates

Same distributions, equal standard deviations.

Figure 3(a) and 3(b) represent the false positive rates as a function of the sample size (number of seeds), for various tests when the samples are drawn from (a): the same standard normal distribution (ideal situation, all assumptions are met), and (b): the same bimodal distribution. Given the sample sizes used estimate , we can directly compare the mean estimates (the lines) to the significance level , the standard errors being smaller than the widths of these lines.555We reproduced all the results twice, hardly seeing any difference in the figures. is very large when using bootstrap tests, unless large sample sizes are used (). Using small sample sizes (), the permutation and the ranked t-test also show large . Results using two log-normal distributions show similar behaviors and can be found in the supplementary results.

(a)
(b)
Figure 3: False positive rates for same distributions, equal standard deviations. Both samples are drawn from the same distribution (). (a): A standard normal distribution. (b): A bimodal distribution.
Same distributions, unequal standard deviations.

Here, we sample from a distribution, and from the same type of distribution with doubled standard deviation. Comparing two normal distributions with different standard deviation does not differ much from the case with equal standard deviations. Figure 4(a) (bimodal distributions) shows that Mann-Whitney and ranked t-test (median tests) constantly overestimate , no matter the sample size (). For log-normal distributions on the other hand (Figure 4(b)), the false positive rate using these tests respects the confidence level with sample sizes higher than . However, other tests tend to show large , even for large sample sizes ( up to ).

(a)
(b)
Figure 4: False positive rates for same distributions, different standard deviations. and are drawn from the same type of distribution, centered in (mean or median), with and . (a): Two bimodal distributions. (b): Two log-normal distributions.
Different distributions, equal standard deviations.

Now we compare samples coming from different distributions with equal standard deviations. Comparing normal and bimodal distributions of equal standard deviation does not impact much the false positive rates curves (similar to Figure 3(a)). However, Figure 5(a) and 5(b) show that when one of the two distributions is skewed (log-normal), the Mann-Whitney and the ranked t-test demonstrate very important false positive rate, a phenomenon that gets worse with larger sample sizes. Section 4.5 discusses why it might be the case.

(a)
(b)
Figure 5: False positive rates for different distributions, equal standard deviations. and are drawn from two different distributions, centered in (mean or median), with . (a): normal and log-normal distributions. (b): bimodal and log-normal distributions.
Different distributions, unequal standard deviations.

We now combine different distributions and different standard deviations. As before, comparing a skewed distribution (log-normal) and a symmetric one leads to high false positive rates for the Mann-Whitney test and the ranked t-test (Figure 6(a) and 6(b)). Comparing a normal distribution and a skewed log-normal with higher standard deviation leads to high positive rates for all other tests as well (), even using large sample sizes (Figure 6(a)).

(a)
(b)
Figure 6: False positive rates for different distributions, different standard deviations. and are drawn from two different distributions, centered in (mean or median), with and . (a): normal and log-normal distributions. (b): bimodal and log-normal distributions.

4.3 Results: Comparison of Statistical Powers

All tests show similar estimations of statistical power. More than samples are needed to detect a relative effect size with probability, close to with and a bit more than with . Tables reporting statistical power for all conditions, tests, sample sizes and relative effect sizes are provided in the supplementary results.

4.4 Results: Comparison of Real RL Distributions: SAC and TD3

Figure 7: False positive rates when comparing sac and td3. is drawn from sac performances, from td3 performances. Both are centered in (mean or median), with and .

Finally, we compare two empirical distributions obtained from running two RL algorithms (sac, td3) times each, on Half-Cheetah. We observe a small increase in false positive rates when using the ranked t-test (Figure 7). The relative effect size estimated from the empirical distributions is (median), or (mean), in favor of sac. For such relative effect sizes, the sample sizes required to achieve a statistical power of are between and for tests comparing the mean and between and for tests comparing the median (see full table in supplementary results). Using a sample size with the Welch’s t-test, the effect size would need to be to times larger to be detected with probability.

4.5 Discussion of Empirical Results

No matter the distributions.

From the above results, it seems clear that the bootstrap test should never be used for sample sizes below and the permutation test should never be used for sample sizes below . The bootstrap test in particular, uses the sample as an estimate of the true performance distribution. A small sample is a very noisy estimate, which leads to very high false positive rates. The ranked t-test shows a false positive rate of and a statistical power of when in all conditions. As noted in (de2013using), comparing two samples of size can result in only four possible p-values (only possible orders when ranked), none of which falls below . Such quantization issues make this test unreliable for small sample sizes, see (de2013using) for further comments and references on this issue.

When distributions do not meet assumptions.

In addition to the behaviors reported above, Section 4.2 shows that non-parametric tests (Mann-Whitney and ranked t-test) can demonstrate very high false positive rates when comparing a symmetric distribution with a skewed one (log-normal). This effect gets worse linearly with the sample size. When the sample size increases, the number of samples drawn in the skewed tail of the log-normal increases. All these realizations will be ranked above any realizations from the other distribution. Therefore, the larger the sample size, the more realization are ranked first in favor of the log-normal, which leads to a bias in the statistical test. This problem does not occur when two log-normal are compared to one another. Comparing a skewed distribution to a symmetric one violates the Mann-Whitney assumptions stating that distributions must have the same shape and spread. The false positive rates of Mann-Whitney and ranked t-test are also above the confidence level whenever a bimodal distribution is compared to another distribution. The traditional recommendation to use non-parametric tests when the distributions are not normal seems to be failing when the two distributions are different.

Most robust tests.

The t-test and the Welch’s t-test were found to be more robust than others to violations of their assumptions. However, was found to be slightly above the required level () when at least one of the two distributions is skewed () no matter the sample size, and when one of the two distributions is bimodal, for small sample sizes . Welch’s is always a bit lower than the t-test’s .

Statistical power.

Except for the anomalies in small sample size mentioned above due to over-confident tests like the bootstrap or the permutation tests, statistical powers stay qualitatively stable no matter the distributions compared, or the test used: ; and .

5 Guidelines for Comparison of RL Algorithm Performances

Measuring the performance of RL Algorithms.

Before using any statistical test, one must obtain measures of performance. RL algorithms should ideally be evaluated offline. The algorithm performance after steps is measured as the average of the returns over evaluation episodes conducted independently from training, usually using a deterministic version of the current policy (e.g. ). Evaluating agents by averaging performances over several training episodes results in a much less interpretable performance measure and should be stated clearly. The collection of performance measures forms a learning curve.

Representing learning curves.

After obtaining a learning curve for each of the runs, it can be rendered on a plot. At each evaluation, one can represent either the empirical mean or median. Whereas the empirical median directly represents the center of the collected sample, the empirical mean tries to model the sample as coming from a Gaussian distribution, and under this assumptions represents the maximum likelihood estimate of that center. Error bars should also be added to this plot. The standard deviation (SD) represents the variability of the performances, but is only representative when the values are approximately normally distributed. When it is not normal, one should prefer to represent interpercentile ranges (e.g. ). If the sample size is small (e.g. ), the most informative solution is to represent all learning curves in addition to the mean or median. When performances are normally distributed, the standard error of the mean (SE) or confidence intervals can be used to represent estimates of the uncertainty on the mean estimate.

Robust comparisons. Which test, which sample sizes?

The results in Section 4.2 advocate for the use of the Welch’s t-test, which shows lower false positive rate and similar statistical powers than other tests. However, the false positive rate often remains superior to the confidence level when the distributions are not normal. When in doubt, we recommend using lower confidence levels (e.g. ) to ensure that . The number of random seeds to be used to achieve a statistical power of depends on the expected relative effect size: ; and . The analysis of a real case comparing sac and td3 algorithms, required a sample size between and for a relatively strong effect when comparing the means, and about more seeds when comparing the medians (). Small sample sizes like would require to times larger effects.

A word on multiple comparisons.

When performing multiple comparisons (e.g. between different pairs of algorithms evaluated in the same setting), the probability to have at least one false positive increases linearly with the number of comparisons . This probability is called the Family-Wise Error Rate (FWER). To correct for this effect, one must apply corrections. The Bonferroni correction for instance adapts the confidence level (bonferroni1936teoria). This ensures that the FWER stays below the initial . Using this corrections makes each test more conservative and decreases its statistical power.

Comparing full learning curves.

Instead of only comparing the final performances of the two algorithms after timesteps in the environment, we can compare performances along learning. This consists in performing a statistical comparison for every evaluation step. This might reveal differences in speed of convergence and can provide more robust comparisons. Further discussions on how this relates to the problem of multiple comparison is given in the supplementary materials.

6 Conclusion

In conclusion, this paper advocates for the use of Welch’s t-test with low confidence level () to ensure a false positive rate below . The sample size must be selected carefully depending on the expected relative effect size. It also warns against the use of other unreliable tests, such as the bootstrap test (for ), the Mann-Whitney and the ranked t-test (unless assumptions are carefully checked), or the permutation test (for ). Using the t-test or the Welch’s t-test with small sample sizes () usually leads to high false positive rate and would require very large relative effect sizes (over ) to show good statistical power. Sample sizes above generally meet the requirement of a statistical power for a relative effect size .

Acknowledgments

This research is financially supported by the French Ministère des Armées - Direction Générale de l’Armement.

References

7 Supplementary Methods

7.1 Pseudo-code

Algorithm 1 represents the pseudo-code of the experiment. The whole code can be found at https://github.com/ccolas/rl_stats. distributions refers to a list of pairs of distributions. When comparing tests for an equal distribution setting, the pairs represent twice the same type of distribution. When comparing for an unequal variance setting, the standard deviation of the second distribution is doubled. The number of repetitions is set to . The rejection variable refers to the rejection of the null hypothesis. The false positive error rates can be found in results_array[:, :, 0, :] when there is no shift between the distributions (null effect size), while the statistical powers are found in results_array[:, :, 1:, :].

1:Input: distributions, tests, nb_repets, effect_sizes, sample_sizes,
2:Initialize: results_array of size (nb_distrib, nb_tests, nb_effects, nb_sample_sizes)
3:for i_d, distrib in distributions do
4:     for i_t, test in tests do
5:         for i_e, effect_size in effect_sizes do
6:              for i_ss, N in sample_sizes do
7:                  rejection_list = []
8:                  for i_r = 1: nb_repets do
9:                       distrib[1].shift(effect)
10:                       sample1 = distrib[0].sample(N)
11:                       sample2 = distrib[1].sample(N)
12:                       rejection_list.append(test.test(sample1, sample2, ))                   
13:                  results_array[i_d, i_t, i_e, i_ss] = mean(rejection_list)                             
Algorithm 1 Comparisons of statistical tests

7.2 Correcting for Multiple Comparison when Comparing Learning Curves

The correction to apply when comparing two learning curves depends 1) on the number of comparisons, 2) on the criteria that is used to conclude whether an algorithm is better than the other. The criteria used to draw a conclusion must be decided before running any test. An example can be: if when comparing the last performance measures of the two algorithms, more than comparisons show a significant difference, then Algorithm 1 is better than Algorithm 2. In that case, the number of comparisons performed is , and the criterion is . We want to constrain the probability FWER that our criterion is met by pure chance to a confidence level . This probability is: FWER. To make it satisfy FWER, we need to correct such as ( in our case).

8 Supplementary Results

8.1 Comparing same distributions with equal standard deviations.

(a)
(b)
Figure 8: False positive rates for same distributions, equal variances. Both samples are drawn from the same distribution. (a): A bimodal distribution (). (b): A skewed log-normal distribution ().
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.048/1.00:4)0.048 \FPeval\resultround(0.024/1.00:4)0.024 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.298/1.00:4)0.298 \FPeval\resultround(0.300/1.00:4)0.300
3 \FPeval\resultround(0.072/1.00:4)0.072 \FPeval\resultround(0.046/1.00:4)0.046 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.128/1.00:4)0.128 \FPeval\resultround(0.229/1.00:4)0.229 \FPeval\resultround(0.122/1.00:4)0.122
5 \FPeval\resultround(0.106/1.00:4)0.106 \FPeval\resultround(0.089/1.00:4)0.089 \FPeval\resultround(0.065/1.00:4)0.065 \FPeval\resultround(0.114/1.00:4)0.114 \FPeval\resultround(0.206/1.00:4)0.206 \FPeval\resultround(0.105/1.00:4)0.105
10 \FPeval\resultround(0.179/1.00:4)0.179 \FPeval\resultround(0.186/1.00:4)0.186 \FPeval\resultround(0.167/1.00:4)0.167 \FPeval\resultround(0.184/1.00:4)0.184 \FPeval\resultround(0.256/1.00:4)0.256 \FPeval\resultround(0.182/1.00:4)0.182
20 \FPeval\resultround(0.336/1.00:4)0.336 \FPeval\resultround(0.340/1.00:4)0.340 \FPeval\resultround(0.321/1.00:4)0.321 \FPeval\resultround(0.332/1.00:4)0.332 \FPeval\resultround(0.378/1.00:4)0.378 \FPeval\resultround(0.341/1.00:4)0.341
30 \FPeval\resultround(0.480/1.00:4)0.480 \FPeval\resultround(0.478/1.00:4)0.478 \FPeval\resultround(0.458/1.00:4)0.458 \FPeval\resultround(0.449/1.00:4)0.449 \FPeval\resultround(0.513/1.00:4)0.513 \FPeval\resultround(0.477/1.00:4)0.477
40 \FPeval\resultround(0.604/1.00:4)0.604 \FPeval\resultround(0.592/1.00:4)0.592 \FPeval\resultround(0.567/1.00:4)0.567 \FPeval\resultround(0.576/1.00:4)0.576 \FPeval\resultround(0.611/1.00:4)0.611 \FPeval\resultround(0.588/1.00:4)0.588
50 \FPeval\resultround(0.691/1.00:4)0.691 \FPeval\resultround(0.693/1.00:4)0.693 \FPeval\resultround(0.678/1.00:4)0.678 \FPeval\resultround(0.680/1.00:4)0.680 \FPeval\resultround(0.717/1.00:4)0.717 \FPeval\resultround(0.693/1.00:4)0.693
100 \FPeval\resultround(0.943/1.00:4)0.943 \FPeval\resultround(0.940/1.00:4)0.940 \FPeval\resultround(0.929/1.00:4)0.929 \FPeval\resultround(0.932/1.00:4)0.932 \FPeval\resultround(0.947/1.00:4)0.947 \FPeval\resultround(0.940/1.00:4)0.940
Medium relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.094/1.00:4)0.094 \FPeval\resultround(0.045/1.00:4)0.045 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.456/1.00:4)0.456 \FPeval\resultround(0.461/1.00:4)0.461
3 \FPeval\resultround(0.155/1.00:4)0.155 \FPeval\resultround(0.115/1.00:4)0.115 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.258/1.00:4)0.258 \FPeval\resultround(0.411/1.00:4)0.411 \FPeval\resultround(0.251/1.00:4)0.251
5 \FPeval\resultround(0.284/1.00:4)0.284 \FPeval\resultround(0.269/1.00:4)0.269 \FPeval\resultround(0.205/1.00:4)0.205 \FPeval\resultround(0.289/1.00:4)0.289 \FPeval\resultround(0.461/1.00:4)0.461 \FPeval\resultround(0.295/1.00:4)0.295
10 \FPeval\resultround(0.560/1.00:4)0.560 \FPeval\resultround(0.553/1.00:4)0.553 \FPeval\resultround(0.506/1.00:4)0.506 \FPeval\resultround(0.550/1.00:4)0.550 \FPeval\resultround(0.646/1.00:4)0.646 \FPeval\resultround(0.556/1.00:4)0.556
20 \FPeval\resultround(0.870/1.00:4)0.870 \FPeval\resultround(0.862/1.00:4)0.862 \FPeval\resultround(0.857/1.00:4)0.857 \FPeval\resultround(0.850/1.00:4)0.850 \FPeval\resultround(0.894/1.00:4)0.894 \FPeval\resultround(0.869/1.00:4)0.869
30 \FPeval\resultround(0.970/1.00:4)0.970 \FPeval\resultround(0.966/1.00:4)0.966 \FPeval\resultround(0.957/1.00:4)0.957 \FPeval\resultround(0.960/1.00:4)0.960 \FPeval\resultround(0.974/1.00:4)0.974 \FPeval\resultround(0.969/1.00:4)0.969
Large relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.217/1.00:4)0.217 \FPeval\resultround(0.108/1.00:4)0.108 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.773/1.00:4)0.773 \FPeval\resultround(0.787/1.00:4)0.787
3 \FPeval\resultround(0.473/1.00:4)0.473 \FPeval\resultround(0.370/1.00:4)0.370 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.626/1.00:4)0.626 \FPeval\resultround(0.801/1.00:4)0.801 \FPeval\resultround(0.593/1.00:4)0.593
5 \FPeval\resultround(0.788/1.00:4)0.788 \FPeval\resultround(0.771/1.00:4)0.771 \FPeval\resultround(0.675/1.00:4)0.675 \FPeval\resultround(0.780/1.00:4)0.780 \FPeval\resultround(0.914/1.00:4)0.914 \FPeval\resultround(0.788/1.00:4)0.788
10 \FPeval\resultround(0.987/1.00:4)0.987 \FPeval\resultround(0.988/1.00:4)0.988 \FPeval\resultround(0.979/1.00:4)0.979 \FPeval\resultround(0.984/1.00:4)0.984 \FPeval\resultround(0.993/1.00:4)0.993 \FPeval\resultround(0.990/1.00:4)0.990
Table 2: Statistical power when comparing samples from two normal distribution with equal standard deviation: . Each result represents the percentage of true positive over repetitions. In bold are results satisfying a true positive rate above .
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.064/1.00:4)0.064 \FPeval\resultround(0.035/1.00:4)0.035 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.291/1.00:4)0.291 \FPeval\resultround(0.291/1.00:4)0.291
3 \FPeval\resultround(0.061/1.00:4)0.061 \FPeval\resultround(0.041/1.00:4)0.041 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.122/1.00:4)0.122 \FPeval\resultround(0.202/1.00:4)0.202 \FPeval\resultround(0.119/1.00:4)0.119
5 \FPeval\resultround(0.091/1.00:4)0.091 \FPeval\resultround(0.084/1.00:4)0.084 \FPeval\resultround(0.075/1.00:4)0.075 \FPeval\resultround(0.119/1.00:4)0.119 \FPeval\resultround(0.193/1.00:4)0.193 \FPeval\resultround(0.092/1.00:4)0.092
10 \FPeval\resultround(0.168/1.00:4)0.168 \FPeval\resultround(0.168/1.00:4)0.168 \FPeval\resultround(0.179/1.00:4)0.179 \FPeval\resultround(0.198/1.00:4)0.198 \FPeval\resultround(0.243/1.00:4)0.243 \FPeval\resultround(0.174/1.00:4)0.174
20 \FPeval\resultround(0.325/1.00:4)0.325 \FPeval\resultround(0.326/1.00:4)0.326 \FPeval\resultround(0.362/1.00:4)0.362 \FPeval\resultround(0.363/1.00:4)0.363 \FPeval\resultround(0.367/1.00:4)0.367 \FPeval\resultround(0.317/1.00:4)0.317
30 \FPeval\resultround(0.460/1.00:4)0.460 \FPeval\resultround(0.469/1.00:4)0.469 \FPeval\resultround(0.505/1.00:4)0.505 \FPeval\resultround(0.509/1.00:4)0.509 \FPeval\resultround(0.503/1.00:4)0.503 \FPeval\resultround(0.456/1.00:4)0.456
40 \FPeval\resultround(0.592/1.00:4)0.592 \FPeval\resultround(0.582/1.00:4)0.582 \FPeval\resultround(0.632/1.00:4)0.632 \FPeval\resultround(0.639/1.00:4)0.639 \FPeval\resultround(0.604/1.00:4)0.604 \FPeval\resultround(0.591/1.00:4)0.591
50 \FPeval\resultround(0.694/1.00:4)0.694 \FPeval\resultround(0.685/1.00:4)0.685 \FPeval\resultround(0.739/1.00:4)0.739 \FPeval\resultround(0.733/1.00:4)0.733 \FPeval\resultround(0.710/1.00:4)0.710 \FPeval\resultround(0.683/1.00:4)0.683
100 \FPeval\resultround(0.939/1.00:4)0.939 \FPeval\resultround(0.937/1.00:4)0.937 \FPeval\resultround(0.954/1.00:4)0.954 \FPeval\resultround(0.957/1.00:4)0.957 \FPeval\resultround(0.939/1.00:4)0.939 \FPeval\resultround(0.938/1.00:4)0.938
Medium relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.102/1.00:4)0.102 \FPeval\resultround(0.052/1.00:4)0.052 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.431/1.00:4)0.431 \FPeval\resultround(0.430/1.00:4)0.430
3 \FPeval\resultround(0.140/1.00:4)0.140 \FPeval\resultround(0.086/1.00:4)0.086 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.220/1.00:4)0.220 \FPeval\resultround(0.373/1.00:4)0.373 \FPeval\resultround(0.196/1.00:4)0.196
5 \FPeval\resultround(0.258/1.00:4)0.258 \FPeval\resultround(0.232/1.00:4)0.232 \FPeval\resultround(0.178/1.00:4)0.178 \FPeval\resultround(0.267/1.00:4)0.267 \FPeval\resultround(0.434/1.00:4)0.434 \FPeval\resultround(0.242/1.00:4)0.242
10 \FPeval\resultround(0.539/1.00:4)0.539 \FPeval\resultround(0.539/1.00:4)0.539 \FPeval\resultround(0.467/1.00:4)0.467 \FPeval\resultround(0.510/1.00:4)0.510 \FPeval\resultround(0.633/1.00:4)0.633 \FPeval\resultround(0.532/1.00:4)0.532
20 \FPeval\resultround(0.868/1.00:4)0.868 \FPeval\resultround(0.870/1.00:4)0.870 \FPeval\resultround(0.807/1.00:4)0.807 \FPeval\resultround(0.804/1.00:4)0.804 \FPeval\resultround(0.887/1.00:4)0.887 \FPeval\resultround(0.869/1.00:4)0.869
30 \FPeval\resultround(0.969/1.00:4)0.969 \FPeval\resultround(0.970/1.00:4)0.970 \FPeval\resultround(0.928/1.00:4)0.928 \FPeval\resultround(0.937/1.00:4)0.937 \FPeval\resultround(0.973/1.00:4)0.973 \FPeval\resultround(0.971/1.00:4)0.971
Large relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.198/1.00:4)0.198 \FPeval\resultround(0.103/1.00:4)0.103 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.723/1.00:4)0.723 \FPeval\resultround(0.725/1.00:4)0.725
3 \FPeval\resultround(0.388/1.00:4)0.388 \FPeval\resultround(0.296/1.00:4)0.296 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.547/1.00:4)0.547 \FPeval\resultround(0.792/1.00:4)0.792 \FPeval\resultround(0.514/1.00:4)0.514
5 \FPeval\resultround(0.786/1.00:4)0.786 \FPeval\resultround(0.776/1.00:4)0.776 \FPeval\resultround(0.619/1.00:4)0.619 \FPeval\resultround(0.735/1.00:4)0.735 \FPeval\resultround(0.912/1.00:4)0.912 \FPeval\resultround(0.757/1.00:4)0.757
10 \FPeval\resultround(0.994/1.00:4)0.994 \FPeval\resultround(0.994/1.00:4)0.994 \FPeval\resultround(0.966/1.00:4)0.966 \FPeval\resultround(0.973/1.00:4)0.973 \FPeval\resultround(0.996/1.00:4)0.996 \FPeval\resultround(0.994/1.00:4)0.994
Table 3: Statistical power when comparing samples from two bimodal distribution with equal standard deviation. The first is centered in (, mean or median depending on the test), the other shifted by the relative effect size (). Both have same standard deviation . Each result represents the percentage of true positive over repetitions. In bold are results satisfying a true positive rate above .
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.067/1.00:4)0.067 \FPeval\resultround(0.032/1.00:4)0.032 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.388/1.00:4)0.388 \FPeval\resultround(0.387/1.00:4)0.387
3 \FPeval\resultround(0.099/1.00:4)0.099 \FPeval\resultround(0.057/1.00:4)0.057 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.195/1.00:4)0.195 \FPeval\resultround(0.288/1.00:4)0.288 \FPeval\resultround(0.189/1.00:4)0.189
5 \FPeval\resultround(0.154/1.00:4)0.154 \FPeval\resultround(0.121/1.00:4)0.121 \FPeval\resultround(0.129/1.00:4)0.129 \FPeval\resultround(0.198/1.00:4)0.198 \FPeval\resultround(0.265/1.00:4)0.265 \FPeval\resultround(0.183/1.00:4)0.183
10 \FPeval\resultround(0.247/1.00:4)0.247 \FPeval\resultround(0.247/1.00:4)0.247 \FPeval\resultround(0.329/1.00:4)0.329 \FPeval\resultround(0.369/1.00:4)0.369 \FPeval\resultround(0.317/1.00:4)0.317 \FPeval\resultround(0.273/1.00:4)0.273
20 \FPeval\resultround(0.404/1.00:4)0.404 \FPeval\resultround(0.401/1.00:4)0.401 \FPeval\resultround(0.628/1.00:4)0.628 \FPeval\resultround(0.632/1.00:4)0.632 \FPeval\resultround(0.432/1.00:4)0.432 \FPeval\resultround(0.424/1.00:4)0.424
30 \FPeval\resultround(0.533/1.00:4)0.533 \FPeval\resultround(0.539/1.00:4)0.539 \FPeval\resultround(0.802/1.00:4)0.802 \FPeval\resultround(0.804/1.00:4)0.804 \FPeval\resultround(0.560/1.00:4)0.560 \FPeval\resultround(0.536/1.00:4)0.536
40 \FPeval\resultround(0.649/1.00:4)0.649 \FPeval\resultround(0.635/1.00:4)0.635 \FPeval\resultround(0.897/1.00:4)0.897 \FPeval\resultround(0.900/1.00:4)0.900 \FPeval\resultround(0.659/1.00:4)0.659 \FPeval\resultround(0.641/1.00:4)0.641
50 \FPeval\resultround(0.724/1.00:4)0.724 \FPeval\resultround(0.719/1.00:4)0.719 \FPeval\resultround(0.955/1.00:4)0.955 \FPeval\resultround(0.960/1.00:4)0.960 \FPeval\resultround(0.746/1.00:4)0.746 \FPeval\resultround(0.726/1.00:4)0.726
100 \FPeval\resultround(0.938/1.00:4)0.938 \FPeval\resultround(0.935/1.00:4)0.935 \FPeval\resultround(1.000/1.00:4)1.000 \FPeval\resultround(1.000/1.00:4)1.000 \FPeval\resultround(0.945/1.00:4)0.945 \FPeval\resultround(0.937/1.00:4)0.937
Medium relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.147/1.00:4)0.147 \FPeval\resultround(0.070/1.00:4)0.070 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.609/1.00:4)0.609 \FPeval\resultround(0.603/1.00:4)0.603
3 \FPeval\resultround(0.262/1.00:4)0.262 \FPeval\resultround(0.193/1.00:4)0.193 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.428/1.00:4)0.428 \FPeval\resultround(0.542/1.00:4)0.542 \FPeval\resultround(0.412/1.00:4)0.412
5 \FPeval\resultround(0.431/1.00:4)0.431 \FPeval\resultround(0.397/1.00:4)0.397 \FPeval\resultround(0.379/1.00:4)0.379 \FPeval\resultround(0.458/1.00:4)0.458 \FPeval\resultround(0.584/1.00:4)0.584 \FPeval\resultround(0.475/1.00:4)0.475
10 \FPeval\resultround(0.657/1.00:4)0.657 \FPeval\resultround(0.649/1.00:4)0.649 \FPeval\resultround(0.768/1.00:4)0.768 \FPeval\resultround(0.796/1.00:4)0.796 \FPeval\resultround(0.726/1.00:4)0.726 \FPeval\resultround(0.671/1.00:4)0.671
20 \FPeval\resultround(0.876/1.00:4)0.876 \FPeval\resultround(0.864/1.00:4)0.864 \FPeval\resultround(0.979/1.00:4)0.979 \FPeval\resultround(0.978/1.00:4)0.978 \FPeval\resultround(0.902/1.00:4)0.902 \FPeval\resultround(0.876/1.00:4)0.876
30 \FPeval\resultround(0.953/1.00:4)0.953 \FPeval\resultround(0.954/1.00:4)0.954 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.998/1.00:4)0.998 \FPeval\resultround(0.964/1.00:4)0.964 \FPeval\resultround(0.954/1.00:4)0.954
Large relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.357/1.00:4)0.357 \FPeval\resultround(0.191/1.00:4)0.191 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.858/1.00:4)0.858 \FPeval\resultround(0.860/1.00:4)0.860
3 \FPeval\resultround(0.642/1.00:4)0.642 \FPeval\resultround(0.534/1.00:4)0.534 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.769/1.00:4)0.769 \FPeval\resultround(0.858/1.00:4)0.858 \FPeval\resultround(0.744/1.00:4)0.744
5 \FPeval\resultround(0.838/1.00:4)0.838 \FPeval\resultround(0.812/1.00:4)0.812 \FPeval\resultround(0.738/1.00:4)0.738 \FPeval\resultround(0.801/1.00:4)0.801 \FPeval\resultround(0.916/1.00:4)0.916 \FPeval\resultround(0.843/1.00:4)0.843
10 \FPeval\resultround(0.960/1.00:4)0.960 \FPeval\resultround(0.959/1.00:4)0.959 \FPeval\resultround(0.985/1.00:4)0.985 \FPeval\resultround(0.988/1.00:4)0.988 \FPeval\resultround(0.979/1.00:4)0.979 \FPeval\resultround(0.964/1.00:4)0.964
Table 4: Statistical power when comparing samples from two log-normal distribution with equal standard deviation. The first is centered in (, mean or median depending on the test), the other shifted by the relative effect size (). Both have same standard deviation . Each result represents the percentage of true positive over repetitions. In bold are results satisfying a true positive rate above .

8.2 Comparing same distributions with different standard deviations.

Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.062/1.00:4)0.062 \FPeval\resultround(0.030/1.00:4)0.030 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.310/1.00:4)0.310 \FPeval\resultround(0.314/1.00:4)0.314
3 \FPeval\resultround(0.084/1.00:4)0.084 \FPeval\resultround(0.058/1.00:4)0.058 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.152/1.00:4)0.152 \FPeval\resultround(0.244/1.00:4)0.244 \FPeval\resultround(0.138/1.00:4)0.138
5 \FPeval\resultround(0.110/1.00:4)0.110 \FPeval\resultround(0.097/1.00:4)0.097 \FPeval\resultround(0.079/1.00:4)0.079 \FPeval\resultround(0.115/1.00:4)0.115 \FPeval\resultround(0.217/1.00:4)0.217 \FPeval\resultround(0.120/1.00:4)0.120
10 \FPeval\resultround(0.191/1.00:4)0.191 \FPeval\resultround(0.180/1.00:4)0.180 \FPeval\resultround(0.174/1.00:4)0.174 \FPeval\resultround(0.195/1.00:4)0.195 \FPeval\resultround(0.261/1.00:4)0.261 \FPeval\resultround(0.189/1.00:4)0.189
20 \FPeval\resultround(0.342/1.00:4)0.342 \FPeval\resultround(0.333/1.00:4)0.333 \FPeval\resultround(0.329/1.00:4)0.329 \FPeval\resultround(0.315/1.00:4)0.315 \FPeval\resultround(0.385/1.00:4)0.385 \FPeval\resultround(0.330/1.00:4)0.330
30 \FPeval\resultround(0.476/1.00:4)0.476 \FPeval\resultround(0.477/1.00:4)0.477 \FPeval\resultround(0.454/1.00:4)0.454 \FPeval\resultround(0.466/1.00:4)0.466 \FPeval\resultround(0.509/1.00:4)0.509 \FPeval\resultround(0.479/1.00:4)0.479
40 \FPeval\resultround(0.602/1.00:4)0.602 \FPeval\resultround(0.585/1.00:4)0.585 \FPeval\resultround(0.574/1.00:4)0.574 \FPeval\resultround(0.573/1.00:4)0.573 \FPeval\resultround(0.622/1.00:4)0.622 \FPeval\resultround(0.598/1.00:4)0.598
50 \FPeval\resultround(0.690/1.00:4)0.690 \FPeval\resultround(0.696/1.00:4)0.696 \FPeval\resultround(0.664/1.00:4)0.664 \FPeval\resultround(0.674/1.00:4)0.674 \FPeval\resultround(0.713/1.00:4)0.713 \FPeval\resultround(0.693/1.00:4)0.693
100 \FPeval\resultround(0.937/1.00:4)0.937 \FPeval\resultround(0.941/1.00:4)0.941 \FPeval\resultround(0.923/1.00:4)0.923 \FPeval\resultround(0.924/1.00:4)0.924 \FPeval\resultround(0.945/1.00:4)0.945 \FPeval\resultround(0.938/1.00:4)0.938
Medium relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.112/1.00:4)0.112 \FPeval\resultround(0.053/1.00:4)0.053 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.484/1.00:4)0.484 \FPeval\resultround(0.476/1.00:4)0.476
3 \FPeval\resultround(0.171/1.00:4)0.171 \FPeval\resultround(0.137/1.00:4)0.137 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.284/1.00:4)0.284 \FPeval\resultround(0.428/1.00:4)0.428 \FPeval\resultround(0.274/1.00:4)0.274
5 \FPeval\resultround(0.303/1.00:4)0.303 \FPeval\resultround(0.253/1.00:4)0.253 \FPeval\resultround(0.228/1.00:4)0.228 \FPeval\resultround(0.280/1.00:4)0.280 \FPeval\resultround(0.466/1.00:4)0.466 \FPeval\resultround(0.313/1.00:4)0.313
10 \FPeval\resultround(0.572/1.00:4)0.572 \FPeval\resultround(0.531/1.00:4)0.531 \FPeval\resultround(0.513/1.00:4)0.513 \FPeval\resultround(0.542/1.00:4)0.542 \FPeval\resultround(0.651/1.00:4)0.651 \FPeval\resultround(0.572/1.00:4)0.572
20 \FPeval\resultround(0.864/1.00:4)0.864 \FPeval\resultround(0.861/1.00:4)0.861 \FPeval\resultround(0.837/1.00:4)0.837 \FPeval\resultround(0.839/1.00:4)0.839 \FPeval\resultround(0.891/1.00:4)0.891 \FPeval\resultround(0.858/1.00:4)0.858
30 \FPeval\resultround(0.968/1.00:4)0.968 \FPeval\resultround(0.961/1.00:4)0.961 \FPeval\resultround(0.949/1.00:4)0.949 \FPeval\resultround(0.955/1.00:4)0.955 \FPeval\resultround(0.971/1.00:4)0.971 \FPeval\resultround(0.965/1.00:4)0.965
Large relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.248/1.00:4)0.248 \FPeval\resultround(0.130/1.00:4)0.130 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.783/1.00:4)0.783 \FPeval\resultround(0.787/1.00:4)0.787
3 \FPeval\resultround(0.479/1.00:4)0.479 \FPeval\resultround(0.371/1.00:4)0.371 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.655/1.00:4)0.655 \FPeval\resultround(0.808/1.00:4)0.808 \FPeval\resultround(0.621/1.00:4)0.621
5 \FPeval\resultround(0.785/1.00:4)0.785 \FPeval\resultround(0.734/1.00:4)0.734 \FPeval\resultround(0.676/1.00:4)0.676 \FPeval\resultround(0.756/1.00:4)0.756 \FPeval\resultround(0.915/1.00:4)0.915 \FPeval\resultround(0.800/1.00:4)0.800
10 \FPeval\resultround(0.988/1.00:4)0.988 \FPeval\resultround(0.983/1.00:4)0.983 \FPeval\resultround(0.973/1.00:4)0.973 \FPeval\resultround(0.979/1.00:4)0.979 \FPeval\resultround(0.994/1.00:4)0.994 \FPeval\resultround(0.987/1.00:4)0.987
Table 5: Statistical power when comparing samples from two log-normal distribution with different standard deviation. The first is centered in ( mean or median depending on the test), the other shifted by the relative effect size (). Both have same standard deviation . Each result represents the percentage of true positive over repetitions. In bold are results satisfying a true positive rate above .
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.115/1.00:4)0.115 \FPeval\resultround(0.067/1.00:4)0.067 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.283/1.00:4)0.283 \FPeval\resultround(0.280/1.00:4)0.280
3 \FPeval\resultround(0.117/1.00:4)0.117 \FPeval\resultround(0.088/1.00:4)0.088 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.132/1.00:4)0.132 \FPeval\resultround(0.206/1.00:4)0.206 \FPeval\resultround(0.126/1.00:4)0.126
5 \FPeval\resultround(0.092/1.00:4)0.092 \FPeval\resultround(0.080/1.00:4)0.080 \FPeval\resultround(0.044/1.00:4)0.044 \FPeval\resultround(0.071/1.00:4)0.071 \FPeval\resultround(0.205/1.00:4)0.205 \FPeval\resultround(0.088/1.00:4)0.088
10 \FPeval\resultround(0.168/1.00:4)0.168 \FPeval\resultround(0.159/1.00:4)0.159 \FPeval\resultround(0.094/1.00:4)0.094 \FPeval\resultround(0.112/1.00:4)0.112 \FPeval\resultround(0.241/1.00:4)0.241 \FPeval\resultround(0.163/1.00:4)0.163
20 \FPeval\resultround(0.312/1.00:4)0.312 \FPeval\resultround(0.310/1.00:4)0.310 \FPeval\resultround(0.174/1.00:4)0.174 \FPeval\resultround(0.169/1.00:4)0.169 \FPeval\resultround(0.370/1.00:4)0.370 \FPeval\resultround(0.318/1.00:4)0.318
30 \FPeval\resultround(0.472/1.00:4)0.472 \FPeval\resultround(0.440/1.00:4)0.440 \FPeval\resultround(0.218/1.00:4)0.218 \FPeval\resultround(0.225/1.00:4)0.225 \FPeval\resultround(0.496/1.00:4)0.496 \FPeval\resultround(0.455/1.00:4)0.455
40 \FPeval\resultround(0.587/1.00:4)0.587 \FPeval\resultround(0.587/1.00:4)0.587 \FPeval\resultround(0.266/1.00:4)0.266 \FPeval\resultround(0.277/1.00:4)0.277 \FPeval\resultround(0.615/1.00:4)0.615 \FPeval\resultround(0.590/1.00:4)0.590
50 \FPeval\resultround(0.685/1.00:4)0.685 \FPeval\resultround(0.690/1.00:4)0.690 \FPeval\resultround(0.331/1.00:4)0.331 \FPeval\resultround(0.322/1.00:4)0.322 \FPeval\resultround(0.708/1.00:4)0.708 \FPeval\resultround(0.688/1.00:4)0.688
100 \FPeval\resultround(0.943/1.00:4)0.943 \FPeval\resultround(0.941/1.00:4)0.941 \FPeval\resultround(0.551/1.00:4)0.551 \FPeval\resultround(0.548/1.00:4)0.548 \FPeval\resultround(0.941/1.00:4)0.941 \FPeval\resultround(0.943/1.00:4)0.943
Medium relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.164/1.00:4)0.164 \FPeval\resultround(0.096/1.00:4)0.096 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.374/1.00:4)0.374 \FPeval\resultround(0.380/1.00:4)0.380
3 \FPeval\resultround(0.137/1.00:4)0.137 \FPeval\resultround(0.112/1.00:4)0.112 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.182/1.00:4)0.182 \FPeval\resultround(0.371/1.00:4)0.371 \FPeval\resultround(0.175/1.00:4)0.175
5 \FPeval\resultround(0.229/1.00:4)0.229 \FPeval\resultround(0.188/1.00:4)0.188 \FPeval\resultround(0.121/1.00:4)0.121 \FPeval\resultround(0.197/1.00:4)0.197 \FPeval\resultround(0.411/1.00:4)0.411 \FPeval\resultround(0.212/1.00:4)0.212
10 \FPeval\resultround(0.528/1.00:4)0.528 \FPeval\resultround(0.507/1.00:4)0.507 \FPeval\resultround(0.325/1.00:4)0.325 \FPeval\resultround(0.366/1.00:4)0.366 \FPeval\resultround(0.621/1.00:4)0.621 \FPeval\resultround(0.509/1.00:4)0.509
20 \FPeval\resultround(0.871/1.00:4)0.871 \FPeval\resultround(0.857/1.00:4)0.857 \FPeval\resultround(0.631/1.00:4)0.631 \FPeval\resultround(0.624/1.00:4)0.624 \FPeval\resultround(0.896/1.00:4)0.896 \FPeval\resultround(0.872/1.00:4)0.872
30 \FPeval\resultround(0.972/1.00:4)0.972 \FPeval\resultround(0.974/1.00:4)0.974 \FPeval\resultround(0.797/1.00:4)0.797 \FPeval\resultround(0.805/1.00:4)0.805 \FPeval\resultround(0.974/1.00:4)0.974 \FPeval\resultround(0.971/1.00:4)0.971
40 \FPeval\resultround(0.995/1.00:4)0.995 \FPeval\resultround(0.993/1.00:4)0.993 \FPeval\resultround(0.897/1.00:4)0.897 \FPeval\resultround(0.896/1.00:4)0.896 \FPeval\resultround(0.995/1.00:4)0.995 \FPeval\resultround(0.995/1.00:4)0.995
50 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.954/1.00:4)0.954 \FPeval\resultround(0.954/1.00:4)0.954 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.999/1.00:4)0.999
Large relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.267/1.00:4)0.267 \FPeval\resultround(0.162/1.00:4)0.162 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.768/1.00:4)0.768 \FPeval\resultround(0.764/1.00:4)0.764
3 \FPeval\resultround(0.319/1.00:4)0.319 \FPeval\resultround(0.189/1.00:4)0.189 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.600/1.00:4)0.600 \FPeval\resultround(0.770/1.00:4)0.770 \FPeval\resultround(0.571/1.00:4)0.571
5 \FPeval\resultround(0.787/1.00:4)0.787 \FPeval\resultround(0.722/1.00:4)0.722 \FPeval\resultround(0.690/1.00:4)0.690 \FPeval\resultround(0.800/1.00:4)0.800 \FPeval\resultround(0.923/1.00:4)0.923 \FPeval\resultround(0.816/1.00:4)0.816
10 \FPeval\resultround(0.997/1.00:4)0.997 \FPeval\resultround(0.997/1.00:4)0.997 \FPeval\resultround(0.981/1.00:4)0.981 \FPeval\resultround(0.987/1.00:4)0.987 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.998/1.00:4)0.998
Table 6: Statistical power when comparing samples from two bimodal distribution with different standard deviation. The first is centered in ( mean or median depending on the test), the other shifted by the relative effect size (). Both have same standard deviation . Each result represents the percentage of true positive over repetitions. In bold are results satisfying a true positive rate above .
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.057/1.00:4)0.057 \FPeval\resultround(0.030/1.00:4)0.030 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.408/1.00:4)0.408 \FPeval\resultround(0.408/1.00:4)0.408
3 \FPeval\resultround(0.056/1.00:4)0.056 \FPeval\resultround(0.038/1.00:4)0.038 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.299/1.00:4)0.299 \FPeval\resultround(0.242/1.00:4)0.242 \FPeval\resultround(0.157/1.00:4)0.157
5 \FPeval\resultround(0.092/1.00:4)0.092 \FPeval\resultround(0.063/1.00:4)0.063 \FPeval\resultround(0.246/1.00:4)0.246 \FPeval\resultround(0.342/1.00:4)0.342 \FPeval\resultround(0.199/1.00:4)0.199 \FPeval\resultround(0.145/1.00:4)0.145
10 \FPeval\resultround(0.185/1.00:4)0.185 \FPeval\resultround(0.168/1.00:4)0.168 \FPeval\resultround(0.588/1.00:4)0.588 \FPeval\resultround(0.612/1.00:4)0.612 \FPeval\resultround(0.233/1.00:4)0.233 \FPeval\resultround(0.249/1.00:4)0.249
20 \FPeval\resultround(0.394/1.00:4)0.394 \FPeval\resultround(0.379/1.00:4)0.379 \FPeval\resultround(0.898/1.00:4)0.898 \FPeval\resultround(0.901/1.00:4)0.901 \FPeval\resultround(0.389/1.00:4)0.389 \FPeval\resultround(0.445/1.00:4)0.445
30 \FPeval\resultround(0.571/1.00:4)0.571 \FPeval\resultround(0.560/1.00:4)0.560 \FPeval\resultround(0.980/1.00:4)0.980 \FPeval\resultround(0.983/1.00:4)0.983 \FPeval\resultround(0.549/1.00:4)0.549 \FPeval\resultround(0.607/1.00:4)0.607
40 \FPeval\resultround(0.693/1.00:4)0.693 \FPeval\resultround(0.693/1.00:4)0.693 \FPeval\resultround(0.997/1.00:4)0.997 \FPeval\resultround(0.997/1.00:4)0.997 \FPeval\resultround(0.674/1.00:4)0.674 \FPeval\resultround(0.732/1.00:4)0.732
50 \FPeval\resultround(0.802/1.00:4)0.802 \FPeval\resultround(0.800/1.00:4)0.800 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.781/1.00:4)0.781 \FPeval\resultround(0.816/1.00:4)0.816
100 \FPeval\resultround(0.980/1.00:4)0.980 \FPeval\resultround(0.978/1.00:4)0.978 \FPeval\resultround(1.000/1.00:4)1.000 \FPeval\resultround(1.000/1.00:4)1.000 \FPeval\resultround(0.974/1.00:4)0.974 \FPeval\resultround(0.983/1.00:4)0.983
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.147/1.00:4)0.147 \FPeval\resultround(0.072/1.00:4)0.072 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.680/1.00:4)0.680 \FPeval\resultround(0.685/1.00:4)0.685
3 \FPeval\resultround(0.264/1.00:4)0.264 \FPeval\resultround(0.181/1.00:4)0.181 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.631/1.00:4)0.631 \FPeval\resultround(0.581/1.00:4)0.581 \FPeval\resultround(0.474/1.00:4)0.474
5 \FPeval\resultround(0.464/1.00:4)0.464 \FPeval\resultround(0.401/1.00:4)0.401 \FPeval\resultround(0.604/1.00:4)0.604 \FPeval\resultround(0.707/1.00:4)0.707 \FPeval\resultround(0.637/1.00:4)0.637 \FPeval\resultround(0.567/1.00:4)0.567
10 \FPeval\resultround(0.749/1.00:4)0.749 \FPeval\resultround(0.728/1.00:4)0.728 \FPeval\resultround(0.956/1.00:4)0.956 \FPeval\resultround(0.966/1.00:4)0.966 \FPeval\resultround(0.787/1.00:4)0.787 \FPeval\resultround(0.801/1.00:4)0.801
20 \FPeval\resultround(0.951/1.00:4)0.951 \FPeval\resultround(0.945/1.00:4)0.945 \FPeval\resultround(1.000/1.00:4)1.000 \FPeval\resultround(1.000/1.00:4)1.000 \FPeval\resultround(0.951/1.00:4)0.951 \FPeval\resultround(0.964/1.00:4)0.964
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.419/1.00:4)0.419 \FPeval\resultround(0.230/1.00:4)0.230 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.924/1.00:4)0.924 \FPeval\resultround(0.930/1.00:4)0.930
3 \FPeval\resultround(0.722/1.00:4)0.722 \FPeval\resultround(0.596/1.00:4)0.596 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.911/1.00:4)0.911 \FPeval\resultround(0.921/1.00:4)0.921 \FPeval\resultround(0.842/1.00:4)0.842
5 \FPeval\resultround(0.904/1.00:4)0.904 \FPeval\resultround(0.868/1.00:4)0.868 \FPeval\resultround(0.910/1.00:4)0.910 \FPeval\resultround(0.935/1.00:4)0.935 \FPeval\resultround(0.959/1.00:4)0.959 \FPeval\resultround(0.944/1.00:4)0.944
10 \FPeval\resultround(0.989/1.00:4)0.989 \FPeval\resultround(0.986/1.00:4)0.986 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(1.000/1.00:4)1.000 \FPeval\resultround(0.993/1.00:4)0.993 \FPeval\resultround(0.994/1.00:4)0.994
Table 7: Statistical power when comparing samples from two log-normal distribution with different standard deviation. The first is centered in ( mean or median depending on the test), the other shifted by the relative effect size (). Both have same standard deviation . Each result represents the percentage of true positive over repetitions. In bold are results satisfying a true positive rate above .

8.3 Comparing different distributions with equal standard deviations.

Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.040/1.00:4)0.040 \FPeval\resultround(0.016/1.00:4)0.016 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.270/1.00:4)0.270 \FPeval\resultround(0.267/1.00:4)0.267
3 \FPeval\resultround(0.047/1.00:4)0.047 \FPeval\resultround(0.032/1.00:4)0.032 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.181/1.00:4)0.181 \FPeval\resultround(0.187/1.00:4)0.187 \FPeval\resultround(0.099/1.00:4)0.099
5 \FPeval\resultround(0.076/1.00:4)0.076 \FPeval\resultround(0.062/1.00:4)0.062 \FPeval\resultround(0.125/1.00:4)0.125 \FPeval\resultround(0.189/1.00:4)0.189 \FPeval\resultround(0.164/1.00:4)0.164 \FPeval\resultround(0.088/1.00:4)0.088
10 \FPeval\resultround(0.155/1.00:4)0.155 \FPeval\resultround(0.145/1.00:4)0.145 \FPeval\resultround(0.330/1.00:4)0.330 \FPeval\resultround(0.357/1.00:4)0.357 \FPeval\resultround(0.211/1.00:4)0.211 \FPeval\resultround(0.169/1.00:4)0.169
20 \FPeval\resultround(0.315/1.00:4)0.315 \FPeval\resultround(0.320/1.00:4)0.320 \FPeval\resultround(0.628/1.00:4)0.628 \FPeval\resultround(0.621/1.00:4)0.621 \FPeval\resultround(0.352/1.00:4)0.352 \FPeval\resultround(0.335/1.00:4)0.335
30 \FPeval\resultround(0.483/1.00:4)0.483 \FPeval\resultround(0.484/1.00:4)0.484 \FPeval\resultround(0.797/1.00:4)0.797 \FPeval\resultround(0.792/1.00:4)0.792 \FPeval\resultround(0.505/1.00:4)0.505 \FPeval\resultround(0.494/1.00:4)0.494
40 \FPeval\resultround(0.611/1.00:4)0.611 \FPeval\resultround(0.615/1.00:4)0.615 \FPeval\resultround(0.894/1.00:4)0.894 \FPeval\resultround(0.903/1.00:4)0.903 \FPeval\resultround(0.620/1.00:4)0.620 \FPeval\resultround(0.619/1.00:4)0.619
50 \FPeval\resultround(0.729/1.00:4)0.729 \FPeval\resultround(0.725/1.00:4)0.725 \FPeval\resultround(0.951/1.00:4)0.951 \FPeval\resultround(0.953/1.00:4)0.953 \FPeval\resultround(0.731/1.00:4)0.731 \FPeval\resultround(0.725/1.00:4)0.725
100 \FPeval\resultround(0.958/1.00:4)0.958 \FPeval\resultround(0.959/1.00:4)0.959 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.959/1.00:4)0.959 \FPeval\resultround(0.960/1.00:4)0.960
Medium relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.078/1.00:4)0.078 \FPeval\resultround(0.042/1.00:4)0.042 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.477/1.00:4)0.477 \FPeval\resultround(0.472/1.00:4)0.472
3 \FPeval\resultround(0.137/1.00:4)0.137 \FPeval\resultround(0.098/1.00:4)0.098 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.390/1.00:4)0.390 \FPeval\resultround(0.411/1.00:4)0.411 \FPeval\resultround(0.263/1.00:4)0.263
5 \FPeval\resultround(0.277/1.00:4)0.277 \FPeval\resultround(0.251/1.00:4)0.251 \FPeval\resultround(0.368/1.00:4)0.368 \FPeval\resultround(0.480/1.00:4)0.480 \FPeval\resultround(0.457/1.00:4)0.457 \FPeval\resultround(0.307/1.00:4)0.307
10 \FPeval\resultround(0.600/1.00:4)0.600 \FPeval\resultround(0.584/1.00:4)0.584 \FPeval\resultround(0.785/1.00:4)0.785 \FPeval\resultround(0.805/1.00:4)0.805 \FPeval\resultround(0.666/1.00:4)0.666 \FPeval\resultround(0.613/1.00:4)0.613
20 \FPeval\resultround(0.912/1.00:4)0.912 \FPeval\resultround(0.910/1.00:4)0.910 \FPeval\resultround(0.981/1.00:4)0.981 \FPeval\resultround(0.979/1.00:4)0.979 \FPeval\resultround(0.924/1.00:4)0.924 \FPeval\resultround(0.919/1.00:4)0.919
Large relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.230/1.00:4)0.230 \FPeval\resultround(0.113/1.00:4)0.113 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.856/1.00:4)0.856 \FPeval\resultround(0.841/1.00:4)0.841
3 \FPeval\resultround(0.513/1.00:4)0.513 \FPeval\resultround(0.392/1.00:4)0.392 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.838/1.00:4)0.838 \FPeval\resultround(0.864/1.00:4)0.864 \FPeval\resultround(0.709/1.00:4)0.709
5 \FPeval\resultround(0.863/1.00:4)0.863 \FPeval\resultround(0.815/1.00:4)0.815 \FPeval\resultround(0.872/1.00:4)0.872 \FPeval\resultround(0.927/1.00:4)0.927 \FPeval\resultround(0.946/1.00:4)0.946 \FPeval\resultround(0.880/1.00:4)0.880
10 \FPeval\resultround(0.997/1.00:4)0.997 \FPeval\resultround(0.997/1.00:4)0.997 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.999/1.00:4)0.999 \FPeval\resultround(0.999/1.00:4)0.999
Table 8: Statistical power when comparing samples from a normal distribution and a log-normal distribution with equal standard deviation. The first is centered in ( mean or median depending on the test), the other shifted by the relative effect size (). Each result represents the percentage of true positive over repetitions. In bold are results satisfying a true positive rate above .
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.100/1.00:4)0.100 \FPeval\resultround(0.055/1.00:4)0.055 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.337/1.00:4)0.337 \FPeval\resultround(0.330/1.00:4)0.330
3 \FPeval\resultround(0.101/1.00:4)0.101 \FPeval\resultround(0.078/1.00:4)0.078 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.102/1.00:4)0.102 \FPeval\resultround(0.276/1.00:4)0.276 \FPeval\resultround(0.143/1.00:4)0.143
5 \FPeval\resultround(0.149/1.00:4)0.149 \FPeval\resultround(0.121/1.00:4)0.121 \FPeval\resultround(0.045/1.00:4)0.045 \FPeval\resultround(0.063/1.00:4)0.063 \FPeval\resultround(0.262/1.00:4)0.262 \FPeval\resultround(0.137/1.00:4)0.137
10 \FPeval\resultround(0.239/1.00:4)0.239 \FPeval\resultround(0.228/1.00:4)0.228 \FPeval\resultround(0.074/1.00:4)0.074 \FPeval\resultround(0.084/1.00:4)0.084 \FPeval\resultround(0.318/1.00:4)0.318 \FPeval\resultround(0.238/1.00:4)0.238
20 \FPeval\resultround(0.390/1.00:4)0.390 \FPeval\resultround(0.395/1.00:4)0.395 \FPeval\resultround(0.125/1.00:4)0.125 \FPeval\resultround(0.128/1.00:4)0.128 \FPeval\resultround(0.447/1.00:4)0.447 \FPeval\resultround(0.381/1.00:4)0.381
30 \FPeval\resultround(0.502/1.00:4)0.502 \FPeval\resultround(0.510/1.00:4)0.510 \FPeval\resultround(0.167/1.00:4)0.167 \FPeval\resultround(0.172/1.00:4)0.172 \FPeval\resultround(0.561/1.00:4)0.561 \FPeval\resultround(0.509/1.00:4)0.509
40 \FPeval\resultround(0.611/1.00:4)0.611 \FPeval\resultround(0.603/1.00:4)0.603 \FPeval\resultround(0.201/1.00:4)0.201 \FPeval\resultround(0.213/1.00:4)0.213 \FPeval\resultround(0.641/1.00:4)0.641 \FPeval\resultround(0.604/1.00:4)0.604
50 \FPeval\resultround(0.691/1.00:4)0.691 \FPeval\resultround(0.686/1.00:4)0.686 \FPeval\resultround(0.250/1.00:4)0.250 \FPeval\resultround(0.242/1.00:4)0.242 \FPeval\resultround(0.725/1.00:4)0.725 \FPeval\resultround(0.693/1.00:4)0.693
100 \FPeval\resultround(0.920/1.00:4)0.920 \FPeval\resultround(0.917/1.00:4)0.917 \FPeval\resultround(0.427/1.00:4)0.427 \FPeval\resultround(0.429/1.00:4)0.429 \FPeval\resultround(0.929/1.00:4)0.929 \FPeval\resultround(0.922/1.00:4)0.922
Medium relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.161/1.00:4)0.161 \FPeval\resultround(0.085/1.00:4)0.085 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.513/1.00:4)0.513 \FPeval\resultround(0.507/1.00:4)0.507
3 \FPeval\resultround(0.202/1.00:4)0.202 \FPeval\resultround(0.136/1.00:4)0.136 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.221/1.00:4)0.221 \FPeval\resultround(0.484/1.00:4)0.484 \FPeval\resultround(0.294/1.00:4)0.294
5 \FPeval\resultround(0.358/1.00:4)0.358 \FPeval\resultround(0.316/1.00:4)0.316 \FPeval\resultround(0.161/1.00:4)0.161 \FPeval\resultround(0.225/1.00:4)0.225 \FPeval\resultround(0.534/1.00:4)0.534 \FPeval\resultround(0.351/1.00:4)0.351
10 \FPeval\resultround(0.601/1.00:4)0.601 \FPeval\resultround(0.606/1.00:4)0.606 \FPeval\resultround(0.374/1.00:4)0.374 \FPeval\resultround(0.413/1.00:4)0.413 \FPeval\resultround(0.694/1.00:4)0.694 \FPeval\resultround(0.600/1.00:4)0.600
20 \FPeval\resultround(0.839/1.00:4)0.839 \FPeval\resultround(0.843/1.00:4)0.843 \FPeval\resultround(0.699/1.00:4)0.699 \FPeval\resultround(0.707/1.00:4)0.707 \FPeval\resultround(0.881/1.00:4)0.881 \FPeval\resultround(0.846/1.00:4)0.846
30 \FPeval\resultround(0.939/1.00:4)0.939 \FPeval\resultround(0.940/1.00:4)0.940 \FPeval\resultround(0.865/1.00:4)0.865 \FPeval\resultround(0.869/1.00:4)0.869 \FPeval\resultround(0.954/1.00:4)0.954 \FPeval\resultround(0.940/1.00:4)0.940
40 \FPeval\resultround(0.978/1.00:4)0.978 \FPeval\resultround(0.980/1.00:4)0.980 \FPeval\resultround(0.947/1.00:4)0.947 \FPeval\resultround(0.950/1.00:4)0.950 \FPeval\resultround(0.983/1.00:4)0.983 \FPeval\resultround(0.980/1.00:4)0.980
Large relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.275/1.00:4)0.275 \FPeval\resultround(0.158/1.00:4)0.158 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.808/1.00:4)0.808 \FPeval\resultround(0.809/1.00:4)0.809
3 \FPeval\resultround(0.539/1.00:4)0.539 \FPeval\resultround(0.390/1.00:4)0.390 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.585/1.00:4)0.585 \FPeval\resultround(0.819/1.00:4)0.819 \FPeval\resultround(0.647/1.00:4)0.647
5 \FPeval\resultround(0.804/1.00:4)0.804 \FPeval\resultround(0.781/1.00:4)0.781 \FPeval\resultround(0.613/1.00:4)0.613 \FPeval\resultround(0.719/1.00:4)0.719 \FPeval\resultround(0.898/1.00:4)0.898 \FPeval\resultround(0.792/1.00:4)0.792
10 \FPeval\resultround(0.955/1.00:4)0.955 \FPeval\resultround(0.953/1.00:4)0.953 \FPeval\resultround(0.956/1.00:4)0.956 \FPeval\resultround(0.962/1.00:4)0.962 \FPeval\resultround(0.976/1.00:4)0.976 \FPeval\resultround(0.956/1.00:4)0.956
Table 9: Statistical power when comparing samples from a log-normal distribution and a bimodal distribution with equal standard deviation. The first is centered in ( mean or median depending on the test), the other shifted by the relative effect size (). Each result represents the percentage of true positive over repetitions. In bold are results satisfying a true positive rate above .
Small relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.061/1.00:4)0.061 \FPeval\resultround(0.031/1.00:4)0.031 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.287/1.00:4)0.287 \FPeval\resultround(0.282/1.00:4)0.282
3 \FPeval\resultround(0.070/1.00:4)0.070 \FPeval\resultround(0.051/1.00:4)0.051 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.112/1.00:4)0.112 \FPeval\resultround(0.217/1.00:4)0.217 \FPeval\resultround(0.102/1.00:4)0.102
5 \FPeval\resultround(0.104/1.00:4)0.104 \FPeval\resultround(0.089/1.00:4)0.089 \FPeval\resultround(0.061/1.00:4)0.061 \FPeval\resultround(0.101/1.00:4)0.101 \FPeval\resultround(0.200/1.00:4)0.200 \FPeval\resultround(0.093/1.00:4)0.093
10 \FPeval\resultround(0.175/1.00:4)0.175 \FPeval\resultround(0.173/1.00:4)0.173 \FPeval\resultround(0.141/1.00:4)0.141 \FPeval\resultround(0.160/1.00:4)0.160 \FPeval\resultround(0.246/1.00:4)0.246 \FPeval\resultround(0.173/1.00:4)0.173
20 \FPeval\resultround(0.334/1.00:4)0.334 \FPeval\resultround(0.339/1.00:4)0.339 \FPeval\resultround(0.281/1.00:4)0.281 \FPeval\resultround(0.283/1.00:4)0.283 \FPeval\resultround(0.384/1.00:4)0.384 \FPeval\resultround(0.326/1.00:4)0.326
30 \FPeval\resultround(0.466/1.00:4)0.466 \FPeval\resultround(0.472/1.00:4)0.472 \FPeval\resultround(0.410/1.00:4)0.410 \FPeval\resultround(0.411/1.00:4)0.411 \FPeval\resultround(0.515/1.00:4)0.515 \FPeval\resultround(0.474/1.00:4)0.474
40 \FPeval\resultround(0.596/1.00:4)0.596 \FPeval\resultround(0.590/1.00:4)0.590 \FPeval\resultround(0.513/1.00:4)0.513 \FPeval\resultround(0.519/1.00:4)0.519 \FPeval\resultround(0.607/1.00:4)0.607 \FPeval\resultround(0.582/1.00:4)0.582
50 \FPeval\resultround(0.695/1.00:4)0.695 \FPeval\resultround(0.683/1.00:4)0.683 \FPeval\resultround(0.614/1.00:4)0.614 \FPeval\resultround(0.611/1.00:4)0.611 \FPeval\resultround(0.709/1.00:4)0.709 \FPeval\resultround(0.693/1.00:4)0.693
100 \FPeval\resultround(0.938/1.00:4)0.938 \FPeval\resultround(0.938/1.00:4)0.938 \FPeval\resultround(0.887/1.00:4)0.887 \FPeval\resultround(0.887/1.00:4)0.887 \FPeval\resultround(0.940/1.00:4)0.940 \FPeval\resultround(0.937/1.00:4)0.937
Medium relative effect size:
N t-test Welch Mann-Whit. r. t-test boot. permut.
2 \FPeval\resultround(0.107/1.00:4)0.107 \FPeval\resultround(0.051/1.00:4)0.051 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.430/1.00:4)0.430 \FPeval\resultround(0.425/1.00:4)0.425
3 \FPeval\resultround(0.145/1.00:4)0.145 \FPeval\resultround(0.099/1.00:4)0.099 \FPeval\resultround(0.000/1.00:4)0.000 \FPeval\resultround(0.230/1.00:4)0.230 \FPeval\resultround(0.406/1.00:4)0.406 \FPeval\resultround(0.212/1.00:4)0.212
5 \FPeval\resultround(0.261/1.00:4)0.261 \FPeval\resultround(0.244/1.00:4)