Equitability, interval estimation, and statistical power
Abstract
As data sets grow in dimensionality, nonparametric measures of dependence have seen increasing use in data exploration due to their ability to identify nontrivial relationships of all kinds. One common use of these tools is to test a null hypothesis of statistical independence on all variable pairs in a data set. However, because this approach attempts to identify any nontrivial relationship no matter how weak, it is prone to identifying so many relationships — even after correction for multiple hypothesis testing — that meaningful followup of each one is impossible. What is needed is a way of identifying a smaller set of “strongest” relationships of all kinds that merit detailed further analysis.
Here we formally present and characterize equitability, a property of measures of dependence that aims to overcome this challenge. Notionally, an equitable statistic is a statistic that, given some measure of noise, assigns similar scores to equally noisy relationships of different types (e.g., linear, exponential, etc.) [1]. We begin by formalizing this idea via a new object called the interpretable interval, which functions as an interval estimate of the amount of noise in a relationship of unknown type. We define an equitable statistic as one with small interpretable intervals.
We then draw on the equivalence of interval estimation and hypothesis testing to show that under moderate assumptions an equitable statistic is one that yields well powered tests for distinguishing not only between trivial and nontrivial relationships of all kinds but also between nontrivial relationships of different strengths, regardless of relationship type. This means that equitability allows us to specify a threshold relationship strength below which we are uninterested, and to search a data set for relationships of all kinds with strength greater than . Thus, equitability can be thought of as a strengthening of power against independence that enables fruitful analysis of data sets with a small number of strong, interesting relationships and a large number of weaker, less interesting ones. We conclude with a demonstration of how our two equivalent characterizations of equitability can be used to evaluate the equitability of a statistic in practice.
1 Introduction
Suppose we have a data set that we would like to explore to find pairwise associations of interest. A commonly taken approach that makes minimal assumptions about the structure in the data is to compute a measure of dependence, i.e., a statistic whose population value is nonzero exactly in cases of statistical dependence, on many candidate pairs of variables. The score of each variable pair can be evaluated against a null hypothesis of statistical independence, and variable pairs with significant scores can be kept for followup [2, 3]. When faced with this task, there is a wealth of measures of dependence from which to choose, each with a different set of properties [4, 5, 6, 7, 8, 9, 10, 11, 12, 13].
While this approach works well in some settings, it is unsuitable in many others due to the size of modern data sets. In particular, as data sets grow in dimensionality, the above approach often results in lists of significant relationships that are too large to allow for meaningful followup of every identified relationship. For example, in the gene expression data set analyzed in [14], several measures of dependence reliably identified thousands of significant relationships amounting to between and percent of the variable pairs in the data set. Given the extensive manual effort that is usually necessary to better understand each of these “hits”, further characterizing all of them is impractical.
A tempting way to deal with this challenge is to rank all the variable pairs in a data set according to the test statistic used (or according to pvalue) and to examine only a small number of pairs with the most extreme values. However, this is a poor idea because, while a measure of dependence guarantees nonzero scores to dependent variable pairs, the magnitude of these nonzero scores can depend heavily on the type of dependence in question, thereby skewing the top of the list toward certain types of relationships over others. For example, if some measure of dependence systematically assigns higher scores to, say, linear relationships than to sinusoidal relationships, then using to rank variable pairs in a large data set could cause noisy linear relationships in the data set to crowd out strong sinusoidal relationships from the top of the list. The natural result would be that the human examining the topranked relationships would never see the sinusoidal relationships, and they would not be discovered.
The consistency guarantee of measures of dependence is therefore not strong enough to solve the data exploration problem posed here. What is needed is a way not just to identify as many relationships of different kinds as possible in a data set, but also to identify a small number of strongest relationships of different kinds.
Here we formally present and characterize equitability, a framework for meeting this goal. In previous work, equitability was informally introduced as follows: an equitable measure of dependence is one that, given some measure of noise, assigns similar scores to equally noisy relationships, regardless of relationship type [1]. In this paper, we formalize this notion in the language of estimation theory and tie it to the theory of hypothesis testing.
Specifically, we define an object called the interpretable interval that functions as an interval estimate of the strength of a relationship of unknown type. That is, given a set of standard relationships on which we have defined a measure of relationship strength, the interpretable interval is a range of values that act as good estimates of the true relationship strength of a distribution, assuming it belongs to . In the same way that a good estimator has narrow confidence intervals, an equitable statistic is one that has narrow interpretable intervals. As we explain, this property can be viewed as a natural generalization of one of the “fundamental properties” described by Renyi in his framework for measures of dependence [15].
We then draw a connection between equitability and statistical power using the equivalence between interval estimation and hypothesis testing. This connection shows that whereas typical measures of dependence are analyzed in terms of power to distinguish nontrivial associations from statistical independence, under moderate assumptions an equitable statistic is one that can distinguish finely between relationships of two different strengths that may both be nontrivial, regardless of the types of the two relationships in question. This result gives us a new way to understand equitability as a natural strengthening of the requirement of power against independence in which we ask that our statistic be useful not just for detecting deviations of different types from independence but also for distinguishing strong relationships from weak relationships regardless of relationship type.
Finally, motivated by the connection between equitability and power, we define a new property, detection threshold, which, at some fixed sample size, is the minimal relationship strength such that a statistic’s corresponding independence test has a certain minimal power on relationships of all kinds with strength at least . We show that low detection threshold is strictly weaker than high equitability in that high equitability implies it but the converse does not hold. Therefore, when equitability is too much to ask, low detection threshold on a broad set of relationships with respect to an interesting measure of relationship strength may be a reasonable surrogate goal.
Throughout this paper, we give concrete examples of how our formalism relates to the analysis of equitability in practice. Indeed, the purpose of the theoretical framework provided here is to allow for such practical analyses, and so we close with a demonstration of an empirical analysis of the equitability of several popular measures of dependence.
This paper is accompanied by two companion papers. The first [4] introduces two new statistics that aim for good equitability on functional relationships and good power against statistical independence, respectively. The second [16] conducts a comprehensive empirical analysis of the equitability and power against independence of both of these new methods as well as several other leading measures of dependence.
The results we present here, in addition to contributing to a better understanding of equitability, also provide an organizing framework in which to consolidate some of the recent discussion around equitability. For instance, our formalization of equitability is sufficiently general to accommodate several of variants that have arisen in the literature. This allows us to precisely discuss the definition given by Kinney and Atwal [17, 18] of what, in our theoretical framework, corresponds to perfect equitability. In particular, our framework allows us to explain the limitations of an impossibility result presented by Kinney and Atwal about perfect equitability. Additionally, our framework and the connection it provides to statistical power also allows us to crystallize and address the concerns about the power against independence of equitable methods raised by Simon and Tibshirani [19]. (However, empirical questions concerning the performance of the maximal information coefficient and related statistics are deferred to the companion papers [16, 4].)
We conclude with a discussion of what situations benefit from using equitability as a desideratum for data analysis. It is our hope that the theoretical results in this paper will provide a foundation for further work not only on equitability and methods for achieving equitability, but also on other possible expansions of our goals for measures of dependence in the setting of data exploration or other related settings.
2 Equitability
Equitability has been described informally by the authors as the ability of a statistic to “give similar scores to equally noisy relationships of different types” [1]. Though useful, this informal definition is imprecise in that it does not specify what is meant by “noisy” or “similar”, and does not specify for which relationships the stated property should hold. In this section we provide the formalism necessary to discuss equitability more rigorously.
To do this, we fix a statistic (presumed to be a measure of dependence), a measure of relationship strength called the property of interest, and a set of standard relationships on which is defined. The idea is that contains relationships of many different types, and for any distribution , is the way we would ideally quantify the strength of if we had knowledge of the distribution . Our goal is then, given a sample of size from , to use to draw inferences about .
Our general approach is to construct a set of intervals, the interpretable intervals of with respect to , by inverting a certain set of hypothesis tests. We show that these intervals can be used to turn into an interval estimate of , and we call the statistic equitable if its interpretable intervals are small, i.e., if it yields narrow interval estimates of .
After constructing the interpretable intervals of with respect to , we demonstrate how our vocabulary can be used to define a few different concrete instantiations of the concept of equitability. We do this by using our framework to state several of the notions of and results about equitability that have appeared in the literature, and discussing the relationships among them. Following this, we provide a short schematic illustration of how the definitions we provide would be used to quantitatively evaluate the equitability of a statistic in practice, and a discussion of how equitability is related to measurement of effect size more generally.
In what follows, we keep our exposition generic in order to accommodate variations – both existing and potential – on the concepts defined here. However, as a motivating example, we often return to the setting of [1], in which is a statistic like the maximal information coefficient , is a set of noisy functional relationships, and is the coefficient of determination () with respect to the generating function. In this setting, the equitability of corresponds to its utility for constructing narrow interval estimates of the of a relationship that is in but whose specific functional form is unknown.
2.1 Interpretable intervals
Let be a statistic taking values in , let be a set of distributions, and let be some measure of relationship strength. As mentioned previously, we refer to as the set of standard relationships and to as the property of interest. To construct the interpretable intervals of with respect to , we must first ask how much can vary when evaluated on a sample from some with . The definition below gives us a way to measure this. (In this definition and in definitions in the rest of this paper, we implicitly assume a fixed sample size of .)
Definition 2.1 (Reliability of a statistic).
Let be a statistic taking values in , and let . The reliable interval of at , denoted by , is the smallest closed interval with the property that, for all with , we have
where is a sample of size from .
The statistic is reliable with respect to on at with probability if and only if the diameter of is at most .
See Figure 1a for an illustration. The reliable interval at is an acceptance region of a size test of the null hypothesis . If there is only one satisfying , this amounts to a central interval of the sampling distribution of on . If there is more than one such , the reliable interval expands to include the relevant central intervals of the sampling distributions of on all the distributions in question. For example, when is a set of noisy functional relationships with several different function types and is , the reliable interval at is the smallest interval such that for any functional relationship with , falls in with high probability over the sample of size from .
Because the reliable interval can be viewed as the acceptance region of a level test of , the equivalence between hypothesis tests and confidence intervals yields interval estimates of in terms of . These intervals are the interpretable intervals, defined below.
Definition 2.2 (Interpretability of a statistic).
Let be a statistic taking values in , and let . The interpretable interval of at , denoted by , is the smallest closed interval containing the set
The statistic is interpretable with respect to on at with confidence if and only if the diameter of is at most .
See Figure 1a for an illustration. The correspondence between hypothesis tests and interval estimates [20] gives us the following guarantee about the coverage probability of the interpretable interval, whose proof we omit.
Proposition 2.3.
Let be a statistic taking values in , and let . For all and for all ,
where is a sample of size from .
The definitions just presented have natural nonstochastic counterparts in the largesample limit that we summarize below.
Definition 2.4 (Reliability and interpretability in the largesample limit).
Let be a function of distributions. For , the smallest closed interval containing the set is called the reliable interval of at and is denoted by . For , the smallest closed interval containing the set is called the interpretable interval of at and is denoted by .
See Figure 1b for an illustration.
(a)  (b) 
2.2 Defining equitability
Proposition 2.3 implies that if the interpretable intervals of with respect to are small then will give good interval estimates of . There are many ways to summarize whether the interpretable intervals of are small; we focus here on two simple ones.
Definition 2.5.
The worstcase reliability (resp. interpretability) of is if it is reliable (resp. interpretable) at all (resp. ) . is said to be worstcase reliable (resp. interpretable) with probability (resp. confidence) .
The averagecase reliability (resp. interpretability) of is if its reliability (resp. interpretability), averaged over all (resp. ) , is at least . is said to be averagecase reliable (resp. interpretable) with probability (resp. confidence) .
(One could imagine more finegrained ways to summarize reliability/interpretability according to, for example, some prior over the distributions in that reflects a belief about the importance or prevalence of various types of relationships; for simplicity, we do not pursue this here.)
With this vocabulary, we can now define equitability: average/worstcase equitability is simply average/worstcase interpretability with respect to some that reflects relationship strength. In this paper, we distinguish between interpretability in general and equitability specifically by using “interpretability” in general statements and “equitability” in contexts in which is specifically considered as a measure of relationship strength. Also, we often use “interpretability” and “equitability” with no qualifier to mean worstcase interpretability/equitability.
The corresponding definitions of average/worstcase interpretability/reliability can be made for in the largesample limit as well. In that setting, it is possible that all the interpretable intervals of with respect to have size 0; that is, the value of uniquely determines the value of . In this case, the worstcase reliability/interpretability of is , and is said to be perfectly reliable/interpretable, or perfectly equitable depending on context.
Before continuing, let us build intuition by giving two examples of statistics that are perfectly interpretable in the largesample limit. First, the mutual information [21, 22] is perfectly interpretable with respect to the correlation on the set of bivariate normal random variables. This is because for bivariate normals we have that [23]. Additionally, Theorem 6 of [24] shows that for bivariate normals distance correlation is a deterministic function of as well. Therefore, distance correlation is also perfectly interpretable and perfectly reliable with respect to on the set of bivariate normals .
The perfect interpretability with respect to on bivariate normals exhibited in both of these examples is in fact equivalent to one of the “fundamental properties” introduced by Renyi in his framework for thinking about ideal properties of measures of dependence [15]. This property contains a compromise: it guarantees interpretability that on the one hand is perfect, but on the other hand applies only on a relatively small set of standard relationships. One goal of equitability is to give us the tools to relax the “perfect” requirement in exchange for the ability to make a much larger set, e.g., a set of noisy functional relationships. Thus, equitability can be viewed as a generalization of Renyi’s requirement that allows for a tradeoff between the precision with which our statistic tells us about and the set on which it does so.
2.3 Examples of and results about equitability
We now give examples, using the vocabulary developed here, of some concrete instantiations of and results about equitability. Our focus here is on functional relationships, as defined below.
Definition 2.6.
A random variable distributed over is called a noisy functional relationship if and only if it can be written in the form where , is a random variable distributed over , and and are (possibly trivial) random variables. We denote the set of all noisy functional relationships by .
2.3.1 Equitability on functional relationships with respect to
We can now state one specific type of equitability on functional relationships: equitability with respect to .
Definition 2.7 (Equitability on functional relationships with respect to ).
Let be a set of noisy functional relationships. A measure of dependence is equitable on with respect to if it is interpretable with respect to on .
We observe that this definition still depends on the set in question. The general approach taken in the literature thus far has been to fix some set of functions that on the one hand is large enough to be representative of relationships encountered in real data sets, but on the other hand is small enough to enable empirical analysis, and to make equitability a realistic goal.
As important as the choice of functions to include in is the choice of marginal distributions and noise model, both of which are left unspecified in our definition of noisy functional relationships. In past work, we have examined several possibilities. The simplest is , with varying, and . Slightly more complex noise models include having and i.i.d. Gaussians, or having be Gaussian and . More complex marginal distributions include having be distributed in a way that depends on the graph of , or having it be nonstochastic [1, 16]. Given that we often lack a neat description of the noise in real data sets, we would ideally like a statistic to be highly equitable on as many different such models as possible.
We can also easily imagine models besides the ones described above: for instance, we might define and to be nonGaussian, we might allow them to depend on each other, or we might allow their variance to depend on . The importance of such modifications depends on the context, but our formalism is designed to be flexible enough to handle general models that include such variations.
2.3.2 A setting in which perfect equitability is impossible
One version of equitability on functional relationships for which perfect equitability has been shown to be impossible was introduced by Kinney and Atwal [17]. This version of equitability uses as standard relationships the set
with representing a random variable that is conditionally independent of given . This model describes functional relationships with noise in the second coordinate only, where that noise can depend arbitrarily on the value of but must be otherwise independent of .
Kinney and Atwal prove that no nontrivial measure of dependence can be perfectly worstcase interpretable with respect to on the set . However, we note here that this result, while interesting, has two serious limitations. The first limitation, pointed out by Murrell et al. in the technical comment [25], is that is extremely large: in particular, the fact that the noise term can depend arbitrarily on the value of leads to identifiability issues such as obtaining the noiseless relationship as a noisy version of . The more permissive (i.e. large) a model is, the easier it is to prove an impossibility result for it. Since is not contained in the other major models considered in, e.g., [1] and [16], it follows that this impossibility result does not imply impossibility for any of those models.
The second limitation of Kinney and Atwal’s result is that it only addresses perfect equitability rather than the more general, approximate notion with which we are primarily concerned.^{8}^{8}8 As a matter of record, we wish to clarify a confusion in Kinney and Atwal’s work. They write “The key claim made by Reshef et al. in arguing for the use of MIC as a dependence measure has two parts. First, MIC is said to satisfy not just the heuristic notion of equitability, but also the mathematical criterion of equitability…”, with the latter term referring to what we here define as perfect equitability [17]. However, such a claim was never made in our previous work [1]. Rather, that paper [1] informally defined equitability as an approximate notion and compared the equitability of MIC, mutual information estimation, and other schemes empirically, concluding not that MIC is perfectly equitable but rather that it is the most equitable statistic available in a variety of settings. One method can be more equitable than another, even if neither method is perfectly equitable. While a statistic that is perfectly equitable with respect to may indeed be difficult or even impossible to achieve for many large models including some of the models in [1] and [16], such impossibility would make approximate equitability no less desirable a property. The question thus remains how equitable various measures are, both provably and empirically. To borrow an analogy from computer science, the fact that a problem is proven to be NPcomplete does not mean that we that we do not want efficient algorithms for the problem; we simply may have to settle for approximate solutions. Similarly, there is merit in searching for measures of dependence that appear to be highly equitable with respect to in practice.
For more on this discussion, see the technical comment [18].
2.4 Quantifying equitability via interpretable intervals
Let us give a simple demonstration of how the formalism above can be used to empirically quantify equitability with respect to on a specific set of noisy functional relationships. We take as our statistic the sample correlation . Since this statistic is meant to detect linear dependencies, we do not expect it to be equitable on a broad class of relationships. In fact it is not even a measure of dependence, since its population value can be zero for relationships with nontrivial dependence. However, we analyze it here as an instructional example since it is widely used and gives intuitive scores. We analyze the equitability of other statistics in Section 5.
Figure 2a shows an analysis of the equitability with respect to of at a sample size of on the set
where is a set of 16 functions analyzed in [16]. (See Appendix A.)
To evaluate the equitability of in this context, we generate, for each function and for 41 noise levels chosen for each function to correspond to values uniformly spaced in , independent samples of size from the relationship . We then evaluate on each sample to estimate the 5th and 95th percentiles of the sampling distribution of on . By taking, for each , the maximal 95th percentile value and the minimal 5th percentile value across all , we obtain estimates of the reliable interval at each noise level. From the reliable intervals we can then construct interpretable intervals, and the equitability of is the reciprocal of the length of the largest interpretable interval.
As expected, the interpretable intervals at many values of are large. This is because our set of functions contains many nonlinear functions, and so a given value of can be assigned to relationships of different types with very different values. This is shown by the pairs of thumbnails in the figure, each of which depicts two relationships with the same but different values of . Thus, has poor equitability with respect to on this set . In contrast, Figure 2b depicts the way this analysis would look if were perfectly equitable: all the interpretable intervals would have size 0.
(a)  (b) 
2.5 Discussion
In this section we formalized the notion of equitability via the concepts of reliability and interpretability. Given a statistic and a measure of relationship strength defined on some set of standard relationships, we constructed a set of intervals called the interpretable intervals of with respect to . We constructed the interpretable intervals so they yield interval estimates of , and we then defined the (worstcase) equitability of to be the inverse of the size of the largest interpretable interval.
Strictly speaking, equitability simply requires that a natural set of confidence intervals obtained from analyzing as an estimator of be small. However, there is a subtlety here: since in our setting typically contains several different relationship types, there are usually multiple relationships in with a given value of . This is different from the conventional framework of estimation of a parameter , in which we assume that there is exactly one distribution with any given value of , and we must account for this difference in our definitions.
When is so small that this subtlety does not arise, equitability becomes a less rich property. To see this, notice that if there is only one relationship in for every value of , then asymptotic monotonicity of with respect to is sufficient for perfect equitability in the largesample limit. In this scenario, the main obstacle to the equitability of is finitesample effects, as with parameter estimation. For example, on the set of bivariate Gaussians, many measures of dependence are asymptotically perfectly equitable with respect to the correlation.
However, this differs from the motivating data exploration scenario we consider, in which contains many different relationship types and there are multiple different relationships corresponding to a given value of . Here, equitability can be hindered either by finitesample effects, or by the differences in the asymptotic behavior of on different relationship types in . This is illustrated in Figure 3.
Regardless of the size of though, equitability is fundamentally meant for a situation in which we cannot simply estimate directly. (In fact, if is a consistent estimator of on , it is trivially perfectly equitable in the largesample limit.) This is because in data exploration we typically require that be a measure of dependence in order to obtain a minimal robustness guarantee, and this requirement makes it very difficult to make a consistent estimator of on a large set . For instance, suppose is a set of noisy functional relationships and . Here, computing the sample relative to a nonparametric estimate of the generating function will be asymptotically perfectly equitable. However, this approach is undesirable for data exploration because of its lack of robustness, as exemplified by the fact that it would assign a score of zero to, e.g., a circular relationship. Therefore, we are left with the problem of finding the nextbest thing: a measure of dependence whose values have a clear, if approximate, interpretation in terms of . Equitability supplies us with a way of talking about how well does in this regard.
We close this section with the observation that, though we largely focused here on setting to be some set of noisy functional relationships, the appropriate definitions of and may change from application to application. For instance, instead of functional relationships one may be interested in relationships supported on onemanifolds, with added noise. Or perhaps instead of one may decide to focus on the mutual information between the sampled yvalues and the corresponding denoised yvalues [17], or on the fraction of deterministic signal in a mixture [26]. In each case the overarching goal should be to have be as large as possible without making it impossible to define an interesting or making it impossible to find a measure of dependence that achieves good equitability on with respect to this . Finding such families and properties is an important avenue of future work.
3 Equitability and statistical power
In the previous section we defined equitability in terms of interval estimation, and observed that the interpretable intervals of a statistic with respect to a property of interest yield interval estimates of on a set of distributions . Given our construction of interpretable intervals via inversion of a set of hypothesis tests, it becomes natural to ask whether there is any connection between equitability and the power of those tests with respect to specific alternatives.
In this section we answer this question by showing that equitability can be equivalently formulated in terms of power with respect to a family of null hypotheses corresponding to different relationship strengths. This result recasts equitability as a strengthening of power against statistical independence on and gives a second formal definition of equitability that is easily quantifiable using standard power analysis.
Henceforth, we fix the statistic and then use to denote the reliable interval of at and to denote the interpretable interval of at .
3.1 Intuition
Before stating and proving the relationship between equitability and power, let us first build some intuition for why it should hold. We begin by recalling that the reliable interval is an acceptance region of a twosided level test of . Since the interval estimates obtained by inverting this test are the interpretable intervals of , it makes sense to ask whether there is any property of these hypothesis tests that improves as the interpretability of the statistic increases. To see why the relevant property is power, let us consider the following illustrative question: what is the minimal such that a righttailed^{9}^{9}9 We consider a onesided test here, and henceforth in this section. The reason is because in practice when corresponds to relationship strength, we are interested in rejecting a null hypothesis representing weaker relationships. In such a situation, it is more common to perform a onesided test. Nevertheless, results similar to those shown in this section can be derived for twosided tests as well. level test of will have power at least on ? As shown graphically in Figure 4, the answer can be stated in terms of the reliable and interpretable intervals of .
Specifically, if is the maximal element of , then the minimal value of at which a righttailed test based on will achieve power is , i.e., the maximal element of the interpretable interval at . So if the statistic is highly interpretable at , then we will be able to achieve high power against very small departures from the null hypothesis of independence. That is, good interpretability on implies good power against independence on . It turns out that this reasoning holds in general and in both directions, as we establish below.
3.2 Definitions
To be able to state our main result, we need to formally describe how equitability would be formulated in terms of power. This requires two definitions. The first is a definition of a power function that parametrizes the space of possible alternative hypotheses specifically by the property of interest. The second is a definition of a property of this power function called its uncertain interval. It will turn out later than uncertain intervals are interpretable intervals and vice versa.
As before, let be a statistic, let be a set of standard relationships, and let be a property of interest defined on . Given a set of righttailed tests based on the same test statistic, we refer to the one with the smallest critical value as the most permissive test.
Definition 3.1.
Fix , and let be the most permissive level righttailed test based on of the (possibly composite) null hypothesis . For , define
where is a sample of size from . That is, is the power of with respect to the composite alternative hypothesis .
We call the function the level power function associated to at with respect to .
Note that in the above definition our null and alternative hypotheses may be composite since they are based on and not on a complete parametrization of . That is, can be one of several distributions with or respectively.
Under the assumption that if and only if represents statistical independence, the power function gives the power of optimal level righttailed tests based on at distinguishing various nonzero values of from statistical independence across the different relationship types in . One way to view the main result of this section is that the set of power functions at values of besides 0 contains much more information than just the power of righttailed tests based on against the null hypothesis of , and that this information can be equivalently viewed in terms of interpretable intervals. Specifically, we can recover the interpretability of at every by considering its power functions at values of beyond 0.
Let us now define the precise aspect of the power functions associated to that will allow us to do this.
Definition 3.2.
The uncertain set of a power function is the set .
The main result of this section will be that uncertain sets are interpretable intervals and vice versa.
3.3 Preliminary lemmas
Our proof of the alternate characterization of equitability in terms of power requires two short lemmas. The first shows a connection between the maximum element of a reliable interval and the minimal element of an interpretable interval, namely that these two operations are inverses of each other.
Lemma 3.3.
Given a statistic , a property of interest , and some , define and . If is strictly increasing, then and are inverses of each other.
Proof.
Let . We know that , for if it were greater than then we would have that , which would imply that , contradicting the definition of . On the other hand, we cannot have , because this would imply that there is some such that , meaning that , which contradicts the fact that is strictly increasing. ∎
The second lemma gives the connection between reliable intervals and hypothesis testing that we will exploit in our proof.
Lemma 3.4.
Fix a statistic , a property of interest , and some . The most permissive level righttailed test based on of the null hypothesis has critical value .
Proof.
We seek the smallest critical value that yields a level test. This would be the supremum, over all with , of the value of the sampling distribution of when applied to . By definition this is . ∎
3.4 Proving the main result: equitability in terms of statistical power
We are now ready to prove our main result, which is the following equivalent characterization of equitability in terms of statistical power.
Theorem 3.5.
Fix a set , a function , and . Let be a statistic with the property that is a strictly increasing function of . Then for all , the following are equivalent.

is worstcase interpretable with respect to with confidence .

For every satisfying , there exists a level righttailed test based on that can distinguish between and with power at least .
Theorem 3.5 can be seen to follow from the proposition below.
Proposition 3.6.
Fix and , and suppose is a statistic with the property that is a strictly increasing function of . Then for , the interval equals the closure of the uncertain set of for . Equivalently, for , the closure of the uncertain set of equals for .
An illustration of this proposition and its proof is shown in Figure 5.
Proof.
The equivalence of the two statements follows from Lemma 3.3, which states that if and only if . We therefore prove only the first statement, namely that is the uncertain set of for .
Let be the uncertain set of . We prove the claim by showing first that , and then that .
To see that , we simply observe that because , we have , which means that is nonempty, and so by construction its infimum is , which we have assumed equals .
Let us now show that : by the definition of the interpretable interval, we can find arbitrarily close to from below such that . But this means that there exists some with such that if is a sample of size from then
i.e.,
But since as we already noted , Lemma 3.4 tells us that it is the critical value of the most permissive level righttailed test of . Therefore, , meaning that .
It remains only to show that . To do so, we note that for all . This implies that either or . However, since and is an increasing function, no can have . Thus the only option remaining is that . This means that if is a sample of size from any with , then
i.e.,
As above, this implies that , which means that , as desired. ∎
3.5 Quantifying equitability via statistical power
Theorem 3.5 gives us an alternative to measuring equitability via lengths of interpretable intervals. Instead, for every and for every , we can use many samples of size to estimate the power of righttailed tests based on at distinguishing from . This process is illustrated schematically in Figure 6. In that figure, good equitability corresponds to high power on pairs even when is small.
3.6 Discussion
In this section, we gave a characterization of equitability in terms of statistical power with respect to a family of null hypotheses corresponding to different relationship strengths. (See Theorem 3.5.) This characterization shows what the concept of equitability/interpretability is fundamentally about: being able to distinguish not just signal () from no signal () but also stronger signal () from weaker signal (), and being able to do so across relationships of different types. This indeed makes sense when a data set contains an overwhelming number of heterogeneous relationships that exhibit, say, and that we would like to ignore because they are not as interesting as the small number of relationships with, say, .
Let us now explore how the power requirement into which equitability translates differs from the conventional lens through which measures of dependence are analyzed. We do so by returning once more to the case in which is a set of noisy functional relationships and the property of interest is . In this setting, the conventional way to assess a measure of dependence would be through analysis of its power with respect to a null hypothesis of independence and with a simple alternative hypothesis. Such an analysis would consider, say, righttailed tests based on the statistic and evaluate their power at rejecting the null hypothesis of , i.e. statistical independence, first on linear relationships with varying noise levels, then separately on exponential relationships with varying noise levels, and so on.
In contrast, our result shows that for to be equitable, it must yield righttailed tests with high power at distinguishing null hypotheses of the form from alternative hypotheses of the form for any . This is more stringent than the conventional analysis described above for the following three reasons.

Instead of just one null hypothesis (i.e., ), there are many possible values of corresponding to different values.

Each of the new null hypotheses can be composite since can contain relationships of many different types (e.g. noisy linear, noisy sinusoidal, and noisy parabolic). Whereas for many measures of dependence all of these relationships may have reduced to a single null hypothesis of statistical independence in the case of , they yield composite null hypotheses once we allow to be nonzero.

The alternative hypotheses here are also composite, since each one similarly consists of several different relationship types with the same . Whereas conventional analysis of power against independence considers only one alternative at a time, here we require that tests simultaneously have good power on sets of alternatives with the same .
This understanding of equitability is both good news and bad news. On the one hand, it provides us with a concrete sense of the relationship of equitability to power against independence, which has been the more traditional way of evaluating measures of dependence. In so doing, it also makes clear the motivation behind equitability and the cases in which it is useful. On the other hand, however, the understanding that equitability corresponds to power against a much larger set of null hypotheses suggests, via “no free lunch”type considerations, that if we want to achieve higher power against this larger set of null hypotheses, we may need to give up some power against independence. And indeed, in [16] we demonstrate empirically that such a tradeoff does seem to exist for several measures of dependence.
However, there are situations in which it may be desirable to give up some power against independence in exchange for a degree of equitability. For instance, recall the analysis [14] of the gene expression data set discussed earlier in this paper. In that analysis, not only did several measures of dependence each detect thousands of significant relationships after correction for multiple hypothesis testing, but there was also an overlap of over among the relationships detected by the five bestperforming methods. In data exploration scenarios such as this one, in which existing measures of dependence reliably identify so many relationships, focusing on additional gains in power against independence appears less of a significant priority than deciding how to choose among the large number of relationships already detected.
4 Equitability implies low detection threshold
The primary motivation given for equitability is that often data sets contain so many relationships that we are not interested in all deviations from independence but rather only in the strongest few relationships. However, there are also many data sets in which, due to low sample size, multipletesting considerations, or relative lack of structure in the data, very few relationships pass significance. Alternatively, there are also settings in which equitability is too ambitious even at large sample sizes. In such settings, we may indeed be interested in simply detecting deviations from independence rather than ranking them by strength.
In this situation, there is still cause for concern about the effect on our results of our choice of test statistic . For instance, it is easy to imagine that, despite asymptotic guarantees, an independence test will suffer from low power even on strong relationships of a certain type at a finite sample size because the test statistic systematically assigns lower scores to relationships of that type. To avoid this, we might want a guarantee that, at a sample size of , the test has a given amount of power in detecting relationships whose strength as measured by is above a certain threshold, across a broad range of relationship types. This would ensure that, even if we cannot rank relationships by strength, we at least will not miss important relationships as a result of the statistic we use.
In this section we show a straightforward connection between equitability as defined above and this desideratum, which we call low detection threshold. In particular, we show via the alternate characterization of equitability proven in the previous section that low detection threshold is a straightforward consequence of high equitability. Since the converse does not hold, low detection threshold may be a reasonable criterion to use in situations in which equitability is too much to ask.
Given a set of standard relationships, and a property of interest , we define low detection threshold as follows.
Definition 4.1.
A statistic has a detection threshold of at level with respect to on if there exists a level righttailed test based on of the null hypothesis whose power on at a sample size of is at least for all with .
The connection between equitability and low detection threshold is then a straightforward corollary of Theorem 3.5.
Corollary 4.2.
Fix some , let be worstcase interpretable with respect to on with confidence , and assume that is a strictly increasing function. Then has a detection threshold of at level with respect to on .
Assume that has the property that it is zero precisely in cases of statistical independence. Then the above corollary says that equitability and interpretability — to the extent they can be achieved — make strong guarantees about power against independence on . On the other hand, it is easy to see that low detection threshold need not imply equitability. Therefore, minimal power against independence is a strictly weaker criterion than equitability.
The connection between equitability and detection threshold with respect to is important because there exist situations in which equitability may be difficult to achieve but in which we still want some sort of guarantee about the robustness of our power against independence to changes in relationship type. This general theme of not missing relationships because of their type is the intuitive heart of equitability, and the above corollary shows how this conception might be utilized in other ways.
Another way that low detection threshold arises naturally is if we prefilter our data set using some independence test before conducting a more finegrained analysis with a second statistic. In that case, low detection threshold ensures that we will not “throw out” important relationships prematurely just because of their relationship type. In our companion paper [16], we propose precisely such a scheme, and we analyze the detection threshold of the preliminary test in question to argue that the scheme will perform well.
5 Quantifying equitability in practice
Having defined equitability and seen how it can be interpreted in terms of power, we now consider the equitability on a set of noisy functional relationships of some commonly used methods: the maximal information coefficient as estimated by [4], distance correlation [5, 24, 27], and mutual information [21, 22] as estimated using the Kraskov estimator [6].
In this analysis, we use as our property of interest, as our sample size, and
where and are i.i.d., is the set of functions in Appendix A, and is the set of xvalues that result in the points being equally spaced along the graph of .
The results of the analysis are shown in Figure 7. The figure visualizes the analysis via both interpretable intervals and statistical power. By Theorem 3.5, these two viewpoints are equivalent, and they are both shown here in order to help the reader build intuition for this equivalence. For instance, the worstcase interpretability of here is , because the widest interpretable interval is of size . And indeed, yields righttailed tests with power at distinguishing any null hypothesis of the form from any alternative hypothesis of the form provided .
As the figure demonstrates, the equitability of achieved by on this is the highest among the methods examined. In contrast, the equitabilities with respect to of distance correlation and mutual information estimation on this are and , respectively. For a more extensive analysis that varies the sample size as well as noise model and marginal distributions, and compares many more methods, see [16].
6 Conclusion
Informally, given some measure of relationship strength, the equitability of a measure of dependence with respect to is the degree to which allows us to draw inferences about relationship strength across a broad set of relationship types. We give here a conceptual framework to motivate equitability and then discuss the contributions of this work.
6.0.1 The motivation for equitability
There are two different ways to motivate equitability. The first is to begin with a measure of dependence and to observe that, though will asymptotically allow us to detect all deviations from independence in a data set, it need not tell us anything about the strength of those relationships. Since it often happens that we detect many more relationships than can be realistically followed up, it would be desirable to have tell us something not just about the presence or absence of a relationship, but also about relationship strength as defined by on at least a partial set of “standard relationships” .
The second way is to suppose that is a consistent estimator of on and to ask “what is the minimal requirement we can add to ensure that is robust to detecting relationships outside of ?” Perhaps the weakest stipulation we can impose is that the population value of our statistic be nonzero in cases of nontrivial dependence of any sort. That is, we want to be a measure of dependence as well.
Both of these scenarios would be resolved by a measure of dependence that is also a consistent estimator of . However, in many interesting cases there is no known statistic satisfying both properties: for instance, if is a set of noisy functional relationships and is , then on the one hand computing the sample with respect to a nonparametric estimate of the generating function will be a consistent estimator of , but will give a score of 0 to a circle. And on the other hand, no measure of dependence is known also to be a consistent estimator of on noisy functional relationships.
This naturally leads us to wonder whether, despite the difficulty of simultaneously estimating consistently and retaining the properties of a measure of dependence, we can at least seek an approximate version of this ideal. Doing so, however, requires a weaker requirement than consistent estimation. This is what leads us to equitability. Equitability allows us to seek statistics that have the robustness of measures of dependence but that also, via their relationship to a property of interest , give values that have a clear, if approximate, interpretation and can therefore be used to rank relationships.
6.0.2 Contributions of this work
In this paper, we formalized and developed the theory of equitability in three ways. We first defined the equitability of a statistic on with respect to as the extent to which give us good interval estimates of on . Our definition rests on an object called the interpretable interval, which has coverage guarantees with respect to . We define to be equitable if all of its interpretable intervals are small.
Second, we showed that this formalization of equitability can be equivalently stated in terms of power against a specific set of null hypotheses corresponding to different relationship strengths. That is, while measures of dependence have conventionally been judged by their power at distinguishing nontrivial signal from statistical independence, equitability is equivalent to the stronger property of being able to distinguish different degrees of possibly nontrivial signal strength from each other.
Third, we defined a concept called low detection threshold, which stipulates that, at a fixed sample size, a statistic yield independence tests with a guaranteed minimal power to detect relationships whose strength passes a certain threshold, across a range of relationship types. We showed that low detection threshold is a straightforward consequence of equitability. Since the converse does not hold, low detection threshold is a natural weaker criterion that one could aim for when equitability proves difficult to achieve.
Our formalization and its results serve three primary purposes. The first is to provide a framework for rigorous discussion and exploration of equitability and related concepts. The second is to situate equitability in the context of interval estimation and hypothesis testing and to clarify its relationship to central concepts in those areas such as confidence and statistical power. The third is to show that equitability and the language developed around it can help us to both formulate and achieve other useful desiderata for measures of dependence.
These connections provide a framework for thinking about the utility of both current and future measure of dependence for exploratory data analysis. Power against independence, the lens through which measures of dependence are currently evaluated, is appropriate in many settings in which very few significant relationships are expected, or in which we want to know whether one specific relationship is nontrivial or not. However, in situations in which most measures of dependence already identify a large number of relationships, a rigorous theory of equitability will allow us to begin to assess when we can glean more information from a given measure of dependence than just the binary result of an independence test.
Of course, there is much left to understand about equitability. For instance, to what extent is it achievable for different properties of interest? What are natural and useful properties of interest for sets besides noisy functional relationships? For common statistics such as MIC [1] or [4], can we obtain a theoretical characterization of the sets for which good equitability with respect to is achieved? Are there systematic ways of obtaining equitable behavior via a learning framework as was done for causation in [28]? These questions all deserve attention.
Equitability as framed here is certainly not the only goal to which we should strive in developing new measures of dependence. As data sets not only grow in size but also become more varied, there will undoubtedly develop new and interesting usecases for measures of dependence, each with its own way of assessing success. Notwithstanding which particular modes of assessment are used, it is important that we formulate and explore concepts that move beyond power against independence, at least in the bivariate setting. Equitability provides one approach to coping with the changing nature of data exploration, but more generally, we can and should ask more of measures of dependence.
7 Acknowledgments
The authors would like to acknowledge R Adams, E Airoldi, H Finucane, A Gelman, M Gorfine, R Heller, J Huggins, J Mueller, and R Tibshirani for constructive conversations and useful feedback.
References
 [1] D. N. Reshef, Y. A. Reshef, H. K. Finucane, S. R. Grossman, G. McVean, P. J. Turnbaugh, E. S. Lander, M. Mitzenmacher, and P. C. Sabeti, “Detecting novel associations in large data sets,” Science, vol. 334, no. 6062, pp. 1518–1524, 2011.
 [2] J. D. Storey and R. Tibshirani, “Statistical significance for genomewide studies,” Proceedings of the National Academy of Sciences, vol. 100, no. 16, pp. 9440–9445, 2003.
 [3] V. Emilsson, G. Thorleifsson, B. Zhang, A. S. Leonardson, F. Zink, J. Zhu, S. Carlson, A. Helgason, G. B. Walters, S. Gunnarsdottir, et al., “Genetics of gene expression and its effect on disease,” Nature, vol. 452, no. 7186, pp. 423–428, 2008.
 [4] Y. A. Reshef, D. N. Reshef, H. K. Finucane, P. C. Sabeti, and M. Mitzenmacher, “Measuring dependence powerfully and equitably,” arXiv preprint arXiv:1505.02213, 2015.
 [5] G. J. Székely, M. L. Rizzo, N. K. Bakirov, et al., “Measuring and testing dependence by correlation of distances,” The Annals of Statistics, vol. 35, no. 6, pp. 2769–2794, 2007.
 [6] A. Kraskov, H. Stogbauer, and P. Grassberger, “Estimating mutual information,” Physical Review E, vol. 69, 2004.
 [7] L. Breiman and J. H. Friedman, “Estimating optimal transformations for multiple regression and correlation,” Journal of the American statistical Association, vol. 80, no. 391, pp. 580–598, 1985.
 [8] W. Hoeffding, “A nonparametric test of independence,” The Annals of Mathematical Statistics, pp. 546–557, 1948.
 [9] R. Heller, Y. Heller, and M. Gorfine, “A consistent multivariate test of association based on ranks of distances,” Biometrika, vol. 100, no. 2, pp. 503–510, 2013.
 [10] B. Jiang, C. Ye, and J. S. Liu, “Nonparametric ksample tests via dynamic slicing,” Journal of the American Statistical Association, no. justaccepted, pp. 00–00, 2014.
 [11] A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf, “Measuring statistical dependence with hilbertschmidt norms,” in Algorithmic learning theory, pp. 63–77, Springer, 2005.
 [12] A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola, “A kernel twosample test,” The Journal of Machine Learning Research, vol. 13, no. 1, pp. 723–773, 2012.
 [13] D. LopezPaz, P. Hennig, and B. Schölkopf, “The randomized dependence coefficient,” in Advances in Neural Information Processing Systems, pp. 1–9, 2013.
 [14] R. Heller, Y. Heller, S. Kaufman, B. Brill, and M. Gorfine, “Consistent distributionfree sample and independence tests for univariate random variables,” arXiv preprint arXiv:1410.6758, 2014.
 [15] A. Rényi, “On measures of dependence,” Acta mathematica hungarica, vol. 10, no. 3, pp. 441–451, 1959.
 [16] D. N. Reshef, Y. A. Reshef, P. C. Sabeti, and M. Mitzenmacher, “An empirical study of leading measures of dependence,” arXiv preprint arXiv:1505.02214, 2015.
 [17] J. B. Kinney and G. S. Atwal, “Equitability, mutual information, and the maximal information coefficient,” Proceedings of the National Academy of Sciences, 2014.
 [18] D. N. Reshef, Y. A. Reshef, M. Mitzenmacher, and P. C. Sabeti, “Cleaning up the record on the maximal information coefficient and equitability,” Proceedings of the National Academy of Sciences, 2014.
 [19] N. Simon and R. Tibshirani, “Comment on “Detecting novel associations in large data sets”,” Unpublished (available at http://wwwstat.stanford.edu/tibs/reshef/comment.pdf on 11 Nov. 2012), 2012.
 [20] G. Casella and R. L. Berger, Statistical inference, vol. 2. Duxbury Pacific Grove, CA, 2002.
 [21] T. Cover and J. Thomas, Elements of Information Theory. New York: John Wiley & Sons, Inc, 2006.
 [22] I. Csiszár, “Axiomatic characterizations of information measures,” Entropy, vol. 10, no. 3, pp. 261–273, 2008.
 [23] E. Linfoot, “An informational measure of correlation,” Information and Control, vol. 1, no. 1, pp. 85–89, 1957.
 [24] G. Szekely and M. Rizzo, “Brownian distance covariance,” The Annals of Applied Statistics, vol. 3, no. 4, pp. 1236–1265, 2009.
 [25] B. Murrell, D. Murrell, and H. Murrell, “R2equitability is satisfiable,” Proceedings of the National Academy of Sciences, 2014.
 [26] A. A. Ding and Y. Li, “Copula correlation: An equitable dependence measure and extension of pearson’s correlation,” arXiv preprint arXiv:1312.7214, 2013.
 [27] X. Huo and G. J. Szekely, “Fast computing for distance covariance,” arXiv preprint arXiv:1410.1503, 2014.
 [28] D. LopezPaz, K. Muandet, B. Schölkopf, and I. Tolstikhin, “Towards a learning theory of causation,” in International Conference on Machine Learning (ICML), 2015.
Appendix A Details of analyses
a.1 Functions analysed in Figures 2 and 7
a.2 Parameters used in Figure 7
In the analysis of the equitability of , distance correlation, and mutual information, the following parameter choices were made: for , and were used; for distance correlation no parameter is required; and for mutual information estimation via the Kraskov estimator, was used. The parameters chosen were the ones that maximize overall equitability in the detailed analyses performed in [16]. For mutual information, the choice of (out of the parameters tested: ) also maximizes equitability on the specific set that is analyzed in Figure 7.