Theoretical Foundations of Equitabilityand the Maximal Information Coefficient 1footnote 11footnote 1This manuscript is subsumed by [1] and [2]. Please cite those papers instead.

Theoretical Foundations of Equitability
and the Maximal Information Coefficient 111This manuscript is subsumed by [1] and [2]. Please cite those papers instead.

Yakir A. Reshef School of Engineering and Applied Sciences, Harvard University. yakir@seas.harvard.edu.    David N. Reshef Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology. dnreshef@mit.edu.    Pardis C. Sabeti Department of Organismic and Evolutionary Biology, Harvard University. psabeti@oeb.harvard.edu.    Michael Mitzenmacher School of Engineering and Applied Sciences, Harvard University. michaelm@eecs.harvard.edu.
Abstract

The maximal information coefficient (MIC) is a tool for finding the strongest pairwise relationships in a data set with many variables [3]. MIC is useful because it gives similar scores to equally noisy relationships of different types. This property, called equitability, is important for analyzing high-dimensional data sets.

Here we formalize the theory behind both equitability and MIC in the language of estimation theory. This formalization has a number of advantages. First, it allows us to show that equitability is a generalization of power against statistical independence. Second, it allows us to compute and discuss the population value of MIC, which we call . In doing so we generalize and strengthen the mathematical results proven in [3] and clarify the relationship between MIC and mutual information. Introducing also enables us to reason about the properties of MIC more abstractly: for instance, we show that is continuous and that there is a sense in which it is a canonical “smoothing” of mutual information. We also prove an alternate, equivalent characterization of that we use to state new estimators of it as well as an algorithm for explicitly computing it when the joint probability density function of a pair of random variables is known. Our hope is that this paper provides a richer theoretical foundation for MIC and equitability going forward.

This paper will be accompanied by a forthcoming companion paper that performs extensive empirical analysis and comparison to other methods and discusses the practical aspects of both equitability and the use of MIC and its related statistics.

1 Introduction

Suppose we have a data set with hundreds or thousands of variables, and we wish to find the strongest pairwise associations in that data set. The number of pairs of will be in the hundreds of thousands, or even millions, and so manually examining each pairwise scatter plot is out of the question. In such a context, one commonly taken approach is to compute some statistic on each of these pairs of variables, to rank the variable pairs from highest- to lowest-scoring, and then to examine only the top of the resulting list.

The results of this approach depend heavily on the chosen statistic. In particular, suppose the statistic is a measure of dependence, meaning that its population value is non-zero score exactly in cases of statistical dependence. Even with such a guarantee, the magnitude of the this non-zero score may depend heavily on the type of dependence in question, thereby skewing the top of the list toward certain types of relationships over others. For instance, a statistic may give non-zero scores to both linear and sinusoidal relationships; however, if the scores of the linear relationships are systematically higher, then using this statistic to rank variable pairs in a large data set will cause the many linear relationships in the data set to crowd out any potential sinusoidal relationships from the top of the list. This means that the human examining the top of the list will effectively never see the sinusoidal relationships.

This shortcoming is not as concerning in a hypothesis testing framework: if what we sought were a comprehensive list of all the non-trivial associations in a data set, then all we would care about would be that the sinusoidal relationships are detected with sufficient power so that we could reject the null hypothesis. Many excellent methods exist that allow one to test for independence in this way in various settings [4, 5, 6, 7, 8, 9, 10]. However, often in data exploration the goal is to identify a relatively small set of strongest associations within a dataset as opposed to finding as many non-trivial associations as possible, which often are too many to sift through. What is needed then is a measure of dependence whose values, in addition to allowing us to identify significant relationships (i.e. reject a null hypothesis of independence), also allow us to measure the strength of relationships (i.e. estimate an effect size).

With the goal of addressing this need, we introduced in  [3] a notion called equitability: an equitable measure of dependence is one that assigns similar scores to relationships with equal noise levels, regardless of relationship type. This notion is notably imprecise – it does not specify, for example, which relationship types are covered nor what is meant by “noise” or “similar”. However, we noted that in the case of functional relationships, one reasonable definition of equitability might be that the value of the statistic reflect the coefficient of determination () of the data with respect to the regression function with as weak a dependence as possible on the function in question. Additionally, though characterizing noise in the case of superpositions of several functional relationships is difficult, it seems reasonable to require that the statistic give a perfect score when the functional relationships being superimposed are noiseless. We then introduced a statistic, the maximal information coefficient (MIC), that behaves more equitably on functional relationships than the state of the art and also has the desired behavior on superpositions of functional relationships, given sufficient sample size.

Although MIC has enjoyed widespread use in a variety of disciplines [11, 12, 13, 14, 15, 16, 17, 18, 19, 20], the original paper on equitability and MIC has generated much discussion, both published and otherwise, including some concerns and confusions. Perhaps the most frequent concern that we have heard is the desire for a richer and more formal theoretical framework for equitability, and this is the main issue we address in this paper. In particular, we provide a formal definition for equitability that is sufficiently general to allow us to state in one unified language several of the variants of the original concept that have arisen. We use this language to discuss the result of Kinney and Atwal about the impossibility of perfect equitability in some settings [21], explaining its limitations based on the permissive underlying noise model that it requires, and on the strong assumption of perfect equitability that it makes. (See Section 2.3.2.) We also use the formal definition of equitability to clarify the relationship between equitability and statistical power by proving an equivalent characterization of equitability in terms of power against a certain set of null hypotheses. Specifically, we show that whereas typical measures of dependence are analyzed in terms of the power of their corresponding tests to distinguish statistical independence from non-trivial associations, an equitable statistic is one that yields tests that can distinguish finely between relationships of different strengths that may both be non-trivial. We then explain how this relates to raised concerns about the power of MIC [22]. (See Section 3.)

Following our treatment of equitability, we show that MIC can be viewed as a consistent estimator of a population quantity, which we call , and give a closed-form expression for it. This has several benefits. First, the consistency of MIC as an estimator of , together with properties of that are easy to prove given the closed-form expression, trivially subsumes and generalizes many of the theorems proven in [3] about the properties of MIC. Second, it clarifies that the parameter choices in MIC are not fundamental to the definition of the estimand () but rather simply control the bias and variance of the estimator (MIC). Third, separating finite-sample effects from properties of the estimand allows us to rigorously discuss the theoretical relationship between and mutual information. And finally, since power is a property of the test corresponding to a statistic at finite sample sizes and not of the population value of the statistic, this re-orientation allows us to ask whether there exist estimates of other than MIC that retain the relative equitability of MIC but also result in better power against statistical independence. It turns out that there do, as we shall soon discuss.

Having a closed-form expression the population value of MIC (i.e. ) also allows us to reason about it more abstractly, and this is the goal to which we devote the remainder of the paper. We first show that, considered as a function of probability density functions, is continuous. This further clarifies the relationship with mutual information by allowing us to view as a “minimally smoothed” version of mutual information that is uniformly continuous. (In contrast, mutual information alone is not continuous in this sense.)

Our theory also yields an equivalent characterization of that allows us to develop better estimators of it. The expression for given at the beginning of this paper, which is analogous to the expression for MIC, defines it as the supremum of a matrix called the characteristic matrix. We show here that can instead be viewed as the supremum only of the boundary of that matrix. This is theoretically interesting, but it is also practically important, because computing elements of this boundary is easier than computing elements of the original matrix. In particular, our equivalent characterization of leads to the following advances.

  • A new consistent estimator of , which we call , and an exact, efficient algorithm for computing it. (The algorithm introduced in [3] for computing MIC is only a heuristic.)

  • An approximation algorithm for computing the of a given probability density function. Previously only heuristic algorithms were known [3, 23]. Having such an algorithm enables the evaluation of different estimators of as well as the evaluation of properties of in the infinite-data limit.

  • An estimator of , which we call , that proceeds by using a consistent density estimator to estimate the probability density function that gave rise to the observed samples, and then applying the above algorithm to compute the true of the resulting probability density function. This approach may prove more accurate in cases where we can encode into our density estimator some prior knowledge about the types of distributions we expect.

This paper will be accompanied by a forthcoming companion paper that performs extensive empirical analysis of several methods, including comparisons of MIC and related statistics. Among other things, that paper show that the first of the two estimators introduced here, , yields a significant improvement in terms of equitability, power, bias/variance properties, and runtime over the original statistic introduced in [3]. The companion paper also compares both as well as the original statistic from [3] to existing methods along these same criteria and discusses when equability is a useful desideratum for data exploration in practice. Thus, questions concerning the performance of MIC (as well as other estimates of ) compared to other methods at finite sample sizes are deferred to the companion paper, while this paper focuses on theoretical issues. Our hope is that these two papers together will lay a richer theoretical foundation on which others can build to improve our knowledge of equitability and , and to advance our understanding of when equitability and estimation of are useful in practice.

2 Equitability

Equitability has been described informally as the ability of a statistic to “give similar scores to equally noisy relationships of different types” [3]. Here we provide the formalism necessary to discuss this notion more rigorously and define equitability using the language of estimation theory. Although what we are ultimately interested in is the equitability of a statistic, we first define equitability and discuss variations on the definition in the setting of random variables. Only then do we adapt our definitions to incorporate the uncertainty that comes with working with finite samples rather than random variables.

2.1 Overview

Before we formally define equitability in full generality, we first give a semi-formal overview of how we will do so, as well as a brief discussion of the benefits of our approach.

In [3], we asked to what extent evaluating a statistic like MIC on a sample from a noisy functional relationship with joint distribution tells us about the coefficient of determination () of that relationship222Recall that for a pair of jointly distributed random variables with regression function , is the squared Pearson correlation coefficient between and . However, this setup can be generalized as follows. We have a statistic (e.g. MIC) that detects any deviation from statistical independence, a set of distinguished standard relationships on which we are able to define what we mean by noise (e.g. noisy functional relationships), and a property of interest that quantifies the noise in those relationships (e.g. ). We now ask: to what extent will evaluating on a sample from some joint distribution tell us about ?

How will we quantify this? Let us step back for a moment and suppose that finite sample effects are not an issue: we will consider the population value of and discuss its desired behavior on distributions . In this setting, we will want to have the following two properties.

  1. It is a measure of dependence. That is, if and only if exhibits statistical independence.

  2. For every fixed number there exists a small interval such that implies . In other words, the set is small.

Assuming that satisfies the first criterion, we will define its equitability on with respect to as the extent to which it satisfies the second. A stronger version of this definition, which we will call perfect equitability, adds the requirement that the interval be of size , i.e. that be exactly recoverable from regardless of the particular identity of . Note, however, that this is strictly a special case of our definition, and in general when we discuss equitability we are explicitly acknowledging the fact that this may not be the case.

The notion of equitability we just described for will then have a natural extension to a statistic . The extension will proceed in the same way that one might define a confidence interval of an estimator: for a fixed number , instead of considering the distributions for which exactly, we will consider the distributions for which is a likely result of computing , where .

In Section 3 we will use this formalization of equitability to give an alternate, equivalent definition in terms of statistical power against a certain set of null hypotheses.

Though [3] focused primarily on various models of noisy functional relationships with as the property of interest, the appropriate definitions of and may change from application to application. For instance, as we have noted previously, when is the set of all superpositions of noiseless functional relationships and , then the population MIC (i.e. ) is perfectly equitable. More generally, instead of functional relationships one may be interested in relationships supported on one-manifolds, with added noise. Or perhaps instead of one may decide to focus simply on the magnitude of the added noise, or on the mutual information between the sampled y-values and the corresponding de-noised y-values. In each case the overarching goal should be to have be as large as possible without making it impossible to define an interesting or making it impossible to find a measure of dependence that achieves good equitability on with respect to this . For this reason, we keep our exposition on equitability generic, and use noisy functional relationships and only as as a motivating example.

Keeping our exposition generic also allows us to address variations on the concept of equitability that have been introduced by others. For example, we are able to state in a formal, unified language the relationship of the work of Kinney and Atwal [21] to our previous work on MIC. In particular, we explain why their negative result about the impossibility of achieving perfect equitability is of limited scope due to its focus on perfect equitability and to the setting of that it requires. (See Section 2.3.2 for this discussion.)

As a matter of record, we wish to clarify at this point that the key motivation given for Kinney and Atwal’s work, namely that our original paper [3] stated that MIC was perfectly equitable, is incorrect. Specifically, they write “The key claim made by Reshef et al. in arguing for the use of MIC as a dependence measure has two parts. First, MIC is said to satisfy not just the heuristic notion of equitability, but also the mathematical criterion of -equitability…”, with the latter term referring to perfect equitability [21]. However, such a claim was never made in [3]. Rather, that paper informally defined equitability as an approximate notion and compared the equitability of MIC, mutual information estimation, and other schemes empirically, concluding that MIC is the most equitable statistic in a variety of settings. In other words, one method can be more equitable than another, even if neither method is perfectly equitable. We intend for the formal definitions we present in this section to lead to a clearer picture of the relationships among these concepts and among the results published about them.

2.2 Preliminaries: interpretability and reliability

Let be the set of distributions over , and let be a mapping such that for describing a pair of jointly distributed random variables, if and only if and are statistically independent. Such a map is called a measure of dependence.

Now let be some subset of indexed by a parameter , and let be some property that is defined on but may not be defined on all of . We ask the following question: to what extent does knowing for some tell us about the value of ? We will refer to the members of as standard relationships, and to as the property of interest.

Conventionally, noisy functional relationships have been used as standard relationships, and the corresponding property of interest has been with respect to the regression function. However, as noted above we might imagine different scenarios. For this reason, we will make our exposition as generic as possible and refer back to the setting of noisy functional relationships as a motivating example.

(a) (b)
Figure 1: An schematic illustration of reliable and interpretable intervals. In both figure parts, is a union of three different models corresponding to the three different colors. (a) The relationship between and on distributions in in the infinite-data limit. The indicated vertical interval is the reliable interval , and the indicated horizontal interval is the interpretable interval . (b) The relationship between some estimator of and on at a finite sample size. The colored dashed lines indicate the and percentiles of the sampling distribution of for each model, at various values of . The indicated vertical interval is the reliable interval , and the indicated horizontal interval is the interpretable interval .

Regardless of our choice of , , and , there are two straightforward ways to measure how similar is to on . The first such way is to restrict our attention only to distributions with , and then to ask how much can vary subject to that constraint.

Definition 2.1.

Let be a measure of dependence, and let . The smallest closed interval containing the set is called the reliable interval of at and is denoted by . is a -reliable proxy for on at if and only if the diameter of is at most .

Equivalently, is a -reliable proxy for on at if and only if there exists an interval of size such that implies that . In other words, if we restrict our attention to distributions such that we are guaranteed that applied to those distributions will produce values that are close to each other. (See Figure 1a for an illustration.) In the context of noisy functional relationships and , this corresponds to saying that relationships with the same will not score too differently.

The second way of measuring how closely matches on is to talk about how much can vary when we consider only distributions with .

Definition 2.2.

Let be a measure of dependence, and let . The smallest closed interval containing the set is called the interpretable interval of at and is denoted by . is a -interpretable proxy for on at if and only if the diameter of is at most .

Equivalently, is a -interpretable proxy for on at if and only if there exists an interval of size such that implies that for all . In other words, if all we know about a distribution is that , then we are able to guess what is pretty accurately. (See Figure 1a for an illustration.) In the context of noisy functional relationships and , this corresponds to the fact that evaluating on a relationship will give us good upper and lower bounds on the noise-level of that relationship as measured by .

When and are clear we will omit them and describe simply as -reliable (resp. interpretable) at (resp. ).

Once we have specified what we mean by “reliable” and “interpretable”, it is straightforward to define “reliability” and “interpretability”.

Definition 2.3.

The reliability (resp. interpretability) of at (resp. ) is , where is the diameter of (resp. ). If , the reliability (resp. interpretability) of is and is called perfectly reliable (resp. interpretable).

We will occasionally refer to the more general notions of reliability/interpretability as “approximate” to distinguish them from the perfect case.

One can imagine many different ways to quantify the overall interpretability and reliability of a measure of dependence. For instance, we have

Definition 2.4.

A measure of dependence is worst-case -reliable (resp. interpretable) if it is -reliable (resp. interpretable) at all (resp. ) .

A measure of dependence is average-case -reliable (resp. interpretable) if its reliability (resp. interpretability), averaged over all (resp. ) , is at least .

More generally, one could imagine defining a prior over all the distributions in to reflect one’s belief about the importance of various types of relationships in the world, and then using that to measure overall reliability and interpretability. We do not pursue this here; instead, we focus only on worst-case reliability and interpretability.

Let us give two simple examples of the use of this new terminology. First, the Linfoot correlation coefficient [24], defined as where is mutual information, is a worst-case perfectly interpretable and perfectly reliable proxy for , the squared Pearson correlation coefficient , on the set of bivariate normal random variables. Additionally, Theorem 6 of [4] implies that distance correlation is a perfectly interpretable and perfectly reliable proxy for on the same set . In the first example, the given measure of dependence simply equals when it is restricted to , which is why the reliability and interpretability are perfect. In the second example, the distance correlation does not equal but rather is a deterministic function of it, which is sufficient.

2.3 Defining equitability

2.3.1 Equitability in the sense of [3]

As we have suggested above, in the language of reliability and interpretability, the informal notion of equitability described in [3] amounts to a requirement that a measure of dependence be a highly interpretable proxy for some property of interest that is suitably defined to reflect “strength”, and over as large a model as possible.

We have discussed the fact that the particular choice of and may vary from problem to problem, as might the way in which the equitability is measured (average-case versus worst-case). Let us define the models considered in [3]. We begin by stating precisely what we mean by the term “noisy functional relationship”.

Definition 2.5.

A random variable distributed over is called a noisy functional relationship if and only if it can be written in the form where , is a random variable distributed over , and and are random variables. We denote the set of all noisy functional relationships by .

As we will soon discuss, there are varying views about whether constraints should be placed on and , ranging from setting them to be Gaussians independent of each other and of all the way to allowing them to be arbitrary random variables that are not necessarily independent of . For this reason, we do not place any constraints on them in the above definition.

With the concept of noisy functional relationships defined, equitability on a set of functional relationships simply amounts to the use of as the property of interest.

Definition 2.6 (Equitability on functional relationships in the sense of Reshef et al.).

Let be a set of noisy functional relationships. A measure of dependence is worst-case (resp. average-case) -equitable on if it is a worst-case (resp. average case) -interpretable proxy for on .

In this paper, we will often use “equitability” with no qualifier to mean worst-case equitability.

Given a set of functions from to , [3] defined a few different subsets of . The simplest is

Where the letter U in indicates that is uniform333In [3] was not actually random. Instead, values of were chosen in to produce evenly spaced x-values. However, for theoretical clarity we opt here to treat as a random variable. over , and is uniform over and is independent of . Of course, one can add noise in the first coordinate as well, producing

where is defined analogously to . In both of the above cases, we can also modify such that, rather than being uniformly distributed over , it is distributed in such a way that is uniformly distributed over the graph of . This gives the last two models, and , that are used in [3].

The reason that [3] defined four different models models was simple: since it is often difficult to say exactly which model (if any) is actually followed by real data, we would ideally like to see good equitability on as many different such models as possible. Given the lack of a neat description of how real data behave, we aim for robustness.

Nevertheless, each of these models is somewhat narrow, and we can easily imagine others: for instance, we might define and to be Gaussian, we might allow them to depend on each other, or we might consider adding noise only to the first coordinate. Each of these modifications deserves attention.

Remark 2.7.

In the remainder of this paper, we will use the terms “equitability” and “interpretability” differently, but the difference is merely notional and not formal: equitability is a type of interpretability that we get when our goal is that reflect the strength of our relationships.

2.3.2 Kinney and Atwal’s impossibility result

Now that we have a sufficiently general language in which to discuss equitability, let us turn to the recent impossibility result of Kinney and Atwal [21]. Kinney and Atwal write the following.

[W]e prove that the definition of equitability proposed by Reshef et al. (ed: [3]) is, in fact, impossible for any (nontrivial) dependence measure to satisfy.

However, this result actually has two severe limitations to its scope. To understand these issues, let us state the result in the language developed above: it amounts to showing that no non-trivial measure of dependence can be perfectly equitable (i.e. a perfect worse-case interpretable proxy for ) on , where

with representing a random variable that is conditionally independent of given . This model describes functional relationships with noise in the second coordinate only, where that noise can depend arbitrarily on the value of (i.e. it can be heteroscedastic) but must be otherwise independent of .

The first limitation of this result is that the argument depends crucially on the fact that the noise term can depend arbitrarily on the value of . In particular, its mean need not be 0 but rather may change depending on . As pointed out in [25], selecting such a large model leads to identifiability issues such as allowing one to obtain the relationship as a noisy version of . The more permissive (i.e. large) a model is, the easier it is to prove an impossibility result for it, and is indeed quite large: in particular, it is not contained in any of the models defined above. This would be necessary in order for impossibility on to translate into impossibility for one of these other models. Thus, Kinney and Atwal’s result does not apply to the models defined in [3].

The second limitation of Kinney and Atwal’s result is that it only addresses perfect equitability rather than the more general, approximate notion with which we are primarily concerned. As we discussed in Section 2.1, the claim that the definition of equitability given in [3] was one of perfect equitability rather than approximate equitability is incorrect. More generally however, though a perfectly equitable proxy for may indeed be difficult or even impossible to achieve for many large models including some of the models defined above, such impossibility would make approximate equitability no less desirable a property. The question thus remains how equitable various measures are, both provably and empirically. To borrow an analogy from computer science, the fact that a problem is proven to be NP-complete does not mean that we that we do not want efficient algorithms for the problem; we simply may have to settle for heuristic solutions, or solutions with some provable approximation guarantees. Similarly, there is merit in searching for measures of dependence that appear to be sufficiently equitable proxies for in practice.

For more on this discussion, see the technical comment [26] published by the authors of this paper about [21].

2.4 Equitability of a statistic

Until now we have only discussed the properties of a measure of dependence considered as a function of random variables. However, it is trivial to define a perfectly reliable and interpretable proxy for any on any : simply define to equal on and an arbitrary measure of dependence on . Of course, this is not the point. Rather, the idea is to define a function that is amenable to efficient estimation, and to use the notions of interpretability and reliability defined above in order to separate the loss in performance that a given estimator of incurs from finite sample effects from the loss in performance caused by the choice of the estimand itself.

Figure 2: The analogy between interpretable intervals and confidence intervals. The left-hand column depicts a scenario in which is estimating a parameter . As sample size increases, the width of the confidence intervals of will tend to zero because each value of corresponds to exactly one population value of . The right-hand column depicts a scenario in which is being used as an estimate of , but does not completely determine the population value of : the red, blue, and green curves represent distinct sets of distributions in whose members can have identical values of . For instance, they might correspond to different function types. This is the setting in which we are operating, and the intervals plotted on the right are called interpretable intervals. Interpretable intervals can be large either because of finite sample effects (as in the conventional estimation case) or because of the lack of interpretability of the population value of the statistic (shown in the bottom-right picture).

However, to reason about this distinction, we do need a way to directly evaluate the reliability and interpretability of a statistic at a given sample size. To do so, we will adapt our above definitions from the “infinite-data limit” by analogy using the theory of estimation and confidence intervals. Specifically, in estimation theory, confidence intervals can be defined in terms of the sets of likely values of a statistic at each value of the parameter. In the same way, we will define a reliable interval to be a set of likely values of given a certain value of , and then define the interpretable interval in terms of the values of whose reliable intervals contain a given value of . This analogy is depicted in Figure 2 and Table 1.

Remark 2.8.

The analogy between an equitable statistic and an estimator with small confidence intervals can be made even more explicit as follows: ordinarily, the best way to obtain information about would be to estimate it directly. However, if we do so we are not guaranteed that the statistic we use will detect any deviation from statistical independence when used on distributions not in . Thus, our problem is akin to that of seeking the best possible estimator of on subject to the constraint that the population value equal 0 if and only if the distribution in question exhibits statistical independence. The difference is that we only care about the confidence intervals of the estimator and not about its bias, since we are principally interested in ranking relationships according to rather than recovering the exact value of .

Estimating (confidence) “Estimating” (interpretability)
Model One value of for each value of Multiple values of for each value of the
Error Confidence intervals wide due to finite sample effects Interpretable intervals wide due to finite sample effects, as well as infinite sample relationship between and
Tests Small confidence intervals give power at rejecting Small interpretable intervals give power at rejecting
Table 1: The analogy between confidence intervals in the setting of estimating a parameter that completely parametrizes a model, and interpretable intervals when viewed as confidence intervals for as an “estimate” of .

We first define the reliability of a statistic. Previously, reliability meant that if we know then we can place in a small interval. To obtain the analogous definition for a statistic, we simply relax the requirement that be in a small interval to the requirement that be in a small interval with high probability when is a sample from . This is equivalent to simply considering as an estimator of rather than of and requiring that its sampling distribution have its probability mass concentrated in a small area.

Definition 2.9.

Let be a statistic, let . The -reliable interval444This is simply the union of the central intervals of the sampling distribution of taken over all distributions . of at , denoted by , is the smallest closed interval with the property that, for all with ,

and

where is a sample of size from .

The statistic is a -reliable proxy for on at with probability if and only if the diameter of is at most .

(See Figure 1b for an illustration.) Looking once more at the example of noisy functional relationships with as the property of interest, this corresponds to the requirement that there exist an interval such that, for any functional relationship with an of , falls within with high probability when is a sample from .

Once reliability is suitably defined, the definition of interpretability is simple to translate into one for a statistic. Here we again make our definition by considering as an estimator of and looking at its confidence intervals. The key is that while we generally think of a confidence interval of a consistent estimator becomes large only due to finite sample effects, the so-called interpretable interval can become large either because of finite sample effects or because the function to which converges is itself not very interpretable.

Definition 2.10.

Let be a statistic, and let . The -interpretable interval of at , denoted by , is the smallest closed interval containing the set

The statistic is a -interpretable proxy for on at with confidence if and only if the diameter of is at most .

(See Figure 1b for an illustration.)

Remark 2.11.

Note that our definitions do not require that converge to in any sense; we are not trying to construct a measure of dependence that also estimates exactly. Rather, we are willing to tolerate some discrepancy between and in order to preserve the fact that acts as a measure of dependence when applied to samples from distributions not in . This is the essential compromise behind the idea of equitability. Why is it worthwhile to make? Because on the one hand, if we are interested in ranking relationships then having only a measure of dependence with no guarantees about how noise affects its score will not do; but on the other hand, we want a statistic that is robust enough that we will not completely miss relationships that do not fall in this set.

Analogous definitions can be made for average-case and worst-case reliability/equitability, and for equitability on functional relationships.

2.5 Discussion

As the definitions given above imply, an equitable statistic is different from other measures of dependence in that its main intended use is not testing for independence, but rather measurement of effect size. The idea is to have a statistic that has the robustness of a measure of dependence but that also, via its relationship to , gives values that have a clear, if approximate, interpretation and can therefore be used to rank relationships.

There is a tension inherent in the concept of equitability that arises from the attempt to reconcile the robustness of a measure of dependence with the utility of a measure of effect size. This tension leads to two important concessions to pragmatism.

  1. The set is not the set of all distributions but rather some strict subset of it.

  2. Despite the fact that we evaluate as an estimator of , we have not required that converge to in any sense, and we explicitly allow for the possibility that it may not. Rather, we are willing to tolerate some discrepancy between the population value of and in order to preserve the fact that acts as a measure of dependence when applied to samples from distributions not in .

The first of these compromises necessitates the second. For if we could set to be the set of all distributions and still define a property of interest that captured what we mean by a “strong” relationship, then we truly would simply seek an estimator for and be done. Unfortunately we cannot do this; the concepts of “noise” and what it means to be a “strong” relationship can become elusive when we enlarge too much. However, this does not mean that we should give up on seeking a statistic that somehow performs reasonably at ranking relationships. Therefore, while define exactly what we would like to have (i.e, ) whenever we can (i.e., on some ), we still demand that our statistic act as a measure of dependence on relationships not in . This second requirement may hurt our ability to estimate , but when we are exploring data sets with real relationships whose form we cannot fully anticipate or model, the robustness it gives can be worth the price of relaxing the requirement that converge to to a requirement that it merely approximate . This is our second compromise.

In this section, we largely focused on setting to be some subset of the set of noisy functional relationships, as this has been the subject of most of the empirical work on the equitability of MIC and other measures of dependence. However, it is important to keep in mind that should ideally be larger than this. For instance, as we discussed previously, in [3] the equitability of MIC is discussed not just in the case of noisy functional relationships but also in the case of superpositions of functional relationships.

As the compromises discussed above make clear, equitability sits in between the traditional hypothesis-testing paradigm of measures of dependence on the one hand and the paradigm of measuring effect size on the other. However, equitability can actually be framed entirely in terms of hypothesis tests. This is the topic of our next section.

3 Equitability as a generalization of power against independence

Having defined equitability in terms of estimation theory, we will now show that we can equivalently think of it in terms of power against a certain family of null hypotheses. This result re-casts equitability as a generalization of power against statistical independence and gives a second formal definition of equitability that is easily quantifiable using traditional power analysis.

3.1 Overview

Our proof is based on the idea behind a standard construction of confidence intervals via inversion of statistical tests. In particular, equitability of a statistic with respect to a property of interest on a model will be shown to be equivalent to power against the collection of null hypotheses of the form corresponding to different values of . Thus, if is such that if and only if exhibits statistical independence, then equitability with respect to is a strictly stronger requirement than power against statistical independence.

As a concrete example, let us again return to the case in which is a set of noisy functional relationships and the property of interest is . Here, a conventional power analysis would consider, say, the right-tailed test based on the statistic and evaluate its type II error at rejecting the null hypothesis of , i.e. statistical independence. In contrast, we will show that for to be equitable, it must yield right-tailed tests with high power against null hypotheses of the form for any . This is difficult: each of these new null hypotheses can be composite since can contain relationships of many different types (e.g. a noisy linear relationship, a noisy sinusoidal relationship, and a noisy parabolic relationship). Whereas all of these relationships may have reduced to a single null hypothesis of statistical independence in the case of , they yield composite null hypotheses once we allow to be non-zero.

3.2 Definitions and proof of the result

As before, let be the set of distributions over , and let be a measure of dependence estimated by some statistic . Let be some model of interest and let be a property of interest.

Now, given some , let be the right-tailed test based on with critical value , null hypothesis , and alternative hypothesis . The set is the set of possible right-tailed tests based on that are available to us for distinguishing from . We will distinguish a one of these tests in particular, namely the optimal one subject to a constraint on type I error: let be the test with chosen to be as small as possible subject to the constraint that the type I error of the resulting test be at most . We are now ready to define the measure of power that we will use to show the equivalence with equitability.

Definition 3.1.

Fix . For any given , let be the power of . We call the function the power function associated to at with significance with respect to .

When if and only if represents statistical independence, then the power function gives the power of right-tailed tests based on at distinguishing statistical independence from various non-zero values of with significance . For instance, if is the set of bivariate normal distributions and is ordinary correlation , then simply gives us the power of the right-tailed test based on at distinguishing the alternative hypothesis of from the null hypothesis of . As an additional example, in the cases discussed above where is some set of functional relationships and is , the power function associated to equals the power of the right-tailed test based on that distinguishes the alternative hypothesis of from the null hypothesis of , i.e., independence, with type I error .

Nevertheless, as we observe here, the set of power functions at values of besides 0 contains much more information that just the power of right-tailed tests based on against the null hypothesis of . We can recover the interpretability of a statistic at every by considering its power functions at values of beyond 0. This is the main result of this section. It is analogous to the standard relationship between the size of the confidence intervals of an estimator and the power of their corresponding right-tailed tests.

Remark 3.2.

In this setup our null and alternative hypotheses, since they are based on and not on a parametrization of that uniquely specifies distributions, may be composite: can be one of several distributions with or respectively. This composite nature of our null hypotheses is bound up in the reason we need interpretability and reliability in the first place: if the set were so small that each value of defined only one distribution then we would likely not be in a setting where we needed an agnostic approach to detecting strong relationships. We could just estimate directly.

Before we prove the main result of this section, the connection between power and interpretabiilty, we must first define what aspect of power will be reflected in the interpretability of .

Definition 3.3.

The uncertain set of a power function is the set .

We will now prove the main proposition of this section, which is essentially that uncertain sets are interpretable intervals and vice versa. In what follows, since our statistic is fixed we use to denote , and to denote . We also use the function to denote the diameter of a subset of .

Proposition 3.4.

Fix and , and suppose is a statistic with the property that is a continuous, increasing function of . The following two statements hold.

  1. If , then the uncertain set of has diameter for .

  2. If the uncertain set of has diameter , then for .

An illustration of this proposition and its proof is shown in Figure 3.

Figure 3: The relationship between equitability and power against independence, as in Proposition 3.4. The top plot is the same as the one in Figure 1b, with the indicated interval denoting the interpretable interval . The bottom plot is a plot of the power function , with the y-axis indicating statistical power. The key to the proof of the proposition is to notice that the width of the interpretable interval describes the distance from to the point at which the power function reaches , and this is exactly the width of the uncertain set of the power function. (Notice that because the null and alternative hypotheses are composite, need not equal ; in general it may be lower.)
Proof.

Let denote the statistical test corresponding to . We first determine: what is the critical value of ? By definition, it is the smallest critical value that gives a type I error of at most . In other words, it is the supremum, over all with , of the -percentile of the sampling distribution of when applied to . But this is simply .

We now prove the proposition by proving each of the two statements separately.

Proof of the first statement: Let be the uncertain set of . Since , we know that , and so . It therefore suffices to show that .

We first show that : since , we know that we can find arbitrarily close to from below such that . But this means that there exists some with such that if is a sample of size from then

i.e.,

and so .

We next show that . To do so, we will need the following fact: since is continuous, the set is closed and because it’s bounded this means that is actually a member of . In other words, . It is easy to similarly show, using the continuity and invertibility of , that in fact .

To show that , we now observe that since , we know that for all . This is either because or because . However, since and is an increasing function, no can have . Thus the only option remaining is that , which gives that if is a sample of size from any with , we will have

But, as we’ve shown, the critical value of the test in question is , which equals . We therefore have that

which implies that is not contained in , as desired.

Proof of the second statement: We again let denote the uncertain set of . What are the infimum and supremum of ? To answer this, we note once again that implies that and moreover, since , we also have that .

To prove our claim, we will establish that and that . The fact that follows easily from and the fact that is an increasing function. It is therefore left only to show the latter claim.

To establish that , let us first show that . We know that since , we can find arbitrarily close to from below such that . Again, since the critical value of the test in question is , this means that there exists some with such that if is a sample of size from then

i.e.,

and so . This means that .

To show that , we observe that for we must have . Since the critical value of the test is , this implies that if is a sample of size from any with , then

i.e.,

In other words, for any , as desired. ∎

3.3 Discussion

What does the above result tell us about equitability? The first consequence of it is the following formal definition of equitability/interpretability in terms of statistical power, which we present without proof.

Theorem 3.5.

Fix a set , and a function . Let be a statistic with the property that is a continuous increasing function of , and fix some and some . Then the following are equivalent:

  1. is a worst-case -interpretable proxy for with confidence .

  2. For every satisfying , there exists a right-tailed test based on that can distinguish between and with type I error at most and power at least .

This definition shows what the concept of equitability/interpretability is fundamentally about: being able to distinguish not just signal () from no signal () but also stronger signal () from weaker signal (). This is the essence of the difference between equitability/interpretability and power against statistical independence.

The definition also shows that equitability and intepretability — to the extent they can be achieved — subsume power against independence. To see this, suppose again that exactly when exhibits statistical independence. By setting in the definition, we obtain the following corollary.

Corollary 3.6.

Fix a set , a function such that iff exhibits statistical independence, and some . Let be a worst-case -interpretable proxy for with confidence , and assume that is a continuous increasing function. The power of the right-tailed test based on at distinguishing from statistical independence with type I error at most is at least .

In other words, equitability/interpretability implies power against independence. However, equitability/interpretability is actually a stronger requirement: as the theorem shows, to be interpretable a statistic must yield a right-tailed test that is well-powered not only to detect deviations from independence () but also from any fixed level of “noisiness” (e.g., ). This indeed makes sense when a data set contains an overwhelming number of relationships that exhibit, say and that we would like to ignore because they are not as interesting as the small number of relationships with .

It is our hope that by formalizing the relationship between equitability and power against independence, our equivalence result will clarify the differences between these two properties, thereby addressing some of the concerns raised about the power of MIC against statistical independence ([22] and [27]). We of course do agree that power against independence is a very important goal that is often the right one, and if all other things are equal more power is certainly always better. To this end, we have worked to greatly enhance MIC’s power, both through better choice of parameters and through use of the estimators introduced later in this paper, to the point where it is often competitive with the state of the art. (The results of this work are forthcoming in the companion paper.) However, we also think that limiting one’s analysis of MIC to power against statistical independence alone is not the right way to think about its utility.

For example, in [22], Simon and Tibshirani write “The ‘equitability’ property of MIC is not very useful, if it has low power”. However, as the result described above shows, the question is “power against what?”. If one is interested only in power against statistical independence (e.g. , in the setting of functional relationships), then choosing a statistic based solely on this property is the correct way to proceed. However, when the relationships in a dataset that exhibit non-trivial statistical dependence number in the hundreds of thousands, it often becomes necessary to be more stringent in deciding which of them to manually examine. As our result in this section shows, this can be thought of as defining one’s null hypothesis to be for some . In such a case, the statistic is not being used to identify any instance of dependence, but rather to identify any instance of dependence of a certain minimal strength. In other words, when used on relationships in , an equitable statistic is a measure of effect size rather than a statistical test, and as with other measures of effect size, analyzing its power against only one null hypothesis (that of statistical independence alone) is therefore inappropriate.

Of course, when the relationships being sought in a dataset are expected to be very noisy, the above paradigm does not make sense and it is quite reasonable to ignore equitability and seek a statistic that maximizes power specifically against statistical independence. This issue, along with a broader discussion of when equitability is an appropriate desideratum, is discussed in more detail in the upcoming companion paper. From a theoretical standpoint, our result here simply formalizes the notion that these concepts, while distinct, are related, and shows that the former — to the extent that it can be achieved — implies the latter.

4 Mic and the MINE statistics as consistent estimators

MIC is defined as the maximal element of a matrix called the characteristic matrix. However, both of these quantities are defined in [3] as statistics rather than as properties of distributions that can then be estimated from samples. Here we define the quantities that these two statistics turn out to estimate, and we prove that they do so. Thinking about these statistic as consistent estimators and then analyzing their behavior in the infinite-data limit subsumes and strengthens several previous results about MIC, gives a better interpretation of the parameters in the definition of MIC, clarifies the relationship of MIC to other measures of dependence, especially mutual information, and allows us to introduce new, better estimators that have improved performance.

In this section, we focus on introducing the population value of MIC (which we will call ) and proving that MIC is a consistent estimator of it, and then give a discussion of some immediate consequences of this approach. Subsequent sections of the paper are devoted to analyzing and stating new estimators of it.

4.1 Definitions

We begin by defining the characteristic matrix as a property of the distribution of two jointly distributed random variables rather than as a statistic. In the sequel we will use to denote, for positive integers and , the set of all -by- grids (possibly with empty rows/columns).

Definition 4.1.

Let be jointly distributed random variables on . For a grid , let where is the column of containing and is analogously defined. Let

where represents the mutual information of and . The population characteristic matrix of , denoted by , is defined by

for .

Note that in the above definition, refers to mutual information (see, e.g., [28] and [29]), not to an interpretable interval as in the previous sections.

The characteristic matrix is so named because in [3] it was hypothesized that this matrix has a characteristic shape for different relationship types, such that different properties of this matrix may correspond to different properties of relationships. One such property was the maximal value of the matrix. This is called the maximal information coefficient (MIC), and is defined below.

Definition 4.2.

Let be jointly distributed random variables on . The population maximal information coefficient () of is defined by

We now define the corresponding statistics introduced in [3].

Remark 4.3.

In the rest of this paper, we will sometimes have a sample from the distribution of rather than the distribution itself. We will abuse notation by using to refer both to the set of points that is the sample, as well as to the uniform distribution over those points. In the latter case, it will then make sense to talk about , as we are about to do below.

Definition 4.4.

Let be a set of ordered pairs. Given a function , we define the sample characteristic matrix of to be

Definition 4.5.

Let be a set of ordered pairs, and let . We define

In [3], other characteristic matrix properties were introduced as well (e.g. maximum asymmetry score [MAS], maximum edge value [MEV], etc.). These can be analogously presented as functions of random variables together with a corresponding statistic for each property.

4.2 The main consistency result

We now show that the statistic MIC defined above is in fact a consistent estimator of . This is a consequence of the following more general result, which will be the main theorem of this section. In the theorem statement below, we let be the space of infinite matrices equipped with the supremum norm, and we let denote the projection

Theorem.

Let be uniformly continuous, and assume that pointwise. Then for every random variable supported on , the statistic is a consistent estimator of provided for .

Since the supremum of a matrix is uniformly continuous as a function on and can be realized as the limit of maxima of larger and larger segments of the matrix, this theorem gives us the following corollary.

Corollary 4.6.

is a consistent estimator of provided for .

It is easily verified that analogous corollaries also hold for the statistics and defined in [3] (and referred to there simply as MAS and MEV). Interestingly, it is unclear a priori whether such a result exists for MCN, since that statistic is not a uniformly continuous function of the sample characteristic matrix.

Before we prove this theorem, we will first give some intuition for why it should hold, and also for why it is non-trivial to prove. We then present the general strategy for the proof before giving the proof itself.

4.2.1 Intuition

Fix a random variable and let be a sample of size from its distribution. It is known that, for a fixed grid , is a consistent estimator of [30]. We might therefore expect to be a consistent estimator of as well. And if is a consistent estimator of , then we might expect the maximum of the sample characteristic matrix (which just consists of normalized terms) to be a consistent estimator of the supremum of the true characteristic matrix.

These intuitions turn out to be true, but there are two reasons they are non-trivial to prove. First, consistency for does not follow from abstract considerations since the maximum of an infinite set of estimators is not necessarily a consistent estimator of the supremum of the estimands555If is a finite set of estimators, then a union bound shows that the random variable converges in probability to with respect to the supremum metric. The continuous mapping theorem then gives the desired result. However, if the set of estimators is infinite, the union bound cannot be employed. And indeed, if we let , let represent a sample of size , and suppose that has all of its probability mass on the point , then each is consistent but their supremum is always infinite.. Second, consistency of alone does not suffice to show that the maximum of the sample characteristic matrix converges to . In particular, if grows too quickly, and the convergence of to is slow, inflated values of MIC can result. To see this, notice that if then always, even though each individual entry of the sample characteristic matrix converges to its true value eventually.

The technical heart of the proof is overcoming these obstacles by using the dependences between the quantities for different grids to not only show the consistency of but then to quantify how quickly actually converges to .

4.2.2 Proof strategy

We will prove the theorem by a sequence of lemmas that build on each other to bound the bias of . The general strategy is to capture the dependencies between different -by- grids by considering a “master grid” that contains many more than cells. Given this master grid, we first bound the difference between and only for sub-grids of . The bound will be in terms of the difference between and . We then show that this bound can be extended without too much loss to all -by- grids. This will give us what we seek, because then the differences between and will be uniformly bounded for all grids in terms of the same random variable: . Once this is done, standard arguments will give us the consistency we seek.

4.2.3 The proof

The proof of this result will sometimes require technical facts about entropy and mutual information that are self-contained and unrelated to the central idea behind our argument. These lemmas are consolidated in Appendix 10.

We begin by using one of these technical lemmas to prove a bound on the difference between and that is uniform over all grids that are sub-grids of a much denser grid . The common structure imposed by will allow us to capture the dependence between the quantities for different grids .

Lemma 4.7.

Let and be random variables distributed over the cells of a grid , and let and be their respective distributions. Define

Let be a sub-grid of with cells. Then, for every there exists some such that

when for all .

Proof.

Let and be the random variables induced by and respectively on the cells of . Using the fact that , we write

where and denote the marginal distributions on the columns of and and denote the marginal distributions on the rows. We can bound each of the above terms using a Taylor expansion argument given in Lemma A.1, whose proof is found in the appendix. Doing so gives

where

and is defined analogously.

To obtain the result, we observe that

since , and the analogous bound holds for . ∎

We now extend Lemma 4.7 to all grids with cells rather than just those that are sub-grids of the master grid . It is useful at this point to recall that, given a distribution , an equipartition of is a grid such that all the rows of have the same probability mass, and all the columns do as well.

Lemma 4.8.

Let and be random variables distributed over , and let be a grid. Define on and as in Lemma 4.7. Let be any -by- grid, and let (resp. ) represent the total probability mass of (resp. ) falling in cells of that are not contained in individual cells of . We have that

provided that the are bounded away from 1 and that .

Proof.

In the proof below, we use the convention that for any two grids and and any distribution , the expression denotes . In addition, we refer to any horizontal or vertical line in that is not in as a dissonant line of .

Consider the grid obtained by adding to the two lines in that surround each dissonant line of and then removing all the dissonant lines of . This grid is clearly a sub-grid of . And in Lemma A.4, whose proof we defer to the appendix, we do some careful accounting to show that has the property that