Copula Index for Detecting Dependence and Monotonicity between Stochastic Signals

# Copula Index for Detecting Dependence and Monotonicity between Stochastic Signals

\nameKiran Karra \emailkiran.karra@vt.edu
Virginia Tech Research Center
Arlington, VA 22033, USA \AND\nameLamine Mili \emaillmili@vt.edu
Virginia Tech
Falls Church, VA 22043, USA
###### Abstract

This paper introduces a nonparametric copula-based approach for detecting the strength and uniquely, the monotonicity structure of linear and nonlinear statistical dependence between pairs of random variables or stochastic signals, termed CIM. CIM satisfies the data processing inequality and is, consequently, a self-equitable metric. Simulation results using synthetic datasets reveal that the CIM compares favorably to other state-of-the-art measures of association that satisfy the data processing inequality, including the estimators of mutual information based on k-nearest neighbors, k-NN, adaptive partitioning, AP, and the von-Mises expansion, vME. Additionally, CIM performs similarly to other state-of-the-art statistical dependency metrics, including the Maximal Information Coefficient (MIC), Randomized Dependency Coefficient (RDC), distance correlation (dCor), copula correlation (Ccor), and the Copula Statistic (CoS) in both statistical power and sample size requirements. Simulations using real world data highlight the importance of understanding the monotonicity structure of the dependence.

Copula Index for Detecting Dependence and Monotonicity between Stochastic Signals Kiran Karra kiran.karra@vt.edu
Bradley Department of Electrical and Computer Engineering
Virginia Tech Research Center
Arlington, VA 22033, USA
Lamine Mili lmili@vt.edu
Bradley Department of Electrical and Computer Engineering
Virginia Tech
Falls Church, VA 22043, USA

Keywords: copula, statistical dependency, monotonic, equitability, discrete

## 1 Introduction

A fundamental problem in statistics and machine learning involves understanding the organization of large datasets and the dependencies between features in them. An unsupervised approach to this problem entails modeling the features within these datasets as random variables and discovering the dependencies and conditional dependencies between them. The most familiar measure of statistical dependence, the correlation coefficient, however only measures linear dependence (Pearson, 1895). Several indices of nonlinear dependence for continuous random variables have been introduced in the literature, including MIC (Reshef et al., 2011), the RDC (Lopez-Paz et al., 2013), the dCor (Székely et al., 2007), the Ccor (Chang et al., 2016), and CoS (Ben Hassine et al., 2016). MIC can process continuous and discrete random variables, but it does so by discretizing continuous random variables. Although heuristics can be applied, finding the optimal discretization algorithm is an NP-complete problem dependent upon the end goals of the data analysis task (García et al., 2013). An additional limitation of these dependence measures is their intrinsic inability to provide a complete insight into the structure of the dependency. By structure, we mean whether the dependency is monotonic or nonmonotonic. One way may be to use the difference between an index of linear dependence such as the correlation coefficient and a nonlinear index of dependence such as MIC can be used as a measure of nonlinearity (Reshef et al., 2011). However, this difference cannot fully characterize whether the nonlinearity is monotonic or nonmonotonic. Additionally, this methodology is prone to error, as the measure of nonlinearity must scale identically to the measure of linearity, in order to make sense of any functional operations between these two measures. We show in Section 4.2 that the characterization of the monotonicity structure has important consequences for accurate modeling and analysis of stochastic data.

In this paper, we introduce a new index of nonlinear dependence, termed Copula Index for detecting dependence and Monoticity, or CIM for short. This index is based on copulas and the rank statistic Kendall’s (Kendall, 1938), that naturally handles continuous, discrete, and hybrid random variables or stochastic signals. In this paper, we define hybrid random variables to be pairs of random variables where one is continuous and the other is discrete. CIM identifies nonlinear associations between random variables by transforming realizations of random variables into their pseudo-observations by applying the probability integral transform, scanning them for regions of monotonicity in the dependence structure, and computing a weighted average of association within each of these identified regions. The identified regions of monotonicity, a unique byproduct of the CIM algorithm, provide insight into the monotonicity (or lack thereof) between the random variables or stochastic signals. The CIM algorithm’s ability to identify nonlinear associations as well as the monotonicity characteristics of the dependency for all types of random variables make it a powerful tool for exploratory data analysis.

At a high level, CIM overcomes two limitations of Kendall’s , namely: 1) its inability to handle hybrid random variables and 2) detection of nonmonotonic dependencies. We overcome the first limitation by proposing a new extension to Kendall’s , which augments its current modifications to account for discrete random variables with a new correction factor for hybrid random variables. We overcome the second limitation by developing a new algorithm to detect regions of monotonicity and processing each region separately.

This paper is organized as follows. Section 2 introduces copulas, rank statistics, and our proposed extension to Kendall’s . Section 3 then introduces the CIM algorithm. Here, important properties of copulas are proved, which theoretically ground CIM as an index of dependence. Additionally, properties of the CIM index are discussed. Next, section 4 provides simulations to exercise the developed metric against other state-of-the-art dependence metrics, including MIC, RDC, dCor, Ccor, and CoS and measures of information including kNN, vME, and AP using synthetic data and conducts real-world data experiments with various datasets from the fields of computational biology, climate science, and finance. Concluding remarks are then provided in Section 5.

## 2 Copulas, Concordance, and Rank Statistics

In this section, we provide a brief overview of copulas, concordance, and rank statistics. We focus on Kendall’s as it provides the basis for CIM, and propose a new extension of Kendall’s to account for hybrid random variables. Properties of this extension, denoted , are then highlighted and discussed.

### 2.1 Introduction to Copulas and Concordance for Continuous Random Variables

Copulas are multivariate joint probability distribution functions for which the marginal distributions are uniform (Nelsen, 2006). The existence of a copula associated with a collection of random variables, , following a joint probability distribution function, , and marginals is ensured by Sklar’s theorem, which states that

 H(x1,…,xn)=C(FX1(x1),…,FXn(xn)). (1)

This theorem guarantees the unicity of the copula for continuous random variables and it unveils its major property, which is its ability to capture the unique dependency structure between any continuous marginal distributions . Thus, the copula can be used to define a measure of dependence between continuous random variables.

A popular measure of dependence that is based on the copula is concordance. Concordance is a form of dependence that measures the degree to which two random variables are associated with each other. More precisely, points in , and , are concordant if and discordant if (Nelsen, 2006). This can be probabilistically represented by the concordance function, , defined as

 Q=P[(X1−X2)(Y1−Y2)>0]−P[(X1−X2)(Y1−Y2)<0] (2)

where and are independent vectors of continuous random variables with distribution functions and having common margins of (of and ) and (of and ). Nelsen (2006) then shows that the concordance function can be written in terms of the copulas of and , and respectively, rather than the joint distribution function as

 Q(C1,C2)=4∫∫I2C2(u,v)dC1(u,v)−1 (3)

Note that (3) is the general definition of the concordance between two random vectors and with identical marginals but different dependency structures. To compute the concordance between two random variables and , the concordance function in (3) reduces to , where is the copula of the joint distribution .

Many metrics of association are based on the concept of concordance, with the two most popular being Kendall’s (Kendall, 1938) and Spearman’s (Spearman, 1904). Kendall’s is defined in terms of the concordance function as

 τ=Q(C,C), (4)

where is the copula of the joint distribution , and interpreted as the scaled difference in the probability between concordance and discordance. It can be estimated by

 ^τ=\# concordant pairs - \# discordant pairs(n2). (5)

Concordance-based measures of association such as Kendall’s are ideal for detecting linear and nonlinear monotonic dependencies because they are rank statistics. These measures have the desirable properties of being margin independent and invariant to strictly monotonic transforms of the data (Scarsini, 1984; Nelsen, 2006).

Although these statistics work well for measuring monotonic association between continuous random variables, adjustments need to be made to account for discrete and hybrid random variables. It turns out that in that case, Sklar’s theorem does not guarantee the unicity of the copula when the marginal distribution functions are non-continuous. If some or all of the marginal distribution functions are associated with discrete random variables, then many copulas satisfy (1) due to ties in the data. Consequently, the measure of concordance becomes margin-dependent (i.e, cannot be expressed solely in terms the joint distribution’s copula as in (3)) and in many cases cannot reach or in scenarios of perfect comonotonicity and countermonotonicity, respectively (Genest and Nešlehová, 2007).

### 2.2 Extension of Kendall’s τ for Hybrid Random Variables

Several proposals for adjusting Kendall’s for ties have been made, including (Kendall, 1945), (Vandenhende and Lambert, 2003), and (Nešlehová, 2007). The common theme among these proposals is that they use different rescaling factors to account for ties in the data. However, even with rescaling, perfect monotone dependence does not always imply , and is not interpretable as a scaled difference between the probabilities of concordance and discordance (Genest and Nešlehová, 2007). Nešlehová (2007) overcomes both of these limitations and defines the non-continuous version of Kendall’s , denoted by as:

 τN=4∫CSXYdCSXY−1√(1−E[ΔFX(X)])(1−E[ΔFY(Y)]), (6)

where is the standard extension copula (Schweizer and Sklar, 1974) given by

 CSXY(u,v) =(1−λ1)(1−λ2)CXY(a1,a2)+(1−λ1)λ2CXY(a1,b2) (7) +λ1(1−λ2)CXY(b1,a2)+λ1λ2CXY(b1,b2),

with

 λi={ui−aibi−ai,if ai

where and denote the least and the greatest element in the closure of the range of and such that , and . Nešlehová (2007) then defines an estimator of the non-continuous version of as:

 ^τN=\# concordant pairs - \# discordant pairs√(n2)−u√(n2)−v (8)

where , , is the number of distinct values observed in and is the number of distinct values observed in , is the number of times the distinct element occurred in the dimension, is the number of times the distinct element occurred in the dimension. From (8), it can be seen that performs well in the discrete case by subtracting the number of ties for each variable and independently from the denominator. The subtraction allows the value of to achieve plus or minus one in the comonotonic and countermonotonic cases, respectively, because ties do not get accounted for in the numerator. The accounting of ties is due to the strict inequalities used for concordance and discordance in (2). In the continuous case, there are no ties and defined in (6) reduces to the original Kendall’s defined in (4).

Although the measure defined by (6) is valid for continuous, discrete, or hybrid random variables, the estimator in (8) does not achieve a value of plus or minus one in the perfectly comonotonic and countermonotonic cases, respectively, for hybrid random variables. In order to extend to that case, we propose to use the maximum number of ties as a correction factor. This is because in that case, the numerator of does not count the increasing continuous variables as concordant. Fig. 1 illustrates this counting in an example, and shows why fails to achieve plus or minus one in the hybrid random variable case for perfectly comonotonic/countermonotonic random variables respectively. In it, the pairs of samples along the continuous dimension within a discrete value ( for and ) are not counted as comonotonic. To overcome this drawback, our proposed extension to is defined as

 ^τKL=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩\# concordant pairs - \# discordant % pairs(n2)continuous\# concordant pairs - \# discordant pairs√(n2)−u√(n2)−vdiscrete\# concordant pairs - \# discordant pairs√(n2)−t√(n2)−thybrid (9)

where , and where and are the same as in , and , denotes the number of overlapping points in the continuous dimension and between different discrete values in the discrete dimension, and denotes the number of unique elements in the discrete dimension. is zero for perfectly monotonic hybrid random variables, but takes nonzero values for copula-based dependencies; it helps to reduce the bias of when hybrid random variable samples are drawn from a copula dependency. The expression of given by (9) can be interpreted as the scaled difference between and , which yields under , by definition of concordance. Thus, regardless of the scaling factor, will be zero when .

Figs. 2 (a),(b),(c), and (d) show the bias and variance between the estimated value of and the value of that generates the corresponding copula. Here, samples of and are drawn from a Gaussian distribution and from a uniform discrete distribution, respectively. All the samples are generated using four possible copulas averaged over 300 Monte-Carlo simulations; and are generated by drawing pseudo-observations from the copula model. This follows the methodology described by Madsen and Birkes (2013) for simulating dependent discrete data. Figs. 2 (e) and (f) display the bias and variance of in the comonotonic and countermonotonic cases. The results show that achieves in the comonotonic case for all continuous, all discrete, and hybrid scenarios, and in the countermonotonic case for all continuous, all discrete, and hybrid scenarios. It also achieves low bias and variance among all proposed modifications to estimators of for hybrid random variables with copula-based dependencies. The null distribution of under independence, that is, , is depicted in Fig. 3. It shows that is Gaussian with a mean of approximately zero and a decreasing variance as the sample size increases for continuous, discrete, and hybrid random variables.

These properties suggest that can additionally be useful for estimating parametric copulas when building hybrid copula Bayesian Networks (HCBN) (Karra and Mili, 2016), rather than using empirical copulas, which are computationally intractable for large dimensional networks. Briefly, in HCBN, a family of nodes is modeled with a copula; the difficulty that HCBN attempts to address is the scenario where some families contain nodes which are both continuous and discrete. In the context of , when building Hybrid Copula Bayesian networks, families of this nature can be estimated either with parametric copula families, or empirical copulas. Although the latter fit the data with less error, more computational resources are required in order to process inference queries with empirical copulas, rather than parametric copulas with closed form expressions such as the Gaussian or Archimedean families. Thus, for large networks with many nodes, it is preferable to use the parametric copula families; however, using parametric copulas with discrete data is prone to biased estimates (Genest and Nešlehová, 2007). The properties of shown above suggest that can be used to estimate a parametric copula’s parameter, , via the one-to-one relationship between and that exists for popular copula families, with low bias.

Another example of why the correction factor detailed in (9) is important is displayed in Fig. 4. In it, a step function dependency with various levels of discretization is displayed, and it is seen that with small levels of discretization, and misrepresents the dependence strength between and , while the correction factor in compensates and achieves the true dependence value of in all situations.

## 3 Copula Index for Detecting Dependence and Monotonicity between Stochastic Signals

In this section, we describe how is used to detect nonlinear monotonic and nonmonotonic statistical dependencies, with a new index termed CIM. The theoretical foundations of this methodology are developed. We then describe the properties of CIM and propose an algorithm to estimate it.

### 3.1 Theoretical Foundations of Cim

CIM detects statistical dependencies by leveraging concepts from concordance, defined above in (3). However, measures of association directly based on concordance do not perform well for measuring nonmonotonic dependencies. This is because two random variables can be perfectly dependent, while having the probability of concordance, , equal to the probability of discordance, , yielding a null concordance function . One example of such a dependency is , with . Thus, in order to use concordance as a measure of nonmonotonic dependence, one must consider regions of concordance and discordance separately; this provides the basis of CIM. Specifically, CIM identifies regions of concordance and discordance, and computes a weighted average of for each of these regions.

To develop the CIM, we begin by proving that a set of observations drawn from any mapping can be grouped into concordant and discordant subsets of pseudo-observations which are piecewise linear functions of each other. Let is the pseudo-observation for the dimensional data point and denote the range-space of , where and are random variables, to be the subset of which encompasses every pair of values that the bivariate random variable can take on. Based on these definitions, we can state the following theorem:

###### Theorem 1

Suppose and are random variables where and are associated through the mapping defined by , where is monotone over each of the regions that form a partition of the range-space of . Define the random variables and . Then, is a piecewise linear function of .

See Appendix A for the proof.

Theorem 1 shows that if two random variables are dependent in a deterministic sense, their Cumulative Distribution Functions (CDFs) are piecewise linear functions of each other. This implies that the pseudo-observations of realizations of these dependent random variables can be grouped into regions of concordance and discordance. Furthermore, in each region, the dependent variable’s pseudo-observations are linear functions of the independent ones, contained in the unit square . Using this as a basis, CIM detects dependencies by identifying regions of concordance and discordance after transforming the original data and into pseudo-observations, and respectively.

As displayed in Fig. 4(a), by definition of concordance, in the independence scenario, no regions of concordance or discordance exist. Similarly, as depicted in Fig. 4(b), for monotonic dependencies only one region, , exists. Finally, for nonmonotonic dependencies, many regions may exist. As an example, Fig. 4(c) displays the pseudo-observations of sinusoidal functional dependence. Here, it is easy to see that and are regions of concordance, and is a region of discordance.

The foregoing examples motivate the following definition of the CIM:

 CIM=∑i(wi|τiN|), (10)

where is the absolute value of (6) for the region and is the ratio of the area of region to . From (10) and the properties of , CIM reduces to for monotonic continuous random variables, and zero for independent random variables. It should be noted that (10) defines the CIM metric, but an algorithm is required in order to identify each region for which is computed. In Section 3.3, we propose an algorithm to identify these regions.

### 3.2 Properties of Cim

In this section, we describe the properties of CIM. We begin by discussing Rényi’s seven properties of dependence measures, and show that CIM satisfies six of them. We then discuss the metric’s characteristics and properties under noise, and show through simulations and the definition of concordance, that any dependency metric based on concordance (including CIM) does not follow Reshef’s definition of equitability, known as -equitability. We then show that CIM does, however, satisfy the Data Processing Inequality (DPI), which implies that it satisfies self-equitability. We now discuss the implications of this property for the CIM metric.

#### 3.2.1 Dependence Metric Properties

Alfréd Rényi defined seven desirable properties of a measure of dependence, , between two random variables and (Rényi, 1959):

1. is defined for any pair of non-constant random variables and .

2. = .

3. .

4. iff .

5. For bijective Borel-measurable functions, , : .

6. if for Borel-measurable functions or , or .

7. If , then, , where is the correlation coefficient.

CIM satisfies the first property because it operates on copula transformed data (pseudo-observations, which exist for any random variable) rather than the raw data. Because of the following identities: , , and , the value of CIM given by (10) takes values between 0 and 1, and thus the third property is satisfied. In the independence case, because there are no regions of concordance or discordance, (10) reduces to . From Scarsini (1984), any measure of concordance is when and are independent; because CIM reduces to , which is an absolute value of the concordance measure for independent random variables, the fourth property is satisfied. The fifth property is satisfied because Kendall’s is invariant to increasing or decreasing transforms (see Nelsen, 2006, Theorem 5.1.8), so the convex sum of Kendall’s must also be invariant to increasing or decreasing transforms. Note that we cannot guarantee that be invariant to bijective transforms because the rescaling depends upon the marginal distributions. Thus, the fifth property is only valid for CIM for continuous random variables. The second and sixth properties are satisfied by virtue of Theorem 1. Finally, although the seventh property is not directly satisfied by CIM, if , the CIM metric is the absolute value of Kendall’s for a Gaussian copula and can be converted to the correlation coefficient , with the relation in order to satisfy the seventh property. This is because the Gaussian copula captures monotonic linear dependence, and hence there is only one region. Thus, the summation in (10) collapses to , which for continuous random variables is .

#### 3.2.2 Equitability and Noise Properties

Equitability is a measure of performance of a statistic under noise. Notionally, an equitable statistic assigns similar scores to equally noisy relationships of different types (Reshef et al., 2011). Kinney and Atwal (2014) formalize this concept as -equitability. A dependence measure is equitable if and only if, when evaluated on a joint probability distribution , that corresponds to a noisy functional relationship between two real random variables and , the relation given by

 D[X;Y]=g(R2([f(X);Y])) (11)

holds true, where is a function that does not depend on , denotes the squared Pearson correlation measure, and is the function defining the noisy functional relationship , for some random variable .

We begin by showing that Kendall’s , the rank correlation metric which CIM is dependent upon, does not satisfy equitability. The estimator of Kendall’s is defined in (5), and for a given set of points in the space drawn from random variables and , indexed by a subscript as , computed using

 ^τ=∑ni∑njsgn(xi−xy)sgn(yi−yj)n(n−1)/2          ∀i≠j, (12)

where is the sign function. Equation (12) reveals the lack of equitability. This can be explained as follows. The distance information between points and is not used by the function in (12), and thus, if two different processes generate data where the distances between points vary, then random perturbations to that data (noise) will affect these processes differently from a data ordering perspective, resulting in differing estimates of Kendall’s .

More concretely, consider the points , and . It is more likely that the lexicographical ordering between and is reversed after the addition of noise, than the ordering between and . Fig. 6 shows this phenomenon via simulations. Fig. 6 (a) depicts the noiseless relationship for a uniformly distributed with the association is plotted, while Fig. 6 (b) displays the and , where . As for Fig. 6 (c), it depicts the noiseless relationship for a uniformly distributed with the association , and Fig. 6 (d) displays and , where . Figs. 6 (b) and (d) reveal how the ordering of the data, which directly corresponds to the pseudo-observations, changes when the distance between points is not accounted for. We infer from the plots that the exponential relationship, which generates points that are spaced further apart, is differently affected by noise than the linear relationship. Numerically, Kendall’s in Fig. 6 (b) is estimated to , while in Fig. 6 (d), it is estimated to , which again shows that these two processes are differently affected by noise, and thus makes an unequitable metric as defined by Reshef et al. (2015).

Additionally, the equitability curves, which show the relationship between and for different relationships, for the two association patterns and are displayed in Fig. 7. The worst interpretable interval, which can be informally defined as the range of values corresponding to any one value of the statistic is represented by the red hashed line. Fig. 7 depicts a large interval, which is indicative of the lack of -equitability of this estimator.

As discussed above, the definition of concordance inherently does not lend itself to the equitability, which implies that CIM is not an -equitable statistic. Although equitability is a desirable property, Reshef et al. (2015) show that there exists a direct trade-off between statistical power and equitability. We empirically validate this trade-off in Section 4.1.

#### 3.2.3 Self Equitability and the Data Processing Inequality

Although the -equitability is difficult to achieve, related properties such as the DPI and self equitability are important desirable properties of a dependence metric. In this section, we prove that and CIM both satisfy the DPI, and are thus both self-equitable for continuous random variables. We show that the scaling factors proposed in (8) and (9) to account for discrete and continuous random variables, unfortunately, does not satisfy the DPI. We then propose a solution to allow CIM to satisfy DPI, even in the discrete and hybrid scenarios.

The DPI is a concept that stems from information theory. It states that if random variables , , and form a Markov chain, denoted by , then , where is the mutual information between and defined as , is the joint distribution of and , and and are the marginal distributions of and , respectively (Cover and Thomas, 2006). Intuitively, it states that information is never gained when being transmitted through a noisy channel (Kinney and Atwal, 2014). As an analog to the information theoretic definition of the DPI, Kinney and Atwal (2014) define a dependence metric to satisfy the DPI if and only if , whenever the random variables , , and form a Markov chain, . Here, we prove that CIM, as defined by (10), satisfies the DPI for continuous random variables, and later on we show that a modified version of CIM, termed satisfies the DPI for discrete and hybrid random variables. More precisely, we have

###### Theorem 2

If the continuous random variables , , and form a Markov chain , then .

See Appendix B for the proof.

An immediate implication of CIM satisfying the DPI is that it can be used for network modeling and information flow of data through Markov chains. One approach is to use the Algorithm for Reconstruction of Accurate Cellular Networks (ARACNe) procedure (Margolin et al., 2006); here, triplets of random variables for which all three CIM values are greater than a baseline threshold are examined, and from the DPI, it can be inferred that no edge exists between the two random variables with the smallest value of CIM (Margolin et al., 2006). Fig. 8 displays a Markov chain of four random variables, . In that figure, the network connections between and and and are represented by red hashed lines (i.e. non-existent connections). This is because from Theorem 2, we infer that and . Similar reasoning is applied to the connection between and .

Another approach to network modeling using CIM is to perform repeated Max-Relevance Min-Redundancy (MRMR) feature selection (Peng et al., 2005) for each variable in the dataset and construct a Maximum Relevancy Network (MRNET) (Meyer et al., 2007). In principle, for a random variable , MRMR works by ranking a set of predictor variables according to the difference between the mutual information (MI) of with (the relevance) and the average MI with the selected variables in (the redundancy). By choosing the variable that maximizes this difference, a network can be constructed in which direct interactions between variables imply edges. By virtue of Theorem 2, CIM and average CIM can be substituted for MI and average MI to apply CIM to MRNET reconstruction. As for Fig. 8, using the MRNET algorithm and Theorem 2, we can readily say that

 CIM(X1,X2)− 0.5[CIM(X1,X3)+CIM(X1,X4)]≥ CIM(X1,X3)−0.5[CIM(X1,X2)+CIM(X1,X4)],

and

 CIM(X1,X2)− 0.5[CIM(X1,X3)+CIM(X1,X4)]≥ CIM(X1,X4)−0.5[CIM(X1,X2)+CIM(X1,X3)],

yielding the connection between and in Fig. 8. Similar reasoning can be applied to the other network connections. Simulation results discussed in Section 4.1 motivate the use of CIM as a substitute for the MI. In it, we compare the statistical power of CIM to various estimators of the Mutual Information (MI) including: 1) k-nearest neighbors (k-NN) estimation (Kraskov et al., 2004), 2) adaptive partitioning (AP) MI estimation (Darbellay and Vajda, 1999), and 3) MI estimation via von Mises expansion (vME) (Kandasamy et al., 2015), and show that CIM is more powerful. This suggests that CIM is indeed a viable alternative for use in the estimation of Markov networks from datasets.

Another implication of CIM satisfying DPI is that it is a self-equitable statistic. A dependence measure is said to be self-equitable if and only if it is symmetric, that is, (), and satisfies , whenever is a deterministic function, and are variables of any type, and form a Markov chain (Kinney and Atwal, 2014). Self equitability implies that is invariant under arbitrary invertible transformations of or (Kinney and Atwal, 2014), which is in-fact a stronger condition than the property defined by Rényi in his proposed list of desirable properties for dependence metrics (Rényi, 1959).

We now address the case where , , and are discrete or hybrid random variables as related to the DPI. Unfortunately, the scaling factors proposed in (8) and (9) to account for discrete and continuous random variables cannot be guaranteed to satisfy the DPI. Indeed, for a Markov Chain , the scaling factor for is . However, the relationship between and is not obvious from because the scaling factors are dependent on the marginal distributions, which are related to, but not derivable from, knowledge of the joint distribution’s copula. Therefore, we cannot infer that . In order to enable CIM to satisfy these properties, we propose to remove the scaling factors defined in (9) and to compute of the standard extension copula given by (7), for both discrete and hybrid random variables (i.e. the numerator in (8) and (9)). More specifically, let us define

 τS(X,Y)=4∫CSXYdCSXY−1,

where is the standard extension copula expressed by (7). From Denuit and Lambert (2005) and Nešlehová (2007), we know that follows the concordance ordering. Combining this property with Theorem 2, it is straightforward to show that does indeed satisfy the DPI, and thus defined as

 CIMS=Σi(wi|τiS|), (13)

also satisfies the DPI, where is defined as before in (10). The consequences of not using the scaling factor are that does not reach plus or minus one for either the perfectly comonotonic, or countermonotonic discrete, or hybrid cases. However, from the perspective of ranking dependencies, as long as concordance order is followed, the absolute value of is irrelevant; only the relative values of and are pertinent. It should be noted that the effect of the scaling factors is decreased in two scenarios: 1) the support of the discrete random variables is large, and 2) the probability mass is more evenly distributed among the support of the discrete random variable (Genest and Nešlehová, 2007). In these scenarios, it is safe to use CIM as defined in (10) when constructing networks using the DPI principle, as the effect of the scaling factor is negligible in the presence of noisy data. However, in scenarios where the support of the discrete random variables are small, or the discrete distribution is skewed, it is advisable to use (13) when comparing the relative strengths of dependencies for tasks such as building Markov Networks using the ARACNe or MRNET algorithms described above.

### 3.3 Algorithms

In this section, we propose an algorithm to estimate the CIM metric. From (10), the CIM estimator can be defined as

 ˆCIM=∑i(wi|^τiKL|), (14)

where is the absolute value of (9) for the region and is the ratio of the number of samples in to the total number of samples being considered. It is seen that performs a weighted average of , computed for each region of concordance and discordance within . Thus, the primary goal of the algorithm is to identify these regions of concordance and discordance within . To this end, the proposed algorithm performs repeated computations of on the pseudo-observations of increasing subsets of the data in the unit-square. Large changes in values of between consecutive subsets of pseudo-observations indicate that a boundary between a concordant and discordant region has been identified. When all the region boundaries have been identified, the CIM value is estimated by applying (14).

More specifically, the first step in approximating the CIM statistic is to transform the data by applying the probability integral transform, via the empirical cumulative distribution function, to both dimensions of the data independently, generating the pseudo-observations. Next, the unit square is scanned to identify regions of concordance and discordance. The output of this step for independent, linear, and sinusoidal association patterns is shown in Figs. 5 (a), (b), and (c) respectively. The scanning is performed by creating an empty region in the unit square, increasing the size of the region by a defined amount, , in the dimension, and estimating the metric over all points contained within that region, denoted by . Every is compared to the previous value, denoted as . If

 ^τKL<^τ′KL−4σ^τKL, (15)

where is the standard deviation of the , then a new region boundary is declared. Stated differently, if the current value has decreased by more than four standard deviations from its previously estimated value for the given noise level, then the algorithm declares this a new boundary between monotonic regions.

Fig. 11 pictorially depicts these steps. In Fig. 11 (a), that has been decided by the algorithm to contain points of concordance, noted by . Additionally, the green region in Fig. 11 (a) shows the region under consideration by the algorithm, which is an increment of the region by . and are compared according to the criterion given above. In Fig. 11 (a), the criterion in (15) yields the decision that the points in the green region belong to the same region as . In Fig. 11 (b), the same criterion in (15) yields the decision that the points in the green region belong to a new region, , as depicted in Fig. 4(c).

The rationale behind this approach can be understood by considering each as a random variable, with a given mean and standard deviation. For a linear relationship and continuous random variables, a non-parametric view of the distribution of is shown via box and whisker plots in Fig. 9. In it, it is seen that the variance of is not only dependent upon the number of samples, but also on the noise level (swept on the axis). The standard deviation of the estimate across all noise levels for a linear relationship can be modeled as . This simple model’s fit is displayed in Fig. 10, which shows the experimental and predicted value of the standard deviation of for various noise levels and sample sizes for a linear relationship. Because each region of monotonicity manifests as a linear relationship after the data is transformed with the probability integral transform, by virtue of Theorem 1, the results from the linear relationship apply to each region of monotonicity independently.

For continuous random variables, because reduces to , the standard deviation is analytically characterized as , where is the number of points for which the estimate of is being computed. For discrete and hybrid random variables, the standard deviation is empirically characterized by using the sample mean in Fig. 3. Because the distribution of is shown to converge to the Normal distribution, four standard deviations from this value represents a confidence interval associated with a 99.9% probability that the newly added points being tested do not belong to the same region of monotonicity. Thus, the algorithm above can be viewed as a hypothesis testing problem, where the null hypothesis that the next set of points belongs to the same region is rejected at a level of . By a similar argument, because the standard deviation of the estimate approaches in the limit as , the boundaries of the regions will be perfectly detected in the asymptotic case of an infinite sample size.

Algorithm 1 details a pseudocode description of the proposed algorithm to estimate CIM for a given set of configuration parameters. In order to maximize the power of the CIM estimator against the null hypothesis that , the scanning process described above, and detailed in function scan-unit-sq of Algorithm 1, is conducted for multiple values of and for both orientations of the unit-square (, and ). Line 2 of Algorithm 1 shows that the values of which are tried are a geometrically decreasing series of values in the range of and , where the hyperparameter minimum scanning increment is denoted by . The choice of using a geometric series to test a series of values between the bounds is arbitrary, and the Algorithm sensitivity testing in Section 3.3.1 leads us to believe that the specific pattern chosen will not have a large effect on the estimated value.

The scanning and orientation of the unit square, which maximizes the dependence metric, is the approximate value of CIM, as shown in function compute-cim. Note that different scanning and orientation values are independent of each other, so the outer two for loops in compute-cim can be parallelized.

The minimum scanning increment (width of the green region in Fig. 11 (a)), , is the only hyperparameter for the proposed algorithm. The value of used in all the simulations, except the sensitivity study, is . The denominator of this value bounds the size and frequency of changes to the monotonicity that the algorithm can detect. By choosing , it is found that all reasonable dependence structures can be captured and identified. The experiments conducted in Section 4.1 and 4.2 corroborate this choice.

#### 3.3.1 Algorithm Performance

In this section we investigate the performance of Algorithm 1 using various synthetic datasets. We show that the proposed algorithm is robust to the input hyperparameter. We also investigate the convergence properties and speed of convergence of as estimated by Algorithm 1. Because the algorithm performance depends heavily on how well it detects the regions of concordance and discordance, we begin by characterizing the region detection performance.

To test the region detection performance, we simulate noisy nonmonotonic relationships of the form:

 Y=4(X−r)2+N(0,σ2), (16)

where . By varying and the number of samples, , that are drawn from , nonmonotonic relationships of this form comprehensively test the algorithm’s ability to detect regions for all types of association. This is becuase directly modulates the angle between the two piecewise linear functions at a region boundary, and the number of samples and the noise level test the performance of the decision criterion specified previously in (15) as a function of the number of samples. After generating data according to (16) for various values of , , and , Algorithm 1 was run on the data and the boundary of the detected region was recorded, for 500 Monte-Carlo simulations. A nonparametric distribution of the detected regions by Algorithm 1 for different values of and are displayed in Fig. 12. It is seen that on average, the algorithm correctly identifies the correct region boundary. In the scenario with no noise, the variance of the algorithm’s detected region boundary is small, regardless of the sample size. For larger levels of noise, the variance decreases with the sample size, as expected.

Our goal now is to investigate the sensitivity of Algorithm 1 to the hyperparameter , for the values . For various dependency types, we compute the maximum deviation of the CIM value over 500 Monte-Carlo simulations for sample sizes () ranging from 100 to 1000. Fig. 13 shows the maximum deviation of the estimated CIM value for each value of noise over sample sizes ranging from 100 to 1000 for eight different association patterns. The results show that the algorithm is robust to different values of the hyperparameter because the maximum deviation of over the noise range, sample sizes, and dependencies tested is no greater than . In the sensitivity study, we excluded values of when testing dependencies for which large values of would have masked the dependence structure. For example, in the high-frequency sinusoidal dependency, the monotonicity structure changes at region boundaries with multiples of . To test the robustness of the algorithm for an value of greater than does not make sense, so these values were excluded. Similar exclusions were applied to both the cubic and low-frequency sinusoidal dependence structures in the sensitivity tests.

Finally, we show that Algorithm 1 converges to the true CIM value through simulation. The results of the algorithm’s convergence performance are shown in Fig. 14 below. The subtitles for each subplot indicate the number of samples required such that the error between and CIM over all computed noise levels is less than 0.01 over 500 Monte-Carlo simulations. It can be seen that for dependencies with small numbers of regions of monotonicity, Algorithm 1 converges very quickly to the true value over all noise levels. For dependencies with large numbers of regions of monotonicity, such as the high frequency sinusoidal relationship depicted in the fifth subplot, a larger number of samples is required in order to ensure convergence. This can be explained from Fig. 9, which shows that the variance of the increases as the number of samples decreases. Thus, with a smaller number of samples in a dependency structure, the increased variance leads Algorithm 1 to make incorrect decisions regarding the region boundaries. As the number of samples increases, the detection performance increases.

#### 3.3.2 Sampling Distribution of Cim

Simulations were also conducted in order to characterize the null distribution of the CIM approximation algorithm provided in Algorithm 1. It was found experimentally that the the algorithm provided produces a statistic which can be approximated by the Beta distribution, as displayed in Fig. 15. Figs. 15 (c) and (d) both show that as the sample size increases, the shape parameter remains relatively constant while the shape parameter increases linearly as a function of . This roughly corresponds to a distribution converging to a delta function centered at zero. This is a desirable property because it implies that the CIM approximation algorithm yields a value close to for data drawn from independent random variables with a decreasing variance as the number of samples used to compute CIM increases.

It is interesting to note that the Beta distribution is intimately connected to rank statistics. More precisely, one can show that if , , are independent random variables, each following , then the distribution of the order statistic, follows (Ahsanullah et al., 2013).

#### 3.3.3 Computational Complexity

In this section we describe the computational complexity of computing the CIM algorithm above. We propose a new algorithm to compute to achieve a computational complexity of for estimating CIM for continuous and discrete random variables, and for hybrid random variables.

The core of Algorithm 1 consists of repeated computations of . If one were to näively compute this, by recomputing the number of concordant and discordant pairs every time a new region was tested, the operations required to compute the number of concordant and discordant samples would exponentially increase. Instead, we describe an algorithm to compute efficiently while accumulating new samples into the batch of data for which the value of is desired (i.e. when expanding the region by ). The essence of the algorithm is that it pre-sorts the data in the direction being scanned, so that the number of concordant and discordant samples do not need to be recomputed in every iteration of the scanning process. Instead, the sorted data allows us to store in memory the number of concordant and discordant samples, and update this value every time a new sample is added to the batch of samples being processed. Additionally, during the sorting process, the algorithm converts floating point data to integer data by storing the statistical ranks of the data rather than the data itself, allowing for potentially efficient FPGA based implementations. The efficient algorithm to compute for continuous and discrete data, given a new sample, is described in the consume function of Algorithm 2. If samples are to be processed, then the consume function is called times. For clarity of exposition, the remaining helper functions are not presented; however, their operation is only to initialize the variables.

The consume function has a computational complexity of , due to lines 9 and 10 in Algorithm 2, which require computation over a vector of data. Because the consume function is called times in order to process all the samples, this algorithm has a computational complexity of . It should be noted that lines 9 and 10 in Algorithm 2 are vectorizable operations, and the initial presorting is an operation. For hybrid data, additional calculations are required in the consume function in order to count the number of overlapping samples between discrete outcomes in the continuous domain, as described in (9). This requires an additional operations, bringing the computational complexity to process hybrid random variables to . For clarity, the pseudocode to compute the overlapping points is not shown in Algorithm 2, but a reference implementation to compute is provided.