Exploring and measuring non-linear correlations: Copulas, Lightspeed Transportation and Clustering
We propose a methodology to explore and measure the pairwise correlations that exist between variables in a dataset. The methodology leverages copulas for encoding dependence between two variables, state-of-the-art optimal transport for providing a relevant geometry to the copulas, and clustering for summarizing the main dependence patterns found between the variables. Some of the clusters centers can be used to parameterize a novel dependence coefficient which can target or forget specific dependence patterns. Finally, we illustrate and benchmark the methodology on several datasets. Code and numerical experiments are available online for reproducible research.
Pearson’s correlation coefficient which estimates linear dependence between two variables is still the mainstream tool for measuring variable correlations in science and engineering. However, its shortcomings are well-documented in the statistics literature: not robust to outliers; not invariant to monotone transformations of the variables; can take value 0 whereas variables are strongly dependent; only relevant when variables are jointly normally distributed. A large but under-exploited literature in statistics and machine learning has expanded recently to alleviate these issues [reshef2011detecting, szekely2009brownian, sejdinovic2013equivalence]. An underlying idea to many of the dependence coefficients is to compute a distance between the joint distribution of variables , and the product of marginal distributions encoding the independence. For example, choosing (Kullback-Leibler divergence), we end up with the Mutual Information (MI) measure, well-known in information theory. Thus, one can detect all the dependences between and since the distance will be greater than 0 as soon as is different from . Then, the dependence literature focus has shifted toward the new concept of “equitability” [kinney2014equitability]: How can one quantify the strength of a statistical association between two variables without bias for relationships of a specific form? Many researchers now aim at designing and proving that their proposed measures are indeed equitable [reshef2013equitability, ding2013copula, chang2016robust]. This is not what we look for in this article. But, on the contrary, we want to target specific dependence patterns and ignore others. We want to target dependence which are relevant to such or such problem, and forget about the dependence which are not in the scope of the problems at hand, or even worse which may be spurious associations (pure chance or artifacts in the data). The latter will be detected with an equitable dependence measure since they are deviation from independence, and will be given as much weight as the interesting ones. Rather than using the biases for specific dependence of several coefficients, we propose a dependence coefficient that can be parameterized by a set of target-dependences, and a set of forget-dependences. Sets of target and forget dependences can be built using expert hypotheses, or by leveraging the centers of clusters resulting from an exploratory clustering of the pairwise dependences. To achieve this goal, we will leverage three tools: copulas, optimal transportation, and clustering. Whereas clustering, the task of grouping a set of objects in such a way that objects in the same group (also called cluster) are more similar to each other than those in different groups, is common knowledge in the machine learning community, copulas and optimal transportation are not yet mainstream tools. Copulas have recently gained attention in machine learning [elidan2013copulas], and several copula-based dependence measures have been proposed for improving feature selection methods [ghahramani2012copula, lopez2013randomized, chang2016robust]. Optimal transport may be more familiar to computer scientists working in computer vision since it is the underlying theory of the Earth Mover’s Distance [rubner2000earth]. Until very recently, optimal transportation distances between distributions were not deemed relevant for machine learning applications since the best computational cost known was super-cubic to the number of bins used for discretizing the distribution supports which grows itself exponentially with the dimension. A mere distance evaluation could take several seconds! In this article, we leverage recent computational breakthroughs detailed in [cuturi2013sinkhorn] which make their use practical in machine learning.
Background on Copulas and Optimal Transport
Copulas are functions that couple multivariate distribution functions to their one-dimensional marginal distribution functions [nelsen2013introduction]. In this article, we will only consider bivariate copulas, but most of the results and the methodology presented hold in the multivariate setting, at the cost of a much higher computational burden which is for now a bit unrealistic.
Theorem 1 (Sklar’s Theorem [sklar1959fonctions])
For any random vector having continuous marginal cumulative distribution functions respectively, its joint cumulative distribution is uniquely expressed as , where , the bivariate distribution of uniform marginals , is known as the copula of .
Copulas are central for studying the dependence between random variables: their uniform marginals jointly encode all the dependence. They allow to study scale-free measures of dependence and are invariant to monotonous transformations of the variables. Some copulas play a major role in the measure of dependence, namely and the Fréchet-Hoeffding copula bounds, and the independence copula (depicted in Figure 1).
Definition 1 (Fréchet-Hoeffding copula bounds)
For any copula and any the following bounds hold:
where is the copula for countermonotonic random variables and is the copula for comonotonic random variables.
Many correlation coefficients can actually be expressed as a distance between the data copula and one of these reference copulas. For example, the Spearman (rank) correlation which is usually understood as , i.e. the linear dependence of the probability integral transformed variables (rank-transformed data), can also be viewed as an average distance between the copula of and the independence copula : [nelsen2013introduction]. Moreover, since is the distance between point to the diagonal (the measure of the positive dependence copula), one can rewrite [liebscher2014copula]. Thus, Spearman correlation can also be viewed as measuring a deviation from the monotonically increasing dependence to the data copula using a quadratic distance. We will leverage this idea to propose our dependence-parameterized dependence coefficient.
Notice that when working with empirical data, we do not know a priori the margins for applying the probability integral transform . Deheuvels in [deheuvels1979fonction] has introduced a practical estimator for the uniform margins and the underlying copula, the empirical copula transform.
Definition 2 (Empirical Copula Transform)
Let , , be observations from a random vector with continuous margins. Since one cannot directly obtain the corresponding copula observations , where , without knowing a priori , one can instead estimate the empirical margins , to obtain the empirical observations . Equivalently, since , being the rank of observation , the empirical copula transform can be considered as the normalized rank transform.
Notice that the empirical copula transform is fast to compute, sorting arrays of length can be done in , consistent and converges fast to the underlying copula [deheuvels1981asymptotic], [ghahramani2012copula].
As motivated in the introduction, we want to compare and summarize the pairwise empirical dependence structure (empirical bivariate copulas) of many variables. This brings the following questions: How can we compare two such copulas? What is a relevant representative of a set of empirical copulas? Which geometries are relevant for clustering these empirical distributions, and which are not?
In [Mart1606:Optimal], authors illustrate in a parametric setting using Gaussian copulas that common divergences (such as Kullback-Leibler, Jeffreys, Hellinger, Bhattacharyya) are not relevant for clustering these distributions, especially when dependence is high. These information divergences are only defined for absolutely continuous measures whereas some copulas have no density (e.g. the one for positive dependence). In practice, when working with frequency histograms, it gets worse: One has to pre-process the empirical measures with a kernel density estimator before computing these divergences. On the contrary, optimal transport distances are well-defined for both discrete (e.g. empirical) and continuous measures.
The idea of optimal transport is intuitive. It was first formulated by Gaspard Monge in 1781 [monge1781memoire] as a problem to efficiently level the ground: Given that work is measured by the distance multiplied by the amount of dirt displaced, what is the minimum amount of work required to level the ground? Optimal transport plans and distances give the answer to this problem.
In practice, empirical distributions can be represented by histograms. We follow notations from [cuturi2013sinkhorn]. Let , be two histograms in the probability simplex . Let be the transportation polytope of and , that is the set containing all possible transport plans between and .
Definition 3 (Optimal Transport)
Given a cost matrix , the cost of mapping to using a transportation matrix can be quantified as , where is the Frobenius dot-product. The optimal transport between and given transportation cost is thus:
Whenever belongs to the cone of distance matrices, the optimum of the transportation problem is itself a distance.
Lightspeed transportation. Optimal transport distances suffer from a computational burden scaling in which has prevented their widespread use in machine learning: A mere distance computation between two high-dimensional histograms can take several seconds. In [cuturi2013sinkhorn], Cuturi provides a solution to this problem: He restrains the polytope of all possible transport plans between and to a Kullback-Leibler ball , where He then shows that it amounts to perform an entropic regularization (recently generalized to many more regularizers in [muzellec2016tsallis, dessein2016regularized]) of the optimal transportation problem whose solution is smoother and less deterministic. The regularized optimal transportation problem is now strictly convex, and can be solved efficiently using the Sinkhorn-Knopp iterative algorithm which exhibits linear convergence. Its solution is the Sinkhorn distance [cuturi2013sinkhorn]:
and its dual :
where , and is the entropy function.
In the following, we will leverage the dual-Sinkhorn distances for comparing, clustering and computing the clusters centers [DBLP:conf/icml/CuturiD14] of a set of copulas at full speed.
A methodology to explore and measure non-linear correlations
We propose an approach to explore and measure non-linear correlations between variables in a dataset. These variables can be, for instance, time series or features. The methodology presented (which is summarized in Figure 2) is twofold, and consists of: (i) an exploratory part of the pairwise dependence between variables, (ii) the parameterization and use of a novel dependence coefficient.
Using transportation of copulas as a measure of correlations
In this section, we leverage and extend the idea presented in our short introduction to copulas: correlation coefficients can be viewed as a distance between the data-copula and the Fréchet-Hoeffding bounds or the independence copula. The distance involved is usually an Minkowski metric distance. In the following, we will:
replace the distance by an optimal transport distance between measures,
parameterize a dependence coefficient with other copulas than the Fréchet-Hoeffding bounds or the independence one.
Using the optimal transport distance between copulas, we now propose a dependence coefficient which is parameterized by two sets of copulas: target copulas and forget copulas.
Definition 4 (Target/Forget Dependence Coefficient)
Let be the set of forget-dependence copulas.
Let be the set of target-dependence copulas.
Let be the copula of . Let be an optimal transport distance parameterized by a ground metric .
We define the Target/Forget Dependence Coefficient as:
Using this definition, we obtain: , .
Example. A standard correlation coefficient can be obtained by setting the forget-dependence set to the independence copula, and the target-dependence set to the Fréchet-Hoeffding bounds. How does it compare to the Spearman correlation? In Figure 3, we display how the two coefficients behave on a simple numerical experiment: , , where is uniform on and are independent noises. That is over . Notice that for , Spearman coefficient takes a negative value. We may thus prefer the monotonically increasing behaviour of the TFDC to the Spearman one.
How to choose, design and build targets?
We now propose two alternatives for choosing, designing and building the target and forget copulas: an exploratory data-driven approach and an hypotheses testing approach.
Data-driven: Clustering of copulas
Assume we have variables , and observations for each of them. First, we compute empirical copulas which represent the dependence structure between all the couples (. Then, we summarize all these distributions using a center-based clustering algorithm, and extract the clusters centers using a fast computation of Wasserstein barycenters [DBLP:conf/icml/CuturiD14]. A given center represents the mean dependence between the couples inside the corresponding cluster. Figure 4 and 5 illustrate why a Wasserstein barycenter, i.e. the minimizer of [agueh2011barycenters] where is a set of measures (here, bivariate empirical copulas), is more relevant to our needs: we benefit from robustness against small deformations of the dependence patterns.
Example. In Table 1, we display some interesting dependence patterns which can be found in UCI datasets http://archive.ics.uci.edu/ml/. In this case, variables are the features. Some associations are easy to explain (e.g. top left copula representing the relation between radius and area of roughly round cells in the Breast Cancer Wisconsin (Diagnostic) Data Set) whereas some others less (e.g. top row third copula from the left which represents the relation between the perimeter and the fractal dimension of the cells).
An equitable copula-based dependence measure such as the one described in [ghahramani2012copula] may detect them well, but will also detect the spurious ones which are due to artifacts in the data (or pure chance). With this approach, one can spot them and add them to the set of forget-dependence copulas. For these reasons, we think that this approach could improve the feature selection correlation-based approaches [hall2000correlation, yu2003feature] which rely on the hypothesis that good feature subsets contain features highly correlated with the class, yet uncorrelated with each other [hall2000correlation].
|Breast Cancer (wdbc)|
Targets as hypotheses from an expert
One can specify dependence hypotheses, generate the corresponding copulas, then measure and rank correlations with respect to them. For example, one can answer to questions such as: Which are the pairs of assets that are usually positively correlated for small variations but uncorrelated otherwise? In [durante2009rectangular], authors present a method for constructing bivariate copulas by changing the values that a given copula assumes on some subrectangles of the unit square. They discuss some applications of their methodology including the construction of copulas with different tail dependencies. Building target and forget copulas is another one. In the Experiments section, we illustrate its use to answer the previous question and other dependence queries.
Exploration of financial correlations
We illustrate the first part of the methodology with three different datasets of financial time series. These time series consist in the daily returns of stocks (40 stocks from the CAC 40 index comprising the French highest market capitalizations), credit default swaps (75 CDS from the iTraxx Crossover index comprising the most liquid sub-investment grade European entities) and foreign exchange rates (80 FX rates of major world currencies) between January 2006 and August 2016. We display some of the clustering centroids obtained for each asset class on the top row, and below we display their corresponding Gaussian copulas parameterized by the estimated linear correlations. Notice the strong difference between the empirical copulas and the Gaussian ones which are still widely used in financial engineering due to their convenience. Notice also the difference between asset classes: Though estimated correlations are for the leftmost copulas, they have much dissimilar peculiarities.
Centroids’ main feature: More mass in the bottom-left corner, i.e. lower tail dependence. Stock prices tend to plummet together.
Credit default swaps
Centroids’ main feature: More mass in the top-right corner, i.e. upper tail dependence. Insurance cost against entities’ default tends to soar in stressed market.
Centroids’ main feature: Empirical copulas show that dependence between FX rates are various. For example, rates may exhibit either strong dependence or independence while being anti-correlated during extreme events.
Answering dependence queries
Inspired by the previous exploration results, we may want to answer such questions: (A) Which pair of assets having correlation has the nearest copula to the Gaussian one? Though such questions can be answered by computing a likelihood for each pairs, our methodology stands out for dealing with non-parametric dependence patterns, and thus for questions such as: (B) Which pairs of assets are both positively and negatively correlated? (C) Which assets occur extreme variations while those of others are relatively small, and conversely? (D) Which pairs of assets are positively correlated for small variations but uncorrelated otherwise?
Considering a cross-asset dataset which comprises the SBF 120 components (index including the CAC 40 and 80 other highly capitalized French entities), the 500 most liquid CDS worldwide, and 80 FX rates, we display in Figure 6 the empirical copulas (alongside their respective targets) which best answer questions A,B,C,D.
Power of TFDC
In this experiment, we compare the empirical power of TFDC to well-known dependence coefficients such as Pearson linear correlation (cor), distance correlation (dCor) [szekely2009brownian], maximal information coefficient (MIC) [reshef2011detecting], alternating conditional expectations (ACE) [breiman1985estimating], maximum mean discrepancy (MMD) [gretton2012kernel], copula maximum mean discrepancy (CMMD) [ghahramani2012copula], randomized dependence coefficient (RDC) [lopez2013randomized]. Statistical power of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true. In the case of dependence coefficients, we consider (H0): and are independent; (H1): and are dependent. Following the numerical experiment described in [simon2014comment, lopez2013randomized], we estimate the power of the aforementioned dependence measures with simulated pairs of variables with different relationships (considered in [reshef2011detecting, simon2014comment, lopez2013randomized]), but with varying levels of noise added. By design, TFDC aims at detecting the simulated dependence relationships. Thus, this dependence measure is expected to have a much higher power than coefficients such as MIC since, according to Simon and Tibshirani in [simon2014comment], coefficients “which strive to have high power against all alternatives can have low power in many important situations.” TFDC only targets the specific important situations. Results are displayed in Figure 7.
It is known by risk managers how dangerous it can be to rely solely on a correlation coefficient to measure dependence. That is why we have proposed a novel approach to explore, summarize and measure the pairwise correlations which exist between variables in a dataset. We have also pointed out through the UCI-datasets example that non-trivial dependence patterns can be easily found between the features variables. Using these patterns as targets when performing correlation-based feature selection may improve results. This idea still needs to be empirically verified. The experiments show the benefits of the proposed method: It allows to highlight the various dependence patterns that can be found between financial time series, which strongly depart from the Gaussian copula widely used in financial engineering. Though answering dependence queries as briefly outlined is still an art, we plan to develop a rich language so that a user can formulate complex questions about dependence, which will be automatically translated into copulas in order to let the methodology provide these questions accurate answers.