A Type-I Errors

An Adaptive Test of Independence with Analytic Kernel Embeddings

Abstract

A new computationally efficient dependence measure, and an adaptive statistical test of independence, are proposed. The dependence measure is the difference between analytic embeddings of the joint distribution and the product of the marginals, evaluated at a finite set of locations (features). These features are chosen so as to maximize a lower bound on the test power, resulting in a test that is data-efficient, and that runs in linear time (with respect to the sample size n). The optimized features can be interpreted as evidence to reject the null hypothesis, indicating regions in the joint domain where the joint distribution and the product of the marginals differ most. Consistency of the independence test is established, for an appropriate choice of features. In real-world benchmarks, independence tests using the optimized features perform comparably to the state-of-the-art quadratic-time HSIC test, and outperform competing and tests.

\newcites

supReferences

1 Introduction

We consider the design of adaptive, nonparametric statistical tests of dependence: that is, tests of whether a joint distribution factorizes into the product of marginals . While classical tests of dependence, such as Pearson’s correlation and Kendall’s , are able to detect monotonic relations between univariate variables, more modern tests can address complex interactions, for instance changes in variance of with the value of . Key to many recent tests is to examine covariance or correlation between data features. These interactions become significantly harder to detect, and the features are more difficult to design when the data reside in high dimensions.

A basic nonlinear dependence measure is the Hilbert-Schmidt Independence Criterion (HSIC), which is the Hilbert-Schmidt norm of the covariance operator between feature mappings of the random variables (Gretton et al., 2005, 2008). Each random variable and is mapped to a respective reproducing kernel Hilbert space and . For sufficiently rich mappings, the covariance operator norm is zero if and only if the variables are independent. A second basic nonlinear dependence measure is the smoothed difference between the characteristic function of the joint distribution, and that of the product of marginals. When a particular smoothing function is used, the statistic corresponds to the covariance between distances of X and Y variable pairs (Feuerverger, 1993; Székely et al., 2007; Székely and Rizzo, 2009), yielding a simple test statistic. It has been shown by Sejdinovic et al. (2013) that the distance covariance (and its generalization to semi-metrics) is an instance of HSIC for an appropriate choice of kernels. A disadvantage of these feature covariance statistics, however, is that they require quadratic time to compute (besides in the special case of the distance covariance with univariate real-valued variables, where Huo and Székely (2014) achieve an cost). Moreover, the feature covariance statistics have intractable null distributions, and either a permutation approach or the solution of an expensive eigenvalue problem (e.g. Zhang et al., 2011) is required for consistent estimation of the quantiles. Several approaches were proposed by Zhang et al. (2016) to obtain faster tests along the lines of HSIC. These include computing HSIC on finite-dimensional feature mappings chosen as random Fourier features (RFFs) (Rahimi and Recht, 2008), a block-averaged statistic, and a Nyström approximation to the statistic. Key to each of these approaches is a more efficient computation of the statistic and its threshold under the null distribution: for RFFs, the null distribution is a finite weighted sum of variables; for the block-averaged statistic, the null distribution is asymptotically normal; for Nyström, either a permutation approach is employed, or the spectrum of the Nyström approximation to the kernel matrix is used in approximating the null distribution. Each of these methods costs significantly less than the cost of the full HSIC (the cost is linear in , but also depends quadratically on the number of features retained). A potential disadvantage of the Nyström and Fourier approaches is that the features are not optimized to maximize test power, but are chosen randomly. The block statistic performs worse than both, due to the large variance of the statistic under the null (which can be mitigated by observing more data).

In addition to feature covariances, correlation measures have also been developed in infinite dimensional feature spaces: in particular, Bach and Jordan (2002); Fukumizu et al. (2008) proposed statistics on the correlation operator in a reproducing kernel Hilbert space. While convergence has been established for certain of these statistics, their computational cost is high at , and test thresholds have relied on permutation. A number of much faster approaches to testing based on feature correlations have been proposed, however. For instance, Dauxois and Nkiet (1998) compute statistics of the correlation between finite sets of basis functions, chosen for instance to be step functions or low order B-splines. The cost of this approach is . This idea was extended by Lopez-Paz et al. (2013), who computed the canonical correlation between finite sets of basis functions chosen as random Fourier features; in addition, they performed a copula transform on the inputs, with a total cost of . Finally, space partitioning approaches have also been proposed, based on statistics such as the KL divergence, however these apply only to univariate variables (Heller et al., 2016), or to multivariate variables of low dimension (Gretton and Györfi, 2010) (that said, these tests have other advantages of theoretical interest, notably distribution-independent test thresholds).

The approach we take is most closely related to HSIC on a finite set of features. Our simplest test statistic, the Finite Set Independence Criterion (FSIC), is an average of covariances of analytic functions (i.e., features) defined on each of and . A normalized version of the statistic (NFSIC) yields a distribution-independent asymptotic test threshold. We show that our test is consistent, despite a finite number of analytic features being used, via a generalization of arguments in Chwialkowski et al. (2015). As in recent work on two-sample testing by Jitkrittum et al. (2016), our test is adaptive in the sense that we choose our features on a held-out validation set to optimize a lower bound on the test power. The design of features for independence testing turns out to be quite different to the case of two-sample testing, however: the task is to find correlated feature pairs on the respective marginal domains, rather than attempting to find a single, high-dimensional feature representation for the entire (as we would need to do if we were comparing distributions and , rather than testing a specific property of ). We demonstrate the performance of our tests on several challenging artificial and real-world datasets, including detection of dependence between music and its year of appearance, and between videos and captions. In these experiments, we outperform competing linear and time tests.

2 Independence Criteria and Statistical Tests

We introduce two test statistics: first, the Finite Set Independence Criterion (FSIC), which builds on the principle that dependence can be measured in terms of the covariance between data features. Next, we propose a normalized version of this statistic (NFSIC), with a simpler asymptotic distribution when . We show how to select features for the latter statistic to maximize a lower bound on the power of its corresponding statistical test.

2.1 The Finite Set Independence Criterion

We begin by introducing the Hilbert-Schmidt Independence Criterion (HSIC) as proposed in Gretton et al. (2005), since our unnormalized statistic is built along similar lines. Consider two random variables and . Denote by the joint distribution between and ; and are the marginal distributions of and . Let denote the tensor product, such that . Assume that and are positive definite kernels associated with reproducing kernel Hilbert spaces (RKHS) and , respectively. Let be the norm on the space of Hilbert-Schmidt operators. Then, HSIC between and is defined as

(1)

where , , , and is an independent copy of . The mean embedding of belongs to the space of Hilbert-Schmidt operators from to , , and the marginal mean embeddings are and (Smola et al., 2007). Gretton et al. (2005, Theorem 4) show that if the kernels and are universal (Steinwart and Christmann, 2008) on compact domains and , then if and only if and are independent. Alternatively, Gretton (2015) shows that it is sufficient for each of and to be characteristic to their respective domains (meaning that distribution embeddings are injective in each marginal domain: see Sriperumbudur et al. (2010)). Given a joint sample , an empirical estimator of HSIC can be computed in time by replacing the population expectations in (1) with their corresponding empirical expectations based on .

We now propose our new linear-time dependence measure, the Finite Set Independence Criterion (FSIC). Let and be open sets. Define the empirical measure over test locations where denotes the Dirac measure centered on , and are realizations from an absolutely continuous distribution (wrt the Lebesgue measure). Write for . The idea is to see and as smooth functions, and consider an distance between and instead of a Hilbert-Schmidt distance as in HSIC (Gretton et al., 2005). Let . FSIC is defined as

(2)

and .

Our first result in Proposition 2 states that almost surely defines a dependence measure for the random variables and , provided that the product kernel on the joint space is characteristic and analytic (see Definition 1).

Definition 1 (Analytic kernels (Chwialkowski et al., 2015)).

Let be an open set in . A positive definite kernel is said to be analytic on its domain if for all , is an analytic function on .

Assumption A.

The kernels and are bounded by and respectively, and the product kernel is characteristic (Sriperumbudur et al., 2010, Definition 6), and analytic (Definition 1) on .

Proposition 2 (FSIC is a dependence measure).

Assume that

  1. Assumption A holds.

  2. The test locations are drawn from an absolutely continuous distribution.

Then, almost surely, if and only if and are independent.

Proof.

Since is characteristic, the mean embedding map is injective (Sriperumbudur et al., 2010, Section 3), where is a probability distribution on . Since is analytic, by Lemma 8 (Appendix), and are analytic functions. Thus, Lemma 9 (Appendix, setting ) guarantees that and are independent almost surely. ∎

FSIC uses as a proxy for , and as a proxy for . Proposition 2 suggests that, to detect the dependence between and , it is sufficient to evaluate at a finite number of locations (defined by ) the difference of the population joint embedding and the embedding of the product of the marginal distributions . A brief explanation to justify this property is as follows. If , then is zero, and for any . If , then will not be a zero function, since the mean embedding map is injective (require the product kernel to be characteristic). Using the same argument as in Chwialkowski et al. (2015), since and are analytic, is also analytic, and the set of roots has Lebesgue measure zero. Thus, it is sufficient to draw from an absolutely continuous distribution, as we are guaranteed that giving .

For FSIC to be a dependence measure, the product kernel is required to be characteristic and analytic. We next show in Proposition 3 that Gaussian kernels and yield such a product kernel.

Proposition 3 (A product of Gaussian kernels is characteristic and analytic).

Let and be Gaussian kernels on and respectively, for positive definite matrices and . Then, is characteristic and analytic on .

Proof (sketch).

The main idea is to use the fact a Gaussian kernel is analytic, and a product of Gaussian kernels is a Gaussian kernel on the pair of variables. See the full proof in Appendix D. ∎

Plug-in Estimator We now give an empirical estimator of FSIC. Assume that we observe a joint sample . Unbiased estimators of and are and , respectively. A straightforward empirical estimator of is then given by

(3)
(4)

where . For conciseness, we define where so that .

can be efficiently computed in time [see (3)], assuming that the runtime complexity of evaluating is and that of is . The unbiasedness of is necessary for to be a U-statistic. This fact and the rewriting of in terms of will be exploited when the asymptotic distribution of is derived (Proposition 4).

Since satisfies , in principle its empirical estimator can be used as a test statistic for an independence test proposing a null hypothesis and are independent” against an alternative and are dependent”. The null distribution (i.e., distribution of the test statistic assuming that is true) is challenging to obtain, however and depends on the unknown . This prompts us to consider a normalized version of whose asymptotic null distribution of a convenient form. We first derive the asymptotic distribution of in Proposition 4, which we use to derive the normalized test statistic in Theorem 2.2. As a shorthand, we write , and .

Proposition 4 (Asymptotic distribution of ).

Define , and . Then, under both and , for any fixed locations and ,

where is given in (2), and is defined in (4). Second, if for , then as , where and .

Proof.

We first note that for a fixed , is a one-sample second-order U-statistic (Serfling, 2009, Section 5.1.3) with a U-statistic kernel where . Thus, by Kowalski and Tu (2008, Section 5.1, Theorem 1), it follows directly that . Substituting with its definition yields the first claim, where we note that .

For the second claim, since is a multivariate one-sample U-statistic, by Lehmann (1999, Theorem 6.1.6) and Kowalski and Tu (2008, Section 5.1, Theorem 1), it follows that as , where . ∎

Recall from Proposition 2 that holds almost surely under . The asymptotic normality in the second claim of Proposition 4 implies that converges in distribution to a sum of dependent weighted random variables. The dependence comes from the fact that the coordinates of all depend on the sample . This null distribution is not analytically tractable, and requires a large number of simulations to compute the rejection threshold for a given significance value .

2.2 Normalized FSIC and Adaptive Test

For the purpose of an independence test, we will consider a normalized variant of , which we call , whose tractable asymptotic null distribution is , the chi-squared distribution with degrees of freedom. We then show that the independence test defined by is consistent. These results are given in Theorem 2.2.

{restatable}

[Independence test using is consistent]thmnfsicgoodtest

Let be a consistent estimate of based on the joint sample . The statistic is defined as where is a regularization parameter. Assume that

  1. Assumption A holds.

  2. is invertible almost surely with respect to drawn from an absolutely continuous distribution.

  3. .

Then, for any and satisfying the assumptions,

  1. Under , as .

  2. Under , for any , almost surely. That is, the independence test based on is consistent.

Proof (sketch) .

Under , asymptotically follows because is asymptotically normally distributed (see Proposition 4). Claim 2 builds on the result in Proposition 2 stating that under ; it follows using the convergence of to . The full proof can be found in Appendix E. ∎

Theorem 2.2 states that if holds, the statistic can be arbitrarily large as increases, allowing to be rejected for any fixed threshold. Asymptotically the test threshold is given by the -quantile of and is independent of . The assumption on the consistency of is required to obtain the asymptotic chi-squared distribution. The regularization parameter is to ensure that can be stably computed. In practice, requires no tuning, and can be set to be a very small constant.

The next proposition states that the computational complexity of the estimator is linear in both the input dimension and sample size, and that it can be expressed in terms of the matrices.

Proposition 5 (An empirical estimator of ).

Let . Denote by the element-wise matrix product. Then,

  1. .

  2. A consistent estimator for is where

Assume that the complexity of the kernel evaluation is linear in the input dimension. Then the test statistic can be computed in time.

Proof (sketch).

Claim 1 for is straightforward. The expression for in claim 2 follows directly from the asymptotic covariance expression in Proposition 4. The consistency of can be obtained by noting that the finite sample bound for decreases as increases. This is implicitly shown in Appendix F.2.2 and its following sections. ∎

Although the dependency of the estimator on is cubic, we empirically observe that only a small value of is required (see Section 3). The number of test locations relates to the number of regions in of and that differ (see Figure 1). In particular, need not increase with for test consistency.

Our final theoretical result gives a lower bound on the test power of i.e., the probability of correctly rejecting . We will use this lower bound as the objective function to determine and the kernel parameters. Let be the Frobenius norm.

{restatable}

[A lower bound on the test power]thmlbpow Let . Let be a kernel class for , be a kernel class for , and be a collection with each element being a set of locations. Assume that

  1. There exist finite and such that and .

  2. .

Then, for any , and , the test power satisfies where

is the floor function, , , , , , , , , and is a constant depending on only and . Moreover, for sufficiently large fixed , is increasing in .

We provide the proof in Appendix F. To put Theorem 5 into perspective, let and be the parameters of the kernels and , respectively. We denote by the collection of all tuning parameters of the test. Assume that for some and for some are Gaussian kernel classes. Then, in Theorem 5, , and . The assumption is a technical condition to guarantee that the test power lower bound is finite for all defined by the feasible sets and . Let . If we set and for some , then as and are compact. In practice, these conditions do not necessarily create restrictions as they almost always hold implicitly. We show in Appendix C that the objective function used to choose will discourage any two locations to be in the same neighborhood.

Parameter Tuning The test power lower bound in Theorem 5 is a function of which is the population counterpart of the test statistic . As in FSIC, it can be shown that if and only if are are independent (from Proposition 2). If and are dependent, then . According to Theorem 5, for a sufficiently large , the test power lower bound is increasing in . One can therefore think of (a function of ) as representing how easily the test rejects given a problem . The higher the , the greater the lower bound on the test power, and thus the more likely it is that the test will reject when it is false.

In light of this reasoning, we propose setting to . That this procedure is also valid under can be seen as follows. Under , will be arbitrary. Since Theorem 5 guarantees that as for any , the asymptotic null distribution does not change by using . In practice, is a population quantity which is unknown. We propose dividing the sample into two disjoint sets: training and test sets. The training set is used to optimize for , and the test set is used for the actual independence test with the optimized . The splitting is to guarantee the independence of and the test sample, which is an assumption of Theorem 2.2.

To better under , we visualize and as a function of one test location on a simple toy problem. In this problem, where . As we consider only one location , is a scalar. The statistic can be written as . These components are shown in Figure 1, where we use Gaussian kernels for both and , and the horizontal and vertical axes correspond to and , respectively.

(a)
(b)
(c)
(d) Statistic
Figure 1: Illustration of .

Intuitively, captures the difference of the joint distribution and the product of the marginals as a function of . Squaring and dividing it by the variance shown in Figure c gives the statistic (also the parameter tuning objective) shown in Figure d. The latter figure suggests that the parameter tuning objective function can be non-convex. However, we note that the non-convexity arises since there are multiple ways to detect the difference between the joint distribution and the product of the marginals. In this case, the lower left and upper right regions equally indicate the largest difference.

3 Experiments

(a) SG
(b) SG
(c) Sin
(d) GSign
Figure 2: (a): Runtime. (b): Probability of rejecting as problem parameters vary. Fix .

In this section, we empirically study the performance of the proposed method on both toy (Section 3.1) and real-life problems (Section 3.2). Our interest is in the performance of linear-time tests on challenging problems which require a large sample size to be able to accurately reveal the dependence. All the code is available at https://github.com/wittawatj/fsic-test.

We compare the proposed NFSIC with optimization (NFSIC-opt) to five multivariate nonparametric tests. The test without optimization (NFSIC-med) acts as a baseline, allowing the effect of parameter optimization to be clearly seen. For pedagogical reason, we consider the original HSIC test of Gretton et al. (2005) denoted by QHSIC, which is a quadratic-time test. Nyström HSIC (NyHSIC) uses a Nyström approximation to the kernel matrices of and when computing the HSIC statistic. FHSIC is another variant of HSIC in which a random Fourier feature approximation (Rahimi and Recht, 2008) to the kernel is used. NyHSIC and FHSIC are studied in Zhang et al. (2016) and can be computed in , with quadratic dependency on the number of inducing points in NyHSIC, and quadratic dependency in the number of random features in FHSIC. Finally, the Randomized Dependence Coefficient (RDC) proposed in Lopez-Paz et al. (2013) is also considered. The RDC can be seen as the primal form (with random Fourier features) of the kernel canonical correlation analysis of Bach and Jordan (2002) on copula-transformed data. We consider RDC as a linear-time test even though preprocessing by an empirical copula transform costs .

We use Gaussian kernel classes and for both and in all the methods. Except NFSIC-opt, all other tests use full sample to conduct the independence test, where the Gaussian widths and are set according to the widely used median heuristic i.e., , and is set in the same way using . The locations for NFSIC-med are randomly drawn from the standard multivariate normal distribution in each trial. For a sample of size , NFSIC-opt uses half the sample for parameter tuning, and the other disjoint half for the test. We permute the sample 300 times in RDC1 and HSIC to simulate from the null distribution and compute the test threshold. The null distributions for FHSIC and NyHSIC are given by a finite sum of weighted random variables given in Eq. 8 of Zhang et al. (2016). Unless stated otherwise, we set the test threshold of the two NFSIC tests to be the -quantile of . To provide a fair comparison, we set , use 10 inducing points in NyHSIC, and 10 random Fourier features in FHSIC and RDC.

Optimization of NFSIC-opt The parameters of NFSIC-opt are and locations of size . We treat all the parameters as a long vector in and use gradient ascent to optimize . We observe that initializing by randomly picking points from the training sample yields good performance. The regularization parameter in NFSIC is fixed to a small value, and is not optimized. It is worth emphasizing that the complexity of the optimization procedure is still linear in .

Since FSIC, NyHFSIC and RDC rely on a finite-dimensional kernel approximation, these tests are consistent only if both the number of features increases with . By constrast, the proposed NFSIC requires only to go to infinity to achieve consistency i.e., can be fixed. We refer the reader to Appendix C for a brief investigation of the test power vs. increasing . The test power does not necessarily monotonically increase with .

3.1 Toy Problems

(a) SG. .
(b) SG. .
(c) Sin. .
(d) GSign. .
Figure 3: (a) Runtime. (b): Probability of rejecting as increases in the toy problems.

We consider three toy problems: Same Gaussian (SG), Sinusoid (Sin), and Gaussian Sign (GSign).

1. Same Gaussian (SG). The two variables are independently drawn from the standard multivariate normal distribution i.e., and where is the identity matrix. This problem represents a case in which holds.

2. Sinusoid (Sin). Let be the probability density of . In the Sinusoid problem, the dependency of and is characterized by where the domains of and is the frequency of the sinusoid. As the frequency increases, the drawn sample becomes more similar to a sample drawn from . That is, the higher , the harder to detect the dependency between and . This problem was studied in Sejdinovic et al. (2013). Plots of the density for a few values of are shown in Figures 6 and 7 in the appendix. The main characteristic of interest in this problem is the local change in the density function.

3. Gaussian Sign (GSign). In this problem, where , is the sign function, and serves as a source of noise. The full interaction of is what makes the problem challenging. That is, is dependent on , yet it is independent of any proper subset of . Thus, simultaneous consideration of all the coordinates of is required to successfully detect the dependency.

We fix and vary the problem parameters. Each problem is repeated for 300 trials, and the sample is redrawn each time. The significance level is set to 0.05. The results are shown in Figure 2. It can be seen that in the SG problem (Figure b) where holds, all the tests achieve roughly correct type-I errors at . In the Sin problem, NFSIC-opt achieves the highest test power for all considered , highlighting its strength in detecting local changes in the joint density. The performance of NFSIC-med is significantly lower than that of NFSIC-opt. This phenomenon clearly emphasizes the importance of the optimization to place the locations at the relevant regions in . RDC has a remarkably high performance in both Sin and GSign (Figure c, d) despite no parameter tuning. Interestingly, both NFSIC-opt and RDC outperform the quadratic-time QHSIC in these two problems. The ability to simultaneously consider interacting features of NFSIC-opt is indicated by its superior test power in GSign, especially at the challenging settings of . An average trial runtime for each test in the SG problem is shown in Figure a. We observe that the runtime does not increase with dimension, as the complexity of all the tests is linear in the dimension of the input. All the tests are implemented in Python using a common SciPy Stack.

To investigate the sample efficiency of all the tests, we fix in SG, in Sin, in GSign, and increase . Figure d shows the results. The quadratic dependency on in QHSIC makes it infeasible both in terms of memory and runtime to consider larger than 6000 (Figure a). In constrast, although not the most time-efficient, NFSIC-opt has the highest sample-efficiency for GSign, and for Sin in the low-sample regime, significantly outperforming QHSIC. Despite the small additional overhead from the optimization, we are yet able to conduct an accurate test with in less than seconds. We observe in Figure b that the two NFSIC variants have correct type-I errors across all sample sizes, indicating that the asymptotic null distribution approximately holds by the time reaches 1000. We recall from Theorem 2.2 that the NFSIC test with random test locations will asymptotically reject if it is false. A demonstration of this property is given in Figure c, where the test power of NFSIC-med eventually reaches 1 with higher than .

3.2 Real Problems

We now examine the performance of our proposed test on real problems.

Million Song Data (MSD) We consider a subset of the Million Song Data2 (Bertin-Mahieux et al., 2011), in which each song out of 515,345 is represented by 90 features, of which 12 features are timbre average (over all segments) of the song, and 78 features are timbre covariance. Most of the songs are western commercial tracks from 1922 to 2011. The goal is to detect the dependency between each song and its year of release . We set , and repeat for 300 trials where the full sample is randomly subsampled to points in each trial. Other settings are the same as in the toy problems. To make sure that the type-I error is correct, we use the permutation approach in the NFSIC tests to compute the threshold. Figure b shows the test powers as increases from 500 to 2000. To simulate the case where holds in the problem, we permute the sample to break the dependency of and . The results are shown in Figure c in the appendix.

(a)
(b) (a) MSD problem.
(c) (b) Videos & Captions problem.
Figure 4: Probability of rejecting as increases in the two real problems. .

Evidently, NFSIC-opt has the highest test power among all the linear-time tests for all the sample sizes. Its test power is second to only QHSIC. We recall that NFSIC-opt uses half of the sample for parameter tuning. Thus, at , the actual sample for testing is 250, which is relatively small. The fact that there is a vast power gain from 0.4 (NFSIC-med) to 0.8 (NFSIC-opt) at suggests that the optimization procedure can perform well even at a lower sample sizes.

Videos and Captions Our last problem is based on the VideoStory46K3 dataset (Habibian et al., 2014). The dataset contains 45,826 Youtube videos of an average length of roughly one minute, and their corresponding text captions uploaded by the users. Each video is represented as a dimensional Fisher vector encoding of motion boundary histograms (MBH) descriptors of Wang and Schmid (2013). Each caption is represented as a bag of words with each feature being the frequency of one word. After filtering only words which occur in at least six video captions, we obtain words. We examine the test powers as increases from to . The results are given in Figure c. The problem is sufficiently challenging that all linear-time tests achieve a low power at . QHSIC performs exceptionally well on this problem, achieving a maximum power throughout. NFSIC-opt has the highest sample efficiency among the linear-time tests, showing that the optimization procedure is also practical in a high dimensional setting.

Acknowledgement

We thank the Gatsby Charitable Foundation for the financial support. The major part of this work was carried out while Zoltán Szabó was a research associate at the Gatsby Computational Neuroscience Unit, University College London.