Exponentially Consistent Kernel Two-Sample Tests

# Exponentially Consistent Kernel Two-Sample Tests

Shengyu Zhu
Huawei Noah’s Ark Lab
Hong Kong, China
szhu05@syr.edu
&Biao Chen
Syracuse University
Syracuse, NY, USA
bichen@syr.edu
&
Zhitang Chen
Huawei Noah’s Ark Lab
Hong Kong, China
chenzhitang2@huawei.com
###### Abstract

Given two sets of independent samples from unknown distributions and , a two-sample test decides whether to reject the null hypothesis that . Recent attention has focused on kernel two-sample tests as the test statistics are easy to compute, converge fast, and have low bias with their finite sample estimates. However, there still lacks an exact characterization on the asymptotic performance of such tests, and in particular, the rate at which the type-II error probability decays to zero in the large sample limit. In this work, we establish that a class of kernel two-sample tests are exponentially consistent with Polish, locally compact Hausdorff sample space, e.g., . The obtained exponential decay rate is further shown to be optimal among all two-sample tests satisfying the level constraint, and is independent of particular kernels provided that they are bounded continuous and characteristic. Our results gain new insights into related issues such as fair alternative for testing and kernel selection strategy. Finally, as an application, we show that a kernel based test achieves the optimal detection for off-line change detection in the nonparametric setting.

Exponentially Consistent Kernel Two-Sample Tests

Shengyu Zhu Huawei Noah’s Ark Lab Hong Kong, China szhu05@syr.edu Biao Chen Syracuse University Syracuse, NY, USA bichen@syr.edu Zhitang Chen Huawei Noah’s Ark Lab Hong Kong, China chenzhitang2@huawei.com

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Given two sets of i.i.d. samples, the two-sample problem decides whether or not to accept the null hypothesis that the generating distributions are the same, without imposing any parametric assumptions. This is important to a variety of applications, including data integration in bioinformatics Borgwardt et al. (2006), statistical model criticism Kim et al. (2016); Lloyd and Ghahramani (2015), and training deep generative models Dziugaite et al. (2015); Li et al. (2017, 2015b); Sutherland et al. (2017). Typical two-sample tests are constructed based on some distance measures between distributions, such as classical Kolmogorov-Smirnov distance Friedman and Rafsky (1979), Kullback-Leibler divergence (KLD) Bu et al. (2016); Nguyen et al. (2010), and maximum mean discrepancy (MMD), a reproducing kernel Hilbert space norm of the difference between kernel mean embeddings of distributions Gretton et al. (2012a, b); Muandet et al. (2017); Simon-Gabriel and Schölkopf (2016); Zaremba et al. (2013). Notably, kernel based test statistics possess several key advantages such as computational efficiency and fast convergence, thereby attracting much attention recently.

A hypothesis test is usually evaluated by characterizing its type-II error probability subject to a level constraint on the type-I error probability. In this respect, existing kernel two-sample tests have been shown to be consistent, in the sense that the type-II error probability decreases to zero as sample sizes scale to infinity. While consistency is a desired property, quantifying how fast the error probability decays is even more desirable, as it provides a natural metric for comparing test performance. However, exact characterization on the decay rate is still elusive, even for some well-known kernel two-sample tests. For example, assuming samples in both sets, the decay rate of the biased quadratic-time test in Gretton et al. (2012a) is claimed to be (at least) , based on a large deviation bound on the test statistic. The large deviation bound has been observed to be loose in general, indicating that the above decay rate is loose too. Other works such as Gretton et al. (2012b); Sutherland et al. (2017); Zaremba et al. (2013) have established the limiting distributions of the test statistics, but they also do not give a tight decay rate. Clearly, no statistical optimality can be claimed if the characterization itself is loose.

More recently, in the context of goodness of fit testing, Zhu et al. (2018) showed that the quadratic-time kernel two-sample tests have the type-II error probability vanishing exponentially fast at a rate determined by the KLD between the two generating distributions. A strong condition for this result is that sample sizes need scale in different orders. Their approach, however, is not readily applicable when sample sizes increase in the same order, e.g., when the two sets have an equal number of samples. This is because existing Sanov’s theorems only hold for the sample sequence originating from one given distribution, whereas the acceptance region defined by the kernel two-sample test involves two sample sequences from different distributions. As such, the key seems to be an extended version of Sanov’s theorem that handles two distributions; this is not apparent as existing tools, e.g., Cramér theorem Dembo and Zeitouni (2009) that is used for proving Sanov’s theorem, can only deal with a single distribution.

The first goal of this paper is to seek an exact statistical characterization for a widely used kernel two-sample test. We establish an extended version of Sanov’s theorem w.r.t. the topology induced by a pairwise weak convergence of probability measures. Our proof is inspired by Csiszár (2006) which proved original Sanov’s theorem of one sample sequence in the -topology. Based on the idea of Zhu et al. (2018), we then show that the biased quadratic-time kernel two-sample test in Gretton et al. (2012a) is exponentially consistent when sample sizes scale in the same order. The obtained exponential decay rate depends only on the generating distributions and the samples sizes under the alternative hypothesis, and is further shown to be the optimal one among all two-sample tests satisfying the level constraint. A notable implication is that kernels affect only the sub-exponential term in the type-II error probability, provided that they are bounded continuous and characteristic. We also comment that the extended Sanov’s theorem may be of independent interest and may be applied to other large deviation applications.

Our second goal is to derive an optimality criterion for nonparametric two-sample tests as well as a way of finding more tests achieving this optimality. Towards this goal, we characterize the maximum exponential decay rate for any two-sample test under the given level constraint. Furthermore, a sufficient condition is derived for the type-II error probability to decay at least exponentially fast with the maximum exponential rate (possibly violating the level constraint). These results provide new insights into related issues such as fair alternative for testing and kernel selection strategy, which are elaborated in Sections 3.4 and 5. As an application, we apply our results to the off-line change detection problem and show that a kernel based test achieves the optimal detection in terms of the exponential decay rate of the type-II error probability. To our best knowledge, this is the first time that a test is shown to be optimal for detecting the presence of a change in the nonparametric setting.

In Section 2, we briefly review the MMD and the two-sample testing. In Section 3, we present our main results on the exact and optimal exponential decay rate for a class of kernel two-sample tests, followed by discussions on related issues. We apply our results to off-line change detection in Section 4 and conduct synthetic experiments in Section 5. Section 6 concludes the paper.

## 2 Maximum mean discrepancy, two-sample testing, and test threshold

We briefly review the MMD and its weak metrizable property. We then describe the two-sample problem as statistical hypothesis testing and choose a suitable threshold for the level constraint.

#### Maximum mean discrepancy

Let be a reproducing kernel Hilbert space (RKHS) defined on a topological space with reproducing kernel . Let be an -valued random variable with probability measure , and the expectation of for a function . Assume that is bounded continuous. Then for every Borel probability measure defined on , there exists a unique element such that for all Berlinet and Thomas-Agnan (2011). The MMD between two Borel probability measures and is the RKHS-distance between and , which can be expressed as

 dk(P,Q)=∥μk(P)−μk(Q)∥F=(Ex,x′k(x,x′)+Ey,y′k(y,y′)−2Ex,yk(x,y))1/2,

where i.i.d.  and i.i.d.  Gretton et al. (2012a). If the kernel is characteristic, then if and only if Sriperumbudur et al. (2010). This property enables the MMD to distinguish different distributions.

We present a weak metrizable property of , which will be used to establish our main results in Section 3. Let denote the set of all Borel probability measures defined on . For a sequence of probability measures , we say that weakly if and only if for every bounded continuous function .

###### Theorem 1 (Simon-Gabriel and Schölkopf (2016); Sriperumbudur (2016)).

The MMD metrizes the weak convergence on if the following two conditions hold: (A1) the sample space is Polish, locally compact and Hausdorff; (A2) the kernel is bounded continuous and characteristic.

We note that the weak metrizable property is also favored for training deep generative models Arjovsky et al. (2017); Li et al. (2017). An example of Polish, locally compact Hausdorff space is , and the Gaussian kernel satisfies the conditions of A2.

#### Two-sample testing based on the MMD

Let and be independent samples, with and where and are unknown. The two-sample testing is to decide between and . Let and be the respective empirical measures of and , that is, and with being Dirac measure at . Then the squared MMD can be estimated by

 d2k(^P,^Q)=1n2n∑i=1n∑j=1k(xi,xj)+1m2m∑i=1m∑j=1k(yi,yj)−2nmn∑i=1m∑j=1k(xi,yj),

which is a biased statistic originally proposed in Gretton et al. (2012a). A hypothesis test for the two-sample testing can then be constructed by comparing this statistic with a threshold : if , then the test accepts the null hypothesis . The acceptance region is hence defined as . There are two types of errors: a type-I error is made if despite being true, and a type-II error occurs when under . The type-I and type-II error probabilities are given by

 αn,m =Pxnym((xn,ym)∉A(n,m)) under H0:P=Q, βn,m =Pxnym((xn,ym)∈A(n,m)) under H1:P≠Q,

respectively. Bear in mind that and are computed w.r.t. the true yet unknown distributions.

With a carefully chosen threshold, the above kernel test has been shown to be consistent in Gretton et al. (2012a). That is, as , while with being set in advance. In this paper, we study the exponential decay rate of in the large sample limit, subject to the same level constraint. Specifically, we aim to characterize

 liminfn,m→∞−1n+mlogβn,m, subject to αn,m≤α.

The above limit is also called the type-II error exponent in information theory Cover and Thomas (2006). If the limit is positive, then the test is said to be exponentially consistent.

#### A suitable threshold

We directly use a result from (Gretton et al., 2012a) in order to pick a proper threshold for the level constraint . Such tests are referred to as level tests in statistics Casella and Berger (2002).

###### Lemma 1 ((Gretton et al., 2012a, Theorem 7)).

Let , , , , , be defined as in Section 2, and assume A2 with being a positive upper bound on . Then under the null hypothesis ,

 Pxnym(dk(^Pn,^Qm)>2(K/m)12+2(K/n)12+ϵ)≤2exp(−ϵ2mn2K(m+n)).

Therefore, for a given , choosing

 γn,m=((K/m)12+(K/n)12)(2+√2log(2α−1)), (1)

the kernel test has its type-I error probability , hence is a level test.

## 3 Main results

In this section, we present our main results on the type-II error exponent of a class of kernel two-sample tests. The first and the most important step is to establish an extended Sanov’s theorem that works with two sample sequences.

### 3.1 Extended Sanov’s theorem

We define a pairwise weak convergence: we say weakly if and only if both and weakly. We consider endowed with the topology induced by this pairwise weak convergence. It can be verified that this topology is equivalent to the product topology on where each is endowed with the topology of weak convergence. An extended version of Sanov’s theorem is given below.

###### Theorem 2 (Extended Sanov’s Theorem).

Let be a Polish space,  i.i.d. , and  i.i.d. . Assume . Then for a set , it holds that

 limsupn,m→∞−1n+mlogPxnym((^Pn,^Qm)∈Γ) ≤inf(R,S)∈intΓcD(R∥P)+(1−c)D(S∥Q), liminfn,m→∞−1n+mlogPxnym((^Pn,^Qm)∈Γ) ≥inf(R,S)∈clΓcD(R∥P)+(1−c)D(S∥Q),

where and denote the interior and closure w.r.t. the pairwise weak convergence, respectively.

We prove the above result in finite sample space and then extend it to general Polish space, with two simple combinatorial lemmas as prerequisites. See details in Appendix A.

### 3.2 Exact exponent of type-II error probability

With the extended Sanov’s theorem and a vanishing threshold given in Eq. (1), we are ready to establish the exponential decay of the type-II error probability. Our result follows.

###### Theorem 3.

Assume A1, A2, and . Under the alternative hypothesis , also assume that

 0

Given , the kernel test with in Eq. (1) is an exponentially consistent level test:

 αn,m≤α, and liminfn,m→∞−1n+mlogβn,m=D∗.
###### Proof.

We use the fact that testing if is equivalent to testing if . Since the threshold as , is eventually smaller than any fixed , and hence for large enough . By the extended Sanov’s theorem, the type-II error probability decays at least exponentially fast if , which can be satisfied by picking under and using the weak convergence property of the MMD (cf. Theorem 1). We then show that the exponential decay rate is both lower bounded and upper bounded by based on the lower semi-continuity of the KLD Van Erven and Harremos (2014) and Stein’s lemma Dembo and Zeitouni (2009), respectively. Details can be found in Appendix B. ∎

Therefore, when , the type-II error probability vanishes as , where is fixed and can be arbitrarily small. The result also shows that kernels only affect the sub-exponential term in the type-II error probability, provided that they meet the conditions of A2.

Not covered in Theorem 3 is the case when and scale in different orders, i.e., when or . Without loss of generality, we may consider only , with . If under the alternative hypothesis, then (Zhu et al., 2018, Theorem 4) indicates that

 liminfn,m→∞−1mlogβn,m=D(P∥Q), (2)

which leads to a degenerate result on the error exponent w.r.t. the sample size :

 liminfn,m→∞−1n+mlogβn,m=liminfn,m→∞11+nm(−1mlogβn,m)=0.

Notice that, with (and ) we have . Then Theorem 3 still holds if we remove the assumption . However, the error exponent being also includes the case where is bounded away from . The more insightful perspective is to look at Eq. (2), and the test is said to be exponentially consistent w.r.t. the sample size .

### 3.3 Optimal exponent and more exponentially consistent two-sample tests

We can identify other two-sample tests that are at least exponentially consistent based on the above results. In particular, the lower bounds still hold if another test has a smaller type-II error probability, or if under , where is the acceptance region defined by the test. A special case is considered in the following theorem, directly from Theorem 3 and Eq. (2).

###### Theorem 4.

Let , , , , , , , and be defined as in Theorem 3. Assume A1 and A2. Let be the acceptance region of another two-sample test and the type-II error probability. If where as , then

 liminfn,m→∞−1n+mlogβ′n,m≥D∗,

when and ; and

 liminfn,m→∞−1mlogβ′n,m≥D(P∥Q),

when and .

The above theorem characterizes only the type-II error exponent. A suitable threshold is needed to guarantee the test be level for practical use. Our next result provides an upper bound on the optimal type-II error exponent of any (asymptotically) level test.

###### Theorem 5.

Let , , , , and be defined as in Theorem 4. For a test which is (asymptotically) level , its type-II error probability satisfies

 liminfn,m→∞−1n+mlogβ′n,m≤D∗,

if  and ; and

 liminfn,m→∞−1mlogβ′n,m≤D(P∥Q),

if and .

###### Proof.

Let be such that for . Define , and , where is fixed and can be arbitrary. Here and are Radon-Nikodym derivatives and exist by the finiteness of . Consider the acceptance region , from which we can obtain an upper bound on the type-II error exponent of . Since can be arbitrarily small, then is an upper bound on the type-II error exponent. When , we can set and apply the above argument; alternatively, we may compare the test with the optimal goodness-of-fit test in Zhu et al. (2018) and use Stein’s lemma Dembo and Zeitouni (2009) to establish the upper bound. See Appendix C for details. ∎

This theorem shows that the kernel test is an optimal level two-sample test, by choosing the type-II error exponent as the asymptotic performance metric. Moreover, Theorems 4 and 5 together provide a way of finding more asymptotically optimal two-sample tests:

• An unbiased estimator of the squared MMD, denoted by , is also proposed in Gretton et al. (2012a). The test is a level test, assuming . As is finitely bounded by , we have and the acceptance region of the unbiased test is a subset of . Then its type-II error probability vanishes exponentially at a rate of .

• It is also possible to consider a family of kernels for the test statistic Fukumizu et al. (2009); Sriperumbudur (2016). For a given family , the test statistic is which also metrizes weak convergence under suitable conditions, e.g., when consists of finitely many Gaussian kernels (Sriperumbudur, 2016, Theorem 3.2). If remains to be an upper bound for all , then comparing with in Eq. (1) results in an asymptotically optimal level test.

### 3.4 Discussions

#### Fair alternative

In Ramdas et al. (2015), a notion of fair alternative is proposed for two-sample testing as dimension increases, which is to fix under the alternative hypothesis for all dimensions. This idea is guided by the fact that the KLD is a fundamental information-theoretic quantity determining the hardness of hypothesis testing problems. This approach, however, does not take into account the impact of sample sizes. In light of our results, perhaps a better choice is to fix in Theorem 3 when the sample sizes grow in the same order. In practice, may be hard to compute, so fixing its upper bound and hence is reasonable.

#### Kernel choice

The main results indicate that the type-II error exponent is independent of kernels as long as they are bounded continuous and characteristic. We remark that this indication does not contradict previous studies on kernel choice, as the sub-exponential term can dominate in the finite sample regime. In light of the exponential consistency, it then raises interesting connections with a kernel selection strategy, where part of samples are used as training data to choose a kernel and the remaining samples are used with the selected kernel to compute the test statistic Gretton et al. (2012b); Sutherland et al. (2017). On the one hand, the sample size should not be too small so that there are enough data for training. On the other hand, if the number of samples is large enough and the exponential decay term becomes dominating, directly using the entire samples may be good enough to have a low type-II error probability, provided that kernel is not too poor. This point will be further illustrated by experiments in Section 5.

#### Threshold choice

As also discussed in Zhu et al. (2018), the distribution-free threshold, in Eq. (1), is loose in general Gretton et al. (2012a). In practice, the threshold can be computed based on some estimate of the null distribution from the given samples, such as a bootstrap procedure and using the eigenspetrum of the Gram matrix on the aggregate sample Gretton et al. (2009, 2012a). While these approaches can meet the level constraint in the large sample limit, they however bring additional randomness on the threshold and further on the type-II error probability. Similar to Zhu et al. (2018), we can take the minimum of such a threshold and the distribution-free one to achieve the optimal type-II error exponent, while the type-I error constraint holds in the asymptotic sense, i.e., .

#### Other discrepancy measures

Other distance measures between distributions may also metrize the weak convergence on , such as Lévy-Prokhorov metric, bounded Lipschitz metric, and Wasserstein distance. If we directly compute such a distance between the empirical measures and compare it with a decreasing threshold, the obtained test would have the same optimal type-II error exponent as in Theorem 4. However, unlike Lemma 1 for the MMD based statistic, there does not exist a uniform or distribution-free threshold such that the level constraint is satisfied for all sample sizes. Similar to the kernel Stein discrepancy based goodness-of-fit test in Zhu et al. (2018), a possible remedy is to relax the level constraint to an asymptotic one, but a uniform characterization on the decay rate of the estimated distance is still required. We will not expand into this direction, because computing such distance measures from samples is generally more costly than the MMD based statistics.

## 4 Application to off-line change detection

In this section, we apply our results to the off-line change detection problem.

Let be an independent sequence of observations. Assume that there is at most one change-point at index , which, if exists, indicates that and with . The off-line change-point analysis consists of two steps: 1) detect if there is a change-point in the sample sequence; 2) estimate the index if such a change-point exists. Notice that a method may readily extend to multiple change-point and on-line settings, through sliding windows running along the sequence, as in Desobry et al. (2005); Harchaoui et al. (2009); Li et al. (2015a).

The first step in the change-point analysis is usually formulated as a hypothesis testing problem:

 H0: zi∼P,i=1,…,n, H1: there exists 1

Let and denote the empirical measures of sequences and , respectively. Then an MMD based test can be directly constructed using the maximum partition strategy:

 decide H0, if maxan≤i≤bndk(^Pi,^Qn−i)≤γn,

where the maximum is searched in the interval with and . If the test favors , we can proceed to estimate the change-point index by . Here we characterize the performance of detecting the presence of a change for this test, using Theorems 3 and 5. We remark that the assumptions on the search interval and on the change-point index in the following theorem are standard practice in this setting Basseville and Nikiforov (1993); Desobry et al. (2005); Harchaoui et al. (2009); James et al. (1987); Li et al. (2015a).

###### Theorem 6.

Let and as . Under the alternative hypothesis , assume that the change-point index satisfies , and that where is defined in Theorem 3. Further assume that the kernel satisfies A2, with being an upper bound. Given , set and . Then the test is level and also achieves the optimal type-II error exponent, that is,

 αn≤α, and liminfn→∞−1nlogβn=D∗,

where and are the type-I and type-II error probabilities, respectively.

###### Proof.

Since , it suffices to make each under the null hypothesis . This can be verified using Lemma 1 with the choice of in the above theorem. To see the optimal type-II error exponent, consider a simpler problem where the possible change-point is known, i.e., a two-sample problem between and . Since as , applying Theorems 3 and 5 establishes the optimal type-II error exponent. ∎

## 5 Experiments

This section presents empirical results to validate our previous findings. We begin with a toy example to demonstrate the exponential consistency, and then consider how kernel choice and sample sizes affect the type-II error probability. We set equal sample sizes, i.e., , and pick the significance level in all experiments.

#### Exponential consistency

While there have been various experiments on the type-II error probability, the exponential decay behavior has been scarcely reported. To this end, we perform a simple experiment and display the type-II error probability in the logarithm scale. Let and , where , , and is the identity matrix. We use the biased test statistic with Gaussian kernel . A fixed choice of and the median heuristic are employed for the kernel bandwidth. We also consider two threshold choices: one is from the Large Deviation Bound (LDB), given in Eq. (1); and the other is from a bootstrap method in Gretton et al. (2012a), with bootstrap replicates. We repeat trials and report the result in Figure 1.

We observe that all the type-II error probabilities exhibit an exponential decay rate as the sample number increases. The LDB threshold is quite conservative and the error probability starts decaying with much more samples. Although the main theorems in Section 3 do not include the median bandwidth, the figure shows that it also leads to an exponential decay of the type-II error probability. This might be because the median distance lies within a small neighborhood of some fixed bandwidth in this experiment, hence behaving similarly.

#### Kernel choice vs. Sample size

Following the discussions in Section 3.4, we investigate how kernel choice and sample number affect the test performance. We consider Gaussian kernels that are determined by their bandwidths. Sutherland et al. (2017) use part of samples as training data to select the bandwidth, which we call the trained bandwidth. The estimated MMD is then computed using the trained bandwidth and the remaining samples.

For the first experiment, we take a similar setting from Sutherland et al. (2017): is a grid of 2D standard normals, with spacing between the centers; is laid out identically, but with covariance between the coordinates. Here we pick and generate samples from each distribution. We pick splitting ratios and for computing the trained bandwidth. Correspondingly, there are and samples used to calculate the test statistic, respectively. For each case with , we report in Figure 2 the type-II error probabilities over different bandwidths, averaged over trials. The unbiased test statistic is used and the test threshold is obtained using bootstrap with permutations. We also mark the trained bandwidths corresponding to the respective sample sizes in the figure (red star marker).

Figure 2 verifies that the trained bandwidth is close to the optimal one in terms of the type-II error probability. Moreover, it indicates that a large range of bandwidths lead to lower or comparable error probabilities if we directly use the entire samples for testing. As the sample number increases, the exponential decay term in the type-II error probability becomes dominating and the effect of kernel choice diminishes. However, since the desired range of bandwidths is not known in advance, an interesting question is when we shall split data for kernel selection and what is a proper splitting ratio.

In the second experiment, we directly use the setup in Liu et al. (2016). We draw with , , and , and then generate by adding standard Gaussian noise (perturbation) to . We consider splitting ratios and of the entire samples used as training data and compute based on the rest samples. For comparison, the kernel tests with fixed bandwidths and are also evaluated, which estimate the MMD based on the entire samples. All the test thresholds are computed using bootstrap with replicates. We repeat trials and report the type-II error probabilities in Figure 2. It shows that the more samples we use to compute the test statistic, the lower type-II error probability we get; in other words, kernel choice is less important than the sample size for this setting. This point is further illustrated in Figure 2 where we show the type-II error probabilities of and samples over different kernel bandwidths. The kernel selection strategy in Sutherland et al. (2017) does not perform well in this experiment, which also motivates future studies on when to use such a kernel selection strategy.

## 6 Conclusion

In this paper, a class of kernel two-sample tests are shown to exponentially consistent and to attain the optimal type-II error exponent, provided that kernels are bounded continuous and characteristic. A notable implication is that kernels affect only the sub-exponential term in the type-II error probability. We apply our results to off-line change detection and show that a test achieves the optimal detection in the nonparametric setting. Finally, we empirically investigate how kernel choice and sample size affect the test performance.

## References

• Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
• Basseville and Nikiforov (1993) M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Englewood Cliffs: Prentice Hall, 1993.
• Berlinet and Thomas-Agnan (2011) A. Berlinet and C. Thomas-Agnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, 2011.
• Borgwardt et al. (2006) K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, July 2006.
• Bu et al. (2016) Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli. Estimation of KL divergence: Optimal minimax rate. arXiv preprint arXiv:1607.02653, 2016.
• Casella and Berger (2002) G. Casella and R. Berger. Statistical Inference. Duxbury Thomson Learning, 2002.
• Cover and Thomas (2006) T. M. Cover and J. A. Thomas. Elements of Information Theory. New York: Wiley, 2nd edition, 2006.
• Csiszár (2006) I. Csiszár. A simple proof of Sanov’s theorem. Bulletin of the Brazilian Mathematical Society, 37(4):453–459, 2006.
• Dembo and Zeitouni (2009) A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. New York: Springer, 2009.
• Desobry et al. (2005) F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE Transactions on Signal Processing, 53(8):2961–2974, 2005.
• Dziugaite et al. (2015) G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 2015.
• Friedman and Rafsky (1979) J. H. Friedman and L. C. Rafsky. Multivariate generalizations of the Wald-Wolfowitz and Smirnov two-sample tests. The Annals of Statistics, pages 697–717, 1979.
• Fukumizu et al. (2009) K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schölkopf, and B. K. Sriperumbudur. Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems, 2009.
• Gretton et al. (2009) A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel two-sample test. In Advances in Neural Information Processing Systems, 2009.
• Gretton et al. (2012a) A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012a.
• Gretton et al. (2012b) A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur. Optimal kernel choice for large-scale two-sample tests. In Advances in Neural Information Processing Systems, 2012b.
• Harchaoui et al. (2009) Z. Harchaoui, E. Moulines, and F. R. Bach. Kernel change-point analysis. In NIPS, 2009.
• James et al. (1987) B. James, K. L. James, and D. Siegmund. Tests for a change-point. Biometrika, 74(1):71–83, 1987.
• Kim et al. (2016) B. Kim, R. Khanna, and O. O. Koyejo. Examples are not enough, learn to criticize! Criticism for interpretability. In Advances in Neural Information Processing Systems, 2016.
• Li et al. (2017) C.-L. Li, W.-C. Chang, Y. Cheng, Y. Yang, and B. Póczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, 2017.
• Li et al. (2015a) S. Li, Y. Xie, H. Dai, and L. Song. M-statistic for kernel change-point detection. In Advances in Neural Information Processing Systems, pages 3366–3374, 2015a.
• Li et al. (2015b) Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International Conference on Machine Learning, 2015b.
• Liu et al. (2016) Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, 2016.
• Lloyd and Ghahramani (2015) J. R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. In Advances in Neural Information Processing Systems, 2015.
• Muandet et al. (2017) K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(1-2):1–141, 2017.
• Nguyen et al. (2010) X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
• Ramdas et al. (2015) A. Ramdas, S. J. Reddi, B. Póczos, A. Singh, and L. A. Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI, 2015.
• Simon-Gabriel and Schölkopf (2016) C.-J. Simon-Gabriel and B. Schölkopf. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. arXiv preprint arXiv:1604.05251, 2016.
• Sriperumbudur (2016) B. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839–1893, 08 2016.
• Sriperumbudur et al. (2010) B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.
• Sutherland et al. (2017) D. Sutherland, H. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In ICLR, 2017.
• Szabó et al. (2015) Z. Szabó, A. Gretton, B. Póczos, and B. Sriperumbudur. Two-stage sampled learning theory on distributions. In Artificial Intelligence and Statistics, 2015.
• Szabó et al. (2016) Z. Szabó, B. K. Sriperumbudur, B. Póczos, and A. Gretton. Learning theory for distribution regression. The Journal of Machine Learning Research, 17(1):5272–5311, 2016.
• Van Erven and Harremos (2014) T. Van Erven and P. Harremos. Rényi divergence and Kullback-Leibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
• Zaremba et al. (2013) W. Zaremba, A. Gretton, and M. Blaschko. B-test: A non-parametric, low variance kernel two-sample test. In Advances in Neural Information Processing Systems, 2013.
• Zhu et al. (2018) S. Zhu, B. Chen, P. Yang, and Z. Chen. Universal hypothesis testing with kernels: Asymptotically optimal tests for goodness of fit. arXiv preprint arXiv:1802.07581, 2018.

## Appendix A Proof of the extended Sanov’s theorem

We first prove the result with a finite sample space and then extend it to the case with general Polish space. The prerequisites are two combinatorial lemmas that are standard tools in information theory.

For a positive integer , let denote the set of probability distributions defined on of form , with integers . Stated below are the two lemmas.

###### Lemma 3 ((Cover and Thomas, 2006, Theorem 11.1.4)).

Assume i.i.d.  where is a distribution defined on . For any , the probability of the empirical distribution of equal to satisfies

 (n+1)−te−nD(P∥Q)≤Pxn(^Pn=P)≤e−nD(P∥Q).

### a.1 Finite sample space

#### Upper bound

Let denote the cardinality of . Without loss of generality, assume that . Hence, the open set is non-empty. As , we can find and such that there exists for all and , and that as . Then we have, with and ,

 Pxnym((^Pn,^Qm)∈Γ) =∑(R,S)∈Γ∩Pn(t)×Pm(t)Pxnym(^Pn=R,^Qm=S) ≥∑(R,S)∈intΓ∩Pn(t)×Pm(t)Pxnym(^Pn=R,^Qm=S) ≥Pxnym(^Pn=P′n,^Qm=Q′m) =Pxn(^Pn=P′n)Pym(^Qm=Q′m) ≥(n+1)−t(m+1)−te−nD(P′n∥P)e−mD(Q′m∥Q),

where the last inequality is from Lemma 3. It follows that

 limsupn,m→∞−1n+mlogPxnym((^Pn,^Qm)∈Γ) ≤limn,m→∞1n+m(−tlog((n+1)(m+1))+nD(P′n∥P)+mD(Q′m∥Q)) =limn,m→∞1n+m(nD(P′n∥P)+mD(Q′m∥Q)) =inf(R,S)∈intΓcD(R∥P)+(1−c)D(S∥Q).

#### Lower bound

 Pxnym((^Pn,^Qm)∈Γ) =∑(R,S)∈Γ∩Pn(t)×Pm(t)Pxn(^Pn=R)Pym(^Qm=S) \lx@stackrel(a)≤∑(R,S)∈Γ∩Pn(t)×Pm(t)e−nD(R∥P)e−mD(S∥Q) \lx@stackrel(b)≤(n+1)t(m+1)tsup(R,S)∈Γe−nD(R∥P)e−mD(S∥Q), (3)

where and are due to Lemma 3 and Lemma 2, respectively. This gives

 liminfn→∞−1n+mlogPxnym((^Pn,^Qm)∈Γ)≥inf(R,S)∈ΓcD(R∥P)+(1−c)D(S∥Q),

and hence the lower bound by noting that . Indeed, when the right hand side is finite, the infimum over equals the infimum over as a result of the continuity of KLD for finite alphabets.

### a.2 Polish sample space

We consider the general case with being a Polish space. Now is the space of probability measures on endowed with the topology of weak convergence. To proceed, we introduce another topology on and an equivalent definition of the KLD.

-topology: denote by the set of all partitions of into a finite number of measurable sets . For , , and , denote

 U(P,A,ζ)={P′∈P:|P′(Ai)−P(Ai)|<ζ,i=1,…,t}. (4)

The -topology on is the coarsest topology in which the mapping are continuous for every measurable set . A base for this topology is the collection of the sets (4). We will use when we refer to endowed with this -topology, and write the interior and closure of a set as and , respectively. We remark that the -topology is stronger than the weak topology: any open set in w.r.t. weak topology is also open in (see more details in Csiszár (2006); Dembo and Zeitouni (2009)). The product topology on is determined by the base of the form of

 U(P,A1,ζ1)×U(Q,A2,ζ2),

for , , and . We still use and to denote the interior and closure of a set . As there always exists that refines both and , any element from the base has an open subset

 ~U(P,Q,A,ζ):=U(P,A,ζ)×U(Q,A,ζ)⊂Pτ×Pτ,

for some .

Another definition of the KLD: an equivalent definition of the KLD will also be used:

 D(P∥Q)=supA∈Πt∑i=1P(Ai)logP(Ai)Q(Ai)=supA∈ΠD(PA∥QA),

with the conventions and if . Here denotes the discrete probability measure obtained from probability measure and partition . It is not hard to verify that for ,

 cD(R∥P)+(1−c)D(S∥Q) =csupA1∈ΠD(RA1∥PA1)+(1−c)supA2∈ΠD(SA2∥QA