Exponentially Consistent Kernel TwoSample Tests
Abstract
Given two sets of independent samples from unknown distributions and , a twosample test decides whether to reject the null hypothesis that . Recent attention has focused on kernel twosample tests as the test statistics are easy to compute, converge fast, and have low bias with their finite sample estimates. However, there still lacks an exact characterization on the asymptotic performance of such tests, and in particular, the rate at which the typeII error probability decays to zero in the large sample limit. In this work, we establish that a class of kernel twosample tests are exponentially consistent with Polish, locally compact Hausdorff sample space, e.g., . The obtained exponential decay rate is further shown to be optimal among all twosample tests satisfying the level constraint, and is independent of particular kernels provided that they are bounded continuous and characteristic. Our results gain new insights into related issues such as fair alternative for testing and kernel selection strategy. Finally, as an application, we show that a kernel based test achieves the optimal detection for offline change detection in the nonparametric setting.
Exponentially Consistent Kernel TwoSample Tests
Shengyu Zhu Huawei Noah’s Ark Lab Hong Kong, China szhu05@syr.edu Biao Chen Syracuse University Syracuse, NY, USA bichen@syr.edu Zhitang Chen Huawei Noah’s Ark Lab Hong Kong, China chenzhitang2@huawei.com
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Given two sets of i.i.d. samples, the twosample problem decides whether or not to accept the null hypothesis that the generating distributions are the same, without imposing any parametric assumptions. This is important to a variety of applications, including data integration in bioinformatics Borgwardt et al. (2006), statistical model criticism Kim et al. (2016); Lloyd and Ghahramani (2015), and training deep generative models Dziugaite et al. (2015); Li et al. (2017, 2015b); Sutherland et al. (2017). Typical twosample tests are constructed based on some distance measures between distributions, such as classical KolmogorovSmirnov distance Friedman and Rafsky (1979), KullbackLeibler divergence (KLD) Bu et al. (2016); Nguyen et al. (2010), and maximum mean discrepancy (MMD), a reproducing kernel Hilbert space norm of the difference between kernel mean embeddings of distributions Gretton et al. (2012a, b); Muandet et al. (2017); SimonGabriel and Schölkopf (2016); Zaremba et al. (2013). Notably, kernel based test statistics possess several key advantages such as computational efficiency and fast convergence, thereby attracting much attention recently.
A hypothesis test is usually evaluated by characterizing its typeII error probability subject to a level constraint on the typeI error probability. In this respect, existing kernel twosample tests have been shown to be consistent, in the sense that the typeII error probability decreases to zero as sample sizes scale to infinity. While consistency is a desired property, quantifying how fast the error probability decays is even more desirable, as it provides a natural metric for comparing test performance. However, exact characterization on the decay rate is still elusive, even for some wellknown kernel twosample tests. For example, assuming samples in both sets, the decay rate of the biased quadratictime test in Gretton et al. (2012a) is claimed to be (at least) , based on a large deviation bound on the test statistic. The large deviation bound has been observed to be loose in general, indicating that the above decay rate is loose too. Other works such as Gretton et al. (2012b); Sutherland et al. (2017); Zaremba et al. (2013) have established the limiting distributions of the test statistics, but they also do not give a tight decay rate. Clearly, no statistical optimality can be claimed if the characterization itself is loose.
More recently, in the context of goodness of fit testing, Zhu et al. (2018) showed that the quadratictime kernel twosample tests have the typeII error probability vanishing exponentially fast at a rate determined by the KLD between the two generating distributions. A strong condition for this result is that sample sizes need scale in different orders. Their approach, however, is not readily applicable when sample sizes increase in the same order, e.g., when the two sets have an equal number of samples. This is because existing Sanov’s theorems only hold for the sample sequence originating from one given distribution, whereas the acceptance region defined by the kernel twosample test involves two sample sequences from different distributions. As such, the key seems to be an extended version of Sanov’s theorem that handles two distributions; this is not apparent as existing tools, e.g., Cramér theorem Dembo and Zeitouni (2009) that is used for proving Sanov’s theorem, can only deal with a single distribution.
The first goal of this paper is to seek an exact statistical characterization for a widely used kernel twosample test. We establish an extended version of Sanov’s theorem w.r.t. the topology induced by a pairwise weak convergence of probability measures. Our proof is inspired by Csiszár (2006) which proved original Sanov’s theorem of one sample sequence in the topology. Based on the idea of Zhu et al. (2018), we then show that the biased quadratictime kernel twosample test in Gretton et al. (2012a) is exponentially consistent when sample sizes scale in the same order. The obtained exponential decay rate depends only on the generating distributions and the samples sizes under the alternative hypothesis, and is further shown to be the optimal one among all twosample tests satisfying the level constraint. A notable implication is that kernels affect only the subexponential term in the typeII error probability, provided that they are bounded continuous and characteristic. We also comment that the extended Sanov’s theorem may be of independent interest and may be applied to other large deviation applications.
Our second goal is to derive an optimality criterion for nonparametric twosample tests as well as a way of finding more tests achieving this optimality. Towards this goal, we characterize the maximum exponential decay rate for any twosample test under the given level constraint. Furthermore, a sufficient condition is derived for the typeII error probability to decay at least exponentially fast with the maximum exponential rate (possibly violating the level constraint). These results provide new insights into related issues such as fair alternative for testing and kernel selection strategy, which are elaborated in Sections 3.4 and 5. As an application, we apply our results to the offline change detection problem and show that a kernel based test achieves the optimal detection in terms of the exponential decay rate of the typeII error probability. To our best knowledge, this is the first time that a test is shown to be optimal for detecting the presence of a change in the nonparametric setting.
In Section 2, we briefly review the MMD and the twosample testing. In Section 3, we present our main results on the exact and optimal exponential decay rate for a class of kernel twosample tests, followed by discussions on related issues. We apply our results to offline change detection in Section 4 and conduct synthetic experiments in Section 5. Section 6 concludes the paper.
2 Maximum mean discrepancy, twosample testing, and test threshold
We briefly review the MMD and its weak metrizable property. We then describe the twosample problem as statistical hypothesis testing and choose a suitable threshold for the level constraint.
Maximum mean discrepancy
Let be a reproducing kernel Hilbert space (RKHS) defined on a topological space with reproducing kernel . Let be an valued random variable with probability measure , and the expectation of for a function . Assume that is bounded continuous. Then for every Borel probability measure defined on , there exists a unique element such that for all Berlinet and ThomasAgnan (2011). The MMD between two Borel probability measures and is the RKHSdistance between and , which can be expressed as
where i.i.d. and i.i.d. Gretton et al. (2012a). If the kernel is characteristic, then if and only if Sriperumbudur et al. (2010). This property enables the MMD to distinguish different distributions.
We present a weak metrizable property of , which will be used to establish our main results in Section 3. Let denote the set of all Borel probability measures defined on . For a sequence of probability measures , we say that weakly if and only if for every bounded continuous function .
Twosample testing based on the MMD
Let and be independent samples, with and where and are unknown. The twosample testing is to decide between and . Let and be the respective empirical measures of and , that is, and with being Dirac measure at . Then the squared MMD can be estimated by
which is a biased statistic originally proposed in Gretton et al. (2012a). A hypothesis test for the twosample testing can then be constructed by comparing this statistic with a threshold : if , then the test accepts the null hypothesis . The acceptance region is hence defined as . There are two types of errors: a typeI error is made if despite being true, and a typeII error occurs when under . The typeI and typeII error probabilities are given by
respectively. Bear in mind that and are computed w.r.t. the true yet unknown distributions.
With a carefully chosen threshold, the above kernel test has been shown to be consistent in Gretton et al. (2012a). That is, as , while with being set in advance. In this paper, we study the exponential decay rate of in the large sample limit, subject to the same level constraint. Specifically, we aim to characterize
The above limit is also called the typeII error exponent in information theory Cover and Thomas (2006). If the limit is positive, then the test is said to be exponentially consistent.
A suitable threshold
We directly use a result from (Gretton et al., 2012a) in order to pick a proper threshold for the level constraint . Such tests are referred to as level tests in statistics Casella and Berger (2002).
Lemma 1 ((Gretton et al., 2012a, Theorem 7)).
Therefore, for a given , choosing
(1) 
the kernel test has its typeI error probability , hence is a level test.
3 Main results
In this section, we present our main results on the typeII error exponent of a class of kernel twosample tests. The first and the most important step is to establish an extended Sanov’s theorem that works with two sample sequences.
3.1 Extended Sanov’s theorem
We define a pairwise weak convergence: we say weakly if and only if both and weakly. We consider endowed with the topology induced by this pairwise weak convergence. It can be verified that this topology is equivalent to the product topology on where each is endowed with the topology of weak convergence. An extended version of Sanov’s theorem is given below.
Theorem 2 (Extended Sanov’s Theorem).
Let be a Polish space, i.i.d. , and i.i.d. . Assume . Then for a set , it holds that
where and denote the interior and closure w.r.t. the pairwise weak convergence, respectively.
We prove the above result in finite sample space and then extend it to general Polish space, with two simple combinatorial lemmas as prerequisites. See details in Appendix A.
3.2 Exact exponent of typeII error probability
With the extended Sanov’s theorem and a vanishing threshold given in Eq. (1), we are ready to establish the exponential decay of the typeII error probability. Our result follows.
Theorem 3.
Proof.
We use the fact that testing if is equivalent to testing if . Since the threshold as , is eventually smaller than any fixed , and hence for large enough . By the extended Sanov’s theorem, the typeII error probability decays at least exponentially fast if , which can be satisfied by picking under and using the weak convergence property of the MMD (cf. Theorem 1). We then show that the exponential decay rate is both lower bounded and upper bounded by based on the lower semicontinuity of the KLD Van Erven and Harremos (2014) and Stein’s lemma Dembo and Zeitouni (2009), respectively. Details can be found in Appendix B. ∎
Therefore, when , the typeII error probability vanishes as , where is fixed and can be arbitrarily small. The result also shows that kernels only affect the subexponential term in the typeII error probability, provided that they meet the conditions of A2.
Not covered in Theorem 3 is the case when and scale in different orders, i.e., when or . Without loss of generality, we may consider only , with . If under the alternative hypothesis, then (Zhu et al., 2018, Theorem 4) indicates that
(2) 
which leads to a degenerate result on the error exponent w.r.t. the sample size :
Notice that, with (and ) we have . Then Theorem 3 still holds if we remove the assumption . However, the error exponent being also includes the case where is bounded away from . The more insightful perspective is to look at Eq. (2), and the test is said to be exponentially consistent w.r.t. the sample size .
3.3 Optimal exponent and more exponentially consistent twosample tests
We can identify other twosample tests that are at least exponentially consistent based on the above results. In particular, the lower bounds still hold if another test has a smaller typeII error probability, or if under , where is the acceptance region defined by the test. A special case is considered in the following theorem, directly from Theorem 3 and Eq. (2).
Theorem 4.
The above theorem characterizes only the typeII error exponent. A suitable threshold is needed to guarantee the test be level for practical use. Our next result provides an upper bound on the optimal typeII error exponent of any (asymptotically) level test.
Theorem 5.
Let , , , , and be defined as in Theorem 4. For a test which is (asymptotically) level , its typeII error probability satisfies
if and ; and
if and .
Proof.
Let be such that for . Define , and , where is fixed and can be arbitrary. Here and are RadonNikodym derivatives and exist by the finiteness of . Consider the acceptance region , from which we can obtain an upper bound on the typeII error exponent of . Since can be arbitrarily small, then is an upper bound on the typeII error exponent. When , we can set and apply the above argument; alternatively, we may compare the test with the optimal goodnessoffit test in Zhu et al. (2018) and use Stein’s lemma Dembo and Zeitouni (2009) to establish the upper bound. See Appendix C for details. ∎
This theorem shows that the kernel test is an optimal level twosample test, by choosing the typeII error exponent as the asymptotic performance metric. Moreover, Theorems 4 and 5 together provide a way of finding more asymptotically optimal twosample tests:

An unbiased estimator of the squared MMD, denoted by , is also proposed in Gretton et al. (2012a). The test is a level test, assuming . As is finitely bounded by , we have and the acceptance region of the unbiased test is a subset of . Then its typeII error probability vanishes exponentially at a rate of .

It is also possible to consider a family of kernels for the test statistic Fukumizu et al. (2009); Sriperumbudur (2016). For a given family , the test statistic is which also metrizes weak convergence under suitable conditions, e.g., when consists of finitely many Gaussian kernels (Sriperumbudur, 2016, Theorem 3.2). If remains to be an upper bound for all , then comparing with in Eq. (1) results in an asymptotically optimal level test.
3.4 Discussions
Fair alternative
In Ramdas et al. (2015), a notion of fair alternative is proposed for twosample testing as dimension increases, which is to fix under the alternative hypothesis for all dimensions. This idea is guided by the fact that the KLD is a fundamental informationtheoretic quantity determining the hardness of hypothesis testing problems. This approach, however, does not take into account the impact of sample sizes. In light of our results, perhaps a better choice is to fix in Theorem 3 when the sample sizes grow in the same order. In practice, may be hard to compute, so fixing its upper bound and hence is reasonable.
Kernel choice
The main results indicate that the typeII error exponent is independent of kernels as long as they are bounded continuous and characteristic. We remark that this indication does not contradict previous studies on kernel choice, as the subexponential term can dominate in the finite sample regime. In light of the exponential consistency, it then raises interesting connections with a kernel selection strategy, where part of samples are used as training data to choose a kernel and the remaining samples are used with the selected kernel to compute the test statistic Gretton et al. (2012b); Sutherland et al. (2017). On the one hand, the sample size should not be too small so that there are enough data for training. On the other hand, if the number of samples is large enough and the exponential decay term becomes dominating, directly using the entire samples may be good enough to have a low typeII error probability, provided that kernel is not too poor. This point will be further illustrated by experiments in Section 5.
Threshold choice
As also discussed in Zhu et al. (2018), the distributionfree threshold, in Eq. (1), is loose in general Gretton et al. (2012a). In practice, the threshold can be computed based on some estimate of the null distribution from the given samples, such as a bootstrap procedure and using the eigenspetrum of the Gram matrix on the aggregate sample Gretton et al. (2009, 2012a). While these approaches can meet the level constraint in the large sample limit, they however bring additional randomness on the threshold and further on the typeII error probability. Similar to Zhu et al. (2018), we can take the minimum of such a threshold and the distributionfree one to achieve the optimal typeII error exponent, while the typeI error constraint holds in the asymptotic sense, i.e., .
Other discrepancy measures
Other distance measures between distributions may also metrize the weak convergence on , such as LévyProkhorov metric, bounded Lipschitz metric, and Wasserstein distance. If we directly compute such a distance between the empirical measures and compare it with a decreasing threshold, the obtained test would have the same optimal typeII error exponent as in Theorem 4. However, unlike Lemma 1 for the MMD based statistic, there does not exist a uniform or distributionfree threshold such that the level constraint is satisfied for all sample sizes. Similar to the kernel Stein discrepancy based goodnessoffit test in Zhu et al. (2018), a possible remedy is to relax the level constraint to an asymptotic one, but a uniform characterization on the decay rate of the estimated distance is still required. We will not expand into this direction, because computing such distance measures from samples is generally more costly than the MMD based statistics.
4 Application to offline change detection
In this section, we apply our results to the offline change detection problem.
Let be an independent sequence of observations. Assume that there is at most one changepoint at index , which, if exists, indicates that and with . The offline changepoint analysis consists of two steps: 1) detect if there is a changepoint in the sample sequence; 2) estimate the index if such a changepoint exists. Notice that a method may readily extend to multiple changepoint and online settings, through sliding windows running along the sequence, as in Desobry et al. (2005); Harchaoui et al. (2009); Li et al. (2015a).
The first step in the changepoint analysis is usually formulated as a hypothesis testing problem:
Let and denote the empirical measures of sequences and , respectively. Then an MMD based test can be directly constructed using the maximum partition strategy:
where the maximum is searched in the interval with and . If the test favors , we can proceed to estimate the changepoint index by . Here we characterize the performance of detecting the presence of a change for this test, using Theorems 3 and 5. We remark that the assumptions on the search interval and on the changepoint index in the following theorem are standard practice in this setting Basseville and Nikiforov (1993); Desobry et al. (2005); Harchaoui et al. (2009); James et al. (1987); Li et al. (2015a).
Theorem 6.
Let and as . Under the alternative hypothesis , assume that the changepoint index satisfies , and that where is defined in Theorem 3. Further assume that the kernel satisfies A2, with being an upper bound. Given , set and . Then the test is level and also achieves the optimal typeII error exponent, that is,
where and are the typeI and typeII error probabilities, respectively.
Proof.
Since , it suffices to make each under the null hypothesis . This can be verified using Lemma 1 with the choice of in the above theorem. To see the optimal typeII error exponent, consider a simpler problem where the possible changepoint is known, i.e., a twosample problem between and . Since as , applying Theorems 3 and 5 establishes the optimal typeII error exponent. ∎
5 Experiments
This section presents empirical results to validate our previous findings. We begin with a toy example to demonstrate the exponential consistency, and then consider how kernel choice and sample sizes affect the typeII error probability. We set equal sample sizes, i.e., , and pick the significance level in all experiments.
Exponential consistency
While there have been various experiments on the typeII error probability, the exponential decay behavior has been scarcely reported. To this end, we perform a simple experiment and display the typeII error probability in the logarithm scale. Let and , where , , and is the identity matrix. We use the biased test statistic with Gaussian kernel . A fixed choice of and the median heuristic are employed for the kernel bandwidth. We also consider two threshold choices: one is from the Large Deviation Bound (LDB), given in Eq. (1); and the other is from a bootstrap method in Gretton et al. (2012a), with bootstrap replicates. We repeat trials and report the result in Figure 1.
We observe that all the typeII error probabilities exhibit an exponential decay rate as the sample number increases. The LDB threshold is quite conservative and the error probability starts decaying with much more samples. Although the main theorems in Section 3 do not include the median bandwidth, the figure shows that it also leads to an exponential decay of the typeII error probability. This might be because the median distance lies within a small neighborhood of some fixed bandwidth in this experiment, hence behaving similarly.
Kernel choice vs. Sample size
Following the discussions in Section 3.4, we investigate how kernel choice and sample number affect the test performance. We consider Gaussian kernels that are determined by their bandwidths. Sutherland et al. (2017) use part of samples as training data to select the bandwidth, which we call the trained bandwidth. The estimated MMD is then computed using the trained bandwidth and the remaining samples.
For the first experiment, we take a similar setting from Sutherland et al. (2017): is a grid of 2D standard normals, with spacing between the centers; is laid out identically, but with covariance between the coordinates. Here we pick and generate samples from each distribution. We pick splitting ratios and for computing the trained bandwidth. Correspondingly, there are and samples used to calculate the test statistic, respectively. For each case with , we report in Figure 2 the typeII error probabilities over different bandwidths, averaged over trials. The unbiased test statistic is used and the test threshold is obtained using bootstrap with permutations. We also mark the trained bandwidths corresponding to the respective sample sizes in the figure (red star marker).
Figure 2 verifies that the trained bandwidth is close to the optimal one in terms of the typeII error probability. Moreover, it indicates that a large range of bandwidths lead to lower or comparable error probabilities if we directly use the entire samples for testing. As the sample number increases, the exponential decay term in the typeII error probability becomes dominating and the effect of kernel choice diminishes. However, since the desired range of bandwidths is not known in advance, an interesting question is when we shall split data for kernel selection and what is a proper splitting ratio.
In the second experiment, we directly use the setup in Liu et al. (2016). We draw with , , and , and then generate by adding standard Gaussian noise (perturbation) to . We consider splitting ratios and of the entire samples used as training data and compute based on the rest samples. For comparison, the kernel tests with fixed bandwidths and are also evaluated, which estimate the MMD based on the entire samples. All the test thresholds are computed using bootstrap with replicates. We repeat trials and report the typeII error probabilities in Figure 2. It shows that the more samples we use to compute the test statistic, the lower typeII error probability we get; in other words, kernel choice is less important than the sample size for this setting. This point is further illustrated in Figure 2 where we show the typeII error probabilities of and samples over different kernel bandwidths. The kernel selection strategy in Sutherland et al. (2017) does not perform well in this experiment, which also motivates future studies on when to use such a kernel selection strategy.
6 Conclusion
In this paper, a class of kernel twosample tests are shown to exponentially consistent and to attain the optimal typeII error exponent, provided that kernels are bounded continuous and characteristic. A notable implication is that kernels affect only the subexponential term in the typeII error probability. We apply our results to offline change detection and show that a test achieves the optimal detection in the nonparametric setting. Finally, we empirically investigate how kernel choice and sample size affect the test performance.
References
 Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
 Basseville and Nikiforov (1993) M. Basseville and I. V. Nikiforov. Detection of Abrupt Changes: Theory and Application. Englewood Cliffs: Prentice Hall, 1993.
 Berlinet and ThomasAgnan (2011) A. Berlinet and C. ThomasAgnan. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, 2011.
 Borgwardt et al. (2006) K. M. Borgwardt, A. Gretton, M. J. Rasch, H.P. Kriegel, B. Schölkopf, and A. J. Smola. Integrating structured biological data by kernel maximum mean discrepancy. Bioinformatics, 22(14):e49–e57, July 2006.
 Bu et al. (2016) Y. Bu, S. Zou, Y. Liang, and V. V. Veeravalli. Estimation of KL divergence: Optimal minimax rate. arXiv preprint arXiv:1607.02653, 2016.
 Casella and Berger (2002) G. Casella and R. Berger. Statistical Inference. Duxbury Thomson Learning, 2002.
 Cover and Thomas (2006) T. M. Cover and J. A. Thomas. Elements of Information Theory. New York: Wiley, 2nd edition, 2006.
 Csiszár (2006) I. Csiszár. A simple proof of Sanov’s theorem. Bulletin of the Brazilian Mathematical Society, 37(4):453–459, 2006.
 Dembo and Zeitouni (2009) A. Dembo and O. Zeitouni. Large Deviations Techniques and Applications. New York: Springer, 2009.
 Desobry et al. (2005) F. Desobry, M. Davy, and C. Doncarli. An online kernel change detection algorithm. IEEE Transactions on Signal Processing, 53(8):2961–2974, 2005.
 Dziugaite et al. (2015) G. K. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In Proceedings of the ThirtyFirst Conference on Uncertainty in Artificial Intelligence, 2015.
 Friedman and Rafsky (1979) J. H. Friedman and L. C. Rafsky. Multivariate generalizations of the WaldWolfowitz and Smirnov twosample tests. The Annals of Statistics, pages 697–717, 1979.
 Fukumizu et al. (2009) K. Fukumizu, A. Gretton, G. R. Lanckriet, B. Schölkopf, and B. K. Sriperumbudur. Kernel choice and classifiability for RKHS embeddings of probability distributions. In Advances in Neural Information Processing Systems, 2009.
 Gretton et al. (2009) A. Gretton, K. Fukumizu, Z. Harchaoui, and B. Sriperumbudur. A fast, consistent kernel twosample test. In Advances in Neural Information Processing Systems, 2009.
 Gretton et al. (2012a) A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012a.
 Gretton et al. (2012b) A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. K. Sriperumbudur. Optimal kernel choice for largescale twosample tests. In Advances in Neural Information Processing Systems, 2012b.
 Harchaoui et al. (2009) Z. Harchaoui, E. Moulines, and F. R. Bach. Kernel changepoint analysis. In NIPS, 2009.
 James et al. (1987) B. James, K. L. James, and D. Siegmund. Tests for a changepoint. Biometrika, 74(1):71–83, 1987.
 Kim et al. (2016) B. Kim, R. Khanna, and O. O. Koyejo. Examples are not enough, learn to criticize! Criticism for interpretability. In Advances in Neural Information Processing Systems, 2016.
 Li et al. (2017) C.L. Li, W.C. Chang, Y. Cheng, Y. Yang, and B. Póczos. MMD GAN: Towards deeper understanding of moment matching network. In Advances in Neural Information Processing Systems, 2017.
 Li et al. (2015a) S. Li, Y. Xie, H. Dai, and L. Song. Mstatistic for kernel changepoint detection. In Advances in Neural Information Processing Systems, pages 3366–3374, 2015a.
 Li et al. (2015b) Y. Li, K. Swersky, and R. Zemel. Generative moment matching networks. In International Conference on Machine Learning, 2015b.
 Liu et al. (2016) Q. Liu, J. Lee, and M. Jordan. A kernelized Stein discrepancy for goodnessoffit tests. In International Conference on Machine Learning, 2016.
 Lloyd and Ghahramani (2015) J. R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. In Advances in Neural Information Processing Systems, 2015.
 Muandet et al. (2017) K. Muandet, K. Fukumizu, B. Sriperumbudur, and B. Schölkopf. Kernel mean embedding of distributions: A review and beyond. Foundations and Trends in Machine Learning, 10(12):1–141, 2017.
 Nguyen et al. (2010) X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio by convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847–5861, 2010.
 Ramdas et al. (2015) A. Ramdas, S. J. Reddi, B. Póczos, A. Singh, and L. A. Wasserman. On the decreasing power of kernel and distance based nonparametric hypothesis tests in high dimensions. In AAAI, 2015.
 SimonGabriel and Schölkopf (2016) C.J. SimonGabriel and B. Schölkopf. Kernel distribution embeddings: Universal kernels, characteristic kernels and kernel metrics on distributions. arXiv preprint arXiv:1604.05251, 2016.
 Sriperumbudur (2016) B. Sriperumbudur. On the optimal estimation of probability measures in weak and strong topologies. Bernoulli, 22(3):1839–1893, 08 2016.
 Sriperumbudur et al. (2010) B. K. Sriperumbudur, A. Gretton, K. Fukumizu, B. Schölkopf, and G. R. Lanckriet. Hilbert space embeddings and metrics on probability measures. Journal of Machine Learning Research, 11(Apr):1517–1561, 2010.
 Sutherland et al. (2017) D. Sutherland, H. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In ICLR, 2017.
 Szabó et al. (2015) Z. Szabó, A. Gretton, B. Póczos, and B. Sriperumbudur. Twostage sampled learning theory on distributions. In Artificial Intelligence and Statistics, 2015.
 Szabó et al. (2016) Z. Szabó, B. K. Sriperumbudur, B. Póczos, and A. Gretton. Learning theory for distribution regression. The Journal of Machine Learning Research, 17(1):5272–5311, 2016.
 Van Erven and Harremos (2014) T. Van Erven and P. Harremos. Rényi divergence and KullbackLeibler divergence. IEEE Transactions on Information Theory, 60(7):3797–3820, 2014.
 Zaremba et al. (2013) W. Zaremba, A. Gretton, and M. Blaschko. Btest: A nonparametric, low variance kernel twosample test. In Advances in Neural Information Processing Systems, 2013.
 Zhu et al. (2018) S. Zhu, B. Chen, P. Yang, and Z. Chen. Universal hypothesis testing with kernels: Asymptotically optimal tests for goodness of fit. arXiv preprint arXiv:1802.07581, 2018.
Appendix
Appendix A Proof of the extended Sanov’s theorem
We first prove the result with a finite sample space and then extend it to the case with general Polish space. The prerequisites are two combinatorial lemmas that are standard tools in information theory.
For a positive integer , let denote the set of probability distributions defined on of form , with integers . Stated below are the two lemmas.
Lemma 2 ((Cover and Thomas, 2006, Theorem 11.1.1)).
Lemma 3 ((Cover and Thomas, 2006, Theorem 11.1.4)).
Assume i.i.d. where is a distribution defined on . For any , the probability of the empirical distribution of equal to satisfies
a.1 Finite sample space
Upper bound
Let denote the cardinality of . Without loss of generality, assume that . Hence, the open set is nonempty. As , we can find and such that there exists for all and , and that as . Then we have, with and ,
where the last inequality is from Lemma 3. It follows that
Lower bound
a.2 Polish sample space
We consider the general case with being a Polish space. Now is the space of probability measures on endowed with the topology of weak convergence. To proceed, we introduce another topology on and an equivalent definition of the KLD.
topology: denote by the set of all partitions of into a finite number of measurable sets . For , , and , denote
(4) 
The topology on is the coarsest topology in which the mapping are continuous for every measurable set . A base for this topology is the collection of the sets (4). We will use when we refer to endowed with this topology, and write the interior and closure of a set as and , respectively. We remark that the topology is stronger than the weak topology: any open set in w.r.t. weak topology is also open in (see more details in Csiszár (2006); Dembo and Zeitouni (2009)). The product topology on is determined by the base of the form of
for , , and . We still use and to denote the interior and closure of a set . As there always exists that refines both and , any element from the base has an open subset
for some .
Another definition of the KLD: an equivalent definition of the KLD will also be used:
with the conventions and if . Here denotes the discrete probability measure obtained from probability measure and partition . It is not hard to verify that for ,