Interpretable Distribution Features with Maximum Testing Power
Abstract
Two semimetrics on probability distributions are proposed, given as the sum of differences of expectations of analytic functions evaluated at spatial or frequency locations (i.e, features). The features are chosen so as to maximize the distinguishability of the distributions, by optimizing a lower bound on test power for a statistical test using these features. The result is a parsimonious and interpretable indication of how and where two distributions differ locally. We show that the empirical estimate of the test power criterion converges with increasing sample size, ensuring the quality of the returned features. In realworld benchmarks on highdimensional text and image data, lineartime tests using the proposed semimetrics achieve comparable performance to the stateoftheart quadratictime maximum mean discrepancy test, while returning humaninterpretable features that explain the test results.
1 Introduction
We address the problem of discovering features of distinct probability distributions, with which they can most easily be distinguished. The distributions may be in high dimensions, can differ in nontrivial ways (i.e., not simply in their means), and are observed only through i.i.d. samples. One application for such divergence measures is to model criticism, where samples from a trained model are compared with a validation sample: in the univariate case, through the KL divergence (Cinzia Carota and Polson, 1996), or in the multivariate case, by use of the maximum mean discrepancy (MMD) (Lloyd and Ghahramani, 2015). An alternative, interpretable analysis of a multivariate difference in distributions may be obtained by projecting onto a discriminative direction, such that the Wasserstein distance on this projection is maximized (Mueller and Jaakkola, 2015). Note that both recent works require low dimensionality, either explicitly (in the case of Lloyd and Gharamani, the function becomes difficult to plot in more than two dimensions), or implicitly in the case of Mueller and Jaakkola, in that a large difference in distributions must occur in projection along a particular onedimensional axis. Distances between distributions in high dimensions may be more subtle, however, and it is of interest to find interpretable, distinguishing features of these distributions.
In the present paper, we take a hypothesis testing approach to discovering features which best distinguish two multivariate probability measures and , as observed by samples drawn independently and identically (i.i.d.) from , and from Nonparametric twosample tests based on RKHS distances (Eric et al., 2008; Fromont et al., 2012; Gretton et al., 2012a) or energy distances (Székely and Rizzo, 2004; Baringhaus and Franz, 2004) have as their test statistic an integral probability metric, the Maximum Mean Discrepancy (Gretton et al., 2012a; Sejdinovic et al., 2013). For this metric, a smooth witness function is computed, such that the amplitude is largest where the probability mass differs most (e.g. Gretton et al., 2012a, Figure 1). Lloyd and Ghahramani (2015) used this witness function to compare the model output of the Automated Statistician (Lloyd et al., 2014) with a reference sample, yielding a visual indication of where the model fails. In high dimensions, however, the witness function cannot be plotted, and is less helpful. Furthermore, the witness function does not give an easily interpretable result for distributions with local differences in their characteristic functions. A more subtle shortcoming is that it does not provide a direct indication of the distribution features which, when compared, would maximize test power  rather, it is the witness function norm, and (broadly speaking) its variance under the null, that determine test power.
Our approach builds on the analytic representations of probability distributions of Chwialkowski et al. (2015), where differences in expectations of analytic functions at particular spatial or frequency locations are used to construct a twosample test statistic, which can be computed in linear time. Despite the differences in these analytic functions being evaluated at random locations, the analytic tests have greater power than linear time tests based on subsampled estimates of the MMD (Gretton et al., 2012b; Zaremba et al., 2013). Our first theoretical contribution, in Sec. 3, is to derive a lower bound on the test power, which can be maximized over the choice of test locations. We propose two novel tests, both of which significantly outperform the random feature choice of Chwialkowski et al.. The (ME) test evaluates the difference of mean embeddings at locations chosen to maximize the test power lower bound (i.e., spatial features); unlike the maxima of the MMD witness function, these features are directly chosen to maximize the distinguishability of the distributions, and take variance into account. The Smooth Characteristic Function (SCF) test uses as its statistic the difference of the two smoothed empirical characteristic functions, evaluated at points in the frequency domain so as to maximize the same criterion (i.e., frequency features). Optimization of the mean embedding kernels/frequency smoothing functions themselves is achieved on a heldout data set with the same consistent objective.
As our second theoretical contribution in Sec. 3, we prove that the empirical estimate of the test power criterion asymptotically converges to its population quantity uniformly over the class of Gaussian kernels. Two important consequences follow: first, in testing, we obtain a more powerful test with fewer features. Second, we obtain a parsimonious and interpretable set of features that best distinguish the probability distributions. In Sec. 4, we provide experiments demonstrating that the proposed lineartime tests greatly outperform all previous linear time tests, and achieve performance that compares to or exceeds the more expensive quadratictime MMD test (Gretton et al., 2012a). Moreover, the new tests discover features of text data (NIPS proceedings) and image data (distinct facial expressions) which have a clear human interpretation, thus validating our feature elicitation procedure in these challenging highdimensional testing scenarios.
2 ME and SCF tests
In this section, we review the ME and SCF tests (Chwialkowski et al., 2015) for twosample testing. In Sec. 3, we will extend these approaches to learn features that optimize the power of these tests. Given two samples independently and identically distributed (i.i.d.) according to and , respectively, the goal of a twosample test is to decide whether is different from on the basis of the samples. The task is formulated as a statistical hypothesis test proposing a null hypothesis (samples are drawn from the same distribution) against an alternative hypothesis (the sample generating distributions are different). A test calculates a test statistic from and , and rejects if exceeds a predetermined test threshold (critical value). The threshold is given by the quantile of the (asymptotic) distribution of under i.e., the null distribution, and is the significance level of the test.
ME test The ME test uses as its test statistic , a form of Hotelling’s Tsquared statistic, defined as where , , and The statistic depends on a positive definite kernel (with ), and a set of test locations . Under , asymptotically follows , a chisquared distribution with degrees of freedom. The ME test rejects if , where the test threshold is given by the quantile of the asymptotic null distribution . Although the distribution of under was not derived, Chwialkowski et al. (2015) showed that if is analytic, integrable and characteristic (in the sense of Sriperumbudur et al. (2011)), under , can be arbitrarily large as , allowing the test to correctly reject .
One can intuitively think of the ME test statistic as a squared normalized (by the inverse covariance ) distance of the mean embeddings (Smola et al., 2007) of the empirical measures , and where , and is the Dirac measure concentrated at . The unnormalized counterpart (i.e., without ) was shown by Chwialkowski et al. (2015) to be a metric on the space of probability measures for any . Both variants behave similarly for twosample testing, with the normalized version being a semimetric having a more computationally tractable null distribution, i.e., .
SCF test The SCF uses the test statistic which has the same form as the ME test statistic with a modified where is the Fourier transform of , and is an analytic translationinvariant kernel i.e., defines a positive definite kernel for and . In contrast to the ME test defining the statistic in terms of spatial locations, the locations in the SCF test are in the frequency domain. As a brief description, let be the characteristic function of . Define a smooth characteristic function as (Chwialkowski et al., 2015, Definition 2). Then, similar to the ME test, the statistic defined by the SCF test can be seen as a normalized (by ) version of distance of empirical and . The SCF test statistic has asymptotic distribution under . We will use to refer to the degrees of freedom of the chisquared distribution i.e., for the ME test, and for the SCF test.
In this work, we modify the statistic with a regularization parameter , giving , for stability of the matrix inverse. Using multivariate Slutsky’s theorem, under , still asymptotically follows provided that as .
3 Lower bound on test power, consistency of empirical power statistic
This section contains our main results. We propose to optimize the test locations and kernel parameters (jointly referred to as ) by maximizing a lower bound on the test power in Proposition 3. This criterion offers a simple objective function for fast parameter tuning. The bound may be of independent interest in other Hotelling’s Tsquared statistics, since apart from the Gaussian case (e.g. Bilodeau and Brenner, 2008, Ch. 8), the characterization of such statistics under the alternative distribution is challenging. The optimization procedure is given in Sec. 4. We use as a shorthand for and let be the Frobenius norm.
[Lower bound on ME test power]proplbmepower
Let be a uniformly bounded (i.e., such that ) family of measurable kernels. Let be a collection in which each element is a set of test locations. Assume that . For large , the test power of the ME test satisfies where
and are positive constants depending on only and . The parameter is the population counterpart of where and . For large , is increasing in .
Proof (sketch).
The idea is to construct a bound for which involves bounding and separately using Hoeffding’s inequality. The result follows after a reparameterization of the bound on to have . See Sec. F for details. ∎
Proposition 3 suggests that for large it is sufficient to maximize to maximize a lower bound on the ME test power. The same conclusion holds for the SCF test (result omitted due to space constraints). Assume that is characteristic (Sriperumbudur et al., 2011). It can be shown that if and only if i.e., is a semimetric for and . In this sense, one can see as encoding the ease of rejecting . The higher , the easier for the test to correctly reject when holds. This observation justifies the use of as a maximization objective for parameter tuning.
Contributions The statistic for both ME and SCF tests depends on a set of test locations and a kernel parameter . We propose to set . The optimization of brings two benefits: first, it significantly increases the probability of rejecting when holds; second, the learned test locations act as discriminative features allowing an interpretation of how the two distributions differ. We note that optimizing parameters by maximizing a test power proxy (Gretton et al., 2012b) is valid under both and as long as the data used for parameter tuning and for testing are disjoint. If holds, then is arbitrary. Since the test statistic asymptotically follows for any , the optimization does not change the null distribution. Also, the rejection threshold depends on only and is independent of .
To avoid creating a dependency between and the data used for testing (which would affect the null distribution), we split the data into two disjoint sets. Let and such that and . In practice, since and are unknown, we use in place of , where is the test statistic computed on the training set . For simplicity, we assume that each of and has half of the samples in . We perform an optimization of with gradient ascent algorithm on . The actual twosample test is performed using the test statistic computed on . The full procedure from tuning the parameters to the actual twosample test is summarized in Sec. A.
Since we use an empirical estimate in place of for parameter optimization, we give a finitesample bound in Theorem 3 guaranteeing the convergence of to as increases, uniformly over all kernels (a family of uniformly bounded kernels) and all test locations in an appropriate class. Kernel classes satisfying conditions of Theorem 3 include the widely used isotropic Gaussian kernel class , and the more general full Gaussian kernel class (see Lemma 3 and Lemma 4).
[Consistency of in the ME test]thmconsistencyme
Let be a measurable set, and be a collection in which each element is a set of test locations. All suprema over and are to be understood as and respectively. For a class of kernels on , define
(1)  
(2) 
Assume that (1) is a uniformly bounded (by ) family of measurable kernels, (2) , and (3) is VCsubgraph with VCindex , and is continuous . Let and . Let s be the universal constants associated to s according to Theorem 2.6.7 in van der Vaart and Wellner (2000). Then for any with probability at least ,
for and .
Proof (sketch).
The idea is to lower bound the difference with an expression involving and . These two quantities can be seen as suprema of empirical processes, and can be bounded by Rademacher complexities of their respective function classes (i.e., and ). Finally, the Rademacher complexities can be upper bounded using Dudley entropy bound and VC subgraph properties of the function classes. Proof details are given in Sec. D. ∎
Theorem 3 implies that if we set , then we have as the rate of convergence. Both Proposition 3 and Theorem 3 require as a precondition. To guarantee that , a concrete construction of is the isotropic Gaussian kernel class , where is constrained to be in a compact set. Also, consider for some . Then, for any nondegenerate , we have since is continuous, and thus attains its supremum over compact sets and .
4 Experiments
Data  

SG  
GMD  
GVD  
Blobs  Gaussian mixtures in as studied in Chwialkowski et al. (2015); Gretton et al. (2012b).  
In this section, we demonstrate the effectiveness of the proposed
methods on both toy and real problems. We consider the isotropic Gaussian
kernel class in all kernelbased tests. We study
seven twosample test algorithms. For the SCF test, we set .
Denote by MEfull and SCFfull the ME and SCF tests whose test locations
and the Gaussian width are fully optimized using gradient
ascent on a separate training sample of the same
size as the test set . MEgrid and SCFgrid are
as in Chwialkowski et al. (2015) where only the Gaussian width is optimized
by a grid search,
In all the following experiments, each problem is repeated for 500 trials. For toy problems, new samples are generated from the specified distributions in each trial. For real problems, samples are partitioned randomly into training and test sets in each trial. In all of the simulations, we report an empirical estimate of which is the proportion of the number of times the statistic is above . This quantity is an estimate of typeI error under , and corresponds to test power when is true. We set in all the experiments. All the code and preprocessed data are available at https://github.com/wittawatj/interpretabletest.
Optimization The parameter tuning objective is a function of consisting of one realvalued and test locations each of dimensions. The parameters can thus be regarded as a Euclidean vector. We take the derivative of with respect to , and use gradient ascent to maximize it. is prespecified and fixed. For the ME test, we initialize the test locations with realizations from two multivariate normal distributions fitted to samples from and ; this ensures that the initial locations are well supported by the data. For the SCF test, initialization using the standard normal distribution is found to be sufficient. The parameter is not optimized; we set the regularization parameter to be as small as possible while being large enough to ensure that can be stably computed. We emphasize that both the optimization and testing are linear in . The testing cost and the optimization costs per gradient ascent iteration. Runtimes of all methods are reported in Sec. C in the appendix.
1. Informative features: simple demonstration We begin with a demonstration that the proxy for the test power is informative for revealing the difference of the two samples in the ME test. We consider the Gaussian Mean Difference (GMD) problem (see Table 1), where both and are twodimensional normal distributions with the difference in means. We use test locations and , where is fixed to the location indicated by the black triangle in Fig. 1. The contour plot shows .
Fig. 1 (top) suggests that is maximized when is placed in either of the two regions that captures the difference of the two samples i.e., the region in which the probability masses of and have less overlap. Fig. 1 (bottom), we consider placing in one of the two key regions. In this case, the contour plot shows that should be placed in the other region to maximize , implying that placing multiple test locations in the same neighborhood will not increase the discriminability. The two modes on the left and right suggest two ways to place the test location in a region that reveals the difference. The nonconvexity of the is an indication of many informative ways to detect differences of and , rather than a drawback. A convex objective would not capture this multimodality.
2. Test power vs. sample size
We now demonstrate the rate of increase of test power with sample size. When the null hypothesis holds, the typeI error stays at the specified level . We consider the following four toy problems: Same Gaussian (SG), Gaussian mean difference (GMD), Gaussian variance difference (GVD), and Blobs. The specifications of and are summarized in Table. 1. In the Blobs problem, and are defined as a mixture of Gaussian distributions arranged on a grid in . This problem is challenging as the difference of and is encoded at a much smaller length scale compared to the global structure (Gretton et al., 2012b). Specifically, the eigenvalue ratio for the covariance of each Gaussian distribution is in , and in . We set in this experiment.
The results are shown in Fig. d where typeI error (for SG problem), and test power (for GMD, GVD and Blobs problems) are plotted against test sample size. A number of observations are worth noting. In the SG problem, we see that the typeI error roughly stays at the specified level: the rate of rejection of when it is true is roughly at the specified level .
GMD with 100 dimensions turns out to be an easy problem for all the tests except MMDlin. In the GVD and Blobs cases, MEfull and SCFfull achieve substantially higher test power than MEgrid and SCFgrid, respectively, suggesting a clear advantage from optimizing the test locations. Remarkably, MEfull consistently outperforms the quadratictime MMD across all test sample sizes in the GVD case. When the difference of and is subtle as in the Blobs problem, MEgrid, which uses randomly drawn test locations, can perform poorly (see Fig. d) since it is unlikely that randomly drawn locations will be placed in the key regions that reveal the difference. In this case, optimization of the test locations can considerably boost the test power (see MEfull in Fig. d). Note also that SCF variants perform significantly better than ME variants on the Blobs problem, as the difference in and is localized in the frequency domain; MEfull and MEgrid would require many more test locations in the spatial domain to match the test powers of the SCF variants. For the same reason, SCFfull does much better than the quadratictime MMD across most sample sizes, as the latter represents a weighted distance between characteristic functions integrated across the entire frequency domain (Sriperumbudur et al., 2010, Corollary 4).
3. Test power vs. dimension
We next investigate how the dimension () of the problem can affect typeI errors and test powers of ME and SCF tests. We consider the same artificial problems: SG, GMD and GVD. This time, we fix the test sample size to 10000, set , and vary the dimension. The results are shown in Fig. c. Due to the large dimensions and sample size, it is computationally infeasible to run MMDquad.
We observe that all the tests except the Ttest can maintain typeI error at roughly the specified significance level as dimension increases. The typeI performance of the Ttest is incorrect at large because of the difficulty in accurately estimating the covariance matrix in high dimensions. It is interesting to note the high performance of MEfull in the GMD problem in Fig. b. MEfull achieves the maximum test power of 1.0 throughout and matches the power Ttest, in spite of being nonparametric and making no assumption on and (the Ttest is further advantaged by its excessive TypeI error). However, this is true only with optimization of the test locations. This is reflected in the test power of MEgrid in Fig. b which drops monotonically as dimension increases, highlighting the importance of test location optimization. The performance of MMDlin degrades quickly with increasing dimension, as expected from Ramdas et al. (2015).
Problem  MEfull  MEgrid  SCFfull  SCFgrid  MMDquad  MMDlin  

BayesBayes  215  .012  .018  .012  .004  .022  .008 
BayesDeep  216  .954  .034  .688  .180  .906  .262 
BayesLearn  138  .990  .774  .836  .534  1.00  .238 
BayesNeuro  394  1.00  .300  .828  .500  .952  .972 
LearnDeep  149  .956  .052  .656  .138  .876  .500 
LearnNeuro  146  .960  .572  .590  .360  1.00  .538 
4. Distinguishing articles from two categories We now turn to performance on real data. We first consider the problem of distinguishing two categories of publications at the conference on Neural Information Processing Systems (NIPS). Out of 5903 papers published in NIPS from 1988 to 2015, we manually select disjoint subsets related to Bayesian inference (Bayes), neuroscience (Neuro), deep learning (Deep), and statistical learning theory (Learn) (see Sec. B). Each paper is represented as a bag of words using TFIDF (Manning et al., 2008) as features. We perform stemming, remove all stop words, and retain only nouns. A further filtering of documentfrequency (DF) of words that satisfies yields approximately 5000 words from which 2000 words (i.e., dimensions) are randomly selected. See Sec. B for more details on the preprocessing. For ME and SCF tests, we use only one test location i.e., set . We perform 1000 permutations to approximate the null distribution of MMDquad in this and the following experiments.
TypeI errors and test powers are summarized in Table. 2. The first column indicates the categories of the papers in the two samples. In BayesBayes problem, papers on Bayesian inference are randomly partitioned into two samples in each trial. This task represents a case in which holds. Among all the lineartime tests, we observe that MEfull has the highest test power in all the tasks, attaining a maximum test power of 1.0 in the BayesNeuro problem. This high performance assures that although different test locations may be selected in different trials, these locations are each informative. It is interesting to observe that MEfull has performance close to or better than MMDquad, which requires runtime complexity. Besides clear advantages of interpretability and linear runtime of the proposed tests, these results suggest that evaluating the differences in expectations of analytic functions at particular locations can yield an equally powerful test at a much lower cost, as opposed to computing the RKHS norm of the witness function as done in MMD. Unlike Blobs, however, Fourier features are less powerful in this setting.
We further investigate the interpretability of the ME test by the following procedure. For the learned test location () in trial , we construct such that . Let be an indicator variable taking value 1 if is among the top five largest for all , and 0 otherwise. Define as a proxy indicating the significance of word i.e., is high if word is frequently among the top five largest as measured by . The top seven words as sorted in descending order by in the BayesNeuro problem are spike, markov, cortex, dropout, recurr, iii, gibb, showing that the learned test locations are highly interpretable. Indeed, “markov” and “gibb” (i.e., stemmed from Gibbs) are discriminative terms in Bayesian inference category, and “spike” and “cortex” are key terms in neuroscience. We give full lists of discriminative terms learned in all the problems in Sec. B.1. To show that not all the randomly selected 2000 terms are informative, if the definition of is modified to consider the least important words (i.e., is high if word is frequently among the top five smallest as measured by ), we instead obtain circumfer, bra, dominiqu, rhino, mitra, kid, impostor, which are not discriminative.
5. Distinguishing positive and negative emotions
In the final experiment, we study how well ME and SCF tests can distinguish two samples of photos of people showing positive and negative facial expressions. Our emphasis is on the discriminative features of the faces identified by ME test showing how the two groups differ. For this purpose, we use Karolinska Directed Emotional Faces (KDEF) dataset (Lundqvist et al., 1998) containing 5040 aligned face images of 70 amateur actors, 35 females and 35 males. We use only photos showing front views of the faces. In the dataset, each actor displays seven expressions: happy (HA), neutral (NE), surprised (SU), sad (SA), afraid (AF), angry (AN), and disgusted (DI). We assign HA, NE, and SU faces into the positive emotion group (i.e., samples from ), and AF, AN and DI faces into the negative emotion group (samples from ). We denote this problem as “ vs. ”. Examples of six facial expressions from one actor are shown in Fig. 4. Photos of the SA group are unused to keep the sizes of the two samples the same. Each image of size pixels is cropped to exclude the background, resized to pixels (), and converted to grayscale.
We run the tests 500 times with the same setting used previously i.e., Gaussian kernels, and . The typeI errors and test powers are shown in Table 3. In the table, “ vs. ” is a problem in which all faces expressing the six emotions are randomly split into two samples of equal sizes i.e., is true. Both MEfull and SCFfull achieve high test powers while maintaining the correct typeI errors.
Problem  MEfull  MEgrid  SCFfull  SCFgrid  MMDquad  MMDlin  

vs.  201  .010  .012  .014  .002  .018  .008 
vs.  201  .998  .656  1.00  .750  1.00  .578 
As a way to interpret how positive and negative emotions differ, we take an average across trials of the learned test locations of MEfull in the “ vs. ” problem. This average is shown in Fig. g. We see that the test locations faithfully capture the difference of positive and negative emotions by giving more weights to the regions of nose, upper lip, and nasolabial folds (smile lines), confirming the interpretability of the test in a highdimensional setting.
Acknowledgement
We thank the Gatsby Charitable Foundation for the financial support.
left=3cm, right=3cm, top=2.5cm, bottom=2.5cm
Interpretable Distribution Features with Maximum Testing Power
Supplementary Material
Appendix A Algorithm
The full algorithm for the proposed tests from parameter tuning to the actual twosample testing is given in Algorithm 1.
Appendix B Experiments on NIPS text collection
The full procedure for processing the NIPS text collection is summarized as following.

Download all 5903 papers from 1988 to 2015 from https://papers.nips.cc/ as PDF files.

Convert each PDF file to text with pdftotext
^{2} . 
Remove all stop words. We use the list of stop words from http://www.ranks.nl/stopwords.

Keep only nouns. We use the list of nouns as available in WordNet3.0
^{3} . 
Keep only words which contain only English alphabets i.e., does not contain punctuations or numbers. Also, word length must be between 3 and 20 characters (inclusive).

Keep only words which occur in at least 5 documents, and in no more than 2000 documents.

Convert all characters to small case. Stem all words with SnowballStemmer in NLTK (Bird et al., 2009). For example, “recognize” and “recognizer” become “recogn” after stemming.

Categorize papers into disjoint collections. A paper is treated as belonging to a group if its title has at least one word from the list of keywords for the category. Papers that match the criteria of both categories are not considered. The lists of keywords are as follows.

Bayesian inference (Bayes): graphical model, bayesian, inference, mcmc, monte carlo, posterior, prior, variational, markov, latent, probabilistic, exponential family.

Deep learning (Deep): deep, drop out, autoencod, convolutional, neural net, belief net, boltzmann.

Learning theory (Learn): learning theory, consistency, theoretical guarantee, complexity, pacbayes, paclearning, generalization, uniform converg, bound, deviation, inequality, risk min, minimax, structural risk, VC, rademacher, asymptotic.

Neuroscience (Neuro): motor control, neural, neuron, spiking, spike, cortex, plasticity, neural decod, neural encod, brain imag, biolog, perception, cognitive, emotion, synap, neural population, cortical, firing rate, firingrate, sensor.


Randomly select 2000 words from the remaining words.

Treat each paper as a bag of words and construct a feature vector with TFIDF (Manning et al., 2008).
b.1 Discriminative terms identified by ME test
In this section, we provide full lists of discriminative terms following the procedure described in Sec. 4. The top ten words in each problem are as follows.

BayesBayes: collabor, traffic, bay, permut, net, central, occlus, mask, draw, joint.

BayesDeep: infer, bay, mont, adaptor, motif, haplotyp, ecg, covari, boltzmann, classifi.

BayesLearn: infer, markov, graphic, segment, bandit, boundari, favor, carlo, prioriti, prop.

BayesNeuro: spike, markov, cortex, dropout, recurr, iii, gibb, basin, circuit, subsystem.

LearnDeep: deep, forward, delay, subgroup, bandit, recept, invari, overlap, inequ, pia.

LearnNeuro: polici, interconnect, hardwar, decay, histolog, edg, period, basin, inject, human.
Appendix C Runtimes
In this section, we provide runtimes of all the experiments. The runtimes of the “Test power vs. sample ” experiment are shown in Fig. d. The runtimes of the “Test power vs. dimension ” experiment are shown in Fig. c. Table 4, 5 give the runtimes of the two realdata experiments.
Problem  MEfull  MEgrid  SCFfull  SCFgrid  MMDquad  MMDlin  

BayesBayes  215  126.7  116  34.67  1.855  13.66  .6112 
BayesDeep  216  118.3  111.7  36.41  1.933  13.59  .5105 
BayesLearn  138  94.59  89.16  23.69  1.036  2.152  .36 
BayesNeuro  394  142.5  130.3  69.19  3.533  32.71  .8643 
LearnDeep  149  105  99.59  24.99  1.253  2.417  .4744 
LearnNeuro  146  101.2  93.53  25.29  1.178  2.351  .3658 
Problem  MEfull  MEgrid  SCFfull  SCFgrid  MMDquad  MMDlin  

vs.  201  87.7  83.4  10.5  1.45  9.93  0.464 
vs.  201  85.0  80.6  11.7  1.42  10.4  0.482 
In the cases where is large (Fig. d), MMDquad has the largest runtime due to its quadratic dependency on the sample size. In the extreme case where the test sample size is (Fig. c), it is computationally infeasible to run MMDquad. We observe that the proposed MEfull and SCFfull have a slight overhead from the parameter optimization. However, since the optimization procedure is also linear in , we are able to conduct an accurate test in less than 10 minutes even when the test sample size is 10000 and (see Fig. a, b). We note that the actual tests (after optimization) for all ME and SCF variants take less than one second in all cases. In the MEfull, we initialize the test locations with realizations from two multivariate normal distributions fitted to samples from and . When is large, this heuristic can be expensive. An alternative initialization scheme for is to randomly select points from the two samples.
Appendix D Proof of theorem 3
Recall Theorem 3: \consistencyme*
A proof is given as follows.
d.1 Notations
Let be the Frobenius inner product, and . means that is symmetric, positive semidefinite. For , . is the concatenation of the vectors. is the set of positive reals. is the composition of function and . Let denote a general metric space below. In measurability requirements metric spaces are meant to be endowed with their Borel algebras.
Let be a collection of subsets of (). is said to shatter an set, if for any there exist such that ; in other words, arbitrary subset of can be cut out by an element of . The VC index of is the smallest for which no set of size i is shattered:
A collection of measurable sets is called VCclass if its index is finite. The subgraph of a realvalued function is . A collection of measurable functions is called VCsubgraph class, or shortly VC if the collection of all subgraphs of , is a VCclass of sets; its index is defined as .
Let be the set of measurable functions. Given an i.i.d. (independent identically distributed) sample from (), let and let denote the empirical measure. (), . Define , where is a probability distribution on . Let .
The diameter of a class is , its covering number () is the size of the smallest net
where is the ball with center and radius . is the fold product measure. For sets , is their Cartesian product. For a function class and , is the empirical Rademacher average, where and s are i.i.d. samples from a Rademacher random variable []. Let be a metric space; a collection of functions is called a separable Carathéodory family if is separable and is continuous for all . denotes the linear hull of its arguments. denotes the Gamma function.
d.2 Bound in terms of and
For brevity, we will interchangeably use for and for . and will be used mainly when the dependency of needs to be emphasized. We start with and upper bound the argument of as
For , we have