Revisiting Classifier TwoSample Tests
Abstract
The goal of twosample tests is to assess whether two samples, and , are drawn from the same distribution. Perhaps intriguingly, one relatively unexplored method to build twosample tests is the use of binary classifiers. In particular, construct a dataset by pairing the examples in with a positive label, and by pairing the examples in with a negative label. If the null hypothesis “” is true, then the classification accuracy of a binary classifier on a heldout subset of this dataset should remain near chancelevel. As we will show, such Classifier TwoSample Tests (C2ST) learn a suitable representation of the data on the fly, return test statistics in interpretable units, have a simple null distribution, and their predictive uncertainty allow to interpret where and differ.
The goal of this paper is to establish the properties, performance, and uses of C2ST. First, we analyze their main theoretical properties. Second, we compare their performance against a variety of stateoftheart alternatives. Third, we propose their use to evaluate the sample quality of generative models with intractable likelihoods, such as Generative Adversarial Networks (GANs). Fourth, we showcase the novel application of GANs together with C2ST for causal discovery.
Revisiting Classifier TwoSample Tests
David LopezPaz, Maxime Oquab 
Facebook AI Research, WILLOW project team, Inria / ENS / CNRS 
dlp@fb.com, maxime.oquab@inria.fr 
1 Introduction
One of the most fundamental problems in statistics is to assess whether two samples, and , are drawn from the same probability distribution. To this end, twosample tests (Lehmann & Romano, 2006) summarize the differences between the two samples into a realvalued test statistic, and then use the value of such statistic to accept^{1}^{1}1For clarity, we abuse statistical language and write “accept” to mean “fail to reject”. or reject the null hypothesis “”. The development of powerful twosample tests is instrumental in a myriad of applications, including the evaluation and comparison of generative models. Over the last century, statisticians have nurtured a wide variety of twosample tests. However, most of these tests are only applicable to onedimensional examples, require the prescription of a fixed representation of the data, return test statistics in units that are difficult to interpret, or do not explain how the two samples under comparison differ.
Intriguingly, there exists a relatively unexplored strategy to build twosample tests that overcome the aforementioned issues: training a binary classifier to distinguish between the examples in and the examples in . Intuitively, if , the test accuracy of such binary classifier should remain near chancelevel. Otherwise, if and the binary classifier is able to unveil some of the distributional differences between and , its test accuracy should depart from chancelevel. As we will show, such Classifier TwoSample Tests (C2ST) learn a suitable representation of the data on the fly, return test statistics in interpretable units, have simple asymptotic distributions, and their learned features and predictive uncertainty provide interpretation on how and differ. In such a way, this work brings together the communities of statistical testing and representation learning.
The goal of this paper is to establish the theoretical properties and evaluate the practical uses of C2ST. To this end, our contributions are:

We review the basics of twosample tests in Section 2, as well as their common applications to measure statistical dependence and evaluate generative models.

We analyze the attractive properties of C2ST (Section 3) including an analysis of their exact asymptotic distributions, testing power, and interpretability.

We evaluate C2ST on a wide variety of synthetic and real data (Section 4), and compare their performance against multiple stateoftheart alternatives. Furthermore, we provide examples to illustrate how C2ST can interpret the differences between pairs of samples.

As a novel application of the synergy between C2ST and GANs, Section 6 proposes the use of these methods for causal discovery.
2 TwoSample Testing
The goal of twosample tests is to assess whether two samples, denoted by and , are drawn from the same distribution (Lehmann & Romano, 2006). More specifically, twosample tests either accept or reject the null hypothesis, often denoted by , which stands for “”. When rejecting , we say that the twosample test favors the alternative hypothesis, often denoted by , which stands for “”. To accept or reject , twosample tests summarize the differences between the two samples (sets of identically and independently distributed examples):
(1) 
into a statistic . Without loss of generality, we assume that the twosample test returns a small statistic when the null hypothesis “” is true, and a large statistic otherwise. Then, for a sufficiently small statistic, the twosample test will accept . Conversely, for a sufficiently large statistic, the twosample test will reject in favour of .
More formally, the statistician performs a twosample test in four steps. First, decide a significance level , which is an input to the twosample test. Second, compute the twosample test statistic . Third, compute the pvalue , the probability of the twosample test returning a statistic as large as when is true. Fourth, reject if , and accept it otherwise.
Inevitably, twosample tests can fail in two different ways. First, to make a typeI error is to reject the null hypothesis when it is true (a “false positive”). By the definition of value, the probability of making a typeI error is upperbounded by the significance level . Second, to make a typeII error is to accept the null hypothesis when it is false (a “false negative”). We denote the probability of making a typeII error by , and refer to the quantity as the power of a test. Usually, the statistician uses domainspecific knowledge to evaluate the consequences of a typeI error, and thus prescribe an appropriate significance level . Within the prescribed significance level , the statistician prefers the twosample test with maximum power .
Among others, twosample tests serve two other uses. First, twosample tests can measure statistical dependence (Gretton et al., 2012a). In particular, testing the independence null hypothesis “the random variables and are independent” is testing the twosample null hypothesis “”. In practice, the twosample test would compare the sample to a sample , where is a random permutation of the set of indices . This approach is consistent when considering all possible random permutations. However, since independence testing is a subset of twosample testing, specialized independence tests may exhibit higher power for this task (Gretton et al., 2005).
Second, twosample tests can evaluate the sample quality of generative models with intractable likelihoods, but tractable sampling procedures. Intuitively, a generative model produces good samples if these are indistinguishable from the real data that they model. Thus, the twosample test statistic between and measures the fidelity of the samples produced by the generative model. The use of twosample tests to evaluate the sample quality of generative models include the pioneering work of Box (1980), the use of Maximum Mean Discrepancy (MMD) criterion (Bengio et al., 2013; Dziugaite et al., 2015; Lloyd & Ghahramani, 2015; Bounliphone et al., 2015; Sutherland et al., 2016), and the connections to densityratio estimation (Kanamori et al., 2010; Wornowizki & Fried, 2016; Menon & Ong, 2016; Mohamed & Lakshminarayanan, 2016).
Over the last century, statisticians have nurtured a wide variety of twosample tests. Classical twosample tests include the test (Student, 1908), which tests for the difference in means of two samples; the WilcoxonMannWhitney test (Wilcoxon, 1945; Mann & Whitney, 1947), which tests for the difference in rank means of two samples; and the KolmogorovSmirnov tests (Kolmogorov, 1933; Smirnov, 1939) and their variants (Kuiper, 1962), which test for the difference in the empirical cumulative distributions of two samples. However, these classical tests are only efficient when applied to onedimensional data. Recently, the use of kernel methods (Smola & Schölkopf, 1998) enabled the development of twosample tests applicable to multidimensional data. Examples of these tests include the MMD test (Gretton et al., 2012a), which looks for differences in the empirical kernel mean embeddings of two samples, and the Mean Embedding test or ME (Chwialkowski et al., 2015; Jitkrittum et al., 2016), which looks for differences in the empirical kernel mean embeddings of two samples at optimized locations. However, kernel twosample tests require the prescription of a manuallyengineered representation of the data under study, and return values in units that are difficult to interpret. Finally, only the ME test provides a mechanism to interpret how and differ.
Next, we discuss a simple but relatively unexplored strategy to build twosample tests that overcome these issues: the use of binary classifiers.
3 Classifier TwoSample Tests (C2ST)
Without loss of generality, we assume access to the two samples and defined in (1), where , for all and , and . To test whether the null hypothesis is true, we proceed in five steps. First, construct the dataset
Second, shuffle at random, and split it into the disjoint training and testing subsets and , where and . Third, train a binary classifier on ; in the following, we assume that is an estimate of the conditional probability distribution . Fourth, return the classification accuracy on :
(2) 
as our C2ST statistic, where is the indicator function. The intuition here is that if , the test accuracy (2) should remain near chancelevel. In opposition, if and the binary classifier unveils distributional differences between the two samples, the test classification accuracy (2) should be greater than chancelevel. Fifth, to accept or reject the null hypothesis, compute a pvalue using the null distribution of the C2ST, as discussed next.
3.1 Null and Alternative Distributions
Each term appearing in (2) is an independent random variable, where is the probability of classifying correctly the example in .
First, under the null hypothesis , the samples and follow the same distribution, leading to an impossible binary classification problem. In that case, follows a distribution. Therefore, for large , we can use the central limit theorem to approximate the null distribution of (2) by .
Second, under the alternative hypothesis , the statistic follows a Poisson Binomial distribution, since the constituent Bernoulli random variables may not be identically distributed. In the following, we will approximate such Poisson Binomial distribution by the distribution, where (Ehm, 1991). Therefore, we can use the central limit theorem to approximate the alternative distribution of (2) by .
3.2 Testing power
To analyze the power (probability of correctly rejecting false null hypothesis) of C2ST, we assume that the our classifier has an expected (unknown) accuracy of under the null hypothesis “”, and an expected accuracy of under the alternative hypothesis “”, where is the effect size distinguishing from . Let be the Normal cdf, the number of samples available for testing, and the significance level. Then,
Theorem 1.
Given the conditions described in the previous paragraph, the approximate power of the statistic (2) is .
See Appendix B for a proof. The power bound in Theorem 1 has an optimal order of magnitude for multidimensional problems (Bai & Saranadasa, 1996; Gretton et al., 2012a; Reddi et al., 2015). These are problems with fixed and , where the power bounds do not depend on .
Remark 1.
We leave for future work the study of quadratictime C2ST with optimal power in highdimensional problems (Ramdas et al., 2015). These are problems where the ratio , and the power bounds depend on . One possible line of research in this direction is to investigate the power and asymptotic distributions of quadratictime C2ST statistics , where the classifier predicts if the examples come from the same sample.
Theorem 1 also illustrates that maximizing the power of a C2ST is a tradeoff between two competing objectives: choosing a classifier that maximizes the test accuracy and maximizing the size of the test set . This relates to the well known biasvariance tradeoff in machine learning. Indeed, simple classifiers will miss more nonlinear patterns in the data (leading to smaller test accuracy), but call for less training data (leading to larger test set sizes). On the other hand, flexible classifiers will miss less nonlinear patterns in the data (leading to higher test accuracy), but call for more training data (leading to smaller test sizes). Formally, the relationship between the test accuracy, sample size, and the flexibility of a classifier depends on capacity measures such as the VCDimension (Vapnik, 1998). Note that there is no restriction to perform model selection (such as crossvalidation) on .
Remark 2.
We have focused on test statistics (2) built on top of the zeroone loss . These statistics give rise to Bernoulli random variables, which can exhibit high variance. However, our arguments are readily extended to realvalued binary classification losses. Then, the variance of such realvalued losses would describe the norm of the decision function of the classifier twosample test, appear in the power expression from Theorem 1, and serve as a hyperparameter to maximize power as in (Gretton et al., 2012b, Section 3).^{2}^{2}2For a related discussion on this issue, we recommend the insightful comment by Arthur Gretton and Wittawat Jitkrittum, available at https://openreview.net/forum?id=SJkXfE5xx.
3.3 Interpretability
There are three ways to interpret the result of a C2ST. First, recall that the classifier predictions are estimates of the conditional probabilities for each of the samples in the test set. Inspecting these probabilities together with the true labels determines which examples were correctly or wrongly labeled by the classifier, with the least or the most confidence. Therefore, the values explain where the two distributions differ. Second, C2ST inherit the interpretability of their classifiers to explain which features are most important to distinguish distributions, in the same way as the ME test (Jitkrittum et al., 2016). Examples of interpretable features include the filters of the first layer of a neural network, the feature importance of random forests, the weights of a generalized linear model, and so on. Third, C2ST return statistics in interpretable units: these relate to the percentage of samples correctly distinguishable between the two distributions. These interpretable numbers can complement the use of values.
3.4 Prior Uses
The reduction of twosample testing to binary classification was introduced in (Friedman, 2003), studied within the context of information theory in (Reid & Williamson, 2011), discussed in (Fukumizu et al., 2009; Gretton et al., 2012a), and analyzed (for the case of linear discriminant analysis) in (Ramdas et al., 2016). The use of binary classifiers for twosample testing is increasingly common in neuroscience: see (Pereira et al., 2009; Olivetti et al., 2012) and the references therein. Implicitly, binary classifiers also perform twosample tests in algorithms that discriminate data from noise, such as unsupervisedassupervised learning (Friedman et al., 2001), noise contrastive estimation (Gutmann & Hyvärinen, 2012), negative sampling (Mikolov et al., 2013), and GANs (Goodfellow et al., 2014).
4 Experiments on TwoSample Testing
We study two variants of classifierbased twosample tests (C2ST): one based on neural networks (C2STNN), and one based on nearest neighbours (C2STKNN). C2STNN has one hidden layer of 20 ReLU neurons, and trains for epochs using the Adam optimizer (Kingma & Ba, 2015). C2STKNN uses nearest neighbours for classification. Throughout our experiments, we did not observe a significant improvement in performance when increasing the flexibility of these classifiers (e.g., increasing the number of hidden neurons or decreasing the number of nearest neighbors). When analyzing onedimensional data, we compare the performance of C2STNN and C2STKNN against the WilcoxonMannWhitney test (Wilcoxon, 1945; Mann & Whitney, 1947), the KolmogorovSmirnov test (Kolmogorov, 1933; Smirnov, 1939), and the Kuiper test (Kuiper, 1962). In all cases, we also compare the performance of C2STNN and C2STKNN against the lineartime estimate of the Maximum Mean Discrepancy (MMD) criterion (Gretton et al., 2012a), the ME test (Jitkrittum et al., 2016), and the SCF test (Jitkrittum et al., 2016). We use a significance level across all experiments and tests, unless stated otherwise. We use Gaussian approximations to compute the null distributions of C2STNN and C2STKNN. We use the implementations of the MMD, ME, and SCF tests gracefully provided by Jitkrittum et al. (2016), the scikitlearn implementation of the KolmogorovSmirnov and Wilcoxon tests, and the implementation from https://github.com/aarchiba/kuiper of the Kuiper test. The implementation of our experiments is available at https://github.com/lopezpaz/classifier_tests.
4.1 Experiments on TwoSample Testing
Control of typeI errors
We start by evaluating the correctness of all the considered twosample tests by examining if the prescribed significance level upperbounds their typeI error. To do so, we draw , and run each twosample test on the two samples and . In this setup, a typeI error would be to reject the true null hypothesis. Figure 1(a) shows that the typeI error of all tests is upperbounded by the prescribed significance level, for all and random repetitions. Thus, all tests control their typeI error as expected, up to random variations due to finite experiments.
Gaussian versus Student
We consider distinguishing between samples drawn from a Normal distribution and samples drawn from a Student’s tdistribution with degrees of freedom. We shift and scale both samples to exhibit zeromean and unitvariance. Since the Student’s t distribution approaches the Normal distribution as increases, a twosample test must focus on the peaks of the distributions to distinguish one from another. Figure 1(b,c) shows the percentage of typeII errors made by all tests as we vary separately and , over trials (random samples). We set when varies, and let when varies. The WilcoxonMannWhitney exhibits the worst performance, as expected (since the ranks mean of the Gaussian and Student’s t distributions coincide) in this experiment. The best performing method is the the onedimensional Kuiper test, followed closely by the multidimensional tests C2STNN and ME.
Independence testing on sinusoids
For completeness, we showcase the use twosample tests to measure statistical dependence. This can be done, as described in Section 2, by performing a twosample test between the observed data and , where is a random permutation. Since the distributions and are bivariate, only the C2STNN, C2STKNN, MMD, and ME tests compete in this task. We draw according to the generative model , , and . Here, are iid examples from the random variable , and are iid examples from the random variable . Thus, the statistical dependence between and weakens as we increase the frequency of the sinusoid, or increase the variance of the additive noise. Figure 1(d,e,f) shows the percentage of typeII errors made by C2STNN, C2STKNN, MMD, and ME as we vary separately , , and over trials. We let , , when fixed. Figure 1(d,e,f) reveals that among all tests, C2STNN is the most efficient in terms of sample size, C2STKNN is the most robust with respect to highfrequency variations, and that C2STNN and ME are the most robust with respect to additive noise.
Distinguishing between NIPS articles
We consider the problem of distinguishing between some of the categories of the 5903 articles published in the Neural Information Processing Systems (NIPS) conference from 1988 to 2015, as discussed in Jitkrittum et al. (2016). We consider articles on Bayesian inference (Bayes), neuroscience (Neuro), deep learning (Deep), and statistical learning theory (Learn). Table 1 shows the typeI errors (BayesBayes row) and powers (rest of rows) for the tests reported in (Jitkrittum et al., 2016), together with C2STNN, at a significance level , when averaged over trials. In these experiments, C2STNN achieves maximum power, while upperbounding its typeI error by .
Distinguishing between facial expressions
Finally, we apply C2STNN to the problem of distinguishing between positive (happy, neutral, surprised) and negative (afraid, angry, disgusted) facial expressions from the Karolinska Directed Emotional Faces dataset, as discussed in (Jitkrittum et al., 2016). See the fourth plot of Figure 2, first tworows, for one example of each of these six emotions. Table 2 shows the typeI errors ( vs row) and the powers ( vs row) for the tests reported in (Jitkrittum et al., 2016), together with C2STNN, at , averaged over trials. C2STNN achieves a nearoptimal power, only marginally behind the perfect results of SCFfull and MMDquad.
Problem  MEfull  MEgrid  SCFfull  SCFgrid  MMDquad  MMDlin  C2STNN  

BayesBayes  215  .012  .018  .012  .004  .022  .008  .002 
BayesDeep  216  .954  .034  .688  .180  .906  .262  1.00 
BayesLearn  138  .990  .774  .836  .534  1.00  .238  1.00 
BayesNeuro  394  1.00  .300  .828  .500  .952  .972  1.00 
LearnDeep  149  .956  .052  .656  .138  .876  .500  1.00 
LearnNeuro  146  .960  .572  .590  .360  1.00  .538  1.00 
Problem  MEfull  MEgrid  SCFfull  SCFgrid  MMDquad  MMDlin  C2STNN  

vs.  201  .010  .012  .014  .002  .018  .008  .002 
vs.  201  .998  .656  1.00  .750  1.00  .578  .997 
5 Experiments on Generative Adversarial Network Evaluation
random sample  MMD  KNN  NN 

0.158  0.830  0.999  
0.154  0.994  1.000  
0.048  0.962  1.000  
0.012  0.798  0.964  
0.024  0.748  0.949  
0.019  0.670  0.983  
0.152  0.940  1.000  
0.222  0.978  1.000  
0.715  1.000  1.000  
0.015  0.817  0.987  
0.020  0.784  0.950  
0.024  0.697  0.971 
Since effective generative models will produce examples barely distinguishable from real data, twosample tests arise as a natural alternative to evaluate generative models. Particularly, our interest is to evaluate the sample quality of generative models with intractable likelihoods, such as GANs (Goodfellow et al., 2014). GANs implement the adversarial game
(3) 
where depicts the probability of the example following the data distribution versus being synthesized by the generator. This is according to a trainable discriminator function . In the adversarial game, the generator plays to fool the discriminator by transforming noise vectors into reallooking examples . On the opposite side, the discriminator plays to distinguish between real examples and synthesized examples . To approximate the solution to (3), alternate the optimization of the two losses (Goodfellow et al., 2014) given by
(4) 
Under the formalization (4), the adversarial game reduces to the sequential minimization of and , and reveals the true goal of the discriminator: to be the C2ST that best distinguishes data examples and synthesized examples , where is the probability distribution induced by sampling and computing . The formalization (4) unveils the existence of an arbitrary binary classification loss function (See Remark 2), which in turn decides the divergence minimized between the real and fake data distributions (Nowozin et al., 2016).
Unfortunately, the evaluation of the loglikelihood of a GANs is intractable. Therefore, we will employ a twosample test to evaluate the quality of the fake examples . In simple terms, evaluating a GAN in this manner amounts to withhold some real data from the training process, and use it later in a twosample test against the same amount of synthesized data. When the twosample test is a binary classifier (as discussed in Section 3), this procedure is simply training a fresh discriminator on a fresh set of data. Since we train and test this fresh discriminator on heldout examples, it may differ from the discriminator trained along the GAN. In particular, the discriminator trained along with the GAN may have overfitted to particular artifacts produced by the generator, thus becoming a poor C2ST.
We evaluate the use of twosample tests for model selection in GANs. To this end, we train a number of DCGANs (Radford et al., 2016) on the bedroom class of LSUN (Yu et al., 2015) and the Labeled Faces in the Wild (LFW) dataset (Huang et al., 2007). We reused the Torch7 code of Radford et al. (2016) to train a set of DCGANs for epochs, where the generator and discriminator networks are convolutional neural networks (LeCun et al., 1998) with and filters per layer, respectively. We evaluate each DCGAN on heldout examples using the fastest multidimensional twosample tests: MMD, C2STNN, and C2STKNN.
Our first experiments revealed an interesting result. When performing twosample tests directly on pixels, all tests obtain nearperfect test accuracy when distinguishing between real and synthesized (fake) examples. Such nearperfect accuracy happens consistently across DCGANs, regardless of the visual quality of their examples. This is because, albeit visually appealing, the fake examples contain checkerboardlike artifacts that are sufficient for the tests to consistently differentiate between real and fake examples. Odena et al. (2016) discovered this phenomenon concurrently with us.
On a second series of experiments, we featurize all images (both real and fake) using a deep convolutional ResNet (He et al., 2015) pretrained on ImageNet, a large dataset of natural images (Russakovsky et al., 2015). In particular, we use the resnet34 model from Gross & Wilber (2016). Reusing a model pretrained on natural images ensures that the test will distinguish between real and fake examples based only on natural image statistics, such as Gabor filters, edge detectors, and so on. Such a strategy is similar to perceptual losses (Johnson et al., 2016) and inception scores (Salimans et al., 2016). In short, in order to evaluate how natural the images synthesized by a DCGAN look, one must employ a “natural discriminator”. Table 3 shows three GANs producing poor samples and three GANs producing good samples for the LSUN and LFW datasets, according to the MMD, C2STKNN, C2STNN tests on top of ResNet features. See Appendix A for the full list of results. Although it is challenging to provide with an objective evaluation of our results, we believe that the rankings provided by twosample tests could serve for efficient early stopping and model selection.
Remark 3 (How good is my GAN? Is it overfitting?).
Evaluating generative models is a delicate issue (Theis et al., 2016), but twosample tests may offer some guidance. In particular, good (nonoverfitting) generative models should produce similar twosample test statistics when comparing their generated samples to both the trainset and the testset samples. ^{3}^{3}3As discussed with Arthur Gretton, if the generative model memorizes the trainset samples, a sufficiently large set of generated samples would reveal such memorization to the twosample test. This is because some unique samples would appear multiple times in the set of generated samples, but not in the testset of samples. As a general recipe, prefer generative models that achieve the same and small twosample test statistic when comparing their generated samples to both the trainset and testset samples.
5.1 Experiments on Interpretability

We illustrate the interpretability power of C2ST. First, the predictive uncertainty of C2ST sheds light on where the two samples under consideration agree or differ. In the first plot of Figure 2, a C2STNN separates two bivariate Gaussian distributions with different means. When performing this separation, the C2STNN provides an explicit decision boundary that illustrates where the two distributions separate from each other. In the second plot of Figure 2, a C2STNN separates a Gaussian distribution from a Student’s t distribution with , after scaling both to zeromean and unitvariance. The plot reveals that the peaks of the distributions are their most differentiating feature. Finally, the third plot of Figure 2 displays, for the LFW and LSUN datasets, five examples classified as real with high uncertainty (first row, better looking examples), and five examples classified as fake with high certainty (second row, worse looking examples).
Second, the features learnt by the classifier of a C2ST are also a mechanism to understand the differences between the two samples under study. The third plot of Figure 2 shows six examples from the Karolinska Directed Emotional Faces dataset, analyzed in Section 4.1. In that same figure, we arrange the weights of the first linear layer of C2STNN into the feature most activated at positive examples (bottom left, positive facial expressions), the feature most activated at negative examples (bottom middle, negative facial expressions), and the “discriminative feature”, obtained by substracting these two features (bottom right). The discriminative feature of C2STNN agrees with the one found by (Jitkrittum et al., 2016): positive and negative facial expressions are best distinguished at the eyebrows, smile lines, and lips. A similar analysis Jitkrittum et al. (2016) on the C2STNN features in the NIPS article classification problem (Section 4.1) reveals that the features most activated for the “statistical learning theory” category are those associated to the words inequ, tight, power, sign, hypothesi, norm, hilbert. The features most activated for the “Bayesian inference” category are those associated to the words infer, markov, graphic, conjug, carlo, automat, laplac.
6 Experiments on Conditional GANs for Causal Discovery
In causal discovery, we study the causal structure underlying a set of random variables . In particular, we assume that the random variables share a causal structure described by a collection of Structural Equations, or SEs (Pearl, 2009). More specifically, we assume that the random variable takes values as described by the SE , for all . In the previous, is a Directed Acyclic Graph (DAG) with vertices associated to each of the random variables . Also in the same equation, denotes the set of random variables which are parents of in the graph , and is an independent noise random variable that follows the probability distribution . Then, we say that if , since a change in will cause a change in , as described by the th SE.
The goal of causal discovery is to infer the causal graph given a sample from . For the sake of simplicity, we focus on the discovery of causal relations between two random variables, denoted by and . That is, given the sample , our goal is to conclude whether “ causes ”, or “ causes ”. We call this problem causeeffect discovery (Mooij et al., 2016). In the case where , we can write the causeeffect relationship as:
(5) 
The current stateoftheart in the causeeffect discovery is the family of Additive Noise Models, or ANM (Mooij et al., 2016). These methods assume that the SE (5) allow the expression , and exploit the independence assumption between the cause random variable and the noise random variable to analyze the distribution of nonlinear regression residuals, in both causal directions.
Unfortunately, assuming independent additive noise is often too simplistic (for instance, the noise could be heteroskedastic or multiplicative). Because of this reason, we propose to use Conditional Generative Adversarial Networks, or CGANs (Mirza & Osindero, 2014) to address the problem of causeeffect discovery. Our motivation is the shocking resemblance between the generator of a CGAN and the SE (5): the random variable is the conditioning variable input to the generator, the random variable is the noise variable input to the generator, and the random variable is the variable synthesized by the generator. Furthermore, CGANs respect the independence between the cause and the noise by construction, since is independent from all other variables. This way, CGANs bypass the additive noise assumption naturally, and allow arbitrary interactions between the cause variable and the noise variable .
To implement our causeeffect inference algorithm in practice, recall that training a CGAN from to minimizes the two following objectives in alternation:
Our recipe for causeeffect is to learn two CGANs: one with a generator from to to synthesize the dataset , and one with a generator from to to synthesize the dataset . Then, we prefer the causal direction if the twosample test statistic between the real sample and is smaller than the one between and . Thus, our method is Occam’s razor at play: declare the simplest direction (in terms of conditional generative modeling) as the true causal direction.
Method  ANMHSIC  IGCI  RCC  CGANC2ST  Ensemble  C2ST type 

Accuracy  82%  KNN  
NN  
MMD 
Table 4 summarizes the performance of this procedure when applied to the Tübingen causeeffect pairs dataset, version August 2016 (Mooij et al., 2016). RCC is the Randomized Causation Coefficient of (LopezPaz et al., 2015). The EnsembleCGANC2ST trains 100 CGANs, and decides the causal direction by comparing the top generator obtained in each causal direction, as told by C2STKNN. The need to ensemble is a remainder of the unstable behaviour of generative adversarial training, but also highlights the promise of such models for causal discovery.
7 Conclusion
Our takehome message is that modern binary classifiers can be easily turned into powerful twosample tests. We have shown that these classifier twosample tests set a new stateoftheart in performance, and enjoy unique attractive properties: they are easy to implement, learn a representation of the data on the fly, have simple asymptotic distributions, and allow different ways to interpret how the two samples under study differ. Looking into the future, the use of binary classifiers as twosample tests provides a flexible and scalable approach for the evaluation and comparison of generative models (such as GANs), and opens the door to novel applications of these methods, such as causal discovery.
References
 Bai & Saranadasa (1996) Z. Bai and H. Saranadasa. Effect of high dimension: by an example of a two sample problem. Statistica Sinica, 1996.
 Bengio et al. (2013) Y. Bengio, L. Yao, and K. Cho. Bounding the test loglikelihood of generative models. arXiv, 2013.
 Bounliphone et al. (2015) W. Bounliphone, E. Belilovsky, M. B. Blaschko, I. Antonoglou, and A. Gretton. A test of relative similarity for model selection in generative models. arXiv, 2015.
 Box (1980) G. E. P. Box. Sampling and bayes’ inference in scientific modelling and robustness. Journal of the Royal Statistical Society, 1980.
 Chwialkowski et al. (2015) K. P. Chwialkowski, A. Ramdas, D. Sejdinovic, and A. Gretton. Fast twosample testing with analytic representations of probability measures. NIPS, 2015.
 Dziugaite et al. (2015) K. G. Dziugaite, D. M. Roy, and Z. Ghahramani. Training generative neural networks via Maximum Mean Discrepancy optimization. UAI, 2015.
 Ehm (1991) W. Ehm. Binomial approximation to the poisson binomial distribution. Statistics & Probability Letters, 1991.
 Friedman et al. (2001) J. Friedman, T. Hastie, and R. Tibshirani. The elements of statistical learning. Springer, 2001.
 Friedman (2003) J. H. Friedman. On multivariate goodness of fit and two sample testing. eConf, 2003.
 Fukumizu et al. (2009) K. Fukumizu, A. Gretton, Gert R. L., B. Schölkopf, and B. Sriperumbudur. Kernel choice and classifiability for rkhs embeddings of probability distributions. NIPS, 2009.
 Goodfellow et al. (2014) I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. NIPS, 2014.
 Gretton et al. (2005) A. Gretton, O. Bousquet, A. Smola, and B. Schölkopf. Measuring statistical dependence with hilbertschmidt norms. In ALT, 2005.
 Gretton et al. (2012a) A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. J. Smola. A kernel twosample test. JMLR, 2012a.
 Gretton et al. (2012b) A. Gretton, D. Sejdinovic, H. Strathmann, S. Balakrishnan, M. Pontil, K. Fukumizu, and B. Sriperumbudur. Optimal kernel choice for largescale twosample tests. NIPS, 2012b.
 Gross & Wilber (2016) S. Gross and M. Wilber. Training and investigating residual nets, 2016. URL http://torch.ch/blog/2016/02/04/resnets.html.
 Gutmann & Hyvärinen (2012) M. U. Gutmann and A. Hyvärinen. Noisecontrastive estimation of unnormalized statistical models, with applications to natural image statistics. JMLR, 2012.
 He et al. (2015) K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2015.
 Huang et al. (2007) G. B. Huang, M. Ramesh, T. Berg, and E. LearnedMiller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. Technical report, University of Massachusetts, Amherst, 2007.
 Jitkrittum et al. (2016) W. Jitkrittum, Z. Szabo, K. Chwialkowski, and A. Gretton. Interpretable Distribution Features with Maximum Testing Power. NIPS, 2016.
 Johnson et al. (2016) J. Johnson, A. Alahi, and L. FeiFei. Perceptual Losses for RealTime Style Transfer and SuperResolution. ECCV, 2016.
 Kanamori et al. (2010) T. Kanamori, T. Suzuki, and M. Sugiyama. fdivergence estimation and twosample homogeneity test under semiparametric densityratio models. arXiv, 2010.
 Kingma & Ba (2015) D. Kingma and J. Ba. Adam: A method for stochastic optimization. ICLR, 2015.
 Kolmogorov (1933) A. N. Kolmogorov. Sulla determinazione empirica di una legge di distribuzione. Inst. Ital. Attuari, 1933.
 Kuiper (1962) N. H. Kuiper. Tests concerning random points on a circle. Nederl. Akad. Wetensch. Proc., 63, 1962.
 LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 1998.
 Lehmann & Romano (2006) E. L. Lehmann and J. P. Romano. Testing statistical hypotheses. Springer, 2006.
 Lloyd & Ghahramani (2015) J. R. Lloyd and Z. Ghahramani. Statistical model criticism using kernel two sample tests. NIPS, 2015.
 LopezPaz et al. (2015) D. LopezPaz, K. Muandet, B. Schölkopf, and I. Tolstikhin. Towards a learning theory of causeeffect inference. In ICML, pp. 1452–1461, 2015.
 Mann & Whitney (1947) H. B. Mann and D. R. Whitney. On a test of whether one of two random variables is stochastically larger than the other. The annals of mathematical statistics, 1947.
 Menon & Ong (2016) A. K. Menon and C. S. Ong. Linking losses for density ratio and classprobability estimation. ICML, 2016.
 Mikolov et al. (2013) T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. NIPS, 2013.
 Mirza & Osindero (2014) M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv, 2014.
 Mohamed & Lakshminarayanan (2016) S. Mohamed and B. Lakshminarayanan. Learning in Implicit Generative Models. arXiv, 2016.
 Mooij et al. (2016) J. M. Mooij, J. Peters, D. Janzing, J. Zscheischler, and B. Schölkopf. Distinguishing cause from effect using observational data: methods and benchmarks. JMLR, 2016.
 Nowozin et al. (2016) S. Nowozin, B. Cseke, and R. Tomioka. fGAN: Training generative neural samplers using variational divergence minimization. NIPS, 2016.
 Odena et al. (2016) A. Odena, V. Dumoulin, and C. Olah. Deconvolution and checkerboard artifacts. http://distill.pub/2016/deconvcheckerboard/, 2016.
 Olivetti et al. (2012) E. Olivetti, S. Greiner, and P. Avesani. Induction in neuroscience with classification: issues and solutions. In Machine Learning and Interpretation in Neuroimaging. 2012.
 Pearl (2009) J. Pearl. Causality. Cambridge University Press, 2009.
 Pereira et al. (2009) F. Pereira, T. Mitchell, and M. Botvinick. Machine learning classifiers and fMRI: a tutorial overview. Neuroimage, 2009.
 Radford et al. (2016) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. ICLR, 2016.
 Ramdas et al. (2015) A. Ramdas, S. J. Reddi, B. Poczos, A. Singh, and L. Wasserman. Adaptivity and ComputationStatistics Tradeoffs for Kernel and Distance based High Dimensional Two Sample Testing. arXiv, 2015.
 Ramdas et al. (2016) A. Ramdas, A. Singh, and L. Wasserman. Classification accuracy as a proxy for two sample testing. arXiv, 2016.
 Reddi et al. (2015) S. J. Reddi, A. Ramdas, B. Póczos, A. Singh, and L. A. Wasserman. On the high dimensional power of a lineartime two sample test under meanshift alternatives. AISTATS, 2015.
 Reid & Williamson (2011) M. D. Reid and R. C. Williamson. Information, divergence and risk for binary experiments. JMLR, 2011.
 Russakovsky et al. (2015) O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. FeiFei. ImageNet large scale visual recognition challenge. IJCV, 2015.
 Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training GANs. NIPS, 2016.
 Smirnov (1939) N. V. Smirnov. On the estimation of the discrepancy between empirical curves of distribution for two independent samples. Bull. Math. Univ. Moscou, 1939.
 Smola & Schölkopf (1998) A. J. Smola and B. Schölkopf. Learning with kernels. Citeseer, 1998.
 Student (1908) Student. The probable error of a mean. Biometrika, 1908.
 Sutherland et al. (2016) D. J. Sutherland, H.Y. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton. Generative Models and Model Criticism via Optimized Maximum Mean Discrepancy. arXiv, 2016.
 Theis et al. (2016) L. Theis, A. van den Oord, and M. Bethge. A note on the evaluation of generative models. ICLR, 2016.
 Vapnik (1998) V. Vapnik. Statistical learning theory. Wiley New York, 1998.
 Wilcoxon (1945) F. Wilcoxon. Individual comparisons by ranking methods. Biometrics bulletin, 1945.
 Wornowizki & Fried (2016) M. Wornowizki and R. Fried. Twosample homogeneity tests based on divergence measures. Computational Statistics, 2016.
 Yu et al. (2015) F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao. LSUN: Construction of a Largescale Image Dataset using Deep Learning with Humans in the Loop. arXiv, 2015.
Appendix A Results on Evaluation of Generative Adversarial Networks
gf  df  ep  random sample  MMD  KNN  NN 
            
32  32  1  0.154  0.994  1.000  
32  32  10  0.024  0.831  0.996  
32  32  50  0.026  0.758  0.983  
32  32  100  0.014  0.797  0.974  
32  32  200  0.012  0.798  0.964  
32  64  1  0.330  0.984  1.000  
32  64  10  0.035  0.897  0.997  
32  64  50  0.020  0.804  0.989  
32  64  100  0.032  0.936  0.998  
32  64  200  0.048  0.962  1.000  
32  96  1  0.915  0.997  1.000  
32  96  10  0.927  0.991  1.000  
32  96  50  0.924  0.991  1.000  
32  96  100  0.928  0.991  1.000  
32  96  200  0.928  0.991  1.000  
64  32  1  0.389  0.987  1.000  
64  32  10  0.023  0.842  0.979  
64  32  50  0.018  0.788  0.977  
64  32  100  0.017  0.753  0.959  
64  32  200  0.018  0.736  0.963  
64  64  1  0.313  0.964  1.000  
64  64  10  0.021  0.825  0.988  
64  64  50  0.014  0.864  0.978  
64  64  100  0.019  0.685  0.978  
64  64  200  0.021  0.775  0.980  
64  96  1  0.891  0.996  1.000  
64  96  10  0.158  0.830  0.999  
64  96  50  0.015  0.801  0.980  
64  96  100  0.016  0.866  0.976  
64  96  200  0.020  0.755  0.983  
96  32  1  0.356  0.986  1.000  
96  32  10  0.022  0.770  0.991  
96  32  50  0.024  0.748  0.949  
96  32  100  0.022  0.745  0.965  
96  32  200  0.024  0.689  0.981  
96  64  1  0.287  0.978  1.000  
96  64  10  0.012  0.825  0.966  
96  64  50  0.017  0.812  0.962  
96  64  100  0.019  0.670  0.983  
96  64  200  0.020  0.711  0.972  
96  96  1  0.672  0.999  1.000  
96  96  10  0.671  0.999  1.000  
96  96  50  0.829  0.999  1.000  
96  96  100  0.668  0.999  1.000  
96  96  200  0.849  0.999  1.000 
gf  df  ep  random sample  MMD  KNN  NN 
            
32  32  1  0.806  1.000  1.000  
32  32  10  0.152  0.940  1.000  
32  32  50  0.042  0.788  0.993  
32  32  100  0.029  0.808  0.982  
32  32  200  0.022  0.776  0.970  
32  64  1  0.994  1.000  1.000  
32  64  10  0.989  1.000  1.000  
32  64  50  0.050  0.808  0.985  
32  64  100  0.036  0.766  0.972  
32  64  200  0.015  0.817  0.987  
32  96  1  0.995  1.000  1.000  
32  96  10  0.992  1.000  1.000  
32  96  50  0.995  1.000  1.000  
32  96  100  0.053  0.778  0.987  
64  96  200  0.037  0.779  0.995  
64  32  1  1.041  1.000  1.000  
64  32  10  0.086  0.971  1.000  
64  32  50  0.043  0.756  0.988  
64  32  100  0.018  0.746  0.973  
64  32  200  0.025  0.757  0.972  
64  64  1  0.836  1.000  1.000  
64  64  10  0.103  0.910  0.998  
64  64  50  0.018  0.712  0.973  
64  64  100  0.020  0.784  0.950  
64  64  200  0.022  0.719  0.974  
64  96  1  1.003  1.000  1.000  
64  96  10  1.015  1.000  1.000  
64  96  50  1.002  1.000  1.000  
64  96  100  1.063  1.000  1.000  
64  96  200  1.061  1.000  1.000  
96  32  1  1.022  1.000  1.000  
96  32  10  0.222  0.978  1.000  
96  32  50  0.026  0.734  0.965  
96  32  100  0.016  0.735  0.964  
96  32  200  0.021  0.780  0.973  
96  64  1  0.715  1.000  1.000  
96  64  10  0.042  0.904  0.999  
96  64  50  0.024  0.697  0.971  
96  64  100  0.028  0.744  0.983  
96  64  200  0.020  0.697  0.976  
96  96  1  0.969  1.000  1.000  
96  96  10  0.920  1.000  1.000  
96  96  50  0.926  1.000  1.000  
96  96  100  0.920  1.000  1.000  
96  96  200  0.923  1.000  1.000 
Appendix B Proof of Theorem 1
Our statistic is a random variable under the null hypothesis, and under the alternative hypothesis. Furthermore, at a significance level , the threshold of our statistic is ; under this threshold we would accept the null hypothesis. Then, the probability of making a typeII error is
Therefore, the power of the test is
which concludes the proof.
Appendix C Acknowledgements
We are thankful to L. Bottou, B. Graham, D. Kiela, M. RojasCarulla, I. Tolstikhin, and M. Tygert for their help in improving the quality of this manuscript.