Kernelized Complete Conditional Stein Discrepancy
Abstract
Much of machine learning relies on comparing distributions with discrepancy measures. Stein’s method creates discrepancy measures between two distributions that require only the unnormalized density of one and samples from the other. Stein discrepancies can be combined with kernels to define the kernelized Stein discrepancies (ksds). While kernels make Stein discrepancies tractable, they pose several challenges in high dimensions. We introduce kernelized complete conditional Stein discrepancies (kccsds). Complete conditionals turn a multivariate distribution into multiple univariate distributions. We prove that kccsds detect convergence and nonconvergence, and that they upperbound ksds. We empirically show that kccsds detect nonconvergence where ksds fail. Our experiments illustrate the difference between kccsds and ksds when comparing highdimensional distributions and performing variational inference.
Kernelized Complete Conditional Stein Discrepancy
Raghav Singhal New York University New York, USA rs4070@nyu.edu Saad Lahlou New York University New York, USA msl596@nyu.edu Rajesh Ranganath New York University New York, USA rajeshr@cims.nyu.edu
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Discrepancy measures that compare a distribution , known up to normalization, with a distribution , known via samples from it, can be used for finding good variational approximations Ranganath et al. (2016), checking the quality of mcmc samplers (Gorham and Mackey, 2015, 2017), or goodnessoffit testing (Liu et al., 2016). There are two difficulties with using traditional discrepancies like Wasserstein metrics or total variation distance for these tasks. First, can be hard to sample, and second, computing these discrepancies requires an expensive maximization. These challenges lead to the following desiderata for a discrepancy (Gorham and Mackey, 2015).

Tractable uses samples from , evaluations of (unnormalized) , and has a closed form.

Detect Convergence If , then .

Detect NonConvergence If , then that implies that
These desiderata ensure that the discrepancy is non zero when does not equal and that it can be easily computed. To meet these desiderata, Chwialkowski et al., 2016; Oates et al., 2017; Gorham and Mackey, 2017 developed kernelized Stein discrepancies (ksds). ksds measure the expectation of functions under that have expectation zero under . These functions are constructed by applying Stein’s operator to a reproducing kernel Hilbert space.
In high dimensions most kernels evaluated on a pair of points are near zero. Thus, ksds in high dimensions can be near zero, making detecting differences between high dimensional distributions difficult. We develop kernelized complete conditional Stein discrepancies (kccsds). These discrepancies use complete conditionals: the distribution of one variable given the rest. The complete conditionals are univariate. Rather than using multivariate kernels, kccsds use multiple univariate kernels, making it easier to compare distributions in high dimensions.
A given Stein discrepancy relies on a supremum over a class of test functions called the Stein set. The kccsds differ from ksds in that kccsds’ Stein set consists of univariate functions rather than multivariate functions. An immediate question is whether a Stein discrepancy with only univariate functions detects nonconvergence. We prove under technical conditions that (1) kccsds detect convergence and nonconvergence, and (2) kccsds are larger than ksds for the same choice of kernel.
Figure 1 compares ksds and kccsds with different kernels on each panel. The figure compares two Gaussian distributions, and where only one coordinate of is nonzero, . We then increase the dimension of the distribution and compare kccsd and ksd with both the imq and rbf kernels. We see that kccsds retain their ability to distinguish distributions in high dimensions. We show that kccsds can be used for variational inference and empirically compare distributions in cases where ksds provably fail.
2 Stein Discrepancies
Stein’s method provides recipes for constructing expectation zero functions of distributions known up to normalization. For a distribution, , with a Lipschitz score function, we can create a Stein operator, , that acts on a test function ,
(1) 
where is smooth and 1Lipschitz function. This relation called Stein’s identity can be used to construct a discrepancy, where is known only up to a normalization constant (Gorham and Mackey, 2015). Let be the Stein set, consisting of smooth Lipschitz functions satisfying a Neumanntype boundary condition:
Stein discrepancies can be computationally burdensome as the supremum lacks a closed form.
Kernelized Stein Discrepancies.
To make the Stein discrepancy simpler to compute, Chwialkowski et al., 2016; Oates et al., 2017; Gorham and Mackey, 2017 combined the theory of reproducing kernels (Steinwart and Christmann, 2008) with the Stein discrepancy to introduce kernelized Stein discrepancies ksds. ksds restrict the Stein set to a reproducing kernel Hilbert space. Let be the kernel of a reproducing kernel Hilbert space (rkhs) . The rkhs consists of functions with finite norm. Functions in the rkhs satisfy the reproducing property: for all and for all . ksds are defined using a Stein set : the set of vectorvalued functions such that for all , , and . ksds have a closedform.
Proposition 1
(Gorham and Mackey, 2017) Suppose , then the for all define the function,
(2) 
where, for instance if is the LangevinStein operator, then for any
Now, if , then kernelized Stein discrepancy has a closed form, where , and .
This theorem shows that when the Stein set is chosen via an rkhs, the Stein discrepancy can be computed in closed form. When the distribution lies in the class of distantly dissipative ^{1}^{1}1A distribution is distantly dissipative if , where Common examples include finite Gaussian mixtures with the same variance, and strongly logconcave distributions. distributions, , ksds provably detect convergence and nonconvergence for , for kernels like the radial basis function or the inverse multiquadratic (imq)(Gorham and Mackey, 2017). In , the ksd with thin tailed kernels like the rbf do not detect nonconvergence. But the ksd with the imq kernel with does detect nonconvergence. However all of these kernels shrink as the grows, which mean their associated ksd become less sensitive in higher dimensions (see Figure 1).
3 Complete Conditional Stein Discrepancy
Complete conditionals are univariate conditional distributions, , where . Complete conditional distributions are the basis for many inference procedures including the Gibbs sampler (Geman and Geman, 1984) and coordinate ascent variational inference (Ghahramani and Beal, 2001).
We construct ccsds and their kernelized versions, kccsds, and show that kccsds satisfy the desiderata. For a broad family of kernels, we show that kccsds upper bound traditional ksds. We focus on the LangevinStein operator (Mira et al., 2013; Gorham and Mackey, 2015; Oates et al., 2017),
(3) 
The analysis done here can be applied other operators based on the gradient of the log probability.
Definition.
Using complete conditionals, we define a new operator that can be used to compare distributions in arbitrary dimensions. For any with univariate component functions, , we can apply the complete conditional factorization,
where . Note that although the test functions are univariate, the Stein operator applied to each component function, , is a scalar valued function of multiple variables, .
This factorization yields the same operator as the LangevinStein operator in Equation 3. A two variable example for ccsds is in Appendix A. The key difference is that the Stein set for ccsd consists of univariate component functions. Formally, we define the function space . Then the complete conditional Stein discrepancy is defined as the Stein discrepancy restricted to the function set consisting of univariate component functions,
ccsds do not require the complete conditionals for or . Like the original Stein discrepancy, the suprema in ccsds can be hard to compute. Instead, we introduce their kernelized form, the kernelized complete conditional Stein discrepancy (kccsd).
4 Kernelized Complete Conditional Stein Discrepancy
Similar to the construction of ksds from the Stein discrepancy, we meld the theory of reproducing kernels with complete conditional Stein discrepancies to obtain kccsds. In this section we show that kccsds satisfy all three desiderata: a closed and tractable form, detection of convergence, and detection of nonconvergence. We also show kccsds upper bound ksds, and the difference between the two increases as the dimension of the distribution increases.
kccsds admit a closed form.
Let be a reproducing kernel for the reproducing kernel Hilbert space , then the Stein set for kccsds consists of functions where each is a univariate function and with . Note that it is possible for the kernel to change with each dimension, but for simplicity we focus on a single kernel for all dimensions and drop the index on the kernel. We now show that kccsds can be computed in closed form.
Theorem 1
Let then for all define the complete conditional Stein kernel, as follows:
Now, if , then , where , and .
The proof is in Appendix D. Note that the closed form for kccsds is the same as ksds but the kernels are now univariate rather than multivariate.
kccsds detect convergence.
kccsds can be upper bounded with the Wasserstein distance (). This shows that if as , then kccsds go to zero, satisfying desideratum 2.
Proposition 2
Suppose and is Lipschitz with , if , then .
The proof follows from Gorham and Mackey, 2017 and is in Appendix E and this proposition applies to kernels like the rbf, imq and Matern kernels.
kccsd detects nonconvergence.
In this section, we show that kccsds detect nonconvergence by showing that when the kccsd converges to zero, the Fisher divergence converges to zero. The Fisher divergence measures the error between the score function of two distributions. It is defined as
The following lemma shows that if are Lipschitz and for all , then when the Fisher divergence between two distributions goes to zero, the two distributions are equal in distribution.
Lemma 1
Suppose and are Lipschitz with , and . If , then .
We use this lemma to show kccsd going to zero implies equality in distribution.
Theorem 2
Suppose is integrally strictly positive definite, and and are Lipschitz with , and and suppose for all . If , then .
The proof is in Appendix F. Unlike a full theory of weak convergence, the proof for Theorem 2 requires the score function of . We empirically show that kccsd detects nonconvergence for distributions that do not have score functions even where the ksd fails to detect nonconvergence.
For , Gorham and Mackey, 2017 show that ksds fail to detect nonconvergence for commonly used kernels like the rbf. When kernels decay faster than the score function grows, ksds ignore the tails. This problem gets worse in higher dimensions for the rbf kernel and Matern kernel. For instance, if then , which causes the rbf kernel to decay rapidly in high dimensions, leading to a low discrepancy value even if the distributions are different.
In Figure 2 we compare a nontight sequence to a Gaussian target from Gorham and Mackey, 2017. For each , let be the empirical distribution over points where and for all . For a kernel like the rbf, this will cause the kernel to decay as we increase the sample size, as . This sequence of does not have a score function. Unlike the ksd, Figure 2 shows that the kccsd with both the rbf and imq kernels is able to detect nonconvergence.
Even when ksds detect convergence in high dimensions, they can be too small to be of practical use, thereby making them poor assessments of sample quality. Figure 1 depicts this problem for two Gaussian distributions, is the standard Gaussian and is a Gaussian with the mean of one dimension set to 5. The plots show how ksds decrease with increasing dimension. After dimension 10, the ksd has becomes very small for the rbf kernel, and even if we use imq kernel, which detects nonconvergence, the ksd still becomes smaller.
kccsds upper bounds ksds.
In this section we show that kccsds are upper bounds on ksds. The difference between the discrepancies grows as the dimensionality increases.
Suppose that the ksd and the kccsd have the same type of kernel with the same kernel parameters. We show that the kccsd is a upper bound of the ksd, given that the kernel satisfies the following conditions:

for all and for all .

Define the univariate kernel difference as , where we fix . Then is an integrally strictly positive definite kernel.
In Appendix G, we show that both the rbf and the imq kernels satisfy these conditions. The proofs follow from Schoenberg connection between monotone and positive definite functions (Fasshauer, 2003).
Theorem 3
Suppose satisfies conditions C1 and C2 and are Lipschitz with and . Then the ksd and the kccsd satisfy the relation
(4) 
The proof is in Appendix G. The diagram below shows the relations between the discrepancies. By construction the Stein set for ccsd is a subset of the Stein set for the Stein discrepancy. This means the Stein discrepancy dominates the ccsd. Kernelization shrinks Stein sets, so kernelized variants are dominated by their corresponding nonkernelized variants. Theorem 3 shows that kccsds dominate ksds.
5 Experiments
We developed the kernelized complete conditional Stein discrepancies. Here we empirically study their use for performing sample quality checks and variational inference. We detail the variational inference algorithms in Appendix B. We study two kernels: imq and rbf. For the imq kernel, , we take and and for the rbf kernel, we choose .
5.1 Distribution Tests
Figure 1 shows the effect of increasing dimension when comparing two Gaussian distributions which only differ in one coordinate of the mean. ksds decrease as the dimension increases, unlike kccsds. Figure 2 with the rbf kernel shows empirically that kccsd detects nonconvergence where kcc does not. Despite showing that kccsds upper bound ksds, in Figure 3 we show that both kccsd and ksd converge to zero at a similar rate when for a mixture of Gaussians where each component has different nondiagonal covariance matrices.
In Appendix C we conduct more tests to study the rate of convergence to zero when both distributions are the same. We also compare two Gaussian distributions with increasing distance between their means, there we see that kccsd is more sensitive to changes than ksd in high dimensions.
5.2 Sample Quality Checks
Here we show that kccsd can be used for sample quality checks.
Selecting Sampler Hyperparameters.
Stochastic gradient Langevin dynamics (sgld) is a biased mcmc sampler based on adding noise to the standard stochastic gradient optimization method (Welling and Teh, 2011). Since this method makes use of subsampling, it has allowed mcmc to scale to large datasets and large models. In this experiment we do posterior inference for a twolayer neural network, with a sigmoid activation function, for a regression task. We used the yacht hydrodynamics dataset (Gerritsma et al., 1981) from the UCI dataset repository.
Since biased methods trade sampling efficiency for asymptotic exactness, standard mcmc diagnostics are not applicable as they do not account for asymptotic bias. We use kccsd to assess sample quality from biased mcmc samplers. Selecting the stepsize is an important task to ensure the samples are approximately from the posterior (Welling and Teh, 2011). When is too small, then sgld is not exploring the space enough and there is high autocorrelation between the samples. However, when is too big, then sgld has higher bias and is unstable.
For we generate 5 independent chains with minibatch 32. Each chain consists of 10,000 samples with a burnin phase of 50,000 samples. We compare kccsd to inverse effective sample size. Effective sample size relies on asymptotic exactness of the samples, which is violated by stochastic gradient Langevin dynamics. Figure 4 compares these two metrics. While has the lowest kccsd value, the effective sample size measure is maximized by the value .
5.3 Stein Variational Gradient Descent
We compare svgd to the complete conditional Stein variational gradient descent (ccsvgd) algorithm by training a Bayesian neural network and learning a multivariate Gaussian. We provide details for ccsvgd in Appendix B.
Learning Multivariate Gaussians.
We compare the performance of ccsvgd and svgd using the rbf kernel on learning a multivariate Gaussian target, . We train both methods to learn a Gaussian target with diagonal covariance and nondiagonal covariance. We use a particles initialized from and run both methods for iterations. We use gradient descent with a decreasing stepsize . Figure 5 displays the ksd between the target and the learnt distribution as the dimension increases. ccsvgd has a lower ksd value, it learns a better approximation.
Bayesian Neural Network.
Dataset  svgd  ccsvgd  map  

Boston  2.565  2.404  2.567  
Yacht  0.749  0.611  0.749  
Mean Test rmse  Protein  4.578  4.740  4.578 
Concrete  5.297  5.609  5.297  
Real Estate  6.608  6.566  7.208  
Boston  2.391  2.369  2.392  
Yacht  1.404  1.167  1.404  
Mean Test LogLikelihood  Protein  2.944  2.968  2.944 
Concrete  3.052  3.148  3.052  
Real Estate  3.378  3.378  3.476 
We compare ccsvgd and svgd on Bayesian neural networks. We use a similar setting as (Liu and Wang, 2016), a neural network with one hidden layer, with a rectifier activation and hidden units. We use of the dataset as a training set and use the rest as the test set. The results are averaged over 5 trials. The minibatch size is 100 and the number of particles is for both methods, we use the rbf kernel with .
Table 1 shows that ccsvgd performs better than svgd and map on 3 out of 5 experiments. svgd and map yield almost identical results. In these experiments, more accurate Bayesian inference seems to provide little advantage as MAP performs well.
6 Discussion
We developed kernelized complete conditional Stein discrepancies. We show that kccsds are an upper bound on ksds and can be tractably computed on samples given an unnormalized differentiable distribution. They lead to better sample quality measures and variational inference algorithms. The ksd with the rbf kernel is not able to detect nonconvergence with nontight sequences. However, we observe that the kccsd with rbf kernel not only detects nonconvergence but also has a higher discrepancy than the ksd with imq. A proof that kccsd does or does not detect nonconvergence for nontight sequences with rbf kernels is a promising theoretical avenue of research. Empirically when distributions match, kccsd and ksd converge to zero at the same rate. While in Theorem 3, we show that the kccsd upper bounds the ksd, this means that the kccsd could provide a more powerful goodnessoffit test (Liu et al., 2016). Like ksds, kccsds also suffer from a computational cost that grows quadratically with the number of samples. To address this, random feature Stein discrepancies (Huggins and Mackey, 2018) were developed. Applications of this method using kccsd are a promising avenue for research.
Acknowledgments
We would like to thank Jaan Altosaar, Mark Goldstein, Xintian Han, Aahlad Manas Puli, and Bharat Srikishan for their helpful feedback and comments.
References
 Chwialkowski et al. (2016) Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit.
 Fasshauer (2003) Fasshauer, G. (2003). Positive definite and completely monotone functions. http://www.math.iit.edu/~fass/603_ch2.pdf.
 Geman and Geman (1984) Geman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721–741.
 Gerritsma et al. (1981) Gerritsma, J., Onnink, R., and Versluis, A. (1981). Geometry, resistance and stability of the delft systematic yacht hull series. International shipbuilding progress, 28(328):276–297.
 Ghahramani and Beal (2001) Ghahramani, Z. and Beal, M. J. (2001). Propagation algorithms for variational bayesian learning. In Advances in neural information processing systems, pages 507–513.
 Gorham and Mackey (2015) Gorham, J. and Mackey, L. (2015). Measuring sample quality with stein’s method. In Advances in Neural Information Processing Systems, pages 226–234.
 Gorham and Mackey (2017) Gorham, J. and Mackey, L. (2017). Measuring sample quality with kernels. arXiv preprint arXiv:1703.01717.
 Huggins and Mackey (2018) Huggins, J. and Mackey, L. (2018). Random feature stein discrepancies. In Advances in Neural Information Processing Systems, pages 1903–1913.
 Liu et al. (2016) Liu, Q., Lee, J., and Jordan, M. (2016). A kernelized stein discrepancy for goodnessoffit tests. In International Conference on Machine Learning, pages 276–284.
 Liu and Wang (2016) Liu, Q. and Wang, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378–2386.
 Mira et al. (2013) Mira, A., Solgi, R., and Imparato, D. (2013). Zero variance markov chain monte carlo for bayesian estimators. Statistics and Computing, 23(5):653–662.
 Oates et al. (2017) Oates, C. J., Girolami, M., and Chopin, N. (2017). Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695–718.
 Ranganath (2018) Ranganath, R. (2018). Black Box Variational Inference: Scalable, Generic Bayesian Computation and its Applications. PhD thesis, Princeton University.
 Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822.
 Ranganath et al. (2016) Ranganath, R., Tran, D., Altosaar, J., and Blei, D. (2016). Operator variational inference. In Advances in Neural Information Processing Systems, pages 496–504.
 Schoenberg (1938) Schoenberg, I. J. (1938). Metric spaces and completely monotone functions. Annals of Mathematics, pages 811–841.
 Steinwart and Christmann (2008) Steinwart, I. and Christmann, A. (2008). Support vector machines. Springer Science & Business Media.
 Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML11), pages 681–688.
Appendix A Construction of ccsds
Consider a distribution and suppose we want to compare their complete conditionals, the univariate distributions and . Using a similar analysis as Ranganath, 2018, we use the univariate LangevinStein operator to compare these complete conditionals. Let be a univariate function. The LangevinStein operator applied to yields
The equality uses the fact that the score function of the conditional distribution is the score function of the joint, . Note that although the test function is a univariate function, the operator applied to , , is a scalarvalued function of multiple variables, .
If for all inputs , Stein’s identity applies and
Now, two distributions match only if their complete conditionals match. This means we can combine the complete conditional Stein operators to compare multivariate distributions. For any , we can compare the distributions as follows:
where we use the fact that . Observe that the discrepancy can be computed without the use of complete conditionals of or . Hence, the complete conditional factorization suggests the use of test functions with univariate component functions.
Appendix B Variational Inference Using Stein Discrepancies
Variational inference casts Bayesian inference as an optimization problem. This is typically formulated as minimizing the kl divergence between the posterior and variational family, . Operator variational inference (opvi) (Ranganath et al., 2016) uses Stein discrepancies as objectives for variational inference. Stein variational gradient descent (svgd) (Liu and Wang, 2016) uses Stein’s method to iteratively transform a set of particles to match the posterior. We describe how to use kccsds in opvi and svgd yielding black box variational inference algorithms (Ranganath et al., 2014).
b.1 Operator Variational Inference
opvi suggests the use of a neural network to learn the optimal test function, . This increase the difficulty of optimization. We introduce the use of kccsds and ksds as objective functions in operator variational inference. This removes the need to estimate an optimal test function.
Given a parametric model family, and data model , kernelized opvi solves the following optimization problem:
(5) 
Unbiased estimation of only requires the evaluation of the model score function, and samples from . As the only requirement for the variational approximation is sampling (and differentiability for gradients), this allows flexibility to choose variational families where a tractable density is not available. Such distributions are called variational programs and were studied in (Ranganath et al., 2016).
b.2 Stein Variational Gradient Descent
svgd, a particle based variational inference algorithm, is based on creating a set of distributions, which consists of distributions which are obtained by taking smooth and invertible transformations , of a reference distribution . The resulting is defined as
where and denote the inverse and the Jacobian matrix of the inverse.
Now, suppose we choose the family of transformations, , to be small perturbations of the identity map of the form , where is a smooth function belonging to a suitable function family. The Stein operator is equivalent to the derivative of the kl divergence (Liu and Wang, 2016). We note that the Stein operator in the kl divergence derivative are built from the matrix Stein operator. However, the derivative uses the trace of the matrix Stein operator which is equal to the Stein operator we study.
Theorem 4
The following lemma identifies the maximal perturbation direction that gives the steepest decrease in the kl divergence:
Lemma 2
(Liu and Wang, 2016) Assume the conditions in Theorem 4, consider all the perturbation directions in the ball of function space , the direction of steepest descent that maximizes the negative gradient in Equation 6 is given by
(7) 
which implies that .
Here, we propose the use of the kccsd Stein set, , as the perturbation family for svgd rather than using the ksd Stein set for the perturbation family. The optimal function for svgd using the kccsd Stein set, , is defined as
where , with denoting the th dimension and .
Here, we state a similar lemma to Lemma 2 to show that if the perturbation functions belong to the kccsd Stein set, then there is a closed form optimal perturbation function.
Lemma 3
(ccsvgd) Assume the conditions in Theorem 4, consider all the perturbation directions in the ball of function space . The direction of steepest descent that maximizes the negative gradient in Equation 6 is given by
(8) 
which implies that .
We refer to svgd using kccsd updates as complete conditional Stein variational gradient descent (ccsvgd).
Theorem 3 shows that . This implies that the perturbation function provided by kccsd, decreases the kldivergence more than the perturbation function provided by ksd. If , then
which shows that the kccsd perturbation function points in a steeper direction of descent than the ksd perturbation function. Note, that this is only a locally optimal step. Given particles , we calculate the svgd update for a particle as
We can see that if the particles are far apart then gets small. This means that svgd reduces to performing map updates for these particles, . We demonstrate this phenomenon with the two layer Bayesian neural network, by calculating the Frobenius norm between the matrix of map updates and the svgd updates.
Bayesian Neural Network.
Consider a Bayesian neural network with one hidden layer with 50 hidden units with a ReLU activation, and we use the Boston housing dataset for our experiments.
Here we compare svgd and ccsvgd to map. As shown in Figure 6, svgd reduces to map in higher dimensions, unlike ccsvgd which increases as the dimension grows. We use a standard Gaussian to initialize the weight matrices, this causes the particles to be far apart and thus reduces the svgd update to map updates, with no interaction between particles without tricks like the median heuristic.
b.3 Operator Variational Inference
In this section, we use kccsd as an objective for variational inference.
Bayesian Linear Regression.
Consider the Bayesian regression problem, , where . We model the data as with a normal prior on . We perform posterior approximation using the variational family, , where the variational parameters are and .
We run opvi with kccsd and ksd using the imq kernel for different dimensions, with a fixed number of data points, and use latent samples, used to calculate the Stein discrepancies with the AdaGrad optimizer. The dataset was generated by randomly picking . We observed that opvi with kernelized discrepancies requires manual tuning of the optimizer. In Table 2 we list the norm between the learned posterior mean and the true mean after 20 iterations.
Dimension  ksd mse  kccsd mse 

0.34  0.05  
0.38  0.19  
4.39  1.26  
6.31  3.25 
Appendix C Distribution Tests
Effect of Diverging Means.
In this experiment, we compare and . We increase the mean of in one coordinate, and see the effect on both discrepancies. Figure 7 shows that kccsds with the rbf and imq have higher discrepancies than ksds. And even though ksds with the rbf kernel does detect nonconvergence for nontight sequences, we observe that as increases, the ksd with the rbf increases at a much slower rate than kccsd.
On Target Sequence.
In Theorem 3 we prove that for the kernels like imq and rbf, the kccsd is an upper bound on ksd. We verified this in the previous experiments. However, in this experiment we compare the discrepancy values when both distributions are equal, .
Here, we compare the rate at which both discrepancies converge to zero when () is a a multivariate Gaussian with diagonal covariance (Figure 9) and () is a multivariate Gaussian with a random nondiagonal covariance matrix (Figure 9). The kccsd converges to zero at a similar rate as the ksd as the number of samples is increasing.
Appendix D Closed form
In this section we prove Theorem 1. Let , for . Let be the 1dimensional canonical feature map, defined by . Now, since , following (Gorham and Mackey, 2017) and using Corollary 4.36 from (Steinwart and Christmann, 2008), we show that for all
(9) 
where .
Then using Lemma 4.34 of Steinwart and Christmann (2008), gives for all .
(10) 
Now, assuming that Equation 10 is integrable with respect to , we have that is Bochner integrable for each . In other words, we can interchange expectation and the inner product. This implies that
(11) 
where .
Now using, Equation 9 and Equation 11 and using FenchelYoung inequality for the dual norm twice, we show that
(12) 
Recall that the Stein set is defined as a product function space of univariate rkhs, limited to the unit ball . We focus on norm throughout, however it can be generalized to any norm. In other words, .
Appendix E Detecting Convergence
We introduce some notation we will need for the proof, here , , and , where is defined as the operator norm for the matrix .
We prove Proposition 2 by using an important lemma from (Gorham and Mackey, 2017), which shows that:
Lemma 4
(Gorham and Mackey, 2017) Let and , and if is Lipschitz and , then for any with , we have the following upper bound
(13) 
where and are constants which depend on and .
(Gorham and Mackey, 2017) also show that this lemma applies to kccsd since we have the same bound as the univariate ksd. We repeat their argument below.
First, we define the multiindex differential operator, , where . For a differentiable function, , we have
Now, let , then for any multiindex , with . We can bound the derivative of each component function by CauchySchwartz and (Steinwart and Christmann, 2008) as
and since