Kernelized Complete Conditional Stein Discrepancy

Kernelized Complete Conditional Stein Discrepancy

Raghav Singhal
New York University
New York, USA
rs4070@nyu.edu
&Saad Lahlou
New York University
New York, USA
msl596@nyu.edu
&Rajesh Ranganath
New York University
New York, USA
rajeshr@cims.nyu.edu
Abstract

Much of machine learning relies on comparing distributions with discrepancy measures. Stein’s method creates discrepancy measures between two distributions that require only the unnormalized density of one and samples from the other. Stein discrepancies can be combined with kernels to define the kernelized Stein discrepancies (ksds). While kernels make Stein discrepancies tractable, they pose several challenges in high dimensions. We introduce kernelized complete conditional Stein discrepancies (kcc-sds). Complete conditionals turn a multivariate distribution into multiple univariate distributions. We prove that kcc-sds detect convergence and non-convergence, and that they upper-bound ksds. We empirically show that kcc-sds detect non-convergence where ksds fail. Our experiments illustrate the difference between kcc-sds and ksds when comparing high-dimensional distributions and performing variational inference.

 

Kernelized Complete Conditional Stein Discrepancy


  Raghav Singhal New York University New York, USA rs4070@nyu.edu Saad Lahlou New York University New York, USA msl596@nyu.edu Rajesh Ranganath New York University New York, USA rajeshr@cims.nyu.edu

\@float

noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Discrepancy measures that compare a distribution , known up to normalization, with a distribution , known via samples from it, can be used for finding good variational approximations Ranganath et al. (2016), checking the quality of mcmc samplers (Gorham and Mackey, 2015, 2017), or goodness-of-fit testing (Liu et al., 2016). There are two difficulties with using traditional discrepancies like Wasserstein metrics or total variation distance for these tasks. First, can be hard to sample, and second, computing these discrepancies requires an expensive maximization. These challenges lead to the following desiderata for a discrepancy (Gorham and Mackey, 2015).

  1. Tractable uses samples from , evaluations of (unnormalized) , and has a closed form.

  2. Detect Convergence If , then .

  3. Detect Non-Convergence If , then that implies that

These desiderata ensure that the discrepancy is non zero when does not equal and that it can be easily computed. To meet these desiderata, Chwialkowski et al., 2016; Oates et al., 2017; Gorham and Mackey, 2017 developed kernelized Stein discrepancies (ksds). ksds measure the expectation of functions under that have expectation zero under . These functions are constructed by applying Stein’s operator to a reproducing kernel Hilbert space.

In high dimensions most kernels evaluated on a pair of points are near zero. Thus, ksds in high dimensions can be near zero, making detecting differences between high dimensional distributions difficult. We develop kernelized complete conditional Stein discrepancies (kcc-sds). These discrepancies use complete conditionals: the distribution of one variable given the rest. The complete conditionals are univariate. Rather than using multivariate kernels, kcc-sds use multiple univariate kernels, making it easier to compare distributions in high dimensions.

A given Stein discrepancy relies on a supremum over a class of test functions called the Stein set. The kcc-sds differ from ksds in that kcc-sds’ Stein set consists of univariate functions rather than multivariate functions. An immediate question is whether a Stein discrepancy with only univariate functions detects non-convergence. We prove under technical conditions that (1) kcc-sds detect convergence and non-convergence, and (2) kcc-sds are larger than ksds for the same choice of kernel.

Figure 1 compares ksds and kcc-sds with different kernels on each panel. The figure compares two Gaussian distributions, and where only one coordinate of is non-zero, . We then increase the dimension of the distribution and compare kcc-sd and ksd with both the imq and rbf kernels. We see that kcc-sds retain their ability to distinguish distributions in high dimensions. We show that kcc-sds can be used for variational inference and empirically compare distributions in cases where ksds provably fail.

Figure 1: kcc-sds are more sensitive to differences than ksds in high dimensions. The figure compares  and , where and the rest of the components are zero and uses 1000 samples to compute the kcc-sd and the ksd with the rbf and imq kernel . kcc-sds retain a better ability to tell and apart as the dimensions increase.

2 Stein Discrepancies

Stein’s method provides recipes for constructing expectation zero functions of distributions known up to normalization. For a distribution, , with a Lipschitz score function, we can create a Stein operator, , that acts on a test function ,

(1)

where is smooth and 1-Lipschitz function. This relation called Stein’s identity can be used to construct a discrepancy, where is known only up to a normalization constant (Gorham and Mackey, 2015). Let be the Stein set, consisting of smooth Lipschitz functions satisfying a Neumann-type boundary condition:

Stein discrepancies can be computationally burdensome as the supremum lacks a closed form.

Kernelized Stein Discrepancies.

To make the Stein discrepancy simpler to compute, Chwialkowski et al., 2016; Oates et al., 2017; Gorham and Mackey, 2017 combined the theory of reproducing kernels (Steinwart and Christmann, 2008) with the Stein discrepancy to introduce kernelized Stein discrepancies ksds. ksds restrict the Stein set to a reproducing kernel Hilbert space. Let be the kernel of a reproducing kernel Hilbert space (rkhs) . The rkhs consists of functions with finite norm. Functions in the rkhs satisfy the reproducing property: for all and for all . ksds are defined using a Stein set : the set of vector-valued functions such that for all , , and . ksds have a closed-form.

Proposition 1

(Gorham and Mackey, 2017) Suppose , then the for all define the function,

(2)

where, for instance if is the Langevin-Stein operator, then for any

Now, if , then kernelized Stein discrepancy has a closed form, where , and .

This theorem shows that when the Stein set is chosen via an rkhs, the Stein discrepancy can be computed in closed form. When the distribution lies in the class of distantly dissipative 111A distribution is distantly dissipative if , where Common examples include finite Gaussian mixtures with the same variance, and strongly log-concave distributions. distributions, , ksds provably detect convergence and non-convergence for , for kernels like the radial basis function or the inverse multi-quadratic (imq)(Gorham and Mackey, 2017). In , the ksd with thin tailed kernels like the rbf do not detect non-convergence. But the ksd with the imq kernel with does detect non-convergence. However all of these kernels shrink as the grows, which mean their associated ksd become less sensitive in higher dimensions (see Figure 1).

3 Complete Conditional Stein Discrepancy

Complete conditionals are univariate conditional distributions, , where . Complete conditional distributions are the basis for many inference procedures including the Gibbs sampler (Geman and Geman, 1984) and coordinate ascent variational inference (Ghahramani and Beal, 2001).

We construct cc-sds and their kernelized versions, kcc-sds, and show that kcc-sds satisfy the desiderata. For a broad family of kernels, we show that kcc-sds upper bound traditional ksds. We focus on the Langevin-Stein operator (Mira et al., 2013; Gorham and Mackey, 2015; Oates et al., 2017),

(3)

The analysis done here can be applied other operators based on the gradient of the log probability.

Definition.

Using complete conditionals, we define a new operator that can be used to compare distributions in arbitrary dimensions. For any with univariate component functions, , we can apply the complete conditional factorization,

where . Note that although the test functions are univariate, the Stein operator applied to each component function, , is a scalar valued function of multiple variables, .

This factorization yields the same operator as the Langevin-Stein operator in Equation 3. A two variable example for cc-sds is in Appendix A. The key difference is that the Stein set for cc-sd consists of univariate component functions. Formally, we define the function space . Then the complete conditional Stein discrepancy is defined as the Stein discrepancy restricted to the function set consisting of univariate component functions,

cc-sds do not require the complete conditionals for or . Like the original Stein discrepancy, the suprema in cc-sds can be hard to compute. Instead, we introduce their kernelized form, the kernelized complete conditional Stein discrepancy (kcc-sd).

4 Kernelized Complete Conditional Stein Discrepancy

Similar to the construction of ksds from the Stein discrepancy, we meld the theory of reproducing kernels with complete conditional Stein discrepancies to obtain kcc-sds. In this section we show that kcc-sds satisfy all three desiderata: a closed and tractable form, detection of convergence, and detection of non-convergence. We also show kcc-sds upper bound ksds, and the difference between the two increases as the dimension of the distribution increases.

kcc-sds admit a closed form.

Let be a reproducing kernel for the reproducing kernel Hilbert space , then the Stein set for kcc-sds consists of functions where each is a univariate function and with . Note that it is possible for the kernel to change with each dimension, but for simplicity we focus on a single kernel for all dimensions and drop the index on the kernel. We now show that kcc-sds can be computed in closed form.

Theorem 1

Let then for all define the complete conditional Stein kernel, as follows:

Now, if , then , where , and .

The proof is in Appendix D. Note that the closed form for kcc-sds is the same as ksds but the kernels are now univariate rather than multivariate.

kcc-sds detect convergence.

kcc-sds can be upper bounded with the Wasserstein distance (). This shows that if as , then kcc-sds go to zero, satisfying desideratum 2.

Proposition 2

Suppose and is Lipschitz with , if , then .

The proof follows from Gorham and Mackey, 2017 and is in Appendix E and this proposition applies to kernels like the rbf, imq and Matern kernels.

kcc-sd detects non-convergence.

In this section, we show that kcc-sds detect non-convergence by showing that when the kcc-sd converges to zero, the Fisher divergence converges to zero. The Fisher divergence measures the error between the score function of two distributions. It is defined as

The following lemma shows that if are Lipschitz and for all , then when the Fisher divergence between two distributions goes to zero, the two distributions are equal in distribution.

Lemma 1

Suppose and are Lipschitz with , and . If , then .

We use this lemma to show kcc-sd going to zero implies equality in distribution.

Theorem 2

Suppose is integrally strictly positive definite, and and are Lipschitz with , and and suppose for all . If , then .

The proof is in Appendix F. Unlike a full theory of weak convergence, the proof for Theorem 2 requires the score function of . We empirically show that kcc-sd detects non-convergence for distributions that do not have score functions even where the ksd fails to detect non-convergence.

For , Gorham and Mackey, 2017 show that ksds fail to detect non-convergence for commonly used kernels like the rbf. When kernels decay faster than the score function grows, ksds ignore the tails. This problem gets worse in higher dimensions for the rbf kernel and Matern kernel. For instance, if then , which causes the rbf kernel to decay rapidly in high dimensions, leading to a low discrepancy value even if the distributions are different.

In Figure 2 we compare a non-tight sequence to a Gaussian target from Gorham and Mackey, 2017. For each , let be the empirical distribution over points where and for all . For a kernel like the rbf, this will cause the kernel to decay as we increase the sample size, as . This sequence of does not have a score function. Unlike the ksd, Figure 2 shows that the kcc-sd with both the rbf and imq kernels is able to detect non-convergence.

Figure 2: kcc-sd can detect non-convergence for non-tight sequences. Here we compute the Stein discrepancies with a fixed number of samples, , and fixed dimension, . We then compute kcc-sd and ksd using the rbf and imq kernels with increasing number of samples which causes samples from to be more spread out.

Even when ksds detect convergence in high dimensions, they can be too small to be of practical use, thereby making them poor assessments of sample quality. Figure 1 depicts this problem for two Gaussian distributions, is the standard Gaussian and is a Gaussian with the mean of one dimension set to 5. The plots show how ksds decrease with increasing dimension. After dimension 10, the ksd has becomes very small for the rbf kernel, and even if we use imq kernel, which detects non-convergence, the ksd still becomes smaller.

kcc-sds upper bounds ksds.

In this section we show that kcc-sds are upper bounds on ksds. The difference between the discrepancies grows as the dimensionality increases.

Suppose that the ksd and the kcc-sd have the same type of kernel with the same kernel parameters. We show that the kcc-sd is a upper bound of the ksd, given that the kernel satisfies the following conditions:

  1. for all and for all .

  2. Define the univariate kernel difference as , where we fix . Then is an integrally strictly positive definite kernel.

In Appendix G, we show that both the rbf and the imq kernels satisfy these conditions. The proofs follow from Schoenberg connection between monotone and positive definite functions (Fasshauer, 2003).

Theorem 3

Suppose satisfies conditions C1 and C2 and are Lipschitz with and . Then the ksd and the kcc-sd satisfy the relation

(4)

The proof is in Appendix G. The diagram below shows the relations between the discrepancies. By construction the Stein set for cc-sd is a subset of the Stein set for the Stein discrepancy. This means the Stein discrepancy dominates the cc-sd. Kernelization shrinks Stein sets, so kernelized variants are dominated by their corresponding non-kernelized variants. Theorem 3 shows that kcc-sds dominate ksds.

5 Experiments

We developed the kernelized complete conditional Stein discrepancies. Here we empirically study their use for performing sample quality checks and variational inference. We detail the variational inference algorithms in Appendix B. We study two kernels: imq and rbf. For the imq kernel, , we take and and for the rbf kernel, we choose .

5.1 Distribution Tests

Figure 1 shows the effect of increasing dimension when comparing two Gaussian distributions which only differ in one coordinate of the mean. ksds decrease as the dimension increases, unlike kcc-sds. Figure 2 with the rbf kernel shows empirically that kcc-sd detects non-convergence where kcc does not. Despite showing that kcc-sds upper bound ksds, in Figure 3 we show that both kcc-sd and ksd converge to zero at a similar rate when for a mixture of Gaussians where each component has different non-diagonal covariance matrices.

In Appendix C we conduct more tests to study the rate of convergence to zero when both distributions are the same. We also compare two Gaussian distributions with increasing distance between their means, there we see that kcc-sd is more sensitive to changes than ksd in high dimensions.

Figure 3: kcc-sds converge to zero with i.i.d samples from . Here we compute both discrepancies with samples in , and . We compute the discrepancies using the rbf and imq kernels.

5.2 Sample Quality Checks

Here we show that kcc-sd can be used for sample quality checks.

Selecting Sampler Hyperparameters.

Stochastic gradient Langevin dynamics (sgld) is a biased mcmc sampler based on adding noise to the standard stochastic gradient optimization method (Welling and Teh, 2011). Since this method makes use of subsampling, it has allowed mcmc to scale to large datasets and large models. In this experiment we do posterior inference for a two-layer neural network, with a sigmoid activation function, for a regression task. We used the yacht hydrodynamics dataset (Gerritsma et al., 1981) from the UCI dataset repository.

Since biased methods trade sampling efficiency for asymptotic exactness, standard mcmc diagnostics are not applicable as they do not account for asymptotic bias. We use kcc-sd to assess sample quality from biased mcmc samplers. Selecting the stepsize is an important task to ensure the samples are approximately from the posterior (Welling and Teh, 2011). When is too small, then sgld is not exploring the space enough and there is high autocorrelation between the samples. However, when is too big, then sgld has higher bias and is unstable.

For we generate 5 independent chains with minibatch 32. Each chain consists of 10,000 samples with a burnin phase of 50,000 samples. We compare kcc-sd to inverse effective sample size. Effective sample size relies on asymptotic exactness of the samples, which is violated by stochastic gradient Langevin dynamics. Figure 4 compares these two metrics. While has the lowest kcc-sd value, the effective sample size measure is maximized by the value .

(a) kcc-sd
(b) Effective Sample Size
Figure 4: kcc-sd is a provable method to compute sample quality and does not assume asymptotic exactness of the samples, unlike standard methods like Effective Sample Size, here is the stepsize for sgld. We use the rbf kernel to compute the kcc-sd value.

5.3 Stein Variational Gradient Descent

We compare svgd to the complete conditional Stein variational gradient descent (cc-svgd) algorithm by training a Bayesian neural network and learning a multivariate Gaussian. We provide details for cc-svgd in Appendix B.

Learning Multivariate Gaussians.

We compare the performance of cc-svgd and svgd using the rbf kernel on learning a multivariate Gaussian target, . We train both methods to learn a Gaussian target with diagonal covariance and non-diagonal covariance. We use a particles initialized from and run both methods for iterations. We use gradient descent with a decreasing stepsize . Figure 5 displays the ksd between the target and the learnt distribution as the dimension increases. cc-svgd has a lower ksd value, it learns a better approximation.

(a) Correlated Multivariate Gaussian
(b) Independent Multivariate Gaussian
Figure 5: (a) Average ksd with the rbf kernel over 5 trials of cc-svgd and svgd approximations with a correlated multivariate Gaussian target. We use a 100 particles and run for 1000 iterations using gradient descent. (b) Average ksd with the rbf kernel over 5 trials of cc-svgd and svgd with a multivariate Gaussian target with a diagonal covariance. We use a 100 particles and run it for a 1000 iterations using gradient descent.

Bayesian Neural Network.

Dataset svgd cc-svgd map
Boston 2.565 2.404 2.567
Yacht 0.749 0.611 0.749
Mean Test rmse Protein 4.578 4.740 4.578
Concrete 5.297 5.609 5.297
Real Estate 6.608 6.566 7.208
Boston -2.391 -2.369 -2.392
Yacht -1.404 -1.167 -1.404
Mean Test Log-Likelihood Protein -2.944 -2.968 -2.944
Concrete -3.052 -3.148 -3.052
Real Estate -3.378 -3.378 -3.476
Table 1: Benchmarks on Bayesian neural network. We show that in 3 out of the 5 tasks, cc-svgd performs better than svgd and map with the same network and hyperparameters.

We compare cc-svgd and svgd on Bayesian neural networks. We use a similar setting as (Liu and Wang, 2016), a neural network with one hidden layer, with a rectifier activation and hidden units. We use of the dataset as a training set and use the rest as the test set. The results are averaged over 5 trials. The minibatch size is 100 and the number of particles is for both methods, we use the rbf kernel with .

Table 1 shows that cc-svgd performs better than svgd and map on 3 out of 5 experiments. svgd and map yield almost identical results. In these experiments, more accurate Bayesian inference seems to provide little advantage as MAP performs well.

6 Discussion

We developed kernelized complete conditional Stein discrepancies. We show that kcc-sds are an upper bound on ksds and can be tractably computed on samples given an unnormalized differentiable distribution. They lead to better sample quality measures and variational inference algorithms. The ksd with the rbf kernel is not able to detect non-convergence with non-tight sequences. However, we observe that the kcc-sd with rbf kernel not only detects non-convergence but also has a higher discrepancy than the ksd with imq. A proof that kcc-sd does or does not detect non-convergence for non-tight sequences with rbf kernels is a promising theoretical avenue of research. Empirically when distributions match, kcc-sd and ksd converge to zero at the same rate. While in Theorem 3, we show that the kcc-sd upper bounds the ksd, this means that the kcc-sd could provide a more powerful goodness-of-fit test (Liu et al., 2016). Like ksds, kcc-sds also suffer from a computational cost that grows quadratically with the number of samples. To address this, random feature Stein discrepancies (Huggins and Mackey, 2018) were developed. Applications of this method using kcc-sd are a promising avenue for research.

Acknowledgments

We would like to thank Jaan Altosaar, Mark Goldstein, Xintian Han, Aahlad Manas Puli, and Bharat Srikishan for their helpful feedback and comments.

References

  • Chwialkowski et al. (2016) Chwialkowski, K., Strathmann, H., and Gretton, A. (2016). A kernel test of goodness of fit.
  • Fasshauer (2003) Fasshauer, G. (2003). Positive definite and completely monotone functions. http://www.math.iit.edu/~fass/603_ch2.pdf.
  • Geman and Geman (1984) Geman, S. and Geman, D. (1984). Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721–741.
  • Gerritsma et al. (1981) Gerritsma, J., Onnink, R., and Versluis, A. (1981). Geometry, resistance and stability of the delft systematic yacht hull series. International shipbuilding progress, 28(328):276–297.
  • Ghahramani and Beal (2001) Ghahramani, Z. and Beal, M. J. (2001). Propagation algorithms for variational bayesian learning. In Advances in neural information processing systems, pages 507–513.
  • Gorham and Mackey (2015) Gorham, J. and Mackey, L. (2015). Measuring sample quality with stein’s method. In Advances in Neural Information Processing Systems, pages 226–234.
  • Gorham and Mackey (2017) Gorham, J. and Mackey, L. (2017). Measuring sample quality with kernels. arXiv preprint arXiv:1703.01717.
  • Huggins and Mackey (2018) Huggins, J. and Mackey, L. (2018). Random feature stein discrepancies. In Advances in Neural Information Processing Systems, pages 1903–1913.
  • Liu et al. (2016) Liu, Q., Lee, J., and Jordan, M. (2016). A kernelized stein discrepancy for goodness-of-fit tests. In International Conference on Machine Learning, pages 276–284.
  • Liu and Wang (2016) Liu, Q. and Wang, D. (2016). Stein variational gradient descent: A general purpose bayesian inference algorithm. In Advances In Neural Information Processing Systems, pages 2378–2386.
  • Mira et al. (2013) Mira, A., Solgi, R., and Imparato, D. (2013). Zero variance markov chain monte carlo for bayesian estimators. Statistics and Computing, 23(5):653–662.
  • Oates et al. (2017) Oates, C. J., Girolami, M., and Chopin, N. (2017). Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695–718.
  • Ranganath (2018) Ranganath, R. (2018). Black Box Variational Inference: Scalable, Generic Bayesian Computation and its Applications. PhD thesis, Princeton University.
  • Ranganath et al. (2014) Ranganath, R., Gerrish, S., and Blei, D. (2014). Black box variational inference. In Artificial Intelligence and Statistics, pages 814–822.
  • Ranganath et al. (2016) Ranganath, R., Tran, D., Altosaar, J., and Blei, D. (2016). Operator variational inference. In Advances in Neural Information Processing Systems, pages 496–504.
  • Schoenberg (1938) Schoenberg, I. J. (1938). Metric spaces and completely monotone functions. Annals of Mathematics, pages 811–841.
  • Steinwart and Christmann (2008) Steinwart, I. and Christmann, A. (2008). Support vector machines. Springer Science & Business Media.
  • Welling and Teh (2011) Welling, M. and Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), pages 681–688.

Appendix A Construction of cc-sds

Consider a distribution and suppose we want to compare their complete conditionals, the univariate distributions and . Using a similar analysis as Ranganath, 2018, we use the univariate Langevin-Stein operator to compare these complete conditionals. Let be a univariate function. The Langevin-Stein operator applied to yields

The equality uses the fact that the score function of the conditional distribution is the score function of the joint, . Note that although the test function is a univariate function, the operator applied to , , is a scalar-valued function of multiple variables, .

If for all inputs , Stein’s identity applies and

Now, two distributions match only if their complete conditionals match. This means we can combine the complete conditional Stein operators to compare multivariate distributions. For any , we can compare the distributions as follows:

where we use the fact that . Observe that the discrepancy can be computed without the use of complete conditionals of or . Hence, the complete conditional factorization suggests the use of test functions with univariate component functions.

Appendix B Variational Inference Using Stein Discrepancies

Variational inference casts Bayesian inference as an optimization problem. This is typically formulated as minimizing the kl divergence between the posterior and variational family, . Operator variational inference (opvi(Ranganath et al., 2016) uses Stein discrepancies as objectives for variational inference. Stein variational gradient descent (svgd) (Liu and Wang, 2016) uses Stein’s method to iteratively transform a set of particles to match the posterior. We describe how to use kcc-sds in opvi and svgd yielding black box variational inference algorithms (Ranganath et al., 2014).

b.1 Operator Variational Inference

opvi suggests the use of a neural network to learn the optimal test function, . This increase the difficulty of optimization. We introduce the use of kcc-sds and ksds as objective functions in operator variational inference. This removes the need to estimate an optimal test function.

Given a parametric model family, and data model , kernelized opvi solves the following optimization problem:

(5)

Unbiased estimation of only requires the evaluation of the model score function, and samples from . As the only requirement for the variational approximation is sampling (and differentiability for gradients), this allows flexibility to choose variational families where a tractable density is not available. Such distributions are called variational programs and were studied in (Ranganath et al., 2016).

b.2 Stein Variational Gradient Descent

svgd, a particle based variational inference algorithm, is based on creating a set of distributions, which consists of distributions which are obtained by taking smooth and invertible transformations , of a reference distribution . The resulting is defined as

where and denote the inverse and the Jacobian matrix of the inverse.

Now, suppose we choose the family of transformations, , to be small perturbations of the identity map of the form , where is a smooth function belonging to a suitable function family. The Stein operator is equivalent to the derivative of the kl divergence (Liu and Wang, 2016). We note that the Stein operator in the kl divergence derivative are built from the matrix Stein operator. However, the derivative uses the trace of the matrix Stein operator which is equal to the Stein operator we study.

Theorem 4

(Liu and Wang, 2016) Let and be the density of where . Then

(6)

where is the matrix Stein operator.

The following lemma identifies the maximal perturbation direction that gives the steepest decrease in the kl divergence:

Lemma 2

(Liu and Wang, 2016) Assume the conditions in Theorem 4, consider all the perturbation directions in the ball of function space , the direction of steepest descent that maximizes the negative gradient in Equation 6 is given by

(7)

which implies that .

Here, we propose the use of the kcc-sd Stein set, , as the perturbation family for svgd rather than using the ksd Stein set for the perturbation family. The optimal function for svgd using the kcc-sd Stein set, , is defined as

where , with denoting the -th dimension and .

Here, we state a similar lemma to Lemma 2 to show that if the perturbation functions belong to the kcc-sd Stein set, then there is a closed form optimal perturbation function.

Lemma 3

(cc-svgd) Assume the conditions in Theorem 4, consider all the perturbation directions in the ball of function space . The direction of steepest descent that maximizes the negative gradient in Equation 6 is given by

(8)

which implies that .

We refer to svgd using kcc-sd updates as complete conditional Stein variational gradient descent (cc-svgd).

Theorem 3 shows that . This implies that the perturbation function provided by kcc-sd, decreases the kl-divergence more than the perturbation function provided by ksd. If , then

which shows that the kcc-sd perturbation function points in a steeper direction of descent than the ksd perturbation function. Note, that this is only a locally optimal step. Given particles , we calculate the svgd update for a particle as

We can see that if the particles are far apart then gets small. This means that svgd reduces to performing map updates for these particles, . We demonstrate this phenomenon with the two layer Bayesian neural network, by calculating the Frobenius norm between the matrix of map updates and the svgd updates.

Bayesian Neural Network.

Consider a Bayesian neural network with one hidden layer with 50 hidden units with a ReLU activation, and we use the Boston housing dataset for our experiments.

Here we compare svgd and cc-svgd to map. As shown in Figure 6, svgd reduces to map in higher dimensions, unlike cc-svgd which increases as the dimension grows. We use a standard Gaussian to initialize the weight matrices, this causes the particles to be far apart and thus reduces the svgd update to map updates, with no interaction between particles without tricks like the median heuristic.

Figure 6: svgd updates reduce to map updates in high dimensions. We compare the svgd update and the kcc-sd update to the map update on the Boston housing dataset. We increase the dimension of the hidden layer and show that svgd decreases and cc-svgd increases as the dimension increases. We use twenty particles and plot the difference between the updates after 10  iterations.
  Input: Model , initialize particles from
  while not converged do
      where
  end while
Algorithm 1 Complete Conditional Stein Variational Gradient Descent

b.3 Operator Variational Inference

In this section, we use kcc-sd as an objective for variational inference.

Bayesian Linear Regression.

Consider the Bayesian regression problem, , where . We model the data as with a normal prior on . We perform posterior approximation using the variational family, , where the variational parameters are and .

We run opvi with kcc-sd and ksd using the imq kernel for different dimensions, with a fixed number of data points, and use latent samples, used to calculate the Stein discrepancies with the AdaGrad optimizer. The dataset was generated by randomly picking . We observed that opvi with kernelized discrepancies requires manual tuning of the optimizer. In Table 2 we list the norm between the learned posterior mean and the true mean after 20 iterations.

Dimension ksd mse kcc-sd mse
0.34 0.05
0.38 0.19
4.39 1.26
6.31 3.25
Table 2: Operator Variational Inference

Appendix C Distribution Tests

Effect of Diverging Means.

In this experiment, we compare and . We increase the mean of in one coordinate, and see the effect on both discrepancies. Figure 7 shows that kcc-sds with the rbf and imq have higher discrepancies than ksds. And even though ksds with the rbf kernel does detect non-convergence for non-tight sequences, we observe that as increases, the ksd with the rbf increases at a much slower rate than kcc-sd.

On Target Sequence.

In Theorem 3 we prove that for the kernels like imq and rbf, the kcc-sd is an upper bound on ksd. We verified this in the previous experiments. However, in this experiment we compare the discrepancy values when both distributions are equal, .

Here, we compare the rate at which both discrepancies converge to zero when () is a a multivariate Gaussian with diagonal covariance (Figure 9) and () is a multivariate Gaussian with a random non-diagonal covariance matrix (Figure 9). The kcc-sd converges to zero at a similar rate as the ksd as the number of samples is increasing.

Figure 7: kcc-sds detect non-convergence for diffusive sequences. Here we compute kcc-sd and ksd where and . Both discrepancies are computed with a fixed number of samples, , and for a fixed dimension, . We increase the coordinate mean in the first dimension, . The kcc-sd with both kernels increases at a faster rate than the ksd with both kernels as the difference in mean increases.
(a) rbf
(b) imq
(a) rbf
(b) imq
Figure 8: kcc-sds converge to zero with i.i.d samples from . Here we compute kcc-sd and ksd for , with samples in . We can observe that both the ksd and the kcc-sd converge to zero at the same rate.
Figure 9: kcc-sds converge to zero with i.i.d samples from . Here we compute kcc-sd and ksd, with where was generated randomly to have non-zero diagonal entries, both discrepancies were computed with samples and .
Figure 8: kcc-sds converge to zero with i.i.d samples from . Here we compute kcc-sd and ksd for , with samples in . We can observe that both the ksd and the kcc-sd converge to zero at the same rate.

Appendix D Closed form

In this section we prove Theorem 1. Let , for . Let be the 1-dimensional canonical feature map, defined by . Now, since , following (Gorham and Mackey, 2017) and using Corollary 4.36 from (Steinwart and Christmann, 2008), we show that for all

(9)

where .

Then using Lemma 4.34 of Steinwart and Christmann (2008), gives for all .

(10)

Now, assuming that Equation 10 is integrable with respect to , we have that is Bochner -integrable for each . In other words, we can interchange expectation and the inner product. This implies that

(11)

where .

Now using, Equation 9 and Equation 11 and using Fenchel-Young inequality for the dual norm twice, we show that

(12)

Recall that the Stein set is defined as a product function space of univariate rkhs, limited to the unit ball . We focus on norm throughout, however it can be generalized to any norm. In other words, .

Appendix E Detecting Convergence

We introduce some notation we will need for the proof, here , , and , where is defined as the operator norm for the matrix .

We prove Proposition 2 by using an important lemma from (Gorham and Mackey, 2017), which shows that:

Lemma 4

(Gorham and Mackey, 2017) Let and , and if is Lipschitz and , then for any with , we have the following upper bound

(13)

where and are constants which depend on and .

(Gorham and Mackey, 2017) also show that this lemma applies to kcc-sd since we have the same bound as the univariate ksd. We repeat their argument below.

First, we define the multi-index differential operator, , where . For a differentiable function, , we have

Now, let , then for any multi-index , with . We can bound the derivative of each component function by Cauchy-Schwartz and (Steinwart and Christmann, 2008) as

and since