Pathwise Derivatives Beyond the Reparameterization Trick

Pathwise Derivatives Beyond the Reparameterization Trick

Martin Jankowiak    Fritz Obermeyer
Abstract

We observe that gradients computed via the reparameterization trick are in direct correspondence with solutions of the transport equation in the formalism of optimal transport. We use this perspective to compute (approximate) pathwise gradients for probability distributions not directly amenable to the reparameterization trick: Gamma, Beta, and Dirichlet. We further observe that when the reparameterization trick is applied to the Cholesky-factorized multivariate Normal distribution, the resulting gradients are suboptimal in the sense of optimal transport. We derive the optimal gradients and show that they have reduced variance in a Gaussian Process regression task. We demonstrate with a variety of synthetic experiments and stochastic variational inference tasks that our pathwise gradients are competitive with other methods.

Machine Learning, ICML

1 Introduction

Maximizing objective functions via gradient methods is ubiquitous in machine learning. When the objective function is defined as an expectation of a (differentiable) test function w.r.t. a probability distribution ,

(1)

computing exact gradients w.r.t. the parameters is often unfeasible so that optimization methods must instead make due with stochastic gradient estimates. If the gradient estimator is unbiased, then stochastic gradient descent with an appropriately chosen sequence of step sizes can be shown to have nice convergence properties (Robbins & Monro, 1951). If, however, the gradient estimator exhibits large variance, stochastic optimization algorithms may be impractically slow. Thus it is of general interest to develop gradient estimators with reduced variance.

We revisit the class of gradient estimators popularized in (Kingma & Welling, 2013; Rezende et al., 2014; Titsias & Lázaro-Gredilla, 2014), which go under the name of the pathwise derivative or the reparameterization trick. While this class of gradient estimators is not applicable to all choices of probability distribution , empirically it has been shown to yield suitably low variance in many cases of practical interest and thus has seen wide use. We show that the pathwise derivative in the literature is in fact a particular instance of a continuous family of gradient estimators. Drawing a connection to tangent fields in the field of optimal transport,111See (Villani, 2003; Ambrosio et al., 2008) for a review. we show that one can define a unique pathwise gradient that is optimal in the sense of optimal transport. For the purposes of this paper, we will refer to these optimal gradients as OMT (optimal mass transport) gradients.

The resulting geometric picture is particularly intriguing in the case of multivariate distributions, where each choice of gradient estimator specifies a velocity field on the sample space. To make this picture more concrete, in Figure 1 we show the velocity fields that correspond to two different gradient estimators for the off-diagonal element of the Cholesky factor parameterizing a bivariate Normal distribution. We note that the velocity field that corresponds to the reparameterization trick has a large rotational component that makes it suboptimal in the sense of optimal transport. In Sec. 7 we show that this suboptimality can result in reduced performance when fitting a Gaussian Process to data.

The rest of this paper is organized as follows. In Sec. 2 we provide a brief overview of stochastic gradient variational inference. In Sec. 3 we show how to compute pathwise gradients for univariate distributions. In Sec. 4 we expand our discussion of pathwise gradients to the case of multivariate distributions, introduce the connection to the transport equation, and provide an analytic formula for the OMT gradient in the case of the multivariate Normal. In Sec. 5 we discuss how we can compute high precision approximate pathwise gradients for the Gamma, Beta, and Dirichlet distributions. In Sec. 6 we place our work in the context of related research. In Sec. 7 we demonstrate the performance of our gradient estimators with a variety of synthetic experiments and experiments on real world datasets. Finally, in Sec. 8 we conclude with a discussion of directions for future work.

Figure 1: Velocity fields for a bivariate Normal distribution parameterized by a Cholesky factor . The gradient is w.r.t. the off-diagonal element . On the left we depict the velocity field corresponding to the reparameterization trick and on the right we depict the velocity field that is optimal in the sense of optimal transport. The solid black circle denotes the 1- covariance ellipse, with the gray ellipses denoting displaced covariance ellipses that result from small increases in . Note that the ellipses evolve the same way under both velocity fields, but individual particles flow differently to effect the same global displacement of mass.

2 Stochastic Gradient Variational Inference

One area where stochastic gradient estimators play a particularly central role is stochastic variational inference (Hoffman et al., 2013). This is especially the case for black-box methods (Wingate & Weber, 2013; Ranganath et al., 2014), where conjugacy and other simplifying structural assumptions are unavailable, with the consequence that Monte Carlo estimators become necessary. For concreteness, we will refer to this class of methods as Stochastic Gradient Variational Inference (SGVI). In this section we give a brief overview of this line of research, as it serves as the motivating use case for our work. Furthermore, in Sec. 7 SGVI will serve as the main testbed for our proposed methods.

Let define a joint probability distribution over observed data and latent random variables . One of the main tasks in Bayesian inference is to compute the posterior distribution . For many models of interest, this is an intractably hard problem and so approximate methods become necessary. Variational inference recasts Bayesian inference as an optimization problem. Specifically we define a variational family of distributions parameterized by and seek to find a value of that minimizes the KL divergence between and the (unknown) posterior . This is equivalent to maximizing the ELBO (Jordan et al., 1999), defined as

(2)

For general choices of and , this expectation—much less its gradients—cannot be computed analytically. In these circumstances a natural approach is to build a Monte Carlo estimator of the ELBO and its gradient w.r.t. . The properties of the chosen gradient estimator—especially its bias and variance—play a critical rule in determining the viability of the resulting stochastic optimization. Next, we review two commonly used gradient estimators; we leave a brief discussion of more elaborate variants to Sec. 6.

2.1 Score Function Estimator

The score function estimator, also referred to as the log-derivative trick or reinforce (Glynn, 1990; Williams, 1992), provides a simple and broadly applicable recipe for estimating ELBO gradients (Paisley et al., 2012). The score function estimator expresses the gradient as an expectation with respect to , with the simplest variant given by

(3)

where . Monte Carlo estimates of Eqn. 3 can be formed by drawing samples from and computing the term in the square brackets. Although the score function estimator is very general (e.g. it applies to discrete random variables) it typically suffers from high variance, although this can be mitigated with the use of variance reduction techniques such as Rao-Blackwellization (Casella & Robert, 1996) and control variates (Ross, 2006).

2.2 Pathwise Gradient Estimator

The pathwise gradient estimator, a.k.a. the reparameterization trick (RT), is not as broadly applicable as the score function estimator, but it generally exhibits lower variance (Price, 1958; Salimans et al., 2013; Kingma & Welling, 2013; Glasserman, 2013; Rezende et al., 2014; Titsias & Lázaro-Gredilla, 2014). It is applicable to continuous random variables whose probability density can be reparameterized such that we can rewrite expectations

(4)

where is a fixed distribution with no dependence on and is a differentiable -dependent transformation. Since the expectation w.r.t.  has no dependence, gradients w.r.t.  can be computed by pushing through the expectation. This reparameterization can be done for a number of distributions, including for example the Normal distribution. Unfortunately the reparameterization trick is non-trivial to apply to a number of commonly used distributions, e.g. the Gamma and Beta distributions, since the required shape transformations inevitably involve special functions.

3 Univariate Pathwise Gradients

Consider an objective function given as the expectation of a test function with respect to a distribution , where is a continuous one-dimensional random variable:

(5)

Here and are parameterized by , and we would like to compute (stochastic) gradients of w.r.t. , where is a scalar component of :

(6)

Crucially we would like to avoid the log-derivative trick, which yields a gradient estimator that tends to have high variance. Doing so will be easy if we can rewrite the expectation in terms of a fixed distribution that does not depend on . A natural choice is to use the standard uniform distribution ,

(7)

where the transformation is the inverse CDF of . As desired, all dependence on is now inside the expectation. Unfortunately, for many continuous univariate distributions of interest (e.g. the Gamma and Beta distributions) the transformation (as well as its derivative w.r.t. ) does not admit a simple analytic expression.

Fortunately, by making use of implicit differentiation we can compute the gradient in Eqn. 6 without explicitly introducing . To complete the derivation define by

(8)

and differentiate both sides of Eqn. 8 w.r.t.  and make use of the fact that does not depend on to obtain

(9)

This then yields our master formula for the univariate case

(10)

where the corresponding gradient estimator is given by

(11)

While this derivation is elementary, it helps to clarify things: the key ingredient needed to compute pathwise gradients in Eqn. 6 is the ability to compute (or approximate) the derivative of the CDF, i.e. . In the supplementary materials we verify that Eqn. 11 results in correct gradients.

It is worth emphasizing how this approach differs from a closely related alternative. Suppose we construct a (differentiable) approximation of the inverse CDF, . For example, we might train a neural network . We can then push samples through and obtain approximate samples from as well as approximate derivatives via the chain rule; in this case, there will be a mismatch between the probability assigned to samples and the actual distribution over . By contrast, if we use the construction of Eqn. 10, our samples will still be exact222Or rather their exactness will be determined by the quality of our sampler for , which is fully decoupled from how we compute derivatives . and the fidelity of our approximation of (the derivatives of) will only affect the accuracy of our approximation for .

4 Multivariate Pathwise Gradients

In the previous section we focused on continuous univariate distributions. Pathwise gradients can also be constructed for continuous multivariate distributions, although the analysis is in general expected to be much more complicated than in the univariate case—directly analogous to the difference between ordinary and partial differential equations. Before constructing estimators for particular distributions, we introduce the connection to the transport equation.

4.1 The Transport Equation

Consider a multivariate distribution in dimensions and consider differentiating with respect to the parameter .333Here without loss of generality we assume that has no dependence on , since computing presents no difficulty; the difficulty stems from the dependence on in . As we vary we move along a curve in the space of distributions over the sample space. Alternatively, we can think of each distribution as a cloud of particles; as we vary from to each particle undergoes an infinitesimal displacement . Any set of displacements that ensures that the displaced particles are distributed according to the displaced distribution is allowed. This intuitive picture can be formalized with the transport a.k.a. continuity equation:444We refer the reader to (Villani, 2003) and (Ambrosio et al., 2008) for details.

(12)

Here the velocity field is a vector field defined on the sample space that displaces samples (i.e. particles) as we vary infinitesimally. Note that there is a velocity field for each component of . This equation is readily interpreted in the language of fluid dynamics. In order for the the total probability to be conserved, the term —which is the rate of change of the number of particles in the infinitesimal volume element at —has to be counterbalanced by the in/out-flow of particles—as given by the divergence term.

4.2 Gradient Estimator

Given a solution to Eqn. 28, we can form the gradient estimator

(13)

which generalizes Eqn. 11 to the multivariate case. That this is an unbiased gradient estimator follows directly from the divergence theorem (see the supplementary materials).

4.3 Tangent Fields

In general Eqn. 28 admits an infinite dimensional space of solutions. In the context of our derivation of Eqn. 10, we might loosely say that different solutions of Eqn. 28 correspond to different ways of specifying quantiles of . To determine a unique555We refer the reader to Ch. 8 of (Ambrosio et al., 2008) for details. solution—the tangent field from the theory of optimal transport—we require that

(14)

In this case it can be shown that minimizes the total kinetic energy, which is given by666Note that the univariate solution, Eqn. 10, is automatically the OMT solution.

(15)

4.4 Gradient variance

The term that appears in Eqn. 15 might lead one to hope that provides gradients that minimize gradient variance. Unfortunately, the situation is more complicated. Denoting the (mean) gradient by the total gradient variance is given by

(16)

Since is the same for all unbiased gradient estimators, the gradient estimator that minimizes the total variance is the one that minimizes the first term in Eqn. 16. For test functions that approximately satisfy over the bulk of the support of , the first term in Eqn. 16 term is approximately proportional to the kinetic energy. In this case the OMT gradient estimator will be (nearly) optimal. Note that the kinetic energy weighs contributions from different components of equally, whereas scales different components of with . Thus we can think of the OMT gradient estimator as a good choice for generic choices of that are relatively flat and isotropic (or, alternatively, for choices of where we have little a priori knowledge about the detailed structure of ). So for any particular choice of a generic there will be some gradient estimator that has lower variance than the OMT gradient estimator. Still, for many choices of we expect the OMT gradient estimator to have lower variance than the RT gradient estimator, since the latter has no particular optimality guarantees (at least not in any coordinate system that we expect to be well adapted to ).

4.5 The Multivariate Normal

In the case of a (zero mean) multivariate Normal distribution parameterized by a Cholesky factor via , where is white noise, the reparameterization trick yields the following velocity field for :777Note that the reparameterization trick already yields the OMT gradient for the location parameter .

(17)

Note that Eqn. 46 is just a particular instance of the solution to the transport equation that is implicitly provided by the reparameterization trick, namely

(18)

In the supplementary materials we verify that Eqn. 46 satisfies the transport equation Eqn. 28. However, it is evidently not optimal in the sense of optimal transport, since is not symmetric in and . In fact the tangent field takes the form

(19)

where is a symmetric matrix whose precise form we give in the supplementary materials. We note that computing gradients with Eqn. 19 is , since it involves a singular value decomposition of the covariance matrix. In Sec. 7 we show that the resulting gradient estimator can lead to reduced variance.

5 Numerical Recipes

In this section we show how Eqn. 10 can be used to obtain pathwise gradients in practice. In many cases of interest we will need to derive approximations to that balance the need for high accuracy (thus yielding gradient estimates with negligible bias) with the need for computational efficiency. In particular we will derive accurate approximations to Eqn. 10 for the Gamma, Beta, and Dirichlet distributions. These approximations will involve three basic components:

  1. Elementary Taylor expansions

  2. The Lugannani-Rice saddlepoint expansion (Lugannani & Rice, 1980; Butler, 2007)

  3. Rational polynomial approximations in regions of that are analytically intractable

5.1 Gamma

The CDF of the Gamma distribution involves the (lower) incomplete gamma function : . Unfortunately does not admit simple analytic expressions for derivatives w.r.t. its first argument, and so we must resort to numerical approximations. Since it is sufficient to consider for the standard Gamma distribution with .

5.1.1

To give a flavor for the kinds of approximations we use, consider how we can approximate in the limit . We simply do a Taylor series in powers of :

In practice we use 6 terms in this expansion, which is accurate for . Details for the remaining approximations can be found in the supplementary materials.

5.2 Beta

The CDF of the Beta distribution, , is the (regularized) incomplete beta function; just like in the case of the Gamma distribution, its derivatives do not admit simple analytic expressions. We describe the numerical approximations we used in the supplementary materials.

5.3 Dirichlet

Let be Dirichlet distributed with components. Noting that the are constrained to lie within the unit -simplex, we proceed by representing in terms of mutually independent Beta variates (Wilks, 1962):

Without loss of generality, we will compute for . Crucially, the only dependence on in Eqn. 5.3 is through . We find:

(20)

Note that Eqn. 20 implies that , as it must because of the simplex constraint. Since we have already developed an approximation for , Eqn. 20 provides a complete recipe for pathwise Dirichlet gradients. Note that although we have used a stick-breaking construction to derive Eqn. 20, this in no way dictates the sampling scheme we use when generating . In the supplementary materials we verify that Eqn. 20 satisfies the transport equation.

5.4 Implementation

It is worth emphasizing that pathwise gradient estimators of the form in Eqn. 29 have the advantage of being ‘plug-and-play.’ We simply plug an approximate or exact velocity field into our favorite automatic differentiation engine888Our approximations for pathwise gradients for the Gamma, Beta, and Dirichlet distributions are available in the 0.4 release of PyTorch (Paszke et al., 2017). so that samples and are differentiable w.r.t. . There is no need to construct a surrogate objective function to form the gradient estimator.

6 Related Work

A number of lines of research bears upon our work. There is a large body of work on constructing gradient estimators with reduced variance, much of which can be understood in terms of control variates (Ross, 2006): for example, (Mnih & Gregor, 2014) construct neural baselines for score-function gradients; (Schulman et al., 2015) discuss gradient estimators for stochastic computation graphs and their Rao-Blackwellization; and (Tucker et al., 2017; Grathwohl et al., 2017) construct adaptive control variates for discrete random variables. Another example of this line of work is reference (Miller et al., 2017), where the authors construct control variates that are applicable when is a diagonal Normal distribution. While our OMT gradient for the multivariate Normal distribution, Eqn. 19, can also be understood in the language of control variates,999See Sec. 8 and the supplementary materials for a brief discussion. (Miller et al., 2017) relies on Taylor expansions of the test function .101010In addition, note that in their approach variance reduction for gradients w.r.t. the scale parameter necessitates a multi-sample estimator (at least for high-dimensional models where computing the diagonal of the Hessian is prohibitively expensive).

In (Graves, 2016), the author derives formula Eqn. 10 and uses it to construct gradient estimators for mixture distributions. Unfortunately, the resulting gradient estimator is expensive, relying on a recursive computation that scales with the dimension of the sample space.

Another line of work constructs partially reparameterized gradient estimators for cases where the reparameterization trick is difficult to apply. The generalized reparameterization gradient (G-Rep)(Ruiz et al., 2016) uses standardization via sufficient statistics to obtain a transformation that minimizes the dependence of on . This results in a partially reparameterized gradient estimator that also includes a score function-like term.111111That is a term in the gradient estimator that is proportional to the test function . In rsvi(Naesseth et al., 2017) the authors consider gradient estimators in the case that can be sampled from efficiently via rejection sampling. This results in a gradient estimator with the same generic structure as G-Rep, although in the case of rsvithe score function-like term can often be dropped in practice at the cost of small bias (with the benefit of reduced variance). Besides the fact that this gradient estimator is not fully pathwise, one key difference with our approach is that for many distributions of interest (e.g. the Beta and Dirichlet distributions), rejection sampling introduces auxiliary random variables, which results in additional stochasticity and thus higher variance (cf. Figure 2). In contrast our pathwise gradients for the Beta and Dirichlet distributions are deterministic for a given and . Finally, (Knowles, 2015) uses (somewhat imprecise) approximations to the inverse CDF to derive gradient estimators for Gamma random variables.

As the final version of this manuscript was being prepared, we became aware of (Figurnov et al., 2018), which has some overlap with this work. In particular, (Figurnov et al., 2018) derives Eqn. 10 and an interesting generalization to the multivariate case. This allows the authors to construct pathwise derivatives for the Gamma, Beta, and Dirichlet distributions. For the latter two distributions, however, the derivatives include additional stochasticity that our pathwise derivatives avoid. Also, the authors do not draw the connection to the transport equation and optimal transport or consider the multivariate Normal distribution in any detail.

Figure 2: Derivatives for samples . We compare the OMT gradient to the gradient that is obtained when samples Beta() are represented as the ratio of two Gamma variates (each with its own pathwise derivative). The OMT derivative has a deterministic value for each sample , whereas the Gamma representation induces a higher variance stochastic derivative due to the presence of an auxiliary random variable.

7 Experiments

All experiments in this section use single-sample gradient estimators.

7.1 Synthetic Experiments

In this section we validate our pathwise gradients for the Beta, Dirichlet, and multivariate Normal distributions. Where appropriate we compare to the RT gradient, the score function gradient, or rsvi.

7.1.1 Beta Distribution

In Fig. 3 we compare the performance of our OMT gradient for Beta random variables to the rsvigradient estimator. We use a test function for which we can compute the gradient exactly. We see that the OMT gradient performs favorably over the entire range of parameter that defines the distribution used to compute . For smaller , where exhibits larger curvature, the variance of the estimator is noticeably reduced. Notice that one reason for the reduced variance of the OMT estimator as compared to the rsviestimator is the presence of an auxiliary random variable in the latter case (cf. Figure 2).

Figure 3: We compare the OMT gradient to the rsvigradient with for the test function and . In the bottom panel we depict finite-sample bias for 25 million samples (this also includes effects from finite numerical precision).

7.1.2 Dirichlet Distribution

In Fig. 4 we compare the variance of our pathwise gradient for the Dirichlet distribution to the rsvigradient estimator. We compute stochastic gradients of the ELBO for a Multinomial-Dirichlet model initialized at the exact posterior (where the exact gradient is zero). The Dirichlet distribution has 1995 components, and the single data point is a bag of words from a natural language document. We see that the pathwise gradient performs favorably over the entire range of the model hyperparameter considered. Note that as we crank up the shape augmentation setting , the rsvivariance approaches that of the pathwise gradient.121212As discussed in Sec. 6, the variance of the rsvigradient estimator can also be reduced by dropping the score function-like term (at the cost of some bias).

Figure 4: Gradient variance for the ELBO of a conjugate Multinomial-Dirichlet model. We compare the pathwise gradient to rsvifor different boosts . See Sec. 7.1.2 for details.

7.1.3 Multivariate Normal

In Fig. 5 we use synthetic test functions to illustrate the amount of variance reduction that can be achieved with the OMT gradient estimator for the multivariate Normal distribution. The dimension is ; the results are qualitatively similar for different dimensions.

Figure 5: We compare the OMT gradient estimator for the multivariate Normal distribution to the RT estimator for three test functions. The horizontal axis controls the magnitude of the off-diagonal elements of the Cholesky factor . The vertical axis depicts the ratio of the mean variance of the OMT estimator to that of the RT estimator for the off-diagonal elements of .

7.2 Real World Datasets

In this section we investigate the performance of our gradient estimators for the Gamma, Beta, and multivariate Normal distributions in two variational inference tasks on real world datasets. Note that we include an additional experiment for the multivariate Normal distribution in the supplementary materials, see Sec. 9.11. All the experiments in this section were implemented in the Pyro131313http://pyro.ai probabilistic programming language.

7.2.1 Sparse Gamma def

The Sparse Gamma def(Ranganath et al., 2015) is a probabilistic model with multiple layers of local latent random variables and global random weights that mimics the architecture of a deep neural network. Here each corresponds to an observed data point , indexes the layer, and and run over the latent components. We consider Poisson-distributed observations for each dimension . Concretely, the model is specified as141414 Note that this experiment closely follows the setup in (Ruiz et al., 2016) and (Naesseth et al., 2017).

We set and use layers with , , and latent factors per data point (for , respectively). We consider two model variants that differ in the prior placed on the weights. In the first variant we place Gamma priors over the weights with and . In the second variant we place priors over the weights with the same means and variances as in the first variant.151515If then . Thus like the Gamma distribution the Beta prime distribution has support on the positive real line. The dataset we consider is the Olivetti faces dataset,161616http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html which consists of grayscale images of human faces. In Fig. 6 we depict how the training set ELBO increases during the course of optimization. We find that on this task the performance of the OMT gradient estimator is nearly identical to rsvi.171717Note that we do not compare to any alternative estimators such as G-Rep, since (Naesseth et al., 2017) shows that rsvihas superior performance on this task. Figure 6 suggests that gradient variance is not the limiting factor for this particular task and dataset.

Figure 6: ELBO during training for two variants of the Sparse Gamma def, one with and one without Beta random variables. We compare the OMT gradient to rsvi. At each iteration we depict a multi-sample estimate of the ELBO with samples.

7.2.2 Gaussian Process Regression

In this section we investigate the performance of our OMT gradient for the multivariate Normal distribution, Eqn. 19, in the context of a Gaussian Process regression task. We model the Mauna Loa data from (Keeling & Whorf, 2004) considered in (Rasmussen, 2004). We use a structured kernel that accommodates a long term linear trend as well as a periodic component. We fit the GP using a single-sample Monte Carlo ELBO gradient estimator and all data points. The variational family is a multivariate Normal distribution with a Cholesky parameterization for the covariance matrix. Progress on the ELBO during the course of training is depicted in Fig. 7. We can see that the OMT gradient estimator has superior sample efficiency due to its lower variance. By iteration 270 the OMT gradient estimator has attained the same ELBO that the RT estimator attains at iteration 500. Since each iteration of the OMT estimator is x slower than the corresponding RT iteration, the superior sample efficiency of the OMT estimator is largely canceled when judged by wall clock time. Nevertheless, the lower variance of the OMT estimator results in a higher ELBO than that obtained by the RT estimator.

Figure 7: ELBO during training for the Gaussian Process regression task in Sec. 7.2.2. At each iteration we depict a multi-sample estimate of the ELBO with samples. We compare the OMT gradient estimator to the RT estimator.

8 Discussion and Future Work

We have seen that optimal transport offers a fruitful perspective on pathwise gradients. On the one hand it has helped us formulate pathwise gradients in situations where this was assumed to be impractical. On the other hand it has focused our attention on a particular notion of optimality, which led us to develop a new gradient estimator for the multivariate Normal distribution. A better understanding of this notion of optimality and, more broadly, a better understanding of when pathwise gradients are preferable over score function gradients (or vice versa) would be useful in guiding the practical application of these methods.

Since each solution of the transport equation Eqn. 28 yields an unbiased gradient estimator, the difference between any two such estimators can be thought of as a control variate. In the case of the multivariate Normal distribution, where computing the OMT gradient has a cost , an attractive alternative to using is to adaptively choose during the course of optimization in direct analogy to adaptive control variate techniques. In future work we will explore this approach in detail, which promises lower variance than the OMT estimator at reduced computational cost.

The geometric picture from optimal transport—and thus the potential for non-trivial derivative applications—is especially rich for multivariate distributions. Here we have explored the multivariate Normal and Dirichlet distributions in some detail, but this just scratches the surface of multivariate distributions. It would be of general interest to develop pathwise gradients for a broader class of multivariate distributions, including for example mixture distributions. Rich distributions with low variance gradient estimators are of special interest in the context of SGVI, where the need to approximate complex posteriors demands rich families of distributions that lend themselves to stochastic optimization. In future work we intend to explore this connection further.

Acknowledgements

We thank Peter Dayan and Zoubin Ghahramani for feedback on a draft manuscript and other colleagues at Uber AI Labs—especially Noah Goodman and Theofanis Karaletsos— for stimulating conversations during the course of this work. We also thank Christian Naesseth for clarifying details of the experimental setup for the deep exponential family experiment in (Naesseth et al., 2017).

References

  • Ambrosio et al. (2008) Ambrosio, Luigi, Gigli, Nicola, and Savaré, Giuseppe. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • Butler (2007) Butler, Ronald W. Saddlepoint approximations with applications, volume 22. Cambridge University Press, 2007.
  • Casella & Robert (1996) Casella, George and Robert, Christian P. Rao-blackwellisation of sampling schemes. Biometrika, 83(1):81–94, 1996.
  • Duchi et al. (2011) Duchi, John, Hazan, Elad, and Singer, Yoram. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
  • Efron & Morris (1975) Efron, Bradley and Morris, Carl. Data analysis using stein’s estimator and its generalizations. Journal of the American Statistical Association, 70(350):311–319, 1975.
  • Figurnov et al. (2018) Figurnov, Michael, Mohamed, Shakir, and Mnih, Andriy. Implicit reparameterization gradients. arXiv preprint arXiv:1805.08498, 2018.
  • Glasserman (2013) Glasserman, Paul. Monte Carlo methods in financial engineering, volume 53. Springer Science & Business Media, 2013.
  • Glynn (1990) Glynn, Peter W. Likelihood ratio gradient estimation for stochastic systems. Communications of the ACM, 33(10):75–84, 1990.
  • Grathwohl et al. (2017) Grathwohl, Will, Choi, Dami, Wu, Yuhuai, Roeder, Geoff, and Duvenaud, David. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123, 2017.
  • Graves (2016) Graves, Alex. Stochastic backpropagation through mixture density distributions. arXiv preprint arXiv:1607.05690, 2016.
  • Hoffman et al. (2013) Hoffman, Matthew D, Blei, David M, Wang, Chong, and Paisley, John. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
  • Jordan et al. (1999) Jordan, Michael I, Ghahramani, Zoubin, Jaakkola, Tommi S, and Saul, Lawrence K. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • Keeling & Whorf (2004) Keeling, Charles David and Whorf, Timothy P. Atmospheric co2 concentrations derived from flask air samples at sites in the sio network. Trends: a compendium of data on Global Change, 2004.
  • Kingma & Ba (2014) Kingma, Diederik P and Ba, Jimmy. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Kingma & Welling (2013) Kingma, Diederik P and Welling, Max. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • Knowles (2015) Knowles, David A. Stochastic gradient variational bayes for gamma approximating distributions. arXiv preprint arXiv:1509.01631, 2015.
  • Kucukelbir et al. (2016) Kucukelbir, Alp, Tran, Dustin, Ranganath, Rajesh, Gelman, Andrew, and Blei, David M. Automatic differentiation variational inference. arXiv preprint arXiv:1603.00788, 2016.
  • Lugannani & Rice (1980) Lugannani, Robert and Rice, Stephen. Saddle point approximation for the distribution of the sum of independent random variables. Advances in applied probability, 12(2):475–490, 1980.
  • Miller et al. (2017) Miller, Andrew, Foti, Nick, D’Amour, Alexander, and Adams, Ryan P. Reducing reparameterization gradient variance. In Advances in Neural Information Processing Systems, pp. 3711–3721, 2017.
  • Mnih & Gregor (2014) Mnih, Andriy and Gregor, Karol. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.
  • Naesseth et al. (2017) Naesseth, Christian, Ruiz, Francisco, Linderman, Scott, and Blei, David. Reparameterization gradients through acceptance-rejection sampling algorithms. In Artificial Intelligence and Statistics, pp. 489–498, 2017.
  • Paisley et al. (2012) Paisley, John, Blei, David, and Jordan, Michael. Variational bayesian inference with stochastic search. arXiv preprint arXiv:1206.6430, 2012.
  • Paszke et al. (2017) Paszke, Adam, Gross, Sam, Chintala, Soumith, Chanan, Gregory, Yang, Edward, DeVito, Zachary, Lin, Zeming, Desmaison, Alban, Antiga, Luca, and Lerer, Adam. Automatic differentiation in pytorch. 2017.
  • Price (1958) Price, Robert. A useful theorem for nonlinear devices having gaussian inputs. IRE Transactions on Information Theory, 4(2):69–72, 1958.
  • Ranganath et al. (2014) Ranganath, Rajesh, Gerrish, Sean, and Blei, David. Black box variational inference. In Artificial Intelligence and Statistics, pp. 814–822, 2014.
  • Ranganath et al. (2015) Ranganath, Rajesh, Tang, Linpeng, Charlin, Laurent, and Blei, David. Deep exponential families. In Artificial Intelligence and Statistics, pp. 762–771, 2015.
  • Rasmussen (2004) Rasmussen, Carl Edward. Gaussian processes in machine learning. In Advanced lectures on machine learning, pp. 63–71. Springer, 2004.
  • Rezende et al. (2014) Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
  • Robbins & Monro (1951) Robbins, Herbert and Monro, Sutton. A stochastic approximation method. The annals of mathematical statistics, pp. 400–407, 1951.
  • Ross (2006) Ross, Sheldon M. Simulation. Academic Press, San Diego, 2006.
  • Ruiz et al. (2016) Ruiz, Francisco R, AUEB, Michalis Titsias RC, and Blei, David. The generalized reparameterization gradient. In Advances in Neural Information Processing Systems, pp. 460–468, 2016.
  • Salimans et al. (2013) Salimans, Tim, Knowles, David A, et al. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837–882, 2013.
  • Schulman et al. (2015) Schulman, John, Heess, Nicolas, Weber, Theophane, and Abbeel, Pieter. Gradient estimation using stochastic computation graphs. In Advances in Neural Information Processing Systems, pp. 3528–3536, 2015.
  • Stan Manual (2017) Stan Manual. Stan modeling language users guide and reference manual, version 2.17.0. http://mc-stan.org/users/documentation/case-studies/pool-binary-trials.html, 2017.
  • Tieleman & Hinton (2012) Tieleman, T. and Hinton, G. Lecture 6.5—RmsProp: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, 2012.
  • Titsias & Lázaro-Gredilla (2014) Titsias, Michalis and Lázaro-Gredilla, Miguel. Doubly stochastic variational bayes for non-conjugate inference. In International Conference on Machine Learning, pp. 1971–1979, 2014.
  • Tucker et al. (2017) Tucker, George, Mnih, Andriy, Maddison, Chris J, Lawson, John, and Sohl-Dickstein, Jascha. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pp. 2624–2633, 2017.
  • Villani (2003) Villani, Cédric. Topics in optimal transportation. Number 58. American Mathematical Soc., 2003.
  • Wilks (1962) Wilks, S.S. Mathematical Statistics. John Wiley and Sons Inc., 1962.
  • Williams (1992) Williams, Ronald J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Wingate & Weber (2013) Wingate, David and Weber, Theophane. Automated variational inference in probabilistic programming. arXiv preprint arXiv:1301.1299, 2013.

Appendix A Supplementary Materials

a.1 The Univariate Case

For completeness we show explicitly that the formula

(21)

yields the correct gradient. Without loss of generality we assume that has no explicit dependence on . Substituting Eqn. 21 for we have

(22)

In the second line we changed the order of integration and in the third we appealed to the fundamental theorem of calculus, assuming that is sufficiently regular that we can drop the boundary term at infinity.

Note that Eqn. 21 is the unique solution to the one-dimensional version of the transport equation that satisfies the boundary condition :

(23)

Figure 8: We illustrate how the pathwise derivative is obtained from the CDF in the univariate case. The black curves depict the CDF of the Gamma distribution with and varying between 1.4 and 2.0. The red line corresponds to a fixed quantile . As we vary the point where the CDF intersects the red line varies. The rate of this variation is precisely the derivative .

a.1.1 Example: Truncated Unit Normal

We consider an illustrative case where Eqn. 21 can be computed in closed form. For simplicity we consider the unit Normal distribution truncated181818As one would expect, Eqn. 21 yields the standard reparameterized gradient in the case of an non-truncated Normal distribution. Also note that the truncated unit normal is amenable to the reparameterization trick provided that one can compute the inverse error function . to the interval with as the only free parameter. A simple computation yields

(24)

First, notice that for we have , which is what we would expect, since is mapped to the rightmost edge of the interval at , i.e. . Similarly we have for . For the derivative interpolates smoothly between 0 and 1. This makes sense, since for a fixed value of as we get further into the tails of the distribution, nudging to the right has a correspondingly larger effect on , while it has a correspondingly smaller effect for in the bulk of the distribution.

a.1.2 Example: Univariate Mixture Distributions

Consider a mixture of univariate distributions:

(25)

If we have analytic control over the individual CDFs (or know how to approximate them and their derivatives w.r.t. the parameters) then we can immediately appeal to Eqn. 21. Concretely for derivatives w.r.t. the parameters of each component distribution we have:

(26)

from which we can get, for example

(27)

for a mixture of univariate Normal distributions.

In Fig. 9 we demonstrate that the OMT gradient for a mixture of univariate Normal distributions can have much lower variance than the corresponding score function gradient. Here the mixture has two components with and . Note that using the reparameterization trick in this setting would be impractical.

Figure 9: We compare the OMT gradient to the score function gradient for the test function where is a mixture with two components. Depicted is the variance of the gradient w.r.t. the logit that governs the mixture probability of the first component. The logit of the second component is fixed to be zero.

a.2 The Multivariate Case

Suppose we are given a velocity field that satisfies the transport equation:

(28)

Then, as discussed in the main text, we can form the gradient estimator

(29)

That this gradient estimator is unbiased follows directly from the transport equation and divergence theorem:

(30)

where we appeal to the identity

(31)

and assume that is sufficiently well-behaved that we can drop the surface integral. This is just the multivariate generalization of the derivation in the previous section.

a.3 Multivariate Normal

a.3.1 Whitened Coordinates

First we take a look at gradient estimators in whitened coordinates . The reparameterization trick ansatz for the velocity field can be obtained by transforming the solution in Eqn. 46 (which is also given in the main text) to the new coordinates:

(32)

Note that the transport equation for the multivariate distribution can be written in the form

(33)

The homogenous equation (i.e. the transport equation without the source term ) is then given by

(34)

In these coordinates it is evident that infinitesimal rotations, i.e. vector fields of the form

(35)

satisfy191919These are in fact not the only solutions; in addition there are non-linear solutions. the homogenous equation, since

(36)

Finally, if we make the specific choice

(37)

we find that (which automatically satisfies the transport equation) and which is given by

satisfies the symmetry condition

(38)

since

(39)

which is symmetric in and . This implies that the velocity field can be specified as the gradient of a scalar field (this is generally true for the OMT solution), i.e.

(40)

for some , which is evidently given by202020Up to an unspecified additive constant.

(41)

Note, however, that this is not the OMT solution we care about: it minimizes a different kinetic energy functional to the one we care about (namely it minimizes the kinetic energy functional in whitened coordinates and not in natural coordinates).

We now explicitly show that solutions of the transport equation that are modified by the addition of an infinitesimal rotation (as in Eqn. A.3.1) still yield valid gradient estimators. Consider a test statistic that is a monomial in :

(42)

It is enough to show that the following expectation vanishes:212121Note that we can thus think of this term as a control variate.

(43)

where is an antisymmetric matrix. The sum in Eqn. 43 splits up into a sum of paired terms of the form

(44)

We can easily show that each of these paired terms has zero expectation. First note that the expectation is zero if either of or is even (since ). If both and are odd we get (using , where is the double factorial)

(45)

Thus, solutions of the transport equation that are modified by the addition of an infinitesimal rotation still yield the same gradient in expectation.

a.4 Natural Coordinates

We first show that the velocity field that follows from the reparameterization trick satisfies the transport equation in the (given) coordinates , where we have

(46)

We have that

(47)

and

(48)

and

Thus, the terms cancel term by term and the transport equation is satisfied.

What about the OMT gradient in the natural (given) coordinates ? To proceed we represent as a linear vector field with symmetric and antisymmetric parts. Imposing the OMT condition determines the antisymmetric part. Imposing the transport equation determines the symmetric part. We find that

(49)

where is the unique symmetric matrix that satisfies the equation

where we define

(50)

To explicitly solve Eqn. A.4 for we use SVD to write

(51)

where and are diagonal and orthogonal matrices, respectively. Then we have that

(52)

where represents elementwise division and is the outer product. Note that a naive implementation of a gradient estimator based on Eqn. 49 would explicitly construct , which has size quartic in the dimension. A more efficient implementation will instead make use of ’s structure as a sum of products and never explicitly constructs .222222Our implementation can be found here:
https://github.com/uber/pyro/blob/0.2.1/pyro/distributions/omt_mvn.py

a.4.1 Bivariate Normal distribution

In Fig. 10 we compare the performance of our OMT gradient for a bivariate Normal distribution to the reparameterization trick gradient estimator. We use a test function for which we can compute the gradient exactly. We see that the OMT gradient estimator performs favorably over the entire range of parameters considered.

Figure 10: We compare the OMT gradient to the gradient from the reparameterization trick for a bivariate Normal distribution and the test function with . The Cholesky factor has diagonal elements and off-diagonal element . The gradient is with respect to . The variance for the OMT gradient is everywhere lower than for the reparameterization trick gradient.

a.5 Gradient Variance for Linear Test Functions

We use the following example to give more intuition for when we expect OMT gradients for the multivariate Normal distribution to be lower variance than RT gradients. Let be the unit normal distribution in dimensions. Consider the test function

(53)

and the derivative w.r.t. the off-diagonal elements of the Cholesky factor . A simple computation yields the total variance of the RT estimator:

(54)

Similarly for the OMT estimator we find

(55)

So if we draw the parameters from a generic prior we expect the variance of the OMT estimator to be about half of that of the RT estimator. Concretely, if then the variance of the OMT estimator will be exactly half that of the RT estimator in expectation. While this computation is for a very specific case—a linear test function and a unit normal —we find that this magnitude of variance reduction is typical.

a.6 The Lugannani-Rice Approximation

Saddlepoint approximation methods take advantage of cumulant generating functions (CGFs) to construct (often very accurate) approximations to probability density functions in situations where full analytic control is intractable.232323We refer the reader to  (Butler, 2007) for an overview. These methods are also directly applicable to CDFs, where a particularly useful approximation—often used by statisticians to estimate various tail probabilities—has been developed by Lugannani and Rice (Lugannani & Rice, 1980). This approximation—after additional differentiation w.r.t. the parameters of the distribution —forms the basis of our approximate formulas for pathwise gradients for the Gamma, Beta and Dirichlet distributions in regions of where the (marginal) density is approximately gaussian. As we will see these approximations attain high accuracy.

For completeness we briefly describe the Lugannani-Rice approximation. It is given by:

(56)

where

(57)

and where and are functions of and the saddlepoint , with the saddlepoint defined implicitly by the equation . Here is the CGF of , is the mean of , and and are the CDFs and probability densities of the unit normal distribution. Note that Eqn. 56 appears to have a singularity at ; it can be shown, however, that Eqn. 56 is in fact smooth at . Nevertheless, in our numerical recipes we will need to take care to avoid numerical instabilities near that result from finite numerical precision.

a.7 Gamma Distribution

Our numerical recipe for for the standard Gamma distribution with divides ) space into three regions. If we use the Taylor series expansion given in the main text. If we use the following set of expressions derived from the Lugannani-Rice approximation. Away from the singularity, for , we use:

(58)

where

Near the singularity, i.e. for , we use:

(59)

Note that Eqn. 59 is derived from Eqn. 58 by a Taylor expansion in powers of . We set , which is chosen to balance use of Eqn. 58 (which is more accurate) and Eqn. 59 (which is more numerically stable for ). Finally, in the remaining region ( and ) we use a bivariate rational polynomial approximation where