Pathwise Derivatives for Multivariate Distributions

# Pathwise Derivatives for Multivariate Distributions

Martin Jankowiak
Uber AI Labs
San Francisco, CA, USA
jankowiak@uber.com
&Theofanis Karaletsos
Uber AI Labs
San Francisco, CA, USA
theofanis@uber.com
Corresponding author
###### Abstract

We exploit the link between the transport equation and derivatives of expectations to construct efficient pathwise gradient estimators for multivariate distributions. We focus on two main threads. First, we use null solutions of the transport equation to construct adaptive control variates that can be used to construct gradient estimators with reduced variance. Second, we consider the case of multivariate mixture distributions. In particular we show how to compute pathwise derivatives for mixtures of multivariate Normal distributions with arbitrary means and diagonal covariances. We demonstrate in a variety of experiments in the context of variational inference that our gradient estimators can outperform other methods, especially in high dimensions.

## 1 Introduction

Stochastic optimization is a major component of many machine learning algorithms. In some cases—for example in variational inference—the stochasticity arises from an explicit, parameterized distribution . In these cases the optimization problem can often be cast as maximizing an objective

 L=Eqθ(z)[f(z)] (1)

where is a (differentiable) test function and is a vector of parameters.111Without loss of generality we assume that has no explicit dependence on , since gradients of are readily handled by bringing inside the expectation. In order to maximize we would like to follow gradients . If were a deterministic function of we would simply apply the chain rule; for any particular (scalar) component of we have

 ∇θf(z)=∇zf⋅∂z∂θ (2)

What is the generalization of Eqn. 2 to the stochastic case? As observed in (beyond, ), we can construct an unbiased estimator for the gradient of Eqn. 1 provided we can solve the following partial differential equation (the transport a.k.a. continuity equation):

 ∂∂θqθ+∇z⋅(qθvθ)=0 (3)

Here is a vector field defined on the sample space of .222Note that there is a vector field for each component of . We can then form the following pathwise gradient estimator:

 ∇θL=Eqθ(z)[∇zf⋅vθ] (4)

That this gradient estimator is unbiased follows directly from the divergence theorem:333Specifically we appeal to the identity and assume that is sufficiently well-behaved that we can drop the surface integral.

 ∇θL=∫dz∂qθ(z)∂θf(z)=−∫dz∇z⋅(qθvθ)f(z)=∫dzqθ(z)∇zf⋅vθ=Eqθ(z)[∇zf⋅vθ]

Thus Eqn. 4 is a generalization of Eqn. 2 to the stochastic case, with the velocity field playing the role of in the deterministic case. In contrast to the deterministic case where is uniquely specified, Eqn. 3 generally admits an infinite dimensional space of solutions for .

Pathwise gradient estimators are particularly appealing because, empirically, they generally exhibit lower variance than alternatives. In this work we are interested in the multivariate case, where the solution space to Eqn. 3 is very rich, and where alternative gradient estimators tend to suffer from particularly large variance, especially in high dimensions. The rest of this paper is organized as follows. In Sec. 2 we give a brief overview of stochastic gradient variational inference (SGVI) and stochastic gradient estimators. In Sec. 3 we use parameterized families of solutions to Eqn. 3 to construct adaptive pathwise gradient estimators. In Sec. 4 we present solutions to Eqn. 3 for multivariate mixture distributions. In Sec. 5 we place our work in the context of recent research. In Sec. 6 we validate our proposed techniques with a variety of experiments in the context of variational inference. In Sec. 7 we conclude and discuss directions for future work.

## 2 Background

### 2.1 Stochastic Gradient Variational Inference

Given a probabilistic model with observations and latent random variables , variational inference recasts posterior inference as an optimization problem. Specifically we define a variational family of distributions parameterized by and seek to find a value of that minimizes the KL divergence between and the (unknown) posterior (jordan1999introduction, ; paisley2012variational, ; hoffman2013stochastic, ; wingate2013automated, ; ranganath2014black, ). This is equivalent to maximizing the ELBO, defined as

 ELBO=Eqθ(z)[logp(x,z)−logqθ(z)] (5)

This stochastic optimization problem will be the basic setting for most of our experiments in Sec. 6.

### 2.2 Score Function Estimator

The score function estimator, also referred to as reinforce glynn1990likelihood (); williams1992simple (); fu2006gradient (), provides a simple and broadly applicable recipe for estimating stochastic gradients, with the simplest variant given by

 ∇θL=Eqθ(z)[f(z)∇θlogqθ(z)]

Although the score function estimator is very general (e.g. it can be applied to discrete random variables) it typically suffers from high variance, although this can be mitigated with the use of variance reduction techniques such as Rao-Blackwellization (casella1996rao, ) and control variates (ross, ).

### 2.3 Reparameterization Trick

The reparameterization trick (RT) is not as broadly applicable as the score function estimator, but it generally exhibits lower variance price1958useful (); salimans2013fixed (); kingma2013auto (); glasserman2013monte (); titsias2014doubly (); rezende2014stochastic (). It is applicable to continuous random variables whose probability density can be reparameterized such that we can rewrite expectations as , where is a fixed distribution and is a differentiable -dependent transformation. Since the expectation w.r.t.  has no dependence, gradients w.r.t.  can be computed by pushing through the expectation. This reparameterization can be done for a number of distributions, including for example the Normal distribution. As noted in (beyond, ), in cases where the reparameterization trick is applicable, we have that

 vθ=∂T(ϵ;θ)∂θ∣∣∣ϵ=T−1(z;θ) (6)

is a solution to the transport equation, Eqn. 3.

Whenever we have a solution to Eqn. 3 we can form an unbiased gradient estimator via Eqn. 4. An intriguing possibility arises if we can construct a parameterized family of solutions : as we take steps in -space we can simultaneously take steps in -space, thus adaptively choosing the form the gradient estimator takes. This can be understood as an instance of an adaptive control variate (ross, ). Indeed consider the solution as well as a fixed reference solution . Then we can write

 ∇θL=Eqθ(z)[∇zf⋅vθλ]=Eqθ(z)[∇zf⋅vθ0]+Eqθ(z)[∇zf⋅(vθλ−vθ0)]

The final expectation is identically equal to zero so the integrand is a control variate. If we choose appropriately we can reduce the variance of our gradient estimator. In order to specify a complete algorithm we need to answer two questions: i) how should we adapt ?; and ii) what is an appropriate family of solutions ? We now address each of these in turn.

Whenever we compute a noisy gradient estimate via Eqn. 4 we can use the same sample(s) to form a noisy estimate of the gradient variance, which is given by

 Var(∂L∂θ)=Eqθ(z)[(∇zf⋅vθλ)2]−(Eqθ(z)[∇zf⋅vθλ])2 (7)

In direct analogy to the adaptation procedure used in (for example) (ruiz2016overdispersed, ) we can adapt by following noisy gradient estimates of this variance:444In the general case where the parameter vector has multiple components , we compute a noisy gradient estimate of the sum of the various (scalar) terms .

 ∇λVar(∂L∂θ)=Eqθ(z)[∇λ(∇zf⋅vθλ)2] (8)

In this way can be adapted to reduce the gradient variance during the course of optimization.

### 3.2 Parameterizing velocity fields

Assuming we have obtained (at least) one solution to the transport equation Eqn. 3, parameterizing a family of solutions is in principle easy. We simply solve the null equation, , which is obtained from the transport equation by dropping the source term . Then for a given family of solutions to the null equation, , we get a solution to the transport equation by simple addition, i.e. . The problem is that solving the null equation is almost too easy, since any divergence free vector field immediately yields a solution:

 ∇z⋅wλ=0and~vθλ=wλqθ⇒∇z⋅(qθ~vθλ)=0 (9)

While parameterizing families of divergence free vector fields is an intriguing option, it might be preferable to find simpler families of solutions—if only to limit the cost of each gradient step. To find simpler null solutions, however, we need to focus in on a particular family of distributions .

### 3.3 Null Solutions for the Multivariate Normal Distribution

Consider the multivariate Normal distribution in dimensions parameterized via a mean and Cholesky factor . We would like to compute derivatives w.r.t. . While the reparameterization trick is applicable in this case, we can potentially get lower variance gradients by using Adaptive Velocity Fields. As we show in the supplementary materials, a simple solution to the corresponding null transport equation is given by

 ~vLabA=LAabL−1(z−μ) (10)

where is an arbitrary collection of antisymmetric matrices (one for each ). This is just a linear vector field and so it is easy to manipulate and (relatively) cheap to compute. Geometrically, for each entry of , this null solution corresponds to an infinitesimal rotation in whitened coordinates . Note that this null solution is added to the reference solution , and this is not equivalent to rotating .

Since runs over pairs of indices and each has free parameters, this space of solutions is quite large. Since we will be adapting via noisy gradient estimates—and because we would like to limit the computational cost—we choose to parameterize a smaller subspace of solutions. Empirically we find that the following parameterization works well:

 Aabjk=M∑ℓ=1BℓaCℓb(δajδbk−δakδbj) (11)

Here each matrix and is arbitrary and the hyperparameter allows us to trade off the flexibility of our family of null solutions with computational cost (typically ). We explore the performance of the resulting Adaptive Velocity Field (AVF) gradient estimator in Sec. 6.1.

## 4 Pathwise Derivatives for Multivariate Mixture Distributions

Consider a mixture of multivariate distributions in dimensions:

 qθ(z)=K∑j=1πjqθj(z) (12)

In order to compute pathwise derivatives of we need to solve the transport equation w.r.t. the mixture weights as well as the component parameters . For the derivatives w.r.t.  we can just repurpose the velocity fields of the individual components. That is, if is a solution of the single-component transport equation i.e. , then

 vθj=πjqθjqθvθjsingle (13)

is a solution of the multi-component transport equation.555We note that the form of Eqn. 13 implies that we can easily introduce adaptive velocity fields for each individual component (although the cost of doing so may be prohibitively high for a large number of components). See the supplementary materials for the (very short) derivation. For example we can readily compute Eqn. 13 if each component distribution is reparameterizable. Note that a typical gradient estimator for, say, a mixture of Normal distributions first samples the discrete component and then samples conditioned on . This results in a pathwise derivative for that only depends on and . By contrast the pathwise gradient computed via Eqn. 13 will result in a gradient for all component parameters for every sample . We explore this difference experimentally in Sec. 6.2.2.

### 4.1 Mixture weight derivatives

Unfortunately, solving the transport equation for the mixture weights is in general much more difficult, since the desired velocity fields need to coordinate mass transport among all components.666For example, we expect this to be particularly difficult if some of the component distributions belong to different families of distributions, e.g. a mixture of Normal distributions and Wishart distributions. We now describe a formal construction for solving the transport equation for the mixture weights. For each pair of component distributions , consider the transport equation

 qθj−qθk+∇z⋅(qθ~vjk)=0 (14)

Intuitively, the velocity field moves mass from component to component . We can superimpose these pairwise solutions to form

 vℓj=πj∑k≠jπk~vjk (15)

As we show in the supplementary materials, the velocity field yields an estimator for the derivative w.r.t. the softmax logit that corresponds to component .777That is with respect to where . Thus provided we can find solutions to Eqn. 14 for each pair of components , we can form a pathwise gradient estimator for the mixture weights. We now consider two particular situations where this recipe can be carried out. We leave a discussion of other solutions—including a solution for a mixture of Normal distributions with arbitrary diagonal covariances—to the supplementary materials.

### 4.2 Mixture of Multivariate Normals with Shared Diagonal Covariance

Here each component distribution is specified by , i.e. each component distribution has its own mean vector but all components share the same (diagonal) square root covariance . We find that a solution to Eqn. 14 is given by

 ~vjk=(Φ(~zjk∥−~μjk∥)−Φ(~zjk∥+~μkj∥))ϕ(||~zjk⊥||2)(2π)D/2−1qθ∏Di=1σidiag(σ)^μjkwith~z≡z⊙σ−1 (16)

and where the quantities are defined in the supplementary materials. Here is the CDF of the unit Normal distribution and is the corresponding probability density. Geometrically, the velocity field in Eqn. 16 moves mass along lines parallel to .

### 4.3 Zero Mean Discrete Gaussian Scale Mixture

Here each component distribution is specified by , where each is a positive scalar. Defining and making use of radial coordinates with we find that a solution of the form in Eqn. 15 reduces to

 vℓj=πjdiag(σ)(~vj−∑kπk~vk) (17)

where

 ~vj=~Φ(rλj)qθλD−1j∏Di=1σi^rand~Φ(z)=z1−D(2π)D/2∫∞z~zD−1e−~z2/2d~z (18)

The ‘radial CDF’ in Eqn.18 can be computed analytically; see the supplementary materials. Note that computing Eqn. 17 is , in contrast to Eqn. 16, which is .

## 5 Related Work

After this manuscript was completed, we become aware of reference figurnov2018implicit (), which has some overlap with this work. In particular, figurnov2018implicit () describes an interesting recipe for computing pathwise gradients that can be applied to mixture distributions. A nice feature of this recipe (also true of (graves2016stochastic, )), is that it can be applied to mixtures of Normal distributions with arbitrary covariances. A disadvantage is that it can be expensive, exhibiting a computational cost even for a mixture of Normal distributions with diagonal covariances. In contrast the gradient estimators constructed in Sec. 4 by solving the transport equation are linear in . It is also unclear how the recipe in figurnov2018implicit () performs empirically, as the authors do not conduct any experiments for the mixture case.

## 6 Experiments

We conduct a series of experiments using synthetic and real world data to validate the performance of the gradient estimators introduced in Sec. 3 & 4. Throughout we use single sample estimators. See the supplementary materials for details on experimental setups. The SGVI experiments in this section were implemented in the Pyro101010http://pyro.ai probabilistic programming language.

#### 6.1.1 Synthetic test functions

In Fig. 0(a) we use synthetic test functions to illustrate the amount of variance reduction that can be achieved with AVF gradients for the multivariate Normal distribution as compared to the reparameterization trick and the OMT gradient estimator from (beyond, ). The dimension is ; the results are qualitatively similar for different dimensions.

#### 6.1.2 Gaussian Process Regression

We investigate the performance of AVF gradients for the multivariate Normal distribution in the context of a Gaussian Process regression task reproduced from (beyond, ). We model the Mauna Loa data from (keeling2004atmospheric, ) considered in (rasmussen2004gaussian, ). We fit the GP using a single-sample Monte Carlo ELBO gradient estimator and all data points (so ). Both the OMT and AVF gradient estimators achieve higher ELBOs than the RT estimator (see Fig. 2); however, the OMT estimator does so at substantially increased computational cost (1.9x), while the AVF estimator for () requires only 6% (11%) more time per iteration. By iteration 250 the AVF gradient estimator with has attained the same ELBO that the RT estimator attains at iteration 500.

### 6.2 Mixture Distributions

#### 6.2.1 Synthetic test function

In Fig. 0(b) we depict the variance reduction of pathwise gradient estimators for various mixture distributions compared to the corresponding score function estimator. We find that the typical variance reduction has magnitude . The results are qualitatively similar for different numbers of components . See the supplementary materials for details on the different families of mixture distributions.

#### 6.2.2 Baseball Experiment

We consider a model for repeated binary trial data (baseball players at bat) using the data in efron1975data () and the modeling setup in stanmanual () with partial pooling. The model has two global latent variables and 18 local latent variables so that the posterior is 20-dimensional. While mean field SGVI gives reasonable results for this model, it is not able to capture the detailed structure of the exact posterior as computed with the NUTS HMC implementation in Stan hoffman2014no (); carpenter2017stan (). To go beyond mean field we consider variational distributions that are mixtures of diagonal Normal distributions (cf. the experiment in (miller2016variational, )). Adding more components gives a better approximation to the exact posterior, see Fig. 2(a). Furthermore, using pathwise derivatives in the ELBO gradient estimator leads to faster convergence, see Fig. 2(b). Here the hybrid estimator uses score functions gradients for the mixture logits but implements Eqn. 13 for gradients w.r.t. the component parameters. Since there is significant overlap among the mixture components, Eqn. 13 leads to substantial variance reduction.

#### 6.2.3 Continuous state space model

We consider a non-linear continuous state space model with observations and random variables at each time step (both two dimensional). Since the likelihood is of the form with , the posterior over is highly multi-modal (since is symmetric under ). We use SGVI to approximate the posterior over . To better capture the multi-modality we use a variational family that is a product of two-component mixture distributions of the form , one at each time step. As can be seen from Fig. 4, the variational family makes use of the mixture distributions at its disposal to model the multi-modal structure of the posterior, something that a variational family with a unimodal struggles to do. In the next section we explore the use of mixture distributions in a much richer time series setup.

#### 6.2.4 Deep Markov Model

We consider a variant of the Deep Markov Model (DMM) setup introduced in (krishnan2017structured, ) on a polyphonic music dataset. We fix the form of the model and vary the variational family and gradient estimator used. We consider a variational family which includes a mixture distribution at each time step. In particular each factor in the variational distribution is a mixture of diagonal Normals. We compare the performance of the pathwise gradient estimator introduced in Sec. 4 to two variants of the Gumbel Softmax gradient estimator (jang2016categorical, ; maddison2016concrete, ), see Table 6.2.4. Note that both Gumbel Softmax estimators are biased, while the pathwise estimator is unbiased. We find that the Gumbel Softmax estimators—which induce times higher gradient variance for the mixture logits as compared to the pathwise estimator—are unable to achieve competitive test ELBOs.111111We also found the Gumbel Softmax estimators to be less numerically stable, which prevented us from using more aggressive KL annealing schedules. This may have contributed to the performance gap in Table 6.2.4.