GO Gradient for Expectation-Based Objectives

GO Gradient for Expectation-Based

Yulai Cong   Miaoyun Zhao   Ke Bai   Lawrence Carin
Department of Electrical and Computer Engineering, Duke University
Correspondence to: Yulai Cong <yulaicong@gmail.com>, Miaoyun Zhao <miaoyun9zhao@gmail.com>.

Within many machine learning algorithms, a fundamental problem concerns efficient calculation of an unbiased gradient wrt parameters for expectation-based objectives . Most existing methods either () suffer from high variance, seeking help from (often) complicated variance-reduction techniques; or () they only apply to reparameterizable continuous random variables and employ a reparameterization trick. To address these limitations, we propose a General and One-sample (GO) gradient that () applies to many distributions associated with non-reparameterizable continuous or discrete random variables, and () has the same low-variance as the reparameterization trick. We find that the GO gradient often works well in practice based on only one Monte Carlo sample (although one can of course use more samples if desired). Alongside the GO gradient, we develop a means of propagating the chain rule through distributions, yielding statistical back-propagation, coupling neural networks to common random variables.

GO Gradient for Expectation-Based

Yulai Congthanks: Correspondence to: Yulai Cong <yulaicong@gmail.com>, Miaoyun Zhao <miaoyun9zhao@gmail.com>.   Miaoyun Zhao   Ke Bai   Lawrence Carin
Department of Electrical and Computer Engineering, Duke University

1 Introduction

Neural networks, typically trained using back-propagation for parameter optimization, have recently demonstrated significant success across a wide range of applications. There has been interest in coupling neural networks with random variables, so as to embrace greater descriptive capacity. Recent examples of this include black-box variational inference (BBVI) (Kingma & Welling, 2014; Rezende et al., 2014; Ranganath et al., 2014; Hernández-Lobato et al., 2016; Ranganath et al., 2016b; Li & Turner, 2016; Ranganath et al., 2016a; Zhang et al., 2018) and generative adversarial networks (GANs) (Goodfellow et al., 2014; Radford et al., 2015; Zhao et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Gan et al., 2017; Li et al., 2018). Unfortunately, efficiently backpropagating gradients through general distributions (random variables) remains a bottleneck. Most current methodology focuses on distributions with continuous random variables, for which the reparameterization trick may be readily applied (Kingma & Welling, 2014; Grathwohl et al., 2017).

As an example, the aforementioned bottleneck greatly constrains the applicability of BBVI, by limiting variational approximations to reparameterizable distributions. This limitation excludes discrete random variables and many types of continuous ones. From the perspective of GAN, the need to employ reparameterization has constrained most applications to continuous observations. There are many forms of data that are more-naturally discrete.

The fundamental problem associated with the aforementioned challenges is the need to efficiently calculate an unbiased low-variance gradient wrt parameters for an expectation objective of the form 111In this paper, we consider expectation objectives meeting basic assumptions: () is differentiable wrt ; () is differentiable for continuous ; and () for discrete . For simplicity of the main paper, those assumptions are implicitly made, as well as fundamental rules like Leibniz’s rule. . We are interested in general distributions , for which the components of may be either continuous or discrete. Typically the components of have a hierarchical structure, and a subset of the components of play a role in evaluating .

Unfortunately, classical methods for estimating gradients of wrt have limitations. The REINFORCE gradient (Williams, 1992), although generally applicable (, for continuous and discrete random variables), exhibits high variance with Monte Carlo (MC) estimation of the expectation, forcing one to apply additional variance-reduction techniques. The reparameterization trick (Rep) (Salimans et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014) works well, with as few as only one MC sample, but it is limited to continuous reparameterizable . Many efforts have been devoted to improving these two formulations, as detailed in Section 6. However, none of these methods is characterized by generalization (applicable to general distributions) and efficiency (working well with as few as one MC sample).

The key contributions of this work are based on the recognition that REINFORCE and Rep are seeking to solve the same objective, but in practice Rep yields lower-variance estimations, albeit for a narrower class of distributions. Recent work (Ranganath et al., 2016b) has made a connection between REINFORCE and Rep, recognizing that the former estimates a term the latter evaluates analytically. The high variance by which REINFORCE approximates this term manifests high variance in the gradient estimation. Extending these ideas, we make the following main contributions. () We propose a new General and One-sample (GO) gradient in Section 3, that principally generalizes Rep to many non-reparameterizable distributions and justifies two recent methods (Figurnov et al., 2018; Jankowiak & Obermeyer, 2018); the “One sample” motivating the name GO is meant to highlight the low variance of the proposed method, although of course one may use more than one sample if desired. () We find that the core of the GO gradient is something we term a variable-nabla, which can be interpreted as the gradient of a random variable wrt a parameter. () Utilizing variable-nablas to propagate the chain rule through distributions, we broaden the applicability of the GO gradient in Sections 4-5 and present statistical back-propagation, a statistical generalization of classic back-propagation (Rumelhart & Hinton, 1986). Through this generalization, we may couple neural networks to general random variables, and compute needed gradients with low variance.

2 Background

To motivate this paper, we begin by briefly elucidating common machine learning problems for which there is a need to efficiently estimate gradients of for functions of the form . Assume access to data samples , drawn i.i.d. from the true (and unknown) underlying distribution . We seek to learn a model to approximate . A classic approach to such learning is to maximize the expected log likelihood , perhaps with an added regularization term on . Expectation is approximated via the available data samples, as .

It is often convenient to employ a model with latent variables , , , with prior on . The integral wrt is typically intractable, motivating introduction of the approximate posterior , with parameters . The well-known evidence lower bound (ELBO) (Jordan et al., 1999; Bishop, 2006) is defined as


where is the true posterior, and represents the Kullback-Leibler divergence. Variational learning seeks .

While computation of the ELBO has been considered for many years, a problem introduced recently concerns adversarial learning of , or, more precisely, learning a model that allows one to efficiently and accurately draw samples that are similar to . With generative adversarial networks (GANs) (Goodfellow et al., 2014), one seeks to solve


where is a discriminator with parameters , quantifying the probability was drawn from , with representing the probability that it was drawn from . There have been many recent extensions of GAN (Radford et al., 2015; Zhao et al., 2016; Arjovsky et al., 2017; Li et al., 2017; Gan et al., 2017; Li et al., 2018), but the basic setup in (3) holds for most.

To optimize (1) and (3), the most challenging gradients that must be computed are of the form ; for (1) and , while for (3) and . The need to evaluate expressions like arises in many other machine learning problems, and consequently it has generated much prior attention.

Evaluation of the gradient of the expectation is simplified if the gradient can be moved inside the expectation. REINFORCE (Williams, 1992) is based on the identity


While simple in concept, this estimate is known to have high variance when the expectation is approximated (as needed in practice) by a finite number of samples.

An approach (Salimans et al., 2013; Kingma & Welling, 2014; Rezende et al., 2014) that has attracted recent attention is termed the reparameterization trick, applicable when can be reparametrized as , with , where is a differentiable transformation and is a simple distribution that may be readily sampled. To keep consistency with the literature, we call a distribution reparameterizable if and only if those conditions are satisfied222A Bernoulli random variable is identical to with . But is called non-reparameterizable because the transformation is not differentiable.. In this case we have


This gradient, termed Rep, is typically characterized by relatively low variance, when approximating with a small number of samples . This approach has been widely employed for computation of the ELBO and within GAN, but it limits one to models that satisfy the assumptions of Rep.

3 GO Gradient

The reparameterization trick (Rep) is limited to reparameterizable random variables with continuous components. There are situations for which Rep is not readily applicable, where the components of may be discrete or nonnegative Gamma distributed. We seek to gain insights from the relationship between REINFORCE and Rep, and generalize the types of random variables for which the latter approach may be effected. We term our proposed approach a General and One-sample (GO) gradient. In practice, we find that this approach works well with as few as one sample for evaluating the expectation, and it is applicable to more general settings than Rep.

Recall that Rep was first applied within the context of variational learning (Kingma & Welling, 2014), as in (1). Specifically, it was assumed , omitting explicit dependence on data , for notational convenience; is component of . In Kingma & Welling (2014) corresponded to a Gaussian distribution , with mean and variance . In the following we generalize such that it need not be Gaussian. Applying integration by parts (Ranganath et al., 2016b)


where denotes with excluded, and is the cumulative distribution function (CDF) of . The “0” term is readily proven to be zero for any , with the assumption that doesn’t tend to infinity faster than tending to zero when .

The “Key” term exactly recovers the one-dimensional Rep when reparameterization , exists (Ranganath et al., 2016b). Further, applying in (6) yields REINFORCE. Consequently, it appears that Rep yields low variance by analytically setting to zero the unnecessary but high-variance-injecting “0” term, while in contrast REINFORCE implicitly seeks to numerically implement both terms in (7).

We generalize for discrete , here assuming . It is shown in Appendix A.2 that this framework is also applicable to discrete with a finite alphabet. It may be shown (see Appendix A.2) that


where , and for all .

Theorem 1 (GO Gradient).

For expectation objectives , where satisfies () ; () the corresponding CDF is differentiable wrt parameters ; and () one can calculate , the General and One-sample (GO) gradient is defined as


where specifies with variable-nabla , , and

All proofs are provided in Appendix A, where we also list for a wide selection of possible , for both continuous and discrete . Note for the special case with continuous , GO reduces to Implicit Rep gradients (Figurnov et al., 2018) and pathwise derivatives (Jankowiak & Obermeyer, 2018); in other words, GO provides a principled explanation for their low variance, namely their foundation (implicit differentiation) originates from integration by parts. For high-dimensional discrete , calculating is computationally expensive. Fortunately, for often used in practice special properties hold that can be exploited for efficient parallel computing. Also for discrete with finite support, it is possible that one could analytically evaluate a part of expectations in (9) for lower variance, mimicking the local idea in Titsias & Lázaro-Gredilla (2015); Titsias (2015). Appendix I shows an example illustrating how to handle these two issues in practice.

4 Deep GO Gradient

The GO gradient in Theorem 1 can only handle single-layer mean-field , characterized by an independence assumption on the components of . One may enlarge the descriptive capability of by modeling it as a marginal distribution of a deep model (Ranganath et al., 2016b; Bishop, 2006). Hereafter, we focus on this situation, and begin with a 2-layer model for simple demonstration. Specifically, consider

where , is the leaf variable, and the internal variable is assumed to be continuous. Components of are assumed to be conditionally independent given , but upon marginalizing out this independence is removed.

The objective becomes and via Theorem 1

Lemma 1.

Equation (10) exactly recovers the Rep gradient in (5), if and has reparameterization , for differentiable and easily sampled .

Lemma 1 shows that Rep is a special case of our deep GO gradient in the following Theorem 2. Note neither Implicit Rep gradients nor pathwise derivatives can recover Rep in general, because a neural-network-parameterized may lead to non-trivial CDF .

For the gradient wrt , we first apply Theorem 1, yielding

For continuous internal variable one can apply Theorem 1 again, from which


Now extending the same procedure to deeper models with layers, we generalize the GO gradient in Theorem 2. Random variable is assumed to be the leaf variable of interest, and may be continuous or discrete; latent/internal random variables are assumed continuous (these generalize from above).

Theorem 2 (Deep GO Gradient).

For expectation objectives with being the marginal distribution of

where , all internal variables are continuous, the leaf variable is either continuous or discrete, , and one has access to variable-nablas and , as defined in Theorem 1, the General and One-sample (GO) gradient is defined as


where and for .

Corollary 1.

The deep GO gradient in Theorem 2 exactly recovers back-propagation (Rumelhart & Hinton, 1986) when each element distribution is specified as the Dirac delta function located at the activated value after activation function.

Figure 1 is presented for better understanding. With variable-nablas, one can readily verify that the deep GO gradient in Theorem 2 in expectation obeys the chain rule.

Assume all continuous variables , and abuse notations for better understanding. ,  

Figure 1: Relating back-propagation (Rumelhart & Hinton, 1986) with the deep GO gradient in Theorem 2. () In deterministic deep neural networks, one forward-propagates information using activation functions, like ReLU, to sequentially activate (black solid arrows), and then back-propagates gradients from Loss to each parameter via gradient-flow through (red dashed arrows). () Similarly for the deep GO gradient with 1 MC sample, one forward-propagates information to calculate the expected loss function using distributions as statistical activation functions, and then uses variable-nablas to sequentially back-propagate gradients through random variables to each , as in (12).

5 Statistical Back-Propagation and Hierarchical Variational Inference

Recall the motivating discussion in Section 2, in which we considered generative model and inference model , the former used to model synthesis of the observed data and the latter used for inference of given observed . In recent deep architectures, a hierarchical representation for cumulative latent variables has been considered (Rezende et al., 2014; Ranganath et al., 2015; Zhou et al., 2015; 2016; Ranganath et al., 2016b; Cong et al., 2017; Zhang et al., 2018). As an example, there are models with . When performing inference for such models, it is intuitive to consider first-order Markov chain structure for . The discussion in this section is most relevant for variational inference, for computation of , and consequently we specialize to that notation in the subsequent discussion (we consider representations in terms of , rather than the more general notation employed in Section 4).

Before proceeding, we seek to make clear the distinction between this section and Section 4. In the latter, only the leaf variable appears in ; see (12), with . That is because in Section 4 the underlying model is a marginal distribution of , , , which is relevant to the generators of GANs; see (3), with and . Random variables were marginalized out of to represent . As discussed in Section 4, were added there to enhance the modeling flexibility of . In this section, the deep set of random variables are inherent components of the underlying generative model for , , . Hence, all components of manifested via inference model play a role in . Besides, no specific structure is imposed on and in this section, moving beyond the aforementioned first-order Markov structure. For a practical application, one may employ domain knowledge to design suitable graphical models for and , and then use the following Theorem 3 for training.

Theorem 3 (Statistical Back-Propagation).

For expectation objectives

where denotes continuous internal variables with at least one child variable, represents continuous or discrete leaf variables with no children except , and is constructed as a hierarchical probabilistic graphical model

with each element distribution having accessible variable-nablas as defined in Theorem 1, denotes the parent variables of , the General and One-sample (GO) gradient for is defined as


where denotes the children variables of , and with ,

and is iteratively calculated as until leaf variables where

Statistical back-propagation in Theorem 3 is relevant to hierarchical variational inference (HVI) (Ranganath et al., 2016b; Hoffman & Blei, 2015; Mnih & Gregor, 2014) (see Appendix G), greatly generalizing GO gradients to the inference of directed acyclic probabilistic graphical models. In HVI variational distributions are specified as hierarchical graphical models constructed by neural networks. Using statistical back-propagation, one may rely on GO gradients to perform HVI with low variance, while greatly broadening modeling flexibility.

6 Related Work

There are many methods directed toward low-variance gradients for expectation-based objectives. Attracted by the generalization of REINFORCE, many works try to improve its performance via efficient variance-reduction techniques, like control variants (Mnih & Gregor, 2014; Titsias & Lázaro-Gredilla, 2015; Gu et al., 2015; Mnih & Rezende, 2016; Tucker et al., 2017; Grathwohl et al., 2017) or via data augmentation and permutation techniques (Yin & Zhou, 2018). Most of this research focuses on discrete random variables, likely because Rep (if it exists) works well for continuous random variables but it may not exist for discrete random variables. Other efforts are devoted to continuously relaxing discrete variables, to combine both REINFORCE and Rep for variance reduction (Jang et al., 2016; Maddison et al., 2016; Tucker et al., 2017; Grathwohl et al., 2017).

Inspired by the low variance of Rep, there are methods that try to generalize its scope. The Generalized Rep (GRep) gradient (Ruiz et al., 2016) employs an approximate reparameterization whose transformed distribution weakly depends on the parameters of interest. Rejection sampling variational inference (RSVI) (Naesseth et al., 2016) exploits highly-tuned transformations in mature rejection sampling simulation to better approximate Rep for non-reparameterizable distributions. Compared to the aforementioned methods, the proposed GO gradient, containing Rep as a special case for continuous random variables, applies to both continuous and discrete random variables with the same low-variance as the Rep gradient. Implicit Rep gradients (Figurnov et al., 2018) and pathwise derivatives (Jankowiak & Obermeyer, 2018) are recent low-variance methods that exploit the gradient of the expected function; they are special cases of GO in the single-layer continuous settings.

The idea of gradient backpropagation through random variables has been exploited before. RELAX (Grathwohl et al., 2017), employing neural-network-parametrized control variants to assist REINFORCE for that goal, has a variance potentially as low as the Rep gradient. SCG (Schulman et al., 2015) utilizes the generalizability of REINFORCE to construct widely-applicable stochastic computation graphs. However, REINFORCE is known to have high variance, especially for high-dimensional problems, where the proposed methods are preferable when applicable (Schulman et al., 2015). Stochastic back-propagation (Rezende et al., 2014; Fan et al., 2015), focusing mainly on reparameterizable Gaussian random variables and deep latent Gaussian models, exploits the product rule for an integral to derive gradient backpropagation through several continuous random variables. By comparison, the proposed statistical back-propagation based on the GO gradient is applicable to most distributions for continuous random variables. Further, it also flexibly generalizes to hierarchical probabilistic graphical models with continuous internal variables and continuous/discrete leaf ones.

7 Experiments

We examine the proposed GO gradients and statistical back-propagation with four experiments: () simple one-dimensional (gamma and negative binomial) examples are presented to verify the GO gradient in Theorem 1, corresponding to nonnegative and discrete random variables; () the discrete variational autoencoder experiment from Tucker et al. (2017) and Grathwohl et al. (2017) is reproduced to compare GO with the state-of-the-art variance-reduction methods; () a multinomial GAN, generating discrete observations, is constructed to demonstrate the deep GO gradient in Theorem 2; () hierarchical variational inference (HVI) for two deep non-conjugate Bayesian models is developed to verify statistical back-propagation in Theorem 3. Note the experiments of Figurnov et al. (2018) and Jankowiak & Obermeyer (2018) additionally support our GO in the single-layer continuous settings.

Many mature machine learning frameworks, like TensorFlow (Abadi et al., ) and PyTorch (Paszke et al., 2017), are optimized for implementation of methods like back-propagation. Fortunately, all gradient calculations in the proposed theorems obey the chain rule in expectation, enabling convenient incorporation of the proposed approaches into existing frameworks. Experiments presented below were implemented in TensorFlow or PyTorch with a Titan Xp GPU. Code for all experiments can be found at github.com/YulaiCong/GOgradient.

Notation denotes the gamma distribution with shape and rate , the negative binomial distribution with number of failures and success probability , the Bernoulli distribution with probability , the multinomial distribution with number of trials and event probabilities , the Poisson distribution with rate , and the Dirichlet distribution with concentration parameters .

7.1 Gamma and NB One-dimensional Simple Examples

Figure 2: Gamma (a-c) and NB (d) toy experimental results. (a) The gradient variance of gamma shape versus iterations, with posterior parameters . (b)-(c) The gradient variance of and ELBOs versus iterations respectively, when . (d) The gradient variance of NB versus iterations with . In each iteration, gradient variances are estimated with 20 Monte Carlo samples (each sample corresponds to one gradient estimate), among which the last one is used to update parameters.

We first consider illustrative one-dimensional “toy” problems, to examine the GO gradient for both continuous and discrete random variables. The optimization objective is expressed as

where for continuous we assume for given set , with and ; for discrete we assume for given set , with and . Stochastic gradient ascent with one-sample-estimated gradients is used to optimize the objective, which is equivalent to minimizing .

Figure 2 shows the results (see Appendix H for additional details). For the nonnegative continuous associated with the gamma distribution, we compare our GO gradient with GRep (Ruiz et al., 2016), RSVI (Naesseth et al., 2016), and their modified version using the “sticking” idea (Roeder et al., 2017), denoted as GRep-Stick and RSVI-Stick, respectively. For RSVI and RSVI-Stick, the shape augmentation parameter is set as by default. The only difference between GRep and GRep-Stick (also RSVI and RSVI-Stick) is the latter does not analytically express the entropy . Figure 2 clearly shows the utility of employing sticking to reduce variance; without it, GRep and RSVI exhibit high variance, that destabilizes the optimization for small gamma shape parameters, as shown in Figures 2 and 2. We adopt the sticking approach hereafter for all the compared methods. Among methods with sticking, GO exhibits the lowest variance in general, as shown in Figures 2 and 2. GO empirically provides more stable learning curves, as shown in Figure 2. For the discrete case corresponding to the NB distribution, GO is compared to REINFORCE (Williams, 1992). To address the concern about comparison with the same number of evaluations of the expected function333In the NB experiments, REINFORCE uses sample and evaluation of the expected function; REINFORCE2 uses sample and evaluations; and GO uses sample and evaluations., another curve of REINFORCE using samples is also added, termed REINFORCE2. It is apparent from Figure 2 that, thanks to analytically removing the “0” terms in (8), the GO gradient has much lower variance, even in this simple one-dimensional case.

7.2 Discrete Variational Autoencoder

To demonstrate the low variance of the proposed GO gradient, we consider the discrete variational autoencoder (VAE) experiment from REBAR (Tucker et al., 2017) and RELAX (Grathwohl et al., 2017), to make a direct comparison with state-of-the-art variance-reduction methods. Since the statistical back-propagation in Theorem 3 cannot handle discrete internal variables, we focus on the single-latent-layer settings (1 layer of Bernoulli random variables), i.e.,

where is the parameters of the prior , means using a neural network to project the latent binary code to the parameters of the likelihood , and is similarly defined for . The objective is given in (1). See Appendix I for more details.

Figure 3: Training curves for the discrete VAE experiment with 1-layer linear model (see Appendix I) on the stochastically binarized MNIST dataset(Salakhutdinov & Murray, 2008). All methods are run with the same learning rate for iterations. The black line represents the best training ELBO of REBAR and RELAX. ELBOs are calculated using all training/validation data.
Dataset Model Training Validation
MNIST Linear 1 layer -112.16 -112.89 -110.21 -114.85 -115.36 -114.27
Nonlinear -96.99 -95.99 -82.26 -112.96 -112.42 -111.48
Omniglot Linear 1 layer -122.19 -122.17 -119.03 -124.81 -124.95 -123.84
Nonlinear -79.51 -80.67 -54.96 -129.00 -128.99 -126.59
Table 1: Best obtained ELBOs for discrete variational autoencoders. Results of REBAR and RELAX are obtained by running the released code555github.com/duvenaud/relax from Grathwohl et al. (2017). All methods are run with the same learning rate for iterations.

Figure 3 (also Figure 9 of Appendix I) shows the training curves versus iteration and running time for the compared methods. Even without any variance-reduction techniques, GO provides better performance, faster convergence rate, and better running efficiency (about ten times faster in achieving the best training ELBO of RERAR/RELAX in this experiment). We believe GO’s better performance originates from: () its inherent low-variance nature; () GO has less parameters compared to REBAR and RELAX (no control variant is adopted for GO); () efficient batch processing methods (see Appendix I) are adopted to benefit from parallel computing. Table 5 presents the best training/validation ELBOs under various experimental settings for the compared methods. GO provides the best performance in all situations. Additional experimental results are given in Appendix I.

Many variance-reduction techniques can be used to further reduce the variance of GO, especially when complicated models are of interest. Compared to RELAX, GO cannot be directly applied when is not computable or where the interested model has discrete internal variables (like multilayer Sigmoid belief networks (Neal, 1992)). For the latter issue, we present in Appendix B.4 a procedure to assist GO (or statistical back-propagation in Theorem 3) in handling discrete internal variables.

7.3 Multinomial GAN

Table 2: Inception scores on quantized MNIST. BGAN’s results are run with the author-released code https://github.com/rdevon/BGAN. bits (states) BGAN MNGAN-GO 1 (1) 8.31 .06 9.10 .06 1 (2) 8.56 .04 8.40 .07 2 (4) 7.76 .04 9.02 .06 3 (8) 7.26 .03 9.26 .07 4 (16) 6.29 .05 9.27 .06 Figure 4: Generated samples from BGAN (top) and MNGAN-GO (bottom) trained on 4-bit quantized MNIST.

To demonstrate the deep GO gradient in Theorem 2, we adopt multinomial leaf variables and construct a new multinomial GAN (denoted as MNGAN-GO) for generating discrete observations with a finite alphabet. The corresponding generator is expressed as

For brevity, we integrate the generator’s parameters into the NN notation, and do not explicitly express them. Details for this example are provided in Appendix J.

We compare MNGAN-GO with the recently proposed boundary-seeking GAN (BGAN) (Hjelm et al., 2018) on 1-bit (1-state, Bernoulli leaf variables ), 1-bit (2-state), 2-bit (4-state), 3-bit (8-state) and 4-bit (16-state) discrete image generation tasks, using quantized MNIST datasets (LeCun et al., 1998). Table 2 presents inception scores (Salimans et al., 2016) of both methods. MNGAN-GO performs better in general. Further, with GO’s assistance, MNGAN-GO shows more potential to benefit from richer information coming from more quantized states. For demonstration, Figure 4 shows the generated samples from the 4-bit experiment, where better image quality and higher diversity are observed for the samples from MNGAN-GO.

7.4 HVI for Deep Exponential Families and Deep Latent Dirichlet Allocation

To demonstrate statistical back-propagation in Theorem 3, we design variational inference nets for two nonconjugate hierarchical Bayesian models, i.e., deep exponential families (DEF) (Ranganath et al., 2015) and deep latent Dirichlet allocation (DLDA) (Zhou et al., 2015; 2016; Cong et al., 2017).


For demonstration, we design the inference nets following the first-order Markov chain construction in Section 5, namely

Further details are provided in Appendix K. One might also wish to design inference nets that have structure beyond the above first-order Markov chain construction, as in Zhang et al. (2018); we do not consider that here, but Theorem 3 is applicable to that case.

Figure 5: ELBOs of HVI for a 128-64 DEF on MNIST. Except for different ways to calculate gradients, all other experimental settings are the same for compared methods, including the sticking idea and one-MC-sample gradient estimate. Figure 6: HVI results for a 128-64-32 DLDA on MNIST. Upperleft is the training ELBOs. The remaining subfigures are learned dictionary atoms from (top-right), (bottom-left), and (bottom-right).

HVI for a 2-layer DEF is first performed, with the ELBO curves shown in Figure 5. GO enables faster and more stable convergence. Figure 6 presents the HVI results for a 3-layer DLDA, for which stable ELBOs are again observed. More importantly, with the GO gradient, one can utilize pure gradient-based methods to efficiently train such complicated nonconjugate models for meaningful dictionaries (see Appendix K for more implementary details).

8 Conclusions

For expectation-based objectives, we propose a General and One-sample (GO) gradient that applies to continuous and discrete random variables. We further generalize the GO gradient to cases for which the underlying model is deep and has a marginal distribution corresponding to the latent variables of interest, and to cases for which the latent variables are hierarchical. The GO-gradient setup is demonstrated to yield the same low-variance estimation as the reparameterization trick, which is only applicable to reparameterizable continuous random variables. Alongside the GO gradient, we constitute a means of propagating the chain rule through distributions. Accordingly, we present statistical back-propagation, to flexibly integrate deep neural networks with general classes of random variables. s


We thank the anonymous reviewers for their useful comments. The research was supported by part by DARPA, DOE, NIH, NSF and ONR. The Titan Xp GPU used was donated by the NVIDIA Corporation. We also wish to thank Chenyang Tao, Liqun Chen, and Chunyuan Li for helpful discussions.


  • (1) M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, et al. TensorFlow: Large-scale machine learning on heterogeneous systems. URL https://www.tensorflow.org/. Software available from tensorflow.org.
  • Arjovsky et al. (2017) M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. In ICLR, 2017.
  • Bishop (2006) C. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
  • Cong et al. (2017) Y. Cong, B. Chen, H. Liu, and M. Zhou. Deep latent Dirichlet allocation with topic-layer-adaptive stochastic gradient Riemannian MCMC. In ICML, 2017.
  • Fan et al. (2015) K. Fan, Z. Wang, J. Beck, J. Kwok, and K. Heller. Fast second order stochastic backpropagation for variational inference. In NIPS, pp. 1387–1395, 2015.
  • Figurnov et al. (2018) M. Figurnov, S. Mohamed, and A. Mnih. Implicit reparameterization gradients. arXiv preprint arXiv:1805.08498, 2018.
  • Gan et al. (2017) Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li, and L. Carin. Triangle generative adversarial networks. In NIPS, pp. 5253–5262, 2017.
  • Geddes et al. (1990) K. Geddes, M. Glasser, R. Moore, and T. Scott. Evaluation of classes of definite integrals involving elementary functions via differentiation of special functions. Applicable Algebra in Engineering, Communication and Computing, 1(2):149–165, 1990.
  • Goodfellow et al. (2014) I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pp. 2672–2680, 2014.
  • Grathwohl et al. (2017) W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud. Backpropagation through the void: Optimizing control variates for black-box gradient estimation. arXiv:1711.00123, 2017.
  • Gu et al. (2015) S. Gu, S. Levine, I. Sutskever, and A. Mnih. MuProp: Unbiased backpropagation for stochastic neural networks. arXiv:1511.05176, 2015.
  • Hernández-Lobato et al. (2016) J. M. Hernández-Lobato, Y. Li, M. Rowland, D. Hernández-Lobato, T. Bui, and R. E. Turner. Black-box -divergence minimization. In ICML, 2016.
  • Hjelm et al. (2018) R. Hjelm, A. Jacob, T. Che, A. Trischler, K. Cho, and Y. Bengio. Boundary seeking GANs. In ICLR, 2018.
  • Hoffman & Blei (2015) M. Hoffman and D. Blei. Stochastic structured variational inference. In AISTATS, pp. 361–369, 2015.
  • Jang et al. (2016) E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. arXiv:1611.01144, 2016.
  • Jankowiak & Obermeyer (2018) M. Jankowiak and F. Obermeyer. Pathwise derivatives beyond the reparameterization trick. arXiv preprint arXiv:1806.01851, 2018.
  • Jordan et al. (1999) M. Jordan, Z. Ghahramani, T. Jaakkola, and L. Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.
  • Kingma & Welling (2014) D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In ICLR, 2014.
  • LeCun et al. (1998) Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • Li et al. (2017) C. Li, H. Liu, C. Chen, Y. Pu, L. Chen, R. Henao, and L. Carin. Alice: Towards understanding adversarial learning for joint distribution matching. In NIPS, pp. 5501–5509, 2017.
  • Li et al. (2018) C. Li, J. Li, G. Wang, and L. Carin. Learning to sample with adversarially learned likelihood-ratio. 2018.
  • Li & Turner (2016) Y. Li and R. E. Turner. Rényi divergence variational inference. In NIPS, pp. 1073–1081, 2016.
  • Maddison et al. (2016) C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation of discrete random variables. arXiv:1611.00712, 2016.
  • Mnih & Gregor (2014) A. Mnih and K. Gregor. Neural variational inference and learning in belief networks. In ICML, pp. 1791–1799, 2014.
  • Mnih & Rezende (2016) A. Mnih and D. Rezende. Variational inference for monte carlo objectives. In ICML, pp. 2188–2196, 2016.
  • Naesseth et al. (2016) C. A. Naesseth, F. J. R. Ruiz, S. W. Linderman, and D. M. Blei. Rejection sampling variational inference. arXiv:1610.05683, 2016.
  • Neal (1992) R. M. Neal. Connectionist learning of belief networks. Artif. Intell., 56(1):71–113, 1992.
  • Paszke et al. (2017) A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in PyTorch. 2017.
  • Radford et al. (2015) A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • Ranganath et al. (2014) R. Ranganath, S. Gerrish, and D. M. Blei. Black box variational inference. In AISTATS, 2014.
  • Ranganath et al. (2015) R. Ranganath, L. Tang, L. Charlin, and D. M. Blei. Deep exponential families. In AISTATS, 2015.
  • Ranganath et al. (2016a) R. Ranganath, D. Tran, J. Altosaar, and D. Blei. Operator variational inference. In NIPS, pp. 496–504, 2016a.
  • Ranganath et al. (2016b) R. Ranganath, D. Tran, and D. Blei. Hierarchical variational models. In ICML, pp. 324–333, 2016b.
  • Rezende et al. (2014) D. J. Rezende, S. Mohamed, and D. Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
  • Roeder et al. (2017) G. Roeder, Y. Wu, and D. K. Duvenaud. Sticking the landing: Simple, lower-variance gradient estimators for variational inference. In NIPS, pp. 6928–6937, 2017.
  • Ruiz et al. (2016) F. J. R. Ruiz, M. K. Titsias, and D. Blei. The generalized reparameterization gradient. In NIPS, pp. 460–468, 2016.
  • Rumelhart & Hinton (1986) D. Rumelhart and G. Hinton. Learning representations by back-propagating errors. Nature, 323(9), 1986.
  • Salakhutdinov & Murray (2008) R. Salakhutdinov and I. Murray. On the quantitative analysis of deep belief networks. In ICML, pp. 872–879. ACM, 2008.
  • Salimans et al. (2013) T. Salimans, D. A. Knowles, et al. Fixed-form variational posterior approximation through stochastic linear regression. Bayesian Analysis, 8(4):837–882, 2013.
  • Salimans et al. (2016) T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NIPS, pp. 2234–2242, 2016.
  • Schulman et al. (2015) J. Schulman, N. Heess, T. Weber, and P. Abbeel. Gradient estimation using stochastic computation graphs. In NIPS, pp. 3528–3536, 2015.
  • Titsias (2015) M. Titsias. Local expectation gradients for doubly stochastic variational inference. arXiv preprint arXiv:1503.01494, 2015.
  • Titsias & Lázaro-Gredilla (2015) M. Titsias and M. Lázaro-Gredilla. Local expectation gradients for black box variational inference. In NIPS, pp. 2638–2646, 2015.
  • Tucker et al. (2017) G. Tucker, A. Mnih, C. Maddison, J. Lawson, and J. Sohl-Dickstein. REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. In NIPS, pp. 2624–2633, 2017.
  • Williams (1992) R. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • Yin & Zhou (2018) M. Yin and M. Zhou. ARM: Augment-REINFORCE-merge gradient for discrete latent variable models. arXiv preprint arXiv:1807.11143, 2018.
  • Zhang et al. (2018) H. Zhang, B. Chen, D. Guo, and M. Zhou. WHAI: Weibull hybrid autoencoding inference for deep topic modeling. In ICLR, 2018.
  • Zhao et al. (2016) J. Zhao, M. Mathieu, and Y. LeCun. Energy-based generative adversarial network. In ICLR, 2016.
  • Zhou et al. (2015) M. Zhou, Y. Cong, and B. Chen. The Poisson gamma belief network. In NIPS, pp. 3025–3033, 2015.
  • Zhou et al. (2016) M. Zhou, Y. Cong, and B. Chen. Augmentable gamma belief networks. JMLR, 17(1):5656–5699, 2016.

Appendix A Proof of Theorem 1

We first prove (7) in the main manuscript, followed by its discrete counterpart, i.e., (8) in the main manuscript. Then, it is easy to verify Theorem 1.

a.1 Proof of Equation (7) in the Main Manuscript

Similar proof in one-dimension is also given in the supplemental materials of Ranganath et al. (2016b).

We want to calculate

where denotes with excluded. Without loss of generality, we assume .

Let , and we have

where is the cumulative distribution function (CDF) of . Further define , we then apply integration by parts (or partial integration) to get


With and , it’s straightforward to verify that the first term is always zero for any , thus named the “0” term.

a.2 Proof of Equation (8) in the Main Manuscript

For discrete variables , we have

where and is the size of the alphabet.

To handle the summation of products of two sequences and develop discrete counterpart of (7), we first introduce Abel transformation.
Abel transformation. Given two sequences and , with , we define and for . Accordingly, we have

Substituting , , , and into the above equation, we have

Note the first term equals zero for both finite alphabet, i.e., , and infinite alphabet, i.e., . When , we get (8) in the main manuscript.

With the above proofs for Eqs. (7) and (8), one could straightforwardly verify Theorem 1.