A Fourier View of REINFORCE

# A Fourier View of REINFORCE

Faculty of Computer Science
GIK Institute, Pakistan
###### Abstract

We show a connection between the Fourier spectrum of Boolean functions and the REINFORCE gradient estimator for binary latent variable models. We show that REINFORCE estimates (up to a factor) the degree-1 Fourier coefficients of a Boolean function. Using this connection we offer a new perspective on variance reduction in gradient estimation for latent variable models: namely, that variance reduction involves eliminating or reducing Fourier coefficients that do not have degree 1. We then use this connection to develop low-variance unbiased gradient estimators for binary latent variable models such as sigmoid belief networks. The estimator is based upon properties of the noise operator from Boolean Fourier theory and involves a sample-dependent baseline added to the REINFORCE estimator in a way that keeps the estimator unbiased. The baseline can be plugged into existing gradient estimators for further variance reduction.

A Fourier View of REINFORCE

Adeel Pervez Faculty of Computer Science GIK Institute, Pakistan adeel@giki.edu.pk

\@float

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Gradient-based optimization is the workhorse of contemporary machine learning. For deterministic models, the backpropagation algorithm has allowed computing of gradients relative to parameters of large neural network models. Recent work has sought to extend this success to models where some of the hidden variables are stochastic. The reparameterization trick [14] has enabled efficient training of models that have continuous stochastic hidden variables by allowing computation of unbiased and low-variance gradients of loss functions. Models with discrete stochastic variables, although attractive from a modeling point of view in terms of their power and interpretability, have proven difficult to optimize using gradient-based optimization. Neither backpropagation nor the reparametrization trick are directly applicable to training discrete variable models. Optimizing discrete variables models using REINFORCE and REINFORCE-like algorithms is difficult due to high-variance of these gradient estimators.

In this paper we use ideas from the theory of harmonic analysis of Boolean functions to provide another perspective on the REINFORCE [4] gradient estimator. Harmonic analysis seeks to analyze Boolean functions using Fourier expansions which express Boolean functions as linear combinations of orthonormal basis functions with coefficients called the Fourier coefficients of the Boolean function. Each Fourier coefficient corresponds to a subset of input coordinates and depends on the product probability distribution imposed on the Boolean cube. We show that REINFORCE estimates singleton Fourier coefficients of a Boolean function. We then introduce the noise operator from harmonic analysis and construct a control variate that can be plugged into existing gradient estimators to further reduce their variance.

The contribution of this paper is two-fold.

1. The first is a reinterpretation of existing variance-reducing gradient estimators in terms of the Fourier coefficients of the objective function seen as a function of the stochastic hidden variables. We show that they can be seen as eliminating or reducing terms with irrelevant Fourier coefficients from the Fourier expansion of the objective function.

2. The second contribution is an unbiased estimator for the gradient of the objective function relative to the parameters of the stochastic binary latent variables. The estimator depends on the properties of the noise operator from Boolean Fourier theory.

We begin with a review of the Fourier analysis of Boolean functions.

## 2 Background

### 2.1 Boolean Fourier Analysis

We reiterate some facts from the analysis of Boolean functions. A comprehensive introduction can be found in [2]. We work with Boolean functions . We assume a product probability distribution on the Boolean input, with being the probability of the th coordinate being one, its mean and its variance. Given Boolean functions and , we define an inner product:

 ⟨f,g⟩=Ep(x)[f(x)g(x)].

We define the -norm of :

 ∥f∥p=E[|f(x)|p]1/p,

with the property that if .

#### 2.1.1 Fourier Expansion

Let . For a set , define . The functions are an orthonormal basis for the space of Boolean functions. The expansion of in this basis

 f(x)=∑S⊆[n]^f(S)ϕS(x)

is known as the -biased Fourier expansion of and the coefficients of this expansion, , are the Fourier coefficients. The cardinality of is the degree or weight of the coefficient. The Fourier expansion of a Boolean function expresses the function as a multilinear polynomial in . It should be noted that the Fourier coefficients of a function depend on both that function and the product probability distribution imposed on the input and changing the input probability distribution changes the Fourier expansion for the same function.

There is also an inverse expansion for the Fourier coefficients

 ^f(S)=⟨f,ϕS⟩=E[f(x)ϕS(x)].

In particular, the degree- coefficient is the average of the function under the input distribution: .

#### 2.1.2 Discrete Derivative

The th discrete derivative of is formal derivative of the -biased expansion of relative to . This is denoted by and has the Fourier expansion given by

 Dif=∑S∋i^f(S)ϕS∖i(x)

Notice that does not depend on .

#### 2.1.3 Noise Operator

Given and , we say that are -correlated if is generated by independently setting each to with probability and a sample from with probability .

 x′i={xi with probability ρ random sample from pi with probability 1−ρ

We also denote this by . We use this to define the noise operator acting on as the expectation over -correlated inputs as follows:

 Tρ(f)(x)=Ex′∼Nρ(x)[f(x′)]

The noise operator has a special action on the functions : it multiplies them by , i.e.,

 Tρ(ϕi)(x)=Ex′∼Nρ(x)[ϕi(x′)]=ρϕi(x).

Similarly, its action on the basis function is to multiply it by :

 Tρ(ϕS)(x)=Tρ(∏i∈Sϕi)(x)=∏i∈STρ(ϕi)(x)=ρ|S|ϕS(x).

By linearity of expectation it follows that

 Tρ(f)(x)=∑S⊆[n]^f(S)Tρ(ϕS)(x)=∑S⊆[n]ρ|S|^f(S)ϕS(x).

## 3 Related Work

Perhaps the best known example of a gradient estimator for (but not limited to) discrete stochastic hidden variable models is the REINFORCE algorithm [4]. Also known as the likelihood ratio estimator and the score function estimator, this estimator uses the log-derivative trick i.e., , to convert the gradient of an expectation with respect to parameters of a probability distribution over which the expectation is computed into an expectation of a term with a gradient of a log probability. That is,

 ∂∂θiEp(x|θ)[f(x)]=Ep(x)[f(x)∂∂θilogp(x|θ)]

The score function estimator has been applied to optimize the variational lower bound on the log-likelihood with an inference network [14; 16]. In this form, however, the estimator has high variance which leads to slow convergence. A number of subsequent methods have been proposed to deal with this high variance. For continuous stochastic variables [14] propose the reparameterization trick which involves rewriting the function in a form that allows gradients to go through.

Another method to reduce the variance of this estimator, which also works for discrete variables, is to subtract a term , called a control variate, from . If the control variate depends on the samples, , then we must add the analytical expectation of the control variate under to keep estimator unbiased.

 ∂∂θiEp(x|θ)[f(x)]=Ex[(f(x)−c(x))∂∂θilogp(x|θ)]+Ex[c(x)∂∂θilogp(x|θ)]

A number of control variate schemes have been proposed: NVIL [5] subtracts two baselines from the objective to reduce variance: the first is a constant baseline set to the moving average of the function and the second is an input-dependent baseline computed by a feedforward neural network. Since the baselines do not depend on the samples, the analytical expectation is identically 0. MuProp [6] uses the first-order Taylor approximation of the function as a baseline. is usually set to the distribution mean which allows gradients to pass through but requires a separate pass through the network. Since the baseline depends on , the term must be added to keep the estimator unbiased. DARN [17] also uses the first-order Taylor expansion as a baseline but does not add the analytical expectation, making the estimator biased. Another very simple biased estimator is the straight-through estimator [19] which uses the gradient relative to the sample as that relative to the parameter. Another class of estimators employ multiple samples such as in [15] to construct tighter lower-bounds on the log likelihood and in [18] to construct per-sample baselines.

The Fourier expansion is widely used in computational learning theory with applications to learning low-degree functions [10], decision trees [11], constant-depth circuits [12], juntas [13].

## 4 REINFORCE Estimates Fourier Coefficients

The following lemma is a slight variant of the Margulis-Russo [8; 7; 2] formula. See the appendix for proof.

###### Lemma 1.

Let be a Boolean function. Then

 ∂∂piEpθ(x)[f(x)]=2σi^f({i}),

where is the Fourier coefficient of under the -biased expansion.

From this we see that:

 gi\textsubscriptREINFORCE =Ep(x)[f(x)∂∂pilogp(x)] (1) =2σi^f({i}) (2) =Ep(x)[f(x)2ϕi(x)σi]. (3)

## 5 Variance Reduction in Gradient Estimation

From the above we see that computing gradients of loss functions with respect to parameters of binary latent variables requires computing weight-1 Fourier coefficients. Since a Fourier coefficient is an expectation, Monte Carlo averaging is frequently used as an estimation technique. Naive averaging however results in an estimator that has high variance. What is the source of this variance? To see this we can use the Fourier expansion as follows. A Fourier coefficient is written

 ^f(i) =E[f(x)ϕi(x)]=E⎡⎣⎛⎝∑S⊆[n]^f(S)ϕS(x)⎞⎠ϕi(x)⎤⎦ (4) =^f(∅)E[ϕi(x)]+^f(i)E[ϕ2i(x)]+∑S⊆[n]S∉{∅,{i}}^f(S)E[ϕS(x)ϕi(x)]. (5)

Since and because of orthonormality, we end up with in expectation. But using a Monte Carlo average to estimate the expectation will get a contribution to its variance coming from the first and third terms. This variance will be high or low if the Fourier coefficients in these terms are high or low respectively. From this perspective, to reduce the variance of a gradient estimator we should attempt to remove those terms with irrelevant Fourier coefficients or ensure that those Fourier coefficients are small in magnitude.

In the following we look at some gradient estimators from the literature with this view.

### 5.1 Straight-Through Estimator

The straight-through estimator [19] is a simple but biased estimator that computes that derivative with respect to a sample and uses it as an estimate of the derivative with respect to the probability parameter .

 gi\textsubscriptST=∂f(x)∂xi

We can view this estimator as approximating the discrete derivative of or equivalently the derivative of the Fourier expansion of . The discrete derivative of can be expanded as follows.

 Di(f)(x∖i)=∑S∋i^f(S)ϕS∖i(x∖i)

Here denotes without the th coordinate. If we were able to compute we would have an unbiased estimator of the gradient since However, since we normally have available to us as a neural network (and not as a multilinear polynomial), we approximate the discrete derivative with the derivative of the function as a neural network.

### 5.2 Nvil

NVIL [5] subtracts two baselines, a constant baseline and an input dependent baseline from the function before computing its Fourier coefficient. Since the baselines do not depend on samples from the distribution, they can be seen as subtracting the weight-0 Fourier coefficient, , from the function. Notice that this does not affect the remaining Fourier coefficients of the resulting function.

### 5.3 MuProp

The MuProp [6] estimator builds a control variate based on the first order Taylor series expansion as follows.

 giμ=(f(x)−f(μ)−∂f(μ)∂μi(xi−μi))∂logp(x)∂pi+∂f(μ)∂μi∂∂piE[xi]

This can be viewed as an approximation to the following.

 giμ=(f(x)−^f(∅)−Dif(x∖i)xi−μiσi)2ϕi(xi)σi+2Dif(x∖i)σi

The term includes all the terms of that contain and subtracting from removes all terms containing from , reducing the variance in . Once again, since we do not have as a multilinear polynomial, we cannot compute , therefore we approximate it with the gradient of the function as a neural network.

## 6 Proposed Estimator

We develop an unbiased gradient estimator using the variance reduction properties of the noise operator. The noise operator, , is a smoothing operator that when applied to a functions decays the effect of the higher-order terms of the function. The extent of decay also depends on degree: the higher the degree of a term, the greater is the decaying factor applied to the term. The expected value of a function under the input distribution is unaffected by application of the noise operator i.e., . Therefore the noise operated version of a function is a function with the same expected value but with much reduced variance.

This is intuitively clear from the Fourier expansion of the noise operator with parameter :

 Tρ(f)(x)=∑S⊆[n]ρ|S|^f(S)ϕS(x)

Here, we see that the noise operator multiplies the Fourier coefficients of with an exponentially decaying factor in that depends on the degree of the term. We also see that as , and as , . In other words, the variance of goes to with . For intermediate values of , has exponentially small higher degree terms and lower variance than . In fact, we can make the stronger statement that even higher norms of are bounded by the second norm of . This fact is expressed by saying that the noise operator is hypercontractive and is the content of Bonami’s hypercontractivity [9; 2] theorem. One version of the theorem states

###### Theorem 1 (Hypercontractivity of Noise Operator).

Given a product distribution on a finite probability space where each outcome in the constituent distributions has probability at least and a function defined on such space, for any and , we have that

 ∥Tρf∥q≤∥f∥2.

In particular for appropriate and we have that . Here we use that fact that for , . Squaring and subtracting the square of the expectation , we see that

 ∥Tρf∥22−E[f]2≤∥f∥22−E[f]2
 σ2(Tρ(f))≤σ2(f),

showing that the variance of the noise operated version of is no more than the variance of .

### 6.1 Constructing the Control Variate

We can use this fact to construct a control variate in a number of ways. Recall that a control variate for an unbiased estimator is essentially a Boolean function for which all degree-1 coefficients are 0. If is any Boolean function then

 g(x)−1ρTρ(g)(x)

has all degree-1 coefficients 0 and can be used as a control variate.

If is well correlated with , then subtracting the low-variance term from results in a function with 0 degree-1 coefficients with higher-order terms that are still well correlated with . Subtracting this from to give then gives a function with small higher-order terms where the degree-1 terms are unaffected. A correlated function can be learnt by using a feedforward network and minimizing the mean squared error with .

Another possible way to build the control variate is to use

 g(x)−Tρ(g)(x)−T1−ρ(g)(x).

### 6.2 Building the Estimator

We built our final estimator used in the experiments upon MuProp. To do so we construct a function that includes the muProp terms, an input dependent baseline and our sample dependent baseline as follows.

 t(x)=f(x)−b(x)−f(μ)−αf′(x)(x−μ)−β(g(x)−Tρ(g)(x)ρ)

The gradient is and we use the single sample estimate of the expectation where is estimated using a single -correlated sample.

## 7 Experiments

To test the variance reduction properties of our proposed estimator, we compared the estimator against NVIL and the MuProp estimator with an input dependent baseline. We built our implementation upon the code base made available by [1]111TensorFlow code available at: https://github.com/alpz/fourier-REINFORCE. In our experiments we used sigmoid belief nets with one, two and three layers. Each stochastic layer in the network is 200 units wide. The input dependent baseline is a feedforward newtork with a single layer of 100 tanh units and the sample-dependent baseline has two 100 unit layers of either tanh or relu units. The datasets are the statically binarized MNIST digit and Omniglot character datasets. For the optimization we used SGD with a momentum of 0.9 and a minibatch size of 24. For 3 layer models on Omniglot we found SGD to be slow to converge regardless of the gradient estimator; for those we performed the optimization using Adam. The gradient variance is estimated using an exponential moving average. We use a constant value of 0.5. The experiments were run with learning rates in and with the best performing result chosen. We used a different learning rate for the baseline networks which was set to times the learning rate for the model.

### 7.1 Generative Modeling with Sigmoid Belief Nets

We train generative models using the autoencoding variational Bayes framework of [14]. The framework simplifies the training of sigmoid belief networks using an inference network to generate samples from an approximate variational posterior distribution. The inference network is parameterized as a feedforward network with a structure that is the reverse of the model. We optimize the single sample variational lower bound (ELBO) of the log-likelihood.

 logp(y|θ)≥Eq(x|y,θ)[logp(y|x,θ)+logp(x|θ)−logq(y|x,θ)]

Here is the approximate variational posterior and is a data sample.

The final training ELBO results for MNIST and Omniglot after 2,000,000 steps are given in tables 2 and 2. The training ELBO for 2-layer and 3-layer models for MNIST is plotted in figure 1. The estimated log of gradient variance is plotted in figure 2. As can be seen from the last figure we get a substantial improvement in gradient variance for the 2-layer model and the difference becomes significant early in the training when model depth is increased. This can also be seen from the ELBO plots where the 3-layer ELBO diverges from the MuProp ELBO much earlier than for the 2-layer model.

#### 7.1.1 Multiple Samples

The results above were performed using a single sample to estimate . We also performed experiments using more samples to estimate . Notice that this still uses only a single sample from the model. This resulted in only very slight reduction in gradient variance in the initial phase of training. This can be seen in figure 3 for the 2-layer model on MNIST. However since we did not optimize over hyperparameters for this, it still might be possible to improve these results for multiple samples to compute .

## References

• [1] Tucker, George, et al. "REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models." Advances in Neural Information Processing Systems. 2017.
• [2] O’Donnell, Ryan. Analysis of boolean functions. Cambridge University Press, 2014.
• [3] Raiko, Tapani, et al. "Techniques for learning binary stochastic feedforward neural networks." arXiv preprint arXiv:1406.2989 (2014).
• [4] Williams, Ronald J. "Simple statistical gradient-following algorithms for connectionist reinforcement learning." Reinforcement Learning. Springer, Boston, MA, 1992. 5-32.
• [5] Mnih, Andriy, and Karol Gregor. "Neural Variational Inference and Learning in Belief Networks." International Conference on Machine Learning. 2014.
• [6] Gu, Shixiang, et al. "MuProp: Unbiased backpropagation for stochastic neural networks." arXiv preprint arXiv:1511.05176 (2015).
• [7] Russo, Lucio. "An approximate zero-one law." Probability Theory and Related Fields 61.1 (1982): 129-139.
• [8] Margulis, Grigorii Aleksandrovich. "Probabilistic characteristics of graphs with large connectivity." Problemy peredachi informatsii 10.2 (1974): 101-108.
• [9] Bonami, Aline. "Étude des coefficients de Fourier des fonctions de Lp (G)." Ann. Inst. Fourier (Grenoble) 20.2 (1970): 335-402.
• [10] Mansour, Yishay. "Learning Boolean functions via the Fourier transform." Theoretical advances in neural computation and learning. Springer, Boston, MA, 1994. 391-424.
• [11] Kushilevitz, Eyal, and Yishay Mansour. "Learning decision trees using the Fourier spectrum." SIAM Journal on Computing 22.6 (1993): 1331-1348.
• [12] Linial, Nathan, Yishay Mansour, and Noam Nisan. "Constant depth circuits, Fourier transform, and learnability." Journal of the ACM (JACM) 40.3 (1993): 607-620.
• [13] Mossel, Elchanan, Ryan O’Donnell, and Rocco P. Servedio. "Learning juntas." Proceedings of the thirty-fifth annual ACM symposium on Theory of computing. ACM, 2003.
• [14] Kingma, Diederik P., and Max Welling. "Auto-encoding variational bayes." arXiv preprint arXiv:1312.6114 (2013).
• [15] Burda, Yuri, Roger Grosse, and Ruslan Salakhutdinov. "Importance weighted autoencoders." arXiv preprint arXiv:1509.00519 (2015).
• [16] Rezende, Danilo Jimenez, Shakir Mohamed, and Daan Wierstra. "Stochastic backpropagation and approximate inference in deep generative models." arXiv preprint arXiv:1401.4082 (2014).
• [17] Gregor, Karol, et al. "Deep autoregressive networks." arXiv preprint arXiv:1310.8499 (2013).
• [18] Mnih, Andriy, and Danilo J. Rezende. "Variational inference for monte carlo objectives." arXiv preprint arXiv:1602.06725 (2016).
• [19] Bengio, Yoshua, Nicholas Léonard, and Aaron Courville. "Estimating or propagating gradients through stochastic neurons for conditional computation." arXiv preprint arXiv:1308.3432 (2013

Appendix

## Appendix A Proof of Lemma 1

We follow [2, 8.4] in the following proof.

###### Proof.

We consider as a multilinear polynomial over . Let be the -biased Fourier representation of the Boolean function . Then by linearity of expectation

 E[f(p)(x1,…,xn)]=f(μ1,…,μn)

We also have that

 ∂∂μif(μ)=Dxif(μ).

Then

 Dxif(μ) =E[Dxif(p)(x1,…,xn)] =1σiE[Dϕif(p)(x1,…,xn)] =1σi^f({i}),

where is the Fourier coefficient in the -biased representation. Given that , and , the result follows by the chain rule. ∎

You are adding the first comment!
How to quickly get a good reply:
• Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
• Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
• Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters