# Neural Control Variates for Variance Reduction

###### Abstract

In statistics and machine learning, approximation of an intractable integration is often achieved by using the unbiased Monte Carlo estimator, but the variances of the estimation are generally high in many applications. Control variates approaches are well-known to reduce the variance of the estimation. These control variates are typically constructed by employing predefined parametric functions or polynomials, determined by using those samples drawn from the relevant distributions. Instead, we propose to construct those control variates by learning neural networks to handle the cases when test functions are complex. In many applications, obtaining a large number of samples for Monte Carlo estimation is expensive, which may result in overfitting when training a neural network. We thus further propose to employ auxiliary random variables induced by the original ones to extend data samples for training the neural networks. We apply the proposed control variates with augmented variables to thermodynamic integration and reinforcement learning. Experimental results demonstrate that our method can achieve significant variance reduction compared with other alternatives.

Neural Control Variates for Variance Reduction

Zhanxing Zhu ^{†}^{†}thanks: Equal contribution.
Center for Data Science, Peking University &
Beijing Institute of Big Data Research (BIBDR)
Beijing, China
zhanxing.zhu@pku.edu.cn
Ruosi Wan ^{1}^{1}footnotemark: 1
Center for Data Science, Peking University
Beijing, China
ruoswan@pku.edu.cn
Mingjun Zhong
College of Science, University of Lincoln
Lincoln, United Kingdom
MZhong@lincoln.ac.uk

noticebox[b]Preprint. Work in progress.\end@float

## 1 Introduction

Most of modern machine learning and statistical approaches focus on modelling complex data, where manipulating high-dimensional and multi-modal probability distributions is of great importance for model inference and learning. Under this circumstance, evaluating the expectation of certain function with respect to a probability distribution is ubiquitous, where the random variable of interest is typically high-dimensional.

However, in complex models, the integration is often analytically intractable. This drives the development of sophisticated Monte Carlo methods to facilitate efficient computation robert2004monte (). The Monte Carlo method is naturally employed to approximate the expectation, i.e., where are samples drawn from the distribution . According to the central limit theorem, this estimator converges to at the rate . For high-dimensional and complex models, when is difficult to sample from neal2012bayesian () or the test function is expensive to evaluate higdon2015bayesian (), a “large-” estimation is prohibited computationally. This directly leads to a high-variance estimator. Therefore, with a limited number of samples, how to reduce the variance of Monte Carlo estimations emerges as an essential issue for its practical use.

Along this line, various variance reduction methods have been introduced in the literature of statistics and numerical analysis. One category aims to develop appropriate samplers for variance reduction, including importance sampling and its variants cornuet2012adaptive (), stratified sampling techniques rubinstein2016simulation (), multi-level Monte Carlo giles2013multilevel () and other sophisticated methods based on Markov chain Monte Carlo (MCMC) robert2004monte (). Another category of variance reduction methods is called control variates assaraf1999zero (); mira2013zero (); oates2016controlled (); oates2017control (); liu2017action (). These methods take advantage of random variables with known expectation values, which are negatively correlated with the test function under consideration. Control variates techniques can fully employ the available samples to reduce the variance, which is popular due to its efficiency and effectiveness.

However, existing control variates approaches have limitations. Firsly, most existing methods use a linear or quadratic form to represent the control function mira2013zero (); oates2016controlled (). Although these control functions have closed forms, the representation power of them is very limited particularly when the test function is complex and non-linear. Control functionals were proposed recently to tackle this problem oates2017control (). However, these estimators may significantly suffer from a curse of dimensionality oates2016convergence (). Secondly, when the available samples are scarce, optimizing the control variates only based on a small number of samples might overfit, which means that it is difficult to generalize on the samples obtained later. These restrictions limit their practical performance.

In order to overcome these issues, in this paper we propose to employ neural networks to represent the control functions, utilising the capability of a neural network to represent a complex test function. This method is named as “Neural Control Variates” (NCV). To accommodate the overfitting issue when available training sample size is small, we augment an auxiliary variable to the original one such that the extra conditional information could be used, abbreviated as NCV with augmented variables (NCVA). Our methods are particularly suitable for the cases when the sample space is high-dimensional or the samples from is hard to obtain. We demonstrate the effectiveness of our methods in thermodynamic integration and policy optimization in reinforcement learning. We show that the new control variates methods achieved the best performance comparing to the state-of-the-art methods in literatures.

## 2 Control Variates

The generic control variates aims to estimate the expectation with reduced variance. The principle behind the control variates relies on constructing an auxiliary function such that

(1) |

Thus the desired expectation can be replaced by that of the auxiliary function

(2) |

It is possible to obtain a variance-reduced Monte Carlo estimator by selecting or optimizing so that the variance . Intuitively, variance reduction can be achieved when is negatively correlated with under , since much of the randomness “cancels out” in the auxiliary function .

The selection of an appropriate form of control function is crucial for the performance of variance reduction. A tractable class of so called zero-variance control variates was proposed in mira2013zero (). Those control variates are expressed as a function of the gradient of the log-density, , i.e. the score function . Concretely, it has the following form

(3) |

where the gradient operator , the Laplace operator , and “” denotes the inner product. The function is often referred to as the trial function. The target is now to find a trail function so that and are negatively correlated. This could thus reduce the variance of the Monte Carlo estimation. As the trail function could be arbitrary under those mild conditions given in mira2013zero (), a prametric function could be used for . For example, when , which is a first degree polynomial function, the auxiliary function becomes as was proposed in mira2013zero (). The optimal choice of the parameter that minimizes the variance of is where , . Obviously, the representation power of these polynomials is limited, and therefore control functionals have been proposed recently where the trail function is stochastic. For example, the trail function could be a kernel function oates2017control (). In order for using these control variates, we firstly estimate these required parameters in the trail function by using some training samples . Then the learned control variates can be used for test samples.

However, there are several drawbacks of the current zero-variance techniques: 1) Dilemma between effectiveness and efficiency. Although increasing the order of polynomial could potentially increase the representation power and reduce more variance, the number the parameters needs to be learned would grow exponentially. As pointed out by mira2013zero (), when quadratic polynomials are used, , the number of parameters will be . Thus, finding the optimal coefficients requires dealing with which is a matrix of dimension of order . Similar issue occurs when employing the control functionals. This makes the use of high order polynomials computationally expensive when faced with high-dimensional sampling spaces. 2) Poor generalization with small sample size. With small sizes of training samples and complex , the learned control variates could potentially overfit the training samples, i.e. generalize poorly over new samples. This is because a small size of training sample might be insufficient for representing the full distributional information of .

These limitations motivate our development of neural control variates and its extension with augmented variables, which will be elaborated below.

## 3 Neural Control Variates and Its Extension

Firstly, we focus on alleviating the dilemma between effectiveness and efficiency on designing control variates in high-dimensional sample space. To this end, we propose neural control variates (NCV) where the trail function is represented by a neural network. Equipped with neural networks, their excellent capability of representing complex functions and overcoming the curse of dimensionality can be fully employed in high-dimensional scenarios lecun2015deep ().

Instead of relying on the control variates (3) introduced in mira2013zero (), we use the following Stein control variates based on Stein identity stein1972bound (); oates2017control (),

(4) |

where is the trial function. Note that in order for , we admit those assumptions made in mira2013zero (); oates2017control () which could be easily satisfied under some mild conditions. Compared with Eq (3), Stein control variates is preferred due to its computational advantages since evaluating the second order derivatives of the trial function is avoided. Note that when the trial function is a linear or quadratic polynomial, the Stein trial function is constant or linear, respectively.

We now represent the trial function by a neural network parameterized by the weights . The control function becomes . In order for variance reduction, we solve the following optimization problem

(5) |

which does not have a closed-form in general. Instead, the following optimization problem will be solved

(6) |

where are samples drawn from . Consequently, standard back-propagation techniques and stochastic gradient descent (SGD) can be adopted to obtain the optimal network parameters.

### 3.1 Neural Control Variates with Augmented Variables

When the distribution is high-dimensional and multi-modal, it would be very expensive to draw samples, which are required for training those control variates. This typically produces a rather small number of samples which is not sufficient for training control variates. When using control variates methods, the learned parameters for control variates can easily overfit over the training samples, and thus could not generalize well on new samples drawn from . To deal with this issue, we augment an extra variable to to establish a joint distribution . This neural control variates with augmented variable (NCVA) method could be helpful to remedy overfitting.

Concretely, our proposal is inspired by the following observation. Given the augmented control function satisfying , we have

(7) |

With the joint distribution , the equation above can be expanded further as follows

(8) |

Thus, the “effective control variates” for under becomes . Equipped with Stein control variates for , we have

(9) |

where the augmented score function can be written as

(10) |

The equations (8)-(10) facilitate expanding the training sample size by adding augmented variable samples from the conditional distribution . This trick could essentially help to reduce overfitting. Intuitively, the influence of augmented variable for training are delivered by the change of the score function presented in equation (10).

The selection of the conditional distribution depends on the application of interest and how well the samples approximate the true density . For simplicity, we obtain the samples of by firstly sampling , and then , where the shape parameter can be tuned by user. We use this simple scheme for all the applications in Section 4. As for how to choose the optimal form of for best generalization, we leave it as our future work.

After parameterizing the augmented trial function by a neural network , the following objective function will be minimized,

(11) |

where the “effective control variates” can be estimated with Monte Carlo samples by an online fashion. We present the training and test procedures of NCVA in Algorithm 1.

## 4 Experiments

To evaluate our proposed methods, i.e., NCV and NCVA, we apply them to a synthetic problem and two real scenarios, which are thermodynamic integration for Bayesian model evidence evaluation, and policy optimization in reinforcement learning. For comparison purposes, control functionals (CF) (oates2017control, ) and polynomial control variates (mira2013zero, ; oates2016controlled, ) are also applied to these problems. The performance of the trained control variates are measured by the variance reduction ratio on the test data set, i.e., We used fully connected neural networks to represent the trial function in all the experiments. We found that for the experiments presented in the following, a medium-sized network is empirically sufficient to achieve good performance. More details on network architectures are provided in the Appendix.

### 4.1 Synthetic Problem

To illustrate the advantage of NCV on dealing with high-dimensional problems over other methods, we consider to approximate the expectation of where which is a mixture of Gaussians, i.e., .

Figure 1 shows the variance reduction ratio on test data () with respect to varying the number of training samples and the dimensions. In both cases, we can observe that NCV outperforms linear, quadratic and control functionals control variates. Particularly, when increasing the dimensions of , NCV can still achieve much lower variance reduction ratio compared with CF.

### 4.2 Thermodynamic Integral for Bayesian Model Evidence Evaluation

In Bayesian analysis, data is assumed to have been generated under a collection of putative models, . To compare these candidate models, the Bayesian model evidence is constructed as where are the parameters associated with model . Unfortunately, for most of the models of interest, this integral is unavailable in closed form. Thus many techniques were proposed to approximate the model evidence. Thermodynamic integration (TI) frenkel2001understanding () is among the most promising approach to estimate the evidence. This approach is derived from the standard thermodynamic identity, where , is called power posterior. Note that we have dropped the model indicator for simplicity. Here is known as an inverse temperature parameter. In many cases, the posterior expectaion can’t be analytically computed, thus the Monte Carlo integration is applied. However, Monte Carlo integration often suffer large variance when sample size is not large enough.

In oates2016controlled (), the zero-variance control variates (3) were used to reduce the variance for TI, so that

(12) |

where are drawn from the posterior . In oates2016controlled (), the trial function was assumed as a linear or quadratic function, which corresponds to a constant or linear function for in Stein control variates ^{1}^{1}1We will call the trial function as the constant or linear type trial functions in the following.. These methods achieved excellent performance for simple models (oates2016controlled, ).
However, they are struggling in some scenarios, for example, a negative example which is Goodwin Oscillator is given in (oates2016controlled, ). Note that Goodwin Oscillator is a nonlinear dynamical system,

(13) |

where the form of is provided in the Appendix. Assuming within only a subset of time points , the solution of (13), i.e. , is observed under Gaussian noise , where denotes the variance of the noise. That means the observation . Then we have the likelihood The expectation of the log likelihood under the power posterior, i.e., needs to be evaluated. In oates2016controlled (), the authors demonstrated the failure of polynomial-type of control function since the log-likelihood surface is highly multi-modal and there is much weaker canonical correlation between the scores and the log posterior.

In practice, sampling from Goodwin Oscillator is difficult and computationally expensive since simulating the underlying ordinary differential equation is extremely time-consuming. This directly leads to the situation that the available training samples for control variates are not sufficient. We show in the following that the proposed NCVA can be employed to remedy this issue. To illustrate the benefits of NCVA, we compared it with other methods with various sizes of training samples and temperatures. For comparison purposes we evaluated the variance reduction ratios on both training and test sets ( samples for test). The experiment settings are the same as those in oates2016controlled ().

Figure 2 shows the experimental results when applying different types of control variates. It can be easily observed that the linear and quadratic methods could hardly reduce the variance of the Goodwin Oscillator model on test set. Both neural control variates and control functionals perform well when the training size is larger than . However, when the sample size is small, CF and NCV suffer from overfitting. Comparing to all other methods, NCVA successfully achieved the lowest variance reduction ratio when the sample size was small.

### 4.3 Variance Reduction for Policy Optimization

In this application, the proposed NCVA is employed to reduce the variance of policy gradient in reinforcement learning. Policy gradient methods have achieved great success in reinforcement learning problemsschulman2015trust (); tucker2017rebar (); kakade2002natural (); schulman2017proximal (). It assumes the action is selected from a conditional probability distribution given a state and parameterized by , typically named as policy. The goal of the policy gradient method is to obtain the optimal parameters by maximizing the expected cumulative reward,

(14) |

According to sutton1998introduction (), the gradient of can be written as

(15) |

where denotes the expected reward under the policy starting from state and action . When training a policy, a Monte Carlo estimation of the policy gradient (15) is computed using the trajectory

(16) |

where is the empirical estimate of , and the trajectory is obtained by simulating the environment under the current policy . However the Monte Carlo policy gradient estimator often suffers a large-variance issue when the number of trajectories is small, which makes the training process inefficient.

Neural Stein control variates were adopted to reduce the variance of the policy gradients estimator in liu2017sample (), which can be described as follows. Given a policy , Stein’s identity with respect to the action is

(17) |

where is the trial function of the Stein control variate. However, (17) can’t be directly applied in policy gradient form (15), since the gradient in (15) is taken with respect to the policy parameters , while the gradient in Stein identity (17) is taken with respect to the action . To connect and , the policy is reparameterized to adapt the Stein identity to the policy gradients framework, i.e. assuming the action is generated by a mapping , where is a random noise drawn from some distribution independent of and . Then we have

(18) |

Combining (18) and (15), the policy gradient with NCV could be derived:

(19) |

(a1) Halfcheetah-v1, | (a2) Halfcheetah-v1, |

(b1) Walker2d-v1, | (b2) Walker2d-v1, |

(b1) Hopper-v1, | (b2) Hopper-v1, |

Now we introduce the augmented variable method to adapt the reparametrized (18):

(20) |

where the augmented variable is generated by , the noise drawn from a pre-defined distribution independently of . Thus we have . Besides,

(21) |

Thus the augmented form of the gradient estimator could be rewritten as:

(22) |

where the control variate is decomposed as , the term is a parametric function approximation of the value function, and is an estimator of the advantage function sutton1998introduction (); tucker2017rebar ().

In liu2017action (), the authors evaluated their reparametrized Stein CV method combining proximal policy optimization (PPO, schulman2017proximal () in continuous control environments from the OpenAI Gym benchmark using the MuJoCo physics simulator. We compare the performance of their method and the augmented version on three different MuJoCo environments, Halfcheetah-v1, Walker2d-v1 and Hopper-v1.

Note that in policy gradient for RL applications, the batch size is typically around several thousands, e.g. for continuous control problems, or . This number of sample size is computationally expensive for kernel-based methods, such as control functionals, which are nearly intractable for RL problems. Therefore, we do not consider CFs in this application.

Figure 3 shows that when the sample size is relatively small (), the reparametrized Stein CVs are not stable and may meet “barrier” to train the policy and obtain higher reward efficiently. Favorably, the usage of NCVA in policy gradient is efficient and stable. For the large sample size (), though liu2017action () obtains good performance as stated, our NCVA still outperforms.

## 5 Conclusion & Future work

We have developed the zero-variance control variate methods for Monte Carlo variance reduction by incorporating neural networks into Stein control variates. The neural control variates can deal with complex test function with high-dimensional distribution. Besides, we proposed the augmented variables approach to achieve better generalization for small sample size cases. We demonstrated their effectiveness on two challenging Monte Carlo integration tasks. In the future, we will mainly focus on exploring the theoretical justification on the effectiveness of the augmented variables method. A wiser scheme for constructing the auxiliary random variable also need to be developed.

## References

- [1] Roland Assaraf and Michel Caffarel. Zero-variance principle for Monte Carlo algorithms. Physical review letters, 83(23):4682, 1999.
- [2] Jean Cornuet, JEAN-MICHEL MARIN, Antonietta Mira, and Christian P Robert. Adaptive multiple importance sampling. Scandinavian Journal of Statistics, 39(4):798–812, 2012.
- [3] Daan Frenkel and Berend Smit. Understanding molecular simulation: from algorithms to applications, volume 1. Elsevier, 2001.
- [4] Michael B Giles. Multilevel monte carlo methods. In Monte Carlo and Quasi-Monte Carlo Methods 2012, pages 83–103. Springer, 2013.
- [5] Dave Higdon, Jordan D McDonnell, Nicolas Schunck, Jason Sarich, and Stefan M Wild. A bayesian approach for parameter estimation and prediction using a computationally intensive model. Journal of Physics G: Nuclear and Particle Physics, 42(3):034009, 2015.
- [6] Sham M Kakade. A natural policy gradient. In Advances in neural information processing systems, pages 1531–1538, 2002.
- [7] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. nature, 521(7553):436, 2015.
- [8] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Sample-efficient policy optimization with stein control variate. arXiv preprint arXiv:1710.11198, 2017.
- [9] Hao Liu, Yihao Feng, Yi Mao, Dengyong Zhou, Jian Peng, and Qiang Liu. Action-dependent control variates for policy optimization via stein identity. In ICLR, 2018.
- [10] Antonietta Mira, Reza Solgi, and Daniele Imparato. Zero variance markov chain monte carlo for bayesian estimators. Statistics and Computing, 23(5):653–662, 2013.
- [11] Radford M Neal. Bayesian learning for neural networks, volume 118. Springer Science & Business Media, 2012.
- [12] Chris J Oates, Jon Cockayne, François-Xavier Briol, and Mark Girolami. Convergence rates for a class of estimators based on stein’s method. arXiv preprint arXiv:1603.03220, 2016.
- [13] Chris J Oates, Mark Girolami, and Nicolas Chopin. Control functionals for monte carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695–718, 2017.
- [14] Chris J Oates, Theodore Papamarkou, and Mark Girolami. The controlled thermodynamic integral for bayesian model evidence evaluation. Journal of the American Statistical Association, 111(514):634–645, 2016.
- [15] Christian P Robert. Monte carlo methods. Wiley Online Library, 2004.
- [16] Reuven Y Rubinstein and Dirk P Kroese. Simulation and the Monte Carlo method, volume 10. John Wiley & Sons, 2016.
- [17] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
- [18] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- [19] Charles Stein et al. A bound for the error in the normal approximation to the distribution of a sum of dependent random variables. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 2: Probability Theory. The Regents of the University of California, 1972.
- [20] Richard S Sutton and Andrew G Barto. Introduction to reinforcement learning, volume 135. MIT press Cambridge, 1998.
- [21] Emanuel Todorov, Tom Erez, and Yuval Tassa. Mujoco: A physics engine for model-based control. In Intelligent Robots and Systems (IROS), 2012 IEEE/RSJ International Conference on, pages 5026–5033. IEEE, 2012.
- [22] George Tucker, Andriy Mnih, Chris J Maddison, John Lawson, and Jascha Sohl-Dickstein. Rebar: Low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pages 2624–2633, 2017.

## Appendix A Experimental Settings

### a.1 Synthetic Problem

The following kernel function used in [13] is employed in control functionals in our experiments

(23) |

where and which were selected by using cross-validation. The trial function in NCV is a neural network, the architecture of which has two fully connected layers. Each layer has 40 units, and the activation function is RELU. Adam was used to train the network with learning rate 0.008.

### a.2 Goodwin Oscillator

The dynamic system used to describe Goodwin Oscillator with species is defined as:

(24) |

The solution of the dynamic system depends on parameters:
. We set . There are unknown parameteres. The prior of each parameter is assumed a Gamma distribution, i.e., . We consider , , and . The observations were obtained when . We generated data using as in [14]. Given the temperature t, Hamilton Monte Carlo (HMC) sampling was used to draw samples from the power posterior . We consider 10 independent runs of population HMC. Each chain has samples. The chains were thinned for training purposes.

The control functionals control variates used the kernel with parameters (23) with , which were chosen by using cross-validation. The neural network used for the trial functions in NCV and NCVA has two fully connected layers, each of which has 40 units. Adam is used to train the network with learning rate 0.05. For NCVA, the augmented variable was generated by , where . All samples were augmented 100 times.

### a.3 Policy Gradients with Stein Control Variates

We reimplement the proxy policy optimization procedure with stein control varaites ([9]) in three mujoco tasks([21]). The neural network for the trial function in NCV and NCVA has three MLP layers, each of which has 100 units. Adam is used to train the network with learning rate 0.0005. The policy was assumed a Gaussian where and diagonal were both constructed as a MLP with three layers. In NCVA, all samples are augmened 100 times. The augmented variable is generated by , where .