Data Generation as Sequential Decision Making

Data Generation as Sequential Decision Making

Philip Bachman
McGill University, School of Computer Science
phil.bachman@gmail.com
&Doina Precup
McGill University, School of Computer Science
dprecup@cs.mcgill.ca
Abstract

We connect a broad class of generative models through their shared reliance on sequential decision making. Motivated by this view, we develop extensions to an existing model, and then explore the idea further in the context of data imputation – perhaps the simplest setting in which to investigate the relation between unconditional and conditional generative modelling. We formulate data imputation as an MDP and develop models capable of representing effective policies for it. We construct the models using neural networks and train them using a form of guided policy search [11]. Our models generate predictions through an iterative process of feedback and refinement. We show that this approach can learn effective policies for imputation problems of varying difficulty and across multiple datasets.

 

Data Generation as Sequential Decision Making


  Philip Bachman McGill University, School of Computer Science phil.bachman@gmail.com Doina Precup McGill University, School of Computer Science dprecup@cs.mcgill.ca

1 Introduction

Directed generative models are naturally interpreted as specifying sequential procedures for generating data. We traditionally think of this process as sampling, but one could also view it as making sequences of decisions for how to set the variables at each node in a model, conditioned on the settings of its parents, thereby generating data from the model. The large body of existing work on reinforcement learning provides powerful tools for addressing such sequential decision making problems. We encourage the use of these tools to understand and improve the extended processes currently driving advances in generative modelling. We show how sequential decision making can be applied to general prediction tasks by developing models which construct predictions by iteratively refining a working hypothesis under guidance from exogenous input and endogenous feedback.

We begin this paper by reinterpreting several recent generative models as sequential decision making processes, and then show how changes inspired by this point of view can improve the performance of the LSTM-based model introduced in [4]. Next, we explore the connections between directed generative models and reinforcement learning more fully by developing an approach to training policies for sequential data imputation. We base our approach on formulating imputation as a finite-horizon Markov Decision Process which one can also interpret as a deep, directed graphical model.

We propose two policy representations for the imputation MDP. One extends the model in [4] by inserting an explicit feedback loop into the generative process, and the other addresses the MDP more directly. We train our models/policies using techniques motivated by guided policy pearch [11, 12, 13, 10]. We examine their qualitative and quantitative performance across imputation problems covering a range of difficulties (i.e. different amounts of data to impute and different “missingness mechanisms”), and across multiple datasets. Given the relative paucity of existing approaches to the general imputation problem, we compare our models to each other and to two simple baselines. We also test how our policies perform when they use fewer/more steps to refine their predictions.

As imputation encompasses both classification and standard (i.e. unconditional) generative modelling, our work suggests that further study of models for the general imputation problem is worthwhile. The performance of our models suggests that sequential stochastic construction of predictions, guided by both input and feedback, should prove useful for a wide range of problems. Training these models can be challenging, but lessons from reinforcement learning may bring some relief.

2 Directed Generative Models as Sequential Decision Processes

Directed generative models have grown in popularity relative to their undirected counter-parts [8, 16, 14, 5, 7, 18, 17] (etc.). Reasons include: the development of efficient methods for training them, the ease of sampling from them, and the tractability of bounds on their log-likelihoods. Growth in available computing power compounds these benefits. One can interpret the (ancestral) sampling process in a directed model as repeatedly setting subsets of the latent variables to particular values, in a sequence of decisions conditioned on preceding decisions. Each subsequent decision restricts the set of potential outcomes for the overall sequence. Intuitively, these models encode stochastic procedures for constructing plausible observations. This section formally explores this perspective.

2.1 Deep AutoRegressive Networks

The deep autoregressive networks investigated in [5] define distributions of the following form:

(1)

in which indicates a generated observation and represent latent variables in the model. The distribution may be factored similarly to . The form of in Eqn. 1 can represent arbitrary distributions over the latent variables, and the work in [5] mainly concerned approaches to parameterizing the conditionals that restricted representational power in exchange for computational tractability. To appreciate the generality of Eqn. 1, consider using that are univariate, multivariate, structured, etc. One can interpret any model based on this sequential factorization of as a non-stationary policy for selecting each action in a state , with each determined by all for , and train it using some form of policy search.

2.2 Generalized Guided Policy Search

We adopt a broader interpretation of guided policy search than one might initially take from, e.g., [11, 12, 13, 10]. We provide a review of guided policy search in the supplementary material. Our expanded definition of guided policy search includes any optimization of the general form:

(2)

in which indicates the primary policy, indicates the guide policy, indicates a distribution over information available only to , indicates a distribution over information available to both and , computes the cost of trajectory in the context of , and measures dissimilarity between the trajectory distributions generated by . As goes to infinity, Eqn. 2 enforces the constraint . Terms for controlling, e.g., the entropy of can also be added. The power of the objective in Eq. 2 stems from two main points: the guide policy can use information that is unavailable to the primary policy , and the primary policy need only be trained to minimize the dissimilarity term .

For example, a directed model structured as in Eqn. 1 can be interpreted as specifying a policy for a finite-horizon MDP whose terminal state distribution encodes . In this MDP, the state at time is determined by . The policy picks an action at time , and picks an action at time . I.e., the policy can be written as for , and as for . The initial state is drawn from . Executing the policy for a single trial produces a trajectory , and the distribution over s from these trajectories is just in the corresponding directed generative model.

The authors of [5] train deep autoregressive networks by maximizing a variational lower bound on the training set log-likelihood. To do this, they introduce a variational distribution which provides and for , with the final step given by a Dirac-delta at . Given these definitions, the training in [5] can be interpreted as guided policy search for the MDP described in the previous paragraph. Specifically, the variational distribution provides a guide policy over trajectories :

(3)

The primary policy generates trajectories distributed according to:

(4)

which does not depend on . In this case, corresponds to the guide-only information in Eqn. 2. We now rewrite the variational optimization as:

(5)

where and indicates the target distribution for the terminal state of the primary policy .111We could pull the term from the and put it in the cost , but we prefer the “path-wise ” formulation for its elegance. We abuse notation using . When expanded, the term in Eqn. 5 becomes:

(6)

Thus, the variational approach used in [5] for training directed generative models can be interpreted as a form of generalized guided policy search. As the form in Eqn. 1 can represent any finite directed generative model, the preceding derivation extends to all models we discuss in this paper.222This also includes all generative models implemented and executed on an actual computer.

2.3 Time-reversible Stochastic Processes

One can simplify Eqn. 1 by assuming suitable forms for and . E.g., the authors of [18] proposed a model in which for all and was Gaussian. We can write their model as:

(7)

where indicates the terminal state distribution of the non-stationary, finite-horizon Markov process determined by . Note that, throughout this paper, we (ab)use sums over latent variables and trajectories which could/should be written as integrals.

The authors of [18] observed that, for any reasonably smooth target distribution and sufficiently large , one can define a “reverse-time” stochastic process with simple, time-invariant dynamics that transforms into the Gaussian distribution . This is given by:

(8)

Next, we define as the distribution over trajectories generated by the reverse-time process determined by . We define as the distribution over trajectories generated by the “forward-time” process in Eqn. 7. The training in [18] is equivalent to guided policy search using guide trajectories sampled from , i.e. it uses the objective:

(9)

which corresponds to minimizing . If the log-densities in Eqn. 9 are tractable, then this minimization can be done using basic Monte-Carlo. If, as in [18], the reverse-time process is not trained, then Eqn. 9 simplifies to: .

This trick for generating guide trajectories exhibiting a particular distribution over terminal states – i.e. running dynamics backwards in time starting from – may prove useful in settings other than those considered in [18]. E.g., the LapGAN model in [2] learns to approximately invert a fixed (and information destroying) reverse-time process. The supplementary material expands on the content of this subsection, including a derivation of Eqn. 9 as a bound on .

2.4 Learning Generative Stochastic Processes with LSTMs

The authors of [4] introduced a model for sequentially-deep generative processes. We interpret their model as a primary policy which generates trajectories with distribution:

(10)

in which indicates a latent trajectory and indicates a state trajectory computed recursively from using the update for . The initial state is given by a trainable constant. Each state represents the joint hidden/visible state of an LSTM and computes a standard LSTM update.333For those unfamiliar with LSTMs, a good introduction can be found in [3]. We use LSTMs including input gates, forget gates, output gates, and peephole connections for all tests presented in this paper. The authors of [4] defined all as isotropic Gaussians and defined the output distribution as , where . Here, is a trainable constant and is, e.g., an affine transform of . Intuitively, transforms into a refinement of the “working hypothesis” , which gets updated to . is governed by parameters which affect , , , and . The supplementary material provides pseudo-code and an illustration for this model.

To train , the authors of [4] introduced a guide policy with trajectory distribution:

(11)

in which indicates a state trajectory computed recursively from using the guide policy’s state update . In this update is the previous guide state and is a deterministic function of and the partial (primary) state trajectory , which is computed recursively from using the state update . The output distribution is defined as a Dirac-delta at .444It may be useful to relax this assumption. Each is a diagonal Gaussian distribution with means and log-variances given by an affine function of . is defined as identical to . is governed by parameters which affect the state updates and the step distributions . corresponds to the “read” operation of the encoder network in [4].

Using our definitions for , the training objective in [4] is given by:

(12)

which can be written more succinctly as . This objective upper-bounds , where .

2.5 Extending the LSTM-based Generative Model

We propose changing in Eqn. 10 to: . We define as a diagonal Gaussian distribution with means and log-variances given by an affine function of (remember that ), and we define as an isotropic Gaussian. We set using , where is a trainable function (e.g. a neural network). Intuitively, our changes make the model more like a typical policy by conditioning its “action” on its state , and upgrade the model to an infinite mixture by placing a distribution over its initial state . We also consider using , which transforms the hidden part of the LSTM state directly into an observation. This makes a working memory in which to construct an observation. The supplementary material provides pseudo-code and an illustration for this model.

We train this model by optimizing the objective:

(13)

where we now have to deal with , , and , which could be treated as constants in the model from [4]. We define as a diagonal Gaussian distribution whose means and log-variances are given by a trainable function .

Figure 1: The left block shows for , for a policy with . The right block is analogous, for a model using .

When trained for the binarized MNIST benchmark used in [4], our extended model scored a negative log-likelihood of 85.5 on the test set.555Data splits from: http://www.cs.toronto.edu/~larocheh/public/datasets/binarized_mnist For comparison, the score reported in [4] was 87.4.666The model in [4] significantly improves its score to 80.97 when using an image-specific architecture. After fine-tuning the variational distribution (i.e. ) on the test set, our model’s score improved to 84.8, which is quite strong considering it is an upper bound. For comparison, see the best upper bound reported for this benchmark in [17], which was 85.1. When the model used the alternate , the raw/fine-tuned test scores were 85.9/85.3. Fig. 1 shows samples from the model. Model/test code is available at http://github.com/Philip-Bachman/Sequential-Generation.

3 Developing Models for Sequential Imputation

The goal of imputation is to estimate , where indicates a complete observation with known values and missing values . We define a mask as a (disjoint) partition of into . By expanding to include all of , one recovers standard generative modelling. By shrinking to include a single element of , one recovers standard classification/regression. Given distribution over and distribution over , the objective for imputation is:

(14)

We now describe a finite-horizon MDP for which guided policy search minimizes a bound on the objective in Eqn. 14. The MDP is defined by mask distribution , complete observation distribution , and the state spaces associated with each of steps. Together, and define a joint distribution over initial states and rewards in the MDP. For the trial determined by and , the initial state is selected by the policy based on the known values . The cost suffered by trajectory in the context is given by , i.e. the negative log-likelihood of guessing the missing values after following trajectory , while seeing the known values .

We consider a policy with trajectory distribution , where is determined by for the current trial and can’t observe the missing values . With these definitions, we can find an approximately optimal imputation policy by solving:

(15)

I.e. the expected negative log-likelihood of making a correct imputation on any given trial. This is a valid, but loose, upper bound on the imputation objective in Eq. 14 (from Jensen’s inequality). We can tighten the bound by introducing a guide policy (i.e. a variational distribution).

As with the unconditional generative models in Sec. 2, we train to imitate a guide policy shaped by additional information (here it’s ). This generates trajectories with distribution . Given this and , guided policy search solves:

(16)

where we define , , and .

3.1 A Direct Representation for Sequential Imputation Policies

We define an imputation trajectory as , where each partial imputation is computed from a partial step trajectory . A partial imputation encodes the policy’s guess for the missing values immediately prior to selecting step , and gives the policy’s final guess. At each step of iterative refinement, the policy selects a based on and the known values , and then updates its guesses to based on and . By iteratively refining its guesses based on feedback from earlier guesses and the known values, the policy can construct complexly structured distributions over its final guess after just a few steps. This happens naturally, without any post-hoc MRFs/CRFs (as in many approaches to structured prediction), and without sampling values in one at a time (as required by existing NADE-type models [9]). This property of our approach should prove useful for many tasks.

We consider two ways of updating the guesses in , mirroring those described in Sec. 2. The first way sets , where is a trainable function. We set using a trainable bias. The second way sets . We indicate models using the first type of update with the suffix -add, and models using the second type of update with -jump. Our primary policy selects at each step using , which we restrict to be a diagonal Gaussian. This is a simple, stationary policy. Together, the step selector and the imputation constructor fully determine the behaviour of the primary policy. The supplementary material provides pseudo-code and an illustration for this model.

We construct a guide policy similarly to . The guide policy shares the imputation constructor with the primary policy. The guide policy incorporates additional information , i.e. the complete observation for which the primary policy must reconstruct some missing values. The guide policy chooses steps using , which we restrict to be a diagonal Gaussian.

We train the primary/guide policy components , , and simultaneously on the objective:

(17)

where . We train our models using Monte-Carlo roll-outs of , and stochastic backpropagation as in [8, 16]. Full implementations and test code are available from http://github.com/Philip-Bachman/Sequential-Generation.

3.2 Representing Sequential Imputation Policies using LSTMs

To make it useful for imputation, which requires conditioning on the exogenous information , we modify the LSTM-based model from Sec. 2.5 to include a “read” operation in its primary policy . We incorporate a read operation by spreading over two LSTMs, and , which respectively “read” and “write” an imputation trajectory . Conveniently, the guide policy for this model takes the same form as the primary policy’s reader . This model also includes an “infinite mixture” initialization step, as used in Sec. 2.5, but modified to incorporate conditioning on and . The supplementary material provides pseudo-code and an illustration for this model.

Following the infinite mixture initialization step, a single full step of execution for involves several substeps: first updates the reader state using , then selects a step , then updates the writer state using , and finally updates its guesses by setting (or ). In these updates, refer to the states of the ()reader and ()writer LSTMs. The LSTM updates and the read/write operations are governed by the policy parameters .

We train to imitate trajectories sampled from a guide policy . The guide policy shares the primary policy’s writer updates and write operation , but has its own reader updates and read operation . At each step, the guide policy: updates the guide state , then selects , then updates the writer state , and finally updates its guesses (or ). As in Sec. 3.1, the guide policy’s read operation gets to see the complete observation , while the primary policy only gets to see the known values . We restrict the step distributions to be diagonal Gaussians whose means and log-variances are affine functions of . The training objective has the same form as Eq. 17.

4 Experiments

Figure 2: (a) Comparing the performance of our imputation models against several baselines, using MNIST digits. The -axis indicates the % of pixels which were dropped completely at random, and the scores are normalized by the number of imputed pixels. (b) A closer view of results from (a), just for our models. (c) The effect of increased iterative refinement steps for our GPSI models.

We tested the performance of our sequential imputation models on three datasets: MNIST (28x28), SVHN (cropped, 32x32) [15], and TFD (48x48) [19]. We converted images to grayscale and shift/scaled them to be in the range [0…1] prior to training/testing. We measured the imputation log-likelihood using the true missing values and the models’ guesses given by . We report negative log-likelihoods, so lower scores are better in all of our tests. We refer to variants of the model from Sec. 3.1 as GPSI-add and GPSI-jump, and to variants of the model from Sec. 3.2 as LSTM-add and LSTM-jump. Except where noted, the GPSI models used 6 refinement steps and the LSTM models used 16.777GPSI stands for “Guided Policy Search Imputer”. The tag “-add” refers to additive guess updates, and “-jump” refers to updates that fully replace the guesses.

We tested imputation under two types of data masking: missing completely at random (MCAR) and missing at random (MAR). In MCAR, we masked pixels uniformly at random from the source images, and indicate removal of % of the pixels by MCAR-. In MAR, we masked square regions, with the occlusions located uniformly at random within the borders of the source image. We indicate occlusion of a square by MAR-.

On MNIST, we tested MCAR- for . MCAR- corresponds to unconditional generation. On TFD and SVHN we tested MCAR-. On MNIST, we tested MAR- for . On TFD we tested MAR- and on SVHN we tested MAR-. For test trials we sampled masks from the same distribution used in training, and we sampled complete observations from a held-out test set. Fig. 2 and Tab. 1 present quantitative results from these tests. Fig. 2(c) shows the behavior of our GPSI models when we allowed them fewer/more refinement steps.

MNIST TFD SVHN
MAR-14 MAR-16 MCAR-80 MAR-25 MCAR-80 MAR-17
LSTM-add 170 167 1381 1377 525 568
LSTM-jump 172 169
GPSI-add 177 175 1390 1380 531 569
GPSI-jump 183 177 1394 1384 540 572
VAE-imp 374 394 1416 1399 567 624
Table 1: Imputation performance in various settings. Details of the tests are provided in the main text. Lower scores are better. Due to time constraints, we did not test LSTM-jump on TFD or SVHN. These scores are normalized for the number of imputed pixels.

We tested our models against three baselines. The baselines were “variational auto-encoder imputation”, honest template matching, and oracular template matching. VAE imputation ran multiple steps of VAE reconstruction, with the known values held fixed and the missing values re-estimated with each reconstruction step.888We discuss some deficiencies of VAE imputation in the supplementary material. After 16 refinement steps, we scored the VAE based on its best guesses. Honest template matching guessed the missing values based on the training image which best matched the test image’s known values. Oracular template matching was like honest template matching, but matched directly on the missing values.

Our models significantly outperformed the baselines. In general, the LSTM-based models outperformed the more direct GPSI models. We evaluated the log-likelihood of imputations produced by our models using the lower bounds provided by the variational objectives with respect to which they were trained. Evaluating the template-based imputations was straightforward. For VAE imputation, we used the expected log-likelihood of the imputations sampled from multiple runs of the 16-step imputation process. This provides a valid, but loose, lower bound on their log-likelihood.

Figure 3: This figure illustrates the policies learned by our models. (a): models trained for (MNIST, MAR-). From topbottom the models are: GPSI-add, GPSI-jump, LSTM-add, LSTM-jump. (b): models trained for (TFD, MAR-), with models in the same order as (a) – but without LSTM-jump. (c): models trained for (SVHN, MAR-), with models arranged as for (b).

As shown in Fig. 3, the imputations produced by our models appear promising. The imputations are generally of high quality, and the models are capable of capturing strongly multi-modal reconstruction distributions (see subfigure (a)). The behavior of GPSI models changed intriguingly when we swapped the imputation constructor. Using the -jump imputation constructor, the imputation policy learned by the direct model was rather inscrutable. Fig. 2(c) shows that additive guess updates extracted more value from using more refinement steps. When trained on the binarized MNIST benchmark discussed in Sec. 2.5, i.e. with binarized images and subject to MCAR-, the LSTM-add model produced raw/fine-tuned scores of 86.2/85.7. The LSTM-jump model scored 87.1/86.3. Anecdotally, on this task, these “closed-loop” models seemed more prone to overfitting than the “open-loop” models in Sec. 2.5. The supplementary material provides further qualitative results.

5 Discussion

We presented a point of view which links methods for training directed generative models with policy search in reinforcement learning. We showed how our perspective can guide improvements to existing models. The importance of these connections will only grow as generative models rapidly increase in structural complexity and effective decision depth.

We introduced the notion of imputation as a natural generalization of standard, unconditional generative modelling. Depending on the relation between the data-to-generate and the available information, imputation spans from full unconditional generative modelling to classification/regression. We showed how to successfully train sequential imputation policies comprising millions of parameters using an approach based on guided policy search [11]. Our approach outperforms the baselines quantitatively and appears qualitatively promising. Incorporating, e.g., the local read/write mechanisms from [4] should provide further improvements.

References

  • [1] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Desjardins, J. Turian, D. Warde-Farley, and Y. Bengio. Theano: A cpu and gpu math expression compiler. In Python for Scientific Computing Conference (SciPy), 2010.
  • [2] Emily L Denton, Soumith Chintala, Arthur Szlam, and Robert Fergus. Deep generative models using a laplacian pyramid of adversarial networks. arXiv:1506.05751 [cs.CV], 2015.
  • [3] Alex Graves. Generating sequences with recurrent neural networks. arXiv:1308.0850 [cs.NE], 2013.
  • [4] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. Draw: A recurrent neural network for image generation. In International Conference on Machine Learning (ICML), 2015.
  • [5] Karol Gregor, Ivo Danihelka, Andriy Mnih, Charles Blundell, and Daan Wierstra. Deep autoregressive networks. In International Conference on Machine Learning (ICML), 2014.
  • [6] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv:1412.6980v2 [cs.LG], 2015.
  • [7] Diederik P Kingma, Danilo J Rezende, Shakir Mohamed, and Max Welling. Semi-supervised learning with deep generative models. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • [8] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In International Conference on Learning Representations (ICLR), 2014.
  • [9] Hugo Larochelle and Iain Murray. The neural autoregressive distribution estimator. In International Conference on Machine Learning (ICML), 2011.
  • [10] Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynamics. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • [11] Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning (ICML), 2013.
  • [12] Sergey Levine and Vladlen Koltun. Variational policy search via trajectory optimization. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • [13] Sergey Levine and Vladlen Koltun. Learning complex neural network policies with trajectory optimization. In International Conference on Machine Learning (ICML), 2014.
  • [14] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In International Conference on Machine Learning (ICML), 2014.
  • [15] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bissacco, Bo Wu, and Andrew Y Ng. Reading digits in natural images with unsupervised feature learning. NIPS Workshop on Deep Learning and Unsupervised Feature Learning, 2011.
  • [16] Danilo Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In International Conference on Machine Learning (ICML), 2014.
  • [17] Danilo J Rezende and Shakir Mohamed. Variational inference with normalizing flows. In International Conference on Machine Learning (ICML), 2015.
  • [18] Jascha Sohl-Dickstein, Eric A. Weiss, Niru Maheswaranathan, and Surya Ganguli. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning (ICML), 2015.
  • [19] Joshua Susskind, Adam Anderson, and Geoffrey E Hinton. The toronto face database. Tech Report: University of Toronto, 2010.
  • [20] Bart van Merrienboer, Dzimitry Bahdanau, Vincent Dumoulin, Dmitriy Serdyuk, David Warde-Farley, Jan Chorowski, and Yoshua Bengio. Blocks and fuel: Frameworks for deep learning. arXiv:1506.00619[cs.LG], 2015.

Appendix A Appendix

Appendix B Additional Material for Section 2

b.1 A Brief Review of Policy Search and Guided Policy Search

Policy search refers to a general class of methods for searching directly through the space of possible parameterized policies for a reinforcement learning system (in contrast to fitting a value function and determining the policy implicitly by choosing the best actions). However, policy search is subject to local optima, which can be quite bad if the policy space is very rich (e.g., policies represented by deep networks). Guided policy search methods [11, 12, 13, 10] address this problem by using either guiding samples, or a guide policy (which generates guiding samples), in order to help move the policy search away from bad local optima. We refer to “local optima” in a colloquial/practical sense. I.e. regions of policy space in which the policy is unlikely to improve via noisy local search.

The initial approach to this problem was to generate guiding samples from policies obtained through trajectory optimization using differential dynamic programming [11]. After applying importance sampling corrections, the guiding samples were then used for off-policy training of the primary policy, a standard approach in policy search. Further work has obtained samples by using a “guide policy” which typically belongs to a larger policy class than the one being searched [13, 10]. In both cases, the optimization criterion contains, in addition to the reward, a regularization term requiring trajectories from the trained policy to be close to the guide samples. Constraining divergence between the guide samples and the trajectories produced by the trained policy allows the system generating the guide samples to gradually pull the trained policy towards improved behavior.

b.2 A Path-wise Bound for Reversible Stochastic Processes

We now show that the objective in Eqn. 9 describes the divergence , and that it provides an upper bound on . First, for , we define:

Next, we derive:

(18)
(19)
(20)
(21)
(22)
(23)
(24)
(25)

which provides a lower bound on based on sample trajectories produced by the reverse-time process when it is started at . The transition from equality to inequality is due to Jensen’s inequality. Though and may at first seem incommensurable via , they both represent distributions over -step trajectories through space, and thus the required divergence is well-defined. Next, by adding an expectation with respect to , we derive a lower bound on the expected log-likelihood :

(26)
(27)
(28)