Deep State Space Models for 11Unconditional Word Generation

Deep State Space Models for Unconditional Word Generation

Abstract

Autoregressive feedback is considered a necessity for successful unconditional text generation using stochastic sequence models. However, such feedback is known to introduce systematic biases into the training and it obscures a principle of generation: committing to global information and forgetting local nuances. We show that a non-autoregressive deep state space model with a clear separation of global and local uncertainty can be build from only two ingredients: An independent noise source and a deterministic transition function. Recent advances on flow-based variational inference allow training an evidence lower-bound without resorting to annealing, auxiliary losses or similar measures. The result is a highly interpretable generative model on par with a comparable auto-regressive model on the task of word generation.

\xpatchcmd

1 Introduction

Deep generative models of sequential data are an active field of research. Generation of text, in particular, remains a challenging and relevant area Hu2017Towards (). Recurrent neural networks (RNNs) are a common model class, and are typically trained via maximum likelihood bowmanVVDJB15 () or adversarially yuZWY16 (); fedusGD2018 (). For conditional text generation, the sequence-to-sequence architecture of sutskeverVL2014 () has proven to be an excellent starting point, leading to significant improvements across a range of tasks, including machine translation bahdanauCB14 (); vaswaniSPUJGKP17 (), text summarization rushCW15 (), sentence compression filippova2015sentence () and dialogue systems serban2016building (). Similarly, RNN language models have been used with success in speech recognition mikolov2010recurrent (); graves2014towards (). In all these tasks, generation is conditioned on information that severely narrows down the set of likely sequences. The role of the model is then largely to distribute probability mass within relatively confined sets of candidates.

Our interest is, by contrast, in unconditional or free generation of text via RNNs. We take as point of departure the shortcomings of existing model architectures and training methodologies developed for conditional tasks. These arise from the increased challenges on both, accuracy and coverage. Generating grammatical and coherent text is considerably more difficult without reliance on an acoustic signal or a source sentence, which may constraint, if not determine much of the sentence structure. Moreover, failure to sufficiently capture the variety and variability of data may not surface in conditional tasks, yet is a key desideratum in unconditional text generation.

The de facto standard model for text generation is based on the RNN architecture originally proposed by graves2013generating () and incorporated as a decoder network in sutskeverVL2014 (). It evolves a continuous state vector, emitting one symbol at a time, which is then fed back into the state evolution – a property that characterizes the broader class of autoregressive models. However, even in a conditional setting, these RNNs are difficult to train without substitution of prevoiusly generated words by ground truth observations during training, a technique generally referred to as teacher forcing williams1989learning (). However, this approach is known to cause biases ranzato2015sequence (); goyalLZZCB16 () that can be detrimental to test time performance, where such nudging is not available and where state trajectories can go astray, requiring ad hoc fixes like beam search wisemanR16 () or scheduled sampling bengioVJS15 (). Nevertheless, teacher forcing has been carried over to unconditional generation bowmanVVDJB15 ().

Another drawback of autoregressive feedback graves2013generating () is in the dual use of a single source of stochasticity. The probabilistic output selection has to account for the local variability in the next token distribution. In addition, it also has to inject a sufficient amount of entropy into the evolution of the state space sequence, which is otherwise deterministic. Such noise injection is known to compete with the explanatory power of autoregressive feedback mechanisms and may result in degenerate, near deterministic models bowmanVVDJB15 (). As a consequence, there have been a variety of papers that propose deep stochastic state sequence models, which combine stochastic and determinstic dependencies, e.g. chung2015recurrent (); fraccaro2016SPW (), or which make use of auxiliary latent variables goyalSCKB17 (), auxiliary losses shabanian17 (), and annealing schedules bowmanVVDJB15 (). No canoncial architecture has emerged so far and it remains unclear how the stochasticity in these models can be interpreted and measured.

In this paper, we propose a stochastic sequence model that preserves the Markov structure of standard state space models by cleanly separating the stochasticity in the state evolution, injected via a white noise process, from the randomness in the local token generation. We train our model using variational inference (VI) and build upon recent advances in normalizing flows rezendeM15 (); kingmaSW16 () to define rich enough stochastic state transition functions for both, generation and inference. Our main goal is to investigate the fundamental question of how far one can push such an approach in text generation, and to more deeply understand the role of stochasticity. For that reason, we have used the most basic problem of text generation as our testbed: word morphology, i.e. the mechanisms underlying the formation of words from characters. This enables us to empirically compare our model to autoregressive RNNs on several metrics that are intractable in more complex tasks such as word sequence model.

2 Model

We argue that text generation is subject to two sorts of uncertainty: Uncertainty about plausible long-term continuations and uncertainty about the emission of the current token. The first reflects the entropy of all things considered “natural language”, the second reflects symbolic entropy at a fixed position that arises from ambiguity, (near-)analogies, or a lack of contextual constraints. As a consequence, we cast the emission of a token as a fundamental trade-off between committing and forgetting about information.

2.1 State space model

Let us define a state space model with transition function

(1)

is deterministic, yet driven by a white noise process , and, starting from some , defines a homogeneous stochastic process. A local observation model generates symbols and is typically realized by a softmax layer with symbol embeddings.

The marginal probability of a symbol sequence is obtained by integrating out ,

(2)

Here is defined implicitly by driving with noise as we will explain in more detail below.1In contrast to standard RNNs, we have defined to not include an auto-regressive input, such as , making potential biases as in teacher-forcing a non-issue. Furthermore, this implements our assumption about the role of entropy and information for generation. The information about the local outcome under is not considered in the transition to the next state as there is no feed-back. Thus all entropy that the model encodes about possible sequence continuations, must be driven by the noise process , which cannot be ignored in a successfully trained model.

2.2 Variational inference

Model-based variational inference (VI) allows us to approximate the marginalization in Eq. (2) by posterior expectations with regard to an inference model . It is easy to verify that the true posterior obeys the conditional independences , which informs our design of the inference model, cf. fraccaro2016SPW ():

(3)

This is to say, the previous state is a sufficient summary of the past. Jensen’s inequality then directly implies the evidence lower bound (ELBO)

(4)
(5)

This is a well-known form, which highlights the per-step balance between prediction quality and the discrepancy between the transition probabilities of the unconditioned generative and the data-conditioned inference models. Intuitively, the inference model breaks down the long range dependencies and provides a local training signal to the generative model for a single step transition and a single output generation.

Using VI successfully for generating symbol sequences requires parametrizing powerful yet tractable next state transitions. As a minimum requirement, forward sampling and log-likelihood computations need to be available. Extensions of VAEs rezendeM15 (); kingmaSW16 () have shown that for non-sequential models under certain conditions an invertible function can shape moderately complex distributions over into highly complex ones over , while still providing the operations necessary for efficient VI. The authors show that a bound similar to Eq. (5) can be obtained by using the law of the unconscious statistician rezendeM15 () and a density transformation to express the discrepancy between generative and inference model in terms of instead of

(6)

This allows working with an implicit latent distribution at the price of computing the Jacobian determinant of . Luckily, there are many choices such that this can be done in rezendeM15 (); dinhSB16 ().

2.3 Training through coupled transition functions

We propose to use two transition functions and for the inference and the generative model, respectively. Using results from flow-based VAEs we derive an ELBO that reveals the intrinsic coupling of both and expresses the relation of the two as a part of the objective that is determined solely by the data. A shared transition model constitutes a special case.

Two-Flow ELBO

For a transition function as in Eq. (1) fix and define the restriction . We require that for any , is a diffeomorphism and thus has a differentiable inverse. In fact, we will work with (possibly) different and for generation and inference, each one inducing restrictions and , respectively. For better readability we will omit the conditioning variable in the sequel.

By combining the per-step decomposition in (5) with the flow-based ELBO from (6), we get (implicitly setting ):

(7)

As our generative model also uses a flow to transform into a distribition on , it is more natural to use the (simple) density in -space. Performing another change of variable, this time on the density of the generative model, we get

(8)

where now is simply the (multivariate) standard normal density as does not depend , whereas does. We have introduced new noise variable to highlight the importance of the transformation , which is a combined flow of the forward inference flow and the inverse generative flow. Essentially, it follows the suggested -distribution of the inference model into the latent state space and back into the noise space of the generative model with its uninformative distribution. Putting this back into Eq. (7) and exploiting the fact that the Jacobians can be combined via we finally get

(9)

Interpretation

Naïvely employing the model-based ELBO approach, one has to learn two independently parametrized transition models and , one informed about the future and one not. Matching the two then becomes and integral part of the objective. However, since the transition model encapsulates most of the model complexity, this introduces redundancy where the learning problem is most challenging. Nevertheless, generative and inference model do address the transition problem from very different angles. Therefore, forcing both to use the exact same transition model might limit flexibility during training and result in an inferior generative model. Our model casts and as independently parametrized functions that are coupled through the objective by treating them as proper transformations of an underlying white noise process. 2 Finally, note that for , and the nominator in the second term becomes a simple prior probability , whereas the determinant reduces to a constant.

2.4 Families of transition functions

Since the Jacobian of a composed function factorizes, a flow is often composed of a chain of individual invertible functions rezendeM15 (). We experiment with individual functions

(10)

where is a multilayer MLP and is a neural network mapping to a lower-triangular matrix with non-zero diagonal entries. Again, we use MLPs for this mapping and clip the diagonal away from for some hyper parameter . The lower-triangular structure allows computing the determinant in and stable inversion of the mapping by substitution in . As a special case we also consider the case when is restricted to diagonal matrices. Finally, we experiment with a conditional variant of the Real NVP flow dinhSB16 ().

Computing is central to our objective and we found that depending on the flow actually parametrizing the inverse directly results in more stable and efficient training.

2.5 Inference network

So far we have only motivated the factorization of the inference network but treated it as a black-box otherwise. We also follow existing work dinhSB16 () and choose the factors to follow a normal distribution with diagonal covariance matrix that allows re-parametrization. We follow the idea of fraccaro2016SPW () and incorporate the variable-length sequence by conditioning on the state of an RNN running backwards in time across . We embed the symbols in a vector space and use use a GRU cell to produce a sequence of hidden states where has digested tokens . Together and parametrize the mean and co-variance matrix of .

2.6 Optimization

Except in very specific and simple cases, for instance, Kalman Filters, it will not be possible to efficiently compute the -expectations in Eq. (5). Instead, we sample in every time-step as it is standard for sequential ELBOsfraccaroSPW2016 (); goyalSCKB17 (). The re-parametrization trick allows pushing all necessary gradients through these expectations to optimize the bound via stochastic gradient-based optimization techniques such as Adam kingmaB14 ().

2.7 Extension: Importance-weighted ELBO for tracking the generative model

Conceptionally, there are two ways we can imagine an inference network to propose sequences for a given sentence . Either, as proposed above, by digesting right-to-left and proposing . Or, by iteratively proposing a taking into account the last state proposed and the generative deterministic mechanism . The latter allows the inference network to peek at states that could generate from before proposing an actual target . This allows the inference model to track a multi-modal without need for to match its expressiveness. As a consequence, this might allow multi-modal generative models without actually employing complex multi-modal distributions in the inference model.

Our extension is built on importance weighted auto-encoders (IWAE) burda15 (). The IWAE ELBO is derived by writing the log marginal as a MC estimate before using Jensen’s inequality. The result is an ELBO and corresponding gradients of the form3

(11)

The authors motivate (11) as a weighting mechanism relieving the inference model from explaining the data well with every sample. We will use the symmetry of this argument to let the inference model condition on potential next states from the generative model without requiring every to allow to make a good proposal. In other words, the sampled outputs of become a vectorized representation of to condition on. In our sequential model, computing exactly is intractable as it would require rolling out the network until time . Instead, we limit the horizon to only one time-step. Furthermore, when proceeding to time-step we choose the new “old” hidden state to be carried over by sampling a proportionally to . Algorithm 1 summarizes the steps carried out at time for a given (to not overload the notation, we drop in ) and a more detailed derivation of the bound is given in Appendix A.

Simulate :                                where  )
Instantiate the inference family:     
Sample inference:                     
Compute gradients as in (11) where
Sample according to for the next step.
Algorithm 1 Details forward pass with importance weighting

3 Related Work

Our work intersects with work directly addressing teacher-forcing, mostly on language modelling and translation (which are mostly not state space models) and stochastic state space models (which are typically autoregressive and do not address teacher forcing).

Early work on addressing teacher-forcing has focused on mitigating its biases by adapting the RNN training procedure to partly rely on the model’s prediction during training bengioVJS15 (); ranzatoCAZ15 (). Recently, the problem has been addressed for conditional generation within an adversarial framework goyalLZZCB16 () and in various learning to search frameworks wisemanR16 (); leblondAOL17 (). However, by design these models do not perform stochastic state transitions.

There have been proposals for hybrid architectures that augment the deterministic RNN state sequences by chains of random variables chung2015recurrent (); fraccaro2016SPW (). However, these approaches are largely patching-up the output feedback mechanism to allow for better modeling of local correlations, leaving the deterministic skeleton of the RNN state sequence untouched. A recent evolution of deep stochastic sequence models has developed models of ever increasing complexity including intertwined stochastic and deterministic state sequences chung2015recurrent (); fraccaro2016SPW () additional auxiliary latent variables goyalSCKB17 () auxiliary losses shabanian17 () and annealing schedules bowmanVVDJB15 (). At the same time, it remains often unclear how the stochasticity in these models can be interpreted and measured.

Closest in spirit to our transition functions is work by Maximilian et al.karl16KSBvS () on generation with external control inputs. In contrast to us they use a simple mixture of linear transition functions and work around using density transformations akin to bayer2014 (). In our unconditional regime we found that relating the stochasticity in explicitly to the stochasticity in is key to successful training. Finally, variational conditioning mechanisms similar in spirit to ours have seen great success in on image generationgregorDGW15 ().

Among generative unconditional sequential models GANs are as of today the most prominent architecture yuZWY16 (); Kusner16 (); fedusGD2018 (); cheLZHLSB17 (). To the best of our knowledge, our model is the first non-autoregressive model for sequence generation in a maximum likelihood framework.

4 Evaluation

Naturally, the quality of a generative model must be measured in terms of the quality of its outputs. However, we also put special emphasise on investigating whether the stochasticity inherent in our model operates as advertised.

4.1 Data inspection

Evaluating generative models of text is a field of ongoing research and currently used methods range from naive set-based metrics to expensive human evaluation fedusGD2018 (). We argue that for morphology, and in particular non-autoregressive models, there is an interesting middle ground: Compared to the space of all sentences, the space of all words has still moderate cardinality which allows us to estimate the data generating distribution by unigram word-frequencies. As a consequence, we can reliably approximate the cross-entropy which naturally generalizes set-based metrics to probabilistic models and addresses both, over-generalization (assigning non-zero probability to non-existing words) and over-confidence (distributing high probability mass only among a few words).

This metric can be addressed by all models which operate by first stochastically generating a sequence of hidden states and then defining a distribution over the data-space given the state sequence. For our model we approximate the marginal by an Monte Carlo estimate

(12)

Note that sampling boils down to sampling from independent standard normals and then applying . In particular, the non-autoregressive property of our model allows us to estimate all words in some set using samples each by using only independent trajectories overall.

4.2 Entropy inspection

We want to go beyond the usual evaluation of existing work on (generative) sequence model and also assess the quality of our noise model. In particular, we are interested in how much information contained in a state about the output is due to the corresponding noise vector . For this we define two per-time-step quantities which we call marginal and expected entropy

(13)
(14)

measuring the entropy of the trajectory under marginalization of and the expected entropy one gets when following particular vectors . For a model that ignores the noise vectors , the two are identical and . On the other hand, if is positive, we know how many bits of the output distribution are due to the noise .

5 Experiments

Dataset and baseline

For our experiments, we use the BooksCorpus kiros2015skip (); zhu2015aligning (), a freely available collection of novels comprising of almost 1B tokens out of which 1.3M are unique. To filter out artefacts and some very uncommon words found in fiction, we restrict the vocabulary to words of length with at least 10 occurrences that only contain letters resulting in a 143K vocabulary. Besides the standard 10% test-train split at the word level, we also perform a second, alternative split at the level of unique words. That means, 10 percent of the words, chosen regardless of their frequency, will be unique to the test set. This is motivated by the fact that even a small test-set under the former regime will result in only very few, very unlikely words unique to the test-set. However, generalization to unseen words is the essence of morphology. As an additional metric to measuring generalization in this scenario, we evaluate the generated output under two Kneser-Ney-smoothed character -gram models trained on the test data.

Our baseline is a GRU cell and the standard RNN training procedure with teacher-forcing4. Hidden state size and embedding size are identical to our model’s.

Model parametrization

For we experiment with sequences of the flow in (10) which we denote as where is the number of flows. Furthermore, we use the simple diagonal special case of (10), short diag, for and a state size of 8 as our default configuration. More complex inference flows resulted in unstable training; less complex generative flows, i.e. diag stagnated at very low performance. For the weighted version we use samples. In addition, for we experiment with a sequence of Real NVPs with masking dimensions (two internal hidden layers of size 8 each). Finally, we investigate deviating from the factorization (3) by using a bidirectional RNN conditioning on all in every timestep.

5.1 Data inspection

Table 1 shows the result for the standard split. In general we observe that the set-based measures have much higher variance than cross entropy, which should not come as a surprise as the latter effectively evaluates a single sample on datapoints. Furthermore, they require manually trading off precision and coverage. We observe that two layers of the tril flow improve performance. Furthermore, importance weighting significantly improves the results across all metrics with diminishing returns at . Its effectiveness is also confirmed by an increase in variance across the weights during training which can be attributed to the significance of the noise model (see 5.2 for more details). We attribute the relatively poor performance of NVP to the sequential VI setting which deviates heavily from what it was designed for and keep adaptations for future work.

Model unique in
tril 11.88 11.93 0.11 0.62 1.07
tril, K=2 11.96 12.20 0.11 0.62 0.94
tril, K=5 11.49 11.56 0.14 0.63 1.30
tril, K=10 11.54 11.58 0.13 0.59 1.24
tril 11.85 11.89 0.11 0.55 0.87
tril, K=5 11.39 11.45 0.14 0.54 1.23
tril, K=10 11.37 11.43 0.15 0.56 1.42
tril, K=10, bidi 11.45 11.41 0.15 0.56 1.36
real-NVP-[2,3,4,5,6,7] 11.77 11.81 0.12 0.61 0.94
baseline-8d 12.92 12.97 0.13 0.53 -
baseline-16d 12.55 12.60 0.14 0.62 -
oracle 7.0 7.025 0.27 1.0 -
Table 1: Results on generation. The cross entropy is computed wrt. both training and test set. The set-based measures are wrt. the training set. oracle is a model sampling from the training data.

Interestingly, the bidirectional inference model only slightly improves performance compared to the equivalently parametrized standard model suggesting that historic information can be sufficiently stored in the states and confirming d-separation as the right principle for inference design.

The poor cross-entropy achieved by the baseline can partly be explained by the fact that auto-regressive RNNs are trained on conditional next-word-predictions. Estimating the real data-space distribution would require normalizing over all possible sequences . However, the set-based metrics clearly show that the performance cannot solely be attributed to this.

Table 2 shows that generalization for the alternative split is indeed harder but cross entropy results carry over from the standard setting. Under the -gram models the models are on par with a slight advantage of the baseline under -grams.

Model
tril, K=5 11.33 12.49 21.74 66.15
baseline-8d 12.81 13.55 22.54 64.92
Table 2: Results for the alternative data split: Cross entropy and likelihood under -gram LM

5.2 Entropy Gap

We use sample to approximate the marginal in (12). Figure 5.2 shows how the gap along with the symbolic entropy changes during training. Remember that in a non-autoregressive model, the latter corresponds to information that cannot be recovered in later timesteps. Over the course of the training, more and more information is driven by and absorbed into states where it can be stored.

Figure 1 shows the final average entropy gap in free-mode achieved by the model. In addition, Figure 5.2 shows how the gap is distributed along the sequence length and across different words. Every boxplot represents the entropy gap observed on a successfully trained model when running in free mode. As initial tokens are more important to remember, it should not come as a surprise that the gap is largest first and decreases over time, yet with increased variance.

\captionbox

[Caption] at different word lengths

1

word position

bits

\captionbox

[Caption] Entropy analysis over training time. For reference the dashed line indicates the overall word entropy of the trained baseline.

training time

bits

\pgfplotstabletranspose\otherdatatable\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\pgfpl@@\otherdatatable\otherdatatable\otherdatatable\otherdatatable\otherdatatable\otherdatatable

6 Conclusion

In this paper we have shown how a deep state space model can be defined and trained with the help of normalizing flows. The recurrent mechanism is driven purely by a simple white noise process and does not require an autoregressive conditioning on previously generated symbols. In addition, we have shown how an importance-weighted conditioning mechanism integrated into the objective allows shifting stochastic complexity from the inference to the generative model. The result is a highly flexible framework for sequence generation with an extremely simple overall architecture, a measurable notion of latent information and no need for pre-training, annealing or auxiliary losses. We believe that pushing the boundaries of non-autoregressive modeling is key to understanding stochastic text generation and can open the door to related fields such as particle filtering naesseth2017 (); maddisonLTHNMDT17 ().

Appendix A Detailed derivation of the weighted ELBO

We simplify the notation and write the distribution of the inference model over a subsequence as for any without making the dependency on explicit. Furthermore, let be short for a set of samples of from the inference model. Finally, let summarize all parameters of both, generative and inference model.

The key idea is to write the marginal as a nested expectation

(15)

and observe that we can perform an MC estimate with respect to only

(16)

The same argument applies for , the integrand in the ELBO. Now we can repeat the IWAE argument from burda15 () for the outer expectation

(17)
(18)
(19)
(20)

where we have used the above factorization in (18), MC sampling in (19) and Jensen’s inequality in (20). Now we can identify

(22)

and use the log-derivative trick to derive gradients

(23)

Again, we have omitted carrying out the re-parametrization trick explicitly when moving the gradient into the expectation and refer to the original paper for a more rigorous version. The gradient of the logarithm decomposes into two terms,

(24)
(25)

The first is the contribution to our original ELBO normalized by the IWAE MC weights. The second is identical to our starting-point in (17) but for and conditioned on . Iterating the above for yields the desired bound.

To allow tractable gradient computation using the importance-weighted bound, we use two simplifications. First, we limit the computation of the weights to a finite horizon of size 1 which reduces them to only the first factor in (22). Second, we forward only a single sample to the next time-step to remain in the usual single-sample sequential ELBO regime (which is important as depends on ). That is, we sample proportional to the weights . A more sophisticated solution would be to incorporate techniques from particle filtering which maintain a fixed-size sample population that is updated over time.

Footnotes

  1. For ease of exposition, we assume fixed length sequences, although in practice one works with end-of-sequence tokens and variable length sequences.
  2. Note that identifying as an invertible function allows us to perform a backwards density transformation which cancels the regularizing terms. This is akin to any flow objective (e.g. see equation (15) inrezendeM15 ()) where applying the transformation additionally to the prior cancels out the Jaccobian term. We can think of as a stochastic bottleneck with the observation model attached to the middle layer. Removing the middle layer collapses the bottleneck and prohibits learning compression.
  3. Here we have tacitly assumed that can be rewritten using the reprametrization trick so that the expectation can be expressed with respect to some parameter-free base-distribution. See burda15 () for a detailed derivation of the gradients in (11).
  4. It should be noted that despite the greatly reduced vocabulary in character-level generation, RNN training without teacher-forcing for our data still fails miserably.
  5. Note that the training-set oracle is not optimal for the test set. The entropy of the test set is 6.80

References

  1. Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473, 2014.
  2. Yuri Burda, Roger B. Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. CoRR, abs/1509.00519, 2015.
  3. Justin Bayer and Christian Osendorfer. Learning stochastic recurrent networks. arXiv preprint arXiv:1411.7610, 2014. arXiv.
  4. Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. CoRR, abs/1506.03099, 2015.
  5. Samuel R. Bowman, Luke Vilnis, Oriol Vinyals, Andrew M. Dai, Rafal Józefowicz, and Samy Bengio. Generating sentences from a continuous space. CoRR, abs/1511.06349, 2015.
  6. Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pages 2980–2988, 2015.
  7. Tong Che, Yanran Li, Ruixiang Zhang, R. Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. Maximum-likelihood augmented discrete generative adversarial networks. CoRR, abs/1702.07983, 2017.
  8. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. CoRR, abs/1605.08803, 2016.
  9. Katja Filippova, Enrique Alfonseca, Carlos A Colmenares, Lukasz Kaiser, and Oriol Vinyals. Sentence compression by deletion with lstms. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 360–368, 2015.
  10. William Fedus, Ian J. Goodfellow, and Andrew M. Dai. Maskgan: Better text generation via filling in the ______. CoRR, abs/1801.07736, 2018.
  11. Marco Fraccaro, Søren Kaae Sø nderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2199–2207. Curran Associates, Inc., 2016.
  12. Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 2199–2207, 2016. NIPS.
  13. Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for image generation. CoRR, abs/1502.04623, 2015.
  14. Alex Graves and Navdeep Jaitly. Towards end-to-end speech recognition with recurrent neural networks. In International Conference on Machine Learning, pages 1764–1772, 2014.
  15. Anirudh Goyal, Alex Lamb, Ying Zhang, Saizheng Zhang, Aaron C. Courville, and Yoshua Bengio. Professor forcing: A new algorithm for training recurrent networks. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 5-10, 2016, Barcelona, Spain, pages 4601–4609, 2016.
  16. Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.
  17. Anirudh Goyal, Alessandro Sordoni, Marc-Alexandre Côté, Nan Rosemary Ke, and Yoshua Bengio. Z-forcing: Training stochastic recurrent networks. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pages 6716–6726, 2017.
  18. Z. Hu, Z. Yang, Liang X., R. Salakhutdinov, and E. R. Xing. Toward controlled generation of text. In International Conference on Machine Learning (ICML), 2017.
  19. Matt J. Kusner and José Miguel Hernández-Lobato. Gans for sequences of discrete elements with the gumbel-softmax distribution. 11 2016.
  20. Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
  21. Maximilian Karl, Maximilian Soelch, Justin Bayer, and Patrick van der Smagt. Deep variational bayes filters: Unsupervised learning of state space models from raw data. arXiv preprint arXiv:1605.06432, 2017. ICLR.
  22. Diederik P. Kingma, Tim Salimans, and Max Welling. Improving variational inference with inverse autoregressive flow. CoRR, abs/1606.04934, 2016.
  23. Ryan Kiros, Yukun Zhu, Ruslan Salakhutdinov, Richard S Zemel, Antonio Torralba, Raquel Urtasun, and Sanja Fidler. Skip-thought vectors. arXiv preprint arXiv:1506.06726, 2015.
  24. Rémi Leblond, Jean-Baptiste Alayrac, Anton Osokin, and Simon Lacoste-Julien. SEARNN: training rnns with global-local losses. CoRR, abs/1706.04499, 2017.
  25. Tomáš Mikolov, Martin Karafiát, Lukáš Burget, Jan Černockỳ, and Sanjeev Khudanpur. Recurrent neural network based language model. In Eleventh Annual Conference of the International Speech Communication Association, 2010.
  26. Chris J. Maddison, Dieterich Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Whye Teh. Filtering variational objectives. CoRR, abs/1705.09279, 2017.
  27. Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational sequential monte carlo. arXiv preprint arXiv:1705.11140, 2017.
  28. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
  29. Marc’Aurelio Ranzato, Sumit Chopra, Michael Auli, and Wojciech Zaremba. Sequence level training with recurrent neural networks. CoRR, abs/1511.06732, 2015.
  30. Alexander M. Rush, Sumit Chopra, and Jason Weston. A neural attention model for abstractive sentence summarization. CoRR, abs/1509.00685, 2015.
  31. Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6-11 July 2015, pages 1530–1538, 2015.
  32. Samira Shabanian, Devansh Arpit, Adam Trischler, and Y Bengio. Variational bi-lstms. 11 2017.
  33. Iulian Vlad Serban, Alessandro Sordoni, Yoshua Bengio, Aaron C Courville, and Joelle Pineau. Building end-to-end dialogue systems using generative hierarchical neural network models. In AAAI, volume 16, pages 3776–3784, 2016.
  34. Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS’14, pages 3104–3112, Cambridge, MA, USA, 2014. MIT Press.
  35. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. CoRR, abs/1706.03762, 2017.
  36. Sam Wiseman and Alexander M. Rush. Sequence-to-sequence learning as beam-search optimization. CoRR, abs/1606.02960, 2016.
  37. Ronald J Williams and David Zipser. A learning algorithm for continually running fully recurrent neural networks. Neural computation, 1(2):270–280, 1989.
  38. Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. Seqgan: Sequence generative adversarial nets with policy gradient. CoRR, abs/1609.05473, 2016.
  39. Yukun Zhu, Ryan Kiros, Richard Zemel, Ruslan Salakhutdinov, Raquel Urtasun, Antonio Torralba, and Sanja Fidler. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. arXiv preprint arXiv:1506.06724, 2015.
202878
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
Edit
-  
Unpublish
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel
Comments 0
""
The feedback must be of minumum 40 characters
Add comment
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question