Neural Particle Smoothing for Sampling from Conditional Sequence Models

Neural Particle Smoothing
for Sampling from Conditional Sequence Models

Chu-Cheng Lin    Jason Eisner
Center for Language and Speech Processing
Johns Hopkins University, Baltimore MD, 21218
{kitsing,jason}@cs.jhu.edu
Abstract

We introduce neural particle smoothing, a sequential Monte Carlo method for sampling annotations of an input string from a given probability model. In contrast to conventional particle filtering algorithms, we train a proposal distribution that looks ahead to the end of the input string by means of a right-to-left LSTM. We demonstrate that this innovation can improve the quality of the sample. To motivate our formal choices, we explain how our neural model and neural sampler can be viewed as low-dimensional but nonlinear approximations to working with HMMs over very large state spaces.

\xapptocmd\aclfinalcopy

1 Introduction

Many structured prediction problems in NLP can be reduced to labeling a length- input string with a length- sequence of tags. In some cases, these tags are annotations such as syntactic parts of speech. In other cases, they represent actions that incrementally build an output structure: IOB tags build a chunking of the input Ramshaw and Marcus (1999), shift-reduce actions build a tree Yamada and Matsumoto (2003), and finite-state transducer arcs build an output string Pereira and Riley (1997).

One may wish to score the possible taggings using a recurrent neural network, which can learn to be sensitive to complex patterns in the training data. A globally normalized conditional probability model is particularly valuable because it quantifies uncertainty and does not suffer from label bias Lafferty et al. (2001); also, such models often arise as the predictive conditional distribution corresponding to some well-designed generative model for the domain. In the neural case, however, inference in such models becomes intractable. It is hard to know what the model actually predicts and hard to compute gradients to improve its predictions.

In such intractable settings, one generally falls back on approximate inference or sampling. In the NLP community, beam search and importance sampling are common. Unfortunately, beam search considers only the approximate-top- taggings from an exponential set Wiseman and Rush (2016), and importance sampling requires the construction of a good proposal distribution Dyer et al. (2016).

In this paper we exploit the sequential structure of the tagging problem to do sequential importance sampling, which resembles beam search in that it constructs its proposed samples incrementally—one tag at a time, taking the actual model into account at every step. This method is known as particle filtering Doucet and Johansen (2009). We extend it here to take advantage of the fact that the sampler has access to the entire input string as it constructs its tagging, which allows it to look ahead or—as we will show—to use a neural network to approximate the effect of lookahead. Our resulting method is called neural particle smoothing.

1.1 What this paper provides

For , let and respectively denote the prefix and the suffix .

We develop neural particle smoothing—a sequential importance sampling method which, given a string , draws a sample of taggings from . Our method works for any conditional probability model of the quite general form111A model may require for convenience that each input end with a special end-of-sequence symbol: that is, .

(1)

where is an incremental stateful global scoring model that recursively defines scores of prefixes of at all times :

(2)
(3)

These quantities implicitly depend on and . Here is the model’s state after observing the pair of length- prefixes . is the score-so-far of this prefix pair, while is the score-to-go. The state summarizes the prefix pair in the sense that the score-to-go depends only on and the length- suffixes . The local scoring function and state update function may be any functions parameterized by —perhaps neural networks. We assume is fixed and given.

This model family is expressive enough to capture any desired . Why? Take any distribution with this desired conditionalization (e.g., the true joint distribution) and factor it as

(4)

by making include as much information about as needed for (4) to hold (possibly ).222Furthermore, could even depend on all of (if does), allowing direct expression of models such as stacked BiRNNs. Then by defining as shown in (4), we get and thus (1) holds for each .

1.2 Relationship to particle filtering

Our method is spelled out in § 4 (one may look now). It is a variant of the popular particle filtering method that tracks the state of a physical system in discrete time Ristic et al. (2004). Our particular proposal distribution for can be found in equations 26, 25, 6 and 5. It considers not only past observations as reflected in , but also future observations , as summarized by the state of a right-to-left recurrent neural network that we will train:

(5)
(6)

Conditioning the distribution of on future observations means that we are doing “smoothing” rather than “filtering” (in signal processing terminology). Doing so can reduce the bias and variance of our sampler. It is possible so long as is provided in its entirety before the sampler runs—which is often the case in NLP.

Figure 1: Sampling a single particle from a tagging model. (orange) have already been chosen, with a total model score of , and now the sampler is constructing a proposal distribution (purple) from which the next tag will be sampled. Each is evaluated according to its contribution to (namely ) and its future score (blue). The figure illustrates quantities used throughout the paper, beginning with exact sampling in equations 712. Our main idea (§ 3) is to approximate the computation (a log-sum-exp over exponentially many sequences) when exact computation by dynamic programming is not an option. The form of our approximation uses a right-to-left recurrent neural network but is inspired by the exact dynamic programming algorithm.

1.3 Applications

Why sample from at all? Many NLP systems instead simply search for the Viterbi sequence that maximizes and thus maximizes . If the space of states is small, this can be done efficiently by dynamic programming Viterbi (1967); if not, then may be an option (see § 2). More common is to use an approximate method: beam search, or perhaps a sequential prediction policy trained with reinforcement learning. Past work has already shown how to improve these approximate search algorithms by conditioning on the future Bahdanau et al. (2017); Wiseman and Rush (2016).

Sampling is essentially a generalization of maximization: sampling from approaches maximization as . It is a fundamental building block for other algorithms, as it can be used to take expectations over the whole space of possible values. For unfamiliar readers, Appendix E reviews how sampling is crucially used in minimum-risk decoding, supervised training, unsupervised training, imputation of missing data, pipeline decoding, and inference in graphical models.

2 Exact Sequential Sampling

To develop our method, it is useful to first consider exact samplers. Exact sampling is tractable for only some of the models allowed by § 1.1. However, the form and notation of the exact algorithms in § 2 will guide our development of approximations in § 3.

An exact sequential sampler draws from for each in sequence. Then is exactly distributed as .

For each given , observe that

(7)
(8)
(9)
(10)
(11)
(12)

Thus, we can easily construct the needed distribution (7) by normalizing (12) over all possible values of . The challenging part of (12) is to compute : as defined in (10), involves a sum over exponentially many futures . (See Figure 1.)

We chose the symbols and in homage to the search algorithm Hart et al. (1968). In that algorithm (which could be used to find the Viterbi sequence), denotes the score-so-far of a partial solution , and denotes the optimal score-to-go. Thus, would be the score of the best sequence with prefix . Analogously, our is the log of the total exponentiated scores of all sequences with prefix . and might be called the logprob-so-far and logprob-to-go of .

Just as approximates with a “heuristic” , the next section will approximate using a neural estimate (equations 56). However, the specific form of our approximation is inspired by cases where can be computed exactly. We consider those in the remainder of this section.

2.1 Exact sampling from HMMs

A hidden Markov model (HMM) specifies a normalized joint distribution over state sequence and observation sequence ,333The HMM actually specifies a distribution over a pair of infinite sequences, but here we consider the marginal distribution over just the length- prefixes. Thus the posterior is proportional to , as required by equation 1.

The HMM specifically defines by equations 23 with and .444It takes , a beginning-of-sequence symbol, so specifies the initial state distribution.

In this setting, can be computed exactly by the backward algorithm Rabiner (1989). (Details are given in Appendix A for completeness.)

2.2 Exact sampling from OOHMMs

For sequence tagging, a weakness of (first-order) HMMs is that the model state may contain little information: only the most recent tag is remembered, so the number of possible model states is limited by the vocabulary of output tags.

We may generalize so that the data generating process is in a latent state at each time , and the observed —along with —is generated from . Now may be arbitrarily large. The model has the form

(13)

This is essentially a pair HMM Knudsen and Miyamoto (2003) without insertions or deletions, also known as an “-free” or “same-length” probabilistic finite-state transducer. We refer to it here as an output-output HMM (OOHMM).555This is by analogy with the input-output HMM (IOHMM) of \newcitebengio-frasconi-1996, which defines directly and conditions the transition to on . The OOHMM instead defines by conditionalizing (13)—which avoids the label bias problem Lafferty et al. (2001) that in the IOHMM, is independent of future input (given the past input ).

Is this still an example of the general model architecture from § 1.1? Yes. Since is latent and evolves stochastically, it cannot be used as the state in equations 23 or (4). However, we can define to be the model’s belief state after observing . The belief state is the posterior probability distribution over the underlying state of the system. That is, deterministically keeps track of all possible states that the OOHMM might be in—just as the state of a determinized FSA keeps track of all possible states that the original nondeterministic FSA might be in.

We may compute the belief state in terms of a vector of forward probabilities that starts at ,

(14)

and is updated deterministically for each by the forward algorithm Rabiner (1989):

(15)

can be interpreted as the logprob-so-far if the system is in state after observing . We may express the update rule (15) by where the matrix depends on , namely .

The belief state simply normalizes into a probability vector, where denotes the normalization operator. The state update (15) now takes the form (3) as desired, with a normalized vector-matrix product:

(16)

As in the HMM case, we define as the log of the generative prefix probability,

(17)

which has the form (2) as desired if we put

(18)

Again, exact sampling is possible. It suffices to compute (9). For the OOHMM, this is given by

(19)

where and the backward algorithm

(20)

for uses dynamic programming to find the total probability of all ways to generate the future observations . Note that is defined for a specific prefix (though it sums over all ), whereas sums over all suffixes (and over all ), to achieve the asymmetric summation in (19).

Define to be a normalized version of . The recurrence (20) can clearly be expressed in the form , much like (16).

2.3 The logprob-to-go for OOHMMs

Let us now work out the definition of for OOHMMs (cf. equation 35 in Appendix A for HMMs). We will write it in terms of from § 1.2. Let us define symmetrically to (see (17)):

(21)

which has the form (5) as desired if we put

(22)

From equations 21, 17, 19 and 10, we see

(23)

where can be regarded as evaluating the compatibility of the state distributions and .

In short, the generic strategy (12) for exact sampling says that for an OOHMM, is distributed as

(24)

This is equivalent to choosing in proportion to (19)—but we now turn to settings where it is infeasible to compute (19) exactly. There we will use the formulation (24) but approximate . For completeness, we will also consider how to approximate , which dropped out of the above distribution (because it was the same for all choices of ) but may be useful for other algorithms (see § 4).

3 Neural Modeling as Approximation

3.1 Models with large state spaces

The expressivity of an OOHMM is limited by the number of states . The state is a bottleneck between the past and the future , in that past and future are conditionally independent given . Thus, the mutual information between past and future is at most bits.

In many NLP domains, however, the past seems to carry substantial information about the future. The first half of a sentence greatly reduces the uncertainly about the second half, by providing information about topics, referents, syntax, semantics, and discourse. This suggests that an accurate HMM language model would require very large —as would a generative OOHMM model of annotated language. The situation is perhaps better for discriminative models , since much of the information for predicting might be available in . Still, it is important to let contribute enough additional information about : even for short strings, making too small (giving bits) may harm prediction Dreyer et al. (2008).

Of course, (4) says that an OOHMM can express any joint distribution for which the mutual information is finite,666This is not true for the language of balanced parentheses. by taking large enough for to capture the relevant info from .

So why not just take to be large—say, to allow 30 bits of information? Unfortunately, evaluating then becomes very expensive—both computationally and statistically. As we have seen, if we define to be the belief state , updating it at each observation (equation 3) requires multiplication by a matrix . This takes time , and requires enough data to learn transition probabilities.

3.2 Neural approximation of the model

As a solution, we might hope that for the inputs observed in practice, the very high-dimensional belief states might tend to lie near a -dimensional manifold where . Then we could take to be a vector in that compactly encodes the approximate coordinates of relative to the manifold: , where is the encoder.

In this new, nonlinearly warped coordinate system, the functions of in (2)–(3) are no longer the simple, essentially linear functions given by (16) and (18). They become nonlinear functions operating on the manifold coordinates. ( in (16) should now ensure that , and in (18) should now estimate .) In a sense, this is the reverse of the “kernel trick” (Boser et al., 1992) that converts a low-dimensional nonlinear function to a high-dimensional linear one.

Our hope is that has enough dimensions to capture the useful information from the true , and that has enough dimensions to capture most of the dynamics of equations 18 and 16. We thus proceed to fit the neural networks directly to the data, without ever knowing the true or the structure of the original operators .

We regard this as the implicit justification for various published probabilistic sequence models that incorporate neural networks. These models usually have the form of § 1.1. Most simply, can be instantiated as one time step in an RNN Aharoni and Goldberg (2017), but it is common to use enriched versions such as deep LSTMs. It is also common to have the state contain not only a vector of manifold coordinates in but also some unboundedly large representation of (cf. equation 4), so the neural network can refer to this material with an attentional Bahdanau et al. (2015) or stack mechanism Dyer et al. (2015).

A few such papers have used globally normalized conditional models that can be viewed as approximating some OOHMM, e.g., the parsers of \newcitedyer2016 and \newciteandor-et-al-2016-acl. That is the case (§ 1.1) that particle smoothing aims to support. Most papers are locally normalized conditional models (e.g., Kann and Schütze, 2016; Aharoni and Goldberg, 2017); these simplify supervised training and can be viewed as approximating IOHMMs (footnote 5). For locally normalized models, by construction, in which case particle filtering (which estimates ) is just as good as particle smoothing. Particle filtering is still useful for these models, but lookahead’s inability to help them is an expressive limitation (known as label bias) of locally normalized models. We hope the existence of particle smoothing (which learns an estimate ) will make it easier to adopt, train, and decode globally normalized models, as discussed in § 1.3.

3.3 Neural approximation of logprob-to-go

We can adopt the same neuralization trick to approximate the OOHMM’s logprob-to-go . We take on the same theory that it is a low-dimensional reparameterization of , and define in equations 56 to be neural networks. Finally, we must replace the definition of in (23) with another neural network that works on the low-dimensional approximations:777 is correct according to (23). Forcing this ensures , so our approximation becomes exact as of .

(except that ) (25)

The resulting approximation to (24) (which does not actually require ) will be denoted :

(26)

The neural networks in the present section are all parameterized by , and are intended to produce an estimate of the logprob-to-go —a function of , which sums over all possible .

By contrast, the OOHMM-inspired neural networks suggested in § 3.2 were used to specify an actual model of the logprob-so-far —a function of and —using separate parameters .

Arguably has a harder modeling job than because it must implicitly sum over possible futures . We now consider how to get corrected samples from even if gives poor estimates of , and then how to train to improve those estimates.

4 Particle smoothing

In this paper, we assume nothing about the given model except that it is given in the form of equations 13 (including the parameter vector ).

Suppose we run the exact sampling strategy but approximate in (7) with a proposal distribution of the form in (25)–(26). Suppressing the subscripts on and for brevity, this means we are effectively drawing not from but from

(27)

If within each draw, then .

Normalized importance sampling corrects (mostly) for the approximation by drawing many sequences IID from (27) and assigning a relative weight of . This ensemble of weighted particles yields a distribution

(28)

that can be used as discussed in § 1.3. To compute in practice, we replace the numerator by the unnormalized version , which gives the same . Recall that each is a sum .

Sequential importance sampling is an equivalent implementation that makes the outer loop and the inner loop. It computes a prefix ensemble

(29)

for each in sequence. Initially, for all . Then for , we extend these particles in parallel:

(30)
(31)

where each is drawn from (26). Each yields a distribution over prefixes , which estimates the distribution . We return . This gives the same as in (28): the final are the same, with the same final weights , where was now summed up as .

That is our basic particle smoothing strategy. If we use the naive approximation everywhere, it reduces to particle filtering. In either case, various well-studied improvements become available, such as various resampling schemes Douc and Cappé (2005) and the particle cascade Paige et al. (2014).888The particle cascade would benefit from an estimate of , as it (like A search) compares particles of different lengths.

An easy improvement is multinomial resampling. After computing each , this replaces with a set of new draws from , each of weight 1—which tends to drop low-weight particles and duplicate high-weight ones.999While resampling mitigates the degeneracy problem, it could also reduce the diversity of particles. In our experiments in this paper, we only do multinomial resampling when the effective sample size of is lower than . Doucet and Johansen (2009) give a more thorough discussion on when to resample. For this to usefully focus the ensemble on good prefixes , should be a good approximation to the true marginal from (10). That is why we arranged for . Without , we would have only —which is fine for the traditional particle filtering setting, but in our setting it ignores future information in (which we have assumed is available) and also favors sequences that happen to accumulate most of their global score early rather than late (which is possible when the globally normalized model (1)–(2) is not factored in the generative form (4)).

5 Training the Sampler Heuristic

We now consider training the parameters of our sampler. These parameters determine the updates in (6) and the compatibility function in (25). As a result, they determine the proposal distribution used in equations 31 and 27, and thus determine the stochastic choice of that is returned by the sampler on a given input .

In this paper, we simply try to tune to yield good proposals. Specifically, we try to ensure that in equation 27 is close to from equation 1. While this may not be necessary for the sampler to perform well downstream,101010In principle, one could attempt to train “end-to-end” on some downstream objective by using reinforcement learning or the Gumbel-softmax trick Jang et al. (2017); Maddison et al. (2017). For example, we might try to ensure that closely matches the model’s distribution (equation 28)—the “natural” goal of sampling. This objective can tolerate inaccurate local proposal distributions in cases where the algorithm could recover from them through resampling. Looking even farther downstream, we might merely want —which is typically used to compute expectations—to provide accurate guidance to some decision or training process (see Appendix E). This might not require fully matching the model, and might even make it desirable to deviate from an inaccurate model. it does guarantee it (assuming that the model is correct). Specifically, we seek to minimize

(32)

averaged over examples drawn from a training set.111111Training a single approximation for all is known as amortized inference. (The training set need not provide true ’s.)

The inclusive KL divergence is an expectation under . We estimate it by replacing with a sample , which in practice we can obtain with our sampler under the current . (The danger, then, is that will be biased when is not yet well-trained; this can be mitigated by increasing the sample size when drawing for training purposes.)

Intuitively, this term tries to encourage in future to re-propose those values that turned out to be “good” and survived into with high weights.

The exclusive KL divergence is an expectation under . Since we can sample from exactly, we can get an unbiased estimate of with the likelihood ratio trick (Glynn, 1990).121212The normalizing constant of from (1) can be ignored because the gradient of a constant is 0. (The danger is that such “REINFORCE” methods tend to suffer from very high variance.)

This term is a popular objective for variational approximation. Here, it tries to discourage from re-proposing “bad” values that turned out to have low relative to their proposal probability.

Our experiments balance “recall” (inclusive) and “precision” (exclusive) by taking (which Appendix F compares to . Alas, because of our approximation to the inclusive term, neither term’s gradient will “find” and directly encourage good values that have never been proposed. Appendix B gives further discussion and formulas.

6 Models for the Experiments

To evaluate our methods, we needed pre-trained models . We experimented on several models. In each case, we trained a generative model , so that we could try sampling from its posterior distribution . This is a very common setting where particle smoothing should be able to help. Details for replication are given in Appendix C.

6.1 Tagging models

We can regard a tagged sentence as a string over the “pair alphabet” . We train an RNN language model over this “pair alphabet”—this is a neuralized OOHMM as suggested in § 3.2:

(33)

This model is locally normalized, so that (as well as its gradient) is straightforward to compute for a given training pair . Joint sampling from it would also be easy (§ 3.2).

However, is globally renormalized (by an unknown partition function that depends on , namely ). Conditional sampling of is therefore potentially hard. Choosing optimally requires knowledge of , which depends on the future .

As we noted in § 1, many NLP tasks can be seen as tagging problems. In this paper we experiment with two such tasks: English stressed syllable tagging, where the stress of a syllable often depends on the number of remaining syllables,131313English, like many other languages, assigns stress from right to left Hayes (1995). providing good reason to use the lookahead provided by particle smoothing; and Chinese NER, which is a familiar textbook application and reminds the reader that our formal setup (tagging) provides enough machinery to treat other tasks (chunking).

English stressed syllable tagging

This task tags a sequence of phonemes , which form a word, with their stress markings . Our training examples are the stressed words in the CMU pronunciation dictionary Weide (1998). We test the sampler on held-out unstressed words.

Chinese social media NER

This task does named entity recognition in Chinese, by tagging the characters of a Chinese sentence in a way that marks the named entities. We use the dataset from Peng and Dredze (2015), whose tagging scheme is a variant of the BIO scheme mentioned in § 1. We test the sampler on held-out sentences.

6.2 String source separation

This is an artificial task that provides a discrete analogue of speech source separation Zibulevsky and Pearlmutter (2001). The generative model is that strings (possibly of different lengths) are generated IID from an RNN language model, and are then combined into a single string according to a random interleaving string .141414We formally describe the generative process in Appendix G. The posterior predicts the interleaving string, which suffices to reconstruct the original strings. The interleaving string is selected from the uniform distribution over all possible interleavings (given the strings’ lengths). For example, with , a possible generative story is that we first sample two strings Foo and Bar from an RNN language model. We then draw an interleaving string 112122 from the aforementioned uniform distribution, and interleave the strings deterministically to get FoBoar.

is proportional to the product of the probabilities of the strings. The only parameters of , then, are the parameters of the RNN language model, which we train on clean (non-interleaved) samples from a corpus. We test the sampler on random interleavings of held-out samples.

The state (which is provided as an input to in (25)) is the concatenation of the states of the language model as it independently generates the strings, and is the log-probability of generating as the next character of the th string, given that string’s language model state within . As a special case, (see footnote 1), and is the total log-probability of termination in all language model states.

String source separation has good reason for lookahead: appending character “o” to a reconstructed string “␣gh” is only advisable if “s” and “t” are coming up soon to make “ghost.” It also illustrates a powerful application setting—posterior inference under a generative model. This task conveniently allowed us to construct the generative model from a pre-trained language model. Our constructed generative model illustrates that the state and transition function can reflect interesting problem-specific structure.

CMU Pronunciation dictionary

The CMU pronunciation dictionary (already used above) provides sequences of phonemes. Here we use words no longer than phonemes. We interleave the (unstressed) phonemes of words.

Penn Treebank

The PTB corpus Marcus et al. (1993) provides English sentences, from which we use only the sentences of length . We interleave the words of sentences.

(a) tagging: stressed syllables
(b) tagging: Chinese NER
(c) source separation: PTB
(d) source separation: CMUdict
Figure 2: Offset KL divergences for the tasks in §§ 6.2 and 6.1. The logarithmic -axis is the size of particles (). The -axis is the offset KL divergence described in § 7.1 (in bits per sequence). The smoothing samplers offer considerable speedup: for example, in Figure 1(a), the non-resampled smoothing sampler achieves comparable offset KL divergences with only as many particles as its filtering counterparts. Abbreviations in the legend: PF=particle filtering. PS=particle smoothing. BEAM=beam search. ‘:R’ suffixes indicate resampled variants. For readability, beam search results are omitted from Figure 1(d), but appear in Figure 3 of the appendices.

7 Experiments

In our experiments, we are given a pre-trained scoring model , and we train the parameters of a particle smoothing algorithm.151515For the details of the training procedures and the specific neural architectures in our models, see Appendices C and D.

We now show that our proposed neural particle smoothing sampler does better than the particle filtering sampler. To define “better,” we evaluate samplers on the offset KL divergence from the true posterior.

7.1 Evaluation metrics

Given x, the “natural” goal of conditional sampling is for the sample distribution to approximate the true distribution from (1). We will therefore report—averaged over all held-out test examples x—the KL divergence

(34)

where denotes the unnormalized distribution given by in (2), and denotes its normalizing constant, .

As we are unable to compute in practice, we replace it with an estimate to obtain an offset KL divergence. This change of constant does not change the measured difference between two samplers, . Nonetheless, we try to use a reasonable estimate so that the reported KL divergence is interpretable in an absolute sense. Specifically, we take , where is the full set of distinct particles that we ever drew for input , including samples from the beam search models, while constructing the experimental results graph.161616Thus, was collected across all samplings, iterations, and ensemble sizes , in an attempt to make the summation over as complete as possible. For good measure, we added some extra particles: whenever we drew particles via particle smoothing, we drew an additional particles by particle filtering and added them to . Thus, the offset KL divergence is a “best effort” lower bound on the true exclusive KL divergence .

7.2 Results

In all experiments we compute the offset KL divergence for both the particle filtering samplers and the particle smoothing samplers, for varying ensemble sizes . We also compare against a beam search baseline that keeps the highest-scoring particles at each step (scored by with no lookahead). The results are in Figures 1(d), 1(b), 1(a) and 1(c).

Given a fixed ensemble size, we see the smoothing sampler consistently performs better than the filtering counterpart. It often achieves comparable performance at a fraction of the ensemble size.

Beam search on the other hand falls behind on three tasks: stress prediction and the two source separation tasks. It does perform better than the stochastic methods on the Chinese NER task, but only at small beam sizes. Varying the beam size barely affects performance at all, across all tasks. This suggests that beam search is unable to explore the hypothesis space well.

We experiment with resampling for both the particle filtering sampler and our smoothing sampler. In source separation and stressed syllable prediction, where the right context contains critical information about how viable a particle is, resampling helps particle filtering almost catch up to particle smoothing. Particle smoothing itself is not further improved by resampling, presumably because its effective sample size is high. The goal of resampling is to kill off low-weight particles (which were overproposed) and reallocate their resources to higher-weight ones. But with particle smoothing, there are fewer low-weight particles, so the benefit of resampling may be outweighted by its cost (namely, increased variance).

8 Related Work

Much previous work has employed sequential importance sampling for approximate inference of intractable distributions (e.g., Thrun, 2000; Andrews et al., 2017). Some of this work learns adaptive proposal distributions in this setting (e.g. Gu et al., 2015; Paige and Wood, 2016). The key difference in our work is that we consider future inputs, which is impossible in online decision settings such as robotics. Klaas et al. (2006) did do particle smoothing, like us, but they did not learn adaptive proposal distributions.

Just as we use a right-to-left RNN to guide posterior sampling of a left-to-right generative model, Krishnan et al. (2017) employed a right-to-left RNN to guide posterior marginal inference in the same sort of model. Serdyuk et al. (2018) used a right-to-left RNN to regularize training of such a model.

9 Conclusion

We have described neural particle smoothing, a sequential Monte Carlo method for approximate sampling from the posterior of incremental neural scoring models. Sequential importance sampling has arguably been underused in the natural language processing community. It is quite a plausible strategy for dealing with rich, globally normalized probability models such as neural models—particularly if a good sequential proposal distribution can be found. Our contribution is a neural proposal distribution, which goes beyond particle filtering in that it uses a right-to-left recurrent neural network to “look ahead” to future symbols of when proposing each symbol . The form of our distribution is well-motivated.

There are many possible extensions to the work in this paper. For example, we can learn the generative model and proposal distribution jointly; we can also infuse them with hand-crafted structure, or use more deeply stacked architectures; and we can try training the proposal distribution end-to-end (footnote 10). Another possible extension would be to allow each step of to propose a sequence of actions, effectively making the tagset size . This extension relaxes our restriction from § 1 and would allow us to do general sequence-to-sequence transduction.

Acknowledgements

This work has been generously supported by a Google Faculty Research Award and by Grant No. 1718846 from the National Science Foundation.

References

Appendix A The logprob-to-go for HMMs

As noted in § 2.1, the logprob-to-go can be computed by the backward algorithm. By the definition of in equation 10,

(35)
(36)

where the vector is defined by base case and for by the recurrence

(37)

The backward algorithm (20) for OOHMMs in § 2.2 is a variant of this.

Appendix B Gradients for Training the Proposal Distribution

For a given , both forms of KL divergence achieve their minimum of 0 when