Theorem 1.
Abstract

The well-known Gumbel-Max trick for sampling from a categorical distribution can be extended to sample elements without replacement. We show how to implicitly apply this ‘Gumbel-Top-’ trick on a factorized distribution over sequences, allowing to draw exact samples without replacement using a Stochastic Beam Search. Even for exponentially large domains, the number of model evaluations grows only linear in and the maximum sampled sequence length. The algorithm creates a theoretical connection between sampling and (deterministic) beam search and can be used as a principled intermediate alternative. In a translation task, the proposed method compares favourably against alternatives to obtain diverse yet good quality translations. We show that sequences sampled without replacement can be used to construct low-variance estimators for expected sentence-level BLEU score and model entropy.

oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

 

Stochastic Beams and Where to Find Them:
The Gumbel-Top- Trick for Sampling Sequences Without Replacement

 

Wouter Kool0 0  Herke van Hoof0  Max Welling0 0 


footnotetext: 1AUTHORERR: Missing \icmlaffiliation. 2AUTHORERR: Missing \icmlaffiliation. 3AUTHORERR: Missing \icmlaffiliation. . Correspondence to: Wouter Kool <w.w.m.kool@uva.nl>. \@xsect

We think the Gumbel-Max trick (Gumbel, 1954; Maddison et al., 2014) is like a magic trick. It allows sampling from the categorical distribution, simply by perturbing the log-probability for each category by adding independent Gumbel distributed noise and returning the category with maximum perturbed log-probability. This trick has recently (re)gained popularity as it allows to derive reparameterizable continous relaxations of the categorical distribution (Maddison et al., 2016; Jang et al., 2016). However, there is more: as was noted (in a blog post) by Vieira (2014), taking the top largest perturbations (instead of the maximum, or top 1) yields a sample of size from the categorical distribution without replacement. We call this extension of the Gumbel-Max trick the Gumbel-Top- trick.

In this paper, we consider factorized distributions over sequences, represented by (parametric) sequence models. Sequence models are widely used, e.g. in tasks such as neural machine translation (Sutskever et al., 2014; Bahdanau et al., 2015) and image captioning (Vinyals et al., 2015b). Many such tasks require obtaining a set of representative sequences from the model. These can be random samples, but for low-entropy distributions, a set of sequences sampled using standard sampling (with replacement) may contain many duplicates. On the other hand, a beam search can find a set of unique high-probability sequences, but these have low variability and, being deterministic, cannot be used to construct statistical estimators.

In this paper, we propose sampling without replacement as an alternative method to obtain representative sequences from a sequence model. We show that for a sequence model, we can do this by applying the Gumbel-Top- trick implicitly, without instantiating all sequences in the (typically exponentially large) domain. This procedure allows us to draw sequences using a number of model evaluations that grows only linear in the number of samples and the maximum sampled sequence length. The algorithm uses the idea of top-down sampling (Maddison et al., 2014) and performs a beam search over stochastically perturbed log-probabilities, which is why we refer to it as Stochastic Beam Search.

Stochastic Beam Search is a novel procedure for sampling sequences that avoids duplicate samples. Unlike ordinary beam search, it has a probabilistic interpretation. Thus, it can e.g. be used in importance sampling. As such, Stochastic Beam Search conceptually connects sampling and beam search, and combines advantages of both methods. In Section id1 we give two examples of how Stochastic Beam Search can be used as a principled alternative to sampling or beam search. In these experiments, Stochastic Beam Search is used to control the diversity of translation results, as well as to construct low variance estimators for sentence-level BLEU score and model entropy.

\@xsect\@xsect

A discrete random variable has a distribution with domain if . We refer to the categories as elements in the domain and denote with the (unnormalized) log-probabilities, so and . Therefore in general we write:

(1)
\@xsect

If , then has a Gumbel distribution with location and we write . From this it follows that , so we can shift Gumbel variables.

\@xsect

The Gumbel-Max trick (Gumbel, 1954; Maddison et al., 2014) allows to sample from the categorical distribution (1) by independently perturbing the log-probabilities with Gumbel noise and finding the largest element.

Formally, let i.i.d. and let , then with . In a slight abuse of notation, we write and we call the perturbed log-probability of element or category in the domain .

For any subset it holds that (Maddison et al., 2014):

(2)
(3)

Additionally, the max (2) and argmax (3) are independent. For details, see Maddison et al. (2014).

\@xsect

Considering the maximum the top 1 (one), we can generalize the Gumbel-Max trick to the Gumbel-Top- trick to draw an ordered sample of size without replacement111With sampling without replacement from a categorical distribution, we mean sampling the first element, then renormalizing the remaining probabilities to sample the next element, etcetera. This does not mean that the inclusion probability of element is proportional to : if we sample elements all elements are included with probability 1., by taking the indices of the largest perturbed log-probabilities. Generalizing , we denote with the function that takes a sequence of values and returns the indices of the largest values, in order of decreasing value.

Theorem 1.

For , let . Then is an (ordered) sample without replacement from the distribution, e.g. for a realization it holds that

(4)

where is the domain (without replacement) for the -th sampled element.

For the proof we refer to Appendix id1. The Gumbel-Top- trick is mathematically equivalent to Weighted Reservoir Sampling (Efraimidis & Spirakis, 2006), as was noted in a blog post by Vieira (2014).

\@xsect

A sequence model is a factorized parametric distribution over sequences. The parameters define the conditional probability of the next token given the partial sequence . Typically is defined as a softmax normalization of unnormalized log-probabilities with optional temperature (default ):

(5)

The normalization is w.r.t. a single token, so the model is locally normalized. The total probability of a (partial) sequence follows from the chain rule of probability:

(6)
(7)

A sequence model defines a valid probability distribution over both partial and complete sequences. When the length is irrelevant, we simply write to indicate a (partial or complete) sequence. If the model is additionally conditioned on a context (e.g., a source sentence), we write .

\@xsect

A beam search is a limited-width breadth first search. In the context of sequence models, it is often used as an approximation to finding the (single) sequence that maximizes (7), or as a way to obtain a set of high-probability sequences from the model. Starting from an empty sequence (e.g. ), a beam search expands at every step at most partial sequences (those with highest probability) to compute the probabilities of sequences with length . It terminates with a beam of complete sequences, which we assume to be of equal length (as sequences can be padded).

Ancestor of top leaf / required node

Top on level / expanded node / on beam

Not in top on level / pruned node / off beam

Figure 1: Example of the Gumbel-Top- trick on a tree, with . The bars next to the leaves indicate the perturbed log-probabilities , while the bars next to internal nodes indicate the maximum perturbed log-probability of the set of leaves in the subtree rooted at that node: with . The bar is split in two to illustrate that . Numbers in the nodes represent , the probability of the partial sequence . Numbers at edges represent the conditional probabilities for the next token. The shaded nodes are ancestors of the top leaves with highest perturbed log-probability . These are the ones we actually need to expand. In each layer, there are at most such nodes, such that we are guaranteed to construct all top leaves by expanding at least the top nodes (ranked on ) in each level (indicated by a solid border).
\@xsect

We derive Stochastic Beam Search by starting with the explicit application of the Gumbel-Top- trick to sample sequences without replacement from a sequence model. This requires instantiating all sequences in the domain to find the largest perturbed log-probabilities. Then we transition to top-down sampling of the perturbed log-probabilities, and we use Stochastic Beam Search to instantiate (only) the sequences with the largest perturbed log-probabilities. As both methods are equivalent, Stochastic Beam Search implicitly applies the Gumbel-Top- trick and thus yields a sample of sequences without replacement.

\@xsect

We represent the sequence model (7) as a tree (as in Figure 1), where internal nodes at level represent partial sequences , and leaf nodes represent completed sequences. We identify a leaf by its index and write as the corresponding sequence, with (normalized!) log-probability . To obtain a sample from the distribution (7) without replacement, we should sample from the set of leaf nodes without replacement, for which we can naively use the Gumbel-Top- trick (Section id1):

  • Compute for all sequences . To reuse computations for partial sequences, the complete probability tree is instantiated, as in Figure 1.

  • Sample , so can be seen as the perturbed log-probability of sequence .

  • Let , then is a sample of sequences from (7) without replacement.

As instantiating the complete probability tree is computationally prohibitive, we construct an equivalent process that only requires computation linear in the number of samples and the sequence length.

\@xsect

For the naive implementation of the Gumbel-Top- trick, we only defined the perturbed log-probabilities for leaf nodes , which correspond to complete sequences . For the Stochastic Beam Search implementation, we also define the perturbed log-probabilities for internal nodes corresponding to partial sequences. We identify a node (internal or leaf) by the set of leaves in the corresponding subtree, and we write as the corresponding (partial or completed) sequence. Its log-probability can be computed incrementally from the parent log-probability using (6), and since the model is locally normalized, it holds that

(8)

Now for each node , we define as the maximum of the perturbed log-probabilities in the subtree leaves . By Equation (2), has a Gumbel distribution with location (hence its notation ):

(9)

Since is a Gumbel perturbation of the log-probability , we call it the perturbed log-probability of the partial sequence . We can define the corresponding Gumbel noise , which can be inferred from by the relation .

\@xsect

We can recursively compute (9). Write as the as the set of direct children of the node (so is a partition of the set ). Since the maximum (9) must be attained in one of the subtrees, it holds that

(10)

If we want to sample for all nodes, we can use the bottom-up sampling procedure: sample the leaves and recursively compute using (10). This is effectively sampling from the degenerate (constant) distribution resulting from conditioning on the children.

\@xsect

The recursive bottom-up sampling procedure can be interpreted as ancestral sampling from a tree-structured graphical model (somewhat like Figure 1) with edges directed upwards. Alternatively, we can reverse the graphical model and sample the tree top-down, starting with the root and recursively sampling the children conditionally.

Note that for the root (since it contains all leaves ), it holds that , so we can let 222Or we can simply set (e.g. condition on) . This does not affect the result by the independence of and .. Starting with , we can recursively sample the children conditionally on the parent variable . For it holds that and we can sample conditionally on (10), e.g. with their maximum equal to .

Sampling a set of Gumbels conditionally on their maximum being equal to a certain value is non-trivial, but can be done by first sampling the and then sampling the individual Gumbels conditionally on both the and . Alternatively, we can let independently and let . Then

is a set of Gumbels with a maximum equal to . See Appendix id1 for details and numerically stable implementation.

If we recursively sample the complete tree top-down, this is equivalent to sampling the complete tree bottom-up, and as a result, for all leaves, it holds that , independently. The benefit of using top-down sampling is that if we are interested only in obtaining the top leaves, we do not need to instantiate the complete tree.

\@xsect

The key idea of Stochastic Beam Search is to apply the Gumbel-Top- trick for a sequence model, without instantiating the entire tree, by using top-down sampling. With top-down sampling, to find the top leaves, at every level in the tree we can suffice with only expanding (instantiating the subtree for) the nodes with highest perturbed log-probability . To see this, first assume that we instantiated the complete tree using top-down sampling and consider the nodes that are ancestors of at least one of the top leaves (the shaded nodes in Figure 1). At every level of the tree, there will be at most such nodes (as each of the top leaves has only one ancestor at level ), and these nodes will have higher perturbed log-probabilities than the other nodes at level , which do not contain a top leaf in the subtree. This means that if we discard all but the nodes with highest log-probabilities , we are guaranteed to include the ancestors of the top leaves. Formally, the -th highest log-probability of the nodes at level provides a lower bound required to be among the top leaves, while is an upper bound for the set of leaves such that it can be discarded or pruned if it is lower than the lower bound, so if is not among the top .

Thus, when we apply the top-down sampling procedure, at each level we only need to expand the nodes with the highest perturbed log-probabilities to end up with the top leaves. By the Gumbel-Top- trick the result is a sample without replacement from the sequence model. The effective procedure is a beam search over the (stochastically) perturbed log-probabilities for partial sequences , hence the name Stochastic Beam Search. As we use to select the top partial sequences, we can also think of as the stochastic score of the partial sequence . We formalize Stochastic Beam Search in Algorithm 1.

1:  Input: one-step probability distribution , beam/sample size
2:  Initialize beam empty
3:  add to beam
4:  for  do
5:     Initialize expansions empty
6:     for  do
7:        
8:        for  do
9:           
10:           
11:           
12:        end for
13:        for  do
14:           
15:           add to expansions
16:        end for
17:     end for
18:     beam of expansions according to
19:  end for
20:  Return beam
Algorithm 1 StochasticBeamSearch(, )
\@xsect

Stochastic Beam Search should not only be considered as a sampling procedure, but also as a principled way to randomize a beam search. As a naive alternative, one could run an ordinary beam search, replacing the top- operation by sampling. In this scenario, at each step of the beam search we could sample without replacement from the partial sequence probabilities using the Gumbel-Top- trick.

However, in this naive approach, for a low-probability partial sequence to be extended to completion, it does not only need to be initially chosen, but it will need to be re-chosen, independently, again with low probability, at each step of the beam search. The result is a much lower probability to sample this sequence than assigned by the model. Intuitively, we should somehow commit to a sampling ‘decision’ made at step . However, a hard commitment to generate exactly one descendant for each of the partial sequences at step would prevent generating any two sequences that share an initial partial sequence.

Our Stochastic Beam Search algorithm makes a soft commitment to a partial sequence (node in the tree) by propagating the Gumbel perturbation of the log-probability consistently down the subtree. The partial sequence will then be extended as long as its total perturbed log-probability is among the top , but will fall off the beam if, despite the consistent perturbation, another sequence is more promising.

\@xsect

As an alternative to Stochastic Beam Search, we can draw samples without replacement by rejecting duplicates from samples drawn with replacement. However, if the domain is large and the entropy low (e.g. if there are only a few valid translations), then rejection sampling requires many samples and consequently many (expensive) model evaluations. Also, we have to estimate how many samples to draw (in parallel) or draw samples sequentially. Stochastic Beam Search executes in a single pass, and requires computation linear in the sample size and the sequence length, which (except for the beam search overhead) is equal to the computational requirement for sampling with replacement.

\@xsect\@xsect
(a)
(b)
(c)
Figure 2: Minimum, mean and maximum BLEU score vs. diversity for different sample sizes . Points indicate different temperatures/diversity strengths, from 0.1 (low diversity, left in graph) to 0.8 (high diversity, right in graph).

In this experiment we compare Stochastic Beam Search as a principled (stochastic) alternative to Diverse Beam Search (Vijayakumar et al., 2018) in the context of neural machine translation to obtain a diverse set of translations for a single source sentence . Following the setup by Vijayakumar et al. (2018) we report both diversity as measured by the fraction of unique -grams in the translations as well as mean and maximum BLEU score (Papineni et al., 2002) as an indication of the quality of the sample. The maximum BLEU score corresponds to ‘oracle performance’ reported by Vijayakumar et al. (2018), but we report the mean as well since a single good translation and completely random sentences scores high on both maximum BLEU score and diversity, while being undesirable. A good method should increase diversity without sacrificing mean BLEU score.

We compare four different sentence generations methods: Beam Search (BS), Sampling, Stochastic Beam Search (SBS) (sampling without replacement) and Diverse Beam Search with groups (DBS()) (Vijayakumar et al., 2018). For Sampling and Stochastic Beam Search, we control the diversity using the softmax temperature in Equation (5). We use , where a higher results in higher diversity. Heuristically, we also vary for computing the scores with (deterministic) Beam Search. The diversity of Diverse Beam Search is controlled by the diversity strengths parameter, which we vary between . We set the number of groups equal to the sample size , which Vijayakumar et al. (2018) reported as the best choice.

We modify the Beam Search implementation in fairseq333https://github.com/pytorch/fairseq to implement Stochastic Beam Search, and use the fairseq implementations for Beam Search, Sampling and Diverse Beam Search. For theoretical correctness of the Stochastic Beam Search, we disable length-normalization (Wu et al., 2016) and early stopping (and therefore also do not use these parameters for the other methods). We use the pretrained model from Gehring et al. (2017) and use the wmt14.v2.en-fr.newstest2014 444https://s3.amazonaws.com/fairseq-py/data/wmt14.v2.en-fr.newstest2014.tar.bz2 test set consisting of 3003 sentences. We run the four methods with sample sizes and plot the minimum, mean and maximum BLEU score among the translations (averaged over the test set) against the average of and -gram diversity, where -gram diversity is defined as:

In Figure 2, we represent the results as curves, indicating the trade-off between diversity and BLEU score. The points indicate datapoints and the dashed lines indicate the (averaged) minimum and maximum BLEU score. For the same diversity, Stochastic Beam Search achieves higher mean/maximum BLEU score. Looking at a certain BLEU score, we observe that Stochastic Beam Search achieves the same BLEU score as Diverse Beam Search with a significantly larger diversity. For low temperatures (), the maximum BLEU score of Stochastic Beam Search is comparable to the deterministic Beam Search, so the increased diversity does not sacrifice the best element in the sample. Note that Sampling achieves higher mean BLEU score at the cost of diversity, which may be because good translations are sampled repeatedly. However, the maximum BLEU score of both Sampling and Diverse Beam Search is lower than with Beam Search and Stochastic Beam Search.

\@xsect
Figure 3: BLEU score estimates for three sentences sampled/decoded by different estimators for different temperatures.

In our second experiment, we use sampling without replacement to evaluate the expected sentence level BLEU score for a translation given a source sentence . Although we are often interested in corpus level BLEU score, estimation of sentence level BLEU score is useful, for example when training using minibatches to directly optimize BLEU score (Ranzato et al., 2016).

We leave dependence of the BLEU score on the source sentence implicit, and write . Writing the domain of (given ) as (e.g. all possible translations), we want to estimate the following expectation:

(11)

Under a Monte Carlo (MC) sampling with replacement scheme with size , we write as the set555Formally, when sampling with replacement, is a multiset. of indices of sampled sequences and estimate (11) using

(12)

If the distribution has low entropy (for example if there are only few valid translations), then MC estimation may be inefficient since repeated samples are uninformative. We can use sampling without replacement as an improvement, but we need to use importance weights to correct for the changed sampling probabilities. Using the Gumbel-Top- trick, we can implement an estimator equivalent to the estimator described by Vieira (2017), derived from priority sampling (Duffield et al., 2007):

(13)

Here is the -th largest element of , which can be considered the empirical threshold for the Gumbel-Top- trick (since if ), and we define . If we would assume a fixed threshold and variably sized sample , then and is a standard importance weight. Surprisingly, using a fixed sample size (and empirical threshold ) also yields in an unbiased estimator, and we include a proof adapted from Duffield et al. (2007) and Vieira (2017) in Appendix id1. To obtain , we need to sacrifice the last sample666For Stochastic Beam Search, we use a sample/beam size , set equal to the -th largest perturbed log-probability and compute (13) based on the remaining samples. Alternatively, we could use a beam size of ., slightly increasing variance.

Empirically, the estimator (13) has high variance, and in practice it is preferred to normalize the importance weights by (Hesterberg, 1988):

(14)

The estimator (14) is biased but consistent: in the limit we sample the entire domain, so we have empirical threshold and and , such that (14) is equal to (11).

We have to take care computing the importance weights as depending on the entropy the terms in the quotient can become very small, and in our case the computation of can suffer from catastrophic cancellation. For details, see Appendix id1.

Because the model is not trained to use its own predictions as input, at test time errors can accumulate. As a result, when sampling with the default temperature , the expected BLEU score is very low (below 10). To improve quality of generated sentences we use lower temperatures and experiment with . We then use different methods to estimate the BLEU score:

  • Monte Carlo (MC), using Equation (12).

  • Stochastic Beam Search (SBS), where we compute estimates using the estimator in Equation (13) and the normalized variant in Equation (14).

  • Beam Search (BS), where we compute a deterministic beam (the temperature affects the scoring) and compute . This is not a statistical estimator, but a lower bound to the target (11) which serves as a validation of the implementation and gives insight on how many sequences we need at least to capture most of the mass in (11). We also compute the normalized version , which can heuristically be considered as a ‘determinstic estimate’.

In Figure 3 we show the results of computing each estimate 100 times (BS only once as it is deterministic) for three different sentences777Sentence 1500, 2000 and 2500 from the WMT14 dataset. Sentences 0, 500, 1000 are shorter and obtained 0 BLEU score in some cases, therefore being uninteresting in comparisons. for temperatures and sample sizes to . We report the empirical mean and -th and -th percentile. The normalized SBS estimator indeed achieves significantly lower variance than the unnormalized version and for , it significantly reduces variance compared to MC, without adding observable bias. For the results are similar, but we are less interested in this case as the overall BLEU score is lower than for .

Figure 4: Entropy score estimates for three sentences sampled/decoded by different estimators for different temperatures.
\@xsect

Additionally to estimating the BLEU score we use such that Equation (11) becomes the model entropy (conditioned on the source sentence ):

Entropy estimation is useful in optimization settings where we want to include an entropy loss to ensure diversity. It is a different problem than BLEU score estimation as high BLEU score (for a good model) correlates positively with model probability, while for entropy rare contribute the largest terms . We use the same experimental setup as for the BLEU score and present the results in Figure 4. The results are similar to the BLEU score experiment: the normalized SBS estimate has significantly lower variance than MC for while for , results are similar. This shows that Stochastic Beam Search can be used to construct practical statistical estimators.

\@xsect\@xsect

The idea of sampling by solving optimization problems has been used for various purposes (Papandreou & Yuille, 2011; Hazan & Jaakkola, 2012; Tarlow et al., 2012; Ermon et al., 2013; Maddison et al., 2014; Chen & Ghahramani, 2016; Balog et al., 2017), but to our knowledge this idea has not been used for sampling without replacement.

Most closely related to our work is the work by Maddison et al. (2014), who note that the Gumbel-Max trick (Gumbel, 1954) can be applied implicitly and generalize it to continuous distributions, using an search to implicitly find the maximum of a Gumbel process. In this work, we extend the idea of top-down sampling to efficiently draw multiple samples without replacement from a factorized distribution (with possibly exponentially large domain) by implicitly applying the Gumbel-Top- trick. This is a new and practical sampling method.

The blog post by Vieira (2014) describes the relation of the Gumbel-Top- trick (as we call it) to Weighted Reservoir Sampling (Efraimidis & Spirakis, 2006), which is an algorithm for drawing weighted samples without replacement from a stream but also yields a practical parallel algorithm for sampling without replacement. In this setting, there is a fixed probability for each element , but no parametric (sequence) model. The implication that the Gumbel-Top- trick can be used for sampling without replacement is not widely known888For example, at the time of writing the popular PyTorch (Paszke et al., 2017) library uses the (for large ) expensive sequential algorithm..

The Gumbel-Max trick has also been used to define relaxations of the categorical distribution (Maddison et al., 2016; Jang et al., 2016), which can be reparameterized for low-variance but biased gradient estimators and can also be used for training sequence models (Gu et al., 2018). We think our work is a step in the direction to improve these methods in the context of sequence models.

\@xsect

Beam search is widely used for approximate inference in various domains such as machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Ranzato et al., 2016; Vaswani et al., 2017; Gehring et al., 2017), image captioning (Vinyals et al., 2015b), speech recognition (Graves et al., 2013) and other structured prediction settings (Vinyals et al., 2015a; Weiss et al., 2015). Although typically a test-time procedure, there are works that include beam search in the training loop (Daumé et al., 2009; Wiseman & Rush, 2016; Edunov et al., 2018b; Negrinho et al., 2018; Edunov et al., 2018a) for training sequence models on the sequence level (Ranzato et al., 2016; Bahdanau et al., 2017). Many variants of beam search have been developed, such as a continuous relaxation (Goyal et al., 2018), diversity encouraging variants (Li et al., 2016; Shao et al., 2017; Vijayakumar et al., 2018) or using modifications such as length-normalization (Wu et al., 2016) or simply applying noise to the output (Edunov et al., 2018a). Our Stochastic Beam Search is a principled alternative that shares some of the benefits of these heuristic variants, such as the ability to control diversity or produce randomized output.

\@xsect

We introduced Stochastic Beam Search: an algorithm that is easy to implement on top of a beam search as a way to sample sequences without replacement. This algorithm relates sampling and beam search, combining advantages of these two methods. Our experiments support the idea that it can be used as a drop-in replacement in places where sampling or beam search is used. In fact, our experiments show Stochastic Beam Search can be used to yield lower-variance estimators and high-diversity samples from a neural machine translation model. In future work, we plan to leverage the probabilistic interpretation of beam search to develop new beam search related statistical learning methods.

References

  • Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
  • Bahdanau et al. (2017) Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. An actor-critic algorithm for sequence prediction. In International Conference on Learning Representations, 2017.
  • Balog et al. (2017) Balog, M., Tripuraneni, N., Ghahramani, Z., and Weller, A. Lost relatives of the gumbel trick. In International Conference on Machine Learning (ICML), pp. 371–379. PMLR, 2017.
  • Chen & Ghahramani (2016) Chen, Y. and Ghahramani, Z. Scalable discrete sampling as a multi-armed bandit problem. In International Conference on Machine Learning, pp. 2492–2501, 2016.
  • Daumé et al. (2009) Daumé, H., Langford, J., and Marcu, D. Search-based structured prediction. Machine learning, 75(3):297–325, 2009.
  • Duffield et al. (2007) Duffield, N., Lund, C., and Thorup, M. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM), 54(6):32, 2007.
  • Edunov et al. (2018a) Edunov, S., Ott, M., Auli, M., and Grangier, D. Understanding back-translation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489–500, 2018a.
  • Edunov et al. (2018b) Edunov, S., Ott, M., Auli, M., Grangier, D., et al. Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pp. 355–364, 2018b.
  • Efraimidis & Spirakis (2006) Efraimidis, P. S. and Spirakis, P. G. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006.
  • Ermon et al. (2013) Ermon, S., Gomes, C. P., Sabharwal, A., and Selman, B. Embed and project: Discrete sampling with universal hashing. In Advances in Neural Information Processing Systems, pp. 2085–2093, 2013.
  • Gehring et al. (2017) Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional sequence to sequence learning. In International Conference on Machine Learning, pp. 1243–1252, 2017.
  • Goyal et al. (2018) Goyal, K., Neubig, G., Dyer, C., and Berg-Kirkpatrick, T. A continuous relaxation of beam search for end-to-end training of neural sequence models. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018.
  • Graves et al. (2013) Graves, A., Mohamed, A.-r., and Hinton, G. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing, IEEE international conference on, pp. 6645–6649, 2013.
  • Gu et al. (2018) Gu, J., Im, D. J., and Li, V. O. Neural machine translation with Gumbel-greedy decoding. In Thirty-Second AAAI Conference on Artificial Intelligence (AAAI), 2018.
  • Gumbel (1954) Gumbel, E. J. Statistical theory of extreme values and some practical applications: a series of lectures. Number 33. US Govt. Print. Office, 1954.
  • Hazan & Jaakkola (2012) Hazan, T. and Jaakkola, T. On the partition function and random maximum a-posteriori perturbations. In Proceedings of the 29th International Conference on Machine Learning, pp. 1667–1674. Omnipress, 2012.
  • Hesterberg (1988) Hesterberg, T. C. Advances in importance sampling. PhD thesis, Stanford University, 1988.
  • Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbel-softmax. In International Conference on Learning Representations, 2016.
  • Li et al. (2016) Li, J., Monroe, W., and Jurafsky, D. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562, 2016.
  • Mächler (2012) Mächler, M. Accurately computing assessed by the Rmpfr package, 2012. URL https://cran.r-project.org/web/packages/Rmpfr/vignettes/log1mexp-note.pdf.
  • Maddison et al. (2014) Maddison, C. J., Tarlow, D., and Minka, T. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.
  • Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2016.
  • Negrinho et al. (2018) Negrinho, R., Gormley, M., and Gordon, G. J. Learning beam search policies via imitation learning. In Advances in Neural Information Processing Systems, pp. 10673–10682, 2018.
  • Papandreou & Yuille (2011) Papandreou, G. and Yuille, A. L. Perturb-and-map random fields: Using discrete optimization to learn and sample from energy models. In Computer Vision (ICCV), IEEE International Conference on, pp. 193–200, 2011.
  • Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems, 2017.
  • Ranzato et al. (2016) Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. Sequence level training with recurrent neural networks. In International Conference on Learning Representations, 2016.
  • Shao et al. (2017) Shao, Y., Gouws, S., Britz, D., Goldie, A., Strope, B., and Kurzweil, R. Generating high-quality and informative conversation responses with sequence-to-sequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2210–2219, 2017.
  • Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
  • Tarlow et al. (2012) Tarlow, D., Adams, R., and Zemel, R. Randomized optimum models for structured prediction. In Artificial Intelligence and Statistics, pp. 1221–1229, 2012.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
  • Vieira (2014) Vieira, T. Gumbel-max trick and weighted reservoir sampling, 2014. URL https://timvieira.github.io/blog/post/2014/08/01/gumbel-max-trick-and-weighted-reservoir-sampling/.
  • Vieira (2017) Vieira, T. Estimating means in a finite universe, 2017. URL https://timvieira.github.io/blog/post/2017/07/03/estimating-means-in-a-finite-universe/.
  • Vijayakumar et al. (2018) Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D. J., and Batra, D. Diverse beam search for improved description of complex scenes. In AAAI, 2018.
  • Vinyals et al. (2015a) Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700, 2015a.
  • Vinyals et al. (2015b) Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, pp. 3156–3164, 2015b.
  • Weiss et al. (2015) Weiss, D., Alberti, C., Collins, M., and Petrov, S. Structured training for neural network transition-based parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pp. 323–333, 2015.
  • Wiseman & Rush (2016) Wiseman, S. and Rush, A. M. Sequence-to-sequence learning as beam-search optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1296–1306, 2016.
  • Wu et al. (2016) Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
\@xsect
Theorem 1.

For , let . Then is an (ordered) sample without replacement from the distribution, e.g. for a realization it holds that

(15)

where is the domain (without replacement) for the -th sampled element.

Proof.

First note that

(16)
(17)
(18)

The step from (16) to (17) follows from the independence of the and (Section id1) and the step from (17) to (18) uses the Gumbel-Max trick. The proof follows by induction on . The case is the Gumbel-Max trick, while if we assume the result (15) proven for , then

(19)

In (19) we have used Equation (18) and Equation (15) for by induction. ∎

\@xsect\@xsect

A random variable has a truncated Gumbel distribution with location and maximum (e.g. ) with CDF if:

(20)

The inverse CDF is:

(21)
\@xsect

In order to sample a set of Gumbel variables , e.g. with their maximum being exactly , we can first sample the , and then sample the Gumbels conditionally on both the and :

  1. Sample . We do not need to condition on since the is independent of the (Section id1).

  2. Set , since this follows from conditioning on the and .

  3. Sample for . This works because, conditioning on the and , it holds that:

Equivalently, we can let , let and define

(22)

Here we have used (20) and (21). Since the transformation (22) is monotonically increasing, it preserves the and it follows from the Gumbel-Max trick (3) that

We can think of this as using the Gumbel-Max trick for step 1 (sampling the argmax) in the sampling process described above. Additionally, for :

so here we recover step 2 (setting ). For it holds that:

This means that , so this is equivalent to step 3 (sampling for ).

\@xsect

Direct computation of (22) can be unstable as large terms need to be exponentiated. Instead, we compute:

(23)
(24)

where we have defined

This is equivalent as

The first step can be easily verified by considering the cases and . and can be computed accurately using and (Mächler, 2012):

\@xsect

We have to take care computing the importance weights as depending on the entropy the terms in the quotient can become very small, and in our case the computation of can suffer from catastrophic cancellation. We can rewrite this expression using the more numerically stable implementation as