Abstract
The wellknown GumbelMax trick for sampling from a categorical distribution can be extended to sample elements without replacement. We show how to implicitly apply this ‘GumbelTop’ trick on a factorized distribution over sequences, allowing to draw exact samples without replacement using a Stochastic Beam Search. Even for exponentially large domains, the number of model evaluations grows only linear in and the maximum sampled sequence length. The algorithm creates a theoretical connection between sampling and (deterministic) beam search and can be used as a principled intermediate alternative. In a translation task, the proposed method compares favourably against alternatives to obtain diverse yet good quality translations. We show that sequences sampled without replacement can be used to construct lowvariance estimators for expected sentencelevel BLEU score and model entropy.
oddsidemargin has been altered.
marginparsep has been altered.
topmargin has been altered.
marginparwidth has been altered.
marginparpush has been altered.
paperheight has been altered.
The page layout violates the ICML style.
Please do not change the page layout, or include packages like geometry,
savetrees, or fullpage, which change it for you.
We’re not able to reliably undo arbitrary changes to the style. Please remove
the offending package(s), or layoutchanging commands and try again.
Stochastic Beams and Where to Find Them:
The GumbelTop Trick for Sampling Sequences Without Replacement
Wouter Kool ^{0 }^{0 } Herke van Hoof ^{0 } Max Welling ^{0 }^{0 }
We think the GumbelMax trick (Gumbel, 1954; Maddison et al., 2014) is like a magic trick. It allows sampling from the categorical distribution, simply by perturbing the logprobability for each category by adding independent Gumbel distributed noise and returning the category with maximum perturbed logprobability. This trick has recently (re)gained popularity as it allows to derive reparameterizable continous relaxations of the categorical distribution (Maddison et al., 2016; Jang et al., 2016). However, there is more: as was noted (in a blog post) by Vieira (2014), taking the top largest perturbations (instead of the maximum, or top 1) yields a sample of size from the categorical distribution without replacement. We call this extension of the GumbelMax trick the GumbelTop trick.
In this paper, we consider factorized distributions over sequences, represented by (parametric) sequence models. Sequence models are widely used, e.g. in tasks such as neural machine translation (Sutskever et al., 2014; Bahdanau et al., 2015) and image captioning (Vinyals et al., 2015b). Many such tasks require obtaining a set of representative sequences from the model. These can be random samples, but for lowentropy distributions, a set of sequences sampled using standard sampling (with replacement) may contain many duplicates. On the other hand, a beam search can find a set of unique highprobability sequences, but these have low variability and, being deterministic, cannot be used to construct statistical estimators.
In this paper, we propose sampling without replacement as an alternative method to obtain representative sequences from a sequence model. We show that for a sequence model, we can do this by applying the GumbelTop trick implicitly, without instantiating all sequences in the (typically exponentially large) domain. This procedure allows us to draw sequences using a number of model evaluations that grows only linear in the number of samples and the maximum sampled sequence length. The algorithm uses the idea of topdown sampling (Maddison et al., 2014) and performs a beam search over stochastically perturbed logprobabilities, which is why we refer to it as Stochastic Beam Search.
Stochastic Beam Search is a novel procedure for sampling sequences that avoids duplicate samples. Unlike ordinary beam search, it has a probabilistic interpretation. Thus, it can e.g. be used in importance sampling. As such, Stochastic Beam Search conceptually connects sampling and beam search, and combines advantages of both methods. In Section id1 we give two examples of how Stochastic Beam Search can be used as a principled alternative to sampling or beam search. In these experiments, Stochastic Beam Search is used to control the diversity of translation results, as well as to construct low variance estimators for sentencelevel BLEU score and model entropy.
A discrete random variable has a distribution with domain if . We refer to the categories as elements in the domain and denote with the (unnormalized) logprobabilities, so and . Therefore in general we write:
(1) 
If , then has a Gumbel distribution with location and we write . From this it follows that , so we can shift Gumbel variables.
The GumbelMax trick (Gumbel, 1954; Maddison et al., 2014) allows to sample from the categorical distribution (1) by independently perturbing the logprobabilities with Gumbel noise and finding the largest element.
Formally, let i.i.d. and let , then with . In a slight abuse of notation, we write and we call the perturbed logprobability of element or category in the domain .
For any subset it holds that (Maddison et al., 2014):
(2)  
(3) 
Additionally, the max (2) and argmax (3) are independent. For details, see Maddison et al. (2014).
Considering the maximum the top 1 (one), we can generalize the GumbelMax trick to the GumbelTop trick to draw an ordered sample of size without replacement^{1}^{1}1With sampling without replacement from a categorical distribution, we mean sampling the first element, then renormalizing the remaining probabilities to sample the next element, etcetera. This does not mean that the inclusion probability of element is proportional to : if we sample elements all elements are included with probability 1., by taking the indices of the largest perturbed logprobabilities. Generalizing , we denote with the function that takes a sequence of values and returns the indices of the largest values, in order of decreasing value.
Theorem 1.
For , let . Then is an (ordered) sample without replacement from the distribution, e.g. for a realization it holds that
(4) 
where is the domain (without replacement) for the th sampled element.
For the proof we refer to Appendix id1. The GumbelTop trick is mathematically equivalent to Weighted Reservoir Sampling (Efraimidis & Spirakis, 2006), as was noted in a blog post by Vieira (2014).
A sequence model is a factorized parametric distribution over sequences. The parameters define the conditional probability of the next token given the partial sequence . Typically is defined as a softmax normalization of unnormalized logprobabilities with optional temperature (default ):
(5) 
The normalization is w.r.t. a single token, so the model is locally normalized. The total probability of a (partial) sequence follows from the chain rule of probability:
(6)  
(7) 
A sequence model defines a valid probability distribution over both partial and complete sequences. When the length is irrelevant, we simply write to indicate a (partial or complete) sequence. If the model is additionally conditioned on a context (e.g., a source sentence), we write .
A beam search is a limitedwidth breadth first search. In the context of sequence models, it is often used as an approximation to finding the (single) sequence that maximizes (7), or as a way to obtain a set of highprobability sequences from the model. Starting from an empty sequence (e.g. ), a beam search expands at every step at most partial sequences (those with highest probability) to compute the probabilities of sequences with length . It terminates with a beam of complete sequences, which we assume to be of equal length (as sequences can be padded).
We derive Stochastic Beam Search by starting with the explicit application of the GumbelTop trick to sample sequences without replacement from a sequence model. This requires instantiating all sequences in the domain to find the largest perturbed logprobabilities. Then we transition to topdown sampling of the perturbed logprobabilities, and we use Stochastic Beam Search to instantiate (only) the sequences with the largest perturbed logprobabilities. As both methods are equivalent, Stochastic Beam Search implicitly applies the GumbelTop trick and thus yields a sample of sequences without replacement.
We represent the sequence model (7) as a tree (as in Figure 1), where internal nodes at level represent partial sequences , and leaf nodes represent completed sequences. We identify a leaf by its index and write as the corresponding sequence, with (normalized!) logprobability . To obtain a sample from the distribution (7) without replacement, we should sample from the set of leaf nodes without replacement, for which we can naively use the GumbelTop trick (Section id1):

Compute for all sequences . To reuse computations for partial sequences, the complete probability tree is instantiated, as in Figure 1.

Sample , so can be seen as the perturbed logprobability of sequence .

Let , then is a sample of sequences from (7) without replacement.
As instantiating the complete probability tree is computationally prohibitive, we construct an equivalent process that only requires computation linear in the number of samples and the sequence length.
For the naive implementation of the GumbelTop trick, we only defined the perturbed logprobabilities for leaf nodes , which correspond to complete sequences . For the Stochastic Beam Search implementation, we also define the perturbed logprobabilities for internal nodes corresponding to partial sequences. We identify a node (internal or leaf) by the set of leaves in the corresponding subtree, and we write as the corresponding (partial or completed) sequence. Its logprobability can be computed incrementally from the parent logprobability using (6), and since the model is locally normalized, it holds that
(8) 
Now for each node , we define as the maximum of the perturbed logprobabilities in the subtree leaves . By Equation (2), has a Gumbel distribution with location (hence its notation ):
(9) 
Since is a Gumbel perturbation of the logprobability , we call it the perturbed logprobability of the partial sequence . We can define the corresponding Gumbel noise , which can be inferred from by the relation .
We can recursively compute (9). Write as the as the set of direct children of the node (so is a partition of the set ). Since the maximum (9) must be attained in one of the subtrees, it holds that
(10) 
If we want to sample for all nodes, we can use the bottomup sampling procedure: sample the leaves and recursively compute using (10). This is effectively sampling from the degenerate (constant) distribution resulting from conditioning on the children.
The recursive bottomup sampling procedure can be interpreted as ancestral sampling from a treestructured graphical model (somewhat like Figure 1) with edges directed upwards. Alternatively, we can reverse the graphical model and sample the tree topdown, starting with the root and recursively sampling the children conditionally.
Note that for the root (since it contains all leaves ), it holds that , so we can let ^{2}^{2}2Or we can simply set (e.g. condition on) . This does not affect the result by the independence of and .. Starting with , we can recursively sample the children conditionally on the parent variable . For it holds that and we can sample conditionally on (10), e.g. with their maximum equal to .
Sampling a set of Gumbels conditionally on their maximum being equal to a certain value is nontrivial, but can be done by first sampling the and then sampling the individual Gumbels conditionally on both the and . Alternatively, we can let independently and let . Then
is a set of Gumbels with a maximum equal to . See Appendix id1 for details and numerically stable implementation.
If we recursively sample the complete tree topdown, this is equivalent to sampling the complete tree bottomup, and as a result, for all leaves, it holds that , independently. The benefit of using topdown sampling is that if we are interested only in obtaining the top leaves, we do not need to instantiate the complete tree.
The key idea of Stochastic Beam Search is to apply the GumbelTop trick for a sequence model, without instantiating the entire tree, by using topdown sampling. With topdown sampling, to find the top leaves, at every level in the tree we can suffice with only expanding (instantiating the subtree for) the nodes with highest perturbed logprobability . To see this, first assume that we instantiated the complete tree using topdown sampling and consider the nodes that are ancestors of at least one of the top leaves (the shaded nodes in Figure 1). At every level of the tree, there will be at most such nodes (as each of the top leaves has only one ancestor at level ), and these nodes will have higher perturbed logprobabilities than the other nodes at level , which do not contain a top leaf in the subtree. This means that if we discard all but the nodes with highest logprobabilities , we are guaranteed to include the ancestors of the top leaves. Formally, the th highest logprobability of the nodes at level provides a lower bound required to be among the top leaves, while is an upper bound for the set of leaves such that it can be discarded or pruned if it is lower than the lower bound, so if is not among the top .
Thus, when we apply the topdown sampling procedure, at each level we only need to expand the nodes with the highest perturbed logprobabilities to end up with the top leaves. By the GumbelTop trick the result is a sample without replacement from the sequence model. The effective procedure is a beam search over the (stochastically) perturbed logprobabilities for partial sequences , hence the name Stochastic Beam Search. As we use to select the top partial sequences, we can also think of as the stochastic score of the partial sequence . We formalize Stochastic Beam Search in Algorithm 1.
Stochastic Beam Search should not only be considered as a sampling procedure, but also as a principled way to randomize a beam search. As a naive alternative, one could run an ordinary beam search, replacing the top operation by sampling. In this scenario, at each step of the beam search we could sample without replacement from the partial sequence probabilities using the GumbelTop trick.
However, in this naive approach, for a lowprobability partial sequence to be extended to completion, it does not only need to be initially chosen, but it will need to be rechosen, independently, again with low probability, at each step of the beam search. The result is a much lower probability to sample this sequence than assigned by the model. Intuitively, we should somehow commit to a sampling ‘decision’ made at step . However, a hard commitment to generate exactly one descendant for each of the partial sequences at step would prevent generating any two sequences that share an initial partial sequence.
Our Stochastic Beam Search algorithm makes a soft commitment to a partial sequence (node in the tree) by propagating the Gumbel perturbation of the logprobability consistently down the subtree. The partial sequence will then be extended as long as its total perturbed logprobability is among the top , but will fall off the beam if, despite the consistent perturbation, another sequence is more promising.
As an alternative to Stochastic Beam Search, we can draw samples without replacement by rejecting duplicates from samples drawn with replacement. However, if the domain is large and the entropy low (e.g. if there are only a few valid translations), then rejection sampling requires many samples and consequently many (expensive) model evaluations. Also, we have to estimate how many samples to draw (in parallel) or draw samples sequentially. Stochastic Beam Search executes in a single pass, and requires computation linear in the sample size and the sequence length, which (except for the beam search overhead) is equal to the computational requirement for sampling with replacement.
In this experiment we compare Stochastic Beam Search as a principled (stochastic) alternative to Diverse Beam Search (Vijayakumar et al., 2018) in the context of neural machine translation to obtain a diverse set of translations for a single source sentence . Following the setup by Vijayakumar et al. (2018) we report both diversity as measured by the fraction of unique grams in the translations as well as mean and maximum BLEU score (Papineni et al., 2002) as an indication of the quality of the sample. The maximum BLEU score corresponds to ‘oracle performance’ reported by Vijayakumar et al. (2018), but we report the mean as well since a single good translation and completely random sentences scores high on both maximum BLEU score and diversity, while being undesirable. A good method should increase diversity without sacrificing mean BLEU score.
We compare four different sentence generations methods: Beam Search (BS), Sampling, Stochastic Beam Search (SBS) (sampling without replacement) and Diverse Beam Search with groups (DBS()) (Vijayakumar et al., 2018). For Sampling and Stochastic Beam Search, we control the diversity using the softmax temperature in Equation (5). We use , where a higher results in higher diversity. Heuristically, we also vary for computing the scores with (deterministic) Beam Search. The diversity of Diverse Beam Search is controlled by the diversity strengths parameter, which we vary between . We set the number of groups equal to the sample size , which Vijayakumar et al. (2018) reported as the best choice.
We modify the Beam Search implementation in fairseq^{3}^{3}3https://github.com/pytorch/fairseq to implement Stochastic Beam Search, and use the fairseq implementations for Beam Search, Sampling and Diverse Beam Search. For theoretical correctness of the Stochastic Beam Search, we disable lengthnormalization (Wu et al., 2016) and early stopping (and therefore also do not use these parameters for the other methods). We use the pretrained model from Gehring et al. (2017) and use the wmt14.v2.enfr.newstest2014 ^{4}^{4}4https://s3.amazonaws.com/fairseqpy/data/wmt14.v2.enfr.newstest2014.tar.bz2 test set consisting of 3003 sentences. We run the four methods with sample sizes and plot the minimum, mean and maximum BLEU score among the translations (averaged over the test set) against the average of and gram diversity, where gram diversity is defined as:
In Figure 2, we represent the results as curves, indicating the tradeoff between diversity and BLEU score. The points indicate datapoints and the dashed lines indicate the (averaged) minimum and maximum BLEU score. For the same diversity, Stochastic Beam Search achieves higher mean/maximum BLEU score. Looking at a certain BLEU score, we observe that Stochastic Beam Search achieves the same BLEU score as Diverse Beam Search with a significantly larger diversity. For low temperatures (), the maximum BLEU score of Stochastic Beam Search is comparable to the deterministic Beam Search, so the increased diversity does not sacrifice the best element in the sample. Note that Sampling achieves higher mean BLEU score at the cost of diversity, which may be because good translations are sampled repeatedly. However, the maximum BLEU score of both Sampling and Diverse Beam Search is lower than with Beam Search and Stochastic Beam Search.
In our second experiment, we use sampling without replacement to evaluate the expected sentence level BLEU score for a translation given a source sentence . Although we are often interested in corpus level BLEU score, estimation of sentence level BLEU score is useful, for example when training using minibatches to directly optimize BLEU score (Ranzato et al., 2016).
We leave dependence of the BLEU score on the source sentence implicit, and write . Writing the domain of (given ) as (e.g. all possible translations), we want to estimate the following expectation:
(11) 
Under a Monte Carlo (MC) sampling with replacement scheme with size , we write as the set^{5}^{5}5Formally, when sampling with replacement, is a multiset. of indices of sampled sequences and estimate (11) using
(12) 
If the distribution has low entropy (for example if there are only few valid translations), then MC estimation may be inefficient since repeated samples are uninformative. We can use sampling without replacement as an improvement, but we need to use importance weights to correct for the changed sampling probabilities. Using the GumbelTop trick, we can implement an estimator equivalent to the estimator described by Vieira (2017), derived from priority sampling (Duffield et al., 2007):
(13) 
Here is the th largest element of , which can be considered the empirical threshold for the GumbelTop trick (since if ), and we define . If we would assume a fixed threshold and variably sized sample , then and is a standard importance weight. Surprisingly, using a fixed sample size (and empirical threshold ) also yields in an unbiased estimator, and we include a proof adapted from Duffield et al. (2007) and Vieira (2017) in Appendix id1. To obtain , we need to sacrifice the last sample^{6}^{6}6For Stochastic Beam Search, we use a sample/beam size , set equal to the th largest perturbed logprobability and compute (13) based on the remaining samples. Alternatively, we could use a beam size of ., slightly increasing variance.
Empirically, the estimator (13) has high variance, and in practice it is preferred to normalize the importance weights by (Hesterberg, 1988):
(14) 
The estimator (14) is biased but consistent: in the limit we sample the entire domain, so we have empirical threshold and and , such that (14) is equal to (11).
We have to take care computing the importance weights as depending on the entropy the terms in the quotient can become very small, and in our case the computation of can suffer from catastrophic cancellation. For details, see Appendix id1.
Because the model is not trained to use its own predictions as input, at test time errors can accumulate. As a result, when sampling with the default temperature , the expected BLEU score is very low (below 10). To improve quality of generated sentences we use lower temperatures and experiment with . We then use different methods to estimate the BLEU score:

Monte Carlo (MC), using Equation (12).

Beam Search (BS), where we compute a deterministic beam (the temperature affects the scoring) and compute . This is not a statistical estimator, but a lower bound to the target (11) which serves as a validation of the implementation and gives insight on how many sequences we need at least to capture most of the mass in (11). We also compute the normalized version , which can heuristically be considered as a ‘determinstic estimate’.
In Figure 3 we show the results of computing each estimate 100 times (BS only once as it is deterministic) for three different sentences^{7}^{7}7Sentence 1500, 2000 and 2500 from the WMT14 dataset. Sentences 0, 500, 1000 are shorter and obtained 0 BLEU score in some cases, therefore being uninteresting in comparisons. for temperatures and sample sizes to . We report the empirical mean and th and th percentile. The normalized SBS estimator indeed achieves significantly lower variance than the unnormalized version and for , it significantly reduces variance compared to MC, without adding observable bias. For the results are similar, but we are less interested in this case as the overall BLEU score is lower than for .
Additionally to estimating the BLEU score we use such that Equation (11) becomes the model entropy (conditioned on the source sentence ):
Entropy estimation is useful in optimization settings where we want to include an entropy loss to ensure diversity. It is a different problem than BLEU score estimation as high BLEU score (for a good model) correlates positively with model probability, while for entropy rare contribute the largest terms . We use the same experimental setup as for the BLEU score and present the results in Figure 4. The results are similar to the BLEU score experiment: the normalized SBS estimate has significantly lower variance than MC for while for , results are similar. This shows that Stochastic Beam Search can be used to construct practical statistical estimators.
The idea of sampling by solving optimization problems has been used for various purposes (Papandreou & Yuille, 2011; Hazan & Jaakkola, 2012; Tarlow et al., 2012; Ermon et al., 2013; Maddison et al., 2014; Chen & Ghahramani, 2016; Balog et al., 2017), but to our knowledge this idea has not been used for sampling without replacement.
Most closely related to our work is the work by Maddison et al. (2014), who note that the GumbelMax trick (Gumbel, 1954) can be applied implicitly and generalize it to continuous distributions, using an search to implicitly find the maximum of a Gumbel process. In this work, we extend the idea of topdown sampling to efficiently draw multiple samples without replacement from a factorized distribution (with possibly exponentially large domain) by implicitly applying the GumbelTop trick. This is a new and practical sampling method.
The blog post by Vieira (2014) describes the relation of the GumbelTop trick (as we call it) to Weighted Reservoir Sampling (Efraimidis & Spirakis, 2006), which is an algorithm for drawing weighted samples without replacement from a stream but also yields a practical parallel algorithm for sampling without replacement. In this setting, there is a fixed probability for each element , but no parametric (sequence) model. The implication that the GumbelTop trick can be used for sampling without replacement is not widely known^{8}^{8}8For example, at the time of writing the popular PyTorch (Paszke et al., 2017) library uses the (for large ) expensive sequential algorithm..
The GumbelMax trick has also been used to define relaxations of the categorical distribution (Maddison et al., 2016; Jang et al., 2016), which can be reparameterized for lowvariance but biased gradient estimators and can also be used for training sequence models (Gu et al., 2018). We think our work is a step in the direction to improve these methods in the context of sequence models.
Beam search is widely used for approximate inference in various domains such as machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Ranzato et al., 2016; Vaswani et al., 2017; Gehring et al., 2017), image captioning (Vinyals et al., 2015b), speech recognition (Graves et al., 2013) and other structured prediction settings (Vinyals et al., 2015a; Weiss et al., 2015). Although typically a testtime procedure, there are works that include beam search in the training loop (Daumé et al., 2009; Wiseman & Rush, 2016; Edunov et al., 2018b; Negrinho et al., 2018; Edunov et al., 2018a) for training sequence models on the sequence level (Ranzato et al., 2016; Bahdanau et al., 2017). Many variants of beam search have been developed, such as a continuous relaxation (Goyal et al., 2018), diversity encouraging variants (Li et al., 2016; Shao et al., 2017; Vijayakumar et al., 2018) or using modifications such as lengthnormalization (Wu et al., 2016) or simply applying noise to the output (Edunov et al., 2018a). Our Stochastic Beam Search is a principled alternative that shares some of the benefits of these heuristic variants, such as the ability to control diversity or produce randomized output.
We introduced Stochastic Beam Search: an algorithm that is easy to implement on top of a beam search as a way to sample sequences without replacement. This algorithm relates sampling and beam search, combining advantages of these two methods. Our experiments support the idea that it can be used as a dropin replacement in places where sampling or beam search is used. In fact, our experiments show Stochastic Beam Search can be used to yield lowervariance estimators and highdiversity samples from a neural machine translation model. In future work, we plan to leverage the probabilistic interpretation of beam search to develop new beam search related statistical learning methods.
References
 Bahdanau et al. (2015) Bahdanau, D., Cho, K., and Bengio, Y. Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, 2015.
 Bahdanau et al. (2017) Bahdanau, D., Brakel, P., Xu, K., Goyal, A., Lowe, R., Pineau, J., Courville, A., and Bengio, Y. An actorcritic algorithm for sequence prediction. In International Conference on Learning Representations, 2017.
 Balog et al. (2017) Balog, M., Tripuraneni, N., Ghahramani, Z., and Weller, A. Lost relatives of the gumbel trick. In International Conference on Machine Learning (ICML), pp. 371–379. PMLR, 2017.
 Chen & Ghahramani (2016) Chen, Y. and Ghahramani, Z. Scalable discrete sampling as a multiarmed bandit problem. In International Conference on Machine Learning, pp. 2492–2501, 2016.
 Daumé et al. (2009) Daumé, H., Langford, J., and Marcu, D. Searchbased structured prediction. Machine learning, 75(3):297–325, 2009.
 Duffield et al. (2007) Duffield, N., Lund, C., and Thorup, M. Priority sampling for estimation of arbitrary subset sums. Journal of the ACM (JACM), 54(6):32, 2007.
 Edunov et al. (2018a) Edunov, S., Ott, M., Auli, M., and Grangier, D. Understanding backtranslation at scale. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 489–500, 2018a.
 Edunov et al. (2018b) Edunov, S., Ott, M., Auli, M., Grangier, D., et al. Classical structured prediction losses for sequence to sequence learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), volume 1, pp. 355–364, 2018b.
 Efraimidis & Spirakis (2006) Efraimidis, P. S. and Spirakis, P. G. Weighted random sampling with a reservoir. Information Processing Letters, 97(5):181–185, 2006.
 Ermon et al. (2013) Ermon, S., Gomes, C. P., Sabharwal, A., and Selman, B. Embed and project: Discrete sampling with universal hashing. In Advances in Neural Information Processing Systems, pp. 2085–2093, 2013.
 Gehring et al. (2017) Gehring, J., Auli, M., Grangier, D., Yarats, D., and Dauphin, Y. N. Convolutional sequence to sequence learning. In International Conference on Machine Learning, pp. 1243–1252, 2017.
 Goyal et al. (2018) Goyal, K., Neubig, G., Dyer, C., and BergKirkpatrick, T. A continuous relaxation of beam search for endtoend training of neural sequence models. In ThirtySecond AAAI Conference on Artificial Intelligence (AAAI), 2018.
 Graves et al. (2013) Graves, A., Mohamed, A.r., and Hinton, G. Speech recognition with deep recurrent neural networks. In Acoustics, Speech and Signal Processing, IEEE international conference on, pp. 6645–6649, 2013.
 Gu et al. (2018) Gu, J., Im, D. J., and Li, V. O. Neural machine translation with Gumbelgreedy decoding. In ThirtySecond AAAI Conference on Artificial Intelligence (AAAI), 2018.
 Gumbel (1954) Gumbel, E. J. Statistical theory of extreme values and some practical applications: a series of lectures. Number 33. US Govt. Print. Office, 1954.
 Hazan & Jaakkola (2012) Hazan, T. and Jaakkola, T. On the partition function and random maximum aposteriori perturbations. In Proceedings of the 29th International Conference on Machine Learning, pp. 1667–1674. Omnipress, 2012.
 Hesterberg (1988) Hesterberg, T. C. Advances in importance sampling. PhD thesis, Stanford University, 1988.
 Jang et al. (2016) Jang, E., Gu, S., and Poole, B. Categorical reparameterization with gumbelsoftmax. In International Conference on Learning Representations, 2016.
 Li et al. (2016) Li, J., Monroe, W., and Jurafsky, D. A simple, fast diverse decoding algorithm for neural generation. arXiv preprint arXiv:1611.08562, 2016.
 Mächler (2012) Mächler, M. Accurately computing assessed by the Rmpfr package, 2012. URL https://cran.rproject.org/web/packages/Rmpfr/vignettes/log1mexpnote.pdf.
 Maddison et al. (2014) Maddison, C. J., Tarlow, D., and Minka, T. A* sampling. In Advances in Neural Information Processing Systems, pp. 3086–3094, 2014.
 Maddison et al. (2016) Maddison, C. J., Mnih, A., and Teh, Y. W. The concrete distribution: A continuous relaxation of discrete random variables. In International Conference on Learning Representations, 2016.
 Negrinho et al. (2018) Negrinho, R., Gormley, M., and Gordon, G. J. Learning beam search policies via imitation learning. In Advances in Neural Information Processing Systems, pp. 10673–10682, 2018.
 Papandreou & Yuille (2011) Papandreou, G. and Yuille, A. L. Perturbandmap random fields: Using discrete optimization to learn and sample from energy models. In Computer Vision (ICCV), IEEE International Conference on, pp. 193–200, 2011.
 Papineni et al. (2002) Papineni, K., Roukos, S., Ward, T., and Zhu, W.J. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, 2002.
 Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in PyTorch. In Advances in Neural Information Processing Systems, 2017.
 Ranzato et al. (2016) Ranzato, M., Chopra, S., Auli, M., and Zaremba, W. Sequence level training with recurrent neural networks. In International Conference on Learning Representations, 2016.
 Shao et al. (2017) Shao, Y., Gouws, S., Britz, D., Goldie, A., Strope, B., and Kurzweil, R. Generating highquality and informative conversation responses with sequencetosequence models. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2210–2219, 2017.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112, 2014.
 Tarlow et al. (2012) Tarlow, D., Adams, R., and Zemel, R. Randomized optimum models for structured prediction. In Artificial Intelligence and Statistics, pp. 1221–1229, 2012.
 Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008, 2017.
 Vieira (2014) Vieira, T. Gumbelmax trick and weighted reservoir sampling, 2014. URL https://timvieira.github.io/blog/post/2014/08/01/gumbelmaxtrickandweightedreservoirsampling/.
 Vieira (2017) Vieira, T. Estimating means in a finite universe, 2017. URL https://timvieira.github.io/blog/post/2017/07/03/estimatingmeansinafiniteuniverse/.
 Vijayakumar et al. (2018) Vijayakumar, A. K., Cogswell, M., Selvaraju, R. R., Sun, Q., Lee, S., Crandall, D. J., and Batra, D. Diverse beam search for improved description of complex scenes. In AAAI, 2018.
 Vinyals et al. (2015a) Vinyals, O., Fortunato, M., and Jaitly, N. Pointer networks. In Advances in Neural Information Processing Systems, pp. 2692–2700, 2015a.
 Vinyals et al. (2015b) Vinyals, O., Toshev, A., Bengio, S., and Erhan, D. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer vision and Pattern Recognition, pp. 3156–3164, 2015b.
 Weiss et al. (2015) Weiss, D., Alberti, C., Collins, M., and Petrov, S. Structured training for neural network transitionbased parsing. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), volume 1, pp. 323–333, 2015.
 Wiseman & Rush (2016) Wiseman, S. and Rush, A. M. Sequencetosequence learning as beamsearch optimization. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1296–1306, 2016.
 Wu et al. (2016) Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.
Theorem 1.
For , let . Then is an (ordered) sample without replacement from the distribution, e.g. for a realization it holds that
(15) 
where is the domain (without replacement) for the th sampled element.
Proof.
First note that
(16)  
(17)  
(18) 
The step from (16) to (17) follows from the independence of the and (Section id1) and the step from (17) to (18) uses the GumbelMax trick. The proof follows by induction on . The case is the GumbelMax trick, while if we assume the result (15) proven for , then
(19)  
In (19) we have used Equation (18) and Equation (15) for by induction. ∎
A random variable has a truncated Gumbel distribution with location and maximum (e.g. ) with CDF if:
(20) 
The inverse CDF is:
(21) 
In order to sample a set of Gumbel variables , e.g. with their maximum being exactly , we can first sample the , and then sample the Gumbels conditionally on both the and :

Sample . We do not need to condition on since the is independent of the (Section id1).

Set , since this follows from conditioning on the and .

Sample for . This works because, conditioning on the and , it holds that:
Equivalently, we can let , let and define
(22) 
Here we have used (20) and (21). Since the transformation (22) is monotonically increasing, it preserves the and it follows from the GumbelMax trick (3) that
We can think of this as using the GumbelMax trick for step 1 (sampling the argmax) in the sampling process described above. Additionally, for :
so here we recover step 2 (setting ). For it holds that:
This means that , so this is equivalent to step 3 (sampling for ).
Direct computation of (22) can be unstable as large terms need to be exponentiated. Instead, we compute:
(23)  
(24) 
where we have defined
This is equivalent as
The first step can be easily verified by considering the cases and . and can be computed accurately using and (Mächler, 2012):
We have to take care computing the importance weights as depending on the entropy the terms in the quotient can become very small, and in our case the computation of can suffer from catastrophic cancellation. We can rewrite this expression using the more numerically stable implementation as