Stochastic Beams and Where to Find Them
Abstract
The wellknown GumbelMax trick for sampling from a categorical distribution can be extended to sample elements without replacement. We show how to implicitly apply this ’GumbelTop’ trick on a factorized distribution over sequences, allowing to draw exact samples without replacement using a Stochastic Beam Search. Even for exponentially large domains, the number of model evaluations grows only linear in and the maximum sampled sequence length. The algorithm creates a theoretical connection between sampling and (deterministic) beam search and can be used as a principled intermediate alternative. In a translation task, the proposed method compares favourably against alternatives to obtain diverse yet good quality translations. We show that sequences sampled without replacement can be used to construct lowvariance estimators for expected sentencelevel BLEU score and model entropy.
1 Introduction
We think the GumbelMax trick Gumbel (1954); Maddison et al. (2014) is like a magic trick. It allows sampling from the categorical distribution, simply by perturbing the logprobability for each category by adding independent Gumbel distributed noise and returning the category with maximum perturbed logprobability. This trick has recently (re)gained popularity as it allows to derive reparameterizable continous relaxations of the categorical distribution (Maddison et al., 2016; Jang et al., 2016). However, there is more: as was noted (in a blog post) by Vieira (2014), taking the top largest perturbations (instead of the maximum, or top 1) yields a sample of size from the categorical distribution without replacement. We call this extension of the GumbelMax trick the GumbelTop trick.
In this paper, we consider factorized distributions over sequences, represented by (parametric) sequence models. Sequence models are widely used, e.g. in tasks such as neural machine translation (Sutskever et al., 2014; Bahdanau et al., 2015) and image captioning (Vinyals et al., 2015b). Many such tasks require obtaining a set of representative sequences from the model. These can be random samples, but for lowentropy distributions, a set of sequences sampled using standard sampling (with replacement) may contain many duplicates. On the other hand, a beam search can find a set of unique highprobability sequences, but these have low variability and, being deterministic, cannot be used to construct statistical estimators.
In this paper, we propose sampling without replacement as an alternative method to obtain representative sequences from a sequence model. We show that for a sequence model, we can do this by applying the GumbelTop trick implicitly, without instantiating all sequences in the (typically exponentially large) domain. This procedure allows us to draw sequences using a number of model evaluations that grows only linear in the number of samples and the maximum sampled sequence length. The algorithm uses the idea of topdown sampling (Maddison et al., 2014) and performs a beam search over stochastically perturbed logprobabilities, which is why we refer to it as Stochastic Beam Search.
Stochastic Beam Search is a novel procedure for sampling sequences that avoids duplicate samples. Unlike ordinary beam search, it has a probabilistic interpretation. Thus, it can e.g. be used in importance sampling. As such, Stochastic Beam Search conceptually connects sampling and beam search, and combines advantages of both methods. In Section 4 we give two examples of how Stochastic Beam Search can be used as a principled alternative to sampling or beam search. In these experiments, Stochastic Beam Search is used to control the diversity of translation results, as well as to construct low variance estimators for sentencelevel BLEU score and model entropy.
2 Preliminaries
2.1 The categorical distribution
A discrete random variable has a distribution with domain if . We refer to the categories as elements in the domain and denote with the (unnormalized) logprobabilities, so and . Therefore in general we write:
(1) 
2.2 The Gumbel distribution
If , then has a Gumbel distribution with location and we write . From this it follows that , so we can shift Gumbel variables.
2.3 The GumbelMax trick
The GumbelMax trick (Gumbel, 1954; Maddison et al., 2014) allows to sample from the categorical distribution (1) by independently perturbing the logprobabilities with Gumbel noise and finding the largest element.
Formally, let i.i.d. and let , then with . In a slight abuse of notation, we write and we call the perturbed logprobability of element or category in the domain .
2.4 The GumbelTop trick
Considering the maximum the top 1 (one), we can generalize the GumbelMax trick to the GumbelTop trick to draw an ordered sample of size without replacement
Theorem 1.
For , let . Then is an (ordered) sample without replacement from the distribution, e.g. for a realization it holds that
(4) 
where is the domain (without replacement) for the th sampled element.
2.5 Sequence models
A sequence model is a factorized parametric distribution over sequences. The parameters define the conditional probability of the next token given the partial sequence . Typically is defined as a softmax normalization of unnormalized logprobabilities with optional temperature (default ):
(5) 
The normalization is w.r.t. a single token, so the model is locally normalized. The total probability of a (partial) sequence follows from the chain rule of probability:
(6)  
(7) 
A sequence model defines a valid probability distribution over both partial and complete sequences. When the length is irrelevant, we simply write to indicate a (partial or complete) sequence. If the model is additionally conditioned on a context (e.g., a source sentence), we write .
2.6 Beam Search
A beam search is a limitedwidth breadth first search. In the context of sequence models, it is often used as an approximation to finding the (single) sequence that maximizes (7), or as a way to obtain a set of highprobability sequences from the model. Starting from an empty sequence (e.g. ), a beam search expands at every step at most partial sequences (those with highest probability) to compute the probabilities of sequences with length . It terminates with a beam of complete sequences, which we assume to be of equal length (as sequences can be padded).
3 Stochastic Beam Search
We derive Stochastic Beam Search by starting with the explicit application of the GumbelTop trick to sample sequences without replacement from a sequence model. This requires instantiating all sequences in the domain to find the largest perturbed logprobabilities. Then we transition to topdown sampling of the perturbed logprobabilities, and we use Stochastic Beam Search to instantiate (only) the sequences with the largest perturbed logprobabilities. As both methods are equivalent, Stochastic Beam Search implicitly applies the GumbelTop trick and thus yields a sample of sequences without replacement.
3.1 The GumbelTop trick on a tree
We represent the sequence model (7) as a tree (as in Figure 1), where internal nodes at level represent partial sequences , and leaf nodes represent completed sequences. We identify a leaf by its index and write as the corresponding sequence, with (normalized!) logprobability . To obtain a sample from the distribution (7) without replacement, we should sample from the set of leaf nodes without replacement, for which we can naively use the GumbelTop trick (Section 2.3):

Compute for all sequences . To reuse computations for partial sequences, the complete probability tree is instantiated, as in Figure 1.

Sample , so can be seen as the perturbed logprobability of sequence .

Let , then is a sample of sequences from (7) without replacement.
As instantiating the complete probability tree is computationally prohibitive, we construct an equivalent process that only requires computation linear in the number of samples and the sequence length.
Perturbed logprobabilities of partial sequences.
For the naive implementation of the GumbelTop trick, we only defined the perturbed logprobabilities for leaf nodes , which correspond to complete sequences . For the Stochastic Beam Search implementation, we also define the perturbed logprobabilities for internal nodes corresponding to partial sequences. We identify a node (internal or leaf) by the set of leaves in the corresponding subtree, and we write as the corresponding (partial or completed) sequence. Its logprobability can be computed incrementally from the parent logprobability using (6), and since the model is locally normalized, it holds that
(8) 
Now for each node , we define as the maximum of the perturbed logprobabilities in the subtree leaves . By Equation (2), has a Gumbel distribution with location (hence its notation ):
(9) 
Since is a Gumbel perturbation of the logprobability , we call it the perturbed logprobability of the partial sequence . We can define the corresponding Gumbel noise , which can be inferred from by the relation .
Bottomup sampling of perturbed logprobabilities.
We can recursively compute (9). Write as the as the set of direct children of the node (so is a partition of the set ). Since the maximum (9) must be attained in one of the subtrees, it holds that
(10) 
If we want to sample for all nodes, we can use the bottomup sampling procedure: sample the leaves and recursively compute using (10). This is effectively sampling from the degenerate (constant) distribution resulting from conditioning on the children.
Topdown sampling of perturbed logprobabilities.
The recursive bottomup sampling procedure can be interpreted as ancestral sampling from a treestructured graphical model (somewhat like Figure 1) with edges directed upwards. Alternatively, we can reverse the graphical model and sample the tree topdown, starting with the root and recursively sampling the children conditionally.
Note that for the root (since it contains all leaves ), it holds that , so we can let
Sampling a set of Gumbels conditionally on their maximum being equal to a certain value is nontrivial, but can be done by first sampling the and then sampling the individual Gumbels conditionally on both the and . Alternatively, we can let independently and let . Then
is a set of Gumbels with a maximum equal to . See Appendix B for details and numerically stable implementation.
If we recursively sample the complete tree topdown, this is equivalent to sampling the complete tree bottomup, and as a result, for all leaves, it holds that , independently. The benefit of using topdown sampling is that if we are interested only in obtaining the top leaves, we do not need to instantiate the complete tree.
Stochastic Beam Search
The key idea of Stochastic Beam Search is to apply the GumbelTop trick for a sequence model, without instantiating the entire tree, by using topdown sampling. With topdown sampling, to find the top leaves, at every level in the tree we can suffice with only expanding (instantiating the subtree for) the nodes with highest perturbed logprobability . To see this, first assume that we instantiated the complete tree using topdown sampling and consider the nodes that are ancestors of at least one of the top leaves (the shaded nodes in Figure 1). At every level of the tree, there will be at most such nodes (as each of the top leaves has only one ancestor at level ), and these nodes will have higher perturbed logprobabilities than the other nodes at level , which do not contain a top leaf in the subtree. This means that if we discard all but the nodes with highest logprobabilities , we are guaranteed to include the ancestors of the top leaves. Formally, the th highest logprobability of the nodes at level provides a lower bound required to be among the top leaves, while is an upper bound for the set of leaves such that it can be discarded or pruned if it is lower than the lower bound, so if is not among the top .
Thus, when we apply the topdown sampling procedure, at each level we only need to expand the nodes with the highest perturbed logprobabilities to end up with the top leaves. By the GumbelTop trick the result is a sample without replacement from the sequence model. The effective procedure is a beam search over the (stochastically) perturbed logprobabilities for partial sequences , hence the name Stochastic Beam Search. As we use to select the top partial sequences, we can also think of as the stochastic score of the partial sequence . We formalize Stochastic Beam Search in Algorithm 1.
3.2 Relation to Beam Search
Stochastic Beam Search should not only be considered as a sampling procedure, but also as a principled way to randomize a beam search. As a naive alternative, one could run an ordinary beam search, replacing the top operation by sampling. In this scenario, at each step of the beam search we could sample without replacement from the partial sequence probabilities using the GumbelTop trick.
However, in this naive approach, for a lowprobability partial sequence to be extended to completion, it does not only need to be initially chosen, but it will need to be rechosen, independently, again with low probability, at each step of the beam search. The result is a much lower probability to sample this sequence than assigned by the model. Intuitively, we should somehow commit to a sampling ‘decision’ made at step . However, a hard commitment to generate exactly one descendant for each of the partial sequences at step would prevent generating any two sequences that share an initial partial sequence.
Our Stochastic Beam Search algorithm makes a soft commitment to a partial sequence (node in the tree) by propagating the Gumbel perturbation of the logprobability consistently down the subtree. The partial sequence will then be extended as long as its total perturbed logprobability is among the top , but will fall off the beam if, despite the consistent perturbation, another sequence is more promising.
3.3 Relation to rejection sampling
As an alternative to Stochastic Beam Search, we can draw samples without replacement by rejecting duplicates from samples drawn with replacement. However, if the domain is large and the entropy low (e.g. if there are only a few valid translations), then rejection sampling requires many samples and consequently many (expensive) model evaluations. Also, we have to estimate how many samples to draw (in parallel) or draw samples sequentially. Stochastic Beam Search executes in a single pass, and requires computation linear in the sample size and the sequence length, which (except for the beam search overhead) is equal to the computational requirement for sampling with replacement.
4 Experiments
4.1 Diverse Beam Search
In this experiment we compare Stochastic Beam Search as a principled (stochastic) alternative to Diverse Beam Search (Vijayakumar et al., 2018) in the context of neural machine translation to obtain a diverse set of translations for a single source sentence . Following the setup by Vijayakumar et al. (2018) we report both diversity as measured by the fraction of unique grams in the translations as well as mean and maximum BLEU score (Papineni et al., 2002) as an indication of the quality of the sample. The maximum BLEU score corresponds to ‘oracle performance’ reported by Vijayakumar et al. (2018), but we report the mean as well since a single good translation and completely random sentences scores high on both maximum BLEU score and diversity, while being undesirable. A good method should increase diversity without sacrificing mean BLEU score.
We compare four different sentence generations methods: Beam Search (BS), Sampling, Stochastic Beam Search (SBS) (sampling without replacement) and Diverse Beam Search with groups (DBS()) (Vijayakumar et al., 2018). For Sampling and Stochastic Beam Search, we control the diversity using the softmax temperature in Equation (5). We use , where a higher results in higher diversity. Heuristically, we also vary for computing the scores with (deterministic) Beam Search. The diversity of Diverse Beam Search is controlled by the diversity strengths parameter, which we vary between . We set the number of groups equal to the sample size , which Vijayakumar et al. (2018) reported as the best choice.
We modify the Beam Search implementation in fairseq
In Figure 2, we represent the results as curves, indicating the tradeoff between diversity and BLEU score. The points indicate datapoints and the dashed lines indicate the (averaged) minimum and maximum BLEU score. For the same diversity, Stochastic Beam Search achieves higher mean/maximum BLEU score. Looking at a certain BLEU score, we observe that Stochastic Beam Search achieves the same BLEU score as Diverse Beam Search with a significantly larger diversity. For low temperatures (), the maximum BLEU score of Stochastic Beam Search is comparable to the deterministic Beam Search, so the increased diversity does not sacrifice the best element in the sample. Note that Sampling achieves higher mean BLEU score at the cost of diversity, which may be because good translations are sampled repeatedly. However, the maximum BLEU score of both Sampling and Diverse Beam Search is lower than with Beam Search and Stochastic Beam Search.
4.2 BLEU score estimation
In our second experiment, we use sampling without replacement to evaluate the expected sentence level BLEU score for a translation given a source sentence . Although we are often interested in corpus level BLEU score, estimation of sentence level BLEU score is useful, for example when training using minibatches to directly optimize BLEU score (Ranzato et al., 2016).
We leave dependence of the BLEU score on the source sentence implicit, and write . Writing the domain of (given ) as (e.g. all possible translations), we want to estimate the following expectation:
(11) 
Under a Monte Carlo (MC) sampling with replacement scheme with size , we write as the set
(12) 
If the distribution has low entropy (for example if there are only few valid translations), then MC estimation may be inefficient since repeated samples are uninformative. We can use sampling without replacement as an improvement, but we need to use importance weights to correct for the changed sampling probabilities. Using the GumbelTop trick, we can implement an estimator equivalent to the estimator described by Vieira (2017), derived from priority sampling (Duffield et al., 2007):
(13) 
Here is the th largest element of , which can be considered the empirical threshold for the GumbelTop trick (since if ), and we define .
If we would assume a fixed threshold and variably sized sample , then and is a standard importance weight.
Surprisingly, using a fixed sample size (and empirical threshold ) also yields in an unbiased estimator, and we include a proof adapted from Duffield et al. (2007) and Vieira (2017) in Appendix D. To obtain , we need to sacrifice the last sample
Empirically, the estimator (13) has high variance, and in practice it is preferred to normalize the importance weights by (Hesterberg, 1988):
(14) 
The estimator (14) is biased but consistent: in the limit we sample the entire domain, so we have empirical threshold and and , such that (14) is equal to (11).
We have to take care computing the importance weights as depending on the entropy the terms in the quotient can become very small, and in our case the computation of can suffer from catastrophic cancellation. For details, see Appendix C.
Because the model is not trained to use its own predictions as input, at test time errors can accumulate. As a result, when sampling with the default temperature , the expected BLEU score is very low (below 10). To improve quality of generated sentences we use lower temperatures and experiment with . We then use different methods to estimate the BLEU score:

Monte Carlo (MC), using Equation (12).

Beam Search (BS), where we compute a deterministic beam (the temperature affects the scoring) and compute . This is not a statistical estimator, but a lower bound to the target (11) which serves as a validation of the implementation and gives insight on how many sequences we need at least to capture most of the mass in (11). We also compute the normalized version , which can heuristically be considered as a ‘determinstic estimate’.
In Figure 3 we show the results of computing each estimate 100 times (BS only once as it is deterministic) for three different sentences
4.3 Conditional Entropy Estimation
Additionally to estimating the BLEU score we use such that Equation (11) becomes the model entropy (conditioned on the source sentence ):
Entropy estimation is useful in optimization settings where we want to include an entropy loss to ensure diversity. It is a different problem than BLEU score estimation as high BLEU score (for a good model) correlates positively with model probability, while for entropy rare contribute the largest terms . We use the same experimental setup as for the BLEU score and present the results in Figure 4. The results are similar to the BLEU score experiment: the normalized SBS estimate has significantly lower variance than MC for while for , results are similar. This shows that Stochastic Beam Search can be used to construct practical statistical estimators.
5 Related work
5.1 Sampling and the GumbelMax trick
The idea of sampling by solving optimization problems has been used for various purposes (Papandreou & Yuille, 2011; Hazan & Jaakkola, 2012; Tarlow et al., 2012; Ermon et al., 2013; Maddison et al., 2014; Chen & Ghahramani, 2016; Balog et al., 2017), but to our knowledge this idea has not been used for sampling without replacement.
Most closely related to our work is the work by Maddison et al. (2014), who note that the GumbelMax trick (Gumbel, 1954) can be applied implicitly and generalize it to continuous distributions, using an search to implicitly find the maximum of a Gumbel process. In this work, we extend the idea of topdown sampling to efficiently draw multiple samples without replacement from a factorized distribution (with possibly exponentially large domain) by implicitly applying the GumbelTop trick. This is a new and practical sampling method.
The blog post by Vieira (2014) describes the relation of the GumbelTop trick (as we call it) to Weighted Reservoir Sampling (Efraimidis & Spirakis, 2006), which is an algorithm for drawing weighted samples without replacement from a stream but also yields a practical parallel algorithm for sampling without replacement. In this setting, there is a fixed probability for each element , but no parametric (sequence) model. The implication that the GumbelTop trick can be used for sampling without replacement is not widely known
The GumbelMax trick has also been used to define relaxations of the categorical distribution (Maddison et al., 2016; Jang et al., 2016), which can be reparameterized for lowvariance but biased gradient estimators and can also be used for training sequence models (Gu et al., 2018). We think our work is a step in the direction to improve these methods in the context of sequence models.
5.2 Beam search
Beam search is widely used for approximate inference in various domains such as machine translation (Sutskever et al., 2014; Bahdanau et al., 2015; Ranzato et al., 2016; Vaswani et al., 2017; Gehring et al., 2017), image captioning (Vinyals et al., 2015b), speech recognition (Graves et al., 2013) and other structured prediction settings (Vinyals et al., 2015a; Weiss et al., 2015). Although typically a testtime procedure, there are works that include beam search in the training loop (Daumé et al., 2009; Wiseman & Rush, 2016; Edunov et al., 2018b; Negrinho et al., 2018; Edunov et al., 2018a) for training sequence models on the sequence level (Ranzato et al., 2016; Bahdanau et al., 2017). Many variants of beam search have been developed, such as a continuous relaxation (Goyal et al., 2018), diversity encouraging variants (Li et al., 2016; Shao et al., 2017; Vijayakumar et al., 2018) or using modifications such as lengthnormalization (Wu et al., 2016) or simply applying noise to the output (Edunov et al., 2018a). Our Stochastic Beam Search is a principled alternative that shares some of the benefits of these heuristic variants, such as the ability to control diversity or produce randomized output.
6 Discussion
We introduced Stochastic Beam Search: an algorithm that is easy to implement on top of a beam search as a way to sample sequences without replacement. This algorithm relates sampling and beam search, combining advantages of these two methods. Our experiments support the idea that it can be used as a dropin replacement in places where sampling or beam search is used. In fact, our experiments show Stochastic Beam Search can be used to yield lowervariance estimators and highdiversity samples from a neural machine translation model. In future work, we plan to leverage the probabilistic interpretation of beam search to develop new beam search related statistical learning methods.
Appendix A Proof of the GumbelTop trick
Theorem 1.
For , let . Then is an (ordered) sample without replacement from the distribution, e.g. for a realization it holds that
(15) 
where is the domain (without replacement) for the th sampled element.
Proof.
First note that
(16)  
(17)  
(18) 
The step from (16) to (17) follows from the independence of the and (Section 2.3) and the step from (17) to (18) uses the GumbelMax trick. The proof follows by induction on . The case is the GumbelMax trick, while if we assume the result (15) proven for , then
(19)  
In (19) we have used Equation (18) and Equation (15) for by induction. ∎
Appendix B Sampling set of Gumbels with maximum
b.1 The truncated Gumbel distribution
A random variable has a truncated Gumbel distribution with location and maximum (e.g. ) with CDF if:
(20) 
The inverse CDF is:
(21) 
b.2 Sampling set of Gumbels with maximum
In order to sample a set of Gumbel variables , e.g. with their maximum being exactly , we can first sample the , and then sample the Gumbels conditionally on both the and :

Sample . We do not need to condition on since the is independent of the (Section 2.3).

Set , since this follows from conditioning on the and .

Sample for . This works because, conditioning on the and , it holds that:
Equivalently, we can let , let and define
(22) 
Here we have used (20) and (21). Since the transformation (22) is monotonically increasing, it preserves the and it follows from the GumbelMax trick (3) that
We can think of this as using the GumbelMax trick for step 1 (sampling the argmax) in the sampling process described above. Additionally, for :
so here we recover step 2 (setting ). For it holds that:
This means that , so this is equivalent to step 3 (sampling for ).
b.3 Numeric stability of truncated Gumbel computation
Direct computation of (22) can be unstable as large terms need to be exponentiated. Instead, we compute:
(23)  
(24) 
where we have defined
This is equivalent as
The first step can be easily verified by considering the cases and . and can be computed accurately using and (Mächler, 2012):