LatentVariable NonAutoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior
Abstract
Although neural machine translation models reached high translation quality, the autoregressive nature makes inference difficult to parallelize and leads to high translation latency. Inspired by recent refinementbased approaches, we propose a latentvariable nonautoregressive model with continuous latent variables and deterministic inference procedure. In contrast to existing approaches, we use a deterministic inference algorithm to find the target sequence that maximizes the lowerbound to the logprobability. During inference, the length of translation automatically adapts itself. Our experiments show that the lowerbound can be greatly increased by running the inference algorithm, resulting in significantly improved translation quality. Our proposed model closes the performance gap between nonautoregressive and autoregressive approaches on ASPEC JaEn dataset with 8.6x faster decoding. On WMT’14 EnDe dataset, our model narrows the gap with autoregressive baseline to 2.0 BLEU points with 12.5x speedup. By decoding multiple intial latent variables in parallel and rescore using a teacher model, the proposed model further brings the gap down to 1.0 BLEU point on WMT’14 EnDe task with 6.8x speedup.
1 Introduction
The field of Neural Machine Translation (NMT) has seen significant improvements in recent years [2, 25, 7, 23]. Despite impressive improvements in translation accuracy, the autoregressive nature of NMT models have made it difficult to speed up decoding by utilizing parallel model architecture and hardware accelerators. This has sparked interest in nonautoregressive NMT models, which predict every target tokens in parallel. In addition to the obvious decoding efficiency, nonautoregressive text generation is appealing as it does not suffer from exposure bias and suboptimal inference.
Inspired by recent work in nonautoregressive NMT using discrete latent variables [11] and iterative refinement [16], we introduce a sequence of continuous latent variables to capture the uncertainty in the target sentence. We motivate such a latent variable model by conjecturing that it is easier to refine lowerdimensional continuous variables^{1}^{1}1We use 8dimensional latent variables in our experiments. than to refine highdimensional discrete variables, as done in \citeauthorlee18deterministic \shortcitelee18deterministic. Unlike \citeauthorkaiser18fast \shortcitekaiser18fast, the posterior and the prior can be jointly trained to maximize the evidence lowerbound of the loglikelihood .
In this work, we propose a deterministic iterative algorithm to refine the approximate posterior over the latent variables and obtain better target predictions. During inference, we first obtain the initial posterior from a prior distribution and the initial guess of the target sentence from the conditional distribution . We then alternate between updating the approximate posterior and target tokens with the help of an approximate posterior . We avoid stochasticity at inference time by introducing a delta posterior over the latent variables. We empirically find that this iterative algorithm significantly improves the lowerbound and results in better BLEU scores. By refining the latent variables instead of tokens, the length of translation can dynamically adapt throughout this procedure, unlike previous approaches where the target length was fixed throughout the refinement process. In other words, even if the initial length prediction is incorrect, it can be corrected simultaneously with the target tokens.
Our models outperform the autoregressive baseline on ASPEC JaEn dataset with 8.6x decoding speedup and bring the performance gap down to 2.0 BLEU points on WMT’14 EnDe with 12.5x decoding speedup. By decoding multiple latent variables sampled from the prior and rescore using a autoregressive teacher model, the proposed model is able to further narrow the performance gap on WMT’14 EnDe task down to 1.0 BLEU point with 6.8x speedup. The contributions of this work can be summarize as follows:

We propose a continuous latentvariable nonautoregressive NMT model for faster inference. The model learns identical number of latent vectors as the input tokens. A length transformation mechanism is designed to adapt the number of latent vectors to match the target length.

We demonstrate a principle inference method for this kind of model by introducing a deterministic inference algorithm. We show the algorithm converges rapidly in practice and is capable of improving the translation quality by around 2.0 BLEU points.
2 Background
Autoregressive NMT
In order to model the joint probability of the target tokens given the source sentence , most NMT models use an autoregressive factorization of the joint probability which has the following form:
(1) 
where denotes the target tokens preceding . Here, the probability of emitting each token is parameterized with a neural network.
To obtain a translation from this model, one could predict target tokens sequentially by greedily taking argmax of the token prediction probabilities. The decoding progress ends when a “</s>” token, which indicates the end of a sequence, is selected. In practice, however, this greedy approach yields suboptimal sentences, and beam search is often used to decode better translations by maintaining multiple hypotheses. However, decoding with a large beam size significantly decreases translation speed.
NonAutoregressive NMT
Although autoregressive models achieve high translation quality through recent advances in NMT, the main drawback is that autoregressive modeling forbids the decoding algorithm to select tokens in multiple positions simultaneously. This results in inefficient use of computational resource and increased translation latency.
In contrast, nonautoregressive NMT models predict target tokens without depending on preceding tokens, depicted by the following objective:
(2) 
As the prediction of each target token now depends only on the source sentence and its location in the sequence, the translation process can be easily parallelized. We obtain a target sequence by applying argmax to all token probabilities.
The main challenge of nonautoregressive NMT is on capturing dependencies among target tokens. As the probability of each target token does not depend on the surrounding tokens, applying argmax at each position may easily result in an inconsistent sequence, that includes duplicated or missing words. It is thus important for nonautoregressive models to apply techniques to ensure the consistency of generated words.
3 LatentVariable NonAutoregressive NMT
In this work, we propose a latentvariable nonautoregressive NMT model by introducing a sequence of continuous latent variables to model the uncertainty about the target sentence. These latent variables are constrained to have the same length as the source sequence, that is, . Instead of directly maximizing the objective function in Eq. (2), we maximize a lowerbound to the marginal logprobability :
(3) 
where is the prior, is an approximate posterior and is the decoder. The objective function in Eq. (3) is referred to as the evidence lowerbound (ELBO). As shown in the equation, the lowerbound is parameterized by three sets of parameters: , and .
Both the prior and the approximate posterior are modeled as spherical Gaussian distributions. The model can be trained endtoend with the reparameterization trick [13].
A Modified Objective Function with Length Prediction
During training, we want the model to maximize the lowerbound in Eq. (3). However, to generate a translation, the target length has to be predicted first. We let the latent variables model the target length by parameterizing the decoder as:
(4) 
Here denotes the length of . The second step is valid as the probability is always zero. Plugging in Eq. (4), with the independent assumption on both latent variables and target tokens, the objective has the following form:
(5) 
Model Architecture
As evident from in Eq. (5), there are four parameterized components in our model: the prior , approximate posterior , decoder and length predictor . The architecture of the proposed nonautoregressive model is depicted in Fig. 1, which reuses modules in Transformer [23] to compute the aforementioned distributions.
Main Components
To compute the prior , we use a multilayer selfattention encoder which has the same structure as the Transformer encoder. In each layer, a feedforward computation is applied after the selfattention. To obtain the probability, we apply a linear transformation to reduce the dimensionality and compute the mean and variance vectors.
For the approximate posterior , as it is a function of the source and the target , we first encode with a selfattention encoder. Then, the resulting vectors are fed into an attentionbased decoder initialized by embeddings. Its architecture is similar to the Transformer decoder except that no causal mask is used. Similar to the prior, we apply a linear layer to obtain the mean and variance vectors.
To backpropagate the loss signal of the decoder to , we apply the reparameterization trick to sample from with . Here, is Gaussian noise.
The decoder computes the probability of outputting target tokens given the latent variables sampled from . The computational graph of the decoder is also similar to the Transformer decoder without using causal mask. To combine the information from the source tokens, we reuse the encoder vector representation created when computing the prior.
Length Prediction and Transformation
Given a latent variable sampled from the approximate posterior , we train a length prediction model . We train the model to predict the length difference between and . In our implementation, is modeled as a categorical distribution that covers the length difference in the range . The prediction is produced by applying softmax after a linear transformation.
As the latent variable has the length , we need to transform the latent variables into vectors for the decoder to predict target tokens. We use a monotonic locationbased attention for this purpose, which is illustrated in Fig. 2. Let the resulting vectors of length transformation be . we produce each vector with
(6)  
(7)  
(8) 
where each transformed vector is a weighted sum of the latent variables. The weight is computed with a softmax over distancebased logits. We give higher weights to the latent variables close to the location . The scale is the only trainable parameter in this monotonic attention mechanism.
Training
If we train a model with the objective function in Eq. (5), the KL divergence often drops to zero from the beginning. This yields a degenerate model that does not use the latent variables at all. This is a wellknown issue in variational inference called posterior collapse [4, 6, 21]. We use two techniques to address this issue. Similarly to \citeauthorKingma2016ImprovingVI \shortciteKingma2016ImprovingVI, we give a budget to the KL term as
(9) 
where is the budget of KL divergence for each latent variable. Once the KL value drops below , it will not be minimized anymore, thereby letting the optimizer focus on the reconstruction term in the original objective function. As is a critical hyperparameter, it is timeconsuming to search for a good budget value. Here, we use the following annealing schedule to gradually lower the budget:
(10) 
is the current step in training, and is the maximum step. In the first half of the training, the budget remains . In the second half of the training, we anneal until it reaches .
Similarly to previous work on nonautoregressive NMT, we apply sequencelevel knowledge distillation [12] where we use the output from an autoregressive model as target for our nonautoregressive model.
4 Inference with a Delta Posterior
Once the training has converged, we use an inference algorithm to find a translation that maximizes the lowerbound in Eq. (3):
It is intractable to solve this problem exactly due to the intractability of computing the first expectation. We avoid this issue in the training time by reparametrizationbased Monte Carlo approximation. However, it is desirable to avoid stochasticity at inference time where our goal is to present a single most likely target sentence given a source sentence.
We tackle this problem by introducing a proxy distribution defined as
This is a Dirac measure, and we call it a delta posterior in our work. We set this delta posterior to minimize the KL divergence against the approximate posterior , which is equivalent to
(11) 
We then use this proxy instead of the original approximate posterior to obtain a deterministic lowerbound:
As the second term is constant with respect to , maximizing this lowerbound with respect to reduces to
(12) 
which can be approximately solved by beam search when is an autoregressive sequence model. If factorizes over the sequence , as in our nonautoregressive model, we can solve it exactly by
With every estimation of , the approximate posterior changes. We thus alternate between fitting the delta posterior in Eq. (11) and finding the most likely sequence in Eq. (12).
We initialize the delta posterior using the prior distribution:
With this initialization, the proposed inference algorithm is fully deterministic. The complete inference algorithm for obtaining the final translation is shown in Algorithm 1.
5 Related Work
This work is inspired by a recent line of work in nonautoregressive NMT. \citeauthorgu2018non \shortcitegu2018non first proposed a nonautoregressive framework by modeling word alignment as a latent variable, which has since then been improved by \citeauthorWang2019NonAutoregressiveMT \shortciteWang2019NonAutoregressiveMT. \citeauthorlee18deterministic \shortcitelee18deterministic proposed a deterministic iterative refinement algorithm where a decoder is trained to refine the hypotheses. Our approach is most related to \citeauthorkaiser18fast,Roy2018TheoryAE \shortcitekaiser18fast,Roy2018TheoryAE. In both works, a discrete autoencoder is first trained on the target sentence, then an autoregressive prior is trained to predict the discrete latent variables given the source sentence. Our work is different from them in three ways: (1) we use continuous latent variables and train the approximate posterior and the prior jointly; (2) we use a nonautoregressive prior; and (3) we propose a novel iterative inference procedure in the latent space.
Concurrently to our work, \citeauthorGhazvininejad2019ConstantTimeMT \shortciteGhazvininejad2019ConstantTimeMT proposed to translate with a maskedprediction language model by iterative replacing tokens with low confidence. \citeauthorgu2019insertion,stern2019insertion,welleck2019non \shortcitegu2019insertion,stern2019insertion,welleck2019non proposed insertionbased NMT models that insert words to the translations with a specific strategy. Unlike these works, our approach performs refinements in the lowdimensional latent space, rather than in the highdimensional discrete space.
Similarly to our latentvariable model, \citeauthorZhang2016VariationalNM \shortciteZhang2016VariationalNM proposed a variational NMT, and \citeauthorShah2018GenerativeNM \shortciteShah2018GenerativeNM models the joint distribution of source and target. Both of them use autoregressive models. \citeauthorShah2018GenerativeNM \shortciteShah2018GenerativeNM designed an EMlike algorithm similar to Markov sampling [1]. In contrast, we propose a deterministic algorithm to remove any nondeterminism during inference.
ASPEC JaEn  WMT’14 EnDe  

BLEU(%)  speedup  BLEU(%)  speedup  
Base Transformer, beam size=3  27.1  1x  26.1  1x 
Base Transformer, beam size=1  24.6  1.1x  25.6  1.3x 
LatentVariable NAR Model  13.3  17.0x  11.8  22.2x 
+ knowledge distillation  25.2  17.0x  22.2  22.2x 
+ deterministic inference  27.5  8.6x  24.1  12.5x 
+ latent search  28.3  4.8x  25.1  6.8x 
6 Experimental Settings
Data and preprocessing
We evaluate our model on two machine translation datasets: ASPEC JaEn [17] and WMT’14 EnDe [3]. The ASPEC dataset contains 3M sentence pairs, and the WMT’14 dataset contains 4.5M senence pairs.
To preprocess the ASPEC dataset, we use Moses toolkit [14] to tokenize the English sentences, and Kytea [18] for Japanese sentences. We further apply bytepair encoding [22] to segment the training sentences into subwords. The resulting vocabulary has 40K unique tokens on each side of the language pair. To preprocess the WMT’14 dataset, we apply sentencepiece [15] to both languages to segment the corpus into subwords and build a joint vocabulary. The final vocabulary size is 32K for each language.
Learning
To train the proposed nonautoregressive models, we adapt the same learning rate annealing schedule as the Base Transformer. Final model parameters is selected based on the validation ELBO value.
The only new hyperparameter in the proposed model is the dimension of each latent variable. If each latent is a highdimension vector, although it has a higher capacity, the KL divergence in Eq. (3) becomes difficult to minimize. In practice, we found that latent dimensionality values between 4 and 32 result in similar performance. However, when the dimensionality is significantly higher or lower, we see a performance drop. In all experiments, we set the latent dimensionionality to 8. We use a hidden size of 512 and feedforward filter size of 2048 for all models in our experiments. We use 6 transformer layers for the prior and the decoder, and 3 transformer layers for the approximate posterior.
Evaluation
We evaluate the tokenized BLEU for ASPEC JaEn datset. For WMT’14 EnDe datset, we use SacreBLEU [20] to evaluate the translation results. We follow \citeauthorlee18deterministic \shortcitelee18deterministic to remove repetitions from the translation results before evaluating BLEU scores.
Latent Search
To further exploit the parallel computation ability of GPUs, we sample multiple initial latent variables from the prior . Then we perform the deterministic inference on each latent variable to obtain a list of candidate translations. However, we can not afford to evalaute each candidate using Eq. (5), which requires importance sampling on . Instead, we use the autoregressive baseline model to score all the candidates, and pick the candidate with the highest log probability. Following \citeauthorParmar2018ImageT \shortciteParmar2018ImageT, we reduce the temperature by a factor of when sampling latent variables, resulting in better translation quality. To avoid the stochasticity, we fix the random seed during sampling.
7 Result and Analysis
Quantitative Analysis
Our quantitative results on both datasets are presented in Table 1. The baseline model in our experiments is a base Transformer. Our implementation of the autoregressive baseline is 1.0 BLEU points lower than the original paper [23] on WMT’14 EnDe datase. We measure the latency of decoding each sentence on a single NVIDIA V100 GPU for all models, which is averaged over all test samples.
As shown in Table 1, without knowledge distillation, we observe a significant gap in translation quality compared to the autoregressive baseline. This observation is in line with previous ones on nonautoregressive NMT [10, 16, 24]. The gap is significantly reduced by using knowledge distillation, as translation targets provided by the autoregressive model are easier to predict.
With the proposed deterministic inference algorithm, we significantly improve translation quality by 2.3 BLEU points on ASPEC JaEn dataset and 1.9 BLEU points on WMT’14 EnDe dataset. Here, we only run the algorithm for one step. We observe gain on ELBO by running more iterative steps, which is however not reflected by the BLEU scores. As a result, we outperform the autoregressive baseline on ASPEC dataset with a speedup of 8.6x. For WMT’14 dataset, although the proposed model reaches a speedup of 12.5x, the gap with the autoregressive baseline still remains, at 2.0 BLEU points. We conjecture that WMT’14 EnDe is more difficult for our nonautoregressive model as it contains a high degree of noise [19].
By searching over multiple initial latent variables and rescoring with the teacher Transformer model, we observe an increase in performance by BLEU scores at the cost of slower translation speed. In our experiments, we sample 50 candidate latent varaibles and decode them in parallel. The slowdown is mainly caused by rescoring. With the help of rescoring, our final model further narrows the performance gap with the autoregressive baseline to 1.0 BLEU with 6.8x speedup on WMT’14 EnDe task.
Nonautoregressive NMT Models
In Table 2, we list the results on WMT’14 EnDe by existing nonautoregressive NMT approaches. All the models use Transformer as their autoregressive baselines. In comparison, our proposed model suffers a drop of 1.0 BLEU points over the baseline, which is a relatively small gap among the existing models. Thanks of the rapid convergence of the proposed deterministic inference algorithm, our model achieves a higher speedup compared to other refinementbased models and provides a better speedaccuracy tradeoff.
Concurrently to our work, the maskprediction language model [8] was found to reduce the performance gap down to 0.9 BLEU on WMT’14 EnDe while still maintaining a reasonable speedup. The main difference is that we update a delta posterior over latent variables instead of target tokens. Both \citeauthorGhazvininejad2019ConstantTimeMT \shortciteGhazvininejad2019ConstantTimeMT and \citeauthorWang2019NonAutoregressiveMT \shortciteWang2019NonAutoregressiveMT with autoregressive rescoring decode multiple candidates in batch and pick one final translation from them. As our proposal is orthogonal to using BERTstyle training [5], it is an interesting future direction to investigate their combination.
BLEU(%)  SPD  

Transformer [23]  27.1   
Baseline [9]  23.4  1x 
NAT (+FT +NPD S=100)  19.1 (4.3)  2.3x 
Baseline [16]  24.5  1x 
Adaptive NAR Model  21.5 (3.0)  1.9x 
Baseline [11]  23.5  1x 
LT, Improved Semhash  19.8 (3.7)  3.8x 
Baseline [24]  27.3  1x 
NATREG, no rescoring  20.6 (6.7)  27.6x 
NATREG, autoregressive rescoring  24.6 (2.7)  15.1x 
BL [8]  27.8  1x 
CMLM with 4 iterations  26.0 (1.8)   
CMLM with 10 iterations  26.9 (0.9)  23x 
Baseline (Ours)  26.1  1x 
NAR with deterministic Inference  24.1 (2.0)  12.5x 
+ latent search  25.1 (1.0)  6.8x 
Analysis of Deterministic Inference
Convergences of ELBO and BLEU
In this section, we empirically show that the proposed deterministic iterative inference improves the ELBO in Eq. (3). As the ELBO is a function of and , we measure the ELBO value with the new target prediction after each iteration during inference. For each instance, we sample 20 latent variables to compute the expectation in Eq. (3). The ELBO value is further averaged over data samples.
In Fig. 3, we show the ELBO value and the resulting BLEU scores for both datasets. In the initial step, the delta posterior is initialized with the prior distribution . We see that the ELBO value increases rapidly by performing the iterative inference, which means a higher lowerbound to . The improvement is highly correlated with increasing BLEU scores. For around 80% of the data samples, the algorithm converges within three steps. We observe the BLEU scores peaked after only one iterative step.
Example 1: Sequence modified without changing length  

Source  hyouki gensuiryou hyoujun no kakuritsu wo kokoromita. (Japanese) 
Reference  the establishment of an optical fiber attenuation standard was attempted . 
Initial Guess  an attempt was made establish establish damping attenuation standard ... 
After Inference  an attempt was to establish the damping attenuation standard ... 


Example 2: One word removed from the sequence  
Source  ...‘‘sen bouchou keisu no toriatsukai’’ nitsuite nobeta. (Japanese) 
Reference  ... handling of linear expansion coefficient . 
Initial Guess  ... ‘‘ handling of of linear expansion coefficient ’’ are described . 
After Inference  ... ‘‘ handling of linear expansion coefficient ’’ are described . 
Example 3: Four words added to the sequence  
Source  ... maikuro manipyureshon heto hatten shite kite ori ...(Japanese) 
Reference  ... with wide application fields so that it has been developed ... 
Initial Guess  ... micro micro manipulation and ... 
After Inference  ... and micro manipulation , and it has been developed , and ... 
Tradeoff between Quality and Speed
In Fig. 4, we show the tradeoff between translation quality and the speed gain on WMT’14 EnDe task when considering multiple candidates latent variables in parallel. We vary the number of candidates from 10 to 100, and report BLEU scores and relative speed gains in the scatter plot. The results are divided into two groups. The first group of experiments search over multiple latent variables and rescore with the teacher Transformer. The second group applies the proposed deterministic inference before rescoring.
We observe that the proposed deterministic inference constantly improves the translation quality in all settings. The BLEU score peaks at 25.2 after increasing the number of candidates to a large value. As GPUs are good at processing massive computations in parallel, we can see that the translation speed only degrades by a small magnitude when decoding less than 50 candidates.
Qualitative Analysis
We present some example translations to demonstrate the effect of the proposed iterative inference in Table 3. In the first example, the length of the target sequence does not change but only the tokens are replaced over the refinement iterations. The second and third examples show that the algorithm removes or inserts words to the sequence during the iterative inference by adaptively changing the target length. Such a significant modification to the predicted sequence mostly happens when translating long sentences.
For some test examples, however, we still find duplicated words in the final translation after applying the proposed deterministic inference. For them, we notice that the quality of the initial guess of translation is considerably worse than average, which typically contains multiple duplicated words. As the decoder is trained to reconstruct the sequence given to the approximator , it is not expected to drastically modify the target prediction. Thus, a highquality initial guess is crucial for obtaining good translations.
8 Conclusion
Our work presents the first approach to use a continuous latentvariable model for nonautoregressive Neural Machine Translation. The key idea is to introduce a sequence of latent variables to capture the uncertainly in the target sentence. The number of latent vectors is always identical to the number of input tokens. A length transformation mechanism is then applied to adapt the latent vectors to match the target length. We train the proposed model by maximizing the lowerbound of the logprobability .
We then introduce a deterministic inference algorithm that uses a delta posterior over the latent variables. The algorithm alternates between updating the delta posterior and the target tokens. Our experiments show that the algorithm is able to improve the evidence lowerbound of predicted target sequence rapidly. In our experiments, the BLEU scores converge in only one iteration. Despite its effectiveness, the algorithm can be easily implemented.
Our nonautoregressive NMT model closes the performance gap with autoregressive baseline on ASPEC JaEn task with a 8.6x speedup, and reduces the gap on WMT’14 EnDe task down to 2.0 BLEU point with a 12.5x speedup. By decoding multiple latent variables sampled from the prior, our model outperforms the baseline by 1.2 BLEU points on EnJa task with 4.8x speedup, brings down the gap on EnDe task down to 1.0 BLEU with a speedup of 6.8x.
When decoding multiple latent variables, a teacher model is essential as the latentvariable model framework does not provide a way to correctly evaluate candidate translations. The teacher model typically takes 15ms to compute. Future work that enables rescoring without the help of a external model may further improve the decoding speed.
References
 [1] (2017) Improving sampling from generative autoencoders with markov chains. CoRR abs/1610.09296. Cited by: §5.
 [2] (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, Cited by: §1.
 [3] (201406) Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 12–58. External Links: Link Cited by: §6.
 [4] (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §3.
 [5] (2018) BERT: pretraining of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §7.
 [6] (2018) Avoiding latent variable collapse with generative skip models. CoRR abs/1807.04863. Cited by: §3.
 [7] (2017) Convolutional sequence to sequence learning. CoRR abs/1705.03122. Cited by: §1.
 [8] (2019) Constanttime machine translation with conditional masked language models. CoRR abs/1904.09324. Cited by: §7, Table 2.
 [9] (2018) Nonautoregressive neural machine translation. In Proceedings of the International Conference on Learning Representations 2018, External Links: Link Cited by: Table 2.
 [10] (2018) Nonautoregressive neural machine translation. CoRR abs/1711.02281. Cited by: §7.
 [11] (2018) Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382. Cited by: §1, Table 2.
 [12] (2016) Sequencelevel knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1317–1327. Cited by: §3.
 [13] (2014) Autoencoding variational bayes. CoRR abs/1312.6114. Cited by: §3.
 [14] (2007) Moses: open source toolkit for statistical machine translation. In ACL, Cited by: §6.
 [15] (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP, Cited by: §6.
 [16] (2018) Deterministic nonautoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1173–1182. Cited by: §1, §7, Table 2.
 [17] (2016) ASPEC: asian scientific paper excerpt corpus. In LREC, Cited by: §6.
 [18] (2011) Pointwise prediction for robust, adaptable japanese morphological analysis. In ACL, pp. 529–533. Cited by: §6.
 [19] (2018) Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pp. 3953–3962. Cited by: §7.
 [20] (201810) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Link Cited by: §6.
 [21] (2019) Preventing posterior collapse with deltavaes. CoRR abs/1901.03416. Cited by: §3.
 [22] (2016) Neural machine translation of rare words with subword units. In ACL, pp. 1715–1725. External Links: Document Cited by: §6.
 [23] (2017) Attention is all you need. In NIPS, Cited by: §1, §3, Table 1, §7, Table 2.
 [24] (2019) Nonautoregressive machine translation with auxiliary regularization. CoRR abs/1902.10245. Cited by: §7, Table 2.
 [25] (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1.