Latent-Variable Non-Autoregressive Neural Machine Translation with Deterministic Inference using a Delta Posterior
Although neural machine translation models reached high translation quality, the autoregressive nature makes inference difficult to parallelize and leads to high translation latency. Inspired by recent refinement-based approaches, we propose a latent-variable non-autoregressive model with continuous latent variables and deterministic inference procedure. In contrast to existing approaches, we use a deterministic inference algorithm to find the target sequence that maximizes the lowerbound to the log-probability. During inference, the length of translation automatically adapts itself. Our experiments show that the lowerbound can be greatly increased by running the inference algorithm, resulting in significantly improved translation quality. Our proposed model closes the performance gap between non-autoregressive and autoregressive approaches on ASPEC Ja-En dataset with 8.6x faster decoding. On WMT’14 En-De dataset, our model narrows the gap with autoregressive baseline to 2.0 BLEU points with 12.5x speedup. By decoding multiple intial latent variables in parallel and rescore using a teacher model, the proposed model further brings the gap down to 1.0 BLEU point on WMT’14 En-De task with 6.8x speedup.
The field of Neural Machine Translation (NMT) has seen significant improvements in recent years [2, 25, 7, 23]. Despite impressive improvements in translation accuracy, the autoregressive nature of NMT models have made it difficult to speed up decoding by utilizing parallel model architecture and hardware accelerators. This has sparked interest in non-autoregressive NMT models, which predict every target tokens in parallel. In addition to the obvious decoding efficiency, non-autoregressive text generation is appealing as it does not suffer from exposure bias and suboptimal inference.
Inspired by recent work in non-autoregressive NMT using discrete latent variables  and iterative refinement , we introduce a sequence of continuous latent variables to capture the uncertainty in the target sentence. We motivate such a latent variable model by conjecturing that it is easier to refine lower-dimensional continuous variables111We use 8-dimensional latent variables in our experiments. than to refine high-dimensional discrete variables, as done in \citeauthorlee18deterministic \shortcitelee18deterministic. Unlike \citeauthorkaiser18fast \shortcitekaiser18fast, the posterior and the prior can be jointly trained to maximize the evidence lowerbound of the log-likelihood .
In this work, we propose a deterministic iterative algorithm to refine the approximate posterior over the latent variables and obtain better target predictions. During inference, we first obtain the initial posterior from a prior distribution and the initial guess of the target sentence from the conditional distribution . We then alternate between updating the approximate posterior and target tokens with the help of an approximate posterior . We avoid stochasticity at inference time by introducing a delta posterior over the latent variables. We empirically find that this iterative algorithm significantly improves the lowerbound and results in better BLEU scores. By refining the latent variables instead of tokens, the length of translation can dynamically adapt throughout this procedure, unlike previous approaches where the target length was fixed throughout the refinement process. In other words, even if the initial length prediction is incorrect, it can be corrected simultaneously with the target tokens.
Our models outperform the autoregressive baseline on ASPEC Ja-En dataset with 8.6x decoding speedup and bring the performance gap down to 2.0 BLEU points on WMT’14 En-De with 12.5x decoding speedup. By decoding multiple latent variables sampled from the prior and rescore using a autoregressive teacher model, the proposed model is able to further narrow the performance gap on WMT’14 En-De task down to 1.0 BLEU point with 6.8x speedup. The contributions of this work can be summarize as follows:
We propose a continuous latent-variable non-autoregressive NMT model for faster inference. The model learns identical number of latent vectors as the input tokens. A length transformation mechanism is designed to adapt the number of latent vectors to match the target length.
We demonstrate a principle inference method for this kind of model by introducing a deterministic inference algorithm. We show the algorithm converges rapidly in practice and is capable of improving the translation quality by around 2.0 BLEU points.
In order to model the joint probability of the target tokens given the source sentence , most NMT models use an autoregressive factorization of the joint probability which has the following form:
where denotes the target tokens preceding . Here, the probability of emitting each token is parameterized with a neural network.
To obtain a translation from this model, one could predict target tokens sequentially by greedily taking argmax of the token prediction probabilities. The decoding progress ends when a “</s>” token, which indicates the end of a sequence, is selected. In practice, however, this greedy approach yields suboptimal sentences, and beam search is often used to decode better translations by maintaining multiple hypotheses. However, decoding with a large beam size significantly decreases translation speed.
Although autoregressive models achieve high translation quality through recent advances in NMT, the main drawback is that autoregressive modeling forbids the decoding algorithm to select tokens in multiple positions simultaneously. This results in inefficient use of computational resource and increased translation latency.
In contrast, non-autoregressive NMT models predict target tokens without depending on preceding tokens, depicted by the following objective:
As the prediction of each target token now depends only on the source sentence and its location in the sequence, the translation process can be easily parallelized. We obtain a target sequence by applying argmax to all token probabilities.
The main challenge of non-autoregressive NMT is on capturing dependencies among target tokens. As the probability of each target token does not depend on the surrounding tokens, applying argmax at each position may easily result in an inconsistent sequence, that includes duplicated or missing words. It is thus important for non-autoregressive models to apply techniques to ensure the consistency of generated words.
3 Latent-Variable Non-Autoregressive NMT
In this work, we propose a latent-variable non-autoregressive NMT model by introducing a sequence of continuous latent variables to model the uncertainty about the target sentence. These latent variables are constrained to have the same length as the source sequence, that is, . Instead of directly maximizing the objective function in Eq. (2), we maximize a lowerbound to the marginal log-probability :
where is the prior, is an approximate posterior and is the decoder. The objective function in Eq. (3) is referred to as the evidence lowerbound (ELBO). As shown in the equation, the lowerbound is parameterized by three sets of parameters: , and .
Both the prior and the approximate posterior are modeled as spherical Gaussian distributions. The model can be trained end-to-end with the reparameterization trick .
A Modified Objective Function with Length Prediction
During training, we want the model to maximize the lowerbound in Eq. (3). However, to generate a translation, the target length has to be predicted first. We let the latent variables model the target length by parameterizing the decoder as:
Here denotes the length of . The second step is valid as the probability is always zero. Plugging in Eq. (4), with the independent assumption on both latent variables and target tokens, the objective has the following form:
As evident from in Eq. (5), there are four parameterized components in our model: the prior , approximate posterior , decoder and length predictor . The architecture of the proposed non-autoregressive model is depicted in Fig. 1, which reuses modules in Transformer  to compute the aforementioned distributions.
To compute the prior , we use a multi-layer self-attention encoder which has the same structure as the Transformer encoder. In each layer, a feed-forward computation is applied after the self-attention. To obtain the probability, we apply a linear transformation to reduce the dimensionality and compute the mean and variance vectors.
For the approximate posterior , as it is a function of the source and the target , we first encode with a self-attention encoder. Then, the resulting vectors are fed into an attention-based decoder initialized by embeddings. Its architecture is similar to the Transformer decoder except that no causal mask is used. Similar to the prior, we apply a linear layer to obtain the mean and variance vectors.
To backpropagate the loss signal of the decoder to , we apply the reparameterization trick to sample from with . Here, is Gaussian noise.
The decoder computes the probability of outputting target tokens given the latent variables sampled from . The computational graph of the decoder is also similar to the Transformer decoder without using causal mask. To combine the information from the source tokens, we reuse the encoder vector representation created when computing the prior.
Length Prediction and Transformation
Given a latent variable sampled from the approximate posterior , we train a length prediction model . We train the model to predict the length difference between and . In our implementation, is modeled as a categorical distribution that covers the length difference in the range . The prediction is produced by applying softmax after a linear transformation.
As the latent variable has the length , we need to transform the latent variables into vectors for the decoder to predict target tokens. We use a monotonic location-based attention for this purpose, which is illustrated in Fig. 2. Let the resulting vectors of length transformation be . we produce each vector with
where each transformed vector is a weighted sum of the latent variables. The weight is computed with a softmax over distance-based logits. We give higher weights to the latent variables close to the location . The scale is the only trainable parameter in this monotonic attention mechanism.
If we train a model with the objective function in Eq. (5), the KL divergence often drops to zero from the beginning. This yields a degenerate model that does not use the latent variables at all. This is a well-known issue in variational inference called posterior collapse [4, 6, 21]. We use two techniques to address this issue. Similarly to \citeauthorKingma2016ImprovingVI \shortciteKingma2016ImprovingVI, we give a budget to the KL term as
where is the budget of KL divergence for each latent variable. Once the KL value drops below , it will not be minimized anymore, thereby letting the optimizer focus on the reconstruction term in the original objective function. As is a critical hyperparameter, it is time-consuming to search for a good budget value. Here, we use the following annealing schedule to gradually lower the budget:
is the current step in training, and is the maximum step. In the first half of the training, the budget remains . In the second half of the training, we anneal until it reaches .
Similarly to previous work on non-autoregressive NMT, we apply sequence-level knowledge distillation  where we use the output from an autoregressive model as target for our non-autoregressive model.
4 Inference with a Delta Posterior
Once the training has converged, we use an inference algorithm to find a translation that maximizes the lowerbound in Eq. (3):
It is intractable to solve this problem exactly due to the intractability of computing the first expectation. We avoid this issue in the training time by reparametrization-based Monte Carlo approximation. However, it is desirable to avoid stochasticity at inference time where our goal is to present a single most likely target sentence given a source sentence.
We tackle this problem by introducing a proxy distribution defined as
This is a Dirac measure, and we call it a delta posterior in our work. We set this delta posterior to minimize the KL divergence against the approximate posterior , which is equivalent to
We then use this proxy instead of the original approximate posterior to obtain a deterministic lowerbound:
As the second term is constant with respect to , maximizing this lowerbound with respect to reduces to
which can be approximately solved by beam search when is an autoregressive sequence model. If factorizes over the sequence , as in our non-autoregressive model, we can solve it exactly by
We initialize the delta posterior using the prior distribution:
With this initialization, the proposed inference algorithm is fully deterministic. The complete inference algorithm for obtaining the final translation is shown in Algorithm 1.
5 Related Work
This work is inspired by a recent line of work in non-autoregressive NMT. \citeauthorgu2018non \shortcitegu2018non first proposed a non-autoregressive framework by modeling word alignment as a latent variable, which has since then been improved by \citeauthorWang2019NonAutoregressiveMT \shortciteWang2019NonAutoregressiveMT. \citeauthorlee18deterministic \shortcitelee18deterministic proposed a deterministic iterative refinement algorithm where a decoder is trained to refine the hypotheses. Our approach is most related to \citeauthorkaiser18fast,Roy2018TheoryAE \shortcitekaiser18fast,Roy2018TheoryAE. In both works, a discrete autoencoder is first trained on the target sentence, then an autoregressive prior is trained to predict the discrete latent variables given the source sentence. Our work is different from them in three ways: (1) we use continuous latent variables and train the approximate posterior and the prior jointly; (2) we use a non-autoregressive prior; and (3) we propose a novel iterative inference procedure in the latent space.
Concurrently to our work, \citeauthorGhazvininejad2019ConstantTimeMT \shortciteGhazvininejad2019ConstantTimeMT proposed to translate with a masked-prediction language model by iterative replacing tokens with low confidence. \citeauthorgu2019insertion,stern2019insertion,welleck2019non \shortcitegu2019insertion,stern2019insertion,welleck2019non proposed insertion-based NMT models that insert words to the translations with a specific strategy. Unlike these works, our approach performs refinements in the low-dimensional latent space, rather than in the high-dimensional discrete space.
Similarly to our latent-variable model, \citeauthorZhang2016VariationalNM \shortciteZhang2016VariationalNM proposed a variational NMT, and \citeauthorShah2018GenerativeNM \shortciteShah2018GenerativeNM models the joint distribution of source and target. Both of them use autoregressive models. \citeauthorShah2018GenerativeNM \shortciteShah2018GenerativeNM designed an EM-like algorithm similar to Markov sampling . In contrast, we propose a deterministic algorithm to remove any non-determinism during inference.
|ASPEC Ja-En||WMT’14 En-De|
|Base Transformer, beam size=3||27.1||1x||26.1||1x|
|Base Transformer, beam size=1||24.6||1.1x||25.6||1.3x|
|Latent-Variable NAR Model||13.3||17.0x||11.8||22.2x|
|+ knowledge distillation||25.2||17.0x||22.2||22.2x|
|+ deterministic inference||27.5||8.6x||24.1||12.5x|
|+ latent search||28.3||4.8x||25.1||6.8x|
6 Experimental Settings
Data and preprocessing
To preprocess the ASPEC dataset, we use Moses toolkit  to tokenize the English sentences, and Kytea  for Japanese sentences. We further apply byte-pair encoding  to segment the training sentences into subwords. The resulting vocabulary has 40K unique tokens on each side of the language pair. To preprocess the WMT’14 dataset, we apply sentencepiece  to both languages to segment the corpus into subwords and build a joint vocabulary. The final vocabulary size is 32K for each language.
To train the proposed non-autoregressive models, we adapt the same learning rate annealing schedule as the Base Transformer. Final model parameters is selected based on the validation ELBO value.
The only new hyperparameter in the proposed model is the dimension of each latent variable. If each latent is a high-dimension vector, although it has a higher capacity, the KL divergence in Eq. (3) becomes difficult to minimize. In practice, we found that latent dimensionality values between 4 and 32 result in similar performance. However, when the dimensionality is significantly higher or lower, we see a performance drop. In all experiments, we set the latent dimensionionality to 8. We use a hidden size of 512 and feedforward filter size of 2048 for all models in our experiments. We use 6 transformer layers for the prior and the decoder, and 3 transformer layers for the approximate posterior.
We evaluate the tokenized BLEU for ASPEC Ja-En datset. For WMT’14 En-De datset, we use SacreBLEU  to evaluate the translation results. We follow \citeauthorlee18deterministic \shortcitelee18deterministic to remove repetitions from the translation results before evaluating BLEU scores.
To further exploit the parallel computation ability of GPUs, we sample multiple initial latent variables from the prior . Then we perform the deterministic inference on each latent variable to obtain a list of candidate translations. However, we can not afford to evalaute each candidate using Eq. (5), which requires importance sampling on . Instead, we use the autoregressive baseline model to score all the candidates, and pick the candidate with the highest log probability. Following \citeauthorParmar2018ImageT \shortciteParmar2018ImageT, we reduce the temperature by a factor of when sampling latent variables, resulting in better translation quality. To avoid the stochasticity, we fix the random seed during sampling.
7 Result and Analysis
Our quantitative results on both datasets are presented in Table 1. The baseline model in our experiments is a base Transformer. Our implementation of the autoregressive baseline is 1.0 BLEU points lower than the original paper  on WMT’14 En-De datase. We measure the latency of decoding each sentence on a single NVIDIA V100 GPU for all models, which is averaged over all test samples.
As shown in Table 1, without knowledge distillation, we observe a significant gap in translation quality compared to the autoregressive baseline. This observation is in line with previous ones on non-autoregressive NMT [10, 16, 24]. The gap is significantly reduced by using knowledge distillation, as translation targets provided by the autoregressive model are easier to predict.
With the proposed deterministic inference algorithm, we significantly improve translation quality by 2.3 BLEU points on ASPEC Ja-En dataset and 1.9 BLEU points on WMT’14 En-De dataset. Here, we only run the algorithm for one step. We observe gain on ELBO by running more iterative steps, which is however not reflected by the BLEU scores. As a result, we outperform the autoregressive baseline on ASPEC dataset with a speedup of 8.6x. For WMT’14 dataset, although the proposed model reaches a speedup of 12.5x, the gap with the autoregressive baseline still remains, at 2.0 BLEU points. We conjecture that WMT’14 En-De is more difficult for our non-autoregressive model as it contains a high degree of noise .
By searching over multiple initial latent variables and rescoring with the teacher Transformer model, we observe an increase in performance by BLEU scores at the cost of slower translation speed. In our experiments, we sample 50 candidate latent varaibles and decode them in parallel. The slowdown is mainly caused by rescoring. With the help of rescoring, our final model further narrows the performance gap with the autoregressive baseline to 1.0 BLEU with 6.8x speedup on WMT’14 En-De task.
Non-autoregressive NMT Models
In Table 2, we list the results on WMT’14 En-De by existing non-autoregressive NMT approaches. All the models use Transformer as their autoregressive baselines. In comparison, our proposed model suffers a drop of 1.0 BLEU points over the baseline, which is a relatively small gap among the existing models. Thanks of the rapid convergence of the proposed deterministic inference algorithm, our model achieves a higher speed-up compared to other refinement-based models and provides a better speed-accuracy tradeoff.
Concurrently to our work, the mask-prediction language model  was found to reduce the performance gap down to 0.9 BLEU on WMT’14 En-De while still maintaining a reasonable speed-up. The main difference is that we update a delta posterior over latent variables instead of target tokens. Both \citeauthorGhazvininejad2019ConstantTimeMT \shortciteGhazvininejad2019ConstantTimeMT and \citeauthorWang2019NonAutoregressiveMT \shortciteWang2019NonAutoregressiveMT with autoregressive rescoring decode multiple candidates in batch and pick one final translation from them. As our proposal is orthogonal to using BERT-style training , it is an interesting future direction to investigate their combination.
|NAT (+FT +NPD S=100)||19.1 (-4.3)||2.3x|
|Adaptive NAR Model||21.5 (-3.0)||1.9x|
|LT, Improved Semhash||19.8 (-3.7)||3.8x|
|NAT-REG, no rescoring||20.6 (-6.7)||27.6x|
|NAT-REG, autoregressive rescoring||24.6 (-2.7)||15.1x|
|CMLM with 4 iterations||26.0 (-1.8)||-|
|CMLM with 10 iterations||26.9 (-0.9)||23x|
|NAR with deterministic Inference||24.1 (-2.0)||12.5x|
|+ latent search||25.1 (-1.0)||6.8x|
Analysis of Deterministic Inference
Convergences of ELBO and BLEU
In this section, we empirically show that the proposed deterministic iterative inference improves the ELBO in Eq. (3). As the ELBO is a function of and , we measure the ELBO value with the new target prediction after each iteration during inference. For each instance, we sample 20 latent variables to compute the expectation in Eq. (3). The ELBO value is further averaged over data samples.
In Fig. 3, we show the ELBO value and the resulting BLEU scores for both datasets. In the initial step, the delta posterior is initialized with the prior distribution . We see that the ELBO value increases rapidly by performing the iterative inference, which means a higher lowerbound to . The improvement is highly correlated with increasing BLEU scores. For around 80% of the data samples, the algorithm converges within three steps. We observe the BLEU scores peaked after only one iterative step.
|Example 1: Sequence modified without changing length|
|Source||hyouki gensuiryou hyoujun no kakuritsu wo kokoromita. (Japanese)|
|Reference||the establishment of an optical fiber attenuation standard was attempted .|
|Initial Guess||an attempt was made establish establish damping attenuation standard ...|
|After Inference||an attempt was to establish the damping attenuation standard ...|
|Example 2: One word removed from the sequence|
|Source||...‘‘sen bouchou keisu no toriatsukai’’ nitsuite nobeta. (Japanese)|
|Reference||... handling of linear expansion coefficient .|
|Initial Guess||... ‘‘ handling of of linear expansion coefficient ’’ are described .|
|After Inference||... ‘‘ handling of linear expansion coefficient ’’ are described .|
|Example 3: Four words added to the sequence|
|Source||... maikuro manipyureshon heto hatten shite kite ori ...(Japanese)|
|Reference||... with wide application fields so that it has been developed ...|
|Initial Guess||... micro micro manipulation and ...|
|After Inference||... and micro manipulation , and it has been developed , and ...|
Trade-off between Quality and Speed
In Fig. 4, we show the trade-off between translation quality and the speed gain on WMT’14 En-De task when considering multiple candidates latent variables in parallel. We vary the number of candidates from 10 to 100, and report BLEU scores and relative speed gains in the scatter plot. The results are divided into two groups. The first group of experiments search over multiple latent variables and rescore with the teacher Transformer. The second group applies the proposed deterministic inference before rescoring.
We observe that the proposed deterministic inference constantly improves the translation quality in all settings. The BLEU score peaks at 25.2 after increasing the number of candidates to a large value. As GPUs are good at processing massive computations in parallel, we can see that the translation speed only degrades by a small magnitude when decoding less than 50 candidates.
We present some example translations to demonstrate the effect of the proposed iterative inference in Table 3. In the first example, the length of the target sequence does not change but only the tokens are replaced over the refinement iterations. The second and third examples show that the algorithm removes or inserts words to the sequence during the iterative inference by adaptively changing the target length. Such a significant modification to the predicted sequence mostly happens when translating long sentences.
For some test examples, however, we still find duplicated words in the final translation after applying the proposed deterministic inference. For them, we notice that the quality of the initial guess of translation is considerably worse than average, which typically contains multiple duplicated words. As the decoder is trained to reconstruct the sequence given to the approximator , it is not expected to drastically modify the target prediction. Thus, a high-quality initial guess is crucial for obtaining good translations.
Our work presents the first approach to use a continuous latent-variable model for non-autoregressive Neural Machine Translation. The key idea is to introduce a sequence of latent variables to capture the uncertainly in the target sentence. The number of latent vectors is always identical to the number of input tokens. A length transformation mechanism is then applied to adapt the latent vectors to match the target length. We train the proposed model by maximizing the lowerbound of the log-probability .
We then introduce a deterministic inference algorithm that uses a delta posterior over the latent variables. The algorithm alternates between updating the delta posterior and the target tokens. Our experiments show that the algorithm is able to improve the evidence lowerbound of predicted target sequence rapidly. In our experiments, the BLEU scores converge in only one iteration. Despite its effectiveness, the algorithm can be easily implemented.
Our non-autoregressive NMT model closes the performance gap with autoregressive baseline on ASPEC Ja-En task with a 8.6x speedup, and reduces the gap on WMT’14 En-De task down to 2.0 BLEU point with a 12.5x speedup. By decoding multiple latent variables sampled from the prior, our model outperforms the baseline by 1.2 BLEU points on En-Ja task with 4.8x speedup, brings down the gap on En-De task down to 1.0 BLEU with a speedup of 6.8x.
When decoding multiple latent variables, a teacher model is essential as the latent-variable model framework does not provide a way to correctly evaluate candidate translations. The teacher model typically takes 15ms to compute. Future work that enables rescoring without the help of a external model may further improve the decoding speed.
-  (2017) Improving sampling from generative autoencoders with markov chains. CoRR abs/1610.09296. Cited by: §5.
-  (2015) Neural machine translation by jointly learning to align and translate. In International Conference on Learning Representations, Cited by: §1.
-  (2014-06) Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 12–58. External Links: Cited by: §6.
-  (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349. Cited by: §3.
-  (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §7.
-  (2018) Avoiding latent variable collapse with generative skip models. CoRR abs/1807.04863. Cited by: §3.
-  (2017) Convolutional sequence to sequence learning. CoRR abs/1705.03122. Cited by: §1.
-  (2019) Constant-time machine translation with conditional masked language models. CoRR abs/1904.09324. Cited by: §7, Table 2.
-  (2018) Non-autoregressive neural machine translation. In Proceedings of the International Conference on Learning Representations 2018, External Links: Cited by: Table 2.
-  (2018) Non-autoregressive neural machine translation. CoRR abs/1711.02281. Cited by: §7.
-  (2018) Fast decoding in sequence models using discrete latent variables. arXiv preprint arXiv:1803.03382. Cited by: §1, Table 2.
-  (2016) Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pp. 1317–1327. Cited by: §3.
-  (2014) Auto-encoding variational bayes. CoRR abs/1312.6114. Cited by: §3.
-  (2007) Moses: open source toolkit for statistical machine translation. In ACL, Cited by: §6.
-  (2018) SentencePiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. In EMNLP, Cited by: §6.
-  (2018) Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 1173–1182. Cited by: §1, §7, Table 2.
-  (2016) ASPEC: asian scientific paper excerpt corpus. In LREC, Cited by: §6.
-  (2011) Pointwise prediction for robust, adaptable japanese morphological analysis. In ACL, pp. 529–533. Cited by: §6.
-  (2018) Analyzing uncertainty in neural machine translation. In Proceedings of the 35th International Conference on Machine Learning, ICML 2018, pp. 3953–3962. Cited by: §7.
-  (2018-10) A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Cited by: §6.
-  (2019) Preventing posterior collapse with delta-vaes. CoRR abs/1901.03416. Cited by: §3.
-  (2016) Neural machine translation of rare words with subword units. In ACL, pp. 1715–1725. External Links: Cited by: §6.
-  (2017) Attention is all you need. In NIPS, Cited by: §1, §3, Table 1, §7, Table 2.
-  (2019) Non-autoregressive machine translation with auxiliary regularization. CoRR abs/1902.10245. Cited by: §7, Table 2.
-  (2016) Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1.