Infusing Sequential Information into Conditional Masked Translation Model with Self-Review Mechanism
Non-autoregressive models generate target words in a parallel way, which achieve a faster decoding speed but at the sacrifice of translation accuracy. To remedy a flawed translation by non-autoregressive models, a promising approach is to train a conditional masked translation model (CMTM), and refine the generated results within several iterations. Unfortunately, such approach hardly considers the sequential dependency among target words, which inevitably results in a translation degradation. Hence, instead of solely training a Transformer-based CMTM, we propose a Self-Review Mechanism to infuse sequential information into it. Concretely, we insert a left-to-right mask to the same decoder of CMTM, and then induce it to autoregressively review whether each generated word from CMTM is supposed to be replaced or kept. The experimental results (WMT14 EnDe and WMT16 EnRo) demonstrate that our model uses dramatically less training computations than the typical CMTM, as well as outperforms several state-of-the-art non-autoregressive models by over 1 BLEU. Through knowledge distillation, our model even surpasses a typical left-to-right Transformer model, while significantly speeding up decoding.
Neural Machine Translation (NMT) models have achieved a great success in recent years [\citenameSutskever et al.2014, \citenameBahdanau et al.2015, \citenameCho et al.2014, \citenameKalchbrenner et al.2016, \citenameGehring et al.2017, \citenameVaswani et al.2017]. Typically, NMTs use autoregressive decoders, where the words are generated one-by-one. However, due to the left-to-right dependency, this computationally-intensive decoding process cannot be easily parallelized, and therefore causes a large latency [\citenameGu et al.2018].
To break the bottleneck of autoregression, several non-autoregressive models have been proposed to induce the decoder to generate all target words simultaneously [\citenameGu et al.2018, \citenameÅukasz Kaiser et al.2018, \citenameLi et al.2019, \citenameMa et al.2019]. Despite the acceleration of computation efficiency, these models usually suffers from the cost of translation accuracy. Even worse, they decode a target only in one shot, and thus miss a chance to remedy a flawed translation. Against them, a promising research line is to perform refinement on the generated result within several iterations [\citenameLee et al.2018, \citenameGhazvininejad et al.2019].
Along this line, \newciteghazvininejad2019mask propose a Mask-Predict decoding strategy, which iteratively refines the generated translation given the most confident target words predicted from the previous iteration. This model is trained using an objective of conditional masked translation modeling (CMTM), by predicting the masked words conditioned on the rest of observed words. However, CMTM just learns from a subset of words instead of the entire target in terms of a training step. As a result, it will iterate more times over the training dataset to explore the contextual relationship within a sentence, and thus will struggle with a huge cost of the whole training time [\citenameClark et al.2020]. Most importantly, CMTM extensively bases upon the assumption of conditional independence, making it hard to capture the strong correlation across the adjacent words [\citenameGu et al.2018]. Inevitably, this issue will still degrade the translation performance, such as outputting repetitive words [\citenameWang et al.2019].
To address the issues, our idea is to infuse sequential information into CMTM. Accordingly, we propose a Self-Review Mechanism, a discriminative task in which the model learns to autoregressively distinguish the ground truth target from the non-autoregressively generated output of itself. As shown by ArDecoder (short for Autoregressive Decoder) in Figure 1, we firstly switch on the autoregressive mode of the same Decoder with CMTM by inserting a left-to-right mask. More importantly, we then require this ArDecoder to recurrently review whether each generated word from CMTM is supposed to be a replacement or just an original. In this way, this mechanism constrains our model to review each predicted word only based on previous ones, which is able to not only correct the prediction errors, but also facilitates the learning of conditional dependence of the target words. Moreover, this mechanism could also help speed up the whole training by learning from all target words rather than a small masked subset.
We extensively validate our model on the datasets of WMT14 EnDe and WMT16 EnRo. The experimental results demonstrate that our model outperforms several state-of-the-art non-autoregressive models by over 1 BLEU. Through knowledge distillation, our model even achieves competitive performance compared with the typical left-to-right Transformer, while significantly reducing the cost of time during inference. Meanwhile, we also prove that the training speed of our model is much faster than the typical CMLM.
2.1 Autoregressive NMT
Given a source sentence , a NMT model is aimed to generate a sentence in target language with identical semantics expressed, where and are denoted as the length of source and target sentence, respectively. Typically, the training objective of an autogressive NMT model is expressed as a chain of conditional probabilities in a left-to-right manner:
where and are SOS and EOS, standing for the start and end of a sentence, respectively. Usually, these probabilities are parameterized using a standard encoder-decoder architecture [\citenameSutskever et al.2014], where the decoders use autoregressive strategy to capture the left-to-right dependency among the target words.
2.2 Conditional Masked Translation Model
Different from the training objective in Equation 1, we adopt conditional masked translation modeling (CMTM) [\citenameGhazvininejad et al.2019] to optimize our proposed non-autoregressive NMT model. During training, our model is aimed to predict a set of masked target words given an source input and a set of observed target words . Note that . Based on the assumption that the words of are independent, the training objective of CMTM is formulated as:
where the masked words in are randomly selected and denoted by a special token mask.
3.1 Model Architecture
Figure 2 illustrates the overall architecture of our proposed model, which is composed of three modules, an Encoder, a Decoder and an ArDecoder. Notably, ArDecoder is obtained by solely adding a left-to-right mask to Decoder, where their weights are tied. Rather than a pure CMTM, we also propose a Self-Review Mechanism to ask the ArDecoder to review the predicted target from Decoder in a left-to-right manner. In this section, we will detail each module and the Self-Review Mechanism.
Our Encoder is identical to the standard Transformer [\citenameVaswani et al.2017]. Built upon self attention, it encodes a source input into a series of contextual representations by:
The non-autoregessive property of our model mainly lies in our Decoder. Different from Encoder, the Decoder has two sets of attention heads as shown in Figure 2: the inner heads are attending over the target words, and the inter heads are over the hidden outputs of Encoder. It is worth noting that we use a bidirectional mask (denoted as ) as shown in the middle of Figure 2. Such mask allows the Decoder to use both left and right contexts to predict each target word, ensuring that the prediction for -th position can depend not only on the information before -th but also right after -th.
Our Decoder is optimized using the objective in Equation 2. Given a source input and part of observed target words , the Decoder is required to predict those words of . Firstly, we obtain a series of Decoder hidden outputs , by feeding Decoder with the observed target and the Encoder outputs . Mathematically, we parameterize as:
Then, we apply a linear projection on the hidden outputs , and obtain the probabilities of target words using softmax. Notably, we only focus on the probabilities of masked words during training. Therefore, the probability in Equation 2 is parameterized by:
The ArDecoder is introduced to serve as a discriminator of adversarial models [\citenameGoodfellow et al.2014], and will play an important role in our Self-Review Mechanism. As shown in Figure 2, ArDecoder is obtained by adding a left-to-right mask (denoted as ) to the Decoder. This mask prevents ArDecoder from attending future words when reading the predicted target from Decoder, ensuring that the prediction for -th position can only rely on the known outputs before -th.
Unlike the aim of Decoder, ArDecoder is asked to review the predicted sentence from Decoder, and distinguish whether each word is supposed to be replaced or not. Notably, we tie the weights of ArDecoder and Decoder, to ensure that our Decoder can take advantage of the sequential information learned from this discriminative task.
3.2 Self-Review Mechanism
As discussed previously, solely a CMTM is insufficient to capture the sequential dependency of target words, and thus still inevitably results in a disappointing translation. To remedy the issue, the core of our work is how to better infuse sequential information into the model.
Concretely, we propose a Self-Review Mechanism for CMTM to learn strong correlations among the target words. During training, Encoder-Decoder firstly predicts a target given a gold observed target as well as an input source . Then, the ArDecoder is asked to review the predicted target , and distinguish whether a predicted word is supposed to be replaced by the ground truth as:
where is a sigmoid function. Finally, the objective of the self-reviewing becomes:
Here, we do not back-propagate the learning errors from ArDecoder to Encoder-Decoder due to the difficulty of applying adversarial learning to text [\citenameCaccia et al.2018]. By adding up , our model sees the entire target sentence rather than a small subset of words in terms of a training step, and thus it does not need to iterate more times to explore the contextual relationships among the words, which is beneficial to speeding up the whole training compared with a pure CMTM [\citenameClark et al.2020]. Besides, our work can also be regarded as a multi-task learning, where we enhance our Decoder with the bidirectional contextual information as well as the left-to-right correlations of target words.
3.3 Length Prediction
Typically, an autoregressive NMT model generates the target sentence word-by-word, and thus it decides the length of target sentence by encountering a special token EOS. However, our model adopts the strategy of non-autoregressive decoding, namely, it predicts the entire target sentence in a parallel way. Following [\citenameDevlin et al.2019, \citenameGhazvininejad et al.2019], we add a special token LEN to the begining of source input. In this sense, our Encoder is also required to predict the length of target sentence , i.e., predict a token from given the source input , where is the maximum length of target sentences in our corpus. Mathematically, we define the loss of length prediction as:
3.4 Optimization and Inference
During inference, we abandon ArDecoder and perform iterative refinement only based on Encoder-Decoder. Following Mask-Predict [\citenameGhazvininejad et al.2019], we generate a raw sequence starting with an entirely masked target given a new input source. Upon this raw sequence, we conduct refinement by masking-out and re-predicting a subset of words whose probabilities are under a threshold. This refinement is repeated within a heuristic number of iterations. For more details, please refer to [\citenameGhazvininejad et al.2019].
We conduct experiments on two benchmark datasets, WMT14 EnDe (4.5M sentence pairs) and WMT16 EnRo (610k pairs). After preprocessing the two datasets following [\citenameLee et al.2018], we tokenize them into subword units using BPE [\citenameSennrich et al.2016]. We use newstest-2013 and newstest-2014 as our development and test datasets for WMT14 EnDe, while use newsdev-2016 and newstest-2016 as our development and test datasets for WMT16 EnRo.
We adopt the widely-used BLEU
We follow the base configuration of Transformer [\citenameVaswani et al.2017]: The dimension of model is set to 512, and the dimension of inner layers is set to 2048. The Encoder is consisted of a stack of 6 layers , as well as the Decoder and ArDecoder. The weights of our model are all randomly initialized with a uniform distribution . Besides, we set the parameters of layer normalization as . We use Adam optimizer [\citenameKingma and Ba2015] with 98k tokens per batch. We increase the learning rate from 0 to 5e-4 within the first 10,000 warmup steps, and gradually decay it with respect to the inverse square root of training steps. Note that we share the weights of Decoder and ArDecoder only except the output layer ( in Equation 5 and 6). During inference, we set length candidates as 5 for non-autoregressive decoding, where the max length is defined as 10,000. The number of iteration for refinement is set as 10. To compare with autoregressive models, we adopt a beam width of 5 for beam search decoding. The training speed is measured on 8 NVIDIA Tesla P100 GPUs and decoding speed is just on one.
Previous works on non-autoregressive NMT models have proved that knowledge distillation can substantially improve the performance [\citenameGu et al.2018, \citenameLee et al.2018, \citenameStern et al.2019, \citenameZhou et al.2020]. Commonly, a student model is trained on a distilled dataset which is generated by a teacher model, where the teacher model usually adopts a much larger configuration of parameters than its student. Different from this common setting, we will investigate if it is still useful to tie the configuration of the teacher and its student model. We train our model on a distilled corpus (ENDE and ENRO), where the distilled target are generated by a typcial left-to-right Transformer with a base configuration. In the followings, we will identify the effect of knowledge distillation to our model.
To demonstrate the effectiveness of our work, we compare with several state-of-the-art NMT models:
Seq2Seq [\citenameBahdanau et al.2015]: It is a LSTM-based sequence-to-sequence model, where the decoder adopts beam search strategy.
ConvS2S [\citenameGehring et al.2017]: It is a convolution-based sequence-to-sequence model, and it decodes the target words in a left-to-right manner.
Transformer [\citenameVaswani et al.2017]: It is a state-of-the-art autoregressive model, and it adopts beam search decoding to generate target translation.
FTNAT [\citenameGu et al.2018]: It is a non-autogressive Transformer model using fertitilies, and adopts noisy parallel decoding (NPD) to generate target translation.
FlowSeq [\citenameMa et al.2019]: It is also a non-autogressive model, which introduces a latent variable to model the generative flow. During inference, it generates a target translation using argmax decoding.
HintNAT [\citenameLi et al.2019]: It is also a non-autoregressive model, which leverags alignments and hidden states of a teacher autoregressive model.
IRNAT [\citenameLee et al.2018]: It is a non-autogressive model trained with a conditional denoising autoencoder. During inference, it iteratively devises the generated translation. We set the number of iterations as 10.
Mask-Predict [\citenameGhazvininejad et al.2019]: It is a typical CMTM model. During inference, it adopts Mask-Predict on the translation within 10 iterations. By comparing with it comprehensively, we aim to examine the effectiveness of our proposed Self-Review Mechanism.
|FlowSeq-large (w/ kd)||1||23.72||28.39||-||29.73||-|
4.3 Comparison Against Baselines
The experimental results are summarized in Table 1. We firstly examine the non-autoregressive models with different decoding strategies, i.e., one-shot decoding vs iterative decoding. As shown in Table 1, FTNAT, HintNAT and FlowSeq achieve the lowest score of BLEU. Such degradation is mainly due the problem of multimodality [\citenameGu et al.2018] that these models hardly considers the left-to-right dependency. Even worse, they do not have a chance to remedy the translations. The same thing happens to the first iteration of IRNAT, Mask-Predict and our model as well, where the results are similar to the one-shot decoding models. From this comparison, we can conclude that iterative decoding is an effective technique for non-autoregressive NMTs.
Although IRNAT and MaskPredict are able to turn the initial bad translation into a much better one through multiple iterations of decoding, there is still a gap of the translation accuracy when comparing against the SOTA autoregressive model, i.e., Transformer. Still, this deficiency is attributed to the lack of a mechanism or strategy to capture the strong correlations among the target words, which is also the root cause why non-autoregressive models are hard to generated satisfactory translation [\citenameRen et al.2020].
In contrast, our model, which is additionally optimized with our proposed Self-Review Mechanism, significantly achieves a performance boost over these non-autoregressive models. Meanwhile, our model has a huge lead in BLEU on the dataset of WMT 14 ENDE compared with Seq2Seq and ConvS2s, and even accomplishes comparable performance with Transformer. More specifically, compared with Transformer, our model (w/o kd) achieves 34.54 (+0.26 gains) and 34.36 (+0.37 gains) of BLEU on WMT16 ENRO and WMT16 ROEN, respectively. Even with the help of knowledge distillation, our model outperforms Transformer on almost all the benchmark datasets except WMT16 ENRO. More importantly, our model dramatically reduces the cost in decoding, which is at least 5.16x faster than Transformer. If we sacrifice a certain translation accuracy by reducing iteration number, we could obtain even higher decoding efficiency. In brief, this comparison results validate the effectiveness of Self-Review Mechanism.
4.4 Effect of Knowledge Distillation
The comparison results are listed in the last 6 rows of Table 1. In terms of the large-scale dataset, i.e., WMT14 ENDe, our model with the knowledge distillation gains a remarkable improvement, especially at the early iterations. Under the same size of configuration, it is widely believed that that the autoregressive model is better that capturing the alignment relationship between a source-target pair [\citenameGu et al.2018], and thus the autoregressive model as a teacher model is able to reduce the redundant and irrelevant alignment “modes” in the raw corpus. In this way, our proposed model benefits from learning such kind of distilled dataset. However, the improvement is not concurrent on the small-scale dataset, i.e., WMT16 ENRO. At the end of 10th iteration, our model even has a decrease of BLEU on WMT16 ENRO. We conjecture that a small-scale dataset is statistically likely to contain less redundant “modes” than a large-scale dataset. As a result, distillation for a small-scale dataset might not be more beneficial for a student model compared with the a raw dataset, probably no matter how large the teacher model is. Therefore, it is useful and more efficient to adopt a teacher model with the same size of configuration as the student model for knowledge distillation on a large-scale dataset.
5 Ablation Study and Analysis
Upon CMLM, we additionally introduce a Self-Review Mechanism during training, whereas Mask-Predict [\citenameGhazvininejad et al.2019] is optimized with only the first two terms in Equation 10. During inference, we abandon ArDecoder, and our model performs decoding as same as Mask-Predict. In this section, we will compare closely to Mask-Predict to validate the contribution of our proposed Self-Review Mechanism.
5.1 Training Speed
To better understand the comparison of training speed between Mask-Predict and our model, we measure the FLOPS of one single step and the whole training steps as shown in Table 2 and Table 3, respectively. In terms of the time usage of one training step, Table 2 shows that Mask-Predict is about 1.6x faster than our model, since our model has to optimize ArDecoder together. However, such result of one training step cannot lead to a conclusion that it will take more time to train our model than Mask-Predict. Instead, the results from Table 3 illustrate that our model effectively speeds up the whole training especially on a large-scale dataset WMT14 ENDE (at least 5x faster). This discrepancy between one step and overall steps might be resulted from several reasons. We conjecture that our model is able to see whole target sentence, where the ArDecoder needs to review each word generated from Decoder. On the contrary, Mask-Predict only learns from a subset of masked words, and thus it has to take much more steps to discover the semantic relationships among the words.
|Dataset/||WMT14 ENDE||WMT14 DEEN||WMT16 ENRO||WMT16 ROEN|
|MaskPredict (w/ kd)||1.12e19 (1.00x)||6.84e18 (1.00x)||1.33e18 (1.00x)||1.45e18 (1.00x)|
|Ours (w/ kd)||1.43e18(7.83x)||1.30e18 (5.26x)||0.78e18 (1.71x)||0.79e18 (1.83x)|
5.2 Sentence Length
Compared with Mask-Predict, we step further to examine the influence of Self-Review Mechanism on different sentence lengths. We conduct comparative experiments on WMT14 ENDE, and divide the reference target by length into different buckets. As shown in Figure 3, Mask-Predict performs similar or slightly better than our model when the sentence length is small. However, the performance of our model is significantly improved as the sentence length increases, even leading to a wide gap with Mask-Predict when the sentence length is quite large. This result supports that our proposed Self-Review Mechanism is better at capturing the long-term dependency among the target words.
5.3 Adjacent Words
According previous work [\citenameWang et al.2019], non-autoregressive models usually suffer from repetitive words at adjacent positions. To validate if such inappropriate pattern is remedied by Self-Review Mechanism, we conduct a statistical study of the repetitive words to compare Mask-Predict and our model. The results in Table 4 show that our model has substantially less repetitive words than Mask-Predict. For better understanding, we visualize the cosine similarities of two generated targets by Mask-Predict and our model respectively given a same input source, where the similarities are measured between decoder hidden states of the last layer. From the heatmaps of the resulting cosine similarities in Figure 4, we can see that there are observably more yellow blocks in (a) than those in (b), indicating that Mask-Predict shares much more similar hidden states across the positions of the generated sentence, especially illustrated along the diagonal parts in Figure 4. The results of Table 4 and Figure 4 demonstrate that our proposed Self-Review Mechanism is beneficial for the model to reduce repetitive words, which further indicate that Self-Review is also an effective technique for CMTM to capture the strong correlations among the target words.
6 Related Work
To tackle the high latency of autoregression, many researchers attempt to use one-shot parallel decoding for machine translation. \newciteGu2018NonAutoregressiveNM firstly proposed a non-autoregressive model by using fertilities as a latent variable. Later on, several works introduced different kinds of latent information to improve the performance. \newciteKaiser2018FastDI used a sequence of discrete latent variables as the decoder inputs, \newciteLi2019HintBasedTF utilized the hints from the hidden states and word alignments of an autoregressive model, and \newciteMa2019FlowSeqNC modeled a meaningful generative flow using latent variables. Although these methods are able to decode the target in one shot, they usually suffer from the cost of translation accuracy [\citenameRen et al.2020]. Worse still, they will never have a chance to remedy the flawed translation.
Our work resides in the research line of iterative parallel decoding. \newcitelee2018deterministic iteratively refined the generated outputs through a denoising autoencoder. However, the optimization is complicated, as they resort to a heuristic method of stochastic corruption on the training data. Still along this line, our work is most relevant to [\citenameGhazvininejad et al.2019], where they proposed a simple yet effective method, i.e., Mask-Predict decoding strategy. A major difference is that \newciteghazvininejad2019mask resorts to a typical conditional masked translation model (CMTM), which is highly based upon the assumption of conditional independence. However, this assumption goes against the highly multimodal distribution of true target translations [\citenameGu et al.2018]. To alleviate the issue, we develop a Self-Review Mechanism to infuse sequential information into the CMTM model.
We also get inspired by the idea of augmenting the model with a discriminative task [\citenameClark et al.2020], in order to solve the computational inefficiency of CMTM. \newciteClark2020ELECTRAPT introduced a discriminator (similar to our ArDecoder) that learns from all input words rather than a small masked subset. Then, they further finetuned the discriminator for the downstream tasks. The difference lies in that we throw out ArDecoder and only perform iterative decoding on Encoder-Decoder. Besides, we tie the weights of ArDecoder and Decoder to ensure that our Decoder can take advantage of the sequential information learned from the discriminative task.
In this paper, we identify the drawback of CMTM that it is insufficient to capture the sequential correlations among target words. To tackle it, we propose a Self-Review Mechanism that is able to infuse sequential information into CMTM. On several benchmark datasets, we demonstrate that our approach achieves a huge improvement against previous non-autoregressive models and a competitive result to the state-of-the-art Transformer model. Through ablation study, our proposed mechanism is also proved to speed up the training of a CMTM model.
- Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
- Massimo Caccia, Lucas Caccia, William Fedus, Hugo Larochelle, Joelle Pineau, and Laurent Charlin. 2018. Language gans falling short. arXiv preprint arXiv:1811.02549.
- Kyunghyun Cho, Bart van Merrienboer, Ãaglar GülÃ§ehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using rnn encoder-decoder for statistical machine translation. ArXiv, abs/1406.1078.
- Kevin Clark, Minh-Thang Luong, Quoc V. Le, and Christopher D. Manning. 2020. ELECTRA: Pre-training text encoders as discriminators rather than generators. In ICLR.
- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. ArXiv, abs/1810.04805.
- Jonas Gehring, Michael Auli, David Grangier, Denis Yarats, and Yann Dauphin. 2017. Convolutional sequence to sequence learning. ArXiv, abs/1705.03122.
- Marjan Ghazvininejad, Omer Levy, Yinhan Liu, and Luke Zettlemoyer. 2019. Mask-predict: Parallel decoding of conditional masked language models. In EMNLP-IJCNLP.
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative adversarial nets. In NIPS.
- Jiatao Gu, James Bradbury, Caiming Xiong, Victor O. K. Li, and Richard Socher. 2018. Non-autoregressive neural machine translation. In ICLR.
- Nal Kalchbrenner, Lasse Espeholt, Karen Simonyan, Aäron van den Oord, Alex Graves, and Koray Kavukcuoglu. 2016. Neural machine translation in linear time. ArXiv, abs/1610.10099.
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. CoRR, abs/1412.6980.
- Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In EMNLP.
- Zhuohan Li, Zi Lin, Di He, Fei Tian, Tao Qin, Liwei Wang, and Tie-Yan Liu. 2019. Hint-based training for non-autoregressive machine translation. In EMNLP-IJCNLP.
- Xuezhe Ma, Chunting Zhou, Xian Li, Graham Neubig, and Eduard Hovy. 2019. Flowseq: Non-autoregressive conditional sequence generation with generative flow. In EMNLP-IJCNLP.
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In ACL.
- Yi Ren, Jinglin Liu, Xu Tan, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2020. A study of non-autoregressive model for sequence generation. In ACL.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In ACL.
- Mitchell Stern, William Chan, Jamie Ryan Kiros, and Jakob Uszkoreit. 2019. Insertion transformer: Flexible sequence generation via insertion operations. In ICML.
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In NIPS.
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS.
- Yiren Wang, Fei Tian, Di He, Tao Qin, ChengXiang Zhai, and Tie-Yan Liu. 2019. Non-autoregressive machine translation with auxiliary regularization. In AAAI.
- Chunting Zhou, Graham Neubig, and Jiatao Gu. 2020. Understanding knowledge distillation in non-autoregressive machine translation. In ICLR.
- Åukasz Kaiser, Aurko Roy, Ashish Vaswani, Niki Parmar, Samy Bengio, Jakob Uszkoreit, and Noam Shazeer. 2018. Fast decoding in sequence models using discrete latent variables. In ICML.