Fast Interleaved Bidirectional Sequence Generation
Independence assumptions during sequence generation can speed up inference, but parallel generation of highly inter-dependent tokens comes at a cost in quality.
Instead of assuming independence between neighbouring tokens (semi-autoregressive decoding, SA), we take inspiration from bidirectional sequence generation and introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously.
We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder by simply interleaving the two directions and adapting the word positions and self-attention masks. Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer, and on five machine translation tasks and two document summarization tasks, achieves a decoding speedup of 2 compared to autoregressive decoding with comparable quality. Notably, it outperforms left-to-right SA because the independence assumptions in IBDecoder are more felicitous.
To achieve even higher speedups, we explore hybrid models where we either simultaneously predict multiple neighbouring tokens per direction, or perform multi-directional decoding by partitioning the target sequence. These methods achieve speedups to 4–11 across different tasks at the cost of 1 BLEU or 0.5 ROUGE (on average).
itemsep=0.30pt,partopsep=0pt,parsep=topsep=5pt \setitemizeitemsep=0.30pt,partopsep=00pt,parsep=topsep=5pt \setdescriptionitemsep=0.30pt,partopsep=0pt,parsep=topsep=5pt \aclfinalcopy
Neural sequence generation aided by encoder-decoder models Bahdanau et al. (2015); Vaswani et al. (2017) has achieved great success in recent years Bojar et al. (2018); Song et al. (2019); Raffel et al. (2019); Karita et al. (2019), but still suffers from slow inference. One crucial bottleneck lies in its generative paradigm which factorizes the conditional probability along the target sequence of length as follows:
where is the source sequence of length . This factorization determines that target words can only be generated one-by-one in a sequential and unidirectional manner, which limits the decoding efficiency.
A promising direction to break this barrier is to generate multiple target words at one decoding step to improve the parallelization of inference Gu et al. (2018); Stern et al. (2018). However, this introduces independence assumptions that hurt translation quality, since words produced in parallel are in fact likely to be inter-dependent. We hypothesize that there are groups of words that are less likely to be strongly inter-dependent than neighbouring words, which will allow for better parallelization. Inspired by bidirectional modeling Zhang et al. (2019a, 2020), we resort to an alternative probabilistic factorization:
Introducing an independence assumption between and allows for parallel word prediction from both the and directions. Based on this factorization, Zhou et al. (2019) propose synchronous bidirectional translation using a dedicated interactive decoder, and report quality improvements compared to left-to-right semi-autoregressive decoding (Wang et al., 2018, SA) in translation quality. However, their success comes along with extra computational overhead brought by the specialized decoder. Empirically, Zhou et al. (2019) only report a decoding speedup of 1.38, slower than SA, although the factorization halves the decoding steps.
We combine the strengths of bidirectional modeling and SA, and propose interleaved bidirectional decoder (IBDecoder) for fast generation. As shown in Figure 0(a), we interleave target words from the left-to-right and right-to-left directions and separate their positions to support reusing any standard unidirectional decoders, such as the Transformer decoder Vaswani et al. (2017). We reorganize the self-attention mask to enable inter- and intra-direction interaction (Figure 0(c)) following SA. Unlike SA, we show through experiments that distant tokens from different directions are less inter-dependent, providing a guarantee for better performance. Compared to previous studies Zhang et al. (2018, 2019a, 2020); Zhou et al. (2019), our approach has no extra model parameters and brings in little overhead at training and decoding.
IBDecoder is speedup-bounded at 2. To push this ceiling up, we explore strategies for multi-word simultaneous generation, including multi-directional decoding (IMDecoder, Figure 0(d)) and SA (Figure 0(b)). The former extends Eq. 2 by inserting more generation directions, while the latter allows each direction to produce multiple target words Wang et al. (2018). These strategies offer us a chance to aggressively improve the decoding speed albeit at the risk of degenerated performance. To encourage multi-word generation in parallel, we propose a modified beam search algorithm.
We extensively experiment on five machine translation tasks and two document summarization tasks, with an in-depth analysis studying the impact of batch size, beam size and sequence length on the decoding speed. We close our analysis by examining the capacity of our model in handling long-range dependencies. On these tasks, IBDecoder yields 2 speedup against Transformer at inference, and reaches 4–11 after pairing it with SA. Still, the overall generation quality is comparable. When we pair our method with sequence-level knowledge distillation (Kim and Rush, 2016), we outperform a Transformer baseline on 6 out of 7 tasks.
Our contributions are summarized below:
We propose IBDecoder, following a bidirectional factorization of the conditional probability, for fast sequence generation. IBDecoder retains the training efficiency and is easy to implement.
We extend IBDecoder to enable multi-word simultaneous generation by investigating integration with IMDecoder and SA. Results show that IBDecoder + SA performs better than IMDecoder.
We propose a modified beam search algorithm to support step-wise parallel generation.
On several sequence generation benchmarks, IBDecoder yields 2 speedup against Transformer at inference, and reaches 4–11 after pairing it with SA. Still, the overall generation quality is comparable.
2 Related Work
Efforts on fast sequence generation come along with the rapid development of encoder-decoder models Vaswani et al. (2017). A straightforward way is to reduce the amount of computation. Methods in this category range from teacher-student model Kim and Rush (2016); Hayashi et al. (2019), constrained softmax prediction Hu et al. (2015), beam search cube pruning Zhang et al. (2018b), float-point quantization Wu et al. (2016); Bhandare et al. (2019), model pruning See et al. (2016), to simplified decoder architectures, such as lightweight recurrent models Zhang et al. (2018a); Zhang and Sennrich (2019); Kim et al. (2019), average attention network Zhang et al. (2018), merged attention network Zhang et al. (2019), dynamic convolution Wu et al. (2019), and hybrid attentions Shazeer (2019); Wang et al. (2019), .etc.
Nonetheless, the above methods still suffer from the inference bottleneck caused by the sequential nature of autoregressive models. Instead, Gu et al. (2018) propose non-autoregressive generation where target words are predicted independently, leading to great speedup, albeit at a high cost to generation quality. Follow-up studies often seek solutions to recover the performance Libovický and Helcl (2018); Guo et al. (2019); Shao et al. (2020); Ghazvininejad et al. (2020); Ran et al. (2020), but also reveal the trade-off between the quality and speed in terms of autoregressiveness. This motivates researchers to discover the optimal balance by resorting to semi-autoregressive modeling Wang et al. (2018); Stern et al. (2018), iterative refinement Lee et al. (2018); Stern et al. (2019); Ghazvininejad et al. (2019) or in-between Kaiser et al. (2018); Akoury et al. (2019).
We hypothesize that generation order affects the felicity of independence assumptions made in semi-autoregressive modelling. Unlike generation with flexible orders Emelianenko et al. (2019); Stern et al. (2019); Gu et al. (2019a), we employ deterministic generation order for model simplicity and training efficiency, specifically focusing on bidirectional decoding. The study of bidirectional modeling dates back to the era of phase-based statistical machine translation Watanabe and Sumita (2002); Finch and Sumita (2009) and recently gained popularity in neural machine translation Liu et al. (2016); Sennrich et al. (2016a); Zhang et al. (2019b, a); Zheng et al. (2019). Unfortunately, these methods either design complex neural decoders, which hurts training efficiency, and/or perform the left-to-right and right-to-left inference separately followed by rescoring, which slows down decoding. By contrast, our model speeds up inference while maintaining training speed.
Our work is closely related to SA Wang et al. (2018) and synchronous bidirectional generation Zhou et al. (2019). IBDecoder extends SA to incorporate information from different directions. In contrast to Zhou et al. (2019), we only make minimal changes to the standard Transformer decoder, which benefits efficiency during training and inference, and makes our method easy to implement. We also find improvements in both decoding speed and translation quality compared to Wang et al. (2018); Zhou et al. (2019).
3 Autoregressive Transformer
Transformer Vaswani et al. (2017), the state-of-the-art neural sequence generation model, follows the autoregressive factorization as in Eq. 1. To handle the dependency of target word on previous target words , Transformer relies on a masked self-attention network in the decoder:
where , denotes softmax operation, is model dimension and is layer depth. are trainable parameters.
The mask matrix limits the access of attention to only the past target words. Formally, given the target sequence length , this matrix can be constructed by the following masking function:
where , denotes the number of generation directions, and is the number of target words predicted per direction. By default, the Transformer decoder is unidirectional and generates words one-by-one. Thus, . The infinity here forces softmax output a probability of 0, disabling invalid attentions.
The input layer to Transformer’s decoder is the addition of target word embedding and word position encoding , i.e . maps to its word position sequence, which is a simple indexing function (Figure 0(b)):
where . Transformer adopts the sinusoidal positional encoding to project these indexes to real-space embeddings, and uses the last-layer decoder output to predict the respective next target word. We explain how to accelerate generation by reordering , adjusting and next.
4 Interleaved Bidirectional Decoder
The structure of Transformer is highly parallelizable, but the autoregressive schema blocks this parallelization during inference. We alleviate this barrier by exploring the alternative probabilistic factorization in Eq. 2 to allow words predicted from different directions simultaneously.
We propose IBDecoder as shown in Figure 0(a). We reuse the standard decoder’s architecture in a bid to largely inherit Transformer’s parallelization and avoid extra computation or parameters, rather than devising dedicated decoder architectures Zhou et al. (2019); Zhang et al. (2020). To make the left-to-right and right-to-left generation collaborative, we reorganize the target sequence and the word positions below (purple and green rectangles in Figure 0(a)):
By following the generation order defined by Eq. 2, the sequence interleaves and and converts a bidirectional generation problem to a unidirectional one. We introduce negative positions to to retain the locality bias of sinusoidal positional encodings in .
We also adapt the self-attention mask to permit step-wise bidirectional generation:
where IBDecoder has generation directions.
This corresponds to the relaxed causal mask by Wang et al. (2018), which ensures access to all predictions made in previous time steps
4.1 Beyond Two-Word Generation
Eq. 2 only supports two-word generation, which indicates an upper bound of 2 speedup at inference. To improve this bound, we study strategies for multi-word generation. We explore two of them.
Similar to IBDecoder, IMDecoder also permutes the target sequence. It inserts multiple generation directions (i.e. increases ), with each direction producing one word per step (i.e. ). As shown in Figure 0(d), it splits the target sequence into several roughly equal segments followed by applying IBDecoder to each segment (thus an even required). Formally, IMDecoder reframes the target sequence and word positions as follows:
where denotes the -th word of , which is the -th segment of reordered by IBDecoder( segments in total). decomposes the word position into two parts. The first one represents the index of decoding step where each word is predicted; the second one denotes the generation direction each target word belongs to. Specifically, we record the corresponding direction indices and add a group of trainable direction embeddings (red rectangles in Figure 0(d)) into the decoder input. IMDecoder uses the following self-attention mask:
Instead of partitioning the target sequence, another option is to produce multiple target words per direction at each step (i.e. increase , Wang et al., 2018). SA assumes that neighbouring words are conditionally independent, despite the fact that tokens in natural language are typically highly inter-dependent.
We combine SA with IBDecoder (Figure 0(e)) with the expectation that producing 2 neighbouring tokens independently per direction is less harmful than producing 4 neighbouring words in parallel. We reuse the sequence and for the decoder input, but enlarge the attention range in the self-attention mask to assist multi-word generation (Figure 0(f)):
To handle multiple predicted words per decoding step simultaneously, we adjust the beam search algorithm as in Algorithm 1. For each partial hypothesis , we predict words in parallel. We thus first extract the top-scoring predictions of probability for all positions (line 10), followed by pruning the resulting search space of size through an outer-addition operation to size (line 12). The scores (line 12) and the backtraced words (line 14) are then used for normal decoding. Note that each complete hypothesis requires a simple deterministic post-processing to recover its original word order (line 28). In contrast to Zhou et al. (2019), we do not separate the left-to-right beam from the right-to-left beam.
With multiple predicted target words, determining whether one hypothesis is complete or not becomes challenging. We adopt a simple strategy: one hypothesis is assumed complete once any word in the predictions hits the end-of-sentence symbol (“[/s]”) (line 17). We leave the study of alternatives for the future.
|3||2 + SA||4||2||2||23.0||26.3||117||3.31||0.98|
We test our model on machine translation (MT) and document summarization. We train MT models on five different language pairs: WMT14 English-German (En-De, Bojar et al., 2014), WMT14 English-French (En-Fr, Bojar et al., 2014), WMT16 Romanian-English (Ro-En, Bojar et al., 2016), WMT18 English-Russian (En-Ru, Bojar et al., 2018) and WAT17 Small-NMT English-Japanese (En-Ja, Nakazawa et al., 2017). Translation quality is measured by BLEU Papineni et al. (2002), and we report detokenized BLEU using the toolkit sacreBLEU Post (2018)
5.1 Results on WMT14 En-De
Table 1 compares the performance of our models on WMT14 En-De. Relaxing the autoregressiveness with IBDecoder yields slightly worse translation quality compared to Transformer (-0.7 BLEU, 1⃝2⃝, w/o KD, ). Unlike Zhang et al. (2020), we observe no quality improvement, but our model delivers a speedup of 1.902.33 at inference, clearly surpassing the simple greedy decoding baseline (1.32) and BIFT (0.89) Zhang et al. (2020). The dropped quality is easily recovered with knowledge distillation (+0.2 BLEU, 1⃝2⃝, w/ KD, ).
Going beyond two-word generation, which enhances independence, greatly decreases the performance (2⃝3⃝,4⃝, w/o KD) while enlarging the speedup to 3.3–4.5. Compared to SA, the quality degradation with IMDecoder is larger, both w/ and w/o KD. We ascribe this to the difficulty of structure planning, as IMDecoder has to guess words in the middle of the sequence at the start of generation. We employ SA for the following experiments.
In contrast to existing work Zhang et al. (2018, 2019a, 2020); Zhou et al. (2019), our models marginally affect the training efficiency (0.98 vs 0.61 Zhang et al. (2020)), and require no extra linguistic information Akoury et al. (2019). Our results also suggest that the degree each model benefits from KD varies. Follow-up studies should report performance w/ and w/o KD.
|2||1 + vanilla mask||2||1||25.7|
|3||1 + vanilla positions||2||1||25.9|
|4||1 + middle-to-side||2||1||20.7|
|5||1 + indep. directions||2||1||23.9|
|7||1 + SA||2||2||23.0|
We carry out an ablation study as shown in Table 2. Replacing the attention mask with the vanilla one (1⃝2⃝) introduces unnecessary independence assumptions and reduces performance by 0.5 BLEU. Using vanilla positional encodings (3⃝) also reduces performance -0.3 BLEU, indicating that we benefit from preserving the locality bias of sinusoidal encodings within each direction. Changing the generation direction from the side-to-middle (1⃝) to the middle-to-side (4⃝) dramatically increases the learning difficulty (-5.5 BLEU).
In IBDecoder, the two translation directions are interlinked, i.e. predictions are conditioned on the history of both directions. We can remove cross-direction attention, essentially forcing the model to produce the left and right half of sequences independently. Such an independent generation performs poorly (-2.3 BLEU, 1⃝5⃝), which supports the importance of using bidirectional context and resonates with the finding of Zhou et al. (2019).
Vanilla SA vs. IBDecoder
Our IBDecoder shares architectural properties with vanilla SA Wang et al. (2018), namely the independent generation of two tokens per time step, and the adapted self-attention mask, but crucially differ in their generation order and independence assumptions, with vanilla SA operating from left-to-right, and IBDecoder interleaving left-to-right and right-to-left decoding.
Our ablation results in Table 2 show that IBDecoder substantially outperforms vanilla SA (2.1/4.3 BLEU, 1⃝6⃝/7⃝8⃝).
To further investigate the difference in independence assumptions between the two approaches, we compare estimated point-wise mutual information (PMI) of the words being predicted independently by IBDecoder and vanilla SA.
|IBDecoder + SA||6/2/2||23.0||3.31|
On Teacher-Student Model
One classical approach to improving decoding efficiency is training a small student model w/ KD. Results in Table 4 support this: Transformer with a student model produces similar performance w/ KD but runs 2.32 faster, even better than IBDecoder (1.90 ). Combining the student schema with IBDecoder increases the speedup to 4.41 without hurting the performance (26.6 BLEU, w/ KD). In exchange of 2.4 BLEU, we could reach 7.24 faster decoding with SA. The compatibility of our model with the teacher-student framework reflects the generalization of our bidirectional modeling.
The results also demonstrate that efficiency improvements from faster autoregressive decoding, here obtained by reducing the number of decoder layers
Impact of Batch and Beam Size
Figure 2 shows speedups over a standard Transformer with varying batch and beam sizes. When batch size , increasing beam size improves the speedup; while the impact becomes negative with batch size . Overall, our model is consistently faster than Transformer at inference, regardless of the batch and beam size.
Impact of Source Sentence Length
Although translation quality fluctuates over the source sentence length, Figure 3 shows that our model shares the same performance pattern with the baseline. With respect to the speedup, our model performs better when translating longer source sentences.
Results in Figure 4 show that controls the trade-off between translation quality and speedup. With larger , more target tokens are predicted per decoding direction, leading to better speedup, but causing a larger performance drop w/ and w/o KD. Further analysis reveals that, as the dependency between predicted target words weakens, our model suffers from more serious over-translation issue, yielding larger OTEM Yang et al. (2018). Although n-gram deduplication slightly improves quality
|SAT Wang et al. (2018)||26.09||2.07|
|SBSG Zhou et al. (2019)||27.22||1.61|
|SynST Akoury et al. (2019)||20.74||4.86|
|Levenshtein Gu et al. (2019b)||27.27||4.01|
|CMLM Ghazvininejad et al. (2019)||27.03||-|
|AXE Ghazvininejad et al. (2020)||23.53||-|
Analysis on Long-range Dependency
We adopt the subject-verb agreement task from Lingeval97 Sennrich (2017) for analysis. We can see from the results in Figure 5 that IBDecoder performs similarly to the original Transformer for agreement over short distances, but agreement over longer distances drops on average. In contrast, models that include SA show steep drops in accuracy for short distances.
Curiously, KD seems to harm agreement scores even though it led to higher BLEU. Overall, these results suggest that BLEU does not show the full quality loss incurred by our independence assumptions. This deficiency also provides evidence for the performance drop in Figure 4.
Comparison to Previous Work
Results in Table 5 show that our model outperforms SynST Akoury et al. (2019) in quality, and slightly surpasses the Levenshtein Transformer Gu et al. (2019b) in speed. Particularly, our model () surpasses SAT Wang et al. (2018) () and SBSG Zhou et al. (2019) () in terms of both quality and speed. Our model doesn’t heavily rely on extra linguistic knowledge Akoury et al. (2019), neither requires complex pseudo training data construction Gu et al. (2019b). Compared to these prior studies, our approach is simple but effective.
|Model||KD||Machine Translation||Document Summarization|
5.2 Results on Other Tasks
Table 6 shows MT results for other translation directions, and for document summarization. Regardless of syntactic, morphological, transcript and sequence-length differences, our model achieves comparable generation quality and 1.75–11.15 speedup over different tasks. With KD, our model even outperforms the Transformer baseline on 5 out of 6 tasks. In particular, our model succeeds on the CDMail task which previous non-autoregressive models rarely attempt due to its lengthy target sequence, although our model suffers from the long-range dependency issue as in Figure 5.
6 Conclusion and Future Work
We present interleaved bidirectional sequence generation to accelerate decoding by enabling generation from the left-to-right and right-to-left directions simultaneously. We combine the strengths of SBSG Zhou et al. (2019) and SA Wang et al. (2018), and propose a simple interleaved bidirectional decoder (IBDecoder) that can be easily implemented on top of a standard unidirectional decoder, like Transformer, via interleaving the target sequence and tweaking the word positions and self-attention masks. IBDecoder inherits Transformer’s training parallelization with no additional model parameters, and is extensible with SA and multi-directional decoding. We show that the independence assumptions we introduce between the two directions are less harmful to translation quality than the independence assumptions in left-to-right SA. On a series of generation tasks, we report comparable quality with significant inference speedup (4–11) and little training overhead. We also show that the approach is orthogonal to speedups to autoregressive decoding, e.g. by reducing model size.
In the future, we would like to further improve multi-directional generation, and will investigate alternative ways to partition the target sequence and encode positional information. We are also interested in better measuring and reducing the quality loss resulting from long-distance dependencies. Finally, we would like to adapt our interleaving approach to other sequence-to-sequence architectures.
This work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (http://www.csd3.cam.ac.uk/), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk). Ivan Titov acknowledges support of the European Research Council (ERC Starting grant 678254) and the Dutch National Science Foundation (NWO VIDI 639.022.518). Rico Sennrich acknowledges support of the Swiss National Science Foundation (MUTAMUR; no. 176727).
Appendix A Data Preprocessing and Model Settings
We use the given well-processed data for WAT17 En-Ja. For other tasks, we apply the byte pair encoding model Sennrich et al. (2016b) with a joint vocab size of 32K except for WMT18 En-Ru (48K). We experiment with Transformer Base Vaswani et al. (2017): , , 8 attention heads and FFN size of 2048. Dropout of rate 0.1 is used on residual connections and attention weights. We employ Adam Kingma and Ba (2015) for parameter optimization with a scheduled learning rate of warm-up step 4K. Gradient is estimated over roughly 25K target subwords. We average the last 5 checkpoints for evaluation, and use beam search (beam size 4, length penalty 0.6) by default for inference.
Appendix B Estimation of the PMI
To evaluate the average point-wise mutual information (PMI) in Table 3, we compare IBDecoder/vanilla SA to its autoregressive counterpart in terms of testing perplexity (ppl). Take SA () as example, we have:
where Base denotes the baseline Transformer. The intuition behind our estimation is that Transformer handles neighboring words () autoregressively, thus models their joint probability: . Instead, vanilla SA predicts those words independently, i.e. . Comparing the perplexity of SA and Transformer gives an estimation of the average PMI. The method for IBDecoder follows the same spirit.
- Source code is released at https://github.com/bzhangGo/zero.
- Consider Figure 0(a). We cannot reorder position encodings along with embeddings (1,6,2,5,…) because we do not know sentence length at test time. Simply using vanilla position encodings (1,2,3,4,…) would increase the embedding distance between positions within a direction.
- Note that with two tokens produced per time step, decoder inputs are shifted by two.
- Signature BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.4.3
- Details about PMI estimation are given in Appendix B
- Also note the concurrent work by (Kasai et al., 2020).
- we only applied deduplication for results in Figure 4.
- Syntactically supervised transformers for faster neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1269–1281. External Links: Cited by: §2, §5.1, §5.1, Table 5.
- Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Cited by: §1.
- Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532. Cited by: §2.
- Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 12–58. External Links: Cited by: §5.
- Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, pp. 131–198. External Links: Cited by: §5.
- Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, pp. 272–303. External Links: Cited by: §1, §5.
- Sequence modeling with unconstrained generation order. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox and R. Garnett (Eds.), pp. 7700–7711. External Links: Cited by: §2.
- Bidirectional phrase-based statistical machine translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 1124–1132. External Links: Cited by: §2.
- Aligned cross entropy for non-autoregressive machine translation. ArXiv abs/2004.01655. Cited by: §2, Table 5.
- Mask-predict: parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 6112–6121. External Links: Cited by: §2, Table 5.
- Non-autoregressive neural machine translation. In International Conference on Learning Representations, External Links: Cited by: §1, §2.
- Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics 7 (), pp. 661–676. External Links: Cited by: §2.
- Levenshtein transformer. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox and R. Garnett (Eds.), pp. 11181–11191. External Links: Cited by: §5.1, §5, Table 5.
- Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3723–3730. Cited by: §2.
- Findings of the third workshop on neural generation and translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, pp. 1–14. External Links: Cited by: §2.
- Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 1693–1701. External Links: Cited by: §5.
- Improved beam search with constrained softmax for nmt. Proceedings of MT Summit XV, pp. 297. Cited by: §2.
- Fast decoding in sequence models using discrete latent variables. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, StockholmsmÃ¤ssan, Stockholm Sweden, pp. 2390–2399. External Links: Cited by: §2.
- A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , pp. 449–456. Cited by: §1.
- Deep encoder, shallow decoder: reevaluating the speed-quality tradeoff in machine translation. External Links: Cited by: footnote 7.
- Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1317–1327. External Links: Cited by: §1, §2, §5.
- From research to production and back: ludicrously fast neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, pp. 280–288. External Links: Cited by: §2.
- Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: Appendix A.
- Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1173–1182. External Links: Cited by: §2.
- End-to-end non-autoregressive neural machine translation with connectionist temporal classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3016–3021. External Links: Cited by: §2.
- ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Cited by: §5.
- Agreement on target-bidirectional neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 411–416. External Links: Cited by: §2.
- Overview of the 4th workshop on Asian translation. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), Taipei, Taiwan, pp. 1–54. External Links: Cited by: §5.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Cited by: §5.
- A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Cited by: §5.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints. External Links: Cited by: §1.
- Learning to recover from multi-modality errors for non-autoregressive neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3059–3069. External Links: Cited by: §2.
- A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389. External Links: Cited by: §5.
- Compression of neural machine translation models via pruning. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 291–301. External Links: Cited by: §2.
- Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, pp. 371–376. External Links: Cited by: §2.
- Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Cited by: Appendix A.
- How grammatical is character-level neural machine translation? assessing MT quality with contrastive translation pairs. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 376–382. External Links: Cited by: §5.1.
- Minimizing the bag-of-ngrams difference for non-autoregressive neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2.
- Fast transformer decoding: one write-head is all you need. arXiv preprint arXiv:1911.02150. Cited by: §2.
- Semantic neural machine translation using amr. Transactions of the Association for Computational Linguistics 7 (), pp. 19–31. External Links: Cited by: §1.
- Insertion transformer: flexible sequence generation via insertion operations. In International Conference on Machine Learning, pp. 5976–5985. Cited by: §2, §2.
- Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 10086–10095. External Links: Cited by: §1, §2.
- Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 5998–6008. External Links: Cited by: Appendix A, §1, §1, §2, §3.
- Accelerating transformer decoding via a hybrid of self-attention and recurrent neural network. arXiv preprint arXiv:1909.02279. Cited by: §2.
- Semi-autoregressive neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 479–488. External Links: Cited by: §1, §1, §2, §2, §4.1, §4, §5.1, §5.1, §5.1, Table 2, Table 5, §6.
- Bidirectional decoding for statistical machine translation. In COLING 2002: The 19th International Conference on Computational Linguistics, External Links: Cited by: §2.
- Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, External Links: Cited by: §2.
- Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Cited by: §2.
- Otem&Utem: over-and under-translation evaluation metric for nmt. In CCF International Conference on Natural Language Processing and Chinese Computing, pp. 291–302. Cited by: Figure 4, §5.1.
- A lightweight recurrent network for sequence modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1538–1548. External Links: Cited by: §2.
- Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 898–909. External Links: Cited by: §2.
- Simplifying neural machine translation with addition-subtraction twin-gated recurrent networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4273–4283. External Links: Cited by: §2.
- Future-aware knowledge distillation for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 2278–2287. External Links: Cited by: §1, §1, §2, §5.1.
- Accelerating neural transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1789–1798. External Links: Cited by: §2.
- Synchronous bidirectional inference for neural sequence generation. Artificial Intelligence 281, pp. 103234. External Links: Cited by: §1, §1, §4, §5.1, §5.1.
- Speeding up neural machine translation decoding by cube pruning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4284–4294. External Links: Cited by: §2.
- Asynchronous bidirectional decoding for neural machine translation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §5.1.
- Regularizing neural machine translation by target-bidirectional agreement. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 443–450. Cited by: §2.
- Dynamic past and future for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 931–941. External Links: Cited by: §2.
- Sequence generation: from both sides to the middle. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5471–5477. Cited by: §1, §1, §2, §4.2, §4, §5.1, §5.1, §5.1, Table 5, §6.