Fast Interleaved Bidirectional Sequence Generation

Fast Interleaved Bidirectional Sequence Generation

Abstract

Independence assumptions during sequence generation can speed up inference, but parallel generation of highly inter-dependent tokens comes at a cost in quality. Instead of assuming independence between neighbouring tokens (semi-autoregressive decoding, SA), we take inspiration from bidirectional sequence generation and introduce a decoder that generates target words from the left-to-right and right-to-left directions simultaneously. We show that we can easily convert a standard architecture for unidirectional decoding into a bidirectional decoder by simply interleaving the two directions and adapting the word positions and self-attention masks. Our interleaved bidirectional decoder (IBDecoder) retains the model simplicity and training efficiency of the standard Transformer, and on five machine translation tasks and two document summarization tasks, achieves a decoding speedup of 2 compared to autoregressive decoding with comparable quality. Notably, it outperforms left-to-right SA because the independence assumptions in IBDecoder are more felicitous. To achieve even higher speedups, we explore hybrid models where we either simultaneously predict multiple neighbouring tokens per direction, or perform multi-directional decoding by partitioning the target sequence. These methods achieve speedups to 4–11 across different tasks at the cost of 1 BLEU or 0.5 ROUGE (on average).1

\setenumerate

[1]itemsep=0.30pt,partopsep=0pt,parsep=topsep=5pt \setitemize[1]itemsep=0.30pt,partopsep=00pt,parsep=topsep=5pt \setdescriptionitemsep=0.30pt,partopsep=0pt,parsep=topsep=5pt \aclfinalcopy

1 Introduction

Neural sequence generation aided by encoder-decoder models Bahdanau et al. (2015); Vaswani et al. (2017) has achieved great success in recent years Bojar et al. (2018); Song et al. (2019); Raffel et al. (2019); Karita et al. (2019), but still suffers from slow inference. One crucial bottleneck lies in its generative paradigm which factorizes the conditional probability along the target sequence of length as follows:

(1)

where is the source sequence of length . This factorization determines that target words can only be generated one-by-one in a sequential and unidirectional manner, which limits the decoding efficiency.

(a) IBDecoder,
(b) SA,
(c) Att. Mask for (0(a), 0(b))
(d) IMDecoder,
(e) IBDecoder + SA,
(f) Att. Mask for (0(d), 0(e))
Figure 1: Overview of the interleaved bidirectional decoder (IBDecoder, 0(a)), the semi-autoregressive decoder (SA, 0(b)), the interleaved multi-directional decoder (IMDecoder, 0(d)) and the bidirectional semi-autoregressive decoder (IBDecoder+ SA, 0(e)) on target sequence . We reorganize the target sequence (purple), the word positions (green) and the self-attention mask (circles) to reuse the standard Transformer decoder. During inference, multiple target words are generated simultaneously at each step (dashed rectangles), improving the decoding speed. The self-attention masks are given in (0(c)) and (0(f)), where sold black circles indicate allowed attention positions. Red arrows indicate generation directions ( is the direction number), whose length denotes the number of words produced per direction (). Blue rectangles denote words generated at the first step. The direction embedding (red rectangles) reflects the direction each target word belongs to. Apart from the left-to-right generation, IBDecoder jointly models the right-to-left counterpart within a single sequence. IMDecoder extends IBDecoder by splitting the sequence into several equal segments and performing bidirectional generation on each of them, while IBDecoder+SA allows each direction to produce multiple words.

A promising direction to break this barrier is to generate multiple target words at one decoding step to improve the parallelization of inference Gu et al. (2018); Stern et al. (2018). However, this introduces independence assumptions that hurt translation quality, since words produced in parallel are in fact likely to be inter-dependent. We hypothesize that there are groups of words that are less likely to be strongly inter-dependent than neighbouring words, which will allow for better parallelization. Inspired by bidirectional modeling Zhang et al. (2019a, 2020), we resort to an alternative probabilistic factorization:

(2)

Introducing an independence assumption between and allows for parallel word prediction from both the and directions. Based on this factorization, Zhou et al. (2019) propose synchronous bidirectional translation using a dedicated interactive decoder, and report quality improvements compared to left-to-right semi-autoregressive decoding (Wang et al., 2018, SA) in translation quality. However, their success comes along with extra computational overhead brought by the specialized decoder. Empirically, Zhou et al. (2019) only report a decoding speedup of 1.38, slower than SA, although the factorization halves the decoding steps.

We combine the strengths of bidirectional modeling and SA, and propose interleaved bidirectional decoder (IBDecoder) for fast generation. As shown in Figure 0(a), we interleave target words from the left-to-right and right-to-left directions and separate their positions to support reusing any standard unidirectional decoders, such as the Transformer decoder Vaswani et al. (2017). We reorganize the self-attention mask to enable inter- and intra-direction interaction (Figure 0(c)) following SA. Unlike SA, we show through experiments that distant tokens from different directions are less inter-dependent, providing a guarantee for better performance. Compared to previous studies Zhang et al. (2018, 2019a, 2020); Zhou et al. (2019), our approach has no extra model parameters and brings in little overhead at training and decoding.

IBDecoder is speedup-bounded at 2. To push this ceiling up, we explore strategies for multi-word simultaneous generation, including multi-directional decoding (IMDecoder, Figure 0(d)) and SA (Figure 0(b)). The former extends Eq. 2 by inserting more generation directions, while the latter allows each direction to produce multiple target words Wang et al. (2018). These strategies offer us a chance to aggressively improve the decoding speed albeit at the risk of degenerated performance. To encourage multi-word generation in parallel, we propose a modified beam search algorithm.

We extensively experiment on five machine translation tasks and two document summarization tasks, with an in-depth analysis studying the impact of batch size, beam size and sequence length on the decoding speed. We close our analysis by examining the capacity of our model in handling long-range dependencies. On these tasks, IBDecoder yields 2 speedup against Transformer at inference, and reaches 4–11 after pairing it with SA. Still, the overall generation quality is comparable. When we pair our method with sequence-level knowledge distillation (Kim and Rush, 2016), we outperform a Transformer baseline on 6 out of 7 tasks.

Our contributions are summarized below:

  • We propose IBDecoder, following a bidirectional factorization of the conditional probability, for fast sequence generation. IBDecoder retains the training efficiency and is easy to implement.

  • We extend IBDecoder to enable multi-word simultaneous generation by investigating integration with IMDecoder and SA. Results show that IBDecoder + SA performs better than IMDecoder.

  • We propose a modified beam search algorithm to support step-wise parallel generation.

  • On several sequence generation benchmarks, IBDecoder yields 2 speedup against Transformer at inference, and reaches 4–11 after pairing it with SA. Still, the overall generation quality is comparable.

2 Related Work

Efforts on fast sequence generation come along with the rapid development of encoder-decoder models Vaswani et al. (2017). A straightforward way is to reduce the amount of computation. Methods in this category range from teacher-student model Kim and Rush (2016); Hayashi et al. (2019), constrained softmax prediction Hu et al. (2015), beam search cube pruning Zhang et al. (2018b), float-point quantization Wu et al. (2016); Bhandare et al. (2019), model pruning See et al. (2016), to simplified decoder architectures, such as lightweight recurrent models Zhang et al. (2018a); Zhang and Sennrich (2019); Kim et al. (2019), average attention network Zhang et al. (2018), merged attention network Zhang et al. (2019), dynamic convolution Wu et al. (2019), and hybrid attentions Shazeer (2019); Wang et al. (2019), .etc.

Nonetheless, the above methods still suffer from the inference bottleneck caused by the sequential nature of autoregressive models. Instead, Gu et al. (2018) propose non-autoregressive generation where target words are predicted independently, leading to great speedup, albeit at a high cost to generation quality. Follow-up studies often seek solutions to recover the performance Libovický and Helcl (2018); Guo et al. (2019); Shao et al. (2020); Ghazvininejad et al. (2020); Ran et al. (2020), but also reveal the trade-off between the quality and speed in terms of autoregressiveness. This motivates researchers to discover the optimal balance by resorting to semi-autoregressive modeling Wang et al. (2018); Stern et al. (2018), iterative refinement Lee et al. (2018); Stern et al. (2019); Ghazvininejad et al. (2019) or in-between Kaiser et al. (2018); Akoury et al. (2019).

We hypothesize that generation order affects the felicity of independence assumptions made in semi-autoregressive modelling. Unlike generation with flexible orders Emelianenko et al. (2019); Stern et al. (2019); Gu et al. (2019a), we employ deterministic generation order for model simplicity and training efficiency, specifically focusing on bidirectional decoding. The study of bidirectional modeling dates back to the era of phase-based statistical machine translation Watanabe and Sumita (2002); Finch and Sumita (2009) and recently gained popularity in neural machine translation Liu et al. (2016); Sennrich et al. (2016a); Zhang et al. (2019b, a); Zheng et al. (2019). Unfortunately, these methods either design complex neural decoders, which hurts training efficiency, and/or perform the left-to-right and right-to-left inference separately followed by rescoring, which slows down decoding. By contrast, our model speeds up inference while maintaining training speed.

Our work is closely related to SA Wang et al. (2018) and synchronous bidirectional generation Zhou et al. (2019). IBDecoder extends SA to incorporate information from different directions. In contrast to Zhou et al. (2019), we only make minimal changes to the standard Transformer decoder, which benefits efficiency during training and inference, and makes our method easy to implement. We also find improvements in both decoding speed and translation quality compared to Wang et al. (2018); Zhou et al. (2019).

3 Autoregressive Transformer

Transformer Vaswani et al. (2017), the state-of-the-art neural sequence generation model, follows the autoregressive factorization as in Eq. 1. To handle the dependency of target word on previous target words , Transformer relies on a masked self-attention network in the decoder:

(3)

where , denotes softmax operation, is model dimension and is layer depth. are trainable parameters.

The mask matrix limits the access of attention to only the past target words. Formally, given the target sequence length , this matrix can be constructed by the following masking function:

(4)

where , denotes the number of generation directions, and is the number of target words predicted per direction. By default, the Transformer decoder is unidirectional and generates words one-by-one. Thus, . The infinity here forces softmax output a probability of 0, disabling invalid attentions.

The input layer to Transformer’s decoder is the addition of target word embedding and word position encoding , i.e . maps to its word position sequence, which is a simple indexing function (Figure 0(b)):

(5)

where . Transformer adopts the sinusoidal positional encoding to project these indexes to real-space embeddings, and uses the last-layer decoder output to predict the respective next target word. We explain how to accelerate generation by reordering , adjusting and next.

4 Interleaved Bidirectional Decoder

The structure of Transformer is highly parallelizable, but the autoregressive schema blocks this parallelization during inference. We alleviate this barrier by exploring the alternative probabilistic factorization in Eq. 2 to allow words predicted from different directions simultaneously.

We propose IBDecoder as shown in Figure 0(a). We reuse the standard decoder’s architecture in a bid to largely inherit Transformer’s parallelization and avoid extra computation or parameters, rather than devising dedicated decoder architectures Zhou et al. (2019); Zhang et al. (2020). To make the left-to-right and right-to-left generation collaborative, we reorganize the target sequence and the word positions below (purple and green rectangles in Figure 0(a)):

(6)
(7)

By following the generation order defined by Eq. 2, the sequence interleaves and and converts a bidirectional generation problem to a unidirectional one. We introduce negative positions to to retain the locality bias of sinusoidal positional encodings in .2 Compared to , the reorganized sequences have the same length, thus with no extra overhead.

We also adapt the self-attention mask to permit step-wise bidirectional generation:

(8)

where IBDecoder has generation directions. This corresponds to the relaxed causal mask by Wang et al. (2018), which ensures access to all predictions made in previous time steps3 and allows for interactions among the tokens to be produced per time step. Although two words are predicted independently at each step, the adapted self-attention mask makes their corresponding decoding context complete; each word has full access to its corresponding decoding history, i.e. the left-to-right () and right-to-left () context. Except for , other components in Transformer are kept intact, including training objective.

4.1 Beyond Two-Word Generation

Eq. 2 only supports two-word generation, which indicates an upper bound of 2 speedup at inference. To improve this bound, we study strategies for multi-word generation. We explore two of them.

Multi-Directional Decoding

Similar to IBDecoder, IMDecoder also permutes the target sequence. It inserts multiple generation directions (i.e. increases ), with each direction producing one word per step (i.e. ). As shown in Figure 0(d), it splits the target sequence into several roughly equal segments followed by applying IBDecoder to each segment (thus an even required). Formally, IMDecoder reframes the target sequence and word positions as follows:

(9)
(10)

where denotes the -th word of , which is the -th segment of reordered by IBDecoder( segments in total). decomposes the word position into two parts. The first one represents the index of decoding step where each word is predicted; the second one denotes the generation direction each target word belongs to. Specifically, we record the corresponding direction indices and add a group of trainable direction embeddings (red rectangles in Figure 0(d)) into the decoder input. IMDecoder uses the following self-attention mask:

(11)
1:Decoder dec, beam size , word number , maximum length
2:Top- finished hypothesis
3: initial hypothesis ( start symbols, score 0)
4:
5:
6:
7:while  do
8:     for  do
9:    words of probability
10:         
11:     : outer addition for vectors
12:         
13:     extract words by index,
14:         
15:         for  do
16:      meet end-of-hypothesis condition
17:              if finish(then
18:                  add to
19:              else
20:                  add to
21:              end if
22:         end for
23:     end for
24:     prune to keep top- hypothesis
25:     
26:end while
27: post(): process to recover word order
28:return sort by
Algorithm 1 Beam search with step-wise multi-word generation.

Semi-Autoregressive Decoding

Instead of partitioning the target sequence, another option is to produce multiple target words per direction at each step (i.e. increase , Wang et al., 2018). SA assumes that neighbouring words are conditionally independent, despite the fact that tokens in natural language are typically highly inter-dependent.

We combine SA with IBDecoder (Figure 0(e)) with the expectation that producing 2 neighbouring tokens independently per direction is less harmful than producing 4 neighbouring words in parallel. We reuse the sequence and for the decoder input, but enlarge the attention range in the self-attention mask to assist multi-word generation (Figure 0(f)):

(12)

4.2 Inference

To handle multiple predicted words per decoding step simultaneously, we adjust the beam search algorithm as in Algorithm 1. For each partial hypothesis , we predict words in parallel. We thus first extract the top-scoring predictions of probability for all positions (line 10), followed by pruning the resulting search space of size through an outer-addition operation to size (line 12). The scores (line 12) and the backtraced words (line 14) are then used for normal decoding. Note that each complete hypothesis requires a simple deterministic post-processing to recover its original word order (line 28). In contrast to Zhou et al. (2019), we do not separate the left-to-right beam from the right-to-left beam.

End-of-Hypothesis Condition

With multiple predicted target words, determining whether one hypothesis is complete or not becomes challenging. We adopt a simple strategy: one hypothesis is assumed complete once any word in the predictions hits the end-of-sentence symbol (“[/s]”) (line 17). We leave the study of alternatives for the future.

ID Model BLEU +KD Latency Speedup Train
1 Transformer 4 1 1 26.9 27.3 387 1.00 1.00
1 26.0 26.8 294 1.32
2 IBDecoder 4 2 1 26.2 27.1 204 1.90 0.98
1 25.0 26.8 166 2.33
3 2 + SA 4 2 2 23.0 26.3 117 3.31 0.98
1 21.7 26.0 89 4.35
4 IMDecoder 4 4 1 21.5 24.6 102 3.79 0.98
1 19.7 24.1 85 4.55
Table 1: Performance on WMT14 En-De for different models with respect to beam size (), generation direction number (, Eq. 4) and predicted token number per step (, Eq. 4). BLEU: detokenized BLEU for models trained from scratch, +KD: detokenized BLEU for models trained with knowledge distillation. Latency (in millisecond) and Speedup are evaluated by decoding the test set with a batch size of 1, averaged over three runs. We report the latency and speedup for 2⃝, 3⃝ and 4⃝ trained with KD. Train compares the training speed averaged over 100 steps. Time is measured on GeForce GTX 1080.

5 Experiments

Setup

We test our model on machine translation (MT) and document summarization. We train MT models on five different language pairs: WMT14 English-German (En-De, Bojar et al., 2014), WMT14 English-French (En-Fr, Bojar et al., 2014), WMT16 Romanian-English (Ro-En, Bojar et al., 2016), WMT18 English-Russian (En-Ru, Bojar et al., 2018) and WAT17 Small-NMT English-Japanese (En-Ja, Nakazawa et al., 2017). Translation quality is measured by BLEU Papineni et al. (2002), and we report detokenized BLEU using the toolkit sacreBLEU Post (2018)4 except for En-Ja. Following Gu et al. (2019b), we segment Japanese text with KyTea5 and compute tokenized BLEU. We train document summarization models on two benchmark datasets: the non-anonymized version of the CNN/Daily Mail dataset (CDMail, Hermann et al., 2015) and the Annotated English Gigaword (Gigaword, Rush et al., 2015). We evaluate the summarization quality using ROUGE-L Lin (2004).

We provide details of data preprocessing and model settings in Appendix A. We perform thorough analysis of our model on WMT14 En-De. We also report results improved by knowledge distillation (KD, Kim and Rush, 2016).

5.1 Results on WMT14 En-De

Table 1 compares the performance of our models on WMT14 En-De. Relaxing the autoregressiveness with IBDecoder yields slightly worse translation quality compared to Transformer (-0.7 BLEU, 1⃝2⃝, w/o KD, ). Unlike Zhang et al. (2020), we observe no quality improvement, but our model delivers a speedup of 1.902.33 at inference, clearly surpassing the simple greedy decoding baseline (1.32) and BIFT (0.89Zhang et al. (2020). The dropped quality is easily recovered with knowledge distillation (+0.2 BLEU, 1⃝2⃝, w/ KD, ).

Going beyond two-word generation, which enhances independence, greatly decreases the performance (2⃝3⃝,4⃝, w/o KD) while enlarging the speedup to 3.3–4.5. Compared to SA, the quality degradation with IMDecoder is larger, both w/ and w/o KD. We ascribe this to the difficulty of structure planning, as IMDecoder has to guess words in the middle of the sequence at the start of generation. We employ SA for the following experiments.

In contrast to existing work Zhang et al. (2018, 2019a, 2020); Zhou et al. (2019), our models marginally affect the training efficiency (0.98 vs 0.61 Zhang et al. (2020)), and require no extra linguistic information Akoury et al. (2019). Our results also suggest that the degree each model benefits from KD varies. Follow-up studies should report performance w/ and w/o KD.

ID Model BLEU
1 IBDecoder 2 1 26.2
2 1 + vanilla mask 2 1 25.7
3 1 + vanilla positions 2 1 25.9
4 1 + middle-to-side 2 1 20.7
5 1 + indep. directions 2 1 23.9
6 vanilla SA 1 2 24.1
7 1 + SA 2 2 23.0
8 vanilla SA 1 4 18.7
Table 2: Ablation study on WMT14 En-De. Beam size 4. All models are trained from scratch. vanilla mask/vanilla positions: the self-attention mask (, Eq. 4) and word positions (, Eq. 5) used in Transformer. middle-to-side: generate words from the middle of the sequence to its two ends, a reverse mode of IBDecoder. indep. directions: disable cross-direction interaction. vanilla SA: predict multiple target words per step following one direction Wang et al. (2018).
Left-to-Right Bidirectional
Autoregressive 4.04 4.86
Semi-Autoregressive 6.95 4.72
Estimated PMI 0.235 -0.014
Table 3: Perplexity of autoregressive and semi-autoregressive models with different factorizations, and estimated average point-wise mutual information between words that are predicted independently. Measured on WMT14 En-De test set. Left-to-Right: , Bidirectional: ; Autoregressive: , Semi-autoregressive: . The estimated PMI shows that the inter-dependence of word pairs predicted in parallel by vanilla SA is stronger than for those predicted simultaneously by IBDecoder.

Ablation Study

We carry out an ablation study as shown in Table 2. Replacing the attention mask with the vanilla one (1⃝2⃝) introduces unnecessary independence assumptions and reduces performance by 0.5 BLEU. Using vanilla positional encodings (3⃝) also reduces performance -0.3 BLEU, indicating that we benefit from preserving the locality bias of sinusoidal encodings within each direction. Changing the generation direction from the side-to-middle (1⃝) to the middle-to-side (4⃝) dramatically increases the learning difficulty (-5.5 BLEU).

In IBDecoder, the two translation directions are interlinked, i.e. predictions are conditioned on the history of both directions. We can remove cross-direction attention, essentially forcing the model to produce the left and right half of sequences independently. Such an independent generation performs poorly (-2.3 BLEU, 1⃝5⃝), which supports the importance of using bidirectional context and resonates with the finding of Zhou et al. (2019).

Vanilla SA vs. IBDecoder

Our IBDecoder shares architectural properties with vanilla SA Wang et al. (2018), namely the independent generation of two tokens per time step, and the adapted self-attention mask, but crucially differ in their generation order and independence assumptions, with vanilla SA operating from left-to-right, and IBDecoder interleaving left-to-right and right-to-left decoding.

Our ablation results in Table 2 show that IBDecoder substantially outperforms vanilla SA (2.1/4.3 BLEU, 1⃝6⃝/7⃝8⃝). To further investigate the difference in independence assumptions between the two approaches, we compare estimated point-wise mutual information (PMI) of the words being predicted independently by IBDecoder and vanilla SA.6 Results in Table 3 show that the PMI in IBDecoder () is significantly smaller than that in vanilla SA (), supporting our assumption that distant words are less inter-dependent on average. This also explains the smaller quality loss in IBDecoder compared to vanilla SA.

Model BLEU Speedup
Transformer 6/1/1 26.9 1.00
 + student 2/1/1 26.0 2.19
 + KD 2/1/1 26.7 2.32
IBDecoder 6/2/1 26.2 1.90
 + student 2/2/1 25.0 4.29
 + KD 2/2/1 26.6 4.41
IBDecoder + SA 6/2/2 23.0 3.31
 + student 2/2/2 21.5 7.13
 + KD 2/2/2 24.5 7.24
Table 4: Detokenized BLEU and decoding speedup for student models on WMT14 En-De with reduced decoder depth (encoder depth remains constant). Beam size 4.

On Teacher-Student Model

One classical approach to improving decoding efficiency is training a small student model w/ KD. Results in Table 4 support this: Transformer with a student model produces similar performance w/ KD but runs 2.32 faster, even better than IBDecoder (1.90 ). Combining the student schema with IBDecoder increases the speedup to 4.41 without hurting the performance (26.6 BLEU, w/ KD). In exchange of 2.4 BLEU, we could reach 7.24 faster decoding with SA. The compatibility of our model with the teacher-student framework reflects the generalization of our bidirectional modeling. The results also demonstrate that efficiency improvements from faster autoregressive decoding, here obtained by reducing the number of decoder layers 7, and from bidirectional decoding, are orthogonal.

Figure 2: Speedup against Transformer vs. batch size and beam size on WMT14 En-De. Comparison is conducted under the same batch size and beam size. IBDecoder (+SA) is trained with KD. Our model consistently accelerates decoding.

Impact of Batch and Beam Size

Figure 2 shows speedups over a standard Transformer with varying batch and beam sizes. When batch size , increasing beam size improves the speedup; while the impact becomes negative with batch size . Overall, our model is consistently faster than Transformer at inference, regardless of the batch and beam size.

Figure 3: BLEU (solid lines, left) and speedup (dashed lines, right) as a function of source sentence length on WMT14 En-De. We sort the test set according to the source sentence length and uniformly divide it into 10 bins (274 sentences each). IBDecoder (+SA) is trained with KD. Beam size 4.

Impact of Source Sentence Length

Although translation quality fluctuates over the source sentence length, Figure 3 shows that our model shares the same performance pattern with the baseline. With respect to the speedup, our model performs better when translating longer source sentences.

Figure 4: BLEU versus speedup (left) and OTEM (right) for different on WMT14 En-De. Generation directions: . Beam size 4. OTEM: a metric measuring the degree of over-translation Yang et al. (2018). Larger indicates more independence between neighbouring tokens and results in more severe over-translation. Deduplication (Dedu) improves translation quality for large .

Effect of

Results in Figure 4 show that controls the trade-off between translation quality and speedup. With larger , more target tokens are predicted per decoding direction, leading to better speedup, but causing a larger performance drop w/ and w/o KD. Further analysis reveals that, as the dependency between predicted target words weakens, our model suffers from more serious over-translation issue, yielding larger OTEM Yang et al. (2018). Although n-gram deduplication slightly improves quality8, it does not explain the whole performance drop, echoing with Wang et al. (2018). We recommend using for a good balance. In addition, the reduction of OTEM by KD in Figure 4 partially clarifies its improvement on quality.

Figure 5: Accuracy of different models over distances on the subject-verb agreement task in Lingeval97.
Model BLEU SU
Existing work
SAT Wang et al. (2018) 26.09 2.07
SBSG Zhou et al. (2019) 27.22 1.61
SynST Akoury et al. (2019) 20.74 4.86
Levenshtein Gu et al. (2019b) 27.27 4.01
CMLM Ghazvininejad et al. (2019) 27.03 -
AXE Ghazvininejad et al. (2020) 23.53 -
This work SacreBLEU
IBDecoder 25.0 25.73 2.48
 w/ SA 22.3 22.95 4.53
 w/ student 25.0 25.33 4.29
IBDecoder 26.8 27.50 2.33
 w/ SA 26.0 26.84 4.35
 w/ student 26.6 27.00 4.41
Table 5: Comparison to several recent fast sequence generation models on WMT14 En-De. : trained w/ KD. : tokenized BLEU. : deduplication applied. SU: short for speedup.

Analysis on Long-range Dependency

We adopt the subject-verb agreement task from Lingeval97 Sennrich (2017) for analysis. We can see from the results in Figure 5 that IBDecoder performs similarly to the original Transformer for agreement over short distances, but agreement over longer distances drops on average. In contrast, models that include SA show steep drops in accuracy for short distances.

Curiously, KD seems to harm agreement scores even though it led to higher BLEU. Overall, these results suggest that BLEU does not show the full quality loss incurred by our independence assumptions. This deficiency also provides evidence for the performance drop in Figure 4.

Comparison to Previous Work

Results in Table 5 show that our model outperforms SynST Akoury et al. (2019) in quality, and slightly surpasses the Levenshtein Transformer Gu et al. (2019b) in speed. Particularly, our model () surpasses SAT Wang et al. (2018) () and SBSG Zhou et al. (2019) () in terms of both quality and speed. Our model doesn’t heavily rely on extra linguistic knowledge Akoury et al. (2019), neither requires complex pseudo training data construction Gu et al. (2019b). Compared to these prior studies, our approach is simple but effective.

Model KD Machine Translation Document Summarization
En-Fr Ro-En En-Ru En-Ja Gigaword CDMail
4 Quality Transformer no 32.1 32.7 27.7 43.97 35.03 36.88
IBDecoder no 32.1 33.3 27.0 43.51 34.57 36.11
 + SA no 30.3 31.3 25.0 41.75 33.65 35.27
IBDecoder yes 32.7 33.5 27.5 43.76 35.12 36.46
 + SA yes 31.3 32.7 26.4 42.99 34.74 36.27
Latency IBDecoder yes 231/1.75 205/1.79 204/1.82 157/1.86 83/2.35 657/3.02
/Speedup  +SA yes 119/3.41 109/3.37 112/3.30 94/3.10 47/4.20 303/6.55
1 Quality Transformer no 31.6 32.3 27.8 42.95 34.88 34.51
IBDecoder no 31.7 32.6 26.8 43.29 34.22 36.74
 + SA no 29.0 30.4 24.3 41.05 33.25 35.04
IBDecoder yes 32.2 33.2 28.2 43.79 35.18 37.03
 + SA yes 30.7 32.4 26.5 42.70 34.63 36.39
Latency Transformer no 357/1.14 333/1.10 342/1.09 260/1.12 157/1.24 1447/1.37
/Speedup IBDecoder yes 186/2.18 154/2.37 157/2.37 121/2.40 56/3.51 312/6.36
 +SA yes 96/4.20 88/4.17 90/4.14 67/4.34 34/5.83 178/11.15
Table 6: Generation quality (BLEU for MT, Rouge-L for summarization) and latency(ms)/speedup on different tasks. We compare IBDecoder (+SA) with Transformer. Best quality is in bold.

5.2 Results on Other Tasks

Table 6 shows MT results for other translation directions, and for document summarization. Regardless of syntactic, morphological, transcript and sequence-length differences, our model achieves comparable generation quality and 1.75–11.15 speedup over different tasks. With KD, our model even outperforms the Transformer baseline on 5 out of 6 tasks. In particular, our model succeeds on the CDMail task which previous non-autoregressive models rarely attempt due to its lengthy target sequence, although our model suffers from the long-range dependency issue as in Figure 5.

6 Conclusion and Future Work

We present interleaved bidirectional sequence generation to accelerate decoding by enabling generation from the left-to-right and right-to-left directions simultaneously. We combine the strengths of SBSG Zhou et al. (2019) and SA Wang et al. (2018), and propose a simple interleaved bidirectional decoder (IBDecoder) that can be easily implemented on top of a standard unidirectional decoder, like Transformer, via interleaving the target sequence and tweaking the word positions and self-attention masks. IBDecoder inherits Transformer’s training parallelization with no additional model parameters, and is extensible with SA and multi-directional decoding. We show that the independence assumptions we introduce between the two directions are less harmful to translation quality than the independence assumptions in left-to-right SA. On a series of generation tasks, we report comparable quality with significant inference speedup (4–11) and little training overhead. We also show that the approach is orthogonal to speedups to autoregressive decoding, e.g. by reducing model size.

In the future, we would like to further improve multi-directional generation, and will investigate alternative ways to partition the target sequence and encode positional information. We are also interested in better measuring and reducing the quality loss resulting from long-distance dependencies. Finally, we would like to adapt our interleaving approach to other sequence-to-sequence architectures.

Acknowledgments

This work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (http://www.csd3.cam.ac.uk/), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk). Ivan Titov acknowledges support of the European Research Council (ERC Starting grant 678254) and the Dutch National Science Foundation (NWO VIDI 639.022.518). Rico Sennrich acknowledges support of the Swiss National Science Foundation (MUTAMUR; no. 176727).

Appendix A Data Preprocessing and Model Settings

We use the given well-processed data for WAT17 En-Ja. For other tasks, we apply the byte pair encoding model Sennrich et al. (2016b) with a joint vocab size of 32K except for WMT18 En-Ru (48K). We experiment with Transformer Base Vaswani et al. (2017): , , 8 attention heads and FFN size of 2048. Dropout of rate 0.1 is used on residual connections and attention weights. We employ Adam  Kingma and Ba (2015) for parameter optimization with a scheduled learning rate of warm-up step 4K. Gradient is estimated over roughly 25K target subwords. We average the last 5 checkpoints for evaluation, and use beam search (beam size 4, length penalty 0.6) by default for inference.

Appendix B Estimation of the PMI

To evaluate the average point-wise mutual information (PMI) in Table 3, we compare IBDecoder/vanilla SA to its autoregressive counterpart in terms of testing perplexity (ppl). Take SA () as example, we have:

(13)

where Base denotes the baseline Transformer. The intuition behind our estimation is that Transformer handles neighboring words () autoregressively, thus models their joint probability: . Instead, vanilla SA predicts those words independently, i.e. . Comparing the perplexity of SA and Transformer gives an estimation of the average PMI. The method for IBDecoder follows the same spirit.

Footnotes

  1. Source code is released at https://github.com/bzhangGo/zero.
  2. Consider Figure 0(a). We cannot reorder position encodings along with embeddings (1,6,2,5,…) because we do not know sentence length at test time. Simply using vanilla position encodings (1,2,3,4,…) would increase the embedding distance between positions within a direction.
  3. Note that with two tokens produced per time step, decoder inputs are shifted by two.
  4. Signature BLEU+c.mixed+#.1+s.exp+tok.13a+v.1.4.3
  5. http://www.phontron.com/kytea/
  6. Details about PMI estimation are given in Appendix B
  7. Also note the concurrent work by (Kasai et al., 2020).
  8. we only applied deduplication for results in Figure 4.

References

  1. Syntactically supervised transformers for faster neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1269–1281. External Links: Link, Document Cited by: §2, §5.1, §5.1, Table 5.
  2. Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, External Links: Link Cited by: §1.
  3. Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532. Cited by: §2.
  4. Findings of the 2014 workshop on statistical machine translation. In Proceedings of the Ninth Workshop on Statistical Machine Translation, Baltimore, Maryland, USA, pp. 12–58. External Links: Link Cited by: §5.
  5. Findings of the 2016 conference on machine translation. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, pp. 131–198. External Links: Link, Document Cited by: §5.
  6. Findings of the 2018 conference on machine translation (WMT18). In Proceedings of the Third Conference on Machine Translation: Shared Task Papers, Belgium, Brussels, pp. 272–303. External Links: Link, Document Cited by: §1, §5.
  7. Sequence modeling with unconstrained generation order. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox and R. Garnett (Eds.), pp. 7700–7711. External Links: Link Cited by: §2.
  8. Bidirectional phrase-based statistical machine translation. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, Singapore, pp. 1124–1132. External Links: Link Cited by: §2.
  9. Aligned cross entropy for non-autoregressive machine translation. ArXiv abs/2004.01655. Cited by: §2, Table 5.
  10. Mask-predict: parallel decoding of conditional masked language models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 6112–6121. External Links: Link, Document Cited by: §2, Table 5.
  11. Non-autoregressive neural machine translation. In International Conference on Learning Representations, External Links: Link Cited by: §1, §2.
  12. Insertion-based decoding with automatically inferred generation order. Transactions of the Association for Computational Linguistics 7 (), pp. 661–676. External Links: Document, Link, https://doi.org/10.1162/tacl_a_00292 Cited by: §2.
  13. Levenshtein transformer. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d\textquotesingleAlché-Buc, E. Fox and R. Garnett (Eds.), pp. 11181–11191. External Links: Link Cited by: §5.1, §5, Table 5.
  14. Non-autoregressive neural machine translation with enhanced decoder input. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 3723–3730. Cited by: §2.
  15. Findings of the third workshop on neural generation and translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, pp. 1–14. External Links: Link, Document Cited by: §2.
  16. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama and R. Garnett (Eds.), pp. 1693–1701. External Links: Link Cited by: §5.
  17. Improved beam search with constrained softmax for nmt. Proceedings of MT Summit XV, pp. 297. Cited by: §2.
  18. Fast decoding in sequence models using discrete latent variables. In Proceedings of the 35th International Conference on Machine Learning, J. Dy and A. Krause (Eds.), Proceedings of Machine Learning Research, Vol. 80, Stockholmsmässan, Stockholm Sweden, pp. 2390–2399. External Links: Link Cited by: §2.
  19. A comparative study on transformer vs rnn in speech applications. In 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. , pp. 449–456. Cited by: §1.
  20. Deep encoder, shallow decoder: reevaluating the speed-quality tradeoff in machine translation. External Links: 2006.10369 Cited by: footnote 7.
  21. Sequence-level knowledge distillation. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 1317–1327. External Links: Link, Document Cited by: §1, §2, §5.
  22. From research to production and back: ludicrously fast neural machine translation. In Proceedings of the 3rd Workshop on Neural Generation and Translation, Hong Kong, pp. 280–288. External Links: Link, Document Cited by: §2.
  23. Adam: a method for stochastic optimization. In International Conference on Learning Representations, Cited by: Appendix A.
  24. Deterministic non-autoregressive neural sequence modeling by iterative refinement. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1173–1182. External Links: Link, Document Cited by: §2.
  25. End-to-end non-autoregressive neural machine translation with connectionist temporal classification. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 3016–3021. External Links: Link, Document Cited by: §2.
  26. ROUGE: a package for automatic evaluation of summaries. In Text Summarization Branches Out, Barcelona, Spain, pp. 74–81. External Links: Link Cited by: §5.
  27. Agreement on target-bidirectional neural machine translation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 411–416. External Links: Link, Document Cited by: §2.
  28. Overview of the 4th workshop on Asian translation. In Proceedings of the 4th Workshop on Asian Translation (WAT2017), Taipei, Taiwan, pp. 1–54. External Links: Link Cited by: §5.
  29. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, pp. 311–318. External Links: Link, Document Cited by: §5.
  30. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers, Belgium, Brussels, pp. 186–191. External Links: Link Cited by: §5.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv e-prints. External Links: 1910.10683 Cited by: §1.
  32. Learning to recover from multi-modality errors for non-autoregressive neural machine translation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, pp. 3059–3069. External Links: Link, Document Cited by: §2.
  33. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389. External Links: Link, Document Cited by: §5.
  34. Compression of neural machine translation models via pruning. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 291–301. External Links: Link, Document Cited by: §2.
  35. Edinburgh neural machine translation systems for WMT 16. In Proceedings of the First Conference on Machine Translation: Volume 2, Shared Task Papers, Berlin, Germany, pp. 371–376. External Links: Link, Document Cited by: §2.
  36. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1715–1725. External Links: Link, Document Cited by: Appendix A.
  37. How grammatical is character-level neural machine translation? assessing MT quality with contrastive translation pairs. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, Valencia, Spain, pp. 376–382. External Links: Link Cited by: §5.1.
  38. Minimizing the bag-of-ngrams difference for non-autoregressive neural machine translation. In Proceedings of the AAAI Conference on Artificial Intelligence, Cited by: §2.
  39. Fast transformer decoding: one write-head is all you need. arXiv preprint arXiv:1911.02150. Cited by: §2.
  40. Semantic neural machine translation using amr. Transactions of the Association for Computational Linguistics 7 (), pp. 19–31. External Links: Document, Link, https://doi.org/10.1162/tacl_a_00252 Cited by: §1.
  41. Insertion transformer: flexible sequence generation via insertion operations. In International Conference on Machine Learning, pp. 5976–5985. Cited by: §2, §2.
  42. Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems 31, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi and R. Garnett (Eds.), pp. 10086–10095. External Links: Link Cited by: §1, §2.
  43. Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: Appendix A, §1, §1, §2, §3.
  44. Accelerating transformer decoding via a hybrid of self-attention and recurrent neural network. arXiv preprint arXiv:1909.02279. Cited by: §2.
  45. Semi-autoregressive neural machine translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 479–488. External Links: Link, Document Cited by: §1, §1, §2, §2, §4.1, §4, §5.1, §5.1, §5.1, Table 2, Table 5, §6.
  46. Bidirectional decoding for statistical machine translation. In COLING 2002: The 19th International Conference on Computational Linguistics, External Links: Link Cited by: §2.
  47. Pay less attention with lightweight and dynamic convolutions. In International Conference on Learning Representations, External Links: Link Cited by: §2.
  48. Google’s neural machine translation system: bridging the gap between human and machine translation. CoRR abs/1609.08144. External Links: Link Cited by: §2.
  49. Otem&Utem: over-and under-translation evaluation metric for nmt. In CCF International Conference on Natural Language Processing and Chinese Computing, pp. 291–302. Cited by: Figure 4, §5.1.
  50. A lightweight recurrent network for sequence modeling. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 1538–1548. External Links: Link, Document Cited by: §2.
  51. Improving deep transformer with depth-scaled initialization and merged attention. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 898–909. External Links: Link, Document Cited by: §2.
  52. Simplifying neural machine translation with addition-subtraction twin-gated recurrent networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4273–4283. External Links: Link, Document Cited by: §2.
  53. Future-aware knowledge distillation for neural machine translation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (12), pp. 2278–2287. External Links: Document Cited by: §1, §1, §2, §5.1.
  54. Accelerating neural transformer via an average attention network. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 1789–1798. External Links: Link, Document Cited by: §2.
  55. Synchronous bidirectional inference for neural sequence generation. Artificial Intelligence 281, pp. 103234. External Links: ISSN 0004-3702, Document, Link Cited by: §1, §1, §4, §5.1, §5.1.
  56. Speeding up neural machine translation decoding by cube pruning. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4284–4294. External Links: Link, Document Cited by: §2.
  57. Asynchronous bidirectional decoding for neural machine translation. In Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §5.1.
  58. Regularizing neural machine translation by target-bidirectional agreement. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33, pp. 443–450. Cited by: §2.
  59. Dynamic past and future for neural machine translation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 931–941. External Links: Link, Document Cited by: §2.
  60. Sequence generation: from both sides to the middle. In Proceedings of the 28th International Joint Conference on Artificial Intelligence, pp. 5471–5477. Cited by: §1, §1, §2, §4.2, §4, §5.1, §5.1, §5.1, Table 5, §6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
420390
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description