Insertionbased Decoding with
Automatically Inferred Generation Order
Abstract
Conventional neural autoregressive decoding commonly assumes a lefttoright generation order. In this work, we propose a novel decoding algorithm  INDIGO  which supports flexible generation in an arbitrary order with the help of insertion operations. We use Transformer, a stateoftheart sequence generation model, to efficiently implement the proposed approach, enabling it to be trained with either a predefined generation order or an adaptive order searched based on the model’s own preference. Experiments on three realworld tasks, including machine translation, word order recovery and code generation, demonstrate that our algorithm can generate sequences in an arbitrary order, while achieving competitive or even better performance compared to the conventional lefttoright generation. Case studies show that INDIGO adopts adaptive generation orders based on input information.
Insertionbased Decoding with
Automatically Inferred Generation Order
1 Introduction
Neural autoregressive models have become the de facto standard in a wide range of sequence generation tasks, such as machine translation (Bahdanau et al., 2014), summarization (Rush et al., 2015) and dialogue systems (Vinyals and Le, 2015). In these studies, it is often the case that a sequence is modeled autoregressively with the lefttoright generation order, which raises the question of whether generation in an arbitrary order is worth considering (Vinyals et al., 2015; Ford et al., 2018). Nevertheless, previous studies on generation orders mostly resort to a fixed set of generation orders, showing particular choices of ordering are helpful (Wu et al., 2018; Ford et al., 2018; Mehri and Sigal, 2018) without providing an efficient algorithm for finding an adaptive generation order, or restrict the problem scope to gram segment generation (Vinyals et al., 2015).
In this paper, we first propose a novel decoding algorithm, INsertionbased Decoding with Inferred Generation Order (INDIGO), which assumes generation orders as latent variables and can be automatically inferred based on input information. Given that absolute positions are unknown before generating the whole sequence, we use an relative representation of positions to capture generation orders based on relative positions, which is equivalent to decode with insertion operations. A demonstration is shown in Fig. 1.
We implement the algorithm with the Transformer (Vaswani et al., 2017)  a stateoftheart sequence generation model  where the generation order is directly captured as relative positions through selfattention inspired by (Shaw et al., 2018). We maximize the evidence lowerbound (ELBO) of the original objective function and study four approximate posterior distributions of generation orders to learn a sequence generation model that uses the proposed INDIGO decoding scheme.
Experimental results on machine translation, code generation and word order recovery demonstrate that our algorithm can generate sequences with arbitrary orders, while achieving competitive or even better performance compared to the conventional lefttoright generation. Case studies show that the proposed method adopts adaptive orders based on input information.
2 Neural Autoregressive Decoding
Let us consider the problem of generating a sequence conditioned on some input information, e.g., a source sequence . Our goal is to build a model parameterized by that models the conditional probability of given , which is factorized in:
(1) 
where and are special tokens and , respectively. The model sequentially predicts the conditional probability of the next token at each step , which can be implemented by any function approximator such as RNNs (Bahdanau et al., 2014) and Transformer (Vaswani et al., 2017).
Learning
Neural autoregressive model is commonly learned by maximizing the conditional likelihood given a set of parallel examples.
Decoding
A common way to decode a sequence given a trained model is to make use of the autoregressive nature that allows us to predict one word at each step. Given any source , we essentially follow the order of factorization to generate tokens sequentially using some heuristicbased algorithms such as greedy decoding and beam search.
3 Decoding with Latent Order
Eq. 1 explicitly assumes a lefttoright (L2R) generation order of the sequence . In principle, we can factorize the sequence probability in any permutation and train the model for each permutation separately. As long as we have infinite amount of data with proper optimization methods, all these models work equally.
Vinyals et al. (2015) have shown that the generation order of a sequence matters in many realworld tasks, for instance language modeling. The L2R order is a strong inductive bias, as it is ‘‘natural’’ for most humanbeings to read sequences following the L2R order. However, L2R is not necessarily the optimal option for generating sequences. For instance, languages such as Arabic follow a righttoleft (R2L) order to write a sentence; some languages such as Japanese are inferior to be translated in a R2L order because of its special characteristics (Wu et al., 2018). For formal language generation such as code generation, it is beneficial to be generated based on abstract syntax trees (Yin and Neubig, 2017).
Therefore, a natural question arises, how can we decode a sequence in its best order?
3.1 Ordering as Latent Variables
We address this question by explicitly modeling the generation order as latent variables during the decoding process. Similar to Vinyals et al. (2015), the target sequence generated in a particular order ^{1}^{1}1 We use to represent the union of all permutations of . is written as , where represents the th generated token and its absolute location, respectively. Under this formulation, and for L2R and R2L, respectively.
The conditional probability of the sequence is modeled by marginalizing out the order:
(2) 
where we fix .
3.2 Relative Representation of Positions
It is fundamentally difficult to directly use Eq. 2 for decoding, because we do not know how many tokens () in total will be generated before the special is predicted during generation. The paradox arises when we predict both the token and its position () simultaneously at each step.
Position Update
It is essential to use a position representation which does not rely on the absolute locations. Thus, we replace absolute locations with the relative positions for each generated word. Formally, we model , where represents the locations of words in a partial sequence. For example, the locations for sequence (, , dream, I, a, have) are in Fig. 1 at step .
Relational Vectors
As change across different steps, we use relational vectors to encode the positions inspired by Shaw et al. (2018). is represented as a trinary vector: and its th element is defined as:
(3) 
where the elements of show the relative positions with respect to all the other words in the partial sequence at step . As is a column vector, we use matrix to show the positions of all the words in the sequence. The relational vector can always be mapped back to by:
(4) 
One of the biggest advantages for using relational vectors is that we only preserve the most straightforward positional relations (left, middle and right), which keeps the relative positions between any two words unchanged once they are determined. That is to say, updating the relative positions is simply extending the existing relational matrix with the newly predicted relational vector :
(5) 
Insertion Operations
The next position is valid iff. predicting is equivalent to determine where to perform the insertion operation for the next word in essence. In this work, insertion is performed by choosing an existing word (the word generated at th step, ) and insert to its left ( ) or right (). Then the predicted is computed as:
(6) 
3.3 Insertionbased Decoding
To conclude the discussion above, we propose the new decoding algorithm: INsertionbased Decoding with Inferred Generation Order (INDIGO). A general framework of the proposed insertionbased decoding is summarized in Algorithm 1. Here we use two special initial tokens to represent the starting and ending positions to set the boundaries, and have the third special token as the endofdecoding. Also, insertion to the left of and to the right of are not allowed.
4 Model
We implement the proposed insertionbased decoding using â TransformerINDIGO â an extension of the Transformer (Vaswani et al., 2017). To the best of our knowledge, TransformerINDIGO is the first probabilistic model that models generation order in autoregressive decoding. The overall framework is shown in Fig. 2.
4.1 Network Design
Our model reuses most of the components of the original Transformer except that we argument it with additional selfattention with relative positions, word/position joint prediction and position updating modules.
SelfAttention
One of the major challenges that prevents the vanilla Transformer from generating arbitrary orders is that the standard positional embeddings are unusable as mentioned in Section 3.2, since decoding in an arbitrary order will destroy the absolute word location representations. In contrast, we adapt Shaw et al. (2018) to relative positions in Transformer. While Shaw et al. (2018) set a clipping distance for relative positions (usually ), our relational vectors set , enabling the Transformer to capture the order information at ease as long as .
Each attention head in a multihead selfattention module of TransformerINDIGO operates on an input sequence, and its corresponding relative position representations , where . We compute the unnormalized energy for attention as:
(7) 
where and are parameter matrices. is the row vector indexed by . The indexed vector biases all the input keys based on the relative location.
Word & Position Prediction Module
Similar to the vanilla Transformer, we take the representation from the last layer of the decoder to predict both the next word and its relational vector by the following factorization: .
More precisely, we project each four times using projection matrices, . As described in Sec. 3.2, the two matrices, and , are used to compute the keys for pointing, considering that each word has two ‘‘keys’’ (left and right) for inserting the generated word. and are used to query these keys and the word embeddings, respectively. The detailed steps for word & position prediction can be summarized as follows:

Predicting the next word:
where is the embedding matrix;

Predicting the relative position by pointing the keys based on the generated word:
where is the embedding of the predicted word, and . The predicted is computed accordingly (Eq. (6)).
Position Updating
As described in Sec. 3.1, we update the relative position matrix according to the predicted from the decoder. Because updating the relative positions will not change the precomputed relativeposition representations, generation from the TransformerINDIGO is as efficient as the standard Transformer.
4.2 Learning
Training requires maximizing the marginalized likelihood in Eq. (2). This is however intractable since we need to enumerate all permutations of tokens. Instead, we maximize the evidence lowerbound (ELBO) of the original objective by introducing an approximate posterior distribution of generation orders , which provides the probabilities of latent generation orders based on the groundtruth sequences and :
(8)  
where , sampled from , is represented as relative positions. Given a sampled order, the learning objective is divided into the translation objective (word prediction loss) and the position prediction objective, where for the latter, we compute the position probability as the sum of
(9) 
where are the pointed position resulting to the same relative position considering that each insertsable location (except for ) can be reached by two words at its left and right sides. Here we study two types of :
Predefined Order
If we already possess some prior knowledge about the sequence, e.g., the L2R order is proven to be a strong baseline in many scenarios (Cho et al., 2015), we assume a Diracdelta distribution , where is a predefined order given and . In this paper, we study a set of predefined orders for evaluating their effect on generation.
Searched Adaptive Order (SAO)
Another option is to choose the approximated posterior as the point estimation that maximizes . In practise, we approximate these generation orders through beamsearch (BS, Pal et al., 2006).
More specifically, we maintain subsequences with the maximum probabilities using a set . At each step and every subsequence , we evaluate the probabilities of every possible choice from the left words and its corresponding position . We calculate the cumulative likelihood for each , , based on which we select top subsequences as the new set for the next step. In the end, we optimize our objective as an average over the searched orders:
(10) 
Beam Search with Noise
The goal of beam search is to approximately find the most likely generation order given the current model, which limits learning from exploring other generation orders that may not be favourable currently but may ultimately be deemed better. Furthermore, prior research (Vijayakumar et al., 2016) also pointed out that the search space of the standard beamsearch is restricted and they proposed to have a more diverse beamsearch by searching divided into groups. We encourage exploration by injecting noise during beam search (Cho, 2016). Particularly, we found it effective to keep the dropout on (e.g. dropout ), greatly increasing the diversity of the searched orders.
Bootstrapping from a Predefined Order
During preliminary experiments, sequences returned by beamsearch were often degenerated by always predicting common or functional words (e.g. ‘‘the’’, ‘‘,’’, etc.) as the first several tokens, leading to inferior performance. We conjecture that is due to the fact that the position prediction module learns much faster than the word prediction module, and it quickly captures spurious correlations induced by a poorly initialized model. It is essential to balance the learning process of these modules. To do so, We bootstrap the model by training the same model with a predefined order (e.g. L2R), and then continue training with pretrained parameters before training with beamsearched orders.
4.3 Decoding
As for decoding, we directly follow the Algorithm 1 to sample or decode greedily from the proposed model. However, in practise beamsearch is important to explore the output space for neural autoregressive models. In our implementation, we also perform beamsearch for INDIGO as a twostep search. Suppose the beam size , at each step, we do beamsearch for word prediction first, and then with the searched words, try out all possible positions and select the top subsequnces.
5 Experiments
We evaluate INDIGO extensively on three challenging sequence generation tasks: word order recovery, machine translation and natural language to code generation (NL2Code). We compare the predefined order (the lefttoright order in default) with the adaptive orders obtained by beamsearch for training the TransformerINDIGO.
5.1 Experimental Settings
Dataset
The machine translation experiments are conducted on three language pairs: WMT 16 RomanianEnglish (RoEn)^{2}^{2}2 http://www.statmt.org/wmt16/translationtask.html , WMT 18 EnglishTurkish (EnTr)^{3}^{3}3 http://www.statmt.org/wmt18/translationtask.html and KFTT EnglishJapanese (EnJa)^{4}^{4}4http://www.phontron.com/kftt/, considering the diversity of grammatical structures of target languages. We use the English part of the RoEn dataset as the training examples for the word order recovery task. For the NL2Code task, we use the Django dataset (Oda et al., 2015)^{5}^{5}5https://github.com/odashi/ase15djangodataset. The dataset statistics can be found in Table 1. Target sequences are much shorter for code generation () than machine translation datasets ().
Preprocessing
We apply the standard tokenization^{6}^{6}6https://github.com/mosessmt/mosesdecoder and normalization on all datasets. We perform joint BPE (Sennrich et al., 2015) operations for the MT datasets, and use all unique words as the vocabulary for NL2Code.
Dataset  Train  Dev  Test  Average target length (words) 

RoEn  620k  2000  2000  26.48 
EnTr  207k  3007  3000  25.81 
EnJa  405k  1166  1160  27.51 
NLCode  16k  1000  1801  8.87 
Predefined Order  Descriptions 

Lefttoright (L2R)  Generate words from left to right. 
Righttoleft (R2L)  Generate words from right to left. 
OddEven (ODD)  Generate words at odd positions from left to right, then generate even positions similarly. 
Syntaxtree (SYN)  Generate words with a topdown lefttoright order based on the dependency tree. 
CommonFirst (CF)  Generate all common words first from left to right, and then generate the others. 
RareFirst (RF)  Generate all rare words first from left to right, and then generate the remaining. 
Random (RND)  Generate words in a randomly shuffled order 
Model
We set , , , , , and throughout all experiments. The source and target embedding matrices are shared except for EnJa, where we found that duplicating the embeddings significantly improves the translation quality. Both the encoder and decoder use relative positions during selfattention except for the word order recovery experiments (where no position embedding is used). We did not introduce taskspecific modules such as copying (Gu et al., 2016) for model simplicity.
Training
When training with the predefined order, we first reorder words of each training sequence in advance accordingly and generate another sequence which provides supervision of the groundtruth positions that each word is to be inserted. We test the predefined orders listed in Table 2. SYN orders are generated based on the dependency parser^{7}^{7}7 https://spacy.io/usage/linguisticfeatures to get the tree structures of the target sequences; CF & RF orders are obtained based on vocabulary cutoff so that the number of common words and the number of rare words are approximately the same (Ford et al., 2018). We also consider generating by associating a random order for each sentence as the baseline. When using L2R as the predefined order, the TransformerINDIGO is almost equivalent to the vanilla Transformer (enhanced with a small number of additional parameters for the position prediction) as the position prediction simply learns to predict the next position which is the left of the symbol. We also train the TransformerINDIGO using the searched adaptive order (SAO) where we set a beamsize to by default .
Model  Ro En  En Tr  En Ja  
dev  test  dev  test  dev  test  
Predefined Orders  
Random  19.43  20.20  
L2R  33.29  31.82  15.11  14.47  28.11  30.08 
R2L  32.71  31.62  14.54  14.24  27.29  30.42 
ODD  31.02  30.11  
SYN  30.11  29.43  
CF  31.29  30.25  
RF  31.34  30.23  
Adaptive Order from Beamsearch  
SAO (default)  33.69  32.19  15.96  15.18  28.37  31.31 
bootstrapping  32.86  31.88  
bootstrapping noise  32.64  31.72  
SAO beamsize  32.31  31.92  
SAO beamsize  33.77  32.08 
5.2 Word Order Recovery
Word order recovery takes a bag of words as input and recovers its original word order. Here the word order is different from the generation order. We take the English translation of the RoEn corpus, remove the order information and train our model to recover its correct order, which is challenging as the searching space is factorial. Similar to memory networks (Weston et al., 2014), it is straightforward to accept such inputs without order information by removing both the positional encoding as well as the relative positional bias, and rely on selfattention to recover the order relations between the bag of source words. In addition, we did not restrict the vocabulary of the source words.
As shown in Fig. 3, we compare the L2R order with the searched adaptive order (SAO) with a beam size of for word order recovery. We vary the decoding beamsizes from to .
It is clear that, with a more flexible generation order using SAO, the BLEU scores of the recovered sequences are much higher than the L2R order with a gain up to BLEU scores. Furthermore, increasing the decoding beamsize brings much more improvements for SAO compared to L2R, which suggests that INDIGO produces much more diverse predictions and has a higher chance to cover the correct outputs. In subsequent experiments, we set the default beamsize as and for predefined orders and SAO, respectively.
5.3 Machine Translation
We evaluate the proposed INDIGO on three translation datasets where the target languages (English (En), Turkish (Tr) and Japanese (Ja)) vary. Neural autogressive models (e.g. Transformer) have achieved impressive performance on machine translation. However, conventional models did not take generation orders into consideration.
As shown in Table 3, we compared our algorithm trained with predefined orders as well as the searched adaptive orders (SAO) with varied setups. First of all, it is clear that almost all predefined orders (except for the random order baseline) are able to perform reasonably well with TransformerINDIGO, while the best score is always reached by the L2R order except for the test set of EnJa where the R2L order works slightly better. It might suggest that in machine translation, the monotonic orders, L2R and R2L, introduce reasonable inductive bias that reflects the nature of languages. It is found that INDIGO using SAO can achieve competitive and even statistically significant improvements over the L2R order. Also, the improvements get larger for nonEnglish languages which may indicate that for languages e.g. Turkish and Japanese with very different syntactic structures from English, a more flexible generation order improves translation quality.
Furthermore, we conduct ablation study over different settings for the searched order. Table 3 shows that both bootstrapping and searching with noise are effective approaches without which learning from SAO will consistently degenerate by around BLEU on RoEn. In addition, we show that increasing the beamsize for SAO helps to improve the decoding, while the improvements decrease.
Model  dev  test  

BLEU  Accuracy  BLEU  Accuracy  
L2R  34.22  16.7%  31.97  10.3% 
SAO  41.45  19.0%  42.33  16.3% 
5.4 Code Generation
We also evaluate the proposed INDIGO on natural language to code generation tasks on the Django dataset. As shown in Table 4, SAO works significantly better than the L2R order on both the generation BLEU and accuracy, which indicates flexible generation orders are more preferable in generating code.
5.5 Case Study
We demonstrate how INDIGO works in practice by uniformly sampling examples from the validation set of each task. As shown in Fig 4, the proposed model generates sequences differently based on its learning orders (either predefined or SAO). For instance, the model is able to decode as if from a dependency tree if we use the SYN order. We also summarize several common features about the inferred orders using SAO: (1) usually starting with the period; (2) decoding as chunks where words are generated in a L2R order in each chunk.
6 Related Work
Neural autoregressive modelling has become one of the most successful approaches for generating sequences (Sutskever et al., 2011; Mikolov, 2012), which has been widely used in a range of applications, such as machine translation (Sutskever et al., 2014), dialogue response generation (Vinyals and Le, 2015), image captioning (Karpathy and FeiFei, 2015) and speech recognition (Chorowski et al., 2015). Another stream of work focuses on generating a sequence of tokens in a nonautoregressive fashion (Gu et al., 2017; Lee et al., 2018; Oord et al., 2017), in which the discrete tokens are generated in parallel, while the strong dependencies among tokens are preserved through methods, such as distillation. Semiautoregressive modelling (Stern et al., 2018; Wang et al., 2018) is a tradeoff of the above two approaches, while largely adhering to lefttoright generation. Our method is radically different from these approaches as we support flexible generation orders, while preserving the dependencies among generated tokens.
Previous studies on generation order of sequences mostly resort to a fixed set of generation orders. Wu et al. (2018) empirically shows that righttoleft generation outperforms its lefttoright counterpart in a few tasks. Ford et al. (2018) devises a twopass approach that produces partiallyfilled sentence ‘‘templates" and then fills in missing tokens. Mehri and Sigal (2018) proposes a middleout decoder that firstly predicts a middleword and simultaneously expands the sequence in both directions afterwards. Another line of work models the probability of a sequence as a tree or directed graph (Zhang et al., 2015; Dyer et al., 2016; Aharoni and Goldberg, 2017). In contrast, we are the first work that supports fully flexible generation orders without predefined orders.
7 Conclusion
We have presented a novel decoding algorithm  INDIGO  which supports flexible sequence generation following arbitrary orders. Our model can be trained with either a predefined generation order or an adaptive order by beamsearch by enhancing the vanilla Transformer. In contrast to conventional neural autoregressive decoding methods which generate from left to right, our model are more flexible to generate sequences. Experiments show that our method achieved competitive or even better performance compared to the conventional lefttoright generation on three realword tasks, including machine translation, word order recovery and code generation.
References
 Aharoni and Goldberg (2017) Roee Aharoni and Yoav Goldberg. 2017. Towards stringtotree neural machine translation. arXiv preprint arXiv:1704.04743.
 Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
 Cho (2016) Kyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language model. arXiv preprint arXiv:1605.03835.
 Cho et al. (2015) Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attentionbased encoderdecoder networks. Multimedia, IEEE Transactions on, 17(11):18751886.
 Chorowski et al. (2015) Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attentionbased models for speech recognition. In NIPS, pages 577585.
 Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. 2016. Recurrent neural network grammars. arXiv preprint arXiv:1602.07776.
 Ford et al. (2018) Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George E Dahl. 2018. The importance of generation order in language modeling. arXiv preprint arXiv:1808.07910.
 Gu et al. (2017) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017. Nonautoregressive neural machine translation. arXiv preprint arXiv:1711.02281.
 Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequencetosequence learning. arXiv preprint arXiv:1603.06393.
 Karpathy and FeiFei (2015) Andrej Karpathy and Li FeiFei. 2015. Deep visualsemantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 31283137.
 Lee et al. (2018) Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic nonautoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901.
 Mehri and Sigal (2018) Shikib Mehri and Leonid Sigal. 2018. Middleout decoding. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 55235534. Curran Associates, Inc.
 Mikolov (2012) Tomáš Mikolov. 2012. Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April.
 Oda et al. (2015) Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudocode from source code using statistical machine translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE ’15, pages 574584, Lincoln, Nebraska, USA. IEEE Computer Society.
 Oord et al. (2017) Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. 2017. Parallel wavenet: Fast highfidelity speech synthesis. arXiv preprint arXiv:1711.10433.
 Pal et al. (2006) Chris Pal, Charles Sutton, and Andrew McCallum. 2006. Sparse forwardbackward using minimum divergence beams for fast training of conditional random fields. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 5, pages VV. IEEE.
 Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
 Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
 Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Selfattention with relative position representations. arXiv preprint arXiv:1803.02155.
 Stern et al. (2018) Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. 2018. Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems, pages 1010710116.
 Sutskever et al. (2011) Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML11), pages 10171024.
 Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. NIPS.
 Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
 Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
 Vinyals et al. (2015) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391.
 Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
 Wang et al. (2018) Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018. Semiautoregressive neural machine translation. arXiv preprint arXiv:1808.08583.
 Weston et al. (2014) Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916.
 Wu et al. (2018) Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jianhuang Lai, and TieYan Liu. 2018. Beyond error propagation in neural machine translation: Characteristics of language also matter. arXiv preprint arXiv:1809.00120.
 Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for generalpurpose code generation. arXiv preprint arXiv:1704.01696.
 Zhang et al. (2015) Xingxing Zhang, Liang Lu, and Mirella Lapata. 2015. Topdown tree long shortterm memory networks. arXiv preprint arXiv:1511.00060.