Insertion-based Decoding with
Automatically Inferred Generation Order
Conventional neural autoregressive decoding commonly assumes a left-to-right generation order. In this work, we propose a novel decoding algorithm -- INDIGO -- which supports flexible generation in an arbitrary order with the help of insertion operations. We use Transformer, a state-of-the-art sequence generation model, to efficiently implement the proposed approach, enabling it to be trained with either a pre-defined generation order or an adaptive order searched based on the model’s own preference. Experiments on three real-world tasks, including machine translation, word order recovery and code generation, demonstrate that our algorithm can generate sequences in an arbitrary order, while achieving competitive or even better performance compared to the conventional left-to-right generation. Case studies show that INDIGO adopts adaptive generation orders based on input information.
Neural autoregressive models have become the de facto standard in a wide range of sequence generation tasks, such as machine translation (Bahdanau et al., 2014), summarization (Rush et al., 2015) and dialogue systems (Vinyals and Le, 2015). In these studies, it is often the case that a sequence is modeled autoregressively with the left-to-right generation order, which raises the question of whether generation in an arbitrary order is worth considering (Vinyals et al., 2015; Ford et al., 2018). Nevertheless, previous studies on generation orders mostly resort to a fixed set of generation orders, showing particular choices of ordering are helpful (Wu et al., 2018; Ford et al., 2018; Mehri and Sigal, 2018) without providing an efficient algorithm for finding an adaptive generation order, or restrict the problem scope to -gram segment generation (Vinyals et al., 2015).
In this paper, we first propose a novel decoding algorithm, INsertion-based Decoding with Inferred Generation Order (INDIGO), which assumes generation orders as latent variables and can be automatically inferred based on input information. Given that absolute positions are unknown before generating the whole sequence, we use an relative representation of positions to capture generation orders based on relative positions, which is equivalent to decode with insertion operations. A demonstration is shown in Fig. 1.
We implement the algorithm with the Transformer (Vaswani et al., 2017) -- a state-of-the-art sequence generation model -- where the generation order is directly captured as relative positions through self-attention inspired by (Shaw et al., 2018). We maximize the evidence lower-bound (ELBO) of the original objective function and study four approximate posterior distributions of generation orders to learn a sequence generation model that uses the proposed INDIGO decoding scheme.
Experimental results on machine translation, code generation and word order recovery demonstrate that our algorithm can generate sequences with arbitrary orders, while achieving competitive or even better performance compared to the conventional left-to-right generation. Case studies show that the proposed method adopts adaptive orders based on input information.
2 Neural Autoregressive Decoding
Let us consider the problem of generating a sequence conditioned on some input information, e.g., a source sequence . Our goal is to build a model parameterized by that models the conditional probability of given , which is factorized in:
where and are special tokens and , respectively. The model sequentially predicts the conditional probability of the next token at each step , which can be implemented by any function approximator such as RNNs (Bahdanau et al., 2014) and Transformer (Vaswani et al., 2017).
Neural autoregressive model is commonly learned by maximizing the conditional likelihood given a set of parallel examples.
A common way to decode a sequence given a trained model is to make use of the autoregressive nature that allows us to predict one word at each step. Given any source , we essentially follow the order of factorization to generate tokens sequentially using some heuristic-based algorithms such as greedy decoding and beam search.
3 Decoding with Latent Order
Eq. 1 explicitly assumes a left-to-right (L2R) generation order of the sequence . In principle, we can factorize the sequence probability in any permutation and train the model for each permutation separately. As long as we have infinite amount of data with proper optimization methods, all these models work equally.
Vinyals et al. (2015) have shown that the generation order of a sequence matters in many real-world tasks, for instance language modeling. The L2R order is a strong inductive bias, as it is ‘‘natural’’ for most human-beings to read sequences following the L2R order. However, L2R is not necessarily the optimal option for generating sequences. For instance, languages such as Arabic follow a right-to-left (R2L) order to write a sentence; some languages such as Japanese are inferior to be translated in a R2L order because of its special characteristics (Wu et al., 2018). For formal language generation such as code generation, it is beneficial to be generated based on abstract syntax trees (Yin and Neubig, 2017).
Therefore, a natural question arises, how can we decode a sequence in its best order?
3.1 Ordering as Latent Variables
We address this question by explicitly modeling the generation order as latent variables during the decoding process. Similar to Vinyals et al. (2015), the target sequence generated in a particular order 111 We use to represent the union of all permutations of . is written as , where represents the -th generated token and its absolute location, respectively. Under this formulation, and for L2R and R2L, respectively.
The conditional probability of the sequence is modeled by marginalizing out the order:
where we fix .
3.2 Relative Representation of Positions
It is fundamentally difficult to directly use Eq. 2 for decoding, because we do not know how many tokens () in total will be generated before the special is predicted during generation. The paradox arises when we predict both the token and its position () simultaneously at each step.
It is essential to use a position representation which does not rely on the absolute locations. Thus, we replace absolute locations with the relative positions for each generated word. Formally, we model , where represents the locations of words in a partial sequence. For example, the locations for sequence (, , dream, I, a, have) are in Fig. 1 at step .
As change across different steps, we use relational vectors to encode the positions inspired by Shaw et al. (2018). is represented as a trinary vector: and its -th element is defined as:
where the elements of show the relative positions with respect to all the other words in the partial sequence at step . As is a column vector, we use matrix to show the positions of all the words in the sequence. The relational vector can always be mapped back to by:
One of the biggest advantages for using relational vectors is that we only preserve the most straightforward positional relations (left, middle and right), which keeps the relative positions between any two words unchanged once they are determined. That is to say, updating the relative positions is simply extending the existing relational matrix with the newly predicted relational vector :
The next position is valid iff. predicting is equivalent to determine where to perform the insertion operation for the next word in essence. In this work, insertion is performed by choosing an existing word (the word generated at -th step, ) and insert to its left ( ) or right (). Then the predicted is computed as:
3.3 Insertion-based Decoding
To conclude the discussion above, we propose the new decoding algorithm: INsertion-based Decoding with Inferred Generation Order (INDIGO). A general framework of the proposed insertion-based decoding is summarized in Algorithm 1. Here we use two special initial tokens to represent the starting and ending positions to set the boundaries, and have the third special token as the end-of-decoding. Also, insertion to the left of and to the right of are not allowed.
We implement the proposed insertion-based decoding using â Transformer-INDIGO â an extension of the Transformer (Vaswani et al., 2017). To the best of our knowledge, Transformer-INDIGO is the first probabilistic model that models generation order in autoregressive decoding. The overall framework is shown in Fig. 2.
4.1 Network Design
Our model reuses most of the components of the original Transformer except that we argument it with additional self-attention with relative positions, word/position joint prediction and position updating modules.
One of the major challenges that prevents the vanilla Transformer from generating arbitrary orders is that the standard positional embeddings are unusable as mentioned in Section 3.2, since decoding in an arbitrary order will destroy the absolute word location representations. In contrast, we adapt Shaw et al. (2018) to relative positions in Transformer. While Shaw et al. (2018) set a clipping distance for relative positions (usually ), our relational vectors set , enabling the Transformer to capture the order information at ease as long as .
Each attention head in a multi-head self-attention module of Transformer-INDIGO operates on an input sequence, and its corresponding relative position representations , where . We compute the unnormalized energy for attention as:
where and are parameter matrices. is the row vector indexed by . The indexed vector biases all the input keys based on the relative location.
Word & Position Prediction Module
Similar to the vanilla Transformer, we take the representation from the last layer of the decoder to predict both the next word and its relational vector by the following factorization: .
More precisely, we project each four times using projection matrices, . As described in Sec. 3.2, the two matrices, and , are used to compute the keys for pointing, considering that each word has two ‘‘keys’’ (left and right) for inserting the generated word. and are used to query these keys and the word embeddings, respectively. The detailed steps for word & position prediction can be summarized as follows:
Predicting the next word:
where is the embedding matrix;
Predicting the relative position by pointing the keys based on the generated word:
where is the embedding of the predicted word, and . The predicted is computed accordingly (Eq. (6)).
As described in Sec. 3.1, we update the relative position matrix according to the predicted from the decoder. Because updating the relative positions will not change the pre-computed relative-position representations, generation from the Transformer-INDIGO is as efficient as the standard Transformer.
Training requires maximizing the marginalized likelihood in Eq. (2). This is however intractable since we need to enumerate all permutations of tokens. Instead, we maximize the evidence lower-bound (ELBO) of the original objective by introducing an approximate posterior distribution of generation orders , which provides the probabilities of latent generation orders based on the ground-truth sequences and :
where , sampled from , is represented as relative positions. Given a sampled order, the learning objective is divided into the translation objective (word prediction loss) and the position prediction objective, where for the latter, we compute the position probability as the sum of
where are the pointed position resulting to the same relative position considering that each insertsable location (except for ) can be reached by two words at its left and right sides. Here we study two types of :
If we already possess some prior knowledge about the sequence, e.g., the L2R order is proven to be a strong baseline in many scenarios (Cho et al., 2015), we assume a Dirac-delta distribution , where is a predefined order given and . In this paper, we study a set of pre-defined orders for evaluating their effect on generation.
Searched Adaptive Order (SAO)
Another option is to choose the approximated posterior as the point estimation that maximizes . In practise, we approximate these generation orders through beam-search (BS, Pal et al., 2006).
More specifically, we maintain sub-sequences with the maximum probabilities using a set . At each step and every sub-sequence , we evaluate the probabilities of every possible choice from the left words and its corresponding position . We calculate the cumulative likelihood for each , , based on which we select top- sub-sequences as the new set for the next step. In the end, we optimize our objective as an average over the searched orders:
Beam Search with Noise
The goal of beam search is to approximately find the most likely generation order given the current model, which limits learning from exploring other generation orders that may not be favourable currently but may ultimately be deemed better. Furthermore, prior research (Vijayakumar et al., 2016) also pointed out that the search space of the standard beam-search is restricted and they proposed to have a more diverse beam-search by searching divided into groups. We encourage exploration by injecting noise during beam search (Cho, 2016). Particularly, we found it effective to keep the dropout on (e.g. dropout ), greatly increasing the diversity of the searched orders.
Bootstrapping from a Pre-defined Order
During preliminary experiments, sequences returned by beam-search were often degenerated by always predicting common or functional words (e.g. ‘‘the’’, ‘‘,’’, etc.) as the first several tokens, leading to inferior performance. We conjecture that is due to the fact that the position prediction module learns much faster than the word prediction module, and it quickly captures spurious correlations induced by a poorly initialized model. It is essential to balance the learning process of these modules. To do so, We bootstrap the model by training the same model with a pre-defined order (e.g. L2R), and then continue training with pre-trained parameters before training with beam-searched orders.
As for decoding, we directly follow the Algorithm 1 to sample or decode greedily from the proposed model. However, in practise beam-search is important to explore the output space for neural autoregressive models. In our implementation, we also perform beam-search for INDIGO as a two-step search. Suppose the beam size , at each step, we do beam-search for word prediction first, and then with the searched words, try out all possible positions and select the top- sub-sequnces.
We evaluate INDIGO extensively on three challenging sequence generation tasks: word order recovery, machine translation and natural language to code generation (NL2Code). We compare the pre-defined order (the left-to-right order in default) with the adaptive orders obtained by beam-search for training the Transformer-INDIGO.
5.1 Experimental Settings
The machine translation experiments are conducted on three language pairs: WMT 16 Romanian-English (Ro-En)222 http://www.statmt.org/wmt16/translation-task.html , WMT 18 English-Turkish (En-Tr)333 http://www.statmt.org/wmt18/translation-task.html and KFTT English-Japanese (En-Ja)444http://www.phontron.com/kftt/, considering the diversity of grammatical structures of target languages. We use the English part of the Ro-En dataset as the training examples for the word order recovery task. For the NL2Code task, we use the Django dataset (Oda et al., 2015)555https://github.com/odashi/ase15-django-dataset. The dataset statistics can be found in Table 1. Target sequences are much shorter for code generation () than machine translation datasets ().
We apply the standard tokenization666https://github.com/moses-smt/mosesdecoder and normalization on all datasets. We perform joint BPE (Sennrich et al., 2015) operations for the MT datasets, and use all unique words as the vocabulary for NL2Code.
|Dataset||Train||Dev||Test||Average target length (words)|
|Left-to-right (L2R)||Generate words from left to right.|
|Right-to-left (R2L)||Generate words from right to left.|
|Odd-Even (ODD)||Generate words at odd positions from left to right, then generate even positions similarly.|
|Syntax-tree (SYN)||Generate words with a top-down left-to-right order based on the dependency tree.|
|Common-First (CF)||Generate all common words first from left to right, and then generate the others.|
|Rare-First (RF)||Generate all rare words first from left to right, and then generate the remaining.|
|Random (RND)||Generate words in a randomly shuffled order|
We set , , , , , and throughout all experiments. The source and target embedding matrices are shared except for En-Ja, where we found that duplicating the embeddings significantly improves the translation quality. Both the encoder and decoder use relative positions during self-attention except for the word order recovery experiments (where no position embedding is used). We did not introduce task-specific modules such as copying (Gu et al., 2016) for model simplicity.
When training with the pre-defined order, we first reorder words of each training sequence in advance accordingly and generate another sequence which provides supervision of the ground-truth positions that each word is to be inserted. We test the pre-defined orders listed in Table 2. SYN orders are generated based on the dependency parser777 https://spacy.io/usage/linguistic-features to get the tree structures of the target sequences; CF & RF orders are obtained based on vocabulary cutoff so that the number of common words and the number of rare words are approximately the same (Ford et al., 2018). We also consider generating by associating a random order for each sentence as the baseline. When using L2R as the pre-defined order, the Transformer-INDIGO is almost equivalent to the vanilla Transformer (enhanced with a small number of additional parameters for the position prediction) as the position prediction simply learns to predict the next position which is the left of the symbol. We also train the Transformer-INDIGO using the searched adaptive order (SAO) where we set a beam-size to by default .
|Model||Ro En||En Tr||En Ja|
|Adaptive Order from Beam-search|
5.2 Word Order Recovery
Word order recovery takes a bag of words as input and recovers its original word order. Here the word order is different from the generation order. We take the English translation of the Ro-En corpus, remove the order information and train our model to recover its correct order, which is challenging as the searching space is factorial. Similar to memory networks (Weston et al., 2014), it is straightforward to accept such inputs without order information by removing both the positional encoding as well as the relative positional bias, and rely on self-attention to recover the order relations between the bag of source words. In addition, we did not restrict the vocabulary of the source words.
As shown in Fig. 3, we compare the L2R order with the searched adaptive order (SAO) with a beam size of for word order recovery. We vary the decoding beam-sizes from to .
It is clear that, with a more flexible generation order using SAO, the BLEU scores of the recovered sequences are much higher than the L2R order with a gain up to BLEU scores. Furthermore, increasing the decoding beam-size brings much more improvements for SAO compared to L2R, which suggests that INDIGO produces much more diverse predictions and has a higher chance to cover the correct outputs. In subsequent experiments, we set the default beam-size as and for pre-defined orders and SAO, respectively.
5.3 Machine Translation
We evaluate the proposed INDIGO on three translation datasets where the target languages (English (En), Turkish (Tr) and Japanese (Ja)) vary. Neural autogressive models (e.g. Transformer) have achieved impressive performance on machine translation. However, conventional models did not take generation orders into consideration.
As shown in Table 3, we compared our algorithm trained with pre-defined orders as well as the searched adaptive orders (SAO) with varied setups. First of all, it is clear that almost all pre-defined orders (except for the random order baseline) are able to perform reasonably well with Transformer-INDIGO, while the best score is always reached by the L2R order except for the test set of En-Ja where the R2L order works slightly better. It might suggest that in machine translation, the monotonic orders, L2R and R2L, introduce reasonable inductive bias that reflects the nature of languages. It is found that INDIGO using SAO can achieve competitive and even statistically significant improvements over the L2R order. Also, the improvements get larger for non-English languages which may indicate that for languages e.g. Turkish and Japanese with very different syntactic structures from English, a more flexible generation order improves translation quality.
Furthermore, we conduct ablation study over different settings for the searched order. Table 3 shows that both bootstrapping and searching with noise are effective approaches without which learning from SAO will consistently degenerate by around BLEU on Ro-En. In addition, we show that increasing the beam-size for SAO helps to improve the decoding, while the improvements decrease.
5.4 Code Generation
We also evaluate the proposed INDIGO on natural language to code generation tasks on the Django dataset. As shown in Table 4, SAO works significantly better than the L2R order on both the generation BLEU and accuracy, which indicates flexible generation orders are more preferable in generating code.
5.5 Case Study
We demonstrate how INDIGO works in practice by uniformly sampling examples from the validation set of each task. As shown in Fig 4, the proposed model generates sequences differently based on its learning orders (either pre-defined or SAO). For instance, the model is able to decode as if from a dependency tree if we use the SYN order. We also summarize several common features about the inferred orders using SAO: (1) usually starting with the period; (2) decoding as chunks where words are generated in a L2R order in each chunk.
6 Related Work
Neural autoregressive modelling has become one of the most successful approaches for generating sequences (Sutskever et al., 2011; Mikolov, 2012), which has been widely used in a range of applications, such as machine translation (Sutskever et al., 2014), dialogue response generation (Vinyals and Le, 2015), image captioning (Karpathy and Fei-Fei, 2015) and speech recognition (Chorowski et al., 2015). Another stream of work focuses on generating a sequence of tokens in a non-autoregressive fashion (Gu et al., 2017; Lee et al., 2018; Oord et al., 2017), in which the discrete tokens are generated in parallel, while the strong dependencies among tokens are preserved through methods, such as distillation. Semi-autoregressive modelling (Stern et al., 2018; Wang et al., 2018) is a trade-off of the above two approaches, while largely adhering to left-to-right generation. Our method is radically different from these approaches as we support flexible generation orders, while preserving the dependencies among generated tokens.
Previous studies on generation order of sequences mostly resort to a fixed set of generation orders. Wu et al. (2018) empirically shows that right-to-left generation outperforms its left-to-right counterpart in a few tasks. Ford et al. (2018) devises a two-pass approach that produces partially-filled sentence ‘‘templates" and then fills in missing tokens. Mehri and Sigal (2018) proposes a middle-out decoder that firstly predicts a middle-word and simultaneously expands the sequence in both directions afterwards. Another line of work models the probability of a sequence as a tree or directed graph (Zhang et al., 2015; Dyer et al., 2016; Aharoni and Goldberg, 2017). In contrast, we are the first work that supports fully flexible generation orders without pre-defined orders.
We have presented a novel decoding algorithm -- INDIGO -- which supports flexible sequence generation following arbitrary orders. Our model can be trained with either a pre-defined generation order or an adaptive order by beam-search by enhancing the vanilla Transformer. In contrast to conventional neural autoregressive decoding methods which generate from left to right, our model are more flexible to generate sequences. Experiments show that our method achieved competitive or even better performance compared to the conventional left-to-right generation on three real-word tasks, including machine translation, word order recovery and code generation.
- Aharoni and Goldberg (2017) Roee Aharoni and Yoav Goldberg. 2017. Towards string-to-tree neural machine translation. arXiv preprint arXiv:1704.04743.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Cho (2016) Kyunghyun Cho. 2016. Noisy parallel approximate decoding for conditional recurrent language model. arXiv preprint arXiv:1605.03835.
- Cho et al. (2015) Kyunghyun Cho, Aaron Courville, and Yoshua Bengio. 2015. Describing multimedia content using attention-based encoder-decoder networks. Multimedia, IEEE Transactions on, 17(11):1875--1886.
- Chorowski et al. (2015) Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. 2015. Attention-based models for speech recognition. In NIPS, pages 577--585.
- Dyer et al. (2016) Chris Dyer, Adhiguna Kuncoro, Miguel Ballesteros, and Noah A Smith. 2016. Recurrent neural network grammars. arXiv preprint arXiv:1602.07776.
- Ford et al. (2018) Nicolas Ford, Daniel Duckworth, Mohammad Norouzi, and George E Dahl. 2018. The importance of generation order in language modeling. arXiv preprint arXiv:1808.07910.
- Gu et al. (2017) Jiatao Gu, James Bradbury, Caiming Xiong, Victor OK Li, and Richard Socher. 2017. Non-autoregressive neural machine translation. arXiv preprint arXiv:1711.02281.
- Gu et al. (2016) Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393.
- Karpathy and Fei-Fei (2015) Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128--3137.
- Lee et al. (2018) Jason Lee, Elman Mansimov, and Kyunghyun Cho. 2018. Deterministic non-autoregressive neural sequence modeling by iterative refinement. arXiv preprint arXiv:1802.06901.
- Mehri and Sigal (2018) Shikib Mehri and Leonid Sigal. 2018. Middle-out decoding. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5523--5534. Curran Associates, Inc.
- Mikolov (2012) Tomáš Mikolov. 2012. Statistical language models based on neural networks. Presentation at Google, Mountain View, 2nd April.
- Oda et al. (2015) Yusuke Oda, Hiroyuki Fudaba, Graham Neubig, Hideaki Hata, Sakriani Sakti, Tomoki Toda, and Satoshi Nakamura. 2015. Learning to generate pseudo-code from source code using statistical machine translation. In Proceedings of the 2015 30th IEEE/ACM International Conference on Automated Software Engineering (ASE), ASE ’15, pages 574--584, Lincoln, Nebraska, USA. IEEE Computer Society.
- Oord et al. (2017) Aaron van den Oord, Yazhe Li, Igor Babuschkin, Karen Simonyan, Oriol Vinyals, Koray Kavukcuoglu, George van den Driessche, Edward Lockhart, Luis C Cobo, Florian Stimberg, et al. 2017. Parallel wavenet: Fast high-fidelity speech synthesis. arXiv preprint arXiv:1711.10433.
- Pal et al. (2006) Chris Pal, Charles Sutton, and Andrew McCallum. 2006. Sparse forward-backward using minimum divergence beams for fast training of conditional random fields. In Acoustics, Speech and Signal Processing, 2006. ICASSP 2006 Proceedings. 2006 IEEE International Conference on, volume 5, pages V--V. IEEE.
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- Sennrich et al. (2015) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.
- Shaw et al. (2018) Peter Shaw, Jakob Uszkoreit, and Ashish Vaswani. 2018. Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.
- Stern et al. (2018) Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. 2018. Blockwise parallel decoding for deep autoregressive models. In Advances in Neural Information Processing Systems, pages 10107--10116.
- Sutskever et al. (2011) Ilya Sutskever, James Martens, and Geoffrey E Hinton. 2011. Generating text with recurrent neural networks. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1017--1024.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. NIPS.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS).
- Vijayakumar et al. (2016) Ashwin K Vijayakumar, Michael Cogswell, Ramprasath R Selvaraju, Qing Sun, Stefan Lee, David Crandall, and Dhruv Batra. 2016. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424.
- Vinyals et al. (2015) Oriol Vinyals, Samy Bengio, and Manjunath Kudlur. 2015. Order matters: Sequence to sequence for sets. arXiv preprint arXiv:1511.06391.
- Vinyals and Le (2015) Oriol Vinyals and Quoc Le. 2015. A neural conversational model. arXiv preprint arXiv:1506.05869.
- Wang et al. (2018) Chunqi Wang, Ji Zhang, and Haiqing Chen. 2018. Semi-autoregressive neural machine translation. arXiv preprint arXiv:1808.08583.
- Weston et al. (2014) Jason Weston, Sumit Chopra, and Antoine Bordes. 2014. Memory networks. arXiv preprint arXiv:1410.3916.
- Wu et al. (2018) Lijun Wu, Xu Tan, Di He, Fei Tian, Tao Qin, Jianhuang Lai, and Tie-Yan Liu. 2018. Beyond error propagation in neural machine translation: Characteristics of language also matter. arXiv preprint arXiv:1809.00120.
- Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696.
- Zhang et al. (2015) Xingxing Zhang, Liang Lu, and Mirella Lapata. 2015. Top-down tree long short-term memory networks. arXiv preprint arXiv:1511.00060.