TreeStructured Neural Machine
for LinguisticsAware Sentence Generation
Abstract
Different from other sequential data, sentences in natural language are structured by linguistic grammars. Previous generative conversational models with chainstructured decoder ignore this structure in human language and might generate plausible responses with less satisfactory relevance and fluency. In this study, we aim to incorporate the results from linguistic analysis into the process of sentence generation for highquality conversation generation. Specifically, we use a dependency parser to transform each response sentence into a dependency tree and construct a training corpus of sentencetree pairs. A treestructured decoder is developed to learn the mapping from a sentence to its tree, where different types of hidden states are used to depict the local dependencies from an internal tree node to its children. For training acceleration, we propose a tree canonicalization method, which transforms trees into equivalent ternary trees. Then, with a proposed treestructured search method, the model is able to generate the most probable responses in the form of dependency trees, which are finally flattened into sequences as the system output. Experimental results demonstrate that the proposed X2Tree framework outperforms baseline methods over 11.15% increase of acceptance ratio.
TreeStructured Neural Machine
for LinguisticsAware Sentence Generation
Ganbin Zhou, Ping Luo, Rongyu Cao, Yijun Xiao, Fen Lin, Bo Chen, Qing He Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS, Beijing 100190, China.{zhouganbin, luop, heqing}@ict.ac.cn University of Chinese Academy of Sciences, Beijing 100049, China. Department of Computer Science, University of California Santa Barbara, Santa Barbara, CA 93106, USA. WeChat Search Application Department, Tencent, China.
Copyright © 2018, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
Introduction
Many natural language processing tasks can be formulated as sequence to sequence problems. Given a sequence of tokens, this task is to generate another sequence of tokens of equal or nonequal length. For example, machine translation models try to find a sequence of words in the target language, expressing the identical meaning to a source sentence; conversational models respond to a post utterance with a semantically coherent and grammatically correct sentence. Neural models were applied to these tasks and achieved stateoftheart performances in recent years (?; ?; ?; ?; ?).
These neural models in essence use a chainstructured decoder to sequentially generate tokens given a context vector encoded from an input sequence. We notice that this decoding process is mostly linear, meaning that tokens are obtained in the order of their appearances. It basically considers the dependency between any word and all its preceding ones. RNN models, such as LSTM (?) and GRU (?), are developed in demand of capturing both the short and longdistance dependency over the chain structure.
Our work improves upon these studies by incorporating the results from linguistic analysis into the decoder. Specifically, we leverage a dependency parser to transform each response sentence into a dependency tree, containing more local dependency information. The proposed model learns to map a sentence into a canonicalized tree, which is then flattened as the final output. Consider the intermediate task for automatic conversation generation. Instead of generating the response to a given input post directly, we aim to generate the dependency parse tree of the corresponding response in topdown fashion. Additionally, a tree canonicalization method is proposed, aiming at transforming trees with different numbers of children nodes into their equivalent form, namely full ternary trees, in order to accelerate training and simplify model implementation on GPU. Then, a postprocessing step converts the dependency tree into a sequence as the final response. We also theoretically prove that ternary tree is the “best” choice for model complexity.
Some models also process trees in a bottomup fashion. Socher et al. (?) proposed a maxmargin structure prediction architecture based on recursive neural networks, and demonstrated that it successfully parses sentences and understands scene images. Tai et al. (?) and Zhu et al. (?) extended the chainstructured LSTM to treestructured LSTM, which is shown to be more effective in representing a tree structure as a latent vector. All these models process trees in a bottomup fashion, where children nodes are recursively merged into parent nodes until the root is generated.
However, bottomup models require all the leaf nodes in the predicted tree given in advance. For example, to generate the constituency parse tree for a sentence (shown in Fig. 1(a)), tokens appeared in the given sentence are used as leaf nodes in this tree. Similarly, to parse natural scene images (?), an image is first divided into segments, each of which corresponds to one leaf node in output tree. With these given leaves bottomup process recursively processes the internal nodes until the root is built.
Here, we argue that the bottomup generative models may not work well when the leaf nodes are not specified ahead of prediction. Consider the task in Fig. 1(b), which is an intermediate task for automatic conversation generation. Instead of generating the response to a given input post directly, we aim to generate the dependency parse tree of the corresponding response. Then, a postprocessing step converts the dependency tree into a sequence as the final response. ^{1}^{1}1The motivation of this solution is detailed in Section Tree Generation. Compared to the Seq2Seq solution to conversation generation, we argue that this treestructured modeling method is more effective due to a shorter average decoding length and the extra structure information provided from the parse tree. In this task, it is clearly seen that: since all the tokens in response are not explicitly given by the input post, it may not be appropriate to generate the dependency from bottom to top.
Previous works on treestructured LSTM (?; ?; ?) show that incorporating syntactic structures into the encoder or decoder results in sentence embedding with improved performances on tasks like sentiment analysis and semantic relatedness. In this paper, we propose to inject tree structures into the decoding process with the following motivations:
1) Dependency tree parsing extracts shortdistance dependencies in the local area of a sentence. Utilizing these linguistic results reduces the difficulty in sequential learning, thus helps decoders to generate grammatically and semantically correct utterances. Let be the response sentence to an input , and be the dependency tree for . Then, the average length between any node in to its root is (?), much smaller than the sentence length . Thus, this tree transformation may alleviate the longdistance gap in sequence generation. 2) Words in higher levels of the dependency tree usually are more influential for the sentence. By generating more “important” words in earlier stages of the decoding process, we essentially free the decoder from the burden to store important semantic information for many time steps. 3) We also believe that the process of treestructured sentence generation is more consistent with how human construct sentences. Although people speak a sentence in a sequential order, they may keep some keywords, such as verbs and nouns, in mind before filling in more descriptive adjectives and adverbs to generate a full sentence.
In this paper, we develop a treestructured decoder in the framework of “X to tree” (X2Tree) learning, where X represents any structure (e.g. chain, tree) encoding the post as a latent vector. Since all the tokens in the response are not explicitly given by the input post, it is appropriate to generate the dependency from top to bottom. To this end, we need to address the following challenges:
1) We need to carefully model the different dependencies between a tree node and its children. Children at different positions may have different meanings, and the generation of a child node depends on not only its parent and ancestors but also its siblings. Thus, we need to fully consider the memory inherited from both its ancestors and siblings (detailed in Section Generative Model for ary Full Tree).
2) A tree node could obtain any number of children. It is nontrivial to automatically determine the number of children. Furthermore, GPUbased parallel computing is difficult when the children number is different for each node. We therefore need a tree canonicalization process, which outputs an equivalent standard tree, where each internal node has a fixed number of child nodes (detailed in Section Tree Canonicalization).
3) In model inference, it is required to develop a algorithm searching for the most probable trees instead of sequences. Since the beam search utilized by previous studies only handles chain structures, a more general search algorithm for tree structures needs to be developed (detailed in Section Tree Generation).
With all these challenges addressed, our main contributions are twofold: 1) We propose a generative neural machine for tree structures, and apply it to conversational model. Specifically, we introduce a tree canonicalization method to standardize the generative process and a greedy search method for tree structure inference. 2) We empirically demonstrate that the proposed method successfully predict the dependency trees of conversational responses to an input post. Specifically, for the task of automatic conversation the proposed X2Tree framework achieves 11.15% increase of acceptance ratio.
It is also worth mentioning that we do not need a perfect dependency parser. In our task, the sequential sentence is the final output, while the dependency tree is only the immediate result. If the parsed tree contains errors in similar patterns, the model can learn these patterns. After we convert the generated tree into a sequence, the sequence may be still correct, which is also demonstrated by the experiments.
X2Tree Neural Network
In this section, we introduce the X2Tree learning framework. The training dataset is given as: where is the response of the post and is the corresponding tree of , e.g. dependency tree. Our task is to learn the mapping from to a tree structure . Specifically, it adopts the encoderdecoder framework. We assume has already been encoded as a latent vector (see e.g. (?; ?)), and mostly focus on the treestructured decoder for the generation of .
As aforementioned, the developed decoder adopts a topdown generative process. The atom step is generating the children for a given node. This atom step is performed on each node until it cannot generate any valid nodes. Thus, the key to the decoder is modeling the parentchildren dependency. Note also that the model parameters for parentchildren dependency are shared for all the atom steps in tree generation.
We first assume the tree is ary full tree where every internal node has exactly children, and model this type of tree in Section Generative Model for ary Full Tree. Then, we propose a canonicalization method that transforms any tree into a ary full tree and discuss the for different applications in Section Tree Canonicalization. Finally, we introduce an algorithm for tree inference in Section Tree Generation.
Generative Model for ary Full Tree
Here, we propose a generative model for ary full tree. For simplicity, also represents the latent vector encoded from the input post. Within the probabilistic learning framework, our main task is to express the conditional probability for a pair . We can first reformulate as:
(1) 
where and denotes the root and the set of nonroot nodes respectively. The first term in Equ. (1) is modeled as follows where is a nonlinear and potentially multilayered function, and is the vocabulary containing all possible values for the discrete random variables.
To model , we make the following conditional independence assumption:
Assumption 1.
The children of different nodes are conditionally independent given their ancestors.
With assumption 1, is decomposed as:
(2) 
where denotes the set of ’s children, and denotes all ’s ancestors.
We then move to model the conditional probability . Concretely, since the child nodes to a parent usually correlate with each other, it is inappropriate to assume conditional independence among them. Thus, the probability is then decomposed into the following ordered conditional probabilities:
(3) 
Furthermore, we argue that children at different positions obtain different underlying meanings. Hence, different types of hidden states are designed for the children of node :
(4) 
where are activation functions which can be LSTM or other RNN cells. denotes the hidden state fed to node , containing the memory from ’s ancestors , and for the root node. With , we define as follows:
(5)  
where is the concatenation of .
Modeling of parentchildren dependency is summarized in Fig.2. With all these modelings, we train the X2Tree model by maximizing the data likelihood, namely
(6)  
It is worth mentioning that in order to explicitly notify the end of tree generation we need to add the special token “eob” (short for “End Of Branch”) to the leaf nodes as their children. Hence, all the leaf nodes of the tree in the training dataset are eob nodes.
Tree Canonicalization
As aforementioned, the proposed X2Tree model requires that the tree is ary full tree. Whereas for dialogue generation task a response sentence can be parsed into a dependency tree with any number of child nodes at each level. During training and generating, it is difficult to determine the childnode number of a word. Additionally, variablelength data is tricky for GPU acceleration. Hence, the original dependency tree is canonicalized into ary full tree before training.
Basically, the transformed ary full tree should be equivalent to the original one. In other words, there must exist an algorithm to support the bidirectional transformation between a tree and its ary full counterpart. Considering the number of is linear to the number of model parameters, to reduce model complexity, we usually hope to be as small as possible. For a given tree, a simple method to transform it into a full tree is to fill all the empty positions with eob nodes. With this method, every tree node obtains children where is the maximal number of immediate children over all tree nodes. However, when is large and the tree nodes are sparse, the redundant eob nodes significantly increase the learning complexity. Hence, ideally, before the eob filling step we want to transform the tree into a binary or ternary tree.
Here, we mainly consider two scenarios. For an ordered tree, where ordering is specified for the children of each node, we transform it to a leftchild rightsibling (LCRS) binary tree (?). This transformation is reversible with a onetoone mapping between the ordered tree and its LCRS counterpart. Furthermore, for the conversational generation tasks, we need to flatten the predicted tree into a sequence. Therefore we need to store position information in the dependency tree. For this purpose we first give the following definition of sequencepreserved tree (SP tree for short).
Definition 1.
An SP tree is an ordered tree where each node is tagged an integer , and is the number of children to node .
The inorder traversal of an SP tree corresponds to a node sequence. Node ’s children are divided into two parts. The left part contains the first nodes (child nodes are ordered from left to right), while the right part contains the remaining nodes. In the inorder traversal we first visit the nodes in the left part, then the current node, finally the right part. Fig. 3 shows three SP trees with their corresponding sequences. Obviously, the dependency tree of a sentence is an SP tree, where the number attached on each node can be obtained by checking the position relationship of the node and its children in the original sentence, as shown in Figure 1(b). For example, the node “says” obtains a number “” which means one child of this node are on its left part in the original sequence. As discussed earlier, a tree canonicalization step is needed to transform the original dependency tree into a ary full tree. To preserve sequence order, we transform the dependency into a ternary tree. We now present the algorithm and discuss why ternary tree is the “best” choice. Alg. 1 details this canonicalization process, and an illustration is shown in Fig.4.
In a ternary tree, each node has three children, namely left, middle and right nodes. For node with attached number , Alg. 1 first determines its left and middle child in the ternary tree. Specifically, its left child is set to , the first child in the original tree; and its middle child is set to . Any other child is set as the right child of recursively. With this ternary tree a simple inorder traversal in the order of left child, parent, middle child and right child can restore it into a sequence.
Next, we prove that the resulting ternary tree is equivalent to the original SP tree in the sense that they can be transformed into each other.
Theorem 1.
Given any SP tree , it can be transformed into a ternary tree , and can be transformed back into the original tree .
Proof.
Using the Alg.2, we can transform into a ternary tree .
We now show how to transform back into . For each node , if is not a right child, let denote the right child of , denote the right child of , denote the right child of until obtains no right child.
In the original tree , and must be siblings. For simplicity, let denote their parent.
1) If is a left child in , ( , , , ) are first, second, …, th child of in the original SP tree .
2) If is a middle child in , ( , , , ) are th, th, …, th child of in the original SP tree .
In this way, for each node in , we can find its original position in , and then reconverts to .
∎
Additionally, we prove that ternary tree is the “best” choice for model complexity. Theoretically, a dependency tree is equivalent to a ary tree when . Since the number of is linear to parameter size in the X2Tree model, we prefer simpler models with smaller values of . Theorem 2 formally shows that SP trees are not equivalent to binary trees. Therefore, the ternary tree is the “best” choice. Thus before training, we perform a preprocessing step which converts each response into its corresponding dependency tree (instance of SP tree), and canonicalize them into ternary trees. A visualization of this canonicalization process is provided in the slides in the supplemental files.
Theorem 2.
Given any SP tree , no algorithm exists which transforms into an LCRS tree and reconverts to .
Proof.
Let , and respectively denote the set of sequencepreserved trees, ordered trees and LCRS trees with nodes. Since the ordered trees and LCRS trees obtain onetoone correspondence (?), it can be inferred that the element number .
For a node in ordered tree, if and its children obtain specified ordering, namely is defined, it converts to a SP tree. Furthermore, for different , the SP trees are different. Thus, . Moreover, suppose that an algorithm exists that transforms into an LCRS tree , and reconverts to . This infers that . It is contradictory to .
∎
Note that the generated tree is a full tree (the leaf nodes can be in the lower depth) but not a perfect tree. For a sentence with n words, the transformed kary full tree contains exactly nodes, and the extra nodes are the EOB tokens. Thus, only the nodes induce the computing waste. To minimize this waste we expect k as small as possible. Theorems 1 and 2 tell that k=3 is the minimal number we can use so that the transformed tree is equivalent to the original dependency tree.
Tree Generation
With the trained model we can infer the most probable trees for a given input . In this section we develop a greedy search algorithm for this inference task.
The beam search is traditionally adopted for sequence structure generation At each step, it keeps (called global beam size) best candidates with the maximal probabilities so far. Then, only those candidates are expanded next. For each candidate on the beam it grows a new node at the current end of the sequence. This process repeats recursively until all candidates end with eob nodes.
Since sequence is a special case of trees, searching tree generation has more challenges to address. First, an arbitrary tree has multiple leaves which could potentially generate new children. Second, when growing new children for a leaf node we need to generate all children as a whole since they correlate with each other (as mentioned in Section Generative Model for ary Full Tree). Multiple groups of such children need to be generated as the best candidates.
We use the example in Fig. 5 to describe this tree generation method. The original tree has two leaves, nodes “i” and “a”. For each of these leaves, we can generate new children. Specifically, for node “i” it generates groups of children, as shown in Fig. 5(b) ( and in this example). Since these new children are ordered, this local step of children generation is actually a task of sequence generation, thus the conventional beam search can be used. Here, (called local beam size) is to specify the number of candidate sequences generated for each leaf. After the child generation for all the leaves, we compare all these candidate trees and only retain top ( in this example) trees for the next round of generation. This process recursively continues until all the leaves in the tree are eob nodes. Note that the proposed method is a generalized beam search. Beam search for sequence generation is a special case with , since sequence is equivalent to ary tree. The method is detailed in Algorithm 2. A visualization of this search process is provided in the slides in the supplementary files.
Experiment Settings
Dataset Details
Our experiments focus on dialogue generation task. 14 million postresponse pairs were obtained from Tencent Weibo^{2}^{2}2http://t.qq.com/?lang=en_US. After removing spams and advertisements, pairs were left, among which are for training, and for model validation.
Benchmark Methods
We implemented the following four popular neuralbased dialogue models for comparison:

Seq2Seq(?): A RNN model that utilizes the last hidden state of the encoder as the initial hidden state of the decoder;

EncDec(?): A RNN model that feeds the last hidden state of the encoder to every cell and softmax unit of the decoder;

ATT(?): A RNN model based on EncDec with attention signal;

NRM(?): Neural Responding Machine with both global and local schemes.
All these models map sequences to sequences directly, and only differ in how to summarize the encoder hidden states into a latent vector. Thus, the proposed tree decoder can be applied to any of these models, and potentially improve the response quality from a different perspective. Here, we stress that this treedecoder can be easily applied to the model (?), which summarizes multiple rounds of dialogues into a latent vector. In the future, tree decoder for multiround dialog will be evaluated.
Implementation Details
All sentences in the experiments are segmented by LTP. A vocabulary of 28,000 most frequent Chinese words in the corpus is used for training, which contains 97% words. Outofvocabulary words are replaced with “unk”. Our implementations are based on the Theano library (?) over NVIDIA K80 GPU. We applied onelayer GRU (?) with 1,024dimensional hidden states to and all baseline models. As suggested in (?), the word embeddings for the encoders and decoders are learned separately, whose dimensions are set to 128 for all models. All the parameters were initialized using a uniform distribution between 0.01 and 0.01. In training, the minibatch size is . We used ADADELTA (?) for optimization. The training stops if the perplexity on the validation set increases for 4 consecutive epochs. Models with best perplexities are selected for further evaluation. When generating responses, for X2Tree we use generalized beam search with global beam size , local beam size . For other X2Seq baseline models, conventional beam search with beam size is used.
Evaluation Methods
Due to the high diversity nature of dialogs, it is practically impossible to construct a data set which adequately covers all responses for each given post. Hence, we apply human judgment to our experiments. In detail, 3 labelers were invited to evaluate the quality of responses to randomly sampled posts. For each post, each model generated top different responses (for a total of ). For fair comparison, we create a single file in which each post is followed by its responses which are shuffled to avoid labelers knowing which model each response is generated by.
For each response the labelers determine the quality to be one of the following three levels:

Level 1: The response is ungrammatical.

Level 2: The response is basically grammatical but irrelevant to the input post.

Level 3: The response is grammatical and relevant to the input post. The response on this level is acceptable for dialog system.
From labeling results, average percentages of responses in different levels are calculated. Additionally, labeling agreement is evaluated by Fleiss’ kappa (?) which is a measure of interrater consistency. Furthermore, we also report BLEU4 (?) scores for these 300 posts. Since some researchers indicate BLEU may not be a good measure for dialog evaluation(?), we consider human judgment as a major measure in experiments.
Experimental Results and Analysis
The experimental results are summarized in Table 1. For Seq2Seq, NRM and X2Tree, the agreement value is in a range from 0.6 to 0.8 which is interpreted as “substantial agreement”. Meanwhile, EncDec and ATT obtain a relatively higher kappa value between 0.8 to 1.0 which is “almost perfect agreement”. Hence, we believe the labeling standard is considered clear which leads to high agreement among labelers.
Models  Level1%  Level2%  Level3%  Agreement  BLEU 

EncDec  0.44  58.89  40.67  0.8114  8.78 
Seq2Seq  1.58  50.73  47.69  0.7834  12.45 
ATT  2.31  45.31  52.38  0.8269  13.89 
NRM  0.64  44.98  54.38  0.7809  13.73 
X2Tree  0.44  34.02  65.53  0.7733  15.87 
For the Level3 (acceptable ratio), X2Tree visibly outperforms other models. The best baseline method NRM achieves 54.38% Level3 ratio, while X2Tree reaches 65.53% with an increase percentage of 11.15%. This improvement is mainly due to less irrelevant (Level2) responses being generated (34.02% v.s. 44.98%), indicating X2Tree outputs more acceptable responses.
We further notice from Table 1 that the percentage of ungrammatical (Level1) responses from X2Tree is less than other baselines (equal to EncDec) and the BLEU score is greater than other baselines in the experiments. It shows that responses generated by the treestructured decoder are more grammatical than those from the chainstructured decoders and demonstrate the X2Tree’s robustness to parser errors. Additionally, X2Tree and EncDec achieve best grammatical ratio (99.56%), but EncDec fails in generating relevant responses. Hence, Tree Decoder can improve the response relevance in experiments. We conjecture the reason is that X2Tree firstly generate the core verb of the responses. The first generated may be more relevant to the post and makes the whole response more relevant to the post.
In summary, the experiments demonstrate that X2Tree is able to generate more grammatical and relevant responses, and also show X2Tree obtains the ability to generate correct trees.
Easiness to Learning
From Table 1, we discover that the percentage of grammatical responses from X2Tree visibly surpasses other models in the experiments. We conjecture that the treestructured decoder is easier to learn because its hidden states need to store less information than their counterparts in a chainstructured decoder.
In detail, given a response utterance with length , the hidden state at position in a chainstructured decoder needs to store the information of all previous words , the average size of is (with an extra eos token). In contrast, in a treestructured decoder, only needs to store the information of its ancestors . After transforming the response into a triple dependency tree structure, the average depth of nodes is (?). In the worst case, the depth of a triple dependency tree is , and the average number of ancestors of nodes is , which is the same to the average size of . Fig. 6 shows the average number of steps hidden states need to remember at different sequence lengths for our data set.
Overall, hidden states of a treestructured decoder need store less information than chainstructured decoder’s. This makes X2Tree potentially capable to handle more complex semantic structures in the response utterances.
Related Work
Statistical Machine Translation. The neuralbased encoderdecoder framework for generative conversation models follows the line of statistical machine translation. Sutskever et al. (?) used multilayered LSTM as the encoder and the decoder for machine translation. Later, Cho et al. (?) proposed the encoderdecoder framework, where the context vector is fed to every unit in the decoder. Bahdana et al. (?) extended the encoderdecoder framework with the attention mechanism to model the alignment between source and target sequences.
Conversation models. Inspired by neural SMT, recent studies showed that these models can also be successfully applied to dialogue systems. Specifically, for short conversation, Shang et al. (?) proposed the Neural Responding Machine which further extended the attention mechanism with both global and local schemes. Zhou et al. (?) proposed MARM to generate diverse responses upon multiple mechanisms. Most recently, some researchers focused on multiround conversation. Serban et al. (?) built an endtoend dialogue system using hierarchical neural network. Sordoni et al. (?) proposed a related model with a hierarchical recurrent encoderdecoder framework for query suggestion. Our proposed model can also be applied to these multiround conversation models and potentially improve the performances.
TreeStructured Neural Network. Recently, some studies use treestructured neural network instead of the conventional chainstructured neural network to improve the quality of semantic representation. Socher et al. (?) proposed the Recursive Neural Tensor Network. Each phrase is represented by word vectors and its parse tree. Vectors of higher level nodes are computed using their child phrase vectors. Tai et al. (?) and Zhu et al. (?) extended the chainstructured LSTM to tree structures. All above models use tree structures to summarize a sentence into a context vector, while we propose to decode from a context vector to generate sentences in a roottoleaf direction. Additionally, Zhang et al.(?) proposed Tree LSTM activation function in topdown fashion. Here, two important points differentiate our work with theirs. First, Zhang et al. mainly estimate generation probability of dependency tree and apply their model to sentence completion and dependency parsing reranking tasks, while X2Tree handles dialogue modeling in encoderdecoder framework. Second, due to the canonicalization method, X2Tree model process fixed number () of children at each step for GPU acceleration, while Zhang et al. need to process the children sequentially. Thus, the proposed tree canonicalization method helps to reduce the training time. To this end, some works also aim at generating different structure types. Rabinovich et al. (?) proposed the abstract syntax networks to transform card image of the game HearthStone into wellformed and executable outputs. Cheng et al. (?) utilized predicateargument structures to store natural language utterances as intermediate and domaingeneral representations.
Conclusion and Future Work
In this study, we proposed a treestructured decoder to improve the response quality in dialogue systems. By incorporating linguistic knowledge into the modeling process, the proposed X2Tree framework outperforms baseline methods over 11.15% increase of acceptance ratio in response generation. Future study on incorporating a treestructured encoder is promising to further enhance the sentence generation quality.
Acknowledgments
This work was supported by the National Key Research and Development Program of China under Grant No. 2017YFB1002104, the National Natural Science Foundation of China (No.61473274, 61573335).
This work was also supported by WeChat Tencent. We thank Leyu Lin, Lixin Zhang, Cheng Niu and Xiaohu Cheng for their constructive advices. We also thank the anonymous AAAI reviewers for their helpful feedback.
References
 [2015] Bahdanau, D.; Cho, K.; and Bengio, Y. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. ICLR.
 [2012] Bastien, F.; Lamblin, P.; Pascanu, R.; Bergstra, J.; Goodfellow, I. J.; Bergeron, A.; Bouchard, N.; and Bengio, Y. 2012. Theano: New Features and Speed Improvements. In NIPS Workshop.
 [2017] Cheng, J.; Reddy, S.; Saraswat, V.; and Lapata, M. 2017. Learning Structured Natural Language Representations for Semantic Parsing. In ACL.
 [2014] Cho, K.; van Merrienboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; and Bengio, Y. 2014. Learning Phrase Representations using RNN Encoderdecoder for Statistical Machine Translation. In EMNLP.
 [2009] Cormen, T. H. 2009. Introduction to algorithms. MIT press.
 [1982] Flajolet, P., and Odlyzko, A. 1982. The Average Height of Binary Trees and Other Simple Trees. JCSS.
 [1971] Fleiss, J. L. 1971. Measuring Nominal Scale Agreement Among Many Raters. Psychological Bulletin.
 [1997] Hochreiter, S., and Schmidhuber, J. 1997. Long ShortTerm Memory. Neural Computation.
 [2016] Liu, C.W.; Lowe, R.; Serban, I. V.; Noseworthy, M.; Charlin, L.; and Pineau, J. 2016. How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation. In ACL.
 [2002] Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W. 2002. BLEU: a Method for Automatic Evaluation of Machine Translation. In ACL.
 [2017] Rabinovich, M.; Stern, M.; and Klein, D. 2017. Abstract Syntax Networks for Code Generation and Semantic Parsing. In ACL.
 [2015] Serban, I. V.; Sordoni, A.; Bengio, Y.; Courville, A.; and Pineau, J. 2015. Building EndToEnd Dialogue Systems Using Generative Hierarchical Neural Network Models. In AAAI.
 [2015] Shang, L.; Lu, Z.; and Li, H. 2015. Neural Responding Machine for ShortText Conversation. In ACL.
 [2011] Socher, R.; Lin, C. C.; Ng, A. Y.; and Manning, C. D. 2011. Parsing Natural Scenes and Natural Language with Recursive Neural Networks. In ICML.
 [2013] Socher, R.; Perelygin, A.; Wu, J. Y.; Chuang, J.; Manning, C. D.; Ng, A. Y.; and Potts, C. 2013. Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank. In ACL.
 [2015] Sordoni, A.; Bengio, Y.; Vahabi, H.; Lioma, C.; Simonsen, J. G.; and Nie, J.Y. 2015. A Hierarchical Recurrent EncoderDecoder For Generative ContextAware Query Suggestion. In CIKM.
 [2014] Sutskever, I.; Vinyals, O.; and Le, Q. V. 2014. Sequence to Sequence Learning with Neural networks. In NIPS.
 [2015] Tai, K. S.; Socher, R.; and Manning, C. D. 2015. Improved Semantic Representations From TreeStructured Long ShortTerm Memory Networks. In ACL.
 [2015] Vinyals, O., and Le, Q. V. 2015. A Neural Conversational Model. arXiv.
 [2012] Zeiler, M. D. 2012. ADADELTA: An Adaptive Learning Rate Method. arXiv.
 [2016] Zhang, X.; Lu, L.; and Lapata, M. 2016. Topdown Tree Long ShortTerm Memory Networks. In NAACL.
 [2017] Zhou, G.; Luo, P.; Cao, R.; Lin, F.; Chen, B.; and He, Q. 2017. Mechanismaware neural machine for dialogue response generation. In AAAI.
 [2015] Zhu, X.; Sobhani, P.; and Guo, H. 2015. Long ShortTerm Memory Over Tree Structures. In ICML.