SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task

Tao Yu  Michihiro Yasunaga  Kai Yang  Rui Zhang
Dongxu Wang  Zifan Li  Dragomir R. Radev
Department of Computer Science, Yale University
{tao.yu, michihiro.yasunaga, k.yang, r.zhang, dragomir.radev}

Most existing studies in text-to-SQL tasks do not require generating complex SQL queries with multiple clauses or sub-queries, and generalizing to new, unseen databases. In this paper we propose SyntaxSQLNet, a syntax tree network to address the complex and cross-domain text-to-SQL generation task. SyntaxSQLNet employs a SQL specific syntax tree-based decoder with SQL generation path history and table-aware column attention encoders. We evaluate SyntaxSQLNet on the Spider text-to-SQL task, which contains databases with multiple tables and complex SQL queries with multiple SQL clauses and nested queries. We use a database split setting where databases in the test set are unseen during training. Experimental results show that SyntaxSQLNet can handle a significantly greater number of complex SQL examples than prior work, outperforming the previous state-of-the-art model by 8.3% in exact matching accuracy. We also show that SyntaxSQLNet can further improve the performance by an additional 8.1% using a cross-domain augmentation method, resulting in a 16.4% improvement in total. To our knowledge, we are the first to study this complex and cross-domain text-to-SQL task.111Code available at

SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-Domain Text-to-SQL Task

Tao Yu  Michihiro Yasunaga  Kai Yang  Rui Zhang Dongxu Wang  Zifan Li  Dragomir R. Radev Department of Computer Science, Yale University {tao.yu, michihiro.yasunaga, k.yang, r.zhang, dragomir.radev}

1 Introduction

Text-to-SQL task is one of the most important sub-task of semantic parsing in natural language processing (NLP). It maps natural language sentences to corresponding SQL queries.

In recent years, some state-of-the-art methods with Seq2Seq encoder-decoder architectures are able to obtain more than 80% exact matching accuracy on some complex text-to-SQL benchmarks such as ATIS and GeoQuery. These models seem to have already solved most problems in this area.

Figure 1: To address the complex text-to-SQL generation task, SyntaxSQLNet employs a tree-based SQL generator. For example, our model can systematically generate a nested query as illustrated above.

However, as (Finegan-Dollak et al., 2018) show, because of the problematic task definition in the traditional datasets, most of these models just learn to match semantic parsing results, rather than truly learn to understand the meanings of inputs and generalize to new programs and databases. More specifically, most existing complex text-to-SQL datasets have less than 500 SQL labels. They are expanded by paraphrasing 4-10 questions for each SQL query. Under the standard train and test split (Zettlemoyer and Collins, 2005), most queries in the test set also appear in the train set. The WikiSQL dataset recently developed by (Zhong et al., 2017) is much larger and does use different databases for training and testing, but it only contains very simple SQL queries and database schemas.

To address those issues in the current semantic parsing datasets, Yu et al. (2018b) have developed a large-scale human labeled text-to-SQL dataset consisting of 10,181 questions, 5,693 unique complex SQL queries, and 200 databases with multiple tables. They split the dataset into train/dev/test by databases, defining a new complex and cross-domain text-to-SQL task that requires models to generalize well to both new SQL queries and databases. The task cannot be solved easily without truly understanding the semantic meanings of the input questions.

In this paper, we propose SyntaxSQLNet, a SQL specific syntax tree network to address the Spider task. Specifically, to generate complex SQL queries with multiple clauses, selections and sub-queries, we develop a syntax tree-based decoder with SQL generation path history. To make our model learn to generalize to new databases with new tables and columns, we also develop a table-aware column encoder. Our contributions are as follows:

  • We propose SQL specific syntax tree networks for the complex and cross-domain text-to-SQL task, which is able to solve nested queries on unseen databases. We are the first to develop a methodology for this challenging semantic parsing task.

  • We introduce a SQL specific syntax tree-based decoder with SQL path history and table-aware column attention encoders. Even with no hyperparameter tuning, our model can significantly outperform the previous best models, with an 8.3% boost in exact matching accuracy. Error analysis shows that our model is able to generalize, and solve much more complex (e.g., nested) queries in new databases than prior work.

  • We also develop a cross-domain data augmentation method to generate more diverse training examples across databases, which further improves the exact matching accuracy by 8.1%. As a result, our model achieves 31.1% accuracy, a 16.4% total improvement compared with the previous best model.

2 Related Work

Semantic parsing maps natural language to formal meaning representations. There are a range of representations, such as logic forms and executable programs (Zelle and Mooney, 1996; Zettlemoyer and Collins, 2005; Wong and Mooney, 2007; Das et al., 2010; Liang et al., 2011; Banarescu et al., 2013; Artzi and Zettlemoyer, 2013; Reddy et al., 2014; Berant and Liang, 2014; Pasupat and Liang, 2015; Herzig and Berant, 2018).

As a sub-task of semantic parsing, the text-to-SQL problem has been studied for decades (Warren and Pereira, 1982; Popescu et al., 2003a, 2004; Li et al., 2006; Giordani and Moschitti, 2012; Wang et al., 2017b). The methods proposed in the database community (Li and Jagadish, 2014; Yaghmazadeh et al., 2017) tend to involve hand feature engineering and user interactions with the systems. In this work, we focus on recent neural network-based approaches (Yin et al., 2016; Zhong et al., 2017; Xu et al., 2017; Wang et al., 2017a; Iyer et al., 2017; Gur et al., 2018; Suhr et al., 2018). Dong and Lapata (2016) introduce a sequence-to-sequence (seq2seq) approach to converting texts to logical forms. Most previous work focuses on a specific table schema. Zhong et al. (2017) publish the WikiSQL dataset and propose a seq2seq model with reinforcement learning to generate SQL queries. Xu et al. (2017) further improve the results on the WikiSQL task by using a SQL-sketch based approach employing a sequence-to-set model. Dong and Lapata (2018) propose a coarse-to-fine model which achieves the new state-of-the-art performances on several datasets including WikiSQL. Their model first generate a sketch of the target program. Then the model fills in missing details in the sketch.

Our syntax tree-based decoder is related to recent work that exploits syntax information for code generation tasks (Yin and Neubig, 2017; Rabinovich et al., 2017). Yin and Neubig (2017) introduce a neural model that transduces a natural language statement into an abstract syntax tree (AST). While they format the generation process as a seq2seq decoding of rules and tokens, our model uses a sequence-to-set module for each grammar component, and calls them recursively to generate a SQL syntax tree. Similarly, Rabinovich et al. (2017) propose abstract syntax networks that use a collection of recursive modules for decoding. Our model differs from theirs in the following points. First, we exploit a SQL specific grammar instead of AST. AST-based models have to predict many non-terminal rules before predicting the terminal tokens, involving more steps. Whereas, our SQL-specific grammar enables direct prediction of SQL tokens. Second, our model uses different sequence-to-set modules to avoid the “ordering issue” (Xu et al., 2017) in many code generation tasks. Third, different from (Rabinovich et al., 2017), we pass a pre-order traverse of SQL decoding history to each module. This provides each module with important dependence information: e.g., if a SQL query has GROUP BY, it is very likely that the grouped column has appeared in SELECT too. Finally, instead of sharing parameters across different modules, we train each module separately, because the parameters of different modules could have different converge times.

In addition to the distinction in model design, our work differs from theirs in the data and task definition. They aim to develop general syntax model for code generation via abstract syntax trees. Instead, we are interested in solving the complex and cross-domain SQL query generation problem; this motivates us to take advantage of SQL specific syntax for decoding, which guides systematic generation of complex SQL queries.

3 Problem Formulation

This work aims to tackle the complex text-to-SQL task that involves multiple tables, SQL clauses and nested queries. Further, we use separate databases for training and testing, aiming to develop models that generalize to new databases.


We use Spider (Yu et al., 2018b) 222The Spider task website is at as the main dataset, which contains 10,181 questions, 5,693 unique complex SQL queries, and 200 databases with multiple tables.

Task and Challenges.
  • The dataset contains a large number of complex SQL labels, which involve more tables, SQL clauses, and nested queries than prior datasets such as WikiSQL. Existing models developed for the WikiSQL task cannot handle those complex SQL queries in the Spider dataset.

  • The dataset contains 200 databases (138 domains), and different databases are used for training and testing. Unlike most previous semantic parsing tasks (e.g., ATIS), this task requires models to generalize to new, unseen databases.

In sum, we train and test models on different complex SQL queries from different databases in this task. This aims to ensure that models can make the correct prediction only when they truly understand the meaning of the questions under the given database, rather than by mere memorization.

4 Methodology

Similar to (Rabinovich et al., 2017), our model structures the decoder as a collection of recursive modules. However, as we discussed in the related work section, we make use of a SQL specific grammar to guide the decoding process, which allows us to take advantage of SQL queries’ well-defined structure. Also, modules do not share any parameters so that we train each of them independently.

Figure 2: Our modules and SQL grammar used in decoding process. A round symbol represents a SQL tokens, a table column, etc. A square symbol indicates a module that predicts the next SQL token from its corresponding token instances with the same color.

4.1 Module Overview

Our model decomposes the SQL decoding process into 9 modules to handle the prediction of different SQL components such as keywords, operators, and columns. We provide the overview in this section and more details in later sections.

Figure 2 illustrates our modules and SQL grammar used in decoding process. A round symbol represents a SQL token, such as SELECT, WHERE, a table column, etc. A square symbol indicates a module that predicts the next SQL token from its corresponding token instances with the same color. Specifically, we have the following modules.

  • IUEN Module, predicting INTERSECT, UNION, EXCEPT, and NONE, which determines if we need to call itself again to generate nested queries.

  • KW Module, predicting keywords from WHERE, GROUP BY, and ORDER BY. All queries in our dataset have SELECT.

  • COL Module, predicting table columns.

  • OP Module, for , , , , , , LIKE, NOT IN, IN, BETWEEN.

  • AGG Module, predicting aggregators from MAX, MIN, SUM, COUNT, AVG, and NONE.

  • Root/Terminal Module, predicting the ROOT of a new subquery or terminal value. It also enables our model to generate nested queries.

  • AND/OR Module, predicting the presence of AND or OR operator between two conditions.

  • DESC/ASC/LIMIT Module, predicting the keywords associated with ORDER BY. It is invoked only when ORDER BY is predicted before.

  • HAVING Module, predicting the presence of HAVING for GROUP BY clause. It is invoked only when GROUP BY is predicted earlier.

4.2 SQL Grammar

In order to structure our decoder to generate complex queries, we consider a SQL grammar. It determines which module to be invoked at each recursive decoding step. Figure 2 illustrates our SQL grammar. During decoding process, given the current SQL token and the SQL history (the tokens we have gone over to reach the current token), we determine which module to invoke, and predict the next SQL token to generate.

To invoke some modules such as HAVING and OP during decoding, we not only check the type of current token instance but also see whether the type of the previously decoded SQL token is GROUP for HAVING module, and WHERE or HAVING for OP module.

In the grammar, IUEN and Root/Terminal modules are able to generate ROOT, which can activate IUEN module again. In this way, our model can recursively generate nested subqueries, and can also predict two or more subqueries in queries that have EXCEPT, INTERSECT, and UNION.

4.3 Input Encoder

Our inputs of each module consist of three types of information: question, table schema, and current SQL decoding history path. We encode a question sentence by a bi-directional LSTM, . We encode table schema and history path in the manners described below.

4.3.1 Table-Aware Column Representation

In order to generalize to new databases in testing, it is important to make our model learn to obtain necessary information from a database schema.

SQLNet (Xu et al., 2017) encodes this information by running different bi-directional LSTMs over words in each column name, whereas TypeSQL (Yu et al., 2018a) first obtains embedding for each column name by taking the average embedding of the words constituting the column name, and then runs a single biLSTM on the embeddings of all columns in a table. Yu et al. (2018b) show that the column encoding method of SQLNet outperforms that of TypeSQL in the database split setting, and the result reverses under the example split setting.

While SQLNet and TypeSQL only need the column names as WikiSQL dataset only contains one table per question-SQL pair, Spider’s databases contain multiple tables. To address this setting, we propose to use both table and column names to construct column embeddings.

Specifically, given a database, for each column, we first get the list for words in its table name, words in its column name, and the type information of the column (string, or number, primary/foreign key), as an initial input of the column. Next, like SQLNet, the table-aware column representation of the given column is computed as the final hidden state of a BiLSTM running on top of this sequence. This way, the encoding scheme can capture both the global (table names) and local (column names and types) information in the database schema to understand a natural language question in the context of the given database.

We also experimented with a hierarchical table and column encoding, where we first obtain embedding for each table name and then incorporate that information into column encoding. But this encoding method did not perform as well.

4.3.2 SQL Decoding History

In addition to question and column information, we also pass the SQL query’s current decoding history as an input to each module. This enables us to use the information of previous decoding states to predict the next SQL token. For example, in Figure 1, the COL module would be more likely to predict salary in the subquery by considering the path history which contains salary for HAVING, and SELECT in the main query.

In contract, each module in SQLNet does not consider the previous decoded SQL history. Hence, if directly applied to our recursive SQL decoding steps, each module would just predict the same output every time it is invoked. By passing the SQL history, each module is able to predict a different output according to the history every time it is called during the recursive SQL generation process. Also, the SQL history can improve the performance of each module on long and complex queries because the history helps the model capture the relations between clauses.

Predicted SQL history is used during test decoding. For training, we first traverse each node in the gold query tree in pre-order to generate gold SQL path history for each training example used in different modules.

4.3.3 Attention for Input Encoding

For each module, like SQLNet (Xu et al., 2017), we apply the attention mechanism to encode question representation. We also employs this technique on SQL path history encoding. The specific formulas used are described in the next section.

4.4 Module Details

Similarly to SQLNet, we employ a sketch-based approach for each module. We apply a sequence-to-set prediction framework introduced by (Xu et al., 2017), to avoid the order issue that happens in seq2seq based models for SQL generation. For example, in Figure 1, SELECT salary, dept_name is the same as SELECT dept_name, salary. The traditional seq2seq decoder generates each of them one by one in order; hence the model could get penalized even if the prediction and gold label are the same as sets. To avoid this problem, SQLNet predicts them together in one step so that their order does not affect the model’s training process. For instance, in Figure 1, our model invokes the COL module to predict salary and dept_name, and push to stack at the same time.

However, SQLNet only covers pre-defined SQL sketches, and its modules do not pass information to one another. To resolve these problems, SyntaxSQLNet employs a syntax tree-based decoding method that recursively calls different modules based on a SQL grammar. Further, the history of generated SQL tokens is passed through modules, allowing SyntaxSQLNet to keep track of the recursive decoding steps.

We first describe how to compute the conditional embedding of an embedding given another embedding :

Here is a trainable parameter. Moreover, we get a probability distribution from a given score matrix by

where is a trainable parameter.

We denote the hidden states of LSTM on question embeddings, path history, and columns embeddings as , , and respectively. In addition, we denote the hidden states of LSTM on multiple keywords embeddings and keywords embeddings as and respectively. Finally, we use to denote trainable parameters that are not shared between modules. The output of each module is computed as follows:

IUEN Module

In the IUEN module, since only one of the multiple keywords from will be used, we compute the probabilities by

KW Module

In the KW module, we first predict the number of keywords in the SQL query and then predict the keywords from .

COL Module

Similarly, in the COL module, we first predict the number of columns in the SQL query and then predict which ones to use.

OP Module

In the OP module, for each predicted column from the COL module that is in the WHERE clause, we first predict the number of operators on it then predict which operators to use from . We use to denote the embedding of one of the predicted columns from the COL module.

AGG Module

In the AGG module, for each predicted column from the COL module, we first predict the number of aggregators on it then predict which aggregators to use from

Root/Terminal Module

To predict nested subqueries, we add a module to predict if there is a new “ROOT” after an operator, which allows the model to decode queries recursively. For each predicted column from the COL module that is in the WHERE clause, we first call OP module, and then predict whether the next decoding step is a “ROOT” node or a value terminal node by

AND/OR Module

For each condition column predicted from the COL module with number bigger than 1, we predict from by


In this module, for each predicted column from the COL module that is in the ORDER BY clause, we predict
from by


In the HAVING module, for each predicted column from the COL module that is in the GROUP BY clause, we predict whether it is in the HAVING clause by

4.5 Recursive SQL Generation

The SQL generation process is a process of activating different modules recursively. As illustrated in Figure 2, we employ a stack to organize our decoding process. At each decoding step, we pop one SQL token instance from the stack, and invoke a module based on the grammar to predict the next token instance, and then push the predicted instance into the stack. The decoding process continues until the stack is empty.

More specifically, we initialize a stack with only ROOT at the first decoding step. At the next step, the stack pops ROOT. As illustrated in Figure 2, ROOT actives the IUEN module to predict if there is EXCEPT, INTERSECT or UNION. If so, there are two subqueries to be generated in the next step. If the model predicts NONE instead, it will be pushed into the stack. The stack pops NONE at next step. For example, in Figure 2, the current popped token is SELECT, which is a instance of keyword (KW) type. It calls the COL module to predict a column name, which will be pushed to the stack.

4.6 Data Augmentation

Even though Spider already has a significantly larger number of complex queries than existing datasets, the number of training examples for some complex SQL components is still limited. A widely used way is to conduct data augmentation to generate more training examples automatically. Many studies (Berant and Liang, 2014; Iyer et al., 2017; Su and Yan, 2017) have shown that data augmentation can bring significant improvement in performance.

In prior work, data augmentation was typically performed within a single domain dataset. We propose a cross-domain data augmentation method to expand our training data for complex queries. Cross-domain data augmentation is more difficult than the in-domain setting because question-program pairs tend to have domain specific words and phrases.

To tackle this issue, we first create a list of universal patterns for question-SQL pairs, based on the human labeled pairs from all the different training databases in Spider. To do so, we use a script to remove (and later fill in) all the table / column names and value tokens in the labeled question-SQL pairs, and then group together the same SQL query patterns. Consequently, each SQL query pattern has a list of about 5-20 corresponding questions. In our task, we want to generate more complex training examples. Thus, we filter out simple SQL query patterns by measuring the length and the number of SQL keywords used. We obtain about 280 different complex SQL query patterns from over 4,000 SQL labels in the train set of our corpus. We then select the 50 most frequent complex SQL patterns that contain multiple SQL components and nested subqueries.

After this, we manually edit the selected SQL patterns and their corresponding list of questions to make sure that the table/column/value slots in the questions have one-to-one correspondence to the slots in the corresponding SQL query. For each slot, we also add column type or table information. Thus, for example, columns with string type do not appear in the column slot with integer type during data augmentation (i.e., slot refilling) process. In this way, our question-SQL patterns are generated based on existing human labeled examples, which ensures that the generated training examples are natural.

Once we have the one-to-one slot mapping between questions and SQL queries, we apply a script that takes a new database schema with type information and generates new question-SQL examples by filling empty slots. Specifically, for each table in WikiSQL, we first randomly sample 10 question-SQL patterns. We randomly sample columns from the database schema based on its type: for example, if the slot type in the pattern is “number”, and then we only sample from columns with “real” type in the current table. We then refill the slots in both the question and SQL query with the selected column names. Similarly, we also refill table / value slots.

By this data augmentation method, we finally obtain about 98,000 question and SQL pairs using some WikiSQL databases with one single table.

5 Experiments

5.1 Dataset

In our experiments, we use Spider (Yu et al., 2018b), a new large-scale human annotated text-to-SQL dataset with complex SQL queries and cross-domain databases. In addition to their originally annotated data, their training split includes 752 queries and 1659 questions from six existing datasets: Restaurants (Tang and Mooney, 2001; Popescu et al., 2003b), GeoQuery (Zelle and Mooney, 1996), Scholar (Iyer et al., 2017), Academic (Li and Jagadish, 2014), Yelp and IMDB (Yaghmazadeh et al., 2017). In total, this dataset consists of 11,840 questions, 6,445 unique complex SQL queries, and 206 databases with multiple tables. We follow (Yu et al., 2018b), and use 146, 20, 40 databases for train, development, test, respectively (randomly split). We also include the question-SQL pair examples generated by our data augmentation method in some experiments.

5.2 Metrics

We evaluate our model using SQL Component Matching and Exact Matching proposed by (Yu et al., 2018b). To compute the component matching scores, Yu et al. (2018b) first decompose predicted queries on SQL clauses including SELECT, WHERE, GROUP BY, ORDER BY, and KEYWORDS separately. After that, they evaluate each predicted clause and the ground truth as bags of several sub-components, and check whether or not these two sets of components match exactly. Exact matching score is 1 if the model predicts all clauses correctly for a given example.

To better understand model performance on different queries, (Yu et al., 2018b) divide SQL queries into 4 levels: easy, medium, hard, extra hard. The definition of difficulty is based on the number of SQL components, selections, and conditions.

5.3 Experimental Settings

Our model is implemented in PyTorch (Paszke et al., 2017). We build each module based on the TypeSQL (Yu et al., 2018a) implementation. We use fixed, pre-trained GloVe (Pennington et al., 2014) embeddings for question, SQL history, and schema tokens. For each experiment, the dimension and dropout rate of all hidden layers is set to 120 and 0.3 respectively. We use Adam (Kingma and Ba, 2015) with the default hyperparameters for optimization, with a batch size of 64. The same loss functions in (Xu et al., 2017) are used for each module. The code is available on

Method Test Dev
Easy Medium Hard Extra Hard All All
Seq2Seq 17.3% 1.9% 1.3% 0.5% 4.9% 3.8%
Seq2Seq+Attention 18.1% 2.6% 2.0% 1.1% 5.6% 2.7%
Seq2Seq+Copying 15.8% 3.4% 2.0% 1.1% 5.4% 4.1%
SQLNet 35.6% 13.3% 6.6% 1.3% 14.7% 14.3%
TypeSQL 29.9% 7.6% 4.0% 0.8% 10.5% 10.6%
SyntaxSQLNet 61.2% 28.1% 26.9% 5.1% 31.1% 28.7%
-augment 50.3% 18.6% 17.7% 4.9% 23.0% 22.1%
-table -augment 49.5% 14.3% 13.2% 2.7% 19.7% 19.7%
-history -table -augment 28.1% 7.3% 0.0% 0.0% 9.1% 8.4%
Table 1: Accuracy of Exact Matching on SQL queries with different hardness levels.
Seq2Seq 13.0% 1.5% 3.3% 5.3% 8.7%
Seq2Seq+Attention 13.6% 3.1% 3.6% 9.9% 9.9%
Seq2Seq+Copying 12.0% 3.1% 5.3% 5.8% 7.3%
SQLNet 44.5% 19.8% 29.5% 48.8% 64.0%
TypeSQL 36.4% 16.0% 17.2% 47.7% 66.2%
SyntaxSQLNet 62.5% 34.8% 55.6% 60.9% 69.6%
-augment 53.9% 24.5% 44.4% 49.5% 71.3%
-table -augment 48.9% 20.1% 36.3% 46.8% 69.7%
-history -table -augment 26.7% 14.6% 11.8% 34.9% 64.6%
Table 2: F1 scores of Component Matching on all SQL queries on Test set.

6 Results and Discussion

Table 1 presents SyntaxSQLNet’s dev and test results compared to previous state-of-the-art models on the Spider dataset with database splitting. Our model with SQL history and data augmentation achieves 31.1% exact matching on all SQL queries, which is more than 16% absolute increase compared to the previous best model, SQLNet and TypeSQL.

6.1 Comparison to Existing Methods

Even though our individual modules are similar to SQLNet and TypeSQL, our syntax-aware decoder allows the modules to generate complex SQL queries in a recursive manner based on the SQL grammar. In addition, by incorporating the SQL decoding history into modules during the decoding process, SyntaxSQL achieves a significant gain in exact matching for queries of all hardness levels. Specifically, even without our data augmentation technique, SyntaxSQLNet outperforms the previous best, SQLNet, by 8.3%. This result suggests that the syntax and history information is beneficial for this complex text-to-SQL task.

Moreover, the tree-based decoder enables SyntaxSQLNet to systematically generate nested queries, boosting the performance for Hard/Extra Hard. As Table 1 shows, SyntaxSQLNet achieves particularly high scores 26.9% and 5.1% for Hard and Extra Hard, which contain nested queries. The Seq2Seq models suffer from generating ungrammatical queries, yielding very low exact matching accuracy on Hard and Extra Hard SQL queries. In contrast, our model generates valid SQL queries by enforcing the syntax.

For the detailed component matching results in Table 2, our model consistently outperforms other previous work by significant margins. Specifically, our model improve F1 score for most of the SQL components by more than 10%.

6.2 Ablation Study

In order to understand the techniques that are responsible for the performance of our model, we perform an ablation study where we remove one of the proposed techniques from our model at a time. The exact match scores are shown in the same tables as other previous models.

Data Augmentation

Our model’s exact matching performance on all queries drops 8.1% by excluding data augmentation technique. This drop is particularly large for GROUP BY and ORDER BY components (Table 2), for which the original Spider dataset has a relatively small number of training examples. Our cross-domain data augmentation technique provides significantly more examples for column prediction (especially under GROUP BY and ORDER BY clauses), which greatly benefits the overall model performance.

Column Encoding

To see how our table-aware column encoding affects performance of our model, we also report the model’s result without using table information for our column encoding. After excluding the table embedding from column embeddings, the test performance further goes down by 3.3%. This drop is especially large for Medium/Hard SQL queries, where the correct column prediction is a key. Additionally, in Table 2, the model’s performance on GROUP BY component decreases dramatically because it is hard to predict group-by columns correctly without table information (e.g. multiple different tables may have a column of the same name ”id” in the database). This result shows that the table-aware encoding is important to predict the correct columns in unseen, complex databases (with many foreign keys).

SQL Decoding History

In order to gain more insight into how our SQL decoding history addresses complex SQL, we report our model’s performance without SQL path history. As shown in the Table 1, the model’s performance drops about 10.6% on exacting matching metric without considering the previous decoding states in each decoding state. More importantly, its performance on hard and extra hard SQL queries decreases to 0%. This indicates that our model is able to predict nested queries thanks to the SQL decoding history.

6.3 Error Analysis and Future Work

The most common errors are from column prediction. Future work may include developing a database schema encoder that can capture relationships among columns and foreign keys in the database more effectively. Other common errors include incorrect prediction of SQL skeleton structures, aggregators and operators.

There are also a few limitations in our model. For example, SyntaxSQLNet first predicts all the column names in the SQL query, and then chooses tables to generate the FROM clause based on the selected columns. Suppose the natural language input is “return the stadium name and the number of concerts held in each stadium.” The SQL query predicted by SyntaxSQLNet is

  • SELECT count(*), name FROM stadium GROUP BY stadium_id

While the correct answer is

  • SELECT, count(*) FROM concert AS T1 JOIN stadium AS T2 ON T1.stadium_id = T2.stadium_id GROUP BY T1.stadium_id

Even though SyntaxSQLNet predicts all column names and keywords correctly, its deterministic FROM clause generation method fails to join tables (”concert” and ”stadium” in this case) together. One possible solution is to predict table names in the FROM clause by considering the relations among tables in the database.

7 Conclusion

In this paper, we presented a syntax tree-based model to address complex and cross-domain text-to-SQL task. Utilizing a SQL specific syntax decoder, as well as SQL path history and table-aware column attention encoders, our model outperforms previous work by a significant margin. The ablation study demonstrates that our proposed techniques are able to predict nested, complex SQL queries correctly even for unseen databases.


We thank Graham Neubig, Tianze Shi, and three anonymous reviewers for their helpful feedback and discussion on this work.


  • Artzi and Zettlemoyer (2013) Yoav Artzi and Luke Zettlemoyer. 2013. Weakly supervised learning of semantic parsers for mapping instructions to actions. Transactions of the Association forComputational Linguistics.
  • Banarescu et al. (2013) Laura Banarescu, Claire Bonial, Shu Cai, Madalina Georgescu, Kira Griffitt, Ulf Hermjakob, Kevin Knight, Philipp Koehn, Martha Palmer, and Nathan Schneider. 2013. Abstract meaning representation for sembanking. In Proceedings of the 7th Linguistic Annotation Workshop and Interoperability with Discourse.
  • Berant and Liang (2014) Jonathan Berant and Percy Liang. 2014. Semantic parsing via paraphrasing. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1415–1425, Baltimore, Maryland. Association for Computational Linguistics.
  • Das et al. (2010) Dipanjan Das, Nathan Schneider, Desai Chen, and Noah A. Smith. 2010. Probabilistic frame-semantic parsing. In NAACL.
  • Dong and Lapata (2016) Li Dong and Mirella Lapata. 2016. Language to logical form with neural attention. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers.
  • Dong and Lapata (2018) Li Dong and Mirella Lapata. 2018. Coarse-to-fine decoding for neural semantic parsing. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 731–742. Association for Computational Linguistics.
  • Finegan-Dollak et al. (2018) Catherine Finegan-Dollak, Jonathan K. Kummerfeld, Li Zhang, Karthik Ramanathan Dhanalakshmi Ramanathan, Sesh Sadasivam, Rui Zhang, and Dragomir Radev. 2018. Improving text-to-sql evaluation methodology. In ACL 2018. Association for Computational Linguistics.
  • Giordani and Moschitti (2012) Alessandra Giordani and Alessandro Moschitti. 2012. Translating questions to sql queries with generative parsers discriminatively reranked. In COLING (Posters), pages 401–410.
  • Gur et al. (2018) Izzeddin Gur, Semih Yavuz, Yu Su, and Xifeng Yan. 2018. Dialsql: Dialogue based structured query generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1339–1349. Association for Computational Linguistics.
  • Herzig and Berant (2018) Jonathan Herzig and Jonathan Berant. 2018. Decoupling structure and lexicon for zero-shot semantic parsing. EMNLP.
  • Iyer et al. (2017) Srinivasan Iyer, Ioannis Konstas, Alvin Cheung, Jayant Krishnamurthy, and Luke Zettlemoyer. 2017. Learning a neural semantic parser from user feedback. CoRR, abs/1704.08760.
  • Kingma and Ba (2015) Diederik P. Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. The 3rd International Conference for Learning Representations, San Diego.
  • Li and Jagadish (2014) Fei Li and HV Jagadish. 2014. Constructing an interactive natural language interface for relational databases. VLDB.
  • Li et al. (2006) Yunyao Li, Huahai Yang, and HV Jagadish. 2006. Constructing a generic natural language interface for an xml database. In EDBT, volume 3896, pages 737–754. Springer.
  • Liang et al. (2011) P. Liang, M. I. Jordan, and D. Klein. 2011. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL), pages 590–599.
  • Pasupat and Liang (2015) Panupong Pasupat and Percy Liang. 2015. Compositional semantic parsing on semi-structured tables. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing of the Asian Federation of Natural Language Processing, ACL 2015, July 26-31, 2015, Beijing, China, Volume 1: Long Papers, pages 1470–1480.
  • Paszke et al. (2017) Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. 2017. Automatic differentiation in pytorch. NIPS 2017 Workshop.
  • Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. Glove: Global vectors for word representation. In EMNLP, pages 1532–1543. ACL.
  • Popescu et al. (2004) Ana-Maria Popescu, Alex Armanasu, Oren Etzioni, David Ko, and Alexander Yates. 2004. Modern natural language interfaces to databases: Composing statistical parsing with semantic tractability. In Proceedings of the 20th international conference on Computational Linguistics, page 141. Association for Computational Linguistics.
  • Popescu et al. (2003a) Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. 2003a. Towards a theory of natural language interfaces to databases. In Proceedings of the 8th international conference on Intelligent user interfaces, pages 149–157. ACM.
  • Popescu et al. (2003b) Ana-Maria Popescu, Oren Etzioni, and Henry Kautz. 2003b. Towards a theory of natural language interfaces to databases. In Proceedings of the 8th International Conference on Intelligent User Interfaces.
  • Rabinovich et al. (2017) Maxim Rabinovich, Mitchell Stern, and Dan Klein. 2017. Abstract syntax networks for code generation and semantic parsing. In ACL (1), pages 1139–1149. Association for Computational Linguistics.
  • Reddy et al. (2014) Siva Reddy, Mirella Lapata, and Mark Steedman. 2014. Large-scale semantic parsing without question-answer pairs. Transactions of the Association for Computational Linguistics, 2:377–392.
  • Su and Yan (2017) Yu Su and Xifeng Yan. 2017. Cross-domain semantic parsing via paraphrasing. CoRR, abs/1704.05974.
  • Suhr et al. (2018) Alane Suhr, Srinivasan Iyer, and Yoav Artzi. 2018. Learning to map context-dependent sentences to executable formal queries. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 2238–2249. Association for Computational Linguistics.
  • Tang and Mooney (2001) Lappoon R Tang and Raymond J Mooney. 2001. Using multiple clause constructors in inductive logic programming for semantic parsing. In ECML, volume 1, pages 466–477. Springer.
  • Wang et al. (2017a) Chenglong Wang, Marc Brockschmidt, and Rishabh Singh. 2017a. Pointing out sql queries from text. Technical Report.
  • Wang et al. (2017b) Chenglong Wang, Alvin Cheung, and Rastislav Bodik. 2017b. Synthesizing highly expressive sql queries from input-output examples. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 452–466. ACM.
  • Warren and Pereira (1982) David HD Warren and Fernando CN Pereira. 1982. An efficient easily adaptable system for interpreting natural language queries. Computational Linguistics, 8(3-4):110–122.
  • Wong and Mooney (2007) Yuk Wah Wong and Raymond J. Mooney. 2007. Learning synchronous grammars for semantic parsing with lambda calculus. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL-2007), Prague, Czech Republic.
  • Xu et al. (2017) Xiaojun Xu, Chang Liu, and Dawn Song. 2017. Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436.
  • Yaghmazadeh et al. (2017) Navid Yaghmazadeh, Yuepeng Wang, Isil Dillig, and Thomas Dillig. 2017. Sqlizer: Query synthesis from natural language. Proc. ACM Program. Lang., 1(OOPSLA):63:1–63:26.
  • Yin et al. (2016) Pengcheng Yin, Zhengdong Lu, Hang Li, and Ben Kao. 2016. Neural enquirer: Learning to query tables in natural language. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI 2016, New York, NY, USA, 9-15 July 2016, pages 2308–2314.
  • Yin and Neubig (2017) Pengcheng Yin and Graham Neubig. 2017. A syntactic neural model for general-purpose code generation. In ACL (1), pages 440–450. Association for Computational Linguistics.
  • Yu et al. (2018a) Tao Yu, Zifan Li, Zilin Zhang, Rui Zhang, and Dragomir Radev. 2018a. Typesql: Knowledge-based type-aware neural text-to-sql generation. In Proceedings of NAACL. Association for Computational Linguistics.
  • Yu et al. (2018b) Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. 2018b. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. In EMNLP.
  • Zelle and Mooney (1996) John M. Zelle and Raymond J. Mooney. 1996. Learning to parse database queries using inductive logic programming. In AAAI/IAAI, pages 1050–1055, Portland, OR. AAAI Press/MIT Press.
  • Zettlemoyer and Collins (2005) Luke S. Zettlemoyer and Michael Collins. 2005. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. UAI.
  • Zhong et al. (2017) Victor Zhong, Caiming Xiong, and Richard Socher. 2017. Seq2sql: Generating structured queries from natural language using reinforcement learning. CoRR, abs/1709.00103.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description