Ask the Right Questions:
Active Question Reformulation with Reinforcement Learning
We frame Question Answering as a Reinforcement Learning task, an approach that we call Active Question Answering. We propose an agent that sits between the user and a black box question-answering system an which learns to reformulate questions to elicit the best possible answers.
The agent probes the system with, potentially many, natural language reformulations of an initial question and aggregates the returned evidence to yield the best answer.
The reformulation system is trained end-to-end to maximize answer quality using policy gradient. We evaluate on SearchQA, a dataset of complex questions extracted from Jeopardy!. Our agent improves F1 by 11% over a state-of-the-art base model that uses the original question/answer pairs.
Web and social media have become primary sources of information. Users’ expectations and information seeking activities co-evolve with the increasing sophistication of these resources. Beyond navigation, document retrieval, and simple factual question answering, users seek direct answers to complex and compositional questions. Such search sessions may require multiple iterations, critical assessment and synthesis (Marchionini, 2006).
The productivity of natural language yields a myriad of ways to formulate a question (Chomsky, 1965). In the face of complex information needs, humans overcome uncertainty by reformulating questions, issuing multiple searches, and aggregating responses. Inspired by humans’ ability to ask the right questions, we present an agent that learns to carry out this process for the user. The agent sits between the user and a backend QA system that we refer to as the ‘environment’. The agent aims to maximize the chance of getting the correct answer by reformulating and reissuing a user’s question to the environment. The agent comes up with a single best answer by asking many pertinent questions and aggregating the returned evidence. The internals of the environment are not available to the agent, so it must learn to probe a black-box optimally using only natural language.
Our method resembles active learning (Settles, 2010). In active learning, the learning algorithm chooses which instances to send to an environment for labeling, aiming to collect the most valuable information in order to improve model quality. Similarly, our agent aims to learn how to query an environment optimally, aiming to maximize the chance of revealing the correct answer to the user. Due to this resemblance we call our approach Active Question Answering (AQA). AQA differs from standard active learning in that it searches in the space of natural language questions and selects the question that yields the most relevant response. Further, AQA aims to solve each problem instance (original question) via active reformulation, rather than selecting hard ones for labelling to improve its decision boundary.
The key component of our proposed solution, see Figure 1, is a sequence-to-sequence model that is trained using reinforcement learning (RL) with a reward based on the answer given by the QA environment. The second component to AQA combines the evidence from interacting with the environment using a convolutional neural network.
We evaluate on a dataset of complex questions taken from Jeopardy!, the SearchQA dataset (Dunn et al., 2017). These questions are hard to answer by design because they use obfuscated and convoluted language, e.g., Travel doesn’t seem to be an issue for this sorcerer & onetime surgeon; astral projection & teleportation are no prob (answer: Doctor Strange). Thus SearchQA tests the ability of AQA to reformulate questions such that the QA system has the best chance of returning the correct answer. AQA outperforms a deep network built for QA, BiDAF (Seo et al., 2017a), which has produced state-of-the-art results on multiple tasks, by 11% absolute F1, a 32% relative F1 improvement. We conclude by proposing AQA as a general framework for stateful, iterative information seeking tasks.
2 Active Question Answering
In this section we detail the components of Active Question Answering as depicted in Figure 1.
2.1 The Agent-Environment Framework
Rather than sending a user’s question to a QA system passively, the AQA system actively reformulates the question multiple times and issues the reformulations. The QA system acts as a black-box environment, to which AQA sends questions and receives answers. The environment returns one or more responses, from which the final answer is selected. AQA has no access to the internals of the environment, and the environment is not trained as a component of the AQA agent. The agent must learn to communicate optimally with the environment to maximize the probability of receiving the correct answer.
The environment can, in principle, include multiple information sources, possibly providing feedback in different modes: images, structured data from knowledge bases, unstructured text, search results, etc. Here, the information source consists of a pre-trained question answering system, that accepts natural language question strings as input, and returns strings as answers.
2.2 Active Question Answering Agent
The reformulator is a sequence-to-sequence model, as is popular in neural machine translation (MT) (Sutskever et al., 2014; Bahdanau et al., 2014).222We build upon the public implementation11footnotemark: 1 of Britz et al. (2017). 22footnotetext: https://github.com/google/seq2seq/ The model’s architecture consists of a multi-layer bidirectional LSTM, with attention-based decoding. The major departure from the standard MT setting is that our model reformulates utterances in the same language. Unlike in MT, there is little high quality training data available for monolingual paraphrasing. Effective training of highly-parametrized neural networks relies on an abundance of data, thus, our setting presents an additional challenge. We address this first by pre-training the model on a related task and second, by utilizing the end-to-end signals produced though the interaction with the QA environment. This is a common strategy in deep learning, for which we develop appropriate methods.
In the downward pass in Figure 1 the reformulator transforms the original question into one or many alternative questions used to probe the environment for candidate answers. The reformulator is trained end-to-end, using an answer quality metric as the objective. This sequence-level loss is non-differentiable, so the model is trained using Reinforcement Learning, detailed in Section 3. In the upward pass in Figure 1, the aggregator selects the best answer. For this we use an additional neural network. The aggregator’s task is to evaluate the candidate answers returned by the environment and select the one to return. Here, we assume that there is a single best answer, as is the case in our evaluation setting; returning multiple answers is a straightforward extension of the model. The aggregator is trained with supervised learning.
2.3 Question-Answering Environment
Finally, we require an environment to interact with. For this we use a competitive neural question answering model, BiDirectional Attention Flow (BiDAF) (Seo et al., 2017a).333https://allenai.github.io/bi-att-flow/ BiDAF is an extractive QA system. It takes as input a question and a document and returns as answer a continuous span from the document. The model contains a bidirectional attention mechanism to score document snippets with respect to the question, implemented with multi-layer LSTMs and other components. The environment is opaque, the agent has no access to its internals: parameters, activations, gradients, etc. AQA may only send questions to it, and receive answers. This scenario enables us to design a general framework that permits the use of any backend. However, it means that feedback on the quality of the question reformulations is noisy and indirect, presenting a challenge for training.
To train AQA we use a combination of reinforcement and supervised learning. We also present a strategy to overcome data paucity in monolingual paraphrasing.
3.1 Question Answering Environment
We treat BiDAF (Seo et al., 2017a) as a static black box QA system. We train the model on the training set for the QA task at hand, see Section 4.5 for details. Afterwards, BiDAF becomes part of the environment and its parameters are not updated while training the AQA agent. In principle, we could train both the agent and the environment jointly to further improve performance. However, this is not our desired task: our aim is for the agent to learn to communicate using natural language with an environment over which is has no control. This setting generalizes to interaction with arbitrary information sources.
3.2 Policy Gradient Training of the Reformulation Model
For a given question , we want to return the best possible answer , maximizing a reward . Typically, is the token level F1 score on the answer. The answer is an unknown function of a question , computed by the environment. Note, that the reward is computed with respect to the original question while the answer is produced using a question . The question is generated according to a policy where are the policy’s parameters. The policy, in this case a sequence-to-sequence model, assigns a probability
to any possible question where is the length of with tokens from a fixed vocubulary .
The goal is to maximize the expected reward of the answer returned under the policy, . We optimize the reward directly with respect to parameters of the policy using the Policy Gradient algorithm (Sutton and Barto, 1998). Since the expected reward cannot be computed in closed form, we use Monte Carlo sampling from the policy to compute an unbiased estimator,
To compute gradients for training we use REINFORCE (Williams and Peng, 1991),
This estimator is often found to have high variance, leading to unstable training (Greensmith et al., 2004). We reudce the variance by adding a baseline: (Williams, 1992). This expectation is also computed by sampling from the policy given .
We often observed collapse onto a sub-optimal deterministic policy. To address this we use entropy regularization
This final objective is:
where is the regularization weight.
3.3 Initialization of the Reformulation Model
We pre-train the question reformulation model by building a paraphrasing Neural MT model, i.e. a model that can translate English to English. While parallel corpora are available for many language pairs, English-English corpora are scarce, so we cannot train monolingual model directly. Instead, we first produce a multilingual translation system that translates between several languages (Johnson et al., 2016). This allows us to use available bilingual corpora. Multilingual training requires nothing more than adding two special tokens to the data indicating the source and target languages. 444For example, we prefix from_es to_en to the source side of a Spanish-English training instance. The encoder-decoder architecture of the translation model remains unchanged.
As Johnson et al. (2016) show, this model can be used for zero-shot translation, i.e. to translate between language pairs for which it has seen no training examples. For example after training English-Spanish, English-French, French-English, and Spanish-English the model has learned a single encoder that encodes English, Spanish, and French and a decoder for the same three languages. Thus, we can use the same model for French-Spanish, Spanish-French and also English-English translation by adding the respective tokens, e.g. from_en to_en to the source.
Johnson et al. (2016) note that zero-shot translation is generally worse than bridging, an approach that uses the model twice: first, to translate into a pivot language, and then into the target language. However, the performance gap can be closed by running a few training steps for the desired language pair. Thus, we first train on multilingual data, then on a small corpus of monolingual data.
3.4 Answer Selection
With either beam search or sampling we can produce many rewrites of a single question from our reformulation system. We issue each rewrite to the QA environment, yielding a set of (query, rewrite, answer) tuples from which we need to pick the best instance. We train another neural network to pick the best answer from the candidates. While this is a ranking problem, we frame it as binary classification, distinguishing between above and below average performance. In training, we compute the F1 score of the answer for every instance. If the rewrite produces an answer with an F1 score greater than the average score of the other rewrites the instance is assigned a positive label. We ignore questions where all rewrites yield equally good/bad answers.
For the classifier we evaluated FFNNs, LSTMs and CNNs and found that the performance of all systems was comparable. Since the inputs are triples of variable length sequences the latter two allow us to incorporate the tokens directly without the need for feature engineering. We choose a CNN for computational efficiency.
In particular, we use pre-trained embeddings for the tokens of query, rewrite, and answer. For each, we add a 1-D CNN followed by max-pooling. The three resulting vectors are then concatenated and passed through a feed-forward network which produces the binary output.
In our experiments we train the answer selection model separately from the reformulator, however, jointly training both models is a promising line of future work.
We experiment on a new and challenging question answering dataset, SearchQA (Dunn et al., 2017). We show that our environment, BiDAF, already shows good relative performance when run alone and improves over the published baseline. However, low absolute performance indicates that the task is challenging. The trained reformulator improves end-to-end performance using a single rewrite, i.e. without aggregation. Our end-to-end system improves over by BiDAF by 11 F1 points (32% relative).
4.1 Question Answering Data
SearchQA is a recently-released dataset built starting from a set of Jeopardy! clues. Clues are obfuscated queries such as This ‘Father of Our Country’ didn’t really chop down a cherry tree. Each clue is associated with the correct answer, e.g. George Washington, and a list of snippets from Google’s top search results. SearchQA contains over 140k question/answer pairs and 6.9M snippets. We train our model on the pre-defined training split, perform model selection and tuning on the validation split and report results on the validation and test splits. The training, validation and test sets contain 99,820, 13,393 and 27,248 examples, respectively.
4.2 Sequence to Sequence Pre-training
For the pre-training of the reformulator we use the multilingual United Nations Parallel Corpus v1.0 (Ziemski et al., 2016). This dataset contains 11.4M sentences which are fully aligned across six UN languages: Arabic, English, Spanish, French, Russian, and Chinese. From all bilingual pairs we produce a multilingual training corpus of 30 language pairs. This yields 340M training examples which we use to train the zero-shot neural MT system (Johnson et al., 2016). We tokenize our data using 16k sentence pieces.555 https://github.com/google/sentencepiece Following (Britz et al., 2017) we use a bidirectional LSTM as encoder and an 4-layer LSTM with attention (Bahdanau et al., 2016) as decoder. The model converged after training on 400M instances using the Adam optimizer with learning rate 0.001 and batch size 128.
The monolingual model trained as described above has poor quality as a source of systematically-related question reformulations. For example, for the question What month, day and year did Super Bowl 50 take place?, the top rewrite is What month and year goes back to the morning and year?. To improve quality, we resume training on a monolingual dataset which is two orders of magnitude smaller than the U.N. corpus. The monolingual data is extracted from the Paralex database of question paraphrases (Fader et al., 2013).666 http://knowitall.cs.washington.edu/paralex/ Unfortunately, this data contains many noisy pairs. We filter many of these pairs out by keeping only those whose the Jaccard coefficient between the sets of source and target terms is above 0.5. Further, since the number of paraphrases for each question can vary significantly, we keep at most 4 paraphrases for each question. After processing, we are left with about 1.5M pairs out of the original .
The refined model has visibly better quality than the zero-shot one; for the example question above it generates What year did superbowl take place?. We also tried training on the monolingual data alone. The resulting quality was in between the multilingual and refined models, consistent with the findings from Johnson et al. (2016).
4.3 RL Training of the Reformulator
After pre-training the reformulator, we switch the optimizer from Adam to SGD and train for RL steps of batch size with a low learning rate of . We use an entropy regularization weight of . For a stopping criterion, we monitor the reward from the best single rewrite, generated via greedy decoding, on the validation set. In contrast to our initial training which we ran on GPUs, this training phase is dominated by latency of the QA system and we run inference and updates on CPU and the BiDAF environment on GPU.
4.4 Training the Aggregator
For the aggregator we use supervised learning: first, we train the reformulator, then we generate rewrites for each question in the SearchQA training and validation sets. After sending these to the environment we have about 2M (question, rewrite, answer) triples to train the aggregator. We remove queries where all rewrites yield identical rewards, which removes about half of the aggregation training data.
We use pre-trained 100-dimensional embeddings (Pennington et al., 2014) for the tokens. Our CNN-based aggregator encodes the three strings into 100 dimensional vectors using a 1D CNN with kernel width 3 and output dimension 100 over the embedded tokens, followed by max-pooling. The vectors are then concatenated and passed through a feed-forward network which produces the binary output, indicating whether the triple performs below of above average, relative to the other reformulations and respective answers.
This way, we are using the training portion of the SearchQA data thrice, first for the initial training of the BiDAF model, then for the reinforcement-learning based tuning of the reformulator, and finally for training of the aggregator. We found, however, that performance was similar on both train and development indicating that no severe overfitting occurs. We use the test set only for evaluation of the final model.
4.5 Baselines and Benchmarks
As a baseline, we repeat the results reported for a deep learning system developed for SearchQA Dunn et al. (2017). This is a modified pointer network, called Attention Sum Reader.777 Dunn et al. (2017) also provide a simpler baseline that ranks unigrams from the search snippets by their TF-IDF score. This baseline is not comparable to our experiments as it can only return unigram answers.
The BiDAF environment can be used without the reformulator to answer the original question. This corresponds to the raw performance of BiDAF, and is our second baseline. We train BiDAF directly on the SearchQA training data. In the SearchQA task, the answers are augmented with several snippets (50 on average) returned by a Google Search for the question. We join snippets to form the context from which BiDAF selects answer spans. For performance reasons, we limit the context to the top 10 snippets. This corresponds to finding the answer on the first page of Google results. The results are only mildly affected by this limitation, for 10% of the questions there is no answer in this shorter context. These datapoints are all counted as losses. We trained with the Adam optimizer for 4500 steps, using learning rate 0.001, batch size 60.
We present two benchmark performance levels. The first is human performance reported in (Dunn et al., 2017) based on a sample of the test set. The second is ‘Oracle Ranking’, which provides an upper bound on the improvement that can be made by aggregation. For this, we replace the aggregator with an oracle that picks the answer with highest F1 score from the set of those returned for the reformulations.
4.6 AQA Variants
We evaluate several variants of AQA. For each query in the evaluation we generate a list of reformulations , for , from the AQA reformulator trained as described in Section 3. We set in these experiments.
AQA Top Hyp.
First, we omit aggregation, and just select the top hypothesis generated by the sequence model, .
In addition to the most likely answer span, BiDAF also reports a model score, which we use for a heuristic weighted voting scheme to implement a deterministic aggregator. Let be the answer returned by BiDAF for query , with associated score . We pick the answer according to .
AQA Max Conf.
We implement a second heuristic aggregator that selects the answer with the single highest BiDAF score across question reformulations.
Finally, we present the complete system with the learned CNN aggregation model described in Section 2.2.
|Att. Sum Reader||–||24.2||–||22.8|
|AQA Top Hyp.||32.0||38.2||30.6||36.8|
|AQA Max Conf.||35.5||42.0||33.8||40.2|
Table 1 shows the results. We report exact match and F1 metrics, computed on token level between the predicted answer and the gold answer. We present results on the full validation and test sets (referred to as -gram in (Dunn et al., 2017)). This includes questions that have both unigram and longer answers.
SearchQA appears to be harder than other recent QA tasks such as SQuAD (Rajpurkar et al., 2016) and CNN/Daily Mail (Hermann et al., 2015), for both machines and humans. BiDAF’s performance drops by 40 F1 points on SearchQA compared to SQuAD and CNN/Daily Mail. However, BiDAF is still competitive on SeachQA, improving over the baseline Attention Sum Reader network by 13.7 F1 points.
Using the top hypothesis alone already yields an improvement of 2.2 F1 on test. This improvement without aggregation demonstrates that the reformulator is able to produce questions more easily answered by the environment. Heuristic aggregation via both Voting and Max Conf yield a further performance boost. Both heuristics draw upon the intuition that when BiDAF is confident in its answer it is more likely to be correct, and that multiple instances of the same answer provide positive evidence (for Max Conf, the max operation implicitly rewards having an answer scored with respect to multiple questions).
Finally, a trained aggregation function improves performance further, yielding an absolute increase of 11 F1 points (32% relative) over BiDAF with the original questions. In terms of exact match score this more than closes half the gap between BiDAF, a state-of-the-art QA system, and human performance.
5 Related work
Bilingual corpora and machine translation have been used to generate paraphrases by pivoting through a second language(Madnani and Dorr, 2010). Early work extracted word or phrase pairs from a phrase table such that translates into a foreign phrase and translates back to (Bannard and Callison-Burch, 2005). Extensions extract full sentences using a parsing based MT system (Li et al., 2009; Ganitkevitch et al., 2011, 2013). Recent work uses neural translation models and multiple pivots(Mallinson et al., ). In contrast, our approach does not use pivoting and, to our knowledge, is the first direct neural paraphrasing system.
Riezler et al. (2007) propose a phrase-based paraphrasing system for queries which they use to extract synonyms of terms in a given query. In contrast to our approach, the reformulated queries are not used directly, their system uses phrase-tables extracted via pivoting and the use of MT for paraphrasing is mainly introduced to include context, i.e. the other words of the original query, when choosing synonyms.
Reinforcement learning is gaining traction in natural language understanding across many problems. For example, Narasimhan et al. (2015) use RL to learn control policies for multi-user dungeon games where the state of the game is summarized by a textual description, and Li et al. (2017) use RL for dialogue generation.
Policy gradient methods(Williams, 1992; Sutton et al., 1999) have been investigated recently for MT and other sequence-to-sequence problems. They alleviate limitations inherent to the word-level optimization of the cross-entropy loss, allowing sequence-level reward functions, like BLEU, to be used. Sequence level reward functions based on language models and reconstruction errors are used to bootstrap MT with fewer resources (Xia et al., 2016). Bahdanau et al. (2016) extend this line of work using actor-critic training for MT. RL training can also prevent exposure bias; an inconsistency between training and inference time stemming from the fact that the model never sees its own mistakes during training (Ranzato et al., 2015; Shen et al., 2016). We also use policy gradient to optimize our agent, however, we use end-to-end question answering quality as the reward.
Uses of policy gradient for QA include Liang et al. (2017), who train a semantic parser to query a knowledge base, and Seo et al. (2017b) who propose query reduction networks that transform a query to answer questions that involve multi-hop common sense reasoning. The work of Nogueira and Cho (2016) is most related to ours. Their goal is to identify a document containing an answer to a question by following links on a document graph. Evaluating on a set of questions from the game “Jeopardy!”, they learn to walk the Wikipedia graph using an RNN until they reach the predicted article/answer. In a recent follow-up Nogueira and Cho (2017) improve document retrieval with an approach inspired by relevance feedback in combination with RL. They reformulate a query by adding terms from documents retrieved from a search engine for the original query. Our work differs in that we generate full reformulations via sequence-to-sequence modeling rather than adding single terms, and we target question-answering, rather than document retrieval.
Ensemble methods often boost performance of ML systems, and QA models are no exception. This is usually achieved through randomization; predictions from several replicas of the base system, trained independently, are combined to increase robustness. AQA also combines several predictions, but differs conceptually. A classic ensemble fixes the input and perturbs the model parameters. In contrast, AQA perturbs the input, keeping the model fixed. Our approach also resembles active learning (Settles, 2010) because our agent optimizes the input to an environment from which it collects data to receive the most useful responses. AQA departs from the usual regime of active learning in the following ways.
AQA must choose the optimal input in the space of natural language questions, usually active learning algorithms select from real-valued feature vectors. The former, being discrete and highly structured, is challenging to optimize.
AQA optimizes the inputs sent to the environment with respect to expected end-to-end performance. Active learning usually optimizes a pre-specified acquisition criterion 888Bayes optimal active learning also tries to optimize end-to-end performance, however, these techniques are often computationally prohibitive in large-scale settings (Roy and McCallum, 2001)..
AQA seeks the best response per datapoint, i.e. conditioned on a original question. Active learning optimizes inputs to acquire generic useful labels training time. We must ensure that the questions sent to, and answers received from, the environment are useful with respect to the original query.
Finally, Active QA is related to recent research on fact-checking: Wu et al. (2017) propose to perturb database queries in order to estimate the support of quantitative claims. In Active QA questions are perturbed with a similar purpose, although directly at the surface natural language form.
We propose a new framework to improve question answering. We call it active question answering (AQA), as it aims to improve answering by systematically perturbing input questions. We investigated a first system of this kind that has three components: a question reformulator, a black box QA system, and a candidate answer aggregator. The reformulator and aggregator form a trainable agent that seek to elicit the best answers from the QA system. Importantly, the agent may only query the environment with natural language questions. Experimental results prove that the approach is highly effective. We improve a sophisticated Deep QA system by 11% absolute F1, 32% relative F1, on a difficult dataset of long, semantically complex, questions.
6.1 Future Work
A direct extension is to plug in multiple different environments, such as additional documents or a knowledge base. The reformulator can choose which questions to send to which environments and the aggregator must now accumulate evidence of different modalities. This setting handles multi-task scenarios naturally, where the same AQA agent is trained to interact with multiple backends to solve multiple QA tasks.
As a longer-term extension, we will investigate the sequential, iterative aspect of QA, and frame the problem as an end-to-end RL task, thus, closing the loop between the reformulator and the aggregator. Figure 2 depicts the generalized AQA agent-environment framework. The agent (AQA) interacts with the environment (E) in order to answer a question (). The environment includes a question answering system (Q&A), and emits observations and rewards. A state at time is the sequence of observations and actions generated starting from , , where includes the question asked (), the corresponding answer returned by the QA system (), and possibly additional information such as features and auxiliary tasks. The agent includes an action scoring component (U), which decides whether to submit a new question to the environment or return a final answer. Formally, , where is the set of all possible questions, and is the set of all possible answers. The agent relies on a question reformulation system (QR), that provides candidate follow up questions, and on an answer ranking system (AR), which scores the answers contained in . Each answer returned is assigned a reward. The objective is to maximize the expected reward over a set of questions.
With respect to this work there are a few important extensions. The decision making component, including answer scoring and action type prediction, is jointly trained with the reformulator. This component also plays the role of a critic or Q-function. Additionally, the critic can pass internal states to the reformulator, making the reformulations stateful. Finally, this formulation allows multi-step episodes to be performed.
- Bahdanau et al.  D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.
- Bahdanau et al.  D. Bahdanau, P. Brakel, K. Xu, A. Goyal, R. Lowe, J. Pineau, A. Courville, and Y. Bengio. An actor-critic algorithm for sequence prediction. https://arxiv.org/abs/1607.07086, 2016.
- Bannard and Callison-Burch  C. Bannard and C. Callison-Burch. Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, pages 597–604. Association for Computational Linguistics, 2005.
- Britz et al.  D. Britz, A. Goldie, M.-T. Luong, and Q. Le. Massive exploration of neural machine translation architectures. https://arxiv.org/pdf/1703.03906.pdf, 2017.
- Chomsky  N. Chomsky. Aspects of the Theory of Syntax. The MIT Press, 1965.
- Dunn et al.  M. Dunn, L. Sagun, M. Higgins, U. Guney, V. Cirik, and K. Cho. SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. https://arxiv.org/abs/1704.05179, 2017.
- Fader et al.  A. Fader, L. Zettlemoyer, and O. Etzioni. Paraphrase-Driven Learning for Open Question Answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 2013.
- Ganitkevitch et al.  J. Ganitkevitch, C. Callison-Burch, C. Napoles, and B. Van Durme. Learning sentential paraphrases from bilingual parallel corpora for text-to-text generation. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pages 1168–1179. Association for Computational Linguistics, 2011.
- Ganitkevitch et al.  J. Ganitkevitch, B. Van Durme, and C. Callison-Burch. Ppdb: The paraphrase database. In Proceedings of the 2013 Conference of the North American Chapter of the Association for Comutational Linguistics: Human Language Technologies, pages 758–764, 2013.
- Greensmith et al.  E. Greensmith, P. L. Bartlett, and J. Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. Journal of Machine Learning Research, 5(Nov):1471–1530, 2004.
- Hermann et al.  K. M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. Teaching machines to read and comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems, pages 1693–1701, 2015.
- Johnson et al.  M. Johnson, M. Schuster, Q. V. Le, M. Krikun, Y. Wu, Z. Chen, N. Thorat, F. Viégas, M. Wattenberg, G. Corrado, M. Hughes, and J. Dean. Google’s Multilingual Neural Machine Translation System: Enabling Zero-Shot Translation. https://arxiv.org/abs/1611.04558, 2016.
- Li et al.  J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, and D. Jurafsky. Deep reinforcement learning for dialogue generation. https://arxiv.org/pdf/1606.01541.pdf, 2017.
- Li et al.  Z. Li, C. Callison-Burch, C. Dyer, J. Ganitkevitch, S. Khudanpur, L. Schwartz, W. N. Thornton, J. Weese, and O. F. Zaidan. Joshua: An open source toolkit for parsing-based machine translation. In Proceedings of the Fourth Workshop on Statistical Machine Translation, pages 135–139. Association for Computational Linguistics, 2009.
- Liang et al.  C. Liang, J. Berant, Q. Le, K. D. Forbus, and N. Lao. Neural symbolic machines: Learning semantic parsers on freebase with weak supervision. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, 2017.
- Madnani and Dorr  N. Madnani and B. J. Dorr. Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3):341–387, 2010.
-  J. Mallinson, R. Sennrich, and M. Lapata. Paraphrasing revisited with neural machine translation. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics.
- Marchionini  G. Marchionini. Exploratory search: From finding to understanding. Commun. ACM, 49(4):41–46, 2006.
- Narasimhan et al.  K. Narasimhan, T. Kulkarni, and R. Barzilay. Language understanding for text-based games using deep reinforcement learning. https://arxiv.org/abs/1506.08941, 2015.
- Nogueira and Cho  R. Nogueira and K. Cho. End-to-end goal-driven web navigation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.
- Nogueira and Cho  R. Nogueira and K. Cho. Task-oriented query reformulation with reinforcement learning. https://arxiv.org/abs/1704.04572v1, 2017.
- Pennington et al.  J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543, 2014.
- Rajpurkar et al.  P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, 2016.
- Ranzato et al.  M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. https://arxiv.org/abs/1511.06732, 2015.
- Riezler et al.  S. Riezler, A. Vasserman, I. Tsochantaridis, V. Mittal, and Y. Liu. Statistical machine translation for query expansion in answer retrieval. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics (ACL 2007), Prague, Czech Republic, 2007.
- Roy and McCallum  N. Roy and A. McCallum. Toward optimal active learning through monte carlo estimation of error reduction. ICML, Williamstown, pages 441–448, 2001.
- Seo et al. [2017a] M. Seo, A. Kembhavi, A. Farhadi, and H. Hajishirzi. Bidirectional Attention Flow for Machine Comprehension. In Proceedings of ICLR, 2017a.
- Seo et al. [2017b] M. Seo, S. Min, A. Farhadi, and H. Hajishirzi. Query-reduction networks for question answering. In Proceedings of ICLR 2017, ICLR 2017, 2017b.
- Settles  B. Settles. Active learning literature survey. University of Wisconsin, Madison, 52(55-66):11, 2010.
- Shen et al.  S. Shen, Y. Cheng, Z. He, W. He, H. Wu, M. Sun, and Y. Liu. Minimum risk training for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1683–1692, 2016.
- Sutskever et al.  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, pages 3104–3112, 2014.
- Sutton and Barto  R. S. Sutton and A. G. Barto. Introduction to Reinforcement Learning. MIT Press, Cambridge, MA, USA, 1st edition, 1998. ISBN 0262193981.
- Sutton et al.  R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proceedings of the 12th International Conference on Neural Information Processing Systems, NIPS’99, pages 1057–1063, 1999.
- Williams  R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Mach. Learn., 8(3-4):229–256, 1992.
- Williams and Peng  R. J. Williams and J. Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3):241–268, 1991.
- Wu et al.  Y. Wu, P. K. Agarwal, C. Li, J. Yang, and C. Yu. Computational fact checking through query perturbations. ACM Trans. Database Syst., 42(1):4:1–4:41, 2017.
- Xia et al.  Y. Xia, D. He, T. Qin, L. Wang, N. Yu, T.-Y. Liu, and W.-Y. Ma. Dual learning for machine translation. In Proceedings of the 30th International Conference on Neural Information Processing Systems, 2016.
- Ziemski et al.  M. Ziemski, M. Junczys-Dowmunt, and B. Poliquen. The united nations parallel corpus v1.0. In Proceedings of Language Resources and Evaluation (LREC), 2016.