Towards Understanding and Answering Multi-Sentence Recommendation Questions on Tourism

Towards Understanding and Answering Multi-Sentence Recommendation Questions on Tourism

Danish Contractor
IIT Delhi & IBM Research AI
New Delhi, India &Barun Patra
IIT Delhi
New Delhi, India \ANDMausam
This work was carried out as part of PhD research at IIT Delhi. The author is also a regular employee at IBM Research.
   Parag Singla
IIT Delhi
New Delhi, India

We introduce the first system towards the novel task of answering complex multi-sentence recommendation questions in the tourism domain. Our solution uses a pipeline of two modules: question understanding and answering. For question understanding, we define an SQL-like query language that captures the semantic intent of a question; it supports operators like subset, negation, preference and similarity, which are often found in recommendation questions. We train and compare traditional CRFs as well as bidirectional LSTM-based models for converting a question to its semantic representation. We extend these models to a semi-supervised setting with partially labeled sequences gathered through crowdsourcing. We find that our best model performs semi-supervised training of BiDiLSTM+CRF with hand-designed features and CCM(Chang et al., 2007) constraints. Finally, in an end to end QA system, our answering component converts our question representation into queries fired on underlying knowledge sources. Our experiments on two different answer corpora demonstrate that our system can significantly outperform baselines with up to 20 pt higher accuracy and 17 pt higher recall.

Towards Understanding and Answering Multi-Sentence Recommendation Questions on Tourism

Danish Contractorthanks: This work was carried out as part of PhD research at IIT Delhi. The author is also a regular employee at IBM Research. IIT Delhi & IBM Research AI New Delhi, India                        Barun Patra IIT Delhi New Delhi, India

Mausam and Parag Singla IIT Delhi New Delhi, India {mausam,parags}

1 Introduction

We are motivated by the goal of building an information agent for tourists – one that would perform various roles of a travel agent, such as helping decide the city to visit, recommending points of interest, finding travel routes, and even creating optimized itineraries. Our paper develops a key component of such an agent – a QA system for directly answering recommendation questions. As a first step, we focus our paper on questions that are entity-seeking, i.e., expect one or more entities as answer. These include the large fraction of tourist questions that ask for hotels, restaurants, points of interest and other services that would serve a user’s specific needs the best. Figure 1 shows an example of such a question, where the user is interested in finding a hotel that satisfies some constraints and preferences; an answer to this question is thus the name of a hotel (entity).

A preliminary analysis of such questions from popular tourism forums reveals that almost all of them contain multiple sentences – they often elaborate on a user’s specific situation before asking their question. We name these MSRQs – multi-sentence recommendation questions. An answering system needs to retrieve answer entities from background knowledge sources that may have information about each candidate entity. This includes review sites like TripAdvisor,, travel guides such as WikiTravel,111 or online services like Google Places.222

Figure 1: An entity-seeking MSRQ and its corresponding RQL representation.

Understanding MSRQs raises several novel challenges. MSRQs use informal language, express a wide variety of intents and requirements in each question, and express user preferences and constraints in addition to those for the answer. The questions can be unnecessarily belabored requiring the system to reason about what is important and what is not. Moreover, the querying module needs to incorporate the various constructs found in recommendation questions.

1.1 Contributions

We present the first system for the novel task of answering entity-seeking MSRQs from background corpora in the tourism domain. We make three main technical contributions: a query language to represent MSRQs, a question understanding module to parse the question into our language, and answering systems that perform retrieval over knowledge sources to return answer entities.

Query Language: For question understanding, we define a recommendation question language (RQL) that captures the semantic intent of an MSRQ. RQL is an SQL-like language with operators chosen to cover various entity-seeking MSRQs. It expresses both attributes of the answer and the user. We construct a dataset of MSRQs and their RQL representations (9200 annotated tokens), used as training data.

Question Understanding: We take a sequence labeling approach for question understanding. The current state of the art for sequence labeling uses bidirectional LSTMs with a final CRF layer (Huang et al., 2015). However, because our dataset is relatively small, it was not clear a priori whether a purely neural solution for sequence labeling will be competitive with the more traditional feature engineering methods. In response, we perform extensive experiments and combine various approaches to construct our best model. It includes (1) neural features, (2) hand-designed features (3) constraints capturing additional domain knowledge, and (4) semi-supervised learning over crowdsourced data. Our final model runs the Constraint Driven Learning Algorithm (CoDL)(Chang et al., 2007) for semi-supervised learning over a BiDiLSTM+CRF with additional features and constraints modeled using constraint conditional modeling (CCM)(Chang et al., 2007). We evaluate the incremental value of each addition to the model.

End-to-End QA Experiments: After parsing the question into RQL, our corpus-specific query generators convert the parse into queries that can be fired on the knowledge source for generating answers. Since answering MSRQs is a novel task, there are no direct baselines. We therefore, compare our system against a QA system - WebQA (Vtyurina and Clarke, 2016) which was designed to handle multi-sentence questions; the original WebQA system returned passages as answers, but we adapt it to return entity answers from our knowledge sources for fair comparison. We use two knowledge sources: (1) an offline corpus of reviews for about half a million entities crawled from Google Places, TripAdvisor, WikiTravel and; (2) full online Google Places API. We find that our RQL-based QA dramatically outperforms baselines obtaining 20 pt accuracy, and 17 pt recall improvements.

2 Related Work

To the best of our knowledge, we are the first to explicitly understand and directly answer multi-sentence recommendation by returning entities using a background corpus. Our work is related to the research in question understanding and other forms of QA.

Question Answering Systems: There are two common approaches for QA systems – joint and pipelined, both with different advantages. The joint systems usually train an end-to-end neural architecture, with a softmax over candidate answers (or spans over a given passage) as the final layer (Iyyer et al., 2014; Rajpurkar et al., 2016). Such systems can be rapidly retrained for different domains, as they use minimal hand-constructed or domain-specific features. But, they require huge amounts of labeled QA pairs for training.

In contrast, a pipelined approach (Fader et al., 2014; Berant and Liang, 2014; Fader et al., 2013; Kwiatkowski et al., 2013; Vtyurina and Clarke, 2016; Wang and Nyberg, 2016) divides the task into two components – question processing (understanding) and querying the knowledge source. Since each of these are simpler sub-problems, such methods can be built with relatively less training data, but require more annotation efforts per domain.

It is important to note that for answering an MSRQ, the answer space can include thousands of candidate entities per question, with large unstructured review documents about each entity that help determine the best answer entity. We briefly summarize popular approaches in QA systems for easy comparison of our work with existing literature in Table 1: QA systems and can be broadly classified based on (a) type of questions they answer (b) nature of KB/Corpus used for answering (c) nature of answers returned by the answering system

Question Type Knowledge Type Answer Type Related Work
Structured (eg. DBPedia, Freebase) Entity (Lukovnikov et al., 2017; Bordes et al., 2014b, 2015)
Single Sentence Structured (Open IE style KBs) Entity (Fader et al., 2014; Berant and Liang, 2014)
Structured + Unstructured (Open IE style KBs with supporting text passages on entities ) Entity (Das et al., 2017)
Structured (Databases) Tables/ Table rows (Saha et al., 2016; Pazos R. et al., 2013)
Unstructured Text Spans (Rajpurkar et al., 2016; Trischler et al., 2016; Trivedi et al., 2017; Chen et al., 2017; Joshi et al., 2017)
Unstructured Text Passages (Vtyurina and Clarke, 2016; Wang and Nyberg, 2016, 2015)
Multiple choice answers Answers from specified choices (Guo et al., 2017; Khot et al., 2017)
Multi-sentence Unstructured Text (Answer) passages (Singh and Simperl, 2016; Romeo et al., 2016; Srba and Bielikova, 2016; Bogdanova and Foster, 2016)
Unstructured (QA pairs) Entity (Iyyer et al., 2014)
Semi-structured meta-data + Unstructured (Entity Reviews) Entity Our work
Table 1: Related work: QA

The problem of directly returning answers to questions from background knowledge sources has been studied, but primarily for single sentence factoid-like questions (Fader et al., 2014; Berant and Liang, 2014; Yin et al., 2015; Sun et al., 2015; Saha et al., 2016; Khot et al., 2017; Lukovnikov et al., 2017). Reading comprehension tasks (Rajpurkar et al., 2016; Trischler et al., 2016; Joshi et al., 2017; Trivedi et al., 2017) require answers to be generated from unstructured text also only return answers for simple single-sentence questions. Other works have considered multi-sentence questions, but in different settings, such as the specialized setting of answering multiple-choice SAT and science questions (Seo et al., 2015; Clark et al., 2016; Khot et al., 2017; Guo et al., 2017), mathematical word problems (Liang et al., 2016), and textbook questions (Sachan et al., 2016). Community QA systems (Bogdanova and Foster, 2016; Shen et al., 2015; Qiu and Huang, 2015; Tan et al., 2015) match questions with user-provided answers, instead of entities from background knowledge-source. IR-based systems (Wang and Nyberg, 2016) query the Web for open-domain questions, but return long (1000 character) passages as answers; they haven’t been tested on recommendation questions. The techniques that can handle MSRQs (Vtyurina and Clarke, 2016; Wang and Nyberg, 2016) typically perform retrieval using keywords extracted from questions; these do not understand the questions well and can’t answer many tourism questions, as our experiments show. The more traditional solutions (e.g., semantic parsing) that parse the questions deeply can process only single-sentence questions (Fader et al., 2014; Berant and Liang, 2014; Fader et al., 2013; Kwiatkowski et al., 2013).

Finally, systems such as QANTA (Iyyer et al., 2014) also answer complex multi-sentence questions but their methods can only select answers from a small list of entities and also require large amounts of training data with redundancy of QA pairs. In contrast, the subset of Google Places we experiment with has close to half a million entities. Further, in our task, the reviews about each entity are significantly longer333Reviews for each entity are contactenated to serve as background information about that entity, resulting in documents ranging in length from a few hundred to a few thousand sentences. than passages (or similar length articles) that have traditionally been used in QA tasks and it is only recently that the task of QA via neural machine comprehension of long documents has been proposed (Trivedi et al., 2017).

Semantic Representation of Questions: QA systems use a variety of different intermediate semantic representations. Most of them, including the rich body of work in NLIDB and semantic parsing, parse single sentence questions into a query based on the underlying ontology or DB schema (Pazos R. et al., 2013; Saha et al., 2016; Zettlemoyer, 2009; Liang, 2011; Trivedi et al., 2017). Open QA (Fader et al., 2014) uses an open-domain representation for factoid single-sentence QA.

Recent works build neural models that represent a question as a continuous-valued vector (Bordes et al., 2014a, b; Xu et al., 2016; Chen et al., 2016; Zhang et al., 2016). Some systems rely on IR and do not construct explicit semantic representations at all (Sun et al., 2015; Vtyurina and Clarke, 2016); instead, they rely on selecting keywords from the question for querying. They can handle multi-sentence questions, but do not understand questions deeply. To the best of our knowledge no question parser has been developed for MSRQs.

Figure 2: Schematic Representation of the system

3 System Architecture

Our QA system broadly consists of two modules (see Figure 2): question understanding, and answer generation. As motivated earlier, the modularized two-step architecture allows us to tackle different aspects of the problem independently. The semantic representation generated by the question understanding module is generic and not tied to a specific corpora or ontology. This allows the answering module to be optimized efficiently for each knowledge source, as well as allows the integration of multiple data sources, each with their own schema and strengths for answering. We first describe the details of our question representation (Section 4). Further sections describe the question understanding (Section 5) and answer generation (Section 6) modules in detail. Examples of some user questions and the answer entities returned by our system are shown in Table 8.

Q-> select x where C
C-> C | (C and C) | (C or C)
C-> L R T | L near R T
R-> = | P = | in {T} | in [T]
T-> T,T | T
T’-> ‘<phrase>’
P-> pref | similar | in | not |
P-> P P
L-> x.attr | user.attr | x.type |
    x | x.location | user.location
L’-> x.location
Figure 3: CFG rules for our query representation

4 RQL Representation

Since we build the first system of its kind, we need to balance representation expressiveness and its answerability. For our first version, we make the assumption that the MSRQ is asking only one final question, and that the expected answer is one or more entities. This precludes Boolean, comparison, ‘why’/‘how’, and multiple part questions. We now describe RQL, our language for representing such MSRQs.

We choose a relatively open question representation for RQL. It makes minimal assumptions about the answering knowledge sources and therefore, minimizes schema or ontology specific semantic vocabulary. Another advantange is that RQL with small changes could be rapidly adapted to a non-tourism domain (see Table 2 for examples of RQL queries for automobiles and electronics domains). At the same time, we note that RQL can easily be extended to include more schema-specific semantic labels, if required.

Domain Question RQL Representation
Automobiles I want to buy a sedan in diesel version and budget is USD 30,000-40,000. looking for one with basic luxury, nothing too fancy. Which one is best:? select x where x.type=”sedan” & x.attribute=”diesel version” & x.attribute=”USD 30000-40000” & x.attribute=”basic luxury” & x.attribute=”nothing too fancy
Automobiles Can anyone suggest to me a reliable brand of a tyre pressure guage and pump? An estimate of their approx costs and place of availability in Delhi would be preferrable. select x where x.type=”place of availability” & x.attribute=”tyre pressure guage and pump” & x.attribute=”reliable brand” & x.location PREF =”Delhi
Electronics My 15 year old Broksonic TV is dying, so I am needing to buy a replacement. Want an LCD TV about the same size (20”), with good picture clarity and sound quality. Must have composite (RCA) or component connectors to fit my DVD recorder, and VCR. Looking to buy from Amazon, TigerDirect, etc. Recommendations? select x where x.type=”LCD TV” & x.attribute=”20,̈ with good picture quality” & x.attribute=”composite (RCA) or component connectors” & x.attribute=”from Amazon, TigerDirect”
Electronics I’ve done a search and concluded that I can’t afford the best washing machine (Miele etc) so how about some recommendations for a good quality front loader,7.5 kg and up to .Thanks. select x where x.type=”washing machine” & x.attribute=”good quality front loader” & x.attribute=”7.5 Kg” & x.attribute=”upto $1000”,x NOT SIMILAR ”Miele”
Table 2: RQL representations in different domains

We illustrate RQL’s representation choices by means of an example (Figure 1). Here, the user is interested in finding a hotel that satisfies some constraints and preferences. The question includes some information about the user herself, which may or may not be relevant for answering.

RQL resembles an SQL-like language. Since each question has an entity (or more) as an answer, it denotes the desired answer by . Each answer will have a type (referred to by the semantic label ), e.g., ‘place to stay’. Many tourism questions may be about facilities, which may have an . To accommodate other characteristics of the answer entity, RQL defines – any phrase describing the answer that is not type or location is marked as an attribute. Users often expect personalized answers and explain their individual situation in their question. To accommodate aspects of a user that may be important while answering, RQL defines a special entity called . It maintains and for user’s features; ‘three daughters in their young 20s’ will be marked in our example.

An analysis of tourism forum questions reveals that RQL can adequately represent almost all tourism questions that satisfy our assumptions. Notice the limited semantic vocabulary for candidate answers (type, location, attribute) – this aligns with our goal of making minimal assumptions about the knowledge sources.

Operators: Another key feature of RQL is that it maximizes the coverage of common operators found in a recommendation question, so that a robust down-stream QA or IR system can meaningfully answer it. In addition to standard logical connectives like AND, OR and NOT (for example, the phrase “not very spicy” may be represented as NOT = ‘very spicy’), RQL also defines four more operators (PREF, NEAR, IN, and SIMILAR) to represent common constructs in MSRQs.

PREF expresses a preference that is not a constraint, e.g., “I would prefer to eat sushi” ( PREF = ‘sushi’). The NEAR operator is used when a user requires recommendations that are geographically close to a location specified in the question (e.g “near Salzburg” will be annotated as NEAR ‘Salzburg’).

SIMILAR is used when a user mentions similar entities. For instance, “I have been to Red Hoods and wanna visit a similar place” will be annotated as SIMILAR = ‘Red Hoods’. We note that mention of entities such as ‘Red Hoods’ can be very informative, since these typically represent siblings of the answer – instances of the type of the desired answer. We name these as sibling entities.

Finally, IN is used when a user explicitly provides a list of sibling entities among which she wants an answer. An example is “I have shortlisted Red Hoods, Cafe China and Royals. Help me”; RQL construct will be IN {‘Red Hoods’, ‘Cafe China’, ‘Royals’}. Curly brackets denote an enumerated set. RQL also uses a special square bracket symbol to denote a range. For example, “location between New York and New Jersey” will translate to IN [‘New York’, ‘New Jersey’]. These operators may also be nested, for example, “preferably not Red Hoods or Royals” will be PREF NOT IN {‘Red Hoods’, ‘Royals’}.

Formally, RQL language is defined in Figure 3. A query is generated by a set of clauses . Each clause can generate a label with operator and a terminal ‘phrase’ generated by and . A separate rule with is written for NEAR operators since they only support .

Two expert annotators with background in NLP annotate 150 user questions (9200 annotated tokens) with their RQL representations. These questions are chosen randomly from a popular tourism forum site. The annotators resolve their differences in person to produce a combined labeled set, which serves as training data for question understanding in the rest of the paper.

5 Question Understanding

Before describing implementation details of our question understanding component, we present some background on Constrained Conditional Models (CCMs)(Chang et al., 2007) and BiDiLSTM+CRF(Huang et al., 2015) as these are at the core of our question understanding component.

5.1 Background on Sequence Labeling

Constraint Conditional Models (CCMs) extend CRFs (Lafferty et al., 2001) by allowing an expert to express domain knowledge through hard or soft constraints. CCMs use an alternate learning objective expressed as the difference between the original CRF log-likelihood and a constraint violation penalty (Chang et al., 2007):

Here, is the sequence and its labeling. and are feature and weight vectors respectively. and denote the violation score and weight associated with constraint. The parameters are learned analogous to a vanilla CRF and computing parameters resorts to counting. Hard constraints have an infinite weight. Inference in CCMs is formulated as an Integer Linear Program (ILP); see Chang et al.Chang et al. (2007) for details.

Constraints driven learning (CODL) is a semi-supervised iterative weight update algorithm, where the weights at each step are computed using a combination of the models learned on the labeled and the unlabeled set (Chang et al., 2007). CoDL’s weight update equation is:

Here, denotes the iteration number. function learns the parameters of the model on examples supplied as its argument. The parameters at iteration are learned only using the labeled data. denotes the unlabeled set whose values have been filled in using parameters at iteration . controls the relative importance of the labeled and unlabeled examples.

Bidirectional LSTM based CRF is a state of the art neural approach for sequence labeling (Huang et al., 2015). The output of BiDi LSTM network at each time step feeds into a CRF layer. The consecutive outputs of the LSTM states are connected to each other in the CRF layer and consist of a state transition matrix that contain the probabilities of transitions between output labels.

It can be seen as a combination of neural feature engineering and CRF’s joint inference.

5.2 Semantic Labeling of Questions

We now discuss our approach to parse an MSRQ into its RQL representation. We could use a full-blown CFG parsing approach. But given that the context-free component of our grammar is limited (most clauses are conjunctive), we, instead, use a combination of sequence labeling and operator post-processing. Our token-level label set directly corresponds to semantic labels in RQL: {x.type, x.attribute, x.location, x.sibling, user.attribute, user.location, other}. Here, refers to sibling entities and label captures all tokens not assigned any of the semantic labels. For now, we mark operator words as and handle them separately as a post-processing step using a set of lexical rules.

Our sequence labeling task dataset is relatively small. Because neural approaches are often more effective in large data settings, we experimented with both solutions – traditional CRFs and BiDiLSTM CRF. We further improve performance using CCM constraints and crowdsourcing more data.

5.2.1 Supervised Labeling

Conditional Random Field: We first pose the sequence labeling task as a single linear chain CRF (Lafferty et al., 2001) over the MSRQ. We implement a number of features as follows. (a) Lexical features for capitalization, indicating numerals etc., token-level features based on POS and NER (b) hand-designed and specific features. These include indicators for guessing potential types, based on targets of WH (what, where, which) words and certain verb classes; dependency parse features that aid in attribute detection, e.g., for every noun and adjective, an attribute indicator feature is on if any of its ancestors is a potential as indicated by type feature; indicator features for descriptive phrases (Contractor et al., 2016), such as adjective-noun pairs. (c) For each token, we include cluster ids generated from a clustering of word2vec vectors (Mikolov et al., 2013) run over a large tourism corpus. (d) We also use the counts of a token in the entire post, as a feature for that token (Vtyurina and Clarke, 2016).

Constrained Conditional Model: Since we label multi-sentence questions, we need to capture patterns spanning across sentences. One method of doing so would be to model these patterns as features defined over non-adjacent tokens (labels). But this can make the modeling quite complex. Instead, we model them as global constraints over the set of possible labels using CCMs. We design the following constraints: (i) type constraint (hard): every question must have at least one token, and (ii) attribute constraint (soft), which penalizes absence of an label in the sequence. (iii) a soft constraint that prefers all tokens occur in the same sentence. The last constraint helps reduce erroneous labels but allows the labeler to choose -labeled tokens from multiple sentences only if it is very confident. Thus, while the first two constraints are directed towards improving recall, the last constraint helps improve precision of labels.

Figure 4: BiDi LSTM CCM for labeling question tokens.

BiDi LSTM based sequence modeling: We also experiment with neural approaches by modeling each question using a bi-directional LSTM CRF. The input states in the LSTM are modeled using a dimension word vector representation of the token. These word vector representations were pre-trained using the word2vec model(Mikolov et al., 2013) on a large collection of tourism questions.

We extend the basic BiDiLST CRF model in two ways to improve performance in our low-data setting (see Figure 4). First, we allow use of hand-crafted features by representing each unique feature set as a one-hot vector and concatenating this feature vector with the word-vector representation of each token444We also experimented with using a feature vector in the CRF layer instead of the LSTM input layer but it gave poorer results. Second, we enforce the CCM constraints during inference and the model is trained end-to-end using back-propagation. To the best of our knowledge, this combined model of BiDiLSTM CRF with CCM constraints is a novel contribution of our work.

5.2.2 Crowd-sourced Data Collection

In order to obtain a larger amount of labeled data for our task, we decided to make use of crowd-sourcing (Amazon Mechanical Turk). Since our labeling task can be fairly complex, we divide our crowd task into multiple steps. We first ask crowd to (i) filter out forum questions that are not recommendation questions. For the questions that remain, the crowd provides (ii) labels, and (iii) labels. For each step, instead of directly asking for token labels, we ask a series of indirect questions that lead us to the desired labels. For example, for , we ask workers to highlight the text that describes what the user is asking for as shown below:

  • “Which sequence of words in the question tells you what the user is asking for? Label only one sequence from a single sentence; we prefer a continuous sequence.”

  • “What is the shortest sequence of words in “A1 (answer to the question above)” that describes a category (e.g. place to stay, restaurant, show, place to eat, spot, hotel etc.)?”

Token level disagreement (%) 52.02 62.22 31.44
Table 3: Inter-worker disagreement on AMT

This alternate way of labeling by asking a series of questions is inspired by (He et al., 2015). We obtain two sets of labels (different workers) on each question. Unfortunately, despite breaking the task into simpler steps, we see disagreement among the workers on some labels (see Table 3 for token-level disagreement statistics). Some of the disagreement results from labeling errors due to complex nature of the task. In other cases, the disagreement results from their choosing one of the several possible correct answers. E.g., in the phrase “good restaurant for dinner” one worker labels ‘restaurant’, ‘good’ and ‘dinner’, while another worker simply chooses the entire phrase as . We find it difficult to relate expert-guidelines to the crowd in a succinct and clear manner.

Since most of our posts have some disagreement among the workers, how do we incorporate this data into our supervised learning setting? We devise the novel idea of only keeping the labels where the two annotators agree and disregarding the other labels, resulting in partially labeled sequences. . This is different from disregarding the entire example where there is any disagreement, which would result in loss of significant (correctly labeled) data. We use this partial supervision, along with our original training set, in a semi-supervised learning setting, to learn the parameters of our CCM model. To the best our knowledge, we are the first to exploit partial supervision from a crowd-sourcing platform in this manner.

5.2.3 Semi-Supervised Labeling

In order to use these partially labeled sequences, we adapt the original CoDL algorithm to work with partial labels. We replace the unlabeled set by the partially labeled set; inference over the set involves predicting only the missing labels. This can be still be done using an ILP based formulation. Rest of the update equation remains the same. In the case of the BiDiLSTM CRF based formulation, the modeling remains the same as described, except the function now corresponds to training the neural network via back-propagation.

Model F1 (x.type) F1 (x.attr) F1 (x.loc) F1 (aggr)
CRF (all features) 51.4 45.3 55.7 50.8
CCM 59.6 50.0 56.1 55.2
CCM (with all crowd data) 55.1 42.2 46.7 48.0
Semi-supervised CCM 58.5 50.6 60.3 56.5
BiDi LSTM CRF 53.3 47.6 52.1 51.0
BiDi LSTM CRF with all features 58.4 48.1 62.0 56.2
BiDi LSTM CCM with Features 59.4 49.8 62.3 57.2
Semi-Supervised BiDi LSTM CCM with features 62.9 50.4 61.5 58.3
Table 4: Sequence tagger F1 using CRF with all features, CCM with all features & constraints, and semi-supervised CCM over partially labeled crowd data. The second set of results mirror these numbers using a Bi-directional LSTM CRF. Results are statistically significant (paired t-test,p value0.000124).

5.3 Operator Post-processing & RQL Generation

To construct the final RQL query, we need to identify the appropriate operator for each labeled phrase. For each operator, we start with a manually curated set of seed words, and expand it using synonym and antonym counter fitted word vectors (Mrksic et al., 2016). The resulting set of trigger words flag the presence of an operator in a sentence. A set of deterministic rules estimate the scope of each operator. Rules use semantic labels from sequence labeling, along with some token-level features such as part of speech tags to identify scope. For instance, a token (or a set of continuous tokens with the same label) labeled by our sequence tagger that occur within a specified window of a trigger word for “negation”, are in its scope. A secondary set of rules are used to compose operators with each other. For instance, if a phrase that is in scope for “negation” is also in scope for a “disjunction”, then the “negation” applies to all tokens in scope of the disjunction. Further, in this case the disjunction converts to a conjunction by laws of Boolean algebra.

5.4 Evaluation

This evaluation answers the questions: (1) which model obtains the best performance for our semantic tagger, and (2) what is the incremental contribution of features, CCM constraints and crowdsourced annotation for our task. We additionally learn about effectiveness of neural models for sequence labeling in low-data setting.

Evaluation Details: We use the 150 expert-annotated tourism forum questions as our dataset and perform leave-one out cross-validation. Our current implementation trains only a subset of the labels {, , , } – these are most important for downstream QA.

We use the Mallet toolkit555 for CRF implementation and the GLPK ILP-based solver666 for CCM inference.

For expts with semi-supervised learning, we add 400 partially-annotated questions from crowdsourced workers to our training set. We retain token labels marked the same by two workers for every question, and treat the other labels as unknown. We pay a total of $1.02 per question for this annotation. We still compute a leave one out cross-validation on our original 150 expert-annotated questions (complete crowd data is included in each training fold). For CoDL learning we set to 0.9 as per original authors’ recommendations.

Sequence-tagged tokens identify phrases for each semantic label, which become units for constructing the final RQL query. So, instead of reporting metrics at the token level, we compute a more meaningful joint metric over tagged phrases. We define a matching-based metric that first matches each extracted segment with the closest one in the gold set, and then computes segment level precision using constituent tokens. Analogously, recall is computed by matching each segment in gold set with the best one in extracted set. As an example, for Figure 1, if the system extracts “convenient to the majority” and “local budget” for then our matching-metric will compute precision as 0.75 (1.0 for “convenient to the majority” and 0.5 for “local budget)” and recall as 0.45 (1.0 for “budget”, 0.0 for “best” and 0.364 for “convenient to the majority … like to see”).

Results: Table 4 reports the performance of our semantic labeler under different configurations. We find that using a CRF based system using all features gives us an aggregate F1 of . The use of our CCM constraints have a significant impact on overall performance, raising aggregate F-score by over 4 points. Table 4 shows that the CoDL training procedure for semi-supervised CCM boost the aggregate F-scores by a further 1.5 points. In order to see how useful our semi-supervised approach is, we also train a CCM based model that makes use of the complete crowdsourced dataset for training, by adding conflicting labels for a question as two independent training data points. As can be seen, without a semi-supervised approach, the noisy crowd-labeled sequences hurts the performance significantly.

We also repeated the experiments using a BiDi LSTM+CRF using only the pre-trained word-embeddings as input features. We were surprised to find that it performs at par with the vanilla CRF model that uses all our engineered features. This speaks to the effectiveness of neural models even in low data scenarios. Our further extension on adding our features to the neural model yield a 5 pt performance boost. Both CCM constraints and semi-supervision improve F-scores by one point each. Overall, our best model combines neural features, hand-designed features, CCM constraints with semi-supervised learning over partially annotated crowd data.

Effect of features: We perform an ablation study over our feature sets to learn about incremental importance of each feature (detailed results omitted due to lack of space). We find that descriptive phrases, and hand-constructed type and attribute indicators improve the performance of each label by 2-3 points. Word2vec features help type detection because labels often occur in similar contexts, leading to informative vectors for typical type words. Frequency of non stopword words in the post are an indicator of the word’s relative importance, and the feature also helps improves overall performance.

Algorithm Prec Recall F1
CRF (all features) 66.9 41.7 51.4
CCM (all features) 62.1 57.2 59.6
BiDI LSTM CRF with Features 54.1 63.6 58.4
BiDi LSTM CCM with Features 55.1 64.5 59.4
Table 5: (i) Precision and Recall of x.type with and without CCM inference.

Effect of constraints: A closer inspection of Table 4 reveals that the vanilla CRF configuration sees more benefit in using our CCM constraints as compared to the BiDiLSTM+CRF based setup (4pt vs 1pt). To understand why, we study the detailed precision-recall characteristics of individual labels; the results for are reported in Table 5. We find that the BiDiLSTM+CRF based setup has significantly higher recall than its equivalent vanilla CRF counter-part while the opposite trend is observed for precision. As a result, since two of the three constraints employed by us in CCM are oriented towards improving recall777Recall that we require at least one x.type (hard constraint) and prefer at least one x.attr (soft constraint), we find that they improve overall F1 more by finding tags that were otherwise of lower probability (i.e. improving recall). However, note that the semi-supervised CCMs have similar performance benefits in both configurations.

Error Analysis: Identifying attributes is the toughest task among all semantic labels. Attributes may be associated with entities irrelevant in answering a query. E.g., in ”Staying in a fancy house, looking for a clean beach. Ideas??”, ‘fancy’ is associated with the ‘house’, and is irrelevant to the beaches being queried for. Moreover, depending on the phrasing of a question, there is a confusion between attribute and type entities: ‘Moroccan Food’ in ”Looking for Moroccan Food” is a type; while in ”Looking for a restaurant with good Moroccan Food”, we mark it is an attribute. This confusion is also reflected in Table 3, where workers agree the least on attributes.

Operator Labeling: We also evaluate the operator labeling component from Section 5.3. We create rules for commonly occurring operators – disjunctions, negations, and preferences.888Our current system ignores IN and SIMILAR operators The default operator between semantic labels is a conjunction. Table 6 reports the accuracy of our rules as evaluated by an author. The ‘Gold’ columns denote the performance when using gold semantic label mentions. The ’System’ columns are the performance when using labels generated by our sequence tagger. We find a significant drop in detection of disjunctions and preferences. A detailed study reveals that nearly 70% of all disjunction clauses occur in the context of attribute labels and almost all preferences are expressed for attributes. Since system’s recall for is low, the pipeline under-performance is not entirely surprising.999There is a more severe drop in disjunctions because for a valid disjunction match we need at least two sets of phrases correctly identified in the same query. However, as the next section shows, the low recall for these tags is not very problematic for the end QA task, as they are relatively infrequent. Moreover, even if we miss some attributes, the other attributes can often lead us to correct answers.

Gold System
Negations 86 66 85 62
Preference 75 60 33 15
Disjunctions 86 59 50 08
Table 6: Performance of operator detection using gold sequence labels, and system generated labels

6 Answer Generation

The answer generation component serves as a translation layer from an RQL parse to the schema and querying mechanism of the knowledge source for retrieving entity answers. To demonstrate the ease of integrating different types of knowledge sources, we experiment with two different knowledge sources: (i) an Apache Lucene data store containing data from Google Places, reviews from Trip Advisor, and as well as articles from WikiTravel, resulting in data for 480,000 unique entities, (ii) the full Google Places collection queried using its Web API.

We randomly select 150 new unseen questions (different from the questions used in the previous section), from a tourism forum website and manually remove 45 of those that were not entity-seeking. For the remaining 105 questions our annotators manually look up the actual entity-answers for those questions and this forms our test set for the answering task. Baselines: As a baseline, we adapt and reimplement a recent tf-idf scheme (called WebQA) originally meant for finding appropriate Google results to answer questions posed in user forums (Vtyurina and Clarke, 2016). WebQA first shortlists a set of top 10 words in the question using tfidf computed over the set of all questions; it forms a query by selecting 3 words based on supervised training data. Since we don’t have any supervised data, we select the top 3 words in the shortlist as the query (space separated). Further, we improve this to a WebQA-Manual system in which an expert chooses 3-4 best words manually.

Lucene-based QA: We construct the background knowledge source by indexing data from four sources. Using a list of 3500 cities that have a population over 100,000: (i) we query Google Places API101010 and obtain records on 405,000 entities (hotels, attractions, restaurants, and other service providers); each record contains the Google Places type,111111 address, overall rating, map coordinates but only five user reviews. (ii) we collect hotel data from and index upto 75 reviews per hotel, and (iii) we index reviews of attractions and places to visit for each city using This results in a total of 454,000 distinct entities. (iv) To answer questions where users are asking for a city or places around a city, we add the full collection of WikiTravel121212 containing 26,000 cities along with its text. For each entity’s textual data (reviews, WikiTravel articles etc), we remove stop words, lemmatize it, and index as a record into Apache Lucene.

Data collected from and does not contain a default “type” field. We index such records by marking their Lucene types as “lodging” for hotels and “point_of_interest” for attractions, since these values are used by other records from Google Places to refer to such entities. All other entities from Google Places retain their Google place type as the Lucene type. We also associate each city entity with a special Lucene type called “city”. We also store its geographical coordinates using the Google Maps API.

Our answer generator needs to translate a question’s RQL representation into a Lucene Query. It first selects the most appropriate Lucene type for the answer by calculating the cosine similarity between RQL phrase and all Lucene types in counter-fitted word vector embedding space (Mrksic et al., 2016). It uses phrases from to query the text fields, and applies filters on location using and question meta-data. We also use the phrases to query the text fields, as sometimes descriptive phrases get included as part of the (e.g, “budget hotel”). Negations, disjunctions and conjunctions are enforced appropriately using Lucene BooleanQuery131313 For now, we ignore PREF and other non-standard operators when forming the Lucene query. For fairness, the WebQA baselines also enforce location filters using question metadata and use in-built Lucene query parsers to generate queries.

As a back-off strategy, in case the system returns no results (sometimes because of a very strict query), we relax the query by replacing conjuctions with disjunctions between tags. For the baseline systems the back-off strategy drops least important (low tf-idf weighted) terms from the query.

Data System Acc@3 MRR Recall
Lucene Default Lucene Query parser 23.7 0.16 20.9
WebQA 25 0.17 21.9
WebQA (manual) 27.7 0.23 23.8
RQL-QA 34.6 0.28 24.8
WebQA 50 0.47 4.7
Google WebQA (manual) 42.6 0.39 30
Places API RQL-QA 62.5 0.52 47.6
Table 7: QA results on two knowledge sources
No. Question Entity Type System Answer
1 My family and my brother’s family will be in Salzburg over Christmas 2015. We have arranged to do the Sleigh Ride on Christmas day but are keen to do a local style Christmas Day dinner somewhere. Any suggestions? Special Dinner place St. Peter Stiftskulinarium, Sankt-Peter-Bezirk 14, 5020 Salzburg
2 Heading to Salzburg by car on Friday September 18th with my wife and her parents (70’s) and trying to make the most of the one day. Thinking about a SOM tour, but not sure what the best tour is, not a big fan of huge groups or buses, but would sacrifice for my Mother in Law (LOL). Also thinking about Old Town or the Salzburg Fortress. Any suggestions? Where to park to have easy access as well as a great place for dinner.Thanks so much! Tour Bob’s Special Tours, Rudolfskai 38, 5020 Salzburg, Austria
3 What can you do in Helsinki on a Sunday morning? What would you recommend a tourist to do or see on a Sunday morning? I’ll be ariving at 7 in the morning, and it seems like everything’s closed on a Sunday morning- either its not open on Sundays or else it’ll open but later on in the day. Things to do / see Senate Square, 00170 Helsinki, Finland Ateneum,Kaivokatu 2, 00100 Helsinki, Finland ..
4 I am planning to visit Agra for 2 days in mid Dec with my friends.My plan is to try some street food and do some local shopping on day 1 and thus wish to stay in a good budget 3 star hotel (as I wont be spending much time in the hotel) at walking distance from such street foodlocal shopping market.Then on the 2nd day, I want to just relax and enjoy the hotel.(I have booked a premium category hotel, Radisson Blu for this day hoping for a relaxed stay)Please suggest some good hotel or market around which I should book an hotel for my first day. Hotel with location constraints Hotel Taj Plaza, Agra, Taj Mahal East Gate, Near Hotel Oberoi Amar Vilas, VIP Road, Shilpgram, Agra, Uttar Pradesh 282001, India
5 Hi there. I am going to Tallinn in a month from just one night on a Saturday. I am 28 and am going with 5 of my friends. Were should we stay so we are near the best clubs in the city? Any recomendations are appreciated!!! Thanks. Place to stay close to clubs Club Prive, Tallinn, Estonia
6 A few friends and I are coming up to Newport for a couple of nights and are looking for restaurant suggestions. We are thinking something casual for the first night. Is Flo’s any good? And then something nicer on Saturday night….preferably a restaurant with good seafood. Also, any suggestions for good breakfast? Restaurant based on cuisine The Red Parrot Restaurant, 48 Thames St, Newport, RI 02840, United States
7 Dear All forum members, I am Yash Khatri from Delhi.I am travelling to Srinagar on 13th July,2016 to 17th July,2016.I am going there for a show, and I’ll be free on 15th and 16th July, 2016. I was thinking to hire a bike at Srinagar and travel toGulmargPahalgam.Queries :1) Where can I rent a bike at Srinagar and how much will it cost me?2) What is better for a quick visit; Gulmarg or Pahalgam?Please help!Thanks Motorcycle rental Kashmir Bikers - Bike Rentals, Sheikh complex , shiraz chowk ,khanyar, Near j&k bank khanyar, Srinagar, Jammu and Kashmir 190003
8. In a couple of weeks, we will have an almost 2 hr layover in Zagreb before flying on to Dubrovnik. Any recommendations for lunch ? A location for lunch that can be visited in a 2 hour layover Hotel Dubrovnik,Gajeva ul. 1, 10000, Zagreb, Croatia
9. Hi,I am looking for a good hotel in Shillong (preferably near Police bazar) which would offer free wifi, spa and are okay with unmarried couples. My budget is 3k maximum. please suggest the best place to stay. Hotel with location and budget constraints Hotel Pegasus Crown, Ward’s Lake Road, Police Bazar, Shillong, Meghalaya 793001, India ;
10. Coming to Gent soon and we will take the trainbus from Charleroi but ideally would like a taxi back from Gent to Charleroi. Can anyone recommend a good taxi firm please? Taxi Service Taxi Didier Ghent Taxi Service, Salviastraat 17, 9040 Sint-Amandsberg Gent, Belgium
Table 8: Some sample questions from our test set and the answers returned by our system. Answers in green are identified as correct while those in red are incorrect.

In order to answer queries that have operators, a Lucene query is first generated to identify candidate locations (of type City) by using the map co-ordinates and the Lucene Geo-point distance query API141414 A second query is then generated that selects the best entities from these cities using the and the entities’ text fields. This example demonstrates how some questions may require specialized query generation and our RQL representation enables easy translations into native KB query language.

Google Places based QA: While Lucene store has a large number of entities, it often has very few reviews for many of them. This may often miss the reviews necessary to match with attributes expressed in the question. For the second setting, our answer generator translates an RQL parse directly into a Google Places API query. This makes use of all of Google Places knowledge, as well as its advanced IR capability, while offering less control on the query langugage.

We generate Google Places query via the transformation: “concat () in ”, with operators applied at appropriate labels. Here, concat lists all attributes in a space-separated fashion. This query under-exploits our rich semantic representation, but still returns more answers than Lucene due to its coverage.

In case the API returns no results, we relax our query by dropping and then re-querying Google Places. This is sometimes helpful because RQL may contain a lot of tags that overwhelm the API151515 tags with values such as “good’, “great”, “best”, “convenient” are non-informative and the Google Places API internal ranking implictly ranks for adjectives even if they aren’t specificed in the query. An explicit specification of too many such tags can cause the API to return no results.. A downside of dropping tags, however, is that any strict selectional preferences expressed by the user in the question may be lost because we don’t identify if some tags are more important than others.

Results: Table 7 reports Accuracy@3, which gives credit if any one of the top three answers is a correct answer. It also reports Mean Reciprocal Rank (MRR). Both of these measures are computed only on the subset of attempted questions (any answer returned). Recall is computed as the percentage of questions answered correctly within the top three answers over all questions. In case the user question requires more than one entity type161616A question can ask for multiple things, eg., ‘museums’ as well suggestions for “hotels”., we mark an answer correct as long as one of them is attempted and answered correctly.

For Lucene, while all systems answer nearly equal number of questions, RQL-QA has a 7-11 pt point higher accuracy. This is because of the type constraints enforced while querying. Some of its errors are due to incorrect matching of textual types (e.g., ‘things to do’) to KB types (tour agencies) due to faulty word-vector distances.

RQL-QA for Google Places answers many more questions than Lucene-based QA because the online API has high review coverage. Moreover, its query processor is likely more sophisticated than Lucene’s, for e.g., in handling types like ‘things to do’. RQL-QA has an 20 point higher accuracy with a 17 point higher recall compared to WebQA (manual), because of a more directed and effective query to Google Places API.

Error Type Error (%) Examples
Incorrect answer returned due to incorrect in RQL 16 Bad extractions in RQL results in incorrect answers.
Incorrect answer returned by knowledge source 21 criteria was not fulfilled - eg. Shop allows renting bicycles but not for tours.
Incorrect answer returned due to incomplete RQL 10 not getting extracted
Answer not returned by knowledge source 21 No apparent errors in RQL and the knowledge base would have been expected to be able to answer such a query.
Answer not returned due to RQL errors 16 RQL errors such as bad , , too many or incorrect etc.
Answer not returned due to knowledge source limitations 16 Query requesting places “around” a city, or between two cities, extracted as “day trips”, “cruises” etc. Requests for where queries were about bus services, activities and train stations.
Table 9: Classification of errors made by our answering system (using Google Places web API as knowledge source)

Table 8 shows some questions and the top-ranked answer entity returned by our system. As can be seen our system supports a variety of question intents/entities and due to our choice of an open semantic representation, we are not limited to specific entity types, entity instances, attributes or locations. For example, in the user is looking for “local dinner suggestions” on Christmas eve, and the answer entity returned by our system is to dine at the “St. Peter Stiftskulinarium” in Salzburg, while in the user is looking for recommendations for “SOM tours” (Sound of Music Tours). A quick internet search shows that our system’s answer, ’Bob’s Special Tours’, is famous for their SOM tours in that area. This question also requests for restaurant suggestions in the old town, but since we focus on returning answers for just one , this part of the question is not attempted by our system. Questions with more than one requests are fairly common and this sometimes results in confusion for our system especially if tags relate to different values. Since we do not attempt to disambiguate or link different tags to their corresponding values, this is often a source of error. Our constraint that forces all labels to come from one sentences mitigates this to some extent, but this is can still be a source of errors. is a complicated question with strict location, budget and attribute constraints and the top ranked returned entity “Hotel Pegasus Crown” fulfills the most requirements of the user171717The hotel does not offer a spa and even with manual search we could not find a better answer. is incorrect because the entity returned does not fulfil the location constraints of being close to the “bazar” while returns an incorrect entity type.

Error Analysis: As can be seen in Table 7 our best system attempts approximately 47% of the questions with an acceptable degree of accuracy for the challenging task of answering MSRQs. We conducted a detailed error study on our test set of 105 questions which is summarized in Table 9. We find that approximately 37% of questions were not answered by our system due to limitations of Google Places, i.e. either an answer was not returned for unknown reasons or the question was un-answerable with the data available in the knowledge source. Another 21% of the questions were answered incorrectly by the knowledge source, sometimes due to shallow query translation from RQL, while approximately 42% of the recall loss in the system can be traced to errors in the RQL representation.

7 Conclusion and Future Work

We have presented the novel task of answering entity-seeking multi-sentence recommendation questions in the tourism domain. As a first solution, we proposed a pipelined model consisting of two steps: (a) Question Understanding (b) Question Answering. We proposed an SQL-like query representation for capturing the semantic content of a question. We formulated the task of generating the semantic representation as a sequence labeling task and presented a CRF based model using BiDiLSTM based as well as hand-tuned features, trained in a semi-supervised setting. Our model explicitly makes use of constraints. For answering, we have proposed to construct knowledge source specific queries from our question representation, which are fired over underlying knowledge sources. We have presented an end-to-end evaluation of our system over two answer repositories, showing that our model significantly outperforms the baseline models.

We see our paper as the first attempt towards end to end QA in the challenging setting of multi-sentence questions answered directly on the basis of information in large textual corpora. It opens up several future research directions, which can be broadly divided in two categories. First, we would like to improve on the existing system in the pipelined setting. Error analysis on our test set suggests the need for a deeper IR system that parses constructs from our semantic representation to execute multiple sub-queries. Currently, between 37-58% of recall loss is due to limitations in the knowledge source and query formulation, while a sizeable 42% may be addressed by improvements to question understanding.

As a second direction, we would like to train an end to end neural system to solve our task. This would require generating a large dataset of labeled QA pairs which could perhaps be sourced semi-automatically using data available in tourism QA forums. However, answer posts in forums can often refer to multiple entities and automatically inferring the exact answer entity for the question can be challenging. Further, we would have to devise efficient techniques to deal with hundreds of thousands of potential class labels (entities). Comparing the performance of the pipelined model and the neural model, and examining if one works better than the other in specific settings would also be interesting to look at.

Exploring other question types such as suggestions for itineraries, fact-check or yes-no questions, experimenting in different domains such as consumer electronics, automobiles etc could also be an interesting direction of future work.

8 Acknowledgements

We would like to thank Poojan Mehta who designed and setup the annotation tasks on Amazon Mechanical Turk. We would also like to acknowledge the IBM Research India PhD program that enables the first author to pursue the PhD at IIT Delhi. This work is supported by Google language understanding and knowledge discovery focused research grants, a Bloomberg award, a Microsoft Azure sponsorship, and a Visvesvaraya faculty award by Govt. of India to Mausam. Parag is being supported by the DARPA Explainable Artificial Intelligence (XAI) Program under contract number N66001-17-2-4032. We thank all AMT workers who participated in our tasks.


  • (1)
  • Berant and Liang (2014) J. Berant and P. Liang. 2014. Semantic Parsing via Paraphrasing. In Association for Computational Linguistics (ACL).
  • Bogdanova and Foster (2016) Dasha Bogdanova and Jennifer Foster. 2016. This is how we do it: Answer Reranking for Open-domain How Questions with Paragraph Vectors and Minimal Feature Engineering. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016. 1290–1295.
  • Bordes et al. (2014a) Antoine Bordes, Sumit Chopra, and Jason Weston. 2014a. Question Answering with Subgraph Embeddings. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. 615–620.
  • Bordes et al. (2015) Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. arXiv preprint arXiv:1506.02075 (2015).
  • Bordes et al. (2014b) Antoine Bordes, Jason Weston, and Nicolas Usunier. 2014b. Open Question Answering with Weakly Supervised Embedding Models. In Machine Learning and Knowledge Discovery in Databases - European Conference, ECML PKDD 2014, Nancy, France, September 15-19, 2014. Proceedings, Part I. 165–180.
  • Chang et al. (2007) Mingwei Chang, Lev Ratinov, and Dan Roth. 2007. Guiding semi-supervision with constraint-driven learning. In In Proc. of the Annual Meeting of the ACL.
  • Chen et al. (2017) Danqi Chen, Adam Fisch, Jason Weston, and Antoine Bordes. 2017. Reading Wikipedia to Answer Open-Domain Questions.. In ACL (1), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, 1870–1879.
  • Chen et al. (2016) Long Chen, Joemon M. Jose, Haitao Yu, Fajie Yuan, and Dell Zhang. 2016. A Semantic Graph Based Topic Model for Question Retrieval in Community Question Answering. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM ’16). ACM, New York, NY, USA, 287–296.
  • Clark et al. (2016) Peter Clark, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Oyvind Tafjord, Peter Turney, and Daniel Khashabi. 2016. Combining Retrieval, Statistics, and Inference to Answer Elementary Science Questions. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI’16). AAAI Press, 2580–2586.
  • Contractor et al. (2016) Danish Contractor, Mausam, and Parag Singla. 2016. Entity-balanced Gaussian pLSA for Automated Comparison. In Proceedings of NAACL-HLT. 69–79.
  • Das et al. (2017) Rajarshi Das, Manzil Zaheer, Siva Reddy, and Andrew McCallum. 2017. Question Answering on Knowledge Bases and Text using Universal Schema and Memory Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 2: Short Papers. 358–365.
  • Fader et al. (2014) Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. 2014. Open Question Answering over Curated and Extracted Knowledge Bases. In Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD ’14). ACM, New York, NY, USA, 1156–1165.
  • Fader et al. (2013) Anthony Fader, Luke S. Zettlemoyer, and Oren Etzioni. 2013. Paraphrase-Driven Learning for Open Question Answering. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4-9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers. 1608–1618.
  • Guo et al. (2017) Shangmin Guo, Kang Liu, Shizhu He, Cao Liu, Jun Zhao, and Zhuoyu Wei. 2017. IJCNLP-2017 Task 5: Multi-choice Question Answering in Examinations. In IJCNLP. 34–40.
  • He et al. (2015) Luheng He, Mike Lewis, and Luke Zettlemoyer. 2015. Question-Answer Driven Semantic Role Labeling: Using Natural Language to Annotate Natural Language. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015. 643–653.
  • Huang et al. (2015) Zhiheng Huang, Wei Xu, and Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. CoRR abs/1508.01991 (2015).
  • Iyyer et al. (2014) Mohit Iyyer, Jordan Boyd-Graber, Leonardo Claudino, Richard Socher, and Hal Daumé III. 2014. A Neural Network for Factoid Question Answering over Paragraphs. In Empirical Methods in Natural Language Processing.
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. CoRR abs/1705.03551 (2017). arXiv:1705.03551
  • Khot et al. (2017) Tushar Khot, Ashish Sabharwal, and Peter Clark. 2017. Answering Complex Questions Using Open Information Extraction. CoRR abs/1704.05572 (2017).
  • Kwiatkowski et al. (2013) Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and Luke S. Zettlemoyer. 2013. Scaling Semantic Parsers with On-the-Fly Ontology Matching. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. 1545–1556.
  • Lafferty et al. (2001) John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning (ICML ’01). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289.
  • Liang et al. (2016) Chao-Chun Liang, Kuang-Yi Hsu, Chien-Tsung Huang, Chung-Min Li, Shen-Yu Miao, and Keh-Yih Su. 2016. A Tag-Based Statistical English Math Word Problem Solver with Understanding, Reasoning and Explanation. In IJCAI. IJCAI/AAAI Press, 4254–4255.
  • Liang (2011) Percy Shuo Liang. 2011. Learning Dependency-Based Compositional Semantics. Ph.D. Dissertation. University of California, Berkeley.
  • Lukovnikov et al. (2017) Denis Lukovnikov, Asja Fischer, Jens Lehmann, and Sören Auer. 2017. Neural Network-based Question Answering over Knowledge Graphs on Word and Character Level. In Proceedings of the 26th International Conference on World Wide Web (WWW ’17). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland, 1211–1220.
  • Mikolov et al. (2013) Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient Estimation of Word Representations in Vector Space. CoRR abs/1301.3781 (2013).
  • Mrksic et al. (2016) Nikola Mrksic, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. Counter-fitting Word Vectors to Linguistic Constraints. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016. 142–148.
  • Pazos R. et al. (2013) Rodolfo A. Pazos R., Juan J. González B., Marco A. Aguirre L., José A. Martínez F., and Héctor J. Fraire H. 2013. Natural Language Interfaces to Databases: An Analysis of the State of the Art. Springer Berlin Heidelberg, Berlin, Heidelberg, 463–480.
  • Qiu and Huang (2015) Xipeng Qiu and Xuanjing Huang. 2015. Convolutional Neural Tensor Network Architecture for Community-Based Question Answering.. In IJCAI. 1305–1311.
  • Rajpurkar et al. (2016) Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250 (2016).
  • Romeo et al. (2016) Salvatore Romeo, Giovanni Da San Martino, Alberto Barrón-Cedeno, Alessandro Moschitti, Yonatan Belinkov, Wei-Ning Hsu, Yu Zhang, Mitra Mohtarami, and James Glass. 2016. Neural attention for learning to rank questions in community question answering. In Proceedings of the 26th International Conference on Computational Linguistics, Osaka, Japan.
  • Sachan et al. (2016) Mrinmaya Sachan, Kumar Dubey, and Eric P. Xing. 2016. Science Question Answering using Instructional Materials. In ACL (2). The Association for Computer Linguistics.
  • Saha et al. (2016) Diptikalyan Saha, Avrilia Floratou, Karthik Sankaranarayanan, Umar Farooq Minhas, Ashish R Mittal, and Fatma Özcan. 2016. ATHENA: an ontology-driven system for natural language querying over relational data stores. Proceedings of the VLDB Endowment 9, 12 (2016), 1209–1220.
  • Seo et al. (2015) Min Joon Seo, Hannaneh Hajishirzi, Ali Farhadi, Oren Etzioni, and Clint Malcolm. 2015. Solving Geometry Problems: Combining Text and Diagram Interpretation.. In EMNLP. The Association for Computational Linguistics, 1466–1476.
  • Shen et al. (2015) Yikang Shen, Wenge Rong, Nan Jiang, Baolin Peng, Jie Tang, and Zhang Xiong. 2015. Word Embedding based Correlation Model for Question/Answer Matching. arXiv preprint arXiv:1511.04646 (2015).
  • Singh and Simperl (2016) Priyanka Singh and Elena Simperl. 2016. Using Semantics to Search Answers for Unanswered Questions in Q&A Forums. In Proceedings of the 25th International Conference Companion on World Wide Web. International World Wide Web Conferences Steering Committee, 699–706.
  • Srba and Bielikova (2016) Ivan Srba and Maria Bielikova. 2016. A Comprehensive Survey and Classification of Approaches for Community Question Answering. ACM Trans. Web 10, 3, Article 18 (Aug. 2016), 63 pages.
  • Sun et al. (2015) Huan Sun, Hao Ma, Wen-tau Yih, Chen-Tse Tsai, Jingjing Liu, and Ming-Wei Chang. 2015. Open Domain Question Answering via Semantic Enrichment. In Proceedings of the 24th International Conference on World Wide Web (WWW ’15). ACM, New York, NY, USA, 1045–1055.
  • Tan et al. (2015) Ming Tan, Bing Xiang, and Bowen Zhou. 2015. LSTM-based Deep Learning Models for non-factoid answer selection. CoRR abs/1511.04108 (2015).
  • Trischler et al. (2016) Adam Trischler, Tong Wang, Xingdi Yuan, Justin Harris, Alessandro Sordoni, Philip Bachman, and Kaheer Suleman. 2016. NewsQA: A Machine Comprehension Dataset. arXiv preprint arXiv:1611.09830 (2016).
  • Trivedi et al. (2017) Priyansh Trivedi, Gaurav Maheshwari, Mohnish Dubey, and Jens Lehmann. 2017. A Corpus for Complex Question Answering over Knowledge Graphs. In Proceedings of 16th International Semantic Web Conference - Resources Track (ISWC’2017).
  • Vtyurina and Clarke (2016) Alexandra Vtyurina and Charles LA Clarke. 2016. Complex questions:Let me Google it for you. In Proceedings of the second Web QA Workshop WEBQA 2016.
  • Wang and Nyberg (2015) Di Wang and Eric Nyberg. 2015. CMU OAQA at TREC 2015 LiveQA: Discovering the Right Answer with Clues. In Proceedings of The Twenty-Fourth Text REtrieval Conference, TREC 2015, Gaithersburg, Maryland, USA, November 17-20, 2015.
  • Wang and Nyberg (2016) Di Wang and Eric Nyberg. 2016. MU OAQA at TREC 2016 LiveQA: An Attentional Neural Encoder-Decoder Approach for Answer Rankin. In Proceedings of The Twenty-Fifth Text REtrieval Conference, TREC 2016.
  • Xu et al. (2016) Kun Xu, Siva Reddy, Yansong Feng, Songfang Huang, and Dongyan Zhao. 2016. Question Answering on Freebase via Relation Extraction and Textual Evidence. In Proceedings of the Association for Computational Linguistics (ACL 2016). Association for Computational Linguistics, Berlin, Germany.
  • Yin et al. (2015) Pengcheng Yin, Nan Duan, Ben Kao, Junwei Bao, and Ming Zhou. 2015. Answering Questions with Complex Semantic Constraints on Open Knowledge Bases. In Proceedings of the 24th ACM International on Conference on Information and Knowledge Management (CIKM ’15). ACM, New York, NY, USA, 1301–1310.
  • Zettlemoyer (2009) Luke Sean Zettlemoyer. 2009. Learning to map sentences to logical form. Ph.D. Dissertation. Massachusetts Institute of Technology.
  • Zhang et al. (2016) Kai Zhang, Wei Wu, Fang Wang, Ming Zhou, and Zhoujun Li. 2016. Learning Distributed Representations of Data in Community Question Answering for Question Retrieval. In Proceedings of the Ninth ACM International Conference on Web Search and Data Mining (WSDM ’16). ACM, New York, NY, USA, 533–542.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description