Look before you Hop: Conversational Question Answeringover Knowledge Graphs Using Judicious Context Expansion

Look before you Hop: Conversational Question Answering
over Knowledge Graphs Using Judicious Context Expansion

Philipp Christmann MPI for Informatics, Germany pchristm@mmci.uni-saarland.de Rishiraj Saha Roy MPI for Informatics, Germany rishiraj@mpi-inf.mpg.de Abdalghani Abujabal Amazon Alexa, Germany abujabaa@amazon.de Jyotsna Singh MPI for Informatics, Germany jsingh@mpi-inf.mpg.de  and  Gerhard Weikum MPI for Informatics, Germany weikum@mpi-inf.mpg.de

Fact-centric information needs are rarely one-shot; users typically ask follow-up questions to explore a topic. In such a conversational setting, the user’s inputs are often incomplete, with entities or predicates left out, and ungrammatical phrases. This poses a huge challenge to question answering (QA) systems that typically rely on cues in full-fledged interrogative sentences. As a solution, we develop Convex: an unsupervised method that can answer incomplete questions over a knowledge graph (KG) by maintaining conversation context using entities and predicates seen so far and automatically inferring missing or ambiguous pieces for follow-up questions. The core of our method is a graph exploration algorithm that judiciously expands a frontier to find candidate answers for the current question. To evaluate Convex, we release ConvQuestions, a crowdsourced benchmark with distinct conversations from five different domains. We show that Convex: (i) adds conversational support to any stand-alone QA system, and (ii) outperforms state-of-the-art baselines and question completion strategies.

ccs: Information systems Question answering

1. Introduction

1.1. Motivation

Obtaining direct answers to fact-centric questions is supported by large knowledge graphs (KGs) such as Wikidata or industrial KGs (at Google, Microsoft, Baidu, Amazon, etc.), consisting of semantically organized entities, attributes, and relations in the form of subject-predicate-object (SPO) triples. This task of question answering over KGs (KG-QA) has been intensively researched (berant2013semantic; bast2015more; unger2012template; yahya2013robust; abujabal2018never; diefenbach2019qanswer; tanon2018demoing; huang2019knowledge). However, users’ information needs are not always expressed in well-formed and self-contained questions for one-shot processing. Quite often, users issue a series of follow-up questions to explore a topic (saha2018complex; guo2018dialog), analogous to search sessions (ren2018conversational). A major challenge in such conversational QA settings is that follow-up questions are often incomplete, with entities or predicates not spelled out, and use of ungrammatical phrases. So a large part of the context is unspecified, assuming that the systems implicitly understand the user’s intent from previous interactions. Consider the following conversation as a running example. A user asks questions (or utterances) and the system has to generate answers :

: Which actor voiced the Unicorn in The Last Unicorn? : Mia Farrow : And Alan Arkin was behind ? : Schmendrick : Who did the score? : Jimmy Webb : So who performed the songs? : America : Genre of this band’s music? Folk rock, Soft rock : By the way, who was the director? : Jules Bass

Such conversations are characterized by a well-formed and complete initial question () with incomplete follow-ups (), an initial and often central entity of interest (“The Last Unicorn”), slight shifts in focus (inquiry of the band America’s genre in ), informal styles (), and a running context comprised of entities and predicates in all preceding questions and answers (not just immediate precedents).

Limitations of state-of-the-art KG-QA. State-of-the-art systems (huang2019knowledge; abujabal2018never; diefenbach2019qanswer; luo2018knowledge; tanon2018demoing) expect well-formed input questions (like ), complete with cue words for entities (“Unicorn”), predicates (“voiced”), and types (“actor”), and map them to corresponding KG-items. A SPARQL query (or an equivalent logical expression) is generated to retrieve answers. For example, a Wikidata query for would be: SELECT ?x WHERE {TheLastUnicorn voiceActor ?x . ?x characterRole TheUnicorn}. In our conversational setup, such methods completely fall apart due to the incompleteness of follow-up questions, and the ad-hoc ways in which they are phrased.

The alternative approach of question completion (kumar2017incomplete) aims to create syntactically correct full-fledged interrogative sentences from the user’s inputs, closing the gaps by learning from supervision pairs, while being agnostic to the underlying KG. However, this paradigm is bound to be limited and would fail for ad-hoc styles of user inputs or when training data is too sparse.

1.2. Approach and Contributions

Our proposed approach, Convex (CONVersational KG-QA with context EXpansion) overcomes these limitations, based on the following key ideas. The initial question is used to identify a small subgraph of the KG for retrieving answers, similar to what prior methods for unsupervised KG-QA use (diefenbach2019qanswer). For incomplete and ungrammatical follow-up questions, we capture context in the form of a subgraph as well, and we dynamically maintain it as the conversation proceeds. This way, relevant entities and predicates from previous turns are kept in the gradually expanding context. However, we need to be careful about growing the subgraph too much as the conversation branches and broadens in scope. As nodes in a KG have many 1-hop neighbors and a huge number of 2-hop neighbors, there is a high risk of combinatorial explosion, and a huge subgraph would no longer focus on the topic of interest. Convex copes with this critical issue by judiciously expanding the context subgraph, using a combination of look-ahead, weighting, and pruning techniques. Hence the “look before you hop” in the paper title.

Specifically, Convex works as follows. Answers to the first question are obtained by any standard KG-QA system (we use the state-of-the-art system QAnswer (diefenbach2019qanswer) and other variants in our experiments over Wikidata). Entities in the initial question , the answer , and their connections initialize a context subgraph ( for turn ) for the conversation in the KG. When a follow-up question arrives, all nodes (entities, predicates, or types) in the KG-neighborhood of are deemed as candidates that will be used to expand the current graph. Brute force addition of all neighbors to will quickly lead to an explosion in its size after a few turns (hugely exacerbated if popular entities are added, e.g. Germany and Barcelona have and neighbor entities in Wikidata). Thus, we opt for prudent expansion as follows. Each neighbor is scored based on its similarity to the question, its distance to important nodes in , the conversation turn , and KG priors. This information is stored in in respective sorted lists with these neighbors as elements.

A small number of top-scoring neighbors of in a turn, termed “frontier nodes” (), are identified by aggregating information across these queues. Next, all KG triples (SPO facts) for these frontiers only, are added to , to produce an expanded context . These are the most relevant nodes w.r.t the current question , and hence are expected to contain the answer in their close proximity. Each entity in is thus scored by its distance to each frontier node and other important nodes in , and the top-ranked entity (possibly multiple, in case of ties) is returned as . This process is then iterated for each turn in the conversation with question , producing , , , and ultimately at each step.

Benchmark. We compiled the first realistic benchmark, termed ConvQuestions, for conversational KG-QA. It contains about conversations which can be evaluated over Wikidata. They are compiled from the inputs of crowdworkers on Amazon Mechanical Turk, with conversations from five domains: “Books”, “Movies”, “Soccer”, “Music”, and “TV Series”. The questions feature a variety of complex question phenomena like comparisons, aggregations, compositionality, and temporal reasoning. Answers are grounded in Wikidata entities to enable fair comparison across diverse methods.

Contributions. The main contributions of this work are:

  • We devise Convex, an unsupervised method for addressing conversational question answering over knowledge graphs.

  • We release ConvQuestions, the first realistic benchmark to evaluate conversational KG-QA.

  • We present extensive experiments, showing how Convex enables any stand-alone system with conversational support.

  • An online demo and all code, data and results is available at http://qa.mpi-inf.mpg.de/convex/.

2. Concepts and Notation

Figure 1. A typical conversation illustrating perfect (but simplified) context expansion and answering at every turn.

We first introduce concepts that will assist in an easier explanation for the Convex method, and corresponding notations. An example workflow instantiating these concepts is shown in Fig. 1, and Table 1 provides a ready reference.

Notation Concept
Knowledge graph, entity, predicate, class, literal
Subject, predicate, object
Nodes and edges in graph
Conversation, turn
Question and answer at turn
Initial and expanded context graphs at turn
-hop neighborhood of nodes in
Frontier nodes at turn
Entities mapped to by words in
Table 1. Notation for key concepts in Convex.

Knowledge graph. A knowledge graph, or a knowledge base, is a set of subject-predicate-object (SPO) RDF triples, each representing a real-world fact, where is of type entity (like The Last Unicorn), is a predicate (like director), and is another entity, a class (like animated feature film), or a literal (like 19 November 1982). All , , , and in are canonicalized. Most modern KGs support -ary facts like movie-cast information (with more than two and more than one ) via reification with intermediate nodes (suchanek2007yago). In Wikidata, such information is represented via optional qualifiers with the main fact (TheLastUnicorn voiceActor MiaFarrow . characterRole TheUnicorn). Compound Value Types (CVTs) were the Freebase analogue. Tapping into qualifier information is a challenge for SPARQL queries, but is easily accessible in a graph-based method like Convex.

Convex stores the KG as a graph , with a set of nodes and a set of edges , instead of a database-like RDF triple store. Each , , , and is assigned a unique node in , with two nodes having an edge between them if there is a triple or . While it is more standard practice to treat each as an edge label, we represent every item in as a node, because it facilitates computing standard graph measures downstream. Examples of sample and are shown in the (sub-)graphs in Fig. 1. and nodes are in rectangles with sharp and rounded corners, respectively. For simplicity, and nodes are not shown. An important thing to note is that each instance of some retains an individual existence in the graph to prevent false inferences (e.g. two voice actor nodes in the figure). As a simple example, if we merge the node for married from two triples and , then we may accidentally infer that is married to during answering.

Conversation. A conversation with turns is made up of a sequence of questions and corresponding answers , where , such that . Fig. 1 (left side) shows a typical that Convex handles, with (six turns). Usually, is well-formed, and all other are ad hoc.

Question. Each is a sequence of words , such that , where is the number of words in . During answering, each word in potentially maps to one or more items in (). However, since conversations revolve around entities of interest, we fixate on the mapped entities, and refer to them as . E.g., “Alan” in , and “score” in ; so , and .

Answer. Each answer to question is a (possibly multiple, single, or null-valued) set of entities or literals in , i.e. (questions asking for predicates or classes are usually not realistic). Each is shaded in light blue in Fig. 1.

Context subgraph. In the Convex model, every turn in is associated with a context , that is a subgraph grounded or anchored in a localized zone in . Each subgraph consists of: (i) the previous question entities in , , (ii) previous answer entities in : , (iii) intermediate nodes and edges connecting the above in . All nodes corresponding to turns , are shaded in light green.

Frontier nodes. At every turn , nodes in the -hop neighborhood of , , define something like a border to which we may need to expand for answering the next (current nodes in are subsumed in ). The number of hops is small in practice, owing to the fact that typical users do not suddenly make large topic jumps during a specific . Even then, since expanding to include every results in an exponential growth rate for its size that we wish to avoid, we first select the best (top-) nodes in . These optimal expansion points in are referred to as frontier nodes, a ranked set , and are the most relevant nodes with respect to the current question and the current context , as ranked by some frontier score (defined later). This entails that only those triples (along with qualifiers) (analogously, the resultant nodes and edges) that connect these to the are added to the context. The top- frontier node at every is shown in orange in the figure (multiple in case of ties).

Expanded context. Once all triples in corresponding to frontier nodes are added to , we obtain an expanded context graph . All nodes in are candidate answers , that are scored appropriately. Fig. 1 shows expanded contexts for every in our example conversation. Corresponding can be visualized by removing facts with the orange frontiers. Notably, .

3. The Convex algorithm

We now describe the Convex conversation handler method, that can be envisaged as a seamless plug-in enabling stand-alone KG-QA systems to answer incomplete follow up questions with possibly ungrammatical and informal formulations. Convex thus requires an underlying KG, a standard QA system that can answer well-formulated questions, and the conversational utterances as input. On receiving an input question at a given turn, our method proceeds in two stages: (i) expand the context graph, and (ii) rank the answer candidates in the expanded graph. We discuss these steps next.

3.1. Context expansion

The initial question is answered by the KG-QA system that Convex augments, and say, that it produces answer(s) . Since entities in the original question are of prime importance in a conversation, we use any off-the-shelf named entity recognition and disambiguation (NERD) system like TagMe (ferragina2010tagme) or AIDA (hoffart2011robust) to identify entities . Such , , and the KG connections between them initialize the context subgraph .

Now, when the first question arrives, we need to look for answer(s) in the vicinity of . The main premise of this work is not to treat every node in such neighborhood of , and more generally, , as an answer candidate. This is because, over turns, expanding the context, by any means, is inevitable: users can freely drift away and revisit the initial entities of interest over the full course of the conversation. Under this postulate, the total number of such context nodes can easily go to the order of millions, aggravated by the presence of popular entities, especially countries (UK, Russia) or cities (Munich, Barcelona) in the KG around prominent entities of discussion (Harry Potter, Christiano Ronaldo).

The logical course of action, then, is to perform this expansion in a somewhat austere fashion, which we propose to do as follows. We wish to identify some key nodes in the -hop neighborhood of , that will prove the most worthwhile if included into (along with their connections to ) w.r.t. answering . We call these optimal expansion points frontier nodes. From here on, we outline frontier identification at a general conversation turn , where . Frontiers are marked by three signals: (i) relevance to the words in ; (ii) relevance to the current context ; and (iii) KG priors. We now explain these individual factors.

Relevance to question. The question words provide a direct clue to the relevant nodes in the neighborhood. However, there is often a vocabulary mismatch between what users specify in their questions and the KG terminology, as typical users are unaware of the KG schema. For example, let us consider Who did the score?. This indicates the sought information is about the score of the movie but unfortunately the KG does not use this term. So, we define the matching similarity score of a neighbor with a question word using cosine similarity between word2vec (mikolov2013distributed) embeddings of the node label and the word. Stopwords like and, of, to, etc. are excluded from this similarity. For multiword phrases, we use an averaging of the word vectors (wieting2016towards). The cosine similarity is originally in : it is scaled to using min-max normalization for comparability to the later measures. So we have:


We then take the maximum of these word-wise scores to define the matching score of a candidate frontier to the question as a whole:


Relevance to context. Nevertheless, such soft lexical matching with embeddings is hardly enough. Let us now consider the word “genre” in Genre of this band’s music?. Looking at the toy example in Fig. 2, we see that even with an exact match, there are five genre-s lurking in the vicinity at (there are several more in reality), where the one connected to America is the intended fit.

Figure 2. An illustration of the ambiguity in frontier node selection for a specific question word (“genre” in Genre of this band’s music?), and how effective scoring can potentially pick the best candidate in a noisy context graph .

We thus define the relevance of a node to the current context as the total graph distance (in number of hops in ) of to the nodes in . Note that we are interested in the relevance score being directly proportional to this quantity, and hence consider proximity, the reciprocal of distance (), as the measure instead. For the aggregation over nodes in , we prefer over as the latter is more sensitive to outliers.

Next, not all nodes in are valuable for the answering process. For anchoring a conversation in a KG, entities that have appeared in questions or as answers in turns , are what specifically matter. Thus, it suffices to consider only and for computing the above proximities. We encode this factor using an indicator function that equals if , and zero otherwise.

Contributions of such Q/A nodes ( or ) should be weighted according to the turn in which they appeared in their respective roles, denoted by ). This is when such nodes had the “spotlight”, in a sense; so recent turns have higher weights than older ones. In addition, since the entity in the first question may always be important as the theme of the conversation (The Last Unicorn), is set to the maximum value instead of zero. We thus define the context proximity score for neighbor , normalized by the number of Q/A nodes in the context, as:


KG priors. Finally, KG nodes have inherent salience (or prominence) values that reflect their likelihoods of being queried about in users’ questions. For example, Harry Potter has higher salience as opposed to some obscure book like Harry Black, and the same can be said to hold about the author predicate compared to has edition for books. Ideally, such salience should be quantified using large-scale query logs from real users that commercial Web search engines possess. In absence of such resources, we use a more intrisic proxy for salience: the frequency of the concept in the KG. The raw frequency is normalized by corresponding maximum values for entities, predicates, classes, and literals, to give . Thus, we have the KG prior for a node as:


Aggregation using Fagin’s Threshold Algorithm. We now have three independent signals from the question, context, and the KG regarding the likelihood of a node being a frontier at a given turn. We use the Fagin’s Threshold Algorithm (FTA) (fagin2003optimal) to aggregate items in the three sorted lists }, that are created when candidates are scored by each of these signals. FTA is chosen as it is an optimal algorithm with correctness and performance guarantees for rank aggregation. In FTA, we perform sorted access in parallel to each of the three sorted lists . As each candidate frontier node is seen under sorted access, we retrieve each of its individual scores by random access. We then compute a frontier score for as a simple linear combination of the outputs from Eqs. 2, 3, and 4 using frontier hyperparameters , , and , where :


In general, FTA requires a monotonic score aggregation function such that whenever , where the component scores of are denoted as ’s (Eq. 5) in corresponding lists . Once the above is done, as nodes are accessed from , if this is one of the top answers so far, we remember it. Here, we assume a buffer of bounded size. For each , let be the score of the last node seen under sorted access. We define the threshold value to be . When nodes have been seen whose frontier score is at least , then we stop and return the top nodes as the final frontiers.

Thus, at the end of this step, we have a set of frontier nodes for turn . If any of these frontiers are entities, they are used to populate . We add the triples (along with qualifiers if any) that connect the to the current context to produce the expanded graph (the step containing in Algorithm 1).

initialize ;
while  do
       for  do
             compute [Eq. 2];
             compute [Eq. 3];
             compute [Eq. 4];
             insert scores into sorted lists ;
       end for
      find = Fagin’s-Threshold-Algorithm() [Eq. 5];
       assign ;
       expand ;
       for a  do
             compute ) [Eq. 6];
       end for
      find ;
end while
Algorithm 1 Convex ()

3.2. Answer ranking

Our task now is to look for answers in . Since frontiers are the most relevant nodes in w.r.t question , it is expected that will appear in their close proximity.

However, labels of frontier nodes only reflect what was explicit in , the unspecified or implicit part of the context in usually refers to a previous question or answer entity (). Thus, we should consider closeness to these context entities as well. Note that just as before, frontiers and Q/A entities both come with corresponding weights: frontier scores and turn id’s, respectively. Thus, while considering proximities is key here, using weighted versions is a more informed choice. We thus score every node by its weighted proximity, using Eqs. 3 and 5, as follows (again, we invert distance to use a measure directly proportional to the candidacy of an answer node):


Contributions by proximities to frontier and Q/A nodes (each normalized appropriately) are again combined linearly with answer hyperparameters and , where . Thus, the final answer score also lies in . Finally, the top scoring (possibly multiple, in case of ties) node(s) is returned as the answer to .

The Convex method is outlined in Algorithm 1. As mentioned before, and are obtained by passing through a stand-alone KG-QA system, and a NERD algorithm, respectively. returns all KG triples that contain the arguments of this function, and the generalized returns the set of entities from its arguments. Note that this algorithm illustrates the workings of Convex in a static setting when all are given upfront; in a real setting, each is issued interactively with a user in the loop.

4. The ConvQuestions Benchmark

Attribute Value
Title Generate question-answer conversations on popular entities
Description Generate conversations in different domains (books, movies, music, soccer, and TV series) on popular entities of your choice. You need to ask natural questions and provide the corresponding answers via Web search.
Total participants
Time allotted per HIT hours
Time taken per HIT hours
Payment per HIT Euros
Table 2. Basic details of the AMT HIT (five conversations).
Turn Books Movies Soccer Music TV series
When was the first book of the book series The Dwarves published? Who played the joker in The Dark Knight? Which European team did Diego Costa represent in the year 2018? Led Zeppelin had how many band members? Who is the actor of James Gordon in Gotham?
2003 Heath Ledger Atlético Madrid 4 Ben McKenzie
What is the name of the second book? When did he die? Did they win the Super Cup the previous year? Which was released first: Houses of the Holy or Physical Graffiti? What about Bullock?
The War of the Dwarves 22 January 2008 No Houses of the Holy Donal Logue
Who is the author? Batman actor? Which club was the winner? Is the rain song and immigrant song there? Creator?
Markus Heitz Christian Bale Real Madrid C.F. No Bruno Heller
In which city was he born? Director? Which English club did Costa play for before returning to Atlético Madrid? Who wrote those songs? Married to in 2017?
Homburg Christopher Nolan Chelsea F.C. Jimmy Page Miranda Cowley
When was he born? Sequel name? Which stadium is this club’s home ground? Name of his previous band? Wedding date first wife?
10 October 1971 The Dark Knight Rises Stamford Bridge The Yardbirds 19 June 1993
Table 3. Representative conversations in ConvQuestions from each domain, highlighting the stiff challenges they pose.

4.1. Benchmark creation

Limitations of current choices. Popular benchmarks for KG-QA like WebQuestions (berant2013semantic), SimpleQuestions (bordes2015large), WikiMovies (miller2016key), ComplexWebQuestions (talmor2018web), and ComQA (abujabal19comqa), are all designed for one-shot answering with well-formulated questions. The CSQA dataset (saha2018complex) takes preliminary steps towards the sequential KG-QA paradigm, but it is extremely artificial: initial and follow-up questions are generated semi-automatically via templates, and sequential utterances are only simulated by stitching questions with shared entities or relations in a thread, without a logical flow. QBLink (elgohary2018dataset), CoQA (reddy2018coqa), ans ShARC (saeidi2018interpretation) are recent resources for sequential QA over text. The SQA resource (iyyer2017search), derived from WikiTableQuestions (liang2015compositional), is aimed at driving conversational QA over (relatively small) Web tables.

Conceptual challenges. In light of such limitations, we overcome several conceptual challenges to build the first realistic benchmark for conversational KG-QA, anchored in Wikidata. The key questions included, among others: Should we choose from existing benchmarks and ask humans to create only follow-ups? Should the answers already come from some KG-QA system, observing which, users create follow-ups? Should we allocate templates to crowdworkers to systematically generate questions that miss either entities, predicates, and types? Can we interleave questions by different workers to create a large number of conversations? Can we permute the order of follow-ups to generate an even larger volume? If there are multiple correct , and in the benchmark involves a different than what the system returns at run-time, how can we evaluate such a dynamic workflow? How can we built a KG-QA resource that is faithful to the setup but is not overly limited to the information the KG contains today?

Creating ConvQuestions. With insights from a meticulous in-house pilot study with ten students over two weeks, we posed the conversation generation task on Amazon Mechanical Turk (AMT) in the most natural setup: Each crowdworker was asked to build a conversation by asking five sequential questions starting from any seed entity of his/her choice, as this is an intuitive mental model that humans may have when satisfying their real information needs via their search assistants. A system-in-the-loop is hardly ideal: this creates comparison across methods challenging, is limited by the shortcomings of the chosen system, and most crucially, there exist no such systems today with satisfactory performance. In a single AMT Human Intelligence Task (HIT), Turkers had to create one conversation each from five domains: “Books”, “Movies”, “Soccer”, “Music”, and “TV Series” (other potential choices were “Politics”, but we found that it quickly becomes subjective, and “Finance”, but that is best handled by relational databases and not curated KGs). Each conversation was to have five turns, including . To keep conversations as natural as possible, we neither interleaved questions from multiple Turkers, nor permuted orders of questions within a conversation. For quality, only AMT Master Workers (who have consistently high performances: see https://www.mturk.com/help#what_are_masters), were allowed to participate. We registered participants, and this resulted in initial conversations, from each domain.

Along with questions, Turkers were asked to provide textual surface forms and Wikidata links of the seed entities and the answers (via Web search), along with paraphrases of each question. The paraphrases provided us with two versions of the same question, and hence a means of augmenting the core data with several interesting variations that can simultaneuosly boost and test the robustness of KG-QA systems (dong2017learning). Since paraphrases of questions (any ) are always semantically equivalent and interchangeable, each conversation with five turns thus resulted in distinct conversations (note that this does not entail shuffling sequences of the utterances). Thereby, in total, we obtained such conversations, that we release with this paper.

If the answers were dates or literals like measurable quantities with units, Turkers were asked to follow the Wikidata formats for the same. They were provided with minimal syntactic guidelines to remain natural in their questions. They were shown judiciously selected examples so as not to ask opinionated questions (like best film by this actor?), or other non-factoid questions (causal, procedural, etc.). The authors invested substantial manual effort for quality control and spam prevention, by verifying both answers of random utterances, and alignments between provided texts and Wikidata URLs. Each question was allowed to have at most three answers, but single-answer questions were encouraged to preclude the possibility of non-deterministic workflows during evaluation.

To allow for ConvQuestions being relevant for a few years into the future, we encouraged users to ask complex questions involving joins, comparisons, aggregations, temporal information needs, and so on. Given the complexity arising from incomplete cues, these additional facets pose an even greater challenge for future KG-QA systems. So as not to restrict questions to only those predicates that are present in Wikidata today, relations connecting question and answer entities are sometimes missing in the KG but can be located in sources like Wikipedia, allowing scope for both future growth of the KG, and experimentation with text plus KG combinations.

4.2. Benchmark analysis

Basic details of our AMT HIT are provided in Table 2 for reference. Question entities and expected answers had a balanced distribution among human (actors, authors, artists) and non-human types (books, movies, stadiums). Detailed distributions are omitted due to lack of space. Illustrative examples of challenging questions from ConvQuestions are in Table 3. We see manifestations of: incomplete cues (TV Series; ), ordinal questions (Books; ), comparatives (Music; ), indirections (Soccer; ), anaphora (Music; ), existentials (Soccer; ), temporal reasoning (Soccer; ), among other challenges. The average lengths of the first and follow-up questions were and words, respectively. Finally, we present the key quantifier for the difficulty in our benchmark: the average KG distance of answers from the original seed entity is , while the highest goes up to as high as five KG hops. Thus, an approach that remains fixated on a specific entity is doomed to fail: context expansion is the key to success on ConvQuestions.

5. Experimental Setup

5.1. Baselines and Metrics

Stand-alone systems. We use the state-of-the-art system QAnswer (diefenbach2019qanswer), and also Platypus (tanon2018demoing), as our stand-alone KG-QA systems, that serve as baselines, and which we enhance with Convex. At the time of writing (May 2019), these are the only two systems that have running prototypes over Wikidata.

To make Convex a self-sufficient system, we also implement a naïve version of answering the first question as follows. Entities are detected in using the TagMe NERD system (ferragina2010tagme), and mapped to their Wikidata IDs via Wikipedia links provided by TagMe. Embeddings were obtained by averaging word2vec vectors of the non-entity words in , and their cosine similarities were computed around each of the predicates around the detected entities . Finally, the best pair was found (as a joint disambiguation), and the returned answer was the subject or object in the triple according as the triple structure was or .

Due to the complexity in even the first question in ConvQuestions, all of the above systems achieve a very poor performance for on the benchmark. This limits the value that Convex can help these systems achieve, as and together initialize . To decouple the effect of the original QA system, we experiment with an Oracle strategy, where we use and provided by the human annotator (Turker who created the conversation).

Conversation models. As intuitive alternative strategies to Convex for handling conversations, we explore two variants: (i) the star-join, and (ii) the chain-join models. The naming is inspired by DB terminology, where a star query looks for a join on several attributes around a single variable (of the form SELECT ?x WHERE {?x att val . ?x att val . ?x att val}), while a chain SQL searches for a multi-variable join via indirections (SELECT ?x WHERE {?x att ?y . ?y att ?z . ?z att val}). For conversations, this entails the following: in the star model, the entity in is always assumed to be the entity in all subsequent utterances (like The Last Unicorn). The best predicate is disambiguated via a search around such using similarities of word2vec embeddings of Wikidata phrases and non-entity words in . The corresponding missing argument from the triple is returned as the answer. In the chain model of a conversation, the previous answer is always taken as the reference entity at turn , instead of . Predicate selection and answer detection are done analogously as in the star model.

No frontiers. We also investigated whether the idea of a frontier node in itself was necessary, by defining an alternative configuration where we optimize an answer-scoring objective directly. The same three signals of question matching, context proximity, and KG priors were aggregated (Eqs. 2, 3, and 4), and the Fagin’s Threshold Algorithm was again applied for obtaining the top- list. However, these top- returned nodes are now directly the answers. The process used translates to a branch-and-bound strategy for iteratively exploring the neighborhood of the initial context (, , and their interconnections) as follows, without explicitly materializing a context subgraph. The -hop neighborhood (-hop as we now directly score for an answer, without finding a frontier first) of each node in the context at a given turn is scored on its likelihood of being an answer, in a breadth-first manner. The first computed score defines a lower bound on the node being a potential answer, that is updated as better candidates are found. If a node’s answer score is lower than the lower bound so far, it is not expanded further (its neighborhood is not explored anymore). We keep exploring the -hop neighborhood of the context iteratively until we do not find any node in better than the current best answer.

End-to-end neural model. We compared our results with D2A (Dialog-to-Action) (guo2018dialog), the state-of-the-art end-to-end neural model for conversational KG-QA. Since Convex is an unsupervised method that does not rely on training data, we used the D2A model pre-trained on the large CSQA benchmark (saha2018complex). D2A manages dialogue memory using a generative model based on a flexible grammar.

Question completion. An interesting question to ask at this stage is whether an attempt towards completing the follow-up utterances is worthwhile. While a direct adaptation of a method like (kumar2017incomplete) is infeasible due to absence of training pairs and the need for rewriting as opposed to plain completion, we investigate certain reasonable alternatives: (i) when is concatenated with keywords (all nouns and verbs) from ; (ii) when is concatenated with ; (iii) when is concatenated with keywords from ; and, (iv) with . These variants are then passed through the stand-alone KG-QA system. Fortunately, the state-of-the-art system QAnswer is totally syntax-agnostic, and searches the KG with all question cue words to formulate an optimal SPARQL query whose components best cover the mapped KG items. This syntax-independent approach was vital as it would be futile to massage the “completed” questions above into grammatically correct forms. Platypus, on the other hand, is totally dependent on an accurate dependency parse of the input utterance, and hence is unsuitable for plugging in these question completion strategies.

Metrics. Since most questions in ConvQuestions had exactly one or at most a few correct answers, we used the standard metrics of Precision at the top rank (P@1), Mean Reciprocal Rank (MRR), and Hit@5 metrics. The last measures the fraction of times a correct answer was retrieved within the top- positions.

5.2. Configuration

Dataset. We evaluate Convex and other baselines on ConvQuestions. A random of the conversations was held out for tuning model parameters, and the remaining was used for testing. Care was taken that this development set was generated from a separate set of seed conversations ( out of the original ) so as to preclude possibilities of “leakage” on to the test set.

Initialization. We use Wikidata (www.wikidata.org) as our underlying KG, and use the complete RDF dump in NTriples format from 15 April 2019 (http://bit.ly/2QhsSDC, TB uncompressed). Identifier triples like those containing predicates like Freebase ID, IMDb ID, etc. were excluded. We used indexing with HDT (www.rdfhdt.org/) that enables much faster lookups. The Python library NetworkX (https://networkx.github.io/) was used for graph processing. TagMe was used for NERD, and word2vec embeddings were obtained via the gensim package. Stanford CoreNLP (manning2014stanford) was used for POS tagging to extract nouns and verbs for question completion. The ideal number of frontier nodes, , was found to be three by tuning on the dev set.

6. Results and Insights

6.1. Key findings

Domain Movies TV Series Music Books Soccer
Method P@1 MRR Hit@5 P@1 MRR Hit@5 P@1 MRR Hit@5 P@1 MRR Hit@5 P@1 MRR Hit@5
QAnswer (diefenbach2019qanswer)
QAnswer + Convex * * * * * * * * * * * * *
QAnswer + Star
QAnswer + Chain
Platypus (tanon2018demoing)
Platypus + Convex * * * * * * * * * * * *
Platypus + Star
Platypus + Chain
Naive + Convex * * * * * * * * * * * *
Naive + Star
Naive + Chain
Oracle + Convex * * * * * * * * * * *
Oracle + Star
Oracle + Chain
Oracle + No frontiers
D2A (guo2018dialog)

The highest value in a group (metric-domain-system triple) is in bold. QAnswer and Platypus return only a top- answer and not ranked lists, and hence have the same P@1, MRR, and Hit@5 values.

Table 4. Our main results on follow-up utterances in ConvQuestions showing how Convex enables KG-QA enables for conversations, and its comparison with baselines.

Table 4 lists main results, where all configurations are run on the follow-up utterances in the ConvQuestions test ( conversations; questions). An asterisk (*) indicates statistical significance of Convex-enabled systems over the strongest baseline in the group, under the -tailed paired -test at level. We make the following key observations.

Convex enables stand-alone systems. The state-of-the-art QAnswer (diefenbach2019qanswer) scores only about (since it produces sets and not ranked lists, all metric values are identical) on its own on the incomplete utterances, which it is clearly not capable of addressing. When Convex is applied, its performance jumps significantly to ) (MRR) across the domains. We have exactly the same trends with the Platypus system. The naive strategy with direct entity and predicate linking performs hopelessly in absence of explicit cues, but with Convex we again see noticeable improvements, brought in by a relevant context graph and its iterative expansion. In the Oracle method, is known, and hence a row by itself is not meaningful. However, contrasting Oracle+Convex with other “+Convex methods, we see that there is significant room for improvement that can be achieved by answering correctly.

Star and chain models of conversations fall short. For every configuration, we see the across-the-board superiority of Convex-boosted methods over star- and chain-models (often over gains). This clearly indicates that while these are intuitive ways of modeling human conversation (as seen in the often respectable values that these achieve), they are insufficient and oversimplified. Evidently, real humans rather prefer the middle path: sometimes hovering around the initial entity, sometimes drifting in a chain of answers. A core component of Convex that we can attribute this pattern to, is the turn-based weighting of answer and context proximity that prefers entities in the first and the last turns. “QAnswer + Star” and “Platypus + Star” achieve the same values as they both operate around the same entity detected by TagMe.

Convex generalizes across domains. In Table 4, we also note that the performance of Convex stretches across all five domains (even though the nature of questions in each of these domains have their own peculiarities), showing the potential of of our unsupervised approach in new domains with little training resources, or to deal with cold starts in enterprise applications. While we did tune hyperparameters individually for each domain, there were surprsingly little variation across them ().

Frontiers help. We applied our frontier-less approach over the oracle annotations for , and in the row marked “Oracle + No frontiers” in Table 4, we find that this results in degraded performance. We thus claim that locating frontiers is an essential step before answer detection. The primary reason behind this is that answers only have low direct matching similarity to the question, making a -stage approach worthwhile. Also, exploring a -hop neighborhood was generally found to suffice: nodes further away from the initial context rarely manage to “win”, due to the proximity score component quickly falling off as KG-hops increase.

Pre-trained models do not suffice. D2A produces a single answer for every utterance, which is why the three metrics are equal. From the D2A row in Table 4, we observe that pre-trained neural models do not work well off-the-shelf on ConvQuestions (when compared to the Convex-enabled QAnswer row, for example). This is mostly due to the restrictive patterns in the CSQA dataset, owing to its semi-synthetic mode of creation. A direct comparison, though, is not fair, as Convex is an enabler method for a stand-alone KG-QA system, while D2A is an end-to-end model. Nevertheless, the main classes of errors come from: (i) a predicate necessary in ConvQuestions that is absent in CSQA (D2A cannot answer temporal questions like In what year was Ender’s game written? as such relations are absent in CSQA); (ii) D2A cannot generate -hop KG triple patterns; (iii) D2A cannot resolve long-term co-references in questions (pronouns only come from the last turn in CSQA, but not in ConvQuestions); (iv) In CSQA, co-references are almost always indicated as “it” or “that one”. But since ConvQuestions is completely user-generated, we have more challenging cases with “this book”, “the author”, “that year”, and so on.

Method Movies TV Music Books Soccer
QAnswer + Convex * * * * *
QAnswer + keywords
QAnswer +
QAnswer + keywords
QAnswer +
Table 5. Comparison with question completion strategies (MRR). The highest value in a column is in bold.

Convex outperforms question completion methods. Comparison with question completion methods are presented in Table 5. Clear trends show that while these strategies generally perform better than stand-alone systems (contrasting QAnswer with Table 4, for, say, Movies, we see vs. on MRR previously), use of Convex results in higher improvement ( MRR on Movies). This implies that question completion is hardly worthwhile in this setup when the KG structure already reveals a great deal about the underlying user intents left implicit in follow-up utterances.

6.2. Analysis

Turn Movies TV Music Books Soccer
Table 6. Performance of Convex over turns (MRR).

Convex maintains its performance over turns. One of the most promising results of this zoomed-in analysis is that the MRR for Convex (measured via its combination with the Oracle, to decouple the effect of the QA system) does not diminish over turns. This shows particular robustness of our graph-based method: while we may produce several wrong results during the session of the conversation, we are not bogged down by any single mistake, as the context graph retains several scored candidates within itself, guarding against “near misses”. This is in stark contrast to the chain model, where it is exclusively dependent on .

Error analysis. Convex has two main steps in its pipeline: context expansion, and answer ranking. Analogously, there are two main cases of error: when the answer is not pulled in when is expanded at the frontiers (incorrect frontier scoring), or when the answer is there in but is not ranked at the top. These numbers are shown in Table 7. We find that there is significant scope for improvement for frontier expansion, as errors lie in this bag. It is however, heartening to see that no particular turn is singularly affected. This calls for more informed frontier scoring than our current strategy. Answer ranking can be improved with better ways of aggregating the two proximity signals. Table 8 lists anecdotal examples of success cases with Convex.

Scenario Turn 1 Turn 2 Turn 3 Turn 4
Ans. in expanded graph but not in top-
Ans. not in expanded graph
Table 7. Error analysis (percentages of total errors).
Utterance: What was the name of the director? (Movies, Turn 4)
Intent: Who was the director of the movie Breakfast at Tiffany’s?
Utterance: What about Mr Morningstar? (TV Series, Turn 2)
Intent: Which actor plays the role of Mr Morningstar in the TV series Lucifer?
Utterance: What record label put out the album? (Music, Turn 3)
Intent: What is the name of the record label of the album Cosmic Thing?
Utterance: written in country? (Books, Turn 4)
Intent: In which country was the book “The Body in the Library” by Agatha Christie written?
Utterance: Who won the World Cup that year? (Soccer, Turn 4)
Intent: Which national team won the 2010 FIFA World Cup?
Table 8. Representative examples where Oracle + Convex produced the best answer at the top-1, but neither Oracle + Star, nor Oracle + Chain could.

7. Related Work

Question answering over KGs. Starting with early approaches in 2012-’13 (unger2012template; yahya2013robust; berant2013semantic), based on parsing questions via handcoded templates and grammars, KG-QA already has a rich body of literature. While templates continued to be a strong line of work due to its focus on interpretability and generalizability (bast2015more; abujabal:17; abujabal2018never; diefenbach2019qanswer; tanon2018demoing), a parallel thread has focused on neural methods driven by performance gains (huang2019knowledge; lukovnikov2017neural; sawant2019neural). Newer trends include shifts towards more complex questions (luo2018knowledge; talmor2018web; lu2019answering), and fusion of knowledge graphs and text (sawant2019neural; sun2018open). However, none of these approaches can deal with incomplete questions in a conversational setting.

Conversational question answering. Saha et al. (saha2018complex) introduce the paradigm of sequential question answering over KGs, and create a large benchmark CSQA for the task, along with a baseline with memory networks. Guo et al. (guo2018dialog) propose D2A, an end-to-end technique for conversational KG-QA , that introduces dialog memory management for inferring the logical form of current utterances. While our goal is rather to build a conversation enabler method, we still compare with, and outperform the CSQA-trained D2A model on ConvQuestions.

Question completion approaches (kumar2017incomplete; raghu2015statistical; ren2018conversational) target this setting by attempting to create full-fledged interrogatives from partial utterances while being independent of the answering resource, but suffer in situations without training pairs and with ad hoc styles. Nevertheless, we try to compare with this line of thought, and show that such completion may not be necessary if the underlying KG can be properly exploited.

Iyyer et al. (iyyer2017search) initiate the direction of sequential QA over tables using dynamic neural semantic parsing trained via weakly supervised reward-guided search, and evaluate by decomposing a previous benchmark of complex questions (liang2015compositional) to create sequential utterances. However, such table-cell search methods cannot scale to real-world, large-scale curated KGs.

QBLink (elgohary2018dataset), CoQA (reddy2018coqa), and ShARC (saeidi2018interpretation) are recent benchmarks aimed at driving conversational QA over text, and the allied paradigm in text comprehension on interactive QA (li2017context). Hixon et al. (hixon2015learning) try to learn concept knowledge graphs from conversational dialogues over science questions, but such KGs are fundamentally different from curated ones like Wikidata with millions of facts.

8. Conclusion

Through Convex, we showed how judicious graph expansion strategies with informed look-ahead, can help stand-alone KG-QA systems cope with some of the challenges posed by incomplete and ad hoc follow-up questions in fact-centric conversations. Convex is completely unsupervised, and thus can be readily applied to new domains, or be deployed in enterprise setups with little or no training data. Further, being a graph-based method, each turn and the associated expansions can be easily visualized, resulting in interpretable evidence for answer derivation: an unsolved concern for many neural methods for QA. Nevertheless, Convex is just a first step towards solving the challenge of conversational KG-QA. We believe that the ConvQuestions benchmark, reflecting real user behavior, can play a key role in driving further progress.

Acknowledgements. We sincerely thank Daya Guo (Sun Yat-sen University) for his help in executing D2A on ConvQuestions.


Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description