Using Multi-Label Classification for Improved Question Answering

Using Multi-Label Classification for Improved Question Answering

Ricardo Usbeck University of Paderborn, Germany    Michael Hoffmann Leipzig University, Germany    Michael Röder University of Paderborn, Germany    Jens Lehmann University of Bonn, Germany and Fraunhofer IAIS, Germany
email: {firstname.lastname}
   Axel-Cyrille Ngonga Ngomo University of Paderborn, Germany

A plethora of diverse approaches for question answering over RDF data have been developed in recent years. While the accuracy of these systems has increased significantly over time, most systems still focus on particular types of questions or particular challenges in question answering. What is a curse for single systems is a blessing for the combination of these systems. We show in this paper how machine learning techniques can be applied to create a more accurate question answering metasystem by reusing existing systems. In particular, we develop a multi-label classification-based metasystem for question answering over 6 existing systems using an innovative set of 14 question features. The metasystem outperforms the best single system by 14% F-measure on the recent QALD-6 benchmark. Furthermore, we analyzed the influence and correlation of the underlying features on the metasystem quality.

inlineinlinetodo: inline - the related work section does not discuss meta classifiers for related fields with the usual techniques, such as bagging, boosting, as well as multi-label classification - paper is unclear in important aspects, such as how the meta features are computed (some are explained later on in the implementation details) - for example, what does it mean ”iff S_i achieves an F1-score greater than 0 on q” - the construction of ranking+thr (RT) is unclear to me - it is not clear if the results that are achieved are due to overfitting (unlike the author’s statement that ”the leave-one-out setup should converge to the second setup”) - I expected the authors to train and test on different sets, e.g. train on QALD 1-5 and test on QALD 6 - is the improvement compared to the best single system statistically significant? it seems that this single system contributes to most of the meta classifier results. is this true? - In addition, IBM’s Watson is taking a similar approach if I recall correctly, using several lite qa systems and combining evidence, instead of selecting the best one. This should be compared and discussed. The results themselves are not clear, the dataset seems small and statistical significance is missing. Novelty is limited * Related areas such as question classification in QA systems or query classification for meta-search engines are totally ignored in related works Lack of clarity in the experimental section * The data set is not described in enough details * It is unclear whether training and test data are kept separate; especially sections 5.3 (F1-Score Full), 5.4 and 5.5 Poor use of space * Is Table 4 really necessary? * 25% of citations are self-citations * The idea classification in a meta-system is not novel; however, to the best of my knowledge, this is a first application of the concept to QALD. We encourage the authors to discuss applications of the concept in other areas such as search (see doi:10.1016/S1389-1286(00)00059-1) in the related work section. * A related concept is question classification in question answering systems (see doi:10.3115/1072228.1072378 for instance), and this line of research is also ignored by the authors. The authors focus on selecting a QALD engine, while question classification in QA focuses on selecting the best components - it’s a related concept and should be mentioned. * The related work section focuses on the egineering aspect (how to query systems programmatically), rather than the research aspect (the methods that make these systems different). * This reviewer is unsure of the validity of the reported experiments in Section 5, esp. in 5.4 and 5.5. * The authors state: ”performed tests to choose the best performing classifier on our data”. Which data? Is training data kept separate from test data? * The authors do not give enough details about the data set. When they mention 100 test questions, it is unclear whether these test questions are used for testing and training. Looking at the QALD-6 data, there exists a larger set of training questions. However, it is unclear whether the authors used the training data. * How many classification examples does QALD-6 correspond to for training and testing? Is this enough data to compare 16 classifiers, some of them related by design? * Table 4 is long and not very informative. What should the take-away be for the reader? * It would have been more constructive to perform an in-depth analysis of the successes and failures, i.e., expand on the last sentence of section 5.5 * There are a few unusual phrasing Section 2, page 2: a domain- as well as a domain-independent ontology * QANUS is a not disclosed QA framework -Evaluation may lead to wrong conclusions -Needs better motivation for the proposed solution * The motivation for multi-label classification is that multiple labels can be associated with the same instance, that is, multiple systems can given the correct answer for the same question. However, it seems that the classifiers are always returning only only system. Maybe a way to improve the results would be to combine multiple systems, and not to select one of them. That said, I think the paper would be strengthened if the author provide a deeper discussion about the solution proposed (ie., delegation vs. combination). * I could not understanding what is ”question word”. It is not described in Section 3.1. However, it seems to be one of the most important features. * The base systems are not discussed properly. Apparently, some of them were not published as an academic paper. Readers may be not familiar with these systems, and this makes it hard to understand the weaknesses and strengths of the QA systems. The paper lacks a proper discussion about this issue. * My main issue with this paper is the evaluation. The dataset is seriously limited in size. The results may change drastically on other datasets. There is no statistical test to ensure the superiority of the systems. The choice of ”Pst” seems strange to me, since it is probably the most prone to overfitting among the available choices (that is probably the reason why it was so good in ”Full” and not so good in ”LOO”). * 1. Using machine learning to combine multiple systems is not novel * 2. Lack technical depth * 3. Experiments are weak, using only 100 questions for both training and test. inlineinlinetodo: inlineThere are two new systems to GERBIL QA

1 Introduction

Recent research on question answering (QA) over Linked Data and RDF has shown significant improvements of quality and efficiency in answering even complex questions [8]. As a result of this research, a multitude of QA systems (see a.o. [1, 30, 32, 6, 22, 11, 25]) have been proposed to tackle questions from different domains and of varying complexities. These systems rely on diverse approaches ranging from the transformation of questions into triple patterns [11] to hybrid question answering over both RDF and text [30]. This has led to a tool landscape with approaches able to deal with particular aspects of questions well (e.g., [25] can deal with simple conjunctive queries) while being unable to deal with other aspects (e.g., [11] has difficulties dealing with superlative queries). In addition to monitoring the development of a large number of question answering approaches, we have also witnessed the creation of a large number of benchmarks and challenges. The latter have provided the possibility to analyse the strengths and weaknesses of many QA systems objectively (see, e.g., QALD [28, 27, 26]).

The availability of both diverse approaches (e.g., approaches with different strengths and weaknesses) and of benchmarks (that allow evaluating these strengths and weaknesses) now suggests the possibility of creating “metasystems” for answering questions over Linked Data and RDF. Such a metasystem (1) integrates several QA systems. Given a question, it is then able to (2) select the most appropriate QA system to answer the said question from a set of questions. While the selection of the most appropriate QA system seem tedious, the main hypothesis of this work is that this selection can be carried out automatically using machine learning (ML) techniques.

In this work, we formulate the problem of the training of a metasystem for QA as a multi-label classification problem. Here, we are interested in the choice of the best fitting classifier and the choice of machine learning features which are most descriptive. In this paper, we present a multi-label classification-based metasystem for question answering over 6 existing systems using a novel set of 14 question features.

Our contributions are as follows: (1) We develop a set of 14 novel features for natural-language questions that is capable of characterizing the weak points of existing QA systems. (2) We analyze 6 current QA systems with respect to their performance and features to deduce future research directions and gain insights into the systems’ performances. (3) We analyze 16 classifiers to find the best performing multi-label classification system for the task at hand. (4) We implement and present a machine learning approach for combining these QA systems with 16 classifiers. This metasystem outperforms the state of the art in the QALD-6 benchmark [28]. (5) We optimize the set of features used for training the metasystem to conclude with a minimal set of meaningful question features boosting the quality of the metasystem by 4%.

More information about the approach, source code and underlying data can be found in our project repository

2 Related Work

With the growing number of published QA systems, a search for an universal framework for reusing components began. One of the earliest works is openQA [14] which is a modular open-source framework for implementing QA systems. openQA’s main work-flow consists of four stages (interpretation, retrieval, synthesis and rendering) as well as adjacent modules (context and service) written as rigid Java interfaces. The authors claim that openQA enables a conciliation of different architectures and methods. QALL-ME [5] is another open source approach using an architecture skeleton for multilingual QA, a domain- as well as a domain-independent ontology. The underlying SOA architecture features several web services which are combined into one QA system in a predetermined way. Another system is the open source OAQA [33]. This system aims to advance the engineering of QA systems by following architectural commitments to components for a better interchangeability. Using these shared interchangeable components OAQA is able to search the most efficient combination of modules for a task at hand.

QANUS [16] is a not disclosed QA framework for the rapid development of novel QA systems as well as a baseline system for benchmarking. It was designed to have interchangeable components in a pre-seeded pipeline and comes with a set of common modules such as named entity recognition and part-of-speech tagging. Both et al. [3] described a first semantic approach towards coupling components together via RDF to tailor search pipelines using semantic, geospatial and full text search modules. Here, modules add semantic information to a query until the search can be solved. QANARY [23] is the first real implementation of a semantic approach towards the generation of QA systems from components. Using the provided QA ontology from QANARY, modules can be exchanged, e.g., various versions of NER tools, to benchmark various pipelines and choose the most performant one.

inlineinlinetodo: inlineVerweis auf arxiv von self-wiring paper

However, none of the frameworks is able to combine various QA systems. To the best of our knowledge, we present the first system able to combine several QA systems based on question features which outperforms every single system performance.

3 Approach

In general, QA systems perform well on certain question domains like geography, physics, encyclopedic knowledge or particular knowledge source combinations [8]. Our goal is to provide a metasystem which is able to pick the most capable, specialised QA system for a particular question. We formalize our problem as follows: Let be a question (i.e., an instance in ML terminology), and be an enumeration of the QA system that underlies our metasystem. We label by the vector , with iff achieves an F1-score greater than 0 on . Otherwise, we set . Given an unseen , our goal is to choose a QA system with the highest F1-score. The problem at hand clearly translates to a multi-label classification problem [24].

The goal of multi-label classification is as follows: Given an unseen instance , assign one or more possible labels to , where each label can have multiple classes. In our case, we use a Boolean set of classes to indicate whether a system is able to answer a certain question or not. Approaches to tackle multi-label classification in this form can be divided into two categories. The first one is to transform the multi-label problem into one or several single-label problems, i.e., training a separate classifier for each subproblem [24]. Depending on the algorithm, the next step could be a voting scheme or other methods to combine the separate classifications [24]. In our case, most classifiers fall into this category and are explained in detail in Section 3.2. The second category contains algorithm adaption methods, where one adapts existing machine learning algorithms to handle multi-label data directly. Examples of this method include Adaboost.MH/MR [21] or ML-kNN [34]. For an exhaustive overview of the techniques of multi-label classification, we refer to the survey of Tsoumakas et al. [24].

Multi-label classification can be tackled using classical ML techniques [20] provided that corresponding features are designed. We address this challenge in Section 3.1. Using these features, we train a classifier (i.e., a metasystem) to select the system(s) that is/are most likely to be able to answer . We interpret the output of this classifier as a ranking among systems and query the system with the highest rank. Our overall approach is depicted in Figure 1.

Figure 1: Overview of the metasystem combining several QA systems. The training and test data are the feature vectors extracted from QALD-6 questions.

3.1 Features

The 14 features developed herein are based on recent surveys [8] as well as on an analysis of the results of previous question answering challenges [25, 30]. These features can be summarized into eight groups. We explain each feature using the following running example: ”Which New York Nicks players from outside the USA are born after Robin Lopez?”.

  1. Question Type: This feature has four dimension, i.e., List, Boolean, Resource and Number and determines the type of the answer set. For our running example, the feature would take the value List.

  2. Query Resource Type: With its seven dimensions, the feature categorizes each entry in the answer set to one of the following items: Misc., Date, Boolean, Number, Person, Organization, and Location. This feature would be set to Person for our running example.

  3. Wh-type: Although simple, this feature is highly effective in determining a spectrum of capabilities of a QA system, e.g., whether the said system is able to construct SPARQL ASK queries.111 Using the first two tokens of an input question, this feature’s dimensions are Who, Which, How, In which, What, When, Where as well as Ask. Note that the Ask dimension summarizes different questions that demand the generation of SPARQL ASK queries [29] as well as questions starting with ”Give me” or ”Show me”. Our running example would be assigned the value Which for this dimension.

  4. #Token: The number of tokens is calculated based on already identified entities and noun phrases and ignores punctuation. For example, our tokenized running example would be [Which] [New York Nicks] [players] [from] [outside] [the] [USA] [are] [born] [after] [Robin Lopez] and would result in a numerical value of 11.

  5. Comparative: This feature describes whether a question uses a comparative adjective, e.g., higher, or comparative words such as than, after, before. For our running example, this boolean feature is true.

  6. Superlative: Like the Comparative feature, the Superlative feature indicates the use of a superlative, like highest or best, and is false for the example question.

  7. Entity types: This group of features includes seven boolean features: Person, Money, Location, Percent,
    Organization, Date and Misc.. Each feature describes whether an entity of this type exists within the question of this particular type. Our running example question, would have Organization (New York Nicks), Place (USA) and Person (Robin Lopez) set to true while the remaining features would be set to false.

While these features are clearly handcrafted, we show their ability to effectively determine the question answering systems according to their capabilities as well as to accurately choose the correct system to answer a certain question in Section 5. Note that our metasystem is flexible so that each feature can be extended, as can the number of features themselves, to adapt to new QA benchmarks or systems.

3.2 Classifiers

To compute a metasystem model we evaluated 16 multi-label classifiers from the MEKA framework [20]. In the following, we give a short overview of the classifiers we used:

  • Label Combination (LC): This method treats each possible combination of labels as a class and uses a multi-class classifier for classification.

  • Ranking + Threshold (RT): Each example is copied once for each label and is assigned one label. On this augmented data, a multi-class classifier is trained. To make predictions a sample is mapped to a ranking of possible labels and gets assigned all labels above a threshold.

  • Classifier Chains (CC): Read et al. [19] introduced this classification method, which uses binary classifiers for each label ordered in a chain, such that the classifier for the i-th label is conditioned on the previous classification results. Since the performance depends on the order of the labels, there are extensions [18, 4] to find the best suited chain. We include the PMCC, MCC, CC, CT, BR and BRq classifiers in this section, since they belong to the same family of classifiers.

  • Random disjoint k-labelsets (RAkELd): In 2011, Tsoumakas et al. [24] introduced RAkELd which randomly partitions the labelset into disjoint labelsets with at most k labels. For each disjoint labelset a classifier is trained, using the Label Combination (LC) method. To classify a new instance, the results of all classifiers are gathered. RAkELo is an extension that also incorporates overlapping k-labelsets. HASEL partitions according to the hierarchy defined by the dataset.

  • Conditional Dependency Networks (CDN): Guo et al [7] present CDN. This classifier models the probabilities for each label as a densely connected conditional dependency network of binary classifiers. The network is trained using logistic regression. The prediction for an instance is obtained by using Gibbs Sampling on this network to obtain the approximate joint distribution and using MAP inference on this approximation. Another member of this family is the CDT classifier.

  • Pruned Sets (PS): This transformation method was introduced by Read et al. [17] as well and transforms the multi-label classification problem into a multiclass classification problem, just like LC but also prunes infrequent label combinations to combat over fitting. PSt introduces a threshold into the classification.

4 Implementation details

All implementation details (including ML, feature extraction and evaluation) can be found in our open-source repository. To calculate each feature, an in-depth analysis of the input question using Part-of-Speech tags, dependency parse trees, string matching and entity recognition respectively disambiguation is required. We rely on the Stanford CoreNLP library [13] in our current implementation. The classification algorithms we rely on are implemented in MEKA [20].

Given that our evaluation is to be carried out on QALD-6, we introduce the participating systems of the QALD-6/Task-1 challenge [28]. Table 1 shows the involved systems and possible reasons for exclusion from our metasystem.222 Most systems do not have a webservice to work with. For a list of QA systems with available web services, please go to our project repository at We successfully contacted all authors and asked for permission to use their challenge entries to test our approach (systems below midrule in Table 1).

Engine Reference Webservice? Exclusion
CANaLI [15] yes M
PersianQA - no L
UTQA (English) [31] no
KWGAnswer - no
NbFramework - no
SemGraphQA [2] no
UIQA WME - no (M)
Table 1: Systems that participated in the 6th QALD challenges. Note, having a publication is optional with QALD. Exclusion indicates the reason for exclusion of a system from our dataset. L means the system is not available for English, M indicates that human interaction is needed.

Four out of seven systems (i.e., KWGAnswer, NbFramework, PersianQA and UIQA) have no attached publication at the time of writing. Thus, we are unable to describe their inner mechanisms. Note, that the UIQA system participated in QALD-6 as an automatic system as well as a human-supported system. We use UIQA without manual entries (WME) for our evaluation.

SemGraphQA [2] is an unsupervised and graph-based approach which also limits itself to questions requiring only DBpedia [10] types. First, the approach tries to match RDF resources to parts of the natural language question and builds a syntactic parse tree from dependency parsing. Second, the resulting structure is transformed into various possible semantic interpretations, i.e, resolving ambiguities indirectly.

UTQA [31] is a crosslingual QA system based on a language-specific chunker for porting, a maximum entropy model and an answer type prediction. Found ground entities, the predicted answer type as well as a semantic similarity are then used to find matching neighbouring entities.

For further information about QA systems and the state of the art, please refer to [8, 12, 9].

5 Evaluation

The purpose of this evaluation was four-fold. First, we aimed to analyze the correlation between certain features and the QALD-6 submission data to point out current weak points as well as future research directions for each QA system. Second, we analyzed the set of available multi-label classifiers and performed tests to choose the best performing classifier on our data. Third, we studied the performance of our novel metasystem for question answering. Finally, we analyzed the features required to optimize the metasystem as well as their influence.

5.1 Dataset

The evaluation of this approach is based on the 6th edition of the Question Answering over Linked Data challenge (QALD-6) [28]. The dataset contains 100 test questions and the answers for the respective systems on task 1, multilingual question answering.

5.2 Feature Association with System Performance

First, to assert the descriptiveness of our features, we calculated Cramers’ V-coefficient for each feature and a system’s ability to answer a question. To this end, the ability to answer was divided into the two classes ”can answer” (F1-score ¿ 0) or ”cannot answer” (F1-score = 0). Cramers’ V is based on the Chi-squared statistics and is defined as follows:


with , I and J being the number of rows and columns of the contingency matrix of our experiment. To define , fix some feature and let be the observed count of event , with ”can answer” and ”cannot answer”, and the j-th state of the feature. Based on this contingency matrix, let be the number of observations, and , then one defines


Cramers’ V estimates the association of the features based on the observed contingency matrix, implying statistical independence and implying that both features are linearly dependent.

Figure 2: Cramers V-coefficient of features and the system performances.

Figure 2 shows that across all QA systems the features Query Resource Type, Question Word and Number of Tokens demonstrate the closest association with a system’s ability to answer. Furthermore, there seems to be a large association between the performance of NBFramework and Location (see for example questions 12, 20, 23 and 44 in Table 2). The same effect can be observed with UTQA (English) and Superlative, see questions 17, 18, 77. We investigate this effect further in Section 5.4

5.3 Choosing a Classifier

inlineinlinetodo: inlineleave-one-out statt cross validation

Second, we determined the best classifier on all features. To this end, we performed a 10-fold cross-validation for a set of classifiers , recording the macro F1-score on each fold. To be precise, we calculated the numbers:


where we defined


Here, refers to the rank, that the j-th classifier assigns to the k-th system and is the set of answers provided by system i on question q. We used the F1-scores according to the reported QALD-6 data for each system. Figure 3 shows the results of our experiment in a boxplot.

Figure 3: Boxplot of Cross-Validation results.

The classifiers of the Classifier Chains (CC) family achieved the best performance on this task. However, we reached higher scores for a particular classifier if we used only a subset of features, see Section 5.4.

Furthermore, we tested the performance of all above classifiers, using all the features in two setups. First, we tested and trained the classifiers on 99 questions and used the remaining single question for testing. We repeated this leave-one-out procedure for all questions and calculated the average F1-score. Second, we tested the performance of all above classifiers using all questions as training set. The results are displayed in Table 2.

As can be seen, there is a huge difference between the two setups. This shows the sparsity and vast diversity of the dataset that is caused by the low number of questions available. With a growing number of questions, the results of the leave-one-out setup should converge to the second setup (i.e., F1-Score Full). Surprisingly, the PSt classifier performed best on this task in the second setup and thus we chose it in our metasystem as it outperforms the best other classifier by 0.04 points F-measure.

Classifier F1-Score leave-one-out F1-Score Full
RT 0.68 0.68
BRq 0.68 0.68
CC 0.68 0.68
CT 0.68 0.68
CDT 0.60 0.69
CDN 0.60 0.72
FW 0.68 0.68
HASEL 0.68 0.68
LC 0.66 0.71
MCC 0.69 0.68
PCC 0.69 0.68
PMCC 0.69 0.68
PS 0.66 0.71
PSt 0.63 0.76
RAkEL 0.55 0.72
RAkELd 0.65 0.70
Table 2: F1-Score leave-one-out: Classifier performance computed using Leave-One-Out methodology. F1-Score Full: Classifier performance tested and trained on all questions.

5.4 Feature Influence on Performance

To probe the influence of the different features on the performance of the metasystem, we trained on all questions and the PSt classifier, using different combinations of features. We avoided using cross-validation due to the small number of potential data points. Since displaying all results is impracticable, the following Table 3 holds the best performing combination, among a sample of other combinations.

Feature Combination F1-Score
QRT 0.69
QT 0.69
QW 0.72
#T 0.68
QW, Loc 0.72
QRT, QW 0.72
QW, Loc 0.73
QRT, QW, Loc 0.75
#T, Loc, QW, QRT, Pers 0.77
#T, Loc, QW, QRT 0.78
All features 0.76
Table 3: Different Feature Combinations. (Question Word QW, Number of Token #T, Location Loc, Person Pers, Query Resource Type QRT)

0.78 is the globally highest F1-Score among all combinations. This optimum is achieved by combining the number of features used, thus we chose to display the one with the least features required. These features are namely Number of Token, Location, Question Word and Query Resource Type. Note, the performance decreases by 2 percent, using all features. Adding other features beyond the optimal group seems to introduce noise. However, this is highly dependent on the particular set of questions that is used.

5.5 Metasystem Performance

The overall goal was to develop a metasystem that is able to perform better than the underlying systems to benefit from the multitude of existing QA research and development activities. As shown in Table 4, the six underlying systems perform with an F-measure of 0.15 to 0.68 on QALD-6. An optimal selection algorithm (which would always choose the best performing QA system) would achieve 0.89 F-measure. Our best performing metasystem, trained on the 100 questions alone using the PSt classifier and only four features–namely #T, Loc, QW and QRT– is able to improve the best single system performance by 14.1% and reaches an F-measure of 0.78. This result supports our assumptions about the diversity of existing QA solutions and shows how a good feature design allows characterization and the effective use of QA systems. The overall results however also show clear weaknesses of existing QA solutions. In particular, questions which require solution modifiers (e.g., 9, 88, 17, 28, 33, 36, 49) remain a difficult problem that need to be tackled.

6 Conclusion and Summary

The QA metasystem we have presented is able to outperform each single QA system using a feature-selection approach combined with multi-label classification. We were able to show that an effective combination of systems, features and classifiers can improve overall performance.

However, our system is still more than 0.10 points F-measure away from an optimal system selection. This gap exists due to a lack of training data since we had only 100 training instances respectively questions available. Thus, we welcome other QA system developers to implement webservices to foster more active research and increase the comparability of systems. We have actively begun research to sophisticate the benchmarking of QA systems.333 Furthermore, we will look deeper into the issues of overfitting classifiers and finding more influential features in the future.

Acknowledgments This work has been supported by Eurostars projects DIESEL (E!9367) and QAMEL (E!9725) as well as the European Union’s H2020 research and innovation action HOBBIT (GA 688227). We also thank Christina Unger for providing us with the underlying datasets.


  • [1] P. Baudis and J. Sedivý. Modeling of the question answering task in the yodaqa system. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 6th International Conference of the CLEF Association, CLEF 2015, Toulouse, France, September 8-11, 2015, Proceedings, pages 222–228, 2015.
  • [2] R. Beaumont, B. Grau, and A.-L. Ligozat. Semgraphqa at qald-5: Limsi participation at qald-5 at clef. In CLEF (Working Notes), 2015.
  • [3] A. Both, A.-C. N. Ngonga, R. Usbeck, D. Lukovnikov, C. Lemke, and M. Speicher. A service-oriented search framework for full text, geospatial and semantic search. In SEMANTiCS, 2014.
  • [4] W. Cheng, E. Hüllermeier, and K. J. Dembczynski. Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 279–286, 2010.
  • [5] O. Ferrandez, C. Spurk, M. Kouylekov, I. Dornescu, S. Ferrandez, M. Negri, R. Izquierdo, D. Tomas, C. Orasan, G. Neumann, et al. The qall-me framework: A specifiable-domain multilingual question answering architecture. Web semantics: Science, services and agents on the world wide web, 9(2):137–145, 2011.
  • [6] A. Freitas, J. G. Oliveira, E. Curry, S. O’Riain, and J. C. P. da Silva. Treo: combining entity-search, spreading activation and semantic relatedness for querying linked data. In 1st Workshop on Question Answering over Linked Data (QALD-1), 2011.
  • [7] Y. Guo and S. Gu. Multi-label classification using conditional dependency networks. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1300, 2011.
  • [8] K. Höffner, S. Walter, E. Marx, J. Lehmann, A. Ngonga, and R. Usbeck. Overcoming challenges of semantic question answering in the semantic web. Semantic Web Journal, 2016.
  • [9] O. Kolomiyets and M.-F. Moens. A survey on question answering technology from an information retrieval perspective. Inf. Sci., 181(24):5412–5434, Dec. 2011.
  • [10] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 2014.
  • [11] V. Lopez, M. Fernández, E. Motta, and N. Stieler. PowerAqua: Supporting users in querying and exploring the Semantic Web. Semantic Web Journal, 3:249–265, 2012.
  • [12] V. Lopez, V. S. Uren, M. Sabou, and E. Motta. Is question answering fit for the semantic web?: A survey. Semantic Web Journal, 2(2):125–155, 2011.
  • [13] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In 52nd ACL: System Demonstrations, pages 55–60, 2014.
  • [14] E. Marx, R. Usbeck, A.-C. N. Ngonga, K. Höffner, J. Lehmann, and S. Auer. Towards an open question answering architecture. In SEMANTiCS, 2014.
  • [15] G. M. Mazzeo and C. Zaniolo. Canali: A system for answering controlled natural language questions on rdf knowledge bases, 2016.
  • [16] J.-P. Ng and M.-Y. Kan. Qanus: An open-source question-answering platform. arXiv preprint arXiv:1501.00311, 2015.
  • [17] J. Read. A pruned problem transformation method for multi-label classification. In Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), volume 143150, 2008.
  • [18] J. Read, L. Martino, and D. Luengo. Efficient monte carlo methods for multi-dimensional learning with classifier chains. Pattern Recognition, 47(3):1535–1546, 2014.
  • [19] J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. Machine learning, 85(3):333–359, 2011.
  • [20] J. Read, P. Reutemann, B. Pfahringer, and G. Holmes. MEKA: A multi-label/multi-target extension to Weka. Journal of Machine Learning Research, 17(21):1–5, 2016.
  • [21] R. E. Schapire and Y. Singer. Boostexter: A boosting-based systemfor text categorization. Mach. Learn., 39(2-3):135–168, May 2000.
  • [22] S. Shekarpour, E. Marx, A.-C. N. Ngomo, and S. Auer. Sina: Semantic interpretation of user queries for question answering on interlinked data. Journal of Web Semantics, 2014.
  • [23] K. Singh, A. Both, D. Diefenbach, S. Shekarpour, D. Cherix, and C. Lange15. Qanary–the fast track to creating a question answering system with linked data technology. In ESWC, 2016.
  • [24] G. Tsoumakas, I. Katakis, and I. Vlahavas. Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering, 23(7):1079–1089, 2011.
  • [25] C. Unger, L. Bühmann, J. Lehmann, A. N. Ngomo, D. Gerber, and P. Cimiano. Template-based question answering over RDF data. In 21st WWW conference, pages 639–648, 2012.
  • [26] C. Unger, C. Forascu, V. Lopez, A. N. Ngomo, E. Cabrio, P. Cimiano, and S. Walter. Question answering over linked data (QALD-4). In CLEF, pages 1172–1180, 2014.
  • [27] C. Unger, C. Forascu, V. Lopez, A. N. Ngomo, E. Cabrio, P. Cimiano, and S. Walter. Question answering over linked data (QALD-5). In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015., 2015.
  • [28] C. Unger, A. Ngonga, and E. Cabrio. 6th open challenge on question answering over linked data (qald-6). In The Semantic Web: ESWC 2016 Challenges., 2016.
  • [29] R. Usbeck, E. Körner, and A.-C. Ngonga Ngomo. Answering boolean hybrid questions with hawk. In NLIWOD workshop at International Semantic Web Conference (ISWC), including erratum and changes, 2015.
  • [30] R. Usbeck, A.-C. Ngomo, L. Bühmann, and C. Unger. Hawk – hybrid question answering using linked data. In The Semantic Web. Latest Advances and New Domains, volume 9088 of Lecture Notes in Computer Science, pages 353–368. Springer International Publishing, 2015.
  • [31] A. P. B. Veyseh. Cross-lingual question answering using common semantic space. In Proceedings of TextGraphs@NAACL-HLT 2016, pages 15–19, 2016.
  • [32] K. Xu, Y. Feng, S. Huang, and D. Zhao. Question answering via phrasal semantic parsing. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 6th International Conference of the CLEF Association, CLEF 2015, Toulouse, France, September 8-11, 2015, Proceedings, pages 414–426, 2015.
  • [33] Z. Yang, E. Garduno, Y. Fang, A. Maiberg, C. McCormack, and E. Nyberg. Building optimal information systems automatically: Configuration space exploration for biomedical information systems. In 22nd ACM CIKM, pages 1421–1430. ACM, 2013.
  • [34] M.-L. Zhang and Z.-H. Zhou. Ml-knn: A lazy learning approach to multi-label learning. Pattern recognition, 40(7):2038–2048, 2007.
Id Question






UTQA English



0 Who was the doctoral supervisor of Albert Einstein? 0.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
1 Did Kaurismäki ever win the Grand Prix at Cannes? 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
2 Who wrote the song Hotel California? 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 Who was on the Apollo 11 mission? 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
4 Which electronics companies were founded in Beijing? 1.0 0.06 1.0 0.06 0.06 1.0 1.0 1.0
5 What is in a chocolate chip cookie? 1.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
6 What is the atmosphere of the Moon composed of? 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
7 How many movies did Park Chan-wook direct? 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0
8 Who are the developers of DBpedia? 1.0 1.0 1.0 1.0 0.85 1.0 1.0 1.0
9 Which Indian company has the most employees? 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
10 What is the name of the school where Obama’s wife studied? 0.0 1.0 0.0 0.0 0.0 0.66 1.0 0.66
11 Where does Piccadilly start? 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
12 What is the capital of Cameroon? 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
13 When did the Boston Tea Party take place? 1.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
14 Who played Gus Fring in Breaking Bad? 0.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
15 Who wrote Harry Potter? 0.66 1.0 0.0 1.0 0.0 1.0 1.0 1.0
16 Which actors play in Big Bang Theory? 0.5 1.0 0.0 0.0 0.0 1.0 1.0 1.0
17 What is the largest country in the world? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
18 Who is the most powerful Jedi? 0.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0
19 How many goals did Pelé score? 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0
20 Who is the president of Eritrea? 1.0 1.0 0.0 0.03 1.0 0.66 1.0 0.66
21 Which computer scientist won an oscar? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
22 Who created Family Guy? 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0
23 How many people live in Poland? 1.0 1.0 0.0 0.0 0.0 0.0 1.0 0.0
24 To which party does the mayor of Paris belong? 1.0 1.0 0.0 1.0 0.0 1.0 1.0 1.0
25 Who does the voice of Bart Simpson? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
26 Who composed the soundtrack for Cameron’s Titanic? 1.0 0.0 0.0 0.0 0.0 0.0 1.0 0.0
27 When did Boris Becker end his active career? 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0
28 Show me all basketball players that are higher than 2 meters. 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
29 What country is Sitecore from? 1.0 1.0 1.0 0.0 0.0 1.0 1.0 1.0
30 Which country was Bill Gates born in? 1.0 0.0 0.0 0.0 1.0 0.0 1.0 0.0
31 Who developed Slack? 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
32 In which city did Nikos Kazantzakis die? 1.0 0.66 0.0 0.66 0.0 1.0 1.0 1.0
33 How many grand-children did Jacques Cousteau have? 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
34 Which films did Stanley Kubrick direct? 1.0 1.0 0.96 1.0 1.0 1.0 1.0 1.0
35 Does Neymar play for Real Madrid? 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
36 How many seats does the home stadium of FC Porto have? 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
37 Show me all books in Asimov’s Foundation series. 0.95 0.0 0.21 0.0 0.0 1.0 1.0 1.0
38 Which movies star both Liz Taylor and Richard Burton? 0.95 1.0 0.0 0.0 0.35 0.77 1.0 0.77
39 In which city are the headquarters of the United Nations? 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
40 In which city was the president of Montenegro born? 1.0 1.0 0.0 0.0 1.0 0.66 1.0 0.66
41 Which writers studied in Istanbul? 0.21 0.0 0.0 0.25 0.33 0.22 0.33 0.22
42 Who is the mayor of Paris? 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
43 What is the full name of Prince Charles? 1.0 0.0 0.0 1.0 0.0 1.0 1.0 1.0
44 What is the longest river in China? 0.0 1.0 0.0 0.0 0.02 0.0 1.0 0.0
45 Who discovered Ceres? 1.0 1.0 0.0 1.0 0.0 0.0 1.0 1.0
46 When did princess Diana die? 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0
47 What do ants eat? 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0
48 Who is the host of the BBC Wildlife Specials? 1.0 1.0 0.0 1.0 0.0 0.0 1.0 0.0
49 How many moons does Mars have? 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
50 What was the first Queen album? 0.02 0.0 0.0 0.0 0.0 0.0 0.02 0.02
51 Did Elvis Presley have children? 1.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
52 Give me a list of all Canadians that reside in the U.S. 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
53 Where is Syngman Rhee buried? 1.0 1.0 0.0 0.0 1.0 1.0 1.0 1.0
54 In which countries do people speak Japanese? 1.0 0.5 0.0 0.0 0.0 1.0 1.0 1.0
55 Who is the king of the Netherlands? 0.0 0.66 0.0 0.0 0.0 1.0 1.0 1.0
56 When did the Dodo become extinct? 1.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
57 Show me all Czech movies. 0.89 1.0 0.0 0.0 0.0 1.0 1.0 1.0
58 Which rivers flow into the North Sea? 1.0 0.43 0.45 0.45 0.45 0.22 1.0 0.22
59 When did Operation Overlord commence? 1.0 0.66 0.0 1.0 0.0 1.0 1.0 0.66
60 Where do the Red Sox play? 1.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
61 In which time zone is Rome? 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0
62 Give me a list of all critically endangered birds. 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
63 How much did the Lego Movie cost? 0.5 0.0 0.0 0.0 0.0 1.0 1.0 0.5
64 What was the original occupation of the inventor of Lego? 0.0 0.0 0.0 0.66 1.0 1.0 1.0 1.0
65 Which countries have more than ten volcanoes? 0.0 0.87 0.0 0.0 0.0 1.0 1.0 1.0
66 Show me all U.S. states. 0.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
67 Who wrote the Game of Thrones theme? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
68 How many calories does a baguette have? 0.0 1.0 0.0 0.0 0.0 0.0 1.0 1.0
69 Can you cry underwater? 1.0 1.0 1.0 1.0 0.0 1.0 1.0 1.0
70 Which companies produce hovercrafts? 1.0 1.0 1.0 1.0 0.0 0.0 1.0 0.0
71 How many emperors did China have? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
72 Show me hiking trails in the Grand Canyon where there’s no danger of flash floods. 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
73 In which ancient empire could you pay with cocoa beans? 1.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
74 How did Michael Jackson die? 0.0 0.0 0.0 0.0 1.0 0.0 1.0 1.0
75 Which space probes were sent into orbit around the sun? 0.0 0.0 0.0 0.0 0.0 0.72 0.72 0.72
76 When was Coca Cola invented? 1.0 1.0 1.0 1.0 0.0 0.0 1.0 1.0
77 What is the biggest stadium in Spain? 0.01 0.0 0.0 0.0 0.01 1.0 1.0 1.0
78 On which day is Columbus Day? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
79 How short is the shortest active NBA player? 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
80 Whom did Lance Bass marry? 1.0 1.0 1.0 0.0 1.0 0.66 1.0 0.66
81 What form of government does Russia have? 0.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
82 What movies does Jesse Eisenberg play in? 1.0 0.98 1.0 0.0 0.98 0.0 1.0 1.0
83 What color expresses loyalty? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
84 Show me all museums in London. 1.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
85 Give me all South American countries. 1.0 1.0 0.0 0.0 0.0 1.0 1.0 1.0
86 Who is the coach of Ankara’s ice hockey team? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
87 Who is the son of Sonny and Cher? 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0
88 What are the five boroughs of New York? 0.08 0.0 0.0 0.0 0.0 0.28 0.28 0.28
89 Show me Hemingway’s autobiography. 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
90 What kind of music did Lou Reed play? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
91 In which city does Sylvester Stallone live? 1.0 1.0 0.66 0.0 0.0 1.0 1.0 1.0
92 Who was Vincent van Gogh inspired by? 1.0 0.0 1.0 0.0 0.0 1.0 1.0 1.0
93 What are the names of the Teenage Mutant Ninja Turtles? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
94 What are the zodiac signs? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
95 What languages do they speak in Pakistan? 1.0 0.37 0.37 0.0 0.0 0.32 1.0 0.32
96 Who became president after JFK died? 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
97 In what city is the Heineken brewery? 1.0 0.0 0.0 1.0 0.0 0.0 1.0 1.0
98 What is Elon Musk famous for? 1.0 1.0 0.0 0.0 0.0 0.44 1.0 1.0
99 What is Batman’s real name? 0.0 0.0 0.0 0.0 0.0 1.0 1.0 1.0
Average F-measure 0.54 0.48 0.17 0.22 0.15 0.68 0.89 0.78
Table 4: Performance of QA systems on QALD-6.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description