Using Multi-Label Classification for Improved Question Answering
A plethora of diverse approaches for question answering over RDF data have been developed in recent years. While the accuracy of these systems has increased significantly over time, most systems still focus on particular types of questions or particular challenges in question answering. What is a curse for single systems is a blessing for the combination of these systems. We show in this paper how machine learning techniques can be applied to create a more accurate question answering metasystem by reusing existing systems. In particular, we develop a multi-label classification-based metasystem for question answering over 6 existing systems using an innovative set of 14 question features. The metasystem outperforms the best single system by 14% F-measure on the recent QALD-6 benchmark. Furthermore, we analyzed the influence and correlation of the underlying features on the metasystem quality.
Recent research on question answering (QA) over Linked Data and RDF has shown significant improvements of quality and efficiency in answering even complex questions . As a result of this research, a multitude of QA systems (see a.o. [1, 30, 32, 6, 22, 11, 25]) have been proposed to tackle questions from different domains and of varying complexities. These systems rely on diverse approaches ranging from the transformation of questions into triple patterns  to hybrid question answering over both RDF and text . This has led to a tool landscape with approaches able to deal with particular aspects of questions well (e.g.,  can deal with simple conjunctive queries) while being unable to deal with other aspects (e.g.,  has difficulties dealing with superlative queries). In addition to monitoring the development of a large number of question answering approaches, we have also witnessed the creation of a large number of benchmarks and challenges. The latter have provided the possibility to analyse the strengths and weaknesses of many QA systems objectively (see, e.g., QALD [28, 27, 26]).
The availability of both diverse approaches (e.g., approaches with different strengths and weaknesses) and of benchmarks (that allow evaluating these strengths and weaknesses) now suggests the possibility of creating “metasystems” for answering questions over Linked Data and RDF. Such a metasystem (1) integrates several QA systems. Given a question, it is then able to (2) select the most appropriate QA system to answer the said question from a set of questions. While the selection of the most appropriate QA system seem tedious, the main hypothesis of this work is that this selection can be carried out automatically using machine learning (ML) techniques.
In this work, we formulate the problem of the training of a metasystem for QA as a multi-label classification problem. Here, we are interested in the choice of the best fitting classifier and the choice of machine learning features which are most descriptive. In this paper, we present a multi-label classification-based metasystem for question answering over 6 existing systems using a novel set of 14 question features.
Our contributions are as follows: (1) We develop a set of 14 novel features for natural-language questions that is capable of characterizing the weak points of existing QA systems. (2) We analyze 6 current QA systems with respect to their performance and features to deduce future research directions and gain insights into the systems’ performances. (3) We analyze 16 classifiers to find the best performing multi-label classification system for the task at hand. (4) We implement and present a machine learning approach for combining these QA systems with 16 classifiers. This metasystem outperforms the state of the art in the QALD-6 benchmark . (5) We optimize the set of features used for training the metasystem to conclude with a minimal set of meaningful question features boosting the quality of the metasystem by 4%.
More information about the approach, source code and underlying data can be found in our project repository https://github.com/AKSW/NLIWOD/tree/master/qa.ml.
2 Related Work
With the growing number of published QA systems, a search for an universal framework for reusing components began. One of the earliest works is openQA  which is a modular open-source framework for implementing QA systems. openQA’s main work-flow consists of four stages (interpretation, retrieval, synthesis and rendering) as well as adjacent modules (context and service) written as rigid Java interfaces. The authors claim that openQA enables a conciliation of different architectures and methods. QALL-ME  is another open source approach using an architecture skeleton for multilingual QA, a domain- as well as a domain-independent ontology. The underlying SOA architecture features several web services which are combined into one QA system in a predetermined way. Another system is the open source OAQA . This system aims to advance the engineering of QA systems by following architectural commitments to components for a better interchangeability. Using these shared interchangeable components OAQA is able to search the most efficient combination of modules for a task at hand.
QANUS  is a not disclosed QA framework for the rapid development of novel QA systems as well as a baseline system for benchmarking. It was designed to have interchangeable components in a pre-seeded pipeline and comes with a set of common modules such as named entity recognition and part-of-speech tagging. Both et al.  described a first semantic approach towards coupling components together via RDF to tailor search pipelines using semantic, geospatial and full text search modules. Here, modules add semantic information to a query until the search can be solved. QANARY  is the first real implementation of a semantic approach towards the generation of QA systems from components. Using the provided QA ontology from QANARY, modules can be exchanged, e.g., various versions of NER tools, to benchmark various pipelines and choose the most performant one.
However, none of the frameworks is able to combine various QA systems. To the best of our knowledge, we present the first system able to combine several QA systems based on question features which outperforms every single system performance.
In general, QA systems perform well on certain question domains like geography, physics, encyclopedic knowledge or particular knowledge source combinations . Our goal is to provide a metasystem which is able to pick the most capable, specialised QA system for a particular question. We formalize our problem as follows: Let be a question (i.e., an instance in ML terminology), and be an enumeration of the QA system that underlies our metasystem. We label by the vector , with iff achieves an F1-score greater than 0 on . Otherwise, we set . Given an unseen , our goal is to choose a QA system with the highest F1-score. The problem at hand clearly translates to a multi-label classification problem .
The goal of multi-label classification is as follows: Given an unseen instance , assign one or more possible labels to , where each label can have multiple classes. In our case, we use a Boolean set of classes to indicate whether a system is able to answer a certain question or not. Approaches to tackle multi-label classification in this form can be divided into two categories. The first one is to transform the multi-label problem into one or several single-label problems, i.e., training a separate classifier for each subproblem . Depending on the algorithm, the next step could be a voting scheme or other methods to combine the separate classifications . In our case, most classifiers fall into this category and are explained in detail in Section 3.2. The second category contains algorithm adaption methods, where one adapts existing machine learning algorithms to handle multi-label data directly. Examples of this method include Adaboost.MH/MR  or ML-kNN . For an exhaustive overview of the techniques of multi-label classification, we refer to the survey of Tsoumakas et al. .
Multi-label classification can be tackled using classical ML techniques  provided that corresponding features are designed. We address this challenge in Section 3.1. Using these features, we train a classifier (i.e., a metasystem) to select the system(s) that is/are most likely to be able to answer . We interpret the output of this classifier as a ranking among systems and query the system with the highest rank. Our overall approach is depicted in Figure 1.
The 14 features developed herein are based on recent surveys  as well as on an analysis of the results of previous question answering challenges [25, 30]. These features can be summarized into eight groups. We explain each feature using the following running example: ”Which New York Nicks players from outside the USA are born after Robin Lopez?”.
Question Type: This feature has four dimension, i.e., List, Boolean, Resource and Number and determines the type of the answer set. For our running example, the feature would take the value List.
Query Resource Type: With its seven dimensions, the feature categorizes each entry in the answer set to one of the following items: Misc., Date, Boolean, Number, Person, Organization, and Location. This feature would be set to Person for our running example.
Wh-type: Although simple, this feature is highly effective in determining a spectrum of capabilities of a QA system, e.g., whether the said system is able to construct SPARQL ASK queries.111http://www.w3.org/TR/2013/REC-sparql11-query-20130321/#ask Using the first two tokens of an input question, this feature’s dimensions are Who, Which, How, In which, What, When, Where as well as Ask. Note that the Ask dimension summarizes different questions that demand the generation of SPARQL ASK queries  as well as questions starting with ”Give me” or ”Show me”. Our running example would be assigned the value Which for this dimension.
#Token: The number of tokens is calculated based on already identified entities and noun phrases and ignores punctuation. For example, our tokenized running example would be [Which] [New York Nicks] [players] [from] [outside] [the] [USA] [are] [born] [after] [Robin Lopez] and would result in a numerical value of 11.
Comparative: This feature describes whether a question uses a comparative adjective, e.g., higher, or comparative words such as than, after, before. For our running example, this boolean feature is true.
Superlative: Like the Comparative feature, the Superlative feature indicates the use of a superlative, like highest or best, and is false for the example question.
Entity types: This group of features includes seven boolean features: Person, Money, Location, Percent,
Organization, Date and Misc.. Each feature describes whether an entity of this type exists within the question of this particular type. Our running example question, would have Organization (New York Nicks), Place (USA) and Person (Robin Lopez) set to true while the remaining features would be set to false.
While these features are clearly handcrafted, we show their ability to effectively determine the question answering systems according to their capabilities as well as to accurately choose the correct system to answer a certain question in Section 5. Note that our metasystem is flexible so that each feature can be extended, as can the number of features themselves, to adapt to new QA benchmarks or systems.
To compute a metasystem model we evaluated 16 multi-label classifiers from the MEKA framework . In the following, we give a short overview of the classifiers we used:
Label Combination (LC): This method treats each possible combination of labels as a class and uses a multi-class classifier for classification.
Ranking + Threshold (RT): Each example is copied once for each label and is assigned one label. On this augmented data, a multi-class classifier is trained. To make predictions a sample is mapped to a ranking of possible labels and gets assigned all labels above a threshold.
Classifier Chains (CC): Read et al.  introduced this classification method, which uses binary classifiers for each label ordered in a chain, such that the classifier for the i-th label is conditioned on the previous classification results. Since the performance depends on the order of the labels, there are extensions [18, 4] to find the best suited chain. We include the PMCC, MCC, CC, CT, BR and BRq classifiers in this section, since they belong to the same family of classifiers.
Random disjoint k-labelsets (RAkELd): In 2011, Tsoumakas et al.  introduced RAkELd which randomly partitions the labelset into disjoint labelsets with at most k labels. For each disjoint labelset a classifier is trained, using the Label Combination (LC) method. To classify a new instance, the results of all classifiers are gathered. RAkELo is an extension that also incorporates overlapping k-labelsets. HASEL partitions according to the hierarchy defined by the dataset.
Conditional Dependency Networks (CDN): Guo et al  present CDN. This classifier models the probabilities for each label as a densely connected conditional dependency network of binary classifiers. The network is trained using logistic regression. The prediction for an instance is obtained by using Gibbs Sampling on this network to obtain the approximate joint distribution and using MAP inference on this approximation. Another member of this family is the CDT classifier.
Pruned Sets (PS): This transformation method was introduced by Read et al.  as well and transforms the multi-label classification problem into a multiclass classification problem, just like LC but also prunes infrequent label combinations to combat over fitting. PSt introduces a threshold into the classification.
4 Implementation details
All implementation details (including ML, feature extraction and evaluation) can be found in our open-source repository. To calculate each feature, an in-depth analysis of the input question using Part-of-Speech tags, dependency parse trees, string matching and entity recognition respectively disambiguation is required. We rely on the Stanford CoreNLP library  in our current implementation. The classification algorithms we rely on are implemented in MEKA .
Given that our evaluation is to be carried out on QALD-6, we introduce the participating systems of the QALD-6/Task-1 challenge . Table 1 shows the involved systems and possible reasons for exclusion from our metasystem.222 Most systems do not have a webservice to work with. For a list of QA systems with available web services, please go to our project repository at https://github.com/AKSW/NLIWOD/blob/master/qa.systems/README.md. We successfully contacted all authors and asked for permission to use their challenge entries to test our approach (systems below midrule in Table 1).
Four out of seven systems (i.e., KWGAnswer, NbFramework, PersianQA and UIQA) have no attached publication at the time of writing. Thus, we are unable to describe their inner mechanisms. Note, that the UIQA system participated in QALD-6 as an automatic system as well as a human-supported system. We use UIQA without manual entries (WME) for our evaluation.
SemGraphQA  is an unsupervised and graph-based approach which also limits itself to questions requiring only DBpedia  types. First, the approach tries to match RDF resources to parts of the natural language question and builds a syntactic parse tree from dependency parsing. Second, the resulting structure is transformed into various possible semantic interpretations, i.e, resolving ambiguities indirectly.
UTQA  is a crosslingual QA system based on a language-specific chunker for porting, a maximum entropy model and an answer type prediction. Found ground entities, the predicted answer type as well as a semantic similarity are then used to find matching neighbouring entities.
The purpose of this evaluation was four-fold. First, we aimed to analyze the correlation between certain features and the QALD-6 submission data to point out current weak points as well as future research directions for each QA system. Second, we analyzed the set of available multi-label classifiers and performed tests to choose the best performing classifier on our data. Third, we studied the performance of our novel metasystem for question answering. Finally, we analyzed the features required to optimize the metasystem as well as their influence.
The evaluation of this approach is based on the 6th edition of the Question Answering over Linked Data challenge (QALD-6) . The dataset contains 100 test questions and the answers for the respective systems on task 1, multilingual question answering.
5.2 Feature Association with System Performance
First, to assert the descriptiveness of our features, we calculated Cramers’ V-coefficient for each feature and a system’s ability to answer a question. To this end, the ability to answer was divided into the two classes ”can answer” (F1-score ¿ 0) or ”cannot answer” (F1-score = 0). Cramers’ V is based on the Chi-squared statistics and is defined as follows:
with , I and J being the number of rows and columns of the contingency matrix of our experiment. To define , fix some feature and let be the observed count of event , with ”can answer” and ”cannot answer”, and the j-th state of the feature. Based on this contingency matrix, let be the number of observations, and , then one defines
Cramers’ V estimates the association of the features based on the observed contingency matrix, implying statistical independence and implying that both features are linearly dependent.
Figure 2 shows that across all QA systems the features Query Resource Type, Question Word and Number of Tokens demonstrate the closest association with a system’s ability to answer. Furthermore, there seems to be a large association between the performance of NBFramework and Location (see for example questions 12, 20, 23 and 44 in Table 2). The same effect can be observed with UTQA (English) and Superlative, see questions 17, 18, 77. We investigate this effect further in Section 5.4
5.3 Choosing a Classifierinlineinlinetodo: inlineleave-one-out statt cross validation
Second, we determined the best classifier on all features. To this end, we performed a 10-fold cross-validation for a set of classifiers , recording the macro F1-score on each fold. To be precise, we calculated the numbers:
where we defined
Here, refers to the rank, that the j-th classifier assigns to the k-th system and is the set of answers provided by system i on question q. We used the F1-scores according to the reported QALD-6 data for each system. Figure 3 shows the results of our experiment in a boxplot.
The classifiers of the Classifier Chains (CC) family achieved the best performance on this task. However, we reached higher scores for a particular classifier if we used only a subset of features, see Section 5.4.
Furthermore, we tested the performance of all above classifiers, using all the features in two setups. First, we tested and trained the classifiers on 99 questions and used the remaining single question for testing. We repeated this leave-one-out procedure for all questions and calculated the average F1-score. Second, we tested the performance of all above classifiers using all questions as training set. The results are displayed in Table 2.
As can be seen, there is a huge difference between the two setups. This shows the sparsity and vast diversity of the dataset that is caused by the low number of questions available. With a growing number of questions, the results of the leave-one-out setup should converge to the second setup (i.e., F1-Score Full). Surprisingly, the PSt classifier performed best on this task in the second setup and thus we chose it in our metasystem as it outperforms the best other classifier by 0.04 points F-measure.
|Classifier||F1-Score leave-one-out||F1-Score Full|
5.4 Feature Influence on Performance
To probe the influence of the different features on the performance of the metasystem, we trained on all questions and the PSt classifier, using different combinations of features. We avoided using cross-validation due to the small number of potential data points. Since displaying all results is impracticable, the following Table 3 holds the best performing combination, among a sample of other combinations.
|QRT, QW, Loc||0.75|
|#T, Loc, QW, QRT, Pers||0.77|
|#T, Loc, QW, QRT||0.78|
0.78 is the globally highest F1-Score among all combinations. This optimum is achieved by combining the number of features used, thus we chose to display the one with the least features required. These features are namely Number of Token, Location, Question Word and Query Resource Type. Note, the performance decreases by 2 percent, using all features. Adding other features beyond the optimal group seems to introduce noise. However, this is highly dependent on the particular set of questions that is used.
5.5 Metasystem Performance
The overall goal was to develop a metasystem that is able to perform better than the underlying systems to benefit from the multitude of existing QA research and development activities. As shown in Table 4, the six underlying systems perform with an F-measure of 0.15 to 0.68 on QALD-6. An optimal selection algorithm (which would always choose the best performing QA system) would achieve 0.89 F-measure. Our best performing metasystem, trained on the 100 questions alone using the PSt classifier and only four features–namely #T, Loc, QW and QRT– is able to improve the best single system performance by 14.1% and reaches an F-measure of 0.78. This result supports our assumptions about the diversity of existing QA solutions and shows how a good feature design allows characterization and the effective use of QA systems. The overall results however also show clear weaknesses of existing QA solutions. In particular, questions which require solution modifiers (e.g., 9, 88, 17, 28, 33, 36, 49) remain a difficult problem that need to be tackled.
6 Conclusion and Summary
The QA metasystem we have presented is able to outperform each single QA system using a feature-selection approach combined with multi-label classification. We were able to show that an effective combination of systems, features and classifiers can improve overall performance.
However, our system is still more than 0.10 points F-measure away from an optimal system selection. This gap exists due to a lack of training data since we had only 100 training instances respectively questions available. Thus, we welcome other QA system developers to implement webservices to foster more active research and increase the comparability of systems. We have actively begun research to sophisticate the benchmarking of QA systems.333http://gerbil-qa.aksw.org/gerbil/. Furthermore, we will look deeper into the issues of overfitting classifiers and finding more influential features in the future.
Acknowledgments This work has been supported by Eurostars projects DIESEL (E!9367) and QAMEL (E!9725) as well as the European Union’s H2020 research and innovation action HOBBIT (GA 688227). We also thank Christina Unger for providing us with the underlying datasets.
-  P. Baudis and J. Sedivý. Modeling of the question answering task in the yodaqa system. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 6th International Conference of the CLEF Association, CLEF 2015, Toulouse, France, September 8-11, 2015, Proceedings, pages 222–228, 2015.
-  R. Beaumont, B. Grau, and A.-L. Ligozat. Semgraphqa at qald-5: Limsi participation at qald-5 at clef. In CLEF (Working Notes), 2015.
-  A. Both, A.-C. N. Ngonga, R. Usbeck, D. Lukovnikov, C. Lemke, and M. Speicher. A service-oriented search framework for full text, geospatial and semantic search. In SEMANTiCS, 2014.
-  W. Cheng, E. Hüllermeier, and K. J. Dembczynski. Bayes optimal multilabel classification via probabilistic classifier chains. In Proceedings of the 27th international conference on machine learning (ICML-10), pages 279–286, 2010.
-  O. Ferrandez, C. Spurk, M. Kouylekov, I. Dornescu, S. Ferrandez, M. Negri, R. Izquierdo, D. Tomas, C. Orasan, G. Neumann, et al. The qall-me framework: A specifiable-domain multilingual question answering architecture. Web semantics: Science, services and agents on the world wide web, 9(2):137–145, 2011.
-  A. Freitas, J. G. Oliveira, E. Curry, S. O’Riain, and J. C. P. da Silva. Treo: combining entity-search, spreading activation and semantic relatedness for querying linked data. In 1st Workshop on Question Answering over Linked Data (QALD-1), 2011.
-  Y. Guo and S. Gu. Multi-label classification using conditional dependency networks. In IJCAI Proceedings-International Joint Conference on Artificial Intelligence, volume 22, page 1300, 2011.
-  K. Höffner, S. Walter, E. Marx, J. Lehmann, A. Ngonga, and R. Usbeck. Overcoming challenges of semantic question answering in the semantic web. Semantic Web Journal, 2016.
-  O. Kolomiyets and M.-F. Moens. A survey on question answering technology from an information retrieval perspective. Inf. Sci., 181(24):5412–5434, Dec. 2011.
-  J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 2014.
-  V. Lopez, M. Fernández, E. Motta, and N. Stieler. PowerAqua: Supporting users in querying and exploring the Semantic Web. Semantic Web Journal, 3:249–265, 2012.
-  V. Lopez, V. S. Uren, M. Sabou, and E. Motta. Is question answering fit for the semantic web?: A survey. Semantic Web Journal, 2(2):125–155, 2011.
-  C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. Bethard, and D. McClosky. The Stanford CoreNLP natural language processing toolkit. In 52nd ACL: System Demonstrations, pages 55–60, 2014.
-  E. Marx, R. Usbeck, A.-C. N. Ngonga, K. Höffner, J. Lehmann, and S. Auer. Towards an open question answering architecture. In SEMANTiCS, 2014.
-  G. M. Mazzeo and C. Zaniolo. Canali: A system for answering controlled natural language questions on rdf knowledge bases, 2016.
-  J.-P. Ng and M.-Y. Kan. Qanus: An open-source question-answering platform. arXiv preprint arXiv:1501.00311, 2015.
-  J. Read. A pruned problem transformation method for multi-label classification. In Proc. 2008 New Zealand Computer Science Research Student Conference (NZCSRS 2008), volume 143150, 2008.
-  J. Read, L. Martino, and D. Luengo. Efficient monte carlo methods for multi-dimensional learning with classifier chains. Pattern Recognition, 47(3):1535–1546, 2014.
-  J. Read, B. Pfahringer, G. Holmes, and E. Frank. Classifier chains for multi-label classification. Machine learning, 85(3):333–359, 2011.
-  J. Read, P. Reutemann, B. Pfahringer, and G. Holmes. MEKA: A multi-label/multi-target extension to Weka. Journal of Machine Learning Research, 17(21):1–5, 2016.
-  R. E. Schapire and Y. Singer. Boostexter: A boosting-based systemfor text categorization. Mach. Learn., 39(2-3):135–168, May 2000.
-  S. Shekarpour, E. Marx, A.-C. N. Ngomo, and S. Auer. Sina: Semantic interpretation of user queries for question answering on interlinked data. Journal of Web Semantics, 2014.
-  K. Singh, A. Both, D. Diefenbach, S. Shekarpour, D. Cherix, and C. Lange15. Qanary–the fast track to creating a question answering system with linked data technology. In ESWC, 2016.
-  G. Tsoumakas, I. Katakis, and I. Vlahavas. Random k-labelsets for multilabel classification. IEEE Transactions on Knowledge and Data Engineering, 23(7):1079–1089, 2011.
-  C. Unger, L. Bühmann, J. Lehmann, A. N. Ngomo, D. Gerber, and P. Cimiano. Template-based question answering over RDF data. In 21st WWW conference, pages 639–648, 2012.
-  C. Unger, C. Forascu, V. Lopez, A. N. Ngomo, E. Cabrio, P. Cimiano, and S. Walter. Question answering over linked data (QALD-4). In CLEF, pages 1172–1180, 2014.
-  C. Unger, C. Forascu, V. Lopez, A. N. Ngomo, E. Cabrio, P. Cimiano, and S. Walter. Question answering over linked data (QALD-5). In Working Notes of CLEF 2015 - Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015., 2015.
-  C. Unger, A. Ngonga, and E. Cabrio. 6th open challenge on question answering over linked data (qald-6). In The Semantic Web: ESWC 2016 Challenges., 2016.
-  R. Usbeck, E. Körner, and A.-C. Ngonga Ngomo. Answering boolean hybrid questions with hawk. In NLIWOD workshop at International Semantic Web Conference (ISWC), including erratum and changes, 2015.
-  R. Usbeck, A.-C. Ngomo, L. Bühmann, and C. Unger. Hawk – hybrid question answering using linked data. In The Semantic Web. Latest Advances and New Domains, volume 9088 of Lecture Notes in Computer Science, pages 353–368. Springer International Publishing, 2015.
-  A. P. B. Veyseh. Cross-lingual question answering using common semantic space. In Proceedings of TextGraphs@NAACL-HLT 2016, pages 15–19, 2016.
-  K. Xu, Y. Feng, S. Huang, and D. Zhao. Question answering via phrasal semantic parsing. In Experimental IR Meets Multilinguality, Multimodality, and Interaction - 6th International Conference of the CLEF Association, CLEF 2015, Toulouse, France, September 8-11, 2015, Proceedings, pages 414–426, 2015.
-  Z. Yang, E. Garduno, Y. Fang, A. Maiberg, C. McCormack, and E. Nyberg. Building optimal information systems automatically: Configuration space exploration for biomedical information systems. In 22nd ACM CIKM, pages 1421–1430. ACM, 2013.
-  M.-L. Zhang and Z.-H. Zhou. Ml-knn: A lazy learning approach to multi-label learning. Pattern recognition, 40(7):2038–2048, 2007.
|0||Who was the doctoral supervisor of Albert Einstein?||0.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|1||Did Kaurismäki ever win the Grand Prix at Cannes?||1.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|2||Who wrote the song Hotel California?||0.0||0.0||0.0||0.0||0.0||0.0||0.0||0.0|
|3||Who was on the Apollo 11 mission?||1.0||0.0||0.0||0.0||0.0||0.0||1.0||0.0|
|4||Which electronics companies were founded in Beijing?||1.0||0.06||1.0||0.06||0.06||1.0||1.0||1.0|
|5||What is in a chocolate chip cookie?||1.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|6||What is the atmosphere of the Moon composed of?||1.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|7||How many movies did Park Chan-wook direct?||1.0||1.0||0.0||1.0||0.0||0.0||1.0||1.0|
|8||Who are the developers of DBpedia?||1.0||1.0||1.0||1.0||0.85||1.0||1.0||1.0|
|9||Which Indian company has the most employees?||0.0||0.0||0.0||0.0||0.0||0.0||0.0||0.0|
|10||What is the name of the school where Obama’s wife studied?||0.0||1.0||0.0||0.0||0.0||0.66||1.0||0.66|
|11||Where does Piccadilly start?||0.0||0.0||0.0||0.0||0.0||0.0||0.0||0.0|
|12||What is the capital of Cameroon?||1.0||1.0||1.0||1.0||1.0||1.0||1.0||1.0|
|13||When did the Boston Tea Party take place?||1.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|14||Who played Gus Fring in Breaking Bad?||0.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|15||Who wrote Harry Potter?||0.66||1.0||0.0||1.0||0.0||1.0||1.0||1.0|
|16||Which actors play in Big Bang Theory?||0.5||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|17||What is the largest country in the world?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|18||Who is the most powerful Jedi?||0.0||1.0||1.0||1.0||0.0||1.0||1.0||1.0|
|19||How many goals did Pelé score?||1.0||1.0||1.0||1.0||0.0||0.0||1.0||1.0|
|20||Who is the president of Eritrea?||1.0||1.0||0.0||0.03||1.0||0.66||1.0||0.66|
|21||Which computer scientist won an oscar?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|22||Who created Family Guy?||1.0||1.0||0.0||1.0||0.0||1.0||1.0||1.0|
|23||How many people live in Poland?||1.0||1.0||0.0||0.0||0.0||0.0||1.0||0.0|
|24||To which party does the mayor of Paris belong?||1.0||1.0||0.0||1.0||0.0||1.0||1.0||1.0|
|25||Who does the voice of Bart Simpson?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|26||Who composed the soundtrack for Cameron’s Titanic?||1.0||0.0||0.0||0.0||0.0||0.0||1.0||0.0|
|27||When did Boris Becker end his active career?||0.0||1.0||0.0||0.0||0.0||0.0||1.0||1.0|
|28||Show me all basketball players that are higher than 2 meters.||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|29||What country is Sitecore from?||1.0||1.0||1.0||0.0||0.0||1.0||1.0||1.0|
|30||Which country was Bill Gates born in?||1.0||0.0||0.0||0.0||1.0||0.0||1.0||0.0|
|31||Who developed Slack?||1.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|32||In which city did Nikos Kazantzakis die?||1.0||0.66||0.0||0.66||0.0||1.0||1.0||1.0|
|33||How many grand-children did Jacques Cousteau have?||0.0||0.0||0.0||0.0||0.0||0.0||0.0||0.0|
|34||Which films did Stanley Kubrick direct?||1.0||1.0||0.96||1.0||1.0||1.0||1.0||1.0|
|35||Does Neymar play for Real Madrid?||1.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|36||How many seats does the home stadium of FC Porto have?||1.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|37||Show me all books in Asimov’s Foundation series.||0.95||0.0||0.21||0.0||0.0||1.0||1.0||1.0|
|38||Which movies star both Liz Taylor and Richard Burton?||0.95||1.0||0.0||0.0||0.35||0.77||1.0||0.77|
|39||In which city are the headquarters of the United Nations?||0.0||0.0||0.0||0.0||0.0||0.0||0.0||0.0|
|40||In which city was the president of Montenegro born?||1.0||1.0||0.0||0.0||1.0||0.66||1.0||0.66|
|41||Which writers studied in Istanbul?||0.21||0.0||0.0||0.25||0.33||0.22||0.33||0.22|
|42||Who is the mayor of Paris?||1.0||1.0||1.0||1.0||1.0||1.0||1.0||1.0|
|43||What is the full name of Prince Charles?||1.0||0.0||0.0||1.0||0.0||1.0||1.0||1.0|
|44||What is the longest river in China?||0.0||1.0||0.0||0.0||0.02||0.0||1.0||0.0|
|45||Who discovered Ceres?||1.0||1.0||0.0||1.0||0.0||0.0||1.0||1.0|
|46||When did princess Diana die?||1.0||1.0||0.0||0.0||1.0||1.0||1.0||1.0|
|47||What do ants eat?||1.0||1.0||1.0||1.0||0.0||1.0||1.0||1.0|
|48||Who is the host of the BBC Wildlife Specials?||1.0||1.0||0.0||1.0||0.0||0.0||1.0||0.0|
|49||How many moons does Mars have?||0.0||0.0||0.0||0.0||0.0||0.0||0.0||0.0|
|50||What was the first Queen album?||0.02||0.0||0.0||0.0||0.0||0.0||0.02||0.02|
|51||Did Elvis Presley have children?||1.0||0.0||0.0||0.0||0.0||0.0||1.0||1.0|
|52||Give me a list of all Canadians that reside in the U.S.||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|53||Where is Syngman Rhee buried?||1.0||1.0||0.0||0.0||1.0||1.0||1.0||1.0|
|54||In which countries do people speak Japanese?||1.0||0.5||0.0||0.0||0.0||1.0||1.0||1.0|
|55||Who is the king of the Netherlands?||0.0||0.66||0.0||0.0||0.0||1.0||1.0||1.0|
|56||When did the Dodo become extinct?||1.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|57||Show me all Czech movies.||0.89||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|58||Which rivers flow into the North Sea?||1.0||0.43||0.45||0.45||0.45||0.22||1.0||0.22|
|59||When did Operation Overlord commence?||1.0||0.66||0.0||1.0||0.0||1.0||1.0||0.66|
|60||Where do the Red Sox play?||1.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|61||In which time zone is Rome?||1.0||1.0||1.0||1.0||0.0||1.0||1.0||1.0|
|62||Give me a list of all critically endangered birds.||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|63||How much did the Lego Movie cost?||0.5||0.0||0.0||0.0||0.0||1.0||1.0||0.5|
|64||What was the original occupation of the inventor of Lego?||0.0||0.0||0.0||0.66||1.0||1.0||1.0||1.0|
|65||Which countries have more than ten volcanoes?||0.0||0.87||0.0||0.0||0.0||1.0||1.0||1.0|
|66||Show me all U.S. states.||0.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|67||Who wrote the Game of Thrones theme?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|68||How many calories does a baguette have?||0.0||1.0||0.0||0.0||0.0||0.0||1.0||1.0|
|69||Can you cry underwater?||1.0||1.0||1.0||1.0||0.0||1.0||1.0||1.0|
|70||Which companies produce hovercrafts?||1.0||1.0||1.0||1.0||0.0||0.0||1.0||0.0|
|71||How many emperors did China have?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|72||Show me hiking trails in the Grand Canyon where there’s no danger of flash floods.||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|73||In which ancient empire could you pay with cocoa beans?||1.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|74||How did Michael Jackson die?||0.0||0.0||0.0||0.0||1.0||0.0||1.0||1.0|
|75||Which space probes were sent into orbit around the sun?||0.0||0.0||0.0||0.0||0.0||0.72||0.72||0.72|
|76||When was Coca Cola invented?||1.0||1.0||1.0||1.0||0.0||0.0||1.0||1.0|
|77||What is the biggest stadium in Spain?||0.01||0.0||0.0||0.0||0.01||1.0||1.0||1.0|
|78||On which day is Columbus Day?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|79||How short is the shortest active NBA player?||0.0||0.0||0.0||0.0||0.0||0.0||0.0||0.0|
|80||Whom did Lance Bass marry?||1.0||1.0||1.0||0.0||1.0||0.66||1.0||0.66|
|81||What form of government does Russia have?||0.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|82||What movies does Jesse Eisenberg play in?||1.0||0.98||1.0||0.0||0.98||0.0||1.0||1.0|
|83||What color expresses loyalty?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|84||Show me all museums in London.||1.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|85||Give me all South American countries.||1.0||1.0||0.0||0.0||0.0||1.0||1.0||1.0|
|86||Who is the coach of Ankara’s ice hockey team?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|87||Who is the son of Sonny and Cher?||0.0||0.0||0.0||0.0||1.0||1.0||1.0||1.0|
|88||What are the five boroughs of New York?||0.08||0.0||0.0||0.0||0.0||0.28||0.28||0.28|
|89||Show me Hemingway’s autobiography.||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|90||What kind of music did Lou Reed play?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|91||In which city does Sylvester Stallone live?||1.0||1.0||0.66||0.0||0.0||1.0||1.0||1.0|
|92||Who was Vincent van Gogh inspired by?||1.0||0.0||1.0||0.0||0.0||1.0||1.0||1.0|
|93||What are the names of the Teenage Mutant Ninja Turtles?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|94||What are the zodiac signs?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|
|95||What languages do they speak in Pakistan?||1.0||0.37||0.37||0.0||0.0||0.32||1.0||0.32|
|96||Who became president after JFK died?||0.0||0.0||0.0||0.0||0.0||0.0||0.0||0.0|
|97||In what city is the Heineken brewery?||1.0||0.0||0.0||1.0||0.0||0.0||1.0||1.0|
|98||What is Elon Musk famous for?||1.0||1.0||0.0||0.0||0.0||0.44||1.0||1.0|
|99||What is Batman’s real name?||0.0||0.0||0.0||0.0||0.0||1.0||1.0||1.0|