Translating Questions into Answers using DBPedia n-triples
In this paper we present a question answering system using a neural network to interpret questions learned from the DBpedia repository. We train a sequence-to-sequence neural network model with n-triples extracted from the DBpedia Infobox Properties. Since these properties do not represent the natural language, we further used question-answer dialogues from movie subtitles. Although the automatic evaluation shows a low overlap of the generated answers compared to the gold standard set, a manual inspection of the showed promising outcomes from the experiment for further work.
Recent work on sequence-to-sequence neural networks have shown remarkable progress in many domains such as speech recognition, computer vision and language processing. These neural models can be used to map a sequence to another sequence, which has direct applications in natural language understanding Sutskever et al. (2014). One of the major advantages of this framework is that it requires a limited amount of feature engineering while still improving state-of-the-art results. This allows researchers to work on tasks for which domain knowledge may not be available. Question answering (QA) systems can directly benefit from this novel approaches, because it only requires mapping between questions and their answers. Due to the complexity of this mapping, most of the previous research in this domain used a complex pipeline of conventional linguistically-based NLP techniques, such as parsing, part-of-speech tagging and co-reference resolution Unger et al. (2014).
In this work, we build an open-domain QA system with sequence-to-sequence neural network models. The system was trained with question-answer pairs on the world knowledge represented in the DBpedia repository. Although building an open-domain system allows us to use an extensive amount of data stored in DBpedia, one of the challenges is to store this large amount of knowledge in the neural network architecture. Although these systems generally involve a smaller learning pipeline, they require a significant amount of training data.
Figure 1 illustrates how a sequence-to-sequence neural network can be trained on question-answering pairs. First, an sequence-to-sequence framework reads the source sentence, i.e. question, using an encoder to build a dense vector, a sequence of non-zero values that represents the meaning of a question. A decoder, processes this vector to predict an answer. In this manner, these encoder-decoder models can capture long-range dependencies in languages, e.g., gender agreements or syntax structures.
2 Related Work
Recent approaches on building QA systems are dominated by the usage of neural networks. Vinyals and Le (2015) present an approach for conversational modelling, which uses a sequence-to-sequence neural model. Their model predicts the next sentence given the previous sentences for an IT helpdesk domain, as well as for an open-domain trained on a subtitles dataset. For an open-domain dialogue generation, (Li et al., 2017) propose using adversarial training. Therefore, reinforcement learning is used to train the system that produces sequences that are indistinguishable from human-generated dialogue utterances. For this, they jointly train two systems, a generative model, which produces response sequences and a discriminator to distinguish between the human-generated dialogues and the machine-generated ones. The outputs from the discriminator are then used as rewards for the generative model, guiding the system to generate dialogues that mostly resemble human dialogues. Yin et al. (2016) demonstrate an end-to-end neural network model for generative QA. Their model is built on the encoder-decoder framework for sequence-to-sequence learning, while equipped with the ability to query a knowledge-base, which they demonstrate on the Chinese encyclopedia web site. The authors show that the proposed model is capable of generating natural and right answers by referring to the facts in the knowledge base. Additionally to the previous approaches, Serban et al. (2016) extend the hierarchical recurrent encoder-decoder neural network to the open domain dialogue system and demonstrate that this model is competitive with state-of-the-art neural language models and back-off n-gram models. They illustrate limitations of similar approaches and show how the performance can be improved by bootstrapping the learning from a larger question-answer pair corpus and from pre-trained word embeddings. A heuristic that guides the development of neural baseline systems for the extractive QA task is described in Weissenborn et al. (2017), which serves as guideline for the development of two neural baseline systems. Their RNN-based system, called FastQA, demonstrates good performance for extractive question answering due to the awareness of question words while processing the context. Additionally they introduce a composition function that goes beyond simple bag-of-words modelling. Rücklé and Gurevych (2017) demonstrate an approach to non-factoid answer generation with a separate component, which bases on BiLSTM to determine the importance of segments in the input. In contrast to other attention-based models, they determine the importance while assuming the independence of questions and candidate answers. Dhingra et al. (2017) present their Gated-Attention reader for answering cloze-style questions over documents. The reader features a novel multiplicative gating mechanism in combination with a multi-hop architecture, which is based on multiplicative interactions between the query embedding and the intermediate states of a recurrent neural network document reader. This enables the reader to build query-specific representations of tokens in the document for accurate answer selection.
The work mentioned above employ neural networks, but incorporate only a small amount of question-answering pairs to build their QA systems. Work on QA, which uses the DBpedia knowledge on the other hand, use much more complex linguistically-based NLP techniques to generate an answer for a given question. Abbas et al. (2016) propose a general purpose a QA system using Wikipedia data as its knowledge source, which can answer wh-interrogated questions. Their QA system includes tasks of named entity tagging, question classification, information retrieval and answer extraction. Li and Xu (2016) present a QA approach on the DBpedia dataset. With predefined templates and dependency parsing on questions they obtain the entity and property mention from the question. They split the QA process into two steps; first they build an entity-centric index to search the target entity, whereby in the second step they expand the property mention with WordNet and ConceptNet to match the target entity with the searched entity property. Similarly, Forner et al. (2014) present a QA system over Linked Data (DBpedia), which focuses on construct a bridge between the users and the Linked Data. Based on the consisting of subject-property-object (SPO) triples, each natural language question firstly is transformed into a triple-based representation (query triple). Then, the corresponding resources in DBpedia, including class, entity, property, are mapped for the phrases in the query triples. Finally, the optimal SPARQL query is generated as the output result. With this approach, their system can not only deal with the single-relation questions but also complex questions containing multi-relations.
Differentially to the approaches mentioned above, we combine the neural network approaches with the large number of facts (¿ 20M n-triples) extracted from the DBpedia repository.
3 Experimental Setting
In this section, we give an overview of the sequence-to-sequence framework used in this experiment as well as information on the datasets used in our experiment. Furthermore, we give insights into the techniques to evaluate the correctness of the provided answers.
3.1 Sequence-to-sequence Neural Network Toolkit
OpenNMT Klein et al. (2017) is a generic deep learning framework mainly specialised in sequence-to-sequence (seq2seq) models covering a variety of tasks such as machine translation, summarisation, speech processing and question answering. We used the default neural network training parameters, i.e. 2 hidden layers, 500 hidden LSTM units, input feeding enabled, batch size of 64, 0.3 dropout probability and a dynamic learning rate decay. We train the network for 13 epochs and report the results in Section 5.
Data Compression - Byte Pair Encoding
A common problem in training a neural network is the computational complexity, which cause that the vocabulary has to be limited to a specific threshold. Because of this the neural network can not learn expressions of rare and unknown words, e.g. domain-specific expressions. Therefore, if the training method does not see a specific word or phrase multiple times during training, it will not learn the interpretation of the word. This challenge is even more evident in sequence-to-sequence models used for summarisation, question answering or machine translation. Therefore the vocabulary is often limited only to 50,000 or 100,000 words (in comparison to 300,000 or more unique words in our training data sets, see Table 1). To overcome this limitation, different methods were suggested, i.e. character based neural model Costa-Jussà and Fonollosa (2016); Ling et al. (2015) or using subword units, e.g. Byte Pair Encoding (BPE). The latter one was successfully adapted for word segmentation specifically for the NMT scenario Sennrich et al. (2015). BPE Gage (1994) is a form of data compression that iteratively replaces the most frequent pair of bytes in a sequence with a single, unused byte. Instead of merging frequent pairs of bytes as shown in the original algorithm, characters or character sequences are merged for the purposes of natural language generation. To achieve this, the symbol vocabulary is initialised with the character vocabulary, and each word is represented as a sequence of characters—plus a special end-of-word symbol, which allows to restore the original tokenisation after the generating the answer based on the given question. This process is repeated as many times as new symbols are created.
3.2 Training Datasets for the QA system
For our QA system we used the DBpedia repository and the Subtitles corpus. Due to the nature of training a sequence-to-sequence neural model, questions and answers need to be aligned. The statistics on the used data are shown in Table 1.
In our work, we use the DBpedia Lehmann et al. (2015) repository (version 2016-04). The DBpedia project aims to extract structured knowledge from the knowledge added to the Wikipedia repository. DBpedia allows users to semantically query relationships and properties of Wikipedia resources, including links to other related datasets.
For our experiment on training a QA system, we extracted (see Section 4) the knowledge stored in Dbpedia Infobox Properties,
Since the extracted information form the DBpedia n-triples does not represent a natural language, we added to the extracted n-triples answers with dialogues from the OpenSubtitles dataset Tiedemann (2009). Since this dataset is stored in an XML structure with time codes, only sentences were extracted, where the first sentence ends with a question mark and the second sentence does not end with a question mark. Additionally, to ensure a better consistency between an question and the answer, the second sentence has to follow the first sentence by less than 20 seconds. From the 14M sentence corpus
|Where’s Lane going?||Away.|
|Can you just let me out, man?||I tell you what.|
|You trying to get high?||No.|
|you want to become a priest?||Yeah.|
|So you believe me?||I don’t know what to believe.|
3.3 Evaluation Dataset
For the automatic evaluation of the QA system, we used 10,000 entries extracted from the DBpedia Infobox Property repository (Table 1).
3.4 Evaluation Metrics
The automatic evaluation of the proposed QA system is based on the correspondence between the generated answer and gold standard. For the automatic evaluation we used the BLEU Papineni et al. (2002), METEOR Denkowski and Lavie (2014) and chrF Popović (2015) metrics. BLEU (Bilingual Evaluation Understudy) is calculated for individual generated segments (n-grams) by comparing them with a dataset of of the gold standard. Considering the shortness of the questions, we report besides the four-gram overlap (BLEU) also scores based on the unigram overlap (BLEU-1). Those scores, between 0 and 100 (perfect overlap), are then averaged over the whole evaluation dataset to reach an estimate of the overall quality of the generated answers.
METEOR (Metric for Evaluation of Translation with Explicit ORdering) is based on the harmonic mean of precision and recall, whereby recall is weighted higher than precision. Along with exact word (or phrase) matching it has additional features, i.e. stemming, paraphrasing and synonymy matching. In contrast to BLEU, the metric produces good correlation with human judgement at the sentence or segment level. Differently to the BLEU metric, which penalises the hypothesis, if words were generated with a different inflection, chrF3 is a character n-gram metric, which showed very good correlations with human judgements for morphologically rich languages.
chrP represents the percentage of n-grams in the generated answer which has a counterpart in the gold standard answer, whereby chrR represents the percentage of character n-grams in the gold standard, which are also present in the predicted answer. is a parameter which assigns times more importance to recall than to precision.
|name of Aristotle||Aristotle|
|birth Date of Aristotle||384|
|death Date of Aristotle||322, Euboea, Greece|
|era of Aristotle||Ancient philosophy|
|religion of Aristotle||Western philosophy|
|school Tradition of Aristotle||Peripatetic school, Aristotelianism|
|main Interests of Aristotle||Biology, Zoology, Physics, Metaphysics, Logic, Ethics, Rhetoric, Music, Poetry, Theatre, Politics, Government|
|notable Ideas of Aristotle||Golden mean, Term logic, Syllogism, Hexis, Hylomorphism, On the Soul|
|influences of Aristotle||Parmenides, Socrates, Plato, Heraclitus, Democritus|
The training approach to the sequence-to-sequence neural model requires an aligned dataset of questions and answers, which are aligned on a sentence level. To train our QA system on generic knowledge, we extracted this required information from the DBpedia Infobox Property repository.
DBpedia n-tripple extraction
We used the DBpedia Infobox Properties to extract the data required to train the sequence-to-sequence model. We threat the subject and predicate of each instance in the n-triple repository as a question, whereby the object of the n-triple represents the answer to the question. The top part of Table 3 shows the DBpedia Infobox Properties of the entry Aristotles, whereby the bottom part shows the extracted information used to train the neural network. As an example, we extract from …/resource/Aristotle (subject of the triple) and …/property/era (predicate) the question era of Aristotle, and use the object of the n-triple, i.e. Ancient philosophy, as the answer of the question.
In this section we report the results of the automatic and a manual evaluation of the answers generated by the neural models.
5.1 Automatic Evaluation
The results of the automatic evaluation using the BLEU, METEOR and chrF metric are shown in Table 4. We observer BLEU and Meteor scores, which evaluate the generated answers on based on a n-gram overlap, of 16.48 and 13.97, respectively. BLEU-1 and the chrF score, which perform on a single word or character level, are slightly higher, i.e. 56.0 and 26.84.
|score||Property||in Test Set|
|91.15||next single of||19|
|87.09||basin countries of||7|
|82.00||direction a of||5|
|79.93||show name of||20|
|76.32||official name of||107|
|75.07||media type of||10|
|71.68||location country of||6|
|1.39||operating system of||3|
|domain of acidomonas methanolica||bacteria||bacteria|
|capital of korean empire||seoul||seoul|
|profession of linda j. lezotte||attorney||lawyer|
|country of chillicothe high school||usa||united states|
|nationality of jean michel diot||french||france|
|short description of clark duke||actor||football player|
|place of birth of edith roger||vestby,||vienna,|
5.2 Manual Evaluation
Besides the automatic evaluation of answer generation over the DBpedia Infobox properties, we performed a detailed manual evaluation of the answers produced by the neural networks.
We first evaluated, which answers, based on the DBpedia properties (of the property-subject pair) were correctly generated based on the METEOR score. Table 5 illustrates that answers, based on the property next single of or basin countries of were mostly generated correctly. On the other hand, answers, based on question containing the properties operating system of or intercommunality of were mostly generated wrongly.
Finally, Table 6 presents different examples of answers, based on the Dbpedia Infobox Properties. Although the properties domain of or capital of appear only once in the evaluation set, they were answered correctly. Furthermore, the neural networks often provide answers, which are synonyms or closely related words to the gold standard answers. As an example, the generated answer lawyer is an synonym word of attorney, and usa can also be referred to as united states only. Similarly, a closely related word to french is france, although this cannot be counted as a correctly generated answer to the question nationality of jean michel diot. At last, the proposed approach of using neural networks also demonstrated various wrongly generated answered. As an example, clark duke is not an football player and edith roger was born in vestby, norway and not in vienna , austria as answered by the proposed neural network approach.
We presented the work on a QA system trained with a sequence-to-sequence neural model with the DBpedia knowledge and movie dialogues. Although the automatic evaluation shows a low overlap of generated answers compared to the gold standard, a manual inspection of the showed promising outcomes from the experiment. Due to the nature of the training dataset, short answers are preferred, since they are more likely to have a lower log-likelihood score than the longer ones. Nevertheless, we observed several correct answers, which shows the availability of storing the entire DBpedia knowledge into neural networks. Our future work will focus on providing longer answers, as well as focusing on answering more complex questions.
This publication has emanated from research conducted with the financial support of Science Foundation Ireland (SFI) under Grant Number SFI/12/RC/2289 (Insight).
- We transform all objects into labels, if n-triple object is encoded with an URL link.
- F. Abbas, M. K. Malik, M. U. Rashid, and R. Zafar. 2016. Wikiqa – a question answering system on wikipedia using freebase, dbpedia and infobox. In 2016 Sixth International Conference on Innovative Computing Technology (INTECH), pages 185–193.
- Marta R. Costa-Jussà and José A. R. Fonollosa. 2016. Character-based neural machine translation. CoRR, abs/1603.00810.
- Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the EACL 2014 Workshop on Statistical Machine Translation.
- Bhuwan Dhingra, Hanxiao Liu, Zhilin Yang, William Cohen, and Ruslan Salakhutdinov. 2017. Gated-attention readers for text comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1832–1846. Association for Computational Linguistics.
- Pamela Forner, Roberto Navigli, Dan Tufis, and Nicola Ferro, editors. 2014. Working Notes for CLEF 2013 Conference , Valencia, Spain, September 23-26, 2013, volume 1179 of CEUR Workshop Proceedings. CEUR-WS.org.
- Philip Gage. 1994. A new algorithm for data compression. C Users J., 12(2):23–38.
- Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senellart, and Alexander M. Rush. 2017. Opennmt: Open-source toolkit for neural machine translation. CoRR, abs/1701.02810.
- Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas, Pablo N. Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick van Kleef, Sören Auer, and Christian Bizer. 2015. DBpedia - a large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web Journal, 6(2):167–195.
- H. Li and F. Xu. 2016. Question answering with dbpedia based on the dependency parser and entity-centric index. In 2016 International Conference on Computational Intelligence and Applications (ICCIA), pages 41–45.
- Jiwei Li, Will Monroe, Tianlin Shi, Sébastien Jean, Alan Ritter, and Dan Jurafsky. 2017. Adversarial learning for neural dialogue generation. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, EMNLP 2017, Copenhagen, Denmark, September 9-11, 2017, pages 2157–2169.
- Wang Ling, Isabel Trancoso, Chris Dyer, and Alan W. Black. 2015. Character-based neural machine translation. CoRR, abs/1511.04586.
- Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, ACL ’02, pages 311–318.
- Maja Popović. 2015. chrf: character n-gram f-score for automatic mt evaluation. In Proceedings of the Tenth Workshop on Statistical Machine Translation, pages 392–395, Lisbon, Portugal. Association for Computational Linguistics.
- Andreas Rücklé and Iryna Gurevych. 2017. Representation learning for answer selection with lstm-based importance weighting. In IWCS 2017 — 12th International Conference on Computational Semantics — Short papers.
- Rico Sennrich, Barry Haddow, and Alexandra Birch. 2015. Neural machine translation of rare words with subword units. CoRR, abs/1508.07909.
- Iulian V. Serban, Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau. 2016. Building end-to-end dialogue systems using generative hierarchical neural network models. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, AAAI’16, pages 3776–3783. AAAI Press.
- Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014. Sequence to sequence learning with neural networks. In Proceedings of the 27th International Conference on Neural Information Processing Systems, NIPS’14, pages 3104–3112, Cambridge, MA, USA. MIT Press.
- Jörg Tiedemann. 2009. News from OPUS - A collection of multilingual parallel corpora with tools and interfaces. In N. Nicolov, K. Bontcheva, G. Angelova, and R. Mitkov, editors, Recent Advances in Natural Language Processing, volume V. John Benjamins, Amsterdam/Philadelphia, Borovets, Bulgaria.
- Christina Unger, André Freitas, and Philipp Cimiano. 2014. An Introduction to Question Answering over Linked Data. Springer International Publishing, Cham.
- Oriol Vinyals and Quoc V. Le. 2015. A neural conversational model. CoRR, abs/1506.05869.
- Dirk Weissenborn, Georg Wiese, and Laura Seiffe. 2017. Making neural qa as simple as possible but not simpler. In CoNLL.
- Jun Yin, Xin Jiang, Zhengdong Lu, Lifeng Shang, Hang Li, and Xiaoming Li. 2016. Neural generative question answering. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, IJCAI’16, pages 2972–2978. AAAI Press.