Commonsense Properties from Query Logs
and Question Answering Forums
Commonsense knowledge about object properties, human behavior and general concepts is crucial for robust AI applications. However, automatic acquisition of this knowledge is challenging because of sparseness and bias in online sources. This paper presents Quasimodo, a methodology and tool suite for distilling commonsense properties from non-standard web sources. We devise novel ways of tapping into search-engine query logs and QA forums, and combining the resulting candidate assertions with statistical cues from encyclopedias, books and image tags in a corroboration step. Unlike prior work on commonsense knowledge bases, Quasimodo focuses on salient properties that are typically associated with certain objects or concepts. Extensive evaluations, including extrinsic use-case studies, show that Quasimodo provides better coverage than state-of-the-art baselines with comparable quality.
1.1. Motivation and Goal
Commonsense knowledge (CSK for short) is an old theme in AI, already envisioned by McCarthy in the 1960s (mccarthy1960programs) and later pursued by AI pioneers like Feigenbaum (feigenbaum1984knowledge) and Lenat (lenat1995cyc). The goal is to equip machines with knowledge of properties of everyday objects (e.g., bananas are yellow, edible and sweet), typical human behavior and emotions (e.g., children like bananas, children learn at school, death causes sadness) and general plausibility invariants (e.g., a classroom of children should also have a teacher). In recent years, research on automatic acquisition of such knowledge has been revived, driven by the pressing need for human-like AI systems with robust and explainable behavior. Important use cases of CSK include the interpretation of user intents in search-engine queries, question answering, versatile chatbots, language comprehension, visual content understanding, and more.
Examples: A keyword query such as “Jordan weather forecast” is ambiguous, but CSK should tell the search engine that this refers to the country and not to a basketball player or machine learning professor. A chatbot should know that racist jokes are considered tasteless and would offend its users; so CSK could have avoided the 2016 PR disaster of the Tay chatbot.111www.cnbc.com/2018/03/17/facebook-and-youtube-should-learn-from-microsoft-tay-racist-chatbot.html In an image of a meeting at an IT company where one person wears a suit and another person is in jeans and t-shirt, the former is likely a manager and the latter an engineer. Last but not least, a “deep fake” video where Donald Trump rides on the back of a tiger could be easily uncovered by knowing that tigers are wild and dangerous and, if at all, only circus artists would do this.
The goal of this paper is to advance the automatic acquisition of salient commonsense properties from online content of the Internet. For knowledge representation, we focus on simple assertions in the form of subject-predicate-object (SPO) triples such as children like banana or classroom includes teacher. Complex assertions, such as Datalog clauses, and logical reasoning over these are outside our scope.
A major difficulty that prior work has struggled with is the sparseness and bias of possible input sources. Commonsense properties are so mundane that they are rarely expressed in explicit terms (e.g., countries or regions have weather, people don’t). Therefore, typical sources for information extraction like Wikipedia are fairly useless for CSK. Moreover, online contents, like social media (Twitter, Reddit, Quora etc.), fan communities (Wikia etc.) and books or movies, are often heavily biased and do not reflect typical real-life situations. For example, existing CSK repositories contain odd triples such as banana located_in monkey’s_hand, engineer has_property conservative, child make choice.
1.2. State of the Art and Limitations
Popular knowledge bases like DBpedia, Wikidata or Yago have a strong focus on encyclopedic knowledge about individual entities like (prominent) people, places etc., and do not cover commonsense properties of general concepts. The notable exception is the inclusion of SPO triples for the (sub-)type (aka. isa) predicate, for example, banana type fruit. Such triples are ample especially in Yago (derived from Wikipedia categories and imported from WordNet). Our focus is on additional properties beyond type, which are absent in all of the above knowledge bases.
The most notable projects on constructing commonsense knowledge bases are Cyc (lenat1995cyc), ConceptNet (conceptnet), WebChild (webchild) and Mosaic TupleKB (tuplekb). Each of these has specific strengths and limitations. The seminal Cyc project solely relied on human experts for codifying logical assertions, with inherent limitations in scope and scale. ConceptNet used crowdsourcing for scalability and better coverage, but is limited to only a few different predicates like has_property, located_in, used_for, capable_of, has_part and type. Moreover, the crowdsourced inputs often take noisy, verbose or uninformative forms (e.g., banana type bunch, banana type herb, banana has_property good_to_eat). WebChild tapped into book n-grams and image tags to overcome the bias in many Web sources. It has a wider variety of 20 predicates and is much larger, but contains a heavy tail of noisy and dubious triples – due to its focus on possible properties rather than typical ones (e.g., engineers are conservative, cool, qualified, hard, vital etc.). TupleKB is built by carefully generating search-engine queries on specific domains and performing various stages of information extraction and cleaning on the query results. Despite its clustering-based cleaning steps, it contains substantial noise and is limited in scope by the way the queries are formulated.
The work in this paper aims to overcome the bottlenecks of these prior projects while preserving their positive characteristics. In particular, we aim to achieve high coverage, like WebChild, with high precision (i.e., a fraction of valid triples), like ConceptNet. In addition, we strive to acquire properties for a wide range of predicates - more diverse and refined than ConceptNet and WebChild, but without the noise that TupleKB has acquired.
1.3. Approach and Challenges
This paper puts forward Quasimodo222The name stands for: Query Logs and QA Forums for Salient Commonsense Definitions. Quasimodo is the main character in Victor Hugo’s novel “The Hunchback of Notre Dame” who epitomizes human preconception and also exhibits unexpected traits., a framework and tool for scalable automatic acquisition of commonsense properties. Quasimodo is designed to tap into non-standard sources where questions rather than statements provide cues about commonsense properties. This leads to noisy candidates for populating a commonsense knowledge base (CSKB). To eliminate false positives, we have devised a subsequent cleaning stage, where corroboration signals are obtained from a variety of sources and combined by learning a regression model. This way, Quasimodo reconciles wide coverage with high precision. In doing this, it focuses on salient properties which typically occur for common concepts, while eliminating possible but atypical and uninformative output. This counters the reporting bias - frequent mentioning of sensational but unusual and unrealistic properties (e.g., pink elephants in Walt Disney’s Dumbo).
The new sources that we tap into for gathering candidate assertions are search-engine query logs and question answering forums like Reddit, Quora etc. Query logs are unavailable outside industrial labs, but can be sampled by using search-engine interfaces in a creative way. To this end, Quasimodo generates queries in a judicious way and collects auto-completion suggestions. The subsequent corroboration stage harnesses statistics from search-engine answer snippets, Wikipedia editions, Google Books and image tags by means of a learned regression model. This step is geared to eliminate noisy, atypical, and uninformative properties.
A subsequent ranking step further enhances the knowledge quality in terms of typicality and saliency. Finally, to counter noisy language diversity, reduce semantic redundancy, and canonicalize the resulting commonsense triples to a large extent, Quasimodo includes a novel way of clustering the triples that result from the fusion step. This is based on a tri-factorization model for matrix decomposition.
Our approach faces two major challenges:
coping with the heavy reporting bias in cues from query logs, potentially leading to atypical and odd properties,
coping with the noise, language diversity, and semantic redundancy in the output of information extraction methods.
The paper shows how these challenges can be (largely) overcome. Experiments demonstrate the practical viability of Quasimodo and its improvements over prior works.
The paper makes the following original contributions:
a complete methodology and tool for multi-source acquisition of typical and salient commonsense properties with principled methods for corroboration, ranking and refined grouping,
novel ways of tapping into non-standard input sources like query logs and QA forums,
a high-quality knowledge base of ca. 2.21 million salient properties for ca. 52,000 concepts, which will be made publicly available as a research resource333https://www.dropbox.com/sh/r1os5uoo6v2xiac/AADinRFpUYSg1kQLm63pdMnOa?dl=0,
an experimental evaluation and comparison to ConceptNet, WebChild, and TupleKB which shows major gains in coverage and quality, and
experiments on extrinsic tasks like language games (Taboo word guessing) and question answering.
Our code will be made available on Github.
2. Related Work
Commonsense Knowledge Bases (CSKB’s). The most notable projects on building large commonsense knowledge bases are the following.
The Cyc project was the first major effort towards collecting and formalizing general world knowledge (lenat1995cyc). Knowledge engineers manually compiled knowledge, in the form of grounded assertions and logical rules. Parts of Cyc were released to the public as OpenCyc in 2002, but these parts mostly focus on concept taxonomies, that is, the (sub-)type predicate.
Crowdsourcing has been used to construct ConceptNet, a triple-based semantic network of commonsense assertions about general objects (DBLP:conf/lrec/SpeerH12; conceptnet). ConceptNet contains ca. 1.3 million assertions for ca. 850,000 subjects (counting only English assertions and semantic relations, i.e., discounting relations like synonym or derivedFrom). The focus is on a small number of broad-coverage predicates, namely, type, locationOf, usedFor, capableOf, hasPart. ConceptNet is one of the highest-quality and most widely used CSK resources.
WebChild has been automatically constructed from book n-grams (and, to a smaller degree, image tags) by a pipeline of information extraction, statistical learning and constraint reasoning methods (DBLP:conf/wsdm/TandonMSW14; webchild). WebChild contains ca. 13 million assertions, and covers 20 distinct predicates such as hasSize, hasShape, physicalPartOf, memberOf, etc. It is the biggest of the publicly available commonsense knowledge bases, with the largest slice being on part-whole knowledge (DBLP:conf/aaai/TandonHURRW16). However, a large mass of WebChild’s contents is in the long tail of possible but not necessarily typical and salient properties. So it comes with a substantial amount of noise and non-salient contents.
The Mosaic project at AI2 aims to collect commonsense knowledge in various forms, from grounded triples to procedural knowledge with first-order logic. TupleKB, released as part of this ongoing project, is a collection of triples for the science domain, compiled by generating domain-specific queries and extracting assertions from the resulting web pages. A subsequent cleaning step, based on integer linear programming, clusters triples into groups. TupleKB contains ca. 280,000 triples for ca. 30,000 subjects.
This collaboratively built knowledge base is mostly geared to organize encyclopedic facts about individual entities like people, places, organizations etc. (vrandevcic2014wikidata; DBLP:conf/semweb/MalyshevKGGB18). It contains more than 400 million assertions for more than 50 million items. This includes some world knowledge about general concepts, like type triples, but this coverage is very limited. For instance, Wikidata neither knows that birds can fly nor that elephants have trunks.
Use Cases of CSK. Commonsense knowledge and reasoning are instrumental in a variety of applications in natural language processing, computer vision, and AI in general. These include question answering, especially for general world comprehension (swag) and science questions (schoenick2016moving). Sometimes, these use cases also involve additional reasoning (e.g., (DBLP:conf/emnlp/TandonDGYBC18)), where CSK contributes, too. Another NLP application is dialog systems and chatbots (e.g., (young2018augmenting)), where CSK adds plausibility priors to language generation.
For visual content understanding, such as object detection or caption generation for images and videos, CSK can contribute as an informed prior about spatial co-location derived, for example, from image tags, and about human activities and associated emotions (e.g., (xu2018automatic; yatskar2016stating; DBLP:conf/wsdm/ChowdhuryTFW18)). In such settings, CSK is an additional input to supervised deep-learning methods.
Information Extraction from Query Logs. Prior works have tapped into query logs for goals like query recommendation (e.g., (cao2008context)) and extracting semantic relationships between search terms, like synonymy and hypernymy/hyponymy (e.g., (baeza-yates2007; DBLP:conf/sigmod/WuLWZ12; DBLP:conf/emnlp/Pasca13; DBLP:conf/acl/PascaD08)). The latter can be seen as gathering triples for CSK, but its sole focus is on the (sub-)type predicate – so the coverage of the predicate space is restricted to class/type taxonomies. Moreover, these projects were carried out on full query logs within industrial labs of search-engine companies. In contrast, Quasimodo addresses a much wider space of predicates and operates with an original way of sampling query-log-derived signals via auto-completion suggestions. To the best of our knowledge, no prior work has aimed to harness auto-completion for CSK acquisition (cf. (autocomplete)).
The methodologically closest work to ours is (DBLP:conf/cikm/Pasca15). Like us, that work used interrogative patterns (e.g. “Why do …”) to mine query logs – with full access to the search-engine company’s logs. Unlike us, subjects, typically classes/types such as “cars” or “actors”, were merely associated with salient phrases from the log rather than extracting complete triples. One can think of this as organizing CSK in SP pairs where P is a textual phrase that comprises both predicate and object but cannot separate these two. Moreover, (DBLP:conf/cikm/Pasca15) restricted itself to the extraction stage and used simple scoring from query frequencies, whereas we go further by leveraging multi-source signals in the corroboration stage and refining the SPO assertions into semantic groups.
3. System Overview
Quasimodo is designed to cope with the high noise and potentially strong bias in online contents. It taps into query logs via auto-completion suggestions as a non-standard input source. However, frequent queries – which are the ones that are visible through auto-completion – are often about sensational and untypical issues. Therefore, Quasimodo combine a recall-oriented candidate gathering phase with two subsequent phases for cleaning, refining, and ranking assertions. Figure 1 gives a pictorial overview of the system architecture.
Candidate Gathering. In this phase, we extract candidate triples from some of the world’s largest sources of the “wisdom of crowds”, namely, search-engine query logs and question answering forums such as Reddit or Quora. While the latter can be directly accessed via search APIs, query logs are unavailable outside of industrial labs. Therefore, we creatively probe and sample this guarded resource by means of generating queries and observing auto-completion suggestions by the search engine. The resulting suggestions are typically among the statistically frequent queries. As auto-completion works only for short inputs of a few words, we generate queries that are centered on candidate subjects, the S argument in the SPO triples that we aim to harvest. Technical details are given in Section 4.
Corroboration. This phase is precision-oriented, aiming to eliminate false positives from the candidate gathering. We consider candidates as invalid for three possible reasons: 1) they do not make sense (e.g., programmers eat python); 2) they are not typical properties for the instances of the S concept (e.g., programmers drink espresso); 3) they are not salient in the sense that they are immediately associated with the S concept by most humans (e.g., programmers visit restaurants). To statistically check to which degree these aspects are satisfied, Quasimodo harnesses corroboration signals in a multi-source scoring step. This includes standard sources like Wikipedia articles and books, which were used in prior works already, but also non-standard sources like image tags and answer snippets from search-engine queries. Technical details are given in Section LABEL:sec:cleaning.
Ranking. To identify typical and salient triples, we devised a probabilistic ranking model with the corroboration scores as input signal. This stage is described in Section LABEL:sec:ranking.
Grouping. For this phase, we have devised a clustering method based on the model of tri-factorization for matrix decomposition (DBLP:conf/kdd/DingLPP06). The output consists of groups of SO pairs and P phrases linked to each other. So we semantically organize and refine both the concept arguments (S and O) in a commonsense triple and the way the predicate (P) is expressed in language. Ideally, this would canonicalize all three components, in analogy to what prior works have achieved for entity-centric encyclopedic knowledge bases (e.g., (DBLP:conf/www/SuchanekSW09; DBLP:conf/cikm/GalarragaHMS14)). However, commonsense assertions are rarely as crisp as facts about individual entities, and often carry subtle variation and linguistic diversity (e.g., live in and roam in for animals being near-synonymous but not quite the same). Our clustering method also brings out refinements of predicates. This is in contrast to prior work on CSK which has mostly restricted itself to a small number of coarse-grained predicates like partOf, usedFor, locatedAt, etc. Technical details are given in Section LABEL:sec:grouping.
4. Candidate Gathering
The key idea for this phase is to utilize questions as a source of human commonsense. For example, the question “Why do dogs bark?” implicitly conveys the user’s knowledge that dogs bark. Questions of this kind are posed in QA forums, such as Reddit or Quora, but their frequency and coverage in these sources alone is not sufficient for building a comprehensive knowledge base. Therefore, we additionally tap into query logs from search engines, sampled through observing auto-completion suggestions. Although most queries merely consist of a few keywords, there is a substantial fraction of user requests in interrogative form (DBLP:conf/www/WhiteRY15).
4.1. Data Sources
Quasimodo exploits two data sources: (i) QA forums, which return questions in user posts through their search APIs, and (ii) query logs from major search engines, which are sampled by generating query prefixes and observing their auto-completions.
QA forums. We use four different QA forums: Quora, Yahoo! Answers444answers.yahoo.com and webscope.sandbox.yahoo.com, Answers.com, and Reddit. The first three are online communities for general-purpose QA across many topics, and Reddit is a large discussion forum with a wide variety of topical categories.
Search engine logs Search engine logs are rich collections of questions. While logs themselves are not available outside of industrial labs, search engines allow us to glimpse at some of their underlying statistics by auto-completion suggestions. Figure 2 shows an example of this useful asset. Quasimodo utilizes Google and Bing, which typically return 5 to 10 suggestions for a given query prefix. In order to obtain more results, we recursively probe the search engine with increasingly longer prefixes that cover all letters of the alphabet, until the number of auto-completion suggestions drops below 5. For example, the query prefix “why do cats” is expanded into “why do cats a”, “why do cats b”, and so on.
We intentionally restrict ourselves to query prefixes in interrogative form, as these are best suited to convey commonsense knowledge. In contrast, simple keyword queries are often auto-completed with references to prominent entities (celebrities, sports teams, product names, etc.), given the dominance of such queries in the overall Internet (e.g., the query prefix "cat" is expanded into "cat musical"). These very frequent queries are not useful for CSK acquisition.
In total, we collected 11,603,121 questions from autocompletion.
4.2. Question Patterns
We performed a quantitative analysis of frequent question words and patterns on Reddit. As a result, we decided to pursue two question words, Why and How, in combination with the verbs is, do, are, does, can, can’t, resulting in 12 patterns in total. Their relative frequency in the question set that we gathered by auto-completion is shown in Table LABEL:tab:patterns_stats. For forums, we performed title searches centered around these patterns. For search engines, we appended subjects of interest to the patterns for query generation (e.g., “Why do cats”) for cats as subject. The subjects were chosen from the common nouns extracted from WordNet (Miller:1995:WLD:219717.219748).