Wikidata: A New Paradigm of Human-Bot Collaboration?
Wikidata is a collaborative knowledge graph which has already drawn the attention of practitioners and researchers. It is the work of a community of volunteers, supported by policies, guidelines and automatic programs (bots) which perform a broad range of tasks, doing the lion’s share of the work on the platform. In this paper, we highlight some of the most salient aspects of human-bot collaboration in Wikidata. We argue that the combination of automated and semi-automated work produces new challenges with respect to other online collaboration platforms.
Paste the appropriate copyright statement here. ACM now supports three different copyright statements:
ACM copyright: ACM holds the copyright on the work. This is the historical approach.
License: The author(s) retain copyright, but ACM receives an exclusive publication license.
Open Access: The author(s) wish to pay for the work to be open access. The additional fee must be paid to ACM.
This text field is large enough to hold the appropriate release statement assuming it is single spaced in a sans-serif 7 point font.
Every submission will be assigned their own unique DOI string to be included here.
Wikidata; bots; collaborative knowledge engineering
Wikidata is a relatively young project, yet it has already drawn great attention from practitioners and researchers. It is a collaborative knowledge graph—a knowledge base that describes real world entities and the relationships that occur among them, organised in a graph [?]. Several features make Wikidata worthy of interest. Since its inception in it has already gathered a community of k monthly active users, who have built a graph that covers facts about around M entities. In relation to CSCW, Wikidata has been considered by some researchers as a new time of platform, at the intersection of peer-production systems and collaborative ontology development projects [?].
The steep growth of Wikidata can be attributed in a large part to the work of bots, pieces of software programmed to perform a range of tasks, among which importing new data from different sources. Especially in the early years of Wikidata’s life, bots’ contributions have boosted the growth of the graph, adding a large amount of facts. Whereas this has allowed Wikidata to reach a size large enough for users to build upon it and produce a usable knowledge source, it has also posed some challenges, regarding the quality of automated work and its effects on the community. Although some of these challenges have been outlined by prior work—e.g. the difficulty for editors to control the quality of bot-generated data [?]—they have never been clearly identified. This is the aim of this paper. Understanding what these challenges are and addressing them is key to ensure the quality and the future sustainability of Wikidata.
The Knowledge Graph
Items and properties are the building blocks of Wikidata’s knowledge: items represent instances of concrete or abstract entities—e.g. Lou Reed or New York—as well as classes—e.g. the class of all musicians. Properties are used to state facts about items, such as Lou Reed–place of birth–New York. Statements assert facts about items and properties. Their nucleus is a claim, a property-value pair whose value can be either an item or a literal. Claims can be enriched through qualifiers and references. Qualifiers add contextual information (e.g. specifying a limitation in the validity of a statement), whereas references link to a source. The knowledge graph is the set of all statements.
The Wikidata community has continuously grown along the whole lifespan of the project, reaching a total of more than thousand unique registered users. Editors do not need to register to contribute and can do that also anonymously. Similarly to what observed in other online collaboration projects (e.g. Wikipedia [?]), the distribution of edits is very skewed and a core of users carry out the bulk of the work [?]. This is facilitated by some tools that are peculiar to the project. Wikidata can be edited through several interfaces. The easiest one is the web interface. Every entity in Wikidata has a corresponding a web page, which can be retrieved and variously edited (Figure 1). Another commonly used interface are semi-automated editing tools, such as QuickStatements or The Wikidata Game. These allow users to edit at a much higher rate than it would be possible through the web interface, e.g. QuickStatements accepts csv files with a list of statements to be added, or check the quality of suggested statements. Revisions made through these tools account for around of all edits made by human editors, although they are used by less than .
Bots carry out various types of tasks, such as editing items and properties or patrolling the graph for quality control checks. They are the authors of the majority of edits on Wikidata. Although their percentage of edits over the total has declined since the early years of the project, when they exceeded of all contributions [?], they remains above of all revisions (Figure 2).
Bots are operated by registered users, according to codified norms and policies. The first step to obtain the community approval to run a bot is to open a Request for permission111https://www.wikidata.org/wiki/Wikidata:Requests_for_permissions/Bot, consulted on September ., where editors must provide a detailed description of the activities that their bot is planned to do. After a test run (between and edits), other users can leave their comment and vote in favour or against the activation of the bot. Once a bot is granted the community approval, its “owner” must continuously check its work and, if it becomes harmful for the graph, immediately suspend it. Other users can request to withdraw the authorisation by opening a new page on a dedicated section of Wikidata. Unauthorised bots exist, but an editing cap is enforced on them. One of the first functions for Wikidata items was to act as inter-language hub for different language versions of Wikipedia articles [?]. Therefore, the first Wikidata bots harvested inter-language links from all Wikipedias, importing them over to Wikidata, where each item was connected to the corresponding articles in several language versions of the online encyclopaedia. Besides this first task, bot activities have been very focused on importing new statements and enrich the knowledge graph. Almost of automated editing in Wikidata concerns the addition or modification of Item statements () or labels/descriptions/aliases () [?]. Nonetheless, bots take roles similar to human editors in constructing Wikidata’s knowledge. [?] have shown that bots are active in five of the six user roles they identified in Wikidata on the basis of type and scope of edits, i.e.reference editor, item creator, item editor, item expert, and property editor. Other bot activities concern quality-control tasks. A bot called KrBot has regularly scanned the whole graph since April and reported property constraints violations.
Bots play a prominent role in the production of Wikidata’s knowledge, as we have seen in the previous paragraphs. This is true also with regard to other projects. For instance, bots are by all means part of the sociotechnical fabric of Wikipedia, providing an essential contribution to authoring and cleaning articles [?] and are crucial to respond timely to vandalism [?]. Nevertheless, we argue that Wikidata constitutes a new model of human-bot collaboration, due to its combination of automated and semi-automated work, presenting new challenges. We outline these in the following:
1. Bots generate new content massively. However, most of this content is likely to be never seen not checked by any human user. With more than M entities in the graph,large swathes of it may be never consulted by anyone. Moreover, Wikidata is often accessed through third party applications which, according to Wikidata’s CC0 licence, do not need to provide attribution. The claim that “given enough eyeballs all bugs are shallow” may not work in this case, simply because there might not be enough eyeballs.
2. Bots import large numbers of statements from a small number of sources, leading to a lack of diversity of the knowledge they produce [?]. Combined to the fact that a restricted circle of users operate bots and that a very small core of human editors perform the majority of edits through semi-automated tool, this can be a serious threat to representing a broad range of viewpoints in Wikidata, a project that was designed having diversity in mind [?].
3. The extensive proportion of automated and semi-automated activity, together with the fact that Wikidata is a multilingual project, may stifle user participation on the platform. Whereas an in-depth study of this aspect has not been carried out yet, it must be noted that discussion pages are seldom used (only item have active talk pages). Does communication between users decrease, compared to other platforms? And if so, how does that affect the point above, i.e. diversity?
Research has already investigated some of the effects of algorithmic work in Wikidata on data quality (e.g. [?] and [?]). However, a study of the main challenges posed by bot activity in Wikidata is still missing. This paper highlights three of them, namely quality control, lack of diversity, and threats to user participation. Further work should address them, in order to shed light on these aspects of algorithmic participation in online platforms and ensure the future sustainability of Wikidata.
- 2 R. Stuart Geiger and Aaron Halfaker. 2013. When the levee breaks: without bots, what happens to Wikipedia’s quality control processes?. In OpenSym. ACM, 6:1–6:6.
- 3 Claudia Müller-Birn, Benjamin Karran, Janette Lehmann, and Markus Luczak-Rösch. 2015. Peer-production system or collaborative ontology engineering effort: what is Wikidata?. In OpenSym. ACM, 20:1–20:10.
- 4 Sabine Niederer and José van Dijck. 2010. Wisdom of the crowd or technicity of content? Wikipedia as a sociotechnical system. New Media & Society 12, 8 (2010), 1368–1387. DOI:http://dx.doi.org/10.1177/1461444810365297
- 5 Felipe Ortega, Jesús M. González-Barahona, and Gregorio Robles. 2008. On the Inequality of Contributions to Wikipedia. In HICSS. IEEE Computer Society, 304.
- 6 Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approaches and evaluation methods. Semantic Web 8, 3 (2017), 489–508.
- 7 Alessandro Piscopo, Lucie-Aimée Kaffee, Chris Phethean, and Elena Simperl. 2017a. Provenance Information in a Collaborative Knowledge Graph: An Evaluation of Wikidata External References. In International Semantic Web Conference (1) (Lecture Notes in Computer Science), Vol. 10587. Springer, 542–558.
- 8 Alessandro Piscopo, Chris Phethean, and Elena Simperl. 2017b. What Makes a Good Collaborative Knowledge Graph: Group Composition and Quality in Wikidata. In SocInfo (1) (Lecture Notes in Computer Science), Vol. 10539. Springer, 305–322.
- 9 Cristina Sarasua, Alessandro Checco, Gianluca Demartini, Djellel Difallah, Michael Feldman, and Lydia Pintscher. 2018. The Evolution of Power and Standard Wikidata Editors: Comparing Editing Behavior over Time to Predict Lifespan and Volume of Edits. Journal of Computer Supported Cooperative Work (2018), 00–00.
- 10 Thomas Steiner. 2014. Bots vs. Wikipedians, Anons vs. Logged-Ins (Redux): A Global Study of Edit Activity on Wikipedia and Wikidata. In OpenSym. ACM, 25:1–25:7.
- 11 Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborative knowledgebase. Commun. ACM 57, 10 (2014), 78–85.