Can Who-Edits-What Predict Edit Survival?

Can Who-Edits-What Predict Edit Survival?

Ali Batuhan Yardım Bilkent University, Ankara, Turkey. Contact: batuhan.yardim@ug.bilkent.edu.tr. This work was done while the author was at EPFL.    Victor Kristof School of Computer and Communication Sciences, EPFL, Lausanne, Switzerland. Contact: first.last@epfl.ch.    Lucas Maystre22footnotemark: 2    Matthias Grossglauser22footnotemark: 2
Abstract

The Internet has enabled the emergence of massive online collaborative projects. As the number of contributors to these projects grows, it becomes increasingly important to understand and predict whether the edits that users make will eventually impact the project positively. Existing solutions either rely on a user reputation system or consist of a highly-specialized predictor tailored to a specific peer-production system. In this work, we explore a different point in the solution space, which does not involve any content-based feature of the edits. To this end, we formulate a statistical model of edit outcomes. We view each edit as a game between the editor and the component of the project. We posit that the probability of a positive outcome is a function of the editor’s skill, of the difficulty of editing the component and of a user-component interaction term. Our model is broadly applicable, as it only requires observing data about who makes an edit, what the edit affects and whether the edit survives or not. Then, we consider Wikipedia and the Linux kernel, two examples of large-scale collaborative projects, and we seek to understand whether this simple model can effectively predict edit survival: in both cases, we provide a positive answer. Our approach significantly outperforms those based solely on user reputation and bridges the gap with specialized predictors that use content-based features. Furthermore, inspecting the model parameters enables us to discover interesting structure in the data. Our method is simple to implement, computationally inexpensive, and it produces interpretable results; as such, we believe that it is a valuable tool to analyze collaborative systems.

\pdfstringdefDisableCommands\pdfstringdefDisableCommands

1 Introduction

Over the last two decades, the number and scale of online collaborative projects has become truly massive, driven by better information networks and advances in collaborative software. At the time of writing, editors contribute regularly to + million articles of the English Wikipedia [Wikipedia, 2017b] and over developers have authored code for the Linux kernel [Corbet and Kroah-Hartman, 2017]. On GitHub, million users collaborate on million active software repositories [GitHub, 2017].

In order to ensure that such projects advance towards their goals, it is necessary to identify whether edits made by users result in positive contributions. As the number of users and components of the project grows, this task becomes increasingly challenging. In response, two types of solutions have been proposed. On the one hand, some authors advocate the use of user reputation systems [Resnick et al., 2000, Adler and de Alfaro, 2007]. These systems are general, their predictions are easy to interpret, and they can be made resistant to manipulations [de Alfaro and Adler, 2013]. On the other hand, a number of highly-specialized methods have been proposed to automatically predict the quality of edits in particular collaborative systems [Druck et al., 2008, Halfaker and Taraborelli, 2015]. For example, Wikipedia has developed a service which predicts whether an edit will be damaging by analyzing + content-based and system-based features of the user, the article and the edit itself, such as: the number of bad words introduced by the edit, the time since the user’s registration and the length of the article [Halfaker and Taraborelli, 2015]. These methods can attain excellent predictive performance [Heindorf et al., 2016] and usually significantly outperform predictors based on user reputation alone [Druck et al., 2008], but they are tailored to a particular peer-production system, use domain-specific features and might be difficult to interpret.

In this work, we set out to explore another point in the solution space. We aim to keep the generality and simplicity of user reputation systems, while reaching the predictive accuracy of highly-specialized methods. We ask the question: Can one predict the outcome of contributions simply by observing who edits what and whether the edits eventually survive? We tackle this question by proposing a novel statistical model of edit outcomes. We start by formalizing the notion of collaborative system as follows. users can propose edits on distinct items (components of the project, such as articles on Wikipedia or a software’s modules). We observe triplets describing a user editing an item and leading to outcome ; the outcome represents a rejected edit, whereas represents a positive, accepted edit. Given a dataset of such observations, we seek to learn a model of the probability that an edit made by user on item is accepted.

Our approach borrows from probabilistic models of pairwise comparisons, used, e.g., to rank chess players based on game outcomes [Zermelo, 1928]. These models learn a real-valued score for each player such that the difference between two players’ scores is predictive of game outcomes. We take a similar perspective and view each edit in a collaborative system as a game between the user, who tries to effect change, and the item, which resists change111Obviously, items do not really “resist” by themselves. Instead, this notion should be taken as a proxy for the combined action of other users (e.g., project maintainers) who may accept or reject an edit depending, among others, on standards of quality.. Similarly to pairwise-comparison models, our approach learns a real-valued score for each user and each item. In addition, it also learns latent features of users and items that capture interaction effects.

In contrast to quality-prediction methods specialized on a particular collaborative system, our approach is general and can be applied to any system in which users contribute by editing discrete items. It does not use any explicit content-based features: instead, it simply learns by observing triplets . Furthermore, the resulting model parameters can be interpreted easily. They enable a principled way of {enumerate*}

ranking users by the quality of their contributions,

ranking items by the difficulty of editing them and

understanding the main dimensions of the interaction between users and items.

We apply our approach on two different peer-production systems. We start with Wikipedia and consider its Turkish and French editions. Evaluating the accuracy of predictions on an independent set of edits, we find that our model approaches the performance of the state of the art, perhaps surprisingly so given its simplicity. More interestingly, the model parameters reveal important facets of the system. For example, we characterize articles that are easy and difficult to edit, respectively, and we identify clusters of articles that share common editing patterns.

Next, we turn our attention to the Linux kernel. In this project, contributors are typically highly-skilled professionals, and the edits that they make affect subsystems that form the kernel. In this instance, our model’s predictions turn out to be more accurate than a random forest classifier trained on domain-specific features. In addition, we are again able to give an interesting qualitative description of subsystems based on their difficulty score.

In short, our paper {enumerate*}

gives evidence that observing who edits what can yield valuable insights into peer-production systems and

proposes a statistically-grounded and computationally-inexpensive method to do so. The analysis of two peer-production systems with very distinct characteristics demonstrates the generality of the approach.

Organization of the paper.

We start by reviewing related literature in Section 2. Then, we present the two main contributions of this work. In Section 3, we describe a novel statistical model of edit outcomes. We explain how the model parameters can be interpreted and briefly discuss how to efficiently learn a model from data. In Sections 4 and 5, we investigate our approach in the context of Wikipedia and of the Linux kernel, respectively. In particular, we evaluate our model’s predictive performance and show how it enables to find salient features of both collaborative systems. Finally, we outline directions for future work and conclude in Section 6.

2 Related Work

With the growing size and impact of online collaborative projects, the task of assessing contribution quality has been extensively studied. We review various approaches to the problem of quantifying and predicting the quality of user contributions and contrast them to our approach.

2.1 User Reputation Systems

Reputation systems have been a long-standing topic of interest in relation to peer-production systems and, more generally, in relation to online services [Resnick et al., 2000].

In the context of Wikipedia, a seminal approach to computing edit quality and user reputation is developed by Adler and de Alfaro [2007]. The article establishes first of all that in the particular collaborative setting of Wikipedia, the success of an edit should be defined based on its longevity across future revisions. Each Wikipedia article is treated as a sequence of versions and authors, and the quality of an edit is measured by how much of the introduced changes is retained in future edits. When applying our methods to Wikipedia, we follow the same the idea of measuring quality implicitly through the state of the article at subsequent revisions. Next, Adler and de Alfaro take advantage of edit qualities in order to assign reputation scores to users. They show that the reputation scores are predictive of future edit quality. The ideas underpinning the computation of implicit edit quality are extended in refined in subsequent papers, such as Adler et al. [2008] and de Alfaro and Adler [2013], where the authors explore ways of improving the implicit quality for better robustness and applicability. This line of work leads to the development of WikiTrust de Alfaro et al. [2011], a browser add-on that highlights low-reputation text in Wikipedia articles.

In this work, we demonstrate that by automatically learning properties of the item that a user edits (in addition to learning properties of the user, such as a reputation score) we can substantially improve predictions of edit quality.

2.2 Specialized Classifiers

Instead of relying on a general model of user reputation, several authors have advocated the development of quality-prediction solutions tailored to a specific collaborative systems. Typically, these methods consist of a machine-learned classifier trained on a large number of content-based and system-based features of the users, the items and the edits themselves. More than from the particular classification model, their performance usually stems from the particular features that they use.

Druck et al. [2008] fit a maximum entropy classifier for estimating the lifespan of a given Wikipedia edit. The paper takes a definition of edit longevity similar to Adler and de Alfaro [2007], arguing that in the absence of a clear-cut definition of quality, one should use the implicit feedback from the community for the purpose of measuring quality. Given edit qualities, the authors train a multinomial logistic regression model which predicts how long the text introduced by the edit will last. Features used in the model include some based on the edit’s content such as number of added or deleted words, type of change, capitalization and punctuation content of the edit, as well as some based on the user, the time of the edit and the article. Their model significantly outperforms a baseline that only uses features of the user.

Other approaches use support vector machines Bronner and Monz [2012], random forests [Bronner and Monz, 2012, Javanmardi et al., 2011] or binary logistic regression Potthast et al. [2008], with varying levels of success. In some cases, content-based features are refined using natural-language processing, leading to substantial performance improvements. While increasing performance, these methods arguably further decrease their general applicability. For example, competitive natural language processing tools have yet to be developed for the Turkish language (we investigate the Turkish Wikipedia in Section 4).

Agrawal and de Alfaro [2016] take a different approach and train a recurrent neural network (RNN) to predict edit outcomes. In contrast to previous work, it treats the task of predicting edit outcomes as one of learning a sequence. The resulting predictions achieve better precision and recall than predictions based on the user reputation scores computed using the approach of Adler and de Alfaro [2007].

In contrast to these methods, our approach explores a radically different design choice: we assume that we do not have access to any content-based or system-based feature at all. This leads to a system that is general and broadly applicable. Furthermore, the use of black-box classifier can hinder the interpretability of predictions, whereas we propose a clean statistical model whose parameters are straightforward to interpret.

2.3 Miscellanea

Finally, we briefly mention work from a few additional areas that are influential in this paper.

Pairwise comparison models.

The approach presented in this paper draws inspiration from models of pairwise comparison, which have been extensively studied over the last century [Zermelo, 1928, Thurstone, 1927a, Bradley and Terry, 1952, Salganik and Levy, 2015]. These models tackle the problem of predicting the outcome of pairwise comparisons between objects, and they have applications ranging from psychometrics [Thurstone, 1927b] to sports ranking [Elo, 1978]. One of the most popular paradigms posits that every object has a latent “strength” parameter , and that the probability of object winning against object is a function of the difference in strength . For example, the Bradley–Terry model [Bradley and Terry, 1952] defines

In our setting, we can view each edit as a comparison between a user and a page, and we have a similar interpretation for model parameters.

Collaborative filtering.

Our method also borrows from collaborative filtering techniques popular in the recommender systems community. In particular, some parts of our model are remindful of matrix factorization techniques Koren et al. [2009]. These techniques automatically learn low-dimensional embeddings of users and items based on ratings, with the goal of producing better recommendations. Our work shows that these ideas can also be helpful in tackling the problem of predicting outcomes of edits in collaborative systems.

Finding controversial items.

On Wikipedia, some articles are highly controversial and result in so-called edit wars [Sumi et al., 2011]. Sepehri Rad and Barbosa [2012] compare different algorithms for quantifying the controversy surrounding an article. Yasseri et al. [2014] look at the patterns of controversial articles across languages. Both papers develop ad-hoc measures of controversy. When applied to Wikipedia, our model is also able to identify controversial articles, as a by-product of learning a per-article parameter that is predictive of edit outcomes (c.f. Section 4).

3 Statistical Models

In this section, we describe and motivate two variants of a statistical model of edit outcomes based on who edits what. In other words, we seek a model that is predictive of the outcome of a contribution of user on item . To that end, we represent the probability that an edit made by user on item is successful. In collaborative projects of interest, most users typically interact only with a small number of items. In order to deal with the sparsity of interactions, we postulate that the probabilities lie on a low-dimensional manifold and propose two model variants of increasing complexity. In both cases, the parameters of the model have intuitive effects and can be interpreted easily.

Basic variant.

The first variant of our model is directly inspired by the Bradley–Terry model of pairwise comparisons [Bradley and Terry, 1952]. The probability of a positive outcome is defined as

(1)

where is the skill of user , is the difficulty of item , and is a global parameter that encodes the overall skew of the distribution of outcomes. We call this model variant interank basic. Under the pairwise-comparison viewpoint, the model predicts the outcome of a game between the item which has inertia and the user which would like to effect change. Intuitively, the skill quantifies the ability of the user to enforce a contribution, whereas the difficulty quantifies how “resistant” to contributions the particular item is. An increasing item difficulty monotonically decreases the probability , whereas an increasing user skill monotonically increases . This corresponds to an appropriate first-order approximation to the outcome probability.

Like reputation systems [Adler and de Alfaro, 2007], interank basic learns a score for each user that is predictive of edit quality. However, unlike these systems, our model also takes into account that some items might be more challenging to edit than others. As an example, on Wikipedia, we can expect high-traffic, controversial articles to be more difficult to edit than less popular articles. Just like user skills, the article difficulty can be inferred automatically from observed outcomes.

Full variant.

While the basic variant is conceptually attractive, the first-order approximation might prove to be too simplistic in some instances. In particular, the basic variant implies that if user is more skilled than user , then for all items . In many collaborative systems, users tend to have their own specializations and interests, and each item in the project might require a particular mix of skills. Taking the Linux kernel as an example, an engineer specialized in file systems might be successful in editing a certain subset of software components, but might be less proficient in contributing to, say, network drivers, whereas the situation might look exactly opposite for another engineer. In order to capture the multidimensional interaction between users and items, we add a bilinear term to the probability model (1). Letting for some dimension , we define

(2)

We call the corresponding model variant interank full. The vectors and can be thought of as embedding users and items as points in a latent -dimensional space. Informally, increases if the two points representing a user and an item are close to each other, and it decreases if they are far from each other (e.g., if the vectors have opposite sign). Slightly oversimplifying, the parameter can be interpreted as describing the set of skills needed to successfully edit item , whereas describes the set of skills displayed by user .

The bilinear term is reminiscent of matrix factorization approaches in recommender systems [Koren et al., 2009]; indeed, this variant can be seen as a collaborative filtering method. In true collaborative filtering fashion, our model is able to learn the latent feature vectors and jointly, taking into consideration all edits and without any additional content-based features.

Finally, note that the skill and difficulty parameters are retained in this variant and can still be used to explain first-order effects. The bilinear term only explains the additional effect due to the user-item interaction.

3.1 Learning the Model

From (1) and (2), it should be clear that our probabilistic model assumes no data other than the identity of the user and that of the item. This makes it generally applicable to any peer-production system in which users contribute to discrete items.

Given a dataset of independent observations , we infer the parameters of the model by maximizing their likelihood under . That is, collecting all model parameters into a single vector , we seek to minimize the negative log-likelihood

where depends on . In the basic variant, the negative log-likelihood is convex, and we can easily find a global maximum using standard methods from convex optimization. In the full variant, the bilinear term breaks the convexity of the objective function, and we can no longer guarantee that we find parameters that are global minimizers. In practice, we do not observe any issues, and we reliably find good model parameters on all datasets.

Implementation.

We implement the model in Python using the TensorFlow library [Abadi et al., 2016]. In order to avoid overfitting the model to the training data, we add a small amount of regularization to the negative log-likelihood, which results in the objective function

The best value of the regularization strength is found by cross-validation. We minimize using stochastic gradient descent [Bishop, 2006] with small batches of data. For interank full, we set the number of latent dimensions to (we observe almost no improvement when further increasing ).

Running time.

Our largest experiment consists of learning the parameters of interank full on the entire history of the French Wikipedia (c.f. Section 4), consisting of over million edits by million users on million items. In this case, our TensorFlow implementation takes approximately hours to converge on standard hardware. In most other experiments, our implementation takes only a few minutes to converge. This demonstrates that our model effortlessly scales even to the largest collaborative systems.

4 Wikipedia

Wikipedia is a popular free online encyclopedia and arguably one of the most successful peer-production systems. In this section, we apply the model presented in Section 3 to the French and Turkish editions of Wikipedia.

4.1 Background & Datasets

The French Wikipedia is one of the largest Wikipedia editions. At the time of writing, it ranks in third position both in terms of number of edits and number of users. In order to obtain a complementary perspective, we also study the Turkish Wikipedia, which is roughly an order of magnitude smaller. Interestingly, both the French and the Turkish editions score very highly on Wikipedia’s depth scale, a measure of collaborative quality [Wikipedia, 2017a].

The Wikimedia Foundation releases periodically and publicly a database dump containing the successive revisions to all articles222See: https://dumps.wikimedia.org/.. In this paper, we use a dump that contains data starting from the beginning of the edition up to Fall 2017.

Edition # users # articles # edits First edit Last edit
French 2001-08-04 2017-09-02 % %
Turkish 2002-12-05 2017-10-01 % %
Table 1: Summary statistics of Wikipedia datasets after preprocessing.

4.1.1 Computation of Edit Quality

On Wikipedia, any user’s edit is immediately incorporated into the encyclopedia333Except for a small minority of protected articles.. Therefore, in order to obtain information about the quality of an edit, we have to consider the implicit signal given by subsequent edits to the same article. If the changes introduced by the edit are preserved, it signals that the edit was positive, whereas if the changes are reverted, the edit likely had a negative impact. A formalization of this idea is given by Adler and de Alfaro [2007] and Druck et al. [2008]; see also de Alfaro and Adler [2013] for a concise explanation. In this paper, we essentially follow their approach.

Consider a particular article and denote by its -th revision (i.e., the state of the article after the -th edit). Let be the Levenshtein distance between two revisions [Kruskal, 1983]. We define the quality of edit from the perspective of the article’s state after subsequent edits as

By properties of distances, . Intuitively (and informally), the quantity captures the proportion of work done in edit that remains in revision . We compute the unconditional quality of the edit by averaging over multiple future revisions:

(3)

where is the minimum between the number of subsequent revisions to the article and . Note that even though is no longer binary, our model extends to continuous-valued in a straightforward way. No change is needed to the discussion of Section 3.1.

In practice, we observe that edit quality is bimodal and asymmetric. Most edits have a quality close to either or and a majority of edits are of high quality. The two rightmost columns of Table 1 quantify this for the French and Turkish Wikipedias.

4.1.2 Dataset Preprocessing

We consider all edits to pages in the main namespace (i.e., articles), including those from anonymous contributors identified by their IP address444Note, however, that a large majority of edits are made by registered users ( % and % for the French and Turkish editions, respectively).. Sequences of consecutive edits to an article by the same user are collapsed into a single edit in order to remove bias in the computation of edit quality [Adler and de Alfaro, 2007]. To evaluate methods in a realistic setting, we split the data into a training set containing approximately the first % of edits, and we report results on an independent validation set containing the remaining %555The cutoff dates are May 2nd, 2016 (French edition) and July 29th, 2016 (Turkish edition).. Note that the quality is computed based on subsequent revisions of an article: in order to guarantee that the two sets are truly independent, we make sure that we never use any revisions from the validation set to compute the quality of edits in the training set. Finally, we remove edits whose quality cannot be assessed reliably, i.e., those that have in (3). A short summary of the data statistics after preprocessing is provided in Table 1.

4.2 Evaluation

We evaluate the performance on a binary classification task consisting of predicting whether an edit is of poor quality. To that end, we assign binary labels to all edits in the validation set: the label bad is assigned to every edit with , and the label good is assigned to all edits with .

As discussed in Section 3, we consider two versions of our model. The first one, interank basic, simply learns scalar user skills and article difficulties. The second one, interank full, additionally includes a latent embedding of dimension for each user and article.

4.2.1 Competing Approaches

To set our results in context, we compare them to those obtained with three different baselines.

Average.

The first approach always outputs the marginal probability of a bad edit in the training set, i.e.,

This is a trivial baseline, and it gives an idea of what results we should expect to achieve without any additional information on the user, article or edit.

User-only.

The second approach models the outcome of an edit using only the user’s identity. In short, the predictor learns skills and a global offset such that, for each user , the probability

maximizes the likelihood of that user’s edits in the training set. This baseline predictor is representative of user reputation systems such as that of Adler and de Alfaro [2007].

ORES reverted

The third approach is a state-of-the-art classifier developed by researchers at the Wikimedia Foundation and is part of Wikipedia’s Objective Revision Evaluation Service [Halfaker and Taraborelli, 2015]. It uses over content-based and system-based features extracted from the user, the article and the edit itself to predict whether the edit will be reverted, a target which essentially matches our operational definition of bad edit. Features include the number of vulgar words introduced by the edit, the length of the article and of the edit, etc. This predictor is representative of specialized, domain-specific approaches to modeling edit quality.

4.2.2 Results

Table 2 presents a summary of key metrics for each method. interank full has the highest average log-likelihood of all models, meaning that its predictive probabilities are well calibrated with respect to the validation data.

Edition Model Avg. log-likelihood AUC\textsubscriptPR
French interank basic
interank full
Average
User-only
ORES reverted
Turkish interank basic
interank full
Average
User-only
ORES reverted
Table 2: Predictive performance on the bad edit classification task for the French and Turkish Wikipedias. The best performance is highlighted in bold.

Figure 1 presents the precision-recall curves for all methods. The analysis is qualitatively similar for both Wikipedia editions. All non-trivial predictors perform similarly in the high-recall regime, but present significant differences in the high-precision regime, on which we will focus. The ORES predictor performs the best. interank comes second, reasonably close behind ORES, and the full variant has a small edge over the basic variant. The user-only baseline is far behind. This shows that incorporating information about the article being edited is crucial to get a good performance on a large portion of the precision-recall trade-off.

Figure 1: Precision-recall curves on the bad edit classification task for the French and Turkish Wikipedias. interank full (solid red) significantly outperforms the user-only baseline (dotted green) and approaches the performance of ORES reverted (dash-dotted blue).

We also note that in the validation set, approximately % ( %) of edits are made by users (respectively, on articles) which are never encountered in the training set (the numbers are similar in both editions). In these cases, interank reverts to default predictions, whereas methods such as ORES take advantage of content-based features of the edit to make an informed prediction. Given that interank does not have access to any content-based feature, we believe that its performance is in fact remarkable.

In summary, we observe that our model, which incorporates the articles’ identity, is able to bridge the gap between user-only prediction approach and a specialized predictor (ORES reverted). Furthermore, modeling the interaction between user and article (interank full) is beneficial and helps further improving predictions, particularly in the high-precision regime.

4.3 Interpretation of Model Parameters

The parameters of interank models, in addition to being predictive of edit outcomes, are also very interpretable. In the following, we demonstrate how they can surface interesting characteristics of the peer-production system.

4.3.1 Controversial Articles

Intuitively, we expect an article whose difficulty parameter is large to deal with topics that are potentially controversial. We focus on the French Wikipedia and explore a list of ten most controversial articles given by Yasseri et al. [2014]. In this 2013 study, the authors identify controversial articles using an ad-hoc methodology. Table 3 presents, for each article identified by Yasseri et al., the percentile of the corresponding difficulty parameter learned by interank full. Our analysis takes place approximately four years later, but the model still identifies these articles as some of the most difficult ones. Interestingly, the article on Sigmund Freud, which has the lowest difficulty parameter of the list, has become a featured article since Yasseri et al.’s analysis—a distinction awarded only to the most well-written and neutral articles.

Rank Title Percentile of
1 Ségolène Royal %
2 Unidentified flying object %
3 Jehovah’s Witnesses %
4 Jesus %
5 Sigmund Freud %
6 September 11 attacks %
7 Muhammad al-Durrah incident %
8 Islamophobia %
9 God in Christianity %
10 Nuclear power debate %
median %
Table 3: The ten most controversial articles on the French Wikipedia according to Yasseri et al. [2014]. For each article , we indicate the percentile of its corresponding parameter .

4.3.2 Latent factors

Next, we turn our attention to the parameters . These parameters can be thought of as an embedding of the articles in a latent space of dimension . As we are learning a model that maximizes the likelihood of edit outcomes, we expect these embeddings to capture latent article features that explain edit outcomes. In order to extract the one or two directions that explain most of the variability in this latent space, we apply principal component analysis [Bishop, 2006] to the matrix .

In Table 4, we consider the Turkish Wikipedia and list a subset of the articles with the highest and lowest coordinates along the first principal axis of . We observe that this axis seems to distinguish articles about popular culture from those about “high culture” or timeless topics. This discovery supports the hypothesis that users have a propensity to successfully edit either popular culture or high-culture articles on Wikipedia, but not both.

Direction Titles
Lowest Harry Potter’s magic list, List of programs broadcasted by Star TV, Bursaspor 2011-12 season, Kral Pop TV Top 20, Death Eater, Heroes (TV series), List of programs broadcasted by TV8, Karadayı, Show TV, List of episodes of Kurtlar Vadisi Pusu.
Highest Seven Wonders of the World, Thomas Edison, Cell, Mustafa Kemal Atatürk, Albert Einstein, Democracy, Isaac Newton, Mehmed the Conqueror, Leonardo da Vinci, Louis Pasteur.
Table 4: A selection of articles of the Turkish Wikipedia among the top- highest and lowest coordinates along the first principal axis of the matrix .

Finally, we consider the French Wikipedia. Once again, we apply principal component analysis to the matrix and keep the first two dimensions. We select the articles with the highest and lowest coordinates along the first two principal axes666Interestingly, the first dimension has a very similar interpretation to that obtained on the Turkish edition: it can also be understood as separating popular culture from high culture.. A two-dimensional -SNE plot [van der Maaten and Hinton, 2008] of the 80 articles selected using PCA is displayed in Figure 2. The plot enables to identify meaningful clusters of related articles, such as articles about tennis players, French municipalities, historical figures, and TV or teen culture. These articles are representative of the latent dimensions that separate editors the most: a user skilled in editing pages about ancient Greek mathematicians might be less skilled in editing pages about anime, and vice versa.

Figure 2: -SNE visualization of 80 articles of the French Wikipedia with highest and lowest coordinates along the first and second principal axes of the matrix .

5 Linux Kernel

In this section, we apply the interank model to the Linux kernel project, an open-source software project that relies on community contributions to evolve and improve. In contrast to Wikipedia, most contributors to the Linux kernel are highly-skilled professionals who dedicate a significant portion of their time and efforts to the project.

5.1 Background & Dataset

The Linux kernel has fundamental impact on technology as a whole. In fact, the Linux operating system runs 90% of the cloud workload and 82% of the smartphones [Corbet and Kroah-Hartman, 2017]. To collectively improve the source code, developers submit bug fixes or new features in the form of a patch to collaborative repositories. Review and integration time depend on the project’s structure, ranging from a few hours or days for Apache Server [Rigby et al., 2008] to a couple of months for the Linux kernel [Jiang et al., 2013]. In particular for the Linux kernel, developers submit patches to subsystem mailing lists, where they undergo several rounds of reviews. If the code is approved and after implementation of suggestions, the patch can be committed to the subsystem maintainer’s software repository. Integration conflicts are spotted at this stage by other developers monitoring the maintainer’s repository and any issues must be fixed by the submitter. If the maintainer is satisfied with the patch, she commits it to Linus Torvalds’ repository, who decides to include it or not with the next Linux release.

5.1.1 Dataset Preprocessing

We use a dataset collected by Jiang et al. [2013] which spans Linux development activity between 2005 and 2012. It consists of patches described using features derived from the mailing lists, commits to software repositories, the developers’ activity and the content of the patches themselves. Jiang et al. scraped patches from the various mailing lists and matched them with commits in the main repository. In total, they managed to trace back 75% of the commits that appear in Linus Torvalds’ repository to a patch submitted to a mailing list. A patch is labeled as accepted if it eventually appears in a release of the Linux kernel. We remove data points with empty subsystem and developer names, as well as all subsystems with no accepted patches. Finally, we chronologically order the patches according to their mailing list submission time.

After preprocessing, the dataset contains patches proposed by developers on subsystems. Among all patches, 34.12% are integrated into Linux kernel release. We then split the data into training set containing the first 80% of patches and a validation set containing the remaining 20%.

5.1.2 Subsystem-Developer Correlation

Given the highly complex nature of the project, one could believe that developers tend to specialize in few, independent subsystems. Let be the collection of binary variables indicating whether developer has an accepted patch in subsystem . We compute the sample Pearson correlation coefficient between and . We show in Figure 3 the correlation matrix between developers patching subsystems. Row corresponds to developer , and we order all rows according to the subsystem each developer contribute to the most. We order the subsystems in decreasing order by the number of submitted patches, such that larger subsystems appear at the top of the matrix . The blocks on the diagonal correspond hence to subsystems, and their size roughly represents the size of their community. As shown by the blocks, developers tend to specialize into one subsystem. However, as the numerous non-zero off-diagonal entries reveal, they still contribute in general substantially to other subsystems. Finally, as highlighted by the dotted, blue square, subsystems number three to six on the diagonal form a cluster. In fact, these four subsystems (include/linux, arch/x86, kernel and mm) are core subsystems of the Linux kernel.

Figure 3: Correlation matrix between developers ordered according to the subsystem they contribute to the most. The blocks on the diagonal correspond to subsystems.

5.2 Evaluation

We consider the task of predicting whether a patch will be integrated into a release of the kernel. Similarly to Section 4, we use interank basic and interank full with latent dimensions to learn the developers’ skills, the subsystems’ difficulty, and the interaction between them.

5.2.1 Competing Approaches

Two of the baselines that we consider, average and user-only, are identical to those described in Section 4.2.1. In addition, we also compare our model to a random forest classifier trained on domain-specific features similar to the one used by Jiang et al. [2013]. In total, this classifier has access to 21 features for each patch. Features include information about the developer’s experience up to the time of submission (e.g., number of accepted commits, number of patches sent), the e-mail thread (e.g., number of developers in copy of the e-mail, size of e-mail, number of e-mails in thread until the patch) and the patch itself (e.g., number of lines changed, number of files changed). We optimize the hyperparameters of the random forest using a grid-search. As the model has access to domain-specific features about each edit, it is representative of the class of specialized methods tailored to a particular collaborative system.

5.2.2 Results

Table 5 displays the average log-likelihood and area under the precision-recall curve (AUC\textsubscriptPR). interank full performs best in terms of both metrics. In terms of average log-likelihood, it means that its predictive probabilities are again well calibrated with respect to the validation data. In terms of AUC\textsubscriptPR, it outperforms the random forest classifier by 4.6% and the user-only baseline by 7.3%.

Model Avg. log-likelihood AUC\textsubscriptPR
interank basic -0.588 0.525
interank full -0.587 0.527
Average -0.640 0.338
User-only -0.600 0.491
Random forest -0.598 0.504
Table 5: Predictive performance on the accepted patch classification task for the Linux kernel. The best performance is highlighted in bold.

We show in Figure 4 the precision-recall curves. interank full and interank basic both perform better than the two baselines. Notably, they outperform the random forest in the high-precision regime, even though the random forest uses content-based features about developers, subsystems and patches. In the high-recall regime, the random forest attains a marginally better precision. The user-only baseline performs worse than all non-trivial models.

Figure 4: Precision-recall curves on the bad edit classification task for the Linux kernel. interank (solid red) outperforms the user-only baseline (dotted green) and the random forest classifier (dash-dotted blue).

5.3 Interpretation of Model Parameters

We show in Table 6 the top-five and bottom-five subsystems according to difficulties learned by interank full. We note that even though patches submitted to difficult subsystems have in general low acceptance rate, interank enables a finer ranking by taking into account who is contributing to the subsystems. This effect is even more noticeable with the bottom-five subsystems.

The subsystems with largest are core components, whose integrity is crucial to the system. For instance, the usr subsystem, providing code for RAM-related instructions at booting time, has barely changed in the last seven years. On the other hand, the subsystems with smallest are peripheral components serving specific devices, such as digital signal processors or gaming consoles. These components can inherently tolerate a higher rate of bugs, and hence they evolve more frequently.

Difficulty Subsystem % Acc. # Patch # Dev.
+2.664 usr 1.88% 796 70
+1.327 include 7.79% 398 101
+1.038 lib 15.99% 5642 707
+1.013 drivers/clk 34.34% 495 81
+0.865 include/trace 17.73% 547 81
-1.194 drivers/addi-data 78.31% 272 8
-1.080 net/tipc 43.11% 573 44
-0.993 drivers/ps3 44.26% 61 9
-0.936 net/nfc 73.04% 204 26
-0.796 arch/mn10300 45.40% 359 63
Table 6: Top-five and bottom-five subsystems according to their difficulty .

Jiang et al. [2013] establish that a high prior subsystem churn (i.e., high number of previous commits to a subsystem) leads to lower acceptance rate. We approximate the number of commits to a subsystem as the number of patches submitted multiplied by the subsystem’s acceptance rate. The first quartile of subsystems according to their increasing difficulty, i.e., the least difficult subsystems, has an average churn of . The third quartile, i.e., the most difficult subsystems, has an average churn of . We verify hence that higher churn correlates with difficult subsystems. This corroborates the results obtained by Jiang et al.

As shown in Figure 4, if false negatives are not a priority, interank will yield a substantially higher precision. In other words, if the task at hand requires that the patches classified as accepted are actually the ones integrated in a future release, then interank will yield more accurate results. For instance, it would be efficient in supporting Linus Torvalds in the development of the Linux kernel by providing him with a restricted list of patches that are likely to be integrated in the next release of the Linux kernel.

6 Conclusion

In this paper, we introduce interank , a model of edit outcomes in peer-production systems. Similarly to user reputation systems, it is simple, easy to interpret and applicable to a wide range of domains. Whereas user reputation systems are usually not competitive with specialized edit quality predictors tailored to a particular collaborative system, interank is able to bridge the gap between the two types of approaches, and it attains a predictive performance that is competitive with the state of the art—without access to content-based features.

We demonstrate the performance of the model on two collaborative systems exhibiting different characteristics. Beyond predictive performance, we can also use model parameters to gain insight into the system. On Wikipedia, we show that the model identifies controversial articles, and that latent dimensions learned by our model display interesting patterns related to cultural distinctions between articles. On the Linux kernel, we show that inspecting model parameters enables to identify core subsystems (large difficulty parameters) from peripheral components (small difficulty parameters).

Future Work

In the future, we would like to investigate the idea of using the latent embeddings learned by our model in order to recommend items to edit. Ideally, we could match items that need to be edited with users that are most suitable for the task. For Wikipedia, an ad-hoc method called “SuggestBot” was proposed by Cosley et al. [2007]. We believe it would be valuable to propose a method that is applicable to collaborative systems in general.

References

  • Abadi et al. [2016] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. TensorFlow: A system for large-scale machine learning. In Proceedings of OSDI’16, Savannah, GA, USA, Nov. 2016.
  • Adler and de Alfaro [2007] B. T. Adler and L. de Alfaro. A content-driven reputation system for the Wikipedia. In Proceedings of WWW’07, Banff, AB, Canada, May 2007.
  • Adler et al. [2008] B. T. Adler, L. de Alfaro, I. Pye, and V. Raman. Measuring author contributions to the Wikipedia. In Proceedings of WikiSym’08, Porto, Portugal, Sept. 2008.
  • Agrawal and de Alfaro [2016] R. Agrawal and L. de Alfaro. Predicting the quality of user contributions via LSTMs. In Proceedings of OpenSym’16, Berlin, Germany, Aug. 2016.
  • Bishop [2006] C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006.
  • Bradley and Terry [1952] R. A. Bradley and M. E. Terry. Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4):324–345, 1952.
  • Bronner and Monz [2012] A. Bronner and C. Monz. User edits classification using document revision histories. In Proceedings of EACL 2012, Avignon, France, Apr. 2012.
  • Corbet and Kroah-Hartman [2017] J. Corbet and G. Kroah-Hartman. 2017 Linux kernel development report. Technical report, The Linux Foundation, 2017.
  • Cosley et al. [2007] D. Cosley, D. Frankowski, L. Terveen, and J. Riedl. SuggestBot: Using intelligent task routing to help people find work in Wikipedia. In Proceedings of IUI’07, Honolulu, HI, USA, Jan. 2007.
  • de Alfaro and Adler [2013] L. de Alfaro and B. T. Adler. Content-driven reputation for collaborative systems. In Proceedings of TGC 2013, Buenos Aires, Argentina, Aug. 2013.
  • de Alfaro et al. [2011] L. de Alfaro, A. Kulshreshtha, I. Pye, and B. T. Adler. Reputation systems for open collaboration. Communications of the ACM, 54(8):81–87, 2011.
  • Druck et al. [2008] G. Druck, G. Miklau, and A. McCallum. Learning to predict the quality of contributions to Wikipedia. In Proceedings of WikiAI 2008, Chicago, IL, USA, July 2008.
  • Elo [1978] A. Elo. The Rating Of Chess Players, Past & Present. Arco Publishing, 1978.
  • GitHub [2017] GitHub. The state of the Octoverse 2017, 2017. URL https://octoverse.github.com/. Accessed: 2017-10-27.
  • Halfaker and Taraborelli [2015] A. Halfaker and D. Taraborelli. Artificial intelligence service “ORES” gives Wikipedians X-ray specs to see through bad edits, Nov. 2015. URL https://blog.wikimedia.org/2015/11/30/artificial-intelligence-x-ray-specs/. Accessed: 2017-10-27.
  • Heindorf et al. [2016] S. Heindorf, M. Potthast, B. Stein, and G. Engels. Vandalism detection in Wikidata. In Proceedings of CIKM’16, Indianapolis, IN, USA, Oct. 2016.
  • Javanmardi et al. [2011] S. Javanmardi, D. W. McDonald, and C. V. Lopes. Vandalism detection in Wikipedia: A high-performing, feature-rich model and its reduction through lasso. In Proceedings of WikiSym’11, Mountain View, CA, USA, Oct. 2011.
  • Jiang et al. [2013] Y. Jiang, B. Adams, and D. M. German. Will my patch make it? and how fast? case study on the Linux kernel. In Proceedings of MSR 2013, San Francisco, CA, USA, May 2013.
  • Koren et al. [2009] Y. Koren, R. Bell, and C. Volinsky. Matrix factorization techniques for recommender systems. Computer, 42(8):30–37, 2009.
  • Kruskal [1983] J. B. Kruskal. An overview of sequence comparison: Time warps, string edits, and macromolecules. SIAM Review, 25(2):201–237, 1983.
  • Potthast et al. [2008] M. Potthast, B. Stein, and R. Gerling. Automatic vandalism detection in Wikipedia. In Proceedings of ECIR 2008, Glasgow, Scottland, Apr. 2008.
  • Resnick et al. [2000] P. Resnick, K. Kuwabara, R. Zeckhauser, and E. Friedman. Reputation systems. Communications of the ACM, 43(12):45–48, 2000.
  • Rigby et al. [2008] P. C. Rigby, D. M. German, and M.-A. Storey. Open source software peer review practices: A case study of the Apache server. In Proceedings of ICSE’08, Leipzig, Germany, May 2008.
  • Salganik and Levy [2015] M. J. Salganik and K. E. C. Levy. Wiki surveys: Open and quantifiable social data collection. PLoS ONE, 10(5):1–17, 2015.
  • Sepehri Rad and Barbosa [2012] H. Sepehri Rad and D. Barbosa. Identifying controversial articles in Wikipedia: A comparative study. In Proceedings of WikiSym’12, Linz, Austria, Aug. 2012.
  • Sumi et al. [2011] R. Sumi, T. Yasseri, A. Rung, A. Kornai, and J. Kertész. Edit wars in Wikipedia. In Proceedings of SocialCom 2011, Boston, MA, USA, Oct. 2011.
  • Thurstone [1927a] L. L. Thurstone. A law of comparative judgment. Psychological Review, 34(4):273–286, 1927a.
  • Thurstone [1927b] L. L. Thurstone. The method of paired comparisons for social values. The Journal of Abnormal and Social Psychology, 21(4):384–400, 1927b.
  • van der Maaten and Hinton [2008] L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(Nov):2579–2605, 2008.
  • Wikipedia [2017a] Wikipedia. Wikipedia article depth, 2017a. URL https://meta.wikimedia.org/wiki/Wikipedia_article_depth. Accessed: 2017-10-30.
  • Wikipedia [2017b] Wikipedia. Wikipedia:Wikipedians, 2017b. URL https://en.wikipedia.org/wiki/Wikipedia:Wikipedians. Accessed: 2017-10-27.
  • Yasseri et al. [2014] T. Yasseri, A. Spoerri, M. Graham, and J. Kertész. The most controversial topics in Wikipedia: A multilingual and geographical analysis. In P. Fichman and N. Hara, editors, Global Wikipedia: International and Cross-Cultural Issues in Online Collaboration. Scarecrow Press, 2014.
  • Zermelo [1928] E. Zermelo. Die Berechnung der Turnier-Ergebnisse als ein Maximumproblem der Wahrscheinlichkeitsrechnung. Mathematische Zeitschrift, 29(1):436–460, 1928.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
69837
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description