Origins of Algorithmic Instabilities in Crowdsourced Ranking
Abstract.
Crowdsourcing systems aggregate decisions of many people to help users quickly identify highquality options, such as the best answers to questions or interesting news stories. A longstanding issue in crowdsourcing is how option quality and human judgement heuristics interact to affect collective outcomes, such as the perceived popularity of options. We address this limitation by conducting a controlled experiment where subjects choose between two ranked options whose quality can be independently varied. We use this data to construct a model that quantifies how judgement heuristics and option quality combine when deciding between two options. The model reveals popularityranking can be unstable: unless the quality difference between the two options is sufficiently high, the higher quality option is not guaranteed to be eventually ranked on top. To rectify this instability, we create an algorithm that accounts for judgement heuristics to infer the best option and rank it first. This algorithm is guaranteed to be optimal if data matches the model. When the data does not match the model, however, simulations show that in practice this algorithm performs better or at least as well as popularitybased and recencybased ranking for any twochoice question. Our work suggests that algorithms relying on inference of mathematical models of user behavior can substantially improve outcomes in crowdsourcing systems.
1. Introduction
Crowdsourcing websites aggregate judgments in order to help users discover high quality content. These systems typically combine choices of many people to algorithmically rank content so that better items—product reviews (Lim and Van Der Heide, 2015), news stories (Stoddard, 2015; Muchnik et al., 2013), or answers on questionanswering (Q&A) platforms (Adamic et al., 2008; Yao et al., 2015)—are easier to find.
Despite a long history of crowdsourcing (de Condorcet, 1976; Galton, 1908), some of its limitations have only recently become apparent. Salganik et al. ((2006)) found that aggregating the votes of many people to rank songs increases the inequality and instability of song popularity. Even when starting from the same initial conditions, the same songs could end up with vastly different rankings. Other studies have shown that algorithmic ranking can amplify the inequality of popularity (Lerman and Hogg, 2014; Keith Burghardt and Lerman, 2018) and bias collective outcomes in crowdsourcing applications (Burghardt et al., 2017; Dev et al., 2019). In addition, information about the choices of other users affects decisions in complex ways (Muchnik et al., 2013; Hogg and Lerman, 2015; Talton III et al., 2019). Unfortunately for crowdsourcing system designers, it is still not clear how these finding could help improve collective outcomes, in large part due to difficulty of quantifying the quality of options (e.g., the best answer to a question) and its impact on individual decisions.
To better understand and improve collective outcomes that emerge from individual decisions, we break down the crowdsourcing task into its basic elements: item quality, item ranking, and social influence. We create a controlled experiment to study how these elements jointly affect individual decisions and collective outcomes. We use experimental data to construct and validate a mathematical model of human judgements, and then use it to explore algorithmic ranking and identify strategies to improve crowdsourcing performance. Our study addresses the following research questions:
 RQ1:

How does quality and presentation of options jointly impact individual decisions?
 RQ2:

When is algorithmic ranking unstable and does not reliably identify the best option?
 RQ3:

How can we stabilize algorithmic ranking such that the best option is typically ranked first?
Our experiment asks subjects to choose the best answer to questions with objectively correct answers, such as the number of dots in a picture. Figure 1 illustrates one such question, where we ask users to find the area ratio between the largest and smallest shapes. (Other questions used in the experiment are shown in Appendix Fig. 8.) While these simple questions abstract away some of the complexity of the oftensubjective decision making people do in crowdsourcing systems, they allow quality to be objectively measured and its effects on decisions better understood.
The experiment has three conditions shown in Fig. 1. In the first condition, we let subjects write their answers to understand how their subjective guesses deviate from the correct answers. In the remaining conditions, we ask subjects to choose the best of two randomly generated answers that are randomly ordered. In the control condition, subjects are not told how they are ordered, while in the social influence condition, the first answer is labeled “more popular”. These simple conditions allow us to disentangle the elements of crowdsourcing systems and begin quantifying how individual decisions affect collective outcomes.
To begin answering RQ1, we construct a mathematical model of the probability to choose an option as a function of its position and quality. The model requires only two parameters to measure cognitive heuristics (mental shortcuts people use to make quick and efficient judgements) and is in excellent agreement with experimental data. The first parameter reflects a user’s preference to pick the first answer (known as “position bias” (Lerman and Hogg, 2014; Keith Burghardt and Lerman, 2018; Krumme et al., 2012; Stoddard, 2015)) and the other parameter measures the rate at which answers are guessed at random. Subjects otherwise pick an answer closest to their initial (unobserved) guess. We call this model the Biased Initial Guess (BIG) model. The BIG model demonstrates that the “social influence” experiment condition enhances position bias (Keith Burghardt and Lerman, 2018), therefore cognitive heuristics that produce position bias and social influence can be quantified with a single parameter, a substantial simplification over previous work (Krumme et al., 2012; Stoddard, 2015). Moreover, it helps explain why users often choose the worst answer when answer quality differences are small.
The BIG model not only improves our understanding of how answers are chosen, but it allows us to test different ranking policies in simulations to answer RQ2 and RQ3. Importantly, these simulations demonstrate that cognitive heuristics can make popularitybased ranking highly unstable. When the quality difference between the options is small, initially minor differences in popularity can create a cumulative advantage (van de Rijt et al., 2014), meaning the better answer does not always become the most popular. However, when the quality difference passes a critical point, popularityranking is stable, and the better answer eventually becomes the most popular. These results may help explain an underappreciated finding of Salganik et al. ((2006)) that the best and worst songs tended to be correctly ranked when songs were ordered by popularity, but intermediatequality songs landed anywhere in between.
Finally, to answer RQ3, we propose an algorithm that rectifies this instability by ordering answers based on their inferred quality, which we call RAICR: Rectifying Algorithmic Instabilities in Crowdsourced Ranking. This method is found to be stable and consistently order the better answer first, thus making best answers easier to find, even when they are only slightly better. RAICR ranks answers as well as, or better than, common baselines such as ordering by popularity or by recency (ranking by the last answer picked).
Our work shows that individual decisions within crowdsourcing systems are strongly affected by cognitive heuristics, which collectively create instability and poor crowd wisdom. Designers of crowdsource systems need to account for these biases in order make good content easier to find. Algorithms such as RAICR, however, can correct for these biases, thereby improving the wisdom of crowds.
2. Related Literature
2.1. Crowdsourcing
Crowdsourcing has a twocentury long history demonstrating how a collective can outperform individual experts (de Condorcet, 1976; Galton, 1908; Surowiecki, 2005; Kaniovski and Zaigraev, 2011; Simoiu et al., 2019), thus creating the moniker “wisdom of crowds”. Crowds have been shown to beat sport markets (Brown and Reade, 2019; Peeters, 2018), corporate earnings forecasts (Da and Huang, 2019), and improve visual searches (Juni and Eckstein, 2017). One reason crowd wisdom works is due to the law of large numbers: assuming unbiased and independent guesses, the average guess should converge to the true value.
Individual decisions, including in online settings, are biased by cognitive heuristics, such as anchoring (Shokouhi et al., 2015), primacy (Mantonakis et al., 2009), prior beliefs (White, 2013) and position bias (Keith Burghardt and Lerman, 2018). These biases are not necessarily canceled out with large samples (Prelec et al., 2017; Kao et al., 2018). As a result, aggregating guesses of a crowd does not necessarily converge to the correct answer.
Guesses are usually not independent, which can sometimes improve the wisdom of crowds. Social influence models, such as the DeGroot model (Degroot, 1974), have been shown to push simulated agents to an optimal decision (Golub and Jackson, 2010; Mossel et al., 2015; Bala and Goyal, 1998; Acemoglu et al., 2011). These results have been backed up experimentally (Becker et al., 2017, 2019; Ungar et al., 2012; Tetlock et al., 2017), even when opinion polarization is included (Becker et al., 2019). One reason social influence can be beneficial is that it encourages people who are way off the mark to improve their guess (Mavrodiev et al., 2013; Becker et al., 2017; Abeliuk et al., 2017).
Often, however, social influence can reduce crowd wisdom. Corporate earnings predictions (Da and Huang, 2019), jury decisions (Kaniovski and Zaigraev, 2011; Burghardt et al., 2019), and other guesses can degrade with influence (Lorenz et al., 2011; Lorenz et al., 2015; Simoiu et al., 2019), and malevolent individuals can manipulate people to make particular collective decisions (Muchnik et al., 2013; Asch, 1951). Too much influence by a single individual can also reduce the wisdom of collective decisions (Becker et al., 2017; Acemoglu et al., 2011; Golub and Jackson, 2010), and deferring to friends can sometimes make unpopular (and potentially lowquality) ideas appear popular (Lerman and Yan, 2016).
Recent work has also demonstrated how other cognitive heuristics can affect crowd wisdom. After the landmark study by Salganik et al. ((2006)), some researchers found that social influence has no effect on decisions (Marzia Antenore, 2018), or that position bias, i.e., the preference to choose options listed first, largely explain biases in crowdsourcing (Lerman and Hogg, 2014; Krumme et al., 2012). Social influence instead enhances the position bias (Keith Burghardt and Lerman, 2018; Krumme et al., 2012). Burghardt et al. ((2018)) have begun to tease apart these effects, showing that while social influence enhances position bias, it has no marginal effect when we control for position. This is consistent with our approach, which models and rectifies both biases with a single parameter.
2.2. Algorithmic Ranking
The goal of a ranking algorithm is to make good content easier to find. Many papers have begun to address this goal (Page et al., 1999; Järvelin and Kekäläinen, 2002; Bendersky et al., 2011), which has recently been applied to crowdsourced ranking. Because of human biases, however, algorithms that naïvely use human feedback to suggest content will end up forming echo chambers (Hilbert et al., 2018; Bozdag, 2013; Hajian et al., 2016), or only recommend alreadypopular items (Abdollahpouri et al., 2017, 2019). This can also give some content a cumulative advantage (van de Rijt et al., 2014), even when it is of similar quality to content that remains unpopular.
To correct for algorithmic bias in this paper, we create a ranking method that follows the strategy of Watts, who says, “…we can instead measure directly how they respond to a whole range of possibilities and react accordingly” (Watts, 2012). In the present context, this strategy implies we can create better algorithms for option ranking by observing, and addressing, how people respond to social influence and position biases. We show this strategy applied in RAICR improves upon simple algorithms used in the past, which include ordering results by popularity (Keith Burghardt and Lerman, 2018) or recency (Lerman and Hogg, 2014). While some crowdsourced ranking strategies use a twotier platform model, in which researchers rank options based on whether content is downloaded and rated (Salganik et al., 2006; Marzia Antenore, 2018; Abeliuk et al., 2017), RAICR is based on a common simpler model in which we only observe if content is chosen (Keith Burghardt and Lerman, 2018; Stoddard, 2015). This subtle difference implies many previous ranking schemes are not applicable. The present paper also compliments previous work that uses features, such as answer position, to predict Q&A website quality (Shah and Pomerantz, 2010; Burghardt et al., 2017).
3. Experiment
The experiment asked subjects, hired through Amazon Mechanical Turk between August 2018 and September 2019, to answer a series of questions shown in the Appendix. The order of questions was randomized for each subject, and questions were not timelimited. Questions include specifying the ratio of the areas of two shapes or the lengths of two lines, or the number of dots in an image. We designed the experiment around quotidian tasks that do not require specific expertise but are difficult for most people. Despite their difficulty, the questions have objectively correct answers. We quantify the quality of an answer by the difference from the mean of all guesses, which better matches a lognormal distribution, as discussed later. In the Appendix, we define quality of an answer by the difference from the correct value and find results are qualitatively very similar.
Subjects were assigned to one of three conditions. In the guess condition, shown in the left panel of Fig. 1, the subjects could freely type their guesses. In the control condition, shown in the center panel of Fig. 1, subjects were told to choose the best among two options, and were not told how the two answers were ordered. Finally, in the social influence condition, shown in the right panel of Fig. 1, they were told the first among two answers was the most popular, but the layout was otherwise identical to the previous condition. We only showed ten questions to each subject to reduce the effects of performance depletion in Q&A systems (Ferrara et al., 2017). Further, to reduce bias due to other phenomena, such as the decoy effect (Huber et al., 1982), we showed subjects only two choices. Subjects in the same condition but assigned on different days have statistically similar behavior.
Approximately 1800 subjects are evenly split between the conditions (596, 586, and 587 for the guess, control, and social influence condition, respectively). For guess values, we remove extreme outliers (guesses smaller than 1 or greater than ). All answers are supposed to be greater than 1, while values greater than may affect mean values and appear to represent throwaway answers. The number of valid participants for each question is shown in Table1.
Q1  Q2  Q3  Q4  Q5  Q6  Q7  Q8  Q9  Q10 

595  594  593  595  592  593  589  591  584  590 
Mechanical Turk workers were hired if they had an approval rate of over , completed more than 1000 Human Intelligence Tasks (HITs), and never participated in any of the experiment conditions before. Each worker was paid for the guessing condition and for the other two conditions. The assignment took minutes on average for the guess condition and minutes on average for the other conditions, equivalent to an hourly wage of . The human experiment was approved by the appropriate IRB board.
4. Results
4.1. A Mathematical Model of Decisions
We use data gathered from the experiment to answer RQ1: How does quality and presentation of options jointly impact individual decisions?
OpenEnded Experiment Condition
To derive the model of how people make these decisions, we start by constructing the distribution of guesses for each question (the guess condition in the experiment). The guesses, plotted in Appendix Fig. 9 and CDF shown in Fig. 2a, are highly variable (by as many as six orders of magnitude), while the correct answers vary by three orders of magnitude. The median guess may differ from the true answer, in agreement with previous work (Kao et al., 2018), but values are typically the correct order of magnitude.
We normalize guesses by defining a new variable :
where is a guess value, and is the mean of the logarithm of guesses. Figure 2b shows that this simple normalization scheme effectively collapses answer guesses to a single distribution. We show in the Appendix that guesses are not normally distributed, but instead are better approximated as lognormal, in agreement with previous work on a different set of questions (Kao et al., 2018). The normalized guesses can be thought of as the scores in lognormal distributions. Alternative ways to center data, shown in the Appendix, produce similar results. We use these normalized guesses and distributions in the remaining two experiment conditions. Intuitively, if the mean of all guesses converges to the correct answer, can be thought of as the best answer.
TwoChoice Experiment Conditions
The latter two experiment conditions require subjects to pick the best among two answers to the question. For both the control and social influence conditions, answers are ordered vertically, with one answer above the other. There is a significant position effect when answers are ordered this way, as shown in Fig. 3. In the control condition, there is a slightly greater probability (52%) to choose the first (top) answer over the last one (pvalue ). In the social influence condition, meanwhile, the probability to choose the first answer is substantially larger (59%) and statistically significantly different than the control condition (pvalue ). This is in agreement with previous work showing that social influence amplifies the position effect (Keith Burghardt and Lerman, 2018).
The Biased Initial Guess Model
We now have the necessary ingredients to model how decisions to choose an answer are affected by its quality, position, and social influence. We present the Biased Initial Guess (BIG) decision model and show it is consistent with the data.
We first discuss the simplest case where a user has to choose the better of two answers , listed first, and , listed last, in the absence of cognitive biases. Figure 4 shows probability of the answer, with the normalized values of the choices and , as well as the user’s initial guess about the true answer, which we do not observe. All things equal, the user will choose the first answer if it is closer to the initial guess, i.e., if , and will otherwise choose . The probability to choose is then:
(1) 
Assuming and follow a normal distribution quantified in the guess experiment condition,
(2) 
where is the error function, is the complimentary error function. When , is not only more likely to be closer to the initial guess () than , but is also closer to zero than , and therefore the objectively better answer. On the other hand, when , is further from most guesses compared to and the objectively worse answer.
To better model decisionmaking, we have to account for biases, due to cognitive heuristics and algorithmic ranking, to explain why people do not always choose the best option. As shown in Fig. 3, sometimes they choose the first answer even if it is not the best answer. We quantify this position bias by assuming that with probability participants choose the first answer regardless of its quality. This parameter should presumably be small in the control condition and large in the social influence condition. Subjects may also choose an answer regardless of its position or quality because there is no monetary incentive to choose good answers. We model this by allowing subjects to choose an answer at random with a probability . Taking these two heuristics into account, we arrive at the BIG Model:
(3) 
The probability of choosing is simply the compliment of this probability. Because is expected to be the best answer, with some simple manipulation we can infer the probability the best answer is chosen.
(4) 
A similar equation can model .
Agreement between the model and data is shown in Fig. 5. In the control condition (Fig. 5a), the best parameters are and . We find that the loglikelihood of the model, , is not statistically different from loglikelihood if the data came from the model itself: pvalue . See Methods for how pvalues and error bars are calculated. We also check if we need both parameters, and , using the likelihood ratio test and Wilks’ Theorem (Wilks, 1938). We compare the likelihood ratio of the twoparameter model to simpler models with or (or both) set to zero. The probability a simpler model could fit the data as well or better is . We conclude that our model describes the control condition very well.
The agreement between data and model is similarly close in the social influence condition (Fig. 5b). The position bias parameter is larger than in the control condition, in agreement with expectations. We also find , thus social influence reduces the frequency of random guesses. In both experiment conditions, surprisingly, of users choose answers for reasons besides “quality” ( and for the control and social influence conditions, respectively). Similar to the control condition, we find that the model is consistent with the data. The loglikelihood of the empirical data () is not statistically different from loglikelihood values if the data came from the model: pvalue . The probability a simpler model ( or set to zero) could fit the data as well or better is . In conclusion, we find the BIG model is consistent with both experiment conditions and its parameters are interpretable and meaningful. In the Appendix, we show that all these results are consistent when we look at a subset of experiment questions or center the data differently.
4.2. Algorithmic Ranking Instability
Crowdsourcing websites automatically highlight what they consider the best choices to help their users more quickly discover them. For example, Stack Exchange (like other Q&A platforms) usually ranks answers to questions by the number of votes they receive. Despite problems with popularitybased ranking identified in previous studies (Salganik et al., 2006; Lerman and Hogg, 2014), it is widely used for ranking content in crowdsourcing websites. In this section we identify an instability in popularitybased ranking: the first few votes a worse answer receives can lock it in the top position, where it acquires cumulative advantage (van de Rijt et al., 2014). This allows us to answer RQ2: When is algorithmic ranking unstable and does not reliably identify the best option?
To demonstrate the instability, we simulate a group of agents who choose answers according to the BIG model with and . In the simulations, one answer is objectively best, e.g., exactly equal to the correct answer (), while the worst answer is larger than (results are symmetric if ). At each timestep, a new user arrives and independently chooses the first or last answer according to Eq. 3. Answers reorder depending on the ranking algorithm. Figure 6 shows our results. If the worst answer, , is not too large and has a few extra votes (a common occurrence when the worst answer is posted before the better answer), we find that popularity ordering completely breaks down—the worst answer usually becomes more popular and is ranked first. Even when both answers start with the same number of votes, the worse answer is often ranked first. Results remain stable even after 20K votes; the effect does not appear transient.
We can explain this result using our model. Using the compliment of Eq. 3, we can define the probability the best answer is chosen, conditional on it being ranked last:
(5) 
We find that
(6) 
where is the probability of choosing the best answer when it is ranked first. If
(7) 
then subjects are more likely to choose the better answer regardless of whether its ranked first or second, therefore answer order is stable. When the above inequality is not true, however, then subjects are more likely to pick the first answer regardless of its quality, and the popularity ranking is unstable. The critical value of between the two regimes is when
(8) 
and is independent of . Intuitively, if the answer is exceptionally bad, it will always be less popular. This is alike to the results in Salganik’s MusicLab study (Salganik et al., 2006), where particularly good and bad songs were ranked correctly. However, if the first answer is likely to be chosen regardless of quality, the worst answer will continue accumulating votes and remain in the top position. Given , we can use Eqs. 8 and 2 to numerically solve for the critical value of . We plot the critical point as a function of and in phase diagram in Fig. 7. We see that there is always a large part of the phase space where popularitybased ranking will be unpredictable if answer quality is close together. Based on Eq. 8, if , there will be no case where popularitybased ranking is guaranteed to correctly rank answers. While this is an extreme case, it still points to substantial limitations of popularitybased ranking.
4.3. Stabilizing Algorithmic Ranking
Given the problems with popularity ranking, it is critical to create a more consistent ranking algorithm. Moreover, while we have sofar explored questions with numeric answers, we want a method that works for all types of answers. Using the BIG model, we can answer RQ3: How can we stabilize algorithmic ranking such that the best option is typically ranked first?
Assume we can approximate and , then we can invert Eq. 3 and use votes to infer the only unknown variable . When , is the best answer, but if we incorrectly rank first, . We can therefore rank the answer in which as the best answer. This is the backbone of the RAICR algorithm. The algorithm uses maximum likelihood estimation to solve for , as shown in the Appendix. Therefore, if the data matches the BIG model with the correct and parameters, our method optimally infers the correct raking by having minimal variance and no bias (Newey and
McFadden, 1994). Moreover, this method only depends on the votes an answer receives rather than the type of answer, such as a numerical or textual answer. We compare quality ranking to popularitybased ranking, and recencybased ranking (ranking by the last answer picked), as shown in Fig. 6. The probability recency ranks the best answer first is calculated as the selfconsistent equation:
{dmath}
Pr(Rank First—Recency) = Pr(Choose A_best—A_best First) Pr(Rank First—Recency)
+ Pr(Choose A_best—A_best Last)(1  Pr(Rank First—Recency) ).
This represents the limit answers acquire many votes. The solution to this equation is:
{dmath}
Pr(Rank First—Recency) = 2 (1p) (1r) s(Abest,Aworst)+r22 p (1r)
We find that the RAICR algorithm performs at least as well as popularitybased ranking by ranking the better answer first, and often better after 2050 votes (Fig. 6a and Appendix Fig. 12). Moreover, RAICR always outperforms the recencybased algorithm. The benefit of the RAICR algorithm only improves as we collect more votes.
For example, Fig. 6b shows after after 500 votes the advantage of RAICR is larger, and Fig. 6c shows that after 20K votes the method is nearly optimal. Popularitybased ranking performs badly for , and recency performs worst if .
While we show these results hold when we have 20 to 20,0000 votes, many platforms are underprovisioned, with a large fraction of webpages receiving little attention and votes (BaezaYates, 2018; Gilbert, 2013). It is the webpages that receive many votes, however, which may be the most important. Correctly ranking options in these popular pages is therefore especially critical to crowdsourcing websites. Moreover, a moderate number of votes is generally needed to make a reasonable estimate of quality, so very few methods will accurately rank unpopular pages.
One caveat of the RAICR algorithm is that we need an approximate value for two parameters: and . What happens if either of these parameters are far off? For example, we could assume when . Results in the Appendix, however, show that the findings are quantitatively very similar, and therefore our model is robust to assumptions about . On the other hand, what if we incorrectly estimate both and ? We show in the Appendix that even in this worstcase scenario, our method performs slightly worse, but is comparable or substantially better than popularitybased ranking.
5. Future Work and Design Implications
In the experiment, we decomposed the crowdsourcing task to its basic components to reduce the complexity and variation inherent in real world tasks. While this helps to disentangle effects of option quality and position without confounding factors muddying the relationships, future work is needed to verify ecological validity of our results (Herbst and Mas, 2015). It is encouraging that the probability to choose an answer (Fig. 3b) is quantitatively similar to empirical data gathered from Stack Exchange (Keith Burghardt and Lerman, 2018), despite Mechanical Turk workers not being representative of the general population (Munger et al., 2019).
For simplicity, we only explored twooption questions; future work should aim to understand multioption decisionmaking given options of variable quality. A generalization of RAICR should also address more complicated biases, such as preference for round numbers (Fitzmaurice and Pease, 1986), anchoring (Shokouhi et al., 2015; Furnham and Boo, 2011), and biases that appear in multioption decisions, such as the decoy effect (Huber et al., 1982). Finally, while RAICR is found to be robust to moderate changes in its parameters, this algorithm and its extensions may fail to rank options properly if its parameters are far off, or if the BIG model is wrong. In the experiment, the model is backed by data, but future work needs to address whether other tasks or questions follow this model.
Our results offer implications for crowdsourcing platforms. First, designers must recognize the limits of crowdsourcing due to biases implicit in their platform. In our experiment people often upvoted options at random (up to 20% of all votes), and chose an inferior option simply because it was shown first. This creates a ranking instability when options are of similar quality. Our controlled experiment and mathematical model point to ways we can counteract this instability. Designers should similarly create platformtailored mathematical models and controlled experiments to rigorously test how crowds can better infer the best options.
A key property of our RAICR algorithm is that it relies on accurate modeling of user decisions to counteract cognitive biases. In effect, each vote is weighted depending on the ranks of answers at the time the vote is cast. The idea is similar to one described by Abeliuk et al. (2017) that ranks items by their inferred quality in order to more robustly identify blockbuster items. Similar weighting schemes could be applied to future debiased algorithms to address the unique goals of each crowdsourcing platform.
There are also simple methods that platforms like Reddit, Facebook, and Stack Exchange can try that may greatly outperform the baselines we mention in our paper. For example, items that have not yet acquired many votes can be ranked randomly to reduce initial ranking biases. Alternatively, new posts and links could be ranked appropriately but their popularity could be hidden until they gather enough votes. Our results suggest this could reduce social influencebased position bias up until the true option quality is more obvious.
6. Conclusion
In this paper, we introduce an experiment designed to inform how cognitive biases and option quality interact to affect crowdsourced ranking. Results from this experiment help us create a novel mathematical decision model, the BIG model, that greatly improves our understanding of how people find the best answer to a question as a function of answer quality, rank, and social influence. This model is then applied to the RAICR algorithm to better rank answers. The BIG model also helped us uncover instability in popularitybased ranking. The instability depends on the quality of options: when there are large differences between option qualities, popularity converges optimally and predictably. However, when the difference between the quality of options is small, the better option may not always become the most popular. These results can help us better understand the foundational empirical results of Salganik et al. ((2006)), who found that popularitybased ranking correctly ranked high and low quality songs, while the ranking of intermediate quality songs was highly unstable. Although our experimental setup is undeniably simpler than real crowdsourcing websites, our results suggest that accurate models of user behavior together with mathematically principled inference can improve the efficiency of crowdsourcing.
Acknowledgements.
Our work is supported by the US Army Research Office MURI Award No. W911NF1310340 and the DARPA Award No. W911NF1710077. Data as well as code to create experiments, create simulations, and analyze data is available at https://github.com/KeithBurghardt/QualityRankCodeAndData.Appendix A Appendix
In this section, we discuss the experiment questions, the validity of the lognormal guess distribution, alternative answer normalization schemes, and the loglikelihood estimate used to rank answers in simulations. We also discuss the robustness of the simulation results.
a.1. Experiment Details
Experiment questions are shown in Fig. 8. We see that questions cover a variety of visual problems with numerical answers. Questions include finding ratios of lines or areas, counting the number of “r”s in text, or counting dots. We make sure that guesses cannot be easily measured (e.g., lines are not straight and dots are not evenly distributed). The guesses, median guess value, and true values are shown in Fig. 9. We see that the median values are close to the true values, but there is deviation between the two in Questions Q6–10, which happen to be dotcounting questions. In any case, the lognormal fit can only be approximate, since all guesses are required to be at least 1 and in case of Q510 are integers. None the less, we find the lognormal approximation useful for later calculations. We also plot how well the data fits a lognormal distribution in Fig 10. We notice that, for most questions, the fit is reasonable or even very good, especially for questions Q6–10. An exception is Q5, where the data is highly peaked around the correct answer, 47. This is because people have the ability to count the correct answer for this question, while other answers are much more difficult to infer. That being said, Fig. 2 shows us that the normalized answers are very similar to each other, thus despite the disagreement, results are still qualitatively similar to the lognormal distribution.
To see if the discrepancy between the guess values and true values affected our model results, we centered answers by as well as by the true values, as shown in Table 2. We see the main text results (in bold) are very similar regardless of how data is centered. Because we see a difference between guesses for dot questions (Q6–10) and ratio question (Q1–4), we separately fit these data subsets to the model. While some of the parameters differed, the qualitative results remain consistent: the model with and fit better than simpler models, and increases in the social influence condition.
Data  Centering  Pr(Matches Data)  Pr()  Pr()  Pr()  

Control All Q  Mean  0.397  0.002  0.28  0.02  0.05  0.02  
Control Q  Mean  0.547  0.023  0.54  0.05  0.10  0.04  
Control Q  Mean  0.61  0.061  0.19  0.02  0.04  0.02  
Control All Q  True  0.546  0.001  0.33  0.02  0.06  0.02  
Control Q  True  0.546  0.020  0.69  0.04  0.16  0.06  
Control Q  True  0.556  0.036  0.12  0.02  0.03  0.02  
Soc. Inf. All Q  Mean  0.472  0.08  0.02  0.21  0.01  
Soc. Inf. Q  Mean  0.498  0.29  0.06  0.36  0.04  
Soc. Inf. Q  Mean  0.52  0.06  0.02  0.15  0.02  
Soc. Inf. All Q  True  0.504  0.11  0.02  0.21  0.01  
Soc. Inf. Q  True  0.53  0.35  0.05  0.39  0.04  
Soc. Inf. Q  True  0.561  0.06  0.02  0.14  0.02 
Bold results are in the main text. represents average and represents the standard error.
a.2. Statistical Methods
Fitting the Decision Model
We fit the decision model defined in the Findings section using maximum likelihood estimation (MLE). This method, however, only provides a point estimate. In order to determine the error of the model parameters, we bootstrap the data (i.e., sample with replacement times for data of size ) and calculate the MLE values for each parameter. We repeat this step times to create a parameter distribution, and calculate the standard deviation of this distribution to find parameter error bars.
Comparing Models
The decision model we fit to data has two parameters, however simpler decision models may fit the data equally well. To check whether this is true, we compare the loglikelihood of the twoparameter decision model, , to a simpler model . Let be the number of observations, then by Wilks’ theorem (Wilks, 1938), as , should follow a distribution if the data better matches the simpler model, where is the difference in the degrees of freedom. In our case, the simpler models have one to two fewer degrees of freedom.
Agreement with Data
In order to find out whether our model fits the data well, we compare our model’s MLE loglikelihood value to the loglikelihood of data bootstrapped from the model with the same answers ( and ) and parameter values (, and ) as the empirical data. We then say the data agrees with the model if the probability the bootstrapped data fits the model worse than empirical data is greater than 0.1. In practice, we find the probability typically exceeds 0.4, thus the model is consistent with the data. To check the robustness of this agreement method, we also took the MLE of the bootstrapped data by refitting it to the model. Results are virtually identical.
a.3. Calculating Probability Error
To calculate error bars in probabilities, we assume a uniform prior, and use the Beta distribution to create a posterior probability distribution:
(9) 
where are the number of successes, are the number of failures, and is the estimated probability of successes. This allows us to calculate error bars for even when or . In all plots, the point estimation is the MLE: .
a.4. MLE Equation
Let’s assume we can independently measure the position bias parameter, , and random guess parameter, (e.g. by using results from the present experiment). Let () and () be the number of times an answer was chosen when it was ranked first (last) and the total number of votes to all answers when the answer was ranked first (last). Also, recall that if , is the better answer.
To estimate , we first define
{dmath}
Pr(A_best First—A_best,A_worst,p,r) = r/2+(1r)(p+(1p) s(A_best,A_worst)),
{dmath}
Pr(A_best Last—A_best,A_worst,p,r) = r/2+(1r)((1p)(1s(A_best,A_worst))),
{dmath}
Pr(A_worst First—A_best,A_worst,p,r) = r/2+(1r)((1p)s(A_best,A_worst)),
and
{dmath}
Pr(A_worst Last—A_best,A_worst,p,r) = r/2+(1r)(p+(1p)(1s(A_best,A_worst))),
where and are the respective answers, and “” means that answer is ordered first (last).
The variable is the only unknown.
Surprisingly, we can infer
without having to normalize answers, let alone know the answer distribution.
The likelihood function of the model is:
{dmath}
L(n_t,N_t,n_b,N_b,s,p,r) =Pr(A_1 First—s,p,r)^n_t Pr(A_1 Last—s,p,r)^N_tn_t Pr(A_2 First—s,p,r)^n_b Pr(A_2 Last—s,p,r)^N_bn_b
The loglikelihood function is therefore
{dmath}
ℓ(n_t,N_t,n_b,N_b,s,p,r) =n_t ln(Pr(A_1 First—s,p,r))+ (N_tn_t)ln(Pr(A_1 Last—s,p,r))
+ n_b ln(Pr(A_2 First—s,p,r))
+ (N_bn_b)ln(Pr(A_2 Last—s,p,r))
To find the MLE of , we find the solution to
(10) 
and then solve for . The MLE value of can easily be solved numerically. As long as a researcher records , , , and , we can accurately infer answer quality.
a.5. Simulation Robustness
Simulations in the main text are for the case where the quality ranking algorithm correctly assumes and . We explore what happens if one or both assumptions are wrong. For example, we show in Fig. 11 the case when quality ranking assumes when .
Comparing to Fig. 6, we see results are quantitatively very similar. This is intuitive as randomly choosing the first or last answer with equal probability should not substantially affect their relative ranking.
What about if we incorrectly estimate both and ? For example, we assume and , but in actuality, and , , or ? As shown in Fig. 2, the correct value could vary between and for the social influence condition. Results are shown in Fig. 12. Overall, we find that quality ranking still significantly outperforms popularitybased ranking. The only exception is when answers begin with equal votes and . In this case, quality ranking is comparable after 20 votes, and slightly worse after 20K votes. A website designer could create smallscale experiments to better infer , and after applying a corrected estimate, they should expect quality ranking to again substantially outperform popularitybased ranking. Overall, even when the qualityranking algorithm is not correctly parameterized, it still performs rather well, and does not not seem to be very sensitive to the estimate of or .
Footnotes
 copyright: acmcopyright
 journalyear: 2020
 doi: 10.1145/1122445.1122456
 journal: JACM
 journalvolume: 37
 journalnumber: 4
 article: 111
 publicationmonth: 1
 ccs: Humancentered computing User models
 ccs: Humancentered computing Laboratory experiments
 ccs: Humancentered computing Heuristic evaluations
 ccs: Humancentered computing Empirical studies in visualization
 ccs: Humancentered computing Empirical studies in interaction design
 copyright: acmlicensed
 journal: PACMHCI
 journalyear: 2020
 journalvolume: 4
 journalnumber: CSCW2
 article: 166
 publicationmonth: 10
 price: 15.00
 doi: 10.1145/3415237
References
 Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2017. Controlling Popularity Bias in LearningtoRank Recommendation. In Proceedings of the Eleventh ACM Conference on Recommender Systems (Como, Italy) (RecSys â17). Association for Computing Machinery, New York, NY, USA, 42â46. https://doi.org/10.1145/3109859.3109912
 Himan Abdollahpouri, Robin Burke, and Bamshad Mobasher. 2019. Managing Popularity Bias in Recommender Systems with Personalized Reranking. arXiv preprint: 1901.07555 (2019).
 AndrÃ©s Abeliuk, Gerardo Berbeglia, Pascal Van Hentenryck, Tad Hogg, and Kristina Lerman. 2017. Taming the Unpredictability of Cultural Markets with Social Influence. In Proceedings of the 26th International World Wide Web Conference (WWW2017). International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, Switzerland.
 Daron Acemoglu, Munther A. Dahleh, Ilan Lobel, and Asuman Ozdaglar. 2011. Bayesian Learning in Social Networks. The Review of Economic Studies 78, 4 (03 2011), 1201–1236. https://doi.org/10.1093/restud/rdr004 arXiv:http://oup.prod.sis.lan/restud/articlepdf/78/4/1201/18376066/rdr004.pdf
 Lada A. Adamic, Jun Zhang, Eytan Bakshy, and Mark S. Ackerman. 2008. Knowledge Sharing and Yahoo Answers: Everyone Knows Something. In Proceedings of the 17th international conference on World Wide Web. ACM, New York, NY, 665–674.
 S. E. Asch. 1951. Effects of group pressure upon the modification and distortion of judgments. Carnegie Press, Oxford, UK. 177–190 pages.
 Ricardo BaezaYates. 2018. Bias on the web. Commun. ACM 61, 6 (2018), 54–61.
 Venkatesh Bala and Sanjeev Goyal. 1998. Learning from Neighbours. The Review of Economic Studies 65, 3 (1998), 595–621. http://www.jstor.org/stable/2566940
 Joshua Becker, Devon Brackbill, and Damon Centola. 2017. Network dynamics of social influence in the wisdom of crowds. Proceedings of the National Academy of Sciences 114, 26 (2017), E5070–E5076. https://doi.org/10.1073/pnas.1615978114 arXiv:https://www.pnas.org/content/114/26/E5070.full.pdf
 Joshua Becker, Ethan Porter, and Damon Centola. 2019. The wisdom of partisan crowds. Proceedings of the National Academy of Sciences 116, 22 (2019), 10717–10722. https://doi.org/10.1073/pnas.1817195116
 Michael Bendersky, W. Bruce Croft, and Yanlei Diao. 2011. QualityBiased Ranking of Web Documents. In Proceedings of the Fourth ACM International Conference on Web Search and Data Mining (Hong Kong, China) (WSDM â11). Association for Computing Machinery, New York, NY, USA, 95â104. https://doi.org/10.1145/1935826.1935849
 Engin Bozdag. 2013. Bias in algorithmic filtering and personalization. Ethics and Information Technology 15, 3 (01 Sep 2013), 209–227. https://doi.org/10.1007/s1067601393216
 Alasdair Brown and J. James Reade. 2019. The wisdom of amateur crowds: Evidence from an online community of sports tipsters. European Journal of Operational Research 272, 3 (2019), 1073 – 1081. https://doi.org/10.1016/j.ejor.2018.07.015
 Keith Burghardt, Emanuel F. Alsina, Michelle Girvan, William Rand, and Kristina Lerman. 2017. The Myopia of Crowds: A Study of Collective Evaluation on Stack Exchange. PLOS ONE 12, 3 (2017), e0173610.
 Keith Burghardt, William Rand, and Michelle Girvan. 2019. Inferring models of opinion dynamics from aggregated jury data. PLoS ONE 14, 7 (2019), e0218312.
 Zhi Da and Xing Huang. 2019. Harnessing the Wisdom of Crowds. Management Science 0, 0 (2019), 1–21. https://doi.org/10.1287/mnsc.2019.3294
 Marquis de Condorcet. 1976. “Essay on the Application of Mathematics to the Theory of DecisionMaking.” Reprinted in Condorcet: Selected Writings. BobbsMerrill,, Indianapolis, Indiana.
 Morris H. Degroot. 1974. Reaching a Consensus. J. Amer. Statist. Assoc. 69, 345 (1974), 118–121. https://doi.org/10.1080/01621459.1974.10480137
 Himel Dev, Karrie Karahalios, and Hari Sundaram. 2019. Quantifying Voter Biases in Online Platforms: An Instrumental Variable Approach. Proceedings of the ACM on HumanComputer Interaction 3, CSCW (2019), 120.
 Emilio Ferrara, Nazanin Alipoufard, Keith Burghardt, Chiranth Gopal, and Kristina Lerman. 2017. Dynamics of Content Quality in Collaborative Knowledge Production. In ICWSM ’17 Proceedings of the 11th International AAAI Conference on Web and Social Media.
 Catherine Fitzmaurice and Ken Pease. 1986. The psychology of judicial sentencing. Manchester University Press, Manchester, UK.
 Adrian Furnham and Hua Chu Boo. 2011. A literature review of the anchoring effect. The Journal of SocioEconomics 40, 1 (2011), 35 – 42. https://doi.org/10.1016/j.socec.2010.10.008
 F. Galton. 1908. Vox Populi. Nature 75 (1908), 450–451.
 Eric Gilbert. 2013. Widespread Underprovision on Reddit. In CSCW ’13: Proceedings of the 2013 conference on Computer supported cooperative work. Association for Computing Machinery, 803â808.
 Benjamin Golub and Matthew O. Jackson. 2010. Naïve Learning in Social Networks and the Wisdom of Crowds. American Economic Journal: Microeconomics 2, 1 (2010), 112–49.
 Sara Hajian, Francesco Bonchi, and Carlos Castillo. 2016. Algorithmic Bias: From Discrimination Discovery to Fairnessaware Data Mining. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Francisco, California, USA) (KDD ’16). ACM, New York, NY, USA, 2125–2126. https://doi.org/10.1145/2939672.2945386
 Daniel Herbst and Alexandre Mas. 2015. Peer effects on worker output in the laboratory generalize to the field. Science 350, 6260 (October 2015), 545–549.
 Martin Hilbert, Saifuddin Ahmed, Jaeho Cho, Billy Liu, and Jonathan Luu. 2018. Communicating with Algorithms: A Transfer Entropy Analysis of Emotionsbased Escapes from Online Echo Chambers. Communication Methods and Measures 12, 4 (2018), 260–275. https://doi.org/10.1080/19312458.2018.1479843
 T. Hogg and K. Lerman. 2015. Disentangling the effects of social signals. Human Computation Journal 2, 2 (2015), 189–208.
 Joel Huber, John W. Payne, and Christopher Puto. 1982. Adding Asymmetrically Dominated Alternatives: Violations of Regularity and the Similarity Hypothesis. Journal of Consumer Research 9, 1 (06 1982), 90–98. https://doi.org/10.1086/208899 arXiv:http://oup.prod.sis.lan/jcr/articlepdf/9/1/90/5205641/9190.pdf
 Kalervo Järvelin and Jaana Kekäläinen. 2002. Cumulated GainBased Evaluation of IR Techniques. ACM Trans. Inf. Syst. 20, 4 (Oct. 2002), 422â446. https://doi.org/10.1145/582415.582418
 Mordechai Z. Juni and Miguel P. Eckstein. 2017. The wisdom of crowds for visual search. Proceedings of the National Academy of Sciences 114, 21 (2017), E4306–E4315. https://doi.org/10.1073/pnas.1610732114
 Serguei Kaniovski and Alexander Zaigraev. 2011. Optimal jury design for homogeneous juries with correlated votes. Theory Dec. 71 (2011), 439–459.
 Albert B. Kao, Andrew M. Berdahl, Andrew T. Hartnett, Matthew J. Lutz, Joseph B. BakColeman, Christos C. Ioannou, Xingli Giam, and Iain D. Couzin. 2018. Counteracting estimation bias and social influence to improve the wisdom of crowds. Journal of The Royal Society Interface 15, 141 (2018), 20180130. https://doi.org/10.1098/rsif.2018.0130
 Tad Hogg Keith Burghardt and Kristina Lerman. 2018. Quantifying the Impact of Cognitive Biases in Crowdsourcing. In Proceedings of The 12th International AAAI Conference on Web and Social Media (ICWSM18). AAAI.
 Coco Krumme, Manuel Cebrian, Galen Pickard, and Sandy Pentland. 2012. Quantifying Social Influence in an Online Cultural Market. PLoS ONE 7, 5 (2012), e33785.
 K. Lerman and T. Hogg. 2014. Leveraging position bias to improve peer recommendation. PLOS ONE 9, 6 (2014), e98914.
 K Lerman and XZ Yan, X amd Wu. 2016. The “Majority Illusion” in Social Networks. , e0147617 pages.
 Youngshin Lim and Brandon Van Der Heide. 2015. Evaluating the wisdom of strangers: The perceived credibility of online consumer reviews on Yelp. Journal of ComputerMediated Communication 20, 1 (2015), 67–82.
 Jan Lorenz, Heiko Rauhut, and Bernhard Kittel. 2015. Majoritarian democracy undermines truthfinding in deliberative committees. Research & Politics 2, 2 (2015), 2053168015582287. https://doi.org/10.1177/2053168015582287
 Jan Lorenz, Heiko Rauhut, Frank Schweitzer, and Dirk Helbing. 2011. How social influence can undermine the wisdom of crowd effect. Proceedings of the National Academy of Sciences 108, 22 (2011), 9020–9025.
 A. Mantonakis, P. Rodero, I. Lesschaeve, and R. Hastie. 2009. Order in Choice: Effects of Serial Position on Preferences. Psychol. Sci. 20, 11 (2009), 1309–1312.
 Erisa Terolli Marzia Antenore, Alessandro Panconesi. 2018. Songs of a Future Past â An Experimental Study of Online Persuaders. In Twelfth International AAAI Conference on Web and Social Media. AAAI.
 Pavlin Mavrodiev, Claudio J. Tessone, and Frank Schweitzer. 2013. Quantifying the effects of social influence. Scientific Reports 3, 1 (2013), 1360. https://doi.org/10.1038/srep01360
 Elchanan Mossel, Allan Sly, and Omer Tamuz. 2015. Strategic Learning and the Topology of Social Networks. Econometrica 83, 5 (2015), 1755–1794. https://doi.org/10.3982/ECTA12058
 Lev Muchnik, Sinan Aral, and Sean J. Taylor. 2013. Social Influence Bias: A Randomized Experiment. Science 341 (2013), 647–651.
 K Munger, M Luca, J Nagler, and J Tucker. 2019. Age matters: Sampling strategies for studying digital media effects. (2019). https://osf.io/sq5ub/
 Whitney K. Newey and Daniel McFadden. 1994. Chapter 36: Large sample estimation and hypothesis testing. Vol. 4. Elsevier Science. 2111–2245 pages.
 Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The PageRank Citation Ranking: Bringing Order to the Web. Technical Report 199966. Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/ Previous number = SIDLWP19990120.
 Thomas Peeters. 2018. Testing the Wisdom of Crowds in the field: Transfermarkt valuations and international soccer results. International Journal of Forecasting 34, 1 (2018), 17 – 29. https://doi.org/10.1016/j.ijforecast.2017.08.002
 Dražen Prelec, H. Sebastian Seung, and John McCoy. 2017. A solution to the singlequestion crowd wisdom problem. Nature 541, 7638 (2017), 532–535. https://doi.org/10.1038/nature21054
 M. Salganik, P. Dodds, and D. Watts. 2006. Experimental Study of Inequality and Unpredictability in an Artificial Cultural Market. Science 311 (2006), 854–856.
 Chirag Shah and Jefferey Pomerantz. 2010. Evaluating and Predicting Answer Quality in Community QA. In Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval (Geneva, Switzerland) (SIGIR ’10). ACM, New York, NY, USA, 411–418. https://doi.org/10.1145/1835449.1835518
 Milad Shokouhi, Ryen White, and Emine Yilmaz. 2015. Anchoring and adjustment in relevance estimation. In Proceedings of the 38th International ACM SIGIR Conference on research and development in information retrieval. 963–966.
 Camelia Simoiu, Chiraag Sumanth, Alok Mysore, and Sharad Goel. 2019. Studying the “Wisdom of Crowds” at Scale. In The Seventh AAAI Conference on Human Computation and Crowdsourcing (HCOMP19). AAAI, 171–179.
 G. Stoddard. 2015. Popularity Dynamics and Intrinsic Quality in Reddit and Hacker News. In Proceedings of the Ninth International AAAI Conference on Web and Social Media. 416–425.
 J. Surowiecki. 2005. The wisdom of crowds. Anchor, New York.
 Jerry O Talton III, Krishna Dusad, Konstantinos Koiliaris, and Ranjitha S Kumar. 2019. How do People Sort by Ratings?. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–10.
 Philip E. Tetlock, Barbara A. Mellers, and J. Peter Scoblic. 2017. Bringing probability judgments into policy debates via forecasting tournaments. Science 355, 6324 (2017), 481–483. https://doi.org/10.1126/science.aal3147 arXiv:https://science.sciencemag.org/content/355/6324/481.full.pdf
 Lyle Ungar, Barbara Mellers, Ville Satopää, Philip Tetlock, and Jon Baron. 2012. The Good Judgment Project: A Large Scale Test of Different Methods of Combining Expert Predictions. Technical Report. 37–42 pages.
 Arnout van de Rijt, Soong Moon Kang, Michael Restivo, and Akshay Patil. 2014. Field experiments of successbreedssuccess dynamics. Proceedings of the National Academy of Sciences 111, 19 (2014), 6934–6939. https://doi.org/10.1073/pnas.1316836111 arXiv:https://www.pnas.org/content/111/19/6934.full.pdf
 D. J. Watts. 2012. Everything Is Obvious: How Common Sense Fails Us. Random House LLC.
 Ryen White. 2013. Beliefs and biases in web search. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval. 3–12.
 S. S. Wilks. 1938. The LargeSample Distribution of the Likelihood Ratio for Testing Composite Hypotheses. Ann. Math. Statist. 9, 1 (1938), 60–62.
 Y. Yao, Hanghang Tong, Tao Xie, Leman Akoglu, Feng Xu, and Jian Lu. 2015. Detecting highquality posts in community question answering sites. Information Sciences 302 (2015), 70–82.