Budget-Optimal Task Allocation for Reliable Crowdsourcing Systems

Budget-Optimal Task Allocation
for Reliable Crowdsourcing Systems

David R. Karger, Sewoong Oh, and Devavrat Shah
Computer Science and Artificial Intelligence Laboratory and Department of EECS at Massachusetts Institute of Technology. Email: karger@mit.eduDepartment of Industrial and Enterprise Systems Engineering at University of Illinois at Urbana-Champaign. Email: swoh@illinois.eduLaboratory for Information and Decision Systems and Department of EECS at Massachusetts Institute of Technology. Email: devavrat@mit.edu. This work was supported in parts by NSF EMT project, AFOSR Complex Networks project and Army Research Office under MURI Award 58153-MA-MUR.
September 20, 2019
Abstract

Crowdsourcing systems, in which numerous tasks are electronically distributed to numerous “information piece-workers”, have emerged as an effective paradigm for human-powered solving of large scale problems in domains such as image classification, data entry, optical character recognition, recommendation, and proofreading. Because these low-paid workers can be unreliable, nearly all such systems must devise schemes to increase confidence in their answers, typically by assigning each task multiple times and combining the answers in an appropriate manner, e.g. majority voting.

In this paper, we consider a general model of such crowdsourcing tasks and pose the problem of minimizing the total price (i.e., number of task assignments) that must be paid to achieve a target overall reliability. We give a new algorithm for deciding which tasks to assign to which workers and for inferring correct answers from the workers’ answers. We show that our algorithm, inspired by belief propagation and low-rank matrix approximation, significantly outperforms majority voting and, in fact, is optimal through comparison to an oracle that knows the reliability of every worker. Further, we compare our approach with a more general class of algorithms which can dynamically assign tasks. By adaptively deciding which questions to ask to the next arriving worker, one might hope to reduce uncertainty more efficiently. We show that, perhaps surprisingly, the minimum price necessary to achieve a target reliability scales in the same manner under both adaptive and non-adaptive scenarios. Hence, our non-adaptive approach is order-optimal under both scenarios. This strongly relies on the fact that workers are fleeting and can not be exploited. Therefore, architecturally, our results suggest that building a reliable worker-reputation system is essential to fully harnessing the potential of adaptive designs.

1 Introduction

Background. Crowdsourcing systems have emerged as an effective paradigm for human-powered problem solving and are now in widespread use for large-scale data-processing tasks such as image classification, video annotation, form data entry, optical character recognition, translation, recommendation, and proofreading. Crowdsourcing systems such as Amazon Mechanical Turk111http://www.mturk.com, establish a market where a “taskmaster” can submit batches of small tasks to be completed for a small fee by any worker choosing to pick them up. For example a worker may be able to earn a few cents by indicating which images from a set of 30 are suitable for children (one of the benefits of crowdsourcing is its applicability to such highly subjective questions).

Because these crowdsourced tasks are tedious and the pay is low, errors are common even among workers who make an effort. At the extreme, some workers are “spammers”, submitting arbitrary answers independent of the question in order to collect their fee. Thus, all crowdsourcers need strategies to ensure the reliability of their answers. When the system allows the crowdsourcers to identify and reuse particular workers, a common approach is to manage a pool of reliable workers in an explore/exploit fashion. However in many crowdsourcing platforms such as Amazon Mechanical Turk, the worker crowd is large, anonymous, and transient, and it is generally difficult to build up a trust relationship with particular workers.222For certain high-value tasks, crowdsourcers can use entrance exams to “prequalify” workers and block spammers, but this increases the cost of the task and still provides no guarantee that the workers will try hard after qualification. It is also difficult to condition payment on correct answers, as the correct answer may never truly be known and delaying payment can annoy workers and make it harder to recruit them to your task next time. Instead, most crowdsourcers resort to redundancy, giving each task to multiple workers, paying them all irrespective of their answers, and aggregating the results by some method such as majority voting.

For such systems there is a natural core optimization problem to be solved. Assuming the taskmaster wishes to achieve a certain reliability in her answers, how can she do so at minimum cost (which is equivalent to asking how she can do so while asking the fewest possible questions)?

Several characteristics of crowdsourcing systems make this problem interesting. Workers are neither persistent nor identifiable; each batch of tasks will be solved by a worker who may be completely new and who you may never see again. Thus one cannot identify and reuse particularly reliable workers. Nonetheless, by comparing one worker’s answer to others’ on the same question, it is possible to draw conclusions about a worker’s reliability, which can be used to weight their answers to other questions in their batch. However, batches must be of manageable size, obeying limits on the number of tasks that can be given to a single worker.

Another interesting aspect of this problem is the choice of task assignments. Unlike many inference tasks which makes inferences based on a fixed set of signals, our algorithm can choose which signals to measure by deciding which questions to include in which batches. In addition, there are several plausible options: for example, we might choose to ask a few “pilot questions” to each worker (just like a qualifying exam) to decide on the reliability of the worker. Another possibility is to first ask few questions and based on the answers decide to ask more questions or not. We would like to understand the role of all such variations in the overall optimization of budget for reliable task processing.

In the remainder of this section, we will define a formal probabilistic model that captures these aspects of the problem. We consider both a non-adaptive scenario, in which all questions are asked simultaneously and all the responses are collected simultaneously, and an adaptive scenario, in which one may adaptively choose which tasks to assign to the next arriving worker based on all the previous answers collected thus far. We provide a non-adaptive task allocation scheme and an inference algorithm based on low-rank matrix approximations and belief propagation. We will then show that our algorithm is order-optimal: for a given target error rate, it spends only a constant factor times the minimum necessary to achieve that error rate. The optimality is established through comparisons to the best possible non-adaptive task allocation scheme and an oracle estimator that can make optimal decisions based on extra information provided by an oracle. In particular, we derive a parameter that characterizes the ‘collective’ reliability of the crowd, and show that to achieve target reliability , it is both necessary and sufficient to replicate each task times. This leads to the next question of interest: by using adaptive task assignment, can we ask fewer questions and still achieve the same error rate? We, somewhat surprisingly, show that the optimal costs under this adaptive scenario scale in the same manner as the non-adaptive scenario; asking questions adaptively does not help!

Setup. We consider the following probabilistic model for crowdsourcing. There is a set of binary tasks which is associated with unobserved ‘correct’ solutions: . Here and after, we use to denote the set of first integers. In the image categorization example stated earlier, a set of tasks corresponds to labeling images as suitable for children or not . We will be interested in finding the true solutions by querying noisy workers who arrive one at a time in an on-line fashion.

An algorithmic solution to crowdsourcing consists of two components: a task allocation scheme and an inference algorithm. At task allocation phase queries are made sequentially according to the following rule. At -th step, the task assignment scheme chooses a subset of tasks to be assigned to the next arriving noisy worker. The only constraint on the choice of the batch is that the size must obey some limit on the number of tasks that can be given to a single worker. Let denote such a limit on the number of tasks that can be assigned to a single worker, such that all batches must satisfy . Then, a worker arrives, whose latent reliability is parametrized by . For each assigned task, this worker gives a noisy answer such that

and if . (Throughout this paper, we use boldface characters to denote random variables and random matrices unless it is clear from the context.) The next assignment can be chosen adaptively, taking into account all of the previous assignments and the answers collected thus far. This process is repeated until the task assignment scheme decides to stop, typically when the total number of queries meet a certain budget constraint. Then, in the subsequent inference phase, an inference algorithm makes a final estimation of the true answers.

We say a task allocation scheme is adaptive if the choice of depends on the answers collected on previous steps, and it is non-adaptive if it does not depend on the answers. In practice, one might prefer using a non-adaptive scheme, since assigning all the batches simultaneously and having all the batches of tasks processed in parallel reduces latency. However, by switching to an adaptive task allocation, one might be able to reduce uncertainty more efficiently. We investigate this possibility in Section 2.4, and show that the gain from adaptation is limited.

Note here that at the time of assigning tasks for a next arriving worker , the algorithm is not aware of the latent reliability of the worker. This is consistent with how real-world crowdsourcing works, since taskmasters typically have no choice over which worker is going to pick up which batch of tasks. Further, we make the pessimistic assumption that workers are neither persistent nor identifiable; each batch of tasks will be solved by a worker who may be completely new and who you may never see again. Thus one cannot identify and reuse particularly reliable workers. This is a different setting from adaptive games [LW89], where you have a sequence of trials and a set of predictions is made at each step by a pool of experts. In adaptive games, you can identify reliable experts from their past performance using techniques like multiplicative weights, whereas in crowdsourcing you cannot hope to exploit any particular worker.

The latent variable captures how some workers are more diligent or have more expertise than others, while some other workers might be trying to cheat. The random variable is independent of any other event given . The underlying assumption here is that the error probability of a worker does not depend on the particular task and all the tasks share an equal level of difficulty. Hence, each worker’s performance is consistent across different tasks. We discuss a possible generalization of this model in Section 2.7.

We further assume that the reliability of workers are independent and identically distributed random variables with a given distribution on . As one example we define the spammer-hammer model, where each worker is either a ‘hammer’ with probability or is a ‘spammer’ with probability . A hammer answers all questions correctly, meaning , and a spammer gives random answers, meaning . It should be noted that the meaning of a ‘spammer’ might be different from its use in other literature. In this model, a spammer is a worker who gives uniformly random labels independent of the true label. In other literature in crowdsourcing, the word ‘spammer’ has been used, for instance, to refer to a worker who always gives ‘’ labels [RY12]. Another example is the beta distribution with some parameters and ( for a proper normalization ) [Hol11, RYZ10b]. A distribution of characterizes a crowd, and the following parameter plays an important role in capturing the ‘collective quality’ of this crowd, as will be clear from our main results:

A value of close to one indicates that a large proportion of the workers are diligent, whereas close to zero indicates that there are many spammers in the crowd. The definition of is consistent with use of in the spammer-hammer model and in the case of beta distribution, . We will see later that our bound on the achievable error rate depends on the distribution only through this parameter .

When the crowd population is large enough such that we do not need to distinguish whether the workers are ‘sampled’ with or without replacement, then it is quite realistic to assume the existence of a prior distribution for . In particular, it is met if we simply randomize the order in which we upload our task batches, since this will have the effect of randomizing which workers perform which batches, yielding a distribution that meets our requirements. The model is therefore quite general. On the other hand, it is not realistic to assume that we know what the prior is. To execute our inference algorithm for a given number of iterations, we do not require any knowledge of the distribution of the reliability. However, is necessary in order to determine how many times a task should be replicated and how many iterations we need to run to achieve a certain target reliability. We discuss a simple way to overcome this limitation in Section 2.2.

The only assumption we make about the distribution is that there is a bias towards the right answer, i.e. . Without this assumption, we can have a ‘perfect’ crowd with , but everyone is adversarial, . Then, there is no way we can correct for this. Another way to justify this assumption is to define the “ground truth” of the tasks as what the majority of the crowd agrees on. We want to learn this consensus efficiently without having to query everyone in the crowd for every task. If we use this definition of the ground truth, then it naturally follows that the workers are on average more likely to be correct.

Throughout this paper, we are going to assume that there is a fixed cost you need to pay for each response you get regardless of the quality of the response, such that the total cost is proportional to the total number of queries. When we have a given target accuracy we want to achieve, and under the probabilistic crowdsourcing model described in this section, we want to design a task allocation scheme and an inference algorithm that can achieve this target accuracy with minimal cost.

Possible deviations from our model. Some of the main assumptions we make on how crowdsourcing systems work are workers are neither identifiable nor reusable, every worker is paid the same amount regardless of their performance, and each worker completes only one batch and she completes all the tasks in that batch. In this section, we discuss common strategies used in real crowdsourcing that might deviate from these assumptions.

First, there has been growing interest recently in developing algorithms to efficiently identify good workers assuming that worker identities are known and workers are reusable. Imagine a crowdsourcing platform where there are a fixed pool of identifiable workers and we can assign the tasks to whichever worker we choose to. In this setting, adaptive schemes can be used to significantly improve the accuracy while minimizing the total number of queries. It is natural to expect that by first exploring to find better workers and then exploiting them in the following rounds, one might be able to improve performance significantly. Donmez et al. [DCS09] proposed IEThresh which simultaneously estimates worker accuracy and actively selects a subset of workers with high accuracy. Zheng et al. [ZSD10] proposed a two-phase approach to identify good workers in the first phase and utilize the best subset of workers in the second phase. Ertekin et al. [EHR11] proposed using a weighted majority voting to better estimate the true labels in CrowdSense, which is then used to identify good workers.

The power of such exploration/exploitation approaches were demonstrated on numerical experiments, however none of these approaches are tested on real-world crowdsourcing. All the experiments are done using pre-collected datasets. Given these datasets they simulate a labor market where they can track and reuse any workers they choose to. The reason that the experiments are done on such simulated labor markets, instead of on popular crowdsourcing platforms such as Amazon Mechanical Turk, is that on real-world crowdsourcing platforms it is almost impossible to track workers. Many of the popular crowdsourcing platforms are completely open labor markets where the worker crowd is large and transient. Further, oftentimes it is the workers who choose which tasks they want to work on, hence the taskmaster cannot reuse particular workers. For these reasons, we assume in this paper that the workers are fleeting and provide an algorithmic solution that works even when workers are not reusable. We show that any taskmaster who wishes to outperform our algorithm must adopt complex worker-tracking techniques. Furthermore, no worker-tracking technique has been developed that has been proven to be foolproof. In particular, it is impossible to prevent a worker from starting over with a new account. Many tracking algorithms are susceptible to this attack.

Another important and closely related question that has not been formally addressed in crowdsourcing literature is how to differentiate the payment based on the inferred accuracy in order to incentivize good workers. Regardless of whether the workers are identifiable or not, when all the tasks are completed we get an estimate of the quality of the workers. It would be desirable to pay the good workers more in order to incentivize them to work for us in the future tasks. For example, bonuses are built into Amazon Mechanical Turk to be granted at the taskmaster’s discretion, but it has not been studied how to use bonuses optimally. This could be an interesting direction for future research.

It has been observed that increasing the cost on crowdsourcing platforms does not directly lead to higher quality of the responses [MW10]. Instead, increasing the cost only leads to faster responses. Mason and Watts [MW10] attributes this counterintuitive findings to an “anchoring” effect. When the (expected) payment is higher, workers perceive the value of their work to be greater as well. Hence, they are no more motivated than workers who are paid less. However, these studies were done in isolated experiments, and the long term effect of taskmasters’ keeping a good reputation still needs to be understood. Workers of Mechanical Turk can manage reputation of the taskmasters using for instance Turkopticon333http://turkopticon.differenceengines.com, a Firefox extension that allows you to rate taskmasters and view ratings from other workers. Another example is Turkernation444http://turkernation.com, an on-line forum where workers and taskmasters can discuss Mechanical Turk and leave feedback.

Finally, in Mechanical Turk, it is typically the workers who choose which tasks they want to work on and when they want to stop. Without any regulations, they might respond to multiple batches of your tasks or stop in the middle of a batch. It is possible to systematically prevent the same worker from coming back and repeating more than one batch of your tasks. For example, on Amazon’s Mechanical Turk, a worker cannot repeat the same task more than once. However, it is difficult to guarantee that a worker completes all the tasks in a batch she started on. In practice, there are simple ways to ensure this by, for instance, conditioning the payment on completing all the tasks in a batch.

A problem with restricting the number of tasks assigned to each worker (as we propose in Section 2.1) is that it might take a long time to have all the batches completed. Letting the workers choose how many tasks they want to complete allows a few eager workers to complete enormous amount of tasks. However, if we restrict the number of tasks assigned to each worker, we might need to recruit more workers to complete all the tasks. This problem of tasks taking long time to finish is not just restricted to our model, but is a very common problem in open crowdsourcing platforms. Ipeirotis [Ipe10] studied the completion time of tasks on Mechanical Turk and observed that it follows a heavy tail distribution according to a power law. Hence, for some tasks it takes significant amount of time to finish. A number of strategies have been proposed to complete tasks on time. This includes optimizing pricing policy [FHI11], continuously posting tasks to stay on the first page [BJJ10, CHMA10], and having a large amount of tasks available [CHMA10]. These strategies are effective in attracting more workers fast. However, in our model, we assume there is no restrictions on the latency and we can wait until all the batches are completed, and if we have good strategies to reduce worker response time, such strategies could be incorporated into our system design.

Prior work. Previous crowdsourcing system designs have focused on developing inference algorithms assuming that the task assignments are fixed and the workers’ responses are already given. None of the prior work on crowdsourcing provides any systematic treatment of task assignment under the crowdsourcing model considered in this paper. To the best of our knowledge, we are the first to study both aspects of crowdsourcing together and, more importantly, establish optimality.

A naive approach to solve the inference problem, which is widely used in practice, is majority voting. Majority voting simply follows what the majority of workers agree on. When we have many spammers in the crowd, majority voting is error-prone since it gives the same weight to all the responses, regardless of whether they are from a spammer or a diligent workers. We will show in Section 2.3 that majority voting is provably sub-optimal and can be significantly improved upon.

If we know how reliable each worker is, then it is straightforward to find the maximum likelihood estimates: compute the weighted sum of the responses weighted by the log-likelihood. Although, in reality, we do not have this information, it is possible to learn about a worker’s reliability by comparing one worker’s answer to others’. This idea was first proposed by Dawid and Skene, who introduced an iterative algorithm based on expectation maximization (EM) [DS79]. They considered the problem of classifying patients based on labels obtained from multiple clinicians. They introduce a simple probabilistic model describing the clinicians’ responses, and gave an algorithmic solution based on EM. This model, which is described in Section 2.7, is commonly used in modern crowdsourcing settings to explain how workers make mistakes in classification tasks [SPI08].

This heuristic algorithm iterates the following two steps. In the M-step, the algorithm estimates the error probabilities of the workers that maximizes the likelihood using the current estimates of the answers. In the E-step, the algorithm estimates the likelihood of the answers using the current estimates of the error probabilities. More recently, a number of algorithms followed this EM approach based on a variety of probabilistic models [SFB95, WRW09, RYZ10a]. The crowdsourcing model we consider in this paper is a special case of these models, and we discuss their relationship in Section 2.7. The EM approach has also been widely applied in classification problems, where a set of labels from low-cost noisy workers is used to find a good classifier [JG03, RYZ10a]. Given a fixed budget, there is a trade-off between acquiring a larger training dataset or acquiring a smaller dataset but with more labels per data point. Through extensive experiments, Sheng, Provost and Ipeirotis [SPI08] show that getting repeated labeling can give considerable advantage.

Despite the popularity of the EM algorithms, the performance of these approaches are only empirically evaluated and there is no analysis that gives performance guarantees. In particular, EM algorithms are highly sensitive to the initialization used, making it difficult to predict the quality of the resulting estimate. Further, the role of the task assignment is not at all understood with the EM algorithm (or for that matter any other algorithm). We want to address both questions of task allocation and inference together, and devise an algorithmic solution that can achieve minimum error from a fixed budget on the total number of queries. When we have a given target accuracy, such an algorithm will achieve this target accuracy with minimum cost. Further, we want to provide a strong performance guarantee for this approach and show that it is close to a fundamental limit on what the best algorithm can achieve.

Contributions. In this work, we provide the first rigorous treatment of both aspects of designing a reliable crowdsourcing system: task allocation and inference. We provide both an order-optimal task allocation scheme (based on random graphs) and an order-optimal algorithm for inference (based on low-rank approximation and belief propagation) on that task assignment. We show that our algorithm, which is non-adaptive, performs as well (for the worst-case worker distribution) as the optimal oracle estimator which can use any adaptive task allocation scheme.

Concretely, given a target probability of error and a crowd with collective quality , we show that spending a budget which scales as is sufficient to achieve probability of error less than using our approach. We give a task allocation scheme and an inference algorithm with runtime which is linear in the total number of queries (up to a logarithmic factor). Conversely, we also show that using the best adaptive task allocation scheme together with the best inference algorithm, and under the worst-case worker distribution, this scaling of the budget in terms of and is unavoidable. No algorithm can achieve error less than with number of queries smaller than with some positive universal constant . This establishes that our algorithm is worst-case optimal up to a constant factor in the required budget.

Our main results show that our non-adaptive algorithm is worst-case optimal and there is no significant gain in using an adaptive strategy. We attribute this limit of adaptation to the fact that, in existing platforms such as Amazon’s Mechanical Turk, the workers are fleeting and the system does not allow for exploiting good workers. Therefore, a positive message of this result is that a good ‘rating system’ for workers is essential to truly benefit from crowdsourcing platforms using adaptivity.

Another novel contribution of our work is the analysis technique. The iterative inference algorithm we introduce operates on real-valued messages whose distribution is a priori difficult to analyze. To overcome this challenge, we develop a novel technique of establishing that these messages are sub-Gaussian and compute the parameters recursively in a closed form. This allows us to prove the sharp result on the error rate. This technique could be of independent interest in analyzing a more general class of message-passing algorithms.

2 Main result

To achieve a certain reliability in our answers with minimum number of queries, we propose using random regular graphs for task allocation and introduce a novel iterative algorithm to infer the correct answers. While our approach is non-adaptive, we show that it is sufficient to achieve an order-optimal performance when compared to the best possible approach using adaptive task allocations. Precisely, we prove an upper bound on the resulting error when using our approach and a matching lower bound on the minimax error rate achieved by the best possible adaptive task allocation together with an optimal inference algorithm. This shows that our approach is minimax optimal up to a constant factor: it requires only a constant factor times the minimum necessary budget to achieve a target error rate under the worst-case worker distribution. We then present the intuitions behind our inference algorithm through connections to low-rank matrix approximations and belief propagation.

2.1 Algorithm

Task allocation. We use a non-adaptive scheme which makes all the task assignments before any worker arrives. This amounts to designing a bipartite graph with one type of nodes corresponding to each of the tasks and another set of nodes corresponding to each of the batches. An edge indicates that task is included in batch . Once all ’s are determined according to the graph, these batches are submitted simultaneously to the crowdsourcing platform. Each arriving worker will pick up one of the batches and complete all the tasks in that batch. We denote by the worker working on -th batch .

To design a bipartite graph, the taskmaster first makes a choice of how many workers to assign to each task and how many tasks to assign to each worker. The task degree is typically determined by how much resources (e.g. money, time, etc.) one can spend on the tasks. The worker degree is typically determined by how many tasks are manageable for a worker depending on the application. The total number of workers that we need is automatically determined as , since the total number of edges has to be consistent.

We will show that with such a regular graph, you can achieve probability of error which is quite close to a lower bound on what any inference algorithm can achieve with any task assignment. In particular, this includes all possible graphs which might have irregular degrees or have very large worker degrees (and small number of workers) conditioned on the total number of edges being the same. This suggests that, among other things, there is no significant gain in using an irregular graph.

We assume that the total cost that must be paid is proportional to the total number of edges and not the number of workers. If we have more budget we can increase . It is then natural to expect the probability of error to decrease, since we are collecting more responses. We will show that the error rate decreases exponentially in as grows. However, increasing does not incur increase in the cost and it is not immediately clear how it affects the performance. We will show that with larger we can learn more about the workers and the error rate decreases as increases. However, how much we can gain by increasing the worker degree is limited.

Given the task and worker degrees, there are multiple ways to generate a regular bipartite graph. We want to choose a graph that will minimize the probability of error. Deviating slightly from regular degrees, we propose using a simple random construction known as configuration model in random graph literature [RU08, Bol01]. We start with half-edges for task nodes and half-edges for the worker nodes, and pair all the half-edges according to a random permutation of . The resulting graph might have multi-edges where two nodes are connected by more than one edges. However, they are very few in thus generated random graph as long as , whence we also have . Precisely, the number of double-edges in the graph converges in distribution to Poisson distribution with mean [Bol01, Page 59 Exercise 2.12]. The only property that we need for the main result to hold is that the resulting random graph converges locally to a random tree in probability in the large system limit. This enables us to analyze the performance of our inference algorithm and provide sharp bounds on the probability of error.

The intuition behind why random graphs are good for our inference problem is related to the spectral gap of random matrices. In the following, we will use the (approximate) top singular vector of a weighted adjacency matrix of the random graph to find the correct labels. Since, sparse random graphs are excellent expanders with large spectral gaps, this enables us to reliably separate the low-rank structure from the data matrix which is perturbed by random noise.

Inference algorithm. We are given a task allocation graph where we connect an edge if a task is assigned to a worker . In the following, we will use indexes for a -th task node and for a -th worker node. We use to denote the neighborhood of node . Each edge on the graph has a corresponding worker response .

To find the correct labels from the given responses of the workers, we introduce a novel iterative algorithm. This algorithm is inspired by the celebrated belief propagation algorithm and low-rank matrix approximations. The connections are explained in detail in Section 2.5 and 2.6, along with mathematical justifications.

The algorithm operates on real-valued task messages and worker messages . A task message represents the log-likelihood of task being a positive task, and a worker message represents how reliable worker is. We start with the worker messages initialized as independent Gaussian random variables, although the algorithm is not sensitive to a specific initialization as long as it has a strictly positive mean. We could also initialize all the messages to one, but then we need to add extra steps in the analysis to ensure that this is not a degenerate case. At -th iteration, the messages are updated according to

(1)
(2)

where is the neighborhood of the task node and is the neighborhood of the worker node . At task update, we are giving more weight to the answers that came from more trustworthy workers. At worker update, we increase our confidence in that worker if the answers she gave on another task, , has the same sign as what we believe, . Intuitively, a worker message represents our belief on how ‘reliable’ the worker is. Hence, our final estimate is a weighted sum of the answers weighted by each worker’s reliability:

Iterative Algorithm
Input: , ,
Output: Estimate
1: For all do
       Initialize with random ;
2: For do
       For all do     ;
       For all do     ;
3: For all do     ;
4: Output estimate vector .

While our algorithm is inspired by the standard Belief Propagation (BP) algorithm for approximating max-marginals [Pea88, YFW03], our algorithm is original and overcomes a few limitations of the standard BP for this inference problem under the crowdsourcing model. First, the iterative algorithm does not require any knowledge of the prior distribution of , whereas the standard BP requires it as explained in detail in Section 2.6. Second, the iterative algorithm is provably order-optimal for this crowdsourcing problem. We use a standard technique, known as density evolution, to analyze the performance of our message-passing algorithm. Although we can write down the density evolution equations for the standard BP for crowdsourcing, it is not trivial to describe or compute the densities, analytically or numerically. It is also very simple to write down the density evolution equations (cf. (13) and (14)) for our algorithm, but it is not a priori clear how one can analyze the densities in this case either. We develop a novel technique to analyze the densities for our iterative algorithm and prove optimality. This technique could be of independent interest to analyzing a broader class of message-passing algorithms.

2.2 Performance guarantee and experimental results

We provide an upper bound on the probability of error achieved by the iterative inference algorithm and task allocation according to the configuration model. The bound decays as with a universal constant . Further, an algorithm-independent lower bound that we establish suggests that such a dependence of the error on is unavoidable.

2.2.1 Bound on the average error probability

To lighten the notation, let and , and recall that . Using these notations, we define to be the effective variance in the sub-Gaussian tail of our estimates after iterations of our inference algorithm:

With this, we can prove the following upper bound on the probability of error when we run iterations of our inference algorithm with -regular assignments on tasks using a crowd with collective quality . We refer to Section 3.1 for the proof.

Theorem 2.1.

For fixed and , assume that tasks are assigned to workers according to a random -regular graph drawn from the configuration model. If the distribution of the worker reliability satisfies and , then for any , the estimate after iterations of the iterative algorithm achieves

(3)

The second term, which is the probability that the resulting graph is not locally tree-like, vanishes for large . Hence, the dominant term in the error bound is the first term. Further, when as per our assumption and when we run our algorithm for large enough number of iterations, converges linearly to a finite limit such that

With linear convergence of , we only need a small number of iterations to achieve close to this limit. It follows that for large enough and , we can prove an upper bound that does not dependent on the problem size or the number of iterations, which is stated in the following corollary.

Corollary 2.2.

Under the hypotheses of Theorem 2.1, there exists and such that

(4)

for all and .

Proof.

For as per our assumption, iterations suffice to ensure that . Also, suffices to ensure that .

The required number of iterations is small (only logarithmic in , , , and ) and does not depend on the problem size . On the other hand, the required number of tasks in our main theorem is quite large. However, numerical simulations suggest that the actual performance of our approach is not very sensitive to the number of tasks and the bound still holds for tasks of small size as well. For example, in Figure 1 (left), we ran numerical experiment with , , and , and the resulting error exhibits exponential decay as predicted by our theorem even for large . In this case, theoretical requirement on the number of tasks is much larger than what we used in the experiment.

Consider a set of worker distributions that have the same collective quality . These distributions that have the same value of can give different values for ranging from to . Our main result on the error rate suggests that the error does not depend on the value of . Hence, the effective second moment is the right measure of the collective quality of the crowd, and the effective first moment only affects how fast the algorithm converges, since we need to run our inference algorithm iterations to guarantee the error bound.

The iterative algorithm is efficient with run-time comparable to that of majority voting which requires operations. Each iteration of the iterative algorithm requires operations, and we need iterations to ensure an error bound in (4). By definition, we have . The run-time is the worst when , which happens under the spammer-hammer model, and it is the smallest when which happens if deterministically. In any case, we only need extra logarithmic factor that does not increase with compared to majority voting, and this Notice that as we increase the number of iterations, the messages converge to an eigenvector of a particular sparse matrix of dimensions . This suggests that we can alternatively compute the messages using other algorithms for computing the top singular vector of large sparse matrices that are known to converge faster (e.g. Lanczos algorithm [Lan50]).

Next, we make a few remarks on the performance guarantee.

First, the assumption that is necessary. If there is no assumption on , then we cannot distinguish if the responses came from tasks with and workers with or tasks with and workers with . Statistically, both of them give the same output. The hypothesis on allows us to distinguish which of the two is the correct solution. In the case when we know that , we can use the same algorithm changing the sign of the final output and get the same performance guarantee.

Second, our algorithm does not require any information on the distribution of . However, in order to generate a graph that achieves an optimal performance, we need the knowledge of for selecting the degree . Here is a simple way to overcome this limitation at the loss of only additional constant factor, i.e. scaling of cost per task still remains . To that end, consider an incremental design in which at step the system is designed assuming for . At step , we design two replicas of the task allocation for . Now compare the estimates obtained by these two replicas for all tasks. If they agree amongst tasks, then we stop and declare that as the final answer. Otherwise, we increase to and repeat. Note that by our optimality result, it follows that if is less than the actual then the iteration must stop with high probability. Therefore, the total cost paid is with high probability. Thus, even lack of knowledge of does not affect the order optimality of our algorithm.

Further, unlike previous approaches based on Expectation Maximization (EM), the iterative algorithm is not sensitive to initialization and converges to a unique solution from a random initialization with high probability. This follows from the fact that the algorithm is essentially computing a leading eigenvector of a particular linear operator.

Finally, we observe a phase transition at . Above this phase transition, when , we will show that our algorithm is order-optimal and the probability of error is significantly smaller than majority voting. However, perhaps surprisingly, when we are below the threshold, when , we empirically observe that our algorithm exhibit a fundamentally different behavior (cf. Figure 1). The error we get after iterations of our algorithm increases with . In this regime, we are better off stopping the algorithm after 1 iteration, in which case the estimate we get is essentially the same as the simple majority voting, and we cannot do better than majority voting. This phase transition is universal and we observe similar behavior with other inference algorithms including EM approaches. We provide more discussions on the choice of and the limitations of having small in the following section.

2.2.2 Minimax optimality of our approach

For a task master, the natural core optimization problem of her concern is how to achieve a certain reliability in the answers with minimum cost. Throughout this paper, we assume that the cost is proportional to the total number of queries. In this section, we show that if a taskmaster wants to achieve a target error rate of , she can do so using our approach with budget per task scaling as for a broad range of worker degree . Compared to the necessary condition which we provide in Section 2.3, this is within a constant factor from what is necessary using the best non-adaptive task assignment and the best inference algorithm. Further, we show in Section 2.4 that this scaling in the budget is still necessary if we allow using the best adaptive task assignment together with the best inference algorithm. This proves that our approach is minimax optimal up to a constant factor in the budget.

Assuming for now that there is no restrictions on the worker degree and we can assign as many tasks to each worker as we want, we can get the following simplified upper bound on the error that holds for all . To simplify the resulting bound, let us assume for now that . Then, we get that . Then from (4), we get the following bound:

for large enough . In terms of the budget or the number of queries necessary to achieve a target accuracy, we get the following sufficient condition as a corollary.

Corollary 2.3.

Using the non-adaptive task assignment scheme with and the iterative inference algorithm introduced in Section 2.1, it is sufficient to query times per task to guarantee that the probability of error is at most for any and for all .

We provide a matching minimax necessary condition up to a constant factor for non-adaptive algorithms in Section 2.3. When the nature can choose the worst-case worker distributions, no non-adaptive algorithm can achieve error less than with budget per task smaller than with some universal positive constant . This establishes that under the non-adaptive scenario, our approach is minimax optimal up to a constant factor for large enough . With our approach you only need to ask (and pay for) a constant factor more than what is necessary using the best non-adaptive task assignment scheme together with the best inference algorithm under the worst-case worker distribution.

Perhaps surprisingly, we will show in Section 2.4 that the necessary condition does not change even if we allow adaptive task assignments. No algorithm, adaptive or non-adaptive, can achieve error less than without asking queries per task with some universal positive constant . Hence, our non-adaptive approach achieves minimax optimal performance that can be achieved by the best adaptive scheme.

In practice, we might not be allowed to have large depending on the application. For different regimes of the restrictions on the allowed worker degree , we need different choices of . When we have a target accuracy , the following corollary establishes that we can achieve probability of error with for any value of .

Corollary 2.4.

Using the non-adaptive task assignment scheme with any and the iterative inference algorithm introduced in Section 2.1, it is sufficient to query times per task to guarantee that the probability of error is at most for any and for all .

Proof.

We will show that for , the probability of error is at most . Since, for , this proves the corollary. Since from the first condition, we get that . Then, the probability of error is upper bounded by . This implies that for the probability of error is at most .

For , this implies that our approach requires queries and it is minimax optimal. However, for , our approach requires queries. This is due to the fact that when is small, we cannot efficiently learn the quality of the workers and need significantly more questions to achieve the accuracy we desire. Hence, in practice, we want to be able to assign more tasks to each worker when we have low-quality workers.

2.2.3 Experimental results

Figure. 1 shows the comparisons between probabilities of error achieved by different inference algorithms, but on the same task assignment using regular bipartite random graphs. We ran iterations of EM and our iterative algorithm, and also the spectral approach of using leading left singular vector of for estimation. The spectral approach, which we call Singular Vector in the graph, is explained in detail in Section 2.5. The error rates are compared with those of majority voting and the oracle estimator. The oracle estimator performance sets a lower bound on what any inference algorithm can achieve, since it knows all the values of ’s. For the numerical simulation on the left-hand side, we set , and used the spammer hammer model for the distribution of the workers with . According to our theorem, we expect a phase transition at . From the figure, we observe that the iterative inference algorithm starts to perform better than majority voting at . For the figure on the right-hand side, we set . For fair comparisons with the EM approach, we used an implementation of the EM approach in Java by Sheng et al. [SPI08], which is publicly available.

Figure 1: The iterative algorithm improves over majority voting and EM algorithm. Using the top singular vector for inference has similar performance as our iterative approach.

We also ran two experiments with real crowd using Amazon Mechanical Turk. In our experiments, we created tasks for comparing colors; we showed three colors on each task, one on the top and two on the bottom. We asked the crowd to indicate “if the color on the top is more similar to the color on the left or on the right”.

The first experiment confirms that the ground truth for these color comparisons tasks are what is expected from pairwise distances in the Lab color space. The distances in the Lab color space between the a pair of colors are known to be a good measure of the perceived distance between the pair [WS67]. To check the validity of this Lab distance we collected responses on each of the color comparison tasks. As shown in Figure. 2, for all tasks, the majority of the responses were consistent with the Lab distance based ground truth.

Next, to test our approach, we created of such similarity tasks and recruited workers to answer all the questions. Once we have this data, we can subsample the data to simulate what would have happened if we collected smaller number of responses per task. The resulting average probability of error is illustrated in Figure. 3. For this crowd from Amazon Mechanical Turk, we can estimate the collective quality from the data, which is about . Theoretically, this indicates that phase transition should happen when , since we set . With this, we expect phase transition to happen around . In Figure. 3, we see that our iterative algorithm starts to perform better than majority voting around .

Figure 2: Experimental results on color comparison using real data from Amazon’s Mechanical Turk. The color on the left is closer to the one on the top in Lab distance for each triplet. The votes from workers are shown below each triplet.
Figure 3: The average probability of error on color comparisons using real data from Amazon’s Mechanical Turk.

2.3 Fundamental limit under the non-adaptive scenario

Under the non-adaptive scenario, we are allowed to use only non-adaptive task assignment schemes which assign all the tasks a priori and collect all the responses simultaneously. In this section, we investigate the fundamental limit on how small an error can be achieved using the best possible non-adaptive task assignment scheme together with the best possible inference algorithm. In particular, we are interested in the minimax optimality: What is the minimum error that can be achieved under the worst-case worker distribution? To this end, we analyze the performance of an oracle estimator when the workers’ latent qualities are drawn from a specific distribution and provide a lower bound on the minimax rate on the probability of error. Compared to our main result, this establishes that our approach is minimax optimal up to a constant factor.

In terms of the budget, the natural core optimization problem of our concern is how to achieve a certain reliability in our answers with minimum cost. Let us assume that the cost is proportional to the total number of queries. We show that for a given target error rate , the total budget sufficient to achieve this target error rate using our algorithm is within a constant factor from what is necessary using the best non-adaptive task assignment and the best inference algorithm.

Fundamental limit. Consider a crowd characterized by worker distribution such that . Let be a set of all distributions on , such that the collective quality is parametrized by :

We want to prove a lower bound on the minimax rate on the probability of error, which only depends on and . Define the minimax rate as

where ranges over all estimators which are measurable functions over the responses, and ranges over the set of all task assignment schemes which are non-adaptive and ask queries in total. Here the probability is taken over all realizations of ’s, ’s, and the randomness introduced in the task assignment and the inference.

Consider any non-adaptive scheme that assigns workers to the -th task. The only constraint is that the average number of queries is bounded by . To get a lower bound on the minimum achievable error, we consider an oracle estimator that has access to all the ’s, and hence can make an optimal estimation. Further, since we are proving minimax optimality and not instance-optimality, the worst-case error rate will always be lower bounded by the error rate for any choice of worker distribution. In particular, we prove a lower bound using the spammer-hammer model. Concretely, we assume the ’s are drawn from the spammer-hammer model with perfect hammers:

Notice that the use of is consistent with . Under the spammer-hammer model, the oracle estimator only makes a mistake on task if it is only assigned to spammers, in which case we flip a fair coin to achieve error probability of half. Formally,

By convexity and using Jensen’s inequality, the average probability of error is lower bounded by

Since we are interested in how many more queries are necessary as the quality of the crowd deteriorates, we are going to assume , in which case . As long as total queries are used, this lower bound holds regardless of how the actual tasks are assigned. And since this lower bound holds for a particular choice of , it holds for the worst case as well. Hence, for the best task assignment scheme and the best inference algorithm, we have

This lower bound on the minimax rate holds for any positive integer , and regardless of the number of workers or the number of queries, , assigned to each worker. In terms of the average number of queries necessary to achieve a target accuracy of , this implies the following necessary condition.

Lemma 2.5.

Assuming and the non-adaptive scenario, if the average number of queries per task is less than , then no algorithm can achieve average probability of error less than for any under the worst-case worker distribution.

To prove this worst-cased bound, we analyzed a specific distribution of the spammer-hammer model. However, the result (up to a constant factor) seems to be quite general and can also be proved using different distributions, e.g. when all workers have the same quality. The assumption on can be relaxed as much as we want, by increasing the constant in the necessary budget. Compared to the sufficient condition in Corollary 2.3 this establishes that our approach is minimax optimal up to a constant factor. With our approach you only need to ask (and pay for) a constant factor more than what is necessary for any algorithm.

Majority voting. As a comparison, we can do similar analysis for the simple majority voting and show that the performance is significantly worse than our approach. The next lemma provides a bound on the minimax rate of majority voting. A proof of this lemma is provided in Section 3.4.

Lemma 2.6.

For any , there exists a positive constant such that when , the error achieved by majority voting is at least

In terms of the number of queries necessary to achieve a target accuracy using majority voting, this implies that we need to ask at least queries per task for some universal constants and . Hence, majority voting is significantly more costly than our approach in terms of budget. Our algorithm is more efficient in terms of computational complexity as well. Simple majority voting requires operations to achieve target error rate in the worst case. From Corollary 2.2, together with and , we get that our approach requires operations in the worst case.

2.4 Fundamental limit under the adaptive scenario

In terms of the scaling of the budget necessary to achieve a target accuracy, we established that using a non-adaptive task assignment, no algorithm can do better than our approach. One might prefer a non-adaptive scheme in practice because having all the batches of tasks processed in parallel reduces the latency. This is crucial in many applications, especially in real-time applications such as searching, visual information processing, and document processing [BJJ10, BLM10, YKG10, BBMK11]. However, by switching to an adaptive task assignment, one might hope to be more efficient and still obtain a desired accuracy from fewer questions. On one hand, adaptation can help improve performance. But on the other hand, it can significantly complicate system design due to careful synchronization requirements. In this section, we want to prove an algorithm-independent upper bound on how much one can gain by using an adaptive task allocation.

When the identities of the workers are known, one might be tempted to first identify which workers are more reliable and then assign all the tasks to those workers in an explore/exploit manner. However, in typical crowdsourcing platforms such as Amazon Mechanical Turk, it is unrealistic to assume that we can identify and reuse any particular worker, since typical workers are neither persistent nor identifiable and batches are distributed through an open-call. Hence, exploiting a reliable worker is not possible. However, we can adaptively resubmit batches of tasks; we can dynamically choose which subset of tasks to assign to the next arriving worker. In particular, we can allocate tasks to the next batch based on all the information we have on all the tasks from the responses collected thus far. For example, one might hope to reduce uncertainty more efficiently by adaptively collecting more responses on those tasks that she is less certain about.

Fundamental limit. In this section, we show that, perhaps surprisingly, there is no significant gain in switching from our non-adaptive approach to an adaptive strategy when the workers are fleeting. We first prove a lower bound on the minimax error rate: the error that is achieved by the best inference algorithm using the best adaptive task allocation scheme under a worst-case worker distribution and the worst-case true answers . Let be the set of all task assignment schemes that use at most queries in total. Then, we can show the following lower bound on the minimax rate on the probability of error. A proof of this theorem is provided in Section 3.5.

Theorem 2.7.

When for any constant , there exists a positive constant such that

(5)

for all where the task assignment scheme ranges over all adaptive schemes that use at most queries and ranges over all estimators that are measurable functions over the responses.

We cannot avoid the factor of half in the lower bound, since we can always achieve error probability of half without asking any queries (with ). In terms of the budget required to achieve a target accuracy, the above lower bound proves that no algorithm, adaptive or non-adaptive, can achieve an error rate less than with number of queries per task less than in the worst case of worker distribution.

Corollary 2.8.

Assuming for any constant and the iterative scenario, there exists a positive constant such that if the average number of queries is less than , then no algorithm can achieve average probability of error less than for any under the worst-case worker distribution.

Compared to Corollary 2.3, we have a matching sufficient and necessary conditions up to a constant factor. This proves that there is no significant gain in using an adaptive scheme, and our approach achieves minimax-optimality up to a constant factor with a non-adaptive scheme. This limitation of adaptation strongly relies on the fact that workers are fleeting in existing platforms and can not be reused. Therefore, architecturally our results suggest that building a reliable reputation system for workers would be essential to harnessing the potential of adaptive designs.

A counter example for instance-optimality. The above corollary establishes minimax-optimality: for the worst-case worker distribution, no algorithm can improve over our approach other than improving the constant factor in the necessary budget. However, this does not imply instance-optimality. In fact, there exists a family of worker distributions where all non-adaptive algorithms fail to achieve order-optimal performance whereas a trivial adaptive algorithm succeeds. Hence, for particular instances of worker distributions, there exists a gap between what can be achieved using non-adaptive algorithms and adaptive ones.

We will prove this in the case of the spammer-hammer model where each new worker is a hammer () with probability or a spammer () otherwise. We showed in Section 2.3 that no non-adaptive algorithm can achieve an error less than for any value of . In particular, this does not vanish even if we increase . We will introduce a simple adaptive algorithm and show that this algorithm achieves an error probability that goes to zero as grows.

The algorithm first groups all the tasks into disjoint sets of size each. Starting with the first group, the algorithm assigns all tasks to new arriving workers until it sees two workers who agreed on all tasks. It declares those responses as its estimate for this group and moves on to the next group. This process is repeated until it reaches the allowed number of queries. This estimator makes a mistake on a group if there were two spammers who agreed on all tasks or we run out of allowed number of queries before we finish the last group. Formally, we can prove the following upper bound on the probability of error.

Lemma 2.9.

Under the spammer-hammer model, when the allowed number of queries per task is larger than , there is an adaptive task allocation scheme and an inference algorithm that achieves average probability of error at most .

Proof.

Recall that we are only allowed queries. Since we are allocating queries per worker, we can only ask at most workers. First, the probability that there is at least one pair of spammers (among all possible pairs from workers) who agreed an all responses is at most . Next, given that no pairs of spammers agreed on all their responses, the probability that we run out of all allowed queries is the probability that the number of hammers in workers is strictly less than (which is the number of hammers we need in order to terminate the algorithm, conditioned on that no spammers agree with one another). By standard concentration results, this happens with probability at most .

This proves the existence of an adaptive algorithm which achieves vanishing error probability as grows for a board range of task degree . Comparing the above upper bound with the known lower bound for non-adaptive schemes, this proves that non-adaptive algorithms cannot be instance optimal: there is a family of distributions where adaptation can significantly improve performance. This is generally true when there is a strictly positive probability that a worker is a hammer ().

One might be tempted to apply the above algorithm in more general settings other than the spammer-hammer model. However, this algorithm fails when there are no perfect workers in the crowd. If we apply this algorithm in such a general setting, then it produces useless answers: the probability of error approaches half as grows for any finite .

2.5 Connections to low-rank matrix approximation

In this section, we first explain why the top singular vector of the data matrix reveals the true answers of the tasks, where is the matrix of the responses and we fill in zeros wherever we have no responses collected. This naturally defines a spectral algorithm for inference which we present next. It was proven in [KOS11] that the error achieved by this spectral algorithm is upper bounded by with some constant . But numerical experiments (cf. Figure 1) suggest that the error decays much faster, and that the gap is due to the weakness of the analysis used in [KOS11]. Inspired by this spectral approach, we introduced a novel inference algorithm that performs as well as the spectral algorithm (cf. Figure 1) and proved a much tighter upper bound on the resulting error which scales as with some constant . Our inference algorithm is based on power iteration, which is a well-known algorithm for computing the top singular vector of a matrix, and Figure 1 suggests that both algorithms are equally effective and the resulting errors are almost identical.

The data matrix can be viewed as a rank- matrix that is perturbed by random noise. Since, , the conditional expectation of this matrix is

where is the all ones vector, the vector of correct solutions is and the vector of worker reliability is . Notice that the rank of this conditional expectation matrix is one and this matrix reveals the correct solutions exactly. We can decompose into a low-rank expectation plus a random perturbation:

where is the random perturbation with zero mean. When the spectral radius of the noise matrix is much smaller than the spectral radius of the signal, we can correctly extract most of using the leading left singular vector of .

Under the crowdsourcing model considered in this paper, an inference algorithm using the top left singular vector of was introduced and analyzed by Karger et al. [KOS11]. Let be the top left singular vector of . They proposed estimating and proved an upper bound on the probability of error that scales as . The main technique behind this result is in analyzing the spectral gap of . It is not difficult to see that the spectral radius of the conditional expectation matrix is , where the operator norm of a matrix is denoted by . Karger et al. proved that the spectral radius of the perturbation is in the order of . Hence, when , we expect a separation between the conditional expectation and the noise.

One way to compute the leading singular vector is to use power iteration: for two vectors and , starting with a randomly initialized , power iteration iteratively updates and by repeating and . It is known that normalized (and ) converges linearly to the leading left (and right) singular vector. Then we can use the sign of to estimate . Writing the update rule for each entry, we get

Notice that this power iteration update rule is almost identical to those of message passing updates in (1) and (2). The task messages from task are close in value to the entry of the power iteration. The worker messages from worker are close in value to the entry of the power iteration. Numerical simulations in Figure 1 suggest that the quality of the estimates from the two algorithms are almost identical. However, the known performance guarantee for the spectral approach is weak. We developed novel analysis techniques to analyze our message passing algorithm, and provide an upper bound on the error that scales as . It might be possible to apply our algorithm, together with the analysis techniques, to other problems where the top singular vector of a data matrix is used for inference.

2.6 Connections to belief propagation

The crowdsourcing model described in this paper can naturally be described using a graphical model. Let denote the weighted bipartite graph, where is the set of task nodes, is the set of worker nodes, is the set of edges connecting a task to a worker who is assigned that task, and is the set of weights on those edges according to the responses. Given such a graph, we want to find a set of task answers that maximize the following posterior distribution .

where with a slight abuse of notation we use to denote the prior probability density function of ’s and we use and to denote task nodes and and to denote worker nodes. For such a problem of finding the most probable realization in a graphical model, the celebrated belief propagation (BP) gives a good approximate solution. To be precise, BP is an approximation for maximizing the marginal distribution of each variable, and a similar algorithm known as min-sum algorithm approximates the most probable realization. However, the two algorithms are closely related, and in this section we only present standard BP. There is a long line of literature providing the theoretical and empirical evidences supporting the use BP [Pea88, YFW03].

Under the crowdsourcing graphical model, standard BP operates on two sets of messages: the task messages and the worker messages . In our iterative algorithm the messages were scalar variables with real values, whereas the messages in BP are probability density functions. Each task message corresponds to an edge and each worker message also corresponds to an edge. The task node corresponds to random variable , and the task message from task to worker , denoted by , represents our belief on the random variable . Then is a probability distribution over . Similarly, a worker node corresponds to a random variable . The worker message is a probability distribution of over . Following the standard BP framework, we iteratively update the messages according to the following rule. We start with randomly initialized ’s and at -th iteration,

for all and for . The above update rule only determines the messages up to a scaling, where indicates that the left-hand side is proportional to the right-hand side. The algorithm produces the same estimates in the end regardless of the scaling. After a predefined number of iterations, we make a decision by computing the decision variable

and estimating .

In a special case of a Haldane prior, where a worker either always tells the truth or always gives the wrong answer,

the above BP updates boils down to our iterative inference algorithm. Let denote the log-likelihood of . Under Haldane prior, is also a binary random variable. We can use to denote the log-likelihood of . After some simplifications, the above BP update boils down to

This is exactly the same update rule as our iterative inference algorithm (cf. Eqs. (1) ad (2)). Thus, our algorithms is belief propagation for a very specific prior. Despite this, it is surprising that it performs near optimally (with random regular graph for task allocation) for all priors. This robustness property is due to the models assumed in this crowdsourcing problem and is not to be expected in general.

2.7 Discussion

In this section, we discuss several implications of our main results and possible future research directions in generalizing the model studied in this paper.

Below phase transition. We first discuss the performance guarantees in the below threshold regime when . As we will show, the bound in (4) always holds even when . However, numerical experiments suggest that we should stop our algorithm at first iteration when we are below the phase transition as discussed in Section 2.2. We provide an upper bound on the resulting error when only one iteration of our iterative inference algorithm is used (which is equivalent as majority voting algorithm).

Notice that the bound in (4) is only meaningful when it is less than a half. When or , the right-hand side of inequality (4) is always larger than half. Hence the upper bound always holds, even without the assumption that , and we only have that assumption in the statement of our main theorem to emphasize the phase transition in how our algorithm behaves.

However, we can also try to get a tighter bound than a trivial half implied from (4) in the below threshold regime. Specifically, we empirically observe that the error rate increases as the number of iterations increases. Therefore, it makes sense to use . In which case, the algorithm essentially boils down to the majority rule. We can prove the following error bound which generally holds for any regime of , and the worker distribution . A proof of this statement is provided in Section 3.6.

Lemma 2.10.

For any value of , , and , and any distribution of workers , the estimates we get after first step of our algorithm achieve

(6)

where .

Since is always between and , the scaling of the above error exponent is always worse than what we have after running our algorithm for a long time (cf. Theorem 2.1). This suggests that iterating our inference algorithm helps when and especially when the gap between and is large. Under these conditions, our approach does significantly better than majority voting (cf. Figure 1). The gain of using our approach is maximized when there exists both good workers and bad workers. This is consistent with our intuition that when there is a variety of workers, our algorithm can identify the good ones and get better estimates.

Golden standard units. Next, consider the variation where we ask questions to workers whose answers are already known (also known as ‘gold standard units’). We can use these to assess the quality of the workers. There are two ways we can use this information. First, we can embed ‘seed gold units’ along with the standard tasks, and use these ‘seed gold units’ in turn to perform more informed inference. However, we can show that there is no gain in using such ‘seed gold units’. The optimal lower bound of essentially utilizes the existence of oracle that can identify the reliability of every worker exactly, i.e. the oracle has a lot more information than what can be gained by such embedded golden questions. Therefore, clearly ‘seed gold units’ do not help the oracle estimator, and hence the order optimality of our approach still holds even if we include all the strategies that can utilize these ‘seed gold units’. However, in practice, it is common to use the ‘seed gold units’, and this can improve the constant factor in the required budget, but not the scaling.

Alternatively, we can use ‘pilot gold units’ as qualifying or pilot questions that the workers must complete to qualify to participate. Typically a taskmaster do not have to pay for these qualifying questions and this provides an effective way to increase the quality of the participating workers. Our approach can benefit from such ‘pilot gold units’, which has the effect of increasing the effective collective quality of the crowd . Further, if we can ‘measure’ how the distribution of workers change when using pilot questions, then our main result fully describes how much we can gain by such pilot questions. In any case, pilot questions only change the distribution of participating workers, and the order-optimality of our approach still holds even if we compare all the schemes that use the same pilot questions.

How to optimize over a multiple choices of crowds. We next consider the scenario where we have a choice over which crowdsourcing platform to use from a set of platforms with different crowds. Each crowd might have different worker distributions with different prices. Specifically, suppose there are crowds of workers: the -th crowd has collective quality and requires payment of to perform a task. Now our optimality result suggests that the per-task cost scales as if we only used workers of class . More generally, if we use a mix of these workers, say fraction of workers from class , with , then the effective parameter . And subject to this, the optimal per task cost scales as . This immediately suggests that the optimal choice of fraction must be such that only if . That is, the optimal choice is to select workers only from the classes that have maximal quality per cost ratio of over . One implication of this observation is that it suggests a pricing scheme for crowdsourcing platforms. If you are managing a crowdsourcing platform with the collective quality and the cost and there is another crowdsourcing platform with and , you want to choose the cost such that the quality per cost ratio is at least as good as the other crowd: .

General crowdsourcing models. Finally, we consider possible generalizations of our model. The model assumed in this paper does not capture several factors: tasks with different level of difficulties or workers who always answer positive or negative. In general, the responses of a worker to a binary question may depend on several factors: the correct answer to the task; the difficulty of the task; the expertise or the reliability of the worker; the bias of the worker towards positive or negative answers. Let