FROG: A Fast and Reliable Crowdsourcing Framework [Technical Report]

FROG: A Fast and Reliable Crowdsourcing Framework [Technical Report]

Peng Cheng, Xiang Lian, Xun Jian, Lei Chen
 Hong Kong University of Science and Technology, Hong Kong, China
{pchengaa, xjian, leichen}@cse.ust.hk
 Kent State University, Ohio, USA
xlian@kent.edu
Abstract

For decades, the crowdsourcing has gained much attention from both academia and industry, which outsources a number of tasks to human workers. Typically, existing crowdsourcing platforms include CrowdFlower, Amazon Mechanical Turk (AMT), and so on, in which workers can autonomously select tasks to do. However, due to the unreliability of workers or the difficulties of tasks, workers may sometimes finish doing tasks either with incorrect/incomplete answers or with significant time delays. Existing works considered improving the task accuracy through voting or learning methods, they usually did not fully take into account reducing the latency of the task completion. This is especially critical, when a task requester posts a group of tasks (e.g., sentiment analysis), and one can only obtain answers of all tasks after the last task is accomplished. As a consequence, the time delay of even one task in this group could delay the next step of the task requester’s work from minutes to days, which is quite undesirable for the task requester.

Inspired by the importance of the task accuracy and latency, in this paper, we will propose a novel crowdsourcing framework, namely Fast and Reliable crOwdsourcinG framework (FROG), which intelligently assigns tasks to workers, such that the latencies of tasks are reduced and the expected accuracies of tasks are met. Specifically, our FROG framework consists of two important components, task scheduler and notification modules. For the task scheduler module, we formalize a FROG task scheduling (FROG-TS) problem, in which the server actively assigns workers with tasks with high reliability and low latency. We prove that the FROG-TS problem is NP-hard. Thus, we design two heuristic approaches, request-based and batch-based scheduling. For the notification module, we define an efficient worker notifying (EWN) problem, which only sends task invitations to those workers with high probabilities of accepting the tasks. To tackle the EWN problem, we propose a smooth kernel density estimation approach to estimate the probability that a worker accepts the task invitation. Through extensive experiments, we demonstrate the effectiveness and efficiency of our proposed FROG platform on both real and synthetic data sets.

I Introduction

Nowadays, the crowdsourcing has become a very useful and practical tool to process data in many real-world applications, such as the sentiment analysis [31], image labeling [40], and entity resolution [39]. Specifically, in these applications, we may encounter many tasks (e.g., identifying whether two photos have the same person in them), which may look very simple to humans, but not that trivial for the computer (i.e., being accurately computed by algorithms). Therefore, the crowdsourcing platform is used to outsource these so-called human intelligent tasks (HITs) to human workers, which has attracted much attention from both academia [16, 28, 27] and industry [21].

Existing crowdsourcing systems (e.g., CrowdFlower [3] or Amazon Mechanical Turk (AMT) [1]) usually wait for autonomous workers to select tasks. As a result, some difficult tasks may be ignored (due to lacking of the domain knowledge) and left with no workers for a long period of time (i.e., with high latency). What is worse, some high-latency (unreliable) workers may hold tasks, but do not accomplish them (or finish them carelessly), which would significantly delay the time (or reduce the quality) of completing tasks. Therefore, it is rather challenging to guarantee high accuracy and low latency of tasks in the crowdsourcing system, in the presence of unreliable and high-latency workers.

Consider an example of auto emergency response on interest public places, which monitors the incidents happened at important places (e.g., crossroads). In such applications, results with low latencies and high accuracies are desired. However, due to the limitation of current computer vision and AI technology, computers cannot do it well without help from humans. For example, people can know one car may cause accidents only when they know road signs in pictures and the traffic regulations, what computers cannot do. Applications may embed “human power” into the system as a module, which assigns monitoring pictures to crowdsourcing workers and aggregates the answers (e.g., “Normal” or “Accident”) from workers in almost real-time. Thus, the latency of the crowdsourcing module will affect the overall application performance.

Example 1 (Accuracy and Latency Problems in the Crowdsourcing System) The application above automatically selects and posts 5 pictures as 5 emergency reorganization tasks at different timestamps, respectively, on a crowdsourcing platform. Assume that 3 workers, , from the crowdsourcing system autonomously accept some or all of the 5 tasks, (for , posted by the emergency response system.

Table I shows the answers and time delays of tasks conducted by workers, (, where the last column provides the correctness (“” or “”) of the emergency reorganization answers against the ground truth. Due to the unreliability of workers and the difficulties of tasks, workers cannot always do the tasks correctly. That is, workers may be more confident to do specific categories of tasks (e.g., biology, cars, electronic devices, and/or sports), but not others. For example, in Table I, worker tags all pictures (tasks), , with 3 wrong labels. Thus, in this case, it is rather challenging to guarantee the accuracy/quality of emergency reorganization (task) answers, in the presence of such unreliable workers in the crowdsourcing system.

Furthermore, from Table I, all the 5 tasks are completed by workers within 20 seconds, except for task which takes worker 5 minutes to finish (because of the difficulty of task ). Such a long latency is highly undesirable for the emergency response application, who needs to proceed with the emergency reorganization results for the next step. Therefore, with the existence of high latency workers in the crowdsourcing system, it is also important, yet challenging, to achieve low latency of the task completion.

Worker Task Answer Time Latency Correctness
Normal 8 s
Accident 9 s
Accident 12 s

Accident 15 s
Normal 5 min
Normal 10 s
Accident 9 s
Accident 14 s

Accident 8 s
Normal 11 s

TABLE I: Answers of Tasks from Workers.

Specifically, the FROG framework contains two important components, task scheduler and notification modules. In the task scheduler module, our FROG framework aims to avoid the scenario of tasks autonomously selected by unreliable workers (with high latency and low accuracy). Thus, we adopt the server-assigned-task (SAT) mode [23, 37], in which the server actively schedules tasks for workers, considering both accuracy and latency. In particular, we formalize a novel FROG task scheduling (FROG-TS) problem under the SAT mode, which finds “good” worker-and-task assignments that minimize the maximal latencies for all tasks and maximize the accuracies (quality) of task results. We prove that the FROG-TS problem is NP-hard, by reducing it from the multiprocessor scheduling problem [19]. As a result, FROG-TS is not tractable. Alternatively, we design two heuristic approaches, request-based and batch-based scheduling, to efficiently tackle the FROG-TS problem.

Note that, existing works on reducing the latency are usually designed for specific tasks (e.g., filtering or resolving entities) [32, 38] by increasing prices over time to encourage workers to accept tasks [18], which cannot be directly used for general-purpose tasks under the budget constraint (i.e., the settings in our FROG framework). Some other works [20, 16] removed low-accuracy or high-latency workers, which may lead to idleness of workers and low throughput of the system. In contrast, our task scheduler module takes into account both factors, accuracy and latency, and can design a worker-and-task assignment strategy with high accuracy, low latency, high throughput, and under a budget constraint.

In existing crowdsourcing systems, workers can freely join or leave the system. However, in the case that the system lacks of active workers, there is no way to invite more offline workers to perform online tasks. To address this issue, the notification module in our FROG framework is designed to notify those offline workers via invitation messages (e.g., by mobile phones). However, in order to avoid sending spam messages, we propose an efficient worker notifying (EWN) problem, which only sends task invitations to those workers with high probabilities of accepting the tasks. To tackle the EWN problem, we present a novel smooth kernel density estimation approach to efficiently compute the probability that a worker accepts the task invitation.

To summarize, in this paper, we have made the following contributions.

  • We propose a new FROG framework for crowdsourcing, which consists of two important task scheduler and notification modules in Section II.

  • We formalize and tackle a novel worker-and-task scheduling problem in crowdsourcing, namely FROG-TS, which assigns tasks to suitable workers, with high reliability and low latency in Section III.

  • We propose a smooth kernel density model to estimate the probabilities that workers can accept task invitations for the EWN problem in the notification module in Section IV.

  • We conduct extensive experiments to verify the effectiveness and efficiency of our proposed FROG framework on both real and synthetic data sets in Section V.

Section VI reviews previous works on the crowdsourcing. Section VII concludes this paper.

Ii Problem Definition

Ii-a The FROG Framework

Figure 1 illustrates our fast and reliable crowdsourcing (FROG) framework, which consists of worker profile manager, public worker pool, notification module, task scheduler module, and quality controller.

Specifically, in our FROG framework, the worker profile manager keeps track of statistics for each worker in the system, including the response time (or in other words, the latency) and the accuracy of doing tasks in each category. These statistics are dynamically maintained, and can be used to guide the task scheduling process (in the task scheduler module).

Moreover, the public worker pool contains the information of online workers who are currently available for doing tasks. Different from the existing work [20] with exclusive retainer pool, we use a shared public retainer pool, which shares workers for different tasks. It can improve the global efficiency of the platform, and benefit workers with more rewards by assigning with multiple tasks (rather than one exclusive task for the exclusive pool).

When the number of online workers in the public worker pool is small, the notification module will send messages to offline workers (e.g., via mobile devices), and invite them to join the platform. Since offline workers do not want to receive too many (spam) messages, in this paper, we will propose a novel smooth kernel density model to estimate the probabilities that offline workers will accept the invitations, especially when the number of historical samples is small. This way, we will only send task invitations to those offline workers with high probabilities of accepting the tasks.

Most importantly, in FROG framework, the task scheduler module assigns tasks to suitable workers with the goals of reducing the latency and enhancing the accuracy for tasks. In this module, we formalize a novel FROG task scheduling (FROG-TS) problem, which finds good worker-and-task assignments to minimize the maximal task latencies and to maximize the accuracies (quality) of task results. Due to the NP-hardness of this FROG-TS problem (proved in Section III-C), we design two approximate approaches, request-based and batch-based scheduling approaches.

Finally, the quality controller is in charge of the quality management during the entire process of the FROG framework. In particular, before workers are assigned to do tasks, we require that each worker need to register/subscribe one’s expertise categories of tasks. To verify the correctness of subscriptions, the quality controller provides workers with some qualification tests, which include several sample tasks (with ground truth already known). Then, the system will later assign them with tasks in their qualified categories. Furthermore, after workers submit their answers of tasks to the system, the quality controller will check whether each task has received enough answers. If the answer is yes, it will aggregate answers of each task (e.g., via voting methods), and then return final results.

Fig. 1: An Illustration of the FROG Framework.

In this paper, we will focus on general functions of two important modules, task scheduler and notification modules, in the FROG framework (as depicted in Figure 1), which will be formally defined in the next two subsections. In addition, we implement our framework and use WeChat [5] as its client to send tasks to workers and receive answers from workers. Note that, other message Apps such as Whatsapp [6] and Skype[4] can also be used as clients.

Ii-B The Task Scheduler Module

The task scheduler module focuses on finding a good worker-and-task assignment strategy with low latency (i.e., minimizing the maximum latency of tasks) and high reliability (i.e., satisfying the required quality levels of tasks).

Tasks and Workers. We first give the definitions for tasks and workers in the FROG framework. Specifically, since our framework is designed for general crowdsourcing platforms, we predefine a set, , of categories for tasks, that is, , where each task belongs to one category (). Here, each category can be the subject of tasks, such as cars, food, aerospace, or politics.

Definition 1.

Tasks Let be a set of tasks in the crowdsourcing platform, where each task () belongs to a task category, denoted by , and arrives at the system at the starting time . Moreover, each task is associated with a user-specified quality threshold , which is the expected probability that the final result for task is correct.

Assume that task is accomplished at the completion time . Then, the latency, , of task can be given by: , where is the starting time (defined in Definition 1). Intuitively, the smaller the latency is, the better the performance of the crowdsourcing platform is.

Definition 2.

Workers Let be a set of workers. For tasks in category , each worker () is associated with an accuracy, , that do tasks in category , and a response time, .

As given in Definition 2, the category accuracy is the probability that worker can correctly accomplish tasks in category . Here, the response time measures the period length from the timestamp that worker receives a task (in category ) to the time point that he/she submits ’s answer to the server.

In order to tackle the intrinsic error rate (unreliability) of workers, many crowdsourcing systems use various voting methods (e.g., majority voting [12]) to report high-accuracy results. For the ease of presentation, in this paper, we assume that each task is a binary task with YES/NO choices. The case of tasks with more than two choices can be easily transformed to a sequence of binary tasks. Thus, our proposed techniques can be extended to tasks with more than two choices, and we would like to leave it as our future work.

In the literature, there are some existing voting methods for result aggregations in the crowdsourcing system, such as the majority voting [12], weighted majority voting [16], half voting[30], and Bayesian voting [28]. In this paper, we use the majority voting for the result aggregation, which has been well accepted in many crowdsourcing works [12]. Assuming that the count of answering task is odd, if the majority workers (not less than workers) vote for a same answer (e.g., Yes), we take this answer as the final result of task .

Denote as the set of workers that do task , and as the category that task belongs to. Then, we have the expected accuracy of task as follows:

(1)

where is a subset of with elements.

Specifically, the expected task accuracy, , calculated with Eq. (1) is the probability that more than half of the workers in can answer correctly. In the case of voting with multiple choices (other than 2 choices, like YES/NO), please refer to Appendix A for the equation of the expected accuracy of task (with majority voting or other voting methods).

Symbol       Description
a set of task categories
a set of tasks
a set of workers
the category of task
a specific quality value of task
the start time of task
the finish time of task
the category accuracy of worker on tasks in category
the response time of worker on tasks in category
TABLE II: Symbols and descriptions.

Table II summarizes the commonly used symbols.

The FROG Task Scheduling Problem. In the task scheduler module, one important problem is on how to route tasks to workers in the retainer pool with the guaranteed low latency and high accuracy. Next, we will formally define the problem of FROG Task Scheduling (FROG-TS) below.

Definition 3.

FROG Task Scheduling Problem Given a set of crowdsourcing tasks, and workers in , the problem of FROG task scheduling (FROG-TS) is to assign workers to tasks , such that:

  1. the accuracy (given in Eq. (1)) of task is not lower than the required accuracy threshold ,

  2. the maximum latency of tasks in is minimized,

where is the latency of task , that is, the duration from the time task is posted in the system to the time, , task is completed.

We will later prove that the FROG-TS problem is NP-hard (in Section III-C), and propose two effective approaches, request-based and batch-based scheduling, to solve this problem in Section III-D.

Ii-C The Notification Module

The notification module is in charge of sending notifications to those offline workers with high probabilities of being available and accepting the invitations (when the retainer pool needs more workers). In general, some workers may join the retainer pool autonomously, but it cannot guarantee that the retainer pool will be fulfilled quickly. Thus, the notification module will invite more offline workers to improve the fulfilling speed of the retainer pool.

Specifically, in our FROG framework, the server side maintains a public worker pool to support all tasks from the requesters. When autonomous workers join the system with a low rate, the system needs to invite more workers to fulfill the worker pool, and guarantees high task processing speed of the platform. One straightforward method is to notify all the offline workers. However, this broadcast method may disturb workers, when they are busy with other jobs (i.e., the probabilities that they accept invitations may be low). For example, assume that the system has 10,000 registered workers, and only 100 workers may potentially accept the invitations. With the broadcast method, all 10,000 workers will receive the notification message, which is inefficient and may potentially damage the user experience. A better strategy is to send notifications only to those workers who are very likely to join the worker pool. Moreover, we want to invite workers with high-accuracy and low-latency. Therefore, we formalize this problem as the efficient worker notifying (EWN) problem.

Definition 4.

Efficient Worker Notifying Problem Given a timestamp , a set of offline workers, the historical online records of each worker , and the number, , of workers that we need to recruit for the public worker pool, the problem of efficient worker notifying (EWN) is to select a subset of workers in with high accuracies and low latencies to send invitation messages, such that:

  1. the expected number, , of workers who accept the invitations is greater than , and

  2. the number of workers in , to whom we send notifications, is minimized,

where is the probability of workers to accept invitations and join the worker pool at timestamp .

In Definition 4, it is not trivial to estimate the probability, , that a worker prefers to join the platform at a given timestamp, especially when we are lacking of his/her historical records. However, if we notify too many workers, it will disturb them, and in the worst case drive them away from our platform forever. To solve the EWN problem, we propose an effective model to efficiently do the estimation in Section IV, with which we select workers with high acceptance probabilities, , to send invitation messages such that the worker pool can be fulfilled quickly. Moreover, since we want to invite workers with high acceptance probabilities, low response times, and high accuracies, we define the worker dominance below to select good worker candidates.

Definition 5.

(Worker Dominance) Given two worker candidates and , we say worker dominates worker , if it holds that: (1) , (2) , and (3) , where is the probability that worker is available, and and are the average accuracy and response time of worker on his/her subscribed categories, respectively.

Then, our notification module will invite those offline workers, , with high ranking scores (i.e., defined as the number of workers dominated by worker [41]). We will discuss the details of the ranking later in Section 4.

Iii The Task Scheduler Module

The task scheduler module actively routes tasks to workers, such that tasks can be completed with small latency and the quality requirement of each task is satisfied. In order to improve the throughput of the FROG platform, in this section, we will estimate the difficulties of tasks and response times (and accuracies as well) of workers, based on records of recent answering. In particular, we will first present effective approaches to estimate worker and task profiles, and then tackle the fast and reliable crowdsourcing task scheduling (FROG-TS) problem, by designing two efficient heuristic-based approaches (due to its NP-hardness).

Iii-a Worker Profile Estimation

In this subsection, we first present the methods to estimate the category accuracy and the response time of a worker, which can be used for finding good worker-and-task assignments in the FROG-TS problem.

The Estimation of the Category Accuracy. In the FROG framework, before each worker joins the system, he/she needs to subscribe some task categories, , he/she would like to contribute to. Then, worker will complete a set of qualification testing tasks , by returning his/her answers, , respectively. Here, the system has the ground truth of the testing tasks in , denoted as .

Note that, at the beginning, we do not know the difficulties of the qualification testing tasks. Therefore, we initially treat all testing tasks with equal difficulty (i.e., 1). Next, we estimate the category accuracy, , of worker on category as follows:

(2)

where is an indicator function (i.e., if , we have ; otherwise, ), and is the number of qualification testing tasks.

Intuitively, Eq. (2) calculates the percentage of the correctly answered tasks (i.e., ) by worker (among all testing tasks).

In practice, the difficulties of testing tasks can be different. Intuitively, if more workers provide wrong answers for a task, then this task is more difficult; similarly, if a high-accuracy worker fails to answer a task, then this task is more likely to be difficult.

Based on the intuitions above, we can estimate the difficulty of a testing task as follows. Assume that we have a set, , of workers (with the current category accuracies ) who have passed the qualification test. Then, we give the definition of the difficulty of a testing task below:

(3)

where is an indicator function, and is the number of workers who passed the qualification test.

In Eq. (3), the numerator (i.e., ) computes the weighted count of wrong answers by workers in for a testing task . The denominator (i.e., ) is used to normalize the weighted count, such that the difficulty of task is within the interval . The higher is, the more difficult is.

In turn, we can treat the difficulty of task as a weight factor, and rewrite the category accuracy, , of worker on category in Eq. (2) as:

(4)

The Update of the Category Accuracy. The category accuracy of a worker may vary over time. For example, on one hand, the worker may achieve more and more accurate results, as he/she is more experienced in doing specific tasks. On the other hand, the worker may become less accurate, since he/she is tired after a long working day.

After worker passes the qualification test of task category , he/she will be assigned with tasks in that category, which would generate historical data on one’s performance of conducting a set, , of real tasks in category . We can update the category accuracy of a worker based on such historical data.

Assume that the ground truth for real tasks in is , and answers provided by are , respectively. Then, we can update the category accuracy of worker on category as follows:

(5)

where is a balance parameter to combine the performance of each worker in testing tasks and real tasks.

The parameter is set by the system. Larger is, more accurate the initial accuracy estimation is. However, larger may cost more time and budget for a new worker to start working on real tasks. In practice, it is acceptable to set to 10.

The Estimation of the Category Response Time. In reality, since different workers may have different abilities, skills, and speeds, their response times could be different, where the response time is defined as the length of the period from the timestamp that the task is posted to the time point that the worker submits the answer of the task to the server.

Furthermore, the response time of each worker may change temporally (i.e., with temporal correlations). To estimate the response time, we utilize the latest response records of worker for answering tasks in category , and apply the least-squares method [26] to predict the response time, , of worker in a future timestamp. The least-squares method can minimize the summation of the squared residuals, where the residuals are the differences between the recent historical values and the fitted values provided by the model. We use the fitted line to estimate the category response time in a future timestamp.

The value of may affect the sensitiveness and stability of the estimation of the category response time. Small may lead the estimation sensitive about the response times of workers, however, the estimated value may vary a lot. Large causes the estimation stable, but insensitive. In practice, we can set as 3 to 5.

Iii-B Task Profile Estimation

In this subsection, we discuss the task difficulty, which may affect the latency of accomplishing tasks.

The Task Difficulty. Some tasks in the crowdsourcing system are in fact more difficult than others. In AMT [1], autonomous workers pick tasks by themselves. As a consequence, difficult tasks will be left without workers to conduct. In contrast, in our FROG platform, the server can designedly assign/push difficult tasks to reliable and low-latency workers to achieve the task quality and reduce the time delays.

For a given task in category with possible answer choices (, in the case of YES/NO tasks), assume that workers accepted this task. Since some workers may skip the task (without completing the task), we denote as the number of workers who skipped task , and as the set of received answers, where . Then, we can estimate the difficulty of task as follows:

(6)

Here, in Eq. (6), we have:

(7)
(8)

where is the set of workers who select the -th possible choice of task and is the number of received answers.

Discussions on the Task Difficulty. The task difficulty in Eq. (6) is estimated based on workers’ performance on doing tasks. Those workers who skipped the task treat tasks as being the most difficult (i.e., with difficulty equal to 1), whereas for those who did the task, we use the normalized entropy (or the diversity) of their answers to measure the task difficulty.

Specifically, the first term (i.e., ) in Eq. (6) indicates the percentage of workers who skipped task , or in other words considered the task difficulty is 1. Intuitively, when a task is skipped by more percentage of workers, it is more difficult.

The second term in Eq. (6) is to measure the task difficulty based on answers from those percent of workers (who did the task). Our observation is as follows. When the answers of workers are spread more evenly (i.e., more diversely), it indicates that it is harder to obtain a final convincing answer of the task with high confidence. In this paper, to measure the diversity of answers from workers, we use the entropy [35], (as given in Eq. (7)), of answers, with respect to the accuracies of workers. Intuitively, when a task is difficult to complete, workers will get confused, and eventually select diverse answers, which leads to high entropy value. Therefore, larger entropy implies higher task difficulty. To avoid the divide-by-zero error, when no answers are received, we add a small value (e.g., ) to each possible choice of task in Eq. (7). Moreover, we also normalize this entropy in Eq. (6), that is, dividing it by the maximum possible entropy value, (as given in Eq. (8)).

Iii-C Hardness of the FROG-TS Problem

In our FROG-TS problem (as given in Definition 3), since we need to guarantee that the expected accuracy of each task is not lower than its specified quality threshold , we assume that each task needs to be answered by different workers, such that the final result (via majority voting) is accurate enough. With tasks and workers, in the worst case, there are an exponential number of possible task-and-worker assignment strategies, which incurs high time complexity (i.e., ). Below, we prove that the FROG-TS problem is NP-hard, by reducing it from the multiprocessor scheduling problem (MSP) [19].

Theorem III.1.

(Hardness of the FROG-TS Problem) The problem of FROG Task Scheduling (FROG-TS) is NP-hard.

Proof.

Please refer to Appendix B. ∎

The FROG-TS problem focuses on completing multiple tasks that satisfy the required quality thresholds, which requires that each task is answered by multiple workers. Thus, we cannot directly use the existing approximation algorithms for the MSP problem (or its variants) to solve the FROG-TS problem. Due to the NP-hardness of our FROG-TS problem, in the next subsection, we will introduce an adaptive task routing approach with two worker-and-task scheduling algorithms, request-based and batch-based scheduling approaches to efficiently retrieve the FROG-TS answers.

Iii-D Adaptive Scheduling Approaches

In this subsection, we first estimate the delay probability of each task. The higher the delay probability is, the more likely the task will be delayed. Then we propose two adaptive scheduling strategies, request-based scheduling and batch-based scheduling, to iteratively assign workers to the task with the highest delay probability such that the maximum processing time of tasks is minimized.

Iii-D1 The Delay Probability

As mentioned in the second criterion of the FROG-TS problem (i.e., in Definition 3, we want to minimize the maximum latency of tasks in . In order to achieve this goal, we will first calculate the delay probability, , of task in , and then assign workers to those tasks with high delay probabilities first, such that the maximum latency of tasks can be greedily minimized.

We define the delay probability of task based on the logistic function [7] as follows:

(9)

where is the time lapse of task , is the difficulty of task given by Eq. (6), and is the natural logarithm base (i.e., ). The logistic function is widely used in a range of fields, including artificial neural networks, linguistics, and statistics. Here, we use logistic function to capture that the increase of delay probability is exponential when the difficulty of task increases a little and then slow down. Note that, other methods can also be used to estimate the delay probability of tasks.

Since we use the entropy-based equation to measure the difficulties of tasks, the difficulty of task will be in a range of . Further, indicates the steepness of the logistic function. Then, the delay probability will be in the range of . For example, when the task difficulty is 0, the delay probability is also 0, which captures the real situation.

Iii-D2 Request-based Scheduling (RBS) Approach

With the estimation of the delay probabilities of tasks, we propose a request-based scheduling (RBS) approach. In this approach, when a worker becomes available, he/she will send a request for the next task to the server. Then, the server calculates the delay probabilities of the on-going tasks on the platform, and greedily return the task with the highest delay probability to the worker.

Input: A worker requesting for his/her next task and a set of uncompleted tasks
Output: Returned task
1 foreach task in  do
2       calculate the delay possibility value of with Eq. (9);
3      
4select one task with the highest delay probability;
5 if the expected accuracy of is higher than  then
6       Remove from ;
7      
8 return ;
Algorithm 1 GreedyRequest(, )

The pseudo code of our request-based scheduling approach, namely GreedyRequest, is shown in Algorithm 1. It first calculates the delay probability of each uncompleted task in (lines 1-2). Then, it selects a suitable task with the highest delay probability (line 3). If we find the expected accuracy of task (given in Eq. (1)) is higher than the quality threshold , then we will remove task from . Finally, we return/assign task to worker .

The Time Complexity of RBS. We next analyze the time complexity of the request-based scheduling approach, GreedyRequest, in Algorithm 1. For each task , to compute its delay probability, the time complexity is . Thus, the time complexity of computing delay probabilities for all uncompleted tasks is given by (lines 1-2). Next, the cost of selecting the task with the highest delay probability is also (line 3). The cost of checking the completeness for task and removing it from is given by . As a result, the time complexity of our request-based scheduling approach is given by .

Iii-D3 Batch-based Scheduling (BBS) Approach

Although the RBS approach can easily and quickly respond to each worker’s request, it in fact does not have the control on workers in this request-and-answer style. Next, we will propose an orthogonal batch-based scheduling (BBS) approach, which assigns each worker with a list of suitable tasks in a batch, where the length of the list is determined by his/her response speed.

The intuition of our BBS approach is as follows. If we can assign high-accuracy workers to difficult and urgent tasks and low-accuracy workers with easy and not that urgent tasks, then the worker labor will be more efficient and the throughput of the platform will increase.

Specifically, in each round, our BBS approach iteratively picks a task with the highest delay probability (among all the remaining tasks in the system), and then greedily selects a minimum set of workers to complete this task. Algorithm 2 shows the pseudo code of the BBS algorithm, namely GreedyBatch. In particular, since no worker-and-task pair is assigned at the beginning, we initialize the assignment set as an empty set (line 1). Then, we calculate the delay probability of each unfinished task (given in Eq. (9)) (lines 2-3). Thereafter, we iteratively assign workers for the next task with the highest delay probability (lines 4-6). Next, we invoke Algorithm MinWorkerSetSelection, which selects a minimum set, , of workers who satisfy the required accuracy threshold of task (line 7). If is not empty, then we insert task-and-worker pairs, , into set (lines 8-10). If each worker cannot be assigned with more tasks, then we remove him/her from (lines 11-12). Here, we decide whether a worker can be assigned with more tasks, according to his/her response times on categories, his/her assigned tasks, and the round interval of the BBS approach. That is, if the summation of response times of the assigned tasks is larger than the round interval, then the worker cannot be assigned with more tasks; otherwise, we can still assign more tasks to him/her.

Input: A set, , of unfinished tasks and a set, , of workers
Output: Assignment
1 ;
2 foreach task in  do
3       calculate the delay possibility value of with Eq. (9);
4      
5while  and  do
6       select task with the highest delay possibility value;
7       remove from ;
8       MinWorkerSetSelection(, , );
9       if  then
10             foreach  do
11                   Insert into ;
12                   if  cannot be assigned with more tasks then
13                         Remove from ;
14                        
15                  
16            
17      
18return ;
Algorithm 2 GreedyBatch(, )

Minimum Worker Set Selection. In line 7 of Algorithm 2 above, we mentioned a MinWorkerSetSelection algorithm, which selects a minimum set of workers satisfying the constraint of the quality threshold for task . We will discuss the algorithm in detail, and prove its correctness below.

Before we provide the algorithm, we first present one property of the expected accuracy of a task.

Lemma III.1.

Given a set of workers, , assigned to task in category , the expected accuracy of task can be calculated as follows:

where and is defined in Eq. (1).

Proof.

Please refer to Appendix C. ∎

We can derive two corollaries below.

Corollary III.1.

For a task in category with a set of assigned workers , if the category accuracy of any worker increases, the expected accuracy of task will increase (until reaching 1).

Proof.

Please refer to Appendix D. ∎

Corollary III.2.

For a task in category with a set of assigned workers , if we assign a new worker to task , the expected accuracy of task will increase.

Proof.

Please refer to Appendix E. ∎

With Corollaries III.1 and III.2, to increase the expected accuracy of a task , we can use workers with higher category accuracies or assign more workers to task . When the required expected accuracy of a task is given, we can finish task with a smaller number of high-accuracy workers. To accomplish as many tasks as possible, we aim to greedily pick the least number of workers to finish each task iteratively.

Input: A set of available workers, a task in category with a set of already assigned workers
Output: A minimum set of workers assigned to task
1 ;
2 while  and  do
3       choose a new worker with the highest accuracy ; .remove();
4       .add();
5      
6if  then
7       return ;
8      
9 else
10       return ;
11      
Algorithm 3 MinWorkerSetSelection(, , )

Algorithm 3 exactly shows the procedure of MinWorkerSetSelection, which selects a minimum set, , of workers to conduct task . In each iteration, we greedily select a worker (who has not been assigned to task ) with the highest accuracy in the category of task , and assign workers to task (lines 2-4). If such a minimum worker set exists, we return the newly assigned worker set; otherwise, we return an empty set (lines 5-8). The correctness of Algorithm 3 is shown below.

Lemma III.2.

The number of workers in the set returned by Algorithm 3 is minimum, if exists.

Proof.

Please refer to Appendix F. ∎

The Time Complexity of BBS. To analyze the time complexity of the batch-based scheduling (BBS) approach, called GreedyBatch, as shown in Algorithm 2, we assume that each task needs to be answered by workers. The time complexity of calculating the delay probability of a task is given by (lines 2-3). Since each iteration solves one task, there are at most iterations (lines 4-13). In each iteration, selecting one task with the highest delay probability requires cost (line 5). The time complexity of the MinWorkerSetSelection procedure is given by (line 7). The time complexity of assigning workers to the selected task is (lines 8-12). Thus, the overall time complexity of the BBS approach is given by .

Iv The Notification Module

In this section, we introduce the detailed model of the notification module in our PROG framework (as mentioned in Section II), which is in charge of sending invitation notifications to offline workers in order to maintain enough online workers doing tasks. Since it is not a good idea to broadcast to all offline workers, our notification module only sends notifications to those workers with high probabilities of accepting invitations.

Iv-a Kernel Density Estimation for the Workers’ Availability

In this subsection, we will model the availability of those (offline) workers from historical records. The intuition is that, for each worker, the patten of availability on each day is relatively similar. For example, a worker may have the spare time to do tasks, when he/she is on the bus to the school (or company) at about 7 am every morning. Thus, we may obtain their historical data about the timestamps they conducted tasks.

However, the number of historical records (i.e., sample size) for each worker might be small. In order to accurately estimate the probability of any timestamp that a worker is available, we use a non-parametric approach, called kernel density estimation (KDE) [33], based on random samples (i.e., historical timestamps that the worker is available).

Specifically, for a worker , let be a set of active records that worker did some tasks, where event () occurs at timestamp . Then, we can use the following KDE estimator to compute the probability that worker is available at timestamp :

where is the event that worker is available and will accept the invitation at a given timestamp , is a kernel function (here, we use Gaussian kernel function ), and is a scalar bandwidth parameter for all events in . The bandwidth of the kernel is a free parameter and exhibits a strong influence on the estimation. For simplicity, we set the bandwidth following a rule-of-thumb [36] as follows:

(11)

where is the standard deviation of the samples. The rule works well when density is close to being normal, which is however not true for estimating the probability of workers at a given timestamp . However, adapting the kernel bandwidth to each data sample may overcome this issue [11].

Inspired by this idea, we select nearest neighbors of event (here, we consider neighbors by using time as measure, instead of distance), and calculate the adaptive bandwidth of event with samples using Eq. (11), where is set to ( is a ratio parameter). Afterwards, we can define the adaptive bandwidth KDE as follows:

(12)

Iv-B Smooth Estimator

Up to now, we have discussed the adaptive kernel density approach to estimate the probability that a worker is available, based on one’s historical records (samples). However, some workers may just register or rarely accomplish tasks, such that his/her historical events are not available or enough to make accurate estimations, which is the “cold-start” problem that often happens in the recommendation system [34].

Inspired by techniques [34] used to solve such a cold-start problem in recommendation systems and the influence among friends [14] (i.e., friends tend to have similar behavior patterns, such as the online time periods), we propose a smooth KDE model (SKDE), which combines the individual’s kernel density estimator with related scale models. That is, for each worker, we can use historical data of his/her friends to supplement/predict his/her behaviors.

Here, our FROG platform is assumed to have the access to the friendship network of each worker, according to his/her social networks (such as Facebook, Twitter, and WeChat). In our experiments of this paper, our FROG platform used data from the WeChat network.

Specifically, we define a smooth kernel density estimation model as follows:

(13)

where are non-negative smoothing factors with the property of , is the entire historical events of all the workers, and is the -th scaling density estimator calculated on the subset events .

For a smooth KDE model with () scaling density estimators, the first scaling density estimator can be the basic individual kernel density estimator with and the -th scaling density estimator can be the entire population density estimator with . Moreover, since our FROG platform can obtain the friendship network of each worker (e.g., Facebook, Twitter, and WeChat), after one registers with social media accounts, we can find each worker’s -step friends. This way, for the intermediate scaling density estimators , we can use different friendship scales, such as the records of the 1-step friends, 2-step friends, …, -step friends of worker . According to the famous Six degrees of separation theory [8], is not larger than 6. However, in practice, we in fact can only use 1-step or 2-step friends, as the intermediate scaling density estimators may involve too many workers of when is too large. Alternatively, other relationship can also be used to smooth the KDE model, such as the location information of workers. One possible variant is to classify the workers based on their locations, as workers in close locations may work or study together such that their time schedules may be similar with each other.

To train the SKDE model, we need to set proper values for smoothing factors . We use the latest event records as validation data (here ), and other history records as the training data . Specifically, for each event in , we have the estimated probability as follows:

where is the number of scaling density estimators. Then, to tune the smoothing factors, we use the Maximum Likelihood Estimation (MLE) with log-likelihood as follows:

(14)

However, Eq. (14) is not trivial to solve, thus, we use EM algorithm to calculate its approximate result.

We initialize the smoothing factors as for . Next, we repeat Expectation-step and Maximization-step, until the smoothing factors converge.

Expectation Step. We add a latent parameter , and its distribution on is , then we can estimate as follows:

where is calculated with Eq. (12).

Maximization Step. Based on the expectation result of the latent parameter , we can calculate the next smoothing factor values with MLE as follows:

where is calculated with Eq. (12).

Iv-C Processing of the Efficient Worker Notifying Problem

As given in Definition 4, our EWN problem is to select a minimum set of workers with high probabilities to accept invitations, to whom we will send notifications.

Formally, given a trained smooth KDE model and a timestamp , assume that we want to recruit more workers for the FROG platform. In the EWN problem (in Definition 4), the acceptance probability of worker can be estimated by Eq. (13).

Next, with Definition 5, we can sort workers, , based on their ranking scores (e.g., the number of workers dominated by each worker) [41]. Thereafter, we will notify top- workers with the highest ranking scores.

The pseudo code of selecting worker candidates is shown in Algorithm 4. We first initialize the selected worker set, , with an empty set (line 1). Next, we calculate the ranking scores of each worker (e.g., the number of other workers can be dominated with the Definition 5) (lines 2-3). Then, we iteratively pick workers with the highest ranking scores until the selected workers are enough or all workers have been selected (lines 4-8). Finally, we return the selected worker candidates to send invitation notifications (line 9).

Input: A set, , of offline workers, the expected number, , of acceptance workers, and the current timestamp
Output: A set, , of workers to be invited
1 ;
2 foreach worker in  do
3       calculate the ranking score of ;
4      
5while  and  do
6       select one worker with the highest ranking score in ;
7       ;
8       .add();
9      
10return ;
Algorithm 4 WorkerNotify(, )

The Time Complexity. To compute the ranking scores, we need to compare every two workers, whose time complexity is . In each iteration, we select one candidate, and there are at most iterations. Assuming that workers are sorted by their ranking scores, lines 4-8 have the time complexity . Thus, the time complexity of Algorithm 4 is given by .

Discussions on Improving the EWN Efficiency. To improve the efficiency of calculating the ranking scores of workers, we may utilize a 3D grid index to accelerate the computation, where 3D includes the acceptance probability, response time, and accuracy. Each worker is in fact a point in a 3D space w.r.t. these 3 dimensions. If a worker dominates a grid cell , then all workers in cell are dominated by . Similarly, if worker is dominated by the cell , then all the workers in cannot be dominated by . Then, we can compute the lower/upper bounds of the ranking score for each worker, and utilize them to enable fast pruning [41].

V Experimental Study

V-a Experimental Methodology

Data Sets for Experiments on Task Scheduler Module. We use both real and synthetic data to test our task scheduler module. We first conduct a set of comparison experiments on the real-world crowdsourcing platform, gMission [13], where workers do tasks and are notified via WeChat [5], and evaluate our task scheduler module on 5 data sets [2]. Tasks in each data set belong to the same category. For each experiment, we use 16 tasks for each data set (category). We manually label the ground truth of tasks. To subscribe one category, each worker is required to take a qualification test consisting of 5 testing questions. We uniformly generate quality threshold for each task, within the range . Below, we give brief descriptions of the 5 real data sets.

1) Disaster Events Detection (DED): DED contains a set of tasks, which ask workers to determine whether a tweet describes a disaster event. For example, a task can be “Just happened a terrible car crash” and workers are required to select “Disaster Event” or “Not Disaster Event”.

2) Climate Warming Detection (CWD): CWD is to determine whether a tweet considers the existence of global warming or climate change or not. The possible answers are “Yes”, if the tweet suggests global warming is occurring, and “No” if the tweet suggests global warming is not occurring. One tweet example is “Global warming. Clearly.”, and workers are expected to answer “Yes”.

3) Body Parts Relationship Verification (BPRV): In BPRV, workers should point out if certain body parts are part of other parts. Questions were phrased like: “[Part 1] is a part of [part 2]”. For example, “Nose is a part of spine” or “Ear is a part of head.” Workers should say “Yes” or “No” for this statement.

4) Sentiment Analysis on Apple Incorporation (SAA): Workers are required to analyze the sentiment about Apple, based on tweets containing “#AAPL, @apple, etc”. In each task, workers are given a tweet about Apple, and asked whether the user is positive, negative, or neutral about Apple. We used records with positive or negative attitude about Apple, and asked workers to select “positive” or “negative” for each tweet.

For synthetic data, we simulate crowd workers based on the observations from real platform experiments. Specifically, in experiments on the real platform, we measure the average response time, , of worker on category , the variance of the response time , and the category accuracy . Then, to generate a worker in the synthetic data set, we first randomly select one worker from the workers in the real platform experiments, and produce his/her response speed on category following a Gaussian distribution , where and are the average and variance of the response time of worker . In addition, we initial the category accuracy of worker as that of the worker .

TABLE III: Experimental Settings.
     Parameters       Values
the number of categories 5, 10, 20, 30, 40
the number of tasks 1000, 2000, 3000, 4000, 5000
the number of workers 100, 200, 300, 400, 500
the range of quality threshold [0.8, 0.85], [0.85, 0.9], [0.9, 0.95], [0.95, 0.97]

Table III depicts the parameter settings in our experiments, where default values of parameters are in bold font. In each set of experiments, we vary one parameter, while setting other parameters to their default values. For each experiment, we report the maximum latency and the average accuracy of tasks.

Data Sets for Experiments on Notification Module. To test our notification module in the FROG framework, we utilize Higgs Twitter Dataset [15]. The Higgs Twitter Dataset is collected for monitoring the spreading process on the Twitter, before, during, and after the announcement of the discovery of a new particle with features of the elusive Higgs boson on July 4th, 2012. The messages posted on the Twitter about this discovery between July 1st and 7th, 2012 are recorded. There are 456,626 user nodes and 14,855,842 edges (friendship connections) between them. In addition, the data set contains 563,069 activities. Each activity happens between two users and can be retweet, mention, or reply. We initialize the registered workers on our platform with users in the Higgs Twitter Dataset (and their relationship on the Twitter). What is more, the activities in the data set is treated as online records of workers on the platform. The reason is that only when a user is free, he/she can make activities on Twitter.

Competitors and Measures. For the task scheduler module, we conduct experiments to test our two adaptive scheduling approaches, request-based (RBS) and batch-based scheduling (BBS) approaches. We compare them with a random method, namely RANDOM, which randomly routes tasks to workers, and a fast-worker greedy method, namely FGreedy, which greedily pick fastest workers to finish the task with the highest delay possibility value. We hire 15 workers from the WeChat platform to conduct this experiment. Table IV shows the statistics of category accuracies and category response times of top 5 workers, who conducted the most tasks. The accuracies and response times are different for different workers and in different categories.

For the notification module, we conduct experiments to compare our smooth KDE model with a random method, namely RandNotify, which randomly selects the same number of tasks as the smooth KDE approach. Moreover, we also compare our approach with a simple method, namely Nearest Worker Priority (NWP), which selects workers with the most number of historical records within the -minute period before or after the given timestamp in previous dates. Here, we use , as it is sufficient for a worker to response the invitation. For each predicted worker, if he/she has activities within the time period from the the target timestamp to 15 minutes later, we treat that it is a correct prediction. At timestamp , we denote as the number of correct predictions, as the number of total predictions and as the number of activities that really happened.

For experiments on the task scheduler module, we report maximum latencies of tasks and average task accuracies, for both our approaches and the competitor method. For experiments on the notification module, we present the precision () and recall () of all tested methods. Our experiments were run on an Intel Xeon X5675 CPU @3.07 GHZ with 32 GB RAM in Java.

WorkerID Category Accuracy Category Response Time
DED CWD BPRV SAA DED CWD BPRV SAA
42 0.901 0.913 0.956 0.961 17.779 13.119 4.558 11.453
57 0.942 0.935 0.985 0.971 21.250 14.524 4.408 13.818
134 0.774 0.825 0.937 0.874 15.792 10.510 5.149 11.969
153 0.653 0.743 0.625 0.906 24.056 12.080 8.532 16.747
155 0.832 0.952 0.917 0.883 19.974 13.044 5.031 7.366
TABLE IV: Statistics of Workrs.
(a) Maximum Latency
(b) Average Accuracy
Fig. 2: The Performance of Task Scheduler Module on Real Data.

V-B Experiments on Real Data

The Performance of the Task Scheduler Module on Real Data. Figure 2 shows the results of experiments on our real platform about the task scheduler module of our framework. For the maximum latencies shown in Figure 2(a), our two approaches can maintain lower latencies than the baseline approach, RANDOM. Specifically, BBS can achieve a much lower latency, which is just half of that of RANDOM. fGreedy is better than RANDOM, however, still needs more time to finish tasks than our approaches. For the accuracies shown in Figure 2(b), our two approaches achieve higher accuracies than RANDOM. Moreover, the accuracy of BBS is higher than that of RBS. The reason is that, BBS can complete the most urgent tasks with minimum sets of workers, achieving the highest category accuracies. In contrast, RBS is not concerned with the accuracy, and just routes available workers to tasks with the highest delay probabilities. Thus, RBS is not that effective, compared with BBS, to maintain a low latency. As the required accuracies are satisfied when assigning tasks to workers, four algorithms achieve close accuracies to each other.

The Performance of Notification Module on Real Data. To show the effectiveness of our smooth KDE model, we present the recall and precision of our model compared with NWP and RandomNotify, by varying the number of prediction samples from 5% to 10% of the entire population. As shown in Figure 3(a), our smooth KDE model can achieve higher recall scores than the other two baseline methods. In addition, when we predict with more samples, the advantage of our smooth KDE model is more obvious w.r.t. the recall scores. The reason is that our smooth KDE model can utilize the influence of the friends, which is more effective when we predict with more samples. Similarly, in Figure 3(b), smooth KDE model can obtain the highest precision scores among all tested methods.

(a) Recall
(b) Precision
Fig. 3: The Performance of the Notification Module on Real Data.

V-C Experiments on Synthetic Data

Effect of the Number, , of Tasks. Figure 4 shows the maximum latency and average accuracy of four approaches, RBS, BBS, RANDOM and fGreedy, by varying the number, , of tasks from to , where other parameters are set to their default values. As shown in Figure 4(a), with more tasks (i.e., larger values), all the four approaches achieve higher maximum task latency. This is because, if there are more tasks, each task will have relatively fewer workers to assign, which prolongs the latencies of tasks.

RANDOM always has higher latency than our RBS approach, followed by BBS. fGreedy can achieve lower latency than RBS approach, but still higher than BBS, as fGreedy is still a batch-based algorithm but greedily picking fastest workers. Here, the maximum latency of BBS remains low, and only slightly increases with more tasks. The reason has been discussed in Section V-B.

Figure 4(b) illustrates the average accuracies of four approaches, with different values. Since BBS always chooses a minimum set of workers with the highest category accuracies, in most cases, the task accuracies of BBS are higher than the other three approaches. fGreedy can achieve slightly higher accuracy than RBS, as fGreedy can select a set of workers that meets the required accuracy threshold of the task with the highest delay probability while RBS can only determine to assign the current available worker to a suitable task. Nonetheless, from the figure, RBS and BBS approaches can achieve high task accuracies (i.e., ).

Effect of the Number, , of Workers. Figure 5 shows the experimental results, where the number, , of workers changes from 100 to 500, and other parameters are set to their default values. For the maximum latencies shown in Figure 5(a), when the number, , of worker increases, the maximum latencies of four algorithms decrease. This is because, with more workers, each task can be assigned with more workers (potentially with lower latencies). Since the quality thresholds of tasks are not changing, with more available workers, the maximum latencies thus decrease. Similarly, BBS can maintain a much lower maximum latency than the other three algorithms. We note that when the number of workers are relatively small, RBS can achieve lower latencies than fGreedy. However, when the number of workers increases up to 300, fGreedy performed better than RBS. The reason is that batch-based algorithms (BBS and fGreedy) may have more workers to select for each task such that the latencies become smaller when more workers are available. For the average accuracies in Figure 5(b), our RBS and BBS algorithms can achieve high average accuracies (i.e., ). The average accuracies of BBS are slightly higher than that of RBS, and that of RANDOM is the lowest.

(a) Maximum Latency
(b) Average Accuracy
Fig. 4: Effect of the number of tasks .

Effect of the Range of the Quality Threshold . Figure 6 shows the performance of four approaches, where the range, , of quality thresholds, , increases from to , and other parameters are set to their default values. Specifically, as depicted in Figure 6(a), when the range of the quality threshold increases, the maximum latencies of the four tested algorithms also increase. The reason is that, with higher quality threshold , each task needs more workers to be satisfied (as shown by Corollary III.2). Similarly, BBS can achieve much lower maximum latencies than that of RBS, fGreedy and RANDOM. Further, RBS is better than RANDOM but worse than fGreedy, w.r.t. the maximum latency.

In Figure 6(b), when the range of increases, the average accuracies of all the four algorithms also increase. This is because, when increases, each task needs more workers to satisfy its quality threshold (as shown by Corollary III.2), which makes the average accuracies of tasks increase. Similar to previous results, our two approaches, BBS and RBS, can achieve higher average accuracies than RANDOM. fGreedy can achieve close accuracy to BBS.

Due to space limitations, please refer to the experimental results with different counts of categories in Appendix G.

In summary, our task scheduler module can achieve results with low latencies and high accuracies on both real and synthetic datasets. Especially, our BBS approach is the best one among all the tested scheduling approaches. Moreover, verified through the experiments on the tweet dataset, our smooth KDE model can accurately predict the acceptance probabilities of workers, and achieve higher precision and recall scores than two baseline methods: RandomNotify and NWP.

(a) Maximum Latency
(b) Average Accuracy
Fig. 5: Effect of the number of workers .

Vi Related Work

Crowdsourcing has been well studied by different research communities (e.g., the database community), and widely used to solve problems that are challenging for computer (algorithms), but easy for humans (e.g., sentiment analysis [31] and entity resolution [39]). In the databases area, CrowdDB [17] and Qurk [29] are designed as crowdsourcing incorporated databases; CDAS [28] and iCrowd [16] are systems proposed to achieve high quality results with crowds; gMission [13] and MediaQ [24] are general spatial crowdsourcing systems that extend crowdsourcing to the real world, where workers need to physically move to specific locations to conduct tasks. Due to intrinsic error rates of humans, crowdsourcing systems always focus on achieving high-quality results with minimum costs. To guarantee the quality of the results, each task can be answered by multiple workers, and the final result is aggregated from answers with voting [16, 12] or learning [28, 22] methods.

Due to the diversity of the workers and their autonomous participation style in existing crowdsourcing markets (e.g., Amazon Mechanical Turk (AMT) [1] and Crowdflower [3]), the quality and completion time of crowdsourcing tasks cannot always be guaranteed. For example, in AMT, the latency of finishing tasks may vary from minutes to days [17, 25]. Some difficult tasks are often ignored by workers, and left uncompleted for a long time. Recently, several works [18, 20, 32, 38] focused on reducing the completion time of tasks. In [32, 38], the authors designed algorithms to reduce the latencies of tasks for specific jobs, such as rating and filtering records, and resolve the entities with crowds. The proposed techniques for specific tasks, however, cannot be used for general crowdsourcing tasks, which is the target of our FROG framework.

Gao et al. [18] leveraged the pricing model from prior works, and developed algorithms to minimize the total elapsed time with user-specified monetary constraint or to minimize the total monetary cost with user-specified deadline constraint. They utilized the decision theory (specifically, Markov decision processes) to dynamically modify the prices of tasks. Daniel et al. [20] proposed a system, called CLAMShell, to speed up crowds in order to achieve consistently low-latency data labeling. They analyzed the sources of labeling latency. To tackle the sources of latency, they designed several techniques (such as straggler mitigation to assign the delayed tasks to multiple workers, and pool maintenance) to improve the average worker speed and reduce the worker variance of the retainer pool.

Different from the existing works [18, 32, 38, 9, 20, 10], our FROG framework adopts the server-assigned-task (SAT) mode (instead of the worker-selected-task (WST) mode in prior works) for general crowdsourcing tasks (rather than specific tasks), and focuses on both reducing the latencies of all tasks and improving the accuracy of tasks under the SAT mode (instead of either latency or accuracy under the WST mode). In our FROG framework, the task scheduler module actively assigns workers to tasks with high reliability and low latency, which takes into account response times and category accuracies of workers, as well as the difficulties of tasks (not fully considered in prior works). We also design two novel scheduling approaches, request-based and batch-based scheduling. Different from prior works [16, 20] that simply filtered out workers with low accuracies, our work utilizes all possible worker labors, by scheduling difficult/urgent tasks to high-accuracy/fast workers and routing easy and not urgent tasks to low-accuracy workers.

Moreover, Bernstein et al. [9] proposed the retainer model to hire a group of workers waiting for tasks, such that the latency of answering crowdsourcing tasks can be dramatically reduced. Bernstein et al. [10] also theoretically analyzed the optimal size of the retainer model using queueing theory for realtime crowdsourcing, where crowdsourcing tasks come individually. These models may either increase the system budget or encounter the scenario where online workers are indeed not enough for the assignment during some period. In contrast, with the help of smart devices, our FROG framework has the capability to invite offline workers to do tasks, which can enlarge the public worker pool, and enhance the throughput of the system. In particular, our notification module in FROG can contact workers who are not online via smart devices, and intelligently send invitation messages only to those available workers with high probabilities. Therefore, with the new model and different goals in our FROG framework, we cannot directly apply techniques in previous works to tackle our problems (e.g., FROG-TS and EWN).

(a) Maximum Latency
(b) Average Accuracy
Fig. 6: Effect of the specific quality value range .

Vii Conclusion

The crowdsourcing has played an important role in many real applications that require the intelligence of human workers (and cannot be accurately accomplished by computers or algorithms), which has attracted much attention from both academia and industry. In this paper, inspired by the accuracy and latency problems of existing crowdsourcing systems, we propose a novel fast and reliable crowdsourcing (FROG) framework, which actively assigns workers to tasks with the expected high accuracy and low latency (rather than waiting for autonomous unreliable and high-latency workers to select tasks). We formalize the FROG task scheduling (FROG-TS) and efficient worker notifying (EWN) problems, and proposed effective and efficient approaches (e.g., request-based, batch-based scheduling, and smooth KDE) to enable the FROG framework. Through extensive experiments, we demonstrate the effectiveness and efficiency of our proposed FROG framework on both real and synthetic data sets.

References

  • [1] Amazon mechanical turk [online]. Available: https://www.mturk.com/mturk/welcome.
  • [2] Crowdflower: Data for everyone library. Available: https://www.crowdflower.com/data-for-everyone/.
  • [3] Crowdflower [online]. Available: https://www.crowdflower.com.
  • [4] Skype [online]. Available: https://www.skype.com/en/.
  • [5] Wechat [online]. Available: http://www.wechat.com/en/.
  • [6] Whatsapp [online]. Available: https://web.whatsapp.com.
  • [7] A. Agresti and M. Kateri. Categorical data analysis. Springer, 2011.
  • [8] A.-L. Barabasi. Linked: How everything is connected to everything else and what it means. Plume Editors, 2002.
  • [9] M. S. Bernstein, J. Brandt, and R. C. Miller. Crowds in two seconds: Enabling realtime crowd-powered interfaces. In UIST, 2011.
  • [10] M. S. Bernstein, D. R. Karger, R. C. Miller, and J. Brandt. Analytic methods for optimizing realtime crowdsourcing. arXiv preprint arXiv:1204.2995, 2012.
  • [11] L. Breiman, W. Meisel, and E. Purcell. Variable kernel estimates of multivariate densities. Technometrics, 1977.
  • [12] C. C. Cao, J. She, Y. Tong, and L. Chen. Whom to ask?: jury selection for decision making tasks on micro-blog services. PVLDB, 2012.
  • [13] Z. Chen, R. Fu, Z. Zhao, Z. Liu, L. Xia, L. Chen, P. Cheng, C. C. Cao, Y. Tong, and C. J. Zhang. gmission: A general spatial crowdsourcing platform. PVLDB, 2014.
  • [14] E. Cho, S. A. Myers, and J. Leskovec. Friendship and mobility: user movement in location-based social networks. In SIGKDD, 2011.
  • [15] M. De Domenico, A. Lima, P. Mougel, and M. Musolesi. The anatomy of a scientific rumor. Scientific reports, 2013.
  • [16] J. Fan, G. Li, B. C. Ooi, K.-l. Tan, and J. Feng. icrowd: An adaptive crowdsourcing framework. In SIGMOD, 2015.
  • [17] M. J. Franklin, D. Kossmann, T. Kraska, S. Ramesh, and R. Xin. Crowddb: answering queries with crowdsourcing. In SIGMOD, 2011.
  • [18] Y. Gao and A. Parameswaran. Finish them!: Pricing algorithms for human computation. PVLDB, 2014.
  • [19] M. R. Gary and D. S. Johnson. Computers and intractability: A guide to the theory of np-completeness, 1979.
  • [20] D. Haas, J. Wang, E. Wu, and M. J. Franklin. Clamshell: speeding up crowds for low-latency data labeling. PVLDB, 2015.
  • [21] P. G. Ipeirotis. Analyzing the amazon mechanical turk marketplace. XRDS: Crossroads, 2010.
  • [22] D. R. Karger, S. Oh, and D. Shah. Iterative learning for reliable crowdsourcing systems. In Advances in neural information processing systems, 2011.
  • [23] L. Kazemi and C. Shahabi. Geocrowd: enabling query answering with spatial crowdsourcing. In SIGGIS, 2012.
  • [24] S. H. Kim, Y. Lu, G. Constantinou, C. Shahabi, G. Wang, and R. Zimmermann. Mediaq: mobile multimedia management system. In MMSys, 2014.
  • [25] A. Kittur, E. H. Chi, and B. Suh. Crowdsourcing user studies with mechanical turk. In SIGCHI, 2008.
  • [26] S. J. Leon. Linear algebra with applications. Macmillan New York, 1980.
  • [27] G. Li, J. Wang, Y. Zheng, and M. Franklin. Crowdsourced data management: A survey. 2016.
  • [28] X. Liu, M. Lu, B. C. Ooi, Y. Shen, S. Wu, and M. Zhang. Cdas: a crowdsourcing data analytics system. PVLDB, 2012.
  • [29] A. Marcus, E. Wu, D. R. Karger, and S. Madden. Crowdsourced databases: Query processing with people. CIDR, 2011.
  • [30] L. Mo, R. Cheng, B. Kao, X. S. Yang, C. Ren, S. Lei, D. W. Cheung, and E. Lo. Optimizing plurality for human intelligence tasks. In CIKM, 2013.
  • [31] S. M. Mohammad and P. D. Turney. Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 2013.
  • [32] A. Parameswaran, S. Boyd, H. Garcia-Molina, A. Gupta, N. Polyzotis, and J. Widom. Optimal crowd-powered rating and filtering algorithms. PVLDB, 2014.
  • [33] M. Rosenblatt et al. Remarks on some nonparametric estimates of a density function. The Annals of Mathematical Statistics, 1956.
  • [34] A. I. Schein, A. Popescul, L. H. Ungar, and D. M. Pennock. Methods and metrics for cold-start recommendations. In SIGIR, 2002.
  • [35] C. E. Shannon. A mathematical theory of communication. ACM SIGMOBILE, 2001.
  • [36] B. W. Silverman. Density estimation for statistics and data analysis. CRC press, 1986.
  • [37] H. To, C. Shahabi, and L. Kazemi. A server-assigned spatial crowdsourcing framework. ACM TSAS, 2015.
  • [38] V. Verroios and H. Garcia-Molina. Entity resolution with crowd errors. In ICDE, 2015.
  • [39] J. Wang, T. Kraska, M. J. Franklin, and J. Feng. Crowder: Crowdsourcing entity resolution. PVLDB, 2012.
  • [40] P. Welinder and P. Perona. Online crowdsourcing: Rating annotators and obtaining cost-effective labels. In CVPRW, 2010.
  • [41] M. L. Yiu and N. Mamoulis. Efficient processing of top-k dominating queries on multi-dimensional data. In PVLDB, 2007.

-a Expected Accuracy of Multi-choices Task

Majority voting with multiple choices. Given a task in category and a set of workers assigned to it, when we use majority voting with choices, the expected accuracy of tasks can be calculated as follows:

where is a subset of with elements.

Weighted Majority voting with multiple choices. Given a task in category and a set of workers assigned to it, when we use weighted majority voting with choices, the expected accuracy of tasks can be calculated as follows:

where is a subset of with elements, and is the weight of a given worker set .

Half voting with multiple choices. Half voting only return the results selected by more than half workers. Given a task in category and a set of workers assigned to it, when we use half voting with choices, the expected accuracy of tasks can be calculated as follows:

where is a subset of with elements. Half voting is effective when there are more than two choices and the expected accuracy of each task is calculated by same equation same with that of majority voting with two choices.

-B Hardness of the FROG-TS Problem

Hardness of the FROG-TS Problem. The problem of FROG Task Scheduling (FROG-TS) is NP-hard.

Proof.

We prove the lemma by a reduction from the multiprocessor scheduling problem (MSP). A multiprocessor scheduling problem can be described as follows: Given a set of jobs where job has length and a number of processors, the multiprocessor scheduling problem is to schedule all jobs in to processors without overlapping such that the time of finishing all the jobs is minimized.

For a given multiprocessor scheduling problem, we can transform it to an instance of FROG-TS problem as follows: we give a set of tasks and each task belongs to a different category and the specified accuracy is lower than the lowest category accuracy of all the workers, which means each task just needs to be answered by one worker. For workers, all the workers have the same response time for the tasks in category , which leads to the processing time of any task is always no matter which worker it is assigned to.

As each task just needs to be assigned to one worker, this FROG-TS problem instance is to minimize the maximum completion time of task in , which is identical to minimize the time of finishing all the jobs in the given multiprocessor scheduling problem. With this mapping it is easy to show that the multiprocessor scheduling problem instance can be solved if and only if the transformed FROG-TS problem can be solved.

This way, we reduce MSP to the FROG-TS problem. Since MSP is known to be NP-hard [19], FROG-TS is also NP-hard, which completes our proof. ∎

-C Proof of Lemma iii.1

Proof.

For a task in category , assume a set of workers are assigned to it. As the definition of the expected accuracy of task in Eq. (1) shows, for any subset and , when worker is not in , we can find an addend of

in Eq. (1). As the Eq. (1) enumerates all the possible subsets of with more than elements, we can find a subset , which represents another addend of

in Eq. (1). Then, we have:

After we combine all these kind of pairs of addends of worker , we can obtain: