Efficient and Flexible Crowdsourcing of Specialized Tasks with Precedence Constraints

Efficient and Flexible Crowdsourcing of Specialized Tasks with Precedence Constraints

Avhishek Chatterjee, Michael Borokhovich, Lav R. Varshney, and Sriram Vishwanath A. Chatterjee and L. R. Varshney are with the Coordinated Science Laboratory, University of Illinois at Urbana-Champaign, Urbana, Illinois, USA. (email: {avhishek,varshney}@illinois.edu).M. Borokhovich is with the AT&T Labs, New Jersey, USA. (email: michaelbor@utexas.edu).S. Vishwanath is with the Wireless Networking and Communication Group, The University of Texas at Austin, Austin, Texas, USA. (email: sriram@austin.utexas.edu).Part of the material in this paper will be presented at IEEE INFOCOM 2016, San Francisco, USA [1].

Many companies now use crowdsourcing to leverage external (as well as internal) crowds to perform specialized work, and so methods of improving efficiency are critical. Tasks in crowdsourcing systems with specialized work have multiple steps and each step requires multiple skills. Steps may have different flexibilities in terms of obtaining service from one or multiple agents, due to varying levels of dependency among parts of steps. Steps of a task may have precedence constraints among them. Moreover, there are variations in loads of different types of tasks requiring different skill-sets and availabilities of different types of agents with different skill-sets. Considering these constraints together necessitates the design of novel schemes to allocate steps to agents. In addition, large crowdsourcing systems require allocation schemes that are simple, fast, decentralized and offer customers (task requesters) the freedom to choose agents. In this work we study the performance limits of such crowdsourcing systems and propose efficient allocation schemes that provably meet the performance limits under these additional requirements. We demonstrate our algorithms on data from a crowdsourcing platform run by a non-profit company and show significant improvements over current practice.

I Introduction

The nature of knowledge work has changed to the point nearly all large companies use crowdsourcing approaches, at least to some extent [2]. The idea is to draw on the cognitive energy of people, either within a company or outside of it [3]. A particularly notable example is the non-profit impact sourcing service provider, Samasource, which relies on a marginalized population of workers to execute work, operating under the notion give work, not aid [4, 5].

There are multifarious crowdsourcing structures [6, 7] that each require different strategies for matching work to agents [8]. Contest-based platforms such as TopCoder and InnoCentive put out open calls for participation, and best submissions win prizes [9]. Microtask platforms such as Amazon Mechanical Turk allocate simple tasks on a first-come-first-serve basis to any available crowd agent. When considering platforms with skilled crowds and specialized work, such as oDesk (now upWork) [7], IBM’s Application Assembly Optimization platform [10], and to a certain extent Samasource’s SamaHub platform [4], efficient allocation algorithms are needed.

In these skill-based crowdsourcing platforms, the specialized tasks have multiple steps, each requiring one or more skills. For example, software development tasks may first be planned (architecture), then developed (programming), and finally tested (testing and quality assurance), perhaps with several iterations. Even in skilled microtasking platforms like SamaHub, most jobs have more than one step. Task steps often have precedence constraints between them, implying that a particular step of a task can only be performed after another set of steps has been completed.

To serve a step requiring multiple skills, we need either a single agent that has all of the skills or a group of agents that collectively do so. Whether multiple agents can be pooled to serve a step or not depends on the flexibility of the step: if there are strong interdependencies between different parts of a step, the step may require a single agent. Notions of flexibility and precedence constraints are central to this paper.

Allocating tasks to servers is a central problem in computer science [11], communication networks [12], and operations research [13]. The skill-based crowdsourcing setting, however, poses new challenges for task allocation in terms of vector-valued service requirements, random and time-varying resource (agents) availability, large system size, a need for simple decentralized schemes requiring minimal actions from the platform provider, and the freedom of customers (task requesters) to choose agents without compromising system performance. Some of these issues have been addressed in recent work [14, 15], but previous work does not address precedence constraints or step flexibility. The notion of flexibility in [15] is based on agent-categories and is different from here.

Task allocation with precedence constraints has been studied in theoretical computer science, as follows. Given several tasks, precedence constraints among them, and one or more machines (either same or different speed), allocate tasks to minimize the weighted sum of completion times or maximum completion time [16]. In crowdsourcing, we have a stream of tasks arriving over time and so we are interested in dynamics.

Dynamic task allocation with precedence constraints has recently been studied in [17] for Bernoulli task arrivals. This is different from crowdsourcing scenarios, and the optimal scheme is required to search over the set of possible allocations, which is not suitable for crowdsourcing systems due to their inherent high-dimensionality (many types of tasks). Additional challenges in a crowdsourcing platform are: (i) random and time-varying agent availability; (ii) vector-valued service requirements; (iii) fast computation requirements for scaling; and (iv) freedom of choice for customers.

Here we address the above issues for various flexibilities of steps and agents, to characterize limits of crowd systems and develop optimal, computationally-efficient, centralized allocation schemes. Based on insights garnered, we further present fast decentralized greedy schemes with good performance guarantees. To complement our theoretical results, we also present numerical studies on real and synthetic data, drawn from Samasource’s SamaHub platform.

The remainder of the paper is organized as follows. Sec. II describes the system model for crowdsourcing platforms with different precedence and flexibility constraints. Sec. III presents a generic characterization of the system limits and a generic centralized optimal allocation scheme. Secs. VVII address particular systems with different flexibility constraints to yield fast decentralized schemes that meet crowdsourcing platform requirements mentioned above. Sec. VIII presents numerical studies on real and synthetic data. Detailed proofs of theoretical results are given in Appendix A.

Ii System Model

There are a total of kinds of skills available in the crowdsourcing system, numbered . We define types of agents by skills, and denote the total number of types of agents by . An agent of type has skills .

Tasks posted on the platform are of types. Each type of task has one or multiple steps associated with it, denoted . A step of a job type —a -step—needs a skill-hour service vector (non-negative orthant), i.e.  hours of skill . A part of a step of type involving skill is called a -substep if , the size of this substep.

In the platform, allocations of work to available agents happen at regular time intervals, . Tasks that arrive after an epoch are considered for allocation at epoch , based on the available agents at that epoch. Tasks or parts of tasks that remain unallocated due to insufficient available skilled agents are considered again in the next epoch. We assume that for any substep , the time requirement is less than the duration between two allocation epochs.

Tasks arrive according to a stochastic process in (non-negative orthant), , where is the number of tasks of type that arrive between epochs and . The stochastic process of available agents at epoch is . We assume and are independent of each other and that each of the processes are i.i.d. for each , with bounded second moments. Let be the distribution function of , and let and be the means of the processes.

An agent is inflexible if it has pre-determined how much time to spend on each of its skills. Inflexible agents bring a vector where if and only if and an inflexible agent spends no more than time for skill . Contrariwise, flexible agents bring a total time which can be arbitrarily split across skills in .

A step is flexible if it can be served by any collection of agents pooling their service-times. All substeps of inflexible steps must be allocated to one agent. At any epoch only an integral allocation of a step is possible. Hence, in any system for a step to be allocated, all of its substeps must be allocated.

A set of flexible substeps of size with skill can be allocated to agents if the available skill-hours111Available skill-hour is determined by the availability of the agent, system state, and whether agents are flexible or inflexible. for skill of these agents, , satisfy the following for some ,


where the capture how agents split their available skill hours across substeps.

For inflexible steps, a set of steps of size (vectors) can be allocated to an agent with available skill-hours (vector) if


There may also be precedence constraints on the order in which different steps of a task of type can be served. For any task of type , this constraint is given by a directed rooted tree on nodes where a directed edge implies step of a task of type can only be served after step of the same task has been completed.

Scalings of several crowd system parameters are as follows. Task arrival rate scales faster than number of task types , i.e. . Number of skills scales slower than , i.e. . In practice, a task requires a constant number of skills , which implies possible types of steps. Number of skills of an agent is implying , implying and for . Further, the length of tasks and availability of agents do not scale with the size of crowdsourcing systems.

Beyond these practical system scalings, we make the following mild assumptions: for all and , for some . These assumptions mean the arrival rate of every type of job and the total number of jobs requiring a particular skill scale with the system. We call these scaling patterns crowd-scaling.

Iii Notions of Optimality

To formally characterize the maximal supportable arrival rate of tasks, we introduce some more notation and invoke some well-accepted notions used in this regard.

For each , let the number of unfinished tasks in the system just after allocation epoch be . is the number of tasks of type arriving between epochs and . The number of tasks of type completely allocated (all steps) at epoch is . Thus evolves as:


Clearly at any epoch , since at most type tasks are available. Hence for all . Note that due to additional precedence constraints, typically .

Definition 1.

A scheme of allocation of tasks is called a policy if it allocates tasks at a time epoch based on knowledge of statistics of the processes and and their realizations up to time , but does not depend on future values.

Definition 2.

A crowd system is stable under policy if the process has a finite expectation in steady-state under that policy, i.e., , for all for any initial condition.

Definition 3.

An arrival rate is stabilizable if there exists a policy under which is stable.

Definition 4.

The capacity region of a crowd system for a given distribution of the agent-availability process is the closure of the set .

We aim to propose statistics-agnostic, computationally simple and decentralized schemes that offer customers freedom of choice while stabilizing any arrival rate in the system’s capacity region. Stronger than stability, often we give high probability bounds on number of unallocated tasks.

Iv Capacity and Centralized Allocation Routine

Here we present a generic characterization of the capacity region of a crowd system for all combinations of agent- and task-flexibility. We also give a generic centralized allocation routine that can be easily adapted to a particular system.

For any given set of available agents , define the number of different types of steps () that can potentially be allocated in a crowd system by . When we say is the number of steps of different types that can potentially be allocated, we consider the following scenario that satisfies the allocation constraints in Sec. II.

  1. An infinite number of steps of each type , for a are available for allocation, i.e., the limitation only comes from the available resource .

  2. Precedence constraints among the steps are already satisfied, i.e., all corresponding -steps of the available -steps have already been allocated previously. This is equivalent to an absence of precedence constraints.

  3. Integral steps must be allocated, i.e., all substeps of a step need to be allocated for allocation of the step.

  4. To allocate steps of different types to a collection of agents of type and available hours (which is a function of depending on the system), we need to satisfy either (1) or (2) depending on system type.

Let be the convex hull of the set , and define another set as follows.

Based on this we define another set . Let for any , , . Then . This set characterizes the capacity region of the crowd system.

Theorem 1.

Any arrival rate is stabilizable if for some , and no arrival rate can be stabilized if is outside the closure of the set .

Note that we ignore the precedence constraint in defining . This does not conflict with the fact the capacity region is a subset of , but it may not be obvious is in fact the capacity region. A fortiori, we show this with a scheme that respects precedence constraints and stabilizes any rate in the interior of .

Iv-a Centralized Allocation

Let us develop a statistics-agnostic scheme that stabilizes any arrival rate .

Let be the number of unallocated steps just before allocation epoch . This includes steps not allocated at epoch and steps that became available for allocation between and . Thus, if for any , -steps were allocated at epoch and new -steps became available between and ,

Note that, for any and , new -steps become available only when some steps have been completed. Service times are strictly less than the duration between two allocation epochs. So, any step allocated at epoch is completed before epoch . Hence, for any and : . On the other hand, for any and , we have an external arrival between epoch and .

At any time , for a given resource availability, an allocation rule determines resources to be allocated for certain number of -steps. We denote this by . Note that . Our goal is to design a scheme that finds a good for a given and .

Centralized Allocation

Input: and at Output: and allocation of steps to agents
1:  Define: number of leaves in the subtree of rooted at
2:  Obtain
3:  For each allocate -steps

This allocation scheme is statistics-agnostic and explicit in terms of system state. Also, note that by the design of the scheme the precedence constraint is automatically satisfied. One important thing to note is that the allocation scheme is generic, in the sense that this policy can be easily adapted for different agent- and step-flexibility. Note that comes from the allocation constraints of the system. If in (4) we replace by the corresponding allocation set, the centralized algorithm becomes a generic allocation routine.

In fact, the generic statistics-agnostic routine for centralized allocation scheme described above is optimal, in the sense that any arrival rate that can possibly be stabilized by any policy can also be stabilized by this scheme.

Theorem 2.

The centralized allocation routine described above stabilizes any if , the capacity region of the corresponding system for any .

Though the scheme has similarity with back-pressure algorithms [18, 19, 12]; unlike the back-pressure scheme it also uses graph parameter () in computing the weights. Proof is using a Lyapunov function involving and queue-lengths.

Instead of directed rooted tree if the precedence constraint is a directed acyclic graph the same results extend. It would be apparent from proof of Theorem 1 that the converse (outer-bound on capacity) depend on the precedence graph. On the other hand, for any precedence constraint given by a directed acyclic graph, there exists a precedence constraint given by a directed rooted tree such that the tree constraint does not violate the directed acyclic graph constraint. Then, by applying the above centralized algorithm for this directed rooted tree capacity can be achieved.

V Inflexible Agents and Flexible Steps

Here we characterize the limits of tasks allocation where all steps are flexible and agents are inflexible. Sec. III presented a generic capacity characterization and algorithm; this section investigates computational aspects of the generic algorithm for this particular system and also proposes a simple decentralized scheme that works well under a broad class of assumptions.

Consider , the set of possible allocations with inflexible agents and flexible tasks for availability of agents, . Recall the allocation scenario in Sec. III to determine a generic : A1–A3 are the same for any system flexibility, but A4 is specific. For an system we have the following.

To allocate tasks of different types to a collection of agents of type and available hours we must satisfy (1):


Note that whenever a step is allocated, all tasks in it must be allocated simultaneously. Hence, we can only allocate tasks with when satisfying (5).

Given , the capacity region is obtained in the same way was obtained from in Sec. III.

The generic centralized allocation routine can be similarly specialized for systems: in (4) of the routine is replaced by . The centralized scheme is computable since can be written explicitly in terms of , , and , but it cannot always be computed in polynomial time. Since any allocation in must satisfy constraint (5), optimization problem (4) can be written as:


Note that the solution to the problem does not change if we replace by , as optimal schemes never allocate resources to negative . Thus, we assume .

Note that (6) is a multi-dimensional knapsack problem, where the number of available items of a given weight and value are unbounded [20]. This problem is known to be NP-hard and without any fully polynomial-time approximation scheme (FPTAS). A polynomial-time approximation scheme (PTAS) is known, but the complexity is exponential in dimension. Recently extended linear programming (LP) relaxations have been proposed, but have the same issues (see [21] and references therein).

We aim to find a simple and fast distributed scheme, but first propose the following LP relaxation-based, polynomial-time (in and ) scheme that gives nearly optimal centralized allocation for a large crowd system (under crowd scaling).


We cannot give performance guarantees for this scheme at each allocation epoch for arbitrary , but for a sufficiently large crowd system, this scheme stabilizes almost any arrival rate that can be stabilized.

Theorem 3.

Under crowd scaling, for any there is an such that for any system with , the LP-based scheme (7) stabilizes any arrival rate in .

V-a Decentralized Allocation

In this section we develop a simple decentralized scheme with good performance guarantees. As discussed before, often one of the main reasons for customers to go to a crowd platform is the ability to choose workers themselves. As such, we propose a simple greedy scheme that allows customers the freedom of choice with minimal intervention from platform operators. This also reduces the platform’s operational cost.

In greedy allocation, each step competes against others to find an allocation for all of its tasks. Contention can be resolved arbitrarily, e.g., random, pre-ordered, or age-based.

The Prioritized Greedy algorithm below performs greedy allocation among all steps across all types of tasks that are in the same order. It starts with steps that are in the beginning of the precedence tree and once these steps find an allocation (or cannot be allocated), only then are steps lower in the corresponding precedence trees allowed to allocate themselves.


2:  for d=1:D do
4:     Greedy allocation among steps
5:  end for
Algorithm 1 Prioritized Greedy

This algorithm can be efficiently implemented on a crowdsourcing platform with minimal intervention from the platform operator. The operator need only tag unallocated steps in the system based on their depth in the rooted precedence tree and only show available workers to them after steps at lower depth have exercised their allocation choice. This may be implemented by personalizing the platform’s search results.

The algorithm is fast and has good performance guarantees under certain broadly-used assumptions on arrival and availability processes.

Definition 5.

A random variable is Gaussian-dominated if and for all , , and Poisson-dominated if for all , .

These domination definitions, commonly assumed in bandit problems [22], imply that variation around the mean is dominated in a moment generating function sense by that of a Gaussian (Poisson) random variable. Such a property is satisfied by many distributions used to model arrival processes, including in crowdsourcing systems [23].

Theorem 4.

Consider inflexible agents and flexible steps crowdsourcing systems (size ) where for any is sub-poly, i.e., , arrival and availability processes are Poisson-dominated (and/or Gaussian-dominated), and system scales as per crowd-scaling. Then, for any , s.t. , any arrival rate can be stabilized by Prioritized Greedy, and at the steady state the total number of unallocated steps in the system across all types is w.p. .

This implies Prioritized Greedy can stabilize almost any stabilizable arrival rate while ensuring the number of unallocated tasks scales more slowly than the system size.

Vi Flexible Agents and Flexible Steps

Now consider systems with flexible agents and flexible steps , and characterize capacity regions. For a given availability of agents , the set of possible step allocations are . As for in Sec. III this satisfies A1–A3 in the allocation scenario; A4 for systems is as follows.

A certain number of steps of each type can be allocated to a set of agents of types if there exists a set of vectors in , , such that:


Based on the set of possible allocations , the capacity region can be characterized just as in Sec. III.

Similar to Sec. V, if we replace by in the centralized allocation routine we obtain an optimal policy for the system. It is not hard to see that for the instance where each agent has exactly one skill, the problem is again a multi-dimensional knapsack problem and therefore NP-hard. We develop a computationally-efficient scheme.

If there are agents of type available, then the centralized allocation problem at time is to optimize:


This is a mixed ILP with integer variables and real variables. The complexity of this problem scales with the number of available agents in the system, . We would like to avoid such a scaling as may be much larger than and in a crowd system. Hence, we pose another optimization problem where the number of variables scales with and .

Given ,


Note that this optimization problem yields an allocation satisfying all constraints for flexible agent allocation. This is because is the fraction of time of an agent of type that has been given to skill , which can be positive only when agent of type has skill . The last inequality ensures that the skill-hour constraint per skill is satisfied. Hence, this is a feasible allocation procedure.

This is again a mixed ILP, but with variables. Note that this problem is also NP-hard, corresponding to a multi-dimensional knapsack problem if . We design a centralized scheme that allocates steps based on the following LP relaxation. Given ,


This scheme has the following performance guarantee.

Theorem 5.

Under crowd scaling, for any there is an s. t. for any system with , the LP-based scheme (11) stabilizes any arrival rate in .

Proof of this theorem is based on the equivalence of (9) and (10).

Vi-a Decentralized Allocation

Now we develop a decentralized allocation scheme that requires minimal centralized operation, and gives customers the option to choose from a pool of multiple agents.

Initialize: , at starting time ,

1:  Update at each :
2:  Solve for
if no solution pick randomly from a simplex in .
3:  Initialize sets:
4:  For each type : put each available agent in one of w. p. (independent rolls of loaded dices)
5:  Create inflexible agents: an agent of type in has available time only for skill
6:  Run Prioritized Greedy for this (I,F) system
Algorithm 2 Prioritized Greedy with Flexibility

This algorithm is amenable to crowdsourcing platform implementation. Note is available from recent history. Creating the set is simple: for any agent of type we just randomly tag (as per ) with a particular skill and it is shown only tasks with this particular skill. Similarly customers are only shown that the agent has only the particular skill. The rest of the algorithm is exactly like Prioritized Greedy where we create classes of steps and priorities among them and then within each class the allocation is arbitrarily greedy.

We can guarantee Alg. 2 performance when satisfies: and .

Theorem 6.

Consider a flexible agents and flexible steps crowdsourcing system with availability processes that are Poisson (and/or Gaussian) dominated with restricted asymmetry, i.e., , being . For any , s.t. in such systems of size that follow crowd scaling any arrival rate can be stabilized by Alg. 2 and at the steady state (i.e., for any finite when ) the total number of unallocated steps in the system across all types is w.p. .

Vii Flexible Agents and Inflexible Steps

Now consider the setting where agents may split their available service-time across their skills, but a step must be allocated to one agent. Multiple agents cannot pool their service time to serve a step. As before, for an agent availability vector , there is a set of possible allocations of steps (of different -types) to agents, denoted . Given and the distribution of agent availability , we can define a capacity region in the same way as is defined in Sec. V based on . Similarly, the generic centralized routine can be adapted by changing the optimization over to an optimization over while ensuring optimality of the modified scheme for system.

Allocation constraint (2) is for allocation of steps to a particular agent. For a given set of agents of different types the allocation constraint can be written based on (2). Note that for inflexible steps agents cannot pool service-times to serve a step. Consider a set of available agents , of types respectively. An allocation of steps to these agents is possible if and only if there are integers such that -steps are allocated to agent and all steps are allocated to some agent, i.e., for each there is an in an -dimensional simplex so that:

Hence, the optimization problem in the centralized allocation routine for system is an integer LP of the form:


Note that like (6), the objective can be written as where .

This problem has a special structure which leads to a computationally-efficient algorithm. Consider the following.


When operating at the optimum of (13), , and so we see that (13) and (14) have the same optimal value. Hence, we solve problem (14) instead of problem (13).

Note that as there is no constraint between and , problem (14) decomposes into optimization problems, each for an available agent. Consider the optimization problem for an agent of type .


which is again equivalent to the following problem, expressed in terms of the set of skills of type agent:


This is a one-dimensional knapsack problem, and there are dynamic programming (DP) pseudo-polynomial time algorithms for solving. Since the sack size is finite (does not scale with the system), the DP has computational complexity . This implies the centralized scheme decomposes into problems, each of which can be solved in polynomial time.

Thus, the centralized scheme naturally leads to a decentralized scheme where each available agent solves (16) and uses the optimal solution as its potential allocated steps. Agents may use an arbitrary contention mechanism among themselves to decide which agent allocates first. Upon resolving contention, agents pick steps greedily by solving (16). Since the decentralized scheme follows directly from the centralized one (13), performance guarantees from Thm. 2 hold.

Although this simple decentralized scheme is optimal, it does not give customers freedom of choosing agents. Thus, we propose another decentralized scheme where customers get to pick any agent from a subset of available agents.

Compute and Store: one-time

1:  for  do
2:     d
4:     while  do
6:        Pick maximal subsets from the collection , say .
8:     end while
9:  end for

Allocation at time


1:  for  do
2:     while  do
3:        Steps in allocate themselves greedily (ties are broken arbitrarily)
5:     end while
6:  end for
Algorithm 3 Restricted Greedy

Alg. 3 allows the different types of steps to pick agents greedily, but in a restricted manner. It prioritizes steps with lower depth like Prioritized Greedy. Among steps with the same priority (in terms of depth), it gives preference to steps requiring more skills to ensure an agent with multiple skills is not used unwisely for a step with lesser requirements.

Theorem 7.

Consider a flexible agent and inflexible steps crowdsourcing system where each type of agent has skills, and arrival as well as availability processes are Poisson (and/or Gaussian) dominated, is a partition of and are same for all . For this system for any , s.t. in such systems of size that follows crowd scaling any arrival rate can be stabilized by Alg. 3 and at steady-state the total number of unallocated steps in the system across all types is w.p. .

For many systems the total sizes of steps are nearly identical and so the assumption on total size is not restrictive, though results can be extended to the case where the total sizes are random with the same mean. The assumption is a partition is required for proving the performance guarantee, but the algorithm (actually a simpler version) works well on simulations. The above performance guarantee can be extended for the following conditions. is a partition of for some and for any    for some , , and for any , pair, and either have no intersection or one is a subset of the other.

Viii Evaluation

Secs. IIIVII characterized limits of different types of crowdsourcing systems, proposed efficient policies for optimal centralized allocation and designed decentralized schemes with provable bounds on backlog while giving customers freedom of choice. This section complements theoretical results by studying real data from Samasource, a non-profit crowdsourcing company and realistic Monte Carlo simulations. We study performance of simplified (in implementation and computation) versions of proposed decentralized algorithms above.

Let us first describe evaluating allocation using real data. The dataset contains M tasks and each belongs to a specific project. Some projects are regarded as real-time which means they have higher priority. The overall number of tasks that belong to the real-time projects is about M. Each task comprises 1 or 2 steps which in turn comprises a single substep. Some tasks have strict step ordering, i.e., the previous step must be completed before the next could be scheduled. Average substep working time requirement is sec. From the data, we calculate the turn-around time (TAT) for each task, i.e., the time since the task arrived to the system until the time its last step was completed. The cdfs of TAT for all projects and for real-time projects only are given in Fig. 1.

SamaHub, the platform of Samasource considers both agents and steps to be flexible. We implement a simplified version of the relevant decentralized algorithm, called step_flex, where we prioritize the steps with higher precedence to choose agents greedily with random tie-breaking.

Fig. 1: CDF of tasks turn-around time (TAT) using real dataset. Current allocation on the platform “current” vs our algorithm “step_flex”.
Fig. 2: Performance of our step_flex algorithm on real data, as a function of number of workers. (a) Tasks turn-around time (TAT). (b) Average backlog (number of unallocated steps in the system).

To compare current allocation on SamaHub with our approach, we use real data as input to step_flex. Since we lack exact knowledge of worker availability, we make the following assumption in consultation with Samasource. The number of active workers in the system is , evenly distributed across four time zones: , where each worker works every day from am to pm. Each worker possesses the skills required for any substep in the dataset. Fig. 1 compares the cdf of TAT of our approach step_flex (simulated with the data as input) with currently deployed scheme. Our algorithm substantially outperforms current scheme: average TAT for all projects is better and more than better for real-time projects. This improvement is also influenced by our implementation, which is not restricted by the currently-practiced organizational structure.

Fig. 2 shows how step_flex performs as a function of number of workers. As the number of workers grows, TAT decreases (see Fig. 2(a)). The benefit of adding more workers can be seen even more clearly when analyzing backlog, i.e., the average number of steps that entered the system but not yet scheduled, see Fig. 2(b).

Fig. 3: Performance of our algorithms on synthetic data with short sub-steps ( sec), as a function of load. (a) Tasks turn-around time (TAT). (b) Average backlog (number of unscheduled steps in the system). (c) Workers utilization.
Fig. 4: Performance of our algorithms on synthetic data with long sub-steps ( sec), as a function of load. (a) Tasks turn-around time (TAT). (b) Average backlog (number of unscheduled steps in the system). (c) Workers utilization.

We also evaluate our algorithms on synthetic data, considering flexible agents and flexible steps, and flexible agents and inflexible steps. Algorithm step_flex is used for the first system and a simplified version of the Restricted Greedy scheme, step_inflex, where we prioritize steps with higher skill requirements and allocate among them greedily is used for the second. We also consider a scenario in between flexible and inflexible steps, where each substep is allocated to a single agent, but different substeps of a step can be allocated to different agents. For this, we develop step_semiflex where steps allocate themselves greedily while ensuring a substep gets all service from an agent. We expected step_flex to outperform step_inflex, but we found somewhat surprisingly that step_flex and step_semiflex perform very similarly.

The first set of generated data has tasks with up to three steps in each and with strict ordering. Each step comprises one to three random substeps out of five possible types. Working time requirement for each substep is uniformly distributed between and sec. Each worker in the system has daily availability from am to pm, evenly distributed across four time zones: . A worker possesses a random set of skills that enables her to work on up to three (out of five) substep types. For each of our three algorithms we compare three metrics: TAT, backlog queue, and worker utilization. The experiment simulated a single run over a timespan of days.

Fig. 3 shows algorithms step_flex and step_semiflex outperform step_inflex for both cases: workers in the system and workers. When the load on the system is tasks/hour and the number of workers is , algorithm step_inflex is substantially worse since it becomes unstable for this load. Notice that step_flex and step_semiflex perform very similarly, which can be explained by relatively short substep work time requirement (in which case splitting becomes a rare event). Also note that worker utilization of step_inflex is not much worse than of the other algorithms. This can be explained by the long backlog queue of step_inflex. Though it is harder for step_inflex to find a worker capable of working on the whole step, when the backlog becomes large, the probability that a given worker will be assigned to some whole step grows.

The last set of results uses the same synthetic data as before, but the working time requirement for each substep is now uniformly distributed between and sec. Fig. 4 shows a slight advantage of step_flex over step_semiflex. Due to the longer working time requirements per substep, cases in which a substep may be split to improve allocation are more probable. In this scenario, the disadvantage of step_inflex is more obvious: for a load of tasks/hour and workers, its TAT and backlog are very large and unstable.

To summarize, our approach substantially outperforms Samasource’s current allocation scheme. While step_flex achieves best performance in terms of TAT and backlog, step_semiflex may be a good alternative. Its performance is almost the same but does not require splitting substeps among different workers, and is computationally lighter.

Ix Conclusion

Inspired by skilled crowdsourcing systems, we have developed new algorithms for allocating tasks to agents while handling novel system properties such as vector-valued service requirements, precedence and flexibility constraints, random and time-varying resource availability, large system size, need for simple decentralized schemes requiring minimal actions from the platform provider, and the freedom of customers to choose agents without compromising system performance. We have provided capacity regions, asymptotic performance guarantees for decentralized algorithms, and demonstration of efficacy in practical regimes, via large-scale data from a non-profit crowdsourcing company.


  • [1] A. Chatterjee, M. Borokhovich, L. R. Varshney, and S. Vishwananth, “Efficient and flexible crowdsourcing of specialized tasks with precedence constraints,” in Proc. 2016 IEEE INFOCOM, Apr. 2016 (to appear).
  • [2] A. Cuenin, “Each of the top best global brands has used crowdsourcing,” Jun. 2015. [Online]. Available: http://www.crowdsourcing.org/editorial/each-of-the-top-25-best-global-brands-has-used-crowdsourcing/50145
  • [3] D. Tapscott and A. D. Williams, Wikinomics: How Mass Collaboration Changes Everything, expanded ed.   New York: Portfolio Penguin, 2006.
  • [4] F. Gino and B. R. Staats, “The microwork solution,” Harvard Bus. Rev., vol. 90, no. 12, pp. 92–96, Dec. 2012.
  • [5] A. Marcus and A. Parameswaran, “Crowdsourced data management: Industry and academic perspectives,” Foundations and Trends in Databases, vol. 6, no. 1-2, pp. 1–161, Dec. 2015.
  • [6] T. W. Malone, R. Laubacher, and C. Dellarocas, “The collective intelligence genome,” MIT Sloan Manage. Rev., vol. 51, no. 3, pp. 21–31, Spring 2010.
  • [7] K. J. Boudreau and K. R. Lakhani, “Using the crowd as an innovation partner,” Harvard Bus. Rev., vol. 91, no. 4, pp. 60–69, Apr. 2013.
  • [8] S. Dustdar and M. Gaedke, “The social routing principle,” IEEE Internet Comput., vol. 15, no. 4, pp. 80–83, July-Aug. 2011.
  • [9] D. DiPalantino and M. Vojnović, “Crowdsourcing and all-pay auctions,” in Proc. 10th ACM Conf. Electron. Commer. (EC’09), Jul. 2009, pp. 119–128.
  • [10] L. R. Varshney, S. Agarwal, Y.-M. Chee, R. R. Sindhgatta, D. V. Oppenheim, J. Lee, and K. Ratakonda, “Cognitive coordination of global service delivery,” arXiv:1406.0215v1 [cs.OH]., Jun. 2014.
  • [11] J. Kleinberg and É. Tardos, Algorithm Design.   Addison-Wesley, 2005.
  • [12] R. Srikant and L. Ying, Communication Networks: An Optimization, Control and Stochastic Networks Perspective.   Cambridge University Press, 2014.
  • [13] M. L. Pinedo, Scheduling: Theory, Algorithms, and Systems.   Springer, 2012.
  • [14] G. Pang and A. L. Stolyar, “A service system with on-demand agent invitations,” Queueing Systems, Nov. 2015.
  • [15] A. Chatterjee, L. R. Varshney, and S. Vishwananth, “Work capacity of freelance markets: Fundamental limits and decentralized schemes,” in Proc. 2015 IEEE INFOCOM, Apr. 2015, pp. 1769–1777.
  • [16] F. A. Chudak and D. B. Shmoys, “Approximation algorithms for precedence-constrained scheduling problems on parallel machines that run at different speeds,” J. Algorithms, vol. 30, no. 2, pp. 323–343, Feb. 1999.
  • [17] R. Pedarsani, “Robust scheduling for queueing networks,” Ph.D. dissertation, University of California, Berkeley, Berkeley, CA, 2015.
  • [18] L. Tassiulas and A. Ephremides, “Stability properties of constrained queueing systems and scheduling policies for maximum throughput in multihop radio networks,” IEEE Trans. Autom. Control, vol. 37, no. 12, pp. 1936–1948, Dec. 1992.
  • [19] M. J. Neely, Stochastic Network Optimization with Application to Communication and Queueing Systems.   Morgan & Claypool, 2010.
  • [20] H. Kellerer, U. Pferschy, and D. Pisinger, Knapsack Problems.   Springer, 2004.
  • [21] D. A. G. Pritchard, “Linear programming tools and approximation algorithms for combinatorial optimization,” Ph.D. dissertation, University of Waterloo, 2009.
  • [22] S. Bubeck and N. Cesa-Bianchi, “Regret analysis of stochastic and nonstochastic multi-armed bandit problems,” Found. Trends Mach. Learn., vol. 5, no. 1, pp. 1–122, Dec. 2012.
  • [23] M. Vukovic and O. Stewart, “Collective intelligence applications in IT services business,” in Proc. IEEE 9th Int. Conf. Services Comput. (SCC), Jun. 2012, pp. 486–493.
  • [24] A. Chatterjee, L. R. Varshney, and S. Vishwananth, “Work capacity of freelance markets: Fundamental limits and decentralized schemes,” arXiv:1508.00023 [cs.MA], Jul. 2015.
  • [25] V. S. Borkar, Stochastic Approximation: A Dynamical Systems Viewpoint.   Cambridge University Press, 2008.

Appendix A Proofs

In this section we present proofs of the main results.

A-a Proof of Theorem 1

Here we only prove that any outside the closure of cannot be stabilized by any policy. To prove achievability, it is sufficient to show that there exists a policy that stabilizes any in the interior of . Hence, it is sufficient to prove Thm. 2, which we do later.

This proof consists of the following steps. We first compare two systems, the original system in question and another in which there is no precedence constraint among different steps of a job. We claim that on any sample path under any policy for the first system, there exists a policy in the second system so that the total number of incomplete jobs across all job types in the second is a lower bound (sample path-wise) for that in the first. Then we show that the second system cannot be stabilized for a outside the closure of and so the result follows for the first system.

Note that the claim regarding the number of incomplete jobs across all types in the second system being a lower bound on the first system follows by considering the same policy for the second system as for the first system.

To proceed, consider the second system, for which we denote the number of unallocated steps of type at epoch by . Now consider the set . We claim that this set is coordinate convex, i.e., it is a convex set and if for some then . To prove this claim, we first show that the set is coordinate convex.

First we prove that is a convex subset of . If , then there exist and such that

Thus for any ,

Note that is convex since it is the convex hull of ; hence , which in turn implies .

For coordinate convexity note that any is a combination of some and any is some convex combination of elements of . Also, from the allocation constraints it is apparent that if then also if . These two imply that for any , if there exists an (component-wise) and , then . Hence, is coordinate convex.

Note that . Note that if (coordinate-wise) then the same is true for and . Also, if for any , then . This proves that is coordinate convex subset of .

As is coordinate convex, for any outside the closure of there exists s.t. .

Consider the following. Let , and . Note that as jobs in this system do not have precedence constraints, for all , .

where is the number of possible departures under the scheme if there were infinite number of steps of each type, and is shorthand for . As is a convex function of , is a convex function of , and . Thus by Jensen’s inequality: