Regularized Minimax Conditional Entropy for Crowdsourcing
Abstract
There is a rapidly increasing interest in crowdsourcing for data labeling. By crowdsourcing, a large number of labels can be often quickly gathered at low cost. However, the labels provided by the crowdsourcing workers are usually not of high quality. In this paper, we propose a minimax conditional entropy principle to infer ground truth from noisy crowdsourced labels. Under this principle, we derive a unique probabilistic labeling model jointly parameterized by worker ability and item difficulty. We also propose an objective measurement principle, and show that our method is the only method which satisfies this objective measurement principle. We validate our method through a variety of real crowdsourcing datasets with binary, multiclass or ordinal labels.
Keywords: crowdsourcing, human computation, minimax conditional entropy
1 Introduction
In many realworld applications, the quality of a machine learning system is governed by the number of labeled training examples, but the labor for data labeling is usually costly. There has been considerable machine learning research work on learning when there are only few labeled examples, such as semisupervised learning and active learning. In recent years, with the emergence of crowdsourcing (or human computation) services like Amazon Mechanical Turk^{1}^{1}1https://www.mturk.com, the costs associated with collecting labeled data in many domains have dropped dramatically enabling the collection of large amounts of labeled data at a low cost. However, the labels provided by the workers are often not of high quality, in part, due to misaligned incentives and a lack of domain expertise in the workers. To overcome this quality issue, in general, the items are redundantly labeled by several different workers, and then the workers’ labels are aggregated in some manner, for example, majority voting.
The assumption underlying majority voting is that all workers are equally good so they have equal vote. Obviously, such an assumption does not reflect the truth. It is easy to imagine that one worker is more capable than another in some labeling task. More subtly, the skill level of a worker may significantly vary from one labeling category to another. To address these issues, Dawid and Skene (1979) propose a model which assumes that each worker has a latent probabilistic confusion matrix for generating her labels. The offdiagonal elements of the matrix represent the probabilities that the worker mislabels an item from one class as another while the diagonal elements correspond to her accuracy in each class. The true labels of the items and the confusion matrices of the workers can be jointly estimated by maximizing the likelihood of the workers’ labels.
In the DawidSkene method, the performance of a worker characterized by her confusion matrix stays the same across all items in the same class. That is not true in many labeling tasks, where some items are more difficult to label than others, and a worker is more likely to mislabel a difficult item than an easy one. Moreover, an item may be easily mislabeled as some class rather than others by whoever labels it. To address these issues, we develop a minimax conditional entropy principle for crowdsourcing. Under this principle, we derive a unique probabilistic model which takes both worker ability and item difficulty into account. When item difficult is ignored, our model seamlessly reduces to the classical DawidSkene model. We also propose a natural objective measurement principle, and show that our method is the only method which satisfies this objective measurement principle.
The work is an extension of the earlier results presented in (Zhou et al., 2012, 2014). We organize the paper as follows. In Section 2, we propose the minimax conditional entropy principle for aggregating multiclass labels collected from a crowd and derive its dual form. In Section 3, we develop regularized minimax conditional entropy for preventing overfitting and generating probabilistic labels. In Section 4, we propose the objective measurement principle which also leads to the probabilistic model derived from the minimax conditional entropy principle. In Section 5, we extend our minimax conditional entropy method to ordinal labels, where we need to introduce a new assumption called adjacency confusability. In Section 6, we present a simple yet efficient coordinate ascent method to solve the minimax program through its dual form and also a method for model selection. Related work are discussed in Section 7. Empirical results on real crowdsourcing data with binary, multiclass or ordinal labels are reported in Section 8, and conclusion are presented in Section 9.
2 Minimax Conditional Entropy Principle
In this section, we present the minimax conditional entropy principle for aggregating crowdsourced multiclass labels in both its primal and dual forms. We also show that minimax conditional entropy is equivalent to minimizing KullbackLeibler (KL) divergence.
2.1 Notation and Problem Setting
Assume that there are a group of workers indexed by a set of items indexed by and a number of classes indexed by or Let be the observed label that worker assigns to item and be the corresponding random variable. Denote by the unobserved true probability that item belongs to class A special case is that and for any other class That is, the labels are deterministic. Denote by the probability that worker labels item as class while the true label is Our goal is to estimate the unobserved true labels from the noisy workers’ labels.
2.2 Primal Form
Our approach is built upon two fourdimensional tensors with the four dimensions corresponding to workers items observed labels and true labels The first tensor is referred to as the empirical confusion tensor of which each element is given by
to represent an observed confusion from class to class by worker on item The other tensor is referred to as the expected confusion tensor of which each element is given by
to represent an expected confusion from class to class by worker on item
We assume that the labels of the items are independent. Thus, the entropy of the observed workers’ labels conditioned on the true labels can be written as
Both the distributions and are unknown here. To attack this problem, we first consider a simpler problem: estimate when is given. Then, we proceed to jointly estimating and when both are unknown.
Given the true label distribution , we propose to estimate which generates the workers’ labels by
(1) 
subject to the worker and item constraints (Figure 1)
(2a)  
(2b) 
plus the probability constraints
(3a)  
(3b)  
(3c) 
The constraints in Equation (2a) enforce the expected confusion counts in the worker dimension to match their empirical counterparts. Symmetrically, the constraints in Equation (2b) enforce the expected confusion counts in the item dimension to match their empirical counterparts. An illustration of empirical confusion tensors is shown in Figure 2.
When both the distributions and are unknown, we propose to jointly estimate them by
(4) 
subject to the constrains in Equation (2) and (3). Intuitively, entropy can be understood as a measure of uncertainty. Thus, minimizing the maximum conditional entropy means that, given the true labels, the workers’s labels are the least random. Theoretically, minimizing the maximum conditional entropy can be connected to maximum likelihood. In what follows, we show how the connection is established.
2.3 Dual Form
The Lagrangian of the maximization problem in (4) can be written as
(5) 
with
where and are introduced as the Lagrange multipliers. By the KarushKuhnTucker (KKT) conditions (Boyd and Vandenberghe, 2004),
which implies
Combining the above equation and the probability constraints in (3a) eliminates and yields
(6) 
where is the normalization factor given by
Although the matrices and in Equation (6) come out as the mathematical consequence of minimax conditional entropy, they can be understood intuitively. We can consider the matrix as the measure of the intrinsic ability of worker The th entry measures how likely worker labels a randomly chosen item in class as class Similarly, we can consider the matrix as the measure of the intrinsic difficult of item The th entry measures how likely item in class is labeled as class by a randomly chosen worker. In the following, we refer to as worker confusion matrices and as item confusion matrices.
Substituting the labeling model in Equation (6) into the Lagrangian in Equation (5), we can obtain the dual form of the minimax problem (4) as (see Appendix A)
(7) 
It is obvious that, to be optimal, the true label distribution has to be deterministic. Thus, the dual Lagrangian can be equivalently expressed as the complete loglikelihood
In Section 3, we show how to regularize the objective function in (4) to generate probabilistic labels.
2.4 Minimizing KL Divergence
Let us extend the two distributions and to the product space We extend the distribution by defining and stays the same. We extend the distribution with where is given by Equation (6), and is a uniform distribution over all possible classes. Then, we have
Theorem 2.1
When the true labels are deterministic, minimizing the KL divergence from to that is,
(8) 
is equivalent to the minimax problem in (4).
The proof is presented in Appendix B. A sketch of the proof is as follows. We show that,
By the definition of is a constant. Moreover, when the true labels are deterministic, we have
This concludes the proof of this theorem.
3 Regularized Minimax Conditional Entropy
In this section, we regularize our minimax conditional entropy method to address two practical issues:

Preventing overfitting. While crowdsourcing is cheap, collecting many redundant labels may be more expensive than hiring experts. Typically, the number of labels collected for each item is limited to a small number. In this case, the empirical counts in Equation (2) may not match their expected values. It is likely that they fluctuate around their expected values although these fluctuations are not large.

Generating probabilistic labels. Our minimax conditional entropy method can only generate deterministic labels (see Section 2.3). In practice, probabilistic labels are usually more useful than deterministic labels. When the estimated label distribution for an item is close to uniform over several classes, we can either ask for more labels for the item from the crowd or forward the item to an external expert.
For addressing the issue of overfitting, we formulate our observation by replacing exact matching with approximate matching while penalizing large fluctuations. For generating probabilistic labels, we consider an entropy regularization over the unknown true label distribution. This is motivated by the analysis in Section 2.4.
Formally, we regularize our minimax conditional entropy method as follows. Let us denote the entropy of the true label distribution by
To estimate the true labels, we consider
(9) 
subject to the relaxed worker and item constraints
(10a)  
(10b) 
plus the probability constraints in Equation (3). The regularization functions and are chosen as
(11a)  
(11b) 
The new slack variables in Equation (10) model the possible fluctuations. Note that these slack variables are not restricted to be positive. When there are a sufficiently large number of observations, the fluctuations should be approximately normally distributed, due to the central limit theorem. This observation motivates the choice of the regularization functions in (11) to penalize large fluctuations. The entropy term in the objective function, which is introduced for generating probabilistic labels, can be regarded as penalizing a large deviation from the uniform distribution.
Substituting the labeling model from Equation (6) into the Lagrangian of (9), we obtain the dual form (see Appendix C)
(12) 
where
(13)  
(14) 
When and the objective function in (12) turns out to be a lower bound of the log marginal likelihood
The last step is based on Jensen’s inequality. Maximizing the marginal likelihood is more appropriate than maximizing the complete likelihood since only the observed data matters in our inference.
Finally, we introduce a variant of our regularized minimax conditional entropy. It is obtained by restricting the feasible region of the slack variables through
(15) 
This is equivalent to
It says that, the empirical count of the correct answers from each worker is equal to its expectation. According to the law of large numbers, this assumption is approximately correct when a worker has a sufficiently large number of correct answers. Note that this does not mean that the percentage of the correct answers from the worker has to be large. Let denote the class size. Under the additional constraints in Equation (15), the dual problem can still be expressed by (12) except (see Appendix C)
(16) 
where
From our empirical evaluations, this variant is somewhat worse than its original version on most datasets. We include it here only for theoretic interest.
4 Objective Measurement Principle
In this section, we introduce a natural objective measurement principle, and show that the probabilistic labeling model in Equation (6) is a consequence of this principle.
Intuitively, the objective measurement principle can be described as follows:

A comparison of labeling difficulty between two items should be independent of which particular workers were involved in the comparison; and it should also be independent of which other items might also be compared.

Symmetrically, a comparison of labeling ability between two workers should be independent of which particular items were involved in the comparison; and it should also be independent of which other workers might also be compared.
Next we mathematically define the objective measurement principle.
Assume that worker has labeled items and in class Denote by the event that one of these two items is labeled as and the other is labeled as Formally,
Denote by the event that item is labeled as and item is labeled as Formally,
It is obvious that Now we formulate the requirement (1) in the objective measurement principle as follows: is independent of worker . Note that
Hence, is independent of worker if and only if
is independent of worker . In other words, given another arbitrary worker we should have
Without loss of generality, we choose as the fixed references. Then,
By the fact that probabilities are nonnegative, we can write
The probabilistic labeling model in Equation (6) follows immediately. It is easy to verify that due to the symmetry between item difficulty and worker ability, we can instead start from formulating the requirement (2) in the objective measurement principle to achieve the same result. Hence, in this sense, the two requirements are actually redundant.
5 Extension to Ordinal Labels
In this section, we extend the minimax conditional entropy principle from multiclass to ordinal labels. Eliciting ordinal labels is important in tasks such as judging the relative quality of web search results or consumer products. Since ordinal labels are a special case of multiclass labels, the approach that we have developed in the previous sections can be used to aggregate ordinal labels. However, we observe that, in ordinal labeling, workers usually have an error pattern different from what we observe in multiclass labeling. We summarize our observation as the adjacency confusability assumption, and formulate it by introducing a different set of constraints for workers and items.
5.1 Adjacency Confusability
In ordinal labeling, workers usually have difficulty distinguishing between two adjacent ordinal classes whereas distinguishing between two classes which are far away from each other is much easier. We refer to this observation as adjacency confusability.
To illustrate this observation, let us consider the example of screening mammograms. A mammogram is an xray picture used to check for breast cancer in women. Radiologists often rate mammograms on a scale such as no cancer, benign cancer, possible malignancy, or malignancy. In screening mammograms, a radiologist may rate a mammogram which indicates possible malignancy as malignancy, but it is less likely that she rates a mammogram which indicates no cancer as malignancy.
5.2 Ordinal Minimax Conditional Entropy
In what follows, we construct a different set of worker and item constraints to encode adjacency confusability. The formulation leads to an ordinal labeling model parameterized with structured confusion matrices for workers and items.
We introduce two symbols and which take on arbitrary binary relations in Ordinal labels are represented by consecutive integers, and the minima one is To estimate the true ordinal labels, we consider
(17) 
subject to the ordinalbased worker and item constraints
(18a)  
(18b) 
for all and the probability constraints in (3). We exclude the case in which the constraints trivially hold.
Let us explain the meaning of the constraints in Equation (18). To construct ordinalbased constraints, the first issue that we have to address is how to compare the observed label and the true label in an ordinal sense. For multiclass labels, as we have seen in Section 2, the label comparison problem is trivial: we only need to check whether they are equal or not. For ordinal labels, such a problem becomes tricky. Here, we propose an indirect comparison between two ordinal labels by comparing both to a reference label which varies through all possible values in a given ordinal label set (Figure 3). Consequently, for every chosen reference label we partition the Cartesian product of the label set into four disjoint regions
A partition example is shown in Table 1 where the given label set is Then, Equation (18a) defines a set of constraints for the workers by summing Equation (2a) over each region. Similarly, Equation (18b) defines a set of constraints for the items by summing Equation (2b) over each region.
From the discussion above, we can see that when there are more than two ordinal classes, the constraints in Equation (18) are less restrictive than those in Equation (2). Consequently, as we see below, the labeling model resulted from Equation (18) has fewer parameters. In the case in which there are only two ordinal classes, the sets of disjoint regions degenerate to pairs and, thus, the sets of constraints in Equations (18) and (2) are identical.
Next we explain why we construct the ordinalbased constraints in such a way. Let us write
For example, when and the above equation becomes
This counts the items of which each belongs to a class less than but worker assigned a label larger or equal to
In general, for a comparison between an observed label and a reference label, there are two possible outcomes: the observed label is larger or equal to the reference label; or the observed label is smaller than the reference label. These are also the two possible outcomes for a comparison between a true label and a reference label. Putting these together, we have four possible outcomes in total. The constraints in Equation (18a) enforce expected counts of all the four kinds of outcomes in the worker dimension to match their empirical counterparts. Symmetrically, the constraints in Equation (18b) enforce expected counts of all the four kinds of outcomes in the item dimension to match their empirical counterparts.



The Lagrangian of the maximization problem in (17) can be written as
with
where and are the introduced Lagrange multipliers. By a procedure similar to that in Section 2, we obtain a probabilistic ordinal labeling model
(19) 
where
(20a)  
(20b) 
The ordinal labeling model in Equation (19) is actually the same as the multiclass labeling model in Equation (6) except the worker and item confusion matrices in Equation (19) are now subtly structured through Equation (20). It is because of the structure that the ordinal labeling model has fewer parameters than the multiclass labeling model when there are more than two classes. In the case in which there are only two classes, the ordinal labeling model and the multiclass labeling model coincide as one would expect.
The regularized minimax conditional entropy for ordinal labels can be written as
(21) 
subject to the relaxed worker and item constraints
(22a)  
(22b) 
for all and the probability constraints in Equation (3). When we choose
the dual problem becomes
where
5.3 Ordinal Objective Measurement Principle
In this section, we adapt the objective measurement principle developed in Section 4 to ordinal labels.
Assume that worker has labeled items and in class For any class we define two events. The first event is
and the other event is
Note that Now we formulate the objective measurement principle as follows: is independent of worker . Assume that the labels of the items are independent. Then, can be written as
Hence, is independent of worker if and only if
is independent of worker . In other words, given another arbitrary worker we should have
To introduce adjacency confusability, we further assume that, for any two classes (or ),
Then, by a procedure similar to that in Section 4, we reach the probabilistic ordinal labeling model described by Equation (19) and (20).
6 Implementation
In this section, we present a simple while efficient coordinate ascent method to solve the minimax program through its dual form and also a practical procedure for model selection.
6.1 Coordinate Ascent
The dual problem of regularized minimax conditional entropy for either multiclass or ordinal labels is nonconvex. A stationary point can be obtained via coordinate ascent (Algorithm 1), which is essentially ExpectationMaximization (EM) (Dempster et al., 1977; Neal and Hinton, 1998). We first initialize the label estimate via aggregating votes in Equation (23). Then, in each iteration step, given the current estimate of the labels, update the estimate of the confusion matrices of the workers and items by solving the optimization problem in (24a); and, given the current estimate of the confusion matrices of worker and item, update the estimate of the labels through the closedform formula in (24b), which is identical to applying the Bayes’ rule with a uniform prior. The optimization problem in (24a) is strongly convex and smooth. Many algorithms can be applied here (Nesterov, 2004). In our experiments, we simply use gradient ascent. Denote by the objective function in (24a). For multiclass labels, the gradients are computed as
For ordinal labels, the gradients are computed as
It is worth pointing out that it is unnecessary to obtain the exact optimum at this intermediate step. We have observed that in practice, several gradient ascent steps here suffice for reaching a final good solution.
(23) 
(24a)  
(24b) 
6.2 Model Selection
The regularization parameters and can be chosen as follows. If the true labels of a subset of items are known—such subsets are usually referred to as validation sets—we may choose the regularization parameters such that those known true labels can be best predicted. Otherwise, we suggest to choose the regularization parameters via fold likelihoodbased crossvalidation. Specifically, we first randomly partition the crowd labels into equalsize subsets, and define a finite set of possible choices for the regularization parameters. Then, for each possible choice of the regularization parameters,

Leave out one subset and use the remaining subsets to estimate the confusion matrices of the workers and items;

Plug the estimate into the probabilistic labeling model to compute the likelihood of the leftout subset;

Repeat the above two steps till each subset is left out once and only once;

Average the likelihoods that we have computed.
After going through all the possible choices for the regularization parameters, we choose the one which results in the largest average likelihood to run our algorithm over the full dataset. The crossvalidation parameter is typically set to 5 or 10.
To simplify the model selection process, we suggest to choose
(25)  
In our experiments, we select from In our limited empirical studies, larger candidate sets for did not give more gains. Two empirical observations motivate us to consider using the square of the number of classes in Equation (25). First, the square of the number of classes has the same magnitude as the number of parameters in a confusion matrix. Second, the label noise dramatically increases when the number of classes increases, requiring a super linearly scaled regularization.
7 Related Work
In this section, we review some existing work that are closely related to our work.
DawidSkene Model. Let denote the number of classes. Dawid and Skene (1979) propose a generative model in which the ability of worker is characterized by a probabilistic confusion matrix in which the diagonal element represents the probability that worker correctly labels an arbitrary item in class , and the offdiagonal element represents the probability that worker mislabels an arbitrary item in class as class Our probabilistic labeling model in Equation (6) is reduced to the DawidSkene model when the item difficult terms in our model disappear since we can then reparameterize
In this sense, our model generalizes the DawidSkene model to incorporate item difficulty. To jointly estimate the workers’ abilities and the true labels in the DawidSkene model, in general, the marginal likelihood is maximized using the EM algorithm.
For binary labeling task, the probabilistic confusion matrix in the DawidSkene model can be written as
where is the accuracy of worker in the first class, and the accuracy in the second class. Usually, this special case of the DawidSkene model is also referred to as the twocoin model (Raykar et al., 2010; Liu et al., 2012; Chen et al., 2013). One may simplify the twocoin model by assuming (Ghosh et al., 2011; Karger et al., 2014; Dalvi et al., 2013). This simplification is accordingly referred to as the onecoin model.
Karger et al. (2014) propose an inference algorithm under the onecoin model, and show that their algorithm achieves the minimax rate when the accuracy of every worker is bounded away from and that is, with some fixed number Liu et al. (2012) show that the algorithm proposed by Karger et al. (2014) is essentially a belief propagation update with the Haldane prior which assumes that each worker is either a hammer () or adversary () with equal probability.
Gao and Zhou (2013) show that under the onecoin model, the global optimum of maximum likelihood achieves the minimax rate. A projected EM algorithm is suggested and shown to achieve nearly the same rate as that of global optimum. Zhang et al. (2014) show that the EM algorithm for the general DawidSkene model can achieve the minimax rate up to a logarithmic factor when it is initialized by spectral methods (Anandkumar et al., 2012) and the accuracy of every worker is bounded away from and
Raykar et al. (2010) extend the DawidSkene model by imposing a beta prior over the worker confusion matrices. Moreover, they jointly learn the classifier and the true labels by assuming that the true labels are generated by a logistic model. Liu et al. (2012) develop full Bayesian inference via variational methods including belief propagation and mean field.
Rasch model (Rasch, 1961, 1968). In educational tests, the Rasch model illustrates the response of each examinee of a given ability to each item in a test. In the model, the probability of a correct response is modeled as a logistic function of the difference between the person and item parameter which are locations on a continuous latent trait. Person parameters represent the ability of examinees while item parameters represent the difficulty of items.
Let be a dichotomous random variable where