Max-MIG: an Information Theoretic Approach for Joint Learning from Crowds

Max-MIG: an Information Theoretic Approach for Joint Learning from Crowds

Peng Cao , Yilun Xu11footnotemark: 1
School of Electronics Engineering and Computer Science,
Peking University
{caopeng2016,xuyilun}@pku.edu.cn
&Yuqing Kong
The Center on Frontiers of Computing Studies,
Peking University
yuqing.kong@pku.edu.cn
\ANDYizhou Wang
Nat’l Eng. Lab. for Video Technology
Computer Science Dept., Peking University
Cooperative Medianet Innovation Center
PengCheng Lab
Deepwise AI Lab
Yizhou.Wang@pku.edu.cn
Equal Contribution.
Abstract

Eliciting labels from crowds is a potential way to obtain large labeled data. Despite a variety of methods developed for learning from crowds, a key challenge remains unsolved: learning from crowds without knowing the information structure among the crowds a priori, when some people of the crowds make highly correlated mistakes and some of them label effortlessly (e.g. randomly). We propose an information theoretic approach, Max-MIG, for joint learning from crowds, with a common assumption: the crowdsourced labels and the data are independent conditioning on the ground truth. Max-MIG simultaneously aggregates the crowdsourced labels and learns an accurate data classifier. Furthermore, we devise an accurate data-crowds forecaster that employs both the data and the crowdsourced labels to forecast the ground truth. To the best of our knowledge, this is the first algorithm that solves the aforementioned challenge of learning from crowds. In addition to the theoretical validation, we also empirically show that our algorithm achieves the new state-of-the-art results in most settings, including the real-world data, and is the first algorithm that is robust to various information structures. Codes are available at https://github.com/Newbeeer/Max-MIG

\iclrfinalcopy

1 Introduction

Lack of large labeled data is a notorious bottleneck of the data-driven-based machine learning paradigm. Crowdsourcing provides a potential solution to this challenge: eliciting labels from crowds. However, the elicited labels are usually very noisy, especially for some difficult tasks (e.g. age estimation, medical images annotation). In the crowdsourcing-learning scenario, two problems are raised:

(i) how to aggregate and infer the ground truth from the imperfect crowdsourced labels?

(ii) how to learn an accurate data classifier with the imperfect crowdsourced labels?

One conventional solution to the two problems is aggregating the crowdsourced labels using majority vote and then learning a data classifier with the majority answer. However, this naive method will cause biased results when the task is difficult and the majority of the crowds label randomly or always label a particular class (say class 1) effortlessly.

Another typical solution is aggregating the crowdsourced labels in a more clever way, like spectral method (Dalvi et al., 2013; Zhang et al., 2014), and then learning with the aggregated results. This method avoids the above flaw that the majority vote method has, as long as their randomnesses are mutually independent. However, the spectral method requires that the experts’ labeling noise are mutually independent, which often does not hold in practice since some experts may make highly correlated mistakes (see Figure 2 for example). Moreover, the above solutions aim to train an accurate data classifier and do not provide a method that can employ both the data and the crowdsourced labels to forecast the ground truth.

A common assumption in the learning from crowds literature is that conditioning on the ground truth, the crowdsourced labels and the data are independent, as shown in Figure 1 (a). Under this assumption, the crowdsourced labels correlate with the data due to and only due to the ground truth. Thus, this assumption tells us the ground truth is the “information intersection” between the crowdsourced labels and the data. This “information intersection” assumption does not restrict the information structure among the crowds i.e. this assumption still holds even if some people of the crowds make highly correlated mistakes.

Figure 1: (a) The general information structure under the “information intersection” assumption. (b) Possible information structures under the “information intersection” assumption, where the crowdsourced labels are provided by several experts: (1) independent mistakes: all of the experts are correlated with the ground truth and mutually independent of each other conditioning on the ground truth; for (2), (3) the senior experts are mutually conditional independent and (2) naive majority: the junior experts always label class 1 without any effort; (3) correlated mistakes: the junior experts, who were advised by the same senior expert before, make highly correlated mistakes.

We present several possible information structures under the “information intersection” assumption in Figure 1 (b). The majority vote will lead to inaccurate results in all cases if the experts have different levels of expertise and will induce extremely biased results in case (2) when a large number of junior experts always label class 1. The approaches that require the experts to make independent mistakes will lead to biased results in case (3), when the experts make highly correlated mistakes

In this paper, we propose an information theoretic approach, Max-MIG, for joint learning from crowds, with a common assumption: the crowdsourced labels and the data are independent conditioning on the ground truth. To the best of our knowledge, this is the first algorithm that is both theoretically and empirically robust to the situation where some experts make highly correlated mistakes and some experts label effortlessly, without knowing the information structure among the experts. Our algorithm simultaneously aggregates the crowdsourced labels and learns an accurate data classifier. In addition, we propose a method to learn an accurate data-crowds forecaster that can employ both the data and the crowdsourced labels.

At a high level, our algorithm trains a data classifier and a crowds aggregator simultaneously to maximize their “mutual information”. This process will find the “information intersection” between the data and crowdsourced labels i.e. the ground truth labels. The data-crowds forecaster can be easily constructed from the trained data classifier and the trained crowds aggregator. This algorithm allows the conditional dependency among the experts as long as the intersection assumption holds.

We design the crowds aggregator as the “weighted average” of the experts. This simple “weighted average” form allows our algorithm to be both highly efficient in computing and theoretically robust to a large family of information structures (e.g. case (1), (2), (3) in Figure 1 (b)). Particularly, our algorithm works when there exists a subset of senior experts, whose identities are unknown, such that these senior experts have mutually independent labeling biases and it is sufficient to only use the seniors’ information to predict the ground truth label. For other junior experts, they are allowed to have any dependency structure among themselves or between them and the senior experts.

Figure 2: Medical image labeling example: we want to train a data classifier to classify the medical images into two classes: benign and malignant. Each image is labeled by several experts. The experts are from different hospitals, say hospital A, B, C. Each hospital has a senior who has a high expertise. We assume the seniors’ labeling biases are mutually independent. However, for two juniors that were advised by the same senior before, they make highly correlated mistakes when labeling the images.
We assume that 5 experts are from hospital A, 50 experts are from hospital B, and 5 experts are from hospital C. If we use majority vote to aggregate the labels, the aggregated result will be biased to hospital B. If we still pretend the experts’ labeling noises are independent and apply the approaches that require independent mistakes, the aggregated result will still be biased to hospital B.

2 Related work

A series of works consider the learning from crowds problem and mix the learning process and the aggregation process together. Raykar et al. (2010) reduce the learning from crowds problem to a maximum likelihood estimation (MLE) problem, and implement an EM algorithm to jointly learn the expertise of different experts and the parameters of a logistic regression classifier. Albarqouni et al. (2016) extend this method to combine with the deep learning model. Khetan et al. (2017) also reduce the learning problem to MLE and assume that the optimal classifier gives the ground truth labels and the experts make independent mistakes conditioning on the ground truth. Unlike our method, these MLE based algorithms are not robust to correlated mistakes. Recently, Guan et al. (2017) and Rodrigues & Pereira (2017) propose methods that model multiple experts individually and explicitly in a neural network. However, their works lack theoretical guarantees and are outperformed by our method in the experiments, especially in the naive majority case. Moreover, unlike our method, their methods cannot be used to employ both the data and the crowdsourced labels to forecast the ground truth.

Several works focus on modeling the experts. Whitehill et al. (2009) model both expert competence and image difficulty, but did not consider expert bias. Welinder et al. (2010) model each expert as a multidimensional classifier in an abstract feature space and consider both the bias of the expert and the difficulty of the image. Rodrigues et al. (2014) model the crowds by a Gaussian process. Khetan & Oh (2016); Shah et al. (2016) consider the generalized Dawid-Skene model (Dawid & Skene, 1979) which involves the task difficulty. However, these works are still not robust to correlated mistakes. We model the crowds via the original Dawid-Skene model and do not consider the task difficulty, but we believe our Max-MIG framework can be incorporated with any model of the experts and allow correlated mistakes.

Our method differs from the works that focus on inferring ground truth answers from the crowds’ reports and then learn the classifier with the inferred ground truth (e.g. (Dawid & Skene, 1979; Zhou et al., 2012; Liu et al., 2012; Karger et al., 2014; Zhang et al., 2014; Dalvi et al., 2013; Ratner et al., 2016)) since our method simultaneously infers the ground truth and learns the classifier. In addition, our method provides a data-crowds forecaster while those works do not.

Our method is also closely related to co-training. Blum & Mitchell (1998) first propose the co-training framework: simultaneously training two classifiers to aggregate two views of data. Our method interprets joint learning from crowds as a co-training style problem. Most traditional co-training methods require weakly good classifier candidates (e.g. better than random guessing). We follow the general information theoretic framework proposed by Kong & Schoenebeck (2018) that does not have this requirement. However, Kong & Schoenebeck (2018) only provide theoretic framework and assume an extremely high model complexity without considering the over-fitting issue, which is a too strong assumption for practice. Our work apply this framework to the learning from crowds problem and provide the proper design for the model complexity as well as the experimental validations.

3 Method

In this section, we formally define the problem, introduce our method, Max-MIG, and provide a theoretical validation for our method.

Notations

For every set , we use to denote the set of all possible distributions over . For every integer , we use to denote . For every matrix , we define as a matrix such that its the entry is . Similarly for every vector , we define as a vector such that its the entry is .

Problem statement

There are datapoints. Each datapoint (e.g. the CT scan of a lung nodule) is labeled by experts (e.g. , 5 experts’ labels: {benign, malignant, benign, benign, benign}). The datapoint and the crowdsourced labels are related to a ground truth (e.g. the pathological truth of the lung nodule).

We are aiming to simultaneously train a data classifier and a crowds aggregator such that predicts the ground truth based on the datapoint , and aggregates crowdsourced labels into a prediction for ground truth . We also want to learn a data-crowds forecaster that forecasts the ground truth based on both the datapoint and the crowdsourced labels .

3.1 Max-MIG: an information theoretic approach

Figure 3: Max-MIG overview: Step 1: finding the “information intersection” between the data and the crowdsourced labels: we train a data classifier and a crowds aggregator simultaneously to maximize their -mutual information gain with a hyperparameter . maps each datapoint to a forecast for the ground truth. aggregates crowdsourced labels into a forecast by “weighted average”. We tune the parameters of and simultaneously to maximize their -mutual information gain. We will show the maximum is the -mutual information (a natural extension of mutual information, see Appendix C) between the data and the crowdsourced labels. Step 2: aggregating the “information intersection”: after we obtain the best that maximizes , we use them to construct a data-crowds forecaster that forecasts ground truth based on both the datapoint and the crowdsourced labels.
To calculate the -mutual information gain, we reward them for the average “agreements” between their outputs for the same task, i.e. and , as shown by the black lines, and punish them for the average “agreements” between their outputs for the different tasks, i.e. and where , as shown by the grey lines. Intuitively, the reward encourages the data classifier to agree with the crowds aggregator, while the punishment avoids them naively agreeing with each other, that is, both of them map everything to . The measurement of “agreement” depends on the selection of . See formal definition for in (1).

Figure 3 illustrates the overview idea of our method. Here we formally introduce the building blocks of our method.

Data classifier

The data classifier is a neural network with parameters . Its input is a datapoint and its output is a distribution over . We denote the set of all such data classifers by .

Crowds aggregator

The crowds aggregator is a “weighted average” function to aggregate crowdsourced labels with parameters and . Its input is the crowdsourced labels provided by experts for a datapoint and its output is a distribution over . By representing each as an one-hot vector where only the th entry of is 1,

where is equivalent to pick the th column of matrix , as shown in Figure 3. We denote the set of all such crowds aggregators by .

Data-crowds forecaster

Given a data classifier , a crowds aggregator and a distribution over the classes, the data-crowds forecaster , that forecasts the ground truth based on both the datapoint and the crowdsourced labels , is constructed by

where Normalize.

-mutual information gain

-mutual information gain measures the “mutual information” between two hypotheses, which is proposed by Kong & Schoenebeck (2018). Given datapoints where each datapoint is labeled by crowdsourced labels , the -mutual information gain between and , associated with a hyperparameter , is defined as the average “agreements” between and for the same task minus the average “agreements” between and for the different tasks, that is,

(1)

where is a convex function satisfying and is the Fenchel duality of . We can use Table 1 as reference for and .

-divergence )
KL divergence
Pearson
Jensen-Shannon

Table 1: Reference for common -divergences and corresponding ’s building blocks. This table is induced from Nowozin et al. (2016).

Since the parameters of is and the parameters of is and , we naturally rewrite as

We seek that maximizes . Later we will show that when the prior of the ground truth is (e.g. i.e. the ground truth is benign with probability 0.8 and malignant with probability 0.2 a priori), the best and are and respectively. Thus, we can set as and only tune . When we have side information about the prior , we can fix parameter as , and fix parameter as .

3.2 Theoretical justification

This section provides a theoretical validation for Max-MIG, i.e., maximizing the -mutual information gain over and finds the “information intersection” between the data and the crowdsourced labels. In Appendix E, we compare our method with the MLE method (Raykar et al., 2010) theoretically and show that unlike our method, MLE is not robust to the correlated mistakes case.

Recall that we assume that conditioning on the ground truth, the data and the crowdsourced labels are mutually independent. Thus, we can naturally define the “information intersection” as a pair of data classifier and crowds aggregator such that they both fully use their input to forecast the ground truth. Kong & Schoenebeck (2018) shows that when we have infinite number of datapoints and maximize over all possible data classifiers and crowds aggregators, the “information intersection” will maximize to the -mutual information (Appendix C) between the data and the crowdsourced labels. However, in practice, with a finite number of datapoints, the data classifier and the crowds aggregator space should be not only sufficiently rich to contain the “information intersection” but also sufficiently simple to avoid over-fitting. Later, the experiment section will show that our picked and are sufficiently simple to avoid over-fitting. We assume the neural network space is sufficiently rich. It remains to show that our weighted average aggregator space is sufficiently rich to contain .

Model and assumptions

Each datapoint with crowdsourced labels provided by experts are drawn i.i.d. from random variables .

Assumption 3.1 (Co-training assumption).

and are independent conditioning on .

Note that we do not assume that the experts’ labels are conditionally mutually independent. We define as the prior for , i.e. .

Definition 3.2 (Information intersection).

We define , and such that

We call them Bayesian posterior data classifier / crowds aggregator / data-crowds forecaster respectively. We call the information intersection between the data and the crowdsourced labels.

We also assume the neural network space is sufficiently rich to contain .

Assumption 3.3 (Richness of the neural networks).

.

Theorem 3.4.

With assumptions 3.1, 3.3, when there exists a subset of experts such that the experts in are mutually independent conditioning on and is a sufficient statistic for , i.e. for every , then is a maximizer of

and the maximum is the -mutual information between and . Moreover, for every .

Our main theorem shows that if there exists a subset of senior experts such that these senior experts are mutually conditional independent and it is sufficient to only use the information from these senior experts, then Max-MIG finds the “information interstion”. Note that we do not need to know the identities of the senior experts. For other junior experts, we allow any dependency structure among them and between them and the senior experts. Moreover, this theorem also shows that our method handles the independent mistakes case where all experts can be seen as senior experts (Proposition D.14).

To show our results, we need to show that contains , i.e. there exists proper weights such that can be represented as a weighted average. In the independent mistakes case, we can construct each expert’s weight using her confusion matrix. Thus, in this case, each expert’s weight represents her expertise. In the general case, we can construct each senior expert’s weight using her confusion matrix and make the junior experts’ weights zero. Due to space limitation, we defer the formal proofs to Appendix D.

4 Experiment

In this section, we evaluate our method on image classification tasks with both synthesized crowdsourced labels in various of settings and real world data.

Our method Max-MIG is compared with: Majority Vote, training the network with the major vote labels from all the experts; Crowd Layer, the method proposed by Rodrigues & Pereira (2017); Doctor Net, the method proposed by Guan et al. (2017) and AggNet, the method proposed by Albarqouni et al. (2016).

Image datasets

Three datasets are used in our experiments. The Dogs vs. Cats (Kaggle, 2013) dataset consists of images from classes, dogs and cats, which is split into a -image training set and a -image test set. The CIFAR-10 (Krizhevsky et al., 2014) dataset consists of color images from classes, which is split into a -image training set and a -image test set. The LUNA16 (Setio et al., 2016) dataset consists of CT scans for lung nodule. We preprocessed the CT scans by generating gray-scale images, which is split into a -image training set and a -image testing set. LUNA16 is highly imbalanced dataset (85%, 15%).

Synthesized crowdsourced labels in various of settings

For each information structure in Figure 1, we generate two groups of crowdsourced labels for each dataset: labels provided by (H) experts with relatively high expertise; (L) experts with relatively low expertise. For each of the situation (H) (L), all three cases have the same senior experts.

Case 4.5.

(Independent mistakes) senior experts are mutually conditionally independent.

Case 4.6.

(Naive majority) senior experts are mutually conditional independent, while other junior experts label all datapoints as the first class effortlessly.

Case 4.7.

(Correlated mistakes) senior experts are mutually conditional independent, and each junior expert copies one of the senior experts.

Real-world dataset

The LabelMe data (Rodrigues & Pereira, 2017; Russell et al., 2008) consists of a total of 2688 images, where 1000 of them were used to obtain labels from multiple annotators from Amazon Mechanical Turk and the remaining 1688 images were using for evaluating the different approaches. Each image was labeled by an average of 2.547 workers, with a mean accuracy of 69.2%.

Networks

We follow the four layers network in Rodrigues & Pereira (2017) on Dogs vs. Cats and LUNA16 and use VGG-16 on CIFAR-10 for the backbone of the data classifier . For Labelme data, we apply the same setting of Rodrigues & Pereira (2017): we use pre-trained VGG-16 deep neural network and apply only one FC layer (with 128 units and ReLU activations) and one output layer on top, using 50% dropout.

We defer other implementation details to appendix B.

Method Majority Vote Crowd Layer Doctor Net AggNet Max-MIG
Accuracy
Table 2: Accuracy on LabelMe (real-world crowdsourced labels)

4.1 Results

Figure 4: Results on Dogs vs. Cats, CIFAR-10, LUNA16.

We train the data classifier on the four datasets through our method111The results of Max-MIG are based on KL divergence. The results for other divergences are similar. and other related methods. The accuracy of the trained data classifiers on the test set are shown in Table 2 and Figure 4. We also show the accuracy of our data-crowd forecaster and on the test set and compare it with AggNet (Table 3).

For the performances of the trained data classifiers, our method Max-MIG (red) almost outperform all other methods in every experiment. For the real-world dataset, LabelMe, we achieve the new state-of-the-art results. For the synthesized crowdsourced labels, the majority vote method (grey) fails in the naive majority situation. The AggNet has reasonably good performances when the experts are conditionally independent, including the naive majority case since naive expert is independent with everything, while it is outperformed by us a lot in the correlated mistakes case. This matches the theory in Appendix E: the AggNet is based on MLE and MLE fails in correlated mistakes case. The Doctor Net (green) and the Crowd Layer (blue) methods are not robust to the naive majority case. Our data-crowds forecaster (Table 3) performs better than our data classifier, which shows that our data-crowds forecaster actually takes advantage of the additional information, the crowdsourced labels, to give a better result. Like us, Aggnet also jointly trains the classifier and the aggregator, and can be used to train a data-crowds forecaster. We compared our data-crowds forecaster with Aggnet. The results still match our theory. When there is no correlated mistakes, we outperform Aggnet or have very similar performances with it. When there are correlated mistakes, we outperform Aggnet a lot (e.g. +30%).

Recall that in the experiments, for each of the situation (H) (L), all three cases have the same senior experts. Thus, all three cases’ crowdsourced labels have the same amount of information. The results show that Max-MIG has similar performances for all three cases for each of the situation (H) (L), which validates our theoretical result: Max-MIG finds the “information intersection” between the data and the crowdsourced labels.

5 Conclusion and discussion

We propose an information theoretic approach, Max-MIG, for joint learning from crowds, with a common assumption: the crowdsourced labels and the data are independent conditioning on the ground truth. We provide theoretical validation to our approach and compare our approach experimentally with previous methods (Doctor net (Guan et al., 2017), Crowd layer (Rodrigues & Pereira, 2017), Aggnet (Albarqouni et al., 2016)) under several different information structures. Each of the previous methods is not robust to at least one information structure and our method is robust to all and almost outperform all other methods in every experiment. To the best of our knowledge, our approach is the first algorithm that is both theoretically and empirically robust to the situation where some people make highly correlated mistakes and some people label effortlessly, without knowing the information structure among the crowds. We also test our method on real-world data and achieve the new state-of-the-art result.

Our current implementation of Max-MIG has several limitations. For example, we implement the aggregator using a simple linear model, which cannot handle the case when the senior experts are latent and cannot be linearly inferred from the junior experts. However, note that if the aggregator space is sufficiently rich, the Max-MIG approach is still able to handle any situation as long as the “information intersection” assumption holds. One potential future direction is designing more complicated but still trainable aggregator space.

Acknowledgments

We would like to express our thanks for support from the following research grants NSFC-61625201 and 61527804.

References

  • Albarqouni et al. (2016) Shadi Albarqouni, Christoph Baur, Felix Achilles, Vasileios Belagiannis, Stefanie Demirci, and Nassir Navab. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE transactions on medical imaging, 35(5):1313–1321, 2016.
  • Ali & Silvey (1966) Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966.
  • Blum & Mitchell (1998) Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with co-training. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. ACM, 1998.
  • Csiszár et al. (2004) Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
  • Dalvi et al. (2013) Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. Aggregating crowdsourced binary ratings. In Proceedings of the 22nd international conference on World Wide Web, pp. 285–294. ACM, 2013.
  • Dawid & Skene (1979) Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Applied statistics, pp. 20–28, 1979.
  • Guan et al. (2017) Melody Y Guan, Varun Gulshan, Andrew M Dai, and Geoffrey E Hinton. Who said what: Modeling individual labelers improves classification. arXiv preprint arXiv:1703.08774, 2017.
  • Kaggle (2013) Kaggle. Dogs vs. cats competition. https://www.kaggle.com/c/dogs-vs-cats, 2013.
  • Karger et al. (2014) David R Karger, Sewoong Oh, and Devavrat Shah. Budget-optimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
  • Khetan & Oh (2016) Ashish Khetan and Sewoong Oh. Achieving budget-optimality with adaptive schemes in crowdsourcing. In Advances in Neural Information Processing Systems, pp. 4844–4852, 2016.
  • Khetan et al. (2017) Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singly-labeled data. arXiv preprint arXiv:1712.04577, 2017.
  • Kong & Schoenebeck (2016) Y. Kong and G. Schoenebeck. An Information Theoretic Framework For Designing Information Elicitation Mechanisms That Reward Truth-telling. ArXiv e-prints, May 2016.
  • Kong & Schoenebeck (2018) Yuqing Kong and Grant Schoenebeck. Water from two rocks: Maximizing the mutual information. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 177–194. ACM, 2018.
  • Krizhevsky et al. (2014) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
  • Liu et al. (2012) Qiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing. In Advances in neural information processing systems, pp. 692–700, 2012.
  • Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
  • Ratner et al. (2016) Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, pp. 3567–3575, 2016.
  • Raykar et al. (2010) Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
  • Rodrigues & Pereira (2017) Filipe Rodrigues and Francisco Pereira. Deep learning from crowds. arXiv preprint arXiv:1709.01779, 2017.
  • Rodrigues et al. (2014) Filipe Rodrigues, Francisco Pereira, and Bernardete Ribeiro. Gaussian process classification and active learning with multiple annotators. In International Conference on Machine Learning, pp. 433–441, 2014.
  • Russell et al. (2008) Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and web-based tool for image annotation. International journal of computer vision, 77(1-3):157–173, 2008.
  • Setio et al. (2016) Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Geert Litjens, Paul Gerke, Colin Jacobs, Sarah J Van Riel, Mathilde Marie Winkler Wille, Matiullah Naqibullah, Clara I Sánchez, and Bram van Ginneken. Pulmonary nodule detection in ct images: false positive reduction using multi-view convolutional networks. IEEE transactions on medical imaging, 35(5):1160–1169, 2016.
  • Shah et al. (2016) Nihar B Shah, Sivaraman Balakrishnan, and Martin J Wainwright. A permutation-based model for crowd labeling: Optimal estimation and robustness. arXiv preprint arXiv:1606.09632, 2016.
  • Welinder et al. (2010) Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pp. 2424–2432, 2010.
  • Whitehill et al. (2009) Jacob Whitehill, Ting-fan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pp. 2035–2043, 2009.
  • Zhang et al. (2014) Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pp. 1260–1268, 2014.
  • Zhou et al. (2012) Denny Zhou, Sumit Basu, Yi Mao, and John C Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in neural information processing systems, pp. 2195–2203, 2012.

Appendix A Data-Crowds Forecaster Comparison

Dataset Method 4.1(H) 4.2(H) 4.3(H) 4.1(L) 4.2(L) 4.3(L)
Dogs vs.Cats Max-MIG (d)
Max-MIG (dc)
AggNet (d)
AggNet (dc)
CIFAR-10 Max-MIG(d)
Max-MIG(dc)
AggNet(d)
AggNet(dc)
LUNA16 Max-MIG(d)
Max-MIG(dc)
AggNet(d)
AggNet(dc)
Table 3: Data-Crowds Forecaster Comparison: Max-MIG VS AggNet

Here (dc) is the shorthand for data-crowds forecaster and (d) is the shorthand for data-classifier. We take the average of five times experiments and the variance is pretty small. Due to space limitation, we omit the variance here.

Appendix B Experiments details

b.1 Experts’ expertise

For each information structure in Figure 1, we generate two groups of crowdsourced labels for each dataset: labels provided by (H) experts with relatively high expertise; (L) experts with relatively low expertise. For each of the situation (H) (L), all three cases have the same senior experts.

Case B.8.

(Independent mistakes) senior experts are mutually conditionally independent. (H) (L)

Dogs vs. Cats

In situation (H), some senior experts are more familiar with cats, while others make better judgments on dogs. For example, expert A is more familiar with cats, her expertise for dogs/cats is 0.6/0.8 in the sense that if the ground truth is dog/cat, she labels the image as “dog”/“cat” with probability 0.6/0.8 respectively. Similarly, other experts expertise are B:0.6/0.6, C:0.9/0.6, D:0.7/0.7, E:0.6/0.7.

In situation (L), all ten seniors’ expertise are 0.55/0.55.

Cifar-10

In situation (H), we generate experts who may make mistakes in distinguishing the hard pairs: cat/dog, deer/horse, airplane/bird, automobile/trunk, frog/ship, but can perfectly distinguish other easy pairs (e.g. cat/frog), which makes sense in practice. When they cannot distinguish the pair, some of them may label the pair randomly and some of them label the pair the same class. In detail, for each hard pair, expert A label the pair the same class (e.g. A always labels the image as “cat” when the image has cats or dogs), expert B labels the pair uniformly at random (e.g. B labels the image as “cat” with the probability 0.5 and “dog” with the probability 0.5 when the image has cats or dogs). Expert C is familiar with mammals so she can distinguish cat/dog and deer/hose, while for other hard pairs, she label each of them uniformly at random. Expert D is familiar with vehicles so she can distinguish airplane/bird, automobile/trunk and frog/ship, while for other hard pairs, she always label each of them the same class. Expert E does not have special expertise. For each hard pair, Expert E labels them correctly with the probability 0.6.

In situation (L), all ten senior experts label each image correctly with probability and label each image as other false classes uniformly with probability .

Luna16

In situation (H), some senior experts tend to label the image as “benign” while others tend to label the image as “malignant”. Their expertise for benign/malignant are: A: 0.6/0.9, B:0.7/0.7, C:0.9/0.6, D:0.6/0.7, E:0.7/0.6.

In situation (L), all ten seniors’ expertise are 0.6/0.6.

Case B.9.

(Naive majority) senior experts are mutually conditional independent, while other junior experts label all data as the first class effortlessly. (H) , . (L) , .

For Dogs vs. Cats, all junior experts label everything as “cat”. For CIFAR-10, all junior experts label everything as “airplane”. For LUNA16, all junior experts label everything as “benign”.

Case B.10.

(Correlated mistakes) senior experts are mutually conditional independent, and each junior expert copies one of the senior experts.(H) , . (L) , .

For Dogs vs. Cats, CIFAR-10 and LUNA16, in situation (H), two junior experts copy expert ’s labels and three junior experts copy expert ’s labels; in situation (L), one junior expert copies expert ’s labels and another junior expert copies expert ’s labels.

b.2 Implementation details

Networks

For Dogs vs. Cats and LUNA16, we follow the four layers network in Rodrigues & Pereira (2017). We use Adam optimizer with learning rate for both the data classifier and the crowds aggregator. Batch size is set to . For CIFAR-10, we use VGG-16 as the backbone. We use Adam optimizer with learning rate for the data classifier and for the crowds aggregator. Batch size is set to .

For Labelme data, We apply the same setting of Rodrigues & Pereira (2017): we use pre-trained VGG-16 deep neural network and apply only one FC layer (with 128 units and ReLU activations) and one output layer on top, using 50% dropout. We use Adam optimizer with learning rate for both the data classifier and the crowds aggregator.

For our method MAX-MIG’s crowds aggregator, for Dogs vs. Cats and LUNA16, we set the bias as and only tune . For CIFAR-10 and Labelme data, we fix the prior distribution to be the uniform distribution and fix the bias as .

Initialization

For AggNet and our method Max-MIG, we initialize the parameters using the method in Raykar et al. (2010):

(2)

where when and when and N is the total number of datapoints. We average all crowdsourced labels to obtain .

For Crowd Layer method, we initialize the weight matrices using identity matrix on Dogs vs. Cats and LUNA as Rodrigues & Pereira (2017) suggest. However, this initialization method leads to pretty bad results on CIFAR-10. Thus, we use (2) for Crowd Layer on CIFAR-10, which is the best practice in our experiments.

Appendix C -mutual information

c.1 -divergence and Fenchel’s duality

-divergence (Ali & Silvey, 1966; Csiszár et al., 2004)

-divergence is a non-symmetric measure of the difference between distribution and distribution and is defined to be

where is a convex function and .

c.2 -mutual information

Given two random variables whose realization space are and , let and be two probability measures where is the joint distribution of and is the product of the marginal distributions of and . Formally, for every pair of ,

If is very different from , the mutual information between and should be high since knowing changes the belief for a lot. If equals to , the mutual information between and should be zero since is independent with . Intuitively, the “distance” between and represents the mutual information between them.

Definition C.11 (-mutual information (Kong & Schoenebeck, 2016)).

The -mutual information between and is defined as

where is -divergence. -mutual information is always non-negative.

Kong & Schoenebeck (2016) show that if we measure the amount of information by -mutual information, any “data processing” on either of the random variables will decrease the amount of information crossing them. With this property, Kong & Schoenebeck (2016) propose an information theoretic mechanism design framework using -mutual information. Kong & Schoenebeck (2018) reduce the co-training problem to a mechanism design problem and extend the information theoretic framework in Kong & Schoenebeck (2016) to address the co-training problem.

Appendix D Proof of Theorem 3.4

This section provides the formal proofs to our main theorem.

Definition D.12 (Confusion matrix).

For each expert , we define her confusion matrix as where .

We denote the set of all possible classifiers by and the set of all possible aggregators by .

Lemma D.13.

(Kong & Schoenebeck, 2018) With assumption 3.1, 3.3, is a maximizer of

and the maximum is the mutual information between and , . Moreover, for every .

Proposition D.14.

[Independent mistakes] With assumptions 3.1, 3.3, if the experts are mutually independent conditioning on , then and

for every .

This implies that is a maximizer of

and the maximum is the mutual information between and , . Moreover, for every .

Proof.

We will show that when the experts are mutually conditionally independent, then

This also implies that . Based on the result of Lemma D.13, by assuming that , we can see is a maximizer of and the maximum is the mutual information between and . Moreover, Lemma D.13 also implies that for every .

For every , every ,

Thus,

Then,

(since , )

Thus,

We restate our main theorem, Theorem 3.4, here with more details and prove it.

Theorem 3.4 (General case).

With assumption 3.1, 3.3, when there exists a subset of experts such that the experts in are mutually independent conditioning on and is a sufficient statistic for , i.e. for every , then and

for every where for every , , for every , 222We denote the matrix whose entries are all zero by ..

This implies that is a maximizer of

and the maximum is the mutual information between and , . Moreover, for every .

Proof.

Like the proof for the above proposition, we need to show that

This also implies that as well as the other results of the theorem.

When is a sufficient statistic for , we have

Proposition D.14 shows that

Thus, we have

where for every , , for every , .

Appendix E Theoretical comparisons with MLE

Raykar et al. (2010) propose a maximum likelihood estimation (MLE) based method in the learning from crowds scenario. Raykar et al. (2010) use logistic regression and Aggnet(Albarqouni et al., 2016) extends it to combine with the deep learning model. In this section, we will theoretically show that these MLE based methods can handle the independent mistakes case but cannot handle even the simplest correlated mistakes case—only one expert reports meaningful information and all other experts always report the same meaningless information—which can be handled by our method. Therefore, in addition to the experimental results, theoretically, our method is still better than these MLE based methods. We first introduce these MLE based methods.

Let be the parameter that control the distribution over and . Let be the parameter that controls the distribution over and .

For each each , ,

(3)
(conditioning on , and are independent )
(experts are mutually conditional independent. )

The MLE based method seeks and that maximize

To theoretically compare it with our method, we use our language to reinterpret the above MLE based method.

We define as the set of all transition matrices with each row summing to 1.

For each expert , we define as a parameter that is associated with .

Given a set of data classifiers where , the MLE based method seeks and transition matrices that maximize

The expectation of the above formula is

Note that Raykar et al. (2010) set the data classifiers space as all logistic regression classifiers and Albarqouni et al. (2016) extend this space to the neural network space.

Proposition E.15 (MLE works for independent mistakes).

If the experts are mutually independent conditioning on Y, then and are a maximizer of

Proof.

Since , thus,

which means can be seen as a distribution over all possible . Moreover, for any two distribution vectors and , , thus

(see equation (3))

Thus, the MLE based method handles the independent mistakes case. However, we will construct a counter example to show that it cannot handle a simple correlated mistakes case which can be handled by our method.

Example E.16 (A simple correlated mistakes case).

We assume there are only two classes and the prior over is uniform, that is, . We also assume that .

There are 101 experts and one of the experts, say her the first expert, fully knows and always reports . The second expert knows nothing and every time flips a random unbiased coin whose randomness is independent with . She reports when she gets head and reports otherwise. The rest of experts copy the second expert’s answer all the time, i.e. , for .

Note that our method can handle this simple correlated mistakes case and will give all useless experts weight zero based on Theorem 3.4.

We define as a data classifier such that . We will show this meaningless data classifier has much higher likelihood than , which shows that in this simple correlated mistakes case, the MLE based method will obtain meaningless results.

We define a data classifier ’s maximal expected likelihood as

Theorem E.17 (MLE fails for correlated mistakes).

In the scenario defined by Example E.16, the meaningless classifier ’s maximal expected likelihood is at least and the Bayesian posterior classifier ’s maximal expected likelihood is .

The above theorem implies that the MLE based method fails in Example E.16.

Proof.

For the Bayesian posterior classifier , since and , then is an one-hot vector where the entry is 1 and everything is determined by the realizations of and .