MaxMIG: an Information Theoretic Approach for Joint Learning from Crowds
Abstract
Eliciting labels from crowds is a potential way to obtain large labeled data. Despite a variety of methods developed for learning from crowds, a key challenge remains unsolved: learning from crowds without knowing the information structure among the crowds a priori, when some people of the crowds make highly correlated mistakes and some of them label effortlessly (e.g. randomly). We propose an information theoretic approach, MaxMIG, for joint learning from crowds, with a common assumption: the crowdsourced labels and the data are independent conditioning on the ground truth. MaxMIG simultaneously aggregates the crowdsourced labels and learns an accurate data classifier. Furthermore, we devise an accurate datacrowds forecaster that employs both the data and the crowdsourced labels to forecast the ground truth. To the best of our knowledge, this is the first algorithm that solves the aforementioned challenge of learning from crowds. In addition to the theoretical validation, we also empirically show that our algorithm achieves the new stateoftheart results in most settings, including the realworld data, and is the first algorithm that is robust to various information structures. Codes are available at https://github.com/Newbeeer/MaxMIG
1 Introduction
Lack of large labeled data is a notorious bottleneck of the datadrivenbased machine learning paradigm. Crowdsourcing provides a potential solution to this challenge: eliciting labels from crowds. However, the elicited labels are usually very noisy, especially for some difficult tasks (e.g. age estimation, medical images annotation). In the crowdsourcinglearning scenario, two problems are raised:
(i) how to aggregate and infer the ground truth from the imperfect crowdsourced labels?
(ii) how to learn an accurate data classifier with the imperfect crowdsourced labels?
One conventional solution to the two problems is aggregating the crowdsourced labels using majority vote and then learning a data classifier with the majority answer. However, this naive method will cause biased results when the task is difficult and the majority of the crowds label randomly or always label a particular class (say class 1) effortlessly.
Another typical solution is aggregating the crowdsourced labels in a more clever way, like spectral method (Dalvi et al., 2013; Zhang et al., 2014), and then learning with the aggregated results. This method avoids the above flaw that the majority vote method has, as long as their randomnesses are mutually independent. However, the spectral method requires that the experts’ labeling noise are mutually independent, which often does not hold in practice since some experts may make highly correlated mistakes (see Figure 2 for example). Moreover, the above solutions aim to train an accurate data classifier and do not provide a method that can employ both the data and the crowdsourced labels to forecast the ground truth.
A common assumption in the learning from crowds literature is that conditioning on the ground truth, the crowdsourced labels and the data are independent, as shown in Figure 1 (a). Under this assumption, the crowdsourced labels correlate with the data due to and only due to the ground truth. Thus, this assumption tells us the ground truth is the “information intersection” between the crowdsourced labels and the data. This “information intersection” assumption does not restrict the information structure among the crowds i.e. this assumption still holds even if some people of the crowds make highly correlated mistakes.
We present several possible information structures under the “information intersection” assumption in Figure 1 (b). The majority vote will lead to inaccurate results in all cases if the experts have different levels of expertise and will induce extremely biased results in case (2) when a large number of junior experts always label class 1. The approaches that require the experts to make independent mistakes will lead to biased results in case (3), when the experts make highly correlated mistakes
In this paper, we propose an information theoretic approach, MaxMIG, for joint learning from crowds, with a common assumption: the crowdsourced labels and the data are independent conditioning on the ground truth. To the best of our knowledge, this is the first algorithm that is both theoretically and empirically robust to the situation where some experts make highly correlated mistakes and some experts label effortlessly, without knowing the information structure among the experts. Our algorithm simultaneously aggregates the crowdsourced labels and learns an accurate data classifier. In addition, we propose a method to learn an accurate datacrowds forecaster that can employ both the data and the crowdsourced labels.
At a high level, our algorithm trains a data classifier and a crowds aggregator simultaneously to maximize their “mutual information”. This process will find the “information intersection” between the data and crowdsourced labels i.e. the ground truth labels. The datacrowds forecaster can be easily constructed from the trained data classifier and the trained crowds aggregator. This algorithm allows the conditional dependency among the experts as long as the intersection assumption holds.
We design the crowds aggregator as the “weighted average” of the experts. This simple “weighted average” form allows our algorithm to be both highly efficient in computing and theoretically robust to a large family of information structures (e.g. case (1), (2), (3) in Figure 1 (b)). Particularly, our algorithm works when there exists a subset of senior experts, whose identities are unknown, such that these senior experts have mutually independent labeling biases and it is sufficient to only use the seniors’ information to predict the ground truth label. For other junior experts, they are allowed to have any dependency structure among themselves or between them and the senior experts.
2 Related work
A series of works consider the learning from crowds problem and mix the learning process and the aggregation process together. Raykar et al. (2010) reduce the learning from crowds problem to a maximum likelihood estimation (MLE) problem, and implement an EM algorithm to jointly learn the expertise of different experts and the parameters of a logistic regression classifier. Albarqouni et al. (2016) extend this method to combine with the deep learning model. Khetan et al. (2017) also reduce the learning problem to MLE and assume that the optimal classifier gives the ground truth labels and the experts make independent mistakes conditioning on the ground truth. Unlike our method, these MLE based algorithms are not robust to correlated mistakes. Recently, Guan et al. (2017) and Rodrigues & Pereira (2017) propose methods that model multiple experts individually and explicitly in a neural network. However, their works lack theoretical guarantees and are outperformed by our method in the experiments, especially in the naive majority case. Moreover, unlike our method, their methods cannot be used to employ both the data and the crowdsourced labels to forecast the ground truth.
Several works focus on modeling the experts. Whitehill et al. (2009) model both expert competence and image difficulty, but did not consider expert bias. Welinder et al. (2010) model each expert as a multidimensional classifier in an abstract feature space and consider both the bias of the expert and the difficulty of the image. Rodrigues et al. (2014) model the crowds by a Gaussian process. Khetan & Oh (2016); Shah et al. (2016) consider the generalized DawidSkene model (Dawid & Skene, 1979) which involves the task difficulty. However, these works are still not robust to correlated mistakes. We model the crowds via the original DawidSkene model and do not consider the task difficulty, but we believe our MaxMIG framework can be incorporated with any model of the experts and allow correlated mistakes.
Our method differs from the works that focus on inferring ground truth answers from the crowds’ reports and then learn the classifier with the inferred ground truth (e.g. (Dawid & Skene, 1979; Zhou et al., 2012; Liu et al., 2012; Karger et al., 2014; Zhang et al., 2014; Dalvi et al., 2013; Ratner et al., 2016)) since our method simultaneously infers the ground truth and learns the classifier. In addition, our method provides a datacrowds forecaster while those works do not.
Our method is also closely related to cotraining. Blum & Mitchell (1998) first propose the cotraining framework: simultaneously training two classifiers to aggregate two views of data. Our method interprets joint learning from crowds as a cotraining style problem. Most traditional cotraining methods require weakly good classifier candidates (e.g. better than random guessing). We follow the general information theoretic framework proposed by Kong & Schoenebeck (2018) that does not have this requirement. However, Kong & Schoenebeck (2018) only provide theoretic framework and assume an extremely high model complexity without considering the overfitting issue, which is a too strong assumption for practice. Our work apply this framework to the learning from crowds problem and provide the proper design for the model complexity as well as the experimental validations.
3 Method
In this section, we formally define the problem, introduce our method, MaxMIG, and provide a theoretical validation for our method.
Notations
For every set , we use to denote the set of all possible distributions over . For every integer , we use to denote . For every matrix , we define as a matrix such that its the entry is . Similarly for every vector , we define as a vector such that its the entry is .
Problem statement
There are datapoints. Each datapoint (e.g. the CT scan of a lung nodule) is labeled by experts (e.g. , 5 experts’ labels: {benign, malignant, benign, benign, benign}). The datapoint and the crowdsourced labels are related to a ground truth (e.g. the pathological truth of the lung nodule).
We are aiming to simultaneously train a data classifier and a crowds aggregator such that predicts the ground truth based on the datapoint , and aggregates crowdsourced labels into a prediction for ground truth . We also want to learn a datacrowds forecaster that forecasts the ground truth based on both the datapoint and the crowdsourced labels .
3.1 MaxMIG: an information theoretic approach
Figure 3 illustrates the overview idea of our method. Here we formally introduce the building blocks of our method.
Data classifier
The data classifier is a neural network with parameters . Its input is a datapoint and its output is a distribution over . We denote the set of all such data classifers by .
Crowds aggregator
The crowds aggregator is a “weighted average” function to aggregate crowdsourced labels with parameters and . Its input is the crowdsourced labels provided by experts for a datapoint and its output is a distribution over . By representing each as an onehot vector where only the th entry of is 1,
where is equivalent to pick the th column of matrix , as shown in Figure 3. We denote the set of all such crowds aggregators by .
Datacrowds forecaster
Given a data classifier , a crowds aggregator and a distribution over the classes, the datacrowds forecaster , that forecasts the ground truth based on both the datapoint and the crowdsourced labels , is constructed by
where Normalize.
mutual information gain
mutual information gain measures the “mutual information” between two hypotheses, which is proposed by Kong & Schoenebeck (2018). Given datapoints where each datapoint is labeled by crowdsourced labels , the mutual information gain between and , associated with a hyperparameter , is defined as the average “agreements” between and for the same task minus the average “agreements” between and for the different tasks, that is,
(1)  
where is a convex function satisfying and is the Fenchel duality of . We can use Table 1 as reference for and .
divergence  )  
KL divergence  
Pearson  
JensenShannon  

Since the parameters of is and the parameters of is and , we naturally rewrite as
We seek that maximizes . Later we will show that when the prior of the ground truth is (e.g. i.e. the ground truth is benign with probability 0.8 and malignant with probability 0.2 a priori), the best and are and respectively. Thus, we can set as and only tune . When we have side information about the prior , we can fix parameter as , and fix parameter as .
3.2 Theoretical justification
This section provides a theoretical validation for MaxMIG, i.e., maximizing the mutual information gain over and finds the “information intersection” between the data and the crowdsourced labels. In Appendix E, we compare our method with the MLE method (Raykar et al., 2010) theoretically and show that unlike our method, MLE is not robust to the correlated mistakes case.
Recall that we assume that conditioning on the ground truth, the data and the crowdsourced labels are mutually independent. Thus, we can naturally define the “information intersection” as a pair of data classifier and crowds aggregator such that they both fully use their input to forecast the ground truth. Kong & Schoenebeck (2018) shows that when we have infinite number of datapoints and maximize over all possible data classifiers and crowds aggregators, the “information intersection” will maximize to the mutual information (Appendix C) between the data and the crowdsourced labels. However, in practice, with a finite number of datapoints, the data classifier and the crowds aggregator space should be not only sufficiently rich to contain the “information intersection” but also sufficiently simple to avoid overfitting. Later, the experiment section will show that our picked and are sufficiently simple to avoid overfitting. We assume the neural network space is sufficiently rich. It remains to show that our weighted average aggregator space is sufficiently rich to contain .
Model and assumptions
Each datapoint with crowdsourced labels provided by experts are drawn i.i.d. from random variables .
Assumption 3.1 (Cotraining assumption).
and are independent conditioning on .
Note that we do not assume that the experts’ labels are conditionally mutually independent. We define as the prior for , i.e. .
Definition 3.2 (Information intersection).
We define , and such that
We call them Bayesian posterior data classifier / crowds aggregator / datacrowds forecaster respectively. We call the information intersection between the data and the crowdsourced labels.
We also assume the neural network space is sufficiently rich to contain .
Assumption 3.3 (Richness of the neural networks).
.
Theorem 3.4.
Our main theorem shows that if there exists a subset of senior experts such that these senior experts are mutually conditional independent and it is sufficient to only use the information from these senior experts, then MaxMIG finds the “information interstion”. Note that we do not need to know the identities of the senior experts. For other junior experts, we allow any dependency structure among them and between them and the senior experts. Moreover, this theorem also shows that our method handles the independent mistakes case where all experts can be seen as senior experts (Proposition D.14).
To show our results, we need to show that contains , i.e. there exists proper weights such that can be represented as a weighted average. In the independent mistakes case, we can construct each expert’s weight using her confusion matrix. Thus, in this case, each expert’s weight represents her expertise. In the general case, we can construct each senior expert’s weight using her confusion matrix and make the junior experts’ weights zero. Due to space limitation, we defer the formal proofs to Appendix D.
4 Experiment
In this section, we evaluate our method on image classification tasks with both synthesized crowdsourced labels in various of settings and real world data.
Our method MaxMIG is compared with: Majority Vote, training the network with the major vote labels from all the experts; Crowd Layer, the method proposed by Rodrigues & Pereira (2017); Doctor Net, the method proposed by Guan et al. (2017) and AggNet, the method proposed by Albarqouni et al. (2016).
Image datasets
Three datasets are used in our experiments. The Dogs vs. Cats (Kaggle, 2013) dataset consists of images from classes, dogs and cats, which is split into a image training set and a image test set. The CIFAR10 (Krizhevsky et al., 2014) dataset consists of color images from classes, which is split into a image training set and a image test set. The LUNA16 (Setio et al., 2016) dataset consists of CT scans for lung nodule. We preprocessed the CT scans by generating grayscale images, which is split into a image training set and a image testing set. LUNA16 is highly imbalanced dataset (85%, 15%).
Synthesized crowdsourced labels in various of settings
For each information structure in Figure 1, we generate two groups of crowdsourced labels for each dataset: labels provided by (H) experts with relatively high expertise; (L) experts with relatively low expertise. For each of the situation (H) (L), all three cases have the same senior experts.
Case 4.5.
(Independent mistakes) senior experts are mutually conditionally independent.
Case 4.6.
(Naive majority) senior experts are mutually conditional independent, while other junior experts label all datapoints as the first class effortlessly.
Case 4.7.
(Correlated mistakes) senior experts are mutually conditional independent, and each junior expert copies one of the senior experts.
Realworld dataset
The LabelMe data (Rodrigues & Pereira, 2017; Russell et al., 2008) consists of a total of 2688 images, where 1000 of them were used to obtain labels from multiple annotators from Amazon Mechanical Turk and the remaining 1688 images were using for evaluating the different approaches. Each image was labeled by an average of 2.547 workers, with a mean accuracy of 69.2%.
Networks
We follow the four layers network in Rodrigues & Pereira (2017) on Dogs vs. Cats and LUNA16 and use VGG16 on CIFAR10 for the backbone of the data classifier . For Labelme data, we apply the same setting of Rodrigues & Pereira (2017): we use pretrained VGG16 deep neural network and apply only one FC layer (with 128 units and ReLU activations) and one output layer on top, using 50% dropout.
We defer other implementation details to appendix B.
Method  Majority Vote  Crowd Layer  Doctor Net  AggNet  MaxMIG 

Accuracy 
4.1 Results
We train the data classifier on the four datasets through our method^{1}^{1}1The results of MaxMIG are based on KL divergence. The results for other divergences are similar. and other related methods. The accuracy of the trained data classifiers on the test set are shown in Table 2 and Figure 4. We also show the accuracy of our datacrowd forecaster and on the test set and compare it with AggNet (Table 3).
For the performances of the trained data classifiers, our method MaxMIG (red) almost outperform all other methods in every experiment. For the realworld dataset, LabelMe, we achieve the new stateoftheart results. For the synthesized crowdsourced labels, the majority vote method (grey) fails in the naive majority situation. The AggNet has reasonably good performances when the experts are conditionally independent, including the naive majority case since naive expert is independent with everything, while it is outperformed by us a lot in the correlated mistakes case. This matches the theory in Appendix E: the AggNet is based on MLE and MLE fails in correlated mistakes case. The Doctor Net (green) and the Crowd Layer (blue) methods are not robust to the naive majority case. Our datacrowds forecaster (Table 3) performs better than our data classifier, which shows that our datacrowds forecaster actually takes advantage of the additional information, the crowdsourced labels, to give a better result. Like us, Aggnet also jointly trains the classifier and the aggregator, and can be used to train a datacrowds forecaster. We compared our datacrowds forecaster with Aggnet. The results still match our theory. When there is no correlated mistakes, we outperform Aggnet or have very similar performances with it. When there are correlated mistakes, we outperform Aggnet a lot (e.g. +30%).
Recall that in the experiments, for each of the situation (H) (L), all three cases have the same senior experts. Thus, all three cases’ crowdsourced labels have the same amount of information. The results show that MaxMIG has similar performances for all three cases for each of the situation (H) (L), which validates our theoretical result: MaxMIG finds the “information intersection” between the data and the crowdsourced labels.
5 Conclusion and discussion
We propose an information theoretic approach, MaxMIG, for joint learning from crowds, with a common assumption: the crowdsourced labels and the data are independent conditioning on the ground truth. We provide theoretical validation to our approach and compare our approach experimentally with previous methods (Doctor net (Guan et al., 2017), Crowd layer (Rodrigues & Pereira, 2017), Aggnet (Albarqouni et al., 2016)) under several different information structures. Each of the previous methods is not robust to at least one information structure and our method is robust to all and almost outperform all other methods in every experiment. To the best of our knowledge, our approach is the first algorithm that is both theoretically and empirically robust to the situation where some people make highly correlated mistakes and some people label effortlessly, without knowing the information structure among the crowds. We also test our method on realworld data and achieve the new stateoftheart result.
Our current implementation of MaxMIG has several limitations. For example, we implement the aggregator using a simple linear model, which cannot handle the case when the senior experts are latent and cannot be linearly inferred from the junior experts. However, note that if the aggregator space is sufficiently rich, the MaxMIG approach is still able to handle any situation as long as the “information intersection” assumption holds. One potential future direction is designing more complicated but still trainable aggregator space.
Acknowledgments
We would like to express our thanks for support from the following research grants NSFC61625201 and 61527804.
References
 Albarqouni et al. (2016) Shadi Albarqouni, Christoph Baur, Felix Achilles, Vasileios Belagiannis, Stefanie Demirci, and Nassir Navab. Aggnet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE transactions on medical imaging, 35(5):1313–1321, 2016.
 Ali & Silvey (1966) Syed Mumtaz Ali and Samuel D Silvey. A general class of coefficients of divergence of one distribution from another. Journal of the Royal Statistical Society. Series B (Methodological), pp. 131–142, 1966.
 Blum & Mitchell (1998) Avrim Blum and Tom Mitchell. Combining labeled and unlabeled data with cotraining. In Proceedings of the eleventh annual conference on Computational learning theory, pp. 92–100. ACM, 1998.
 Csiszár et al. (2004) Imre Csiszár, Paul C Shields, et al. Information theory and statistics: A tutorial. Foundations and Trends® in Communications and Information Theory, 1(4):417–528, 2004.
 Dalvi et al. (2013) Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. Aggregating crowdsourced binary ratings. In Proceedings of the 22nd international conference on World Wide Web, pp. 285–294. ACM, 2013.
 Dawid & Skene (1979) Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer errorrates using the em algorithm. Applied statistics, pp. 20–28, 1979.
 Guan et al. (2017) Melody Y Guan, Varun Gulshan, Andrew M Dai, and Geoffrey E Hinton. Who said what: Modeling individual labelers improves classification. arXiv preprint arXiv:1703.08774, 2017.
 Kaggle (2013) Kaggle. Dogs vs. cats competition. https://www.kaggle.com/c/dogsvscats, 2013.
 Karger et al. (2014) David R Karger, Sewoong Oh, and Devavrat Shah. Budgetoptimal task allocation for reliable crowdsourcing systems. Operations Research, 62(1):1–24, 2014.
 Khetan & Oh (2016) Ashish Khetan and Sewoong Oh. Achieving budgetoptimality with adaptive schemes in crowdsourcing. In Advances in Neural Information Processing Systems, pp. 4844–4852, 2016.
 Khetan et al. (2017) Ashish Khetan, Zachary C Lipton, and Anima Anandkumar. Learning from noisy singlylabeled data. arXiv preprint arXiv:1712.04577, 2017.
 Kong & Schoenebeck (2016) Y. Kong and G. Schoenebeck. An Information Theoretic Framework For Designing Information Elicitation Mechanisms That Reward Truthtelling. ArXiv eprints, May 2016.
 Kong & Schoenebeck (2018) Yuqing Kong and Grant Schoenebeck. Water from two rocks: Maximizing the mutual information. In Proceedings of the 2018 ACM Conference on Economics and Computation, pp. 177–194. ACM, 2018.
 Krizhevsky et al. (2014) Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar10 dataset. online: http://www. cs. toronto. edu/kriz/cifar. html, 2014.
 Liu et al. (2012) Qiang Liu, Jian Peng, and Alexander T Ihler. Variational inference for crowdsourcing. In Advances in neural information processing systems, pp. 692–700, 2012.
 Nowozin et al. (2016) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
 Ratner et al. (2016) Alexander J Ratner, Christopher M De Sa, Sen Wu, Daniel Selsam, and Christopher Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems, pp. 3567–3575, 2016.
 Raykar et al. (2010) Vikas C Raykar, Shipeng Yu, Linda H Zhao, Gerardo Hermosillo Valadez, Charles Florin, Luca Bogoni, and Linda Moy. Learning from crowds. Journal of Machine Learning Research, 11(Apr):1297–1322, 2010.
 Rodrigues & Pereira (2017) Filipe Rodrigues and Francisco Pereira. Deep learning from crowds. arXiv preprint arXiv:1709.01779, 2017.
 Rodrigues et al. (2014) Filipe Rodrigues, Francisco Pereira, and Bernardete Ribeiro. Gaussian process classification and active learning with multiple annotators. In International Conference on Machine Learning, pp. 433–441, 2014.
 Russell et al. (2008) Bryan C Russell, Antonio Torralba, Kevin P Murphy, and William T Freeman. Labelme: a database and webbased tool for image annotation. International journal of computer vision, 77(13):157–173, 2008.
 Setio et al. (2016) Arnaud Arindra Adiyoso Setio, Francesco Ciompi, Geert Litjens, Paul Gerke, Colin Jacobs, Sarah J Van Riel, Mathilde Marie Winkler Wille, Matiullah Naqibullah, Clara I Sánchez, and Bram van Ginneken. Pulmonary nodule detection in ct images: false positive reduction using multiview convolutional networks. IEEE transactions on medical imaging, 35(5):1160–1169, 2016.
 Shah et al. (2016) Nihar B Shah, Sivaraman Balakrishnan, and Martin J Wainwright. A permutationbased model for crowd labeling: Optimal estimation and robustness. arXiv preprint arXiv:1606.09632, 2016.
 Welinder et al. (2010) Peter Welinder, Steve Branson, Pietro Perona, and Serge J Belongie. The multidimensional wisdom of crowds. In Advances in neural information processing systems, pp. 2424–2432, 2010.
 Whitehill et al. (2009) Jacob Whitehill, Tingfan Wu, Jacob Bergsma, Javier R Movellan, and Paul L Ruvolo. Whose vote should count more: Optimal integration of labels from labelers of unknown expertise. In Advances in neural information processing systems, pp. 2035–2043, 2009.
 Zhang et al. (2014) Yuchen Zhang, Xi Chen, Denny Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. In Advances in neural information processing systems, pp. 1260–1268, 2014.
 Zhou et al. (2012) Denny Zhou, Sumit Basu, Yi Mao, and John C Platt. Learning from the wisdom of crowds by minimax entropy. In Advances in neural information processing systems, pp. 2195–2203, 2012.
Appendix A DataCrowds Forecaster Comparison
Dataset  Method  4.1(H)  4.2(H)  4.3(H)  4.1(L)  4.2(L)  4.3(L) 

Dogs vs.Cats  MaxMIG (d)  
MaxMIG (dc)  
AggNet (d)  
AggNet (dc)  
CIFAR10  MaxMIG(d)  
MaxMIG(dc)  
AggNet(d)  
AggNet(dc)  
LUNA16  MaxMIG(d)  
MaxMIG(dc)  
AggNet(d)  
AggNet(dc) 
Here (dc) is the shorthand for datacrowds forecaster and (d) is the shorthand for dataclassifier. We take the average of five times experiments and the variance is pretty small. Due to space limitation, we omit the variance here.
Appendix B Experiments details
b.1 Experts’ expertise
For each information structure in Figure 1, we generate two groups of crowdsourced labels for each dataset: labels provided by (H) experts with relatively high expertise; (L) experts with relatively low expertise. For each of the situation (H) (L), all three cases have the same senior experts.
Case B.8.
(Independent mistakes) senior experts are mutually conditionally independent. (H) (L)
Dogs vs. Cats
In situation (H), some senior experts are more familiar with cats, while others make better judgments on dogs. For example, expert A is more familiar with cats, her expertise for dogs/cats is 0.6/0.8 in the sense that if the ground truth is dog/cat, she labels the image as “dog”/“cat” with probability 0.6/0.8 respectively. Similarly, other experts expertise are B:0.6/0.6, C:0.9/0.6, D:0.7/0.7, E:0.6/0.7.
In situation (L), all ten seniors’ expertise are 0.55/0.55.
Cifar10
In situation (H), we generate experts who may make mistakes in distinguishing the hard pairs: cat/dog, deer/horse, airplane/bird, automobile/trunk, frog/ship, but can perfectly distinguish other easy pairs (e.g. cat/frog), which makes sense in practice. When they cannot distinguish the pair, some of them may label the pair randomly and some of them label the pair the same class. In detail, for each hard pair, expert A label the pair the same class (e.g. A always labels the image as “cat” when the image has cats or dogs), expert B labels the pair uniformly at random (e.g. B labels the image as “cat” with the probability 0.5 and “dog” with the probability 0.5 when the image has cats or dogs). Expert C is familiar with mammals so she can distinguish cat/dog and deer/hose, while for other hard pairs, she label each of them uniformly at random. Expert D is familiar with vehicles so she can distinguish airplane/bird, automobile/trunk and frog/ship, while for other hard pairs, she always label each of them the same class. Expert E does not have special expertise. For each hard pair, Expert E labels them correctly with the probability 0.6.
In situation (L), all ten senior experts label each image correctly with probability and label each image as other false classes uniformly with probability .
Luna16
In situation (H), some senior experts tend to label the image as “benign” while others tend to label the image as “malignant”. Their expertise for benign/malignant are: A: 0.6/0.9, B:0.7/0.7, C:0.9/0.6, D:0.6/0.7, E:0.7/0.6.
In situation (L), all ten seniors’ expertise are 0.6/0.6.
Case B.9.
(Naive majority) senior experts are mutually conditional independent, while other junior experts label all data as the first class effortlessly. (H) , . (L) , .
For Dogs vs. Cats, all junior experts label everything as “cat”. For CIFAR10, all junior experts label everything as “airplane”. For LUNA16, all junior experts label everything as “benign”.
Case B.10.
(Correlated mistakes) senior experts are mutually conditional independent, and each junior expert copies one of the senior experts.(H) , . (L) , .
For Dogs vs. Cats, CIFAR10 and LUNA16, in situation (H), two junior experts copy expert ’s labels and three junior experts copy expert ’s labels; in situation (L), one junior expert copies expert ’s labels and another junior expert copies expert ’s labels.
b.2 Implementation details
Networks
For Dogs vs. Cats and LUNA16, we follow the four layers network in Rodrigues & Pereira (2017). We use Adam optimizer with learning rate for both the data classifier and the crowds aggregator. Batch size is set to . For CIFAR10, we use VGG16 as the backbone. We use Adam optimizer with learning rate for the data classifier and for the crowds aggregator. Batch size is set to .
For Labelme data, We apply the same setting of Rodrigues & Pereira (2017): we use pretrained VGG16 deep neural network and apply only one FC layer (with 128 units and ReLU activations) and one output layer on top, using 50% dropout. We use Adam optimizer with learning rate for both the data classifier and the crowds aggregator.
For our method MAXMIG’s crowds aggregator, for Dogs vs. Cats and LUNA16, we set the bias as and only tune . For CIFAR10 and Labelme data, we fix the prior distribution to be the uniform distribution and fix the bias as .
Initialization
For AggNet and our method MaxMIG, we initialize the parameters using the method in Raykar et al. (2010):
(2) 
where when and when and N is the total number of datapoints. We average all crowdsourced labels to obtain .
For Crowd Layer method, we initialize the weight matrices using identity matrix on Dogs vs. Cats and LUNA as Rodrigues & Pereira (2017) suggest. However, this initialization method leads to pretty bad results on CIFAR10. Thus, we use (2) for Crowd Layer on CIFAR10, which is the best practice in our experiments.
Appendix C mutual information
c.1 divergence and Fenchel’s duality
divergence (Ali & Silvey, 1966; Csiszár et al., 2004)
divergence is a nonsymmetric measure of the difference between distribution and distribution and is defined to be
where is a convex function and .
c.2 mutual information
Given two random variables whose realization space are and , let and be two probability measures where is the joint distribution of and is the product of the marginal distributions of and . Formally, for every pair of ,
If is very different from , the mutual information between and should be high since knowing changes the belief for a lot. If equals to , the mutual information between and should be zero since is independent with . Intuitively, the “distance” between and represents the mutual information between them.
Definition C.11 (mutual information (Kong & Schoenebeck, 2016)).
The mutual information between and is defined as
where is divergence. mutual information is always nonnegative.
Kong & Schoenebeck (2016) show that if we measure the amount of information by mutual information, any “data processing” on either of the random variables will decrease the amount of information crossing them. With this property, Kong & Schoenebeck (2016) propose an information theoretic mechanism design framework using mutual information. Kong & Schoenebeck (2018) reduce the cotraining problem to a mechanism design problem and extend the information theoretic framework in Kong & Schoenebeck (2016) to address the cotraining problem.
Appendix D Proof of Theorem 3.4
This section provides the formal proofs to our main theorem.
Definition D.12 (Confusion matrix).
For each expert , we define her confusion matrix as where .
We denote the set of all possible classifiers by and the set of all possible aggregators by .
Lemma D.13.
Proposition D.14.
[Independent mistakes] With assumptions 3.1, 3.3, if the experts are mutually independent conditioning on , then and
for every .
This implies that is a maximizer of
and the maximum is the mutual information between and , . Moreover, for every .
Proof.
We will show that when the experts are mutually conditionally independent, then
This also implies that . Based on the result of Lemma D.13, by assuming that , we can see is a maximizer of and the maximum is the mutual information between and . Moreover, Lemma D.13 also implies that for every .
For every , every ,
Thus,
Then,
(since , ) 
Thus,
∎
We restate our main theorem, Theorem 3.4, here with more details and prove it.
Theorem 3.4 (General case).
With assumption 3.1, 3.3, when there exists a subset of experts such that the experts in are mutually independent conditioning on and is a sufficient statistic for , i.e. for every , then and
for every where for every , , for every , ^{2}^{2}2We denote the matrix whose entries are all zero by ..
This implies that is a maximizer of
and the maximum is the mutual information between and , . Moreover, for every .
Proof.
Like the proof for the above proposition, we need to show that
This also implies that as well as the other results of the theorem.
Thus, we have
where for every , , for every , .
∎
Appendix E Theoretical comparisons with MLE
Raykar et al. (2010) propose a maximum likelihood estimation (MLE) based method in the learning from crowds scenario. Raykar et al. (2010) use logistic regression and Aggnet(Albarqouni et al., 2016) extends it to combine with the deep learning model. In this section, we will theoretically show that these MLE based methods can handle the independent mistakes case but cannot handle even the simplest correlated mistakes case—only one expert reports meaningful information and all other experts always report the same meaningless information—which can be handled by our method. Therefore, in addition to the experimental results, theoretically, our method is still better than these MLE based methods. We first introduce these MLE based methods.
Let be the parameter that control the distribution over and . Let be the parameter that controls the distribution over and .
For each each , ,
(3)  
(conditioning on , and are independent )  
(experts are mutually conditional independent. ) 
The MLE based method seeks and that maximize
To theoretically compare it with our method, we use our language to reinterpret the above MLE based method.
We define as the set of all transition matrices with each row summing to 1.
For each expert , we define as a parameter that is associated with .
Given a set of data classifiers where , the MLE based method seeks and transition matrices that maximize
The expectation of the above formula is
Note that Raykar et al. (2010) set the data classifiers space as all logistic regression classifiers and Albarqouni et al. (2016) extend this space to the neural network space.
Proposition E.15 (MLE works for independent mistakes).
If the experts are mutually independent conditioning on Y, then and are a maximizer of
Proof.
Since , thus,
which means can be seen as a distribution over all possible . Moreover, for any two distribution vectors and , , thus
(see equation (3)) 
∎
Thus, the MLE based method handles the independent mistakes case. However, we will construct a counter example to show that it cannot handle a simple correlated mistakes case which can be handled by our method.
Example E.16 (A simple correlated mistakes case).
We assume there are only two classes and the prior over is uniform, that is, . We also assume that .
There are 101 experts and one of the experts, say her the first expert, fully knows and always reports . The second expert knows nothing and every time flips a random unbiased coin whose randomness is independent with . She reports when she gets head and reports otherwise. The rest of experts copy the second expert’s answer all the time, i.e. , for .
Note that our method can handle this simple correlated mistakes case and will give all useless experts weight zero based on Theorem 3.4.
We define as a data classifier such that . We will show this meaningless data classifier has much higher likelihood than , which shows that in this simple correlated mistakes case, the MLE based method will obtain meaningless results.
We define a data classifier ’s maximal expected likelihood as
Theorem E.17 (MLE fails for correlated mistakes).
In the scenario defined by Example E.16, the meaningless classifier ’s maximal expected likelihood is at least and the Bayesian posterior classifier ’s maximal expected likelihood is .
The above theorem implies that the MLE based method fails in Example E.16.
Proof.
For the Bayesian posterior classifier , since and , then is an onehot vector where the entry is 1 and everything is determined by the realizations of and .