Does Confidence Reporting from the Crowd Benefit Crowdsourcing Performance?
We explore the design of an effective crowdsourcing system for an -ary classification task. Crowd workers complete simple binary microtasks whose results are aggregated to give the final classification decision. We consider the scenario where the workers have a reject option so that they are allowed to skip microtasks when they are unable to or choose not to respond to binary microtasks. Additionally, the workers report quantized confidence levels when they are able to submit definitive answers. We present an aggregation approach using a weighted majority voting rule, where each worker’s response is assigned an optimized weight to maximize crowd’s classification performance. We obtain a couterintuitive result that the classification performance does not benefit from workers reporting quantized confidence. Therefore, the crowdsourcing system designer should employ the reject option without requiring confidence reporting.
Crowdsourcing provides a new framework to utilize distributed human wisdom to solve problems that machines cannot perform well, like handwriting recognition, paraphrase acquisition, audio transcription, and photo tagging (Paritosh et al., 2011; Burrows et al., 2013; Fan et al., 2015). Despite the successful applications of crowdsourcing, the relatively low quality of output is a key challenge (Ipeirotis et al., 2010; Allahbakhsh et al., 2013; Mo et al., 2013).
Several methods have been proposed to deal with the aforementioned problems (Karger et al., 2011; Vempaty et al., 2014; Yue et al., 2014; Sanchez-Charles et al., 2014; Varshney et al., 2014; Quinn and Bederson, 2011; Zhang and van der Schaar, 2012; Hirth et al., 2013). A crowdsourcing task is decomposed into microtasks that are easy for an individual to accomplish, and these microtasks could be as simple as binary distinctions (Karger et al., 2011). A classification problem with crowdsourcing, where taxonomy and dichotomous keys are used to design binary questions, is considered in (Vempaty et al., 2014). In our research group, we employed binary questions and studied the use of error-control codes and decoding algorithms to design crowdsourcing systems for reliable classification (Varshney et al., 2014; Vempaty et al., 2014). A group control mechanism where the reputation of the workers is taken into consideration to partition the crowd into groups is presented in(Quinn and Bederson, 2011; Zhang and van der Schaar, 2012). Group control and majority voting are compared in (Hirth et al., 2013), which reports that majority voting is more cost-effective on less complex tasks.
In past work on classification via crowdsourcing, crowd workers were required to provide a definitive yes/no response to binary microtasks. Crowd workers may be unable to answer questions for a variety of reasons such as lack of expertise. As an example, in mismatched speech transcription, i.e., transcription by workers who do not know the language, workers may not be able to perceive the phonological dimensions they are tasked to differentiate (Jyothi and Hasegawa-Johnson, 2015). In recent work, we have investigated the design of the optimal aggregation rule when the workers have a reject option so that they are unable to or choose not to respond (Li et al., 2017).
The possibility of using confidence scores to improve the quality of crowdsourced labels was investigated in (Kazai, 2011). An aggregation method using confidence scores to integrate labels provided by crowdsourcing workers was developed in (Oyama et al., 2013). A payment mechanism was proposed for crowdsourcing systems with a reject option and confidence score reporting (Shah and Zhou, 2014). Indeed, confidence reporting can be useful for estimating the quality of the provided responses and possibly yield better outcomes when the aggregation is not optimal. However, potential crowdsourcing performance improvement with an optimal aggregation rule resulting from confidence reporting has not yet been investigated. As is studied in this paper, when an optimal aggregation rule is developed, confidence reporting does not help to improve the performance.
In this paper, we further consider the problem investigated in (Li et al., 2017) by studying the scenario when the workers include their confidence levels in their responses. The main contribution of this paper is the counterintuitive finding that the confidence scores of the crowd do not play a role in the optimal aggregation rule. The weight assignment scheme to ensure the maximum weight for the correct class is the same as that when there is no confidence reporting. Although confidence reporting can provide useful information for estimating the quality of the crowd, the noise introduced due to categorization of confidence makes the estimation less accurate. Since the estimation result is essential for aggregation, confidence reporting may cause performance degradation.
2. Crowdsourcing Task with a Reject Option
Consider the situation where workers take part in an -ary object classification task. Each worker is asked simple binary questions, termed as microtasks, and the worker’s answer to a single microtask is conventionally represented by either “1” (Yes) or “0” (No), which eventually lead to a classification decision among the classes. We assume independent microtask design and, therefore, we have independent microtasks of equal difficulty. The workers submit responses that are combined to give the final decision. Here, we consider the microtasks to be simple binary questions and the worker’s answer to a single microtask is conventionally represented by either “1” (Yes) or “0” (No) (Vempaty et al., 2014; Rocker et al., 2007). Thus, the th worker’s ordered answers to all the microtasks form an -bit word, which is denoted by . Let , represent the th bit in this vector.
In our previous work (Li et al., 2017), we considered a general problem setting where the worker has a reject option of skipping the microtasks. We denote this skipped answer as , whereas the “1/0” (Yes/No) answers are termed as definitive answers. Due to the variability of different worker backgrounds, the probability of submitting definitive answers is different for different workers. Let represent the probability of the th worker submitting for the th microtask. Similarly, let be the probability that , the th answer of the th worker, is correct given that a definitive answer has been submitted. Due to the variabilities and anonymity of workers, we study crowdsourcing performance when and are realizations of certain probability distributions, which are denoted by distributions and respectively. The corresponding means are expressed as and .
Let and denote the hypotheses where “0” or “1” is the true answer for a single microtask, respectively. For simplicity of performance analysis, and are assumed equiprobable for every microtask. The crowdsourcing task manager or a fusion center (FC) collects the -bit words from workers and performs fusion based on an aggregation rule.
We focus on finding the optimal aggregation rule and let us briefly review the results regarding the aggregation of responses from the workers for classification in our previous work (Li
et al., 2017).
• Let be the set of all the object classes, where represents the th class. Based on th worker’s response to the microtasks, a subset is chosen, within which the classes are associated with weight for aggregation.111If all the responses from the th worker are definitive, is a singleton. Otherwise, contains multiple classes. The fusion center FC adds up the weights for every class and chooses the one with highest overall weight as the final decision , which can be expressed as
where is an indicator function which equals 1 if and 0 otherwise. To derive the optimal weight for each worker, one may look into the minimization of the misclassification probability, for which a closed-form expression cannot be derived without an explicit expression for . Hence, it is difficult to determine the optimal weight.
• The -ary classification task can also be split into binary hypothesis testing problems, by associating a classification decision with an -bit word. Each worker votes “1” or “0” with the weight for every bit. In this case, the Chair-Varshney rule gives the optimal weight as (Chair and Varshney, 1986). However, this requires the prior knowledge on for every worker, which is not available in practice.
• We proposed a novel weighted majority voting method, which was derived by solving the following optimization problem
where denotes the crowd’s average weight contribution to the correct class and denotes the average weight contribution to all the possible classes that is constrained to remain a constant . Statistically, this method ensures maximum weight to the correct class and consequently maximum probability of correct classification. We showed that this method significantly outperforms the simple majority voting procedure.
In this paper, we investigate the impact of confidence reporting from the crowd on system performance. The weight assignment scheme is developed by solving problem (2) as well.
3. Crowdcouring with Confidence Reporting
We consider the case where the crowd is composed of honest workers, which means that the workers honestly observe, think, and answer the questions, give confidence levels, and skip questions that they are not confident about. We derive the optimal weight assignment for the workers and the performance of the system in a closed form. Based on these findings, we determine the potential benefits of confidence reporting in a crowdsourcing system with a reject option.
3.1. Confidence Level Reporting
In a crowdsourcing system where workers submit answers and report confidence, we define the th worker’s confidence about the answer to the th microtask as the probability of this answer being correct given that this worker gives a definitive answer, which is equal to as defined earlier. When is bounded as , , the th worker reports his/her confidence level as . Let be drawn from the distribution . Note that every worker independently gives confidence levels for different microtasks, and simply means that workers submit answers and do not report their confidence levels.
Assuming that a worker can accurately perceive the probability and honestly report the confidence level, intuitively it is expected that it will benefit the crowdsourcing fusion center as much more information about the quality of the crowd can be extracted. However, as the confidence is quantized, which helps the workers in determining the confidence levels to be reported, quantization noise is introduced in extracting the crowd quality from confidence reporting.
As an illustrative example, consider the problem of mismatched crowdsourcing for speech transcription, which has garnered interest in the signal processing community (Hasegawa-Johnson et al., 2015; Jyothi and Hasegawa-Johnson, 2015; Varshney et al., 2016; Liu et al., 2016; Chen et al., 2016; Kong et al., 2016). Suppose the four possibilities for a velar stop consonant to transcribe are k, K, g, G. The simple binary question of “whether it is aspirated or unaspirated” differentiates between K, G and k, g, whereas the binary question of “whether it is voice or unvoiced” differentiates between g, G and k, K . The highest confidence level is set as . Now suppose the first worker is a native Italian speaker. Since Italian does not use aspiration, this worker will be unable to differentiate between k and K, or between g and G. It would be of benefit if this worker would specify the inability to perform the task through a special symbol , rather than guessing randomly, and this worker answers “Yes” with confidence level 1 to the second question. Suppose the second worker is a native Bengali speaker. Since this language makes a four-way distinction among velar stops, such a worker will probably answer both questions without a .
In the rest of this section, we address the problem “Does the confidence reporting help crowdsourcing system performance?” by performing analyses when workers report their confidences with their definitive answers.
3.2. Optimal Weight Assignment Scheme
We determine the optimal weight for the th worker in this section. We rewrite hereby the weight assignment problem
where denotes the crowd’s average weight contribution to the correct class and denotes the average weight contribution to all the possible classes and remains a constant . Statistically, we are looking for the weight assignment scheme such that the weight contribution to the correct class is maximized while the weight contribution to all the classes remains fixed, so as to maximize the probability of correct classification.
Proposition 3.1 ().
To maximize the average weight assigned to the correct classification element, the weight for th worker’s answer is given by
where is the number of definitive answers that the th worker submits.
See Appendix. ∎
Remark 1 ().
Here the weight depends on the number of questions answered by a worker. In fact, if more questions are answered, the weight assigned to the corresponding worker’s answer is larger. This is intuitively pleasing as a high-quality worker is able to answer more questions and is assigned a higher weight. Increased weight can put more emphasis on the contribution of high-quality workers in that sense and improve overall classification performance.
Remark 2 ().
When , associated with every worker for every microtask is reported exactly. Then the Chair-Varshney rule gives the optimal weight assignment to minimize error probability (Chair and Varshney, 1986). However, human decision makers are limited in their information processing capacity and can only carry around seven categories (Miller, 1956). Thus, the largest value of is around 7 in practice.
Remark 3 ().
Note that the optimal weight assignment scheme is the same as in the case where the workers do not report confidence levels, i.e., . Actually, the value of does not play any role in the weight assignment, as long as is not known exactly. Therefore, the weight assignment is universally optimal regardless of confidence reporting.
3.3. Parameter Estimation
Before the proposed aggregation rule can be used, has to be estimated to assign the weight for every worker’s answers. Here, we employ three approaches to estimate . We refer to our previous work (Li et al., 2017) for training-based and majority-voting based methods to estimate , and give an additional method using the information extracted from the workers’ reported confidence levels.
Note that the reported confidence levels correspond to . We collect all the values of the submitted confidence levels and obtain the estimate of from them. First, the th worker’s confidence level for the th microtask is represented by . Considering the fact that if the worker submits a definitive answer, we use to approximate . Let if the th worker skips the th microtask. We obtain the estimate of by
where denotes the number of definitive answers that th worker submits.
3.4. Performance Analysis
In this section, we characterize the performance of the proposed crowdsourcing classification framework in terms of the probability of correct classification . Note that we have overall correct classification only when all the bits are classified correctly.
Proposition 3.2 ().
The probability of correct classification in the crowdsourcing system is
where with natural numbers and , and , and
The proof is similar to the proof in our previous work (Li et al., 2017) and is, therefore, omitted for brevity. ∎
4. Simulation Results
In this section, we give the simulation results for the proposed crowdsourcing system. The workers take part in a classification task of microtasks. is a uniform distribution denoted as .
First, we show the efficiency of the derived optimal weight assignment over the widely used simple majority voting method for crowdsourcing systems. The performance comparison is presented with the number of workers varying from 3 to 29. Here, we consider different qualities of the individual workers in the crowd which is represented by variable with a uniform distribution . Thus, the mean is 0.8, and we give simulation results when confidence reporting is not included and the estimation of is perfect in Fig. 1. It is observed that a larger crowd completes the classification task with higher quality. A significant performance improvement by the proposed method with a reject option compared with the simple majority voting is shown in the figure.
Since an accurate estimation of is essential for applying the optimal weight assignment scheme, we next focus on the estimation results of for the three estimation methods as discussed in the previous section. Let be a uniform distribution expressed as with , and thus we can have varying from 0.5 to 1. We consider that workers participate in the classification task with a reject option and confidence reporting.
In Fig. 2, it is observed that the training-based method has the best overall performance, which takes advantage of the gold standard questions. We can also see that the majority voting method has better performance as increases. This is because a larger means a better-quality crowd, which will lead to a more accurate result from majority voting, and consequently better estimation performance of . When confidence is considered with , we find that the overall estimation performance is not better than the other two methods because of quantization noise associated with confidence reporting in the estimation of . It is also shown that the curve saturates and yields a fixed value of when . This is because almost all the confidence levels submitted then are and the corresponding estimate result is exactly 0.875.
The estimation performance of the confidence-based method with multiple confidence levels is presented in Fig. 2. As is expected, a larger can help improve the estimation performance. However, it is seen that even though , the corresponding performance is still not as good as that of the other two methods. Although we can expect estimation performance improvement as the maximum number of confidence levels increases, is pretty much the limit in practice due to the human inability to categorize beyond 7 levels. When the confidence-based estimation method is employed, the estimate value saturates at a certain fixed value when is large. Therefore, it can be concluded that the confidence-based estimation method does not provide good results.
Even though the three methods differ in performance in the estimation of , we show in Fig. 3 the robustness of the proposed system. We observe from Fig. 2 that the majority voting based method suffers from performance degradation in the low- regime, while the confidence based one suffers in the high- regime. However, when the value of is low, the workers are making random guesses even when they believe that they are able to respond with definitive answers. When the value of is large, almost all the definitive answers submitted are correct. Therefore, in those two situations, the performance degradation in the estimation of is negligible. From Fig. 3, we see that system performance of the proposed system with estimation results from Fig. 2 is almost the same as with the other three estimation methods, which significantly outperforms the system where simple majority voting is employed without a reject option. However, if a significant performance degradation in the estimation of occurs outside the two aforementioned regimes, overall classification performance loss is expected. For example, consider the case where is 0.8 while is 0.5, and , then . However, the actual equals 0.89 when is estimated with an acceptable error.
We have studied a novel framework of crowdsourcing system for classification, where an individual worker has the reject option and can skip a microtask if he/she has no definitive answer, and gives definitive answers with quantized confidence. We presented an aggregation approach using a weighted majority voting rule, where each worker’s response is assigned an optimized weight to maximize the crowd’s classification performance. However, we showed that reporting of confidence by the crowd does not benefit classification performance. One is advised to adopt the reject option without confidence indication from the workers as it does not improve classification performance and may degrade performance in some cases.
To solve problem (3), we need and . First, the th worker can have weight contribution to only if all his/her definitive answers are correct. Thus, we have the average weight assigned to the correct element as
where denotes and with cardinality . Given a known th worker, i.e., is known, we write
Note that , and then (8) is upper-bounded using Cauchy-Schwarz inequality as follows:
Also note that equality holds in (9) only if
where is a positive quantity independent of , which might be a function of , and
Note that , and similarly we write
The equality (12) holds only if
WHERE is a positive constant independent of , and we conclude that is also a positive quantity independent of . Then from (11), we have . Since is the product of variables, its distribution is not known a priori. A possible solution to weight assignment is a deterministic value given by and, therefore, we can write the weight as
Then, we can express the crowd’s average weight contribution to all the classes defined in (3) as
Thus, and the weight can be obtained accordingly. Note that the weight derived above has a term that is common for every worker. Since the voting scheme is based on comparison, we can ignore this factor and have the normalized weight as .
Acknowledgements.This work was supported in part by the Army Research Office under Grant W911NF-14-1-0339 and in part by the National Science Foundation under Grant ENG-1609916.
- Allahbakhsh et al. (2013) Mohammad Allahbakhsh, Boualem Benatallah, Aleksandar Ignjatovic, Hamid Reza Motahari-Nezhad, Elisa Bertino, and Schahram Dustdar. 2013. Quality control in crowdsourcing systems: Issues and directions. IEEE Internet Comput. 2 (March 2013), 76–81.
- Burrows et al. (2013) Steven Burrows, Martin Potthast, and Benno Stein. 2013. Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans. Intell. Syst. Technol. 4, 3 (July 2013), 43.
- Chair and Varshney (1986) Zeineddin Chair and Pramod K. Varshney. 1986. Optimal data fusion in multiple sensor detection systems. IEEE Trans. Aerosp. Electron. Syst. AES-22, 1 (Jan. 1986), 98–101. DOI:http://dx.doi.org/10.1109/TAES.1986.310699
- Chen et al. (2016) Wenda Chen, Mark Hasegawa-Johnson, and Nancy F. Chen. 2016. Mismatched Crowdsourcing based Language Perception for Under-resourced Languages. Procedia Computer Science 81 (2016), 23–29. DOI:http://dx.doi.org/10.1016/j.procs.2016.04.025
- Fan et al. (2015) Ju Fan, Meihui Zhang, S. Kok, Meiyu Lu, and Beng Chin Ooi. 2015. CrowdOp: Query Optimization for Declarative Crowdsourcing Systems. IEEE Trans. Knowl. Data Eng. 27, 8 (Aug. 2015), 2078–2092. DOI:http://dx.doi.org/10.1109/TKDE.2015.2407353
- Hasegawa-Johnson et al. (2015) Mark Hasegawa-Johnson, Jennifer Cole, Preethi Jyothi, and Lav R. Varshney. 2015. Models of Dataset Size, Question Design, and Cross-Language Speech Perception for Speech Crowdsourcing Applications. Laboratory Phonology 6, 3-4 (Oct. 2015), 381–431. DOI:http://dx.doi.org/10.1515/lp-2015-0012
- Hirth et al. (2013) Matthias Hirth, Tobias Hoßfeld, and Phuoc Tran-Gia. 2013. Analyzing costs and accuracy of validation mechanisms for crowdsourcing platforms. Math. Comput. Model. 57, 11 (July 2013), 2918–2932.
- Ipeirotis et al. (2010) Panagiotis G. Ipeirotis, Foster Provost, and Jing Wang. 2010. Quality Management on Amazon Mechanical Turk. In Proc. ACM SIGKDD Workshop Human Comput. (HCOMP’10). 64–67. DOI:http://dx.doi.org/10.1145/1837885.1837906
- Jyothi and Hasegawa-Johnson (2015) Preethi Jyothi and Mark Hasegawa-Johnson. 2015. Acquiring Speech Transcriptions Using Mismatched Crowdsourcing. In Proc. 29th AAAI Conf. Artificial Intelligence (AAAI’15).
- Karger et al. (2011) David R. Karger, Sewoong Oh, and Devavrat Shah. 2011. Iterative learning for reliable crowdsourcing systems. In Advances in Neural Information Processing Systems (NIPS) 24. MIT Press, Cambridge, MA, 1953–1961.
- Kazai (2011) Gabriella Kazai. 2011. In search of quality in crowdsourcing for search engine evaluation. In European Conference on Information Retrieval. Springer, 165–176.
- Kong et al. (2016) Xiang Kong, Preethi Jyothi, and Mark Hasegawa-Johnson. 2016. Performance Improvement of Probabilistic Transcriptions with Language-specific Constraints. Procedia Computer Science 81 (2016), 30–36. DOI:http://dx.doi.org/10.1016/j.procs.2016.04.026
- Li et al. (2017) Q. Li, A. Vempaty, L. R. Varshney, and P. K. Varshney. 2017. Multi-Object Classification via Crowdsourcing With a Reject Option. IEEE Trans. Signal Process. 65, 4 (Feb 2017), 1068–1081. DOI:http://dx.doi.org/10.1109/TSP.2016.2630038
- Liu et al. (2016) Chunxi Liu, Preethi Jyothi, Hao Tang, Vimal Manohar, Rose Sloan, Tyler Kekona, Mark Hasegawa-Johnson, and Sanjeev Khudanpur. 2016. Adapting ASR for Under-Resourced Languages Using Mismatched Transcriptions.
- Miller (1956) George A Miller. 1956. The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological review 63, 2 (1956), 81.
- Mo et al. (2013) Kaixiang Mo, Erheng Zhong, and Qiang Yang. 2013. Cross-task crowdsourcing. In Proc. ACM Int. Conf. Knowl Discovery Data Mining. 677–685.
- Oyama et al. (2013) Satoshi Oyama, Yukino Baba, Yuko Sakurai, and Hisashi Kashima. 2013. Accurate Integration of Crowdsourced Labels Using Workers’ Self-reported Confidence Scores.. In IJCAI.
- Paritosh et al. (2011) Praveen Paritosh, Panos Ipeirotis, Matt Cooper, and Siddharth Suri. 2011. The computer is the new sewing machine: Benefits and perils of crowdsourcing. In Proc. 20th Int. Conf. World Wide Web (WWW’11). 325–326.
- Quinn and Bederson (2011) Alexander J. Quinn and Benjamin B. Bederson. 2011. Human computation: a survey and taxonomy of a growing field. In Proc. 2011 Annu. Conf. Hum. Factors Comput. Syst. (CHI 2011). 1403–1412. DOI:http://dx.doi.org/10.1145/1978942.1979148
- Rocker et al. (2007) Jana Rocker, Christopher M Yauch, Sumanth Yenduri, LA Perkins, and F Zand. 2007. Paper-based dichotomous key to computer based application for biological indentification. J. Comput. Sci. Coll. 22, 5 (May 2007), 30–38.
- Sanchez-Charles et al. (2014) D. Sanchez-Charles, J. Nin, M. Sole, and V. Muntes-Mulero. 2014. Worker ranking determination in crowdsourcing platforms using aggregation functions. In Proc. IEEE Int. Conf. Fuzzy Syst. (FUZZ 2014). 1801–1808. DOI:http://dx.doi.org/10.1109/FUZZ-IEEE.2014.6891807
- Shah and Zhou (2014) Nihar B Shah and Dengyong Zhou. 2014. Double or Nothing: Multiplicative Incentive Mechanisms for Crowdsourcing. arXiv preprint arXiv:1408.1387 (2014).
- Varshney et al. (2016) Lav R. Varshney, Preethi Jyothi, and Mark Hasegawa-Johnson. 2016. Language Coverage for Mismatched Crowdsourcing.
- Varshney et al. (2014) Lav R. Varshney, Aditya Vempaty, and Pramod K. Varshney. 2014. Assuring Privacy and Reliability in Crowdsourcing with Coding. In Proc. 2014 Inf. Theory Appl. Workshop. DOI:http://dx.doi.org/10.1109/ITA.2014.6804213
- Vempaty et al. (2014) Aditya Vempaty, Lav R. Varshney, and Pramod K. Varshney. 2014. Reliable Crowdsourcing for Multi-Class Labeling Using Coding Theory. IEEE J. Sel. Topics Signal Process. 8, 4 (Aug. 2014), 667–679. DOI:http://dx.doi.org/10.1109/JSTSP.2014.2316116
- Yue et al. (2014) Dejun Yue, Ge Yu, Derong Shen, and Xiaocong Yu. 2014. A weighted aggregation rule in crowdsourcing systems for high result accuracy. In Proc. IEEE 12th Int. Conf. Depend. Auton. Secure Comput. (DASC). 265–270.
- Zhang and van der Schaar (2012) Yu Zhang and Mihaela van der Schaar. 2012. Reputation-based incentive protocols in crowdsourcing applications. In Proc. 31st IEEE Conf. Computer Commun. (INFOCOM 2012). 2140–2148.