Evaluating Artificial Systems for Pairwise Ranking Tasks Sensitive to Individual Differences
Abstract
Owing to the advancement of deep learning, artificial systems are now rival to humans in several pattern recognition tasks, such as visual recognition of object categories. However, this is only the case with the tasks for which correct answers exist independent of human perception. There is another type of tasks for which what to predict is human perception itself, in which there are often individual differences. Then, there are no longer single “correct” answers to predict, which makes evaluation of artificial systems difficult. In this paper, focusing on pairwise ranking tasks sensitive to individual differences, we propose an evaluation method. Given a ranking result for multiple item pairs that is generated by an artificial system, our method quantifies the probability that the same ranking result will be generated by humans, and judges if it is distinguishable from humangenerated results. We introduce a probabilistic model of human ranking behavior, and present an efficient computation method for the judgment. To estimate model parameters accurately from smallsize samples, we present a method that uses confidence scores given by annotators for ranking each item pair. Taking as an example a task of ranking image pairs according to material attributes of objects, we demonstrate how the proposed method works.
Evaluating Artificial Systems for Pairwise Ranking Tasks Sensitive to Individual Differences
Xing Liu Graduate School of Information Sciences, Tohoku Univeristy ryu@vision.is.tohoku.ac.jp Takayuki Okatani Graduate School of Information Sciences, Tohoku University RIKEN Center for AIP okatani@vision.is.tohoku.ac.jp
noticebox[b]Preprint. Work in progress.\end@float
1 Introduction
Standard recognition tasks, in which some entities (e.g., labels) are to be predicted from an input (e.g., an image etc.), can be roughly categorized into two groups:

Tasks in which what to predict is given independent of human perception.

Tasks in which what to predict is human perception itself.
For the sake of explanation, we limit our attention to visual tasks in what follows, although the discussions apply to other modalities. Then, examples of are visual recognition of object categories and individual faces, and examples of are prediction of aesthetic quality (Murray et al. [2012], Marchesotti et al. [2011], Kong et al. [2016], Deng et al. [2016]) and memorability (Isola et al. [2014]) of images.
In order to build a machinelearningbased system for task group , it is first necessary to materialize what humans perceive from images. This can be performed by, for example, asking human subjects to give a score for an image or asking them to rank multiple images. Then, we consider training artificial systems (e.g., convolutional neural networks) so that they will predict the scores or ranking results as accurately as possible.
As is well recognized, CNNs can now rival humans for several visual recognition tasks (He et al. [2016], Szegedy et al. [2015]), when they are properly trained in a supervised manner. Although it may not be widely recognized, this is only the case with task group , in which labels to predict are given independent of human perception and thus there should exist correct answers to predict; therefore it is straightforward to define and measure the performance of CNNs. For task group , however, it remains unclear whether CNNs can achieve the human level of performance, although the advancement of deep learning arguably has contributed to significant performance boost.
This may be attributable to individual differences of annotators that often emerge in tasks of . When there are individual differences, there is no unique correct answer to predict, which makes it difficult to evaluate the performance of artificial systems. Figure 1 shows an example of such cases, that is, pairwise ranking of images according to material attributes of objects. While, for some image pairs and attributes, human annotators will give unanimous ranking as depicted in Figure 1(a), for others, they will give diverged rankings. The latter can be divided into two cases, i.e., when the annotators confidently make diverged rankings, which mostly occurs for subjective cases, as in Figure 1(b), and when they are uncertain and give diverged rankings, as in Figure 1(c).
In order to build an artificial system that rivals humans for this type of tasks, we need to answer the following questions:

How can we measure its performance or judge its equivalence to humans?

How can we build such an artificial system?
In this paper, targeting at pairwise ranking tasks, we attempt to give answers to these questions.
For , we propose a method that is based on a probabilistic model of ranking results given by human subjects. Suppose that an artificial system predicts ranking of image pairs. The proposed method quantifies how probable it is that the same ranking result is generated by human annotators for the same image pairs. It properly considers the above individual differences.
Despite its simplicity, there are a few difficulties to overcome with this approach. One is the difficulty with obtaining an accurate probabilistic model of human ranking from small sample size data^{1}^{1}1We have human annotators rank each of item pairs. Considering data collection cost, increasing both and is prohibitive. As needs to be large, has to be small.. To resolve this, we propose to collect confidence scores from annotators for ranking each item pair, and utilize them to accurately estimate a probabilistic model of human ranking (Sec.3). Thus, our proposal includes an annotation scheme for achieving the goal. Another difficulty is with computational complexity. The presence of individual differences increases the number of ranking patterns that humans can generate for item pairs, making naive computation infeasible. We present a method that performs the necessary computation efficiently (Sec.2.3). Following the proposed data collection scheme, we have created a dataset named Material Attribute Ranking Dataset (MARD), on which we tested the proposed method. The dataset will be made publicly available.
For , we argue that learning to predict distribution of rankings given by multiple annotators performs better than previous methods. In previous studies (Parikh and Grauman [2011], Dong et al. [2014], Burges et al. [2005]), individual differences are usually ignored; the task is converted to binary classification by taking the majority if there are individual differences. Our experimental results support this argument.
2 Distinguishing Artificial Systems from Humans
2.1 Outline of the Proposed Approach
We consider a pairwise ranking task using pairs, where, for instance, two images in each pair are to be ranked. We denote a ranking result of a human subject (or an artificial system) by a sequence , where is a binary variable, such that if a subject chooses the first image, and if the subject chooses the second image.
Considering the aforementioned individual differences, we introduce a probabilistic model for . Let be the probability mass function of human generated sequences ’s. Given a sequence generated by an artificial system, we wish to use to estimate how probable a human can generate . If the probability is very low, it means that is distinguishable from human sequences; then, we may judge the artificial system behaved differently from humans. If it is high, then we cannot distinguish from human sequences, implying than the artificial system behaves similarly to humans, as far as the given task/dataset is concerned.
Ideally, any ’s with can be generated from human subjects, and thus it could be possible to use or for the above judgment. However, this is not appropriate. As the exact is not available, we have to use an approximate model built upon several assumptions. Moreover, is estimated from the data collected from human subjects, which could contain noises. It tends to have a longtail (i.e., many sequences with a very small probability). Thus, the model may be unreliable particularly for ’s with low ’s. These sequences are also considered to be minor and eccentric sequences that the majority of humans will not generate.
Therefore, we use a criterion that excludes such minor sequences with the lowest probabilities. Let denote a subset of possible sequences for the pairs. In particular, we consider a subset with the minimum cardinality that satisfies the following inequality:
(1) 
where is a small number. We denote it by . indicates a set of sequences that humans are likely to generate. The complementary set contains the abovementioned minor sequences.
Given a ranking result created by an artificial system for the same pairs, we check if belongs to , i.e., whether or not. We declare to be indistinguishable from human results if , and to be distinguishable if .
2.2 Model of Human Ranking
We now describe the model . Assuming that ranking results of different pairs are independent of each other, we model the probability of a sequence by
(2) 
We then model using a Bernoulli distribution with a parameter , that is,
(3) 
Determination of will be explained later. Human subjects can provide many different sequences; each sequence will occur with probability .
It should be noted that the BradleyTerry model, a popular model of pairwise ranking, is not fit for our problem. It considers a closed set of items (e.g., sport teams and scientific journals), and is mainly used for the purpose of obtaining a ranking of all the items in the set from observations of ranking of item pairs). On the other hand, in our case, we consider an open set of items. Our interest is not with the item set itself but with evaluation of an intelligent system performing the task.
2.3 Percentile of a Sequence in ProbabilityOrdered List
Percentile Value
As mentioned above, we want to check whether or . To perform this, we calculate the percentile of in the ordered list of all possible sequences ’s in the order of . It is defined and computed as follows (see Fig.2): i) Sort all (i.e., ) possible binary sequences ’s in descending order of of (2) and ii) Compute the percentile of (denoted by ) using the cumulative sum of probabilities from the first rank to the position where is ranked. Using thus computed for the target , the condition is equivalent to (and is equivalent to ). It is noteworthy that represents how close the sequence is to the most probable ranking result of humans, which corresponds to .
Efficient Computation of
When is large, it is not feasible to naively perform the above procedure, as the number of possible sequences explodes. It is also noted that may differ for each , and thus the standard statistics of binomial distribution cannot be computed for the whole sequence. Thus, we group the elements having the same to a subsequence, which are specified by an index set for a constant . Suppose that this grouping splits into subsequences . Using the independence of the elements, we have
(4) 
In this grouping, we may redefine the variable by swapping the first and second images so that . This enables us to minimize the number of groups without loss of generality to improve computational efficiency as will be discussed below.
Let be the number of elements belonging to (i.e., ), and be their Bernoulli parameter (i.e. for any ). Then, the probability of being a sequence is computed by
(5) 
where is the number of 1’s (i.e., ) in . Note that the number of possible sequences having 1’s is , and each of them has the same probability computed as above. For the entire sequence, the probability of being a sequence is given by
(6) 
Using (6) along with (5), we can compute the probability that a sequence consists of subsequences ), each of which has 1’s. The number of such sequences is calculated by , and each sequence has the same probability.
Now, we consider sorting all sequences of . We construct a sequence by choosing each of its elements (the number of 1’s in the group) from . We obtain sequences having the same probability by employing this assignment scheme. We denote the assignment by (, where is the number of possible assignments, which is given by . (Note that .)
As it is not necessary to sort sequences having the same probability, we only need to sort blocks of sequences instead of individual sequences. For each block associated with an assignment , we use (6) for computation of the probability of each sequence belonging to this block. Let be this probability. Then, using , we sort blocks, which can be done much more efficiently than sorting sequences.
Finally, we compute for a sequence . In order to compute its rank, we count the number of 1’s in (more specifically, the number of 1’s in each of , and find the block to which belongs. Let be the index set of blocks ranked higher than , i.e., . Then, the cumulative probability down to block (including it) can be computed by
(7) 
3 Estimation of the Bernoulli Parameter Using Confidence Scores
3.1 Smallsample Estimate of
We modeled human ranking by the Bernoulli distribution as in (3). We now consider how to estimate its parameter . Suppose subjects participate in ranking the image pair, . Let be the number of subjects who chose the first image. Considering a pairwise ranking task with exclusive choice, the number of subjects who chose the second image is . Then, the maximum likelihood estimate (MLE) of is immediately given by
(8) 
Despite its simplicity, this method could have an issue when the subjects unanimously choose the same image of an image pair, i.e., either or . In this case, the above MLE gives or , which leads to or . However, this result is quite sensitive to (in)accuracy of and thus results may not be useful. If a CNN chooses the one with for only a single unanimous pair, then immediately vanishes irrespective of ranking of other pairs (i.e., %), declaring that this CNN behaves completely differently from human. The estimate (8) could be inaccurate if is not large enough. Although this issue will be mitigated by using a large , it will first increase the cost of data collection; it will also increase the computational complexity of (because the number of groups having an identical tends to increase).
3.2 A More Accurate Estimate Using a Confidence Score
Therefore, we instead consider collecting additional information from human subjects. For each image pair , we ask them to additionally give a confidence score of their ranking. We use a score , which correspond to “not confident”, “somewhat confident” and “very confident”, respectively. As we ask subjects for a single image pair , we introduce an index to represent each subject and denote the ranking choice and score of th subject by and , respectively. Let and . Assuming independence of individual annotations, we have
(9)  
We use this model to perform maximum likelihood estimation for the unanimously ranked pairs. Suppose that all subjects chose the first image. Then we have
(10) 
To model , we consider the probability of occurrence of each score ; we denote it by . Then we have
(11) 
where , and denotes the number of subjects who chose confidence scores 0, 1, and 2, respectively; we omit the superscript in , , etc. for simplicity. Thus, from (9) we have
(12) 
Given the observations and , we want to maximize the likelihood (12) with respect to as well as the introduced unknowns , , and . As they are probabilities, there are a few constraints for , , and defined as
(13a)  
(13b) 
We then assume a relation between the confidence scores and leading to the following equation:
(14) 
This equation indicates that the three scores, not confident, confident, and very confident, are mapped to , 0.75, and 1.0, respectively. In other words, subjects who are not confident irrespective of the choice will choose the other with 50% in the future; those who are very confident will always do the same choice in the future; those who have intermediate confidence will do the same with the intermediate probability 75% in the future.
Using the constraints (13) and (14), we can maximize the likelihood (12) with respect to the unknowns. To be specific, we eliminate, say, and , from (12) using (13a) and (14) and maximize it for and with the inequality constraints (13b). Any numerical constrained maximization method can be used for this optimization. As a result, we have the MLE for . It should be noted that we use this estimation method only when the ranking results are unanimous; we use the standard estimate (8) otherwise.
Discussion on the Use of Confidence Score
As is described above, we propose to use a confidence score to estimate each . An alternative use of confidence scores is to simply eliminate the ranking results with a confidence score less than 2 and go with only maximally confident ranking results. We do not adopt this approach due to the following reason. We think that there are two cases for how individual differences emerge (see also Fig.1): i) annotators make selections confidently, which are nevertheless split (subjectivity); and ii) annotators make selections without confidence, which makes their decisions fluctuated and then split (uncertainty). Elimination of samples with score 0 or 1 will remove data of type (ii). This will be fine when we are only interested in data of type (i), which seems to be the case with most existing studies. On the other hand, our study also considers data type (ii). Suppose, for instance, the case where people perceive very similar glossiness for two different objects, e.g., a plastic cup and a glass mug; uncertainty will be fairly informative for such cases. This is why we attempt to model the uncertainty instead of eliminating it. As we are interested in modeling individual differences, we don’t eliminate samples with low interrater agreement, either.
4 Experimental Results
4.1 Material Attribute Ranking Dataset
To test our method, we have created a dataset of a pairwise ranking task. It has been shown (Fleming et al. [2013], Fleming [2014]) that humans can visually perceive fairly accurately the “material attribute” of an object surface, such as hardness, coldness, lightness etc. Motivated by this, we chose a task of ranking a pair of images according to such material attributes. We consider thirteen material attributes, namely aged, beautiful, clean, cold, fragile, glossy, hard, light, resilient, smooth, sticky, transparent, and wet.
The dataset, which we name Material Attribute Ranking Dataset (MARD), consists of 1,000 training and 300 testing samples for each of the thirteen attributes. Each sample contains ranking results of an image pair of five Mechanical Turkers. For test samples for which the initial five Turkers make a unanimous selection, the dataset provides additional ranking results of ten more Turkers as well as threelevel scores of their confidence in their ranking.
It should be noted that MARD differs from any of similar existing datasets of pairwise ranking, e.g., Emotion Dataset (You et al. []). The existing datasets simply discard individual differences by taking the majority of nonunanimous annotations, which is equivalent to regarding the most probable ranking result of humans as the only correct prediction. Although this greatly simplifies the problem, this makes it impossible to consider individual differences in any ways.
Details of Creation of MARD
MARD uses images of the Flicker Material Database (FMD) (Sharan et al. [2014]), a benchmark dataset of classification of ten materials categories. We split 1,000 FMD images into two sets of nonoverlapping 500 images (one set is used for training and the other set is used for testing). We then created 1,000 and 300 image pairs by randomly choosing images from each set, respectively. We asked five Turkers to rank each image pair in terms of each of the thirteen material attributes by showing the paired images in a row. (The same image pairs were used for all the attributes.) Specifically, we asked them to choose one of three options, the first image, the second image, and “unable to decide”, for each pair. We discarded the image pairs with three or more “unable to decide” in the training set and those with one or more “unable to decide” in the test set. For the image pairs of the test set which were ranked unanimously for an attribute by the five Turkers, we further have ten more Turkers rank the same image pair and attribute. In this second task, we removed the “unable to decide” option, and also asked the Turkers to additionally provide their confidence on their ranking by choosing one of three level confidence, i.e., “Not confident”, “Somewhat confident”, and “Very confident”. The confidence scores are used in order to estimate the parameter to mitigate the sensitivity issue with unanimous ranking, as was discussed in Section 3.
4.2 Evaluation of Differently Trained CNNs
Using the MARD, we conducted experiments to test our evaluation method. To see how it evaluates different prediction methods, we consider four methods. The first three are existing methods that convert the task into binary classification by regarding the majority of human ranking results as the correct label to predict. For the fourth method, we present a method that considers the individual differences as they are.

RankCNN (Dong et al. [2014]): A CNN is trained to predict binary labels by minimizing weighted squared distances computed for each training sample to its binary label.

RankNet (Parikh and Grauman [2011]): In their original work, a linear classifier is trained to predict binary labels by searching for maximummargin hyperplanes in a feature space. In this study, we instead employ a hingeloss.

RankNet (Burges et al. [2005]): A neural network (a CNN in this study) is trained to predict binary labels by minimizing the crossentropy loss.

RankDist (ours): A CNN is trained to predict the distribution of ranking of each item pair. To be specific, it is trained to predict the ML estimate (8) of the Bernoulli parameter . (Recall that training samples do not provide scores.) The crossentropy loss is used.
We applied the above four methods to the training set of the MARD. For all the four methods, we use the same CNN, VGG19 (Simonyan and Zisserman [2015]). It is first pretrained on the ImageNet and finetuned on the EFMD (Zhang et al. [2016]). Then it is trained using the MARD where its ten lower weight layers are fixed and the subsequent layers are updated. It is used in a Siamese fashion, providing two outputs, from which each loss is computed. The losses for all the thirteen adjectives are summed and minimized. Note that the four methods differ only in their employed losses.
Table 1 shows the values of the four methods for the thirteen attributes that are computed using the test set of the MARD. When choosing for the threshold for , any ranking results with are declared to be distinguishable from human ranking. It is observed that the number of such attributes is 3, 2, 3, and 1, for RankCNN, RankNet, RankNet, and RankDist, respectively. RankDist tends to provide smaller ’s for many attributes, indicating that its ranking results are closer to the most probable human results than other three. The three methods show more or less similar behaviours.
In summary, the proposed evaluation method enables to show which method provides ranking that are (in)distinguishable from human ranking for which attribute. For instance, the ranking results for the attribute hard are the most dissimilar to human ranking, implying that there is room for improvements. Our method also makes it possible to visualize the different behaviours of the four methods; in particular, RankDist that considers individual differences performs differently from the three existing methods that neglect individual differences.
5 Summary
In this study, we have discussed how to compare artificial systems with humans for the task of ranking a pair of items. We have proposed a method for judging if an artificial system is distinguishable from humans for ranking of item pairs. More rigorously, we check if an pair ranking result given by an artificial system is distinguishable from those given by humans. It relies on a probabilistic model of human ranking that is based on the Bernoulli distribution. We have proposed to collect confidence scores of ranking each item pair from annotators and utilize them to estimate the Bernoulli parameter accurately for each item pair. We have also shown an efficient method for the judgment that calculates and uses the percentile value of the target pair ranking result. Taking annotation noises and inaccuracies with the models into account, is compared with a specified threshold (e.g., 90.0%); if it is smaller than the threshold, we declare the artificial system is indistinguishable from humans for rankings of the item pairs. The value may also be used as a measure of how close the ranking result of the artificial system is to the most probable ranking result of humans.
References
 Murray et al. [2012] Naila Murray, Luca Marchesotti, and Florent Perronnin. Ava: A largescale database for aesthetic visual analysis. In Proc. Conference on Computer Vision and Pattern Recognition, 2012.
 Marchesotti et al. [2011] Luca Marchesotti, Florent Perronnin, Diane Larlus, and Gabriela Csurka. Assessing the aesthetic quality of photographs using generic image descriptors. In Proc. International Conference on Computer Vision, 2011.
 Kong et al. [2016] Shu Kong, Xiaohui Shen, Zhe Lin, Radomir Mech, and Charless Fowlkes. Photo aesthetics ranking network with attributes and content adaptation. In Proc. European Conference on Computer Vision, 2016.
 Deng et al. [2016] Yubin Deng, Chen Change Loy, and Xiaoou Tang. Image aesthetic assessment: An experimental survey. arXiv:preprint arXiv:1610.00838, 2016.
 Isola et al. [2014] Phillip Isola, Jianxiong Xiao, Devi Parikh, Antonio Torralba, and Aude Oliva. What makes a photograph memorable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 2014.
 He et al. [2016] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proc. Conference on Computer Vision and Pattern Recognition, 2016.
 Szegedy et al. [2015] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proc. Conference on Computer Vision and Pattern Recognition, 2015.
 Parikh and Grauman [2011] Devi Parikh and Kristen Grauman. Relative attributes. In Proc. International Conference on Computer Vision, 2011.
 Dong et al. [2014] Yuan Dong, Chong Huang, and Wei Liu. Rankcnn: When learning to rank encounters the pseudo preference feedback. Computer Standards & Interfaces, 2014.
 Burges et al. [2005] Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier, Matt Deeds, Nicole Hamilton, and Greg Hullender. Learning to rank using gradient descent. In Proc. International Conference on Machine Learning, 2005.
 Fleming et al. [2013] Roland W. Fleming, Christiane Wiebel, and Karl Gegenfurtner. Perceptual qualities and material classes. Journal of Vision, 2013.
 Fleming [2014] Roland W. Fleming. Visual perception of materials and their properties. Vision Research, 2014.
 [13] Quanzeng You, Jiebo Luo, Hailin Jin, and Jianchao Yang. Building a large scale dataset for image emotion recognition: The fine print and the benchmark. In Proc. International Conference on Association for the Advancement of Artificial Intelligence.
 Sharan et al. [2014] Lavanya Sharan, Ruth Rosenholtz, and Edward H. Adelson. Accuracy and speed of material categorization in realworld images. Journal of Vision, 2014.
 Simonyan and Zisserman [2015] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In Proc. International Conference on Learning Representations, 2015.
 Zhang et al. [2016] Yan Zhang, Mete Ozay, Xing Liu, and Takayuki Okatani. Integrating deep features for material recognition. Proc. International Conference on Pattern Recognition, 2016.