Exploring Outliers in Crowdsourced Ranking for QoE

Exploring Outliers in Crowdsourced Ranking for QoE

Qianqian Xu, Ming Yan, Chendi Huang,
Jiechao Xiong, Qingming Huang, Yuan Yao
State Key Laboratory of Information Security (SKLOIS), Institute of Information Engineering, CAS, Beijing 100093, China Department of Computational Mathematics, Michigan State University, East Lansing, MI 48824, USA BICMR-LMAM-LMEQF-LMP, School of Mathematical Sciences, Peking University, Beijing 100871, China Tencent AI Lab, Shenzhen 518057, China University of Chinese Academy of Sciences, Beijing, 100049, China Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, Beijing, 100190, China Key Lab of Big Data Mining and Knowledge Management, CAS, Beijing, 100190, China Department of Mathematics, Hong Kong University of Science and Technology, Hong Kong 100871 xuqianqian@iie.ac.cn,yanm@math.msu.edu,cdhuang@pku.edu.cn jcxiong@tencent.com, qmhuang@ucas.ac.cn,yuany@ust.hk
Abstract.

Outlier detection is a crucial part of robust evaluation for crowdsourceable assessment of Quality of Experience (QoE) and has attracted much attention in recent years. In this paper, we propose some simple and fast algorithms for outlier detection and robust QoE evaluation based on the nonconvex optimization principle. Several iterative procedures are designed with or without knowing the number of outliers in samples. Theoretical analysis is given to show that such procedures can reach statistically good estimates under mild conditions. Finally, experimental results with simulated and real-world crowdsourcing datasets show that the proposed algorithms could produce similar performance to Huber-LASSO approach in robust ranking, yet with nearly 8 or 90 times speed-up, without or with a prior knowledge on the sparsity size of outliers, respectively. Therefore the proposed methodology provides us a set of helpful tools for robust QoE evaluation with crowdsourcing data.

HodgeRank; Outlier Detection; -regularization; Iterative Hard Thresholding; Iterative Least Trimmed Squares; Adaptive Algorithms
Corresponding author.
journalyear: 2017copyright: acmcopyrightconference: MM ’17; October 23–27, 2017; Mountain View, CA, USAprice: 15.00doi: 10.1145/3123266.3123267isbn: 978-1-4503-4906-2/17/10ccs: Information systems Data cleaningccs: Information systems Rank aggregation

1. Introduction

In recent years, the Quality of Experience (QoE) (Hossfeld12-QoE, ; Wu13crowd, ) has become a major research theme within the multimedia community. QoE measures a user’s subjective expectation, feeling, perception, and satisfaction with respect to multimedia content. Measuring and ensuring good QoE of multimedia content is highly subjective in nature.

A variety of approaches can be employed to conduct subjective tests, among which Mean Opinion Score (MOS) (MOS, ) and paired comparison are the two most popular ones. In the MOS test, individuals are asked to specify a rating from “Bad” to “Excellent” (e.g., Bad-1, Poor-2, Fair-3, Good-4, and Excellent-5) to grade the quality of a stimulus; while in paired comparison approach, raters are only asked to make intuitive comparative judgments instead of mapping their perception on a categorical or numerical scale. Among these there may be tradeoffs in the amount of information the preference label contains and the bias associated with obtaining the label. For example, while a graded relevance judgment on a five-point scale may contain more information than a binary judgment, raters may also make more errors due to the complexity of assigning finer-grained judgments. In (MM09, ), it shows that MOS may suffer from three fundamental problems: (i) it is unable to concretely define the concept of scale; (ii) the interpretations of the scales among raters are highly different; (iii) it is difficult to verify whether a rater gives false ratings either intentionally or carelessly. Therefore, the paired comparison method is currently gaining growing attention. It not only promises assessments that are easier and faster to obtain with less demanding task for raters, but also yields more reliable data with less personal scale bias in practice. However, a shortcoming of paired comparison is that it has more expensive sampling complexity than the MOS test, since the number of pairs grows quadratically with the number of items to be ranked.

To tackle the cost problem, with the growth of crowdsourcing platforms such as MTurk, InnoCentive, CrowdFlowerCrowdRank, and AllOurIdeas, researchers who wish to seek help from the Internet crowd can post their task requests on websites for QoE evaluation  (MM09, ; tmm12, ; Hossfeld2014, ; conf2012-441, ; Keimel_etal_QoMEX2012_CrowdSourcing_Preprint, ; Wu13crowd, ). Methods for rating/ranking via pairwise comparisons in QoE evaluation in crowdsourcing scenarios must address a number of inherent difficulties including: (i) incomplete and imbalanced data; (ii) streaming and online data; (iii) outliers. To meet the first challenge, the work in (added, ; MM11, ; tmm12, ) propose randomized paired comparison methods which accommodate incomplete and imbalanced data. A general framework named HodgeRank on Random Graphs (HRRG) not only deals with incomplete and imbalanced data collected from crowdsourcing studies but also derives the constraints on sampling complexity that the random selection must adhere to in crowdsourcing experiment. Furthermore, a recent extension of HRRG is introduced in (MM12, ; TMM13, ) to deal with streaming and online data in the crowdsourcing scenario, providing the possibility of making assessment procedure significantly faster without deteriorating the accuracy.

The third challenge of crowdsourcing QoE evaluations is because not every Internet rater is trustworthy. In other words, due to the lack of supervision when raters perform experiments in crowdsourcing, they may provide erroneous responses perfunctorily, carelessly, or dishonestly (MM09, ). Such random decisions are useless and may deviate significantly from other raters’ decisions. So outliers have to be identified and removed in order to achieve a robust QoE evaluation. Many methods have been developed for outlier detection, such as -estimator  (Huber81, ), Least Median of Squares (LMS) (rousseeuw1984least, ), S-estimators (rousseeuw1984robust, ), Least Trimmed Squares (LTS) (leroy1987robust, ), and Thresholding based Iterative Procedure for Outlier Detection (-IPOD) (SheOwe11, ) etc. Besides, there are also distribution-based (barnett1994outliers, ), depth-based (johnson1998fast, ), distance-based (knorr1999finding, ; knorr2000distance, ), density-based (breunig2000lof, ), and clustering-based (jain1999data, ) methods for outlier detection. The authors of (MM09, ) proposed Transitivity Satisfaction Rate (TSR), which checks all the intransitive triangles, e.g., , to identify and discard noisy data provided by unreliable raters in QoE. However, TSR can only be applied to complete and balanced paired comparison data. When the paired data is incomplete and imbalanced, i.e., having missing edges, the question of how to detect the noisy pairs remains open. The work in (MM13, ) attacks this problem and formulates the outlier detection as a LASSO problem based on sparse approximations of the cyclic ranking projection of paired comparison data in Hodge decomposition. Regularization paths of the LASSO problem could provide an order on samples tending to be outliers. However, the solution of the LASSO problem is biased. Solving the LASSO path is too slow and the problem has to be solved for many times for model selection via cross-validation.

In this paper, we propose simple and fast algorithms based on nonconvex optimization for outlier detection and robust ranking in QoE evaluation. The contributions of this paper are as follows:

1. We propose 3 iterative procedures solving some nonconvex optimization problems arising from outlier detection with or without knowing the number of outliers in samples.

2. Theoretical analysis shows that such procedures can reach statistically good estimates under mild conditions.

3. Experiments with simulated and crowdsourcing real-world data show that our algorithms work effectively in practice.

2. Methodology

In this section, we propose some simple iterative algorithms for outlier detection by solving some nonconvex optimization problems. These algorithms are based on either a prior knowledge on the number of outliers or adaptive estimation of the outlier sparsity size. Specifically, we propose iHT and iLTS with known outlier sparsity size and aLTS for adaptive estimation of outliers without knowing its precise number. In spite of the NP-hardness for finding global optimizers in the worst case, we show that such simple algorithms are able to reach statistically good estimates under mild conditions. Before the algorithms are described, a brief introduction on robust ranking is provided which motivates our main development.

2.1. Robust Ranking

Assume that there are raters and items to be ranked by the raters. Let be the the total number of paired comparisons (samples). Let vector denote the degree that rater prefers item to item . Without loss of generality, we assume that if rater prefers item to item and otherwise. In addition, we assume that the paired comparison data is skew-symmetric for each , i.e., . In practice, can be continuous, dichotomous or of a -point Likert scale with according to the strategy used in QoE evaluation.

It is natural to assume that

(1)

is the true ranking score on items and is the noise satisfying . When the noise is independent and identically distributed with zero mean, least squares (LS) problem has been used in (MM11, ; tmm12, ; MM12, ) to derive ranking scores in subjective multimedia assessments.

However, not all comparisons are trustworthy and there may be sparse outliers due to different test conditions, human errors, or abnormal variations in content. Putting in a mathematical way, here we consider

(2)

or equivalently

(3)

where , which models the outliers, is sparse and has a much larger magnitude than , which models the Gaussian noise, and satisfies that: if is the th entry of , then the th row of equals to , here satisfies that only the th entry is and others are . Such is often called the (generalized) “gradient operator” on graph , with being the (unnormalized) graph Laplacian.

When sparse outliers exist ( for a small number of ), the solution to the least squares problem on all the comparisons becomes unstable and may give an inaccurate estimation. If the outliers can be detected and removed, the solution to the least squares problem on the remaining pairwise comparisons is more accurate and gives a better estimation.

In (MM13, ), a robust regression approach called Huber-LASSO is used to detect outliers:

(4)

This is a convex optimization problem and the LASSO path could provide information on the order of samples tending to be outliers.

However, there are two issues with this approach: 1) the Huber-LASSO estimator is always biased, even under the identifiable condition where ; 2) computing the Huber-LASSO path to get top outliers is computationally expensive.

In order to remove the bias in the solution, we replace the -norm of in (4) with the -“norm” of and obtain

(5)

where , the -“norm” of , is the number of nonzero components in . Although this is a nonconvex optimization problem which is NP-hard in the worst case, in the sequel we shall see that under mild conditions even simple iterative algorithms may detect where the outliers are and lead to statistically good estimators.

2.2. iHT and iLTS with Known

Input: , , .
Initialization: .
for  do
     Update by
     If , break.
end for
return .
Algorithm 1 iterative Hard Thresholding (iHT)

First of all, Proposition 1, whose proof is provided in the supplementary material, shows that problem (5) is, in a sense, equivalent to

(6)

and

(7)

where is elementwise Hadamard product operator. The index of zero entries of indicate outliers. Problem (7) is actually the Least Trimmed Squares (LTS) in robust regression (leroy1987robust, ). A benefit of (7) lies in that the global ranking score does not depend on the outlier magnitude estimate, by dropping off the outliers.

Proposition 1 ().

For a given , pick any global optimal for problem (5), and let . Let

and

Then .

Hence now we turn to problem (6) and (7), both have a parameter , which is considered as an upper bound of the number of outliers. Because of the two -“norm”, finding the global optimal solution is NP-hard. We attempt to find approximate (but sufficient) solutions via the alternating minimization method.

Note that once we fix for problem (6), then we just need to solve an ordinary least squares problem and get a corresponding simply by

(8)

Here is the Moore–Penrose pseudoinverse of a matrix . And if we fix , we just need to take a Hard Thresholding, i.e.

(9)

where is an operator which sets all entries to except entries with largest squares. For example,

Plugging (8) into (9), such a procedure implies

where is the “hat matrix”. Such a procedure is described precisely in Algorithm 1 and called iterative Hard Thresholding (iHT).

For problem (7), when is fixed, update by solving a least squares problem using only the comparisons indicated by , i.e.

(10)

When fixing , updating is to choose entries of with smallest squares, then set the corresponding entries of to be , and others to be . The procedure is described precisely in Algorithm 2.

Input: , .
Initialization: .
for  do
     Update to get by (10).
     Update by choosing entries of with smallest squares111If the th and th largest squares have the same value, there are multiple choices of . In this case, randomly choose one of them different from all ’s appeared before. If all the choices have appeared, break., then setting the corresponding entries of to be , and others to be .
     Check if the new is different from all () appeared before. If not, break.
end for
return .
Algorithm 2 An Iterative Procedure for LTS (iLTS)

2.3. Consistency of iHT and iLTS

A natural question is, under what conditions can these two algorithms detect the true outlier set. The following theorems, whose proofs are given in the supplementary material, present some RIP-like sufficient conditions which can be met in outlier detection.

Theorem 2.1 (Sparsistency of iHT).

Assume that satisfies the model (3) with and . Now, for arbitrary satisfying

(11)

(here is the submatrix consist of some columns of indexed by ), in Algorithm 1 converges to the true outlier vector in the following sense

(12)

Moreover, if

(13)

then for sufficiently large , holds. If (13) holds and additionally, then for sufficiently large , holds.

Remark 1 ().

Condition (11) resembles the condition in (foucart2012sparse, ), with the measurement matrix replaced by , and the number of nonzero entries replaced by .

Remark 2 ().

According to the statement of the theorem above, we should choose to be at least . But it is unnecessary to exactly let be the unknown number , since we allow to be larger than . However, usually can not be too large, due to the condition which must be satisfied.

In the definition of , note that is a submatrix of , and always holds since is always positive semi-definite. If (upper bound of ) is small enough, then can be smaller than , satisfying the proposed condition. For example, if , , and each pair has exactly one comparison, then .

Theorem 2.2 (Convergence of iLTS).

Algorithm 2 converges in finite steps. Moreover, let

where is the indicator function, which equals if the event happens, and equals otherwise. Then the output with the corresponding satisfies

  1. is a coordinatewise minimum point of , namely, for any ,

  2. is a local minimum point of .

Remark 3 ().

There is no convergence analysis for iHT in general case. But this theorem tells that iLTS always converges, though they are two different iterative algorithms for two equivalent problems.

Theorem 2.3 (Sparsistency of iLTS).

Assume that satisfies the model (3) with and satisfying

Now, for arbitrary , let

(14a)
(14b)
(14c)

If

(15)

then for the corresponding with the output of Algorithm 2, holds. If (15) holds and additionally, then .

Remark 4 ().

In the vast majority of cases, . In fact, as long as for each , any row of is a linear combination of (which means that, removing the samples indicated by rows of does not disturb the original structure of connected components of the graph), there is a matrix such that

Thus

which implies that .

Remark 5 ().

According to the statement of the theorem above, we should choose to be at least . But it is unnecessary to exactly let be the unknown number , since we allow to be larger than . However, usually can not be too large, due to the condition which must be satisfied.

Remark 6 ().

Conditions (11) and (15) play similar roles as Restricted Isometry Property (RIP) in compressed sensing (CanTao05, ).

2.4. Adaptive LTS with Unknown

If the exact number of outliers is given or can be accurately estimated, Algorithm 1 or 2 can be used to detect the outliers and improve the performance of least squares solutions. However, in practice, the exact number of outliers is generally unknown. If is underestimated, we are able to remove some outliers and the remaining outliers will still damage the performance of the least squares solutions. On the other hand, if is overestimated, too many comparisons are removed. The resulting data is not enough for robust QoE evaluation and provides unstable solutions. Therefore, a method to estimate the number of outliers accurately is strongly desired.

We propose a method to estimate the number of outliers automatically for dichotomous choice . In this case, a natural way is to consider those outliers as the paired comparisons which disagree with the sign (or preference order) of global ranking score differences.

As the number of outliers is unknown, firstly we use the least squares problem to find an estimation of , then the total number of comparisons with wrong directions ( has different sign with ), which is denoted as , is an overestimation of . Then we obtain an underestimation of the number of outliers via multiplying by , i.e., . We remove comparisons that have largest violations to the current score because they are most likely to be outliers. The remaining comparisons are used to find the new estimation of via the least squares problem. In this case, we are able to remove some outliers and improve the estimation for . With these improved estimation for , we are able to remove more outliers. So we increase the underestimation by (). However, this number can not be larger than , the smallest overestimation of the number of outliers, because we do not want to remove too many comparisons. Therefore the update of is just where is the smallest integer no smaller than positive real number . Iterations go on until , and it gives an accurate estimation of the number of outliers. This algorithm is named aLTS for adaptive Least Trimmed Squares, and Algorithm 3 describes such a procedure precisely.

Input: , , .
Initialization: , , .
for  do
     Update to get by (10).
     Let be the total number of comparisons with wrong directions, i.e., has different sign with .
     If , break.
     Update to get in the same way as in Algorithm 2, with replaced by .
end for
return .
Algorithm 3 adaptive LTS (aLTS)
Remark 7 ().

There are only two parameters to choose, and these two parameters are easy to set. They are chosen according to inequalities ( and are fixed in our numerical experiments). has to be small to make sure that the first estimation of the number of outliers is underestimated. Then the underestimation increases geometrically with rate , and can not be too large, because the remain comparisons are not enough for robust QoE evaluation after too many comparisons are removed.

Remark 8 ().

The algorithm is able to detect most of the outliers in our experiments. However, there may be mistakes in the detection, and these mistakes happen mostly between two successive items in the order. Therefore, we can add one step to just compare every pair of two successive items and make the correction on the detection, i.e., if , but , then remove from outliers and add in .

Algorithm 3 always stops in finite steps, as shown in the following lemma.

Lemma 2.4 ().

Algorithm 3 stops in no more than steps, where

Proof.

It follows from the fact that the sequence is non-increasing, and is a geometrically increasing sequence which is bounded by the smallest component of . Specifically, assume that steps have been taken in Algorithm 3, then has approached , and for , so

which leads to the result. ∎

Such a result only ensures that the algorithm stops with a possible overestimation of the number of outliers because is always an overestimation for the number of outliers. The following theorem presents a stability condition when Algorithm 3 returns the correct number of outliers.

Theorem 2.5 ().

Consider binary choice data with outliers

(16)

Assume that there exists an integer such that for all , least squares estimator is order-consistent to the true score , i.e., induces the same ranking order as the true score , then Algorithm 3 returns the correct number of outliers.

Proof.

As is an order-consistent solution of the ground-truth, by definition, gives the correct number of outliers, say . It actually holds for all , that . From Lemma 1, the claim follows. ∎

Remark 9 ().

One scenario is the generalized linear model where for some cumulate distribution function symmetric w.r.t. . With a large enough sample, all the pairwise preferences in the minority direction can be regarded as “outliers” and dropping such outliers will not change the order consistency of least square estimators.

Note that Theorem 2.5 does not require to correctly identify the outliers, but just stable estimator to be order-consistent to . In practice, this might not be satisfied easily. But, as we shall see in the next section, Algorithm 3 typically returns stable estimators that only deviate locally.

(a) SN=1000
(b) SN=2000
(c) SN=3000
Figure 1. Precisions for simulated data via LASSO, iHT, iLTS, and aLTS, 100 times repeat.
(a) SN=1000
(b) SN=2000
(c) SN=3000
Figure 2. Recalls for simulated data via LASSO, iHT, iLTS, and aLTS, 100 times repeat.

3. Experiments

A key question in the outlier detection community is how to evaluate the effectiveness of outlier detection algorithms when the ground-truth outliers are not available. In this section, we will first show the effectiveness of the proposed method on simulated data with known ground-truth outliers, followed by real-world crowdsourcing datasets without ground-truth outliers.

3.1. Simulated Data

The simulated data is constructed as follows. A random total order on items is created as the ground-truth order. Then we add paired comparison edges randomly with preference directions following the ground-truth order. We simulate the outliers by randomly choosing a portion of the comparison edges and reversing them in preference directions. A paired comparison graph with outliers, possibly incomplete and imbalanced, is constructed.

Here we choose , which is consistent with the real-world datasets, and make the following definitions for the experimental parameters. The total number of paired comparisons occurred on the graph is SN (Sample Number), and the number of outliers is ON (Outlier Number). Then the outlier percentage OP can be obtained as ON/SN.

Most outlier detection algorithms adopt a tuning parameter (say t) in order to select different number of data samples as outliers (MM13, ), and the number of outliers detected changes as changes. If t is picked too restrictively, then the algorithm will miss true outlier (false negatives). On the other hand, if the algorithm declares too many data samples as outliers, then it will lead to too many false positives. This tradeoff can be measured in terms of precision and recall, which are commonly used for measuring the effectiveness of outlier detection methods. Specifically, the precision is defined as the percentage of reported outliers that truly turn out to be outliers; and the recall is correspondingly defined as the percentage of ground-truth outliers that have been reported as outliers.

(a) SN=1000
(b) SN=2000
(c) SN=3000
Figure 3. F1 scores for simulated data via LASSO, iHT, iLTS, and aLTS, 100 times repeat.

We then compare LASSO, iHT, iLTS, and aLTS for outlier detection on the simulated data. For ease of comparison, here we should tell LASSO, iHT, and iLTS in advance the exact number of outliers existed in the dataset. Because, different from aLTS, these three methods can not estimate the number of outliers in the dataset automatically.

The mean precisions, recalls, and F1-scores over 100 runs for these four methods on different choices of SN and OP are shown in Figures 12, and  3. F1-score is a combined measure that assesses the precision/recall tradeoff, which reaches its best value at 1 and worst score at 0.

It is easy to see that the performances of LASSO, iHT, and iLTS are very similar, while aLTS could produce better performance (indicated by higher precisions, recalls, and F1-scores in almost all cases). In addition, we compare the computing time required for these four methods to finish all the 100 runs in Tables 1. All computation is done using MATLAB R2014a, on a Mac Pro desktop PC, with 2.8 GHz Intel Core i7-4558u, and 16 GB memory. It is easy to see that on the simulated dataset, iHT, iLTS, and aLTS algorithms are much faster than LASSO, which implies their advantages in dealing with large-scale data. Specifically, iHT and iLTS can achieve up to about 30–90 times faster than LASSO, and aLTS is almost 3–8 times faster than the time for LASSO. As aLTS does not have any information about the number of outliers existed in the dataset and should estimate the number of outliers automatically, its computation cost is reasonably more expensive compared with iHT and iLTS.

3.2. Real-world Data

Two crowdsourcing real-world datasets are adopted in this subsection. Since there is no ground-truth for outliers in real-world datasets, we can not compute precision and recall as in the simulated data to evaluate the performance of the methods. Therefore, we inspect the outliers returned by four methods and compare them with the whole data to see whether they are reasonably good outliers or not.

The first dataset PC-VQA, which is collected by (MM11, ), contains 38,400 pairwise comparisons of the LIVE dataset (LIVE, ) from 209 random raters. The paired comparison data in this dataset is complete and balanced. Take reference (a) in the PC-VQA dataset as an illustrative example (other reference videos exhibit similar results). The number of outliers estimated by aLTS is used for LASSO/iHT/iLTS to choose the regularization parameter and select the outliers.

(a) LASSO
(b) iHT
(c) iLTS
(d) aLTS
Table 1. Computing time for 100 runs in total on simulated data via LASSO, iHT, iLTS, and aLTS.
Table 2. Paired comparison matrices of reference (a) in PC-VQA dataset. Red numbers are overlapping outliers obtained by LASSO, iHT, iLTS, and aLTS. Open blue circles are those obtained by LASSO/iHT/iLTS but not aLTS, while filled blue circles are those obtained by aLTS but not LASSO/iHT/iLTS.

Outliers detected by these methods are shown in the paired comparison matrix in Table 2. The paired comparison matrix is constructed as follows (Table 3 is constructed in the same way). For each video pair , let be the number of comparisons for items and , among which raters agree that the quality of item is better than item ( carries the opposite meaning). So if no tie occurs. In the PC-VQA dataset, for any video pair . The order of the video IDs in the matrix is arranged such that the global ranking score calculated by the least squares problem with all the comparisons is decreasing (from high to low). The number of outliers estimated by aLTS from this reference video is 716. So we choose the parameter for LASSO/iHT/iLTS to detect 716 outliers, and the exact number of outliers returned by LASSO/iHT/iLTS is 718, which is slightly larger than 716.

The outliers detected by these methods are mainly distributed in the lower left corner of this matrix, which implies that the outliers are those comparisons with large deviations from the global ranking scores by LS. It is easy to see that outliers returned by LASSO, iHT, iLTS, and aLTS are almost the same except on one pair (ID = 3 and ID = 4). In this dataset, 15 raters agree that the quality of ID = 3 is better than that of ID = 4, while 17 raters have the opposite opinion. LASSO, iHT, and iLTS return the same results which tend to choose comparisons with large deviations from the global ranking scores as outliers. So these three treat the 17 comparisons preferring ID = 4 as outliers because ID = 3 ranks above ID = 4. However, aLTS prefers to choose the minority as outliers and treats the 15 comparisons preferring ID = 3 as outliers. Such a small difference only leads to a local order change of ID = 3 and ID = 4. Therefore the ranking algorithms are stable.

Table 3. Paired comparison matrices of reference (c) in PC-IQA dataset. Red numbers, open blue circles, and filled blue circles carry the same meanings as in Table 2.

The global ranking scores of the four algorithms, namely LASSO, iHT, iLTS, and aLTS are shown in Table 5(a). For the ease of seeing the differences on global rating scores after outlier detection, we also report the results obtained by LS which has been used in (MM11, ; tmm12, ) to derive ranking scores in subjective multimedia assessments. After the detected outliers are removed, the orders of some competitive videos are changed. LASSO, iHT, iLTS, and aLTS all think that ID = 12 has better performance than ID = 3 and ID = 4. However, the orders of ID = 3 and ID = 4 are exchanged in aLTS and LASSO/iHT/iLTS, because they choose different preference directions as outliers.

The second dataset PC-IQA (MM12, ) is incomplete and imbalanced. This dataset contains 15 reference images and 15 distorted versions of each reference image. So the total number of images is 240. These images come from two publicly available datasets: LIVE (LIVE, ) and IVC (IVC, ). Totally, 186 raters, each of whom performs a varied number of comparisons via Internet, provide 23,097 pairwise comparisons.

Tables 3 and 5(b) show the comparable experimental results of LASSO, iHT, iLTS, and aLTS on reference image (c) (other reference images exhibit similar results). The number of outliers estimated by aLTS is 173, so we choose the parameter of LASSO/iHT/iLTS to detect 173 outliers. The exact number of outliers returned by LASSO/iHT/iLTS is 177, which is slightly larger than 173. We can see that the difference of the detection between LASSO/iHT/iLTS and aLTS happens on two pairs: 1) ID = 6 and ID = 11; 2) ID = 10 and ID = 15. Same as in the last experiment, these methods differ in outlier detection for highly comparable pairs. aLTS prefers to choose the minority in paired comparisons, i.e., the 5 comparisons preferring ID = 11 over ID = 6 and the 3 comparisons preferring ID = 10 over ID = 15, while LASSO/iHT/iLTS selects comparisons with largest deviations from global ranking scores even when the votings are in majority. Such a difference leads to a local order change of involved items only.

(a) Ref (a) in the PC-VQA
(b) Ref (c) in the PC-IQA
Table 4. Comparison of different rankings. Five ranking methods are compared with the integer representing the ranking position and the number in parentheses representing the global ranking score returned by the corresponding algorithm.

3.3. Discussion

As we have seen in the numerical experiments, LASSO, iHT, iLTS, and aLTS mostly find the same outliers, and when they disagree, aLTS tends to choose the minority and LASSO/iHT/iLTS prefer to choose comparisons with large deviations from the global ranking scores even when the votings are in majority. When outliers consist of minority voting as in simulated experiments, aLTS performs better than LASSO, iHT, and iLTS. This can also be explained from the algorithm. We choose a small underestimation for the number of outliers, and increase this estimation until there is no outliers in the remaining comparisons. The parameter is chosen to be small so we will not overestimate the number of outliers too much.

Finally, we would like to point out that subject-based outlier detection can be a straightforward extension from our proposed algorithms. From the detection results, one may evaluate the reliability of one rater based on all the comparisons from the rater and remove all the comparison from unreliable raters.

4. Conclusions

In this paper, we proposed fast algorithms for outlier detection with nonconvex optimization and robust ranking in QoE evaluation. Specifically, for known , the proposed iHT and iLTS could provide us almost the same performance compared with LASSO, and the computational speed can achieve up to 90 times faster than LASSO. For unknown , we proposed an adaptive method called aLTS which could estimate the number of outliers and detect them without any prior information about the number of outliers in the dataset. This method is nearly 3–8 times faster than LASSO. The effectiveness and efficiency of the proposed methods is demonstrated on both simulated examples and real-world applications. The small distinctions between these four methods indicate that aLTS prefers to choosing minority voting data as outliers, while the LASSO, iHT, and iLTS select the comparisons with largest deviations from the global ranking score as outliers even when they are in majority. In both cases, the global rankings obtained are stable. In summary, we expect that the proposed outlier detection methods for QoE will be helpful tools for people in the multimedia community exploiting crowdsourceable paired comparison data for robust ranking.

References

  • [1] Methods for Subjective Determination of Transmission Quality. ITU-T Rec. P.800, 1996.
  • [2] V. Barnett and T. Lewis. Outliers in statistical data, volume 3. Wiley New York, 1994.
  • [3] M. Breunig, H. Kriegel, R. Ng, and J. Sander. Lof: identifying density-based local outliers. In ACM Conference on Management of Data, volume 29, pages 93–104, 2000.
  • [4] E. Candes and T. Tao. Decoding by linear programming. IEEE Transactions on Information Theory, 51(12):4203–4215, 2005.
  • [5] K.-T. Chen, C.-C. Wu, Y.-C. Chang, and C.-L. Lei. A crowdsourceable QoE evaluation framework for multimedia content. pages 491–500. ACM Multimedia, 2009.
  • [6] A. Eichhorn, P. Ni, and R. Eg. Randomised pair comparison: an economic and robust method for audiovisual quality assessment. pages 63–68. International Workshop on Network and Operating Systems Support for Digital Audio and Video, 2010.
  • [7] S. Foucart. Sparse recovery algorithms: sufficient conditions in terms of restricted isometry constants. In Approximation Theory XIII: San Antonio 2010, pages 65–77. Springer, 2012.
  • [8] B. Gardlo, M. Ries, and T. Hossfeld. Impact of screening technique on crowdsourcing QoE assessments. In Radioelektronika, 2012 22nd International Conference, pages 1–4, 2012.
  • [9] T. Hossfeld, C. Keimel, M. Hirth, B. Gardlo, J. Habigt, K. Diepold, and P. Tran-Gia. Best practices for QoE crowdtesting: QoE assessment with crowdsourcing. IEEE Transactions on Multimedia, 16(2):541–558, 2014.
  • [10] P. Huber. Robust Statistics. New York: Wiley, 1981.
  • [11] A. Jain, M. Murty, and P. Flynn. Data clustering: a review. ACM Computing Surveys, 31(3):264–323, 1999.
  • [12] T. Johnson, I. Kwok, and R. Ng. Fast computation of 2-dimensional depth contours. In ACM Knowledge Discovery and Data Mining, pages 224–228, 1998.
  • [13] C. Keimel, J. Habigt, and K. Diepold. Challenges in crowd-based video quality assessment. In International Workshop on Quality of Multimedia Experience, pages 13–18, 2012.
  • [14] E. Knorr and R. Ng. Finding intensional knowledge of distance-based outliers. In International Conference on Very Large Data Bases, volume 99, pages 211–222, 1999.
  • [15] E. Knorr, R. Ng, and V. Tucakov. Distance-based outliers: algorithms and applications. International Journal on Very Large Data Bases, 8(3-4):237–253, 2000.
  • [16] P. Le Callet and F. Autrusseau. Subjective quality assessment irccyn/ivc database, 2005. http://www.irccyn.ec-nantes.fr/ivcdb/.
  • [17] A. Leroy and P. Rousseeuw. Robust regression and outlier detection. John Wiley & Sons, 1987.
  • [18] P. Rousseeuw. Least median of squares regression. Journal of the American statistical association, 79(388):871–880, 1984.
  • [19] P. Rousseeuw and V. Yohai. Robust regression by means of S-estimators. In Robust and nonlinear time series analysis, pages 256–272. 1984.
  • [20] R. Schatz, T. Hoßfeld, L. Janowski, and S. Egger. From packets to people: Quality of experience as new measurement challenge. In Data Traffic Monitoring and Analysis: From Measurement, Classification and Anomaly Detection to Quality of Experience. Springer’s Computer Communications and Networks series, 2012.
  • [21] Y. She and A. Owen. Outlier detection using nonconvex penalized regression. Journal of the American Statistical Association, 106(494):626–639, 2011.
  • [22] H. Sheikh, Z.Wang, L. Cormack, and A. Bovik. LIVE image & video quality assessment database, 2008.
  • [23] C.-C. Wu, K.-T. Chen, Y.-C. Chang, and C.-L. Lei. Crowdsourcing multimedia QoE evaluation: A trusted framework. IEEE Transactions on Multimedia, 15(5):1121–1137, 2013.
  • [24] Q. Xu, Q. Huang, T. Jiang, B. Yan, W. Lin, and Y. Yao. HodgeRank on random graphs for subjective video quality assessment. IEEE Transactions on Multimedia, 14(3):844–857, 2012.
  • [25] Q. Xu, Q. Huang, and Y. Yao. Online crowdsourcing subjective image quality assessment. pages 359–368. ACM Multimedia, 2012.
  • [26] Q. Xu, T. Jiang, Y. Yao, Q. Huang, B. Yan, and W. Lin. Random partial paired comparison for subjective video quality assessment via HodgeRank. pages 393–402. ACM Multimedia, 2011.
  • [27] Q. Xu, J. Xiong, Q. Huang, and Y. Yao. Robust evaluation for quality of experience in crowdsourcing. In ACM Multimedia, pages 43–52, 2013.
  • [28] Q. Xu, J. Xiong, Q. Huang, and Y. Yao. Online HodgeRank on random graphs for crowdsourceable QoE evaluation. IEEE Transactions on Multimedia, 16(2):373–386, 2014.

Appendix A Proofs

(Proposition 1).

First we prove . For any , there is such that and is optimal for problem (5). Then for any such that , since

and , we have

Hence is optimal for problem (6), i.e. .

For any , there is such that and is optimal for problem (6). Then for the pre-chosen , since

and , we sum them up to get

Note that is optimal for problem (5), hence equality must hold, i.e.

Hence is optimal for problem (5) as well as , i.e. . Thus .

Then we prove . For any , there is such that and is optimal for problem (6). Since is optimal for , it is easy to know that the index of nonzero entries of is contained in which is the index of entries of with largest squares. Let satisfying and . For any such that , let , then and hence

Hence is optimal for problem (7), i.e. .

For any , there is such that and is optimal for problem (7). For any such that , let and , then and

Hence is optimal for problem (6), i.e. . Thus . ∎

(Theorem 2.1).

Note that

and , we have

Expanding the right hand side and simple calculations imply

Plug in and note that , the above right hand side becomes

Let , then and

Besides, since is positive semi-definite, we know and . Combining the above analysis, we obtain

from which we can prove (12) by induction.

Moreover, if (13) holds, according to (12) we know that for sufficiently large ,

which implies . When additionally, due to the fact that , we have . ∎

(Theorem 2.2).

For any , Let

and be the th smallest (th largest) value of entries of . and are continuous functions of , thus we can find a sufficiently small such that for any satisfying ,

Now, for any given satisfying , we can find an optimal for , so that . Such a satisfies