A Comparative Evaluation of Temporal Pooling Methods for Blind Video Quality Assessment
Many objective video quality assessment (VQA) algorithms include a key step of temporal pooling of frame-level quality scores. However, less attention has been paid to studying the relative efficiencies of different pooling methods on no-reference (blind) VQA. Here we conduct a large-scale comparative evaluation to assess the capabilities and limitations of multiple temporal pooling strategies on blind VQA of user-generated videos. The study yields insights and general guidance regarding the application and selection of temporal pooling models. In addition, we also propose an ensemble pooling model built on top of high-performing temporal pooling models. Our experimental results demonstrate the relative efficacies of the evaluated temporal pooling models, using several popular VQA algorithms, and evaluated on two recent large-scale natural video quality databases. In addition to the new ensemble model, we provide a general recipe for applying temporal pooling of frame-based quality predictions.
Video quality assessment (VQA) models have been widely studied [seshadrinathan2010study] as an increasingly important toolset used by the streaming and social media industries. While full-reference (FR) VQA research is gradually maturing and several algorithms [wang2004image, li2016toward] are quite widely deployed, recent attention has shifted more towards creating better no-reference (NR) VQA models that can be used to predict and monitor the quality of authentically distorted user-generated content (UGC) videos. UGC videos, which are typically created by amateur videographers, often suffer from unsatisfactory perceptual quality, arising from imperfect capture devices, uncertain shooting skills, a variety of possible content processes, as well as compression and streaming distortions. In this regard, predicting UGC video quality is much more challenging than assessing the quality of synthetically distorted videos in traditional video databases. UGC distortions are more diverse, complicated, commingled, and no “pristine” reference is available.
Many researchers have studied and proposed possible solutions to the NR VQA problem [mittal2012no, saad2014blind, xu2014no, mittal2015completely, ghadiyaram2017perceptual, varga2019no, li2019quality], among which a simple but reasonably effective strategy is to compute frame-level quality scores, e.g., as generated by image quality assessment (IQA) models, then to express the evolution or relative importance over time by applying temporal pooling on the frame-level quality scores. Simple temporal average pooling is a widely used scheme to augment both FR [seshadrinathan2009motion, bampis2017speed, vu2014vis3] and NR VQA models [mittal2015completely, saad2014blind, varga2019no]. Other kinds of pooling that are used include harmonic mean [li2018vmaf], Minkowski mean [rimac2009influence, seufert2013pool], percentile pooling [moorthy2009visual, chen2016perceptual], and adaptively weighted sums [park2012video]. More sophisticated pooling strategies have considered memory effects, such as primacy, recency [bampis2017study, rimac2009influence, seufert2013pool], and hysteresis [seshadrinathan2011temporal, xu2014no, li2019quality, choi2018video]. The general applicability of these pooling models, however, has not so far been deeply validated in the general context of NR VQA models for real-world UGC videos, though a few more directed studies have been conducted [rimac2009influence, seufert2013pool]. To date, no comprehensive studies have been conducted to establish the added values of the spectrum of available VQA pooling schemes.
Here we seek to help fill this gap by conducting a systematic evaluation of popular temporal pooling algorithms, as applied to leading NR IQA models on recently developed large scale UGC video quality databases. We assessed the benefits, generalizability, and stability of these pooling mechanisms. Our aim is to identify statistically verifiable pooling approaches that can be applied on top of future state-of-the-art IQA models to further produce consistently better predictions of video quality. We also propose an ensemble approach, wherein multiple pooling models are aggregated to deliver better retrospective quality prediction. Our experimental results demonstrate that the proposed ensemble pooling method reveals robustness among the top-performing models.
2 Related Work
A variety of methods for spatial pooling of “quality-aware” features have been proposed and studied in [engelke2011visual, moorthy2009visual, wang2010information], yet less effort has been applied to the study on temporal pooling methods for NR VQA. The most related works to that reported here are the comparative evaluations of temporal pooling on short video clips [rimac2009influence], and on longer adaptive streaming videos [seufert2013pool]. They have collectively included various pooling methods combined with several objective frame-level quality predictors, evaluated on different subjective databases. Among the studied temporal pooling methods are: simple averaging, percentile pooling [moorthy2009visual], Minkowski pooling [rimac2009influence], harmonic mean pooling [li2018vmaf], and the more complex VQPooling scheme [park2012video], which adaptively emphasizes the worst scores along the time dimension, wherein frame-level scores are clustered into two groups (low quality and high quality), then combined into a single score by upweighting low-quality scores. Methods like percentile and VQPooling are predicated by the accepted notion that quality judgments are heavily influenced by the worst parts of a video.
Another cognitive aspect relevant to temporal visual pooling is the serial-position effect (or memory effect) hypothesis [murdock1962serial]. Primacy and recency are two common effects that have been investigated in numerous video quality of experience (QoE) studies [bampis2017study, ghadiyaram2018learning, nguyen2019modeling], but are less studied in regard to their influence on the blind quality prediction of UGC video clips. Another popular temporal memory modeling approach is hysteresis pooling [seshadrinathan2011temporal], which has been justified in several video quality modeling papers [xu2014no, li2019quality, choi2018video]. The hysteresis model assumes that while subjective judgments drop sharply with event of poor video quality, they only recover slowly with subsequent improved video quality.
3 Evaluating Temporal Pooling Methods
We propose a comprehensive evaluation framework to study the influence of temporal pooling algorithms on the performances of objective video quality models. Suppose a video has frames processed by any NR IQA models that produces frame-level (time-varying) quality predictions . The per-frame quality scores are temporally combined by a temporal pooling function to obtain a final quality prediction: .
3.1 Frame Quality Prediction
Frame-level quality scores can be predicted by any NR IQA, such as BRISQUE [mittal2012no], NIQE [mittal2012making], FRIQUEE [ghadiyaram2017perceptual] or even models implemented as deep learning networks [kim2017deep].
3.2 Temporal Pooling Models
Once frame-level quality scores are obtained, a variety of ways have been proposed to summarize the time-varying quality scores into a single overall video quality judgment. A variety of human factors have been explored in this context, including visual perception [de2013model, chen2014modeling], memory effects [bampis2017study, ghadiyaram2018learning, seshadrinathan2011temporal], and video content [ghadiyaram2018learning, mirkovic2014evaluating, li2019quality]. Here we model and study a collection of factors that express aspects of temporal quality perception, as candidates for deriving final quality predictions on UGC videos. Specifically, we study the following listed in approximate order of increasing complexity and abstraction:
Arithmetic Mean: The sample mean of frame-level scores is the most widely used method:
Harmonic Mean: The harmonic mean has been observed to emphasize the impact of low-quality frames [li2018vmaf]:
Geometric Mean: The third Pythagorean mean (geometric) expresses the central tendency of the quality scores by the product of their values:
Minkowski Mean: The Minkowski summation [rimac2009influence, seufert2013pool] of time-varying quality is defined as:
Percentile: The idea of percentile pooling is based on observed phenomenon that perceptual quality is heavily affected by the “worst” parts of the content. Many prior works have studied and justified (or challenged) percentile pooling [moorthy2009visual, chen2016perceptual, seufert2013pool, rimac2009influence, bampis2017study]. The -th percentile pooling is expressed:
VQPooling: VQPooling is an adaptive spatial and temporal pooling strategy proposed in [park2012video]. Here we only study the temporal pooling part, wherein the quality scores of all frames are classified into two groups composed of higher and lower quality, using -means clustering. The two groups, dubbed and , are then combined to obtain an overall quality prediction on the entire video sequence:
where and denote the cardinality of and , while the weight is defined as the ratio between the scores in and :
where and are the average value of the quality scores in set and , respectively.
Temporal Variation: The approach of [ninassi2009considering] considers the temporal changes of spatial distortions over time and proposes short-term and long-term spatiotemporal pooling mechanisms to account for quality changes. Here we only utilize the temporal variation terms in our study:
where is the absolute gradient at time , and pools the largest of the per-frame gradients of quality values.
Primacy Effect: The primacy effect describes the tendency of human viewers to recall the earliest portion of a video when providing overall evaluations [murdock1962serial]. One way of capturing primacy is as an exponentially decreasing weighted sum. Define
Recency Effect: The recency effect is another well-established behavioral and memory effect, whereby, in this context, video quality is very strongly influenced by a viewer’s most recently percieved visual impression [murdock1962serial]. The recency effect can also be characterized as an exponential weighted sum (Eq. (9)), but with a different weighting:
Temporal Hysteresis: This approach was inspired by the hysteresis effect observed in human judgments of time-varying video quality [seshadrinathan2011temporal], which is closely related to, but not the same as the recency effect. The hysteresis measurement can be formulated as follows. Let be the time-varying frame quality scores. The memory of past quality at the -th frame is expressed as the minimum quality scores over the previous frames:
where indexes the previous frames. The current video quality is expressed as a weighted sum of ordered [longbotham1989theory] frame-level qualities:
where indexes the next frames and is the descending half of a Gaussian weighting function. Linearly combining the memory and the current quality components in (12) and (14) yields time-varying scores that capture the hysteresis effect. The pooled video quality is computed as the global temporal average of the time-varying hysteresis-transformed predictions:
where adjusts the contributions of these two elements.
|Mean||0.552 / 0.560||0.673 / 0.676||0.662 / 0.671||0.690 / 0.696||0.749 / 0.764||0.600 / 0.631||0.597 / 0.632||0.575 / 0.618||0.532 / 0.570||0.694 / 0.743|
|Median||0.543 / 0.554||0.667 / 0.670||0.657 / 0.666||0.680 / 0.689||0.750 / 0.760||0.584 / 0.618||0.577 / 0.619||0.558 / 0.602||0.521 / 0.559||0.687 / 0.744|
|Harmonic||0.550 / 0.560||0.674 / 0.676||0.667 / 0.674||0.693 / 0.699||0.696 / 0.696||0.607 / 0.637||0.605 / 0.636||0.585 / 0.620||0.537 / 0.575||0.709 / 0.737|
|Geometric||0.551 / 0.560||0.676 / 0.679||0.666 / 0.673||0.692 / 0.698||0.747 / 0.760||0.604 / 0.634||0.600 / 0.631||0.578 / 0.617||0.537 / 0.573||0.698 / 0.746|
|Minkowski||0.552 / 0.559||0.672 / 0.676||0.661 / 0.670||0.689 / 0.695||0.736 / 0.746||0.597 / 0.628||0.596 / 0.630||0.574 / 0.615||0.538 / 0.569||0.688 / 0.739|
|Percentile||0.545 / 0.547||0.655 / 0.647||0.674 / 0.678||0.685 / 0.687||0.696 / 0.700||0.630 / 0.634||0.629 / 0.647||0.606 / 0.627||0.586 / 0.610||0.712 / 0.744|
|VQPooling||0.549 / 0.554||0.670 / 0.665||0.672 / 0.674||0.698 / 0.701||0.743 / 0.758||0.628 / 0.644||0.617 / 0.658||0.605 / 0.633||0.563 / 0.597||0.700 / 0.753|
|Variation||0.347 / 0.328||0.348 / 0.338||0.509 / 0.511||0.434 / 0.444||0.240 / 0.303||0.507 / 0.476||0.470 / 0.463||0.495 / 0.488||0.474 / 0.482||0.567 / 0.609|
|Primacy||0.541 / 0.552||0.668 / 0.671||0.647 / 0.653||0.684 / 0.690||0.726 / 0.741||0.601 / 0.631||0.573 / 0.627||0.575 / 0.613||0.535 / 0.561||0.684 / 0.737|
|Recency||0.553 / 0.558||0.670 / 0.667||0.660 / 0.667||0.690 / 0.694||0.745 / 0.754||0.584 / 0.615||0.586 / 0.626||0.561 / 0.599||0.518 / 0.555||0.670 / 0.729|
|Hysteresis||0.563 / 0.569||0.684 / 0.681||0.681 / 0.684||0.703 / 0.707||0.732 / 0.735||0.621 / 0.638||0.621 / 0.650||0.600 / 0.629||0.570 / 0.595||0.711 / 0.756|
|EPooling||0.572 / 0.579||0.670 / 0.679||0.670 / 0.676||0.698 / 0.704||0.749 / 0.762||0.623 / 0.645||0.617 / 0.646||0.605 / 0.623||0.582 / 0.601||0.705 / 0.743|
3.3 Ensemble Temporal Pooling
We have just described a diverse set of temporal pooling mechanisms, each either heuristically, statistically defined, or motivated by psychovisual reasoning. As might be expected, and as we shall show, the performances of these methods differ, and also vary on different datasets. Given that these methods likely capture different aspects of perceptual pooling, ensemble learning is a direct way to combine them towards creating a more reliable and generic quality predictor. We denote this ensemble-based temporal pooling as EPooling. Similar concepts of model fusion/ensemble have been successfully utilized on the IQA/VQA problems [bampis2018spatiotemporal, pei2015image, li2016toward].
Suppose the quality scores delivered by a set of pooling methods are denoted , where is the number of input model predictions. Then train an ensemble regressor to fuse the multiple predicted labels into a single final score:
where is the quality vector stacked from multiple singly pooled scores, and is the learned regression function that maps the proxy quality vector to a final quality prediction . Here we empirically chose Mean, VQPooling, and Hysteresis, as the three input prediction models after coarse preliminary feature analysis. Further improvements may be achieved by applying finer feature selection techniques.
4.1 Experimental Setup
We selected five popular NR IQA models: NIQE [mittal2012making], BRISQUE [mittal2012no], GM-LOG [xue2014blind], HIGRADE [kundu2017no], and CORNIA [ye2012unsupervised], as frame-level quality predictors, and evaluated the temporal pooling methods on two recent large scale UGC VQA databases: KoNViD-1k [hosu2017konstanz] and LIVE-VQC [sinno2018large]. When defining the parametric temporal pooling models, we used for Minkowski, for percentile, for primacy and recency, and for Temporal Hysteresis, as recommended in the originating works. We randomly split the evaluation dataset into - portions for training and testing, respectively, over trials and report the overall median performance on the testing set. We only conducted iterations for CORNIA due to its high training complexity. Within each split iteration, EPooling requires two phases of training – first, to train the mapping from the IQA feature vector to frame-level quality predictions (meaning predicted MOS), then, to learn the regression function that fuses the temporally pooled predictions to obtain the final quality result. Both phases are conducted on the training set. We used a support vector regression (SVR) as the learning model for both training stages, employing cross-validation and grid-search for the SVR parameter selection. As performance metrics, we used the Spearman rank-order correlation coefficient (SRCC) calculated between the ground truth MOS and the predicted scores to measure the prediction monotonicity of the models, and the Pearson linear correlation coefficient (PLCC) (computed after logistic mapping) to measure the degree of linear correlation against MOS.
4.2 Results and Recipe
The performance results are shown in Table 1 on the KoNViD-1k [hosu2017konstanz] and LIVE-VQC [sinno2018large], respectively. On KoNViD-1k, none of the sophisticated pooling algorithms were observed to significantly outperform the sample mean of temporal video quality scores. While an average gain of in SRCC/PLCC was achieved using Hysteresis pooling, the three classical Pythagorean means performed quite well despite their simplicity and computational efficiency. When tested on LIVE-VQC [sinno2018large], however, we have observed a performance average gain when employing perceptual importance pooling methods like percentile [moorthy2009visual], VQPooling [park2012video], and Hysteresis [seshadrinathan2011temporal], regardless of which NR IQA model was used. It is likely that the memory-related effects, primacy and recency, would play a more important role on longer videos (usually minutes long), as shown in [bampis2017study, ghadiyaram2018learning], but they did not contribute much on the short duration videos (8-10 seconds) in these datasets. Our proposed ensemble method of pooling achieved consistently competitive outcomes on both datasets.
These performance results yet reveal different trends on the two databases: KoNViD-1k yielded similar results among most of the competing pooling approaches, whereas on LIVE-VQC, Percentile, VQPooling, Hysteresis, and the ensemble enhancement, EPooling, generated the best scores. Towards understanding this, we observe that LIVE-VQC contains videos with more camera motion, hence more temporal variation than those in KoNViD-1k. It is possible that LIVE-VQC contains a larger range of perceived time-varying qualities scores, while temporal quality variations in KoNViD-1k adhere more closely to the mean quality level. Recalling the aforementioned hypothesis that perceptual quality is heavily affected by the worst portions of a video, our experimental results promote this assumption. In conclusion, our suggested recipe for incorporating temporal pooling into the design of NR VQA models strongly depends on video content – for videos containing more motion or temporal quality variations, pooling strategies that more heavily weight low quality events are recommended. In situations where the quality variations are low, or contain less motion, traditional statistical mean predictions may be adequate.
We conducted a benchmark study on the added value of integrating temporal pooling into blind video quality assessment for user-generated video content. We found that the efficacy of temporal pooling is content-dependent, but an ensemble approach can further improve quality prediction performance on a difficult problem that is only incompletely understood.
- thanks: Equal contribution