Are All the Frames Equally Important?

Are All the Frames Equally Important?

Abstract

In this work, we address the problem of measuring and predicting temporal video saliency - a metric which defines the importance of a video frame for human attention. Unlike the conventional spatial saliency which defines the location of the salient regions within a frame (as it is done for still images), temporal saliency considers importance of a frame as a whole and may not exist apart from context.
The proposed interface is an interactive cursor-based algorithm for collecting experimental data about temporal saliency. We collect the first human responses and perform their analysis. As a result, we show that qualitatively, the produced scores have very explicit meaning of the semantic changes in a frame, while quantitatively being highly correlated between all the observers.
Apart from that, we show that the proposed tool can simultaneously collect fixations similar to the ones produced by eye-tracker in a more affordable way. Further, this approach may be used for creation of first temporal saliency datasets which will allow training computational predictive algorithms. The proposed interface does not rely on any special equipment, which allows to run it remotely and cover a wide audience.

Attention; video; saliency; temporal saliency; eye-tracking
\copyrightinfo

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
CHI ’20 Extended Abstracts, April 25–30, 2020, Honolulu, HI, USA.
© 2020 Copyright is held by the author/owner(s).
ACM ISBN 978-1-4503-6819-3/20/04.
http://dx.doi.org/10.1145/3334480.3382980
\numberofauthors6

{CCSXML}

<ccs2012> <concept> <concept_id>10002951.10003227.10003251</concept_id> <concept_desc>Information systems Multimedia information systems</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10003120.10003121</concept_id> <concept_desc>Human-centered computing Human computer interaction (HCI)</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10003120.10003121.10003129.10011757</concept_id> <concept_desc>Human-centered computing User interface toolkits</concept_desc> <concept_significance>300</concept_significance> </concept> </ccs2012>

\ccsdesc

[500]Information systems Multimedia information systems \ccsdesc[500]Human-centered computing Human computer interaction (HCI) \ccsdesc[300]Human-centered computing User interface toolkits

\printccsdesc

1 Introduction

It seems obvious that some fragments of a video are more important than others. Such fragments concentrate most of the viewer’s attention while others remain of no interest. The naïve examples are: a culmination scene in a movie, a screamer in a horror film, the moment of an explosion, or even a slight motion in very calm footage. We denote such fragments as groups of frames with high temporal saliency. Information about temporal saliency is an essential part of a video characterization which gives valuable insights about the video structure. Such information is directly applicable in video compression (frames which do not attract attention may be compressed more), video summarization (salient frames contain the most of perceived video content), indexing, memorability prediction, and others tasks. So, the reader may expect that there is a big number of algorithms and techniques aimed at measuring and predicting temporal saliency. However, this is not the case. The most, if not all, of the well-known works on video saliency are aimed at spatial saliency, i.e., a prediction of spatial distribution of the observer’s attention across the frame (in a similar way as if it was an individual image). We hypothesize that this is due to the absence of established methodology for measuring temporal saliency in the experiment, which is crucial for obtaining ground truth data. Conventionally, saliency data are collected using eye-tracking, which is a technique that produces a continuous temporal signal. In other words, it does not allow to differentiate between the frames as a whole, because each frame produces the same kind of output – a pair of gaze fixation coordinates with a rate defined by hardware.
In this work, we propose a new methodology for measuring temporal video saliency in the experiment – the first, to the best of our knowledge, method of this kind. For this, we develop a special interface based on mouse-contingent moving-window approach for measuring saliency maps of static images. We also show that it can simultaneously gather meaningful spatial information which can serve as an approximation of gaze fixations.
During the experiment, observers are presented with repeated blurry video-sequences which they can partially deblur using mouse click (Fig. 2). Users can deblur a circular region with a center at cursor location which approximates the confined area of focus in the human eye fovea surrounded by a blurred periphery [3]. Since the number of clicks is limited - observers are forced to use clicks only on most “interesting” frames which attract their attention. Statistical analysis of the collected clicks allows to assign the corresponding level of importance to each frame. This information can be applied directly in numerous tasks of video processing.
To summarize, unlike the conventional approaches which only try to understand where the observer looks, we also study when the observer pay the most attention.

2 Related works

The straightforward method of retrieving the information about attention is based on the utilization of commercial eye-trackers (e.g. EyeLink, Tobii). Hardware-based eye-tracking has been used widely in various studies on human-computer interaction [6][13]. A less accurate, but much more affordable, way of measuring saliency is based on measuring the mouse cursor position which was proven to correlate strongly with gaze fixations [4][5][15]. The most successful algorithms of this type utilize a moving-window paradigm, which masks information outside of the area adjacent to the cursor and requires a user to move the cursor (followed by a window around it) to make other regions visible. Such algorithms include Restricted Focus Viewer software by Jansen et al. [7] and more recent SALICON [8] and BubbleView [10]. These algorithms were also used in large online crowdsourcing experiments due to the native scalability of cursor-based approaches. However, they were studied only in the context of spatial saliency of static images. This is fair for static images, but for video-sequences, temporal information is commonly even more important than spatial regions. Furthermore, there are no well-known experimental datasets which can provide this kind of information1 and be used for training of computational algorithms. For example, the popular video saliency datasets Hollywood-2 [16], UCF sports [12], SAVAM [2], DHF1K [17] only provide eye-tracking results which are constant in the temporal domain.

{marginfigure}

[208 pt] The proposed interface. A more representative video demonstration is available online: [link].

3 Methodology

Our approach is inspired by moving-window gaze approximations methods for still images. In the proposed setup all video frames are blurred. Clicking the mouse deblurs a round window around the cursor. Users are demonstrated repeated video sequence during which they can click the mouse for short periods of time. The total number of times when the frame was deblurred defines temporal saliency score, while location of the cursor when the mouse button is pressed approximates gaze fixation location and allows to detect what caused the interest.

3.1 Discretization

Short fragment of a video is more likely to attract user’s attention rather than a single frame, so we let the users keep the mouse button pressed instead of clicking on each frame they find interesting. However, when not forced explicitly, observers tend to keep the mouse button pressed all the time, which is natural. Thus, to obtain variation of scores, it is crucial to restrict users artificially. Our solution is to simply limit the amount of deblurred frames (time period), after which clicking the mouse button stops working, and additionally limit the amount of deblurred frames per one continuous click. The users cannot see the limits, instead, they learn them during a test trial and then follow them intuitively. For example, a 10-second video may have up to 4 seconds of deblurred frames, but no more than 1 second at once. In the result, a user can make 4 long clicks 1 second each or a larger number of short clicks, while we are guaranteed to have at least four discrete responses after one run.

3.2 Repetition

The idea of repeating the videos may be used to gather more responses from one observer and have richer statistics. Moreover, if a salient event happens at the end, the observer may reach the limit before seeing it, so it is necessary to make a second round. Also, eye-motion and cognitive processing are faster than clicking the mouse, so giving the user an opportunity to predict when an event will happen is beneficial for the creation of more accurate saliency maps with a shorter delay. However, we observed that in the majority of the cases, the first run is the most informative one, and the user is able to detect most salient information without preparation. Subsequent repeats lead to shifting the user’s attention to smaller details. Eventually, we used repetition in our experiments, but analyze different numbers of repeats in results.

3.3 Other parameters

Other important parameters are the blur radius and the radius of the window. Their definition requires more detailed study. The task given to an observer also influences where they look [18][10], so, this parameter depends on the particular context in which the experiment is performed. In our case, we are interested in basic watching of a video without a particular task, so we worked under a “free-view” setup.

{marginfigure}

[62 pt] Experimental setup (the light is off during the session).

4 Experimental setup

The experiments were performed offline using a special setup in the laboratory (Fig. 3.3) for the sake of fully-controlled conditions (in future we are also planning to run the experiment on Amazon Mechanical Turk for gathering larger database, which would be impossible to do with an eye-tracker). The display used is 24.1″ EIZO ColorEdge CG241W color-calibrated with X-Rite Eye-One Pro. The distance between the display and the observer was 50 cm.
The code is written in MatLab with Psychtoolbox-3 [11] and is publicly available by the link 2.
Videos with ground-truth eye-tracking data were taken from SAVAM dataset [2] due to their high quality and diverse content. We used eight 10-seconds long HD videos including two test videos. The content of the videos is diverse and includes: a basketball game with a score moment, a calm shot of leaves in the wind, marine animals underwater, a cinematic scene of a child coming home, a surveillance camera footage of two men meeting, a suffocating diver emerging from the water.
Interface parameters: radius of a circular window – 200 px ( visual angle), blur kernel – Gaussian with standard deviation of 15, video duration – 10 s, limit of deblurred frames per one round – 4 s (100 frames), limit of deblurred frames at one click – 1 s (25 frames), number of repetitions – 5, frame-rate of the videos – 25 fps, video resolution – 1280 px 720 px ( visual angle), videos are silent.
The observers were invited from the University staff and students. 30 subjects in total, 15 women and 15 men. Age: 21-42 (mean 25.6).

Figure 1: The produced temporal saliency graphs. Thick black line , red line , thin black line . Zoom is required.
Pearson Correlation Coefficient (mean ) Kolmogorov-Smirnov test (mean -value)
"The underwater world" 0.663 0.694 0.740 0.770 0.119 0.048 0.011 0.036
"Cinematic scene" 0.615 0.711 0.803 0.789 0.164 0.107 0.033 0.067
"Leaves in the wind" 0.694 0.563 0.545 0.647 0.081 0.073 0.044 0.057
"Basketball game" 0.741 0.766 0.863 0.845 0.164 0.099 0.055 0.063
"Diver suffocating" 0.789 0.788 0.820 0.834 0.134 0.092 0.043 0.068
"Meeting of the two" 0.660 0.701 0.740 0.753 0.121 0.112 0.061 0.053
Table 1: Inter-observer consistency of the measured temporal saliency maps. denotes sum of rounds used for computation.
Figure 2: The comparison of spatial saliency maps. Top row in each pair – eye-tracking results, bottom – our results. Zoom is required.

5 Results and discussion

The proposed interface allows measuring both temporal and spatial saliency at the same time, thus, we evaluate the accuracy of both these outputs.

5.1 Temporal saliency results

Considering that there are no ground truth temporal saliency data, we evaluate the output of the algorithm by analyzing the produced temporal saliency “maps” and estimating inter-observer consistency. The examples of obtained temporal saliency “maps” are illustrated in Fig. 1. The demonstration of the videos with saliency scores encoded as a color-map is available online: [link]. Figure 1 demonstrates three plots for each video which correspond to different averaging approaches: the sum of all clicks from all five video repeats (); the sum of clicks only from the first round without repeating (); and the weighted sum of clicks from 5 rounds (, where ). All the scores are normalized by a maximum number of clicks the frame can have.

Qualitative analysis shows that most of the peaks on the temporal saliency graph correspond to the semantically meaningful salient events on the video. This is the main achievement of the proposed interface. It can also be seen that an intentionally taken monotonic video without salient events (“leaves in the wind”) has relatively flat saliency graph without strongly pronounced peaks (which could be even flatter when the response statistics is larger). Apart from that, it may be seen that in the case of other videos, the output of the first round (red line) is very similar to the total output of all five rounds. This means that even when in the next rounds observers start exploring smaller, less salient details, they still return to the “main” events and follow a similar pattern of clicks as in the first round. Also, adding weights to the sum (thin black line) does not influence the results significantly, which again indicates the similarity of clicks from all the rounds. However, using rounds indeed allows to gather times more responses making the graph smoother and, as we show next, produces more consistent responses from each observer.
In order to estimate consistency between different groups of observers, we synthetically split observers into two groups of 15 people each. Then, we compute temporal saliency maps for each group independently and compare the results. The comparison is done using the Pearson Correlation Coefficient between the saliency maps from different groups, as well as performing the Kolmogorov-Smirnov test between two distributions and reporting the p-value. Results are averaged between 100 random splits (standard deviation is also reported for PCC). Table 1 shows that the correlation between responses from different observers is very high, up to 0.86. Increasing the number of rounds considered increases the correlation of responses significantly, with maximum values achieved when all five rounds are included.

5.2 Spatial saliency results

The spatial saliency maps produced by eye-tracking data versus our interface can be compared visually in Fig. 2. (fixation points are blurred with a Gaussian of sigma equal to of visual angle (33 px)). As may be seen, the results are very similar, even though we did not use any special equipment and collected spatial data additionally to the main temporal output.
Saliency maps are evaluated quantitatively using standard saliency metrics: Area under ROC Curve (AUC) [9][1] and Normalized Scanpath Saliency (NSS) [14]. Table 2 presents statistics of the scores computed per frame. Results demonstrate both good and poor performance, and differ significantly from video to video. Additionally, quality of spatial saliency can be assessed visually via the rendered videos with map overlay [link], as well as the videos with both eye-tracking (blue dots) and our results (red dots) simultaneously [link].

AUC (mean ) NSS (mean )
"The underwater world" 0.617 0.73
"Cinematic scene" 0.712 1.59
"Leaves in the wind" 0.548 0.18
"Basketball game" 0.727 1.52
"Diver suffocating" 0.794 2.66
"Meeting of the two" 0.625 0.95
Table 2: Comparison of the measured spatial saliency maps and gaze-fixations obtained using eye-tracker.

6 Conclusions

In this work, we presented a novel mouse-contingent interface designed for measuring temporal and spatial video saliency. Temporal saliency is a novel concept which is studied incongruously less than it should in comparison to spatial saliency. Temporal video saliency allows identifying the important fragments of a video by assigning a saliency score to each frame. The analysis of the experimental study shows that the use of the proposed interface allows to accurately approximate the temporal saliency “map” as well as gaze-fixations of the observers at the same time.

Footnotes

  1. A comprehensive list of saliency datasets: http://saliency.mit.edu/datasets.html
  2. https://github.com/acecreamu/temporal-saliency

References

  1. A. Borji, H. R. Tavakoli, D. N. Sihite and L. Itti (2013) Analysis of scores, datasets, and models in visual saliency prediction. In IEEE ICCV, pp. 921–928. Cited by: §5.2.
  2. Y. Gitman, M. Erofeev, D. Vatolin, A. Bolshakov and A. Fedorov (2014-10) Semiautomatic Visual-Attention modeling and its application to video compression. In IEEE ICIP, Paris, France, pp. 1105–1109. Cited by: §2, §4.
  3. F. Gosselin and P. G. Schyns (2001) Bubbles: a technique to reveal the use of information in recognition tasks. Vision research 41 (17), pp. 2261–2271. Cited by: §1.
  4. Q. Guo and E. Agichtein (2010) Towards predicting web searcher gaze position from mouse movements. In CHI’10 Extended Abstracts, pp. 3601–3606. Cited by: §2.
  5. J. Huang, R. White and G. Buscher (2012) User see, user point: gaze and cursor alignment in web search. In SIGCHI, pp. 1341–1350. Cited by: §2.
  6. R. J. K. Jacob and K. S. Karn (2003) Eye tracking in human-computer interaction and usability research: ready to deliver the promises. Mind 2 (3), pp. 4. Cited by: §2.
  7. A. R. Jansen, A. F. Blackwell and K. Marriott (2003) A tool for tracking visual attention: the restricted focus viewer. Behavior research methods, instruments, & computers 35 (1), pp. 57–69. Cited by: §2.
  8. M. Jiang, S. Huang, J. Duan and Q. Zhao (2015) Salicon: saliency in context. In IEEE CVPR, pp. 1072–1080. Cited by: §2.
  9. T. Judd, K. Ehinger, F. Durand and A. Torralba (2009) Learning to predict where humans look. In IEEE ICCV, pp. 2106–2113. Cited by: §5.2.
  10. N. W. Kim, Z. Bylinskii, M. A. Borkin, K. Z. Gajos, A. Oliva, F. Durand and H. Pfister (2017) BubbleView: an interface for crowdsourcing image importance maps and tracking visual attention. ACM TOCHI 24 (5), pp. 36. Cited by: §2, §3.3.
  11. M. Kleiner, D. Brainard, D. Pelli, A. Ingling, R. Murray and C. Broussard (2007) What’s new in psychtoolbox-3. Cited by: §4.
  12. S. Mathe and C. Sminchisescu (2015) Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI 37 (7), pp. 1408–1424. Cited by: §2.
  13. J. Nielsen and K. Pernice (2010) Eyetracking web usability. New Riders. Cited by: §2.
  14. R. J. Peters, A. Iyer, L. Itti and C. Koch (2005) Components of bottom-up gaze allocation in natural images. Vision research 45 (18), pp. 2397–2416. Cited by: §5.2.
  15. K. Rodden, X. Fu, A. Aula and I. Spiro (2008) Eye-mouse coordination patterns on web search results pages. In CHI’08 Extended Abstracts, CHI EA ’08, New York, NY, USA, pp. 2997–3002. External Links: ISBN 978-1-60558-012-8 Cited by: §2.
  16. E. Vig, M. Dorr and D. Cox (2014-06) Large-scale optimization of hierarchical features for saliency prediction in natural images. In IEEE CVPR, Vol. , pp. 2798–2805. External Links: ISSN 1063-6919 Cited by: §2.
  17. W. Wang, J. Shen, F. Guo, M. Cheng and A. Borji (2018) Revisiting video saliency: a large-scale benchmark and a new model. In IEEE CVPR, pp. 4894–4903. Cited by: §2.
  18. A. L. Yarbus (1967) Eye movements and vision. Plenum, New York, NY, USA. Cited by: §3.3.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
407786
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description