Are All the Frames Equally Important?
Abstract
In this work, we address the problem of measuring and predicting temporal video saliency - a metric which defines the importance of a video frame for human attention. Unlike the conventional spatial saliency which defines the location of the salient regions within a frame (as it is done for still images), temporal saliency considers importance of a frame as a whole and may not exist apart from context.
The proposed interface is an interactive cursor-based algorithm for collecting experimental data about temporal saliency. We collect the first human responses and perform their analysis. As a result, we show that qualitatively, the produced scores have very explicit meaning of the semantic changes in a frame, while quantitatively being highly correlated between all the observers.
Apart from that, we show that the proposed tool can simultaneously collect fixations similar to the ones produced by eye-tracker in a more affordable way. Further, this approach may be used for creation of first temporal saliency datasets which will allow training computational predictive algorithms. The proposed interface does not rely on any special equipment, which allows to run it remotely and cover a wide audience.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s).
CHI ’20 Extended Abstracts, April 25–30, 2020, Honolulu, HI, USA.
© 2020 Copyright is held by the author/owner(s).
ACM ISBN 978-1-4503-6819-3/20/04.
http://dx.doi.org/10.1145/3334480.3382980
\numberofauthors6
<ccs2012> <concept> <concept_id>10002951.10003227.10003251</concept_id> <concept_desc>Information systems Multimedia information systems</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10003120.10003121</concept_id> <concept_desc>Human-centered computing Human computer interaction (HCI)</concept_desc> <concept_significance>500</concept_significance> </concept> <concept> <concept_id>10003120.10003121.10003129.10011757</concept_id> <concept_desc>Human-centered computing User interface toolkits</concept_desc> <concept_significance>300</concept_significance> </concept> </ccs2012>
[500]Information systems Multimedia information systems \ccsdesc[500]Human-centered computing Human computer interaction (HCI) \ccsdesc[300]Human-centered computing User interface toolkits
1 Introduction
It seems obvious that some fragments of a video are more important than others. Such fragments concentrate most of the viewer’s attention while others remain of no interest. The naïve examples are: a culmination scene in a movie, a screamer in a horror film, the moment of an explosion, or even a slight motion in very calm footage. We denote such fragments as groups of frames with high temporal saliency. Information about temporal saliency is an essential part of a video characterization which gives valuable insights about the video structure. Such information is directly applicable in video compression (frames which do not attract attention may be compressed more), video summarization (salient frames contain the most of perceived video content), indexing, memorability prediction, and others tasks. So, the reader may expect that there is a big number of algorithms and techniques aimed at measuring and predicting temporal saliency. However, this is not the case. The most, if not all, of the well-known works on video saliency are aimed at spatial saliency, i.e., a prediction of spatial distribution of the observer’s attention across the frame (in a similar way as if it was an individual image). We hypothesize that this is due to the absence of established methodology for measuring temporal saliency in the experiment, which is crucial for obtaining ground truth data. Conventionally, saliency data are collected using eye-tracking, which is a technique that produces a continuous temporal signal. In other words, it does not allow to differentiate between the frames as a whole, because each frame produces the same kind of output – a pair of gaze fixation coordinates with a rate defined by hardware.
In this work, we propose a new methodology for measuring temporal video saliency in the experiment – the first, to the best of our knowledge, method of this kind. For this, we develop a special interface based on mouse-contingent moving-window approach for measuring saliency maps of static images. We also show that it can simultaneously gather meaningful spatial information which can serve as an approximation of gaze fixations.
During the experiment, observers are presented with repeated blurry video-sequences which they can partially deblur using mouse click (Fig. 2). Users can deblur a circular region with a center at cursor location which approximates the confined area of focus in the human eye fovea surrounded by a blurred periphery [3]. Since the number of clicks is limited - observers are forced to use clicks only on most “interesting” frames which attract their attention. Statistical analysis of the collected clicks allows to assign the corresponding level of importance to each frame. This information can be applied directly in numerous tasks of video processing.
To summarize, unlike the conventional approaches which only try to understand where the observer looks, we also study when the observer pay the most attention.
2 Related works
The straightforward method of retrieving the information about attention is based on the utilization of commercial eye-trackers (e.g. EyeLink, Tobii). Hardware-based eye-tracking has been used widely in various studies on human-computer interaction [6][13]. A less accurate, but much more affordable, way of measuring saliency is based on measuring the mouse cursor position which was proven to correlate strongly with gaze fixations [4][5][15]. The most successful algorithms of this type utilize a moving-window paradigm, which masks information outside of the area adjacent to the cursor and requires a user to move the cursor (followed by a window around it) to make other regions visible. Such algorithms include Restricted Focus Viewer software by Jansen et al. [7] and more recent SALICON [8] and BubbleView [10]. These algorithms were also used in large online crowdsourcing experiments due to the native scalability of cursor-based approaches. However, they were studied only in the context of spatial saliency of static images. This is fair for static images, but for video-sequences, temporal information is commonly even more important than spatial regions. Furthermore, there are no well-known experimental datasets which can provide this kind of information
[208 pt]
The proposed interface. A more representative video demonstration is available online: [link].
3 Methodology
Our approach is inspired by moving-window gaze approximations methods for still images. In the proposed setup all video frames are blurred. Clicking the mouse deblurs a round window around the cursor. Users are demonstrated repeated video sequence during which they can click the mouse for short periods of time. The total number of times when the frame was deblurred defines temporal saliency score, while location of the cursor when the mouse button is pressed approximates gaze fixation location and allows to detect what caused the interest.
3.1 Discretization
Short fragment of a video is more likely to attract user’s attention rather than a single frame, so we let the users keep the mouse button pressed instead of clicking on each frame they find interesting. However, when not forced explicitly, observers tend to keep the mouse button pressed all the time, which is natural. Thus, to obtain variation of scores, it is crucial to restrict users artificially. Our solution is to simply limit the amount of deblurred frames (time period), after which clicking the mouse button stops working, and additionally limit the amount of deblurred frames per one continuous click. The users cannot see the limits, instead, they learn them during a test trial and then follow them intuitively. For example, a 10-second video may have up to 4 seconds of deblurred frames, but no more than 1 second at once. In the result, a user can make 4 long clicks 1 second each or a larger number of short clicks, while we are guaranteed to have at least four discrete responses after one run.
3.2 Repetition
The idea of repeating the videos may be used to gather more responses from one observer and have richer statistics. Moreover, if a salient event happens at the end, the observer may reach the limit before seeing it, so it is necessary to make a second round. Also, eye-motion and cognitive processing are faster than clicking the mouse, so giving the user an opportunity to predict when an event will happen is beneficial for the creation of more accurate saliency maps with a shorter delay. However, we observed that in the majority of the cases, the first run is the most informative one, and the user is able to detect most salient information without preparation. Subsequent repeats lead to shifting the user’s attention to smaller details. Eventually, we used repetition in our experiments, but analyze different numbers of repeats in results.
3.3 Other parameters
Other important parameters are the blur radius and the radius of the window. Their definition requires more detailed study. The task given to an observer also influences where they look [18][10], so, this parameter depends on the particular context in which the experiment is performed. In our case, we are interested in basic watching of a video without a particular task, so we worked under a “free-view” setup.
[62 pt]
Experimental setup (the light is off during the session).
4 Experimental setup
The experiments were performed offline using a special setup in the laboratory (Fig. 3.3) for the sake of fully-controlled conditions (in future we are also planning to run the experiment on Amazon Mechanical Turk for gathering larger database, which would be impossible to do with an eye-tracker). The display used is 24.1″ EIZO ColorEdge CG241W color-calibrated with X-Rite Eye-One Pro. The distance between the display and the observer was 50 cm.
The code is written in MatLab with Psychtoolbox-3 [11] and is publicly available by the link
Videos with ground-truth eye-tracking data were taken from SAVAM dataset [2] due to their high quality and diverse content. We used eight 10-seconds long HD videos including two test videos. The content of the videos is diverse and includes: a basketball game with a score moment, a calm shot of leaves in the wind, marine animals underwater, a cinematic scene of a child coming home, a surveillance camera footage of two men meeting, a suffocating diver emerging from the water.
Interface parameters: radius of a circular window – 200 px ( visual angle), blur kernel – Gaussian with standard deviation of 15, video duration – 10 s, limit of deblurred frames per one round – 4 s (100 frames), limit of deblurred frames at one click – 1 s (25 frames), number of repetitions – 5, frame-rate of the videos – 25 fps, video resolution – 1280 px 720 px ( visual angle), videos are silent.
The observers were invited from the University staff and students. 30 subjects in total, 15 women and 15 men. Age: 21-42 (mean 25.6).

Pearson Correlation Coefficient (mean ) | Kolmogorov-Smirnov test (mean -value) | |||||||
---|---|---|---|---|---|---|---|---|
"The underwater world" | 0.663 | 0.694 | 0.740 | 0.770 | 0.119 | 0.048 | 0.011 | 0.036 |
"Cinematic scene" | 0.615 | 0.711 | 0.803 | 0.789 | 0.164 | 0.107 | 0.033 | 0.067 |
"Leaves in the wind" | 0.694 | 0.563 | 0.545 | 0.647 | 0.081 | 0.073 | 0.044 | 0.057 |
"Basketball game" | 0.741 | 0.766 | 0.863 | 0.845 | 0.164 | 0.099 | 0.055 | 0.063 |
"Diver suffocating" | 0.789 | 0.788 | 0.820 | 0.834 | 0.134 | 0.092 | 0.043 | 0.068 |
"Meeting of the two" | 0.660 | 0.701 | 0.740 | 0.753 | 0.121 | 0.112 | 0.061 | 0.053 |

5 Results and discussion
The proposed interface allows measuring both temporal and spatial saliency at the same time, thus, we evaluate the accuracy of both these outputs.
5.1 Temporal saliency results
Considering that there are no ground truth temporal saliency data, we evaluate the output of the algorithm by analyzing the produced temporal saliency “maps” and estimating inter-observer consistency. The examples of obtained temporal saliency “maps” are illustrated in Fig. 1. The demonstration of the videos with saliency scores encoded as a color-map is available online: [link]. Figure 1 demonstrates three plots for each video which correspond to different averaging approaches: the sum of all clicks from all five video repeats (); the sum of clicks only from the first round without repeating (); and the weighted sum of clicks from 5 rounds (, where ). All the scores are normalized by a maximum number of clicks the frame can have.
Qualitative analysis shows that most of the peaks on the temporal saliency graph correspond to the semantically meaningful salient events on the video. This is the main achievement of the proposed interface. It can also be seen that an intentionally taken monotonic video without salient events (“leaves in the wind”) has relatively flat saliency graph without strongly pronounced peaks (which could be even flatter when the response statistics is larger). Apart from that, it may be seen that in the case of other videos, the output of the first round (red line) is very similar to the total output of all five rounds. This means that even when in the next rounds observers start exploring smaller, less salient details, they still return to the “main” events and follow a similar pattern of clicks as in the first round. Also, adding weights to the sum (thin black line) does not influence the results significantly, which again indicates the similarity of clicks from all the rounds. However, using rounds indeed allows to gather times more responses making the graph smoother and, as we show next, produces more consistent responses from each observer.
In order to estimate consistency between different groups of observers, we synthetically split observers into two groups of 15 people each. Then, we compute temporal saliency maps for each group independently and compare the results. The comparison is done using the Pearson Correlation Coefficient between the saliency maps from different groups, as well as performing the Kolmogorov-Smirnov test between two distributions and reporting the p-value. Results are averaged between 100 random splits (standard deviation is also reported for PCC). Table 1 shows that the correlation between responses from different observers is very high, up to 0.86. Increasing the number of rounds considered increases the correlation of responses significantly, with maximum values achieved when all five rounds are included.
5.2 Spatial saliency results
The spatial saliency maps produced by eye-tracking data versus our interface can be compared visually in Fig. 2. (fixation points are blurred with a Gaussian of sigma equal to of visual angle (33 px)). As may be seen, the results are very similar, even though we did not use any special equipment and collected spatial data additionally to the main temporal output.
Saliency maps are evaluated quantitatively using standard saliency metrics: Area under ROC Curve (AUC) [9][1] and Normalized Scanpath Saliency (NSS) [14]. Table 2 presents statistics of the scores computed per frame. Results demonstrate both good and poor performance, and differ significantly from video to video. Additionally, quality of spatial saliency can be assessed visually via the rendered videos with map overlay [link], as well as the videos with both eye-tracking (blue dots) and our results (red dots) simultaneously [link].
AUC (mean ) | NSS (mean ) | |
---|---|---|
"The underwater world" | 0.617 | 0.73 |
"Cinematic scene" | 0.712 | 1.59 |
"Leaves in the wind" | 0.548 | 0.18 |
"Basketball game" | 0.727 | 1.52 |
"Diver suffocating" | 0.794 | 2.66 |
"Meeting of the two" | 0.625 | 0.95 |
6 Conclusions
In this work, we presented a novel mouse-contingent interface designed for measuring temporal and spatial video saliency. Temporal saliency is a novel concept which is studied incongruously less than it should in comparison to spatial saliency. Temporal video saliency allows identifying the important fragments of a video by assigning a saliency score to each frame. The analysis of the experimental study shows that the use of the proposed interface allows to accurately approximate the temporal saliency “map” as well as gaze-fixations of the observers at the same time.
Footnotes
- A comprehensive list of saliency datasets: http://saliency.mit.edu/datasets.html
- https://github.com/acecreamu/temporal-saliency
References
- (2013) Analysis of scores, datasets, and models in visual saliency prediction. In IEEE ICCV, pp. 921–928. Cited by: §5.2.
- (2014-10) Semiautomatic Visual-Attention modeling and its application to video compression. In IEEE ICIP, Paris, France, pp. 1105–1109. Cited by: §2, §4.
- (2001) Bubbles: a technique to reveal the use of information in recognition tasks. Vision research 41 (17), pp. 2261–2271. Cited by: §1.
- (2010) Towards predicting web searcher gaze position from mouse movements. In CHI’10 Extended Abstracts, pp. 3601–3606. Cited by: §2.
- (2012) User see, user point: gaze and cursor alignment in web search. In SIGCHI, pp. 1341–1350. Cited by: §2.
- (2003) Eye tracking in human-computer interaction and usability research: ready to deliver the promises. Mind 2 (3), pp. 4. Cited by: §2.
- (2003) A tool for tracking visual attention: the restricted focus viewer. Behavior research methods, instruments, & computers 35 (1), pp. 57–69. Cited by: §2.
- (2015) Salicon: saliency in context. In IEEE CVPR, pp. 1072–1080. Cited by: §2.
- (2009) Learning to predict where humans look. In IEEE ICCV, pp. 2106–2113. Cited by: §5.2.
- (2017) BubbleView: an interface for crowdsourcing image importance maps and tracking visual attention. ACM TOCHI 24 (5), pp. 36. Cited by: §2, §3.3.
- (2007) Whatâs new in psychtoolbox-3. Cited by: §4.
- (2015) Actions in the eye: dynamic gaze datasets and learnt saliency models for visual recognition. IEEE TPAMI 37 (7), pp. 1408–1424. Cited by: §2.
- (2010) Eyetracking web usability. New Riders. Cited by: §2.
- (2005) Components of bottom-up gaze allocation in natural images. Vision research 45 (18), pp. 2397–2416. Cited by: §5.2.
- (2008) Eye-mouse coordination patterns on web search results pages. In CHI’08 Extended Abstracts, CHI EA ’08, New York, NY, USA, pp. 2997–3002. External Links: ISBN 978-1-60558-012-8 Cited by: §2.
- (2014-06) Large-scale optimization of hierarchical features for saliency prediction in natural images. In IEEE CVPR, Vol. , pp. 2798–2805. External Links: ISSN 1063-6919 Cited by: §2.
- (2018) Revisiting video saliency: a large-scale benchmark and a new model. In IEEE CVPR, pp. 4894–4903. Cited by: §2.
- (1967) Eye movements and vision. Plenum, New York, NY, USA. Cited by: §3.3.
