CrowdFix: An Eyetracking Data-set of Human Crowd Video
Understanding human visual attention and saliency is an integral part of vision research. In this context, there is an ever-present need for fresh and diverse benchmark datasets, particularly for insight into special use cases like crowded scenes. We contribute to this end by: (1) reviewing the dynamics behind saliency and crowds. (2) using eye tracking to create a dynamic human eye fixation dataset over a new set of crowd videos gathered from the Internet. The videos are annotated into three distinct density levels. (3) Finally, we evaluate state-of-the-art saliency models on our dataset to identify possible improvements for the design and creation of a more robust saliency model.
Memoona National University of Sciences and Technology (NUST), Islamabad, Pakistan http://firstname.lastname@example.org
Sobas National University of Sciences and Technology (NUST), Islamabad, Pakistan http://email@example.com
Anis U. National University of Sciences and Technology (NUST), Islamabad, Pakistan http://firstname.lastname@example.org
Omar National University of Sciences and Technology (NUST), Islamabad, Pakistan http://email@example.com
Saliency studies form the intersection between natural and computer vision. A quantitative study of saliency provides a structured insight into the human mind on what it perceives to be important in a scene. Visual attention then guides gaze to focus on and further explore that region of interest. To achieve near human accuracy in predicting gaze locations, Saliency models need to be able to approximate gaze over a wide variety of stimuli [borji2019saliency]. We approach this problem in two ways: first we discuss static and dynamic stimuli used for modelling saliency as well as the need for specialized datasets to boost saliency modelling. Traditionally most of the active research has been on images, but in the recent years, using dynamic content as the subject of saliency studies has picked up pace. The pace of this research is determined by publicly available, diverse datasets of videos covering multitudes of natural scenes. Datasets such as DIEM [mital2011clustering], HOLLYWOOD-2 [mathe2014actions], and UCFSports [mathe2014actions], LEDOV [jiang2018deepvs] and DHFK [wang2018revisiting] are dynamic datasets that cover a range of natural scenes. However, there is an obvious gap for specialized datasets targeting a category of natural scenes. Our study focuses on the category of crowded scenes because it presents an interesting use case: The number of stimuli competing for attention in crowd scenes are larger in number and the crowd activity is far more random and attention grabbing than normal scenes containing one or two object of interest [yoo2016visual]. This insight proves useful for monitoring, managing and securing crowds [gupta2014design]. To date, there has been only one crowd saliency dataset, namely EyeCrowd, consisting of 500 natural images [jiang2014people].
Our research contributes by adding a first saliency dataset of crowd videos called ’CrowdFix’ and its corresponding saliency information to the pool of publicly available saliency datasets. Crowdfix is a real-life, moving crowds high definition (720p) videos dataset collected in in RGB. Eyetracking results benefit from higher quality datasets [vigier2016new]. For this reason we chose not to include videos from pre-existing crowd video datasets due to the lower quality of those videos, i.e. below 720p. The dataset has been further annotated into three different crowd density levels to facilitate understanding of attention modulation within different each level. This also helps in the producing better, more generalized saliency models, particularly deep models by providing a finer categorization of salient images and videos [he2019understanding]. We assess the attentional impact of different levels of these crowds on individuals and further evaluate three state-of-art deep learning based saliency models on our datasets to judge how well general saliency models perform for crowd saliency prediction. This analysis serves as a baseline for future design of a crowd saliency model.
1.1 Related Work
Gaze, a synchronized act of the eyes and head, has frequently been used as an intermediary for attention in natural conduct. For example, a human or a robot has to cooperate with contiguous objects and regulate the gaze to accomplish a task while moving in the surroundings. In this sense, gaze control involves vision, response, and attention concurrently to achieve sensory-motor arrangement necessary for the preferred behavior (e.g., reaching and grasping) [borji2012state].
Our human visual system is designed to automatically filter the incoming visual information from our gaze. This is done passively based on verdict of visual attention. Visual attention is a mechanism that intervenes between competing aspects of a visual scene and assists in selecting the most important regions while diminishing the importance of others. It is vital to understand how visual attention works to determine how our vision will be directed to the objects presented in front of it. [jiang2014people]. Attention is a umbrella term which includes all factors that influence selection. The active selection is expected to be suppressed by two major channels called bottom-up and top-down control. Bottom-up attention is spontaneous attention. It is fast, uncontrolled, and stimulus-driven. Our attention is naturally drawn to salient regions in visual field. The term ”salient” is interchangeably used for bottom-up attention [borji2012state]. Human visual attention is supposed to look at the salient stimuli in the environment.
1.2 Crowds and Visual Attention
Crowds represent a unique challenge for visual attention selection. A crowd is a big cluster of people assembled together and has attributes like density and movement. Crowds exhibit a distinct category of scenes. Crowd scenes can be categorized as complex scenes, like cross-streets in which several objects interconnect with each other that consists of different movement patterns, for example walking straight and then turning left [yoo2016visual]. We know that the crowds have an impact on the attention of an individual and can be tested by relating the physical stimuli with the contents of consciousness [mancas2010dense]. We can then further correlate the different crowd levels with the visual attention to learn the behaviour of an individual while free viewing real life crowds. This information is fundamental for crowd management, safety and surveillance in handling and avoiding emergencies due to rush and congestion [chiappino2015bio]. Analysis of complex situations like dense crowds can extremely benefit from algorithms which can encode human attention [mancas2010dense]. This serves many applications in human computer interaction, graphics and user interface design, particularly for small displays, by comprehending where humans look at in a scene. Additionally, knowledge of visual attention is beneficial for automatic image cropping, thumb nailing, image search, image and video compression. [judd2009learning].
1.3 Computational Models for Visual Attention
Older models integrated complicated characteristics of the Human Visual System (HVS) and and reconstruct the visual input through hierarchically combining low level features. The bottom-up mechanism is the maximum occurring feature found in these models [le2006coherent]. The core indication which implicates bottom-up attention is the uncommonness and distinction of a feature in a given circumstance [mancas2010attention]. Bottom-up use a feed-forward method to process visual input. They apply sequential transformations to visual features collected over the entire visual field, to highlight regions which are the most attention-grabbing, significant, eye-catching, or so-called salient information [borji2012state] However, the existing models of visual attention present a reductionist of visual attention. This is because fixations are not only influenced by bottom-up saliency as determined by the models, but also by various top-down influences. Consequently, comparing bottom-up saliency maps to eye fixations is demanding and requires that one attempts to minimize top-down impacts. [volokitin2016predicting] One way is to focus on early fixations when top-down influences have not yet come into affect, such as by use of jump cuts in videos, in our case, and MTV style video stimulus [carmi2006role].
2 Our Contribution: The CrowdFix Dataset
In the only crowd eye tracking experiments that have been done before, images were used. There is no HD (720p), FHD (1080p) or 4K crowd video dataset that exist instead all of the existing datasets have low resolutions. A higher quality dataset leads to better eye fixation information. This is because HD and FHD show a better level of detail in the video and allows for more possibilities of visual exploration. [vigier2016new]. Most datasets cater exclusively to high-density crowds or abnormal crowds which established the need to have a diversity in the dataset according to crowd density levels. To the best of our knowledge, no such categorization has been performed on existing crowd video datasets.
We collected a crowd dataset consisting of videos that depict real life scenarios. The dataset is categorized into three distinct density levels of the crowds named as sparse, dense free-flowing and dense congested. This dataset is built for studying the influence and saliency in crowds. Therefore, this dataset consists of diverse real-life, moving crowds. It has a total of 89 videos cut into 434 clips for MTV style videos. Having high resolution of the videos as the starting key point, our dataset comes with the resolution of 1280720 with 30 frames per second. None of the videos in this dataset are taken from any previously existing datasets. For maintaining the clarity and simplicity of the videos, none of them is in a fast forward motion nor any of them has a watermark on it while all of the videos being in RGB. For generating the dataset we picked the crowd videos under the Creative Commons depicting multiple real life crowded scenes. The stimulated crowd videos were not considered at all. We collected a wide variety of moving crowd scenes while assessing the varying densities of crowds. The videos were then later finalized. The categorization of the crowd density levels is concluded from the results of the participants. The major step in creating the stimulus was to maximize the bottom up attention. Since bottom up attention is the involuntary attention it follows that the stimulus should change frequently and abruptly. In terms of videos this can be achieved by using jump cuts by combining videos of a very short length back to back together. We call these very short videos as a ’clip’. Based on the research of [carmi2006role], each snippet duration varies from 1 second to up to 3 seconds. Any clip of length greater than this would invoke top down attention. To create the clips from the crowd videos we take all the videos from each density level and randomly shuffle them. This ensures there is no sequence based on the crowd density to make it predictable. Each video has a duration of 1 - 3 seconds. These snippets are again randomly combined into two videos of approximately 10 minutes each. We then present the stimulus to the participants. Table 1 shows the attributes of the real life crowd dataset.
|Stimuli type||Outdoor daytime/nighttime human moving crowds|
|Sources||05 (Flickr, Pexels, Pixabay, Vimeo and Youtube)|
|Licence||under Creative Commons|
|Number of videos||89 video clips|
|Categories||Sparse, dense free-flowing, and dense congested|
|Videos per category||Sparse (15), dense free-flowing (41), and dense congested (33)|
|Total video frames||37,493|
|Video frame size||1280 720|
|Video snippets||485 (1–3s each)|
|Video clippet||Randomly selected snippets|
|Video clippet duration||10 mins|
2.0.1 Dataset Annotation
The objective behind dataset annotation is to divide the dataset into distinct crowd density levels. All the previously available crowd videos datasets lacked the density feature. The major attributes of crowds include density, orientation, time and location of event, type of event, demographics and organization within the crowd. Hence we choose to focus on different crowd density levels as well to perform better from social, psychological and computational point of view. 1 - 1.5 humans per square meter is treated as sparse, 2 humans per square meter is treated as dense free-flowing and if 3 - 4 humans per square meter then it is treated as dense congested; this is all done for moving crowds [crowden]. 23 annotators (5 males and 18 females, in between the age group of 17 and 40) free-viewed the crowd videos. After viewing each video they were given some time to mark the video as one of the level explained in the beginning. Figure 1 shows the distribution of the categories chosen by the annotators that were further assigned to all the videos in the dataset.
Since we saved the decision about the density of the video right after showing the video to the participants we can be fairly certain that the participant’s judgment was not influenced by other videos. And the participant could pause as long as they wished before moving on to the next video. Figure 2 shows the sample images of different levels of crowd density. The rows represent sparse, dense free-flowing, and dense congested crowds respectively.
2.1 Eyetracking Data Acquisition
2.2 Eye tracking, general motivation and process
We chose eyetracking as our ground truth collection approach to harvest good data. [tavakoli2017saliency] Ground truth refers to human eye movement data obtained from real life observers who viewed the stimulus. This also works out well because our stimulus are videos where each frame moves rapidly and only stays on screen for a split second.
Videos were displayed on a 23.8” HP 24es LCD monitor (resolution 1920 x 1080) with the person resting his face on the head and chin rest to minimize any kind of ambiguity and shakiness in eye movement tracking. The distance of the viewer from the screen was kept as 60 cm. 32 participants volunteered for the free viewing of the videos while their gaze points were being recorded. Figure 3 shows the experimental setup and the experiment being performed by a volunteer.
All participants were shown the same set of videos in the same order. Free viewing allows participants to involve in natural visual expedition, while reassuring them to pay solid attention on the screen during the session. Therefore, some instructions kept the participants naive to the objective of experiment. Also, no one had seen the stimulus before. The instructions given were as follows:
You have to thoroughly watch the videos that are going to be played in front of you
Try to follow the main things in the video as some general questions can be asked at the end of the session
Make sure your eye sight is normal or corrected and you’re not wearing any polarized glasses or mascara
The EyeTribe eye tracker is used to perform the experiment with our dataset. The company ’The Eye Tribe’ endorses their eye tracker to be ”the world’s first 99 eye tracker with full SDK”. It has two software suites that supplements the device i.e. EyeTribe UI and EyeTribe Server. It has a sampling rate of 60 Hz and standard precision of 0.5 to 1.0. Eye tracking is a measurement of eye movement or activity. Near infra-red light is fixed towards the focal point of the eyes namely pupils, instigating visible reflections in the cornea (the outer-most optical part of the eye), and tracked by a camera. The results provides us with the fixation data that is a time in which our eyes are locked towards a particular object in a visual angle [dalrymple2018examination].
An eye tracker’s efficiency is commonly assessed by two metrics: accuracy and precision. Systematic error or accuracy replicates the eye tracker’s capability to assess the point of regard. It is also defined as the mean difference between a test stimulus position and the measured gaze position [holmqvist2012eye]. Whereas the precision invokes the eye tracker’s ability to deliver steady measurements, and is appraised by calculating the root mean square noise [holmqvist2012eye].
The eye tracker is controlled by the python based PyGaze toolbox (an alternative of the Psychtoolbox from the MATLAB) script running on Lenovo 320-15IKB (Intel Core i7-8550U CPU- @ 1.80 GHz, 8 GB, Windows 10), using a HP 24es LED monitor (23.8 inch, 60 Hz, 1920 1080 pixels, with dimensions 52.7x29.6 cm degrees of visual angle). Calibrations are performed using a nine point grid scripted in Python. Table 2 shows the the properties of the eye tribe eye tracker used for the experiment.
|Eye tracking principle||Non-invasive, image based eye tracking|
|Sampling rate||30 Hz or 60 Hz|
|Spatial resolution||1.0 (RMS)|
|Latency||20 ms at 60 Hz|
|Calibration||9, 12 or 16 points|
|Operating range||45–75 cm|
|Tracking area||4030 cm at 65 cm distance|
|Gaze tracking range||Up to 24”|
|Data output||Binocular gaze data|
To establish the aforementioned measures, gaze position is recorded during two-second periods of fixating a target stimulus. The targets appear consecutively, with an inter-trial interval of one second, on locations that were different from the calibration grid. The target grid spans 25.81 degrees of visual angle horizontally, and 19.50 degrees vertically (centred around the display centre). This is done to ensure that the tracker is feasible enough for performing the experiment in terms of: systematic error (spatial accuracy in degrees of visual angle), precision (Temporal accuracy in degrees of visual angle), and sampling accuracy.
2.2.1 Experimental design
We use convenience sampling for conducting the eye tracking experiment. This means that we search for volunteers amongst colleagues and people around us in the university only. Data cleaning is also performed by comparing the results from calibration and validation from the accuracy and precision that were calculated at both the times. It regards the systematic error of less than 1.7 as being acceptable [blignaut2014eye]. Based on the data cleaning process 6 participant’s data was discarded, leaving us with 26 participants - 16 females and 10 males aged between 17 - 40 years. Since the research was being held at the graduate level hence the participants were mostly graduates with normal or corrected vision. Since eye tracking falls under human behavioural research, we choose elements from commonly used behavioural experiments. The design of the experiment is motivated by the need to quantify the response of participants to the stimuli in an objective and reliable way. To maintain reliability, the stimulus duration and sequence for each participant is fixed. The experiment is divided into two identical blocks with a break of 3 - 5 minutes and starting again with a re-calibration process. The video sequence within each block remains the same for each participant. The MTV style sequence in each of the blocks keep the stimulus unpredictable and preserves the objectivity. On the same note, the stimuli did not overlap as it is a mixed design. It includes longitudinal data (by collecting a sample at the rate of 60 Hz) and cross-sectional data across several participants. The hypothesis of underlying the design of the eye tracking experiment is defined by cause, effect and goal. The cause is our stimuli, that is the crowd videos shown on a monitor. The effect is the change in the visual attention of the participant. And out goal is to analyze visual attention in crowd videos. The independent variable in our experiment are the crowd videos. The well defined density levels of a crowd ensures that we provide sufficiently diverse stimuli to our participants. The dependent variables therefore, are the raw gaze data and fixation data. To allow the fixation data to accurately represent actual eye fixations that rest on salient regions only however, is tough and requires the attempt to reduce top-down influences by concentrating on initial fixations on a stimulus [volokitin2016predicting]. One way to keep the focus on early fixation is the use of jump cuts in videos, in our case, and MTV style video stimulus. This is in line with the understanding that salient parts of a scene consist of an unexpected commencements or local singularity [le2006coherent]. Early attention is learnt from initial interactions, later viewing involves task/memory and other complex processes. Hence the reassembling is done into two MTV style videos named MTV1 and MTV2 of 10:12 and 10:37 minutes of duration respectively. These videos for bottom up attention helps in reducing the time for a participant to think. Therefore, the recorded data is objective.
2.3 Database Location, Structure and Accessibility
The dataset is organized into stimuli containing the video frame, the fixation maps which is a binary map pinpointing the exact location of the fixations, and saliency maps that are the Gaussian blurred fixations. The sigma for the Gaussian was set equal to one degree of visual angle according to standard procedure, which was 38 in our dataset.
A brief description of the dataset along with the download link can be found at: https://github.com/MemoonaTahira/CrowdFix.
3 Results and Analysis
3.1 Fixation Overview
All the resulted log files from the eye tracking experiment had the data of participant’s fixations across all the videos. We performed general analysis on those log files to come up with some information about the experiment. Total number of fixations and average fixations were computed from MTV1 and MTV2 both. The count of minimum and maximum number of fixations was also calculated on both the parts. The values clearly show that MTV1 has more number of fixations than MTV2 therefore having more average fixations in the first part as well. Table 3 shows the results of fixation data gathered by all the participants in MTV part 1 and 2.
|Attributes||MTV 1||MTV 2|
|Median of fixations||455||443|
3.2 Gaze Data Visualization
The final analysis is done with respect to crowd density levels against their number of videos.Figure 4 shows the distribution of videos over different levels of crowds. Each level has different number of videos as presented in the figure.
Different density levels of the crowd were evaluated on two things being, number of fixations and duration of fixation. Table 4 shows the results of the evaluation on sparse, Dense free-flowing and dense congested crowd levels. From the levels that we have, dense congested has the highest number of fixations since there are more people to look at as compared to dense free-flowing and sparse. But even if sparse has less number of fixations, it has the highest duration of fixations on the screen.
|Crowd Level||Number of Fixations||Average Fixation Duration|
|Dense free-flowing||4238||229.4085 ms|
|Dense congested||4030||228.7178 ms|
Fixation duration explains for how long the fixation of an individual lied on the screen. Hence, sparse having the highest fixation duration shows us that it catches most attention of the individuals looking at the crowd as they have lesser people to look at so they spend longer time on viewing such scenes as compared to dense free-flowing and dense congested.
Fixation location is also one of an important aspect to look at while interpreting the results. It reveals the areas where the participants fixate on the screen. The images below are the graphs representing fixation locations across all the participants of different crowd levels. It can be seen that all the fixations lie closer to the center and form a big cluster with all the points tightly loaded. Figure 5 shows the fixation locations of the participants throughout the experiment on different crowd density levels. It represents the graphs for sparse, dense free-flowing and dense-congested categories from left to right respectively.
The fixation coordinates of all the participants were extracted and used for calculating the distance from the center. Figure 6 shows the results of different crowd levels. A peak at the first frame can be seen due to the bottom-up saliency influences. The fixations from sparse density level seem to be more close to the center as there are lesser things to be looked at in those scenes that revolve around the center of the screen. Hence the lesser entities in the scene catch attention for longer periods of time being consistent towards the center of the screen. While the fixations from dense free-flowing and dense congested seem to be more distributed than sparse. Dense free-flowing seems to have more distance from the center as compared to both the categories. Reasons being having more number of videos along with having the attention on both the entities as well as the salient regions of the scene. Dense congested has lesser distribution of distance from dense free-flowing but more than sparse because the scene is so congested that the person is unable to focus on something but rather struggles to explore the screen during which the scene changes already.
The spread of the recorded data samples was also evaluated to judge the closeness of agreement between the results. Mostly the standard deviation measure is used for estimating the variability. The fixation coordinates were again used to assess the dispersion of the data around the mean which was later averaged across the participants for all the density levels. Figure 7 shows the results of different crowd levels.
While evaluating the dispersion we can again see the peak at the first frame. The spread remains consistent and around the center showing center-bias. As the attention system lies on perceptual memory, the availability of continuous factors is possible. The small peaks represent the impacts that occurred immediately after jump cuts.
3.3 Performance Evaluation over Existing Saliency Models
Deep Learning models are trained by combining tasks such as feature extraction, integration and saliency value prediction in an end to end manner. Their performance is superior in contrast to classic saliency models. Keeping this in mind, we select two of the latest state-of-the-art dynamic deep learning models [borji2019saliency]. The models were selected based on best performance on pre-existing dynamic saliency datasets. These models are ACL (resnet variant) [wang2018revisiting] and DeepVS [jiang2018deepvs]. The third model, SAM [cornia2018predicting] is one of the top performing deep static saliency model.
We create a benchmark of these models over our dataset. The three models were tested over videos from each crowd category. We choose 4 of the most common saliency evaluation metrics AUC-J, NSS, KLdiv and CC to provide an easy comparison to other saliency benchmarks such as MIT@saliency [mit-saliency-benchmark] and the DHF1K video saliency leaderboard. [cheng_2019] We also provide a baseline from the model’s own performance results over their original datasets. We average our results as well to make a comparison with the baseline results and evaluate the performance difference. Figure 8 shows the original image and its ground truth saliency map for dense free-flowing crowd category. Table 5 shows the results of evaluation with DeepVS, ACL and SAM model over different crowd density levels.
Based on the results, ACL performs the best out of three models over all three categories of videos individually and on average. However, the difference between these and ACL’s original results is enough to prompt for improvements in model parameter design and architecture to bring saliency prediction in crowds up to par to general saliency prediction. Even in the other two models, the difference between average results and the baseline shows crowd videos need customized saliency prediction models to reach state-of-art-performance.
4 Discussion and Conclusion
Crowd Scenes provide a richer set of dynamics and stimuli. These can be used to formulate and test the accuracy of general saliency judgments and models if they hold true fro crowd scenes as well provide insights on how to bring about improvements.
In this work, we studied the crowd characteristics and categorised the crowds into different density levels. The fixation and dispersion analysis shows that attention does vary with the number of people in the crowd. As the crowd gets bigger, most of the time is spent viewing more objects in the scene rather than paying attention to any one particular object. With decrease in the number of entities, salient features are more spontaneously noticeable in individual objects. As s future avenue, to bridge the gap in human performance and predicted saliency, it would be prudent to include more cognitive information about crowded stimuli into the computational models [feng2016fixation]. The importance of different features particularly facial features in the context of crowd videos is still an unexplored area. General saliency datasets’ evaluation metrics results and ours reflect a quite a big gap in performance. There is an obvious gap for improving deep saliency model to work equally well, if not better for crowds. We reiterate the need to investigate which features should be reinforced for crowd videos in the design of the model to predict better crowd scene saliency.