Unsupervised Gaze Prediction in Egocentric Videos by Energy-based Surprise Modeling

Unsupervised Gaze Prediction in Egocentric Videos by Energy-based Surprise Modeling


Egocentric perception has grown rapidly with the advent of immersive computing devices. Human gaze prediction is an important problem in analyzing egocentric videos and has largely been tackled through either saliency-based modeling or highly supervised learning. In this work, we tackle the problem of jointly predicting human gaze points and temporal segmentation of egocentric videos, in an unsupervised manner without using any training data. We introduce an unsupervised computational model that draws inspiration from cognitive psychology models of human attention and event perception. We use Grenander’s pattern theory formalism to represent spatial-temporal features and model surprise as a mechanism to predict gaze fixation points and temporally segment egocentric videos. Extensive evaluation on two publicly available datasets - GTEA and GTEA+ datasets show that the proposed model is able to outperform all unsupervised baselines and some supervised gaze prediction baselines. Finally, we show that the model can also temporally segment egocentric videos with a performance comparable to more complex, fully supervised deep learning baselines.


Sathyanarayanan N. Aakur, Arunkumar Bagavathi \addressOklahoma State University, Stillwater, OK, USA
{saakurn,abagava}@okstate.edu {keywords} Gaze, Egocentric Vision, Segmentation

1 Introduction

Figure 1: Overall Approach: The proposed approach consists of three essential stages: constructing the feature configuration, a local temporal configuration proposal and the final gaze prediction.

The emergence of wearable and immersive computing devices for virtual and augmented reality have enabled the acquisition of images and video from a first-person perspective. Given the recent advances in computer vision, egocentric analysis could be used to infer the surrounding scene, understand gestures, and enhance the quality of living through immersive, user-centric applications. At the core of such an application is the need to understand the user’s actions and where they are attending to. More specifically, gaze prediction is an essential task in egocentric perception and refers to the process of predicting human fixation points in the scene with respect to the head-centered coordinate system. Beyond enabling more efficient video analysis, studies from psychology have shown that human gaze estimation capture human intent to enable collaboration [11]. While the tasks of activity recognition and segmentation have been explored in recent literature [16], we aim to tackle the task of unsupervised gaze prediction and temporal event segmentation in a unified framework, without the need for training data annotations.

To jointly predict gaze and perform temporal segmentation, we must first identify an underlying mechanism that connects the two tasks. Drawing inspiration from psychology [7, 8, 22], we identify that the notion of predictability and surprise is a common mechanism for both tasks. Defined broadly as the surprise-attention hypothesis, studies have found that any deviations from expectations have a strong effect on both attention processing and event perception in humans. More specifically, short-term spatial surprise, such as between two regions from subsequent frames of a video, has a high probability of human fixation and affects saccade patterns. Longer-term temporal surprise, such as between several frames across a video, leads to the perception of new events [22]. We leverage such findings and formulate a computational framework that jointly models both short-term (local) and long-term (global) surprise using Grenander’s pattern theory [5] representations.

A significant departure from recent pattern theory formulations [1], our framework models bottom-up, feature-level spatial-temporal correlations in egocentric videos. The key intuition in our approach is the notion that human attention is highly sensitive to deviations from expectations. We represent the spatial-temporal structure of egocentric videos using Grenander’s pattern theory formalism [5]. The spatial features are encoded in a local configuration, whose energy represents the expectation of the model with respect to the recent past. Configurations are aggregated across time to provide a local temporal proposal for capturing the spatial-temporal correlation of video features. An acceptor function is used to switch between saccade and fixation modes to select the final configuration proposal. Localizing the source of maximum surprise provides attention points, and monitoring global surprise allows for temporally segmenting videos.

The contributions of our approach are three-fold: (1) we introduce an unsupervised gaze prediction framework based on surprise modeling that enables gaze prediction without training and outperforms all unsupervised baselines and some supervised baselines; (2) we demonstrate that the pattern theory representation can be used to tackle different tasks such as unsupervised video event segmentation to achieve comparable performance to state-of-the-art deep learning approaches, and (3) we show that pattern theory representation can be extended to beyond semantic, symbolic reasoning mechanisms.

Related Work. Saliency-based models treat the gaze prediction problem by modeling the visual saliency of a scene and identifying areas that can attract the gaze of a person. At the core of traditional saliency prediction methods is the idea of feature integration [21], which is based on the combination of multiple levels of features. Introduced by Itti et al [14, 13], there have been many approaches to saliency prediction [3, 17, 10, 17], including graph-based models [6], supervised CNN-based saliency [15] and video-based saliency [9, 17].

Supervised Gaze Prediction has been an increasingly popular way to tackle the problem of gaze prediction in egocentric videos. Li et al.  [18] proposed a graphical model to combine egocentric cues such as camera motion, hand positions, and motion and modeled gaze prediction as a function of these latent variables. Zhang et al. [23] used a Generative Adversarial Network (GAN) to handle the problem of gaze anticipation. The GAN was used to generate future frames, and a 3D CNN temporal saliency model was used to predict gaze positions. Huang et al.  [12] used recurrent neural networks to predict gaze positions by combining task-specific cues with bottom-up saliency.

2 Energy-based Surprise Modeling

In this section, we introduce our energy-based surprise modeling approach for gaze prediction, as illustrated in Figure 1 and described in Algorithm 1. We formulate our approach on Grenander’s pattern theory formalism [5]. We first introduce the necessary background on the pattern theory representation and present the proposed gaze prediction formulation.

Pattern Theory Representation. Following Grenander’s notations ([5]), the basic building blocks of our representation are atomic components called generators (). The collection of all possible generators in a given environment is termed the generator space (). While there can exist various types of generators, as explored in prior pattern theory approaches [1], we consider only one type of generator, namely the feature generator. We define feature generators as features extracted from videos and are used to estimate the gaze at each time step.

We model temporal and spatial associations among generators through bonds. Each generator has a fixed number of bonds called the arity of a generator (). These bonds are symbolic representations of the structural and contextual relationships shared between generators.

The energy of a bond is used to quantify the strength of the relationship expressed between two generators and is given by the function:


where and represent the bonds from the generators and , respectively; is the strength of the relationship expressed in the bond; and is a constant used to scale the bond energies. In our framework, is a distance metric and provides a measure of similarity between the features expressed through their respective generators.

Generators combine through their corresponding bonds to form complex structures called configurations. Each configuration has an underlying graph topology, specified by a connector graph , where is the set of all available connector graphs. is also called the connection type and is used to define the directed connections between the elements of the configuration. In our work, we restrict to a lattice configuration, as illustrated in Figure 1, with bonds extending spatially and aggregated temporally.

The energy of a configuration is the sum of the bond energies (Equation 1) in a configuration and is given by


where a lower energy indicates that the generators are closely associated with each other. Hence, a higher energy suggests that the surprise faced by the framework is higher.

Local Temporal Configuration Proposal. The first step in the framework is the construction of a lattice configuration called the feature configuration (). The lattice configuration is a grid, with each point (generator) in the configuration representing a possible region of fixation. We construct the feature configuration at time () by dividing the input image into an grid. Each generator is populated by extracting features from each of these grids. These features can be appearance-based such as RGB values, motion features such as optical flow [2], or deep learning features [19]. We experiment with both appearance and motion-based features and find that motion-based features (Section  3) provide distracting cues, especially with significant camera motion associated with egocentric videos. Bonds are quantified across generators with spatial locality i.e., all neighboring generators are connected through bonds.

A local proposal for the time steps is done through a temporal aggregation of the previous configurations across time. The aggregation enforces temporal consistency into the prediction and selectively influences the current prediction. Bonds are established across time by computing bonds between generators with a spatial locality. For example, a bond can be established with generator in feature configuration at time () and the configuration at time (). The bonds are quantified using the energy function defined in Equation 1. is defined as cosine distance () for appearance-based features and covariance () for motion features.

The temporal consistency is enforced through an ordered weighted averaging aggregation operation of configurations across time. Formally, the temporal aggregation is a mapping function that maps configurations from previous time steps into a single configuration as a local proposal for time . The function has an associated set of weights lying in the unit interval and summing to one. is used to aggregate configurations across time using the function given by , where refers to pairwise aggregation of bonds across configurations and . is a set of weights used to scale the influence of each configuration from the past on the current prediction. While can be learned, we set to follow an exponential decay function given by , where is the decay factor and is the initial value. We set to number of frames per second, and are is set to and .

Gaze Prediction by Localizing Surprise. Intuitively, each generator in the configuration corresponds to the predictability of each spatial segment in the image. Hence, the predicted gaze position corresponds to the grid cell with most surprise i.e. the generator with maximum energy. Algorithm 1 illustrates the process to find the generator with maximum energy. refer to the set of past configurations, the video frame at time and the generator space, respectively.

1 Gaze Prediction ;
5 if  then
7 else
11 foreach  do
12       if  then
14             if  then
19 end foreach
Algorithm 1 Gaze Prediction using Pattern Theory

In addition to naïve surprise modeling, we introduce additional criteria to model human gaze. First, we introduce some randomness into inference process to ensure a way to model both fixation and saccade phases. We do so by having two acceptor functions (lines and ). The first function randomly (with probability ) rejects the local temporal proposal and selects a configuration with strong center bias () as the final prediction for time . The second acceptor function randomly rejects (with probability ) the generator with maximum energy and chooses a different generator. Second, we scale the energy of each generator with a distance function () that quantifies the distance between the previous predicted gaze point and the current generator. This scaling ensures that the model is able to fixate on a chosen target while allowing for saccade function to choose a different target, if required. Once the generator () is chosen, the final gaze point () is computed as follows. , where is the offset of the grid from the top left corner of the image; () are grid width and height respectively.

3 Experimental Evaluation

3.1 Data and Evaluation Setup

Data. We evaluate our approach on the GTEA [4] and GTEA+ [18] datasets. The two datasets consist of several video sequences on meal preparation tasks by different subjects. The GTEA dataset contains sequences of tasks performed by subjects, with each sequence lasting about minutes. The GTEA+ dataset contains longer sequences (about 10 to 15 minutes) of subjects performing activities. We use the official splits for both GTEA and GTEA+ as defined in prior works  [4] and  [18], respectively.

Evaluation Metrics. We use Average Angular Error (AAE)  [20] as our primary evaluation metric, following prior efforts [13, 4, 18]. AAE is the angular distance between the predicted gaze location and the ground truth. Area Under the Curve (AUC) measures the area under the curve on saliency maps and is not directly applicable to our approach since our prediction is not a saliency map.

Baseline Approaches. We compare our approach with state-of-the-art unsupervised and supervised approaches. We consider unsupervised saliency models such as Graph-Based Visual Saliency (GBVS) [6], Attention-based Information Maximization (AIM) [3], OBDL [9] and AWS-D [17]. We also compare against supervised models (DFG [23], Yin [18] and SALICON [15], LDTAT [12]) that leverage annotations and representation learning capabilities of deep networks.

Ablation. We perform ablations of our approach to test the effectiveness of each component. We evaluate the effect of different features by including optical flow [2] as input to the model. We remove the prior spatial constraint and term the model “Ours (Saccade)”, highlighting its tendency to saccade through the video without fixating.

3.2 Quantitative Evaluation

Supervision Approach GTEA GTEA+
None Ours 9.2 12.3
Ours (Optical Flow) 10.1 13.8
Ours (Saccade) 12.5 14.6
AIM [3] 14.2 15.0
GBVS [6] 15.3 14.7
OBDL [9] 15.6 19.9
AWS [17] 17.5 14.8
AWS-D [17] 18.2 16.0
Itti’s model  [14] 18.4 19.9
ImSig [10] 19.0 16.5
Full LDTAT [12] 7.6 4.0
Yin [18] 8.4 7.9
DFG [23] 10.5 6.6
SALICON [15] 16.5 15.6
Table 1: Evaluation on GTEA and GTEA+ datasets. We outperform all unsupervised and some supervised baselines.

We present the results of our experimental evaluation in Table 1. We present the Average Angular Error (AAE) for both the GTEA and GTEA+ datasets and group the baseline approaches into two categories - no supervision and full supervision. It can be seen that our approach outperforms all unsupervised methods on both datasets. It is interesting to note that our model significantly outperforms saliency-based methods, including the closely related graph-based visual saliency (GBVS) model. Also note that we significantly outperform SALICON [15], a fully supervised convolutional neural network trained on groundtruth saliency maps on both datasets. We also outperform the GAN-based gaze prediction model DFG [23] on the GTEA dataset, where the amount of training data is limited. We also offer competitive performance to Yin et al [18], who use auxiliary information in the form of visual cues such as hands, objects of interest, and faces. Qualitative examples are in the supplementary.1

Event Segmentation in Streaming Videos. To highlight our model’s ability to understand video semantics and its subsequent ability to predict gaze locations, we adapt the framework to perform event segmentation in streaming videos. Instead of localizing specific generators, we monitor the global surprise by considering the energy of the entire configuration . The configuration with the highest energy in a window (set to be ) holds the highest surprise and is selected as an event boundary. We evaluate the performance of the approach on the GTEA dataset and quantify its performance using accuracy. We first align the prediction and groundtruth using the Hungarian method and then evaluate using accuracy as the metric, following prior works [16]. Our unsupervised approach achieves a temporal segmentation accuracy of . We campare against three (3) state-of-the-art supervised deep learning approaches in Spatial-CNN, Bi-LSTM, and Temporal Convolutional Networks (TCN) [16], which achieve accuracy of , and , respectively. The performance shows that modeling surprise using pattern theory can be used to segment egocentric videos into constituent activities without training data.

4 Conclusion

We presented an unsupervised gaze prediction framework, based on energy-based surprise modeling. The unsupervised approach helps us break the ever-increasing demands on the quality and quantity of training data. We demonstrate how pattern theory can be used to model surprise and predict gaze locations in egocentric videos. Our pattern theory representation forms the basis for unsupervised temporal video segmentation. Extensive experiments demonstrate the efficacy of the approach and its highly competitive performance.


  1. https://saakur.github.io/Projects/GazePrediction/


  1. S. Aakur, F. de Souza and S. Sarkar (2019) Generating open world descriptions of video using common sense knowledge in a pattern theory framework. Quarterly of Applied Mathematics 77 (2), pp. 323–356. Cited by: §1, §2.
  2. T. Brox and J. Malik (2010) Large displacement optical flow: descriptor matching in variational motion estimation. IEEE Transactions on Pattern Analysis and Machine Intelligence 33 (3), pp. 500–513. Cited by: §2, §3.1.
  3. N. Bruce and J. Tsotsos (2006) Saliency based on information maximization. In Advances in Neural Information Processing Systems, pp. 155–162. Cited by: §1, §3.1, Table 1.
  4. A. Fathi, Y. Li and J. M. Rehg (2012) Learning to recognize daily actions using gaze. In European Conference on Computer Vision, pp. 314–327. Cited by: §3.1, §3.1.
  5. U. Grenander (1996) Elements of pattern theory. JHU Press. Cited by: §1, §1, §2, §2.
  6. J. Harel, C. Koch and P. Perona (2007) Graph-based visual saliency. In Advances in Neural Information Processing Systems, pp. 545–552. Cited by: §1, §3.1, Table 1.
  7. G. Horstmann and A. Herwig (2015) Surprise attracts the eyes and binds the gaze. Psychonomic Bulletin & Review 22 (3), pp. 743–749. Cited by: §1.
  8. G. Horstmann and A. Herwig (2016) Novelty biases attention and gaze in a surprise trial. Attention, Perception, & Psychophysics 78 (1), pp. 69–77. Cited by: §1.
  9. S. Hossein Khatoonabadi, N. Vasconcelos, I. V. Bajic and Y. Shan (2015) How many bits does it take for a stimulus to be salient?. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5501–5510. Cited by: §1, §3.1, Table 1.
  10. X. Hou, J. Harel and C. Koch (2011) Image signature: highlighting sparse salient regions. IEEE Transactions on Pattern Analysis and Machine Intelligence 34 (1), pp. 194–201. Cited by: §1, Table 1.
  11. C. Huang, S. Andrist, A. Sauppé and B. Mutlu (2015) Using gaze patterns to predict task intent in collaboration. Frontiers in Psychology 6, pp. 1049. Cited by: §1.
  12. Y. Huang, M. Cai, Z. Li and Y. Sato (2018) Predicting gaze in egocentric video by learning task-dependent attention transition. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 754–769. Cited by: §1, §3.1, Table 1.
  13. L. Itti and P. F. Baldi (2006) Bayesian surprise attracts human attention. In Advances in Neural Information Processing Systems, pp. 547–554. Cited by: §1, §3.1.
  14. L. Itti and C. Koch (2000) A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research 40 (10-12), pp. 1489–1506. Cited by: §1, Table 1.
  15. M. Jiang, S. Huang, J. Duan and Q. Zhao (2015) Salicon: saliency in context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1072–1080. Cited by: §1, §3.1, §3.2, Table 1.
  16. C. Lea, R. Vidal, A. Reiter and G. D. Hager (2016) Temporal convolutional networks: a unified approach to action segmentation. In European Conference on Computer Vision, pp. 47–54. Cited by: §1, §3.2.
  17. V. Leboran, A. Garcia-Diaz, X. R. Fdez-Vidal and X. M. Pardo (2016) Dynamic whitening saliency. IEEE Transactions on Pattern Analysis and Machine Intelligence 39 (5), pp. 893–907. Cited by: §1, §3.1, Table 1.
  18. Y. Li, A. Fathi and J. M. Rehg (2013) Learning to predict gaze in egocentric video. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3216–3223. Cited by: §1, §3.1, §3.1, §3.1, §3.2, Table 1.
  19. J. Redmon, S. Divvala, R. Girshick and A. Farhadi (2016) You only look once: unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788. Cited by: §2.
  20. N. Riche, M. Duvinage, M. Mancas, B. Gosselin and T. Dutoit (2013) Saliency and human fixations: state-of-the-art and study of comparison metrics. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1153–1160. Cited by: §3.1.
  21. A. M. Treisman and G. Gelade (1980) A feature-integration theory of attention. Cognitive Psychology 12 (1), pp. 97–136. Cited by: §1.
  22. J. M. Zacks, B. Tversky and G. Iyer (2001) Perceiving, remembering, and communicating structure in events.. Journal of Experimental Psychology: General 130 (1), pp. 29. Cited by: §1.
  23. M. Zhang, K. Teck Ma, J. Hwee Lim, Q. Zhao and J. Feng (2017) Deep future gaze: gaze anticipation on egocentric videos using adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4372–4381. Cited by: §1, §3.1, §3.2, Table 1.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description