GazeMAE: General Representations of Eye Movements using a Micro-Macro Autoencoder

GazeMAE: General Representations of Eye Movements using a Micro-Macro Autoencoder


Eye movements are intricate and dynamic events that contain a wealth of information about the subject and the stimuli. We propose an abstract representation of eye movements that preserve the important nuances in gaze behavior while being stimuli-agnostic. We consider eye movements as raw position and velocity signals and train separate deep temporal convolutional autoencoders. The autoencoders learn micro-scale and macro-scale representations that correspond to the fast and slow features of eye movements. We evaluate the joint representations with a linear classifier fitted on various classification tasks. Our work accurately discriminates between gender and age groups and outperforms previous works on biometrics and stimuli classification. Further experiments highlight the validity and generalizability of this method, bringing eye-tracking research closer to real-world applications.

I Introduction

Our eyes move in response to top-down and bottom-up factors, subconsciously influenced by a stimuli’s characteristics and our own goals [19]. Eye movements can be seen simply as a sequence of fixations and saccades: at some points we keep our eyes still to take in information, then rapidly move them to switch our point of focus. Thus, eye movements tell a lot about our perception, thought, and decision-making processes [15]. In addition, there exist less-pronounced eye movements even within a fixation, among them are microsaccades that have recently been found to have numerous links to attention, memory, and cognitive load [30, 34, 26, 23]. Overall, such findings encourage eye-tracking technology to be brought to various fields such as human-computer interaction, psychology, education, medicine, and security [9].

Bridging the gap between laboratory findings and real-world applications require that eye movements are processed using representations or feature vectors as inputs to algorithms. Common methods to do so include processing gaze into parameters [33] (e.g. fixation counts and durations), maps [28] (e.g. heat maps, saliency maps), scanpaths [1] (e.g. string sequences), and graphical models [5, 8] that consider image regions as nodes and saccades as edges.

However, these methods have two main drawbacks that inhibit them from optimally representing eye movements. First, they do not exploit the wealth of information present in eye movements. By discretizing movements into fixations and saccades, they flatten the dynamic nature of eye movements and lose the tiny but important nuances. Additionally, event detection is still an active research area and as such may be prone to inaccuracies and inconsistencies [2, 14]. Second, they are not generalizable due to the tight links of the methods to the stimuli, thereby limiting eye movement comparison to those elicited from the same image or stimuli. Scanpaths and graphs additionally have a dependence on pre-defined areas of interest (AoIs). This may be mitigated by learning AoIs in a data-driven manner, but this in turn introduces dependencies on the method and on the amount and quality of data available for each new stimulus.

Fig. 1: Raw eye movement position and velocity signals are used as input to autoencoders which learn micro-scale (, ) and macro-scale (, ) representations.

In this work, we use deep unsupervised learning to learn abstract representations of eye movements. This removes the need for extensive feature engineering, allowing us to bypass the event detection steps and learn from the full resolution of the data. We use only the position and velocity signals as input, making this method stimuli-agnostic. It can extract representations for any sample regardless of stimuli, enabling comparisons to be made. In particular, we use an autoencoder (AE) in which the encoder and decoder networks are temporal convolutional networks (TCN). Our AE architecture uses two bottlenecks, encoding information at a micro and macro scale. We train a model on position signals, and another on velocity signals. The models are evaluated on various classification tasks with a linear classifier. Characteristics such as identity, age, gender, and stimuli were predicted using AE representations. Additionally, we show that the AE can handle any input length (i.e. viewing time), generalize to an unseen data set with a lower sampling frequency, and perform comparably with a supervised version of the encoder network.

The contributions of this paper are as follows:

  1. We apply deep unsupervised learning to eye movement signals such that representations are learned without supervision or feature engineering.

  2. We learn representations for eye movements that are stimuli-agnostic.

  3. We propose a modified autoencoder with two bottlenecks that learn fast and slow features of the eye movement signal. This autoencoder also uses an interpolative decoder instead of a regular Temporal Convolutional Network or an autoregressive decoder.

  4. We show that the representations learned are meaningful. They are able to accurately classify labels, generalize to an unseen data set, scale to long input lengths. Furthermore, similar data points exhibit clustering properties.

Note that this work is limited to eye movements gathered on static and visual stimuli, recorded with research-grade eye-trackers. Eye movements on texts, videos, or ”in the wild” are beyond our scope. Source code and models are available at

Ii Methodology

Ii-a Preliminaries

Representation Learning

The goal of representation learning, also called feature learning, is to abstract information from data such that the underlying factors of variation in the data are captured [4]. This involves mapping an input to an embedding space which meaningfully describe the original data. A common use case for learning representations is to act as a preprocessing step for downstream tasks in which the representation, often notated as of a data point , will be used as the input for classifiers and predictors. Representation learning methods are commonly unsupervised methods, where no external labels about the data is required. Therefore, these can take advantage of any available data to learn more robust features.


An autoencoder (AE) is a neural network that learns a representation of an input data by attempting to reconstruct a close approximation of the input. A typical AE is undercomplete, i.e. it uses a bottleneck to compresses the input to a lower-dimensional space before producing an output with the same dimensions as the input.

Generally, an AE works as follows: an encoder maps the original input to a latent vector , and a decoder maps to an output . It is trained to reconstruct , i.e. .

Since , the encoder is forced to learn only the relevant information such that the decoder is able to sufficiently reconstruct the original input. This is a simple framework to learn a representation of the data, and is commonly thought of as a non-linear version of Principal Component Analysis (PCA) [4]. Because an AE uses the input data as its target output, it is a self-supervised method for representation learning.

Temporal Convolutional Network

The temporal convolutional network (TCN) [3] is a generic convolutional neural network (CNN) architecture that has recently been shown to outperform Recurrent Neural Networks (RNNs). TCNs work in the same manner as the CNN, where each convolutional layer convolves a number of 1-dimensional kernels ( filters of size ) across the input data to recognize sequence patterns [10]. A TCN modifies the convolution operation into the following:

  1. Dilated Convolutions, where the kernel skips values. For a learnable kernel with kernel size and dilation , the output at a subsequence of size in an input is calculated with the following:

    Dilations are commonly increased exponentially across layers, e.g. . This enables the output in layer to be calculated with higher receptive field i.e. from a wider input range.

  2. Causal Convolutions The output at time is calculated using only the values from the previous time steps . This is done by padding zeroes on the left of the input. In effect, this emulates the sequential processing of RNNs.

Ii-B Data Sets


The Eye Movements Verification and Identification Competition (EMVIC) 2014 [21] is a data set used as a benchmark for Biometrics, where subjects are to be identified based only on their eye movements. They collected data from 34 subjects who were shown a number of normalized face images (the eyes, nose, and mouth are in roughly the same position in the images). The viewing times spent by the subjects to look at the face images range from 891 ms to 22012 ms, and the average is 2429 ms or roughly 2.5 seconds. Eye movements were recorded using a Jazz-Novo eye tracker with a 1000 Hz sampling frequency, i.e. it records 1000 gaze points per second. 1,430 eye movement samples were collected, where the training set consists of 837 samples from 34 subjects and the test set consists of 593 from 22 subjects.


The Fixations in Faces (FIFA) [7] is an eye movement data set of 7 subjects using 250 images from indoor and outdoor scenes. Eye movements were recorded using SR Research EyeLink 1000 eye-tracker with a 1000 Hz sampling frequency. The images were of 1024x768 resolution and were displayed on a screen 80cm from the subject. This corresponds to a subjects’ visual angle of 28°x 21°. We obtain 3,200 samples from this data set.


The Eye Tracking Research & Applications (ETRA) data set was used to analyze saccades and microsaccades in [30, 27] and was also used for a data mining challenge in ETRA 2019. Eight subjects participated and viewed 4 image types: blank image, natural scenes, picture puzzles, and ”Where’s Waldo?” images. For the blank and natural image types, the subjects were free to view the image in any manner. Picture puzzles contain two almost-identical images, and the subjects had to spot the differences between the two. ”Where’s Waldo?” images are complex scenes filled with small objects and characters, and the subjects had to find the character Waldo. Each viewing was recorded for 45 seconds.

Eye movements were recorded using an SR Research EyeLink II eye-tracker at 500 Hz sampling frequency. The stimuli were presented such that they are within 36°x 25.2°of the subjects’ visual angle. 480 eye movement samples were obtained from this data set.

Hz Stimuli Tasks Subj. Sample Time(s)
EMVIC 1000 face free 34 1430 ave.
FIFA 1000 natural free, 8 3200 2s
ETRA 500 natural, free, 8 480 45s
puzzle search
Total 50 5110
TABLE I: Summary of data sets.

Ii-C Data Preprocessing and Augmentation

To recap, we combine three data sets into a joint data set . Each sample is a vector with 2 channels (x and y) and a variable length . To work across multiple data sets, we preprocess each as follows:

  • We turn blinks (negative values) to zero since not all data sets have blink data.

  • We standardize to a sampling frequency of 500 Hz. The EMVIC and FIFA data sets are downsampled from 1000 Hz to 500 Hz by dropping every other gaze point.

  • We modify the coordinates such that the origin (0, 0) is at the top-left corner of the screen. This is to ensure that the network processes eye movements in the same scale.

  • We scale the coordinates such that a subject’s 1°of visual angle corresponds to roughly 35 pixels (1 dva 35px). For FIFA and ETRA data sets, these are estimated based on their given eye-tracker and experiment specifications. For EMVIC, we leave the coordinates unprocessed due to lack of details. This is done so that all movements are according to the same visual resolution of the subjects.

The inputs to the AEs are standardized into 2-second samples , where = 1000 = 500 Hz 2s. We increase our data set by taking advantage of the ambiguity of eye movements. For all 5,110 trials in the data sets, we take 2-second time windows that slide forward in time by 20% or 0.4s, which is equivalent to 200 gaze points. Each window counts as a new sample. With this, the training set size is increased to 68,178 samples.

For all 5,110 trials in the data sets, we take 2s time windows that slide forward in time by 20% or 0.4s, which is equivalent to 200 gaze points. Using this method, the training set size is increased to 68,178 samples.

Ii-D Velocity Signals

Fig. 2: Top: a 2-second position signal at 500 Hz. Bottom: its corresponding velocity signal.

In addition to the raw eye movement data given as a sequence of positions across time (position signals ), we also take the derivative, or the rate at which positions change over time (velocity signals ), simply calculated as (). We separately train a position autoencoder (AE\textsubscriptp) and velocity autoencoder (AE\textsubscriptv) as they are expected to learn different features. While position signals exhibit spatial information and visual saliency, velocity signals can reveal more behavioral information that may infer a subject’s thought process. Velocity is also commonly used as a threshold for eye movement segmentation [2]. Figure 2 shows an example of a position signal and a corresponding velocity signal. Position signals are further preprocessed by clipping the coordinates to the maximum screen resolution: 1280x1024. For both signals, neither scaling nor mean normalization is done. Based on our experiments, we found that this was especially important for velocity signals.

Ii-E Network Architecture

In this subsection, we first describe the TCN architecture of both the encoder and decoder. Next, we describe how a micro and macro representations are learned in the bottlenecks. Lastly, we describe an interpolative decoder that fills in a destroyed signal to reconstruct or recover the original. The overall architecture of the autoencoder is visualized in Figure 3, and a summary of its main components is shown in Table II. The number of filters and layers were chosen empirically.

position AE (AE\textsubscriptp) velocity AE (AE\textsubscriptv)
Encoder TCN 128 filters x 8 layers 256 filters x 8 layers
Micro-scale Bottleneck 64-dim FC 64-dim FC
Macro-scale Bottleneck 64-dim FC 64-dim FC
Decoder TCN 128 filters x 4 layers; 128 x 8 layers
64 filters x 4 layers
Total Parameters 652,228 1,964,676
TABLE II: Autoencoder specifications
Fig. 3: Architecture of the Micro-Macro Autoencoder, with each convolutional layer having a specified dilation.

Convolutional Layers

The encoder and decoder of the AE are implemented as TCNs. However, the encoder is non-causal in order to take in as much information as possible. The decoder remains causal, as this forces the encoder to learn temporal dependencies.

Convolutions have a fixed kernel size of 3 and stride 1. Zero-padding is used to maintain the same temporal dimension across all layers. All convolutions are followed by a Rectified Linear Unit (ReLU) activation function and Batch Normalization [16]. Both the encoder and decoder networks have 8 convolutional layers. These are split into 4 residual blocks [13] with 2 convolutional layers each. The layers have exponentially-increasing dilations starting at the second layer (1, 1, 2, 4, 8, 16, 32, 64), resulting in the following receptive fields: (3, 5, 9, 17, 33, 65, 129, 257). Figure 4 visualizes the growth of the receptive field across layers.

Fig. 4: Convolutional layers of AE\textsubscriptv encoder. The height corresponds to the effective receptive field of each convolution operation. The width corresponds to the number of filters. The outputs at the fourth and eighth layers are calculated to be the micro-scale and macro-scale representations, respectively. Heights roughly to scale.


Our AEs have two bottlenecks, each encoding information at different scales. The first takes in the output of the fourth convolutional layer, while the second takes in that of the eighth convolutional layer. Recall that the individual values from these layers were calculated with receptive fields of 17 and 257. Therefore, the first bottleneck can be thought of as encoding micro-scale information, or the fine-grained and fast-changing eye movement patterns. The second encodes macro-scale information, or the flow and slow-changing patterns. This is partly inspired by [17].

Specifically, the representations at these bottlenecks are learned as follows: first, the convolutional layer outputs are downsampled with a Global Average Pooling (GAP) layer that compresses the time dimension (GAP: where is the number of convolution filters and is the number of time steps). Then, a fully-connected (FC) layer transforms these downsampled values into micro-scale representation and macro-scale representation . The two representations are independent, i.e. there is no forward connection from to . From initial experiments, this resulted in better performance. All representations is a feature vector of size 64.

Interpolative Decoder

The decoder used in this work is a modification from the vanilla AE architecture. In this model, the original signal is first destroyed by randomly dropping values and then input to the decoder. The task of the decoder remains the same: to output a reconstruction, but it can also now be described as filling in the missing values. Thus, we call it an interpolative decoder.

Intuitively, inputting a destroyed version of the original signal to the decoder may free up the encoder to capture more of the nuances in the data, instead of having to also encode the scale and trend of the signal. Representations and act as supplemental information and are used to condition the decoder such that it accurately outputs a reconstruction. is used as an additive bias to the first decoder layer, providing information about the general trend (macro-scale) of the signal. is used as an additive bias to the fifth decoder layer, providing more specific (micro-scale) information and filling in smaller patterns and sequences.

However, reconstructing the input may become a trivial task since too much information is already available to the decoder. In practice, we found that this can be mitigated with a high dropout probability. The AE\textsubscriptp uses , while the AE\textsubscriptv uses . Because position signals are much less erratic, a higher dropout probability had to be used to keep the decoder from relying on the destroyed input. We use this decoder design as an alternative to the more commonly used autoregressive decoders which output one value at a time. We found that the performance was on-par while requiring less training time.

Ii-F Optimization

To summarize, this work trains a position autoencoder (AE\textsubscriptp) and a velocity autoencoder (AE\textsubscriptv) to learn representations and , respectively. Both are concatenations of representations at a micro-scale and a macro-scale , i.e. . The training data consists of three data sets combined into a single data set . Each sample from is preprocessed into an input vector . For each , an AE is trained to output a reconstruction . The loss function is simply the sum of squared errors (SSE), computed as follows:


The AEs are trained using Adam [22] optimizer, with a fixed learning rate of 5e-4. The total number of training samples is 68,178. The batch size for the AE\textsubscriptp and AE\textsubscriptv is 256 and 128, respectively. The networks are implemented using PyTorch 1.3.1 [31], and trained on an NVIDIA GTX 1070 with 8GB of VRAM. Random seeds were kept consistent throughout experiments. AE\textsubscriptp was trained for 14 epochs (1 epoch 13 mins.) and AE\textsubscriptv was trained for 25 epochs (1 epoch 38 mins.).

Ii-G Evaluation

For evaluation, we input the full-length samples and use the AEs to extract representations to be used as input for classification tasks. We evaluate three types of representation: from AE\textsubscriptp, from AE\textsubscriptv, and . The classification tasks are the following:

Classification Task Data Set Classes Samples
Biometrics EMVIC 34 837
Biometrics all 50 5110
Stimuli (4) ETRA 4 480
Stimuli (3) ETRA 3 360
Age Group FIFA 2 3200
Gender FIFA 2 3200
TABLE III: Classification tasks used for evaluating the representations learned by the autoencoder.
  1. Biometrics on EMVIC data set. We use the official training and test set, reporting accuracies for both. Our results will be compared to the work in [29]. For a fair comparison, we mimic their setup by reporting a 4-fold Cross-Validation (CV) accuracy on the training set, and another on the test set after fitting on the whole training set.

  2. Biometrics on all data sets. We combine the three data sets and classify a total of 50 subjects, each with a varying number of samples. In contrast to Biometrics on EMVIC data set, this task is now performed on eye movements from different experiment designs (e.g. eye tracker setup, stimuli used). Therefore, this is a more difficult task and is better suited to evaluate the validity and generalizability of the method.

  3. Stimuli Classification on ETRA data set. We use the 4 image types (blank, natural, puzzle, waldo) as labels, where each type has 120 samples. This task, referred to as Stimuli (4), was also done in [11, 24]. Unfortunately, the composition of the data that we use have variations that prohibit us from fairly comparing our work to theirs. Instead, we compare with another work [12], which did the same task but using only 3 labels (natural, puzzle, waldo) with 115 samples each. We use all 120 available samples, but since this is a minor variation from their setup, we still compare our accuracy with theirs. This task, Stimuli (3), is done on a leave-one-out CV (LOOCV) setup to be as similar as possible to theirs.

  4. Age Group Classification on FIFA data set. FIFA provides the subjects’ ages which range from 18-27. They are split into two groups: 18-22, and 22-27, yielding 1,600 samples per group. A number of previous works have done a similar task, but because they used different data sets, we are not able to fairly compare with their results.

  5. Gender Classification on FIFA data set. FIFA was collected from 6 males and 2 females, and we use their gender as labels for their eye movements. The resulting samples are unbalanced, with 2,400 samples for male subjects, and only 800 for females. However, no sampling technique is performed. As with age group classification, there is no previous work with which we can fairly compare with.

To serve as a soft benchmark for tasks without a similar work, we also apply PCA on the position and velocity signals, each with 128 components (PCA\textsubscriptpv). The classifier used for all tasks is a Support Vector Machine (SVM) with a linear kernel. Grid search is conducted on the regularization parameter . For all tasks, the accuracy will be reported. Multi-class classification is conducted using a One-vs-Rest (OVR) technique. Unless otherwise stated, all experiments will be conducted in a 5-fold CV setup. PCA, SVM, and CV are implemented using scikit-learn [32].

Iii Results and Discussions

This section details the classification results and three additional experiments to gauge the representations. For simplicity, we omit the reconstruction errors, as those are not of primary concern when evaluating representations.

Iii-a Classification Tasks


Classification Task PCA\textsubscriptpv others
Biometrics (EMVIC-Train) 18.4 31.8 86.8 84.4 86.0 [29]
Biometrics (EMVIC-Test) 19.7 31.1 87.8 87.8 81.5 [29]
Biometrics (All) 24.6 29.0 79.8 78.4 -
Stimuli (4) 38.8 81.3 85.4 87.5 -
Stimuli (3) 55.8 90.3 87.2 93.9 88.0** [12]
Age Group 62.0 61.9 77.7 77.3 -
Gender 51.12 54.9 85.8 86.3 -
TABLE IV: Accuracies for various classification tasks. Underlined numbers are highest among AE models; bold numbers are highest among different works.
* these were mentioned in [29] but no citation was found.
** their classification used 115 samples per label, ours used 120.

The results of the classification tasks, along with chance accuracies and other works are summarized in Table IV. First, it is clear that velocity representations carry more discriminative information than , as it can perform well on its own and can be supplementary to as in the case of stimuli and gender classification. The performance of only came close to in the stimuli classification task, which is expected since spatial information is explicitly linked to the stimuli. Next, AE performance on Biometrics task on EMVIC data set was able to outperform the work in [29]. They used a statistical method to extract spatial, temporal, and static shape features, on which they fitted a logistic regression classifier. They additionally mentioned two works which achieved higher test accuracies (82.3% and 86.4%) than theirs, but those were uncited and no document describing those works have been found as of writing. Nevertheless, AE\textsubscriptv also outperforms those two works.

On stimuli classification on ETRA data set, our work outperformed [12]. Recall, however, that the comparison is not entirely balanced due to different number of samples. The four other tasks have no other works to directly compare with, however, we found the performance more than satisfactory. and performed well on the Biometrics task on all data sets despite the fact that the eye movements were gathered from a diverse set of images. This may indicate that the speed and behavior of eye movements are sufficient identity markers, and future eye movements-based Biometric systems need not meticulously curate the stimuli used for interfaces. For age group and gender classification, note that the task is performed only with 8 subjects. In terms of viability of eye movements for classifying a person’s demographic, these results are inconclusive. Nevertheless, the accuracies are well above chance and PCA feature extraction, encouraging further experimentation on the area.

Feature Analysis

To further inspect the importance of the representations, we take the linear SVM fitted on (total 256 total dimensions), and inspect the top 20% features. Though the linear SVM may suffer from fitting on a large number of dimensions, this presents an estimate of how useful the feature types are for various tasks.

Fig. 5: Count of features per type among the top 20% weights of the trained linear SVM.

Figure 5 shows the result. Velocity representations dominate the top features. Both the micro and macro scales of the velocity signal are useful, though the micro-scale takes a slightly larger share of the top features. Position representations are much less important, even on the stimuli classification task. Thus, a velocity autoencoder may be a less complicated but sufficient method for representing eye movements. However, this may still be explored with other classification tasks.

Next, we explore the representations by visualizing the embedding space. We apply t-SNE [25], a dimensionality reduction algorithm that preserves the distances of all points, on , , and , as shown in Figure 6. Consistent with the accuracies in Table IV, and are able to discriminate stimuli types. Visualization of on Biometrics show almost no clustering, while exhibits some. We also plot all samples and label them according to their data sets. Clear clustering can be observed based on . This is made clearer when and was combined, showing that these two representations can be indeed supplementary.

Fig. 6: t-SNE visualizations of learned representations. Top: ETRA samples, labels are stimuli types. Middle: EMVIC samples, labels are 10 subjects with most number of samples. Bottom: all samples, labels are the data sets.

Iii-B Additional Experiments


To test if the AE is generalizable and did not overfit, we use AE\textsubscriptv to extract representations for unseen samples. The Biometrics task is performed using the data set provided in [18], herein termed as MIT-LowRes. This data set contains eye movement signals from 64 subjects looking at 168 images of varying low resolutions. Only the samples obtained from viewing the highest-resolution will be used for this experiment. This corresponds to 21 samples for 64 subjects, amounting to 1,344 total samples. The data was recorded in 240 Hz. To be used for the AE\textsubscriptv model, the signals are upsampled to 500Hz using cubic interpolation.

We also train two more AE models. One is trained using the three original data sets but on a 250Hz sampling frequency (AE\textsubscriptv-250), and the second is trained exclusively on MIT-LowRes at 250Hz (AE\textsubscriptv-MLR). These models use the same architecture and specifications as AE\textsubscriptv, and we only modify the dilations so that the receptive field is approximately halved.

Classification Task AE\textsubscriptv AE\textsubscriptv-250 AE\textsubscriptv-MLR
Biometrics (MIT-LowRes) 23.7 21.5 18.38
TABLE V: Accuracies for a Biometrics task on MIT-LowRes, an unseen data set. For comparison, AE\textsubscriptv-MLR is a model trained exclusively on MIT-LowRes.

From Table V, we see that AE\textsubscriptv achieved the highest accuracy of the three models. It outperformed AE\textsubscriptv-250, showing that there are indeed more meaningful information with a higher sampling frequency. However, even AE\textsubscriptv-250 outperformed AE\textsubscriptv-MLR. This shows that the AEs benefited from training on more data, and can indeed generalize to unseen samples, even if they’re from another data set. Furthermore, this also shows that signals at 240Hz can be upsampled to 500Hz through simple cubic interpolation in order to benefit from 500Hz models.

Input Length / Viewing Time

The use of a GAP layer enables the autoencoder to take in inputs of any length. Recall that we train the AEs on only 2s, and we evaluated it with the full-length samples. In this experiment, we explicitly test for the effect of the input length or viewing time on the representations. We do this by using 1s, 2s, averaged representations of disjoint 2-second segments (2s*), and full-length inputs to AE\textsubscriptv. From Table VI, it can be seen that the AE can scale well even up to 45s without loss of performance, making it more usable on any eye movement sample.

Classification Task 1s 2s 2s* full
Biometrics (EMVIC-Train) 78.9 84.2 83.35 86.8 (22s)
Biometrics (EMVIC-Test) 79.0 85.6 86.6 87.8 (22s)
Biometrics (All) 69.3 76.9 79.7 79.8 (45s)
Stimuli (4) 46.7 59.2 85.0 85.4 (45s)
Age Group 75.1 78.2 - -
Gender 79.4 85.9 - -
TABLE VI: Accuracies for classification tasks depending on the viewing time (length of input to the AE). 1s = 500 gaze points = 500 time steps

Comparison with Supervised TCN

Classification Task AE\textsubscriptv (unsupervised) TCN\textsubscriptv (supervised)
Biometrics (EMVIC-Train) 86.8 93.6
Biometrics (EMVIC-Test) 87.8 95.5
Biometrics (All) 79.8 84.5
Stimuli (4) 89.2 90.0
Age Group 78.0 96.8
Gender 87.4 96.2
TABLE VII: Accuracies for classification tasks of AE\textsubscriptv compared to TCN\textsubscriptv, supervised version of the encoder network.

Finally, AE\textsubscriptv is compared against a supervised TCN (TCN\textsubscriptv) with the same architecture as the encoder in AE\textsubscriptv. To be supervised, we add an FC and Softmax layer to the network to output class probabilities.

For each task, we train a new TCN\textsubscriptv for 100 epochs with early stopping. We perform 4-fold CV for Biometrics (EMVIC), and 5-fold on all other tasks. Table VII shows the results. TCN\textsubscriptv models clearly outperform AE\textsubscriptv which is an expected result given that supervised networks tune their weights according to the task. It is, however, encouraging to find that AE\textsubscriptv can reach as low as 0.8% difference in accuracy when compared to TCN\textsubscriptv. AEs also have less tendency to overfit and can be reused for different scenarios.

Iv Related Work

Our work aims to learn generalizable representations for eye movements through unsupervised learning. To the best of our knowledge, no work with the exact same goal has been done. Related but tangential works that construct gaze embeddings include [20] and [6]. The first used eye movement parameters, grids, and heatmaps, while the second used a CNN to extract feature vectors at fixated image patches. Another related work is [11] which used a generative adversarial network (GAN) to represent scanpaths. However, theirs is only a small-scale experiment primarily focused on scanpath classification.

V Conclusion

In this work, we proposed an autoencoder (AE) that learns micro and macro-scale representations for eye movements. We trained a position AE and a velocity AE using three different data sets, and we evaluate the representations with various classification tasks. We were able to achieve competitive results, outperforming other works despite using an unsupervised feature extractor and fitting with only a linear classifier. Further experiments showed that the proposed AE can handle any input length, and is able to generalize to unseen samples from a different data set. Performance was also shown to be comparable to a supervised version of the encoder CNN. This work is therefore a positive step towards adapting eye tracking technology to real-world tasks.


  1. N. C. Anderson, F. Anderson, A. Kingstone and W. F. Bischof (2014-12) A comparison of scanpath comparison methods. Behavior Research Methods 47 (4), pp. 1377–1392. External Links: Document, Link Cited by: §I.
  2. R. Andersson, L. Larsson, K. Holmqvist, M. Stridh and M. Nyström (2016-05) One algorithm to rule them all? an evaluation and discussion of ten eye movement event-detection algorithms. Behavior Research Methods 49 (2), pp. 616–637. External Links: Document, Link Cited by: §I, §II-D.
  3. S. Bai, J. Z. Kolter and V. Koltun (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. External Links: 1803.01271 Cited by: §II-A3.
  4. Y. Bengio, A. Courville and P. Vincent (2013-08) Representation learning: a review and new perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (8), pp. 1798–1828. External Links: Document, Link Cited by: §II-A1, §II-A2.
  5. V. Cantoni, C. Galdi, M. Nappi, M. Porta and D. Riccio (2015-04) GANT: gaze analysis technique for human identification. Pattern Recognition 48 (4), pp. 1027–1038. External Links: Document, Link Cited by: §I.
  6. N. Castner, T. C. Kuebler, K. Scheiter, J. Richter, T. Eder, F. Huettig, C. Keutel and E. Kasneci (2020-06) Deep semantic gaze embedding and scanpath comparison for expertise classification during opt viewing. Symposium on Eye Tracking Research and Applications. External Links: ISBN 9781450371339, Link, Document Cited by: §IV.
  7. M. Cerf, J. Harel, W. Einhäuser and C. Koch (2008) Predicting human gaze using low-level saliency combined with face detection. In Advances in neural information processing systems, pp. 241–248. Cited by: §II-B2.
  8. A. Coutrot, J. H. Hsiao and A. B. Chan (2017-04) Scanpath modeling and classification with hidden markov models. Behavior Research Methods 50 (1), pp. 362–379. External Links: Document, Link Cited by: §I.
  9. A. T. Duchowski (2017) Eye tracking methodology. Springer International Publishing. External Links: Document, Link Cited by: §I.
  10. V. Dumoulin and F. Visin (2016) A guide to convolution arithmetic for deep learning. External Links: 1603.07285 Cited by: §II-A3.
  11. W. Fuhl, E. Bozkir, B. Hosp, N. Castner, D. Geisler, T. C. Santini and E. Kasneci (2019-06) Encodji. In Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications, External Links: Document, Link Cited by: item 3, §IV.
  12. D. Geisler, N. Castner, G. Kasneci and E. Kasneci (2020-06) A MinHash approach for fast scanpath classification. In Symposium on Eye Tracking Research and Applications, External Links: Document, Link Cited by: item 3, §III-A1, TABLE IV.
  13. K. He, X. Zhang, S. Ren and J. Sun (2016-06) Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781467388511, Link, Document Cited by: §II-E1.
  14. R. S. Hessels, D. C. Niehorster, M. Nyström, R. Andersson and I. T. C. Hooge (2018-08) Is the eye-movement field confused about fixations and saccades? a survey among 124 researchers. Royal Society Open Science 5 (8), pp. 180502. External Links: Document, Link Cited by: §I.
  15. S.B. Hutton (2008-12) Cognitive control of saccadic eye movements. Brain and Cognition 68 (3), pp. 327–340. External Links: Document, Link Cited by: §I.
  16. S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. External Links: 1502.03167 Cited by: §II-E1.
  17. L. A. Jäger, S. Makowski, P. Prasse, S. Liehr, M. Seidler and T. Scheffer (2020) Deep eyedentification: biometric identification using micro-movements of the eye. Lecture Notes in Computer Science, pp. 299–314. External Links: ISBN 9783030461478, ISSN 1611-3349, Document Cited by: §II-E2.
  18. T. Judd, F. Durand and A. Torralba (2010-08) Fixations on low resolution images. Journal of Vision 10 (7), pp. 142–142. External Links: Document, Link Cited by: §III-B1.
  19. P. König, N. Wilming, T. Kietzmann, J. Ossandón, S. Onat, B. Ehinger, R. Gameiro and K. Kaspar (2016-Dec.) Eye movements as a window to cognitive processes. Journal of Eye Movement Research 9 (5). External Links: Link, Document Cited by: §I.
  20. N. Karessli, Z. Akata, B. Schiele and A. Bulling (2017-07) Gaze embeddings for zero-shot image classification. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). External Links: ISBN 9781538604571, Link, Document Cited by: §IV.
  21. P. Kasprowski and K. Harezlak (2014-09) The second eye movements verification and identification competition. In IEEE International Joint Conference on Biometrics, External Links: Document, Link Cited by: §II-B1.
  22. D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. External Links: 1412.6980 Cited by: §II-F.
  23. K. Krejtz, A. T. Duchowski, A. Niedzielska, C. Biele and I. Krejtz (2018-09) Eye tracking cognitive load using pupil diameter and microsaccades with fixed gaze. PLOS ONE 13 (9), pp. e0203629. External Links: Document, Link Cited by: §I.
  24. A. Kumar, A. Tyagi, M. Burch, D. Weiskopf and K. Mueller (2019-06) Task classification model for visual fixation, exploration, and search. In Proceedings of the 11th ACM Symposium on Eye Tracking Research & Applications, External Links: Document, Link Cited by: item 3.
  25. L. v. d. Maaten and G. Hinton (2008) Visualizing data using t-sne. Journal of machine learning research 9 (Nov), pp. 2579–2605. Cited by: §III-A2.
  26. S. Martinez-Conde and S. L. Macknik (2017-02) Unchanging visions: the effects and limitations of ocular stillness. Philosophical Transactions of the Royal Society B: Biological Sciences 372 (1718), pp. 20160204. External Links: Document, Link Cited by: §I.
  27. M. B. McCamy, J. Otero-Millan, L. L. D. Stasi, S. L. Macknik and S. Martinez-Conde (2014-02) Highly informative natural scene regions increase microsaccade production during visual scanning. Journal of Neuroscience 34 (8), pp. 2956–2966. External Links: Document, Link Cited by: §II-B3.
  28. O. L. Meur and T. Baccino (2012-07) Methods for comparing scanpaths and saliency maps: strengths and weaknesses. Behavior Research Methods 45 (1), pp. 251–266. External Links: Document, Link Cited by: §I.
  29. S. Mukhopadhyay and S. Nandi (2017-06) LPiTrack: eye movement pattern recognition algorithm and application to biometric identification. Machine Learning 107 (2), pp. 313–331. External Links: Document, Link Cited by: item 1, §III-A1, TABLE IV.
  30. J. Otero-Millan, X. G. Troncoso, S. L. Macknik, I. Serrano-Pedraza and S. Martinez-Conde (2008-12) Saccades and microsaccades during visual fixation, exploration, and search: foundations for a common saccadic generator. Journal of Vision 8 (14), pp. 21–21. External Links: Document, Link Cited by: §I, §II-B3.
  31. A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein and L. Antiga (2019) Pytorch: an imperative style, high-performance deep learning library. In Advances in neural information processing systems, pp. 8026–8037. Cited by: §II-F.
  32. F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot and E. Duchesnay (2011) Scikit-learn: machine learning in Python. Journal of Machine Learning Research 12, pp. 2825–2830. Cited by: §II-G.
  33. Rigas, Ioannis; Texas State University, Friedman, Lee; Texas State University and Komogortsev, Oleg; Texas State University (2018) Study of an extensive set of eye movement features: extraction methods and statistical analysis. University Library Bern (eng). External Links: Document, Link Cited by: §I.
  34. E. Siegenthaler, F. M. Costela, M. B. McCamy, L. L. D. Stasi, J. Otero-Millan, A. Sonderegger, R. Groner, S. Macknik and S. Martinez-Conde (2013-11) Task difficulty in mental arithmetic affects microsaccadic rates and magnitudes. European Journal of Neuroscience 39 (2), pp. 287–294. External Links: Document, Link Cited by: §I.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description