Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention
A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing (NLP). We propose a novel hybrid text saliency model (TSM) that, for the first time, combines a cognitive model of reading with explicit human gaze supervision in a single machine learning framework. On four different corpora we demonstrate that our hybrid TSM duration predictions are highly correlated with human gaze ground truth. We further propose a novel joint modelling approach to integrate TSM predictions into the attention layer of a network designed for a specific upstream \addedNLP task without the need for any task-specific human gaze data. We demonstrate that our joint model outperforms the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieves state-of-the-art performance for sentence compression on the challenging Google Sentence Compression corpus. As such, our work introduces a practical approach for bridging between data-driven and cognitive models and demonstrates a new way to integrate human gaze-guided neural attention into NLP tasks.
Neural attention mechanisms have been widely applied in \addednatural language processing and computer vision. By mimicking human attention Sood et al. (2020), they have enabled neural networks to only focus on those aspects of their input that are important for a given task Mnih et al. (2014); Xu et al. (2015b). While neural networks are able to learn meaningful attention mechanisms using only supervision received \replacedforfrom the target task, the addition of human gaze information has been shown to be beneficial in many cases Karessli et al. (2017); Qiao et al. (2018); Xu et al. (2015a); Yun et al. (2013). An especially interesting way of leveraging gaze information was demonstrated by works incorporating human gaze into neural attention mechanisms, for example for image and video captioning Sugano and Bulling (2016); Yu et al. (2017) or visual question answering Qiao et al. (2018).
While attention is at least as important for reading text as it is for viewing images Commodari and Guarnera (2005); Wolfe and Horowitz (2017), integration of human gaze into neural attention mechanisms for natural language processing (NLP) tasks remains under-explored. A major obstacle to studying such integration is data scarcity: Available corpora of human gaze during reading consist of too few samples to provide effective supervision for modern \addeddata-hungry architectures and human gaze data is only available for a small number of NLP tasks. For paraphrase generation and sentence compression, which play an important role for tasks such as reading comprehension systems Gupta et al. (2018); Hermann et al. (2015); Patro et al. (2018), no human gaze data is available.
We address this data scarcity in two \replacednoveldistinct ways: \addedFirst, to overcome the low number of human gaze samples for reading, we propose a novel hybrid text saliency model (TSM) in which we combine a cognitive model of reading behaviour with human gaze supervision in a single machine learning framework. More specifically, we use the E-Z Reader model of attention allocation during reading Reichle et al. (1998) to obtain a large number of synthetic training examples. \replacedWe use these examples towith which we pre-train a BiLSTM Graves and Schmidhuber (2005) network with a Transformer Vaswani et al. (2017) whose weights we subsequently refine by training on only a small amount of human gaze data. \addedWe demonstrate that our model yields predictions that are well-correlated with human gaze on out-of-domain data. \addedSecond, we \deletedfurther propose a novel joint modelling approach of attention and comprehension that allows human gaze predictions to be flexibly adapted to different NLP tasks by integrating TSM predictions into an attention layer. By jointly training the TSM with a task-specific network, the saliency predictions are adapted to this upstream task without the need for explicit supervision using real gaze data. Using this approach, we outperform the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieve state-of-the-art performance on the Google Sentence Compression corpus. As such, our work demonstrates the significant potential of combining cognitive and data-driven models and establishes a general principle for flexible gaze integration into NLP that has the potential to also benefit tasks beyond paraphrase generation and sentence compression.
2 Related work
Our work is related to previous works on 1) NLP tasks for text comprehension, 2) human attention modelling, as well as 3) gaze integration in neural network architectures.
2.1 NLP tasks for text comprehension
Two key tasks in machine text comprehension are paraphrasing and summarization Chen et al. (2016); Hermann et al. (2015); Cho et al. (2019); Li et al. (2018); Gupta and Lehal (2010). While paraphrasing is the task of “conveying the same meaning, but with different expressions” Cho et al. (2019); Fader et al. (2013); Li et al. (2018), summarization deals with extracting or abstracting the key points of a larger input sequence Frintrop et al. (2010); Tas and Kiyani (2007); Kaushik and Lipton (2018). Though advances have helped bring machine comprehension closer to human performance, humans are still superior for most tasks Blohm et al. (2018); Xia et al. (2019); Zhang et al. (2018). While attention mechanisms can improve performance by helping models to focus on relevant parts of the input Prakash et al. (2016); Rush et al. (2015); Rocktäschel et al. (2016); Cao et al. (2016); Hasan et al. (2016); Cho et al. (2015), the benefit of explicit supervision through human attention remains under-explored.
2.2 Human attention modelling
Predicting where people look (saliency prediction) in images is a long-standing challenge in neuroscience and computer vision Borji and Itti (2012); Bylinskii et al. (2016); Kümmerer et al. (2015). In contrast to images, most attention models for eye movement behaviors during reading are cognitive process models, i.e. models that do not involve machine learning but implement cognitive theories Engbert et al. (2005); Rayner (1978); Reichle et al. (1998). Key challenges for such models are a limited number of parameters, hand-crafted rules and thus a difficulty to adapt them to different tasks and domains, as well as the difficulty to use them as part of an end-to-end trained machine learning architectures Duch et al. (2008); Kotseruba and Tsotsos (2018); Ma and Peters (2020). One of the most influential cognitive models of gaze during reading is the E-Z Reader model Reichle et al. (1998). It assumes attention shifts to be strictly serial in nature and that saccade production depends on different stages of lexical processing. The E-Z Reader model has been very successful in explaining different effects seen in attention allocation during reading Reichle et al. (2009, 2013).
In contrast, learning-based attention models for text remain under-explored. Nilsson and Nivre (2009) trained person-specific models on features including length and frequency of words to predict fixations on words and later extended their approach to also predict fixation durations Nilsson and Nivre (2010). The first work to present a person-independent model for fixation prediction on text used a linear CRF model Matthies and Søgaard (2013). A separate line of work has instead tried to incorporate assumptions about the human reading process into the model design. For example, the Neural Attention Trade-off (NEAT) language model by Hahn and Keller (2016) is trained with hard attention and assigning a cost to each fixation. Subsequent work applied the NEAT model to question answering tasks, showing task-specific effects on learned attention patterns that reflect human behavior Hahn and Keller (2016). Further works include sentence representation learning using surprisal and part of speech tags as proxies to human attention Wang et al. (2017), attention as a way to improve time complexity for NLP tasks Seo et al. (2018), and learning saliency scores by training for sentence comparison Samardzhiev et al. (2018). Our work is fundamentally different from all of these works given that we, for the first time, combine cognitive theory and data-driven approaches.
2.3 Gaze integration in neural network architectures
Integration of human gaze data into neural network architectures has been explored for a range of computer vision tasks Karessli et al. (2017); Shcherbatyi et al. (2015); Xu et al. (2015a); Yu et al. (2017); Yun et al. (2013). Sugano and Bulling (2016) used gaze as an additional input to the attention layer for image captioning, while Qiao et al. (2018) used human-like attention maps as an additional supervision for the attention layer for a visual question answering task. Most previous work in gaze-supported NLP has used gaze as an input feature, e.g. for syntactic sequence labeling Klerke and Plank (2019), classifying referential versus non-referential use of pronouns Yaneva et al. (2018), reference resolution Iida et al. (2011), keyphrase extraction Zhang and Zhang (2019), or prediction of multi-word expressions Rohanian et al. (2017). Recently, Hollenstein et al. (2019) proposed to build a lexicon of gaze features given word types, overcoming the need for gaze data at test time. \addedTwo key recent works in NLP pioneered methods for \replacedincorporatingintegrating gaze data into NLP classification \replacedmodelstasks, inspired from multi-task learning: Klerke et al. (2016) added a gaze prediction task to regularize their sentence compression model. While they do not integrate gaze into an attention layer, their supervision method still improved performance on this task. Barrett et al. (2018) proposed an architecture for sequence classification tasks that could alternate between supervisory signals from labeled sequences and disjoint eye tracking data. In this work, the authors do not predict gaze on the specific task corpus, but rather they use eye tracking data from a different corpus and task in order to regularize the neural attention function used in their classification system. In stark contrast, our work provides human gaze predictions over any given NLP text corpus and therefore for the first time we are able to supervise NLP attention models by integrating human gaze predictions (made over the task corpus) directly into neural attention layers.
We make two distinct contributions:
A hybrid text saliency model as well as two attention-based models for paraphrase generation and text summarization employed in a novel joint modelling approach
3.1 Hybrid text saliency model
To overcome the limited amount of eye-tracking data for reading comprehension tasks, we propose a hybrid approach when training our text saliency model.
In the first stage of training, we leverage the E-Z Reader model Reichle et al. (1998) to generate a large amount of training data over the CNN and Daily Mail Reading Comprehension Corpus Hermann et al. (2015).
After training the text saliency model until convergence using this synthetic data, in a second training phase we fine-tune the network with real eye tracking data of humans reading from the Provo and Geco corpus Luke and Christianson (2018); Cop et al. (2017).
We used the most recent implementation of EZ Reader (Version 10.2) available from the authors’ website
The task of text saliency is to predict fixation durations for each word of an input sentence. In our text saliency model, we combine a BiLSTM network Graves and Schmidhuber (2005) with a Transformer Vaswani et al. (2017) (see Figure 1 for an overview). Each word of the input sentence is encoded using pre-trained GloVe embeddings Pennington et al. (2014). The resulting embeddings are fed into a single-layer BiLSTM network Graves and Schmidhuber (2005) that integrates information over the whole input sentence. The outputs from the BiLSTM network are fed into a Transformer network with multi-headed self-attention Vaswani et al. (2017). In contrast to Vaswani et al. (2017), we only use the encoder of the Transformer network. Furthermore, we do not provide positional encodings as input, as this information is implicitly present in the outputs produced by the BiLSTM layer. In initial experiments we found an advantage of using only four layers with four attention heads each for the Transformer network as opposed to six layers with 12 heads in the original Transformer architecture Vaswani et al. (2017). The combination of a BiLSTM network with a subsequent Transformer network allows our model to better capture the sequential context while still maintaining computational efficiency. Finally, a fully connected layer is used to obtain an attention score for each input word in . We apply sigmoid nonlinearities with subsequent normalization over the input sentence to obtain a probability distribution over the sentence. As loss function we use the mean squared error.
3.2 Joint modelling for natural language processing tasks
To model the relationship between attention allocation and text comprehension, we integrate the TSM with two different NLP task attention-based networks in a joint model (see Figure 1). Specifically, we propose a modification to the Luong attention layer Luong et al. (2015) that is a computationally light-weight but highly effective, multiplicative attention algorithm Luong et al. (2015); Britz et al. (2017). We compute attention scores as
using our task-specific modified score functions . For the tasks of paraphrase generation and sentence compression, respectively, we propose the novel score functions
Where is the current hidden state, are the hidden states of the encoder and and are learnable parameters of the attention mechanism. The outputs of the TSM model on the input sentence are incorporated into the score function by element-wise multiplication. This way, attention scores in the upstream task network reflect word saliencies learnt from humans. In addition to that, the error signal from the upstream loss function can be propagated back to the TSM in order to adapt its’ parameters to the upstream task, thereby defining an implicit loss on . This way, the attention distribution returned by the TSM is adapted to the specific upstream task, allowing us to incorporate and adapt a neural model of attention to tasks for which no human gaze data is available. Note, as we have two different tasks namely generative (paraphrase generation) and classification (sentence compression), we used different score functions as suggested by previous work Luong et al. (2015).
4.1 Joint model with upstream tasks
Datasets. We used two standard benchmark corpora to evaluate each upstream NLP task.
For paraphrase generation, we used the Quora Question Pairs corpus
Paraphrase generation. Our first task was paraphrase generation where, given a source sentence, the model has to produce a different target sentence with the same meaning that may have a different length. We used a sequence-to-sequence network with word-level attention that was originally been proposed for neural machine translation Bahdanau et al. (2015). The model consisted of two recurrent neural networks, an encoder and an attention decoder (see Figure 1). The encoder consisted of an embedding layer followed by a gated recurrent unit (GRU) Cho et al. (2014). The decoder produced an output sentence step-by-step given the hidden state of the encoder and the input sentence. At each output step, the encoded input word and the previous hidden state are used to produce attention weights using our modified Luong attention (see Equation 2). These attention weights are combined with the embedded input sentence and fed into a GRU to produce an output sentence. The loss between predicted and the ground-truth paraphrase was calculated over the entire vocabulary using cross-entropy.
Sentence compression. As a second text comprehension task, we opted for deletion-based sentence compression that aims to delete unimportant words from the input sentence Jing (2000); Knight and Marcu (2002); McDonald (2006); Clarke and Lapata (2008); Filippova et al. (2015). We incorporated the attention mechanism into the baseline architecture presented in Filippova et al. (2015). The network consisted of three stacked LSTM layers, with dropout after each LSTM layer as a regularization method. The outputs of the last LSTM layer were fed through our modified Luong attention mechanism (see Equation 3) and two fully connected layers which predicted for each word whether it should be deleted. The loss between predicted and ground truth deletion mask was calculated with cross-entropy.
Training. We used pre-trained 300-dimensional GloVe embeddings in both the TSM and the upstream task network to represent the input words Pennington et al. (2014). We trained both upstream task models using the ADAM optimizer Kingma and Ba (2015) with a learning rate of 0.0001. For paraphrase generation we used uni-directional GRUs with hidden layer size 1,024 and dropout probability of 0.2. For sentence compression we used Bi-LSTMs with hidden layer size 1,024 and dropout probability of 0.1.
Metrics. The most common metric to evaluate text generative tasks is BLEU Papineni et al. (2002), which measures the n-gram overlap between the produced and target sequence. To ensure reproducibility, we followed the standard Sacrebleu Post (2018) implementation that uses BLEU-4. For sentence compression, we followed previous works Filippova et al. (2015); Zhao et al. (2018) by reporting the F1 score as well as the compression ratio calculated as the length of the compressed sentence divided by the input sentence length measured in characters Filippova et al. (2015).
Results and discussion
Results for our joint model on paraphrase generation and sentence compression in comparison to the state of the art are shown in Table 1. For paraphrase generation, our approach achieves a BLEU-4 score of 28.82 when using 100K training examples, clearly outperforming the previous state of the art for this task from Patro et al. (2018) (17.9 BLEU-4). The same holds for 50K training examples (26.24 vs. 16.5 BLEU-4). For sentence compression, our joint model reaches state-of-the art performance with 85.0 F1 score and 0.39 compression rate compared to 85.1 F1 score and 0.39 compression rate achieved by the approach of Zhao et al. (2018).
To further \replacedanalyzeunderstand the impact of our joint modelling approach, we evaluated several ablated versions of our model:
No Fixations: Stand-alone upstream task network with original Luong attention (no TSM).
Random TSM Init: Random initialization of the TSM instead of training on E-Z Reader and human data. Still implicit supervision by the upstream task during joint training.
TSM Weight Swap: Exchange of the weights of the TSM model between tasks, i.e. sentence compression using the TSM weights obtained from the best-performing paraphrase generation model and vice versa.
Frozen TSM: Training of the TSM with E-Z Reader and human gaze predictions but with frozen weights in the joint training with the upstream task, i.e. no adaptation of the TSM.
As can be seen from Table 1, all ablated models obtain inferior performance to our full model on both tasks (statistically significant at the 0.05 level).
Notably, even the No Fixation model improves drastically over the Seq-to-Seq baseline for paraphrase generation, most likely due to the significant increase in network parameters.
The benefit of training the TSM with our hybrid approach
before using it in the joint model is underlined by the performance difference between the Random TSM Init (e.g. decrease in performance for both tasks) and our full model (e.g. best performance and differently adapted saliency predictions (see Table 1 and Figure 2).
Most importantly, our full model achieves higher performance than the Frozen TSM model in all evaluations (e.g. 85.0 vs. 83.9 F1 for sentence compression), indicating that our model successfully adapts the TSM predictions during joint training. This is further underlined by the inferior performance of the TSM Weight Swap model: Swapping the optimal TSM weights between different upstream tasks leads to a notable performance decrease (e.g. 85.0 vs. 83.7 F1 for sentence compression), implying that the TSM model adaptation is specific to the upstream task.
To gain qualitative insights into how our joint model training adapts TSM predictions to specific upstream tasks, we visualize the saliency predictions over time. Figure 2 shows a visualization of representative samples for both tasks. In addition, we show the 2D neural attention map of the converged models with the input sequence on the horizontal and the subsequent prediction on the vertical axis for our (with fixations) and the No Fixation model predictions and weights, respectively. As can be seen, the adapted saliency predictions (left) differ significantly from each other, particularly when analyzed over training epochs where the final epoch are fixation duration predictions from our converged model. In paraphrase generation (left top) the saliency predictions focus on fewer words in the sentence within 11 epochs, specifically the word “travel”, as this word is replaced in the correct paraphrase by the word “visit”. Our model correctly predicts the paraphrase, while the No Fixation model does not. For sentence compression (left bottom) the predictions continue to be spread over the whole sentence with only slight changes in the distribution over the words. This makes sense given that the task of this network is to delete as many words in the input sequence as possible while still maintaining syntactic structure and meaning. In the 2D attention map (bottom right), both the converged models neural attention weights differ with respect to allocation of probability mass. We see the No Fixation model densely concentrates attention towards a specific few input words (horizontal axis) when predicting several words (vertical axis). In contrast, the attention mass of our model is more spread out.
4.2 Pre-training of the hybrid text saliency model (TSM)
Training datasets. Training the TSM consists of two stages: pre-training with synthetic data generated by E-Z Reader, and subsequent fine-tuning on human gaze data. For the first step, we run E-Z Reader on the CNN and Daily Mail corpus Hermann et al. (2015) consisting of 300k online news articles with on average 3.75 sentences. As recommended in Reichle et al. (1998), we run E-Z Reader 10 times for each sentence to ensure stability in fixation predictions. For training we obtain a total of 7.6M annotated sentences on Daily Mail and 3.1M for CNN. For validation, we obtained 850K sentences on Daily Mail and 350K on CNN. For the second step, we used the two established gaze corpora Provo Luke and Christianson (2018) and Geco Cop et al. (2017). Provo contains 55 short passages, extracted from different sources such as popular science magazines and fiction stories Luke and Christianson (2018). We split the data into 10K sentence pairs (pairs means sentence to human, as multiple humans read the same sentence) for train and 1K sentence pairs for validation. Geco is comprised of long passages from a popular novel Cop et al. (2017). We split the data into 65K sentence pairs for train and 8K sentence pairs for validation.
Test datasets. We evaluated our model on the validation sets of the Provo and Geco corpora, as well as on the Dundee Kennedy and Pynte (2005) and MQA-RC corpora Sood et al. (2020). The combined validation corpora of Provo and Geco contained 18K sentence pairs. Dundee consists of recordings from 10 participants reading 20 news articles while MQA-RC corpus is a 3-condition reading comprehension corpus using 32 documents from the MovieQA question answering dataset Tapaswi et al. (2016). For our evaluation we used 1K sentence pairs from the free reading condition. This dataset is substantially different from the other eye tracking corpora because its stimuli are scraped from online sources and contain noise not found in text intended for human reading.
Implementation details. We used pre-trained 300 dimensional GloVe word embeddings Pennington et al. (2014). Our network has a bidirectional LSTM, with four transformer self-attention layers with four heads and hidden size of 128. The model objective is to predict normalized fixation durations for each word in the input sentence, resulting in saliency scores between 0 and 1. We used the ADAM optimizer Kingma and Ba (2015) with a learning rate of 0.00001, batch size of 100, and dropout of 0.5 after the embedding layer and the recurrent layer. We pre-trained our network on synthetic training data for four epochs, and then fine-tune it on human data for 10 epochs.
|TSM||TSM w/o pre-train||TSM w/o fine-tune|
|Provo + Geco||0.105||0.34||1.00*||0.112||0.36||0.99*||0.238||0.46||0.10|
Metrics. To evaluate the TSM model, we compute mean squared error (MSE) between the predicted and ground truth fixation durations as well as the Jensen-Shannon Divergence (JSD) Lin (1991). JSD is widely used in eye tracking research to evaluate inter-gaze agreement Mozaffari et al. (2018); Fang et al. (2009); Davies et al. (2016); Oertel and Salvi (2013) as, unlike Kullback-Leibler Divergence, JSD is symmetric. In addition we measured the word type predictability as it is a well-known predictor of fixation probabilities Hahn and Keller (2016); Nilsson and Nivre (2009). We used the Stanford tagger Toutanova et al. (2003) to predict part-of-speech (POS) tags for our corpora and compute the average fixation probability per tag, allowing us to compute the correlation between our model and ground truth using Spearman’s .
Results and discussion
Table 2 shows the performance of our model and ablation conditions in terms of means squared error (MSE), Jensen-Shannon-Divergence (JSD) and correlation to human ground truth. As ablation conditions we evaluate a model only trained on human data (w/o pretrain) as well as a model that is not fine-tuned on human data (w/o finetune), but only trained with E-Z Reader data.
Most importantly, our model is superior to- or on par with both ablation variants across all metrics and corpora, showing the importance of both the E-Z Reader pre-training as well as the fine-tuning with human data. Pretraining with data obtained from E-Z Reader is most beneficial in the case of the small Provo corpus, where we observe a reduction from 0.44 JSD to 0.24 JSD by adding the pretraining step. For the larger corpora this difference is less pronounced but still present. It is interesting to note that TSM w/o fine-tune performs consistently the worst, indicating that training on E-Z Reader data alone insufficient even though it provides benefits when combined with human data.
Using the correlations to human gaze over the POS distributions, we can compare our approach to Hahn and Keller (2016) who achieved a of 0.85 on the Dundee corpus, compared to a of 0.99 achieved by our model.
Furthermore we observe an especially large improvement in as a result of E-Z Reader pre-training on the MQA-RC dataset.
This dataset, unlike the other eye tracking corpora, is generated from stimuli which were scraped from online sources regarding movie plots, underlining the effectiveness of our approach in generalising to out-of-domain data.
In further analyses on the POS based correlations we observed that content words, such as adjectives, adverbs, nouns, and verbs, are more predictive than function words.
In this work we made two \replacedoriginaldistinct contributions towards improving natural language processing tasks using human gaze predictions as a supervisory signal. First, we introduced a novel hybrid text saliency model that, for the first time, integrates a cognitive reading model with a data-driven approach to address the scarcity of human gaze data on text. Second, we proposed a novel joint modelling approach that allows the TSM to be flexibly adapted to different NLP tasks \addedwithout the need for task-specific ground truth human gaze data. We showed that both advances result in significant performance improvements over the state of the art in paraphrase generation \replacedas well asand competitive performance for sentence compression but with a much less complex model than the state of the art. We further demonstrated that this approach is effective in \replacedyieldingproducing task-specific attention predictions. Taken together, \replacedourthese findings not only demonstrate the feasibility and significant potential of combining cognitive and data-driven models for NLP tasks – and potentially beyond – but also how saliency predictions can be effectively integrated into the attention layer of task-specific neural network architectures to improve performance.
E. Sood was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germany’s Excellence Strategy - EXC 2075 – 390740016; S. Tannert was supported by IBM Research AI through the IBM AI Horizons Network; P. Müller and A. Bulling were funded by the European Research Council (ERC; grant agreement 801708). We would like to thank the following people for their helpful insights and contributions: Sean Papay, Pavel Denisov, Prajit Dhar, Manuel Mager, and Diego Frassinelli. Additional revenues related to, but not supporting, this work: Scholarship by Google for E. Sood.
Appendix A Appendix
a.1 Sentence Compression Comparison To Previous SOTA
To gain further insight into the comparison between our model and the current state of the art in sentence compression, we show results of our method and ablations in relation to ablations of the method by Zhao et al.  (see Table 3). In their work, the authors added a “syntax-based language model” to their sentence compression network with which they obtained the state-of-the-art performance of 85.1 F1 score. The authors employ a syntax-based language model which is trained to learn the syntactic dependencies between lexical items in the given input sequence. Together with this language model, they use a reinforcement learning algorithm to improve the deletion proposed by their Bi-LSTM model. Using a naive language model without syntactic features their model obtained a F1 score of 85.0. With their stand-alone Bi-LSTM method in which they do not employ the reinforce language model policy, they obtain 84.8. In contrast, our method does neither include a reinforcement-learning based language model nor additional syntactic features. However, our method is still competitive with the state of the art (achieving a F1 score of 85.0), and arguably might benefit from additional incorporation of syntactic information in future work.
|Zhao et al (2018)||LSTM implementation||84.8||0.40||—|
|Syntax-Based Evaluator LM||85.1||0.39||—|
|Our paper||Baseline (BiLSTM)||81.3||0.39||12M|
|Random TSM Init||83.7||0.38||178M|
|TSM Weight Swap||83.8||0.38||178M|
a.2 Ablation Study – Attention Maps
To shed more light onto the adapted TSM predictions for the conditions in our ablation study, we present saliency and neural attention maps for the conditions Random TSM Init and TSM Weight Swap. In Figure 4, we show that the adapted saliency predictions (blue, left showing) for paraphrase generation, between the two conditions (top vs. bottom) vary with respect to the words which are predicted to be most salient and the temporal adaptation during training. The last epoch is from the converged models, respectively. There exist notable differences in the adapted TSM predictions for the two ablations. However, we assume they do not play a role in performance between these two conditions, as these performance differences are not statistically significant. However, these conditions do perform significantly worse than our model (see paper for results). As shown in the paper, our model allocates the most attention to the word “travel” in the example sentence. This is the word that is changed in the paraphrase output, indicating that the our adapted TSM can effectively guide the paraphrase generation system. Figure 5 shows the adapted saliency predictions for the sentence compression task. The differences between both conditions are less distinct, with minimal temporal variation in the word saliency predictions. As with the paraphrase generation models, performance differences between the two ablations are not statistically significant. Compared to the saliency output for our model (shown in the paper), we observe that our model more equally allocates attention to the part of the sentence that is going to be deleted.
While the 2d neural attention maps for the example sentence in the paraphrase generation task are similar for Random TSM Init and TSM Weight Swap, they differ clearly from the corresponding neural attention maps for our model (shown in the paper). Similarly, the 2d neural attention maps for sentence compression (Figure 5, right) are rather similar for Random TSM Init and TSM Weight Swap. However, the corresponding neural attention map for our method presented in the paper is more spread out and additionally allocates more attention on the position in the input sentence from which on the network decides to delete words. Taken together, these results illustrate the differences in neural attention that are connected to the superior performance of our full model over the ablation conditions.
a.3 Part of Speech Distributions – Content vs Function Words
In our paper we showed that our model and humans are significantly correlated with respect to gaze durations over part of speech tag (POS) distributions. We use this measure as POS tags have been shown to be good predictors of fixation probabilities Hahn and Keller , Nilsson and Nivre . In Figure 6, we provide an additional analysis on this matter. We group together the fixation duration predictions over content words (adjective, adverb, noun, and verb) and the fixation duration predictions over function words (conjunction, pronoun, determiner, numbers, adposition, and particles), for both human gaze and our model predictions (normalized between 0 to 1). In the figure, we show that our model predicts, similarly to humans, that content words are more informative than function words.
- Code and other supporting material can be found at https://perceptualui.org/publications/sood20_neurips/
- Additional 1D and 2D maps over all conditions are available in the supplementary material.
- Detailed POS distributions are available in the supplementary material.
- Neural machine translation by jointly learning to align and translate. In Proc. International Conference on Learning Representations, Cited by: 1st item, §4.1.
- Sequence classification with human attention. In Proc. Conference on Computational Natural Language Learning, pp. 302–312. Cited by: §2.3.
- Comparing attention-based convolutional and recurrent neural networks: success and limitations in machine reading comprehension. In Proc. Conference on Computational Natural Language Learning, pp. 108–118. External Links: Cited by: §2.1.
- State-of-the-art in visual attention modeling. IEEE transactions on pattern analysis and machine intelligence 35 (1), pp. 185–207. Cited by: §2.2.
- Massive exploration of neural machine translation architectures. arXiv preprint arXiv:1703.03906. Cited by: §3.2.
- Where should saliency models look next?. In Proc. European Conference on Computer Vision, pp. 809–824. Cited by: §2.2.
- AttSum: joint learning of focusing and summarization with neural attention. In Proc. International Conference on Computational Linguistics: Technical Papers, pp. 547–556. Cited by: §2.1.
- A thorough examination of the CNN/daily mail reading comprehension task. In Proc. Annual Meeting of the Association for Computational Linguistics, pp. 2358–2367. External Links: Cited by: §2.1.
- Paraphrase generation for semi-supervised learning in nlu. In Proc. Workshop on Methods for Optimizing and Evaluating Neural Language Generation, pp. 45–54. Cited by: §2.1.
- Describing multimedia content using attention-based encoder-decoder networks. IEEE Transactions on Multimedia 17 (11), pp. 1875–1886. Cited by: §2.1.
- Learning phrase representations using RNN encoder–decoder for statistical machine translation. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734. External Links: Cited by: §4.1.
- Global inference for sentence compression: an integer linear programming approach. Journal of Artificial Intelligence Research 31, pp. 399–429. Cited by: §4.1.
- Attention and reading skills. Perceptual and Motor Skills 100 (2), pp. 375–386. Cited by: §1.
- Presenting geco: an eyetracking corpus of monolingual and bilingual sentence reading. Behavior Research Methods 49 (2), pp. 602–615. Cited by: §3.1, §4.2.
- Exploring the relationship between eye movements and electrocardiogram interpretation accuracy. Scientific reports 6, pp. 38227. Cited by: §4.2.
- Cognitive architectures: where do we go from here?. In Proc. Conference on Artificial General Intelligence, NLD, pp. 122–136. External Links: Cited by: §2.2.
- SWIFT: a dynamical model of saccade generation during reading.. Psychological Review 112 (4), pp. 777. Cited by: §2.2.
- Paraphrase-driven learning for open question answering. In Proc. Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1608–1618. Cited by: §2.1.
- Between linguistic attention and gaze fixations inmultimodal conversational interfaces. In Proc. International Conference on Multimodal Interfaces, pp. 143–150. Cited by: §4.2.
- Sentence compression by deletion with lstms. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 360–368. Cited by: §4.1, §4.1, §4.1.
- Computational visual attention systems and their cognitive foundations: a survey. ACM Transactions on Applied Perception (TAP) 7 (1), pp. 6. Cited by: §2.1.
- Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks 18 (5-6), pp. 602–610. Cited by: §1, §3.1.
- A deep generative framework for paraphrase generation. In Proc. Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §4.1.
- A survey of text summarization extractive techniques. Journal of Emerging Technologies in Web Intelligence 2 (3), pp. 258–268. Cited by: §2.1.
- Modeling human reading with neural attention. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 85–95. External Links: Cited by: §A.3, §2.2, §4.2, §4.2.
- Neural clinical paraphrase generation with attention. In Proc. Clinical Natural Language Processing Workshop, pp. 42–53. Cited by: §2.1.
- Teaching machines to read and comprehend. In Proc. Advances in Neural Information Processing Systems, pp. 1693–1701. Cited by: §1, §2.1, §3.1, §4.2.
- Advancing nlp with cognitive language processing signals. arXiv preprint arXiv:1904.02682. Cited by: §2.3.
- Multi-modal reference resolution in situated dialogue by integrating linguistic and extra-linguistic clues. In Proc. International Joint Conference on Natural Language Processing, pp. 84–92. Cited by: §2.3.
- Sentence reduction for automatic text summarization. In Proc. Applied Natural Language Processing Conference, pp. 310–315. Cited by: §4.1.
- Gaze embeddings for zero-shot image classification. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 4525–4534. Cited by: §1, §2.3.
- How much reading does reading comprehension require? a critical investigation of popular benchmarks. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 5010–5015. Cited by: §2.1.
- Parafoveal-on-foveal effects in normal reading. Vision research 45 (2), pp. 153–168. Cited by: §4.2.
- Adam: A method for stochastic optimization. In Proc. International Conference on Learning Representations, Cited by: §4.1, §4.2.
- Improving sentence compression by learning to predict gaze. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics, pp. 1528–1533. External Links: Cited by: §2.3.
- At a glance: the impact of gaze aggregation views on syntactic tagging. In Proc. Beyond Vision and Language: Integrating Real-World Knowledge, pp. 51–61. Cited by: §2.3.
- Summarization beyond sentence extraction: a probabilistic approach to sentence compression. Artificial Intelligence 139 (1), pp. 91–107. Cited by: §4.1.
- 40 years of cognitive architectures: core cognitive abilities and practical applications. Artificial Intelligence Review, pp. 1–78. Cited by: §2.2.
- Deep gaze i: boosting saliency prediction with feature maps trained on imagenet. In In International Conference on Learning Representations, pp. 1–12. Cited by: §2.2.
- Paraphrase generation with deep reinforcement learning. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 3865–3878. External Links: Cited by: §2.1.
- Divergence measures based on the shannon entropy. IEEE Transactions on Information theory 37 (1), pp. 145–151. Cited by: §4.2.
- The provo corpus: a large eye-tracking corpus with predictability norms. Behavior Research Methods 50 (2), pp. 826–833. Cited by: §3.1, §4.2.
- Effective approaches to attention-based neural machine translation. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 1412–1421. Cited by: §3.2, §3.2.
- A neural network walks into a lab: towards using deep nets as models for human behavior. arXiv preprint arXiv:2005.02181. Cited by: §2.2.
- With blinkers on: robust prediction of eye movements across readers. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 803–807. Cited by: §2.2.
- Discriminative sentence compression with soft syntactic evidence. In Proc. Conference of the European Chapter of the Association for Computational Linguistics, Cited by: §4.1.
- Recurrent models of visual attention. In Proc. Advances in Neural Information Processing Systems, pp. 2204–2212. Cited by: §1.
- Evaluating similarity measures for gaze patterns in the context of representational competence in physics education. In Proc. ACM Symposium on Eye Tracking Research & Applications, pp. 1–5. Cited by: §4.2.
- Learning where to look: modeling eye movements in reading. In Proc. Conference on Computational Natural Language Learning, pp. 93–101. Cited by: §A.3, §2.2, §4.2.
- Towards a data-driven model of eye movement control in reading. In Proc. Workshop on Cognitive Modeling and Computational Linguistics, pp. 63–71. Cited by: §2.2.
- A gaze-based method for relating group involvement to individual engagement in multimodal multiparty dialogue. In Proc. International Conference on Multimodal Interaction, pp. 99–106. Cited by: §4.2.
- BLEU: a method for automatic evaluation of machine translation. In Proc. Annual Meeting of Association for Computational Linguistics, pp. 311–318. Cited by: §4.1.
- Learning semantic sentence embeddings using sequential pair-wise discriminator. In Proc. International Conference on Computational Linguistics, pp. 2715–2729. Cited by: §1, §4.1, §4.1.
- Glove: global vectors for word representation. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543. Cited by: §3.1, §4.1, §4.2.
- A call for clarity in reporting bleu scores. In Proc. Conference on Machine Translation, pp. 186–191. Cited by: §4.1.
- Neural paraphrase generation with stacked residual LSTM networks. In Proc. International Conference on Computational Linguistics, pp. 2923–2934. Cited by: §2.1.
- Exploring human-like attention supervision in visual question answering. In Proc. Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §1, §2.3.
- Eye movements in reading and information processing.. Psychological bulletin 85 (3), pp. 618. Cited by: §2.2.
- Using ez reader to examine the concurrent development of eye-movement control and reading skill. Developmental Review 33 (2), pp. 110–149. Cited by: §2.2.
- Toward a model of eye movement control in reading.. Psychological review 105 (1), pp. 125. Cited by: §1, §2.2, §3.1, §4.2.
- Using ez reader to model the effects of higher level language processing on eye movements during reading. Psychonomic bulletin & review 16 (1), pp. 1–21. Cited by: §2.2.
- Reasoning about entailment with neural attention. In Proc. International Conference on Learning Representations, Cited by: §2.1.
- Using gaze data to predict multiword expressions. In Proc. International Conference Recent Advances in Natural Language Processing, Varna, Bulgaria, pp. 601–609. External Links: Cited by: §2.3.
- A neural attention model for abstractive sentence summarization. In Proc. Conference on Empirical Methods in Natural Language Processing, pp. 379–389. External Links: Cited by: §2.1.
- Learning neural word salience scores. In Proc. Joint Conference on Lexical and Computational Semantics, pp. 33–42. External Links: Cited by: §2.2.
- Bidirectional recurrent neural networks. IEEE transactions on Signal Processing 45 (11), pp. 2673–2681. Cited by: 1st item.
- Neural speed reading via skim-RNN. In Proc. International Conference on Learning Representations, Cited by: §2.2.
- Gazedpm: early integration of gaze information in deformable part models. arXiv preprint arXiv:1505.05753. Cited by: §2.3.
- Interpreting attention models with human visual attention in machine reading comprehension. In Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL), Cited by: §1, §4.2.
- Seeing with humans: gaze-assisted neural image captioning. arXiv preprint arXiv:1608.05203. Cited by: §1, §2.3.
- MovieQA: Understanding Stories in Movies through Question-Answering. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §4.2.
- A survey automatic text summarization. PressAcademia Procedia 5 (1), pp. 205–213. Cited by: §2.1.
- Feature-rich part-of-speech tagging with a cyclic dependency network. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics, pp. 173–180. External Links: Cited by: §4.2.
- Attention is all you need. In Proc. Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §1, §3.1.
- Learning sentence representation with guidance of human attention. In Proc. International Joint Conference on Artificial Intelligence, pp. 4137–4143. Cited by: §2.2.
- Five factors that guide attention in visual search. Nature Human Behaviour 1 (3), pp. 1–8. Cited by: §1.
- Automatic learner summary assessment for reading comprehension. In Proc. Conference of the North American Chapter of the Association for Computational Linguistics, pp. 2532–2542. External Links: Cited by: §2.1.
- Gaze-enabled egocentric video summarization via constrained submodular maximization. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2235–2244. Cited by: §1, §2.3.
- Show, attend and tell: neural image caption generation with visual attention. In Proc. International Conference on Machine Learning, pp. 2048–2057. Cited by: §1.
- Classifying referential and non-referential it using gaze. In Proc. Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 4896–4901. External Links: Cited by: §2.3.
- Supervising neural attention models for video captioning by human gaze data. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 490–498. Cited by: §1, §2.3.
- Studying relationships between human gaze, description, and computer vision. In Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 739–746. Cited by: §1, §2.3.
- Record: bridging the gap between human and machine commonsense reading comprehension. arXiv preprint arXiv:1810.12885. Cited by: §2.1.
- Using human attention to extract keyphrase from microblog post. In Proc. Annual Meeting of the Association for Computational Linguistics, pp. 5867–5872. Cited by: §2.3.
- A language model based evaluator for sentence compression. In Proc. Annual Meeting of the Association for Computational Linguistics, pp. 170–175. Cited by: §A.1, §4.1, §4.1, §4.1.