Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded

Taking a HINT:
Leveraging Explanations to Make Vision and Language Models More Grounded

Ramprasaath R. Selvaraju        Stefan Lee        Yilin Shen        Hongxia Jin
Dhruv Batra          Devi Parikh
Georgia Institute of Technology, Samsung Research America, Facebook AI Research
{ramprs, steflee, dbatra, parikh}@gatech.edu
{yilin.shen, hongxia.jin}@samsung.com
Abstract

Many vision and language models suffer from poor visual grounding – often falling back on easy-to-learn language priors rather than associating language with visual concepts. In this work, we propose a generic framework which we call Human Importance-aware Network Tuning (HINT) that effectively leverages human supervision to improve visual grounding. HINT constrains deep networks to be sensitive to the same input regions as humans. Crucially, our approach optimizes the alignment between human attention maps and gradient-based network importances – ensuring that models learn not just to look at but rather rely on visual concepts that humans found relevant for a task when making predictions. We demonstrate our approach on Visual Question Answering and Image Captioning tasks, achieving state-of-the-art for the VQA-CP dataset which penalizes over-reliance on language priors.

1 Introduction

Many popular and well-performing models for multi-modal, vision-and-language tasks exhibit poor visual grounding – failing to appropriately associate words or phrases with the image regions they denote and relying instead on superficial linguistic correlations [2, 1, 37, 11, 13]. For example, answering the question “What color are the bananas?” with yellow regardless of their ripeness evident in the image. When challenged with datasets that penalize reliance on these sort of biases [2, 11], state-of-the-art models demonstrate significant drops in performance despite there being no change to the set of visual and linguistic concepts about which models must reason.

In addition to these diagnostic datasets, another powerful class of tools for observing this shortcoming has been gradient-based explanation techniques [28, 36, 27, 24] which allow researchers to examine which portions of the input models rely on when making decisions. Applying these techniques has shown that vision-and-language models often focus on seemingly irrelevant or contextual image regions that differ significantly from where human subjects fixate when asked to perform the same tasks [7, 25].

While somewhat dissatisfying, these findings are not wholly surprising – after all, standard practices do not provide any guidance for visual grounding. Instead, models are trained on input-output pairs and must resolve grounding from co-occurrences – a challenging task, especially in the presence of more direct and easier to learn correlations in language. Consider our previous example question, the words ‘color’, ‘banana’, and ‘yellow’ are given as discrete tokens that will trivially match in every occurrence when these underlying concepts are referenced. In contrast, actually grounding this question requires dealing with all visual variations of bananas and learning the common feature of things described as ‘yellow’. To combat this tendency, we explore how to provide grounding supervision directly.


Figure 1: Teaser figure: Our approach, HINT, provides a general framework for aligning explanations that models produce with spatial input regions that humans deemed as being important for any task.

Towards this end, we introduce a generic, second-order approach that updates model parameters to better align gradient-based explanations with human attention maps. Our approach which we call Human Importance-aware Network Tuning (HINT) enforces a ranking loss between human annotations of input importance and gradient-based explanations produced by a deep network – updating model parameters via a gradient-of-gradient step. Importantly, this constrains models to not only look at the correct regions but to also be sensitive to the content present there when making predictions. This forces models to base their decisions on the same regions as human respondents, providing explicit grounding supervision. While we explore applying HINT to vision-and-language problems, this approach is general and can be applied to focus model decisions on specific inputs in any context.

We apply HINT to two task domains – Visual Question Answering (VQA) [5] and image captioning [15] – and find our simple approach significantly improves visual grounding. We achieve a new state-of-the-art on the challenging VQA Under Changing Priors (VQA-CP) dataset [2] designed to penalize models with poor visual grounding. In both VQA and Image Captioning, we see significantly improved correlations between human attention and visual explanations for HINT trained models on new images, showing that these models learn generalizable notions of visual grounding. We also perform human preference studies to determine if this improved correlation results in increased notions of model reasonableness.

Many existing vision and language models contain mechanisms that attend to image regions based on language, predicting a weight for each image region. Interestingly, we find directly supervising these attention weights with human attention to be ineffective. While the models learn to attend to the correct regions during training, they fail to rely on them when making predictions – highlighting the importance of supervising gradient-based explanations in HINT.

Contributions. We summarize our contributions as:

  • We introduce Human Importance-aware Network Tuning (HINT), a general framework for constraining the sensitivity of deep networks to specific input regions and demonstrate its results in significantly improved visual grounding for two vision and language tasks.

  • Using HINT, we set a new state-of-the-art in the language-bias sensitive Visual Question Answering Under Changing Priors (VQA-CP) dataset [2].

2 Related Work

Model Interpretability.  There has been significant recent interest in building machine learning models that are transparent and interpretable in their decision making process. For deep networks, several works propose explanations based on internal states or structures of the network [35, 12, 38, 25]. Most related to our work is the approach of Selvaraju et al. [25] which computes neuron importance as part of a visual explanation pipeline. In this work, we enforce that these importance scores match importances provided by domain experts.

Vision and Language Tasks. Image Captioning [16] and Visual Question Answering (VQA) [5] have emerged as two of the most widely studied vision-and-language problems. The image captioning task requires generating natural language description of image contents and the VQA task requires answering free-from questions about images. In both, models must learn to associate image content with complex free-form text. Consequentially, attention based models that explicitly reason about image-text correspondences have become the dominant paradigm [26, 31, 14, 33, 20, 3] for these tasks; however, there is growing evidence that even these attentional models still latch onto and leverage language biases [2, 37, 4].

Recently, Agrawal et al. [2] introduced a novel, bias-sensitive dataset split for the VQA task. This split, called VQA Under Changing Priors (VQA-CP), is constructed such that the answer distributions differ significantly between training and test. As such, models that memorize language associations in training instead of actually grounding their answers in image content will perform poorly on the test set. Likewise Lu et al. [20] introduce a robust captioning split of the COCO captioning dataset [16] in which the distribution of co-occurring objects differs significantly between training and test. We use these dataset splits to evaluate the impact of our method on visual grounding.

Debiasing Vision and Language Models. A number of recent works have aimed to reduce the effect of language bias in vision and language models.

Hendricks et al. [4] study the generation of gender-specific words in image captioning – showing that models nearly always associated male gendered words to people performing extreme sports like snowboarding regardless of the image content. Their presented Equalizer approach encourages models to adjust their confidence depending on the evidence present – confident when gender evidence is visible and unsure when it is occluded by ground-truth segmentation masks. Experiments on a set of captions containing people show this approach reduces gender bias.

For VQA, Agrawal et al. [2] developed a Grounded VQA model (GVQA) that explicitly disentangles the vision and language components – consisting of separate visual concept and answer cluster classifiers. This approach uses a question’s type (e.g. “What color …”) to determine the space of possible answers and the question target (e.g. “banana”) to detect visual attributes in the scene that are then filtered by the possible answer set. While effective, this approach requires multi-stage training and is difficult to extend to new models. In follow-up work, Ramakrishnan et al. [23] introduce an adversarial regularization technique to reduce bias in VQA models that is model agnostic – pitting the model against a question-only adversary.

Goyal et al. [11] noted an inherent language bias in the VQAv1 [5] dataset that are easily exploited by deep models. Hence they collected complementary images for each question in the VQAv1 dataset, such that the answer to the question for the new image is different, thus creating a balanced VQA dataset. This makes it harder for models to exploit the language biases. This is a very expensive process to make vision and language models less biased.

In contrast, our approach directly incorporates human supervision for visual grounding – forcing models to base their decisions on the same regions as human respondents. Rather than relying on collecting expensive annotation or creating novel splits, our approach uses annotations from existing datasets to improve visual grounding. We show that this improved visual grounding results in significant improvements on bias-sensitive datasets.

Human Attention for VQA. Das et al. [7] collected human attention maps for a subset of the VQA dataset [5]. Given a question and a blurry image, humans were asked to interactively deblur regions in the image until they could confidently answer. In this work we utilize these maps, enforcing the gradient-based visual explanations of model decisions to match the human attention closely.

Supervising model attention. Liu et al. [17] and Qiao et al. [22] apply human attention supervision to attention maps produced by the model for image captioning and VQA, respectively. Model attention alone is a bottom-up computation that relies only on the image (and the question in case of VQA). Even with appropriate model attention, the remaining network layers may still disregard the visual signal in the presence of strong biases in the dataset. In addition we show how gradient explanations directly link model decisions to input regions and so aligning these importances ensures the model is basing its decision on human-attended regions.

3 Preliminaries

Recall that our approach takes a pretrained base model and tunes it through a loss imposed on the regions that the model looks at while making decisions. We provide an overview of the base model we use in this work in this section for completeness but encourage the readers to read the full works for further details.

3.1 Bottom-up Top-down attention model

In this work we take the recent Bottom-up Top-down architecture as our base model. A number of works [32, 9, 34, 30, 18, 33, 19] use Top-down attention mechanisms to help fine-grained and multi-stage reasoning which is shown to be very important for vision and language tasks. Anderson et al. [3] propose a variant of the traditional attention mechanism, where instead of attending over convolutional features they show that attending at the level of objects and other salient image regions gives significant improvements in VQA and captioning performance.

Bottom-up Top-Down Attention for VQA. As shown in left half of Fig. 2, given an image, Bottom-up Top-down (UpDown) attention model takes as input up to image features, each encoding a salient region of the image. These regions and their features are proposals extracted from Faster-RCNN [10]. The question is encoded using a GRU [6] and a soft-attention over each of the proposal features is computed using the question embedding. The final pooled attention feature is combined with the question feature using a few fully connected layers which predict the answer to the question.

Bottom-up Top-down Attention for Image Captioning. The image captioning model consists of two Long Short-Term Memory (LSTM) networks – an attention LSTM and a language LSTM. The first LSTM layer is a top-down visual attention model whose input at each time step consists of the previous hidden state of the language LSTM, concatenated with the mean-pooled bottom-up proposal features (similar to above) and an encoding of the previously generated word. The output of the attention LSTM does a soft attention over the proposal features. The second LSTM is a language generation LSTM that takes as input the attended features concatenated with the output of the attention LSTM. The language LSTM provides a distribution over the vocabulary of words for the next time step.


Figure 2: Our Human Importance-aware Network Tuning (HINT) approach: Given an image and a question like “Did he hit the ball?”, we pass them through the Bottom-up Top-down architecture shown in the left half. For the example shown, the model incorrectly answers, ‘no’. For the ground-truth answer ‘yes’, we determine the proposals important for the decision through Grad-CAM based proposal importance. We rank the proposal through human attention and provide a ranking loss in order to align the network’s importance with the human importance. Tuning the model through HINT makes the model not only answer correctly, but also look at the right regions, as shown in the right.

3.2 Faithfulness of gradient-based explanations

Gradient-based explanations are more faithful to model decisions than model attention. Faithfulness refers to the ability to explain the underlying function learned by the model. To demonstrate faithfulness we report occlusion studies similar to [24]. Occlusion study measures the difference in model scores for the predicted answer when different proposal features for the image are masked and forward propagated, taking this delta as an importance score for each proposal. Rank correlation for model attention with occlusion-based importance is , compared to for gradients demonstrating our claim. Hence, aligning gradient-based explanations to human importance would ensure the model is actually right for right reasons.

4 Human Importance-aware Network Tuning

In this section, we describe our framework for teaching deep networks to attend to the same regions as humans which we call Human Importance-aware Network Tuning (HINT). The premise of our work is as follows– humans tend to rely on some portion of the input more than others when making decisions – our approach ensures that those portions of input are relevant for the model as well. HINT computes the important concepts through gradient-based explanations and tunes the network parameters so as to align with the concepts deemed important by humans. We use the generic term ‘decision’ to refer to both the answer in the case of VQA and the words generated at each time step in the case of image captioning. While our approach is generic and can be applied to any architecture, below we describe HINT in context of the Bottom-up Top-down model for VQA and captioning.

4.1 Human Importance

In this step we align the expert knowledge obtained from humans into a form corresponding to the network inputs. The Bottom-up Top-down model [3] takes in as input, region proposals. For a given image and question (in case of VQA) we compute an importance score for each of the proposals for the correct decision based on the normalized human attention map energy inside the proposal box relative to the normalized energy outside the box.

More concretely, consider an importance map that indicates the spatial regions of support for a decision with higher values in . Given a proposal region with area , we can write the normalized importance inside and outside for decision as

respectively. We compute the overall importance score for for decision as:

(1)

Human attention for VQA and captioning.  For VQA, we use the human attention maps collected by Das et al[8] for a subset of the VQA [5] dataset. HAT maps are available for a total of 40554 image-question pairs. This corresponds to approximately 6 % of the VQA dataset which consists of a total of 658111 image-question pairs (train + val). While human attention maps do not exist for image captioning, COCO dataset [15] has segmentation annotations for 80 everyday occurring categories. We use an object category to word mapping that maps object categories like person to a list of potential fine-grained labels like [“child”, “man”, ”woman”, …] similar to [20]. We map a total of 830 visual words existing in COCO captions to 80 COCO categories. We then use the segmentation annotations for the 80 categories as human attention for this subset of matching words.

4.2 Network Importance

We define Network Importance as the importance (weight) that the given trained network places on spatial regions of the input when forced to make a decision. Selvaraju et al. [25] proposed an approach to compute the importance of last convolutional layer’s neurons. In their work, they compute the importance of last convolutional layer neurons as they serve as the best compromise between high level semantics and detailed spatial information. Since proposals usually look at objects and salient/semantic regions of interest while providing a good spatial resolution, we naturally extend [25] to compute importance over proposals. Given a proposal , its embedding , its importance for predicting the ground-truth decision , can be computed as,

(2)

where is a one-hot encoding containing the score for the ground-truth decision (answer in VQA and the visual word in case of captioning). Note that we compute the importance for the ground-truth decision, and not the predicted. Human attention for incorrect decisions are not available and are intuitively non-existent as there exists no evidence for incorrect predictions in the image.

Importance for VQA and Captioning.  For VQA, we compute the importance for the ground-truth answer. For captioning, we compute importance for each visual word in the ground-truth caption that maps to a COCO category.

4.3 Human-Network Importance Alignment

At this stage, we now have two sets of importance scores – one computed from the human attention and another from network importance – that we would like to align. Each set of scores is calibrated within itself; however, absolute values are not comparable between the two as human importance lies in while network importance is unbounded. Consequentially, we focus on the relative rankings of the proposals, applying a ranking loss – specifically, a variant of Weighted Approximate Rank Pairwise (WARP) loss.

Ranking loss.  At a high level, our ranking loss searches all possible pairs of proposals and finds those pairs where the pair-wise ranking based on network importance disagrees with the ranking from human importance. Let denote the set of all such misranked pairs. For each pair in , the loss is updated with the absolute difference between the network importance score for the proposals pair.

(3)

where and are the proposals whose order based on neuron importance does not align with human importance and indicates that proposal is more important compared to according to human importance.

Importance of task loss. In order to stabilize training we observe that it is necessary to have the task loss – cross entropy loss in case of VQA and negative log-likelihood in case of image captioning. So the final HINT loss becomes,

(4)

The first term encourages the network to base decisions on the correct regions and the second term encourages it to actually make the right decision.

Network importances being the gradient of the score with respect to proposal embeddings, hence are a function of all the intermediate parameters of the network ranging from the model attention layer weights to the final fully connected layer weights. Hence an update through an optimization algorithm (gradient-descent or Adam) with the given loss in (4) would require computation of second order gradients. This update will affect all the parameters of the network.

5 Experiments and Analysis

In this section we describe the experimental evaluation of our approach on VQA and Image Captioning.

VQA. For VQA, we evaluate on the VQA-CP [2] dataset split. Recall from Section 2 that VQA-CP is a restructuring of VQAv2 [11] that is designed such that the answer distribution in the training set differs significantly from that of the test set. For example, while the most popular answer in train for “What sport …” questions might be “tennis”, in test it might be “volleyball”. Without proper visual grounding, models trained on this dataset will generalize poorly to the test distribution. In fact, [2] and [23] report significant performance drops for state-of-the-art VQA models on this challenging, language-bias sensitive split. For this experiment, we pretrain our Bottom-Up Top-Down model on VQA-CP before fine-tuning with the HINT loss. We also report results on VQAv2 for completeness and to assess whether our approach’s effect in standard settings. Recall that our approach includes the task loss for stable training; we use for our experiments. Ablation studies varying and number of proposals can be found in the appendix.

We compare our approach against strong baselines and existing approaches, specifically:

  • Base Model (UpDn) We compare to the base Bottom-up Top-down model without our HINT loss.

  • Attention Alignment (Attn. Align.) The Bottom-up Top-down model uses soft attention over object proposals – essentially predicting a set of attention scores for object proposals based on their relevancy to the question. These attention scores are much like the network importances we compute in HINT; however, they are functions only of the network prior to their prediction. We try directly applying the ranking loss from HINT between these attention weights and human importances as computed in (1).

  • Grounded VQA (GVQA). As discussed in Section 2, [2] introduced a grounded VQA model (GVQA) that explicitly disentangles vision and language components and was developed alongisde the VQA-CP dataset.

  • Adversarial Regularization (AdvReg). [23] introduced an adversarial regularizer to reduce the effect of language-bias in VQA.

Image Captioning. We also apply HINT for the Bottom-up Top-down captioning model pretrained on COCO [15] captioning dataset. We show results on 2 splits - the standard ‘Karpathy’ split and the robust captioning split introduced by Lu et al. in [20]. The robust split has varying distribution of co-occurring objects between train and test.

In order to obtain grounding supervision for HINT, we first manually map 830 visual words from the captioning dataset to 80 COCO categories and use the corresponding binary segmentation mask annotation from the COCO dataset. We pretrain the captioning model and then fine-tune with HINT, only updating for the time steps corresponding to these visual words.

5.1 HINT for VQA

Table 1 shows the results for VQA-CP test and VQAv2 val. We find our HINTed Bottom-up Top-down model significantly improves not only over its base performance (an 8 percentage point gain in overall accuracy) but also over existing approaches based on the same model (41.17 for AdvReg vs 47.7 for HINT), setting a new state-of-the-art for this problem. Furthermore it is less destructive to VQAv2 accuracy than previous approaches – showing that grounding does not hurt performance substantially even when strong language-biases are not penalized as is observed for other bias-reducing approaches in Table 1.

It is of course important to note that our approach does leverage additional supervision in the form of human attention maps. The attention alignment baseline also leverages these maps but does not provide improvements over the base model despite its face-value similarity to our approach. The key difference between these methods is that HINT computes importance with respect to model decisions, whereas attention alignment simply requires the model to predict human-like attentions, not necessarily to care about them when making decisions. We argue this results from gradient-based explanations being 1) more faithful to model decisions than model attention (3.2) and 2) a function of all network parameters. This shows that attention only loosely corresponds to how the model actually arrives at its decision. Hence, aligning gradient-based explanations to human importance helps ensure the model is right for the right reasons.

Model VQA-CP test VQAv2 val
Overall Yes/No Number Other Overall Yes/No Number Other
SAN [33] 24.96 38.35 11.14 21.74 52.41 70.06 39.28 47.84
UpDn [3] 39.49 45.21 11.96 42.98 62.85 80.89 42.78 54.44
GVQA [2] 31.30 57.99 13.68 22.14 48.24 72.03 31.17 34.65
UpDn + Attn. Align 38.5 42.5 11.39 43.78 60.95 78.89 38.40 53.29
UpDn + AdvReg [23] 41.17 65.49 15.48 35.48 62.75 79.84 42.35 55.16
UpDn + HINT (ours) 47.7 70.04 10.68 46.31 62.35 80.49 41.75 54.01
Table 1: VQA results on compositional and standard split. We see that our approach (HINT) gets a significant boost of over 8% compared to the base model UpDn. Our Attn Align baseline however performs worse than base model. In the standard split, we observe only a slight drop in performance (0.5%) compared to our baseline (Attn Align) which observes a drop of 1.9%. results taken from corresponding papers.
Figure 3: Effect of number of human attention maps. x-axis goes from using no HINT supervision to using all the Human attention maps during training, which amounts to of the amount of VQAv2 training data.
(a)
(b)
(c)
(d)
(e)
(f)
Figure 4: Qualitative comparison of models before and after applying HINT. The left column shows the input image along with the question and the ground-truth (GT) answer from the VQA-CP val split. In the middle column, for the base model we show the explanation visualization for the GT answer along with the model’s answer. Similarly we show the explanations and predicted answer for the HINTed models in the third column. We see that the HINTed model looks at more appropriate regions and answers better. For example, for the example in the middle row of left column, the base model only looks at the boy, and after we apply HINT, it looks at both the boy and the skateboard in order to answer ‘Yes’. After applying HINT, the model also changes its answer from ‘No’ to ‘Yes’.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Figure 5: Qualitative comparison of captioning models before and after applying HINT. The left column shows the input image along with the ground-truth caption from the COCO robust split. In the middle column, for the base model we show the explanation visualization for the visual word mentioned below. Similarly we show the explanations for the HINTed models in the third column. We see that the HINTed model looks at more appropriate regions. For example in (a) note how the HINTed model correctly localizes the fork, apple and the orange accurately when generating the corresponding visual words, but the base model fails to do so. Interestingly note how the model is able to ground even the shadow of the cat in (h)!

Qualitative examples.  Fig. 7 shows qualitative example showing the effect of applying HINT to the Bottom-up Top-down VQA model. Fig. 7 (a) shows an image and a question,“Is this the right sized skateboard for him?”. Not only does the base model answer “no” incorrectly, but it also cannot localize the right answer – looks at the hair for “yes”. The HINTed model answers “yes” correctly and looks at both the skateboard and boy. For the image in Fig. 7 (b), for the question, “What color are the signs?”, the base model answers “Red” which is partially correct, but it fails to ground the answer correctly. The HINTed model not only answers “Red and White” correctly but also looks at the red stop sign and the white street sign. For the images in the last row, both the baseline and the HINTed models predict the same answer, but the HINTed models also look at the right regions – in (e) HINTed model correctly localizes the mustache, and in (f) it looks at the court.

Varying the amount of HINT supervision: Figure 3 shows performance for different amounts of Human Attention maps for VQA-CP. We see a steady increase in performance as human supervision increases (max of 6% train set), indicating that there is a lot of scope to build models which generalize better using HINT.

5.2 HINT for Image Captioning

Our implementation of the Bottom-up Top-down captioning model in Pytorch [21] achieves a CIDEr [29] score of 1.06 on the standard split and 0.90 on the robust split. Upon applying HINT to the base model trained on the robust split, we obtain no change in captioning performance. For the model trained on the standard split, performance drops slightly by 0.02 in CIDEr score (1.04 compared to 1.06). As we demonstrate in the following sections, the lack of improvement in score does not mean a lack of change – we find the model significant improvements at grounding.

Qualitative examples.  Fig. 8 shows qualitative examples that indicate significant improvements in grounding performance of HINTed models. For example Fig. 8 (a) shows how a model trained with HINT is able to simultaneously improve grounding for the 3 visual words present in the ground-truth caption. We find that when giraffes and zebras coexist (as in Fig. 8 (d)), the base model incorrectly focuses on ‘giraffe’ when generating ‘zebra’ and focuses on ‘zebra’ when saying ‘giraffe’. After tuning the model using HINT, the model correctly localizes ‘zebra’ for ‘zebra’ and vice-versa. We see that HINT also helps with making models focus on individual object occurrences rather than using context, as shown in Fig. 8 (c, f, g), helping them generalize better.

6 Evaluating Grounding

In Sections 5.1 and 5.2 we evaluated the effect of HINT on the task performance. In this section we evaluate the grounding ability of models tuned with HINT.

6.1 Correlation with Human Attention

In order to evaluate the grounding ability of models before and after applying HINT, we compare the network importances as in (2) for the ground-truth decision with the human attention computed as in (1) for both the base model and the model fine-tuned with HINT. We then compute the rank correlation between the network importance scores and human importance scores for images from the VQA-CP and COCO robust test splits.

VQA. For the model trained on VQA-CP, we find that the base model obtains a Spearman’s rank correlation of 0.007 with human attention maps [8] and the model after HINTing obtains a correlation of 0.06.


Figure 6: AMT interfaces for evaluating the baseline captioning model and our HINTed model. HINTed model outperforms baseline model trust in terms of human trust.

Image Captioning. For the model trained on the COCO robust split, the base model achieves a rank correlation of 0.008 with COCO segmentation maps for the visual words, and the model after HINTing achieves a correlation of 0.17.

Of course, this rank correlation measure matches the intent of the rank-based HINT loss, but this result shows that the visual grounding learned during training generalizes to new images and language contexts!

6.2 Evaluating Trust

In the previous section we evaluate if HINTed models tend to attend to the same regions as humans when forced into making decisions. Having established that, we turn to understanding whether this improved grounding translates to increased human trust in HINTed models.

We conduct human studies to evaluate if based on individual prediction explanations from two models – one with improved grounding through HINT and a base model – humans find either of the models more trustworthy. In order to tease apart the effect of grounding from the accuracy of the models being visualized, we only visualize decisions corresponding to the ground-truth (answer in the case of VQA and caption in the case of image captioning).

For a given ground truth caption, we show study participants the network importance explanation for a ground truth visual word as well as the whole caption. Workers were then asked to rate the reasonableness of the models relative to each other on a scale of clearly more/less reasonable (+/-), slightly more/less reasonable (+/-), and equally reasonable (). This interface is shown in Figure 6.

In order to eliminate any biases, base model and HINTed model were assigned to be ‘model1’ with approximately equal probability. Human scores are then post-hoc ‘re-aligned’ such that model1 is always the HINTed model, and model2 is always the base model, thus positive average scores indicate preference for HINTed models and negatives indicate scores for base model. Hence, a positive score would favor HINT and a negative score would indicate a preference for the base model.

In total, Amazon Mechanical Turk (AMT) workers participated in the study containing 1000 image pairs (similar to the interface shown in Fig. 6. In 49.9 % of instances, participants preferred HINT compared to only 33.1 % for the base model. These results indicate that HINT helps models look at appropriate regions, and that this in turn makes the model more trustworthy.

7 Conclusion

We presented Human Importance-aware Network Tuning (HINT), a general framework for aligning network sensitivity to spatial input regions that humans deemed as being relevant to a task. We demonstrated this method’s effectiveness at improving visual grounding in vision and language tasks such as Visual Question Answering and Image Captioning. We also show that better grounding not only improves the generalization capability of models to arbitrary test distributions, but also improves the trust-worthiness of model.

Taking a broader view, the idea of constraining network gradients to achieve desired computational properties (grounding in our case) may prove to be more widely applicable to problems outside of vision and language – enabling users to provide focused feedback to networks. We leave exploring these ideas further as future work.

8 Acknowledgements

This work was supported in part by NSF, AFRL, DARPA, Siemens, Samsung, Google, Amazon, ONR YIPs and ONR Grants N00014-16-1-{2713,2793}. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government, or any sponsor.

Appendix A Ablation

We perform ablation studies on the VQA model trained on the VQA-CP dataset. Specifically we vary the strength of the regularization coefficient and number of proposals used for the HINT finetuning. In all our experiments we set the learning rate to , and stop training at convergence.

a.1 Varying strength of regularization coefficient

One key component to our approach HINT is the regularizer which enforces that the tuned weights do not change much from the pre-trained weights – avoiding arbitrary scaling of the learned weights and the bias this could introduce. To explore the effect of the regularizer, we vary the coefficient from to . The results can be found in the Table 2. We find that at we achieve the best task performance.

Method
UpDn 39.48
HINT () 42.0
HINT () 47.7
HINT () 40.0
HINT () 40.13
Table 2: Accuracy on VQA-CP dataset varying the strength of the regularization coefficient . Higher is better.

a.2 Varying number of proposals

We vary the number of proposals on which the HINT loss (Eqn 3 of the main paper) is applied. The results can be found in Table 3. We find that by just using 20 proposals we can get an accuracy of .

Method
UpDn 39.48
HINT (#proposals = 5) 41.44
HINT (#proposals = 10) 43.35
HINT (#proposals = 15) 45.71
HINT (#proposals = 20) 47.15
HINT (#proposals = 25) 47.52
HINT (#proposals = 30) 47.68
HINT (#proposals = 36 (all)) 47.7
Table 3: Accuracy on VQA-CP dataset upon varying the number of proposals. Higher is better.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
Figure 7: Qualitative comparison of models before and after applying HINT. The left column shows the input image along with the question and the ground-truth (GT) answer from the VQA-CP val split. In the middle column, for the base model we show the explanation visualization for the GT answer along with the model’s answer. Similarly we show the explanations and predicted answer for the HINTed models in the third column. We see that the HINTed model looks at more appropriate regions and answers better.

Appendix B Qualitative examples

In Fig. 7 we show examples applying HINT for the Bottom-up Top-down VQA model. The left column shows the input image along with the question and the ground-truth (GT) answer from the VQA-CP val split. In the middle column, for the base model we show the explanation visualization for the GT answer along with the model’s answer. Similarly we show the explanations and predicted answer for the HINTed models in the third column. We see that the HINTed model not only looks at more appropriate regions compared to the base models.

Fig. 7 (a) shows an image and a question,“Is the person screaming”. Not only does the base model answer “no” incorrectly, but it also cannot localize the right answer – looks just at the bear for “yes”. The HINTed model answers “yes” correctly and looks at both the bear and the face of the person. For the image 7 (b) with question “Does the building have a clock on it?”, the base model incorrectly answers no, whereas the HINTed model not only asnwers ‘yes’ correctly, it also localizes the clock on the building. The bottom row shows two examples where HINT helps with localizing the right answer, although the answers from both the models (base model and HINTed model) are incorrect.

(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
(i)
(j)
(k)
(l)
(m)
(n)
Figure 8: Qualitative comparison of Top-down Bottom-up captioning model before and after applying HINT. The left column shows the input image along with the ground-truth caption from the COCO robust split. In the middle column, for the base model we show the explanation visualization for the visual word mentioned below. Similarly we show the explanations for the HINTed models in the third column. We see that the HINTed model looks at more appropriate regions. For example in (k) and (l) note how the HINTed model correctly localizes the visual words ‘cat’ and ‘remote’ accurately when generating the corresponding visual words, but the base model fails to do so.

In Fig. 8 we show qualitative examples showing the effect of applying HINT on the Top-down Bottom-up [3] captioning model trained on the Robust split of the COCO dataset. The left column shows the input image along with the ground-truth caption from the COCO robust split. In the middle column, for the base model we show the explanation visualization for the visual word mentioned below. Similarly we show the explanations for the HINTed models in the third column. We see that the HINTed model looks at more appropriate regions when generating the mentioned visual word (below the visualization).

For example, for the input image in Fig. 8 (a) and (b), the base model only places a little importance on the face while generating the word ‘guy’, whereas the HINTed model correctly looks at the face of the person. Similarly when generating ‘ties’ the HINTed model looks at the whole tie region compared to the base model. Similarly for the images in (e) and (f), the HINTed model looks more correctly at the spoons for the visual word ‘spoon’ and sink, for the word ‘sink’.

References

  • [1] A. Agrawal, D. Batra, and D. Parikh. Analyzing the behavior of visual question answering models. In EMNLP, 2016.
  • [2] A. Agrawal, D. Batra, D. Parikh, and A. Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  • [3] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, 2018.
  • [4] L. Anne Hendricks, K. Burns, K. Saenko, T. Darrell, and A. Rohrbach. Women also snowboard: Overcoming bias in captioning models. 2018.
  • [5] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. VQA: Visual Question Answering. 2015.
  • [6] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078, 2014.
  • [7] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? 2016.
  • [8] A. Das, H. Agrawal, C. L. Zitnick, D. Parikh, and D. Batra. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
  • [9] H. Fang, S. Gupta, F. Iandola, R. K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J. C. Platt, et al. From Captions to Visual Concepts and Back. 2015.
  • [10] R. Girshick. Fast r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 1440–1448, 2015.
  • [11] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  • [12] Y. Goyal, A. Mohapatra, D. Parikh, and D. Batra. Interpreting visual question answering models. CoRR, abs/1608.08974, 2016.
  • [13] J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. 2017.
  • [14] V. Kazemi and A. Elqursh. Show, ask, attend, and answer: A strong baseline for visual question answering. CoRR, abs/1704.03162, 2017.
  • [15] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common Objects in Context. 2014.
  • [16] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In ECCV. 2014.
  • [17] C. Liu, J. Mao, F. Sha, and A. L. Yuille. Attention correctness in neural image captioning. In AAAI, 2017.
  • [18] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 6, page 2, 2017.
  • [19] J. Lu, J. Yang, D. Batra, and D. Parikh. Hierarchical question-image co-attention for visual question answering. In NIPS, 2016.
  • [20] J. Lu, J. Yang, D. Batra, and D. Parikh. Neural baby talk. In CVPR, 2018.
  • [21] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch. 2017.
  • [22] T. Qiao, J. Dong, and D. Xu. Exploring human-like attention supervision in visual question answering. In AAAI, 2018.
  • [23] S. Ramakrishnan, A. Agrawal, and S. Lee. Overcoming language priors in visual question answering with adversarial regularization. In Neural Information Processing Systems (NIPS), 2018.
  • [24] R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, D. Batra, et al. Grad-cam: Visual explanations from deep networks via gradient-based localization. In ICCV, 2017.
  • [25] R. R. Selvaraju, A. Das, R. Vedantam, M. Cogswell, D. Parikh, and D. Batra. Grad-CAM: Why did you say that? Visual Explanations from Deep Networks via Gradient-based Localization. 2017.
  • [26] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [27] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all convolutional net. arXiv preprint arXiv:1412.6806, 2014.
  • [28] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In International Conference on Machine Learning, pages 3319–3328, 2017.
  • [29] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. CoRR, abs/1411.5726, 2014.
  • [30] Z. Y. Y. Y. Y. Wu and R. S. W. W. Cohen. Encode, review, and decode: Reviewer module for caption generation. arXiv preprint arXiv:1605.07912, 2016.
  • [31] C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In International conference on machine learning, pages 2397–2406, 2016.
  • [32] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International conference on machine learning, pages 2048–2057, 2015.
  • [33] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
  • [34] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4651–4659, 2016.
  • [35] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. 2014.
  • [36] J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. Top-down Neural Attention by Excitation Backprop. 2016.
  • [37] P. Zhang, Y. Goyal, D. Summers-Stay, D. Batra, and D. Parikh. Yin and Yang: Balancing and answering binary visual questions. In CVPR, 2016.
  • [38] B. Zhou, A. Khosla, À. Lapedriza, A. Oliva, and A. Torralba. Object detectors emerge in deep scene cnns. CoRR, abs/1412.6856, 2014.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
337284
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description