Towards Transparent AI Systems: Interpreting Visual Question Answering Models
Deep neural networks have shown striking progress and obtained state-of-the-art results in many AI research fields in the recent years. However, it is often unsatisfying to not know why they predict what they do. In this paper, we address the problem of interpreting Visual Question Answering (VQA) models. Specifically, we are interested in finding what part of the input (pixels in images or words in questions) the VQA model focuses on while answering the question. To tackle this problem, we use two visualization techniques – guided backpropagation and occlusion – to find important words in the question and important regions in the image. We then present qualitative and quantitative analyses of these importance maps. We found that even without explicit attention mechanisms, VQA models may sometimes be implicitly attending to relevant regions in the image, and often to appropriate words in the question.
We are witnessing an excitement in the research community and frenzy in media regarding advances in AI. Fueled by a combination of massive datasets and advances in deep neural networks (DNNs), the community has made remarkable progress in the past few years on a variety of ‘low-level’ AI tasks such as image classification [\citenameSzegedy et al.2015] machine translation [\citenameBrea et al.2011, \citenameSutskever et al.2014] and speech recognition [\citenameHinton et al.2012]. Neural networks are also demonstrating potential in ‘high-level’ AI tasks such as learning to play Go [\citenameSilver et al.2016], answering reading comprehension questions by understanding short stories [\citenameBordes et al.2015, \citenameWeston et al.2015], and even answering questions about images [\citenameAntol et al.2015, \citenameRen et al.2015, \citenameMalinowski et al.2015].
Unfortunately, when today’s machine perception and intelligent systems fail, they fail in a spectacularly disgraceful manner, without warning or explanation, leaving the user staring at an incoherent output, wondering why the system did what it did.
In this work, we focus on Visual Question Answering, where given an image and a free-form natural language question about the image, (e.g., “What color are the girl’s shoes?”, or “Is the boy jumping?”), the machine has to produce a natural language answer as its output (e.g. “blue”, or “yes”). Specifically, we try to interpret a recent state-of-art VQA model [\citenameLu et al.2015] trained on recently released VQA [\citenameAntol et al.2015] dataset. This VQA model uses Convolutional Neural Network (CNN) based embedding of the image, Long-Short Term Memory (LSTM) based embedding of the question, combines these two embeddings, and uses a multi-layer perceptron as a classifier to predict a probability distribution over answers.
We are interested in the question of transparency – why does a VQA system do what it does? (See \figreffig:teaser). Specifically, what evidence in the test input (image and question) supports a particular prediction? In the context of VQA, this question can be expressed as two subproblems:
What words in the question does the model “listen to” in order to answer the question?
What pixels in the image does the model “look at” while answering the question?
In this work, we use two visualization methods to tackle the above problems. The first method (\secrefsec:guidedback) uses guided backpropagation [\citenameSpringenberg et al.2015] to analyze important words in the question and important regions in the image. In the second method (\secrefsec:occlusion), we occlude portions of input and observe the change in prediction probabilities of the model, to compute importance of question words and image regions. In \secrefsec:results, we present qualitative and quantitative analyses of these image/question ‘importance maps’ – question importance maps are analyzed using their Part-of-Speech (POS) tags; image importance maps are compared to ‘human attention maps’ or maps showing where humans look for answering a question about the image [\citenameDas et al.2016]. We found that even without explicit attention mechanisms, VQA models may sometimes be implicitly attending to relevant regions in the image, and often to appropriate words in the question.
2 Related Work
Many gradient based methods [\citenameZeiler and Fergus2014, \citenameSimonyan et al.2014, \citenameSpringenberg et al.2015] have been proposed in recent years in the field of computer vision to visualize deep convolutional neural networks. But most of them focused on the task of image classification on iconic images where the main object occupies most of the image. Our work differs in 2 ways – 1) we also compute gradients \wrtthe input question, and 2) we use guided backpropagation [\citenameSpringenberg et al.2015] for the task of VQA, where the model can look at different regions in the same image for different questions. As per our literature review, we are the first to study this problem for VQA.
Our occlusion experiment is inspired by [\citenameZeiler and Fergus2014] who mask small regions in the image with a gray patch, and observe the output of an image classification model. We evaluate if the model looks at the same regions in the image as humans do, while answering a question about the image.
A few recent works [\citenameRibeiro et al.2016, \citenameBaehrens et al.2010, \citenameLiu and Wang2012] have begun to study the task of providing interpretable post-hoc explanations for classifier predictions. Such methods typically involve fitting or training a secondary interpretable mechanism on top of the base ‘black-box’ classifier predictions. In contrast, our work directly computes importance maps from the model of interest without another layer of training (which could obfuscate the analysis).
At a high level, we view a VQA model as a learned function that takes in an input image and a question about the image , is parameterized by parameters , and produces an answer . In order to gauge the importance of components of and (\iepixels and words), we consider the best linear approximation to around each test point :
Intuitively, the two key quantities we need to compute are and , \iethe partial derivatives of the function \wrteach of the inputs (image and question). These expressions superficially look similar to gradients computed in backpropagation-based training of neural networks. However, there are two key differences – (i) we compute partial derivatives of the probability of predicted output, not the ground-truth output; and (ii) we compute partial derivatives \wrtinputs (\ieimage pixel intensities and word embeddings), not parameters.
Due to linearization above, elements of these partial derivatives tell us the effect of those pixels/words on the final prediction. These may be computed in the following two ways.
3.1 Guided Backpropagation
Guided backpropagation [\citenameSpringenberg et al.2015] is a gradient-based visualization technique used to visualize activations of neurons in different layers in CNNs. It has been shown to perform better than its counterparts such as deconvolution [\citenameZeiler and Fergus2014] especially for visualizing higher order layers. Intuitively speaking, it is a modified version of backpropagation that restricts negative gradients from flowing backwards towards input layer, resulting in sharper image visualizations.
Specifically, Guided BP is identical to classical BP except in the way the backward pass is computed in Rectified Linear Units (ReLUs). Let denote the input to layer and denote the output. Recall that a ReLU is defined as . Let denote the partial derivative \wrtthe output of the ReLU (received as input in the backward pass). The key difference between the two backprops (BP) is:
guided BP blocks negative gradients from flowing back in ReLUs. For more details, please refer to [\citenameSpringenberg et al.2015].
We use guided BP to compute ‘gradients’ of the probability of predicted answer \wrtinputs (image and question). Note that the language pathway in the models we typically use, does not contain ReLUs, thus these are true gradients (not just gradient-based visualizations) on the language side. We interpret the words/pixels with the highest (magnitude) gradients received as the most important for the model since small changes in these lead to largest changes in the model’s confidence in the predicted answer.
3.2 Discrete Derivatives
In this method, we systematically occlude subsets of the input, forward propagate the masked input through the VQA model, and compute the change in the probability of the answer predicted with the unmasked original input. Since there are 2 inputs to the model, we focus on one input at a time, keeping the other input fixed (mimicing partial derivatives). Specifically, to compute importance of a question word, we mask that word by dropping it from the question, and feed the masked question with original image as inputs to the model. The importance score of the question word is computed as the change in probability of the original predicted answer.
We follow the same procedure on the images to compute importance of image regions. We divide the image into a grid of size 16 x 16, occlude one cell at a time with a gray patch
More results and interactive visualizations can be found on authors’ webpages.
While image/question importance maps on individual inputs provide crucial insight into the inner-workings of a model (\eg, see \figreffig:masked_example), what do the aggregate statistics of these maps tell us about the model?
4.1 Analyzing Image Importance
[\citenameDas et al.2016] recently collected human attention annotations for (question, image) pairs from VQA dataset [\citenameAntol et al.2015]. Given a blurry image and a question, humans were asked to deblur the regions in the image that were helpful in answering the question.
|Guided backpropagation||0.292 0.004|
We evaluate the quality of image importance maps obtained from the two methods (guided backpropagation and occlusion) by comparing them to the human attention maps. The human attention dataset contains annotations for 1374 (question, image) pairs from VQA [\citenameAntol et al.2015] validation set. Following the evaluation protocol in [\citenameDas et al.2016], we take the absolute value of the importance maps and compute their mean rank-correlation with the human attention maps. Specifically, we first scale both the image importance and human attention maps to 14x14, normalize them spatially and rank the pixels according to their spatial attention, and then compute correlation between these two ranked lists. The results are shown in \tablereftab:image_maps. We find that both importance maps (occlusion and guided BP) are weakly positively correlated with human attention maps, although it is far from inter-human correlation. Thus, our techniques revealed an interesting finding – that even without attention mechanisms, VQA models may be implicitly attending to relevant regions in the image.
4.2 Analyzing Question Importance
Since there is no human attention dataset for questions, we instead analyze the importance maps for questions using their POS tags. Our hypothesis is that wh-words and nouns should matter most to a ‘sensible’ model’s prediction. We plot the probability of a word being most important in a question given that it has a certain POS tag. To get reliable statistics, we picked 15 most frequent POS tags from the VQA validation dataset, and grouped similar tags into one category, \egWDT, WP, WRB are grouped as wh-words. The histogram can be seen in \figreffig:ques_imp_hist. Indeed, wh-words are most important followed by adjectives and nouns. Adjectives and nouns rank high because many questions tend to ask about characteristics of objects, or objects themselves. This finding suggests that the language model part of the VQA model is strong and is able to learn to focus on appropriate words without any explicit attention procedure.
Note that for many occlusions, the model’s predicted answer is different from the original predicted answer. In fact, we found that the number of times the predicted answer changes correlates with the model’s accuracy. It is able to predict success/failure accurately 72% of the times. This suggests that features that characterize these importance maps can provide useful signals for predicting the model’s oncoming failures.
In this paper, we experimented with two visualization methods – guided backpropagation and occlusion – to interpret deep learning models for the task of Visual Question Answering. Although we focus on only one VQA model in this work, the methods are generalizable to all other end-to-end VQA models. The occlusion method can even be applied to any (non-end-to-end) VQA model considering it as a black box. We believe that these methods and results can be helpful in interpreting the current VQA models, and designing the next generation of VQA models.
Acknowledgements. This work was supported in part by the following: National Science Foundation CAREER awards to DB and DP, Army Research Office YIP awards to DB and DP, ICTAS Junior Faculty awards to DB and DP, Army Research Lab grant W911NF-15-2-0080 to DP and DB, Office of Naval Research grant N00014-14-1-0679 to DB, Paul G. Allen Family Foundation Allen Distinguished Investigator award to DP, Google Faculty Research award to DP and DB, AWS in Education Research grant to DB, and NVIDIA GPU donation to DB. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of the U.S. Government or any sponsor.
- Demo available here: http://cloudcv.org/vqa/ [\citenameAgrawal et al.2015]
- a gray patch of intensities (R, G, B) = (123.68, 116.779, 103.939), mean RGB pixel values across a large image dataset ImageNet [\citenameDeng et al.2009] on which the CNN is trained.
- Question importance maps: https://mlp.ece.vt.edu/masked_ques_vis/. Image importance maps: https://mlp.ece.vt.edu/masked_image_vis/
- Harsh Agrawal, Clint Solomon Mathialagan, Yash Goyal, Neelima Chavali, Prakriti Banik, Akrit Mohapatra, Ahmed Osman, and Dhruv Batra. 2015. Cloudcv: Large-scale distributed computer vision as a cloud service. In Mobile Cloud Visual Media Computing, pages 265–290. Springer International Publishing.
- Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. 2015. VQA: Visual Question Answering. In ICCV.
- David Baehrens, Timon Schroeter, Stefan Harmeling, Motoaki Kawanabe, Katja Hansen, and Klaus-Robert Müller. 2010. How to explain individual classification decisions. J. Mach. Learn. Res., 11:1803–1831, August.
- Antoine Bordes, Nicolas Usunier, Sumit Chopra, and Jason Weston. 2015. Large-scale simple question answering with memory networks. CoRR, abs/1506.02075.
- Johanni Brea, Walter Senn, and Jean-Pascal Pfister. 2011. Sequence learning with hidden units in spiking neural networks. In NIPS.
- Abhishek Das, Harsh Agrawal, C. Lawrence Zitnick, Devi Parikh, and Dhruv Batra. 2016. Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions? In EMNLP.
- J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. 2009. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR.
- Geoffrey Hinton, Li Deng, Dong Yu, George E Dahl, Abdel-rahman Mohamed, Navdeep Jaitly, Andrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara N Sainath, et al. 2012. Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups. Signal Processing Magazine, IEEE, 29(6):82–97.
- L. Liu and L. Wang. 2012. What has my classifier learned? Visualizing the classification rules of bag-of-feature model by support region detection. In CVPR.
- Jiasen Lu, Xiao Lin, Dhruv Batra, and Devi Parikh. 2015. Deeper LSTM and normalized CNN Visual Question Answering model. https://github.com/VT-vision-lab/VQA_LSTM_CNN.
- Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. 2015. Ask your neurons: A neural-based approach to answering questions about images. In ICCV.
- Mengye Ren, Ryan Kiros, and Richard Zemel. 2015. Exploring models and data for image question answering. In NIPS.
- Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ”Why Should I Trust You?”: Explaining the Predictions of Any Classifier. In Knowledge Discovery and Data Mining (KDD).
- David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489.
- Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. In ICLR Workshop Track.
- J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. 2015. Striving for Simplicity: The All Convolutional Net. In ICLR Workshop Track.
- Ilya Sutskever, Oriol Vinyals, and Quoc Le. 2014. Sequence to Sequence Learning with Neural Networks. In NIPS.
- Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In CVPR.
- Jason Weston, Antoine Bordes, Sumit Chopra, and Tomas Mikolov. 2015. Towards AI-Complete Question Answering: A Set of Prerequisite Toy Tasks. CoRR, abs/1502.05698.
- Matthew D. Zeiler and Rob Fergus. 2014. Visualizing and Understanding Convolutional Networks. In ECCV.