Neural Network Interpretation via Fine Grained Textual Summarization

Neural Network Interpretation via
Fine Grained Textual Summarization

Pei Guo, Connor Anderson, Kolten Pearson, Ryan Farrell
Department of Computer Science
Brigham Young University
{peiguo, connor.anderson, farrell},

Current visualization based network interpretation methods suffer from lacking semantic-level information. In this paper, we introduce the novel task of interpreting classification models using fine grained textual summarization. Along with the label prediction, the network will generate a sentence explaining its decision. Constructing a fully annotated dataset of filter  text pairs is unrealistic because of image to filter response function complexity. We instead propose a weakly-supervised learning algorithm leveraging off-the-shelf image caption annotations. Central to our algorithm is the filter-level attribute probability density function (PDF), learned as a conditional probability through Bayesian inference with the input image and its feature map as latent variables. We show our algorithm faithfully reflects the features learned by the model using rigorous applications like attribute based image retrieval and unsupervised text grounding. We further show that the textual summarization process can help in understanding network failure patterns and can provide clues for further improvements.


Neural Network Interpretation via
Fine Grained Textual Summarization

  Pei Guo, Connor Anderson, Kolten Pearson, Ryan Farrell Department of Computer Science Brigham Young University {peiguo, connor.anderson, farrell},


noticebox[b]32nd Conference on Neural Information Processing Systems (NIPS 2018), Montréal, Canada.\end@float

1 Introduction

Given a neural network, we’re interested in knowing what features it has learned for making classification decisions. Despite their tremendous success on various computer vision [17, 9, 4, 12] tasks, deep neural network models [17, 28, 13, 15] are still commonly viewed as black boxes. The difficulty for neural network understanding lies mainly in the end-to-end learning of the feature extractor sub-network and the classifier sub-network, which often contain millions of parameters. Debugging an over-confident network, which assigns the wrong class label to an image with high probability, can be extremely difficult. This is also true when adversarial noise [10] is added to deliberately guide the network to a wrong conclusion. It is therefore desirable to have some textual output explaining which features were responsible for triggering the error, just like an intelligent compiler does for a grammar bug in code. Network interpretation is also crucial for tasks involving humans, like autonomous driving and medical image analysis. It is therefore important to distill the knowledge learned by deep models and represent it in an easy-to-understand way.

There are two main approaches to network interpretation in the literature: filter-level interpretation [6, 30, 18, 23, 11, 22, 21, 3, 39, 36, 29, 37] and holistic-level interpretation [27, 40, 26]. The goal of filter-level interpretation is to understand and visualize the features that specific neurons learn. While it’s easy to directly visualize the first convolutional layer filter weight to get a sense of the patterns they detect, it makes little sense to directly visualize deeper layer filter weights because they act as complex composite functions of lower layers’ operations. Early examples of filter-level understanding include finding the maximally activated input patches [37] and visualizing the guided back propagation gradients [29]. Some works [23] try to synthesize visually pleasant preferred input image of each neuron through back-propagation into the image space. [22] applies a generator network to generate images conditioned on maximally activating certain last-layer neurons. The Plug and Play paper [21] further extends [22] to introduce a generalized adversarial learning framework for filter-guided image generation. Network dissection [3] measures the interpretability of each neuron by annotating them with predefined attributes like color, texture, part, etc. [38] proposes to utilize a knowledge graph to represent the content and structure in an image.

Figure 1: Comparison of visualization-based interpretation [40] and interpretation by textual-summarization (the proposed approach). The latter has more semantic details useful for analyzing incorrect predictions.

Attempts at holistic summarization mainly focus on visualizing important image subregions through attentional mechanism. Examples include CAM [40] and Grad-CAM [26]. However, the visualization based method only provides coarse-level information, and it remains hard to intuitively know what feature or pattern the network has learned to detect. More importantly, the holistic heat map representation is sometimes insufficient to justify why the network favors certain classes over others when the attentional maps for different classes overlap heavily. See Figure 1 for example.

Humans, on the other hand, can justify their conclusions using natural language. For instance, a knowledgeable person looking at a photograph of a bird might say, "I think this is a Anna’s Hummingbird because it has a straight bill, a rose pink throat and crown. It’s not a Broad-tailed Hummingbird because the later lacks the red crown" [1]. This kind of textual description carries rich semantic information and is easily understandable. Natural language is a logical medium in which to ground the interpretation of deep convolutional models.

In this paper, we propose the task of summarizing the decision-making process of deep convolutional models using fine-grained textual descriptions. Along with the predicted class label, the network produces a textual description justifying its decision. See Figure 6 for examples. The proposed task essentially bridges the gap between model filters and text, as we are trying to find the best text that can describe the patterns learned by the filters during training.

Other tasks that combine text generation and visual explanation include image captioning and visual question answering (VQA). Although it sounds like a similar task, image captioning [8, 32, 7, 16] is fundamentally different from ours. Image captioning is usually done in a fully supervised manner, with the goal of generating a caption that describes the general content of an image. Our textual interpretation task aims to loyally reflects the knowledge learned by a classification model in an unsupervised way. Visual question answering [2, 34, 19] is a task that requires understanding the image and answering textual questions. Our task can be viewed as a special case of unsupervised VQA that focuses more specifically on questions such as: "Why does the model think the image belongs to class X." Text grounding [25, 35] is a language to vision task that tries to locate the object in an image referred to by a given text phrase. We note that [14] defines a task similar to ours, to explain and justify a classification model. Their model is learned in a supervised manner, with explanations generated from an LSTM network which only implicitly depends on the internal feature maps. It is essentially an image captioning task that generates captions with more class-discriminative information. Our method is unsupervised and does not rely on another black-box network to generate descriptions.

Section 2 describes our method for associating text with network filters and generating holistic-level explanations. We provide details of our Bayesian approach. Section 3 introduces several applications of filter-text association. Section 4 presents our experiments and results. Section 5 provides a summary and directions for future work.

Figure 2: (A) The filter-text association is to find a mapping from the abstract functional space to the semantic text space. (B) The composite filter function can be coordinate transformed to be defined on the textual attribute space. (C) Top activation images of a filter (a yellow head detector).

2 Algorithm Details

As a fundamental step toward network interpretation, we’re interested in representing network filter patterns with text phrases. Constructing a paired filter  text dataset is unrealistic, because the filter (as a composite function) is not a well defined concept with concrete samples. Instead, we propose leveraging off-the-shelf image-captioning annotations because they contain rich textual references to visual attributes. The intuition behind our filter-text association is simple: The model filters can be represented by the images that strongly activate them. The corresponding image attributes should have a high probability of representing an activated filter. The joint consensus of all textual attributes can serve as a good indicator of the filter pattern, provided the network is properly trained.

More formally, the composite filter function takes an image as input and produces a feature map whose strength indicates the existence of certain patterns. The filter interpretation task aims to find a mapping from the abstract functional space to the semantic text space (Figure 2). With the help of image captioning, we consider a coordinate transformation operation that transforms the input from image to image attributes. The new filter-attribute function can be approximated by the attribute probability density function (PDF), which is a key component of the proposed algorithm.

2.1 Generating the Filter Attribute Probability Density Function

We denote as the group of model filters. In this paper, we are only interested in the final convolutional layer filters, as they are directly related to the final classifier input. We denote as the set of input images. The filter’s output is naturally written as , which we call a feature map or filter activation. We consider models [12, 15] with a global pooling layer and a one-layer classification sub-network. The classifier sub-network produces class-label predictions with the weight matrix . We attach a list of textual attributes from the set to each image .

We propose a Bayesian inference framework to learn the probability of textual phrases conditioned on filters. As we can’t directly observe the ground truth filter  text pairs, we introduce the input image and its feature map as hidden variables. The filter attribute PDF can then be computed by marginalizing over the hidden variables:


where Z is a normalization factor. is the image activation probability. It measures the likelihood of image activating filter , and can be expressed as:


where is the global pooling layer output. The probability , which we call the text representation probability, measures the likelihood of a text phrase representing a feature-map masked image. It involves grounding the sub-regions of an image to the set of noun phrases. This task is similar to text grounding but in a reverse manner. For example, if the highlights the head area of a bird, one should assign higher probability to parts like "head", "beak" or "eyes" and lower probability to others like "wings" and "feet". In our implementation the feature map information is neglected for simplicity:


This naive approximation assigns equal probability to every noun phrase that appears in the caption. We show this approximation actually works quite well, as the joint consensus of all input images highlights the true attributes and suppresses false ones.

is the prior probability for textual attribute . We consider the relative importance of attributes because they carry different information entropy. For example, "small bird" has less information than "orange beak" because the latter appears less in the text corpora and corresponds to a more important image feature. We employ the TF/IDF feature as the attribute prior.

2.2 Aggregating Filter Attribute PDFs for Holistic Description

With the help of filter attribute PDFs, we can reason about what features the network has learned for image classification. This problem can be formulated as the probability of text phrase explanations, given the image and network-produced class label:


where is the probability that is the reason that the network predicts as class , is from the filter attribute PDF, and is the weight from the classifier weight matrix connecting filter to class prediction . We call the image-class attribute PDF.

We generate a natural sentence to describe the network decision-making process using the image-class attribute PDF. Although it’s popular to employ a recurrent model for sentence generation, our task is to faithfully reflect the internal features learned by the network and introducing another network could result in more uncertainty. We instead propose a simple template-based method, which has the following form:

"This is a {class name} because it has {attribute 1}, {attribute 2}, …, and {attribute n}."

We consider only the top 5 attributes to make the sentence shorter and more precise. Steps are taken to merge adjectives related to the same nouns. Simple rules are added to merge similar concepts together such as "beak" and "bill", or "belly" and "stomach".

Another important aspect of model interpretation is to compare the reasons behind certain choices as opposed to others, i.e. why the network thinks the input belongs to class instead of . We can easily summarize the relation and the difference between two predictions by comparing their image-class attribute PDFs. For example, while both birds have long beaks, the class favors a green crown while the class tends to have a blue crown.

Explain-Away Mechanism The filter attribute PDF obeys a multinomial distribution. It does not necessarily activate on only one narrow beam of features. Instead it may behaves like a multi-modal Gaussian distribution that activates on several totally different features. For example, the filter is likely to detect both "blue head" and "black head" with high probability. The interpretability of the filter could suffer from this multi-modal characteristic. This is especially true for the image description task because it becomes hard to know exactly which feature activates the filter.

However, we observe that other filters can act in a complimentary way to help explain away the probability of non-related patterns. For instance, there could be another filter activates for "blue head", but not for "black head". If both filters activate, then the probability of "blue head" is high. If only the activates, then "black head" is the more probable pattern. The joint consensus of all the filters makes the generated explanation reasonable.

Class Level Description Given the filter attribute PDFs, we are interested in knowing which features are important for each class. This task can be formulated as:


where is the class attribute PDF. Different from the image-class attribute PDF, class level description weights attributes based only on the classifier weight and the filter attribute PDF. For difficult tasks like fine grained recognition, deep models often perform better than non-expert users. The knowledge distilled from class level description could potentially be used to teach users how to discriminate in challenging domains.

3 Applications for Textual Summarization

The filter-level and holistic-level attribute PDFs and their corresponding textual summarizations can be used in several different applications. In this section we show applications for network debugging, unsupervised text grounding and attribute-based image retrieval. We validate them with experiments in section 4. Other potential applications are left to future work.

Network Debugging When the network predicts the wrong class, we would like to understand the reason. We generate textural summarization to explain why the network favors the wrong prediction instead of ground truth prediction. We unveil common failure patterns of the network that are helpful for network improvement.

Unsupervised Text Grounding Given a short text phrase, we would like to know which image region it refers to. We show that the filter attribute PDF can help with unsupervised text grounding. Suppose is a phrase associated with image , and is the image region of interests. We have:


where is the total number of final conv-layer filters. Intuitively, we re-weight the filter responses according to their probability for detecting regions that match . This task is totally unsupervised as we have no ground truth image-region  phrase pairs to learn from.

Attribute Based Image Retrieval We would like to be able to search a database of images using textual attribute queries, and return images that match. For example we would like to find all images of birds with "white head" and "black throat". The image-class attribute PDF provides a simple method that ranks images based on the probability of containing the desired attributes.

Figure 3: Examples of caption annotations on CUB. The noun phrases are highlighted.
Figure 4: Top 100 most frequent noun phrases.

4 Experiments

We demonstrate the effectiveness of our algorithm on the CUB-200-2011 dataset [33], which is a fine grained dataset containing 5997 training images and 5797 testing images.  [24] adds additional captions for each image. Some of the captions are shown in Figure 3. We use as our convolutional model a ResNet-50 that has been trained for classification on ImageNet [5] and fine-tuned on CUB. It achieves 81.4% classification accuracy on the CUB dataset after fine tuning. We use bounding-box cropped images to reduce background noise.

4.1 Text Preprocessing and Attribute Extraction

We first extract noun phrases from the image captioning text. We follow the pipeline of word tokenization, part-of-speech tagging and noun-phrase chunking. For simplicity, we only consider adjective-noun type phrases without recursion. We end up with 9649 independent attributes: a partial distribution is shown in Figure 4. Simple rules are added to merge similar nouns like "beak" and "bill". The Term Frequency (TF) of phrase is computed as the number of occurrences of in the same captioning file. For CUB with captions, each image has a caption file with 5 different captions. The Inverse Document Frequency (IDF) is where is the the total number of files and in the number of files containing phrase .

Figure 5: Unnormalized filter attribute PDFs with the top activation images.

4.2 Filter-Level and Holistic-Level Interpretation

We show examples of filter attribute PDFs in Figure 5. We see a clear connection between the top activated images an the top ranked attributes. This validates our idea of using text phrases to represent the features learned by each filter. We show examples of generated textual explanations for image classification in Figure 6. We can see that the generated explanations capture the class discriminative information present in the images.

Figure 6: Textual Explanations Generated by our Method. Example images (1-6 first row, 7-12 second row) are shown with each image’s corresponding explanation produced by our approach.
Figure 7: Analysis of Network Failures (for Network Debugging). Each row represents a network failure – an incorrectly predicted class label. From left to right, each row shows the query image, canonical images for the ground-truth and incorrectly predicted classes, and explanations for each of these classes. The box below the first row provides background on differences between Tree Sparrows and Chipping Sparrows.

4.3 Applications of Textual Summarization

Network Debugging In figure 7, we show three major patterns of network failure through textual summarization. In the first example, a Tree Sparrow is incorrectly recognized as a Chipping Sparrow because the network mistakenly thinks "long tail" is a discriminative feature. Failing to identify effective features for discrimination is the most common source of errors across the dataset. In fine-grained classification, the main challenge is to identify discriminative features for visually similar classes, differences which are often subtle and localized to small parts.

The second example shows a Seaside Sparrow that has mistakenly been recognized as a Blue Grosbeak. From the textual explanations we ascertain that the low image quality mistakenly activates filters that correspond to blue head and blue crown. The underlying source of this error is complex – the generalization ability of the network is limited such that small perturbations in the image can result in unwanted filter responses. Such failures imply the critical importance of improving network robustness.

In the third case, the network predicts the image as a Yellow Warbler, however the ground-truth label is Yellow-bellied Flycatcher. According to a bird expert, the network got this correct – the ground-truth label is an error. The network correctly identifies the yellow crown and yellow head, both obvious features of the Yellow Warbler. Errors like this are not surprising because, according to [31], roughly 4% of the class labels in the CUB dataset are incorrect. The mistake shown in Figure 1 could also be a false negative and it indicates the classifier may not learn to assign correct weights to discriminative features.

Unsupervised Text Grounding In Figure  8, we show some text referral expressions and the generated raw heatmaps indicating what part of the image the expression refers to. Each column denotes a different visual attribute and the heatmap (shown as a transparency map) indicates the region of highest activation within the image. From these results, we can see that the proposed approach is reasonably good at highlighting region of interests.

Attribute-based Image Retrieval In Figure 9, three examples of attribute-based image search using text-based attributes are shown. From top to bottom, the three search queries are: (1) "yellow head, yellow breast", (2) "white head, black throat" and (3) "white eyebrow". Images are ranked from high to low using the probability that the image contains the query attributes. The results are very encouraging – each image clearly contains the query attributes.

Figure 8: Examples of text grounding. Each column represents a different attribute and examples are shown as heatmaps indicating the region where the attribute is most present.
Figure 9: Attribute Based Image Retrieval. Each row shows an attribute query on the left, followed by the top-ranked results, in terms of probability that the image contains the query attributes.

5 Conclusion and Future Work

In this paper, we propose a novel task for network interpretation that generates textual summarization justifying the network decision. We use publicly available captioning annotations to learn the filter  text relationships in an unsupervised manner. The approach builds on the intuition that filter responses are strongly correlated with specific semantic patterns. Leveraging a joint consensus of attributes across the top-activated images, we can generate the filter-level attribute PDF. This further enables holistic-level explanations by combining attributes into a natural sentence. We demonstrate several applications of the proposed interpretation approach, particularly for network debugging.

Future work includes experiments on additional models and datasets. The algorithm can also be generalized to learning from weaker class-level caption annotations. Wordnet-based embedding methods such as word2vec [20] can be utilized for learning to embed and group semantically similar words together. Keypoint-based annotations can be used to assign different weight for attributes according to the feature map. Potential applications include explaining adversarial examples and attribute-based zero-shot learning.


  • [1] All about birds.
  • [2] Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [3] David Bau, Bolei Zhou, Aditya Khosla, Aude Oliva, and Antonio Torralba. Network dissection: Quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, 2017.
  • [4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. TPAMI, June 2016.
  • [5] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. IEEE, 2009.
  • [6] Dumitru Erhan, Yoshua Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Technical Report 1341, University of Montreal, June 2009.
  • [7] Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollar, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, C. Lawrence Zitnick, and Geoffrey Zweig. From captions to visual concepts and back. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [8] Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, editor="Daniilidis Kostas Forsyth, David", Petros Maragos, and Nikos Paragios. Every picture tells a story: Generating sentences from images. In Computer Vision – ECCV 2010, pages 15–29, Berlin, Heidelberg, 2010. Springer Berlin Heidelberg.
  • [9] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [10] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and Harnessing Adversarial Examples. ArXiv e-prints, December 2014.
  • [11] Google. Inceptionism: Going deeper into neural networks.
  • [12] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Girshick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
  • [14] Lisa Anne Hendricks, Zeynep Akata, Marcus Rohrbach, Jeff Donahue, Bernt Schiele, and Trevor Darrell. Generating visual explanations. In European Conference on Computer Vision, pages 3–19. Springer, 2016.
  • [15] Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q. Weinberger. Densely connected convolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [16] A. Karpathy and L. Fei-Fei. Deep Visual-Semantic Alignments for Generating Image Descriptions. ArXiv e-prints, December 2014.
  • [17] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
  • [18] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [19] Mateusz Malinowski, Marcus Rohrbach, and Mario Fritz. Ask your neurons: A neural-based approach to answering questions about images. In Proceedings of the 2015 IEEE International Conference on Computer Vision (ICCV), pages 1–9. IEEE Computer Society, 2015.
  • [20] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • [21] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play generative networks: Conditional iterative generation of images in latent space. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [22] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in Neural Information Processing Systems 29. 2016.
  • [23] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [24] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
  • [25] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of Textual Phrases in Images by Reconstruction. ArXiv e-prints, November 2015.
  • [26] Ramprasaath R. Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-based localization. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [27] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. ArXiv e-prints, December 2013.
  • [28] Karen Simonyan and Andrew Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recognition. In ICLR, 2015.
  • [29] J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for Simplicity: The All Convolutional Net. ArXiv e-prints, December 2014.
  • [30] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. ArXiv e-prints, December 2013.
  • [31] Grant Van Horn, Steve Branson, Ryan Farrell, Scott Haber, Jessie Barry, Panos Ipeirotis, Pietro Perona, and Serge Belongie. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 595–604, 2015.
  • [32] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and Tell: A Neural Image Caption Generator. ArXiv e-prints, November 2014.
  • [33] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001, 2011.
  • [34] Zichao Yang, Xiaodong He, Jianfeng Gao, Li Deng, and Alex Smola. Stacked attention networks for image question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 21–29, 2016.
  • [35] R. A. Yeh, J. Xiong, W.-M. Hwu, M. Do, and A. G. Schwing. Interpretable and globally optimal prediction for textual grounding using image concepts. In Proc. NIPS, 2017.
  • [36] J. Yosinski, J. Clune, A. Nguyen, T. Fuchs, and H. Lipson. Understanding Neural Networks Through Deep Visualization. ArXiv e-prints, June 2015.
  • [37] M. D Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. ECCV, November 2013.
  • [38] Q. Zhang, R. Cao, F. Shi, Y. Nian Wu, and S.-C. Zhu. Interpreting CNN Knowledge via an Explanatory Graph. AAAI, August 2017.
  • [39] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Object Detectors Emerge in Deep Scene CNNs. ICLR, December 2014.
  • [40] Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. Learning deep features for discriminative localization. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description