Excitation Backprop for RNNs

Excitation Backprop for RNNs

Abstract

Deep models are state-of-the-art for many vision tasks including video action recognition and video captioning. Models are trained to caption or classify activity in videos, but little is known about the evidence used to make such decisions. Grounding decisions made by deep networks has been studied in spatial visual content, giving more insight into model predictions for images. However, such studies are relatively lacking for models of spatiotemporal visual content videos. In this work, we devise a formulation that simultaneously grounds evidence in space and time, in a single pass, using top-down saliency. We visualize the spatiotemporal cues that contribute to a deep model’s classification/captioning output using the model’s internal representation. Based on these spatiotemporal cues, we are able to localize segments within a video that correspond with a specific action, or phrase from a caption, without explicitly optimizing/training for these tasks.

1Introduction

To visualize what in a video gives rise to an output of a deep recurrent network, it is important to consider space and time saliency, , where and when. The visualization of what a deep recurrent network finds salient in an input video can enable interpretation of the model’s behavior in action classification, video captioning, and other tasks. Moreover, estimates of the model’s attention (, saliency maps) can be used directly in localizing a given action within a video or in localizing the portions of a video that correspond to a particular concept within a caption.

Several works address visualization of model attention in Convolutional Neural Networks (CNNs) for image classification [1]. These methods produce saliency maps that visualize the importance of class-specific image regions (spatial localization). Analogous methods for Recurrent Neural Network (RNN)-based models must handle more complex recurrent, non-linear, spatiotemporal dependencies; thus, progress on RNNs has been limited to [7]. Karpathy [7] visualize the role of Long Short Term Memory (LSTM) cells for text input, but not for visual data. Ramanishka [13] map words to regions in the video captioning task by dropping out (exhaustively or by sampling) video frames and/or parts of video frames to obtain saliency maps. This can be computationally expensive, and does not consider temporal evolution but only frame-level saliency.

In contrast, we propose the first one-pass formulation for visualizing spatiotemporal attention in RNNs, without selectively dropping or sampling frames or frame regions. In our proposed approach, contrastive Excitation Backprop for RNNs (EB-R), we address how to ground1 decisions of deep recurrent networks in space and time simultaneously, using top-down saliency. Our approach models the top-down attention mechanism of deep models to produce interpretable and useful task-relevant saliency maps. Our saliency maps are obtained implicitly without the need to re-train models, unlike models that include explicit attention layers [26]. Our method is weakly supervised; it does not require a model trained using explicit spatial (region/bounding box) or temporal (frame) supervision.

Fig. ? gives an overview of our approach that produces saliency maps which enable us to visualize where and when an action/caption is occurring in a video. Given a trained model, we perform the standard forward pass. In the backward pass, we use EB-R to compute and propagate winning neuron probabilities normalized over space and time. This process yields spatiotemporal attention maps.

We evaluate our approach on two models from the literature: a CNN-LSTM trained for video action recognition, and a CNN-LSTM-LSTM (encoder-decoder) trained for video captioning. In addition, we show how the spatiotemporal saliency maps produced for these two models can be utilized for localization of segments within a video that correspond to specified activity classes or noun phrases.

In summary, our contributions are:

  • We are the first to formulate top-down saliency in deep recurrent models for space-time grounding of videos.

  • We do so using a single contrastive Excitation Backprop pass of an already trained model.

  • Although we are not directly optimizing for localization (no training is performed on spatial or temporal annotations), we show that the internal representation of the model can be utilized to perform localization.

2Related Work

Several works in the literature give more insight into CNN model predictions, , the evidence behind deep model predictions. Such approaches are mainly devised for image understanding and can identify the importance of class-specific image regions by means of saliency maps in a weakly-supervised way.

Spatial Grounding.

Guided Backpropagation [21] and Deconvolution [30] used different variants of the standard backpropagation error and visualized salient parts at the image pixel level. In particular, starting from a high-level feature map, [30] inverted the data flow inside a CNN, from neuron activations in higher layers down to the image level. Guided Backpropagation [21] introduced an additional guidance signal to standard backpropagation preventing backward flow of negative gradients. Simonyan [18] directly computed the gradient of the class score with respect to the image pixel to find the spatial cues that help the class predictions in a CNN. The CAM [34] algorithm removed the last fully connected layer of a CNN and exploited a weighted sum of the last convolutional feature maps to obtain the class activation maps. Grad-CAM [15] combined [21] and [34] to produce high-resolution class-discriminative visualizations. Zhou [33] generated class activation maps using global average pooling in fully-convolutional CNNs (networks that do not contain fully connected layers). Zhang [31] generated class activation maps from any CNN architecture that uses non-linearities producing non-negative activations. Oquab [11] used mid-level CNN outputs on overlapping patches, requiring multiple passes through the network.

Spatiotemporal Grounding.

Weakly-supervised visual saliency is much less explored for temporal data. Karpathy [7] visualized interpretable LSTM cells that keep track of long-range dependencies such as line lengths, quotes, and brackets in a character-based model. Ramanishka [13] explored visual saliency guided by captions in an encoder-decoder model. In contrast, our approach models the top-down attention mechanism of generic CNN-RNN models to produce interpretable and useful task-relevant spatiotemporal saliency maps that can be used for action/caption localization in videos.

3Background: Excitation Backprop

In this section, a brief background on Excitation Backprop (EB) [31] is given. EB was proposed for CNNs in that work. In general, the forward activation of neuron in a CNN is computed by , where is the activation coming from a lower layer, is a nonlinear activation function, is the weight from neuron to neuron , and is the added bias at layer . The EB framework makes two key assumptions about the activation which are satisfied in the majority of modern CNNs due to wide usage of the ReLU non-linearity: A1. is non-negative, and A2. is a response that is positively correlated with its confidence of the detection of specific visual features.

EB realized a probabilistic Winner-Take-All formulation to efficiently compute the probability of each neuron recursively using conditional winning probabilities , normalized (Fig. Figure 1). The top-down signal is a prior distribution over the output units. EB passes top-down signals through excitatory connections having non-negative weights, excluding from the competition inhibitory ones. Recursively propagating the top-down signal and preserving the sum of backpropagated probabilities layer by layer, it is possible to compute task-specific saliency maps from any intermediate layer in a single backward pass.

Figure 1:  In Excitation Backprop, excitation probabilities are propagated in a single backward pass in the CNN. A top-down signal is a probability distribution over the output units. The probabilities are backpropagated from every parent node to its children through its excitatory connections. The figure illustrates the contributions of a single parent neuron to the excitation probabilities computed at the next layer. Each P(a_i) in the saliency map is computed over the complete parent set \mathcal{P}_i. Shading of nodes in the figure conveys P(a_i) (darker shade = greater P(a_i)).
Figure 1: In Excitation Backprop, excitation probabilities are propagated in a single backward pass in the CNN. A top-down signal is a probability distribution over the output units. The probabilities are backpropagated from every parent node to its children through its excitatory connections. The figure illustrates the contributions of a single parent neuron to the excitation probabilities computed at the next layer. Each in the saliency map is computed over the complete parent set . Shading of nodes in the figure conveys (darker shade = greater ).

To significantly improve the discriminativeness of the generated saliency maps, [31] introduced contrastive top-down attention. The idea of the contrastive mechanism is to cancel out common winner neurons and amplify the discriminative neurons for the desired class. To do this, given an output unit , a dual unit is virtually generated, whose input weights are the negation of those of . By subtracting the saliency map for from the one for the result better highlights cues in the image that are unique to the desired class. EB and contrastive EB (EB) have been devised to find task-driven discriminative cues in images.

4Our Framework

In this section we explain the details of our spatiotemporal grounding framework: EB-R. As illustrated in Fig. ?, we have three main modules: RNN Backward, Temporal normalization, and CNN Backward.

RNN Backward.

This module implements an excitation backprop formulation for RNNs. Recurrent models such as LSTMs are well-suited for top-down temporal saliency as they explicitly propagate information over time. The extension of EB for Recurrent Networks, EB-R, is not straightforward since EB must be implemented through the unrolled time steps of the RNN and since the original RNN formulation contains non-linearities which do not satisfy the EB assumptions A1 and A2. [3] have conducted an analysis over variations of the standard RNN formulation, and discovered that different non-linearities performed similarly for a variety of different tasks. This is also reflected in our experiments. Based on this, we use ReLU nonlinearities and corresponding derivatives, instead of non-linearities and corresponding derivatives. This satisfies A1 and A2, while also giving similar performance on both tasks.

Working backwards from the RNN’s output layer, we compute the conditional winning probabilities from the set of output nodes , and the set of dual output nodes :

is a normalization factor such that the sum of all conditional probabilities of the children of (Eqn.s Equation 1, Equation 2) sum to 1; where is the set of model weights and is the weight between child neuron and parent neuron ; where is obtained by negating the model weights at the classification layer only. is only needed for contrastive attention.

We compute the neuron winning probabilities starting from the prior distribution encoding a given action/caption as follows:

where is the set of parent neurons of .

Temporal Normalization.

Replacing non-linearities with ReLU non-linearities to extend EB in time does not suffice for temporal saliency. EB performs normalization at every layer to maintain a probability distribution. Hence, for temporal localization, backpropagated signals from the desired time-step of a -frame clip should be normalized in time before being further backpropagated into the CNN:

EB-R computes the difference between the normalized saliency maps obtained by EB-R starting from , and EB-R starting from using negated weights of the classification layer. EB-R is more discriminative as it grounds the evidence that is unique to a selected class/word. For example, EB-R of Surfing will give evidence that is unique to Surfing and not common to other classes used at training time (see Fig. ? for an example). This is conducted as follows:

CNN Backward. For every time step , for every video frame , we use the backprop of [31] for all CNN layers:

where is the activation when frame is passed through the CNN. at the desired CNN layer is the EB-R saliency map for . Computationally, the complexity of EB-R is on the order of a single backward pass. Note that for EB-R, is used instead of in Equation 9. The general framework applied to both video action recognition and captioning is summarized in Algorithm ?. Details of each task are discussed in the following two sections.

4.1Grounding: Video Action Recognition

In this task, we ground the evidence of a specific action using a model trained on action recognition. The task takes as input a video sequence and the action () to be localized, and outputs spatiotemporal saliency maps for this action in the video. We use the CNN-LSTM implementation of [2] with VGG-16 [19] for our action grounding in video. This encodes the temporal information intrinsically present in the actions we want to localize. The CNN is truncated at the fc7 layer such that the fc7 features of frames feed into the recurrent unit. We use a single LSTM layer.

Performing EB-R results in a sequence of saliency maps for at conv5 (various layers perform similarly [31]). These maps are then used to perform the temporal grounding for action . Localizing the action, entails the following sequence of steps. First, the sum of every saliency map is computed to give a vector . Second, we find an anchor map with the highest sum. Third, we extend a window around the anchor map in both directions in a greedy manner until a saliency map with a negative sum is found. A negative sum indicates that the map is less relevant to the action under consideration. This allows us to determine the start and end points of the temporal grounding, and respectively. Fig. ? depicts the EB-R pipeline for the task of action grounding.

4.2Grounding: Video Captioning

In this task, we ground evidence of word(s) using a model trained on video captioning. The task takes as input a video and word(s) to be localized, and outputs spatiotemporal saliency maps corresponding to the query word(s). We use the captioning model of [22] to test our EB-R approach. This model consists of a VGG-16, followed by a mean pooling of the VGG fc7 features, followed by a two-layer LSTM. Fig. ? depicts EB-R for caption grounding.

We backpropagate an indicator vector for the words to be visualized starting at the time-steps they were predicted, through time, to the average pooling layer. We then distribute and backpropagate probabilities among frames -according to their forward activations (Eqn. Equation 8)- through the VGG until the conv5 layer where we obtain the corresponding saliency map. Performing EB-R results in a sequence of saliency maps for grounding the words in the video frames. Temporal localization is performed using the steps described in Section 4.1.

Figure 2: Grounding Action Recognition. The red arrows depict cEB-R for spatiotemporal grounding of the action CliffDiving. Starting from the last LSTM time-step, cEB-R backpropagates the probability distribution through time and through the CNN at every time-step. The saliency map for each time-step is used for the spatial localization. The sum of each saliency map, over time, is then used for temporal localization of the action within the video, as described in Sec. .
Figure 2: Grounding Action Recognition. The red arrows depict EB-R for spatiotemporal grounding of the action CliffDiving. Starting from the last LSTM time-step, EB-R backpropagates the probability distribution through time and through the CNN at every time-step. The saliency map for each time-step is used for the spatial localization. The sum of each saliency map, over time, is then used for temporal localization of the action within the video, as described in Sec. .

5Experiments: Action Grounding

In this work we ground the decisions made by our deep models. In order to evaluate this grounding, we compare it with methods that localize actions. Although our framework is able to jointly localize actions in space and time, we report results for spatial localization and temporal localization separately due to the lack of an action dataset that has untrimmed videos with spatiotemporal bounding boxes.

5.1Spatial Localization

In this section we evaluate how well we ground actions in space. We do this by comparing our grounding results with ground-truth bounding boxes localizing actions per-frame.

Dataset.

THUMOS14 [4] provides per-frame bounding box annotations of humans performing actions for 1815 videos of 14 classes from the UCF101 dataset [20]. UCF101 is a trimmed video dataset containing 13320 actions belonging to 101 action classes.

Baselines.

We compare our formulation against spatial top-down saliency using a CNN (treating every video frame as an independent image). We also compare against standard backpropagation (BP), and BP for RNNs (BP-R).

Models.

We use the following CNN model: VGG-16 of Ma [9] trained on UCF101 video frames and BU101 web images for action recognition with a test accuracy of 83.5%. We use the following CNN-LSTM model: the same VGG-16 fine-tuned with a one-layer LSTM on UCF101 for action recognition with a test accuracy of 83.3%.

Figure 3: Grounding Captioning. The red arrows depict cEB-R for spatiotemporal caption grounding. The video caption produced by the model is A man is singing on a stage. Starting from the time-step corresponding to the word singing, cEB-R backprops the probability distribution through the previous time-steps and through the CNN. The saliency map for each time step is used for the spatial localization. The sum of each saliency map, over time, is then used for temporal localization of the word within the clip.
Figure 3: Grounding Captioning. The red arrows depict EB-R for spatiotemporal caption grounding. The video caption produced by the model is A man is singing on a stage. Starting from the time-step corresponding to the word singing, EB-R backprops the probability distribution through the previous time-steps and through the CNN. The saliency map for each time step is used for the spatial localization. The sum of each saliency map, over time, is then used for temporal localization of the word within the clip.
Grounding Surfing using EB-R (L) and cEB-R (R)
Grounding Surfing using EB-R (L) and cEB-R (R)
Grounding Surfing using EB-R (L) and EB-R (R)
Grounding BasketballDunk using EB-R (L) and cEB-R (R)
Grounding BasketballDunk using EB-R (L) and cEB-R (R)
Grounding BasketballDunk using EB-R (L) and EB-R (R)

Setup and Results.

We use the bounding box annotations to evaluate our spatial grounding using the pointing game introduced by Zhang [31]. We locate the point having maximum value on each top-down saliency map. Following [31], if a 15-pixel diameter circle around the located point intersects the ground-truth bounding-box of the action category for a frame, we record a hit, otherwise we record a miss. We measure the spatial action localization accuracy by over all the annotated frames for each action.

Table ? reports the results of the spatial pointing game. Extending top-down saliency in time (-R) consistently improves the accuracy for all three methods by at least 3.5% absolute improvement, compared to performing top-down saliency separately on every frame of the video using a CNN. EB-R has the greatest absolute improvement of 6.6%.

We note that the non-contrastive versions outperform their contrastive counterparts. This is because they highlight discriminative evidence for actions, which may not necessarily be the humans performing the actions. For example, for many actions in UCF101, the human may be in a standing position, in which case EB-R will highlight cues that are discriminative and unique to this action rather than highlighting the human. These cues may belong to the context in which the activity is performed, or the action classes on which the model was trained. We demonstrate this in Fig. ? for the actions Surfing and BasketballDunk.

Accuracy of the spatial pointing game conducted on 2K videos of UCF101 for spatially locating humans performing actions in videos. The results show that extending top-down saliency in time (-R) improves the accuracy compared to performing top-down saliency separately on every frame of the video using a CNN. The non-contrastive versions work better for reasons described in the text.
EB EB-R EB EB-R BP BP-R
60.5 67.1 31.5 35.0 34.8 38.5

5.2Temporal Localization

In this section we evaluate how well we ground actions in time. We do this by comparing our grounding results with ground-truth action boundaries.

Datasets.

We first use a simple and controlled setting to validate our method by creating a synthetic action detection dataset. We then present results on the THUMOS14 [4] action detection dataset. The synthetic dataset is created by concatenating two UCF101 videos uniformly sampled: a ground truth (GT) video, and a random (rand) background video, such that class(GT) class(rand). The two actions are concatenated, first sequentially (rand + GT or GT + rand) in 16-frame clips, then inserted at a random position (rand + GT + rand) in 128-frame clips. We use all 3783 test videos provided in UCF101, each in combination with a different random background video. The THUMOS14 dataset consists of 1010 untrimmed validation videos and 1574 untrimmed test videos of 20 action classes. Among test videos, we evaluate our framework on the 213 test videos which contain annotations as in [24].

Grounding of the action TableTennisShot in the video
Grounding of the action TableTennisShot in the video
Grounding of the action Skiing in the video
Grounding of the action Skiing in the video

Baselines.

For the synthetic experiment, we compare EB-R and EB-R with a probability-based approach where we threshold the predicted probability (to if , to if ) of the GT class at every time-step. For the detection experiment in THUMOS14 we compare our proposed method with state-of-the-art approaches.

Models.

For the synthetic dataset, we use the same CNN-LSTM model used for spatial action grounding (Sec. ?). For the THUMOS14 dataset we use a CNN-LSTM model: the same VGG-16 model used for spatial action grounding (Sec. ?) fine-tuned with a one-layer LSTM on UCF101 and trimmed sequences from THUMOS14 background and validation sets.

Setup and Results: Synthetic Data.

First, we perform experiments on the synthetic videos composed of two sequential actions, where the boundary is the midpoint. Fig. ? presents a sample spatiotemporal localization. The heatmaps produced by EB-R correctly ground the queried action. While Fig. ? presents a qualitative sample, Figure 4 quantitatively presents results on the entire test set. The action switches from GT to rand or vice versa midway. It can be seen that the sum of saliency maps is: positive and increasing as more of the GT action is observed, negative and decreasing as more of the rand action is observed.

Next, we perform experiments where we vary the length of the GT action that we want to localize inside a clip. To retain action dynamics, we sample GT and rand from the entire length of their corresponding videos. Table 1 presents the temporal localization results of our synthetic data. In the experimental setup with fixed action length we assume that we know the label and length of the action to be localized. To localize, we find the highest consecutive sum of attention maps for the desired action length. Regarding the sequences with unknown action lengths, we only assume the label of the action to be localized and perform the pipeline described in Section 4.1. In the bottom half of Table 1 we only report thresholded probabilities and EB-R results since our localization procedure assumes negative values at action boundaries, whereas EB-R is non-negative. The grounded evidence obtained by EB-R attains the highest detection scores, and , for action sequences of known and unknown lengths, respectively, for IoU overlap between detections and ground-truth of , despite the fact that the model is not trained for localization.

Setup and Results: THUMOS14 Pointing Game.

We evaluate the pointing game in time for THUMOS14 -a fair evaluation for weakly supervised methods. For processing, we divide a video into 128-frame consecutive clips. We perform the pointing game by pointing [31] in time to the peak sum of saliency maps. For each ground-truth annotation we check if the detected peak is within its boundaries. If yes, we count it as a hit, otherwise we count it as a miss. We compare this approach with the peak position of predicted probabilities, and a random point in that clip.

The results of this experiment are presented in Table ?. Pointing to a random position clearly obtains lowest results while peak probability () and EB-R () have similar performance. However, peak probability does not offer spatial localization. Peak probability uses the model prediction, while EB-R uses the evidence of that prediction. Moreover, we observe that peak probability and EB-R are complementary, yielding .

Setup and Results: THUMOS14 Action Detection.

We evaluate how well our grounding does on the more challenging task of action detection that it was not trained for. In this experiment, we divide a video into 128-frame consecutive clips for processing. Table ? presents the temporal detection results of the THUMOS14 dataset. Differently from the pointing game experiment, we detect the start and end of the ground-truth action. We note that although our method is not supervised for the detection task, we achieve an accuracy of 57.32% when locating a ground truth class with an overlap as demonstrated in Table ?.

6Experiments: Caption Grounding

In this section, we show how EB-R is also applicable in the context of caption grounding. As observed by [13], there is an absence of datasets with spatiotemporal annotations of frames for captions. Therefore, they propose the following experimental setup which we follow: qualitative results for the spatiotemporal grounding on videos, and quantitative results for spatial grounding on images.

Datasets.

We use the MSR-VTT [25] dataset for video captioning and Flickr30kEntities [12] for image captioning.

Models.

We use the CNN-LSTM-LSTM video captioning model of [22] trained on MSR-VTT to test our EB-R approach for spatiotemporal grounding as described in Section 4.2. We use the same video captioning model, without the average pooling layer, trained on Flickr30kEntities for image captioning. The models have comparable METEOR scores to the Caption-Guided Saliency work of [13], to which we compare our results: 26.5 (25.9) for video captioning and 18.0 (18.3) for image captioning.

Setup and Results.

For the MSR-VTT video dataset, we sample 26 frames per video following [13] and perform grounding of nouns. Fig. ? presents the grounding for the word man and phone in the same video. The man is well localized only in frames where a man appears, and the phone is well localized in frames where a phone appears.

Figure 4: Sum of the saliency maps at fc7 over time, in frames, for synthetic videos that (blue) have a rand action followed by a GT action and (green) have a GT action followed by a rand action. The average and standard deviation are reported over all test videos. cEB-R provides a clear midway accurate boundary between actions.
Figure 4: Sum of the saliency maps at fc7 over time, in frames, for synthetic videos that (blue) have a rand action followed by a GT action and (green) have a GT action followed by a rand action. The average and standard deviation are reported over all test videos. EB-R provides a clear midway accurate boundary between actions.
Table 1: Action detection results on synthetic data, measured by mAP at IoU threshold . Top part of the table: methods assume that the length and label of the action to be detected are known. Bottom part of the table: methods assume that the label is known, but the length is unknown. EB-R attains best performance.
[-0.6em] 11 8.5 11.3 15.5
41 28.2 38.5 53.2
65 47.7 56.3 73.5
[-0.5em] 11 3.4 - 4.1
41 9.5 - 47.9
65 35.7 - 62.0
Pointing game in time performed on the THUMOS14 test set. The probability of an action together with the evidence for presence of the action are complementary and give a great improvement in accuracy when combined.

Method

Accuracy (%)
Random 57.3
Peak probability 65.8
EB-R 65.1
Peak probability + EB-R 77.4

Our weakly supervised approach fully supervised approaches for action detection on THUMOS14, measured by mAP at IoU threshold . Although our model is not trained for action detection (trained for recognition), we achieve 57.9%, which is comparable to state-of-the-art when localizing a ground truth action in a video.

mAP ()
Karaman [6] 4.6
Wang [23] 18.2
Oneata [10] 36.6
Richard [14] 39.7
Shou [17] 47.7
Yeung [28] 48.9
Yuan [29] 51.4
Xu [24] 54.5
Zhao [32] 60.3
Kaufman [8] 61.1
Ours 57.9

We quantitatively evaluate our results of spatial grounding using the pointing game on the Flickr30kEntities and compare our method to the Caption-Guided Saliency work of [13], following their evaluation protocol. We use ground truth captions as an input to our model in order to reproduce the same captions. Then, we use bounding box annotations for each noun phrase in the ground truth captions and check whether the maximum point in a saliency map is inside the annotated bounding box.

Table 2 shows the results of the spatial pointing game on Flickr30kEntities. Our approach achieves comparable performance to [13]. In this experiment, we ground the ground truth captions to match the experimental setup in [13]. Although we follow their protocol for fair comparison, we note that our method can better highlight evidence using generated captions (ground truth captions). This is because the evidence of a ground truth noun that is not predicted may not be sufficiently activated in the forward pass. Fig. ? presents some visual examples of grounding in images using the generated captions.

Our approach has a computational advantage over [13]. In order to obtain spatial saliency maps for a word in a video, -EB-R requires one forward pass and one backward pass through the CNN-LSTM-LSTM, while [13] requires one forward pass through the CNN part, but forward passes through the LSTM-LSTM part, where is the area of the saliency map (our single backward pass). Moreover, they require forward LSTM passes, where is the number of frames, to compute the temporal grounding, whereas ours is implicitly spatiotemporal.

grounding of the word man
grounding of the word man
grounding of the word phone
grounding of the word phone
Table 2: Evaluation of spatial saliency on Flickr30kEntities using EB-R. Baseline random samples the maximum point uniformly and Baseline center always picks the center.
Avg (Noun Phrases)
Baseline random 0.268
Baseline center 0.492
Caption-Guided Saliency 0.501
Ours 0.512
image caption: A man in a lab coat is working on a microscope.
image caption: A man in a lab coat is working on a microscope.
image caption: A man in a lab coat is working on a microscope.
image caption: A man in a lab coat is working on a microscope.
image caption: A man in a lab coat is working on a microscope.
image caption: A cowboy is riding a bucking horse.
image caption: A cowboy is riding a bucking horse.
image caption: A cowboy is riding a bucking horse.
image caption: A cowboy is riding a bucking horse.
image caption: A cowboy is riding a bucking horse.

7Conclusion

In conclusion, we devise a temporal formulation, EB-R, that enables us to visualize how recurrent networks ground their decisions in visual content. We apply this to two video understanding tasks: video action recognition, and video captioning. We demonstrate how spatiotemporal top-down saliency is capable of grounding evidence on several action and captioning datasets. These datasets provide annotations for detection and/or localization, to which we have compared the evidence in our generated saliency maps. We observe the strengths of EB-R in highlighting discriminative evidence, which was particularly beneficial for temporal grounding. We also observe the strengths of its variant, EB-R, in highlighting salient evidence, which was particularly beneficial for spatial localization of action subjects.

Acknowledgments

We thank Kate Saenko and Vasili Ramanishka for helpful discussions. This work was supported in part through NSF grants 1551572 and 1029430, an IBM PhD Fellowship, and gifts from Adobe and NVidia.

Footnotes

  1. In this work we use the terms ground and localize interchangeably.

References

  1. Look and think twice: Capturing top-down visual attention with feedback convolutional neural networks.
    C. Cao, X. Liu, Y. Yang, Y. Yu, J. Wang, Z. Wang, Y. Huang, L. Wang, C. Huang, W. Xu, et al. In Proceedings of the IEEE International Conference on Computer Vision, pages 2956–2964, 2015.
  2. Long-term recurrent convolutional networks for visual recognition and description.
    J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2625–2634, 2015.
  3. LSTM: A search space odyssey.
    K. Greff, R. K. Srivastava, J. Koutník, B. R. Steunebrink, and J. Schmidhuber. IEEE transactions on neural networks and learning systems, 2016.
  4. THUMOS challenge: Action recognition with a large number of classes.
    Y.-G. Jiang, J. Liu, A. Roshan Zamir, G. Toderici, I. Laptev, M. Shah, and R. Sukthankar. http://crcv.ucf.edu/THUMOS14/, 2014.
  5. An empirical exploration of recurrent network architectures.
    R. Jozefowicz, W. Zaremba, and I. Sutskever. In Proceedings of the 32nd International Conference on Machine Learning (ICML), pages 2342–2350, 2015.
  6. Fast saliency based pooling of Fisher encoded dense trajectories.
    S. Karaman, L. Seidenari, and A. Del Bimbo. In ECCV THUMOS Workshop, volume 1, page 5, 2014.
  7. Visualizing and understanding recurrent networks.
    A. Karpathy, J. Johnson, and L. Fei-Fei. In ICLR Workshop, 2016.
  8. Temporal tessellation: A unified approach for video analysis.
    D. Kaufman, G. Levi, T. Hassner, and L. Wolf. In The IEEE International Conference on Computer Vision (ICCV), 2017.
  9. Do less and achieve more: Training cnns for action recognition utilizing action images from the web.
    S. Ma, S. A. Bargal, J. Zhang, L. Sigal, and S. Sclaroff. Pattern Recognition, 2017.
  10. The lear submission at thumos 2014.
    D. Oneata, J. Verbeek, and C. Schmid. 2014.
  11. Is object localization for free? - weakly-supervised learning with convolutional neural networks.
    M. Oquab, L. Bottou, I. Laptev, and J. Sivic. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  12. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models.
    B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 2641–2649, 2015.
  13. Top-down visual saliency guided by captions.
    V. Ramanishka, A. Das, J. Zhang, and K. Saenko. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  14. Temporal action detection using a statistical language model.
    A. Richard and J. Gall. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3131–3140, 2016.
  15. Grad-cam: Visual explanations from deep networks via gradient-based localization.
    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra. See https://arxiv. org/abs/1610.02391 v3, 2016.
  16. Cdc: Convolutional-de-convolutional networks for precise temporal action localization in untrimmed videos.
    Z. Shou, J. Chan, A. Zareian, K. Miyazawa, and S.-F. Chang. arXiv preprint arXiv:1703.01515, 2017.
  17. Temporal action localization in untrimmed videos via multi-stage cnns.
    Z. Shou, D. Wang, and S.-F. Chang. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1049–1058, 2016.
  18. Deep inside convolutional networks: Visualising image classification models and saliency maps.
    K. Simonyan, A. Vedaldi, and A. Zisserman. arXiv preprint arXiv:1312.6034, 2013.
  19. Very deep convolutional networks for large-scale image recognition.
    K. Simonyan and A. Zisserman. arXiv preprint arXiv:1409.1556, 2014.
  20. Ucf101: A dataset of 101 human actions classes from videos in the wild.
    K. Soomro, A. R. Zamir, and M. Shah. arXiv preprint arXiv:1212.0402, 2012.
  21. Striving for simplicity: The all convolutional net.
    J. T. Springenberg, A. Dosovitskiy, T. Brox, and M. A. Riedmiller. CoRR, abs/1412.6806, 2014.
  22. Translating videos to natural language using deep recurrent neural networks.
    S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. North American Chapter of the Association for Computational Linguistics – Human Language Technologies NAACL-HLT, 2015.
  23. Action recognition and detection by combining motion and appearance features.
    L. Wang, Y. Qiao, and X. Tang. THUMOS14 Action Recognition Challenge, 1(2):2, 2014.
  24. R-c3d: Region convolutional 3d network for temporal activity detection.
    H. Xu, A. Das, and K. Saenko. The IEEE International Conference on Computer Vision (ICCV), 2017.
  25. Msr-vtt: A large video description dataset for bridging video and language.
    J. Xu, T. Mei, T. Yao, and Y. Rui. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5288–5296, 2016.
  26. Show, attend and tell: Neural image caption generation with visual attention.
    K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. In International Conference on Machine Learning, pages 2048–2057, 2015.
  27. Describing videos by exploiting temporal structure.
    L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. In Proceedings of the IEEE international conference on computer vision (ICCV), pages 4507–4515, 2015.
  28. End-to-end learning of action detection from frame glimpses in videos.
    S. Yeung, O. Russakovsky, G. Mori, and L. Fei-Fei. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2678–2687, 2016.
  29. Temporal action localization with pyramid of score distribution features.
    J. Yuan, B. Ni, X. Yang, and A. A. Kassim. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3093–3102, 2016.
  30. Visualizing and understanding convolutional networks.
    M. D. Zeiler and R. Fergus. In European conference on computer vision (ECCV), pages 818–833. Springer, 2014.
  31. Top-down neural attention by excitation backprop.
    J. Zhang, Z. Lin, J. Brandt, X. Shen, and S. Sclaroff. In European Conference on Computer Vision (ECCV), pages 543–559. Springer, 2016.
  32. Temporal action detection with structured segment networks.
    Y. Zhao, Y. Xiong, L. Wang, Z. Wu, X. Tang, and D. Lin. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  33. Learning Deep Features for Discriminative Localization.
    B. Zhou, A. Khosla, L. A., A. Oliva, and A. Torralba. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  34. Learning deep features for discriminative localization.
    B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2921–2929, 2016.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
2083
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description