Cross-Modal and Hierarchical Modeling of Video and Text

Cross-Modal and Hierarchical Modeling of Video and Text

Bowen Zhang equal contributionDept. of Computer Science, U. of Southern California, Los Angeles, CA 90089    Hexiang Hu Dept. of Computer Science, U. of Southern California, Los Angeles, CA 90089    Fei Sha Netflix, 5808 Sunset Blvd, Los Angeles, CA 90028,, fsha@netflix.comthanks: On leave from U. of Southern California (

Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (hse), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.

Hierarchical Sequence Embedding, Video Text Retrieval, Video Description Generation, Action Recognition, Zero-shot Transfer

1 Introduction

Recently, there has been an intensive interest in multi-modal learning of vision + language. A few challenging tasks have been proposed: visual semantic embedding (VSE) [19, 17, 6], image captioning [40, 44, 14, 24], and visual question answering (VQA) [2, 49, 3]. To jointly understand these two modalities of data and make inference over them, the main intuition is that different types of data can share a common semantic representation space. Examples are embedding images and the visual categories [8], embedding images and texts for VSE [19], and embedding images, questions, and answers for VQA [13]. Once embedded into this common (vector) space, similarity and distances among originally heterogeneous data can be captured by learning algorithms.

While there has been a rich study on how to discover this shared semantic representation on structures such as images, noun phrases (visual object or action categories) and sentences (such as captions, questions, answers), less is known about how to achieve so on more complex structures such as videos and paragraphs of texts 111We use paragraphs and documents interchangeably throughout this work.. There are conceptual challenges: while complex structured data can be mapped to vector spaces (for instance, using deep architectures [21, 10]), it is not clear whether the intrinsic structures in those data’s original format, after being transformed to the vectorial representations, still maintain their correspondence and relevance across modalities.

Take the dense video description task as an example [20]. The task is to describe a video which is made of short, coherent and meaningful clips. (Note that those clips could overlap temporally.) Due to its narrowly focused semantic content, each clip is then describable with a sentence. The description for the whole video is then a paragraph of texts with sentences linearly arranged in order. Arguably, a corresponding pair of video and its descriptive paragraph can be embedded into a semantic space where their embeddings are close to each other, using a vanilla learning model by ignoring the boundaries of clips and sentences and treating as a sequence of continually flowing visual frames and words. However, for such a modeling strategy, it is opaque that if and how the correspondences at the “lower level” (i.e. clips versus sentences) are useful in either deriving the embeddings or using the embeddings to perform downstream tasks such as video or text retrieval.

Figure 1: Conceptual diagram of our approach for cross-modal modeling of video and texts. The main idea is to embed both low-level (clips and sentences) and high-level (video and paragraph) in their own semantic spaces coherently. As shown in the figure, the 3 sentences (and the corresponding 3 clips) are mapped into a local embedding space where the corresponding pairs of clips and sentences are placed close to each other. As a whole, the videos and the paragraphs are mapped into a global semantic space where their embeddings are close. See Fig. 3 and texts for details.

Addressing these deficiencies, we propose a novel cross-modal learning approach to model both videos and texts jointly. The main idea is schematically illustrated in Fig. 1. Our approach is mindful of the intrinsic hierarchical structures of both videos and texts, and models them with hierarchical sequence learning models such as GRUs [5]. However, as opposed to methods which disregard low-level correspondences, we exploit them by deriving loss functions to ensure the embeddings for the clips and sentences are also in accordance in their own (shared) semantic space. Those low-level embeddings in turn strengthen the desiderata that videos and paragraphs are embedded coherently. We demonstrate the advantages of the proposed model in a range of tasks including video and text retrieval, zero-shot action recognition and video description.

The rest of the paper is organized as follows. In section 2, we discuss related work. We describe our proposed approach in section 3, followed by extensive experimental results and ablation studies in section 4. We conclude in section 5.

2 Related Work

Hierarchical Sequence Embedding Models.

Embedding images, videos, and textual data has been very popular with the rise of deep learning. The most related works to ours are  [22] and  [28]. The former models the paragraph using a hierarchical auto-encoder for text modeling [22], and the later uses a hierarchical RNN for videos and a one-layer RNN for caption generation. In contrast, our work models both modalities hierarchically and learn the parameters by leveraging the correspondences across modalities. Works motivated by other application scenarios usually explore hierarchical modeling in one modality [27, 45, 47].

Cross-modal Embedding Learning.

There has been a rich history to learn embeddings for images and smaller linguistic units (such as words and noun phrases). DeViSE [8] learns to align the latent embeddings of visual data and names of the visual object categories. ReViSE [37] uses auto-encoders to derive embeddings for images and words which allow them to leverage unlabeled data. In contrast to previous methods, our approach models both videos and texts hierarchically, bridging the embeddings at different granularities using discriminative loss computed on corresponded pairs (i.e. videos vs. paragraphs).

Action Recognition in Videos.

Deep learning has brought significant improvement to video understanding [33, 36, 7, 41, 46, 43] on large-scale action recognition datasets [11, 34, 16] in the past decade. Most of them [33, 7, 41] employed deep convolutional neural network to learn appearance feature and motion information respectively. Based on the spatial-temporal feature from these video modeling methods, we learn video semantic embedding to match the holistic video representation to text representation. To evaluate the generalization of our learned video semantic representation, we evaluate the model directly on the challenging action recognition benchmark. (Details in Section 4.4)

3 Approach

We begin by describing the problem settings and introducing necessary notations. We then describe the standard sequential modeling technique, ignoring the hierarchical structures in the data. Finally, we describe our approach.

3.1 Settings and Notations

We are interested in modeling videos and texts that are paired in correspondence In the later section, we describe how to generalize this where there is no one to one correspondence.

A video has clips (or subshots), where each clip contains frames. Each frame is represented by a visual feature vector . This feature vector can be derived in many ways, for instance, by feeding the frame (and its contextual frames) to a convolution neural net and using the outputs from the penultimate layer. Likewise, we assume there is a paragraph of texts describing the video. The paragraph contains sentences, one for each video clip. Let denote the th sentence and the feature for the th word out of words. We denote by a set of corresponding videos and text descriptions.

We compute a clip vector embedding from the frame features , and a sentence embedding from the word features . From those, we derive and , the embedding for the video and the paragraph, respectively.

Figure 2: Flat sequence modeling of videos and texts, ignoring the hierarchical structures in either and regarding the video (paragraph) as a sequence of frames (words).

3.2 Flat Sequence Modeling

Many sequence-to-sequence (seq2seq) methods leverage the encoder-decoder structure [35, 25] to model the process of transforming from the input sequence to the output sequence. In particular, the encoder, which is composed of a layer of long short-term memory units (LSTMs) [12] or Gated Recurrent Units (GRUs) [5], transforms the input sequence into a vector as the embedding . The similarly constructed decoder takes as input and outputs another sequence.

The original seq2seq methods do not consider the hierarchical structures in videos or texts. We refer the embeddings as flat sequence embedding (fse):


Fig. 2 schematically illustrates this idea. We measure how well the videos and the texts are aligned by the following cosine similarity


Figure 3: Hierarchical cross-modal modeling of videos and texts. We differ from previous works [22, 28] in two aspects (components in red color): layer-wise reconstruction through decoders, and matching at both global and local levels. See texts for details.

3.3 Hierarchical Sequence Modeling

One drawback of flat sequential modeling is that the LSTM/GRU layer needs to have a sufficient number of units to model well the potential long-range dependency among video frames (or words). This often complicates learning as the optimization becomes difficult [29].

We leverage the hierarchical structures in those data to overcome this deficiency: a video is made of clips which are made of frames. In parallel, a paragraph of texts is made of sentences which in turn are made of words. Similar ideas have been explored in [28, 22] and other previous works. The basic idea is illustrated in Fig. 3, where we also add components in red color to highlight our extensions.

Hierarchical Sequence Embedding. Given the hierarchical structures in Fig. 3, we can compute the embeddings using the forward paths


Learning with Discriminative Loss. For videos and texts have strong correspondences where clips and sentences are paired, we optimize the encoders such that videos and texts are matched. To this end, we define two loss functions, corresponding to the matching at the low-level and the high-level respectively:


These losses are margin-based losses [32] where and are positive numbers as the margins to separate matched pairs from unmatched ones. The function is the standard hinge loss function.

Learning with Contrastive Loss. Assuming videos and texts are well clustered, we use the following loss to model their clustering in their own space.


Note that the self-matching values and are 1 by definition. This loss can be computed on videos and texts alone and does not require them being matched.

Learning with Unsupervised Layer-wise Reconstruction Loss. Thus far, the matching loss focuses on matching across modality. The clustering loss focuses on separating between video/text data so that they do not overlap. None of them, however, focuses on the quality of the modeling data itself. In what follows, we propose a layer-wise reconstruction loss – when minimized, this loss ensures the learned video/text embedding faithfully preserves information in the data.

We first introduce a set of layer-wise decoders for both videos and texts. The key idea is to pair the encoders with decoders so that each pair of functions is an auto-encoder. Specifically, the decoder is also a layer of LSTM/GRU units, generating sequences of data. Thus, at the level of video (or paragraph), we will have a decoder to generate clips (or sentences). And at the level of clips (or sentences), we will have a decoder to generate frames (or words). Concretely, we would like to minimize the difference between what are generated by the decoders and what are computed by encoders on the data. Let


be the two (high-level) decoders for videos and texts respectively. And similarly, for the decoder at the low-level


where the low-level decoders take each generated clip and sentence embeddings as inputs and output sequences of generated frame and word embeddings.


Using those generated embeddings, we can construct a loss function characterizing how well the encoders encode the data pair (see Eq 3.3).

3.4 Final Learning Objective and Its Extensions

The final learning objective is to balance all those loss quantities


where the high-level and low-level losses are defined as


In our experiments, we will study the contribution by each term.

Learning under Weak Correspondences. Our idea can be also extended to the common setting where only high-level alignments are available. In fact, high-level coarse alignments of data are easier and more economical to obtain, compared to fine-grained alignments between each sub-level sentence and video clip.

Since we do not have enough information to define the low-level matching loss exactly, we resort to approximation. We first define an averaged matching over all pairs of clips and sentences for a pair of video and paragraph


where we relax the assumption that there is precisely the same number of sentences and clips. We use this averaged quantity to approximate the low-level matching loss


This objective will push a clip embedding closer to the embeddings of the sentences belonging to the corresponding video (and vice versa for sentences to the corresponding video). A more refined approximation involving a soft assignment of matching can also be derived, which will be left for future work.

4 Experiments

We evaluate and demonstrate the advantage of learning hierarchical cross-modal embedding with our proposed approach on several tasks: (i) large-scale video-paragraph retrieval (Section 4.2), (ii) down-stream tasks such as video captioning (Section 4.3), and (iii) action recognition (Section 4.4).

4.1 Experiment Setups

4.1.1 Datasets.

We evaluate on three large-scale video datasets:

(1) ActivityNet Dense Caption [20]. This variant of ActivityNet contains densely labeled temporal segments for 10,009 training and 4,917/4,885 (val1/val2) validation videos. Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips. In all our retrieval experiments, we follow the setting in [20] and report retrieval metrics such as recall@k (k=1,5,50) and median rank (MR). Following [20] we use ground-truth clip proposals as input for our main results. In addition, we also study our algorithm with a heuristic proposal method (see Section 4.2.4). In the main text, we report all results on validation set 1 (val1). Please refer to the Supp. Material for the results on val2. For video caption experiment, we follow [20] and evaluate on the validation set (val1 and val2). Instead of using action proposal method, ground-truth video segmentation is used for training and evaluation. Performances are reported in Bleu@K, METEOR and CIDEr.

(2) DiDeMo [1]. The original goal of DiDeMo dataset is to locate the temporal segments that correspond to unambiguous natural language descriptions in a video. We re-purpose it for the task of video and paragraph retrieval. It contains 10,464 videos, 26,892 video clips and 40,543 sentences. The training, validation and testing split contain 8,395, 1,065 and 1,004 videos and corresponding paragraphs, respectively. Each video clip may correspond to one or many sentences. For the video and paragraph retrieval task, paragraphs are constructed by concatenating all sentences that corresponding to one video. Similar to the setting in ActivityNet, we use the ground-truth clip proposals as input.

(3) ActivityNet Action Recognition [11]. We use ActivityNet V1.3 for aforementioned off-the-shelf action recognition. The dataset contains 14,950 untrimmed videos with 200 action classes, which is split into training and validation set. Training and validation set have 10,024 and 4,926 videos, respectively. Among all 200 action classes, 189 of the action classes have been covered by the vocabulary extracted from the paragraph corpus and 11 of the classes are unseen.

4.1.2 Baselines and Our Methods.

We use the fse method (as described in Section 3.1) as a baseline model. It ignores the clip and sentence structures in the videos and paragraphs. We train a one-layer GRU directly on the extracted frame/word features and take their outputs as the embedding representing each modality. Results with C3D features are also included (see Table 1).

Our method has two variants: when , the method (hse[=0]) simplifies to a stacked/hierarchical sequence models as used in [22, 28] except that they do not consider cross-modal learning with cross-modal matching loss while we do. We consider this as a very strong baseline. When , the hse takes full advantage of layer-wise reconstruction with multiple decoders, at different levels of the hierarchy. In our experiments, this method gives the best results.

4.1.3 Implementation Details.

Following the settings of [20], we extract the C3D features [36] pretrained on Sports-1M dataset [15] for raw videos in ActivityNet. PCA is then used to reduce the dimensionality of the feature to 500. To verify the generalization of our model across different sets of visual feature, as well as leveraging the state-of-the-art video models, we also employed recently proposed TSN-Inception V3 network [41] pre-trained on Kinetics [16] dataset to extract visual features. Similarly, we extract TSN-Inception V3 feature for videos in Didemo dataset. We do not fine-tuning the convolutional neural network on the video along the training to reduce the computational cost. For word embedding, we use 300 dimension GloVe [30] features pre-trained on 840B common web-crawls. In all our experiments, we use GRU as sequence encoders. For hse, we choose from tuning this hyper-parameter on the val2 set of ActivityNet retrieval dataset. The same value is used for experiments on DiDeMo, without further tuning. (More details in the Supp. Material)

4.2 Results on Video-Paragraph Retrieval

In this section, we first compare our proposed approach to the state-of-the-art algorithms, and then perform ablation studies on variants of our method, to evaluate the proposed learning objectives.

Paragraph Video Video Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
C3D Feature with Dimensionality Reduction [36]
lstm-yt [38] 0.0 4.0 24.0 102.0 0.0 7.0 38.0 98.0
no context [39] 5.0 14.0 32.0 78.0 7.0 18.0 45.0 56.0
dense online[20] 10.0 32.0 60.0 36.0 17.0 34.0 70.0 33.0
dense full[20] 14.0 32.0 65.0 34.0 18.0 36.0 74.0 32.0
fse 12.60.4 33.20.3 77.60.3 12.0 11.50.5 31.80.3 77.70.3 13.0
hse[=0] 32.80.3 62.30.4 90.50.1 3.0 32.00.6 62.50.5 90.50.3 3.0
hse[=5e-4] 32.70.7 63.20.4 90.80.2 3.0 32.80.4 63.20.2 91.20.3 3.0
Inception-V3 pre-trained on Kinetics [42]
fse 18.20.2 44.80.4 89.10.3 7.0 16.70.8 43.11.1 88.40.3 7.3
hse[=0] 43.90.6 75.80.2 96.90.3 2.0 43.30.6 75.30.6 96.60.2 2.0
hse[=5e-4] 44.40.5 76.70.3 97.10.1 2.0 44.20.6 76.70.3 97.00.3 2.0
Table 1: Video paragraph retrieval on ActivityNet (val1). Standard deviation from 3 random seeded experiments are also reported.
Paragraph Video Video Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
s2vt [39] 11.9 33.6 76.5 13.0 13.2 33.6 76.5 15.0
fse 13.90.7 36.00.8 78.91.6 11.0 13.10.5 33.90.4 78.00.8 12.0
hse[=0] 30.20.8 60.51.1 91.80.7 3.3 29.40.4 58.90.7 91.90.6 3.7
hse[=5e-4] 29.70.2 60.30.9 92.40.3 3.3 30.11.2 59.20.9 92.10.5 3.0
Table 2: Video paragraph retrieval on DiDeMo dataset. s2vt method is re-implemented for retrieval task.

4.2.1 Main Results.

We reported our results on ActivityNet Dense Caption val1 set and DiDeMo test set as Table 1 and Table 2, respectively. For both C3D and Inception V3 feature, we observed performances on our hierarchical models improved the previous state-of-the-art result by a large margin (on Recall@1, over improvement with C3D and improvement with InceptionV3). dense full [20], which models the flat sequences of clips, outperforms our fse baseline as they augment each segment embedding with a weighted aggregated context embedding. However, it fails to model more complex temporal structures of video and paragraph, which leads to inferior performance to our hse models.

Comparing to our flat baseline model, both hse[=0] and hse[=5e-4] improve performances over all metrics in retrieval. It implies that hierarchical modeling can effectively capture the structure information and relationships over clips and sentences among videos and paragraphs. Moreover, we observe that hse[=5e-4] consistently improves over hse[=0] across most retrieval metrics on both datasets. This attributes the importance of our layer-wise reconstruction objectives, which suggests that better generalization performances.

Paragraph Video Video Paragraph
Dataset R@1 R@5 R@50 R@1 R@5 R@50
ActivityNet hse[=0] 41.80.4 74.10.6 96.60.1 40.50.4 73.90.6 96.30.1
weak 42.60.4 74.80.3 96.70.1 41.30.2 74.70.4 96.50.1
strong 43.90.6 75.80.2 96.90.3 43.30.6 75.30.6 96.60.2
hse[=5e-4] 42.50.3 74.80.1 96.90.0 41.60.2 74.70.6 96.60.1
weak 43.00.6 75.20.4 96.90.1 41.50.1 75.20.6 96.80.2
strong 44.40.5 76.70.3 97.10.1 44.20.6 76.70.3 97.00.3
DiDeMo hse[=0] 27.11.9 59.10.4 92.20.3 27.31.0 57.60.5 91.31.2
weak 28.00.8 58.90.5 91.40.6 28.30.3 58.50.6 91.20.3
strong 30.20.8 60.51.1 91.80.7 29.40.4 58.90.7 91.90.6
hse[=5e-4] 28.10.8 59.51.1 91.70.7 28.20.8 58.10.5 90.90.5
weak 28.72.1 59.10.2 91.60.7 28.30.8 59.20.6 91.10.1
strong 29.70.2 60.30.9 92.40.3 30.11.2 59.20.9 92.10.5
Table 3: Ablation studies on the learning objectives.

4.2.2 Low-level Loss is Beneficial.

Table 1 and Table 2 have shown results with optimizing both low-level and high-level objectives. In Table 3, we further performed ablation studies on the learning objectives. Note that rows with ✗ represent learning without low-level loss . In all scenarios, joint learning with both low-level and high-level correspondences improves the retrieval performance.

4.2.3 Learning with Weak Correspondences at Low-level.

As mentioned in Section 3, our method can be extended to learn the low-level embedding with weak correspondence. We evaluate its effectiveness on both ActivityNet and DiDeMo datasets. Performance are listed in Table 3. Note that for the rows of “weak”, no auxiliary alignments between sentences and clips are available during training.

Clearly, including low-level loss with weak correspondence (ie, correspondence only at the high-level) obtained superior performances when compared to models that do not include low-level loss at all. On several occasions, it even attains the same competitive result as including low-level loss with strong correspondences at the clip/sentence levels.

Proposal Method # Segments R@1 R@5 R@1 R@5 Precision Recall
hse + ssn - 10.4 31.9 10.8 31.7 1.5 17.1
hse + uniform 1 18.0 45.5 16.5 44.9 63.2 31.1
2 20.0 48.9 18.4 47.6 61.8 46.0
3 20.0 48.6 18.2 47.9 55.3 50.6
4 20.5 49.3 18.7 48.1 43.2 45.5
hse + ground truth - 44.4 76.7 44.2 76.7 100.0 100.0
fse - 18.2 44.8 16.7 43.1 - -
Table 4: Performance of using proposal instead of ground truth on ActivityNet dataset

4.2.4 Learning with Video Proposal Methods.

As using ground-truth temporal segments of videos is not a natural assumption, we perform experiments to validate the effectiveness of our method with proposal methods. Specifically, we experiment with two different proposal approaches: SSN [48] pre-trained on ActivityNet action proposal and a heuristic uniform proposal. For uniform proposal of K segments, we meant naturally segmenting a video into K non-overlapping and equal-length temporal segments.

The results are summarized in Table 4 (with columns of precision and recall being the performance metrics of the proposal methods). There are two main conclusions from these results: (1) The segments of Dense Caption dataset deviate significantly from the action proposals, therefore a pre-trained action proposal algorithm performs poorly. (2) Even with heuristic proposal methods, the performance of hse is mostly better than (or comparable with) fse. We leave to future work on identifying stronger methods for proposals.

4.2.5 Retrieval with Incomplete Video and Paragraph.

In this section, we investigate the correlation between the number of observed clips and sentences and models’ performance of video and paragraph retrieval. In this experiment, we gradually increase the number of clips and sentences observed by our model during the testing and obtained the Figure 4, on ActivityNet. When the video/paragraph contains fewer clips/sentences than the number of observations we required, we take all those available clips/sentences for computing the video/paragraph embedding. (On average 3.65 clips/sentences per video/paragraph)

Figure 4: Retrieval performance improves given more observed clips/sentences.

From Figure 4, we note that increasing the number of the observed clips and sentences leads to improved performance results in retrievals. We can see that when observing only one clip and sentence, our model already outperforms the previous state-of-the-art method as well as our baseline fse that observes the entire sequence. With observing less than the average length of clips and sentences, our learned model can achieve of the final performance.

4.3 Results on Video Captioning

4.3.1 Setups.

In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow  [20] and train a caption model [40] on top of the pre-trained video embeddings. Similar to [20], we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator. We randomly initialized the word embedding as well as LSTM and trained the model for 25 epochs with learning rate of 0.001. We use the ground-truth proposal throughout training and evaluation following the setting of  [20, 23]. During testing, beam search is used with beam=5. Results are reported in Table 6.

Table 5: Results for video captioning on ActivityNet
B@1 B@2 B@3 B@4 M C
lstm-yt [38] 18.2 7.4 3.2 1.2 6.6 14.9
s2vt [39] 20.4 9.0 4.6 2.6 7.9 21.0
hrnn [45] 19.5 8.8 4.3 2.5 8.0 20.2
dense [20] 26.5 13.5 7.1 4.0 9.5 24.6
dvc [23] 19.6 9.9 4.6 1.6 10.3 25.2
fse 17.9 8.2 3.6 1.7 8.7 32.1
hse[=0] 19.6 9.4 4.2 2.0 9.2 39.5
hse[=5e-4] 19.8 9.4 4.3 2.1 9.2 39.8
Table 6: Results for action recognition on ActivityNet (low-level embeddings)
Zero-Shot Train
Transfer Classifier
Top-1 Top-5 Top-1 Top-5
fv-vae [31] - - 78.6 -
tsn [42] - - 88.1 -
fse 48.3 79.4 74.4 94.1
hse[=0] 50.2 84.4 74.7 94.3
hse[=5e-4] 51.4 83.8 75.3 94.3
random 0.5 2.5 0.5 2.5

4.3.2 Results.

We observe that our proposed model outperforms baseline over most metrics. Meanwhile, hse also improves over previous approaches such as lstm-yt, s2vt, and hrnn on B@2, METEOR, and CIDEr by a margin. hse achieves comparable results with dvc in all criterions. However, both hse and hse[=0] failed to obtain close performance to dense [20]. This may due to the fact that dense [20] carefully learns to aggregate the context information of a video clip for producing high-quality caption, while optimized for video-paragraph retrieval our embedding model does not equip with such capability. However, it is worth noting that our model obtains higher CIDEr score compared to all existing methods. We empirically observe that fine-tuning the pre-trained video embedding does not lead to further performance improvement.

4.4 Results on Action Recognition

To evaluate the effectiveness of our model, we take the off-the-shelf clip-level embeddings trained on video-paragraph retrieval for action recognition (on ActivityNet with non-overlapping training and validation data). We use two action recognition settings to evaluate, namely zero-shot transfer and classification.

4.4.1 Setups.

In the zero-shot setting, we directly evaluate our low-level embedding model learned in the video and text retrieval, via treating the phrases of actions as sentences and use the sentence-level encoder to encode the action embedding. We take the raw video and apply clip-level video encoder to extract the feature for retrieving actions. No re-training is performed and all models have no access to the actions’ data distribution. Note though action are not directly used as sentences during the training, some are available as verbs in the vocabulary. Meanwhile, as we are using pre-trained word vector (GloVe), it allows the transfer to unseen actions. In the classification setting, we discriminatively train a simple classifier to measure the classification accuracy. Concretely, a one-hidden-layer Multi-Layer Perceptron (MLP) is trained on the clip-level embeddings. We do not fine-tune the pre-trained clip-level video embedding here.

4.4.2 Results.

We report results of above two settings on the ActivityNet validation set (see Table 6). We observe that our learned low-level embeddings allow superior zero-shot transfer to action recognition, without accessing any training data. This indicates that semantics of actions are indeed well reserved in the learned embedding models. More interestingly, we can see that both hse[=0] and hse improve the performance over fse. It shows that our hierarchical modeling of video benefits not only high-level embedding but also low-level embedding. A similar trend is also observed in the classification setting. Our method achieves comparable performance to the state-of-the-art video modeling approach such as fv-vae [31]. Note tsn [42] is fully supervised thus not directly comparable.

4.5 Qualitative Results

ActivityNet Training Set ActivityNet Validation Set
Figure 5: T-SNE visualization of off-the-shelf video embedding of hse on ActivityNet v1.3 training and validation set. Points are marked with its action classes.

We use t-SNE [26] to visualize our results in the video to paragraph and paragraph to video retrieval task. Fig 5 shows that the proposed method can cluster the embedding of videos with regard to its action classes. To further explain the retrieval quality, we provide qualitative visualization in the Supp. Material.

5 Conclusion

In this paper, we propose a novel cross-modal learning approach to model videos and texts jointly, which leverages the intrinsic hierarchical structures of both videos or texts. Specifically, we consider the correspondences of videos and texts at multiple granularities, and derived loss functions to align the embeddings for the paired clips and sentences, as well as paired video and paragraph in accordance in their own semantic spaces. Another important component of our model is layer-wise reconstruction, which ensures that learned embeddings capture video (paragraph) and clips (words) at different levels. Moreover, we further extend our learning objective so that it allows to handle a more generalized learning scenario where only video paragraph correspondence exists. We demonstrate the advantage of our proposed model in a range of tasks including video and text retrieval, zero-shot action recognition and video caption.

Acknowledgments We appreciate the feedback from the reviewers. This work is partially supported by NSF IIS-1065243, 1451412, 1513966/ 1632803/1833137, 1208500, CCF-1139148, a Google Research Award, an Alfred P. Sloan Research Fellowship, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.


  • [1] Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV. pp. 5804–5813 (2017)
  • [2] Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: ICCV. pp. 2425–2433 (2015)
  • [3] Chao, W.L., Hu, H., Sha, F.: Being negative but constructively: Lessons learnt from creating better visual question answering datasets. NAACL-HLT pp. 431–441 (2018)
  • [4] Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
  • [5] Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
  • [6] Collell, G., Moens, M.F.: Is an image worth more than a thousand words? on the fine-grain semantic differences between visual and linguistic representations. In: COLING. pp. 2807–2817 (2016)
  • [7] Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR. pp. 1933–1941 (2016)
  • [8] Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: A deep visual-semantic embedding model. In: NIPS. pp. 2121–2129 (2013)
  • [9] Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS. pp. 249–256 (2010)
  • [10] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
  • [11] Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: A large-scale video benchmark for human activity understanding. In: CVPR. pp. 961–970 (2015)
  • [12] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
  • [13] Hu, H., Chao, W.L., Sha, F.: Learning answer embeddings for visual question answering. In: CVPR. pp. 5428–5436 (2018)
  • [14] Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR. pp. 3128–3137 (2015)
  • [15] Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR. pp. 1725–1732 (2014)
  • [16] Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
  • [17] Kiela, D., Bottou, L.: Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics. In: EMNLP. pp. 36–45 (2014)
  • [18] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
  • [19] Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
  • [20] Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: ICCV. pp. 706–715 (2017)
  • [21] Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1106–1114 (2012)
  • [22] Li, J., Luong, M.T., Jurafsky, D.: A hierarchical neural autoencoder for paragraphs and documents. ACL pp. 1106–1115 (2015)
  • [23] Li, Y., Yao, T., Pan, Y., Chao, H., Mei, T.: Jointly localizing and describing events for dense video captioning. In: CVPR. pp. 7492–7500 (2018)
  • [24] Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014)
  • [25] Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. EMNLP pp. 1412–1421 (2015)
  • [26] Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. JMLR 9(Nov), 2579–2605 (2008)
  • [27] Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Hierarchical multimodal lstm for dense visual-semantic embedding. In: ICCV. pp. 1899–1907 (2017)
  • [28] Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR. pp. 1029–1038 (2016)
  • [29] Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML. pp. 1310–1318 (2013)
  • [30] Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP. pp. 1532–1543 (2014)
  • [31] Qiu, Z., Yao, T., Mei, T.: Deep quantization: Encoding convolutional activations with deep generative model. In: CVPR. pp. 4085–4094 (2017)
  • [32] Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR. pp. 815–823 (2015)
  • [33] Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS. pp. 568–576 (2014)
  • [34] Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
  • [35] Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS. pp. 3104–3112 (2014)
  • [36] Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV. pp. 4489–4497 (2015)
  • [37] Tsai, Y.H.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: ICCV. pp. 3591–3600 (2017)
  • [38] Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: ICCV. pp. 4534–4542 (2015)
  • [39] Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. NAACL-HLT pp. 1494–1504 (2015)
  • [40] Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR. pp. 3156–3164 (2015)
  • [41] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. pp. 20–36 (2016)
  • [42] Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks for action recognition in videos. arXiv preprint arXiv:1705.02953 (2017)
  • [43] Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. CVPR (2018)
  • [44] Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML. pp. 2048–2057 (2015)
  • [45] Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR. pp. 4584–4593 (2016)
  • [46] Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector cnns. In: CVPR. pp. 2718–2726 (2016)
  • [47] Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: ECCV. pp. 766–782 (2016)
  • [48] Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. ICCV pp. 2933–2942 (2017)
  • [49] Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: Grounded question answering in images. In: CVPR. pp. 4995–5004 (2016)

Appendix A Implementation Details

a.1 Video and Text Features

a.1.1 C3D Features.

Similar to  [20], we follow the standard ActivityNet setting and use the C3D [36] features from [11] for retrieval and dense captioning [20]. In all our experiments under this setting, we extract frame-wise video feature using C3D model pre-trained on Sports-1M dataset, with the temporal stride of 16. PCA dimensionality reduction is then conducted to reduce features dimension to 500.

a.1.2 TSN-Inception V3 Features.

To leverage the state-of-the-art of current video modeling, we extract more recent deep features for retrieval on ActivityNet [20] and DiDeMo [1], using the Inception V3 model pre-trained on Kinetics [16] dataset (provided by  [41]). Follow their settings, we resize video frames to the resolution of . We then fed video frames into the deep Inception V3 model to extract the output activations from penultimate layer. Unlike [41], we do not perform any test-time data augmentations (e.g.multiple crops, color jitter, etc.). Note that no fine-tuning are performed on either ActivityNet or DiDeMo.

a.1.3 Word Features.

In the retrieval related experiments, we always use GloVE features [30] for the initialization of the word embedding and fine-tune. Specifically, we use the GloVE vectors pre-trained on 840B common web-crawled data, with its dimensionality equals to 300.

a.1.4 Training Details

When the learning of hierarchical embedding is applicable, we feed the entire video/paragraph in its frame-wise/word-wise representations through the low-level encoder, and then input the subsequent low-level embedding to the high-level encoder as its initial hidden state. In all our experiments, we use GRU [4] with its hidden dimension to be 1,024 as our sequence encoder and decoder. To obtain the embedding for a sequence, we take the channel-wise max over all output vectors of the GRU as it empirically outperforms other strategies such as  [38].

During training, we use Adam [18] optimizer with initial learning rate as 0.001, and decay it by 10 for every 10 epochs during the training. We use Xavier initialization [9] for each affine layer in our model with zero mean and variance of 0.01. We set all margin in the loss function to . Each loss is normalized by its batch size. On both ActivityNet and DiDeMo dataset, we train our embedding models for 15 epochs and collect the final results.

Appendix B Additional Experiments

b.1 Ablation Study on Different Learning Objectives

b.1.1 Ablation Study with Different Learning Objectives

We report ablation studies of different losses on ActivityNet video and paragraph retrieval task in Table 7. We use the Inception-V3 features and follow the same setting for training hse. Each time we remove one loss and report the performance. Note that the reconstruction loss and low-match loss are the most useful.

Paragraph Video Video Paragraph
Method R@1 R@5 R@1 R@5
hse w/o high-cluster 44.6 76.4 44.2 76.1
hse w/o low-match 40.9 73.6 39.8 73.6
hse w/o low-cluster 44.6 76.6 43.9 76.4
hse w/o reconstruction 43.9 75.8 43.3 75.3
hse w all losses 44.4 76.7 44.2 76.7
Table 7: Ablation study on the learning objectives.

b.1.2 Low-level Loss is Beneficial

As mentioned in the main text (see Table 1 and Table 2 in the main text), learning with low-level objectives is beneficial for our full model. To better understands this, we also plot the recall (in ) with regard to the rank of the video/paragraph to a query as supportive evidence. The results are shown in Fig.  6.

(a) hse[=0] (b) hse
Figure 6: Recall vs Rank curves of Video to Paragraph and Paragraph to Video retrieval of both hse[=0] and hse. All results are collected from models based on InceptionV3 feature on ActivityNet validation set 1.

b.2 Ablation Study on Reconstruction Balance Term

Here we study the influence of loss balance term, by experimenting multiple choices of under a controlled environment. We choose to study this on the validation set 2 (val2) of ActivityNet with Inception V3 visual feature as input. Detailed results are shown in Table 8. We summarized that retrieval performance, R@1 and R@5, approach to its peak when =0.0005. Therefore, as stated in the main text, we set to be 0.0005 in all our experiments.

Paragraph Video Video Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
Inception-V3 pre-trained on Kinetics [42]
hse[=0.05] 25.0 54.9 92.6 5.0 25.1 55.4 92.4 4.0
hse[=0.005] 32.4 62.2 93.8 3.0 32.1 63.0 93.7 3.0
hse[=0.0005] 33.2 62.9 93.6 3.0 32.6 62.8 93.5 3.0
hse[=0.00005] 33.2 62.9 93.8 3.0 32.2 62.5 93.6 3.0
hse[=0] 32.2 61.5 93.6 3.0 31.5 62.0 93.3 3.0
Table 8: Ablation study of on ActivityNet (val2).

b.3 Performance on ActivityNet Validation Set 2

As mentioned in the main paper, we reported the val2 performance of fse, hse[=0], and hse in Table 9. Again, the results verified our papers’ claim as we show that hse consistently improve performance than fse and hse[=0]. It shows the importance of hierarchical modeling and feature reconstruction.

Paragraph Video Video Paragraph
R@1 R@5 R@50 MR R@1 R@5 R@50 MR
C3D Feature with Dimensionality Reduction [36]
fse 11.50.2 31.00.4 75.90.2 14.0 11.00.5 30.60.3 75.50.4 14.0
hse[=0] 23.30.5 48.20.2 84.50.4 6.0 23.00.3 47.90.2 84.60.2 6.0
hse[=0.0005] 23.90.3 49.40.3 85.30.2 6.0 23.40.5 49.40.4 85.50.3 6.0
Inception-V3 pre-trained on Kinetics [42]
fse 16.00.2 41.80.4 88.00.5 8.0 15.10.7 41.00.4 87.70.5 8.0
hse[=0] 32.30.2 62.20.7 93.50.1 3.0 32.01.0 61.90.2 93.30.1 3.0
hse[=0.0005] 32.90.4 62.70.2 93.90.4 3.0 32.60.1 63.00.2 93.70.2 3.0
Table 9: Performance of video and paragraph retrieval on ActivityNet (val2). Standard deviation from 3 random seeded experiments are also reported.

Appendix C Visualization

query video
ground truth The credits of the clip are shown. People clean their hand and use their hands to dance. The credits of the video are shown.
hse The credits of the clip are shown. People clean their hand and use their hands to dance. The credits of the video are shown.
hse[=0] The credits of the clip are shown. People clean their hand and use their hands to dance. The credits of the video are shown.
fse Two women are seen speaking to the camera and leads into turning on a faucet and running her hands underneath. The woman then scrubs soap into her hands and continues to wash them off then taking a paper down and drying her hands and sink. The other steps in to demonstrate how she washes her hands and ends by laughing to the camera.
query video
ground truth Two people dressed up in sumo wrestler suits come running into a gym and wrestle while people stand around and watch. The wrestler wearing red falls over. They continue wrestling and are having a lot of fun doing it falling down and bouncing around. One of the wrestlers wearing blue makes a shot in a basketball net. The two people continue wrestling in their sumo suits. A man comes into the shot and pushes the sumo wrestler over on top of another person not wearing a suit. There comes a final of the sumo wrestlers and a man in a white shirt is presenting and there is a referee. They start on the wrestle while people watch swinging each other around in the middle of the red mats. The red sumo wrestler falls down and the blue sumo wrestler wins. The blue sumo wrestler jumps up happy with his friends and walks out the door and the red sumo wrestler is left on the ground.
hse Two people dressed up in sumo wrestler suits come running into a gym and wrestle while people stand around and watch. The wrestler wearing red falls over. They continue wrestling and are having a lot of fun doing it falling down and bouncing around. One of the wrestlers wearing blue makes a shot in a basketball net. The two people continue wrestling in their sumo suits. A man comes into the shot and pushes the sumo wrestler over on top of another person not wearing a suit. There comes a final of the sumo wrestlers and a man in a white shirt is presenting and there is a referee. They start on the wrestle while people watch swinging each other around in the middle of the red mats. The red sumo wrestler falls down and the blue sumo wrestler wins. The blue sumo wrestler jumps up happy with his friends and walks out the door and the red sumo wrestler is left on the ground.
hse[=0] There are some girls wearing karate uniforms doing karate on a stage. There’s an orange belt and a yellow belt karate student doing some karate moves with batons in their hands. After they leave, another karate student wearing a yellow belt comes on stage to perform her karate moves with a baton. Then she leaves and another girl wearing an orange belt joins in holding two hammers to show her karate moves. She leaves and another girl wearing a yellow belt comes on stage with a hand fan and shows her karate moves. After she leaves the master comes on stage along with three other students. They take turns to smash the board held by the master. Then the master leaves and the three students demonstrate their coordinated karate moves.
fse A girl walks along the gym holding her fencing gear. She points at the camera with her sword. A coach comes to dress her and fix equipment. The gym is full of kids fencing and practicing. She starts fencing with another girl. A coach in a blue shirt gives the direction.
Figure 7: ActivityNet: Given Video and Retrieve Paragraph. Positive qualitative examples of hse, hse[=0], and fse on the task of given video to retrieve texts. We mark the correct sample in green and incorrect one in red.
query video
ground truth A child solves a cube puzzle while holding a basketball. A hand holding an object touch the ear of the boy. The boy stand and leave, then a naked man appears holding a rod.
hse A woman tries to solves a cube puzzle. After, the woman pass the cube to a man that starts to solves the puzzle. The man solves the puzzle.
hse[=0] A child solves a cube puzzle while holding a basketball. A hand holding an object touch the ear of the boy. The boy stand and leave, then a naked man appears holding a rod.
fse A sitting man holds a pack of cigarettes and a lighter to the camera. The man puts a cigarette in his mouth and smokes the cigarette fast. The man removes the cigarette and puts it back. The man shows the camera the end of the cigarette and smokes the rest. The man removes the cigarette and shows it to the camera.
query video
ground truth A woman is seen turning on a hose to pour water in and walks over to another bucket and sits down. The woman then washes the clothes in the bucket while looking to the camera leading into her hanging up the clothes and speaking to the camera.
hse Two kids are washing their clothes in a sink. They put the clean clothes in a bucket on the floor.
hse[=0] A man is seen kneeling over a bucket dipping clothes inside and washing. Another man is seen standing in front of a hose and dipping clothes underneath as well.
fse This man is sitting in a chair outdoors in his yard and he is washing the black shirt in the bucket. There’s also 3 other buckets that have clothes in them and there are 2 other people outside. This video has no audio by the way and when the man is done, he stops to say a few words and the camera cuts off.
Figure 8: ActivityNet: Given Video and Retrieve Paragraph. Negative qualitative examples of hse, hse[=0], and fse on the task of given video to retrieve texts. We mark the correct sample in green and incorrect one in red.
query text A man is floating in the water holding a table and a stool. The man stands on the table and sits the stool upright. The man sits on the stool as he water skis on the table. The man stands on top of the stool then stands up. The man is standing on a stool as he water skis in lake. The man does a spin while on the stool. The man jumps in the water as the boat drives on.
ground truth
query text We see the scoreboard of a racing game. The race starts ant the player is playing on jet skis. The player passes the red bridge. The player passes the cruise ship and zeppelin. The player passes the cliff with the lighthouse. The timer counts down from 10 and the race is finishes. We see the players record score. We see the ranking screen for the game, the level, and the option to change the difficulty.
ground truth
Figure 9: ActivityNet: Given Paragraph and Retrieve Video. Positive qualitative examples of hse, hse[=0], and fse on the task of given text to retrieve video. We mark the correct sample in green and incorrect one in red.
query text A man and a woman are dancing together. The man dips under the woman’s arm. The man has his hand on the woman’s waist. The man puts his hand on his waist. The man hits his head accidentally. A person hits the item hanging from the roof.
ground truth
query text A man throws a bowling ball. He then goes back and high fives his friends. They sit and talk around the table. A woman stands up and grabs a bowling ball. She walks up and drops it down the lane. She sits back down and looks at her phone. They continue talking around the table. She stands back up and picks up a bowling ball. She throws it down the lane again. She sits back down at the table. She throws a ball behind her while walking away. She picks up a ball and throws it with her hands over her eyes. She throws a bowling ball while talking on the phone.
ground truth
Figure 10: ActivityNet: Given Paragraph and Retrieve Video. Negative Qualitative examples of hse, hse[=0], and fse on the task of given text to retrieve video. We mark the correct sample in green and incorrect one in red.

c.0.1 Qualitative Examples for hse, hse[=0], and fse on Retrieval Tasks.

We show the qualitative examples on ActivityNet as below. To show a systematic analysis of the success cases and failure cases, we choose to visualize positive examples of paragraph retrieval (in Figure 7) and video retrieval (in Figure 9) and negative examples in Figure 8 and Figure 10. We observe that in some failed cases, although hse failed to retreive the correct text/video, it retreive very relevant item given the query information.

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description