Cross-Modal and Hierarchical Modeling of Video and Text
Visual data and text data are composed of information at multiple granularities. A video can describe a complex scene that is composed of multiple clips or shots, where each depicts a semantically coherent event or action. Similarly, a paragraph may contain sentences with different topics, which collectively conveys a coherent message or story. In this paper, we investigate the modeling techniques for such hierarchical sequential data where there are correspondences across multiple modalities. Specifically, we introduce hierarchical sequence embedding (hse), a generic model for embedding sequential data of different modalities into hierarchically semantic spaces, with either explicit or implicit correspondence information. We perform empirical studies on large-scale video and paragraph retrieval datasets and demonstrated superior performance by the proposed methods. Furthermore, we examine the effectiveness of our learned embeddings when applied to downstream tasks. We show its utility in zero-shot action recognition and video captioning.
Keywords:Hierarchical Sequence Embedding, Video Text Retrieval, Video Description Generation, Action Recognition, Zero-shot Transfer
Recently, there has been an intensive interest in multi-modal learning of vision + language. A few challenging tasks have been proposed: visual semantic embedding (VSE) [19, 17, 6], image captioning [40, 44, 14, 24], and visual question answering (VQA) [2, 49, 3]. To jointly understand these two modalities of data and make inference over them, the main intuition is that different types of data can share a common semantic representation space. Examples are embedding images and the visual categories , embedding images and texts for VSE , and embedding images, questions, and answers for VQA . Once embedded into this common (vector) space, similarity and distances among originally heterogeneous data can be captured by learning algorithms.
While there has been a rich study on how to discover this shared semantic representation on structures such as images, noun phrases (visual object or action categories) and sentences (such as captions, questions, answers), less is known about how to achieve so on more complex structures such as videos and paragraphs of texts 111We use paragraphs and documents interchangeably throughout this work.. There are conceptual challenges: while complex structured data can be mapped to vector spaces (for instance, using deep architectures [21, 10]), it is not clear whether the intrinsic structures in those data’s original format, after being transformed to the vectorial representations, still maintain their correspondence and relevance across modalities.
Take the dense video description task as an example . The task is to describe a video which is made of short, coherent and meaningful clips. (Note that those clips could overlap temporally.) Due to its narrowly focused semantic content, each clip is then describable with a sentence. The description for the whole video is then a paragraph of texts with sentences linearly arranged in order. Arguably, a corresponding pair of video and its descriptive paragraph can be embedded into a semantic space where their embeddings are close to each other, using a vanilla learning model by ignoring the boundaries of clips and sentences and treating as a sequence of continually flowing visual frames and words. However, for such a modeling strategy, it is opaque that if and how the correspondences at the “lower level” (i.e. clips versus sentences) are useful in either deriving the embeddings or using the embeddings to perform downstream tasks such as video or text retrieval.
Addressing these deficiencies, we propose a novel cross-modal learning approach to model both videos and texts jointly. The main idea is schematically illustrated in Fig. 1. Our approach is mindful of the intrinsic hierarchical structures of both videos and texts, and models them with hierarchical sequence learning models such as GRUs . However, as opposed to methods which disregard low-level correspondences, we exploit them by deriving loss functions to ensure the embeddings for the clips and sentences are also in accordance in their own (shared) semantic space. Those low-level embeddings in turn strengthen the desiderata that videos and paragraphs are embedded coherently. We demonstrate the advantages of the proposed model in a range of tasks including video and text retrieval, zero-shot action recognition and video description.
2 Related Work
Hierarchical Sequence Embedding Models.
Embedding images, videos, and textual data has been very popular with the rise of deep learning. The most related works to ours are  and . The former models the paragraph using a hierarchical auto-encoder for text modeling , and the later uses a hierarchical RNN for videos and a one-layer RNN for caption generation. In contrast, our work models both modalities hierarchically and learn the parameters by leveraging the correspondences across modalities. Works motivated by other application scenarios usually explore hierarchical modeling in one modality [27, 45, 47].
Cross-modal Embedding Learning.
There has been a rich history to learn embeddings for images and smaller linguistic units (such as words and noun phrases). DeViSE  learns to align the latent embeddings of visual data and names of the visual object categories. ReViSE  uses auto-encoders to derive embeddings for images and words which allow them to leverage unlabeled data. In contrast to previous methods, our approach models both videos and texts hierarchically, bridging the embeddings at different granularities using discriminative loss computed on corresponded pairs (i.e. videos vs. paragraphs).
Action Recognition in Videos.
Deep learning has brought significant improvement to video understanding [33, 36, 7, 41, 46, 43] on large-scale action recognition datasets [11, 34, 16] in the past decade. Most of them [33, 7, 41] employed deep convolutional neural network to learn appearance feature and motion information respectively. Based on the spatial-temporal feature from these video modeling methods, we learn video semantic embedding to match the holistic video representation to text representation. To evaluate the generalization of our learned video semantic representation, we evaluate the model directly on the challenging action recognition benchmark. (Details in Section 4.4)
We begin by describing the problem settings and introducing necessary notations. We then describe the standard sequential modeling technique, ignoring the hierarchical structures in the data. Finally, we describe our approach.
3.1 Settings and Notations
We are interested in modeling videos and texts that are paired in correspondence In the later section, we describe how to generalize this where there is no one to one correspondence.
A video has clips (or subshots), where each clip contains frames. Each frame is represented by a visual feature vector . This feature vector can be derived in many ways, for instance, by feeding the frame (and its contextual frames) to a convolution neural net and using the outputs from the penultimate layer. Likewise, we assume there is a paragraph of texts describing the video. The paragraph contains sentences, one for each video clip. Let denote the th sentence and the feature for the th word out of words. We denote by a set of corresponding videos and text descriptions.
We compute a clip vector embedding from the frame features , and a sentence embedding from the word features . From those, we derive and , the embedding for the video and the paragraph, respectively.
3.2 Flat Sequence Modeling
Many sequence-to-sequence (seq2seq) methods leverage the encoder-decoder structure [35, 25] to model the process of transforming from the input sequence to the output sequence. In particular, the encoder, which is composed of a layer of long short-term memory units (LSTMs)  or Gated Recurrent Units (GRUs) , transforms the input sequence into a vector as the embedding . The similarly constructed decoder takes as input and outputs another sequence.
The original seq2seq methods do not consider the hierarchical structures in videos or texts. We refer the embeddings as flat sequence embedding (fse):
Fig. 2 schematically illustrates this idea. We measure how well the videos and the texts are aligned by the following cosine similarity
3.3 Hierarchical Sequence Modeling
One drawback of flat sequential modeling is that the LSTM/GRU layer needs to have a sufficient number of units to model well the potential long-range dependency among video frames (or words). This often complicates learning as the optimization becomes difficult .
We leverage the hierarchical structures in those data to overcome this deficiency: a video is made of clips which are made of frames. In parallel, a paragraph of texts is made of sentences which in turn are made of words. Similar ideas have been explored in [28, 22] and other previous works. The basic idea is illustrated in Fig. 3, where we also add components in red color to highlight our extensions.
Hierarchical Sequence Embedding. Given the hierarchical structures in Fig. 3, we can compute the embeddings using the forward paths
Learning with Discriminative Loss. For videos and texts have strong correspondences where clips and sentences are paired, we optimize the encoders such that videos and texts are matched. To this end, we define two loss functions, corresponding to the matching at the low-level and the high-level respectively:
These losses are margin-based losses  where and are positive numbers as the margins to separate matched pairs from unmatched ones. The function is the standard hinge loss function.
Learning with Contrastive Loss. Assuming videos and texts are well clustered, we use the following loss to model their clustering in their own space.
Note that the self-matching values and are 1 by definition. This loss can be computed on videos and texts alone and does not require them being matched.
Learning with Unsupervised Layer-wise Reconstruction Loss. Thus far, the matching loss focuses on matching across modality. The clustering loss focuses on separating between video/text data so that they do not overlap. None of them, however, focuses on the quality of the modeling data itself. In what follows, we propose a layer-wise reconstruction loss – when minimized, this loss ensures the learned video/text embedding faithfully preserves information in the data.
We first introduce a set of layer-wise decoders for both videos and texts. The key idea is to pair the encoders with decoders so that each pair of functions is an auto-encoder. Specifically, the decoder is also a layer of LSTM/GRU units, generating sequences of data. Thus, at the level of video (or paragraph), we will have a decoder to generate clips (or sentences). And at the level of clips (or sentences), we will have a decoder to generate frames (or words). Concretely, we would like to minimize the difference between what are generated by the decoders and what are computed by encoders on the data. Let
be the two (high-level) decoders for videos and texts respectively. And similarly, for the decoder at the low-level
where the low-level decoders take each generated clip and sentence embeddings as inputs and output sequences of generated frame and word embeddings.
Using those generated embeddings, we can construct a loss function characterizing how well the encoders encode the data pair (see Eq 3.3).
3.4 Final Learning Objective and Its Extensions
The final learning objective is to balance all those loss quantities
where the high-level and low-level losses are defined as
In our experiments, we will study the contribution by each term.
Learning under Weak Correspondences. Our idea can be also extended to the common setting where only high-level alignments are available. In fact, high-level coarse alignments of data are easier and more economical to obtain, compared to fine-grained alignments between each sub-level sentence and video clip.
Since we do not have enough information to define the low-level matching loss exactly, we resort to approximation. We first define an averaged matching over all pairs of clips and sentences for a pair of video and paragraph
where we relax the assumption that there is precisely the same number of sentences and clips. We use this averaged quantity to approximate the low-level matching loss
This objective will push a clip embedding closer to the embeddings of the sentences belonging to the corresponding video (and vice versa for sentences to the corresponding video). A more refined approximation involving a soft assignment of matching can also be derived, which will be left for future work.
We evaluate and demonstrate the advantage of learning hierarchical cross-modal embedding with our proposed approach on several tasks: (i) large-scale video-paragraph retrieval (Section 4.2), (ii) down-stream tasks such as video captioning (Section 4.3), and (iii) action recognition (Section 4.4).
4.1 Experiment Setups
We evaluate on three large-scale video datasets:
(1) ActivityNet Dense Caption . This variant of ActivityNet contains densely labeled temporal segments for 10,009 training and 4,917/4,885 (val1/val2) validation videos. Each video contains multiple clips and a corresponding paragraph with sentences aligned to the clips. In all our retrieval experiments, we follow the setting in  and report retrieval metrics such as recall@k (k=1,5,50) and median rank (MR). Following  we use ground-truth clip proposals as input for our main results. In addition, we also study our algorithm with a heuristic proposal method (see Section 4.2.4). In the main text, we report all results on validation set 1 (val1). Please refer to the Supp. Material for the results on val2. For video caption experiment, we follow  and evaluate on the validation set (val1 and val2). Instead of using action proposal method, ground-truth video segmentation is used for training and evaluation. Performances are reported in Bleu@K, METEOR and CIDEr.
(2) DiDeMo . The original goal of DiDeMo dataset is to locate the temporal segments that correspond to unambiguous natural language descriptions in a video. We re-purpose it for the task of video and paragraph retrieval. It contains 10,464 videos, 26,892 video clips and 40,543 sentences. The training, validation and testing split contain 8,395, 1,065 and 1,004 videos and corresponding paragraphs, respectively. Each video clip may correspond to one or many sentences. For the video and paragraph retrieval task, paragraphs are constructed by concatenating all sentences that corresponding to one video. Similar to the setting in ActivityNet, we use the ground-truth clip proposals as input.
(3) ActivityNet Action Recognition . We use ActivityNet V1.3 for aforementioned off-the-shelf action recognition. The dataset contains 14,950 untrimmed videos with 200 action classes, which is split into training and validation set. Training and validation set have 10,024 and 4,926 videos, respectively. Among all 200 action classes, 189 of the action classes have been covered by the vocabulary extracted from the paragraph corpus and 11 of the classes are unseen.
4.1.2 Baselines and Our Methods.
We use the fse method (as described in Section 3.1) as a baseline model. It ignores the clip and sentence structures in the videos and paragraphs. We train a one-layer GRU directly on the extracted frame/word features and take their outputs as the embedding representing each modality. Results with C3D features are also included (see Table 1).
Our method has two variants: when , the method (hse[=0]) simplifies to a stacked/hierarchical sequence models as used in [22, 28] except that they do not consider cross-modal learning with cross-modal matching loss while we do. We consider this as a very strong baseline. When , the hse takes full advantage of layer-wise reconstruction with multiple decoders, at different levels of the hierarchy. In our experiments, this method gives the best results.
4.1.3 Implementation Details.
Following the settings of , we extract the C3D features  pretrained on Sports-1M dataset  for raw videos in ActivityNet. PCA is then used to reduce the dimensionality of the feature to 500. To verify the generalization of our model across different sets of visual feature, as well as leveraging the state-of-the-art video models, we also employed recently proposed TSN-Inception V3 network  pre-trained on Kinetics  dataset to extract visual features. Similarly, we extract TSN-Inception V3 feature for videos in Didemo dataset. We do not fine-tuning the convolutional neural network on the video along the training to reduce the computational cost. For word embedding, we use 300 dimension GloVe  features pre-trained on 840B common web-crawls. In all our experiments, we use GRU as sequence encoders. For hse, we choose from tuning this hyper-parameter on the val2 set of ActivityNet retrieval dataset. The same value is used for experiments on DiDeMo, without further tuning. (More details in the Supp. Material)
4.2 Results on Video-Paragraph Retrieval
In this section, we first compare our proposed approach to the state-of-the-art algorithms, and then perform ablation studies on variants of our method, to evaluate the proposed learning objectives.
|Paragraph Video||Video Paragraph|
|C3D Feature with Dimensionality Reduction |
|no context ||5.0||14.0||32.0||78.0||7.0||18.0||45.0||56.0|
|Inception-V3 pre-trained on Kinetics |
|Paragraph Video||Video Paragraph|
4.2.1 Main Results.
We reported our results on ActivityNet Dense Caption val1 set and DiDeMo test set as Table 1 and Table 2, respectively. For both C3D and Inception V3 feature, we observed performances on our hierarchical models improved the previous state-of-the-art result by a large margin (on Recall@1, over improvement with C3D and improvement with InceptionV3). dense full , which models the flat sequences of clips, outperforms our fse baseline as they augment each segment embedding with a weighted aggregated context embedding. However, it fails to model more complex temporal structures of video and paragraph, which leads to inferior performance to our hse models.
Comparing to our flat baseline model, both hse[=0] and hse[=5e-4] improve performances over all metrics in retrieval. It implies that hierarchical modeling can effectively capture the structure information and relationships over clips and sentences among videos and paragraphs. Moreover, we observe that hse[=5e-4] consistently improves over hse[=0] across most retrieval metrics on both datasets. This attributes the importance of our layer-wise reconstruction objectives, which suggests that better generalization performances.
|Paragraph Video||Video Paragraph|
4.2.2 Low-level Loss is Beneficial.
Table 1 and Table 2 have shown results with optimizing both low-level and high-level objectives. In Table 3, we further performed ablation studies on the learning objectives. Note that rows with ✗ represent learning without low-level loss . In all scenarios, joint learning with both low-level and high-level correspondences improves the retrieval performance.
4.2.3 Learning with Weak Correspondences at Low-level.
As mentioned in Section 3, our method can be extended to learn the low-level embedding with weak correspondence. We evaluate its effectiveness on both ActivityNet and DiDeMo datasets. Performance are listed in Table 3. Note that for the rows of “weak”, no auxiliary alignments between sentences and clips are available during training.
Clearly, including low-level loss with weak correspondence (ie, correspondence only at the high-level) obtained superior performances when compared to models that do not include low-level loss at all. On several occasions, it even attains the same competitive result as including low-level loss with strong correspondences at the clip/sentence levels.
|P V||V P|
|Proposal Method||# Segments||R@1||R@5||R@1||R@5||Precision||Recall|
|hse + ssn||-||10.4||31.9||10.8||31.7||1.5||17.1|
|hse + uniform||1||18.0||45.5||16.5||44.9||63.2||31.1|
|hse + ground truth||-||44.4||76.7||44.2||76.7||100.0||100.0|
4.2.4 Learning with Video Proposal Methods.
As using ground-truth temporal segments of videos is not a natural assumption, we perform experiments to validate the effectiveness of our method with proposal methods. Specifically, we experiment with two different proposal approaches: SSN  pre-trained on ActivityNet action proposal and a heuristic uniform proposal. For uniform proposal of K segments, we meant naturally segmenting a video into K non-overlapping and equal-length temporal segments.
The results are summarized in Table 4 (with columns of precision and recall being the performance metrics of the proposal methods). There are two main conclusions from these results: (1) The segments of Dense Caption dataset deviate significantly from the action proposals, therefore a pre-trained action proposal algorithm performs poorly. (2) Even with heuristic proposal methods, the performance of hse is mostly better than (or comparable with) fse. We leave to future work on identifying stronger methods for proposals.
4.2.5 Retrieval with Incomplete Video and Paragraph.
In this section, we investigate the correlation between the number of observed clips and sentences and models’ performance of video and paragraph retrieval. In this experiment, we gradually increase the number of clips and sentences observed by our model during the testing and obtained the Figure 4, on ActivityNet. When the video/paragraph contains fewer clips/sentences than the number of observations we required, we take all those available clips/sentences for computing the video/paragraph embedding. (On average 3.65 clips/sentences per video/paragraph)
From Figure 4, we note that increasing the number of the observed clips and sentences leads to improved performance results in retrievals. We can see that when observing only one clip and sentence, our model already outperforms the previous state-of-the-art method as well as our baseline fse that observes the entire sequence. With observing less than the average length of clips and sentences, our learned model can achieve of the final performance.
4.3 Results on Video Captioning
In addition to the video paragraph retrieval, we evaluate our learned embeddings for video captioning. Specifically, we follow  and train a caption model  on top of the pre-trained video embeddings. Similar to , we concatenate the clip-level feature with contextual video-level feature, and build a two-layer LSTM as a caption generator. We randomly initialized the word embedding as well as LSTM and trained the model for 25 epochs with learning rate of 0.001. We use the ground-truth proposal throughout training and evaluation following the setting of [20, 23]. During testing, beam search is used with beam=5. Results are reported in Table 6.
We observe that our proposed model outperforms baseline over most metrics. Meanwhile, hse also improves over previous approaches such as lstm-yt, s2vt, and hrnn on B@2, METEOR, and CIDEr by a margin. hse achieves comparable results with dvc in all criterions. However, both hse and hse[=0] failed to obtain close performance to dense . This may due to the fact that dense  carefully learns to aggregate the context information of a video clip for producing high-quality caption, while optimized for video-paragraph retrieval our embedding model does not equip with such capability. However, it is worth noting that our model obtains higher CIDEr score compared to all existing methods. We empirically observe that fine-tuning the pre-trained video embedding does not lead to further performance improvement.
4.4 Results on Action Recognition
To evaluate the effectiveness of our model, we take the off-the-shelf clip-level embeddings trained on video-paragraph retrieval for action recognition (on ActivityNet with non-overlapping training and validation data). We use two action recognition settings to evaluate, namely zero-shot transfer and classification.
In the zero-shot setting, we directly evaluate our low-level embedding model learned in the video and text retrieval, via treating the phrases of actions as sentences and use the sentence-level encoder to encode the action embedding. We take the raw video and apply clip-level video encoder to extract the feature for retrieving actions. No re-training is performed and all models have no access to the actions’ data distribution. Note though action are not directly used as sentences during the training, some are available as verbs in the vocabulary. Meanwhile, as we are using pre-trained word vector (GloVe), it allows the transfer to unseen actions. In the classification setting, we discriminatively train a simple classifier to measure the classification accuracy. Concretely, a one-hidden-layer Multi-Layer Perceptron (MLP) is trained on the clip-level embeddings. We do not fine-tune the pre-trained clip-level video embedding here.
We report results of above two settings on the ActivityNet validation set (see Table 6). We observe that our learned low-level embeddings allow superior zero-shot transfer to action recognition, without accessing any training data. This indicates that semantics of actions are indeed well reserved in the learned embedding models. More interestingly, we can see that both hse[=0] and hse improve the performance over fse. It shows that our hierarchical modeling of video benefits not only high-level embedding but also low-level embedding. A similar trend is also observed in the classification setting. Our method achieves comparable performance to the state-of-the-art video modeling approach such as fv-vae . Note tsn  is fully supervised thus not directly comparable.
4.5 Qualitative Results
|ActivityNet Training Set||ActivityNet Validation Set|
We use t-SNE  to visualize our results in the video to paragraph and paragraph to video retrieval task. Fig 5 shows that the proposed method can cluster the embedding of videos with regard to its action classes. To further explain the retrieval quality, we provide qualitative visualization in the Supp. Material.
In this paper, we propose a novel cross-modal learning approach to model videos and texts jointly, which leverages the intrinsic hierarchical structures of both videos or texts. Specifically, we consider the correspondences of videos and texts at multiple granularities, and derived loss functions to align the embeddings for the paired clips and sentences, as well as paired video and paragraph in accordance in their own semantic spaces. Another important component of our model is layer-wise reconstruction, which ensures that learned embeddings capture video (paragraph) and clips (words) at different levels. Moreover, we further extend our learning objective so that it allows to handle a more generalized learning scenario where only video paragraph correspondence exists. We demonstrate the advantage of our proposed model in a range of tasks including video and text retrieval, zero-shot action recognition and video caption.
Acknowledgments We appreciate the feedback from the reviewers. This work is partially supported by NSF IIS-1065243, 1451412, 1513966/ 1632803/1833137, 1208500, CCF-1139148, a Google Research Award, an Alfred P. Sloan Research Fellowship, gifts from Facebook and Netflix, and ARO# W911NF-12-1-0241 and W911NF-15-1-0484.
-  Anne Hendricks, L., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: ICCV. pp. 5804–5813 (2017)
-  Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Lawrence Zitnick, C., Parikh, D.: Vqa: Visual question answering. In: ICCV. pp. 2425–2433 (2015)
-  Chao, W.L., Hu, H., Sha, F.: Being negative but constructively: Lessons learnt from creating better visual question answering datasets. NAACL-HLT pp. 431–441 (2018)
-  Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., Bengio, Y.: Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014)
-  Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
-  Collell, G., Moens, M.F.: Is an image worth more than a thousand words? on the fine-grain semantic differences between visual and linguistic representations. In: COLING. pp. 2807–2817 (2016)
-  Feichtenhofer, C., Pinz, A., Zisserman, A.: Convolutional two-stream network fusion for video action recognition. In: CVPR. pp. 1933–1941 (2016)
-  Frome, A., Corrado, G.S., Shlens, J., Bengio, S., Dean, J., Mikolov, T., et al.: Devise: A deep visual-semantic embedding model. In: NIPS. pp. 2121–2129 (2013)
-  Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: AISTATS. pp. 249–256 (2010)
-  He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR. pp. 770–778 (2016)
-  Heilbron, F.C., Escorcia, V., Ghanem, B., Niebles, J.C.: Activitynet: A large-scale video benchmark for human activity understanding. In: CVPR. pp. 961–970 (2015)
-  Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
-  Hu, H., Chao, W.L., Sha, F.: Learning answer embeddings for visual question answering. In: CVPR. pp. 5428–5436 (2018)
-  Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: CVPR. pp. 3128–3137 (2015)
-  Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Fei-Fei, L.: Large-scale video classification with convolutional neural networks. In: CVPR. pp. 1725–1732 (2014)
-  Kay, W., Carreira, J., Simonyan, K., Zhang, B., Hillier, C., Vijayanarasimhan, S., Viola, F., Green, T., Back, T., Natsev, P., et al.: The kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)
-  Kiela, D., Bottou, L.: Learning Image Embeddings using Convolutional Neural Networks for Improved Multi-Modal Semantics. In: EMNLP. pp. 36–45 (2014)
-  Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
-  Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539 (2014)
-  Krishna, R., Hata, K., Ren, F., Fei-Fei, L., Niebles, J.C.: Dense-captioning events in videos. In: ICCV. pp. 706–715 (2017)
-  Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS. pp. 1106–1114 (2012)
-  Li, J., Luong, M.T., Jurafsky, D.: A hierarchical neural autoencoder for paragraphs and documents. ACL pp. 1106–1115 (2015)
-  Li, Y., Yao, T., Pan, Y., Chao, H., Mei, T.: Jointly localizing and describing events for dense video captioning. In: CVPR. pp. 7492–7500 (2018)
-  Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755 (2014)
-  Luong, M.T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. EMNLP pp. 1412–1421 (2015)
-  Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. JMLR 9(Nov), 2579–2605 (2008)
-  Niu, Z., Zhou, M., Wang, L., Gao, X., Hua, G.: Hierarchical multimodal lstm for dense visual-semantic embedding. In: ICCV. pp. 1899–1907 (2017)
-  Pan, P., Xu, Z., Yang, Y., Wu, F., Zhuang, Y.: Hierarchical recurrent neural encoder for video representation with application to captioning. In: CVPR. pp. 1029–1038 (2016)
-  Pascanu, R., Mikolov, T., Bengio, Y.: On the difficulty of training recurrent neural networks. In: ICML. pp. 1310–1318 (2013)
-  Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP. pp. 1532–1543 (2014)
-  Qiu, Z., Yao, T., Mei, T.: Deep quantization: Encoding convolutional activations with deep generative model. In: CVPR. pp. 4085–4094 (2017)
-  Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR. pp. 815–823 (2015)
-  Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NIPS. pp. 568–576 (2014)
-  Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)
-  Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS. pp. 3104–3112 (2014)
-  Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: ICCV. pp. 4489–4497 (2015)
-  Tsai, Y.H.H., Huang, L.K., Salakhutdinov, R.: Learning robust visual-semantic embeddings. In: ICCV. pp. 3591–3600 (2017)
-  Venugopalan, S., Rohrbach, M., Donahue, J., Mooney, R., Darrell, T., Saenko, K.: Sequence to sequence-video to text. In: ICCV. pp. 4534–4542 (2015)
-  Venugopalan, S., Xu, H., Donahue, J., Rohrbach, M., Mooney, R., Saenko, K.: Translating videos to natural language using deep recurrent neural networks. NAACL-HLT pp. 1494–1504 (2015)
-  Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: A neural image caption generator. In: CVPR. pp. 3156–3164 (2015)
-  Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks: Towards good practices for deep action recognition. In: ECCV. pp. 20–36 (2016)
-  Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Van Gool, L.: Temporal segment networks for action recognition in videos. arXiv preprint arXiv:1705.02953 (2017)
-  Wu, C.Y., Zaheer, M., Hu, H., Manmatha, R., Smola, A.J., Krähenbühl, P.: Compressed video action recognition. CVPR (2018)
-  Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: Neural image caption generation with visual attention. In: ICML. pp. 2048–2057 (2015)
-  Yu, H., Wang, J., Huang, Z., Yang, Y., Xu, W.: Video paragraph captioning using hierarchical recurrent neural networks. In: CVPR. pp. 4584–4593 (2016)
-  Zhang, B., Wang, L., Wang, Z., Qiao, Y., Wang, H.: Real-time action recognition with enhanced motion vector cnns. In: CVPR. pp. 2718–2726 (2016)
-  Zhang, K., Chao, W.L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: ECCV. pp. 766–782 (2016)
-  Zhao, Y., Xiong, Y., Wang, L., Wu, Z., Tang, X., Lin, D.: Temporal action detection with structured segment networks. ICCV pp. 2933–2942 (2017)
-  Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: Grounded question answering in images. In: CVPR. pp. 4995–5004 (2016)
Appendix A Implementation Details
a.1 Video and Text Features
a.1.1 C3D Features.
Similar to , we follow the standard ActivityNet setting and use the C3D  features from  for retrieval and dense captioning . In all our experiments under this setting, we extract frame-wise video feature using C3D model pre-trained on Sports-1M dataset, with the temporal stride of 16. PCA dimensionality reduction is then conducted to reduce features dimension to 500.
a.1.2 TSN-Inception V3 Features.
To leverage the state-of-the-art of current video modeling, we extract more recent deep features for retrieval on ActivityNet  and DiDeMo , using the Inception V3 model pre-trained on Kinetics  dataset (provided by ). Follow their settings, we resize video frames to the resolution of . We then fed video frames into the deep Inception V3 model to extract the output activations from penultimate layer. Unlike , we do not perform any test-time data augmentations (e.g.multiple crops, color jitter, etc.). Note that no fine-tuning are performed on either ActivityNet or DiDeMo.
a.1.3 Word Features.
In the retrieval related experiments, we always use GloVE features  for the initialization of the word embedding and fine-tune. Specifically, we use the GloVE vectors pre-trained on 840B common web-crawled data, with its dimensionality equals to 300.
a.1.4 Training Details
When the learning of hierarchical embedding is applicable, we feed the entire video/paragraph in its frame-wise/word-wise representations through the low-level encoder, and then input the subsequent low-level embedding to the high-level encoder as its initial hidden state. In all our experiments, we use GRU  with its hidden dimension to be 1,024 as our sequence encoder and decoder. To obtain the embedding for a sequence, we take the channel-wise max over all output vectors of the GRU as it empirically outperforms other strategies such as .
During training, we use Adam  optimizer with initial learning rate as 0.001, and decay it by 10 for every 10 epochs during the training. We use Xavier initialization  for each affine layer in our model with zero mean and variance of 0.01. We set all margin in the loss function to . Each loss is normalized by its batch size. On both ActivityNet and DiDeMo dataset, we train our embedding models for 15 epochs and collect the final results.
Appendix B Additional Experiments
b.1 Ablation Study on Different Learning Objectives
b.1.1 Ablation Study with Different Learning Objectives
We report ablation studies of different losses on ActivityNet video and paragraph retrieval task in Table 7. We use the Inception-V3 features and follow the same setting for training hse. Each time we remove one loss and report the performance. Note that the reconstruction loss and low-match loss are the most useful.
|Paragraph Video||Video Paragraph|
|hse w/o high-cluster||44.6||76.4||44.2||76.1|
|hse w/o low-match||40.9||73.6||39.8||73.6|
|hse w/o low-cluster||44.6||76.6||43.9||76.4|
|hse w/o reconstruction||43.9||75.8||43.3||75.3|
|hse w all losses||44.4||76.7||44.2||76.7|
b.1.2 Low-level Loss is Beneficial
As mentioned in the main text (see Table 1 and Table 2 in the main text), learning with low-level objectives is beneficial for our full model. To better understands this, we also plot the recall (in ) with regard to the rank of the video/paragraph to a query as supportive evidence. The results are shown in Fig. 6.
|(a) hse[=0]||(b) hse|
b.2 Ablation Study on Reconstruction Balance Term
Here we study the influence of loss balance term, by experimenting multiple choices of under a controlled environment. We choose to study this on the validation set 2 (val2) of ActivityNet with Inception V3 visual feature as input. Detailed results are shown in Table 8. We summarized that retrieval performance, R@1 and R@5, approach to its peak when =0.0005. Therefore, as stated in the main text, we set to be 0.0005 in all our experiments.
|Paragraph Video||Video Paragraph|
|Inception-V3 pre-trained on Kinetics |
b.3 Performance on ActivityNet Validation Set 2
As mentioned in the main paper, we reported the val2 performance of fse, hse[=0], and hse in Table 9. Again, the results verified our papers’ claim as we show that hse consistently improve performance than fse and hse[=0]. It shows the importance of hierarchical modeling and feature reconstruction.
|Paragraph Video||Video Paragraph|
|C3D Feature with Dimensionality Reduction |
|Inception-V3 pre-trained on Kinetics |
Appendix C Visualization
|query text||A man is floating in the water holding a table and a stool. The man stands on the table and sits the stool upright. The man sits on the stool as he water skis on the table. The man stands on top of the stool then stands up. The man is standing on a stool as he water skis in lake. The man does a spin while on the stool. The man jumps in the water as the boat drives on.|
|query text||We see the scoreboard of a racing game. The race starts ant the player is playing on jet skis. The player passes the red bridge. The player passes the cruise ship and zeppelin. The player passes the cliff with the lighthouse. The timer counts down from 10 and the race is finishes. We see the players record score. We see the ranking screen for the game, the level, and the option to change the difficulty.|
|query text||A man and a woman are dancing together. The man dips under the woman’s arm. The man has his hand on the woman’s waist. The man puts his hand on his waist. The man hits his head accidentally. A person hits the item hanging from the roof.|
|query text||A man throws a bowling ball. He then goes back and high fives his friends. They sit and talk around the table. A woman stands up and grabs a bowling ball. She walks up and drops it down the lane. She sits back down and looks at her phone. They continue talking around the table. She stands back up and picks up a bowling ball. She throws it down the lane again. She sits back down at the table. She throws a ball behind her while walking away. She picks up a ball and throws it with her hands over her eyes. She throws a bowling ball while talking on the phone.|
c.0.1 Qualitative Examples for hse, hse[=0], and fse on Retrieval Tasks.
We show the qualitative examples on ActivityNet as below. To show a systematic analysis of the success cases and failure cases, we choose to visualize positive examples of paragraph retrieval (in Figure 7) and video retrieval (in Figure 9) and negative examples in Figure 8 and Figure 10. We observe that in some failed cases, although hse failed to retreive the correct text/video, it retreive very relevant item given the query information.