Learning Shared Multimodal Embeddings with Unpaired Data

Learning Shared Multimodal Embeddings
with Unpaired Data

AJ Piergiovanni       Michael S. Ryoo
Department of Computer Science
Indiana University Bloomington
{ajpiergi, mryoo}@indiana.edu

In this paper, we propose a method to learn a joint multimodal embedding space. We compare the effect of various constraints using paired text and video data. Additionally, we propose a method to improve the joint embedding space using an adversarial formulation with unpaired text and video data. In addition to testing on publicly available datasets, we introduce a new, large-scale text/video dataset. We experimentally confirm that learning such a shared embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning.


Learning Shared Multimodal Embeddings
with Unpaired Data

  AJ Piergiovanni       Michael S. Ryoo Department of Computer Science Indiana University Bloomington {ajpiergi, mryoo}@indiana.edu


noticebox[b]Preprint. Work in progress.\end@float

1 Introduction

Videos contain multiple data sources, such as visual, audio or text data. Each data modality has distinct statistical properties capturing different aspects of the event. Current state-of-the-art activity recognition models (e.g., [4, 39]) only take visual data and class labels as input. A sentence, for example ‘a group of men play basketball outdoors,’ contains much more information, such as ‘outdoors’ and ‘group of men’ than a simple activity class of ‘basketball.’ Without multimodal learning, these models are unable to benefit from the additional information.

In this multimodal setting, we have visual and textual input data, each with its own representation. Video data is represented as a sequence of images (high dimensional pixel data) while text is represented as a sequence of word embeddings (relatively low dimensional). Using a shared embedding space allows for learning the highly non-linear relationships between the modalities. Such relationships will capture similarities between concepts, such as basketball and volleyball both being sports with a ball and further generalize to concepts not seen during training.

Zero-shot learning is defined as the ability to classify an instance of an unseen class without any training instances. Standard activity classification methods are only able to classify seen examples. However, using a joint embedding space, we can take advantage of a different modality (e.g., text) to infer ideal embeddings of unseen classes. This allows categorization of a video with an unseen class within the embedding space, enabling the zero-shot recognition.

Existing approaches to both zero-shot and embedding space learning require paired data examples, which can be expensive to obtain. By taking advantage of adversarial learning [7], we are able use unpaired data (i.e., random sentences and random videos without any labels or correspondence) to further improve our learned shared embedding space.

In this paper, we design a method capable of learning joint video/language embedding spaces using both paired and unpaired data and experimentally confirm its benefit to three challenging tasks (i) zero-shot activity recognition, (ii) unsupervised activity discovery, and (iii) unseen activity captioning. We show that the learning of a shared embedding space generalizes to unseen data.

2 Related works

Multimodal learning

Our approach falls into the category of multimodal learning. Previous approaches have used Restricted Boltzmann Machines [38] or log-bilinear models [16] to learn distributions of sentences and images. Ngiam et al. [24] designed an autoencoder that learns joint audio-video representations by concatenating and using a bottleneck layer. Frome et al. [6] describe a model that maps images and words to a shared embedding. However, these works either learn a joint embedding by concatenating the different features (e.g., [24, 38]) or require a triplet of positive and negative matches (e.g., [6]).

Text and vision

The use of text combined with visual data has been used for many tasks, such as image captioning [14, 12, 13] or video captioning [18, 48, 43]. Other works have explored the use of textual grounding for image/video retrieval [9, 32, 22, 11]. There have been various models proposed to learn a fixed text embedding space with mappings from video features into this embedding space [8, 26, 36, 41, 42]. However, these works all learn a single directional mapping (e.g., only mapping from image/video to text) and only learn with paired text/image samples.

Unpaired learning

Recently, there have been many works taking advantage of variational autoencoders (VAEs) [15] or generative adversarial networks (GANs) [7] to learn mappings between unpaired samples. CycleGan [49] uses a cycle-consistency loss (i.e., the ability to go from a sample in one domain to a second domain then back to the source) to learn unpaired image translation (e.g., image to sketch). Other works learn many-to-many mappings between images [2] or use two GANs to map between domains [46]. An autoencoder with shared weights for both domains has been used to learn a latent space for image-to-image translation [21]. However, these works all focus on learning mappings between unpaired data of the same modalitiy (e.g. image to image).

Zero-shot activity recognition

Many previous works have studied zero-shot activity recognition. Common approaches include using attributes [20, 27, 33] or word embeddings [44, 45, 25, 35, 17] or learning a similarity metric [47, 5].

Our work differs from these previous works in three key ways: (1) we learn a shared embedding space with bi-directional mappings, (2) we experimentally compare to use of the embedding space for both zero-shot recognition and unseen video captioning, and (3) we show the benefit of additional data augmentation using unpaired samples.

3 Method

To enable learning of a joint embedding space, we use a deep autoencoder architecture. Our model consists of 4 neural networks:

Video Encoder Video Decoder
Text Encoder Text Decoder

where is a video and is a sentence. is the latent joint embedding that we are learning. The encoders learn a compressed representation of the video or text while the decoders are trained to reconstruct the input:


As both text and video data are sequences, they often have different lengths. A joint embedding spaces requires that the features from both modalities have the same dimensions. Given a text representation of length and a video representation of length , we need to obtain a fixed-size representation. To learn a fixed-dimensional embedding, there are many choices for the encoder/decoder architecture, such as pooling [23], attention [29] or RNNs [18]. We choose to use temporal attention filters [29] as they learn a mapping from any length input to a -dimensional vector and have been shown to outperform pooling and RNNs on activity recognition tasks.

The attention filters learn 3 parameters: a center , a width and a stride . The filters are created by:


where is the number of distributions to learn. They are applied by matrix multiplication with the video sequence: . Additionally, we can learn a transposed version of these filters to reconstruct the input: . To reconstruct the input, the decoders learn their own set of weights with the tensors transposed, resulting in the matching output size. The architecture of the encoders is shown in Fig. 1.

Figure 1: Illustration of the encoder models used to learn a joint embedding space. Videos and texts are mapped into a low-dimensional space by applying CNNs and temporal attention. Then several fully-connected layers map into the joint embedding space. The decoders follow this same architecture with the weights transposed.

3.1 Learning a joint embedding space

Figure 2: Visualization of how we learn a shared embedding space. Circles are video data, ovals are reconstructed video. Diamonds are text data, and pentagons are reconstructed text. (a) The reconstruction (Eq. 1) and joint (Eq. 3) losses. (b) Mapping from text to video using the cross-domain (Eq. 4) loss.

To learn a joint representation space, we minimize the distance between the embeddings of a pair of text and video (shown in Fig. 2(a)):


This forces the joint embeddings to be similar and when combined with the reconstruction loss, ensures that the representations can still reconstruct the input.

We can further constrain the networks and learned representation by forcing a cross-domain mapping from text to video and from video to text (shown in Fig. 2(b)):


Additionally, we can use a cycle loss to map from video to text and back to video. Note that while the previous losses all require paired examples, this loss does not.


To train the model to learn a joint embedding space, we minimize


where are hyper-parameters weighting the various loss components.

3.2 Semi-supervised learning with unpaired data

Figure 3: Visualization of the adversarial formulation to learn with unpaired data. We create 3 discriminators, (1) learns to discriminate examples of text/video in the latent space. (2) learns to discriminate video generated from text compared to video. (3) learns to discriminate generated text compared to text.

To learn using unpaired data (i.e., unrelated text and video), we use an adversarial formulation. We treat the encoders and decoders as generator networks. We then learn an additional 3 discriminator networks which constrain the generators and embedding space and force the encoders and decoders to be consistent:

  1. which learns to discriminate between latent text representations and latent video representations. Conceptually, this constrains the learned embeddings to appear to be from the same distribution.

  2. which learns to discriminate between true video data and generated video data .

  3. which learns to discriminate between true text data and generated text data, .

Given these discriminators, we minimize the following losses:


Using the discriminators, we can train the generators (encoders and decoders) to minimize the following loss based on unpaired data:


Note that in this formulation, and are not paired.

These networks are trained in an adversarial setting. For example, the text-to-video generator (i.e., and video discriminator, , we optimize the following minimax equation:


This equation is similarly applied for video-to-text. For learning the embedding space with the video and text encoders, and the discriminator , we optimize the following minimax equation:


As training GANs can be unstable, we developed an algorithm to allow for more stable training of the joint embedding space, shown in Algorithm 1. We initialize both the generator and discriminator networks by training only on paired data. After several iterations of this, we train with both unpaired and paired data. We found the initial training of the generators and discriminators was important for stability, without it the loss often diverges and the learned embedding is not meaningful.

function Train
     for number of initialization iterations do
         Sample (, ) from pair training data
         Update encoders/decoders based on paired data (Eq. 6)
         Update discriminators (Eq. 7)
     end for
     for number of training iterations do
         Sample (, ) from paired and (, ) from unpaired training data
         Update encoders/decoders based on paired data (Eq. 6)
         Update encoders/decoders based on unpaired data (Eq. 8)
         Update discriminators based on all samples (Eq. 7)
     end for
end function
Algorithm 1 Semi-supervised alignment with adversarial learning

4 Experiments

We compare our various approaches on different tasks (i) zero-shot activity recognition, (ii) unsupervised activity discovery and (iii) unseen activity captioning. These tasks test various combinations of our encoders and decoders and how well the shared representation generalizes to unseen data. We experimentally confirm the benefits of our methods using multiple datasets: AcitivtyNet [10, 18], HMDB [19], UCF101 [37], and MLB-YouTube [30].

4.1 Implementation/training details

We implement our models in PyTorch. For the per-segment video CNN, we use I3D [4] to obtain a video representation. We use GloVe word embeddings [28] followed by 4 fully-connected layers to obtain a language representation. We set for the temporal attention filters to learn the joint embedding space. We train for 200 epochs and use stochastic gradient descent with momentum to minimize the loss function with a learning rate of 0.01. After every 30 epochs, we decay the learning rate by a factor of 10. When training in the adversarial setting (e.g., Algorithm 1), we initialize the network training for 20 epochs on paired data followed by 200 on the paired + unpaired data.

4.2 Zero-shot activity recognition

Zero-shot activity recognition is the problem of classifying a video that belongs to a class not seen during training. Given training videos of seen classes together with paired text descriptions, our approach learns a shared embedding that maps videos/texts from multiple seen classes. Next, only the text data of the unseen classes is provided. In the testing phase, the objective is to classify videos of unseen classes solely based on the learned embedding space and the text samples.

We use the ActivityNet captions [18] dataset to learn the joint embedding space, as this dataset has both sentence descriptions for each video as well as activity classes. We randomly choose a set of activity classes and withhold all videos belonging to the class during training. For testing, we take a set of sentences for the unseen classes and map the sentences into the joint embedding space, . We then map the videos into the space, and use nearest neighbors to match each video () to text (), using the class of the sentence as the classification for the video.

In Table 1, we compare the effect of the various loss components. For each method, we run 10 trials each with a different set of unseen activity classes and average the results. We find that previous methods of learning a fixed language embedding (e.g., [35, 44, 45]) are significantly outperformed by learning a joint representation. Further, adding the reconstruction, cross-domain, and cycle losses all improve performance.

To obtain unpaired data, we use the sentence descriptions from the Charardes [34] dataset, which contains many activities in a home setting. The unpaired video data is sampled from HMDB and UCF101. We find that our approach of using the adversarial learning with unpaired data (in addition to the paired training) further improves performance.

5 Unseen 10 Unseen 20 Unseen 50 Unseen
Fixed Text Representation 42.0 38.2 29.5 15.5
Joint Representation 55.2 42.5 35.2 20.1
Joint + recons 70.3 54.5 42.3 27.2
Joint + recons + cycle 70.5 54.7 41.4 27.2
Joint + recons + cross 72.4 55.2 42.8 27.4
Joint + recons + cross + cycle 76.2 57.4 45.2 29.4
Joint + all + unpaired 82.4 60.2 46.5 30.2
Table 1: Comparison of various methods on ActivityNet for 5, 10, 20 or 50 unseen classes. These results are averaged over 10 trials where each trial has a different set of unseen activity classes.

In Table 2, we compare our approach to previous zero-shot learning methods on HMDB and UCF101. The paired training data for these models is drawn from ActivityNet with any classes belonging to HMDB or UCF101 withheld. The unpaired text data is sampled from Charades and the video data comes from either HMDB (when testing on UCF101) or UCF101 (when testing on HMDB). As HMDB and UCF101 have no text descriptions, we created a sentence description for each activity class (included in Appendix B). We find that our joint embedding space outperforms the previous approaches on these datasets and unpaired adversarial learning further improves our performance.

SJE [1]
ConSe [25]
Semantic Embedding [44]
Manifold Ridge Regression [45]
Ours (paired)
Ours (paired + unpaired)
Table 2: Results on HMDB51 and UCF101 compared to previous state-of-the-art results. We find that learning a shared embedding space is beneficial and that augmented with unpaired data provides the best results.

4.3 Unsupervised activity discovery

Figure 4: Example video sequences from the MLB-YouTube dataset with the commentary caption. Top: Sentences that describe the occurring activities. Bottom: Sentences that do not describe the current activities.

To further evaluate our joint embedding, we conduct experiments on unseen activity discovery. For this task, we expand the MLB-YouTube dataset [30] by densely annotating the videos with captions from the commentary given by the announcers, resulting in approximately 50 hours of matching text and video. Examples of the text and video are shown in Fig. 4. The MLB-YouTube dataset is designed for fine-grained activity recognition, where the difference between activities is quite small. Additionally, these captions only roughly describe what is happening in the video, and often contain unrelated stories or commentary on a previous event, making this a challenging task. The dataset is available at https://github.com/piergiaj/mlb-youtube. To train our joint embedding space, we split each baseball video into 30 second intervals and use the corresponding text as paired data. This results in 6,089 paired training samples.

Figure 5: t-SNE mapping of (a) fixed text representation and (b) joint embedding with all paired losses for the MLB-YouTube dataset. This visually shows that the learning of a joint embedding space provides a better representation. Each color represents the activity class of the video (e.g., swing, hit, foul ball, etc.).

We evaluate our joint embedding using the segmented videos from MLB-YouTube. For each video, we compute the embedded features and apply -means clustering. Each segmented video is assigned to a cluster and votes for the cluster label based on its ground truth label. We use that cluster assignment for classification. We report our findings in Table 3. As a baseline, we cluster I3D features pre-trained on Kinetics. We find that learning a joint embedding is beneficial. Further, we find that adding additional constraints leads to improved results. However, we note that augmenting with unpaired data performs worse on this task. This is likely due to unpaired data being significantly different from the paired data in baseball videos.

In Table 4 we compare our various methods for unsupervised activity discovery on HMDB and UCF101. Here, we learn a joint embedding space using the ActivityNet videos and captions. We withhold any videos belonging to a class in HMDB or UCF101. Once learned, we cluster the embedded video features. On these datasets, we find that using the unpaired training further improves performance. This confirms that when the additional data is similar to the target dataset, using the adversarial learning setting further improves the embedding space.

Baseline I3D features 23.4
Fixed Text Representation 27.9
Joint Representation 34.5
Joint + decoder 37.9
Joint + decoder + cycle 44.2
Joint + decoder + cross 43.7
Joint + decoder + cross + cycle 48.4
Joint + all + unpaired 39.7
Table 3: Comparison of unsupervised activity classification on MLB-YouTube segmented videos.
Baseline I3D features 26.8 42.8
Joint Representation 32.5 57.8
Joint + decoder 33.4 59.1
Joint + decoder + cross + cycle 34.2 59.3
Joint + all + unpaired 34.8 59.7
Table 4: Comparison of unsupervised activity classification on HMDB and UCF101.

4.4 Unseen video captioning

Figure 6: Example captions for unseen activities. Left: Using a joint embedding space allows the model to correctly caption this video as basketball, despite never seeing an example of basketball during training. Right: An example of a caption for the unseen water-ski activity. Here the model fails to correctly caption the activity.

As our model learns a mapping from video to the embedding space and from the embedding space to text, we can apply our model to caption videos. Existing video captioning models are unable to create realistic captions for unseen activities, as without training data they do not know the words to describe the video. Given a video, , we can generate a caption by mapping the video to text . For each word, we then use nearest neighbors matching with the GloVe embeddings to obtain the words to form a sentence. In Table 5, we report the commonly used METEOR [3] and CIDEr [40] scores of our various models, measured with the unseen classes from the ActivityNet dataset. We find that learning a joint representation is beneficial and using unpaired samples further improves the task. In Fig. 6, we show example captioned videos.

Fixed Text Representation 3.64 8.95
Joint Representation 4.21 9.23
Joint + all paired 5.31 11.21
Joint + paired + unpaired 6.89 13.95
Table 5: Comparison of several models for unseen activity captioning using the ActivityNet dataset, using METEOR and CIDEr scores. This evaluation was done on 10 unseen classes held out during training. Higher values are better.

5 Conclusion

We proposed an approach to learn a joint language/text embedding space using various constraints. We further extended the model to be able to learn with unpaired video and text data using an adversarial formulation. We experimentally confirmed that learning such an embedding space benefits three difficult tasks (i) zero-shot activity classification, (ii) unsupervised activity discovery, and (iii) unseen activity captioning.


  • [1] Z. Akata, S. Reed, D. Walter, H. Lee, and B. Schiele. Evaluation of output embeddings for fine-grained image classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [2] A. Almahairi, S. Rajeswar, A. Sordoni, P. Bachman, and A. Courville. Augmented cyclegan: Learning many-to-many mappings from unpaired data. arXiv preprint arXiv:1802.10151, 2018.
  • [3] S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005.
  • [4] J. Carreira and A. Zisserman. Quo vadis, action recognition? a new model and the kinetics dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [5] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with application to face verification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2005.
  • [6] A. Frome, G. S. Corrado, J. Shlens, S. Bengio, J. Dean, T. Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • [7] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • [8] S. Guadarrama, N. Krishnamoorthy, G. Malkarnenkar, S. Venugopalan, R. Mooney, T. Darrell, and K. Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV). IEEE, 2013.
  • [9] S. Gupta and R. J. Mooney. Using closed captions as supervision for video activity recognition. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2010.
  • [10] F. C. Heilbron, V. Escorcia, B. Ghanem, and J. C. Niebles. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2015.
  • [11] L. Hendricks, O. Wang, E. Shechtman, J. Sivic, T. Darrell, and B. Russell. Localizing moments in video with natural language. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [12] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [13] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [14] A. Karpathy, A. Joulin, and L. F. Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in Neural Information Processing Systems (NIPS), 2014.
  • [15] D. P. Kingma and M. Welling. Auto-encoding variational bayes. International Conference on Learning Representations (ICLR), 2014.
  • [16] R. Kiros, R. Salakhutdinov, and R. Zemel. Multimodal neural language models. In International Conference on Machine Learning (ICML), 2014.
  • [17] E. Kodirov, T. Xiang, and S. Gong. Semantic autoencoder for zero-shot learning. arXiv preprint arXiv:1704.08345, 2017.
  • [18] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. Dense-captioning events in videos. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [19] H. Kuehne, H. Jhuang, E. Garrote, T. Poggio, and T. Serre. Hmdb: a large video database for human motion recognition. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2011.
  • [20] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2011.
  • [21] M.-Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In Advances in Neural Information Processing Systems (NIPS), 2017.
  • [22] A. Miech, J.-B. Alayrac, P. Bojanowski, I. Laptev, and J. Sivic. Learning from video and text via large-scale discriminative clustering. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [23] J. Y.-H. Ng, M. Hausknecht, S. Vijayanarasimhan, O. Vinyals, R. Monga, and G. Toderici. Beyond short snippets: Deep networks for video classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [24] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng. Multimodal deep learning. In International Conference on Machine Learning (ICML), 2011.
  • [25] M. Norouzi, T. Mikolov, S. Bengio, Y. Singer, J. Shlens, A. Frome, G. S. Corrado, and J. Dean. Zero-shot learning by convex combination of semantic embeddings. arXiv preprint arXiv:1312.5650, 2013.
  • [26] M. Otani, Y. Nakashima, E. Rahtu, J. Heikkilä, and N. Yokoya. Learning joint representations of videos and sentences with web image search. In Proceedings of European Conference on Computer Vision (ECCV), 2016.
  • [27] M. Palatucci, D. Pomerleau, G. E. Hinton, and T. M. Mitchell. Zero-shot learning with semantic output codes. In Advances in Neural Information Processing Systems (NIPS), 2009.
  • [28] J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), 2014.
  • [29] A. Piergiovanni, C. Fan, and M. S. Ryoo. Learning latent sub-events in activity videos using temporal attention filters. In Proceedings of the American Association for Artificial Intelligence (AAAI), 2017.
  • [30] A. Piergiovanni and M. S. Ryoo. Fine-grained activity recognition in baseball videos. In CVPR Workshop on Computer Vision in Sports, 2018.
  • [31] J. Qin, L. Liu, L. Shao, F. Shen, B. Ni, J. Chen, and Y. Wang. Zero-shot action recognition with error-correcting output codes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [32] A. Rohrbach, M. Rohrbach, R. Hu, T. Darrell, and B. Schiele. Grounding of textual phrases in images by reconstruction. In Proceedings of European Conference on Computer Vision (ECCV). Springer, 2016.
  • [33] B. Romera-Paredes and P. Torr. An embarrassingly simple approach to zero-shot learning. In International Conference on Machine Learning (ICML), 2015.
  • [34] G. A. Sigurdsson, G. Varol, X. Wang, A. Farhadi, I. Laptev, and A. Gupta. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of European Conference on Computer Vision (ECCV), 2016.
  • [35] R. Socher, M. Ganjoo, C. D. Manning, and A. Ng. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems (NIPS), 2013.
  • [36] Y. C. Song, I. Naim, A. Al Mamun, K. Kulkarni, P. Singla, J. Luo, D. Gildea, and H. A. Kautz. Unsupervised alignment of actions in video with text descriptions. In IJCAI, pages 2025–2031, 2016.
  • [37] K. Soomro, A. R. Zamir, and M. Shah. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  • [38] N. Srivastava and R. R. Salakhutdinov. Multimodal learning with deep boltzmann machines. In Advances in Neural Information Processing Systems (NIPS), 2012.
  • [39] D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and M. Paluri. A closer look at spatiotemporal convolutions for action recognition. arXiv preprint arXiv:1711.11248, 2017.
  • [40] R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [41] L. Wang, Y. Li, and S. Lazebnik. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [42] Y. Xian, Z. Akata, G. Sharma, Q. Nguyen, M. Hein, and B. Schiele. Latent embeddings for zero-shot classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [43] H. Xu, B. Li, V. Ramanishka, L. Sigal, and K. Saenko. Joint event detection and description in continuous video streams. arXiv preprint arXiv:1802.10250, 2018.
  • [44] X. Xu, T. Hospedales, and S. Gong. Semantic embedding space for zero-shot action recognition. In International Conference on Image Processing (ICIP), 2015.
  • [45] X. Xu, T. Hospedales, and S. Gong. Transductive zero-shot action recognition by word-vector embedding. International Journal of Computer Vision (IJCV), 2017.
  • [46] Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [47] Z. Zhang and V. Saligrama. Zero-shot learning via semantic similarity embedding. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  • [48] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. End-to-end dense video captioning with masked transformer. arXiv preprint arXiv:1804.00819, 2018.
  • [49] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

Appendix A MLB-Youtube captioning

As a baseline for the MLB-YouTube captions dataset, we compared several different models for standard video captioning (i.e., all activity classes are seen). This task is quite challenging compared to other datasets as the announcers commentary is not always a direct description of the current events. Often the announcers tell loosely related stories and attempt to describe events differently each time to avoid repetition. Additionally, the descriptions contain on average 150 words for each 30 second interval and current captioning approaches usually only trained and tested on 10-20 word sentences. Due to these factors, this task is quite challenging the standard evaluation metrics do not account for these factors. In Table 6, we report our results on this task.

Fixed Text Representation 0.12 0.04 0.12
Joint Representation 0.14 0.08 0.15
Joint + all paired 0.15 0.10 0.18
Joint + paired + unpaired 0.10 0.02 0.08
Table 6: Comparison of several models for standard, seen video captioning using the MLB-YouTube dataset, using Bleu, METEOR and CIDEr scores. Higher values are better.

Appendix B HMDB and UCF101 Sentences

For the HMBD and UCF101 datasets, we created sentences to describe each activity class. Our sentences descriptions are included in this appendix.


  1. chew: a woman is chewing on bread

  2. golf: a man swings a golf club

  3. sword exercise: a person is playing with a sword

  4. walk: a person is walking

  5. jump: a person jumps into the water

  6. pour: a man pours from a bottle

  7. laugh: a man is laughing

  8. shoot gun: a person rapidly fires a gun

  9. run: a person is running

  10. turn: a person turns around

  11. ride bike: a man is riding a bike on the street

  12. swing baseball: a boy hits a baseball

  13. draw sword: a person draws a sword

  14. sit: a person sits in a char

  15. fencing: two men are fencing

  16. dribble: a boy dribbles a basketball

  17. stand: a person stands up

  18. pushup: a man does pushups

  19. sword: two people are fighting with swords

  20. pullup: a boy does pullups in a doorway

  21. smile: a man smiles

  22. shake hands: two people shake hands

  23. shoot ball: a person shoots a basketball

  24. kick: a person kicks another person

  25. somersault: a person does a somersault

  26. flic flac: a boy does a backflip

  27. hug: two people hug

  28. hit: a boy swings a baseball bat

  29. dive: a person jumps into a lake

  30. drink: a man drinks from a bottle

  31. punch: a woman punches a man

  32. wave: a person waves

  33. talk: a person is talking

  34. kiss: a man and woman kiss

  35. catch: a boy catches a ball

  36. smoking: a woman smokes a cigarette

  37. eat: a man eats pizza

  38. throw: a person throws a ball

  39. climb stairs: a man is running down the stairs

  40. kick ball: a person kicks a soccer ball

  41. ride horse: a girl is riding a horse

  42. fall floor: a man is pushed onto the ground

  43. brush hair: a girl is brushing her hair

  44. situp: a man does situps

  45. cartwheel: a guy runs and jumps and flips

  46. pick: a man picks a book

  47. push: a boy pushes a table

  48. climb: a man is climbing up a wall

  49. handstand: three girls do handstands

  50. clap: a woman claps her hands

  51. shoot bow: a person shows a bow and arrow


  1. MilitaryParade: people are marching and waving a flag

  2. TrampolineJumping: kids are jumping on a trampoline

  3. PlayingDaf: a person moves a circle and hits it

  4. SalsaSpin: poeple are dancing and spinning

  5. CuttingInKitchen: a person is in the kitchen using a knife

  6. ApplyEyeMakeup: a woman is putting on makeup

  7. PlayingViolin: a person plays the violin

  8. YoYo: a person plays with a yoyo

  9. PlayingCello: a person is playing the cello

  10. Bowling: a person is bowling

  11. UnevenBars: a woman is spinning and flying on bars

  12. BalanceBeam: a woman is on the balance beam

  13. SkyDiving: people are falling out of the sky

  14. SumoWrestling: two fat people are wrestling

  15. PushUps: a man does pushups

  16. FloorGymnastics: a girl does gymnastics

  17. ApplyLipstick: a woman is putting on lipstick

  18. BreastStroke: a woman is swimming

  19. GolfSwing: a man swings a golf club

  20. PlayingDohl: a person hits on a drum

  21. HorseRiding: a woman rides a horse

  22. PlayingFlute: a person blow into a flute

  23. PizzaTossing: a man is making a pizza

  24. CleanAndJerk: a person is lifting weights

  25. WritingOnBoard: a person is writing on the wall

  26. CricketShot: a person hits a ball with a bat

  27. FieldHockeyPenalty: a girl in the field shoots a ball

  28. HammerThrow: a person spins and throws an object

  29. BodyWeightSquats: a man is squatting

  30. CliffDiving: a person jumps off a cliff

  31. Typing: a person is typing at a computer

  32. MoppingFloor: a man mops the floor

  33. TaiChi: people are doing tai chi

  34. PlayingPiano: a person plays piano

  35. Punch: someone punches another person

  36. Nunchucks: a person swings nun chucks

  37. RopeClimbing: a person climbs a rope

  38. Swing: a baby is swinging

  39. Knitting: a woman is knitting

  40. Rafting: people are rafting on a river

  41. PlayingGuitar: a person strums a guitar

  42. ShavingBeard: a man shaves his beard

  43. JugglingBalls: a person is juggling balls

  44. Diving: a boy dives into a pool

  45. JumpingJack: a person jumps and swings his arms

  46. VolleyBallSpiking: people hit a volleyball

  47. PoleValut: a person runs with a pole and launches into the air

  48. SkateBoarding: a man is skateboarding

  49. BoxingPunchingBag: a man is punching a bag

  50. IceDancing: people are ice skating

  51. WallPushups: a person does pushups against a wall

  52. FrisbeeCatch: a person jumps and catches a frisbee

  53. Drumming: people are drumming

  54. JumpRope: a girl is jumping rope

  55. HeadMassage: a person gets their head massaged

  56. PlayingTabla: a person plays two drums

  57. TableTennisShot: people are playing table tennis

  58. PommelHorse: a person spins around on their hands

  59. HighJump: a man jumps over a bar and lands on his back

  60. BasketballDunk: a man jumps and dunks the basketball

  61. BoxingSpeedBag: a man punches a bad in the air quickly

  62. PullUps: a person does hangs on a bar and pulls up

  63. RockClimbingIndoor: a person is climbing up rocks

  64. BlowingCandles: a boy blows out candles on a cake

  65. Skiing: people are skiing on a mountain

  66. WalkingWithDog: a person walks a dog

  67. Basketball: men are playing basketball

  68. SoccerJuggling: a person is playing with a soccer ball

  69. Fencing: people are fencing

  70. Billiards: a man is playing billiards

  71. BaseballPitch: a man throws a baseball

  72. BlowDryHair: a woman is drying her hair

  73. CricketBowling: a person throws a cricket ball

  74. BandMarching: people are walking down the street playing music

  75. PlayingSitar: a person plays a funny guitar

  76. ThrowDiscus: a person spins and throws a disk

  77. StillRings: a man holds in the air on rings

  78. Lunges: a person bends to the ground with one knee

  79. Skijet: a person rides a jetski in the ocean

  80. BabyCrawling: a baby is crawling on the floor

  81. Mixing: a woman is mixing in a bowl

  82. Hammering: a person is hitting nails with a hammer

  83. Shotput: a person spins and launches a ball

  84. Archery: a man shoots a bow and arrow

  85. Surfing: a man is surfing in the ocean

  86. FrontCrawl: a person is swimming freestyle

  87. HulaHoop: a person spins a hoop around their waist

  88. JavelinThrow: a person throws a spear

  89. Rowing: people are in a canoe and rowing

  90. Kayaking: a person is kayaking on a lake

  91. ParallelBars: a man does gymnastics on the parallel bars

  92. HorseRace: horses are racing around a track

  93. HandstandWalking: a person stands on their hands and walk

  94. BrushingTeeth: a boy brushes his teeth

  95. LongJump: a person runs and jumps into a sand pit

  96. Biking: people are riding bikes

  97. HandstandPushups: a person does pushups upside down

  98. BenchPress: a man is lifting weights

  99. Haircut: a person is getting a hair cut

  100. TennisSwing: a woman hits a tennis ball

Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description