Being Consistent with Humans: Diverse Video Captioning Through Latent Variable Expansion with Conditional GAN

Being Consistent with Humans: Diverse Video Captioning Through Latent Variable Expansion with Conditional GAN


Automatically describing video content with text description is challenging but important task, which has been attracting a lot of attention in computer vision community. Previous works mainly strive for the accuracy of the generated sentences, while ignoring the sentences diversity, which is inconsistent with human behavior. In this paper, we aim to caption each video with multiple descriptions and propose a novel framework. Concretely, for a given video, the intermediate latent variables of conventional encode-decode process are utilized as input to the conditional generative adversarial network (CGAN) with the purpose of generating diverse sentences. We adopt different CNNs as our generator that produces descriptions conditioned on latent variables and discriminator that assesses the quality of generated sentences. Simultaneously, a novel DCE metric is designed to assess the diverse captions. We evaluate our method on the benchmark datasets, where it demonstrates its ability to generate diverse descriptions and achieves superior results against other state-of-the-art methods.

Figure 1: Illustration of the proposed method. Our model consists of an encoder-decoder architecture and a conditional GAN. The former aims to learn an effective coding method for video and text, and the latter is designed to generate diverse descriptions conditioned on the latent variables extracted from the encoder-decoder model.

1 Introduction

Video captioning has recently received increased interest and become an important task in computer vision. Most of the research efforts have been trying to generate captions for videos, including the template based methods and the neural network based models. In the template based methods, subjects, verbs and objects are firstly detected and then filled into a pre-defined template to generate the corresponding sentence. However, the sentences yielded by these methods are very rigid and limited. They are still far from being perfect.

Deep learning has rapidly developed, and significant advancements have been made in image classification using convolutional neural network (CNN) [27] and machine translation utilizing recurrent neural network (RNN) [30]. Benefit from these achievements, neural network based approaches have become the mainstream to generate descriptions for videos. They are mostly built upon the encoder-decoder framework and optimized in an end-to-end manner. In these approaches, typically a CNN is utilized as an encoder to generate the visual representation and a RNN as a decoder to generate sequence of words.

In view of the effectiveness of the encoder-decoder framework, we follow this elegant recipe in our work. However, previous research primarily focuses on the fidelity of sentences, while another essential property, diversity, is not taken into account. More specifically, they are mostly trained to select words with maximum probability, which results in a monotonous set of generated sentences since these sentences bear high resemblance to training data.

Towards the goal to be consistent with human behavior, we propose a novel approach on top of the encoder-decoder framework and the conditional GAN to alleviate the aforementioned limitation, as depicted in Figure 1. We refer to our proposed method as DCM (Diverse Captioning Model), which consists of two components: (1) An attention-based LSTM model is trained with cross entropy loss to generate textual descriptions for the given videos. Concretely, we employ a CNN and Bi-directional LSTM [10] to encode the video frames as vector representations. Following it, a temporal attention mechanism is utilized to make a soft-selection over them. Afterwards, we adopt the hierarchical LSTM to generate the descriptions. (2) A conditional GAN whose input is the latent variables of the attention-based model is trained to generate diverse sentences. Specifically, in CGAN, we get rid of LSTM and adopt the fully convolutional network as our generator to produce descriptions based on the latent variables. For the discriminator, both the sentences and video features are used as input to evaluate the quality of generated sentences. The former component is designed to effectively model video-sentence pairs, and the latter aims at simultaneously considering both the fidelity and diversity of the generated descriptions.

To our best knowledge, we are the first to propose a method for generating diverse video descriptions conditioned on the latent variables of traditional encode-decode process via adversarial learning. The contributions of this work can be summarized as follows: (1) An efficient encoder-decoder framework is proposed to describe video content with high accuracy. (2) Our method relies on the conditional GAN to explore the diversity of the generated sentences. (3) We propose a novel performance evaluation metric named Diverse Captioning Evaluation (DCE), which considers not only the differences between sentences, but also the rationality of sentences (i.e., whether the video content is correctly described). (4) Extensive experiments and ablation studies on benchmark MSVD and MSR-VTT datasets demonstrate the superiority of our proposed model in comparison to the state-of-the-art methods.

2 Related Work

2.1 RNN for Video Captioning

As a crucial challenge for visual content understanding, captioning tasks have attracted much attention for many years. [35] transferred knowledge from image caption models via adopting the CNN as the encoder and LSTM as the decoder. [19] used the mean-pooling caption model with joint visual and sentence embedding. To better encode the temporal structures of video, [39] incorporated the local C3D features and a global temporal attention mechanism to select the most relevant temporal segments. [18] proposed a hierarchical recurrent video encoder to exploit multiple time-scale abstraction of the temporal information. More recently, a novel encoder-decoder-reconstruction network was proposed by [36] to utilize both the forward and backward flows for video captioning. [1] embedded temporal dynamics in visual features by hierarchically applying Short Fourier Transform to CNN features.

Multi-sentence description for videos has been explored in various works [23, 26, 40, 28, 24] recently. [23] generated multiple video descriptions by focusing on different levels of details. [26] temporally segmented the video with action localization and then generated multiple captions for those segments. [40] proposed a hierarchical model containing a sentence generator that produces short sentences and a paragraph generator that captures the inter-sentence dependencies by taking the sentence vector as input. [24] exploited the spatial region information and further explored the correspondence between sentences and region-sequences. Different from these methods which focus on temporally segmentation or spatial region information, MS-RNN [28] modeled the uncertainty observed in the data using latent stochastic variables. It can thereby generate multiple sentences with consideration of different random factors. In this paper we also pay attention to the latent variables, but we try to generate diverse captions from the video-level.

2.2 GAN for Natural Language Processing

Generative adversarial network (GAN) [9] has become increasingly popular and achieved promising advancements in generating continuous data and discrete data [41]. It introduces a competing process between a generative model and a discriminative model through a minimax game where the generative model is encouraged to produce highly imitated data and the discriminative model learns to distinguish them. To produce specific data, conditional GAN (CGAN) was first proposed by [17] to generate MNIST digits conditioned on class labels. Since then it has been widely applied to other fields, such as image synthesis and text generation. For instance, Reed et al. [22] used CGAN to generate images based on the text descriptions. Dai et al. [4] built a conditional sequence generative adversarial net to improve the image description.

Motivated by previous researches, in this paper we incorporate CGAN with encoder-decoder model to describe video content. Moreover, we model the captioning process via policy gradient [31] to overcome the problem that gradients can not be back-propagated directly since the text generation procedure is non-differentiable. By expanding the latent variables of encoder-decoder model to CGAN, our proposed DCM is able to generate correct and varied descriptions for a given video.

3 The Proposed Method

3.1 Video Captioning with Encoder-Decoder Framework

We innovate in this paper a stacked LSTM based model for video captioning, as shown in Figure 1. Given a video, V, with N frames, the extracted visual features and the embedded textual features can be represented as and , where , , and is the sentence length. Specifically, and are the respective dimensions of the frame-level features and vocabulary. We use a bi-directional LSTM (Bi-LSTM) which can capture both forward and backward temporal relationships to encode the extracted visual features. The activation vectors are obtained as:


where and are the forward and backward hidden activation vectors. Inspired by the recent success of attention mechanism in various tasks, a visual attention mechanism is incorporated following the Bi-LSTM. To avoid imposing visual attention on non-visual words [29], we append a blank feature whose values are all zeros to the encoded video features. Accordingly, the output context vector at time step t can be represented as:


In (2), is the blank feature, and is the attention weight which can be computed as:


where , , and are the learned parameters, is the hidden state of the decoder LSTM (LSTM2) at the (-1)-th time step.

As shown in Figure 1, our decoder integrates two LSTMs. The bottom LSTM layer (LSTM1) is used to efficiently encode the previous words, and the top LSTM (LSTM2) generates the next word based on the concatenation of the visual information and the textual information . According to the above analysis, at time step , our model utilizes V and the previous words to predict a word with the maximal probability , until we reach the end of the sentence. Thus, the loss function of our encoder-decoder model can be defined as:


where is the model parameter set.

3.2 Diverse Captioning with Conditional GAN

As introduced above, we use the conditional GAN to implement the diverse captioning for video. In our framework, multiple CNNs are adopted as the generator G to generate descriptions, and the discriminator D trying to assess the sentences quality. They are optimized by a minimax two-player game until they reach an equilibrium.

Concretely, our generator strives to produce high-quality sentences conditioned on the random vector z sampled from the normal distribution (0,1) and the latent variables L obtained from the encoder-decoder model, where and . Among them, consists of and , and is calculated in the same way as , except that the output of the is used as input. At each step , suppose is the concatenation of and , and is the concatenation of . Then, a with kernel takes input and produces a context vector that contains the information of all the previously generated words. Afterwards, a MLP layer is utilized to encode and yields a conditional distribution over the vocabulary. We choose CNN since it is being penalized for producing less-peaky word probability distributions, giving it the potential to explore the sentence diversity [2]. Overall, the random vector allows the generator to generate diverse descriptions and the latent variable guarantees the fidelity of the generated sentences. Specifically, when the discriminator is fixed, the generator can be optimized by minimizing the following formulation:


here, represents the parameter set of generator G. Considering both the accuracy and diversity of the generated sentences, we balance the generator with an extra cross entropy loss, which also prevents our conditional GAN from deviating from its correct orbit during training. Therefore, we minimize the objective function as follows when updating G:


where is the tradeoff parameter.

In the discriminator D, our goal is to judge whether the given video-text pair as matched or not and how well a sentence describes a given video. Inspired by the superior performance of convolutional neural network in text classification, we choose the CNN as our discriminator. Suppose the embedded matrix of the generated sentence is represented as by concatenating the word embeddings as columns, where is the dimension of the word embedding. Then, a convolutional operation with a kernel is used to encode the sentence and produce a feature map:


here, * is the convolution operator, is the bias term and is the RELU non-linear function. A max-over-time pooling operation is then applied over the generated feature maps, =. Moreover, we use different kernels to extract different features of sentence and combine them as our final sentence representation.

Once the above operations are completed, we concatenate the sentence representation with its corresponding video feature extracted from the hidden layer of Bi-LSTM. Then, a fully connected layer is utilized to map this concatenated feature to a low-dimensional space and a softmax layer is incorporated to output the probability that indicates the quality of sentence. Formally, this process can be described as:


where and are parameters to be learned, is projected onto the range of . An output close to 1 indicates a bigger probability that is drawn from the real data distribution or not. The optimization target of D is to maximize the probability of correctly distinguishing the ground truth from the generated sentences. For G fixed, D can be optimized as:


where is the parameter set of discriminator D, represents the video-sentence pair in the training set .

Unlike a typical GAN setting, in which the generator receives rewards at each intermediate step, our captioning model only receives the reward at the end (the reward signal is meaningful only for the completed sentence), which may lead to several difficulties in training such as vanishing gradients and error propagation. To mitigate the lack of intermediate rewards, the Monte Carlo rollouts [16] is employed to provide early feedback. Specifically, following [31], when there is no intermediate reward, the objective of the generator (when optimizing the adversarial loss) is to generate a sequence from the start state to maximize its expected end reward:


where is the start state, is the reward for a complete sequence, is the vocabulary, is the generator policy which influences the action that generates the next word, and indicates the action-value function of a sequence (i.e. the expected accumulative reward starting from state , taking action , and then following policy ). In our experiments we use the estimated probability of being real by the discriminator as the reward. Thus, we have:

S2VT [34] V+O 29.8 - - - - - - - -
h-RNN  [40] V+C 32.6 49.9 65.8 - - - - - -
HRNE [18] G+C 33.9 46.7 - - - - - - -
aLSTMs [8] I 33.3 50.8 74.8 - I 26.1 38.0 43.2 -
hLSTMat [29] R 33.6 53.0 73.8 - R 26.3 38.3 - -
RecNet [36] Iv4 34.1 52.3 80.3 69.8 Iv4 26.6 39.1 42.7 59.3
DVC [24] - - - - - R+C+A 28.3 41.4 48.9 62.6
GRU-EVE [1] IRv2+C 35.0 47.9 78.1 71.5 IRv2+C 28.4 38.3 48.1 60.7
Aalto [25] - - - - - G+C+ 26.9 39.8 45.7 59.8
v2t-navigator [12] - - - - - V+C+A+ 28.2 40.8 44.8 60.9
Ours-ED (S) I 35.1 53.1 82.0 71.0 I 26.8 39.1 43.8 59.1
Ours-ED (M) I+C 35.6 53.3 83.1 71.2 I+C+A 28.7 43.4 47.2 61.6
DCM-Best1 (M) I+C 35.2 52.8 75.1 71.0 I+C+A 34.2 43.8 47.6 65.8
Table 1: Captioning performance comparison on MSVD and MSR-VTT. V, O, G, C, R, I, Iv4 and IRv2 denote VGGNet, Optical Flow, GoogLeNet, C3D, ResNet, Inception-v3, Inception-v4 and InceptionResNet-V2, respectively. and A denote category and audio information. S and M denote single feature and multiple features. The symbol “-” indicates such metric is unreported.

Here, indicates the corresponding visual feature. To evaluate the action-value for an intermediate state, the Monte Carlo rollouts is utilized to sample the unknown last tokens. To reduce the variance, we run the rollout policy starting from current state till the end of the sequence for K times. Thus, we have:


where and is sampled based on the rollout policy and the current state. In summary, Algorithm 1 (see supplementary material) shows full details of the proposed DCM. Aiming at reducing the instability in training process, we pre-train our encoder-decoder model and discriminator aforementioned to have a warm start. When pre-training our discriminator, the positive examples are from the given dataset, whereas the negative examples consist of two parts. One part is generated from our generator, and the other part is manually configured. Concretely, mismatched video-sentence pairs are utilized as one of the negative examples (model the inter-sentence relationship). Meanwhile, with the purpose of evaluating sentences more accurately, we exchange the word positions of the sentences in the positive examples and regard them as negative examples (model the intra-sentence relationship). The objective function of D for pre-training can be formalized into a cross entropy loss as follow:


where and denote the real label and the predicted value of discriminator respectively, and is the number of examples in a batch. It is worth noting that during testing, when we use LSTM2 to generate high-precision captions, the input of LSTM1 at each step is the previous generated word of LSTM2. And if we want to generate diverse descriptions, the input of LSTM1 is the previous generated word of generator.

4 Experiments

4.1 Datasets

Two benchmark public datasets including MSVD [3] and MSR-VTT [38] are employed to evaluate the proposed diverse captioning model. Regarding the MSVD dataset, there are 1,970 video clips collected from YouTube, which covers a lot of topics and is well-suited for training and evaluating a video captioning model. We adopt the same data splits as provided in [34] with 1,200 videos for training, 100 videos for validation and 670 videos for testing. As for the MSR-VTT, there are 10K video clips and 20 reference sentences annotated by human are provided for each video clip. We follow the public split method: 6,513 videos for training, 497 videos for validation, and 2,990 videos for testing.

4.2 Experimental Settings

We uniformly sample 60 frames from each clip and use Inception-v3 [32] to extract frame-level features. To capture the video temporal information, the C3D network [13] is utilized to extract the dynamic features of video. The dynamic features are then encoded by a LSTM whose final output is concatenated with all the frame-level features. We convert all the sentences to lower cases, remove punctuation characters and tokenize the sentences. We retain all the words in the dataset and thus obtain a vocabulary of 13,375 words for MSVD, 29,040 words for MSR-VTT. To evaluate the performance of our model, we utilize METEOR [6], BLEU [20], CIDEr [33] and ROUGE [15] as our evaluation metrics, which are commonly used for performance evaluation of video captioning methods.

In our experiments, with an initial learning rate to avoid the gradient explosion, the LSTM unit size and word embedding size are set as 512, empirically. We train our model with mini-batch 64 using ADAM optimizer [14], and the length of sentence T is set as 25. For sentence with fewer than 25 words, we pad the remaining inputs with zeros. To regularize the training and avoid overfitting, we apply dropout with rate of 0.5 on the outputs of LSTMs. During testing process, beam search with beam width of 5 is used to generate descriptions.

Figure 2: Examples of video captioning results. The results of MS-RNN and DVC are reported in [28] and [24]


4.3 Comparison with the State-of-the-Art

Quantitative Analysis

Table 1 demonstrates the result comparison among our proposed method and some state-of-the-art models. The comparing algorithms include the encoder-decoder based architectures (S2VT [34], DVC [24], GRU-EVE [1]), and the attention based methods (h-RNN [40], HRNE [18], aLSTMs [8], hLSTMat [29], RecNet [36]). For MSVD dataset, the results of Ours-ED indicate that our method outperforms previous video captioning models. In particular, compared with the best counterpart (i.e., RecNet) which uses single feature, Ours-ED (S) achieves better performance, with 1.0, 0.8, 1.7 and 0.2 increases on METEOR, BLEU-4, CIDEr and ROUGE respectively. Compared with the models using multiple features, Ours-ED (M) also performs best on most metrics, verifying the superiority of our proposed approach.

For MSR-VTT dataset, we also compare our models with the top-2 results from the MSR-VTT challenge in the table1, including v2t-navigator [12] and Aalto [25], which are all based on features from multiple cues such as action features and audio features. For fair comparison, we integrate the audio features extracted from the pre-trained VGGish model [11] with other video features. Note that v2t-navigator uses category information, and DVC adopts data augmentation during training to obtain better accuracy. Nevertheless, Ours-ED also achieves the best performance on METEOR and BLEU4 metrics. It demonstrates the effectiveness of our proposed method.

To compare our DCM with the single caption models, we also report the accuracy of its best-1 caption. For each video, we calculate the METEOR scores of its corresponding generated captions and obtain the best-1 caption with the highest METEOR score. From Table 1 we can see that our DCM-Best1 (M) outperforms most previous methods or even superior to Ours-ED (M) on MSR-VTT. This is encouraging since convolutional architecture is generally worse than LSTM architecture in terms of accuracy [2]. It shows the great potential of our DCM in improving sentence accuracy.

Figure 3: Examples of video captioning achieved by our model.


Model DCE DCE-Variant Word Types Mixed mBLEU Self-CIDEr DCE DCE-Variant Word Types Mixed mBLEU Self-CIDEr
MS-RNN [28] 0.09 0.12 239 0.91 0.26 0.13 0.21 620 0.81 0.48
DCM 0.25 0.37 2210 0.64 0.79 0.21 0.39 6492 0.68 0.81
DCM-5T 0.25 0.36 2125 0.69 0.76 0.20 0.37 5100 0.75 0.76
Table 2: Diversity evaluation using multiple metrics.

Qualitative Analysis

In our experiments, a beam search with size 5 is used to generate descriptions. Thus, we can obtain 5 sentences with each input z. We run the testing 15 times and remove duplicate sentences. For each video, if the number of descriptions is more than 30, we only preserve the first 30 as the final descriptions. To gain an intuition of the improvement on generated diverse descriptions of our DCM, we present some video examples with the video description from MS-RNN [28] and DVC [24] as comparison to our system in Figure 2. We can see that our DCM performs better than MS-RNN and DVC in generating diverse sentences. Meanwhile, compared with these models, DCM generates more accurate and detailed descriptions for diverse video topics. For example, in the last video, DVC simply produces “a man is hitting a ping pong ball in a stadium”, while our model can generate sentence like “two men compete in a game of table tennis by vigorously volleying the ping pong ball over the net”, which shows the superiority of our method. At the same time, our model correctly describes “two players” or “two men”. Conditioned on latent variables, our DCM can generate diverse descriptions while maintaining the semantic relevance. In general, compared with these two models, the sentences generated by DCM are superior in terms of video content description and sentence style generation.

To better understand our proposed method, in Figure 3 we show some captioning examples generated by our model. More results are provided in supplementary material. From the generated results we can see that both Ours-ED and DCM are able to capture the core video subjects. Surprisingly, for a same model, in addition to describing video content from different aspects, DCM can generate sentences in different voices. For example, for the first video, DCM generates “a man is being interviewed on a news show” and “a man is giving a news interview”. The former describes the video content in a passive voice, while the latter uses the active voice, which is a significant improvement over previous research works [40, 24, 28] that only produce a single voice.

4.4 Diversity Evaluation

To evaluate the caption diversity quantitatively, [24] pay attention to its opposite - the similarity of captions. They uses latent semantic analysis (LSA) [5] to represent sentence by first generating sentence bag-of-words representation and then mapping it to LSA space. However, this method suffers important limitations. On one hand, it ignores the rationality of sentences. Two very different sentences can get high diversity score using this method, but their descriptions may be wrong. The generated sentences should be diverse on the premise of correctly describing the video content. On the other hand, this LSA method cannot capture polysemy and ignores the order of words in sentences, which limits its representation of sentence structure. Therefore, we propose in this subsection a diverse captioning evaluation (DCE) metric. It is mainly designed from two perspectives: the difference between sentences and the sentences reasonableness. About the difference of sentences, different from [24], we use the Jaccard similarity coefficient [21] which is effective in dealing with discrete data to model the word-level similarity, and use BERT [7] which shows superior performance in various NLP tasks and addresses the limitations of bag-of-words model to generate sentence-level representation. Meanwhile, we use METEOR as an assessment of sentences reasonableness since it has shown better correlation with human judgment compared with other metrics. Formally, the DCE can be calculated as:


where S is the sentence set with cardinality (i.e. if each video is accompanied with 5 generated sentences, will be 10 because each sentence is used to calculate the similarities with others), is the number of videos in the dataset, is the adjustment coefficient (we set in our experiments), is the METEOR score of candidate sentence , is the sentence vector encoded by BERT, and denote the Jaccard similarity and cosine similarity.

We evaluate the diversity degree of our generated captions on the test set of two benchmark datasets. In addition to DCE, we also use the following methods: (1) DCE-Variant. The proposed DCE metric combines both accuracy and diversity. To interpret the role of each part more clearly, we evaluate the captions with the variant of DCE that removes the METEOR score. Thus we can see how much of the score comes from the sentence diversity. (2) Word Types. We measure the number of word types in the generated sentences. This reflects from the side whether the generated sentences are diverse. In general, higher word types indicates higher diversity. (3) Mixed mBLEU. Given a set of captions, BLEU score is computed between each caption against the rest. Mean of these BLEU scores is the mBLEU score. Following [37], we consider the mixed mBLEU score which represents the mean of mBLEU-n, where n is the n-gram. A lower mixed mBLEU value indicates higher diversity. (4) Self-CIDEr. To measure the diversity of image captions, Self-CIDEr [37] is designed to consider phrases and sentence structures. It shows better correlation with human judgement than the LSA-based metric. Higher is more diverse.

Model DCE-Variant DCE Human Evaluation
Ours-ED 0.30 0.19 3.14
DCM 0.39 0.26 3.92
GT 0.53 1.05 5.96
Table 3: Correlation to human evaluation.
Model LSA-based Method DCE
DVC [24] 0.48 0.19
DCM 0.53 0.21
Table 4: Diversity comparison with DVC [24] on MSR-VTT.
Figure 4: Effect of
Figure 5: Effect of

The results are shown in Table 2, from which we observe that our DCM outperforms MS-RNN on all the evaluation metrics, showing the superiority of our proposed method. At the same time, it also proves the effectiveness of DCE which performs consistent with other metrics (e.g. Self-CIDEr). Note that MS-RNN relies on latent stochastic variables to produce different descriptions, and it only runs the testing 5 times. For fair comparison, we also conduct the same testing times and refer to this model as DCM-5T. Experimental results show that our DCM-5T is also better than MS-RNN. We observe that the value of our DCE is lower than DCE-Variant. This is because DCE considers the sentence reasonableness which is vital for video caption. It penalizes the inaccurate descriptions. To further verify the rationality of DCE, we also conduct human evaluation by inviting volunteers to assess the captions obtained by Ours-ED, DCM and Ground Truth on the test set of MSR-VTT. Note that Ours-ED also can generate multiple descriptions using beam search. However, with beam search-n, Ours-ED can only generate up to n different sentences for the same model, while DCM can generate more, even several times Ours-ED. We randomly select 100 videos and sample 5 sentences from each model per video since Ours-ED has at most 5 different sentences. Each video is evaluated by 5 volunteers, and the average score is obtained. The given score (from 1 to 7) reflects the diversity of the set of captions. The higher the score, the better the diversity. From Table 3 we can see that the DCE evaluation is consistent with the human evaluation, showing the advantage of DCE and proving that DCM is better than Ours-ED in diversity. Besides, compared with DCE-Variant, we notice that the gap between DCM and Ground Truth is larger under DCE evaluation. This indicates DCE takes both accuracy and diversity into account. Our DCE can be further improved if there is a better accuracy assessment method.

Table 4 shows the diversity comparison between our DCM and our reimplemented DVC [24] on the test set of MSR-VTT using the method of  [24] (LSA-based method) and DCE. We observe that our DCM outperforms DVC on both metrics (the LSA score reported in [24] is 0.501, and we also surpass it). In addition, observing the fact that some of the expressions generated by our model do not even appear in the reference sentences, we believe that it reflects the potential of our model to enrich the annotation sentences in the database, which is of great significance.

4.5 Ablation Studies

(1) Effect of . Considering both the diversity and fidelity of the generated sentences, and to speed up the convergence, we study the performance variance with different . We tune from 0.1 to 0.9 on the MSR-VTT dataset. The results are reported in Figure 4, from which we notice that when , our model achieves the highest DCE score. Thus we set in the following experiments.

(2) Effect of Rollout Times . Obviously, the larger is, the more time it costs. A suitable should improve the performance of the model without costing too much time. We increase the rollout times from 1 to 9 at intervals of 2 on MSR-VTT. Here we randomly select 10 sentences for each video and evaluate the diversity. As shown in Figure 5, while we have , our model achieves the best performance on DCE. Therefore, we adopt .

5 Conclusion

In this work, a diverse captioning model DCM is proposed for video captioning, which simultaneously considers the fidelity and the diversity of descriptions. We obtain the latent variables that contain rich visual information and textual information from the well-designed encoder-decoder model, and then utilize them as input to a conditional GAN with the motivation to generate diverse sentences. Moreover, we develop a new evaluation metric named DCE to assess the diversity of a caption set. The potential of our method to generate diverse captions is demonstrated experimentally, through an elaborate experimental study involving two benchmark datasets MSVD and MSR-VTT.

Appendix A Supplementary Material

a.1 Algorithm 1

Require :  encoder-decoder model ; generator ; discriminator ; training data set
1 Initialize , and with random weights , and ;
2 Pre-train on by Eq. (5);
3 Generate negative samples in three ways mentioned in section 3.2 (Diverse captioning with conditional GAN);
4 Pre-train via minimizing the cross entropy;
5 repeat
6      for g-step do
7           Generate a sequence using ;
8           for t in 1: do
9                Compute by Eq. (13);
11                end for
12               Update generator parameters by Eq. (7);
14                end for
15               for d-steps do
16                     Use current to generate negative examples and combine with given positive examples of ;
17                     Train discriminator for epochs by Eq. (10);
19                     end for
21                     until DCM converges;
Algorithm 1 Training process of DCM

a.2 More Ablation Studies

Model Mixed mBLEU Self-CIDEr DCE-Variant DCE
Single LSTM Layer 0.831 0.665 0.297 0.188
Single CNN Layer 0.772 0.744 0.343 0.192
Single LSTM Layer + Single CNN Layer 0.830 0.673 0.304 0.194
Single LSTM Layer + Multiple CNN Layers 0.792 0.709 0.336 0.202
Multiple CNN Layers 0.687 0.803 0.387 0.210
Table 5: Comparison of different architectures.
Figure 6: Different architectures of generator.


Figure 7: Examples of video captioning achieved by our model.


Effect of Different Architectures of Generator

To design a better diverse captioning model, we also conduct some ablation studies on different architectures of generator, as shown in Figure1. The details are as follow:

(1) Single LSTM Layer. Here a single LSTM is utilized as our generator to generate diverse captions. At each time step , it directly takes and as input and produces the context vector .

(2) Single CNN Layer. Here a single CNN is utilized as our generator. At each time step , suppose is the concatenation of and , and is the concatenation of and , where is the zero vector with the same dimension as the . Afterwards, the CNN with kernel takes input and produces the context vector .

(3) Single LSTM layer + Single CNN layer. Here we combine a LSTM and a CNN as our generator. At each time step , the LSTM takes input and , and outputs which is then combined with all the previous outputs to produce a context vector , where and is the zero vector with the same dimension as the LSTM output. Afterwards, a CNN with kernel is utilized to encode and yields .

(4) Single LSTM layer + Multiple CNN layers. Here we combine a LSTM and multiple CNNs as our generator. At each step , suppose is the concatenation of and , then the LSTM takes as input and outputs which is then combined with all the previous outputs to produce a context vector . Afterwards, a with kernel is used to encode and produces .

(5) Multiple CNN Layers. This is our adopted method introduced above in the manuscript.

For each method we select the top 10 different sentences for each video and evaluate the diversity using Mixed mBLEU, Self-CIDEr, DCE-Variant and DCE. The results are presented in Table 1, showing that the architecture of multiple CNN layers performs best on all evaluation metrics. In particular, multiple CNN layers and single CNN layer achieve the top 2 performance in terms of Mixed mBLEU, Self-CIDEr and DCE-Variant, which demonstrates the potential of CNN architecture in generating diverse sentences. LSTM is good at producing peaky word probability distributions at the output [2]. This is conducive to generating high-precision descriptions, but limits the ability to generate diverse sentences. Compared to LSTM, CNN has the characteristic of generating lower peaky word probability distributions, giving it the potential to produce more word combinations. Benefit from the characteristics of CGAN, the diversity of single CNN architecture has been further improved, making it not only superior to the LSTM architecture, but also better than the combined architectures of LSTM and CNN on all metrics except DCE. Note that although the architectures of single LSTM layer + single CNN layer and single LSTM layer + multiple CNN layers are inferior to single CNN layer in terms of sentence diversity due to the LSTM limitations, they achieve better performance on DCE metric since LSTM makes the generated sentences more accurate, and DCE takes accuracy into account. For the last architecture, multiple CNNs are used for decoding. Each of them is responsible for predicting the next word based on a fixed length of input. Therefore, the division of labor of each CNN is very clear, and the information of all the previously generated words can be explicitly utilized when generating the next word, which is deficient in LSTM and single CNN. Compared to a single CNN, the architecture of multiple CNNs has advantages in generating high-precision sentences. Meanwhile, it shows the potential to improve the accuracy of LSTM-based architecture, as introduced above in our manuscript. In sum, it has a good performance in terms of both sentence diversity and accuracy. That is why it achieves the best results on all metrics. Thus, we adopt this multiple CNNs as our final architecture of generator.

a.3 More Examples of Video Captioning




  1. Nayyer Aafaq, Naveed Akhtar, Wei Liu, Syed Zulqarnain Gilani, and Ajmal Mian. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning. In CVPR, June 2019.
  2. Jyoti Aneja, Aditya Deshpande, and Alexander G Schwing. Convolutional image captioning. In CVPR, pages 5561–5570, 2018.
  3. D. L. Chen and W. B. Dolan. Collecting highly parallel data for paraphrase evaluation. In ACL, pages 190–200, 2011.
  4. Bo Dai, Sanja Fidler, Raquel Urtasun, and Dahua Lin. Towards diverse and natural image descriptions via a conditional gan. In ICCV, pages 2970–2979, 2017.
  5. Scott Deerwester, Susan T Dumais, George W Furnas, Thomas K Landauer, and Richard Harshman. Indexing by latent semantic analysis. Journal of the American society for information science, 41(6):391–407, 1990.
  6. M. Denkowski and A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation, pages 376–380, 2014.
  7. Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. L. Gao, Z. Guo, H. Zhang, X. Xu, and H. T. Shen. Video captioning with attention-based lstm and semantic consistency. IEEE Transactions on Multimedia, 19(9):2045–2055, 2017.
  9. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In NIPS, pages 2672–2680. MIT Press, 2014.
  10. Alex Graves and Jürgen Schmidhuber. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Networks, 18(5-6):602–610, 2005.
  11. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. Cnn architectures for large-scale audio classification. In ICASSP, pages 131–135. IEEE, 2017.
  12. Qin Jin, Jia Chen, Shizhe Chen, Yifan Xiong, and Alexander Hauptmann. Describing videos using multi-modal fusion. In ACM MM, pages 1087–1091. ACM, 2016.
  13. Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and Li Fei-Fei. Large-scale video classification with convolutional neural networks. In CVPR, pages 1725–1732, 2014.
  14. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  15. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
  16. S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Improved image captioning via policy gradient optimization of spider. In ICCV, volume 3, page 3, 2017.
  17. M. Mirza and S. Osindero. Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784, 2014.
  18. P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In CVPR, pages 1029–1038, 2016.
  19. Y. Pan, T. Mei, T. Yao, H. Li, and Y. Rui. Jointly modeling embedding and translation to bridge video and language. In CVPR, pages 4594–4602, 2016.
  20. K. Papineni, S. Roukos, T. Ward, and W. Zhu. Bleu: a method for automatic evaluation of machine translation. In ACL, pages 311–318, 2002.
  21. Raimundo Real and Juan M Vargas. The probabilistic basis of jaccard’s index of similarity. Systematic biology, 45(3):380–385, 1996.
  22. S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
  23. Anna Rohrbach, Marcus Rohrbach, Wei Qiu, Annemarie Friedrich, Manfred Pinkal, and Bernt Schiele. Coherent multi-sentence video description with variable level of detail. In GCPR, pages 184–195. Springer, 2014.
  24. Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. Weakly supervised dense video captioning. In CVPR, pages 1916–1924, 2017.
  25. Rakshith Shetty and Jorma Laaksonen. Frame-and segment-level features and candidate pool evaluation for video caption generation. In ACM MM, pages 1073–1076. ACM, 2016.
  26. Andrew Shin, Katsunori Ohnishi, and Tatsuya Harada. Beyond caption to narrative: Video captioning with multiple sentences. In ICIP, pages 3364–3368. IEEE, 2016.
  27. Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  28. Jingkuan Song, Yuyu Guo, Lianli Gao, Xuelong Li, Alan Hanjalic, and Heng Tao Shen. From deterministic to generative: Multimodal stochastic rnns for video captioning. IEEE transactions on neural networks and learning systems, 2018.
  29. J. Song, Z. Guo, L. Gao, W. Liu, D. Zhang, and H. T. Shen. Hierarchical lstm with adjusted temporal attention for video captioning. arXiv preprint arXiv:1706.01231, 2017.
  30. I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, pages 3104–3112. MIT Press, 2014.
  31. R. S. Sutton, D. A. McAllester, S. P. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. In Proc. NIPS, pages 1057–1063. MIT Press, 2000.
  32. Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, pages 2818–2826, 2016.
  33. R. Vedantam, C Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In CVPR, pages 4566–4575, 2015.
  34. S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence-video to text. In ICCV, pages 4534–4542, 2015.
  35. S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and K. Saenko. Translating videos to natural language using deep recurrent neural networks. arXiv preprint arXiv:1412.4729, 2014.
  36. B. Wang, L. Ma, W. Zhang, and W. Liu. Reconstruction network for video captioning. In CVPR, pages 7622–7631, 2018.
  37. Qingzhong Wang and Antoni B Chan. Describing like humans: on diversity in image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4195–4203, 2019.
  38. J. Xu, T. Mei, T. Yao, and Y. Rui. Msr-vtt: A large video description dataset for bridging video and language. In CVPR, pages 5288–5296, 2016.
  39. L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In ICCV, pages 4507–4515, 2015.
  40. H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning using hierarchical recurrent neural networks. In CVPR, pages 4584–4593, 2016.
  41. L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description