Adversarial Inference for Multi-Sentence Video Description

Adversarial Inference for Multi-Sentence Video Description

Jae Sung Park, Marcus Rohrbach, Trevor Darrell, Anna Rohrbach
University of California, Berkeley, Facebook AI Research

While significant progress has been made in the image captioning task, video description is still comparatively in its infancy, due to the complex nature of video data. Generating multi-sentence descriptions for long videos is even more challenging. Among the main issues are the fluency and coherence of the generated descriptions, and their relevance to the video. Recently, reinforcement and adversarial learning based methods have been explored to improve the image captioning models; however, both types of methods suffer from a number of issues, e.g. poor readability and high redundancy for RL and stability issues for GANs. In this work, we instead propose to apply adversarial techniques during inference, designing a discriminator which encourages better multi-sentence video description. In addition, we find that a multi-discriminator “hybrid” design, where each discriminator targets one aspect of a description, leads to the best results. Specifically, we decouple the discriminator to evaluate on three criteria: 1) visual relevance to the video, 2) language diversity and fluency, and 3) coherence across sentences. Our approach results in more accurate, diverse and coherent multi-sentence video descriptions, as shown by automatic as well as human evaluation on the popular ActivityNet Captions dataset.

1 Introduction

Figure 1: Comparison of three state-of-the-art video description approaches (Transformer [76], VideoStory [12], MoveForwardTell [66]) and our proposed Adversarial Inference. Our approach generates more interesting and accurate descriptions with less redundancy. Video from ActivityNet Captions dataset [28], three segments, from left to right; red/bold indicates content errors, blue/italic indicates repetitive patterns, underscore highlights more rare/interesting phrases.

Being able to automatically generate a natural language description for a video has fascinated researchers since the early 2000s [26]. Despite the high interest in this task and ongoing emergence of new datasets [12, 28, 75] and approaches [66, 69, 76], it remains a highly challenging problem. Consider the outputs of the three recent video description methods on an example video from the ActivityNet Captions dataset [28] in Figure 1. We notice that there are multiple issues with these descriptions, in addition to the errors with respect to the video content: there are semantic inconsistencies and lack of diversity within sentences, as well as redundancies across sentences. There are multiple challenges towards more accurate and natural video descriptions. One of the issues is the size of the available training data, which, despite the recent progress, is limited. Besides, video representations are more complex than e.g. image representations, and require modeling temporal structure jointly with the semantics of the content. Moreover, describing videos with multiple sentences, requires correctly recognizing a sequence of events in a video, maintaining linguistic coherence and avoiding redundancy.

Another important factor is the target metric optimized by the description models. Most works still exclusively rely on the automatic metrics, such as METEOR [30], despite the evidence that they are not consistent with human judgments [23, 56]. Furthermore, some recent works propose to explicitly optimize for the sentence metrics using reinforcement learning based methods [34, 45]. These techniques have become quite widespread, both for image and video description [1, 66]. Despite getting higher scores, reinforcement learning based methods have been shown to lead to unwanted artifacts, such as ungrammatical sentence endings [14], increased object hallucination rates [46] and lack of diverse content [35]. Overall, while informative, sentence metrics should not be the only way of evaluating the description approaches.

Some works aim to overcome this issue by using the adversarial training techniques [8, 52]. While Generative Adversarial Networks (GANs) [13] have achieved impressive results for image and even video generation [20, 42, 62, 67, 77], their success in language generation has been limited [54, 71]. The main issue is the difficulty of achieving stable training due to the discreetness of the output space [3, 4]. Another reported issue is lack of coherence, especially for long text generation [19]. Still, the underlying idea of learning to distinguish the “good” natural descriptions from the “bad” fake ones, is very compelling.

Rather than doing the joint adversarial training, we explore the effectiveness of a simpler approach, Adversarial Inference for video description, which relies on a discriminator to obtain better description quality. More specifically, we are interested in the task of multi-sentence video description [47, 70], i.e. the output of our model is a paragraph that describes a video. We assume that the ground-truth temporal segments are given, i.e. we do not address the event detection task, but focus on obtaining a coherent multi-sentence description (see Figure 1).

We first design a strong baseline generator model trained with the maximum likelihood objective, which relies on a previous sentence as context, similar to [12, 66]. We introduce object-level features in the form of object detections [1] to better represent people and objects in video.

We then make the following contributions:

(1) We propose the Adversarial Inference for video description, where we progressively sample sentence candidates for each clip, and select the best ones based on a discriminator’s score. Prior work has explored sampling with log probabilities [11], while we show that a specifically trained discriminator leads to better results in terms of correctness, coherence and diversity.

(2) Specifically, we propose the “hybrid discriminator”, which combines three specialized discriminators: one measures the language characteristics of a sentence, the second assesses its relevance to a video segment, and the third measures its coherence with the previous sentence. Prior work has considered a “single discriminator” for adversarial training to capture both the linguistic characteristics and visual relevance [52, 8]. We show that our “hybrid discriminator” outperforms the “single discriminator” design.

(3) We compare our proposed approach against multiple baselines on a number of metrics, including automatic sentence scores, diversity and redundancy scores, person-specific correctness scores, and, most importantly, human judgments. We show that our Adversarial Inference approach leads to more accurate and diverse multi-sentence descriptions, outperforming GAN and RL based approaches in a human evaluation.

2 Related Work

We review the existing approaches to video description, then discuss recent captioning models based on Reinforcement and Adversarial Learning. Finally, we review recent works that propose to sample and re-score sentence descriptions, and some that aim to design alternatives to automatic evaluation metrics.

Video description.

Over the past years there has been an increased interest in the task of video description generation, notably with the broader adoption of the deep learning techniques. S2VT [57] was among the first approaches based on LSTMs [18, 10]; some of the later ones include [37, 48, 51, 68, 72, 73]. Most recently, a number of approaches to video description have been proposed, such as replacing LSTM with a Transformer Network [76], introducing a reconstruction objective [58], using bidirectional attention fusion for context modeling [60], and others [6, 12, 32].

While most works focus on “video in - one sentence out” task, some aim to generate a multi-sentence paragraph for a video. This is related to the task of dense captioning [28], where videos are annotated with multiple localized sentences but the task does not require to produce a single coherent paragraph for the video. Some of the earlier works that address multi-sentence video description include [47, 53, 70]. Recently, [69] propose a Fine-grained Video Captioning Model for generating detailed sports narratives, and [66] propose a Move Forward and Tell approach, which localizes events and progressively decides when to generate the next sentence.

Reinforcement learning for caption generation.

Most deep language generation models rely on Cross-Entropy loss and during training are given a previous ground-truth word. This has been shown to lead to an exposure bias [41], as during test time the models need to condition on the predicted words instead. To overcome this issue, a number of reinforcement learning based actor-critic [27] approaches have been proposed [44, 45, 74]. E.g. [34] propose a policy gradient optimization method to directly optimize for language metrics, like CIDEr [56], using Monte Carlo rollouts. [45] propose a Self-Critical Sequence Training (SCST) method based on REINFORCE [65], where instead of estimating a baseline they use the test-time inference (greedy decoding) algorithm.

Some recent works adopt similar techniques to video description. [39] extend the approach of [41] by using a mixed loss (cross-entropy as well as reinforcement learning loss) and correcting CIDEr with an entailment penalty. [64] propose a hierarchical reinforcement learning approach, where a Manager generates sub-goals, a Worker performs low-level actions, and a critic determines whether the goal is achieved. Finally, [31] propose a multitask reinforcement learning approach, built off [45], with an additional attribute prediction loss.

GANs for caption generation.

Instead of optimizing for hand-designed metrics, some recent works aim to learn what the “good” captions should be like using adversarial training. The first works to apply Generative Adversarial Networks (GANs) [13] to image captioning are [52] and [8]. [52] train a discriminator to distinguish natural human captions from fake generated captions, focusing on caption diversity and image relevance. To sample captions they rely on Gumbel-Softmax approximation [21]. [8] instead rely on policy gradient, and their discriminator focuses on caption naturalness and image relevance.

Some works have applied adversarial learning to generate paragraph descriptions for images. [33] propose a joint training approach which incorporates multi-level adversarial discriminators, one for sentence level and another for coherent topic transition at a paragraph level. [63] rely on adversarial reward learning to train a visual storytelling policy. [59] use a language and a multi-modal discriminator for their adversarial training. Their discriminator design resembles ours, only we specifically distinguish visual, language, and pairwise discriminators, while their multi-modal discriminator has a standard design used in prior work [8, 52]. Also, our visual discriminator is agnostic to language structure and only focuses on visual relevance. None of these works rely on their trained discriminators during inference.

Two recent image captioning works propose using discriminator scores instead of language metrics in SCST model [5, 35]. We implement a GAN baseline based on this idea, and compare it to our approach.

Caption sampling and re-scoring.

A few prior works explore caption sampling and re-scoring during inference [2, 17, 55]. Specifically, [17] aim to obtain more image-grounded bird explanations, while [2, 55] aim to generate discriminative captions for a given distractor image. While our approach is similar, our goal is different, as we work with video rather than images, and aim to improve multi-sentence description with respect to multiple properties.

Alternatives to automatic metrics.

Recently, there is an interest in alternative ways of measuring the description quality than e.g. [38, 30, 56]. [7] propose to train a general critic network to learn to score captions, providing various types of corrupted captions as negatives. [50] propose the use of a composite metric, a classifier trained on the automatic scores as input. Our work differs from these in that we do not aim to build a general evaluation tool. Instead, we propose to improve the video description quality with our Adversarial Inference for a given generator.

3 Generation with Adversarial Inference

Figure 2: The overview of our Adversarial Inference approach. The Generator progressively samples candidate sentences for each clip, using the previous sentence as context. The Hybrid Discriminator scores the candidate sentences, and chooses the best one based on its visual relevance, linguistic characteristics and coherence to the previous sentence, more details in Figure 3.

In this section, we present our approach to multi-sentence description generation based on our Adversarial Inference method. We first introduce our baseline generator and then discuss our discriminator . The task of is to score the descriptions generated by for a given video. This includes, among others, to measure whether the multi-sentence descriptions are (1) correct with respect to the video, (2) fluent within individual sentences, and (3) form a coherent story across sentences. Instead of assigning all three tasks to a single discriminator, we propose to compose out of three separate discriminators, each focusing on one of the above tasks. We denote this design a hybrid discriminator (see Figure 3).

While prior works mostly rely on discriminators for joint adversarial training [8, 52], we argue that using them during inference is a more robust way of improving over the original generator. In our Adversarial Inference, the pre-trained generator presents with the sentence candidates by sampling from its probability distribution. In its turn, our hybrid discriminator selects the best sentence relying on the combination of its sub-discriminators. The overview of our approach is shown in Figure 2.

3.1 Baseline Multi-Sentence Generator:

Given clips for each video , the task of is to generate sentences , where each sentence matches the content of the corresponding clip . As the clips come belong to the same video and each sentence depends on what has been said before, our goal is to not only generate a sentence that matches its visual content, but also make the entire set of sentences coherent and diverse as a natural paragraph.

Our generator follows a standard LSTM decoder [10, 18] to generate individual sentences with encoded representation of as our visual context. Typically, for each step , the LSTM hidden state expects an input vector that encodes the visual features from as well as the previous word . To encourage coherence among consecutive sentences, we additionally append the last hidden state of the previous sentence as input to the LSTM decoder [12, 66]. The final input to the LSTM decoder for clip at time step is defined as follows:


where is the visual input at step , and is the previously generated word.

We further detail how we obtain the visual input . Unlike image captioning that relies on static features, video description requires a dynamic multimodal fusion over different visual features, such as e.g. stream of RGBs and motion. In addition to video and image-level features, we introduce object detections extracted for a subset of frames. Different features may be temporally misaligned (i.e. extracted over different sets of frames). To address that we do the following. Suppose, a visual feature extracted from is represented as a sequence of segments: [61, 66]. Then the previous hidden state is used to predict temporal attention [68] over these segments, which results in a single feature vector . We concatenate the resulting vectors from all features as our final visual input to the decoder: .

We follow the standard Maximum likelihood estimation (MLE) training for , where we maximize the likelihood of each word given the current LSTM hidden state .

3.2 Discriminator:

The task of a discriminator is to score a sentence w.r.t. a video as , where 1 indicates a positive match, while 0 is a negative match. Most prior works that perform adversarial training for image captioning [5, 8, 35, 52], rely on the following “single discriminator” design. is trained to distinguish human ground-truth sentences as positives vs. sentences generated by as well as human mismatched sentences (from a different image) as negatives. The latter aim to direct the discriminator’s attention to the sentences’ visual relevance.

For a given generator , the discriminator is trained with the following objective:


where N is the number of training videos. For a video a respective term is defined as:


where is the set of ground truth descriptions for , are generated samples from , and are human descriptions from other videos, are hyper-parameters.

3.2.1 Hybrid Discriminator

With the “single discriminator” setup, the discriminator is given multiple tasks at once, i.e. to detect generated “fakes”, which requires looking for specific linguistic characteristics, such as diversity or language structure, as well the mismatched “fakes”, which requires looking at sentence semantics and relate it to the visual features. Moreover, for the multi-sentence description, we would also like to detect cases where a sentence is inconsistent or redundant to a previous sentence.

To obtain these properties, we argue it is important to decouple the different tasks and allocate an individual discriminator for each one. In the following we introduce our visual, language and pairwise discriminators, which jointly constitute our hybrid discriminator (see Figure 3). We use the objective defined above for all three, however, the types of negatives vary by discriminator.

Visual Discriminator

Visual discriminator determines whether a sentence refers to concepts present in a video clip , regardless of fluency and grammatical structure of the sentence. We believe that as the pre-trained generator already produces video relevant sentences, we should not include the generated samples as negatives for . Instead, we use the human mismatched as well as generated mismatched sentences as our two types of negatives. While randomly mismatched negatives may be easier to distinguish, hard negatives, e.g. sentences from videos with the same activity as a given video, require stronger discriminative abilities. To further improve our discriminator, we introduce such hard negatives, after training for 2 epochs.

Note, that if we use an LSTM to encode our sentence inputs to , may exploit the language characteristics to distinguish the generated mismatched sentences, instead of looking at their semantics. To mitigate this issue, we replace the LSTM encoding with a bag of words (BOW) representation, where each sentence is represented as a vocabulary-sized binary vector. The BOW representation is further embedded via a linear layer, and we obtain our final sentence encoding .

Figure 3: An overview of our Hybrid Discriminator. We score a sentence for a given video clip and a previous sentence .

Similar to , also considers multiple visual features, i.e. we aggregate features from different misaligned modalities (video, image, objects). We individually encode each feature using temporal attention based on the entire sentence representation . The obtained vector representations are then fused with the sentence representation , using Multimodal Low-rank Bilinear pooling (MLB) [24], which is known to be effective in tasks like multi-modal retrieval or VQA. The score for visual feature and sentence representation is obtained as follows:


where is a sigmoid, producing values in , is Hadamard product, , are linear layers. Instead of concatenating features as done in the generator, here we determine the scores between the sentence and each modality, and learn to weigh them adaptively based on the sentence. The intuition is that some sentences are more likely to require video features (“a man is jumping”), while others may require e.g. object features (“a man is wearing a red shirt”). Following [36], we assign weights to each modality based on the sentence representation :


where are learned parameters. Finally, the score is the sum of the scores weighted by :

Language Discriminator

Language discriminator focuses on language structure of an individual sentence , independent of its visual relevance. Here we want to ensure fluency as well as diversity of sentence structure that is lacking in . The ActivityNet Captions [28] dataset, that we experiment with, has long (over 13 words on average) and diverse descriptions with varied grammatical structures. We observe that the discriminator is able to point out an obvious mismatch based on diversity of the real vs. fake sentences, but fails to capture fluency or repeating N-grams. To address this, in addition to generated sentences from , is given negative inputs with mixture of randomly shuffled words or with repeated phrases within a sentence.

To obtain a score, we encode a sentence with a bidirectional LSTM, concatenate both last hidden states, denoted as , followed by a fully connected layer and a sigmoid layer:

Pairwise Discriminator

Pairwise discriminator evaluates whether two consecutive sentences and are coherent yet diverse in content. Specifically, scores based on . To ensure coherence, we include “shuffled” sentences as negatives, i.e. the order of sentences in a paragraph is randomly changed. We also design negatives with a pair of identical sentences and optionally cutting off the endings (e.g. “a person enters and takes a chair” and “a person enters”) to avoid repeating contents.

Similar to above, we encode both sentences with a bidirectional LSTM and obtain and . We concatenate the two vectors and compute the score as follows:


Note, that the first sentence of a video description paragraph is not assigned a pairwise score, as there is no previous sentence.

3.3 Adversarial Inference

In adversarial training for caption generation, and are first pre-trained and then jointly updated, where the discriminator improves the generator by providing feedback to the quality of sampled sentences. To deal with the issue of non-differentiable discrete sampling in joint training, several solutions have been proposed, such as Reinforcement Learning with variants of policy gradient methods or Gumbel softmax relaxation [5, 8, 52]. While certain improvement has been shown, as we discussed in Section 1, GAN training can be very unstable.

Motivated by the difficulties of joint training, we present our Adversarial Inference method, which uses the discriminator during inference of the generator . We show that our approach outperforms a jointly trained GAN model, most importantly, in human evaluation (see Section 4).

During inference, the generator typically uses greedy max decoding or beam search to generate a sentence based on the maximum probability of each word. One alternative to this is sampling sentences based on log probability [11]. Instead, we use our Hybrid Discriminator to score the sampled sentences. Note, that we generate sentences progressively, i.e. we provide the hidden state representation of the previous best sentence as context to sample the next sentence (see Figure 2). Formally, for a video clip , a previous best sentence and sampled sentences from the generator , the scores from our hybrid discriminator can be used to compare the sentences and select the best one:


where is the sampled sentence. The final discriminator score is defined as:


where are hyper-parameters.

4 Experiments

Per video Overall Per act. Per video
Method METEOR BLEU@4 CIDEr-D Vocab Sent RE-4 Div-1 Div-2 RE-4
Size Length
MLE 16.81 9.95 20.04 1749 13.83 0.38 0.55 0.74 0.08
GAN w/o CE 16.49 9.76 20.24 2174 13.67 0.35 0.56 0.74 0.07
GAN 16.69 10.02 21.07 1930 13.60 0.36 0.56 0.74 0.07
SCST 15.80 10.82 20.89 941 12.13 0.52 0.47 0.65 0.11
MLE + BS3 16.22 10.79 21.81 1374 12.92 0.48 0.55 0.71 0.11
MLE + LP 17.51 8.70 12.23 1601 18.68 0.48 0.48 0.69 0.12
MLE + SingleDis 16.29 9.25 18.17 2291 13.98 0.37 0.59 0.75 0.07
MLE + HybridDis w/o Pair 16.60 9.56 19.39 2390 13.86 0.32 0.58 0.76 0.06
MLE + HybridDis 16.48 9.91 20.60 2346 13.38 0.32 0.59 0.77 0.06
Human - - - 8352 14.27 0.04 0.71 0.85 0.01
Table 1: Comparison to video description baselines at paragraph-level. Statistics over generated descriptions include N-gram Diversity (Div-1,2, higher better) and Repetition (RE-4, lower better) per video and per activity. For discussion see Section 4.2.

We benchmark our approach for multi-sentence video description on the ActivityNet Captions dataset [28] and compare our Adversarial Inference to GAN and other baselines, as well as to state-of-the-art models.

4.1 Experimental Setup

Dataset. The ActivityNet Captions dataset contains 10,009 videos for training and 4,917 videos for validation with two reference descriptions for each111The two references are not aligned to the same time intervals, and even may have a different number of sentences.. Similar to prior work [76, 12], we use the validation videos with the 2nd reference for development, while the 1st reference is used for evaluation. While the original task defined on ActivityNet Captions involves both event localization and description, we run our experiments with ground truth video intervals. Our goal is to show that our approach leads to more correct, diverse and coherent multi-sentence video descriptions than the baselines.

Visual Processing. Each video clip is encoded with 2048-dim ResNet-152 features [16] pre-trained with ImageNet [9] (denoted as ResNet) and 8192-dim ResNext-101 features [15] pre-trained with the Kinetics dataset [22] (denoted as R3D). We extract both ResNet and R3D features every 16 frames and use a temporal resolution of 16 frames for R3D. The features are uniformly divided into 10 segments as in [61, 66], and mean pooled within each segment to represent the clip as 10 sequential features. We also run the Faster R-CNN detector [43] from [1] trained on Visual Genome [29], on 3 frames (at the beginning, middle and end of a clip) and detect top 16 objects per frame.We encode the predicted object labels with bag of words weighted by detection confidences (denoted as BottomUp). Thus, a visual representation for each clip consists of 10 R3D features, 10 ResNet features, and 3 BottomUp features.

Language Processing. The sentences are “cut” at a maximum length of 30 words. The LSTM cells’ dimensionality is fixed to 512. The discriminators’ word embeddings are initialized with 300-dim Glove embeddings [40].

Training and Inference. We train the generator and discriminator with cross entropy objectives using the ADAM optimizer [25] with a learning rate of . One batch consists of multiple clips and captions from the same video, and the batch size is fixed to 16 when training both models. The weights for all the discriminators’ negative inputs (, in Eq. 3), are set to 0.5. The weights for our hybrid discriminator are set as = 0.8, = 0.2, = 1.0. Sampling temperature during discriminator training is 1.0; during inference we sample sentences with temperature 0.2. When training the discriminators, a type of a negative example is randomly chosen for a video, i.e. a batch consists of a combination of different negatives.

Baselines and SoTA. We compare our Adversarial Inference (denoted MLE+HybridDis) to: our baseline generator (MLE); multiple inference procedures, i.e. beam search with size 3 (MLE+BS3), sampling with log probabilities (MLE+LP) and inference with the single discriminator (MLE+SingleDis); Self Critical Sequence Tranining [45] which optimizes for CIDEr (SCST); GAN models built off [5, 35] with a single discriminator222We have tried incorporating our hybrid discriminator in GAN training, however, we have not observed a large difference, likely due to a large space of training hyper-parameters which is challenging to explore., with and without a cross entropy (CE) loss (GAN, GAN w/o CE). Finally, we also compare to the following state-of-the-art methods: Transformer [76], VideoStory [12] and MoveForwardTell [66], whose predictions we obtained from the authors.

4.2 Results

Automatic Evaluation.

Following [66], we conduct our evaluation at paragraph-level rather than at sentence-level. We compare our model to baselines in Table 1. We include standard metrics, i.e. METEOR [30], BLEU@4 [38] and CIDEr-D [56]. The best performing models in these metrics do not include our adversarial inference procedure nor the jointly trained GAN models. This is somewhat expected, as prior work shows that adversarial training does worse in these metrics than the MLE baseline [8, 52]. We note that adding a CE loss benefits GAN training, leading to more fluent descriptions (GAN w/o CE vs. GAN). We also observe that the METEOR score, popular in video description literature, is strongly correlated with sentence length.

The standard metrics alone are not sufficient to get a holistic view of the description quality, since the scores fail to capture content diversity or detect repetition of phrases and sentence structures. To see if our approach improves on these properties, we report Div-1 and Div-2 scores [52], that measure a ratio of unique N-grams (N=1,2) to the total number of words, and RE-4 [66], that captures a degree of N-gram repetition (N=4) in a description333For Div-1,2 higher is better, while for RE-4 lower is better.. We compute these scores at video (paragraph) level, and report the average score over all videos (see “Per video”). We see that our Adversarial Inference leads to more diverse descriptions with less repetition than all the baselines, including GANs. Our MLE+HybridDis model outperforms the MLE+SingleDis in every metric, supporting our hybrid discriminator design. We also see that including the Pairwise Discriminator further improves the metrics (MLE+HybridDis w/o Pair vs. MLE+HybridDis). Note that the SCST has the lowest diversity and highest repetition among all baselines.

Finally, we want to capture the degree of “discriminativeness” among the descriptions of videos with similar content. ActivitiyNet includes 200 activity labels, and the videos with the same activity have similar visual content. We thus also report RE-4 per activity by combining all sentences associated with each activity, and averaging the score over all activities (see “Per act.”). Our MLE+HybridDis model gets the lowest repetition score, suggesting that it obtains more video relevant and less generic descriptions.

Per video Overall Per act. Per video
Method METEOR BLEU@4 CIDEr-D Vocab Sent RE-4 Div-1 Div-2 RE-4
Size Length
VideoStory [12] 16.26 7.66 14.53 1269 16.73 0.37 0.51 0.72 0.09
Transformer [76] 16.15 10.29 21.72 1819 12.42 0.34 0.53 0.73 0.07
MoveForwardTell [66] 14.67 10.03 19.49 1926 11.46 0.53 0.55 0.66 0.18
MLE 16.81 9.95 20.04 1749 13.83 0.38 0.55 0.74 0.08
MLE + HybridDis 16.48 9.91 20.60 2346 13.38 0.32 0.59 0.77 0.06
Table 2: Comparison to video description SoTA models at paragraph-level. Statistics over generated descriptions include N-gram Diversity (Div, higher better) and Repetition (RE, lower better) per video and per activity.
Method Better Worse Delta
than MLE than MLE
SCST 22.0 62.0 -40.0
GAN 32.5 30.0 +2.5
MLE + BS3 27.0 31.0 -4.0
MLE + LP 32.5 34.0 -1.5
MLE + SingleDis 29.0 30.0 -1.0
MLE + HybridDis w/o Pair 42.0 36.5 +5.5
MLE + HybridDis 38.0 31.5 +6.5
Table 3: Human evaluation of multi-sentence video descriptions.
Human Evaluation.

The most reliable way to evaluate the description quality is with human judges. We run our evaluation on Amazon Mechanical Turk (AMT)444 with a set of 200 random videos. To make the task easier for humans we compare two systems at a time, rather than judging multiple systems at once. We design a set of experiments, where each system is being compared to the MLE baseline. The human judges can select that one description is better than another or that both as similar. We ask 3 human judges to score each pair of sentences, so that we can compute a majority vote (i.e. at least 2 out of 3 agree on a judgment), see Table 3. As we see, our proposed approach improves over all other inference procedures, as well as over GAN and SCST. We see that the GAN is rather competitive, but still overall not scored as high as out approach. Notably, SCST is scored rather low, which we attribute to its grammatical issues and high redundancy.

Figure 4 shows a few qualitative examples comparing ground truth descriptions to the ones generated by the following methods: MLE, SCST (with CIDEr), GAN, MLE+SingleDis (Single Disc), and our MLE+HybridDis (Ours). We highlight errors, e.g. objects not present in video, in bold/red, and repeating phrases in italic/blue. Overall, our approach leads to more correct, more fluent, and less repetitive multi-sentence descriptions than the baselines. In (a), our prediction is preferred to all the baselines w.r.t. the sentence fluency. While all models recognize the presence of a baby and a person eating an ice cream, the baselines fail to describe the scene in a coherent way, while our approach summarizes the visual information correctly. Our model also generates more diverse descriptions specific to what is happening in the video, often mentioning more interesting and informative words/phrases, such as “trimming the hedges” in (b) or “their experience” in (c). MLE and SCST mention less visually specific information, and generate more generic descriptions, such as “holding a piece of wood”. In an attempt to explore diverse phrases, the single discriminator is more prone to hallucinating non-existing objects, e.g. “monkey bars” in (b). Finally, our model outperforms the baselines in terms of lower redundancy across sentences. As seen in (c), our approach delivers more diverse content for each clip, while all others over-report “speaking/talking to the camera”, a very common phrase in the dataset. Our model is not completely free of errors e.g. hallucinating an ice cream “cone” and incorrectly mentioning “showing off her new york”, compared to the ground truth; however, our captions improve over those of the baselines, as supported by our human evaluation.

Figure 4: Comparison of our approach to the baselines: MLE, SCST, GAN, MLE+SingleDis. Red/bold indicates content errors, blue/italic indicates repetitive patterns, underscore highlights more rare/interesting phrases.
Comparison to SoTA.

We compare our baseline MLE model and our full approach (MLE + HybridDis) to multiple state-of-the-art approaches using the same automatic metrics as above. As can be seen from Table 2, our models perform on par with the state-of-the-art in standard metrics, while MLE + HybridDis wins in diversity metrics. Qualitative comparison which underlines the strengths of our approach is shown in Figure 5. While the state-of-the-art models are often able to capture the relevant visual information, they are still prone to issues like repetition, lack of diverse and precise content as well as content errors. In particular, VideoStory and MoveForwardTell suffer from the dominant language prior and repeatedly mention “the camera”, making the stories less informative and specific to the events in the video. Despite having less repeating contents and high scores in language metrics, the Transformer model is prone to produce incoherent phrases e.g. “a man is a bikini” or “putting sunscreen on the beach water”, and ungrammatical endings, e.g. the last two sentences. On the other hand, our model captures the visual content more precisely, i.e. referring to the subject as a “girl”, pointing out that the girl is “laying on a bed”, correctly recognizing “sand castles”, etc. Again, we note that there is a large room for improvement w.r.t. the human ground-truth descriptions.

Figure 5: Comparison to SoTA models: Transformer [76], VideoStory [12] and MoveForwardTell [66]. Red/bold indicates content errors, blue/italic indicates repetitive patterns, underscore highlights more rare/interesting phrases.
Method Exact Gender+
word plurality
VideoStory [12] 44.9 64.1
Transformer [76] 45.8 66.0
MoveForwardTell [66] 42.6 64.1
MLE 48.8 67.5
SCST 44.0 63.3
GAN 48.9 67.5
MLE + HybridDis 49.1 67.9
Table 4: Correctness of person-specific words, F1 score.
Person Correctness.

Many videos in the ActivityNet Captions dataset discuss people and their actions. To get additional insights into correctness of the generated descriptions, we perform an evaluation of “person words” correctness. Specifically, we compare (a) the exact person words (e.g. girl, guys) or (b) only gender and plurality (e.g. female-single, male-plural) between the references and the predicted descriptions, and report the F1 score in Table 4 (this is similar to [49], who evaluate character correctness in movie descriptions). Interestingly, our MLE model already outperforms the state-of-the-art in terms of correctness, likely due to the additional object-level features [1]. We see that SCST leads to a significant decrease in person word correctness, while our Adversarial Inference helps further improve it.

5 Conclusion

The focus of prior work in video description generation has so far been on training better generators and improving the input representation. In contrast, in this work we advocate an orthogonal direction to improve the quality of video descriptions: We propose the concept Adversarial Inference for video description where a trained discriminator selects the best from a set of sampled sentences. This allows to make the final decision what is the best sample a posteriori by relying on strong trained discriminators, which look at the video and the generated sentences to make a decision. More specifically we introduce a hybrid discriminator which consists of three individual experts: one for language, one for relating the sentence to the video, and one pair-wise, across sentences. In our experimental study, humans prefer sentences selected by our hybrid discriminator used in Adversarial Inference better than the default greedy decoding. Beam search, sampling with log probability as well as previous approaches to improve learning the generator (SCST and GAN) are judged not as good as our sentences.

6 Acknowledgements

This work was in part supported by the US DoD, the Berkeley Artificial Intelligence Research (BAIR) Lab, and the Berkeley DeepDrive (BDD) Lab.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
  • [2] J. Andreas and D. Klein. Reasoning about pragmatics with neural listeners and speakers. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2016.
  • [3] M. Caccia, L. Caccia, W. Fedus, H. Larochelle, J. Pineau, and L. Charlin. Language gans falling short. arXiv preprint arXiv:1811.02549, 2018.
  • [4] T. Che, Y. Li, R. Zhang, R. D. Hjelm, W. Li, Y. Song, and Y. Bengio. Maximum-likelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983, 2017.
  • [5] C. Chen, S. Mu, W. Xiao, Z. Ye, L. Wu, F. Ma, and Q. Ju. Improving image captioning with conditional generative adversarial nets. arXiv:1805.07112, 2018.
  • [6] Y. Chen, S. Wang, W. Zhang, and Q. Huang. Less is more: Picking informative frames for video captioning. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [7] Y. Cui, G. Yang, A. Veit, X. Huang, and S. Belongie. Learning to evaluate image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5804–5812, 2018.
  • [8] B. Dai, S. Fidler, R. Urtasun, and D. Lin. Towards diverse and natural image descriptions via a conditional gan. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [9] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on, pages 248–255. Ieee, 2009.
  • [10] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [11] J. Donahue, L. A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2017.
  • [12] S. Gella, M. Lewis, and M. Rohrbach. A dataset for telling the stories of social media videos. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 968–974, 2018.
  • [13] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing Systems (NIPS), pages 2672–2680, 2014.
  • [14] T. Guo, S. Chang, M. Yu, and K. Bai. Improving reinforcement learning based image captioning with natural language prior. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  • [15] K. Hara, H. Kataoka, and Y. Satoh. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, pages 18–22, 2018.
  • [16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  • [17] L. A. Hendricks, R. Hu, T. Darrell, and Z. Akata. Grounding visual explanations. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [18] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • [19] A. Holtzman, J. Buys, M. Forbes, A. Bosselut, D. Golub, and Y. Choi. Learning to write with cooperative discriminators. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
  • [20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [21] E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • [22] W. Kay, J. Carreira, K. Simonyan, B. Zhang, C. Hillier, S. Vijayanarasimhan, F. Viola, T. Green, T. Back, P. Natsev, et al. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950, 2017.
  • [23] M. Kilickaya, A. Erdem, N. Ikizler-Cinbis, and E. Erdem. Re-evaluating automatic metrics for image captioning. In Proceedings of the Conference of the European Chapter of the Association for Computational Linguistics (EACL), 2016.
  • [24] J. Kim, K. W. On, W. Lim, J. Kim, J. Ha, and B. Zhang. Hadamard product for low-rank bilinear pooling. In Proceedings of the International Conference on Learning Representations (ICLR), 2017.
  • [25] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • [26] A. Kojima, T. Tamura, and K. Fukunaga. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision (IJCV), 50(2):171–184, 2002.
  • [27] V. R. Konda and J. N. Tsitsiklis. Actor-critic algorithms. In Advances in Neural Information Processing Systems (NIPS), pages 1008–1014, 2000.
  • [28] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles. Dense-captioning events in videos. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 706–715, 2017.
  • [29] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32–73, 2017.
  • [30] M. D. A. Lavie. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), page 376, 2014.
  • [31] L. Li and B. Gong. End-to-end video captioning with multitask reinforcement learning. arXiv preprint arXiv:1803.07950, 2018.
  • [32] Y. Li, T. Yao, Y. Pan, H. Chao, and T. Mei. Jointly localizing and describing events for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7492–7500, 2018.
  • [33] X. Liang, Z. Hu, H. Zhang, C. Gan, and E. P. Xing. Recurrent topic-transition gan for visual paragraph generation. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [34] S. Liu, Z. Zhu, N. Ye, S. Guadarrama, and K. Murphy. Improved image captioning via policy gradient optimization of spider. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), volume 3, page 3, 2017.
  • [35] I. Melnyk, T. Sercu, P. L. Dognin, J. Ross, and Y. Mroueh. Improved image captioning with adversarial semantic alignment. arXiv:1805.00063, 2018.
  • [36] A. Miech, I. Laptev, and J. Sivic. Learning a text-video embedding from incomplete and heterogeneous data. arXiv preprint arXiv:1804.02516, 2018.
  • [37] P. Pan, Z. Xu, Y. Yang, F. Wu, and Y. Zhuang. Hierarchical recurrent neural encoder for video representation with application to captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [38] K. Papineni, S. Roukos, T. Ward, and W. jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2002.
  • [39] R. Pasunuru and M. Bansal. Reinforced video captioning with entailment rewards. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2017.
  • [40] J. Pennington, R. Socher, and C. Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543, 2014.
  • [41] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba. Sequence level training with recurrent neural networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • [42] S. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. In Proceedings of the International Conference on Machine Learning (ICML), 2016.
  • [43] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [44] Z. Ren, X. Wang, N. Zhang, X. Lv, and L.-J. Li. Deep reinforcement learning-based image captioning with embedding reward. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [45] S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [46] A. Rohrbach, L. A. Hendricks, K. Burns, T. Darrell, and K. Saenko. Object hallucination in image captioning. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2018.
  • [47] A. Rohrbach, M. Rohrbach, W. Qiu, A. Friedrich, M. Pinkal, and B. Schiele. Coherent multi-sentence video description with variable level of detail. In Proceedings of the German Confeence on Pattern Recognition (GCPR), 2014.
  • [48] A. Rohrbach, M. Rohrbach, and B. Schiele. The long-short story of movie description. In Proceedings of the German Confeence on Pattern Recognition (GCPR), 2015.
  • [49] A. Rohrbach, M. Rohrbach, S. Tang, S. J. Oh, and B. Schiele. Generating descriptions with grounded and co-referenced people. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [50] N. Sharif, L. White, M. Bennamoun, and S. A. A. Shah. Learning-based composite metrics for improved caption evaluation. In Proceedings of ACL 2018, Student Research Workshop, pages 14–20, 2018.
  • [51] R. Shetty and J. Laaksonen. Frame- and segment-level features and candidate pool evaluation for video caption generation. In Proceedings of the ACM international conference on Multimedia (MM), pages 1073–1076, 2016.
  • [52] R. Shetty, M. Rohrbach, L. A. Hendricks, M. Fritz, and B. Schiele. Speaking the same language: Matching machine to human captions by adversarial training. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
  • [53] A. Shin, K. Ohnishi, and T. Harada. Beyond caption to narrative: Video captioning with multiple sentences. In Proceedings of the IEEE IEEE International Conference on Image Processing (ICIP), 2016.
  • [54] S. Subramanian, S. Rajeswar, F. Dutil, C. Pal, and A. Courville. Adversarial generation of natural language. In Proceedings of the 2nd Workshop on Representation Learning for NLP, pages 241–251, 2017.
  • [55] R. Vedantam, S. Bengio, K. Murphy, D. Parikh, and G. Chechik. Context-aware captions from context-agnostic supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 3, 2017.
  • [56] R. Vedantam, C. L. Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015.
  • [57] S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, and K. Saenko. Sequence to sequence – video to text. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  • [58] B. Wang, L. Ma, W. Zhang, and W. Liu. Reconstruction network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7622–7631, 2018.
  • [59] J. Wang, J. Fu, J. Tang, Z. Li, and T. Mei. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2018.
  • [60] J. Wang, W. Jiang, L. Ma, W. Liu, and Y. Xu. Bidirectional attentive fusion with context gating for dense video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 7190–7198, 2018.
  • [61] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool. Temporal segment networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, pages 20–36. Springer, 2016.
  • [62] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catanzaro. Video-to-video synthesis. In Advances in Neural Information Processing Systems (NIPS), 2018.
  • [63] X. Wang, W. Chen, Y.-F. Wang, and W. Y. Wang. No metrics are perfect: Adversarial reward learning for visual storytelling. In Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL), 2018.
  • [64] X. Wang, W. Chen, J. Wu, Y.-F. Wang, and W. Y. Wang. Video captioning via hierarchical reinforcement learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4213–4222, 2018.
  • [65] R. J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8(3-4):229–256, 1992.
  • [66] Y. Xiong, B. Dai, and D. Lin. Move forward and tell: A progressive generator of video descriptions. In Proceedings of the European Conference on Computer Vision (ECCV), 2018.
  • [67] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li. High-resolution image inpainting using multi-scale neural patch synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [68] L. Yao, A. Torabi, K. Cho, N. Ballas, C. Pal, H. Larochelle, and A. Courville. Describing videos by exploiting temporal structure. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015.
  • [69] H. Yu, S. Cheng, B. Ni, M. Wang, J. Zhang, and X. Yang. Fine-grained video captioning for sports narrative. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6006–6015, 2018.
  • [70] H. Yu, J. Wang, Z. Huang, Y. Yang, and W. Xu. Video paragraph captioning using hierarchical recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
  • [71] L. Yu, W. Zhang, J. Wang, and Y. Yu. Seqgan: Sequence generative adversarial nets with policy gradient. In AAAI, pages 2852–2858, 2017.
  • [72] Y. Yu, H. Ko, J. Choi, and G. Kim. End-to-end concept word detection for video captioning, retrieval, and question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [73] M. Zanfir, E. Marinoiu, and C. Sminchisescu. Spatio-temporal attention models for grounded video captioning. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2016.
  • [74] L. Zhang, F. Sung, F. Liu, T. Xiang, S. Gong, Y. Yang, and T. M. Hospedales. Actor-critic sequence training for image captioning. In Advances in Neural Information Processing Systems (NIPS Workshops), 2017.
  • [75] L. Zhou, C. Xu, and J. J. Corso. Towards automatic learning of procedures from web instructional videos. In Proceedings of the Conference on Artificial Intelligence (AAAI), 2018.
  • [76] L. Zhou, Y. Zhou, J. J. Corso, R. Socher, and C. Xiong. End-to-end dense video captioning with masked transformer. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 8739–8748, 2018.
  • [77] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial networkss. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description