Oracle Performance for Visual Captioning

Oracle Performance for Visual Captioning

Abstract

The task of associating images and videos with a natural language description has attracted a great amount of attention recently. The state-of-the-art results on some of the standard datasets have been pushed into the regime where it has become more and more difficult to make significant improvements. Instead of proposing new models, this work investigates performances that an oracle can obtain. In order to disentangle the contribution from visual model from the language model, our oracle assumes that high-quality visual concept extractor is available and focuses only on the language part. We demonstrate the construction of such oracles on MS-COCO, YouTube2Text and LSMDC (a combination of M-VAD and MPII-MD). Surprisingly, despite the simplicity of the model and the training procedure, we show that current state-of-the-art models fall short when being compared with the learned oracle. Furthermore, it suggests the inability of current models in capturing important visual concepts in captioning tasks.

\addauthor

Li Yaoli.yao@umontreal.ca1 \addauthorNicolas Ballasnicolas.ballas@umontreal.ca1 \addauthorKyunghyun Chokyunghyun.cho@nyu.edu3 \addauthorJohn R. Smithjsmith@us.ibm.com2 \addauthorYoshua Bengioyoshua.bengio@umontreal.ca1 \addinstitution Université de Montréal
\addinstitution IBM T.J. Watson Research
\addinstitution New York University
Oracle Performance for Visual Captioning

1 Introduction

With standard datasets publicly available, such as COCO and Flickr (Lin et al., 2014; Hodosh et al., 2013; Young et al., ) in image captioning, and YouTube2Text, MVAD and MPI-MD (Guadarrama et al., 2013; Torabi et al., 2015; Rohrbach et al., 2015b) in video captioning, the field has been progressing in an astonishing speed. For instance, the state-of-the-art results on COCO image captioning has been improved rapidly from 0.17 to 0.31 in BLEU Kiros et al. (2014); Devlin et al. (2015b); Donahue et al. (2015); Vinyals et al. (2014); Xu et al. (2015b); Mao et al. (2015); Karpathy and Fei-Fei (2014); Bengio et al. (2015); Qi Wu et al. (2015). Similarly, the benchmark on YouTube2Text has been repeatedly pushed from 0.31 to 0.50 in BLEU score Rohrbach et al. (2013); Venugopalan et al. (2015b); Yao et al. (2015); Venugopalan et al. (2015a); Xu et al. (2015a); Rohrbach et al. (2015a); Yu et al. (2015); Ballas et al. (2016). While obtaining encouraging results, captioning approaches involve large networks, usually leveraging convolution network for the visual part and recurrent network for the language side. It therefore results model with a certain complexity where the contribution of the different component is not clear.

Instead of proposing better models, the main objective of this work is to develop a method that offers a deeper insight of the strength and the weakness of popular visual captioning models. In particular, we propose a trainable oracle that disentangles the contribution of the visual model from the language model. To obtain such oracle, we follow the assumption that the image and video captioning task may be solved with two steps Rohrbach et al. (2013); Fang et al. (2015) . Consider the model where refers to usually high dimensional visual inputs, such as representations of an image or a video, and refers to a caption, usually a sentence of natural language description. In order to work well, needs to form higher level visual concept, either explicitly or implicitly, based on in the first step, denoted as , followed by a language model that transforms visual concept into a legitimate sentence, denoted by . referes to atoms that are visually perceivable from .

The above assumption suggests an alternative way to build an oracle. In particular, we assume the first step is close to perfect in the sense that visual concept (or hints) is observed with almost 100% accuracy. And then we train the best language model conditioned on hints to produce captions.

Using the proposed oracle, we compare the current state-of-the-art models against it, which helps to quantify their capacity of visual modeling, a major weakness, apart from the strong language modeling. In addition, when being applied on different datasets, the oracle offers insight on the intrinsic difficulty and blessing of them, a general guideline when designing new algorithms and developing new models. Finally, we also relax the assumption to investigate the case where visual concept may not be realistically predicted with 100% accuracy and demonstrate a quantity-accuracy trade-off in solving visual captioning tasks.

2 Related work

Visual captioning

The problem of image captioning has attracted a great amount of attention lately. Early work focused on constructing linguistic templates or syntactic trees based on a set of concept from visual inputs Kuznetsova et al. (2012); Mitchell et al. (2012); Kulkarni et al. (2013). Another popular approach is based on caption retrieval in the embedding space such as Kiros et al. (2014); Devlin et al. (2015b). Most recently, the use of language models conditioned on visual inputs have been widely studied in the work of Fang et al. (2015) where a maximum entropy language model is used and in Donahue et al. (2015); Vinyals et al. (2014); Xu et al. (2015b); Mao et al. (2015); Karpathy and Fei-Fei (2014) where recurrent neural network based models are built to generate natural language descriptions. The work of Devlin et al. (2015a) advocates to combine both types of language models. Furthermore, CIDEr (Vedantam et al., 2015) was proposed as an alternative evaluation metric for image captioning and is shown to be more advantageous compared with BLEU and METEOR. To further improve the performance, Bengio et al. (2015) suggests a simple sampling algorithm during training, which was one of the winning recipes for MSR-COCO Captioning challenge 111http://mscoco.org, and Jia et al. (2015) suggests the use of extra semantic information to guide the language generation process.

Similarly, video captioning has made substantial progress recently. Early models such as Kojima et al. (2002); Barbu et al. (2012); Rohrbach et al. (2013) tend to focus on constrained domains with limited appearance of activities and objects in videos. They also rely heavily on hand-crafted video features, followed by a template-based or shallow statistical machine translation approaches to produce captions. Borrowing success from image captioning, recent models such as Venugopalan et al. (2015b); Donahue et al. (2015); Yao et al. (2015); Venugopalan et al. (2015a); Xu et al. (2015a); Rohrbach et al. (2015a); Yu et al. (2015) and most recently Ballas et al. (2016) have adopted a more general encoder-decoder approach with end-to-end parameter tuning. Videos are input into a specific variant of encoding neural networks to form a higher level visual summary, followed by a caption decoder by recurrent neural networks. Training such type of models are possible with the availability of three relatively large scale datasets, one collected from YouTube by Guadarrama et al. (2013), the other two constructed based on Descriptive Video Service (DVS) on movies by Torabi et al. (2015) and Rohrbach et al. (2015b). The latter two have recently been combined as the official dataset for Large Scale Movie Description Challenge (LSMDC) 222https://goo.gl/2hJ4lw.

Capturing higher-level visual concept

The idea of using intermediate visual concept to guide the caption generation has been discussed in Qi Wu et al. (2015) in the context of image captioning and in Rohrbach et al. (2015a) for video captioning. Both work trained classifiers on a predefined set of visual concepts, extracted from captions using heuristics from linguistics and natural language processing. Our work resembles both of them in the sense that we also extract similar constituents from captions. The purpose of this study, however, is different. By assuming perfect classifiers on those visual atoms, we are able to establish the performance upper bounds for a particular dataset. Note that a simple bound is suggested by Rohrbach et al. (2015a) where METEOR is measured on all the training captions against a particular test caption. The largest score is picked as the upper bound. As a comparison, our approach constructs a series of oracles that are trained to generate captions given different number of visual hints. Therefore, such bounds are clear indication of models’ ability of capturing concept within images and videos when performing caption generation, instead of the one suggested by Rohrbach et al. (2015a) that performs caption retrieval.

3 Oracle Model

The construction of the oracle is inspired by the observation that where denotes a caption containing a sequence of words having a length . denotes the visual inputs such as an image or a video. denotes visual concepts which we call “atoms”. We have explicitly factorized the captioning model into two parts, , which we call the conditional language model given atoms, and , which we call conditional atom model given visual inputs. To establish the oracle, we assume that the atom model is given, which amounts to treat as a Dirac delta function that assigns all the probability mass to the observed atom . In other words, .

Therefore, with the fully observed , the task of image and video captioning reduces to the task of language modeling conditioned on atoms. This is arguably a much easier task compared with the direct modeling of , therefore a well-trained model could be treated as a performance oracle of it. Information contained in directly influences the difficulty of modeling . For instance, if no atoms are available, reduces to unconditional language modeling, which could be considered as a lower bound of . By increasing the amount of information carries, the modeling of becomes more and more straightforward.

3.1 Oracle Parameterization

Given a set of atoms that summarize the visual concept appearing in the visual inputs , this section describes the detailed parameterization of the model with denoting the overall parameters. In particular, we adopt the commonly used encoder-decoder framework (Cho et al., 2014) to model this conditional based on the following simple factorization .

Recurrent neural networks (RNNs) are natural choices when outputs are identified as sequences. We borrow the recent success from a variant of RNNs called Long-short term memory networks (LSTMs) first introduced in Hochreiter and Schmidhuber (1997), formulated as the following

(1)

where and represent the RNN state and memory of LSTMs at timestep t respectively. Combined with the atom representation, Equ. (1) is implemented as following

where denotes the word embedding matrix, as apposed to the atom embedding matrix , , , and are parameters of the LSTM. With the LSTM’s state , the probability of the next word in the sequence is with parameters , , and . The overall training criterion of the oracle is

(2)

given training pairs . represents parameters in the LSTM.

3.2 Atoms Construction

Each configuration of may be associated with a different distribution , therefore a different oracle model. We define configuration as an orderless collection of unique atoms. That is, where is the size of the bag and all items in the bag are different from each other. Considering the particular problem of image and video captioning, atoms are defined as words in captions that are most related to actions, entities, and attributes of entities (in Figure 1). The reason of using these three particular choices of language components as atoms is not an arbitrary decision. It is reasonable to consider these three types among the most visually perceivable ones when human describes visual content in natural language. We further verify this by conducting a human evaluation procedure to identify “visual” atoms from this set and show that a dominant majority of them indeed match human visual perception, detailed in Section 5.1. Being able to capture these important concepts is considered as crucial in getting superior performance. Therefore, comparing the performance of existing models against this oracle reveals their ability of capturing atoms from visual inputs when is unknown.

A set of atoms is treated as “a bag of words”. As with the use of word embedding matrix in neural language modeling (Bengio et al., 2003), the atom is used to index the atom embedding matrix to obtain a vector representation of it. Then the representation of the entire set of atoms is .

4 Contributing factors of the oracle

The formulation of Section 3 is generic, only relying on the assumption the two-step visual captioning process, independent of the parameterization in Section 3.1. In practice, however, one needs to take into account several contributing factors to the oracle.

Firstly, atoms, or visual concepts, may be defined as 1-gram words, 2-gram phrases and so on. Arguably a mixture of N-gram representations has the potential to capture more complicated correlations among visual concepts. For simplicity, this work uses only 1-gram representations, detailed in Section 5.1. Secondly, the procedure used to extract atoms needs to be reliable, extracting mainly visual concepts, leaving out non-visual concepts. To ensure this, the procedure used in this work is verified with human evaluation, detailed in 5.1. Thirdly, the modeling capacity of the conditional language has a direct influence on the obtained oracle. Section 3.1 has shown one example of many possible parameterizations. Lastly, the oracle may be sensitive to the training procedure and its hyper-parameters (see Section 5.2).

While it is therefore important to keep in mind that proposed oracle conditions on the above factors, quite surprisingly, however, with the simplest procedure and parameterization we show in the experimental section that oracle serves their purpose reasonably well.

5 Experiments

We demonstrate the procedure of learning the oracle on three standard visual captioning datasets. MS COCO (Lin et al., 2014) is the most commonly used benchmark dataset in image captioning. It consists of 82,783 training and 40,504 validation images. each image accompanied by 5 captions, all in one sentence. We follow the split used in Xu et al. (2015b) where a subset of 5,000 images are used as validation, and another subset of 5,000 images are used for testing. YouTube2Text is the most commonly used benchmark dataset in video captioning. It consists of 1,970 video clips, each accompanied with multiple captions. Overall, there are 80,000 video and caption pairs. Following Yao et al. (2015), it is split into 1,200 clips for training, 100 for validation and 670 for testing. Another two video captioning datasets have been recently introduced in Torabi et al. (2015) and Rohrbach et al. (2015b). Compared with YouTube2Text, they are both much larger in the number of video clips, most of which are associated with one or two captions. Recently they are merge together for Large Scale Movie Description Challenge (LSMDC). 333https://goo.gl/2hJ4lw We therefore name this particular dataset LSMDC. The official splits contain 91,908 clips for training, 6,542 for validation and 10,053 for testing.

5.1 Atom extraction

Figure 1: Given ground truth captions, three categories of visual atoms (entity, action and attribute) are automatically extracted using NLP Parser. “NA” denotes the empty atom set.

Visual concepts in images or videos are summarized as atoms that are provided to the caption language model. They are split into three categories: actions, entities, and attributes. To identify these three classes, we utilize Stanford natural language parser 444http://goo.gl/lSvPr to automatically extract them. After a caption is parsed, we apply simple heuristics based on the tags produced by the parser, ignoring the phrase and sentence level tags 555complete list of tags: https://goo.gl/fU8zDd: Use words tagged with {“NN”, “NNP”, “NNPS” ,“NNS”, “PRP”} as entity atoms. Use words tagged with {“VB”, “VBD”, “VBG”, “VBN”, “VBP”, “VBZ”} as action atoms. Use words tagged with {“JJ”, “JJR”, “JJS”} as attribute atoms. After atoms are identified, they are lemmatized with NLTK lemmatizer 666http://www.nltk.org/ to unify them to their original dictionary format 777available at https://goo.gl/t7vtFj. Figure 1 illustrates some results. We extracted atoms for COCO, YouTube2Text and LSMDC. This gives 14,207 entities, 4,736 actions and 8,671 attributes for COCO, 6,922 entities, 2,561 actions, 2,637 attributes for YouTube2Text, and 12,895 entities, 4,258 actions, 8550 attributes for LSMDC. Note that although the total number of atoms of each categories may be large, atom frequency varies. In addition, the language parser does not guarantee the perfect tags. Therefore, when atoms are being used in training the oracle, we sort them according to their frequency and make sure to use more frequent ones first to also give priority to atoms with larger coverage, detailed in Section 5.2 below.

We conducted a simple human evaluation 888details available at https://goo.gl/t7vtFj to confirm that extracted atoms are indeed predominantly visual. As it might be impractical to evaluate all the extracted atoms for all three datasets, we focus on top 150 frequent atoms. This evaluation intends to match the last column of Table 2 where current state-of-the-art models have the equivalent capacity of capturing perfectly less than 100 atoms from each of three categories. Subjects are asked to cast their vote independently. The final decision of an atom being visual or not is made by majority vote. Table 1 shows the ratio of atoms flagged as visual by such procedure.

entities actions attributes
COCO 92% 85% 81%
YouTube2Text 95% 91% 72%
LSMDC 83% 87% 78%
Table 1: Human evaluation of proportion of atoms that are voted as visual. It is clear that extracted atoms from three categories contain dominant amount of visual elements, hence verifying the procedure described in Section 3.2. Another observation is that entities and actions tend to be more visual than attributes according to human perception.

5.2 Training

After the atoms are extracted, they are sorted according to the frequency they appear in the dataset, with the most frequent one leading the sorted list. Taking first items from this list gives the top most frequent ones, forming a bag of atoms denoted by where is the size of the bag. Conditioned on the atom bag, the oracle is maximized as in Equ (2).

To form captions, we used a vocabulary of size 20k, 13k and 25k for COCO, YouTube2Text and LSMDC respectively. For all three datasets, models were trained on training set with different configuration of (1) atom embedding size, (2) word embedding size and (3) LSTM state and memory size. To avoid overfitting we also experimented weight decay and Dropout (Hinton et al., 2012) to regularize the models with different size. In particular, we experimented with random hyper-parameter search by Bergstra and Bengio (2012) with range on (1), (2) and (3). Similarly we performed random search on the weight decay coefficient with a range of , and whether or not to use dropout. Optimization was performed by SGD, minibatch size 128, and with Adadelta (Zeiler, 2012) to automatically adjust the per-parameter learning rate. Model selection was done on the standard validation set, with an early stopping patience of 2,000 (early stop if no improvement made after 2,000 minibatch updates). We report the results on the test splits.

5.3 Interpretation

Figure 2: Learned oracle on COCO (left), YouTube2Text (middle) and LSMDC (right). The number of atoms is varied on x-axis and oracles are computed on y-axis on testsets. The first row shows the oracles on BLEU and METEOR with atoms, from each of the three categories. The second row shows the oracles when atoms are selected individually for each category. CIDEr is used for COCO and YouTube2Text as each test example is associated with multiple ground truth captions, argued in (Vedantam et al., 2015). For LSMDC, METEOR is used, as argued by Rohrbach et al. (2015a).

All three metrics – BLEU, METEOR and CIDER are computed with Microsoft COCO Evaluation Server (Chen et al., 2015). Figure 2 summarizes the learned oracle with an increasing number of .

comparing oracle performance with existing models

We compare the current state-of-the-art models’ performance against the established oracles in Figure 2. Table 2 shows the comparison on three different datasets. With Figure 2, one could easily associate a particular performance with the equivalent number of atoms perfectly captured across all 3 atom categories, as illustrated in Table 2, the oracle included in bold. It is somehow surprising that state-of-the-art models have performance that is equivalent to capturing only a small amount of “ENT” and “ALL”. This experiment highlights the shortcoming of the state-of-art visual models. By improving them, we could close the performance gap that we currently have with the oracles.

B1 B4 M C ENT ACT ATT ALL
COCO
(Qi Wu et al., 2015)
0.74
0.80
0.31
0.35
0.26
0.30
0.94
1.4
200 2100 4000 50
YouTube2Text
(Yu et al., 2015)
0.815
0.88
0.499
0.58
0.326
0.40
0.658
1.2
60 500 1900 20
LSMDC
(Venugopalan et al., 2015a)
N/A
0.45
N/A
0.12
0.07
0.22
N/A
N/A
40 50 4000 10
Table 2: Measure semantic capacity of current state-of-the-art models. Using Figure 2, one could easily map the reported metric to the number of visual atoms captured. This establishes an equivalence between a model, the proposed oracle and a model’s semantic capacity. (“ENT” for entities. “ACT” for actions. “ATT” for attributes. “ALL” for all three categories combined. “B1” for BLEU-1, “B4” for BLEU-4. “M” for METEOR. “C” for CIDEr. Note that the CIDEr is between 0 and 10 according to Vedantam et al. (2015). The learned oracle is denoted in bold.

quantify the diminishing return

As the number of atoms in increases, one would expect the oracle to be improved accordingly. It is however not yet clear the speed of such improvement. In other words, the gain in performance may not be proportional to the number of atoms given when generating captions, due to atom frequencies and language modeling. Figure 2 quantifies this effect. The oracle on all three datasets shows a significant gain at the beginning and diminishes quickly as more and more atoms are used.

Row 2 of Figure 2 also highlights the difference among actions, entities and attributes in generating captions. For all three datasets tested, entities played much more important roles, even more so than action atoms in general. This is particularly true on LSMDC where the gain of modeling attributes is much less than the other two categories.

Although visual atoms dominant the three atom categories shown in Section 5.1, as they increase in number, more and more non-visual atoms may be included, such as “living”, “try”, “find” and “free” which are relatively difficult to be associated with a particular part of visual inputs. Excluding non-visual atoms in the conditional language model can further tighten the oracle bound as less hints are provided to it. The major difficulty lies in the labor of hand-separating visual atoms from non-visual ones as to the our best knowledge this is difficult to automate with heuristics.

atom accuracy versus atom quantity

We have assumed that the atoms are given, or in other words, the prediction accuracy of atoms is 100%. In reality, one would hardly expect to have a perfect atom classifier. There is naturally a trade-off between number of atoms one would like to capture and the prediction accuracy of it. Figure 3 quantifies this trade-off on COCO and LSMDC. It also indicates the upper limit of performance given different level of atom prediction accuracy. In particular, we have replaced by where portion of are randomly selected and replaced by other randomly picked atoms not appearing in . The case of corresponds to those shown in Figure 2. And the larger the ratio , the worse the assumed atom prediction is. The value of is shown in the legend of Figure 3. According to the figure, in order to improve the caption generation score, one would have two options, either by keeping the number of atoms fixed while improving the atom prediction accuracy or by keeping the accuracy while increasing the number of included atoms. As state-of-art visual model already model around 1000 atoms, we hyphotesize that we could gain more in improving the atoms accuracy rather than increase the number of atom detected by those models.

Figure 3: Learned oracles with different atom precision ( in red) and atom quantity (x-axis) on COCO (left) and LSMDC (right). The number of atoms is varied on x-axis and oracles are computed on y-axis on testsets. CIDEr is used for COCO and METEOR for LSMDC. It shows one could increase the score by either improving with a fixed or increase . It also shows the maximal error bearable for different score.

intrinsic difficulties of particular datasets

Figure 2 also reveals the intrinsic properties of each dataset. In general, the bounds on YouTube2Text are much higher than COCO, with LSMDC the lowest. For instance, from the first column of the figure, taking 10 atoms respectively, BLUE-4 is around 0.15 for COCO, 0.30 for YouTube2Text and less than 0.05 for LSMDC. With little visual information to condition upon, a strong language model is required, which makes a dramatic difference across three datasets. Therefore the oracle, when compared across different datasets, offer an objective measure of difficulties of using them in the captioning task.

6 Discussion

This work formulates oracle performance for visual captioning. The oracle is constructed with the assumption of decomposing visual captioning into two consecutive steps. We have assumed the perfection of the first step where visual atoms are recognized, followed by the second step where language models conditioned on visual atoms are trained to maximize the probability of given captions. Such an empirical construction requires only automatic atom parsing and the training of conditional language models, without extra labeling or costly human evaluation.

Such an oracle enables us to gain insight on several important factors accounting for both success and failure of the current state-of-the-art models. It further reveals model independent properties on different datasets. We furthermore relax the assumption of prefect atom prediction. This sheds light on a trade-off between atom accuracy and atom coverage, providing guidance to future research in this direction. Importantly, our experimental results suggest that more efforts are required in step one where visual inputs are converted into visual concepts (atoms).

Despite its effectiveness shown in the experiments, the empirical oracle is constructed with the simplest atom extraction procedure and model parameterization in mind, which makes such a construction in a sense a “conservative” oracle.

Acknowledgments

The authors would like to acknowledge the support of the following agencies for research funding and computing support: IBM T.J. Watson Research, NSERC, Calcul Québec, Compute Canada, the Canada Research Chairs and CIFAR. We would also like to thank the developers of Theano (Theano Development Team, 2016) , for developing such a powerful tool for scientific computing.

References

  • Ballas et al. (2016) Nicolas Ballas, Li Yao, Chris Pal, and Aaron Courville. Delving deeper into convolutional networks for learning video representations. ICLR, 2016.
  • Barbu et al. (2012) A. Barbu, A. Bridge, Z. Burchill, D. Coroian, S. Dickinson, S. Fidler, A. Michaux, S. Mussman, S. Narayanaswamy, D. Salvi, et al. Video in sentences out. UAI, 2012.
  • Bengio et al. (2015) Samy Bengio, Oriol Vinyals, Navdeep Jaitly, and Noam Shazeer. Scheduled sampling for sequence prediction with recurrent neural networks. arXiv preprint arXiv:1506.03099, 2015.
  • Bengio et al. (2003) Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic language model. The Journal of Machine Learning Research, 3:1137–1155, 2003.
  • Bergstra and Bengio (2012) James Bergstra and Yoshua Bengio. Random search for hyper-parameter optimization. JMLR, 2012.
  • Chen et al. (2015) Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollar, and C Lawrence Zitnick. Microsoft coco captions: Data collection and evaluation server. arXiv 1504.00325, 2015.
  • Cho et al. (2014) Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. Learning phrase representations using rnn encoder-decoder for statistical machine translation. EMNLP, 2014.
  • Devlin et al. (2015a) Jacob Devlin, Hao Cheng, Hao Fang, Saurabh Gupta, Li Deng, Xiaodong He, Geoffrey Zweig, and Margaret Mitchell. Language models for image captioning: The quirks and what works. arXiv preprint arXiv:1505.01809, 2015a.
  • Devlin et al. (2015b) Jacob Devlin, Saurabh Gupta, Ross Girshick, Margaret Mitchell, and C Lawrence Zitnick. Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467, 2015b.
  • Donahue et al. (2015) Jeff Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for visual recognition and description. CVPR, 2015.
  • Fang et al. (2015) Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John Platt, et al. From captions to visual concepts and back. CVPR, 2015.
  • Guadarrama et al. (2013) Sergio Guadarrama, Niveda Krishnamoorthy, Girish Malkarnenkar, Subhashini Venugopalan, Raymond Mooney, Trevor Darrell, and Kate Saenko. Youtube2text: Recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition. In ICCV, 2013.
  • Hinton et al. (2012) Geoffrey E Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580, 2012.
  • Hochreiter and Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  • Hodosh et al. (2013) Micah Hodosh, Peter Young, and Julia Hockenmaier. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 2013.
  • Jia et al. (2015) Xu Jia, Efstratios Gavves, Basura Fernando, and Tinne Tuytelaars. Guiding long-short term memory for image caption generation. arXiv preprint arXiv:1509.04942, 2015.
  • Karpathy and Fei-Fei (2014) A Karpathy and L Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In CVPR, 2014.
  • Kiros et al. (2014) Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  • Kojima et al. (2002) Atsuhiro Kojima, Takeshi Tamura, and Kunio Fukunaga. Natural language description of human activities from video images based on concept hierarchy of actions. IJCV, 2002.
  • Kulkarni et al. (2013) Girish Kulkarni, Visruth Premraj, Vicente Ordonez, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C Berg, and Tamara L Berg. Babytalk: Understanding and generating simple image descriptions. PAMI, 2013.
  • Kuznetsova et al. (2012) Polina Kuznetsova, Vicente Ordonez, Alexander C Berg, Tamara L Berg, and Yejin Choi. Collective generation of natural image descriptions. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 359–368. Association for Computational Linguistics, 2012.
  • Lin et al. (2014) Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014, pages 740–755. Springer, 2014.
  • Mao et al. (2015) Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan Yuille. Deep captioning with multimodal recurrent neural networks (m-rnn). ICLR, 2015.
  • Mitchell et al. (2012) Margaret Mitchell, Xufeng Han, Jesse Dodge, Alyssa Mensch, Amit Goyal, Alex Berg, Kota Yamaguchi, Tamara Berg, Karl Stratos, and Hal Daumé III. Midge: Generating image descriptions from computer vision detections. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pages 747–756. Association for Computational Linguistics, 2012.
  • Qi Wu et al. (2015) Qi Qi Wu, Chunhua Shen, Anton van den Hengel, Lingqiao Liu, and Anthony Dick. What value high level concepts in vision to language problems? arXiv 1506.01144, 2015.
  • Rohrbach et al. (2015a) Anna Rohrbach, Marcus Rohrbach, and Bernt Schiele. The long-short story of movie description. 2015a.
  • Rohrbach et al. (2015b) Anna Rohrbach, Marcus Rohrbach, Niket Tandon, and Bernt Schiele. A dataset for movie description. CVPR, 2015b.
  • Rohrbach et al. (2013) Marcus Rohrbach, Wei Qiu, Ivan Titov, Stefan Thater, Manfred Pinkal, and Bernt Schiele. Translating video content to natural language descriptions. In ICCV, 2013.
  • Theano Development Team (2016) Theano Development Team. Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints, abs/1605.02688, May 2016. URL http://arxiv.org/abs/1605.02688.
  • Torabi et al. (2015) Atousa Torabi, Christopher Pal, Hugo Larochelle, and Aaron Courville. Using descriptive video services to create a large data source for video annotation research. arXiv: 1503.01070, 2015.
  • Vedantam et al. (2015) Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. CIDEr: Consensus-based image description evaluation. CVPR, 2015.
  • Venugopalan et al. (2015a) Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue, Raymond Mooney, Trevor Darrell, and Kate Saenko. Sequence to sequence – video to text. In ICCV, 2015a.
  • Venugopalan et al. (2015b) Subhashini Venugopalan, Huijuan Xu, Jeff Donahue, Marcus Rohrbach, Raymond Mooney, and Kate Saenko. Translating videos to natural language using deep recurrent neural networks. NAACL, 2015b.
  • Vinyals et al. (2014) Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. CVPR, 2014.
  • Xu et al. (2015a) Huijuan Xu, Subhashini Venugopalan, Vasili Ramanishka, Marcus Rohrbach, and Kate Saenko. A multi-scale multiple instance video description network. arXiv 1505.05914, 2015a.
  • Xu et al. (2015b) Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua Bengio. Show, attend and tell: Neural image caption generation with visual attention. ICML, 2015b.
  • Yao et al. (2015) Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. Describing videos by exploiting temporal structure. In ICCV, 2015.
  • (38) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. ACL14.
  • Yu et al. (2015) Haonan Yu, Jiang Wang, Zhiheng Huang, Yi Yang, and Wei Xu. Video paragraph captioning using hierarchical recurrent neural networks. arXiv 1510.07712, 2015.
  • Zeiler (2012) Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. Technical report, 2012.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
""
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
   
Add comment
Cancel
Loading ...
5324
This is a comment super asjknd jkasnjk adsnkj
Upvote
Downvote
""
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters
Submit
Cancel

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test
Test description