Text2Scene: Generating Abstract Scenes from Textual Descriptions
In this paper, we propose an end-to-end model that learns to interpret natural language describing a scene to generate an abstract pictorial representation. The pictorial representations generated by our model comprise the spatial distribution and attributes of the objects in the described scene. Our model uses a sequence-to-sequence network with a double attentive mechanism and introduces a regularization strategy. These scene representations can be sampled from our model similarly as in language-generation models. We show that the proposed model, initially designed to handle the generation of cartoon-like pictorial representations in the Abstract Scenes Dataset, can also handle, under minimal modifications, the generation of semantic layouts corresponding to real images in the COCO dataset. Human evaluations using a visual entailment task show that pictorial representations generated with our full model can entail at least one out of three input visual descriptions 94% of the times, and at least two out of three 62% of the times for each image.
Language comprehension is a long standing goal in Artificial Intelligence. Understanding natural language can enable systems that seamlessly communicate with humans, perform actions through complex natural language commands, or answer complex questions from knowledge contained in large amounts of text. A special case is the language describing visual information such as scenes. Understanding language referring to objects in a scene might require building an intermediate representation that is also visual in nature. Moreover, visual grounding of language can directly impact tasks in robotics, computer-aided design, and image search and retrieval.
Our paper focuses on the textual description of a scene and proposes a task of abstract scene generation to demonstrate this type of language understanding. We introduce a data-driven end-to-end model to interpret important semantics in visually descriptive language in order to generate pictorial representations. We specifically focus on generating a scene representation consisting of a list of objects, along with their attributes and locations, provided a set of natural language utterances describing what is true or known about the scene. Generating such a representation is challenging because input textual descriptions might only indirectly hint at the presence of attributes (e.g. Mike is surprised should change facial attributes on the generated object ‘‘Mike’’), they also frequently contain complex information about relative spatial configurations (e.g Jenny is running towards Mike and the duck makes the orientation of ‘‘Jenny’’ dependent on the positions of ‘‘Mike’’, and ‘‘duck’’), and they could also hint at the presence of objects only indirectly through the mention of an activity that implies the object or group of objects (e.g. some people hints at the presence of several objects of type ‘‘person’’). These examples are illustrated in Figure 1.
We model this text-to-scene task with a sequence-to-sequence approach where objects are placed sequentially on an initially empty canvas. More specifically, our proposed model, named TEXT2SCENE, consists of a text encoder which maps input sentences to a set of embedding representations, an object decoder which predicts the next object conditioned on the current scene state and input representations, and an attribute decoder which determines the attributes of the predicted object. Our model was trained and originally designed to handle the generation of abstract scenes as introduced in the dataset of . Then we further show that our model can also be trained to generate semantic object layouts corresponding to real images in the COCO dataset  while only requiring minimal modifications. COCO scenes do not contain as rich a set of attribute annotations for each object but generally exhibit a larger variation in spatial configurations for objects.
Another challenge in our setup is evaluating the generated output from this task. For instance, image descriptions in the Abstract Scenes dataset might not explicitly refer to all the visual elements in the corresponding ‘‘ground truth’’ scene (e.g. in Figure 1 ‘‘table’’ is in the scene but is not mentioned in the text). In this case, the generated output is still valid as it does not negate any of the input statements, so the corresponding scenes in the datasets can only be considered as a reference output. Therefore, we introduce a set of metrics inspired in metrics used for machine translation to quantify the alignment between semantic frames in language and corresponding visual features. Additionally, we perform human evaluations in the form of a visual entailment task where humans need to judge whether a generated scene negates any of the input language descriptions when considering them as propositional statements.
We conduct extensive quantitative and qualitative analysis on two distinct datasets (Abstract Scenes and COCO) and the results show the ability of our models for generating scenes with multiple objects and complex semantic relations. Our TEXT2SCENE model compares favorably against the parsing + conditional random field (CRF) model introduced in , and thus is the state-of-the-art on this benchmark. Finally, to the best of our knowledge, TEXT2SCENE is the first model proposed for this type of task that shows positive results on both abstract images and semantic object layouts for real images, thus opening the possibility for future work on transfer learning across domains.
Our main contributions can be summarized as follows:
We propose an end-to-end trainable approach based on a sequence-to-sequence model to generate a pictorial representation of input language describing a scene.
We show that the same model devised for the task of generating abstract scenes, can also be reasonably used to generate semantic object layouts of real images.
2 Related Work
Most research on visually descriptive language has focused on the task of image captioning or mapping images to text [5, 18, 13, 10, 27, 28, 19, 2]. However, there has been less work in the opposite direction of mapping text to images, this is, using text as an input to synthesize images [9, 22, 30, 23]. Most of these recent approaches have leveraged conditional Generative Adversarial Networks (cGANs) to produce controllable image synthesis conditioned on text. While these works have managed to generate results of increasing quality, there are major challenges when attempting to synthesize images for complex scenes with multiple interacting objects. In our work, we do not attempt to produce pixel-level output but rather abstract representations containing the semantic elements that compose the scene but devoid of specific texture-level image information. Our method can be considered as an intermediate step for generation of more complex scenes.
Our work is also related to prior research on using abstract scenes to mirror and analyze complex situations in the real world [31, 32, 6, 26]. The most related is  where a graphical model was introduced to generate an abstract scene from the input textual description of the scene. Unlike this previous work, our method does not use a semantic parser to obtain a set of tuples but is trained directly from input sentences in an end-to-end fashion. Moreover, we show our method compares favorably to this previous work. Our work is also related to  where a sequence-to-sequence model is proposed to transform an abstract scene layout to text. Our work poses the problem in the opposite direction and demonstrates good results on both cartoon-like scenes and semantic layouts corresponding to real images.
Most closely related to our approach are the recent works of  and , as these works also exploit some form of abstract scene generation.  targets image synthesis using conditional GANs (pixel-level output) but unlike prior works, it generates semantic object layouts as an intermediate step. In our work, we specifically focus on layout generation as an isolated task, and our method works for both cartoon-like scenes, and scenes corresponding to real images. The work of  performs pictorial generation from a chat log which makes the task less ambiguous as chat logs provide more detailed and specific descriptions of the scene compared to image captions. In our work, the descriptions used are considerably more underspecified so our proposed model has to infer its output by also leveraging commonly occurring patterns across other images. Additionally, our model is the only one targeting both the type of abstract scenes used in , and semantic object layouts corresponding to real images as in [8, 23] under an unified framework.
Our TEXT2SCENE model generates a scene in an initially empty two dimensional canvas sequentially in three steps: (1) given the current two dimensional canvas, the model attends to its input text to decide what is the next object to add to the canvas, or decides the generation should end; (2) if the decision is to add a new object, the agent zooms in the local context of the object in the input text to determine its attributes (e.g. pose, size) and relations with its surroundings (e.g. location, interactions with other objects); (3) given the attributes extracted in the previous step, the agent refers back to the canvas and grounds the extracted textual attributes into their corresponding visual representations.
In order to model this process, we adopt an encoder-decoder sequence-to-sequence approach . Specifically, TEXT2SCENE consists of a text encoder which takes as input a sequence of words (section 3.1), an object decoder which predicts sequentially objects , and an attribute decoder that predicts for each object their locations and set of attributes (section 3.2). Additionally we incorporate a regularization that encourages attending to most words in the input text so that no referred objects are missing in the output scene (section 3.3). Figure. 2 shows an example of a step-by-step generation of our model.
3.1 Text Encoder
Our text encoder consists of a bidirectional recurrent neural network with Gated Recurrent Units (GRUs). For a given input text with words, we compute for each word :
where BiGRU is a bidirectional GRU cell, is a word embedding corresponding to the i-th word, and is a vector encoding the current word and its context. We use as our encoded input text representation that will be input to our object and attribute decoders.
3.2 Object and Attribute Decoders
We formulate scene generation as a sequential process where at each step , our model predicts the next object and its attributes from input text encoding and current scene state . For this part, we use a convolutional GRU (ConvGRU) which is similar to a regular GRU but uses convolutional operators to compute its hidden state. Our ConvGRU has two cascaded branches: one for decoding object and the other for decoding attributes .
Scene State Representation
We compute the current scene state representation by feeding current scene canvas to a convolutional neural network (CNN) . The output of is an feature map, where is the spatial dimension and is the feature dimension. This representation provides a good idea about the current location of all objects in the scene but it might fail to capture small objects. In order to compensate for this, a one-hot indicator vector of the object predicted in the previous step is also used. We set the initial object using a special start-of-sentence token.
The temporal states of the generated scene are modeled using a convolutional GRU (ConvGRU) as follows:
where the initial hidden state is created by spatially replicating the last hidden state in the text encoder GRU.
Attention-based Object Decoder
Our object decoder uses an attention-based model that outputs scores for all possible objects in a vocabulary of objects . It takes as input at each time step the encoded scene state , input text representation and the previously predicted object . Following , we adopt a soft-attention mechanism on visual input , and following , we adopt a soft-attention mechanism on textual input as follows:
where is a CNN with spatial attention on input canvas state . The attended features are then spatially collapsed into a vector using average pooling. is the text-based attention module which uses to attend the input text represented by . is a two-layer perceptron producing the likelihood of the next object using a softmax function.
Attention-based Attribute Decoder
The attribute set corresponding to object can be predicted from the visual context in and the semantic context in input text . We use another attention module to ‘‘zoom in’’ the context of in the input text, extracting a new context vector . For each spatial location in , we train a set of classifiers, predicting both the location likelihood as a special case of attribute, and all other attribute likelihoods . For the purpose of locations, the possible locations are discretized into positions in a grid to turn it into a classification problem. We use a fixed resolution (28 28) in our experiments, as will be explained in the implementation section. In general our attributes are predicted as follows:
where is the attention module using to attend the input text. is a CNN spatially attending . In order to combine with , it is spatially replicated and then concatenated along the depth channel. The final likelihood map is predicted from a CNN with softmax classifiers over each value of and , which takes as input the concatenation along the depth channel of and a spatially replicated version of and .
The purpose of our regularization mechanism is to encourage attention weights for the decoders to not miss any referred objects in the input text. Let and be the attention weights from and respectively. For each step , , since these are computed using a softmax function. Then we define the following regularization loss which encourages the model to distribute the attention across all the words in the input sentence:
and similarly we define a regularization for attribute attention weights.
The final loss function for a given example in our model with reference values for object, location, and attributes is:
where the first three terms are negative log-likelihood losses corresponding to the object, location, and attribute softmax classifiers, and , , , and are hyperparameters.
|Module||Input Shape||Operation||Output Shape|
|(83, 64, 64)||Conv: (83, 128, 7, 7), stride 2||(128, 32, 32)|
|(for COCO)||(128, 32, 32)||Residual block, 128 filters||(128, 32, 32)|
|(128, 32, 32)||Residual block, 256 filters, stride 2||(256, 16, 16)|
|(256, 16, 16)||Bilateral upsampling||(256, 28, 28)|
|(512, 28, 28)||Conv: (512, 256, 3, 3)||(256, 28, 28)|
|(256, 28, 28)||Conv: (256, 1, 3, 3)||(1, 28, 28)|
|(1324, 28, 28)||Conv: (1324, 256, 3, 3)||(256, 28, 28)|
|(256, 28, 28)||Conv: (256, 1, 3, 3)||(1, 28, 28)|
|(1324 + , )||Linear: (1324 + )512||(512,)|
|(1324 + , 28, 28)||Conv: (1324 + , 512, 3, 3)||(512, 28, 28)|
|(512, 28, 28)||Conv: (512, 256, 3, 3)||(256, 28, 28)|
|(256, 28, 28)||Conv: (256, 256, 3, 3)||(256, 28, 28)|
|(256, 28, 28)||Conv: (256, , 3, 3)||(, 28, 28)|
Our model shares several design choices for our experiments on the Abstract Scenes and COCO datasets: We use GloVe  as our word embeddings, which are fixed throughout our experiments. The text encoder uses a one layer bidirectional GRU with a hidden dimension of 256. We discretize the possible values for spatial locations into the indices of a grid of size . and are two-layer ConvNets with and output channels respectively. consists of two fully connected layers outputting a likelihood scores for all object categories. consists of four convolutional layers with output channels (512, 256, 256, ), where denotes the discretized range of the k-th attribute. In the last layer (with a channel size of ), the first depth channel encodes the location likelihood of the predicted object. A 2D softmax function is applied to this channel to normalize the likelihoods over spatial domain. The rest channels encodes the attribute likelihoods for each spatial location. A 1D softmax function is applied to the channels of each attribute along the depth dimension. Every convolutional and fully connected layer, except the last layer, are followed by a ReLU activation function.
For the Abstract Scenes dataset, is represented directly as an RGB image, and the CNN encoder is a pre-trained ResNet-34 , with the last residual group (conv5_x) removed, followed by a bilateral upsampling operation to resize the spatial resolution to . We do not finetune for the Abstract Scenes dataset. We include a BatchNorm layer between each pair of the convolutional and ReLU layers.
For the COCO dataset, is a three dimensional tensor , where at each location in there is an indicator vector of size , which is the number of object categories. The encoder CNN consists of 2 residual blocks. The first residual block has two 3x3 convolutional layers as the skip connection. The second residual block consists of one 1x1 convolutional layer with stride 2, and two 3x3 convolutional layers as the skip connection. Table 1 provides detailed information of all the network modules in our model.
For optimization we use Adam  with an initial learning rate of . The learning rate is decayed by every epochs. Models are trained until validation errors stop decreasing. The model with the minimum validation error is used for evaluation.
We conduct experiments on two text-to-scene tasks: (I) generating abstract scenes of clip arts in the Abstract Scenes dataset; and (II) constructing semantic object layouts of real images in the COCO dataset.
|Zitnick et al. 2013||0.722||0.655||0.280||0.265||0.407||0.370||0.449||0.416|
|TEXT2SCENE (w/o attention)||0.665||0.605||0.228||0.186||0.305||0.323||0.395||0.338|
|TEXT2SCENE (w object attention)||0.731||0.671||0.312||0.261||0.365||0.368||0.406||0.427|
|TEXT2SCENE (w both attentions)||0.749||0.685||0.327||0.272||0.408||0.374||0.402||0.467|
|Zitnick et al. 2013||0.555||0.92||0.49||0.53||0.44||0.667||0.625|
|TEXT2SCENE (w/o attention)||0.455||0.75||0.42||0.431||0.36||0.6||0.583|
|TEXT2SCENE (w/o attention)||0.733||0.648||0.378||0.318||0.577||0.249||0.221||0.221|
|TEXT2SCENE (w object attention)||0.738||0.661||0.402||0.335||0.598||0.259||0.234||0.242|
|TEXT2SCENE (w both attentions)||0.716||0.654||0.391||0.346||0.585||0.258||0.362||0.232|
|TEXT2SCENE (w/o attention)||0.591||0.391||0.254||0.169||0.179||0.430||0.531||0.110|
|TEXT2SCENE (w object attention)||0.591||0.391||0.256||0.171||0.179||0.430||0.524||0.110|
|TEXT2SCENE (w both attentions)||0.600||0.401||0.263||0.175||0.182||0.436||0.555||0.114|
4.1 Clip-art Generation on Abstract Scenes
We use the dataset introduced by , which contains over 1,000 sets of 10 semantically similar scenes of children playing outside. The scenes are composed with 58 clip art objects. The attributes we consider for each clip art object are their locations, sizes (), and the directions the object is facing (). For the person objects, we also explicitly model the pose () and expression (). There are three sentences describing different aspects of a scene. The three sentences are fed separately into the text encoder. The output features are concatenated in temporal order. We convert all sentences to lowercase and discard punctuation and stop-words, which results in a vocabulary size of . We restrict the maximum length of each sequence to . The last sentence is padded with a special end-of-sentence token. After filtering empty scenes, we obtain samples. Following , we reserve 1000 samples as the test set. We also reserve samples for validation. The final train/val/test splits are //. We set the hyperparameters (, , , , , , and ) in Section 3.4 to (8,2,2,2,1,1,1,1) to make the losses of the object prediction branch and attributes prediction branch comparable. Exploration of the best hyperparameters is left for future work.
Baselines and Competing Approach
We compare our full model with . We reproduce the results of  from the released source codes. We also consider variants of our full model: (1) TEXT2SCENE (w/o attention): a model without any attention module. In particular, we replace Eq. 3 with a pure average pooling operation, discard in Eq. 5, discard in Eq. 8 and replace with . (2) TEXT2SCENE (w object attention): a model with attention modules for object prediction but no dedicated attentions for attribute prediction. Specifically, we replace (, ) with (, ) in Eq. 8. In this case, the semantic-visual alignment would be learned directly from and the context vector from the object decoder. (3) TEXT2SCENE (w both attentions): a model with attention modules for both object prediction and attribute prediction but no regularization.
4.2 Semantic Layout Generation on COCO
We also test our approach by generating semantic object layouts corresponding to real images in the COCO dataset given an input textual caption. The semantic layouts contain bounding boxes of the objects annotated in the images. COCO  has categories of objects in images. Each image has textual descriptions. We use the official val2017 split as our test set and use samples from the train2017 split for validation. The sentences are preprocessed as in our previous experiment. We use a vocabulary of size of by taking into account the most frequent words (appear at least 5 times) in the training split. We normalize the bounding boxes and discard objects with areas smaller than the size of the image. We order the objects from bottom to top as the y-coordinates typically indicate the distances between the objects and the camera. We further order the objects with the same y-coordinate based on their x-coordinates (from left to right) and categorical indices. After filtering out empty scenes, we obtain a train/val/test split of size //. The attributes we consider for COCO are locations, sizes(), and aspect ratios (). For the size attribute, we estimate the size range of the bounding boxes in the training split, and discretize it evenly into 17 scales. We also use 17 aspect ratio scales, which are and . We restrict the maximum length of the output object sequence to and pad the sequence with a special ending token. For sequences shorter than , we also fill them with a special padding token. We set the hyperparameters (, , , , and ) in Section 3.4 to (5,2,2,2,1,0) to make the losses of the object prediction branch and attributes prediction branch comparable, and leave the exploration of the best hyperparameters for future work.
For the task of generating semantic layouts, we compare the performance of equivalent TEXT2SCENE (w/o attention), TEXT2SCENE (w object attention) and TEXT2SCENE (w both attentions) models as defined in section 4.1.
Our tasks pose new challenges on evaluating the semantic matches between the textual descriptions and the visual features of the generated scenes/layouts since there is no absolute one-to-one correspondence between text and scenes. We draw inspiration from the evaluation metrics applied in machine translation  but we aim at aligning multimodal visual-linguistic data instead. To this end, we propose to compute the following metrics: precision/recall on single objects (U-obj), ‘‘bigram’’ object pairs (B-obj); classification accuracies for poses, expressions; Euclidean distances (defined as a Gaussian function with a kernel size of 0.2) for bounding box size, aspect ratio, coordinates of U-obj and B-obj. A ‘‘bigram’’ object pair is defined as a pair of objects with overlapping bounding boxes as illustrated in Figure 3.
We collect human evaluations on 100 groups of images for the Abstract Scenes dataset via Amazon Mechanical Turk (AMT). The human annotators are asked to judge whether a sentence is entailed given a corresponding clip-art scene. Each scene in this dataset is associated with sentences that are used as the statements. Each sentence-scene pair is reviewed by three turkers to determine if the entailment is true, false or uncertain. To further analyze if the approach could capture finer-grained semantic alignments between the textual description and the generated abstract scenes, we apply the predicate-argument semantic frame analysis of  on the corresponding triplets obtained from input sentences using a semantic parsing method as computed by . We subdivide each sentence by the structure in the triplet as: sub-pred corresponding to sentences referring to only one object, sub-pred-obj corresponding to sentences referring to object pairs with semantic relations, pred:loc corresponding to sentences referring to locations, and pred:pa corresponding to sentences mentioning facial expressions.
Extrinsic Evaluation through Captioning
For the generation of scenes in the COCO dataset, we also employ caption generation as an extrinsic evaluation, similarly as in . We generate captions from the semantic layouts and compare them back to the original captions used to generate the scene. We also use as reference an approach similar to  where the caption is generated from the reference layout. We use commonly used metrics for captioning such as BLEU , METEOR , ROUGE_L , CIDEr  and SPICE .
Table 2 presents quantitative results. TEXT2SCENE (full) shows significant improvement over the previous work of  and our variant that does not use attention on all the metrics except U-obj Coord. This metric tests for exact matching of corresponding object locations, which penalizes our method because it generates more diverse than the reference scenes yet remarkably maintains the semantics of the input text. Human evaluation results on Table 3 confirm this, where Scores are the percentage of scene-textual pairs with a true entailment; () () denotes at least one (two) out of three sentence-scene annotated as true by majority votes. TEXT2SCENE (full) outperforms the no-attention variant and the previous work under these experiments, including on statements with specific semantic information such as Obj-single, Obj-pair, and expression, and are actually comparable on location statements. We also perform human evaluation on the reference scenes provided in the Abstract Scenes dataset as an upper bound on this task. Results also show that it is more challenging to generate the semantically related object pairs. Overall, the results also suggest that our proposed metrics correlate with human judgment on the task.
Figure 4 shows qualitative results of our models on Abstract Scene, in comparison with baseline approaches and reference scenes. These examples illustrate that the models are able to capture the semantic nuances such as the spatial relation between two objects (e.g., the bucket and shovel are correctly placed in Jenny’s hands in the last row) and object locations (e.g., Mike is on the ground near the swing set in the last row).
Compared to our previous experiments, quantitative results on COCO in Table 4 seem to suggest that the attribute attention might be less impactful on this dataset. We conjecture it is because the objects in the COCO images while realistic, exhibit weaker compositionality compared to the abstract scenes dataset which was built explicitly to depict complex relationships between objects. Also the layout representation has fewer attributes (e.g. location, size and aspect ratio). These attributes have larger variances on the realistic COCO data and are usually underspecified in the input captions. While the spatial relations between objects can be learned from the object attention, the size and aspect may be predicted mainly from the prior distribution of the COCO data. But overall we found that semantically plausible layouts were generated. As shown in our qualitative examples Figure 5, our model learns the presence (first row, text in purple) and count (second row, text in blue) of the objects, and their spatial relations (third and fourth rows, text in red) reasonably well. Additionally, in our extrinsic evaluation on a captioning task, shown in Table 5, we also find that our model produces reasonable results leading to a CIDEr score of , which, for reference, is considerably larger than CIDEr scores obtained from synthetically generated pixel-level images in .
This work presents an end-to-end approach for generating pictorial representations of the visually descriptive language. We provide extensive quantitative and qualitative analysis of our method on two distinctive datasets. The results demonstrate the ability of our models for capturing finer semantic matching from descriptive text to generate scenes. We establish quantitative baselines for text-to-scene generation on the Abstract Scenes and the COCO Datasets.
-  P. Anderson, B. Fernando, M. Johnson, and S. Gould. Spice: Semantic propositional image caption evaluation. In European Conference on Computer Vision (ECCV), pages 382--398. Springer, 2016.
-  P. Anderson, B. Fernando, M. Johnson, and S. Gould. Guided open vocabulary image captioning with constrained beam search. Empirical Methods in Natural Language Processing (EMNLP), 2017.
-  S. Banerjee and A. Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65--72, 2005.
-  X. Carreras and L. Màrquez. Introduction to the conll-2005 shared task: Semantic role labeling. In Proceedings of the ninth conference on computational natural language learning, pages 152--164. Association for Computational Linguistics, 2005.
-  A. Farhadi, M. Hejrati, M. A. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, and D. Forsyth. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision (ECCV), pages 15--29. Springer, 2010.
-  D. F. Fouhey and C. L. Zitnick. Predicting object dynamics in scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2019--2026, 2014.
-  K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
-  S. Hong, D. Yang, J. Choi, and H. Lee. Inferring semantic layout for hierarchical text-to-image synthesis. In Computer Vision and Pattern Recognition (CVPR), 2018.
-  L. Karacan, Z. Akata, A. Erdem, and E. Erdem. Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215, 2016.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128--3137, 2015.
-  J.-H. Kim, D. Parikh, D. Batra, B.-T. Zhang, and Y. Tian. Codraw: Visual dialog for collaborative drawing. arXiv preprint arXiv:1712.05558, 2017.
-  D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. International Conference on Learning Representations (ICLR), 2015.
-  P. Kuznetsova, V. Ordonez, T. Berg, and Y. Choi. Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association of Computational Linguistics, 2(1):351--362, 2014.
-  A. Lavie and A. Agarwal. Meteor: An automatic metric for mt evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Translation, StatMT ’07, pages 228--231, Stroudsburg, PA, USA, 2007. Association for Computational Linguistics.
-  C.-Y. Lin. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out, 2004.
-  T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: Common objects in context. European Conference on Computer Vision (ECCV), 2014.
-  T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural machine translation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1412--1421. Association for Computational Linguistics, 2015.
-  R. Mason and E. Charniak. Nonparametric method for data-driven image captioning. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 592--598, 2014.
-  V. Ordonez, X. Han, P. Kuznetsova, G. Kulkarni, M. Mitchell, K. Yamaguchi, K. Stratos, A. Goyal, J. Dodge, A. Mensch, et al. Large scale retrieval and generation of image descriptions. International Journal of Computer Vision, 119(1):46--59, 2016.
-  K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pages 311--318. Association for Computational Linguistics, 2002.
-  J. Pennington, R. Socher, and C. D. Manning. Glove: Global vectors for word representation. In EMNLP, pages 1532--1543, 2014.
-  S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In Advances in Neural Information Processing Systems, pages 217--225, 2016.
-  S. Sharma, D. Suhubdy, V. Michalski, S. E. Kahou, and Y. Bengio. Chatpainter: Improving text to image generation using dialogue. International Conference on Learning Representations (ICLR) Workshop, 2018.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In NIPS, 2014.
-  R. Vedantam, C. Lawrence Zitnick, and D. Parikh. Cider: Consensus-based image description evaluation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4566--4575, 2015.
-  R. Vedantam, X. Lin, T. Batra, C. Lawrence Zitnick, and D. Parikh. Learning common sense through visual abstraction. In Proceedings of the IEEE international conference on computer vision, pages 2542--2550, 2015.
-  O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 3156--3164, 2015.
-  K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning (ICML), volume 37 of Proceedings of Machine Learning Research, pages 2048--2057. PMLR, 07--09 Jul 2015.
-  X. Yin and V. Ordonez. Obj2text: Generating visually descriptive language from object layouts. In Empirical Methods in Natural Language Processing (EMNLP), 2017.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE International Conference on Computer Vision (ICCV), pages 5907--5915, 2017.
-  C. L. Zitnick and D. Parikh. Bringing semantics into focus using visual abstraction. In Computer Vision and Pattern Recognition (CVPR), 2013.
-  C. L. Zitnick, D. Parikh, and L. Vanderwende. Learning the visual interpretation of sentences. In International Conference on Computer Vision (ICCV), 2013.