CPGAN: Full-Spectrum Content-Parsing Generative Adversarial Networks for Text-to-Image Synthesis
Typical methods for text-to-image synthesis seek to design effective generative architecture to model the text-to-image mapping directly. It is fairly arduous due to the cross-modality translation involved in the task of text-to-image synthesis. In this paper we circumvent this problem by focusing on parsing the content of both the input text and the synthesized image thoroughly to model the text-to-image consistency in the semantic level. In particular, we design a memory structure to parse the textual content by exploring semantic correspondence between each word in the vocabulary to its various visual contexts across relevant images in training data during text encoding. On the other hand, the synthesized image is parsed to learn its semantics in an object-aware manner. Moreover, we customize a conditional discriminator, which models the fine-grained correlations between words and image sub-regions to push for the cross-modality semantic alignment between the input text and the synthesized image. Thus, a full-spectrum content-oriented parsing in the deep semantic level is performed by our model, which is referred to as Content-Parsing Generative Adversarial Networks (CPGAN). Extensive experiments on COCO dataset manifest that CPGAN advances the state-of-the-art performance significantly.
Text-to-image synthesis aims to generate an image according to a textual description. The synthesized image is expected to be not only photo-realistic but also consistent with the description in the semantic level. It has various potential applications such as artistic creation and interactive entertainment. Text-to-image synthesis is more challenging than other tasks of conditional image synthesis like label-conditioned synthesis  or image-to-image translation . On one hand, the given text contains much more descriptive information than a label, which implies more conditional constraints for image synthesis. On the other hand, the task involves cross-modality translation which is more complicated than image-to-image translation.
Most existing methods [3, 7, 8, 12, 21, 23, 28, 31, 32, 34, 35, 37], for text-to-image synthesis are built upon the generative adversarial networks (GANs) , which has been validated its effectiveness in various tasks on image synthesis [2, 17, 33]. A pivotal example is StackGAN  which is proposed to synthesize images iteratively in a coarse-to-fine framework by employing stacked GANs. Subsequently, many follow-up works focus on refining this generative architecture either by introducing the attention mechanism [31, 37] or modeling an intermediate representation to smoothly bridge the input text and generated image [8, 12]. Whilst substantial progress has been made by these methods, one potential limitation is that these methods seek to model the text-to-image mapping directly during generative process which is fairly arduous for such cross-modality translation. Consider the example in Figure 1, both StackGAN and AttnGAN can hardly correspond the word ‘sheep’ to an intact visual picture for a sheep correctly. It is feasible to model the text-to-image consistency more explicitly in the semantic level, which however requires thorough understanding for both text and image modalities. Nevertheless, little attention is paid by these methods to parsing content semantically for either the input text or the generated image. Recently this limitation is investigated by SD-GAN , a state-of-the-art model, which leverages the Siamese structure in the discriminator to learn semantic consistency between two textual descriptions. However, direct content-oriented parsing in the semantic level for both the input text and the generated image is not performed in depth.
In this paper we focus on parsing the content of both the input text and the synthesized image thoroughly and thereby modeling the semantic correspondence between them. On the side of text modality, we design a memory mechanism to parse the textual content by capturing the various visual context information across relevant images in the training data for each word in the vocabulary. On the side of image modality, we propose to encode the generated image in an object-aware manner to extract the visual semantics. The obtained text embeddings and the image embeddings are then utilized to measure the text-image consistency in the semantic space. Besides, we also design a conditional discriminator to push for the semantic text-image alignment by modeling the fine-grained correlations locally between words and image sub-regions. Thus, a full-spectrum content parsing is performed by the resulting model, which we refer to as Content-Parsing Generative Adversarial Networks (CPGAN), to better align the input text and the generated image semantically and thereby improve the performance of text-to-image synthesis. Going back to the example in Figure 1, our CPGAN successfully translates the textual description ‘a herd of sheep grazing on a greed field’ to a correct visual scene, which is more realistic than the generated results of other methods.
We evaluate the performance of our CPGAN on COCO dataset both quantitatively and qualitatively, demonstrating that CPGAN pushes forward the state-of-the-art performance by a significant step (from 35.69 to 52.73 in Inception score). Moreover, the human evaluation performed on a randomly selected subset from COCO test set consistently shows that our model outperforms other two classical methods on text-to-image synthesis (StackGAN and AttnGAN).
2 Related Work
Text-to-Image Synthesis. Text-to-image synthesis was initially investigated based on pixelCNN [24, 25], which suffers from highly computational cost during the inference phase. Meanwhile, the variational autoencoder (VAE)  was applied to text-to-image synthesis, leveraging attention mechanism to balance the attention to different words in the input text.
A potential drawback of VAE-based synthesis methods is that the generated images by VAE tends to be blurry presumably due to the loss functions of direct minimizing the distribution difference to the ground truth. This limitation is largely mitigated by the generative adversarial networks (GANs) , which was promptly extended to various generative tasks in computer vision [2, 10, 17, 33, 36]. After Reed  made the first attempt to apply GAN to text-to-image synthesis, many follow-up works [8, 12, 21, 31, 34, 35, 37] focus on improving the generative architecture of GAN to refine the quality of generated images. A well-known example is StackGAN [34, 35], which proposes to synthesize images in a coarse-to-fine framework using cascaded GANs. Following StackGAN, AttnGAN  introduces the attention mechanism to this framework to model the fine-grained correspondence between words of the input text and sub-regions of the generated image, which achieves substantial progress. DMGAN  further refines the attention mechanism by utilizing a memory scheme. Inspired by cycleGAN , MirrorGAN  develops a text-to-image-to-text cycle framework to encourage the text-image consistency. Another interesting line of research attempts to learn an intermediate representation as a smooth bridge between the input text and the synthesized image [8, 12].
Whist these methods have brought about substantial progress, they seek to model the text-to-image mapping directly during generative process. Unlike these methods, we focus on content-oriented parsing of both text and image modalities to obtain a thorough understanding of involved multimodal information, which is beneficial for modeling the text-to-image consistency in the deep semantic level. Recently Siamese network is leveraged to explore the semantic consistence either between two textual descriptions by SD-GAN  or two images by SEGAN . However, deep content parsing in the semantic level for both text and image modalities is not performed.
Memory Mechanism. Memory networks were first proposed to tackle the limited memory of recurrent networks [11, 27]. It was then extensively applied in tasks of natural language processing (NLP) [4, 5, 16, 29] and computer vision (CV)[14, 18, 20]. Different from the initial motivation of memory networks that is to enlarge the modeling memory, we design a specific memory mechanism to build the semantic correspondence between a word to all its relevant visual features across training data during text parsing. It should be noted that our memory mechanism is totally different from DMGAN , which uses memory as an enhanced attention mechanism to attend to not only the text but also the generated visual feature maps in the previous iteration during image generation.
3 Full-Spectrum Content-Parsing Generative Adversarial Networks
The proposed Content-Parsing Generative Model for text-to-image synthesis focuses on parsing the involved multimodal information by three customized components. To be specific, the Memory-Attended Text Encoder employs the memory structure to explore the semantic correspondence between a word and its various visual contexts across training data during text encoding; the Object-Aware Image Encoder is designed to parse the generated image in the semantic level; the Fine-grained Conditional Discriminator is proposed to measure the consistency between the input text and the generated image for guiding optimization of the whole model.
We will first present the overall architecture of the proposed Full-Spectrum Content-Parsing Generative Adversarial Networks illustrated in Figure 2, which follows the coarse-to-fine generative framework, then we will elaborate on the three aforementioned components specifically designed for content parsing.
3.1 Coarse-to-fine Generative Framework
Our proposed model synthesizes the output image from the given textual description in the classical coarse-to-fine framework, which has been extensively shown to be effective in generative tasks [12, 21, 31, 32, 34, 35]. As illustrated in Figure 2, the input text is parsed by our Memory-Attended Text Encoder and the resulting text embedding is further fed into three cascaded generators to obtain coarse-to-fine synthesized images. Two different types of loss functions are employed to optimize the whole model jointly: 1) Generative Adversarial Losses to push the generated image to be realistic and meanwhile match the descriptive text by training adversarial discriminators and 2) Text-Image Semantic Consistency Loss to encourage the text-image alignment in the semantic level.
Formally, given a textual description containing words, the parsed text embeddings by the Memory-Attended Text Encoder (presented in 3.2) are denoted as:
Herein consists of embeddings of words in which denotes the embedding for the -th word. is the global embedding for the whole sentence. Three cascaded generators are then employed to sequentially synthesize coarse-to-fine images . In our implementation we apply similar structure as Attentional Generative Network in AttnGAN :
where are the generated intermediate feature maps by and is an attention model designed to attend to the word embeddings to each pixel of in -th generation stage. Note that the first-stage generator takes as input the noise vector sampled from a standard Gaussian distribution to introduce the randomness. In practice, and are modeled as convolutional neural networks (CNNs), which are elaborated in the supplementary material. Different from AttnGAN, we introduce extra residual connection from to and (via up-sampling) to ease the information propagation between different generators.
To optimize the whole model, the generative adversarial losses are utilized by training generators and the corresponding discriminators alternately. In particular, we train two discriminators for each generative stage: 1) an unconditional discriminator to push the synthesized image to be realistic and 2) a conditional discriminator to facilitate the alignment between the synthesized image and the input text. The generators are trained by minimizing the following adversarial losses:
Accordingly, the adversarial loss for the corresponding discriminators in the -th generative stage is defined as:
where is the input descriptive text and is the corresponding groudtruth image for the -th generative stage. The negative pairs are also involved to improve the training robustness. Note that we formulate the adversarial losses in the form of Hinge loss rather than the negative log-likelihood due to the empirical superior performance of Hinge loss [17, 33].
The modeling of unconditional discriminator is straightforward by CNNs (check supplementary material for details), it is however non-trivial to design an effective conditional discriminator . For this reason, we propose the Fine-grained Conditional Discriminator in Section 3.4.
While the adversarial losses in Equation 4, 5 push for the text-image consistency in an adversarial manner by the conditional discriminator, Text-Image Semantic Consistency Loss (TISCL) is proposed to optimize the semantic consistency directly. Specifically, the synthesized image and the input text are encoded respectively, then the obtained image embedding and the text embedding are projected to the same latent space to measure their consistency. Here we adopt DAMSM  (refer to the supplementary file for details) to compute the non-matching loss between a textual description and the corresponding image :
The key difference between our TISCL and DAMSM lies in encoding mechanisms for both input text (TextEnc) and the synthesized image (ImageEnc). Our proposed Memory-Attended Text Encoder and Object-Aware Image Encoder focus on 1) distilling the underlying semantic information contained in text and image, and 2) capturing the semantic correspondence between them. We will discuss these two encoders in subsequent sections concretely.
3.2 Memory-Attended Text Encoder
The Memory-Attended Text Encoder is designed to parse the input text and learn meaningful text embeddings for downstream generators to synthesize realistic images. A potential challenge during text encoding is that a word in the vocabulary may have multiple (similar but not identical) visual context information and correspond to more than one relevant images in training data. Typical text encoding methods which encode the text online during training can only focus on the text-image correspondence of the current training pair. Our Memory-Attended Text Encoder aims to capture full semantic correspondence between a word to various visual contexts from all its relevant images across training data. Thus, our model can achieve more comprehensive understanding for each word in the vocabulary and synthesize images of higher quality with more diversity.
The memory is constructed as a mapping structure, wherein each item maps a word to its visual context representation. To learn the meaningful visual features from each relevant image for a given word, we detect salient regions in each image to the word and extract features from them. There are many ways to achieve this goal. We resort to existing models for image captioning, which is the sibling task of text-to-image synthesis, since we can readily leverage the capability of image-text modeling. In particular, we opt for the Bottom-Up and Top-Down Attention model  which extracts the salient visual features for each word in a caption at the level of objects.
Specifically, given an image-text pair , object detection is first performed on image by pretrained Yolo-V3  to select top-36 sub-regions (indicated by bounding boxes) w.r.t. the confidence score and the extracted features are denoted as . Note that we replace the Faster R-CNN with Yolo-V3 for object detection for computational efficiency. Then the pretrained Bottom-Up and Top-Down Attention model is employed to measure the salience of each of 36 sub-regions for each word in the caption (text) based on attention mechanism. In practice we only retain the visual feature of the most salient sub-region from each relevant image. Since a word may correspond to multiple relevant images, we extract salient visual features for each of the images the word is involved in. As shown in Figure 3, the visual context features in the memory for the -th word in the vocabulary is modeled as the weighted average feature:
where is the number of relevant images in the training data to the -th word; is the attention weight on -th sub-regions for the -th relevant image and is the index of the most salient sub-region of the -th relevant image. To avoid potential feature pollution, we extract features from top- most relevant images instead of all images where is a hyper-parameter tuned on the validation set.
The benefits of parsing visual features by such memory mechanism are twofold: 1) extract precise semantic features from the most salient region of relevant images for each word; 2) capture full semantic correspondence between a word to its various visual contexts.
It is worth mentioning that both Yolo-V3 and Bottom-Up and Top-Down Attention model are pretrained on MSCOCO dataset  which is also used for text-to-image synthesis, hence we do not utilize extra data in our method.
Text Encoding with Memory
Apart from the learned memory which parses the input text from visual context information, we also encode the text by learning latent embedding directly for each word in the vocabulary to characterize the semantic distance among all words. To be specific, we aim to learn a embedding matrix consisting of -dim embeddings for in total words in the vocabulary. The learned word embedding for the -th word in the vocabulary is then fused with the learned memory by a concatenation operation:
where is a nonlinear projection function to balance the feature dimensions between and . In practice, we perform by two fully-connected layers with a LeakReLU layer  in between, as illustrated in Figure 4.
Given a textual description containing words, we employ a Bi-LSTM  structure to obtain final word embedding for each time step, which incorporates the temporal dependencies between words:
3.3 Object-Aware Image Encoder
The Object-Aware Image Encoder is proposed to parse the synthesized image by our generator in the semantic level. The obtained image-encoded features are prepared for the proposed TISCL (Equation 14) to guide the optimization of the whole model by minimizing the semantic discrepancy between the input text and the synthesized image. Thus, the quality of the parsed image features are crucial to the performance of image synthesis by our model.
Besides learning global features of the whole image, typical way of attending to local image features is to extract features from equally-partitioned image sub-regions . We propose to parse the image in object level to extract more physically-meaningful features. In particular, we employ Yolo-V3 (pretrained on MSCOCO) to detect salient bounding boxes with top confidence of object detection and learn features from them, which is exactly same as the corresponding operations by Yolo-V3 in the section of memory construction 3.2.1. Formally, we extract visual features (1024-dim) of top 36 bounding boxes by Yolo-V3 for a given image , denoted as . Another benefit of parsing images in object level is that it is consistent with our Memory-Attended Text Encoder, which parses text based on visual context information in object level.
The synthesized image in the early stage of training process cannot be sufficiently meaningful for performing object (salience) detection by Yolo-V3, which would adversely affect the image encoding quality. Hence, we also incorporate local features extracted from equally-partitioned sub-regions ( in our implementation) like AttnGAN , which is denoted as . This kind of two-pronged image encoding scheme is illustrated in Figure 2.
Two kinds of extracted features and are then projected into latent spaces with the same dimension by linear transformation and concatenated together to derive the final image encoding features :
where and are transformation matrices. The obtained image encoding feature is further fed into the DAMSM in Equation 14 to compute the TISCL by measuring the maximal semantic consistency between each word of the input text and different sub-region of the image by attention mechanism
3.4 Fine-grained Conditional Discriminator
Conditional discriminator is utilized to distinguish whether a textual caption matches the image in a pair, thus to push the semantic alignment between the synthesized image and the input text by the corresponding adversarial loss. Typical way of designing conditional discriminator is to extract a feature embedding from the text and the image respectively, and then train a discriminator directly on the aggregated features. A potential limitation of such method is that only the global compatibility between the text and the image are considered whereas the local correlations between a word in the text and a sub-region of the image are not explored. Nevertheless, most salient correlations between an image and a caption are always reflected locally. To this end, we propose the Fine-grained Conditional Discriminator, which focuses on modeling local correlations between an image and a caption to measure their compatibility more accurately.
Inspired by PatchGAN , we partition the image into patches and extract visual features for each patch. Then learn the contextual features from the text for each patch by attending to each of the word in the text. As illustrated in Figure 5, suppose the extracted visual features for the -th patch in the image are denoted as and the word features in the text extracted by our text encoder are denoted as . We compute the contextual features for the -th patch by attention mechanism:
where is the attention weight for -th word in the text. The obtained contextual feature is concatenated together with the visual feature as well as the sentence embedding for the discrimination to be real for fake.
Note that the patch size (or the value of ) should be tuned to balance between capturing fine-grained local correlations and global text-image correlations.
To evaluate the performance of our proposed Full-Spectrum Content-Parsing Generative Model, we conduct experiments on COCO dataset  which is a widely used benchmark on the task of text-to-image synthesis.
Code reproducing the results of our experiments is available
4.1 Experimental Setup
Dataset. Following the official 2014 data splits, COCO dataset contains 82,783 images for training and 40,504 images for validation. Each image has 5 corresponding textual descriptions by human annotation.
Evaluation Metrics. We adopt two commonly used metrics for evaluation: Inception score  and R precision . Inception score is extensively used to evaluate the quality of synthesized images taking into account both the authenticity and diversity of target images. R precision is used to measure the semantic consistency between the textual description and the synthesized image.
Implementation Details. Our model is designed based on AttnGAN , hence AttnGAN is an important baseline to evaluate our model. We make several minor technical improvements over AttnGAN, which yield much performance gain. Specifically, we replace the binary cross-entropy function for adversarial loss with hinge-loss form. Besides, we adopt truncated Gaussian noise  as input noise for synthesis ( in Equation 2). We observe that larger batch size in the training process can also lead to better performance. In our implementation, we use batch size of 72 samples instead of 14 samples in AttnGAN. Finally, the hyper-parameters in AttnGAN are carefully tuned. We call the resulting version based on these improvements as AttnGAN.
4.2 Ablation Study
We first conduct experiments to investigate the effectiveness of our proposed three modules respectively, i.e., Memory-Attended Text Encoder (MATE), Object-Aware Image Encoder (OAIE) and Fine-Grained Conditional Discriminator (FGCD). To this end, we perform ablation experiments which begins with AttnGAN, and then incrementally augments the text-to-image synthesis system with three modules. Figure 6 presents the performance measured by Inception score of all ablation experiments.
AttnGAN versus AttnGAN. It is shown in Figure 6 that AttnGAN performs much better than original AttnGAN, which benefits from the aforementioned technical improvements. We observe that increasing the batch size (from 14 to 72) during training brings about the largest performance gain (around 7 points in Inception score). Additionally, fine-tuning the hyper-parameters (especially in Equation 10 in AttnGAN ) contributes another points of improvement to the performance.
Effect of single module. Equipping the system with each of three proposed modules individually boosts the performance substantially. Compared to AttnGAN, the performance is improved by 8.9 points, 2.5 points, and 9.8 points by Memory-Attended Text Encoder (MATE), Object-Aware Image Encoder (OAIE) and Fine-Grained Conditional Discriminator (FGCD) respectively. These improvements demonstrate the effectiveness of all three modules. Whilst sharing same generators with AttnGAN, all our three modules focus on parsing the content of the input text or the synthesized image. Therefore, it is implied that deeper semantic content parsing for the text by the memory-based mechanism helps the downstream generators to understand the input text more precisely. On the other hand, our Object-Aware Image Encoder encourages generators to generate more consistent images with the input text under the guidance of our TISCL. Besides, Fine-grained Conditional Discriminator steers the optimization of generators to achieve better alignment between the text and the image by the corresponding adversarial losses.
Effect of combined modules. We then combine every two of three modules together to further augment the text-to-image synthesis system. The experimental results in Figure 6 indicate that the performances are further enhanced compared to the results of single-module cases with the exception of MATE + OAIE. We surmise that this is because Memory-Attended Text Encoder (MATE) performs similar operations as OAIE when learning the visual context information from images in the object level for each word in the vocabulary. Nevertheless, Object-Aware Image Encoder (OAIE) still advances the performances after being mounted over the single Fine-Grained Conditional Discriminator (FGCD) or MATE + FGCD.
Employing all three modules leads to our full CPGAN model and achieves the best performance of 52.73 points in Inception score, which is better than all other single-module or double-module cases.
Qualitative evaluation. To gain more insight into effectiveness of our three modules, we visualize the synthesized images for several examples by systems equipped with different modules and the baseline AttnGAN. Figure 7 presents the qualitative comparison.
Compared to AttnGAN, the synthesized images by each of our three modules are more realistic and more consistent with the input text, which again reveals advantages of our proposed modules over AttnGAN. Benefiting from the content-oriented parsing mechanisms, our modules tend to generate more intact and realistic objects corresponding to the meaningful words in the input text, which are indicated with red or green arrows.
4.3 Comparison with State-of-the-arts
In this set of experiments, we compare our model with the state-of-the-art methods for text-to-image synthesis on COCO dataset.
Quantitative Evaluation. Table 1 reports the quantitative experimental results. Our model achieves the best performance and outperforms other methods significantly in terms of Inception score, which is owing to joint contributions from all three modules we proposed. We also provide the results of R-precision, on which our model consistently performs best among the results that are publicly available.
|Reed ||7.88 0.07|
|StackGAN ||8.45 0.03|
|Infer ||11.46 0.09|
|SD-GAN ||35.69 0.50|
|MirrorGAN ||26.47 0.41|
|SEGAN ||27.86 0.31|
|DMGAN ||30.49 0.57||88.56%|
|AttnGAN ||25.89 0.47||82.98%|
|CPGAN (ours)||52.73 0.61||93.59%|
Human Evaluation. As a complement to the standard evaluation metrics, we also perform a human evaluation to compare our model with two classical models: StackGAN and AttnGAN. We randomly select 50 test samples and ask 100 human subjects to compare the quality of synthesized images by these three models and vote for the best for each sample. Note that three models’ synthesized results are presented to human subjects randomly for each test sample. We calculate the rank-1 ratio for each model as the comparison metric. Table 2 presents the results. Averagely, our model achieves 63.73% of votes while AttnGAN wins on 28.33% votes and StackGAN performs worst. This human evaluation result is consistent with the quantitative results in terms of Inception score in Table 1.
Qualitative Evaluation. To obtain a qualitative comparison, we visualize the synthesized images on randomly selected text samples by our models and other three classical models: StackGAN, AttnGAN and DMGAN, which is shown in Figure 8. It can be observed that our model is able to generate more realistic images than other two models, like ‘sheep’ , ‘doughnts’ or ‘sink’. Besides, the scenes in the generated image by our model are also more consistent with the given text than the other models, such as ‘bench next to a patch of grass’ or ‘people standing on a beach next to the ocean’.
In this work, we have presented the Content-Parsing Generative Adversarial Networks (CPGAN) for text-to-image synthesis. The proposed CPGAN focuses on content-oriented parsing on both the input text and the synthesized image to learn the text-image consistency in the semantic level. Further, we also design a fine-grained conditional discriminator to model the local correlations between words and image sub-regions to push for the text-image alignment. Our model significantly improves the state-of-the-art performance on COCO dataset.
6.1 Details of Coarse-to-fine Generative Framework
As described in Sec 3.1 in the paper, we adopt three cascaded generators to obtain coarse-to-fine synthesized images. At each stage, the generator is adopted to generate intermediate feature maps which could be directly mapped to generated image by convolutional layers.
As shown in Figure 9 (a), the global embedding for the whole sentence concatenated with Gaussian noise is processed by , which is composed of a FC layer, a reshape layer and four cascaded upsampling layers. The obtain intermediate feature map , together with , are then fed into the subsequent generators and , which consists of three residual blocks and a upsampling layer. Here is the output of the attention model designed to attend to the word embeddings to each pixel of . Formally, given the input word embedding and the intermediate feature map , the is modeled as:
Herein is the shape of intermediate feature map at the stage and denotes the embedding for the -th word. The word embeddings are first projected into the common space of the intermediate features by a FC layer, , where . Suppose the -th intermediate feature in the feature map is denoted as . We compute the dynamic representation of word embeddings related to the -th intermediate feature by attention mechanism:
where is the dynamic representation of word embeddings related to the intermediate feature maps .
6.2 The Structure of Unconditional Discriminators
The unconditional discriminator in Sec 3.1 in the paper consists of five cascaded downsampling layers, a Reshape layer and a FC layer, as illustrated in Figure 10.
6.3 DAMSM Loss
We employ DAMSM  to construct our TISCL loss function for modeling the non-matching loss between a textual description and the corresponding synthesized image . Formally, given the final word embeddings and sentence embedding obtained by our text encoder in Equation 8 in the paper and the image embedding by our image encoder shown in Equation 9 in the paper, the TISCL is modeled as:
Here is the image global feature extracted from the last average pooling layer of Inception-V3. We use as the sentence embedding .
We first reshape into matrix . The similarity matrix for pairs of words and sub-regions is computed by:
where is the dot-product similarity between the -th word of the sentence and the the -th sub-region of the image. We calculate the dynamic representation for the word embedding attending to the sub-regions of the image features by:
where is a factor that determines how much attention is paid to features of its relevant sub-regions when computing the region-context vector for a word. Finally, we define the semantic consistency between each word of input text and different sub-region of the image using the cosine similarity, . The image-text matching score between the entire image and the whole sentence description D is define as:
where is a factor that determines how much to magnify the importance of the most relevant word-to-region-context pair.
In a mini-batch of iteration, the posterior probability of sentence matching with the corresponding image is obtained by:
where is a smoothing factor determined by experiments. is batch size. Then the word-level loss function of the positive image-sentence pair in a mini-batch is define as:
For the sentence embedding and the image global feature , we define the image-text matching score by:
The sentence-level loss is modeled as:
Finally, the DAMSM loss is define as:
- Both authors contributed equally.
- Both authors contributed equally.
- Corresponding Author: Feng Lu (email@example.com)
- Details are provided in the supplementary file.
- (2018) Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 6077–6086. Cited by: §3.2.1.
- (2017) Large scale gan training for high fidelity natural image synthesis. Cited by: §1, §2, §4.1.
- (2017) Adversarial nets with perceptual losses for text-to-image synthesis. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6. Cited by: §1.
- (2017) Question answering on knowledge bases and text using universal schema and memory networks. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(ACL), pp. 358â365. Cited by: §2.
- (2017) Memory-augmented neural machine translation. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing(EMNLP), pp. 1390â1399. Cited by: §2.
- (2014) Generative adversarial nets. In Advances in neural information processing systems(NIPS), pp. 2672–2680. Cited by: §1, §2.
- (2017) Semantic image synthesis via adversarial learning. In Proceedings of the IEEE international conference on computer vision(ICCV), pp. 5706–5714. Cited by: §1.
- (2018) Inferring semantic layout for hierarchical text-to-image synthesis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 7986–7994. Cited by: §1, §2, Table 1.
- (2015) Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991. Cited by: §3.2.2.
- (2017) Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), pp. 1125–1134. Cited by: §1, §2, §3.4.
- (2015) Memory networks. International Conference on Learning Representations(ICLR). Cited by: §2.
- (2019) Object-driven text-to-image synthesis via adversarial training. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 12174–12182. Cited by: §1, §2, §3.1, Table 1.
- (2014) Microsoft coco: common objects in context. In European Conference on Computer Vision(ECCV), pp. 740–755. Cited by: §3.2.1, §4.
- (2018) Visual question answering with memory-augmented networks. Proceedings of the IEEE conference on computer vision and pattern recognition(CVPR), pp. 6975â6984. Cited by: §2.
- (2016) Generating images from captions with attention. International Conference on Learning Representations(ICLR). Cited by: §2.
- (2018) Document context neural machine translation with memory networks. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(ACL), pp. 1275â1284. Cited by: §2.
- (2018) Spectral normalization for generative adversarial networks. International Conference on Learning Representations(ICLR). Cited by: §1, §2, §3.1.
- (2018) Automatic stance detection using end-to-end memory networks. arXiv preprint arXiv:1804.07581. Cited by: §2.
- (2017) Conditional image synthesis with auxiliary classifier gans. Proceedings of the 34 rd International Conference on Machine Learning(ICML), pp. 2642â2651. Cited by: §1.
- (2019) Memory-attended recurrent network for video captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 8347–8356. Cited by: §2.
- (2019) MirrorGAN: learning text-to-image generation by redescription. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 4321–4330. Cited by: §1, §2, §3.1, Table 1.
- (2018) Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767. Cited by: §3.2.1.
- (2016) Generative adversarial text to image synthesis. Proceedings of the 33 rd International Conference on Machine Learning(ICML). Cited by: §1, §2, Table 1.
- (2017) Parallel multiscale autoregressive density estimation. Proceedings of the 34 rd International Conference on Machine Learning(ICML), pp. 2912–2921. Cited by: §2.
- (2017) Generating interpretable images with controllable structure. International Conference on Learning Representations(ICLR). Cited by: §2.
- (2016) Improved techniques for training gans. In Advances in neural information processing systems(NIPS), pp. 2234–2242. Cited by: §4.1.
- (2015) End-to-end memory networks. In Advances in neural information processing systems(NIPS), pp. 2440–2448. Cited by: §2.
- (2019) Semantics-enhanced adversarial nets for text-to-image synthesis. Proceedings of the IEEE international conference on computer vision(ICCV), pp. 10501â10510. Cited by: §1, §2, Table 1.
- (2018) Target-sensitive memory networks for aspect sentiment classification. Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(ACL), pp. 957–967. Cited by: §2.
- (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §3.2.2.
- (2018) Attngan: fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 1316–1324. Cited by: §1, §2, §3.1, §3.1, §3.1, §3.3, §3.3, §4.1, §4.1, §4.2, Table 1, Table 2, §6.3.
- (2019) Semantics disentangling for text-to-image generation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition(CVPR), pp. 2327–2336. Cited by: §1, §2, §3.1, Table 1.
- (2019) Self-attention generative adversarial networks. Proceedings of the 36 rd International Conference on Machine Learning(ICML). Cited by: §1, §2, §3.1.
- (2017) Stackgan: text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision(ICCV), pp. 5907–5915. Cited by: §1, §2, §3.1, Table 1, Table 2.
- (2018) Stackgan++: realistic image synthesis with stacked generative adversarial networks. IEEE transactions on pattern analysis and machine intelligence(TPAMI) 41 (8), pp. 1947–1962. Cited by: §1, §2, §3.1.
- (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision(ICCV), pp. 2223–2232. Cited by: §2.
- (2019) DM-gan: dynamic memory generative adversarial networks for text-to-image synthesis. pp. 5802–5810. Cited by: §1, §2, §2, Table 1.