Realistic Image Generation using Region-phrase Attention
The Generative Adversarial Network (GAN) has recently been applied to generate synthetic images from text. Despite significant advances, most current state-of-the-art algorithms are regular-grid region based; when attention is used, it is mainly applied between individual regular-grid regions and a word. These approaches are sufficient to generate images that contain a single object in its foreground, such as a “bird” or “flower”. However, natural languages often involve complex foreground objects and the background may also constitute a variable portion of the generated image. Therefore, the regular-grid based image attention weights may not necessarily concentrate on the intended foreground region(s), which in turn, results in an unnatural looking image. Additionally, individual words such as “a”, “blue” and “shirt” do not necessarily provide a full visual context unless they are applied together. For this reason, in our paper, we proposed a novel method in which we introduced an additional set of attentions between true-grid regions and word phrases. The true-grid region is derived using a set of auxiliary bounding boxes. These auxiliary bounding boxes serve as superior location indicators to where the alignment and attention should be drawn with the word phrases. Word phrases are derived from analysing Part-of-Speech (POS) results. We perform experiments on this novel network architecture using the Microsoft Common Objects in Context (MSCOCO) dataset and the model generates conditioned on a short sentence description. Our proposed approach is capable of generating more realistic images compared with the current state-of-the-art algorithms.
Generating images from text descriptions is a challenging problem that has attracted much interest in recent years. Algorithms based on the Generative Adversarial Network (GAN) , specifically Deep Convolutional GAN (DCGAN) , have demonstrated promising results on various datasets.
 synthesized images based on the RNN network encoded text descriptions and were able to generate images on CUB, Oxford-102 and MSCOCO datasets. The RNN network was pre-trained using a deep convolutional and recurrent text encoder introduced by . The Generative Adversarial What-Where Network (GAWWN)  was another GAN network that introduced extra information, such as bounding boxes and key points, and enabled the location of the main object in an image to be controllable . Following on from this work,  proposed StackGAN which was able to generate photo-realistic images for text descriptions through a 2-stage generation process.  further extended StackGAN by incorporating an attention mechanism to perform multi-stage image generation. This method allowed attention to be paid to relevant words in generating different regions of images.
These works have demonstrated a significant breakthrough in synthesizing images that contain a single object, such as the CUB  and Oxford-102  datasets in which each image contained a single specific type of flower or bird. However, synthesizing an image that models human poses or involves multi-object interactions usually lacks sufficient details, and can easily be distinguished from real images.
We believe that the in-depth connection between individual words and image sub-regions is not yet fully utilized and the model performance could be improved upon. For example, consider the sentence: A man swinging a baseball bat. We would expect a man and a baseball bat and their interaction, swinging are all to be captured in the generated image, while retaining some degree of freedom in the direction of the action, and/or the exact position for each object in the image.
In this paper, a few novel strategies have been proposed based upon the attention mechanism introduced by AttnGAN framework . First, we adjusted the Deep Attentional Multimodal Similarity Model (DAMSM) loss by utilizing true-grid features inside each bounding box in addition to phrases that consist of multiple consecutive words. This revised loss function encouraged the network to learn the in-depth relationship between a sentence and its generated image: in terms of both how the whole image reflected the prescribed sentence, as well as how specific regions of an image related to an individual word or phrase.
We also incorporated the bounding box and phrase information in defining attention weights between image regions and each word phrase. Therefore when generating pixels inside an object bounding box, the attention was paid to phrases such as “a red shirt” or “a green apple”, instead of focusing on each individual word separately.
The rest of this paper is organised as follows. In section 2, we review the GAN network and several other literatures that we applied as the basis and inspiration of our work. In section 3, we introduce assumptions and the architecture of our model. The performances are compared and discussed in section 4.
2 Background and Related Work
In this section we review previous works on text embedding, GAN network structure, Region of Interest (RoI) Pooling and image-sentence alignment that we used as the basis for our work.
2.1 Sentence Embedding
Generating images from text requires each sentence to be encoded into a fixed length vector. Previous works such as StackGAN used text embedding generated by a pre-trained Convolutional Recurrent Neural Network . The RNN network has been widely applied in modelling natural languages for classification or translation purposes .
2.2 Text-to-Image with GAN
The Generative Adversarial Network involves a 2-player non-cooperative game by generator and discriminator. The generator produces samples from the random noise vector , and the discriminator differentiates between true samples and fake samples. The value function of the game is as follows:
On the basis of DCGAN, GAN-CLS  generated images based on the corresponding image caption in addition to the noise vector . was sampled from a Gaussian distribution , and the text description was encoded with a pre-trained text encoder to be . was then concatenated with and processed through series of transposed convolution layers to generate the fake image. In the discriminator , a series of convolution-batch normalization-leaky ReLU were applied to discriminate between true images and fake images.
GAWWN  proposed an architecture for controllable text-to-image generation that adopted supplementary information such as bounding boxes or part locations of the main object in the image.
As previous works failed to generate images with higher-resolution than , StackGAN  employed a 2-stage GAN network to generate photo-realistic images from text descriptions. Its architecture consisted of 2 stages: Stage-I generated a low-resolution image (e.g. ) based on texts and random noises, Stage-II generated a higher resolution image (e.g. ) based on texts and the lower resolution images from Stage-I.
AttnGAN  was an extension of StackGAN, which used an attention mechanism in addition to an image-text matching score. It was able to generate images of better quality and achieve higher inception score.
The generation process of AttnGAN was based on a multi-stage attention mechanism. At each stage, the generated image received information from attention weights. These attention weights were calculated between image features from the last stage and text features. The attention mechanism was also used to calculate a Deep Attentional Multimodal Similarity Model (DAMSM) loss, which encouraged the correct matching between sentence-image pairs. These details are discussed in section 3.1.4.
2.3 RoI Pooling
The Region of Interest (RoI) pooling layer was first introduced in . For an image region with spatial size , it was first divided into grids of sub-windows. Each sub-window was then fed through a max-pooling layer, which derived a final pooling result with spatial size . The RoI pooling allowed each image region to be embedded into a fixed-length vector with no additional parameters and training involved.
We construct our network based on the latest architecture of AttnGAN . Inspired by the visual-semantic alignment , we also encourage image sub-region and word matching during training. As shown in figure 1, the proposed architecture consists of an end-to-end text encoder network and a GAN framework. The details are explained in the following sections.
3.1 Text Encoder
Current GAN based text to image generation networks typically extract a whole sentence representation and word representations using a bi-directional LSTM [18, 11, 17]. Our proposed algorithm takes advantage of previous methods while extracting additional phrase features.
We define a phrase as a combination of an article, adjective and noun. Such information can be extracted from raw sentences via part-of-speech tagging (POS-tagging). For example, a sentence ’A man is riding a surfboard on a wave’ is tagged as [(’A’, indefinite article), (’man’,noun),(’is’, verb),(’riding’,verb),(’a’,indefinite article),(’surfboard’,noun), (’on’, preposition), (’a’, indefinite article), (’wave’, noun)]. We then group the nearest article-adjective-noun words as a phrase, which yields ”a man”, ”a surfboard” and ”a wave’.
3.1.1 Sentence Encoder
3.1.2 Phrase Encoder
On top of the extracted word representations and , phrase representations are extracted by applying a second LSTM in the following way. Given the phrase, a LSTM is applied over the sequence of words in the phrase. The last hidden state is used as its feature representation which we refer to as .
Our phrase-based embedding clearly has an advantage over the traditional word-based mechanism where each word has a seperate representation. For example, none of the individual words in the phrase “a green apple” portrays an overall picture of the object; all three words work together to capture its visual meaning.
3.1.3 Image Encoder
The image encoder itself comes from the pre-trained Inception-v3 network  and is not further fine-tuned. In our work, we apply the image encoder to extract three different image features from a single image: a true-grid region feature, a regular-grid region feature and a full image feature.
Examples of each region are shown in figure 3. A true-grid region is defined over an single object and thus the regions differ in sizes. Regular-grid regions have equal sizes and each of them can contain half or multiple objects.
Common to all features, each image first undergoes a pre-trained Inception-v3 model . We use the ”” layer feature map as the designated layer for the regular-grid region. The full image feature is obtained from the last average pooling layer. In addition, both regular-grid region feature and full image feature are converted into vectors in the same semantic space using a Fully Connected (FC) layer. Therefore, the resulting features have the following dimensions: a regular-grid region feature where is the dimension for ”” layer feature map. The image feature is denoted as .
To obtain a true-grid region feature, first we need the location and size of each region. In several open datasets, such as MSCOCO, the manually-labeled bounding boxes of object(s) within an image are readily available. In the case where the dataset does not provide such information, they can also be obtained from off-the-shelf image object detectors, such as RCNN . This makes it possible to apply our algorithm to any image datasets with text annotations, including CUB and Oxford-102.
The ”” layer feature map and its bounding box information is fed through the Region of Interest (ROI) pooling to generate its true-grid region feature. The extracted feature after the ROI pooling is the same size despite the fact that each bounding box may differ in sizes. These features are fed through a convolution operation with a kernel of an equivalent size, resulting in a vector in a common semantic space as text features. We denote the true-grid region feature as where is the number of bounding boxes in each image.
3.1.4 Attention Based Embedding Loss
Text embedding and the perceptron layer for image and region features are bootstrapped prior to training the GAN network.
Following , the training target is to minimize the negative log posterior probability for the correct image-sentence pair. i.e. for a batch of image-sentence pairs , the loss function is given as:
is the posterior probability for a sentence to be matched with an image .
Here gives the similarity score between the sentence and the image and is a manually defined smooth factor. The posterior probability for an image being matched to a sentence is defined in a similar way.
The similarity score is defined from three perspectives. The first score uses the cosine similarity between a sentence representation and a whole image feature . The second is to utilise an attention mechanism built between the regular-grid regions and the words:
is a second smooth factor. is the cosine similarity between a word embedding and a region-context vector which is calculated as a weighted sum over regular-grid image features:
In equation 5, is the attention weight for the regular-grid towards the word. is a normalised cosine similarity between the word and the region.
The third way to define is through the attention mechanism between the true-grid regions and the phrases:
Here is the region context vector which is computed from a weighted sum over true-grid region features.
3.1.5 Text Encoder Loss Functions
Bringing the three different values to equation 3.1.4 and equation 3.1.4, the three loss functions are noted as and and as shown in graph 1. Later in section 4.2, we compare different combination of these loss functions, where LSTM-BASIC is the baseline model using , LSTM utlizes all three loss functions as ; LSTM-PHRASE uses .
3.2 Attentional Text to Image Generation
Following the work by AttnGAN , our work constructs text to image generation as a multi-stage process. At each generation stage, images from small to large scales are generated from corresponding hidden representations. At the first stage, the thumbnail generation takes sentence embedding as the input and generates images with the lowest resolution. At the following stages, images with higher resolution are generated through an attention structure. Details are explained below.
3.2.1 Thumbnail Generation
The thumbnail generation is inspired from the vanilla sentence to image design by , which generates images conditioning on the sentence and additional information including bounding boxes and keypoints. The network structure is shown in figure 4.
The generation process branches into two paths. The global path, which is not bounding box conditioned, takes the conditioning factor , concatenates with the noise vector and fed through several upsampling block to a global feature tensor. The local path first spatially replicate the sentence embedding, and zeros out the region out of the bounding box. The masked text tensor is fed through upsampling blocks to a local feature tensor. Tensors from both paths are concatenated depth-wise and fed through another two upsampling blocks to derive .
3.2.2 Super-resolution I & II
Super-resolution enlarges the previously generated thumbnails through the attention mechanism. At stage , a hidden representation is constructed from the last hidden state . The hidden representation is later translated to an image with the image generation network in section 3.2.3.
We incorporate two set of attentions in our framework, the first is between individual words and regular-grid regions, the second is between phrases and true-grid regions.
Given the word embeddings where for words in a sentence and phrase embeddings where for phrases in a sentence, is calculated as:
Here, is a deep neural network that constructs the hidden representation from given inputs, and are the deep neural networks that construct the word-context matrix and phrase-context matrix respectively.
The word-context matrix is constructed from word representations and regular-grid image region features from . Word embeddings are first fed through a perceptron layer to be converted into the common semantic space as image features. The regular-grid region is defined here in a similar way to section 3.1.3, except that the input feature map is not from the pre-trained Inception-v3.
Given regular-grid region feature , a word-context vector is defined as the weighted sum over word embeddings:
Here is the attention weight between the word and the regular-grid region. Suppose there are regular-grids, the final word-context matrix is then defined as the union of the value for each regular-grid region, i.e., .
This phrase-context matrix is calculated in a similar way as in equation 8, except that word embeddings are replaced with phrase features, and regular-grids are replaced with true-grid features. Here true-grid features are derived from by feeding it through the RoI pooling.
The resulting matrix is of length where is the number of objects defined in the image. In order to apply such a matrix to the network, we let each pixel inside the bounding box carry the same phrase context vector while pixels outside of bounding box carry zeros. As for regions where multiple bounding boxes overlap, the phrase context vectors are averaged. The resulting phrase-context matrix is of the same shape as the previously defined word-context matrix. These two context matrices are again averaged to generate the final hidden representation.
3.2.3 Image Generation Network: Hidden Representation to Images
As shown in figure 1, the previous thumbnail generation and super-resolution stages do not produce images directly, they instead produce hidden representations that are fed through an additional convolution operation with kernel size and the depth dimension to generate images.
In general, we use three types of discriminators. The first evaluates a given image as being real or fake, the second evaluates a pair of image and sentence, and the third evaluates a group of image, sentence and bounding boxes. In addtion, we incorporate the logic of matching-aware discriminator from , where the latter two discriminators are fed through real, fake and false samples.
Therefore the value function for the generator and the discriminator at each stage is given as:
3.3 Bounding Box Prediction
As the image generation relies on bounding box information, a third bounding box prediction network is trained based on text embeddings from section 3.1. We define two prediction tasks in the network, the first is to predict coordinates for bounding boxes. The second is to predict the number of bounding boxes. We structure both prediction as a regression problem from the sentence embedding. Therefore, given a sentence embedding, it is first fed through 2 seperate multi-layer neural networks, in which is the final layer of both networks is a mean squared error of the predicted value and the real value.
We adopt several processing step on the data in the following manner. First, the coordinates of bounding boxes is normalized to the proportion of the full length, so that the maximum value is regardless of the size of the bounding box or the image. Second, given a predicted number of bounding boxes, the coordinates for the bounding boxes that out-numbers the predicted value are considered as ”invalid”, and are thus excluded in computing the loss. In addition, we define words or phrases such as “left”, “right”, “on top of” as position related words. In section 4.3, we report the performance of whether or not the training is only performed on sentences that contain position related words.
Below we demonstrate the performance of the revised text encoder and the proposed GAN network.
The dataset we used is the MSCOCO dataset, which includes various images that involve natural scenes and complex object interactions. It contains 82,783 images for training and 40,504 for validation. Each image has 5 corresponding captions. Bounding boxes are provided for objects over 80 categories.
Experiment results demonstrated below are performed on random samples from the validation set. Two metrics, inception scores and r-precision are utilized to perform the evaluation.
4.1 Metrics: Inception score and R-precision
In terms of the metrics, as it is difficult to measure the performance of image generation in a quantitative way, the inception score was proposed by .
where is a generated image and is the label predicted by the Inception model .
However, the inception score only measures the quality and diversity of images generated. It does not evaluate how accurate an image can reflect the description of a sentence. Therefore, previously another metric called R-precision is proposed.
In AttnGAN, authors define R-precision to be the top relevant text descriptions out of retrieved texts for an image. The candidate sentences are one relevant and randomly selected sentences. The final R-precision is an average over generated images. However, we observed through our experiments that as the number of randomly selected unmatched sentences increases, R-precision values decrease. This value is also affected by how similar candidate sentences are. Therefore, we also report a second R-precision value with one ground truth () and all the rest of the sentences ( false samples) to be the mismatching samples. We denote the first value as R-precision(100), and the second value as R-precision(30K).
4.2 The Revised Text Encoder
Below we demonstrate the performance of the revised text encoder by comparing R-precision scores of the validation set in the text encoder training. We compare the performance of three encoders which are different combinations of encoder loss values introduced in section 3.1.4.
Table 1 shows that in training the text encoder, introducing phrase and real-grid regions encourages finding the correct matching between real image-sentence pairs. Using phrase and true-grid regions to construct the loss and emit word and regular-grid regions further improves the result.
4.3 Bounding Box Prediction
In the phase of validation and training, the coordinate of each bounding box and number of bounding boxes are predicted by seperate networks, as explained in section 3.3.
In this section, we compare the performances of multiple alternatives of both predictions. The first is between applying 4 layers of neural nerworks on both tasks versus 1 layer. The second is whether use all sentences in the prediction tasks or use only those sentences with position related words. The definition of position related words is given in section 3.3. Comparisons are made in terms of loss values on the validation set over training iterations.
From figure 5, applying 4 layers on both predictions results in and higher loss than 1 layer, training only sentences with position related words improves the coordinates prediction by , but it results in a much higher loss in the number prediction task. Thus we can conclude that the best practice is applying 1 layer neural network to both predictions, and number of nouns should not be brought into the number prediction task.
4.4 The GAN Network
Table 2 reports the R-precision and Inception score achieved by the AttnGAN and the proposed method. The proposed method achieves higher R-precision(100) and higher R-precision(30K). This shows that the proposed method is able to generate images that match more closely with the content described in the sentence.
Table 2 also shows that the proposed method achieves a very close final Inception score to AttnGAN. Below we show three aspects with real samples that the generation result surpass previous methods.
Firstly, as shown in figure 3, it is able to generate images that match closer with a given sentence, which is also proved by the higher R-precision rate. For example, when the generation is based on sentences that describe the number of objects explicitly, such as “three sheep” or “adult and two children”, the proposed method is able to generate the correct number of objects in most cases. In addition, the proposed method is less likely to only focus on certain keywords and omit other important information. For example, in the case where “stop sign” and “go light” are both mentioned, the generated image displays both objects.
|A picture of a stop and go light with a stop sign next to it|
|Three sheep in the process of running on grass outside|
|Adult and two children on the beach flying a kite|
Secondly, it is less likely for the proposed method to generate very similar images with different sentences, as opposed to AttnGAN. As shown in figure 4, in the cases where “surf” or “surfing” related words are mentioned, AttnGAN generates very similar if not almost the same images. This issue also exists in cases of food related sentences.
|A man riding a wave on a surfboard in the ocean|
|Person on surfboard laying on board going over a small wave|
Thirdly, table 6 shows that the proposed method performs better at displaying the correct shape of objects.
|Large brown cow standing in field with small cow|
Below we give another interesting example where our proposed method is actually able to generate back and white images with better quality.
|Black and white photo of a predestrian at a suburban crosswalk|
Our work provides improvements on the state-of-the-art attention based GAN network for text to image generation. Our main contribution comes in two folds. Firstly, we propose a new design of text encoder which extracts additional phrase embeddings. Secondly, we incorporate a new set of attention which is between true-grid regions and phrases into the GAN network. Through the experimentation on MSCOCO dataset, our approach is capable of generating more realistic and accurate images.
-  R. Girshick. Fast r-cnn. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
-  R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2014.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27, pages 2672–2680. Curran Associates, Inc., 2014.
-  A. Graves, S. Fernández, and J. Schmidhuber. Bidirectional lstm networks for improved phoneme classification and recognition. In International Conference on Artificial Neural Networks, pages 799–804. Springer, 2005.
-  S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015.
-  A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
-  M.-E. Nilsback and A. Zisserman. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, Dec 2008.
-  A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
-  S. E. Reed, Z. Akata, S. Mohan, S. Tenka, B. Schiele, and H. Lee. Learning what and where to draw. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 217–225. Curran Associates, Inc., 2016.
-  S. E. Reed, Z. Akata, B. Schiele, and H. Lee. Learning deep representations of fine-grained visual descriptions. CoRR, abs/1605.05395, 2016.
-  S. E. Reed, Z. Akata, X. Yan, L. Logeswaran, B. Schiele, and H. Lee. Generative adversarial text to image synthesis. CoRR, abs/1605.05396, 2016.
-  T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, X. Chen, and X. Chen. Improved techniques for training gans. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 2234–2242. Curran Associates, Inc., 2016.
-  M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681, 1997.
-  I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. CoRR, abs/1409.3215, 2014.
-  C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. CoRR, abs/1512.00567, 2015.
-  P. Welinder, S. Branson, T. Mita, C. Wah, F. Schroff, S. Belongie, and P. Perona. Caltech-UCSD Birds 200. Technical Report CNS-TR-2010-001, California Institute of Technology, 2010.
-  T. Xu, P. Zhang, Q. Huang, H. Zhang, Z. Gan, X. Huang, and X. He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. CoRR, abs/1711.10485, 2017.
-  H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, and D. Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1612.03242, 2016.