Semi Supervised Phrase Localization in a Bidirectional Caption-Image Retrieval Framework
We introduce a novel deep neural network architecture that links visual regions to corresponding textual segments including phrases and words. To accomplish this task, our architecture makes use of the rich semantic information available in a joint embedding space of multi-modal data. From this joint embedding space, we extract the associative localization maps that develop naturally, without explicitly providing supervision during training for the localization task. The joint space is learned using a a bidirectional ranking objective that is optimized using a -Pair loss formulation. This training mechanism demonstrates the idea that localization information is learned inherently while optimizing a Bidirectional Retrieval objective. The model’s retrieval and localization performance is evaluated on MSCOCO and Flickr30K Entities datasets. This architecture outperforms the state of the art results in the semi-supervised phrase localization setting.
Multi-modal data fusion is critical in retrieving a uniform representation of such data, thereby leading to a better understanding of the underlying semantic relationships . It is understood that data from multiple modalities having similar semantic context are characterized by latent relationships. The task of extracting a uniform representation of multimodal data is driven by the hypothesis that these latent relationships can be perceived as meaningful associations when projected to a common space. For instance,  showed that by reasonably projecting data of different modalities into a single common space, one can find a way to reveal several implicit correlation patterns between the different modalities. In real-world situations, upon being introduced to rich textual descriptions of a corresponding image, we can intuitively identify complex relationships between the different visual components and can also localize the different textual entities on the visual scene. Interestingly, we learn these relationships without the level of extensive supervision available in current multi-modal datasets. Consequently, one may be tempted to explore the possibility of developing a method that can learn these beneficial latent relationships naturally in a joint embedding space. We propose a method which is able to find accurate spatial attention maps corresponding to each constituent word or phrase in a descriptive caption associated with an image. These spatial attention maps are derived without any supervision in terms of localizing these textual entities on the image.
We aim to extract relevant inherent phrase localization information from a model trained on a ranking objective. We have chosen the Bidirectinal Image Caption Retrieval task as a proxy for the localization objective. A deep neural network model is trained end to end to minimize a loss objective for this proxy retrieval task. We hypothesize that by using such a proxy task for our neural network framework, we can implicitly capture visual associations of caption tokens and objects located at different spatial locations in the image. The model does not use any additional parameters, object detection frameworks or region proposal networks to aid the principal localization objective. This localization objective is a favorable outcome of the proxy task training. The retrieval framework that we use resembles some of the well-known retrieval architectures, but has been modified slightly to enable the extraction of the required spatial saliency maps. These saliency maps should represent the implicit localization tendency of a model trained on the retrieval task. We show that a machine can learn to discover visual entities being mentioned as textual segments from a corresponding descriptive caption. Our model can process both words and phrases when finding out these saliency maps.
Datasets like Flickr30K Entities or Microsoft COCO have segmentation and ground truth annotation information embedded along with the images and associated captions. This leads to the formulation of a problem where the machine can jointly learn the relationship between language information and visual data while being provided additional localization information. Several methods that depend on such datasets require elaborate and accurate annotations. This often turns out to be a laborious and time intensive tasks. In addition to that, errors in annotations can persist during training and may lead to the development of a fairly wrong or biased embedding space. Our method circumvents this problem by not relying on such bounding box annotations for training.
In summary, the main contributions of our paper are as follows: We propose a novel framework for localizing query phrases from a caption on its associated image. By using the bidirectional retrieval task as a proxy, our method demonstrates the utility of extracting additional knowledge from an existing network, in a self-supervised fashion. We compare the performance of two different ranking based loss functions and propose the use of the -Pair loss function to optimize the retrieval and localization objective. We also evaluate our framework’s localization performance on the Flickr30K Entities dataset and achieve state-of-the-art performance.
2 Related Work
Saliency Maps: Previous work has shown that deep neural networks trained on the image classification task can be used to explore relationships that develop between the network’s activation patterns and object labels. However, training neural networks on the image classification tasks that consider a limited number of discrete categories can render the relationships between the objects present in the image to be discrete and disconnected . Saliency based methods have been able to use appropriately weighted activation maps derived from the output of a CNN to visualize internal representations of discrete labels on an image . More recent work has shown that it can also be possible to derive pixel-wise importance for a given class using distinct entity labels. These methods, however, can not be extended to linguistic structures that represent and convey complex relationships between the visual entities present in the image.
Phrase Localization: The problem of phrase localization has been solved earlier to good effect in a supervised fashion while using large margin based objectives inspired from metric learning [14, 20], but these methods require large amounts of data, while relying heavily on a well-defined set of annotations of object proposals. Moreover, most such methods use a ranking function on a set of region proposals[?] to locate regions which best match the textual descriptor. These two steps may not be functionally linked well, as the region proposals may be based on discrete object categories and may not include regions that are relevant to the rich language descriptor.
A general approach in semi-supervised phrase localization is to optimize a suitably chosen proxy objective. The work in  shows that the localization objective can be optimized by reconstructing a phrase correctly after the model learns to attend to a meaningful bounding box. However, this objective can also be optimized by learning better unimodal co-occurrence statistics for language tokens. Xiao et al. proposed that the task of localization can be optimized by applying a discriminative loss on the whole phrase instead of the object to be localized. Ramanishka et al. use a caption generation framework to score the phrase on a set of proposal boxes to select the box with highest probability. We argue that the precise reconstruction of a textual descriptor from an image might not correlate well with the localization objective. Several other works such as [23, 16] average over heat-maps to get phrase level localization output. These methods generate attention maps for a single word and need to average over attention maps to find out a single heatmap for a phrase in the caption. The work that most relates to our method is the one presented in , where the authors use representations of fragmented images and text to obtain a global score. They hypothesize that if matching elements are present in corresponding sentence and image fragments, a fixed non-linear function should generate a high global score. They optimize their objective using a Multiple Instance Learning approach, but the method suffers from the problems arising from using predefined region proposals. A recent work uses concept learning as a task to learn self-supervised localization patterns, but can incur the problem of reduced generalization and limitations in the number of possible concepts to be learned. Deep Reinforcement Learning based methods have also been explored  that basically train an agent to move and reshape a bounding box to localize the object according to referring expressions. The model consists of a spatial and a temporal context that affects small changes in predicted bounding boxes at each step which makes it prone to failure in terms of capturing global context.
3 Model Architecture
The architecture that we develop should be able to construct a high-level joint embedding space , where instances belonging to a corresponding pair of data from different modalities can be projected close to each other. The model comprises of two different branches to develop two separate representative vectors for the image and caption, respectively. The model receives a batch of corresponding image-caption pairs during training. A caption is sampled randomly from the pool of available captions while constructing the training mini-batch. The corresponding set of images is then retrieved to complete the batch. Each image in the batch is passed through the image branch of the model to obtain an activation map. Each point on this activation map encodes a certain region on the original image as an -dimensional vector, where is the depth of the activation block. The network is able to encode such overlapping regions on the image. The image representation can then be written as . Each caption token in a single caption is transformed into an dimensional vector by the caption branch of the network. This transformation is achieved by passing the GloVE embedding of that token through an LSTM network. These vectors are then stacked one above the other to generate the caption representation . signifies the length of the caption, or the number of caption tokens considered.
After obtaining a representation for both the image and caption, we use them directly to find a joint associative localization space. The value at each point in this localization space is computed by evaluating the dot product between a caption token vector and a visual vector . Consequently, we generate such maps, and each map consists of pixels. The weight on each pixel in one of these maps correspond to the degree of association between that specific image region encoded by and the caption token . The principal objective at hand for this network is to assign a similarity score to a given image-caption pair. To learn this assignment, one has to train the network to assign a higher score to a similar pair and a lower score to a dissimilar pair. This score is computed by applying an aggregator function on the associative localization space we just built. We use a pooling based method to retrieve a single score from each frame in the joint space. We should note that each caption token has a corresponding spatial attention map as a part of this joint space, which can be represented as a set of frames . Therefore, we max-pool across the spatial dimensions for each attention map corresponding to a single token. This gives us a -dimensional vector. We then apply average pooling on this vector to obtain a single scalar value. We name this aggregator operation the operation and can be formulated as:
Here, represents each point in the 3-dimensional localization space, determined by the row, column and depth dimensions. When using , the model tries to adjust the spatial location of the maximum activated point for each textual token. This leads to much more refined adjustments as the model picks an image region that correlates highly with the given caption token. It is to be noted that this scalar score is needed to optimize the Bidirectional Retrieval task, which is the primary objective of the network. This score can be computed by a number of aggregator functions, but enables us to learn the association between image regions and caption tokens in a much more meaningful fashion.
3.2 Loss Function
A ranking based loss function is needed to optimize the chosen proxy task of bidirectional image caption retrieval. For this purpose, we first explored the triplet loss, which uses a triplet of the anchor example itself, a positive sample and an impostor sample. The loss is designed to maximize the similarity between the anchor and the positive instance, and simultaneously minimize the similarity between the anchor and the negative sample. The overall loss , comprises of two components. The first component is derived by fixing an image anchor and finding a pair of captions, comprising of the positive() and negative() samples from the batch. The impostor caption is sampled using an online hard triplet mining strategy. The component uses an anchor caption and a pair of Images() to form the triplet. The overall loss is computed over the entire batch consisting of samples.
After performing several experiments, we found that the Triplet loss converged slowly and the online triplet mining procedure was computationally quite expensive. This led us to explore the possibility of using a much more generalized loss function, like the -Pair loss. Unlike the Triplet loss, the -Pair loss considers all the impostor samples present in a batch. This enhances the discriminative power of the model. During a single update of the triplet loss, the model’s parameters are updated based on a single positive and impostor pair, while ignoring the other negative examples in the batch. This leads to a situation where the anchor sample can be distant from a specific impostor class, but might still be considerably close to other impostor samples, as shown in Figure 3. The Triplet Loss formulation is based on the assumption that over several iterations, a considerable number of triplets will be sampled, so that the final distance from all impostor samples are greater than the margin . This can lead to unstable individual updates and may lead to poor convergence, as observed from our experiments. The -Pair loss formulation is a simple generalization on the triplet loss, so that each update considers multiple impostor samples.
We see that the -Pair loss formulation is similar to a Multiclass Softmax structure, and the similarity scores are analogous to the class probability scores in the softmax formulation. During training, the gradient from this loss function ensures that the following update increases the similarity score for the similar pair, and distances all the negative samples from the anchor. This leads to the development of a set of well-structured clusters of similar data points, as seen in Figure 3. We also note that in the case where , the -Pair Loss is equivalent to the Triplet Loss. We are the first to propose the use of -Pair Loss in such a setting and it enables us to perform well on the Localization Objective.
However, due to the larger number of objects present per category, as well as the higher number of training images in MSCOCO, we use it for training purposes. The Flickr30K dataset has been used for evaluation purposes.
4 Experimental Setup
In this section, we discuss about some Implementational Details, datasets being used, evaluation metrics employed for performance assessment and some baselines.
4.1 Implementational Details
The model is trained using the PyTorch framework, while using a batch size of 64 and optimized using SGD with momentum. For the image branch, a pretrained VGG-19 model is used to obtain the activation map. The activation map has a dimension of x x . The caption representation has dimensions of x . Thus, the associative localization space has dimensions of x x . During training, the model is trained end to end. Therefore, weights in the VGG network as well as the LSTM network are updated. Based on some initial analysis, it was found that (%) of captions in both Flickr30K and MSCOCO were of length 20 or lower. Therefore, the pad limit was set to length 20 and only captions of the said length or lower were considered. The development of a batch suitable for the -Pair loss can get computationally expensive. In an ideal scenario, one would like to sample all possible impostor samples for a given anchor. However, we limit the number of impostor samples to , as prescribed in the original paper. Under these conditions, we need to develop pairs in total for a single batch of training. This complexity is reduced by employing a simple workaround. The impostor samples for a given anchor is sampled from the corresponding positive samples for the other anchors in the batch. Therefore, a positive sample for an anchor becomes an impostor sample for another anchor. We avoid sampling captions from the same image to avoid multiple positive samples. This strategy enables us to build a batch from image-caption pairs, instead of pairs.
Datasets The datasets that can be considered for this task have to pair images with global text descriptions and also include some region level annotations for constituent segments in the caption. We choose the MSCOCO  and Flickr30K Entities  datasets for this work. Both datasets provide an excellent repository of images and a maximum of 5 corresponding captions for each such image. The MSCOCO Dataset has no region level annotations, and therefore, cannot be used for the quantitative evaluation of the localization task, but has been used for evaluating Retrieval performance. The Flickr30K Entities dataset has been used to evaluate both Retrieval and Localization performance.
The Flickr30K Entities dataset contains 31,783 images. Each image is associated with 5 captions, with 3.52 query phrases in each caption on average. Each query phrase has 2.3 words on average and these phrases have an average noun count of 1.2. This is a highly desirable scenario when considering the dataset for testing a network on a weakly supervised localization objective. Since the attention maps we build correspond to a single region for localizing entities, it would be beneficial to have a query phrase to point to a single bounding box on the image. This dataset also provides multiple bounding box annotations for different description instances within an image. Other datasets that can be considered are the Visual Genome and ReferIt dataset. Both these datasets have descriptions for regions that are less salient. Due to the nature of the Visual Genome crowdsourcing protocol, its object annotations have a much greater redundancy. For instance, the phrases A boy wearing a shirt and This is a little boy may be associated with completely different bounding boxes, despite referring to the same person in the image. Moreover, these datsets pair specific objects with short descriptions, rather than pairing images with global descriptions that have segment-wise annotations. Based on , the Flickr30K Entities dataset is best suited for understanding the different ways by which our mind recognizes visual entities and the most salient relationships amongst them. These factors make Flickr30K Entities the best suited dataset for our task as the proposed method aims to find relationships that build inherently between language and image modalities while being trained on the retrieval objective.
Evaluation Metric: To evaluate the localization maps generated by the model, one needs a fairly suitable evaluation metric. Since the model generates the localization in the form of an attention map, one can use the pointing game metric . This metric essentially measures if the most confident region of the predicted attention map falls within the ground truth bounding box. A good attention map can be considered to be consistent if the maximum attention is focused in the ground truth bounding box, which is synonymous to a Hit case. In the Miss case, it falls outside the ground truth bounding box. The accuracy is given as the ratio of the total number of Hit cases to the total testing instances: . Many of the previous works have used this metric to generate results. This provides a strong platform to compare this result with state of the art. A well-acknowledged problem with this method arises when there are multiple instances of an entity in the image. The associated textual token will then have multiple bounding boxes on the image. We checked the metric across all such bounding boxes, irrespective of the number of bounding boxes for any given textual token.
Baselines: To compare the proposed method with previous results, a number of suitable baselines have been considered. The first baseline can be a method that chooses the mid-point of each image as the maximum of the pointing game. If the dataset is center-biased, this can provide reasonably good results. Another baseline to be used is to use just a VGG19 model, pretrained on ImageNet data to generate an averaged attention map for the entire image. If this baseline shows high accuracy, the dataset has a bias in which its phrases of the caption are mostly describing the principal object of the image. The accuracy values for these baselines have been provided in . Another baseline that has been described here is a method that uses the proposed model, without any training. This is to prove that with improving retrieval performance, the localization performance increases as well. Apart from these baselines, the results here have been compared to the weakly supervised works in , , , , and . The compiled results have been presented in Table 2. We also evaluated the localization performance using the same model, but with the Triplet Loss. Apart from these, cross dataset performance has also been checked by using different trained models for evaluation. Loss types and parse mode evaluation scores are also mentioned in Table LABEL:table:loc_expt. All quantitative evaluation is based on the Flickr30KEntities dataset.
Parse Mode: For a given image-caption pair in the Flickr30K Entities dataset, the spatial attention maps can be extracted for each word in the caption. However, since the dataset associates some visual objects with multi-word phrases within the caption, experiments were performed with two different parse modes. In the default setting, the spatial attention maps for each constituent word in the descriptive phrase was averaged over to generate a single attention map. In the phrase-mode setting, the entire phrase in a given token is represented by a single GloVe vector by averaging over the GloVe vectors for each constituent word in the phrase. This automatically generates a single attention map for the whole phrase. We obtain better results when using the phrase parse mode.
5.1 Localization maps
For a given positive image-caption pair, the intermediary co-localization maps have been extracted and each mask is overlaid on top of the associated image to visualize the most salient parts of the image corresponding to the aligned word in the caption. By setting a threshold on the saliency maps, one can also generate segmentation masks that help us visualize only the most salient parts of the image for a given textual token. It is seen in Figures 1 and 5 that the model performs well and is consistent with the initial hypothesis of implicitly generating localization maps as a favorable by-product of the retrieval task. In Figure 1, we can also see the heatmap-based visualization of the localization patterns formed on the image for different caption units. Since quantitative evaluation on the MSCOCO dataset is not possible, we present some qualitative results from the MSCOCO test set in Figure 5. More qualitative results have been provided in the Supplementary Material.
|Model/Method Name||Flickr30K Test Set||MSCOCO Test Set|
|CCA (Fisher Vector)||35.0||62.0||73.8||25.0||52.7||66.0||39.4||67.9||80.9||25.1||59.8||76.6|
|DSPE (Fisher Vector)||40.3||68.9||79.9||29.7||60.1||72.1||50.1||79.7||89.2||39.6||75.2||86.9|
|Proposed Method N-Pair Loss||27.0||49.0||62.0||10.0||32.0||42.0||47.0||77.0||92.9||27.9||65.9||81.0|
|No training Baseline||26.40|
|Fang et al ||29.03|
|Zhang et al ||42.40|
|Ramanishka et al ||50.10|
|Javed et al ||49.10|
|Proposed Method - -Pair Loss||51.06|
|Proposed Method - Triplet loss||14.93|
Quantitative Evaluation: Table 2 presents results for all baselines, previous methods and proposed method for the Flickr30K Test dataset. It is seen that this method clearly outperforms the other state of the art methods by a fair margin. Based on the baseline analysis, an interesting takeaway is the fact that the Flickr30K Entities dataset is indeed slightly biased towards the center point and a lot of phrases do have their bounding boxes encompassing the central point in the image. This proposed method is able to surpass the results in , which uses a captioning network and an attention mechanism to generate such attention maps. It is also found that the model performs fairly well on the localization objective across datasets. We trained a model on the MSCOCO Dataset and evaluated the localization objective on the Flickr30K Entities dataset and found that in the Phrase Parse Mode, this model trained on the MSCOCO dataset performs almost as well as the model trained on Flickr30K Entities dataset with the default Phrase mode. We also note that the -Pair Loss formulation helps us achieve much better results when compared to the Triplet Loss formulation. Results from further experiments have been included in the Supplementary Material.
Comparison with VGG19 Baseline: To have a fair comparison, we also compare our results qualitatively and quantitatively with a VGG19 Baseline to ensure that our model is learning an additional localization objective. The quantitative evaluation has been presented in Table 2 and shows that the model indeed is able to distinguish different parts of the images containing several objects being mentioned in the associated caption. Looking at the results in Figure 6, we can see that the VGG19 Averaged heatmaps provide a single region of focus on the most salient portion of the image. Whereas, this proposed model intelligently looks at different parts of the image for distinct mentions of the various entities in the caption. This substantiates the hypothesis that a retrieval framework inherently learns to discover different visual objects in the image as it learns to associate similar text and image data.
5.2 Retrieval Scores
Retrieval scores are reported as Recall@1, Recall@5 and Recall@10 for bidirectional retrieval tasks. The model seems to perform decently well on the test fold of Flickr30K and a separate test fold held out from the MSCOCO Validation set. A qualitative evaluation, presented in Figure 4, shows how the model extracts one of the ground truth instances being retrieved from the ranked list. Table 1 lists all the recall scores for models using different score types on an image fold of size 100. As the standard model for the retrieval task has been slightly altered, it was expected that the retrieval performance would be a bit worse than the state of the art results. But, experiments showed that as the retrieval performance increased, so did the localization performance. Moreover, it being a proxy task, we focused more on tweaking the model to get better localization performance. As expected Retireval Performance was very poor on the model trained using Triplet Loss.
We propose a novel neural architecture that can find localization patterns across the text and image modalities. After several experiments, this architecture shows improved performance on the Flickr30K Entities dataset when compared to other state of the art methods in self-supervised localization. The method leverages the inherent localization occuring across sentence units and the image when being trained for a bidirectional retrieval objective. This is critical in ensuring that there are no excess parameters or weights involved in obtaining these maps. Further experimentations enabled the design of a better optimization technique by using the -Pair loss function. However, loss metrics like the one mentioned in  can be explored, to learn the semantic relationships better. This experiment also proves that the machine is able to localize the visual region corresponding to a specific token in the textual caption, and can also discover relevant visual regions on its own without any form of supervision.
- (2019) Multimodal machine learning: a survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2), pp. 423–443. Cited by: §1.
- (2015) Look and think twice: capturing top-down visual attention with feedback convolutional neural networks. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2956–2964. Cited by: §2.
- (2011) Semantic combination of textual and visual information in multimedia retrieval. In Proceedings of the 1st ACM international conference on multimedia retrieval, pp. 44. Cited by: §1.
- (2015) From captions to visual concepts and back. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1473–1482. Cited by: §4.2, Table 2.
- DeViSE: A Deep Visual-Semantic Embedding Model. Technical report External Links: Cited by: §2.
- (2018) Learning unsupervised visual grounding through semantic self-supervision. arXiv preprint arXiv:1803.06506. Cited by: §2, §4.2, Table 2.
- (2014) Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pp. 1889–1897. Cited by: §2.
- (2017) Introduction to pytorch. In Deep learning with python, pp. 195–208. Cited by: §4.1.
- (2014) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. arXiv preprint arXiv:1411.7399. Cited by: Table 1.
- (2014) Microsoft coco: common objects in context. In European conference on computer vision, pp. 740–755. Cited by: §1, §4.2.
- (2015) Multimodal convolutional neural networks for matching image and sentence. In Proceedings of the IEEE international conference on computer vision, pp. 2623–2631. Cited by: Table 1.
- (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632. Cited by: Table 1.
- (2017) Deep metric learning via facility location. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5382–5390. Cited by: §6.
- (2017) Phrase localization and visual relationship detection with comprehensive image-language cues. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1928–1937. Cited by: §2.
- (2015) Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pp. 2641–2649. Cited by: §1, §4.2, §4.2.
- (2017) Top-down visual saliency guided by captions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7206–7215. Cited by: §2, §4.2, §5.1, Table 2.
- (2016) Grounding of textual phrases in images by reconstruction. In European Conference on Computer Vision, pp. 817–834. Cited by: §2.
- (2016) Improved deep metric learning with multi-class n-pair loss objective. In Advances in Neural Information Processing Systems, pp. 1857–1865. Cited by: §3.2.
- (2000) Content-based query of image databases: inspirations from text retrieval. Pattern Recognition Letters 21 (13-14), pp. 1193–1198. Cited by: §1.
- (2016) Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 5005–5013. Cited by: §2, Table 1.
- (2017) An end-to-end approach to natural language object retrieval via context-aware deep reinforcement learning. arXiv preprint arXiv:1703.07579. Cited by: §2.
- (2017) Weakly-supervised visual grounding of phrases with linguistic structures. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5945–5954. Cited by: §2, §4.2.
- (2018) Top-down neural attention by excitation backprop. International Journal of Computer Vision 126 (10), pp. 1084–1102. Cited by: §2, §4.2, §4.2, Table 2.
- (2016) Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: §2.