Learning Object Detection from Captions via Textual Scene Attributes

Learning Object Detection from Captions via Textual Scene Attributes


Object detection is a fundamental task in computer vision, requiring large annotated datasets that are difficult to collect, as annotators need to label objects and their bounding boxes. Thus, it is a significant challenge to use cheaper forms of supervision effectively. Recent work has begun to explore image captions as a source for weak supervision, but to date, in the context of object detection, captions have only been used to infer the categories of the objects in the image. In this work, we argue that captions contain much richer information about the image, including attributes of objects and their relations. Namely, the text represents a scene of the image, as described recently in the literature. We present a method that uses the attributes in this “textual scene graph” to train object detectors. We empirically demonstrate that the resulting model achieves state-of-the-art results on several challenging object detection datasets, outperforming recent approaches.


1 Tel Aviv University, 2 Bar-Ilan University 3 NVIDIA Research 4 The Allen Institute for AI


Object detection is one of the key tasks in computer vision. It requires detecting the bounding boxes of objects in a given image and identifying the category of each one. While object detection models have many real-life applications, they often also serve as a component in higher-level machine-vision systems such as Image Captioning, Visual Question Answering Anderson et al. (2018), Grounding of Referring Expressions  Hu et al. (2017), Scene Graph Generation Xu et al. (2017); Zellers et al. (2017), Densely Packed Scenes Detection Goldman et al. (2019),Video Understanding Zhou et al. (2019); Herzig et al. (2019); Materzynska et al. (2020), and many more.

The simplest way to train an object detection system is via supervised learning on a dataset that contains images along with annotated bounding boxes for objects and their correct visual categories. However, collecting such data is time consuming and costly, thus limiting the size of the resulting datasets. An alternative is to use weaker forms of supervision. The most common instance of this approach is the problem of Weakly Supervised Object Detection (WSOD), where images are only annotated with the set of object labels that appear in them, but without annotated bounding boxes. Such datasets are of course easier to collect (e.g., from images collected from social media, paired with their user-provided hashtags Mahajan et al. (2018)), and thus much research has been devoted for designing methods that learn detection models in this setting. However, WSOD remains an open problem, as indicated by the large performance gap of 50% between WSOD Singh et al. (2018) and fully supervised detection approaches Ren et al. (2020) on the PASCAL VOC detection benchmark Everingham et al. (2010).

Figure 1: An illustration of our novel scene graph refinement process. The model makes use of the “black” attribute to localize the “laptop” object in the image at train time. This will result in improved object detection accuracy at test time.

An alternative, and potentially rich, source of weak supervision is image captions. Namely, textual descriptions of images, that are fairly easy to collect from the web. The potential of captions for learning detectors was recently highlighted in Ye et al. (2019), where they improve the extraction of object pseudo-labels from the captions.

In this work, we argue that captions contain much richer information than has been used so far, because a caption tells us more about an image than just the objects it contains. It can reveal the attributes of the objects in the image (e.g., a blue hat) and their relations (e.g., dog on chair). In the machine vision literature, such a description of objects, their attributes and relations is known as a scene graph (SG) Johnson et al. (2015b), and these have become central to many machine vision tasks Johnson et al. (2015a, b); Xu et al. (2017); Liao et al. (2016); Zellers et al. (2017); Herzig et al. (2018). This suggests that captions can be used to extract part of the scene graph of the image they accompany.

Knowing the scene graph of an image provides valuable information that can used as weak supervision. To understand why, consider an image with two fruits that are hard to identify, and the caption “a red apple next to a pear”. Since it is relatively easy to recognize a red object, we can use this knowledge to identify that the red fruit should have the “apple” label. An illustration of such possible usage of visual attributes in the object classification process from the COCO Captions dataset Chen et al. (2015) is shown in Figure 1.

We propose a learning procedure that captures this intuition by extracting “textual scene graphs” from captions, and using them within a weak supervision framework to learn object detectors. Our approach uses a novel notion of an entanglement loss that weakly constrains visual objects to have certain visual attributes corresponding to those describing them in the caption. Empirical evaluation shows that our model achieves significantly superior results over baseline models across multiple datasets and benchmarks.

Our contributions are thus: (1) We introduce a novel approach that aligns the structured representation of captions and images. (2) We propose a novel architecture with an entanglement loss that uses textual SGs to enforce constraints on the visual SG prediction. (3) We demonstrate our approach and architecture on the challenging MS COCO and Pascal VOC 2007 object detection datasets, leading to state of the art results and outperforming prior methods by a significant gap.

Related Work

Weakly Supervised Object Detection. WSOD is a specific task out of a broader class of problems, named Multiple Instance Learning (MIL). In MIL problems, instead of receiving a set of individually labeled instances, the learner receives a set of labeled bags, each containing many instances Dietterich et al. (1997). MIL is a valuable formalization for problems with a high input complexity and weak forms of supervision, such as medical images processing Quellec et al. (2017), action recognition in videos Ali and Shah (2008) and sound event detection Kumar and Raj (2016). The motivation behind the MIL formalization stems from the fact that the correct label for the input bag can be predicted from a single instance. For example, in the WSOD task, an object label can be inferred from the specific image patch in which the object appears. This is referred to as the standard multiple instance (SMI) assumption Amores (2013), i.e., every positive bag contains at least one positive instance, while in every negative bag all of the instances are negative.

Recent works as  Oquab et al. (2015); Zhou et al. (2016) use the MIL formalization and propose a new Global Max (or Average) Pooling layer to learn the activation maps for object classes. Moreover,  Bilen and Vedaldi (2016) introduced Weakly-Supervised Deep Detection Networks (WSDDN) that use two different data streams for detection and classification, while the detection stream is used to weigh the classification predictions.

The works by Akbari et al. (2019); Gupta et al. (2020) aim to tackle the problem of Weakly Supervised Phrase Grounding, which also requires finding an alignment between image regions and caption words. However, as their task objective is to find relevant image regions rather than to train an object detector, the image captions are given as inputs also at test time, and the task does not aim to correctly identify all of the existing objects in an image but only those that are present in its caption. Moreover, their task setting assumes the existence of a pretrained Faster-RCNN object detector for the region proposal extraction, which is not allowed in our setting.

Lately, the novel task of learning object detection directly from image captions was introduced by Ye et al. (2019). Their work addresses the same task as ours, but as they focus on achieving better object pseudo-labels from the image captions, we show that using the captions data more efficiently and providing the model with better image understanding abilities is a more important direction. Furthermore, Ye et al. (2019) use additional supervision in the form of image captions, object annotations pairs which is costly to collect, and here we show that our use of captions obviates the need for this additional supervision.

Models for Images and Text.

Even though the problem of modeling the relationship between images and text has attracted many research efforts throughout the years, the task of training an object detector from image captions is relatively novel. Recently, there has been a surge of works trying to build a unified general-purpose framework to tackle tasks that involve both visual and textual inputs Lu et al. (2019); Su et al. (2019); Tan and Bansal (2019); Chen et al. (2019). These works take large-scale parallel visual and textual data, pass it through an attention-based model to get contextualized representations, and apply a reconstruction loss to predict masked out data tokens (self-supervision). The resulting models were proven to achieve state-of-the-art results for various visual-linguistic tasks via transfer learning. This is related to recent advances in self-supervision in natural language processing, which train a transferable general-purpose language model with a similar masking objective Devlin et al. (2018). However, while these works are useful for scenarios that do not require prediction of the alignment between the visual and textual data, our objective is to explicitly classify the objects in the input image. Moreover, as we aim to train an object detector which naturally does not receive any textual input, we cannot use a model that requires both visual and textual inputs.

Figure 2: An illustration of our Textual Label Extraction (TLE) module. Given an image and its caption, we apply exact string matching to detect objects, and use a text-to-scene-graph model to generate a scene graph from the captions and aggregate object and attribute pairs.

Scene Graphs.

The machine vision community has been using scene graphs of images for representation of information in visual scenes. SGs are used in various computer vision tasks including image retrieval Johnson et al. (2015b); Schuster et al. (2015), relationship modeling Krishna et al. (2018); Raboh et al. (2020); Schroeder et al. (2019), image captioning Xu et al. (2019), image generation Herzig et al. (2020); Johnson et al. (2018), and recently even video generation Bar et al. (2020).

Textual scene graph prediction is the problem of predicting SG representations from image captions. While the problem of generating semantically meaningful directed graphs from a sentence is well known in NLP research as Dependency Parsing, in the SG context the only three meaningful word classifications are objects, attributes and relations. Early approaches to this problem use a dependency parser as a basis for the SG prediction Schuster et al. (2015); Wang et al. (2018). Recently, Andrews et al. (2019) proposed to train a transformer model on a parallel dataset of image region descriptions and scene graphs, taken from the Visual Genome dataset Krishna et al. (2017).

The SG2Det Model

We next describe our approach for using image captions to learn an object detector. We emphasize that image captions are our only source of supervision, and no manually annotated bounding boxes or even ground-truth image-level categories are used. We refer to our model as SG2Det since it uses textual scene graphs for learning object detectors.

Image captions can provide rich and informative data about visual scenes. Namely, in addition to providing the categories of objects in the image, the captions can suggest the relations between the objects, their positions within the image and even their visual attributes. In this work, we claim that by aligning the scene graph structure as extracted from the captions to the different image regions, we can provide the model with an improved understanding of the visual scenes and obtain superior object detection results.

The key element of our approach is an “entanglement loss” that aligns the visual attributes in the image with those described in the text. As an example, consider the image caption “a red stop sign is glowing against the dark sky”. Instead of extracting only the “stop sign” object pseudo-label and discarding the rest of the caption information, we propose to use the “red” attribute to enhance our supervision. Namely, instead of training the model to find the image region that is the most probable to be a “stop sign”, our training objective is now to find the one that is the most probable to be both a “stop sign” and “red”. Technically, this is achieved by multiplying object and attribute probabilities, as described later in the The Attribute Entanglement Loss subsection.

Our proposed SG2Det model is, therefore, composed of the following three components:

  1. The Textual Label Extraction (TLE) module, which extracts the object pseudo-labels and the scene graph information from the text captions.

  2. The Visual Scores Extraction (VSE) module, which finds bounding box proposals and outputs logits for the categories and attributes of these boxes.

  3. The Attribute Entanglement Loss (AEL) module, which enforces agreement between the textual and visual representations.

Figure 3: An overview of our attribute entanglement loss. First, region proposals (bounding boxes) are extracted from the image, and convolutional features are calculated for each. The image captions are used to extract a list of object and attribute pairs. The scores are calculated for the object categories and attributes, and they are used within the entanglement loss that captures the product scores of the object-attribute pairs.

Textual Label Extraction

In this module, we extract the object pseudo-labels and the scene graph data from the image captions.  Figure 2 provides a high-level illustration of this module. The object category labels can be extracted in several different ways (e.g. string matching, synonym dictionary, a trained text classifier, etc.) as explored by Ye et al. (2019), but we choose simple string matching to highlight the usefulness of our novel loss. In addition to the object category labels, we also extract a textual scene graph representation for each of the captions, and use the object, attribute pairs aggregated from them within the entanglement loss described later. For SG extraction, we use an off-the-shelf textual scene graph parser1 based on Schuster et al. (2015). We also experimented with scene graphs extracted by  Wang et al. (2018), but found these to achieve inferior results.

We choose to split the object attributes to categories (e.g., color, shape, material, etc.) based on a categorization taken from the GQA dataset Hudson and Manning (2019). This dataset contains visual questions that are automatically generated using scene graphs. By looking at the semantic representation (logical form) of the questions, we can derive the different attribute categories that were used by the authors for dataset generation. Some of these attribute categories are general (e.g. color, size), while others are specific to certain object classes (e.g. shape, pose, sportActivity).

The advantage of this categorization is that it provides the model with additional knowledge. Unlike object labels, each region proposal can have more than one attribute. For example, a cat can be both large and brown. Thus, we cannot model the attributes prediction as a multi-class problem. By using categories, we enforce mutual exclusivity within each category, preventing the model, for example, from predicting that an object is both black and white; while allowing multiple labels across different categories.

The output of this stage for each image is as follows:

  • A set of object categories. For example indicates that the text describes a cat and a dog.

  • For each we have a set containing attribute-value pairs . Thus, indicates that the cat is brown and large.

Visual Scores Extraction

We now consider the input image and extract bounding boxes from it. Then, fully connected (FC) classifiers are applied to these bounding boxes, resulting in model scores for object categories and attributes. Below we elaborate on this process.

First, we generate object region proposals using the Selective Search algorithm  Uijlings et al. (2013), and compute a convolutional feature map for each input image by feeding it to a convolutional network pre-trained on ImageNet Deng et al. (2009). Importantly, the convolutional backbone is not trained on an object detection dataset since our objective is to learn detection from the captions only. Then, we apply a ROIAlign layer He et al. (2017) for cropping the proposals to fixed-sized convolutional feature maps. Finally, a box feature extractor is applied to extract a fixed-length descriptor for each proposal .

Denote by the number of different object classes and by the number of different regions. We now apply an FC classifier followed by a softmax layer to achieve an object score for each and , where is the background class. Similarly, for each region , attribute type and attribute value , we obtain a score via an FC prediction head for attribute .

The output of this stage is as follows:

  • A score for each bounding box (region) and category value .

  • A score for each bounding box (region) , attribute type and attribute value . E.g., can be “shape” and can be “rectangular”.

The Attribute Entanglement Loss

The core element of our approach is a loss that enforces agreement between the textual and visual representations of the image. To achieve this, we adapt the MIL approach to capture also the attributes information.

We begin by describing the standard loss used in MIL for the case where only object categories information is available in both the textual and visual descriptions. Namely, assume that from the text side we only have the set of object categories , and from the image side we only have the model scores for all categories and bounding boxes . The intuition in this case is that each category in should have at least one bounding box describing it. Thus, if then should be high for some . This motivates the use of the following loss:

Figure 4: Qualitative examples for our objects and attributes predictions. It can be seen that our attribute classifiers provides meaningful classification results across different categories, which validates the success of our attribute classification objective.

However, in our case we wish to go beyond object information and use attributes. Namely, if we have and , we would like some bounding box to both contain a cat and have the attribute brown. Namely, there should be a box where both and are high. The use of and here is important, since a violation of either these conditions would imply this box does not contain the brown cat. The following entanglement loss precisely captures the intuition that attributes and categories should be dependent:


We note that although the log in the above can be written as , this objective is very different from using two objectives like (1), one for attributes and one for objects. This is because the maximum over is applied to the log, and thus if one of or is very low, the bounding box will not be the maximizer.

We conclude by emphasizing that the whole training procedure operates without any explicit supervision of bounding boxes. Despite this, we shall see that the model succeeds in learning both detection (i.e., finding the right bounding boxes) as well as classification for both object categories and attributes.

Other Losses

In addition to the losses and , we use a loss that promotes high scores for categories in and low scores for categories outside . This is referred to by Tang et al. (2017) as the Multiple Instance Detection (MID) loss. We now pass our region descriptors through two parallel FC layers to get two different -dimensional matrices. Then, we pass one of the matrices through a sigmoid function and the other through a softmax layer over the different regions, and multiply them element-wise to get a score for each region, object label pair. We denote these scores by for each and . Note that we do not add a “background” class here since the MID loss cannot propagate gradients for this class. Next, we aggregate the scores from all different boxes to a single image-level soft-binary score as follows:


where is the sigmoid function. Now, we consider the binary cross entropy between and the indicator corresponding to . This results in the following loss term:


Our overall loss is thus:


Online Refinement

So far we assumed that for each category only one bounding box contains the object pseudo-label (namely, the one that maximizes ). In practice, there could be other bounding boxes that highly overlap with the maximizing one, and should therefore be included in the loss. This intuition was used by the Online Instance Classifier Refinement (OICR) method introduced in  Tang et al. (2017), and also in Ye et al. (2019). In order to provide a fair comparison to Ye et al. (2019) we also use OICR here.

The OICR method uses different score functions for and corresponding losses. The loss is similar to Eq. (1) but with two differences: it uses all boxes that sufficiently overlap the maximizing box, and the operator is applied to the scores from the previous OICR step. This provides a refinement process where each FC classifier uses the output of its predecessor as an instance-level supervision. The first score function is obtained by applying a softmax function over the scores from the MID step. Here, we extend OICR to also use attribute scores and similarly extend the loss in Eq. (2).

Note that we do not apply an MID loss to the attributes as we do for objects (i.e., aggregating attribute scores from all regions, and comparing to the attribute pseudo-labels extracted from the captions) since we found this to harm the performance of our model. We hypothesize that this is because the assumption that a label is present in the caption if and only if it is present in the image is improbable for attributes, as image captions data is often sparse with attribute annotations. Thus, we do not have initial MID scores for the attributes as we do for the objects. Because of that, at the first OICR iteration we only train the attributes classifiers using the objects MID scores, and we do not apply the entanglement loss. E.g., if the image contains a “brown cat”, we use the box that maximizes the “cat” probability as a supervision for the color classifier with the label “brown”.


In this section, we show both qualitative and quantitative result for our SG2Det model. We show that compared to prior work, the additional scene graph information we extract from the image captions is indeed helpful, and provides significantly better object detection results on all of the benchmarks we evaluate on, without using any additional training data. Our model achieves state-of-the-art results on the COCO detection test-dev set and the PASCAL VOC 2007 detection test set, when training on multiple captions datasets. Specifically, when training on COCO captions, we achieve results that are comparable to the state-of-the-art on the PASCAL VOC 2007 test set for a WSOD model that was trained on COCO ground-truth labels.

Implementation Details

To the best of our knowledge, the only existing work to tackle the problem of training an object detector from image captions is Ye et al. (2019). Therefore, to ensure a fair comparison between our works, we use the same algorithm and configuration for proposal boxes extraction and the same convolutional backbone and feature layers, and our model is based on their official paper implementation.2 Specifically, we use the Selective Search algorithm Van de Sande et al. (2011) to extract (at most) 500 proposals for each image, taken from the OpenCV library.

We compute the region descriptor vectors by using the (“Conv2d1a7x7” to “Mixed4e”) layers from InceptionV2 Szegedy et al. (2016) for extracting the convolutional feature maps from the images. In addition, we use the (“Mixed5a” to “Mixed5c”) layers in the same model to extract the region descriptors after the ROIAlign He et al. (2017) operation. Finally, the convolutional backbone network was pre-trained on ImageNet Deng et al. (2009).
























Training on different datasets using ground-truth labels:
GT-Label VOC 68.7 49.7 53.3 27.6 14.1 64.3 58.1 76.0 23.6 59.8 50.7 57.4 48.1 63.0 15.5 18.4 49.7 55.0 48.4 67.8 48.5
GT-Label COCO 65.3 50.3 53.2 25.3 16.2 68.0 54.8 65.5 20.7 62.5 51.6 45.6 48.6 62.3 7.2 24.6 49.6 34.6 51.1 69.3 46.3
Training on COCO dataset using captions:
ExactMatch (EM) 62.0 45.5 52.8 31.5 14.2 66.8 50.7 34.0 12.3 53.4 51.9 57.9 45.4 59.7 11.2 13.1 57.8 46.1 50.0 53.8 43.5
EM + TextClsf 64.1 46.7 47.9 31.6 12.4 70.0 53.0 56.5 17.9 60.5 37.6 59.4 47.4 59.4 25.0 0.2 44.4 49.8 43.3 55.8 44.1
EM + SG Loss 61.9 48.3 53.5 32.3 15.8 65.7 50.3 54.2 16.0 61.1 48.9 68.0 49.1 57.1 15.5 16.0 52.7 56.1 46.0 49.2 45.9
Training on Flickr30K dataset using captions:
ExactMatch (EM) 43.5 27.3 41.7 15.2 9.0 31.9 47.7 67.2 11.0 45.6 28.9 65.6 28.9 51.8 31.3 5.6 34.7 33.7 23.6 46.0 33.1
EM + TextClsf 37.3 35.6 46.0 18.9 9.8 45.6 45.4 57.3 15.1 46.1 19.4 67.5 36.0 52.2 8.2 0.1 46.5 30.9 27.8 46.1 34.6
EM + SG Loss 50.1 45.0 46.8 12.5 10.0 40.8 44.9 61.2 15.4 42.9 43.3 69.3 32.1 43.5 3.5 3.4 37.0 42.1 27.75 48.0 36.0
Table 1: Average precision on the VOC 2007 test set (learning from ground-truth annotations, COCO and Flickr30K captions). We train the detection models using the 80 COCO objects, but evaluate on only the overlapping 20 VOC objects.
Methods Avg. Precision, IoU Avg. Precision, Area
0.5:0.95 0.5 0.75 S M L
GT-Label 10.6 23.4 8.7 3.2 12.1 18.1
ExactMatch (EM) 9.0 20.2 7.1 2.4 10.6 16.3
EM + TextClsf 9.1 20.0 7.4 2.1 10.5 16.5
EM + SG Loss 9.4 21.1 7.3 2.4 10.8 17.6
Table 2: COCO test-dev mean average precision results when training on COCO captions. These numbers are computed by submitting our detection results to the COCO evaluation server. The best method is shown in bold.

We use the AdaGrad optimizer with a learning rate of 0.01 and set the batch size to 2. For the training data augmentation, we randomly flip the image left to right at training time and resize each image randomly to one of the four scales . We set the number of OICR iterations to , as we found this to yield the best performance. We use non-maximum-suppression (NMS) at the post-processing stage with an intersection-over-union (IOU) threshold of 0.4. We Follow Ye et al. (2019) by weighing the term by . For the term we experiment with different values in {} and find to yield the best results.

Unlike Ye et al. (2019), we found that the performance of our SG2Det model continues to improve until 1M training steps when training on COCO Captions (17 epochs) and 300K steps for Flickr30K (19 epochs). We hypothesize that this is a result of the more complex optimization objective of our model. We pick the best model based on the validation set for each dataset. Our models are trained on a single Titan XP GPU for 7 days when training on COCO, and 2 days when training on Flickr30K.

As we experienced instability in the results when training the same model different times with different random seeds, all of the caption models’ results we report were achieved by training the model 3 times, while the best one is chosen based on the validation set. Therefore, for almost all of the baseline models we report improved results over what is reported by Ye et al. (2019), when some of the models have significant improvement gaps.


For training the SG2Det model, we use two popular image captions datasets: COCO Captions Chen et al. (2015) and Flickr30K Young et al. (2014). For training on the COCO Captions dataset, we use images, each paired with five different human-generated captions, which sum up to captions in total. For training on the Flickr30K dataset, we use images, each paired with five different human-generated captions, which sum up to captions in total. When evaluating the VOC 2007 dataset, we use the train and validation set for validation and report our results on the test set. As the object labels vocabulary differs between the COCO and VOC datasets, when evaluating on VOC we use only the twenty overlapping VOC objects. We report the mAP@0.5 for each of the object labels and its mean across the labels for this dataset. When evaluating on COCO, we use the val2017 set for validation and test our model by submitting our object detection predictions to the COCO test-dev2017 evaluation server.3 We report the metrics provided by the server, where mAP@0.5:0.95 is the primary evaluation metric of the dataset.


On all different benchmarks, we report our results for the following three models:

  1. The Exact Match (EM) baseline model proposed by Ye et al. (2019). This model performs a simple string matching to extract object pseudo-labels from the captions.

  2. The EM + TextClsf model. This is the best model reported by Ye et al. (2019). This model performs object pseudo-label extraction by training a text classifier on additional data of parallel image captions and object annotation pairs, taken from the COCO detection dataset.

  3. Our novel EM + SG Loss model. Our model also performs exact string matching for object label extraction, but additionally applies our novel SG entanglement loss. Note that our model is identical to ExactMatch when weighting the SG entanglement loss by 0.


Figure 4 shows qualitative examples for predictions of our SG2Det model. For consistency, we only visualize objects with detection confidence , and attribute categories that are meaningful across different object classes. We can see that the model identifies most of the objects and their attributes correctly. This further validates our claim that our SG loss gives the model better scene understanding abilities, which in turn allow it to obtain improved object detection results. This figure also shows the quality of the attribute classifiers we obtain as a by-product of our training process.

Table 1 shows the results of our models on the VOC 2007 test set. At the top of the table, we show the results of models trained using the gold objects annotations on the COCO and VOC datasets (without bounding boxes annotations), as reported by Ye et al. (2019). These results can be viewed as an upper bound for what is achievable by training from image captions using our method, since we use a weaker form of supervision.

In the middle part of the table, we report the VOC 2007 test results for models that were trained on COCO captions. Our novel EM + SG Loss model achieves state-of-the-art results on this dataset. It is worth noting that the performance of our model that was trained only on COCO Captions ( mAP score), achieves comparable results to the WSOD model trained on the ground-truth COCO labels (). This implies that our SG loss utilizes the image captions with close-to-maximal efficiency.

At the bottom of the table, we report the VOC 2007 test results for models trained on captions from the Flickr30K dataset. As before, our model achieves state-of-the-art results, which validates the contribution of our novel SG loss.

Table 2 shows the results of our model on the COCO testdev dataset. The main metric for evaluation for this dataset is mAP @ 0.5:0.95, which is reported in the leftmost column. Our novel EM + SG Loss model achieves state-of-the-art results on this baseline too.

To summarize, both our work and Ye et al. (2019) seek to improve over the simple ExactMatch baseline. Our method uses contextual scene understanding for this purpose, while the EM + TextClsf method focuses on achieving better object-pseudo labels. When training on COCO Captions and predicting on VOC we can see that our model almost doubles the performance gap over the text classifier model, and when training on Flickr30K and evaluating on VOC, or training on COCO Captions and evaluating on COCO test-dev, our model’s performance gap is about 4 times the text classifier gap, without using any additional training data. This validates our hypothesis that better scene understanding is a more critical factor for WSOD models than extraction of better pseudo-labels.


We present a novel weakly supervised object detection approach that uses only image captions as supervision. Unlike previous approaches to this problem, we make use of the rich information available in the text in the form of visual attribute descriptions. We propose a novel entanglement loss that captures the coupling between the objects and attributes.

Our evaluation of the COCO and VOC datasets demonstrates state-of-the-art results, using less supervision than previous caption-based methods used by Ye et al. (2019). Moreover, it shows the power of using grounding information from the text when analyzing images. Here our focus was on attributes only, although textual scene relations can also be explored. These are more technically challenging to handle since the number of potential relations is quadratic in the number of region proposals. However, these can be pruned in different ways. Finally, here we used a fixed pre-trained text-to-scene-graph model. An exciting question is how to learn this jointly with the detection model. We leave these directions for future work.


This work was supported by the Israeli Innovation Authority MAGNETON program, and the Israel Science Foundation.


  1. https://github.com/vacancy/SceneGraphParser
  2. https://github.com/yekeren/Cap2Det
  3. https://competitions.codalab.org/competitions/20794


  1. Multi-level multimodal common semantic space for image-phrase grounding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12476–12486. Cited by: Related Work.
  2. Human action recognition in videos using kinematic features and multiple instance learning. IEEE transactions on pattern analysis and machine intelligence 32 (2), pp. 288–303. Cited by: Related Work.
  3. Multiple instance classification: review, taxonomy and comparative study. Artificial intelligence 201, pp. 81–105. Cited by: Related Work.
  4. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 6077–6086. Cited by: Introduction.
  5. Scene graph parsing by attention graph. arXiv preprint arXiv:1909.06273. Cited by: Scene Graphs..
  6. Compositional video synthesis with action graphs. Cited by: Scene Graphs..
  7. Weakly supervised deep detection networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2846–2854. Cited by: Related Work.
  8. Microsoft coco captions: data collection and evaluation server. arXiv preprint arXiv:1504.00325. Cited by: Introduction, Datasets.
  9. Uniter: learning universal image-text representations. arXiv preprint arXiv:1909.11740. Cited by: Models for Images and Text..
  10. Imagenet: a large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Cited by: Visual Scores Extraction, Implementation Details.
  11. Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: Models for Images and Text..
  12. Solving the multiple instance problem with axis-parallel rectangles. Artificial intelligence 89 (1-2), pp. 31–71. Cited by: Related Work.
  13. The pascal visual object classes (voc) challenge. International journal of computer vision 88 (2), pp. 303–338. Cited by: Introduction.
  14. Precise detection in densely packed scenes. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: Introduction.
  15. Contrastive learning for weakly supervised phrase grounding. arXiv preprint arXiv:2006.09920. Cited by: Related Work.
  16. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pp. 2961–2969. Cited by: Visual Scores Extraction, Implementation Details.
  17. Learning canonical representations for scene graph to image generation. In European Conference on Computer Vision, Cited by: Scene Graphs..
  18. Spatio-temporal action graph networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0. Cited by: Introduction.
  19. Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems (NIPS), Cited by: Introduction.
  20. Modeling relationships in referential expressions with compositional modular networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1115–1124. Cited by: Introduction.
  21. Gqa: a new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6700–6709. Cited by: Textual Label Extraction.
  22. Image generation from scene graphs. arXiv preprint arXiv:1804.01622. Cited by: Scene Graphs..
  23. Image retrieval using scene graphs. In Proc. Conf. Comput. Vision Pattern Recognition, pp. 3668–3678. Cited by: Introduction.
  24. Image retrieval using scene graphs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3668–3678. Cited by: Introduction, Scene Graphs..
  25. Referring relationships. ECCV. Cited by: Scene Graphs..
  26. Visual genome: connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123 (1), pp. 32–73. Cited by: Scene Graphs..
  27. Audio event detection using weakly labeled data. In Proceedings of the 24th ACM international conference on Multimedia, pp. 1038–1047. Cited by: Related Work.
  28. On support relations and semantic scene graphs. arXiv preprint arXiv:1609.05834. Cited by: Introduction.
  29. Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, pp. 13–23. Cited by: Models for Images and Text..
  30. Exploring the limits of weakly supervised pretraining. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 181–196. Cited by: Introduction.
  31. Something-else: compositional action recognition with spatial-temporal interaction networks. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Cited by: Introduction.
  32. Is object localization for free?-weakly-supervised learning with convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 685–694. Cited by: Related Work.
  33. Multiple-instance learning for medical image and video analysis. IEEE reviews in biomedical engineering 10, pp. 213–234. Cited by: Related Work.
  34. Differentiable scene graphs. In Winter Conf. on App. of Comput. Vision, Cited by: Scene Graphs..
  35. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10598–10607. Cited by: Introduction.
  36. Triplet-aware scene graph embeddings. In The IEEE International Conference on Computer Vision (ICCV) Workshops, Cited by: Scene Graphs..
  37. Generating semantically precise scene graphs from textual descriptions for improved image retrieval. In Proceedings of the fourth workshop on vision and language, pp. 70–80. Cited by: Scene Graphs., Scene Graphs., Textual Label Extraction.
  38. Sniper: efficient multi-scale training. In Advances in neural information processing systems, pp. 9310–9320. Cited by: Introduction.
  39. Vl-bert: pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530. Cited by: Models for Images and Text..
  40. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2818–2826. Cited by: Implementation Details.
  41. Lxmert: learning cross-modality encoder representations from transformers. arXiv preprint arXiv:1908.07490. Cited by: Models for Images and Text..
  42. Multiple instance detection network with online instance classifier refinement. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2843–2851. Cited by: Other Losses, Online Refinement.
  43. Selective search for object recognition. International journal of computer vision 104 (2), pp. 154–171. Cited by: Visual Scores Extraction.
  44. Segmentation as selective search for object recognition. In 2011 International Conference on Computer Vision, pp. 1879–1886. Cited by: Implementation Details.
  45. Scene graph parsing as dependency parsing. arXiv preprint arXiv:1803.09189. Cited by: Scene Graphs., Textual Label Extraction.
  46. Scene Graph Generation by Iterative Message Passing. In Proc. Conf. Comput. Vision Pattern Recognition, pp. 3097–3106. Cited by: Introduction, Introduction.
  47. Scene graph captioner: image captioning based on structural visual representation. Journal of Visual Communication and Image Representation, pp. 477–485. Cited by: Scene Graphs..
  48. Cap2det: learning to amplify weak caption supervision for object detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 9686–9695. Cited by: Introduction, Related Work, Textual Label Extraction, Online Refinement, item 1, item 2, Implementation Details, Implementation Details, Implementation Details, Implementation Details, Results, Results, Discussion.
  49. From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics 2, pp. 67–78. Cited by: Datasets.
  50. Neural motifs: scene graph parsing with global context. arXiv preprint arXiv:1711.06640 abs/1711.06640. Cited by: Introduction, Introduction.
  51. Learning deep features for discriminative localization. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 2921–2929. Cited by: Related Work.
  52. Grounded video description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6578–6587. Cited by: Introduction.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description