ADVISE: Symbolism and External Knowledge for Decoding Advertisements

ADVISE: Symbolism and External Knowledge for Decoding Advertisements

Keren Ye       Adriana Kovashka
Department of Computer Science
University of Pittsburgh
{yekeren, kovashka}

In order to convey the most content in their limited space, advertisements embed references to outside knowledge via symbolism. For example, a motorcycle stands for adventure (a positive property the ad wants associated with the product being sold), and a gun stands for danger (a negative property to dissuade viewers from undesirable behaviors). We show how to use symbolic references to better understand the meaning of an ad. We further show how anchoring ad understanding in general-purpose object recognition and image captioning can further improve results. We formulate the ad understanding task as matching the ad image to human-generated statements that describe the action that the ad prompts and the rationale it provides for this action. We greatly outperform the state of the art in this task. We also show additional applications of our learned representations for ranking the slogans of ads, and clustering ads according to their topic.

1 Introduction

Advertisements are a powerful tool for affecting human behavior. Product ads convince us to make large purchases, e.g. for cars and home appliances, or small but recurrent purchases, e.g. for laundry detergent. Public service announcements (PSAs) encourage different behaviors, e.g. combating domestic violence or driving safely. To stand out from the rest, ads have to be both eye-catching and memorable [74], while also conveying the information that the ad designer wants to impart. All of this must be done in limited space (one image) and time (however many seconds the viewer is willing to spend looking at an ad).

How can ads get the most “bang for their buck”? One technique is to make references to knowledge that viewers already have, in the form of e.g. cultural knowledge, conceptual mappings, and symbols that humans have learned [58, 39, 61, 38]. These symbolic mappings might come from literature (e.g. a snake symbolizes evil or danger), movies (e.g. motorcycles symbolize adventure or coolness), common sense (a flexed arm symbolizes strength), or even pop culture (Usain Bolt symbolizes speed).

In this paper, we describe how to use symbolic mappings to predict the messages of advertisements. On one hand, we use the symbol bounding boxes and labels from the Ads Dataset of [23] as visual anchors to ideas outside the image. On the other hand, we use knowledge sources external to the main task, such as object detection, to better relate ad images to their corresponding messages. These are both forms of using outside knowledge which boil down to learning links between objects and symbolic concepts. We use each type of knowledge in two ways, as a constraint or as an additive component for the learned image representation.

Figure 1: Our key idea: Use symbolic associations shown in yellow (a gun symbolizes danger; a motorcycle symbolizes coolness) and recognized objects shown in red, to learn an image-text space where each ad maps to the correct statement that describes the message of the ad. The symbol “cool” brings images B and C closer together in the learned space, and further from image A and its associated symbol “danger.” At test time (shown in orange), we use the learned image-text space to retrieve a matching statement for test image D. At test time, the symbol labels are not provided.

We focus on the following multiple-choice task: Given an image and several statements, the system must identify the correct statement to pair with the ad. For example, for test image D in Fig. 1, the system might predict the message is “Buy this drink because it’s exciting.” Our method learns a joint image-text embedding that associates ads with their corresponding messages. The method has three components: (1) an image embedding which takes into account individual regions in the image, (2) constraints on the learned space computed from symbol labels and object predictions, and (3) an additive expansion of the image representation using a symbol distribution.

In more detail, we first use the symbol bounding boxes from the Ads Dataset [23], without labels, to learn a region proposal network. For each image, we compute its representation as a weighted average of the representations of its important regions. It is this representation that we embed in the joint image-text space. Second, we constrain the learned space using the sparse ground-truth symbol labels from the Ads Dataset, and the predictions from a generic captioning method not based on ads [29]. Images that have similar symbol labels and similar predicted captions should project closeby in the learned space. Third, we add an adaptive additive refinement to our image representation that brings the image representation closer to its corresponding statement. Both the constraints and the additive component depend on external knowledge in the form of symbols and object predictions, i.e. we show two ways to learn an embedding for ads that rely on outside knowledge. We call our method ADVISE: ADs VIsual Semantic Embedding.

We focus on public service announcements, rather than product (commercial) ads. PSAs tend to be more conceptual and challenging, often involving multiple steps of reasoning. Quantitatively, 59% of the product ads in the dataset of [23] are straightforward, i.e. would be nearly solved with traditional recognition advancements. In contrast, only 33% of PSAs use straightforward strategies, while the remaining 67% use a number of challenging non-literal approaches to convey their message. Our method outperforms several relevant baselines, including prior visual-semantic embeddings [35] and methods for understanding ads [23].

In addition to showing how to use external knowledge to solve ads, we also demonstrate how recent advances in object recognition help with the ad-understanding task. While [23] evaluates basic techniques for ad-understanding, it does not make use of recent advances in computer vision, e.g. region proposals [18, 55, 42, 15], attention [8, 73, 70, 60, 72, 54, 51, 43, 13, 79, 52], or image-text embeddings [35, 6, 7, 11, 16, 53].

Note that symbols can be culture-dependent, and the messages of ads might be interpreted differently by different viewers. We do not deal with these nuisance factors. Symbols were annotated by workers in the United States, and from the point of view of ad designers, a single message is encoded. It is this message, and this US-based symbolism, that we are interested in predicting.

To summarize, our contributions are as follows:

  • We show how to effectively use symbolism to better understand ads.

  • We show how to make use of noisy caption predictions to bridge the gap between the abstract task of predicting the message of an ad, and more accessible information such as the objects present in the image. Detected objects are mapped to symbols via a domain-specific knowledge base.

  • We improve the state of the art in understanding ads by 35% in the case of public service announcements, and 30% in the case of product ads.

  • We show that for the “abstract” PSAs, conceptual knowledge in the form of symbols helps more, while for the more “straightforward” product ads, use of general-purpose object recognition techniques is more helpful.

The remainder of the paper is organized as follows. We briefly discuss related work in Sec. 2. In Sec. 3.1, we describe the retrieval task on which we focus, and in Sec. 3.2, we describe standard triplet embedding using the Ads Dataset. In Sec. 3.3, we discuss the representation of an image as a weighted combination of region representations, weighed by their importance via an attention model. In Sec. 3.4, we describe how we use external knowledge to constrain the learned space. In Sec. 3.5, we develop an optional additive refinement of the image representation, again using external knowledge and symbols. In Sec. 4, we compare our method to state of the art methods, and conduct extensive ablation studies. We conclude in Sec. 5.

2 Related Work

Advertisements and multimedia.

The most related work to ours is [23] which proposes the problem of decoding ads, formulated as answering the question “Why should I [action]?” where [action] is what the ad suggests the viewer should do, e.g. buy something or behave a certain way (e.g. help prevent domestic violence). The dataset contains 64,832 image ads. Annotations available include the topic (product or subject) of the ad, sentiments and actions the ad prompts, rationales provided for why the action should be done, symbolic mappings (signifier-signified, e.g. motorcycle-adventure), etc. Considering the media domain more broadly, [30] analyze in what light a photograph portrays a politician, and [31] analyze how the facial features of a political candidate determine the outcome of an election. Also related is work in parsing infographics, charts and comics [4, 33, 26]. In contrast to these, our interest is analyzing the implicit references ads were created to make.

Vision and language and image-text embeddings.

In recent years, there is great interest in joint vision-language tasks, e.g. image and video captioning [67, 32, 10, 29, 2, 73, 66, 65, 77, 71, 14, 52, 59, 9, 36], visual question answering [3, 76, 44, 72, 60, 69, 63, 80, 81, 21, 68, 28, 64], and cross-domain retrieval [7, 6, 77, 40]. The latter often makes use of learned joint image-text embeddings, as we also do in this work. [35] uses triplet loss where an image and its corresponding human-provided caption should be closer in the learned embedding space than image/caption pairs that do not match. [11] propose a bi-directional network to maximize correlation between matching images and text, akin to CCA [19]. [16] utilize the images and texts in Wikipedia articles for self-supervision. While these achieve excellent results, none of them consider images with implicit or explicit persuasive intent, as we do.

External knowledge for vision-language tasks.

We propose to use external knowledge for decoding ads, in two ways: via symbols that inherently refer to outside knowledge, and by using outside knowledge to learn to detect symbols. [69, 68, 28, 81, 64] examine the use of knowledge bases for answering visual questions. [66] use external sources to diversify their image captioning language model. [49] learn to compose object classifiers by relating similarity in semantics to visual similarity. [45, 17] use knowledge graphs or hierarchies to aid in object recognition. These works all use mappings that are objectively/scientifically grounded, i.e. lions are related to cats, lions are a type of cat, etc. In contrast, we use cultural associations that arose in the media/literature and are internalized by humans, e.g. motorcycles are associated with adventure.

Region proposals and attention.

We make use of region proposals [18, 55, 42, 15] and attention [8, 73, 70, 60, 72, 54, 51, 43, 13, 79, 52]. Region proposals focus the job of an object detector to regions likely to contain objects. Attention helps focus learning and prediction tasks on regions likely to be relevant to the task; these regions are learned using backpropagation. We show that the regions over which we compute attention must be specific to the domain of ads. Modeling of human attention [25, 5, 27, 37, 78] and memorability [24, 34] are also relevant since ads are created to draw and hold the viewer’s attention [74].

3 Approach

We learn a joint image-text embedding space where we can evaluate the similarity between ad images and ad messages. We use symbols and external knowledge to constrain and refine this space in three ways. First, rather than consider ad images as a whole, we represent an ad image as a weighted average of the representations of its regions (Sec. 3.3). Second, we enforce that images that have the same symbol labels, or the same detected objects [29], map closeby in the learned space (Sec. 3.4). Finally, we propose an additive refinement of the image representation via an attention-masked symbol distribution (Sec. 3.5). In Sec. 4 we demonstrate the utility of each component.

3.1 Task and dataset

Our goal is to develop a method for understanding advertisements. Concretely, we consider the task of question-answering in the dataset of [23]. The authors formulated ad understanding as answering the question is “Q: Why should I [action]? A: [one-word reason]” An example question-answer pair is “Q: Why should I speak up about domestic violence? A: bad.” The one-word reason is picked from a full-sentence reason, also available in the dataset. We believe that using a single word is insufficient to capture the rhetoric of complex ads, so we slightly modify their task. Rather than a softmax over 1000 words, we ask the system to pick which statement is most appropriate for the image. We retrieve statements in the format: “I should [action] because [reason].” Using the same example, the statement would be “I should speak up about domestic violence because being quiet is as bad as committing violence yourself.” Given an image, we rank 50 statements (3 related and 47 unrelated) based on their similarity to the image, in the learned feature space.

3.2 Basic image-text triplet embedding

As a foundation for our approach, we first directly learn an embedding that optimizes for the desired task. We require that in the feature space, the distance between an image and its corresponding statement should be smaller than the distance between that image and any other statement, or between other images and that statement. In other words, the loss that is being minimized is


where indicates the visual embedding we are learning and indicates the text embedding, , , , correspond to the same ad, and , correspond to a different ad.

In order to extract visual embedding from image , we first extract the image’s CNN feature (1536-D) from [62], then use a fully connected layer to project it to the 200-D joint embedding feature space. Given , the parameter of the fully connected layer,


The text embedding vector is a summation of individual word embedding [47, 48, 46] vectors. Both the image and the text feature are l2-normalized. We set the hyper parameter to 0.4 in Eq (1) using the results of preliminary experiments. To make the training process converge faster, we use a twist on the hard negative mining approach of [12, 57].

3.3 Embedding using symbol regions

Since ads are carefully designed, they may involve complex narratives with several distinct components, i.e. several regions in the ad might need to be interpreted individually first, before we can reason about them jointly to infer the message of the ad. Thus, we represent an image as a collection of its constituent important reasons, using an attention module to learn the weights for each region.

We learn a region proposal network using the symbol bounding boxes of [23]. The idea is that ads draw the viewer’s attention in a particular way, and the symbol bounding boxes without symbol labels can be used to approximate this. This label-agnostic method is a new use of [23]’s symbolism data that has not been explored before. We use a pre-trained network [42, 22, 20] and fine-tune it using the symbol bounding box annotations. We show in Sec. 4 that this fine-tuning is crucial.

To further model the viewer’s attention, we also incorporate the bottom-up attention mechanism [64, 1], which is a weighing among region proposals.

In more detail, we extract CNN features for each detected ads region in the image . We then use a fully connected layer to project each region-based feature to: 1) a 200-D embedding vector (Eq. 3), and 2) a confidence score saying how much the region should contribute to the final representation (Eq. 4). In Eq (3) and Eq (4) and . The final embedding vector is a weighted sum of these region-based vectors weighed by their confidence score (Eq. 5). This intuitive idea was also used in [64] for visual question answering, but in our case, the attention distribution do not depend on questions. In Fig. 2, we show how we use bottom-up attention to weigh the different regions.

Figure 2: Our image embedding model with knowledge branch. In the main branch (top), multiple image regions are proposed by the region proposal network. Attention weighing is then applied on these regions and the embedding of the image is computed as a weighted sum of the regions’ embedding vectors. The knowledge branch (bottom) predicts the existence of symbols, maps these to 200-D, and adds them to the image embedding (top).

The loss used to learn the image-text embedding is the same as in Eq. (1), but defined using the region-based image representation instead of : .

We demonstrate that (1) learning a region proposal network with attention, and (2) learning from symbol bounding boxes, greatly help the statement retrieval task. In particular, statement ranking results are worse if we use a generic pre-trained region proposal network. We argue that general-purpose object detection models cannot capture nuance in ads since they ignore uncommon or abstract objects.

3.4 Constraints via symbols and captions

In Sec. 3.2, we describe how we learn a joint image-text space using triplet loss defined over pairs of images and their corresponding reason statements. Since symbols provide additional information that humans use to decode ads, we now propose additional constraints to our triplet loss such that two images (and their statements) that were annotated with the same symbol are closer in the learned space than images (and statements) annotated with different symbols. The extra loss term we use to constrain the training process is shown in Eq (6) where is the 200-D embedding of symbol labels.


By applying the new constraints, the model converges faster and the training process becomes more stable. At the same time, we explicitly embed symbol labels in the same feature space as images and statements. These symbols embedding vectors serve as entry points for external knowledge and shall be further discussed in Sec. 3.5.

Further, note that there is some regularity in terms of the objects that ads with similar rhetoric portray. For example, environment ads often feature animals, safe driving ads feature cars, beauty ads feature faces, drink ads feature bottles, etc. The Ads Dataset contains insufficient data to properly learn about object categories. Thus, to ground the embeddings that our model learns, we use DenseCap [29] to “annotate” the images with captions, then we create additional constraints out of these “annotations.” If two images/statements have similar DenseCap predicted captions, they should be closer than images/statements with different captions. The extra loss term we use for constrain the training process is shown in Eq (7).


Note that the object/DenseCap embedding model does not share weights with the statements’ embedding model since the meaning of the same surface words may vary in these two different domains.

The same object can be used to make different points, e.g. faces in beauty ads vs domestic violence ads, cars in car ads vs safety ads, etc. Similarly, symbol labels in isolation do not tell the full story of the ad. Thus, we reduce the impact of the symbol-based and object-based constraints by weighing the corresponding loss by 0.1.

Note that [50] also propose to use labels to constrain an embedding space. However, we show in our experiments that it is not sufficient to use any type of label in the domain of interest. We experiment with another type of label from [23]’s dataset, namely the topic of the ad (e.g. what product it is selling), and show that symbols give a greater benefit.

3.5 Additive external knowledge

In this section, we describe how to make use of external knowledge via a symbol representation that is adaptively added to the image representation to compensate inadequacies of the image embedding. This external knowledge can take the form of a mapping between physical objects and implicit concepts, or a classifier mapping pixels to concepts.

Assume we are viewing a challenging ad whose meaning is not immediately obvious. The only thing we can do is to use our human experience to find some evidence. For example, do the visual cues remind us of concepts we have seen in other ads? This is how external knowledge helps us to decode the ad. Our model is able to interpret ads in the same way: based on external knowledge base, it infers the abstract symbols. Since the model knows the exact meaning of these symbols (since it already knows the embedding vectors of these symbols, see Sec. 3.4), it is able to reconstruct the image representation using these symbols’ embedding vectors by weighing them. Fig. 2 (bottom) shows the general idea of the external knowledge branch. It is worth mentioning that our model only uses external knowledge to compensate its own lack of knowledge and it assigns small weights for uninformative symbols.

We propose two ways to additively expand the image representation with external knowledge. Both ways are a form of knowledge base (KB) mapping physical evidence to concepts. The first is to directly train classifiers to link certain visuals to the symbolic concept. More specifically, we use the 53-way multilabel symbol classifier from [23] as 53 individual classifiers. We obtain a symbol distribution . We learn a weight for each of classifiers denoting the confidence of the classifier. Therefore, the learned model of the knowledge branch is an attention model weighing the 53 symbol classifiers. The attention weights are adjusted depending on whether a particular symbol is helping in the statement matching task.

The second method is to learn associations between actual objects in the image (surface words for detected objects) and abstract concepts of symbols. For example, what type of ad might I see a “car” in? What about a “rock” or “animal”? We first construct a knowledge base associating object words to symbol words. We compute the similarity in the learned text embedding space between symbol words and DenseCap words, then create a mapping rule (object implies symbol) for each symbol and its five most similar DenseCap words. This results in a 53 matrix , where is the size of the DenseCap vocabulary. Each row contains 5 entries of 1 denoting the mapping rule, and entries of 0. Examples of learned mappings are shown in Table 6.

For a given image, we use [29] to predict the 3 most probable words in the DenseCap vocabulary, and put the results in a multi-hot vector. We then matrix-multiply to accumulate evidence for the presence of all symbols using the detected objects: . We associate a weight with each rule in the KB, , , which explains the importance of the rule saying to what extent the word in the DenseCap vocabulary symbolizes the symbol (e.g. to what extent “rock” is used to illustrate “natural”).

For both methods, we first use the attention weights as a mask, then project the 53-D symbol distribution into 200-D, and add it to the image embedding.

This additive branch is most helpful when the information it contains is not already contained in the main image embedding branch. We discovered this happens when the discovered symbols are rare. This poses a learning challenge for our object-to-symbol mapping method. In order to learn attention weights on the full 53 matrix, we must have enough data, but if we have enough data, the additional branch is not likely to be active. Breaking this dependency is the subject of our future work.

3.6 ADVISE: our final model

Our final ADs VIsual Semantic Embedding loss combines the losses from Sec. 3.2, 3.3, 3.4, and 3.5:


4 Experimental Validation

Recall@3 (Higher is better) Rank (Lower is better)
Method PSA Product PSA Product
VSE [35] 0.313 0.010 0.504 0.003 10.817 0.177 7.394 0.036
One-word [23] 0.697 0.017 0.653 0.004 7.934 0.208 6.336 0.036
VSE on Ads 1.220 0.018 1.511 0.004 3.139 0.095 2.112 0.019
ADVISE (Ours) 1.507 0.018 1.726 0.004 2.032 0.076 1.474 0.016
Table 1: Our main result. We observe our method greatly outperforms three recent methods in retrieving matching statements for each ad.
Figure 3: The performance of our ADVISE method compared to the strongest baseline, VSE on Ads. In the first example, the baseline is tricked into thinking this is a beauty ad, as is the intent of the ad designer for creating a more dramatic effect. In the second example, the baseline may have gotten confused by the purple colors often used in Ben & Jerry’s ads.

We evaluate to what extent our proposed method is able to match an ad to its intended message (see Sec. 3.1). We compare our method to the following approaches from recent literature:

  • One-word, the QA method from [23] which uses symbols in a different, less effective way. The goal is to predict a one-word answer to the question “Why should the viewer [action]?”, e.g. “Why should the viewer buy this car?” An answer might be e.g. “reliable” or “fast”. The method combines three features: the VGG embedding of the image, an LSTM embedding of the question, and a distribution over 53 symbols using a symbol classifier. In order to adapt this method to our task, we take the predicted one word, and use Glove similarity to rank the statement options in terms of their similarity to this word.

  • the Visual-Semantic Embedding (VSE) from [35], trained using Flickr30K [75] and COCO [41]. Note that more recent image-text joint embedding methods exist, but these use complex architectures [11] or are specialized to particular applications [16, 6, 7]. We focus on VSE as a simple general-purpose embedding.

  • VSE on Ads uses the same method as [35] but trains it on around 39,000 images from the Ads Dataset and more than 111,000 associated statements (training set size varies for different folds), as described in Sec. 3.2.

We compute two metrics: Recall@3, which denotes the number of true statements ranked within the Top-3, and Rank, which is the averaged ranking value of the highest-ranked true matching statement (highest rank is 0, which means rank at the first place). We expect a good model to have high Recall@3 and low Rank score.

We use five random splits of the dataset into train/validation/test sets, and show mean results and standard error over a total of 62,468 test cases.

We show the improvement that our method produces over state of the art methods, in Table 1. Since public service announcements (e.g. domestic violence campaigns and anti-bullying campaigns) typically use different strategies and sentiments than product ads (e.g. ads for cars and coffee), we separately show the result for PSAs and products. We observe that our method greatly outperforms the prior relevant research. PSAs in general appear harder than product ads, consistent with our argument in Sec. 1. Compared to VSE [35], our method improves Recall@3 and Rank by five times for PSAs, and three/five times for products. Compared to the strongest baseline, VSE on Ads, we improve Rank by 35% for PSAs, and 30% for product ads. Note that in Table 1 we show the better of the two alternative methods in Sec. 3.5, namely the symbol classifier. Qualitative results are shown in Fig. 3.

We also conduct ablation studies to verify the benefit of the components of our method.

  • generic region embedding using image regions from a generic region proposal network [22] trained on the COCO [41] detection dataset.

  • symbol box embedding and attention (Sec. 3.3)

  • symbol/object constraints (Sec. 3.4)

  • additive knowledge (Sec. 3.5), first predicting objects and mapping to symbols (KB objects) or directly predicting symbols via training data as a KB (KB symbols)

% improvement
Method Rec@3 Rank Rec@3 Rank
VSE on Ads 1.220 3.139
generic region 1.384 2.414 13 23
symbol box 1.452 2.159 5 11
+ attention 1.450 2.237 0 -4
+ symbol/object 1.487 2.128 3 5
+ KB objects 1.488 2.102 0 1
+ KB symbols 1.507 2.032 1 5
Table 2: Ablation study on PSAs. All external knowledge components (i.e. all except attention) give a boost over the naive VSE.
% improvement
Method Rec@3 Rank Rec@3 Rank
VSE on Ads 1.511 2.112
generic region 1.668 1.649 9 22
symbol box 1.694 1.549 2 6
+ attention 1.725 1.491 2 4
Table 3: Ablation study on products. General-purpose recognition approaches, e.g. regions and attention, produced the main boost. The symbol-based method components in Sec.3.4 and Sec.3.5 produced small or no improvements.

The results are shown in Table 2 for PSAs, and Table 3 for products. We also show percent improvement of each new component. Improvement is computed with respect to the previous row, except for KB objects and KB symbols, whose improvement is computed with respect to the third-to-last row, i.e. the method on which both KB methods are based. The largest increase in performance comes from focusing on individual regions within the image to understand the ad’s story. This makes sense because ads are carefully designed and multiple elements work together to convey the message. Qualitative examples showing the impact of regions are shown in Fig. 4. Especially for PSAs, these regions must be learned on the ads domain specifically to further greatly increase performance.

Beyond this, the story that the results tell differs between PSAs and products. Our key idea to rely on external knowledge and symbols is more helpful for the challenging, abstract PSAs that are the focus of our work. In contrast, general-purpose computer vision techniques help more for product ads that rely on straightforward strategies.

Interestingly, attention helps for product ads, but slightly hurts for PSAs. It appears PSAs tell their “story” holistically, so subselecting individual regions is detrimental.

Finally, the additive inclusion of external information helps more when we directly predict the symbols, but also when we first extract objects and map these to symbols. Note that given the plethora of object recognition resources, KB objects is much cheaper in terms of human effort, as KB symbols required over 64,000 symbol labels. In contrast, KB objects simply relies on mappings between object and symbol words, which can be obtained much more efficiently as they are not image-dependent. Thus, KB objects would likely generalize better to a new domain of ads (or ads in a different culture) where the symbol training data from the Ads Dataset is not available.

Figure 4: Visualization of region proposals for PSAs. Note how our proposals based on ads focus on relevant regions of the image, e.g. the smoke (which can often be a symbol) and the tip of the cigarette (left), the wound (middle), and the region of damage in the forest (right). The generic COCO boxes straddle the boundaries of meaningful regions in the ad.
PSA Product
Method Rec@3 Rank Rec@3 Rank
Symbol labels 2 4 0 1
Topic labels 1 4 0 0
Table 4: % improvement for different types of labels as constraints.

In Table 4, we show that not any type of label would suffice as constraint. In particular, even though the Ads Dataset includes 6 times more topic labels that could be used as constraints compared to symbol labels, symbol labels give much greater benefit. Thus, [50]’s approach is not enough; the type of labels must be carefully chosen.

Method Hard stmt () Slogan () Clustering ()
VSE [35] 9.676 9.564 0.173
One-word [23] 8.725 7.365 N/A
VSE on Ads 4.642 3.108 0.293
ADVISE (Ours) 3.835 2.336 0.356
Table 5: Other tasks that our learned image-text embedding helps with. We show Rank for the first two tasks (lower is better), and homogeneity score [56] for the third task (higher is better). N/A indicates the method does not learn an embedding.

In Table 5, we demonstrate the versatility of our learned embedding, compared to the baselines from Table 1. None of the methods were retrained, i.e. we simply used the pre-trained embedding used for the results in Table 1. First, we again perform a statement retrieval task, but make the task harder. In particular, all statements that are to be ranked are from the same topic (i.e. all statements are about car safety or about beauty products). The second task uses creative captions that MTurk workers were asked to write for 2000 ads in [23]. We perform a retrieval task among these slogans, using an image as the query. Finally, we check how well an embedding clusters ad images with respect to a ground-truth “clustering” defined by the topics of ads. For example, if two images show faces but one is a domestic violence ad while the other is a beauty ad, would an embedding accurately place these ads in different clusters despite their visual similarity? We see that our method greatly outperforms all other embeddings.

To summarize, the main takeaways from our quantitative experiments are as follows:

  • Our ADVISE method greatly exceeds the state of the art for the task of retrieving statements that describe the ad’s message correctly (VSE [35] and One-word [23]).

  • The region embedding greatly improves upon the traditional embedding, even though the latter is trained directly for the task of interest.

  • Relying on symbol boxes always helps as symbol boxes indicate how ads draw the viewer’s attention.

  • For PSAs, symbol labels as additional constraints help further; importantly, they help not just because they are proxies for the metric learning [50], but because they capture the idea of the ads. In contrast, another type of label that is also from the ads dataset but much more rough (topics), does not help as much. Regularizing with external info, e.g. predictions about objects, also provides a benefit.

ID Symbol Statement DenseCap
1 comfort couch, sofa, soft, bed, comfy pillow, bed, blanket, couch, rug
2 speed, excitement, adventure, power, fashion cool sunglasses, sleeve, jacket, carrying, scarf
3 safety, danger, injury, speed, death driving car, windshield, van, tire, license
4 comfort, relaxation, christmas, safety, family home house, cabinet, kitchen, bush, sink
5 freedom, vacation, relaxation, sex, family american flag, persons, mustard, papers, striped
6 health, death, art, injury, violence think kites, lamp, art, bike, design
7 delicious, hot, food, strong, hunger ketchup beer, pepper, sauce, jar, juice
8 hunger, food, delicious, hot, desire food, meal, steak, meals, roast meat
9 environment, nature, adventure, travel, strong wilderness, outdoors, terrain, rugged, rover rock
10 violence, humor, love, desire, strong make-up, makeup, maybelline, eyeliner, covergirl face
11 food, healthy, hunger, delicious, variety salads, food, salad, menu, toppings tomato
Table 6: Discovered synonym triplets between symbol words, action/reason words, and DenseCap words. The single word (the word appearing alone in a table cell, shown in italics) is the query.

Finally, in Table 6, we show some qualitative results demonstrating the utility of anchoring our learned space with DenseCap predictions. We consider three vocabularies: the 53 symbol words from [23], the 27,999 unique words from the action/reason question-answers, and the 823 unique words from the DenseCap annotations. In the learned space, we compute the nearest neighbors for each symbol, statement, and DenseCap word, to establish rough synonymy. This is the knowledge base used in Sec. 3.5. These discovered results can be used as a “dictionary” showing the meaning of ad. In other words, if I see a given object, what should I predict the message of the ad is?

We begin with an intuitive result in the first triplet (ID 1). When an ad designer wants to allude to comfort, they might use objects such as a soft bed, where “soft” is a statement word (“Why should I buy this? Because it’s soft.”) In ID 3-6, we demonstrate the evidence in terms of different symbols and different objects, for statements that are given. The first column essentially tells us the meaning of a visual, and the last column tells us how a concept is illustrated. To illustrate “coolness”, one might show sunglasses, and these might refer to adventure. If the statement contains the word “driving,” then perhaps this is a safe driving ad, where visuals in the ad allude to safety, danger, or speed, while physically the visuals contain cars, tires, and windshields. Ads about “home” show houses and kitchens, but these refer to safety, family, and comfort. Freedom, family and relaxation are “American” concepts alluded to by flags. In ID 6, we see further evidence that PSAs (on health, art and violence) are more conceptual (the viewer needs to think).

In IDs 8-10, we see the intuitive context and symbolism associated with “meat” and “rocks,” and the double role that “faces” can play (in beauty and domestic violence ads). Finally, observe the different role of “tomato” (ID 11) vs “ketchup” (ID 7): the former symbolizes health while the latter is associated with flavor and hotness.

5 Conclusion

We presented a method for matching image advertisements to statements which describe the idea of the ad. Our method uses external knowledge in the form of symbols and predicted objects in two ways, as constraints for a joint image-text embedding space, and as an additive component for the image representation. We also verify the effect of state of the art computer vision techniques in the form of region proposals and attention for the task of automatically understanding ads. Our method outperforms existing techniques by a large margin. In the future, we will investigate further external resources for decoding ads, such as predictions about the memorability or human attention over ads, and use our object-symbol mappings to analyze the variability that the same object category exhibits when used for different ad topics.


  • [1] P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang. Bottom-up and top-down attention for image captioning and VQA. CoRR, abs/1707.07998, 2017.
  • [2] L. Anne Hendricks, S. Venugopalan, M. Rohrbach, R. Mooney, K. Saenko, and T. Darrell. Deep compositional captioning: Describing novel object categories without paired training data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [3] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, and D. Parikh. Vqa: Visual question answering. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [4] Z. Bylinskii, S. Alsheikh, S. Madan, A. Recasens, K. Zhong, H. Pfister, F. Durand, and A. Oliva. Understanding infographics through textual and visual tag prediction. arXiv preprint arXiv:1709.09215, 2017.
  • [5] Z. Bylinskii, A. Recasens, A. Borji, A. Oliva, A. Torralba, and F. Durand. Where should saliency models look next? In European Conference on Computer Vision, pages 809–824. Springer, 2016.
  • [6] Y. Cao, M. Long, J. Wang, and S. Liu. Deep visual-semantic quantization for efficient image retrieval. In CVPR, 2017.
  • [7] K. Chen, T. Bui, C. Fang, Z. Wang, and R. Nevatia. Amc: Attention guided multi-modal correlation learning for image search. In CVPR, 2017.
  • [8] L.-C. Chen, Y. Yang, J. Wang, W. Xu, and A. L. Yuille. Attention to scale: Scale-aware semantic image segmentation. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016.
  • [9] T.-H. Chen, Y.-H. Liao, C.-Y. Chuang, W.-T. Hsu, J. Fu, and M. Sun. Show, adapt and tell: Adversarial training of cross-domain image captioner. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [10] J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and T. Darrell. Long-term recurrent convolutional networks for visual recognition and description. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [11] A. Eisenschtat and L. Wolf. Linking image and text with 2-way nets. In CVPR, 2017.
  • [12] F. Faghri, D. J. Fleet, J. R. Kiros, and S. Fidler. Vse++: Improved visual-semantic embeddings. arXiv preprint arXiv:1707.05612, 2017.
  • [13] J. Fu, H. Zheng, and T. Mei. Look closer to see better: recurrent attention convolutional neural network for fine-grained image recognition. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  • [14] C. Gan, Z. Gan, X. He, J. Gao, and L. Deng. Stylenet: Generating attractive visual captions with styles. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [15] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.
  • [16] L. Gomez, Y. Patel, M. Rusinol, D. Karatzas, and C. V. Jawahar. Self-supervised learning of visual features through embedding images into text topic spaces. In CVPR, 2017.
  • [17] W. Goo, J. Kim, G. Kim, and S. J. Hwang. Taxonomy-regularized semantic deep convolutional neural networks. In European Conference on Computer Vision, pages 86–101. Springer, 2016.
  • [18] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [19] H. Hotelling. Relations between two sets of variates. Biometrika, 28(3/4):321–377, 1936.
  • [20] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  • [21] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, and K. Saenko. Learning to reason: End-to-end module networks for visual question answering. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [22] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, and K. Murphy. Speed/accuracy trade-offs for modern convolutional object detectors. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [23] Z. Hussain, M. Zhang, X. Zhang, K. Ye, C. Thomas, Z. Agha, N. Ong, and A. Kovashka. Automatic understanding of image and video advertisements. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [24] P. Isola, J. Xiao, D. Parikh, A. Torralba, and A. Oliva. What makes a photograph memorable? IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(7):1469–1482, 2014.
  • [25] L. Itti, C. Koch, and E. Niebur. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence, 20(11):1254–1259, 1998.
  • [26] M. Iyyer, V. Manjunatha, A. Guha, Y. Vyas, J. Boyd-Graber, H. Daume, III, and L. S. Davis. The amazing mysteries of the gutter: Drawing inferences between panels in comic book narratives. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [27] M. Jiang, S. Huang, J. Duan, and Q. Zhao. Salicon: Saliency in context. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1072–1080, 2015.
  • [28] J. Johnson, B. Hariharan, L. van der Maaten, J. Hoffman, L. Fei-Fei, C. L. Zitnick, and R. Girshick. Inferring and executing programs for visual reasoning. In ICCV, 2017.
  • [29] J. Johnson, A. Karpathy, and L. Fei-Fei. Densecap: Fully convolutional localization networks for dense captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [30] J. Joo, W. Li, F. F. Steen, and S.-C. Zhu. Visual persuasion: Inferring communicative intents of images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014.
  • [31] J. Joo, F. F. Steen, and S.-C. Zhu. Automated facial trait judgment and election outcome prediction: Social dimensions of face. In Proceedings of the IEEE International Conference on Computer Vision, pages 3712–3720, 2015.
  • [32] A. Karpathy and L. Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [33] A. Kembhavi, M. Salvato, E. Kolve, M. Seo, H. Hajishirzi, and A. Farhadi. A diagram is worth a dozen images. In European Conference on Computer Vision, pages 235–251. Springer, 2016.
  • [34] A. Khosla, A. S. Raju, A. Torralba, and A. Oliva. Understanding and predicting image memorability at a large scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 2390–2398, 2015.
  • [35] R. Kiros, R. Salakhutdinov, and R. S. Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
  • [36] R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles. Dense-captioning events in videos. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [37] M. Kümmerer, T. S. Wallis, and M. Bethge. Deepgaze ii: Reading fixations from deep features trained on object recognition. arXiv preprint arXiv:1610.01563, 2016.
  • [38] J. H. Leigh and T. G. Gabel. Symbolic interactionism: its effects on consumer behaviour and implications for marketing strategy. Journal of Services Marketing, 6(3):5–16, 1992.
  • [39] S. J. Levy. Symbols for sale. Harvard business review, 37(4):117–124, 1959.
  • [40] X. Li, D. Hu, and X. Lu. Image2song: Song retrieval via bridging image content and lyric words. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [41] T. Lin, M. Maire, S. J. Belongie, L. D. Bourdev, R. B. Girshick, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft COCO: common objects in context. CoRR, abs/1405.0312, 2014.
  • [42] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. Ssd: Single shot multibox detector. In European conference on computer vision, pages 21–37. Springer, 2016.
  • [43] J. Lu, C. Xiong, D. Parikh, and R. Socher. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In CVPR, 2017.
  • [44] M. Malinowski, M. Rohrbach, and M. Fritz. Ask your neurons: A neural-based approach to answering questions about images. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [45] K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for image classification. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [46] T. Mikolov, K. Chen, G. Corrado, and J. Dean. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013.
  • [47] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 3111–3119. Curran Associates, Inc., 2013.
  • [48] T. Mikolov, W.-t. Yih, and G. Zweig. Linguistic regularities in continuous space word representations. In HLT-NAACL, pages 746–751, 2013.
  • [49] I. Misra, A. Gupta, and M. Hebert. From red wine to red tomato: Composition with context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [50] Y. Movshovitz-Attias, A. Toshev, T. K. Leung, S. Ioffe, and S. Singh. No fuss distance metric learning using proxies. In ICCV, 2017.
  • [51] H. Nam, J.-W. Ha, and J. Kim. Dual attention networks for multimodal reasoning and matching. In CVPR, 2017.
  • [52] M. Pedersoli, T. Lucas, C. Schmid, and J. Verbeek. Areas of attention for image captioning. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [53] B. A. Plummer, M. Brown, and S. Lazebnik. Enhancing video summarization via vision-language embedding. In CVPR, 2017.
  • [54] M. Ren and R. S. Zemel. End-to-end instance segmentation with recurrent attention. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [55] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems, pages 91–99, 2015.
  • [56] A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL, volume 7, pages 410–420, 2007.
  • [57] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A unified embedding for face recognition and clustering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2015.
  • [58] L. M. Scott. Images in advertising: The need for a theory of visual rhetoric. Journal of consumer research, 21(2):252–273, 1994.
  • [59] R. Shetty, M. Rohrbach, L. Anne Hendricks, M. Fritz, and B. Schiele. Speaking the same language: Matching machine to human captions by adversarial training. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [60] K. J. Shih, S. Singh, and D. Hoiem. Where to look: Focus regions for visual question answering. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016.
  • [61] N. E. Spears, J. C. Mowen, and G. Chakraborty. Symbolic role of animals in print advertising: Content analysis and conceptual development. Journal of Business Research, 37(2):87–95, 1996.
  • [62] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi. Inception-v4, inception-resnet and the impact of residual connections on learning. In ICLR 2016 Workshop, 2016.
  • [63] M. Tapaswi, Y. Zhu, R. Stiefelhagen, A. Torralba, R. Urtasun, and S. Fidler. Movieqa: Understanding stories in movies through question-answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [64] D. Teney, P. Anderson, X. He, and A. v. d. Hengel. Tips and tricks for visual question answering: Learnings from the 2017 challenge. arXiv preprint arXiv:1708.02711, 2017.
  • [65] R. Vedantam, S. Bengio, K. Murphy, D. Parikh, and G. Chechik. Context-aware captions from context-agnostic supervision. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [66] S. Venugopalan, L. Anne Hendricks, M. Rohrbach, R. Mooney, T. Darrell, and K. Saenko. Captioning images with diverse objects. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [67] O. Vinyals, A. Toshev, S. Bengio, and D. Erhan. Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015.
  • [68] P. Wang, Q. Wu, C. Shen, A. Dick, and A. van den Hengel. Fvqa: fact-based visual question answering. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
  • [69] Q. Wu, P. Wang, C. Shen, A. Dick, and A. van den Hengel. Ask me anything: Free-form visual question answering based on knowledge from external sources. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [70] H. Xu and K. Saenko. Ask, attend and answer: Exploring question-guided spatial attention for visual question answering. In European Conference on Computer Vision (ECCV). Springer, 2016.
  • [71] L. Yang, K. Tang, J. Yang, and L.-J. Li. Dense captioning with joint inference and visual context. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [72] Z. Yang, X. He, J. Gao, L. Deng, and A. Smola. Stacked attention networks for image question answering. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016.
  • [73] Q. You, H. Jin, Z. Wang, C. Fang, and J. Luo. Image captioning with semantic attention. In Computer Vision and Pattern Recognition (CVPR). IEEE, 2016.
  • [74] C. E. Young. The advertising research handbook. Ideas in Flight, 2005.
  • [75] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78, 2014.
  • [76] L. Yu, E. Park, A. C. Berg, and T. L. Berg. Visual madlibs: Fill in the blank description generation and question answering. In The IEEE International Conference on Computer Vision (ICCV), December 2015.
  • [77] Y. Yu, J. Choi, Y. Kim, K. Yoo, S.-H. Lee, and G. Kim. Supervising neural attention models for video captioning by human gaze data. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
  • [78] J. Zhang and S. Sclaroff. Exploiting surroundedness for saliency detection: a boolean map approach. IEEE transactions on pattern analysis and machine intelligence, 38(5):889–902, 2016.
  • [79] H. Zheng, J. Fu, T. Mei, and J. Luo. Learning multi-attention convolutional neural network for fine-grained image recognition. In The IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [80] Y. Zhu, O. Groth, M. Bernstein, and L. Fei-Fei. Visual7w: Grounded question answering in images. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [81] Y. Zhu, J. J. Lim, and L. Fei-Fei. Knowledge acquisition for visual question answering via iterative querying. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017.
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description