Neural Naturalist: Generating Fine-Grained Image Comparisons

Neural Naturalist: Generating Fine-Grained Image Comparisons

Maxwell Forbes      Christine Kaeser-Chen      Piyush Sharma      Serge Belongie
Cornell University and Cornell Tech

We introduce the new Birds-to-Words dataset of 41k sentences describing fine-grained differences between photographs of birds. The language collected is highly detailed, while remaining understandable to the everyday observer (e.g., “heart-shaped face,” “squat body”). Paragraph-length descriptions naturally adapt to varying levels of taxonomic and visual distance—drawn from a novel stratified sampling approach—with the appropriate level of detail. We propose a new model called Neural Naturalist that uses a joint image encoding and comparative module to generate comparative language, and evaluate the results with humans who must use the descriptions to distinguish real images.

Our results indicate promising potential for neural models to explain differences in visual embedding space using natural language, as well as a concrete path for machine learning to aid citizen scientists in their effort to preserve biodiversity.

1 Introduction

Figure 1: The Birds-to-Words dataset: comparative descriptions adapt naturally to the appropriate level of detail (orange underlines). A difficult distinction (top) is given a longer and more fined-grained comparison than an easier one (bottom). Annotators organically use everyday language to refer to parts (green highlights).

Humans are adept at making fine-grained comparisons, but sometimes require aid in distinguishing visually similar classes. Work done during an internship at Google. Take, for example, a citizen science effort like iNaturalist,111 where everyday people photograph wildlife, and the community reaches a consensus on the taxonomic label for each instance. Many species are visually similar (e.g., Figure 1, top), making them difficult for a casual observer to label correctly. This puts an undue strain on lieutenants of the citizen science community to curate and justify labels for a large number of instances. While everyone may be capable of making such distinctions visually, non-experts require training to know what to look for.

Field guides exist for the purpose helping people learn how to distinguish between species. Unfortunately, field guides are costly to create because writing such a guide requires expert knowledge of class-level distinctions.

In this paper, we study the problem of explaining the differences between two images using natural language. We introduce a new dataset called Birds-to-Words of paragraph-length descriptions of the differences between pairs of bird photographs. We find several benefits from eliciting comparisons: (a) without a guide, annotators naturally break down the subject of the image (e.g., a bird) into pieces understood by the everyday observer (e.g., head, wings, legs); (b) by sampling comparisons from varying visual and taxonomic distances, the language exhibits naturally adaptive granularity of detail based on the distinctions required (e.g., “red body” vs “tiny stripe above its eye”); (c) in contrast to requiring comparisons between categories (e.g., comparing one species vs. another), non-experts can provide high-quality annotations without needing domain expertise.

We also propose the Neural Naturalist model architecture for generating comparisons given two images as input. After embedding images into a latent space with a CNN, the model combines the two image representations with a joint encoding and comparative module before passing them to a Transformer decoder. We find that introducing a comparative module—an additional Transformer encoder—over the combined latent image representations yields better generations.

Our results suggest that these classes of neural models can assist in fine-grained visual domains when humans require aid to distinguish closely related instances. Non-experts—such as amateur naturalists trying to tell apart two species—stand to benefit from comparative explanations. Our work approaches this sweet-spot of visual expertise, where any two in-domain images can be compared, and the language is detailed, adaptive to the types of differences observed, and still understandable by laypeople.

Recent work has made impressive progress on context sensitive image captioning. One direction of work uses class labels as context, with the objective of generating captions that distinguish why the image belongs to one class over others Hendricks et al. (2016); Vedantam et al. (2017). Another choice is to use a second image as context, and generate a caption that distinguishes one image from another. Previous work has studied ways to generalize single-image captions into comparative language Vedantam et al. (2017), as well as comparing two images with high pixel overlap (e.g., surveillance footage) Jhamtani and Berg-Kirkpatrick (2018). Our work complements these efforts by studying directly comparative, everyday language on image pairs with no pixel overlap.

Figure 2: Illustration of pivot-branch stratified sampling algorithm used to construct the Birds-to-Words dataset. The algorithm harnesses visual and taxonomic distances (increasing vertically) to create a challenging task with board coverage.

Our approach outlines a new way for models to aid humans in making visual distinctions. The Neural Naturalist model requires two instances as input; these could be, for example, a query image and an image from a candidate class. By differentiating between these two inputs, a model may help point out subtle distinctions (e.g., one animal has spots on its side), or features that indicate a good match (e.g., only a slight difference in size). These explanations can aid in understanding both differences between species, as well as variance within instances of a single species.

Dataset Domain Lang Ctx Cap Example
CUB Captions (R, 2016) Birds m 1 1 “An all black bird with a very long rectrices and relatively dull bill.”
CUB-Justify (V, 2017) Birds s 7 1 “The bird has white orbital feathers, a black crown, and yellow tertials.”
Spot-the-Diff (J&B, 2018) Surveilance e 2 1–2 ”Silver car is gone. Person in a white t shirt appears. 3rd person in the group is gone.”
Birds-to-Words (this work) Birds e 2 2 “Animal1 is gray, while animal2 is white. Animal2 has a long, yellow beak, while animal1’s beak is shorter and gray. Animal2 appears to be larger than animal1.”
Table 1: Comparison with recent fine-grained language-and-vision datasets. Lang values: s = scientific, e = everyday, m = mixed. Images Ctx = number of images shown, Images Cap = number of images described in caption. Dataset citations: R = Reed et al., V = Vedantam et al., J&B = Jhamtani and Berg-Kirkpatrick.

2 Birds-to-Words Dataset

Our goal is to collect a dataset of tuples , where and are images, and is a natural language comparison between the two. Given a domain , this collection depends critically on the criteria we use to select image pairs.

If we sample image pairs uniformly at random, we will end up with comparisons encompassing a broad range of phenomena. For example, two images that are quite different will yield categorical comparisons (“One is a bird, one is a mushroom.”). Alternatively, if the two images are very similar, such as two angles of the same creature, comparisons between them will focus on highly detailed nuances, such as variations in pose. These phenomena support rich lines of research, such as object classification Deng et al. (2009) and pose estimation Murphy-Chutorian and Trivedi (2009).

We aim to land somewhere in the middle. We wish to consider sets of distinguishable but intimately related pairs. This sweet spot of visual similarity is akin to the genre of differences studied in fine-grained visual classification Wah et al. (2011); Krause et al. (2013a). We approach this collection with a two-phase data sampling procedure. We first select pivot images by sampling from our full domain uniformly at random. We then branch from these images into a set of secondary images that emphases fine-grained comparisons, but yields broad coverage over the set of sensible relations. Figure 2 provides an illustration of our sampling procedure.

2.1 Domain

Birds-to-Words Dataset
Image pairs 3,347
Paragraphs / pair 4.8
Paragraphs 16,067
Tokens / paragraph 32.1 mean
Sentences 40,969
Sentences / paragraph 2.6 mean
Clarity rating
Train / dev / test 80% / 10% / 10%
Figure 3: Annotation lengths for compared datasets (top), and statistics for the proposed Birds-to-Words dataset (bottom). The Birds-to-Words dataset has a large mass of long descriptions in comparison to related datasets.

We sample images from iNaturalist, a citizen science effort to collect research-grade222Research-grade observations have met or exceeded iNaturalist’s guidelines for community consensus of the taxonomic label for a photograph. observations of plants and animals in the wild. We restrict our domain to instances labeled under the taxonomic class333To disambiguate class, we use class to denote the taxonomic rank in scientific classification, and simply “class” to refer to the machine learning usage of the term as a label in classification. Aves (i.e., birds). While a broader domain would yield some comparable instances (e.g., bird and dragonfly share some common body parts), choosing only Aves ensures that all instances will be similar enough structurally to be comparable, and avoids the gut reaction comparison pointing out the differences in animal type. This choice yields 1.7M research-grade images and corresponding taxonomic labels from iNaturalist. We then perform pivot-branch sampling on this set to choose pairs for annotation.

Figure 4: The proposed Neural Naturalist model architecture. The multiplicative joint encoding and Transformer-based comparative module yield the best comparisons between images.

2.2 Pivot Images

The Aves domain in iNaturalist contains instances of 9k distinct species, with heavy observation bias to more common species (such as the mallard duck). We uniformly sample species from the set of 9k to help overcome this bias. In total, we select 405 species and corresponding photographs to use as images.

2.3 Branching Images

We use both a visual similarity measure and taxonomy to sample a set of comparison images branching off from each pivot image . We use a branching factor of from each pivot image.

To capture visually similar images to , we employ a similarity function . We use an Inception-v4 Szegedy et al. (2017) network pretrained on ImageNet Deng et al. (2009) and then fine-tuned to perform species classification on all research-grade observations in iNaturalist. We take the embedding for each image from the last layer of the network before the final softmax. We perform a k-nearest neighbor search by quantizing each embedding and using L2 distance Wu et al. (2017); Guo et al. (2016), selecting the closest images in embedding space.

We also use the iNaturalist scientific taxonomy to sample images at varying levels of taxonomic distance from . We select taxonomically branched images by sampling two images each from the same species (), genus, family, order, and class () as . This yields 4,860 raw image pairs .

2.4 Language Collection

For each image pair , we elicit five natural language paragraphs describing the differences between them.

An annotator is instructed to write a paragraph (usually 2–5 sentences) comparing and contrasting the animal appearing in each image. We instruct annotators not to explicitly mention the species (e.g., “Animal 1 is a penguin”), and to instead focus on visual details (e.g., “Animal 1 has a black body and a white belly”). They are additionally instructed to avoid mentioning aspects of the background, scenery, or pose captured in the photograph (e.g., “Animal 2 is perched on a coconut”).

We discard all annotations for an image pair where either image did not have at least positive ratings of image clarity. This yields a total of 3,347 image pairs, annotated with 16,067 paragraphs. Detailed statistics of the Birds-to-Words dataset are shown in Figure 3, and examples are provided in Figure 5. Further details of our both our algorithmic approach and dataset construction are given in Appendices A and B.

3 Neural Naturalist Model

Figure 5: Samples from the dev split of the proposed Birds-to-Words dataset, along with Neural Naturalist model output (M) and one of five ground truth paragraphs (G). The second row highlights failure cases in red. The model produces coherent descriptions of variable granularity, though emphasis and assignment can be improved.

Given two images as input, our task is to generate a natural language paragraph that compares the two images.


Recent image captioning approaches Xu et al. (2015); Sharma et al. (2018) extract image features using a convolutional neural network (CNN) which serve as input to a language decoder, typically a recurrent neural network (RNN) Mikolov et al. (2010) or Transformer Vaswani et al. (2017). We extend this paradigm with a joint encoding step and comparative module to study how best to encode and transform multiple latent image embeddings. A schematic of the model is outlined in Figure 4, and its key components are described in the upcoming sections.

3.1 Image Embedding

Both input images are first processed using CNNs with shared weights. In this work, we consider ResNet He et al. (2016) and Inception Szegedy et al. (2017) architectures. In both cases, we extract the representation from the deepest layer immediately before the classification layer. This yields a dense 2D grid of local image feature vectors, shaped . We then flatten each feature grid into a shaped matrix:

3.2 Joint Encoding

We define a joint encoding of the images which contains both embedded images , a mutated combination , or both. We consider as possible mutations . We try these encoding variants to explore whether simple mutations can effectively combine the image representations.

3.3 Comparative Module

Given the joint encoding of the images (), we would like to represent the differences in feature space () in order to generate comparative descriptions. We explore two variants at this stage. The first is a direct passthrough of the joint encoding (). This is analogous to “standard” CNN+LSTM architectures, which embed images and pass them directly to an LSTM for decoding. Because we try different joint encodings, a passthrough here also allows us to study their effects in isolation.

Our second variant is an -layer Transformer encoder. This provides an additional self-attentive mutations over the latent representations . Each layer contains a multi-headed attention mechanism (). The intent is that self-attention in Transformer encoder layers will guide comparisons across the joint image encoding.

Denoting ln as Layer Norm and ff as Feed Forward, with as the output of the th layer of the Transformer encoder, , and :

3.4 Decoder

We use an -layer Transformer decoder architecture to produce distributions over output tokens. The Transformer decoder is similar to an encoder, but it contains an intermediary multi-headed attention which has access to the encoder’s output at every time step.

Here we denote the text observed during training as , which is modulated with a position-based encoding and masked in the first multi-headed attention.

4 Experiments

Dev Test
Most Frequent 0.20 0.31 0.42 0.20 0.30 0.43
Text-Only 0.14 0.36 0.05 0.14 0.36 0.07
Nearest Neighbor 0.18 0.40 0.15 0.14 0.36 0.06
CNN + LSTM Vinyals et al. (2015) 0.22 0.40 0.13 0.20 0.37 0.07
CNN + Attn. + LSTM Xu et al. (2015) 0.21 0.40 0.14 0.19 0.38 0.11
Neural Naturalist – Simple Joint Encoding 0.23 0.44 0.23 - - -
Neural Naturalist – No Comparative Module 0.09 0.27 0.09 - - -
Neural Naturalist – Small Decoder 0.22 0.42 0.25 - - -
Neural Naturalist – Full 0.24 0.46 0.28 0.22 0.43 0.25
Human 0.26 +/- 0.02 0.47 +/- 0.01 0.39 +/- 0.04 0.27 +/- 0.01 0.47 +/- 0.01 0.42 +/- 0.03
Table 2: Experimental results for comparative paragraph generation on the proposed dataset. For human captions, mean and standard deviation are given for a one-vs-rest scheme across twenty-five runs. We observed that CIDEr-D scores had little correlation with description quality. The Neural Naturalist model benefits from a strong joint encoding and Transformer-based comparative module, achieving the highest BLEU-4 and ROUGE-L scores.

We train the Neural Naturalist model to produce descriptions of the differences between images in the Birds-to-Words dataset. We partition the dataset into train (80%), val (10%), and test (10%) sections by splitting based on the pivot images . This ensures species are unique across the different splits.

We provide model hyperparameters and optimization details in Appendix  C.

4.1 Baselines and Variants

The most frequent paragraph baseline produces only the most observed description in the training data, which is that the two animals appear to be exactly the same. Text-Only samples captions from the training data according to their empirical distribution. Nearest Neighbor embeds both images and computes the lowest total distance to a training set pair, sampling a caption from it. We include two standard neural baselines, CNN (+ Attention) + LSTM, which concatenate the images embeddings, optionally perform attention, and decode with an LSTM. The main model variants we consider are a simple joint encoding (), no comparative module (), a small (1-layer) decoder, and our full Neural Naturalist model. We also try several other ablations and model variants, which we describe later.

4.2 Quantitative Results

Automatic Metrics

We evaluate our model using three machine-graded text metrics: BLEU-4 Papineni et al. (2002), ROUGE-L Lin (2004), and CIDEr-D Vedantam et al. (2015). Each generated paragraph is compared to all five reference paragraphs.

For human performance, we use a one-vs-rest scheme to hold one reference paragraph out and compute its metric using the other four. We average this score across twenty-five runs over the entire split in question.

Results using these metrics are given in Table 2 for the main baselines and model variants. We observe improvement across BLEU-4 and ROUGE-L scores compared to baselines. Curiously, we observe that the CIDEr-D metric is susceptible to common patterns in the data; our model, when stopped at its highest CIDEr-D score, outputs a variant of, “these animals appear exactly the same” for 95% of paragraphs, nearly mimicking the behavior of the most frequent paragraph (Freq.) baseline. The corpus-level behavior of CIDEr-D gives these outputs a higher score. We observed anecdotally higher quality outputs correlated with ROUGE-L score, which we verify using a human evaluation (paragraph after next).

Ablations and Model Variants

We ablate and vary each of the main model components, running the automatic metrics to study coarse changes in the model’s behavior. Results for these experiments are given in Table 3. For the joint encoding, we try combinations of four element-wise operations with and without both encoded images. To study the comparative module in greater detail, we examine its effect on the top three joint encodings: , , and . After fixing the best joint encoding and comparative module, we also try variations of the decoder (Transformer depth), as well as decoding algorithms (greedy decoding, multinomial sampling, and beamsearch).

Overall, we we see that the choice of joint encoding requires a balance with the choice of comparative module. More disruptive joint encodings (like element-wise multiplication ) appear too destructive when passed directly to a decoder, but yield the best performance when paired with a deep comparative module. Others (like subtraction) function moderately well on their own, and are further improved when a comparative module is introduced.

Joint Encoding
max Comparative Module Decoder BLEU-4 ROUGE-L CIDEr-D
Beamsearch 0.23 0.44 0.23
0.23 0.45 0.27
0.24 0.43 0.28
0.23 0.43 0.24
0.24 0.46 0.28
0.22 0.44 0.22
0.22 0.42 0.25
0.21 0.42 0.22
0.22 0.43 0.23
0.21 0.43 0.20
Beamsearch 0.00 0.02 0.00
1-L Transformer 0.24 0.44 0.27
3-L Transformer 0.24 0.44 0.27
6-L Transformer 0.24 0.46 0.28
Beamsearch 0.22 0.40 0.22
1-L Transformer 0.21 0.41 0.26
3-L Transformer 0.22 0.41 0.22
6-L Transformer 0.23 0.45 0.27
Beamsearch 0.09 0.27 0.09
1-L Transformer 0.24 0.43 0.24
3-L Transformer 0.22 0.42 0.26
6-L Transformer 0.22 0.44 0.22
1-L Transformer Beamsearch 0.22 0.42 0.25
3-L Transformer 0.23 0.42 0.25
6-L Transformer 0.24 0.46 0.28
Greedy 0.21 0.44 0.18
Multinomial 0.20 0.42 0.16
Beamsearch 0.24 0.46 0.28
Table 3: Variants and ablations for the Neural Naturalist model. We find the best performing combination is an elementwise multiplication () for the joint encoding, a 6-layer Transformer comparative module, a 6-layer Transformer decoder, and using beamsearch to perform inference.
Human Evaluation

To verify our observations about model quality and automatic metrics, we also perform a human evaluation of the generated paragraphs. We sample 120 instances from the test set, taking twenty each from the six categories for choosing comparative images (visual similarity in embedding space, plus five taxonomic distances). We provide annotators with the two images in a random order, along with the output from the model at hand. Annotators must decide which image contains Animal 1, and which contains Animal 2, or they may say that there is no way to tell (e.g., for a description like “both look exactly the same”).

We collect three annotations per datum, and score a decision only if 2/3 annotators made that choice. A model receives +1 point if annotators decide correctly, 0 if they cannot decide or agree there is no way to tell, and -1 point if they decide incorrectly (label the images backwards). This scheme penalizes a model for confidently writing incorrect descriptions. The total score is then normalized to the range . Note that Human uses one of the five gold paragraphs sampled at random.

Results for this experiment are shown in Table 4. In this measure, we see the frequency and text-only baselines now fall flat, as expected. The frequency baseline never receives any points, and the text-only baseline is often penalized for incorrectly guessing. Our model is successful at making distinctions between visually distinct species (Genus column and ones further right), which is near the challenge level of current fine-grained visual classification tasks. However, it struggles on the two data subsets with highest visual similarity (Visual, Species). The significant gap between all methods and human performance in these columns indicates ultra fine-grained distinctions are still possible for humans to describe, but pose a challenge for current models to capture.

4.3 Qualitative Analysis

In Figure 5, we present several examples of the model output for pairs of images in the dev set, along with one of the five reference paragraphs. In the following section, we split an analysis of the model into two parts: largely positive findings, as well as common error cases.

Positive Findings

We find that the model exhibits dynamic granularity, by which we mean that it adjusts the magnitude of the descriptions based on the scale of differences between the two animals. If two animals are quite similar, it generates fine-grained descriptions such as, “Animal 2 has a slightly more curved beak than Animal 1,” or “Animal 1 is more iridescent than Animal 2.” If instead the two animals are very different, it will generate text describing larger-scale differences, like, “Animal 1 has a much longer neck than Animal 2,” or “Animal 1 is mostly white with a black head. Animal 2 is almost completely yellow.”

We also observe that the model is able to produce coherent paragraphs of varying linguistic structure. These include a range of comparisons set up across both single and multiple sentences. For example, one it generates straightforward comparisons of the form, Animal 1 has X, while Animal 2 has Y. But it also generates contrastive expressions with longer dependencies, such as Animal 1 is X, Y, and Z. Animal 2 is very similar, except W. Furthermore, the model will mix and match different comparative structures within a single paragraph.

Finally, in addition to varying linguistic structure, we find the model is able to produce coherent semantics through a series of statements. For example, consider the following full output: “Animal 1 has a very long neck compared to Animal 2. Animal 1 has shorter legs than Animal 2. Animal 1 has a black beak, Animal 2 has a brown beak. Animal 1 has a yellow belly. Animal 2 has darker wings than Animal 1.” The range of concepts in the output covers neck, legs, beak, belly, wings without repeating any topic or getting sidetracked.

Visual Species Genus Family Order Class
Freq. 0.00 0.00 0.00 0.00 0.00 0.00
Text-Only 0.00 -0.10 -0.05 0.00 0.15 -0.15
CNN + LSTM -0.15 0.20 0.15 0.50 0.40 0.15
CNN + Attn. + LSTM 0.15 0.15 0.15 -0.05 0.05 0.20
Neural Naturalist 0.10 -0.10 0.35 0.40 0.45 0.55
Human 0.55 0.55 0.85 1.00 1.00 1.00
Table 4: Human evaluation results on 120 test set samples, twenty per column. Scale: -1 (perfectly wrong) to 1 (perfectly correct). Columns are ordered left-to-right by increasing distance. Our model outperforms baselines for several distances, though highly similar comparisons still prove difficult.

Error Analysis

We also observe several patterns in the model’s shortcomings. The most prominent error case is that the model will sometimes hallucinate differences (Figure 5, bottom row). These range from pointing out significant changes that are missing (e.g., “a black head” where there is none (Fig. 5, bottom left)), to clawing at subtle distinctions where there are none (e.g., “[its] colors are brighter …and [it] is a bit bigger” (Fig. 5, bottom right)). We suspect that the model has learned some associations between common features in animals, and will sometimes favor these associations over visual evidence.

The second common error case is missing obvious distinctions. This is observed in Fig. 5 (bottom middle), where the prominent beak of Animal 1 is ignored by the model in favor of mundane details. While outlying features make for lively descriptions, we hypothesize that the model may sometimes avoid taking them into account given its per-token cross entropy learning objective.

Finally, we also observe the model sometimes swaps which features are attributed to which animal. This is partially observed in Fig. 5 (bottom left), where the “black head” actually belongs to Animal 1, not Animal 2. We suspect that mixing up references may be a trade-off for the representational power of attending over both images; there is no explicit bookkeeping mechanism to enforce which phrases refer to which feature comparisons in each image.

5 Related Work

Employing visual comparisons to elicit focused natural language observations was proposed by Maji (2012). Zou et al. (2015) studied this tactic in the context of crowdsourcing, and Su et al. (2017) performed a large scale investigation in the aircraft domain, using reference games to evoke attribute phrases. We take inspiration from these works.

Previous work has collected natural language captions of bird photographs: CUB Captions Reed et al. (2016) and CUB-Justify Vedantam et al. (2017) are both language annotations on top of the CUB-2011 dataset of bird photographs Wah et al. (2011). In addition to describing two photos instead of one, the language in our dataset is more complex by comparison, containing a diversity of comparative structures and implied semantics. We also collect our data without an anatomical guide for annotators, yielding everyday language in place of scientific terminology.

Conceptually, our paper offers a complementary approach to works that generate single-image, class or image-discriminative captions Hendricks et al. (2016); Vedantam et al. (2017). Rather than discriminative captioning, we focus on comparative language as a means for bridging the gap between varying granularities of visual diversity.

Methodologically, our work is most closely related to the Spot-the-diff dataset Jhamtani and Berg-Kirkpatrick (2018) and other recent work on change captioning Park et al. (2019); Tan and Bansal (2019). While change captioning compares two images with few changing pixels (e.g., surveillance footage), we consider image pairs with no pixel overlap, motivating our stratified sampling approach to select comparisons.

Finally, the recently released NLVR dataset Suhr et al. (2018) introduces a challenging natural language reasoning task using two images as context. Our work instead focuses on generating comparative language rather than reasoning.

6 Conclusion

We present the new Birds-to-Words dataset and Neural Naturalist model for generating comparative explanations of fine-grained visual differences. This dataset features paragraph-length, adaptively detailed descriptions written in everyday language. We hope that continued study of this area will produce models that can aid humans in critical domains like citizen science.


  • J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. Cited by: §2.3, §2.
  • R. Guo, S. Kumar, K. Choromanski, and D. Simcha (2016) Quantization based fast inner product search. In Artificial Intelligence and Statistics, pp. 482–490. Cited by: §2.3.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778. Cited by: §3.1.
  • L. A. Hendricks, Z. Akata, M. Rohrbach, J. Donahue, B. Schiele, and T. Darrell (2016) Generating visual explanations. In European Conference on Computer Vision, pp. 3–19. Cited by: §1, §5.
  • H. Jhamtani and T. Berg-Kirkpatrick (2018) Learning to describe differences between pairs of similar images. arXiv preprint arXiv:1808.10584. Cited by: §1, §5.
  • A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei (2011) Novel dataset for fine-grained image categorization. In First Workshop on Fine-Grained Visual Categorization, IEEE Conference on Computer Vision and Pattern Recognition, Colorado Springs, CO. Cited by: footnote 4.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013a) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: §2.
  • J. Krause, M. Stark, J. Deng, and L. Fei-Fei (2013b) 3D object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia. Cited by: footnote 4.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §4.2.
  • S. Maji, E. Rahtu, J. Kannala, M. Blaschko, and A. Vedaldi (2013) Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151. Cited by: footnote 4.
  • S. Maji (2012) Discovering a lexicon of parts and attributes. In European Conference on Computer Vision, pp. 21–30. Cited by: §5.
  • T. Mikolov, M. Karafiát, L. Burget, J. Černockỳ, and S. Khudanpur (2010) Recurrent neural network based language model. In Eleventh annual conference of the international speech communication association, Cited by: §3.
  • E. Murphy-Chutorian and M. M. Trivedi (2009) Head pose estimation in computer vision: a survey. IEEE transactions on pattern analysis and machine intelligence 31 (4), pp. 607–626. Cited by: §2.
  • K. Papineni, S. Roukos, T. Ward, and W. Zhu (2002) BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §4.2.
  • D. H. Park, T. Darrell, and A. Rohrbach (2019) Robust change captioning. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4624–4633. Cited by: §5.
  • S. Reed, Z. Akata, H. Lee, and B. Schiele (2016) Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 49–58. Cited by: §5.
  • P. Sharma, N. Ding, S. Goodman, and R. Soricut (2018) Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 2556–2565. Cited by: §3.
  • J. Su, C. Wu, H. Jiang, and S. Maji (2017) Reasoning about fine-grained attribute phrases using reference games. In International Conference on Computer Vision (ICCV), Cited by: §5.
  • A. Suhr, S. Zhou, I. Zhang, H. Bai, and Y. Artzi (2018) A corpus for reasoning about natural language grounded in photographs. CoRR abs/1811.00491. External Links: Link, 1811.00491 Cited by: §5.
  • C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §2.3, §3.1.
  • H. Tan and M. Bansal (2019) Expressing visual relationships via language. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Cited by: §5.
  • G. Van Horn, O. Mac Aodha, Y. Song, Y. Cui, C. Sun, A. Shepard, H. Adam, P. Perona, and S. Belongie (2018) The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8769–8778. Cited by: footnote 4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.
  • R. Vedantam, S. Bengio, K. Murphy, D. Parikh, and G. Chechik (2017) Context-aware captions from context-agnostic supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 251–260. Cited by: §1, §5, §5.
  • R. Vedantam, C. Lawrence Zitnick, and D. Parikh (2015) Cider: consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4566–4575. Cited by: §4.2.
  • O. Vinyals, A. Toshev, S. Bengio, and D. Erhan (2015) Show and tell: a neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3156–3164. Cited by: Table 2.
  • C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie (2011) The caltech-ucsd birds-200-2011 dataset. Cited by: §2, §5, footnote 4.
  • X. Wu, R. Guo, A. T. Suresh, S. Kumar, D. N. Holtmann-Rice, D. Simcha, and F. Yu (2017) Multiscale quantization for fast similarity search. In Advances in Neural Information Processing Systems, pp. 5745–5755. Cited by: §2.3.
  • K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio (2015) Show, attend and tell: neural image caption generation with visual attention. In International conference on machine learning, pp. 2048–2057. Cited by: §3, Table 2.
  • J. Y. Zou, K. Chaudhuri, and A. T. Kalai (2015) Crowdsourcing feature discovery via adaptively chosen comparisons. arXiv preprint arXiv:1504.00064. Cited by: §5.

Appendix A Algorithmic Approach to Dataset Construction

We present here an algorithmic approach to collecting a dataset of image pairs with natural language text describing their differences. The central challenge is to balance empirical desiderata—mainly, sample coverage and model relevance—with practical constraints of data quality and cost. This algorithmic approach underpins the dataset collection we outlined in the paper body.

a.1 Goals

Our goal is to collect a dataset of tuples , where and are images, and is a textual comparison of them. We can consider each image as drawn from some domain , or a completely open domain of all concepts. There are several criteria we would like to balance:

  1. Coverage A dataset should sufficiently cover so that generalization across the space is possible.

  2. Relevance Given the capabilities for models to distinguish and , should provide value.

  3. Comparability Each pair (, ) must have sufficient structural similarities that a human annotator can reasonably write comparing them. Pairs that are too different will yield lengthy and uninteresting descriptions without direct contrasting statements. Pairs that are too similar for human perception may yield “I can’t see any difference.”444This hints at the same sweet spot the fine-grained visual classification (FGVC) community studies, like cars Krause et al. (2013b), aircraft Maji et al. (2013), dogs Khosla et al. (2011), and birds Wah et al. (2011); Van Horn et al. (2018).

  4. Efficiency Image judgements and textual annotations require human labor. With a fixed budget, we would like to yield a dataset of the largest size possible.

We describe sampling algorithms for addressing these issues given the choice of a domain.

a.2 Pivot-Branch Sampling

Drawing a single image from domain , there is a chance that each image is ill-suited for comparisons. For example, might be out-of-focus or contain multiple instances.

If a pair of images is drawn, and each has probability of being discarded, then times more pairs must be selected and annotated. For example, if , then the annotation cost is scaled by 2.25. This severely impacts annotation efficiency.

To combat this, we employ a stratified sampling strategy we call pivot-branch sampling. Each image on one side of the comparison (say, ) is vetted individually, and images on the other side (say, ) are sampled to produce pairs. With -times fewer images, it is feasible to check each instance for usability. This lowers the annotation cost scale to (e.g., with , this is 1.5).

Splitting our selection from into two parts allows us to define two distinct sampling strategies. One choice is for to select pivot images. The second is for to sample images given a single pivot image.

a.3 Designing

Selecting are important because each will contribute to image pairs in a dataset. Here we consider the case where there are class labels available for each image in the domain. We propose selecting to sample uniformly over . This strategy attempts to provide coverage over using class labels as a coarse measure of diversity. It accounts for category-level dataset bias (e.g., where most images belong to only a few classes). This pushes the need to address relevance and comparability to the sampling procedure for branched images.

a.4 Designing

Given each pivot image , we will choose images from for comparison. We can make use of additional functions and structure available on :

A function that measures the visual similarity
between any two images.

We can partition to sample visually-similar images using and taxonomically related images. A simple strategy for visually similar images is to pick

times without replacement. This samples the most visually similar images to , excluding the image itself.

To employ taxonomic information, we propose a walk over mutually exclusive subsets of . We define a function that gives the set of other taxonomic leaves that share a common ancestor exactly taxonomic levels above , and no levels lower. More formally, if we use to express that and share a parent taxonomic levels above , then we can define:

The function partitions the taxonomy into disjoint subtrees. For example, are the set of sibling classes to which share its direct parent; are the set of cousin classes to which share its grandparent, but not its parent.

We can employ by choosing class from our pivot image and varying . As we increase , we define mutually exclusive sets of classes with greater taxonomic distance from .

To sample images using this scheme, we can further split our budget for taxonomically sampled images into for different levels. Then, if we write the set of classes , we can sample images from . One scheme is to perform round-robin sampling: rotate through each class and sample sample one image from each until are chosen.

a.5 Analyzing

Given a good visual similarity function , image pairs will exhibit enough similarity to satisfy requirement that they be semantically close enough to be comparable. They may also be so visually similar that comparability is difficult. However, this aspect counter-balances with relevance: if is small under a visual model, but their differences are describable by humans, their difference description has high value because it distinguishes two points with high similarity in visual embeddings space.

The use of the taxonomy complements by providing controllable coverage over while maintaining relevance and comparability. Tuning the range of values used in the taxonomic splits ensures comparability is maintained. Clamping below a threshold ensures images have sufficient similarity, and controlling the proportion of for small values of mitigates the risk of too-similar image pairs.

Similarly, we can adjust the relevance of taxonomic sampling by controlling the distribution of with respect to the particular structure of the taxonomy . If the taxonomy is well-balanced, then fixing a constant will draw proportionally more samples from subtrees close to . This can be seen by considering that defines exponentially larger subsets of as increases. Drawing the same number of samples from each subset biases the collection towards relevant pairs (which should be more difficult to distinguish) while maintaining sparse coverage over the entirety of .

Appendix B Details for Constructing Birds-to-Words Dataset

We provide here additional details for constructing the Birds-to-Words  dataset. This is meant to link the high level overview in Section 2 with the algorithmic approach presented in the previous section (Appendix A).

b.1 Clarity

To build a dataset emphasizing fine-grained comparisons between two animals, we impose stricter restrictions on the images than iNaturalist research-grade observations (photographs). An iNaturalist observation that is research-grade indicates the community has reached consensus on the animal’s species, that the photo was taken in the wild, and several other qualifications.555More details on iNaturalist research-grade specification: We include four additional criteria that we define together as clarity:

  1. Single instance: A photo must include only a single instance of the target species. Bird photography often includes flocks in trees, in the air, or on land. In addition, some birds appear in male/female pairs. For our dataset, all of those photos must be discarded.

  2. Animal: A photo must include the animal itself, rather than a record of it (e.g., tracks).

  3. Focus: A photo must be sufficiently in-focus to describe the animal in detail.

  4. Visibility: The animal in the photo must not be too obscured by the environment, and must take up enough pixels in the photo to be clearly described.

b.2 Pivot Images

To pick pivot images, we first uniformly sample from the set of 9k species in the taxonomic class Aves in iNaturalist. We consider only species with at least four recorded observations to promote the likelihood that at least one image is clear. We also perform look-ahead branch sampling to ensure that a species will yield sufficient comparisons taxonomically. For each species, we manually review four images sampled from this species to select the clearest image to use as the pivot image. If none are suitable, we move to the next species. With this manual process, we select 405 species and corresponding photographs to use as pivot images.

b.3 Branching Images

See Section 2.3 for the description of selecting visually similar branching images using a function . We highlight here the use of the taxonomy to select branching images with varying levels of taxonomic distance.

For the class corresponding to image , we split the taxonomic tree into disjoint subtrees rooted taxonomic levels above . Each higher level excludes the levels beneath it. For example, at we consider all images of the same species as ; at , we consider all images of the same genus as , but that have a different species. We set each for a total of .

b.4 Annotations


Annotators first label whether and are clear. While we manually verified each is clear, each must still be vetted.666Annotators would occasionally agree that a particular images was in fact unclear, upon which we removed it and all corresponding pairs from the dataset. Starting from 405 pivot images , and selecting branching images for each, we annotated a total of 4,860 image pairs. After restricting images to have positive clarity judgments, we ended up with the 3,347 image pairs in our dataset, a retention rate of 68.9%.


We vet each annotator individually by manually reviewing five reference annotations from a pilot round, and perform random quality assessments during data collection. We found that manually vetting the writing quality and guideline adherence of each individual annotator vital for ensuring high data quality.

Appendix C Model Details

For the image embedding component of our model, we use a ResNet-101 network as our CNN. We use a model pretrained on ImageNet and fix the CNN weights before starting training for our task. We also experimented with an Inception-v4 model, but found ResNet-101 to have better performance.

For both the Transformer encoder and decoder, we use layers, a hidden size of 512, 8 attention heads, and dot product self-attention. Each paragraphs is clipped at 64 tokens during training (chosen empirically to cover 94% of paragraphs). The text is preprocessed using standard techniques (tokenization, lowercasing), and we replace mentions referring to each image with special tokens animal1 and animal2.

For inference, we experiment with greedy decoding, multinomial sampling, and beam search. Beam search performs best, so we use it with a beam size of 5 for all reported results (except the decoding ablations, where we report each).

We train with Adagrad for 700k steps using a learning rate of .01 and batch size of 2048. We decay the learning rate after 20k steps by a factor of 0.9. Gradients are clipped at a magnitude of 5.

Appendix D Image Attributions

The table above provides attributions for all photographs used in this paper.

Photograph Attribution
Fig. 1: Top and bottom left salticidude (CC BY-NC 4.0)
Fig. 1: Top right Patricia Simpson (CC BY-NC 4.0)
Fig. 1: Bottom right kalamurphyking (CC BY-NC-ND 4.0)
Fig. 2: Top left Ryan Schain
Fig. 2: Top right Anonymous eBirder
Fig. 2: Right, 2nd from top Garth McElroy/VIREO
Fig. 2: Right, 3rd from top Myron Tay
Fig. 2: Right, 4th from top Brian Kushner
Fig. 2: Bottom, left A. \textcyrupSHCHerbakov
Fig. 2: Bottom, right prepa3tgz-11bwv518 (CC BY-NC 4.0)
Fig. 4: Top jmaley (CC0 1.0)
Fig. 4: Bottom lorospericos (CC BY-NC 4.0)
Fig. 5: Top left, left wildlife-naturalists (CC BY-NC 4.0)
Fig. 5: Top left, right Colin Barrows (CC BY-NC-SA 4.0)
Fig. 5: Top middle, left charley (CC BY-NC 4.0)
Fig. 5: Top middle, right guyincognito (CC BY-NC 4.0)
Fig. 5: Top right, left Chris van Swaay (CC BY-NC 4.0)
Fig. 5: Top right, right Jonathan Campbell (CC BY-NC 4.0)
Fig. 5: Middle left, left John Ratzlaff (CC BY-NC-ND 4.0)
Fig. 5: Middle left, right Jessica (CC BY-NC 4.0)
Fig. 5: Middle middle, left i_c_riddell (CC BY-NC 4.0)
Fig. 5: Middle middle, right Pronoy Baidya (CC BY-NC-ND 4.0)
Fig. 5: Middle right, left Nicolas Olejnik (CC BY-NC 4.0)
Fig. 5: Middle right, right Carmelo López Abad (CC BY-NC 4.0)
Fig. 5: Bottom left, left Luis Querido (CC BY-NC 4.0)
Fig. 5: Bottom left, right copper (CC BY-NC 4.0)
Fig. 5: Bottom middle, left vireolanius (CC BY-NC 4.0)
Fig. 5: Bottom middle, right Mathias D’haen (CC BY-NC 4.0)
Fig. 5: Bottom right, left tas47 (CC BY-NC 4.0)
Fig. 5: Bottom right, right Nik Borrow (CC BY-NC 4.0)
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description